Set Operations Using Shell Script Commands
On many occasions I’ve had lists of names or other data stored in text files. The lists were really sets of names since no name was duplicated in the list. I often wanted to quickly and easily compute the intersection, union, relative complement, and symmetric difference of those sets. Because the sets were stored in text files, it seemed to me that the easiest way to compute these set operations was from a command line or shell. So I wrote the following four bash scripts. All the scripts use existing Unix commands, such as sort, sed, and comm and therefore assume that the sets are stored in text files with only one element listed on each line. All four scripts write their output to stdout.
Beside computing the set operations, each of the scripts provides an interesting study of the basic shell commands sort, sed, and comm.
Intersection
#!/bin/bash # intersect # Purpose: Compute intersection of sets $1 and $2. # A ^ B = {x : x e A and x e B} # # Expected format of sets: # 1. One element per line. # 2. Blank lines are allowed. # 3. Comments start with '#' and continue to end of line. # 4. Leading and trailing white space will be removed. TMP1=`mktemp --tmpdir --suffix=.txt intse-XXXXX` TMP2=`mktemp --tmpdir --suffix=.txt intse-XXXXX` trap "rm -f $TMP1 $TMP2 ; exit 0" EXIT trap "rm -f $TMP1 $TMP2 ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV # Remove comments, leading and trailing white space and # blank lines, and sort input files. sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' $1 | \ sort -u > $TMP1 sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' $2 | \ sort -u > $TMP2 comm -12 $TMP1 $TMP2 exit 0
Union
#!/bin/bash # union # Purpose: Compute the union of sets, i.e. merge sets. # A U B = {x : x e A or x e B or x e both} # # Expected format of sets: # 1. One element per line. # 2. Blank lines are allowed. # 3. Comments start with '#' and continue to end of line. # 4. Leading and trailing white space will be removed. TMP=`mktemp --tmpdir --suffix=.txt union-XXXXX` trap "rm -f $TMP ; exit 0" EXIT trap "rm -f $TMP ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV # Remove comments, leading and trailing white space and # blank lines from the sets and append all sets to $TMP. for i in "$@"; do sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' "$i" | \ sort -u >> $TMP done # Merge sets using sort with unique flag. sort -u $TMP exit 0
Relative Complement
#!/bin/bash # relcompl # Purpose: Compute the complement of set $2 relative to set $1. # In other words compute set $1 - set $2 # A \ B = {x : x e A and x ne B} # # Expected format of sets: # 1. One element per line. # 2. Blank lines are allowed. # 3. Comments start with '#' and continue to end of line. # 4. Leading and trailing white space will be removed. TMP1=`mktemp --tmpdir --suffix=.txt relc-XXXXX` TMP2=`mktemp --tmpdir --suffix=.txt relc-XXXXX` trap "rm -f $TMP1 $TMP2 ; exit 0" EXIT trap "rm -f $TMP1 $TMP2 ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV # Remove comments, leading and trailing white space and # blank lines, and sort input files. sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' $1 | \ sort -u > $TMP1 sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' $2 | \ sort -u > $TMP2 comm -23 $TMP1 $TMP2 exit 0
Symmetric Difference
#!/bin/bash # symdif # Purpose: Compute the symmetric difference # of sets $1 and $2 # A @ B = {x : x e A or x e B but x ne both} # # Expected format of sets: # 1. One element per line. # 2. Blank lines are allowed. # 3. Comments start with '#' and continue to end of line. # 4. Leading and trailing white space will be removed. TMP1=`mktemp --tmpdir --suffix=.txt symdif-XXXXX` TMP2=`mktemp --tmpdir --suffix=.txt symdif-XXXXX` trap "rm -f $TMP1 $TMP2 ; exit 0" EXIT trap "rm -f $TMP1 $TMP2 ; exit 1" ABRT BUS HUP INT KILL QUIT SEGV # Remove comments, leading and trailing white space and # blank lines, and sort input files. sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' $1 | \ sort -u > $TMP1 sed -e 's/#.*$//' \ -e 's/^[[:space:]]*\([^[:space:]]*\)[[:space:]]*$/\1/' \ -e '/^[[:space:]]*$/d' $2 | \ sort -u > $TMP2 comm -3 $TMP1 $TMP2 | sed 's/\t//g' exit 0