6 Combining Commands
6.1 The |
Pipe
In the previous section we ended with an exercise where we counted the number of lines matching the word “Alpha” in several CSV files containing variant classification of coronavirus virus samples from several countries.
We achieved this in three steps:
- Combine all CSV files into one:
cat *_variants.csv > all_countries.csv
. - Create a new file containing only the lines that match our pattern:
grep "Alpha" all_countries.csv > alpha.csv
- Count the number of lines in this new file:
wc -l alpha.csv
But what if we now wanted to search for a different pattern, for example “Delta”? It seems unpractical to keep creating new files every time we want to ask such a question from our data.
This is where one of the shell’s most powerful feature becomes handy: the ease with which it lets us combine existing programs in new ways.
The way we can combine commands together is using a pipe, which uses the special operator |
. Here is our example using a pipe:
cat *_variants.csv | grep "Alpha" | wc -l
Notice how we now don’t specify an input to either grep
nor wc
. The input is streamed automatically from one tool to another through the pipe. So, the output of cat
is sent to grep
and the output from grep
is then sent to wc
.
6.2 Cut, Sort, Unique & Count
Let’s now explore a few more useful commands to manipulate text that can be combined to quickly answer useful questions about our data.
Let’s start with the command cut
, which is used to extract sections from each line of its input. For example, let’s say we wanted to retrieve only the second field (or column) of our CSV file, which contains the clade classification of each of our omicron samples:
cat *_variants.csv | cut -d "," -f 2
clade
20I (Alpha; V1)
20A
20I (Alpha; V1)
20A
... (more output omitted) ...
The two options used with this command are:
-d
defines the delimiter used to separate different parts of the line. Because this is a CSV file, we use the comma as our delimiter. The tab is used as the default delimiter.-f
defines the field or part of the line we want to extract. In our case, we want the second field (or column) of our CSV file. It’s worth knowing that you can specify more than one field, so for example if you had a CSV file with more columns and wanted columns 3 and 7 you could set-f 3,7
.
The next command we will explore is called sort
, which sorts the lines of its input alphabetically (default) or numerically (if using the -n
option). Let’s combine it with our previous command to see the result:
cat *_variants.csv | cut -d "," -f 2 | sort
19B
19B
20A
20A
20A
20A
20A
20A
20A
20A
... (more output omitted) ...
You can see that the output is now sorted alphabetically.
The sort
command is often used in conjunction with another command: uniq
. This command returns the unique lines in its input. Importantly, it only works as intended if the input is sorted. That’s why it’s often used together with sort
.
Let’s see it in action, by continuing building our command:
cat *_variants.csv | cut -d "," -f 2 | sort | uniq
19B
20A
20B
20C
20E (EU1)
20I (Alpha; V1)
21A (Delta)
21I (Delta)
21J (Delta)
21K (Omicron)
21L (Omicron)
21M (Omicron)
NA
clade
We can see that now the output is de-duplicated, so only unique values are returned. And so, with a few simple commands, we’ve answered a very useful question from our data: what are the unique variants in our collection of samples?