Understand the logic and applications of a for loop in programming.
Write a loop that applies one or more commands separately to each file in a set of files.
Trace the values taken on by a loop variable during execution of the loop.
Explain what a dry-run is and how to implement it.
9.1 The for Loop
Loops are a programming construct which allow us to repeat a command or set of commands for each item in a list. As such they are key to productivity improvements through automation. Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).
Going back to our molecules directory, suppose we wanted to count the number of atoms in each of our molecules’ PDB files. As a reminder, here is the command to do this for one of our files:
cat cubane.pdb |grep"ATOM"|wc-l
Of course, we could manually then repeat this for each of our molecule files: cubane.pdb, ethane.pdb, methane.pdb, octane.pdb, pentane.pdb, propane.pdb.
But what if we had hundreds (or thousands!) of these files? We’ll use a loop to solve this problem, but first let’s look at the general form of a loop:
for thing in list_of_thingsdo# Indentation within the loop is not required, but aids legibilityoperation_using${thing}done
Taking our command above to count atoms, let’s create a new script called count_loop.sh, where we apply this idea:
#!/bin/bashfor filename in cubane.pdb ethane.pdb methane.pdbdo# count the number of lines containing the word "ATOM"cat${filename}|grep"ATOM"|wc-ldone
If we ran this script (bash count_loop.sh), we would get the following output:
16
8
5
When the shell sees the keyword for, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable we specify (in this case filename). Then, the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we call for the variable’s value $filename.
In our example, at each iteration of the for loop, the variable $filename stored a different value, cycling through cubane.pdb, ethane.pdb and finally methane.pdb.
At the moment our script is not very informative of what files are being processed. But we could use some of the programming techniques we’ve already learned about to make our output even more informative. Here is an example of a modified script:
#!/bin/bashfor filename in cubane.pdb ethane.pdb methane.pdbdo# count the number of lines containing the word "ATOM"# store the result inside a variable 'natoms'natoms=$(cat${filename}|grep"ATOM"|wc-l)# print a message to the userecho"The number of atoms in ${filename} is: ${natoms}"done
If we run this script (bash count_loop.sh), we get a more informative output than before:
The number of atoms in cubane.pdb is: 16
The number of atoms in ethane.pdb is: 8
The number of atoms in methane.pdb is: 5
Note
Do not use spaces, quotes, or wildcard characters such as ’*’ or ‘?’ in filenames, as it complicates variable expansion.
Give files consistent names that are easy to match with wildcard patterns to make it easy to select them for looping.
In the example above, we wrote our code to count the number of atoms directly inside our for loop. However, in the previous section, we had already written a script - count_atoms.sh - that counts the number of atoms in a single file.
Given we already have that generic script, we could have run our task like this:
for filename in cubane.pdb ethane.pdb methane.pdbdobash count_atoms.sh $filenamedone
Here, we call our count_atoms.sh script from within the for loop. This is a very useful technique, as we can write generic scripts for a certain task, which we can then call from programming constructs such as a for loop.
9.3 Dry runs
A loop is a way to do many things at once – or to make many mistakes at once if it does the wrong thing! One way to check what a loop would do is to echo the commands it would run instead of actually running them – this is known as a dry-run.
Suppose we want to preview the commands of our count_loop.sh script, but without actually executing the command within the loop. Here is how we could have modified the previous code:
for filename in cubane.pdb ethane.pdb methane.pdbdoecho"bash count_atoms.sh $filename"done
All we’ve done is wrap our command instead of the echo command. If we run this modified code, the output is:
So, it wouldn’t actually run the command within the loop, but rather tell us what would have been run. This is a good practice when building scripts that include a for loop, because it lets us check that our code is all correct.
Can you think of a way to improve our count_loop.sh script, so that every file gets processed, but without having to type all the individual files’ names?
Answer
We can use the * wildcard in the for loop:
for filename in*.pdbdo# count the number of lines containing the word "ATOM"natoms=$(cat${filename}|grep"ATOM"|wc-l)# print a message to the userecho"The number of atoms in ${filename} is: ${natoms}"done
Searching for text
Level:
For this exercise, go to the following directory: cd ~/Desktop/data-shell/coronavirus/variants (adjust the path if your data is on a different location of your computer).
Previously, we had used the following command to count the number of “Alpha” variants in our dataset:
cat*_variants.csv |grep"Alpha"|wc-l
Write a for loop to search for several variants:
Use nano to create a new script called count_variants.sh.
Adapt the commands shown above to write a for loop to search for the variants “Alpha”, “Delta” and “Omicron”.
Print a message indicating which of the variants is being searched for.
Bonus (optional): modify the script to output the results to a CSV file called variant_counts.csv with the name of the variant as the first column and the count as the second column.
Answer
We can write the following script:
#!/bin/bashfor variant in Alpha Delta Omicrondo# count the variant occurrence across all files - save the result in a variable called "n"n=$(cat*_variants.csv |grep"${variant}"|wc-l)# print a message to the terminalecho"The number of ${variant} samples is: ${n}"done
In our for loop, we create a variable called variant to store each of the values we are iterating through.
Within the loop, we used this $variant variable in our grep command. This ensure that each time the code runs, a different variant will be searched for in our files.
We stored the result of our cat + grep + wc commands in a variable. This was so we could use this variable in our message at the end.
We used the echo command to print a message, which again uses the $variant variable as well as the $n variable, which stores the number of atoms from our previous command.
After creating the script, we ran it with bash count_variants.sh and this was the result:
The number of Alpha samples is: 38
The number of Delta samples is: 75
The number of Omicron samples is: 93
The bonus task asked to modify the code to output the results to a file. We can use the redirection operators (> / >>) to achieve this:
#!/bin/bash# outside of the loop we initiate a new file with column namesecho"variant,count"> variant_counts.csvfor variant in Alpha Delta Omicrondo# count the variant occurrence across all files - save the result in a variable called "n"n=$(cat*_variants.csv |grep"${variant}"|wc-l)# we append the variant name and its count to our file, each separated by a commaecho"${variant},${n}">> variant_counts.csvdone
If we run this modified script (bash count_variants.sh), nothing is printed to the terminal. However, a file is created in our directory, which contains the results of our analysis:
cat variant_counts.csv
variant,count
Alpha,38
Delta,75
Omicron,93
Because this is a CSV file, we could easily import it into a data analysis package (e.g. R or Python) to produce some visualisations.
Dry run
Level:
Suppose we want to set up up a directory structure to organize some experiments measuring reaction rate constants with different compounds and different temperatures.
Modify the following code to run as a dry-run (i.e. not actually execute the command inside the loop) and try to understand what would happen:
for molecule in cubane ethane methanedofor temp in 25 30 37 40domkdir$molecule-$tempdonedone
Answer
We can modify the code so that our command is quoted in an echo command: echo "mkdir $molecule-$temp".
This would be the output:
What we have done here is known as a nested loop, i.e. a loop contained within another loop. So, for each molecule in the first loop, the second loop (the nested loop) iterates over the list of temperatures, and creates a new directory for each combination.
Nested loops
Level:
In a previous exercise we created the script count_atom_type.sh, which counts specific atom types in our PDB files. Here is the code from that script:
# print a messageecho"The number of $2 atoms in $1 is:"# count carbon "C" atomscat"$1"|grep"ATOM"|grep"$2"|wc-l
(Note: if you haven’t done the previous exercise, please create a script named count_atom_type.sh and copy/paste this code into it.)
Create a new script called atom_type_loop.sh, which counts the number of Hydrogen (“H”) and Carbon (“C”) atoms in every .pdb file.
Hint
You will have to use a nested for loop, i.e. a loop within another loop. Look at the previous exercise to see the syntax for a nested loop. In this case, you want to loop through the files and loop through the two atom types.
Answer
We achieve this with two nested for loops:
the first loop goes through the files
the second loop goes through each of the types of atom that we are interested in
#!/bin/bash# first for loopfor filename in*.pdbdo# second (nested) for loopfor atom in C Hdo# run the script on the current filename and atombash count_atom_type.sh $filename$atomdonedone
Running our script with bash atom_type_loop.sh, we get the following output:
The number of C atoms in cubane.pdb is:
8
The number of H atoms in cubane.pdb is:
8
The number of C atoms in ethane.pdb is:
2
The number of H atoms in ethane.pdb is:
6
The number of C atoms in methane.pdb is:
1
The number of H atoms in methane.pdb is:
4
The number of C atoms in octane.pdb is:
8
The number of H atoms in octane.pdb is:
18
The number of C atoms in pentane.pdb is:
5
The number of H atoms in pentane.pdb is:
12
The number of C atoms in propane.pdb is:
3
The number of H atoms in propane.pdb is:
8
With only a few lines of code, we managed to perform this operation across all our files at once.
Note that, in this case, the order of the two loops was not important, it would have also worked to loop through the atom types first and then loop through the files. This may not always be the case, however, it depends on the task at hand.
Preparing pipeline input files
Level:
Let’s consider the files in the data-shell/sequencing directory (note: it’s not important to know what sequencing is). This directory contains the results of an experiment where several samples were processed in two runs of the sequencing machine (run1/ and run2/). For each sample, there are two input files, which have suffix _1.fq.gz and _2.fq.gz.
The researcher analysing these files needs to produce a CSV file, which will be used as input to a pipeline, and the format of this file should be:
Hint - don’t hesitate to look at these tips, it’s a challenging exercise!
Use a for loop to iterate through each sample (remember that each sample has two input files).
You can combine multiple wildcards in a path, for example ls run*/*_1.fq.gz would list all files in folders starting with the word “run” and all files within those folders ending in “_1.fq.gz”.
The command dirname can be used to extract the directory name from a path. For example: dirname run1/sampleA_1.fq.gz would return “run1” as the result.
Conversely, the command basename can be used to extract the name of a file from a path. For example: basename run1/sampleA_1.fq.gz would return “sampleA_1.fq.gz”. Further, you can also remove a suffix at the end of the name by passing it as a second argument to basename: basename run1/sampleA_1.fq.gz "_1.fq.gz" would only return “sampleA”.
You can store the result of a command in a variable with the syntax name=$(command). For example, dir=$(dirname run1/sampleA_1.fq.gz) would create a variable called $dir storing the result of the command, i.e. “run1”.
Answer
Based on all the hints given, here is a script that would achieve the desired result:
#!/bin/bash# first create a file with the CSV header (column names)echo"sample,input1,input2"> samplesheet.csv# for each file ending with `_1.fq.gz` (so we only process each sample once)for file in run*/*_1.fq.gzdo# extract the directory name of the filedir=$(dirname$file)# extract the prefix basename of the filebase=$(basename$file"_1.fq.gz")# append the name of the sample, and each input file pathecho"${base},${dir}/${base}_1.fq.gz,${dir}/${base}_2.fq.gz">> samplesheet.csvdone
This can be incredibly useful, especially for bioinformatic applications, when you may have to process hundreds of samples using standard pipelines.
9.5 Summary
Key Points
A for loop repeats commands once for every item in a list.
Every for loop needs a variable to refer to the item it is currently operating on.
Use $NAME to use the variable within the loop. ${NAME} can also be used.
You can use the echo command to do a dry-run of the commands in the loop to check what they would do, but without actually running them.
Two other commands that can be useful when looping through files are:
dirname to extract the directory name from a path.
basename to extract the file name from a path. Using basename <path> <suffix> will return the file name without the specified suffix (this is useful to extract the file name without the extension).