Loops#
Loops are a programming construct which allow us to repeat a command or set of commands for each item in a list. As such they are key to productivity improvements through automation. Similar to wildcards and tab completion, using loops also reduces the amount of typing required (and hence reduces the number of typing mistakes).
Suppose we have several hundred genome data files named basilisk.dat
, minotaur.dat
,
and unicorn.dat
. For this example, we’ll use the exercise-data/creatures
directory which only has three example files, but the principles can be
applied to many many more files at once.
Let’s change to that directory:
cd ~/Downloads/shell-lesson-data/exercise-data/creatures
and check what’s there:
$ ls
basilisk.dat minotaur.dat unicorn.dat
The structure of these files is the same: the common name, classification,
and updated date are presented on the first three lines, with DNA sequences
on the following lines. These files are pretty long, so let’s look at just the beginning of each one using the head
command:
head -n 5 *.dat
We would like to print out the classification for each species, which is given
on the second line of each file. To do this, we can combine the head
command
with the tail
command, which works like head except it shows lines from the
bottom of it’s input. For example:
$ tail -n 2 minotaur.dat
TCCAGTCCCA
GCCTTCACGG
We get just the last two lines of the file, which are DNA sequences.
To get the classification for a file, we need to execute the command head -n 2
to get the first two lines, and pipe this to tail -n 1
to get the last line of
those two (i.e. the second line, which we know is the classification in each
file).
For example:
$ head -n 2 minotaur.dat | tail -n 1
CLASSIFICATION: bos hominus
But we want to do this on every file in the directory all at once, not one at at time. We’ll use a loop to solve this problem, but first let’s look at the general form of a loop, using the pseudo-code below:
# The word "for" indicates the start of a "For-loop" command
for thing in list_of_things
#The word "do" indicates the start of job execution list
do
# Indentation within the loop is not required, but aids legibility
operation_using/command $thing
# The word "done" indicates the end of a loop
done
and we can apply this to our example like this:
$ for filename in *.dat
> do
> echo $filename
> head -n 2 $filename | tail -n 1
> done
basilisk.dat
CLASSIFICATION: basiliscus vulgaris
minotaur.dat
CLASSIFICATION: bos hominus
unicorn.dat
CLASSIFICATION: equus monoceros
When the shell sees the keyword for
, it knows to repeat a command (or group of commands) once for each item in a list. Each time the loop runs (called an iteration), an item in the list is assigned in sequence to the variable, and the commands inside the loop are executed, before moving on to the next item in the list. Inside the loop, we get the variable’s value by putting $
in front of it. The $
tells the shell to treat something as a variable name and substitute its value in its place, rather than treat it as text or a command.
In this example, the list is three filenames: basilisk.dat
, minotaur.dat
, and unicorn.dat
. Each time the loop iterates, we first use echo
to print the value that the variable $filename
currently holds. This is not necessary for the result, but helpful for us here to have an easier time to follow along. Next, we run the head
command on the file currently referred to by $filename
.
The first time through the loop,
$filename
isbasilisk.dat
. The interpreter runs the command head onbasilisk.dat
and pipes the first two lines to thetail
command, which then prints the second line ofbasilisk.dat
.For the second iteration,
$filename
becomesminotaur.dat
. This time, the shell runshead
onminotaur.dat
and pipes the first two lines to thetail
command, which then prints the second line ofminotaur.dat
.For the third iteration,
$filename
becomesunicorn.dat
, so the shell runs thehead
command on that file, andtail
on the output of that.
Since the list was only three items, the shell exits the loop after this.
Challenge: Write your own loop
How would you write a loop that uses the echo
command to print all 10 numbers from 0 to 9?
Solution
$ for number in 0 1 2 3 4 5 6 7 8 9
> do
> echo $number
> done
Challenge: Variables in loops
What is the output of the following code? Why?
$ for number in 0 1 2 3 4 5 6 7 8 9
> do
> echo number
> done
Solution
number
number
number
number
number
number
number
number
number
number
Becuase we forgot to use the $
to tell the shell to treat number
as a variable name, the shell will print the literal text number
10 times.
Challenge: more variables in Loops
This exercise uses the shell-lesson-data/exercise-data/alkanes
directory again. Start by changing into this directory:
cd ~/Downloads/shell-lesson-data/exercise-data/alkanes
ls *.pdb
gives the following output:
$ ls *.pdb
cubane.pdb
ethane.pdb
methane.pdb
octane.pdb
pentane.pdb
What is the output of the following code?
$ for datafile in *.pdb
> do
> ls *.pdb
> done
What is the output of the following code?
$ for datafile in *.pdb
> do
> ls $datafile
> done
Why do these two loops give different outputs?
Challenge: Limiting Sets of Files
What is the output of running the following loop in the shell-lesson-data/exercise-data/alkanes
directory?
$ for filename in c*
> do
> ls $filename
> done
No files are listed.
All files are listed.
Only
cubane.pdb
,octane.pdb
andpentane.pdb
are listed.Only
cubane.pdb
is listed.
Solution
4.
is the correct answer. *
matches zero or more characters, so any file name starting with
the letter c, followed by zero or more other characters will be matched.
Challenge: Saving to a File in a Loop - Part One
In the shell-lesson-data/exercise-data/alkanes
directory, what is the effect
of this loop?
for file in *.pdb
do
echo $file
cat $file > test.pdb
done
Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
,pentane.pdb
andpropane.pdb
, and the text frompropane.pdb
will be saved to a file calledtest.pdb
.Prints
cubane.pdb
,ethane.pdb
, andmethane.pdb
, and the text from all three files would be concatenated and saved to a file calledtest.pdb
.Prints
cubane.pdb
,ethane.pdb
,methane.pdb
,octane.pdb
, andpentane.pdb
, and the text frompropane.pdb
will be saved to a file calledtest.pdb
.None of the above.
Solution
1.
is the correct answer. The text from each file in turn gets written to the test.pdb
file.
However, the file gets overwritten on each loop iteration, so the final content of
test.pdb
is the text from the propane.pdb
file.
Challenge: Spaces in Names
Spaces are used to separate the elements of the list that we are going to loop over. If one of those elements contains a space character, we need to surround it with quotes, and do the same thing to our loop variable. Suppose our data files are named:
red dragon.dat
purple unicorn.dat
To loop over these files, we would need to add double quotes like so:
$ for filename in "red dragon.dat" "purple unicorn.dat"
> do
> head -n 100 "$filename" | tail -n 20
> done
It is simpler to avoid using spaces (or other special characters) in filenames.
The files above don’t exist, so if we run the above code, the head
command will be unable
to find them; however, the error message returned will show the name of the files it is
expecting:
head: cannot open 'red dragon.dat' for reading: No such file or directory
head: cannot open 'purple unicorn.dat' for reading: No such file or directory
Try removing the quotes around $filename
in the loop above to see the effect of the quote
marks on spaces. Note that we get a result from the loop command for unicorn.dat
when we run this code in the creatures
directory:
head: cannot open 'red' for reading: No such file or directory
head: cannot open 'dragon.dat' for reading: No such file or directory
head: cannot open 'purple' for reading: No such file or directory
CGGTACCGAA
AAGGGTCGCG
CAAGTGTTCC
...