Shell scripts

Shell scripts#

We are finally ready to see what makes the shell such a powerful programming environment. We are going to take the commands we repeat frequently and save them in files so that we can re-run all those operations again later by typing a single command. For historical reasons, a bunch of commands saved in a file is usually called a shell script, but make no mistake — these are actually small programs.

Not only will writing shell scripts make your work faster, but also you won’t have to retype the same commands over and over again. It will also make it more accurate (fewer chances for typos) and more reproducible. If you come back to your work later (or if someone else finds your work and wants to build on it), you will be able to reproduce the same results simply by running your script, rather than having to remember or retype a long list of commands.

Our first shell script#

Let’s start by making sure we’re in the alkanes directory, and then creating a new file, middle.sh which will become our shell script:

cd ~/Downloads/shell-lesson-data/exercise-data/alkanes
nano middle.sh

Let’s edit edit the file by inserting the following line:

head -n 15 octane.pdb | tail -n 5

This is a variation on the pipe we constructed earlier, which selects lines 11-15 of the file octane.pdb. Remember, we are not running it as a command just yet; we are only saving the command in a file.

Then we save the file (Ctrl-O in nano) and exit (Ctrl-X in nano). Check that the directory alkanes now contains a file called middle.sh.

Once we have saved the file, we can ask the shell to execute the commands it contains. Our shell is called bash, so we run the following command:

bash middle.sh

Sure enough, our script’s output is exactly what we would get if we ran that pipeline directly.

Command line arguments#

What if we want to select lines from an arbitrary file? We could edit middle.sh each time to change the filename, but that would probably take longer than typing the command out again in the shell and executing it with a new file name. Instead, let’s edit middle.sh and make it more versatile:

nano middle.sh

Now, within “nano”, replace the text octane.pdb with the special variable called $1:

head -n 15 "$1" | tail -n 5

Inside a shell script, $1 means ‘the first filename (or other argument) on the command line’. We can now run our script like this:

bash middle.sh octane.pdb

or on a different file like this:

$ bash middle.sh pentane.pdb

Currently, we need to edit middle.sh each time we want to adjust the range of lines that is returned. Let’s fix that by configuring our script to instead use three command-line arguments. After the first command-line argument ($1), each additional argument that we provide will be accessible via the special variables $1, $2, $3, which refer to the first, second, third command-line arguments, respectively.

Knowing this, we can use additional arguments to define the range of lines to be passed to head and tail respectively:

nano middle.sh

head -n "$2" "$1" | tail -n "$3"

We can now run:

bash middle.sh pentane.pdb 15 5

By changing the arguments to our command, we can change our script’s behaviour:

bash middle.sh pentane.pdb 20 5

Putting comments in the code#

We can improve our script by adding some comments at the top:

nano middle.sh

# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"

A comment starts with a # character and runs to the end of the line. The computer ignores comments, but they’re invaluable for helping people (including your future self) understand and use scripts. The only caveat is that each time you modify the script, you should check that the comment is still accurate. An explanation that sends the reader in the wrong direction is worse than none at all.

Multiple command line arguments#

What if we want to process many files in a single pipeline? For example, if we want to sort our .pdb files by length, we would type:

$ wc -l *.pdb | sort -n

because wc -l lists the number of lines in the files (recall that wc stands for ‘word count’, adding the -l option means ‘count lines’ instead) and sort -n sorts things numerically. We could put this in a file, but then it would only ever sort a list of .pdb files in the current directory. If we want to be able to get a sorted list of other kinds of files, we need a way to get all those names into the script. We can’t use $1, $2, and so on because we don’t know how many files there are. Instead, we use the special variable $@, which means, ‘All of the command-line arguments to the shell script’. We also should put $@ inside double-quotes to handle the case of arguments containing spaces ("$@" is special syntax and is equivalent to "$1" "$2" …).

Here’s an example:

nano sorted.sh

# Sort files by their length.
# Usage: bash sorted.sh one_or_more_filenames
wc -l "$@" | sort -n

bash sorted.sh *.pdb ../creatures/*.dat

methane.pdb
ethane.pdb
propane.pdb
cubane.pdb
pentane.pdb
octane.pdb
../creatures/basilisk.dat
../creatures/minotaur.dat
../creatures/unicorn.dat
total

Challenge: List Unique Species

Leah has several hundred data files, each of which is formatted like this:

2013-11-05,deer,5
2013-11-05,rabbit,22
2013-11-05,raccoon,7
2013-11-06,rabbit,19
2013-11-06,deer,2
2013-11-06,fox,1
2013-11-07,rabbit,18
2013-11-07,bear,1

An example of this type of file is given in shell-lesson-data/exercise-data/animal-counts/animals.csv.

We can use the command cut -d , -f 2 animals.csv | sort | uniq to produce the unique species in animals.csv. In order to avoid having to type out this series of commands every time, a scientist may choose to write a shell script instead.

Write a shell script called species.sh that takes any number of filenames as command-line arguments and uses a variation of the above command to print a list of the unique species appearing in each of those files separately.

Solution

# Script to find unique species in csv files where species is the second data field
# This script accepts any number of file names as command line arguments

# Loop over all files
for file in $@
do
    echo "Unique species in $file:"
    # Extract species names
    cut -d , -f 2 $file | sort | uniq
done

Executable script#

Edit the middle.sh file to be as follows:

#!/bin/bash
# Select lines from the middle of a file.
# Usage: bash middle.sh filename end_line num_lines
head -n "$2" "$1" | tail -n "$3"

To make shell file executable, we need to change the permission on the file

chmod 755 middle.sh
./middle.sh