4 The Unix command line
- Recognise why the Unix command line is essential for bioinformatic analysis.
- Explain how the location of files and folders is specified from the command line.
- Memorise and apply key commands to navigate the filesystem and investigate the content of text files.
- Combine multiple commands to achieve more complex operations.
Learning the Unix command line is critical for bioinformatic analysis due to its widespread use in the field, particularly in the context of the Linux operating system. The Unix command line offers several key advantages:
- Ubiquitous in computing: used across various computing applications and is essential for working with remote servers like those in high-performance computing (HPC) environments.
- Versatile command set: provides a vast array of commands that enable intricate file manipulations, including tasks like locating and replacing text patterns. These capabilities are very useful in the field of bioinformatics.
- Scripting for automation: users can use, create and share script files to store and execute sequences of commands, facilitating automation and ensuring the reproducibility of analyses.
In summary, mastering the Unix command line enables bioinformaticians to efficiently handle data, automate workflows, and enhance the reliability of their research.
In this section we give a very brief overview of some of the key Unix commands needed to follow these materials. For a more thorough coverage of this topic, see our accompanying materials: Introduction to the Unix Command Line.
4.1 The command prompt
When you open a terminal you are presented with a command prompt, waiting for you to input a command. It will look something like this:
username@computer-name:~$ |
It gives you information about:
- Your username
- The name of your computer
- The location in your filesystem (
~
indicates your home directory) - A separator, usually
$
symbol - The prompt (often blinking) waiting for your command input
4.3 Files and folders
Here are some key commands to create directories and investigate the content of text files:
mkdir
creates a directoryhead
prints the top lines of a filetail
prints the bottom lines of a fileless
opens the file in a viewerwc
counts lines, words and characters in a filegrep
prints lines that match a specified text pattern
To create a directory called “test” you can run:
mkdir test
To look at the top lines of a file you can use:
head genome.fasta
>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence
GTGTCATCTTCGCTATGGTTGCAATGTTTGCAACGGCTTCAGGAAGAGCTACCTGCCGCAGAATTCAGTATGTGGGTGCG
TCCGCTTCAAGCGGAGCTCAATGACAATACTCTCACTTTATTCGCCCCGAACCGCTTTGTGTTGGATTGGGTACGCGATA
AGTACCTCAATAACATCAATCGTCTGCTGATGGAATTCAGTGGCAATGATGTGCCTAATTTGCGCTTTGAAGTGGGGAGC
CGCCCTGTGGTGGCGCCAAAACCCGCGCCTGTACGTACGGCTGCGGATGTCGCGGCGGAATCGTCGGCGCCTGCGCAATT
GGCGCAGCGTAAACCTATCCATAAAACCTGGGATGATGACAGTGCTGCGGCTGATATTACTCACCGCTCAAATGTGAACC
CGAAACACAAGTTCAACAACTTCGTGGAAGGTAAATCTAACCAGTTAGGTCTGGCCGCGGCTCGCCAAGTCTCTGATAAC
CCAGGTGCGGCGTATAACCCCCTCTTTTTGTATGGCGGCACCGGTTTGGGTAAAACGCACTTGCTGCATGCGGTGGGTAA
CGCGATTGTTGATAACAACCCGAACGCTAAAGTGGTGTACATGCACTCTGAGCGTTTCGTGCAAGACATGGTAAAAGCCC
TGCAGAACAACGCGATTGAAGAATTCAAACGCTACTATCGCAGTGTAGATGCCTTGTTGATCGACGATATTCAATTCTTT
You can print only ‘N’ lines of the file using the following option:
head -n 2 genome.fasta
>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence
GTGTCATCTTCGCTATGGTTGCAATGTTTGCAACGGCTTCAGGAAGAGCTACCTGCCGCAGAATTCAGTATGTGGGTGCG
The tail
command works similarly, but prints the bottom lines of a file.
To open the file in a viewer, you can use:
less genome.fasta
You can use ↑ and ↓ arrows on your keyboard to browse the file. When you want to exit you can press Q (quit).
To count the lines in a text file you can use:
wc -l genome.fasta
50601 genome.fasta
To print the lines that match a pattern in a file you can use:
grep ">" genome.fasta
>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence
>NZ_CP028828.1 Vibrio cholerae strain N16961 chromosome 2, complete sequence
4.4 Combining commands
You can chain multiple commands together using the pipe operator. For example:
grep ">" genome.fasta | wc -l
2
- First find and print the lines that match “>”
- And then count the number of lines from the output of the previous step
In this case, the wc
command took its input from the pipe.
4.5 Summary
- The Unix command line is essential for bioinformatic analysis because it is widely used in the field and allows for efficient data manipulation, automation, and reproducibility.
- The location of files and folders from the command line using either absolute or relative paths.
- Absolute paths always start with
/
(the root of the filesystem) - Subsequent directory names are separated by
/
.
- Absolute paths always start with
- Key commands to navigate the filesystem include:
cd
andls
- Key commands to investigate the content of files include:
head
,tail
,less
,grep
andwc
. - The wildcard
*
can be used to match multiple files sharing part of their name.