4  The Unix command line

Learning Objectives
  • Recognise why the Unix command line is essential for bioinformatic analysis.
  • Explain how the location of files and folders is specified from the command line.
  • Memorise and apply key commands to navigate the filesystem and investigate the content of text files.
  • Combine multiple commands to achieve more complex operations.

Learning the Unix command line is critical for bioinformatic analysis due to its widespread use in the field, particularly in the context of the Linux operating system. The Unix command line offers several key advantages:

In summary, mastering the Unix command line enables bioinformaticians to efficiently handle data, automate workflows, and enhance the reliability of their research.

In this section we give a very brief overview of some of the key Unix commands needed to follow these materials. For a more thorough coverage of this topic, see our accompanying materials: Introduction to the Unix Command Line.

4.1 The command prompt

When you open a terminal you are presented with a command prompt, waiting for you to input a command. It will look something like this:

username@computer-name:~$ |

It gives you information about:

  • Your username
  • The name of your computer
  • The location in your filesystem (~ indicates your home directory)
  • A separator, usually $ symbol
  • The prompt (often blinking) waiting for your command input

4.3 Files and folders

Here are some key commands to create directories and investigate the content of text files:

  • mkdir creates a directory
  • head prints the top lines of a file
  • tail prints the bottom lines of a file
  • less opens the file in a viewer
  • wc counts lines, words and characters in a file
  • grep prints lines that match a specified text pattern

To create a directory called “test” you can run:

mkdir test

To look at the top lines of a file you can use:

head genome.fasta
>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence
GTGTCATCTTCGCTATGGTTGCAATGTTTGCAACGGCTTCAGGAAGAGCTACCTGCCGCAGAATTCAGTATGTGGGTGCG
TCCGCTTCAAGCGGAGCTCAATGACAATACTCTCACTTTATTCGCCCCGAACCGCTTTGTGTTGGATTGGGTACGCGATA
AGTACCTCAATAACATCAATCGTCTGCTGATGGAATTCAGTGGCAATGATGTGCCTAATTTGCGCTTTGAAGTGGGGAGC
CGCCCTGTGGTGGCGCCAAAACCCGCGCCTGTACGTACGGCTGCGGATGTCGCGGCGGAATCGTCGGCGCCTGCGCAATT
GGCGCAGCGTAAACCTATCCATAAAACCTGGGATGATGACAGTGCTGCGGCTGATATTACTCACCGCTCAAATGTGAACC
CGAAACACAAGTTCAACAACTTCGTGGAAGGTAAATCTAACCAGTTAGGTCTGGCCGCGGCTCGCCAAGTCTCTGATAAC
CCAGGTGCGGCGTATAACCCCCTCTTTTTGTATGGCGGCACCGGTTTGGGTAAAACGCACTTGCTGCATGCGGTGGGTAA
CGCGATTGTTGATAACAACCCGAACGCTAAAGTGGTGTACATGCACTCTGAGCGTTTCGTGCAAGACATGGTAAAAGCCC
TGCAGAACAACGCGATTGAAGAATTCAAACGCTACTATCGCAGTGTAGATGCCTTGTTGATCGACGATATTCAATTCTTT

You can print only ‘N’ lines of the file using the following option:

head -n 2 genome.fasta
>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence
GTGTCATCTTCGCTATGGTTGCAATGTTTGCAACGGCTTCAGGAAGAGCTACCTGCCGCAGAATTCAGTATGTGGGTGCG

The tail command works similarly, but prints the bottom lines of a file.

To open the file in a viewer, you can use:

less genome.fasta

You can use and arrows on your keyboard to browse the file. When you want to exit you can press Q (quit).

To count the lines in a text file you can use:

wc -l genome.fasta
50601 genome.fasta

To print the lines that match a pattern in a file you can use:

grep ">" genome.fasta
>NZ_CP028827.1 Vibrio cholerae strain N16961 chromosome 1, complete sequence
>NZ_CP028828.1 Vibrio cholerae strain N16961 chromosome 2, complete sequence

4.4 Combining commands

You can chain multiple commands together using the pipe operator. For example:

grep ">" genome.fasta | wc -l
2
  • First find and print the lines that match “>”
  • And then count the number of lines from the output of the previous step

In this case, the wc command took its input from the pipe.

4.5 Summary

Key Points
  • The Unix command line is essential for bioinformatic analysis because it is widely used in the field and allows for efficient data manipulation, automation, and reproducibility.
  • The location of files and folders from the command line using either absolute or relative paths.
    • Absolute paths always start with / (the root of the filesystem)
    • Subsequent directory names are separated by /.
  • Key commands to navigate the filesystem include: cd and ls
  • Key commands to investigate the content of files include: head, tail, less, grep and wc.
  • The wildcard * can be used to match multiple files sharing part of their name.