Appendix A — Common file formats

This page lists some common file formats used in Bioinformatics (listed alphabetically). The heading of each file links to a page with more details about each format.

Generally, files can be classified into two categories: text files and binary files.

Very often, text files may be compressed to save storage space. A common compression format used in bioinformatics is gzip with has extension .gz. Many bioinformatic tools support compressed files. For example, FASTQ files (used to store NGS sequencing data) are often compressed with format .fq.gz.

BAM (“Binary Alignment Map”)

  • Binary file.
  • Same as a SAM file but compressed in binary form.
  • File extensions: .bam

BED (“Browser Extensible Data”)

  • Text file.
  • Stores coordinates of genomic regions.
  • File extension: .bed

CSV (“Comma Separated Values”)

  • Text file.
  • Stores tabular data in a text file. (also see TSV format)
  • File extensions: .csv

These files can be opened with spreadsheet programs (such as Microsoft Excel). They can also be created from spreadsheet programs by going to File > Save As… and select “CSV (Comma delimited)” as the file format.

FAST5

  • Binary file. More specifically, this is a Hierarchical Data Format (HDF5) file.
  • Used by Nanopore platforms to store the called sequences (in FASTQ format) as well as the raw electrical signal data from the pore.
  • File extensions: .fast5

FASTA

  • Text file.
  • Stores nucleotide or amino acid sequences.
  • File extensions: .fa or .fas or .fasta

FASTQ

  • Text file, but often compressed with gzip.
  • Stores sequences and their quality scores.
  • File extensions: .fq or .fastq (compressed as .fq.gz or .fastq.gz)

GFF (“General Feature Format”)

  • Text file.
  • Stores gene coordinates and other features.
  • File extension: .gff

NEWICK

  • Text file.
  • Stores phylogenetic trees including nodes names and edge lengths.
  • File extensions: .tree or .treefile

SAM (“Sequence Alignment Map”)

  • Text file.
  • Stores sequences aligned to a reference genome. (also see BAM format)
  • File extensions: .sam

TSV (“Tab-Separated Values”)

  • Text file.
  • Stores tabular data in a text file. (also see CSV format)
  • File extensions: .tsv or .txt

These files can be opened with spreadsheet programs (such as Microsoft Excel). They can also be created from spreadsheet programs by going to File > Save As… and select “Text (Tab delimited)” as the file format.

VCF (“Variant Calling Format”)

  • Text file but often compressed with gzip.
  • Stores SNP/Indel variants
  • File extension: .vcf (or compressed as .vcf.gz)