2 Protein structure databases

Learning Objectives

Describe the main databases that store information about protein structure (UniProt, PDB, CATH, AlphaFoldDB, InterPro)
Understand which database is most suitable for different tasks
Find and retrieve data from major structure databases
Interpret the main components of structural files (PDB and PDBx/mmCIF), including atomic coordinates, secondary structure annotations, and metadata.
Convert between PDB and mmCIF formats

2.1 Overview

Modern structural biology generates a huge amount of data. Researchers solve thousands of new protein structures each year. Computational tools now add millions more predicted structures.

These data only become useful if we can store, organise, and retrieve them efficiently. For this reason, several public databases collect information about protein sequences, structures, and domains.

2.2 Why databases matter

A single protein structure contains a large amount of information. For example, a typical structural file includes:

the amino acid sequence
the 3D coordinates of every atom
chain organisation
ligands and cofactors
experimental details
references to the scientific literature

Without databases, each research group would store these data in its own format. That would make sharing and analysis very difficult.

Instead, structural biology relies on standardised repositories. These resources allow researchers to:

deposit newly solved structures
search existing structures
retrieve data in standard formats
connect structures to sequence and functional annotations

Several databases now form the core infrastructure of structural bioinformatics.

2.3 Core databases

2.3.1 UniProt

UniProt is the central database for protein sequence information. Each protein entry in UniProt includes:

the amino acid sequence
protein name and function
organism
domain annotations
cross-references to other databases
known variants and mutations

UniProt serves as the hub that links many biological resources together. For example, a UniProt entry may link to:

experimental structures in the Protein Data Bank
predicted models in AlphaFoldDB
domain annotations from InterPro
structural classifications such as CATH

When working with proteins, UniProt often provides the best starting point.

A typical workflow begins by:

Searching for a protein in UniProt.
Inspecting its annotations.
Following links to structural databases.

2.3.2 The Protein Data Bank (PDB)

The Protein Data Bank (PDB) stores experimentally determined macromolecular structures.

Researchers deposit structures into the PDB after solving them using methods such as X-ray crystallography, NMR spectroscopy or cryo-electron microscopy. Each entry contains:

atomic coordinates
experimental metadata
structural annotations
information about ligands and cofactors

The PDB currently contains hundreds of thousands of structures. These include proteins, nucleic acids, and large molecular complexes.

Each entry receives a unique four-character identifier. Examples include:

1A52 - estrogen receptor ligand-binding domain
1HCQ - estrogen receptor DNA-binding domain bound to DNA

Researchers commonly refer to structures by these codes.

2.3.3 AlphaFoldDB

Experimental structures remain limited compared with the number of known protein sequences. This gap motivated large-scale prediction efforts.

AlphaFoldDB provides predicted structures generated using AlphaFold. The database contains models for:

entire proteomes
millions of proteins across many organisms

Each AlphaFoldDB entry links directly to a UniProt identifier. These predicted models include useful information such as prediction confidence scores, which will be discussed later.

Predicted structures can often guide experiments, particularly when no experimental structure exists. However, they remain computational models, not experimental observations. Researchers must always interpret them with care.

2.3.4 InterPro

Proteins often contain domains - conserved structural units that perform specific functions. InterPro collects domain annotations from many specialised databases (e.g. PFAM and CATH), integrating them into one place.

InterPro helps researchers identify:

conserved domains
functional motifs
protein families

For example, a transcription factor might contain:

a DNA-binding domain
a ligand-binding domain
regulatory regions

2.3.5 CATH

CATH classifies proteins based on three-dimensional structure. CATH groups proteins into hierarchical categories:

Class - overall secondary structure composition
Architecture - general spatial arrangement
Topology - fold connectivity
Homologous superfamily - proteins with shared ancestry

This classification helps researchers study:

structural evolution
recurring protein folds
relationships between distant proteins

In other words, CATH helps answer the question: Which proteins share similar structures, even if their sequences differ?

2.4 Choosing the right database

Each database answers different questions.

A useful rule of thumb:

Task	Best database
Find basic protein information	UniProt
Retrieve experimental structures	PDB
Explore predicted structures	AlphaFoldDB
Identify domains and motifs	InterPro
Study structural classification	CATH

In practice, researchers often move between several databases during analysis.

For example:

Start with UniProt to find the protein sequence.
Check PDB for experimental structures.
Use AlphaFoldDB if no structure exists.
Examine InterPro to identify domains.
Use CATH to look for structural relationships.

2.5 File formats

Once we find a structure, we usually download it as a coordinate file.

Two formats dominate structural biology:

PDB format
PDBx/mmCIF format

Both store the same core information. They differ mainly in structure and flexibility.

2.5.1 The PDB format

The original PDB format dates back to the 1970s. It uses fixed-width text lines to describe atoms and metadata.

A typical line looks like this:

ATOM   1523  CA  LEU A 203      12.456  18.233   9.612

This contains several pieces of information:

record type (ATOM)
atom number
atom name
residue name
chain identifier
residue number
x, y, z coordinates

These coordinates define the atom’s position in three-dimensional space.

Software such as ChimeraX reads these coordinates and reconstructs the molecular structure.

2.5.2 The PDBx/mmCIF format

As structures became larger and more complex, the original PDB format showed its limits. The PDBx/mmCIF format replaced it as the official standard.

mmCIF files store the same information but use a more flexible table-like structure. They can represent:

very large complexes
extensive metadata
detailed experimental information

Most modern PDB entries now appear primarily in mmCIF format. Fortunately, most molecular visualisation tools can read both formats.

Sometimes there is a need to convert structures between formats, for example for compatibility with different software tools. Many programs support conversion, including ChimeraX, which we will introduce in the next section.

In practice, the mmCIF format has become the preferred standard, while PDB files remain common for legacy workflows.

2.5.3 Key components of structural files

Regardless of format, structural files contain several important types of information.

Atomic coordinates: These lines define the position of every atom in the structure. Visualisation software uses these coordinates to build the 3D model.
Chain and residue information: Structures often contain multiple chains. For example: protein chains, DNA strands, ligands, cofactors. Chain identifiers allow software to separate these components.
Secondary structure annotations: Many files include annotations that describe secondary structures such as helices and beta strands. These annotations help visualisation tools display the protein in cartoon form.
Metadata: Structural files also store experimental information. This includes the method used to solve the structure, the resolution, authors and publication details.

2.5.4 The FASTA format

FASTA is a simple text format for storing protein or nucleotide sequences. Each entry has two parts:

Header line - starts with > and usually contains an identifier and brief description.
Sequence lines - the sequence itself, written in standard one-letter codes.

Example:

>sp|P9WF37|WHIB6_MYCTU Probable transcriptional regulator WhiB6 OS=Mycobacterium tuberculosis
MRYAFAAEATTCNAFWRNVDMTVTALYEVPLGVCTQDPDRWTTTPDDEAKTLCRACPRRW
LCARDAVESAGAEGLWAGVVIPESGRARAFALGQLRSLAERNGYPVRDHRVSAQSA

FASTA is widely used because it is human-readable, compact, and compatible with most bioinformatics tools. UniProt allows you to download protein sequences directly in FASTA format for further analysis.

2.6 Exercises

Exercise 1 - Exploring the PDB

Go to the RCSB PDB portal
Search for “vitamin D receptor”, or directly by the PDB ID: “1DB1”
- Note: use the CSM toggle when you also want to include high-accuracy AlphaFold2 models (not experimentally determined)

Questions:

What expression system and method was used to determine this structure?
Does this protein have any known domains?

Answer

Looking at the PDB entry for 1DB1, we can see:

From the Structure Summary tab: The structure was determined using X-ray crystallography, and the expression system used was Escherichia coli.
From the Annotations tab: We can see the “Vitamin D nuclear receptor” domain is annotated for this protein.

We could find further details for the protein, namely specific binding sites from UniProt entry P11473, which is linked from the “Structure Summary” tab of the PDB entry.

2.7 Summary

Key Points

Structural biology relies on a network of databases that store and organise protein information.
Some of the main resources include:
- UniProt - protein sequence and functional annotation https://www.uniprot.org/
- PDB - experimentally determined structures https://www.rcsb.org/
- AlphaFoldDB - predicted protein structures https://alphafold.ebi.ac.uk/
- InterPro - domain and motif annotations https://www.ebi.ac.uk/interpro/
- CATH - structural classification of protein folds https://cathdb.github.io/
Two main file formats are used to store protein structures:
- PDB - the original file format definition to store three-dimensional structures, which is still used by many applications.
- PDBx/mmCIF - a modern format that is now widely adopted, allowing storing larger structures and more complex metadata.
Protein sequences are stored in FASTA format.