3 Package Managers
- Describe the role of a package manager and list examples of package managers for different applications.
- Recognise the challenges in managing complex software environments and the role of the Conda/Mamba package manager in solving these.
- Create and use reproducible software environments using Mamba.
- Recognise some of limitations of Mamba as a package manager and how to avoid common pitfalls.
3.1 What is a package manager?
Most operating systems have package managers available, which allow the user to manage (install, remove, upgrade) their software easily. The package manager takes care of automatically downloading and installing the software we want, as well as any dependencies it requires.
There are many package managers available, some are specific to a given type of operating system, or specific to a programming language, while others are more generic. Each of these package managers will use their own repositories, meaning they have access to different sets of software (although there is often some overlap). Some examples include:
apt
is the default Linux package manager for Debian-derived distributions, such as the popular Ubuntu. It comes pre-installed and can be used to install system-level applications.homebrew
is a popular package manager for macOS, although it also works on Linux.conda
/mamba
is a package manager very popular in bioinformatics and data science communities, due to the repositories which give access to software used in these fields. It will be the main focus of this section.
Some programming languages also come with their own package managers. For example:
- The statistical software R has two main library repositories: CRAN and Bioconductor. These are installed from within the R console using the commands
install.packages()
andBiocManager::install()
, respectively. - The programming laguage Python has a package manager called
pip
, which has access to the Python Package Index (PyPI) repository.
In many cases package managers can also install software directly from code repositories such as GitHub, adding further flexibility to how we manage our scientific software.
3.2 Conda/Mamba
A popular package manager in data science, scientific computing and bioinformatics is Mamba, which is a successor to another package manager called Conda.
Conda was originally developed by Anaconda as a way to simplify the creation, distribution, and management of software environments containing different packages and dependencies. It is known for its cross-platform compatibility and relative ease of use (compared to compiling software and having the user manually install all software dependencies). Mamba is a more recent and high-performance alternative to Conda. While it maintains compatibility with Conda’s package and environment management capabilities, Mamba is designed for faster dependency resolution and installation, making it a better choice nowadays. Therefore, the rest of this section focuses on Mamba specifically.
One of the strengths of using Mamba to manage your software is that you can have different versions of your software installed alongside each other, organised in environments. Organising software packages into environments is extremely useful, as it allows to have a reproducible set of software versions that you can use and reuse in your projects.
For example, imagine you are working on two projects with different software requirements:
- Project A: requires Python 3.7, NumPy 1.15, and scikit-learn 0.20.
- Project B: requires Python 3.12, the latest version of NumPy, and TensorFlow 2.0.
If you don’t use environments, you would need to install and maintain these packages globally on your system. This can lead to several issues:
- Version conflicts: different projects may require different versions of the same library. For example, Project A might not be compatible with the latest NumPy, while Project B needs it.
- Dependency chaos: as your projects grow, you might install numerous packages, and they could interfere with each other, causing unexpected errors or instability.
- Difficulty collaborating: sharing your code with colleagues or collaborators becomes complex because they may have different versions of packages installed, leading to compatibility issues.
Mamba allows you to create self-contained software environments for each project, addressing these issues:
- Isolation: you can create a separate environment for each project. This ensures that the dependencies for one project don’t affect another.
- Software versions: you can specify the exact versions of libraries and packages required for each project within its environment. This eliminates version conflicts and ensures reproducibility.
- Ease of collaboration: sharing your code and environment file makes it easy for collaborators to replicate your environment and run your project without worrying about conflicts.
- Simplified maintenance: if you need to update a library for one project, it won’t impact others. You can manage environments separately, making maintenance more straightforward.
Another advantage of using Mamba is that the software is installed locally (by default in your home directory), without the need for admin (sudo
) permissions.
We give instructions to install Mamba on our setup page.
3.2.1 Installing software with Mamba
You can search for available packages from the anaconda.org website. Packages are organised into “channels”, which represent communities that develop and maintain the installation “recipes” for each software. The most popular channels for bioinformatics and data analysis are “bioconda” and “conda-forge”.
There are three main commands to use with Mamba:
mamba create -n ENVIRONMENT-NAME
: this command creates a new software environment, which can be named as you want. Usually people name their environments to either match the name of the main package they are installing there (e.g. an environment calledpangolin
if it’s to install the Pangolin software). Or, if you are installing several packages in the same environment, then you can name it as a topic (e.g. an environment calledrnaseq
if it contains several packages for RNA-seq data analysis).mamba install -n ENVIRONMENT-NAME NAME-OF-PACKAGE
: this command installs the desired package in the specified environment.mamba activate ENVIRONMENT-NAME
: this command “activates” the environment, which means the software installed there becomes available from the terminal.
Let’s see a concrete example. If we wanted to install packages for phylogenetic analysis, we could do:
# create an environment named "phylo"
mamba create -n phylo
# install some software in that environment
mamba install -n phylo iqtree==2.3.3 mafft==7.525
If we run the command:
mamba env list
We will get a list of environments we created, and “phylo” should be listed there. If we want to use the software we installed in that environment, then we can activate it:
mamba activate phylo
And usually this changes your terminal to have the word (phylo)
at the start of your prompt instead of (base)
.
3.2.2 Environment files
Although we can create and manage environments as shown above, it may sometimes be useful to specify an environment in a file. This is particularly useful if you want to document how your environment was created and if you want to recreate it somewhere else.
Environments can be defined using a specification file in YAML format (a simple text format often used for configuration files). For example, our phylogenetics environment above could be specified as follows:
name: phylo
channels:
- conda-forge
- bioconda
dependencies:
- iqtree==2.3.3
- mafft==7.525
We have included this example in the file demo/envs/phylo.yml
. To create the environment from the file, we can use the command:
mamba env create -f envs/phylo.yml
Note that this command is slightly different from the one we saw earlier: mamba env create -f environment.yml
as shown here is to create an environment from a file, whereas mamba create -n name-of-environment
that we saw earlier is used to create a new environment from scratch.
If you later decide to update the environment, either by adding a new software or by updating the software versions, you can run the command:
mamba env update -f envs/phylo.yml
You can practice this in an exercise below.
If you did not create an environment file at the start of your project, you can create one from an existing environment using the command mamba env export > env.yaml
3.3 Disadvantages and pitfalls
Dependency conflicts
One thing to be very careful about is how Conda/Mamba manages the dependency graph of packages to install. If you don’t specify the version of the software you want, in theory Mamba will pick the latest version available on the channel. However, this is conditional on the other packages that are installed alongside it, as some versions may be incompatible with each other, it may downgrade some packages without you realising.
Take this example, where we create a new environment called metagen
for a metagenomics project. We initiate the environment with only two packages: GTDB-tk (taxonomic classification of genomes) and Pandas (for data analysis in Python):
mamba create -n metagen gtdbtk pandas
When you run this command, Mamba will ask if you want to proceed with the installation. Before proceeding, it’s always a good idea to check which versions of the packages we are interested in are being installed.
At the time of writing, the latest version of Pandas on anaconda.org is 2.2.3, however as we run this command we can see that Mamba is installing version 2.0.3, which is a few versions behind.
Let’s be more explicit and specify we want the latest versions available for both packages (at the time of writing):
mamba create -n metagen gtdbtk==2.4.0 pandas==2.2.3
By running this command, we get an error message informing us that Mamba could not find a fully compatible environment for these software versions:
Click here to view the long error message
Could not solve for environment specs
The following packages are incompatible
├─ gtdbtk 2.4.0 is installable and it requires
│ └─ fastani 1.32.* , which requires
│ └─ boost >=1.70.0,<1.70.1.0a0 with the potential options
│ ├─ boost 1.70.0 would require
│ │ ├─ python >=2.7,<2.8.0a0 , which can be installed;
│ │ └─ python_abi * *_cp27mu, which can be installed;
│ ├─ boost 1.70.0 would require
│ │ ├─ python >=3.6,<3.7.0a0 , which can be installed;
│ │ └─ python_abi * *_cp36m, which can be installed;
│ ├─ boost 1.70.0 would require
│ │ ├─ python >=3.7,<3.8.0a0 , which can be installed;
│ │ └─ python_abi * *_cp37m, which can be installed;
│ └─ boost 1.70.0 would require
│ ├─ python >=3.8,<3.9.0a0 , which can be installed;
│ └─ python_abi * *_cp38, which can be installed;
└─ pandas 2.2.3 is uninstallable because there are no viable options
├─ pandas 2.2.3 would require
│ ├─ python >=3.12,<3.13.0a0 , which conflicts with any installable versions previously reported;
│ └─ python_abi 3.12.* *_cp312, which requires
│ └─ python 3.12.* *_cpython, which can be installed;
├─ pandas 2.2.3 would require
│ └─ python_abi 3.10.* *_cp310, which conflicts with any installable versions previously reported;
├─ pandas 2.2.3 would require
│ └─ python_abi 3.11.* *_cp311, which conflicts with any installable versions previously reported;
├─ pandas 2.2.3 would require
│ └─ python >=3.13.0rc2,<3.14.0a0 , which requires
│ └─ _python_rc , which does not exist (perhaps a missing channel);
└─ pandas 2.2.3 would require
└─ python_abi 3.9.* *_cp39, which conflicts with any installable versions previously reported.
The message is a bit hard to interpret, but generally we can see that the issue seems to be related to the Python version that is required for these two packages. We can see that GTDB-tk is only compatible with older versions of Python, whereas the latest version of Pandas requires newer versions.
How could we solve this problem? One possibility is to install each software in a separate environment. The disadvantage is that you will need to run several mamba activate
commands at every step of your analysis.
Another possibility is to find a compatible combination of package versions that is sufficient for your needs. For example, let’s say that GTDB-tk was the most critical software for which we needed to run the latest version. We could find what is the latest version of Pandas compatible with it, by forcing the GTDB-tk version, but not the other one:
mamba create -n metagen gtdbtk==2.4.0 pandas
Running this command, we can see that we would get pandas==2.0.3
. So, Pandas would be a slightly older version than currently available, but for our purposes this might not be a problem. If we were happy with this choice, then we could proceed. For reproducibility, we could save all this information in a YAML file specifying our environment:
name: metagen
channels:
- conda-forge
- bioconda
dependencies:
- pandas==2.0.3
- gtdbtk==2.4.0
Package availability
Some packages not available:
- Cell Ranger is a very popular software for processing single-cell RNA-seq data from the 10x genomics platform. However, the software is not open source and therefore not available through the bioconda channel.
- Software that is not used by a wide-enough community, and thus has no available installation recipe. For example AliView (to visualise multiple sequence alignments) or APAtrap (differential usage of alternative polyadenylation sites from RNA-seq).
Disk space
Environments can take a lot of disk space in your system. This is software-dependent, but in some cases can become quite substantial (several GB of files). Therefore, it’s good practice to:
- Remove unused environments: regularly check for environments that you no longer need and remove them using
mamba env remove --name ENV_NAME
. You can usemamba env list
to list your environments and find unused ones. - Recreate environments from YAML files: related to the previous point, always make sure to keep YAML enviroment files for your environments. This way, you can safely remove less frequently used environments and later recreate them using
mamba env create -f env.yml
. If you did not create an environment file, you can create one from an existing environment withmamba env export > env.yaml
. - Regularly clear cached packages: Mamba caches downloaded packages for faster installation in the future. However, you can clear this cache using
mamba clean --all
. This is particularly useful after upgrades, as you don’t need old versions of the packages stored in your system.
To see if this is a problem in your system, you can occasionally check the size of your Miniforge installation folder with the following command (assuming default installation path):
du --si -s $CONDA_PREFIX
Note the $CONDA_PREFIX
is an environment variable that stores the directory path to your conda/mamba installation.
There might be times when some packages/libraries are not available in a package manager. For example, it can be common to use conda/mamba but find a python library that is only available through pip
. Unfortunately, this may cause issues in your environment as pip may change your conda-installed packages, which might break the conda environment. There are a few steps one can follow to avoid this pitfalls:
- Start from a new and clean environment. If the new environment breaks you can safely remove it and start over. You can create a new environment from pre-existing ones if necessary. We will see more of this later.
- Install
pip
in your conda environment. This is important as the pip you have in your base environment is different from your new environment (will avoid conflicts). - Install any conda packages your need to get the environment ready and leave the pip install for last. Avoid switching between package managers. Start with one and finish with the other one so reversing or fixing conflicts is easier.
You can find a (checklist)[https://www.anaconda.com/blog/using-pip-in-a-conda-environment] in the anaconda webpage for good practice.
3.4 Exercises
3.5 Summary
- A package manager automates the process of installing, upgrading, configuring, and managing software packages, including their dependencies.
- Examples of package managers are
pip
(Python),apt
(Debian/Ubuntu) andconda
/mamba
(generic). - Dependency conflicts, which often arise in complex bioinformatic workflows, can be resolved by managing software in isolated environments.
- Conda/Mamba simplify these tasks by managing dependencies, creating isolated environments, and ensuring reproducible setups across different systems.
- Key Mamba commands include:
mamba create --name ENVIRONMENT-NAME
to create a new environment.mamba install -n ENVIRONMENT-NAME NAME-OF-PACKAGE
to install a package inside that environment.mamba activate ENVIRONMENT-NAME
to make the software from that environment available.mamba env create -f ENVIRONMENT-YAML-SPECIFICATION
to create an environment from a YAML file (recommended for reproducibility).mamba env update -f ENVIRONMENT-YAML-SPECIFICATION
to update an environment from a YAML file (recommended for reproducibility).
- Recognise some of limitations of Mamba as a package manager and how to avoid common pitfalls.
- There are some disadvantages/limitations of Mamba as a package manager:
- Dependencies aren’t always respected.
- Software versions are sometimes downgraded without explicit warning.
- It can be slow at resolving very complex environments.