Running the Example Pipeline¶
This page serves to familiarize new users with the basic process of running the ImmuneDB pipeline. Example input FASTQ files are provided which contain human B-cell heavy chain sequences.
Commands are listed as either being run in either the Docker container or on
the host. All immunedb_*
commands have a --help
flag which will show
all arguments and their descriptions. It is recommended for each command you
run the help flag to see options not listed in this documentation.
To begin, run the Docker container as documented:
$ docker run -v $HOME/immunedb_share:/share \
-p 8080:8080 -it arosenfeld/immunedb:v0.29.10
Metadata Specification¶
Before ImmuneDB can be run, metadata must be specified for each input file. For this example, one has already been created for you. To learn how to create a metadata file for your own data, see Creating a Metadata Sheet.
ImmuneDB Instance Creation¶
Next, we create a database for the data with:
$ immunedb_admin create example_db /share/configs
This creates a new database named example_db
and stores its configuration
in /share/configs/example_db.json
.
Identifying or Importing Sequences¶
Data can be added to the new ImmuneDB database either by importing annotated sequencing data in AIRR format, or via a built-in gene assignment method based on Zhang, et al., 2015.
For this example, there are two input FASTQ files in /example/fastq
along
with an associated metadata file. These will be used regardless of the method
you choose. There are also germline files for human and mouse included.
Option 1: Importing from AIRR Files (Recommended)¶
First, IgBLAST needs to be run on the input files. A small wrapper script is provided in the Docker container. It takes three parameters: a species, a locus (in uppercase), an input directory with FASTA/FASTQ files, and an output directory:
$ run_igblast.sh human IGH /example/fastq /example/airr
To import these files, run the immunedb_import
command:
$ immunedb_import /share/configs/example_db.json airr \
/root/germlines/igblast/human/IGHV.gapped.fasta \
/root/germlines/igblast/human/IGHJ.gapped.fasta \
/example/airr
Option 2: Annotating FASTA/FASTQ Files via Anchoring¶
Alternatively, if you’d prefer to use the built-in annotation method on
FASTA/FASTQ files, you can use the immunedb_identify
command. Note this
method is more sensitive to high mutation rates in the regions flanking the
CDR3.
$ immunedb_identify /share/configs/example_db.json \
/root/germlines/anchor/human/IGHV.gapped.fasta \
/root/germlines/anchor/human/IGHJ.gapped.fasta \
/example
Sequence Collapsing¶
After data are imported or annotated on a sample-level basis, ImmuneDB
determines the subject-level unique sequences; that is, the set of unique
sequences across all samples in each subject. Because sequences may contain
the ambiguous N
symbol, the process is not trivial string equality
checking. It is implemented in the immunedb_collapse
command.
To collapse sequences, run:
$ immunedb_collapse /share/configs/example_db.json
Clonal Assignment¶
After collapsing unique sequences across each subject they can be grouped into clones which are aggregations of sequences likely deriving from a common progenitor cell.
ImmuneDB offers two clonal inference methods, similarity and cluster. The cluster method is recommended and documented here as it more robust than similarity.
For both methods, clones are inferred in two steps: grouping sequences and then
merging similar clones. Both steps are run together with the
immunedb_clones
command
By default, only sequences with a subject-level copy number greater
than 1 are included in clones. This can be changed with the --min-copy
parameter.
In the first step of clonal inference, sequences meeting the above copy number
criteria are hierarchically clustered together such that any two sequences in a
clone must (1) have the same CDR3 length and (2) share at least 85% amino-acid
similarity in the CDR3. The similarity can be changed with --min-similarity
X
parameter where X is the minimum similarity between 0 and 1. If nucleotide
similarity should be used, --level nt
can be passed.
Note
For T-cells it is recommended the --min-similarity 1
parameter be set
but the --level
parameter by left at the default amino-acid setting.
Using both --min-similarity 1 --level nt
may lead to the creation of
spurious clones due to sequencing error. Only pass both if you’re quite
certain your sequencing error has been eliminated (e.g. by barcoding).
After this step is complete, sequences have been assigned to clones. In some cases clones may have the same CDR3 nucleotide sequence but different gene calls. This can indeed occur biologically but frequently due to mutation and sequencing error causing incorrect gene calls.
To rectify this, a second step in clonal inference is to collapse merge clones
that have the same CDR3 nucleotide sequences. In cases where this occurs, the
highest copy clone absorbs the lower copy clones. This second step can be
configured in two ways via the --reduce-difference
flag. Setting it to a
negative number (e.g. --reduce-difference -1
) disables the step entirely.
Setting it to a positive number (e.g. --reduce-difference 2
) will alter the
step’s behavior to combine clones differing by at most that number of
nucleotides. The default value is 0, so only clones with exactly the same CDR3
nucleotide sequences will be combined.
$ immunedb_clones /share/configs/example_db.json cluster
Statistics Generation¶
Two sets of statistics can be calculated in ImmuneDB:
- Clone Statistics: For each clone and sample combination, statistics on the clone’s size, mutation level, and top copy sequence
- Sample Statistics: Distribution of sequence and clone features on a per-sample basis, including gene usage, mutation level, copy number, CDR3 length.
These are calculated with the immunedb_clone_stats
and immunedb_sample_stats
commands and must be run in that order.
$ immunedb_clone_stats /share/configs/example_db.json
$ immunedb_sample_stats /share/configs/example_db.json
Selection Pressure (Optional)¶
Warning
Selection pressure calculations are time-consuming, so you can skip this step if time is limited.
Selection pressure of clones can be calculated with Baseline. To do so run:
$ immunedb_clone_pressure /share/configs/example_db.json \
/apps/baseline/Baseline_Main.r
Note, this process is relatively slow and may take some time to complete.
Clone Trees (Optional)¶
Lineage trees for clones is generated with the immunedb_clone_trees
command. The only currently supported method is neighbor-joining as provided
by Clearcut.
There are many parameters that can be changed for tree construction:
--min-seq-copies
(default 0): The minimum number copy number required for a sequence to be included in the tree.--min-seq-samples
(default 0): The minimum number samples in which a sequence must appear for it to be included in the tree.--min-mut-copies
(default 0): The minimum number of copies in which a mutation must occur to be included in the tree.--min-mut-samples
(default 0): The minimum number of samples in which a mutation must occur to be included in the tree.--exclude-stops
(defaultFalse
): Exclude sequences with a stop codon.--full-seq
(defaultFalse
): By default only the V-region of each sequence (the portion 5’ of the CDR3) is included in the tree construction. Setting this flag will use the entire sequence.
Generally we recommend using --min-seq-copies 2
.
$ immunedb_clone_trees /share/configs/example_db.json --min-seq-copies 2
Web Interface¶
ImmuneDB has a web interface to interact with a database instance. The Docker container automatically makes this available at http://localhost:8080/frontend/example_db
When you create more databases, simply replace example_db with the proper database name.
Next Steps¶
Now that the basic workflow has been covered, instructions to run ImmuneDB on your own data can be found at Running the Pipeline on Your Data.