ImmuneDB travisci docs codecov pypi docker

ImmuneDB is a database-backed system to analyze and store large amounts (terabytes) of high-throughput B-cell receptor (BCR) and T-cell receptor (TCR) data. Although it can be used as a stand-alone package for comprehensive repertoire profiling, ImmuneDB excels at acting as a central data store and interface between other tools such as IgBLAST, the Immcantation Framework, MiXCR, and VDJtools via AIRR compliant importing and exporting routines.

Feature Highlights

  • Relational storage of repertoire data: Sequences, annotations, clones, lineages, and statistics are all stored in a relational database to promote consistent formatting and easy querying.
  • Consolidated metadata: Custom study, experiment, and replicate metadata is stored alongside your sequencing data in a non-redundant format to avoid inconsistencies and errors over the life of your study.
  • Web interface: ImmuneDB provides a built-in web interface for interactive exploration of data.
  • Interoperability: With AIRR compliant input and output methods, ImmuneDB can interface with other software in the AIRR ecosystem. Other output formats include Change-O and VDJtools.
  • Proven reliability: ImmuneDB is used by multiple labs to manage terabytes of data comprised of billions of sequences in dozens of projects.

Quick Start

To get started immediately, please see the Docker installation instructions.

Installing Locally (advanced)

This section details how to set ImmuneDB up locally on a machine. This is a more complicated process than using the Docker method but may be useful if you plan on running ImmuneDB remotely on a server rather than locally.

Dependency Installation

MySQL

ImmuneDB utilizes MySQL as its underlying data store. We recommend using its drop-in replacement, MariaDB. Please consult their website and your operating systems package manager for installation instructions.

R (optional)

Baseline can optionally be used to calculate selection pressure on clones. This requires R to be installed along with the ade4 package. Installation is platform dependent.

The newest version of Baseline can be downloaded here. The path to the main script will be needed for clone statistics generation as described in Statistics Generation.

For genotyping, TIgGER must also be installed.

Bowtie2 (optional)

Bowtie2 can be used to locally align sequences which cannot be aligned using the built-in anchor method.

Clearcut (optional)

Clearcut can be used to generate lineage trees for clones. After downloading and compiling per the instructions, note the path to the clearcut executable which will be required for generating trees in Clone Trees (Optional).

ImmuneDB Installation

It is recommended that ImmuneDB be installed within a venv, creating an isolated environment from the rest of the system.

To create a virtual environment and activate it run:

$ python3 -m venv immunedb
$ source immunedb/bin/activate

Then install ImmuneDB:

$ pip install immunedb

Web Interface Installation

Please refer to the ImmuneDB Frontend installation instructions.

Running the Example Pipeline

This page serves to familiarize new users with the basic process of running the ImmuneDB pipeline. Example input FASTQ files are provided which contain human B-cell heavy chain sequences.

Commands are listed as either being run in either the Docker container or on the host. All immunedb_* commands have a --help flag which will show all arguments and their descriptions. It is recommended for each command you run the help flag to see options not listed in this documentation.

To begin, run the Docker container as documented:

Run on Host
 $ docker run -v $HOME/immunedb_share:/share \
      -p 8080:8080 -it arosenfeld/immunedb:v0.29.8

Metadata Specification

Before ImmuneDB can be run, metadata must be specified for each input file. For this example, one has already been created for you. To learn how to create a metadata file for your own data, see Creating a Metadata Sheet.

ImmuneDB Instance Creation

Next, we create a database for the data with:

Run in Docker
 $ immunedb_admin create example_db /share/configs

This creates a new database named example_db and stores its configuration in /share/configs/example_db.json.

Identifying or Importing Sequences

Data can be added to the new ImmuneDB database either by importing annotated sequencing data in AIRR format, or via a built-in gene assignment method based on Zhang, et al., 2015.

For this example, there are two input FASTQ files in /example/fastq along with an associated metadata file. These will be used regardless of the method you choose. There are also germline files for human and mouse included.

Option 2: Annotating FASTA/FASTQ Files via Anchoring

Alternatively, if you’d prefer to use the built-in annotation method on FASTA/FASTQ files, you can use the immunedb_identify command. Note this method is more sensitive to high mutation rates in the regions flanking the CDR3.

Run in Docker
 $ immunedb_identify /share/configs/example_db.json \
      /root/germlines/anchor/human/IGHV.gapped.fasta \
      /root/germlines/anchor/human/IGHJ.gapped.fasta \
      /example

Sequence Collapsing

After data are imported or annotated on a sample-level basis, ImmuneDB determines the subject-level unique sequences; that is, the set of unique sequences across all samples in each subject. Because sequences may contain the ambiguous N symbol, the process is not trivial string equality checking. It is implemented in the immunedb_collapse command.

To collapse sequences, run:

Run in Docker
 $ immunedb_collapse /share/configs/example_db.json

Clonal Assignment

After collapsing unique sequences across each subject they can be grouped into clones which are aggregations of sequences likely deriving from a common progenitor cell.

ImmuneDB offers two clonal inference methods, similarity and cluster. The cluster method is recommended and documented here as it more robust than similarity.

For both methods, clones are inferred in two steps: grouping sequences and then merging similar clones. Both steps are run together with the immunedb_clones command

By default, only sequences with a subject-level copy number greater than 1 are included in clones. This can be changed with the --min-copy parameter.

In the first step of clonal inference, sequences meeting the above copy number criteria are hierarchically clustered together such that any two sequences in a clone must (1) have the same CDR3 length and (2) share at least 85% amino-acid similarity in the CDR3. The similarity can be changed with --min-similarity X parameter where X is the minimum similarity between 0 and 1. If nucleotide similarity should be used, --level nt can be passed.

Note

For T-cells it is recommended the --min-similarity 1 parameter be set but the --level parameter by left at the default amino-acid setting. Using both --min-similarity 1 --level nt may lead to the creation of spurious clones due to sequencing error. Only pass both if you’re quite certain your sequencing error has been eliminated (e.g. by barcoding).

After this step is complete, sequences have been assigned to clones. In some cases clones may have the same CDR3 nucleotide sequence but different gene calls. This can indeed occur biologically but frequently due to mutation and sequencing error causing incorrect gene calls.

To rectify this, a second step in clonal inference is to collapse merge clones that have the same CDR3 nucleotide sequences. In cases where this occurs, the highest copy clone absorbs the lower copy clones. This second step can be configured in two ways via the --reduce-difference flag. Setting it to a negative number (e.g. --reduce-difference -1) disables the step entirely. Setting it to a positive number (e.g. --reduce-difference 2) will alter the step’s behavior to combine clones differing by at most that number of nucleotides. The default value is 0, so only clones with exactly the same CDR3 nucleotide sequences will be combined.

Run in Docker
 $ immunedb_clones /share/configs/example_db.json cluster

Statistics Generation

Two sets of statistics can be calculated in ImmuneDB:

  • Clone Statistics: For each clone and sample combination, statistics on the clone’s size, mutation level, and top copy sequence
  • Sample Statistics: Distribution of sequence and clone features on a per-sample basis, including gene usage, mutation level, copy number, CDR3 length.

These are calculated with the immunedb_clone_stats and immunedb_sample_stats commands and must be run in that order.

Run in Docker
 $ immunedb_clone_stats /share/configs/example_db.json
 $ immunedb_sample_stats /share/configs/example_db.json

Selection Pressure (Optional)

Warning

Selection pressure calculations are time-consuming, so you can skip this step if time is limited.

Selection pressure of clones can be calculated with Baseline. To do so run:

Run in Docker
 $ immunedb_clone_pressure /share/configs/example_db.json \
      /apps/baseline/Baseline_Main.r

Note, this process is relatively slow and may take some time to complete.

Clone Trees (Optional)

Lineage trees for clones is generated with the immunedb_clone_trees command. The only currently supported method is neighbor-joining as provided by Clearcut.

There are many parameters that can be changed for tree construction:

  • --min-seq-copies (default 0): The minimum number copy number required for a sequence to be included in the tree.
  • --min-seq-samples (default 0): The minimum number samples in which a sequence must appear for it to be included in the tree.
  • --min-mut-copies (default 0): The minimum number of copies in which a mutation must occur to be included in the tree.
  • --min-mut-samples (default 0): The minimum number of samples in which a mutation must occur to be included in the tree.
  • --exclude-stops (default False): Exclude sequences with a stop codon.
  • --full-seq (default False): By default only the V-region of each sequence (the portion 5’ of the CDR3) is included in the tree construction. Setting this flag will use the entire sequence.

Generally we recommend using --min-seq-copies 2.

Run in Docker
 $ immunedb_clone_trees /share/configs/example_db.json --min-seq-copies 2

Web Interface

ImmuneDB has a web interface to interact with a database instance. The Docker container automatically makes this available at http://localhost:8080/frontend/example_db

When you create more databases, simply replace example_db with the proper database name.

Next Steps

Now that the basic workflow has been covered, instructions to run ImmuneDB on your own data can be found at Running the Pipeline on Your Data.

Running the Pipeline on Your Data

This page describes how to run the ImmuneDB pipeline on your own BCR/TCR data. It is assumed that you’ve previously tried the example pipeline and understand the basics of running commands in the Docker container.

Like in the example, each code block has a header saying if the command should be run on the host or in the Docker container.

Copying Your Sequence Data Into Docker

Unlike in the example pipeline where sequencing data was provided, you’ll need to copy your own FASTA/FASTQ sequencing data or AIRR-formatted IgBLAST output into the Docker container.

To do so, on the host, we create a new directory in the shared directory into which we’ll copy your sequencing data. Here we’re calling it sequences but you’ll probably want to choose a more descriptive name. Replace PATH_TO_SEQUENCES with the path to your sequencing data.

Run on Host
$ mkdir -p $HOME/immunedb_share/input
$ cp PATH_TO_SEQUENCES $HOME/immunedb_share/input

Running IgBLAST (optional)

Note

If your data is already in AIRR-compliant IgBLAST format or you are planning on using the built in anchoring method, you can skip this step.

The following command will run IgBLAST on your files. Valid values for species and locus are:

  • SPECIES:human, mouse
  • LOCUS: IGH, IGL, IGK, TRA, TRB,
$ run_igblast.sh SPECIES LOCUS /share/input /share/input

For consistency with the commands in the rest of this tutorial, we’ll move the new IgBLAST output files to /share/input and move the FASTA/FASTQ files to /share/sequences.

$ mkdir -p /share/sequences
$ mv /share/input/*.fast[aq] /share/sequences

Creating a Metadata Sheet

Next, we’ll use the immunedb_metadata command to create a template metadata file for your sequencing data. In the Docker container run:

Run in Docker
$ cd /share/input
$ immunedb_metadata --use-filenames

Note

This command expects the files to end in .fasta for FASTA, .fastq for FASTQ, or .tsv for AIRR.

This creates a metadata.tsv file in /share/input in Docker which is linked to $HOME/immunedb_share/input on the host.

The --use-filenames flag is optional, and simply populates the sample_name field with the file names stripped of their extension.

Editing the Metadata Sheet

On the host open the metadata file in Excel or your favorite spreadsheet editor. The headers included in the file are required. You may add additional headers as necessary for your dataset (e.g. tissue, cell_subset, timepoint) so long as they follow the following rules:

  • The headers must all be unique
  • Each header may only contain lowercase letters, numbers, and underscores
  • Each header must begin with a (lowercase) character
  • Each header must not exceed 32 characters in length
  • The values within each column cannot exceed 64 characters in length

Note

When data is missing or not necessary in a field, leave it blank or set to NA, N/A, NULL, or None (case-insensitive).

Pipeline Steps

Much of the rest of the pipeline follows from the example pipeline’s instance creation step. To start, create a database. Here we’ll call it my_db but you’ll probably want to give it a more descriptive name:

Run in Docker
$ immunedb_admin create my_db /share/configs

Then we’ll identify or import the sequences. For this process the germline genes must be specified. The germlines are provided FASTA files in the Docker image at /root/germlines.

Note

You can use your own germline files if you desire so long as they are IMGT gapped.

For this segment we’ll assume human B-cell heavy chains, but the process is the same for any dataset. Depending on if you want to use IgBLAST input (recommended) or the built-in annotation method the command will be one of the following:

Option 1: Importing from IgBLAST output (recommended):

Run in Docker
$ immunedb_import /share/configs/example_db.json airr \
     /root/germlines/igblast/human/IGHV.gapped.fasta \
     /root/germlines/igblast/human/IGHJ.gapped.fasta \
     /share/input

Option 2: Using anchoring method:

Run in Docker
$ immunedb_identify /share/configs/my_db.json \
      /root/germlines/anchor/human/IGHV.gapped.fasta \
      /root/germlines/anchor/human/IGHJ.gapped.fasta \
      /share/input

After importing or identifying sequences, continue running the pipeline from here:

Run in Docker
$ immunedb_collapse /share/configs/my_db.json

Then we assign clones. For B-cells we recommend:

Run in Docker
$ immunedb_clones /share/configs/my_db.json cluster

For T-cells we recommend:

Run in Docker
$ immunedb_clones /share/configs/my_db.json cluster --min-similarity 1

If you have a mixed dataset, you can assign clones in different ways, filtering on V-gene type. For example:

Run in Docker
$ immunedb_clones /share/configs/my_db.json cluster --gene IGHV
$ immunedb_clones /share/configs/my_db.json cluster --gene TCRB \
      --min-similarity 1

The last required step is to generate aggregate statistics:

Run in Docker
 $ immunedb_clone_stats /share/configs/my_db.json
 $ immunedb_sample_stats /share/configs/my_db.json

For B-cells, you might want to generate lineages too. The following excludes mutations that only occur once. immunedb_clone_trees has many other parameters for filtering which you can view with the --help flag or at Clone Trees (Optional).

Run in Docker
 $  immunedb_clone_trees /share/configs/my_db.json --min-seq-copies 2

Selection pressure can be run with the following. This process is quite time-consuming, even for small datasets:

Run in Docker
 $ immunedb_clone_pressure /share/configs/my_db.json \
      /apps/baseline/Baseline_Main.r

Finally, the data should be available at http://localhost:8080/frontend/my_db.

Analyzing Your Data

After all the above steps are complete, you should have a fully populated database, ready for analysis via Exporting Data to Files, Querying with SQL, and the Python API.

Modifying the Database

Databases can be modified in various ways using the immunedb_modify command.

Appending New Data

Adding new samples to a database is simply running the steps in Running the Pipeline on Your Data just on the new FASTA/FASTQ or AIRR files. Effort has been made to reduce the amount of information that needs to be recomputed when samples are added. However, after new samples are added all affected subjects will be entirely re-collapsed and clones will be recalculated.

Changing Metadata

Metadata specified when initially populating ImmuneDB via importing or identification can be updated in two steps. First, export the metadata currently in the database with:

$ immunedb_export PATH_TO_CONFIG samples --for-update

This will generate a samples.tsv file which can by modified. Headers and values can be changed, deleted, or added.

Note

Note that changing the subject of any sample will require steps after and including immunedb_collapse to be re-run.

After modifying the metadata, update the database with:

$ immunedb_modify PATH_TO_CONFIG update-metadata samples.tsv

Combining Samples

Warning

You cannot collapse samples from multiple subjects. If that functionality is desired, first modify the metadata to set the subject for each sample to be the same with update-metadata, and then run combine-samples.

One assumption ImmuneDB makes is that each sample is a biological replicate in that no one cell has its BCR/TCR sequence in more than one sample. If you have technical replicates, multiple independent sequencing runs of the same same biological replicate, they should be combined into one ImmuneDB-sample each. To do so, add a metadata field to the database as described in Changing Metadata where all technical replicates from the same biological replicate have the same value.

For example, if we have the following samples where each sample has two technical replicates:

sample subject
biorep1_techrep1 S1
biorep1_techrep2 S1
biorep2_techrep1 S1
biorep2_techrep2 S1

You would update the metadata to be:

sample subject collapse
biorep1_techrep1 S1 first_sample
biorep1_techrep2 S1 first_sample
biorep2_techrep1 S1 second_sample
biorep2_techrep2 S1 second_sample

And then run:

$ immunedb_modify PATH_TO_CONFIG combine-samples collapse

This will result in the four replicates being collapsed into two, using the collapse field as the new name for each:

sample subject
first_sample S1
second_sample S1

Note the header collapse can have any value you want so long as it’s passed to immunedb_modify. Further, the values in that column can be arbitrary but will be used as the new name of the samples after collapsing.

Deleting Samples

The following command can be used to delete samples by ID:

$ immunedb_modify PATH_TO_CONFIG delete-samples [sample_ids]

Note that deleting samples will require the subject to be re-analyzed by running all pipeline steps after and including immunedb_collapse.

Running in the Background

After you have populated your ImmuneDB database(s), you may want to leave the frontend web service running in the background. To do so, you can start ImmuneDB in detached mode with the following:

$ docker run -v $HOME/immunedb_share:/share \
     -p 8080:8080 -e IMMUNEDB_DAEMON=1 -d=true \
     arosenfeld/immunedb:v0.29.8

If you want to stop the process in the future, get its process ID with

$ docker ps

And then run:

$ docker stop ID

Exporting Data to Files

You can use the immunedb_export command to export your data in a variety of formats.

Exporting Samples

To export samples statistics run the command:

$ immunedb_export PATH_TO_CONFIG samples

After completion, a TSV file samples.tsv will be written with the following headers, one line per sample:

Field Description
id Unique numeric sample identifier
name Name given to the sample
subject Subject from which the sample originated
input_sequences Reads input into ImmuneDB
identified Reads successfully annotated
in_frame Reads in-frame
stops Reads with stop codons
functional Functional reads (in-frame and no stop codons)
avg_clone_cdr3_num_nts Average clonal CDR3 length in nucleotides
avg_clone_v_identity Average clonal V-region identity
clones Total number of clones

Exporting Clones

In it’s most basic form, the command to export clones is:

$ immunedb_export PATH_TO_CONFIG clones

This will generate one file per sample each with one line per clone having the fields below. Note that intances, copies, avg_v_identity, and top_copy_seq are for the clone in the context of that sample. That is, those fields may vary for the same clone in different samples.

Field Description
clone_id Database-wide unique clone identifier. This number can be used to track clones across samples.
subject Subject in which the clone was found
v_gene V-gene of the clone
j_gene J-gene of the clone
functional If the clone is in-frame and contains no stop in the consensus (T or F)
insertions Insertions in the clone (deprecated)
deletions Deletions in the clone (deprecated)
cdr3_nt CDR3 nucleotide sequence
cdr3_num_nts CDR3 nucleotide sequence length
cdr3_aa CDR3 amino-acid sequence
uniques Unique sequences in the clone overall
instances Sequences instances in the clone in the associated sample
copies Copies in the clone in the associated sample
germline Clonal germline sequence
parent_id Parent ID (deprecated)
avg_v_identity Average V-gene identity to germline
top_copy_seq Nucleotide sequence of top-copy sequence

The --pool-on parameter can be used to change how data is aggregated. By default it takes the value sample (as described above) but it also accepts, subject, or any custom metadata field(s).

For the purposes of illustration, assume we have samples with the associated metadata below.

sample subject tissue subset
sample1 S1 blood naive
sample2 S1 spleen naive
sample3 S1 spleen mature
sample4 S3 blood native

Passing --pool-on subject will generate one file per subject with the clone information aggregated across all samples in that subject. Alternatively, passing --pool-on tissue will generate one file per subject/tissue combination. You can pass multiple metadata fields to the --pool-on parameter as well. For example --pool-on tissue subset will generate one file per subject/tissue/subset combination.

Two other common parameters are --sample-ids which restricts which samples to include in the export and --format which accepts immunedb (the default) or vdjtools for interoperability with the VDJtools suite.

Exporting Sequences

Sequences can be exported in Change-O and AIRR formats.

The basic command is:

$ immunedb_export PATH_TO_CONFIG sequences

This will generate one file per sample in Change-O format. To use AIRR format, specify --format airr. You can filter out sequences that were not assigned to a clone with the --clones-only flag.

Exporting Selection Pressure

If selection pressure was calculated with the immunedb_clone_pressure command, the results can be exported in TSV format, one row per clone/sample combination. Additionally, unless the --filter samples is passed, there will be one additional row per clone with a All Samples value for the sample which indicates the overall selection pressure on the clone.

For more information on interpreting the values see Uduman, et al, 2011 and Yaari, et al. 2012.

Field Value
clone_id Clone ID
subject Subject to which the clone belongs
sample Sample within which the selection pressure was calculated. If All Samples the overall selection pressure for the clone.
threshold The threshold at which the selection pressure was calculated
expected_REGION_TYPE The expected number of TYPE (r or s) mutations in REGION (cdr or fwr)
observed_REGION_TYPE The observed number of TYPE (r or s) mutations in REGION (cdr or fwr)
sigma_REGION The selection pressure in REGION
sigma_REGION_cilower The lower bound of the confidence interval of selection in REGION
sigma_REGION_ciupper The upper bound of the confidence interval of selection in REGION
sigma_p_REGION The P-value of the selection in REGION

Exporting MySQL Data

The final method of exporting data is to dump the entire MySQL database to a file. This is meant to be a backup method rather than for downstream-analysis.

To backup run:

$ immunedb_admin backup PATH_TO_CONFIG BACKUP_PATH

To restore a backup run:

$ immunedb_admin restore PATH_TO_CONFIG BACKUP_PATH

Querying with SQL

ImmuneDB is backed by a MySQL database that can be queried directly to gather information, bypassing the Python API.

Accessing the Database

There are many ways to access the database directly. The two introduced here are directly through MySQL or using immunedb_sql which simply wraps a call to MySQL.

Directly with MySQL

From the command line, you may access an ImmuneDB database DATABASE from user USERNAME with:

$ mysql -u USERNAME -p DATABASE

This will prompt for a password and then to the database. This method of access is useful for quickly querying the database. To save results of a query QUERY run the command:

$ mysql -u USERNAME -p DATABASE -e "QUERY" > output

Python API

Note

This section is currently incomplete. We’re working to fill out the details of the Python API as soon as possible.

Configuration

The immunedb.common.config module provides methods to initialize a connection to a new or existing database.

Most programs using ImmuneDB will start with code similar to:

import immunedb.common.config as config


parser = config.get_base_arg_parser('Some description of the program')
# ... add any additional arguments to the parser ...
args = parser.parse_args()

session = config.init_db(args.db_config)

When this script is run, it will require at least one argument which is the path to a database configuration (as generated with immunedb_admin). Using that, a Session object will be made, connected to the associated database.

One can also directly specify the path to a configuration directly.

import immunedb.common.config as config


session = config.init_db('path/to/config')

Alternatively a dictionary with the same information can be passed:

import immunedb.common.config as config


session = config.init_db({
    'host': '...',
    'database': '...',
    'username': '...',
    'password': '...',
})

Returned will be a Session object which can be used to interact with the database.

Using the Session

ImmuneDB is built using SQLAlchemy as a MySQL abstraction layer. Simply put, instead of writing SQL, the database is queried using Python constructs. Full documentation on using the session can be found in SQLAlchemy’s documentation.

Once a session is created, the models listed below can be queried.

Example Queries

Below are some example queries that demonstrate how to use the ImmuneDB API.

Clone CDR3s

Get all clones with a given V-gene and print their CDR3 AA sequences.

Input

import immunedb.common.config as config
from immunedb.common.models import Clone

session = config.init_db(...)

for clone in session.query(Clone).filter(Clone.v_gene == 'IGHV3-30'):
    print('clone {} has AAs {}'.format(clone.id, clone.cdr3_aa))

Output

clone 37884 has AAs CARGYSSSYFDYW
clone 37886 has AAs CARSRTSLSIYGVVPTGDFDSW
clone 37885 has AAs CARNGLNTVSGVVISPKYWLDPW
clone 37887 has AAs CARDLFRGVDFYYYGMDVW
Clone Frequency

Determine how many sequences appear in each sample belonging to clone 1234.

Note the CloneStats model has one entry for each clone/sample combination plus one where the sample_id field is null which represents the overall clone.

Input

import immunedb.common.config as config
from immunedb.common.models import CloneStats

session = config.init_db(...)
for stat in session.query(CloneStats).filter(
        CloneStats.clone_id == 1234).order_by(CloneStats.sample_id):
    print('clone {} has {} unique sequences and {} copies {}'.format(
        stat.clone_id,
        stat.unique_cnt,
        stat.total_cnt,
        ('in sample ' + stat.sample.name) if stat.sample else 'overall'))

Output

clone 1234 has 53 unique sequences and 1331 copies overall
clone 1234 has 27 unique sequences and 379 copies in sample sample1
clone 1234 has 27 unique sequences and 339 copies in sample sample3
clone 1234 has 24 unique sequences and 311 copies in sample sample4
clone 1234 has 28 unique sequences and 302 copies in sample sample10
V-gene Usage

This is a more complex query which gathers the V-gene usage of all sequences which are (a) in subject with ID 1, (b) associated with a clone, and (c) are unique to the subject, printing them from least to most frequent.

Input

import immunedb.common.config as config
from immunedb.common.models import Sequence, SequenceCollapse

session = config.init_db(...)

subject_unique_seqs = session.query(
    func.count(Sequence.seq_id).label('count'),
    Sequence.v_gene
).join(
    SequenceCollapse
).filter(
    Sequence.subject_id == 1,
    ~Sequence.clone_id.is_(None),
    SequenceCollapse.copy_number_in_subject > 0
).group_by(
    Sequence.v_gene
).order_by(
    'count'
)

for seq in subject_unique_seqs:
    print(seq.v_gene, seq.count)

Output

# ... output truncated ...
IGHV4-34 1128
IGHV1-2 1160
IGHV3-48 1169
IGHV4-39 1310
IGHV3-7 1345
IGHV3-30|3-30-5|3-33 1607
IGHV3-23|3-23D 1626
IGHV3-21 1878

Data Models

Referencing ImmuneDB

If you use ImmuneDB, please cite the tool as:

Rosenfeld, A. M., Meng, W., Luning Prak, E. T., Hershberg, U., ImmuneDB, a Novel Tool for the Analysis, Storage, and Dissemination of Immune Repertoire Sequencing Data. Frontiers in Immunology 9 (2018).

ImmuneDB was originally announced previously in:

Rosenfeld, A. M., Meng, W., Luning Prak, E. T., Hershberg, U., ImmuneDB: a system for the analysis and exploration of high-throughput adaptive immune receptor sequencing data, Bioinformatics 33 (2016), no. 2, 292–293.