Igseqanalysis is a Python package for parsing and processing the NCBI-IgBlast alignment results of antibody sequences (NGS). The code can extract CDR regions and/or mutations, and cluster the extracted unique sequences with a user-defined sequence identity cutoff.
- Python 2.7 Installation instruction
- pip Installation site
- Standalone NCBI-IgBlast 1.5 or 1.6 Installation FTP site
- usearch Installation site (Optional: needed if you want to cluster CDR sequence by identity)
Note: Add igblastn
and usearch
to environment PATH. The development had been done on Red Hat Linux 6.
Change the current directory to the igseqanalysis folder and install the package using pip from the local file:
pip install igseqanalysis
After the installation, run parse_igblast -h
and if the tool is successfully installed, it will print the command line instruction for the tool.
The workflow below used the sample data that is included in the package which contains 2000 sequences. The sample_R1.fasta are VH sequences while sample_R2.fasta are VL sequences. Adjust the path accordingly for the data files if you are using your own dataset.
cat sample/sample_R1.fasta | igblast_IG | parse_igblast -q IG -t CDR > sample/sample_R1.csv
cat sample/sample_R2.fasta | igblast_IG | parse_igblast -q IG -t CDR > sample/sample_R2.csv
cat sample/tcr_alpha_sample.fasta | igblast_TCR | parse_igblast -q TCR -t CDR > sample/sample_R1.csv
cat sample/tcr_beta_sample.fasta | igblast_TCR | parse_igblast -q TCR -t CDR > sample/sample_R2.csv
Pair the VH annotation with the VL annotation by fasta ID. Each row in the output file will contain the following columns:
- fasta id
- VH V gene
- VH J gene
- CDR-H1 sequence
- CDR-H2 sequence
- CDR-H3 sequence
- VL V gene
- VL J gene
- CDR-L1 sequence
- CDR-L2 sequence
- CDR-L3 sequence
pair_by_id -l sample/sample_R1.csv -r sample/sample_R2.csv -o sample/sample.paired.tsv
Output file has the same format as the input format. Only the DNA sequences has been translated into protein sequences.
cat sample/sample.paired.tsv | translate_table -p 3,4,5,8,9,10 > sample/sample.paired.prot.tsv
If you want to count the unique CDR3 in DNA sequences, you could provide the DNA sequences as the input sample.paired.tsv
cat sample/sample.paired.prot.tsv | count_unique -p 5,10 > sample/sample.paired.CDR3.count
If you want to cluster the unique CDR3 in DNA sequences, you could provide the DNA sequences as the input sample.paired.tsv
. The -s
option will write size information for each unique fasta, which is required fo usearch clustering.
cat sample/sample.paired.prot.tsv | csv2fasta -p 5 -s > sample/sample.VH.fasta
cat sample/sample.paired.prot.tsv | csv2fasta -p 10 -s > sample/sample.VL.fasta
Clustering could efficiently reduce the effect of PCR and sequencing errors, but at expense of cluster a real unique VH/VL into another VH/VL. -id 0.88
for protein sequences allows 1 amino acid difference when CDR length is between 9 and 16, and 2 amino acid difference when CDR length is between 17 and 24. If clustering DNA sequences, -id 0.96
is similar to -id 0.88
for protein sequences. -sort size
will enable the most abundant sequence is considered as the centroid sequence of the cluster. -fulldp -maxgaps 0 -leftjust -rightjust
disallow any gaps.
usearch -cluster_fast sample/sample.VH.fasta -id 0.88 -sizein -sort size -uc sample/sample.VH.uc -fulldp -maxgaps 0 -leftjust -rightjust
usearch -cluster_fast sample/sample.VL.fasta -id 0.88 -sizein -sort size -uc sample/sample.VL.uc -fulldp -maxgaps 0 -leftjust -rightjust
format_cluster -c sample/sample.VH.uc -f sample/sample.VH.fasta > sample/sample.R1.cluster.count
format_cluster -c sample/sample.VL.uc -f sample/sample.VL.fasta > sample/sample.R2.cluster.count
The output file contains 3 columns:
- Unique CDR3 sequence
- Centroid CDR3 sequences that this sequence belongs to
- Count of the unique CDR3 sequence
Alternative to the usearch clustering in step 5-7, a tree-based clustering was implemented which clustered sequence more aggressively.
cat sample/sample.paired.prot.tsv | count_unique -p 5 | python cluster_by_count.py -c 2 > cluster.cdrh3.count
cat sample/sample.paired.prot.tsv | count_unique -p 10 | python cluster_by_count.py -c 2 > cluster.cdrl3.count