Immunoglobulin Sequence Analysis (igseqanalysis)

Igseqanalysis is a Python package for parsing and processing the NCBI-IgBlast alignment results of antibody sequences (NGS). The code can extract CDR regions and/or mutations, and cluster the extracted unique sequences with a user-defined sequence identity cutoff.

Prerequisite

Python 2.7 Installation instruction
pip Installation site
Standalone NCBI-IgBlast 1.5 or 1.6 Installation FTP site
usearch Installation site (Optional: needed if you want to cluster CDR sequence by identity)

Note: Add igblastn and usearch to environment PATH. The development had been done on Red Hat Linux 6.

Installation

Change the current directory to the igseqanalysis folder and install the package using pip from the local file:

pip install igseqanalysis

After the installation, run parse_igblast -h and if the tool is successfully installed, it will print the command line instruction for the tool.

Workflow

The workflow below used the sample data that is included in the package which contains 2000 sequences. The sample_R1.fasta are VH sequences while sample_R2.fasta are VL sequences. Adjust the path accordingly for the data files if you are using your own dataset.

1. Annotate sequence by IgBlast

Annotate Sequences in Read 1 (Immunoglobulin)

cat sample/sample_R1.fasta | igblast_IG | parse_igblast -q IG -t CDR > sample/sample_R1.csv

Annotate Sequences in Read 2 (Immunoglobulin)

cat sample/sample_R2.fasta | igblast_IG | parse_igblast -q IG -t CDR > sample/sample_R2.csv

Annotate Sequences in Read 1 (TCR)

cat sample/tcr_alpha_sample.fasta | igblast_TCR | parse_igblast -q TCR -t CDR > sample/sample_R1.csv

Annotate Sequences in Read 2 (TCR)

cat sample/tcr_beta_sample.fasta | igblast_TCR | parse_igblast -q TCR -t CDR > sample/sample_R2.csv

2. Pair VH with VL

Pair the VH annotation with the VL annotation by fasta ID. Each row in the output file will contain the following columns:

fasta id
VH V gene
VH J gene
CDR-H1 sequence
CDR-H2 sequence
CDR-H3 sequence
VL V gene
VL J gene
CDR-L1 sequence
CDR-L2 sequence
CDR-L3 sequence

pair_by_id -l sample/sample_R1.csv -r sample/sample_R2.csv -o sample/sample.paired.tsv

3. Translate DNA into protein for CDR sequences

Output file has the same format as the input format. Only the DNA sequences has been translated into protein sequences.

cat sample/sample.paired.tsv | translate_table -p 3,4,5,8,9,10 > sample/sample.paired.prot.tsv

4. Count the unique paired CDR3

If you want to count the unique CDR3 in DNA sequences, you could provide the DNA sequences as the input sample.paired.tsv

cat sample/sample.paired.prot.tsv | count_unique -p 5,10 > sample/sample.paired.CDR3.count

5. Convert CDR3 in CSV format to fasta format for usearch clustering

If you want to cluster the unique CDR3 in DNA sequences, you could provide the DNA sequences as the input sample.paired.tsv. The -s option will write size information for each unique fasta, which is required fo usearch clustering.

cat sample/sample.paired.prot.tsv | csv2fasta -p 5 -s > sample/sample.VH.fasta
cat sample/sample.paired.prot.tsv | csv2fasta -p 10 -s > sample/sample.VL.fasta

6. Clustering CDR3 by usearch

Clustering could efficiently reduce the effect of PCR and sequencing errors, but at expense of cluster a real unique VH/VL into another VH/VL. -id 0.88 for protein sequences allows 1 amino acid difference when CDR length is between 9 and 16, and 2 amino acid difference when CDR length is between 17 and 24. If clustering DNA sequences, -id 0.96 is similar to -id 0.88 for protein sequences. -sort size will enable the most abundant sequence is considered as the centroid sequence of the cluster. -fulldp -maxgaps 0 -leftjust -rightjust disallow any gaps.

usearch -cluster_fast sample/sample.VH.fasta -id 0.88 -sizein -sort size -uc sample/sample.VH.uc -fulldp -maxgaps 0 -leftjust -rightjust
usearch -cluster_fast sample/sample.VL.fasta -id 0.88 -sizein -sort size -uc sample/sample.VL.uc -fulldp -maxgaps 0 -leftjust -rightjust

7. Format the usearch result into tabular format

format_cluster -c sample/sample.VH.uc -f sample/sample.VH.fasta > sample/sample.R1.cluster.count
format_cluster -c sample/sample.VL.uc -f sample/sample.VL.fasta > sample/sample.R2.cluster.count

The output file contains 3 columns:

Unique CDR3 sequence
Centroid CDR3 sequences that this sequence belongs to
Count of the unique CDR3 sequence

8. Clustering CDR3 by count

Alternative to the usearch clustering in step 5-7, a tree-based clustering was implemented which clustered sequence more aggressively.

cat sample/sample.paired.prot.tsv | count_unique -p 5  | python cluster_by_count.py -c 2 > cluster.cdrh3.count
cat sample/sample.paired.prot.tsv | count_unique -p 10 | python cluster_by_count.py -c 2 > cluster.cdrl3.count

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
igseqanalysis		igseqanalysis
sample		sample
.gitignore		.gitignore
COPYING		COPYING
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Immunoglobulin Sequence Analysis (igseqanalysis)

Prerequisite

Installation

Workflow

1. Annotate sequence by IgBlast

Annotate Sequences in Read 1 (Immunoglobulin)

Annotate Sequences in Read 2 (Immunoglobulin)

Annotate Sequences in Read 1 (TCR)

Annotate Sequences in Read 2 (TCR)

2. Pair VH with VL

3. Translate DNA into protein for CDR sequences

4. Count the unique paired CDR3

5. Convert CDR3 in CSV format to fasta format for usearch clustering

6. Clustering CDR3 by usearch

7. Format the usearch result into tabular format

8. Clustering CDR3 by count

About

Releases

Packages

Languages

License

Jincheng2009/igseqanalysis

Folders and files

Latest commit

History

Repository files navigation

Immunoglobulin Sequence Analysis (igseqanalysis)

Prerequisite

Installation

Workflow

1. Annotate sequence by IgBlast

Annotate Sequences in Read 1 (Immunoglobulin)

Annotate Sequences in Read 2 (Immunoglobulin)

Annotate Sequences in Read 1 (TCR)

Annotate Sequences in Read 2 (TCR)

2. Pair VH with VL

3. Translate DNA into protein for CDR sequences

4. Count the unique paired CDR3

5. Convert CDR3 in CSV format to fasta format for usearch clustering

6. Clustering CDR3 by usearch

7. Format the usearch result into tabular format

8. Clustering CDR3 by count

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages