Skip to content

Jincheng2009/igseqanalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Immunoglobulin Sequence Analysis (igseqanalysis)

Igseqanalysis is a Python package for parsing and processing the NCBI-IgBlast alignment results of antibody sequences (NGS). The code can extract CDR regions and/or mutations, and cluster the extracted unique sequences with a user-defined sequence identity cutoff.

Prerequisite

Note: Add igblastn and usearch to environment PATH. The development had been done on Red Hat Linux 6.

Installation

Change the current directory to the igseqanalysis folder and install the package using pip from the local file:

pip install igseqanalysis

After the installation, run parse_igblast -h and if the tool is successfully installed, it will print the command line instruction for the tool.

Workflow

The workflow below used the sample data that is included in the package which contains 2000 sequences. The sample_R1.fasta are VH sequences while sample_R2.fasta are VL sequences. Adjust the path accordingly for the data files if you are using your own dataset.

1. Annotate sequence by IgBlast

Annotate Sequences in Read 1 (Immunoglobulin)

cat sample/sample_R1.fasta | igblast_IG | parse_igblast -q IG -t CDR > sample/sample_R1.csv

Annotate Sequences in Read 2 (Immunoglobulin)

cat sample/sample_R2.fasta | igblast_IG | parse_igblast -q IG -t CDR > sample/sample_R2.csv

Annotate Sequences in Read 1 (TCR)

cat sample/tcr_alpha_sample.fasta | igblast_TCR | parse_igblast -q TCR -t CDR > sample/sample_R1.csv

Annotate Sequences in Read 2 (TCR)

cat sample/tcr_beta_sample.fasta | igblast_TCR | parse_igblast -q TCR -t CDR > sample/sample_R2.csv

2. Pair VH with VL

Pair the VH annotation with the VL annotation by fasta ID. Each row in the output file will contain the following columns:

  1. fasta id
  2. VH V gene
  3. VH J gene
  4. CDR-H1 sequence
  5. CDR-H2 sequence
  6. CDR-H3 sequence
  7. VL V gene
  8. VL J gene
  9. CDR-L1 sequence
  10. CDR-L2 sequence
  11. CDR-L3 sequence
pair_by_id -l sample/sample_R1.csv -r sample/sample_R2.csv -o sample/sample.paired.tsv

3. Translate DNA into protein for CDR sequences

Output file has the same format as the input format. Only the DNA sequences has been translated into protein sequences.

cat sample/sample.paired.tsv | translate_table -p 3,4,5,8,9,10 > sample/sample.paired.prot.tsv

4. Count the unique paired CDR3

If you want to count the unique CDR3 in DNA sequences, you could provide the DNA sequences as the input sample.paired.tsv

cat sample/sample.paired.prot.tsv | count_unique -p 5,10 > sample/sample.paired.CDR3.count

5. Convert CDR3 in CSV format to fasta format for usearch clustering

If you want to cluster the unique CDR3 in DNA sequences, you could provide the DNA sequences as the input sample.paired.tsv. The -s option will write size information for each unique fasta, which is required fo usearch clustering.

cat sample/sample.paired.prot.tsv | csv2fasta -p 5 -s > sample/sample.VH.fasta
cat sample/sample.paired.prot.tsv | csv2fasta -p 10 -s > sample/sample.VL.fasta

6. Clustering CDR3 by usearch

Clustering could efficiently reduce the effect of PCR and sequencing errors, but at expense of cluster a real unique VH/VL into another VH/VL. -id 0.88 for protein sequences allows 1 amino acid difference when CDR length is between 9 and 16, and 2 amino acid difference when CDR length is between 17 and 24. If clustering DNA sequences, -id 0.96 is similar to -id 0.88 for protein sequences. -sort size will enable the most abundant sequence is considered as the centroid sequence of the cluster. -fulldp -maxgaps 0 -leftjust -rightjust disallow any gaps.

usearch -cluster_fast sample/sample.VH.fasta -id 0.88 -sizein -sort size -uc sample/sample.VH.uc -fulldp -maxgaps 0 -leftjust -rightjust
usearch -cluster_fast sample/sample.VL.fasta -id 0.88 -sizein -sort size -uc sample/sample.VL.uc -fulldp -maxgaps 0 -leftjust -rightjust

7. Format the usearch result into tabular format

format_cluster -c sample/sample.VH.uc -f sample/sample.VH.fasta > sample/sample.R1.cluster.count
format_cluster -c sample/sample.VL.uc -f sample/sample.VL.fasta > sample/sample.R2.cluster.count

The output file contains 3 columns:

  1. Unique CDR3 sequence
  2. Centroid CDR3 sequences that this sequence belongs to
  3. Count of the unique CDR3 sequence

8. Clustering CDR3 by count

Alternative to the usearch clustering in step 5-7, a tree-based clustering was implemented which clustered sequence more aggressively.

cat sample/sample.paired.prot.tsv | count_unique -p 5  | python cluster_by_count.py -c 2 > cluster.cdrh3.count
cat sample/sample.paired.prot.tsv | count_unique -p 10 | python cluster_by_count.py -c 2 > cluster.cdrl3.count

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published