-
Notifications
You must be signed in to change notification settings - Fork 104
Tutorial: Demultiplexing SMRT Link Iso Seq Jobs
Last Updated: 01/04/2021
DISCLAIMER: ALL CODE AND TUTORIAL IN THIS REPOSITORY IS DEVELOPER WORK AND NOT PACBIO OFFICIAL SOFTWARE. UPDATES AND CHANGES CAN HAPPEN ANYTIME. USE AT YOUR OWN RISK.
This tutorial is for demultiplexing IsoSeq jobs in SMRT Link. The scripts offered below are standalone --- no installation required, however, you will need to have Python and also BioPython library installed.
- Identifying the Unix path for your SMRT Link job
- Demultiplexing IsoSeq jobs without a reference genome
- Demultiplexing IsoSeq jobs with a reference genome
- Making up your own classify report to manually de-multiplex
- Creating demultiplexed GFF and FASTA/FASTQ files after demultiplexing
- Python (3.7.x)
- BioPython
No installation is required. You can directly copy the scripts to your local folder:
$ mkdir ~/github
$ cd ~/github/
$ git clone https://github.com/Magdoll/cDNA_Cupcake.git
$ mkdir ~/test_dir/
$ cd ~/test_dir
$ cp ~/github/cDNA_cupcake/post_isoseq_cluster/demux*py .
Or add them to your PATH
variable:
$ export PATH=$PATH:~/github/cDNA_cupcake/post_isoseq_cluster/
You can find the Unix path for your SMRT Link IsoSeq job by going to the "Analysis Overview" page and copying the "Path" value. In this example, the path is <path>/userdata/jobs_root/0000/0000114/0000114372
.
If you ran IsoSeq without a reference genome, the demultiplexing script will output the associated full-length read counts for each HQ isoform.
$ ln -s <path>/userdata/jobs_root/0000/0000114/0000114372 job114372
$ python demux_isoseq_no_genome.py -j job114372 -o job114372.hq_isoform_fl_count.txt
The output job114372.hq_isoform_fl_count.txt
will show the associated FL count for each HQ isoform:
id,NEB_5p--barcode1_3p,NEB_5p--barcode2_3p
transcript/1,0,3
transcript/4,10,3
transcript/11,3,5
Instead of providing a job directory, you can also directly provide the required input files to demultiplex.
$ python demux_isoseq_no_genome.py --hq_fafq hq_transcripts.fastq \
--cluster_csv cluster_report.csv \
--classify_csv flnc_report.csv \
-o job114372.hq_isoform_fl_count.txt
If you ran IsoSeq with a reference genome, the demultiplexing script will output the associated full-length read counts for each mapped, unique, isoform. Each isoform will have the ID format PB.X.Y
.
$ ln -s <path>/userdata/jobs_root/0000/0000114/0000114372 job114372
$ python demux_isoseq_with_genome.py -j job114372 -o job114372.mapped_fl_count.txt
The output will look like:
id,NEB_5p--barcode1_3p,NEB_5p--barcode2_3p
PB.1.2,0,35
PB.2.4,3,1
PB.2.2,0,5
PB.2.3,0,6
PB.2.5,0,3
PB.3.1,0,2
Instead of providing a job directory, you can also directly provide the required input files to demultiplex.
$ python demux_isoseq_with_genome.py --mapped_fafq mapped.fastq \
--read_stat mapped.read_stat.txt \
--classify_csv flnc_report.csv \
-o job114372.mapped_fl_count.txt
What happens if you actually did not multiplex the samples and say, you merely combined different runs into a single Iso-Seq run and now you need to manually multiplex?
You can make a fake classify report flnc_report.csv
that is a comma-delimited CSV file with just two fields:
id,primer
m54278_181025_184153/4194470/ccs,sample1
m54278_181025_184153/4194611/ccs,sample1
m54278_181025_184153/4259972/ccs,sample1
m54278_181110_094539/74908648/ccs,sample2
m54278_181110_094539/74908650/ccs,sample2
m54278_181110_094539/74908653/ccs,sample2
Each id
is a FLNC read, and the primer
field tells you which sample it comes from. You can make this fake report by, say, using the movi name (ex: m54278_181025_184153
) which is uniquely associated with each run to identify the run-specific samples.
Below is an example of a Python code snippet that can be modified to make a fake FLNC report:
d = {'m54278_181109_232120': 'sample1',
'm54278_181103_072310': 'sample2',
'm54278_181110_094539': 'sample3'}
h = open('fake.flnc_report.csv', 'w')
h.write("id,primer\n")
f = open('mapped.read_stat.txt')
f.readline()
for line in f:
seqid = line.strip().split()[0]
movie = seqid.split('/')[0]
h.write("{0},{1}\n".format(seqid, d[movie]))
h.close()
After you have a demultiplexed FL count file, you can create a demultiplexed GFF (and optionally, FASTA or FASTQ) files for each barcode group.
You can use demux_by_barcode_groups.py
, providing a mapped GFF file, the demux FL count file you obtained from running the demux scripts above, and list of tuples (surrounded by double quote) to indicate the grouping of each barcode.
usage: demux_by_barcode_groups.py [-h] [--pooled_fastx POOLED_FASTX]
pooled_gff demux_count_file output_prefix
outgroup_dict
For example, using the following, each barcode is separated into a different output:
python demux_by_barcode_groups.py
--pooled_fastx collapse_isoforms.fastq \
collapse_isoforms.gff \
mapped_fl_count.txt \
output_demux \
"('NEB_5p--barcode1_3p','barcode1'),('NEB_5p--barcode2_3p','barcode2'),('NEB_5p--barcode3_3p','barcode3'),
('NEB_5p--barcode4_3p','barcode4')"
From this we will get output:
output_demux_barcode1_only.gff
output_demux_barcode1_only.fastq
output_demux_barcode2_only.gff
output_demux_barcode2_only.fastq
output_demux_barcode3_only.gff
output_demux_barcode3_only.fastq
output_demux_barcode4_only.gff
output_demux_barcode4_only.fastq
Alternatively, say barcode 1 and 2 are actually two samples from the same condition, and barcode 3 and 4 belongs to another condition. We can run:
python demux_by_barcode_groups.py
--pooled_fastx collapse_isoforms.fastq \
collapse_isoforms.gff \
mapped_fl_count.txt \
output_demux \
"('NEB_5p--barcode1_3p','conditionA'),('NEB_5p--barcode2_3p','conditionA'),('NEB_5p--barcode3_3p','conditionB'),
('NEB_5p--barcode4_3p','conditionB')"
From this we will get output:
output_demux_ConditionA_only.gff
output_demux_ConditionA_only.fastq
output_demux_ConditionB_only.gff
output_demux_ConditionB_only.fastq