-
Notifications
You must be signed in to change notification settings - Fork 104
Annotation and Rarefaction Wiki
Last Updated: 09/21/2020
- BioPython
All test data used in this wiki is available in the annotation/test_data directory.
make_file_for_subsampling_from_collapsed.py
: Prepare file for running subsampling (rarefaction curve).subsample.py
andsubsample_with_category
: Running subsampling.
Requirement: You must have run Iso-Seq cluster and post-processed it with the collapse script to get two files: the unique isoform FASTQ file (ex: test.fq) and an abundance file (ex: test.abundance.txt).
The subsampling will always be done using the number of full-length (FL) reads associated with each isoform. Therefore, the abundance file must contain the field count_fl
.
To prepare the file to run subsampling, use the command below:
usage: Make subsample-ready file from ToFU/Iso-Seq collapsed output
[-i INPUT_PREFIX] [-o OUTPUT_FILENAME]
[-m2 SQANTI_CLASS] [--include_single_exons]
For example:
make_file_for_subsampling_from_collapsed.py -i test -o test.for_subsampling.txt
By default, single exons are excluded. If you want to include it, use --include_single_exons
.
It will automatically look for test.gff
, test.abundance.txt
and test.rep.fq
, so make sure the three files exist.
In addition, the script can also add on the SQANTI3 classification result, should you want to subsample based on the number of known isoforms/genes.
make_file_for_subsampling_from_collapsed.py -i test -o test.for_subsampling.txt \
-m2 test.classification.txt
The output test.for_subsampling.txt will look like this:
pbid | pbgene | length | refisoform | refgene | category | fl_count |
---|---|---|---|---|---|---|
PB.15650.14 | 15650 | 2627 | xxxx | TFDP2 | novel_in_catalog | 3 |
PB.19796.10 | 19796 | 2734 | xxxx | ZNF3 | novel_in_catalog | 3 |
PB.10034.17 | 10034 | 1959 | xxxx | TCF4 | full-splice_match | 3 |
PB.18454.7 | 18454 | 2732 | xxxx | NFYA | full-splice_match | 3 |
And now we are ready to do subsampling!
At a minimum, the input file must have the following fields: pbid
(must be unique ID), length
(length of transcript), and fl_count
(number of FL reads). You can generate the file anyway you prefer, or use the scripts provided above.
The subsample script usage is:
usage: subsample.py [-h] [--by BY] [--iterations ITERATIONS] [--range RANGE]
[--min_fl_count MIN_FL_COUNT] [--step STEP]
count_filename
positional arguments:
count_filename
optional arguments:
-h, --help show this help message and exit
--by BY Unique specifier name(default: id)
--iterations ITERATIONS
Number of iterations (default: 100)
--range RANGE Length range (ex: (1000,2000), default None)
--min_fl_count MIN_FL_COUNT
Minimum FL count (default: 1)
--step STEP Step size (default: 10000)
If you don't specify the --range
option, it will use all the transcripts. You can plot rarefaction curves for each size bin individually to better understand sampling efforts in different bins.
We can subsample by the collapse script gene ID (which is determined solely based on genomic locus overlap) in the 1-3kb range of transcripts:
subsample.py --by pbgene --min_fl_count 2 --step 1000 --range "(1000,3000)" \
test.for_subsampling.txt > test.1to3k.by_pbgene.rarefaction.txt
We will now run subsampling at the gene and isoform level, then plot them using R on the same figure.
subsample.py --by refgene --min_fl_count 2 --step 1000 \
test.for_subsampling.txt > test.rarefaction.by_refgene.min_fl_2.txt
subsample.py --by refisoform --min_fl_count 2 --step 1000 \
test.for_subsampling.txt > test.rarefaction.by_refisoform.min_fl_2.txt
Alternatively, we can use SQANTI3 classification results to subsample by category:
subsample_with_category.py --by refisoform --min_fl_count 2 --step 1000 \
test.for_subsampling.all.txt > \
test.rarefaction.by_refisoform.min_fl_2.by_category.txt
You can follow the R script in the test data directory to plot rarefaction curves.
The first plot shows rarefaction curve at the gene and isoform level.
The second plot shows rarefaction curve at the isoform level, by SQANTI3 classification category.