Annotation and Rarefaction Wiki

Last Updated: 09/21/2020

Python Requirement

BioPython

Test Data

All test data used in this wiki is available in the annotation/test_data directory.

List of Scripts

make_file_for_subsampling_from_collapsed.py: Prepare file for running subsampling (rarefaction curve).
subsample.py and subsample_with_category: Running subsampling.

Detailed Documentation

Prepare file for subsampling efforts (rarefaction curve drawing)

Requirement: You must have run Iso-Seq cluster and post-processed it with the collapse script to get two files: the unique isoform FASTQ file (ex: test.fq) and an abundance file (ex: test.abundance.txt).

The subsampling will always be done using the number of full-length (FL) reads associated with each isoform. Therefore, the abundance file must contain the field count_fl.

To prepare the file to run subsampling, use the command below:

usage: Make subsample-ready file from ToFU/Iso-Seq collapsed output
       [-i INPUT_PREFIX] [-o OUTPUT_FILENAME] 
       [-m2 SQANTI_CLASS] [--include_single_exons]

For example:

make_file_for_subsampling_from_collapsed.py -i test -o test.for_subsampling.txt

By default, single exons are excluded. If you want to include it, use --include_single_exons.

It will automatically look for test.gff, test.abundance.txt and test.rep.fq, so make sure the three files exist.

In addition, the script can also add on the SQANTI3 classification result, should you want to subsample based on the number of known isoforms/genes.

make_file_for_subsampling_from_collapsed.py -i test -o test.for_subsampling.txt \
   -m2 test.classification.txt

The output test.for_subsampling.txt will look like this:

pbid	pbgene	length	refisoform	refgene	category	fl_count
PB.15650.14	15650	2627	xxxx	TFDP2	novel_in_catalog	3
PB.19796.10	19796	2734	xxxx	ZNF3	novel_in_catalog	3
PB.10034.17	10034	1959	xxxx	TCF4	full-splice_match	3
PB.18454.7	18454	2732	xxxx	NFYA	full-splice_match	3

And now we are ready to do subsampling!

Running subsampling to draw rarefaction curves

At a minimum, the input file must have the following fields: pbid (must be unique ID), length (length of transcript), and fl_count (number of FL reads). You can generate the file anyway you prefer, or use the scripts provided above.

The subsample script usage is:

usage: subsample.py [-h] [--by BY] [--iterations ITERATIONS] [--range RANGE]
                    [--min_fl_count MIN_FL_COUNT] [--step STEP]
                    count_filename

positional arguments:
  count_filename

optional arguments:
  -h, --help            show this help message and exit
  --by BY               Unique specifier name(default: id)
  --iterations ITERATIONS
                        Number of iterations (default: 100)
  --range RANGE         Length range (ex: (1000,2000), default None)
  --min_fl_count MIN_FL_COUNT
                        Minimum FL count (default: 1)
  --step STEP           Step size (default: 10000)

If you don't specify the --range option, it will use all the transcripts. You can plot rarefaction curves for each size bin individually to better understand sampling efforts in different bins.

We can subsample by the collapse script gene ID (which is determined solely based on genomic locus overlap) in the 1-3kb range of transcripts:

subsample.py --by pbgene --min_fl_count 2 --step 1000 --range "(1000,3000)" \
  test.for_subsampling.txt  > test.1to3k.by_pbgene.rarefaction.txt

We will now run subsampling at the gene and isoform level, then plot them using R on the same figure.

subsample.py --by refgene --min_fl_count 2 --step 1000 \
   test.for_subsampling.txt > test.rarefaction.by_refgene.min_fl_2.txt 

subsample.py --by refisoform --min_fl_count 2 --step 1000 \
   test.for_subsampling.txt > test.rarefaction.by_refisoform.min_fl_2.txt

Alternatively, we can use SQANTI3 classification results to subsample by category:

subsample_with_category.py --by refisoform --min_fl_count 2 --step 1000 \
    test.for_subsampling.all.txt > \
    test.rarefaction.by_refisoform.min_fl_2.by_category.txt

Plotting Rarefaction Curves in R

You can follow the R script in the test data directory to plot rarefaction curves.

The first plot shows rarefaction curve at the gene and isoform level.

The second plot shows rarefaction curve at the isoform level, by SQANTI3 classification category.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly