Skip to content

Tama Collapse

GenomeRIK edited this page Aug 15, 2018 · 73 revisions

Tama Collapse

TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.

Detailed explanation of TAMA Collapse:

After you map your Iso-Seq data onto a genome assembly you will notice that within some gene models there are multiple transcripts which appear to be very similar. Collapsing is the process of figuring out which transcripts are similar enough to be collapsed into a single transcript model. While this may sound trivial, it is actually a complex problem. For example, let's take a look at the Iso-Seq Tofu algorithm for collapsing. While not specifically termed on the Tofu site there are 2 modes of collapsing allowed for Tofu Collapse which I have named "Transcription Start Site Collapse" (TSSC) and "Exon Cascade Collapse" (ECC).

TSSC will collapse all transcripts with matching 3' end, matching spice junctions, and matching number of exons. The 5' end (or transcription start site) can differ. ECC will collapse all transcripts with matching 3' end and matching splice junctions. However, the transcripts to be collapsed do not need to have the same number of exons. See below for a diagram of this behaviour:

You might be wondering why the 5' end is allowed to differ in both modes. This was done because the official Iso-Seq library preparation protocol does not include a 5' cap selection step. This means that transcripts which have been degraded from the 5' end are sequenced as well. Thus with the standard protocol it is impossible to distinguish transcripts based on their 5' end alone.

A more detailed explanation of this can be found in this paper:

https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3691-9

If you have performed 5' cap selection on your Iso-Seq library then none of these algorithms will be appropriate for your data. However, TAMA Collapse is designed to work with both 5' cap selected and standard libraries.

One question you might have regarding these algorithms is "how much difference between exon start and end sites is allowed?" Indeed, if you look closely at your Iso-Seq data, you will see that very similar looking transcripts may have slight differences in their individual exon start and end sites. I call this phenomenon "wobble". One explanation for wobble is that a sequence error occured near the splice junction which caused the mapper to shift the exon start/end site slightly. In TAMA Collapse you can define the amount of difference you want to allow between exon start/end sites. TAMA Collapse also tracks the amount of wobble that occurs for each exon start/end site so you can see if there was a significant amount of wobble for a collapsed transcript model. Large amounts of wobble are caused by something I call "wobble walking". You can see what this looks like below:

As you can see in the above diagram, the first two exon start sites match given a 10 bp threshold. The second and third exon start sites also match given the same 10 bp threshold. Thus all three exon start sites "match". You can see that this can continue on which would cause a large amount of wobble for that feature.

With TAMA Collapse you can choose to either select the feature which is most common or the feature that would make the longest exon. The default is the most common feature. Basically what this does is look at all the transcripts which match and use the most common exon start/end site (the feature with the most support).

Other collapsing algortihms usually decide on the "correct" exon start site based on either reference information (i.e. short read data or public annotation) or based on canonical intronic splice donor-acceptor pairs (GT-AG). While this may seem like the right idea, it is masking over the information in your Iso-Seq data. Thus it projects or biases the processing toward an assumption which may not be true. However, if you do want to match the splice junctions to another transcriptome you can use TAMA Merge.

TAMA Collapse creates output files that contain all the information pertaining to wobble and sourcing so that you can see where there might be an anomaly with the collapsing. This ties into one of the core tenets of TAMA which is to "flag not hide" any unusual results.

In order to run TAMA Collapse with the ECC mode, you simply have to use the cap flag parameter "-x no_cap". In order to emulate the TSSC mode with TAMA you can just give a very large threshold for the 5' end. Thus TAMA Collapse can peform the same type of collapsing as TOFU Collapse.

NOTE: As of 2017-08-02, we noticed an issue with the strand information provided by TOFU Collapse when running it on bam files generated by GMAP. This was caused by TOFU Collapse using the XS:A? field to define the strand which is designated by GMAP. GMAP had a bug which sometimes gave the wrong strand information in that field. Newer versions of GMAP should have fixed this bug. However, to address this TAMA Collapse has a file output which shows when the XS:A? field disagrees with the strand information from the SAM flag.

GMAP actually uses the splice junction intronic donor-accepter sequences (canonical GT-AG) to "correct" the strand information. So if your mapped transcript has splice junctions which show the reverse complement of the canonical donor-accepter sequences, GMAP will flip the strand in the XS:A? field. The "strand_check" output file from TAMA Collapse contains these instances.

Manual

usage: tama_collapse.py [-h] [-s] [-f] [-p] [-x] [-e] [-c] [-i] [-a] [-m] [-z]

This script collapses mapped transcript models

arguments:

  -h, --help  show this help message and exit
  -s S        Sorted sam file (required)
  -f F        Fasta file (required)
  -p P        Output prefix (required)
  -x X        Capped flag: capped or no_cap
  -e E        Collapse exon ends flag: common_ends or longest_ends (default
              common_ends)
  -c C        Coverage (default 99)
  -i I        Identity (default 85)
  -a A        5 prime threshold (default 10)
  -m M        Exon/Splice junction threshold (default 10)
  -z Z        3 prime threshold (default 10)
  -d D        Flag for merging duplicate transcript groups (default no_merge quits when duplicates are found, merge_dup will merge duplicates)
  -sj SJ      Use error threshold to prioritize the use of splice junction information from collapsing transcripts(default no_priority, activate with sj_priority)
  -sjt SJT    Threshold for detecting errors near splice junctions (default is 10bp)

Default command would look like this:

python tama_collapse.py -s mapped_reads.sam -f genome.fa -p prefix -x capped

Note:

The SAM file should be sorted and should not be compressed into BAM. The fasta file input should be for the genome assembly that was used to create the SAM file. This tool is very memory greedy. If it is running out of memory on your system please let me know and I will optimize it further for memory usage.

Detail explanation of optional arguments:

-x X Capped flag: capped or no_cap

If you performed 5' cap selection during youe Iso-Seq library preparation you should use the "-x capped" option. This will not allow transcripts to be collapses if they have a different number of exons.

If you did not do 5' cap selection (currently official Iso-Seq library preparation protocol), choose "-x no_cap". This will collapse transcripts if they have the same 3' (exons and splice junctions) end but differing number of exons on the 5' end.

Basically using "no_cap" will make TAMA Collapse behave in almost the same way as ECC Tofu Collapse. The only difference is that Tofu Collapse behaves in a more "stochastic" way. By that I mean that instead of adding each shorter model to all matching longer models, Tofu Collapse will only match the first instance it sees. This saves compute time but it means that you are not able to use all the information available to predict the final features for the final collapsed transcript models.

-e E Collapse exon ends flag: common_ends or longest_ends (default common_ends)

When collapsing, this option determines if the selected exon start/end site is defined by the most common feature or the longest feature. Longest feature means the feature which makes the exon longer.

-c C Coverage (default 99)

Coverage is defined by the percentage of the original read that is mapped to the genome. In some cases regions at the beginnning or end of the read will be clipped from the mapping representation. The amount of this clipping is what defines the coverage value.

-i I Identity (default 85)

Identity is defined as the number of bases of the read that match the bases of the genome as per the genomic mapping. Basically it is how close the read sequence is to the genomic sequence.

-a A 5 prime threshold (default 10)

The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.

-m M Exon/Splice junction threshold (default 10)

The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.

-z Z 3 prime threshold (default 10)

The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.

-d D Flag for merging duplicate transcript groups (default no_merge quits when duplicates are found, merge_dup will merge duplicates)

When collapsing transcripts, there can be events where different groups of transcript collapse into the same model. I call these duplicate transcript groups. By default TAMA Collapse will not merge these groups during collapsing and will exit the run prematurely with an error output. If you wish to merge these groups, use the merge_dup flag.

-sj SJ Use error threshold to prioritize the use of splice junction information from collapsing transcripts(default no_priority, activate with sj_priority)

I call this option splice junction priority. Basically if you turn this on, TAMA Collapse will search for mismatches in the regions on each side of each splice junction. The length of the regions are defined by the SJT flag. TAMA Collapse will then rank the splice junction evidence from each transcript/read as either 0, 1, 2, or 3 with 0 being highest priroity. These rankings are defined as follows:

0 - No mismatches on either side of the splice junction

1 - One mismatch on the other side of the splice junction (if we are determining the priority of an exon start, a priority of 1 would mean there is a mismatch in the sequence within the region of the previous exon's end).

2 - One mismatch on the same side of the splice junction (if we are determining the priority of an exon start, a priority of 2 would mean there is a mismatch in the sequence within the region of this exon start).

3 - There are mismatches on both sides of the splice junction

TAMA Collapse will use these rankings to decide which splice junction (exon start and exon end) to use for the final model.

-sjt SJT Threshold for detecting errors near splice junctions (default is 10bp)

This is the length threshold for the regions to observe for splice junction priority. The default is 10bp which means TAMA collapse will look at the 10bp region upstream and downstream of the splice junction.

Outputs:

  prefix.bed
  prefix_read.txt
  prefix_polya.txt
  prefix_strand_check.txt
  prefix_trans_read.bed
  prefix_trans_report.txt
  prefix_varcov.txt
  prefix_variants.txt

Detail explanation:

prefix.bed

This is a bed12 format file containing the final collapsed version of your transcriptome.

prefix_read.txt

This file contains information for each pre-collapsed read. Format as follows:

  read_id mapped_flag     accept_flag     percent_coverage        percent_identity        error_line<h;s;i;d;m>   length  cigar
  15_25_c195936/1/485     forward_strand  discarded       50.52   50.1    240;0;1;0;1     485     240H42M12896N65M1I137M

prefix_polya.txt

This file contains the reads with potential poly A truncation. The format is as follows:

  cluster_id      trans_id        strand  a_percent       a_count sequence
  15_25_c10327/1/1176     G60.1   -       85.0    17      TAAAAAAAATCAAAAAAAAA

a_percent is the percent of A's in the 20 nucleotide stretch after the 3' end in the genome. a_count is the number of A's in the 20 nucleotide stretch. The sequence shows the 20 nucleotide stretch.

prefix_strand_check.txt

This file shows instances where the sam flag strand information contrasted the GMAP strand information. Format as follows:

  read_id scaff_name      start_pos       cigar   strands
  15_25_c122605/5/833     1       81276   8S825M  +-

prefix_trans_read.bed

This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing. You can use this file to see if there were any strange occurences during collapsing. It also contains the relationships between clusters and collapsed transcript models. The 1st subfield in the 4th column shows the final transcript ID and the 2nd subfield in the 4th column shows the cluster ID.

  1       2350    3116    G1.1;m54041_170609_082817/47972644/30_581_CCS   40      +       2350    3116    255,0,0 3       108,377,65      0,197,701

prefix_trans_report.txt

This file contains collapsing information for each transcript. Format as follows:

  transcript_id  num_clusters    high_coverage   low_coverage    high_quality_percent    low_quality_percent     start_wobble_list       end_wobble_list collapse_sj_start_err   collapse_sj_end_err     collapse_error_nuc
  G14.2          1               100.0           100.0           98.81                   98.81                   0,0,0,0,0               0,0,0,0,0       0,1,2,0,0               2,1,0,0,0               6.T.A>0;0>5M_1D;0>0;0>0

num_clusters - Number of pre-collapse support reads/FLNC/clusters

high_coverage - Highest coverage from pre-collapse support

low_coverage - Lowest coverage from pre-collapse support

high_quality_percent - Highest identity from pre-collapse support

low_quality_percent - Lowestidentity from pre-collapse support

start_wobble_list - Range of exon starts from reads supporting this exon start.

end_wobble_list - Range of exon ends from reads supporting this exon end.

collapse_sj_start_err - Splice junction priority for exon starts for highest priority support.

collapse_sj_end_err - Splice junction priority for exon ends for highest priority support.

collapse_error_nuc - Mapping mismatches occurring within threshold of splice junctions. The format for this line is as follows:

Splice junctions are delimited by ";".

Upstream and downstream of splice junction is delimited by ">".

Different mismatches from different reads for a given splice junction side are delimited with "-".

Mismatch types from the SAM CIGAR string are delimited by "_".

Single nucleotide mismatch (not indel) will display the position in relation to the splice junction and the read base and the genome base as such:

position.read base.genome base

Example:

6.T.A

prefix_varcov.txt

This file contains the coverage information for each variant detected. Format as follows:

  positions       overlap_clusters
  1_117276,1_117295       15_25_c205247/3/1197,1_14_c115955/1/1379,1_14_c126599/1/1419,1_14_c16580/3/1270,1_14_c80976/2/2080

prefix_variants.txt

This file contains the variants called. Format as follows:

  scaffold        position        type    alt_allele      count   cov_count       cluster_list
  1       95526   M       C       6       6       15_25_c79384/2/691,15_25_c28659/2/1191,1_14_c66511/2/1198,1_14_c44191/1/1461,1_14_c60980/2/1887,1_14_c76490/1/1662