-
Notifications
You must be signed in to change notification settings - Fork 26
Tama Collapse
Tama Collapse
TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.
Detailed explanation of TAMA Collapse:
After you map your Iso-Seq data onto a genome assembly you will notice that within some gene models there are multiple transcripts which appear to be very similar. Collapsing is the process of figuring out which transcripts are similar enough to be collapsed into a single transcript model. While this may sound trivial, it is actually a complex problem. For example, let's take a look at the Iso-Seq Tofu algorithm for collapsing. While not specifically termed on the Tofu site there are 2 modes of collapsing allowed for Tofu Collapse which I have named "Transcription Start Site Collapse" (TSSC) and "Exon Cascade Collapse" (ECC).
TSSC will collapse all transcripts with matching 3' end, matching spice junctions, and matching number of exons. The 5' end (or transcription start site) can differ. ECC will collapse all transcripts with matching 3' end and matching splice junctions. However, the transcripts to be collapsed do not need to have the same number of exons. See below for a diagram of this behaviour:
You might be wondering why the 5' end is allowed to differ in both modes. This was done because the official Iso-Seq library preparation protocol does not include a 5' cap selection step. This means that transcripts which have been degraded from the 5' end are sequenced as well. Thus with the standard protocol it is impossible to distinguish transcripts based on their 5' end alone.
A more detailed explanation of this can be found in this paper:
https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3691-9
If you have performed 5' cap selection on your Iso-Seq library then none of these algorithms will be appropriate for your data. However, TAMA Collapse is designed to work with both 5' cap selected and standard libraries.
One question you might have regarding these algorithms is "how much difference between exon start and end sites is allowed?" Indeed, if you look closely at your Iso-Seq data, you will see that very similar looking transcripts may have slight differences in their individual exon start and end sites. I call this phenomenon "wobble". One explanation for wobble is that a sequence error occured near the splice junction which caused the mapper to shift the exon start/end site slightly. In TAMA Collapse you can define the amount of difference you want to allow between exon start/end sites. TAMA Collapse also tracks the amount of wobble that occurs for each exon start/end site so you can see if there was a significant amount of wobble for a collapsed transcript model. Large amounts of wobble are caused by something I call "wobble walking". You can see what this looks like below:
As you can see in the above diagram, the first two exon start sites match given a 10 bp threshold. The second and third exon start sites also match given the same 10 bp threshold. Thus all three exon start sites "match". You can see that this can continue on which would cause a large amount of wobble for that feature.
With TAMA Collapse you can choose to either select the feature which is most common or the feature that would make the longest exon. The default is the most common feature. Basically what this does is look at all the transcripts which match and use the most common exon start/end site (the feature with the most support).
Other collapsing algortihms usually decide on the "correct" exon start site based on either reference information (i.e. short read data or public annotation) or based on canonical intronic splice donor-acceptor pairs (GT-AG). While this may seem like the right idea, it is masking over the information in your Iso-Seq data. Thus it projects or biases the processing toward an assumption which may not be true. However, if you do want to match the splice junctions to another transcriptome you can use TAMA Merge.
TAMA Collapse creates output files that contain all the information pertaining to wobble and sourcing so that you can see where there might be an anomaly with the collapsing. This ties into one of the core tenets of TAMA which is to "flag not hide" any unusual results.
In order to run TAMA Collapse with the ECC mode, you simply have to use the cap flag parameter "-x no_cap". In order to emulate the TSSC mode with TAMA you can just give a very large threshold for the 5' end. Thus TAMA Collapse can peform the same type of collapsing as TOFU Collapse.
NOTE: As of 2017-08-02, we noticed an issue with the strand information provided by TOFU Collapse when running it on bam files generated by GMAP. This was caused by TOFU Collapse using the XS:A? field to define the strand which is designated by GMAP. GMAP had a bug which sometimes gave the wrong strand information in that field. Newer versions of GMAP should have fixed this bug. However, to address this TAMA Collapse has a file output which shows when the XS:A? field disagrees with the strand information from the SAM flag.
GMAP actually uses the splice junction intronic donor-accepter sequences (canonical GT-AG) to "correct" the strand information. So if your mapped transcript has splice junctions which show the reverse complement of the canonical donor-accepter sequences, GMAP will flip the strand in the XS:A? field. The "strand_check" output file from TAMA Collapse contains these instances.
Manual
usage: tama_collapse.py [-h] [-s] [-f] [-p] [-x] [-e] [-c] [-i] [-a] [-m] [-z]
This script collapses mapped transcript models
arguments:
-h, --help show this help message and exit -s S Sorted sam file (required) -f F Fasta file (required) -p P Output prefix (required) -x X Capped flag: capped or no_cap -e E Collapse exon ends flag: common_ends or longest_ends (default common_ends) -c C Coverage (default 99) -i I Identity (default 85) -a A 5 prime threshold (default 10) -m M Exon/Splice junction threshold (default 10) -z Z 3 prime threshold (default 10) -d D Flag for merging duplicate transcript groups (default is merge_dup will merge duplicates ,no_merge quits when duplicates are found) -sj SJ Use error threshold to prioritize the use of splice junction information from collapsing transcripts(default no_priority, activate with sj_priority) -sjt SJT Threshold for detecting errors near splice junctions (default is 10bp) -lde LDE Threshold for amount of local density error near splice junctions that is allowed (default is 1000 errors which practically means no threshold is applied) -ses SES Simple error symbol. Use this to pick the symbol used to represent matches in the simple error string for LDE output. -log LOG Turns on/off output of collapsing process. (default on, use log_off to turn off)
Default command would look like this:
python tama_collapse.py -s mapped_reads.sam -f genome.fa -p prefix -x capped
Note:
The SAM file should be sorted and should not be compressed into BAM. The fasta file input should be for the genome assembly that was used to create the SAM file. This tool is very memory greedy. If it is running out of memory on your system please let me know and I will optimize it further for memory usage.
Detail explanation of optional arguments:
-x X Capped flag: capped or no_cap
If you performed 5' cap selection during youe Iso-Seq library preparation you should use the "-x capped" option. This will not allow transcripts to be collapses if they have a different number of exons.
If you did not do 5' cap selection (currently official Iso-Seq library preparation protocol), choose "-x no_cap". This will collapse transcripts if they have the same 3' (exons and splice junctions) end but differing number of exons on the 5' end.
Basically using "no_cap" will make TAMA Collapse behave in almost the same way as ECC Tofu Collapse. The only difference is that Tofu Collapse behaves in a more "stochastic" way. By that I mean that instead of adding each shorter model to all matching longer models, Tofu Collapse will only match the first instance it sees. This saves compute time but it means that you are not able to use all the information available to predict the final features for the final collapsed transcript models.
-e E Collapse exon ends flag: common_ends or longest_ends (default common_ends)
When collapsing, this option determines if the selected exon start/end site is defined by the most common feature or the longest feature. Longest feature means the feature which makes the exon longer.
-c C Coverage (default 99)
Coverage is defined by the percentage of the original read that is mapped to the genome. In some cases regions at the beginnning or end of the read will be clipped from the mapping representation. The amount of this clipping is what defines the coverage value.
-i I Identity (default 85)
Identity is defined as the number of bases of the read that match the bases of the genome as per the genomic mapping. Basically it is how close the read sequence is to the genomic sequence.
-a A 5 prime threshold (default 10)
The 5 prime threshold is the amount of tolerance at the 5' end of the transcript for grouping reads to be collapsed.
-m M Exon/Splice junction threshold (default 10)
The Exon/Splice junction threshold is the amount of tolerance for the splice junctions of the transcript for grouping reads to be collapsed.
-z Z 3 prime threshold (default 10)
The 3 prime threshold is the amount of tolerance for the 3' end of the transcript for grouping reads to be collapsed.
-d D Flag for merging duplicate transcript groups (default no_merge quits when duplicates are found, merge_dup will merge duplicates)
When collapsing transcripts, there can be events where different groups of transcript collapse into the same model. I call these duplicate transcript groups. By default TAMA Collapse will not merge these groups during collapsing and will exit the run prematurely with an error output. If you wish to merge these groups, use the merge_dup flag.
-sj SJ Use error threshold to prioritize the use of splice junction information from collapsing transcripts(default no_priority, activate with sj_priority)
I call this option splice junction priority. Basically if you turn this on, TAMA Collapse will search for mismatches in the regions on each side of each splice junction. The length of the regions are defined by the SJT flag. TAMA Collapse will then rank the splice junction evidence from each transcript/read as either 0, 1, 2, or 3 with 0 being highest priroity. These rankings are defined as follows:
0 - No mismatches on either side of the splice junction
1 - One mismatch on the other side of the splice junction (if we are determining the priority of an exon start, a priority of 1 would mean there is a mismatch in the sequence within the region of the previous exon's end).
2 - One mismatch on the same side of the splice junction (if we are determining the priority of an exon start, a priority of 2 would mean there is a mismatch in the sequence within the region of this exon start).
3 - There are mismatches on both sides of the splice junction
TAMA Collapse will use these rankings to decide which splice junction (exon start and exon end) to use for the final model.
-sjt SJT Threshold for detecting errors near splice junctions (default is 10bp)
This is the length threshold for the regions to observe for splice junction priority. The default is 10bp which means TAMA collapse will look at the 10bp region upstream and downstream of the splice junction.
-lde LDE Threshold for amount of local density error near splice junctions that is allowed (default is 1000 errors which practically means no threshold is applied)
This is the Threshold for the number of errors that is allowed to occur on each side of the splice junction within the range specified by the -sjt argument. So for example, if you use the default SJT of 10bp and "-lde 5", then TAMA Collapse would remove all reads with at least one splice junction that has more than 5 errors withing the 10bp region downstream of the a splice junction or upstream of a splice junction. This feature is used to remove mapped reads where the splice junction mapping is likely to be erroneous due to a high local density of errors near the splice junction.
-ses SES Simple error symbol. Use this to pick the symbol used to represent matches in the simple error string for LDE output.
This argument refers to the LDE output file. It allows you to change the symbol used to represent matches in the sj_error_simple field of the prefix_local_density_error.txt file.
Outputs:
prefix.bed prefix_read.txt prefix_polya.txt prefix_strand_check.txt prefix_trans_read.bed prefix_trans_report.txt prefix_varcov.txt prefix_variants.txt prefix_local_density_error.txt
Detail explanation:
prefix.bed
This is a bed12 format file containing the final collapsed version of your transcriptome.
prefix_read.txt
This file contains information for each pre-collapsed read. Format as follows:
read_id mapped_flag accept_flag percent_coverage percent_identity error_line<h;s;i;d;m> length cigar 15_25_c195936/1/485 forward_strand discarded 50.52 50.1 240;0;1;0;1 485 240H42M12896N65M1I137M
prefix_polya.txt
This file contains the reads with potential poly A truncation. The format is as follows:
cluster_id trans_id strand a_percent a_count sequence 15_25_c10327/1/1176 G60.1 - 85.0 17 TAAAAAAAATCAAAAAAAAA
a_percent is the percent of A's in the 20 nucleotide stretch after the 3' end in the genome. a_count is the number of A's in the 20 nucleotide stretch. The sequence shows the 20 nucleotide stretch. A threshold of 70% A's within the 20bp region is used for determining possible poly-A truncation sites.
prefix_strand_check.txt
This file shows instances where the sam flag strand information contrasted the GMAP strand information. Format as follows:
read_id scaff_name start_pos cigar strands 15_25_c122605/5/833 1 81276 8S825M +-
prefix_trans_read.bed
This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing. You can use this file to see if there were any strange occurences during collapsing. It also contains the relationships between clusters and collapsed transcript models. The 1st subfield in the 4th column shows the final transcript ID and the 2nd subfield in the 4th column shows the cluster ID.
1 2350 3116 G1.1;m54041_170609_082817/47972644/30_581_CCS 40 + 2350 3116 255,0,0 3 108,377,65 0,197,701
prefix_trans_report.txt
This file contains collapsing information for each transcript. Format as follows:
transcript_id num_clusters high_coverage low_coverage high_quality_percent low_quality_percent start_wobble_list end_wobble_list collapse_sj_start_err collapse_sj_end_err collapse_error_nuc G14.2 1 100.0 100.0 98.81 98.81 0,0,0,0,0 0,0,0,0,0 0,1,2,0,0 2,1,0,0,0 6.T.A>0;0>5M_1D;0>0;0>0
num_clusters - Number of pre-collapse support reads/FLNC/clusters
high_coverage - Highest coverage from pre-collapse support
low_coverage - Lowest coverage from pre-collapse support
high_quality_percent - Highest identity from pre-collapse support
low_quality_percent - Lowestidentity from pre-collapse support
start_wobble_list - Range of exon starts from reads supporting this exon start.
end_wobble_list - Range of exon ends from reads supporting this exon end.
collapse_sj_start_err - Splice junction priority for exon starts for highest priority support.
collapse_sj_end_err - Splice junction priority for exon ends for highest priority support.
collapse_error_nuc - Mapping mismatches occurring within threshold of splice junctions. The format for this line is as follows:
Splice junctions are delimited by ";".
Upstream and downstream of splice junction is delimited by ">".
Different mismatches from different reads for a given splice junction side are delimited with "-".
Mismatch types from the SAM CIGAR string are delimited by "_".
Single nucleotide mismatch (not indel) will display the position in relation to the splice junction and the read base and the genome base as such:
position.read base.genome base
Example:
6.T.A
Example of single splice junction error profile:
6.T.A>5M_1D
This means that 6 base pairs upstream of the splice junction there is a mismatch where T is the nucleotide found in the read for that position and A is the nucleotide found in the reference genome for that position.
For downstream of the splice junction, there are 5 matching nucleotides and then a 1 nucleotide deletion.
The symbols for this error representations are as such:
M - Matching nucleotides
I - Insertion
D- Deletion
S - Soft clipping
H - Hard clipping
This is a modified CIGAR format. For more information on CIGAR string format please see the SAM format guidelines:
https://samtools.github.io/hts-specs/SAMv1.pdf
prefix_varcov.txt
This file contains the coverage information for each variant detected. Format as follows:
positions overlap_clusters 1_117276,1_117295 15_25_c205247/3/1197,1_14_c115955/1/1379,1_14_c126599/1/1419,1_14_c16580/3/1270,1_14_c80976/2/2080
prefix_variants.txt
This file contains the variants called. Format as follows:
scaffold position type alt_allele count cov_count cluster_list 1 95526 M C 6 6 15_25_c79384/2/691,15_25_c28659/2/1191,1_14_c66511/2/1198,1_14_c44191/1/1461,1_14_c60980/2/1887,1_14_c76490/1/1662
prefix_local_density_error.txt
This file contains the log of filtering for local density error around the splice junctions ("-lde") Format as follows:
cluster_id lde_flag scaff_name start_pos end_pos strand num_exons bad_sj_num_line bad_sj_error_count_line sj_error_profile_idmsh sj_error_nuc sj_error_simple cigar m54053_170316_122436/64881153/30_577_CCS lde_fail Chr1 45296 45933 - 2 1 1>0 0,1,0,0,0>0,0,0,0,0 1D_2M>0 _______D__>__________ 229M1D5M1I26M1D2M86N99M1D36M1I56M1D24M1D26M1D42M
lde_flag - Can be lde_pass (passed quality requirments) or lde_fail (failed threshold and removed from further analysis).
bad_sj_num_line - This is the splice junction number or numbers that did not pass the LDE threshold.
bad_sj_error_count_line - This shows the number of errors on each side of each splice junction using the same format as the collapse_error_nuc line in the prefix_trans_report.txt file.
sj_error_profile_idmsh - This shows the breakdown of type and number of errors on each side of each splice junction. Same formatting conventions of bad_sj_error_count_line except that there are 5 subfields delimited by "," representing insertion,deletions,mismatch,soft_clipping,hard_clipping errors.
sj_error_nuc - This shows a precise view of the errors surrounding each splice junction. Same format conventions as the collapse_error_nuc line in the prefix_trans_report.txt file.
sj_error_simple - This shows a simplified view of the errors on each side of each splice junction. The "-ses" argument determines what character is used to represent matches. The default character for "-ses" is "_".
cigar - This is the cigar string from the SAM file.