-
Notifications
You must be signed in to change notification settings - Fork 26
TAMA GO: Transcript Filtering
This set of tools in TAMA-GO is used to filter out transcript models that may represent transcriptional noise or sample contamination.
Note: If you are running these tools in series make sure to use the same source name in the generated read support file as is provided in the filelist file. The source names are important for keeping track of where reads come from. They also prevent issues with non-unique read ID's across sequencing runs.
tama_remove_polya_models_levels.py
To remove transcript models with genomic poly-A stretches immediately following the 3' end of the transcript model use tama_remove_polya_models_levels.py. Transcript models with 3' genomic poly-A stretches are likely to be the result of either internal priming or genomic contamination.
usage: tama_remove_polya_models_levels.py [-h] [-b] [-f] [-r] [-o] [-p] [-l] [-a] [-k]
optional arguments:
-h, --help show this help message and exit -b B Annotation bed file -f F Filelist file with Poly-A file names -r R Read support file -o O Output prefix (required) -p P Percent poly-A threshold (default of 75.0) -l L Level of removal (gene or transcript level)(default is gene level) -a A Remove all models with Poly-A (all_polya or singleton_polya)(default is singleton_polya) -k K Keep all multi-exon models (keep_multi or remove_multi)(default is keep_multi)
Default command would look like this:
python tama_remove_polya_models_levels.py -b bed -f filelist -r readsupport -o prefix
Detailed explanation of arguments:
-b B
The bed file is the annotation bed file that is the main output of either TAMA Collapse or TAMA Merge.
-f F
The filelist file contains the names of the poly-A files that are the output from TAMA Collapse runs.
source_name source_file 1 polya.txt
Note: Do not include the header "source_name source_file" in the filelist file.
-r R
The readsupport file is the output file from running tama_read_support_levels.py on the the output of the TAMA Collapse run or the TAMA Merge run.
-o O
This is the output prefix. The prefix will be used to create the output file names.
-p P
Percent poly-A threshold (default of 75.0). If the model has a genomic poly-A stretch that is greater than this threshold, it is treated as a noise model.
-l L
Level of removal (gene or transcript level). This determines under what conditions to remove a transcript model. Default is gene level. Gene level means it will only remove genes with only one transcript which shows genomic poly-A. Thus if a gene has multiple transcripts none of its transcript models will not be removed even if one or more transcripts have genomic poly-A. Transcript level means any transcript with genomic poly-A will be removed. The reason for removing on gene level is that there could be genes with real genomic poly-A such as processed pseudogenes.
-a A
Remove all models with Poly-A (all_polya or singleton_polya). Default is singleton_polya. "singleton_polya" means that only transcript models with single read support and genomic poly-A will be removed. "all_polya" means that transcript models with multiple read support of which all have genomic poly-A will be removed. The reason for the different algorithms is that sometimes multiple read support for a model will be interpreted as a true model (singleton_polya) and sometimes it will be interpreted as the result of PCR artifacts (all_polya).
-k K
Keep all multi-exon models (keep_multi or remove_multi)(default is remove_multi). "remove_multi" will allow the tool to remove multiple exon transcript models if they have genomic poly-A. "keep_multi" will keep all multiple exon transcript models even if they have genomic poly-A. The reason to keep multiple exon models with genomic poly-A is because they represent real transcripts but may be 3' truncated. This rescues gene counts.
Outputs:
annotation.bed polya_report.txt polya_support.txt trash_polya.bed
Detailed explanation:
annotation.bed
This is filtered annotation file in bed12 format.
polya_report.txt
This is a report file which shows the mapping of gene and transcript ID's from the pre-filtered to the filtered annotation file. It also shows the read support on gene and transcript level.
old_gene_id old_trans_id source_line num_reads new_gene_id new_trans_id num_exons G1 G1.1 a,b 6 G1 G1.1 10 G1 G1.2 a,b 6 G1 G1.2 10
polya_support.txt
This file shows the the genomic poly-A signal for each read that was filtered out.
trans_id source read_id source_trans_id strand percent_polya a_count polya_seq G1.114 a m64012_181221_231243/79037480/ccs G1.59 - 75.0 15 GAAAGAAAAAGAAAAGAAAT G1.114 a m64012_181221_231243/94699746/ccs G1.59 - 75.0 15 GGAAAGAAAAAGAAAAGAAA
trash_polya.bed
This is a bed12 file of all the removed transcript models.
tama_remove_single_read_models_levels.py
To remove transcript models with only single read support or only single source support use tama_remove_polya_models_levels.py. This tool allows you to remove models based on read support which may be transcriptional noise, genomic contamination, or PCR artifacts.
usage: tama_remove_single_read_models_levels.py [-h] [-b] [-r] [-o] [-l] [-k] [-s]
optional arguments:
-h, --help show this help message and exit -b B Annotation bed file -r R Read support file -o O Output prefix (required) -l L Level of removal (gene or transcript level)(default is gene level) -k K Keep all multi-exon models (keep_multi or remove_multi)(default is remove_multi) -s S Requires models to have support from at least this number of sources. Default is 1 -n N Requires models to have support from at least this number of reads. Default is 2
Default command would look like this:
python tama_remove_single_read_models_levels.py -b bed -r readsupport -o prefix
Detailed explanation of arguments:
-b B
The bed file is the annotation bed file that is the main output of either TAMA Collapse or TAMA Merge.
-r R
The readsupport file is the output file from running tama_read_support_levels.py on the the output of the TAMA Collapse run or the TAMA Merge run.
-o O
This is the output prefix. The prefix will be used to create the output file names.
-l L
Level of removal (gene or transcript level). Gene level will only remove genes with a single read, transcript level will remove all singleton transcripts.
-k K
Keep all multi-exon models (keep_multi or remove_multi)(default is keep_multi). "remove_multi" will allow the tool to remove multiple exon transcript models if they have only a single read support. "keep_multi" will keep all multiple exon transcript models regardless of the number of supporting reads.
-s S
Requires models to have support from at least this number of sources. Default is 1
-n N
Requires models to have support from at least this number of reads. Default is 2
Outputs:
annotation.bed singleton.bed singleton_report.txt
Detailed explanation:
annotation.bed
This is filtered annotation file in bed12 format.
singleton.bed
This is a bed12 file with all the removed transcript models.
singleton_report.txt
This is a report file which shows the mapping of gene and transcript ID's from the pre-filtered to the filtered annotation file. It also shows the read support on gene and transcript level.
old_gene_id old_trans_id source_line num_reads new_gene_id new_trans_id num_exons G1 G1.1 a,b 6 G1 G1.1 1 G1 G1.2 b 6 G1 removed_transcript 1
tama_filter_primary_transcripts_orf.py
To keep only 1 representative transcript model per gene model based on ORF predictions. This needs the output BED12 file from the ORF/NMD pipeline.
usage: tama_filter_primary_transcripts_orf.py [-h] [-b] [-o]
optional arguments:
-h, --help show this help message and exit -b B Annotation bed file -o O Output file name
Default command would look like this:
python tama_filter_primary_transcripts_orf.py -b input.bed -o output.bed
Detailed explanation of arguments:
-b B
The bed file is the annotation bed file that is the main output of the TAMA ORF/NMD Pipeline.
-o O
This is the output file name.
Outputs:
annotation.bed
Detailed explanation:
annotation.bed
This is filtered annotation file in bed12 format.
tama_remove_fragment_models.py
Remove transcript models that may be fragments of longer transcript models in the same annotation. This tool requires a bed12 file annotation as input.
usage: tama_remove_fragment_models.py [-h] [-f] [-o] [-m] [-e] [-s] [-id] [-cds] optional arguments:
-h, --help show this help message and exit -f F Bed file -o O Output file prefix -m M Exon ends threshold/ splice junction threshold (Default is 10) -e E Trans ends wobble threshold (Default is 500) -s S Single exon overlap percent threshold (Default is 20 percent) -id ID Use original ID line original_id (Default is tama_id line based on gene_id;transcript_id structure -cds CDS Pull CDS option. Default is tama_cds where CDS regions matching TSS and TTS are ignored if another CDS is found. Use longest_cds to pick the longest CDS
Default command would look like this:
python tama_remove_fragment_models.py -f input.bed -o output_prefix
Detailed explanation of arguments:
-f F
The bed file is the annotation bed12 file. Could be the output from TAMA Collapse, TAMA Merge, or TAMA ORF/NMD pipeline.
-o O
This is the output prefix for generating output files.
-m M
This is the splice junction wobble threshold for matching fragments to longer models. Default is 10.
-e E
This is the transcript ends wobble threshold for matching fragments to longer models. For instance if the fragment actually extends 100bp past the longer model, if the threshold is 200bp then the fragment model will be absorbed into the longer model (longer on the other end) and the longer end from the short model will be used for the final model. Basically just make the model as long as possible given the evidence. Default is 500.
-s S
Single exon overlap percent threshold (Default is 20 percent). When matching multiple single exon transcript models this overlap threshold is used to label as a match or not match.
-id ID
Use original ID line original_id (Default is tama_id line based on gene_id;transcript_id structure.
-cds CDS
Pull CDS option. Default is tama_cds where CDS regions matching TSS and TTS are ignored if another CDS is found. Use longest_cds to pick the longest CDS.
Outputs:
prefix.bed prefix_discarded.txt
Detailed explanation:
prefix.bed
This is filtered annotation file in bed12 format.
prefix_discarded.txt
This is a bed12 file with all the shorter models which were discarded.