Skip to content

TAMA GO: Transcript Filtering

GenomeRIK edited this page Aug 18, 2022 · 17 revisions

This set of tools in TAMA-GO is used to filter out transcript models that may represent transcriptional noise or sample contamination.

Note: If you are running these tools in series make sure to use the same source name in the generated read support file as is provided in the filelist file. The source names are important for keeping track of where reads come from. They also prevent issues with non-unique read ID's across sequencing runs.

tama_remove_polya_models_levels.py

To remove transcript models with genomic poly-A stretches immediately following the 3' end of the transcript model use tama_remove_polya_models_levels.py. Transcript models with 3' genomic poly-A stretches are likely to be the result of either internal priming or genomic contamination.

usage: tama_remove_polya_models_levels.py [-h] [-b] [-f] [-r] [-o] [-p] [-l] [-a] [-k]

optional arguments:

  -h, --help  show this help message and exit
  -b B        Annotation bed file
  -f F        Filelist file with Poly-A file names
  -r R        Read support file
  -o O        Output prefix (required)
  -p P        Percent poly-A threshold (default of 75.0)
  -l L        Level of removal (gene or transcript level)(default is gene level)
  -a A        Remove all models with Poly-A (all_polya or singleton_polya)(default is singleton_polya)
  -k K        Keep all multi-exon models (keep_multi or remove_multi)(default is keep_multi)

Default command would look like this:

python tama_remove_polya_models_levels.py -b bed -f filelist -r readsupport -o prefix

Detailed explanation of arguments:

-b B

The bed file is the annotation bed file that is the main output of either TAMA Collapse or TAMA Merge.

-f F

The filelist file contains the names of the poly-A files that are the output from TAMA Collapse runs.

  source_name    source_file
  1       polya.txt

Note: Do not include the header "source_name source_file" in the filelist file.

-r R

The readsupport file is the output file from running tama_read_support_levels.py on the the output of the TAMA Collapse run or the TAMA Merge run.

-o O

This is the output prefix. The prefix will be used to create the output file names.

-p P

Percent poly-A threshold (default of 75.0). If the model has a genomic poly-A stretch that is greater than this threshold, it is treated as a noise model.

-l L

Level of removal (gene or transcript level). This determines under what conditions to remove a transcript model. Default is gene level. Gene level means it will only remove genes with only one transcript which shows genomic poly-A. Thus if a gene has multiple transcripts none of its transcript models will not be removed even if one or more transcripts have genomic poly-A. Transcript level means any transcript with genomic poly-A will be removed. The reason for removing on gene level is that there could be genes with real genomic poly-A such as processed pseudogenes.

-a A

Remove all models with Poly-A (all_polya or singleton_polya). Default is singleton_polya. "singleton_polya" means that only transcript models with single read support and genomic poly-A will be removed. "all_polya" means that transcript models with multiple read support of which all have genomic poly-A will be removed. The reason for the different algorithms is that sometimes multiple read support for a model will be interpreted as a true model (singleton_polya) and sometimes it will be interpreted as the result of PCR artifacts (all_polya).

-k K

Keep all multi-exon models (keep_multi or remove_multi)(default is remove_multi). "remove_multi" will allow the tool to remove multiple exon transcript models if they have genomic poly-A. "keep_multi" will keep all multiple exon transcript models even if they have genomic poly-A. The reason to keep multiple exon models with genomic poly-A is because they represent real transcripts but may be 3' truncated. This rescues gene counts.

Outputs:

  annotation.bed
  polya_report.txt
  polya_support.txt
  trash_polya.bed

Detailed explanation:

annotation.bed

This is filtered annotation file in bed12 format.

polya_report.txt

This is a report file which shows the mapping of gene and transcript ID's from the pre-filtered to the filtered annotation file. It also shows the read support on gene and transcript level.

  old_gene_id     old_trans_id    source_line     num_reads       new_gene_id     new_trans_id    num_exons
  G1      G1.1    a,b     6       G1      G1.1    10
  G1      G1.2    a,b     6       G1      G1.2    10

polya_support.txt

This file shows the the genomic poly-A signal for each read that was filtered out.

  trans_id        source  read_id source_trans_id strand  percent_polya   a_count polya_seq
  G1.114  a       m64012_181221_231243/79037480/ccs       G1.59   -       75.0    15      GAAAGAAAAAGAAAAGAAAT
  G1.114  a       m64012_181221_231243/94699746/ccs       G1.59   -       75.0    15      GGAAAGAAAAAGAAAAGAAA

trash_polya.bed

This is a bed12 file of all the removed transcript models.


tama_remove_single_read_models_levels.py

To remove transcript models with only single read support or only single source support use tama_remove_polya_models_levels.py. This tool allows you to remove models based on read support which may be transcriptional noise, genomic contamination, or PCR artifacts.

usage: tama_remove_single_read_models_levels.py [-h] [-b] [-r] [-o] [-l] [-k] [-s]

optional arguments:

  -h, --help  show this help message and exit
  -b B        Annotation bed file
  -r R        Read support file
  -o O        Output prefix (required)
  -l L        Level of removal (gene or transcript level)(default is gene level)
  -k K        Keep all multi-exon models (keep_multi or remove_multi)(default is remove_multi)
  -s S        Requires models to have support from at least this number of sources. Default is 1
  -n N        Requires models to have support from at least this number of
              reads. Default is 2

Default command would look like this:

python tama_remove_single_read_models_levels.py -b bed -r readsupport -o prefix

Detailed explanation of arguments:

-b B

The bed file is the annotation bed file that is the main output of either TAMA Collapse or TAMA Merge.

-r R

The readsupport file is the output file from running tama_read_support_levels.py on the the output of the TAMA Collapse run or the TAMA Merge run.

-o O

This is the output prefix. The prefix will be used to create the output file names.

-l L

Level of removal (gene or transcript level). Gene level will only remove genes with a single read, transcript level will remove all singleton transcripts.

-k K

Keep all multi-exon models (keep_multi or remove_multi)(default is keep_multi). "remove_multi" will allow the tool to remove multiple exon transcript models if they have only a single read support. "keep_multi" will keep all multiple exon transcript models regardless of the number of supporting reads.

-s S

Requires models to have support from at least this number of sources. Default is 1

-n N

Requires models to have support from at least this number of reads. Default is 2

Outputs:

  annotation.bed
  singleton.bed
  singleton_report.txt

Detailed explanation:

annotation.bed

This is filtered annotation file in bed12 format.

singleton.bed

This is a bed12 file with all the removed transcript models.

singleton_report.txt

This is a report file which shows the mapping of gene and transcript ID's from the pre-filtered to the filtered annotation file. It also shows the read support on gene and transcript level.

  old_gene_id     old_trans_id    source_line     num_reads       new_gene_id     new_trans_id    num_exons
  G1      G1.1    a,b     6       G1      G1.1    1
  G1      G1.2    b       6       G1      removed_transcript      1

tama_filter_primary_transcripts_orf.py

To keep only 1 representative transcript model per gene model based on ORF predictions. This needs the output BED12 file from the ORF/NMD pipeline.

usage: tama_filter_primary_transcripts_orf.py [-h] [-b] [-o]

optional arguments:

  -h, --help  show this help message and exit
  -b B        Annotation bed file
  -o O        Output file name

Default command would look like this:

python tama_filter_primary_transcripts_orf.py -b input.bed -o output.bed

Detailed explanation of arguments:

-b B

The bed file is the annotation bed file that is the main output of the TAMA ORF/NMD Pipeline.

-o O

This is the output file name.

Outputs:

  annotation.bed

Detailed explanation:

annotation.bed

This is filtered annotation file in bed12 format.


tama_remove_fragment_models.py

Remove transcript models that may be fragments of longer transcript models in the same annotation. This tool requires a bed12 file annotation as input.

usage: tama_remove_fragment_models.py [-h] [-f] [-o] [-m] [-e] [-s] [-id] [-cds] optional arguments:

  -h, --help  show this help message and exit
  -f F        Bed file
  -o O        Output file prefix
  -m M        Exon ends threshold/ splice junction threshold (Default is 10)
  -e E        Trans ends wobble threshold (Default is 500)
  -s S        Single exon overlap percent threshold (Default is 20 percent)
  -id ID      Use original ID line original_id (Default is tama_id line based on gene_id;transcript_id structure
  -cds CDS      Pull CDS option. Default is tama_cds where CDS regions matching TSS and TTS are ignored if another CDS is found. Use longest_cds to pick the longest CDS

Default command would look like this:

python tama_remove_fragment_models.py -f input.bed -o output_prefix

Detailed explanation of arguments:

-f F

The bed file is the annotation bed12 file. Could be the output from TAMA Collapse, TAMA Merge, or TAMA ORF/NMD pipeline.

-o O

This is the output prefix for generating output files.

-m M

This is the splice junction wobble threshold for matching fragments to longer models. Default is 10.

-e E

This is the transcript ends wobble threshold for matching fragments to longer models. For instance if the fragment actually extends 100bp past the longer model, if the threshold is 200bp then the fragment model will be absorbed into the longer model (longer on the other end) and the longer end from the short model will be used for the final model. Basically just make the model as long as possible given the evidence. Default is 500.

-s S

Single exon overlap percent threshold (Default is 20 percent). When matching multiple single exon transcript models this overlap threshold is used to label as a match or not match.

-id ID

Use original ID line original_id (Default is tama_id line based on gene_id;transcript_id structure.

-cds CDS

Pull CDS option. Default is tama_cds where CDS regions matching TSS and TTS are ignored if another CDS is found. Use longest_cds to pick the longest CDS.

Outputs:

  prefix.bed
  prefix_discarded.txt

Detailed explanation:

prefix.bed

This is filtered annotation file in bed12 format.

prefix_discarded.txt

This is a bed12 file with all the shorter models which were discarded.