Skip to content

TAMA GO: Read Support

GenomeRIK edited this page Mar 9, 2020 · 33 revisions

This set of tools in TAMA-GO is used to collect all the information regarding supporting reads for each gene/transcript model.

tama_read_support_levels.py

To generate a file containing read support for each transcript use tama_read_support_levels.py. This is a versatile tool which provides read support/count information for different levels of processing. It can be used to find read support for clustered reads, for collapse transcript models, and for merged transcript models.

usage: tama_read_support_levels.py [-h] [-f] [-m] [-o] [-d] [-mt]

optional arguments:

  -h, --help  show this help message and exit
  -f F        File list
  -m M        Merge.txt file from after merging. Use "no_merge" if there is no merge file
  -o O        Output file prefix
  -d D        Ignore duplicate read name warning with -d dup_ok, default is to flag duplicates and terminate early.
  -mt MT      Merge type flag indicates the type of merge file used. Use -mt cupcake for cupcake file. Default is TAMA merge output.

Default command would look like this:

python tama_read_support_levels.py -f filelist.txt -o prefix -m mergefile

Detailed explanation of arguments:

-f F

The filelist file contains the name of the files from which you want to pull read support. The format for the file should be like this (tab separated, do not include header) (this is an example for pulling read support for a TAMA Merge annotation which merged multiple TAMA Collapse runs):

  source_name    source_file    file_type
  1       source_file_1.txt     trans_read
  2       source_file_2.txt     trans_read

Note: Do not include the header "source_name source_file file_type" in the filelist file.

Note: Do not use underscores in the source names as underscores will be used as name delimiters in the output.

File types accepted are "cluster", "trans_read", and "read_support".

Use "cluster" for getting the read support of each transcript model in a TAMA collapse annotation file if the mapped reads used were from the Cluster/Polish step in the IsoSeq3 pipeline (https://github.com/PacificBiosciences/IsoSeq/blob/master/README_v3.2.md). "cluster" is the file type of the "cluster_report.csv" file generated after running IsoSeq3 Cluster.

Use "trans_read" for getting the read support of each transcript model generated from a TAMA Merge run that merged the outputs from TAMA Collapse runs. "trans_read" is the file type of the "trans_read.bed" that is generated from TAMA Collapse runs.

Use "read_support" for getting the read support of each transcript model generated from a TAMA Merge run that merged multiple merged annotations. "read_support" is the file type for the file that is produced from "tama_read_support_levels.py".

-m M

The mergefile can be the "trans_read.bed" from TAMA Collapse, the "merge.txt" file from TAMA Merge, the report file from tama_remove_polya_models_levels.py or tama_remove_single_read_models_levels.py, or the "group.txt" file from running Cupcake Collapse. If you want to generate a read support file for a file that is not the product of merging (ie TAMA Collapse run on mapped FLNC), you can use "no_merge" here to indicate there is no merge file. Note that this means there can only be one file in the filelist file.

-o O

This is the prefix used for the file naming of all the output files.

-d D

Ignore duplicate read name warning with -d dup_ok, default is to flag duplicates and terminate early. Read ID's should be unique. If you have duplicate read ID's in the source files, this may indicate an issue with the source files.

-mt MT

Merge type flag indicates the type of merge file used. Use "-mt tama" for TAMA Merge file or "trans_read.bed" file. Use "-mt cupcake" for cupcake file. Use "-mt filter" for report files produced from tama_remove_polya_models_levels.py or tama_remove_single_read_models_levels.py. Default is TAMA merge output.

Outputs:

  prefix_read_support.txt

Detailed explanation:

prefix_read_support.txt

This contains the read support information for each transcript model. The format is as follows:

  merge_gene_id   merge_trans_id  gene_read_count trans_read_count        source_line     support_line
  G1      G1.1    521     3       1       1:m64012_181221_231243/81005164/ccs,m64012_181221_231243/98568367/ccs,m64012_181221_231243/6686664/ccs

Examples of usage


Generating read support for TAMA Collapse run where Cluster/Polish reads were used

Filelist.txt

  cluspol       cluster_report.csv     cluster

Command

python tama_read_support_levels.py -f filelist.txt -o tama_collapse_cluster -m trans_read.bed


Generating read support for TAMA Collapse run where FLNC reads were used

Filelist.txt

  flnc       trans_read.bed     trans_read

Command

python tama_read_support_levels.py -f filelist.txt -o tama_collapse_flnc -m no_merge


Generating read support for TAMA Merge run where TAMA Collapse annotations were merged

Filelist.txt

  flnc_a       a_trans_read.bed     trans_read
  flnc_b       b_trans_read.bed     trans_read
  flnc_c       c_trans_read.bed     trans_read

Command

python tama_read_support_levels.py -f filelist.txt -o tama_merge_collapse -m merge.txt


Generating read support for TAMA Merge run where TAMA Merge annotations were merged

First you need to generate read support files from all the TAMA Merge annotations that were used in this TAMA Merge annotation.

Filelist.txt

  merge_a       a_read_support.txt     read_support
  merge_b       b_read_support.txt     read_support
  merge_c       c_read_support.txt     read_support

Command

python tama_read_support_levels.py -f filelist.txt -o tama_merge_merge -m merge.txt


Generating read support for a Cupcake Collapse run where Cluster/Polish reads were used

Filelist.txt

  cluspol       cluster_report.csv    cluster

Command

python tama_read_support_levels.py -f filelist.txt -o cupcake -m collapsed.group.txt -cc cupcake


DEPRECATED TOOLS BELOW


tama_read_support_collapse_cluster.py

To find read support for each TAMA Collapse run use tama_read_support_collapse_cluster.py.

USAGE:

python tama_read_support_collapse_cluster.py trans_read.bed cluster_file output_file

trans_read.bed - This is the output file from TAMA Collapse.

cluster_file - This is the cluster file form running clustering with the official Iso-Seq pipelines. For Iso-Seq3 the file will look like "prefix.primer_5p--primer_3p.cluster" for Iso-Seq1 the cluster file will look like "cluster_report.csv". Alternatively, if you did not do clustering and mapped the FLNC directly to the genome then you can use the "trans_read.bed" file in for the cluster_file input.

output_file - The name that you want the output file to be called.

OUTPUT:

The format of the output file is as follows:

  gene_id trans_id        gene_num_reads  trans_num_reads cluster_line
  G1      G1.1    5       1       4_c18833:m160316_194043_42149_c100936532550000001823211106101647_s1_p0/27978/2288_50_CCS

cluster_line - This field shows the clusters and reads supporting each transcript model. The field is sub-divided using ";" to delimit between cluster groups, ":" to delimit between cluster name and read name, and "," to delimit between reads names.

tama_read_support_merge_collapse.py

To find read support for each TAMA Merge run use tama_read_support_merge_collapse.py.

USAGE:

python tama_read_support_merge_collapse.py filelist_file output_file

filelist_file - This is a file that lists all the read support files from the TAMA Collapase runs that were merged during TAMA Merge.

The format is as follows (tab delimited):

  read_support_collapse1.txt collapse1   /path/to/file/
  read_support_collapse2.txt collapse2   /path/to/file/

read_support_collapse1.txt - This is the name of the output file from running "tama_read_support_collapse_cluster" on the TAMA Collapse run.

collapse1 - This is the prefix used in the TAMA Merge run to identify the TAMA Collapse source run

/path/to/file/ - This is the path to the "read_support_collapse1.txt" file.

output_file - The name that you want the output file to be called.

OUTPUT:

The format of the output file is as follows:

  merge_gene_id   merge_trans_id  gene_read_support       trans_read_support      source_prefix   source_trans_line       source_read_line
  G34     G34.2   8       3       ovary,testes    ovary_G21.2,testes_G26.2        m160315_220438_42149_c100936532550000001823211106101642_s1_p0/121400/26_2480_CCS,m160316_194043_42149_c100936532550000001823211106101647_s1_p0/30762/26_2477_CCS;m160316_064304_42149_c100936532550000001823211106101644_s1_p0/108271/27_2482_CCS

merge_gene_id - The gene ID as given in the TAMA Merge output file.

merge_trans_id - The transcript ID as given in the TAMA Merge output file.

gene_read_support - The number of reads supporting this gene.

trans_read_support - The number of read supporting this transcript.

source_prefix - A list of the sources supporting this transcript.

source_trans_line - A list of the source transcripts supporting the merged transcript model.

source_read_line - A list of the read names support this merged transcript model. The group of reads supporting each source transcript are delimited by ";". The reads within each group are delmited by ",".The order of the group of reads matches the order given in the "source_trans_line" field.