-
Notifications
You must be signed in to change notification settings - Fork 26
TAMA GO: Read Support
This set of tools in TAMA-GO is used to collect all the information regarding supporting reads for each gene/transcript model.
tama_read_support_levels.py
To generate a file containing read support for each transcript use tama_read_support_levels.py. This is a versatile tool which provides read support/count information for different levels of processing. It can be used to find read support for clustered reads, for collapse transcript models, and for merged transcript models.
usage: tama_read_support_levels.py [-h] [-f] [-m] [-o] [-d] [-mt]
optional arguments:
-h, --help show this help message and exit -f F File list -m M Merge.txt file from after merging. Use "no_merge" if there is no merge file -o O Output file prefix -d D Ignore duplicate read name warning with -d dup_ok, default is to flag duplicates and terminate early. -mt MT Merge type flag indicates the type of merge file used. Use -mt cupcake for cupcake file. Default is TAMA merge output.
Default command would look like this:
python tama_read_support_levels.py -f filelist.txt -o prefix -m mergefile
Detailed explanation of arguments:
-f F
The filelist file contains the name of the files from which you want to pull read support. The format for the file should be like this (tab separated, do not include header) (this is an example for pulling read support for a TAMA Merge annotation which merged multiple TAMA Collapse runs):
source_name source_file file_type 1 source_file_1.txt trans_read 2 source_file_2.txt trans_read
Note: Do not include the header "source_name source_file file_type" in the filelist file.
Note: Do not use underscores in the source names as underscores will be used as name delimiters in the output.
File types accepted are "cluster", "trans_read", and "read_support".
Use "cluster" for getting the read support of each transcript model in a TAMA collapse annotation file if the mapped reads used were from the Cluster/Polish step in the IsoSeq3 pipeline (https://github.com/PacificBiosciences/IsoSeq/blob/master/README_v3.2.md). "cluster" is the file type of the "cluster_report.csv" file generated after running IsoSeq3 Cluster.
Use "trans_read" for getting the read support of each transcript model generated from a TAMA Merge run that merged the outputs from TAMA Collapse runs. "trans_read" is the file type of the "trans_read.bed" that is generated from TAMA Collapse runs.
Use "read_support" for getting the read support of each transcript model generated from a TAMA Merge run that merged multiple merged annotations. "read_support" is the file type for the file that is produced from "tama_read_support_levels.py".
-m M
The mergefile can be the "trans_read.bed" from TAMA Collapse, the "merge.txt" file from TAMA Merge, the report file from tama_remove_polya_models_levels.py or tama_remove_single_read_models_levels.py, or the "group.txt" file from running Cupcake Collapse. If you want to generate a read support file for a file that is not the product of merging (ie TAMA Collapse run on mapped FLNC), you can use "no_merge" here to indicate there is no merge file. Note that this means there can only be one file in the filelist file.
-o O
This is the prefix used for the file naming of all the output files.
-d D
Ignore duplicate read name warning with -d dup_ok, default is to flag duplicates and terminate early. Read ID's should be unique. If you have duplicate read ID's in the source files, this may indicate an issue with the source files.
-mt MT
Merge type flag indicates the type of merge file used. Use "-mt tama" for TAMA Merge file or "trans_read.bed" file. Use "-mt cupcake" for cupcake file. Use "-mt filter" for report files produced from tama_remove_polya_models_levels.py or tama_remove_single_read_models_levels.py. Default is TAMA merge output.
Outputs:
prefix_read_support.txt
Detailed explanation:
prefix_read_support.txt
This contains the read support information for each transcript model. The format is as follows:
merge_gene_id merge_trans_id gene_read_count trans_read_count source_line support_line G1 G1.1 521 3 1 1:m64012_181221_231243/81005164/ccs,m64012_181221_231243/98568367/ccs,m64012_181221_231243/6686664/ccs
Examples of usage
Generating read support for TAMA Collapse run where Cluster/Polish reads were used
Filelist.txt
cluspol cluster_report.csv cluster
Command
python tama_read_support_levels.py -f filelist.txt -o tama_collapse_cluster -m trans_read.bed
Generating read support for TAMA Collapse run where FLNC reads were used
Filelist.txt
flnc trans_read.bed trans_read
Command
python tama_read_support_levels.py -f filelist.txt -o tama_collapse_flnc -m no_merge
Generating read support for TAMA Merge run where TAMA Collapse annotations were merged
Filelist.txt
flnc_a a_trans_read.bed trans_read flnc_b b_trans_read.bed trans_read flnc_c c_trans_read.bed trans_read
Command
python tama_read_support_levels.py -f filelist.txt -o tama_merge_collapse -m merge.txt
Generating read support for TAMA Merge run where TAMA Merge annotations were merged
First you need to generate read support files from all the TAMA Merge annotations that were used in this TAMA Merge annotation.
Filelist.txt
merge_a a_read_support.txt read_support merge_b b_read_support.txt read_support merge_c c_read_support.txt read_support
Command
python tama_read_support_levels.py -f filelist.txt -o tama_merge_merge -m merge.txt
Generating read support for a Cupcake Collapse run where Cluster/Polish reads were used
Filelist.txt
cluspol cluster_report.csv cluster
Command
python tama_read_support_levels.py -f filelist.txt -o cupcake -m collapsed.group.txt -cc cupcake
DEPRECATED TOOLS BELOW
tama_read_support_collapse_cluster.py
To find read support for each TAMA Collapse run use tama_read_support_collapse_cluster.py.
USAGE:
python tama_read_support_collapse_cluster.py trans_read.bed cluster_file output_file
trans_read.bed - This is the output file from TAMA Collapse.
cluster_file - This is the cluster file form running clustering with the official Iso-Seq pipelines. For Iso-Seq3 the file will look like "prefix.primer_5p--primer_3p.cluster" for Iso-Seq1 the cluster file will look like "cluster_report.csv". Alternatively, if you did not do clustering and mapped the FLNC directly to the genome then you can use the "trans_read.bed" file in for the cluster_file input.
output_file - The name that you want the output file to be called.
OUTPUT:
The format of the output file is as follows:
gene_id trans_id gene_num_reads trans_num_reads cluster_line G1 G1.1 5 1 4_c18833:m160316_194043_42149_c100936532550000001823211106101647_s1_p0/27978/2288_50_CCS
cluster_line - This field shows the clusters and reads supporting each transcript model. The field is sub-divided using ";" to delimit between cluster groups, ":" to delimit between cluster name and read name, and "," to delimit between reads names.
tama_read_support_merge_collapse.py
To find read support for each TAMA Merge run use tama_read_support_merge_collapse.py.
USAGE:
python tama_read_support_merge_collapse.py filelist_file output_file
filelist_file - This is a file that lists all the read support files from the TAMA Collapase runs that were merged during TAMA Merge.
The format is as follows (tab delimited):
read_support_collapse1.txt collapse1 /path/to/file/ read_support_collapse2.txt collapse2 /path/to/file/
read_support_collapse1.txt - This is the name of the output file from running "tama_read_support_collapse_cluster" on the TAMA Collapse run.
collapse1 - This is the prefix used in the TAMA Merge run to identify the TAMA Collapse source run
/path/to/file/ - This is the path to the "read_support_collapse1.txt" file.
output_file - The name that you want the output file to be called.
OUTPUT:
The format of the output file is as follows:
merge_gene_id merge_trans_id gene_read_support trans_read_support source_prefix source_trans_line source_read_line G34 G34.2 8 3 ovary,testes ovary_G21.2,testes_G26.2 m160315_220438_42149_c100936532550000001823211106101642_s1_p0/121400/26_2480_CCS,m160316_194043_42149_c100936532550000001823211106101647_s1_p0/30762/26_2477_CCS;m160316_064304_42149_c100936532550000001823211106101644_s1_p0/108271/27_2482_CCS
merge_gene_id - The gene ID as given in the TAMA Merge output file.
merge_trans_id - The transcript ID as given in the TAMA Merge output file.
gene_read_support - The number of reads supporting this gene.
trans_read_support - The number of read supporting this transcript.
source_prefix - A list of the sources supporting this transcript.
source_trans_line - A list of the source transcripts supporting the merged transcript model.
source_read_line - A list of the read names support this merged transcript model. The group of reads supporting each source transcript are delimited by ";". The reads within each group are delmited by ",".The order of the group of reads matches the order given in the "source_trans_line" field.