-
Notifications
You must be signed in to change notification settings - Fork 26
TAMA GO: Read Support
This set of tools in TAMA-GO is used to collect all the information regarding supporting reads for each gene/transcript model.
tama_read_support_levels.py
To generate a file containing read support for each transcript use tama_read_support_levels.py. This is a versatile tool which provides read support/count information for different levels of processing. It can be used to find read support for clustered reads, for collapse transcript models, and for merged transcript models.
usage: tama_read_support_levels.py [-h] [-f] [-m] [-o] [-d] [-cc]
optional arguments:
-h, --help show this help message and exit -f F File list -m M Merge.txt file from after merging. Use "no_merge" if there is no merge file -o O Output file prefix -d D Ignore duplicate read name warning with -d dup_ok, default is to flag duplicates and terminate early. -cc CC Cupcake flag indicates that merge file is from cupcake output. Use -cc cupcake to turn on. Default is TAMA output.
Default command would look like this:
python tama_read_support_levels.py -f filelist.txt -o prefix -m mergefile
Detailed explanation of arguments:
-f F
The filelist file contains the name of the files from which you want to pull read support. The format for the file should be like this (tab separated, do not include header) (this is an example for pulling read support for a TAMA Merge annotation which merged multiple TAMA Collapse runs):
source_name source_file file_type 1 source_file_1.txt trans_read 2 source_file_2.txt trans_read
Note: Do not include the header "source_name source_file file_type" in the filelist file.
File types accepted are "cluster", "trans_read", and "read_support".
Use "cluster" for getting the read support of each transcript model in a TAMA collapse annotation file if the mapped reads used were from the Cluster/Polish step in the IsoSeq3 pipeline (https://github.com/PacificBiosciences/IsoSeq/blob/master/README_v3.2.md). "cluster" is the file type of the "cluster_report.csv" file generated after running IsoSeq3 Cluster.
Use "trans_read" for getting the read support of each transcript model generated from a TAMA Merge run that merged the outputs from TAMA Collapse runs. "trans_read" is the file type of the "trans_read.bed" that is generated from TAMA Collapse runs.
Use "read_support" for getting the read support of each transcript model generated from a TAMA Merge run that merged multiple merged annotations. "read_support" is the file type for the file that is produced from "tama_read_support_levels.py".
-m M
The mergefile can either be the "trans_read.bed" from TAMA Collapse or the "merge.txt" file from TAMA Merge. If you want to generate the read support file that is not the product of merging (ie TAMA Collapse run on mapped FLNC), you can use "no_merge" here to indicate there is no merge file. Note that this means there can only be one file in the filelist file.
-o O
This is the prefix used for the file naming of all the output files.
-d D
Ignore duplicate read name warning with -d dup_ok, default is to flag duplicates and terminate early. Read ID's should be unique. If you have duplicate read ID's in the source files, this may indicate an issue with the source files.
-cc CC
Cupcake flag indicates that merge file is from cupcake output. Use -cc cupcake to turn on. Default is TAMA output (-cc tama).
USAGE:
python tama_read_support_levels.py -f filelist -o prefix -m mergefile
trans_read.bed - This is the output file from TAMA Collapse.
cluster_file - This is the cluster file form running clustering with the official Iso-Seq pipelines. For Iso-Seq3 the file will look like "prefix.primer_5p--primer_3p.cluster" for Iso-Seq1 the cluster file will look like "cluster_report.csv". Alternatively, if you did not do clustering and mapped the FLNC directly to the genome then you can use the "trans_read.bed" file in for the cluster_file input.
output_file - The name that you want the output file to be called.
OUTPUT:
The format of the output file is as follows:
gene_id trans_id gene_num_reads trans_num_reads cluster_line G1 G1.1 5 1 4_c18833:m160316_194043_42149_c100936532550000001823211106101647_s1_p0/27978/2288_50_CCS
cluster_line - This field shows the clusters and reads supporting each transcript model. The field is sub-divided using ";" to delimit between cluster groups, ":" to delimit between cluster name and read name, and "," to delimit between reads names.
DEPRECATED TOOLS BELOW
tama_read_support_collapse_cluster.py
To find read support for each TAMA Collapse run use tama_read_support_collapse_cluster.py.
USAGE:
python tama_read_support_collapse_cluster.py trans_read.bed cluster_file output_file
trans_read.bed - This is the output file from TAMA Collapse.
cluster_file - This is the cluster file form running clustering with the official Iso-Seq pipelines. For Iso-Seq3 the file will look like "prefix.primer_5p--primer_3p.cluster" for Iso-Seq1 the cluster file will look like "cluster_report.csv". Alternatively, if you did not do clustering and mapped the FLNC directly to the genome then you can use the "trans_read.bed" file in for the cluster_file input.
output_file - The name that you want the output file to be called.
OUTPUT:
The format of the output file is as follows:
gene_id trans_id gene_num_reads trans_num_reads cluster_line G1 G1.1 5 1 4_c18833:m160316_194043_42149_c100936532550000001823211106101647_s1_p0/27978/2288_50_CCS
cluster_line - This field shows the clusters and reads supporting each transcript model. The field is sub-divided using ";" to delimit between cluster groups, ":" to delimit between cluster name and read name, and "," to delimit between reads names.
tama_read_support_merge_collapse.py
To find read support for each TAMA Merge run use tama_read_support_merge_collapse.py.
USAGE:
python tama_read_support_merge_collapse.py filelist_file output_file
filelist_file - This is a file that lists all the read support files from the TAMA Collapase runs that were merged during TAMA Merge.
The format is as follows (tab delimited):
read_support_collapse1.txt collapse1 /path/to/file/ read_support_collapse2.txt collapse2 /path/to/file/
read_support_collapse1.txt - This is the name of the output file from running "tama_read_support_collapse_cluster" on the TAMA Collapse run.
collapse1 - This is the prefix used in the TAMA Merge run to identify the TAMA Collapse source run
/path/to/file/ - This is the path to the "read_support_collapse1.txt" file.
output_file - The name that you want the output file to be called.
OUTPUT:
The format of the output file is as follows:
merge_gene_id merge_trans_id gene_read_support trans_read_support source_prefix source_trans_line source_read_line G34 G34.2 8 3 ovary,testes ovary_G21.2,testes_G26.2 m160315_220438_42149_c100936532550000001823211106101642_s1_p0/121400/26_2480_CCS,m160316_194043_42149_c100936532550000001823211106101647_s1_p0/30762/26_2477_CCS;m160316_064304_42149_c100936532550000001823211106101644_s1_p0/108271/27_2482_CCS
merge_gene_id - The gene ID as given in the TAMA Merge output file.
merge_trans_id - The transcript ID as given in the TAMA Merge output file.
gene_read_support - The number of reads supporting this gene.
trans_read_support - The number of read supporting this transcript.
source_prefix - A list of the sources supporting this transcript.
source_trans_line - A list of the source transcripts supporting the merged transcript model.
source_read_line - A list of the read names support this merged transcript model. The group of reads supporting each source transcript are delimited by ";". The reads within each group are delmited by ",".The order of the group of reads matches the order given in the "source_trans_line" field.