-
Notifications
You must be signed in to change notification settings - Fork 104
Handling multiple samples: Creating a master transcriptome
Last Updated: 12/21/2022
What to do if you have multiple Iso-Seq runs? How do you connect the novel isoforms from one sample to another?
There are a few existing ways to do it and they all have their pros and cons.
Tool | Input | Output | Accepts Reference? | Designed for Long Reads? |
---|---|---|---|---|
gffcompare | GTF | one combined GTF | Yes, optional | No |
Cupcake chaining | GTF, Counts, (FASTA) | GTF, Counts, (FASTA) | No | Yes |
TAMA Merge | BED | BED | Yes | Yes |
TALON | MD-tagged SAM | CSV, GTF | Yes, required | Yes |
Cupcake chaining is very simple. For each sample, you provide a collapsed GFF3/GTF, a FL count file, and optionally a reference FASTA.
Cupcake does this chaining iteratively, so as the number of samples increase, issues with matching slightly difference isoforms become increasingly difficult. You may begin to see redundant isoforms (ex: there is no 1-to-1 mapping between isoforms across samples, rather you see an isoform from sample 1 matching to two isoforms in sample 2, etc).
TL;DR I recommend Cupcake chaining for small-scale (ex: 2-4) sample merging.
TAMA Merge is part of the TAMA package that is based on BED file format. It handles merging of samples more gracefully than Cupcake, but it only supports BED formats and does not appear to accept count information.
TL;DR I recommend TAMA Merge for small-scale (ex: 2-4) sample merging.
TALON was developed to handle the ENCODE4 consortium Iso-Seq samples and is database-driven. You start with a foundational DB (ex: Gencode) and incrementally add samples using a SQL database.
TL;DR I recommend TALON for moderate-to-large (ex: >4) sample merging where a reference transcriptome readily exists.
The TALON documentation on their GitHub is excellent, and below I'm simply providing an example of how to use it.
Note that in the TALON tutorial below, I'm using it slightly differently from how the TALON developers devised it (and hence, there are some changes). For example, I'm not giving TALON a SAM file where the alignments are aligned and cleaned full-length reads, which means these are very large SAM files with millions of alignments many of which come from the same isoforms.
Instead, I'm using TALON following the isoseq3 --> Cupcake collapse --> SQANTI3 classification and filtering pathway. So my input to TALON are unique transcripts that have gone through filtering and each transcript would have an associated FL count already. This means TALON will run pretty fast since the input file is much smaller, and I'm also going to skip the TALON filtering step.
For this example, I'm using Gencode v36 as the initial database.
talon_initialize_database --f gencode.v36.annotation.gtf \
--g hg38 --a gencode36 --o myTest
TALON will only accept MD-tagged SAM files. Let's say if you already ran Cupcake collapse or maybe already went through SQANTI3 filtering. For each of the samples you want to "chain/combine", it's most likely you have either a GFF3 file or a FASTA file.
If you have a GFF3 file, you can re-constitute a FASTA file from the genome using say getfasta or gffread.
NOTE: make sure your FASTA sequence IDs are PB.X.Y
without additional description or text, if you want to be able to link it cleanly with other reports (ex: from SQANTI3). For example, your fasta file may have IDs that look like:
PB.1.1|chr16:23548331-23548927(-)|RPFWgene
which has extra text starting with the |
symbol. One way to do clean it up is to run the following sed command to your fasta file:
sed -i 's/|.*//g' 5merge.collapsed.rep.fa
Now we align the FASTA file to the ref genome and make a MD-tagged SAM file. Note that we're using the splice:hq
preset for alignment here as it is more sensitive to small exons for HiFi reads alignment.
minimap2 -ax splice:hq -uf --secondary=no -C5 hg38.fa -t30 \
5merge.collapsed.rep.fa > 5merge.collapsed.rep.fa.hg38.sam
samtools calmd 5merge.collapsed.rep.fa.hg38.sam \
hg38.fa \
--output-fmt sam > 5merge.collapsed.rep.fa.hg38.MDtagged.sam
The config file for TALON is of the following format:
name, sample description, platform, sam file (full path)
So we can make a config.csv
file like:
sample1,sample1,PacBio,sample1/5merge.collapsed.rep.fa.hg38.MDtagged.sam
sample2,sample2,PacBio,sample2/5merge.collapsed.rep.fa.hg38.MDtagged.sam
sample3,sample3,PacBio,sample3/5merge.collapsed.rep.fa.hg38.MDtagged.sam
Now we add the samples to the database:
talon --f config.csv \
--db myTest.db \
--build hg38 \
--t 30 \
--cov 0.95 \
--identity 0.95 \
--o output1
At this point, you can use TALON to extract sample abundances and filter library artifacts.
However, I usually would have already run through SQANTI3 filtering before I use TALON. TALON filtering and SQANTI3 filtering serve the same purpose, just depends on whether you want to add everything to the database first then filter, or filter first w SQANTI3 then add them to the database.
We can now output the union of all observed samples from the TALON db.
talon_create_GTF --db=myTest.db \
--annot=gencode36 \
-b hg38 \
--observed \
--o=combined
Note that I'm using --observed
, so anything that appears at least once in the samples in my config.csv
(and whatever else is added later) will be output. If you have specific subset of samples you want to output, you can use the --datasets
option instead.
The output will be combined_talon_observedOnly.gtf
.
Recall that we can't use TALON to generate counts because we have given it unique transcripts per sample.
To link it back to the FL counts, we can use the PB.X.Y
IDs that are in the .abundance.txt
files you should have gotten when you ran Cupcake post-collapse abundance tallying, where the file looks like this:
pbid count_fl
PB.1.1 1
PB.2.1 1
PB.3.1 3
PB.3.2 1
PB.3.3 3
PB.3.4 1
PB.3.5 705
You can now associate this PB.X.Y
IDs back with the PB.X.Y
IDs in the TALON output TSV file. (Remember however, that that TSV file will contain the PBIDs from all samples, so if you are using R to do the joining, do this per-sample)
Here's an example of how to do that for one sample in R:
x<-read.table('output1_talon_read_annot.tsv',sep='\t',header=T)
x.sample1 <- subset(x, dataset==sample1)
count.info <- read.table('sample1/5merge.collapsed.abundance.txt',sep='\t',header=T)
x.sample1$pbid <- x.sample1$read_name
m <- merge(x.sample1, count.info, by='pbid')
write.table(m, 'merged.count.sample1.tsv', quote=F, row.names=F, sep='\t')