From tama_collapse to differential gene expression #44

martsper · 2020-12-29T08:57:12Z

martsper
Dec 29, 2020

Dear Tama developers and community,

I am looking for some inspiration and suggestions. My aim is to do differential gene expression analysis using a PacBio IsoSeq transcriptome as reference.

I did PacBio Isoseq and Illumina paired-end sequencing of a eukaryotic alga. The two datasets were assembled de-novo into a hybrid transcriptome. I mapped the hybrid transcriptome to a publicly available reference genome of the same species (but different strain) using GMAP, and collapsed the contigs using tama_collapse.

If I understand right, I could do now two things for differential gene expression analyses. First, I could use Illumina short reads, map them to the reference genome (using e.g. STAR), and count reads based on the tama_collapse gene loci (and normalize counts with e.g. DeSeq2). However, I would prefer to map Illumina short reads directly to the novel hybrid transcriptome.

Here is my question: How can I get from the tama_collapse step to a reference transcriptome that can be used for differential gene expression analysis? (preferentially, I would like to generate a annotation file for the hybrid transcriptome with information about gene loci and transcript isoforms, that can be used as input for STAR).

Best,
Martin

GenomeRIK · 2021-01-03T17:35:38Z

GenomeRIK
Jan 3, 2021
Maintainer

Hi Martin,

Thanks for posting the first discussion for TAMA!

Regarding your question, there are multiple ways of doing this.

The one I prefer is using a reference-less quantification pipeline with Kallisto or Salmon. To use these with your TAMA annotation you just need to convert the annotation into a fasta file for input to these tools. You can do so by using the bedtools as is shown in the first step of the ORF/NMD pipeline:
https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-ORF-and-NMD-predictions

ie:
bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}

If you prefer to still use the reference genome assembly then you can just guide STAR using the TAMA annotation in GTF format. So for this pipeline you would just need to convert the TAMA bed file into a gtf file which you can do using one of the TAMA bed to gtf convertor tools:
https://github.com/GenomeRIK/tama/wiki/TAMA-GO:-Formatting

For your purposes I would use this one:
tama_convert_bed_gtf_ensembl_no_cds.py

Let me know if this works for you and/or if you have more questions.

Cheers,
Richard

0 replies

martsper · 2021-01-04T14:43:58Z

martsper
Jan 4, 2021
Author

Hi Richards,

Thanks a lot for taking your time to explain the two different approaches. It helped me to understand some differences between kallisto/salmon and STAR (though, as I remember, kallisto/salmon don't output an alignment? Which, however, is not need for DGE ...)

I would like to discuss some possible issues when working with non-model-organisms, for which only a low-quality reference genome is available:

If I see it right, the two above approaches will miss genes, for which transcripts are found in the PacBio IsoSeq data, but that are not encoded in the reference genomes.
What about PacBio transcripts that map with e.g. a low coverage of only 50% to the reference genome? Based on the two above approaches, the soft-clipped (unmmaped) part of the transcripts will be lost for differential gene expression analysis.

What do you think about the following approach, which is based on suggestions made by the cogent developers: Unmapped and badly mapped transcripts could be reference-free clustered with cogent (or isONclust?). Cogent then tries to reconstruct the gene to which the clustered transcripts belong. The reconstructed genes could now be appended to the reference genome. Finally, the new genome could be used as reference for GMAP/minimap mapping, followed by transcript clustering with tama. The expression of the clusters can then be quantified with one of the two above approaches you explained. The clustering with tama will generate annotations that also include genetic information of transcripts, which are originally not encoded in the reference genome, or map with low coverage.

The approach of appending reference genomes with reconstructed genes would also have the advantage to generate all-in-one visualizations that allow to keep track over differences between the reference genome and generated PacBio transcripts.

Best,
Martin

0 replies

GenomeRIK · 2021-01-06T03:07:31Z

GenomeRIK
Jan 6, 2021
Maintainer

Hi Martin,

If I see it right, the two above approaches will miss genes, for which transcripts are found in the PacBio IsoSeq data, but that are not encoded in the reference genomes.

Yes this is true.

What about PacBio transcripts that map with e.g. a low coverage of only 50% to the reference genome? Based on the two above approaches, the soft-clipped (unmmaped) part of the transcripts will be lost for differential gene expression analysis.

This is true and problematic if the genome assembly is not good.

What do you think about the following approach, which is based on suggestions made by the cogent developers: Unmapped and badly mapped transcripts could be reference-free clustered with cogent (or isONclust?). Cogent then tries to reconstruct the gene to which the clustered transcripts belong. The reconstructed genes could now be appended to the reference genome. Finally, the new genome could be used as reference for GMAP/minimap mapping, followed by transcript clustering with tama. The expression of the clusters can then be quantified with one of the two above approaches you explained. The clustering with tama will generate annotations that also include genetic information of transcripts, which are originally not encoded in the reference genome, or map with low coverage.

I think this is a good approach. You can use Cogent, IsONclust or Rattle for the unmapped reads. You just need to be careful about the gene models that are generated from the reference-less approach since it is difficult to filter out problematic models without the genome assembly.

The approach of appending reference genomes with reconstructed genes would also have the advantage to generate all-in-one visualizations that allow to keep track over differences between the reference genome and generated PacBio transcripts.

I think this sounds like a great idea. Let me know if you have more questions regarding all of this or if I missed one of your questions.

Cheers,
Richard

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

From tama_collapse to differential gene expression #44

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

From tama_collapse to differential gene expression #44

martsper Dec 29, 2020

Replies: 3 comments

GenomeRIK Jan 3, 2021 Maintainer

martsper Jan 4, 2021 Author

GenomeRIK Jan 6, 2021 Maintainer

martsper
Dec 29, 2020

GenomeRIK
Jan 3, 2021
Maintainer

martsper
Jan 4, 2021
Author

GenomeRIK
Jan 6, 2021
Maintainer