-
Notifications
You must be signed in to change notification settings - Fork 31
Pangenome
We will used tools supporting the variation graph data model, as described at the pangenome tools and workflows page, to build and distribute pangenome data structures from SARS-CoV-2 genomes. These models are useful for diagnostic and resequencing applications. They can also help us generate assemblies from raw sequencing data.
This is straightforward, and similar to work shared on twitter (although that was a hack viz for a grant).
It's easy to set up, but could be interesting to analyze:
# get your SARS-CoV-2 sequences from GenBank in seqs.fa
minimap2 -cx asm20 -X seqs.fa seqs.fa >seqs.paf
seqwish -s seqs.fa -p seqs.paf -g seqs.gfa
odgi build -g seqs.gfa -s -o seqs.odgi
odgi viz -i seqs.odgi -o seqs.png -x 4000 -y 500 -R -P 5
We can do the same for sequences from GISAID, but users will need authorization, and this can't be redistributed.
Note that we had to add in gene sequences to provide the annotations at the bottom. These are just paths like any other sequences. They get concatenated into the input sequences given to seqwish, and aligned against the set of genomes that we're working with in a separate minimap2 step (the PAF output of that is concatenated onto the end of the PAF given to seqwish as well).
Assemble direct RNA sequencing data of SARS-CoV-2 from this paper using minimap2, seqwish, odgi, and GraphAligner.
We'll need to hack on odgi to get the pruning right. GraphAligner or a similar technique might be able to get polished versions of full length RNA reads of the viral genome.
We have already sketched out this project in a related git repo on viral pangenome assembly.
Similar to the previous, but trickier as PCR tiling amplification doesn't necessarily make a contiguous genome that we can easily work with.
Here we'd be annotating and working with SPARQL queries on a RDF version of the pangenome, as produced by vg view
.
A few minor tweaks to the seqwish process would allow us to make a protein space pangenome. We need to change the following things:
- protein alignments in PAF
- (possibly) add a flag to seqwish to tell it that we're in protein space and avoid any potential reverse complements
The latter may not be necessary.
odgi could be used for processing of the graph. Except for kmer extraction functions, it is not DNA specific. The output could be visualized in tools like Schematize that provide a high-level matrix view of the pangenome.
Many tools that work on variation graphs are based in C++.
It would really help future development and maintenance to support development of tools in Rust.
One target would be a Rust implementation of the HandleGraph abstraction, as documented in the manuscript Succinct dynamic genome variation graphs, defined in libhandlegraph, and implemented in libbdsg. In our experiments, the best model was the PackedGraph, which uses Skiplists encoded in succinct integer vectors to achieve excellent memory performance.
The workflows topic group is hoping to provide the starting point(s) for pangenome workflows; better yet the pangenome workflows might be embedded in the sequencing workflows themselves. It isn't too difficult to go from bash to Nextflow, see e.g. this example.
An alternative to compressing assembly graphs in RAM on a single machine is to use distributed and partitionable data formats such as Apache Parquet, for distributed data analysis on a cluster via Apache Spark, Apache Arrow, Ray, etc. adam-gfa provides schema for GFA 1.0/2.0 and ETL to/from Apache Parquet on Spark. SPARK-25994 provides property graphs, Cypher queries, and graph algorithms on Spark.
Josiah: After some debate, I decided to create a second project for a Pangenome Browser. This depends on Variation graph construction, but it's certainly a different set of tasks that can be carried out independently. In order for a browser to be effective, we must have annotations aggregated/curated as a third task. I would still like us to coordinate closely.
Protein-level features
In my personal opinion (Ben Busby), it may be particularly productive to look at the less conserved satellite genes near the S locus.
We may also want to look at amino acid 614 of the spike protein. This is in an unstructured loop, presumably between presumably transmembrane helices. This may be involved in immune evasion. We should look at correspondence between SARS-1 and SARS-2 at this position.
Being able to see these loci in the context of each other, as well as the CoVID genomes in general, may be very beneficial in terms of subclassing the virus wrt human reaction.
Seconding Ben Busby's point. These types of variable regions are seen in other RNA viruses, like HIV for sure. Some are also tied to biological switches, believed to control tropism between tissues (V3 region of gp120). We might find something like that for SARS-CoV-2 if variants at the protein/amino acid level.
Nucleic acid-level features
Would be interesting to see if we can look at variants near recently annotated RNA features as well. See some sources below.
HIV receptor tropism RE: V3 (some papers)
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3195025/
And in HIV-2 SARS-CoV-2 RNA preprint
Update: I just emailed corresponding author for SARS-CoV-2 RNA preprint above to see if they are willing to share RNA mod loci from their run. Will update when they respond, but just FYI they're in AUS.
- Erik Garrison
- Michael Heuer
- Pjotr Prins
- Josiah Seaman (Graph Genome Browser Consortium)
- Christian Fischer
- Rutger Vos (maybe, especially if using PanTools)
- Ben W. (maybe)
- Simon Heumos
- Noah Legall (maybe, I am interested in learning on what biological questions pangenomics can answer for viruses. Would be a night time endeavor for me)
- Ali Ghaffaari (maybe)
- Fawaz Dabbaghie (maybe)
- Hao Chen
- Jouni Sirén
- Simone Ciccolella
- Luca Denti
- Glenn Hickey
- Saeed Omidi (maybe)
- René Xavier
- Knut Rand
- Andrea Guarracino
- Flavia Villani
- Artem Tarasov
- Ben Busby
- Alex Gener Knows: HIV virology, long read sequencing + analysis (sequence, some base modification workflows), some RNA-seq. Learning: graph genome software implementation.
- Eric Dawson