Skip to content
Eric T. Dawson edited this page Apr 6, 2020 · 44 revisions

Pangenome and variation graph

We will used tools supporting the variation graph data model, as described at the pangenome tools and workflows page, to build and distribute pangenome data structures from SARS-CoV-2 genomes. These models are useful for diagnostic and resequencing applications. They can also help us generate assemblies from raw sequencing data.

Possible topics

Pangenome model from available genomes

This is straightforward, and similar to work shared on twitter (although that was a hack viz for a grant).

SARS-CoV-2 pangenome viz

It's easy to set up, but could be interesting to analyze:

# get your SARS-CoV-2 sequences from GenBank in seqs.fa
minimap2 -cx asm20 -X seqs.fa seqs.fa >seqs.paf
seqwish -s seqs.fa -p seqs.paf -g seqs.gfa
odgi build -g seqs.gfa -s -o seqs.odgi
odgi viz -i seqs.odgi -o seqs.png -x 4000 -y 500 -R -P 5

We can do the same for sequences from GISAID, but users will need authorization, and this can't be redistributed.

Note that we had to add in gene sequences to provide the annotations at the bottom. These are just paths like any other sequences. They get concatenated into the input sequences given to seqwish, and aligned against the set of genomes that we're working with in a separate minimap2 step (the PAF output of that is concatenated onto the end of the PAF given to seqwish as well).

Assembly of direct RNA sequencing

Assemble direct RNA sequencing data of SARS-CoV-2 from this paper using minimap2, seqwish, odgi, and GraphAligner.

We'll need to hack on odgi to get the pruning right. GraphAligner or a similar technique might be able to get polished versions of full length RNA reads of the viral genome.

We have already sketched out this project in a related git repo on viral pangenome assembly.

Assembly from short amplified reads

Similar to the previous, but trickier as PCR tiling amplification doesn't necessarily make a contiguous genome that we can easily work with.

RDF model from pangenome

Here we'd be annotating and working with SPARQL queries on a RDF version of the pangenome, as produced by vg view.

Protein space pangenome

A few minor tweaks to the seqwish process would allow us to make a protein space pangenome. We need to change the following things:

  • protein alignments in PAF
  • (possibly) add a flag to seqwish to tell it that we're in protein space and avoid any potential reverse complements

The latter may not be necessary.

odgi could be used for processing of the graph. Except for kmer extraction functions, it is not DNA specific. The output could be visualized in tools like Schematize that provide a high-level matrix view of the pangenome.

Rust versions of basic variation graph utilities and data structures

Many tools that work on variation graphs are based in C++.

It would really help future development and maintenance to support development of tools in Rust.

One target would be a Rust implementation of the HandleGraph abstraction, as documented in the manuscript Succinct dynamic genome variation graphs, defined in libhandlegraph, and implemented in libbdsg. In our experiments, the best model was the PackedGraph, which uses Skiplists encoded in succinct integer vectors to achieve excellent memory performance.

Pangenome workflows in Nextflow

The workflows topic group is hoping to provide the starting point(s) for pangenome workflows; better yet the pangenome workflows might be embedded in the sequencing workflows themselves. It isn't too difficult to go from bash to Nextflow, see e.g. this example.

Cypher graph queries over GFA in Apache Parquet format

An alternative to compressing assembly graphs in RAM on a single machine is to use distributed and partitionable data formats such as Apache Parquet, for distributed data analysis on a cluster via Apache Spark, Apache Arrow, Ray, etc. adam-gfa provides schema for GFA 1.0/2.0 and ETL to/from Apache Parquet on Spark. SPARK-25994 provides property graphs, Cypher queries, and graph algorithms on Spark.

Communication

Josiah: After some debate, I decided to create a second project for a Pangenome Browser. This depends on Variation graph construction, but it's certainly a different set of tasks that can be carried out independently. In order for a browser to be effective, we must have annotations aggregated/curated as a third task. I would still like us to coordinate closely.

Specific use cases.

Protein-level features

In my personal opinion (Ben Busby), it may be particularly productive to look at the less conserved satellite genes near the S locus.

We may also want to look at amino acid 614 of the spike protein. This is in an unstructured loop, presumably between presumably transmembrane helices. This may be involved in immune evasion. We should look at correspondence between SARS-1 and SARS-2 at this position.

Being able to see these loci in the context of each other, as well as the CoVID genomes in general, may be very beneficial in terms of subclassing the virus wrt human reaction.

Seconding Ben Busby's point. These types of variable regions are seen in other RNA viruses, like HIV for sure. Some are also tied to biological switches, believed to control tropism between tissues (V3 region of gp120). We might find something like that for SARS-CoV-2 if variants at the protein/amino acid level.

Nucleic acid-level features

Would be interesting to see if we can look at variants near recently annotated RNA features as well. See some sources below.

HIV receptor tropism RE: V3 (some papers)

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3195025/

V3 and V2

And in HIV-2 SARS-CoV-2 RNA preprint

Update: I just emailed corresponding author for SARS-CoV-2 RNA preprint above to see if they are willing to share RNA mod loci from their run. Will update when they respond, but just FYI they're in AUS.

Participants

Clone this wiki locally