Skip to content
Rutger Vos edited this page Mar 27, 2020 · 21 revisions

Creating a tool for phylogenetic analysis of COVID-19 sequence data

Communication

For the time being, there is a #phylogeny channel on the Slack group (check out the [email protected] group for the invitation link). During the BioHackathon, we'll update this section.

Resources

Workflows

Data

Tools (brainstorm section)

  • multiple sequence alignment tools, e.g. muscle, mafft
  • phylogenetic inference tools, e.g. RAxML
  • sequence rate evolution analysis tools, e.g. PAML, HyPhy

Ideas for projects

  • Working on the phylogeny of COVID 19 (similar to this analysis, and more connected to this article in terms of receptors and conserved sites).
  • To be implemented as a rerunnable workflow for when new sequence data come available
  • Easily deployable, runnable in public cloud
  • Connected to other COVID 19 analysis workflows and their emerging I/O standards

The current list of SARS-CoV-2 sequences GenBank can be used for this purpose, and, if developed as a workflow, it can connect to the "main" public sequence resource deliverable/task - possibly also to the biostatistics and the Machine Learning ones.

As for technical implementation, it would make sense to implement this as a rerunnable workflow (e.g. Snakemake or CWL) that is therefore connected to the Workflows activity. As available sequence data continues to grow, some of the analysis steps are going to become computationally expensive. (For example, running BranchSiteREL or similar analyses.) Hence, we should plan for scaling out to HPC cloud infra.

Participants

Clone this wiki locally