Variation Analysis and Visualization of SARS-CoV-2 sequences in GenBank

NCBI Datasets Codeathon Team 4

The goal of this project is to identify and visualize variation (SNPs) in SARS-CoV-2 genomes

Background

Understanding variation across SARS-CoV-2 genome sequences has important implications for epidemiology and molecular biology.

Existing tools allow researchers to visualize SARS-CoV-2 variation but access to the underlying data, including genome sequences and valuable metadata, such as where in the world a particular genome was isolated, is restricted.

The SARS-CoV-2 Variation Viewer allows researchers to easily visualize SARS-CoV-2 SNPs mapped to the annotated Wuhan-Hu-1 reference genome (NC_045512.2), with a searchable tabular view showing amino acid changes, collection location, host species, and links to protein structure views in iCn3D.

Workflow

Download all SARS-CoV-2 genomes from GenBank using NCBI Datasets
Identify SNPs in these SARS-CoV-2 genomes
Correlate SNPs with SARS-CoV-2 genome metadata (also from NCBI Datasets)
Visualize SNPs and associated metadata in a graphical display

Executing the workflow

The workflow depends on

bwa
samtools
bcftools

If those are not in $PATH, first execute:

bash bin/install_tool.sh

Next, run the workflow:

bash bin/workflow.sh

Download SARS-CoV-2 genomes using NCBI Datasets

Use the NCBI Datasets command-line tool to download all available SARS-CoV-2 genome sequences (6k+ as of July 8 2020) and associated metadata from GenBank.

Identify SNPs in SARS-CoV-2 genomes

Align 6k+ genomes using bwa mem
Call variants (identify SNPs) using bcftools, which found a total of 48,398 variants
Encode variants using SPDI format (Holmes et al., 2019)
Group variants by location and allele, resulting in 4,299 variants. We threw out any variants not seen in at least 5 genomes. Those variants are plotted here:

The final filtered set is comprised of 590 viral variations.

Add metadata and annotation

The remaining 590 variants were joined to their metadata (e.g. geographic location, collection date, host) extracted from the NCBI Datasets virus genome data report
Each variant allele was annotated with the associated protein change, using the CDS FASTA sequence data available in the NCBI Datasets data-package (only 557 variants were in a single coding region and annotated).
- 290 of the variants resulted in a non-synonymous protein change
- 267 of the variants resulted in a synonymous protein change

Example variant data

    {
      "start": 9560,
      "stop": 9561,
      "reference_sequence": "C",
      "alleles": [
        {
          "allele": "T",
          "count": 6,
          "spdi": "NC_045512.2:9560:C:T",
          "Host": [
            {
              "value": "Homo sapiens",
              "count": 6
            }
          ],
          "Collection Date": [
            {
              "value": "2020-01-11",
              "count": 1
            },
            ...
          ],
          "Collection Location": [
            {
              "value": "USA: CA/North America",
              "count": 5
            },
            {
              "value": "China/Asia",
              "count": 1
            }
          ],
          "codon": "TTA",
          "amino_acid": "L",
          "protein_variant": "S336L",
          "aa_type": "non_synonymous"
        }
      ],
      "protein_name": "nsp4",
      "protein_accession": "YP_009724389.1",
      "protein_position": 336,
      "offset": 1,
      "codon": "TCA",
      "amino_acid": "S"
    },

Visualize SNPs in a graphical display

Visualize SNPs and associated metadata in a graphical display using two views, summary view and detailed view
Summary View: Visualize SNPs mapped to the SARS-CoV-2 reference genome with protein annotations
- Also includes a searchable, sortable tabular view of the data.
Detailed View: Visualize SNPs mapped to SARS-CoV-2 proteins at single amino acid resolution, with links to iCn3D protein structure views.

SARS2 Variation Viewer Live Demo

https://ncbi-codeathons.github.io/SARS2-Variation-Viewer/

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
bin		bin
data		data
docs		docs
images		images
.gitignore		.gitignore
.metadata.json.swp		.metadata.json.swp
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variation Analysis and Visualization of SARS-CoV-2 sequences in GenBank

Background

Workflow

Executing the workflow

Download SARS-CoV-2 genomes using NCBI Datasets

Identify SNPs in SARS-CoV-2 genomes

Add metadata and annotation

Example variant data

Visualize SNPs in a graphical display

SARS2 Variation Viewer Live Demo

About

Releases

Packages

Contributors 6

Languages

License

NCBI-Codeathons/SARS2-Variation-Viewer

Folders and files

Latest commit

History

Repository files navigation

Variation Analysis and Visualization of SARS-CoV-2 sequences in GenBank

Background

Workflow

Executing the workflow

Download SARS-CoV-2 genomes using NCBI Datasets

Identify SNPs in SARS-CoV-2 genomes

Add metadata and annotation

Example variant data

Visualize SNPs in a graphical display

SARS2 Variation Viewer Live Demo

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages