Skip to content

2.3 Retrieving Virus information using NCBI Datasets

mtntsuchiya edited this page Oct 15, 2024 · 4 revisions

In addition to the genome service (covered in the previous session), users can retrieve viral genome sequences and metadata using the Virus service from NCBI Datasets CLI.

A. What's the difference between the Virus and Genome endpoints?

The data available through these endpoints originate from two distinct sources. Virus data is sourced from NCBI Virus, which employs both manual and automated curation processes to verify viral sequence data provided by the International Nucleotide Sequence Database Collaboration (INSDC) and standardized metadata. In contrast, the NCBI Datasets genome service, provides access to a subset of virus sequences from NCBI Virus that have been assembled and assigned an NCBI Assembly accession (GCA_/GCF_).

What does it mean for those working on Viruses?

  • Virus data from the Virus endpoint has an additional layer of curation
  • Data packages downloaded from Virus will likely be different (larger) from those downloaded from the Genome:
    • For SARS-CoV-2, there are currently (10/15/2024) 8,972,850 genomes available through the Virus service, and only 93 through the genome service.

      Try it yourself
      • Check the number of SARS-CoV-2 sequences in the Virus endpoint:
      datasets summary virus genome taxon sars2 --limit 0
      
      • Now check the number of sequences in the genome endpoint:
      datasets summary genome taxon sars2 --limit 0
      

B. Special virus cases: cache packages for SARS-CoV-2 and Influenza

For both SARS-CoV-2 and (Alpha)Influenza, NCBI Datasets CLI provides a cache package. A cache package is pre-packaged with all genomes available for those two taxa.

An cache package is the equivalent to a grab-n-go sandwich versus a made-to-order one (regular packages). It's faster to download a cache package because it doesn't need to be assembled, but it also travels through faster download channels at NCBI.

C. Retrieving genome information for Dengue virus

In this exercise, we will take a look at the genomes available for the Dengue virus.

  • Let's take a look at the options for the virus genome subcommand:
datasets download virus genome taxon --help

Download a virus genome data package by taxon (NCBI Taxonomy ID, scientific or common name for any virus at any tax rank). Virus genome data packages include genome, transcript and protein sequences, annotation and one or more data reports. Data packages are downloaded as a zip archive.

The default virus genome data package includes the following files:
  * genomic.fna (genomic sequences)
  * data_report.jsonl (data report with virus genome metadata)
  * dataset_catalog.json (a list of files and file types included in the data package)

Usage
  datasets download virus genome taxon  <taxon> [flags]

Sample Commands
  datasets download virus genome taxon sars-cov-2 --host dog --include protein
  datasets download virus genome taxon coronaviridae --host "manis javanica"

Flags
      --fast-zip-validation   Skip zip checksum validation after download


Global Flags
      --annotated                 Limit to annotated genomes
      --api-key string            Specify an NCBI API key
      --complete-only             Limit to complete genomes
      --debug                     Emit debugging info
      --filename string           Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
      --geo-location string       Limit to genomes isolated from a specified geographic location (continent or country)
      --help                      Print detailed help about a datasets command
      --host string               Limit to virus genomes isolated from a specified host species
      --include string(,string)   Specify virus genome sequence types to download
                                    * genome:     genomic sequences
                                    * cds:        nucleotide coding sequences
                                    * protein:    amino acid sequences
                                    * annotation: annotation report
                                    * biosample:  biosample report
                                    * none:       no sequence data, only primary data report
                                       (default [genome])
      --lineage string            Limit results by Pango lineage (only SARS-CoV-2)
      --no-progressbar            Hide progress bar
      --refseq                    Limit to RefSeq genomes
      --released-after string     Limit to genomes released on or after a specified date (free format, ISO 8601 YYYY-MM-DD recommended)
      --updated-after string      Limit to genomes updated on or after a specified date (free format, ISO 8601 YYYY-MM-DD recommended)
      --usa-state string          Limit to genomes isolated from a specified U.S. state (two-letter abbreviation)
      --version                   Print version of datasets

C1. Downloading genomes

  • Download all genomes
datasets download virus genome taxon 12637 --filename dengue-all.zip
  • Download reference genomes (4 genomes, Dengue virus 1-4)
datasets download virus genome taxon 12637 --refseq --filename dengue-all-ref.zip

C2. Filtering download based on metadata information

  • Look at the first record and all the fields with jq
datasets summary virus genome taxon 12637 --limit 1 | jq
  • Use dataformat to pull metadata information. Let's look at unique entries for the geo-location field.
    • datasets will retrieve the metadata;
    • dataformat will pull the information from the metadata field geo-location
    • sort will sort all the geo-location entries in alphabetical order
    • uniq -c will count the number of each unique entry
datasets summary virus genome taxon 12637 --as-json-lines | \
dataformat tsv virus-genome --fields geo-location | sort | uniq -c

  • Let's look at all genomes filtered by geo-location (Brazil)
datasets summary virus genome taxon 12637 --geo-location Brazil | jq .total_count
3721
  • We have a new field to filter by US state that's separate from the --geo-location flag. Let's take a look at how many genomes we have from Florida using the summary subcommand and jq:
datasets summary virus genome taxon 12637 --usa-state FL | jq .total_count
526

Additional info: 2.4 Important resources