Skip to content

3.2 Searching SRA via GCP BigQuery

Ryan edited this page Oct 11, 2024 · 7 revisions

Exercise 1: Review metadata table contents and help docs

Add nih-sra-datastore to your datasets

You will want to pin the SRA dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the “Add Data” button on the upper left side of the screen, in the Explorer panel.

bigquery-pin

Next, select “Pin a project by name”, paste “nih-sra-datastore” into the Project name box and click Pin. It will now appear on the left side of the page in the Explorer panel.

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra’ entry to reveal the metadata table. Click on the metadata table to open it in the main window.

bq_metadata_details

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.

Compare BigQuery table contents to run selector

The SRA Run Selector allows users to search for SRA records via accession of Studies, Samples, Experiments, or Runs. Multiple query terms can be included by using a comma separated list of accessions. The requested records are displayed in a table format with 1 Run per row. Information common to all records in the query is displayed at the top under the 'Common Fields' section, while remaining metadata fields that vary between runs are included in the table underneath.

Some metadata fields can be used as filters to limit your output to a subset of the full query result and once filtering and selection of your Runs of interest is complete, the dataset can be downloaded as a list of Run accessions or a table of Runs including the metadata fields. Note that this BigQuery table includes information from the SRA record as well as associated BioSample and Bioproject Records.

Compare what you see in the BigQuery table to what is available in run selector, such as for PRJNA839090. Note that there are more fields available in BigQuery than in run selector.

run_selector

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, noting differences between what is available in run selector and in BigQuery, and referencing our help documentation, do you still have any questions?


Exercise 2: Review taxonomy table and help docs

Intro to nested interval indexing

For reference, see Tropashko, 2004. Briefly, for tree-like hierarchical structures, each node in the tree can be encode as a tuple, T1: (a, b), such that a < b. If we ensure that each child tuple, T2: (c, d), falls within the range of T1, such that a<=c<b and a<=d<b; we can use these indices to quickly and easily identify all nodes above and below a given node.

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.

bq_tax_schema

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.

Compare BigQuery table contents to the taxonomy web page

Compare what you see in the BigQuery table, to what is available in NCBI Taxonomy. Note that there is more information available on the website than in BigQuery.

taxonomy1

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, noting differences between what is available in Taxonomy and in BigQuery, and referencing our help documentation, do you still have any questions?


Exercise 3: Review STAT results table and help docs

Inspect table schema, details, and preview

Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.

bq_stat_preview

You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table. Note that the ileft and iright values as unique per SRA record and that each record may have multiple rows, one for each taxa detected.

Compare BigQuery table contents to the run analysis web page

The SRA Run Analysis page allows users to see a representation of the results of STAT for each record in SRA, to assess what taxa might be present in the record.

Compare what you see in the BigQuery table, to what is exailable on the run analysis page, such as for SRR21803617. Note that there are more fields available in BigQuery than on the run analysis page.

Reference existing help documentation and ask any questions you might have

Looking at what is available in BigQuery, noting differences between what is available on the run analysis web page and in BigQuery, and referencing our help documentation, do you still have any questions?


Project A: Find Salmonella enterica associated Illumina records in SRA

Step 1: Figuring out how to implement each filter individually

It is often useful to work out each filter individually initially to ensure they work as expected, and none return an empty result set

Query 1: Filter by Organism

Organism contains the taxonomic name the record was submitted under and is cap-sensitive. Using lower() is helpful to avoid issues with capitalization.

select *
from nih-sra-datastore.sra.metadata
where lower(organism) = "salmonella enterica"
limit 100

Query 2: Filter by Platform

The lower function is not strictly needed here as platform is a controlled vocabulary, but it may be helpful to get in the habit of using it so as to avoid pesky capitalization related errors.

select *
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina"
limit 100

Query 3: Find records containing S. enterica by STAT

Because we are selecting only on the taxonomic name, we will find records with even a single hit, likely due to trace contamination or noise. This could be controlled for by use of the self-count or total-count fields

select *
from nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
where lower(name) = "salmonella enterica"
limit 100

Step 2: Putting it all together

Now that we know all our filters work, let try putting them all together.

The opening with statement is known as a common table expression, or CTE, and is useful when constructing more complex queries. Note the two subqueries within the CTE; we are asking for records where the associated accession occurs in each (and) of two subqueries (one looking for organism and the other for STAT results)

Query 4:

with org as (
    select acc
    from nih-sra-datastore.sra.metadata
    where lower(organism) = "salmonella enterica" 
),
stat as (
    select acc
    from nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
    where lower(name) = "salmonella enterica"
)
select *
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina" and (acc in (select acc from org) or acc in (select acc from stat))

Step 3: Grouping by collection date

Its often useful to look at some descriptive statistics of the data you've identified before conducting downstream analyses. Here will consider how to modify the above query to look at the distribution of collection dates among the records previously identified using group by. group by is an aggregate function, meaning it operates over the associated columns. When used, each value in the select statement must either have an aggregate function applied to it, or else be grouped by. Count is an aggregate function summing-up the number of items in the argument, in this case, the number of unique (distinct) accessions.

Query 5:

with org as (
    select acc
    from nih-sra-datastore.sra.metadata
    where lower(organism) = "salmonella enterica" 
),
stat as (
    select acc
    from nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
    where lower(name) = "salmonella enterica"
)
select count(distinct acc) as runs, collection_date_sam
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina" and (acc in (select acc from org) or acc in (select acc from stat))
group by collection_date_sam
order by collection_date_sam desc

Next: 3.3 Using SRAToolkit to retrieve Data

Clone this wiki locally