-
Notifications
You must be signed in to change notification settings - Fork 1
3.2 Searching SRA via GCP BigQuery
You will want to pin the SRA dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the “Add Data” button on the upper left side of the screen, in the Explorer panel.
Next, select “Pin a project by name”, paste “nih-sra-datastore” into the Project name box and click Pin. It will now appear on the left side of the page in the Explorer panel.
Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra’ entry to reveal the metadata table. Click on the metadata table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.
The SRA Run Selector allows users to search for SRA records via accession of Studies, Samples, Experiments, or Runs. Multiple query terms can be included by using a comma separated list of accessions. The requested records are displayed in a table format with 1 Run per row. Information common to all records in the query is displayed at the top under the 'Common Fields' section, while remaining metadata fields that vary between runs are included in the table underneath.
Some metadata fields can be used as filters to limit your output to a subset of the full query result and once filtering and selection of your Runs of interest is complete, the dataset can be downloaded as a list of Run accessions or a table of Runs including the metadata fields. Note that this BigQuery table includes information from the SRA record as well as associated BioSample and Bioproject Records.
Compare what you see in the BigQuery table to what is available in run selector, such as for PRJNA839090. Note that there are more fields available in BigQuery than in run selector.
Looking at what is available in BigQuery, noting differences between what is available in run selector and in BigQuery, and referencing our help documentation, do you still have any questions?
For reference, see Tropashko, 2004. Briefly, for tree-like hierarchical structures, each node in the tree can be encode as a tuple, T1: (a, b), such that a < b. If we ensure that each child tuple, T2: (c, d), falls within the range of T1, such that a<=c<b and a<=d<b; we can use these indices to quickly and easily identify all nodes above and below a given node.
Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table.
Compare what you see in the BigQuery table, to what is available in NCBI Taxonomy. Note that there is more information available on the website than in BigQuery.
Looking at what is available in BigQuery, noting differences between what is available in Taxonomy and in BigQuery, and referencing our help documentation, do you still have any questions?
Click the triangle ‘expand node’ next to the pinned ‘nih-sra-datastore’ entry in the left-hand Explorer panel to see the contents of the project. Then click the triangle ‘expand node’ next to the ‘sra_tax_analysis_tool’ entry to reveal the constituent tables. Click on the taxonomy table to open it in the main window.
You may then click on the ‘Schema’, ‘Details’, and ‘Preview’ tabs to further explore the contents of the table. The Schema tab gives information about the name of each field in the table and how that field is encoded (timestamp, string, integer, etc.). The Details tab provides information about the data underlying the table, such as the last time it was updated and how many rows the table contains. Finally, the Preview tab shows the first few rows of the table so you can get a firsthand understanding of the organization and contents of the table. Note that the ileft and iright values as unique per SRA record and that each record may have multiple rows, one for each taxa detected.
The SRA Run Analysis page allows users to see a representation of the results of STAT for each record in SRA, to assess what taxa might be present in the record.
Compare what you see in the BigQuery table, to what is exailable on the run analysis page, such as for SRR21803617. Note that there are more fields available in BigQuery than on the run analysis page.
Looking at what is available in BigQuery, noting differences between what is available on the run analysis web page and in BigQuery, and referencing our help documentation, do you still have any questions?
It is often useful to work out each filter individually initially to ensure they work as expected, and none return an empty result set
Organism contains the taxonomic name the record was submitted under and is cap-sensitive. Using lower()
is helpful to avoid issues with capitalization.
select *
from nih-sra-datastore.sra.metadata
where lower(organism) = "salmonella enterica"
limit 100
The lower function is not strictly needed here as platform is a controlled vocabulary, but it may be helpful to get in the habit of using it so as to avoid pesky capitalization related errors.
select *
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina"
limit 100
Because we are selecting only on the taxonomic name, we will find records with even a single hit, likely due to trace contamination or noise. This could be controlled for by use of the self-count or total-count fields
select *
from nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
where lower(name) = "salmonella enterica"
limit 100
Now that we know all our filters work, let try putting them all together.
The opening with
statement is known as a common table expression, or CTE, and is useful when constructing more complex queries. Note the two subqueries within the CTE; we are asking for records where the associated accession occurs in each (and
) of two subqueries (one looking for organism and the other for STAT results)
with org as (
select acc
from nih-sra-datastore.sra.metadata
where lower(organism) = "salmonella enterica"
),
stat as (
select acc
from nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
where lower(name) = "salmonella enterica"
)
select *
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina" and (acc in (select acc from org) or acc in (select acc from stat))
Its often useful to look at some descriptive statistics of the data you've identified before conducting downstream analyses. Here will consider how to modify the above query to look at the distribution of collection dates among the records previously identified using group by
. group by
is an aggregate function, meaning it operates over the associated columns. When used, each value in the select statement must either have an aggregate function applied to it, or else be grouped by. Count is an aggregate function summing-up the number of items in the argument, in this case, the number of unique (distinct
) accessions.
with org as (
select acc
from nih-sra-datastore.sra.metadata
where lower(organism) = "salmonella enterica"
),
stat as (
select acc
from nih-sra-datastore.sra_tax_analysis_tool.tax_analysis
where lower(name) = "salmonella enterica"
)
select count(distinct acc) as runs, collection_date_sam
from nih-sra-datastore.sra.metadata
where lower(platform) = "illumina" and (acc in (select acc from org) or acc in (select acc from stat))
group by collection_date_sam
order by collection_date_sam desc
This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health