-
Notifications
You must be signed in to change notification settings - Fork 1
3.3 Using SRAToolkit to retrieve Data
For the remainder of this portion of the workshop, we will be considering two accessions: SRR9854072 and SRR17393369
To begin, we will set an environment variable to make calling toolkit simpler. Please enter the following command into your console
export PATH=$PATH:/home/connorrp/sratoolkit.3.1.1-ubuntu64/bin
Generally for most machines where fastq is the expected output we would recommend using prefetch then fasterq-dump. prefetch
prefetch SRR9854072
fasterq-dump SRR9854072
If you want to inspect what quality score format is in the downloaded data, one way to do this is by inspecting the fastq output directly.
head SRR9854072_1.fastq
SRA's SDL function can be used to locate SRA data directly. This is most useful when downloading data with a tool like a commercial Cloud Service Provider's (CSP) Command Line Interface (CLI). The additional arguments and function SDL is being piped to are to make the output more human readable in a console.
wget -q "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?acc=SRR9854072&accept-alternate-locations=yes" -O - | jq -r '.result[].files[].locations[] | [.service, .region, .link] | @tsv' | column -t
The base URL provides documentation for using the service is https://www.ncbi.nlm.nih.gov/Traces/sdl/2/ To see all types and locations of files available for an accession, the locality function is used with at least one accession, e.g. https://www.ncbi.nlm.nih.gov/Traces/sdl/2/locality?acc=SRR24727461 To get a URL to a data file the retrieve function is used with at least one accession and often with additional options specified to ensure the correct type and location of file is returned in the URL, e.g. https://www.ncbi.nlm.nih.gov/Traces/sdl/2/retrieve?acc=SRR24727461&zqa=zr&location=gs.us-east1&capability=can-pay:gs
The zqa option is used to specify preferences or requirements for the type of data file returned. All options are listed in the documentation page for SDL, in this case we are using 'zr' which will prefer SRA Lite but accept Normalized depending on availability. We are specifying the location of Google cloud services with a location of us-east1 and because there may be overhead charges from Google we also need to indication to SDL that we are able to pay for those charges using the capability option.
If we know the exact URL for the file we would like to retrieve, we can pass that as an argument to prefetch (or fasterq-dump).
prefetch https://sra-download-internal.ncbi.nlm.nih.gov/sos5/sra-pub-zq-16/SRR009/854/SRR9854072.sralite.1
fasterq-dump SRR9854072
Toolkit is designed to configure the options and preferences that would be provided to SDL to give a single best URL for accessing data. You can use the vdb-configure
tool either in the interactive gui mode (-i) or as a command line tool to specify which format of data if preferred and provide the billing credentials to be passed along to the cloud service providers to be charged for fees like egress or other overhead charges. However the SRA Toolkit prefetch tool can also be used by providing a URL or will accept the --location
option that uses the same format as SDL to specify a provider and region to access data from. prefetch SRR24727461 --location gs.us-east1
Given a file location, if it is in a CSP, you can use the CSP's CLI to fetch the data.
If a run was downloaded using prefetch then toolkit will be able to locate and access the run automatically. However if the run was downloaded manually using a cloud platform CLI or was moved after download you will likely need to provide the local location of the run to the toolkit. In this case a path can be provided to prefetch.
aws s3 cp --no-sign-request s3://sra-pub-run-odp/sra/SRR9854072/SRR9854072 ~/
fasterq-dump ~/SRR9854072
If wanting to force a specific version of the run you can configure the toolkit to prefer SRA Normalized or SRA Lite with vdb-config
vdb-config -Q <yes|no>
, yes = lite no = normalized, and prefetch can be asked to force lite and fail if lite is not available prefetch --eliminate-quals
. After changing our configuration settings, we will first remove the previous SRA record before running prefetch and fasterq-dump. Additionally, here we use the --outfile
argument to specify a file name, so as to distinguish from the previously retreived Normalized files.
vdb-config -Q yes
rm -r SRR9854072
prefetch SRR9854072
fasterq-dump --outfile SRR9854072_lite SRR9854072
We will now repeat this process for the second accession, using the vdb-config approach. Note the file sizes and the number of bases returned.
prefetch SRR17393369
fasterq-dump --outfile SRR24727461_lite SRR17393369
rm -r SRR17393369
vdb-config -Q no
prefetch SRR17393369
fasterq-dump SRR17393369
The bash command time
can be run before any of the commands above to capture execution timings. In the table below the timings for a single execution of each indicated command for each run are shown.
. | Acc | SRR9854072 | SRR9854072 | SRR9854072 | SRR9854072 | SRR17393369 | SRR17393369 |
---|---|---|---|---|---|---|---|
. | Format | Normalized | Normalized | Normalized | Lite | Normalized | Lite |
Size | Bytes | 307555042 | 307555042 | 307555042 | 116221985 | 1020428245 | 817899569 |
Size | Reads | 1559786 | 1559786 | 1559786 | 1559786 | 30487546 | 30487546 |
Approach | Approach | fastq alone | prefetch | aws | prefetch | prefetch | prefetch |
Time(real) | prefetch/aws | 4.584 | 3.185 | 2.109 | 51.369 | 9.502 | |
Time(real) | fasterq-dump | 69.41 | 4.526 | 2.874 | 6.308 | 79.787 | 93.458 |
Time(real) | total | 69.41 | 9.11 | 6.059 | 8.417 | 131.156 | 102.96 |
Note that for SRR9854072 the SRA file retrieved is ~72% smaller when fetching Lite vs normalized whereas for SRR17393369 it is only ~20% smaller. That Lite files are smaller is reflected in the comparatively shorter time it takes to get a fastq file for each. Comparing the prefetch approach for Normalized and Lite the total time required was ~8% faster for SRR9854072 and ~12% faster for SRR17393369 (the larger file). Finally, note that using the cloud CLI (aws
) was fastest, while using fasterq-dump
without prefetch
was slowest by a significant amount.
Next: 3.4 Examining Assemblies from SRA Normalized and Lite Format Files
This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM) and the National Institute of Allergy and Infectious disease (NIAID), National Institutes of Health