Skip to content

3.3 Using SRAToolkit to retrieve Data

Ryan edited this page Oct 16, 2024 · 18 revisions

Project B: Using SRAToolkit to get a FASTQ file

For the remainder of this portion of the workshop, we will be considering two accessions: SRR9854072 and SRR17393369

To begin, we will set an environment variable to make calling toolkit simpler. Please enter the following command into your console

export PATH=$PATH:/home/connorrp/sratoolkit.3.1.1-ubuntu64/bin

Approach 1: Using bare accession with pre-fetch

Generally for most machines where fastq is the expected output we would recommend using prefetch then fasterq-dump. prefetch

prefetch SRR9854072 

fasterq-dump SRR9854072

What File Format was retrieved?

If you want to inspect what quality score format is in the downloaded data, one way to do this is by inspecting the fastq output directly.

head SRR9854072_1.fastq

Approach 2: Using the SDL Service to Find Locations for File-Types for an Accession

SRA's SDL function can be used to locate SRA data directly. This is most useful when downloading data with a tool like a commercial Cloud Service Provider's (CSP) Command Line Interface (CLI). The additional arguments and function SDL is being piped to are to make the output more human readable in a console.

wget -q "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?acc=SRR9854072&accept-alternate-locations=yes" -O - | jq -r '.result[].files[].locations[] | [.service, .region, .link] | @tsv' | column -t

The base URL provides documentation for using the service is https://www.ncbi.nlm.nih.gov/Traces/sdl/2/ To see all types and locations of files available for an accession, the locality function is used with at least one accession, e.g. https://www.ncbi.nlm.nih.gov/Traces/sdl/2/locality?acc=SRR24727461 To get a URL to a data file the retrieve function is used with at least one accession and often with additional options specified to ensure the correct type and location of file is returned in the URL, e.g. https://www.ncbi.nlm.nih.gov/Traces/sdl/2/retrieve?acc=SRR24727461&zqa=zr&location=gs.us-east1&capability=can-pay:gs

The zqa option is used to specify preferences or requirements for the type of data file returned. All options are listed in the documentation page for SDL, in this case we are using 'zr' which will prefer SRA Lite but accept Normalized depending on availability. We are specifying the location of Google cloud services with a location of us-east1 and because there may be overhead charges from Google we also need to indication to SDL that we are able to pay for those charges using the capability option.

Using toolkit to retrieve data from a specific location

If we know the exact URL for the file we would like to retrieve, we can pass that as an argument to prefetch (or fasterq-dump).

prefetch https://sra-download-internal.ncbi.nlm.nih.gov/sos5/sra-pub-zq-16/SRR009/854/SRR9854072.sralite.1

fasterq-dump SRR9854072

Toolkit is designed to configure the options and preferences that would be provided to SDL to give a single best URL for accessing data. You can use the vdb-configure tool either in the interactive gui mode (-i) or as a command line tool to specify which format of data if preferred and provide the billing credentials to be passed along to the cloud service providers to be charged for fees like egress or other overhead charges. However the SRA Toolkit prefetch tool can also be used by providing a URL or will accept the --location option that uses the same format as SDL to specify a provider and region to access data from. prefetch SRR24727461 --location gs.us-east1

Using CSP CLI to copy data

Given a file location, if it is in a CSP, you can use the CSP's CLI to fetch the data.

If a run was downloaded using prefetch then toolkit will be able to locate and access the run automatically. However if the run was downloaded manually using a cloud platform CLI or was moved after download you will likely need to provide the local location of the run to the toolkit. In this case a path can be provided to prefetch.

aws s3 cp --no-sign-request s3://sra-pub-run-odp/sra/SRR9854072/SRR9854072 ~/

fasterq-dump ~/SRR9854072

Approach 3: Force toolkit to retrieve SRA Lite Format

If wanting to force a specific version of the run you can configure the toolkit to prefer SRA Normalized or SRA Lite with vdb-config vdb-config -Q <yes|no>, yes = lite no = normalized, and prefetch can be asked to force lite and fail if lite is not available prefetch --eliminate-quals . After changing our configuration settings, we will first remove the previous SRA record before running prefetch and fasterq-dump. Additionally, here we use the --outfile argument to specify a file name, so as to distinguish from the previously retreived Normalized files.

vdb-config -Q yes

rm -r SRR9854072

prefetch SRR9854072

fasterq-dump --outfile SRR9854072_lite SRR9854072

Part 2: Repeat the process for the second accession

We will now repeat this process for the second accession, using the vdb-config approach. Note the file sizes and the number of bases returned.

prefetch SRR17393369

fasterq-dump --outfile SRR24727461_lite SRR17393369

rm -r SRR17393369

vdb-config -Q no

prefetch SRR17393369

fasterq-dump SRR17393369

Evaluating the performance of different approaches.

The bash command time can be run before any of the commands above to capture execution timings. In the table below the timings for a single execution of each indicated command for each run are shown.

. Acc SRR9854072 SRR9854072 SRR9854072 SRR9854072 SRR17393369 SRR17393369
. Format Normalized Normalized Normalized Lite Normalized Lite
Size Bytes 307555042 307555042 307555042 116221985 1020428245 817899569
Size Reads 1559786 1559786 1559786 1559786 30487546 30487546
Approach Approach fastq alone prefetch aws prefetch prefetch prefetch
Time(real) prefetch/aws 4.584 3.185 2.109 51.369 9.502
Time(real) fasterq-dump 69.41 4.526 2.874 6.308 79.787 93.458
Time(real) total 69.41 9.11 6.059 8.417 131.156 102.96

Note that for SRR9854072 the SRA file retrieved is ~72% smaller when fetching Lite vs normalized whereas for SRR17393369 it is only ~20% smaller. That Lite files are smaller is reflected in the comparatively shorter time it takes to get a fastq file for each. Comparing the prefetch approach for Normalized and Lite the total time required was ~8% faster for SRR9854072 and ~12% faster for SRR17393369 (the larger file). Finally, note that using the cloud CLI (aws) was fastest, while using fasterq-dump without prefetch was slowest by a significant amount.


Next: 3.4 Examining Assemblies from SRA Normalized and Lite Format Files