This module provides scripts to download and extract SRA files for High-throughput genomic data from NCBI (GEO portal) using NCBI .soft file
- docker
- possible root access
- 13.8 GB of free memory (docker image)
docker pull opoirion/ssrge
mkdir /<Results data folder>/
cd /<Results data folder>/
PATHDATA=`pwd`
The pipeline consists of 3 steps (for downloading the data) and 4 steps for aligning and calling SNVs:
# Download
docker run --rm opoirion/ssrge download_soft_file -h
docker run --rm opoirion/ssrge download_sra -h
docker run --rm opoirion/ssrge extract_sra -h
Let's download and process 2 samples from GSE79457 in a project name test_n2
# download of the soft file containing the metadata for GSE79457
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge download_soft_file -project_name test_n2 -soft_id GSE79457
# download sra files
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge download_sra -project_name test_n2 -max_nb_samples 2
# exctract sra files
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge extract_sra -project_name test_n2
# rm sra files (optionnal)
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge rm_sra -project_name test_n2
- python 2 (>=2.7)
- The only external software needed is fastq-dump to extract the .sra files. Path toward the executable must be given to the config file or parsed as argument
- A folder with the name of the project must be created and the absolute path toward that folder must be given to the config file or parsed as argument
- The .soft file from NCBI GEO website file related to the project must be downloaded (and put into the project folder (default))
- link for dataset description GEO webpage example
- An example soft file is also available in the ./example/ folder of the repository (default folder)
- all global variables can be set into the file ./garmire_download_ncbi_sra/config.py or parsed as function attributes
- arguments description can be found at any time by invoking the -h (or -H) option or by consulting the config file:
-PROJECT_NAME The name of the project (defining the name of the folder)
-PATH_DATA The absolute path where the project will be created and the SRA files downloaded and extracted
-PATH_SOFT path toward the .soft file (with the corresponding ftp addresses for the .sra files)
-NB_THREADS number of threads (download in parallel) to use for downloading rsa files (default 2)
-FASTQ_DUMP path to the fastq-dump software
-FASTQ_DUMP_OPTION options to use to extract the sra (using fastq-dump) "--split-3 -B is the default" and it is strongly recommended to keep it
-LIMIT define the maximum number of sra files to be downloaded (default None)
move to folder of the git project (https://github.com/lanagarmire/SSrGE.git)
cd SSrGE
- Setting the global variables into the config file (download_ncbi_sra/config.py) or parsing them each time as arguments
- [optional] Running the tests:
python ./test/test_download.py -v
- download and extract data (download by default .sra files from the example .soft file):
python garmire_download_ncbi_sra/download_data.py
- download and extract data (with parsing options):
python garmire_download_ncbi_sra/download_data.py -NB_THREADS 5 -PATH_SOFT tutut/...
- extract SRA file
python garmire_download_ncbi_sra/extract_data.py
- remove SRA file
python garmire_download_ncbi_sra/remove_sra.py
- Developer: Olivier Poirion (PhD)
- contact: [email protected]