Skip to content

Latest commit

 

History

History
111 lines (83 loc) · 4 KB

README_download_ncbi_rsa.md

File metadata and controls

111 lines (83 loc) · 4 KB

Download SRA files from NCBI (GEO)

This module provides scripts to download and extract SRA files for High-throughput genomic data from NCBI (GEO portal) using NCBI .soft file

SRA project download using docker

Requirements

  • docker
  • possible root access
  • 13.8 GB of free memory (docker image)

installation (local)

docker pull opoirion/ssrge
mkdir /<Results data folder>/
cd /<Results data folder>/
PATHDATA=`pwd`

usage

The pipeline consists of 3 steps (for downloading the data) and 4 steps for aligning and calling SNVs:

# Download
docker run --rm opoirion/ssrge download_soft_file -h
docker run --rm opoirion/ssrge download_sra -h
docker run --rm opoirion/ssrge extract_sra -h

example

Let's download and process 2 samples from GSE79457 in a project name test_n2

# download of the soft file containing the metadata for GSE79457
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge download_soft_file -project_name test_n2 -soft_id GSE79457
# download sra files
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge download_sra -project_name test_n2 -max_nb_samples 2
# exctract sra files
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge extract_sra -project_name test_n2
# rm sra files (optionnal)
docker run --rm -v $PATHDATA:/data/results/:Z opoirion/ssrge rm_sra -project_name test_n2

Installation from github (not updated!! => Use the docker image for now)

Requirements

  • python 2 (>=2.7)
  • The only external software needed is fastq-dump to extract the .sra files. Path toward the executable must be given to the config file or parsed as argument
  • A folder with the name of the project must be created and the absolute path toward that folder must be given to the config file or parsed as argument
  • The .soft file from NCBI GEO website file related to the project must be downloaded (and put into the project folder (default))
    • link for dataset description GEO webpage example
    • An example soft file is also available in the ./example/ folder of the repository (default folder)

configuration

  • all global variables can be set into the file ./garmire_download_ncbi_sra/config.py or parsed as function attributes
  • arguments description can be found at any time by invoking the -h (or -H) option or by consulting the config file:
-PROJECT_NAME    The name of the project (defining the name of the folder)
-PATH_DATA    The absolute path where the project will be created and the SRA files downloaded and extracted
-PATH_SOFT    path toward the .soft file (with the corresponding ftp addresses for the .sra files)
-NB_THREADS    number of threads (download in parallel) to use for downloading rsa files (default 2)
-FASTQ_DUMP    path to the fastq-dump software
-FASTQ_DUMP_OPTION    options to use to extract the sra (using fastq-dump) "--split-3 -B is the default" and it is strongly recommended to keep it
-LIMIT    define the maximum number of sra files to be downloaded (default None)

usage

move to folder of the git project (https://github.com/lanagarmire/SSrGE.git)

cd SSrGE
  • Setting the global variables into the config file (download_ncbi_sra/config.py) or parsing them each time as arguments
  • [optional] Running the tests:
  python ./test/test_download.py -v
  • download and extract data (download by default .sra files from the example .soft file):
python garmire_download_ncbi_sra/download_data.py
  • download and extract data (with parsing options):
python garmire_download_ncbi_sra/download_data.py -NB_THREADS 5 -PATH_SOFT tutut/...
  • extract SRA file
python garmire_download_ncbi_sra/extract_data.py
  • remove SRA file
python garmire_download_ncbi_sra/remove_sra.py

contact and credentials