A python 3 pipeline for submitting CRAM files generated by ArrayExpress to the European Nucleotide Archive (ENA).
Get a copy of the project and install python dependencies with pip install -r requirements.txt
.
Or build the docker image and skipt the next section.
Start the central scheduler
luigid
and visit it on port 8082, e.g. http://localhost:8082. Then run a pipeline and follow the progress in your browser. The localhost needs to be the name of the local (or farm) server you are working on.
export ena_user=webin-xxx
export ena_password=xxxxxxxx
luigi --module pipeline SubmitSpecies --species oryza_sativa
docker run \
-e "ena_user=webin-xxx" \
-e "ena_password=xxxxxxxx" \
<image> SubmitSpecies --species oryza_sativa
This will
- make a request to getRunsByOrganism on the ArrayExpress API to fetch a list of all Oryza sativa CRAM files which have been marked as 'Complete'
- upload each CRAM file to the European Nucleotide Archive (ENA) FTP server
- collect metadata required for the submission
- create 'submission' and 'analysis' XML documents required for programmatic submission to the ENA and submit them
- store the resulting submission and analysis accessions in an SQLite database
Add --test --limit 3
to the luigi command to sumit to the ENA test server (results are not publicly visible) and sumit only 3 CRAM files instead of all.
export ena_user=webin-xxx
export ena_password=xxxxxxxx
luigi --module pipeline SubmitAllSpecies
docker run \
-e "ena_user=webin-xxx" \
-e "ena_password=xxxxxxxx" \
<image> SubmitAllSpecies
This will make a request to getOrganisms on the ArrayExpress API to fetch a list of all plant species, and run SubmitSpecies (described above) for each.
Multiple workers can be run in parallel on the same host by adding the --workers
parameter. E.g. luigi --module pipeline SubmitAllSpecies --workers 8
. Since this pipeline is limited by the throughput of the ArrayExpress and ENA FTP servers, increasing the number of workers beyond this will not improve performance.
Every step from discovering CRAM files, over collecting metadata, to submitting to ENA is implemented as a luigi Task.
This makes it easy to deal with failures that inevitably will happen when e.g. some of 30k+ long running tasks that have dependencies between each other will fail. Instead of cleaning up after failed tasks, resetting state, or being forced to start again from scratch we can rely on luigi to check the completion of (atomic) tasks and resume safely.
Programmatic submission to the European Nucleotide Archive requires the creation of 'submission' and 'analysis' XML documents, following the provided schemas.
These documents are created with the help of generateDS, which generates an API to match the schemas. The generated code is in ena/schema
.
Should the schemas change, a new version of the API can be generated with
generateDS.py -o "SRA_analysis.py" -s "SRA_analysis_sub.py" SRA.analysis.xsd
generateDS.py -o "SRA_submission.py" -s "SRA_submission_sub.py" SRA.submission.xsd
Luigi will print a summary of all work that has been done. Here's simplified example:
===== Luigi Execution Summary =====
Scheduled 40132 tasks of which:
* 7923 present dependencies were encountered:
- 7918 StoreEnaSubmissionResult(...)
* 32186 ran successfully:
- 10705 StoreEnaSubmissionResult(...)
- 36 SubmitSpecies(species=aegilops_tauschii) ...
- 10705 SubmitToEna(...)
- 10702 UploadCramToENA(...)
* 11 failed:
- 8 SubmitToEna(...)
- 3 UploadCramToENA(...)
* 3 were left pending, among these:
* 3 were missing external dependencies:
- 1 SubmitAllSpecies(limit=0)
- 2 SubmitSpecies(species=oryza_sativa) and SubmitSpecies(species=zea_mays)
present dependency means the task has completed successfully on a previous run of luigi. StoreEnaSubmissionResult is the task that stores accessions returned from ENA in SQLite. So this section tells us that 7918 files have been submitted before.
ran successfully means the task completed successfully just now. So we can see that 10705 files have been submitted, and 36 species have been processed fully.
failed means just that - errors are in the log or the scheduler web interface. 8 files couldn't be submitted to the ENA rest endpoint. 3 files couldn't be uploaded to the ENA ftp server.
These failures cause problems upstream for tasks that depend on them, which are described after that. We can see that oryza_sativa and zea_mays were affected by the submission and upload failures.
The good news is that 32186 tasks ran successfully and won't be run again in the future, only the 11 failed ones.