-
Notifications
You must be signed in to change notification settings - Fork 0
Group 1: First Weekend Homework
Using one of the packaging or container systems described (e.g., Conda, Guix, or Docker), prepare a working environment to run the examples. Now try to run the workflows using the tools presented and appreciate the different approaches to execute the same example.
Our group used the Ubuntu 18.04 Devel and Docker image as our base image. We deployed the image in one of the instances that we created in-session this week. We then cloned the repository provided by the authors and built our working environment using the Debian Dockerfile. The was a relatively smooth process, but we realized that the command needed a destination docker build -t scalability_debian -f Docker/Dockerfile.debian
." (last "." added from provided code).
Our attempts to run the four example overflows encountered many errors. We've detailed here the steps taken to resolve these issues.
-
Start a new instance or redo git clone after change the folder name of "data", in case that the data clusters have been revised after running other workflow tools like snakemake.
-
Git clone scalability-reproducibility-chapter
$ git clone https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter.git
$ cd scalability-reproducibility-chapter
-
setups virtual environments
$ virtualenv -p python2 venv
# Create a virtual environment, can usepython3
as well$ source venv/bin/activate
# Activate environment before installingcwltool
Check how to setup env -
Install cwlref-runner
$ pip install cwlref-runner
-
Run the example
$ CWL/workflow.cwl --clusters data
got the error "Command '['docker', 'pull', 'biocontainers/clustal-omega']' returned non-zero exit status 1." in the clustal file.
- Something goes wrong with the image 'biocontainers/clustal-omega'. When you search this container in docker docs (https://hub.docker.com/r/biocontainers/clustal-omega/tags), you can see that there are three version tags without the latest version. This means that the tag information should also be mentioned when docker pull this image.
$ cd CWL
$ nano clustalo.cwl
dockerPull: biocontainers/clustal-omega/tags: v1.2.1_cv5
$ cd ..
$ CWL/workflow.cwl --clusters data
got the error "Docker image pal2nal not found" in pal2nal file.
- Similar issues with the docker pull image. Search the image resource (image name and tag/version) and revise the docker pull image in pal2nal.cwl file (https://bioconda.github.io/recipes/pal2nal/README.html).
$ nano pal2nal.cwl
dockerPull: quay.io/biocontainers/pal2nal:14.1--pl526_0
$ CWL/workflow.cwl --clusters data
got the error "Command '['docker', 'pull', 'quay.io/biocontainers/paml']' returned non-zero exit status 1." in the codeml file.
- Similar issues with the docker pull image. Search the image resource and revise the docker pull image in codeml.cwl file (https://quay.io/repository/biocontainers/paml?tag=latest&tab=tags).
$ nano codeml.cwl
dockerPull: quay.io/biocontainers/paml:4.9--h14c3975_4
$ CWL/workflow.cwl --clusters data
- Something goes wrong with the image 'biocontainers/clustal-omega'. When you search this container in docker docs (https://hub.docker.com/r/biocontainers/clustal-omega/tags), you can see that there are three version tags without the latest version. This means that the tag information should also be mentioned when docker pull this image.
-
Install GUIX by downloading and running the shell scritpt. You can download the shell script as:
$ wget https://git.savannah.gnu.org/cgit/guix.git/plain/etc/guix-install.sh
and run the shell script to install:
$ sh guix-install.sh
There are also alternative installation methods, however, installing through the shell script is more reliable and recommended.
-
After GUIX is installed, run the command below to install GWL
guix package -i gwl
There is a high chance that you might get an warning message given below:
guile: warning: failed to install locale
hint: Consider installing the
glibc-utf8-locales
orglibc-locales
package and definingGUIX_LOCPATH
, along these lines:guix package -i glibc-utf8-locales
export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale"
See the "Application Setup" section in the manual, for more info. guix: workflow: command not found Try
guix --help
for more information.This warning could seem harmless, since it does not say that is an error. However, it is crucial to fix this problem. Otherwise the other following steps will not work.
The solution is a bit tricky and could take a bit of time. To fix the problem follow the rules under this thread.
-
After that run the command below in order to change your directory, were your GWL workflow files exist.
cd scalability-reproducibility-chapter/GWL
-
If you have fixed the warning in step 2, if you run the command below, you will have an error message that >No such command as workflow.
However, if you have fixed the problem you will have no errors.
guix workflow -r example-workflow
Unfortunately, at this point
guix workflow
command cannot find the workflow files and execute it, even though the workflow file exist!
-
Install snakemake from the source
$ git clone https://bitbucket.org/snakemake/snakemake.git
$ cd snakemake
$ virtualenv -p python3 .venv
$ source .venv/bin/activate
$ python setup.py install
Check how to install from source code
-
Prepare Miniconda3 for installing packages
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
Check how to setup miniconda-3
Reopen a new terminal, you can use the new conda command to install software packages
-
Prepare the bioconda snakemake tool
$ conda install -y -c bioconda snakemake=4.2.0
-
Run the example from the repository tree
$ cd Snakemake
$ snakemake -n
-
got results folder [paml_results]
-
difficult to present the workflow graph.
curl -s https://get.nextflow.io | bash
- Requires Java (not on the image, despite the information indicated in the "tags"; this was installed following the instructions on: link ./nextflow run Nextflow/workflow.nf -with-docker evolutionarygenomics/scalability -with-dag workflow.pdf
- Requires Graphviz; this was installed following the instructions on: link; Package link to graphviz.
- This script generated a PDF showing the workflow.
Compare the different syntaxes used by the tools to define a workflow and explore how each tool describes the processes and the dependencies in a different way.
CWL (shortest code, higher risks of getting issues with each .cwl file, relatively long time waiting)
Script – three steps:
- “extract_clusters”: input is directory of clusters, three output files are proteins, nucleotides, names; this step sorts the input clusters one by one, based on the basenames (is this a more simplified process than the other two?)
- “per_cluster_workflow.cwl”: uses output from “extract_clusters” to create the results file and a scatter plot (?) of the results
- Time used: 1:25:36
- INFO [job codeml_72] Max memory used: 962MiB
- INFO [job codeml_72] completed success
- Requires modules from guix, includes deployment; requires a lot more lines of code
- Then defines three processes, similar to snakemake
- Code is built in to create separate processes for the different clusters, only requires one line to run (in theory)
Defines 3 rules:
- Clustal: sorts inputs into two groups – “guidetree” and “align”; whatever doesn’t go into “guidetree” goes into “align” (?)
- Pal2nal: takes results of “align” and runs it through paml (?)
- Codem1: takes results of previous step, runs it through codem1 and outputs the results as a text file
- Job counts: count jobs 1 all 72 clustal 72 codeml 72 pal2nal total 217
- Duration: within 10 minutes (fastest one)
Defines 3 processes, also similar to Snakemake and GWL
- Duration : 3h 52m 53s
- CPU hours: 10.2
- Succeeded: 216 Can nicely present the workflow graph.
Use the Amazon EC2 calculation sheet, and calculate how much it would cost to store 100 GB in S3, and execute a calculation on 100 “large” nodes, each reading 20 GB of data. Do the same for another cloud provider.
Pricing will depend on runtime and access frequency.
Amazon: 100 instances of m5.16 xlarge + 100 GB on S3 storage 100% usage: 238,016.84 75% usage: 178,987.94 50% usage: 119,960.09 25% usage: 60,455.60 10% usage: 24,626.65 5% usage: 12,465.03
Google: 100 instances of C2-standard-8 (vCPUs:8, RAM:32GB) + 100 GB storage 100% usage: 24,396.43 50% usage: 14,339.92 25% usage: 7,618.57 10% usage: 3,047.43 5% usage: 1,523.71
Microsoft Azure: 100 instances (20GB) + 100 GB storage 233,286$/year
- Creating a reproducible study is complex, even when it is a stated goal. There are a lot of different components to consider and work-through.
- Cloud services are expensive! I think this is important to keep in mind when designing a study workflow, in particular how to use resources efficiently and effectively, especially when resources are limited.
- One of the things that might help would be the use of containers for full reproducibility.
- Additional commenting within the code would also be useful - particularly about dependencies and packages needed to execute scripts.
This project and the course in general has also made me wonder what the responsibility of the researcher is to ensure long-term reproducibility. How does one balance between time investment and working to ensure that the study will continue to be able to be run in the future. What standards/best practices should we use as guidance?