Group 1: First Weekend Homework

First Weekend Homework

Deliverables:

Q1 - Answer.

Using one of the packaging or container systems described (e.g., Conda, Guix, or Docker), prepare a working environment to run the examples. Now try to run the workflows using the tools presented and appreciate the different approaches to execute the same example.

Our group used the Ubuntu 18.04 Devel and Docker image as our base image. We deployed the image in one of the instances that we created in-session this week. We then cloned the repository provided by the authors and built our working environment using the Debian Dockerfile. The was a relatively smooth process, but we realized that the command needed a destination docker build -t scalability_debian -f Docker/Dockerfile.debian ." (last "." added from provided code).

Our attempts to run the four example overflows encountered many errors. We've detailed here the steps taken to resolve these issues.

A. CWL

Start a new instance or redo git clone after change the folder name of "data", in case that the data clusters have been revised after running other workflow tools like snakemake.
Git clone scalability-reproducibility-chapter $ git clone https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter.git $ cd scalability-reproducibility-chapter
setups virtual environments $ virtualenv -p python2 venv # Create a virtual environment, can use python3 as well $ source venv/bin/activate # Activate environment before installing cwltool Check how to setup env
Install cwlref-runner $ pip install cwlref-runner
Run the example $ CWL/workflow.cwl --clusters data

got the error "Command '['docker', 'pull', 'biocontainers/clustal-omega']' returned non-zero exit status 1." in the clustal file.
- Something goes wrong with the image 'biocontainers/clustal-omega'. When you search this container in docker docs (https://hub.docker.com/r/biocontainers/clustal-omega/tags), you can see that there are three version tags without the latest version. This means that the tag information should also be mentioned when docker pull this image. $ cd CWL $ nano clustalo.cwl dockerPull: biocontainers/clustal-omega/tags: v1.2.1_cv5 $ cd ..
$ CWL/workflow.cwl --clusters data

got the error "Docker image pal2nal not found" in pal2nal file.
- Similar issues with the docker pull image. Search the image resource (image name and tag/version) and revise the docker pull image in pal2nal.cwl file (https://bioconda.github.io/recipes/pal2nal/README.html). $ nano pal2nal.cwl dockerPull: quay.io/biocontainers/pal2nal:14.1--pl526_0
$ CWL/workflow.cwl --clusters data

got the error "Command '['docker', 'pull', 'quay.io/biocontainers/paml']' returned non-zero exit status 1." in the codeml file.
- Similar issues with the docker pull image. Search the image resource and revise the docker pull image in codeml.cwl file (https://quay.io/repository/biocontainers/paml?tag=latest&tab=tags). $ nano codeml.cwl dockerPull: quay.io/biocontainers/paml:4.9--h14c3975_4
$ CWL/workflow.cwl --clusters data

B. GWL

Install GUIX by downloading and running the shell scritpt. You can download the shell script as:

$ wget https://git.savannah.gnu.org/cgit/guix.git/plain/etc/guix-install.sh

and run the shell script to install:

$ sh guix-install.sh

There are also alternative installation methods, however, installing through the shell script is more reliable and recommended.

After GUIX is installed, run the command below to install GWL

guix package -i gwl

There is a high chance that you might get an warning message given below:

guile: warning: failed to install locale

hint: Consider installing the glibc-utf8-locales or glibc-locales package and defining GUIX_LOCPATH, along these lines:

guix package -i glibc-utf8-locales

export GUIX_LOCPATH="$HOME/.guix-profile/lib/locale"

See the "Application Setup" section in the manual, for more info. guix: workflow: command not found Try guix --help for more information.

This warning could seem harmless, since it does not say that is an error. However, it is crucial to fix this problem. Otherwise the other following steps will not work.

The solution is a bit tricky and could take a bit of time. To fix the problem follow the rules under this thread.
After that run the command below in order to change your directory, were your GWL workflow files exist.

cd scalability-reproducibility-chapter/GWL
If you have fixed the warning in step 2, if you run the command below, you will have an error message that >No such command as workflow.

However, if you have fixed the problem you will have no errors.

guix workflow -r example-workflow

Unfortunately, at this point guix workflow command cannot find the workflow files and execute it, even though the workflow file exist!

C. Snakemake

Install snakemake from the source

$ git clone https://bitbucket.org/snakemake/snakemake.git $ cd snakemake $ virtualenv -p python3 .venv $ source .venv/bin/activate $ python setup.py install

Check how to install from source code

Prepare Miniconda3 for installing packages

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh $ bash Miniconda3-latest-Linux-x86_64.sh

Check how to setup miniconda-3

Reopen a new terminal, you can use the new conda command to install software packages

Prepare the bioconda snakemake tool

$ conda install -y -c bioconda snakemake=4.2.0
Run the example from the repository tree

$ cd Snakemake $ snakemake -n
got results folder [paml_results]
difficult to present the workflow graph.

D. Nextflow

curl -s https://get.nextflow.io | bash

Requires Java (not on the image, despite the information indicated in the "tags"; this was installed following the instructions on: link ./nextflow run Nextflow/workflow.nf -with-docker evolutionarygenomics/scalability -with-dag workﬂow.pdf
Requires Graphviz; this was installed following the instructions on: link; Package link to graphviz.
This script generated a PDF showing the workflow.

Q2 - Answer.

Compare the different syntaxes used by the tools to define a workflow and explore how each tool describes the processes and the dependencies in a different way.

CWL (shortest code, higher risks of getting issues with each .cwl file, relatively long time waiting)

Script – three steps:

“extract_clusters”: input is directory of clusters, three output files are proteins, nucleotides, names; this step sorts the input clusters one by one, based on the basenames (is this a more simplified process than the other two?)
“per_cluster_workflow.cwl”: uses output from “extract_clusters” to create the results file and a scatter plot (?) of the results
Time used: 1:25:36
INFO [job codeml_72] Max memory used: 962MiB
INFO [job codeml_72] completed success

GWL (longest code, most complex and easy to get issues)

Requires modules from guix, includes deployment; requires a lot more lines of code
Then defines three processes, similar to snakemake
Code is built in to create separate processes for the different clusters, only requires one line to run (in theory)

Snakemake (third shortest code, easy to get working, fastest running)

Defines 3 rules:

Clustal: sorts inputs into two groups – “guidetree” and “align”; whatever doesn’t go into “guidetree” goes into “align” (?)
Pal2nal: takes results of “align” and runs it through paml (?)
Codem1: takes results of previous step, runs it through codem1 and outputs the results as a text file
Job counts: count jobs 1 all 72 clustal 72 codeml 72 pal2nal total 217
Duration: within 10 minutes (fastest one)

Nextflow (easiest one to get working, second shortest code, relatively long waiting time)

Defines 3 processes, also similar to Snakemake and GWL

Duration : 3h 52m 53s
CPU hours: 10.2
Succeeded: 216 Can nicely present the workflow graph.

Q3 - Answer.

Use the Amazon EC2 calculation sheet, and calculate how much it would cost to store 100 GB in S3, and execute a calculation on 100 “large” nodes, each reading 20 GB of data. Do the same for another cloud provider.

Pricing will depend on runtime and access frequency.

Amazon: 100 instances of m5.16 xlarge + 100 GB on S3 storage 100% usage: 238,016.84 75% usage: 178,987.94 50% usage: 119,960.09 25% usage: 60,455.60 10% usage: 24,626.65 5% usage: 12,465.03

Google: 100 instances of C2-standard-8 (vCPUs:8, RAM:32GB) + 100 GB storage 100% usage: 24,396.43 50% usage: 14,339.92 25% usage: 7,618.57 10% usage: 3,047.43 5% usage: 1,523.71

Microsoft Azure: 100 instances (20GB) + 100 GB storage 233,286$/year

Conclusions:

1) What We Learned

Creating a reproducible study is complex, even when it is a stated goal. There are a lot of different components to consider and work-through.

2) What was Surprising

Cloud services are expensive! I think this is important to keep in mind when designing a study workflow, in particular how to use resources efficiently and effectively, especially when resources are limited.

3) What We Would Do Differently

One of the things that might help would be the use of containers for full reproducibility.
Additional commenting within the code would also be useful - particularly about dependencies and packages needed to execute scripts.

This project and the course in general has also made me wonder what the responsibility of the researcher is to ensure long-term reproducibility. How does one balance between time investment and working to ensure that the study will continue to be able to be run in the future. What standards/best practices should we use as guidance?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly