Group 2 Homework 1

Working environments to run example workflows

Snakemake

git clone https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter
docker pull evolutionarygenomics/scalability_snakemake
docker run -v ${HOME}/scalability-reproducibility-chapter/:/repo -it evolutionarygenomics/scalability_snakemake sh
cd repo
cd Snakemake
snakemake

CWL

git clone https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter
cd Docker
nano Dockerfile.cwl

edit the Dockerfile so it says the following

FROM conda/miniconda3
RUN conda config --add channels conda-forge
RUN conda install -y perl=5.22.0
RUN conda install -y -c bioconda paml=4.9 clustalo=1.2.4 wget=1.19.1
ADD pal2nal.pl /usr/local/bin/pal2nal.pl
RUN chmod +x /usr/local/bin/pal2nal.pl

# install the CWL reference engine
RUN apt-get -y update
RUN apt-get -y install cwltool
RUN apt-get -y install curl
RUN curl -sL http://deb.nodesource.com/setup_10.x|bash -
RUN apt-get -y install nodejs

then save the file and go back to the command line

docker build -f Dockerfile.clw -t ${DOCKERHUB_USERNAME}/cwl .
docker run -v ${HOME}/scalability-reproducibility-chapter/:/repo -it ${DOCKERHUB_USERNAME}/cwl sh
cd repo
CWL/workflow.cwl --clusters data

Troubleshooting: We ran Snakemake first and then CWL. However Snakemake makes files in the data directory, which then disrupts CWLs ability to run the workflow correctly. Pulling from the GitHub repo again to get a fresh data folder solved this problem.

NextFlow

git clone https://github.com/EvolutionaryGenomics/scalability-reproducibility-chapter
sudo apt update
sudo apt install openjdk-11-jre-headless
cd scalability-reproducibility-chapter
curl -s https://get.nextflow.io | bash
./nextflow run Nextflow/workflow.nf -with-docker evolutionarygenomics/scalability

Troubleshooting: We thought initially that nextflow was supposed to be run inside a docker container. We later realized that the command -with-docker is what calls the docker image to use.

Compare the different syntaxes used by the tools to define a workflow and explore how each tool describes the processes and the dependencies in a different way.

GWL requires loading lots of modules; setting definitions

Nextflow relatively simple, short series of inputs as file names, workflow is designed as a set of rules.

Snakemake the most similar to Nextflow in that they each have processes that you provide inputs in a straightforward manner. Workflow is designed as a set of rules.

CWL organization is fundamentally different; relies on a directory of files to define inputs as well as input files. Workflow is designed as a set of steps.

Use the Amazon EC2 calculation sheet, and calculate how much it would cost to store 100 GB in S3, and execute a calculation on 100 “large” nodes, each reading 20 GB of data. Do the same for another cloud provider.

It depends on how you're the reading and treating the data in your process. In Amazon EC2, for an extra large a1.4xlarge server, which reads all 20 GB at once, it's $1265/hr for 1 hr/day for one month. If you're able to use less RAM and use an a1.large server $159 dollars/hr for one month. S3 storage is $2.30/month. In Microsoft Azure, 100GB storage is $2.08/month, the cost is $1500 for a 4 core 28GB of RAM for 30/hrs month.
What we learned Reproducible computing is a mythical creature. Reproducible computing is made all the more complex by open source software that may not have fully maintained error messages, resource pages, or even code bases. This means you should be careful about which platforms/code bases you adopt and proceed with caution if there is limited support, because the next person may not be able to figure out what you did easily. Even Docker may not properly account for every future reproduction.

What was surprising We were surprised that so many files and workflows could break in so little time from publication. We were also surprised at how expensive it is to get computing hours at AWS and how cheap it is to store data.

What we would do differently If we were the authors of the paper, we would have very explicit step-by-step directions for how to pull the docker images and run the different workflows. The paper was unclear what the order of operations was supposed to be. We would also provide docker images that were set to auto-build so that they could be better maintained as software packages are updated. From a project management perspective, Github's project management solutions are limiting because it is difficult to have conversations in context. We also need to be more explicit about making cards that only one person has responsibility for, to make progress tracking easier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Group 2 Homework 1

Clone this wiki locally