Skip to content
This repository has been archived by the owner on Aug 9, 2023. It is now read-only.

April updates #159

Merged
merged 28 commits into from
Apr 28, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
655f3b4
genomics pipeline using CDK
itzhapaz Mar 16, 2021
660d044
CDK instructions
itzhapaz Mar 16, 2021
7160346
Delete cdk.context.json
itzhapaz Mar 16, 2021
cfa359e
remove context file
itzhapaz Mar 16, 2021
d7fe024
Merge branch 'aws-genomics-cdk' of github.com:itzhapaz/aws-genomics-w…
itzhapaz Mar 16, 2021
7030a12
update job queue and compute environment guidance
wleepang Mar 26, 2021
2fd1b1e
Add docs Custom Distribution section
crabba Mar 26, 2021
fb0d2bf
ensure non-shimmed aws is used by host instance
wleepang Mar 30, 2021
486c124
add ebs-autoscale logs to cloudwatch config
wleepang Mar 30, 2021
1826068
Merge pull request #138 from itzhapaz/aws-genomics-cdk
wleepang Mar 31, 2021
dbf14a9
Add docs Custom Deployment
crabba Mar 28, 2021
c3872bf
update job queue and compute environment guidance
wleepang Mar 26, 2021
af31aa8
ensure non-shimmed aws is used by host instance
wleepang Mar 30, 2021
3bfb1d4
add ebs-autoscale logs to cloudwatch config
wleepang Mar 30, 2021
acb9ad2
Revise docs for Custom Deployment
crabba Apr 1, 2021
63ef2b7
created workflows
Apr 8, 2021
182e361
examples cleanup
itzhapaz Apr 8, 2021
aafd09c
Merge pull request #154 from itzhapaz/develop/cdk-constructs
wleepang Apr 9, 2021
043728e
Merge pull request #152 from crabba/custom-deploy
wleepang Apr 9, 2021
74debf6
add server timeouts to cromwell.conf
henriqueribeiro Apr 14, 2021
2b9716b
add aws config file
henriqueribeiro Apr 14, 2021
47a9f94
Add nextflow-and-core template
crabba Apr 15, 2021
56461a0
Merge branch 'master' of github.com:aws-samples/aws-genomics-workflow…
crabba Apr 16, 2021
7deb10b
Require namespace for nextflow-and-core
crabba Apr 21, 2021
c5c43ee
Merge pull request #157 from aws-samples/develop/cdk-constructs
wleepang Apr 23, 2021
2d06382
Merge pull request #155 from henriqueribeiro/config_files
wleepang Apr 23, 2021
ad992af
quote the conditional as well
wleepang Apr 23, 2021
918ed9d
Merge pull request #156 from crabba/all-in-one
wleepang Apr 23, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions docs/core-env/build-custom-distribution.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Building Custom Resources

This section describes how to build and upload templates and artifacts to use in a customized deployment. Once uploaded, the locations of the templates and artifacts are used when deploying the Nextflow on AWS Batch solution (see [Customized Deployment](custom-deploy.md))

## Building a Custom Distribution

This step involves building a distribution of templates and artifacts from the solution's source code.

First, create a local clone of the [Genomics Workflows on AWS](https://github.com/aws-samples/aws-genomics-workflows) source code. The code base contains several directories:

* `_scripts/`: Shell scripts for building and uploading the customized distribution of templates and artifacts
* `docs/`: Source code for the documentation, written in [MarkDown](https://markdownguide.org) for the [MkDocs](https://mkdocs.org) publishing platform. This documentation may be modified, expanded, and contributed in the same way as source code.
* `src/`: Source code for the components of the solution:
* `containers/`: CodeBuild buildspec files for building AWS-specific container images and pushing them to ECR
* `_common/`
* `build.sh`: A generic build script that first builds a base image for a container, then builds an AWS specific image
* `entrypoint.aws.sh`: A generic entrypoint script that wraps a call to a binary tool in the container with handlers data staging from/to S3
* `nextflow/`
* `Dockerfile`
* `nextflow.aws.sh`: Docker entrypoint script to execute the Nextflow workflow on AWS Batch
* `ebs-autoscale/`
* `get-amazon-ebs-autoscale.sh`: Script to retrieve and install [Amazon EBS Autoscale](https://github.com/awslabs/amazon-ebs-autoscale)
* `ecs-additions/`: Scripts to be installed on ECS host instances to support the distribution
* `awscli-shim.sh`: Installed as `/opt/aws-cli/bin/aws` and mounted onto the container, allows container images without full glibc to use the AWS CLI v2 through supplied shared libraries (especially libz) and `LD_LIBRARY_PATH`.
* `ecs-additions-common.sh`: Utility script to install `fetch_and_run.sh`, Nextflow and Cromwell shims, and swap space
* `ecs-additions-cromwell-linux2-worker.sh`:
* `ecs-additions-cromwell.sh`:
* `ecs-additions-nextflow.sh`:
* `ecs-additions-step-functions.sh`:
* `fetch_and_run.sh`: Uses AWS CLI to download and run scripts and zip files from S3
* `provision.sh`: Appended to the userdata in the launch template created by [gwfcore-launch-template](custom-deploy.md): Starts SSM Agent, ECS Agent, Docker; runs `get-amazon-ebs-autoscale.sh`, `ecs-additions-common.sh` and orchestrator-specific `ecs-additions-` scripts.
* `lambda/`: Lambda functions to create, modify or delete ECR registries or CodeBuild jobs
* `templates/`: CloudFormation templates for the solution stack, as described in [Customized Deployment](custom-deploy.md)

## Deploying a Custom Distribution

The script `_scripts/deploy.sh` will create a custom distribution of artifacts and templates from files in the source tree, then upload this distribution to an S3 bucket. It will optionally also build and deploy a static documentation site from the Markdown documentation files. Its usage is:

```sh
deploy.sh [--site-bucket BUCKET] [--asset-bucket BUCKET]
[--asset-profile PROFILE] [--deploy-region REGION]
[--public] [--verbose]
STAGE

--site-bucket BUCKET Deploy documentation site to BUCKET
--asset-bucket BUCKET Deploy assets to BUCKET
--asset-profile PROFILE Use PROFILE for AWS CLI commands
--deploy-region REGION Deploy in region REGION
--public Deploy to public bucket with '--acl public-read' (Default false)
--verbose Display more output
STAGE 'test' or 'production'
```

When running this script from the command line, use the value `test` for the stage. This will deploy the templates and artifacts into a directory `test` in your deployment bucket:

```
$ aws s3 ls s3://my-deployment-bucket/test/
PRE artifacts/
PRE templates/
```

Use these values when deploying a customized installation, as described in [Customized Deployment](custom-deploy.md), sections 'Artifacts and Nested Stacks' and 'Nextflow'. In the example from above, the values to use would be:

* Artifact S3 Bucket Name: `my-deployment-bucket`
* Artifact S3 Prefix: `test/artifacts`
* Template Root URL: `https://my-deployment-bucket.s3.amazonaws.com/test/templates`

The use of `production` for stage is reserved for deployments from a Travis CI/CD environment; this usage will deploy into a subdirectory named after the current release tag.
58 changes: 58 additions & 0 deletions docs/core-env/custom-deploy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Customized Deployment

Deployments of the 'Nextflow on AWS Batch' solution are based on nested CloudFormation templates, and on artifacts comprising scripts, software packages, and configuration files. The templates and artifacts are stored in S3 buckets, and their S3 URLs are used when launching the top-level template and as parameters to that template's deployment.

## VPC
The quick start link deploys the [AWS VPC Quickstart](https://aws.amazon.com/quickstart/architecture/vpc/), which creates a VPC with up to 4 Availability Zones, each with a public subnet and a private subnet with NAT Gateway access to the Internet.

## Genomics Workflow Core
This quick start link deploys the CloudFormation template `gwfcore-root.template.yaml` for the Genomics Workflow Core (GWFCore) from the [Genomics Workflows on AWS](https://github.com/aws-samples/aws-genomics-workflows) solution. This template launches a number of nested templates, as shown below:

* Root Stack __gwfcore-root__ - Top level template for Genomics Workflow Core
* S3 Stack __gwfcore-s3__ - S3 bucket (new or existing) for storing analysis results
* IAM Stack __gwfcore-iam__ - Creates IAM roles to use with AWS Batch scalable genomics workflow environment
* Code Stack __gwfcore-code__ - Creates AWS CodeCommit repos and CodeBuild projects for Genomics Workflows Core assets and artifacts
* Launch Template Stack __gwfcore-launch-template__ - Creates an EC2 Launch Template for AWS Batch based genomics workflows
* Batch Stack __gwfcore-batch__ - Deploys resource for a AWS Batch environment that is suitable for genomics, including default and high-priority JobQueues

### Root Stack
The quick start solution links to the CloudFormation console, where the 'Amazon S3 URL' field is prefilled with the S3 URL of a copy of the root stack template, hosted in the public S3 bucket __aws-genomics-workflows__.

<img src="https://dpkk088kye7gn.cloudfront.net/aws-genomics-workflows/docs/images/custom-deploy-0.png"
alt="custom-deploy-0"
width="100%" height="100%"
class="screenshot" />

To use a customized root stack, upload your modified stack template to an S3 bucket (see [Building a Custom Distribution](build-custom-distribution.md)), and specify that template's URL in 'Amazon S3 URL'.

### Artifacts and Nested Stacks
The subsequent screen, 'Specify Stack Details', allows for customization of the deployed resources in the 'Distribution Configuration' section.

<img src="https://dpkk088kye7gn.cloudfront.net/aws-genomics-workflows/docs/images/custom-deploy-1.png"
alt="custom-deploy-1"
width="70%" height="70%"
class="screenshot" />

* __Artifact S3 Bucket Name__ and __Artifact S3 Prefix__ define the location of the artifacts uploaded prior to this deployment. By default, pre-prepared artifacts are stored in the __aws-genomics-workflows__ bucket.
* __Template Root URL__ defines the bucket and prefix used to store nested templates, called by the root template.

To use your own modified artifacts or nested templates, build and upload as described in [Building a Custom Distribution](build-custom-distribution.md), and specify the bucket and prefix in the fields above.

## Workflow Orchestrators
### Nextflow
This quick start deploys the Nextflow template `nextflow-resources.template.yaml`, which launches one nested stack:

* Root Stack __nextflow-resources__ - Creates resources specific to running Nextflow on AWS
* Container Build Stack __container-build__ - Creates resources for building a Docker container image using CodeBuild, storing the image in ECR, and optionally creating a corresponding Batch Job Definition

The nextflow root stack is specified in the same way as the GWFCore root stack, above, and a location for a modified root stack may be specified as with the Core stack.

The subsequent 'Specify Stack Details' screen has fields allowing the customization of the Nextflow deployment.

<img src="https://dpkk088kye7gn.cloudfront.net/aws-genomics-workflows/docs/images/nextflow-0.png"
alt="nextflow-0"
width="70%" height="70%"
class="screenshot" />

* __S3NextflowPrefix__, __S3LogsDirPrefix__, and __S3WorkDirPrefix__ specify the path within the GWFCore bucket in which to store per-run data and log files.
* __TemplateRootUrl__ specifies the path to the nested templates called by the Nextflow root template, as with the GWFCore root stack.
25 changes: 11 additions & 14 deletions docs/core-env/setup-aws-batch.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ A complete AWS Batch environment consists of the following:

1. A Compute Environment that utilizes [EC2 Spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) for cost-effective computing
2. A Compute Environment that utilizes EC2 on-demand (e.g. [public pricing](https://aws.amazon.com/ec2/pricing/on-demand/)) instances for high-priority work that can't risk job interruptions or delays due to insufficient Spot capacity.
3. A default Job Queue that utilizes the Spot compute environment first, but spills over to the on-demand compute environment if defined capacity limits (i.e. Max vCPUs) are reached.
4. A priority Job Queue that leverages the on-demand and Spot CE's (in that order) and has higher priority than the default queue.
3. A default Job Queue that solely utilizes the Spot compute environment. This is for jobs where timeliness isn't a constraint, and can wait for the right instances to become available, as well has handle interruption. It also ensures the most cost savings.
4. A priority Job Queue that leverages the on-demand, and optionally Spot, CE's (in that order) and has higher priority than the default queue. This is for jobs that cannot handle interruption, and need to be executed immediately.

### Automated via CloudFormation

Expand Down Expand Up @@ -81,7 +81,7 @@ You can create several compute environments to suit your needs. Below we'll cre
6. In the "Service role" drop down, select the `AWSBatchServiceRole` you created previously
7. In the "Instance role" drop down, select the `ecsInstanceRole` you created previously
8. For "Provisioning model" select "On-Demand"
9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances.
9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances. This should be sufficient for demonstration purposes. In a production setting, it is recommended to specify the instance famimlies and sizes most apprioriate for the jobs the CE will support. For the On-Demand CE, selecting newer instance types is beneficial as they tend to have better price per performance.
10. "Allocation strategy" will already be set to `BEST_FIT`. This is recommended for on-demand based compute environments as it ensures the most cost efficiency.
11. In the "Launch template" drop down, select the `genomics-workflow-template` you created previously
12. Set Minimum and Desired vCPUs to 0.
Expand Down Expand Up @@ -112,7 +112,7 @@ Click on "Create"
6. In the "Service role" drop down, select the `AWSBatchServiceRole` you created previously
7. In the "Instance role" drop down, select the `ecsInstanceRole` you created previously
8. For "Provisioning model" select "Spot"
9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances.
9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances. This should be sufficient for demonstration purposes. In a production setting, it is recommended to specify the instance families and sizes most appropriate for the jobs the CE will support. For the SPOT CE a wider diversity of instance types is recommended to maximize the pools from which capacity can be drawn from. Limiting the size of instances is also recommended to avoid scheduling too many jobs on a SPOT instance that could be interrupted.
10. "Allocation strategy" will already be set to `SPOT_CAPACITY_OPTIMIZED`. This is recommended for Spot based compute environments as it ensures the most compute capacity is available for your jobs.
11. In the "Launch template" drop down, select the `genomics-workflow-template` you created previously
12. Set Minimum and Desired vCPUs to 0.
Expand All @@ -135,20 +135,18 @@ Job queues can be associated with one or more compute environments in a preferre
Below we'll create two job queues:

* A "Default" job queue
* A "High Priority" job queue
* A "Priority" job queue

Both job queues will use both compute environments you created previously.

##### Create a "default" job queue

This queue is intended for jobs that do not require urgent completion, and can handle potential interruption. This queue will schedule jobs to:
This queue is intended for jobs that do not require urgent completion, and can handle potential interruption. This queue will schedule jobs to only the "spot" compute environment.

1. The "spot" compute environment
2. The "ondemand" compute environment
!!! note
It is not recommended to configure a job queue to "spillover" from Spot to On-Demand. Doing so could lead Insufficient Capacity Errors, resulting in Batch unable to schedule jobs, leaving them stuck in "RUNNABLE"

in that order.

Because it primarily leverages Spot instances, it will also be the most cost effective job queue.
Because it leverages Spot instances, it will also be the most cost effective job queue.

* Go to the AWS Batch Console
* Click on "Job queues"
Expand All @@ -157,8 +155,7 @@ Because it primarily leverages Spot instances, it will also be the most cost eff
* Set "Priority" to 1
* Under "Connected compute environments for this queue", using the drop down menu:

1. Select the "spot" compute environment you created previously, then
2. Select the "ondemand" compute environment you created previously
1. Select the "spot" compute environment you created previously

* Click on "Create Job Queue"

Expand All @@ -169,7 +166,7 @@ This queue is intended for jobs that are urgent and **cannot** handle potential
1. The "ondemand" compute environment
2. The "spot" compute environment

in that order.
in that order. In this queue configuration, Batch will schedule jobs to the "ondemand" compute environment first. When the number of Max vCPUs for that environment is reached, Batch will begin scheduling jobs to the "spot" compute environment. The use of the "spot" compute environment is optional, and is used to help drain pending jobs from the queue faster.

* Go to the AWS Batch Console
* Click on "Job queues"
Expand Down
6 changes: 6 additions & 0 deletions docs/extra.css
Original file line number Diff line number Diff line change
Expand Up @@ -57,4 +57,10 @@

.md-header, .md-footer, .md-footer-nav, .md-footer-meta {
background-color: #232f3e !important;
}

.screenshot {
style: "float: left";
margin: 10px;
border: 1px solid lightgrey;
}
Binary file added docs/images/custom-deploy-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/custom-deploy-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/nextflow-0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ nav:
- Permissions: core-env/create-iam-roles.md
- Compute Resources: core-env/create-custom-compute-resources.md
- AWS Batch: core-env/setup-aws-batch.md
- Customized Deployment: core-env/custom-deploy.md
- Building a Custom Distribution: core-env/build-custom-distribution.md
# - Containerized Tooling:
# - Introduction: containers/container-introduction.md
# - Examples: containers/container-examples.md
Expand Down Expand Up @@ -57,3 +59,5 @@ extra:
s3:
bucket: docs.opendata.aws
prefix: genomics-workflows

use_directory_urls: false
9 changes: 9 additions & 0 deletions src/aws-genomics-cdk/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
*.js
!jest.config.js
*.d.ts
node_modules

# CDK asset staging directory
.cdk.staging
cdk.out
cdk.context.json
6 changes: 6 additions & 0 deletions src/aws-genomics-cdk/.npmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.ts
!*.d.ts

# CDK asset staging directory
.cdk.staging
cdk.out
Loading