Merge pull request #159 from aws-samples/master

April updates
aws-samples · Apr 28, 2021 · 1676f13 · 1676f13
2 parents 7b8e77c + 918ed9d
commit 1676f13
Show file tree

Hide file tree

Showing 59 changed files with 10,628 additions and 25 deletions.
diff --git a/docs/core-env/build-custom-distribution.md b/docs/core-env/build-custom-distribution.md
@@ -0,0 +1,68 @@
+# Building Custom Resources
+
+This section describes how to build and upload templates and artifacts to use in a customized deployment.  Once uploaded, the locations of the templates and artifacts are used when deploying the Nextflow on AWS Batch solution (see [Customized Deployment](custom-deploy.md))
+
+## Building a Custom Distribution
+
+This step involves building a distribution of templates and artifacts from the solution's source code.
+
+First, create a local clone of the [Genomics Workflows on AWS](https://github.com/aws-samples/aws-genomics-workflows) source code.  The code base contains several directories:
+
+* `_scripts/`: Shell scripts for building and uploading the customized distribution of templates and artifacts
+* `docs/`: Source code for the documentation, written in [MarkDown](https://markdownguide.org) for the [MkDocs](https://mkdocs.org) publishing platform.  This documentation may be modified, expanded, and contributed in the same way as source code.
+* `src/`: Source code for the components of the solution:
+    * `containers/`: CodeBuild buildspec files for building AWS-specific container images and pushing them to ECR
+        * `_common/`
+            * `build.sh`: A generic build script that first builds a base image for a container, then builds an AWS specific image
+            * `entrypoint.aws.sh`: A generic entrypoint script that wraps a call to a binary tool in the container with handlers data staging from/to S3
+        * `nextflow/`
+            * `Dockerfile`
+            * `nextflow.aws.sh`: Docker entrypoint script to execute the Nextflow workflow on AWS Batch
+    * `ebs-autoscale/`
+        * `get-amazon-ebs-autoscale.sh`: Script to retrieve and install [Amazon EBS Autoscale](https://github.com/awslabs/amazon-ebs-autoscale)
+    * `ecs-additions/`: Scripts to be installed on ECS host instances to support the distribution
+        * `awscli-shim.sh`: Installed as `/opt/aws-cli/bin/aws` and mounted onto the container, allows container images without full glibc to use the AWS CLI v2 through supplied shared libraries (especially libz) and `LD_LIBRARY_PATH`.
+        * `ecs-additions-common.sh`: Utility script to install `fetch_and_run.sh`, Nextflow and Cromwell shims, and swap space
+        * `ecs-additions-cromwell-linux2-worker.sh`: 
+        * `ecs-additions-cromwell.sh`: 
+        * `ecs-additions-nextflow.sh`: 
+        * `ecs-additions-step-functions.sh`: 
+        * `fetch_and_run.sh`: Uses AWS CLI to download and run scripts and zip files from S3
+        * `provision.sh`: Appended to the userdata in the launch template created by [gwfcore-launch-template](custom-deploy.md): Starts SSM Agent, ECS Agent, Docker; runs `get-amazon-ebs-autoscale.sh`, `ecs-additions-common.sh` and orchestrator-specific `ecs-additions-` scripts.
+    * `lambda/`: Lambda functions to create, modify or delete ECR registries or CodeBuild jobs
+    * `templates/`: CloudFormation templates for the solution stack, as described in [Customized Deployment](custom-deploy.md)
+
+## Deploying a Custom Distribution
+
+The script `_scripts/deploy.sh` will create a custom distribution of artifacts and templates from files in the source tree, then upload this distribution to an S3 bucket.  It will optionally also build and deploy a static documentation site from the Markdown documentation files. Its usage is:
+
+```sh
+    deploy.sh [--site-bucket BUCKET] [--asset-bucket BUCKET] 
+              [--asset-profile PROFILE] [--deploy-region REGION] 
+              [--public] [--verbose] 
+              STAGE
+
+    --site-bucket BUCKET        Deploy documentation site to BUCKET
+    --asset-bucket BUCKET       Deploy assets to BUCKET
+    --asset-profile PROFILE     Use PROFILE for AWS CLI commands
+    --deploy-region REGION      Deploy in region REGION
+    --public                    Deploy to public bucket with '--acl public-read' (Default false)
+    --verbose                   Display more output
+    STAGE                       'test' or 'production'
+```
+
+When running this script from the command line, use the value `test` for the stage.  This will deploy the templates and artifacts into a directory `test` in your deployment bucket:
+
+```
+$ aws s3 ls s3://my-deployment-bucket/test/
+    PRE artifacts/
+    PRE templates/
+```
+
+Use these values when deploying a customized installation, as described in [Customized Deployment](custom-deploy.md), sections 'Artifacts and Nested Stacks' and 'Nextflow'.  In the example from above, the values to use would be:
+
+* Artifact S3 Bucket Name: `my-deployment-bucket`
+* Artifact S3 Prefix: `test/artifacts`
+* Template Root URL: `https://my-deployment-bucket.s3.amazonaws.com/test/templates`
+
+The use of `production` for stage is reserved for deployments from a Travis CI/CD environment; this usage will deploy into a subdirectory named after the current release tag.
diff --git a/docs/core-env/custom-deploy.md b/docs/core-env/custom-deploy.md
@@ -0,0 +1,58 @@
+# Customized Deployment
+
+Deployments of the 'Nextflow on AWS Batch' solution are based on nested CloudFormation templates, and on artifacts comprising scripts, software packages, and configuration files.  The templates and artifacts are stored in S3 buckets, and their S3 URLs are used when launching the top-level template and as parameters to that template's deployment.  
+
+## VPC
+The quick start link deploys the [AWS VPC Quickstart](https://aws.amazon.com/quickstart/architecture/vpc/), which creates a VPC with up to 4 Availability Zones, each with a public subnet and a private subnet with NAT Gateway access to the Internet.
+
+## Genomics Workflow Core
+This quick start link deploys the CloudFormation template `gwfcore-root.template.yaml` for the Genomics Workflow Core (GWFCore) from the [Genomics Workflows on AWS](https://github.com/aws-samples/aws-genomics-workflows) solution.  This template launches a number of nested templates, as shown below:
+
+* Root Stack __gwfcore-root__ - Top level template for Genomics Workflow Core
+    * S3 Stack __gwfcore-s3__ - S3 bucket (new or existing) for storing analysis results
+    * IAM Stack __gwfcore-iam__ - Creates IAM roles to use with AWS Batch scalable genomics workflow environment
+    * Code Stack __gwfcore-code__ - Creates AWS CodeCommit repos and CodeBuild projects for Genomics Workflows Core assets and artifacts
+    * Launch Template Stack __gwfcore-launch-template__ - Creates an EC2 Launch Template for AWS Batch based genomics workflows
+    * Batch Stack __gwfcore-batch__ - Deploys resource for a AWS Batch environment that is suitable for genomics, including default and high-priority JobQueues
+
+### Root Stack
+The quick start solution links to the CloudFormation console, where the 'Amazon S3 URL' field is prefilled with the S3 URL of a copy of the root stack template, hosted in the public S3 bucket __aws-genomics-workflows__.
+
+<img src="https://dpkk088kye7gn.cloudfront.net/aws-genomics-workflows/docs/images/custom-deploy-0.png"
+     alt="custom-deploy-0"
+     width="100%" height="100%"
+     class="screenshot" />
+
+To use a customized root stack, upload your modified stack template to an S3 bucket (see [Building a Custom Distribution](build-custom-distribution.md)), and specify that template's URL in 'Amazon S3 URL'.
+
+### Artifacts and Nested Stacks
+The subsequent screen, 'Specify Stack Details', allows for customization of the deployed resources in the 'Distribution Configuration' section.
+
+<img src="https://dpkk088kye7gn.cloudfront.net/aws-genomics-workflows/docs/images/custom-deploy-1.png"
+     alt="custom-deploy-1"
+     width="70%" height="70%"
+     class="screenshot" />
+
+* __Artifact S3 Bucket Name__ and __Artifact S3 Prefix__ define the location of the artifacts uploaded prior to this deployment.  By default, pre-prepared artifacts are stored in the __aws-genomics-workflows__ bucket.  
+* __Template Root URL__ defines the bucket and prefix used to store nested templates, called by the root template.  
+
+To use your own modified artifacts or nested templates, build and upload as described in [Building a Custom Distribution](build-custom-distribution.md), and specify the  bucket and prefix in the fields above.
+
+## Workflow Orchestrators
+### Nextflow
+This quick start deploys the Nextflow template `nextflow-resources.template.yaml`, which launches one nested stack:
+
+* Root Stack __nextflow-resources__ - Creates resources specific to running Nextflow on AWS
+    * Container Build Stack __container-build__ - Creates resources for building a Docker container image using CodeBuild, storing the image in ECR, and optionally creating a corresponding Batch Job Definition
+
+The nextflow root stack is specified in the same way as the GWFCore root stack, above, and a location for a modified root stack may be specified as with the Core stack.
+
+The subsequent 'Specify Stack Details' screen has fields allowing the customization of the Nextflow deployment.
+
+<img src="https://dpkk088kye7gn.cloudfront.net/aws-genomics-workflows/docs/images/nextflow-0.png"
+     alt="nextflow-0"
+     width="70%" height="70%"
+     class="screenshot" />
+
+* __S3NextflowPrefix__, __S3LogsDirPrefix__, and __S3WorkDirPrefix__ specify the path within the GWFCore bucket in which to store per-run data and log files.
+* __TemplateRootUrl__ specifies the path to the nested templates called by the Nextflow root template, as with the GWFCore root stack.
diff --git a/docs/core-env/setup-aws-batch.md b/docs/core-env/setup-aws-batch.md
@@ -46,8 +46,8 @@ A complete AWS Batch environment consists of the following:
 
 1. A Compute Environment that utilizes [EC2 Spot instances](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-spot-instances.html) for cost-effective computing
 2. A Compute Environment that utilizes EC2 on-demand (e.g. [public pricing](https://aws.amazon.com/ec2/pricing/on-demand/)) instances for high-priority work that can't risk job interruptions or delays due to insufficient Spot capacity.
-3. A default Job Queue that utilizes the Spot compute environment first, but spills over to the on-demand compute environment if defined capacity limits (i.e. Max vCPUs) are reached.
-4. A priority Job Queue that leverages the on-demand and Spot CE's (in that order) and has higher priority than the default queue.
+3. A default Job Queue that solely utilizes the Spot compute environment. This is for jobs where timeliness isn't a constraint, and can wait for the right instances to become available, as well has handle interruption. It also ensures the most cost savings.
+4. A priority Job Queue that leverages the on-demand, and optionally Spot, CE's (in that order) and has higher priority than the default queue. This is for jobs that cannot handle interruption, and need to be executed immediately.
 
 ### Automated via CloudFormation
 
@@ -81,7 +81,7 @@ You can create several compute environments to suit your needs.  Below we'll cre
 6. In the "Service role" drop down, select the `AWSBatchServiceRole` you created previously
 7. In the "Instance role" drop down, select the `ecsInstanceRole` you created previously
 8. For "Provisioning model" select "On-Demand"
-9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances.
+9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances. This should be sufficient for demonstration purposes. In a production setting, it is recommended to specify the instance famimlies and sizes most apprioriate for the jobs the CE will support. For the On-Demand CE, selecting newer instance types is beneficial as they tend to have better price per performance.
 10. "Allocation strategy" will already be set to `BEST_FIT`. This is recommended for on-demand based compute environments as it ensures the most cost efficiency.
 11. In the "Launch template" drop down, select the `genomics-workflow-template` you created previously
 12. Set Minimum and Desired vCPUs to 0.
@@ -112,7 +112,7 @@ Click on "Create"
 6. In the "Service role" drop down, select the `AWSBatchServiceRole` you created previously
 7. In the "Instance role" drop down, select the `ecsInstanceRole` you created previously
 8. For "Provisioning model" select "Spot"
-9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances.
+9. "Allowed instance types" will be already populated with "optimal" - which is a mixture of M4, C4, and R4 instances. This should be sufficient for demonstration purposes. In a production setting, it is recommended to specify the instance families and sizes most appropriate for the jobs the CE will support. For the SPOT CE a wider diversity of instance types is recommended to maximize the pools from which capacity can be drawn from. Limiting the size of instances is also recommended to avoid scheduling too many jobs on a SPOT instance that could be interrupted.
 10. "Allocation strategy" will already be set to `SPOT_CAPACITY_OPTIMIZED`. This is recommended for Spot based compute environments as it ensures the most compute capacity is available for your jobs.
 11. In the "Launch template" drop down, select the `genomics-workflow-template` you created previously
 12. Set Minimum and Desired vCPUs to 0.
@@ -135,20 +135,18 @@ Job queues can be associated with one or more compute environments in a preferre
 Below we'll create two job queues:
 
  * A "Default" job queue
- * A "High Priority" job queue
+ * A "Priority" job queue
 
 Both job queues will use both compute environments you created previously.
 
 ##### Create a "default" job queue
 
-This queue is intended for jobs that do not require urgent completion, and can handle potential interruption. This queue will schedule jobs to:
+This queue is intended for jobs that do not require urgent completion, and can handle potential interruption. This queue will schedule jobs to only the "spot" compute environment.
 
-1. The "spot" compute environment
-2. The "ondemand" compute environment
+!!! note
+    It is not recommended to configure a job queue to "spillover" from Spot to On-Demand. Doing so could lead Insufficient Capacity Errors, resulting in Batch unable to schedule jobs, leaving them stuck in "RUNNABLE"
 
-in that order.
-
-Because it primarily leverages Spot instances, it will also be the most cost effective job queue.
+Because it leverages Spot instances, it will also be the most cost effective job queue.
 
 * Go to the AWS Batch Console
 * Click on "Job queues"
@@ -157,8 +155,7 @@ Because it primarily leverages Spot instances, it will also be the most cost eff
 * Set "Priority" to 1
 * Under "Connected compute environments for this queue", using the drop down menu:
 
-    1. Select the "spot" compute environment you created previously, then
-    2. Select the "ondemand" compute environment you created previously
+    1. Select the "spot" compute environment you created previously
 
 * Click on "Create Job Queue"
 
@@ -169,7 +166,7 @@ This queue is intended for jobs that are urgent and **cannot** handle potential
 1. The "ondemand" compute environment
 2. The "spot" compute environment
 
-in that order.
+in that order. In this queue configuration, Batch will schedule jobs to the "ondemand" compute environment first. When the number of Max vCPUs for that environment is reached, Batch will begin scheduling jobs to the "spot" compute environment. The use of the "spot" compute environment is optional, and is used to help drain pending jobs from the queue faster.
 
 * Go to the AWS Batch Console
 * Click on "Job queues"

diff --git a/docs/extra.css b/docs/extra.css
@@ -57,4 +57,10 @@
 
   .md-header, .md-footer, .md-footer-nav, .md-footer-meta {
     background-color: #232f3e !important;
+  }
+
+  .screenshot {
+    style: "float: left";
+    margin: 10px;
+    border: 1px solid lightgrey;
   }
diff --git a/docs/images/custom-deploy-0.png b/docs/images/custom-deploy-0.png
diff --git a/docs/images/custom-deploy-1.png b/docs/images/custom-deploy-1.png
diff --git a/docs/images/nextflow-0.png b/docs/images/nextflow-0.png
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -9,6 +9,8 @@ nav:
     - Permissions: core-env/create-iam-roles.md
     - Compute Resources: core-env/create-custom-compute-resources.md
     - AWS Batch: core-env/setup-aws-batch.md
+    - Customized Deployment: core-env/custom-deploy.md
+    - Building a Custom Distribution: core-env/build-custom-distribution.md
   # - Containerized Tooling:
   #   - Introduction: containers/container-introduction.md
   #   - Examples: containers/container-examples.md
@@ -57,3 +59,5 @@ extra:
     s3:
       bucket: docs.opendata.aws
       prefix: genomics-workflows
+
+use_directory_urls: false
diff --git a/src/aws-genomics-cdk/.gitignore b/src/aws-genomics-cdk/.gitignore
@@ -0,0 +1,9 @@
+*.js
+!jest.config.js
+*.d.ts
+node_modules
+
+# CDK asset staging directory
+.cdk.staging
+cdk.out
+cdk.context.json
diff --git a/src/aws-genomics-cdk/.npmignore b/src/aws-genomics-cdk/.npmignore
@@ -0,0 +1,6 @@
+*.ts
+!*.d.ts
+
+# CDK asset staging directory
+.cdk.staging
+cdk.out