EMR Capacity Optimized and Instance Selector CLI (#130)

* Removed i3s, added AWS Instance Selector, Cloud9 * capacity-optimized and 15 instance types * Updated to reflect latest cosole changes * Cloud9 Console Link * Cloud9 Images * Removed SIA Image * Cloud9 * Cloud9 * Savings Summary - 32 Task Spot Unit * Removed Warning on old Spot limits * Updated to 32 instances to support 4xl * Cloud9 * PR Review * PR Review * Spell Check * Format Fix * Format Changes * Format Changes * Format Changes * Format Changes, replacing "\" with " " Co-authored-by: Jagdeep <[email protected]>
awslabs · Jan 8, 2021 · 77479c0 · 77479c0
1 parent a2f01cd
commit 77479c0
Show file tree

Hide file tree

Showing 28 changed files with 222 additions and 128 deletions.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/_index.md b/content/running_spark_apps_with_emr_on_spot_instances/_index.md
@@ -7,7 +7,7 @@ pre: "<b>⁃ </b>"
 
 ## Overview
 
-Welcome! In this workshop you will assume the role of a data engineer, tasked with optimizing the organization's costs for running Spark applications, using Amazon EMR and EC2 Spot Instances.\
+Welcome! In this workshop you will assume the role of a data engineer, tasked with optimizing the organization's costs for running Spark applications, using Amazon EMR and EC2 Spot Instances.
 
 {{% notice info %}} The **estimated time** for completing the workshop is 60-90 minutes and the **estimated cost** for running the workshop's resources in your AWS account is less than $2.\
 The **learning objective** for the workshop is to become familiar with the best practices and tooling that are available to you for cost optimizing your EMR clusters running Spark applications, using Spot Instances. {{% /notice %}}
@@ -20,6 +20,6 @@ The **learning objective** for the workshop is to become familiar with the best
 * [Amazon EC2 Spot Instances] (https://aws.amazon.com/ec2/spot/) offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. EC2 can interrupt Spot Instances with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are analytics, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD, and other test and development workloads.
 
 ## About Spot Instances in Analytics workloads
-The most important best practice when using Spot Instances is to be flexible with the EC2 instance types that our application can run on, in order to be able to access many spare capacity pools (a combination of EC2 instance type and an Availability Zone), as well as achieve our desired capacity from a different instance type in case some of our Spot capacity in the EMR cluster is interrupted, when EC2 needs the spare capacity back.\
+The most important best practice when using Spot Instances is to be flexible with the EC2 instance types that our application can run on, in order to be able to access many spare capacity pools (a combination of EC2 instance type and an Availability Zone), as well as achieve our desired capacity from a different instance type in case some of our Spot capacity in the EMR cluster is interrupted, when EC2 needs the spare capacity back.  
 It's possible to run Spark applications in a single cluster that is running on multiple different instance types, we'll just need to right-size our executors and use the EMR Instance Fleets configuration option in order to meet the Spot diversification best practice. We'll look into that in detail during this workshop.
 
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md b/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md
@@ -22,19 +22,19 @@ If the Name tag Key was not enabled as a Cost Allocation Tag, you will not be ab
 {{% /notice %}}
 
 
-Let's use Cost Explorer to analyze the costs of running our EMR application.\
-1. Navigate to Cost Explorer by opening the AWS Management Console -> Click your username in the top right corner -> click **My Billing Dashboard** -> click **Cost Explorer in the left pane**. or [click here] (https://console.aws.amazon.com/billing/home#/costexplorer) for a direct link.\
-2. We know that we gave our EMR cluster a unique Name tag, so let's filter according to it. In the right pane, click Tags -> Name -> enter "**EMRTransientCluster1**"\
-3. Instead of the default 45 days view, let's narrow down the time span to just the day when we ran the cluster. In the data selection dropdown, mark that day as start and end.\
-4. You are now looking at the total cost to run the cluster (**$0.30**), including: EMR, EC2, EBS, and possible AWS Cross-Region data transfer costs, depending on where you ran your cluster relative to where the S3 dataset is located (in N. Virginia).\
+Let's use Cost Explorer to analyze the costs of running our EMR application.  
+1. Navigate to Cost Explorer by opening the AWS Management Console -> Click your username in the top right corner -> click **My Billing Dashboard** -> click **Cost Explorer in the left pane**. or [click here] (https://console.aws.amazon.com/billing/home#/costexplorer) for a direct link.  
+2. We know that we gave our EMR cluster a unique Name tag, so let's filter according to it. In the right pane, click Tags -> Name -> enter "**EMRTransientCluster1**"  
+3. Instead of the default 45 days view, let's narrow down the time span to just the day when we ran the cluster. In the data selection dropdown, mark that day as start and end.  
+4. You are now looking at the total cost to run the cluster (**$0.30**), including: EMR, EC2, EBS, and possible AWS Cross-Region data transfer costs, depending on where you ran your cluster relative to where the S3 dataset is located (in N. Virginia).  
 5. Group by **Usage Type** to get a breakdown of the costs
 
 ![costexplorer](/images/running-emr-spark-apps-on-spot/costexplorer1.png)
 
-* EU-SpotUsage:r5.xlarge: This was the instance type that ran in the EMR Task Instance fleet and accrued the largest cost, since EMR launched 10 instances ($0.17)\
-* EU-BoxUsage:r5.xlarge: The EMR costs. [Click here] (https://aws.amazon.com/emr/pricing/) to learn more about EMR pricing. ($0.06)\
-* EU-EBS:VolumeUsage.gp2: EBS volumes that were attached to my EC2 Instances in the cluster - these got tagged automatically. ($0.03)\
-* EU-SpotUsage:r5a.xlarge & EU-SpotUsage:m4.xlarge: EC2 Spot price for the other instances in my cluster (Master and Core) ($0.02 combined)\
+* EU-SpotUsage:r5.xlarge: This was the instance type that ran in the EMR Task Instance fleet and accrued the largest cost, since EMR launched 10 instances ($0.17)  
+* EU-BoxUsage:r5.xlarge: The EMR costs. [Click here] (https://aws.amazon.com/emr/pricing/) to learn more about EMR pricing. ($0.06)  
+* EU-EBS:VolumeUsage.gp2: EBS volumes that were attached to my EC2 Instances in the cluster - these got tagged automatically. ($0.03)  
+* EU-SpotUsage:r5a.xlarge & EU-SpotUsage:m4.xlarge: EC2 Spot price for the other instances in my cluster (Master and Core) ($0.02 combined)  
 
 If you have access to Cost Explorer, have a look around and see what you can find by slicing and dicing with filtering and grouping. For example, what happens if you filter by **Purchase Option = Spot** & **Group by = Instance Type**?
 
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md b/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md
@@ -3,7 +3,7 @@ title: "Automations and monitoring"
 weight: 110
 ---
 
-When adopting EMR into your analytics flows and data processing pipelines, you will want to launch EMR clusters and run jobs in a programmatic manner. There are many ways to do so with AWS SDKs that can run in different environments like Lambda Functions, invoked by AWS Data Pipeline or AWS Step Functions, with third party tools like Apache Airflow, and more. \
+When adopting EMR into your analytics flows and data processing pipelines, you will want to launch EMR clusters and run jobs in a programmatic manner. There are many ways to do so with AWS SDKs that can run in different environments like Lambda Functions, invoked by AWS Data Pipeline or AWS Step Functions, with third party tools like Apache Airflow, and more.
 
 #### (Optional) Examine the JSON configuration for EMR Instance Fleets
 In this section we will simply look at a CLI command that can be used to start an identical cluster to the one we started from the console. This makes it easy to configure your EMR clusters with the AWS Management Console and get a CLI runnable command with one click.
@@ -16,12 +16,12 @@ In this section we will simply look at a CLI command that can be used to start a
 #### (Optional) Set up CloudWatch Events for Cluster and/or Step failures
 Much like we set up a CloudWatch Event rule for EC2 Spot Interruptions to be sent to our email via an SNS notification, we can also set up rules to send out notifications or perform automations when an EMR cluster fails to start, or a Task on the cluster fails. This is useful for monitoring purposes.
 
-In this example, let's set up a notification for when our EMR step failed.\
-1. In the AWS Management Console, go to Cloudwatch -> Events -> Rules and click **Create Rule**.\
-2. Under Service Name select EMR, and under Event Type select **State Change**.\
-3. Check **Specific detail type(s)** and from the dropdown menu, select **EMR Step Status Change**\
-4. Check **Specific states(s)** and from the dropdown menu, select **FAILED**.\
+In this example, let's set up a notification for when our EMR step failed.  
+1. In the AWS Management Console, go to Cloudwatch -> Events -> Rules and click **Create Rule**.  
+2. Under Service Name select EMR, and under Event Type select **State Change**.  
+3. Check **Specific detail type(s)** and from the dropdown menu, select **EMR Step Status Change**  
+4. Check **Specific states(s)** and from the dropdown menu, select **FAILED**.  
 ![cwemrstep](/images/running-emr-spark-apps-on-spot/emrstatechangecwevent.png)
-5. In the targets menu, click **Add target**, select **SNS topic** and from the dropdown menu, select the SNS topic you created and click **Configure details**.\
-6. Provide a name for the rule and click **Create rule**.\
+5. In the targets menu, click **Add target**, select **SNS topic** and from the dropdown menu, select the SNS topic you created and click **Configure details**.  
+6. Provide a name for the rule and click **Create rule**.  
 7. You can test that the rule works by following the same steps to start a cluster, but providing a bad parameter when submitting the step, for example - a non existing location for the Spark application or results bucket.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md b/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md
@@ -6,8 +6,7 @@ hidden: true
 ---
 
 1. In the EMR Management Console, check that the cluster is in the **Terminated** state. If it isn't, then you can terminate it from the console.
-2. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks).
-3. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket.
-4. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**.
-
-
+2. Go to the [Cloud9 Dashboard](https://console.aws.amazon.com/cloud9/home) and delete your environment.
+3. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks).
+4. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket.
+5. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md
@@ -0,0 +1,28 @@
+---
+title: "Update to the latest AWS CLI"
+chapter: false
+weight: 20
+comment: default install now includes aws-cli/1.15.83
+---
+
+{{% notice tip %}}
+For this workshop, please ignore warnings about the version of pip being used.
+{{% /notice %}}
+
+1. Run the following command to view the current version of aws-cli:
+```
+aws --version
+```
+
+1. Update to the latest version:
+```
+curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
+unzip awscliv2.zip
+sudo ./aws/install
+. ~/.bash_profile
+```
+
+1. Confirm you have a newer version:
+```
+aws --version
+```
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cloud9-workspace.md b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-workspace.md
@@ -0,0 +1,35 @@
+---
+title: "Create a Workspace"
+chapter: false
+weight: 15
+---
+
+{{% notice warning %}}
+If you are running the workshop on your own, the Cloud9 workspace should be built by an IAM user with Administrator privileges, not the root account user. Please ensure you are logged in as an IAM user, not the root
+account user.
+{{% /notice %}}
+
+{{% notice info %}}
+If you are at an AWS hosted event, follow the instructions on the region that should be used to launch resources
+{{% /notice %}}
+
+{{% notice tip %}}
+Ad blockers, javascript disablers, and tracking blockers should be disabled for
+the cloud9 domain, or connecting to the workspace might be impacted.
+Cloud9 requires third-party-cookies. You can whitelist the [specific domains]( https://docs.aws.amazon.com/cloud9/latest/user-guide/troubleshooting.html#troubleshooting-env-loading).
+{{% /notice %}}
+
+### Launch Cloud9:
+
+- Go to [Cloud9 Console](https://console.aws.amazon.com/cloud9/home)
+- Select **Create environment**
+- Name it **emrworkshop**, and take all other defaults
+- When it comes up, customize the environment by closing the **welcome tab**
+and **lower work area**, and opening a new **terminal** tab in the main work area:
+![c9before](/images/running-emr-spark-apps-on-spot/c9before.png)
+
+- Your workspace should now look like this:
+![c9after](/images/running-emr-spark-apps-on-spot/c9after.png)
+
+- If you like this theme, you can choose it yourself by selecting **View / Themes / Solarized / Solarized Dark**
+in the Cloud9 workspace menu.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md b/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md
@@ -17,10 +17,10 @@ Select the correct tab, depending on where you are running the workshop:
 
 #### Thank you
 
-We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs.\
+We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs.  
 If you have any feedback or questions, click the "**Feedback / Questions?**" link in the left pane to reach out to the authors of the workshop.
 
 #### Other Resources:
-Visit the [**Amazon EMR on EC2 Spot Instances**] (https://aws.amazon.com/ec2/spot/use-case/emr/) page for more information, customer case studies and videos. \
-Read the blog post: [**Best practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR**] (https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-spark-applications-using-amazon-ec2-spot-instances-with-amazon-emr/) \
+Visit the [**Amazon EMR on EC2 Spot Instances**] (https://aws.amazon.com/ec2/spot/use-case/emr/) page for more information, customer case studies and videos.  
+Read the blog post: [**Best practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR**] (https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-spark-applications-using-amazon-ec2-spot-instances-with-amazon-emr/)  
 Watch the AWS Online Tech-Talk: [**Best Practices for Running Spark Applications Using Spot Instances on EMR - AWS Online Tech Talks**] (https://www.youtube.com/watch?v=u5dFozl1fW8)
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md b/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md
@@ -5,14 +5,24 @@ weight: 30
 
 When adopting Spot Instances into your workload, it is recommended to be flexible around how to launch your workload in terms of Availability Zone and Instance Types. This is in order to be able to achieve the required scale from multiple Spot capacity pools (a combination of EC2 instance type in an availability zone) or one capacity pool which has sufficient capacity, as well as decrease the impact on your workload in case some of the Spot capacity is interrupted with a 2-minute notice when EC2 needs the capacity back, and allow EMR to replenish the capacity with a different instance type.
 
-With EMR instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet (Master, Core, Task). When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones.\
+With EMR instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet (Master, Core, Task). When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones.  
 
 {{% notice info %}}
 [Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html) to learn more about EMR Instance Fleets in the official documentation.
 {{% /notice %}}
 
-**When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify, and will select the Spot Instances with the lowest chance of getting interrupted, for the lowest cost.**
+While a cluster is running, if Amazon EC2 reclaims a Spot Instance or if an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify in your fleet. This makes it easier to regain capacity in case some of the instances get interrupted by EC2 when it needs the Spot capacity back.
 
-
-While a cluster is running, if Amazon EC2 reclaims a Spot Instance or if an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify in your fleet. This makes it easier to regain capacity in case some of the instances get interrupted by EC2 when it needs the Spot capacity back.\
 These options do not exist within the default EMR configuration option "Uniform Instance Groups", hence we will be using EMR Instance Fleets only.
+
+As an enhancement to the default EMR instance fleets cluster configuration, the allocation strategy feature is available in EMR version **5.12.1 and later**. With allocation strategy:    
+* On-Demand instances use a lowest-price strategy, which launches the lowest-priced instances first.  
+* Spot instances use a [capacity-optimized] (https://aws.amazon.com/about-aws/whats-new/2020/06/amazon-emr-uses-real-time-capacity-insights-to-provision-spot-instances-to-lower-cost-and-interruption/) allocation strategy, which allocates instances from most-available Spot Instance pools and lowers the chance of interruptions. This allocation strategy is appropriate for workloads that have a higher cost of interruption such as persistent EMR clusters running Apache Spark, Apache Hive, and Presto.
+
+{{% notice note %}}
+This allocation strategy option also lets you specify **up to 15 EC2 instance types on task instance fleet**. By default, Amazon EMR allows a maximum of 5 instance types for each type of instance fleet. By enabling allocation strategy, you can diversify your Spot request for task instance fleet across 15 instance pools. With more instance type diversification, Amazon EMR has more capacity pools to allocate capacity from, this allows you to get more compute capacity. 
+{{% /notice %}}
+
+{{% notice info %}}
+[Click here] (https://aws.amazon.com/blogs/big-data/optimizing-amazon-emr-for-resilience-and-cost-with-capacity-optimized-spot-instances/) for an in-depth blog post about capacity-optimized allocation strategy for Amazon EMR instance fleets.
+{{% /notice %}}