diff --git a/content/running_spark_apps_with_emr_on_spot_instances/_index.md b/content/running_spark_apps_with_emr_on_spot_instances/_index.md index 32bd061d..fa910fd4 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/_index.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/_index.md @@ -7,7 +7,7 @@ pre: "" ## Overview -Welcome! In this workshop you will assume the role of a data engineer, tasked with optimizing the organization's costs for running Spark applications, using Amazon EMR and EC2 Spot Instances.\ +Welcome! In this workshop you will assume the role of a data engineer, tasked with optimizing the organization's costs for running Spark applications, using Amazon EMR and EC2 Spot Instances. {{% notice info %}} The **estimated time** for completing the workshop is 60-90 minutes and the **estimated cost** for running the workshop's resources in your AWS account is less than $2.\ The **learning objective** for the workshop is to become familiar with the best practices and tooling that are available to you for cost optimizing your EMR clusters running Spark applications, using Spot Instances. {{% /notice %}} @@ -20,6 +20,6 @@ The **learning objective** for the workshop is to become familiar with the best * [Amazon EC2 Spot Instances] (https://aws.amazon.com/ec2/spot/) offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. EC2 can interrupt Spot Instances with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are analytics, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD, and other test and development workloads. ## About Spot Instances in Analytics workloads -The most important best practice when using Spot Instances is to be flexible with the EC2 instance types that our application can run on, in order to be able to access many spare capacity pools (a combination of EC2 instance type and an Availability Zone), as well as achieve our desired capacity from a different instance type in case some of our Spot capacity in the EMR cluster is interrupted, when EC2 needs the spare capacity back.\ +The most important best practice when using Spot Instances is to be flexible with the EC2 instance types that our application can run on, in order to be able to access many spare capacity pools (a combination of EC2 instance type and an Availability Zone), as well as achieve our desired capacity from a different instance type in case some of our Spot capacity in the EMR cluster is interrupted, when EC2 needs the spare capacity back. It's possible to run Spark applications in a single cluster that is running on multiple different instance types, we'll just need to right-size our executors and use the EMR Instance Fleets configuration option in order to meet the Spot diversification best practice. We'll look into that in detail during this workshop. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md b/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md index 06acd21c..456c368f 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md @@ -22,19 +22,19 @@ If the Name tag Key was not enabled as a Cost Allocation Tag, you will not be ab {{% /notice %}} -Let's use Cost Explorer to analyze the costs of running our EMR application.\ -1. Navigate to Cost Explorer by opening the AWS Management Console -> Click your username in the top right corner -> click **My Billing Dashboard** -> click **Cost Explorer in the left pane**. or [click here] (https://console.aws.amazon.com/billing/home#/costexplorer) for a direct link.\ -2. We know that we gave our EMR cluster a unique Name tag, so let's filter according to it. In the right pane, click Tags -> Name -> enter "**EMRTransientCluster1**"\ -3. Instead of the default 45 days view, let's narrow down the time span to just the day when we ran the cluster. In the data selection dropdown, mark that day as start and end.\ -4. You are now looking at the total cost to run the cluster (**$0.30**), including: EMR, EC2, EBS, and possible AWS Cross-Region data transfer costs, depending on where you ran your cluster relative to where the S3 dataset is located (in N. Virginia).\ +Let's use Cost Explorer to analyze the costs of running our EMR application. +1. Navigate to Cost Explorer by opening the AWS Management Console -> Click your username in the top right corner -> click **My Billing Dashboard** -> click **Cost Explorer in the left pane**. or [click here] (https://console.aws.amazon.com/billing/home#/costexplorer) for a direct link. +2. We know that we gave our EMR cluster a unique Name tag, so let's filter according to it. In the right pane, click Tags -> Name -> enter "**EMRTransientCluster1**" +3. Instead of the default 45 days view, let's narrow down the time span to just the day when we ran the cluster. In the data selection dropdown, mark that day as start and end. +4. You are now looking at the total cost to run the cluster (**$0.30**), including: EMR, EC2, EBS, and possible AWS Cross-Region data transfer costs, depending on where you ran your cluster relative to where the S3 dataset is located (in N. Virginia). 5. Group by **Usage Type** to get a breakdown of the costs ![costexplorer](/images/running-emr-spark-apps-on-spot/costexplorer1.png) -* EU-SpotUsage:r5.xlarge: This was the instance type that ran in the EMR Task Instance fleet and accrued the largest cost, since EMR launched 10 instances ($0.17)\ -* EU-BoxUsage:r5.xlarge: The EMR costs. [Click here] (https://aws.amazon.com/emr/pricing/) to learn more about EMR pricing. ($0.06)\ -* EU-EBS:VolumeUsage.gp2: EBS volumes that were attached to my EC2 Instances in the cluster - these got tagged automatically. ($0.03)\ -* EU-SpotUsage:r5a.xlarge & EU-SpotUsage:m4.xlarge: EC2 Spot price for the other instances in my cluster (Master and Core) ($0.02 combined)\ +* EU-SpotUsage:r5.xlarge: This was the instance type that ran in the EMR Task Instance fleet and accrued the largest cost, since EMR launched 10 instances ($0.17) +* EU-BoxUsage:r5.xlarge: The EMR costs. [Click here] (https://aws.amazon.com/emr/pricing/) to learn more about EMR pricing. ($0.06) +* EU-EBS:VolumeUsage.gp2: EBS volumes that were attached to my EC2 Instances in the cluster - these got tagged automatically. ($0.03) +* EU-SpotUsage:r5a.xlarge & EU-SpotUsage:m4.xlarge: EC2 Spot price for the other instances in my cluster (Master and Core) ($0.02 combined) If you have access to Cost Explorer, have a look around and see what you can find by slicing and dicing with filtering and grouping. For example, what happens if you filter by **Purchase Option = Spot** & **Group by = Instance Type**? diff --git a/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md b/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md index b29875d8..3b7acc78 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md @@ -3,7 +3,7 @@ title: "Automations and monitoring" weight: 110 --- -When adopting EMR into your analytics flows and data processing pipelines, you will want to launch EMR clusters and run jobs in a programmatic manner. There are many ways to do so with AWS SDKs that can run in different environments like Lambda Functions, invoked by AWS Data Pipeline or AWS Step Functions, with third party tools like Apache Airflow, and more. \ +When adopting EMR into your analytics flows and data processing pipelines, you will want to launch EMR clusters and run jobs in a programmatic manner. There are many ways to do so with AWS SDKs that can run in different environments like Lambda Functions, invoked by AWS Data Pipeline or AWS Step Functions, with third party tools like Apache Airflow, and more. #### (Optional) Examine the JSON configuration for EMR Instance Fleets In this section we will simply look at a CLI command that can be used to start an identical cluster to the one we started from the console. This makes it easy to configure your EMR clusters with the AWS Management Console and get a CLI runnable command with one click. @@ -16,12 +16,12 @@ In this section we will simply look at a CLI command that can be used to start a #### (Optional) Set up CloudWatch Events for Cluster and/or Step failures Much like we set up a CloudWatch Event rule for EC2 Spot Interruptions to be sent to our email via an SNS notification, we can also set up rules to send out notifications or perform automations when an EMR cluster fails to start, or a Task on the cluster fails. This is useful for monitoring purposes. -In this example, let's set up a notification for when our EMR step failed.\ -1. In the AWS Management Console, go to Cloudwatch -> Events -> Rules and click **Create Rule**.\ -2. Under Service Name select EMR, and under Event Type select **State Change**.\ -3. Check **Specific detail type(s)** and from the dropdown menu, select **EMR Step Status Change**\ -4. Check **Specific states(s)** and from the dropdown menu, select **FAILED**.\ +In this example, let's set up a notification for when our EMR step failed. +1. In the AWS Management Console, go to Cloudwatch -> Events -> Rules and click **Create Rule**. +2. Under Service Name select EMR, and under Event Type select **State Change**. +3. Check **Specific detail type(s)** and from the dropdown menu, select **EMR Step Status Change** +4. Check **Specific states(s)** and from the dropdown menu, select **FAILED**. ![cwemrstep](/images/running-emr-spark-apps-on-spot/emrstatechangecwevent.png) -5. In the targets menu, click **Add target**, select **SNS topic** and from the dropdown menu, select the SNS topic you created and click **Configure details**.\ -6. Provide a name for the rule and click **Create rule**.\ +5. In the targets menu, click **Add target**, select **SNS topic** and from the dropdown menu, select the SNS topic you created and click **Configure details**. +6. Provide a name for the rule and click **Create rule**. 7. You can test that the rule works by following the same steps to start a cluster, but providing a bad parameter when submitting the step, for example - a non existing location for the Spark application or results bucket. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md b/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md index fc2e5be5..cd9e56a1 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/cleanup_ownaccount.md @@ -6,8 +6,7 @@ hidden: true --- 1. In the EMR Management Console, check that the cluster is in the **Terminated** state. If it isn't, then you can terminate it from the console. -2. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks). -3. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket. -4. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**. - - +2. Go to the [Cloud9 Dashboard](https://console.aws.amazon.com/cloud9/home) and delete your environment. +3. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks). +4. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket. +5. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md new file mode 100644 index 00000000..95d0eafb --- /dev/null +++ b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md @@ -0,0 +1,28 @@ +--- +title: "Update to the latest AWS CLI" +chapter: false +weight: 20 +comment: default install now includes aws-cli/1.15.83 +--- + +{{% notice tip %}} +For this workshop, please ignore warnings about the version of pip being used. +{{% /notice %}} + +1. Run the following command to view the current version of aws-cli: +``` +aws --version +``` + +1. Update to the latest version: +``` +curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +unzip awscliv2.zip +sudo ./aws/install +. ~/.bash_profile +``` + +1. Confirm you have a newer version: +``` +aws --version +``` diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cloud9-workspace.md b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-workspace.md new file mode 100644 index 00000000..6150fd29 --- /dev/null +++ b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-workspace.md @@ -0,0 +1,35 @@ +--- +title: "Create a Workspace" +chapter: false +weight: 15 +--- + +{{% notice warning %}} +If you are running the workshop on your own, the Cloud9 workspace should be built by an IAM user with Administrator privileges, not the root account user. Please ensure you are logged in as an IAM user, not the root +account user. +{{% /notice %}} + +{{% notice info %}} +If you are at an AWS hosted event, follow the instructions on the region that should be used to launch resources +{{% /notice %}} + +{{% notice tip %}} +Ad blockers, javascript disablers, and tracking blockers should be disabled for +the cloud9 domain, or connecting to the workspace might be impacted. +Cloud9 requires third-party-cookies. You can whitelist the [specific domains]( https://docs.aws.amazon.com/cloud9/latest/user-guide/troubleshooting.html#troubleshooting-env-loading). +{{% /notice %}} + +### Launch Cloud9: + +- Go to [Cloud9 Console](https://console.aws.amazon.com/cloud9/home) +- Select **Create environment** +- Name it **emrworkshop**, and take all other defaults +- When it comes up, customize the environment by closing the **welcome tab** +and **lower work area**, and opening a new **terminal** tab in the main work area: +![c9before](/images/running-emr-spark-apps-on-spot/c9before.png) + +- Your workspace should now look like this: +![c9after](/images/running-emr-spark-apps-on-spot/c9after.png) + +- If you like this theme, you can choose it yourself by selecting **View / Themes / Solarized / Solarized Dark** +in the Cloud9 workspace menu. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md b/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md index 5559b946..de83359f 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md @@ -17,10 +17,10 @@ Select the correct tab, depending on where you are running the workshop: #### Thank you -We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs.\ +We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs. If you have any feedback or questions, click the "**Feedback / Questions?**" link in the left pane to reach out to the authors of the workshop. #### Other Resources: -Visit the [**Amazon EMR on EC2 Spot Instances**] (https://aws.amazon.com/ec2/spot/use-case/emr/) page for more information, customer case studies and videos. \ -Read the blog post: [**Best practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR**] (https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-spark-applications-using-amazon-ec2-spot-instances-with-amazon-emr/) \ +Visit the [**Amazon EMR on EC2 Spot Instances**] (https://aws.amazon.com/ec2/spot/use-case/emr/) page for more information, customer case studies and videos. +Read the blog post: [**Best practices for running Apache Spark applications using Amazon EC2 Spot Instances with Amazon EMR**] (https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-spark-applications-using-amazon-ec2-spot-instances-with-amazon-emr/) Watch the AWS Online Tech-Talk: [**Best Practices for Running Spark Applications Using Spot Instances on EMR - AWS Online Tech Talks**] (https://www.youtube.com/watch?v=u5dFozl1fW8) diff --git a/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md b/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md index 0104f3a9..d2484d0e 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md @@ -5,14 +5,24 @@ weight: 30 When adopting Spot Instances into your workload, it is recommended to be flexible around how to launch your workload in terms of Availability Zone and Instance Types. This is in order to be able to achieve the required scale from multiple Spot capacity pools (a combination of EC2 instance type in an availability zone) or one capacity pool which has sufficient capacity, as well as decrease the impact on your workload in case some of the Spot capacity is interrupted with a 2-minute notice when EC2 needs the capacity back, and allow EMR to replenish the capacity with a different instance type. -With EMR instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet (Master, Core, Task). When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones.\ +With EMR instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet (Master, Core, Task). When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones. {{% notice info %}} [Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html) to learn more about EMR Instance Fleets in the official documentation. {{% /notice %}} -**When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify, and will select the Spot Instances with the lowest chance of getting interrupted, for the lowest cost.** +While a cluster is running, if Amazon EC2 reclaims a Spot Instance or if an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify in your fleet. This makes it easier to regain capacity in case some of the instances get interrupted by EC2 when it needs the Spot capacity back. - -While a cluster is running, if Amazon EC2 reclaims a Spot Instance or if an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify in your fleet. This makes it easier to regain capacity in case some of the instances get interrupted by EC2 when it needs the Spot capacity back.\ These options do not exist within the default EMR configuration option "Uniform Instance Groups", hence we will be using EMR Instance Fleets only. + +As an enhancement to the default EMR instance fleets cluster configuration, the allocation strategy feature is available in EMR version **5.12.1 and later**. With allocation strategy: +* On-Demand instances use a lowest-price strategy, which launches the lowest-priced instances first. +* Spot instances use a [capacity-optimized] (https://aws.amazon.com/about-aws/whats-new/2020/06/amazon-emr-uses-real-time-capacity-insights-to-provision-spot-instances-to-lower-cost-and-interruption/) allocation strategy, which allocates instances from most-available Spot Instance pools and lowers the chance of interruptions. This allocation strategy is appropriate for workloads that have a higher cost of interruption such as persistent EMR clusters running Apache Spark, Apache Hive, and Presto. + +{{% notice note %}} +This allocation strategy option also lets you specify **up to 15 EC2 instance types on task instance fleet**. By default, Amazon EMR allows a maximum of 5 instance types for each type of instance fleet. By enabling allocation strategy, you can diversify your Spot request for task instance fleet across 15 instance pools. With more instance type diversification, Amazon EMR has more capacity pools to allocate capacity from, this allows you to get more compute capacity. +{{% /notice %}} + +{{% notice info %}} +[Click here] (https://aws.amazon.com/blogs/big-data/optimizing-amazon-emr-for-resilience-and-cost-with-capacity-optimized-spot-instances/) for an in-depth blog post about capacity-optimized allocation strategy for Amazon EMR instance fleets. +{{% /notice %}} \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md index e0b1c541..b4405e36 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md @@ -6,8 +6,8 @@ weight: 95 In this section we will look at the utilization of our EC2 Spot Instances while the application is running, and examine how many Spark executors are running. ### EMR Management Console -To get started, let's check that your EMR cluster and Spark application are running.\ -1. In our EMR Cluster page, the status of the cluster will either be Starting (in which case you can see the status of the hardware in the Summary or Hardware tabs) or Running.\ +To get started, let's check that your EMR cluster and Spark application are running. +1. In our EMR Cluster page, the status of the cluster will either be Starting (in which case you can see the status of the hardware in the Summary or Hardware tabs) or Running. 2. Move to the Steps tab, and your Spark application will either be Pending (for the cluster to start) or Running. {{% notice note %}} @@ -15,57 +15,57 @@ In this step, when you look at the utilization of the EMR cluster, do not expect {{% /notice %}} ### Using Ganglia, YARN ResourceManager and Spark History Server -The recommended approach to connect to the web interfaces running on our EMR cluster is to use SSH tunneling. [Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) to learn more about connecting to EMR interfaces.\ +The recommended approach to connect to the web interfaces running on our EMR cluster is to use SSH tunneling. [Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) to learn more about connecting to EMR interfaces. For the purpose of this workshop, and since we started our EMR cluster in a VPC public subnet, we can allow access in the EC2 Security Group in order to reach the TCP ports on which the web interfaces are listening on. {{% notice warning %}} Normally you would not run EMR in a public subnet and open TCP access to the master instance, this faster approach is just used for the purpose of the workshop. {{% /notice %}} -To allow access to your IP address to reach the EMR web interfaces via EC2 Security Groups:\ -1. In your EMR cluster page, in the AWS Management Console, go to the **Summary** tab\ -2. Click on the ID of the security under **Security groups for Master**\ -3. Check the Security Group with the name **ElasticMapReduce-master**\ -4. In the lower pane, click the **Inbound tab** and click the **Edit**\ -5. Click **Add Rule**. Under Type, select **All Traffic**, under Source, select **My IP**\ +To allow access to your IP address to reach the EMR web interfaces via EC2 Security Groups: +1. In your EMR cluster page, in the AWS Management Console, go to the **Summary** tab +2. Click on the ID of the security under **Security groups for Master** +3. Check the Security Group with the name **ElasticMapReduce-master** +4. In the lower pane, click the **Inbound tab** and click the **Edit** +5. Click **Add Rule**. Under Type, select **All Traffic**, under Source, select **My IP** 6. Click **Save** {{% notice note %}} While the Ganglia web interface uses TCP port 80, the YARN ResourceManager web interface uses TCP port 8088 and the Spark History Server uses TCP port 18080, which might not allowed for outbound traffic on your Internet connection. If you are using a network connection that blocks these ports (or in other words, doesn't allow non-well known ports) then you will not be able to reach the YARN ResourceManager web interface and Spark History Server. You can either skip that part of the workshop, use a different Internet connection (i.e mobile hotspot) or consider using the more complex method of SSH tunneling described [here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) {{% /notice %}} -Go back to the Summary tab in your EMR cluster page, and you will see links to tools in the **Connections** section (you might need to refresh the page).\ -1. Click **Ganglia** to open the web interface.\ -2. Have a look around. Take notice of the heatmap (**Server Load Distribution**). Notable graphs are:\ -* **Cluster CPU last hour** - this will show you the CPU utilization that our Spark application consumed on our EMR cluster. you should see that utilization varied and reached around 70%.\ -* **Cluster Memory last hour** - this will show you how much memory we started the cluster with, and how much Spark actually consumed.\ -3. Go back to the Summary page and click the **Resource Manager** link.\ -4. On the left pane, click **Nodes**, and in the node table, you should see the number of containers that each node ran.\ -5. Go back to the Summary page and click the **Spark History Server** link.\ -6. Click on the App ID in the table (where App Name = Amazon reviews word count) and go to the **Executors** tab\ +Go back to the Summary tab in your EMR cluster page, and you will see links to tools in the **Connections** section (you might need to refresh the page). +1. Click **Ganglia** to open the web interface. +2. Have a look around. Take notice of the heatmap (**Server Load Distribution**). Notable graphs are: +* **Cluster CPU last hour** - this will show you the CPU utilization that our Spark application consumed on our EMR cluster. you should see that utilization varied and reached around 70%. +* **Cluster Memory last hour** - this will show you how much memory we started the cluster with, and how much Spark actually consumed. +3. Go back to the Summary page and click the **Resource Manager** link. +4. On the left pane, click **Nodes**, and in the node table, you should see the number of containers that each node ran. +5. Go back to the Summary page and click the **Spark History Server** link. +6. Click on the App ID in the table (where App Name = Amazon reviews word count) and go to the **Executors** tab 7. You can again see the number of executors that are running in your EMR cluster under the **Executors table** ### Using CloudWatch Metrics -EMR emits several useful metrics to CloudWatch metrics. You can use the AWS Management Console to look at the metrics in two ways:\ -1. In the EMR console, under the Monitoring tab in your Cluster's page\ +EMR emits several useful metrics to CloudWatch metrics. You can use the AWS Management Console to look at the metrics in two ways: +1. In the EMR console, under the Monitoring tab in your Cluster's page 2. By browsing to the CloudWatch service, and under Metrics, searching for the name of your cluster (copy it from the EMR Management Console) and clicking **EMR > Job Flow Metrics** {{% notice note %}} The metrics will take a few minutes to populate. {{% /notice %}} -Some notable metrics:\ -* **AppsRunning** - you should see 1 since we only submitted one step to the cluster.\ -* **ContainerAllocated** - this represents the number of containers that are running on your cluster, on the Core and Task Instance Fleets. These would the be Spark executors and the Spark Driver.\ -* **MemoryAllocatedMB** & **MemoryAvailableMB** - you can graph them both to see how much memory the cluster is actually consuming for the wordcount Spark application out of the memory that the instances have.\ +Some notable metrics: +* **AppsRunning** - you should see 1 since we only submitted one step to the cluster. +* **ContainerAllocated** - this represents the number of containers that are running on your cluster, on the Core and Task Instance Fleets. These would the be Spark executors and the Spark Driver. +* **MemoryAllocatedMB** & **MemoryAvailableMB** - you can graph them both to see how much memory the cluster is actually consuming for the wordcount Spark application out of the memory that the instances have. #### Terminate the cluster When you are done examining the cluster and using the different UIs, terminate the EMR cluster from the EMR management console. This is not the end of the workshop though - we still have some interesting steps to go. #### Number of executors in the cluster -With 40 Spot Units in the Task Instance Fleet, EMR launched either 10 * xlarge (running one executor) or 5 * 2xlarge instances (running 2 executors), so the Task Instance Fleet provides 10 executors / containers to the cluster.\ +With 32 Spot Units in the Task Instance Fleet, EMR launched either 8 * xlarge (running one executor) or 4 * 2xlarge instances (running 2 executors) or 2 * 4xlarge instances (running 4 executors), so the Task Instance Fleet provides 8 executors / containers to the cluster. The Core Instance Fleet launched one xlarge instance, able to run one executor. -{{%expand "Question: Did you see more than 11 containers in CloudWatch Metrics and in YARN ResourceManager? if so, do you know why? Click to expand the answer" %}} +{{%expand "Question: Did you see more than 9 containers in CloudWatch Metrics and in YARN ResourceManager? if so, do you know why? Click to expand the answer" %}} Your Spark application was configured to run in Cluster mode, meaning that the **Spark driver is running on the Core node**. Since it is counted as a container, this adds a container to our count, but it is not an executor. {{% /expand%}} diff --git a/content/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.md b/content/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.md index c758f7ee..d3d50f12 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/fleet_config_options.md @@ -11,11 +11,11 @@ While our cluster is starting (7-8 minutes) and the step is running (4-10 minute Since Amazon EC2 Spot Instances [changed the pricing model and bidding is no longer required] (https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/), we have an optional "Max-price" field for our Spot requests, which would limit how much we're willing to pay for the instance. It is recommended to leave this value at 100% of the On-Demand price, in order to avoid limiting our instance diversification. We are going to pay the Spot market price regardless of the Maximum price that we can specify, and setting a higher max price does not increase the chance of getting Spot capacity nor does it decrease the chance of getting your Spot Instances interrupted when EC2 needs the capacity back. You can see the current Spot price in the AWS Management Console under EC2 -> Spot Requests -> **Pricing History**. #### Each instance counts as X units -This configuration allows us to give each instance type in our diversified fleet a weight that will count towards our Total units. By default, this weight is configured as the number of YARN VCores that the instance type has by default (this would typically equate to the number of EC2 vCPUs) - this way it's easy to set the Total units to the number of vCPUs we want our cluster to run with, and EMR will select the best instances while taking into account the required number of instances to run. For example, if r4.xlarge is the instance type that EMR found to be the least likely to be interrupted and has the lowest price out of our selection, its weight is 4 and our total units (only Spot) is 40, then 10 * r4.xlarge instances will be launched by EMR in the fleet. -If my Spark application is memory driven, I can set the total units to the total amount of memory I want my cluster to run with, and change the "Each instance counts as" field to the total memory of the instance, leaving aside some memory for the operating system and other processes. For example, for the r4.xlarge I can set its weight to 25. If I then set up the Total units to 500 then EMR will bring up 20 * r4.xlrage instances in the fleet. Since our executor size is 18 GB, one executor will run on this instance type. +This configuration allows us to give each instance type in our diversified fleet a weight that will count towards our Total units. By default, this weight is configured as the number of YARN VCores that the instance type has by default (this would typically equate to the number of EC2 vCPUs) - this way it's easy to set the Total units to the number of vCPUs we want our cluster to run with, and EMR will select the best instances while taking into account the required number of instances to run. For example, if r4.xlarge is the instance type that EMR found to be the least likely to be interrupted, its weight is 4 and our total units (only Spot) is 32, then 8 * r4.xlarge instances will be launched by EMR in the fleet. +If my Spark application is memory driven, I can set the total units to the total amount of memory I want my cluster to run with, and change the "Each instance counts as" field to the total memory of the instance, leaving aside some memory for the operating system and other processes. For example, for the r4.xlarge I can set its weight to 25. If I then set up the Total units to 500 then EMR will bring up 20 * r4.xlarge instances in the fleet. Since our executor size is 18 GB, one executor will run on this instance type. #### Defined duration This option will allow you run your EMR Instance Fleet on Spot Blocks, which are uninterrupted Spot Instances, available for 1-6 hours, at a lower discount compared to Spot Instances. #### Provisioning timeout -You can determine that after a set amount of minutes, if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead, or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application - if it is SLA bound and should complete even at On-Demand price, then the "Switch to On-Demand" option might be suitable. However, make sure you diversify the instance types in the fleet when looking to use Spot Instances, before you look into failing over to On-Demand. Also, try to select instance types with lower interruption rates according to the [Spot Instance Advisor] (https://aws.amazon.com/ec2/spot/instance-advisor/) \ No newline at end of file +You can determine that after a set amount of minutes, if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead, or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application - if it is SLA bound and should complete even at On-Demand price, then the "Switch to On-Demand" option might be suitable. However, make sure you diversify the instance types in the fleet when looking to use Spot Instances, before you look into failing over to On-Demand. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md b/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md index d128143b..bccfca7f 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md @@ -5,7 +5,7 @@ disableToc: true hidden: true --- -Create an S3 bucket - we will use this for our Spark application code (which will be provided later) and the Spark application's results.\ +Create an S3 bucket - we will use this for our Spark application code (which will be provided later) and the Spark application's results. Refer to the **Create a Bucket** page in the [Amazon S3 Getting Started Guide] (https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) You don't need to create a VPC, as the workshop account already has a default VPC that we will use in this workshop. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md b/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md index dd67c12f..370831b6 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md @@ -5,15 +5,15 @@ disableToc: true hidden: true --- -1. Create an S3 bucket - we will use this for our Spark application code (which will be provided later) and the Spark application's results.\ +1. Create an S3 bucket - we will use this for our Spark application code (which will be provided later) and the Spark application's results. Refer to the **Create a Bucket** page in the [Amazon S3 Getting Started Guide] (https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) -2. Deploy a new VPC that will be used to run your EMR cluster in the workshop.\ -a. Open the ["Modular and Scalable VPC Architecture Quick stage page"] (https://aws.amazon.com/quickstart/architecture/vpc/) and go to the "How to deploy" tab, Click the ["Launch the Quick Start"] (https://fwd.aws/mm853) link.\ -b. Select your desired region to run the workshop from the top right corner of the AWS Management Console and click **Next**.\ -c. Provide a name for the stack or leave it as **Quick-Start-VPC**.\ -d. Under **Availability Zones**, select three availability zones from the list, and set the **Number of Availability Zones** to **3**.\ -e. Under **Create private subnets** select **false**.\ -f. click **Next** and again **Next** in the next screen.\ -g. Click **Create stack**.\ +2. Deploy a new VPC that will be used to run your EMR cluster in the workshop. +a. Open the ["Modular and Scalable VPC Architecture Quick stage page"] (https://aws.amazon.com/quickstart/architecture/vpc/) and go to the "How to deploy" tab, Click the ["Launch the Quick Start"] (https://fwd.aws/mm853) link. +b. Select your desired region to run the workshop from the top right corner of the AWS Management Console and click **Next**. +c. Provide a name for the stack or leave it as **Quick-Start-VPC**. +d. Under **Availability Zones**, select three availability zones from the list, and set the **Number of Availability Zones** to **3**. +e. Under **Create private subnets** select **false**. +f. click **Next** and again **Next** in the next screen. +g. Click **Create stack**. The stack creation should take under 2 minutes and the status of the stack will be **CREATE_COMPLETE**. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-1.md b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-1.md index fe0f4cff..3871810c 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-1.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-1.md @@ -8,15 +8,15 @@ In this step we'll launch our first cluster, which will run solely on Spot Insta Normally our dataset on S3 would be located in the same region where we are going to run our EMR clusters. In this workshop, it is fine if you are running EMR in a different region, and the Spark application will work against the dataset which is located in the N. Virginia region. This will be negligible in terms of price and performance. {{% /notice %}} -To launch the cluster, follow these steps:\ +To launch the cluster, follow these steps: -1. [Open the EMR console] (https://console.aws.amazon.com/elasticmapreduce/home) in the region where you are looking to launch your cluster.\ -1. Click "**Create Cluster**"\ -1. Click "**Go to advanced options**"\ -1. Select the latest EMR release, and in the list of components, only leave **Hadoop** checked and also check **Spark** and **Ganglia** (we will use it later to monitor our cluster)\ -1. Under "**Steps (Optional)**" -> Step type drop down menu, select "**Spark application**" and click **Add step**, then add the following details in the Add step dialog window:\ +1. [Open the EMR console] (https://console.aws.amazon.com/elasticmapreduce/home) in the region where you are looking to launch your cluster. +1. Click "**Create Cluster**" +1. Click "**Go to advanced options**" +1. Select the latest EMR release, and in the list of components, only leave **Hadoop** checked and also check **Spark** and **Ganglia** (we will use it later to monitor our cluster) +1. Under "**Steps (Optional)**" -> Step type drop down menu, select "**Spark application**" and click **Add step**, then add the following details in the Add step dialog window: -* **Spark-submit options**: here we will configure the memory and core count for each executor, as described in the previous section. Use these settings (make sure you have two '-' chars):\ +* **Spark-submit options**: here we will configure the memory and core count for each executor, as described in the previous section. Use these settings (make sure you have two '-' chars): ``` --executor-memory 18G --executor-cores 4 ``` @@ -35,7 +35,7 @@ exit() Then add the location of the file under the **Application location** field, i.e: s3://\/script.py -* **Arguments**: Here we will configure the location of where Spark will write the results of the job. Enter: s3://\/results/\ +* **Arguments**: Here we will configure the location of where Spark will write the results of the job. Enter: s3://\/results/ * **Action on failure**: Leave this on *Continue* and click **Add** to save the step. ![sparksubmit](/images/running-emr-spark-apps-on-spot/sparksubmitstep1.png) diff --git a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md index 8b50d105..e3f8b807 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md @@ -13,25 +13,22 @@ The workshop focuses on running Spot Instances across all the cluster node types {{% /notice %}} #### **Master node**: -Unless your cluster is very short-lived and the runs are cost-driven, avoid running your Master node on a Spot Instance. We suggest this because a Spot interruption on the Master node terminates the entire cluster. \ -For the purpose of this workshop, we will run the Master node on a Spot Instance as we simulate a relatively short lived job running on a transient cluster. There will not be business impact if the job fails due to a Spot interruption and later re-started.\ -Click **Add / remove instance types to fleet** and select two relatively small and cheap instance types - i.e c4.large and m4.large and check Spot under target capacity. EMR will only provision one instance, but will select the best instance type for the Master node based on price and available capacity. +Unless your cluster is very short-lived and the runs are cost-driven, avoid running your Master node on a Spot Instance. We suggest this because a Spot interruption on the Master node terminates the entire cluster. +For the purpose of this workshop, we will run the Master node on a Spot Instance as we simulate a relatively short lived job running on a transient cluster. There will not be business impact if the job fails due to a Spot interruption and later re-started. +Click **Add / remove instance types to fleet** and select two relatively cheaper instance types - i.e c5.xlarge and m5.xlarge and check Spot under target capacity. EMR will only provision one instance, but will select the best instance type for the Master node from the Spot instance pools with the optimal capacity. ![FleetSelection1](/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png) #### **Core Instance Fleet**: -Avoid using Spot Instances for Core nodes if your Spark applications use HDFS. That prevents a situation where Spot interruptions cause data loss for data that was written to the HDFS volumes on the instances. For short-lived applications on transient clusters, as is the case in this workshop, we are going to run our Core nodes on Spot Instances.\ -When using EMR Instance Fleets, one Core node is mandatory. Since we want to scale out and run our Spark application on our Task nodes, let's stick to the one mandatory Core node. We will specify **4 Spot units**, and select instance types that count as 4 units and will allow to run one executor.\ -Under the core node type, Click **Add / remove instance types to fleet** and select instance types that have 4 vCPUs and enough memory to run an executor (given the 18G executor size), for example: +Avoid using Spot Instances for Core nodes if your Spark applications use HDFS. That prevents a situation where Spot interruptions cause data loss for data that was written to the HDFS volumes on the instances. For short-lived applications on transient clusters, as is the case in this workshop, we are going to run our Core nodes on Spot Instances. +When using EMR Instance Fleets, one Core node is mandatory. Since we want to scale out and run our Spark application on our Task nodes, let's stick to the one mandatory Core node. We will specify **4 Spot units**, and select instance types that count as 4 units and will allow to run one executor. +Under the core node type, click **Add / remove instance types to fleet** and select instance types that you noted before as suitable to run an executor (given the 18G executor size), for example: ![FleetSelection2](/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png) #### **Task Instance Fleet**: Our task nodes will only run Spark executors and no HDFS DataNodes, so this is a great fit for scaling out and increasing the parallelization of our application's execution, to achieve faster execution times. -Under the task node type, Click **Add / remove instance types to fleet** and select the 5 instance types you noted before as suitable for our executor size and that had suitable interruption rates in the Spot Instance Advisor.\ -Since our executor size is 4 vCPUs, and each instance counts as the number of its vCPUs towards the total units, let's specify **40 Spot units** in order to run 10 executors, and allow EMR to select the best instance type in the Task Instance Fleet to run the executors on. In this example, it will either start 10 * r4.xlarge / r5.xlarge / i3.xlarge **or** 5 * r5.2xlarge / r4.2xlarge in EMR Task Instance Fleet. -{{% notice warning %}} -If you are using your own AWS account (not an account that was created for you in an AWS event), Keep reading: if your account is new, or you've never launched Spot Instances in the account, your ability to launch Spot Instances could be limited. To overcome this, please make sure you launch no more than 3 instances in the Task Instance Fleet. You can do this, for example, by only providing instances that count as 8 units, and specify 24 for Spot units.\ -{{% /notice %}} +Under the task node type, click **Add / remove instance types to fleet** and select **up to 15 instance types** you noted before as suitable for your executor size. +Since our executor size is 4 vCPUs, and each instance counts as the number of its vCPUs towards the total units, let's specify **32 Spot units** in order to run 8 executors, and allow EMR to select the best instance type in the Task Instance Fleet to run the executors on. ![FleetSelection3](/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png) diff --git a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-3.md b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-3.md index f5e4d283..32c77804 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-3.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-3.md @@ -3,9 +3,9 @@ title: "Launch a cluster - Steps 3&4" weight: 80 --- -Under "**Tags**", tag your instance with a recognizable Name tag so that you'll be able to see it later in the cost reports. For example:\ -Key=Name\ -Value (Optional)=EMRTransientCluster1\ +Under "**Tags**", tag your instance with a recognizable Name tag so that you'll be able to see it later in the cost reports. For example: +Key=Name +Value (Optional)=EMRTransientCluster1 ![tags](/images/running-emr-spark-apps-on-spot/emrtags.png) Click **Next** to go to **Step 4: Security** diff --git a/content/running_spark_apps_with_emr_on_spot_instances/prerequisites_notes.md b/content/running_spark_apps_with_emr_on_spot_instances/prerequisites_notes.md index a7fced03..e380937b 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/prerequisites_notes.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/prerequisites_notes.md @@ -2,13 +2,13 @@ title: "Prerequisites and initial steps" weight: 10 --- -#### General requirements and notes:\ +#### General requirements and notes: 1. This workshop is self-paced. The instructions will walk you through achieving the workshop's goal using the AWS Management Console. 2. While the workshop provides step by step instructions, **please do take a moment to look around and understand what is happening at each step** as this will enhance your learning experience. The workshop is meant as a getting started guide, but you will learn the most by digesting each of the steps and thinking about how they would apply in your own environment and in your own organization. You can even consider experimenting with the steps to challenge yourself. -#### Preparation steps:\ +#### Preparation steps: Select the correct tab, depending on where you are running the workshop: {{< tabs name="EventorOwnAccount" >}} {{< tab name="In your own account" include="initial_ownaccount.md" />}} diff --git a/content/running_spark_apps_with_emr_on_spot_instances/right_sizing_executors.md b/content/running_spark_apps_with_emr_on_spot_instances/right_sizing_executors.md index 95d6cd38..46a7beb7 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/right_sizing_executors.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/right_sizing_executors.md @@ -18,14 +18,14 @@ If we keep approximately the same vCPU:Mem ratio (1:4.5) for our job and avoid g EMR by default places limits on executor sizes in two different ways, this is in order to avoid having the executor consume too much memory and interfere with the operating system and other processes running on the instance. 1. [for each instance type differently] (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-task-config.html#emr-hadoop-task-jvm) in the default YARN configuration options. -Let's have a look at a few examples of instances that have our approximate vCPU:Mem ratio:\ -r4.xlarge: yarn.scheduler.maximum-allocation-mb 23424\ -r4.2xlarge: yarn.scheduler.maximum-allocation-mb 54272\ -r5.xlarge: yarn.scheduler.maximum-allocation-mb 24576\ +Let's have a look at a few examples of instances that have our approximate vCPU:Mem ratio: +r4.xlarge: yarn.scheduler.maximum-allocation-mb 23424 +r4.2xlarge: yarn.scheduler.maximum-allocation-mb 54272 +r5.xlarge: yarn.scheduler.maximum-allocation-mb 24576 2. With the Spark on YARN configuration option which was [introduced in EMR version 5.22] (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew-history.html#emr-5220-whatsnew): spark.yarn.executor.memoryOverheadFactor and defaults to 0.1875 (18.75% of the spark.yarn.executor.memoryOverhead setting ) -So we can conclude that if we decrease our executor size to ~18GB, we'll be able to use r4.xlarge and basically any of the R family instance types (or i3 that have the same vCPU:Mem ratio) as vCPU and Memory grows linearly within family sizes. If EMR will select an r4.2xlarge instance type from the list of supported instance types that we'll provide to EMR Instance Fleets, then it will run more than 1 executor on each instance, due to Spark dynamic allocation being enabled by default. +So we can conclude that if we decrease our executor size to ~18GB, we'll be able to use r4.xlarge and basically any of the R family instance types (`1:8 vCPU:Mem ratio`) as vCPU and Memory grows linearly within family sizes. If EMR will select an r4.2xlarge instance type from the list of supported instance types that we'll provide to EMR Instance Fleets, then it will run more than 1 executor on each instance, due to Spark dynamic allocation being enabled by default. ![tags](/images/running-emr-spark-apps-on-spot/sparkmemory.png) diff --git a/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md b/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md index 3d8497c0..f3ab1ad2 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md @@ -3,40 +3,65 @@ title: "Selecting instance types" weight: 50 --- -Let's use our newly acquired knowledge around Spark executor sizing in order to select the EC2 Instance Types that will be used in our EMR cluster.\ -EMR clusters run Master, Core and Task node types. [Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html) to read more about the different node types. +Let's use our newly acquired knowledge around Spark executor sizing in order to select the EC2 Instance Types that will be used in our EMR cluster. We determined that in order to be flexible and allow running on multiple instance types, we will submit our Spark application with **"–executor-memory=18GB –executor-cores=4"**. -We determined that in order to be flexible and allow running on multiple instance types, we will submit our Spark application with **"–executor-memory=18GB –executor-cores=4"**, +To apply the instance diversification best practices while meeting the application constraints defined in the previous section, we can add different instances sizes from the current generation, such as R5 and R4. We can even include variants, such as R5d instance types (local NVMe-based SSDs) and R5a instance types (powered by AMD processors). -We can use the [Spot Instance Advisor] (https://aws.amazon.com/ec2/spot/instance-advisor/) page to find the relevant instance types with sufficient number of vCPUs and RAM, and use this opportunity to also select instance types with low interruption rates. \ -For example: r5.2xlarge has 8 vCPUs and 64 GB of RAM, so EMR will automatically run 2 executors that will consume 36 GB of RAM and still leave free RAM for the operating system and other processes.\ -However, at the time of writing, when looking at the EU (Ireland) region in the Spot Instance advisor, the r5.2xlarge instance type is showing an interruption rate of >20%.\ -Instead, we'll focus on instance types with lower interruption rates and suitable vCPU/Memory ratio. As an example, at the time of writing, in the EU (Ireland) region, these could be: r4.xlarge, r4.2xlarge, i3.xlarge, i3.2xlarge, r5d.xlarge +{{% notice info %}} +There are over 275 different instance types available on EC2 which can make the process of selecting appropriate instance types difficult. **[amazon-ec2-instance-selector](https://github.com/aws/amazon-ec2-instance-selector)** helps you select compatible instance types for your application to run on. The command line interface can be passed resource criteria like vCPUs, memory, network performance, and much more and then return the available, matching instance types. +{{% /notice %}} -![Spot Instance Advisor](/images/running-emr-spark-apps-on-spot/spotinstanceadvisor1.png) +We will use **amazon-ec2-instance-selector** to help us select the relevant instance +types with sufficient number of vCPUs and RAM. -{{% notice note %}} -Spot Instance interruption rates are dynamic, the above just provides a real world example from a specific time and would probably be different when you are performing this workshop. -{{% /notice %}} +Let's first install amazon-ec2-instance-selector on Cloud9 IDE: + +``` +curl -Lo ec2-instance-selector https://github.com/aws/amazon-ec2-instance-selector/releases/download/v1.3.0/ec2-instance-selector-`uname | tr '[:upper:]' '[:lower:]'`-amd64 && chmod +x ec2-instance-selector +sudo mv ec2-instance-selector /usr/local/bin/ +ec2-instance-selector --version +``` -To keep our flexibility in place and be able to provide multiple instance types for our EMR cluster, we need to make sure that our executor size will be under the EMR YARN limitation that we saw in the previous step, +Now that you have ec2-instance-selector installed, you can run +`ec2-instance-selector --help`, to understand how you could use it for selecting +instances that match your workload requirements. -**Your first task**: Find and take note of 5 instance types in the region where you have created your VPC to run your EMR cluster, which will allow running executors with at least 4 vCPUs and 30+ GB of RAM, and also have low Spot interruption rates (maximum 10-15%). +For the purpose of this workshop we will select instances based on below criteria: + * Instances that have minimum 4 vCPUs and maximum 16 vCPUs + * Instances which have vCPU to Memory ratio of 1:8, same as R Instance family + * Instances with CPU Architecture x86_64 and no GPU Instances + * Instances that belong to current generation + * Instances types that are not supported by EMR such as R5N, R5ad and R5b. Enhanced z, I and D Instance families, which are priced higher than R family. So basically, adding a deny list with the regular expression `.*n.*|.*ad.*|.*b.*|^[zid].*`. -{{%expand "Click here to see a hint for the task" %}} -Instance types with sufficient Memory and vCPUs for our executor size, as well as suitable for our desired vCPU:Mem ratio, and are also under the default memory EMR limitations:\ +{{% notice info %}} +[Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html) to find out the instance types that Amazon EMR supports . +{{% /notice %}} + +Run the following command with above mentioned criteria, to get the list of instances. -**Recommended for the workshop:**\ -- r4.xlarge and larger\ -- r5.xlarge and larger\ -- r5a.xlarge and larger\ -- r5d.xlarge and larger\ -- i3.xlarge and larger\ +```bash +ec2-instance-selector --vcpus-min 4 --vcpus-max 16 --vcpus-to-memory-ratio 1:8 --cpu-architecture x86_64 --current-generation --gpus 0 --deny-list '.*n.*|.*ad.*|.*b.*|^[zid].*' +``` -**Previous generation instance types:**\ -- r3.xlarge and larger\ -- i2.xlarge and larger\ -you will notice that these instance types have double the vCores as they do vCPU, as reflected in the EMR instance selection window - this is an EMR optimization method. Feel free to use these as well, but note that the executor calculations that we're referring to in the workshop will differ. Also, these previous generation instance types will perform slower and the application will take longer to complete.\ -Also note that not all instance types exist in all regions. -{{% /expand%}} +Internally ec2-instance-selector is making calls to the [DescribeInstanceTypes](https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeInstanceTypes.html) for the specific region and filtering +the instances based on the criteria selected in the command line. Above command should display a list like the one that follows (note results might differ depending on the region). We will use this instances as part of our EMR Core and Task Instance Fleets. +``` +r4.2xlarge +r4.4xlarge +r4.xlarge +r5.2xlarge +r5.4xlarge +r5.xlarge +r5a.2xlarge +r5a.4xlarge +r5a.xlarge +r5d.2xlarge +r5d.4xlarge +r5d.xlarge +``` + +{{% notice note %}} +You are encouraged to test what are the options that `ec2-instance-selector` provides and run a few commands with it to familiarize yourself with the tool. +For example, try running the same commands as you did before with the extra parameter **`--output table-wide`**. +{{% /notice %}} \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md b/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md index 70ada879..fa6fc8be 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/simulating_recovery.md @@ -15,7 +15,7 @@ This is an optional step that will take approximately 20 minutes more than the o [Click here] (https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr/) For an in-depth blog post about Spark's resiliency in EMR {{% /notice %}} -#### Step objectives:\ +#### Step objectives: 1. Observe that EMR replenishes the target capacity if some EC2 Instances failed, were terminated/stopped, or received an EC2 Spot Interruption. 2. Observe that the Spark application running in your EMR cluster still completes successfully, despite losing executors due to instance terminations. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md b/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md index eeb72d06..1c8b5d15 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/tracking_spot_interruptions.md @@ -13,7 +13,7 @@ In most cases, when running fault-tolerant workloads, we don't really need to tr Let's set up an email notification for when Spot interruptions occur, so if there are any failures in our EMR applications, we'll be able to check if the failures correlate to a Spot interruption. #### Creating an SNS topic for the notifications -1. Create a new SNS topic and subscribe to the topic with your email address\ +1. Create a new SNS topic and subscribe to the topic with your email address For guidance, you can follow steps #1 & #2 in the [Amazon SNS getting started guide] (https://docs.aws.amazon.com/sns/latest/dg/sns-getting-started.html) 1. You will receive an email with the subject "AWS Notification - Subscription Confirmation". Click the "**Confirm subscription**" link in the email in order to allow SNS to send email to the endpoint (your email). diff --git a/static/images/running-emr-spark-apps-on-spot/c9after.png b/static/images/running-emr-spark-apps-on-spot/c9after.png new file mode 100644 index 00000000..d1298ba6 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/c9after.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/c9before.png b/static/images/running-emr-spark-apps-on-spot/c9before.png new file mode 100644 index 00000000..aea8466c Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/c9before.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cliexport.png b/static/images/running-emr-spark-apps-on-spot/cliexport.png index 13394af8..f0570fd5 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/cliexport.png and b/static/images/running-emr-spark-apps-on-spot/cliexport.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png index a74c2331..b76554ec 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png and b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png index eb7c99fd..645e6af3 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png and b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrinstancefleetsnetwork.png b/static/images/running-emr-spark-apps-on-spot/emrinstancefleetsnetwork.png index 307deb99..36df46f2 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrinstancefleetsnetwork.png and b/static/images/running-emr-spark-apps-on-spot/emrinstancefleetsnetwork.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/savingssummary.png b/static/images/running-emr-spark-apps-on-spot/savingssummary.png index a0253598..b4f98494 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/savingssummary.png and b/static/images/running-emr-spark-apps-on-spot/savingssummary.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/spotinstanceadvisor1.png b/static/images/running-emr-spark-apps-on-spot/spotinstanceadvisor1.png deleted file mode 100644 index 8dce7a91..00000000 Binary files a/static/images/running-emr-spark-apps-on-spot/spotinstanceadvisor1.png and /dev/null differ