Skip to content

Commit

Permalink
Merge pull request #6 from jagpk/EMR-AllocationStrategies
Browse files Browse the repository at this point in the history
EMR Allocation Strategies with Instance Selector
  • Loading branch information
jagpk authored Dec 15, 2020
2 parents a2f01cd + 6a917a2 commit 4a695ad
Show file tree
Hide file tree
Showing 17 changed files with 148 additions and 49 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ hidden: true
---

1. In the EMR Management Console, check that the cluster is in the **Terminated** state. If it isn't, then you can terminate it from the console.
2. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks).
3. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket.
4. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**.


2. Go to the [Cloud9 Dashboard](https://console.aws.amazon.com/cloud9/home) and delete your environment.
3. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks).
4. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket.
5. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**.
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
---
title: "Update to the latest AWS CLI"
chapter: false
weight: 20
comment: default install now includes aws-cli/1.15.83
---

{{% notice tip %}}
For this workshop, please ignore warnings about the version of pip being used.
{{% /notice %}}

1. Run the following command to view the current version of aws-cli:
```
aws --version
```

1. Update to the latest version:
```
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
. ~/.bash_profile
```

1. Confirm you have a newer version:
```
aws --version
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
---
title: "Create a Workspace"
chapter: false
weight: 15
---

{{% notice warning %}}
If you are running the workshop on your own, the Cloud9 workspace should be built by an IAM user with Administrator privileges, not the root account user. Please ensure you are logged in as an IAM user, not the root
account user.
{{% /notice %}}

{{% notice info %}}
If you are at an AWS hosted event, follow the instructions on the region that should be used to launch resources
{{% /notice %}}

{{% notice tip %}}
Ad blockers, javascript disablers, and tracking blockers should be disabled for
the cloud9 domain, or connecting to the workspace might be impacted.
Cloud9 requires third-party-cookies. You can whitelist the [specific domains]( https://docs.aws.amazon.com/cloud9/latest/user-guide/troubleshooting.html#troubleshooting-env-loading).
{{% /notice %}}

### Launch Cloud9:

- Go to [Cloud9 Console](https://console.aws.amazon.com/cloud9/home)
- Select **Create environment**
- Name it **emrworkshop**, and take all other defaults
- When it comes up, customize the environment by closing the **welcome tab**
and **lower work area**, and opening a new **terminal** tab in the main work area:
![c9before](/images/running-emr-spark-apps-on-spot/c9before.png)

- Your workspace should now look like this:
![c9after](/images/running-emr-spark-apps-on-spot/c9after.png)

- If you like this theme, you can choose it yourself by selecting **View / Themes / Solarized / Solarized Dark**
in the Cloud9 workspace menu.
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
title: "EMR Allocation Strategies"
weight: 35
---

As an enhancement to the default EMR instance fleets cluster configuration, the allocation strategy feature is available in EMR version 5.12.1 and later. It optimizes the allocation of instance fleet capacity and lets you choose a target strategy for each cluster node.

* On-Demand instances use a lowest-price strategy, which launches the lowest-priced instances first.
* Spot instances use a capacity-optimized strategy, which launches Spot instances from Spot instance pools that have optimal capacity for the number of instances that are launching.

{{% notice note %}}
The allocation strategy option also lets you specify up to 15 EC2 instance types per task node when creating your cluster, as opposed to 5 maximum allowed by the default EMR cluster instance fleet configuration.
{{% /notice %}}

The capacity-optimized allocation strategy for Spot instances uses real-time capacity data to allocate instances from the Spot instance pools with the optimal capacity for the number of instances that are launching. This allocation strategy is appropriate for workloads that have a higher cost of interruption. Examples include long-running jobs and multi-tenant persistent clusters running Apache Spark, Apache Hive, and Presto. This allocation strategy lets you specify up to 15 EC2 instance types on task instance fleets to diversify your Spot requests and get steep discounts. Previously, instance fleets allowed a maximum of five instance types. You can now diversify your Spot requests across these 15 pools within each Availability Zone and prioritize deploying into a deeper capacity pool to lower the chance of interruptions. With more instance type diversification, Amazon EMR has more capacity pools to allocate capacity from, and chooses the Spot Instances which are least likely to be interrupted.

{{% notice info %}}
[Click here] (https://aws.amazon.com/blogs/big-data/optimizing-amazon-emr-for-resilience-and-cost-with-capacity-optimized-spot-instances/) For an in-depth blog post about capacity-optimized allocation strategy for Amazon EMR instance fleets.
{{% /notice %}}
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@ Some notable metrics:\
When you are done examining the cluster and using the different UIs, terminate the EMR cluster from the EMR management console. This is not the end of the workshop though - we still have some interesting steps to go.

#### Number of executors in the cluster
With 40 Spot Units in the Task Instance Fleet, EMR launched either 10 * xlarge (running one executor) or 5 * 2xlarge instances (running 2 executors), so the Task Instance Fleet provides 10 executors / containers to the cluster.\
With 32 Spot Units in the Task Instance Fleet, EMR launched either 8 * xlarge (running one executor) or 4 * 2xlarge instances (running 2 executors) or 2 * 4xlarge instances (running 4 executors), so the Task Instance Fleet provides 8 executors / containers to the cluster.\
The Core Instance Fleet launched one xlarge instance, able to run one executor.
{{%expand "Question: Did you see more than 11 containers in CloudWatch Metrics and in YARN ResourceManager? if so, do you know why? Click to expand the answer" %}}
{{%expand "Question: Did you see more than 9 containers in CloudWatch Metrics and in YARN ResourceManager? if so, do you know why? Click to expand the answer" %}}
Your Spark application was configured to run in Cluster mode, meaning that the **Spark driver is running on the Core node**. Since it is counted as a container, this adds a container to our count, but it is not an executor.
{{% /expand%}}
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ While our cluster is starting (7-8 minutes) and the step is running (4-10 minute
Since Amazon EC2 Spot Instances [changed the pricing model and bidding is no longer required] (https://aws.amazon.com/blogs/compute/new-amazon-ec2-spot-pricing/), we have an optional "Max-price" field for our Spot requests, which would limit how much we're willing to pay for the instance. It is recommended to leave this value at 100% of the On-Demand price, in order to avoid limiting our instance diversification. We are going to pay the Spot market price regardless of the Maximum price that we can specify, and setting a higher max price does not increase the chance of getting Spot capacity nor does it decrease the chance of getting your Spot Instances interrupted when EC2 needs the capacity back. You can see the current Spot price in the AWS Management Console under EC2 -> Spot Requests -> **Pricing History**.

#### Each instance counts as X units
This configuration allows us to give each instance type in our diversified fleet a weight that will count towards our Total units. By default, this weight is configured as the number of YARN VCores that the instance type has by default (this would typically equate to the number of EC2 vCPUs) - this way it's easy to set the Total units to the number of vCPUs we want our cluster to run with, and EMR will select the best instances while taking into account the required number of instances to run. For example, if r4.xlarge is the instance type that EMR found to be the least likely to be interrupted and has the lowest price out of our selection, its weight is 4 and our total units (only Spot) is 40, then 10 * r4.xlarge instances will be launched by EMR in the fleet.
If my Spark application is memory driven, I can set the total units to the total amount of memory I want my cluster to run with, and change the "Each instance counts as" field to the total memory of the instance, leaving aside some memory for the operating system and other processes. For example, for the r4.xlarge I can set its weight to 25. If I then set up the Total units to 500 then EMR will bring up 20 * r4.xlrage instances in the fleet. Since our executor size is 18 GB, one executor will run on this instance type.
This configuration allows us to give each instance type in our diversified fleet a weight that will count towards our Total units. By default, this weight is configured as the number of YARN VCores that the instance type has by default (this would typically equate to the number of EC2 vCPUs) - this way it's easy to set the Total units to the number of vCPUs we want our cluster to run with, and EMR will select the best instances while taking into account the required number of instances to run. For example, if r4.xlarge is the instance type that EMR found to be the least likely to be interrupted, its weight is 4 and our total units (only Spot) is 32, then 8 * r4.xlarge instances will be launched by EMR in the fleet.
If my Spark application is memory driven, I can set the total units to the total amount of memory I want my cluster to run with, and change the "Each instance counts as" field to the total memory of the instance, leaving aside some memory for the operating system and other processes. For example, for the r4.xlarge I can set its weight to 25. If I then set up the Total units to 500 then EMR will bring up 20 * r4.xlarge instances in the fleet. Since our executor size is 18 GB, one executor will run on this instance type.

#### Defined duration
This option will allow you run your EMR Instance Fleet on Spot Blocks, which are uninterrupted Spot Instances, available for 1-6 hours, at a lower discount compared to Spot Instances.

#### Provisioning timeout
You can determine that after a set amount of minutes, if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead, or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application - if it is SLA bound and should complete even at On-Demand price, then the "Switch to On-Demand" option might be suitable. However, make sure you diversify the instance types in the fleet when looking to use Spot Instances, before you look into failing over to On-Demand. Also, try to select instance types with lower interruption rates according to the [Spot Instance Advisor] (https://aws.amazon.com/ec2/spot/instance-advisor/)
You can determine that after a set amount of minutes, if EMR is unable to provision your selected Spot Instances due to lack of capacity, it will either start On-Demand instances instead, or terminate the cluster. This can be determined according to the business definition of the cluster or Spark application - if it is SLA bound and should complete even at On-Demand price, then the "Switch to On-Demand" option might be suitable. However, make sure you diversify the instance types in the fleet when looking to use Spot Instances, before you look into failing over to On-Demand.
Original file line number Diff line number Diff line change
Expand Up @@ -15,23 +15,20 @@ The workshop focuses on running Spot Instances across all the cluster node types
#### **Master node**:
Unless your cluster is very short-lived and the runs are cost-driven, avoid running your Master node on a Spot Instance. We suggest this because a Spot interruption on the Master node terminates the entire cluster. \
For the purpose of this workshop, we will run the Master node on a Spot Instance as we simulate a relatively short lived job running on a transient cluster. There will not be business impact if the job fails due to a Spot interruption and later re-started.\
Click **Add / remove instance types to fleet** and select two relatively small and cheap instance types - i.e c4.large and m4.large and check Spot under target capacity. EMR will only provision one instance, but will select the best instance type for the Master node based on price and available capacity.
Click **Add / remove instance types to fleet** and select two relatively cheaper instance types - i.e c5.xlarge and m5.xlarge and check Spot under target capacity. EMR will only provision one instance, but will select the best instance type for the Master node from the Spot instance pools with the optimal capacity.
![FleetSelection1](/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png)


#### **Core Instance Fleet**:
Avoid using Spot Instances for Core nodes if your Spark applications use HDFS. That prevents a situation where Spot interruptions cause data loss for data that was written to the HDFS volumes on the instances. For short-lived applications on transient clusters, as is the case in this workshop, we are going to run our Core nodes on Spot Instances.\
When using EMR Instance Fleets, one Core node is mandatory. Since we want to scale out and run our Spark application on our Task nodes, let's stick to the one mandatory Core node. We will specify **4 Spot units**, and select instance types that count as 4 units and will allow to run one executor.\
Under the core node type, Click **Add / remove instance types to fleet** and select instance types that have 4 vCPUs and enough memory to run an executor (given the 18G executor size), for example:
Under the core node type, click **Add / remove instance types to fleet** and select instance types that you noted before as suitable to run an executor (given the 18G executor size), for example:
![FleetSelection2](/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png)

#### **Task Instance Fleet**:
Our task nodes will only run Spark executors and no HDFS DataNodes, so this is a great fit for scaling out and increasing the parallelization of our application's execution, to achieve faster execution times.
Under the task node type, Click **Add / remove instance types to fleet** and select the 5 instance types you noted before as suitable for our executor size and that had suitable interruption rates in the Spot Instance Advisor.\
Since our executor size is 4 vCPUs, and each instance counts as the number of its vCPUs towards the total units, let's specify **40 Spot units** in order to run 10 executors, and allow EMR to select the best instance type in the Task Instance Fleet to run the executors on. In this example, it will either start 10 * r4.xlarge / r5.xlarge / i3.xlarge **or** 5 * r5.2xlarge / r4.2xlarge in EMR Task Instance Fleet.
{{% notice warning %}}
If you are using your own AWS account (not an account that was created for you in an AWS event), Keep reading: if your account is new, or you've never launched Spot Instances in the account, your ability to launch Spot Instances could be limited. To overcome this, please make sure you launch no more than 3 instances in the Task Instance Fleet. You can do this, for example, by only providing instances that count as 8 units, and specify 24 for Spot units.\
{{% /notice %}}
Under the task node type, Click **Add / remove instance types to fleet** and select up to 15 instance types you noted before as suitable for our executor size.\
Since our executor size is 4 vCPUs, and each instance counts as the number of its vCPUs towards the total units, let's specify **32 Spot units** in order to run 8 executors, and allow EMR to select the best instance type in the Task Instance Fleet to run the executors on.

![FleetSelection3](/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ r5.xlarge: yarn.scheduler.maximum-allocation-mb 24576\
2. With the Spark on YARN configuration option which was [introduced in EMR version 5.22] (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew-history.html#emr-5220-whatsnew): spark.yarn.executor.memoryOverheadFactor and defaults to 0.1875 (18.75% of the spark.yarn.executor.memoryOverhead setting )


So we can conclude that if we decrease our executor size to ~18GB, we'll be able to use r4.xlarge and basically any of the R family instance types (or i3 that have the same vCPU:Mem ratio) as vCPU and Memory grows linearly within family sizes. If EMR will select an r4.2xlarge instance type from the list of supported instance types that we'll provide to EMR Instance Fleets, then it will run more than 1 executor on each instance, due to Spark dynamic allocation being enabled by default.
So we can conclude that if we decrease our executor size to ~18GB, we'll be able to use r4.xlarge and basically any of the R family instance types as vCPU and Memory grows linearly within family sizes. If EMR will select an r4.2xlarge instance type from the list of supported instance types that we'll provide to EMR Instance Fleets, then it will run more than 1 executor on each instance, due to Spark dynamic allocation being enabled by default.

![tags](/images/running-emr-spark-apps-on-spot/sparkmemory.png)

Expand Down
Loading

0 comments on commit 4a695ad

Please sign in to comment.