diff --git a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md index b3c782e0..3c8c487d 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md @@ -1,6 +1,6 @@ --- title: "Examining the cluster" -weight: 95 +weight: 90 --- In this section we will look at the utilization of our EC2 Spot Instances while the application is running, and examine how many Spark executors are running. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md index 39e103bf..627a8383 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md @@ -3,7 +3,7 @@ title: "Launch a cluster - Step 2" weight: 70 --- -Under "**Instance group configuration**", select Instance Fleets. Under Network, select the VPC that you deployed using the CloudFormation template earlier in the workshop (or the default VPC if you're running the workshop in an AWS event), and select all subnets in the VPC. When you select multiple subnets, the EMR cluster will still be started in a single Availability Zone, but EMR Instance Fleets will make the best instance type selection based on available capacity and price across the multiple availability zones that you specified. Aslo, click on the checkbox "Apply allocation strategy" to leverage lowest-price allocation for On-Demand Instances and Capacity-Optimized allocation for Spot Instances; this will also allow you to configure up to 15 instance types on the Task Instance fleet. +Under "**Instance group configuration**", select Instance Fleets. Under Network, select the VPC that you deployed using the CloudFormation template earlier in the workshop (or the default VPC if you're running the workshop in an AWS event), and select all subnets in the VPC. When you select multiple subnets, the EMR cluster will still be started in a single Availability Zone, but EMR Instance Fleets will make the best instance type selection based on available capacity and price across the multiple availability zones that you specified. Also, click on the checkbox "Apply allocation strategy" to leverage lowest-price allocation for On-Demand Instances and Capacity-Optimized allocation for Spot Instances; this will also allow you to configure up to 15 instance types on the Task Instance fleet. ![FleetSelection1](/images/running-emr-spark-apps-on-spot/emrinstancefleetsnetwork.png) diff --git a/content/running_spark_apps_with_emr_on_spot_instances/scaling_emr_cluster.md b/content/running_spark_apps_with_emr_on_spot_instances/scaling_emr_cluster.md new file mode 100644 index 00000000..a1141ed6 --- /dev/null +++ b/content/running_spark_apps_with_emr_on_spot_instances/scaling_emr_cluster.md @@ -0,0 +1,56 @@ +--- +title: "Scaling EMR cluster" +weight: 95 +--- + +While you can always manually adjust the number of core or task nodes (EC2 instances) in your Amazon EMR cluster, you can also use the power of EMR auto-scaling to automatically adjust the cluster size in response to changing workloads without any manual intervention. + +In this section, we are going to enable automatic scaling for the cluster using **[Amazon EMR Managed Scaling](https://aws.amazon.com/blogs/big-data/introducing-amazon-emr-managed-scaling-automatically-resize-clusters-to-lower-cost/)**. With EMR Managed scaling you specify the minimum and maximum compute limits for your cluster and Amazon EMR automatically resizes EMR clusters for best performance and resource utilization. EMR Managed Scaling constantly monitors key metrics based on workload and optimizes the cluster size for best resource utilization. + +{{% notice note %}} +EMR Managed Scaling is supported for Apache Spark, Apache Hive and YARN-based workloads on Amazon EMR versions 5.30.1 and above. +{{% /notice %}} + +### Enable Managed Scaling + +1. In your EMR cluster page, in the AWS Management Console, go to the **Summary** tab. +1. Copy the **ID** from the **Summary** tab. +1. Open the shell terminal in your Cloud9 enivronment that you created at the beginning of this workshop. +1. Run the command after replacing the **CLUSTER-ID** with the one you copied earlier. + +```Bash +aws emr put-managed-scaling-policy \ + --cluster-id CLUSTER-ID \ + --managed-scaling-policy "ComputeLimits={ + UnitType=InstanceFleetUnits, + MinimumCapacityUnits=8, + MaximumCapacityUnits=16, + MaximumOnDemandCapacityUnits=0, + MaximumCoreCapacityUnits=8 + }" +``` + +In the command above, we have set the : + +* Mimumum and maximum capacity for the worker nodes to 8 and 16 respectively. +* **MaximumOnDemandCapacityUnits** to 0 to only use EC2 Spot instances. +* **MaximumCoreCapacityUnits** to 8 to allow scaling of core nodes. + + +### Managed Scaling in Action + +Now we want to post more jobs to the cluster and to trigger scaling. + +1. In your EMR cluster page, in the AWS Management Console, go to the **Steps** tab. +1. Select the Spark application that you created during cluster creation and click **Clone step** +![jobCloning](/images/running-emr-spark-apps-on-spot/emrsparkjobcloning.png) +1. On the next screen, click **Add**. Wait for couple of moments so that step is in **Running** state. +1. Go to the **Events** tab to see the scaling events. +![scalingEvent](/images/running-emr-spark-apps-on-spot/emrsparkscalingevent.png) + +With this configuration, Your EMR cluster with automatically scale based on compute requirements of the job and add EC2 spot instances while keeping the limits into consideration. + +{{% notice note %}} +Managed Scaling now also has the capability to prevent scaling down instances that store intermediate shuffle data for Apache Spark. Intelligently scaling down clusters without removing the instances that store intermediate shuffle data prevents job re-attempts and re-computations, which leads to better performance, and lower cost. +**[Click here](https://aws.amazon.com/about-aws/whats-new/2022/03/amazon-emr-managed-scaling-shuffle-data-aware/)** for more details. +{{% /notice %}} \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md b/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md index e84e7dc9..06d6d199 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/selecting_instance_types.md @@ -58,7 +58,10 @@ r5a.4xlarge r5a.xlarge r5d.2xlarge r5d.4xlarge -r5d.xlarge +r5d.xlarge +r6i.2xlarge +r6i.4xlarge +r6i.xlarge ``` {{% notice note %}} diff --git a/static/images/running-emr-spark-apps-on-spot/emrsparkjobcloning.png b/static/images/running-emr-spark-apps-on-spot/emrsparkjobcloning.png new file mode 100644 index 00000000..8609970b Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/emrsparkjobcloning.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrsparkscalingevent.png b/static/images/running-emr-spark-apps-on-spot/emrsparkscalingevent.png new file mode 100644 index 00000000..be0211d2 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/emrsparkscalingevent.png differ