diff --git a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md index dcf62bd1..09fd1f23 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/launching_emr_cluster-2.md @@ -13,15 +13,13 @@ The workshop focuses on running Spot Instances across all the cluster node types {{% /notice %}} #### **Master node**: -Unless your cluster is very short-lived and the runs are cost-driven, avoid running your Master node on a Spot Instance. We suggest this because a Spot interruption on the Master node terminates the entire cluster. -For the purpose of this workshop, we will run the Master node on a Spot Instance as we simulate a relatively short lived job running on a transient cluster. There will not be business impact if the job fails due to a Spot interruption and later re-started. -Click **Add / remove instance types to fleet** and select two relatively cheaper instance types - i.e c5.xlarge and m5.xlarge and check Spot under target capacity. EMR will only provision one instance, but will select the best instance type for the Master node from the Spot instance pools with the optimal capacity. +Unless your cluster is very short-lived and the runs are cost-driven, avoid running your Master node on a Spot Instance. We suggest this because a Spot interruption on the Master node terminates the entire cluster. For the purpose of this workshop, we will run the Master node on an EC2 On-Demand Instance but you can use EC2 Spot Instance for relatively short lived jobs running on a transient cluster and where job failures due to a Spot interruption will not have business impact and can be re-started later. Click **Add / remove instance types to fleet** and select two relatively cheaper instance types - i.e c5.xlarge and m5.xlarge and check Spot under target capacity. EMR will only provision one instance, but will select the best instance type for the Master node from the Spot instance pools with the optimal capacity. ![FleetSelection1](/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png) #### **Core Instance Fleet**: -Avoid using Spot Instances for Core nodes if your Spark applications use HDFS. That prevents a situation where Spot interruptions cause data loss for data that was written to the HDFS volumes on the instances. For short-lived applications on transient clusters, as is the case in this workshop, we are going to run our Core nodes on Spot Instances. -When using EMR Instance Fleets, one Core node is mandatory. Since we want to scale out and run our Spark application on our Task nodes, let's stick to the one mandatory Core node. We will specify **4 Spot units**, and select instance types that count as 4 units and will allow to run one executor. +Avoid using Spot Instances for Core nodes if your Spark applications use HDFS. That prevents a situation where Spot interruptions cause data loss for data that was written to the HDFS volumes on the instances. We will also use On-Demand Instances for core nodes. You can use Spot Instances for Core nodes with transient clusters. +When using EMR Instance Fleets, one Core node is mandatory. Since we want to scale out and run our Spark application on our Task nodes, let's stick to the one mandatory Core node. We will specify **4 On-demand units**, and select instance types that count as 4 units and will allow to run one executor. Under the core node type, click **Add / remove instance types to fleet** and select instance types that you noted before as suitable to run an executor (given the 18G executor size), for example: ![FleetSelection2](/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png) @@ -43,10 +41,10 @@ EMR Managed Scaling is supported for Apache Spark, Apache Hive and YARN-based wo {{% /notice %}} 1. Select the checkbox for **Enable Cluster Scaling** in **Cluster scaling** section. -1. Set **MinimumCapacityUnits** to **36**, which includes core node capacity units plus capacity units for 8 task executors. -1. Set **MaximumCapacityUnits** to **68**, keeping same capacity units for core nodes but allowing scaling for task nodes. -1. Set **MaximumOnDemandCapacityUnits** to **0**, use EC2 Spot instances only for both Code Nodes and Task Nodes. -1. Set **MaximumCoreCapacityUnits** to **4**, keeping same capacity units for core nodes. +1. Set **MinimumCapacityUnits** to **36**, mimimum allowed EC2 capacity in a cluster which includes core (4 units) and task nodes (8 task executors * 4 units). +1. Set **MaximumCapacityUnits** to **72**, maximum allowed EC2 capacity in a cluster to allow scaling core and task nodes. +1. Set **MaximumOnDemandCapacityUnits** to **8**, Allowing core nodes to scale using On-demand instances. +1. Set **MaximumCoreCapacityUnits** to **8**, Allowing core nodes to scale from 4 to 8 units. ![emrmanagedscaling](/images/running-emr-spark-apps-on-spot/emrmanagedscaling.png) {{% notice note %}} diff --git a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png index 113ae4a3..008ce23a 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png and b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-core1.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png index b76554ec..becf9328 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png and b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-master.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png index 645e6af3..9ab52ba7 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png and b/static/images/running-emr-spark-apps-on-spot/emrinstancefleets-task2.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/emrmanagedscaling.png b/static/images/running-emr-spark-apps-on-spot/emrmanagedscaling.png index 45892c89..28e1aea7 100644 Binary files a/static/images/running-emr-spark-apps-on-spot/emrmanagedscaling.png and b/static/images/running-emr-spark-apps-on-spot/emrmanagedscaling.png differ