awslabs · ranshn · Jul 3, 2019 · Jun 17, 2019 · Jun 17, 2019 · Jun 18, 2019
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/_index.md b/content/running_spark_apps_with_emr_on_spot_instances/_index.md
@@ -2,31 +2,24 @@
 title: "Running Spark apps with EMR on Spot Instances"
 date: 2019-01-24T09:05:54Z
 weight: 60
-draft: true
 pre: "<b>⁃ </b>"
 ---
 
-## This workshop is still in draft! ping [email protected] for any concerns.
-
 ## Overview
 
-In this workshop you will assume the role of a data engineer, tasked with building a platform that will allow your organization to run data processing jobs, specifically Apache Spark applications. 
-
-The requirements for the platform are:
+Welcome! In this workshop you will assume the role of a data engineer, tasked with cost optimizing the organization's costs for running Spark applications, using Amazon EMR and EC2 Spot Instances.\
 
-1. Use a managed service - in order to avoid the heavy lifting of installing, maintaining and upgrading compute clusters that run Apache Hadoop framework software, mainly Spark.
-2. Be secure - allow network level isolation and encryption at rest and in transit.
-3. Be cost optimized - use Amazon EC2 Spot Instances, as well as easily run transient clusters (that will be spun up just to run a job and then spun down) where possible in order to cost optimize.
-4. Decouple compute and storage - in order to allow to elastically scale your processing power independently from having to provision more storage for your clusters. 
-5. Be self-healing in order to decrease operations overhead - if a compute node fails, the cluster will automatically replace it and continue running the job.
+The **estimated time** for completing the workshop is 60-90 minutes and the **estimated cost** for running the workshop's resources in your AWS account is less than $2.\
+The **learning objective** for the workshop is to become familiar with the best practices and tooling that are available to you for cost optimizing your EMR clusters running Spark applications, using Spot Instances.
 
-
-## The decision is simple - <span style="color:#ff9900">***Amazon EMR***</span> fulfills all the requirements. 
+## Recap - Amazon EMR and EC2 Spot Instances
 
 * [Amazon EMR] (https://aws.amazon.com/emr/) provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run other popular distributed frameworks such as [Apache Spark] (https://aws.amazon.com/emr/details/spark/), [HBase] (https://aws.amazon.com/emr/details/hbase/), [Presto] (https://aws.amazon.com/emr/details/presto/), and [Flink] (https://aws.amazon.com/blogs/big-data/use-apache-flink-on-amazon-emr/) in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB. EMR Notebooks, based on the popular Jupyter Notebook, provide a development and collaboration environment for ad hoc querying and exploratory analysis.
   EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.
 
 * [Amazon EC2 Spot Instances] (https://aws.amazon.com/ec2/spot/) offer spare compute capacity available in the AWS Cloud at steep discounts compared to On-Demand prices. EC2 can interrupt Spot Instances with two minutes of notification when EC2 needs the capacity back. You can use Spot Instances for various fault-tolerant and flexible applications. Some examples are analytics, containerized workloads, high-performance computing (HPC), stateless web servers, rendering, CI/CD, and other test and development workloads.
 
-### About Spot Instances in Analytics workloads
-The most important best practice when using Spot Instances is to be flexible with the EC2 instance types that our application can run on, in order to be able to access many spare capacity pools, as well as get our desired capacity from a different instance type in case some of our Spot capacity in the EMR cluster is interrupted, when EC2 needs the spare capacity back. It's possible to run Spark applications in a single cluster that is running on multiple different instance types, we'll just need to right-size our executors and use the EMR Instance Fleets configuration option in order to meet the Spot diversification best practice.
+## About Spot Instances in Analytics workloads
+The most important best practice when using Spot Instances is to be flexible with the EC2 instance types that our application can run on, in order to be able to access many spare capacity pools (a combination of EC2 instance type and an Availability Zone), as well as achieve our desired capacity from a different instance type in case some of our Spot capacity in the EMR cluster is interrupted, when EC2 needs the spare capacity back.\
+It's possible to run Spark applications in a single cluster that is running on multiple different instance types, we'll just need to right-size our executors and use the EMR Instance Fleets configuration option in order to meet the Spot diversification best practice. We'll look into that in detail during this workshop.
+
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md b/content/running_spark_apps_with_emr_on_spot_instances/analyzing_costs.md
@@ -0,0 +1,37 @@
+---
+title: "Analyzing costs"
+weight: 145
+---
+
+In this section we will use AWS Cost explorer to look at the costs of our EMR cluster, including the underlying EC2 Spot Instances.
+{{% notice note %}}
+It will take 24-48 hours for your usage to appear in Cost Explorer, so you can plan to come back to this step later to check the costs of running the workshop. If your organization administrator has not granted you access to Billing information, then you will not be able to access Cost Explorer, but you can look at the examples provided below.
+{{% /notice %}}
+
+In Step 4 of the EMR cluster launch, we tagged the cluster with the following Tag: Key=**Name**, Value=**EMRTransientCluster1**. This tag can be used to identify resources in your AWS accounts, and can also be used to identify the costs associated with usage in case the tag Key has been enabled as a Cost Allocation Tag. [Click here] (https://aws.amazon.com/answers/account-management/aws-tagging-strategies/) to learn more about tagging in AWS.
+
+
+### Analyzing costs with AWS Cost Explorer
+[AWS Cost Explorer] (https://aws.amazon.com/aws-cost-management/aws-cost-explorer/) has an easy-to-use interface that lets you visualize, understand, and manage your AWS costs and usage over time. You can analyze cost and usage data, both at a high level (e.g. how much did I pay for EMR) and for highly-specific requests (e.g. Cost for a specific instance type in a specific account with a specific tag). 
+
+{{% notice note %}}
+If the Name tag Key was not enabled as a Cost Allocation Tag, you will not be able to filter/group according to it in Cost Explorer, but you can still gather data like cost for the EMR service, instance types, etc.
+{{% /notice %}}
+
+
+Let's use Cost Explorer to analyze the costs of running our EMR application.\
+1. Navigate to Cost Explorer by opening the AWS Management Console -> Click your username in the top right corner -> click **My Billing Dashboard** -> click **Cost Explorer in the left pane**. or [click here] (https://console.aws.amazon.com/billing/home#/costexplorer) for a direct link.\
+2. We know that we gave our EMR cluster a unique Name tag, so let's filter according to it. In the right pane, click Tags -> Name -> enter "**EMRTransientCluster1**"\
+3. Instead of the default 45 days view, let's narrow down the time span to just the day when we ran the cluster. In the data selection dropdown, mark that day as start and end.\
+4. You are now looking at the total cost to run the cluster (**$0.30**), including: EMR, EC2, EBS, and possible AWS Cross-Region data transfer costs, depending on where you ran your cluster relative to where the S3 dataset is located (in N. Virginia).\
+5. Group by **Usage Type** to get a breakdown of the costs
+
+![costexplorer](/images/running-emr-spark-apps-on-spot/costexplorer1.png)
+
+* EU-SpotUsage:r5.xlarge: This was the instance type that ran in the EMR Task Instance fleet and accrued the largest cost, since EMR launched 10 instances ($0.17)\
+* EU-BoxUsage:r5.xlarge: The EMR costs. [Click here] (https://aws.amazon.com/emr/pricing/) to learn more about EMR pricing. ($0.06)\
+* EU-EBS:VolumeUsage.gp2: EBS volumes that were attached to my EC2 Instances in the cluster - these got tagged automatically. ($0.03)\
+* EU-SpotUsage:r5a.xlarge & EU-SpotUsage:m4.xlarge: EC2 Spot price for the other instances in my cluster (Master and Core) ($0.02 combined)\
+
+If you have access to Cost Explorer, have a look around and see what you can find by slicing and dicing with filtering and grouping. For example, what happens if you filter by **Purchase Option = Spot** & **Group by = Instance Type**?
+
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md b/content/running_spark_apps_with_emr_on_spot_instances/automations_monitoring.md
@@ -1,7 +1,6 @@
 ---
 title: "Automations and monitoring"
 weight: 110
-draft: true
 ---
 
 When adopting EMR into your analytics flows and data processing pipelines, you will want to launch EMR clusters and run jobs in a programmatic manner. There are many ways to do so with AWS SDKs that can run in different environments like Lambda Functions, invoked by AWS Data Pipeline or AWS Step Functions, with third party tools like Apache Airflow, and more. \
@@ -12,17 +11,17 @@ In this section we will simply look at a CLI command that can be used to start a
 1. In the AWS Management Console, under the EMR service, go to your cluster, and click the **AWS CLI export** button.
 2. Find the --instance-fleets parameter, and copy the contents of the parameter including the brackets:
 ![cliexport](/images/running-emr-spark-apps-on-spot/cliexport.png)
-3. Paste the data into a JSON validator like [JSON Lint] (https://jsonlint.com/) and vlaidate the JSON file. this will make it easy to see the Instance Fleets configuration we configured in the console, in a JSON format, that can be re-used when you launch your cluster programmatically. 
+3. Paste the data into a JSON validator like [JSON Lint] (https://jsonlint.com/) and validate the JSON file. this will make it easy to see the Instance Fleets configuration we configured in the console, in a JSON format, that can be re-used when you launch your cluster programmatically. 
 
 #### (Optional) Set up CloudWatch Events for Cluster and/or Step failures
 Much like we set up a CloudWatch Event rule for EC2 Spot Interruptions to be sent to our email via an SNS notification, we can also set up rules to send out notifications or perform automations when an EMR cluster fails to start, or a Task on the cluster fails. This is useful for monitoring purposes.
 
-In this example, let's set up a notification for when our EMR step failed.
+In this example, let's set up a notification for when our EMR step failed.\
 1. In the AWS Management Console, go to Cloudwatch -> Events -> Rules and click **Create Rule**.\
-2. Under Service Name select EMR, and under Event Type select State Change.\
-3. Check **Specific detail type(s) and from the dropdown menu, select **EMR Step Status Change**\
-4. Check Specific states(s) and from the dropdown menu, select **FAILED**.\
+2. Under Service Name select EMR, and under Event Type select **State Change**.\
+3. Check **Specific detail type(s)** and from the dropdown menu, select **EMR Step Status Change**\
+4. Check **Specific states(s)** and from the dropdown menu, select **FAILED**.\
 ![cwemrstep](/images/running-emr-spark-apps-on-spot/emrstatechangecwevent.png)
-5. In the targets menu, click **Add target**, select SNS topic and from the dropdown menu, select the SNS topic you created and click **Configure details**.\
-6. Provide a name for the rule and click **Create rule**\
+5. In the targets menu, click **Add target**, select **SNS topic** and from the dropdown menu, select the SNS topic you created and click **Configure details**.\
+6. Provide a name for the rule and click **Create rule**.\
 7. You can test that the rule works by following the same steps to start a cluster, but providing a bad parameter when submitting the step, for example - a non existing location for the Spark application or results bucket.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md b/content/running_spark_apps_with_emr_on_spot_instances/conclusions_and_cleanup.md
@@ -1,19 +1,18 @@
 ---
 title: "Conclusions and cleanup"
 weight: 150
-draft: true
 ---
 
-**Congratulations!** you have reached the end of the workshop. In this workshop, you learned about the need to be flexible with EC2 instance types when using Spot Instances, and how to size your Spark executors to allow for this flexibility. You ran a Spark application solely on Spot Instances using EMR Instance Fleets, verified the results of the application, and saw the cost savings that you achieved by running the application on Spot Instances.
+**Congratulations!** you have reached the end of the workshop. In this workshop, you learned about the need to be flexible with EC2 instance types when using Spot Instances, and how to size your Spark executors to allow for this flexibility. You ran a Spark application solely on Spot Instances using EMR Instance Fleets, you verified the results of the application, and saw the cost savings that you achieved by running the application on Spot Instances.
 
 #### Cleanup
 
-1. Our EMR cluster has already been terminated after the Spark application we submitted finished running. Just to be on the safe side, you can visit the EMR console and check that the cluster is in the **Terminated** state.
+1. Our EMR cluster has already been terminated after the Spark application we submitted finished running. Just to be on the safe side, or if you didn't use **Auto-terminate cluster after the last step is completed** you can visit the EMR console and check that the cluster is in the **Terminated** state. If it isn't, then you can termintae it from the console.
 2. Delete the VPC you deployed via CloudFormation, by going to the CloudFormation service in the AWS Management Console, selecting the VPC stack (default name is Quick-Start-VPC) and click the Delete option. Make sure that the deletion has completed successfully (this should take around 1 minute), the status of the stack will be DELETE_COMPLETE (the stack will move to the Deleted list of stacks).
 3. Delete your S3 bucket from the AWS Management Console - choose the bucket from the list of buckets and hit the Delete button. This approach will also empty the bucket and delete all existing objects in the bucket.
 4. Delete the Athena table by going to the Athena service in the AWS Management Console, find the **emrworkshopresults** Athena table, click the three dots icon next to the table and select **Delete table**.
 
 #### Thank you
 
-We hope this workshop was educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR in order to optimize your costs.\
+We hope you found this workshop educational, and that it will help you adopt Spot Instances into your Spark applications running on Amazon EMR, in order to optimize your costs.\
 If you have any feedback or questions, click the "**Feedback / Questions?**" link in the left pane to reach out to the authors of the workshop.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md b/content/running_spark_apps_with_emr_on_spot_instances/emr_instance_fleets.md
@@ -1,12 +1,18 @@
 ---
 title: "EMR Instance Fleets"
 weight: 30
-draft: true
 ---
 
-With EMR instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet (Master, Core, Task). When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones.
+When adopting Spot Instances into your workload, it is recommended to be flexible around how to launch your workload in terms of Availability Zone and Instance Types. This is in order to be able to achieve the required scale from multiple Spot capacity pools (a combination of EC2 instance type in an availability zone) or one capacity pool which has sufficient capacity, as well as decrease the impact on your workload in case some of the Spot capacity is interrupted with a 2-minute notice when EC2 needs the capacity back, and allow EMR to replenish the capacity with a different instance type.
+
+With EMR instance fleets, you specify target capacities for On-Demand Instances and Spot Instances within each fleet (Master, Core, Task). When the cluster launches, Amazon EMR provisions instances until the targets are fulfilled. You can specify up to five EC2 instance types per fleet for Amazon EMR to use when fulfilling the targets. You can also select multiple subnets for different Availability Zones.\
+
+{{% notice info %}}
+[Click here] (https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html) to learn more about EMR Instance Fleets in the official documentation.
+{{% /notice %}}
 
 **When Amazon EMR launches the cluster, it looks across those subnets to find the instances and purchasing options you specify, and will select the Spot Instances with the lowest chance of getting interrupted, for the lowest cost.**
 
 
-While a cluster is running, if Amazon EC2 reclaims a Spot Instance or if an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify in your fleet. This makes it easier to regain capacity in case some of the instances get interrupted by EC2 when it needs the Spot capacity back.
+While a cluster is running, if Amazon EC2 reclaims a Spot Instance or if an instance fails, Amazon EMR tries to replace the instance with any of the instance types that you specify in your fleet. This makes it easier to regain capacity in case some of the instances get interrupted by EC2 when it needs the Spot capacity back.\
+These options do not exist within the default EMR configuration option "Uniform Instance Groups", hence we will be using EMR Instance Fleets only.
diff --git a/content/running_spark_apps_with_emr_on_spot_instances/emr_uniform_groups.md b/content/running_spark_apps_with_emr_on_spot_instances/emr_uniform_groups.md