diff --git a/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md index 80810716..a6427cb9 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/cloud9-awscli.md @@ -10,26 +10,32 @@ For this workshop, please ignore warnings about the version of pip being used. {{% /notice %}} 1. Uninstall the AWS CLI 1.x by running: -```bash -sudo pip uninstall -y awscli -``` + ```bash + sudo pip uninstall -y awscli + ``` 1. Install the AWS CLI 2.x by running the following command: -``` -curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" -unzip awscliv2.zip -sudo ./aws/install -. ~/.bash_profile -``` + ``` + curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" + unzip awscliv2.zip + sudo ./aws/install + . ~/.bash_profile + ``` 1. Confirm you have a newer version: -``` -aws --version -``` + ``` + aws --version + ``` 1. Create an SSH Key pair so you can then SSH into the EMR cluster -```bash -aws ec2 create-key-pair --key-name emr-workshop-key-pair --query "KeyMaterial" --output text > emr-workshop-key-pair.pem -chmod 400 emr-workshop-key-pair.pem -``` + ```bash + aws ec2 create-key-pair --key-name emr-workshop-key-pair --query "KeyMaterial" --output text > emr-workshop-key-pair.pem + chmod 400 emr-workshop-key-pair.pem + ``` + +1. Install JQ + + ```bash + sudo yum -y install jq + ``` \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md index 96a5be75..eb2190c9 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/examining_cluster.md @@ -3,36 +3,38 @@ title: "Examining the cluster" weight: 90 --- -In this section we will look at the utilization of our EC2 Spot Instances while the application is running, and examine how many Spark executors are running. +In this section you will look at the utilization of instance fleets and examine Spark executors, while the Spark application is running. ### EMR Management Console To get started, let's check that your EMR cluster and Spark application are running. -1. In our EMR Cluster page, the status of the cluster will either be Starting (in which case you can see the status of the hardware in the Summary or Hardware tabs) or Running. -2. Move to the Steps tab, and your Spark application will either be Pending (for the cluster to start) or Running. +1. In our EMR Cluster page, the status of the cluster will either be **Starting** or **Running**. If the status is **Starting** then you can see the status of instance fleets in the Hardware tab, while you wait for cluster to reach **Running** stage. +2. Move to the Steps tab, the Spark application will either be **Pending** or **Running**. If the status is **Pending** then Wait for Spark application to reach **Running** stage -{{% notice note %}} -In this step, when you look at the utilization of the EMR cluster, do not expect to see full utilization of vCPUs and Memory on the EC2 instances, as the wordcount Spark application we are running is not very resource intensive and is just used for demo purposes. -{{% /notice %}} - -### Using Ganglia, YARN ResourceManager and Spark History Server -To connect to the web interfaces running on our EMR cluster you need to use SSH tunneling. [Click here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) to learn more about connecting to EMR interfaces. +### EMR On-cluster application user interfaces +To connect to the application user interfaces running on our EMR cluster you need to use SSH tunneling. [Click here](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html) to learn more about connecting to EMR interfaces. First, we need to grant SSH access from the Cloud9 environment to the EMR cluster master node: 1. In your EMR cluster page, in the AWS Management Console, go to the **Summary** tab -2. Click on the ID of the security group in **Security groups for Master** -3. Check the Security Group with the name **ElasticMapReduce-master** -4. In the lower pane, click the **Inbound tab** and click the **Edit inbound rules** -5. Click **Add Rule**. Under Type, select **SSH**, under Source, select **Custom**. As the Cloud9 environment and the EMR cluster are on the default VPC, introduce the CIDR of your Default VPC (e.g. 172.16.0.0/16). To check your VPC CIDR, go to the [VPC console](https://console.aws.amazon.com/vpc/home?#) and look for the CIDR of the **Default VPC**. -6. Click **Save** +1. Click on the ID of the security group in **Security groups for Master** +1. Check the Security Group with the name **ElasticMapReduce-master** +1. In the lower pane, click the **Inbound tab** and click the **Edit inbound rules** +1. Click **Add Rule**. Under Type, select **SSH**, under Source, select **Custom**. As the Cloud9 environment and the EMR cluster are on the default VPC, introduce the CIDR of your Default VPC (e.g. 172.16.0.0/16). To check your VPC CIDR, go to the [VPC console](https://console.aws.amazon.com/vpc/home?#) and look for the CIDR of the **Default VPC**. +1. Click **Save** + +At this stage, you will be able to ssh into the EMR master node. -At this stage, we'll be able to ssh into the EMR master node. First we will access the Ganglia web interface to look at cluster metrics: +{{% notice note %}} +In the following steps, you might not see full utilization of vCPUs and Memory on the EC2 instances because the wordcount demo Spark application is not very resource intensive. +{{% /notice %}} + +#### Access Resource Manager web interface 1. Go to the EMR Management Console, click on your cluster, and open the **Application user interfaces** tab. You'll see the list of on-cluster application interfaces. -2. Copy the master node DNS name from one of the interface urls, it will look like ec2.xx-xxx-xxx-xxx..compute.amazonaws.com -3. Establish an SSH tunnel to port 80, where Ganglia is bound, executing the below command on your Cloud9 environment (update the command with your master node DNS name): +1. Copy the **Master public DNS** from the **Summary** section, it will look like ec2.xx-xxx-xxx-xxx..compute.amazonaws.com +1. Establish an SSH tunnel to port 8088, where Resource Manager is bound, by executing the below command on your Cloud9 environment (update the command with your master node DNS name): ``` - ssh -i ~/environment/emr-workshop-key-pair.pem -N -L 8080:ec2-###-##-##-###.compute-1.amazonaws.com:80 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com + ssh -i ~/environment/emr-workshop-key-pair.pem -N -L 8080:ec2-###-##-##-###.compute-1.amazonaws.com:8088 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com ``` You'll get a message saying the authenticity of the host can't be established. Type 'yes' and hit enter. The message will look similar to the following: @@ -44,42 +46,73 @@ At this stage, we'll be able to ssh into the EMR master node. First we will acce Are you sure you want to continue connecting (yes/no)? ``` -4. Now, on your Cloud9 environment, click on the "Preview" menu on the top and then click on "Preview Running Application". You'll see a browser window opening on the environment with an Apache test page. on the URL, append /ganglia/ to access the Ganglia Interface. The url will look like https://xxxxxx.vfs.cloud9.eu-west-1.amazonaws.com/ganglia/. -![Cloud9-Ganglia](/images/running-emr-spark-apps-on-spot/cloud9-ganglia.png) -5. Click on the button next to "Browser" (arrow inside a box) to open Ganglia in a dedicated browser page.Have a look around. Take notice of the heatmap (**Server Load Distribution**). Notable graphs are: -* **Cluster CPU last hour** - this will show you the CPU utilization that our Spark application consumed on our EMR cluster. you should see that utilization varied and reached around 70%. -* **Cluster Memory last hour** - this will show you how much memory we started the cluster with, and how much Spark actually consumed. +1. Now, on your Cloud9 environment, click on the **Preview** menu on the top and then click on **Preview Running Application**. +![Cloud9-preview-application](/images/running-emr-spark-apps-on-spot/cloud9-preview-application.png) + +1. You'll see a browser window opening with in the Cloud9 environment with a **refused connection error** page. Click on the button next to **Browser** (arrow inside a box) to open web UI in a dedicated browser page. +![Cloud9-resource-manager-pop-out](/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-pop-out.png) + +1. On the left pane, click on **Nodes**: + +* If the Spark App is **Running**, then in the **Cluster Metrics** table the **Containers Running** will be **18**. In **Cluster Nodes Metrics** table, the number of **Active Nodes** will be **17** (1 core node with CORE Label and 16 task nodes without any Node Label). -Now, let's look at the **Resource Manager** application user interface. +* If the Spark App is **Completed**, then **Containers Running** will be 0, **Active Nodes** will be **1** (1 core node with CORE Label) and 16 **Decommissioned Nodes** (16 task nodes will be decommissioned by EMR managed cluster scaling). -1. Go to the Cloud9 terminal where you have established the ssh connection, and press ctrl+c to close it. -1. Create an SSH tunnel to the cluster master node on port 8088 by running this command (update the command with your master node DNS name): +![Cloud9-Resource-Manager](/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-browser.png) + +### Challenge + +Now that you are familiar with EMR web interfaces, can you try to access **Ganglia** and **Spark History Server** application user interfaces? + +{{% notice tip %}} +Go to **Application user interfaces** tab to see the user interfaces URLs for **Ganglia** and **Spark History Server**. +{{% /notice %}} + +{{% expand "Show answers" %}} + + +#### Access Ganglia user interface + +1. Go to the Cloud9 terminal where you have established the ssh tunnel, and press ctrl+c to close the tunnel used by the previous web UI. +1. Establish an SSH tunnel to port 80, where Ganglia is bound, by executing the below command on your Cloud9 environment (update the command with your master node DNS name): ``` - ssh -i ~/environment/emr-workshop-key-pair.pem -N -L 8080:ec2-###-##-##-###.compute-1.amazonaws.com:8088 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com + ssh -i ~/environment/emr-workshop-key-pair.pem -N -L 8080:ec2-###-##-##-###.compute-1.amazonaws.com:80 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com ``` -1. Now, on your browser, update the URL to "/cluster" i.e. https://xxxxxx.vfs.cloud9.eu-west-1.amazonaws.com/cluster -1. On the left pane, click Nodes, and in the node table, you should see the number of containers that each node ran. +1. Now, go back to the browser where Resource Manager was running, and append /ganglia/ to the URL access the Ganglia Interface The URL should look like: https://xxxxxx.vfs.cloud9.eu-west-1.amazonaws.com/ganglia/ -Now, let's look at **Spark History Server** application user interface: -1. Go to the Cloud9 terminal where you have established the ssh connection, and press ctrl+c to close it. -1. Create an SSH tunnel to the cluster master node on port 18080 by running this command (update the command with your master node DNS name): +1. Take notice of the heatmap (**Server Load Distribution**). Notable graphs are: + + * **Cluster CPU last hour** - this will show you the CPU utilization that our Spark application consumed on our EMR cluster. you should see that utilization varied and reached around 70%. + * **Cluster Memory last hour** - this will show you how much memory we started the cluster with, and how much Spark actually consumed. +![Cloud9-Ganglia-Browser](/images/running-emr-spark-apps-on-spot/cloud9-ganglia-browser.png) + + +#### Access Spark History Server application user interface + +1. Go to the Cloud9 terminal where you have established the ssh tunnel, and press ctrl+c to close the tunnel used by the previous web UI. +1. Establish an SSH tunnel to port 18080, where Spark History Server is bound, by executing the below command on your Cloud9 environment (update the command with your master node DNS name): + ``` ssh -i ~/environment/emr-workshop-key-pair.pem -N -L 8080:ec2-###-##-##-###.compute-1.amazonaws.com:18080 hadoop@ec2-###-##-##-###.compute-1.amazonaws.com ``` -1. Now, on your browser, go to the base URL of your Cloud9 environment i.e. https://xxxxxx.vfs.cloud9.eu-west-1.amazonaws.com/ +1. Now, on your browser, go to the base URL of your Cloud9 preview application i.e. https://xxxxxx.vfs.cloud9.eu-west-1.amazonaws.com/ 1. Click on the App ID in the table (where App Name = Amazon reviews word count) and go to the **Executors** tab 1. You can again see the number of executors that are running in your EMR cluster under the **Executors table** + ![Cloud9-Spark-History-Server](/images/running-emr-spark-apps-on-spot/cloud9-spark-history-server.png) +{{% /expand %}} ### Using CloudWatch Metrics + EMR emits several useful metrics to CloudWatch metrics. You can use the AWS Management Console to look at the metrics in two ways: + 1. In the EMR console, under the **Monitoring** tab in your cluster's page -2. By browsing to the CloudWatch service, and under Metrics, searching for the name of your cluster (copy it from the EMR Management Console) and clicking **EMR > Job Flow Metrics** +1. By browsing to the CloudWatch service, and under Metrics, searching for the name of your cluster (copy it from the EMR Management Console) and clicking **EMR > Job Flow Metrics** {{% notice note %}} The metrics will take a few minutes to populate. @@ -91,7 +124,7 @@ Some notable metrics: * **ContainerAllocated** - this represents the number of containers that are running on core and task fleets. These would the be Spark executors and the Spark Driver. * **Memory allocated MB** & **Memory available MB** - you can graph them both to see how much memory the cluster is actually consuming for the wordcount Spark application out of the memory that the instances have. -#### Managed Scaling in Action +### Managed Scaling in Action You enabled managed cluster scaling and EMR scaled out to 64 Spot units in the task fleet. EMR could have launched either 16 * xlarge (running one executor per xlarge) or 8 * 2xlarge instances (running 2 executors per 2xlarge) or 4 * 4xlarge instances (running 4 executors pe r4xlarge), so the task fleet provides 16 executors / containers to the cluster. The core fleet launched one xlarge instance and it will run one executor / container, so in total 17 executors / containers will be running in the cluster. diff --git a/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md b/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md index bccfca7f..db89c8d6 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/initial_event.md @@ -6,6 +6,6 @@ hidden: true --- Create an S3 bucket - we will use this for our Spark application code (which will be provided later) and the Spark application's results. -Refer to the **Create a Bucket** page in the [Amazon S3 Getting Started Guide] (https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) +Refer to the **Create a Bucket** page in the *["Amazon S3 Getting Started Guide"](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html)* You don't need to create a VPC, as the workshop account already has a default VPC that we will use in this workshop. \ No newline at end of file diff --git a/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md b/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md index 370831b6..15f24fc3 100644 --- a/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md +++ b/content/running_spark_apps_with_emr_on_spot_instances/initial_ownaccount.md @@ -9,7 +9,7 @@ hidden: true Refer to the **Create a Bucket** page in the [Amazon S3 Getting Started Guide] (https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) 2. Deploy a new VPC that will be used to run your EMR cluster in the workshop. -a. Open the ["Modular and Scalable VPC Architecture Quick stage page"] (https://aws.amazon.com/quickstart/architecture/vpc/) and go to the "How to deploy" tab, Click the ["Launch the Quick Start"] (https://fwd.aws/mm853) link. +a. Open the *["Modular and Scalable VPC Architecture Quick stage"](https://aws.amazon.com/quickstart/architecture/vpc/)* page and go to the **How to deploy** tab, click the *["Launch the Quick Start"](https://fwd.aws/mm853)* link. b. Select your desired region to run the workshop from the top right corner of the AWS Management Console and click **Next**. c. Provide a name for the stack or leave it as **Quick-Start-VPC**. d. Under **Availability Zones**, select three availability zones from the list, and set the **Number of Availability Zones** to **3**. diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-ganglia-browser.png b/static/images/running-emr-spark-apps-on-spot/cloud9-ganglia-browser.png new file mode 100644 index 00000000..287599ad Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-ganglia-browser.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-ganglia-pop-out.png b/static/images/running-emr-spark-apps-on-spot/cloud9-ganglia-pop-out.png new file mode 100644 index 00000000..e575f5b2 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-ganglia-pop-out.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-preview-application.png b/static/images/running-emr-spark-apps-on-spot/cloud9-preview-application.png new file mode 100644 index 00000000..2ebf03d1 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-preview-application.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-previewfail.png b/static/images/running-emr-spark-apps-on-spot/cloud9-previewfail.png new file mode 100644 index 00000000..58748302 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-previewfail.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-browser.png b/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-browser.png new file mode 100644 index 00000000..4b99ed76 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-browser.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-pop-out.png b/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-pop-out.png new file mode 100644 index 00000000..debd9684 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager-pop-out.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager.png b/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager.png new file mode 100644 index 00000000..22c5bf92 Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-resource-manager.png differ diff --git a/static/images/running-emr-spark-apps-on-spot/cloud9-spark-history-server.png b/static/images/running-emr-spark-apps-on-spot/cloud9-spark-history-server.png new file mode 100644 index 00000000..5558e26a Binary files /dev/null and b/static/images/running-emr-spark-apps-on-spot/cloud9-spark-history-server.png differ