From bfdc4b0d21203819c1f07f70f85815b1850c1789 Mon Sep 17 00:00:00 2001
From: "Dingheng (Bruce) Zhang" <dingheng9688@gmail.com>
Date: Thu, 20 Oct 2022 08:04:41 -0700
Subject: [PATCH] Update SageMaker Training Compiler MNMG Example Notebook for
 PT1.11 (#3611)

* update mnmg notebook and test file

* edit parameters for estimators

* fix format

* edit by comments and update learning rate

* turn off amp

* change dataset from sst2 to wikitext

* edit package install and add comments for ptxla

* fix comments

* fix grammar

Co-authored-by: BruceZhang@eitug <eitug@amazon.com>
---
 ...nguage-modeling-multi-gpu-multi-node.ipynb | 339 ++++++++++--------
 .../scripts/launch_sm_training_compiler.py    |   9 -
 .../scripts/requirements.txt                  |   2 +
 .../scripts/run_clm.py                        |   2 +
 .../scripts/run_mlm.py                        |   3 +
 5 files changed, 192 insertions(+), 163 deletions(-)
 delete mode 100644 sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/launch_sm_training_compiler.py
 create mode 100644 sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/requirements.txt

diff --git a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.ipynb b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.ipynb
index ac6c157bfe..a0f2876580 100644
--- a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.ipynb
+++ b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/language-modeling-multi-gpu-multi-node.ipynb
@@ -2,30 +2,30 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "perfect-paraguay",
+   "id": "922da91e",
    "metadata": {},
    "source": [
-    "# Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Multi-Node Multi-GPU Training"
+    "# Compile and Train the GPT2 Model using the Transformers Trainer API with the wikitext Dataset for Multi-Node Multi-GPU Training"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "facial-classification",
+   "id": "e2762d09",
    "metadata": {},
    "source": [
     "1. [Introduction](#Introduction)  \n",
-    "2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n",
+    "2. [Development Environment](#Development-Environment)\n",
     "    1. [Installation](#Installation)  \n",
-    "    2. [Permissions](#Permissions)\n",
+    "    2. [SageMaker Environment](#SageMaker-Environment)\n",
     "3. [SageMaker Training Job](#SageMaker-Training-Job)  \n",
-    "    1. [Training with Native PyTorch](#Training-with-Native-PyTorch)  \n",
-    "    2. [Training with Optimized PyTorch](#Training-with-Optimized-PyTorch)  \n",
-    "    3. [Analysis](#Analysis)  "
+    "    1. [Training with Native PyTorch + SM DDP](#Training-with-Native-PyTorch-+-SM-DDP) \n",
+    "    2. [Training with SageMaker Training Compiler](#Training-with-SageMaker-Training-Compiler)  \n",
+    "4. [Analysis](#Analysis)  "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "outer-citizenship",
+   "id": "2d6f733f",
    "metadata": {},
    "source": [
     "## SageMaker Training Compiler Overview\n",
@@ -38,16 +38,16 @@
     "\n",
     "## Introduction\n",
     "\n",
-    "In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `gpt-2` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n",
+    "In this demo, you'll use Hugging Face's transformers and datasets libraries with Amazon SageMaker Training Compiler to train the gpt-2 model on the wikitext dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n",
     "\n",
-    "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_pytorch_p36` respectively.\n",
+    "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, Python 3 (PyTorch x.y Python 3.x CPU Optimized) or conda_pytorch_p36 respectively.\n",
     "\n",
-    "**NOTE:** This notebook uses four `ml.p3.16xlarge` instances that have multiple GPUs. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). "
+    "**NOTE:** This notebook uses four ml.p4d.24xlarge instances that have multiple GPUs. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "authentic-phoenix",
+   "id": "6fe11605",
    "metadata": {},
    "source": [
     "## Development Environment "
@@ -55,38 +55,44 @@
   },
   {
    "cell_type": "markdown",
-   "id": "ethical-dubai",
+   "id": "293eff00",
    "metadata": {},
    "source": [
-    "### Installation\n",
-    "\n",
-    "This example notebook requires the **SageMaker Python SDK v2.70.0** and **transformers v4.11.0**."
+    "### Installation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "86e7a516",
+   "metadata": {},
+   "source": [
+    "This example notebook requires the **SageMaker Python SDK v2.108.0** and **transformers v4.21**."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "caring-equivalent",
+   "id": "9782c8b3",
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install -U sagemaker==2.70.0"
+    "!pip install \"sagemaker>=2.108.0\" botocore boto3 awscli typing-extensions pandas numpy --upgrade"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "dutch-october",
+   "id": "de3ca816",
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install transformers==4.11.0"
+    "!pip install \"transformers==4.21\" --upgrade"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "proprietary-breath",
+   "id": "ad210e7d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -102,10 +108,10 @@
   },
   {
    "cell_type": "markdown",
-   "id": "orange-chess",
+   "id": "64570f5d",
    "metadata": {},
    "source": [
-    "**NOTE:** Copy and run the following code if you need to upgrade ipywidgets for `datasets` library and restart the kernel. This is needed if the installation is not applied to the current kernel.\n",
+    "**NOTE:** Copy and run the following code if you need to upgrade ipywidgets for datasets library and restart the kernel. This is needed if the installation is not applied to the current kernel.\n",
     "\n",
     "```python\n",
     "%%capture\n",
@@ -118,16 +124,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "upper-plumbing",
+   "id": "c47f3664",
    "metadata": {},
    "source": [
-    "### SageMaker environment "
+    "### SageMaker Environment "
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "front-fetish",
+   "id": "88e13af0",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -152,36 +158,26 @@
   },
   {
    "cell_type": "markdown",
-   "id": "respiratory-morris",
+   "id": "2025db42",
    "metadata": {},
    "source": [
     "## SageMaker Training Job\n",
     "\n",
-    "To create a SageMaker training job, we use a `HuggingFace` estimator. Using the estimator, you can define which training script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.\n",
-    "\n",
-    "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `HuggingFace` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.\n",
+    "To create a SageMaker training job, we use a HuggingFace/PyTorch estimator. Using the estimator, you can define which training script should SageMaker use through entry_point, which instance_type to use for training, which hyperparameters to pass, and so on.\n",
     "\n",
-    "In the following section, you learn how to set up two versions of the SageMaker `HuggingFace` estimator, a native one without the compiler and an optimized one with the compiler."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "retired-albert",
-   "metadata": {},
-   "source": [
-    "### Training Setup"
+    "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the HuggingFace Deep Learning Container, uploads your training script, and downloads the data from sagemaker_session_bucket into the container at /opt/ml/input/data."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "domestic-richardson",
+   "id": "b07a2709",
    "metadata": {},
    "outputs": [],
    "source": [
     "# Here we configure the training job. Please configure the appropriate options below:\n",
     "\n",
-    "EPOCHS = 400\n",
+    "EPOCHS = 10\n",
     "\n",
     "# Choose between Causal Language Model and Masked Language Model\n",
     "LANGUAGE_MODELING_LOSS = \"clm\"  # or \"mlm\"\n",
@@ -194,83 +190,141 @@
     "\n",
     "# For more information about the options, please look into the training scripts\n",
     "\n",
-    "# SageMaker Training Compiler currently only supports training on GPU\n",
     "# Select Instance type for training\n",
-    "INSTANCE_TYPE = \"ml.p3.16xlarge\"  # ml.p3.8xlarge is easily available. However, ml.p3.16xlarge provides better performance.\n",
+    "INSTANCE_TYPE = \"ml.p4d.24xlarge\"\n",
     "NUM_INSTANCES = 2\n",
+    "# Since ml.p4d.24xlarge instance has 8 GPUs, we set num_gpus_per_instance to 8\n",
     "num_gpus_per_instance = 8"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "final-newark",
+   "id": "46c5650c",
+   "metadata": {},
+   "source": [
+    "First, we define some basic parameters common to all estimators.\n",
+    "\n",
+    "**Note**: We recommend you to turn the SageMaker Debugger's profiling and debugging tools off to avoid additional overheads."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "63846356",
    "metadata": {},
+   "outputs": [],
    "source": [
-    "### Training with Native PyTorch"
+    "estimator_args = dict(\n",
+    "    entry_point=f\"run_{LANGUAGE_MODELING_LOSS}.py\",\n",
+    "    source_dir=\"./scripts\",\n",
+    "    instance_type=INSTANCE_TYPE,\n",
+    "    instance_count=NUM_INSTANCES,\n",
+    "    role=role,\n",
+    "    py_version=\"py38\",\n",
+    "    volume_size=512,\n",
+    "    disable_profiler=True,  # Disabling SageMaker Profiler to avoid overheads during benchmarking\n",
+    "    debugger_hook_config=False,  # Disabling SageMaker Debugger to avoid overheads during benchmarking\n",
+    "    base_job_name=\"trcomp-pt-example\",\n",
+    "    metric_definitions=[\n",
+    "        {\"Name\": \"summary_train_runtime\", \"Regex\": \"'train_runtime': ([0-9.]*)\"},\n",
+    "        {\n",
+    "            \"Name\": \"summary_train_samples_per_second\",\n",
+    "            \"Regex\": \"'train_samples_per_second': ([0-9.]*)\",\n",
+    "        },\n",
+    "        {\"Name\": \"summary_train_steps_per_second\", \"Regex\": \"'train_steps_per_second': ([0-9.]*)\"},\n",
+    "        {\"Name\": \"summary_train_loss\", \"Regex\": \"'train_loss': ([0-9.]*)\"},\n",
+    "        {\"Name\": \"epoch\", \"Regex\": \"'epoch': ([0-9.]*)\"},\n",
+    "        {\"Name\": \"train_loss\", \"Regex\": \"'loss': ([0-9.]*)\"},\n",
+    "        {\"Name\": \"learning_rate\", \"Regex\": \"'learning_rate': ([0-9.]*)\"},\n",
+    "    ],\n",
+    ")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "representative-emerald",
+   "id": "359ea82b",
    "metadata": {},
    "source": [
-    "The batch size below is the maximum batch we could fit into the memory of an `ml.p3.16xlarge` instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory. We also use AMP for faster training."
+    "Next, we define some basic arguments to be passed to the training script."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "statutory-electricity",
+   "id": "c45ecc3b",
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sagemaker.huggingface import HuggingFace\n",
-    "\n",
     "# hyperparameters are passed to the training entrypoint as arguments\n",
     "hyperparameters = {\n",
     "    MODEL_CONFIG: MODEL_NAME,\n",
     "    \"tokenizer_name\": TOKENIZER_NAME,\n",
-    "    \"dataset_name\": \"glue\",\n",
-    "    \"dataset_config_name\": \"sst2\",\n",
+    "    \"dataset_name\": \"wikitext\",\n",
+    "    \"dataset_config_name\": \"wikitext-103-v1\",\n",
     "    \"do_train\": True,\n",
-    "    \"do_eval\": True,\n",
-    "    \"fp16\": True,\n",
-    "    \"per_device_train_batch_size\": 10,\n",
-    "    \"per_device_eval_batch_size\": 16,\n",
-    "    \"overwrite_output_dir\": True,\n",
+    "    \"do_eval\": False,\n",
     "    \"num_train_epochs\": EPOCHS,\n",
-    "    \"output_dir\": \"/opt/ml/model\",\n",
     "    SEQ_LEN_ARG: 512,\n",
-    "    \"logging_strategy\": \"epoch\",\n",
+    "    \"overwrite_output_dir\": True,\n",
     "    \"save_strategy\": \"no\",\n",
-    "}\n",
+    "    \"evaluation_strategy\": \"no\",\n",
+    "    \"logging_strategy\": \"epoch\",\n",
+    "    \"output_dir\": \"/opt/ml/model\",\n",
+    "    \"dataloader_drop_last\": True,\n",
+    "    \"preprocessing_num_workers\": 12,\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f15ae3eb",
+   "metadata": {},
+   "source": [
+    "In the following sections, we will create estimators and start training."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97d35764",
+   "metadata": {},
+   "source": [
+    "### Training with Native PyTorch + SM DDP"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4b3575d9",
+   "metadata": {},
+   "source": [
+    "The batch size below is the maximum batch we could fit into the memory of a ml.p4d.24xlarge instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory.\n",
+    "\n",
+    "This example uses HuggingFace training script run_clm.py, which you can find it inside the scripts folder."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4778f414",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sagemaker.pytorch import PyTorch\n",
+    "\n",
+    "hyperparameters[\"per_device_train_batch_size\"] = 13\n",
     "\n",
     "# The original LR was set for a batch of 32. Here we are scaling learning rate with batch size.\n",
     "hyperparameters[\"learning_rate\"] = (\n",
-    "    float(\"5e-5\")\n",
-    "    / 32\n",
-    "    * hyperparameters[\"per_device_train_batch_size\"]\n",
-    "    * num_gpus_per_instance\n",
-    "    * NUM_INSTANCES\n",
+    "    float(\"5e-5\") / 32 * hyperparameters[\"per_device_train_batch_size\"]\n",
     ")\n",
     "\n",
     "# configure the training job\n",
-    "native_estimator = HuggingFace(\n",
-    "    entry_point=f\"run_{LANGUAGE_MODELING_LOSS}.py\",\n",
-    "    source_dir=\"./scripts\",\n",
-    "    instance_type=INSTANCE_TYPE,\n",
-    "    instance_count=NUM_INSTANCES,\n",
-    "    role=role,\n",
-    "    volume_size=200,\n",
-    "    transformers_version=\"4.11\",\n",
-    "    pytorch_version=\"1.9\",\n",
-    "    py_version=\"py38\",\n",
+    "native_estimator = PyTorch(\n",
+    "    **estimator_args,\n",
+    "    framework_version=\"1.11\",\n",
     "    hyperparameters=hyperparameters,\n",
     "    distribution={\n",
     "        \"smdistributed\": {\"dataparallel\": {\"enabled\": True}}\n",
-    "    },  # Use SageMaker Data Parallel to train across nodes/GPUs.\n",
-    "    disable_profiler=True,  # Disabling SageMaker Profiler to avoid overhead during benchmarking\n",
-    "    debugger_hook_config=False,  # Disabling SageMaker Debugger to avoid overhead during benchmarking\n",
+    "    },  # Use SageMaker Distributed Data Parallel to train across nodes/GPUs.\n",
     ")\n",
     "\n",
     "# Start the training job\n",
@@ -280,73 +334,48 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aggressive-treaty",
+   "id": "3a6d8e80",
    "metadata": {},
    "source": [
-    "### Training with Optimized PyTorch"
+    "### Training with SageMaker Training Compiler "
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "perfect-trinity",
+   "id": "d288f21d",
    "metadata": {},
    "source": [
     "Compilation through Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. Note that if you want to change the batch size, you must adjust the learning rate appropriately.\n",
+    "Below, we have scaled the learning rate linearly with the increase in batch size.\n",
     "\n",
-    "**Note:** We recommend you to turn the SageMaker Debugger's profiling and debugging tools off when you use compilation to avoid additional overheads.\n",
-    "\n",
-    "Here, instead of using the `distribution` kwarg to launch a multi node training job, we use a wrapper script to setup an inter-node communication using `torch_xla.distributed.sm_dist`, which has been optimized to work with SageMaker Training Compiler."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "coordinate-burning",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "!pygmentize ./scripts/launch_sm_training_compiler.py"
+    "**Note**: We are using distribution mechanism pytorchxla which is a compiler aware method of distributed training."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "bacterial-multiple",
+   "id": "a550ef7b",
    "metadata": {},
    "outputs": [],
    "source": [
     "from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig\n",
     "\n",
-    "# To use SageMaker Training Compiler in a Distributed setting, please use a wrapper script to invoke your training script\n",
-    "hyperparameters[\"training_script\"] = f\"run_{LANGUAGE_MODELING_LOSS}.py\"\n",
-    "\n",
     "# with SageMaker Training Compiler we are able to fit a larger batch into memory\n",
-    "hyperparameters[\"per_device_train_batch_size\"] = 20\n",
+    "hyperparameters[\"per_device_train_batch_size\"] = 25\n",
     "\n",
     "# The original LR was set for a batch of 32. Here we are scaling learning rate with batch size.\n",
     "hyperparameters[\"learning_rate\"] = (\n",
-    "    float(\"5e-5\")\n",
-    "    / 32\n",
-    "    * hyperparameters[\"per_device_train_batch_size\"]\n",
-    "    * num_gpus_per_instance\n",
-    "    * NUM_INSTANCES\n",
+    "    float(\"5e-5\") / 32 * hyperparameters[\"per_device_train_batch_size\"]\n",
     ")\n",
     "\n",
     "# configure the training job\n",
     "optimized_estimator = HuggingFace(\n",
-    "    entry_point=\"launch_sm_training_compiler.py\",  # Wrapper around training script that enables multi node training\n",
-    "    compiler_config=TrainingCompilerConfig(),  # We are enabling SageMaker Training Compiler here !\n",
-    "    source_dir=\"./scripts\",\n",
-    "    instance_type=INSTANCE_TYPE,\n",
-    "    instance_count=NUM_INSTANCES,\n",
-    "    role=role,\n",
-    "    volume_size=200,\n",
-    "    py_version=\"py38\",\n",
-    "    transformers_version=\"4.11.0\",\n",
-    "    pytorch_version=\"1.9.0\",\n",
+    "    compiler_config=TrainingCompilerConfig(),\n",
+    "    transformers_version=\"4.21\",\n",
+    "    pytorch_version=\"1.11\",\n",
     "    hyperparameters=hyperparameters,\n",
-    "    disable_profiler=True,  # Disable SageMaker Profiler to avoid overhead during benchmarking\n",
-    "    debugger_hook_config=False,  # Disable SageMaker Debugger to avoid overhead during benchmarking\n",
+    "    distribution={\"pytorchxla\": {\"enabled\": True}},\n",
+    "    **estimator_args,\n",
     ")\n",
     "\n",
     "# start the training job\n",
@@ -356,7 +385,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "regular-country",
+   "id": "cd07fd0e",
    "metadata": {},
    "source": [
     "### Wait for training jobs to complete\n"
@@ -365,7 +394,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "specific-afternoon",
+   "id": "aa99c58d",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -381,7 +410,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "hindu-seven",
+   "id": "13a6ed8f",
    "metadata": {},
    "source": [
     "## Analysis"
@@ -389,7 +418,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "tender-moral",
+   "id": "8f3a568c",
    "metadata": {},
    "source": [
     "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n",
@@ -401,7 +430,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "measured-contributor",
+   "id": "8c238cf7",
    "metadata": {},
    "source": [
     "### Load logs of the training job *with* SageMaker Training Compiler"
@@ -410,7 +439,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "southwest-radio",
+   "id": "ea71078b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -422,7 +451,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "numerous-sunglasses",
+   "id": "948dba2b",
    "metadata": {},
    "source": [
     "### Load logs of the training job *without* SageMaker Training Compiler"
@@ -431,7 +460,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "eleven-package",
+   "id": "5d19e18c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -443,7 +472,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "metric-cheat",
+   "id": "4e186c13",
    "metadata": {},
    "source": [
     "### Create helper functions for analysis"
@@ -452,7 +481,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "random-party",
+   "id": "aa34ebfd",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -493,7 +522,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "immune-election",
+   "id": "b6862bdc",
    "metadata": {},
    "source": [
     "### Plot Optimized vs Native Training Throughput\n",
@@ -504,7 +533,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "speaking-quantity",
+   "id": "6e30398b",
    "metadata": {
     "scrolled": true
    },
@@ -525,7 +554,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "simple-senate",
+   "id": "0a17264c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -534,16 +563,16 @@
     "plt.title(\"Training Throughput \\n (Higher is better)\")\n",
     "plt.ylabel(\"Samples/sec\")\n",
     "\n",
-    "plt.bar(x=[1], height=native_throughput, label=\"Baseline PT\", width=0.35)\n",
-    "plt.bar(x=[1.5], height=optimized_throughput, label=\"Compiler-enhanced PT\", width=0.35)\n",
+    "plt.bar(x=[1], height=native_throughput, label=\"PT + SM DDP\", width=0.35)\n",
+    "plt.bar(x=[1.5], height=optimized_throughput, label=\"PT + Compiler\", width=0.35)\n",
     "\n",
     "plt.xlabel(\"  ====> {} Compiler savings <====\".format(avg_speedup))\n",
-    "plt.xticks(ticks=[1, 1.5], labels=[\"Baseline PT\", \"Compiler-enhanced PT\"])"
+    "plt.xticks(ticks=[1, 1.5], labels=[\"PT + SM DDP\", \"PT + Compiler\"])"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "ceramic-exception",
+   "id": "6796db9d",
    "metadata": {},
    "source": [
     "### Convergence of Training Loss\n",
@@ -554,7 +583,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "boolean-induction",
+   "id": "d99dc081",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -566,14 +595,14 @@
     "plt.title(\"Plot of Training Loss\")\n",
     "plt.xlabel(\"Epoch\")\n",
     "plt.ylabel(\"Training Loss\")\n",
-    "plt.plot(vanilla_epochs, vanilla_loss, label=\"Baseline PT\")\n",
-    "plt.plot(optimized_epochs, optimized_loss, label=\"Compiler-enhanced PT\")\n",
+    "plt.plot(vanilla_epochs, vanilla_loss, label=\"PT + SM DDP\")\n",
+    "plt.plot(optimized_epochs, optimized_loss, label=\"PT + Compiler\")\n",
     "plt.legend()"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "promotional-glass",
+   "id": "fe3cb58a",
    "metadata": {},
    "source": [
     "### Training Stats\n",
@@ -585,17 +614,19 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "thick-salvation",
+   "id": "7bb74930",
    "metadata": {},
    "outputs": [],
    "source": [
-    "pd.DataFrame([n[\"summary\"], o[\"summary\"]], index=[\"Native\", \"Compiler-enhanced\"])"
+    "import pandas as pd\n",
+    "\n",
+    "pd.DataFrame([n[\"summary\"], o[\"summary\"]], index=[\"PT + SM DDP\", \"PT + Compiler\"])"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "changing-leave",
+   "id": "e605617c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -611,7 +642,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "pacific-syria",
+   "id": "973ddcb9",
    "metadata": {},
    "source": [
     "### Total Billable Time\n",
@@ -622,7 +653,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "developed-bride",
+   "id": "0f31fc1a",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -637,30 +668,30 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "interior-landscape",
+   "id": "ec70b709",
    "metadata": {},
    "outputs": [],
    "source": [
     "Billable = {}\n",
-    "Billable[\"Native\"] = BillableTimeInSeconds(native_estimator.latest_training_job.name)\n",
-    "Billable[\"Optimized\"] = BillableTimeInSeconds(optimized_estimator.latest_training_job.name)\n",
+    "Billable[\"PT + SM DDP\"] = BillableTimeInSeconds(native_estimator.latest_training_job.name)\n",
+    "Billable[\"PT + Compiler\"] = BillableTimeInSeconds(optimized_estimator.latest_training_job.name)\n",
     "pd.DataFrame(Billable, index=[\"BillableSecs\"])"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "paperback-trash",
+   "id": "0f1bb1ee",
    "metadata": {},
    "outputs": [],
    "source": [
-    "speedup = (Billable[\"Native\"] - Billable[\"Optimized\"]) * 100 / Billable[\"Native\"]\n",
+    "speedup = (Billable[\"PT + SM DDP\"] - Billable[\"PT + Compiler\"]) * 100 / Billable[\"PT + SM DDP\"]\n",
     "print(f\"SageMaker Training Compiler integrated PyTorch was {int(speedup)}% faster in summary.\")"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "assured-spread",
+   "id": "3e64821c",
    "metadata": {},
    "source": [
     "## Clean up\n",
@@ -671,7 +702,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "mathematical-islam",
+   "id": "0ede2e27",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -692,7 +723,7 @@
   },
   {
    "cell_type": "markdown",
-   "id": "grand-refrigerator",
+   "id": "8000452a",
    "metadata": {},
    "source": [
     "Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*."
@@ -701,9 +732,9 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "conda_pytorch_p36",
+   "display_name": "conda_pytorch_p38",
    "language": "python",
-   "name": "conda_pytorch_p36"
+   "name": "conda_pytorch_p38"
   },
   "language_info": {
    "codemirror_mode": {
@@ -715,7 +746,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.13"
+   "version": "3.8.12"
   }
  },
  "nbformat": 4,
diff --git a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/launch_sm_training_compiler.py b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/launch_sm_training_compiler.py
deleted file mode 100644
index 655af389a2..0000000000
--- a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/launch_sm_training_compiler.py
+++ /dev/null
@@ -1,9 +0,0 @@
-import subprocess
-import sys
-
-if __name__ == "__main__":
-    arguments_command = " ".join([arg for arg in sys.argv[1:]])
-    """
-    The following line will take care of setting up inter node communication as well as managing intra node workers for each GPU.
-    """
-    subprocess.check_call("python -m torch_xla.distributed.sm_dist " + arguments_command, shell=True)
diff --git a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/requirements.txt b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/requirements.txt
new file mode 100644
index 0000000000..db6140254d
--- /dev/null
+++ b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/requirements.txt
@@ -0,0 +1,2 @@
+transformers == 4.21.1
+datasets == 1.18.4
\ No newline at end of file
diff --git a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_clm.py b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_clm.py
index d2a473f6b6..b6591ce4d0 100644
--- a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_clm.py
+++ b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_clm.py
@@ -435,6 +435,7 @@ def tokenize_function(examples):
             remove_columns=column_names,
             load_from_cache_file=not data_args.overwrite_cache,
             desc="Running tokenizer on dataset",
+            keep_in_memory=True,
         )
 
     if data_args.block_size is None:
@@ -484,6 +485,7 @@ def group_texts(examples):
             num_proc=data_args.preprocessing_num_workers,
             load_from_cache_file=not data_args.overwrite_cache,
             desc=f"Grouping texts in chunks of {block_size}",
+            keep_in_memory=True,
         )
 
     if training_args.do_train:
diff --git a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_mlm.py b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_mlm.py
index a0d3302566..fd8c9cd978 100644
--- a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_mlm.py
+++ b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_multiple_node/scripts/run_mlm.py
@@ -454,6 +454,7 @@ def tokenize_function(examples):
                 remove_columns=[text_column_name],
                 load_from_cache_file=not data_args.overwrite_cache,
                 desc="Running tokenizer on dataset line_by_line",
+                keep_in_memory=True,
             )
     else:
         # Otherwise, we tokenize every text, then concatenate them together before splitting them in smaller parts.
@@ -470,6 +471,7 @@ def tokenize_function(examples):
                 remove_columns=column_names,
                 load_from_cache_file=not data_args.overwrite_cache,
                 desc="Running tokenizer on every text in dataset",
+                keep_in_memory=True,
             )
 
         # Main data processing function that will concatenate all texts from our dataset and generate chunks of
@@ -503,6 +505,7 @@ def group_texts(examples):
                 num_proc=data_args.preprocessing_num_workers,
                 load_from_cache_file=not data_args.overwrite_cache,
                 desc=f"Grouping texts in chunks of {max_seq_length}",
+                keep_in_memory=True,
             )
 
     if training_args.do_train: