Update SageMaker Training Compiler Example Notebooks for PT1.11 (aws#…

…3592) * update pytorch_single_gpu_single_node example notebooks * edit estimator from PyTorch to HuggingFace * update parameters and fix grammar for roberta-base and bert-base-cased notebook * update parameters for albert-base-v2 notebook and reformat it * fix grammar mistake * fix syntax errors and update albert-base-v2 analysis part * fix panda and numpy version * rerun tests * edit code format Co-authored-by: Bruce Zhang <[email protected]> Co-authored-by: Aaron Markham <[email protected]> Co-authored-by: atqy <[email protected]>
atqy · Oct 28, 2022 · cfe6046 · cfe6046
1 parent 80535a3
commit cfe6046
Show file tree

Hide file tree

Showing 6 changed files with 189 additions and 180 deletions.
diff --git a/...g-compiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/albert-base-v2.ipynb b/...g-compiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/albert-base-v2.ipynb
diff --git a/...mpiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/scripts/requirements.txt b/...mpiler/huggingface/pytorch_single_gpu_single_node/albert-base-v2/scripts/requirements.txt
@@ -0,0 +1,2 @@
+transformers == 4.21.1
+datasets == 1.18.4
diff --git a/...torch_single_gpu_single_node/bert-base-cased/bert-base-cased-single-node-single-gpu.ipynb b/...torch_single_gpu_single_node/bert-base-cased/bert-base-cased-single-node-single-gpu.ipynb
diff --git a/...piler/huggingface/pytorch_single_gpu_single_node/bert-base-cased/scripts/requirements.txt b/...piler/huggingface/pytorch_single_gpu_single_node/bert-base-cased/scripts/requirements.txt
@@ -0,0 +1,2 @@
+transformers == 4.21.1
+datasets == 1.18.4
diff --git a/...ining-compiler/huggingface/pytorch_single_gpu_single_node/roberta-base/roberta-base.ipynb b/...ining-compiler/huggingface/pytorch_single_gpu_single_node/roberta-base/roberta-base.ipynb
@@ -38,11 +38,11 @@
     "\n",
     "## Introduction\n",
     "\n",
-    "In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `RoBERTa` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n",
+    "In this demo, you'll use Hugging Face's transformers and datasets libraries with Amazon SageMaker Training Compiler to train the RoBERTa model on the Stanford Sentiment Treebank v2 (SST2) dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n",
     "\n",
-    "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_pytorch_p36` respectively.\n",
+    "**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, Python 3 (PyTorch x.y Python 3.x CPU Optimized) or conda_pytorch_p36 respectively.\n",
     "\n",
-    "**NOTE:** This notebook uses two `ml.p3.2xlarge` instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). "
+    "**NOTE:** This notebook uses two ml.p3.2xlarge instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). "
    ]
   },
   {
@@ -58,7 +58,7 @@
    "source": [
     "### Installation\n",
     "\n",
-    "This example notebook requires the **SageMaker Python SDK v2.70.0** and **transformers v4.11.0**."
+    "This example notebook requires the **SageMaker Python SDK v2.108.0** and **transformers v4.21**."
    ]
   },
   {
@@ -67,7 +67,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install sagemaker botocore boto3 awscli --upgrade"
+    "!pip install \"sagemaker>=2.108.0\" botocore boto3 awscli \"torch==1.11.0\" --upgrade"
    ]
   },
   {
@@ -76,7 +76,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "!pip install -U transformers datasets --upgrade"
+    "!pip install -U \"transformers==4.21.1\" datasets --upgrade"
    ]
   },
   {
@@ -99,7 +99,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Copy and run the following code if you need to upgrade ipywidgets for `datasets` library and restart kernel. This is only needed when prerpocessing is done in the notebook.\n",
+    "Copy and run the following code if you need to upgrade ipywidgets for datasets library and restart kernel. This is only needed when preprocessing is done in the notebook.\n",
     "\n",
     "```python\n",
     "%%capture\n",
@@ -164,7 +164,7 @@
     "\n",
     "If you'd like to try other training datasets later, you can simply use this method.\n",
     "\n",
-    "For this example notebook, we prepared the `SST2` dataset in the public SageMaker sample file S3 bucket. The following code cells show how you can directly load the dataset and convert to a HuggingFace DatasetDict."
+    "For this example notebook, we prepared the SST2 dataset in the public SageMaker sample file S3 bucket. The following code cells show how you can directly load the dataset and convert to a HuggingFace DatasetDict."
    ]
   },
   {
@@ -173,7 +173,7 @@
    "source": [
     "## Preprocessing\n",
     "\n",
-    "We download and preprocess the `SST2` dataset from the `s3://sagemaker-sample-files/datasets` bucket. After preprocessing, we'll upload the dataset to the `sagemaker_session_bucket`, which will be used as a data channel for the training job."
+    "We download and preprocess the SST2 dataset from the s3://sagemaker-sample-files/datasets bucket. After preprocessing, we'll upload the dataset to the sagemaker_session_bucket, which will be used as a data channel for the training job."
    ]
   },
   {
@@ -253,9 +253,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Uploading data to `sagemaker_session_bucket`\n",
+    "## Uploading data to sagemaker_session_bucket\n",
     "\n",
-    "We are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our preprocessed dataset to S3."
+    "We are going to use the new FileSystem [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our preprocessed dataset to S3."
    ]
   },
   {
@@ -284,11 +284,11 @@
    "source": [
     "## SageMaker Training Job\n",
     "\n",
-    "To create a SageMaker training job, we use a `HuggingFace` estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.\n",
+    "To create a SageMaker training job, we use a HuggingFace/PyTorch estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through entry_point, which instance_type to use for training, which hyperparameters to pass, and so on.\n",
     "\n",
-    "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `HuggingFace` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.\n",
+    "When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the HuggingFace Deep Learning Container, uploads your training script, and downloads the data from sagemaker_session_bucket into the container at /opt/ml/input/data.\n",
     "\n",
-    "In the following section, you learn how to set up two versions of the SageMaker `HuggingFace` estimator, a native one without the compiler and an optimized one with the compiler."
+    "In the following section, you learn how to set up two versions of the SageMaker HuggingFace/PyTorch estimator, a native one without the compiler and an optimized one with the compiler."
    ]
   },
   {
@@ -302,7 +302,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Set up an option for fine-tuning or full training. Set `FINE_TUNING = 1` for fine-tuning and using `fine_tune_with_huggingface.py`. Set `FINE_TUNING = 0` for full training and using `full_train_roberta_with_huggingface.py`."
+    "Set up an option for fine-tuning or full training. `FINE_TUNING = 1` is for fine-tuning, and it will use fine_tune_with_huggingface.py. `FINE_TUNING = 0` is for full training, and it will use full_train_roberta_with_huggingface.py."
    ]
   },
   {
@@ -318,7 +318,7 @@
     "FULL_TRAINING = not FINE_TUNING\n",
     "\n",
     "# Fine tuning is typically faster and is done for fewer epochs\n",
-    "EPOCHS = 4 if FINE_TUNING else 100\n",
+    "EPOCHS = 7 if FINE_TUNING else 100\n",
     "\n",
     "TRAINING_SCRIPT = (\n",
     "    \"fine_tune_with_huggingface.py\" if FINE_TUNING else \"full_train_roberta_with_huggingface.py\"\n",
@@ -340,7 +340,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The `train_batch_size` in the following code cell is the maximum batch that can fit into the memory of an `ml.p3.2xlarge` instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory."
+    "The `train_batch_size` in the following code cell is the maximum batch that can fit into the memory of the ml.p3.2xlarge instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory."
    ]
   },
   {
@@ -349,7 +349,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig\n",
+    "from sagemaker.pytorch import PyTorch\n",
     "\n",
     "# hyperparameters, which are passed into the training job\n",
     "hyperparameters = {\"epochs\": EPOCHS, \"train_batch_size\": 18, \"model_name\": \"roberta-base\"}\n",
@@ -367,26 +367,26 @@
    "outputs": [],
    "source": [
     "# configure the training job\n",
-    "huggingface_estimator = HuggingFace(\n",
+    "native_estimator = PyTorch(\n",
     "    entry_point=TRAINING_SCRIPT,\n",
     "    source_dir=\"./scripts\",\n",
     "    instance_type=INSTANCE_TYPE,\n",
     "    instance_count=1,\n",
     "    role=role,\n",
     "    py_version=\"py38\",\n",
-    "    transformers_version=\"4.11.0\",\n",
-    "    pytorch_version=\"1.9.0\",\n",
+    "    transformers_version=\"4.21.1\",\n",
+    "    framework_version=\"1.11.0\",\n",
     "    volume_size=volume_size,\n",
     "    hyperparameters=hyperparameters,\n",
     "    disable_profiler=True,\n",
     "    debugger_hook_config=False,\n",
     ")\n",
     "\n",
     "# start training with our uploaded datasets as input\n",
-    "huggingface_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n",
+    "native_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n",
     "\n",
     "# The name of the training job.\n",
-    "huggingface_estimator.latest_training_job.name"
+    "native_estimator.latest_training_job.name"
    ]
   },
   {
@@ -417,7 +417,9 @@
     "hyperparameters[\"learning_rate\"] = float(\"5e-5\") / 32 * hyperparameters[\"train_batch_size\"]\n",
     "\n",
     "# If checkpointing is enabled with higher epoch numbers, your disk requirements will increase as well\n",
-    "volume_size = 60 + 2 * hyperparameters[\"epochs\"]"
+    "volume_size = 60 + 2 * hyperparameters[\"epochs\"]\n",
+    "\n",
+    "from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig"
    ]
   },
   {
@@ -435,8 +437,8 @@
     "    instance_count=1,\n",
     "    role=role,\n",
     "    py_version=\"py38\",\n",
-    "    transformers_version=\"4.11.0\",\n",
-    "    pytorch_version=\"1.9.0\",\n",
+    "    transformers_version=\"4.21.1\",\n",
+    "    pytorch_version=\"1.11.0\",\n",
     "    volume_size=volume_size,\n",
     "    hyperparameters=hyperparameters,\n",
     "    disable_profiler=True,\n",
@@ -463,10 +465,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "waiter = huggingface_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
+    "waiter = native_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
     "    \"training_job_completed_or_stopped\"\n",
     ")\n",
-    "waiter.wait(TrainingJobName=huggingface_estimator.latest_training_job.name)\n",
+    "waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)\n",
     "waiter = optimized_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
     "    \"training_job_completed_or_stopped\"\n",
     ")\n",
@@ -494,14 +496,14 @@
    "outputs": [],
    "source": [
     "# container image used for native training job\n",
-    "print(f\"container image used for training job: \\n{huggingface_estimator.image_uri}\\n\")\n",
+    "print(f\"container image used for training job: \\n{native_estimator.image_uri}\\n\")\n",
     "\n",
     "# s3 uri where the native trained model is located\n",
-    "print(f\"s3 uri where the trained model is located: \\n{huggingface_estimator.model_data}\\n\")\n",
+    "print(f\"s3 uri where the trained model is located: \\n{native_estimator.model_data}\\n\")\n",
     "\n",
     "# latest training job name for this estimator\n",
     "print(\n",
-    "    f\"latest training job name for this estimator: \\n{huggingface_estimator.latest_training_job.name}\\n\"\n",
+    "    f\"latest training job name for this estimator: \\n{native_estimator.latest_training_job.name}\\n\"\n",
     ")"
    ]
   },
@@ -514,16 +516,16 @@
     "%%capture native\n",
     "\n",
     "# access the logs of the native training job\n",
-    "huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)"
+    "native_estimator.sagemaker_session.logs_for_job(native_estimator.latest_training_job.name)"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n",
+    "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new PyTorch estimator. For example:\n",
     "```python\n",
-    "huggingface_estimator = HuggingFace.attach(\"your_huggingface_training_job_name\")\n",
+    "native_estimator = PyTorch.attach(\"your_huggingface_training_job_name\")\n",
     "```"
    ]
   },
@@ -626,9 +628,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Plot and compare throughputs of compiled training and native training\n",
+    "### Plot and compare throughput of compiled training and native training\n",
     "\n",
-    "Visualize average throughputs as reported by HuggingFace and see potential savings."
+    "Visualize average throughput as reported by HuggingFace and see potential savings."
    ]
   },
   {
@@ -782,7 +784,7 @@
    "outputs": [],
    "source": [
     "Billable = {}\n",
-    "Billable[\"Native\"] = BillableTimeInSeconds(huggingface_estimator.latest_training_job.name)\n",
+    "Billable[\"Native\"] = BillableTimeInSeconds(native_estimator.latest_training_job.name)\n",
     "Billable[\"Optimized\"] = BillableTimeInSeconds(optimized_estimator.latest_training_job.name)\n",
     "pd.DataFrame(Billable, index=[\"BillableSecs\"])"
    ]
@@ -823,7 +825,7 @@
     "        sm.stop_training_job(TrainingJobName=name)\n",
     "\n",
     "\n",
-    "stop_training_job(huggingface_estimator.latest_training_job.name)\n",
+    "stop_training_job(native_estimator.latest_training_job.name)\n",
     "stop_training_job(optimized_estimator.latest_training_job.name)"
    ]
   },
@@ -841,9 +843,9 @@
    "hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
   },
   "kernelspec": {
-   "display_name": "conda_pytorch_latest_p36",
+   "display_name": "conda_pytorch_p38",
    "language": "python",
-   "name": "conda_pytorch_latest_p36"
+   "name": "conda_pytorch_p38"
   },
   "language_info": {
    "codemirror_mode": {
@@ -855,7 +857,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.13"
+   "version": "3.8.12"
   }
  },
  "nbformat": 4,

diff --git a/...compiler/huggingface/pytorch_single_gpu_single_node/roberta-base/scripts/requirements.txt b/...compiler/huggingface/pytorch_single_gpu_single_node/roberta-base/scripts/requirements.txt
@@ -0,0 +1,2 @@
+transformers == 4.21.1
+datasets == 1.18.4