Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SageMaker Training Compiler Example Notebooks for PT1.11 #3592

Merged
merged 13 commits into from
Sep 23, 2022

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers == 4.21.1
datasets == 1.18.4

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers == 4.21.1
datasets == 1.18.4
Original file line number Diff line number Diff line change
Expand Up @@ -38,11 +38,11 @@
"\n",
"## Introduction\n",
"\n",
"In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `RoBERTa` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n",
"In this demo, you'll use Hugging Face's transformers and datasets libraries with Amazon SageMaker Training Compiler to train the RoBERTa model on the Stanford Sentiment Treebank v2 (SST2) dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n",
"\n",
"**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_pytorch_p36` respectively.\n",
"**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, Python 3 (PyTorch x.y Python 3.x CPU Optimized) or conda_pytorch_p36 respectively.\n",
"\n",
"**NOTE:** This notebook uses two `ml.p3.2xlarge` instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). "
"**NOTE:** This notebook uses two ml.p3.2xlarge instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). "
]
},
{
Expand All @@ -58,7 +58,7 @@
"source": [
"### Installation\n",
"\n",
"This example notebook requires the **SageMaker Python SDK v2.70.0** and **transformers v4.11.0**."
"This example notebook requires the **SageMaker Python SDK v2.108.0** and **transformers v4.21**."
]
},
{
Expand All @@ -67,7 +67,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install sagemaker botocore boto3 awscli --upgrade"
"!pip install \"sagemaker>=2.108.0\" botocore boto3 awscli \"torch==1.11.0\" --upgrade"
]
},
{
Expand All @@ -76,7 +76,7 @@
"metadata": {},
"outputs": [],
"source": [
"!pip install -U transformers datasets --upgrade"
"!pip install -U \"transformers==4.21.1\" datasets --upgrade"
]
},
{
Expand All @@ -99,7 +99,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Copy and run the following code if you need to upgrade ipywidgets for `datasets` library and restart kernel. This is only needed when prerpocessing is done in the notebook.\n",
"Copy and run the following code if you need to upgrade ipywidgets for datasets library and restart kernel. This is only needed when preprocessing is done in the notebook.\n",
"\n",
"```python\n",
"%%capture\n",
Expand Down Expand Up @@ -164,7 +164,7 @@
"\n",
"If you'd like to try other training datasets later, you can simply use this method.\n",
"\n",
"For this example notebook, we prepared the `SST2` dataset in the public SageMaker sample file S3 bucket. The following code cells show how you can directly load the dataset and convert to a HuggingFace DatasetDict."
"For this example notebook, we prepared the SST2 dataset in the public SageMaker sample file S3 bucket. The following code cells show how you can directly load the dataset and convert to a HuggingFace DatasetDict."
]
},
{
Expand All @@ -173,7 +173,7 @@
"source": [
"## Preprocessing\n",
"\n",
"We download and preprocess the `SST2` dataset from the `s3://sagemaker-sample-files/datasets` bucket. After preprocessing, we'll upload the dataset to the `sagemaker_session_bucket`, which will be used as a data channel for the training job."
"We download and preprocess the SST2 dataset from the s3://sagemaker-sample-files/datasets bucket. After preprocessing, we'll upload the dataset to the sagemaker_session_bucket, which will be used as a data channel for the training job."
]
},
{
Expand Down Expand Up @@ -253,9 +253,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Uploading data to `sagemaker_session_bucket`\n",
"## Uploading data to sagemaker_session_bucket\n",
"\n",
"We are going to use the new `FileSystem` [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our preprocessed dataset to S3."
"We are going to use the new FileSystem [integration](https://huggingface.co/docs/datasets/filesystems.html) to upload our preprocessed dataset to S3."
]
},
{
Expand Down Expand Up @@ -284,11 +284,11 @@
"source": [
"## SageMaker Training Job\n",
"\n",
"To create a SageMaker training job, we use a `HuggingFace` estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.\n",
"To create a SageMaker training job, we use a HuggingFace/PyTorch estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through entry_point, which instance_type to use for training, which hyperparameters to pass, and so on.\n",
"\n",
"When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `HuggingFace` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.\n",
"When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the HuggingFace Deep Learning Container, uploads your training script, and downloads the data from sagemaker_session_bucket into the container at /opt/ml/input/data.\n",
"\n",
"In the following section, you learn how to set up two versions of the SageMaker `HuggingFace` estimator, a native one without the compiler and an optimized one with the compiler."
"In the following section, you learn how to set up two versions of the SageMaker HuggingFace/PyTorch estimator, a native one without the compiler and an optimized one with the compiler."
]
},
{
Expand All @@ -302,7 +302,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Set up an option for fine-tuning or full training. Set `FINE_TUNING = 1` for fine-tuning and using `fine_tune_with_huggingface.py`. Set `FINE_TUNING = 0` for full training and using `full_train_roberta_with_huggingface.py`."
"Set up an option for fine-tuning or full training. `FINE_TUNING = 1` is for fine-tuning, and it will use fine_tune_with_huggingface.py. `FINE_TUNING = 0` is for full training, and it will use full_train_roberta_with_huggingface.py."
]
},
{
Expand All @@ -318,7 +318,7 @@
"FULL_TRAINING = not FINE_TUNING\n",
"\n",
"# Fine tuning is typically faster and is done for fewer epochs\n",
"EPOCHS = 4 if FINE_TUNING else 100\n",
"EPOCHS = 7 if FINE_TUNING else 100\n",
"\n",
"TRAINING_SCRIPT = (\n",
" \"fine_tune_with_huggingface.py\" if FINE_TUNING else \"full_train_roberta_with_huggingface.py\"\n",
Expand All @@ -340,7 +340,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The `train_batch_size` in the following code cell is the maximum batch that can fit into the memory of an `ml.p3.2xlarge` instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory."
"The `train_batch_size` in the following code cell is the maximum batch that can fit into the memory of the ml.p3.2xlarge instance. If you change the model, instance type, sequence length, and other parameters, you need to do some experiments to find the largest batch size that will fit into GPU memory."
]
},
{
Expand All @@ -349,7 +349,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig\n",
"from sagemaker.pytorch import PyTorch\n",
"\n",
"# hyperparameters, which are passed into the training job\n",
"hyperparameters = {\"epochs\": EPOCHS, \"train_batch_size\": 18, \"model_name\": \"roberta-base\"}\n",
Expand All @@ -367,26 +367,26 @@
"outputs": [],
"source": [
"# configure the training job\n",
"huggingface_estimator = HuggingFace(\n",
"native_estimator = PyTorch(\n",
" entry_point=TRAINING_SCRIPT,\n",
" source_dir=\"./scripts\",\n",
" instance_type=INSTANCE_TYPE,\n",
" instance_count=1,\n",
" role=role,\n",
" py_version=\"py38\",\n",
" transformers_version=\"4.11.0\",\n",
" pytorch_version=\"1.9.0\",\n",
" transformers_version=\"4.21.1\",\n",
" framework_version=\"1.11.0\",\n",
" volume_size=volume_size,\n",
" hyperparameters=hyperparameters,\n",
" disable_profiler=True,\n",
" debugger_hook_config=False,\n",
")\n",
"\n",
"# start training with our uploaded datasets as input\n",
"huggingface_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n",
"native_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n",
"\n",
"# The name of the training job.\n",
"huggingface_estimator.latest_training_job.name"
"native_estimator.latest_training_job.name"
]
},
{
Expand Down Expand Up @@ -417,7 +417,9 @@
"hyperparameters[\"learning_rate\"] = float(\"5e-5\") / 32 * hyperparameters[\"train_batch_size\"]\n",
"\n",
"# If checkpointing is enabled with higher epoch numbers, your disk requirements will increase as well\n",
"volume_size = 60 + 2 * hyperparameters[\"epochs\"]"
"volume_size = 60 + 2 * hyperparameters[\"epochs\"]\n",
"\n",
"from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig"
]
},
{
Expand All @@ -435,8 +437,8 @@
" instance_count=1,\n",
" role=role,\n",
" py_version=\"py38\",\n",
" transformers_version=\"4.11.0\",\n",
" pytorch_version=\"1.9.0\",\n",
" transformers_version=\"4.21.1\",\n",
" pytorch_version=\"1.11.0\",\n",
" volume_size=volume_size,\n",
" hyperparameters=hyperparameters,\n",
" disable_profiler=True,\n",
Expand All @@ -463,10 +465,10 @@
"metadata": {},
"outputs": [],
"source": [
"waiter = huggingface_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
"waiter = native_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
" \"training_job_completed_or_stopped\"\n",
")\n",
"waiter.wait(TrainingJobName=huggingface_estimator.latest_training_job.name)\n",
"waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)\n",
"waiter = optimized_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
" \"training_job_completed_or_stopped\"\n",
")\n",
Expand Down Expand Up @@ -494,14 +496,14 @@
"outputs": [],
"source": [
"# container image used for native training job\n",
"print(f\"container image used for training job: \\n{huggingface_estimator.image_uri}\\n\")\n",
"print(f\"container image used for training job: \\n{native_estimator.image_uri}\\n\")\n",
"\n",
"# s3 uri where the native trained model is located\n",
"print(f\"s3 uri where the trained model is located: \\n{huggingface_estimator.model_data}\\n\")\n",
"print(f\"s3 uri where the trained model is located: \\n{native_estimator.model_data}\\n\")\n",
"\n",
"# latest training job name for this estimator\n",
"print(\n",
" f\"latest training job name for this estimator: \\n{huggingface_estimator.latest_training_job.name}\\n\"\n",
" f\"latest training job name for this estimator: \\n{native_estimator.latest_training_job.name}\\n\"\n",
")"
]
},
Expand All @@ -514,16 +516,16 @@
"%%capture native\n",
"\n",
"# access the logs of the native training job\n",
"huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)"
"native_estimator.sagemaker_session.logs_for_job(native_estimator.latest_training_job.name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n",
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new PyTorch estimator. For example:\n",
"```python\n",
"huggingface_estimator = HuggingFace.attach(\"your_huggingface_training_job_name\")\n",
"native_estimator = PyTorch.attach(\"your_huggingface_training_job_name\")\n",
"```"
]
},
Expand Down Expand Up @@ -626,9 +628,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot and compare throughputs of compiled training and native training\n",
"### Plot and compare throughput of compiled training and native training\n",
"\n",
"Visualize average throughputs as reported by HuggingFace and see potential savings."
"Visualize average throughput as reported by HuggingFace and see potential savings."
]
},
{
Expand Down Expand Up @@ -782,7 +784,7 @@
"outputs": [],
"source": [
"Billable = {}\n",
"Billable[\"Native\"] = BillableTimeInSeconds(huggingface_estimator.latest_training_job.name)\n",
"Billable[\"Native\"] = BillableTimeInSeconds(native_estimator.latest_training_job.name)\n",
"Billable[\"Optimized\"] = BillableTimeInSeconds(optimized_estimator.latest_training_job.name)\n",
"pd.DataFrame(Billable, index=[\"BillableSecs\"])"
]
Expand Down Expand Up @@ -823,7 +825,7 @@
" sm.stop_training_job(TrainingJobName=name)\n",
"\n",
"\n",
"stop_training_job(huggingface_estimator.latest_training_job.name)\n",
"stop_training_job(native_estimator.latest_training_job.name)\n",
"stop_training_job(optimized_estimator.latest_training_job.name)"
]
},
Expand All @@ -841,9 +843,9 @@
"hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
},
"kernelspec": {
"display_name": "conda_pytorch_latest_p36",
"display_name": "conda_pytorch_p38",
"language": "python",
"name": "conda_pytorch_latest_p36"
"name": "conda_pytorch_p38"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -855,7 +857,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
"version": "3.8.12"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
transformers == 4.21.1
datasets == 1.18.4