Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update SageMaker Training Compiler Example Notebooks for PT1.11 #3592

Merged
merged 13 commits into from
Sep 23, 2022
Original file line number Diff line number Diff line change
Expand Up @@ -563,11 +563,11 @@
"source": [
"## SageMaker Training Job\n",
"\n",
"To create a SageMaker training job, we use a `HuggingFace` estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.\n",
"To create a SageMaker training job, we use a `PyTorch` estimator. Using the estimator, you can define which fine-tuning script should SageMaker use through `entry_point`, which `instance_type` to use for training, which `hyperparameters` to pass, and so on.\n",
"\n",
"When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `HuggingFace` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.\n",
"When a SageMaker training job starts, SageMaker takes care of starting and managing all the required machine learning instances, picks up the `PyTorch` Deep Learning Container, uploads your training script, and downloads the data from `sagemaker_session_bucket` into the container at `/opt/ml/input/data`.\n",
"\n",
"In the following section, you learn how to set up two versions of the SageMaker `HuggingFace` estimator, a native one without the compiler and an optimized one with the compiler."
"In the following section, you learn how to set up two versions of the SageMaker `PyTorch` estimator, a native one without the compiler and an optimized one with the compiler."
]
},
{
Expand All @@ -581,7 +581,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Below, we run a native PyTorch training job with the `HuggingFace` estimator on a `ml.p3.2xlarge` instance. \n",
"Below, we run a native PyTorch training job with the `PyTorch` estimator on a `ml.p3.2xlarge` instance. \n",
"\n",
"We run a batch size of 28 on our native training job and 52 on our Training Compiler training job to make an apples to apples comparision. These batch sizes along with the max_length variable get us close to 100% GPU memory utilization.\n",
"\n",
Expand All @@ -596,7 +596,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.huggingface import HuggingFace\n",
"from sagemaker.pytorch import PyTorch\n",
"\n",
"batch_size_native = 28\n",
"learning_rate_native = float(\"3e-5\") / 32 * batch_size_native\n",
Expand All @@ -622,26 +622,26 @@
"metadata": {},
"outputs": [],
"source": [
"huggingface_estimator = HuggingFace(\n",
"native_estimator = PyTorch(\n",
" entry_point=\"qa_trainer_huggingface.py\",\n",
" source_dir=\"./scripts\",\n",
" instance_type=\"ml.p3.2xlarge\",\n",
" instance_count=1,\n",
" role=role,\n",
" py_version=\"py38\",\n",
" transformers_version=\"4.11.0\",\n",
" pytorch_version=\"1.9.0\",\n",
" transformers_version=\"4.21.1\",\n",
" framework_version=\"1.11.0\",\n",
" volume_size=volume_size,\n",
" hyperparameters=hyperparameters,\n",
" disable_profiler=True,\n",
" debugger_hook_config=False,\n",
")\n",
"\n",
"# starting the train job with our uploaded datasets as input\n",
"huggingface_estimator.fit({\"train\": training_input_path, \"test\": eval_input_path}, wait=False)\n",
"native_estimator.fit({\"train\": training_input_path, \"test\": eval_input_path}, wait=False)\n",
"\n",
"# The name of the training job. You might need to note this down in case you lose connection to your notebook.\n",
"huggingface_estimator.latest_training_job.name"
"native_estimator.latest_training_job.name"
]
},
{
Expand All @@ -668,8 +668,8 @@
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig\n",
"\n",
"from sagemaker.pytorch import PyTorch\n",
"from sagemaker.training_compiler.config import TrainingCompilerConfig\n",
"# an updated max batch size that can fit into GPU memory with compiler\n",
"batch_size = 52\n",
"\n",
Expand Down Expand Up @@ -697,16 +697,16 @@
"metadata": {},
"outputs": [],
"source": [
"compile_estimator = HuggingFace(\n",
"compile_estimator = PyTorch(\n",
" entry_point=\"qa_trainer_huggingface.py\",\n",
" compiler_config=TrainingCompilerConfig(),\n",
" source_dir=\"./scripts\",\n",
" instance_type=\"ml.p3.2xlarge\",\n",
" instance_count=1,\n",
" role=role,\n",
" py_version=\"py38\",\n",
" transformers_version=\"4.11.0\",\n",
" pytorch_version=\"1.9.0\",\n",
" transformers_version=\"4.21.1\",\n",
" framework_version=\"1.11.0\",\n",
" volume_size=volume_size,\n",
" hyperparameters=hyperparameters,\n",
" disable_profiler=True,\n",
Expand All @@ -728,10 +728,10 @@
"source": [
"# Wait for training jobs to complete.\n",
"\n",
"waiter = huggingface_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
"waiter = native_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
" \"training_job_completed_or_stopped\"\n",
")\n",
"waiter.wait(TrainingJobName=huggingface_estimator.latest_training_job.name)\n",
"waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)\n",
"waiter = compile_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
" \"training_job_completed_or_stopped\"\n",
")\n",
Expand Down Expand Up @@ -759,14 +759,14 @@
"outputs": [],
"source": [
"# container image used for native training job\n",
"print(f\"container image used for training job: \\n{huggingface_estimator.image_uri}\\n\")\n",
"print(f\"container image used for training job: \\n{native_estimator.image_uri}\\n\")\n",
"\n",
"# s3 uri where the native trained model is located\n",
"print(f\"s3 uri where the trained model is located: \\n{huggingface_estimator.model_data}\\n\")\n",
"print(f\"s3 uri where the trained model is located: \\n{native_estimator.model_data}\\n\")\n",
"\n",
"# latest training job name for this estimator\n",
"print(\n",
" f\"latest training job name for this estimator: \\n{huggingface_estimator.latest_training_job.name}\\n\"\n",
" f\"latest training job name for this estimator: \\n{native_estimator.latest_training_job.name}\\n\"\n",
")"
]
},
Expand All @@ -779,16 +779,16 @@
"%%capture native\n",
"\n",
"# access the logs of the native training job\n",
"huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)"
"native_estimator.sagemaker_session.logs_for_job(native_estimator.latest_training_job.name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n",
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new native estimator. For example:\n",
"```python\n",
"huggingface_estimator = HuggingFace.attach(\"your_huggingface_training_job_name\")\n",
"native_estimator = native.attach(\"your_native_training_job_name\")\n",
"```"
]
},
Expand Down Expand Up @@ -833,9 +833,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n",
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new native estimator. For example:\n",
"```python\n",
"optimized_est = HuggingFace.attach(\"your_compiled_huggingface_training_job_name\")\n",
"optimized_est = native.attach(\"your_compiled_native_training_job_name\")\n",
"```"
]
},
Expand Down Expand Up @@ -945,7 +945,7 @@
"source": [
"sm = boto3.client(\"sagemaker\")\n",
"native_job = sm.describe_training_job(\n",
" TrainingJobName=huggingface_estimator.latest_training_job.name\n",
" TrainingJobName=native_estimator.latest_training_job.name\n",
")\n",
"\n",
"compile_job = sm.describe_training_job(TrainingJobName=compile_estimator.latest_training_job.name)\n",
Expand Down Expand Up @@ -1051,7 +1051,7 @@
" sm.stop_training_job(TrainingJobName=name)\n",
"\n",
"\n",
"stop_training_job(huggingface_estimator.latest_training_job.name)\n",
"stop_training_job(native_estimator.latest_training_job.name)\n",
"stop_training_job(compile_estimator.latest_training_job.name)"
]
},
Expand All @@ -1070,9 +1070,9 @@
},
"instance_type": "ml.p3.2xlarge",
"kernelspec": {
"display_name": "conda_pytorch_p36",
"display_name": "conda_pytorch_p38",
"language": "python",
"name": "conda_pytorch_p36"
"name": "conda_pytorch_p38"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -1084,7 +1084,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
"version": "3.8.12"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
transformers == 4.21.1
accelerate
torch >= 1.3
datasets >= 1.8.0
sentencepiece != 0.1.92
protobuf
evaluate
Original file line number Diff line number Diff line change
Expand Up @@ -404,7 +404,7 @@
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.huggingface import HuggingFace\n",
"from sagemaker.pytorch import PyTorch\n",
"\n",
"hyperparameters = {\"epochs\": 5, \"train_batch_size\": 14, \"model_name\": \"bert-base-cased\"}\n",
"\n",
Expand All @@ -421,11 +421,11 @@
"metadata": {},
"outputs": [],
"source": [
"# By setting the hyperparameters in the HuggingFace Estimator below\n",
"# By setting the hyperparameters in the PyTorch Estimator below\n",
"# and using the AutoModelForSequenceClassification class in the train.py script\n",
"# we can fine-tune the bert-base-cased pretrained Transformer for sequence classification\n",
"\n",
"huggingface_estimator = HuggingFace(\n",
"native_estimator = PyTorch(\n",
" entry_point=\"train.py\",\n",
" source_dir=\"./scripts\",\n",
" instance_type=\"ml.p3.2xlarge\",\n",
Expand All @@ -434,18 +434,18 @@
" py_version=\"py38\",\n",
" base_job_name=\"native-sst-bert-base-cased-p3-2x-pytorch-190\",\n",
" volume_size=volume_size,\n",
" transformers_version=\"4.11.0\",\n",
" pytorch_version=\"1.9.0\",\n",
" transformers_version=\"4.21.1\",\n",
" framework_version=\"1.11.0\",\n",
" hyperparameters=hyperparameters,\n",
" disable_profiler=True,\n",
" debugger_hook_config=False,\n",
")\n",
"\n",
"# starting the train job with our uploaded datasets as input\n",
"huggingface_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n",
"native_estimator.fit({\"train\": training_input_path, \"test\": test_input_path}, wait=False)\n",
"\n",
"# The name of the training job. You might need to note this down in case your kernel crashes.\n",
"huggingface_estimator.latest_training_job.name"
"native_estimator.latest_training_job.name"
]
},
{
Expand All @@ -472,7 +472,8 @@
"metadata": {},
"outputs": [],
"source": [
"from sagemaker.huggingface import HuggingFace, TrainingCompilerConfig\n",
"from sagemaker.pytorch import PyTorch\n",
"from sagemaker.training_compiler.config import TrainingCompilerConfig\n",
"\n",
"hyperparameters = {\n",
" \"epochs\": 5,\n",
Expand All @@ -493,13 +494,13 @@
"metadata": {},
"outputs": [],
"source": [
"# By setting the hyperparameters in the HuggingFace Estimator below\n",
"# By setting the hyperparameters in the PyTorch Estimator below\n",
"# and using the AutoModelForSequenceClassification class in the train.py script\n",
"# the bert-base-cased pretrained Transformer is fine-tuned for sequence classification\n",
"\n",
"# Importantly, the TrainingCompilerConfig() is passed below to enable the SageMaker Training Compiler\n",
"\n",
"sm_training_compiler_estimator = HuggingFace(\n",
"sm_training_compiler_estimator = PyTorch(\n",
" entry_point=\"train.py\",\n",
" source_dir=\"./scripts\",\n",
" instance_type=\"ml.p3.2xlarge\",\n",
Expand All @@ -508,8 +509,8 @@
" py_version=\"py38\",\n",
" base_job_name=\"sm-compiled-sst-bert-base-cased-p3-2x-pytorch-190\",\n",
" volume_size=volume_size,\n",
" transformers_version=\"4.11.0\",\n",
" pytorch_version=\"1.9.0\",\n",
" transformers_version=\"4.21.1\",\n",
" framework_version=\"1.11.0\",\n",
" compiler_config=TrainingCompilerConfig(),\n",
" hyperparameters=hyperparameters,\n",
" disable_profiler=True,\n",
Expand Down Expand Up @@ -538,10 +539,10 @@
"metadata": {},
"outputs": [],
"source": [
"waiter = huggingface_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
"waiter = native_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
" \"training_job_completed_or_stopped\"\n",
")\n",
"waiter.wait(TrainingJobName=huggingface_estimator.latest_training_job.name)\n",
"waiter.wait(TrainingJobName=native_estimator.latest_training_job.name)\n",
"waiter = sm_training_compiler_estimator.sagemaker_session.sagemaker_client.get_waiter(\n",
" \"training_job_completed_or_stopped\"\n",
")\n",
Expand Down Expand Up @@ -569,14 +570,14 @@
"outputs": [],
"source": [
"# container image used for native training job\n",
"print(f\"container image used for training job: \\n{huggingface_estimator.image_uri}\\n\")\n",
"print(f\"container image used for training job: \\n{native_estimator.image_uri}\\n\")\n",
"\n",
"# s3 uri where the native trained model is located\n",
"print(f\"s3 uri where the trained model is located: \\n{huggingface_estimator.model_data}\\n\")\n",
"print(f\"s3 uri where the trained model is located: \\n{native_estimator.model_data}\\n\")\n",
"\n",
"# latest training job name for this estimator\n",
"print(\n",
" f\"latest training job name for this estimator: \\n{huggingface_estimator.latest_training_job.name}\\n\"\n",
" f\"latest training job name for this estimator: \\n{native_estimator.latest_training_job.name}\\n\"\n",
")"
]
},
Expand All @@ -591,16 +592,16 @@
"%%capture native\n",
"\n",
"# access the logs of the native training job\n",
"huggingface_estimator.sagemaker_session.logs_for_job(huggingface_estimator.latest_training_job.name)"
"native_estimator.sagemaker_session.logs_for_job(native_estimator.latest_training_job.name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n",
"**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new native estimator. For example:\n",
"```python\n",
"huggingface_estimator = HuggingFace.attach(\"your_huggingface_training_job_name\")\n",
"native_estimator = HuggingFace.attach(\"your_native_training_job_name\")\n",
"```"
]
},
Expand Down Expand Up @@ -916,7 +917,7 @@
" sm.stop_training_job(TrainingJobName=name)\n",
"\n",
"\n",
"stop_training_job(huggingface_estimator.latest_training_job.name)\n",
"stop_training_job(native_estimator.latest_training_job.name)\n",
"stop_training_job(sm_training_compiler_estimator.latest_training_job.name)"
]
},
Expand All @@ -933,9 +934,9 @@
"hash": "c281c456f1b8161c8906f4af2c08ed2c40c50136979eaae69688b01f70e9f4a9"
},
"kernelspec": {
"display_name": "conda_pytorch_p36",
"display_name": "conda_pytorch_p38",
"language": "python",
"name": "conda_pytorch_p36"
"name": "conda_pytorch_p38"
},
"language_info": {
"codemirror_mode": {
Expand All @@ -947,7 +948,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.13"
"version": "3.8.12"
}
},
"nbformat": 4,
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
transformers == 4.21.1
accelerate
torch >= 1.3
datasets >= 1.8.0
sentencepiece != 0.1.92
protobuf
evaluate
Loading