-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Creating new notebook for training compiler TensorFlow support. Singl…
…e Node
- Loading branch information
Showing
1 changed file
with
165 additions
and
0 deletions.
There are no files selected for viewing
165 changes: 165 additions & 0 deletions
165
sagemaker-training-compiler/tensorflow/single_gpu_single_node/vision-transformer.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cf50d633", | ||
"metadata": {}, | ||
"source": [ | ||
"# Compile and Train a Vision Transformer model on the ImageNet Dataset for Image Classification on a Single-Node Single-GPU" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "31e66f8f", | ||
"metadata": {}, | ||
"source": [ | ||
"1. [Introduction](#Introduction) \n", | ||
"2. [Development Environment and Permissions](#Development-Environment-and-Permissions)\n", | ||
" 1. [Installation](#Installation) \n", | ||
" 2. [SageMaker environment](#SageMaker-environment)\n", | ||
"3. [Processing](#Preprocessing)\n", | ||
" 1. [Tokenization](#Tokenization)\n", | ||
" 2. [Uploading data to sagemaker_session_bucket](#Uploading-data-to-sagemaker_session_bucket)\n", | ||
"4. [SageMaker Training Job](#SageMaker-Training-Job)\n", | ||
" 1. [Training with Native TensorFlow](#Training-with-Native-TensorFlow) \n", | ||
" 2. [Training with Optimized TensorFlow](#Training-with-Optimized-TensorFlow) \n", | ||
" 3. [Analysis](#Analysis)\n", | ||
"5. [Clean Up](#Clean-Up)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "5e3c714a", | ||
"metadata": {}, | ||
"source": [ | ||
"## SageMaker Training Compiler Overview\n", | ||
"\n", | ||
"SageMaker Training Compiler is a capability of SageMaker that makes these hard-to-implement optimizations to reduce training time on GPU instances. The compiler optimizes DL models to accelerate training by more efficiently using SageMaker machine learning (ML) GPU instances. SageMaker Training Compiler is available at no additional charge within SageMaker and can help reduce total billable time as it accelerates training. \n", | ||
"\n", | ||
"SageMaker Training Compiler is integrated into the AWS Deep Learning Containers (DLCs). Using the SageMaker Training Compiler enabled AWS DLCs, you can compile and optimize training jobs on GPU instances with minimal changes to your code. Bring your deep learning models to SageMaker and enable SageMaker Training Compiler to accelerate the speed of your training job on SageMaker ML instances for accelerated computing. \n", | ||
"\n", | ||
"For more information, see [SageMaker Training Compiler](https://docs.aws.amazon.com/sagemaker/latest/dg/training-compiler.html) in the *Amazon SageMaker Developer Guide*.\n", | ||
"\n", | ||
"## Introduction\n", | ||
"\n", | ||
"In this demo, you'll use Hugging Face's `transformers` and `datasets` libraries with Amazon SageMaker Training Compiler to train the `RoBERTa` model on the `Stanford Sentiment Treebank v2 (SST2)` dataset. To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on. \n", | ||
"\n", | ||
"**NOTE:** You can run this demo in SageMaker Studio, SageMaker notebook instances, or your local machine with AWS CLI set up. If using SageMaker Studio or SageMaker notebook instances, make sure you choose one of the PyTorch-based kernels, `Python 3 (PyTorch x.y Python 3.x CPU Optimized)` or `conda_pytorch_p36` respectively.\n", | ||
"\n", | ||
"**NOTE:** This notebook uses two `ml.p3.2xlarge` instances that have single GPU. If you don't have enough quota, see [Request a service quota increase for SageMaker resources](https://docs.aws.amazon.com/sagemaker/latest/dg/regions-quotas.html#service-limit-increase-request-procedure). " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "bc2ccaab", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"Cloning into '/var/folders/5r/j40pqpnd4lv66lxzrjmygjzr0000gs/T/tmpvf3uunxo'...\n", | ||
"Note: switching to 'v2.9.2'.\n", | ||
"\n", | ||
"You are in 'detached HEAD' state. You can look around, make experimental\n", | ||
"changes and commit them, and you can discard any commits you make in this\n", | ||
"state without impacting any branches by switching back to a branch.\n", | ||
"\n", | ||
"If you want to create a new branch to retain commits you create, you may\n", | ||
"do so (now or later) by using -c with the switch command. Example:\n", | ||
"\n", | ||
" git switch -c <new-branch-name>\n", | ||
"\n", | ||
"Or undo this operation with:\n", | ||
"\n", | ||
" git switch -\n", | ||
"\n", | ||
"Turn off this advice by setting config variable advice.detachedHead to false\n", | ||
"\n", | ||
"HEAD is now at 675d26469 Make preprocess_ops visible from tensorflow_models import.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"from sagemaker.tensorflow import TensorFlow\n", | ||
"from sagemaker.training_compiler.config import TrainingCompilerConfig\n", | ||
"\n", | ||
"import boto3\n", | ||
"\n", | ||
"HOPPER_IMAGE_URI='669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-tensorflow-training:2.9.0-gpu-py39-cu112-ubuntu20.04-sagemaker-pr-1839-2022-05-17-00-38-02'\n", | ||
"epochs=1\n", | ||
"batch = 56\n", | ||
"train_steps = int(30000*epochs/batch)\n", | ||
"steps_per_loop = train_steps//10\n", | ||
"overrides=\\\n", | ||
"f\"runtime.enable_xla=False,\"\\\n", | ||
"f\"runtime.num_gpus=1,\"\\\n", | ||
"f\"runtime.distribution_strategy=one_device,\"\\\n", | ||
"f\"runtime.mixed_precision_dtype=float16,\"\\\n", | ||
"f\"task.train_data.global_batch_size={batch},\"\\\n", | ||
"f\"task.train_data.input_path=/opt/ml/input/data/training/caltech*,\"\\\n", | ||
"f\"task.train_data.cache=False,\"\\\n", | ||
"f\"trainer.train_steps={train_steps},\"\\\n", | ||
"f\"trainer.steps_per_loop={steps_per_loop},\"\\\n", | ||
"f\"trainer.summary_interval={steps_per_loop},\"\\\n", | ||
"f\"trainer.checkpoint_interval={train_steps},\"\\\n", | ||
"f\"task.model.backbone.type=vit,\"\n", | ||
"estimator = TensorFlow(\n", | ||
" git_config={\n", | ||
" 'repo': 'https://github.com/tensorflow/models.git',\n", | ||
" 'branch': 'v2.9.2',\n", | ||
" },\n", | ||
" source_dir='.',\n", | ||
" entry_point='official/projects/vit/train.py',\n", | ||
" model_dir=False,\n", | ||
" instance_type='ml.p3.2xlarge',\n", | ||
" instance_count=1,\n", | ||
" image_uri=HOPPER_IMAGE_URI,\n", | ||
" hyperparameters={\n", | ||
" TrainingCompilerConfig.HP_ENABLE_COMPILER : False,\n", | ||
" 'experiment': 'vit_imagenet_pretrain',\n", | ||
" 'mode' : 'train',\n", | ||
" 'model_dir': '/opt/ml/model',\n", | ||
" 'params_override' : overrides,\n", | ||
" },\n", | ||
" debugger_hook_config=None,\n", | ||
" disable_profiler=True,\n", | ||
" max_run=60*60*12, #12 hours\n", | ||
" base_job_name='native-tf29-vit',\n", | ||
" role=boto3.client('iam').get_role(RoleName='SageMaker-Execution-Role-For-PyTest')['Role']['Arn'],\n", | ||
" )\n", | ||
"estimator.fit(inputs='s3://collection-of-ml-datasets/Caltech-256-tfrecords')\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "2a6209ce", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.9.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |