Skip to content

Commit

Permalink
Update SMMP GPT sample (aws#3433)
Browse files Browse the repository at this point in the history
* update smp

* update smp

* fp16 change

* minor fix

* minor fix

* pin transformer version

* Update SMMP notebooks

* update gpt2 script

* update notebook

* minor fix

* minor fix

* minor fix

* minor fix

* fix

* update gptj script and noteboook

* update memory tracker

* minor fix

* fix

* fix gptj notebook

* Update training/distributed_training/pytorch/model_parallel/gpt-j/11_train_gptj_smp_tensor_parallel_notebook.ipynb

Co-authored-by: Miyoung <[email protected]>

* Fix typos&expressions

* reformat

Co-authored-by: Miyoung <[email protected]>
Co-authored-by: Aaron Markham <[email protected]>
  • Loading branch information
3 people authored and atqy committed Oct 28, 2022
1 parent 177a728 commit 3f52f05
Show file tree
Hide file tree
Showing 23 changed files with 1,009 additions and 5,521 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@
"This notebook depends on the following files and folders:\n",
"\n",
"1. `train_gptj_smp_script.py`: This is an entrypoint script that is passed to the PyTorch estimator in the notebook instructions. This script is responsible for end to end training of the GPT-J model with SMP. The script has additional comments at places where the SMP API is used.\n",
"2. `fp16`: This folder is used for 16-bit float training, which contains a fp16 optimizer and various fp16 utilities.\n",
"2. `memory_tracker.py`: This contains the functions to track memory usage.\n",
"3. `learning_rates.py`: This contains the functions for learning rate schedule.\n",
"4. `requirements.txt`: This will install the dependencies, like the right version of huggingface transformers.\n",
"5. `preprocess.py`: This will download and preprocess the sst2/glue dataset.\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook walks you through how to use the tensor parallelism feature provided by the SageMaker model parallelism library. You'll learn how to train the GPT-J model with tensor parallelism on the GLUE sst2 dataset.\n",
"This notebook walks you through how to use the tensor parallelism feature provided by the SageMaker model parallelism library. You'll learn how to run FP16 training of the GPT-J model with tensor parallelism on the GLUE sst2 dataset.\n",
"\n",
"## Install and Upgrade Libraries\n",
"\n",
Expand Down Expand Up @@ -82,7 +82,7 @@
"import os\n",
"\n",
"from sagemaker import get_execution_role\n",
"from sagemaker.huggingface import HuggingFace\n",
"from sagemaker.pytorch import PyTorch\n",
"from smexperiments.experiment import Experiment\n",
"from smexperiments.trial import Trial\n",
"import boto3\n",
Expand Down Expand Up @@ -611,6 +611,7 @@
" \"activation_checkpointing\": 1,\n",
" \"activation_strategy\": \"each\",\n",
" \"optimize\": \"speed\",\n",
" \"zipped_data\": 0,\n",
" # below flag loads model and optimizer state from checkpoint_s3_uri\n",
" # 'load_partial': 1,\n",
"}\n",
Expand Down Expand Up @@ -809,7 +810,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a SageMaker HuggingFace 🤗 Estimator\n",
"### Create a SageMaker PyTorch Estimator\n",
"\n",
"The following cell constructs a PyTorch estimator using the parameters defined above. To see how the SageMaker tensor parallelism modules and functions are applied to the script, see the `train_gptj_smp_tensor_parallel_script.py` file and the private preview documentation. "
]
Expand All @@ -826,7 +827,7 @@
" kwargs[\"security_group_ids\"] = [fsx_security_group_id]\n",
" kwargs[\"subnets\"] = [fsx_subnet]\n",
"\n",
"smp_estimator = HuggingFace(\n",
"smp_estimator = PyTorch(\n",
" entry_point=\"train_gptj_smp_tensor_parallel_script.py\",\n",
" source_dir=os.getcwd(),\n",
" role=role,\n",
Expand All @@ -851,18 +852,16 @@
" \"partitions\": hyperparameters[\"pipeline_parallel_degree\"],\n",
" \"shard_optimizer_state\": hyperparameters[\"shard_optimizer_state\"] > 0,\n",
" \"prescaled_batch\": hyperparameters[\"prescaled_batch\"] > 0,\n",
" \"fp16_params\": hyperparameters[\"fp16\"] > 0,\n",
" \"fp16\": hyperparameters[\"fp16\"] > 0,\n",
" \"optimize\": hyperparameters[\"optimize\"],\n",
" \"auto_partition\": False if hyperparameters[\"manual_partition\"] else True,\n",
" \"default_partition\": 0,\n",
" \"fp16_params\": hyperparameters[\"fp16\"] > 0,\n",
" \"optimize\": hyperparameters[\"optimize\"],\n",
" },\n",
" }\n",
" },\n",
" },\n",
" pytorch_version=\"1.10\",\n",
" transformers_version=\"4.17\",\n",
" framework_version=\"1.12\",\n",
" py_version=\"py38\",\n",
" output_path=s3_output_bucket,\n",
" checkpoint_s3_uri=checkpoint_s3_uri if not use_fsx else None,\n",
Expand Down

This file was deleted.

Loading

0 comments on commit 3f52f05

Please sign in to comment.