diff --git a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling/gpt-2.ipynb b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling/gpt-2.ipynb index 401cad21c7..295af7429f 100644 --- a/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling/gpt-2.ipynb +++ b/sagemaker-training-compiler/huggingface/pytorch_multiple_gpu_single_node/language-modeling/gpt-2.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "d6e9eb74", + "id": "09281cc4", "metadata": {}, "source": [ "# Compile and Train the GPT2 Model using the Transformers Trainer API with the SST2 Dataset for Single-Node Multi-GPU Training" @@ -10,7 +10,7 @@ }, { "cell_type": "markdown", - "id": "333f1154", + "id": "5608a157", "metadata": {}, "source": [ "1. [Introduction](#Introduction) \n", @@ -25,7 +25,7 @@ }, { "cell_type": "markdown", - "id": "439f1d71", + "id": "f60e1bd1", "metadata": {}, "source": [ "## SageMaker Training Compiler Overview\n", @@ -47,7 +47,7 @@ }, { "cell_type": "markdown", - "id": "e2997e9c", + "id": "2e95db60", "metadata": {}, "source": [ "## Development Environment " @@ -55,7 +55,7 @@ }, { "cell_type": "markdown", - "id": "5af26c1f", + "id": "5aa60f30", "metadata": {}, "source": [ "### Installation\n", @@ -66,17 +66,17 @@ { "cell_type": "code", "execution_count": null, - "id": "18ac683c", + "id": "a0e55d93", "metadata": {}, "outputs": [], "source": [ - "!pip install \"sagemaker>=2.108.0\" botocore boto3 awscli --upgrade" + "!pip install \"sagemaker>=2.108.0\" botocore boto3 awscli pandas numpy --upgrade" ] }, { "cell_type": "code", "execution_count": null, - "id": "f1437715", + "id": "e0d0f005", "metadata": {}, "outputs": [], "source": [ @@ -92,7 +92,7 @@ }, { "cell_type": "markdown", - "id": "17d79fbe", + "id": "99c4d79b", "metadata": {}, "source": [ "Copy and run the following code if you need to upgrade IPython widgets for `datasets` library and restart kernel. This is only needed when preprocessing is done in the notebook.\n", @@ -108,7 +108,7 @@ }, { "cell_type": "markdown", - "id": "f46dc857", + "id": "10b9d397", "metadata": {}, "source": [ "### SageMaker environment " @@ -117,7 +117,7 @@ { "cell_type": "code", "execution_count": null, - "id": "44730963", + "id": "cd39452f", "metadata": {}, "outputs": [], "source": [ @@ -142,7 +142,7 @@ }, { "cell_type": "markdown", - "id": "d87ea153", + "id": "fd7db139", "metadata": {}, "source": [ "## SageMaker Training Job\n", @@ -159,7 +159,7 @@ { "cell_type": "code", "execution_count": null, - "id": "40036a0b", + "id": "831e0a92", "metadata": {}, "outputs": [], "source": [ @@ -194,7 +194,7 @@ }, { "cell_type": "markdown", - "id": "7479a143", + "id": "81e31c27", "metadata": {}, "source": [ "Next, we define some basic arguments to be passed to the training script." @@ -203,7 +203,7 @@ { "cell_type": "code", "execution_count": null, - "id": "a341a9b4", + "id": "7d4fb991", "metadata": {}, "outputs": [], "source": [ @@ -231,7 +231,7 @@ }, { "cell_type": "markdown", - "id": "45a85dc8", + "id": "ab907b5a", "metadata": {}, "source": [ "In the following sections, we will create estimators and start training.\n", @@ -248,7 +248,7 @@ { "cell_type": "code", "execution_count": null, - "id": "cdf61a8c", + "id": "e29fd7fe", "metadata": {}, "outputs": [], "source": [ @@ -283,7 +283,7 @@ }, { "cell_type": "markdown", - "id": "3e13585e", + "id": "f31232d4", "metadata": {}, "source": [ "### Training with Optimized PyTorch" @@ -291,7 +291,7 @@ }, { "cell_type": "markdown", - "id": "25c16bb2", + "id": "ad364eb4", "metadata": {}, "source": [ "Compilation through Training Compiler changes the memory footprint of the model. Most commonly, this manifests as a reduction in memory utilization and a consequent increase in the largest batch size that can fit on the GPU. Note that when you change the batch size, you must adjust the learning rate appropriately. Below, we have scaled the learning rate linearly with the increase in batch size.\n", @@ -302,7 +302,7 @@ { "cell_type": "code", "execution_count": null, - "id": "859d6577", + "id": "e6638a72", "metadata": {}, "outputs": [], "source": [ @@ -339,7 +339,7 @@ }, { "cell_type": "markdown", - "id": "cf30f0ad", + "id": "0e58bbc1", "metadata": {}, "source": [ "### Wait for training jobs to complete" @@ -348,7 +348,7 @@ { "cell_type": "code", "execution_count": null, - "id": "d46ace7d", + "id": "d08f7879", "metadata": {}, "outputs": [], "source": [ @@ -361,7 +361,7 @@ }, { "cell_type": "markdown", - "id": "3b530fd0", + "id": "bd16a442", "metadata": {}, "source": [ "## Analysis" @@ -369,7 +369,7 @@ }, { "cell_type": "markdown", - "id": "3be6032e", + "id": "80674ab7", "metadata": {}, "source": [ "**Note:** If the estimator object is no longer available due to a kernel break or refresh, you need to directly use the training job name and manually attach the training job to a new HuggingFace estimator. For example:\n", @@ -383,7 +383,7 @@ { "cell_type": "code", "execution_count": null, - "id": "7bf96f3a", + "id": "c99a9d7c", "metadata": {}, "outputs": [], "source": [ @@ -393,7 +393,7 @@ }, { "cell_type": "markdown", - "id": "fb29f9a5", + "id": "a18bc4fa", "metadata": {}, "source": [ "### Load logs of the training job *with* SageMaker Training Compiler" @@ -402,7 +402,7 @@ { "cell_type": "code", "execution_count": null, - "id": "74a70159", + "id": "036f3a33", "metadata": {}, "outputs": [], "source": [ @@ -414,7 +414,7 @@ }, { "cell_type": "markdown", - "id": "d7e6d03f", + "id": "8af3f031", "metadata": {}, "source": [ "### Load logs of the training job *without* SageMaker Training Compiler" @@ -423,7 +423,7 @@ { "cell_type": "code", "execution_count": null, - "id": "8bf1320c", + "id": "c809ed2b", "metadata": {}, "outputs": [], "source": [ @@ -435,7 +435,7 @@ }, { "cell_type": "markdown", - "id": "1072f6af", + "id": "b16c33d2", "metadata": {}, "source": [ "### Create helper functions for analysis" @@ -444,7 +444,7 @@ { "cell_type": "code", "execution_count": null, - "id": "b0f987cf", + "id": "f56d5e71", "metadata": {}, "outputs": [], "source": [ @@ -485,7 +485,7 @@ }, { "cell_type": "markdown", - "id": "754cc075", + "id": "5b2cfcfe", "metadata": {}, "source": [ "### Plot Optimized vs Native Training Throughput\n", @@ -496,7 +496,7 @@ { "cell_type": "code", "execution_count": null, - "id": "faeb13fa", + "id": "ec96cd7a", "metadata": {}, "outputs": [], "source": [ @@ -515,7 +515,7 @@ { "cell_type": "code", "execution_count": null, - "id": "860d539a", + "id": "fc5368c9", "metadata": {}, "outputs": [], "source": [ @@ -533,7 +533,7 @@ }, { "cell_type": "markdown", - "id": "5114a33b", + "id": "7fc85afb", "metadata": {}, "source": [ "### Convergence of Training Loss\n", @@ -544,7 +544,7 @@ { "cell_type": "code", "execution_count": null, - "id": "116d8611", + "id": "7fa6b008", "metadata": {}, "outputs": [], "source": [ @@ -563,7 +563,7 @@ }, { "cell_type": "markdown", - "id": "fe0838b9", + "id": "f3810076", "metadata": {}, "source": [ "### Training Stats\n", @@ -574,7 +574,7 @@ { "cell_type": "code", "execution_count": null, - "id": "60f9246e", + "id": "83f2b56c", "metadata": {}, "outputs": [], "source": [ @@ -586,7 +586,7 @@ { "cell_type": "code", "execution_count": null, - "id": "404971ae", + "id": "35f81eb2", "metadata": {}, "outputs": [], "source": [ @@ -603,7 +603,7 @@ }, { "cell_type": "markdown", - "id": "174f54df", + "id": "638fb396", "metadata": {}, "source": [ "### Total Billable Time\n", @@ -614,7 +614,7 @@ { "cell_type": "code", "execution_count": null, - "id": "c502dcbd", + "id": "ed587500", "metadata": {}, "outputs": [], "source": [ @@ -629,7 +629,7 @@ { "cell_type": "code", "execution_count": null, - "id": "1518fe79", + "id": "93063cf8", "metadata": {}, "outputs": [], "source": [ @@ -642,7 +642,7 @@ { "cell_type": "code", "execution_count": null, - "id": "5139d4bd", + "id": "1fb0bcf2", "metadata": {}, "outputs": [], "source": [ @@ -652,7 +652,7 @@ }, { "cell_type": "markdown", - "id": "94663edc", + "id": "63ba51ec", "metadata": {}, "source": [ "## Clean up\n", @@ -663,7 +663,7 @@ { "cell_type": "code", "execution_count": null, - "id": "416aff1d", + "id": "80275fe5", "metadata": {}, "outputs": [], "source": [ @@ -684,7 +684,7 @@ }, { "cell_type": "markdown", - "id": "b91720a9", + "id": "c0a5f087", "metadata": {}, "source": [ "Also, to find instructions on cleaning up resources, see [Clean Up](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-cleanup.html) in the *Amazon SageMaker Developer Guide*."