diff --git a/README.md b/README.md index 25c9d52f1e..f2088a64db 100644 --- a/README.md +++ b/README.md @@ -187,6 +187,7 @@ These examples showcase unique functionality available in Amazon SageMaker. They - [Host Multiple Models with SKLearn](advanced_functionality/multi_model_sklearn_home_value) shows how to deploy multiple models to a realtime hosted endpoint using a multi-model enabled SKLearn container. - [SageMaker Training and Inference with Script Mode](sagemaker-script-mode) shows how to use custom training and inference scripts, similar to those you would use outside of SageMaker, with SageMaker's prebuilt containers for various frameworks like Scikit-learn, PyTorch, and XGBoost. - [Host Models with NVidia Triton Server](sagemaker-triton) shows how to deploy models to a realtime hosted endpoint using [Triton](https://developer.nvidia.com/nvidia-triton-inference-server) as the model inference server. +- [Heterogenous Clusters Training in TensorFlow or PyTorch ](training/heterogeneous-clusters/README.md) shows how to train using TensorFlow tf.data.service (distributed data pipeline) or Pytorch (with gRPC) on top of Amazon SageMaker Heterogenous clusters to overcome CPU bottlenecks by including different instance types (GPU/CPU) in the same training job. ### Amazon SageMaker Neo Compilation Jobs diff --git a/index.rst b/index.rst index e153475283..53ddf2b0de 100644 --- a/index.rst +++ b/index.rst @@ -185,7 +185,7 @@ More examples sagemaker-script-mode/index training/bring_your_own_container training/management - + training/heterogeneous-clusters/index .. toctree:: :maxdepth: 1 diff --git a/sagemaker-datawrangler/readme.md b/sagemaker-datawrangler/readme.md index 8b13789179..d6c963f90d 100644 --- a/sagemaker-datawrangler/readme.md +++ b/sagemaker-datawrangler/readme.md @@ -1 +1,41 @@ +![Amazon SageMaker Data Wrangler](https://github.com/aws/amazon-sagemaker-examples/raw/main/_static/sagemaker-banner.png) + +# Amazon SageMaker Data Wrangler Examples + +Example flows that demonstrate how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler. + +## :books: Background + +[Amazon SageMaker Data Wrangler](https://aws.amazon.com/sagemaker/data-wrangler/) reduces the time it takes to aggregate and prepare data for ML. From a single interface in SageMaker Studio, you can import data from Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, and Amazon SageMaker Feature Store, and in just a few clicks SageMaker Data Wrangler will automatically load, aggregate, and display the raw data. It will then make conversion recommendations based on the source data, transform the data into new features, validate the features, and provide visualizations with recommendations on how to remove common sources of error such as incorrect labels. Once your data is prepared, you can build fully automated ML workflows with Amazon SageMaker Pipelines or import that data into Amazon SageMaker Feature Store. + + + +The [SageMaker example notebooks](https://sagemaker-examples.readthedocs.io/en/latest/) are Jupyter notebooks that demonstrate the usage of Amazon SageMaker. + +## :hammer_and_wrench: Setup + +Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio. Use this section to learn how to access and get started using Data Wrangler. Do the following: + +* Complete each step in [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html#data-wrangler-getting-started-prerequisite). + +* Follow the procedure in [Access Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-getting-started.html#data-wrangler-getting-started-access) to start using Data Wrangler. + + + + +## :notebook: Examples + +### **[Tabular DataFlow](tabular-dataflow/README.md)** + +This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Tabular dataset. + +### **[Timeseries DataFlow](timeseries-dataflow/readme.md)** + +This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Timeseries dataset. + +### **[Joined DataFlow](joined-dataflow/readme.md)** + +This example provide quick walkthrough of how to aggregate and prepare data for Machine Learning using Amazon SageMaker Data Wrangler for Joined dataset. + + diff --git a/training/heterogeneous-clusters/.gitignore b/training/heterogeneous-clusters/.gitignore new file mode 100644 index 0000000000..f6a49187cf --- /dev/null +++ b/training/heterogeneous-clusters/.gitignore @@ -0,0 +1,11 @@ +.venv/ +.DS_Store +data/MyMNIST +pt.grpc.local/data/* +pt.grpc.local/__pycache__ +pt.grpc.local/profile +tf.data.service.sagemaker/data +tf.data.service.sagemaker/code/__pycache__ +tf.data.service.local/data +pt.grpc.sagemaker/data +tf.data.service.sagemaker/__pycache__ diff --git a/training/heterogeneous-clusters/README.md b/training/heterogeneous-clusters/README.md new file mode 100644 index 0000000000..8fc73d4764 --- /dev/null +++ b/training/heterogeneous-clusters/README.md @@ -0,0 +1,38 @@ +# Heterogeneous Clusters +SageMaker Training Heterogeneous Clusters allows you to run one training job +that includes instances of different types. For example a GPU instance like +ml.p4d.24xlarge and a CPU instance like c5.18xlarge. + +One primary use case is offloading CPU intensive tasks like image +pre-processing (data augmentation) from the GPU instance to a dedicate +CPU instance, so you can fully utilize the expensive GPUs, and arrive at +an improved time and cost to train. + +You'll find TensorFlow (tf.data.service) and PyTorch (a customer gRPC based distributed data loading) examples on how to utilize Heterogeneous clusters in your training jobs. You can reuse these examples when enabling your own training workload to use heterogeneous clusters. + +![Hetero job diagram](tf.data.service.sagemaker/images/basic-heterogeneous-job.png) + +## Examples: + +### TensorFlow examples +- [**TensorFlow's tf.data.service running locally**](tf.data.service.local/README.md): +This example runs the tf.data.service locally on your machine (not on SageMaker). It's helpful in order to get familiar with tf.data.service and to run small scale quick experimentation. + +- [**TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters**](tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb): +This TensorFlow example runs a Homogenous trainign job and compares its results with a Heterogeneous Clusters SageMaker training job that runs with two instance groups: + - `data_group` - this group has two ml.c5.18xlarge instances to which data augmentation is offloaded. + - `dnn_group` - Running one ml.p4d.24xlarge instance (8GPUs) in a horovod/MPI distribution. + +### PyTorch examples +- [**PyTorch with gRPC distributed dataloader running locally**](pt.grpc.local/README.md): +This Pytorch example runs a training job split into two processes locally on your machine (not on SageMaker). It's helpful in order to get familiar with the GRPC distributed data loader and to run small scale quick experimentation. + +- [**PyTorch with gRPC distributed dataloader Heterogeneous Clusters training job example**](pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb): +This PyTorch example runs a Hetero SageMaker training job that uses gRPC to offload data augmentation to a CPU based server. + + +### Hello world example +- [**Hetero Training Job - Hello world**](hello.world.sagemaker/README.md): +This basic example run a heterogeneous training job consisting of two instance groups. Each group includes a different instance_type. +Each instance prints its instance group information and exits. +Note: This example only shows how to orchastrate the training job with instance type, for actual code to help with a distributed data loader, see the TF or PT examples below. \ No newline at end of file diff --git a/training/heterogeneous-clusters/hello.world.sagemaker/helloworld-example.ipynb b/training/heterogeneous-clusters/hello.world.sagemaker/helloworld-example.ipynb new file mode 100644 index 0000000000..e990cf22ce --- /dev/null +++ b/training/heterogeneous-clusters/hello.world.sagemaker/helloworld-example.ipynb @@ -0,0 +1,467 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Heterogeneous Cluster - a hello world training job\n", + "\n", + "This basic example on how to run a Heterogeneous Clusters training job consisting of two instance groups. Each instance group includes a different instance type. Each instance prints its environment information including its instance group and exits.\n", + "\n", + "You can retrieve environment information in either of the following ways:\n", + " - **Option 1**: Read instance group information using the convenient `sagemaker_training.environment.Environment` class.\n", + " - **Option 2**: Read instance group information from `/opt/ml/input/config/resourceconfig.json`.\n", + " \n", + " \n", + "Note: This notebook does not demonstrate offloading of data preprocessing job to data group and deep neural network training to dnn_group. We will cover those examples in [TensorFlow's tf.data.service based Amazon SageMaker Heterogeneous Clusters for training](../tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb) and [PyTorch and gRPC distributed dataloader based Amazon SageMaker Heterogeneous Clusters for training](../pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb) notebooks." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A. Setting up SageMaker Studio notebook\n", + "#### Before you start\n", + "Ensure you have selected Python 3 (_TensorFlow 2.6 Python 3.8 CPU Optimized_) image for your SageMaker Studio Notebook instance, and running on _ml.t3.medium_ instance type.\n", + "\n", + "#### Step 1 - Upgrade SageMaker SDK and dependent packages\n", + "Heterogeneous Clusters for Amazon SageMaker model training was [announced](https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training) on 07/08/2022. This feature release requires you to have updated SageMaker SDK and boto3 client libraries." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: boto3 in /usr/local/lib/python3.8/site-packages (1.24.72)\n", + "Collecting boto3\n", + " Downloading boto3-1.24.83-py3-none-any.whl (132 kB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.5/132.5 kB 2.5 MB/s eta 0:00:00\n", + "Requirement already satisfied: botocore in /usr/local/lib/python3.8/site-packages (1.27.72)\n", + "Collecting botocore\n", + " Downloading botocore-1.27.83-py3-none-any.whl (9.2 MB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.2/9.2 MB 42.5 MB/s eta 0:00:00\n", + "Requirement already satisfied: awscli in /usr/local/lib/python3.8/site-packages (1.25.73)\n", + "Collecting awscli\n", + " Downloading awscli-1.25.84-py3-none-any.whl (3.9 MB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 35.4 MB/s eta 0:00:00\n", + "Requirement already satisfied: sagemaker in /usr/local/lib/python3.8/site-packages (2.109.0)\n", + "Collecting sagemaker\n", + " Downloading sagemaker-2.110.0.tar.gz (576 kB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 576.0/576.0 kB 9.9 MB/s eta 0:00:00\n", + " Preparing metadata (setup.py): started\n", + " Preparing metadata (setup.py): finished with status 'done'\n", + "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /usr/local/lib/python3.8/site-packages (from boto3) (0.6.0)\n", + "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.8/site-packages (from boto3) (0.10.0)\n", + "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /usr/local/lib/python3.8/site-packages (from botocore) (1.25.11)\n", + "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.8/site-packages (from botocore) (2.8.2)\n", + "Requirement already satisfied: colorama<0.4.5,>=0.2.5 in /usr/local/lib/python3.8/site-packages (from awscli) (0.4.3)\n", + "Requirement already satisfied: PyYAML<5.5,>=3.10 in /usr/local/lib/python3.8/site-packages (from awscli) (5.4.1)\n", + "Requirement already satisfied: docutils<0.17,>=0.10 in /usr/local/lib/python3.8/site-packages (from awscli) (0.15.2)\n", + "Requirement already satisfied: rsa<4.8,>=3.1.2 in /usr/local/lib/python3.8/site-packages (from awscli) (4.7.2)\n", + "Requirement already satisfied: attrs<22,>=20.3.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (21.2.0)\n", + "Requirement already satisfied: google-pasta in /usr/local/lib/python3.8/site-packages (from sagemaker) (0.2.0)\n", + "Requirement already satisfied: numpy<2.0,>=1.9.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (1.19.5)\n", + "Requirement already satisfied: protobuf<4.0,>=3.1 in /usr/local/lib/python3.8/site-packages (from sagemaker) (3.19.1)\n", + "Requirement already satisfied: protobuf3-to-dict<1.0,>=0.1.5 in /usr/local/lib/python3.8/site-packages (from sagemaker) (0.1.5)\n", + "Requirement already satisfied: smdebug_rulesconfig==1.0.1 in /usr/local/lib/python3.8/site-packages (from sagemaker) (1.0.1)\n", + "Requirement already satisfied: importlib-metadata<5.0,>=1.4.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (4.8.2)\n", + "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/site-packages (from sagemaker) (21.3)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.8/site-packages (from sagemaker) (1.2.5)\n", + "Requirement already satisfied: pathos in /usr/local/lib/python3.8/site-packages (from sagemaker) (0.2.8)\n", + "Collecting schema\n", + " Downloading schema-0.7.5-py2.py3-none-any.whl (17 kB)\n", + "Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.8/site-packages (from importlib-metadata<5.0,>=1.4.0->sagemaker) (3.6.0)\n", + "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.8/site-packages (from packaging>=20.0->sagemaker) (3.0.6)\n", + "Requirement already satisfied: six in /usr/local/lib/python3.8/site-packages (from protobuf3-to-dict<1.0,>=0.1.5->sagemaker) (1.16.0)\n", + "Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.8/site-packages (from rsa<4.8,>=3.1.2->awscli) (0.4.8)\n", + "Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.8/site-packages (from pandas->sagemaker) (2021.3)\n", + "Requirement already satisfied: dill>=0.3.4 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (0.3.4)\n", + "Requirement already satisfied: ppft>=1.6.6.4 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (1.6.6.4)\n", + "Requirement already satisfied: pox>=0.3.0 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (0.3.0)\n", + "Requirement already satisfied: multiprocess>=0.70.12 in /usr/local/lib/python3.8/site-packages (from pathos->sagemaker) (0.70.12.2)\n", + "Collecting contextlib2>=0.5.5\n", + " Downloading contextlib2-21.6.0-py2.py3-none-any.whl (13 kB)\n", + "Building wheels for collected packages: sagemaker\n", + " Building wheel for sagemaker (setup.py): started\n", + " Building wheel for sagemaker (setup.py): finished with status 'done'\n", + " Created wheel for sagemaker: filename=sagemaker-2.110.0-py2.py3-none-any.whl size=791666 sha256=5e4f859fef28f399b5eb60568410a22ddb2c42bbc357d0b3eae61587a14ca679\n", + " Stored in directory: /root/.cache/pip/wheels/ad/56/4f/4c5b1ed9fb3a725a634741aa293beb6fad882af965e2ccb6ae\n", + "Successfully built sagemaker\n", + "Installing collected packages: contextlib2, schema, botocore, boto3, awscli, sagemaker\n", + " Attempting uninstall: botocore\n", + " Found existing installation: botocore 1.27.72\n", + " Uninstalling botocore-1.27.72:\n", + " Successfully uninstalled botocore-1.27.72\n", + " Attempting uninstall: boto3\n", + " Found existing installation: boto3 1.24.72\n", + " Uninstalling boto3-1.24.72:\n", + " Successfully uninstalled boto3-1.24.72\n", + " Attempting uninstall: awscli\n", + " Found existing installation: awscli 1.25.73\n", + " Uninstalling awscli-1.25.73:\n", + " Successfully uninstalled awscli-1.25.73\n", + " Attempting uninstall: sagemaker\n", + " Found existing installation: sagemaker 2.109.0\n", + " Uninstalling sagemaker-2.109.0:\n", + " Successfully uninstalled sagemaker-2.109.0\n", + "Successfully installed awscli-1.25.84 boto3-1.24.83 botocore-1.27.83 contextlib2-21.6.0 sagemaker-2.110.0 schema-0.7.5\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\n" + ] + } + ], + "source": [ + "%%bash\n", + "python3 -m pip install --upgrade boto3 botocore awscli sagemaker" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 2 - Restart the notebook kernel " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#import IPython\n", + "#IPython.Application.instance().kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 3 - Validate SageMaker Python SDK and TensorFlow versions\n", + "Ensure the output of the cell below reflects:\n", + "\n", + "- SageMaker Python SDK version 2.98.0 or above, \n", + "- boto3 1.24 or above \n", + "- botocore 1.27 or above \n", + "- TensorFlow 2.6 or above " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: sagemaker\n", + "Version: 2.110.0\n", + "---\n", + "Name: boto3\n", + "Version: 1.24.83\n", + "---\n", + "Name: botocore\n", + "Version: 1.27.83\n", + "---\n", + "Name: tensorflow\n", + "Version: 2.6.2\n", + "---\n", + "Name: protobuf\n", + "Version: 3.19.1\n" + ] + } + ], + "source": [ + "!pip show sagemaker boto3 botocore tensorflow protobuf |egrep 'Name|Version|---'" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### B. Run a heterogeneous cluster training job\n", + "\n", + "#### Step 1: Set up training environment\n", + "Import the required libraries that enable you to use Heterogeneous clusters for training. In this step, you are also inheriting this notebook's IAM role and SageMaker session. " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "import datetime\n", + "\n", + "import sagemaker\n", + "from sagemaker import get_execution_role\n", + "from sagemaker.tensorflow import TensorFlow\n", + "from sagemaker.instance_group import InstanceGroup\n", + "\n", + "sess = sagemaker.Session()\n", + "role = get_execution_role()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 2: Define instance groups \n", + "Here we define instance groups. Each instance group includes a different instance type." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "data_group = InstanceGroup(\"data_group\", \"ml.c5.xlarge\", 1)\n", + "dnn_group = InstanceGroup(\"dnn_group\", \"ml.m4.xlarge\", 1) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 3: Review the \"hello world\" training code" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mjson\u001b[39;49;00m\n", + "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36mos\u001b[39;49;00m\n", + "\u001b[34mimport\u001b[39;49;00m \u001b[04m\u001b[36msys\u001b[39;49;00m\n", + "\u001b[34mfrom\u001b[39;49;00m \u001b[04m\u001b[36msagemaker_training\u001b[39;49;00m \u001b[34mimport\u001b[39;49;00m environment \u001b[37m# This module is present on the DLC images, or you can install it with pip install sagemaker_training\u001b[39;49;00m\n", + "\n", + "\u001b[34mif\u001b[39;49;00m \u001b[31m__name__\u001b[39;49;00m == \u001b[33m\"\u001b[39;49;00m\u001b[33m__main__\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m:\n", + " \n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33m\"\u001b[39;49;00m\u001b[33mOption-1: Read instance group information from the sagemaker_training.environment.Environment class\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " env = environment.Environment() \n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.is_hetero: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.is_hetero\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.current_host: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.current_host\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.current_instance_type: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.current_instance_type\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.current_instance_group: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.current_instance_group\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.current_instance_group_hosts: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.current_instance_group_hosts\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.instance_groups: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.instance_groups\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.instance_groups_dict: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.instance_groups_dict\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.distribution_hosts: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.distribution_hosts\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.distribution_instance_groups: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00menv.distribution_instance_groups\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \n", + "\n", + " file_path = \u001b[33m'\u001b[39;49;00m\u001b[33m/opt/ml/input/config/resourceconfig.json\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33m\"\u001b[39;49;00m\u001b[33mOption-2: Read instance group information from \u001b[39;49;00m\u001b[33m{file_path}\u001b[39;49;00m\u001b[33m.\u001b[39;49;00m\u001b[33m\\\u001b[39;49;00m\n", + "\u001b[33m You\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mll need to parse the json yourself. This doesn\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33mt require an additional library.\u001b[39;49;00m\u001b[33m\\n\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \n", + " \u001b[34mwith\u001b[39;49;00m \u001b[36mopen\u001b[39;49;00m(file_path, \u001b[33m'\u001b[39;49;00m\u001b[33mr\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m) \u001b[34mas\u001b[39;49;00m f:\n", + " config = json.load(f)\n", + "\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33m{\u001b[39;49;00mfile_path\u001b[33m}\u001b[39;49;00m\u001b[33m dump = \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mjson.dumps(config, indent=\u001b[34m4\u001b[39;49;00m, sort_keys=\u001b[34mTrue\u001b[39;49;00m)\u001b[33m}\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m)\n", + " \n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.is_hetero: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m\u001b[33minstance_groups\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m \u001b[35min\u001b[39;49;00m config\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mcurrent_host=\u001b[39;49;00m\u001b[33m{\u001b[39;49;00mconfig[\u001b[33m'\u001b[39;49;00m\u001b[33mcurrent_host\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33mcurrent_instance_type=\u001b[39;49;00m\u001b[33m{\u001b[39;49;00mconfig[\u001b[33m'\u001b[39;49;00m\u001b[33mcurrent_instance_type\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.current_instance_group: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mconfig[\u001b[33m'\u001b[39;49;00m\u001b[33mcurrent_group_name\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.current_instance_group_hosts: TODO\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.instance_groups: TODO\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.instance_groups_dict: \u001b[39;49;00m\u001b[33m{\u001b[39;49;00mconfig[\u001b[33m'\u001b[39;49;00m\u001b[33minstance_groups\u001b[39;49;00m\u001b[33m'\u001b[39;49;00m]\u001b[33m}\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.distribution_hosts: TODO\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n", + " \u001b[36mprint\u001b[39;49;00m(\u001b[33mf\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m\u001b[33menv.distribution_instance_groups: TODO\u001b[39;49;00m\u001b[33m\"\u001b[39;49;00m)\n" + ] + } + ], + "source": [ + "!pygmentize source_dir/train.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 4: Configure the Estimator\n", + "In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "estimator = TensorFlow(\n", + " entry_point='train.py',\n", + " source_dir='./source_dir',\n", + " #instance_type='ml.m4.xlarge',\n", + " #instance_count=1,\n", + " instance_groups = [data_group, dnn_group,],\n", + " framework_version='2.9.1',\n", + " py_version='py39',\n", + " role=role,\n", + " volume_size=10,\n", + " max_run=3600,\n", + " disable_profiler=True,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 5: Submit the training job\n", + "Here you are submitting the heterogeneous cluster training job. " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2022-09-30 17:23:58 Starting - Starting the training job...\n", + "2022-09-30 17:24:26 Starting - Preparing the instances for training.........\n", + "2022-09-30 17:25:56 Downloading - Downloading input data...\n", + "2022-09-30 17:26:22 Training - Downloading the training image...............\n", + "2022-09-30 17:28:53 Training - Training image download completed. Training in progress....\n", + "2022-09-30 17:29:24 Uploading - Uploading generated training model\n", + "2022-09-30 17:29:24 Completed - Training job completed\n", + "..Training seconds: 0\n", + "Billable seconds: 0\n" + ] + } + ], + "source": [ + "estimator.fit(\n", + " job_name='hello-world-heterogenous' + \n", + " '-' + datetime.datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 6: Review the logs for environment information\n", + "\n", + "Wait for the training job to finish, and review its logs in the AWS Console (click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**) You'll find two logs: Algo1, Algo2. Examine the printouts on each node on how to retrieve instance group environment information. An example is shown here:\n", + "\n", + "```\n", + "Option-1: Read instance group information from the sagemaker_training.environment.Environment class\n", + "env.is_hetero: True\n", + "env.current_host: algo-1\n", + "env.current_instance_type: ml.c5.xlarge\n", + "env.current_instance_group: data_group\n", + "env.current_instance_group_hosts: ['algo-1']\n", + "env.instance_groups: ['data_group', 'dnn_group']\n", + "\n", + "Option-2: Read instance group information from {file_path}. You'll need to parse the json yourself. This doesn't require an additional library.\n", + "/opt/ml/input/config/resourceconfig.json dump = {\n", + " \"current_group_name\": \"data_group\",\n", + " \"current_host\": \"algo-1\",\n", + " \"current_instance_type\": \"ml.c5.xlarge\",\n", + " \"hosts\": [\n", + " \"algo-1\",\n", + " \"algo-2\"\n", + " ],\n", + " \"instance_groups\": [\n", + " {\n", + " \"hosts\": [\n", + " \"algo-1\"\n", + " ],\n", + " \"instance_group_name\": \"data_group\",\n", + " \"instance_type\": \"ml.c5.xlarge\"\n", + " },\n", + " {\n", + " \"hosts\": [\n", + " \"algo-2\"\n", + " ],\n", + " \"instance_group_name\": \"dnn_group\",\n", + " \"instance_type\": \"ml.m4.xlarge\"\n", + " }\n", + " ],\n", + " \"network_interface_name\": \"eth0\"\n", + "}\n", + "env.is_hetero: True\n", + "current_host=algo-1\n", + "current_instance_type=ml.c5.xlarge\n", + "env.current_instance_group: data_group\n", + "env.current_instance_group_hosts: TODO\n", + "env.instance_groups: TODO\n", + "env.instance_groups_dict: [{'instance_group_name': 'data_group', 'instance_type': 'ml.c5.xlarge', 'hosts': ['algo-1']}, {'instance_group_name': 'dnn_group', 'instance_type': 'ml.m4.xlarge', 'hosts': ['algo-2']}]\n", + "env.distribution_hosts: TODO\n", + "env.distribution_instance_groups: TODO\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### C. Next steps\n", + "\n", + "In this notebook, we demonstrated how to retrieve the environment information, and differentiate which instance group an instance belongs to. Based on this, you can build logic to offload data processing tasks in your training job to a dedicated instance group. To understand how that can be done with a real-world example, we suggest going through the following notebook examples: \n", + "\n", + "- [TensorFlow's tf.data.service based Amazon SageMaker Heterogeneous Clusters for training](../tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb)\n", + "- [PyTorch and gRPC distributed dataloader based Amazon SageMaker Heterogeneous Clusters for training](../pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb)" + ] + } + ], + "metadata": { + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "Python 3.9.7 ('.venv': venv)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + }, + "vscode": { + "interpreter": { + "hash": "77c0de85c2cb739aa5100af7b92fb9d2075368f0e653f4148499a56c989df5f7" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/training/heterogeneous-clusters/hello.world.sagemaker/source_dir/train.py b/training/heterogeneous-clusters/hello.world.sagemaker/source_dir/train.py new file mode 100644 index 0000000000..1884e20fab --- /dev/null +++ b/training/heterogeneous-clusters/hello.world.sagemaker/source_dir/train.py @@ -0,0 +1,38 @@ +import json +import os +import sys +from sagemaker_training import environment # This module is present on the DLC images, or you can install it with pip install sagemaker_training + +if __name__ == "__main__": + + print("Option-1: Read instance group information from the sagemaker_training.environment.Environment class") + env = environment.Environment() + print(f"env.is_hetero: {env.is_hetero}") + print(f"env.current_host: {env.current_host}") + print(f"env.current_instance_type: {env.current_instance_type}") + print(f"env.current_instance_group: {env.current_instance_group}") + print(f"env.current_instance_group_hosts: {env.current_instance_group_hosts}") + print(f"env.instance_groups: {env.instance_groups}") + print(f"env.instance_groups_dict: {env.instance_groups_dict}") + print(f"env.distribution_hosts: {env.distribution_hosts}") + print(f"env.distribution_instance_groups: {env.distribution_instance_groups}") + + + file_path = '/opt/ml/input/config/resourceconfig.json' + print("Option-2: Read instance group information from {file_path}.\ + You'll need to parse the json yourself. This doesn't require an additional library.\n") + + with open(file_path, 'r') as f: + config = json.load(f) + + print(f'{file_path} dump = {json.dumps(config, indent=4, sort_keys=True)}') + + print(f"env.is_hetero: {'instance_groups' in config}") + print(f"current_host={config['current_host']}") + print(f"current_instance_type={config['current_instance_type']}") + print(f"env.current_instance_group: {config['current_group_name']}") + print(f"env.current_instance_group_hosts: TODO") + print(f"env.instance_groups: TODO") + print(f"env.instance_groups_dict: {config['instance_groups']}") + print(f"env.distribution_hosts: TODO") + print(f"env.distribution_instance_groups: TODO") diff --git a/training/heterogeneous-clusters/index.rst b/training/heterogeneous-clusters/index.rst new file mode 100644 index 0000000000..55b3125889 --- /dev/null +++ b/training/heterogeneous-clusters/index.rst @@ -0,0 +1,49 @@ +#################### +Heterogeneous Clusters +#################### + +SageMaker Training Heterogeneous Clusters allows you to run one training job +that includes instances of different types. For example a GPU instance like +ml.p4d.24xlarge and a CPU instance like c5.18xlarge. + +One primary use case is offloading CPU intensive tasks like image +pre-processing (data augmentation) from the GPU instance to a dedicate +CPU instance, so you can fully utilize the expensive GPUs, and arrive at +an improved time and cost to train. + +.. admonition:: More resources: + + - `SageMaker heterogeneous cluster developer guide `_ + + +See the following example notebooks: + +Hello World +==================================== +This minimal example launches a Heterogeneous cluster training job, print environment information, and exit. + +.. toctree:: + :maxdepth: 1 + + hello.world.sagemaker/helloworld-example + + +TensorFlow +==================================== +This example is a reusable implementation of Heterogeneous cluster with TensorFlow's tf.data.service + +.. toctree:: + :maxdepth: 1 + + tf.data.service.sagemaker/hetero-tensorflow-restnet50 + + +PyTorch +==================================== +This example is a reusable implementation of Heterogeneous cluster with gRPC based data loader + +.. toctree:: + :maxdepth: 1 + + pt.grpc.sagemaker/hetero-pytorch-mnist + diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed.proto b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed.proto new file mode 100644 index 0000000000..94de2cd212 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed.proto @@ -0,0 +1,14 @@ +syntax = "proto3"; + +service DatasetFeed { + rpc get_examples(Dummy) returns (stream Example) {} + rpc shutdown(Dummy) returns (Dummy) {} +} + +message Dummy { +} + +message Example { + bytes image = 1; + bytes label = 2; +} \ No newline at end of file diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed_pb2.py b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed_pb2.py new file mode 100644 index 0000000000..78575b8888 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed_pb2.py @@ -0,0 +1,47 @@ +# -*- coding: utf-8 -*- +# Generated by the protocol buffer compiler. DO NOT EDIT! +# source: dataset_feed.proto +"""Generated protocol buffer code.""" +from google.protobuf import descriptor as _descriptor +from google.protobuf import descriptor_pool as _descriptor_pool +from google.protobuf import message as _message +from google.protobuf import reflection as _reflection +from google.protobuf import symbol_database as _symbol_database +# @@protoc_insertion_point(imports) + +_sym_db = _symbol_database.Default() + + + + +DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\x12\x64\x61taset_feed.proto\"\x07\n\x05\x44ummy\"\'\n\x07\x45xample\x12\r\n\x05image\x18\x01 \x01(\x0c\x12\r\n\x05label\x18\x02 \x01(\x0c\x32Q\n\x0b\x44\x61tasetFeed\x12$\n\x0cget_examples\x12\x06.Dummy\x1a\x08.Example\"\x00\x30\x01\x12\x1c\n\x08shutdown\x12\x06.Dummy\x1a\x06.Dummy\"\x00\x62\x06proto3') + + + +_DUMMY = DESCRIPTOR.message_types_by_name['Dummy'] +_EXAMPLE = DESCRIPTOR.message_types_by_name['Example'] +Dummy = _reflection.GeneratedProtocolMessageType('Dummy', (_message.Message,), { + 'DESCRIPTOR' : _DUMMY, + '__module__' : 'dataset_feed_pb2' + # @@protoc_insertion_point(class_scope:Dummy) + }) +_sym_db.RegisterMessage(Dummy) + +Example = _reflection.GeneratedProtocolMessageType('Example', (_message.Message,), { + 'DESCRIPTOR' : _EXAMPLE, + '__module__' : 'dataset_feed_pb2' + # @@protoc_insertion_point(class_scope:Example) + }) +_sym_db.RegisterMessage(Example) + +_DATASETFEED = DESCRIPTOR.services_by_name['DatasetFeed'] +if _descriptor._USE_C_DESCRIPTORS == False: + + DESCRIPTOR._options = None + _DUMMY._serialized_start=22 + _DUMMY._serialized_end=29 + _EXAMPLE._serialized_start=31 + _EXAMPLE._serialized_end=70 + _DATASETFEED._serialized_start=72 + _DATASETFEED._serialized_end=153 +# @@protoc_insertion_point(module_scope) diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed_pb2_grpc.py b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed_pb2_grpc.py new file mode 100644 index 0000000000..b37fe7aad6 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/dataset_feed_pb2_grpc.py @@ -0,0 +1,99 @@ +# Generated by the gRPC Python protocol compiler plugin. DO NOT EDIT! +"""Client and server classes corresponding to protobuf-defined services.""" +import grpc + +import dataset_feed_pb2 as dataset__feed__pb2 + + +class DatasetFeedStub(object): + """Missing associated documentation comment in .proto file.""" + + def __init__(self, channel): + """Constructor. + + Args: + channel: A grpc.Channel. + """ + self.get_examples = channel.unary_stream( + '/DatasetFeed/get_examples', + request_serializer=dataset__feed__pb2.Dummy.SerializeToString, + response_deserializer=dataset__feed__pb2.Example.FromString, + ) + self.shutdown = channel.unary_unary( + '/DatasetFeed/shutdown', + request_serializer=dataset__feed__pb2.Dummy.SerializeToString, + response_deserializer=dataset__feed__pb2.Dummy.FromString, + ) + + +class DatasetFeedServicer(object): + """Missing associated documentation comment in .proto file.""" + + def get_examples(self, request, context): + """Missing associated documentation comment in .proto file.""" + context.set_code(grpc.StatusCode.UNIMPLEMENTED) + context.set_details('Method not implemented!') + raise NotImplementedError('Method not implemented!') + + def shutdown(self, request, context): + """Missing associated documentation comment in .proto file.""" + context.set_code(grpc.StatusCode.UNIMPLEMENTED) + context.set_details('Method not implemented!') + raise NotImplementedError('Method not implemented!') + + +def add_DatasetFeedServicer_to_server(servicer, server): + rpc_method_handlers = { + 'get_examples': grpc.unary_stream_rpc_method_handler( + servicer.get_examples, + request_deserializer=dataset__feed__pb2.Dummy.FromString, + response_serializer=dataset__feed__pb2.Example.SerializeToString, + ), + 'shutdown': grpc.unary_unary_rpc_method_handler( + servicer.shutdown, + request_deserializer=dataset__feed__pb2.Dummy.FromString, + response_serializer=dataset__feed__pb2.Dummy.SerializeToString, + ), + } + generic_handler = grpc.method_handlers_generic_handler( + 'DatasetFeed', rpc_method_handlers) + server.add_generic_rpc_handlers((generic_handler,)) + + + # This class is part of an EXPERIMENTAL API. +class DatasetFeed(object): + """Missing associated documentation comment in .proto file.""" + + @staticmethod + def get_examples(request, + target, + options=(), + channel_credentials=None, + call_credentials=None, + insecure=False, + compression=None, + wait_for_ready=None, + timeout=None, + metadata=None): + return grpc.experimental.unary_stream(request, target, '/DatasetFeed/get_examples', + dataset__feed__pb2.Dummy.SerializeToString, + dataset__feed__pb2.Example.FromString, + options, channel_credentials, + insecure, call_credentials, compression, wait_for_ready, timeout, metadata) + + @staticmethod + def shutdown(request, + target, + options=(), + channel_credentials=None, + call_credentials=None, + insecure=False, + compression=None, + wait_for_ready=None, + timeout=None, + metadata=None): + return grpc.experimental.unary_unary(request, target, '/DatasetFeed/shutdown', + dataset__feed__pb2.Dummy.SerializeToString, + dataset__feed__pb2.Dummy.FromString, + options, channel_credentials, + insecure, call_credentials, compression, wait_for_ready, timeout, metadata) diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/launcher.py b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/launcher.py new file mode 100644 index 0000000000..371663f461 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/launcher.py @@ -0,0 +1,101 @@ +import sys +import time +from typing import Optional + +# instance group names +DATA_GROUP = 'data_group' +DNN_GROUP = 'dnn_group' + +def start_child_process(name : str, additional_args=[]) -> int: + import subprocess + params = ["python", f"./{name}"] + sys.argv[1:] + additional_args + print(f'Opening process: {params}') + p = subprocess.run(params) + print(f'Process {name} closed with returncode={p.returncode}') + if p.returncode == -15 or p.returncode == -9: + print(f'Received SIGTERM|SIGKILL which is normal termination for pytorch data service to avoid hanging process') + return 0 + return p.returncode + + +def start_data_group(dispatcher_host : str) -> int: + return start_child_process('train_data.py', ["--dispatcher_host", dispatcher_host]) + + +def start_dnn_group(dispatcher_host : Optional[str]) -> int: + additional_args = [] if dispatcher_host is None else ["--dispatcher_host", dispatcher_host] + return start_child_process('train_dnn.py', additional_args) + + +def get_group_first_host(instance_groups, target_group_name): + return instance_groups[target_group_name]['hosts'][0] + +def shutdown_pt_data_service_with_retries(dispatcher_host : str): + for i in range(0,12): + try: + if i>0: + sleeptime = 10 + print(f'Will attempt {i} time to shutdown in {sleeptime} seconds') + time.sleep(sleeptime) + _shutdown_data_service(dispatcher_host) + break + except Exception as e: + print(f'Failed to shutdown dispatcher in {dispatcher_host} due to: {e}') + + +def _shutdown_data_service(dispatcher_host : str): + SHUTDOWN_PORT = 16000 + print(f'Shutting down data service dispatcher via: [{dispatcher_host}:{SHUTDOWN_PORT}]') + import socket + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.connect((dispatcher_host, SHUTDOWN_PORT)) + print(f'Shutdown request sent to {dispatcher_host}:{SHUTDOWN_PORT}') + + +def split_to_instance_group_train_script() -> int: + from sagemaker_training import environment + env = environment.Environment() + # try: + # from sagemaker_training import environment + # env = environment.Environment() + # except ImportError: + # class Object(object): + # pass + + # env = Object() + # env.is_hetero = True + # env.current_host = 'dummyhost' + # env.instance_groups_dict = {DATA_GROUP : {'hosts': ['dummyhost']}} + # env.current_instance_group = DNN_GROUP + # env.current_instance_type = 'dummyinstance' + + print(f'env.is_hetero={env.is_hetero}') + print(f'current_host={env.current_host}') + + if env.is_hetero: + dispatcher_host = get_group_first_host(env.instance_groups_dict, DATA_GROUP) + first_host_in_dnn_group = get_group_first_host(env.instance_groups_dict, DNN_GROUP) + print(f'current_instance_type={env.current_instance_type}') + print(f'current_group_name={env.current_instance_group}') + print(f'dispatcher_host={dispatcher_host}') + if env.current_instance_group == DATA_GROUP: + return start_data_group(dispatcher_host) + elif env.current_instance_group == DNN_GROUP: + returncode = start_dnn_group(dispatcher_host) + # first host in DNN group takes care of shutting down the dispatcher + if env.current_host == first_host_in_dnn_group: + shutdown_pt_data_service_with_retries(dispatcher_host) + return returncode + else: + raise Exception(f'Unknown instance group: {env.current_instance_group}') + + else: # not hetero + return start_dnn_group(dispatcher_host=None) + +if __name__ == "__main__": + try: + returncode = split_to_instance_group_train_script() + exit(returncode) + except Exception as e: + print(f'Failed due to {e}. exiting with returncode=1') + sys.exit(1) \ No newline at end of file diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/requirements.txt b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/requirements.txt new file mode 100644 index 0000000000..5d406f6b34 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/requirements.txt @@ -0,0 +1,2 @@ +torchvision +grpcio-tools diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train.py b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train.py new file mode 100644 index 0000000000..1963d940bf --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train.py @@ -0,0 +1,127 @@ +import torch +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim +from torchvision import datasets, transforms +import time +import logging +import sys +import os +import json + +logger = logging.getLogger(__name__) +logger.setLevel(logging.DEBUG) +logger.addHandler(logging.StreamHandler(sys.stdout)) + +class Net(nn.Module): + def __init__(self): + super(Net, self).__init__() + self.conv1 = nn.Conv2d(1, 32, 3, 1) + self.conv2 = nn.Conv2d(32, 64, 3, 1) + self.dropout1 = nn.Dropout(0.25) + self.dropout2 = nn.Dropout(0.5) + self.fc1 = nn.Linear(9216, 128) + self.fc2 = nn.Linear(128, 10) + def forward(self, x): + x = self.conv1(x) + x = F.relu(x) + x = self.conv2(x) + x = F.relu(x) + x = F.max_pool2d(x, 2) + x = self.dropout1(x) + x = torch.flatten(x, 1) + x = self.fc1(x) + x = F.relu(x) + x = self.fc2(x) + output = F.log_softmax(x, dim=1) + return output + +class MyMNIST(datasets.MNIST): + ''' + A personalized extension of the MNIST class in which we + modify the __len__ operation to return the maximum value + of int32 so that we do not run out of data. + ''' + + def __init__(self, batch_size : int, iterations : int, **kwargs): + + super().__init__(**kwargs) + self.batch_size = batch_size + self.iterations = iterations + + def __len__(self) -> int: + size = self.batch_size * self.iterations + return size + + def __getitem__(self, index: int): + return super(MyMNIST, self).__getitem__(index % len(self.data)) + +def main(args): + use_cuda = torch.cuda.is_available() + device = torch.device("cuda" if use_cuda else "cpu") + train_kwargs = {'batch_size': args.batch_size, + 'num_workers': args.num_data_workers, + 'pin_memory': args.pin_memory + } + logger.info ('Training job started...') + transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.1307,), (0.3081,)), + transforms.GaussianBlur(11) + ]) + dataset = MyMNIST(batch_size=args.batch_size, iterations=args.iterations, root='./data', train=True, + transform=transform, download=True) + train_loader = torch.utils.data.DataLoader(dataset, + **train_kwargs) + model = Net().to(device) + optimizer = optim.Adadelta(model.parameters()) + model.train() + t = time.perf_counter() + for idx, (data, target) in enumerate(train_loader, start=1): + data, target = data.to(device), target.to(device) + optimizer.zero_grad() + output = model(data) + loss = F.nll_loss(output, target) + loss.backward() + optimizer.step() + if device=='cpu' or idx % 10 == 0: + logger.info( + f'{idx}: avg step time: {(time.perf_counter()-t)/idx}') + logger.info('Training completed!') + save_model(model, args.model_dir) + +def save_model(model, model_dir): + logger.info("Saving the model") + path = os.path.join(model_dir, "model.pth") + torch.save(model.cpu().state_dict(), path) + return + +def read_args(): + import argparse + parser = argparse.ArgumentParser() + + parser.add_argument("--batch-size", type=int, default=4, + help="Input batch size for training",) + parser.add_argument("--iterations", type=int, default=10, + help="Based on no. of cpu per training instance",) + parser.add_argument("--num-data-workers", type=int, default=1, metavar="N", + help="Based on no. of cpu per training instance type in data group",) + parser.add_argument("--num-dnn-workers", type=int, default=1, metavar="N", + help="Based on no. of cpu per training instance type in dnn group, ideally should match to grpc-workers",) + parser.add_argument("--grpc-workers", type=int, default=1, metavar="N", + help="No. of grpc server workers to start",) + parser.add_argument("--pin-memory", type=bool, default=1, + help="pin to GPU memory (default: True)",) + parser.add_argument("--seed", type=int, default=1, + help="random seed (default: 1)",) + parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"])) + parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"]) + parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"]) + parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"]) + #parser.add_argument("--test", type=str, default=os.environ["SM_CHANNEL_TESTING"]) + parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"]) + parser.add_argument("--dispatcher_host", type=str) + return parser.parse_args() + +if __name__ == '__main__': + main(read_args()) diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train_data.py b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train_data.py new file mode 100644 index 0000000000..2d35ba4b26 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train_data.py @@ -0,0 +1,172 @@ +import multiprocessing as mp +from concurrent import futures + +import grpc +import torch +from torchvision import datasets, transforms + +import dataset_feed_pb2 +import dataset_feed_pb2_grpc +import logging +import sys + +# Logging initialization +logger = logging.getLogger(__name__) +logger.setLevel(logging.DEBUG) +logger.addHandler(logging.StreamHandler(sys.stdout)) + +# The following class implements the data feeding service +class DatasetFeedService(dataset_feed_pb2_grpc.DatasetFeedServicer): + def __init__(self, q, kill_event): + ''' + param q: A shared queue containing data batches + param kill: Kill event for graceful shutdown + ''' + self.q = q + self.kill_event = kill_event + + + def get_examples(self, request, context): + while True: + #print('DEBUG: get_examples') + example = self.q.get() + yield dataset_feed_pb2.Example(image=example[0], + label=example[1]) + + + def shutdown(self, request, context): + logger.info("Received shutdown request - Not implemented") + # from main_grpc_client import shutdown_data_service + # shutdown_data_service() + context.set_code(grpc.StatusCode.OK) + context.set_details('Shutting down') + return dataset_feed_pb2.Dummy() + + +# The data loading and preprocessing logic. +# We chose to keep the existing logic unchanged, just instead +# of feeding the model, the dataloader feeds a shared queue +class MyMNIST(datasets.MNIST): + ''' + A personalized extension of the MNIST class in which we + modify the __len__ operation to return the maximum value + of int32 so that we do not run out of data. + ''' + + def __init__(self, batch_size : int, iterations : int, **kwargs): + + super().__init__(**kwargs) + self.batch_size = batch_size + self.iterations = iterations + + def __len__(self) -> int: + size = self.batch_size * self.iterations + return size + + def __getitem__(self, index: int): + return super(MyMNIST, self).__getitem__(index % len(self.data)) + + +def fill_queue(q,kill, args): + + MyMNIST.mirrors = ["https://sagemaker-sample-files.s3.amazonaws.com/datasets/image/MNIST/"] + train_kwargs = {'batch_size': args.batch_size, + 'num_workers': args.num_data_workers} + transform=transforms.Compose([ + transforms.ToTensor(), + transforms.Normalize((0.1307,), (0.3081,)), + transforms.GaussianBlur(11) + ]) + dataset = MyMNIST(batch_size=args.batch_size, iterations=args.iterations, root='./data', train=True, + transform=transform, download=True) + loader = torch.utils.data.DataLoader(dataset, **train_kwargs) + for batch_idx, (data, target) in enumerate(loader): + if kill.is_set(): + logger.info('kill signal received, exiting fill_queue') + break + added = False + while not added and not kill.is_set(): + try: + # convert the data to bytestrings and add to queue + q.put((data.numpy().tobytes(), + target.type(torch.int8).numpy().tobytes()), + timeout=1) + #print(f'DEBUG: Added example to queue') + added = True + except: + continue + logger.info('Finished filling queue with dataset.') + + +def start(kill_event, args): + q = mp.Queue(maxsize=32) + queuing_process = mp.Process(target=fill_queue, args=(q, kill_event, args)) + queuing_process.start() + logger.info('Started queuing process.') + + server = grpc.server(futures.ThreadPoolExecutor(max_workers=args.grpc_workers)) + dataset_feed_pb2_grpc.add_DatasetFeedServicer_to_server( + DatasetFeedService(q, kill_event), server) + server.add_insecure_port('[::]:6000') + server.start() + logger.info('gRPC Data Server started at port 6000.') + return queuing_process,server + + +def shutdown(queuing_process, grpc_server): + logger.info('Shutting down...') + logger.info('Stopping gRPC server...') + grpc_server.stop(2).wait() + logger.info('Stopping queuing process...') + queuing_process.join(1) + queuing_process.terminate() + logger.info('Shutdown done.') + import os, time + os.system('kill -9 %d' % os.getpid()) + + +def wait_for_shutdown_signal(): + SHUTDOWN_PORT = 16000 + import socket + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(('', SHUTDOWN_PORT)) + s.listen(1) + logger.info('Awaiting shutdown signal on port {}'.format(SHUTDOWN_PORT)) + conn, addr = s.accept() + print('Received shutdown signal from: ', addr) + try: + conn.close() + s.close() + except Exception as e: + logger.info(e) + + +def serve(args): + kill_event = mp.Event() # an mp.Event for graceful shutdown + queue_data_loader_process, grpc_server = start(kill_event, args) + wait_for_shutdown_signal() + kill_event.set() + shutdown(queue_data_loader_process, grpc_server) + +def read_args(): + import argparse + parser = argparse.ArgumentParser() + parser.add_argument("--batch-size", type=int, default=4, metavar="N", + help="input batch size for training",) + parser.add_argument("--num-data-workers", type=int, default=1, metavar="N", + help="based on no. of cpu per training instance",) + parser.add_argument("--num-dnn-workers", type=int, default=1, + help="based on no. of cpu per training instance",) + parser.add_argument("--iterations", type=int, default=10, metavar="N", + help="The number of iterations per epoch (multiply of 10)",) + parser.add_argument("--grpc-workers", type=int, default=1, metavar="N", + help="No. of gRPC server workers",) + parser.add_argument("--pin-memory", type=bool, default=1, + help="pin to GPU memory (default: True)",) + parser.add_argument("--first_data_host", type=str) + args, unknown = parser.parse_known_args() + return args + + +if __name__ == "__main__": + serve(read_args()) diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train_dnn.py b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train_dnn.py new file mode 100644 index 0000000000..6dfa6f59f7 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/code/train_dnn.py @@ -0,0 +1,179 @@ +import torch +import torch.nn as nn +import torch.nn.functional as F +import torch.optim as optim +from torchvision import datasets, transforms +import time +import grpc +import dataset_feed_pb2_grpc +import dataset_feed_pb2 +import logging +import sys +import json +import os + +#Pass environment variables to detect heterogenous host names +from sagemaker_training import environment + + +logger = logging.getLogger(__name__) +logger.setLevel(logging.DEBUG) +logger.addHandler(logging.StreamHandler(sys.stdout)) + +# Based on https://github.com/pytorch/examples/blob/master/mnist/main.py +class Net(nn.Module): + def __init__(self): + super(Net, self).__init__() + self.conv1 = nn.Conv2d(1, 32, 3, 1) + self.conv2 = nn.Conv2d(32, 64, 3, 1) + self.dropout1 = nn.Dropout(0.25) + self.dropout2 = nn.Dropout(0.5) + self.fc1 = nn.Linear(9216, 128) + self.fc2 = nn.Linear(128, 10) + def forward(self, x): + x = self.conv1(x) + x = F.relu(x) + x = self.conv2(x) + x = F.relu(x) + x = F.max_pool2d(x, 2) + x = self.dropout1(x) + x = torch.flatten(x, 1) + x = self.fc1(x) + x = F.relu(x) + x = self.fc2(x) + output = F.log_softmax(x, dim=1) + return output + + +# Decode binary data from SM_CHANNEL_TRAINING +# Decode and preprocess data +# Create map dataset +class RemoteDataset(torch.utils.data.IterableDataset): + ''' + An iterable PyTorch dataset that opens a connection to the + gRPC server and reads from a stream of data batches + ''' + + def __init__(self, data_host, batch_size, iterations): + self.data_host = data_host + self.batch_size = batch_size + self.iterations = iterations + + + def __len__(self) -> int: + size = self.batch_size * self.iterations + return size + + def get_stub(self): + channel = grpc.insecure_channel(f'{self.data_host}:6000', + # overwrite the default max message length + options=[('grpc.max_receive_message_length', + 200 * 1024 * 1024)]) + + try: + # print('Waiting for gRPC data server to be ready...') + grpc.channel_ready_future(channel).result(timeout=30) + except grpc.FutureTimeoutError: + logger.error('ERROR: Timeout connecting to gRPC data server. Check that it is running.') + raise + #print('Connected to gRPC data server.') + + return dataset_feed_pb2_grpc.DatasetFeedStub(channel,) + + + def __iter__(self): + import numpy as np + + examples = self.get_stub().get_examples(dataset_feed_pb2.Dummy()) + for s in examples: + image = torch.tensor(np.frombuffer(s.image, + dtype=np.float32)).reshape( + [self.batch_size, 1, 28, 28]) + label = torch.tensor(np.frombuffer(s.label, + dtype=np.int8)).reshape( + [self.batch_size]).type(torch.int64) + yield image, label + + + # def shutdown_remote(self): + # print('Calling remote server to shutdown') + # self.get_stub().shutdown(dataset_feed_pb2.Dummy()) + + +def main(args): + logger.info ('Training job started...') + use_cuda = args.num_gpus > 0 + device = torch.device("cuda" if use_cuda > 0 else "cpu") + + torch.manual_seed(args.seed) + if use_cuda: + torch.cuda.manual_seed(args.seed) + + train_kwargs = {'batch_size': None, #the data is already batched + 'num_workers': args.num_dnn_workers, + 'pin_memory': args.pin_memory + } + + dataset = RemoteDataset(args.dispatcher_host, args.batch_size, args.iterations) + train_loader = torch.utils.data.DataLoader(dataset, + **train_kwargs) + model = Net().to(device) + optimizer = optim.Adadelta(model.parameters()) + model.train() + t = time.perf_counter() + for idx, (data, target) in enumerate(train_loader, start=1): + data, target = data.to(device), target.to(device) + optimizer.zero_grad() + output = model(data) + loss = F.nll_loss(output, target) + loss.backward() + optimizer.step() + if device.type == 'cpu' or idx % 10 == 0: + logger.info( + f'{idx}: avg step time: {(time.perf_counter()-t)/idx}') + + # TODO: exit the loop through the iterator stopping by itself + if idx*args.batch_size==(dataset.__len__()): + break + + save_model(model, args.model_dir) + logger.info ('Training job completed!') + + +def save_model(model, model_dir): + logger.info("Saving the model") + path = os.path.join(model_dir, "model.pth") + torch.save(model.cpu().state_dict(), path) + return + + +def read_args(): + import argparse + parser = argparse.ArgumentParser() + + parser.add_argument("--batch-size", type=int, default=4, + help="Input batch size for training",) + parser.add_argument("--iterations", type=int, default=10, + help="Based on no. of cpu per training instance",) + parser.add_argument("--num-data-workers", type=int, default=1, metavar="N", + help="Based on no. of cpu per training instance type in data group",) + parser.add_argument("--num-dnn-workers", type=int, default=1, metavar="N", + help="Based on no. of cpu per training instance type in dnn group, ideally should match to grpc-workers",) + parser.add_argument("--grpc-workers", type=int, default=1, metavar="N", + help="No. of grpc server workers to start",) + parser.add_argument("--pin-memory", type=bool, default=1, + help="pin to GPU memory (default: True)",) + parser.add_argument("--seed", type=int, default=1, + help="random seed (default: 1)",) + parser.add_argument("--hosts", type=list, default=json.loads(os.environ["SM_HOSTS"])) + parser.add_argument("--current-host", type=str, default=os.environ["SM_CURRENT_HOST"]) + parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"]) + parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"]) + #parser.add_argument("--test", type=str, default=os.environ["SM_CHANNEL_TESTING"]) + parser.add_argument("--num-gpus", type=int, default=os.environ["SM_NUM_GPUS"]) + parser.add_argument("--dispatcher_host", type=str) + return parser.parse_args() + + +if __name__ == '__main__': + main(read_args()) diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb b/training/heterogeneous-clusters/pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb new file mode 100644 index 0000000000..7146988464 --- /dev/null +++ b/training/heterogeneous-clusters/pt.grpc.sagemaker/hetero-pytorch-mnist.ipynb @@ -0,0 +1,520 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "85da9619", + "metadata": {}, + "source": [ + "# PyTorch's example to demonstrate Amazon SageMaker Heterogeneous Cluster for model training\n", + "\n", + "---\n", + "### Description\n", + "Heterogeneous clusters enable launching training jobs that use multiple instance types in a single job. This capability can improve your training cost and speed by running different parts of the model training on the most suitable instance type. This use case typically happens in computer vision DL training, where training is bottleneck on CPU resources needed for data augmentation, leaving the expensive GPU underutilized. Heterogeneous clusters enable you to add more CPU resources to fully utilize GPUs, thus increase training speed and cost-efficiency. For more details, you can find the documentation of this feature [here](https://docs.aws.amazon.com/sagemaker/latest/dg/train-heterogeneous-cluster.html).\n", + "\n", + "This notebook demonstrates how to use Heterogeneous Cluster feature of SageMaker Training with PyTorch. The notebook works on Python 3 (_PyTorch 1.11 Python 3.8 CPU Optimized_) image of SageMaker Studio Notebook instance, and runs on _ml.t3.medium_ instance type.\n", + "\n", + "The notebook covers:\n", + "- Setting up SageMaker Studio Notebook \n", + "- Setting up the Training environment \n", + "- Submit a Training job\n", + "- Monitor and visualize the CloudWatch metrics\n", + "- Comparing time-to-train and cost-to-train\n", + "- Conclusion \n", + "\n", + "In this sample notebook, we have taken the PyTorch model based on this [official MNIST example](https://github.com/pytorch/examples/tree/main/MNIST). We modified the training code to be heavy on data pre-processing. We are going to train this model in both Homogeneous and Heterogeneous Cluster modes. The flag to train on any of these modes can be set using `IS_HETERO = False or True` in section **B.2 Configure environment variables**. \n", + "\n", + "Homogeneous Training Job - In this baseline we observe an ml.p3.2xlarge with an under-utilized GPU due to a CPU bottleneck. \n", + "\"homogeneous-training \n", + "\n", + "Heterogeneous Training Job - Where we add ml.c5.9xlarge instance for extra CPU cores, to allow increased GPU usage of ml.p3.2xlarge instance, and improve cost-efficiency. Both the jobs runs the training code, train data set, pre-processing, and other relevant parameters.\n", + "\"heterogeneous-training\n", + "\n", + "In homogeneous cluster training job, the data pre-processing and Deep Neural Network (DNN) training code runs on the same instance. However, in heterogeneous cluster training job, the data pre-processing code runs on the CPU nodes (here by referred as **data_group or data group**), whereas the Deep Neural Network (DNN) training code runs on the GPU nodes (here referred as **dnn_group or dnn group**). The inter-node communication between the data and dnn groups is handled by generic implementation of [gRPC client-server interface](https://grpc.io/docs/languages/python/basics/).  \n", + "\n", + "The script (`launcher.py`) has the logic to detect (using SageMaker environment variables) whether the node it is running on belongs to data_group or dnn_group. If it is data_group, it spawns a separate process by executing `train_data.py`. This script runs grpc-server service for extracting processed training batches using [Protocol Buffers](https://developers.google.com/protocol-buffers/docs/overview). The gRPC server running on the data_group listens on a specific port (ex. 6000). In the code (`train_data.py`) documentation, we have chosen an implementation that keeps the data loading logic intact  where data batches are entered into a shared queue. The `get_samples` function of the `DataFeedService` pulls batches from the same queue and sends them to the client in the form of a continuous data stream. While fetching the data, the main entrypoint script `launcher.py` listens on port 16000 for a shutdown request coming from gRPC client i.e. data group. The `train_data.py` waits for shutdown action from the parent process. \n", + "\n", + "If the node belongs to dnn_group, the main training script (`launcher.py`) spawns a separate set of processes by executing `train_dnn.py`. The script runs gRPC client code and DNN component of the training job. It consumes the processed training data from the gRPC server. We have defined an iterable PyTorch dataset, RemoteDataset, that opens a connection to the gRPC server, and reads from a stream of data batches. Once the model is trained with all the batches of training data, the gRPC client exits, and the parent process`launcher.py` sends a shutdown request on port 16000. This indicates the gRPC server to shutdown, and signals ends of the training job. \n", + "\n", + "Here is how the workflow looks like:\n", + "\n", + "\n", + "\n", + "This example notebook runs a training job on 2 instances, 1 in each node group. The data_group uses ml.c5.9xlarge whereas dnn_group uses ml.p3.2xlarge.\n", + "\n", + "This notebook refers following files and folders:\n", + "\n", + "- Folders: \n", + "  - `code`: this has the training (data pre-processing and dnn) scripts, and grpc client-server start and shutdown scripts\n", + "  - `images`: contains images referred in notebook\n", + "- Files: \n", + "  - `launcher.py`: entry point training script. This script is executed on all the nodes irrespective of which group it belongs to. This is a parent process that makes a decision on where to spawn a data pre-processing or dnn component of the training job. The script runs on all the nodes as entry point. It also handles the shutdown logic for gRPC server. \n", + "  - `train_data.py`, `dataset_feed_pb2.py`, `dataset_feed_pb2_grpc.py`: these scripts run on the data_group nodes and responsible for setting up grpc-server, start and shutdown.\n", + "  - `train_dnn.py`: this script runs dnn code on the training data set. It fetches preprocessed data from the data_group node as a stream using gRPC client-server communication. It also sends a shutdown request after all the iterations on the preprocessed training data set. \n", + "  - `requirement.txt`: defines package required for gRPC \n", + "  - `train.py`: this script is the entry point script for SageMaker homogeneous cluster training. This script is picked up when you choose IS_HETERO = False. This uses a local dataset and runs both data pre-processing and a dnn component on the same node. " + ] + }, + { + "cell_type": "markdown", + "id": "1f98cde9", + "metadata": {}, + "source": [ + "### security groups update if running in private VPC\n", + "This section is relevant if you plan to [run in a private VPC](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html) (passing `subnets` and `security_group_ids` parameters when defining an Estimator). \n", + "SageMaker documentation recommends you [add](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html#train-vpc-vpc) a rule for your security group that allows inbound connections between members of the same security group, for all TCP communication. This will also cover for the gRPC related traffic between instances:\n", + "- the data_group instances will listen on port 6000 for connections from all nodes. This stream is not encrypted. You can change the code to encrypted the connection if needed.\n", + "- the data_group intances listen on port 16000 for a shutdown signal from all nodes." + ] + }, + { + "cell_type": "markdown", + "id": "fd1e5aca", + "metadata": {}, + "source": [ + "### A. Setting up SageMaker Studio notebook\n", + "\n", + "#### Step 1 - Upgrade SageMaker SDK and dependent packages \n", + "Heterogeneous Clusters for Amazon SageMaker model training was [announced](https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training) on 07/08/2022. As a first step, ensure you have updated SageMaker SDK, PyTorch, and Boto3 client that enables this feature." + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "54ff1687", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "python3 -m pip install --upgrade boto3 botocore awscli sagemaker" + ] + }, + { + "cell_type": "markdown", + "id": "0d20b2f3", + "metadata": {}, + "source": [ + "#### Step 2 - Restart the notebook kernel " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "229e1b18", + "metadata": {}, + "outputs": [], + "source": [ + "#import IPython\n", + "#IPython.Application.instance().kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "id": "a9592cda", + "metadata": {}, + "source": [ + "#### Step 3 - Validate SageMaker Python SDK and PyTorch versions\n", + "Ensure the output of the cell below reflects:\n", + "\n", + "- SageMaker Python SDK version 2.98.0 or above, \n", + "- boto3 1.24 or above \n", + "- botocore 1.27 or above \n", + "- PyTorch 1.10 or above " + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "0b0e3202", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: sagemaker\n", + "Version: 2.109.0\n", + "---\n", + "Name: torch\n", + "Version: 1.10.2+cpu\n", + "---\n", + "Name: boto3\n", + "Version: 1.24.72\n", + "---\n", + "Name: botocore\n", + "Version: 1.27.72\n" + ] + } + ], + "source": [ + "!pip show sagemaker torch boto3 botocore |egrep 'Name|Version|---'" + ] + }, + { + "cell_type": "markdown", + "id": "9176d868", + "metadata": {}, + "source": [ + "--------------\n", + "### B. Setting up the Training environment\n", + "\n", + "#### Step 1 - Import SageMaker components and set up the IAM role and Amazon S3 bucket" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "594fce53", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "arn:aws:iam::776941257690:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole\n", + "s3://sagemaker-us-east-1-776941257690/DEMO-MNIST\n" + ] + } + ], + "source": [ + "import os\n", + "import json\n", + "import datetime\n", + "import os\n", + "\n", + "import sagemaker\n", + "from sagemaker.pytorch import PyTorch\n", + "from sagemaker import get_execution_role\n", + "from sagemaker.instance_group import InstanceGroup\n", + "\n", + "\n", + "sess = sagemaker.Session()\n", + "\n", + "role = get_execution_role()\n", + "\n", + "output_path = \"s3://\" + sess.default_bucket() + \"/DEMO-MNIST\"\n", + "print(role)\n", + "print(output_path)" + ] + }, + { + "cell_type": "markdown", + "id": "165bca04", + "metadata": {}, + "source": [ + "#### Step 2 - Configure environment variables \n", + "This step defines whether you want to run training job in heterogeneous cluster mode or not. Also, defines instance groups, multiple nodes in group, and hyperparameter values. For baselining, run a homogeneous cluster training job by setting `IS_HETERO = False`. This will let both the data pre-processing and DNN code run on the same node i.e. `ml.p3.2xlarge`. \n", + "\n", + "\n", + "Test configuration (if running training on p3.2xl or g5.2xl as dnn_group instance type, and c5.2xl as data_group instance type: (training duration: 7-8 mins) \n", + "`num-data-workers: 4` \n", + "`grpc-workers: 4` \n", + "`num-dnn-workers: 4` \n", + "`pin-memory\": True` \n", + "`iterations : 100` \n", + "\n", + "Performance configuration (if running training on p3.2xl as dnn_group instance type, and c5.9xl as data_group instance type OR training in homogeneous cluster mode i.e. g5.8xl): (training duration - 30 mins) \n", + "`num-data-workers: 32` \n", + "`grpc-workers: 2` \n", + "`num-dnn-workers: 2` \n", + "`pin-memory\": True` \n", + "`iterations : 4800`\n", + "\n", + "Performance configuration (if running training on p3.2xl in homogeneous cluster mode): \n", + "`num-data-workers: 8` \n", + "`grpc-workers: 2` \n", + "`num-dnn-workers: 2` \n", + "`pin-memory\": True` \n", + "`iterations : 2400`\n", + "\n", + "Note: This PyTorch example has not been tested with multiple instances in an instance group. " + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "0d65707b", + "metadata": {}, + "outputs": [], + "source": [ + "IS_CLOUD_JOB = True\n", + "IS_HETERO = True # if set to false, uses homogeneous cluster\n", + "PT_DATA_MODE = \"service\" if IS_HETERO else \"local\" # local | service\n", + "IS_DNN_DISTRIBUTION = False # Distributed Training with DNN nodes not tested, set it to False\n", + "\n", + "data_group = InstanceGroup(\n", + " \"data_group\", \"ml.c5.9xlarge\", 1\n", + ") # 36 vCPU #change the instance type if IS_HETERO=True\n", + "dnn_group = InstanceGroup(\n", + " \"dnn_group\", \"ml.p3.2xlarge\", 1\n", + ") # 8 vCPU #change the instance type if IS_HETERO=True\n", + "\n", + "kwargs = dict()\n", + "kwargs[\"hyperparameters\"] = {\n", + " \"batch-size\": 8192,\n", + " \"num-data-workers\": 4, # This number drives the avg. step time. More workers help parallel pre-processing of data. Recommendation: Total no. of cpu 'n' = 'num-data-wokers'+'grpc-workers'+ 2 (reserved)\n", + " \"grpc-workers\": 4, # No. of workers serving pre-processed data to DNN group (gRPC client). see above formula.\n", + " \"num-dnn-workers\": 4, # Modify this no. to be less than the cpu core of your training instances in dnn group\n", + " \"pin-memory\": True, # Pin to GPU memory\n", + " \"iterations\": 100, # No. of iterations in an epoch (must be multiple of 10).\n", + "}\n", + "\n", + "if IS_HETERO:\n", + " kwargs[\"instance_groups\"] = [data_group, dnn_group]\n", + " entry_point = \"launcher.py\"\n", + "else:\n", + " kwargs[\"instance_type\"] = (\n", + " \"ml.p3.2xlarge\" if IS_CLOUD_JOB else \"local\"\n", + " ) # change the instance type if IS_HETERO=False\n", + " kwargs[\"instance_count\"] = 1\n", + " entry_point = \"train.py\"\n", + "\n", + "if IS_DNN_DISTRIBUTION:\n", + " processes_per_host_dict = {\n", + " \"ml.g5.xlarge\": 1,\n", + " \"ml.g5.12xlarge\": 4,\n", + " \"ml.p3.8xlarge\": 4,\n", + " \"ml.p4d.24xlarge\": 8,\n", + " }\n", + " kwargs[\"distribution\"] = {\n", + " \"mpi\": {\n", + " \"enabled\": True,\n", + " \"processes_per_host\": processes_per_host_dict[dnn_instance_type],\n", + " \"custom_mpi_options\": \"--NCCL_DEBUG INFO\",\n", + " },\n", + " }\n", + " if IS_HETERO:\n", + " kwargs[\"distribution\"][\"instance_groups\"] = [dnn_group]\n", + "\n", + " print(f\"distribution={kwargs['distribution']}\")" + ] + }, + { + "cell_type": "markdown", + "id": "4ff19e24", + "metadata": {}, + "source": [ + "#### Step 3: Set up the Estimator\n", + "In order to use SageMaker to fit our algorithm, we'll create `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training." + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "94f4c8ce", + "metadata": {}, + "outputs": [], + "source": [ + "estimator = PyTorch(\n", + " framework_version=\"1.11.0\", # 1.10.0 or later\n", + " py_version=\"py38\", # Python v3.8\n", + " role=role,\n", + " entry_point=entry_point,\n", + " source_dir=\"code\",\n", + " volume_size=10,\n", + " max_run=4800,\n", + " disable_profiler=True,\n", + " debugger_hook_config=False,\n", + " **kwargs,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "a81dcab6", + "metadata": {}, + "source": [ + "#### Step 4: Download the MNIST Data and Upload it to S3 bucket\n", + "\n", + "This is an optional step for now. The training job downloads the data on its run directly from MNIST website to the data_group node (grpc server). " + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "d0534973", + "metadata": {}, + "outputs": [], + "source": [ + "import logging\n", + "import boto3\n", + "from botocore.exceptions import ClientError\n", + "\n", + "# Download training and testing data from a public S3 bucket\n", + "\n", + "\n", + "def download_from_s3(data_dir=\"./data\", train=True):\n", + " \"\"\"Download MNIST dataset and convert it to numpy array\n", + "\n", + " Args:\n", + " data_dir (str): directory to save the data\n", + " train (bool): download training set\n", + "\n", + " Returns:\n", + " None\n", + " \"\"\"\n", + "\n", + " if not os.path.exists(data_dir):\n", + " os.makedirs(data_dir)\n", + "\n", + " if train:\n", + " images_file = \"train-images-idx3-ubyte.gz\"\n", + " labels_file = \"train-labels-idx1-ubyte.gz\"\n", + " else:\n", + " images_file = \"t10k-images-idx3-ubyte.gz\"\n", + " labels_file = \"t10k-labels-idx1-ubyte.gz\"\n", + "\n", + " # download objects\n", + " s3 = boto3.client(\"s3\")\n", + " bucket = f\"sagemaker-sample-files\"\n", + " for obj in [images_file, labels_file]:\n", + " key = os.path.join(\"datasets/image/MNIST\", obj)\n", + " dest = os.path.join(data_dir, obj)\n", + " if not os.path.exists(dest):\n", + " s3.download_file(bucket, key, dest)\n", + " return\n", + "\n", + "\n", + "download_from_s3(\"./data\", True)\n", + "download_from_s3(\"./data\", False)" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "2d699654", + "metadata": {}, + "outputs": [], + "source": [ + "# Upload to the default bucket\n", + "\n", + "prefix = \"DEMO-MNIST\"\n", + "bucket = sess.default_bucket()\n", + "loc = sess.upload_data(path=\"./data\", bucket=bucket, key_prefix=prefix)\n", + "\n", + "channels = {\"training\": loc, \"testing\": loc}" + ] + }, + { + "cell_type": "markdown", + "id": "48352f04", + "metadata": {}, + "source": [ + "## C. Submit the training job\n", + "\n", + "The job runs for the predefined iterations. DNN instance group sends a shutdown request to data group after done with the training. You can see the following entries in the CloudWatch logs of dnn instance. A job with 4800 iterations finishes in 29 mins in a Heterogeneous cluster composed of 1x ml.c5.9xlarge as data node and 1x ml.p3.2xlarge as DNN node.\n", + "\n", + "Note: The console output of billing seconds can be ignored. See the AWS console > SageMaker > Training Job for the exact billing seconds.\n", + "\n", + "Log excerpt from algo-1 (DNN instance)\n", + "```\n", + "4780: avg step time: 0.19709917231025106\n", + "INFO:__main__:4780: avg step time: 0.19709917231025106\n", + "4790: avg step time: 0.19694106239373696\n", + "INFO:__main__:4790: avg step time: 0.19694106239373696\n", + "4800: avg step time: 0.196784295383125\n", + "Saving the model\n", + "INFO:__main__:4800: avg step time: 0.196784295383125\n", + "INFO:__main__:Saving the model\n", + "Training job completed!\n", + "INFO:__main__:Training job completed!\n", + "Process train_dnn.py closed with returncode=0\n", + "Shutting downdata service dispatcher via: [algo-2:16000]\n", + "shutdown request sent to algo-2:16000\n", + "2022-08-16 01:15:05,555 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.\n", + "2022-08-16 01:15:05,555 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.\n", + "2022-08-16 01:15:05,556 sagemaker-training-toolkit INFO Reporting training SUCCESS\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31cb6cae", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2022-09-15 00:55:22 Starting - Starting the training job...\n", + "2022-09-15 00:55:50 Starting - Preparing the instances for training.........\n", + "2022-09-15 00:57:10 Downloading - Downloading input data.." + ] + } + ], + "source": [ + "estimator.fit(\n", + " inputs=channels,\n", + " job_name=\"pt-hetero\"\n", + " + \"-\"\n", + " + \"H-\"\n", + " + str(IS_HETERO)[0]\n", + " + \"-\"\n", + " + datetime.datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "3ea2e092", + "metadata": {}, + "source": [ + "## D. Monitoring Instance Metrics for GPU and CPU utilization\n", + "\n", + "Click on **View instance metrics** from the **Training jobs** node in **Amazon SageMaker Console**. In the run above, all 30 vCPU of Data node (algo-1) is approx. 100% utilized, and the GPU utilization is at 100% at frequent intervals in the DNN node (algo-2). To rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch \"Add Math\" feature and average it out by no. of cores on those instance types.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "430fb45e", + "metadata": {}, + "source": [ + "## E. Comparing time-to-train and cost-to-train\n", + "\n", + "Let's continue with the above example i.e. train a heavy data pre-processing (CPU intensive) model (MNIST) requiring only 1 GPU. We start with ml.p3.2xlarge (1xV100 GPU, 8x vCPU) in homogeneous cluster mode to get the baseline performance numbers. Due to the no. of CPU cores, we could not go beyond 8 data loader/workers for data pre-processing. The avg. step cost was `7.6 cents` and avg. step time is `1.19 seconds`. \n", + "\n", + "Our objective is to reduce the cost and speed up the model training time. The first choice here is to scale up the instance type in the same family. If we leverage the next instance type (4 GPU) in the P3 family, the GPUs would have gone underutilized. In this case, we needed more vCPU to GPU ratio. Assuming we haven't had any instance type in another instance family or the model is incompatible with the CPU/GPU architectures of other instance families, we are constrained to use ml.p3.2xlarge. The only way then to have more vCPUs to GPU ratio is to use SageMaker feature, Heterogeneous Cluster, which enables customers to offload data pre-processing logic to CPU only instance types example ml.c5. In the next test, we offloaded CPU intensive work i.e. data preprocessing to ml.c5.9xlarge (36 vCPU) and continued using ml.p3.2xlarge for DNN. The avg. step cost was `1.9 cents` and avg. step time is `0.18 seconds`. \n", + "\n", + "In summary, we reduced the training cost by 4.75 times, and the avg. step reduced by 6.5 times. This was possible because with higher cpu count, we could use 32 data loader workers (compared to 8 with p3.2xl) to preprocess the data, and kept GPU close to 100% utilized at frequent intervals. Note: These numbers are just taken as a sample, you have to do benchmarking with your own model and dataset to come up with the exact price-performance benefits. \n", + "\n", + "## F. Conclusion\n", + "In this notebook, we demonstrated how to leverage heterogeneous cluster feature of SageMaker Training to achieve better price performance. To get started you can copy this example project, and only change the `train_dnn.py` script.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.9.7 ('.venv': venv)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + }, + "vscode": { + "interpreter": { + "hash": "77c0de85c2cb739aa5100af7b92fb9d2075368f0e653f4148499a56c989df5f7" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/images/heterogeneous-cluster-diagram.png b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/heterogeneous-cluster-diagram.png new file mode 100644 index 0000000000..c84c185d31 Binary files /dev/null and b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/heterogeneous-cluster-diagram.png differ diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/images/heterogeneous-instance-metrics.png b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/heterogeneous-instance-metrics.png new file mode 100644 index 0000000000..d812e2f46b Binary files /dev/null and b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/heterogeneous-instance-metrics.png differ diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/images/homogeneous-cluster-diagram.png b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/homogeneous-cluster-diagram.png new file mode 100644 index 0000000000..ae3119d4b6 Binary files /dev/null and b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/homogeneous-cluster-diagram.png differ diff --git a/training/heterogeneous-clusters/pt.grpc.sagemaker/images/pytorch-heterogeneous-workflow.png b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/pytorch-heterogeneous-workflow.png new file mode 100644 index 0000000000..c40b09aeed Binary files /dev/null and b/training/heterogeneous-clusters/pt.grpc.sagemaker/images/pytorch-heterogeneous-workflow.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/cloudwatch-metric-definitions/heterogenenous-workload.json b/training/heterogeneous-clusters/tf.data.service.sagemaker/cloudwatch-metric-definitions/heterogenenous-workload.json new file mode 100644 index 0000000000..5b3736fc35 --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/cloudwatch-metric-definitions/heterogenenous-workload.json @@ -0,0 +1,30 @@ +{ + "metrics": [ + [ { "expression": "100*(m1/9600)", "label": "DNN1 CPU/100%", "id": "e1" } ], + [ { "expression": "100*(m2/800)", "label": "DNN1 GPU/100%", "id": "e2" } ], + [ { "expression": "100*(m3/7200)", "label": "DATA1 CPU/100%", "id": "e3" } ], + [ { "expression": "100*(m4/7200)", "label": "DATA2 CPU/100%", "id": "e4" } ], + [ "/aws/sagemaker/TrainingJobs", "CPUUtilization", "Host", "hetero-tf-data-service-Dnode2-wrkrs-1-20220922T214326Z/algo-1", { "id": "m1", "yAxis": "left", "visible": false } ], + [ ".", "GPUUtilization", ".", ".", { "id": "m2", "visible": false } ], + [ ".", "CPUUtilization", ".", "hetero-tf-data-service-Dnode2-wrkrs-1-20220922T214326Z/algo-2", { "id": "m3", "visible": false } ], + [ "...", "hetero-tf-data-service-Dnode2-wrkrs-1-20220922T214326Z/algo-3", { "id": "m4", "visible": false } ] + ], + "sparkline": true, + "view": "timeSeries", + "stacked": false, + "region": "us-east-1", + "stat": "Average", + "period": 60, + "setPeriodToTimeRange": true, + "yAxis": { + "left": { + "min": 0, + "max": 100, + "label": "% Utilization", + "showUnits": false + } + }, + "legend": { + "position": "bottom" + } +} \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/cloudwatch-metric-definitions/homogenous-workload copy.json b/training/heterogeneous-clusters/tf.data.service.sagemaker/cloudwatch-metric-definitions/homogenous-workload copy.json new file mode 100644 index 0000000000..c514eced5a --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/cloudwatch-metric-definitions/homogenous-workload copy.json @@ -0,0 +1,26 @@ +{ + "sparkline": true, + "metrics": [ + [ { "expression": "100*(m1/9600)", "label": "CPU/100%", "id": "e1" } ], + [ { "expression": "100*(m2/800)", "label": "GPU/100%", "id": "e2" } ], + [ "/aws/sagemaker/TrainingJobs", "CPUUtilization", "Host", "hetero-tf-data-local-Dnode1-wrkrs-1-20220921T213920Z/algo-1", { "id": "m1", "visible": false } ], + [ ".", "GPUUtilization", ".", ".", { "id": "m2", "visible": false } ] + ], + "view": "timeSeries", + "stacked": false, + "region": "us-east-1", + "stat": "Average", + "period": 60, + "setPeriodToTimeRange": true, + "yAxis": { + "left": { + "min": 0, + "max": 100, + "label": "% Utilization", + "showUnits": false + } + }, + "legend": { + "position": "bottom" + } +} \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/code/launcher.py b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/launcher.py new file mode 100644 index 0000000000..d67b27af7d --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/launcher.py @@ -0,0 +1,123 @@ +import sys +import os +import time +from typing import Optional +import subprocess + +# instance group names +DATA_GROUP = 'data_group' +DNN_GROUP = 'dnn_group' + + +def start_child_process_async(name : str, additional_args=[]) -> int: + #TODO: Find a way to stream stdout and stderr to the parent process + params = ["python", f"./{name}"] + sys.argv[1:] + additional_args + print(f'Opening process async: {params}') + p = subprocess.Popen(params) + print(f'Process {name} started') + return p.pid + + +def start_child_process(name : str, additional_args=[]) -> int: + params = ["python", f"./{name}"] + sys.argv[1:] + additional_args + print(f'Opening process: {params}') + p = subprocess.run(params) + print(f'Process {name} closed with returncode={p.returncode}') + return p.returncode + + +def start_data_group(dispatcher_host : str) -> int: + return start_child_process('train_data.py', ["--dispatcher_host", dispatcher_host]) + + +def not_mpi_or_rank_0() -> bool: + return 'OMPI_COMM_WORLD_LOCAL_RANK' not in os.environ or os.environ['OMPI_COMM_WORLD_LOCAL_RANK'] == '0' + + +def start_dnn_group(dispatcher_host : Optional[str]) -> int: + if dispatcher_host is not None: + args = ["--dispatcher_host", dispatcher_host] + # Start a tf.data.service worker processes for each host in the DNN group + # to take advantage of its CPU resources. + # Start once per instance, not per MPI process + if not_mpi_or_rank_0(): + start_child_process_async('train_data.py', args) + else: + args = [] + return start_child_process('train_dnn.py', args) + + +def get_group_first_host(instance_groups, target_group_name): + return instance_groups[target_group_name]['hosts'][0] + + +def is_not_mpi_or_world_rank_0() -> bool: + return 'OMPI_COMM_WORLD_RANK' in os.environ and os.environ['OMPI_COMM_WORLD_RANK'] != '0' + + +def shutdown_tf_data_service_with_retries(hosts : list): + # only world rank 0 process should shutdown the dispatcher + if is_not_mpi_or_world_rank_0(): + return + + completed_hosts = [] + for host in hosts: + for i in range(0,12): + try: + if i>0: + sleeptime = 10 + print(f'Will attempt {i} time to shutdown in {sleeptime} seconds') + time.sleep(sleeptime) + + if host not in completed_hosts: + _shutdown_data_service(host) + completed_hosts.append(host) + break + except Exception as e: + print(f'Failed to shutdown dispatcher in {host} due to: {e}') + + +def _shutdown_data_service(dispatcher_host : str): + SHUTDOWN_PORT = 16000 + print(f'Shutting down tf.data.service dispatcher via: [{dispatcher_host}:{SHUTDOWN_PORT}]') + import socket + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.connect((dispatcher_host, SHUTDOWN_PORT)) + print(f'Shutdown request sent to {dispatcher_host}:{SHUTDOWN_PORT}') + + +def split_to_instance_group_train_script() -> int: + from sagemaker_training import environment + env = environment.Environment() + + print(f'env.is_hetero={env.is_hetero}') + print(f'current_host={env.current_host}') + + if env.is_hetero: + dispatcher_host = get_group_first_host(env.instance_groups_dict, DATA_GROUP) + first_host_in_dnn_group = get_group_first_host(env.instance_groups_dict, DNN_GROUP) + print(f'current_instance_type={env.current_instance_type}') + print(f'current_group_name={env.current_instance_group}') + print(f'dispatcher_host={dispatcher_host}') + if env.current_instance_group == DATA_GROUP: + return start_data_group(dispatcher_host) + elif env.current_instance_group == DNN_GROUP: + returncode = start_dnn_group(dispatcher_host) + # first host in DNN group will take care of shutting down the dispatcher + if env.current_host == first_host_in_dnn_group: + hosts = env.instance_groups_dict[DATA_GROUP]['hosts'] + env.instance_groups_dict[DNN_GROUP]['hosts'] + shutdown_tf_data_service_with_retries(hosts) + return returncode + else: + raise Exception(f'Unknown instance group: {env.current_instance_group}') + + else: # not heterogenous + return start_dnn_group(dispatcher_host=None) + +if __name__ == "__main__": + try: + returncode = split_to_instance_group_train_script() + exit(returncode) + except Exception as e: + print(f'Failed due to {e}. exiting with returncode=1') + sys.exit(1) \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/code/requirements.txt b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/requirements.txt new file mode 100644 index 0000000000..64b956a22e --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/requirements.txt @@ -0,0 +1,2 @@ +protobuf==3.20.1 +tensorflow-addons==0.17.0 \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/code/train_data.py b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/train_data.py new file mode 100644 index 0000000000..62248d1e68 --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/train_data.py @@ -0,0 +1,68 @@ +from tensorflow.data.experimental.service import DispatchServer, WorkerServer, DispatcherConfig, WorkerConfig + +def wait_for_shutdown_signal(dispatcher, workers): + SHUTDOWN_PORT = 16000 + import socket + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(('', SHUTDOWN_PORT)) + s.listen(1) + print('Awaiting shutdown signal on port {}'.format(SHUTDOWN_PORT)) + conn, addr = s.accept() + print('Received shutdown signal from: ', addr) + try: + conn.close() + s.close() + except Exception as e: + print(e) + + if dispatcher is not None: # dispatcher runs only on the 1st data instance + print('Stopping dispatcher.') + dispatcher._stop() + print('Joining dispatcher') + dispatcher.join() + + for i,worker in enumerate(workers, start=0): + print(f'Stopping worker {i}') + worker._stop() + print(f'Joining worker {i}') + worker.join() + +def create_worker(workerIndex : int, dispatcher_host : str, current_host : str) -> WorkerServer: + port = 6001 + workerIndex + w_config = WorkerConfig(port=port, + dispatcher_address=f'{dispatcher_host}:6000', + worker_address=f'{current_host}:{port}') + print(f'Starting tf.data.service WorkerServer {w_config}') + worker = WorkerServer(w_config) + return worker + +def start_dispatcher_and_worker(dispatcher_host : str, current_host : str, num_of_data_workers : int): + assert(dispatcher_host is not None) + + if current_host == dispatcher_host: + print(f'starting Dispatcher (dispatcher_host={dispatcher_host})') + d_config = DispatcherConfig(port=6000) + dispatcher = DispatchServer(d_config) + else: + dispatcher = None + + workers = [ create_worker(i, dispatcher_host, current_host) for i in range(num_of_data_workers) ] + print(f'Finished starting dispatcher and {num_of_data_workers} workers') + + wait_for_shutdown_signal(dispatcher, workers) + + +"This function read mode command line argument" +def read_args(): + import argparse, os + parser = argparse.ArgumentParser() + parser.add_argument("--dispatcher_host", type=str) + parser.add_argument("--current_host", type=str, default=os.environ["SM_CURRENT_HOST"]) + parser.add_argument("--num_of_data_workers", type=int) + args, unknown = parser.parse_known_args() + return args + + +if __name__ == "__main__": + args = read_args() + start_dispatcher_and_worker(args.dispatcher_host, args.current_host, args.num_of_data_workers) \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/code/train_dnn.py b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/train_dnn.py new file mode 100644 index 0000000000..cd938b46c7 --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/code/train_dnn.py @@ -0,0 +1,153 @@ +from tensorflow.keras.layers.experimental import preprocessing +from tensorflow.keras.applications.resnet50 import ResNet50 +import tensorflow_addons as tfa +import tensorflow as tf +import os +import horovod.tensorflow.keras as hvd + + +# dilation filter +def dilate(image, label): + dilateFilter = tf.zeros([3, 3, 3], tf.uint8) + image = tf.expand_dims(image, 0) + image = tf.nn.dilation2d( + image, dilateFilter, strides=[1, 1, 1, 1], + dilations=[1, 1, 1, 1], + padding='SAME', + data_format='NHWC') + image = tf.squeeze(image) + return image, label +# blur filter + + +def blur(image, label): + image = tfa.image.gaussian_filter2d(image=image, + filter_shape=(11, 11), sigma=0.8) + return image, label + +# rescale filter +def rescale(image, label): + image = preprocessing.Rescaling(1.0 / 255)(image) + return image, label + + +# augmentation filters +def augment(image, label): + data_augmentation = tf.keras.Sequential( + [preprocessing.RandomFlip("horizontal"), + preprocessing.RandomRotation(0.1), + preprocessing.RandomZoom(0.1)]) + image = data_augmentation(image) + return image, label + + +# This function generates a dataset consisting 32x32x3 random images +# And a corresponding random label representing 10 different classes. +# As this dataset is randomly generated, you should not expect the model +# to converge in a meaningful way, it doesn't matter as our intent is +# only to measure data pipeline and DNN optimization throughput +def generate_artificial_dataset(): + import numpy as np + x_train = np.random.randint(0, 255, size=(32000, 32, 32, 3), dtype=np.uint8) + y_train = np.random.randint(0, 10, size=(32000,1)) + train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)) + return train_dataset + + +def get_dataset(batch_size : int, use_tf_data_service : bool, dispatcher_host : str): + autotune = tf.data.experimental.AUTOTUNE + options = tf.data.Options() + options.experimental_deterministic = False + + ds = generate_artificial_dataset().shuffle(10000).repeat() + + ds = ds.map(dilate, num_parallel_calls=autotune) + ds = ds.map(blur, num_parallel_calls=autotune) + ds = ds.map(rescale,num_parallel_calls=autotune) + ds = ds.map(augment, num_parallel_calls=autotune) + ds = ds.batch(batch_size) + + if use_tf_data_service: + ds = ds.apply(tf.data.experimental.service.distribute( + processing_mode="parallel_epochs", + service=f'grpc://{dispatcher_host}:6000',), + ) + + #ds = ds.take(1).cache().repeat() + ds = ds.prefetch(autotune) + return ds + +"This function read mode command line argument" +def read_args(): + import argparse + parser = argparse.ArgumentParser() + parser.add_argument('--tf_data_mode', type=str, default='local', + help="'service' distributed dataset using tf.data.service. 'local' use standard tf.data") + parser.add_argument('--steps_per_epoch', type=int, default=1) + parser.add_argument('--batch_size', type=int) + parser.add_argument('--epochs', type=int, default=1) + parser.add_argument("--n_gpus", type=str, + default=os.environ.get("SM_NUM_GPUS")) + parser.add_argument("--dispatcher_host", type=str) + parser.add_argument("--num_of_data_workers", type=int, default=1) + parser.add_argument("--output_data_dir", type=str, + default=os.environ.get("SM_OUTPUT_DATA_DIR")) + parser.add_argument("--model_dir", type=str, + default=os.environ.get("SM_MODEL_DIR")) + parser.add_argument("--checkpoint-path",type=str,default="/opt/ml/checkpoints",help="Path where checkpoints are saved.") + args = parser.parse_args() + return args + +if __name__ == "__main__": + args = read_args() + hvd.init() + # Horovod: pin GPU to be used to process local rank (one GPU per process) + gpus = tf.config.experimental.list_physical_devices('GPU') + print(str(gpus)) + for gpu in gpus: + tf.config.experimental.set_memory_growth(gpu, True) + if gpus: + print(f'hvd.local_rank() {hvd.local_rank()}') + tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU') + + model = ResNet50(weights=None, + input_shape=(32, 32, 3), + classes=10) + + model.compile(loss=tf.losses.SparseCategoricalCrossentropy(), + optimizer=tf.optimizers.Adam()) + # Horovod: adjust learning rate based on number of GPUs. + scaled_lr = 0.001 * hvd.size() + opt = tf.optimizers.Adam(scaled_lr) + opt = hvd.DistributedOptimizer( + opt, backward_passes_per_step=1, average_aggregated_gradients=True) + + model.compile(loss=tf.losses.SparseCategoricalCrossentropy(), + optimizer=opt, + experimental_run_tf_function=False) + + callbacks = [ + hvd.callbacks.BroadcastGlobalVariablesCallback(0), + hvd.callbacks.MetricAverageCallback(), + hvd.callbacks.LearningRateWarmupCallback(initial_lr=scaled_lr, warmup_epochs=3, verbose=1), + ] + # Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them. + if hvd.rank() == 0: + path = os.path.join(args.checkpoint_path, './checkpoint-{epoch}.h5') + callbacks.append(tf.keras.callbacks.ModelCheckpoint(path)) + + # Horovod: write logs on worker 0. + verbose = 1 if hvd.rank() == 0 else 0 + + assert(args.tf_data_mode == 'local' or args.tf_data_mode == 'service') + print(f'Running in {args.tf_data_mode} tf_data_mode.') + dataset = get_dataset(batch_size = args.batch_size, use_tf_data_service=(args.tf_data_mode == 'service'), dispatcher_host = args.dispatcher_host) + + model.fit( dataset, + steps_per_epoch=args.steps_per_epoch, + callbacks=callbacks, + epochs=args.epochs, + verbose=2,) + + if hvd.rank() == 0: + model.save(os.path.join(args.model_dir, '000000001'), 'my_model.h5') \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb b/training/heterogeneous-clusters/tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb new file mode 100644 index 0000000000..61cdc56cf5 --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/hetero-tensorflow-restnet50.ipynb @@ -0,0 +1,1136 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# TensorFlow's tf.data.service with Amazon SageMaker Training Heterogeneous Clusters\n", + "\n", + "---\n", + "### Introduction\n", + "\n", + "Heterogeneous clusters enable launching training jobs that use multiple instance types in a single job. This capability can improve your training cost and speed by running different parts of the model training on the most suitable instance type. This use case typically happens in computer vision (CV) deep learning (DL) training, where training is bottleneck on CPU resources needed for data augmentation, leaving the expensive GPU underutilized. Heterogeneous clusters enable you to add more CPU resources to fully utilize GPUs to increase training speed and cost-efficiency. For more details, you can find the documentation of this feature [here](https://docs.aws.amazon.com/sagemaker/latest/dg/train-heterogeneous-cluster.html).\n", + "\n", + "This notebook demonstrates how to use Heterogeneous Clusters with TensorFlow's [tf.data.service](https://www.TensorFlow.org/api_docs/python/tf/data/experimental/service). It includes training a CPU intensive DL CV workload. Comparing cost and performance between homogeneous and heterogeneous training configurations. \n", + "\n", + "💡To get started quickly with heterogeneous clusters, we suggest you'll reuse the provided code as a quick way to migrate your workload from a local tf.data pipeline to a distributed tf.data.service pipeline. You'll need to change [code/train_dnn.py](./code/train_dnn.py), while keeping [code/train_data.py](./code/train_data.py) and [code/launcher.py](code/launcher.py) intact. This is explained below in the [Workload Details] section.\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook covers:\n", + "- A guide to switching from a homogeneous job (single instance type) to a heterogeneous job (multiple instance types)\n", + "- Explaining to use Heterogeneous clusters with TensorFlow's tf.data.service\n", + "- Set up Amazon SageMaker Studio Notebook \n", + "- Run homogeneous cluster training job \n", + "- Run heterogeneous cluster training job \n", + "- Compare time and cost to train between homogeneous and heterogeneous clusters\n", + "- Conclusion\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A guide to switching from a homogeneous to a heterogeneous job\n", + "\n", + "This notebook runs and compares these two workloads:\n", + "\n", + "Homogeneous Training Job - The image shows a ml.p4d.24xlarge instance GPUs is under-utilized due to a CPU bottleneck. \n", + "\"homogeneous\n", + " \n", + "Heterogeneous Training Job - The image shows two ml.c5.18xlarge instances with extra CPU cores, to reduce the CPU bottleneck and increase GPU usage, to improve training speed cost-efficiency. \n", + " \"Heterogeneous\n", + "\n", + "In each workload: Training data is an artificially generated dataset consisting of 32x32x3 images with random pixel values, and a corresponding random label representing 10 different classes. As this dataset is randomly generated, you should not expect the model to converge in a meaningful way. This shouldn't matter as our intent is only to measure data pipeline and neural network optimization throughput expressed in epoch/step time. \n", + "The model we used is [Resnet50](https://www.TensorFlow.org/api_docs/python/tf/keras/applications/ResNet50). The workloads uses an 8 GPUs instance, ml.p4d.24xlarge, and uses Horovod for data parallelization. \n", + "\n", + "The heterogeneous job will include two instance groups:\n", + "- **data_group** - A group of CPU instances that will run data pre-processing code.\n", + "- **dnn_group** - A group of GPU instances that will run Deep Neural Network training code.\n", + "\n", + "In this example, the inter-node communication between CPU and GPU instance groups is implemented using [TensorFlow data service feature](https://www.TensorFlow.org/api_docs/python/tf/data/experimental/service). This feature allows offloading a configurable amount of preprocessing work to worker machines. Note that SageMaker's Heterogeneous cluster does not provide out-of-the-box support for inter-instance_group communication, and it is up to the user to implement (we provide reference implementation here).\n", + "\n", + "This notebook refers following files and folders:\n", + "- [code/train_dnn.py](./code/train_dnn.py) - this is standard TF training script, it has a single reference to tf.data.service when setting up the tf.data pipeline. This script will be executed on GPU instances belonging to the dnn_group.\n", + "- [code/train_data.py](./code/train_data.py) - this script starts tf.data.service services like a tf.data.service Dispatcher and tf.data.service Worker processes. You shouldn't edit this script when adjusting to your workload.\n", + "- [code/launcher.py](./code/launcher.py) - Entry point training script. This is the script that SageMaker Training will start on all instances (all instances groups must share the same entry point script in heterogeneous clusters). `launcher.py` is responsible for detecting the instance group the instance belong to, and start `train_dnn.py` and `train_data.py` accordingly. It is also responsible for shutting down tf.data.services the training script completes (`train_dnn.py`) so all instances exit allowing the SageMaker training job to complete. \n", + "In every instance `luncher.py` will use `train_data.py` to start a tf.data.service worker server (As all instance types have CPUs that could be used for preprocessing). `luncher.py` will start a single tf.data.service dispatcher server (on the first instance of the `data_group`). \n", + "`luncher.py` will start the `train_dnn.py` script in all GPU instances (`dnn_group` instances).\n", + "\n", + "#### Learn more about tf.data.service processes\n", + "`tf.data.service Dispatcher` - The dispatcher server acts as the control plain for tf.data.service; Being responsible for registering worker servers and assigning preprocessing tasks to them. Each training job has a single Dispatcher running in the first instance of the `data_group` and listens on port 6000.\n", + "`tf.data.service Workers` - Worker servers carry out the data processing. Each instance could have one or more workers (listen on port 6001/6002/...).\n", + "\n", + "##### Defining what part of your pipeline runs in which instance group\n", + " When you apply `tf.data.experimental.service.distribute()` on your dataset, all preprocessing operations defined up to the apply will run on the tf.data.service workers, and all dataset operations defined afterwords will run on the local process. All instances will need access to a dataset you'll make available through a SageMaker training data channel. You do have the option of limiting which instance group will see which training data channel.\n", + "\n", + "The below figure shows sequence of events of setting up and running in a tf.data.service based heterogeneous cluster training job.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### security groups update if running in private VPC\n", + "This section is relevant if you plan to [run in a private VPC](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html) (passing `subnets` and `security_group_ids` parameters when defining an Estimator). \n", + "SageMaker documentation recommends you [add](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html#train-vpc-vpc) a rule for your security group that allows inbound connections between members of the same security group, for all TCP communication. This will also cover for the tf.data.service related traffic between instances:\n", + "- tf.data.service Dispatcher node will listen for incoming connections on port 6000 (configurable) from all nodes.\n", + "- tf.data.service Workers will listen on ports 6001-6006 from all nodes.\n", + "- Each node listens on port 16000 for a tf.data.service shutdown signal from all nodes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### A. Set up SageMaker Studio notebook\n", + "#### Before you start\n", + "Ensure you have selected Python 3 (_TensorFlow 2.6 Python 3.8 CPU Optimized_) image for your SageMaker Studio Notebook instance, and running on _ml.t3.medium_ instance type.\n", + "\n", + "#### Step 1 - Upgrade SageMaker SDK and dependent packages \n", + "Heterogeneous Clusters for Amazon SageMaker model training was [announced](https://aws.amazon.com/about-aws/whats-new/2022/07/announcing-heterogeneous-clusters-amazon-sagemaker-model-training) on 07/08/2022. This feature release requires you to have updated SageMaker SDK, Boto3 client." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: boto3 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (1.24.72)\n", + "Collecting boto3\n", + " Downloading boto3-1.24.80-py3-none-any.whl (132 kB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 132.5/132.5 kB 925.9 kB/s eta 0:00:00\n", + "Requirement already satisfied: botocore in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (1.27.72)\n", + "Collecting botocore\n", + " Downloading botocore-1.27.80-py3-none-any.whl (9.1 MB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.1/9.1 MB 16.4 MB/s eta 0:00:00\n", + "Requirement already satisfied: awscli in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (1.25.73)\n", + "Collecting awscli\n", + " Downloading awscli-1.25.81-py3-none-any.whl (3.9 MB)\n", + " ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.9/3.9 MB 38.4 MB/s eta 0:00:00\n", + "Requirement already satisfied: sagemaker in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (2.109.0)\n", + "Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from boto3) (0.6.0)\n", + "Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/jmespath-1.0.0-py3.9.egg (from boto3) (1.0.0)\n", + "Requirement already satisfied: urllib3<1.27,>=1.25.4 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/urllib3-1.26.9-py3.9.egg (from botocore) (1.26.9)\n", + "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/python_dateutil-2.8.2-py3.9.egg (from botocore) (2.8.2)\n", + "Requirement already satisfied: PyYAML<5.5,>=3.10 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from awscli) (5.4.1)\n", + "Requirement already satisfied: docutils<0.17,>=0.10 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from awscli) (0.16)\n", + "Requirement already satisfied: colorama<0.4.5,>=0.2.5 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from awscli) (0.4.4)\n", + "Requirement already satisfied: rsa<4.8,>=3.1.2 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from awscli) (4.7.2)\n", + "Requirement already satisfied: importlib-metadata<5.0,>=1.4.0 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/importlib_metadata-4.11.3-py3.9.egg (from sagemaker) (4.11.3)\n", + "Requirement already satisfied: pathos in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/pathos-0.2.8-py3.9.egg (from sagemaker) (0.2.8)\n", + "Requirement already satisfied: protobuf3-to-dict<1.0,>=0.1.5 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/protobuf3_to_dict-0.1.5-py3.9.egg (from sagemaker) (0.1.5)\n", + "Requirement already satisfied: pandas in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/pandas-1.4.2-py3.9-macosx-10.9-x86_64.egg (from sagemaker) (1.4.2)\n", + "Requirement already satisfied: numpy<2.0,>=1.9.0 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from sagemaker) (1.22.4)\n", + "Requirement already satisfied: attrs<22,>=20.3.0 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/attrs-20.3.0-py3.9.egg (from sagemaker) (20.3.0)\n", + "Requirement already satisfied: smdebug-rulesconfig==1.0.1 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/smdebug_rulesconfig-1.0.1-py3.9.egg (from sagemaker) (1.0.1)\n", + "Requirement already satisfied: google-pasta in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/google_pasta-0.2.0-py3.9.egg (from sagemaker) (0.2.0)\n", + "Requirement already satisfied: protobuf<4.0,>=3.1 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from sagemaker) (3.20.1)\n", + "Requirement already satisfied: packaging>=20.0 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/packaging-21.3-py3.9.egg (from sagemaker) (21.3)\n", + "Requirement already satisfied: zipp>=0.5 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/zipp-3.7.0-py3.9.egg (from importlib-metadata<5.0,>=1.4.0->sagemaker) (3.7.0)\n", + "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/pyparsing-3.0.7-py3.9.egg (from packaging>=20.0->sagemaker) (3.0.7)\n", + "Requirement already satisfied: six in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from protobuf3-to-dict<1.0,>=0.1.5->sagemaker) (1.15.0)\n", + "Requirement already satisfied: pyasn1>=0.1.3 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages (from rsa<4.8,>=3.1.2->awscli) (0.4.8)\n", + "Requirement already satisfied: pytz>=2020.1 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/pytz-2022.1-py3.9.egg (from pandas->sagemaker) (2022.1)\n", + "Requirement already satisfied: dill>=0.3.4 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/dill-0.3.4-py3.9.egg (from pathos->sagemaker) (0.3.4)\n", + "Requirement already satisfied: multiprocess>=0.70.12 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/multiprocess-0.70.12.2-py3.9.egg (from pathos->sagemaker) (0.70.12.2)\n", + "Requirement already satisfied: pox>=0.3.0 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/pox-0.3.0-py3.9.egg (from pathos->sagemaker) (0.3.0)\n", + "Requirement already satisfied: ppft>=1.6.6.4 in /Users/gili/dev/hetro-training/.venv/lib/python3.9/site-packages/ppft-1.6.6.4-py3.9.egg (from pathos->sagemaker) (1.6.6.4)\n", + "Installing collected packages: botocore, boto3, awscli\n", + " Attempting uninstall: botocore\n", + " Found existing installation: botocore 1.27.72\n", + " Uninstalling botocore-1.27.72:\n", + " Successfully uninstalled botocore-1.27.72\n", + " Attempting uninstall: boto3\n", + " Found existing installation: boto3 1.24.72\n", + " Uninstalling boto3-1.24.72:\n", + " Successfully uninstalled boto3-1.24.72\n", + " Attempting uninstall: awscli\n", + " Found existing installation: awscli 1.25.73\n", + " Uninstalling awscli-1.25.73:\n", + " Successfully uninstalled awscli-1.25.73\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "sagemaker-training 4.2.2 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Successfully installed awscli-1.25.81 boto3-1.24.80 botocore-1.27.80\n" + ] + } + ], + "source": [ + "%%bash\n", + "python3 -m pip install --upgrade boto3 botocore awscli sagemaker" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 2 - Restart the notebook kernel " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#import IPython\n", + "#IPython.Application.instance().kernel.do_shutdown(True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 3 - Validate SageMaker Python SDK and TensorFlow versions\n", + "Ensure the output of the cell below reflects:\n", + "\n", + "- SageMaker Python SDK version 2.98.0 or above, \n", + "- boto3 1.24 or above \n", + "- botocore 1.27 or above \n", + "- TensorFlow 2.6 or above " + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Name: sagemaker\n", + "Version: 2.109.0\n", + "---\n", + "Name: boto3\n", + "Version: 1.24.80\n", + "---\n", + "Name: botocore\n", + "Version: 1.27.80\n", + "---\n", + "Name: tensorflow\n", + "Version: 2.8.0\n", + "---\n", + "Name: protobuf\n", + "Version: 3.20.1\n" + ] + } + ], + "source": [ + "!pip show sagemaker boto3 botocore tensorflow protobuf |egrep 'Name|Version|---'" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "import json\n", + "import datetime\n", + "\n", + "import sagemaker\n", + "from sagemaker import get_execution_role\n", + "from sagemaker.instance_group import InstanceGroup\n", + "\n", + "sess = sagemaker.Session()\n", + "role = get_execution_role()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### C. Run a homogeneous training job\n", + "#### Step 1: Set up the training environment\n", + "In this step, we define and submit a homogeneous training job. It uses a single instance type (p4d.24xlarge) with 8 GPUs. The analysis of the job will shows that it is CPU bound and therefore its GPUs are underutilized." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "from sagemaker.tensorflow import TensorFlow\n", + "from sagemaker.instance_group import InstanceGroup\n", + "import os\n", + "\n", + "hyperparameters = {\n", + " \"epochs\": 10,\n", + " \"steps_per_epoch\": 500,\n", + " \"batch_size\": 1024,\n", + " \"tf_data_mode\": \"local\", # We won't be using tf.data.service ('service') for this homogeneous job\n", + " \"num_of_data_workers\": 0, # We won't be using tf.data.service ('service') for this homogeneous job\n", + "}\n", + "\n", + "estimator = TensorFlow(\n", + " entry_point=\"launcher.py\",\n", + " source_dir=\"code\",\n", + " framework_version=\"2.9.1\",\n", + " py_version=\"py39\",\n", + " role=role,\n", + " volume_size=10,\n", + " max_run=1800, # 30 minutes\n", + " disable_profiler=True,\n", + " instance_type=\"ml.p4d.24xlarge\",\n", + " instance_count=1,\n", + " hyperparameters=hyperparameters,\n", + " distribution={\n", + " \"mpi\": {\n", + " \"enabled\": True,\n", + " \"processes_per_host\": 8, # 8 GPUs per host\n", + " \"custom_mpi_options\": \"--NCCL_DEBUG WARN\",\n", + " },\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 2: Submit the training job\n", + "\n", + "Note: For the logs, click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**. \n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2022-09-24 11:28:23 Starting - Starting the training job......\n", + "2022-09-24 11:29:08 Starting - Preparing the instances for training........................\n", + "2022-09-24 11:33:34 Downloading - Downloading input data\n", + "2022-09-24 11:33:34 Training - Downloading the training image..................\n", + "2022-09-24 11:37:00 Training - Training image download completed. Training in progress..2022-09-24 11:37:05.792579: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "2022-09-24 11:37:05.801314: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "2022-09-24 11:37:06.269740: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "2022-09-24 11:37:13,412 sagemaker-training-toolkit INFO Imported framework sagemaker_tensorflow_container.training\n", + "2022-09-24 11:37:14,075 sagemaker-training-toolkit INFO Installing dependencies from requirements.txt:\n", + "/usr/local/bin/python3.9 -m pip install -r requirements.txt\n", + "Collecting protobuf==3.20.1\n", + "Downloading protobuf-3.20.1-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)\n", + "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 39.6 MB/s eta 0:00:00\n", + "Collecting tensorflow-addons==0.17.0\n", + "Downloading tensorflow_addons-0.17.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)\n", + "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 51.6 MB/s eta 0:00:00\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.9/site-packages (from tensorflow-addons==0.17.0->-r requirements.txt (line 2)) (21.3)\n", + "Requirement already satisfied: typeguard>=2.7 in /usr/local/lib/python3.9/site-packages (from tensorflow-addons==0.17.0->-r requirements.txt (line 2)) (2.13.3)\n", + "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.9/site-packages (from packaging->tensorflow-addons==0.17.0->-r requirements.txt (line 2)) (3.0.9)\n", + "Installing collected packages: protobuf, tensorflow-addons\n", + "Attempting uninstall: protobuf\n", + "Found existing installation: protobuf 3.19.4\n", + "Uninstalling protobuf-3.19.4:\n", + "Successfully uninstalled protobuf-3.19.4\n", + "Attempting uninstall: tensorflow-addons\n", + "Found existing installation: tensorflow-addons 0.17.1\n", + "Uninstalling tensorflow-addons-0.17.1:\n", + "Successfully uninstalled tensorflow-addons-0.17.1\n", + "ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n", + "tf-models-official 2.9.1 requires tensorflow~=2.9.0, which is not installed.\n", + "tensorflow-gpu 2.9.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.\n", + "tensorboard 2.9.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.\n", + "sagemaker-training 4.1.4.dev0 requires protobuf<3.20,>=3.9.2, but you have protobuf 3.20.1 which is incompatible.\n", + "Successfully installed protobuf-3.20.1 tensorflow-addons-0.17.0\n", + "WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv\n", + "[notice] A new release of pip available: 22.1.2 -> 22.2.2\n", + "[notice] To update, run: pip install --upgrade pip\n", + "2022-09-24 11:37:24,079 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.\n", + "2022-09-24 11:37:24,079 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.\n", + "2022-09-24 11:37:24,258 sagemaker-training-toolkit INFO Starting MPI run as worker node.\n", + "2022-09-24 11:37:24,258 sagemaker-training-toolkit INFO Creating SSH daemon.\n", + "2022-09-24 11:37:24,274 sagemaker-training-toolkit INFO Waiting for MPI workers to establish their SSH connections\n", + "2022-09-24 11:37:24,274 sagemaker-training-toolkit INFO Env Hosts: ['algo-1'] Hosts: ['algo-1:8'] process_per_hosts: 8 num_processes: 8\n", + "2022-09-24 11:37:24,276 sagemaker-training-toolkit INFO Network interface name: eth0\n", + "2022-09-24 11:37:24,368 sagemaker-training-toolkit INFO Invoking user script\n", + "Training Env:\n", + "{\n", + " \"additional_framework_parameters\": {\n", + " \"sagemaker_mpi_custom_mpi_options\": \"--NCCL_DEBUG WARN\",\n", + " \"sagemaker_mpi_enabled\": true,\n", + " \"sagemaker_mpi_num_of_processes_per_host\": 8\n", + " },\n", + " \"channel_input_dirs\": {},\n", + " \"current_host\": \"algo-1\",\n", + " \"current_instance_group\": \"homogeneousCluster\",\n", + " \"current_instance_group_hosts\": [\n", + " \"algo-1\"\n", + " ],\n", + " \"current_instance_type\": \"ml.p4d.24xlarge\",\n", + " \"distribution_hosts\": [\n", + " \"algo-1\"\n", + " ],\n", + " \"distribution_instance_groups\": [\n", + " \"homogeneousCluster\"\n", + " ],\n", + " \"framework_module\": \"sagemaker_tensorflow_container.training:main\",\n", + " \"hosts\": [\n", + " \"algo-1\"\n", + " ],\n", + " \"hyperparameters\": {\n", + " \"batch_size\": 1024,\n", + " \"epochs\": 10,\n", + " \"model_dir\": \"/opt/ml/model\",\n", + " \"num_of_data_workers\": 0,\n", + " \"steps_per_epoch\": 500,\n", + " \"tf_data_mode\": \"local\"\n", + " },\n", + " \"input_config_dir\": \"/opt/ml/input/config\",\n", + " \"input_data_config\": {},\n", + " \"input_dir\": \"/opt/ml/input\",\n", + " \"instance_groups\": [\n", + " \"homogeneousCluster\"\n", + " ],\n", + " \"instance_groups_dict\": {\n", + " \"homogeneousCluster\": {\n", + " \"instance_group_name\": \"homogeneousCluster\",\n", + " \"instance_type\": \"ml.p4d.24xlarge\",\n", + " \"hosts\": [\n", + " \"algo-1\"\n", + " ]\n", + " }\n", + " },\n", + " \"is_hetero\": false,\n", + " \"is_master\": true,\n", + " \"job_name\": \"homogeneous-20220924T112821Z-1\",\n", + " \"log_level\": 20,\n", + " \"master_hostname\": \"algo-1\",\n", + " \"model_dir\": \"/opt/ml/model\",\n", + " \"module_dir\": \"s3://sagemaker-us-east-1-331113010199/homogeneous-20220924T112821Z-1/source/sourcedir.tar.gz\",\n", + " \"module_name\": \"launcher\",\n", + " \"network_interface_name\": \"eth0\",\n", + " \"num_cpus\": 96,\n", + " \"num_gpus\": 8,\n", + " \"output_data_dir\": \"/opt/ml/output/data\",\n", + " \"output_dir\": \"/opt/ml/output\",\n", + " \"output_intermediate_dir\": \"/opt/ml/output/intermediate\",\n", + " \"resource_config\": {\n", + " \"current_host\": \"algo-1\",\n", + " \"current_instance_type\": \"ml.p4d.24xlarge\",\n", + " \"current_group_name\": \"homogeneousCluster\",\n", + " \"hosts\": [\n", + " \"algo-1\"\n", + " ],\n", + " \"instance_groups\": [\n", + " {\n", + " \"instance_group_name\": \"homogeneousCluster\",\n", + " \"instance_type\": \"ml.p4d.24xlarge\",\n", + " \"hosts\": [\n", + " \"algo-1\"\n", + " ]\n", + " }\n", + " ],\n", + " \"network_interface_name\": \"eth0\"\n", + " },\n", + " \"user_entry_point\": \"launcher.py\"\n", + "}\n", + "Environment variables:\n", + "SM_HOSTS=[\"algo-1\"]\n", + "SM_NETWORK_INTERFACE_NAME=eth0\n", + "SM_HPS={\"batch_size\":1024,\"epochs\":10,\"model_dir\":\"/opt/ml/model\",\"num_of_data_workers\":0,\"steps_per_epoch\":500,\"tf_data_mode\":\"local\"}\n", + "SM_USER_ENTRY_POINT=launcher.py\n", + "SM_FRAMEWORK_PARAMS={\"sagemaker_mpi_custom_mpi_options\":\"--NCCL_DEBUG WARN\",\"sagemaker_mpi_enabled\":true,\"sagemaker_mpi_num_of_processes_per_host\":8}\n", + "SM_RESOURCE_CONFIG={\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-1\",\"current_instance_type\":\"ml.p4d.24xlarge\",\"hosts\":[\"algo-1\"],\"instance_groups\":[{\"hosts\":[\"algo-1\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.p4d.24xlarge\"}],\"network_interface_name\":\"eth0\"}\n", + "SM_INPUT_DATA_CONFIG={}\n", + "SM_OUTPUT_DATA_DIR=/opt/ml/output/data\n", + "SM_CHANNELS=[]\n", + "SM_CURRENT_HOST=algo-1\n", + "SM_CURRENT_INSTANCE_TYPE=ml.p4d.24xlarge\n", + "SM_CURRENT_INSTANCE_GROUP=homogeneousCluster\n", + "SM_CURRENT_INSTANCE_GROUP_HOSTS=[\"algo-1\"]\n", + "SM_INSTANCE_GROUPS=[\"homogeneousCluster\"]\n", + "SM_INSTANCE_GROUPS_DICT={\"homogeneousCluster\":{\"hosts\":[\"algo-1\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.p4d.24xlarge\"}}\n", + "SM_DISTRIBUTION_INSTANCE_GROUPS=[\"homogeneousCluster\"]\n", + "SM_IS_HETERO=false\n", + "SM_MODULE_NAME=launcher\n", + "SM_LOG_LEVEL=20\n", + "SM_FRAMEWORK_MODULE=sagemaker_tensorflow_container.training:main\n", + "SM_INPUT_DIR=/opt/ml/input\n", + "SM_INPUT_CONFIG_DIR=/opt/ml/input/config\n", + "SM_OUTPUT_DIR=/opt/ml/output\n", + "SM_NUM_CPUS=96\n", + "SM_NUM_GPUS=8\n", + "SM_MODEL_DIR=/opt/ml/model\n", + "SM_MODULE_DIR=s3://sagemaker-us-east-1-331113010199/homogeneous-20220924T112821Z-1/source/sourcedir.tar.gz\n", + "SM_TRAINING_ENV={\"additional_framework_parameters\":{\"sagemaker_mpi_custom_mpi_options\":\"--NCCL_DEBUG WARN\",\"sagemaker_mpi_enabled\":true,\"sagemaker_mpi_num_of_processes_per_host\":8},\"channel_input_dirs\":{},\"current_host\":\"algo-1\",\"current_instance_group\":\"homogeneousCluster\",\"current_instance_group_hosts\":[\"algo-1\"],\"current_instance_type\":\"ml.p4d.24xlarge\",\"distribution_hosts\":[\"algo-1\"],\"distribution_instance_groups\":[\"homogeneousCluster\"],\"framework_module\":\"sagemaker_tensorflow_container.training:main\",\"hosts\":[\"algo-1\"],\"hyperparameters\":{\"batch_size\":1024,\"epochs\":10,\"model_dir\":\"/opt/ml/model\",\"num_of_data_workers\":0,\"steps_per_epoch\":500,\"tf_data_mode\":\"local\"},\"input_config_dir\":\"/opt/ml/input/config\",\"input_data_config\":{},\"input_dir\":\"/opt/ml/input\",\"instance_groups\":[\"homogeneousCluster\"],\"instance_groups_dict\":{\"homogeneousCluster\":{\"hosts\":[\"algo-1\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.p4d.24xlarge\"}},\"is_hetero\":false,\"is_master\":true,\"job_name\":\"homogeneous-20220924T112821Z-1\",\"log_level\":20,\"master_hostname\":\"algo-1\",\"model_dir\":\"/opt/ml/model\",\"module_dir\":\"s3://sagemaker-us-east-1-331113010199/homogeneous-20220924T112821Z-1/source/sourcedir.tar.gz\",\"module_name\":\"launcher\",\"network_interface_name\":\"eth0\",\"num_cpus\":96,\"num_gpus\":8,\"output_data_dir\":\"/opt/ml/output/data\",\"output_dir\":\"/opt/ml/output\",\"output_intermediate_dir\":\"/opt/ml/output/intermediate\",\"resource_config\":{\"current_group_name\":\"homogeneousCluster\",\"current_host\":\"algo-1\",\"current_instance_type\":\"ml.p4d.24xlarge\",\"hosts\":[\"algo-1\"],\"instance_groups\":[{\"hosts\":[\"algo-1\"],\"instance_group_name\":\"homogeneousCluster\",\"instance_type\":\"ml.p4d.24xlarge\"}],\"network_interface_name\":\"eth0\"},\"user_entry_point\":\"launcher.py\"}\n", + "SM_USER_ARGS=[\"--batch_size\",\"1024\",\"--epochs\",\"10\",\"--model_dir\",\"/opt/ml/model\",\"--num_of_data_workers\",\"0\",\"--steps_per_epoch\",\"500\",\"--tf_data_mode\",\"local\"]\n", + "SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate\n", + "SM_HP_BATCH_SIZE=1024\n", + "SM_HP_EPOCHS=10\n", + "SM_HP_MODEL_DIR=/opt/ml/model\n", + "SM_HP_NUM_OF_DATA_WORKERS=0\n", + "SM_HP_STEPS_PER_EPOCH=500\n", + "SM_HP_TF_DATA_MODE=local\n", + "PYTHONPATH=/opt/ml/code:/usr/local/bin:/usr/local/lib/python39.zip:/usr/local/lib/python3.9:/usr/local/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/site-packages:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg:/usr/local/lib/python3.9/site-packages/pyinstrument-3.4.2-py3.9.egg:/usr/local/lib/python3.9/site-packages/pyinstrument_cext-0.2.4-py3.9-linux-x86_64.egg\n", + "Invoking script with the following command:\n", + "mpirun --host algo-1:8 -np 8 --allow-run-as-root --display-map --tag-output -mca btl_tcp_if_include eth0 -mca oob_tcp_if_include eth0 -mca plm_rsh_no_tree_spawn 1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca orte_abort_on_non_zero_status 1 -mca btl_vader_single_copy_mechanism none -x NCCL_MIN_NRINGS=4 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG=WARN -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD=/usr/local/lib/python3.9/site-packages/gethostname.cpython-39-x86_64-linux-gnu.so -x SM_HOSTS -x SM_NETWORK_INTERFACE_NAME -x SM_HPS -x SM_USER_ENTRY_POINT -x SM_FRAMEWORK_PARAMS -x SM_RESOURCE_CONFIG -x SM_INPUT_DATA_CONFIG -x SM_OUTPUT_DATA_DIR -x SM_CHANNELS -x SM_CURRENT_HOST -x SM_CURRENT_INSTANCE_TYPE -x SM_CURRENT_INSTANCE_GROUP -x SM_CURRENT_INSTANCE_GROUP_HOSTS -x SM_INSTANCE_GROUPS -x SM_INSTANCE_GROUPS_DICT -x SM_DISTRIBUTION_INSTANCE_GROUPS -x SM_IS_HETERO -x SM_MODULE_NAME -x SM_LOG_LEVEL -x SM_FRAMEWORK_MODULE -x SM_INPUT_DIR -x SM_INPUT_CONFIG_DIR -x SM_OUTPUT_DIR -x SM_NUM_CPUS -x SM_NUM_GPUS -x SM_MODEL_DIR -x SM_MODULE_DIR -x SM_TRAINING_ENV -x SM_USER_ARGS -x SM_OUTPUT_INTERMEDIATE_DIR -x SM_HP_BATCH_SIZE -x SM_HP_EPOCHS -x SM_HP_MODEL_DIR -x SM_HP_NUM_OF_DATA_WORKERS -x SM_HP_STEPS_PER_EPOCH -x SM_HP_TF_DATA_MODE -x PYTHONPATH /usr/local/bin/python3.9 -m mpi4py launcher.py --batch_size 1024 --epochs 10 --model_dir /opt/ml/model --num_of_data_workers 0 --steps_per_epoch 500 --tf_data_mode local\n", + "Data for JOB [7555,1] offset 0 Total slots allocated 8\n", + " ======================== JOB MAP ========================\n", + " Data for node: ip-10-0-215-180#011Num slots: 8#011Max slots: 0#011Num procs: 8\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 0 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 1 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 2 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 3 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 4 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 5 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 6 Bound: N/A\n", + " #011Process OMPI jobid: [7555,1] App: 0 Process rank: 7 Bound: N/A\n", + " =============================================================\n", + "[1,mpirank:1,algo-1]:env.is_hetero=False\n", + "[1,mpirank:1,algo-1]:current_host=algo-1\n", + "[1,mpirank:1,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:4,algo-1]:env.is_hetero=False\n", + "[1,mpirank:4,algo-1]:current_host=algo-1\n", + "[1,mpirank:4,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:5,algo-1]:env.is_hetero=False\n", + "[1,mpirank:5,algo-1]:current_host=algo-1\n", + "[1,mpirank:5,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:7,algo-1]:env.is_hetero=False\n", + "[1,mpirank:7,algo-1]:current_host=algo-1\n", + "[1,mpirank:7,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:0,algo-1]:env.is_hetero=False\n", + "[1,mpirank:0,algo-1]:current_host=algo-1\n", + "[1,mpirank:0,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:6,algo-1]:env.is_hetero=False\n", + "[1,mpirank:6,algo-1]:current_host=algo-1\n", + "[1,mpirank:6,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:3,algo-1]:env.is_hetero=False\n", + "[1,mpirank:3,algo-1]:current_host=algo-1\n", + "[1,mpirank:3,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:2,algo-1]:env.is_hetero=False\n", + "[1,mpirank:2,algo-1]:current_host=algo-1[1,mpirank:2,algo-1]:\n", + "[1,mpirank:2,algo-1]:Opening process: ['python', './train_dnn.py', '--batch_size', '1024', '--epochs', '10', '--model_dir', '/opt/ml/model', '--num_of_data_workers', '0', '--steps_per_epoch', '500', '--tf_data_mode', 'local']\n", + "[1,mpirank:1,algo-1]:2022-09-24 11:37:25.276381: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:2,algo-1]:2022-09-24 11:37:25.276382: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:4,algo-1]:2022-09-24 11:37:25.276384: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:1,algo-1]:2022-09-24 11:37:25.276524: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:2,algo-1]:2022-09-24 11:37:25.276524: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:4,algo-1]:2022-09-24 11:37:25.276524: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:0,algo-1]:2022-09-24 11:37:25.290991: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:7,algo-1]:2022-09-24 11:37:25.290987: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:3,algo-1]:2022-09-24 11:37:25.290990: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:5,algo-1]:2022-09-24 11:37:25.290991: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:6,algo-1]:2022-09-24 11:37:25.290990: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:0,algo-1]:2022-09-24 11:37:25.291121: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:7,algo-1]:2022-09-24 11:37:25.291122: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:3,algo-1]:2022-09-24 11:37:25.291124: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:5,algo-1]:2022-09-24 11:37:25.291124: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:6,algo-1]:2022-09-24 11:37:25.291121: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:105] SageMaker Profiler is not enabled. The timeline writer thread will not be started, future recorded events will be dropped.\n", + "[1,mpirank:4,algo-1]:2022-09-24 11:37:25.310966: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:2,algo-1]:2022-09-24 11:37:25.310966: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:1,algo-1]:2022-09-24 11:37:25.310966: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:0,algo-1]:2022-09-24 11:37:25.325878: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:7,algo-1]:2022-09-24 11:37:25.325873: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:3,algo-1]:2022-09-24 11:37:25.325878: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:6,algo-1]:2022-09-24 11:37:25.326012: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:5,algo-1]:2022-09-24 11:37:25.326064: W tensorflow/core/profiler/internal/smprofiler_timeline.cc:460] Initializing the SageMaker Profiler.\n", + "[1,mpirank:6,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:0,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:0,algo-1]:hvd.local_rank() 0\n", + "[1,mpirank:1,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:6,algo-1]:hvd.local_rank() 6\n", + "[1,mpirank:1,algo-1]:hvd.local_rank() 1\n", + "[1,mpirank:5,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')][1,mpirank:5,algo-1]:\n", + "[1,mpirank:5,algo-1]:hvd.local_rank() 5\n", + "[1,mpirank:2,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:3,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:4,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:2,algo-1]:hvd.local_rank() 2\n", + "[1,mpirank:7,algo-1]:[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:4', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:5', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:6', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:7', device_type='GPU')]\n", + "[1,mpirank:3,algo-1]:hvd.local_rank() 3\n", + "[1,mpirank:4,algo-1]:hvd.local_rank() 4\n", + "[1,mpirank:7,algo-1]:hvd.local_rank() 7\n", + "[1,mpirank:3,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:6,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:5,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:0,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:7,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:2,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:1,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:4,algo-1]:Running in local tf_data_mode.\n", + "[1,mpirank:1,algo-1]:Epoch 1/10\n", + "[1,mpirank:4,algo-1]:Epoch 1/10\n", + "[1,mpirank:2,algo-1]:Epoch 1/10\n", + "[1,mpirank:3,algo-1]:Epoch 1/10\n", + "[1,mpirank:5,algo-1]:Epoch 1/10\n", + "[1,mpirank:7,algo-1]:Epoch 1/10\n", + "[1,mpirank:0,algo-1]:Epoch 1/10\n", + "[1,mpirank:6,algo-1]:Epoch 1/10\n", + "[1,mpirank:5,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:5,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:5,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:0,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:0,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:0,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:7,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:7,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:7,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:3,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:3,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:3,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:1,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:1,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:1,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:2,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:2,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:2,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:4,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:4,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:4,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:6,algo-1]:Extension horovod.torch has not been built: /usr/local/lib/python3.9/site-packages/horovod/torch/mpi_lib/_mpi_lib.cpython-39-x86_64-linux-gnu.so not found\n", + "[1,mpirank:6,algo-1]:If this is not expected, reinstall Horovod with HOROVOD_WITH_PYTORCH=1 to debug the build error.\n", + "[1,mpirank:6,algo-1]:Warning! MPI libs are missing, but python applications are still available.\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.366 algo-1:177 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.366 algo-1:178 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.366 algo-1:179 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.366 algo-1:181 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.366 algo-1:183 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.366 algo-1:184 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.366 algo-1:182 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.366 algo-1:180 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None\n", + "[1,mpirank:1,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:2,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:4,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:5,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:6,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:3,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:1,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:2,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:5,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:4,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:6,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:7,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:0,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:3,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:7,algo-1]:/usr/local/lib/python3.9/site-packages/smdebug-1.0.17b20220701-py3.9.egg/smdebug/profiler/system_metrics_reader.py:63: SyntaxWarning: \"is not\" with a literal. Did you mean \"!=\"?\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.581 algo-1:177 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.581 algo-1:184 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.582 algo-1:181 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.582 algo-1:179 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.582 algo-1:180 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.582 algo-1:182 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.582 algo-1:183 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.582 algo-1:178 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.639 algo-1:184 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.639 algo-1:177 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.639 algo-1:178 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.639 algo-1:183 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.639 algo-1:180 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.639 algo-1:182 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.639 algo-1:181 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.639 algo-1:179 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.640 algo-1:177 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.640 algo-1:178 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.640 algo-1:184 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.640 algo-1:183 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.640 algo-1:180 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.640 algo-1:182 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.640 algo-1:181 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.640 algo-1:184 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.640 algo-1:177 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.640 algo-1:178 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.640 algo-1:184 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.640 algo-1:178 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.640 algo-1:177 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.640 algo-1:179 INFO hook.py:201] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.640 algo-1:183 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.640 algo-1:183 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:2,algo-1]:[2022-09-24 11:37:33.640 algo-1:184 INFO hook.py:421] Monitoring the collections: losses, sm_metrics, metrics\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.640 algo-1:180 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.640 algo-1:182 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:1,algo-1]:[2022-09-24 11:37:33.640 algo-1:177 INFO hook.py:421] Monitoring the collections: losses, metrics, sm_metrics\n", + "[1,mpirank:4,algo-1]:[2022-09-24 11:37:33.640 algo-1:178 INFO hook.py:421] Monitoring the collections: losses, metrics, sm_metrics\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.640 algo-1:181 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.640 algo-1:180 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.640 algo-1:182 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.641 algo-1:181 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:3,algo-1]:[2022-09-24 11:37:33.641 algo-1:183 INFO hook.py:421] Monitoring the collections: losses, sm_metrics, metrics\n", + "[1,mpirank:6,algo-1]:[2022-09-24 11:37:33.641 algo-1:182 INFO hook.py:421] Monitoring the collections: sm_metrics, metrics, losses\n", + "[1,mpirank:7,algo-1]:[2022-09-24 11:37:33.641 algo-1:180 INFO hook.py:421] Monitoring the collections: sm_metrics, losses, metrics\n", + "[1,mpirank:0,algo-1]:[2022-09-24 11:37:33.641 algo-1:181 INFO hook.py:421] Monitoring the collections: sm_metrics, metrics, losses\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.641 algo-1:179 INFO hook.py:254] Saving to /opt/ml/output/tensors\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.641 algo-1:179 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.\n", + "[1,mpirank:5,algo-1]:[2022-09-24 11:37:33.641 algo-1:179 INFO hook.py:421] Monitoring the collections: metrics, losses, sm_metrics\n", + "[1,mpirank:0,algo-1]:NCCL version 2.10.3+cuda11.2\n", + "[1,mpirank:2,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2246s vs `on_train_batch_end` time: 0.6465s). Check your callbacks.\n", + "[1,mpirank:2,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2246s vs `on_train_batch_end` time: 0.6465s). Check your callbacks.\n", + "[1,mpirank:3,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2247s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:3,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2247s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:0,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2237s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:0,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2237s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:5,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2247s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:5,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2247s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:7,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2241s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:6,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2247s vs `on_train_batch_end` time: 0.6465s). Check your callbacks.\n", + "[1,mpirank:7,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2241s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:6,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2247s vs `on_train_batch_end` time: 0.6465s). Check your callbacks.\n", + "[1,mpirank:1,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2255s vs `on_train_batch_end` time: 0.6463s). Check your callbacks.\n", + "[1,mpirank:1,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2255s vs `on_train_batch_end` time: 0.6463s). Check your callbacks.\n", + "[1,mpirank:4,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2250s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:4,algo-1]:WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.2250s vs `on_train_batch_end` time: 0.6464s). Check your callbacks.\n", + "[1,mpirank:7,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:4,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:6,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:1,algo-1]:Epoch 2/10\n", + "[1,mpirank:5,algo-1]:Epoch 2/10\n", + "[1,mpirank:7,algo-1]:Epoch 2/10\n", + "[1,mpirank:6,algo-1]:Epoch 2/10\n", + "[1,mpirank:2,algo-1]:Epoch 2/10\n", + "[1,mpirank:4,algo-1]:Epoch 2/10\n", + "[1,mpirank:3,algo-1]:500/500 - 121s - loss: 2.4081 - lr: 0.0033 - 121s/epoch - 242ms/step\n", + "[1,mpirank:3,algo-1]:Epoch 2/10\n", + "[1,mpirank:0,algo-1]:500/500 - 122s - loss: 2.4081 - lr: 0.0033 - 122s/epoch - 245ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 2/10\n", + "[1,mpirank:1,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:4,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:7,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:6,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:3,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 100s - loss: 2.3881 - lr: 0.0057 - 100s/epoch - 199ms/step\n", + "[1,mpirank:7,algo-1]:Epoch 3/10\n", + "[1,mpirank:5,algo-1]:Epoch 3/10\n", + "[1,mpirank:2,algo-1]:Epoch 3/10\n", + "[1,mpirank:6,algo-1]:Epoch 3/10\n", + "[1,mpirank:4,algo-1]:Epoch 3/10\n", + "[1,mpirank:3,algo-1]:Epoch 3/10\n", + "[1,mpirank:1,algo-1]:Epoch 3/10\n", + "[1,mpirank:0,algo-1]:500/500 - 99s - loss: 2.3881 - lr: 0.0057 - 99s/epoch - 199ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 3/10\n", + "[1,mpirank:6,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:0,algo-1]:\n", + "[1,mpirank:0,algo-1]:Epoch 3: finished gradual learning rate warmup to 0.008.\n", + "[1,mpirank:7,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:3,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:6,algo-1]:Epoch 4/10\n", + "[1,mpirank:5,algo-1]:Epoch 4/10\n", + "[1,mpirank:2,algo-1]:Epoch 4/10\n", + "[1,mpirank:7,algo-1]:Epoch 4/10\n", + "[1,mpirank:1,algo-1]:Epoch 4/10\n", + "[1,mpirank:3,algo-1]:Epoch 4/10\n", + "[1,mpirank:4,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:4,algo-1]:Epoch 4/10\n", + "[1,mpirank:0,algo-1]:500/500 - 103s - loss: 2.3532 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 4/10\n", + "[1,mpirank:7,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:6,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:4,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:3,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:6,algo-1]:Epoch 5/10\n", + "[1,mpirank:7,algo-1]:Epoch 5/10\n", + "[1,mpirank:4,algo-1]:Epoch 5/10\n", + "[1,mpirank:1,algo-1]:Epoch 5/10\n", + "[1,mpirank:2,algo-1]:Epoch 5/10\n", + "[1,mpirank:3,algo-1]:Epoch 5/10\n", + "[1,mpirank:5,algo-1]:Epoch 5/10\n", + "[1,mpirank:0,algo-1]:500/500 - 103s - loss: 2.3199 - lr: 0.0080 - 103s/epoch - 206ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 5/10\n", + "[1,mpirank:2,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:6,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:2,algo-1]:Epoch 6/10\n", + "[1,mpirank:7,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:6,algo-1]:Epoch 6/10\n", + "[1,mpirank:3,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:5,algo-1]:Epoch 6/10\n", + "[1,mpirank:7,algo-1]:Epoch 6/10\n", + "[1,mpirank:1,algo-1]:Epoch 6/10\n", + "[1,mpirank:3,algo-1]:Epoch 6/10\n", + "[1,mpirank:4,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:4,algo-1]:Epoch 6/10\n", + "[1,mpirank:0,algo-1]:500/500 - 100s - loss: 2.3071 - lr: 0.0080 - 100s/epoch - 200ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 6/10\n", + "[1,mpirank:7,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:4,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:3,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:4,algo-1]:Epoch 7/10\n", + "[1,mpirank:5,algo-1]:Epoch 7/10\n", + "[1,mpirank:3,algo-1]:Epoch 7/10\n", + "[1,mpirank:7,algo-1]:Epoch 7/10\n", + "[1,mpirank:1,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:2,algo-1]:Epoch 7/10\n", + "[1,mpirank:1,algo-1]:Epoch 7/10\n", + "[1,mpirank:6,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 188ms/step\n", + "[1,mpirank:6,algo-1]:Epoch 7/10\n", + "[1,mpirank:0,algo-1]:500/500 - 94s - loss: 2.3043 - lr: 0.0080 - 94s/epoch - 189ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 7/10\n", + "[1,mpirank:3,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:3,algo-1]:Epoch 8/10\n", + "[1,mpirank:5,algo-1]:Epoch 8/10\n", + "[1,mpirank:2,algo-1]:Epoch 8/10\n", + "[1,mpirank:6,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:7,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:6,algo-1]:Epoch 8/10\n", + "[1,mpirank:1,algo-1]:Epoch 8/10\n", + "[1,mpirank:7,algo-1]:Epoch 8/10\n", + "[1,mpirank:4,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:4,algo-1]:Epoch 8/10\n", + "[1,mpirank:0,algo-1]:500/500 - 97s - loss: 2.3031 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 8/10\n", + "[1,mpirank:3,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:4,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:7,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:6,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:5,algo-1]:Epoch 9/10\n", + "[1,mpirank:4,algo-1]:Epoch 9/10\n", + "[1,mpirank:7,algo-1]:Epoch 9/10\n", + "[1,mpirank:3,algo-1]:Epoch 9/10\n", + "[1,mpirank:2,algo-1]:Epoch 9/10\n", + "[1,mpirank:6,algo-1]:Epoch 9/10\n", + "[1,mpirank:1,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:1,algo-1]:Epoch 9/10\n", + "[1,mpirank:0,algo-1]:500/500 - 96s - loss: 2.3027 - lr: 0.0080 - 96s/epoch - 192ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 9/10\n", + "[1,mpirank:2,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:3,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:2,algo-1]:Epoch 10/10\n", + "[1,mpirank:1,algo-1]:Epoch 10/10\n", + "[1,mpirank:4,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:6,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:3,algo-1]:Epoch 10/10\n", + "[1,mpirank:5,algo-1]:Epoch 10/10\n", + "[1,mpirank:6,algo-1]:Epoch 10/10\n", + "[1,mpirank:4,algo-1]:Epoch 10/10\n", + "[1,mpirank:7,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 210ms/step\n", + "[1,mpirank:7,algo-1]:Epoch 10/10\n", + "[1,mpirank:0,algo-1]:500/500 - 105s - loss: 2.3021 - lr: 0.0080 - 105s/epoch - 209ms/step\n", + "[1,mpirank:0,algo-1]:Epoch 10/10\n", + "[1,mpirank:6,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:7,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:3,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:2,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:5,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:1,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:4,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 194ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 97s - loss: 2.3013 - lr: 0.0080 - 97s/epoch - 193ms/step\n", + "[1,mpirank:4,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:6,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:3,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:1,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:5,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:7,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:2,algo-1]:Process train_dnn.py closed with returncode=0\n", + "[1,mpirank:0,algo-1]:WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 53). These functions will not be directly callable after loading.\n", + "[1,mpirank:0,algo-1]:INFO:tensorflow:Assets written to: /opt/ml/model/000000001/assets\n", + "[1,mpirank:0,algo-1]:INFO:tensorflow:Assets written to: /opt/ml/model/000000001/assets\n", + "[1,mpirank:0,algo-1]:Process train_dnn.py closed with returncode=0\n", + "2022-09-24 11:54:50,061 sagemaker-training-toolkit INFO Waiting for the process to finish and give a return code.\n", + "2022-09-24 11:54:50,061 sagemaker-training-toolkit INFO Done waiting for a return code. Received 0 from exiting process.\n", + "2022-09-24 11:54:50,062 sagemaker-training-toolkit INFO Reporting training SUCCESS\n", + "\n", + "2022-09-24 11:55:05 Uploading - Uploading generated training model\n", + "2022-09-24 11:55:36 Completed - Training job completed\n", + "Training seconds: 1337\n", + "Billable seconds: 1337\n" + ] + } + ], + "source": [ + "from start_job_utils import fit_with_retries\n", + "fit_with_retries(5, estimator, \n", + " job_name=\"homogeneous-\" + datetime.datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 3: Analyzing the homogeneous training job throughput and resource usage\n", + "We'll examine: CPU and GPU usage. Epoch time and step time\n", + "\n", + "**CPU and GPU usage analysis** \n", + "\n", + "In the screenshot below we observe that close to all the 96 vCPU of the instance is utilized. While GPU utilization is only ~45%. Clearly if we had more vCPUs we could increase GPU usage significantly to increase job throughput\n", + "\n", + "Note: To view your own job Click on **View instance metrics** from the **Training jobs** in **Amazon SageMaker Console**. Then to rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch \"Add Math\" feature and average it out by no. of vCPUs/GPUs on those instance types. We captured metrics definitions used to produce this graph [here](./cloudwatch-metric-definitions/homogenous-workload%20copy.json). \n", + "" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Epoch time and step time analysis**\n", + "\n", + "For 2nd and 3rd epochs the below should print out: 105s/epoch - 209ms/step." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture homogeneous_logs\n", + "estimator.sagemaker_session.logs_for_job(estimator.latest_training_job.name)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Printing step time for epochs and steps for homogeneous-20220923T231801Z\n", + "[1,mpirank:0,algo-1]:500/500 - 117s - loss: 2.4153 - lr: 0.0033 - 117s/epoch - 234ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 92s - loss: 2.3755 - lr: 0.0057 - 92s/epoch - 184ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 92s - loss: 2.3472 - lr: 0.0080 - 92s/epoch - 184ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 92s - loss: 2.3175 - lr: 0.0080 - 92s/epoch - 184ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 92s - loss: 2.3066 - lr: 0.0080 - 92s/epoch - 183ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 90s - loss: 2.3043 - lr: 0.0080 - 90s/epoch - 181ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 94s - loss: 2.3028 - lr: 0.0080 - 94s/epoch - 189ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 92s - loss: 2.3024 - lr: 0.0080 - 92s/epoch - 184ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 93s - loss: 2.3021 - lr: 0.0080 - 93s/epoch - 185ms/step\n", + "[1,mpirank:0,algo-1]:500/500 - 89s - loss: 2.3018 - lr: 0.0080 - 89s/epoch - 177ms/step\n" + ] + } + ], + "source": [ + "print(f\"Printing step time for epochs and steps for {estimator.latest_training_job.name}\")\n", + "for line in homogeneous_logs.stdout.split(\"\\n\"):\n", + " if \"mpirank:0\" in line and \"/epoch\" in line:\n", + " print(line)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### D. Run a heterogeneous cluster training job\n", + "\n", + "#### Step 1: Set up training environment\n", + "We'll now run a training job in heterogeneous cluster mode. \n", + "Note the changes from the homogeneous cluster job: \n", + "- We define two new instance groups that are provided to the `estimator` as the `instance_groups` parameter that replaces the homogeneous parameters `instance_type` and `instance_count`.\n", + "- In the `distribution` parameter for Horovod we added a new parameter `instance_groups` that is used to limit the MPI cluster to run in the `dnn_group`. The MPI cluster should include only the GPU nodes that run Horovod (which needs MPI). The `data_group` instances should not be part of the MPI cluster, as they set up their on `tf.data.service` cluster.\n", + "\n", + "More on the two instance groups config we use:\n", + "- `data_group` - two ml.c5.18xlarge instances, each with 72 vCPUs to handle data preprocessing. Reading data from S3, preprocessing it, and forwarding it to the `dnn_group`.\n", + "- `dnn_group` - a single p4d.24xlarge instance, with 8 GPUs and 96 vCPUs to handle deep neural network optimization (forward backward passes). To fully utilize 96 vCPUs in the `dnn_group`, we'll be starting data workers on all the instances in both groups, therefore we have 240 vCPUs (96+72+72) in total available for preprocessing (minus vCPUs used for the neural network optimization process).\n", + "\n", + "There are three Python scripts to know about:\n", + "The 1st is `train_dnn.py` - This is your training script for the neural network, you should edit it to match your own use case. Note that this script isn't aware of the Heterogeneous cluster set up, except when it initializes the tf.data dataset calling this line: `ds = ds.apply(tf.data.experimental.service.distribute(...)`. \n", + "The 2nd and 3rd scripts, which should not need editing when adapting to your own use case, do the heavy lifting required for using tf.data.service over the Heterogeneous cluster feature. \n", + "`train_data.py` include functions to start/stop tf.service.data process like a dispatcher and WorkerServer. \n", + "`launcher.py` has several responsibilities: \n", + "- A single entry point script for all instances in all instance groups (SageMaker will start the same script on all instances).\n", + "- Identifies which instance group the node belong to, and start the relevant script accordingly (`train_dnn.py` or `train_data.py` or sometimes both).\n", + "- Takes measures to ensure that tf.data.service processes shutdown when training completes, as the training job completes only when all instances exit.\n", + "- Allow to start more than one process (for example, on the dnn_group instances we'll run both the `train_dnn.py` and a tf.data.service worker to utilize the instance CPUs)." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "import datetime\n", + "from sagemaker.tensorflow import TensorFlow\n", + "from sagemaker.instance_group import InstanceGroup\n", + "from sagemaker.inputs import TrainingInput\n", + "import os\n", + "\n", + "hyperparameters = {\n", + " \"epochs\": 10,\n", + " \"steps_per_epoch\": 500,\n", + " \"batch_size\": 1024,\n", + " \"tf_data_mode\": \"service\", # Using tf.data.service for this Heterogeneous cluster job\n", + " \"num_of_data_workers\": 1, # One tf.data.service worker per node\n", + "}\n", + "\n", + "# Group for CPU instances to run tf.data.service dispatcher/workers processes.\n", + "data_group = InstanceGroup(\"data_group\", \"ml.c5.18xlarge\", 2)\n", + "# Group for deep neural network (dnn) with accleartors (e.g., GPU, FPGA, etc.)\n", + "dnn_group = InstanceGroup(\"dnn_group\", \"ml.p4d.24xlarge\", 1)\n", + "\n", + "estimator2 = TensorFlow(\n", + " entry_point=\"launcher.py\",\n", + " source_dir=\"code\",\n", + " framework_version=\"2.9.1\",\n", + " py_version=\"py39\",\n", + " role=role,\n", + " volume_size=10,\n", + " max_run=1800, # 30 minutes\n", + " disable_profiler=True,\n", + " # instance_type='ml.p4d.24xlarge',\n", + " # instance_count=1,\n", + " instance_groups=[data_group, dnn_group],\n", + " hyperparameters=hyperparameters,\n", + " distribution={\n", + " \"mpi\": {\n", + " \"enabled\": True,\n", + " \"processes_per_host\": 8, # p4d.24xlarge has 8 GPUs per host\n", + " \"custom_mpi_options\": \"--NCCL_DEBUG WARN\",\n", + " },\n", + " \"instance_groups\": [dnn_group], # Apply distribution strategy to the dnn_group only\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 2: Submit the training job\n", + "\n", + "Note1: For the logs, click on **View logs** from the **Training Jobs** node in **Amazon SageMaker Console**. \n", + "Note2: Ignore the 0 billable seconds shown below. See actual billable seconds in the AWS web console > SageMaker > Training Jobs > this job." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2022-09-23 23:52:33 Starting - Starting the training job......\n", + "2022-09-23 23:53:17 Starting - Preparing the instances for training........................\n", + "2022-09-23 23:57:33 Downloading - Downloading input data...\n", + "2022-09-23 23:57:48 Training - Downloading the training image...........................\n", + "2022-09-24 00:02:25 Training - Training image download completed. Training in progress....................................................\n", + "2022-09-24 00:11:23 Uploading - Uploading generated training model...\n", + "2022-09-24 00:11:59 Completed - Training job completed\n" + ] + } + ], + "source": [ + "from start_job_utils import fit_with_retries\n", + "fit_with_retries(5, estimator2, \n", + " job_name=\"heterogeneous-\" + datetime.datetime.utcnow().strftime(\"%Y%m%dT%H%M%SZ\"),\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Step 3: Analyze the heterogeneous cluster training job's throughput and resource usage\n", + "We'll examine: CPU and GPU usage. Epoch time and step time.\n", + "\n", + "**CPU and GPU usage analysis** \n", + "\n", + " In the screenshot below we observe that GPU usage has increase to 74% (compared to ~45% in the homogeneous training run) which is what we were aiming for. The CPU usage on all 3 instances are close to 80% CPU usage. \n", + " \n", + "Note: To view your own job Click on **View instance metrics** from the **Training jobs** node in **Amazon SageMaker Console**. Then to rescale the CloudWatch Metrics to 100% on CPU utilization for algo-1 and algo-2, use CloudWatch \"Add Math\" feature and average it out by no. of vCPUs/GPUs on those instance types. We captured metrics definitions used to produce this graph [here](./cloudwatch-metric-definitions/heterogenenous-workload.json). \n", + "\n", + "\n", + "**Epoch time and step time analysis** \n", + "\n", + "For 2nd epoch onwards you should see this print out in the logs of the dnn_group instance (p4d.24xlarge): 43s/epoch - 86ms/step.\n", + "Note that the instances are named: Algo1, Algo2, Algo3 randomly on each execution, so you'll need to open all instances logs to find the dnn_group instance." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## E. Comparing time-to-train and cost-to-train\n", + "The table below summarizes both jobs. We can see that:\n", + "- The Heterogeneous job is 2.2x faster to train (86ms/step) than the homogeneous job (192ms/step).\n", + "- The Heterogeneous job is 45% cheaper to train than the homogeneous job. This is despite the heterogeneous costs more per hour ($45/hour vs $37/hour), due to the two extra c5.18xlarge instances included in the heterogeneous job `($45 = $37.7 + 2 * $3.67` \n", + "The cost-to-train formula we used: change in hourly price `($45/$37.7) ` times `reduction-in-time-to-train (86ms/192ms)` = 45% = `($45/$37.7) * (86ms/192ms)`. \n", + "\n", + "\"results\n", + "\n", + "## F. Conclusion\n", + "In this notebook, we demonstrated how to leverage Heterogeneous cluster feature of SageMaker Training, with TensorFlow to achieve better price performance and increase training speed. To get started you can copy this example project and change `train_dnn.py` to match your workload. To run the job, you could use this notebook, or the `start_job.py`." + ] + } + ], + "metadata": { + "instance_type": "ml.t3.medium", + "kernelspec": { + "display_name": "Python 3.9.7 ('.venv': venv)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + }, + "vscode": { + "interpreter": { + "hash": "77c0de85c2cb739aa5100af7b92fb9d2075368f0e653f4148499a56c989df5f7" + } + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/TensorFlow-Hetero-Instance-Metrics.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/TensorFlow-Hetero-Instance-Metrics.png new file mode 100644 index 0000000000..21c27a3528 Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/TensorFlow-Hetero-Instance-Metrics.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/basic-heterogeneous-job.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/basic-heterogeneous-job.png new file mode 100644 index 0000000000..8fbcdb04ea Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/basic-heterogeneous-job.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/basic-homogeneous-job.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/basic-homogeneous-job.png new file mode 100644 index 0000000000..6b281b4010 Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/basic-homogeneous-job.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/heterogeneous cluster diagrams (5) (1).xml b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/heterogeneous cluster diagrams (5) (1).xml new file mode 100644 index 0000000000..3c3ecb2977 --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/heterogeneous cluster diagrams (5) (1).xml @@ -0,0 +1 @@ +7Z1bd5u42sc/TdbqXODFQZwuk6bNZLedSdN2pu1NF2CcuLWNBzunfvotAcIghIxtDpKt9N3v2BhjIYSeH//noDPj9fz5KvaW9x+icTg709Xx85lxeabrmuta8D9oy0u6xdSMdMNdPB1nO202fJr+DrONarb1YToOV6Ud11E0W0+X5Y1BtFiEwbq0zYvj6Km82ySalX916d2FlQ2fAm9W3frvdLy+z7Zqqrr54M9wenef/bRjZh/MPbxztmF1742jp8Im482Z8TqOonX6av78OpyhzsP9kn1vET6v0SfX43+82UPWLNvJj3AbrqKHOAgvw1UQT5frKIZfirON6d4/zoxz3Iownnqz6W9vPY0WymMYr+B/070es128rMtiyoGz3/wUzr3FehpcemvvdbRYe9NFGDc5evrtdTxd3L2frsPYm6XXbh0u1qWzXsbRMozX2ai5X6/R9T4/09/C/4O7R7Po7mW0CoOHeLp+GXlz73e0GI3DR/jxJHpYjJMWwDfjqXcXe3Plcbp6yFsGt3twKJquaSmBMZ4oINADxQeOC9+q9sRyPD9w7aQlb9M2X99eV7p1p1bBUTy9WyjTxWoJxynqy7dBNF9GC3jmK/jGAZ6j+uZEMS2gK8DTbMU1HVMJ/cnYt8wJ8MOg3a5ZvazW4VyZo3sWXg+4RQWm4ZqurQDdMRQwsUzFUV1V8ULLnrhjI3SBW+wU+II+EvCnlLGJP8qGO33o53ddXB4Xuwx4TZABr8kBf3IDPvJ/ImOlqzPPhwYzOa9P0JJ88H4lw+pzDA+bNFH9X+RnAxkOmWA6+/yyzK7GCn5hjr6grLPdlZ94Z3wjvflqXN4sDPfX+P7qu33l/9YeLhVNy0dmfuet1i/Y5MWoz0J0APXMuHi6h8P209JLxsATNPJw2/16jhqtwZdwyMPO8GbnM3ix4bZ1hHaYwI75lB0R7TX2VvfJEdGb1TqOfoWv4WWLkx801OQPflK967ObA/1K+FzaFI8nX5ZwpKctLRj91Spco076p3gzGoUzvgqjebiOX+DO2SEVDVvtbIwBkL1/2th84Gbb7gvm3tCIOeQuP/hm1DScsBjXSoBpjNl6ObkJMLl13ilgYsIHAm2iTIzAgp0CYKf4qqqommFaque7+gTUdUrehqenp9GTMYpidIrwcC46U3QS8DaGc4KyeoHD+llZwB801miubPMMSz1+h554FDibPQTrhzhE7fBBGHqBo6ggNBUQwgvuWC4ci2PTmvgusH3b7/QEmXfoNgPTtE06mqxhm+C9/TZpUHAPpxXYmtSYNW2Ofnv19+WPL+dvXq71N/Gv9ev3itXPbKGaQWAaWqCMdaApwA0MeGOM0WUybNsOTKBbky6sPmp+avhrIeBMt2bw44tpySRb/z2gp8SLILWZ52gQ3/mvTHgF4JWAx1VLL//YfAO+usv+mxwXGWbqodEHyip5/EaH17Tlc/Uo57O7CFpSVYH/u/zrr+ShPXpYZluWYDzSwfPMi+EjdfZ7ySVBP1luBtw83Wyj0o2HfktRlPFioSS/Al+Xf2E759gNOUdrwjRe9i6AIxhd+4vJdDYrUMxkElpBQOObsY2mOfQJxCh4Q3xGx7tUdHNA5NHgrxeJxzSqxKPZZpV44FzdMfHYQhOPLYlHEo8kHi6IZxd7JQmofwKq6H7X9u1n9fqnPf14/fLu1r/9crGcKVp2izxu5G/MEdCgLhrhjIpwJpUY1ETRUFappIE+XUTxHF0dEnfWkxHsSQ92dPw4hTsnVgOZigLGpC0ok8x+SkpJIymyxQJOJ7VUUgsLFaSoRwGrLH6YLgUFdIr4YTXQPoYSvvSG6HcqEpdpEwqXOYDCpQtCdrpkOMlwkuG4YDipWvHFbFtUq7blJg3jOzwFgt8Dc6Q5bcpNGpKb4O/kelPpF4p0YTaki1aEpVAbm6FNowrXsg3P4ktYslSCJlUKTZoU0NCM9kDDFAQ0TAkaEjQkaHABGrtYGwkefIFHY85QaZzRROcpYwUVIYjDKPlBiuQwnsxe/gt/TVfLf6am8vbuzdU3QzGs9tSK+yie/kZdzoaOWrWpGXAMRhca4bbS9Cpd6DS3leq0RheMaygAczBbL0lEkogkkYFJhGGPJHjwAh5UBLgL4c9NAyW49xaLrI9z08828uH4LsT+hChe30d30cKbvdlsvShjwGaf91Hy2I8M+c9wvX7Jkmi8h3VURoPUnOM0ForhH3uhM6GGsFiBA2dEluEv3qpUE6NnBniNEBpny6Sb0Lm3jwxxOIMj6TEkjlA1/9lXb6JpQpHYY6KVhQxLJ8x3esbZtwiPV96MXbhCFIKQrCAOKxzJ1N41EFimZtrwDlds4MO+tdUQNsUdK77rT0zLtwIfdAsEFzNv8esveJ2vL5s/c+D5s62rVe2XsTYJVMsKFcOx0JAPVcUxVKBYQTgBkAJVL5NyG55G5w22NNsPoUFRggm8PYGnaopnm2PF0cKxAUFWNw19IGYA8BTj0BujjvGe4P8fJyoTFSTwvgrcU9nsh1GCHdgKH8DXZctfcTGQosB8Oh6nkAHnwN+enxwKYcYSmbfE2pgXZ+Yl3JKc0YUX/LpLgKQUAYv+cGxESiGaRsRKqNn7wvcMw3UndLiwalWFLOE3a+pZnjjaCjq8lPDkQJBQCM1C0UaO6Rb/QPmQ0WQC29oOU4gSTCvDZsVhCqk/HLf+wDBUUnvgQXtIvRJLqscD3tOhgpVv5PNAQX0mJbgi6RL4+Sej4OZYbvFxrIwyiGzKUNBAJHpYo+a8zmtzUIw//PcWHeAC2expuPksC3utSBEbyiiLF/mBcteGWgCd96j/bqLVdJ2aphx4SBLyo/U6mtOYqcBTNB9MEXrUs2qAaBajAd8Z6N29l/Tp/DmZHUbe0wqM4KAewzniOkBNvFjG6YvyPitjQPcL0IngDlMbAV3N/7JDF5wxNiXSw7VHWntJRI4ghNPyo48whBOGAPiG6ykB8H0FqNA6+a6lKdCGO2GguUbotFzzhQPCEdr+l0yDtPp8WX2ZGbxbZnDjYicnmATsOGQ0RcNYTbu9UE1hCpzIWiZSopASBS+IIjN7hSOYDVYWUnrQ5fozhJ0YodiJ6GGFunL2AO9/FO6irjcZPz8b5IPe4yPBAyn4MIwsUapXTSsazf5zR4eCAZNMA26YIKq3mCDKuB4CEAKz9RIbJDZIbBgaG1oyQRInhseJGhhghmDSZ2jAtPc9BGbCCx2/fM2+n7z5ht6MTPz28rn44eVL9q4c0KkfFJ6pZSOqGJ5JDzrKen7goM1WyAIITRa1lkKSBXdkcSTzu4zDZM2d4sRhMmf2k4zO3LmkGKD5QW7iEPZeEEKTN078LulB/RjvkIbSMMuHEUCzLBySEgpKH5DsAhVcBYgCKrYwLV7f4aD7BoDigE8yc6TNAE/G9RcWbGRZDHHARkomxy2ZEBZNhoLypYEQhbeywiXXi9XaWySpw1epX6xxdEWwfED3YvJ1JXOqbeeNpgUs6oInDw+8MNE/aupq8kfLOkF/Qzpc3LLDxckiBLcWyjL1jmURURJRe3kYk/Qg6UHSw570UGuPJEHwQhBteVHYOanH50WhLv9hZR261YuCrdQxeFFEyTFltl7iggC4cCTz+8l7UZhzp+BelO6dQcfmRUncJ4Qj5baQeLyLt4ReL4N+odjJq8I6Sexa9UKgmhnqCFUFKfw5VkksUcAIb+nPqyJKqimz9RJ0BAAdqYscty5yK+tqiCKFhIHeACb0zdQsK2H0UwkDXZgBK5GDsvdG06vuG4vivcFPOV2pMZjpxIQU3PqTg5TJxHDHdqArhuWoCjBCT/HtALbKNk3P9QxXtVtOJOIAUoQ24dKGMp7/x9NH6uM/eqhWMnuCnv8zk1JblaL4xD+fjehLfpVlgXxz0oby1i6aZSfG5vXNlx1bdAhssGtacAwbOVM0jPIokEavVDEj2p03aDDcMAjcMJrhRtexIrlnTEzcONWaHhI3RMONsvUrGB0pH/CFPptYzKQuyQGhmHf7hGKmOsagoZjba2ANZUUdm9OQy1x9EtOMtuy5FcaMSteCdC3wBgp1ZkdyAi+ccMiTP7s2FcdP/tLNsBexuPp+bobOgUXoolx6yw+3wgCLfO4XzZxLG9qimyF99lQTa6KsUnOCPl9EiU2p6P3z2aimdnULzoVdG+M8ww/PNURm6lXJ3TBAY1yr6vVozb3BLsjFMeQI5d6oRp9mncOd88PVOHV+CF0/TD/V+mESgkSDIMIOE5awYIukyMEHoF3CqcZbww6Na0hgTOzABgJ2OaudfBpD2TAL6CNitQ2gV6wYbbkssnpU61ZM6GJRuiwWJX0P0vfAh50mp31pioc3xdDoncfh3RTVM3/1R405pu3ENMkGOwSx13IPmwoP34qf1ZR7IB7wx2bojAFNErBsOF9MzojlOdCbG28N+2mRbNFVrSo/5F+tpY2t9bixzr+9HndmAI+gkoQhdDChwWkwIXtVM8qNL2ftU63bIKGVAq371dJuOdu6Olj6KQne8tMVZczzW8zi3yj+VStgPBU+ZIsX7DIToooXuHjl0OKF0DUZTnZRcBHtgBQvjlu8KE73EoGHFy7grHp+mz2a/EClMqDdqhUw9MJzDLkzW8hgR1T2KWRoZ6W6lfq2wpWCKBkZvw2tZGQoo45U1c5KZO1bpKq3ilOG0FGWBqdRluypmDHrSKsghREJxIcpCtgaSGFEXGHESGDLG9fSGLkDm8DY0R0DEtiuAOaFzoSa+GoFDryRq3SFalwmUadwfDzmYBY+T9d5HXP4Om2MbWZvN41Bb3Bb+uAzp0s+69fTJHTMi8FpzAvzTMg5QcKUhCkJUxKmenGWcQxT68korQOrLqdLNHrCGqjKdlTKu7HRSmeiVRmcOPU9uRrheTKqnieH4nkyOvY84XtPUIRoeeaQnifpeZKepz09T3UmQCLy8F6oQlkvSjptkD7Lo0zZ+M5/ZSIvypkOj6uWXv5RTaA9n91F8J5SFfi/tGRLVuEr2VKTdbytYpiHDqooynixSAuGwdflQ21HBvY6buVwlT5qhWWJr58TLUdJnGJDoYjjlFHEsswKilDrh9lds4jQS7AZcgk2ySKSRfhgkV3skmQUPhiln0gZk5+Unx39NOUVXslqGxrpfxElroaPDKFCXI1LrNGGy3bsFmdzHsfeS2GHbPG8ahgOpjK1XB3EVkuK5Nb9TZOU/tIWDBX0YwqdBYVbzxvMyaAf6aeSfqqh/FQyG0r6qbaijOB+KqJSq+VUxaFB/FRCZ0i1PXNIbUhqQ1Ibkn6qo9OAKsvP4A1IeKD6rtAHyioRQZD/StOWz3Q3lY7lwPTaF/VA+oJ86U9Wlr9r4r3SkfcK0QN2X5V+YTtIuA1BohXvVaiNzdCmKUOuZRuexZf3ylLJJG7LqobSUP1XTteM4grNKK5kFMkoklG4YJRdDJZkF77YpTGqqDRUyegU9lH8OE1WPkKv4CxfSyYs+QIfRskPshU+wMaIHKxi3Efx9DfqcjahlBxPu9PJYCii2SSK2HoVRbDeVqrp7nSLIkAVGUVw6yWKSBSRKMKHXEIzSJI8+CAPreD4ba84LmAvYNNrcdy2I2UECY4BWe0RPoJjFHWk6mrWpG6jYUxnt2gYcn/eomGA0GvsgFr7ynE0DG1SlMZKRsFIdM87Za/wEWyTZBSMuFEw7dQEBuyig4LUBC4rOBg0hq4IjO8yQYGB05p50gpIAefkBBxZEZgvtQacpSWk0JD0nuD/T6ORqLYY76vAPZXNfmyzPHglut31l/1rztHLzLGrzJXypGrJYy91ho/icjUiCZEyZJEBMekZZ9/qTgsRumodELFqHWPKkRZBSiIShk9NEjnlUnXt1v1NHQ0H0FYJRoRCr/3JqXHSN76zjqBYb+aREhR7cOuFwh5ZrFeyjmQdmQQ9CLJxTEAdZlmhWJi+sqy0xllWb74alzcLw/01vr/6bl/5v7WHS0Vjk9tJZ1nBzilrRcBtWCRQU83WOIpx2QTgKGbreeMoaSClZ+zkPGO7GCz5/MCH90y8LCu6HWgaO3PyWVZKThQ4tle1qyhCS7JyjI5JRJQYHWbrJYlIEpEkIpOsJHhsAw8qAtyF8OemgRLce4tF1sdbTH97eVVaa06ixk4hnNRUl+zUXTxO0VlE79hskLbhLCo7sag/pxcT5duL0qlIL6ZLKCo1YTq75lgpukb8kF0SJHf+wlkpy+ps76ghxl0jLGyJmEEljhmQ/qtThOJtjh+mkThJxw92iJ8dHnpM71x26PE6fF6XSaTiNSGli/l0PE6hBw7g356fHAoxQ2bd4HHNizPzEm5JzvPCC37dJYBUWp8J/dFiYdD7rH3qThnczFmeoojAiXUBL0d2Amd5/lArSkk2iDPP7MH8YY1Uyy382SWjr2juyDHUwp9e/ok2k6kZo0xYFOA0gFiaGKm7nJzuIgPjOddcUm/MkurpQaWbFSz4I1+PCy+ESYlLSboEfv7JKLh3llt8OyujCfBYTOCJHtaoka9T40+HDPjvLTrABaKAabj5bAEnCZoEs6GZsmiTHyh39KgFoHqPevUmWk3XqRnLwYokLj9ar6M5jc0K3EbzSBXhSiXgStuEt8B3Bnp37yU9PX9O5oyR97QCIzjUx3DmuA5QEy+WcfqivM/KGNIZZZDLLgFrBPQNC2XH3pI97tojreswGUtoSOpp7uMOksIQAN9wPSUAvq8AFQ4537U0BWKAEwaaa4Q4ae+IIElohChZFwkOfIDD5RRamzXs0LjGuo+JHdhWvunCPhwXOlFsslitaVRDOjsudcLoXWEtlVzaRz7Oy8d5PmwxOe1LU9yvKc6tIzan1/btZ/X6pz39eP3y7ta//XKxnCk4luLRmz1kQ/Rhla/D12ehkjN6vMCWlFZ6NAClGAj17EuSQVXEx8+F3dZXzdfrwyxgmcXhsev+Z23UVyWGVsOxZAwxbvYtTFwKdKFHytjsUJmKGoREh0I2dmdjmbqjAehjeefE6v2uPXa3beYRrO9BM7CgaobURDY16T7Vz32HSpB2MNolvvNfQfRCe+mwnSrx+o+q3qgX0o5zuTFtUVlxHNYxmumLhBxJjKcdXJnFkca4V+ufUNSRCjS3NM/sVWy61/nH3mP+abfkQz4ZabtNRpWpxLMg0Dtbh0S/U0zp2bv/KcZpaYoh54hideriWu0czxdkIAVt5HQwadjbJg3NxKkMOEIBHDZr4JZrI5U4sOaMXCKpZJ+phbowdQwB/Md4sRgtX/D13mlJasZt1kC54nhJag0Q7InTjvtbkpo5hYmpW+m1arXUraRuJXUruST1CepXcknqWnNRxDBZLIXIUK4uBGlUGaXrNakZ101YSLElpEhIkZDCBaTINanFhRfBq6XoTQNwZLUUTdVJFgF6lUU6XpSacRWFZREZ6CNZRLIIHywi66Vwjx7pslOKXgMDtFUG6TEGxTIbwgbfEibZrsoDQ4TeYpVCTIuMWy8tsrTI0iLLVQal+e2x1IehMc2yYKsMDlfVbGukFLZyvZY/OyDeih5Mo+QskyNQs6pmZ+3WDjFwt4tJPDjijjPikTUWZEEzWdBsmIJmbT8FVQdL2wvyyLpsJKwVMwaohFa3JCF9ROgHohm3SxLmkeh1Eep75dTsz2bNE2VUjtlMJ9eF7qjgrF3+GdM2ikfbtjterbqPYrMGVi3FpERdQEqUazVKNJRo2C4a6p0zVS9o2D3hcoKGDXPScPnToZGugYDWReqgARpCV12mfcep9YTQZRulO4CijFkm6wtbQedsr8TGbGpoPbGxRncWM6+RKADdal6jSR+c2eHVkWq6emlcKPiRYYBkaIaDPy/AQHXw40azH1jZK6II4eLXHMLFr3Pi4hd63Qyj1mUrXfzcMap08Z+Ci1+p9UHJB/S+deNN7VHvLvzgJQEYSTLGn9E8QiuTRQ8r1JGzB1RnAjEzqjKQ3N3qz8ivsdwreLA5Ophynx4GHkXJjrFS8CGU/ABs286ukL1T6D49Up+sM12K3B8KByrLp1dgwAA0GtC6xgGhy0IbPd3yEgckDkgcaFz4+gDjIylieIqowYDdVzY12EUCeo8J1A9y7oJsZGzVGfGOAzt3W8EDocsFGLJcgDh4cCTz9NG7Y/dy2YHj8Dx2fxqceB5ZtQMa+4MAzR90E4ew94IQmjzkE8IH9WO8QwP/EAEmy8IhG6crsCsIcOF4wiFxgIotTIvX94qjB0acgbKvRNGMkVlaUpQ4YudLihpClyYwZGkCcbhHyiLHLYsQBk8muxyl1AHYVQn4kjq4jDrXCBeJZoGRYeou0DTbgs85VvmAqQpzcAy6W/nR4tF23P2sZtmTfLFQc5cz7CcBEnudxCQd0HIoaC+h7eLMxBLYBljedHCFh3dNIRETCFnhtmlsKWHpd6hwANgVDkSVDPAU2rdkkI3a7FH14HICKEC2LCKAka0NqSEAoUsLgNqwLqkhHJ9JkhoC1xrCrayTwaF0kHLGksotaM0OBUfyIXJBT44mZdmH4mrxObgst1DLymjCKzqTV6KHNWrk69TGZ+BB5LvAf2/RAS6QsZ+Gm8+ytQIreV8bGCmn8OcHyoMx1QIPvUe9ehOtpuvUdvnReh3NawM8K1xVYC5axGgRjPBJFgNEs6Uj4DsDvbv3km6ePyezxMh7WoERHOdjOFdcB6h9F8s4fVHeZ2VgTiucSt7GWjGm+wLQhOpg2CNXL3BRJoJtyUsxwajretC4LrWgwKSfKDCFIQC+4XpKAHwfPk7AMee7lqZAJHDCQHON0GlZpeEAmITGiZLRkTzBC09QTX0Y6E1sPTurgmNbn2sgpK3PIYBTc48uzIDppcQC7Bpe87Rgxi3q+lIjo2szLnRKSdshXsKY8cnEcMd2oCuG5agKMEJP8e0Atso2Tc/1DFe1W14iVJrxw8y4NKOMx/Lx9JH6YI5keiUzKei5PLMqlQfzsmci3TafjZZgPNJBZSHGsqch35w0ory1i3Y5z/AL5xpiM/Xq5gs37XItZAhf79yiQ0CInXLCMQgdIHr0Sjw1nijcOcVlOFUeBZFKnInejJxMIjyjdWwSOtUGnGqqjcQm0bCJsOKE8SzYLKmM8IV03HpaLOlpkZ4WHZBgYVgjW6uwxUDOFUto5wpu/cnhhXSuiIYX0rnCJUJQrTszz0P579u969tP395Nrt9ePqzsKFbHisa29T0kesArHb/kixqgN4XVp9DbzaoGyTu8rIHoCSK2SdhXG5QP0XrCBGsE8GFJZTTkrpUk2NeUN744kplepkzIMgw1N17RVgiZVMGeUEQpxICTJsjy021mSbBGwImYUxlH3/WzmSSi462tJUSdBfYsxxtgcgcuTwUlfxHFc9TB9VI+WWk0hJekWmsUvSSLje5CNoUC2PgHGpbA/udBAd+mj/r03YMS2V9fwtkH7XV5rY/TqYFtEiv4uVZVHQcuRR5vswo264pICJEQwscJSggRE0LasEwDMAt7UjxpZtndb1DTmewMjePwGxTnu1CZ/f74bfz34ufHa+v3i/LPX2b8Nt2tWEH79ov65//eTpbRV//G+PLug/c81VK7PrQXoh2w4CgZQlqlXa0S+5ryNi8eiYE9emfBttLTjJlTnPrZjHldVs9u4rbR+XDb1EyB7DQLAdw27Kldum22gg1H6QpSMZGKycmxqThAJ4Tbhj3L8Yb6A4DLFHPEZXKJ1OvFau0t4L2qq1fQDi8L1DHdghzB8gHdqMnXlbv0y0XmqAXgBi6TBkuCN8x1IBIrJib6R0vFsJK/ak5H+jegw8VSCYeLW112VDMpDhfN1FvTRRhPM5IdJDvwcIKSHcRkh1pLNAA/9CDaCAkPrflP2At2HZ//5PyXqX/89e/HpRG78x/zN/H4g5+hcdF/whh3R+E/4WghLmlnWvKfcLo82ZGYzJP3nzBmTnH8J4M4gQTwn6Bn7P39J9RlQtJD7rFMSM305jJJRVzfiFMrWgi0UIg6Mgy3+OdYJY1EASO8pUdnintClCMFESmI8Aaq4tAdz+uEsCc33nB/eBmkUtKS3oEGe21RjstLyeLe+/ltyEoXqlF13NBqVOIySZ3JMbjq+EmAioilHY/Eyh09ou1hR/G9d9J2tIti1Yekos5no8AcudwUBD/kXIw9y3Xzdh6vkoyqHwiDR8uXPzopPl5zi7KXiOWY1ISqPs5T1U8LkKwGmrFaiyE2rMEoWU2ymmS1TquJbwAgt6AVE8SNGoNnBUmRhYjWv/46Oyig9W5LQGttMtSgAa2T0AoCGlKMbSTID2hUHYeDwFVGBttJmFTpp5F+GklF3QSu1hicASihhzxdIRHhIBlAF1UGkA6bvXjFJVdjbeqw6V4EwEP5FIhFigBSBODJYaNLO7qncyB9+FQTg6KsUouSSv6JWenWobM0RnqPDp3hztXJ/T3wWiKQVP/pdxHZ4U4dC3SLRd8uInbBe47ZUCgXUe0Ctdw5kFxy2Th+HEinVJlfsqNkx6EcSBvgYFlk0mTx41KSte37qG0fzav1g89aK2yfHv2gsvYGYKLNsZa1J7QfF9vqgvk2AMV+d1/V3gAnZMClu0q6qySD9VzVvqFN4gdVaqeAE0IVKgrsUZLFYC9j00NJlj1Lq6w+ffw3fPvtdhX8c/3z8u5/IXC/vEt3K5ZWYex2DKVVjFNa8+boTAv7mvI2vx2JlTz50iqMKVGc0iqDnIQApVWOozS9wV5oR9jyK3hqF6U0fdZEYJbUEUUzRqahFv6II3ZfbMU4pSV5pCoiVRHe0FUc3hO4cr3RU5eehtLBXu6GL6VjYG2CjgFaJULWcsrHSJWX7Gub8XIex95LYbeMuWp/yK35nbf77Q9fpC2gf1sx9JG55cTS4Vw5sdax5pQW5Dk6m8i+prxN4kdi3o9ezhlAROC/PqvBriQvrkBgDyMQtFufVQMjlZAMnBF5nB5EglOqOy9FAikS8AZE4lCEeBVZDU4XYBgAV5ZUVkEpGMp9IVoTPSWaXeajbEJsjO3JKMttYZ1GmYOii2/vr54//fjw8fv/wkt7cf7VMD8zIeg401AGzE9eGTzmniiaWtUvgD1yM+2sEMJq0zJQwMgkxI4DolRqR+mJUFgYAuAbrqcEwPcVoMLL47uWpkD4csJAc40QV8WRJlwwIWBw/mwY3lmbyNEppDBu/JMmFKptr6SZ1kaziGjcZf2Rvey4VilCqoOKBaflkDrqyGjPgjMCq07CgsscUmnBB7HgPUQ0CmlBeatM3n9xkSMuHtJ1cZDOCoNMzHf/re+n7uXfuvbn37/+dD96xsbGCUZsx1EVJPupz0k0jsqjVFMNoDGaIZ5JrGZ4AN/VDlzJd5LvJN9xUyNksPogjAlCcmhLvqbGATYqCrDJKqwd6IcqR+Ds7YeqXZj7xMBH+qEIuAEVuAH2yNYqfNO9E4qxdvxJII50QknEOUknFOPGl+iCoSPIjeWGG6A5dZJL0LySWQivQFZt7KxRKbMGEcCbUmb1Zcsmnx8mD6/f/3VjBR//db2nd/H1J0VJJ5NaBOm6bBkFT5L+3Cnnp2P7rBMhuVpeu6xgnIFLsc65yW5DfmBcvhMxzzJSV0bqSsLquchZU3M1hODCnBIluFz4NfLJTiTjze4i1J+vxovFj2R5vz+2iCObqieFHf3tjqLNKoLZZV41IRhzY4FoBFPOQrqAO79WUTaS/hq9GyGxqbSBfG+XN2jVd+gY5Q3ke7u8QSMPrxG/r5ENLGyovCsdXiV+Xy00MEnBompKNdrRDqKNToCeSpdpdHWULhCJfrOq0OCPf8DB8AMPhh/5YGDhYlHNyprfUJrKtKX34QR1h0HR2C4dy9IMUhobCkdN/OUcR80qjmqmRsFRvWscxS2ROCpxlIcTlDgqJo5SmYMbwjRbFoGFJEwqxDGry9R0psbEt+3VZWD3PuZ6UgFZwufp+mvh9TdktyEape8unzMznrx5wciy9uL1eRxHTxuMCBdjvCWYQYM9RatGU+rVUOvalFUsMu5Hq7JGWhau6p4be6Ezoa5XnX+jFkiK0+f3t//MnZ8Xf15e/fh58fz0GP6pfb9NdytW/53/e6V4/sv5zffF50vz5VxzV1/TqXnoCjvtcArumVPgFBGN3LYqpoxhzE8p1iOx1EdfKGYfBND4RIBt9w1jXj/p6r+N44o0WuGeJvJaZqYr3wXe6JP3iJZB/4Cmy8bS2quKyDdPv69GD+vlw5omvf3BVOga+BmT8kBNoE5nQl21ytD22kHN6xDtWGWIIvjkqlIVqdjTQd/Vh3AmuzrSbM0qyULZuwMrErllqQngfPkuahGxhpJENW5RTepswutsRwKqzNMg7OyrsrXkSl7S+WTLHrHsc+5cpmPIZISKSsGTjx+nQaig/8Ipu0wmtY+I7QQ73Ufx9DfqAnb0UynaidR0zNAZA5qm4+i+YVmlwOdBXU54kdPc5eRQXE5aBgtFl1OeuNWClMN45j8JOJB2UPqbpL+pEzgo25ueKaAHLVNABGg9hklL/Imw1VwGMRUNsAxikkFMRx7EpON63TlRAqdKlHj5rHIQk94aUbLuRT6g8kiMrwQrKZwd4wPD1iCmKnTwIzO1XK1DQMa8TCqTqzdxqGSLmR2oOV0sLqcP1/dA+Xt+eff9r/fey9NTxJXmFGpjM7RpFOJatuFZQzIBjqvLmcC2q0yAzX9JZcJjuQUmqL2EfACBNBrCGw0JQyeqMjHMTc9EwJjkThgHaiz/PkHNgGn0twc1H7xIJs3dtD2EuBRLTQuGrmWD4mR2eWm9sy5eK/eri9f63Fn9Ao9X03S3Yozxe83T/rN/Be/+Z3g/Pv/4T/O/oOAqtBufq3gCQC4XYQDC7tcsd1k9Fh5t+ZFIR1nNiqBnbYfXgBNiGxEN47aITsbNJiOheZOxjjESuvZpgO9IaIb1kZHQO8Yw6yP4jdvQG6OerixiujXYuBrKPF0kEcxqjlzUWGYqrQF4qWLYFIW+KmrNIDaZvMbVqqh5ElnBS3Vw/DKoVXhEWj2VWDpVGzmmW/wD5UP2ELxsnhBdSeVIKkdSOepEOaIb2FeEmeTKu2TyiYXDEVXDmtUkW93d3qBK558SdxMlDWwZxejoFryY9SFNVFJ6iuKkzlcRj2qfJhs4sOpCbArYVOWRnlxLFlEzx8A1HQuuJZdSMgcQ+U4HOJYYD+oSDyQe8HCCEg/ExIOSjSAswgBE0IMkKSAOaKPE9xeto+Qm/e8hXCEL7c2RjVz4q2W9pTYyTePVH43EDItpsLc7n8omfQ9XFBwJ8cvX4ptC8R30dlN9J3mHy+/s7rCqq55TcmThSkC43E+zSkCkN62WXYrT7efrO3/xKQh/v7eNJ+35ze079fr2rOL6Ytwh3bi+Mg5SR7njaV+xpUfxxOKHjpjzHf3O7mbi3SZrMwagdAdJd5AEym6Bkj2T8UY/XPqWxfGRHabovJ6hbKUd6++kQcjKcrpEwzcs42Dt3C+2eGOaaiXuBueaF+QbnJJelG9azD5nGFYeAEXKN1K+OTlrKw4qNpFvsEEYQLHp4alBQMXGKD/XreCdtAqrkk3VJbMJ8A7Hma+uLlxlR2nHHlra6UqiwYpRLgqlIo29TTKq6jT0us82W+5pJO40jGtm3EydijuKOlJdvC56HgODw3UOi60xHJ0gsH0jn22NzPJy3PKRaiKfz+PYeynslgVS1TfZrDTZZbYMWAd/wWJ/AT+5HPAFQD77pN0ynEhn88PATBNHncxZU3X7JlgC3vEC3tFrgVxqRa0Kz+wJjjcO5tIRIIB2lweQl9ZcplV4qnB9lr8JLWodzfsPwa+QqJBdm2hQi/HU8kYERsN/b9EB6soeEZyu27amWRVOz3YulQTCkezvUd/dRKvpOrW1frReR/PmJYmKDw1biix5iT2DbybTZ9SONEg+jN88piWT0nREaiWmrMeHUyctAmcNNatFUJAmcXh9UZrE21qQJhmpLNxj2QlZ+G30FoYA+IbrKQHwfQWo0Cj7rqUp0I45YaC5Roglbn7o7ehF2Yar5hpV49CzhNdDPpuAEl47HrtP9w/rcfSErnRikuB/P8de8IsVkK21FJF99fOz9jK59r483n/65/fyjb/89veMiQ9COPX0inZiVov9uBSXntmeS6+2a/mwm9KlJ7z1kIrP8So+zNOotxikfRgAFRjz3gmjwhrVAP8xXixGy5ca+9x4+SuHaaC3L3+F/G6r7IF5p2zzanY5paYwVQMoFyIcCgtM06rE+mAK2BLr0/3i5s4JsYGIhkUCk/DAdApsUDU0fGRsOxIB0iuDyvwezgCuZIB9pQFibQCgNov27Z4AXEkAkgAkAUgCaIUAinaGDwRwTx4Baqz+HjWArc1cvV+sbilAtgAKlABdaiAvpSAwpYwcLTR3/4RpagBwTchwLV8U57bx3/GXN8+RG5y796ZlPP12gg9f0t2KMbeMgtY81BLeNXZVJ+M+QTmOqcEXjNIXzkqBomctB4HiGAjJRZxy0bZgMcZdJkjwnjhkcPSRonuAh9VyYEtfQZY9rKMgZJBl43ALjRZuoaJQ/evVj826WfAYsANqc6/OmiVPN1Ru0lpxByg3VQGmuXizYzFgVoJVla/YN1/fxX83aUsAaGV8UbK3h1YEBuWyf4o9Iur5dZ8ggysPSjbilI2kZiQ1Iz7IkHkaDKPIjX6E5zreMI5nAtJpBDTzHhbwwsSJPHhoBGntw107EaReHGQ0olLgpAohA8WXApBXkdkSYYpry7TgRmI8V58ED0jTJyNMZYRpJzxQMhEDEEAPkqGQ5r8195HONNEiuI/KFVgGcR8Zt1er6bdv0W9r+V75a/4+fv8U/XtWcR99f/vP3Pl58efl1Y+fF89Pj+Gf2vfbM27cRxVhw8ILXud0oxHM0FIRFYus+7LNEVX9guoQ92KXjih8A54CXIlombcJ6oz7VTqipCOqe0eUzifCbLtvGAZMOqL2d0RlkttiQbqhunA7FVHhFNxO2a02nNtJt10Co9pZiNIgvE6aVT5CD04nQ3IQ1xwklTfhlbcjoUC206nOAPLjcjL4BDaeWad7l1PtY9wxu5xMsuDsIC4nxhP0SdCANHzS5SRdTkfpcupBHBTS/LfmcgJMEx0uxudxHD0hQWEGDeQ0KFvl7Q6jffKK9l5Lskae2N3RVF6oYKvfrRY/9vBLMSr1dOOXKhYtZRRfO1QgIUDJJIu6NV1MgHQ2mWozN1jregs4IcIS0TxLvxM/jCH9TvUTCG8cs+2+6aGU3An6nUwU610oIZitKYxWKOnE82Qyue/4PE/ZzTaQ50kdqZpllcClpXwnp3xQQlTqwe1kSgziGoOk+ia8+nYkEMg8DYb948fxZPJJbD3CDvDRdfqMihkl05GKpoJZmF+nVjQpi8kmu2lSOk2Tylee3EhK30qKEl1fKtfKa09tqkJMhgFFXYjxuNr1EpMGAJQFbrrTikyXrAZIhjo3FYuqyzXaOpkPXqMXDbGyIkZEHmDqSIzO0SsPEnp3dc2x7z3ejDvzVLaZ4/axiUtVlWt1qDUsOnQhcLGyw6iYJysO7sJRoK6kYNspY0Ct+aFaLqt8QTOJG7TLlDGOFrCW5lvWLpQALZrrTtAFomXtwi5cdxZi8DeQXVYd+eoOXVpKNF9ddncNmCWm2aDEJ5lbbV8e61FBkgtW8Y020h0n3XF8gB3zNIo2jR//m1yuKl98GrNFYT3yAn34G/SoLFb9AU1e8Avn0OJPvACvXV7BE/8h+BUSgDL/90rx/Jfzm++Lz5fmy7nmrr6GTDqJHtZwpgxfp4b7jOo6g//eogNcIAs+DTefZWtUEdqObtuaZlXUoGznsbe6z310GGveox68iVbTdWqo/Gi9juYU7lkjtauKRwW4KtJO7lzMzjdZwyuZl+GbyfQZtSOlrzBO1oRdYSnt3ku6ef6czG8j72mlq6Osx4dLZ7NJwchQq8lsZmY+So4/UM9COyaz1Q4xPqDmSCb/zjEkDAHwDddTAuD7ClChXfVdS1MgfThhoLlGiJdzOyI2E5pcCpakzkb0TAGMqeCkEYDuUUqS9RvICTY75StL+i+avF3MUR9P6KZB1HAxcXxIwUxRA1Q6Xy4S6wY8mCr5qCn8hC0llBPNu2bPLtL6NalIUtOH7MyXFouSEI+tDQJEq4aWPQ66ehzMBpZVNaq0Qiad29RTSjGRNlXaVGlT+7SpMqkDi8Tj6SPVC448zEqmiSI3+CycrKtucHyU1dJb4G2w/d4dCkZVfWRXgofVGhqHuKBUF/cubE6a0or33GZnk2z3nqOQyFVm1gvacNIJDMd51ele5IFMrKYq2OU0E16YAC+y41agQDMGedLmKFdCmhTp6T5GVDoFKqg1UQMI3uyZ7qQJ5UDNm507IaLmDVxQfTx3Kb7Z7i0xR+H0+9xenAbzSqiQUHFyUPFl8WsRPS1uoEnmkyWGN3rNpW7LZVq97qTuycSpkbrzTxo/1tYHgx+h1G25/JhSOdUKP9VKhJBSN2V24Y10e7WpXUndsFNhP8AOSLSEZNYnpIVNgB3c+sH71bMOvpnZW9XBt6eQ9a6Ed4wMWAmnPH/bgyjh6glBg4gWR5KU8CR1CtCwnwXjRyZvOZ9AQLr5N5EGFFzhhq4c4I8xLrz5alzeLAz31/j+6rt95f/WHi4VnU0LO0kHVbvdTzYTBAFc6y8bDCYuB1iw2R2bbEbv8mGxmXcas/W83WiSPSR7nBx7lKf8ng2xOODTkRHObSM2ptf27Wf1+qc9/Xj98u7Wv/1ysZzli0E/erOHbIBe3XzBRjo3qjgpGPZdEK5W2w2rn5dl+TvNss625ynQGk2ad92kaktFmnddVYXjEttrinGumPD6Z2OcfIUNL35fMLyAYniNBoaXuCLNLgE+8E4FE8vyA7UmIqV2YuEC7buQWXn5MeoKaja7eiI1xyC57mk/VOoVbqmQSAeBLCShWCNxO022V99QA+7IsXTVclzLNDVbL485Rx+hjUAzbNfQVKLsc9MS0poLRvkPwGO5ZaS0DBdei0Iryr/SUoFFTSfuJ9tgNtoid9eJO4hdXXG/WwxT/WaWK+ukTetrpVqfupndlCAdx2iX+M5/BacTtJcO26kSr/+oirWwT8/j0Bu/+qNWbiVmhqoW2lzs3F57q6aiFq51US2eVbhtdwg+Kt7QjCmx/ulJHakAL1L8Unje6qkMVtXwxlG0Lu6OSC1NVDfe/B8=7V1bd5s6Fv41ecQLCYnLYy5Np3PSc7KaXqbzcpYAYdNg8IAcx/31I3GxwWDsJDgHJWq62iAJ3ZD0fdrae+vMuJw/fkzJYvY58Wl0BnX/8cy4OoPQMRH/VwSsiwDk4CJgmoZ+EQS2AXfhb1oG6mXoMvRp1kjIkiRi4aIZ6CVxTD3WCCNpmqyayYIkapa6IFPaCrjzSNQO/RH6bFaGAl3fRvyLhtNZWbSNy4g5qRKXAdmM+MmqFmR8ODMu0yRhxW/zx0saib6r+iX8L8r0rx74w/3817cvN/Of//t9pZWZxfSRieSf/O8kWpZ1dTa5fqFZskw9ekUzLw0XLEn5O2kZWCT++8w47ysho2lIovA3YWESaw80zfj/xasPZRJS9m3aUVpZkTs6JzELvSvCyGUSMxLGND0m9+Jtlobx9CZkNCVR8ZEZjdnh7lmkyYKmrBxzM8bEaDk/g9f8L88jiZLpepJRb5mGbD0hc/I7iSc+feDRQbKM/bxa/MEPyTQlc+0hzJab6vJwAiHGDjY1z/ADDXnQ01xkO/xRtwLTJq7nWHlNrouGfPryqfUBnlQrPgfCaayFcbbgo1x08LWXzBdJzLsj4w82Irbu4kDDJoIaIsDSHGxjjbqB75o4QC71hu2abJ0xOtfmYsLzj8RDdIQNBzuWhqBtaCgwsWbrjq4RalqB4xvUQU69U/gv3cOjiu0YxVVUOVmeMHE2Ezk9YgS9eBIBmScRUJPo3U2ixP0l4BPqEXE5guftOoNmxMfJhct/mYpfzn/c8RSXUbL0qziezSa6HPN8dHlh9HW9KL8RWWWaV7wDD83SGoZtpnXG1hUap6LzqMgA8PJWMz6o7xYkHwwrTj942IzNozKaDzo/5J/2kn+ONH/fCPI/PI6Pw6mYKh6PF511MUvS8LfovOptPp14n5LovEzJEpF9xksTs4kGYgLpm0aL1PRxhzIcWGnKdKkffFuIutImMyFZRpnoxe/1iW3U+ucjTeaUpWueuCzarGhTOTKhVT6vttwFVWlmNdqC0M5yNN3kvR1rL1kQiw8r64pY1X5sS2JvS+5uf/7pp+fAQtc/zoFH0l/fBl6+9jVDx56HDeBpPh9tGnI8gy9iPm+Lb1iW5WEEzUCt7AdW9pN3CgowcBwQaIHhmbxTEO8UV9c1HRjY1InrwADt65RNHVar1WRlTJJUNJFn54iWikZAqPGVTcvWfHY+ajEv0GACEoZsYaPHp2L/qWUsXXpsmVJRDxdRSjxb0xHFGqL8g9umI4YhNgPXQZZruSdtYO/0bIDpUyoBdbH5vNY5TlznNfBmfDnkxRfQfQJyIL5EwQ96uEIn9k8pLy70tM3AO4IDoCMpgH6YAlQwH+V4PTKsNppYbRhtrDZMp43V0DwxViOZoXrvkqWQWiG1QmqF1J0NHBFSzkgclzDej5PGdpnsAkrqT+ld+RjzocTxr4mdScpmyTSJSXSTiM1tjpi/KGPrUgxPlixp4ikfj+n6P+X7+cNP8TDBulkFXD3Wo6/W9adb3kKOVKJba3tnUdFnoXF9NHamK48aGEmntC8/c0B0T2nEx/kD3cnhNEhtSC2sN0Yqrb+ISHz/J8eDT1fHfwo8bFvaa6QPAk83TaoZtimQkeqabehIMz0aIA4AOimPnl7SCvPUrTCB5VI+5TUv4NCOiA40YmFfswH1DQ5sEBtQ8aandysG2DKhrlnI5d1q6ZRXxfE113EDbLqm56LTAqgihicUzh/BKr5lNN3DLJabqF46gXvZRLJkvBPo5eaYXeB3kDSk6/znWmTQkryX9CMIo6grOR/IyT3dSeyTbLahKpUw/kY0+DbJQlYgkJswlsz3SuvbUv4tlRF1L3kOgNVz2V5RJMnHIX8IwkdRj4tFEopcPjzkA7LIJJuRvJfnj/mUnJBVhiZ5f49M2ADMprABATCBuCVvsOy2uKEKOxWHwTJTmIFRX0kbXlnaALBtABdpAAPEK8QRk+gW0RwHAw/7lgsde2yg8ub34r2Dd4tzb1Fc7gqFNZr+CGOhKnYYs81ezC5wtVJbK/Gq0EHTdxG2Dyp3UW6eePfLxcSr+iebNOttXKQlsuotdOf9buZ/mjGwiNJ128vP52tRRhHlIfHDo+a8zK/0sRgPl6IwMcozvuXmn2db7Kjg19B3ZP02amEvsDrO5YEBJ/i0+GvKjL8D71cV/ippv5L2v3mG8YpAfyv0y6FeSax3AV+oj1eA04v01uBIX56JPwnnSezNRF80ds9WWxKwgfjno/CQMvnqxL0JwgC2MbhLNw6apwVgS2YAthQAKwBWAKwAeLgtfhMypd7mf7+75ON8D/Y/ZN4mshf77XFhf31zLyUPADuad2Zb8Q7if4AI2DITgYHlpIoIKCKgiMD7JgJ19JSaBuwhAE9X+eu3jqOxfy5M3oUYP+LAGHpHau8d0t17DJl4DUwwROVz+aZZPW9fFQ/1N19b6a8E6YNKf2hAYlFjDl3EoQo7WjewLOFWqBxseQvaoS32DhsouqZ8aTvQW/k4O/kYO/kUXdfKJx/3m0a/mO1IbRA4UnvAcer8vYrm4sB2H0pzUWkuvmNq/Yrk6FYsylk+NaD+4UIYRd7x7HIhUPeZyeYF3oUZH0y1xP0uQPoNJpqCFNgvHwmiZMV5W8omvNeJSzK6xxqx9BjwtbCwcMammWDhjYJBRQdQWxzC9zv7Wc2pxCHVWYycDKGq/dgogoKlt7tqK4HI2xaIHIJKqYUkF4UO4R7Qd+ux/SgPelGe5aqDDWjvMj+o2yqUQS0VyV2jg3no+9E+HtC0vRwVBzDspkAAd3giQF3nIacmAHJ7URupGzVFAI7DOkAxMj1qax7weYVMYGqOY0PN0aEVcNoMAR1YcjICAiA1PDYQRGosPM+/iYB3ju2fyX2+UP2ZMOomyT3/9VOcMRJ7Qk1iHk2YMZlTP1zO92Bn8YW1jOc1F3lpcZmTGPx5Plo7l36MrQPKMRg7Krwz7eaOFzsdCgBd2vi7gvLBAc+QGvAMBXgyA57a8aod79gg/UVAKDUF2IPlT9cZAP3+9A4qDZSH/6WaQHX033/sv1E02OoW/KzHjclJUAVbBxUGKomuRBoD5o7JgwWeqTJg7TgW3mT8ujoDQGrXhGCkvgmfdd4+NNv7h9QGhj6mUXoDSm/gHfNxGQkS7BdmnM5p4pFql/rEcWBD7RI4xujULsGxzhbhkBKdV/W2CKUWz8CRimeeB9tvw90iHPiMSLEPxT4U+5CLfRwQz5zAi/NoqACwpaUCUgsi4JsSRLwNKgAGNqBVVEBRAUUF5KIC/S6YX/NCB+tom9CGYMLCvXKJ8RAPR1riIbW7ZDhSf8nvmngMbMuqiIciHop4yEU8+v1Ij5N4TPTc49SWfDjYkvhUZEjlktdlJFI7kK5qrxjJeBgJVDoZipEoRiIhIxH+urQ7mj7steMULi+1rJaiX3e1XyLypLugd26f8gm1A6/zfgrP5p9zdEYrVlMDE3UYaQLYoUoK0KmtVqQWSQyNmcpqRVmtKKuV92210oJBZYmSL7X9l1jUxQxbicKHbagEmg/HOp80ZN3tA6lvqwAjva7iebv9t+G48eR2JGqzrzb774ZaviovMMlcbLJjN1vk4Kp3BP17uVizfFHvlQv8KpI9QTTQf2bxJNHAzrUYo973G7jz5ubOrf+ukebgZEBq0f/Q+Kl2/mrnr3b+73vn3412Um///6DpZnvfgu37WmQvXPfbWCZLxicU5c2I87ps7paqieX5z7XI4EIAa0i3cR0eG3lyAMwrfLHPvWPjVqzKj+ONaPRtkoWsABg3YSyZdzh6ZELc0Htldv1eLFg9l+3NL+LKpy9/CMJHUY+LhfCcQNMPD/k87vY5TVYZmmy8eP1defEaG2OxsbPjUtow0cSELc4CnTZlMYwJtE7LWqS24hypEafiLIqzKM4iJ2epA7zUTOUrTedhTPZxFdaI7hcuHLhyswRm3lsezbLDAgaXePfTXCTxV8F0ugUPuyoKmNo+6lJRsKFrvPBezVMAf7lubFQUOvxqdrnVPLkfaakv1hzaMFBBvoJ8BfnvG/KbSCk16H9ZxnE+yPXzxWIP8qdFGo1UKfrBv/+WTQX+R4E/hh36idWl7q+K/lJfNDm0dZ5Cf4X+Cv3fN/p3QKbUFOCKusvpdK9qgd+I7jeEPHBDpIL+o6AfjwT6q/zlhP6hzeAU9CvoV9D/vqG/CZVSo/5d8QGhLqwt9kB/UZbmbVL0o3//zZEV+ntrPrn8/MT/EPwXSog37vvmA8iwJm2NRbvjJODUpopVqyQlBOpGSUUIFCFQhGBAQtCBotKwAlFukrBa0o+i9z+L0c4D/w8= \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/heterogeneous cluster diagrams (5).xml b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/heterogeneous cluster diagrams (5).xml new file mode 100644 index 0000000000..ee0dcb5f4b --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/heterogeneous cluster diagrams (5).xml @@ -0,0 +1 @@ +7Z1bl5rI2sc/jWtlLnBxKEAuu9NJJ3uSmUxOk+QmCxC7Tas4SJ/y6XcVJzkUJSpIlTydd7+jiFgURf1/PKcaaS+XT9eBvb5970+9xUiVp08j7WqkqqpsIPwfsuU53oIsPd5wE8yn8SZlu+HT/LeXbJSTrffzqbcp7Bj6/iKcr4sbXX+18tywsM0OAv+xuNvMXxR/dW3feJUNn1x7Ud3673wa3iZbFVnefvDGm9/cJj890ZMPlna6c7Jhc2tP/cfcJu3VSHsZ+H4Yv1o+vfQWpPPSfkm+t/KeQvLJ2+lXe3GfNMucZEf46G38+8D1rryNG8zXoR/gLwXJxnjvnyPtIm2FF8ztxfy3Hc79lfTgBRv833ivh2QXO+mygHLg5Dc/eUt7Fc7dKzu0X/qr0J6vvKDJ0eNvh8F8dfNuHnqBvYivXeitwsJZrwN/7QVhMmpuw5Bc74uR+hr/H97dX/g3z+ON594H8/B5bC/t3/5qPPUe8Mcz/341jVqA30zn9k1gL6WH+eY+axnebquqrlu6IbnadCYhV3UlB00s/FY2Z8bEdlzLjFryOm7z249vK926V6vwKJ7frKT5arPG45T05WvXX679FT7zDX4zQfZEdvSZpBtIlZCtmJKlT3TJc2ZTx9BnyPHcdrtm87wJvaW0JPcsvh54i4x0zdItU0LqRJPQzNCliWzJku0Z5syaap6FrHyn4Bf0kZB+Shmb6UfJcKcP/eyuC4rjYp8Brwgy4BUY8IMb8L7zi4iVKi9sBwtmdF6fsJK8t++iYfU5wIeNmij/z3eSgYyHjDtffH5eJ1djg7+wJF+QwmR36Ve6c3ojqdkYzO6xTficiltAescju8oj7fLxFg/QT2s7utqPWM7xtttwSZqn4Jd4cOPTthcXC3xZ8bbQJzvMcBd8So5I9pram9voiOTNJgz8O+8lvkBB9IOaHP3hT+JWkkN6T7W3fLIpmM6+rPGYjluak/fNxgtJd3zN33Za7oyvPX/phcEz3jn5Fd1M7vkUR1K5ftyKO7KSbbc5XdeU0mRxkx17OzwazkyqIDOTCjOTMDNT552CZrpiWcpMmmmugTsF4U5xZFmSFU03ZNux1Bmq65SsDY+Pj+NHbewH5BTx4SxypuQkVFXCt7m0ecYD+Ela4R/UQjLRtXmGhR6/IY8rEp6g7t3wPvBIOxzkebY7kWTk6RLy8AWfGBYei1PdmDkWMh3T6fQEmffiLnVo2iaVzL+4TXgOfR01yL3FEwhuTaxETZujfrz+++rnl4tXz2/VV8Fd+PKdZJxmtpB119U1xZWmKlIkZLkavjGm5DJppmm6OlKNWReSTZofq3atgo9UY4E/vpzjFzdhpHLxBiKSBeU1/rv30w+kTfTQi7tIVpT10/bD9CgXixsfXy5Zwv8jpxA9K/v362STq4+VydPCDvCDbPJ7UV+Snyw2A2/ONY3KFDb5LUmS8GWwpehX8JvCL+TpQm9IF0oTkrCTdy4eZOTyXM7mi0WOHTxlqnsmjSosw9Rsg3yC4QWP2c/keFcSbl5/oGHIRdDQ5SpoKDoFNBStPdDQBQENHUADQANAgwvQ2EdtADz4Ao/GnCHTOCOcjYns4z4KHuZ44EdTO5nPa7GCihClw0jZQfLkMJ0tnv/z7uab9de5Lr2+eXX9XZM0oz1rxa0fzH+TLmdDR8FcsT9w9EYXil6kC0Wt0kUqqwW6kCet0QXjGgrAHMzWA4kAiQCJ9EwiDD0C8OAFPKgIcOPhn5u7kntrr1ZJH2fSzxZ5b3rjpf4EPwhv/Rt/ZS9ebbdeFjFgu887P3rsJ0L+ywvD5ySWwL4P/SIaxHKeevMpwj+1vcnMpQm/4U7wjJgJP2nsLtnP37tUzVETRQ4JU6dRBC0iQ+At8Eh68EpHqMp/8tUP/jyiyNRjohQNGYZaku/4BJNvFUZhrhn7cIUoBAGsIA4rnMnU3jUQGLqim/gOl0zk4L41ZQ83xZpKjuXMdMMxXAd1CwSXC3t19xe+zm+vmj9zpNNlW1er2i9TZebKhuFJ2sQgQ96TpYkmI8lwvRnCFCjbiSm34Wl03mBDMR0P65PkzvDtiWxZkWxTn0oTxZtqGGRVXVN7YgaETzHw7CnpGPsR//9pZGWigkS6r4T3lLb7pShhMlECP5OHReWvuBjKRoHlfDqNIQPPgb9tJzoUwYw1kbdIbfTLkX6Ft0RndGm7dzcRkOQgYRb9pbERMYUoSilWQk7e576naZY1m7GsCkmQY9Ku0TZYLocdRovo8FykkeNAQirZLCRlPNGt/B8qHtKfzXBb22EKUxCmMIEphGEKsD+ct/2BIVRge+DB9hB7JdZUjwe+pz0ptXwTnwcJ6tMpwRVRl+DPP2k5N8d6h49joxVBZBuNTwMR/z4kzXmZpShQxB//e00OcEk0e+5tP1vh6YBmithSRtF4kR0oc23IOdB5R/rvg7+Zh7E0ZcBTJiHHD0N/SWOmHE/RfDB56JFH1QDRJEYDv9PIu1s76tPlUzQ7jO3HDRrjQT3Fc8RblzTxch3EL4r7bLQe3S9ILQV36MoYqXL2lxw654wxKZEeljlOkagFb8xEEMJp+dFHGMLxPIQczbIlFzmOhGSsTo5lKBLW8InnKpbmTVpOfeGAcITW/4I0gOrzpfpzquq7sQ4SvQ9unBc6CWQZqfi4cuHlH1UOaC9UEw+WOHjmr79GxdiZNZqOVdRmqKZGQjVXqyxSs/gLeUJRFCaitByrOZt5hkv1oExN8gjDV6zmZFKOpmgYq2m2F6qpCJOuBvlqYKIAEwUniLKH3gDB8EEwW6zMpfSQy/XGw53ok9gJ/35DunJxj+9/Eu4ih9uMn18N8kFv0yPhA0npYRhZolSvmpIXzdPnjvYFA3jaKsJAwwRRtcUEUcb1EIAQmK0HbABsAGzoGxtakiDAif5xogYGmCGY9BkaMfX+BIGZ+EIHz9+S70dvvpM3Yz19e/WU//DqOXlXDOhU2w3PVJIhlg/PpEchqS3CSONYi1bIAglNFrVKAWTBHVmcyfwOcZisqVKcOEzmRD7I6MzUAbFZ26tGfhBE84N8CDzce66HJW8a+V3igzpBukMcSpP5OuJfY7o71rlDUkJB6QOSXaCCqwBRdGzgJ1Mce80kSQM+y5kjbQZ4Mq6/sGADZTHEARswmZy3yaSkaBAKypcNpFR4Kylc8na1Ce1VlDp8HfvFGkdXuOt7ci9GX5cSp9pu3mhawKIuePL4wAud/KOmrkZ/tKwT8tenw8UqOlwmSYTgzkJZutqxWUSURNSTPIwBPQA9AD0cSA+1egQEwQtBtOVFYeekDsCL8uqbdvVhpVl309vrH+a181u5v5KMpId3elFS2RLQiyJKjimz9YALAuDCmczvg/eiMKdKwb0o3TuDzs2LErlPSo6Uj7nE4328JfR6GfQLxU5eHYKTxGyROdqtmSGPSVWQ3N/EKBhLJDROt5zOqyJKqimz9QA6AoAO2EXO2y7yEepqiGIK8Vy1AUzklumDShinqYRBLkyPlchR0XujqFX3jUHx3qRPOV1ZY7JVV4WElMGugjqbadbUdFVJMyayhDTPlhzTxa0ydd22bM2SzZYTiTiAFKElHDSU8fw/nT9QH//JQ7WU6Al5/k8kpbYqRf6Jf7kY05f8KpoFss1RG4pbu2iWGenPyw9f9mzRMbDBrmnBMWxkTNEwyiNHGielikWp3VmDesMNrYQbWjPc6DpWJHOEiYkbQ63pAbghGm4U1S8nOmA+4At9trGYUV2SI0Ixbw4JxYztGL2GYu6ugdWXik5MTkMuM+uTmDLasudWGBkF1wK4FngDhTrZAU7ghROOefJn16bi+Mkf3AwHEYulHuZm6BxYhC7Kpbb8cCsMsMBzv2hyDhraopshfvaUIzWRNrGckM9XfqQpFXv/cjGuqV3dgnNh38ZMnvCHFwohM/m64G7ooTGWUfV6tObeYBfk4hhyhHJvVKNPk87hzvlhKZw6P4SuH6YOtX4YQJBoEFTS4ZIS5rQIjBx8ANoVnmrsEHdoUEMC09IObCBgl7Pay6fRl4YZSB2XVttAakXFaMtllatHta5iQheLUqFYFPgewPfAh06Xp32Q4v6lGIveReDdzEk98xd/1MgxbSemJGvsEMSTlnvYVnj4nv+sptxD6QF/qnuTKaKZBAwTzxezUWl5DvLmgx3iflpFW1RZqZofsq/GPdlKPe7U8L+7HrfeIr2ctJKEJnQwocZpMCF7VTPKjQ+z9lDrNgC0UqD1sFraLWdbVwfLaUqCt/x0RRnz/Baz+NcP7moNGI+5D9nGC3aZCVGNF2nxyr6NF0LXZBjsouAi6gAYL87beJGf7gGB+zdc4Fn14mPyaPKTlMrAulVrwFBzzzHlndmGDHZE5SkNGcqoULdS3VW4UlRLRgJ0J7ZkJCgjj2XZTOteHVik6mQVpzShoyw1TqMs2VMxY9YBVQDDCADxcRaFdPIHw4i4hhEtgi17Wktj5R3YBMaO7uiRwPYFMNubzKiJr4Y7wTdyla5Ijcso6hSPj4cMzLyneZjVMcev48aYevJ22xjyJm1LL3yWFvsUz9MkdMyLxmnMC/NMynMCwBTAFMAUwNRJnGUcw1Q4G8d1YOX1fE1Gj1cDVcmOUnE3NlqpTLQqghOnvidLKXmetKrnaULxPGkde57Se09QhGh55gDPE3iewPN0oOepTgIAkfv3QuXKelHSad34WZ5kygY3zgudeFFGKj6uXHj5RzWB9mJx4+N7Spbw/+KSLUmFr2hLTdbxrophNjmoJEnT1SouGIZfFw+1GxnY67gVw1VOUSssSXz9HNlypMgp1heKTCZFFDEMvYIi1PphZtcsIvQSbBoswQYsAizCB4vso0vAKHwwymkiZXR+Un729NMUV3gtV9tQyv4XYeNqeskQysXVWKU12tKyHfvF2VwEgf2c2yFZPK8ahpNSmVysDmLKBYvkzv11vWz6i1vQV9CPLnQWVNp63mAOgn7ATwV+qr78VJANBX6qneQiuJ+qVKnVmFSNQ734qYTOkGp75gDbENiGwDYEfqqzswFVlp9JNxDDA9V3RT6QNpERhPivFGX9RHdTqak5ML72eXsgfUG++Ccry9818V6pxHtF6CF1XxV+YTdIWA1BohXvladMdc+kWYYsw9Rsgy/vlSGXk7gNoxpKQ/VfTbpmFEtoRrGAUYBRgFG4YJR9BAvYhS92aYwqMg1VEjrFfRQ8zKOVj8grPMvXkgnLfJEeRsoOshM+0FZEjrZi3PrB/DfpcjahFBxP+9NJbyiimGUUMdUqiqT2tkJN90m3KIJkkVEkbT2gCKAIoAgf5hKaIAF58EEeSs7x215xXMRewOakxXHbjpQRNTgGJcVIegmOkeSxrMpJC7qNhtEn+0XDlPfnLRoGCb3GDqrVV46jYWiTIogVRMEAumedclD4SCpBEAUjbhRMOzWBEbvooCA1gYsWnBQ0+q4InN5lggIDpzXzQAXAgDM4Aw5UBObLWoNGcQkpMiTtR/z/42gkqhan+0p4T2m7H1uWe69Et7/95fCac/Qyc+wqc4U8qVG31pleisvVGElKKUNGOSAmPsHkW93ZQoSuWodErFrHmHJAEcAkAjA8NJPIkEvVtVv3N3Y0HEFbBRgRCr1aJKfGSd/prSZesd7EIyUo9qStFwp7oFgvsA6wDiRB94JsHBPQJ/vGe29HRkH5c4B/JRo/8v98p4aINvgLS/IFKUx2l36lO6d09OqbdvVhpVl309vrH+a181u5v5IUdrWdvYKJ6bHDEd4kR0xDczKcqjCNHP316G+SssykZMghVE2wRhbF46Qp7bmcGNdKAEBhtp43QAHlAZfT4FxOuwQGQLx/N1SHqdYkIPZUqdZK41RrumywzTeDTrXGnVNiFathpWClxXIwjMsmLKtwakwBVgFWGRyr7CNYwC58sYs4qdZ0HWgaQDv4VGtJkUvrJ8lmFUVomdaK3F6qNeMiCosiEKkLKAIowgeKQKq1AORBZYAbD//c3JXcW3u1Svp4h/b3nl1NCRXZOzRk1DgEJB/ZQe8Qo0VsODJStmL50FPLR2ehsoxBIixbiJg2LM6sB0EbQ2TAXdEOzLl1kNEOLebb0DuXnW+Dn5TDovBWvATlR/XlfDqNNR4P4N+2Ex2KqHxSNgMfV78c6Vd4S3Sel7Z7dxPxQGFRQvJHCwDNRUzIO8M7Kc/6eL5c4V5O2jXKcmHzqs8UhFYkPhnEaX7NsYJfNC9IyniiW/k/VDxkmxVDGKNKWOnnNEsGJAXMCoMzK0D2F+cmhdjbsKZ6Msj6BFJq0Ca+DBIbqFPiLqIuwZ9/0nLui/UO38VGawI4BhNw/PuQNPJlTAV0qMD/XpMDXBLVn3vbz1Z4kqBZGLb0UrRJZAfKHBlyDqDekV794G/mYSxjGUiVCcvxw9Bf0lgsx2k0j0sepuRRNfw0Cd/A7zTy7taOenr5FM0ZY/txg8Z4qE/xzPHWJU28XAfxi+I+G61PZ4tWXlsQGWOkytlfcuwdJVIsc6yUrCat+14MoSHpRHMfd5DkeQg5mmVLLnIcCcl4yDmWoUgYAyaeq1ial2amnxEkCY0QBXUBcOALHOZUcHBj0STIENw4L8iKpvhK4OPKhZd/VFGivUBQPFgaLcHdRiCoRgJBV6ssDrS6yDcbckwm5LQcCDqbeYZLda1MTfJoxFcgqGKUgy+axoF2DgCm0ABgDhQAwEoCVhLuEGcPvQIC4oOAruZYJUPcoUENGExLO7ARoOn6vRzXM5XM8po0ulYV644rmjJ6V1iphhV8QapBqvmQ6vK0D1LcvxTHVWaltIYAvah4+jFThtWmC9CJJcOGyoUMp8cXU4bVlu3CIMMgwyDDRxUWT+d0EOH+RThJICGdMF+T29arkeNkR6m4G1uWW0xW7EuWFaskymr12XhCEeWuyxmoQucQqrU+QRBlEGUQ5T5yCCsSAPLcvzx35bAnrgvU0HWxl08dHedTV9k5juBTz6GIYVQXHqP61M2uWUTonEOV05xDYBFgkcGxyD66BIzCB6N0akJgJxIIYUJQ6xbR6tuEIHQovHqiGw9kG2QbZBtMCMLKc4dVm9XR6ao2q8dVbVZPGqwvWtVmpRx7YOgN4/XLC4K2DilCh+urEK4PkAKQwgek7KNYAC98wYvgZZvVpnkCULZZkdUyixgNCzfrHddtVoXOR1AhHwFYBFiEDxaBus3co0ccq1qDAo+5D5nKr51FakJJkCmLOvWRmJA6SMTU47T1oMegx6DHXCQmgPjyIr5dV2DW2KtM97OWQq5Q4GFrK9TVZfae5uG3pKnk9Xfyeqwn766ech9dPSdv8Kh6yMAk7tEmizbkpw+m7Im1ukPGMikCmb2s7qAJveB22nreiAdK4cI6E7DORD/rTLT9FFQdLFNl5sqG4UnaxCBd6cnSRJORZLjeDGGqlu3Erw3LZRwIa1oyRb74o4bQyjuw0Uw9Es0K5CIUp7WIWWl45E7MSm/A02JWK+YfVWgYUgWEofKNDAQEBAQEdBw6tB3I3hMBdQ9yHBNQUtckFaSaUlZqE/ppWjODa39RpZSVVU156cVjJHTVjLT1vCEDaAF4jAbqMUrndADh/n1G7dWT1PapQsGpCFcKV1GiKHuRYKGLRWhQLAIkGCSYKwmGapK8SXAbz8HsJbLFlGBe4iaFXihag4WiQYJBgrmSYHgK5kaCt+tu2jfeezuKao3yW9/4S//GW3n+/YZ05OJ+E0afhQFuSnR3y798p0a5N/hgS3Iw6TY+DD6KlBxjI6WHkLIDsLW9xaJO9OTH8hrLhWTIvnCgFEKYTJR5GNAQjQaUrnFA6DpQ2oluecABwAHAgcaLPh8hPkAR/VNEDQaQSxjMXcm9tVerpI93aD277tLJEy3UdsPsUNMwu3RHAcPshK7ApEEFJnHw4Ezm6bMP/jsoQAy1fLV6inPr/jQ4jnNL6yVt1vaqIOR15ZgQrRzTh8DDved6WPKmI/JUHB/UCdId4hSrrDRT/GvM0kzr3CEb54CyizKF3lNYJIlK1cey8WE5n05jaMGT82/biQ5FsGVN8gIjcdQvR/oV3hJ16qXt3t1EgFNYlIL8lfIMEMtugafJFb5Uye+NMkNDHmiY4shDiiUq+kokRRvrmpz7Kx3Rn81w40r3Qau5lkJXe9Kg2pM43ANmkfM2i5QEDzKIz9LUgdilnjg3dfDAAErJRaIYaKzpqoUUxTTwc45RPGBsdKnUXLgIAvs5t1sCX7W/alV+NH+0PXfHL+Lfp385i15odoanqSqRep3EJJ209byRzpnMxABsXQAb7xYe3m0KkTGhZFb4mCvPso/tYI+yUYhdNmoAJoN0tm0FF5JRmzyqHl2jSR7LJSMCGptKnzYEJHS9prT1vCkrSBLYEAZnQ/gIxcc4NB3EnLGmcgtZBk1KI/kIuZAnR52yklYWNKPlwGW9g1o2WhNeUZm84t+HpJEvYxhIwKNUnAj/e00OcEnEfu5tP1vhmYFW5WgLI8W6SNmBsmBMOcdD70ivfvA38zDWLscPQ3/ZfAGuHHPRIkbzYJSeZD5ANFmNC7/TyLtbO+rm5VM0S4ztxw0a43E+xXPFW5e073IdxC+K+2y0lNNyp5K1sZbGul9To2R10Myxpea4KDGC7chL0dG46yU2UGqOExOY1IECk+ch5GiWLbnIcfAzAh5zjmUoEkaCiecqluZNWrbScABMQuNEQXSAJ3jhCarUe67aROvZWRUca31mAylrfQYBnMo9uTA9ppdqJU+EolVk3KAu2TnWupZxoVNK2g7xEkbGZzPNmpquKmnGRJaQ5tmSY7q4Vaau25atWbLZcgkqkPHjZBxklPFYPp0/UB/MiZleSiSFPJcnqlJ5MC96JuJty8V4jaZjFVXWti56GrLNUSOKW7to1+QJf+FCIWwmX3/4wk27LINo48u9W3QMCLFTTjgGoSOMHiclnhpPVNo5+ZXNZR4NIpU4E7UZOeml8IzWsUnoVBs01FQbwCbRsKmk4iXxzGkWWEb4QjpuPS0GeFrA06KiMlhoxthUKmzRk3PFENq5krZ+cHgBzhXR8AKcK1wiBFXdmXke0n/fby3HfPz+5+zt66v7jekH8lRS2Fp/gkQPfKWD52xFT/Imt6Qnebtd0zN6ly7qKXqCiKmX9NVExUO0njDBGgF8KClEQ+5bSYJ9TXnjizOZ6SFlAsow1Nx47CWPziupgj339KqxadJEufx0m1kSrBEwEDmFOPqun82AiM63tpYQdRbYsxxvgMkduDzmLPkrP1iSDq435ZcrjXr4klRrjZKX5WKj+5BNrgB2+gMNS2B/vZfQ9/mDOv/zXvLNb8/e4r3yUlKbrm9xXjWwdbP4AG8ZVes4sijm8TarYLOuCEAIQAgfJwgQIiaEtKFMPTALe1IcNLPs7zeo6Ux2hsaZ+g3yE6AnLX7/83369+rXP2+N38/S17/04HW8W76C9scv8pv/vZ6t/W/OB+3Ln+/tp7ny3IeFpB2w4CgZAlRpX1ViX1Pe5sUzEdizdxbsKj3NmCjFqZ/NmMahenYTt43Kh9umZgpkp1mcl9uGrQJn77ZhjYCBgA1YTMBiwhubigN0Qrht2LMcb6jfA7jMU464ii6R/Ha1Ce0VvldV+Rrr8DpHHfMdyOGu78mNGn1duom/nGeOWgBu4DJpsCR4w1yHUmLFTCf/aKkYRvRXzemI/3p0uBhyyeFiVZcdVXSKw0XR1dbsIoynGWAHYAceThDYQUx2qFWiHvjhBEYbIeGhNf8Je8GuAfhPLu509Z+7f/9Za4G1/Ll8FUzfOwkr5/0njIEoov+Eo4W4QGda8p9wujzZmUjm4P0njIlSHP9JL04gAfwn5Bn7cP8JdZmQ+JAHLBNSM71ZTFIZhG8kXdWDv4VC5LGmWfm/iVGwkUhonG45oTPFGhDlgEEEDCK8gao4dMfzOiHsyY033O/fDFIpaUnvQI29tijH5aWguPdhfptypQtZqzpuaDUq0zJJnZlj0qrjgwAVEUs7nonKnT2iHaCj6b03aB3tolj1Mamoy8XY1ccWNwXBjzkX7cBy3bydx4soo+onweDx+vmPToqP19yi7CViOSY1oaqP81T100BlVkPNWK3FEBvWYARWA1YDVuu0mvgWADIFrUgQN9aYdFYAisxFtP711+iogNabHQGttclQvQa0zjzDdWlIMTWJQb5HUZ1MOAhcZWSwDUJSwU8Dfhqgom4CV2sEpwdKOEGerpCIcJQZQBXVDAAOm4N4xSqvxtrUYdO9ESAdykMgFjACgBGAJ4eNCjp6oHMgfviUI0GRNrGixCb/SFa6deistbF6QodOf+c6yfw9+FoSkJS/nnYR2f5OPTXQrVandhGxC95zzIZCuYhqF6jlzoFklZeN48eBNKTK/MCOwI59OZC2wMFS5LJk8eNSgtr2p6ht7y+r9YNHrRW2j49+VFl7DTHR5lzL2pdsP1aq1Tn51hBFv7uvaq+hAQk4uKvAXQUMduKq9g01iR9UqZ0CBoQqVBQ4oCSLxl7G5gQlWdoqrbL59M+/3uvvHzfu17e/rm7+5yHry5/xbvnSKozdBCytog1pzZuzkxb2NeVtfjsTlRx8aRXGDChOaZVeTkKA0irnUZpeYy+0M4TyK6kK9FqaPjkZpBesI5KijXVNzv2Vjth9sRVtSEvygFUErCK8oas4vCdw5XrtRF06DEsHe7kbzi0dPGCAUomQNSbFY8SGluRr2/FyEQT2c263hLlqf8iq+Z3Xh+2PX8QtoH9b0tSxvuPE4uFcObHWsWZIC/KcnSayrylvk/iZyPvZm3N6MCLwX59VY1eSH4SBwGyRDNqtz6qgsVwyGUzG5eOcwEgwpLrzYCQAIwFvQCQORYhXkVXjdAGGHnBlTWUVkoIh3eaiNclTot5lPso2xEbbnYyy3hXWqRU5yL/8/u766dPP9//8+J93Za4uvmn6ZyYEnWcaSo/5yRuNx9wTSZGr9gtkjq3EdpYLYTVpGShorJeMHUdEqdSO0oFQmOch5GiWLbnIcSQk48vjWIYiYfiaeK5iaV5aFQckXDBDQO/82TC8szaRo1NIYdz4gyYUqrZX0kxro1lEFHeoP3KQjiuVIqQqqig4LYd0Io+19hScEVg1CAWHHFJQ8F4U/AQRjUIqKG+VyU9fXOSMi4d0XRyks8IgM/3P/8LbuXX1t6q8+fvujfWPrW01TjBiO4+qIMlPfY6icWQeTTXVABqtGeLppdUMj+C72oELfAd8B3zHTY2Q3uqDMCYI4NCWfE2NA2xkEmCTVFg70g9VjMA52A9VuzD3wMAH/FAluEEVuEHm2FQqfNO9E4qxdvwgEAecUIA4g3RCMW58QJcUOtxMLLfcgOV0El2C5pXMPHwFkmpjo0alzBpEAG9LmdWXLZt9vp/dv3z31wfD/edfy378M3j7SZLiyaQWQbouW0bBk6g/63N+Tq/PaikkV8lql+XEGVkUdc4kuw3zA+PyDUSeIVIXInWBsE5c5KypXPVhcGFOiQAul06N+WQvkrEXNz7pzxfT1epntLzfHzuMI9uqJ7kdnd2Oou0qgsll3jQhGH2rQDSCKWYhXeKdX8okG0l9Sd6NibGpsKH83ixuUKrvyDGKG8rvzeIGpXx4pfT7SrmBuQ2Vd4XDy6Xfl3MNjFKwqDalGtvRHkYbtQR6Mt1Mo8rjeIFI8ptVC0368U88GH6mg+FnNhhYuJi3ZiXNb2iaSmxL77wZ6Q6NYmO7mhiGopVNY33hqJ5+OcNRvYqjiq5QcFTtGkfTlgCOAo7ycIKAo2LiKJU5uCFMvWUjsJCESYU4ZnWZms5UmPi2u7oM7t6HzJ6UQxbvaR5+y73+TnQbo1H87uopkfHozXOKLKEdhBdB4D9uMcJbTdMt7gIL9pysGk2pV0Ota1O0YpXjfpQqa8Rl4aruuantTWbU9aqzb8S9vW/13x+vvy4nvy7fXF3//HX59PjgvVF+fIx3y1f/Xf57LdnO88WHH6vPV/rzhWJtvnktwk1X1X9Zg24gnCKiyO2qYsoYtfyUYj0TpT77QjGHIIBymmu1LwLsum8Y0/igq/82jitSaIV7mpjXEpmufBfZ40/2A1kG/T2ZLhub1l5UjHzL+Puyfx+u70Oa6e0PpoWugZ8xKg/UBOpUJtRVqwztrh3UvA7RnlWGKAafzKpUa+NpUn2IPXO0mskujxVTMQpmoeTdkRWJrKKpCaX58l3UImINJUA1blEN7GzC29nOBFSZp1HS2RdFteTKvKTyyZYnxLLPmXOZjiGzMSkqhU8+eJi7nkT+i6fsIpnUPiK2E+x06wfz36QL2NFPhWinsk1H9yZTRLPpTFRHM4xC4HOvLqd0kdPM5TShuJyUBBbyLqcscasFUw7jmX8QcAA6CP4m8Dd1AgdFvTkxBZzAlikgArQew6RE/kTcai6DmPICDEFMEMR05kFMalqEOyNKNKkSZbp8VjGISW2NKFn3Ih9QeSbiC2AFhrNzfGDYGcRUhQ5+zEwtV+sQkDGvosrk8ofAk5LFzI60OV2urub3b2+R9Pfy6ubHX+/s58dHnyubk6dMdc+kUYhlmJpt9MkEaVxdxgSmWWWCVP4LVqZ0LLfABLWXkA8gANEQXjQAhgZqZWLIzYmJgDHJDRgHapT/kKBmxBT93UHNRy+SSXM37Q4hLsRS04Kh45PfN8b46sr407h8Kd1uLl+qy8nmDj1cz+Pd8jHG7xRb+c+8c//8n2b//PzzP8X5QqKtWuOMY1fxRKi8XISGSrpfs9xl9VjpaMuOVHaU1awIOmo7vAYNiG1EFMZdEZ2MewsioXkzY51jJHTt0wDfkdAMsYFI6D1jmNUx/sZHz56Snq4sYroz2LgayjxfRRHMcoZc1FhmKq0hfKkC3BSJvipqzSDWmbzG1aqoWRJZzkt1yvhl1CKRtbx6amnpVGU80a38Hyoe8gTBy/qA6AosR2A5AstRJ5YjusC+KMkkV94lnU8s7I+oGtasLrPVzccPpNL5p8jdREkDW/sBObqBL2Z9SBOVlB79IKrzlcej2qfJBg6suhCbHDbV8kjXriWjVDNHS2s65lxLFqVkDirlOx3hWGI8qAMeAB7wcIKAB2LiQUEjSorQAxGcwCQpIA4o48j354d+dJP+d+9tiELbS6KRK2ezrldqLbFpvPijkTHDYAr2budTUdIPcEXhkRA8f8u/yRXfIW+31Xeid2n5nf0dVnXVcwqOrLQSUFrup1kloLI3bXSQ6+vz2xtn9cn1fr8ztUfl6dXHP+W3H0cV1xfjlmmTg+Rx5ng61NhyQuOJwQ8dMec7+p3dzcS7y6zNGG/gDgJ3EABlt0DJnsl4ox8ufcvi+MiOs+i8XJBspT3r78RByNJ6vibD1yviYO3cL7bxRtflStxNmmueM9+kKel5802L2ecMYeUBUMB8A+abwamtOKjYxHyTCkIPFpsTPDUIaLHRis91G3wnbbyqyabqktkGeHvTxFdXF66yp2nH7Nu005WJJrUYZUah2Ehj7jIZVe009LrPJtvcc5hxp2FcM+PuarkKoGyl66JnMTDyceae5NjaRC0R2KGRz6ZSzvKaWMUj1UQ+XwSB/ZzbLQmkqm+yXmmyxWwZMo7+gsH+QvrkcsQXUPnZJ+6W/ox0Jj8MzJQ46mTOmqrbl2AAvPMFvLO3BXJpK2rV8Mye4HjjYC4dAQLY7rIA8sKay7QKTxWuT/I3saLW0bxz7955pQrZtYkGtRhPLW9Uwmj87zU5QF3ZoxKnq6apKEaF05OdCyWB0kj2d6TvPvibeRhrreOHob9sXpIo/9Cwo8iSHekZfjObP5F2xEHyXvDqIS6ZFKcjUisxJT3en3XSKOGsJie1CHKmyTRmPm+aTLe1YJpkpLJwj2UDUvhd9OZ5CDmaZUsuchwJyViUHctQJKxjE89VLM1LTdz80NvZG2UbrpqrVcXhxCa8E+SzCWjCa8dj9+n2Ppz6j+RKR5KE//s5sN07VkC20lJE9vWvz8rz7K395eH209ff61fO+vvfCyY+COHUUyu2E71a7MeiuPT09lx6tV3Lh26CS0949QCLz/lafJinUa8YZX3oARUY896AUSEkNcB/Tler8fq5Rp8bL381YQr07uWviN9tkzww75VtXs0up9QUptoAioUI+8ICXTcqsT4pBeyI9el+cfPJgNhARGEBYBIemIbABlWh4SNjewIIEF8ZUub3eAawgAEONQ2U1gZAcrNo3+4JwAICAAIAAgACaIUA8jrDBwJYg0eAGtU/oAawsZ2rD4vVLQTI5kCBEqBLDeSlFASmlJGjheYenjBNDQCuCRmOO3HfmNvp38GXV0++5V5Yt7qhPf6euO+/xLvlY24ZFa57qCW8b+yqWo77RMU4pgZf0ApfGBUCRUctB4GmMRDARZxy0a5gMcZNJUjwnjhkcPaRogeAh9FyYMupgixPsI6CkEGWjcMtFFq4hUxC9d9ufm7XzcLHwB1Qm3s1apY83dByE9eKO8JyUzXANDfe7FkMmJVgVWu/Obz4b3qftpy2hJBSxBcpeXtsRWBULPsnmeNSPb/uE2TSyoPARpyyEdiMwGbEBxkyT4MhitzYj9K5jjeM45mAVBoBLez7Fb4wQWQePDaCtPbhrp0IUjtwExqRKXBSCyGnji9FKKsisyPCNK0t04IbifFcPQgeAOmDCFOIMO2EBwoS0QMBnMBkKKT8t+Y+UpkSLYL7qFiBhQ/3kfbxejP//t3/bazfSX8t3wXvHv1/RxX30Y/XX5eTX5dvrq5//rp8enzw3ig/PrZIKUcaNox0weuMbpQSM7RURMUo133Z5YiqfkGelO7FLh1R6Q04BLgSUZl3GdQZtyc4osAR1b0jSuUTYXbdNwy9AkfU4Y6oxOS2WpXdUF24nfKoAG4n2l3ZsttJNa0SRrWzEKVW8jopRvEIJ3A6acBBXHMQWN6Et7ydCQWynU51AsiPy0njE9h4Zp3uXU61j3Hn7HLSywVne3E5MZ6gB0EDIHzgcgKX01m6nE5gHBRS/ltzOSGmRHur6UUQ+I/EoLDAAjl3i6q822F0SF7RwWtJ1pgn9nc0FRcq2Ol3G3Xpl2KU7mkFZfJFSxnF1441kJRASS8XdWu6mEDZ2aTLzdxgrdtb0IAIS0R5Br8TP4wBfqf6CYQ3jtl135yglNwA/U46ifXOlRBM1hQmK5R04nnSmdw3aM8TahGuskW4ZcUwCuDSUr7TpHjQklHpBG4nHTCIawwC65vw1rczgUDmaTD0jx/Hk84nsZ0QdpBDrtNnUswomo5kMhUsvOw6tWKTMphssp9NSqXZpLKVJ7cmpe8FixLdvlSsldeetWnU2HyUtwsxnl9bDprREKIscNOdrUi3ytUAy6HOTY1F1eUaTbWcD15jL+pjZcUUEXmAqTMRnbO3PAD07uuaY997vIk781R2yXH72MSlVZVr61BrWHTsQuBiZYdRMQ8qDu7DUaiupGDbKWNIrvmhWi6rfEHRSzdolyljHC1gDfINtQsBoEVz3Qm6QDTULuzCdWcQBn+F2WXTka/u2KWlzthXZ7ZIW7ksMcVEBT5J3GqH8tgJLUiwYBXfaAPuOHDH8QF2zNPIaxo//jdYripbfDpli9x65Dn6cLboUVms+j2ZvPAXLrDiz2w3Xbu8gifOvXvnlQBl+e+1ZDvPFx9+rD5f6c8XirX55jHpxL8P8UzpvYwlfkR1neF/r8kBLomCz73tZ8kaVSXbjmqaimJUrEHJzlN7c5v56FKseUd68IO/mYexUDl+GPpLCveExNpVxaMcXOVpJ3MuJucbreEVzcv4zWz+RNoR05cXRGvCblJT2q0ddfPyKZrfxvbjRpXHSY/3l85mlg1GmlxNZtMT+Sg4/lA9C+2ZzFY7xPiAmjOZ/DvHEM9DyNEsW3KR40hIxrrqWIYiYfqYeK5iaV66nNsZsZnQ5JJTkjqNODEFMKaCQSMA3aMUJes3MCeY7JSvJOk/L3lHPbh3kXWtlWq46Gl8SE6mqAEqnS8XmdoNeJAqeNQUfsIGE8pA867ZswuoX5OKJDV9yM58abEoSemxtUGAaKPnPvbQaEVnk4FlVEWVVsikc00dUooJaCpoKmjqKTUVkjpSI/F0/kD1ghMPs5TYRIkbfOHNwqobPD3KZm2v0m24/fYNCUaVHaIr7v0mxOIQ5CzV+b1zm6OmtOI9N9nZJLu95yQkcpPIes42HHUCw3FedbrneSAxVlMt2MU0E16YIF1kx6pAgaL18qTNUa4ESAp4us8RlYZABbUS1YPBmz3TDZpQjrR5s3MnRLR5IwtVH88tim+2eyXmKJz+kNuL02BegAqAisFBxZfV3cp/XH3AkswnS/Qves1N3YbFVL3uTN2z2aTG1J19csxjbZtB33yZug2LHymFqVb4qRYQAkzdlNmFN9I9qaZ2ZerGnYr7AXdAZEuIZv2SaWEbYIe3vrfvTmwH387srdrBd6eQ8WAJ7yBPTKU8f5u9WMLlAUGDiIoDJCU8SQ0BGg5TMH7M5C3nE3BON6QLfT/M7XpNbps4cl979X8=7V1bd5s6Fv41ecQLCYnLYy5Np3PSc7KaXqbzcpYAYdNg8IAcx/31I3GxwWDsJDgHJWq62iAJ3ZD0fdrae+vMuJw/fkzJYvY58Wl0BnX/8cy4OoPQMRH/VwSsiwDk4CJgmoZ+EQS2AXfhb1oG6mXoMvRp1kjIkiRi4aIZ6CVxTD3WCCNpmqyayYIkapa6IFPaCrjzSNQO/RH6bFaGAl3fRvyLhtNZWbSNy4g5qRKXAdmM+MmqFmR8ODMu0yRhxW/zx0saib6r+iX8L8r0rx74w/3817cvN/Of//t9pZWZxfSRieSf/O8kWpZ1dTa5fqFZskw9ekUzLw0XLEn5O2kZWCT++8w47ysho2lIovA3YWESaw80zfj/xasPZRJS9m3aUVpZkTs6JzELvSvCyGUSMxLGND0m9+Jtlobx9CZkNCVR8ZEZjdnh7lmkyYKmrBxzM8bEaDk/g9f8L88jiZLpepJRb5mGbD0hc/I7iSc+feDRQbKM/bxa/MEPyTQlc+0hzJab6vJwAiHGDjY1z/ADDXnQ01xkO/xRtwLTJq7nWHlNrouGfPryqfUBnlQrPgfCaayFcbbgo1x08LWXzBdJzLsj4w82Irbu4kDDJoIaIsDSHGxjjbqB75o4QC71hu2abJ0xOtfmYsLzj8RDdIQNBzuWhqBtaCgwsWbrjq4RalqB4xvUQU69U/gv3cOjiu0YxVVUOVmeMHE2Ezk9YgS9eBIBmScRUJPo3U2ixP0l4BPqEXE5guftOoNmxMfJhct/mYpfzn/c8RSXUbL0qziezSa6HPN8dHlh9HW9KL8RWWWaV7wDD83SGoZtpnXG1hUap6LzqMgA8PJWMz6o7xYkHwwrTj942IzNozKaDzo/5J/2kn+ONH/fCPI/PI6Pw6mYKh6PF511MUvS8LfovOptPp14n5LovEzJEpF9xksTs4kGYgLpm0aL1PRxhzIcWGnKdKkffFuIutImMyFZRpnoxe/1iW3U+ucjTeaUpWueuCzarGhTOTKhVT6vttwFVWlmNdqC0M5yNN3kvR1rL1kQiw8r64pY1X5sS2JvS+5uf/7pp+fAQtc/zoFH0l/fBl6+9jVDx56HDeBpPh9tGnI8gy9iPm+Lb1iW5WEEzUCt7AdW9pN3CgowcBwQaIHhmbxTEO8UV9c1HRjY1InrwADt65RNHVar1WRlTJJUNJFn54iWikZAqPGVTcvWfHY+ajEv0GACEoZsYaPHp2L/qWUsXXpsmVJRDxdRSjxb0xHFGqL8g9umI4YhNgPXQZZruSdtYO/0bIDpUyoBdbH5vNY5TlznNfBmfDnkxRfQfQJyIL5EwQ96uEIn9k8pLy70tM3AO4IDoCMpgH6YAlQwH+V4PTKsNppYbRhtrDZMp43V0DwxViOZoXrvkqWQWiG1QmqF1J0NHBFSzkgclzDej5PGdpnsAkrqT+ld+RjzocTxr4mdScpmyTSJSXSTiM1tjpi/KGPrUgxPlixp4ikfj+n6P+X7+cNP8TDBulkFXD3Wo6/W9adb3kKOVKJba3tnUdFnoXF9NHamK48aGEmntC8/c0B0T2nEx/kD3cnhNEhtSC2sN0Yqrb+ISHz/J8eDT1fHfwo8bFvaa6QPAk83TaoZtimQkeqabehIMz0aIA4AOimPnl7SCvPUrTCB5VI+5TUv4NCOiA40YmFfswH1DQ5sEBtQ8aandysG2DKhrlnI5d1q6ZRXxfE113EDbLqm56LTAqgihicUzh/BKr5lNN3DLJabqF46gXvZRLJkvBPo5eaYXeB3kDSk6/znWmTQkryX9CMIo6grOR/IyT3dSeyTbLahKpUw/kY0+DbJQlYgkJswlsz3SuvbUv4tlRF1L3kOgNVz2V5RJMnHIX8IwkdRj4tFEopcPjzkA7LIJJuRvJfnj/mUnJBVhiZ5f49M2ADMprABATCBuCVvsOy2uKEKOxWHwTJTmIFRX0kbXlnaALBtABdpAAPEK8QRk+gW0RwHAw/7lgsde2yg8ub34r2Dd4tzb1Fc7gqFNZr+CGOhKnYYs81ezC5wtVJbK/Gq0EHTdxG2Dyp3UW6eePfLxcSr+iebNOttXKQlsuotdOf9buZ/mjGwiNJ128vP52tRRhHlIfHDo+a8zK/0sRgPl6IwMcozvuXmn2db7Kjg19B3ZP02amEvsDrO5YEBJ/i0+GvKjL8D71cV/ippv5L2v3mG8YpAfyv0y6FeSax3AV+oj1eA04v01uBIX56JPwnnSezNRF80ds9WWxKwgfjno/CQMvnqxL0JwgC2MbhLNw6apwVgS2YAthQAKwBWAKwAeLgtfhMypd7mf7+75ON8D/Y/ZN4mshf77XFhf31zLyUPADuad2Zb8Q7if4AI2DITgYHlpIoIKCKgiMD7JgJ19JSaBuwhAE9X+eu3jqOxfy5M3oUYP+LAGHpHau8d0t17DJl4DUwwROVz+aZZPW9fFQ/1N19b6a8E6YNKf2hAYlFjDl3EoQo7WjewLOFWqBxseQvaoS32DhsouqZ8aTvQW/k4O/kYO/kUXdfKJx/3m0a/mO1IbRA4UnvAcer8vYrm4sB2H0pzUWkuvmNq/Yrk6FYsylk+NaD+4UIYRd7x7HIhUPeZyeYF3oUZH0y1xP0uQPoNJpqCFNgvHwmiZMV5W8omvNeJSzK6xxqx9BjwtbCwcMammWDhjYJBRQdQWxzC9zv7Wc2pxCHVWYycDKGq/dgogoKlt7tqK4HI2xaIHIJKqYUkF4UO4R7Qd+ux/SgPelGe5aqDDWjvMj+o2yqUQS0VyV2jg3no+9E+HtC0vRwVBzDspkAAd3giQF3nIacmAHJ7URupGzVFAI7DOkAxMj1qax7weYVMYGqOY0PN0aEVcNoMAR1YcjICAiA1PDYQRGosPM+/iYB3ju2fyX2+UP2ZMOomyT3/9VOcMRJ7Qk1iHk2YMZlTP1zO92Bn8YW1jOc1F3lpcZmTGPx5Plo7l36MrQPKMRg7Krwz7eaOFzsdCgBd2vi7gvLBAc+QGvAMBXgyA57a8aod79gg/UVAKDUF2IPlT9cZAP3+9A4qDZSH/6WaQHX033/sv1E02OoW/KzHjclJUAVbBxUGKomuRBoD5o7JgwWeqTJg7TgW3mT8ujoDQGrXhGCkvgmfdd4+NNv7h9QGhj6mUXoDSm/gHfNxGQkS7BdmnM5p4pFql/rEcWBD7RI4xujULsGxzhbhkBKdV/W2CKUWz8CRimeeB9tvw90iHPiMSLEPxT4U+5CLfRwQz5zAi/NoqACwpaUCUgsi4JsSRLwNKgAGNqBVVEBRAUUF5KIC/S6YX/NCB+tom9CGYMLCvXKJ8RAPR1riIbW7ZDhSf8nvmngMbMuqiIciHop4yEU8+v1Ij5N4TPTc49SWfDjYkvhUZEjlktdlJFI7kK5qrxjJeBgJVDoZipEoRiIhIxH+urQ7mj7steMULi+1rJaiX3e1XyLypLugd26f8gm1A6/zfgrP5p9zdEYrVlMDE3UYaQLYoUoK0KmtVqQWSQyNmcpqRVmtKKuV92210oJBZYmSL7X9l1jUxQxbicKHbagEmg/HOp80ZN3tA6lvqwAjva7iebv9t+G48eR2JGqzrzb774ZaviovMMlcbLJjN1vk4Kp3BP17uVizfFHvlQv8KpI9QTTQf2bxJNHAzrUYo973G7jz5ubOrf+ukebgZEBq0f/Q+Kl2/mrnr3b+73vn3412Um///6DpZnvfgu37WmQvXPfbWCZLxicU5c2I87ps7paqieX5z7XI4EIAa0i3cR0eG3lyAMwrfLHPvWPjVqzKj+ONaPRtkoWsABg3YSyZdzh6ZELc0Htldv1eLFg9l+3NL+LKpy9/CMJHUY+LhfCcQNMPD/k87vY5TVYZmmy8eP1defEaG2OxsbPjUtow0cSELc4CnTZlMYwJtE7LWqS24hypEafiLIqzKM4iJ2epA7zUTOUrTedhTPZxFdaI7hcuHLhyswRm3lsezbLDAgaXePfTXCTxV8F0ugUPuyoKmNo+6lJRsKFrvPBezVMAf7lubFQUOvxqdrnVPLkfaakv1hzaMFBBvoJ8BfnvG/KbSCk16H9ZxnE+yPXzxWIP8qdFGo1UKfrBv/+WTQX+R4E/hh36idWl7q+K/lJfNDm0dZ5Cf4X+Cv3fN/p3QKbUFOCKusvpdK9qgd+I7jeEPHBDpIL+o6AfjwT6q/zlhP6hzeAU9CvoV9D/vqG/CZVSo/5d8QGhLqwt9kB/UZbmbVL0o3//zZEV+ntrPrn8/MT/EPwXSog37vvmA8iwJm2NRbvjJODUpopVqyQlBOpGSUUIFCFQhGBAQtCBotKwAlFukrBa0o+i9z+L0c4D/w8= \ No newline at end of file diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/homogeneous-vs-heterogeneous-results-table.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/homogeneous-vs-heterogeneous-results-table.png new file mode 100644 index 0000000000..1a9a992130 Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/homogeneous-vs-heterogeneous-results-table.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics Heterogeneous cpu and gpu usage.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics Heterogeneous cpu and gpu usage.png new file mode 100644 index 0000000000..b40e494d9b Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics Heterogeneous cpu and gpu usage.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics Heterogeneous cpu and gpu usage.png.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics Heterogeneous cpu and gpu usage.png.png new file mode 100644 index 0000000000..33a8139f4c Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics Heterogeneous cpu and gpu usage.png.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics homogeneous cpu and gpu usage.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics homogeneous cpu and gpu usage.png new file mode 100644 index 0000000000..0c3d3f3ebf Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/metrics homogeneous cpu and gpu usage.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/images/tf.data.service-diagram.png b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/tf.data.service-diagram.png new file mode 100644 index 0000000000..f08f27335d Binary files /dev/null and b/training/heterogeneous-clusters/tf.data.service.sagemaker/images/tf.data.service-diagram.png differ diff --git a/training/heterogeneous-clusters/tf.data.service.sagemaker/start_job_utils.py b/training/heterogeneous-clusters/tf.data.service.sagemaker/start_job_utils.py new file mode 100644 index 0000000000..2fd7e1d02e --- /dev/null +++ b/training/heterogeneous-clusters/tf.data.service.sagemaker/start_job_utils.py @@ -0,0 +1,37 @@ +import time +from sagemaker.estimator import EstimatorBase + +def fit_with_retries(retries : int, estimator : EstimatorBase, *args, **kwargs): + """Run estimator fit with retries in case of temporary issues like capacity exception or user exceeded resource usage + Example invocation: fit_with_retries(5, estimator, job_name="my job name") + + Args: + retries (int): How many retries in case of exception_to_try is raised + estimator (EstimatorBase): will call estimator.fit(...) + *args: list of positioned arguments to pass to fit() + **kwargs: list of keyword arguments to pass to fit() + Returns: + None + """ + orig_job_name = kwargs['job_name'] if 'job_name' in kwargs and kwargs['job_name'] else None + for i in range(1, retries+1): + try: + # Ensure job_name is unique between retries (if specified) + if orig_job_name: + kwargs['job_name'] = orig_job_name + f'-{i}' + estimator.fit(*args, **kwargs) + break + except Exception as e: + if not ('CapacityError' in str(e) or 'ResourceLimitExceeded' in str(e)): + raise e + print(f'Caught error: {type(e).__name__}: {e}') + if i == retries: + print(f'Giving up after {retries} failed attempts.') + raise e + else: + if 'ResourceLimitExceeded' in str(e): + seconds = 10 + print(f'ResourceLimitExceeded: Sleeping {seconds}s before retrying.') + time.sleep(seconds) + print(f'Retrying attempt: {i+1}/{retries}') + continue \ No newline at end of file