diff --git a/sagemaker-triton/TensorFlow/Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb b/sagemaker-triton/TensorFlow/Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb new file mode 100644 index 0000000000..18d3922922 --- /dev/null +++ b/sagemaker-triton/TensorFlow/Deploy-TensorFlow-Model-Using-NVIDIA-Triton.ipynb @@ -0,0 +1,461 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6d6ee543", + "metadata": {}, + "source": [ + "# Deploy a TensorFlow Model using NVIDIA Triton on SageMaker" + ] + }, + { + "cell_type": "markdown", + "id": "7634d547", + "metadata": {}, + "source": [ + "Amazon SageMaker is a fully managed service for data science and machine learning workflows. It helps data scientists and developers to prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML.\n", + "\n", + "Now, NVIDIA Triton Inference Server can be used to serve models for inference in Amazon SageMaker. Thanks to the new NVIDIA Triton container image, you can easily serve ML models and benefit from the performance optimizations, dynamic batching, and multi-framework support provided by NVIDIA Triton. Triton helps maximize the utilization of GPU and CPU, further lowering the cost of inference.\n", + "\n", + "This example will showcase how to deploy a pre-trained TensorFlow model using NVIDIA Triton on SageMaker.\n", + "\n", + "The model used here was pre-trained on the MNIST dataset. See this [Deploy a Trained TensorFlow V2 Model example](https://github.com/aws/amazon-sagemaker-examples/blob/1c5da8941bc933b176b56a93157073d5645d8cdf/frameworks/tensorflow/get_started_mnist_deploy.ipynb) for the training of the model. \n", + "\n", + "## Contents\n", + "1. [Introduction to NVIDIA Triton Server](#Introduction-to-NVIDIA-Triton-Server)\n", + "1. [Set up the environment](#Set-up-the-environment)\n", + "1. [Transform TensorFlow Model structure](#Transform-TensorFlow-Model-structure)\n", + " 1. [Inspect the model using the saved_model_cli](#Inspect-the-model-using-the-saved_model_cli)\n", + " 1. [Create the config.pbtxt](#Create-the-config.pbtxt)\n", + " 1. [Create the tar ball in the required Triton structure](#Create-the-tar-ball-in-the-required-Triton-structure)\n", + "1. [Deploy model to SageMaker Endpoint](#Deploy-model-to-SageMaker-Endpoint)\n", + "1. [Clean up](#Clean-up)" + ] + }, + { + "cell_type": "markdown", + "id": "51ab88ee", + "metadata": {}, + "source": [ + "## Introduction to NVIDIA Triton Server\n", + "\n", + "[NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/) was developed specifically to enable scalable, cost-effective, and easy deployment of models in production. NVIDIA Triton Inference Server is open-source inference serving software that simplifies the inference serving process and provides high inference performance.\n", + "\n", + "Some key features of Triton are:\n", + "* **Support for Multiple frameworks**: Triton can be used to deploy models from all major frameworks. Triton supports TensorFlow GraphDef, TensorFlow SavedModel, ONNX, PyTorch TorchScript, TensorRT, RAPIDS FIL for tree based models, and OpenVINO model formats. \n", + "* **Model pipelines**: Triton model ensemble represents a pipeline of one or more models or pre/post processing logic and the connection of input and output tensors between them. A single inference request to an ensemble will trigger the execution of the entire pipeline.\n", + "* **Concurrent model execution**: Multiple models (or multiple instances of the same model) can run simultaneously on the same GPU or on multiple GPUs for different model management needs.\n", + "* **Dynamic batching**: For models that support batching, Triton has multiple built-in scheduling and batching algorithms that combine individual inference requests together to improve inference throughput. These scheduling and batching decisions are transparent to the client requesting inference.\n", + "* **Diverse CPUs and GPUs**: The models can be executed on CPUs or GPUs for maximum flexibility and to support heterogeneous computing requirements.\n", + "\n", + "**Note**: This initial release of NVIDIA Triton on SageMaker will only support a single model. Future releases will have multi-model support. A minimal `config.pbtxt` configuration file is **required** in the model artifacts. This release doesn't support inferring the model config automatically.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "5cf5f5fc", + "metadata": {}, + "source": [ + "## Set up the environment\n", + "\n", + "Download the pre-trained TensorFlow model from a public S3 bucket.\n", + "Also define the IAM role that will give SageMaker access to the model artifacts and the NVIDIA Triton ECR image.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "13469557", + "metadata": {}, + "outputs": [], + "source": [ + "%%time\n", + "import boto3\n", + "\n", + "# use the region-specific saved model object\n", + "region = boto3.Session().region_name\n", + "saved_model = \"s3://sagemaker-sample-files/datasets/image/MNIST/model/tensorflow-training-2020-11-20-23-57-13-077/model.tar.gz\"\n", + "!aws s3 cp $saved_model models/SavedModel/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "76af8c28", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker\n", + "\n", + "sm_session = sagemaker.Session()\n", + "role = sagemaker.get_execution_role()\n", + "bucket_name = sm_session.default_bucket()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "31b31768", + "metadata": {}, + "outputs": [], + "source": [ + "account_id_map = {\n", + " \"us-east-1\": \"785573368785\",\n", + " \"us-east-2\": \"007439368137\",\n", + " \"us-west-1\": \"710691900526\",\n", + " \"us-west-2\": \"301217895009\",\n", + " \"eu-west-1\": \"802834080501\",\n", + " \"eu-west-2\": \"205493899709\",\n", + " \"eu-west-3\": \"254080097072\",\n", + " \"eu-north-1\": \"601324751636\",\n", + " \"eu-south-1\": \"966458181534\",\n", + " \"eu-central-1\": \"746233611703\",\n", + " \"ap-east-1\": \"110948597952\",\n", + " \"ap-south-1\": \"763008648453\",\n", + " \"ap-northeast-1\": \"941853720454\",\n", + " \"ap-northeast-2\": \"151534178276\",\n", + " \"ap-southeast-1\": \"324986816169\",\n", + " \"ap-southeast-2\": \"355873309152\",\n", + " \"cn-northwest-1\": \"474822919863\",\n", + " \"cn-north-1\": \"472730292857\",\n", + " \"sa-east-1\": \"756306329178\",\n", + " \"ca-central-1\": \"464438896020\",\n", + " \"me-south-1\": \"836785723513\",\n", + " \"af-south-1\": \"774647643957\",\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd92a880", + "metadata": {}, + "outputs": [], + "source": [ + "if region not in account_id_map.keys():\n", + " raise (\"UNSUPPORTED REGION\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0cc3ddf8", + "metadata": {}, + "outputs": [], + "source": [ + "base = \"amazonaws.com.cn\" if region.startswith(\"cn-\") else \"amazonaws.com\"\n", + "triton_image_uri = \"{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:21.08-py3\".format(\n", + " account_id=account_id_map[region], region=region, base=base\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "39414be7", + "metadata": {}, + "outputs": [], + "source": [ + "!tar -xf models/SavedModel/model.tar.gz -C models/SavedModel/" + ] + }, + { + "cell_type": "markdown", + "id": "a1e5faca", + "metadata": {}, + "source": [ + "## Transform TensorFlow Model structure\n", + "\n", + "\n", + "The model that we want to deploy currently has the following structure:\n", + "\n", + "```\n", + "00000000\n", + " ├── saved_model.pb\n", + " ├── assets/\n", + " └── variables/\n", + " ├── variables.data-00000-of-00001\n", + " └── variables.index\n", + "```\n", + "For Triton, the model needs to have the following structure:\n", + "```\n", + "\n", + "├── config.pbtxt\n", + "└── 1/\n", + " └── model.savedmodel\n", + " ├── saved_model.pb\n", + " ├── assets/\n", + " └── variables/\n", + " ├── variables.data-00000-of-00001\n", + " └── variables.index\n", + " \n", + "\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "392b33db", + "metadata": {}, + "outputs": [], + "source": [ + "! mkdir -p models/TritonModel/MNIST/1\n", + "! cp models/SavedModel/00000000 --recursive ./models/TritonModel/MNIST/1/model.savedmodel/" + ] + }, + { + "cell_type": "markdown", + "id": "4c21b8be", + "metadata": {}, + "source": [ + "### Inspect the model using the `saved_model_cli`\n", + "\n", + "In order to create the `config.pbtxt` we need to confirm the model inputs and outputs (Signature).\n", + "We use the `saved_model_cli` to inspect the model and take note of the input and output shape." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "42b58467", + "metadata": {}, + "outputs": [], + "source": [ + "!saved_model_cli show --all --dir {\"models/SavedModel/00000000\"}" + ] + }, + { + "cell_type": "markdown", + "id": "1332701d", + "metadata": {}, + "source": [ + "### Create the config.pbtxt \n", + "\n", + "Triton requires a [Model Configuration file](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) known as a `config.pbtxt`. We create one below in the correct directory.\n", + "\n", + "The `name` in the `config.pbtxt` must match the name of our model directory. In this case we will use `MNIST`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0f843f41", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile models/TritonModel/MNIST/config.pbtxt\n", + "name: \"MNIST\"\n", + "platform: \"tensorflow_savedmodel\"\n", + "max_batch_size: 0\n", + "\n", + "instance_group {\n", + " count: 1\n", + " kind: KIND_GPU\n", + "}\n", + "\n", + "dynamic_batching {\n", + "\n", + "}\n", + "\n", + "input [\n", + " {\n", + " name: \"input_1\"\n", + " data_type: TYPE_FP32\n", + " dims: [-1, 28, 28, 1]\n", + " }\n", + "]\n", + "output [\n", + " {\n", + " name: \"output_1\"\n", + " data_type: TYPE_FP32\n", + " dims: [-1, 10]\n", + " }\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "af89bd5e", + "metadata": {}, + "outputs": [], + "source": [ + "model_location = f\"s3://{bucket_name}/TritonModel/TritonModel.tar.gz\"" + ] + }, + { + "cell_type": "markdown", + "id": "311b6185", + "metadata": {}, + "source": [ + "### Create the tar ball in the required Triton structure" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e69c75c4", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "cd models/TritonModel/ \n", + "tar -czvf TritonModel.tar.gz MNIST/" + ] + }, + { + "cell_type": "markdown", + "id": "32dfbd14", + "metadata": {}, + "source": [ + "### Upload the new tar ball containing the Triton model structure to s3" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1bec7fb4", + "metadata": {}, + "outputs": [], + "source": [ + "!aws s3 cp models/TritonModel/TritonModel.tar.gz $model_location" + ] + }, + { + "cell_type": "markdown", + "id": "bedf430a", + "metadata": {}, + "source": [ + "## Deploy model to SageMaker Endpoint\n", + "We start off by creating a sagemaker model from the model files we uploaded to s3 in the previous step.\n", + "\n", + "In this step we also provide an additional Environment Variable i.e. `SAGEMAKER_TRITON_DEFAULT_MODEL_NAME` which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to s3. This variable is optional in case of a single model. In case of ensemble models, this key has to be specified for Triton to startup in SageMaker.\n", + "\n", + "Additionally, customers can set `SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT` and `SAGEMAKER_TRITON_THREAD_COUNT` for optimizing the thread counts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9097c998", + "metadata": {}, + "outputs": [], + "source": [ + "from sagemaker.model import Model\n", + "\n", + "tensorflow_model = Model(\n", + " model_data=model_location,\n", + " role=role,\n", + " env={\"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME\": \"MNIST\"},\n", + " image_uri=triton_image_uri,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ea99bee4", + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import datetime\n", + "\n", + "date = datetime.now().strftime(\"%Y-%m-%d-%H-%m-%S\")\n", + "\n", + "endpoint_name = f\"Triton-MNIST-{date}\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "376ca9af", + "metadata": {}, + "outputs": [], + "source": [ + "predictor = tensorflow_model.deploy(\n", + " initial_instance_count=1,\n", + " instance_type=\"ml.g4dn.xlarge\",\n", + " endpoint_name=endpoint_name,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a1e48701", + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "import json\n", + "\n", + "payload = {\n", + " \"inputs\": [\n", + " {\n", + " \"name\": \"input_1\",\n", + " \"shape\": [4, 28, 28, 1],\n", + " \"datatype\": \"FP32\",\n", + " \"data\": np.random.rand(4, 28, 28, 1).tolist(),\n", + " }\n", + " ]\n", + "}\n", + "runtime_sm_client = boto3.client(\"sagemaker-runtime\")\n", + "response = runtime_sm_client.invoke_endpoint(\n", + " EndpointName=endpoint_name,\n", + " ContentType=\"application/octet-stream\",\n", + " Body=json.dumps(payload),\n", + ")\n", + "\n", + "predictions = json.loads(response[\"Body\"].read())[\"outputs\"][0][\"data\"]\n", + "predictions = np.array(predictions, dtype=np.float32)\n", + "predictions = np.argmax(predictions)\n", + "predictions" + ] + }, + { + "cell_type": "markdown", + "id": "f0f2a7d2", + "metadata": {}, + "source": [ + "## Clean up\n", + "We strongly recommend to delete the Real-time endpoint created to stop incurring cost when finished with the example" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8f9e893f", + "metadata": {}, + "outputs": [], + "source": [ + "sm_client = boto3.client(\"sagemaker\")\n", + "# Delete endpoint\n", + "sm_client.delete_endpoint(EndpointName=endpoint_name)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "conda_tensorflow2_p38", + "language": "python", + "name": "conda_tensorflow2_p38" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}