From dc7c06034fbba9cd395e38adf2486bba3f93e428 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Thu, 8 Sep 2022 17:28:23 -0400 Subject: [PATCH 01/14] CLI upgrade --- ...PT-J-6B-model-parallel-inference-DJL.ipynb | 437 ++++++++++++++++++ 1 file changed, 437 insertions(+) create mode 100644 advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb new file mode 100644 index 0000000000..d40325d877 --- /dev/null +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -0,0 +1,437 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bc5ab391", + "metadata": {}, + "source": [ + "# Serve large models on SageMaker with model parallel inference and DJLServing" + ] + }, + { + "cell_type": "markdown", + "id": "98b43ca5", + "metadata": {}, + "source": [ + "In this notebook, we explore how to host a large language model on SageMaker using model parallelism from DeepSpeed and DJLServing.\n", + "\n", + "Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.\n", + "\n", + "Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.\n", + "\n", + "In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. " + ] + }, + { + "cell_type": "markdown", + "id": "81c2bdf4", + "metadata": {}, + "source": [ + "## Step 1: Creating image for SageMaker endpoint\n", + "We first pull the docker image djl-serving:0.18.0-deepspeed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2876d11c", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "docker pull deepjavalibrary/djl-serving:0.18.0-deepspeed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73d0ff93", + "metadata": {}, + "outputs": [], + "source": [ + "!docker images" + ] + }, + { + "cell_type": "markdown", + "id": "e822977b", + "metadata": {}, + "source": [ + "You should see the image `djl-serving` listed from running the code above. Please note the `IMAGE ID`. We will need it for the next step." + ] + }, + { + "cell_type": "markdown", + "id": "6c695144", + "metadata": {}, + "source": [ + "### Push image to ECR\n", + "The following code pushes the `djl-serving` image, downloaded from previous step, to ECR. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47ab31d1", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%sh\n", + "\n", + "# The name of our container\n", + "img=djl_deepspeed\n", + "\n", + "\n", + "account=$(aws sts get-caller-identity --query Account --output text)\n", + "\n", + "# Get the region defined in the current configuration\n", + "region=$(aws configure get region)\n", + "\n", + "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${img}:latest\"\n", + "\n", + "# If the repository doesn't exist in ECR, create it.\n", + "aws ecr describe-repositories --repository-names \"${img}\" > /dev/null 2>&1\n", + "\n", + "if [ $? -ne 0 ]\n", + "then\n", + " aws ecr create-repository --repository-name \"${img}\" > /dev/null\n", + "fi\n", + "\n", + "# Get the login command from ECR and execute it directly\n", + "aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}\n", + "\n", + "\n", + "# # Build the docker image locally with the image name and then push it to ECR\n", + "image_id=$(docker images -q | head -n1)\n", + "docker tag $image_id ${fullname}\n", + "\n", + "docker push $fullname" + ] + }, + { + "cell_type": "markdown", + "id": "1ac32e96", + "metadata": {}, + "source": [ + "## Step 2: Create a `model.py` and `serving.properties`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f4864eb", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile model.py\n", + "\n", + "from djl_python import Input, Output\n", + "import os\n", + "import deepspeed\n", + "import torch\n", + "from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer\n", + "\n", + "predictor = None\n", + "\n", + "def get_model():\n", + " model_name = 'EleutherAI/gpt-j-6B'\n", + " tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '2'))\n", + " local_rank = int(os.getenv('LOCAL_RANK', '0'))\n", + " model = AutoModelForCausalLM.from_pretrained(model_name, revision=\"float32\", torch_dtype=torch.float32)\n", + " tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + " \n", + " model = deepspeed.init_inference(model,\n", + " mp_size=tensor_parallel,\n", + " dtype=model.dtype,\n", + " replace_method='auto',\n", + " replace_with_kernel_inject=True)\n", + " generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)\n", + " return generator\n", + "\n", + "\n", + "def handle(inputs: Input) -> None:\n", + " global predictor\n", + " if not predictor:\n", + " predictor = get_model()\n", + "\n", + " if inputs.is_empty():\n", + " # Model server makes an empty call to warmup the model on startup\n", + " return None\n", + "\n", + " data = inputs.get_as_string()\n", + " result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)\n", + " return Output().add(result)" + ] + }, + { + "cell_type": "markdown", + "id": "f02b6929", + "metadata": {}, + "source": [ + "### Setup serving.properties\n", + "\n", + "User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set\n", + "\n", + "```\n", + "gpu.minWorkers=1\n", + "gpu.maxWorkers=1\n", + "```\n", + "by adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c5ea96a", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile serving.properties\n", + "\n", + "engine=Rubikon" + ] + }, + { + "cell_type": "markdown", + "id": "f44488e6", + "metadata": {}, + "source": [ + "The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a536439", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker, boto3\n", + "\n", + "session = sagemaker.Session()\n", + "account = session.account_id()\n", + "region = session.boto_region_name\n", + "img = 'djl_deepspeed'\n", + "fullname = account+'.dkr.ecr.'+region+'amazonaws.com/'+img+':latest'\n", + "\n", + "bucket = session.default_bucket()\n", + "path = 's3://' + bucket + '/DEMO-djl-big-model/'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9965dd7c", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "if [ -d gpt-j ]; then\n", + " rm -d -r gpt-j\n", + "fi #always start fresh\n", + "\n", + "mkdir -p gpt-j\n", + "mv model.py gpt-j\n", + "mv serving.properties gpt-j\n", + "tar -czvf gpt-j.tar.gz gpt-j/\n", + "#aws s3 cp gpt-j.tar.gz {path}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db47f969", + "metadata": {}, + "outputs": [], + "source": [ + "!aws s3 cp gpt-j.tar.gz {path}" + ] + }, + { + "cell_type": "markdown", + "id": "c507e3ef", + "metadata": {}, + "source": [ + "## Step 3: Create SageMaker endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "f96c494a", + "metadata": {}, + "source": [ + "First let us make sure we have the lastest awscli" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b665515", + "metadata": {}, + "outputs": [], + "source": [ + "!pip3 install --upgrade --user awscli" + ] + }, + { + "cell_type": "markdown", + "id": "32589338", + "metadata": {}, + "source": [ + "You should see two images from code above. Please note the image name similar to`.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed`. This is the ECR image URL that we need for later use. \n", + "\n", + "Now we create our [SageMaker model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). Make sure you provide an IAM role that SageMaker can assume to access model artifacts and docker image for deployment on ML compute hosting instances. In addition, you also use the IAM role to manage permissions the inference code needs. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. \n", + "\n", + " You must enter ECR image name, S3 path for the model file, and an execution-role-arn in the code below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "026d27d2", + "metadata": {}, + "outputs": [], + "source": [ + "!aws sagemaker create-model \\\n", + "--model-name gpt-j \\\n", + "--primary-container \\\n", + "Image=,ModelDataUrl={path},Environment={TENSOR_PARALLEL_DEGREE=2} \\\n", + "--execution-role-arn " + ] + }, + { + "cell_type": "markdown", + "id": "22d2fc2b", + "metadata": {}, + "source": [ + "Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84e25dd4", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "aws sagemaker create-endpoint-config \\\n", + " --region $(aws configure get region) \\\n", + " --endpoint-config-name gpt-j-config \\\n", + " --production-variants '[\n", + " {\n", + " \"ModelName\": \"gpt-j\",\n", + " \"VariantName\": \"AllTraffic\",\n", + " \"InstanceType\": \"ml.g5.48xlarge\",\n", + " \"InitialInstanceCount\": 1,\n", + " \"ModelDataDownloadTimeoutInSeconds\": 1800,\n", + " \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600\n", + " }\n", + " ]'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "962a1aef", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "aws sagemaker create-endpoint \\\n", + "--endpoint-name gpt-j \\\n", + "--endpoint-config-name gpt-j-config" + ] + }, + { + "cell_type": "markdown", + "id": "2dc2a85a", + "metadata": {}, + "source": [ + "The creation of the SageMaker endpoint might take a while. After the endpoint is created, you can test it out using the following code. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ed7a325", + "metadata": {}, + "outputs": [], + "source": [ + "import boto3, json\n", + "\n", + "client = boto3.client('sagemaker-runtime')\n", + "\n", + "endpoint_name = \"gpt-j\" # Your endpoint name.\n", + "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", + "# accept = \"...\" # The desired MIME type of the inference in the response.\n", + "payload = \"Amazon.com is the best\" # Payload for inference.\n", + "response = client.invoke_endpoint(\n", + " EndpointName=endpoint_name, \n", + " ContentType=content_type,\n", + " Body=payload\n", + " )\n", + "print(response['Body'].read())" + ] + }, + { + "cell_type": "markdown", + "id": "92e83c91", + "metadata": {}, + "source": [ + "## Step 4: Clean up" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a15980a3", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "aws sagemaker delete-endpoint --endpoint-name gpt-j" + ] + }, + { + "cell_type": "markdown", + "id": "2eff050b", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, you used tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we used open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n", + "\n", + "As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.8.9 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.9" + }, + "vscode": { + "interpreter": { + "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From 875c110b2ab4953d1273848398dd4b1227a80f60 Mon Sep 17 00:00:00 2001 From: atqy Date: Thu, 8 Sep 2022 16:49:11 -0700 Subject: [PATCH 02/14] reformat --- ...PT-J-6B-model-parallel-inference-DJL.ipynb | 53 ++++++++++--------- 1 file changed, 29 insertions(+), 24 deletions(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index d40325d877..c36008b831 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -135,19 +135,26 @@ "\n", "predictor = None\n", "\n", + "\n", "def get_model():\n", - " model_name = 'EleutherAI/gpt-j-6B'\n", - " tensor_parallel = int(os.getenv('TENSOR_PARALLEL_DEGREE', '2'))\n", - " local_rank = int(os.getenv('LOCAL_RANK', '0'))\n", - " model = AutoModelForCausalLM.from_pretrained(model_name, revision=\"float32\", torch_dtype=torch.float32)\n", + " model_name = \"EleutherAI/gpt-j-6B\"\n", + " tensor_parallel = int(os.getenv(\"TENSOR_PARALLEL_DEGREE\", \"2\"))\n", + " local_rank = int(os.getenv(\"LOCAL_RANK\", \"0\"))\n", + " model = AutoModelForCausalLM.from_pretrained(\n", + " model_name, revision=\"float32\", torch_dtype=torch.float32\n", + " )\n", " tokenizer = AutoTokenizer.from_pretrained(model_name)\n", - " \n", - " model = deepspeed.init_inference(model,\n", - " mp_size=tensor_parallel,\n", - " dtype=model.dtype,\n", - " replace_method='auto',\n", - " replace_with_kernel_inject=True)\n", - " generator = pipeline(task='text-generation', model=model, tokenizer=tokenizer, device=local_rank)\n", + "\n", + " model = deepspeed.init_inference(\n", + " model,\n", + " mp_size=tensor_parallel,\n", + " dtype=model.dtype,\n", + " replace_method=\"auto\",\n", + " replace_with_kernel_inject=True,\n", + " )\n", + " generator = pipeline(\n", + " task=\"text-generation\", model=model, tokenizer=tokenizer, device=local_rank\n", + " )\n", " return generator\n", "\n", "\n", @@ -190,7 +197,7 @@ "source": [ "%%writefile serving.properties\n", "\n", - "engine=Rubikon" + "engine = Rubikon" ] }, { @@ -213,11 +220,11 @@ "session = sagemaker.Session()\n", "account = session.account_id()\n", "region = session.boto_region_name\n", - "img = 'djl_deepspeed'\n", - "fullname = account+'.dkr.ecr.'+region+'amazonaws.com/'+img+':latest'\n", + "img = \"djl_deepspeed\"\n", + "fullname = account + \".dkr.ecr.\" + region + \"amazonaws.com/\" + img + \":latest\"\n", "\n", "bucket = session.default_bucket()\n", - "path = 's3://' + bucket + '/DEMO-djl-big-model/'" + "path = \"s3://\" + bucket + \"/DEMO-djl-big-model/\"" ] }, { @@ -362,18 +369,16 @@ "source": [ "import boto3, json\n", "\n", - "client = boto3.client('sagemaker-runtime')\n", + "client = boto3.client(\"sagemaker-runtime\")\n", "\n", - "endpoint_name = \"gpt-j\" # Your endpoint name.\n", - "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", + "endpoint_name = \"gpt-j\" # Your endpoint name.\n", + "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", "# accept = \"...\" # The desired MIME type of the inference in the response.\n", - "payload = \"Amazon.com is the best\" # Payload for inference.\n", + "payload = \"Amazon.com is the best\" # Payload for inference.\n", "response = client.invoke_endpoint(\n", - " EndpointName=endpoint_name, \n", - " ContentType=content_type,\n", - " Body=payload\n", - " )\n", - "print(response['Body'].read())" + " EndpointName=endpoint_name, ContentType=content_type, Body=payload\n", + ")\n", + "print(response[\"Body\"].read())" ] }, { From f211fb239efca196b7ab0d49ad67f431dd021889 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Thu, 8 Sep 2022 21:32:13 -0400 Subject: [PATCH 03/14] grammatical changes --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index c36008b831..d979d03d7d 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -313,7 +313,7 @@ "id": "22d2fc2b", "metadata": {}, "source": [ - "Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " + "Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " ] }, { @@ -373,7 +373,6 @@ "\n", "endpoint_name = \"gpt-j\" # Your endpoint name.\n", "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", - "# accept = \"...\" # The desired MIME type of the inference in the response.\n", "payload = \"Amazon.com is the best\" # Payload for inference.\n", "response = client.invoke_endpoint(\n", " EndpointName=endpoint_name, ContentType=content_type, Body=payload\n", @@ -407,7 +406,7 @@ "source": [ "## Conclusion\n", "\n", - "In this notebook, you used tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we used open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n", + "In this notebook, you use tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we use open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n", "\n", "As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes." ] From 9b2d62cd214cf4c8633bef2908f1372787bc2591 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 17:05:49 -0400 Subject: [PATCH 04/14] boto3 version --- ...PT-J-6B-model-parallel-inference-DJL.ipynb | 112 ++++++++++-------- 1 file changed, 61 insertions(+), 51 deletions(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index d979d03d7d..920965f7e6 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -22,6 +22,16 @@ "In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. " ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6ed354b", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install boto3==1.24.68" + ] + }, { "cell_type": "markdown", "id": "81c2bdf4", @@ -221,8 +231,7 @@ "account = session.account_id()\n", "region = session.boto_region_name\n", "img = \"djl_deepspeed\"\n", - "fullname = account + \".dkr.ecr.\" + region + \"amazonaws.com/\" + img + \":latest\"\n", - "\n", + "fullname = account+'.dkr.ecr.'+region+'.amazonaws.com/'+img+':latest'\n", "bucket = session.default_bucket()\n", "path = \"s3://\" + bucket + \"/DEMO-djl-big-model/\"" ] @@ -253,7 +262,7 @@ "metadata": {}, "outputs": [], "source": [ - "!aws s3 cp gpt-j.tar.gz {path}" + "model_s3_url = sagemaker.s3.S3Uploader.upload('gpt-j.tar.gz', path, kms_key=None, sagemaker_session=session)" ] }, { @@ -266,77 +275,79 @@ }, { "cell_type": "markdown", - "id": "f96c494a", + "id": "32589338", "metadata": {}, "source": [ - "First let us make sure we have the lastest awscli" + "Now we create our [SageMaker model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. " ] }, { "cell_type": "code", "execution_count": null, - "id": "0b665515", + "id": "026d27d2", "metadata": {}, "outputs": [], "source": [ - "!pip3 install --upgrade --user awscli" + "from datetime import datetime\n", + "sm_client = boto3.client('sagemaker')\n", + "\n", + "time_stamp = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n", + "model_name = 'gpt-j-' + time_stamp\n", + "\n", + "create_model_response = sm_client.create_model(\n", + " ModelName = model_name,\n", + " ExecutionRoleArn = session.get_caller_identity_arn(),\n", + " PrimaryContainer = {\n", + " 'Image': fullname,\n", + " 'ModelDataUrl': model_s3_url,\n", + " })" ] }, { "cell_type": "markdown", - "id": "32589338", + "id": "22d2fc2b", "metadata": {}, "source": [ - "You should see two images from code above. Please note the image name similar to`.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed`. This is the ECR image URL that we need for later use. \n", - "\n", - "Now we create our [SageMaker model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). Make sure you provide an IAM role that SageMaker can assume to access model artifacts and docker image for deployment on ML compute hosting instances. In addition, you also use the IAM role to manage permissions the inference code needs. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. \n", - "\n", - " You must enter ECR image name, S3 path for the model file, and an execution-role-arn in the code below." + "Now we creates an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " ] }, { "cell_type": "code", "execution_count": null, - "id": "026d27d2", + "id": "84e25dd4", "metadata": {}, "outputs": [], "source": [ - "!aws sagemaker create-model \\\n", - "--model-name gpt-j \\\n", - "--primary-container \\\n", - "Image=,ModelDataUrl={path},Environment={TENSOR_PARALLEL_DEGREE=2} \\\n", - "--execution-role-arn " + "initial_instance_count = 1\n", + "instance_type = \"ml.g5.48xlarge\"\n", + "variant_name = \"AllTraffic\" \n", + "endpoint_config_name = \"t-j-config-\"+time_stamp\n", + "\n", + "production_variants = [\n", + " {\n", + " \"VariantName\": variant_name,\n", + " \"ModelName\": model_name,\n", + " \"InitialInstanceCount\": initial_instance_count,\n", + " \"InstanceType\": instance_type,\n", + " 'ModelDataDownloadTimeoutInSeconds':1800,\n", + " 'ContainerStartupHealthCheckTimeoutInSeconds':3600\n", + " }\n", + "]\n", + "\n", + "endpoint_config = {\n", + " \"EndpointConfigName\": endpoint_config_name,\n", + " \"ProductionVariants\": production_variants,\n", + "}\n", + "\n", + "ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)" ] }, { "cell_type": "markdown", - "id": "22d2fc2b", + "id": "e4b3bc26", "metadata": {}, "source": [ - "Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "84e25dd4", - "metadata": {}, - "outputs": [], - "source": [ - "%%sh\n", - "aws sagemaker create-endpoint-config \\\n", - " --region $(aws configure get region) \\\n", - " --endpoint-config-name gpt-j-config \\\n", - " --production-variants '[\n", - " {\n", - " \"ModelName\": \"gpt-j\",\n", - " \"VariantName\": \"AllTraffic\",\n", - " \"InstanceType\": \"ml.g5.48xlarge\",\n", - " \"InitialInstanceCount\": 1,\n", - " \"ModelDataDownloadTimeoutInSeconds\": 1800,\n", - " \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600\n", - " }\n", - " ]'" + "We are ready to create an endpoint using the model and the endpoint configuration created from above steps. " ] }, { @@ -346,10 +357,10 @@ "metadata": {}, "outputs": [], "source": [ - "%%sh\n", - "aws sagemaker create-endpoint \\\n", - "--endpoint-name gpt-j \\\n", - "--endpoint-config-name gpt-j-config" + "endpoint_name = \"gpt-j\" + time_stamp\n", + "ep_res = sm_client.create_endpoint(\n", + " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", + ")" ] }, { @@ -367,7 +378,7 @@ "metadata": {}, "outputs": [], "source": [ - "import boto3, json\n", + "import json\n", "\n", "client = boto3.client(\"sagemaker-runtime\")\n", "\n", @@ -395,8 +406,7 @@ "metadata": {}, "outputs": [], "source": [ - "%%sh\n", - "aws sagemaker delete-endpoint --endpoint-name gpt-j" + "sm_client.delete_endpoint(endpoint_name)" ] }, { From 59702b56848714f5605058484a1b3c6237b7a242 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 17:11:00 -0400 Subject: [PATCH 05/14] boto3 version-with minor change --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 920965f7e6..a3761e2ab9 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -233,7 +233,7 @@ "img = \"djl_deepspeed\"\n", "fullname = account+'.dkr.ecr.'+region+'.amazonaws.com/'+img+':latest'\n", "bucket = session.default_bucket()\n", - "path = \"s3://\" + bucket + \"/DEMO-djl-big-model/\"" + "path = \"s3://\" + bucket + \"/DEMO-djl-big-model\"" ] }, { From 40de44c515b7ccabe4face09f66310d464ac3acd Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 17:59:22 -0400 Subject: [PATCH 06/14] serving.perperties remove empty line --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index a3761e2ab9..a077783bed 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -206,7 +206,6 @@ "outputs": [], "source": [ "%%writefile serving.properties\n", - "\n", "engine = Rubikon" ] }, From 0fd8a5577d792bda5692ee3f89eb314b53fbbd58 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 18:24:29 -0400 Subject: [PATCH 07/14] set env variable for tensor_parallel_degree --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 3 +++ 1 file changed, 3 insertions(+) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index a077783bed..9d1fdfd18f 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -299,6 +299,9 @@ " PrimaryContainer = {\n", " 'Image': fullname,\n", " 'ModelDataUrl': model_s3_url,\n", + " 'Environment': {\n", + " 'TENSOR_PARALLEL_DEGREE': '2'\n", + " }\n", " })" ] }, From 841d4d2a486fad16f7a01b57aa24b607730297dd Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 18:30:54 -0400 Subject: [PATCH 08/14] grammatic fix --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 9d1fdfd18f..178263f57c 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -310,7 +310,7 @@ "id": "22d2fc2b", "metadata": {}, "source": [ - "Now we creates an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " + "Now we creates an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. " ] }, { From bbea1276efb8392e409265b3d3ca5ed3bb9acf2d Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 18:51:22 -0400 Subject: [PATCH 09/14] black-nb --- ...PT-J-6B-model-parallel-inference-DJL.ipynb | 36 ++++++++++--------- 1 file changed, 19 insertions(+), 17 deletions(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 178263f57c..93d2491c6c 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -230,7 +230,7 @@ "account = session.account_id()\n", "region = session.boto_region_name\n", "img = \"djl_deepspeed\"\n", - "fullname = account+'.dkr.ecr.'+region+'.amazonaws.com/'+img+':latest'\n", + "fullname = account + \".dkr.ecr.\" + region + \".amazonaws.com/\" + img + \":latest\"\n", "bucket = session.default_bucket()\n", "path = \"s3://\" + bucket + \"/DEMO-djl-big-model\"" ] @@ -261,7 +261,9 @@ "metadata": {}, "outputs": [], "source": [ - "model_s3_url = sagemaker.s3.S3Uploader.upload('gpt-j.tar.gz', path, kms_key=None, sagemaker_session=session)" + "model_s3_url = sagemaker.s3.S3Uploader.upload(\n", + " \"gpt-j.tar.gz\", path, kms_key=None, sagemaker_session=session\n", + ")" ] }, { @@ -288,21 +290,21 @@ "outputs": [], "source": [ "from datetime import datetime\n", - "sm_client = boto3.client('sagemaker')\n", + "\n", + "sm_client = boto3.client(\"sagemaker\")\n", "\n", "time_stamp = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n", - "model_name = 'gpt-j-' + time_stamp\n", + "model_name = \"gpt-j-\" + time_stamp\n", "\n", "create_model_response = sm_client.create_model(\n", - " ModelName = model_name,\n", - " ExecutionRoleArn = session.get_caller_identity_arn(),\n", - " PrimaryContainer = {\n", - " 'Image': fullname,\n", - " 'ModelDataUrl': model_s3_url,\n", - " 'Environment': {\n", - " 'TENSOR_PARALLEL_DEGREE': '2'\n", - " }\n", - " })" + " ModelName=model_name,\n", + " ExecutionRoleArn=session.get_caller_identity_arn(),\n", + " PrimaryContainer={\n", + " \"Image\": fullname,\n", + " \"ModelDataUrl\": model_s3_url,\n", + " \"Environment\": {\"TENSOR_PARALLEL_DEGREE\": \"2\"},\n", + " },\n", + ")" ] }, { @@ -322,8 +324,8 @@ "source": [ "initial_instance_count = 1\n", "instance_type = \"ml.g5.48xlarge\"\n", - "variant_name = \"AllTraffic\" \n", - "endpoint_config_name = \"t-j-config-\"+time_stamp\n", + "variant_name = \"AllTraffic\"\n", + "endpoint_config_name = \"t-j-config-\" + time_stamp\n", "\n", "production_variants = [\n", " {\n", @@ -331,8 +333,8 @@ " \"ModelName\": model_name,\n", " \"InitialInstanceCount\": initial_instance_count,\n", " \"InstanceType\": instance_type,\n", - " 'ModelDataDownloadTimeoutInSeconds':1800,\n", - " 'ContainerStartupHealthCheckTimeoutInSeconds':3600\n", + " \"ModelDataDownloadTimeoutInSeconds\": 1800,\n", + " \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600,\n", " }\n", "]\n", "\n", From 43c4e9d93e87b3b30f1069fc384d0c1413cf6669 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 18:58:47 -0400 Subject: [PATCH 10/14] grammatical change --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 93d2491c6c..618bb8db77 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -312,7 +312,7 @@ "id": "22d2fc2b", "metadata": {}, "source": [ - "Now we creates an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. " + "Now we create an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. " ] }, { From d4555e7254248eeeeb5a451cb4686c52bc919a6e Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Fri, 9 Sep 2022 19:02:23 -0400 Subject: [PATCH 11/14] endpoint_name fix --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 1 - 1 file changed, 1 deletion(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 618bb8db77..211465f464 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -386,7 +386,6 @@ "\n", "client = boto3.client(\"sagemaker-runtime\")\n", "\n", - "endpoint_name = \"gpt-j\" # Your endpoint name.\n", "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", "payload = \"Amazon.com is the best\" # Payload for inference.\n", "response = client.invoke_endpoint(\n", From 850c7953a00a1ec27ca070c608cc0e89681e66b5 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Mon, 12 Sep 2022 13:03:26 -0400 Subject: [PATCH 12/14] "By" cap --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 211465f464..fc9c8dcf62 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -195,7 +195,7 @@ "gpu.minWorkers=1\n", "gpu.maxWorkers=1\n", "```\n", - "by adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." + "By adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." ] }, { From 428e5d1092bdc00cc9ec57c54744ead04b16dee6 Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Mon, 12 Sep 2022 13:05:56 -0400 Subject: [PATCH 13/14] minor change --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index fc9c8dcf62..74565437de 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -189,13 +189,13 @@ "source": [ "### Setup serving.properties\n", "\n", - "User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set\n", + "User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set by adding these lines in the below file.\n", "\n", "```\n", "gpu.minWorkers=1\n", "gpu.maxWorkers=1\n", "```\n", - "By adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." + "By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." ] }, { From 1f7e37dff005342211c67697deea7422a10b76de Mon Sep 17 00:00:00 2001 From: Qingwei Li Date: Mon, 19 Sep 2022 11:26:44 -0400 Subject: [PATCH 14/14] docker tag call improvement --- .../GPT-J-6B-model-parallel-inference-DJL.ipynb | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index 74565437de..da11c73cb7 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -114,8 +114,7 @@ "\n", "\n", "# # Build the docker image locally with the image name and then push it to ECR\n", - "image_id=$(docker images -q | head -n1)\n", - "docker tag $image_id ${fullname}\n", + "docker tag deepjavalibrary/djl-serving:0.18.0-deepspeed ${fullname}\n", "\n", "docker push $fullname" ]