diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb new file mode 100644 index 0000000000..d979d03d7d --- /dev/null +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -0,0 +1,441 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bc5ab391", + "metadata": {}, + "source": [ + "# Serve large models on SageMaker with model parallel inference and DJLServing" + ] + }, + { + "cell_type": "markdown", + "id": "98b43ca5", + "metadata": {}, + "source": [ + "In this notebook, we explore how to host a large language model on SageMaker using model parallelism from DeepSpeed and DJLServing.\n", + "\n", + "Language models have recently exploded in both size and popularity. In 2018, BERT-large entered the scene and, with its 340M parameters and novel transformer architecture, set the standard on NLP task accuracy. Within just a few years, state-of-the-art NLP model size has grown by more than 500x with models such as OpenAI’s 175 billion parameter GPT-3 and similarly sized open source Bloom 176B raising the bar on NLP accuracy. This increase in the number of parameters is driven by the simple and empirically-demonstrated positive relationship between model size and accuracy: more is better. With easy access from models zoos such as Hugging Face and improved accuracy in NLP tasks such as classification and text generation, practitioners are increasingly reaching for these large models. However, deploying them can be a challenge because of their size.\n", + "\n", + "Model parallelism can help deploy large models that would normally be too large for a single GPU. With model parallelism, we partition and distribute a model across multiple GPUs. Each GPU holds a different part of the model, resolving the memory capacity issue for the largest deep learning models with billions of parameters. This notebook uses tensor parallelism techniques which allow GPUs to work simultaneously on the same layer of a model and achieve low latency inference relative to a pipeline parallel solution.\n", + "\n", + "In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. " + ] + }, + { + "cell_type": "markdown", + "id": "81c2bdf4", + "metadata": {}, + "source": [ + "## Step 1: Creating image for SageMaker endpoint\n", + "We first pull the docker image djl-serving:0.18.0-deepspeed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2876d11c", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "docker pull deepjavalibrary/djl-serving:0.18.0-deepspeed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "73d0ff93", + "metadata": {}, + "outputs": [], + "source": [ + "!docker images" + ] + }, + { + "cell_type": "markdown", + "id": "e822977b", + "metadata": {}, + "source": [ + "You should see the image `djl-serving` listed from running the code above. Please note the `IMAGE ID`. We will need it for the next step." + ] + }, + { + "cell_type": "markdown", + "id": "6c695144", + "metadata": {}, + "source": [ + "### Push image to ECR\n", + "The following code pushes the `djl-serving` image, downloaded from previous step, to ECR. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47ab31d1", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%sh\n", + "\n", + "# The name of our container\n", + "img=djl_deepspeed\n", + "\n", + "\n", + "account=$(aws sts get-caller-identity --query Account --output text)\n", + "\n", + "# Get the region defined in the current configuration\n", + "region=$(aws configure get region)\n", + "\n", + "fullname=\"${account}.dkr.ecr.${region}.amazonaws.com/${img}:latest\"\n", + "\n", + "# If the repository doesn't exist in ECR, create it.\n", + "aws ecr describe-repositories --repository-names \"${img}\" > /dev/null 2>&1\n", + "\n", + "if [ $? -ne 0 ]\n", + "then\n", + " aws ecr create-repository --repository-name \"${img}\" > /dev/null\n", + "fi\n", + "\n", + "# Get the login command from ECR and execute it directly\n", + "aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}\n", + "\n", + "\n", + "# # Build the docker image locally with the image name and then push it to ECR\n", + "image_id=$(docker images -q | head -n1)\n", + "docker tag $image_id ${fullname}\n", + "\n", + "docker push $fullname" + ] + }, + { + "cell_type": "markdown", + "id": "1ac32e96", + "metadata": {}, + "source": [ + "## Step 2: Create a `model.py` and `serving.properties`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6f4864eb", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile model.py\n", + "\n", + "from djl_python import Input, Output\n", + "import os\n", + "import deepspeed\n", + "import torch\n", + "from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer\n", + "\n", + "predictor = None\n", + "\n", + "\n", + "def get_model():\n", + " model_name = \"EleutherAI/gpt-j-6B\"\n", + " tensor_parallel = int(os.getenv(\"TENSOR_PARALLEL_DEGREE\", \"2\"))\n", + " local_rank = int(os.getenv(\"LOCAL_RANK\", \"0\"))\n", + " model = AutoModelForCausalLM.from_pretrained(\n", + " model_name, revision=\"float32\", torch_dtype=torch.float32\n", + " )\n", + " tokenizer = AutoTokenizer.from_pretrained(model_name)\n", + "\n", + " model = deepspeed.init_inference(\n", + " model,\n", + " mp_size=tensor_parallel,\n", + " dtype=model.dtype,\n", + " replace_method=\"auto\",\n", + " replace_with_kernel_inject=True,\n", + " )\n", + " generator = pipeline(\n", + " task=\"text-generation\", model=model, tokenizer=tokenizer, device=local_rank\n", + " )\n", + " return generator\n", + "\n", + "\n", + "def handle(inputs: Input) -> None:\n", + " global predictor\n", + " if not predictor:\n", + " predictor = get_model()\n", + "\n", + " if inputs.is_empty():\n", + " # Model server makes an empty call to warmup the model on startup\n", + " return None\n", + "\n", + " data = inputs.get_as_string()\n", + " result = predictor(data, do_sample=True, min_tokens=200, max_new_tokens=256)\n", + " return Output().add(result)" + ] + }, + { + "cell_type": "markdown", + "id": "f02b6929", + "metadata": {}, + "source": [ + "### Setup serving.properties\n", + "\n", + "User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set\n", + "\n", + "```\n", + "gpu.minWorkers=1\n", + "gpu.maxWorkers=1\n", + "```\n", + "by adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2c5ea96a", + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile serving.properties\n", + "\n", + "engine = Rubikon" + ] + }, + { + "cell_type": "markdown", + "id": "f44488e6", + "metadata": {}, + "source": [ + "The code below creates the SageMaker model file (`model.tar.gz`) and upload it to S3. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5a536439", + "metadata": {}, + "outputs": [], + "source": [ + "import sagemaker, boto3\n", + "\n", + "session = sagemaker.Session()\n", + "account = session.account_id()\n", + "region = session.boto_region_name\n", + "img = \"djl_deepspeed\"\n", + "fullname = account + \".dkr.ecr.\" + region + \"amazonaws.com/\" + img + \":latest\"\n", + "\n", + "bucket = session.default_bucket()\n", + "path = \"s3://\" + bucket + \"/DEMO-djl-big-model/\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9965dd7c", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "if [ -d gpt-j ]; then\n", + " rm -d -r gpt-j\n", + "fi #always start fresh\n", + "\n", + "mkdir -p gpt-j\n", + "mv model.py gpt-j\n", + "mv serving.properties gpt-j\n", + "tar -czvf gpt-j.tar.gz gpt-j/\n", + "#aws s3 cp gpt-j.tar.gz {path}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db47f969", + "metadata": {}, + "outputs": [], + "source": [ + "!aws s3 cp gpt-j.tar.gz {path}" + ] + }, + { + "cell_type": "markdown", + "id": "c507e3ef", + "metadata": {}, + "source": [ + "## Step 3: Create SageMaker endpoint" + ] + }, + { + "cell_type": "markdown", + "id": "f96c494a", + "metadata": {}, + "source": [ + "First let us make sure we have the lastest awscli" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0b665515", + "metadata": {}, + "outputs": [], + "source": [ + "!pip3 install --upgrade --user awscli" + ] + }, + { + "cell_type": "markdown", + "id": "32589338", + "metadata": {}, + "source": [ + "You should see two images from code above. Please note the image name similar to`.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed`. This is the ECR image URL that we need for later use. \n", + "\n", + "Now we create our [SageMaker model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). Make sure you provide an IAM role that SageMaker can assume to access model artifacts and docker image for deployment on ML compute hosting instances. In addition, you also use the IAM role to manage permissions the inference code needs. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. \n", + "\n", + " You must enter ECR image name, S3 path for the model file, and an execution-role-arn in the code below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "026d27d2", + "metadata": {}, + "outputs": [], + "source": [ + "!aws sagemaker create-model \\\n", + "--model-name gpt-j \\\n", + "--primary-container \\\n", + "Image=,ModelDataUrl={path},Environment={TENSOR_PARALLEL_DEGREE=2} \\\n", + "--execution-role-arn " + ] + }, + { + "cell_type": "markdown", + "id": "22d2fc2b", + "metadata": {}, + "source": [ + "Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84e25dd4", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "aws sagemaker create-endpoint-config \\\n", + " --region $(aws configure get region) \\\n", + " --endpoint-config-name gpt-j-config \\\n", + " --production-variants '[\n", + " {\n", + " \"ModelName\": \"gpt-j\",\n", + " \"VariantName\": \"AllTraffic\",\n", + " \"InstanceType\": \"ml.g5.48xlarge\",\n", + " \"InitialInstanceCount\": 1,\n", + " \"ModelDataDownloadTimeoutInSeconds\": 1800,\n", + " \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600\n", + " }\n", + " ]'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "962a1aef", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "aws sagemaker create-endpoint \\\n", + "--endpoint-name gpt-j \\\n", + "--endpoint-config-name gpt-j-config" + ] + }, + { + "cell_type": "markdown", + "id": "2dc2a85a", + "metadata": {}, + "source": [ + "The creation of the SageMaker endpoint might take a while. After the endpoint is created, you can test it out using the following code. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5ed7a325", + "metadata": {}, + "outputs": [], + "source": [ + "import boto3, json\n", + "\n", + "client = boto3.client(\"sagemaker-runtime\")\n", + "\n", + "endpoint_name = \"gpt-j\" # Your endpoint name.\n", + "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", + "payload = \"Amazon.com is the best\" # Payload for inference.\n", + "response = client.invoke_endpoint(\n", + " EndpointName=endpoint_name, ContentType=content_type, Body=payload\n", + ")\n", + "print(response[\"Body\"].read())" + ] + }, + { + "cell_type": "markdown", + "id": "92e83c91", + "metadata": {}, + "source": [ + "## Step 4: Clean up" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a15980a3", + "metadata": {}, + "outputs": [], + "source": [ + "%%sh\n", + "aws sagemaker delete-endpoint --endpoint-name gpt-j" + ] + }, + { + "cell_type": "markdown", + "id": "2eff050b", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this notebook, you use tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we use open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n", + "\n", + "As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3.8.9 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.9" + }, + "vscode": { + "interpreter": { + "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" + } + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}