diff --git a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb index d979d03d7d..74565437de 100644 --- a/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb +++ b/advanced_functionality/pytorch_deploy_large_GPT_model/GPT-J-6B-model-parallel-inference-DJL.ipynb @@ -22,6 +22,16 @@ "In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. " ] }, + { + "cell_type": "code", + "execution_count": null, + "id": "d6ed354b", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install boto3==1.24.68" + ] + }, { "cell_type": "markdown", "id": "81c2bdf4", @@ -179,13 +189,13 @@ "source": [ "### Setup serving.properties\n", "\n", - "User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set\n", + "User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set by adding these lines in the below file.\n", "\n", "```\n", "gpu.minWorkers=1\n", "gpu.maxWorkers=1\n", "```\n", - "by adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." + "By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`." ] }, { @@ -196,7 +206,6 @@ "outputs": [], "source": [ "%%writefile serving.properties\n", - "\n", "engine = Rubikon" ] }, @@ -221,10 +230,9 @@ "account = session.account_id()\n", "region = session.boto_region_name\n", "img = \"djl_deepspeed\"\n", - "fullname = account + \".dkr.ecr.\" + region + \"amazonaws.com/\" + img + \":latest\"\n", - "\n", + "fullname = account + \".dkr.ecr.\" + region + \".amazonaws.com/\" + img + \":latest\"\n", "bucket = session.default_bucket()\n", - "path = \"s3://\" + bucket + \"/DEMO-djl-big-model/\"" + "path = \"s3://\" + bucket + \"/DEMO-djl-big-model\"" ] }, { @@ -253,7 +261,9 @@ "metadata": {}, "outputs": [], "source": [ - "!aws s3 cp gpt-j.tar.gz {path}" + "model_s3_url = sagemaker.s3.S3Uploader.upload(\n", + " \"gpt-j.tar.gz\", path, kms_key=None, sagemaker_session=session\n", + ")" ] }, { @@ -266,77 +276,82 @@ }, { "cell_type": "markdown", - "id": "f96c494a", + "id": "32589338", "metadata": {}, "source": [ - "First let us make sure we have the lastest awscli" + "Now we create our [SageMaker model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. " ] }, { "cell_type": "code", "execution_count": null, - "id": "0b665515", + "id": "026d27d2", "metadata": {}, "outputs": [], "source": [ - "!pip3 install --upgrade --user awscli" + "from datetime import datetime\n", + "\n", + "sm_client = boto3.client(\"sagemaker\")\n", + "\n", + "time_stamp = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n", + "model_name = \"gpt-j-\" + time_stamp\n", + "\n", + "create_model_response = sm_client.create_model(\n", + " ModelName=model_name,\n", + " ExecutionRoleArn=session.get_caller_identity_arn(),\n", + " PrimaryContainer={\n", + " \"Image\": fullname,\n", + " \"ModelDataUrl\": model_s3_url,\n", + " \"Environment\": {\"TENSOR_PARALLEL_DEGREE\": \"2\"},\n", + " },\n", + ")" ] }, { "cell_type": "markdown", - "id": "32589338", + "id": "22d2fc2b", "metadata": {}, "source": [ - "You should see two images from code above. Please note the image name similar to`.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed`. This is the ECR image URL that we need for later use. \n", - "\n", - "Now we create our [SageMaker model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). Make sure you provide an IAM role that SageMaker can assume to access model artifacts and docker image for deployment on ML compute hosting instances. In addition, you also use the IAM role to manage permissions the inference code needs. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. \n", - "\n", - " You must enter ECR image name, S3 path for the model file, and an execution-role-arn in the code below." + "Now we create an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. " ] }, { "cell_type": "code", "execution_count": null, - "id": "026d27d2", + "id": "84e25dd4", "metadata": {}, "outputs": [], "source": [ - "!aws sagemaker create-model \\\n", - "--model-name gpt-j \\\n", - "--primary-container \\\n", - "Image=,ModelDataUrl={path},Environment={TENSOR_PARALLEL_DEGREE=2} \\\n", - "--execution-role-arn " + "initial_instance_count = 1\n", + "instance_type = \"ml.g5.48xlarge\"\n", + "variant_name = \"AllTraffic\"\n", + "endpoint_config_name = \"t-j-config-\" + time_stamp\n", + "\n", + "production_variants = [\n", + " {\n", + " \"VariantName\": variant_name,\n", + " \"ModelName\": model_name,\n", + " \"InitialInstanceCount\": initial_instance_count,\n", + " \"InstanceType\": instance_type,\n", + " \"ModelDataDownloadTimeoutInSeconds\": 1800,\n", + " \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600,\n", + " }\n", + "]\n", + "\n", + "endpoint_config = {\n", + " \"EndpointConfigName\": endpoint_config_name,\n", + " \"ProductionVariants\": production_variants,\n", + "}\n", + "\n", + "ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)" ] }, { "cell_type": "markdown", - "id": "22d2fc2b", + "id": "e4b3bc26", "metadata": {}, "source": [ - "Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "84e25dd4", - "metadata": {}, - "outputs": [], - "source": [ - "%%sh\n", - "aws sagemaker create-endpoint-config \\\n", - " --region $(aws configure get region) \\\n", - " --endpoint-config-name gpt-j-config \\\n", - " --production-variants '[\n", - " {\n", - " \"ModelName\": \"gpt-j\",\n", - " \"VariantName\": \"AllTraffic\",\n", - " \"InstanceType\": \"ml.g5.48xlarge\",\n", - " \"InitialInstanceCount\": 1,\n", - " \"ModelDataDownloadTimeoutInSeconds\": 1800,\n", - " \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600\n", - " }\n", - " ]'" + "We are ready to create an endpoint using the model and the endpoint configuration created from above steps. " ] }, { @@ -346,10 +361,10 @@ "metadata": {}, "outputs": [], "source": [ - "%%sh\n", - "aws sagemaker create-endpoint \\\n", - "--endpoint-name gpt-j \\\n", - "--endpoint-config-name gpt-j-config" + "endpoint_name = \"gpt-j\" + time_stamp\n", + "ep_res = sm_client.create_endpoint(\n", + " EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n", + ")" ] }, { @@ -367,11 +382,10 @@ "metadata": {}, "outputs": [], "source": [ - "import boto3, json\n", + "import json\n", "\n", "client = boto3.client(\"sagemaker-runtime\")\n", "\n", - "endpoint_name = \"gpt-j\" # Your endpoint name.\n", "content_type = \"text/plain\" # The MIME type of the input data in the request body.\n", "payload = \"Amazon.com is the best\" # Payload for inference.\n", "response = client.invoke_endpoint(\n", @@ -395,8 +409,7 @@ "metadata": {}, "outputs": [], "source": [ - "%%sh\n", - "aws sagemaker delete-endpoint --endpoint-name gpt-j" + "sm_client.delete_endpoint(endpoint_name)" ] }, {