Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boto3 version notebook #3597

Merged
merged 14 commits into from
Sep 12, 2022
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,16 @@
"In this notebook, we deploy a PyTorch GPT-J model from Hugging Face with 6 billion parameters across two GPUs on an Amazon SageMaker ml.g5.48xlarge instance. DeepSpeed is used for tensor parallelism inference while DJLServing handles inference requests and the distributed workers. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d6ed354b",
"metadata": {},
"outputs": [],
"source": [
"!pip install boto3==1.24.68"
]
},
{
"cell_type": "markdown",
"id": "81c2bdf4",
Expand Down Expand Up @@ -179,13 +189,13 @@
"source": [
"### Setup serving.properties\n",
"\n",
"User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set\n",
"User needs to add engine Rubikon as shown below. If you would like to control how many worker groups, you can set by adding these lines in the below file.\n",
"\n",
"```\n",
"gpu.minWorkers=1\n",
"gpu.maxWorkers=1\n",
"```\n",
"by adding these lines in the below file. By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`."
"By default, we will create as much worker group as possible based on `gpu_numbers/tensor_parallel_degree`."
]
},
{
Expand All @@ -196,7 +206,6 @@
"outputs": [],
"source": [
"%%writefile serving.properties\n",
"\n",
"engine = Rubikon"
]
},
Expand All @@ -221,10 +230,9 @@
"account = session.account_id()\n",
"region = session.boto_region_name\n",
"img = \"djl_deepspeed\"\n",
"fullname = account + \".dkr.ecr.\" + region + \"amazonaws.com/\" + img + \":latest\"\n",
"\n",
"fullname = account + \".dkr.ecr.\" + region + \".amazonaws.com/\" + img + \":latest\"\n",
"bucket = session.default_bucket()\n",
"path = \"s3://\" + bucket + \"/DEMO-djl-big-model/\""
"path = \"s3://\" + bucket + \"/DEMO-djl-big-model\""
]
},
{
Expand Down Expand Up @@ -253,7 +261,9 @@
"metadata": {},
"outputs": [],
"source": [
"!aws s3 cp gpt-j.tar.gz {path}"
"model_s3_url = sagemaker.s3.S3Uploader.upload(\n",
" \"gpt-j.tar.gz\", path, kms_key=None, sagemaker_session=session\n",
")"
]
},
{
Expand All @@ -266,77 +276,82 @@
},
{
"cell_type": "markdown",
"id": "f96c494a",
"id": "32589338",
"metadata": {},
"source": [
"First let us make sure we have the lastest awscli"
"Now we create our [SageMaker model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model). Make sure your execution role has access to your model artifacts and ECR image. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b665515",
"id": "026d27d2",
"metadata": {},
"outputs": [],
"source": [
"!pip3 install --upgrade --user awscli"
"from datetime import datetime\n",
"\n",
"sm_client = boto3.client(\"sagemaker\")\n",
"\n",
"time_stamp = datetime.now().strftime(\"%Y-%m-%d-%H-%M-%S\")\n",
"model_name = \"gpt-j-\" + time_stamp\n",
"\n",
"create_model_response = sm_client.create_model(\n",
" ModelName=model_name,\n",
" ExecutionRoleArn=session.get_caller_identity_arn(),\n",
" PrimaryContainer={\n",
" \"Image\": fullname,\n",
" \"ModelDataUrl\": model_s3_url,\n",
" \"Environment\": {\"TENSOR_PARALLEL_DEGREE\": \"2\"},\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "32589338",
"id": "22d2fc2b",
"metadata": {},
"source": [
"You should see two images from code above. Please note the image name similar to`<AWS_account_ID>.dkr.ecr.us-east-1.amazonaws.com/djl_deepspeed`. This is the ECR image URL that we need for later use. \n",
"\n",
"Now we create our [SageMaker model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html). Make sure you provide an IAM role that SageMaker can assume to access model artifacts and docker image for deployment on ML compute hosting instances. In addition, you also use the IAM role to manage permissions the inference code needs. Please check out our SageMaker Roles [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for more details. \n",
"\n",
" <span style=\"color:red\"> You must enter ECR image name, S3 path for the model file, and an execution-role-arn</span> in the code below."
"Now we create an endpoint configuration that SageMaker hosting services uses to deploy models. Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to accommodate the large size of our model. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "026d27d2",
"id": "84e25dd4",
"metadata": {},
"outputs": [],
"source": [
"!aws sagemaker create-model \\\n",
"--model-name gpt-j \\\n",
"--primary-container \\\n",
"Image=<ECR image>,ModelDataUrl={path},Environment={TENSOR_PARALLEL_DEGREE=2} \\\n",
"--execution-role-arn <your execution-role-arn>"
"initial_instance_count = 1\n",
"instance_type = \"ml.g5.48xlarge\"\n",
"variant_name = \"AllTraffic\"\n",
"endpoint_config_name = \"t-j-config-\" + time_stamp\n",
"\n",
"production_variants = [\n",
" {\n",
" \"VariantName\": variant_name,\n",
" \"ModelName\": model_name,\n",
" \"InitialInstanceCount\": initial_instance_count,\n",
" \"InstanceType\": instance_type,\n",
" \"ModelDataDownloadTimeoutInSeconds\": 1800,\n",
" \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600,\n",
" }\n",
"]\n",
"\n",
"endpoint_config = {\n",
" \"EndpointConfigName\": endpoint_config_name,\n",
" \"ProductionVariants\": production_variants,\n",
"}\n",
"\n",
"ep_conf_res = sm_client.create_endpoint_config(**endpoint_config)"
]
},
{
"cell_type": "markdown",
"id": "22d2fc2b",
"id": "e4b3bc26",
"metadata": {},
"source": [
"Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "84e25dd4",
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"aws sagemaker create-endpoint-config \\\n",
" --region $(aws configure get region) \\\n",
" --endpoint-config-name gpt-j-config \\\n",
" --production-variants '[\n",
" {\n",
" \"ModelName\": \"gpt-j\",\n",
" \"VariantName\": \"AllTraffic\",\n",
" \"InstanceType\": \"ml.g5.48xlarge\",\n",
" \"InitialInstanceCount\": 1,\n",
" \"ModelDataDownloadTimeoutInSeconds\": 1800,\n",
" \"ContainerStartupHealthCheckTimeoutInSeconds\": 3600\n",
" }\n",
" ]'"
"We are ready to create an endpoint using the model and the endpoint configuration created from above steps. "
]
},
{
Expand All @@ -346,10 +361,10 @@
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"aws sagemaker create-endpoint \\\n",
"--endpoint-name gpt-j \\\n",
"--endpoint-config-name gpt-j-config"
"endpoint_name = \"gpt-j\" + time_stamp\n",
"ep_res = sm_client.create_endpoint(\n",
" EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name\n",
")"
]
},
{
Expand All @@ -367,11 +382,10 @@
"metadata": {},
"outputs": [],
"source": [
"import boto3, json\n",
"import json\n",
"\n",
"client = boto3.client(\"sagemaker-runtime\")\n",
"\n",
"endpoint_name = \"gpt-j\" # Your endpoint name.\n",
"content_type = \"text/plain\" # The MIME type of the input data in the request body.\n",
"payload = \"Amazon.com is the best\" # Payload for inference.\n",
"response = client.invoke_endpoint(\n",
Expand All @@ -395,8 +409,7 @@
"metadata": {},
"outputs": [],
"source": [
"%%sh\n",
"aws sagemaker delete-endpoint --endpoint-name gpt-j"
"sm_client.delete_endpoint(endpoint_name)"
]
},
{
Expand Down