Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boto3 version notebook #3597

Merged
merged 14 commits into from
Sep 12, 2022
Prev Previous commit
Next Next commit
grammatical changes
Qingwei Li committed Sep 9, 2022
commit f211fb239efca196b7ab0d49ad67f431dd021889
Original file line number Diff line number Diff line change
@@ -313,7 +313,7 @@
"id": "22d2fc2b",
"metadata": {},
"source": [
"Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. "
"Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. "
]
},
{
@@ -373,7 +373,6 @@
"\n",
"endpoint_name = \"gpt-j\" # Your endpoint name.\n",
"content_type = \"text/plain\" # The MIME type of the input data in the request body.\n",
"# accept = \"...\" # The desired MIME type of the inference in the response.\n",
"payload = \"Amazon.com is the best\" # Payload for inference.\n",
"response = client.invoke_endpoint(\n",
" EndpointName=endpoint_name, ContentType=content_type, Body=payload\n",
@@ -407,7 +406,7 @@
"source": [
"## Conclusion\n",
"\n",
"In this notebook, you used tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we used open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n",
"In this notebook, you use tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we use open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n",
"\n",
"As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes."
]