Skip to content

Commit

Permalink
grammatical changes
Browse files Browse the repository at this point in the history
  • Loading branch information
Qingwei Li committed Sep 9, 2022
1 parent 875c110 commit f211fb2
Showing 1 changed file with 2 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@
"id": "22d2fc2b",
"metadata": {},
"source": [
"Note that we configured `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. "
"Note that we configure `ModelDataDownloadTimeoutInSeconds` and `ContainerStartupHealthCheckTimeoutInSeconds` to acommodate the large size of our model. "
]
},
{
Expand Down Expand Up @@ -373,7 +373,6 @@
"\n",
"endpoint_name = \"gpt-j\" # Your endpoint name.\n",
"content_type = \"text/plain\" # The MIME type of the input data in the request body.\n",
"# accept = \"...\" # The desired MIME type of the inference in the response.\n",
"payload = \"Amazon.com is the best\" # Payload for inference.\n",
"response = client.invoke_endpoint(\n",
" EndpointName=endpoint_name, ContentType=content_type, Body=payload\n",
Expand Down Expand Up @@ -407,7 +406,7 @@
"source": [
"## Conclusion\n",
"\n",
"In this notebook, you used tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we used open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n",
"In this notebook, you use tensor parallelism to partition a large language model across multiple GPUs for low latency inference. With tensor parallelism, multiple GPUs work on the same model layer at once allowing for faster inference latency when a low batch size is used. Here, we use open source DeepSpeed as the model parallel library to partition the model and open source Deep Java Library Serving as the model serving solution.\n",
"\n",
"As a next step, you can experiment with larger models from Hugging Face such as GPT-NeoX. You can also adjust the tensor parallel degree to see the impact to latency with models of different sizes."
]
Expand Down

0 comments on commit f211fb2

Please sign in to comment.