forked from kubeflow/pipelines
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update torchserve docs (kubeflow#1271)
* Update torchserve doc * Fix autoscaling/canary example * Reorgnize torchserve examples * Add bert example
- Loading branch information
Showing
21 changed files
with
421 additions
and
259 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
# Autoscaling | ||
KFServing supports the implementation of Knative Pod Autoscaler (KPA) and Kubernetes’ Horizontal Pod Autoscaler (HPA). | ||
The features and limitations of each of these Autoscalers are listed below. | ||
|
||
IMPORTANT: If you want to use Kubernetes Horizontal Pod Autoscaler (HPA), you must install [HPA extension](https://knative.dev/docs/install/any-kubernetes-cluster/#optional-serving-extensions) | ||
after you install Knative Serving. | ||
|
||
Knative Pod Autoscaler (KPA) | ||
- Part of the Knative Serving core and enabled by default once Knative Serving is installed. | ||
- Supports scale to zero functionality. | ||
- Does not support CPU-based autoscaling. | ||
|
||
Horizontal Pod Autoscaler (HPA) | ||
- Not part of the Knative Serving core, and must be enabled after Knative Serving installation. | ||
- Does not support scale to zero functionality. | ||
- Supports CPU-based autoscaling. | ||
|
||
## Create InferenceService with concurrency target | ||
|
||
|
||
### Soft limit | ||
You can configure InferenceService with annotation `autoscaling.knative.dev/target` for a soft limit. The soft limit is a targeted limit rather than | ||
a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded. | ||
|
||
```yaml | ||
apiVersion: "serving.kubeflow.org/v1beta1" | ||
kind: "InferenceService" | ||
metadata: | ||
name: "torchserve" | ||
annotations: | ||
autoscaling.knative.dev/target: "10" | ||
spec: | ||
predictor: | ||
pytorch: | ||
protocolVersion: v2 | ||
storageUri: "gs://kfserving-examples/models/torchserve/image_classifier" | ||
``` | ||
### Hard limit | ||
You can also configure InferenceService with field `containerConcurrency` for a hard limit. The hard limit is an enforced upper bound. | ||
If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests. | ||
|
||
```yaml | ||
apiVersion: "serving.kubeflow.org/v1beta1" | ||
kind: "InferenceService" | ||
metadata: | ||
name: "torchserve" | ||
spec: | ||
predictor: | ||
containerConcurrency: 10 | ||
pytorch: | ||
protocolVersion: v2 | ||
storageUri: "gs://kfserving-examples/models/torchserve/image_classifier" | ||
``` | ||
|
||
### Create the InferenceService | ||
|
||
```bash | ||
kubectl apply -f torchserve.yaml | ||
``` | ||
|
||
Expected Output | ||
|
||
```bash | ||
$inferenceservice.serving.kubeflow.org/torchserve created | ||
``` | ||
|
||
## Run inference with concurrent requests | ||
|
||
The first step is to [determine the ingress IP and ports](../../../README.md#determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT` | ||
|
||
Install hey load generator | ||
```bash | ||
go get -u github.com/rakyll/hey | ||
``` | ||
|
||
Send concurrent inference requests | ||
```bash | ||
MODEL_NAME=mnist | ||
SERVICE_HOSTNAME=$(kubectl get inferenceservice torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3) | ||
./hey -m POST -z 30s -D ./mnist.json -host ${SERVICE_HOSTNAME} http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict | ||
``` | ||
|
||
### Check the pods that are scaled up | ||
`hey` by default generates 50 requests concurrently, so you can see that the InferenceService scales to 5 pods as the container concurrency target is 10. | ||
|
||
```bash | ||
kubectl get pods -n kfserving-test | ||
NAME READY STATUS RESTARTS AGE | ||
torchserve-predictor-default-cj2d8-deployment-69444c9c74-67qwb 2/2 Terminating 0 103s | ||
torchserve-predictor-default-cj2d8-deployment-69444c9c74-nnxk8 2/2 Terminating 0 95s | ||
torchserve-predictor-default-cj2d8-deployment-69444c9c74-rq8jq 2/2 Running 0 50m | ||
torchserve-predictor-default-cj2d8-deployment-69444c9c74-tsrwr 2/2 Running 0 113s | ||
torchserve-predictor-default-cj2d8-deployment-69444c9c74-vvpjl 2/2 Running 0 109s | ||
torchserve-predictor-default-cj2d8-deployment-69444c9c74-xvn7t 2/2 Terminating 0 103s | ||
``` |
5 changes: 3 additions & 2 deletions
5
...mples/v1beta1/torchserve/autoscaling.yaml → ...1/torchserve/autoscaling/autoscaling.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# TorchServe example with Huggingface bert model | ||
In this example we will show how to serve [Huggingface Transformers with TorchServe](https://github.com/pytorch/serve/tree/master/examples/Huggingface_Transformers) | ||
on KFServing. | ||
|
||
## Model archive file creation | ||
|
||
Clone [pytorch/serve](https://github.com/pytorch/serve) repository, | ||
navigate to `examples/Huggingface_Transformers` and follow the steps for creating the MAR file including serialized model and other dependent files. | ||
TorchServe supports both eager model and torchscript and here we save as the pretrained model. | ||
|
||
```bash | ||
torch-model-archiver --model-name BERTSeqClassification --version 1.0 \ | ||
--serialized-file Transformer_model/pytorch_model.bin \ | ||
--handler ./Transformer_handler_generalized.py \ | ||
--extra-files "Transformer_model/config.json,./setup_config.json,./Seq_classification_artifacts/index_to_name.json" | ||
``` | ||
|
||
## Create the InferenceService | ||
|
||
Apply the CRD | ||
|
||
```bash | ||
kubectl apply -f bert.yaml | ||
``` | ||
|
||
Expected Output | ||
|
||
```bash | ||
$inferenceservice.serving.kubeflow.org/torchserve-bert created | ||
``` | ||
|
||
## Run a prediction | ||
|
||
The first step is to [determine the ingress IP and ports](../../../../README.md#determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT` | ||
|
||
```bash | ||
MODEL_NAME=torchserve-bert | ||
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -n <namespace> -o jsonpath='{.status.url}' | cut -d "/" -f 3) | ||
|
||
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/BERTSeqClassification:predict -d ./sample_text.txt | ||
``` | ||
|
||
Expected Output | ||
|
||
```bash | ||
* Trying 44.239.20.204... | ||
* Connected to a881f5a8c676a41edbccdb0a394a80d6-2069247558.us-west-2.elb.amazonaws.com (44.239.20.204) port 80 (#0) | ||
> PUT /v1/models/BERTSeqClassification:predict HTTP/1.1 | ||
> Host: torchserve-bert.kfserving-test.example.com | ||
> User-Agent: curl/7.47.0 | ||
> Accept: */* | ||
> Content-Length: 79 | ||
> Expect: 100-continue | ||
> | ||
< HTTP/1.1 100 Continue | ||
* We are completely uploaded and fine | ||
< HTTP/1.1 200 OK | ||
< cache-control: no-cache; no-store, must-revalidate, private | ||
< content-length: 8 | ||
< date: Wed, 04 Nov 2020 10:54:49 GMT | ||
< expires: Thu, 01 Jan 1970 00:00:00 UTC | ||
< pragma: no-cache | ||
< x-request-id: 4b54d3ac-185f-444c-b344-b8a785fdeb50 | ||
< x-envoy-upstream-service-time: 2085 | ||
< server: istio-envoy | ||
< | ||
* Connection #0 to host torchserve-bert.kfserving-test.example.com left intact | ||
Accepted | ||
``` | ||
## Captum Explanations | ||
In order to understand the word importances and attributions when we make an explanation Request, we use Captum Insights for the Hugginface Transformers pre-trained model. | ||
```bash | ||
MODEL_NAME=torchserve-bert | ||
SERVICE_HOSTNAME=$(kubectl get inferenceservice ${MODEL_NAME} -n <namespace> -o jsonpath='{.status.url}' | cut -d "/" -f 3) | ||
|
||
curl -v -H "Host: ${SERVICE_HOSTNAME}" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/BERTSeqClassification:explaine -d ./sample_text.txt | ||
``` | ||
Expected output | ||
```bash | ||
* Trying ::1:8080... | ||
* Connected to localhost (::1) port 8080 (#0) | ||
> POST /v1/models/BERTSeqClassification:explain HTTP/1.1 | ||
> Host: torchserve-bert.default.example.com | ||
> User-Agent: curl/7.73.0 | ||
> Accept: */* | ||
> Content-Length: 84 | ||
> Content-Type: application/x-www-form-urlencoded | ||
>Handling connection for 8080 | ||
|
||
* upload completely sent off: 84 out of 84 bytes | ||
* Mark bundle as not supporting multiuse | ||
< HTTP/1.1 200 OK | ||
< content-length: 292 | ||
< content-type: application/json; charset=UTF-8 | ||
< date: Sun, 27 Dec 2020 05:53:52 GMT | ||
< server: istio-envoy | ||
< x-envoy-upstream-service-time: 5769 | ||
< | ||
* Connection #0 to host localhost left intact | ||
{"explanations": [{"importances": [0.0, -0.6324463574494716, -0.033115653530477414, 0.2681695752722339, -0.29124745608778546, 0.5422589681903883, -0.3848768219546909, 0.0], | ||
"words": ["[CLS]", "bloomberg", "has", "reported", "on", "the", "economy", "[SEP]"], "delta": -0.0007350619859377225}]} | ||
``` | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
apiVersion: serving.kubeflow.org/v1beta1 | ||
kind: InferenceService | ||
metadata: | ||
name: "torchserve-bert" | ||
spec: | ||
predictor: | ||
pytorch: | ||
protocolVersion: v2 | ||
storageUri: gs://kfserving-examples/models/torchserve/huggingface | ||
# storageUri: pvc://model-pv-claim |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
inference_address=http://0.0.0.0:8085 | ||
management_address=http://0.0.0.0:8081 | ||
number_of_netty_threads=4 | ||
job_queue_size=10 | ||
model_store=/mnt/models/model-store | ||
model_snapshot={"name":"startup.cfg","modelCount":1,"models":{"bert":{"1.0":{"defaultVersion":true,"marName":"BERTSeqClassification.mar","minWorkers":1,"maxWorkers":5,"batchSize":1,"maxBatchDelay":5000,"responseTimeout":120}}}} |
Oops, something went wrong.