Skip to content

Commit

Permalink
add text embedding API example doc
Browse files Browse the repository at this point in the history
Signed-off-by: Yaliang Wu <[email protected]>
  • Loading branch information
ylwu-amzn committed Jan 25, 2023
1 parent 1ba4de8 commit db760fd
Show file tree
Hide file tree
Showing 2 changed files with 293 additions and 2 deletions.
291 changes: 291 additions & 0 deletions docs/model_serving_framework/text_embedding_model_examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
Model serving framework (released in 2.4) supports running NLP models inside OpenSearch cluster.
It only supports text embedding NLP model now. This document will show you some examples of how to upload
and run text embedding models via ml-commons REST APIs. We use [Huggingface](https://huggingface.co/) models to build these examples.
Read [ml-commons doc](https://opensearch.org/docs/latest/ml-commons-plugin/model-serving-framework/) to learn more details.

We build examples with this Huggingface sentence transformers model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

From 2.5, we support uploading [torchscipt](https://pytorch.org/docs/stable/jit.html) and [ONNX](https://onnx.ai/) model.

Note:
- This doc doesn't include how to trace models to torchscript/ONNX.
- Model serving framework is experimental feature. If you see any bug or have any suggestion, feel free to cut Github issue.

# 0. Prepare cluster

We suggest to start dedicated ML node to separate ML workloads from data nodes. From 2.5, ml-commons will run ML tasks on ML nodes only by default.
If you want to run some testing models on data node, you can disable this cluster setting `plugins.ml_commons.only_run_on_ml_node`.

```
PUT /_cluster/settings
{
"persistent" : {
"plugins.ml_commons.only_run_on_ml_node" : false
}
}
```

And we add native memory circuit breaker in 2.5 to avoid OOM error when loading too many models.
By default, the native memory threshold is 90%. If exceeds the threshold, will throw exception.
For testing purpose, you can set `plugins.ml_commons.native_memory_threshold` as 100% to disable it.

```
PUT _cluster/settings
{
"persistent" : {
"plugins.ml_commons.native_memory_threshold" : 100
}
}
```
ml-commons supports several other settings. You can tune them according to your requirement. Find more on this [ml-commons settings doc](https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings).

# 1. Torchscript
We can trace this example model [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to torchscript with two options:

## 1.1 trace sentence transformers model
If you have [`sentence-transformers`](https://pypi.org/project/sentence-transformers/) installed. You can trace this model directly: `SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')`.
Sentence transformer model already includes post-processing logic. So no need to specify `pooling_mode`/`normalize_result` when upload model.

- Step 1: upload model. This step will save model to model index.
```
# Sample request
POST /_plugins/_ml/models/_upload
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.0",
"description": "test model",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers"
},
"url": "https://github.com/ylwu-amzn/ml-commons/blob/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}
# Sample response
{
"task_id": "zgla5YUB1qmVrJFlwzW-",
"status": "CREATED"
}
```
Then you can get task by calling get task API. When task completed, you will see `model_id` in response.
```
GET /_plugins/_ml/tasks/<task_id>
# Sample request
GET /_plugins/_ml/tasks/zgla5YUB1qmVrJFlwzW-
# Sample response
{
"model_id": "zwla5YUB1qmVrJFlwzXJ",
"task_type": "UPLOAD_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "COMPLETED",
"worker_node": [
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"create_time": 1674590208957,
"last_update_time": 1674590211718,
"is_async": true
}
```
- Step 2: deploy/load model. This step will read model content from index and deploy to node.

```
# Deploy to all nodes
POST /_plugins/_ml/models/<model_id>/_load
# Deploy to specific nodes
POST /_plugins/_ml/models/<model_id>/_load
{
"node_ids": [ "<node_id>", "<node_id>" ] # replace "node_id" with your own node id
}
# Sample request
POST /_plugins/_ml/models/zwla5YUB1qmVrJFlwzXJ/_load
# Sample response
{
"task_id": "0Alb5YUB1qmVrJFlADVT",
"status": "CREATED"
}
```
Similar to upload model, you can get task by calling get task API. When task completed, you can run predict API.

```
# Sample request
GET /_plugins/_ml/tasks/0Alb5YUB1qmVrJFlADVT
# Sample response
{
"model_id": "zwla5YUB1qmVrJFlwzXJ",
"task_type": "LOAD_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "COMPLETED",
"worker_node": [
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"create_time": 1674590224467,
"last_update_time": 1674590226409,
"is_async": true
}
```

- Step 3: inference/predict

```
POST /_plugins/_ml/_predict/text_embedding/<model_id>
# Sample request
POST /_plugins/_ml/_predict/text_embedding/zwla5YUB1qmVrJFlwzXJ
{
"text_docs": [ "today is sunny" ],
"return_number": true,
"target_response": [ "sentence_embedding" ]
}
# Sample response
{
"inference_results": [
{
"output": [
{
"name": "sentence_embedding",
"data_type": "FLOAT32",
"shape": [
384
],
"data": [
-0.023314998,
0.08975688,
0.07847973,
...
]
}
]
}
]
}
```
- Step 4 (optional): profile

You can use profile API to get model deployment information and monitor inference latency. Refer to [this doc](https://opensearch.org/docs/latest/ml-commons-plugin/api/#profile)

By default, it will monitor last 100 predict requests. You can tune this setting [plugins.ml_commons.monitoring_request_count](https://opensearch.org/docs/latest/ml-commons-plugin/cluster-settings/#predict-monitoring-requests) to control monitoring how many requests.

```
# Sample request
POST /_plugins/_ml/profile/models/yQlW5YUB1qmVrJFlPDXc
# Sample response
{
"nodes": {
"0TLL4hHxRv6_G3n6y1l0BQ": {
"models": {
"yQlW5YUB1qmVrJFlPDXc": {
"model_state": "LOADED",
"predictor": "org.opensearch.ml.engine.algorithms.text_embedding.TextEmbeddingModel@1a0b0793",
"target_worker_nodes": [
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"worker_nodes": [
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"model_inference_stats": { // in Millisecond, time used in model part
"count": 10,
"max": 35.021633,
"min": 31.924348,
"average": 33.9418092,
"p50": 34.0341065,
"p90": 34.8487421,
"p99": 35.00434391
},
"predict_request_stats": { // in Millisecond, end-to-end time including model and all other parts
"count": 10,
"max": 36.037992,
"min": 32.903162,
"average": 34.9731029,
"p50": 35.073967999999994,
"p90": 35.868510300000004,
"p99": 36.02104383
}
}
}
}
}
}
```
- Step 5: unload model. This step will destroy model from nodes. Model document won't be deleted from model index.

```
# Unload one model
POST /_plugins/_ml/models/<model_id>/_unload
# Unload all models
POST /_plugins/_ml/models/_unload
# Sample request
POST /_plugins/_ml/models/zwla5YUB1qmVrJFlwzXJ/_unload
# Sample response
{
"0TLL4hHxRv6_G3n6y1l0BQ": { # node id
"stats": {
"zwla5YUB1qmVrJFlwzXJ": "unloaded"
}
}
}
```

## 1.2 trace huggingface transformers model
Without [`sentence-transformers`](https://pypi.org/project/sentence-transformers/) installed, you can trace this model `AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')`.
But model traced this way doesn't include post-processing. So user have to specify post-process logic with `pooling_mode` and `normalize_result`.

Supported pooling method: `mean`, `mean_sqrt_len`, `max`, `weightedmean`, `cls`.

The only difference is the uploading model input, for load/predict/profile/unload model, you can refer to ["1.1 trace sentence transformers model"](#11-trace-sentence-transformers-model).

```
# Sample request
POST /_plugins/_ml/models/_upload
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.0",
"description": "test model",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "huggingface_transformers",
"pooling_mode":"mean",
"normalize_result":"true"
},
"url": "https://github.com/ylwu-amzn/ml-commons/blob/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_huggingface.zip?raw=true"
}
```

# 2. ONNX
User can export Pytorch model to ONNX, then upload and run it with ml-commons APIs.
This example ONNX model also needs to specify post-process logic with `pooling_mode` and `normalize_result`.

Supported pooling method: `mean`, `mean_sqrt_len`, `max`, `weightedmean`, `cls`.

The only difference is the uploading model input, for load/predict/profile/unload model, you can refer to ["1.1 trace sentence transformers model"](#11-trace-sentence-transformers-model).

```
# Sample request
POST /_plugins/_ml/models/_upload
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.0",
"description": "test model",
"model_format": "ONNX",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "huggingface_transformers",
"pooling_mode":"mean",
"normalize_result":"true"
},
"url": "https://github.com/ylwu-amzn/ml-commons/raw/2.x_custom_m_helper/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_onnx.zip?raw=true"
}
```
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,7 @@ private MLModelState getNewModelState(
) {
Set<String> loadTaskNodes = loadingModels.get(modelId);
if (loadTaskNodes != null && loadTaskNodes.size() > 0 && state != MLModelState.LOADING) {
// no
// If some node/nodes are loading the model and model state is not LOADING, then set model state as LOADING.
return MLModelState.LOADING;
}
int currentWorkerNodeCount = modelWorkerNodes.containsKey(modelId) ? modelWorkerNodes.get(modelId).size() : 0;
Expand All @@ -254,7 +254,7 @@ private MLModelState getNewModelState(
// it happens.
log
.warn(
"Model {} loaded on more nodes [{}] than planing worker node[{}]",
"Model {} loaded on more nodes [{}] than planning worker node [{}]",
modelId,
currentWorkerNodeCount,
planningWorkerNodeCount
Expand Down

0 comments on commit db760fd

Please sign in to comment.