Model serving framework (released in 2.4) supports running NLP models inside OpenSearch cluster. It only supports text embedding NLP model now. This document will show you some examples of how to upload and run text embedding models via ml-commons REST APIs. We use Huggingface models to build these examples. Read ml-commons doc to learn more details.
We build examples with this Huggingface sentence transformers model sentence-transformers/all-MiniLM-L6-v2
From 2.5, we support uploading torchscipt and ONNX model.
Note:
- This doc doesn't include how to trace models to torchscript/ONNX.
- Model serving framework is experimental feature. If you see any bug or have any suggestion, feel free to cut Github issue.
We suggest to start dedicated ML node to separate ML workloads from data nodes. From 2.5, ml-commons will run ML tasks on ML nodes only by default.
If you want to run some testing models on data node, you can disable this cluster setting plugins.ml_commons.only_run_on_ml_node
.
PUT /_cluster/settings
{
"persistent" : {
"plugins.ml_commons.only_run_on_ml_node" : false
}
}
And we add native memory circuit breaker in 2.5 to avoid OOM error when loading too many models.
By default, the native memory threshold is 90%. If exceeds the threshold, will throw exception.
For testing purpose, you can set plugins.ml_commons.native_memory_threshold
as 100% to disable it.
PUT _cluster/settings
{
"persistent" : {
"plugins.ml_commons.native_memory_threshold" : 100
}
}
ml-commons supports several other settings. You can tune them according to your requirement. Find more on this ml-commons settings doc.
We can trace this example model sentence-transformers/all-MiniLM-L6-v2 to torchscript with two options:
If you have sentence-transformers
installed. You can save this model directly: SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
and then trace the model using torchScipt.
Refer to opensearch-py-ml doc and
opensearch-py-ml code: sentencetransformermodel.py#save_as_pt
Sentence transformer model already includes post-processing logic. So no need to specify pooling_mode
/normalize_result
when upload model.
From 2.6 release we are supporting to register pre-trained models. And from 2.8, we need model group id to register model. More details about model group here
- create a model group:
POST /_plugins/_ml/model_groups/_register
{
"name": "test_model_group_public",
"description": "This is a public model group"
}
{
"model_group_id": "7IjOsYgBFp6IJxCceZ1-",
"status": "CREATED"
}
Now we can use that model group id to register model.
- Step 1: upload model. This step will save model to model index.
# Sample request
POST /_plugins/_ml/models/_register
{
"name": "huggingface/sentence-transformers/all-MiniLM-L12-v2",
"version": "1.0.1",
"model_format": "TORCH_SCRIPT",
"model_group_id": "7IjOsYgBFp6IJxCceZ1-"
}
# Sample response
{
"task_id": "zgla5YUB1qmVrJFlwzW-",
"status": "CREATED"
}
Then you can get task by calling get task API. When task completed, you will see model_id
in response.
GET /_plugins/_ml/tasks/<task_id>
# Sample request
GET /_plugins/_ml/tasks/zgla5YUB1qmVrJFlwzW-
# Sample response
{
"model_id": "zwla5YUB1qmVrJFlwzXJ",
"task_type": "UPLOAD_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "COMPLETED",
"worker_node": [
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"create_time": 1674590208957,
"last_update_time": 1674590211718,
"is_async": true
}
We can also register model from URL. To do that we need to update the following cluster settings:
PUT _cluster/settings
{
"persistent" : {
"plugins.ml_commons.allow_registering_model_via_url" : true
}
}
Now we can register model using URL upload:
POST /_plugins/_ml/models/_register
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.1",
"description": "This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.",
"model_task_type": "TEXT_EMBEDDING",
"model_format": "TORCH_SCRIPT",
"model_content_hash_value": "c15f0d2e62d872be5b5bc6c84d2e0f4921541e29fefbef51d59cc10a8ae30e0f",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "sentence_transformers",
"all_config": "{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}"
},
"model_group_id": "7IjOsYgBFp6IJxCceZ1-",
"url": "https://artifacts.opensearch.org/models/ml-models/huggingface/sentence-transformers/all-MiniLM-L6-v2/1.0.1/torch_script/sentence-transformers_all-MiniLM-L6-v2-1.0.1-torch_script.zip"
}
- Step 2: deploy/load model. This step will read model content from index and deploy to node.
# Deploy to all nodes
POST /_plugins/_ml/models/<model_id>/_load
# Deploy to specific nodes
POST /_plugins/_ml/models/<model_id>/_load
{
"node_ids": [ "<node_id>", "<node_id>" ] # replace "node_id" with your own node id
}
# Sample request
POST /_plugins/_ml/models/zwla5YUB1qmVrJFlwzXJ/_load
# Sample response
{
"task_id": "0Alb5YUB1qmVrJFlADVT",
"status": "CREATED"
}
Similar to upload model, you can get task by calling get task API. When task completed, you can run predict API.
# Sample request
GET /_plugins/_ml/tasks/0Alb5YUB1qmVrJFlADVT
# Sample response
{
"model_id": "zwla5YUB1qmVrJFlwzXJ",
"task_type": "LOAD_MODEL",
"function_name": "TEXT_EMBEDDING",
"state": "COMPLETED",
"worker_node": [
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"create_time": 1674590224467,
"last_update_time": 1674590226409,
"is_async": true
}
- Step 3: inference/predict
POST /_plugins/_ml/_predict/text_embedding/<model_id>
# Sample request
POST /_plugins/_ml/_predict/text_embedding/zwla5YUB1qmVrJFlwzXJ
{
"text_docs": [ "today is sunny" ],
"return_number": true,
"target_response": [ "sentence_embedding" ]
}
# Sample response
{
"inference_results": [
{
"output": [
{
"name": "sentence_embedding",
"data_type": "FLOAT32",
"shape": [
384
],
"data": [
-0.023314998,
0.08975688,
0.07847973,
...
]
}
]
}
]
}
- Step 4 (optional): profile
You can use profile API to get model deployment information and monitor inference latency. Refer to this doc
By default, it will monitor last 100 predict requests. You can tune this setting plugins.ml_commons.monitoring_request_count to control monitoring how many requests.
# Sample request
GET /_plugins/_ml/profile/models/zwla5YUB1qmVrJFlwzXJ
# Sample response
{
"nodes": {
"0TLL4hHxRv6_G3n6y1l0BQ": { # node id
"models": {
"zwla5YUB1qmVrJFlwzXJ": { # model id
"model_state": "LOADED",
"predictor": "org.opensearch.ml.engine.algorithms.text_embedding.TextEmbeddingDenseModel@1a0b0793",
"target_worker_nodes": [ # plan to deploy model to these nodes
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"worker_nodes": [ # model deployed to these nodes
"0TLL4hHxRv6_G3n6y1l0BQ"
],
"model_inference_stats": { // in Millisecond, time used in model part
"count": 10,
"max": 35.021633,
"min": 31.924348,
"average": 33.9418092,
"p50": 34.0341065,
"p90": 34.8487421,
"p99": 35.00434391
},
"predict_request_stats": { // in Millisecond, end-to-end time including model and all other parts
"count": 10,
"max": 36.037992,
"min": 32.903162,
"average": 34.9731029,
"p50": 35.073967999999994,
"p90": 35.868510300000004,
"p99": 36.02104383
}
}
}
}
}
}
- Step 5: unload model. This step will destroy model from nodes. Model document won't be deleted from model index.
# Unload one model
POST /_plugins/_ml/models/<model_id>/_unload
# Unload all models
POST /_plugins/_ml/models/_unload
# Sample request
POST /_plugins/_ml/models/zwla5YUB1qmVrJFlwzXJ/_unload
# Sample response
{
"0TLL4hHxRv6_G3n6y1l0BQ": { # node id
"stats": {
"zwla5YUB1qmVrJFlwzXJ": "unloaded"
}
}
}
Without sentence-transformers
installed, you can trace this model AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
.
But model traced this way doesn't include post-processing. So user have to specify post-process logic with pooling_mode
and normalize_result
.
Supported pooling method: mean
, mean_sqrt_len
, max
, weightedmean
, cls
.
The only difference is the uploading model input, for load/predict/profile/unload model, you can refer to "1.1 trace sentence transformers model".
# Sample request
POST /_plugins/_ml/models/_upload
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.0",
"description": "test model",
"model_format": "TORCH_SCRIPT",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "huggingface_transformers",
"pooling_mode":"mean",
"normalize_result":"true"
},
"url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_huggingface.zip?raw=true"
}
User can export Pytorch model to ONNX, then upload and run it with ml-commons APIs.
This example ONNX model also needs to specify post-process logic with pooling_mode
and normalize_result
.
Supported pooling method: mean
, mean_sqrt_len
, max
, weightedmean
, cls
.
The only difference is the uploading model input, for load/predict/profile/unload model, you can refer to "1.1 trace sentence transformers model".
# Sample request
POST /_plugins/_ml/models/_upload
{
"name": "sentence-transformers/all-MiniLM-L6-v2",
"version": "1.0.0",
"description": "test model",
"model_format": "ONNX",
"model_config": {
"model_type": "bert",
"embedding_dimension": 384,
"framework_type": "huggingface_transformers",
"pooling_mode":"mean",
"normalize_result":"true"
},
"url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_onnx.zip?raw=true"
}