Add multimodal search/sparse search/pre- and post-processing function documentation #5168

kolchfa-aws · 2023-10-06T16:18:53Z

Fixes #5108
Fixes #5105
Fixes #5081

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Fanit Kolchina <[email protected]>

martin-gaievski · 2023-10-09T17:35:55Z

_api-reference/ingest-apis/processors/text-embedding.md

+
+>   ****PREREQUISITE****
+>
+>   Before using the `text_embedding` processor, you must set up a machine learning (ML) model and provide the model ID when creating the processor.


As part of 2.11 we're releasing feature of default model_id, this is documentation PR #5060. User can setup a processor as part of search pipeline and it will inject model_id

martin-gaievski · 2023-10-09T17:44:14Z

_query-dsl/specialized/neural.md

+Field | Data type | Required/Optional | Description
+:--- | :--- | :--- 
+`query_text` | String | Optional | The query text from which to generate vector embeddings. You must specify at least one `query_text` or `query_image`.
+`query_image` | Binary | Optional | The query image from which to generate vector embeddings. You must specify at least one `query_text` or `query_image`.


Data type for query_image must be string, we can mention in the description that it's a string with base64 encoded image. binary it's the type of the OpenSearch field if user wants to store it

martin-gaievski · 2023-10-09T17:50:47Z

_api-reference/ingest-apis/processors/text-image-embedding.md

+`description`  | String | Optional  | A brief description of the processor.  |
+`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
+
+## Using the processor


Can we state somewhere that user can use multiple processors in their pipeline definition in case they need to generate embeddings for multiple fields?

martin-gaievski · 2023-10-09T17:53:54Z

_ml-commons-plugin/extensibility/bedrock-connector.md

+grand_parent: Connecting to remote models 
+---
+
+# Bedrock connector


I think we need to get a strong "go" for referencing BedRock connector/models, @ylwu-amzn could you please check.

martin-gaievski · 2023-10-09T17:58:44Z

_search-plugins/neural-search.md


-## Using neural search
+- [Text search]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/): Uses text-based embedding models to search text data. 
+- [Multimodal search]({{site.url}}{{site.baseurl}}/search-plugins/neural-multimodal-search/): Uses vision-language embedding models to search text and image data. 


for search the difference between text only and "text/image" is in one query_image keyword in search request. Is the main purpose of splitting text and text/image into two section is basically to provide some examples on how to setup ingestion, as that part requires new processor? No strong opinion, just want to understand reasoning behind this.

Yes, that's correct. The "old" syntax is in one file (because the old syntax is still supported, correct?) and the new syntax is in another file. Providing an end-to-end example helps the user follow it on one page. Also, the field_map is not set up in the same way: in the old syntax, the user just maps text_field: embedding_field but in the new syntax, the syntax is different.

martin-gaievski · 2023-10-09T18:00:03Z

_search-plugins/neural-multimodal-search.md

+```
+{% include copy-curl.html %}
+
+To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field. To learn more, see [Setting a default model on an index or field]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/##setting-a-default-model-on-an-index-or-field).


good you put this reference here

martin-gaievski · 2023-10-09T18:02:29Z

_search-plugins/neural-multimodal-search.md

+
+## Step 4: Search the index using neural search
+
+To perform vector search on your index, use the `neural` query clause either in the [k-NN plugin API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api/#search-model) or [Query DSL]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/index/) queries. You can refine the results by using a [k-NN search filter]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/).


should we also mention that search take take only text, only image, or both?

Signed-off-by: Fanit Kolchina <[email protected]>

zane-neo · 2023-10-10T02:26:16Z

_ml-commons-plugin/extensibility/blueprints.md

@@ -73,7 +73,89 @@ The `action` parameter supports the following options.
 | `url` | String | Required. Sets the connection endpoint at which the action takes place. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/extensibility/index#adding-trusted-endpoints). |
 | `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. |
 | `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. |
+| `pre_process_function` | String |  Optional. A built-in or custom Painless script to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models<br> - `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models <br> - `connector.pre_process.default.embedding` that you can use to preprocess documents in neural search requests so they are in the format that the model expects (OpenSearch 2.11 or later). For more information, see [default functions](#default-preprocessing-and-post-processing-functions).  |


in the format that the model expects -> in the format that ml-commons can process with the default preprocessor.

zane-neo · 2023-10-10T02:29:18Z

_ml-commons-plugin/extensibility/blueprints.md

+
+The default pre- and post-processing functions translate between the format that the model expects and the format that [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) expects. 
+
+Call the default pre- and post-processing functions instead of writing a custom Painless script when connecting to the following text embedding models:


when connecting to the following text embedding models -> when connecting to the following text embedding models or your own text embedding model when they're deployed on a remote server, e.g. SageMaker.

zane-neo · 2023-10-10T02:31:02Z

_ml-commons-plugin/extensibility/blueprints.md

+
+- [Pretrained models provided by OpenSearch](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/)
+- [OpenAI remote models](https://platform.openai.com/docs/api-reference/embeddings)
+- [Cohere remote models](https://docs.cohere.com/reference/embed)


OpenAI embedding model will use connector.pre/post_process.openai.embedding not default pre/post processor. Cohere embedding model will use connector.pre/post_process.cohere.embedding not default pre/post processor.

Signed-off-by: Fanit Kolchina <[email protected]>

zane-neo · 2023-10-12T05:30:31Z

_search-plugins/neural-sparse-search.md

+
+## Step 2: Create an index for ingestion
+
+In order to use the text embedding processor defined in your pipelines, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`.


the passage_embedding field must be mapped as a k-NN vector with a dimension that matches the model dimension -> the passage_embedding field must be mapped as a rank_features type.

Signed-off-by: Fanit Kolchina <[email protected]>

natebower

@kolchfa-aws Excellent job on this. Let me know if you have any questions. Thanks!

_ml-commons-plugin/conversational-search.md

_ml-commons-plugin/extensibility/blueprints.md

natebower · 2023-10-12T12:31:50Z

_ml-commons-plugin/extensibility/blueprints.md

@@ -73,7 +73,100 @@ The `action` parameter supports the following options.
 | `url` | String | Required. Sets the connection endpoint at which the action takes place. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/extensibility/index#adding-trusted-endpoints). |
 | `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. |
 | `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. |
+| `pre_process_function` | String |  Optional. A built-in or custom Painless script to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models<br> - `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models <br> - `connector.pre_process.default.embedding` that you can use to preprocess documents in neural search requests so they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions).  |
+| `post_process_function` | String | Optional. A built-in or custom Painless script to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)<br> - `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings) <br> - `connector.post_process.default.embedding` that you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions).  |


We've formatted "Cohere text embedding models" and "OpenAI text embedding models" differently in this and the preceding line (different words used and different text included in the link). Should they match?

natebower · 2023-10-12T14:40:38Z

_search-plugins/neural-sparse-search.md

+```
+{% include copy-curl.html %}
+
+Before the document is ingested into the index, the ingest pipeline runs the `sparse_encoding` processor on the document, generating vector embeddings for the `passage_text` field. The indexed document contains the `passage_text` field that has the original text and the `passage_embedding` field that has the vector embeddings. 


Confirm that the second sentence shouldn't be "The indexed document contains the passage_text field, which contains the original text, and the passage_embedding field, which contains the vector embeddings."

natebower · 2023-10-12T14:42:03Z

_search-plugins/neural-text-search.md

+
+## Step 2: Create an index for ingestion
+
+In order to use the text embedding processor defined in your pipelines, create a k-NN index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`.


Confirm that the first instance of "pipelines" shouldn't be singular.

_search-plugins/neural-text-search.md

natebower · 2023-10-12T14:44:39Z

_search-plugins/neural-text-search.md

+```
+{% include copy-curl.html %}
+
+Before the document is ingested into the index, the ingest pipeline runs the `text_embedding` processor on the document, generating text embeddings for the `passage_text` field. The indexed document contains the `passage_text` field that has the original text and the `passage_embedding` field that has the vector embeddings. 


Same comment re: the second sentence.

natebower · 2023-10-12T14:45:26Z

_search-plugins/neural-text-search.md

+
+## Setting a default model on an index or field
+
+A [`neural`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural/) query requires a model ID for generating vector embeddings. To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field. 


"avoid" instead of "eliminate"?

_ml-commons-plugin/extensibility/index.md

_ml-commons-plugin/index.md

_ml-commons-plugin/semantic-search.md

_query-dsl/specialized/neural.md

_search-plugins/neural-multimodal-search.md

_search-plugins/neural-sparse-search.md

_search-plugins/neural-text-search.md

Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>

Signed-off-by: Fanit Kolchina <[email protected]>

… documentation (opensearch-project#5168) * Add multimodal search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Text image embedding processor Signed-off-by: Fanit Kolchina <[email protected]> * Add prerequisite Signed-off-by: Fanit Kolchina <[email protected]> * Change query text Signed-off-by: Fanit Kolchina <[email protected]> * Added bedrock connector tutorial and renamed ML TOC Signed-off-by: Fanit Kolchina <[email protected]> * Name changes and rewording Signed-off-by: Fanit Kolchina <[email protected]> * Change connector link Signed-off-by: Fanit Kolchina <[email protected]> * Change link Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Link fix and field name fix Signed-off-by: Fanit Kolchina <[email protected]> * Add default text embedding preprocessing and post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Add sparse search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Fix links Signed-off-by: Fanit Kolchina <[email protected]> * Pre/post processing function tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Fix link Signed-off-by: Fanit Kolchina <[email protected]> * Sparse search tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Implemented doc review comments Signed-off-by: Fanit Kolchina <[email protected]> * Add actual test sparse pipeline response Signed-off-by: Fanit Kolchina <[email protected]> * Added tested examples Signed-off-by: Fanit Kolchina <[email protected]> * Added model choice for sparse search Signed-off-by: Fanit Kolchina <[email protected]> * Remove Bedrock connector Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review feedback Signed-off-by: Fanit Kolchina <[email protected]> * Add that the model must be deployed to neural search Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Link fix Signed-off-by: Fanit Kolchina <[email protected]> * Add session token to sagemaker blueprint Signed-off-by: Fanit Kolchina <[email protected]> * Formatted bullet points the same way Signed-off-by: Fanit Kolchina <[email protected]> * Specified both model types in neural sparse query Signed-off-by: Fanit Kolchina <[email protected]> * Added more explanation for default pre/post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Remove framework and extensibility references Signed-off-by: Fanit Kolchina <[email protected]> * Minor rewording Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>

… documentation (#5168) * Add multimodal search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Text image embedding processor Signed-off-by: Fanit Kolchina <[email protected]> * Add prerequisite Signed-off-by: Fanit Kolchina <[email protected]> * Change query text Signed-off-by: Fanit Kolchina <[email protected]> * Added bedrock connector tutorial and renamed ML TOC Signed-off-by: Fanit Kolchina <[email protected]> * Name changes and rewording Signed-off-by: Fanit Kolchina <[email protected]> * Change connector link Signed-off-by: Fanit Kolchina <[email protected]> * Change link Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Link fix and field name fix Signed-off-by: Fanit Kolchina <[email protected]> * Add default text embedding preprocessing and post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Add sparse search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Fix links Signed-off-by: Fanit Kolchina <[email protected]> * Pre/post processing function tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Fix link Signed-off-by: Fanit Kolchina <[email protected]> * Sparse search tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Implemented doc review comments Signed-off-by: Fanit Kolchina <[email protected]> * Add actual test sparse pipeline response Signed-off-by: Fanit Kolchina <[email protected]> * Added tested examples Signed-off-by: Fanit Kolchina <[email protected]> * Added model choice for sparse search Signed-off-by: Fanit Kolchina <[email protected]> * Remove Bedrock connector Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review feedback Signed-off-by: Fanit Kolchina <[email protected]> * Add that the model must be deployed to neural search Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Link fix Signed-off-by: Fanit Kolchina <[email protected]> * Add session token to sagemaker blueprint Signed-off-by: Fanit Kolchina <[email protected]> * Formatted bullet points the same way Signed-off-by: Fanit Kolchina <[email protected]> * Specified both model types in neural sparse query Signed-off-by: Fanit Kolchina <[email protected]> * Added more explanation for default pre/post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Remove framework and extensibility references Signed-off-by: Fanit Kolchina <[email protected]> * Minor rewording Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>

Add multimodal search documentation

b82f467

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws self-assigned this Oct 6, 2023

kolchfa-aws requested review from hdhalter, Naarcha-AWS, vagimeli, ananzh, seanneumann, AMoo-Miki and natebower as code owners October 6, 2023 16:18

kolchfa-aws added release-notes PR: Include this PR in the automated release notes v2.11.0 labels Oct 6, 2023

kolchfa-aws added 7 commits October 6, 2023 12:22

Text image embedding processor

ecf682f

Signed-off-by: Fanit Kolchina <[email protected]>

Add prerequisite

0c11614

Signed-off-by: Fanit Kolchina <[email protected]>

Change query text

80abeeb

Signed-off-by: Fanit Kolchina <[email protected]>

Added bedrock connector tutorial and renamed ML TOC

e4b6efb

Signed-off-by: Fanit Kolchina <[email protected]>

Name changes and rewording

dd15458

Signed-off-by: Fanit Kolchina <[email protected]>

Change connector link

90c8ec5

Signed-off-by: Fanit Kolchina <[email protected]>

Change link

6d684bd

Signed-off-by: Fanit Kolchina <[email protected]>

martin-gaievski reviewed Oct 9, 2023

View reviewed changes

kolchfa-aws added 3 commits October 9, 2023 17:13

Implemented tech review comments

a230259

Signed-off-by: Fanit Kolchina <[email protected]>

Link fix and field name fix

03f95d4

Signed-off-by: Fanit Kolchina <[email protected]>

Add default text embedding preprocessing and post-processing functions

76f02c4

Signed-off-by: Fanit Kolchina <[email protected]>

zane-neo reviewed Oct 10, 2023

View reviewed changes

martin-gaievski approved these changes Oct 10, 2023

View reviewed changes

hdhalter added the 3 - Tech review PR: Tech review in progress label Oct 10, 2023

Add sparse search documentation

74b039d

Signed-off-by: Fanit Kolchina <[email protected]>

kolchfa-aws changed the title ~~Add multimodal search documentation~~ Add multimodal search/sparse search/pre- and post-processing in connectors documentation Oct 10, 2023

kolchfa-aws changed the title ~~Add multimodal search/sparse search/pre- and post-processing in connectors documentation~~ Add multimodal search/sparse search/pre- and post-processing function documentation Oct 10, 2023

zane-neo reviewed Oct 12, 2023

View reviewed changes

kolchfa-aws added 3 commits October 12, 2023 07:47

Remove Bedrock connector

93086a2

Signed-off-by: Fanit Kolchina <[email protected]>

Implemented tech review feedback

42985a9

Signed-off-by: Fanit Kolchina <[email protected]>

Add that the model must be deployed to neural search

a4f5d57

Signed-off-by: Fanit Kolchina <[email protected]>

natebower reviewed Oct 12, 2023

View reviewed changes