Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multimodal search/sparse search/pre- and post-processing function documentation #5168

Merged
merged 35 commits into from
Oct 16, 2023

Conversation

kolchfa-aws
Copy link
Collaborator

@kolchfa-aws kolchfa-aws commented Oct 6, 2023

Fixes #5108
Fixes #5105
Fixes #5081

Checklist

  • By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and subject to the Developers Certificate of Origin.
    For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@kolchfa-aws kolchfa-aws self-assigned this Oct 6, 2023
@kolchfa-aws kolchfa-aws added release-notes PR: Include this PR in the automated release notes v2.11.0 labels Oct 6, 2023
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>

> ****PREREQUISITE****
>
> Before using the `text_embedding` processor, you must set up a machine learning (ML) model and provide the model ID when creating the processor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of 2.11 we're releasing feature of default model_id, this is documentation PR #5060. User can setup a processor as part of search pipeline and it will inject model_id

Field | Data type | Required/Optional | Description
:--- | :--- | :---
`query_text` | String | Optional | The query text from which to generate vector embeddings. You must specify at least one `query_text` or `query_image`.
`query_image` | Binary | Optional | The query image from which to generate vector embeddings. You must specify at least one `query_text` or `query_image`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data type for query_image must be string, we can mention in the description that it's a string with base64 encoded image. binary it's the type of the OpenSearch field if user wants to store it

`description` | String | Optional | A brief description of the processor. |
`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |

## Using the processor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we state somewhere that user can use multiple processors in their pipeline definition in case they need to generate embeddings for multiple fields?

grand_parent: Connecting to remote models
---

# Bedrock connector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to get a strong "go" for referencing BedRock connector/models, @ylwu-amzn could you please check.


## Using neural search
- [Text search]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/): Uses text-based embedding models to search text data.
- [Multimodal search]({{site.url}}{{site.baseurl}}/search-plugins/neural-multimodal-search/): Uses vision-language embedding models to search text and image data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for search the difference between text only and "text/image" is in one query_image keyword in search request. Is the main purpose of splitting text and text/image into two section is basically to provide some examples on how to setup ingestion, as that part requires new processor? No strong opinion, just want to understand reasoning behind this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. The "old" syntax is in one file (because the old syntax is still supported, correct?) and the new syntax is in another file. Providing an end-to-end example helps the user follow it on one page. Also, the field_map is not set up in the same way: in the old syntax, the user just maps text_field: embedding_field but in the new syntax, the syntax is different.

```
{% include copy-curl.html %}

To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field. To learn more, see [Setting a default model on an index or field]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/##setting-a-default-model-on-an-index-or-field).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good you put this reference here


## Step 4: Search the index using neural search

To perform vector search on your index, use the `neural` query clause either in the [k-NN plugin API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api/#search-model) or [Query DSL]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/index/) queries. You can refine the results by using a [k-NN search filter]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mention that search take take only text, only image, or both?

@@ -73,7 +73,89 @@ The `action` parameter supports the following options.
| `url` | String | Required. Sets the connection endpoint at which the action takes place. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/extensibility/index#adding-trusted-endpoints). |
| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. |
| `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. |
| `pre_process_function` | String | Optional. A built-in or custom Painless script to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models<br> - `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models <br> - `connector.pre_process.default.embedding` that you can use to preprocess documents in neural search requests so they are in the format that the model expects (OpenSearch 2.11 or later). For more information, see [default functions](#default-preprocessing-and-post-processing-functions). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the format that the model expects -> in the format that ml-commons can process with the default preprocessor.


The default pre- and post-processing functions translate between the format that the model expects and the format that [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) expects.

Call the default pre- and post-processing functions instead of writing a custom Painless script when connecting to the following text embedding models:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when connecting to the following text embedding models -> when connecting to the following text embedding models or your own text embedding model when they're deployed on a remote server, e.g. SageMaker.


- [Pretrained models provided by OpenSearch](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/)
- [OpenAI remote models](https://platform.openai.com/docs/api-reference/embeddings)
- [Cohere remote models](https://docs.cohere.com/reference/embed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAI embedding model will use connector.pre/post_process.openai.embedding not default pre/post processor. Cohere embedding model will use connector.pre/post_process.cohere.embedding not default pre/post processor.

@hdhalter hdhalter added the 3 - Tech review PR: Tech review in progress label Oct 10, 2023
Signed-off-by: Fanit Kolchina <[email protected]>
@kolchfa-aws kolchfa-aws changed the title Add multimodal search documentation Add multimodal search/sparse search/pre- and post-processing in connectors documentation Oct 10, 2023
@kolchfa-aws kolchfa-aws changed the title Add multimodal search/sparse search/pre- and post-processing in connectors documentation Add multimodal search/sparse search/pre- and post-processing function documentation Oct 10, 2023

## Step 2: Create an index for ingestion

In order to use the text embedding processor defined in your pipelines, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the passage_embedding field must be mapped as a k-NN vector with a dimension that matches the model dimension -> the passage_embedding field must be mapped as a rank_features type.

Copy link
Collaborator

@natebower natebower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kolchfa-aws Excellent job on this. Let me know if you have any questions. Thanks!

_ml-commons-plugin/conversational-search.md Outdated Show resolved Hide resolved
_ml-commons-plugin/extensibility/blueprints.md Outdated Show resolved Hide resolved
_ml-commons-plugin/extensibility/blueprints.md Outdated Show resolved Hide resolved
_ml-commons-plugin/extensibility/blueprints.md Outdated Show resolved Hide resolved
@@ -73,7 +73,100 @@ The `action` parameter supports the following options.
| `url` | String | Required. Sets the connection endpoint at which the action takes place. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/extensibility/index#adding-trusted-endpoints). |
| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. |
| `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. |
| `pre_process_function` | String | Optional. A built-in or custom Painless script to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models<br> - `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models <br> - `connector.pre_process.default.embedding` that you can use to preprocess documents in neural search requests so they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). |
| `post_process_function` | String | Optional. A built-in or custom Painless script to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)<br> - `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings) <br> - `connector.post_process.default.embedding` that you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've formatted "Cohere text embedding models" and "OpenAI text embedding models" differently in this and the preceding line (different words used and different text included in the link). Should they match?

```
{% include copy-curl.html %}

Before the document is ingested into the index, the ingest pipeline runs the `sparse_encoding` processor on the document, generating vector embeddings for the `passage_text` field. The indexed document contains the `passage_text` field that has the original text and the `passage_embedding` field that has the vector embeddings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that the second sentence shouldn't be "The indexed document contains the passage_text field, which contains the original text, and the passage_embedding field, which contains the vector embeddings."


## Step 2: Create an index for ingestion

In order to use the text embedding processor defined in your pipelines, create a k-NN index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm that the first instance of "pipelines" shouldn't be singular.

_search-plugins/neural-text-search.md Outdated Show resolved Hide resolved
```
{% include copy-curl.html %}

Before the document is ingested into the index, the ingest pipeline runs the `text_embedding` processor on the document, generating text embeddings for the `passage_text` field. The indexed document contains the `passage_text` field that has the original text and the `passage_embedding` field that has the vector embeddings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: the second sentence.


## Setting a default model on an index or field

A [`neural`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural/) query requires a model ID for generating vector embeddings. To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"avoid" instead of "eliminate"?

kolchfa-aws and others added 5 commits October 12, 2023 11:17
Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
@hdhalter hdhalter added 5 - Editorial review PR: Editorial review in progress and removed 4 - Doc review PR: Doc review in progress labels Oct 12, 2023
@kolchfa-aws kolchfa-aws merged commit a97c719 into main Oct 16, 2023
4 checks passed
harshavamsi pushed a commit to harshavamsi/documentation-website that referenced this pull request Oct 31, 2023
… documentation (opensearch-project#5168)

* Add multimodal search documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Text image embedding processor

Signed-off-by: Fanit Kolchina <[email protected]>

* Add prerequisite

Signed-off-by: Fanit Kolchina <[email protected]>

* Change query text

Signed-off-by: Fanit Kolchina <[email protected]>

* Added bedrock connector tutorial and renamed ML TOC

Signed-off-by: Fanit Kolchina <[email protected]>

* Name changes and rewording

Signed-off-by: Fanit Kolchina <[email protected]>

* Change connector link

Signed-off-by: Fanit Kolchina <[email protected]>

* Change link

Signed-off-by: Fanit Kolchina <[email protected]>

* Implemented tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Link fix and field name fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add default text embedding preprocessing and post-processing functions

Signed-off-by: Fanit Kolchina <[email protected]>

* Add sparse search documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix links

Signed-off-by: Fanit Kolchina <[email protected]>

* Pre/post processing function tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix link

Signed-off-by: Fanit Kolchina <[email protected]>

* Sparse search tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Implemented doc review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Add actual test sparse pipeline response

Signed-off-by: Fanit Kolchina <[email protected]>

* Added tested examples

Signed-off-by: Fanit Kolchina <[email protected]>

* Added model choice for sparse search

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove Bedrock connector

Signed-off-by: Fanit Kolchina <[email protected]>

* Implemented tech review feedback

Signed-off-by: Fanit Kolchina <[email protected]>

* Add that the model must be deployed to neural search

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Link fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add session token to sagemaker blueprint

Signed-off-by: Fanit Kolchina <[email protected]>

* Formatted bullet points the same way

Signed-off-by: Fanit Kolchina <[email protected]>

* Specified both model types in neural sparse query

Signed-off-by: Fanit Kolchina <[email protected]>

* Added more explanation for default pre/post-processing functions

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove framework and extensibility references

Signed-off-by: Fanit Kolchina <[email protected]>

* Minor rewording

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
vagimeli added a commit that referenced this pull request Dec 21, 2023
… documentation (#5168)

* Add multimodal search documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Text image embedding processor

Signed-off-by: Fanit Kolchina <[email protected]>

* Add prerequisite

Signed-off-by: Fanit Kolchina <[email protected]>

* Change query text

Signed-off-by: Fanit Kolchina <[email protected]>

* Added bedrock connector tutorial and renamed ML TOC

Signed-off-by: Fanit Kolchina <[email protected]>

* Name changes and rewording

Signed-off-by: Fanit Kolchina <[email protected]>

* Change connector link

Signed-off-by: Fanit Kolchina <[email protected]>

* Change link

Signed-off-by: Fanit Kolchina <[email protected]>

* Implemented tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Link fix and field name fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add default text embedding preprocessing and post-processing functions

Signed-off-by: Fanit Kolchina <[email protected]>

* Add sparse search documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix links

Signed-off-by: Fanit Kolchina <[email protected]>

* Pre/post processing function tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix link

Signed-off-by: Fanit Kolchina <[email protected]>

* Sparse search tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Implemented doc review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Add actual test sparse pipeline response

Signed-off-by: Fanit Kolchina <[email protected]>

* Added tested examples

Signed-off-by: Fanit Kolchina <[email protected]>

* Added model choice for sparse search

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove Bedrock connector

Signed-off-by: Fanit Kolchina <[email protected]>

* Implemented tech review feedback

Signed-off-by: Fanit Kolchina <[email protected]>

* Add that the model must be deployed to neural search

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Link fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add session token to sagemaker blueprint

Signed-off-by: Fanit Kolchina <[email protected]>

* Formatted bullet points the same way

Signed-off-by: Fanit Kolchina <[email protected]>

* Specified both model types in neural sparse query

Signed-off-by: Fanit Kolchina <[email protected]>

* Added more explanation for default pre/post-processing functions

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove framework and extensibility references

Signed-off-by: Fanit Kolchina <[email protected]>

* Minor rewording

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
@kolchfa-aws kolchfa-aws deleted the multimodal-search branch March 28, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Editorial review PR: Editorial review in progress release-notes PR: Include this PR in the automated release notes v2.11.0
Projects
None yet
9 participants