-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multimodal search/sparse search/pre- and post-processing function documentation #5168
Conversation
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
|
||
> ****PREREQUISITE**** | ||
> | ||
> Before using the `text_embedding` processor, you must set up a machine learning (ML) model and provide the model ID when creating the processor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of 2.11 we're releasing feature of default model_id, this is documentation PR #5060. User can setup a processor as part of search pipeline and it will inject model_id
_query-dsl/specialized/neural.md
Outdated
Field | Data type | Required/Optional | Description | ||
:--- | :--- | :--- | ||
`query_text` | String | Optional | The query text from which to generate vector embeddings. You must specify at least one `query_text` or `query_image`. | ||
`query_image` | Binary | Optional | The query image from which to generate vector embeddings. You must specify at least one `query_text` or `query_image`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data type for query_image
must be string, we can mention in the description that it's a string with base64 encoded image. binary
it's the type of the OpenSearch field if user wants to store it
`description` | String | Optional | A brief description of the processor. | | ||
`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. | | ||
|
||
## Using the processor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we state somewhere that user can use multiple processors in their pipeline definition in case they need to generate embeddings for multiple fields?
grand_parent: Connecting to remote models | ||
--- | ||
|
||
# Bedrock connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to get a strong "go" for referencing BedRock connector/models, @ylwu-amzn could you please check.
|
||
## Using neural search | ||
- [Text search]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/): Uses text-based embedding models to search text data. | ||
- [Multimodal search]({{site.url}}{{site.baseurl}}/search-plugins/neural-multimodal-search/): Uses vision-language embedding models to search text and image data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for search the difference between text only and "text/image" is in one query_image
keyword in search request. Is the main purpose of splitting text and text/image into two section is basically to provide some examples on how to setup ingestion, as that part requires new processor? No strong opinion, just want to understand reasoning behind this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. The "old" syntax is in one file (because the old syntax is still supported, correct?) and the new syntax is in another file. Providing an end-to-end example helps the user follow it on one page. Also, the field_map
is not set up in the same way: in the old syntax, the user just maps text_field: embedding_field but in the new syntax, the syntax is different.
``` | ||
{% include copy-curl.html %} | ||
|
||
To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field. To learn more, see [Setting a default model on an index or field]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/##setting-a-default-model-on-an-index-or-field). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good you put this reference here
|
||
## Step 4: Search the index using neural search | ||
|
||
To perform vector search on your index, use the `neural` query clause either in the [k-NN plugin API]({{site.url}}{{site.baseurl}}/search-plugins/knn/api/#search-model) or [Query DSL]({{site.url}}{{site.baseurl}}/opensearch/query-dsl/index/) queries. You can refine the results by using a [k-NN search filter]({{site.url}}{{site.baseurl}}/search-plugins/knn/filter-search-knn/). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also mention that search take take only text, only image, or both?
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
@@ -73,7 +73,89 @@ The `action` parameter supports the following options. | |||
| `url` | String | Required. Sets the connection endpoint at which the action takes place. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/extensibility/index#adding-trusted-endpoints). | | |||
| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | | |||
| `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | | |||
| `pre_process_function` | String | Optional. A built-in or custom Painless script to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models<br> - `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models <br> - `connector.pre_process.default.embedding` that you can use to preprocess documents in neural search requests so they are in the format that the model expects (OpenSearch 2.11 or later). For more information, see [default functions](#default-preprocessing-and-post-processing-functions). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the format that the model expects -> in the format that ml-commons can process with the default preprocessor.
|
||
The default pre- and post-processing functions translate between the format that the model expects and the format that [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) expects. | ||
|
||
Call the default pre- and post-processing functions instead of writing a custom Painless script when connecting to the following text embedding models: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when connecting to the following text embedding models -> when connecting to the following text embedding models or your own text embedding model when they're deployed on a remote server, e.g. SageMaker.
|
||
- [Pretrained models provided by OpenSearch](https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/) | ||
- [OpenAI remote models](https://platform.openai.com/docs/api-reference/embeddings) | ||
- [Cohere remote models](https://docs.cohere.com/reference/embed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OpenAI embedding model will use connector.pre/post_process.openai.embedding
not default pre/post processor. Cohere embedding model will use connector.pre/post_process.cohere.embedding
not default pre/post processor.
Signed-off-by: Fanit Kolchina <[email protected]>
|
||
## Step 2: Create an index for ingestion | ||
|
||
In order to use the text embedding processor defined in your pipelines, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the passage_embedding
field must be mapped as a k-NN vector with a dimension that matches the model dimension -> the passage_embedding
field must be mapped as a rank_features type.
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kolchfa-aws Excellent job on this. Let me know if you have any questions. Thanks!
@@ -73,7 +73,100 @@ The `action` parameter supports the following options. | |||
| `url` | String | Required. Sets the connection endpoint at which the action takes place. This must match the regex expression for the connection used when [adding trusted endpoints]({{site.url}}{{site.baseurl}}/ml-commons-plugin/extensibility/index#adding-trusted-endpoints). | | |||
| `headers` | JSON object | Sets the headers used inside the request or response body. Default is `ContentType: application/json`. If your third-party ML tool requires access control, define the required `credential` parameters in the `headers` parameter. | | |||
| `request_body` | String | Required. Sets the parameters contained inside the request body of the action. The parameters must include `\"inputText\`, which specifies how users of the connector should construct the request payload for the `action_type`. | | |||
| `pre_process_function` | String | Optional. A built-in or custom Painless script to preprocess the input data. OpenSearch provides the following built-in preprocess functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere](https://cohere.com/) embedding models<br> - `connector.pre_process.openai.embedding` for [OpenAI](https://openai.com/) embedding models <br> - `connector.pre_process.default.embedding` that you can use to preprocess documents in neural search requests so they are in the format that ML Commons can process with the default preprocessor (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). | | |||
| `post_process_function` | String | Optional. A built-in or custom Painless script to post-process the model output data. OpenSearch provides the following built-in post-process functions that you can call directly:<br> - `connector.pre_process.cohere.embedding` for [Cohere text embedding models](https://docs.cohere.com/reference/embed)<br> - `connector.pre_process.openai.embedding` for [OpenAI text embedding models](https://platform.openai.com/docs/api-reference/embeddings) <br> - `connector.post_process.default.embedding` that you can use to post-process documents in the model response so that they are in the format that neural search expects (OpenSearch 2.11 or later). For more information, see [built-in functions](#built-in-pre--and-post-processing-functions). | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've formatted "Cohere text embedding models" and "OpenAI text embedding models" differently in this and the preceding line (different words used and different text included in the link). Should they match?
``` | ||
{% include copy-curl.html %} | ||
|
||
Before the document is ingested into the index, the ingest pipeline runs the `sparse_encoding` processor on the document, generating vector embeddings for the `passage_text` field. The indexed document contains the `passage_text` field that has the original text and the `passage_embedding` field that has the vector embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm that the second sentence shouldn't be "The indexed document contains the passage_text
field, which contains the original text, and the passage_embedding
field, which contains the vector embeddings."
|
||
## Step 2: Create an index for ingestion | ||
|
||
In order to use the text embedding processor defined in your pipelines, create a k-NN index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the `field_map` are mapped as correct types. Continuing with the example, the `passage_embedding` field must be mapped as a k-NN vector with a dimension that matches the model dimension. Similarly, the `passage_text` field should be mapped as `text`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm that the first instance of "pipelines" shouldn't be singular.
``` | ||
{% include copy-curl.html %} | ||
|
||
Before the document is ingested into the index, the ingest pipeline runs the `text_embedding` processor on the document, generating text embeddings for the `passage_text` field. The indexed document contains the `passage_text` field that has the original text and the `passage_embedding` field that has the vector embeddings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment re: the second sentence.
|
||
## Setting a default model on an index or field | ||
|
||
A [`neural`]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural/) query requires a model ID for generating vector embeddings. To eliminate passing the model ID with each neural query request, you can set a default model on a k-NN index or a field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"avoid" instead of "eliminate"?
Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: Fanit Kolchina <[email protected]>
… documentation (opensearch-project#5168) * Add multimodal search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Text image embedding processor Signed-off-by: Fanit Kolchina <[email protected]> * Add prerequisite Signed-off-by: Fanit Kolchina <[email protected]> * Change query text Signed-off-by: Fanit Kolchina <[email protected]> * Added bedrock connector tutorial and renamed ML TOC Signed-off-by: Fanit Kolchina <[email protected]> * Name changes and rewording Signed-off-by: Fanit Kolchina <[email protected]> * Change connector link Signed-off-by: Fanit Kolchina <[email protected]> * Change link Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Link fix and field name fix Signed-off-by: Fanit Kolchina <[email protected]> * Add default text embedding preprocessing and post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Add sparse search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Fix links Signed-off-by: Fanit Kolchina <[email protected]> * Pre/post processing function tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Fix link Signed-off-by: Fanit Kolchina <[email protected]> * Sparse search tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Implemented doc review comments Signed-off-by: Fanit Kolchina <[email protected]> * Add actual test sparse pipeline response Signed-off-by: Fanit Kolchina <[email protected]> * Added tested examples Signed-off-by: Fanit Kolchina <[email protected]> * Added model choice for sparse search Signed-off-by: Fanit Kolchina <[email protected]> * Remove Bedrock connector Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review feedback Signed-off-by: Fanit Kolchina <[email protected]> * Add that the model must be deployed to neural search Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Link fix Signed-off-by: Fanit Kolchina <[email protected]> * Add session token to sagemaker blueprint Signed-off-by: Fanit Kolchina <[email protected]> * Formatted bullet points the same way Signed-off-by: Fanit Kolchina <[email protected]> * Specified both model types in neural sparse query Signed-off-by: Fanit Kolchina <[email protected]> * Added more explanation for default pre/post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Remove framework and extensibility references Signed-off-by: Fanit Kolchina <[email protected]> * Minor rewording Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
… documentation (#5168) * Add multimodal search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Text image embedding processor Signed-off-by: Fanit Kolchina <[email protected]> * Add prerequisite Signed-off-by: Fanit Kolchina <[email protected]> * Change query text Signed-off-by: Fanit Kolchina <[email protected]> * Added bedrock connector tutorial and renamed ML TOC Signed-off-by: Fanit Kolchina <[email protected]> * Name changes and rewording Signed-off-by: Fanit Kolchina <[email protected]> * Change connector link Signed-off-by: Fanit Kolchina <[email protected]> * Change link Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Link fix and field name fix Signed-off-by: Fanit Kolchina <[email protected]> * Add default text embedding preprocessing and post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Add sparse search documentation Signed-off-by: Fanit Kolchina <[email protected]> * Fix links Signed-off-by: Fanit Kolchina <[email protected]> * Pre/post processing function tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Fix link Signed-off-by: Fanit Kolchina <[email protected]> * Sparse search tech review comments Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Melissa Vagi <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Implemented doc review comments Signed-off-by: Fanit Kolchina <[email protected]> * Add actual test sparse pipeline response Signed-off-by: Fanit Kolchina <[email protected]> * Added tested examples Signed-off-by: Fanit Kolchina <[email protected]> * Added model choice for sparse search Signed-off-by: Fanit Kolchina <[email protected]> * Remove Bedrock connector Signed-off-by: Fanit Kolchina <[email protected]> * Implemented tech review feedback Signed-off-by: Fanit Kolchina <[email protected]> * Add that the model must be deployed to neural search Signed-off-by: Fanit Kolchina <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> * Link fix Signed-off-by: Fanit Kolchina <[email protected]> * Add session token to sagemaker blueprint Signed-off-by: Fanit Kolchina <[email protected]> * Formatted bullet points the same way Signed-off-by: Fanit Kolchina <[email protected]> * Specified both model types in neural sparse query Signed-off-by: Fanit Kolchina <[email protected]> * Added more explanation for default pre/post-processing functions Signed-off-by: Fanit Kolchina <[email protected]> * Remove framework and extensibility references Signed-off-by: Fanit Kolchina <[email protected]> * Minor rewording Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: kolchfa-aws <[email protected]> Co-authored-by: Melissa Vagi <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
Fixes #5108
Fixes #5105
Fixes #5081
Checklist
For more information on following Developer Certificate of Origin and signing off your commits, please check here.