Skip to content

Commit

Permalink
Add multimodal search/sparse search/pre- and post-processing function…
Browse files Browse the repository at this point in the history
… documentation (opensearch-project#5168)

* Add multimodal search documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Text image embedding processor

Signed-off-by: Fanit Kolchina <[email protected]>

* Add prerequisite

Signed-off-by: Fanit Kolchina <[email protected]>

* Change query text

Signed-off-by: Fanit Kolchina <[email protected]>

* Added bedrock connector tutorial and renamed ML TOC

Signed-off-by: Fanit Kolchina <[email protected]>

* Name changes and rewording

Signed-off-by: Fanit Kolchina <[email protected]>

* Change connector link

Signed-off-by: Fanit Kolchina <[email protected]>

* Change link

Signed-off-by: Fanit Kolchina <[email protected]>

* Implemented tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Link fix and field name fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add default text embedding preprocessing and post-processing functions

Signed-off-by: Fanit Kolchina <[email protected]>

* Add sparse search documentation

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix links

Signed-off-by: Fanit Kolchina <[email protected]>

* Pre/post processing function tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Fix link

Signed-off-by: Fanit Kolchina <[email protected]>

* Sparse search tech review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Melissa Vagi <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Implemented doc review comments

Signed-off-by: Fanit Kolchina <[email protected]>

* Add actual test sparse pipeline response

Signed-off-by: Fanit Kolchina <[email protected]>

* Added tested examples

Signed-off-by: Fanit Kolchina <[email protected]>

* Added model choice for sparse search

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove Bedrock connector

Signed-off-by: Fanit Kolchina <[email protected]>

* Implemented tech review feedback

Signed-off-by: Fanit Kolchina <[email protected]>

* Add that the model must be deployed to neural search

Signed-off-by: Fanit Kolchina <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nathan Bower <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>

* Link fix

Signed-off-by: Fanit Kolchina <[email protected]>

* Add session token to sagemaker blueprint

Signed-off-by: Fanit Kolchina <[email protected]>

* Formatted bullet points the same way

Signed-off-by: Fanit Kolchina <[email protected]>

* Specified both model types in neural sparse query

Signed-off-by: Fanit Kolchina <[email protected]>

* Added more explanation for default pre/post-processing functions

Signed-off-by: Fanit Kolchina <[email protected]>

* Remove framework and extensibility references

Signed-off-by: Fanit Kolchina <[email protected]>

* Minor rewording

Signed-off-by: Fanit Kolchina <[email protected]>

---------

Signed-off-by: Fanit Kolchina <[email protected]>
Signed-off-by: kolchfa-aws <[email protected]>
Co-authored-by: Melissa Vagi <[email protected]>
Co-authored-by: Nathan Bower <[email protected]>
  • Loading branch information
3 people authored and harshavamsi committed Oct 31, 2023
1 parent ee51b9c commit e0359e8
Show file tree
Hide file tree
Showing 23 changed files with 1,553 additions and 607 deletions.
147 changes: 147 additions & 0 deletions _api-reference/ingest-apis/processors/sparse-encoding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
layout: default
title: Sparse encoding
parent: Ingest processors
grand_parent: Ingest APIs
nav_order: 240
---

# Sparse encoding

The `sparse_encoding` processor is used to generate a sparse vector/token and weights from text fields for [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/) using sparse retrieval.

**PREREQUISITE**<br>
Before using the `sparse_encoding` processor, you must set up a machine learning (ML) model. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/) and [Semantic search]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search/).
{: .note}

The following is the syntax for the `sparse_encoding` processor:

```json
{
"sparse_encoding": {
"model_id": "<model_id>",
"field_map": {
"<input_field>": "<vector_field>"
}
}
}
```
{% include copy-curl.html %}

#### Configuration parameters

The following table lists the required and optional parameters for the `sparse_encoding` processor.

| Name | Data type | Required | Description |
|:---|:---|:---|:---|
`model_id` | String | Required | The ID of the model that will be used to generate the embeddings. The model must be deployed in OpenSearch before it can be used in neural search. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/) and [Semantic search]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search/).
`field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to a `rank_features` field.
`field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating vector embeddings.
`field_map.<vector_field>` | String | Required | The name of the vector field in which to store the generated vector embeddings.
`description` | String | Optional | A brief description of the processor. |
`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |

## Using the processor

Follow these steps to use the processor in a pipeline. You must provide a model ID when creating the processor. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/).

**Step 1: Create a pipeline.**

The following example request creates an ingest pipeline where the text from `passage_text` will be converted into text embeddings and the embeddings will be stored in `passage_embedding`:

```json
PUT /_ingest/pipeline/nlp-ingest-pipeline
{
"description": "A sparse encoding ingest pipeline",
"processors": [
{
"sparse_encoding": {
"model_id": "aP2Q8ooBpBj3wT4HVS8a",
"field_map": {
"passage_text": "passage_embedding"
}
}
}
]
}
```
{% include copy-curl.html %}

**Step 2 (Optional): Test the pipeline.**

It is recommended that you test your pipeline before you ingest documents.
{: .tip}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/nlp-ingest-pipeline/_simulate
{
"docs": [
{
"_index": "testindex1",
"_id": "1",
"_source":{
"passage_text": "hello world"
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The response confirms that in addition to the `passage_text` field, the processor has generated text embeddings in the `passage_embedding` field:

```json
{
"docs" : [
{
"doc" : {
"_index" : "testindex1",
"_id" : "1",
"_source" : {
"passage_embedding" : {
"!" : 0.8708904,
"door" : 0.8587369,
"hi" : 2.3929274,
"worlds" : 2.7839446,
"yes" : 0.75845814,
"##world" : 2.5432441,
"born" : 0.2682308,
"nothing" : 0.8625516,
"goodbye" : 0.17146169,
"greeting" : 0.96817183,
"birth" : 1.2788506,
"come" : 0.1623208,
"global" : 0.4371151,
"it" : 0.42951578,
"life" : 1.5750692,
"thanks" : 0.26481047,
"world" : 4.7300377,
"tiny" : 0.5462298,
"earth" : 2.6555297,
"universe" : 2.0308156,
"worldwide" : 1.3903781,
"hello" : 6.696973,
"so" : 0.20279501,
"?" : 0.67785245
},
"passage_text" : "hello world"
},
"_ingest" : {
"timestamp" : "2023-10-11T22:35:53.654650086Z"
}
}
}
]
}
```

## Next steps

- To learn how to use the `neural_sparse` query for a sparse search, see [Neural sparse query]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural-sparse/).
- To learn more about sparse neural search, see [Sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/).
- To learn more about using models in OpenSearch, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/).
- For a semantic search tutorial, see [Semantic search]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search/).
128 changes: 128 additions & 0 deletions _api-reference/ingest-apis/processors/text-embedding.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
layout: default
title: Text embedding
parent: Ingest processors
grand_parent: Ingest APIs
nav_order: 260
---

# Text embedding

The `text_embedding` processor is used to generate vector embeddings from text fields for [neural search]({{site.url}}{{site.baseurl}}/search-plugins/neural-search/).

**PREREQUISITE**<br>
Before using the `text_embedding` processor, you must set up a machine learning (ML) model. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/) and [Semantic search]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search/).
{: .note}

The following is the syntax for the `text_embedding` processor:

```json
{
"text_embedding": {
"model_id": "<model_id>",
"field_map": {
"<input_field>": "<vector_field>"
}
}
}
```
{% include copy-curl.html %}

#### Configuration parameters

The following table lists the required and optional parameters for the `text_embedding` processor.

| Name | Data type | Required | Description |
|:---|:---|:---|:---|
`model_id` | String | Required | The ID of the model that will be used to generate the embeddings. The model must be deployed in OpenSearch before it can be used in neural search. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/) and [Semantic search]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search/).
`field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to a vector field.
`field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating text embeddings.
`field_map.<vector_field>` | String | Required | The name of the vector field in which to store the generated text embeddings.
`description` | String | Optional | A brief description of the processor. |
`tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |

## Using the processor

Follow these steps to use the processor in a pipeline. You must provide a model ID when creating the processor. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/).

**Step 1: Create a pipeline.**

The following example request creates an ingest pipeline where the text from `passage_text` will be converted into text embeddings and the embeddings will be stored in `passage_embedding`:

```json
PUT /_ingest/pipeline/nlp-ingest-pipeline
{
"description": "A text embedding pipeline",
"processors": [
{
"text_embedding": {
"model_id": "bQ1J8ooBpBj3wT4HVUsb",
"field_map": {
"passage_text": "passage_embedding"
}
}
}
]
}
```
{% include copy-curl.html %}

**Step 2 (Optional): Test the pipeline.**

It is recommended that you test your pipeline before you ingest documents.
{: .tip}

To test the pipeline, run the following query:

```json
POST _ingest/pipeline/nlp-ingest-pipeline/_simulate
{
"docs": [
{
"_index": "testindex1",
"_id": "1",
"_source":{
"passage_text": "hello world"
}
}
]
}
```
{% include copy-curl.html %}

#### Response

The response confirms that in addition to the `passage_text` field, the processor has generated text embeddings in the `passage_embedding` field:

```json
{
"docs": [
{
"doc": {
"_index": "testindex1",
"_id": "1",
"_source": {
"passage_embedding": [
-0.048237972,
-0.07612712,
0.3262124,
...
-0.16352308
],
"passage_text": "hello world"
},
"_ingest": {
"timestamp": "2023-10-05T15:15:19.691345393Z"
}
}
}
]
}
```

## Next steps

- To learn how to use the `neural` query for text search, see [Neural query]({{site.url}}{{site.baseurl}}/query-dsl/specialized/neural/).
- To learn more about neural text search, see [Text search]({{site.url}}{{site.baseurl}}/search-plugins/neural-text-search/).
- To learn more about using models in OpenSearch, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/ml-framework/).
- For a semantic search tutorial, see [Semantic search]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search/).
Loading

0 comments on commit e0359e8

Please sign in to comment.