-
Notifications
You must be signed in to change notification settings - Fork 505
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add example to text chunking processor documentation (#6794)
* add search document example for text chunking and embedding pipeline Signed-off-by: yuye-aws <[email protected]> * tune document Signed-off-by: yuye-aws <[email protected]> * Add the text chunking page Signed-off-by: Fanit Kolchina <[email protected]> * correct example Signed-off-by: yuye-aws <[email protected]> * Update _search-plugins/text-chunking.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> * Update _search-plugins/text-chunking.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> * resolve review comments Signed-off-by: yuye-aws <[email protected]> * Move cascading section to processor file Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: Nathan Bower <[email protected]> Co-authored-by: kolchfa-aws <[email protected]>
- Loading branch information
1 parent
5d9edcb
commit d676a79
Showing
4 changed files
with
128 additions
and
115 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,116 @@ | ||
--- | ||
layout: default | ||
title: Text chunking | ||
nav_order: 65 | ||
--- | ||
|
||
# Text chunking | ||
Introduced 2.13 | ||
{: .label .label-purple } | ||
|
||
To split long text into passages, you can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. For more information about the processor parameters, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). Before you start, follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. The following example preprocesses text by splitting it into passages and then produces embeddings using the `text_embedding` processor. | ||
|
||
## Step 1: Create a pipeline | ||
|
||
The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field: | ||
|
||
```json | ||
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline | ||
{ | ||
"description": "A text chunking and embedding ingest pipeline", | ||
"processors": [ | ||
{ | ||
"text_chunking": { | ||
"algorithm": { | ||
"fixed_token_length": { | ||
"token_limit": 10, | ||
"overlap_rate": 0.2, | ||
"tokenizer": "standard" | ||
} | ||
}, | ||
"field_map": { | ||
"passage_text": "passage_chunk" | ||
} | ||
} | ||
}, | ||
{ | ||
"text_embedding": { | ||
"model_id": "LMLPWY4BROvhdbtgETaI", | ||
"field_map": { | ||
"passage_chunk": "passage_chunk_embedding" | ||
} | ||
} | ||
} | ||
] | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Step 2: Create an index for ingestion | ||
|
||
In order to use the ingest pipeline, you need to create a k-NN index. The `passage_chunk_embedding` field must be of the `nested` type. The `knn.dimension` field must contain the number of dimensions for your model: | ||
|
||
```json | ||
PUT testindex | ||
{ | ||
"settings": { | ||
"index": { | ||
"knn": true | ||
} | ||
}, | ||
"mappings": { | ||
"properties": { | ||
"text": { | ||
"type": "text" | ||
}, | ||
"passage_chunk_embedding": { | ||
"type": "nested", | ||
"properties": { | ||
"knn": { | ||
"type": "knn_vector", | ||
"dimension": 768 | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Step 3: Ingest documents into the index | ||
|
||
To ingest a document into the index created in the previous step, send the following request: | ||
|
||
```json | ||
POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline | ||
{ | ||
"passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch." | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Step 4: Search the index using neural search | ||
|
||
You can use a `nested` query to perform vector search on your index. We recommend setting `score_mode` to `max`, where the document score is set to the highest score out of all passage embeddings: | ||
|
||
```json | ||
GET testindex/_search | ||
{ | ||
"query": { | ||
"nested": { | ||
"score_mode": "max", | ||
"path": "passage_chunk_embedding", | ||
"query": { | ||
"neural": { | ||
"passage_chunk_embedding.knn": { | ||
"query_text": "document", | ||
"model_id": "-tHZeI4BdQKclr136Wl7" | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} |