Add example to text chunking processor documentation (#6794)

* add search document example for text chunking and embedding pipeline Signed-off-by: yuye-aws <[email protected]> * tune document Signed-off-by: yuye-aws <[email protected]> * Add the text chunking page Signed-off-by: Fanit Kolchina <[email protected]> * correct example Signed-off-by: yuye-aws <[email protected]> * Update _search-plugins/text-chunking.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> * Update _search-plugins/text-chunking.md Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> * resolve review comments Signed-off-by: yuye-aws <[email protected]> * Move cascading section to processor file Signed-off-by: Fanit Kolchina <[email protected]> --------- Signed-off-by: yuye-aws <[email protected]> Signed-off-by: Fanit Kolchina <[email protected]> Signed-off-by: Yuye Zhu <[email protected]> Co-authored-by: Fanit Kolchina <[email protected]> Co-authored-by: Nathan Bower <[email protected]> Co-authored-by: kolchfa-aws <[email protected]>
opensearch-project · Mar 29, 2024 · d676a79 · d676a79
1 parent 5d9edcb
commit d676a79
Show file tree

Hide file tree

Showing 4 changed files with 128 additions and 115 deletions.
diff --git a/_ingest-pipelines/processors/text-chunking.md b/_ingest-pipelines/processors/text-chunking.md
@@ -157,119 +157,11 @@ The response confirms that, in addition to the `passage_text` field, the process
 }
 ```
 
-Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
-
-## Chaining text chunking and embedding processors
-
-You can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage.
-
-**Prerequisites**
-
-Follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model.
-
-**Step 1: Create a pipeline**
-
-The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field:
-
-```json
-PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
-{
-  "description": "A text chunking and embedding ingest pipeline",
-  "processors": [
-    {
-      "text_chunking": {
-        "algorithm": {
-          "fixed_token_length": {
-            "token_limit": 10,
-            "overlap_rate": 0.2,
-            "tokenizer": "standard"
-          }
-        },
-        "field_map": {
-          "passage_text": "passage_chunk"
-        }
-      }
-    },
-    {
-      "text_embedding": {
-        "model_id": "LMLPWY4BROvhdbtgETaI",
-        "field_map": {
-          "passage_chunk": "passage_chunk_embedding"
-        }
-      }
-    }
-  ]
-}
-```
-{% include copy-curl.html %}
-
-**Step 2 (Optional): Test the pipeline**
-
-It is recommended that you test your pipeline before ingesting documents.
-{: .tip}
-
-To test the pipeline, run the following query:
-
-```json
-POST _ingest/pipeline/text-chunking-embedding-ingest-pipeline/_simulate
-{
-  "docs": [
-    {
-      "_index": "testindex",
-      "_id": "1",
-      "_source":{
-         "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
-      }
-    }
-  ]
-}
-```
-{% include copy-curl.html %}
-
-#### Response
-
-The response confirms that, in addition to the `passage_text` and `passage_chunk` fields, the processor has generated text embeddings for each of the three passages in the `passage_chunk_embedding` field. The embedding vectors are stored in the `knn` field for each chunk:
-
-```json
-{
-  "docs": [
-    {
-      "doc": {
-        "_index": "testindex",
-        "_id": "1",
-        "_source": {
-          "passage_chunk_embedding": [
-            {
-              "knn": [...]
-            },
-            {
-              "knn": [...]
-            },
-            {
-              "knn": [...]
-            }
-          ],
-          "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch.",
-          "passage_chunk": [
-            "This is an example document to be chunked. The document ",
-            "The document contains a single paragraph, two sentences and 24 ",
-            "and 24 tokens by standard tokenizer in OpenSearch."
-          ]
-        },
-        "_ingest": {
-          "timestamp": "2024-03-20T03:04:49.144054Z"
-        }
-      }
-    }
-  ]
-}
-```
-
-Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see [Step 2: Create an index for ingestion]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-2-create-an-index-for-ingestion) and [Step 3: Ingest documents into the index]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/#step-3-ingest-documents-into-the-index) of the [neural sparse search documentation]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
+Once you have created an ingest pipeline, you need to create an index for document ingestion. To learn more, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
 
 ## Cascaded text chunking processors
 
-You can chain multiple chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows:
+You can chain multiple text chunking processors together. For example, to split documents into paragraphs, apply the `delimiter` algorithm and specify the parameter as `\n\n`. To prevent a paragraph from exceeding the token limit, append another text chunking processor that uses the `fixed_token_length` algorithm. You can configure the ingest pipeline for this example as follows:
 
 ```json
 PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline
@@ -309,7 +201,7 @@ PUT _ingest/pipeline/text-chunking-cascade-ingest-pipeline
 
 ## Next steps
 
+- For a complete example, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
 - To learn more about semantic search, see [Semantic search]({{site.url}}{{site.baseurl}}/search-plugins/semantic-search/).
 - To learn more about sparse search, see [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
 - To learn more about using models in OpenSearch, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
-- For a comprehensive example, see [Neural search tutorial]({{site.url}}{{site.baseurl}}/search-plugins/neural-search-tutorial/).
diff --git a/_search-plugins/neural-sparse-search.md b/_search-plugins/neural-sparse-search.md
@@ -2,7 +2,7 @@
 layout: default
 title: Neural sparse search
 nav_order: 50
-has_children: false
+has_children: true
 redirect_from:
   - /search-plugins/sparse-search/
 ---
@@ -55,7 +55,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
 ```
 {% include copy-curl.html %}
 
-To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors).
+To split long text into passages, use the `text_chunking` ingest processor before the `sparse_encoding` processor. For more information, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
+
 
 ## Step 2: Create an index for ingestion
 
@@ -364,4 +365,8 @@ The response contains both documents:
     ]
   }
 }
-```
+```
+
+## Next steps
+
+- To learn more about splitting long text into passages for neural search, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
diff --git a/_search-plugins/semantic-search.md b/_search-plugins/semantic-search.md
@@ -48,7 +48,7 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
 ```
 {% include copy-curl.html %}
 
-To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Chaining text chunking and embedding processors]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/#chaining-text-chunking-and-embedding-processors).
+To split long text into passages, use the `text_chunking` ingest processor before the `text_embedding` processor. For more information, see [Text chunking]({{site.url}}{{site.baseurl}}/search-plugins/text-chunking/).
 
 ## Step 2: Create an index for ingestion
 

diff --git a/_search-plugins/text-chunking.md b/_search-plugins/text-chunking.md
@@ -0,0 +1,116 @@
+---
+layout: default
+title: Text chunking
+nav_order: 65
+---
+
+# Text chunking
+Introduced 2.13
+{: .label .label-purple }
+
+To split long text into passages, you can use a `text_chunking` processor as a preprocessing step for a `text_embedding` or `sparse_encoding` processor in order to obtain embeddings for each chunked passage. For more information about the processor parameters, see [Text chunking processor]({{site.url}}{{site.baseurl}}/ingest-pipelines/processors/text-chunking/). Before you start, follow the steps outlined in the [pretrained model documentation]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/) to register an embedding model. The following example preprocesses text by splitting it into passages and then produces embeddings using the `text_embedding` processor.
+
+## Step 1: Create a pipeline
+
+The following example request creates an ingest pipeline that converts the text in the `passage_text` field into chunked passages, which will be stored in the `passage_chunk` field. The text in the `passage_chunk` field is then converted into text embeddings, and the embeddings are stored in the `passage_embedding` field:
+
+```json
+PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
+{
+  "description": "A text chunking and embedding ingest pipeline",
+  "processors": [
+    {
+      "text_chunking": {
+        "algorithm": {
+          "fixed_token_length": {
+            "token_limit": 10,
+            "overlap_rate": 0.2,
+            "tokenizer": "standard"
+          }
+        },
+        "field_map": {
+          "passage_text": "passage_chunk"
+        }
+      }
+    },
+    {
+      "text_embedding": {
+        "model_id": "LMLPWY4BROvhdbtgETaI",
+        "field_map": {
+          "passage_chunk": "passage_chunk_embedding"
+        }
+      }
+    }
+  ]
+}
+```
+{% include copy-curl.html %}
+
+## Step 2: Create an index for ingestion
+
+In order to use the ingest pipeline, you need to create a k-NN index. The `passage_chunk_embedding` field must be of the `nested` type. The `knn.dimension` field must contain the number of dimensions for your model:
+
+```json
+PUT testindex
+{
+  "settings": {
+    "index": {
+      "knn": true
+    }
+  },
+  "mappings": {
+    "properties": {
+      "text": {
+        "type": "text"
+      },
+      "passage_chunk_embedding": {
+        "type": "nested",
+        "properties": {
+          "knn": {
+            "type": "knn_vector",
+            "dimension": 768
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Step 3: Ingest documents into the index
+
+To ingest a document into the index created in the previous step, send the following request:
+
+```json
+POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
+{
+  "passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
+}
+```
+{% include copy-curl.html %}
+
+## Step 4: Search the index using neural search
+
+You can use a `nested` query to perform vector search on your index. We recommend setting `score_mode` to `max`, where the document score is set to the highest score out of all passage embeddings:
+
+```json
+GET testindex/_search
+{
+  "query": {
+    "nested": {
+      "score_mode": "max",
+      "path": "passage_chunk_embedding",
+      "query": {
+        "neural": {
+          "passage_chunk_embedding.knn": {
+            "query_text": "document",
+            "model_id": "-tHZeI4BdQKclr136Wl7"
+          }
+        }
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}