From e45d880f50b40b9ec38379bd2bd673b5b0e115e5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Wed, 18 Sep 2024 12:03:10 +0200 Subject: [PATCH] [DOCS] Gives more details to the load data step of the semantic search tutorials (#113088) (#113095) Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com> --- .../semantic-search-elser.asciidoc | 26 ++++++++++++++----- .../semantic-search-inference.asciidoc | 17 +++++++----- .../semantic-search-semantic-text.asciidoc | 17 +++++++----- 3 files changed, 39 insertions(+), 21 deletions(-) diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc index 11aec59a00b30..5309b24fa37c9 100644 --- a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc @@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model. -It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes. +We use this sample dataset in the tutorial because is easily accessible for demonstration purposes. You can use a different data set to test the workflow and become familiar with it. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. -Assign the name `id` to the first column and `content` to the second column. -The index name is `test-data`. -Once the upload is complete, you can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI. +After your data is analyzed, click **Override settings**. +Under **Edit field names**, assign `id` to the first column and `content` to the second. +Click **Apply**, then **Import**. +Name the index `test-data`, and click **Import**. +After the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] [[reindexing-data-elser]] @@ -161,6 +161,18 @@ GET _tasks/ You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress. +Reindexing large datasets can take a long time. +You can test this workflow using only a subset of the dataset. +Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. +The following API request will cancel the reindexing task: + +[source,console] +---- +POST _tasks//_cancel +---- +// TEST[skip:TBD] + + [discrete] [[text-expansion-query]] ==== Semantic search by using the `sparse_vector` query diff --git a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc index 8ad36f17530d1..360d835560b50 100644 --- a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc @@ -67,12 +67,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages All unique passages, along with their IDs, have been extracted from that data set and compiled into a https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. -Assign the name `id` to the first column and `content` to the second column. -The index name is `test-data`. -Once the upload is complete, you can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI. +After your data is analyzed, click **Override settings**. +Under **Edit field names**, assign `id` to the first column and `content` to the second. +Click **Apply**, then **Import**. +Name the index `test-data`, and click **Import**. +After the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] [[reindexing-data-infer]] @@ -91,7 +91,10 @@ GET _tasks/ ---- // TEST[skip:TBD] -You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets: +Reindexing large datasets can take a long time. +You can test this workflow using only a subset of the dataset. +Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. +The following API request will cancel the reindexing task: [source,console] ---- diff --git a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc index e2cc2d8c62219..709d17091164c 100644 --- a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc +++ b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc @@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs, have been extracted from that data set and compiled into a https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file]. -Download the file and upload it to your cluster using the -{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] -in the {ml-app} UI. Assign the name `id` to the first column and `content` to -the second column. The index name is `test-data`. Once the upload is complete, -you can see an index named `test-data` with 182469 documents. +Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI. +After your data is analyzed, click **Override settings**. +Under **Edit field names**, assign `id` to the first column and `content` to the second. +Click **Apply**, then **Import**. +Name the index `test-data`, and click **Import**. +After the upload is complete, you will see an index named `test-data` with 182,469 documents. [discrete] @@ -137,8 +138,10 @@ GET _tasks/ ------------------------------------------------------------ // TEST[skip:TBD] -It is recommended to cancel the reindexing process if you don't want to wait -until it is fully complete which might take a long time for an inference endpoint with few assigned resources: +Reindexing large datasets can take a long time. +You can test this workflow using only a subset of the dataset. +Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed. +The following API request will cancel the reindexing task: [source,console] ------------------------------------------------------------