From e45d880f50b40b9ec38379bd2bd673b5b0e115e5 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?=
 <istvan.szabo@elastic.co>
Date: Wed, 18 Sep 2024 12:03:10 +0200
Subject: [PATCH] [DOCS] Gives more details to the load data step of the
 semantic search tutorials (#113088) (#113095)

Co-authored-by: Liam Thompson <32779855+leemthompo@users.noreply.github.com>
---
 .../semantic-search-elser.asciidoc            | 26 ++++++++++++++-----
 .../semantic-search-inference.asciidoc        | 17 +++++++-----
 .../semantic-search-semantic-text.asciidoc    | 17 +++++++-----
 3 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc
index 11aec59a00b30..5309b24fa37c9 100644
--- a/docs/reference/search/search-your-data/semantic-search-elser.asciidoc
+++ b/docs/reference/search/search-your-data/semantic-search-elser.asciidoc
@@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se
 https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
 
 IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model.
-It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes.
+We use this sample dataset in the tutorial because is easily accessible for demonstration purposes.
 You can use a different data set to test the workflow and become familiar with it.
 
-Download the file and upload it to your cluster using the
-{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
-in the {ml-app} UI.
-Assign the name `id` to the first column and `content` to the second column.
-The index name is `test-data`.
-Once the upload is complete, you can see an index named `test-data` with 182469 documents.
+Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI.
+After your data is analyzed, click **Override settings**.
+Under **Edit field names**, assign `id` to the first column and `content` to the second.
+Click **Apply**, then **Import**.
+Name the index `test-data`, and click **Import**.
+After the upload is complete, you will see an index named `test-data` with 182,469 documents.
 
 [discrete]
 [[reindexing-data-elser]]
@@ -161,6 +161,18 @@ GET _tasks/<task_id>
 
 You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress.
 
+Reindexing large datasets can take a long time.
+You can test this workflow using only a subset of the dataset.
+Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
+The following API request will cancel the reindexing task:
+
+[source,console]
+----
+POST _tasks/<task_id>/_cancel
+----
+// TEST[skip:TBD]
+
+
 [discrete]
 [[text-expansion-query]]
 ==== Semantic search by using the `sparse_vector` query
diff --git a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc
index 8ad36f17530d1..360d835560b50 100644
--- a/docs/reference/search/search-your-data/semantic-search-inference.asciidoc
+++ b/docs/reference/search/search-your-data/semantic-search-inference.asciidoc
@@ -67,12 +67,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages
 All unique passages, along with their IDs, have been extracted from that data set and compiled into a
 https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
 
-Download the file and upload it to your cluster using the
-{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
-in the {ml-app} UI.
-Assign the name `id` to the first column and `content` to the second column.
-The index name is `test-data`.
-Once the upload is complete, you can see an index named `test-data` with 182469 documents.
+Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
+After your data is analyzed, click **Override settings**.
+Under **Edit field names**, assign `id` to the first column and `content` to the second.
+Click **Apply**, then **Import**.
+Name the index `test-data`, and click **Import**.
+After the upload is complete, you will see an index named `test-data` with 182,469 documents.
 
 [discrete]
 [[reindexing-data-infer]]
@@ -91,7 +91,10 @@ GET _tasks/<task_id>
 ----
 // TEST[skip:TBD]
 
-You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets:
+Reindexing large datasets can take a long time.
+You can test this workflow using only a subset of the dataset.
+Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
+The following API request will cancel the reindexing task:
 
 [source,console]
 ----
diff --git a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc
index e2cc2d8c62219..709d17091164c 100644
--- a/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc
+++ b/docs/reference/search/search-your-data/semantic-search-semantic-text.asciidoc
@@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs,
 have been extracted from that data set and compiled into a
 https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].
 
-Download the file and upload it to your cluster using the
-{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
-in the {ml-app} UI. Assign the name `id` to the first column and `content` to
-the second column. The index name is `test-data`. Once the upload is complete,
-you can see an index named `test-data` with 182469 documents.
+Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
+After your data is analyzed, click **Override settings**.
+Under **Edit field names**, assign `id` to the first column and `content` to the second.
+Click **Apply**, then **Import**.
+Name the index `test-data`, and click **Import**.
+After the upload is complete, you will see an index named `test-data` with 182,469 documents.
 
 
 [discrete]
@@ -137,8 +138,10 @@ GET _tasks/<task_id>
 ------------------------------------------------------------
 // TEST[skip:TBD]
 
-It is recommended to cancel the reindexing process if you don't want to wait
-until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
+Reindexing large datasets can take a long time.
+You can test this workflow using only a subset of the dataset.
+Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
+The following API request will cancel the reindexing task:
 
 [source,console]
 ------------------------------------------------------------