Skip to content

Commit

Permalink
[DOCS] Gives more details to the load data step of the semantic searc…
Browse files Browse the repository at this point in the history
…h tutorials (#113088) (#113095)

Co-authored-by: Liam Thompson <[email protected]>
  • Loading branch information
szabosteve and leemthompo authored Sep 18, 2024
1 parent abf3c0f commit e45d880
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 21 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -117,15 +117,15 @@ All unique passages, along with their IDs, have been extracted from that data se
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

IMPORTANT: The `msmarco-passagetest2019-top1000` dataset was not utilized to train the model.
It is only used in this tutorial as a sample dataset that is easily accessible for demonstration purposes.
We use this sample dataset in the tutorial because is easily accessible for demonstration purposes.
You can use a different data set to test the workflow and become familiar with it.

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI.
Assign the name `id` to the first column and `content` to the second column.
The index name is `test-data`.
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[File Uploader] in the UI.
After your data is analyzed, click **Override settings**.
Under **Edit field names**, assign `id` to the first column and `content` to the second.
Click **Apply**, then **Import**.
Name the index `test-data`, and click **Import**.
After the upload is complete, you will see an index named `test-data` with 182,469 documents.

[discrete]
[[reindexing-data-elser]]
Expand Down Expand Up @@ -161,6 +161,18 @@ GET _tasks/<task_id>

You can also open the Trained Models UI, select the Pipelines tab under ELSER to follow the progress.

Reindexing large datasets can take a long time.
You can test this workflow using only a subset of the dataset.
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
The following API request will cancel the reindexing task:

[source,console]
----
POST _tasks/<task_id>/_cancel
----
// TEST[skip:TBD]


[discrete]
[[text-expansion-query]]
==== Semantic search by using the `sparse_vector` query
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,12 @@ It consists of 200 queries, each accompanied by a list of relevant text passages
All unique passages, along with their IDs, have been extracted from that data set and compiled into a
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI.
Assign the name `id` to the first column and `content` to the second column.
The index name is `test-data`.
Once the upload is complete, you can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
After your data is analyzed, click **Override settings**.
Under **Edit field names**, assign `id` to the first column and `content` to the second.
Click **Apply**, then **Import**.
Name the index `test-data`, and click **Import**.
After the upload is complete, you will see an index named `test-data` with 182,469 documents.

[discrete]
[[reindexing-data-infer]]
Expand All @@ -91,7 +91,10 @@ GET _tasks/<task_id>
----
// TEST[skip:TBD]

You can also cancel the reindexing process if you don't want to wait until the reindexing process is fully complete which might take hours for large data sets:
Reindexing large datasets can take a long time.
You can test this workflow using only a subset of the dataset.
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
The following API request will cancel the reindexing task:

[source,console]
----
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,12 @@ a list of relevant text passages. All unique passages, along with their IDs,
have been extracted from that data set and compiled into a
https://github.com/elastic/stack-docs/blob/main/docs/en/stack/ml/nlp/data/msmarco-passagetest2019-unique.tsv[tsv file].

Download the file and upload it to your cluster using the
{kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer]
in the {ml-app} UI. Assign the name `id` to the first column and `content` to
the second column. The index name is `test-data`. Once the upload is complete,
you can see an index named `test-data` with 182469 documents.
Download the file and upload it to your cluster using the {kibana-ref}/connect-to-elasticsearch.html#upload-data-kibana[Data Visualizer] in the {ml-app} UI.
After your data is analyzed, click **Override settings**.
Under **Edit field names**, assign `id` to the first column and `content` to the second.
Click **Apply**, then **Import**.
Name the index `test-data`, and click **Import**.
After the upload is complete, you will see an index named `test-data` with 182,469 documents.


[discrete]
Expand Down Expand Up @@ -137,8 +138,10 @@ GET _tasks/<task_id>
------------------------------------------------------------
// TEST[skip:TBD]

It is recommended to cancel the reindexing process if you don't want to wait
until it is fully complete which might take a long time for an inference endpoint with few assigned resources:
Reindexing large datasets can take a long time.
You can test this workflow using only a subset of the dataset.
Do this by cancelling the reindexing process, and only generating embeddings for the subset that was reindexed.
The following API request will cancel the reindexing task:

[source,console]
------------------------------------------------------------
Expand Down

0 comments on commit e45d880

Please sign in to comment.