Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search latency tracking - Coordinator node #8386

Merged
merged 61 commits into from
Sep 20, 2023

Conversation

buddharajusahil
Copy link
Contributor

@buddharajusahil buddharajusahil commented Jun 30, 2023

Description:

This PR is addressing this issue: #8381

We are trying to add search stats to coordinator level node stats. This keeps track of the total time, current requests, and total requests of each request phase.

Also added support for general coordinator stats as well on the node level.

Manual Testing:

Case 1:

First command:

curl -X PUT "localhost:9200/test2?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1
  }
}
'

1 shard test

Insert Data:

curl -X POST "localhost:9200/_bulk?pretty" -H 'Content-Type: application/json' -d'
{ "index" : { "_index" : "test2", "_id" : "1" } }
{ "field1" : "value1" }
{ "index" : { "_index" : "test2", "_id" : "2" } }
{ "field2" : "value2" }
{ "index" : { "_index" : "test2", "_id" : "3" } }
{ "field3" : "value3" }
{ "index" : { "_index" : "test2", "_id" : "4" } }
{ "field4" : "value4" }
'

Search Query

curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": {
      "field1": {
        "value": "*v",
        "boost": 1.0
      }
    }
  }
}
'

This is an unsuccessful search request
We will grab the node stats now with this command:

curl -XGET 'localhost:9200/_nodes/runTask-0/stats/indices/search?pretty=true'

"indices" : {
  "search" : {
    "open_contexts" : 0,
    "query_total" : 1,
    "query_time_in_millis" : 7,
    "query_current" : 0,
    "fetch_total" : 1,
    "fetch_time_in_millis" : 1,
    "fetch_current" : 0,
    "scroll_total" : 0,
    "scroll_time_in_millis" : 0,
    "scroll_current" : 0,
    "point_in_time_total" : 0,
    "point_in_time_time_in_millis" : 0,
    "point_in_time_current" : 0,
    "suggest_total" : 0,
    "suggest_time_in_millis" : 0,
    "suggest_current" : 0,
    "coordinator" : {
      "dfs_prequery_time_in_millis" : 0,
      "dfs_prequery_current" : 0,
      "dfs_prequery_total" : 0,
      "canmatch_time_in_millis" : 0,
      "canmatch_current" : 0,
      "canmatch_total" : 0,
      "query_time_in_millis" : 30,
      "query_current" : 0,
      "query_total" : 1,
      "fetch_time_in_millis" : 1,
      "fetch_current" : 0,
      "fetch_total" : 1,
      "expand_time_in_millis" : 0,
      "expand_current" : 0,
      "expand_total" : 1
    }
  }
 }

This is the expected output, as the phases after query will take minimal time due to the query phase resulting in 0 hits

If we instead run this query:

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": { "query_string": { "query": "abc" } }
}'

We get this response for the coordinator stats:

"indices" : {
  "search" : {
    "open_contexts" : 0,
    "query_total" : 1,
    "query_time_in_millis" : 7,
    "query_current" : 0,
    "fetch_total" : 1,
    "fetch_time_in_millis" : 1,
    "fetch_current" : 0,
    "scroll_total" : 0,
    "scroll_time_in_millis" : 0,
    "scroll_current" : 0,
    "point_in_time_total" : 0,
    "point_in_time_time_in_millis" : 0,
    "point_in_time_current" : 0,
    "suggest_total" : 0,
    "suggest_time_in_millis" : 0,
    "suggest_current" : 0,
    "coordinator" : {
      "dfs_prequery_time_in_millis" : 0,
      "dfs_prequery_current" : 0,
      "dfs_prequery_total" : 0,
      "canmatch_time_in_millis" : 0,
      "canmatch_current" : 0,
      "canmatch_total" : 0,
      "query_time_in_millis" : 25,
      "query_current" : 0,
      "query_total" : 1,
      "fetch_time_in_millis" : 1,
      "fetch_current" : 0,
      "fetch_total" : 1,
      "expand_time_in_millis" : 0,
      "expand_current" : 0,
      "expand_total" : 1
    }
  }
 }

So now if we run both queries subsequently, we get these coordinator stats:

"indices" : {
  "search" : {
    "open_contexts" : 0,
    "query_total" : 2,
    "query_time_in_millis" : 9,
    "query_current" : 0,
    "fetch_total" : 2,
    "fetch_time_in_millis" : 1,
    "fetch_current" : 0,
    "scroll_total" : 0,
    "scroll_time_in_millis" : 0,
    "scroll_current" : 0,
    "point_in_time_total" : 0,
    "point_in_time_time_in_millis" : 0,
    "point_in_time_current" : 0,
    "suggest_total" : 0,
    "suggest_time_in_millis" : 0,
    "suggest_current" : 0,
    "coordinator" : {
      "dfs_prequery_time_in_millis" : 0,
      "dfs_prequery_current" : 0,
      "dfs_prequery_total" : 0,
      "canmatch_time_in_millis" : 0,
      "canmatch_current" : 0,
      "canmatch_total" : 0,
      "query_time_in_millis" : 55,
      "query_current" : 0,
      "query_total" : 2,
      "fetch_time_in_millis" : 1,
      "fetch_current" : 0,
      "fetch_total" : 2,
      "expand_time_in_millis" : 0,
      "expand_current" : 0,
      "expand_total" : 2
    }
  }
 }

This makes sense because each individual query took 34 and 7 ms respectively, and it has correctly added up these stats.

Case 2:

In this case, we will perform a DFS query then fetch action instead of query then fetch action. We will populate the date in the same way with the same settings in case 1 except with 5 shards.

The command we will use is:

curl -XGET 'localhost:9200/_search?pretty&search_type=dfs_query_then_fetch' -H 'Content-Type: application/json' -d'
{
  "query": { "query_string": { "query": "abc" } }
}'

The response we get when we ask for coordinator stats is now:

"indices" : {
  "search" : {
    "open_contexts" : 0,
    "query_total" : 5,
    "query_time_in_millis" : 31,
    "query_current" : 0,
    "fetch_total" : 0,
    "fetch_time_in_millis" : 0,
    "fetch_current" : 0,
    "scroll_total" : 0,
    "scroll_time_in_millis" : 0,
    "scroll_current" : 0,
    "point_in_time_total" : 0,
    "point_in_time_time_in_millis" : 0,
    "point_in_time_current" : 0,
    "suggest_total" : 0,
    "suggest_time_in_millis" : 0,
    "suggest_current" : 0,
    "coordinator" : {
      "dfs_prequery_time_in_millis" : 32,
      "dfs_prequery_current" : 0,
      "dfs_prequery_total" : 1,
      "canmatch_time_in_millis" : 0,
      "canmatch_current" : 0,
      "canmatch_total" : 0,
      "query_time_in_millis" : 10,
      "query_current" : 0,
      "query_total" : 1,
      "fetch_time_in_millis" : 3,
      "fetch_current" : 0,
      "fetch_total" : 1,
      "expand_time_in_millis" : 0,
      "expand_current" : 0,
      "expand_total" : 1
    }
  }
}

This is as expected. If we then run a normal query then fetch from above:

curl -XGET 'localhost:9200/_search?pretty' -H 'Content-Type: application/json' -d'
{
  "query": { "query_string": { "query": "abc" } }
}'

Now the coordinator stats are:

"indices" : {
  "search" : {
    "open_contexts" : 0,
    "query_total" : 10,
    "query_time_in_millis" : 33,
    "query_current" : 0,
    "fetch_total" : 0,
    "fetch_time_in_millis" : 0,
    "fetch_current" : 0,
    "scroll_total" : 0,
    "scroll_time_in_millis" : 0,
    "scroll_current" : 0,
    "point_in_time_total" : 0,
    "point_in_time_time_in_millis" : 0,
    "point_in_time_current" : 0,
    "suggest_total" : 0,
    "suggest_time_in_millis" : 0,
    "suggest_current" : 0,
    "coordinator" : {
      "dfs_prequery_time_in_millis" : 32,
      "dfs_prequery_current" : 0,
      "dfs_prequery_total" : 1,
      "canmatch_time_in_millis" : 0,
      "canmatch_current" : 0,
      "canmatch_total" : 0,
      "query_time_in_millis" : 17,
      "query_current" : 0,
      "query_total" : 2,
      "fetch_time_in_millis" : 3,
      "fetch_current" : 0,
      "fetch_total" : 2,
      "expand_time_in_millis" : 0,
      "expand_current" : 0,
      "expand_total" : 2
    }
  }
}

Performance Testing:

  • Test Config
    • c4.2xlarge - 3 data nodes, 3 master nodes - c4.xlarge
    • Test cluster setup
      • data node count: 3, c4.2xlarge
      • master node count: 3, c4.2xlarge
      • Open Search Version: 2.5
    • Workload Configuration
      • dataset: nyc_taxis

Latency:
image

Throughput Comparison:
image

Operation Counts:
image

CPU Usage:
image

Garbage Collection:
image

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Jul 5, 2023

Gradle Check (Jenkins) Run Completed with:

@buddharajusahil buddharajusahil changed the title Coordinator stats Search latency tracking - Coordinator node Jul 5, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2023

Gradle Check (Jenkins) Run Completed with:

sgup432 and others added 2 commits September 19, 2023 15:40
Removing unused variable from SearchRequestStats
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

sgup432 and others added 4 commits September 19, 2023 17:45
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

sgup432 and others added 2 commits September 20, 2023 09:31
Removing RequestStats file as not needed
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Collaborator

@msfroh msfroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sgup432! Looks good to me.

I'm going to keep retrying the Gradle check to see if we can get a build without the flaky test failures.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@msfroh msfroh merged commit b944194 into opensearch-project:main Sep 20, 2023
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/OpenSearch that referenced this pull request Sep 24, 2023
We are trying to add search stats to coordinator level node stats. This keeps track of the total time, current requests, and total requests of each request phase.

Also added support for general coordinator stats as well on the node level.

Signed-off-by: sahil buddharaju <[email protected]>
Signed-off-by: sahil <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Co-authored-by: sahil buddharaju <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar <[email protected]>
brusic pushed a commit to brusic/OpenSearch that referenced this pull request Sep 25, 2023
We are trying to add search stats to coordinator level node stats. This keeps track of the total time, current requests, and total requests of each request phase.

Also added support for general coordinator stats as well on the node level.

Signed-off-by: sahil buddharaju <[email protected]>
Signed-off-by: sahil <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Co-authored-by: sahil buddharaju <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar <[email protected]>
Signed-off-by: Ivan Brusic <[email protected]>
sgup432 added a commit to sgup432/OpenSearch that referenced this pull request Sep 29, 2023
We are trying to add search stats to coordinator level node stats. This keeps track of the total time, current requests, and total requests of each request phase.

Also added support for general coordinator stats as well on the node level.

Signed-off-by: sahil buddharaju <[email protected]>
Signed-off-by: sahil <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Co-authored-by: sahil buddharaju <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
vikasvb90 pushed a commit to vikasvb90/OpenSearch that referenced this pull request Oct 10, 2023
We are trying to add search stats to coordinator level node stats. This keeps track of the total time, current requests, and total requests of each request phase.

Also added support for general coordinator stats as well on the node level.

Signed-off-by: sahil buddharaju <[email protected]>
Signed-off-by: sahil <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Co-authored-by: sahil buddharaju <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar <[email protected]>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
We are trying to add search stats to coordinator level node stats. This keeps track of the total time, current requests, and total requests of each request phase.

Also added support for general coordinator stats as well on the node level.

Signed-off-by: sahil buddharaju <[email protected]>
Signed-off-by: sahil <[email protected]>
Signed-off-by: Sagar Upadhyaya <[email protected]>
Signed-off-by: Sagar <[email protected]>
Co-authored-by: sahil buddharaju <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar Upadhyaya <[email protected]>
Co-authored-by: Sagar <[email protected]>
Signed-off-by: Shivansh Arora <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants