[RFC] Tracking Search Pipeline Execution #16705

junweid62 · 2024-11-22T18:37:23Z

Is your feature request related to a problem? Please describe

With the expansion of search pipeline processors, tracking data transformations and understanding data flow through complex processors is becoming challenging. The introduction of ML inference processors, which can manipulate model inputs and outputs, increases the need for a tool to visualize and debug the flow of data across these processors. Such functionality would aid in troubleshooting, optimizing pipeline configurations, and provide transparency for end-to-end transformations of search requests and responses.

As search pipeline processors grow in complexity, there is an increasing need to: Related Issue

Track how data flows and transforms through each processor.
Debug data transformations and pinpoint any failures within the pipeline.
View the end-to-end pipeline execution for both the request and response sides of a search.

This capability would also be valuable for frontend plugins like the Flow Framework, helping users configure and test complex ingest and search pipelines.

Describe the solution you'd like

Adding `verbose` Parameter to Search Request [Preferred]

Overview

In this approach, the verbose_pipeline parameter is introduced as a query parameter in the search request URL. When used in conjunction with the search_pipeline parameter, it activates a debugging mode, allowing detailed tracking of search pipeline processor execution without requiring a new API or changes to the Explain API.

Pros

Minimal Changes to Existing Workflow:
- No need for a new API endpoint; the debugging functionality is seamlessly integrated into the existing search request.
Backward Compatibility:
- The verbose parameter is optional and defaults to false. Existing search requests remain unaffected unless explicitly updated to include verbose=true.
Alignment with OpenSearch Design:
- Consistent with the design of existing search features, such as the profile query parameter.

Cons

Performance Impact:
- Activating verbose mode may slightly increase computational load due to additional processor-level logging, primarily for debugging purposes. By integrating with the existing search backpressure mechanism, the system can dynamically manage resource usage, ensuring stability while allowing detailed debugging during low-load periods.

Example Request

GET /my_index/_search?search_pipeline=my_debug_pipeline&verbose_pipeline=true

Example Response

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 50,
      "relation": "eq"
    },
    "max_score": 1.0,
    "hits": [
      {
        "_index": "my_index",
        "_id": "1",
        "_score": 1.0,
        "_source": { "field": "value" }
      }
    ]
  },
  "processor_result": [
    {
      "processor": "filter_query",
      "status": "success",
      "execution_time": 3,
      "input": { "query": { "match_all": {} } },
      "output": { "query": { "filtered_query": { "match_all": {} } } }
    },
    {
      "processor": "collapse",
      "status": "success",
      "execution_time": 5,
      "input": { "hits": [...] },
      "output": { "collapsed_hits": [...] }
    }
  ]
}

Common Fields for All Processors

Each processor, regardless of type, will include the following common fields:

processor: The name or type of the processor (e.g., filter_query, collapse).
status: Indicates whether the processor completed successfully (success) or encountered an error (failure).
execution_time: The time taken by the processor to execute, in milliseconds.
input: The input data provided to the processor. The structure of this field varies depending on the processor type.
output: The transformed data output by the processor. The structure of this field varies depending on the processor type.

Request Processor Fields

For processors that handle the incoming search request:

input: The original search request before processing (e.g., the query, filters, and other parameters).
output: The modified search request after this processor has applied its transformations.

Example:

{
  "processor": "filter_query",
  "status": "success",
  "execution_time": 3,
  "input": { "query": { "match_all": {} } },
  "output": { "query": { "filtered_query": { "match_all": {} } } }
}

Search Phase Result Processor Fields

For processors that handle intermediate results during the search phase:

input: The set of search hits or results passed into this processor.
output: The modified or filtered set of search hits after the processor has completed its operation.

{
  "processor":"normalization-processor"
  "status": "success",
  "execution_time": 5,
  "input": {
    "hits": [
      { "_index": "my_index", "_id": "1", "_score": 1.0, "_source": { "field": "value1" } },
      { "_index": "my_index", "_id": "2", "_score": 0.9, "_source": { "field": "value2" } }
    ]
  },
  "output": {
    "hits": [
      { "_index": "my_index", "_id": "1", "_score": 1.0, "_source": { "field": "value1" } }
    ]
  }
}

Response Processor Fields

For processors that handle the final search response:

input: The raw search response from the previous phase or processor.
output: The final transformed response to be returned to the client.

{
  "processor": "Rerank",
  "status": "success",
  "execution_time": 4,
  "input": { "hits": [ ... ] },
  "output": { "hits": [ ... ] }
}

Verbose Mode Support Across Search Pipeline Configurations

The verbose mode is designed to seamlessly integrate with all ways of using a search pipeline, ensuring consistent debugging capabilities regardless of the method chosen. Below is an overview of how verbose mode supports different search pipeline configurations:

Default Search Pipeline

PUT /my_index/_settings
{
  "index.search.default_pipeline": "my_pipeline"
}

GET /my_index/_search?verbose_pipeline=true

Specified Search Pipeline by ID

GET /my_index/_search?search_pipeline=my_pipeline&verbose_pipeline=true

Ad-Hoc (Temporary) Search Pipeline

POST /my_index/_search?verbose_pipeline=true
{
  "query": {
    "match": { "text_field": "some search text" }
  },
  "search_pipeline": {
    "request_processors": [
      {
        "filter_query": {
          "query": { "term": { "visibility": "public" } }
        }
      }
    ],
    "response_processors": [
      {
        "collapse": {
          "field": "category"
        }
      }
    ]
  }
}

Related component

Search

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

pyek-bot · 2024-11-22T19:20:56Z

Hi, Thanks for the detailed information! I'm pretty new to the OpenSearch project. I'm trying to understand when the verbose_pipeline parameter is passed, where is it getting its information from? Is it from a map, system index or logs? Trying to understand where this information is persisted or what the source of it is.

junweid62 · 2024-11-22T21:53:46Z

Hi, Thanks for the detailed information! I'm pretty new to the OpenSearch project. I'm trying to understand when the verbose_pipeline parameter is passed, where is it getting its information from? Is it from a map, system index or logs? Trying to understand where this information is persisted or what the source of it is.

Thanks for your question! When the verbose_pipeline parameter is passed, the information is not persisted anywhere—it’s generated dynamically during the request execution. The source of the information is the actual processing flow of the search pipeline in memory. Each processor in the pipeline logs its input, output, status, and execution time as the request flows through.

This information is collected directly from the execution of the processors and returned as part of the response. It is not stored in a system index, map, or logs, which helps keep the feature lightweight and avoids adding unnecessary overhead to the system.

Hope this clarifies! Let me know if you have any follow-up questions!

owaiskazi19 · 2024-11-26T21:41:10Z

By integrating with the existing search backpressure mechanism, the system can dynamically manage resource usage

Since this param is part of the search request, search backpressure would already be integrated.

owaiskazi19 · 2024-11-26T21:41:43Z

@msfroh @reta @andrross @dblock can you take a look at this RFC and share your thoughts?

msfroh · 2024-11-26T22:45:57Z

I think it's a good idea. As search pipelines get more complicated, getting step-by-step logging from each processor will be useful (though the response can get quite large -- kind of like profiler output). I discussed doing something like this with @mingshl.

As a minor correction to the section Search Phase Result Processor Fields, the phase results are just doc IDs and scores (if I recall correctly). Still, being able to see how scores were processed, e.g. by the hybrid query score normalizer, would be pretty nice. I was discussing ways of getting "explain" output from the normalizer with @martin-gaievski, and I think this could solve that problem.

If you need somewhere to store the verbose output, the PipelinedRequest object that flows through the pipelines might be an option. At the end, you could copy the output from that into the final SearchResponse that gets returned to the client.

owaiskazi19 · 2024-11-26T22:56:52Z

(though the response can get quite large -- kind of like profiler output)

To handle this do you think we can add a size parameter or limit the response by a small number let's say 5 because we just have to see how the processor does processing. We probably don't need an entire SearchResponse?

If you need somewhere to store the verbose output

I don't think so we have to store it. We could just directly read from the ProcessorResultMap and return it in the SearchResponse.

mingshl · 2024-11-26T23:05:21Z

Glad to see this RFC is cut! It would be helpful for debugging and tracking search processors. I wish this will also be added to ingest pipelines too. Can ingest pipeline take similar design? Just so, if we decide to do the same on ingest pipeline, we can make it consistent.

Wondering if we can reuse PipelineProcessingContext, this can carry over a map of context between processors.

owaiskazi19 · 2024-11-26T23:09:32Z

Can ingest pipeline take similar design? Just so, if we decide to do the same on ingest pipeline, we can make it consistent.

We already have verbose for Ingest Pipelines https://opensearch.org/docs/latest/ingest-pipelines/simulate-ingest/#query-parameters

junweid62 · 2024-11-26T23:20:13Z

Wondering if we can reuse PipelineProcessingContext, this can carry over a map of context between processors.

Thanks for the suggestion! I just checked the code base, and PipelineProcessingContext does seem like a good fit for this purpose. It looks like it can effectively carry debug information between processors. I'll explore this further and see how it aligns with the verbose mode implementation.

junweid62 · 2024-11-26T23:32:07Z

As a minor correction to the section Search Phase Result Processor Fields, the phase results are just doc IDs and scores (if I recall correctly). Still, being able to see how scores were processed, e.g. by the hybrid query score normalizer, would be pretty nice. I was discussing ways of getting "explain" output from the normalizer with @martin-gaievski, and I think this could solve that problem.

Thanks for the feedback! I’ll update the section on Search Phase Result Processor Fields to reflect that the phase results are just doc IDs and scores—thanks for pointing that out.

If you need somewhere to store the verbose output, the PipelinedRequest object that flows through the pipelines might be an option. At the end, you could copy the output from that into the final SearchResponse that gets returned to the client.

I’ll also take a closer look at using the PipelinedRequest object for storing verbose output.

reta · 2024-11-27T06:52:10Z

Thanks @junweid62 , certainly +1 to the feature. The only concern I have is that we are probably introducing too many knobs on the search side:

the search request has profile which we could use to trace search request in details
yes, we do have verbose setting for ingest pipelines (as you pointed out) and introducing it for consistency probably would have made sense, but ingest do not have profile or/and explain (AFAIK, may be that's what we could bring on the ingest side instead?)

From my perspective, profile should be sufficient, no need to introduce verbose setting on search side

junweid62 · 2024-11-27T18:02:24Z

Thanks @junweid62 , certainly +1 to the feature. The only concern I have is that we are probably introducing too many knobs on the search side:

the search request has profile which we could use to trace search request in details

yes, we do have verbose setting for ingest pipelines (as you pointed out) and introducing it for consistency probably would have made sense, but ingest do not have profile or/and explain (AFAIK, may be that's what we could bring on the ingest side instead?)

From my perspective, profile should be sufficient, no need to introduce verbose setting on search side

Thanks for the feedback, profile is fundamentally designed to provide timing-related insights, as it focuses on performance debugging. However, verbose_pipeline serves a different purpose that complements profile rather than overlapping with it:

The profile API focuses on timing information and payload metrics (e.g., size/quantity), which are excellent for debugging performance bottlenecks. As highlighted in the OpenSearch documentation:

"The Profile API provides timing information about the execution of individual components of a search request. Using the Profile API, you can debug slow requests and understand how to improve their performance."

It doesn't track logical transformations or interim values between processors. This makes verbose_pipeline the right tool for cases where users need to understand how data evolves across the pipeline, especially with increasingly complex processors like those involving ML inference.

reta · 2024-11-28T06:17:37Z

It doesn't track logical transformations or interim values between processors. This makes verbose_pipeline the right tool for cases where users need to understand how data evolves across the pipeline, especially with increasingly complex processors like those involving ML inference.

Thank @junweid62 , got it, I think I misunderstood a bit the scope of it (and from the comments above, there seems to be a mix of the intermediate data and time-related insights like execution_time). I think that we should expand a bit here:

update the profile stats with timing-related insights
add the verbose_pipeline (as you suggested) with data-related insights

What do you think? Thanks!

junweid62 added enhancement Enhancement or improvement to existing feature or request untriaged RFC Issues requesting major changes labels Nov 22, 2024

opensearch-infra bot added this to OpenSearch Roadmap Nov 22, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Nov 22, 2024

github-actions bot added the Search Search query, autocomplete ...etc label Nov 22, 2024

github-project-automation bot added this to Search Project Board Nov 22, 2024

github-project-automation bot moved this to 🆕 New in Search Project Board Nov 22, 2024

sandeshkr419 removed the untriaged label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Tracking Search Pipeline Execution #16705

[RFC] Tracking Search Pipeline Execution #16705

junweid62 commented Nov 22, 2024

pyek-bot commented Nov 22, 2024

junweid62 commented Nov 22, 2024

owaiskazi19 commented Nov 26, 2024

owaiskazi19 commented Nov 26, 2024

msfroh commented Nov 26, 2024

owaiskazi19 commented Nov 26, 2024 •

edited

Loading

mingshl commented Nov 26, 2024

owaiskazi19 commented Nov 26, 2024 •

edited

Loading

junweid62 commented Nov 26, 2024 •

edited

Loading

junweid62 commented Nov 26, 2024

reta commented Nov 27, 2024

junweid62 commented Nov 27, 2024 •

edited

Loading

reta commented Nov 28, 2024

[RFC] Tracking Search Pipeline Execution #16705

[RFC] Tracking Search Pipeline Execution #16705

Comments

junweid62 commented Nov 22, 2024

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Adding verbose Parameter to Search Request [Preferred]

Overview

Pros

Example Request

Example Response

Common Fields for All Processors

Request Processor Fields

Search Phase Result Processor Fields

Response Processor Fields

Verbose Mode Support Across Search Pipeline Configurations

Related component

Describe alternatives you've considered

Additional context

pyek-bot commented Nov 22, 2024

junweid62 commented Nov 22, 2024

owaiskazi19 commented Nov 26, 2024

owaiskazi19 commented Nov 26, 2024

msfroh commented Nov 26, 2024

owaiskazi19 commented Nov 26, 2024 • edited Loading

mingshl commented Nov 26, 2024

owaiskazi19 commented Nov 26, 2024 • edited Loading

junweid62 commented Nov 26, 2024 • edited Loading

junweid62 commented Nov 26, 2024

reta commented Nov 27, 2024

junweid62 commented Nov 27, 2024 • edited Loading

reta commented Nov 28, 2024

Adding `verbose` Parameter to Search Request [Preferred]

owaiskazi19 commented Nov 26, 2024 •

edited

Loading

owaiskazi19 commented Nov 26, 2024 •

edited

Loading

junweid62 commented Nov 26, 2024 •

edited

Loading

junweid62 commented Nov 27, 2024 •

edited

Loading