[RFC] Performance metrics framework #6533

rishabhmaurya · 2023-03-03T05:00:13Z

Is your feature request related to a problem? Please describe.
Yes. Tasks such as Source Peer Recovery, Target Peer Recovery, Snapshots, Merges, Shard Initialization, Ultrawarm migrations etc. can significantly consume the system resources and impact search and indexing latency. Impact is hard to quantify with current state of metrics measured around them. Its not just lack of metrics, but also the limitations in measurement framework - Resource Tracking Framework, Stats API and and Performance Analyzer, to measure the consumption at a desired granularity.

Describe the solution you'd like
Using concepts from OpenTelemetry like Trace, Span, Context, Context Propagation, Event, Instrument and Meter.
This example demonstrates creation of spans, emitting events and metering resource usage using OpenTelemetry for Source Peer Recovery in OpenSearch.
Using the code reference from the Otel java manual to explain the intended framework -

Create Span, Set Attributes

Span sourcePeerRecoverySpan = tracer.spanBuilder("Source Peer Recovery").setSpanKind("Internal").startSpan();

// Make the span the current span
try (Scope ss = span.makeCurrent()) {
	sourcePeerRecoverySpan.setAttribute("index_name", "");
    sourcePeerRecoverySpan.setAttribute("shard_number", "");
    sourcePeerRecoverySpan.setAttribute("type_of_recovery", "");
    sourcePeerRecoverySpan.setAttribute("target_node", "");
    sourcePeerRecoverySpan.setAttribute("thread_id", "");
        sourcePeerRecoverySpan.setAttribute("thread_name", "");
} finally {
    sourcePeerRecoverySpan.end();
}

Add Events

Attributes eventAttributes = sourcePeerRecoverySpan.attributes() +  
Attributes.of(
    AttributeKey.stringKey("thread_id"), "",
    AttributeKey.longKey("thread_name"), "");
    
sourcePeerRecoverySpan.addEvent("start_acquire_retention_lease)", eventAttributes);

...

sourcePeerRecoverySpan.addEvent("end_acquire_retention_lease)", eventAttributes);

Metering Usage

// Build an asynchronous instrument, e.g. Gauge
meter
  .gaugeBuilder("cpu_usage")
  .setDescription("CPU Usage")
  .setUnit("ms")
  .buildWithCallback(measurement -> {
    measurement.record(getCpuUsage(), Attributes.of(stringKey("eventID"), ""));
  });

The way Resource Tracking Framework records CPU and memory allocation at a Task level, it can also be used as one of the Instrument to meter CPU, Memory usage and Thread Contention of a thread level and the value of the meter can be observed at the Start and End event of any operation.

A sample log entry would look like -

"span": {
  "name": "unique to a peer recovery on a shard",
  "kind": "INTERNAL",
  "attributes": {
     "thread.id": ""
     "thread.name": "generic [T10]"
     "shard_number": ""
     "index_name": ""
     "source_node": ""
     "target_node": ""
     "cpu_time": ""
     "memory": ""
     "thread_contention": ""
  },
  "start_time": "2023-03-06T12:00:00Z",
  "end_time": "2023-03-06T12:01:00Z",
  "trace_id": "<unique to a peer recovery on a shard>"
  "parent_id": "<id_of parent span e.g. phase_1 for send_files>"
  "events": [
    {
      "name": "start_acquire_retention_lease",
      "attributes": {
        "cpu_time": ""
      },
      "timestamp": "2023-03-06T12:00:30Z"
    },
	{
      "name": "end_acquire_retention_lease",
      "attributes": {
        "cpu_time": ""
      },
      "timestamp": "2023-03-06T13:00:30Z"
    }
  ]
}

Decorating events with system level metric using Performance Analyzer

Performance Analyzer (PA) agent (runs as a separate process) Metric Processor today parses the start and end event generated by PA plugin in system memory and decorates the event metrics with system resources consumption. For example, the shard search/bulk requests are decorated later with OS/Node statistics here.

Similarly, the event generated above can be integrated with PA event, to decorate them with system resources metrics, which can be useful to understand resource bottlenecks at a finer granularity of node->shard->operation->thread. Integration would require following enhancements -

A start event and end event would be needed in addition to the complete metric object described above.
PA's MetricProcessor uses a sampling approach to collect Node/OS statistics every 5 secs. For a background tasks like peer recover, which doesn't grows with number of search/write request, collecting these metric as the start and end events are generated, would be more accurate and will not have much overhead in the system because of the nature of these tasks.

Once integrated, we will see these metrics associated with each of these events - Metric reference - https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/
Decorating events with system level metrics may not be useful for every one and will require PA agent to be installed and enabled on a node.

Both event generation and decoration part would be an opt-in feature and will be disabled by default.

System Overhead

Event generation part shouldn't have any additional overhead other than what we already incur while recording resource usage of tasks by Resource Tracking Framework. This is something which can be confirmed after the prototype and running benchmark.

Event decoration with system level metrics should be done with care and not at per http/shard request level. Number of background tasks such as Peer Recovery, snapshots etc are proportional to rate of growth of volume of data, indexing policy or cluster events and doesn't grow with number of requests. Thus theoretically, the system overhead should be very minimal and acceptable because of number of events generated by them. Also, event decoration is done outside of OpenSearch process in PA agent and is optional, so regular OpenSearch users who don't use this feature will not see any impact.

If we see utility of it, I can start working on the prototype and benchmark the overhead introduced in the system. I'm thinking of doing it in 2 phases as both event generation and decoration are independent and I will be prioritizing the prior.

This should take care of use cases like - #4401

Describe alternatives you've considered
Resource Tracking Framework, Stats API, Performance Analyzer.

Utilities

Auto-tune capabilities can be enriched by consuming these metrics.
OpenSearch benchmark can be integrated with these resource consumption metrics to prevent performance regressions around resource consumption.
Ease of performance bottleneck investigation

Examples

Having a shard's system resources consumption on operations like Snapshots, Peer Recovery, Shard Initialization, Segment Merges, Ultrawarm migration etc, could be helpful for cluster admins to tweak configurations such as number of shards/shrink policy, filesystem for index, merge policy, scheduling snapshots/peer recovery, ultrawarm migration policy. Tweaking these settings could help them to understand the system resource utilization and reduce cost eventually by having optimal configurations around their indices.
When system is under stress and the impact of these tasks can be viewed live, they can be deferred (when possible) to reduce the load on the system.

Plugins developers can make use of this framework to integrate it with their new task such as tasks scheduled by JobScheduler, to understand the impact of it. Plugins impact can go unnoticed as they are tested separately and are not part of OpenSearch benchmark.

Appendix

Source Peer Recovery Spans

Span

Parent Span

ThreadPool

Dimensions

Enter Code Path

Exit Code Path

Attributes

recover

None

generic

index_name, shard_number, type_of_recovery (Peer/Local), target_node

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/PeerRecoverySourceService.java

Line 159 in 5989d01

    
           private void recover(StartRecoveryRequest request, ActionListener<RecoveryResponse> listener) {

None

acquire_retention_lease

recover

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/LocalStorePeerRecoverySourceHandler.java

Line 60 in 5989d01

final SetOnce<RetentionLease> retentionLeaseRef = new SetOnce<>();

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/LocalStorePeerRecoverySourceHandler.java

Line 99 in 5989d01

}

phase_1

recover

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 349 in 5989d01

void phase1(

send_file_info

phase_1

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 380 in 5989d01

final List<String> phase1FileNames = new ArrayList<>();

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 451 in 5989d01

);

send_files

phase_1

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 516 in 5989d01

    
           void sendFiles(Store store, StoreFileMetadata[] files, IntSupplier translogOps, ActionListener<Void> listener) {

create_retention_lease

phase_1

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 527 in 5989d01

    
           void createRetentionLease(final long startingSeqNo, ActionListener<RetentionLease> listener) {

clean_files

phase_1

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 889 in 5989d01

private void cleanFiles(

prepare_target_for_translog

recover

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 600 in 5989d01

    
           void prepareTargetForTranslog(int totalTranslogOps, ActionListener<TimeValue> listener) {

phase_2 / send_snapshot_step

recover

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 630 in 5989d01

void phase2(

finalize_recovery

recover

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 788 in 5989d01

    
           void finalizeRecovery(long targetLocalCheckpoint, long trimAboveSeqNo, ActionListener<Void> listener) throws IOException {

relocated/handoff

recover

generic

same_as_parent

OpenSearch/server/src/main/java/org/opensearch/indices/recovery/RecoverySourceHandler.java

Line 824 in 5989d01

    
           final Consumer<StepListener> forceSegRepConsumer = shard.indexSettings().isSegRepEnabled()

rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 3, 2023

rishabhmaurya changed the title ~~Performance metrics around background tasks on shards~~ Performance metrics framework Mar 3, 2023

rishabhmaurya changed the title ~~Performance metrics framework~~ [RFC] Performance metrics framework Mar 3, 2023

minalsha added RFC Issues requesting major changes and removed untriaged labels Mar 6, 2023

Gaganjuneja mentioned this issue Apr 6, 2023

[Prototype] Distributed Tracing #7026

Open

Gaganjuneja mentioned this issue Sep 20, 2023

[RFC] Metrics Framework #10141

Open

ansjcy mentioned this issue Nov 13, 2023

[RFC] Real-time Insights into Top N Queries by Latency and Resource Usage #11186

Closed

andrross added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 29, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Performance metrics framework #6533

[RFC] Performance metrics framework #6533

rishabhmaurya commented Mar 3, 2023 •

edited

Loading

[RFC] Performance metrics framework #6533

[RFC] Performance metrics framework #6533

Comments

rishabhmaurya commented Mar 3, 2023 • edited Loading

Create Span, Set Attributes

Add Events

Metering Usage

Decorating events with system level metric using Performance Analyzer

System Overhead

Utilities

Examples

Appendix

Source Peer Recovery Spans

rishabhmaurya commented Mar 3, 2023 •

edited

Loading