Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Performance metrics framework #6533

Open
rishabhmaurya opened this issue Mar 3, 2023 · 0 comments
Open

[RFC] Performance metrics framework #6533

rishabhmaurya opened this issue Mar 3, 2023 · 0 comments
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label

Comments

@rishabhmaurya
Copy link
Contributor

rishabhmaurya commented Mar 3, 2023

Is your feature request related to a problem? Please describe.
Yes. Tasks such as Source Peer Recovery, Target Peer Recovery, Snapshots, Merges, Shard Initialization, Ultrawarm migrations etc. can significantly consume the system resources and impact search and indexing latency. Impact is hard to quantify with current state of metrics measured around them. Its not just lack of metrics, but also the limitations in measurement framework - Resource Tracking Framework, Stats API and and Performance Analyzer, to measure the consumption at a desired granularity.

Describe the solution you'd like
Using concepts from OpenTelemetry like Trace, Span, Context, Context Propagation, Event, Instrument and Meter.
This example demonstrates creation of spans, emitting events and metering resource usage using OpenTelemetry for Source Peer Recovery in OpenSearch.
Using the code reference from the Otel java manual to explain the intended framework -

Create Span, Set Attributes

Span sourcePeerRecoverySpan = tracer.spanBuilder("Source Peer Recovery").setSpanKind("Internal").startSpan();

// Make the span the current span
try (Scope ss = span.makeCurrent()) {
	sourcePeerRecoverySpan.setAttribute("index_name", "");
    sourcePeerRecoverySpan.setAttribute("shard_number", "");
    sourcePeerRecoverySpan.setAttribute("type_of_recovery", "");
    sourcePeerRecoverySpan.setAttribute("target_node", "");
    sourcePeerRecoverySpan.setAttribute("thread_id", "");
        sourcePeerRecoverySpan.setAttribute("thread_name", "");
} finally {
    sourcePeerRecoverySpan.end();
}

Add Events

Attributes eventAttributes = sourcePeerRecoverySpan.attributes() +  
Attributes.of(
    AttributeKey.stringKey("thread_id"), "",
    AttributeKey.longKey("thread_name"), "");
    
sourcePeerRecoverySpan.addEvent("start_acquire_retention_lease)", eventAttributes);

...

sourcePeerRecoverySpan.addEvent("end_acquire_retention_lease)", eventAttributes);

Metering Usage

// Build an asynchronous instrument, e.g. Gauge
meter
  .gaugeBuilder("cpu_usage")
  .setDescription("CPU Usage")
  .setUnit("ms")
  .buildWithCallback(measurement -> {
    measurement.record(getCpuUsage(), Attributes.of(stringKey("eventID"), ""));
  });

The way Resource Tracking Framework records CPU and memory allocation at a Task level, it can also be used as one of the Instrument to meter CPU, Memory usage and Thread Contention of a thread level and the value of the meter can be observed at the Start and End event of any operation.

A sample log entry would look like -

"span": {
  "name": "unique to a peer recovery on a shard",
  "kind": "INTERNAL",
  "attributes": {
     "thread.id": ""
     "thread.name": "generic [T10]"
     "shard_number": ""
     "index_name": ""
     "source_node": ""
     "target_node": ""
     "cpu_time": ""
     "memory": ""
     "thread_contention": ""
  },
  "start_time": "2023-03-06T12:00:00Z",
  "end_time": "2023-03-06T12:01:00Z",
  "trace_id": "<unique to a peer recovery on a shard>"
  "parent_id": "<id_of parent span e.g. phase_1 for send_files>"
  "events": [
    {
      "name": "start_acquire_retention_lease",
      "attributes": {
        "cpu_time": ""
      },
      "timestamp": "2023-03-06T12:00:30Z"
    },
	{
      "name": "end_acquire_retention_lease",
      "attributes": {
        "cpu_time": ""
      },
      "timestamp": "2023-03-06T13:00:30Z"
    }
  ]
}

Decorating events with system level metric using Performance Analyzer

Performance Analyzer (PA) agent (runs as a separate process) Metric Processor today parses the start and end event generated by PA plugin in system memory and decorates the event metrics with system resources consumption. For example, the shard search/bulk requests are decorated later with OS/Node statistics here.

Similarly, the event generated above can be integrated with PA event, to decorate them with system resources metrics, which can be useful to understand resource bottlenecks at a finer granularity of node->shard->operation->thread. Integration would require following enhancements -

  1. A start event and end event would be needed in addition to the complete metric object described above.
  2. PA's MetricProcessor uses a sampling approach to collect Node/OS statistics every 5 secs. For a background tasks like peer recover, which doesn't grows with number of search/write request, collecting these metric as the start and end events are generated, would be more accurate and will not have much overhead in the system because of the nature of these tasks.

Once integrated, we will see these metrics associated with each of these events - Metric reference - https://opensearch.org/docs/latest/monitoring-your-cluster/pa/reference/
Decorating events with system level metrics may not be useful for every one and will require PA agent to be installed and enabled on a node.

Both event generation and decoration part would be an opt-in feature and will be disabled by default.

System Overhead

Event generation part shouldn't have any additional overhead other than what we already incur while recording resource usage of tasks by Resource Tracking Framework. This is something which can be confirmed after the prototype and running benchmark.

Event decoration with system level metrics should be done with care and not at per http/shard request level. Number of background tasks such as Peer Recovery, snapshots etc are proportional to rate of growth of volume of data, indexing policy or cluster events and doesn't grow with number of requests. Thus theoretically, the system overhead should be very minimal and acceptable because of number of events generated by them. Also, event decoration is done outside of OpenSearch process in PA agent and is optional, so regular OpenSearch users who don't use this feature will not see any impact.

If we see utility of it, I can start working on the prototype and benchmark the overhead introduced in the system. I'm thinking of doing it in 2 phases as both event generation and decoration are independent and I will be prioritizing the prior.

This should take care of use cases like - #4401

Describe alternatives you've considered
Resource Tracking Framework, Stats API, Performance Analyzer.

Utilities

  • Auto-tune capabilities can be enriched by consuming these metrics.
  • OpenSearch benchmark can be integrated with these resource consumption metrics to prevent performance regressions around resource consumption.
  • Ease of performance bottleneck investigation

Examples

  • Having a shard's system resources consumption on operations like Snapshots, Peer Recovery, Shard Initialization, Segment Merges, Ultrawarm migration etc, could be helpful for cluster admins to tweak configurations such as number of shards/shrink policy, filesystem for index, merge policy, scheduling snapshots/peer recovery, ultrawarm migration policy. Tweaking these settings could help them to understand the system resource utilization and reduce cost eventually by having optimal configurations around their indices.
  • When system is under stress and the impact of these tasks can be viewed live, they can be deferred (when possible) to reduce the load on the system.
  • Plugins developers can make use of this framework to integrate it with their new task such as tasks scheduled by JobScheduler, to understand the impact of it. Plugins impact can go unnoticed as they are tested separately and are not part of OpenSearch benchmark.

Appendix

Source Peer Recovery Spans
Span Parent Span ThreadPool Dimensions Enter Code Path Exit Code Path Attributes
recover None generic index_name, shard_number, type_of_recovery (Peer/Local), target_node
private void recover(StartRecoveryRequest request, ActionListener<RecoveryResponse> listener) {
None  
acquire_retention_lease recover generic same_as_parent
final SetOnce<RetentionLease> retentionLeaseRef = new SetOnce<>();
 
phase_1 recover generic same_as_parent    
send_file_info phase_1 generic same_as_parent
final List<String> phase1FileNames = new ArrayList<>();
 
send_files phase_1 generic same_as_parent
void sendFiles(Store store, StoreFileMetadata[] files, IntSupplier translogOps, ActionListener<Void> listener) {
   
create_retention_lease phase_1 generic same_as_parent
void createRetentionLease(final long startingSeqNo, ActionListener<RetentionLease> listener) {
   
clean_files phase_1 generic same_as_parent    
prepare_target_for_translog recover generic same_as_parent
void prepareTargetForTranslog(int totalTranslogOps, ActionListener<TimeValue> listener) {
   
phase_2 / send_snapshot_step recover generic same_as_parent    
finalize_recovery recover generic same_as_parent
void finalizeRecovery(long targetLocalCheckpoint, long trimAboveSeqNo, ActionListener<Void> listener) throws IOException {
   
relocated/handoff recover generic same_as_parent
final Consumer<StepListener> forceSegRepConsumer = shard.indexSettings().isSegRepEnabled()
 
@rishabhmaurya rishabhmaurya added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 3, 2023
@rishabhmaurya rishabhmaurya changed the title Performance metrics around background tasks on shards Performance metrics framework Mar 3, 2023
@rishabhmaurya rishabhmaurya changed the title Performance metrics framework [RFC] Performance metrics framework Mar 3, 2023
@minalsha minalsha added RFC Issues requesting major changes and removed untriaged labels Mar 6, 2023
@andrross andrross added the Roadmap:Stability/Availability/Resiliency Project-wide roadmap label label May 29, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes Roadmap:Stability/Availability/Resiliency Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

3 participants