Print long running work for debugging #564

gregschohn · 2024-04-09T16:33:10Z

Description

(Adapted from the Jira task...)

To allow users and developers to understand the Replayer’s activities, we use Otel traces to describe the major phases that captures and their derived counterparts go through. However, traces are only able to be viewed/used within X-Ray, Jaeger, etc when the trace was closed. Unfortunately, that means that traces for the most problematic activities that don’t complete cannot provide any value to the user or the developer.

This feature attempts to remedy that shortcoming by registering each trace that is created within an ordered hashmap and removing those entries upon their completion. Periodically, a nanny thread can scan that map to dump the oldest traces and all of their attributes to a specific “still-running-activities” logfile. That will allow an end-user or a developer to monitor the logfile and be able to get invaluable clues for the next steps in diagnosing many issues.

Below is sample output from one scheduled dump of active work on my replayer (which has one stuck request and two behind it). As per the default log4j2.properties file, this will be in the log directory and named "longRunningActivity.log".

Category Enhancement, OE Improvement
Why these changes are required? To make it easier to find stuck work
What is the old behavior before changes and new behavior after changes? Now there's a file that shows which activities are long-running.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1618

Testing

Unit testing

Check List

New functionality includes testing
- All tests pass, including unit test, integration test and doctest
New functionality has been documented
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Total requests outstanding: 3 ([INFO ] 2024-04-11 03:48:59,061)
Oldest of 3 outstanding requests that are past thresholds ([WARN ] 2024-04-11 03:48:59,062)
  age=PT2M29.880878375S [431059092] waiting for 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33.0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1.1.0 to be queued and run through TrafficStreamLimiter[…] ([WARN ] 2024-04-11 03:48:59,062)
  age=PT2M29.871312416S [2011658475] waiting for 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33.0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1.2.1 to be queued and run through TrafficStreamLimiter[…] ([WARN ] 2024-04-11 03:48:59,062)
  age=PT2M29.866209S [258354658] waiting for 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33.0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1.3.2 to be queued and run through TrafficStreamLimiter[…] ([WARN ] 2024-04-11 03:48:59,063)

 ([WARN ] 2024-04-11 03:48:59,063)
Oldest of 1 scopes for 'tcpConnection' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,063)
  age=PT2M29.717513166S, start=2024-04-11T03:46:29.344495Z id=<<924942282>> tcpConnection: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,063)
Oldest of 4 scopes for 'trafficStreamLifetime' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,063)
  age=PT2M29.865783542S, start=2024-04-11T03:46:29.196498Z id=<<1770852636>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,064)
Oldest of 3 scopes for 'httpTransaction' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,064)
  age=PT2M29.868227667S, start=2024-04-11T03:46:29.194225Z id=<<1814951883>> httpTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 2, sourceRequestIndex: 2}
    age=PT2M29.871021875S, start=2024-04-11T03:46:29.191510Z id=<<1893987183>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,064)
  age=PT2M29.873687958S, start=2024-04-11T03:46:29.188976Z id=<<1385879879>> httpTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 1, sourceRequestIndex: 1}
    age=PT2M29.879445S, start=2024-04-11T03:46:29.183279Z id=<<142261320>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,064)
Oldest of 1 scopes for 'receivingResponse' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,064)
  age=PT2M29.545403625S, start=2024-04-11T03:46:29.517509Z id=<<1760885942>> receivingResponse: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 0, sourceRequestIndex: 0}
    age=PT2M29.769283042S, start=2024-04-11T03:46:29.293730Z id=<<1128824353>> targetTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 0, sourceRequestIndex: 0}
      age=PT2M29.90971975S, start=2024-04-11T03:46:29.153355Z id=<<754707140>> httpTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 0, sourceRequestIndex: 0}
        age=PT2M29.927628375S, start=2024-04-11T03:46:29.135511Z id=<<1544322553>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33}
          age=PT2M29.930582709S, start=2024-04-11T03:46:29.132597Z id=<<454898831>> channel: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} ([WARN ] 2024-04-11 03:48:59,064)



 ([WARN ] 2024-04-09 22:38:01,140)

…ime that it is closed as a first-class feature of the coreUtilities tracing library. A number of changes were made to existing interfaces to reduce the scope of 'naked' Spans so that it's easier to audit if they're being used directly to close spans. The getCurrentSpan() needs to be public because it's in an interface and used internally, but it should really be private and require friend access only within its package. However, I'm not sure how to do better guardrails in Java. Signed-off-by: Greg Schohn <[email protected]>

# Conflicts: # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java

… from the business logic. Some of the new class's constructors have been simplified/streamlined a bit at the expense of not creating a JsonTransfomer automatically. Signed-off-by: Greg Schohn <[email protected]>

…class. TopLevel deals with the total lifecycle (shutdown) of critical resources while Core does the orchestration between various top-level components. Signed-off-by: Greg Schohn <[email protected]>

…ontexts, etc, tying their log levels to their ages. Signed-off-by: Greg Schohn <[email protected]>

…or to output the deepest items in the context tree. Signed-off-by: Greg Schohn <[email protected]>

Beefed up unit tests for the ActiveContextMonitor too. Signed-off-by: Greg Schohn <[email protected]>

…tor every 30s. Policies are to cap the items at 64 items. Items > 10 minutes are logged at ERROR, 60s are at WARN, and > 30s are INFO (debug>5, trace>2). I've also modified the log4j2.properties file to log this stream to a new file, which is rotated every hour and we keep up to 4 of them. Signed-off-by: Greg Schohn <[email protected]>

peternied · 2024-04-09T20:29:39Z

@gregschohn ~~Before this comes out of draft could you fill in the PR template?~~

[Edit: Looks good, thanks for the details]

Some fixes were within tests and others improved the ability to test. The repeating scheduled task to log the activity now includes a header of how many requests are outstanding so that the file has something happening in quiescent periods (so that a user doesn't think that the monitoring broke). That's independent of the ActivityMonitor itself so that the tests don't need to worry about it. I've also cleaned up the log4j2.properties file for replayer tests. Signed-off-by: Greg Schohn <[email protected]>

codecov · 2024-04-09T22:49:54Z

Codecov Report

Attention: Patch coverage is 74.81032% with 166 lines in your changes are missing coverage. Please review.

Project coverage is 76.08%. Comparing base (b04d2a1) to head (ad7cd40).
Report is 8 commits behind head on main.

❗ Current head ad7cd40 differs from pull request most recent head 0e2ce10. Consider uploading reports for the commit 0e2ce10 to get more accurate results

Files	Patch %	Lines
...nsearch/migrations/replay/TrafficReplayerCore.java	71.13%	33 Missing and 23 partials ⚠️
...rch/migrations/replay/TrafficReplayerTopLevel.java	70.32%	43 Missing and 11 partials ⚠️
...h/migrations/replay/util/ActiveContextMonitor.java	84.35%	23 Missing ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java	0.00%	19 Missing ⚠️
...ch/migrations/tracing/CompositeContextTracker.java	63.63%	4 Missing ⚠️
...opensearch/migrations/tracing/BaseSpanContext.java	75.00%	0 Missing and 3 partials ⚠️
...h/migrations/replay/util/OrderedWorkerTracker.java	83.33%	2 Missing and 1 partial ⚠️
...ficcapture/kafkaoffloader/KafkaCaptureFactory.java	50.00%	1 Missing ⚠️
...ns/tracing/ActiveContextTrackerByActivityType.java	95.00%	0 Missing and 1 partial ⚠️
...opensearch/migrations/tracing/RootOtelContext.java	90.00%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@            Coverage Diff             @@
##               main     #564    +/-   ##
==========================================
  Coverage     76.07%   76.08%            
- Complexity     1410     1500    +90     
==========================================
  Files           155      162     +7     
  Lines          6074     6340   +266     
  Branches        548      563    +15     
==========================================
+ Hits           4621     4824   +203     
- Misses         1082     1143    +61     
- Partials        371      373     +2

Flag	Coverage Δ
unittests	`76.08% <74.81%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ith a reference (System.hashId of the context) rather than reprinting them. Those items NOT printed do NOT impact the counting, so more contexts will be eligible to log. The old mode is still supported and can be run via logTopOpenActivities(false). Signed-off-by: Greg Schohn <[email protected]>

mikaylathompson · 2024-04-10T21:42:49Z

...ities/src/testFixtures/java/org/opensearch/migrations/tracing/BacktracingContextTracker.java

+            if (oldItem != null) {
+                var oldCtx = scopedContextToCallDetails.entrySet().stream().findFirst().get().getKey();
+                if (oldCtx.equals(ctx)) {
+                    log.error("check");


is this correct? it looks like a debug line

Thanks. I've improved it to be more helpful

mikaylathompson · 2024-04-11T10:13:41Z

...icCapture/coreUtilities/src/main/java/org/opensearch/migrations/tracing/BaseSpanContext.java

+        this.startTimeNano = System.nanoTime();
+        this.startTimeInstant = Instant.now();


I want to make sure I understand why we're using each of these:

startTimeNano is used to measure the duration of the span precisely

startTimeInstant is used to indicate when it started (but is a bit less precise?)

Is that correct?

Pretty much. startTime is for logs correlation. I've added these comments to the base interface to explain...

public interface IWithStartTimeAndAttributes extends IInstrumentationAttributes { /** * This is used to calculate the precise duration of the span. System.nanoTime() is guaranteed to be monotonic * and not susceptible to clock fluctuations due to system time resets */ long getStartTimeNano(); /** * This is by some ContextTrackers to log the recorded wall-time so that it can be easier to find the event * within logs. Notice that if the system clock is reset (which should be rare), there could be duplicated * values at different points in time. */ Instant getStartTimeInstant();

mikaylathompson · 2024-04-11T10:48:06Z

.../trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayerTopLevel.java

+
+        @Override
+        public boolean isEmpty() {
+            return map. isEmpty();


extremely minor: extra space here

mikaylathompson · 2024-04-11T11:00:07Z

...rafficReplayer/src/main/java/org/opensearch/migrations/replay/util/ActiveContextMonitor.java

+        ageToLevelEdgeMapRef.set(levelShowsAgeOlderThanMap.entrySet().stream()
+                .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey,
+                        (x,y) -> { throw Lombok.sneakyThrow(new IllegalStateException("Shouldn't have any merges")); },
+                        TreeMap::new)));
+    }


Can you explain what's going on here?

I needed the merge function so that I could specify what kind of map to create.
I've changed the code to just make two maps

ageToLevelEdgeMapRef.set(new TreeMap<>(levelShowsAgeOlderThanMap.entrySet().stream() .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey))));

It is called in the constructor and within unit tests and the number of items is 5. Hopefully that's easier to read & worth the 'cost'.

mikaylathompson · 2024-04-11T11:04:17Z

...rafficReplayer/src/main/java/org/opensearch/migrations/replay/util/ActiveContextMonitor.java

+                                Function<DiagnosticTrackableCompletableFuture<String, Void>, String> formatWorkItem,
+                                BiConsumer<Level, Supplier<String>> logger,
+                                Predicate<Level> logLevelIsEnabled,
+                                Map<Level,Duration> levelShowsAgeOlderThanMap) {


okay, this variable name keeps throwing me for a loop. is it intended to mean "map that correlates logging level with the duration at which it 'kicks in'", more or less?

loggingLevelForAgeMap? I'm not sure it's big improvement

Yes. I'm using a point-map to represent a a non-duplicative and fully-covering interval map. That's why I'm using the "olderThan" part and keying off of level rather than time, but I think that the other approaches would be about the same. (I've never been comfortable using floor/ceiling search operations... even when I understand them, the cognitive load is high)

mikaylathompson · 2024-04-11T11:07:01Z

...rafficReplayer/src/main/java/org/opensearch/migrations/replay/util/ActiveContextMonitor.java

+                    if (cad.items.isEmpty()) { return Optional.<Level>empty(); }
+                    final var sample = cad.items.get(0);
+                    logger.accept(getHigherLevel(Optional.of(sample.getValue()), Optional.of(Level.INFO)).get(), () ->
+                            OLDEST_ITEMS_FROM_GROUP_LABEL_PREAMBLE + cad.totalScopes + " scopes that are past " +


Is it necessary to use OLDEST_ITEMS_FROM_GROUP_LABEL_PREAMBLE vs. the string itself, given that there are plenty of other direct strings in this line?

No, After 3 (I think) replications, SonarLint complains. We're at 2, so I've inlined it.
(once upon a time, it was 3 ;) )

mikaylathompson · 2024-04-11T11:12:05Z

...rafficReplayer/src/main/java/org/opensearch/migrations/replay/util/ActiveContextMonitor.java

+        return aOuter.map(a -> bOuter.filter(b -> a.toInt() <= b.toInt()).orElse(a))
+                .or(() -> bOuter);


All we're doing here is comparing the int values of each level, right? It seems like a bewilderingly complex line to just do that

Sortof - the types are Optional where Level is an enum (hence the toInt()). The Optional parts require the map and two ors. If I had a monad on Optional to say, take the one that was present OR do this comparison, that would be cleaner, but I don't know of one

mikaylathompson · 2024-04-11T11:17:40Z

...rafficReplayer/src/main/java/org/opensearch/migrations/replay/util/OrderedWorkerTracker.java

+
+    @Override
+    public boolean isEmpty() {
+        return primaryMap. isEmpty();


tiny nit: another extra space

mikaylathompson · 2024-04-11T11:19:48Z

TrafficCapture/trafficReplayer/src/main/resources/log4j2.properties

+appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.layout.pattern = %msg ([%-5level] %d{yyyy-MM-dd HH:mm:ss,SSS}{UTC})%n
+appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.type = Policies
+appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.time.type = TimeBasedTriggeringPolicy
+appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.time.interval = 60


am I remembering correctly that the unit for the interval value here is the smallest unit of the filePattern? So this means every 60 minutes? This could clearly vary enormously, but do you have a sense of how large this file might be within that hour?

60 represents minutes.

Assuming that we are rendering 64 scopes for each one of our 67 classes and that each line is about 200 bytes (this will be dependent upon the number of attributes, but 200 bytes is where we're at today), at 120 observations per file or per hour, we'd be at ~100MB as the theoretical max. Typically, this file should be much smaller, with a couple span types being open for longer (channel, trafficStream, maybe even some responses or requests), but that should be < 10% of the total. We may need to eventually supply the time edges as a command line argument (especially if one would want to disable the feature entirely).

This was all put into bigger files to minimize the number of times that a user would need to rerun tail -f as the underlying file handle changes.

mikaylathompson · 2024-04-11T11:20:28Z

TrafficCapture/trafficReplayer/src/test/resources/log4j2.properties

+#logger.RSO.name = org.opensearch.migrations.replay.RequestSenderOrchestrator
+#logger.RSO.level = TRACE
+#logger.RSO.additivity = false
+#logger.RSO.appenderRef.RSO.ref = Console


should this be gone for good?

sure - done

Signed-off-by: Greg Schohn <[email protected]>

gregschohn added 8 commits April 5, 2024 11:40

Merge branch 'main' into PrintLongRunningWorkForDebugging

c3f8477

# Conflicts: # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java

Refactor "Core" parts of TrafficReplayerTopLevel out into a separate …

02027a1

…class. TopLevel deals with the total lifecycle (shutdown) of critical resources while Core does the orchestration between various top-level components. Signed-off-by: Greg Schohn <[email protected]>

WIP - wire up some repeating log statements to print out the active c…

9856daa

…ontexts, etc, tying their log levels to their ages. Signed-off-by: Greg Schohn <[email protected]>

Get some reasonable policies together to setup the ActiveContextMonit…

69f2c6d

…or to output the deepest items in the context tree. Signed-off-by: Greg Schohn <[email protected]>

Add logging requests that are outstanding alongside the contexts.

c58661d

Beefed up unit tests for the ActiveContextMonitor too. Signed-off-by: Greg Schohn <[email protected]>

gregschohn marked this pull request as ready for review April 9, 2024 22:42

gregschohn requested review from AndreKurait, chelma, kartg, lewijacn, mikaylathompson, okhasawn and sumobrian as code owners April 9, 2024 22:42

gregschohn mentioned this pull request Apr 9, 2024

Print long running work for debugging #554

Closed

4 tasks

Merge branch 'main' into PrintLongRunningWorkForDebuggingPart2

8134dbb

gregschohn changed the title ~~Print long running work for debugging part2~~ Print long running work for debugging Apr 10, 2024

gregschohn added 2 commits April 10, 2024 23:54

Merge branch 'main' into PrintLongRunningWorkForDebuggingPart2

ad7cd40

mikaylathompson approved these changes Apr 11, 2024

View reviewed changes

PR Feedback, adding comments and simplifying a bit of code.

0e2ce10

Signed-off-by: Greg Schohn <[email protected]>

gregschohn merged commit 4fe2e9e into opensearch-project:main Apr 11, 2024
5 checks passed

gregschohn deleted the PrintLongRunningWorkForDebuggingPart2 branch April 11, 2024 12:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Print long running work for debugging #564

Print long running work for debugging #564

gregschohn commented Apr 9, 2024 •

edited

Loading

peternied commented Apr 9, 2024 •

edited

Loading

codecov bot commented Apr 9, 2024 •

edited

Loading

mikaylathompson Apr 10, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

mikaylathompson Apr 11, 2024

gregschohn Apr 11, 2024

		this.startTimeNano = System.nanoTime();
		this.startTimeInstant = Instant.now();

		return aOuter.map(a -> bOuter.filter(b -> a.toInt() <= b.toInt()).orElse(a))
		.or(() -> bOuter);

Print long running work for debugging #564

Print long running work for debugging #564

Conversation

gregschohn commented Apr 9, 2024 • edited Loading

Description

Issues Resolved

Testing

Check List

peternied commented Apr 9, 2024 • edited Loading

codecov bot commented Apr 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregschohn commented Apr 9, 2024 •

edited

Loading

peternied commented Apr 9, 2024 •

edited

Loading

codecov bot commented Apr 9, 2024 •

edited

Loading