Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print long running work for debugging #564

Conversation

gregschohn
Copy link
Collaborator

@gregschohn gregschohn commented Apr 9, 2024

Description

(Adapted from the Jira task...)

To allow users and developers to understand the Replayer’s activities, we use Otel traces to describe the major phases that captures and their derived counterparts go through. However, traces are only able to be viewed/used within X-Ray, Jaeger, etc when the trace was closed. Unfortunately, that means that traces for the most problematic activities that don’t complete cannot provide any value to the user or the developer.

This feature attempts to remedy that shortcoming by registering each trace that is created within an ordered hashmap and removing those entries upon their completion. Periodically, a nanny thread can scan that map to dump the oldest traces and all of their attributes to a specific “still-running-activities” logfile. That will allow an end-user or a developer to monitor the logfile and be able to get invaluable clues for the next steps in diagnosing many issues.

Below is sample output from one scheduled dump of active work on my replayer (which has one stuck request and two behind it). As per the default log4j2.properties file, this will be in the log directory and named "longRunningActivity.log".

  • Category Enhancement, OE Improvement
  • Why these changes are required? To make it easier to find stuck work
  • What is the old behavior before changes and new behavior after changes? Now there's a file that shows which activities are long-running.

Issues Resolved

https://opensearch.atlassian.net/browse/MIGRATIONS-1618

Testing

Unit testing

Check List

  • New functionality includes testing
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Total requests outstanding: 3 ([INFO ] 2024-04-11 03:48:59,061)
Oldest of 3 outstanding requests that are past thresholds ([WARN ] 2024-04-11 03:48:59,062)
  age=PT2M29.880878375S [431059092] waiting for 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33.0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1.1.0 to be queued and run through TrafficStreamLimiter[…] ([WARN ] 2024-04-11 03:48:59,062)
  age=PT2M29.871312416S [2011658475] waiting for 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33.0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1.2.1 to be queued and run through TrafficStreamLimiter[…] ([WARN ] 2024-04-11 03:48:59,062)
  age=PT2M29.866209S [258354658] waiting for 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33.0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1.3.2 to be queued and run through TrafficStreamLimiter[…] ([WARN ] 2024-04-11 03:48:59,063)

 ([WARN ] 2024-04-11 03:48:59,063)
Oldest of 1 scopes for 'tcpConnection' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,063)
  age=PT2M29.717513166S, start=2024-04-11T03:46:29.344495Z id=<<924942282>> tcpConnection: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,063)
Oldest of 4 scopes for 'trafficStreamLifetime' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,063)
  age=PT2M29.865783542S, start=2024-04-11T03:46:29.196498Z id=<<1770852636>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,064)
Oldest of 3 scopes for 'httpTransaction' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,064)
  age=PT2M29.868227667S, start=2024-04-11T03:46:29.194225Z id=<<1814951883>> httpTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 2, sourceRequestIndex: 2}
    age=PT2M29.871021875S, start=2024-04-11T03:46:29.191510Z id=<<1893987183>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,064)
  age=PT2M29.873687958S, start=2024-04-11T03:46:29.188976Z id=<<1385879879>> httpTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 1, sourceRequestIndex: 1}
    age=PT2M29.879445S, start=2024-04-11T03:46:29.183279Z id=<<142261320>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} parentRef=<<454898831>>... ([WARN ] 2024-04-11 03:48:59,064)
Oldest of 1 scopes for 'receivingResponse' that are past thresholds that are not otherwise reported below  ([WARN ] 2024-04-11 03:48:59,064)
  age=PT2M29.545403625S, start=2024-04-11T03:46:29.517509Z id=<<1760885942>> receivingResponse: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 0, sourceRequestIndex: 0}
    age=PT2M29.769283042S, start=2024-04-11T03:46:29.293730Z id=<<1128824353>> targetTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 0, sourceRequestIndex: 0}
      age=PT2M29.90971975S, start=2024-04-11T03:46:29.153355Z id=<<754707140>> httpTransaction: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33, replayerRequestIndex: 0, sourceRequestIndex: 0}
        age=PT2M29.927628375S, start=2024-04-11T03:46:29.135511Z id=<<1544322553>> trafficStreamLifetime: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33}
          age=PT2M29.930582709S, start=2024-04-11T03:46:29.132597Z id=<<454898831>> channel: attribs={connectionId: 0a3e1afffef52b37-0000000b-00001261-2ff8f6ba641959b5-8e651bd1, nodeId: 9feb1902-0fd2-40a3-ad0d-9f6b924f0b33} ([WARN ] 2024-04-11 03:48:59,064)



 ([WARN ] 2024-04-09 22:38:01,140)

…ime that it is closed as a first-class feature of the coreUtilities tracing library.

A number of changes were made to existing interfaces to reduce the scope of 'naked' Spans so that it's easier to audit if they're being used directly to close spans.  The getCurrentSpan() needs to be public because it's in an interface and used internally, but it should really be private and require friend access only within its package.  However, I'm not sure how to do better guardrails in Java.

Signed-off-by: Greg Schohn <[email protected]>
# Conflicts:
#	TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java
… from the business logic.

Some of the new class's constructors have been simplified/streamlined a bit at the expense of not creating a JsonTransfomer automatically.

Signed-off-by: Greg Schohn <[email protected]>
…class.

TopLevel deals with the total lifecycle (shutdown) of critical resources while Core does the orchestration between various top-level components.

Signed-off-by: Greg Schohn <[email protected]>
…ontexts, etc, tying their log levels to their ages.

Signed-off-by: Greg Schohn <[email protected]>
…or to output the deepest items in the context tree.

Signed-off-by: Greg Schohn <[email protected]>
Beefed up unit tests for the ActiveContextMonitor too.

Signed-off-by: Greg Schohn <[email protected]>
…tor every 30s.

Policies are to cap the items at 64 items.  Items > 10 minutes are logged at ERROR, 60s are at WARN, and > 30s are INFO (debug>5, trace>2).
I've also modified the log4j2.properties file to log this stream to a new file, which is rotated every hour and we keep up to 4 of them.

Signed-off-by: Greg Schohn <[email protected]>
@peternied
Copy link
Member

peternied commented Apr 9, 2024

@gregschohn Before this comes out of draft could you fill in the PR template?

[Edit: Looks good, thanks for the details]

Some fixes were within tests and others improved the ability to test.
The repeating scheduled task to log the activity now includes a header of how many requests are outstanding so that the file has something happening in quiescent periods (so that a user doesn't think that the monitoring broke).  That's independent of the ActivityMonitor itself so that the tests don't need to worry about it.
I've also cleaned up the log4j2.properties file for replayer tests.

Signed-off-by: Greg Schohn <[email protected]>
@gregschohn gregschohn marked this pull request as ready for review April 9, 2024 22:42
Copy link

codecov bot commented Apr 9, 2024

Codecov Report

Attention: Patch coverage is 74.81032% with 166 lines in your changes are missing coverage. Please review.

Project coverage is 76.08%. Comparing base (b04d2a1) to head (ad7cd40).
Report is 8 commits behind head on main.

❗ Current head ad7cd40 differs from pull request most recent head 0e2ce10. Consider uploading reports for the commit 0e2ce10 to get more accurate results

Files Patch % Lines
...nsearch/migrations/replay/TrafficReplayerCore.java 71.13% 33 Missing and 23 partials ⚠️
...rch/migrations/replay/TrafficReplayerTopLevel.java 70.32% 43 Missing and 11 partials ⚠️
...h/migrations/replay/util/ActiveContextMonitor.java 84.35% 23 Missing ⚠️
.../opensearch/migrations/replay/TrafficReplayer.java 0.00% 19 Missing ⚠️
...ch/migrations/tracing/CompositeContextTracker.java 63.63% 4 Missing ⚠️
...opensearch/migrations/tracing/BaseSpanContext.java 75.00% 0 Missing and 3 partials ⚠️
...h/migrations/replay/util/OrderedWorkerTracker.java 83.33% 2 Missing and 1 partial ⚠️
...ficcapture/kafkaoffloader/KafkaCaptureFactory.java 50.00% 1 Missing ⚠️
...ns/tracing/ActiveContextTrackerByActivityType.java 95.00% 0 Missing and 1 partial ⚠️
...opensearch/migrations/tracing/RootOtelContext.java 90.00% 1 Missing ⚠️
... and 1 more
Additional details and impacted files
@@            Coverage Diff             @@
##               main     #564    +/-   ##
==========================================
  Coverage     76.07%   76.08%            
- Complexity     1410     1500    +90     
==========================================
  Files           155      162     +7     
  Lines          6074     6340   +266     
  Branches        548      563    +15     
==========================================
+ Hits           4621     4824   +203     
- Misses         1082     1143    +61     
- Partials        371      373     +2     
Flag Coverage Δ
unittests 76.08% <74.81%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gregschohn gregschohn changed the title Print long running work for debugging part2 Print long running work for debugging Apr 10, 2024
…ith a reference (System.hashId of the context) rather than reprinting them.

Those items NOT printed do NOT impact the counting, so more contexts will be eligible to log.
The old mode is still supported and can be run via logTopOpenActivities(false).

Signed-off-by: Greg Schohn <[email protected]>
if (oldItem != null) {
var oldCtx = scopedContextToCallDetails.entrySet().stream().findFirst().get().getKey();
if (oldCtx.equals(ctx)) {
log.error("check");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct? it looks like a debug line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've improved it to be more helpful

Comment on lines +27 to +28
this.startTimeNano = System.nanoTime();
this.startTimeInstant = Instant.now();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure I understand why we're using each of these:

  • startTimeNano is used to measure the duration of the span precisely
  • startTimeInstant is used to indicate when it started (but is a bit less precise?)

Is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much. startTime is for logs correlation. I've added these comments to the base interface to explain...

public interface IWithStartTimeAndAttributes extends IInstrumentationAttributes {
    /**
     * This is used to calculate the precise duration of the span.  System.nanoTime() is guaranteed to be monotonic
     * and not susceptible to clock fluctuations due to system time resets
     */
    long getStartTimeNano();
    /**
     * This is by some ContextTrackers to log the recorded wall-time so that it can be easier to find the event
     * within logs. Notice that if the system clock is reset (which should be rare), there could be duplicated
     * values at different points in time.
     */
    Instant getStartTimeInstant();


@Override
public boolean isEmpty() {
return map. isEmpty();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extremely minor: extra space here

Comment on lines 94 to 98
ageToLevelEdgeMapRef.set(levelShowsAgeOlderThanMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey,
(x,y) -> { throw Lombok.sneakyThrow(new IllegalStateException("Shouldn't have any merges")); },
TreeMap::new)));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what's going on here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed the merge function so that I could specify what kind of map to create.
I've changed the code to just make two maps

        ageToLevelEdgeMapRef.set(new TreeMap<>(levelShowsAgeOlderThanMap.entrySet().stream()
                .collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey))));

It is called in the constructor and within unit tests and the number of items is 5. Hopefully that's easier to read & worth the 'cost'.

Function<DiagnosticTrackableCompletableFuture<String, Void>, String> formatWorkItem,
BiConsumer<Level, Supplier<String>> logger,
Predicate<Level> logLevelIsEnabled,
Map<Level,Duration> levelShowsAgeOlderThanMap) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, this variable name keeps throwing me for a loop. is it intended to mean "map that correlates logging level with the duration at which it 'kicks in'", more or less?

loggingLevelForAgeMap? I'm not sure it's big improvement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'm using a point-map to represent a a non-duplicative and fully-covering interval map. That's why I'm using the "olderThan" part and keying off of level rather than time, but I think that the other approaches would be about the same. (I've never been comfortable using floor/ceiling search operations... even when I understand them, the cognitive load is high)

if (cad.items.isEmpty()) { return Optional.<Level>empty(); }
final var sample = cad.items.get(0);
logger.accept(getHigherLevel(Optional.of(sample.getValue()), Optional.of(Level.INFO)).get(), () ->
OLDEST_ITEMS_FROM_GROUP_LABEL_PREAMBLE + cad.totalScopes + " scopes that are past " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to use OLDEST_ITEMS_FROM_GROUP_LABEL_PREAMBLE vs. the string itself, given that there are plenty of other direct strings in this line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, After 3 (I think) replications, SonarLint complains. We're at 2, so I've inlined it.
(once upon a time, it was 3 ;) )

Comment on lines +133 to +134
return aOuter.map(a -> bOuter.filter(b -> a.toInt() <= b.toInt()).orElse(a))
.or(() -> bOuter);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All we're doing here is comparing the int values of each level, right? It seems like a bewilderingly complex line to just do that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sortof - the types are Optional where Level is an enum (hence the toInt()). The Optional parts require the map and two ors. If I had a monad on Optional to say, take the one that was present OR do this comparison, that would be cleaner, but I don't know of one


@Override
public boolean isEmpty() {
return primaryMap. isEmpty();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: another extra space

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.layout.pattern = %msg ([%-5level] %d{yyyy-MM-dd HH:mm:ss,SSS}{UTC})%n
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.type = Policies
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.time.type = TimeBasedTriggeringPolicy
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.time.interval = 60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I remembering correctly that the unit for the interval value here is the smallest unit of the filePattern? So this means every 60 minutes? This could clearly vary enormously, but do you have a sense of how large this file might be within that hour?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60 represents minutes.

Assuming that we are rendering 64 scopes for each one of our 67 classes and that each line is about 200 bytes (this will be dependent upon the number of attributes, but 200 bytes is where we're at today), at 120 observations per file or per hour, we'd be at ~100MB as the theoretical max. Typically, this file should be much smaller, with a couple span types being open for longer (channel, trafficStream, maybe even some responses or requests), but that should be < 10% of the total. We may need to eventually supply the time edges as a command line argument (especially if one would want to disable the feature entirely).

This was all put into bigger files to minimize the number of times that a user would need to rerun tail -f as the underlying file handle changes.

#logger.RSO.name = org.opensearch.migrations.replay.RequestSenderOrchestrator
#logger.RSO.level = TRACE
#logger.RSO.additivity = false
#logger.RSO.appenderRef.RSO.ref = Console
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be gone for good?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure - done

@gregschohn gregschohn merged commit 4fe2e9e into opensearch-project:main Apr 11, 2024
5 checks passed
@gregschohn gregschohn deleted the PrintLongRunningWorkForDebuggingPart2 branch April 11, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants