-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print long running work for debugging #564
Print long running work for debugging #564
Conversation
…ime that it is closed as a first-class feature of the coreUtilities tracing library. A number of changes were made to existing interfaces to reduce the scope of 'naked' Spans so that it's easier to audit if they're being used directly to close spans. The getCurrentSpan() needs to be public because it's in an interface and used internally, but it should really be private and require friend access only within its package. However, I'm not sure how to do better guardrails in Java. Signed-off-by: Greg Schohn <[email protected]>
# Conflicts: # TrafficCapture/trafficReplayer/src/main/java/org/opensearch/migrations/replay/TrafficReplayer.java
… from the business logic. Some of the new class's constructors have been simplified/streamlined a bit at the expense of not creating a JsonTransfomer automatically. Signed-off-by: Greg Schohn <[email protected]>
…class. TopLevel deals with the total lifecycle (shutdown) of critical resources while Core does the orchestration between various top-level components. Signed-off-by: Greg Schohn <[email protected]>
…ontexts, etc, tying their log levels to their ages. Signed-off-by: Greg Schohn <[email protected]>
…or to output the deepest items in the context tree. Signed-off-by: Greg Schohn <[email protected]>
Beefed up unit tests for the ActiveContextMonitor too. Signed-off-by: Greg Schohn <[email protected]>
…tor every 30s. Policies are to cap the items at 64 items. Items > 10 minutes are logged at ERROR, 60s are at WARN, and > 30s are INFO (debug>5, trace>2). I've also modified the log4j2.properties file to log this stream to a new file, which is rotated every hour and we keep up to 4 of them. Signed-off-by: Greg Schohn <[email protected]>
@gregschohn [Edit: Looks good, thanks for the details] |
Some fixes were within tests and others improved the ability to test. The repeating scheduled task to log the activity now includes a header of how many requests are outstanding so that the file has something happening in quiescent periods (so that a user doesn't think that the monitoring broke). That's independent of the ActivityMonitor itself so that the tests don't need to worry about it. I've also cleaned up the log4j2.properties file for replayer tests. Signed-off-by: Greg Schohn <[email protected]>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #564 +/- ##
==========================================
Coverage 76.07% 76.08%
- Complexity 1410 1500 +90
==========================================
Files 155 162 +7
Lines 6074 6340 +266
Branches 548 563 +15
==========================================
+ Hits 4621 4824 +203
- Misses 1082 1143 +61
- Partials 371 373 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…ith a reference (System.hashId of the context) rather than reprinting them. Those items NOT printed do NOT impact the counting, so more contexts will be eligible to log. The old mode is still supported and can be run via logTopOpenActivities(false). Signed-off-by: Greg Schohn <[email protected]>
if (oldItem != null) { | ||
var oldCtx = scopedContextToCallDetails.entrySet().stream().findFirst().get().getKey(); | ||
if (oldCtx.equals(ctx)) { | ||
log.error("check"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this correct? it looks like a debug line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I've improved it to be more helpful
this.startTimeNano = System.nanoTime(); | ||
this.startTimeInstant = Instant.now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to make sure I understand why we're using each of these:
- startTimeNano is used to measure the duration of the span precisely
- startTimeInstant is used to indicate when it started (but is a bit less precise?)
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much. startTime is for logs correlation. I've added these comments to the base interface to explain...
public interface IWithStartTimeAndAttributes extends IInstrumentationAttributes {
/**
* This is used to calculate the precise duration of the span. System.nanoTime() is guaranteed to be monotonic
* and not susceptible to clock fluctuations due to system time resets
*/
long getStartTimeNano();
/**
* This is by some ContextTrackers to log the recorded wall-time so that it can be easier to find the event
* within logs. Notice that if the system clock is reset (which should be rare), there could be duplicated
* values at different points in time.
*/
Instant getStartTimeInstant();
|
||
@Override | ||
public boolean isEmpty() { | ||
return map. isEmpty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extremely minor: extra space here
ageToLevelEdgeMapRef.set(levelShowsAgeOlderThanMap.entrySet().stream() | ||
.collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey, | ||
(x,y) -> { throw Lombok.sneakyThrow(new IllegalStateException("Shouldn't have any merges")); }, | ||
TreeMap::new))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed the merge function so that I could specify what kind of map to create.
I've changed the code to just make two maps
ageToLevelEdgeMapRef.set(new TreeMap<>(levelShowsAgeOlderThanMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getValue, Map.Entry::getKey))));
It is called in the constructor and within unit tests and the number of items is 5. Hopefully that's easier to read & worth the 'cost'.
Function<DiagnosticTrackableCompletableFuture<String, Void>, String> formatWorkItem, | ||
BiConsumer<Level, Supplier<String>> logger, | ||
Predicate<Level> logLevelIsEnabled, | ||
Map<Level,Duration> levelShowsAgeOlderThanMap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, this variable name keeps throwing me for a loop. is it intended to mean "map that correlates logging level with the duration at which it 'kicks in'", more or less?
loggingLevelForAgeMap
? I'm not sure it's big improvement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'm using a point-map to represent a a non-duplicative and fully-covering interval map. That's why I'm using the "olderThan" part and keying off of level rather than time, but I think that the other approaches would be about the same. (I've never been comfortable using floor/ceiling search operations... even when I understand them, the cognitive load is high)
if (cad.items.isEmpty()) { return Optional.<Level>empty(); } | ||
final var sample = cad.items.get(0); | ||
logger.accept(getHigherLevel(Optional.of(sample.getValue()), Optional.of(Level.INFO)).get(), () -> | ||
OLDEST_ITEMS_FROM_GROUP_LABEL_PREAMBLE + cad.totalScopes + " scopes that are past " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to use OLDEST_ITEMS_FROM_GROUP_LABEL_PREAMBLE
vs. the string itself, given that there are plenty of other direct strings in this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, After 3 (I think) replications, SonarLint complains. We're at 2, so I've inlined it.
(once upon a time, it was 3 ;) )
return aOuter.map(a -> bOuter.filter(b -> a.toInt() <= b.toInt()).orElse(a)) | ||
.or(() -> bOuter); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All we're doing here is comparing the int values of each level, right? It seems like a bewilderingly complex line to just do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sortof - the types are Optional where Level is an enum (hence the toInt()). The Optional parts require the map and two ors. If I had a monad on Optional to say, take the one that was present OR do this comparison, that would be cleaner, but I don't know of one
|
||
@Override | ||
public boolean isEmpty() { | ||
return primaryMap. isEmpty(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: another extra space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.layout.pattern = %msg ([%-5level] %d{yyyy-MM-dd HH:mm:ss,SSS}{UTC})%n | ||
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.type = Policies | ||
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.time.type = TimeBasedTriggeringPolicy | ||
appender.ALL_ACTIVE_WORK_MONITOR_LOGFILE.policies.time.interval = 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I remembering correctly that the unit for the interval value here is the smallest unit of the filePattern
? So this means every 60 minutes? This could clearly vary enormously, but do you have a sense of how large this file might be within that hour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
60 represents minutes.
Assuming that we are rendering 64 scopes for each one of our 67 classes and that each line is about 200 bytes (this will be dependent upon the number of attributes, but 200 bytes is where we're at today), at 120 observations per file or per hour, we'd be at ~100MB as the theoretical max. Typically, this file should be much smaller, with a couple span types being open for longer (channel, trafficStream, maybe even some responses or requests), but that should be < 10% of the total. We may need to eventually supply the time edges as a command line argument (especially if one would want to disable the feature entirely).
This was all put into bigger files to minimize the number of times that a user would need to rerun tail -f as the underlying file handle changes.
#logger.RSO.name = org.opensearch.migrations.replay.RequestSenderOrchestrator | ||
#logger.RSO.level = TRACE | ||
#logger.RSO.additivity = false | ||
#logger.RSO.appenderRef.RSO.ref = Console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be gone for good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure - done
Signed-off-by: Greg Schohn <[email protected]>
Description
(Adapted from the Jira task...)
To allow users and developers to understand the Replayer’s activities, we use Otel traces to describe the major phases that captures and their derived counterparts go through. However, traces are only able to be viewed/used within X-Ray, Jaeger, etc when the trace was closed. Unfortunately, that means that traces for the most problematic activities that don’t complete cannot provide any value to the user or the developer.
This feature attempts to remedy that shortcoming by registering each trace that is created within an ordered hashmap and removing those entries upon their completion. Periodically, a nanny thread can scan that map to dump the oldest traces and all of their attributes to a specific “still-running-activities” logfile. That will allow an end-user or a developer to monitor the logfile and be able to get invaluable clues for the next steps in diagnosing many issues.
Below is sample output from one scheduled dump of active work on my replayer (which has one stuck request and two behind it). As per the default log4j2.properties file, this will be in the log directory and named "longRunningActivity.log".
Issues Resolved
https://opensearch.atlassian.net/browse/MIGRATIONS-1618
Testing
Unit testing
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.