Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report transport message size per action #94543

Conversation

DaveCTurner
Copy link
Contributor

@DaveCTurner DaveCTurner commented Mar 20, 2023

Adds to the transport stats a histogram of transport message sizes for each transport action.

Closes #88151

Adds to the transport stats a histogram of transport message sizes for
each transport action.
@DaveCTurner DaveCTurner added >enhancement :Distributed Coordination/Network Http and internode communication implementations v8.8.0 labels Mar 20, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team (obsolete) label Mar 20, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments. I wonder if we should add a brief .yml test to validate the structure of the response. And also, I think we can add documentation for the new stats in this PR.

requestId,
version,
header.getCompressionScheme(),
ResponseStatsConsumer.NONE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the response stats instead here, except if short circuited?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we should track the response stats if short-circuited too? 905e5ad does that, but I can rework this some more if you'd prefer to drop the response stats for short-circuited ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see, we short-circuit if the action is not known, as well as on other up-front exceptions. afaa718 captures the response stats for known actions even if short-circuited.

* n: < 2^(n+3)
* ...
* 25: < 2^28 == 256MiB
* 26: >= 2^28
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we missing 2 buckets? >= 512MiB and >= 1GiB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured by the time we got to 256MiB the messages were already unreasonably large, but you're right it's only 2 buckets away from the max size anyway. Added in 80eeac4.

@@ -41,7 +43,8 @@ public TransportStats(
long txCount,
long txSize,
long[] inboundHandlingTimeBucketFrequencies,
long[] outboundHandlingTimeBucketFrequencies
long[] outboundHandlingTimeBucketFrequencies,
List<TransportActionStats> transportActionStats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more intuitive to have the stats be a Map - also to avoid that TransportActionStats is a fragment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in f2e42f1

Comment on lines 206 to 207
// Stats came from before v8.7
assert Version.CURRENT.major == Version.V_8_0_0.major;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not exactly sure what the purpose here is? I am guessing it is to ensure that we cleanup something here when major switches to 9?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah except it should have mentioned 7.0 not 8.0. Fixed now.

return builder.endObject();
}

static void histogramToXContent(XContentBuilder builder, long[] sizeHistogram, String fieldName) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fieldName seems to be passed "histogram" always except in tests, could it not be hardcoded instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think I used the shorter one to avoid some line breaking mess but it's not necessary any more. Done in e24b97e.

Comment on lines 54 to 79
final int[] bucketBounds = TransportActionStatsTracker.getBucketUpperBounds();
assert sizeHistogram.length == bucketBounds.length + 1;
builder.startArray(fieldName);

int firstBucket = 0;
long remainingCount = 0L;
for (int i = 0; i < sizeHistogram.length; i++) {
if (remainingCount == 0) {
firstBucket = i;
}
remainingCount += sizeHistogram[i];
}

for (int i = firstBucket; i < sizeHistogram.length && 0 < remainingCount; i++) {
builder.startObject();
if (i > 0) {
builder.humanReadableField("ge_bytes", "ge", ByteSizeValue.ofBytes(bucketBounds[i - 1]));
}
if (i < bucketBounds.length) {
builder.humanReadableField("lt_bytes", "lt", ByteSizeValue.ofBytes(bucketBounds[i]));
}
builder.field("count", sizeHistogram[i]);
builder.endObject();
remainingCount -= sizeHistogram[i];
}
builder.endArray();
Copy link
Contributor

@idegtiarenko idegtiarenko Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider org.HdrHistogram.Histogram to simplify rendering json and omitting empty buckets and also for determining the bucket by the size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand how this could make things simpler, but maybe I'm missing something.

.sorted(Comparator.comparing(RequestHandlerRegistry::getAction))
.map(RequestHandlerRegistry::getStats)
.toList();
.collect(Collectors.toMap(RequestHandlerRegistry::getAction, RequestHandlerRegistry::getStats, (a, b) -> a, TreeMap::new));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re (a, b) -> a, do we expect a duplicate action somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but we've got to say something here (there's no overload that accepts a mapFactory without a mergeFunction) and this is the shortest thing we can say

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay. NIT: we have org.elasticsearch.common.util.Maps#toUnmodifiableSortedMap that only accept key/value functions and could be used here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL thanks, that's what I wanted 😄

@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

showing the total amount of traffic and a histogram of message sizes for
incoming requests and outgoing responses.
+
.Properties of `actions.*.requests` and `actions.*.responses`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The below seems to describe the histograms, can we make that explicit here? And also include documentation of the outer level count and total_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh oops good catch. Fixed in 63d5d73.

header.isHandshake(),
message.takeBreakerReleaseControl()
);
} catch (Exception e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I see a potential for exceptions in the code above. Fine to keep this, but if there is no exception expected, perhaps we can assert false here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I couldn't either but I'm paranoid about leaking a response here (or in future). Assertion added in 20ee73d.

import java.io.IOException;

public record TransportActionStats(
String action,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field is no longer used and I think we should remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done in c1247b0.

Comment on lines -181 to +198
assert Version.CURRENT.major == Version.V_8_0_0.major;
assert Version.CURRENT.major == Version.V_7_0_0.major + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the difference, can you explain the motivation for changing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When Version.CURRENT becomes 9.0 we remove all the v7.x version constants, so by mentioning one here we are forced to remove this dead code to fix the compile error.

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 28, 2023
@elasticsearchmachine elasticsearchmachine merged commit 4ef9965 into elastic:main Mar 28, 2023
@DaveCTurner DaveCTurner deleted the 2023-03-19-per-action-transport-stats branch March 28, 2023 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Coordination/Network Http and internode communication implementations >enhancement Team:Distributed Meta label for distributed team (obsolete) v8.8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Track volume of transport traffic broken down by action/category
4 participants