Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Lucene segment-level fields stats #111123

Merged
merged 10 commits into from
Jul 23, 2024
Merged

Add Lucene segment-level fields stats #111123

merged 10 commits into from
Jul 23, 2024

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Jul 20, 2024

This change returns the total number of fields at the segment level, allowing for a more accurate estimate of the memory used by Lucene. The new estimate is expected to be closer to the actual memory usage than the current estimate using the index-level field count, due to the non-trivial overhead incurred by each Lucene segment. Two new fields are introduced: total_segment_fields, which represents the total number of fields at the segment level, and average_fields_per_segment. The overhead per field in segments with fewer fields is larger than in segments with many fields.

@dnhatn dnhatn force-pushed the lucene-fields branch 2 times, most recently from 8126d91 to 6d02fc3 Compare July 20, 2024 18:07
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to having these stats available! I left a couple of comments.

Comment on lines 102 to 103
builder.field(Fields.TOTAL_SEGMENT_FIELDS, totalSegments);
builder.field(Fields.AVERAGE_FIELDS_PER_SEGMENT, totalSegments == 0 ? 0 : totalSegmentFields / totalSegments);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the average per node is nice but could we also include the raw counts in the output so we can aggregate this across nodes more easily?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks David. Yes, we send these raw stats between nodes and return only the average in the rest response, as including the number of segments in the mappings section doesn't make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including the number of segments in the mappings section doesn't make sense

Although I see what you mean, I disagree that this is a good reason to omit it. The number of segments might not be available elsewhere (hence why we've added it to NodeMappingStats) or at least there may be some variance between the number used in this computation and the one exposed elsewhere. Let's have the raw figures here as well as the average please.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I pushed e42a69f

@dnhatn dnhatn added :StorageEngine/Mapping The storage related side of mappings >enhancement labels Jul 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @dnhatn, I've created a changelog YAML for you.

@dnhatn dnhatn marked this pull request as ready for review July 23, 2024 05:09
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall change looks ok but I'd rather see the raw values in the output as well as the computed average (see this comment).

Copy link
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Nhat. Very nice to have these stats available !

Should we also update the docs (node-stats.asciidoc) ?

LGTM, pending David's suggestion to expose the raw values as well

@dnhatn dnhatn requested a review from DaveCTurner July 23, 2024 14:31
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Jul 23, 2024

@DaveCTurner @andreidan Thanks for reviewing :).

@dnhatn dnhatn merged commit f275dff into elastic:main Jul 23, 2024
15 checks passed
@dnhatn dnhatn deleted the lucene-fields branch July 23, 2024 15:52
@dnhatn
Copy link
Member Author

dnhatn commented Jul 23, 2024

Should we also update the docs (node-stats.asciidoc) ?

@andreidan ++ I updated it in 9eb89c9.

elasticsearchmachine pushed a commit that referenced this pull request Jul 23, 2024
The `search_shards` API is not available in serverless. This PR replaces
its usage in the newly added test with the `search` API with profiling.

Relates #111123
dnhatn added a commit that referenced this pull request Aug 2, 2024
Previously, we returned the number of segments and the total number of 
fields in those segments in NodeMappingStats (see #111123). However, the
total number of fields returned in that PR might be very inaccurate for
indices having large mappings but only a small number of actual fields.

This change returns a more accurate total number of fields using the 
Lucene FieldInfos from those segments. Since we need to acquire a
searcher to compute this stats, we opt to compute it after a shard is
refreshed and cache the result.

Relates #111123
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this pull request Sep 4, 2024
Previously, we returned the number of segments and the total number of 
fields in those segments in NodeMappingStats (see elastic#111123). However, the
total number of fields returned in that PR might be very inaccurate for
indices having large mappings but only a small number of actual fields.

This change returns a more accurate total number of fields using the 
Lucene FieldInfos from those segments. Since we need to acquire a
searcher to compute this stats, we opt to compute it after a shard is
refreshed and cache the result.

Relates elastic#111123
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants