-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Building Indices Lookup in Metadata #83241
Speed up Building Indices Lookup in Metadata #83241
Conversation
We can speed this up by almost a third in case of a large index count where most indices are part of a datastream by avoiding the double lookup for the ds abstraction in the way done here. Also, simplified the loop iteration a little to use the slightly faster cursor and removed some needless conditional in the loop.
Pinging @elastic/es-data-management (Team:Data Management) |
final String name = entry.key; | ||
final IndexMetadata indexMetadata = entry.value; | ||
final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name); | ||
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertion change is unrelated, just felt the need to clean this up and make it a little clearer what we're asserting :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName)
, i.e.:
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName())) | |
assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
final String name = entry.key; | ||
final IndexMetadata indexMetadata = entry.value; | ||
final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name); | ||
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName)
, i.e.:
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName())) | |
assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals) |
IndexAbstraction existing = indicesLookup.put( | ||
dataStream.getName(), | ||
new IndexAbstraction.DataStream(dataStream, aliases) | ||
); | ||
assert existing == null : "duplicate data stream for " + dataStream.getName(); | ||
|
||
for (Index i : dataStream.getIndices()) { | ||
indexToDataStreamLookup.put(i.getName(), dataStream); | ||
indexToDataStreamLookup.put(i.getName(), dsAbstraction); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to build indextToDataStreamLookup
at all? Could we not simply populate indicesLookup
directly here instead and ignore those below (asserting that they are data stream owned instead of the duplicate check)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right we don't ... now comes the interesting question of whether this would actually be faster.
Looking at the profiling for the full method (after my change here):
for a completely degenerate case where we have about as many indices as datastreams (one index per DS) ...
the hash map building barely shows up relative to the tree-map.
So the question becomes, is it cheaper to do 2 puts (or I guess one put and one computeIfAbsent) to the tree-map (the second put in the indices loop would be to a larger tree map than the one in the DS loop) or do one put into an on average smaller tree map and instead build the technically redundant hash-map.
Profiling suggests that this is faster I'd say. And in the real world, where you'd have multiple indices per DS the decision would be even more in favour of having the hash-map I think.
The problem we're running into is that the tree-map is super lob sided because we have the shared .ds
prefix for all indices pretty much and then some more sharing for the specific DS, leading to loads of comparisons.
We should've used a different naming schema for the internal DS indices I guess ... (in fact I think we still could).
Thanks Henning! I went with my solution for now because of the way our tree-map is painful. But I think we might want to improve that in a follow-up. |
* upstream/master: (100 commits) Avoid duplicate _type fields in v7 compat layer (elastic#83239) Bump bundled JDK to 17.0.2+8 (elastic#83243) [DOCS] Correct header syntax (elastic#83275) Add unit tests for indices.recovery.max_bytes_per_sec default values (elastic#83261) [DOCS] Add note that write indices are not replicated (elastic#82997) Add notes on indexing to kNN search guide (elastic#83188) Fix get-snapshot-api :docs:integTest (elastic#83273) FilterPathBasedFilter support match fieldname with dot (elastic#83178) Fix compilation issues in example-plugins (elastic#83258) fix ClusterStateListener javadoc (elastic#83246) Speed up Building Indices Lookup in Metadata (elastic#83241) Mute whole suite for elastic#82502 (elastic#83252) Make PeerFinder log messages happier (elastic#83222) [Docs] Add supported _terms_enum field types (elastic#83244) Add an aggregator for IPv4 and IPv6 subnets (elastic#82410) [CI] Fix 70_time_series/default sort yaml test failures (elastic#83217) Update test-failure Issue Template to include "needs:triage" label elastic#83226 Add an index->step cache to the PolicyStepsRegistry (elastic#82316) Improve support for joda datetime to java datetime transition in Painless (elastic#83099) Fix joda migration for week based methods in Painless (elastic#83232) ... # Conflicts: # x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/TransportRollupAction.java
We can speed this up by almost a third in case of a large index count
where most indices are part of a datastream by avoiding the double lookup
for the ds abstraction in the way done here.
Also, simplified the loop iteration a little to use the slightly faster cursor
and removed some needless conditional in the loop.
relates #77466