Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Building Indices Lookup in Metadata #83241

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1792,7 +1792,7 @@ static SortedMap<String, IndexAbstraction> buildIndicesLookup(
ImmutableOpenMap<String, IndexMetadata> indices
) {
SortedMap<String, IndexAbstraction> indicesLookup = new TreeMap<>();
Map<String, DataStream> indexToDataStreamLookup = new HashMap<>();
Map<String, IndexAbstraction.DataStream> indexToDataStreamLookup = new HashMap<>();
// If there are no indices, then skip data streams. This happens only when metadata is read from disk
if (dataStreamMetadata != null && indices.size() > 0) {
Map<String, List<String>> dataStreamToAliasLookup = new HashMap<>();
Expand All @@ -1817,35 +1817,27 @@ static SortedMap<String, IndexAbstraction> buildIndicesLookup(
assert dataStream.getIndices().isEmpty() == false;

List<String> aliases = dataStreamToAliasLookup.getOrDefault(dataStream.getName(), List.of());
final IndexAbstraction.DataStream dsAbstraction = new IndexAbstraction.DataStream(dataStream, aliases);
IndexAbstraction existing = indicesLookup.put(
dataStream.getName(),
new IndexAbstraction.DataStream(dataStream, aliases)
);
assert existing == null : "duplicate data stream for " + dataStream.getName();

for (Index i : dataStream.getIndices()) {
indexToDataStreamLookup.put(i.getName(), dataStream);
indexToDataStreamLookup.put(i.getName(), dsAbstraction);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need to build indextToDataStreamLookup at all? Could we not simply populate indicesLookup directly here instead and ignore those below (asserting that they are data stream owned instead of the duplicate check)?

Copy link
Member Author

@original-brownbear original-brownbear Jan 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right we don't ... now comes the interesting question of whether this would actually be faster.

Looking at the profiling for the full method (after my change here):

image

for a completely degenerate case where we have about as many indices as datastreams (one index per DS) ...
the hash map building barely shows up relative to the tree-map.

So the question becomes, is it cheaper to do 2 puts (or I guess one put and one computeIfAbsent) to the tree-map (the second put in the indices loop would be to a larger tree map than the one in the DS loop) or do one put into an on average smaller tree map and instead build the technically redundant hash-map.

Profiling suggests that this is faster I'd say. And in the real world, where you'd have multiple indices per DS the decision would be even more in favour of having the hash-map I think.
The problem we're running into is that the tree-map is super lob sided because we have the shared .ds prefix for all indices pretty much and then some more sharing for the specific DS, leading to loads of comparisons.
We should've used a different naming schema for the internal DS indices I guess ... (in fact I think we still could).

}
}
}

Map<String, List<IndexMetadata>> aliasToIndices = new HashMap<>();
for (var indexMetadata : indices.values()) {
ConcreteIndex index;
DataStream parent = indexToDataStreamLookup.get(indexMetadata.getIndex().getName());
if (parent != null) {
assert parent.getIndices()
.stream()
.map(Index::getName)
.collect(Collectors.toList())
.contains(indexMetadata.getIndex().getName())
: "Expected data stream [" + parent.getName() + "] to contain index " + indexMetadata.getIndex();
index = new ConcreteIndex(indexMetadata, (IndexAbstraction.DataStream) indicesLookup.get(parent.getName()));
} else {
index = new ConcreteIndex(indexMetadata);
}

IndexAbstraction existing = indicesLookup.put(indexMetadata.getIndex().getName(), index);
for (var entry : indices) {
final String name = entry.key;
final IndexMetadata indexMetadata = entry.value;
final IndexAbstraction.DataStream parent = indexToDataStreamLookup.get(name);
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertion change is unrelated, just felt the need to clean this up and make it a little clearer what we're asserting :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally a nit and feel free to ignore, but since we are using streams, I would prefer to keep .map(Index::getName), i.e.:

Suggested change
assert parent == null || parent.getIndices().stream().anyMatch(index -> name.equals(index.getName()))
assert parent == null || parent.getIndices().stream().map(Index::getName).anyMatch(name::equals)

: "Expected data stream [" + parent.getName() + "] to contain index " + indexMetadata.getIndex();
IndexAbstraction existing = indicesLookup.put(name, new ConcreteIndex(indexMetadata, parent));
assert existing == null : "duplicate for " + indexMetadata.getIndex();

for (var aliasCursor : indexMetadata.getAliases()) {
Expand Down