-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix ShardSplittingQuery
to respect nested documents.
#27398
Conversation
Today if nested docs are used in an index that is split the operation will only work correctly if the index is not routing partitioned or unless routing is used. This change fixes the query that selectes the docs to delete to also select all parents nested docs as well. Closes elastic#27378
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor comments regarding naming and caching, otherwise LGTM
}; | ||
} | ||
|
||
private void markChildDocs(BitSet parentDocs, BitSet deletedDocs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why deletedDocs ? It's the opposite, no ? The bitset of the matching parent+children, allDocs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naming issue, I will fix. In theory it will hold all docs that need to be deleted by the IndexWriter.
/* | ||
* this is used internally to obtain a bitset for parent documents. We don't cache this since we never access the same reader more | ||
* than once | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We warm this query per segment in BitsetFilterCache#BitSetProducerWarmer
so why not using the per-segment cache directly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use this only as a delete by query which is executed on a recovery-private
index writer. There is no point in cacheing it and it won't have a cache hit either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I missed the recovery-private thing. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good @s1monw! I left a few questions and small remarks.
@@ -83,53 +104,91 @@ public void testSplitOnRouting() throws IOException { | |||
.numberOfShards(numShards) | |||
.setRoutingNumShards(numShards * 1000000) | |||
.numberOfReplicas(0).build(); | |||
boolean hasNested = true || randomBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/true || randomBoolean();
/randomBoolean();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yeah 🗡
IndexMetaData metaData = IndexMetaData.builder("test") | ||
.settings(Settings.builder().put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT)) | ||
.numberOfShards(numShards) | ||
.setRoutingNumShards(numShards * 1000000) | ||
.numberOfReplicas(0).build(); | ||
boolean hasNested = true;randomBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
assert indexMetaData.isRoutingPartitionedIndex() == false; | ||
findSplitDocs(IdFieldMapper.NAME, includeInShard, leafReader, bitSet::set); | ||
} else { | ||
final BitSet parentBitSet; | ||
final IntPredicate includeDoc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to double check, the includeDoc
is to ensure that only root docs get selected and later we select the nested docs of the selected root docs in markChildDocs(...)
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct
* this is used internally to obtain a bitset for parent documents. We don't cache this since we never access the same reader more | ||
* than once | ||
*/ | ||
private static BitSetProducer newParentDocBitSetProducer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
@Override | ||
public boolean matches() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to understand why this works, because of the forward iteration here (with nested, we usually seek backwards (BitSet.prevSetBit(...)
)). So this works because all live doc ids (root docs and nested docs) are evaluated in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct, I will leave a comment
@jimczi @martijnvg I pushed new commits if you wanna take another look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Today if nested docs are used in an index that is split the operation will only work correctly if the index is not routing partitioned or unless routing is used. This change fixes the query that selectes the docs to delete to also select all parents nested docs as well. Closes #27378
* master: Stop skipping REST test after backport of #27056 Fix default value of ignore_unavailable for snapshot REST API (#27056) Add composite aggregator (#26800) Fix `ShardSplittingQuery` to respect nested documents. (#27398) [Docs] Restore section about multi-level parent/child relation in parent-join (#27392) Add TcpChannel to unify Transport implementations (#27132) Add note on plugin distributions in plugins folder Remove implementations of `TransportChannel` (#27388) Update Google SDK to version 1.23 (#27381) Fix Gradle 4.3.1 compatibility for logging (#27382) [Test] Change Elasticsearch startup timeout to 120s in packaging tests Docs/windows installer (#27369)
Today if nested docs are used in an index that is split the operation
will only work correctly if the index is not routing partitioned or
unless routing is used. This change fixes the query that selectes the docs
to delete to also select all parents nested docs as well.
Closes #27378