Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ShardSplittingQuery to respect nested documents. #27398

Merged
merged 4 commits into from
Nov 16, 2017

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Nov 15, 2017

Today if nested docs are used in an index that is split the operation
will only work correctly if the index is not routing partitioned or
unless routing is used. This change fixes the query that selectes the docs
to delete to also select all parents nested docs as well.

Closes #27378

Today if nested docs are used in an index that is split the operation
will only work correctly if the index is not routing partitioned or
unless routing is used. This change fixes the query that selectes the docs
to delete to also select all parents nested docs as well.

Closes elastic#27378
@s1monw s1monw added :Core/Infra/Core Core issues without another label >bug v6.1.0 v7.0.0 labels Nov 15, 2017
@s1monw s1monw requested review from martijnvg and jimczi November 15, 2017 15:51
@s1monw
Copy link
Contributor Author

s1monw commented Nov 16, 2017

@elasticmachine test this please

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor comments regarding naming and caching, otherwise LGTM

};
}

private void markChildDocs(BitSet parentDocs, BitSet deletedDocs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why deletedDocs ? It's the opposite, no ? The bitset of the matching parent+children, allDocs ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming issue, I will fix. In theory it will hold all docs that need to be deleted by the IndexWriter.

/*
* this is used internally to obtain a bitset for parent documents. We don't cache this since we never access the same reader more
* than once
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We warm this query per segment in BitsetFilterCache#BitSetProducerWarmer so why not using the per-segment cache directly ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this only as a delete by query which is executed on a recovery-private index writer. There is no point in cacheing it and it won't have a cache hit either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I missed the recovery-private thing. Thanks

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @s1monw! I left a few questions and small remarks.

@@ -83,53 +104,91 @@ public void testSplitOnRouting() throws IOException {
.numberOfShards(numShards)
.setRoutingNumShards(numShards * 1000000)
.numberOfReplicas(0).build();
boolean hasNested = true || randomBoolean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/true || randomBoolean();/randomBoolean();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah 🗡

IndexMetaData metaData = IndexMetaData.builder("test")
.settings(Settings.builder().put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT))
.numberOfShards(numShards)
.setRoutingNumShards(numShards * 1000000)
.numberOfReplicas(0).build();
boolean hasNested = true;randomBoolean();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

assert indexMetaData.isRoutingPartitionedIndex() == false;
findSplitDocs(IdFieldMapper.NAME, includeInShard, leafReader, bitSet::set);
} else {
final BitSet parentBitSet;
final IntPredicate includeDoc;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to double check, the includeDoc is to ensure that only root docs get selected and later we select the nested docs of the selected root docs in markChildDocs(...), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

* this is used internally to obtain a bitset for parent documents. We don't cache this since we never access the same reader more
* than once
*/
private static BitSetProducer newParentDocBitSetProducer() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

@Override
public boolean matches() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to understand why this works, because of the forward iteration here (with nested, we usually seek backwards (BitSet.prevSetBit(...))). So this works because all live doc ids (root docs and nested docs) are evaluated in order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, I will leave a comment

@s1monw
Copy link
Contributor Author

s1monw commented Nov 16, 2017

@jimczi @martijnvg I pushed new commits if you wanna take another look

Copy link
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@s1monw s1monw merged commit 303e0c0 into elastic:master Nov 16, 2017
@s1monw s1monw deleted the fix_split_on_nested branch November 16, 2017 10:35
s1monw added a commit that referenced this pull request Nov 16, 2017
Today if nested docs are used in an index that is split the operation
will only work correctly if the index is not routing partitioned or
unless routing is used. This change fixes the query that selectes the docs
to delete to also select all parents nested docs as well.

Closes #27378
jasontedor added a commit that referenced this pull request Nov 16, 2017
* master:
  Stop skipping REST test after backport of #27056
  Fix default value of ignore_unavailable for snapshot REST API (#27056)
  Add composite aggregator (#26800)
  Fix `ShardSplittingQuery` to respect nested documents. (#27398)
  [Docs] Restore section about multi-level parent/child relation in parent-join (#27392)
  Add TcpChannel to unify Transport implementations (#27132)
  Add note on plugin distributions in plugins folder
  Remove implementations of `TransportChannel` (#27388)
  Update Google SDK to version 1.23 (#27381)
  Fix Gradle 4.3.1 compatibility for logging (#27382)
  [Test] Change Elasticsearch startup timeout to 120s in packaging tests
  Docs/windows installer (#27369)
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shard splitting doesn't always respect nested docs
3 participants