-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for index sorting with document blocks #12829
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a few potshots ...
lucene/core/src/test/org/apache/lucene/index/TestIndexSorting.java
Outdated
Show resolved
Hide resolved
writer.addDocuments(Arrays.asList(doc, new Document())); | ||
}); | ||
assertEquals( | ||
"only the last document in the block must contain a numeric doc values field named: parent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will prevent us from handling multiple levels of nested child documents -- with a sort. I guess that would require multiple bitsets so you can do different kinds of joins. Seems kind of involved, but I think it is possible today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what the semantics of this would be. I think we should not violate the semantics of the addDocuments
API that guarantees the insert order being preserved across merges etc. from that perspective I think it's actually good that we don't do that?
Also if you rely on a certain order you should sort it ahead of time or at retrieval time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov I think multiple joins (nested levels) would work, except the user cannot use this parent DV field to record the sub-structure? Another field must be added by the user to record the child -> grandchild blocks. I agree it'd be nice if this parent field would accommodate more than one level of nesting, but maybe that can be a followon improvement.
This field would only handle a single level (the entire block of child and grandchild docs).
In general, I like the idea of making block joins more of a first-class citizen. I have been thinking for a long time about changing how blocks are identified from using bitsets to using a doc-value field where only parents documents have a value for the field, and the value must be the number of child documents that the parent has. This PR would like to identify blocks with a doc-values field as well, which is interesting. This makes me wonder if Lucene could be adding this doc value field for root documents automatically, similarly to what it is doing for soft deletes? (Possibly in a follow-up PR) |
I like @jpountz's idea to make the value of this field be the number of children. It is simple and makes sense, and is pretty close to having the degree of flexibility that the current setup has. I'm a little worried about giving up functionality, but I think if we had a list of parent-fields rather than a single parent-field that would cover what we can do today? Maybe one is enough - you can still do hierarchies with one field, but today's API actually let's you define multiple overlapping relations. |
I don't think we give up any functionality. can you elaborate what functionality you are referring to? I don't think we should have a list of parent fields that IW requires, what would that list be used for? I also don't understand how the APU lets you define overlapping relations that this change would prevent you from? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @s1monw -- I made a first pass review!
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java
Outdated
Show resolved
Hide resolved
writer.addDocuments(Arrays.asList(doc, new Document())); | ||
}); | ||
assertEquals( | ||
"only the last document in the block must contain a numeric doc values field named: parent", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msokolov I think multiple joins (nested levels) would work, except the user cannot use this parent DV field to record the sub-structure? Another field must be added by the user to record the child -> grandchild blocks. I agree it'd be nice if this parent field would accommodate more than one level of nesting, but maybe that can be a followon improvement.
This field would only handle a single level (the entire block of child and grandchild docs).
* it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there | ||
* is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. | ||
* | ||
* @param parentField the name of a numeric doc values field that marks the last document of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it clearer that you only use this for index-time sorting? It has no meaning at search time (and should not be set)?
Or ... does it have search-time meaning? Can I use this to retrieve whole blocks in my search hits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I guess we could do something with that field at search time. Yet, at this point it's indexing time only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not change Sort
if possible. It's weird that it needs to know about a parentField
, and IW should take care during static sort time (flush, merge) to validate that what it (IWC) knows as the parent field is consistent with the index-time sort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikemccand do you mean we should rather make this a property of IWC rather than sort? That would be no issue. I used sort here since it's the only thing that is using this?
* document blocks indexed with {@link | ||
* org.apache.lucene.index.IndexWriter#addDocuments(Iterable)} or it's update relatives. This | ||
* is required for indices that use index sorting in combination with document blocks in order | ||
* to maintain the document order of the blocks documents. Index sorting will effectively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the Sort is otherwise free to break up the block? I.e. this parentField
will override sort fields that might attempt to sort child docs away from one another?
This is a neat idea too -- would this maybe make completing the join at search time more efficient? Today we scan for the next set bit, but I guess with this approach, we'd still scan for the next
This is a nice idea too, but I think we'd still need user input to give us a free field name to use for this purpose. |
I think we have always been able to model a multi-level hierarchy. Suppose I index a book broken into documents representing paragraphs grouped into sections, chapters, and parts; so it's a five-level hierarchy. There can be documents at every level and we want to be able to do roll-ups to any level. If I index a book as a block, we support this today. I might be confused, but I didn't think that would work with this arrangement. However maybe it does? I guess one describes every component document as having the book as a parent, and then later can still use a different parent bitset to do the roll-ups to other levels? |
@mikemccand @jpountz thanks for your ideas. I'd love to flash this out more before we add anything we write to the index. Today we'd only use this for sorting but if that field can be useful to speed up or improve searching I am happy to extend this further. @msokolov there is no reduction of features here. you can do all the things you have done before except of re-sorting the do cuments within a block. This is a API guarantee that we try to fix here. You can still add another field to each doc specifying which hierarchy level it belongs to and then do |
@s1monw that makes sense. I think I was confusing index-time changes and query-time changes. This whole piece of functionality is a little confusing given how loosely coupled these things are! |
@mikemccand @jpountz I went down a hybrid way. This change now only requires the user to specify the field name and IW adds it to the last doc in the block if it's configured. I will for sure add some more tests to this once we settled on the approach. The only think I am torn on is, if we set the num of children as a value for the DV field then I guess we should have a good usecase for that? |
Thanks @s1monw -- I'll try to review soon.
One thing this would be helpful for is stronger consistency check in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew, I made a first pass review! Brain hurts now ...
Net/net I think I like this approach, but I'm worried about performance impact during indexing (I can help test this), and about changing Sort
at all to know about parent fields. I left lots of other small comments ... thanks @s1monw!
* it produces a tie, then the second SortField is used to break the tie, etc. Finally, if there | ||
* is still a tie after all SortFields are checked, the internal Lucene docid is used to break it. | ||
* | ||
* @param parentField the name of a numeric doc values field that marks the last document of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not change Sort
if possible. It's weird that it needs to know about a parentField
, and IW should take care during static sort time (flush, merge) to validate that what it (IWC) knows as the parent field is consistent with the index-time sort?
lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java
Outdated
Show resolved
Hide resolved
...ne/backward-codecs/src/test/org/apache/lucene/backward_index/TestBackwardsCompatibility.java
Show resolved
Hide resolved
.setIndexSort(sort) | ||
.setMergePolicy(newLogMergePolicy()))) { | ||
// add 10 docs | ||
for (int i = 0; i < 10; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is testing indexing doc blocks into an old (pre-10.0) index right? In this case does IW still add the DV field to the parent doc? The test here seems not to be configuring a parent field, so I guess no? Only indices created 10.0+ will do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, unless we 1. backport this which is possible and 2. if you set a parent field in the old index. yet I can't fully test that just yet since it's not backported. makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this makes sense -- if you have a pre-10.0 index, then none of the new code is run. You must create a new index to get the stronger checking.
(Or, 9.10.x index if we backport).
lucene/core/src/java/org/apache/lucene/index/DocumentsWriterPerThread.java
Show resolved
Hide resolved
Function<IndexSorter.DocComparator, IndexSorter.DocComparator> comparatorWrapper = in -> in; | ||
|
||
if (state.segmentInfo.getHasBlocks() && indexSort.getParentField() != null) { | ||
final DocIdSetIterator readerValues = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that this is non-null and throw CorruptIndexException
or so if it is null?
lucene/core/src/java/org/apache/lucene/index/IndexingChain.java
Outdated
Show resolved
Hide resolved
|
||
/** | ||
* Wraps the given field in a reserved field and registers it as reserved. Only DWPT should do | ||
* this to mark fields as private / reserved to prevent this fieldname to be used from the outside |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get that we are recording the name so it cannot be created with a different type, and indeed prevent it from being added to a document except by IndexingChain -- are there any other restrictions on the use of the field? Could we clear up the javadoc to be more explicit? I think eg I can still get a DV iterator over this field and read its values from "outside". Maybe we could say "prevent this fieldname from being used when writing documents ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to tighten up the language.
@mikemccand I agree we should not add this to sort but rather tread it the same way we treat the softDeletes field. it's essentially the same thing from an IW perspective. I will go ahead an make that change. |
One small observation here: one can use the I'm not sure what performance difference this makes, but it should only help speed up indexing throughput. E.g. luceneserver (the search engine behind jirasearch and soon now githubsearch) does this in its bulk indexing API, I think. This saves the IW/DWPT overhead, but comes with some risk if your blocks are too big since IW cannot flush until the block is done. With this change, I think such usage would still be fine if you have no index sort? But if you have an index sort, then IW will always do the parent block validation / tracking? I think net/net this is fine, I just wanted to call it out, for crazy people trying to eek out every last bit of indexing throughput :) |
I opened mikemccand/luceneutil#252 to try to measure the performance change of |
"cannot configure [" | ||
+ parentFieldName | ||
+ "] as parent document field ; this index uses [" | ||
+ fieldName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I am still confused here -- this looks backwards to me (fieldName
and parentFieldName
should be swapped)?
I.e. parentFieldName
was derived from the index as the already established parent field, and now a newly indexed document / current IWC is attempting to set fieldName
as the parent field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing from it's name. The parentFieldName
comes from IWC and fieldname is the incoming field. I think it's right the way it is, maybe we need to do some rewording altogether but when you look at the tests I think the error message makes sense?
I would like to find a way that users don't have to specify these fieldnames altogether but that should be a follow up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to find a way that users don't have to specify these fieldnames altogether but that should be a follow up
Yeah +1 to defer.
Phew, I fired up a GitHub workspace to spelunk the code/PR better! I think (maybe?) I understand what's going on and my confusion! I need more caffeine now ...
I think this code is supposed to only catch the addIndexes
mismatch case? Because, by design, if user is indexing documents, they cannot tickle this code -- the parentFieldName
(from IWC) and the newly born FieldInfo
(in-memory DWPT segment) will also use IWC's configured parent field, so these ifs should never trigger from simply indexing documents?
Maybe we can rename parentFieldName
to iwcParentFieldName
making its provenance clear? Then, by contrast, it's more clear that fieldName
is the new foreign entity that must be validated against IWC's config.
And then I find this index
super confusing -- in the test case it seems to refer to the incoming index (via addIndexes
) and not to the currently open index that your IndexWriter
is pointing to?
How about changing the wording for all of these exceptions to something like the current index is configured with parentName=X; cannot add external index with parentName=Y
or so?
throw new IllegalArgumentException( | ||
"this index has [" | ||
+ fieldName | ||
+ "] as parent document field already but parent document field is not configured in IWC"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here? parentFieldName
was derived from the segments in the current index (not set), and now a document (current IWC) is trying to set fieldName
as the parent field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phew! I'm tired :)
}) | ||
.getMessage(); | ||
assertEquals( | ||
"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See I think this exception message is backwards :) Shouldn't it be:
cannot configure [foobar] as parent document field ; this index uses [foo] as parent document field already
I.e. "this index" is referring to dir2
/w2
(the index into which we are adding an external index), and it is configured to use foo
as the parent field, while the incoming index via addIndexes
(dir1
) is attempting (illegally) to change that to foobar
?
Edit: OK I think this all comes down to this index
being super confusing to me :) If it's referring to the incoming (newly added) index via addIndexes, the message makes sense. But I would think this index
means the one you opened with IW...
if (info.isParentField()) { | ||
if (parentField != null && parentField.equals(info.name) == false) { | ||
throw new IllegalArgumentException( | ||
"multiple parent fields [" + info.name + ", " + parentField + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we improve this wording @s1monw ^^? It is confusing because it does not actually say specifically what is wrong (that you must have only one parent field).
@@ -78,6 +80,7 @@ public FieldInfos(FieldInfo[] infos) { | |||
boolean hasPointValues = false; | |||
boolean hasVectorValues = false; | |||
String softDeletesField = null; | |||
String parentField = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe fix these silly pointless initializers ^^?
This FieldInfos.java
is some of the oldest, most "pristine"/"untouched" code in Lucene ;) You can almost count the tree rings ...
}) | ||
.getMessage(); | ||
assertEquals( | ||
"cannot configure [foo] as parent document field ; this index uses [foobar] as parent document field already", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
"cannot configure [" | ||
+ parentFieldName | ||
+ "] as parent document field ; this index uses [" | ||
+ fieldName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to find a way that users don't have to specify these fieldnames altogether but that should be a follow up
Yeah +1 to defer.
Phew, I fired up a GitHub workspace to spelunk the code/PR better! I think (maybe?) I understand what's going on and my confusion! I need more caffeine now ...
I think this code is supposed to only catch the addIndexes
mismatch case? Because, by design, if user is indexing documents, they cannot tickle this code -- the parentFieldName
(from IWC) and the newly born FieldInfo
(in-memory DWPT segment) will also use IWC's configured parent field, so these ifs should never trigger from simply indexing documents?
Maybe we can rename parentFieldName
to iwcParentFieldName
making its provenance clear? Then, by contrast, it's more clear that fieldName
is the new foreign entity that must be validated against IWC's config.
And then I find this index
super confusing -- in the test case it seems to refer to the incoming index (via addIndexes
) and not to the currently open index that your IndexWriter
is pointing to?
How about changing the wording for all of these exceptions to something like the current index is configured with parentName=X; cannot add external index with parentName=Y
or so?
@mikemccand I did another pass on it and change the wording. I think I know why you are confused and I tired to adress it. We use the exact same wording in the soft deletes case. I will work on this in a sep. issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay, thanks @s1monw -- looks great!!
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to apache#12711
A parent field can now safely be added to indices that are created on or after 9.9.x that have no doc blocks as well as newly created indices. Adding a parent field to older indices is not permitted. Relates to apache#12829
This change prevents users from adding a parent field to an existing index. Parent field must be added before any documents are added to the index to prevent documents without the parent field from being indexed and later to be treated as child documents upon merge. Relates to apache#12829
This change prevents users from adding a parent field to an existing index. Parent field must be added before any documents are added to the index to prevent documents without the parent field from being indexed and later to be treated as child documents upon merge. Relates to #12829
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to apache#12711
This change prevents users from adding a parent field to an existing index. Parent field must be added before any documents are added to the index to prevent documents without the parent field from being indexed and later to be treated as child documents upon merge. Relates to apache#12829
Today index sorting will most likely break document blocks added with `IndexWriter#addDocuments(...)` and `#updateDocuments(...)` since the index sorter has no indication of what documents are part of a block. This change automatically adds a marker field to parent documents if configured in `IWC`. These marker documents are optional unless document blocks are indexed and index sorting is configured. In this case indexing blocks will fail unless a parent field is configured. Index sorting will preserve document blocks during sort. Documents within a block not be reordered by the sorting algorithm and will sort along side their parent documents. Relates to #12711
This change prevents users from adding a parent field to an existing index. Parent field must be added before any documents are added to the index to prevent documents without the parent field from being indexed and later to be treated as child documents upon merge. Relates to #12829
This PR piggy-backs on recent changes in Lucene 9.11.1 (apache/lucene#12829, apache/lucene#13341), setting the parent doc when nested fields are present. This allows moving nested documents along with parent ones during sorting. With this change, sorting is now allowed on fields outside nested objects. Sorting on fields within nested objects is still not supported (throws an exception). Fixes #107349
Today index sorting with likely break document blocks added with
IndexWriter#addDocuments(...)
and friends since the index sorter has no indication of what documents are part of a block. This change proposes a marker field as a requirement for parent documents if the block API is used in conjunction with index sorting.At this point this change requires a
NumericDV
field only present on the last document of a block or on every document indexed as an individual document iff a parent field is configured on the Sort. This can potentially be extended to aTerm
which is not as straight forward since validations might be more difficult and today postings are not available when we sort the flushed segment.Relates to #12711