Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ghost fields and postings/points [LUCENE-10357] #11393

Closed
asfimport opened this issue Jan 5, 2022 · 4 comments
Closed

Ghost fields and postings/points [LUCENE-10357] #11393

asfimport opened this issue Jan 5, 2022 · 4 comments

Comments

@asfimport
Copy link

On doc values and norms, we require that codec APIs return non-null empty instances on fields that have the feature turned on on their FieldInfo, even if all values have been merged away.

However postings and points have the choice: they may either return empty instances or null. See e.g. BasePostingsFormatTestCase#testGhosts for instance.

I fear that this could be a source of bugs, as a caller could be tempted to assume that he would get non-null terms on a FieldInfo that has IndexOptions that are not NONE. Should we introduce a contract that FieldsProducer (resp. PointsReader) must return a non-null instance when postings (resp. points) are indexed?


Migrated from LUCENE-10357 by Adrien Grand (@jpountz), updated Jul 13 2022
Pull requests: #907

@javanna
Copy link
Contributor

javanna commented Aug 26, 2022

I've just seen an occurrence of this, FieldExistsQuery#count checks if getPointDimensionCount is greather than 0, and it may happen that such condition is true and the following getPointValues throws a NullPointerException.

@javanna
Copy link
Contributor

javanna commented Aug 31, 2022

@jpountz I am looking into fixing the points issue that I mentioned above. We could for instance ensure that Lucene90PointsReader never returns null and instead return an empty instance. On the other hand it looks like many callers of this method already check for null, and I see that getValues also throws exception in case FieldInfo#getPointDimensionCount is 0, which means that callers can't blindly call getValues without consulting FieldInfo first. I wonder then, should we fix FieldExistsQuery to handle the null return value, or adapt all other existing callers to not handle null? I think the former is quicker, the latter is cleaner, especially if like you said above "codec APIs are required to return non-null empty instances on fields that have the feature turned on on their FieldInfo, even if all values have been merged away."

@jpountz
Copy link
Contributor

jpountz commented Sep 1, 2022

I see that getValues also throws exception in case FieldInfo#getPointDimensionCount is 0, which means that callers can't blindly call getValues without consulting FieldInfo first

It's a bit more complicated than that. Callers indeed cannot call PointsReader#getValues blindly, which is a codec API that should only be called on fields that have points enabled. However, callers can call LeafReader#getPointValues blindly, the user-facing API, which internally checks whether the field is indexed with points to know whether it should forward to the PointsReader#getValues codec API or return null. Queries are expected to always interact with points through LeafReader#getPointValues, not PointsReader#getValues. If we changed PointsReader#getValues to never return null, LeafReader#getPointValues would still return null on fields that do not exist or that do not have points enabled.

jpountz added a commit to jpountz/lucene that referenced this issue Sep 20, 2022
Introduction of dynamic pruning for string sorts (apache#11669) introduced a bug with
string sorts and ghost fields, triggering a `NullPointerException` because the
code assumes that `LeafReader#terms` is not null if the field is indexed
according to field infos.

This commit fixes the issue and adds tests for ghost fields across all sort
types.

Hopefully we can simplify and remove the null check in the future when we
improve handling of ghost fields (apache#11393).
javanna added a commit to javanna/lucene that referenced this issue Sep 20, 2022
getPointValues may currently return null for unknown fields or fields that don't index points.
It can happen that a field no longer has points for any document in a segment after delete+merge, which
causes field info to think that the field is there and has points, yet when calling getPointValues null
is returned.

With this change, we prevent getPointValues from returning null for ghost fields, it will instead return an
empty instance of PointValues.

Relates to apache#11393
javanna added a commit to javanna/lucene that referenced this issue Sep 20, 2022
getPointValues may currently return null for unknown fields or fields that don't index points.
It can happen that a field no longer has points for any document in a segment after delete+merge, which
causes field info to think that the field is there and has points, yet when calling getPointValues null
is returned.

With this change, we prevent getPointValues from returning null for ghost fields, it will instead return an
empty instance of PointValues.

Relates to apache#11393
javanna added a commit to javanna/lucene that referenced this issue Sep 20, 2022
FieldExistsQuery checks if there are points for a certain field, and then retrieves the
corresponding point values. When all documents that had points for a certain field have
been deleted from a certain segments, as well as merged away, field info may report
that there are points yet the corresponding point values are null.

With this change we add a null check in FieldExistsQuery. Long term, we will likely want
to prevent this situation from happening.

Relates apache#11393
javanna added a commit to javanna/lucene that referenced this issue Sep 20, 2022
FieldExistsQuery checks if there are points for a certain field, and then retrieves the
corresponding point values. When all documents that had points for a certain field have
been deleted from a certain segments, as well as merged away, field info may report
that there are points yet the corresponding point values are null.

With this change we add a null check in FieldExistsQuery. Long term, we will likely want
to prevent this situation from happening.

Relates apache#11393
jpountz added a commit that referenced this issue Sep 20, 2022
Introduction of dynamic pruning for string sorts (#11669) introduced a bug with
string sorts and ghost fields, triggering a `NullPointerException` because the
code assumes that `LeafReader#terms` is not null if the field is indexed
according to field infos.

This commit fixes the issue and adds tests for ghost fields across all sort
types.

Hopefully we can simplify and remove the null check in the future when we
improve handling of ghost fields (#11393).
jpountz added a commit that referenced this issue Sep 20, 2022
Introduction of dynamic pruning for string sorts (#11669) introduced a bug with
string sorts and ghost fields, triggering a `NullPointerException` because the
code assumes that `LeafReader#terms` is not null if the field is indexed
according to field infos.

This commit fixes the issue and adds tests for ghost fields across all sort
types.

Hopefully we can simplify and remove the null check in the future when we
improve handling of ghost fields (#11393).
jpountz added a commit that referenced this issue Sep 20, 2022
Introduction of dynamic pruning for string sorts (#11669) introduced a bug with
string sorts and ghost fields, triggering a `NullPointerException` because the
code assumes that `LeafReader#terms` is not null if the field is indexed
according to field infos.

This commit fixes the issue and adds tests for ghost fields across all sort
types.

Hopefully we can simplify and remove the null check in the future when we
improve handling of ghost fields (#11393).
jpountz pushed a commit that referenced this issue Sep 20, 2022
FieldExistsQuery checks if there are points for a certain field, and then retrieves the
corresponding point values. When all documents that had points for a certain field have
been deleted from a certain segments, as well as merged away, field info may report
that there are points yet the corresponding point values are null.

With this change we add a null check in FieldExistsQuery. Long term, we will likely want
to prevent this situation from happening.

Relates #11393
jpountz pushed a commit that referenced this issue Sep 20, 2022
FieldExistsQuery checks if there are points for a certain field, and then retrieves the
corresponding point values. When all documents that had points for a certain field have
been deleted from a certain segments, as well as merged away, field info may report
that there are points yet the corresponding point values are null.

With this change we add a null check in FieldExistsQuery. Long term, we will likely want
to prevent this situation from happening.

Relates #11393
jpountz pushed a commit that referenced this issue Sep 20, 2022
FieldExistsQuery checks if there are points for a certain field, and then retrieves the
corresponding point values. When all documents that had points for a certain field have
been deleted from a certain segments, as well as merged away, field info may report
that there are points yet the corresponding point values are null.

With this change we add a null check in FieldExistsQuery. Long term, we will likely want
to prevent this situation from happening.

Relates #11393
@jpountz
Copy link
Contributor

jpountz commented Nov 10, 2022

I had hoped that getting rid of ghost fields would automatically help avoid some bugs but after looking into it for both postings and points (thanks @javanna and @shahrs87 !) it looks like it's not actually possible to make things significantly better. Hence closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants