Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUCENE-10428: Avoid infinite loop under error conditions. #711

Merged
merged 6 commits into from
Mar 3, 2022

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Feb 24, 2022

No description provided.

@dblock
Copy link
Contributor

dblock commented Mar 1, 2022

Definitely needs tests.

What can we do to get this change into the next version of Lucene? I am worried that we still don't have a consistent way to reproduce this issue.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 1, 2022

I was hoping for a 👍 from Ankit before merging.

Regarding tests, I was not able to identify the case that would break the existing logic, so I'm not sure what new test to add. There are existing tests for this class in TestMaxScoreSumPropagator.

@dblock
Copy link
Contributor

dblock commented Mar 1, 2022

@jpountz I'd write a test that uses invalid input to do more than 2 iterations and expect IllegalStateException with the right values/description. Otherwise no test confirms that it can't loop infinitely.

@dblock
Copy link
Contributor

dblock commented Mar 1, 2022

@jainankitk
Copy link
Contributor

@jpountz I'd write a test that uses invalid input to do more than 2 iterations and expect IllegalStateException with the right values/description. Otherwise no test confirms that it can't loop infinitely.

@dblock I guess @jpountz's is trying to say that we don't have an existing case (set of inputs / repro) which runs for more than 2 iterations. But, I agree with @dblock's point as well to test this code somehow.

Should we parameterize the number of iterations parameter and use it as test hook for getting the method to throw IllegalArgumentException? Also, does this exception reflect as 4xx or 5xx on the client side?

@vigyasharma
Copy link
Contributor

@jpountz Out of curiosity, what is the implication of returning whatever minScore get's computed after 2 iterations? Like we do a best case effort to converge, but return with whatever we get after n attempts?
I guess it could be bad if the function doesn't keep converging with every iteration, and stopping abruptly at n iters can land in some random, divergent values.. I'm not sure how this works.

Also, do you have any pointers on what user should look at once she gets this exception? I understand it's too early as we're not able to repro the issue, so is the plan to keep a jira open to collate more data around how frequently people run into this exception?

// Important: use ulp of minScoreSum and not minScore to make sure that we
// converge quickly.
minScore -= Math.ulp(minScoreSum);
// this should converge in at most two iterations:
// - one because of the subtraction rounding error
// - one because of the error introduced by sumUpperBound
assert ++iters <= 2 : iters;
if (iter > 2) {
throw new IllegalStateException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also publish the last computed value of minScore here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this part of jpountz#2.

@vigyasharma
Copy link
Contributor

Should we parameterize the number of iterations parameter and use it as test hook for getting the method to throw IllegalArgumentException?

I'm not sure how useful that is as a test. Would also avoid parameterizing a value (maxIterations) that is not supposed to change across different callers. But I like the idea of collating data from users who run into this exception (via the params printed in stacktrace). Maybe add a ToDo to test/repro/fix the edge case later?

@dblock
Copy link
Contributor

dblock commented Mar 1, 2022

Should we parameterize the number of iterations parameter and use it as test hook for getting the method to throw IllegalArgumentException?

I'm not sure how useful that is as a test.

I won't argue on this specific approach, but I have hard time accepting that we can fix a real bug, and write no new tests.

@msokolov
Copy link
Contributor

msokolov commented Mar 2, 2022

I won't argue on this specific approach, but I have hard time accepting that we can fix a real bug, and write no new tests.

Please feel free to contribute one, this is open source after all. Nobody is here to do your bidding.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 2, 2022

Out of curiosity, what is the implication of returning whatever minScore get's computed after 2 iterations? Like we do a best case effort to converge, but return with whatever we get after n attempts?

@vigyasharma This might cause Lucene to miss some top hits. The point of this code is to propagate minimum competitive scores to child clauses. For instance imagine that you have a disjunction with 3 clauses, whose max scores are 1, 2 and 3. If the collector sets a min competitive score of 5.5 on the disjunction at some point, then we can deduce that the first clause must have a score that it at least 5.5-2-3=0.5, the second clause must have a score that is at least 5.5-1-3=1.5 and the 3rd clause must have a score that is at least 5.5-1-2=2.5. This is very simple in principle but the fact that operations on floating-point numbers produce rounded values makes it a bit tricky, hence the current logic. The risk with your proposal is that we would return a score that is too high, e.g. 0.6 on the first clause instead of 0.5, which would in-turn make Lucene skip hits whose score is between 0.5 and 0.6 on the first clause even though some of them might be top hits. The only "safe" return value would be 0, which would in-turn disable skipping on the child clause.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 2, 2022

@dblock Yeah I'm unhappy about this change too. FWIW this isn't really a bug fix, this mostly turns a potential bug that we haven't managed to understand into a different bug that has less severe implications and would make the root cause easier to identify.

@dblock
Copy link
Contributor

dblock commented Mar 2, 2022

I think we have a case of floating point here.

If you change getMinCompetitiveScore to take a double or minScoreSum, it becomes easy to reproduce. It also almost always starts tripping testNClausesRandomScore and often test2ClausesRandomScore.

private float getMinCompetitiveScore(double minScoreSum, double sumOfOtherMaxScores)
  public void testGetMinCompetitiveScore() throws Exception {   

    List<FakeScorer> scorers = new ArrayList<>();
    scorers.add(new FakeScorer(0.39404958f));
    scorers.add(new FakeScorer(0.0028006434f));
    scorers.add(new FakeScorer(0.24155897f));

    float minScoreSum = 2.7810444831848145f;
    MaxScoreSumPropagator p = new MaxScoreSumPropagator(scorers);
    p.setMinCompetitiveScore(minScoreSum);
  }
org.apache.lucene.search.TestMaxScoreSumPropagator > testGetMinCompetitiveScore STANDARD_ERROR
    minScoreSum=2.7810442447662354, sumOfOtherMaxScores=0.24435961246490479
    minScoreSum=2.7810442447662354, sumOfOtherMaxScores=0.39685022830963135
    minScoreSum=2.7810442447662354, sumOfOtherMaxScores=0.6356085538864136

org.apache.lucene.search.TestMaxScoreSumPropagator > testGetMinCompetitiveScore FAILED
    java.lang.AssertionError: 3
        at __randomizedtesting.SeedInfo.seed([4705BBC182B572C6:C8E41A539F61A77B]:0)
        at org.apache.lucene.search.MaxScoreSumPropagator.getMinCompetitiveScore(MaxScoreSumPropagator.java:160)
        at org.apache.lucene.search.MaxScoreSumPropagator.setMinCompetitiveScore(MaxScoreSumPropagator.java:121)
        at org.apache.lucene.search.TestMaxScoreSumPropagator.testGetMinCompetitiveScore(TestMaxScoreSumPropagator.java:244)

I wasn't able to trip it with a float minScoreSum though with the values in the issue, but it has to exist, given that it happened in production. Or there's something different about that setup than my linux test machine? JVM 11.0.12+7-LTS vs. 17? I can't find anything conclusive. This works:

org.apache.lucene.search.TestMaxScoreSumPropagator > testGetMinCompetitiveScore STANDARD_ERROR
    minScoreSum=2.7810442, sumOfOtherMaxScores=0.24435961246490479
    minScoreSum=2.7810442, sumOfOtherMaxScores=0.39685022830963135
    minScoreSum=2.7810442, sumOfOtherMaxScores=0.6356085538864136

It looks like converging in getMinCompetitiveScore relies on truncating (rounding?) doubles to floats, so this looks suspicious to me.

Any ideas of what we can do next? I will gladly write a unit test on top of the proposed changes in this PR if I can get to reproduce it.

@jpountz
Copy link
Contributor Author

jpountz commented Mar 2, 2022

It looks like converging in getMinCompetitiveScore relies on truncating (rounding?) doubles to floats, so this looks suspicious to me.

I certainly wish it was simpler, but it's done this way on purpose. The way that boolean queries combine scores of their sub clauses together is by summing up the float scores into a double and finally casting back that double back to a float. So this method undoes this operation by taking the float score sum and subtracting the sum of other scores as a double as a first approximation of the minimum score that this clause must have.

If you change getMinCompetitiveScore to take a double or minScoreSum, it becomes easy to reproduce.

That's expected. If minScoreSum is a double then we end up subtracting ulps of a double from a float, which doesn't make much sense.

it has to exist, given that it happened in production.

Right, it's bad. My current thinking is that there are 3 main options:

  • A bug in this code,
  • A bug in other code that breaks assumptions that this code is making, e.g. a query that produces negative or NaN scores.
  • A problem in something external like a JVM bug or bad RAM.

@dblock
Copy link
Contributor

dblock commented Mar 2, 2022

Thank you for the detailed explanation, this makes a lot of sense.

  • A bug in other code that breaks assumptions that this code is making, e.g. a query that produces negative or NaN scores.

This gave me the idea that we can use negative values for a test at least. I PRed jpountz#2 into your branch: it reproduces this issue with negative scores, and can now validate the new expectation of an IllegalStateException being thrown.

@dblock
Copy link
Contributor

dblock commented Mar 2, 2022

I won't argue on this specific approach, but I have hard time accepting that we can fix a real bug, and write no new tests.

Please feel free to contribute one, this is open source after all. Nobody is here to do your bidding.

jpountz#2

@jainankitk
Copy link
Contributor

jainankitk commented Mar 2, 2022

This gave me the idea that we can use negative values for a test at least. I PRed jpountz#2 into your branch: it reproduces this issue with negative scores, and can now validate the new expectation of an IllegalStateException being thrown.

Thats great! and much better than parameterizing for testing IllegalStateException. I have reviewed and approved the changes for jpountz#2

Copy link
Contributor

@jainankitk jainankitk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This combined with changes from @dblock in jpountz#2 is good to go!

@jpountz jpountz merged commit 44a2a82 into apache:main Mar 3, 2022
@jpountz jpountz deleted the lucene10428 branch March 3, 2022 08:42
@jpountz
Copy link
Contributor Author

jpountz commented Mar 3, 2022

Thanks all for the feedback and contributions!

jpountz added a commit that referenced this pull request Mar 3, 2022
@dblock
Copy link
Contributor

dblock commented Mar 3, 2022

Thanks for fixing the issue @jpountz and everyone for your support and ideas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants