Made DocIdsWriter use DISI when reading documents with an IntersectVisitor #13149

antonha · 2024-03-01T09:54:26Z

Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead.

This seems to make Lucene faster at some sorting and range querying tasks - I saw 35-45% reduction in execution time. In learnt this through running this benchmark setup by Elastic: https://github.com/elastic/elasticsearch-opensearch-benchmark.

The hypothesis is that this is due to fewer virtual calls being made - once per BKD leaf, instead of once per document. Note that this is only measurable if the readInts methods have been called with at least 3 implementation of the IntersectVisitor interface - otherwise the JIT inlining takes away the virtual call. In real life Lucene deployments, I would judge that it is very likely that at least 3 implementations are used. For more details on method etc, there are details in this blog post: https://blunders.io/posts/es-benchmark-4-inlining

I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?

mikemccand

This looks like a great optimization! I just had one small concern about advance.

mikemccand · 2024-03-06T16:50:59Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

+
+    @Override
+    public int advance(int target) throws IOException {
+      while (index < count && scratch[index] < target) {


Do we even need to implement this method? Can we throw UOE instead? It's a bit scary that it's O(N), though, in practice, scratch is a smallish buffer?

Maybe we don't need to - but I don't see why this would be scarier than advancing using the nextDoc method? But I might be missing something - if you'd like that change, I'll make it. It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial.

For large segments the buffer size would almost always be 512 - but there would be a lot of buffers.

I tried changing the advance method to just throw an throw new UnsupportedOperationException() - and tests do pass - I can commit that if you prefer that. I guess we should add some kind of comment of why it is unsupported though?

It might be that a DISI is not the right interface for visiting docids within the BKD tree - but I'm guessing changing that would be larger and more controversial.

Yeah indeed DISI may not be right, but let's not try to fix that here/now? Perhaps there are use cases that might want to call .advance on the DISI? Maybe some sort of curious combined filtering / points query?

OK you convinced me!: let's leave the method as is. You're right that if such a use case existed today, it is suffering through the linear scan of .visit calls already, so this is no worse.

Yeah indeed DISI may not be right, but let's not try to fix that here/now?
Agreed, but I'm tempted to experiment with it in another PR :)

gautamworah96 · 2024-03-06T18:05:09Z

lucene/core/src/java/org/apache/lucene/util/bkd/DocIdsWriter.java

-      visitor.visit(scratch[i]);
-    }
+    scratchDocIdSetIterator.reset(count);
+    visitor.visit(scratchDocIdSetIterator);


@antonha Maybe add a small comment here explaining that we found using a DISI was much faster than visting docs individually?

People usually attach the PR id to the comment in case someone wants to dig deeper. Many a times, I have learnt about why something was implemented a certain way through these comments.

There is a comment higher up in the file, next to the ScratchDocIdSetIterator. Would moving that down here and referring to this PR make things better?

That comment is good enough. I had not noticed it I guess..

mikemccand

@antonha could you add a CHANGES.txt entry explaining the optimization? Thanks!

antonha · 2024-03-09T07:39:49Z

Thanks for the approval!

I do however think that it would be good to prove this performance improvement in luceneutil before merging, to make the benchmark more easily reproducible than the benchmarking I have done. I have started looking into it, but will have more time for it next week. Does that sound reasonable? The only downside I see is a later merge of this PR.

@antonha could you add a CHANGES.txt entry explaining the optimization? Thanks!

Yes, will do once I have run some luceneutil benchmarks (if we want to wait for that)

mikemccand · 2024-03-10T21:32:37Z

I tried benchmarking this with luceneutil, but did not see any significant change with the default benchmark - I suspect that I'm using the wrong luceneutil tasks to see any major difference. Which luceneutils benchmarks should I be using for these changes?

Confusingly named, the IntsNRQ task is the only task using points, I think. It runs simple 1D range queries.

The geospatial benchmarks (a different benchmark than the normal/nightly luceneutil) use multi-dimensional points.

Still, given that you saw gains in the "OpenSearch vs Elasticsearch" benchmarks, even if the results are flat with the existing luceneutil benchmarks, I think we should just merge the change. The night after we can watch Lucene's nightly benchmarks and see if the nightly box measured anything.

(Hmm, curiously/separately, it looks like something caused a jump in some geo tasks' performance e.g. distance filter ... I'll try to find the cause and add an annotation!).

antonha · 2024-03-13T06:41:46Z

I spent some time "proving" this is luceneutil - luceneutil/pull/257 adds a reproduction - if run with wikimediumall, optimize = True for indexing and commitPoint = 'single'.

The reason that the larger segment is needed is that Lucene otherwise chooses to store document ids in the BKD leaves as int24, meaning that the optimization in this PR does nothing. With the luceneutil changes, I get the following output when comparing this PR to master:

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                    OrHighNotLow      527.76      (7.9%)      516.38      (8.0%)   -2.2% ( -16% -   14%) 0.392
                    OrHighNotMed      523.37      (7.3%)      512.63      (7.5%)   -2.1% ( -15% -   13%) 0.384
                   OrHighNotHigh      418.72      (6.4%)      411.39      (6.3%)   -1.8% ( -13% -   11%) 0.384
           BrowseMonthTaxoFacets       11.50     (22.5%)       11.32     (26.2%)   -1.6% ( -41% -   60%) 0.839
                   OrNotHighHigh      419.16      (5.4%)      412.88      (5.4%)   -1.5% ( -11% -    9%) 0.381
                     AndHighHigh       43.95      (5.4%)       43.42      (3.8%)   -1.2% (  -9% -    8%) 0.413
            HighIntervalsOrdered        4.60      (5.6%)        4.55      (6.0%)   -1.1% ( -12% -   11%) 0.561
               HighTermMonthSort    13015.73      (3.3%)    12889.39      (4.0%)   -1.0% (  -7% -    6%) 0.401
                          Fuzzy1      285.37      (3.3%)      282.87      (2.2%)   -0.9% (  -6% -    4%) 0.320
        AndHighHighDayTaxoFacets        9.00      (4.6%)        8.92      (4.8%)   -0.8% (  -9% -    8%) 0.577
                    HighSpanNear        4.59      (2.8%)        4.55      (2.8%)   -0.8% (  -6% -    4%) 0.356
                    OrNotHighLow     1091.32      (2.8%)     1082.56      (3.3%)   -0.8% (  -6% -    5%) 0.402
                     LowSpanNear       12.72      (2.4%)       12.64      (2.5%)   -0.7% (  -5% -    4%) 0.377
            HighTermTitleBDVSort        5.69      (2.4%)        5.65      (2.5%)   -0.7% (  -5% -    4%) 0.387
                          Fuzzy2      165.14      (2.4%)      164.10      (1.8%)   -0.6% (  -4% -    3%) 0.343
                     MedSpanNear       12.61      (1.9%)       12.53      (2.2%)   -0.6% (  -4% -    3%) 0.327
                        HighTerm      578.67      (6.5%)      575.09      (5.1%)   -0.6% ( -11% -   11%) 0.739
            MedTermDayTaxoFacets       19.91      (5.1%)       19.79      (4.3%)   -0.6% (  -9% -    9%) 0.683
          OrHighMedDayTaxoFacets        3.43     (12.0%)        3.41     (11.4%)   -0.5% ( -21% -   25%) 0.885
                         MedTerm      748.03      (5.6%)      744.26      (4.5%)   -0.5% ( -10% -   10%) 0.755
         AndHighMedDayTaxoFacets       32.72      (1.5%)       32.57      (1.6%)   -0.5% (  -3% -    2%) 0.335
                         LowTerm      526.39      (2.7%)      524.40      (2.7%)   -0.4% (  -5% -    5%) 0.653
                HighSloppyPhrase        9.56      (2.3%)        9.53      (2.9%)   -0.3% (  -5% -    5%) 0.689
                      HighPhrase       33.59      (3.8%)       33.49      (3.8%)   -0.3% (  -7% -    7%) 0.803
             MedIntervalsOrdered       14.95      (4.2%)       14.92      (4.4%)   -0.2% (  -8% -    8%) 0.876
       BrowseDayOfYearSSDVFacets        6.37      (1.4%)        6.36      (1.3%)   -0.2% (  -2% -    2%) 0.665
                 MedSloppyPhrase       15.68      (1.6%)       15.65      (2.8%)   -0.2% (  -4% -    4%) 0.799
                      AndHighMed      109.07      (3.9%)      108.94      (3.2%)   -0.1% (  -6% -    7%) 0.915
                 LowSloppyPhrase       12.64      (1.5%)       12.63      (2.2%)   -0.1% (  -3% -    3%) 0.855
                    OrNotHighMed      377.59      (3.0%)      377.64      (3.3%)    0.0% (  -6% -    6%) 0.989
           BrowseMonthSSDVFacets        6.65      (0.4%)        6.65      (0.4%)    0.0% (   0% -    0%) 0.829
                       OrHighLow      541.18      (3.3%)      541.47      (3.3%)    0.1% (  -6% -    6%) 0.960
             LowIntervalsOrdered       12.42      (4.0%)       12.43      (3.9%)    0.1% (  -7% -    8%) 0.948
     BrowseRandomLabelSSDVFacets        5.35      (1.3%)        5.36      (1.1%)    0.1% (  -2% -    2%) 0.809
                       MedPhrase       47.26      (2.3%)       47.31      (2.4%)    0.1% (  -4% -    4%) 0.889
                         Prefix3      246.21      (4.9%)      246.48      (7.3%)    0.1% ( -11% -   12%) 0.956
                       LowPhrase       34.24      (2.5%)       34.30      (2.4%)    0.2% (  -4% -    5%) 0.824
               HighTermTitleSort       98.67      (2.9%)       98.84      (4.6%)    0.2% (  -7% -    7%) 0.884
                      OrHighHigh       42.99      (8.8%)       43.10      (8.6%)    0.3% ( -15% -   19%) 0.924
            BrowseDateSSDVFacets        2.24     (13.7%)        2.25     (13.2%)    0.6% ( -23% -   31%) 0.895
                         Respell      255.75      (1.7%)      257.22      (1.7%)    0.6% (  -2% -    4%) 0.281
       BrowseDayOfYearTaxoFacets        7.39      (1.9%)        7.44      (4.1%)    0.7% (  -5% -    6%) 0.489
                      AndHighLow      961.38      (3.8%)      970.95      (4.1%)    1.0% (  -6% -    9%) 0.427
            BrowseDateTaxoFacets        7.08      (1.3%)        7.16      (4.6%)    1.1% (  -4% -    7%) 0.291
                      TermDTSort       77.89      (7.4%)       78.84      (4.8%)    1.2% ( -10% -   14%) 0.535
                        PKLookup      250.25      (3.3%)      253.54      (2.7%)    1.3% (  -4% -    7%) 0.169
           HighTermDayOfYearSort      107.64      (2.6%)      109.12      (2.1%)    1.4% (  -3% -    6%) 0.064
                       OrHighMed      109.62      (6.2%)      111.24      (6.1%)    1.5% ( -10% -   14%) 0.447
     BrowseRandomLabelTaxoFacets        6.17      (5.0%)        6.27      (3.9%)    1.6% (  -6% -   11%) 0.253
                        Wildcard      183.93      (3.7%)      187.03      (4.4%)    1.7% (  -6% -   10%) 0.193
                         LongNRQ       42.83      (6.1%)       83.47     (34.5%)   94.9% (  51% -  144%) 0.000
                          IntNRQ       20.29      (3.4%)       53.24     (34.2%)  162.4% ( 120% -  207%) 0.000

The interesting parts here is in the bottom two lines - IntNRQ and LongNRQ becomes much faster. it might be that I messed up the "minimal" part of reproduction, maybe all that was needed is the single-segment and taskCountPerCat increase.

Regardless, it looks promising - a 94% to 162% increase in QPS for range queries with this PR in the slightly modified benchmark.

jpountz · 2024-03-13T15:28:15Z

Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator get used in IntersectVisitor#visit?

antonha · 2024-03-13T17:03:29Z

Would we be subject to the same issue if/when 3+ different implementations of DocIdSetIterator get used in IntersectVisitor#visit?

Yes.

Your question makes me think that maybe we should not add a new DocIdSetIterator for this PR - maybe that was your thought as well?

jpountz · 2024-03-13T17:47:19Z

Relatedly indeed, I was wondering if the API should expose an IntsRef or something like that, so that there is a single virtual call per block of doc IDs anyway (IntsRef cannot be extended, it's a single impl).

antonha · 2024-03-13T22:53:42Z

@jpountz I had a quick look at the code, and it seems to me like there are, with this PR, only two implementations used for the DISI used for the IntersectVisitor#visit method - which is good in the sense that it would probably be fine to merge it now, but bad in the sense that it would make adding a third hurt performance.

Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?

jpountz · 2024-03-14T14:19:07Z

++ on progress over perfection

That said, I wonder if this change is legal: DocIdSetIterator must return doc IDs in order, but it looks like it wouldn't always be the case with your change?

antonha · 2024-03-15T09:53:05Z

That said, I wonder if this change is legal: DocIdSetIterator must return doc IDs in order, but it looks like it wouldn't always be the case with your change?

You are correct - I had (falsely) assumed that the document ids in the DocIdsWriter were written in order - thus the PR as of now (dacb2a5d3ad707a555ed1fa2925a082d81aa4443) is not a legal change.

The good news is that this should not have affected the benchmarks - neither the PointRangeQuery nor the NumericLeafComparator seems to rely on the DISI being ordered. They just iterate until DocIdSetIterator.NO_MORE_DOCS.

I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).

…sitor. Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead. This seems to make Lucene faster at sorting and range querying tasks - the hypothesis being that it is due to fewer virtual calls.

antonha · 2024-03-15T11:44:43Z

I've changed the readInts16/34 methods to now use the IntsRef instead - I kind of like how it turned out, but the downside is that we will have to add many implementations to see the benefits. I've added the 2 in the PointRangeQuery - if we want to continue down this path, I can add more.

One thing that I would love input on is whether or not to manually inline the visit(docid) version or not. E.g. I now do:

for (int i = ref.offset; i < ref.offset + ref.length; i++) {
  result.clear(ref.ints[i]);
  cost[0]--;
 }

While we could do

for (int i = ref.offset; i < ref.offset + ref.length; i++) {
  visit(ref.ints[i]);
 }

The latter would, most likely, be inlined by the JVM. The first is manually inlined, so we don't need to trust the JVM on it. Thoughts?

It is important to note that just having the default method in IntersectVisitor would not be good, since that would do virtual calls to visit(int). Copy-pasting that method body into each implementation would however, most likely, inline better.

antonha · 2024-03-15T18:20:16Z

I benchmarked the current commit (e826696fd235cb1ef215745944cac6e34faa0d20) using the same modified lucenebench as above, it seems to like this change even better:

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
LongNRQ       41.11      (6.6%)      107.13     (13.4%)  160.6% ( 131% -  193%) 0.000
 IntNRQ       23.23      (7.0%)       85.58     (19.0%)  268.4% ( 226% -  316%) 0.000

(the baseline being master)

gf2121 · 2024-03-18T06:33:11Z

lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java

@@ -185,6 +186,13 @@ public void visit(DocIdSetIterator iterator) throws IOException {
            adder.add(iterator);
          }

+          @Override
+          public void visit(IntsRef ref) {
+            for (int i = ref.offset; i < ref.offset + ref.length; i++) {


Can we delegate this to BulkAdder and take advantage of System#arraycopy in BufferAdder ?

That would probably be better. I don't think we should expect huge performance increases from it (since it would only affect narrow range queries). But it would make a large difference if a third implementation of the adder is ever used.

I have limited time this week to look at it - I will try to find some time for it though!

I unfortunately did not find time enough for this this week - and since this PR seems to be wrapping up I will leave it out. It looks like a simple change but I got stuck implementing tests (which I feel would be very needed). I will leave this for future improvement!

jpountz · 2024-03-19T08:55:48Z

I will try adding a visit() method taking an IntsRef (I believe that is what you meant @jpountz?).

This is what I meant indeed.

Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?

antonha · 2024-03-24T21:02:07Z

Before merging I'd be curious to better understand why the JVM doesn't optimize this better. Presumably, it should be able to resolve the virtual call once for the entire for loop rather than doing it again on every iteration? I wonder if there is actually a performance bug, or if we are just insufficiently warming up the JVM and this for loop never gets compiled through C2?

Valid questions.

I ran the same luceneutils benchmark with -XX:+LogCompilation and -XX:PrintInlining and I can see c2/level4 compilations being done, including BKDPointTree::addAll and DocIdsWriter::readInts32. The Elasticsearch benchmarks mentioned in my post (where we saw similar results) were also run long enough that I would be very surprised if c2 had not kicked in.

I also tried reproducing this behavior in a shorter java program for demonstration purposes - with virtual calls similar to the IntersectVisitor. The program has 3 implementations of an interface processing int values, with both single value and batch implementations:

import java.util.Random;

public class C2Inlining {
  static int ITERATIONS = 50_000;
  static int NUM_VALUES = 100_000;
  public static void main(String[] args) {
    //Generate ints
    Random r = new Random();
    int[] arr = new int[NUM_VALUES];
    for (int i = 0; i < NUM_VALUES; i++) {
      arr[i] = r.nextInt();
    }

    IntProcessor[] intProcessors = {
        new IntProcessor1(),
        new IntProcessor2(),
        //Comment this last one out to trigger bimorphic behaviour
        new IntProcessor3()
    };
    processOneByOne(intProcessors, arr);
    processBatch(intProcessors, arr);
  }

  private static void processOneByOne(IntProcessor[] intProcessors, int[] arr) {
    long start = System.nanoTime();
    for (int i = 0; i < ITERATIONS; i++) {
      for (IntProcessor intProcessor : intProcessors) {
        for (int value : arr) {
          intProcessor.process(value);
        }
      }
    }
    long end = System.nanoTime();
    long took = end - start;
    System.out.printf("One-by-one: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
  }

  private static void processBatch(IntProcessor[] intProcessors, int[] arr) {
    long start = System.nanoTime();
    for (int i = 0; i < ITERATIONS; i++) {
      for (IntProcessor intProcessor : intProcessors) {
        intProcessor.process(arr);
      }
    }
    long end = System.nanoTime();
    long took = end - start;
    System.out.printf("Batch: Time per iteration: %.3f ms%n", (((double) took) / ITERATIONS) / 1_000_000d);
  }

  interface IntProcessor {
    void process(int i);
    void process(int[] arr);
  }

  static class IntProcessor1 implements IntProcessor {

    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }

  static class IntProcessor2 implements IntProcessor {
    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }

  static class IntProcessor3 implements IntProcessor {
    static int value;

    @Override
    public void process(int i) {
      value = i;
    }

    @Override
    public void process(int[] arr) {
      for (int i = 0; i < arr.length; i++) {
        value = arr[i];
      }
    }
  }
}

I ran this in this form, and with one implementation commented out (to trigger the bimorphic inlining behavior). The timing results are quite extreme (running with jdk21 and -server)

Variant	One-by-One	Batch
Three implementations	1.101 ms/iter	0.018 ms/iter
Two implementations	0.198 ms/iter	0.012 ms/iter

I.e. with three implementations, batching is ~60 times faster than single implementations. With two, it is ~16 times faster.
Both versions saw c2 compilations.

This is the same pattern that we see with the lucene code - although the difference is much more extreme (I'm guessing due to the implementation). I do think that we can draw the conclusion that helping the JVM with batch versions of virtual calls can help performance significantly.

One might argue that the JVM should be able to figure this out on it's own. But until it does, let's maybe help it a bit?

jpountz

Thanks for double checking that the problem still occurs with significant warmup. The change looks reasonable, I'd like to have someone else take a look though. @mikemccand Would you be able to do that since you already started looking?

Ideally we'd add a query that adds another implementation of an IntersectVisitor to the nightly benchmarks before merging that PR so that we can see the performance bump?

jpountz · 2024-03-25T21:46:26Z

lucene/core/src/java/org/apache/lucene/search/PointRangeQuery.java

+          public void visit(IntsRef ref) throws IOException {
+            for (int i = ref.offset; i < ref.offset + ref.length; i++) {
+              result.clear(ref.ints[i]);
+              cost[0]--;


you could move this outside of the loop and decrement by ref.length in one go?

yes, done :)

antonha · 2024-03-26T22:09:37Z

Ideally we'd add a query that adds another implementation of an IntersectVisitor to the nightly benchmarks before merging that PR so that we can see the performance bump?

Yes - this would be ideal imo. Some caveats that I found while testing:

PointRangeQuerys are executed with different IntersectVisitors depending on expected number of matches, which means that several query intervals need to be used in each JVM. In luceneutil, I believe that this corresponds to setting taskCountPerCat high.
To see significant change, the number of documents in the segments needs to either fit into the 16bit or 32bit implementation of bkd.DocIdsWriter - this was not the case with the simple localrun.py script out of the box.
It is possible that we see a performance decrease when we add more IntersectVisitor variants, since the benchmark becomes more realistically slow with the additional implementations.

mikemccand · 2024-03-28T12:13:57Z

Do we believe that we can merge this PR and then continue with changing the BKD visit API in a later change, or should we try to change the abstraction in this PR?

+1 -- PNP!

In luceneutil, I believe that this corresponds to setting taskCountPerCat high.

The nightly benchy uses taskCountPerCat=5 -- is that enough to reveal the impact of this optimization?

But, the nightly benchy does not forceMerge down to one segment, so, we may not have the right number of docs to use 16 or 32 bit int encoding in the BKD file.

Actually, I don't think we should block this change on fixing nightly benchy to reveal its gains? Your above luceneutil repro is sufficient? I'm wary of allowing benchmarking to unduly hold up otherwise good Lucene changes in general... it's a helpful tool, not a crutch ;)

mikemccand

Looks great -- I love the state this iterated to. Thank you for the deep analysis, and persisting through the feedback @antonha. This is a very exciting optimization!

antonha · 2024-03-30T22:45:01Z

Thank you all for helping to get this PR into a better state. The only thing that still irked me is that I couldn't get the tests to execute all newly added methods. I made a few changes in the last commit to:

Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?
Added a higher chance of having 0 missing values in doTestRandomLongs - otherwise the inverse intersect visitor would only have a chance of happening every 100 runs, which I thought a bit low.

I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes? After this, what's the protocol? Do I merge the PR myself or should a committer push the button?

mikemccand · 2024-04-01T14:33:28Z

Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?

How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...

I would love it if you, @mikemccand or @jpountz, could give a thumbs up to these last changes?

I'll have a look!

Do I merge the PR myself or should a committer push the button?

One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.

mikemccand

Just a tiny javadoc wording improvement. Thanks @antonha!

mikemccand · 2024-04-01T14:34:13Z

lucene/core/src/java/org/apache/lucene/index/PointValues.java

@@ -298,6 +299,17 @@ default void visit(DocIdSetIterator iterator) throws IOException {
      }
    }

+    /**
+     * Similar to {@link IntersectVisitor#visit(int)}, but a bulk visit and implements may have


implements -> implementations?

whoops, fixed!

mikemccand · 2024-04-01T14:43:23Z

Add a new test case which calls doTestRandomLongs with 20 000 - without this I couldn't get the IntsRef to trigger often enough. This should maybe be a @Nightly?

How long does the test take to run? It'd be nice to exercise the optimized path in "ordinary" (non-nightly) test runs too ...

I timed this a bit ... looks like it's ~.25 - .5 seconds. I think that's OK to run always.

antonha · 2024-04-01T16:21:25Z

One of we committers must merge it! It sounds like we are super close ... I'll try to review today and maybe merge.

Sounds great - the javadoc fix is done. Thanks a lot for having a look and timing the test!

mikemccand · 2024-04-01T16:46:46Z

Thanks @antonha -- sorry, maybe also add a CHANGES.txt entry? This is an exciting opto!

antonha · 2024-04-01T18:45:35Z

Thanks @antonha -- sorry, maybe also add a CHANGES.txt entry? This is an exciting opto!

My bad - I should have added one the first time you asked 🙈. I've Added one now, let me know if you feel like it lacks something!

mikemccand

Thank you @antonha!

…sitor (#13149) * Made DocIdsWriter use DISI when reading documents with an IntersectVisitor. Instead of calling IntersectVisitor.visit for each doc in the readDelta16 and readInts32 methods, create a DocIdSetIterator and call IntersectVisitor.visit(DocIdSetIterator) instead. This seems to make Lucene faster at sorting and range querying tasks - the hypothesis being that it is due to fewer virtual calls. * Spotless * Changed bulk iteration to use IntsRef instead. * Clearer comment. * Decrease cost outside loop * Added test to make inverse intsref be used. * Wording improvement in javadoc. * Added CHANGES.txt entry.

mikemccand · 2024-04-02T14:00:47Z

OK I merged a backported to 9.11.0 -- I think that's safe: we added a new default method to IntersectVisitor.

antonha force-pushed the docidswriter_disi_visit branch from 0319617 to dacb2a5 Compare March 1, 2024 19:10

msfroh mentioned this pull request Mar 4, 2024

Improve search performance for numeric sort queries opensearch-project/OpenSearch#10867

Open

mikemccand reviewed Mar 6, 2024

View reviewed changes

gautamworah96 reviewed Mar 6, 2024

View reviewed changes

mikemccand approved these changes Mar 8, 2024

View reviewed changes

antonha mentioned this pull request Mar 13, 2024

Minimal proof for Bulk DocIdSetIterator for Lucene PR 13149 mikemccand/luceneutil#257

Open

antonha added 3 commits March 15, 2024 11:30

Spotless

8287e85

Changed bulk iteration to use IntsRef instead.

60495d4

antonha force-pushed the docidswriter_disi_visit branch from b5fbbf1 to 60495d4 Compare March 15, 2024 10:31

Clearer comment.

e826696

gf2121 reviewed Mar 18, 2024

View reviewed changes

jpountz approved these changes Mar 25, 2024

View reviewed changes

Decrease cost outside loop

2b6328b

mikemccand approved these changes Mar 28, 2024

View reviewed changes

antonha force-pushed the docidswriter_disi_visit branch from 013a520 to cc39cc1 Compare March 30, 2024 22:40

Added test to make inverse intsref be used.

cc39cc1

mikemccand approved these changes Apr 1, 2024

View reviewed changes

Wording improvement in javadoc.

95e3bcc

Added CHANGES.txt entry.

288ab03

mikemccand approved these changes Apr 2, 2024

View reviewed changes

mikemccand merged commit 4eeb3a7 into apache:main Apr 2, 2024
3 checks passed

mikemccand added a commit that referenced this pull request Apr 2, 2024

#13149: move CHANGES entry under 9.11.0

bf193a7

mikemccand added this to the 9.11.0 milestone Apr 2, 2024

iverase mentioned this pull request Jan 14, 2025

Implement IntersectVisitor#visit(IntsRef) whenever it makes sense #14138

Merged

Made DocIdsWriter use DISI when reading documents with an IntersectVisitor #13149

Made DocIdsWriter use DISI when reading documents with an IntersectVisitor #13149

Conversation

antonha commented Mar 1, 2024

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

antonha commented Mar 9, 2024

mikemccand commented Mar 10, 2024

antonha commented Mar 13, 2024

jpountz commented Mar 13, 2024

antonha commented Mar 13, 2024

jpountz commented Mar 13, 2024

antonha commented Mar 13, 2024

jpountz commented Mar 14, 2024

antonha commented Mar 15, 2024

antonha commented Mar 15, 2024

antonha commented Mar 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Mar 19, 2024

antonha commented Mar 24, 2024

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonha commented Mar 26, 2024

mikemccand commented Mar 28, 2024

mikemccand left a comment

Choose a reason for hiding this comment

antonha commented Mar 30, 2024

mikemccand commented Apr 1, 2024

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikemccand commented Apr 1, 2024

antonha commented Apr 1, 2024

mikemccand commented Apr 1, 2024

antonha commented Apr 1, 2024

mikemccand left a comment

Choose a reason for hiding this comment

mikemccand commented Apr 2, 2024

antonha commented Mar 15, 2024 •

edited

Loading