Skip to content

Commit

Permalink
internal/keyspan: modify FragmentIterator seek semantics
Browse files Browse the repository at this point in the history
Alter the semantics of SeekGE and SeekLT on FragmentIterator.
Previously, FragmentIterator's seek operations were defined only in
terms of span start keys. This commit changes the seek operations to be
defined in terms of the keys contained by the span. A SeekGE now seeks
to the first span containing a key ≥ the seek key, and a SeekLT now
seeks to the last span containing a key < seek key. These new semantics
match the typical top-level iterator use.

With these new semantics, SeekLT can still be implemented in terms of a
simple span start key seek. Seeking to the last span containing a key <
seek key is equivalent to seeking to the last span with a start key less
the given key.

However, SeekGE implementations now require an extra key comparison and
sometimes a Next. Since top-level iterator requires the containment
semantics anyways, this key comparison and Next is only being moved down
the stack into the interface implementation. When using the keyspan
merging iterator, the keyspan.MergingIter's SeekGE implementation
performs a SeekLT per-level, which suffers no additional overhead.

With the MergingIter and DefragmentingIter implementations, these new
semantics reduce the amount of work performed during a seek. The
previous iterator stack's SeekGE looked like (left-to-right, top-down):

                         InterleavingIter.SeekGE
                                    │
                   ╭────────────────┴───────────────╮
                   │                                │
        DefragmentingIter.SeekLT         DefragmentingIter.Next()
                   │                                │
       ╭───────────┴───╮                            │
       │               │                            │
 MergingIter.SeekLT    ├── defragmentFwd            ├── defragmentFwd
       │               │                            │
       │               ╰── defragmentBwd            ╰── defragmentFwd
       ╰───────────╮
                   │
       ╭───────────┴───────────╮
       │                       │
 MergingIter.SeekGE      MergingIter.Prev
       │
       ╰─╶╶ per level╶╶ ─╮
                         │
             ╭───────────┴───────────╮
             │                       │
         <?>.SeekLT              <?>.Next

The new iterator stack's SeekGE, assuming it doesn't hit the new
defragmenting fast path, looks like:

                         InterleavingIter.SeekGE
                                    │
                         DefragmentingIter.SeekGE
                                    │
                   ╭────────────────┴───────────────╮
                   │                                ├── defragmentBwd*
             MergingIter.SeekGE                     │
                   │                                ╰── defragmentFwd
                   ╰─╶╶ per level╶╶ ─╮
                                     │
                                     │
                                     ├── <?>.SeekLT
                                     │
                                     ╰── <?>.Next

* — The call to defragmentBackward during SeekGE may now sometimes be
    elided, specifically if the span discovered by MergingIter.SeekGE does
    not contain the seek key within its bounds.

Note that in this interface, there are no calls to any of the leaf
FragmentIterator's SeekGE methods which would suffer the extra key
comparison and Next. Instead, the MergingIter calls SeekLT and
unconditionally Nexts each of the leafs as a part of its logic to
fragment bounds across levels.

This reduced work for seeks has a large impact on the MVCCGet and
MVCCScan microbenchmarks in the presence of range keys.

```
name                                                                      old time/op    new time/op     delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8/numRangeKeys=0-24         6.30µs ± 1%     6.22µs ± 2%      ~     (p=0.095 n=5+5)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8/numRangeKeys=1-24         11.5µs ± 1%     10.3µs ± 1%    -9.95%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8/numRangeKeys=100-24        118µs ± 1%       79µs ± 2%   -33.14%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8/numRangeKeys=0-24        23.9µs ± 1%     24.1µs ± 2%      ~     (p=0.310 n=5+5)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8/numRangeKeys=1-24        31.7µs ± 2%     29.6µs ± 1%    -6.65%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8/numRangeKeys=100-24       109µs ± 1%       69µs ± 2%   -36.58%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8/numRangeKeys=0-24        100µs ± 1%       99µs ± 3%      ~     (p=0.310 n=5+5)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8/numRangeKeys=1-24        110µs ± 1%      106µs ± 2%    -3.24%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8/numRangeKeys=100-24      197µs ± 2%      153µs ± 1%   -22.75%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8/numRangeKeys=0-24          3.74µs ± 1%     3.57µs ± 1%    -4.47%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8/numRangeKeys=1-24          6.01µs ± 1%     4.93µs ± 2%   -17.86%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8/numRangeKeys=100-24        66.1µs ± 1%     28.8µs ± 1%   -56.35%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8/numRangeKeys=0-24         20.4µs ± 1%     20.4µs ± 1%      ~     (p=0.690 n=5+5)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8/numRangeKeys=1-24         25.9µs ± 1%     23.9µs ± 3%    -7.79%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8/numRangeKeys=100-24       89.3µs ± 1%     50.2µs ± 2%   -43.76%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8/numRangeKeys=0-24        98.7µs ± 1%     97.9µs ± 1%      ~     (p=0.151 n=5+5)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8/numRangeKeys=1-24         106µs ± 1%      103µs ± 1%    -2.63%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8/numRangeKeys=100-24       179µs ± 3%      131µs ± 2%   -26.75%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24            10.9µs ± 3%     10.7µs ± 1%      ~     (p=0.151 n=5+5)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=1-24            17.9µs ± 1%     16.1µs ± 2%   -10.35%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=100-24           172µs ± 1%       94µs ± 2%   -45.23%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=2/valueSize=64/numRangeKeys=0-24            13.0µs ± 1%     13.1µs ± 2%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=1/versions=2/valueSize=64/numRangeKeys=1-24            21.1µs ± 1%     19.0µs ± 2%    -9.70%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=2/valueSize=64/numRangeKeys=100-24           158µs ± 1%       83µs ± 3%   -47.57%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64/numRangeKeys=0-24           20.3µs ± 1%     20.1µs ± 2%      ~     (p=0.151 n=5+5)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64/numRangeKeys=1-24           30.6µs ± 2%     27.6µs ± 1%    -9.70%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64/numRangeKeys=100-24          160µs ± 2%       88µs ± 3%   -45.10%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64/numRangeKeys=0-24          41.5µs ± 1%     41.1µs ± 1%    -0.97%  (p=0.048 n=5+5)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64/numRangeKeys=1-24          50.9µs ± 1%     48.6µs ± 2%    -4.67%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64/numRangeKeys=100-24         140µs ± 2%       94µs ± 1%   -32.43%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=1/valueSize=64/numRangeKeys=0-24           15.6µs ± 2%     16.1µs ± 1%    +3.21%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=1/valueSize=64/numRangeKeys=1-24           24.0µs ± 1%     23.1µs ± 2%    -3.87%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=1/valueSize=64/numRangeKeys=100-24          117µs ± 1%       78µs ± 2%   -33.17%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=2/valueSize=64/numRangeKeys=0-24           20.1µs ± 1%     20.4µs ± 1%    +1.30%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=2/valueSize=64/numRangeKeys=1-24           30.1µs ± 1%     28.5µs ± 1%    -5.25%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=2/valueSize=64/numRangeKeys=100-24          109µs ± 2%       70µs ± 1%   -36.07%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=10/valueSize=64/numRangeKeys=0-24          37.7µs ± 2%     38.1µs ± 1%      ~     (p=0.056 n=5+5)
MVCCScan_Pebble/rows=10/versions=10/valueSize=64/numRangeKeys=1-24          55.5µs ± 2%     53.9µs ± 1%    -2.79%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=10/valueSize=64/numRangeKeys=100-24         140µs ± 2%      101µs ± 1%   -27.99%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=100/valueSize=64/numRangeKeys=0-24         97.5µs ± 3%     96.1µs ± 2%      ~     (p=0.095 n=5+5)
MVCCScan_Pebble/rows=10/versions=100/valueSize=64/numRangeKeys=1-24          117µs ± 4%      115µs ± 1%      ~     (p=0.151 n=5+5)
MVCCScan_Pebble/rows=10/versions=100/valueSize=64/numRangeKeys=100-24        216µs ± 1%      176µs ± 4%   -18.57%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24          50.9µs ± 1%     53.2µs ± 2%    +4.37%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=1-24          75.3µs ± 2%     74.8µs ± 2%      ~     (p=0.548 n=5+5)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=100-24         184µs ± 3%      141µs ± 1%   -23.61%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=2/valueSize=64/numRangeKeys=0-24          68.5µs ± 2%     70.7µs ± 1%    +3.27%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=2/valueSize=64/numRangeKeys=1-24           100µs ± 3%      102µs ± 1%      ~     (p=0.222 n=5+5)
MVCCScan_Pebble/rows=100/versions=2/valueSize=64/numRangeKeys=100-24         192µs ± 1%      149µs ± 2%   -22.12%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64/numRangeKeys=0-24          172µs ± 0%      176µs ± 1%    +2.48%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64/numRangeKeys=1-24          266µs ± 2%      271µs ± 2%    +1.88%  (p=0.032 n=5+5)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64/numRangeKeys=100-24        404µs ± 3%      364µs ± 3%    -9.69%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64/numRangeKeys=0-24         579µs ± 1%      578µs ± 1%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64/numRangeKeys=1-24         704µs ± 1%      706µs ± 2%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64/numRangeKeys=100-24       965µs ± 2%      923µs ± 3%    -4.45%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=64/numRangeKeys=0-24          357µs ± 1%      372µs ± 1%    +4.34%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=64/numRangeKeys=1-24          529µs ± 2%      546µs ± 1%    +3.26%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=64/numRangeKeys=100-24        757µs ± 1%      691µs ± 1%    -8.72%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=2/valueSize=64/numRangeKeys=0-24          496µs ± 2%      507µs ± 2%      ~     (p=0.222 n=5+5)
MVCCScan_Pebble/rows=1000/versions=2/valueSize=64/numRangeKeys=1-24          758µs ± 1%      778µs ± 1%    +2.63%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=2/valueSize=64/numRangeKeys=100-24        968µs ± 2%      904µs ± 1%    -6.58%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=10/valueSize=64/numRangeKeys=0-24        1.46ms ± 4%     1.50ms ± 2%      ~     (p=0.421 n=5+5)
MVCCScan_Pebble/rows=1000/versions=10/valueSize=64/numRangeKeys=1-24        2.36ms ± 2%     2.35ms ± 1%      ~     (p=0.841 n=5+5)
MVCCScan_Pebble/rows=1000/versions=10/valueSize=64/numRangeKeys=100-24      2.97ms ± 5%     2.91ms ± 2%      ~     (p=0.151 n=5+5)
MVCCScan_Pebble/rows=1000/versions=100/valueSize=64/numRangeKeys=0-24       5.12ms ± 3%     5.08ms ± 3%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=1000/versions=100/valueSize=64/numRangeKeys=1-24       6.38ms ± 2%     6.34ms ± 2%      ~     (p=0.548 n=5+5)
MVCCScan_Pebble/rows=1000/versions=100/valueSize=64/numRangeKeys=100-24     8.11ms ± 3%     7.97ms ± 5%      ~     (p=0.310 n=5+5)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24        3.56ms ± 1%     3.37ms ± 1%    -5.61%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=1-24        5.32ms ± 1%     5.12ms ± 2%    -3.90%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=100-24      6.35ms ± 1%     6.25ms ± 1%    -1.59%  (p=0.016 n=5+5)
MVCCScan_Pebble/rows=10000/versions=2/valueSize=64/numRangeKeys=0-24        4.91ms ± 2%     4.90ms ± 1%      ~     (p=1.000 n=5+5)
MVCCScan_Pebble/rows=10000/versions=2/valueSize=64/numRangeKeys=1-24        7.41ms ± 1%     7.26ms ± 1%    -2.10%  (p=0.032 n=5+5)
MVCCScan_Pebble/rows=10000/versions=2/valueSize=64/numRangeKeys=100-24      8.48ms ± 1%     8.42ms ± 1%      ~     (p=0.095 n=5+5)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64/numRangeKeys=0-24       14.3ms ± 3%     14.4ms ± 1%      ~     (p=0.310 n=5+5)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64/numRangeKeys=1-24       22.7ms ± 2%     22.6ms ± 2%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64/numRangeKeys=100-24     27.7ms ± 3%     28.0ms ± 3%      ~     (p=0.548 n=5+5)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64/numRangeKeys=0-24      51.8ms ± 1%     50.4ms ± 5%      ~     (p=0.151 n=5+5)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64/numRangeKeys=1-24      64.0ms ± 6%     63.0ms ± 4%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64/numRangeKeys=100-24    83.4ms ± 7%     84.3ms ± 4%      ~     (p=0.841 n=5+5)

name                                                                      old speed      new speed       delta
MVCCGet_Pebble/batch=false/versions=1/valueSize=8/numRangeKeys=0-24       1.27MB/s ± 2%   1.28MB/s ± 2%      ~     (p=0.119 n=5+5)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8/numRangeKeys=1-24        696kB/s ± 1%    774kB/s ± 1%   +11.21%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=1/valueSize=8/numRangeKeys=100-24     70.0kB/s ± 0%  100.0kB/s ± 0%   +42.86%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8/numRangeKeys=0-24       336kB/s ± 2%    330kB/s ± 0%      ~     (p=0.095 n=5+4)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8/numRangeKeys=1-24       250kB/s ± 0%    270kB/s ± 0%    +8.00%  (p=0.016 n=4+5)
MVCCGet_Pebble/batch=false/versions=10/valueSize=8/numRangeKeys=100-24    70.0kB/s ± 0%  114.0kB/s ± 5%   +62.86%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8/numRangeKeys=0-24     80.0kB/s ± 0%   80.0kB/s ± 0%      ~     (all equal)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8/numRangeKeys=1-24     70.0kB/s ± 0%   76.0kB/s ± 8%      ~     (p=0.167 n=5+5)
MVCCGet_Pebble/batch=false/versions=100/valueSize=8/numRangeKeys=100-24   40.0kB/s ± 0%   50.0kB/s ± 0%   +25.00%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8/numRangeKeys=0-24        2.14MB/s ± 1%   2.24MB/s ± 1%    +4.68%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8/numRangeKeys=1-24        1.33MB/s ± 1%   1.62MB/s ± 2%   +21.77%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=1/valueSize=8/numRangeKeys=100-24       120kB/s ± 0%    280kB/s ± 0%  +133.33%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8/numRangeKeys=0-24        390kB/s ± 0%    390kB/s ± 0%      ~     (all equal)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8/numRangeKeys=1-24        310kB/s ± 0%    340kB/s ± 0%    +9.68%  (p=0.016 n=5+4)
MVCCGet_Pebble/batch=true/versions=10/valueSize=8/numRangeKeys=100-24     90.0kB/s ± 0%  160.0kB/s ± 0%   +77.78%  (p=0.008 n=5+5)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8/numRangeKeys=0-24      80.0kB/s ± 0%   80.0kB/s ± 0%      ~     (all equal)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8/numRangeKeys=1-24      80.0kB/s ± 0%   80.0kB/s ± 0%      ~     (all equal)
MVCCGet_Pebble/batch=true/versions=100/valueSize=8/numRangeKeys=100-24    44.0kB/s ±14%   60.0kB/s ± 0%   +36.36%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=0-24          5.90MB/s ± 3%   6.00MB/s ± 1%      ~     (p=0.119 n=5+5)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=1-24          3.57MB/s ± 1%   3.98MB/s ± 2%   +11.53%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=1/valueSize=64/numRangeKeys=100-24         370kB/s ± 0%    678kB/s ± 2%   +83.24%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=2/valueSize=64/numRangeKeys=0-24          4.91MB/s ± 1%   4.90MB/s ± 2%      ~     (p=0.730 n=5+5)
MVCCScan_Pebble/rows=1/versions=2/valueSize=64/numRangeKeys=1-24          3.04MB/s ± 1%   3.36MB/s ± 2%   +10.73%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=2/valueSize=64/numRangeKeys=100-24         404kB/s ± 1%    772kB/s ± 3%   +91.09%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64/numRangeKeys=0-24         3.15MB/s ± 1%   3.19MB/s ± 2%      ~     (p=0.167 n=5+5)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64/numRangeKeys=1-24         2.09MB/s ± 2%   2.32MB/s ± 1%   +10.70%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=10/valueSize=64/numRangeKeys=100-24        400kB/s ± 0%    730kB/s ± 3%   +82.50%  (p=0.016 n=4+5)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64/numRangeKeys=0-24        1.54MB/s ± 1%   1.56MB/s ± 1%    +1.17%  (p=0.048 n=5+5)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64/numRangeKeys=1-24        1.26MB/s ± 1%   1.32MB/s ± 2%    +4.93%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1/versions=100/valueSize=64/numRangeKeys=100-24       460kB/s ± 2%    680kB/s ± 0%   +47.83%  (p=0.016 n=5+4)
MVCCScan_Pebble/rows=10/versions=1/valueSize=64/numRangeKeys=0-24         41.0MB/s ± 2%   39.7MB/s ± 1%    -3.13%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=1/valueSize=64/numRangeKeys=1-24         26.6MB/s ± 1%   27.7MB/s ± 2%    +4.03%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=1/valueSize=64/numRangeKeys=100-24       5.46MB/s ± 1%   8.17MB/s ± 2%   +49.56%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=2/valueSize=64/numRangeKeys=0-24         31.8MB/s ± 1%   31.4MB/s ± 1%    -1.28%  (p=0.024 n=5+5)
MVCCScan_Pebble/rows=10/versions=2/valueSize=64/numRangeKeys=1-24         21.3MB/s ± 1%   22.5MB/s ± 1%    +5.52%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=2/valueSize=64/numRangeKeys=100-24       5.85MB/s ± 2%   9.15MB/s ± 1%   +56.37%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=10/valueSize=64/numRangeKeys=0-24        17.0MB/s ± 2%   16.8MB/s ± 1%      ~     (p=0.056 n=5+5)
MVCCScan_Pebble/rows=10/versions=10/valueSize=64/numRangeKeys=1-24        11.5MB/s ± 2%   11.9MB/s ± 1%    +2.82%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=10/valueSize=64/numRangeKeys=100-24      4.56MB/s ± 2%   6.33MB/s ± 1%   +38.89%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10/versions=100/valueSize=64/numRangeKeys=0-24       6.57MB/s ± 3%   6.66MB/s ± 2%      ~     (p=0.087 n=5+5)
MVCCScan_Pebble/rows=10/versions=100/valueSize=64/numRangeKeys=1-24       5.47MB/s ± 4%   5.58MB/s ± 1%      ~     (p=0.135 n=5+5)
MVCCScan_Pebble/rows=10/versions=100/valueSize=64/numRangeKeys=100-24     2.97MB/s ± 1%   3.65MB/s ± 4%   +22.98%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=0-24         126MB/s ± 1%    120MB/s ± 2%    -4.18%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=1-24        85.0MB/s ± 3%   85.6MB/s ± 2%      ~     (p=0.548 n=5+5)
MVCCScan_Pebble/rows=100/versions=1/valueSize=64/numRangeKeys=100-24      34.7MB/s ± 4%   45.4MB/s ± 1%   +30.87%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=2/valueSize=64/numRangeKeys=0-24        93.4MB/s ± 2%   90.5MB/s ± 1%    -3.17%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=2/valueSize=64/numRangeKeys=1-24        63.8MB/s ± 3%   62.9MB/s ± 1%      ~     (p=0.222 n=5+5)
MVCCScan_Pebble/rows=100/versions=2/valueSize=64/numRangeKeys=100-24      33.4MB/s ± 1%   42.9MB/s ± 2%   +28.42%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64/numRangeKeys=0-24       37.2MB/s ± 0%   36.3MB/s ± 1%    -2.42%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64/numRangeKeys=1-24       24.1MB/s ± 2%   23.6MB/s ± 2%    -1.84%  (p=0.032 n=5+5)
MVCCScan_Pebble/rows=100/versions=10/valueSize=64/numRangeKeys=100-24     15.9MB/s ± 3%   17.6MB/s ± 3%   +10.73%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64/numRangeKeys=0-24      11.1MB/s ± 1%   11.1MB/s ± 1%      ~     (p=0.635 n=5+5)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64/numRangeKeys=1-24      9.09MB/s ± 2%   9.07MB/s ± 2%      ~     (p=0.643 n=5+5)
MVCCScan_Pebble/rows=100/versions=100/valueSize=64/numRangeKeys=100-24    6.63MB/s ± 2%   6.94MB/s ± 3%    +4.68%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=64/numRangeKeys=0-24        179MB/s ± 1%    172MB/s ± 1%    -4.16%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=64/numRangeKeys=1-24        121MB/s ± 2%    117MB/s ± 1%    -3.16%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=1/valueSize=64/numRangeKeys=100-24     84.5MB/s ± 1%   92.6MB/s ± 1%    +9.56%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=2/valueSize=64/numRangeKeys=0-24        129MB/s ± 2%    126MB/s ± 3%      ~     (p=0.222 n=5+5)
MVCCScan_Pebble/rows=1000/versions=2/valueSize=64/numRangeKeys=1-24       84.4MB/s ± 1%   82.3MB/s ± 1%    -2.57%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=2/valueSize=64/numRangeKeys=100-24     66.1MB/s ± 2%   70.8MB/s ± 1%    +7.04%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=1000/versions=10/valueSize=64/numRangeKeys=0-24      43.7MB/s ± 4%   42.7MB/s ± 2%      ~     (p=0.421 n=5+5)
MVCCScan_Pebble/rows=1000/versions=10/valueSize=64/numRangeKeys=1-24      27.2MB/s ± 2%   27.3MB/s ± 1%      ~     (p=0.841 n=5+5)
MVCCScan_Pebble/rows=1000/versions=10/valueSize=64/numRangeKeys=100-24    21.6MB/s ± 5%   22.0MB/s ± 2%      ~     (p=0.135 n=5+5)
MVCCScan_Pebble/rows=1000/versions=100/valueSize=64/numRangeKeys=0-24     12.5MB/s ± 3%   12.6MB/s ± 3%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=1000/versions=100/valueSize=64/numRangeKeys=1-24     10.0MB/s ± 2%   10.1MB/s ± 2%      ~     (p=0.548 n=5+5)
MVCCScan_Pebble/rows=1000/versions=100/valueSize=64/numRangeKeys=100-24   7.89MB/s ± 3%   8.04MB/s ± 5%      ~     (p=0.310 n=5+5)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=0-24       180MB/s ± 1%    190MB/s ± 1%    +5.94%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=1-24       120MB/s ± 1%    125MB/s ± 2%    +4.06%  (p=0.008 n=5+5)
MVCCScan_Pebble/rows=10000/versions=1/valueSize=64/numRangeKeys=100-24     101MB/s ± 1%    102MB/s ± 1%    +1.60%  (p=0.016 n=5+5)
MVCCScan_Pebble/rows=10000/versions=2/valueSize=64/numRangeKeys=0-24       130MB/s ± 2%    131MB/s ± 1%      ~     (p=1.000 n=5+5)
MVCCScan_Pebble/rows=10000/versions=2/valueSize=64/numRangeKeys=1-24      86.3MB/s ± 1%   88.2MB/s ± 1%    +2.14%  (p=0.032 n=5+5)
MVCCScan_Pebble/rows=10000/versions=2/valueSize=64/numRangeKeys=100-24    75.5MB/s ± 1%   76.0MB/s ± 1%      ~     (p=0.095 n=5+5)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64/numRangeKeys=0-24     44.7MB/s ± 3%   44.4MB/s ± 1%      ~     (p=0.310 n=5+5)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64/numRangeKeys=1-24     28.2MB/s ± 2%   28.3MB/s ± 2%      ~     (p=0.690 n=5+5)
MVCCScan_Pebble/rows=10000/versions=10/valueSize=64/numRangeKeys=100-24   23.2MB/s ± 3%   22.9MB/s ± 3%      ~     (p=0.548 n=5+5)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64/numRangeKeys=0-24    12.4MB/s ± 1%   12.7MB/s ± 5%      ~     (p=0.151 n=5+5)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64/numRangeKeys=1-24    10.0MB/s ± 5%   10.2MB/s ± 5%      ~     (p=0.643 n=5+5)
MVCCScan_Pebble/rows=10000/versions=100/valueSize=64/numRangeKeys=100-24  7.69MB/s ± 7%   7.60MB/s ± 4%      ~     (p=0.841 n=5+5)
```

Close cockroachdb#1829.
Informs cockroachdb/cockroach#83049.
  • Loading branch information
jbowens committed Dec 15, 2022
1 parent 4e4c71d commit 76029d5
Show file tree
Hide file tree
Showing 19 changed files with 451 additions and 172 deletions.
58 changes: 33 additions & 25 deletions internal/keyspan/defragment.go
Original file line number Diff line number Diff line change
Expand Up @@ -119,10 +119,10 @@ const (
//
// Seeking (SeekGE, SeekLT) poses an obstacle to defragmentation. A seek may
// land on a physical fragment in the middle of several fragments that must be
// defragmented. A seek first degfragments in the opposite direction of
// iteration to find the beginning of the defragmented span, and then
// defragments in the iteration direction, ensuring it's found a whole
// defragmented span.
// defragmented. A seek that lands in a fragment straddling the seek key must
// first degfragment in the opposite direction of iteration to find the
// beginning of the defragmented span, and then defragments in the iteration
// direction, ensuring it's found a whole defragmented span.
type DefragmentingIter struct {
// DefragmentingBuffers holds buffers used for copying iterator state.
*DefragmentingBuffers
Expand Down Expand Up @@ -205,8 +205,9 @@ func (i *DefragmentingIter) Close() error {
return i.iter.Close()
}

// SeekGE seeks the iterator to the first span with a start key greater than or
// equal to key and returns it.
// SeekGE moves the iterator to the first span covering a key greater than or
// equal to the given key. This is equivalent to seeking to the first span with
// an end key greater than the given key.
func (i *DefragmentingIter) SeekGE(key []byte) *Span {
i.iterSpan = i.iter.SeekGE(key)
if i.iterSpan == nil {
Expand All @@ -216,30 +217,28 @@ func (i *DefragmentingIter) SeekGE(key []byte) *Span {
i.iterPos = iterPosCurr
return i.iterSpan
}
// Save the current span and peek backwards.
i.saveCurrent()
i.iterSpan = i.iter.Prev()
if i.iterSpan != nil && i.equal(i.curr.Start, i.iterSpan.End) && i.checkEqual(i.iterSpan, &i.curr) {
// A continuation. The span we originally landed on and defragmented
// backwards has a true Start key < key. To obey the FragmentIterator
// contract, we must not return this defragmented span. Defragment
// forward to finish defragmenting the span in the forward direction.
i.defragmentForward()

// Now we must be on a span that truly has a defragmented Start key >
// key.
// If the span starts strictly after key, we know there mustn't be an
// earlier span that ends at i.iterSpan.Start, otherwise i.iter would've
// returned that span instead.
if i.comparer.Compare(i.iterSpan.Start, key) > 0 {
return i.defragmentForward()
}

// The span previous to i.curr does not defragment, so we should return it.
// Next the underlying iterator back onto the span we previously saved to
// i.curr and then defragment forward.
i.iterSpan = i.iter.Next()
// The span we landed on has a Start bound ≤ key. There may be additional
// fragments before this span. Defragment backward to find the start of the
// defragmented span.
i.defragmentBackward()
if i.iterPos == iterPosPrev {
// Next once back onto the span.
i.iterSpan = i.iter.Next()
}
// Defragment the full span from its start.
return i.defragmentForward()
}

// SeekLT seeks the iterator to the last span with a start key less than
// key and returns it.
// SeekLT moves the iterator to the last span covering a key less than the
// given key. This is equivalent to seeking to the last span with a start
// key less than the given key.
func (i *DefragmentingIter) SeekLT(key []byte) *Span {
i.iterSpan = i.iter.SeekLT(key)
if i.iterSpan == nil {
Expand All @@ -249,7 +248,16 @@ func (i *DefragmentingIter) SeekLT(key []byte) *Span {
i.iterPos = iterPosCurr
return i.iterSpan
}
// Defragment forward to find the end of the defragmented span.
// If the span ends strictly before key, we know there mustn't be a later
// span that starts at i.iterSpan.End, otherwise i.iter would've returned
// that span instead.
if i.comparer.Compare(i.iterSpan.End, key) < 0 {
return i.defragmentBackward()
}

// The span we landed on has a End bound ≥ key. There may be additional
// fragments after this span. Defragment forward to find the end of the
// defragmented span.
i.defragmentForward()
if i.iterPos == iterPosNext {
// Prev once back onto the span.
Expand Down
25 changes: 7 additions & 18 deletions internal/keyspan/interleaving_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -824,27 +824,16 @@ func (i *InterleavingIter) interleaveBackward() (*base.InternalKey, base.LazyVal
}
}

// keyspanSeekGE seeks the keyspan iterator to the first span covering k ≥ key.
// Note that this differs from the FragmentIterator.SeekGE semantics, which
// seek to the first span with a start key ≥ key.
func (i *InterleavingIter) keyspanSeekGE(key []byte, prefix []byte) {
// Seek using SeekLT to look for a span that starts before key, with an end
// boundary extending beyond key.
i.span = i.keyspanIter.SeekLT(key)
if i.span == nil || i.cmp(i.span.End, key) <= 0 {
// The iterator is exhausted in the reverse direction, or the span we
// found ends before key. Next to the first key with a start ≥ key.
i.span = i.keyspanIter.Next()
}
// keyspanSeekGE seeks the keyspan iterator to the first span covering a key ≥ k.
func (i *InterleavingIter) keyspanSeekGE(k []byte, prefix []byte) {
i.span = i.keyspanIter.SeekGE(k)
i.checkForwardBound(prefix)
i.savedKeyspan()
}

// keyspanSeekLT seeks the keyspan iterator to the last span covering k < key.
// Note that this differs from the FragmentIterator.SeekLT semantics, which
// seek to the last span with a start key < key.
func (i *InterleavingIter) keyspanSeekLT(key []byte) {
i.span = i.keyspanIter.SeekLT(key)
// keyspanSeekLT seeks the keyspan iterator to the last span covering a key < k.
func (i *InterleavingIter) keyspanSeekLT(k []byte) {
i.span = i.keyspanIter.SeekLT(k)
i.checkBackwardBound()
// The current span's start key is not guaranteed to be less than key,
// because of the bounds enforcement. Consider the following example:
Expand All @@ -857,7 +846,7 @@ func (i *InterleavingIter) keyspanSeekLT(key []byte) {
//
// This problem is a consequence of the SeekLT's exclusive search key and
// the fact that we don't perform bounds truncation at every leaf iterator.
if i.span != nil && i.truncated && i.cmp(i.truncatedSpan.Start, key) >= 0 {
if i.span != nil && i.truncated && i.cmp(i.truncatedSpan.Start, k) >= 0 {
i.span = nil
}
i.savedKeyspan()
Expand Down
16 changes: 11 additions & 5 deletions internal/keyspan/iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,14 @@ import (
// longer lifetimes but implementations need only guarantee stability until the
// next positioning method.
type FragmentIterator interface {
// SeekGE moves the iterator to the first span whose start key is greater
// than or equal to the given key.
// SeekGE moves the iterator to the first span covering a key greater than
// or equal to the given key. This is equivalent to seeking to the first
// span with an end key greater than the given key.
SeekGE(key []byte) *Span

// SeekLT moves the iterator to the last span whose start key is less than
// the given key.
// SeekLT moves the iterator to the last span covering a key less than the
// given key. This is equivalent to seeking to the last span with a start
// key less than the given key.
SeekLT(key []byte) *Span

// First moves the iterator to the first span.
Expand Down Expand Up @@ -104,19 +106,23 @@ func (i *Iter) Init(cmp base.Compare, spans []Span) {
func (i *Iter) SeekGE(key []byte) *Span {
// NB: manually inlined sort.Search is ~5% faster.
//
// Define f(j) = true iff the span i.spans[j] is strictly before `key`
// (equivalently, i.spans[j].End ≤ key.)
//
// Define f(-1) == false and f(n) == true.
// Invariant: f(index-1) == false, f(upper) == true.
i.index = 0
upper := len(i.spans)
for i.index < upper {
h := int(uint(i.index+upper) >> 1) // avoid overflow when computing h
// i.index ≤ h < upper
if i.cmp(key, i.spans[h].Start) > 0 {
if i.cmp(key, i.spans[h].End) >= 0 {
i.index = h + 1 // preserves f(i-1) == false
} else {
upper = h // preserves f(j) == true
}
}

// i.index == upper, f(i.index-1) == false, and f(upper) (= f(i.index)) ==
// true => answer is i.index.
if i.index >= len(i.spans) {
Expand Down
4 changes: 2 additions & 2 deletions internal/keyspan/level_iter.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ func (l *LevelIter) SeekGE(key []byte) *Span {
f := l.findFileGE(key)
if f != nil && l.keyType == manifest.KeyTypeRange && l.cmp(key, f.SmallestRangeKey.UserKey) < 0 {
prevFile := l.files.Prev()
l.files.Next()
if prevFile != nil {
// We could unconditionally return an empty span between the seek key and
// f.SmallestRangeKey, however if this span is to the left of all range
Expand All @@ -202,7 +203,6 @@ func (l *LevelIter) SeekGE(key []byte) *Span {
//
// TODO(bilal): Investigate ways to be able to return straddle spans in
// cases similar to the above, while still retaining correctness.
l.files.Next()
// Return a straddling key instead of loading the file.
l.iterFile = f
if err := l.Close(); err != nil {
Expand Down Expand Up @@ -237,6 +237,7 @@ func (l *LevelIter) SeekLT(key []byte) *Span {
f := l.findFileLT(key)
if f != nil && l.keyType == manifest.KeyTypeRange && l.cmp(f.LargestRangeKey.UserKey, key) < 0 {
nextFile := l.files.Next()
l.files.Prev()
if nextFile != nil {
// We could unconditionally return an empty span between f.LargestRangeKey
// and the seek key, however if this span is to the right of all range keys
Expand All @@ -252,7 +253,6 @@ func (l *LevelIter) SeekLT(key []byte) *Span {
//
// TODO(bilal): Investigate ways to be able to return straddle spans in
// cases similar to the above, while still retaining correctness.
l.files.Prev()
// Return a straddling key instead of loading the file.
l.iterFile = f
if err := l.Close(); err != nil {
Expand Down
Loading

0 comments on commit 76029d5

Please sign in to comment.