Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Flatbush RTree for Spatial Joins #13079

Merged
merged 5 commits into from
Aug 22, 2019
Merged

Conversation

jagill
Copy link
Contributor

@jagill jagill commented Jul 10, 2019

There are memory problems with spatial joins, potentially due to the
large number of objects created by the JTS RTree. A Flatbush is an
RTree with a low memory footprint, minimal heap allocation, and fast
lookup. As the name would suggest, it is very flat -- the tree itself
is just a DoubleBigArray, with a couple other small fields. Lookup is
also pretty fast: the main difficulty is some index ninjitsu to handle
parent/child relationships.

This PR is branched off of #13235, which contains formatting and
other changes that are good independent of this one.

== RELEASE NOTES ==
General
--------
* Use Hilbert Packed RTree (Flatbush) instead of the JTS STR RTree for spatial joins.  The Flatbush is a flattened tree that uses vastly fewer heap allocations, and has faster build and query times.

@jagill
Copy link
Contributor Author

jagill commented Jul 10, 2019

cc @viczhang861 @mbasmanova

@jagill
Copy link
Contributor Author

jagill commented Jul 11, 2019

cc @jwass

@jagill jagill force-pushed the flatbush branch 4 times, most recently from 5475e37 to 57bcd97 Compare July 16, 2019 21:49
@jagill jagill changed the title WIP Initial implementation of Flatbush RTree Use Flatbush RTree instead of JTS RTree for Spatial Joins Jul 25, 2019
@jagill jagill force-pushed the flatbush branch 2 times, most recently from fd4d8db to 823a83b Compare July 31, 2019 22:11
@jagill jagill force-pushed the flatbush branch 2 times, most recently from 87e0eac to 9123371 Compare August 13, 2019 02:33
@mbasmanova
Copy link
Contributor

@jagill James, thanks for replacing JTS R-Tree with the one that doesn't create so many objects. Where can I read about the Flatbush you are using? Is https://github.com/mourner/flatbush/ the right place?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore order of rows in spatial join tests

Looks good, but I would split it into two commits: (1) reformat the code; (2) switch to assertOperatorEqualsIgnoreOrder

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add HasExtent interface

Looks good.

return this;
}

public long getSizeInBytes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public interface HasExtent
{
Rectangle getExtent();
long getSizeInBytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer getEstimatedSize or getEstimatedSizeInBytes because often we can't know exactly how much memory an object is using and return an estimate or approximation.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments as I ran out of time.

Implement Flatbush Hilbert RTree

Commit message is indented (most likely unintentionally) - would you fix that?

*/
public class Flatbush<T extends HasExtent>
{
private static final int FLATBUSH_INSTANCE_SIZE = ClassLayout.parseClass(Flatbush.class).instanceSize();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FLATBUSH_INSTANCE_SIZE -> INSTANCE_SIZE by convention

checkArgument(degree > 0, "degree must be positive");
this.degree = degree;
this.items = requireNonNull(items, "items is null");
this.levelOffsets = Flatbush.calculateLevelIndices(items.length, degree);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Flatbush.calculateLevelIndices -> calculateLevelIndices
  • calculateLevelIndices -> calculateLevelOffsets to avoid using multiple terms for the same thing

this.degree = degree;
this.items = requireNonNull(items, "items is null");
this.levelOffsets = Flatbush.calculateLevelIndices(items.length, degree);
this.tree = this.buildTree();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this.buildTree() -> buildTree()

*/
private static int[] calculateLevelIndices(int numItems, int degree)
{
List<Integer> levelIndices = new ArrayList<Integer>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • levelIndices -> offsets
  • new ArrayList() -> new ArrayList<>()
        List<Integer> offsets = new ArrayList<>();

{
List<Integer> levelIndices = new ArrayList<Integer>();
// Leaf nodes start at 0, root is the last element.
levelIndices.add(Integer.valueOf(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer.valueOf(0) -> 0

offsets.add(0);

return items;
}

public int getHeight()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is not used

DoubleBigArray tree = new DoubleBigArray(Double.NaN);
tree.ensureCapacity(levelOffsets[levelOffsets.length - 1] + ENVELOPE_SIZE);

sortByHilberIndex(this.items);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this.items -> items

sortByHilberIndex(this.items);

int writeOffset = 0;
int readOffset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable is not used until next loop; move it there

}

int numChildren = items.length;
for (int childLevel = 0; childLevel < levelOffsets.length - 1; childLevel++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

childLevel -> level

@jagill
Copy link
Contributor Author

jagill commented Aug 15, 2019

@mbasmanova For further reading on the flatbush, the mourner/flatbush repo is a good start; it also links to the wikipedia page which describes the general class/algorith: https://en.wikipedia.org/wiki/Hilbert_R-tree#Packed_Hilbert_R-trees

Note in this case we only sort the leaf notes, and rely on the hierarchy of the Hilbert curve to keep all parent nodes roughly in the right place.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jagill Finished reading Implement Flatbush Hilbert RTree commit. Overall looks good. Here are some comments.

yMax = max(yMax, tree.get(readOffset++));

// Every `degree` children, make a parent
if (readOffset % (degree * ENVELOPE_SIZE) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (child % degree == 0) { might be clearer and make the comment unnecessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts re:

if (child % degree == 0) { might be clearer and make the comment unnecessary

?

}
}
// Add any "dangling" parent not added above
if (readOffset % (degree * ENVELOPE_SIZE) != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (numChildren % degree > 0) { might be clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts re:

if (readOffset % degree > 0) { might be clearer and make the comment unnecessary

?

int level = 0;
while (numItems > 1) {
// The level capacity will be the smallest multiple of degree gte numItems
int capacity = ((numItems + degree - 1) / degree) * degree;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capacity -> numChildren to match buildTree?

import com.facebook.presto.array.DoubleBigArray;
import com.facebook.presto.geospatial.Rectangle;
import it.unimi.dsi.fastutil.ints.IntArrayList;
import org.locationtech.jts.index.ItemVisitor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add this dependency. This interface is trivial. Let's just make our own.

    public interface ItemVisitor<T>
    {
        void visitItem(T item);
    }

}
}

public List<T> findIntersections(Rectangle query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used only in TestFlatbush. I'd move it there and use results::add:

    private static List<Rectangle> findIntersections(Flatbush<Rectangle> flatbush, Rectangle rectangle)
    {
        List<Rectangle> results = new ArrayList<>();
        flatbush.findIntersections(rectangle, results::add);
        return results;
    }

}

@Test
public void testRandomRects100Probe1000Build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Don't abbreviate.
  • I think a better structure would to be have helper method take two lists of rectangles and have the tests compose these lists as they see fit.
    @Test
    public void testRandomRectanglesSmall()
    {
        Random random = new Random(42);
        assertRectangles(makeRectangles(random, 1_000), makeRectangles(random, 100));
    }

    @Test
    public void testRandomRectanglesMedium()
    {
        Random random = new Random(123);
        assertRectangles(makeRectangles(random, 10_000), makeRectangles(random, 1_000));
    }

    @Test
    public void testRandomRectanglesLarge()
    {
        Random random = new Random(321);
        assertRectangles(makeRectangles(random, 50_000), makeRectangles(random, 5_000));
    }

    private static void assertRectangles(List<Rectangle> buildRectanges, List<Rectangle> probeRectanges)
    {
        Flatbush<Rectangle> rtree = new Flatbush<>(buildRectanges.toArray(new Rectangle[] {}));
        for (Rectangle probe : probeRectanges) {
            List<Rectangle> expected = buildRectanges.stream()
                    .filter(rectangle -> rectangle.intersects(probe))
                    .collect(toImmutableList());
            assertEqualsNoOrder(findIntersections(rtree, probe), expected);
        }
    }

    private static void assertEqualsNoOrder(List<Rectangle> actual, List<Rectangle> expected)
    {
        List<Rectangle> actualSorted = actual.stream().sorted(RECTANGLE_COMPARATOR).collect(toImmutableList());
        List<Rectangle> expectedSorted = expected.stream().sorted(RECTANGLE_COMPARATOR).collect(toImmutableList());

        assertEquals(actualSorted, expectedSorted);
    }

OGCGeometryWrapper pointW = new OGCGeometryWrapper(POINT_W);

Flatbush rtree = new Flatbush<OGCGeometryWrapper>(new OGCGeometryWrapper[] {
pointX, pointY, pointZ, pointW
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the previous line:

Flatbush<OGCGeometryWrapper> rtree = new Flatbush<>(new OGCGeometryWrapper[] {pointX, pointY, pointZ, pointW});


List<OGCGeometryWrapper> resultsA = rtree.findIntersections(octagonA.getExtent());
resultsA.sort(Comparator.naturalOrder());
assertEquals(resultsA, Arrays.asList(pointX, pointY));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use assertEqualsNoOrder helper function

        assertEqualsNoOrder(findIntersections(rtree, octagonA.getExtent()), ImmutableList.of(pointX, pointY), Comparator.naturalOrder());
        assertEqualsNoOrder(findIntersections(rtree, octagonB.getExtent()), ImmutableList.of(pointY, pointZ), Comparator.naturalOrder());

    private static <T> void assertEqualsNoOrder(List<T> actual, List<T> expected, Comparator<T> comparator)
    {
        List<T> actualSorted = actual.stream().sorted(comparator).collect(toImmutableList());
        List<T> expectedSorted = expected.stream().sorted(comparator).collect(toImmutableList());

        assertEquals(actualSorted, expectedSorted);
    }

return createFromEsriGeometry(new Point(x, y), null);
}

public static double randomDouble(Random rng, double minVal, double maxVal)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Don't abbreviate
  • rng -> random, minVal -> min, maxVal -> max
  • makeRandomRectangle -> makeRectangle; make it private
  • makeRandomRectangles -> makeRectanges
public static List<Rectangle> makeRectangles(Random random, int count)
    {
        return IntStream.range(0, count)
                .mapToObj(i -> makeRectangle(random))
                .collect(toImmutableList());
    }

    private static double randomDouble(Random random, double min, double max)
    {
        return min + random.nextDouble() * (max - min);
    }

    /*
     * Make a random rectangle "near" origin of size "about" size.
     */
    private static Rectangle makeRectangle(Random random)
    {
        double minX = randomDouble(random, -100, 100);
        double minY = randomDouble(random, -100, 100);
        double sizeX = randomDouble(random, 0.0, 10);
        double sizeY = randomDouble(random, 0.0, 10);
        return new Rectangle(minX, minY, minX + sizeX, minY + sizeY);
    }


import static com.esri.core.geometry.ogc.OGCGeometry.createFromEsriGeometry;

public final class TestUtils
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very generic name. Perhaps, RectangleTestUtils?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Flatbush instead of JTS STR RTree for spatial joins

This is awesome!!!

I'd change commit title to Use Flatbush R-Tree in spatial join

@@ -155,11 +112,11 @@ private static Envelope getEnvelope(OGCGeometry ogcGeometry)

IntArrayList matchingPositions = new IntArrayList();

Envelope envelope = getEnvelope(probeGeometry);
rtree.query(envelope, item -> {
Rectangle queryRect = GeometryUtils.getExtent(probeGeometry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import + inline: rtree.findIntersections(getExtent(probeGeometry), item -> {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't inline, because it's used in the closure (for testReferencePoint).

@@ -221,4 +178,60 @@ public void appendTo(int joinPosition, PageBuilder pageBuilder, int outputChanne
outputChannelOffset++;
}
}

public static final class GeometryWithPosition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move this class in a separate commit?

@mbasmanova mbasmanova self-assigned this Aug 15, 2019
@jagill jagill changed the title Use Flatbush RTree instead of JTS RTree for Spatial Joins Use Flatbush RTree for Spatial Joins Aug 15, 2019
@jagill jagill force-pushed the flatbush branch 3 times, most recently from 3bff965 to 05cff9d Compare August 16, 2019 14:53
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Flatbush instead of JTS STR RTree for spatial joins

Very nice. Thank you!

Envelope envelope = getEnvelope(probeGeometry);
rtree.query(envelope, item -> {
GeometryWithPosition geometryWithPosition = (GeometryWithPosition) item;
Rectangle queryRect = getExtent(probeGeometry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not abbreviate: queryRect -> query or rectangle

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding benchmarks. Some of them show very high error bounds. Would you adjust the settings on the benchmarks to bring errors down?

293.261 ± 41.055
444.502 ± 86.900

{
private static final int SEED = 613;

public static void main(String[] args)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move main method to the end of the class; ditto other benchmarks

@Param({"8", "16", "32"})
private int rtreeDegree;
@Param({"1000", "3000", "10000", "30000", "100000", "300000", "1000000"})
private int numBuildRects;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't abbreviate: numBuildRects -> numRectangles and buildRects -> rectangles

public static class BenchmarkData
{
@Param({"1000", "3000", "10000", "30000", "100000", "300000", "1000000"})
private int numBuildEnvs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't abbreviate

@Param({"1000", "3000", "10000", "30000", "100000", "300000", "1000000"})
private int numBuildEnvs;

private List<Envelope> buildEnvs;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't abbreviate throughout this commit

@mbasmanova
Copy link
Contributor

CC: @pettyjamesm

@jagill jagill force-pushed the flatbush branch 2 times, most recently from 803aa45 to c79c59a Compare August 17, 2019 03:55
@jagill
Copy link
Contributor Author

jagill commented Aug 20, 2019

cc @mourner: This PR ports your Flatbush to Java for use with Presto. Just thought you should know, and happy to hear any feedback you have!

@jagill
Copy link
Contributor Author

jagill commented Aug 20, 2019

cc @rawrunprotected We used your log(n) 2D hilbert curve algorithm here. Just wanted to let you know, and to see if you had any feedback! We had to port to Java signed ints (wrapped in a long), because Java doesn't have unsigned ints :(.

@rawrunprotected
Copy link

cc @rawrunprotected We used your log(n) 2D hilbert curve algorithm here. Just wanted to let you know, and to see if you had any feedback! We had to port to Java signed ints (wrapped in a long), because Java doesn't have unsigned ints :(.

Thanks for letting me know, awesome to see my stuff being used :)

I haven't touched Java in a decade, but I think the code should work fine with just plain ints and without explicit masking to 16bit intermediates. Bitwise operations don't care about signedness, and the code already uses logical right shifts (>>>), so unsigned vs signed doesn't really matter at all. The only caveat is that you'll want to zero-extend the result at the very end to long in order to allow using it as a non-negative offset. That can be done with a quick pair of shifts. See attached, but I only did some very superficial testing ;)

hilbert.txt

@mourner
Copy link

mourner commented Aug 20, 2019

@jagill this is awesome, thanks a lot for letting me know! Can you share the benchmark results here to get an idea for the improvement?

@rawrunprotected thanks so much for chiming in here and for the amazing original implementation! In the JS version (which this PR ports), I use it like this, although it could probably be simplified further because I limit the input to 16-bit uints (plenty for spatial indexing needs) so that the resulting hilbert values can be stored in a uint32 array.

jagill added 5 commits August 20, 2019 17:45
Spatial joins don't have a defined output order, so don't test for that.
RTrees can take anything that exposes a bounding-box extent.  Create an
interface for this (that the RTree will use later), and implement it for
Rectangle.  Also, add a containment function in Rectangle, and implement
helper functions for HasExtent in GeometryUtils.
There are memory problems with spatial joins, potentially due to the
large number of objects created by the JTS RTree.  The JTS tree used for
the build side of spatial joins creates many objects and may not report
its memory precisely.  A Flatbush is a flattened, bulk-loaded immutable
RTree that stores the tree as a single array of doubles.  Its internal
fields are just a few primitive fields, a shared pointer to the input
geometries, a small (~5-6 size) array of ints, and a DoubleBigArray of a
size approximately the same as the input geometries.

Construction involves sorting the geometries by their Hilbert curve
index, which has good hierarchical properties and makes construction of
the rest of the tree straightforward.
The flatbush provides performance benefits over the JTS tree, with over
4x query speed at 1M build geometries, and 2.5x faster builds.

Benchmarks (included in change)
Query Benchmark                     (numBuildRects)  Mode  Cnt    Score    Error  Units
JTS STR RTree
BenchmarkJtsStrTreeQuery.rtreeQuery            1000  avgt   20    0.439 ±  0.006  us/op
BenchmarkJtsStrTreeQuery.rtreeQuery            3000  avgt   20    0.778 ±  0.010  us/op
BenchmarkJtsStrTreeQuery.rtreeQuery           10000  avgt   20    1.675 ±  0.017  us/op
BenchmarkJtsStrTreeQuery.rtreeQuery           30000  avgt   20    4.208 ±  0.565  us/op
BenchmarkJtsStrTreeQuery.rtreeQuery          100000  avgt   20   18.533 ±  0.786  us/op
BenchmarkJtsStrTreeQuery.rtreeQuery          300000  avgt   20   59.184 ±  1.175  us/op
BenchmarkJtsStrTreeQuery.rtreeQuery         1000000  avgt   20  250.904 ± 10.420  us/op

Flatbush (degree 16)
BenchmarkFlatbushQuery.rtreeQuery              1000  avgt   20    0.578 ±  0.019  us/op
BenchmarkFlatbushQuery.rtreeQuery              3000  avgt   20    0.902 ±  0.013  us/op
BenchmarkFlatbushQuery.rtreeQuery             10000  avgt   20    1.667 ±  0.018  us/op
BenchmarkFlatbushQuery.rtreeQuery             30000  avgt   20    3.329 ±  0.033  us/op
BenchmarkFlatbushQuery.rtreeQuery            100000  avgt   20    9.435 ±  0.599  us/op
BenchmarkFlatbushQuery.rtreeQuery            300000  avgt   20   27.564 ±  0.851  us/op
BenchmarkFlatbushQuery.rtreeQuery           1000000  avgt   20   72.725 ±  0.582  us/op

Build Benchmark                     (numBuildRects)  Mode  Cnt     Score    Error  Units
JTS STR RTree
BenchmarkJtsStrTreeBuild.buildRtree            1000  avgt   20     0.306 ±  0.020  ms/op
BenchmarkJtsStrTreeBuild.buildRtree            3000  avgt   20     1.077 ±  0.023  ms/op
BenchmarkJtsStrTreeBuild.buildRtree           10000  avgt   20     4.542 ±  0.055  ms/op
BenchmarkJtsStrTreeBuild.buildRtree           30000  avgt   20    15.837 ±  0.414  ms/op
BenchmarkJtsStrTreeBuild.buildRtree          100000  avgt   20    77.911 ±  2.883  ms/op
BenchmarkJtsStrTreeBuild.buildRtree          300000  avgt   20   309.898 ±  7.493  ms/op
BenchmarkJtsStrTreeBuild.buildRtree         1000000  avgt   20  1325.381 ± 67.566  ms/op

Flatbush (degree 16)
BenchmarkFlatbushBuild.buildRtree              1000  avgt   20     0.503 ±  0.007  ms/op
BenchmarkFlatbushBuild.buildRtree              3000  avgt   20     1.786 ±  0.017  ms/op
BenchmarkFlatbushBuild.buildRtree             10000  avgt   20     3.213 ±  0.154  ms/op
BenchmarkFlatbushBuild.buildRtree             30000  avgt   20     9.667 ±  0.389  ms/op
BenchmarkFlatbushBuild.buildRtree            100000  avgt   20    30.333 ±  0.559  ms/op
BenchmarkFlatbushBuild.buildRtree            300000  avgt   20    98.058 ±  2.014  ms/op
BenchmarkFlatbushBuild.buildRtree           1000000  avgt   20   367.947 ±  5.481  ms/op
There are memory problems with spatial joins, potentially due to the
large number of objects created by the JTS RTree.  The JTS tree used for
the build side of spatial joins creates many objects and may not report
its memory precisely.  A Flatbush creates very few objects, with a low
memory footprint, as well as faster build/query times.
@jagill
Copy link
Contributor Author

jagill commented Aug 20, 2019

@rawrunprotected Thanks! That little change sped up build times about 10%. Prior to using (the first version of) your code, the building bottleneck was the Hilbert index calculation, using the bit twiddling massively sped up the sorting.

@mourner I've included some benchmarks in the description of the commit that adds them: c1d6b4d . In general, both querying and building are about 4x faster when we get to 1M geometries. Critically, it has a better scaling factor than the JTS STR RTree.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add HasExtent interface

Looks good.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement Flatbush Hilbert RTree

Looks good.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Flatbush instead of JTS STR RTree for spatial joins

Looks good.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignore order of rows in spatial join tests

Thanks.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Flatbush RTree for Spatial Joins

Looks good.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jagill Looks good to me. This is a very nice and desirable improvement for the spatial joins. Thanks for researching and implementing such an elegant fix.

}

HilbertIndex hilbert = new HilbertIndex(totalExtent);
Arrays.parallelSort(items, Comparator.comparing(item ->
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jagill is it inefficient to compute the hilbert index dynamically like this? Won't there be many computations of the same index?

Copy link
Contributor Author

@jagill jagill Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index is calculated on average O(ln N) times per entry, as you said. I originally used davidmoten/hilbert-curve, and (via profiling) index calculation did indeed dominate construction CPU. However, this new calculation is very efficient (all bit twiddling), and so now it's only a few % of construction CPU.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense, and certainly simpler than maintaining a pre-computed value.


private boolean doesIntersect(Rectangle query, int nodeIndex)
{
return query.getXMax() >= tree.get(nodeIndex) // xMin
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this logic compute "query contains node", or "query intersects node"?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants