[BUG] `GeoColumnAccessors` should respect slicing #771

thomcom · 2022-11-02T19:53:39Z

I think I solved this, PR coming.

Describe the bug
The most visible interface for a GeoSeries respects slicing:

geoseries = cuspatial.from_geopandas(geopandas.read_file(geopandas.get_path("naturalearth_lowres")))
shorter = geoseries[0:3]
print(shorter)
0    MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1    POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2    POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
Name: geometry, dtype: geometry

However, the underlying GeoColumnAccessor that allows us to access the GeoArrow buffers, does not:

shorter.polygons.geometry_offset
0        0
1        3
2        4
3        5
4       35
      ... 
173    284
174    285
175    286
176    287
177    288
Length: 178, dtype: int32

Expected behavior
Users should not have to slice their GeoSeries and also slice the underlying buffers. When a sliced GeoSeries has GeoColumnAccessors called, they should return only the coordinates that are part of the GeoSeries.

The text was updated successfully, but these errors were encountered:

thomcom · 2022-11-03T14:26:54Z

I'm puzzling over a substantial problem and am going to start the conversation about it. Most new algorithms depend on the GeoArrow format, that is that the offsets buffer for a range of $n$ features is always $n+1$ offsets. The offsets include the first and "last" position of every feature, including the final one.

The arrow DenseUnion doesn't seem to respect this - that is, the offsets buffer of the DenseUnion is not length $n+1$.

This is important because of slicing. So far we've always tested our algorithms using the "complete" data source, so that all of the points returned by the various buffer calls always respect all the points in the original buffer.

point_in_polygon revealed a flaw in this approach because of the 31 polygon limit - given an original data source like the naturalearth_lowres database, one needs to slice only 31 polygons from this source. Therefore a sliced dataframe should only contain the points that were sliced, so that subsequent accesses to its buffers remain accurate.

I tried to solve this issue by modifying the offset buffer accessors to use the offsets buffer of the union before they return their results. This works well, with one major problem - if I slice two elements out of an offsets buffer, I only get two elements in the offsets buffer, not the $n+1$ tail.

I'd like to keep our design where the underlying points array is not modified, both for memory usage reasons and also to avoid having to sub-slice all the points in order to make a copy. How can we change an offsets buffer slice operation to return element $n+1$ at the tail?

thomcom · 2022-11-03T14:27:46Z

It seems like I might have to slice the geometry data, and not just the union_offsets and input_types, so that each GeoSeries fully represents itself.

I was able to implement the above discussion easily, where a sliced dataframe only returns the offsets that were sliced, also, but this immediately broke all of the point_linestring_distance tests since they expect 1 extra value in all of the offsets lists.

If I could just do a slice + 1, that'd solve the issue, but I'm not sure how. Slicing can take many forms, particularly list or slice, that are not trivial (or possible?) to identify what $n+1$ is. Maybe I'm just overlooking something simple.

This is a simple example of a failing test that demonstrates what I'm talking about

def test_point_geometry_offset():
    gs = gpd.GeoSeries(
        [
            MultiPoint([Point(0, 1), Point(1, 0)]),
            MultiPoint([Point(0, 2), Point(2, 0)]),
            MultiPoint([Point(0, 3), Point(3, 0)]),
            MultiPoint([Point(0, 4), Point(4, 0)]),
        ]
    )
    cugs = cuspatial.from_geopandas(gs)
    sliced = cugs[0:2]
    assert len(sliced.multipoints.geometry_offset) == 3

The dataframe has been sliced from four multipoints down to two, but sliced.multipoints.geometry_offset is not length 3, it still has all of the offsets in it. The solution is to slice the geometry offsets, but I don't have a trivial way of slicing to n+1.
If I slice the geometry offsets, all the algorithms that expect n+1 break.

thomcom · 2022-11-03T15:45:18Z

I think I figured it out.

thomcom · 2022-11-03T17:44:36Z

There are potentially a lot of tests for this, I'm working on that now.

Closes #771 This PR modifies the production of `geometry_offset`, `part_offset`, and `ring_offset` by sampling the existing values in a `GeoSeries` before returning their various offsets. This has the effect of using `cudf.ListSeries` to re-pack any features into a new, dense `GeoColumn`, then returning the offsets based on it. Previously, `GeoSeries` that had been modified by slicing would have the appearance of the sliced elements, but when `_offset` buffers were used they would return the full original offset buffer that the sliced `GeoSeries` had originated from. This was a problem because it made slicing useless for our algorithms. I also modify the `core/spatial/distance.py` and `core/spatial/nearest_points.py` files to use `as_column(linestrings.lines.geometry_offset)` instead of `linestrings.lines.geometry_offset._column` because there doesn't appear to be a reason to use a `cudf.Series` to wrap the offset buffers. They are private methods essentially, don't need indexes, and will eventually be factored out so that they're hidden from the user. I wrote a new test file `test_geocolumn_accessor.py` to exercise the new {`geometry_buffer`...} accessors for all geometry types. Finally I added tests for a base case, a more complicated case, and a case with noncontiugous slices to the inputs of `test_point_linestring_distance.py`, validating that the changes have exactly the effect we need. Authors: - H. Thomson Comer (https://github.com/thomcom) Approvers: - Michael Wang (https://github.com/isVoid) URL: #776

thomcom added bug Something isn't working Needs Triage Need team to review and classify labels Nov 2, 2022

thomcom self-assigned this Nov 2, 2022

GPUtester added this to cuSpatial Nov 2, 2022

jarmak-nv moved this to Todo in cuSpatial Nov 3, 2022

thomcom mentioned this issue Nov 3, 2022

Better slicing via union_offsets #776

Merged

3 tasks

rapids-bot bot closed this as completed in #776 Nov 8, 2022

Repository owner moved this from Todo to Done in cuSpatial Nov 8, 2022

harrism removed the Needs Triage Need team to review and classify label Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `GeoColumnAccessors` should respect slicing #771

[BUG] `GeoColumnAccessors` should respect slicing #771

thomcom commented Nov 2, 2022 •

edited

Loading

thomcom commented Nov 3, 2022 •

edited

Loading

thomcom commented Nov 3, 2022 •

edited

Loading

thomcom commented Nov 3, 2022

thomcom commented Nov 3, 2022

[BUG] GeoColumnAccessors should respect slicing #771

[BUG] GeoColumnAccessors should respect slicing #771

Comments

thomcom commented Nov 2, 2022 • edited Loading

thomcom commented Nov 3, 2022 • edited Loading

thomcom commented Nov 3, 2022 • edited Loading

thomcom commented Nov 3, 2022

thomcom commented Nov 3, 2022

[BUG] `GeoColumnAccessors` should respect slicing #771

[BUG] `GeoColumnAccessors` should respect slicing #771

thomcom commented Nov 2, 2022 •

edited

Loading

thomcom commented Nov 3, 2022 •

edited

Loading

thomcom commented Nov 3, 2022 •

edited

Loading