Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GeoColumnAccessors should respect slicing #771

Closed
thomcom opened this issue Nov 2, 2022 · 4 comments · Fixed by #776
Closed

[BUG] GeoColumnAccessors should respect slicing #771

thomcom opened this issue Nov 2, 2022 · 4 comments · Fixed by #776
Assignees
Labels
bug Something isn't working

Comments

@thomcom
Copy link
Contributor

thomcom commented Nov 2, 2022

I think I solved this, PR coming.

Describe the bug
The most visible interface for a GeoSeries respects slicing:

geoseries = cuspatial.from_geopandas(geopandas.read_file(geopandas.get_path("naturalearth_lowres")))
shorter = geoseries[0:3]
print(shorter)
0    MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1    POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2    POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
Name: geometry, dtype: geometry

However, the underlying GeoColumnAccessor that allows us to access the GeoArrow buffers, does not:

shorter.polygons.geometry_offset
0        0
1        3
2        4
3        5
4       35
      ... 
173    284
174    285
175    286
176    287
177    288
Length: 178, dtype: int32

Expected behavior
Users should not have to slice their GeoSeries and also slice the underlying buffers. When a sliced GeoSeries has GeoColumnAccessors called, they should return only the coordinates that are part of the GeoSeries.

@thomcom thomcom added bug Something isn't working Needs Triage Need team to review and classify labels Nov 2, 2022
@thomcom thomcom self-assigned this Nov 2, 2022
@thomcom
Copy link
Contributor Author

thomcom commented Nov 3, 2022

I'm puzzling over a substantial problem and am going to start the conversation about it. Most new algorithms depend on the GeoArrow format, that is that the offsets buffer for a range of $n$ features is always $n+1$ offsets. The offsets include the first and "last" position of every feature, including the final one.

The arrow DenseUnion doesn't seem to respect this - that is, the offsets buffer of the DenseUnion is not length $n+1$.

This is important because of slicing. So far we've always tested our algorithms using the "complete" data source, so that all of the points returned by the various buffer calls always respect all the points in the original buffer.

point_in_polygon revealed a flaw in this approach because of the 31 polygon limit - given an original data source like the naturalearth_lowres database, one needs to slice only 31 polygons from this source. Therefore a sliced dataframe should only contain the points that were sliced, so that subsequent accesses to its buffers remain accurate.

I tried to solve this issue by modifying the offset buffer accessors to use the offsets buffer of the union before they return their results. This works well, with one major problem - if I slice two elements out of an offsets buffer, I only get two elements in the offsets buffer, not the $n+1$ tail.

I'd like to keep our design where the underlying points array is not modified, both for memory usage reasons and also to avoid having to sub-slice all the points in order to make a copy. How can we change an offsets buffer slice operation to return element $n+1$ at the tail?

@thomcom
Copy link
Contributor Author

thomcom commented Nov 3, 2022

It seems like I might have to slice the geometry data, and not just the union_offsets and input_types, so that each GeoSeries fully represents itself.

I was able to implement the above discussion easily, where a sliced dataframe only returns the offsets that were sliced, also, but this immediately broke all of the point_linestring_distance tests since they expect 1 extra value in all of the offsets lists.

If I could just do a slice + 1, that'd solve the issue, but I'm not sure how. Slicing can take many forms, particularly list or slice, that are not trivial (or possible?) to identify what $n+1$ is. Maybe I'm just overlooking something simple.

This is a simple example of a failing test that demonstrates what I'm talking about

def test_point_geometry_offset():
    gs = gpd.GeoSeries(
        [
            MultiPoint([Point(0, 1), Point(1, 0)]),
            MultiPoint([Point(0, 2), Point(2, 0)]),
            MultiPoint([Point(0, 3), Point(3, 0)]),
            MultiPoint([Point(0, 4), Point(4, 0)]),
        ]
    )
    cugs = cuspatial.from_geopandas(gs)
    sliced = cugs[0:2]
    assert len(sliced.multipoints.geometry_offset) == 3

The dataframe has been sliced from four multipoints down to two, but sliced.multipoints.geometry_offset is not length 3, it still has all of the offsets in it. The solution is to slice the geometry offsets, but I don't have a trivial way of slicing to n+1.
If I slice the geometry offsets, all the algorithms that expect n+1 break.

@thomcom
Copy link
Contributor Author

thomcom commented Nov 3, 2022

I think I figured it out.

@jarmak-nv jarmak-nv moved this to Todo in cuSpatial Nov 3, 2022
@thomcom
Copy link
Contributor Author

thomcom commented Nov 3, 2022

There are potentially a lot of tests for this, I'm working on that now.

@rapids-bot rapids-bot bot closed this as completed in #776 Nov 8, 2022
rapids-bot bot pushed a commit that referenced this issue Nov 8, 2022
Closes #771 

This PR modifies the production of `geometry_offset`, `part_offset`, and `ring_offset` by sampling the existing values in a `GeoSeries` before returning their various offsets. This has the effect of using `cudf.ListSeries` to re-pack any features into a new, dense `GeoColumn`, then returning the offsets based on it.

Previously, `GeoSeries` that had been modified by slicing would have the appearance of the sliced elements, but when `_offset` buffers were used they would return the full original offset buffer that the sliced `GeoSeries` had originated from. This was a problem because it made slicing useless for our algorithms.

I also modify the `core/spatial/distance.py` and `core/spatial/nearest_points.py` files to use `as_column(linestrings.lines.geometry_offset)` instead of `linestrings.lines.geometry_offset._column` because there doesn't appear to be a reason to use a `cudf.Series` to wrap the offset buffers. They are private methods essentially, don't need indexes, and will eventually be factored out so that they're hidden from the user.

I wrote a new test file `test_geocolumn_accessor.py` to exercise the new {`geometry_buffer`...} accessors for all geometry types. 

Finally I added tests for a base case, a more complicated case, and a case with noncontiugous slices to the inputs of `test_point_linestring_distance.py`, validating that the changes have exactly the effect we need.

Authors:
  - H. Thomson Comer (https://github.com/thomcom)

Approvers:
  - Michael Wang (https://github.com/isVoid)

URL: #776
Repository owner moved this from Todo to Done in cuSpatial Nov 8, 2022
@harrism harrism removed the Needs Triage Need team to review and classify label Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants