-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] GeoColumnAccessors
should respect slicing
#771
Comments
I'm puzzling over a substantial problem and am going to start the conversation about it. Most new algorithms depend on the GeoArrow format, that is that the offsets buffer for a range of The arrow DenseUnion doesn't seem to respect this - that is, the offsets buffer of the DenseUnion is not length This is important because of slicing. So far we've always tested our algorithms using the "complete" data source, so that all of the points returned by the various buffer calls always respect all the points in the original buffer. point_in_polygon revealed a flaw in this approach because of the 31 polygon limit - given an original data source like the naturalearth_lowres database, one needs to slice only 31 polygons from this source. Therefore a sliced dataframe should only contain the points that were sliced, so that subsequent accesses to its buffers remain accurate. I tried to solve this issue by modifying the offset buffer accessors to use the offsets buffer of the union before they return their results. This works well, with one major problem - if I slice two elements out of an offsets buffer, I only get two elements in the offsets buffer, not the I'd like to keep our design where the underlying points array is not modified, both for memory usage reasons and also to avoid having to sub-slice all the points in order to make a copy. How can we change an offsets buffer slice operation to return element |
It seems like I might have to slice the geometry data, and not just the union_offsets and input_types, so that each GeoSeries fully represents itself. I was able to implement the above discussion easily, where a sliced dataframe only returns the offsets that were sliced, also, but this immediately broke all of the point_linestring_distance tests since they expect 1 extra value in all of the offsets lists. If I could just do a slice + 1, that'd solve the issue, but I'm not sure how. Slicing can take many forms, particularly list or slice, that are not trivial (or possible?) to identify what This is a simple example of a failing test that demonstrates what I'm talking about
The dataframe has been sliced from four multipoints down to two, but sliced.multipoints.geometry_offset is not length 3, it still has all of the offsets in it. The solution is to slice the geometry offsets, but I don't have a trivial way of slicing to n+1. |
I think I figured it out. |
There are potentially a lot of tests for this, I'm working on that now. |
Closes #771 This PR modifies the production of `geometry_offset`, `part_offset`, and `ring_offset` by sampling the existing values in a `GeoSeries` before returning their various offsets. This has the effect of using `cudf.ListSeries` to re-pack any features into a new, dense `GeoColumn`, then returning the offsets based on it. Previously, `GeoSeries` that had been modified by slicing would have the appearance of the sliced elements, but when `_offset` buffers were used they would return the full original offset buffer that the sliced `GeoSeries` had originated from. This was a problem because it made slicing useless for our algorithms. I also modify the `core/spatial/distance.py` and `core/spatial/nearest_points.py` files to use `as_column(linestrings.lines.geometry_offset)` instead of `linestrings.lines.geometry_offset._column` because there doesn't appear to be a reason to use a `cudf.Series` to wrap the offset buffers. They are private methods essentially, don't need indexes, and will eventually be factored out so that they're hidden from the user. I wrote a new test file `test_geocolumn_accessor.py` to exercise the new {`geometry_buffer`...} accessors for all geometry types. Finally I added tests for a base case, a more complicated case, and a case with noncontiugous slices to the inputs of `test_point_linestring_distance.py`, validating that the changes have exactly the effect we need. Authors: - H. Thomson Comer (https://github.com/thomcom) Approvers: - Michael Wang (https://github.com/isVoid) URL: #776
I think I solved this, PR coming.
Describe the bug
The most visible interface for a
GeoSeries
respects slicing:However, the underlying
GeoColumnAccessor
that allows us to access theGeoArrow
buffers, does not:Expected behavior
Users should not have to slice their
GeoSeries
and also slice the underlying buffers. When a slicedGeoSeries
hasGeoColumnAccessors
called, they should return only the coordinates that are part of theGeoSeries
.The text was updated successfully, but these errors were encountered: