Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support for FixedSizeList #4014

Closed
kylebarron opened this issue Jul 14, 2022 · 11 comments
Closed

Feature request: Support for FixedSizeList #4014

kylebarron opened this issue Jul 14, 2022 · 11 comments

Comments

@kylebarron
Copy link
Contributor

Along with @stuartlynn, I've been working on https://github.com/kylebarron/geopolars to extend polars to add support for geospatial data, much like GeoPandas extends Pandas (see also polars issues #1830, #3208).

With Arrow, the whole ecosystem benefits when a common memory layout is used. There's been a lot of work in https://github.com/geopandas/geo-arrow-spec to define common ways to store vector geospatial data (points, lines, polygons, etc) in Arrow memory. Right now, two alternate layouts are defined in the spec:

  • geoarrow.wkb: use a Binary column where geometries are stored in Well-Known Binary format. WKB is common in the geo-world, but this is a less performant storage format; coordinates can't be accessed with zero copy and parsing is O(n).
  • An Arrow-native encoding using a combination of List and FixedSizeList (spec). This is more performant because geometry access to any coordinate is possible in O(1) time and zero-copy access is possible. For example:
    • 2D Points: geoarrow.point: FixedSizeList<f64>[2]
    • 2D Lines: geoarrow.linestring: List<FixedSizeList<f64>[2]>
    • 2D MultiPolygons: geoarrow.multipolygon: List<List<List<FixedSizeList<f64>[2]>>>

Therefore, to support the current version of the geo-arrow-spec, FixedSizeList would be a necessary data type.

Arrow2 supports FixedSizeList. Beyond that, I don't know the polars codebase well enough to know how much work it would be to add and support FixedSizeList. Would it be possible to reuse existing List support for FixedSizeList?

Thoughts? I would be open to submitting a PR for this as well.

Appendix

Current Behavior

When trying to load this Arrow file (cities-geoarrow.arrow.zip), with schema:

pyarrow.Table
name: string
geometry: fixed_size_list<xy: double not null>[2]
  child 0, xy: double not null

into Polars using table = pyarrow.feather.read_table(path); polars.from_arrow(table) it errors with:

Cannot create polars series from FixedSizeList(Field { name: "xy", data_type: Float64, is_nullable: false, metadata: {} }, 2) type

Example files:

  • cities-geoarrow.arrow.zip: dataset with Point geometries in a fixed_size_list<xy: double not null>[2] column
  • nationalpark.arrow.zip: dataset with MultiPolygon geometries in a column:
    geometry: list<item: list<item: list<item: fixed_size_list<xy: double not null>[2]>>>
      child 0, item: list<item: list<item: fixed_size_list<xy: double not null>[2]>>
          child 0, item: list<item: fixed_size_list<xy: double not null>[2]>
              child 0, item: fixed_size_list<xy: double not null>[2]
                  child 0, xy: double not null
    
@ritchie46
Copy link
Member

I have been thinking about this and is something that might fit in the scope of polars eventually. It is a lot of work with currently not much benefit with regard to the default list type. Eventually I'd like geotypes under the polars umbrella, but I first want to mature the default use case and have not a battle on two fronts.

@cjermain
Copy link
Contributor

Can Structs be used instead of FixedSizeLists? For 2-3 data points, I'm wondering if the list properties are relevant.

@ritchie46
Copy link
Member

On our side that should work if we were to implement geo types.

@kylebarron
Copy link
Contributor Author

kylebarron commented Jul 20, 2022

My goal is be to be compliant with the GeoArrow specification in development. At this point, the spec defines a nested list format where the inner array is a FixedSizeList. To the geo world, this is kind of the "best of both worlds" because the logical layout matches the coordinates array from GeoJSON while the physical Arrow layout is a flat array of coords.

implement geo types

My preference is to not use a polars Object type Series containing rust structs defined in the geo crate, because this has the usual non-Arrow drawbacks including serialization and deserialization costs every time data is loaded or shared with a program outside of polars.

Today, geopolars still has an extra copy from Arrow data into geo structs, but my long-term goal is to work with the geo crate to restructure their algorithms around geometry traits, so that geometry data in Arrow can be accessed zero-copy (see georust/geo#838, georust/geo#67).

Can Structs be used instead of FixedSizeLists?

To clarify, are you referring to rust structs or Arrow structs? Early on in GeoArrow discussions, an Arrow Struct format was proposed, but this was decided against because it is nearly identical to the physical layout of the nested list approach, while lacking the easier logical API of the nested lists.

It is a lot of work with currently not much benefit with regard to the default list type

I'm sympathetic to the extra dev overhead of new data types. I wonder whether it would be possible to add some sort of minimal "container" data type that just wraps Arrow arrays but doesn't have full polars support otherwise. In the current approach of geopolars, we don't need or use any polars-specific methods on the geometry column (but the point is for users to access polars operations on all their non-geometry columns); we just access the underlying arrow data, pass it to an algorithm, and create a new series. So all we need is a way to store this more "custom" column data layout in a column alongside the rest of a polars DataFrame.

@kylebarron
Copy link
Contributor Author

kylebarron commented Jul 20, 2022

For 2-3 data points, I'm wondering if the list properties are relevant

Not sure I understand. For a geometry column of type Point, each row of the geometry column would contain only 2 or 3 numbers. But for a geometry column of type Polygon, each row could contain thousands of vertices. E.g. for a GeoDataFrame representing countries where each row is one country and the geometry column includes the country's boundary, a single unsimplified geometry could include tens of thousands of vertices.

@ritchie46
Copy link
Member

There are more requests for fixedsizelist + extension types so that we can deal with tensor types. I want to add fixedsizelist type as a minimal type. One that can be put into a DataFrame and supports minimal aggregations and take functionality. That should allow third parties to work with more of the arrow spec + polars.

@kylebarron
Copy link
Contributor Author

kylebarron commented Sep 23, 2022

As a heads up: the GeoArrow community is reconsidering using a struct type instead of FixedSizeList for the inner coordinate format. If the only reason to implement FixedSizeList was for the geo use case, it might be worth holding off for now until that discussion is resolved 🙂 (I understand you might want to implement FixedSizeList anyways to support tensors)

@jondo2010
Copy link

@ritchie46 is there any news on supporting FixedSizeList in Polars? What would be involved in adding support?

@ritchie46
Copy link
Member

@ritchie46 is there any news on supporting FixedSizeList in Polars? What would be involved in adding support?

It would need a PR similar to this one. #5122

I would accept such a PR. It's just a few hours if work.

@JackKelly
Copy link

There are more requests for fixedsizelist + extension types so that we can deal with tensor types

I'd absolutely love to be able to use tensor types within Polars! (I'm currently using xarray, which is awesome but uses Pandas + Dask).

@kylebarron
Copy link
Contributor Author

Closing this given #8943. Created #9112 to track Arrow extension types if there's interest in that as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants