Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISCUSSION: allow external libraries to define a custom Block #17144

Closed
jorisvandenbossche opened this issue Aug 1, 2017 · 10 comments
Closed
Labels
API Design Internals Related to non-user accessible pandas implementation

Comments

@jorisvandenbossche
Copy link
Member

I am opening this issue because I want (to try) to pursue this in GeoPandas to add a custom GeometryBlock (work together with Matthew Rocklin in geopandas/geopandas#467, ultra short motivation: we want to store integers (pointers to C objects) in a column but box it to shapely python objects when the user interacts with the column (repr, accessing element, ..))

I am of course free to try this :-), but I wanted to raise this because it has some consequences. With the "allow external libraries" in the issue title, I mean the following:

  • agree that this is 'OK' which means that we try to not break the Block API (to a certain extent of course, or try to change it backwards compatible)
  • accept some changes to pandas to make this possible where needed (as long as they are only some internal clean-ups)

I don't think we plan many internal refactorings for pandas 0.x / 1.x, so on that regard the Block API should/could remain rather stable (of course for 2.0 this is a whole other issue).

So this issue can serve as general discussion for this (or if people have input or feedback) and as a reference for when changes in pandas are made for this.

cc @pandas-dev/pandas-core

@jorisvandenbossche
Copy link
Member Author

#17143 is an example issue of a small change.

Based on my first experiments, it seems that implementing the GeometryBlock is somehow feasible. The repr (with above PR), (re)indexing, slicing, accessing elements, some operations, .. are already working, although it is of course possible that this were just the easy parts and that the can of worms opens only now trying to fix the remaining problems :-)

@jorisvandenbossche jorisvandenbossche added API Design Internals Related to non-user accessible pandas implementation labels Aug 1, 2017
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 3, 2017 via email

@jorisvandenbossche
Copy link
Member Author

(of course for 2.0 this is a whole other issue).

I think this deserves more than a parenthetical :) We want to avoid
introducing new APIs that will break with pandas 2, as your GeometryBlock
would (I think). That said, I think it's worthwhile, even if the internals
of geopandas will need to be updated for pandas 2.

I think it is perfectly reasonable to assume that the current GeometryBlock that would be included in geopandas will only work for pandas 1.x, and that we will have to rework this for pandas 2.x. Maybe the main constraint for pandas 2.x from this regard would be is not to support such blocks, but to at least have a similar (hopefully cleaner) mechanism to let external libraries extend pandas.

@TomAugspurger
Copy link
Contributor

Maybe the main constraint for pandas 2.x from this regard would be is not to support such blocks, but to at least have a similar (hopefully cleaner) mechanism to let external libraries extend pandas.

Yes, I was going to suggest that, but I don't want to put more work on Wes and others' plate :) I wouldn't really consider this a hard requirement for the initial pandas 2, but at some point it would be good to have.

@wesm
Copy link
Member

wesm commented Aug 3, 2017

I think we'll be able to make user defined types much simpler. For example, a Lattitude-Longitude type could be embedded in struct<lattitude: double, longitude: double>. Ultimately the block manager is going away, but I don't think this should prevent useful work from happening in current pandas.

As an aside, it seems more and more likely that the optimal route for pandas2 will be a separate codebase, while factoring out reusable components of pandas 0.x that do not need to have knowledge of the low level internals.

@mrocklin
Copy link
Contributor

mrocklin commented Aug 3, 2017

The GeoPandas case is a bit more complex than storing structs. We need to store (and track) pointers to an external library, GEOS. This is the library that backs essentially every geospatial system, including Postgres' PostGIS.

Currently our array-like-geometry object tracks references so that we can free the GEOS pointers at the appropriate time. Is handling pointers to external libraries within scope for Pandas 2? This is a bit atypical.

@jbrockmendel
Copy link
Member

It looks like the set of recognized Block subclasses is hard-coded in internals.form_blocks and internals.make_block. (It also looks like some of the logic in these two functions could be shared.) It wouldn't be too hard to have these functions refer to a registry that brave souls could experiment with.

@jschendel
Copy link
Member

Can this be closed now that we have the extension array interface, and through that an ExtensionBlock? Or is this something we want in addition to that?

@TomAugspurger
Copy link
Contributor

Yes I think extension block is no longer necessary.

@jorisvandenbossche
Copy link
Member Author

And hooray that they are no longer necessary! The extension array interface is much better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

No branches or pull requests

6 participants