-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: require _constructor/_constructor_sliced to return a class #51772
Comments
@phofl since passing Managers to the constructors came up in another issue: my preferred solution is to do this deprecation, then implement something like |
+1 |
With the idea that I think for internal purposes, more consistently using something like |
Good catch on an implicit assumption worth making explicit! I was thinking we'd call |
But then you can't deprecate |
_constructor._from_mgr would only work if _constructor reliably returned a class. So the deprecation is for allowing non-class _constructor |
Ah, yes, of course. Still, the second part of my question still holds I think? I.e. what would be the benefit of doing |
For the non-_sliced version either way would work |
Looks like we call _constructor_sliced with a Manager object in _box_col_values, _ixs, and xs |
+1 in deprecation / change to require a class |
(for context I worked implementing the
I've just opened geopandas/geopandas#2845 to test what this might look like on the geopandas side. Generally, it's a bit awkward because for GeoPandas we have subclass typecasting logic which we only introduce for Seems like it's possible from a quick test creating dummy classes, but not the most elegant. For Aside, GeoPandas is currently patching over the fact that |
This has already been fixed in pandas more than a year ago (#46018). So we only still have that monkeypatch in geopandas for compat with older pandas versions. |
We chatted about this at the dev meeting this week as well. What is not yet fully clear to me is what is the exact motivation for wanting to change this.
As I mentioned in the meeting as well, and as @m-richards shows, it's not that this is technically difficult for GeoPandas (we just need to create another subclass with only a custom |
Sorry, I'm probably not adding much value to the discussion here then, I'm unfortunately not that up to date on pandas dev stuff. But I will briefly comment on this
This is certainly a nice mental model for someone who isn't super familiar on pandas internals. If pandas does have a need or desire to make this |
No, no, your feedback is useful and very much appreciated! (my mention of that this was discussed at a meeting was just to give some context to my comment, as it was partly responding to things were discussed in the meeting) |
Just noticed that we have |
I've re-read through #51772. If this was never meant to work in the first place, and it was never documented to work, then it seems perfectly fine to deprecate. Especially if it simplifies the codebase and if it's not technically difficult to workaround in geopandas Regarding the point that it'd be easy to write a code check to ensure that .constructor is only ever used assuming it's a class - I'm not sure I agree. At least, not statically, and anything involving inspecting annotations tends to be noticeably slower [reposted from #52420] |
It's true that it is not explicitly documented that it can be a generic callable, but I think I can also say that it is not explicitly documented that it must be a class. We only speak about "constructor" (in prose docs and in the name of the methods), and IMO that doesn't have to indicate a class. I think "constructor" is generally understood as "a callable to construct an object". Personally I don't know what the "intention" was when this was added 12 years ago (I could dig up a commit of that time that calls this "factory functions", but I don't want to put any intent in those words). But the fact is that this has been used for quite a while as a callable (and not just by geopandas). Anyway, I have already said that it is technically possible for subclasses to adapt to this proposal (you just put your current callable in a |
What I mean is a simple pre-commit check like the following: diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 43b3699907..6384f94774 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -292,6 +292,13 @@ repos:
files: ^pandas/tests/extension/base/
exclude: ^pandas/tests/extension/base/base\.py$
types_or: [python, cython, rst]
+ - id: unwanted-patterns-constructor
+ name: Unwanted patterns for _constructor
+ language: pygrep
+ entry: |
+ (?x)
+ \._constructor\.
+ |\._constructor_sliced\.
+ |\._constructor_expanddim\.
+ types: [python] This catches it if someone does (sidenote: the above wouldn't actually pass right now, because we also have |
Could you clarify which concerns/questions exactly you'd like a response to please? It looks to me like that overall conclusion is
Relying on a regex (like the one you posted) to prevent bugs sounds dangerous I kind of feel like flipping the question around - is there a strong enough reason to not do this? I may be misunderstanding, but moving the callable to a |
Can you explain why you think it would be dangerous? (We have plenty of similar pre-commit checks that do some regex) |
sure - the other ones are more for linting issues, rather than for preventing bugs |
OK, I understand, it's indeed less about style or patterns we just don't like. Now, I would personally still not categorize it as dangerous, if it is only to help catch the pattern. In the end it is still up to the reviewers (and our test suite, and the test suites of subclasses) to catch those things. It also doesn't cause bugs in pandas itself, only potentially for subclasses. But we have plenty of ways that we can do something wrong for subclasses that now need to be catched by the human reviewer and which can easily be missed. For example, ensuring that we always use |
It's essentially #51772 (comment) and #51772 (comment) (the latest comments here without any response, before you revived the discussion last week).
Let's try to make both of those points more concrete with an example from #51765 (comment) (the case where Brock correctly commented that the initial code in the PR was wrong because we currently can't assume # original code:
result = self.apply(Series.value_counts, **kwargs)
# the PR changed this to:
result = self.apply(self.obj._constructor.value_counts, **kwargs) where The above relies on calling the The above code relies on result = self.apply(lambda ser: ser.value_counts(**kwargs)) This assumes that the object passed to the function by So first, I am personally not convinced that this is making the fix less straightforward (but of course this is only one example, and there are probably other examples where it might be more complicated to not rely on the class methods). But this also illustrates the second point why this can complicate things for subclasses: assume a subclass has overridden The exact example using Series.value_counts here is not perfect to illustrate this, because it is quite unlikely that selecting a subset of the series (for each group) would give a different class than the parent Series. But in the context of a DataFrame applying a method on each of its columns, that is certainly not unrealistic. For example, assume the following pseudo code:
Using this pattern of calling a method on one of the constructor attributes could in this case lead to wrong behaviour if not every column ( BTW, there are also other reasons for wanting to have a custom callable, apart from being able to be flexible in which class to return. Another reason is to have custom logic relying on characteristics of the calling (sorry, in the end it turned out not to be a summary .., but rather a lengthy but hopefully somewhat illustrative post) |
The fact that this has caused bugs and development burden in the not-so-distant past is one concern. Another big one is that this is a blocker to implementing perf-improving _from_mgr method (#52132), which it itself a blocker for deprecating allowing managers in Allowing this particular usage significantly impedes pandas development. And AFAICT the use case geopandas needs it for can be better supported by just registering an accessor, which is actually intentionally supported. |
I might have missed it, but I don't see it mentioned in that PR that it is being blocked by this issue (the only mention I see is about my alternative implementation proposal that would make use of a callable |
(DataFrame|Series)Subclass._constructor(_sliced|_expanddim)?
does not currently need to return a class. It can return a callable. This means that we cannot writeobj._constructor_sliced.some_method
, which has caused bugs/confusion at times. Most recently this surprised a contributor in #51765 and makes the fix there less straightforward.AFAIK the only/main subclass that relies on this is geopandas, which inspects the arguments to decide which subclass to use. IIUC this could be accomplished with a
__new__
method just as easily. cc @jorisvandenbosscheWe could even go as far as deprecating _constructor entirely in favor of type(self) and just having _constructor_sliced/_constructor_expanddim.
The text was updated successfully, but these errors were encountered: