Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: accepting Manager objects in DataFrame/Series #52419

Merged
merged 48 commits into from
Oct 17, 2023

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jbrockmendel jbrockmendel added Internals Related to non-user accessible pandas implementation Deprecate Functionality to remove in pandas labels Apr 4, 2023
@mroeschke
Copy link
Member

@jorisvandenbossche looks like this deprecation would affect pyarrow. Any thoughts?

@jorisvandenbossche
Copy link
Member

That would certainly be a problem for pyarrow. There is also #52132 which adds a separate method to construct from a manager. I would assume we would need to do something like that first?

@jbrockmendel it would help if you could add some more context in the top post of your PRs, short explainer, link to relevant PRs, or open an issue about this that can be linked to from the different PRs

@jbrockmendel
Copy link
Member Author

I would assume we would need to do something like [implement _from_mgr] first?

We'd need to have that by the time we enforce the deprecation, yes.

@jreback
Copy link
Contributor

jreback commented Apr 13, 2023

+1 here

I agree that we still need a way for a power user to construct from collections of 2D arrays of homogenized dtypes. So that begs the question why don't we just do this?

we have from_arrays why not from_2darrays (or better name)? this pretty much solves the downstream problem with efficiency

@jbrockmendel
Copy link
Member Author

should use DeprecationWarning instead of FutureWarning?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback merged commit b2a622e into pandas-dev:main Oct 17, 2023
33 checks passed
@jbrockmendel jbrockmendel deleted the depr-fastpath branch October 17, 2023 22:24
@rhshadrach
Copy link
Member

On main, I'm now seeing:

df = pd.DataFrame({'a': [1, 1, 2]})
df.to_parquet('temp.parquet')
pd.read_parquet('temp.parquet')
# DeprecationWarning: Passing a BlockManager to DataFrame is deprecated and will raise in a future version. Use public APIs instead.

@jbrockmendel
Copy link
Member Author

Yes, pyarrow needs to update their usage.

@jorisvandenbossche
Copy link
Member

I think this is way too noisy if every user sees this. I would propose to revert this for now, and only put it back in when there is a released version of pyarrow that doesn't trigger those warnings.

@jbrockmendel
Copy link
Member Author

Without this deprecation, will pyarrow ever actually start using public APIs?

Isn't the point of using a DeprecationWarning instead of FutureWarning to assure appropriate visibility?

If we do revert, it should be just before release, not immediate.

@jorisvandenbossche
Copy link
Member

Isn't the point of using a DeprecationWarning instead of FutureWarning to assure appropriate visibility?

Yeah, that's true, but for some reason it doesn't seem to be working here. See Richard's snippet above, which I can reproduce on main without any special filterwarnings settings. So something we should investigate why this is happening.

Without this deprecation, will pyarrow ever actually start using public APIs?

I am planning to simply move pyarrow to whatever new private API is needed to achieve the same (I suppose DataFrame._from_mgr?).
As mentioned above, I think there are valid use cases to allow power users to access those APIs to be able create a DataFrame from a Manager (also subclasses will need to be able to do this, BTW. And especially for something like pyarrow we want that the integration can be as optimal as possible). I fully agree that this API shouldn't be the main DataFrame constructor, so I am fully on board with this deprecation. I am only assuming we will provide an alternative.

@jorisvandenbossche
Copy link
Member

Isn't the point of using a DeprecationWarning instead of FutureWarning to assure appropriate visibility?

Yeah, that's true, but for some reason it doesn't seem to be working here. See Richard's snippet above, which I can reproduce on main without any special filterwarnings settings. So something we should investigate why this is happening.

So it's actually quite simple ;) We need to set the stacklevel correctly for how this is used, i.e. we want to warn if someone passes a manager to a DataFrame themselves, and so for that use case, the stacklevel is always two (this deprecation doesn't live deeply nested where it can be reached in multiple ways, it's simply in DataFrame/Series.__init__, and only when calling this init we should show the warning)

PR to fix this -> #55591

@rhshadrach
Copy link
Member

I also wasn't aware of this drawback of using find_stack_level, I think this would be good to know.

cc @pandas-dev/pandas-core see @jorisvandenbossche's comment immediately above.

@jorisvandenbossche
Copy link
Member

Yes, we will need to re-evaluate other DeprecationWarnings as well

@jorisvandenbossche
Copy link
Member

After further working on #55591, I discovered that correcting the stacklevel unfortunately does not fix all cases (eg calling pyarrow_table.to_pandas() still incorrectly triggers the warning).
I think this is related to the DataFrame constructor being called from cython on the pyarrow side, and cython might mess up the stack? (I am moving this construction from cython to python in my PR to update pyarrow (apache/arrow#38374), but that of course doesn't help for released versions)

Although this usage won't be as widespread as read_parquet, this might still be a reason to revert it temporarily ..

@jorisvandenbossche
Copy link
Member

Another option is maybe to, initially, set the stacklevel to 1. That would circumvent the cython issue (and avoid the question of reverting temporarily), and ensure the warning isn't seen by pandas/pyarrow users. It would also result in not seeing the warning when doing it manually (i.e. user doing pd.DataFrame(mgr)), but that's 1) the less likely use case, and 2) then in a next step we can bump the stacklevel back to 2.

@jorisvandenbossche
Copy link
Member

I implemented the stacklevel=1 idea in my PR at #55591

@rhshadrach
Copy link
Member

I'm also seeing this warning hit by subclassing a pandas DataFrame:

pandas/pandas/core/frame.py

Lines 648 to 654 in 2d2d67d

def _constructor_from_mgr(self, mgr, axes):
if self._constructor is DataFrame:
# we are pandas.DataFrame (or a subclass that doesn't override _constructor)
return self._from_mgr(mgr, axes=axes)
else:
assert axes is mgr.axes
return self._constructor(mgr)

It's triggered by the test test_groupby_preserves_subclass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants