Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add downcast as method to DataFrame and Series #51641

Closed
wants to merge 13 commits into from

Conversation

phofl
Copy link
Member

@phofl phofl commented Feb 26, 2023

@jbrockmendel adding downcast as first step towards getting rid of the keyword for fillna and related

@jbrockmendel
Copy link
Member

Do we have an idea how often anything other than "infer" is used? im not wild about a try/except for the non-infer cases, and without it users might as well just user .astype

@phofl
Copy link
Member Author

phofl commented Feb 26, 2023

Fair point. I don't think very often, but not sure. We could just try it with infer and see if there are any reports that people are using it?

@jbrockmendel
Copy link
Member

We could just try it with infer and see if there are any reports that people are using it?

you mean just enable "infer" in the new method and see if anyone asks for the other options? that seems fine

@phofl
Copy link
Member Author

phofl commented Feb 27, 2023

Yes exactly

@phofl
Copy link
Member Author

phofl commented Feb 27, 2023

Removed the argument

if using_copy_on_write():
result = self.copy(deep=False)
else:
result = self.copy(deep=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would have expected this to be handled within the Manager method. am i wrong to be surprised?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn’t really matter, when we supported dict like inputs this was better, but can move it to the manager now



class TestDowncast:
def test_downcast(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parametrize over frame_or_series?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes it more complicated

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the alternative is to implement an analogous test in the series tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just do

if frame_or_series is Series:
    obj = obj["A"]
    expected = expected["A"]

result = ...

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a series test already, should have commented. Are you ok with that?

@jbrockmendel
Copy link
Member

couple comments, otherwise LGTM. id like to make sure we have buy-in from the rest of the team on doing this and deprecating the keyword

@phofl
Copy link
Member Author

phofl commented Mar 5, 2023

Sounds good to me. Should we ping on the issue?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this what to_numeric does ?
we rejected this method in the past for that reason

what's the rationale here for adding?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2023

+1 on the deprecation itself

@phofl
Copy link
Member Author

phofl commented Mar 5, 2023

to_numeric doesn't work well on DataFrames, but also it changes the size of your dtype if possible, downcast here only coerces from float to int if possible

@jreback
Copy link
Contributor

jreback commented Mar 5, 2023

sure but maybe that's a better option

eg enhancing to_numeric to accept a frame and also a less strict option

@phofl
Copy link
Member Author

phofl commented Mar 5, 2023

Hm I think calling to_numeric on an already numeric object is a bit counterintuitive? Should we rather mirror the functionality from to_numeric here and centralise down casting logic here?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2023

that's another option

yeah just shouldnt be different very similar options -

@phofl
Copy link
Member Author

phofl commented Mar 5, 2023

@jbrockmendel thoughts?

@jorisvandenbossche
Copy link
Member

downcast here only coerces from float to int if possible

This might be confusing for a method called "downcast"? When I saw this issue, I assumed it would be about a method that would actually downcast things like int64 -> int32 etc (that's also what the downcast keyword in to_numeric does).
The current docstring in the PR also isn't very clear about that this is the only thing it does.

Adding the actual downcasting logic that to_numeric has might make sense, then.

But, personally, I am not sure this is worth a completely new DataFrame/Series method.

And if we want something like this, I am wondering if it would be more interesting to have something more general like a "infer dtypes" (or "optimize" dtypes) method (although it would then be a but unfortunate that we already have both an infer_objects and convert_dtypes that both do something a bit different). "Inferring dtypes" could have options to enable several kinds of dtype optimizations, like float->int, smaller bitwidths for float/int, inferring object, ... (and later maybe things like converting to categorical for certain characteristics such as low cardinality strings)

@phofl
Copy link
Member Author

phofl commented Mar 13, 2023

Yeah I agree that we should add the down casting logic from to_numeric.

Right now all of them are doing something different. Goal is to centralise this here and also get rid of the inconsistent behavior in fillna and friends.

@jbrockmendel
Copy link
Member

How to centralize/standardize downcasting in a minimal set of methods is hard, as is naming.

The "real" motivation is getting the downcasting out of fillna etc. I think the lazy thing to do is deprecate that first then see if anyone complains, then add this if necessary. That avoids the difficult part (choosing a name).

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Apr 14, 2023
@jbrockmendel
Copy link
Member

The "real" motivation is getting the downcasting out of fillna etc. I think the lazy thing to do is deprecate that first then see if anyone complains, then add this if necessary. That avoids the difficult part (choosing a name).

i've stumbled upon the annoying downcast-inside-fillna/where again. Gonna go ahead and deprecate at least some of that.

@phofl
Copy link
Member Author

phofl commented May 5, 2023

Sounds good, let's close this for now till we get actual complaints. Should I take care of downcast keywords?

@phofl phofl closed this May 5, 2023
@phofl phofl deleted the downcast branch August 28, 2023 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants