-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: concat with empty dataframe with columns passed and nonempty dataframe coerces dtype to object #45637
Comments
This was an intended change, see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#ignoring-dtypes-in-concat-with-empty-or-all-na-columns Your empty DataFrame has dtype object, a common dtype between float and object is object. cc @jbrockmendel to be sure |
Yes this was intended. |
This was intentionally changed, but as I also commented before on the PR (#43507 (comment)) and as illustrated by this report: it is a backwards incompatible change (and removing functionality to preserve the dtype that was coded intentionally), and IMO we should do it with a deprecation warning instead. |
moved to 1.4.2 |
Not ignoring dtypes of a dataframe with all missing values makes sense to me. But may I ask for an example where this behavior is helpful with empty dataframes? The issue with the current behavior is that if an empty dataframe is present, information about dtypes of all other non-empty dataframes participating in concatenation is erased. We are forced to filter the input to prevent this loss of dtype information. I can't imagine why anyone would not want to do that. |
The general principle here is that the resulting dtype should not depend on whether any of the frames are empty or not (i.e. "values-dependent behavior") Looks like there are still some inconsistencies in the Series vs DataFrame behavior I had previously thought eliminated:
|
I would propose to revert that original change, and restore the special case for empty / all-NaN dataframes. Yes, this keeps some values-dependent behaviour, while ideally we strictly look at dtypes in this case. But as long as we don't have a better way to deal with empty columns without specific dtype information (apart from using object dtype), I think practicality beats purity and it is worth it to keep the special case and avoid such a breaking change. |
@jorisvandenbossche above you suggested deprecating. IIUC the idea was that 1.5 would have the old behavior and 2.0 would have the current behavior. Is that a) a correct understanding of what you were suggesting here and b) are you now suggesting something different? |
I am not fully sure, but in any case both options (also a deprecation) require it to be reverted. |
Yes, and we use object dtype for empty or float64 dtype for all NaN by default. So those indeed have a specific dtype by default, but that doesn't mean that this dtype conveys the correct information about that column (eg such an all-NaN column can be introduced in a reindex operation (see my original example at #43507 (comment)), but nothing about this operation says that it should be a float64 column). And so ignoring the dtype or not in those cases matters for the result. |
In general according to the version policy https://pandas.pydata.org/pandas-docs/dev/development/policies.html
So I guess that discussion can be kept independent of reverting #43507 for 1.4.x? |
As I mentioned there, the reindex in that example (
Trying to whittle down the issue: would you only ignore the empty/all-NaN cases when they are object/float64, respectively? So suppose a user has:
For
If we can't have both, then I think res_B should take priority since the user was explicit about what they wanted. |
Yes, but this is only a dummy example, and back then I answered to that with the following (#43507 (comment)):
So we have deprecated a feature for this, explicitly saying to users they can use
Yes, and that's also what we did in the past more or less (eg we didn't ignore all-NaT datetime64). I would maybe leave out the "respectively", so for now consider both empty and all-NaN for both dtypes to keep things simpler (although we should check what we did exactly before).
Long term, I think this is the way to go (something like this is what I meant with the "a better way to deal with empty columns without specific dtype information" above). |
The revert is not straightforward since there have been some changes to concat code since #43507 e.g. removing code in #43577, and changing signatures in #43626 and #43606 after a couple of attempts of reverting these in different orders to reduce the number of conflicts to manually resolve, will revisit again soon We are agreed that we want to revert #43507 for 1.4.x and in a separate PR targeted to main/1.5 add a deprecation warning instead? |
moving to 1.4.3 |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When using
concat
with empty dataframe withcolumns
argument passed, and a nonempty dataframe, the dtypes of the resulting dataframe are coerced to object.Expected Behavior
I would expect the dtypes to be taken from the nonempty dataframe, as was the behavior in previous versions of pandas.
This issue can be avoided if dtypes are explicitly passed, which maybe is intentional, but still it is unexpected.
Pandas 1.3.5 behavior:
Installed Versions
The text was updated successfully, but these errors were encountered: