Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating rows with Int64 datatype coerces to object #24768

Closed
janvanrijn opened this issue Jan 14, 2019 · 16 comments · Fixed by #36278
Closed

Concatenating rows with Int64 datatype coerces to object #24768

janvanrijn opened this issue Jan 14, 2019 · 16 comments · Fixed by #36278
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@janvanrijn
Copy link

janvanrijn commented Jan 14, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

df_a = pd.DataFrame({'a': [-1, -1, -1]})
df_b = pd.DataFrame({'b': [1, 1, 1]})
df_a['a'] = df_a['a'].astype('Int64')
df_b['b'] = df_b['b'].astype('Int64')

total = pd.concat([df_a, df_b], ignore_index=True)

print(total.dtypes)

output:

a    object
b    object
dtype: object

Problem description

When running the exact same code with floats, the output is similar to the expected output. It also happens with the append function

import pandas as pd

df_a = pd.DataFrame({'a': [-1, -1, -1]})
df_b = pd.DataFrame({'b': [1, 1, 1]})
df_a['a'] = df_a['a'].astype('Int64')
df_b['b'] = df_b['b'].astype('Int64')

total = df_a.append([df_b], ignore_index=True)

print(total.dtypes)

Expected Output

a    Int64
b    Int64
dtype: object

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+997.ga197837
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@WillAyd WillAyd added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 14, 2019
@WillAyd
Copy link
Member

WillAyd commented Jan 14, 2019

Thanks for the report and makes sense. Investigation and PRs are always welcome!

@WillAyd WillAyd changed the title concatenating rows with int datatype Concatenating rows with Int64 datatype coerces to object Jan 14, 2019
@janvanrijn
Copy link
Author

I like the title change that you proposed, as this is actually another issue that I wanted to open. The problem that I tried to point out in this issue however is the fact that there are now 6 rows with a 1 in it (column B), whereas we would only expect 3 rows with a 1 in it.

@janvanrijn
Copy link
Author

In short, I would like to propose to keep the title of this issue to the original title, as all the code and expected output points out the 'unexpected behavior when concatenating rows with int datatype'. I will open a second issue with proper expected output on 'Concatenating rows with Int64 datatype coerces to object' (there are some more things that go wrong there, that are not outlined in this issue )

@TomAugspurger
Copy link
Contributor

@janvanrijn can you check the example your original post? When you assign df_a['b'] = ..., it no longer matches your "working" example

import pandas as pd

df_a = pd.DataFrame({'a': [-1, -1, -1]})
df_b = pd.DataFrame({'b': [1, 1, 1]})

total = pd.concat([df_a, df_b], ignore_index=True)

as the columns are different. Perhaps you mean df_b['b'] = ... instead.

@janvanrijn
Copy link
Author

Ouch, sloppy. My sincere apologies.

Then only the 'Concatenating rows with Int64 datatype coerces to object' remains.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 14, 2019 via email

@janvanrijn
Copy link
Author

Updated.

@TomAugspurger
Copy link
Contributor

Thanks. So the issue is that concating integer and empty coerces to object

In [33]: pd.concat([df_a, pd.DataFrame(index=[0, 1])], ignore_index=True, sort=True).dtypes
Out[33]:
a    object
dtype: object

The logic in get_empty_dtype_and_na in internals/concat.py needs to be updated. I suspect that things can be simplified somewhat now that we have EAs.

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Jan 14, 2019
@jreback
Copy link
Contributor

jreback commented Jan 14, 2019

this is correct
we changed this

@jreback
Copy link
Contributor

jreback commented Jan 14, 2019

sorry that was for int64

concat with empty is not tested at all for EA types

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 14, 2019 via email

@janvanrijn
Copy link
Author

Also, when making an Int64 column the index, it is coerced to a float type.

@TomAugspurger
Copy link
Contributor

@janvanrijn that's a separate issue. #22861 / #23223

@jreback
Copy link
Contributor

jreback commented Jan 28, 2019

Also, when making an Int64 column the index, it is coerced to a float type.

is not implemented

@wuhaochen
Copy link
Contributor

wuhaochen commented Feb 15, 2019

Yeah, just to make this clear, the following are two different cases python In [10]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame()]).dtypes Out[10]: A int64 dtype: object In [11]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame(columns=['A'])]).dtypes Out[11]: A object dtype: object Concatenating with a newly-created column should preserve the dtype (if possible).

On Mon, Jan 14, 2019 at 2:20 PM Jeff Reback @.***> wrote: sorry that was for int64 concat with empty is not tested at all for EA types — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24768 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHImTiGV39Wu5S3iiqhL6WMJ25HtOeks5vDOaPgaJpZM4Z_EIi .

In such case concatenating float64 data doesn't preserve the dtype though. Is it considered a bug?
It is a bit weird if the behaviors for int and float are not the same.

pd.concat([pd.DataFrame({"A": [1., 2.]}), pd.DataFrame(columns=['A'])]).dtypes

A float64
dtype: object

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 15, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants