Concatenating rows with Int64 datatype coerces to object #24768

janvanrijn · 2019-01-14T17:22:02Z

Code Sample, a copy-pastable example if possible

import pandas as pd

df_a = pd.DataFrame({'a': [-1, -1, -1]})
df_b = pd.DataFrame({'b': [1, 1, 1]})
df_a['a'] = df_a['a'].astype('Int64')
df_b['b'] = df_b['b'].astype('Int64')

total = pd.concat([df_a, df_b], ignore_index=True)

print(total.dtypes)

output:

a    object
b    object
dtype: object

Problem description

When running the exact same code with floats, the output is similar to the expected output. It also happens with the append function

import pandas as pd

df_a = pd.DataFrame({'a': [-1, -1, -1]})
df_b = pd.DataFrame({'b': [1, 1, 1]})
df_a['a'] = df_a['a'].astype('Int64')
df_b['b'] = df_b['b'].astype('Int64')

total = df_a.append([df_b], ignore_index=True)

print(total.dtypes)

Expected Output

a    Int64
b    Int64
dtype: object

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+997.ga197837
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 2.2.2
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-01-14T17:23:23Z

Thanks for the report and makes sense. Investigation and PRs are always welcome!

janvanrijn · 2019-01-14T17:28:48Z

I like the title change that you proposed, as this is actually another issue that I wanted to open. The problem that I tried to point out in this issue however is the fact that there are now 6 rows with a 1 in it (column B), whereas we would only expect 3 rows with a 1 in it.

janvanrijn · 2019-01-14T17:31:59Z

In short, I would like to propose to keep the title of this issue to the original title, as all the code and expected output points out the 'unexpected behavior when concatenating rows with int datatype'. I will open a second issue with proper expected output on 'Concatenating rows with Int64 datatype coerces to object' (there are some more things that go wrong there, that are not outlined in this issue )

TomAugspurger · 2019-01-14T17:56:48Z

@janvanrijn can you check the example your original post? When you assign df_a['b'] = ..., it no longer matches your "working" example

import pandas as pd

df_a = pd.DataFrame({'a': [-1, -1, -1]})
df_b = pd.DataFrame({'b': [1, 1, 1]})

total = pd.concat([df_a, df_b], ignore_index=True)

as the columns are different. Perhaps you mean df_b['b'] = ... instead.

janvanrijn · 2019-01-14T19:16:23Z

Ouch, sloppy. My sincere apologies.

Then only the 'Concatenating rows with Int64 datatype coerces to object' remains.

TomAugspurger · 2019-01-14T19:21:18Z

Can you ensure that the original post has correct values for the example and expected output?

…

On Mon, Jan 14, 2019 at 1:16 PM janvanrijn ***@***.***> wrote: Ouch, sloppy. My sincere apologies. Then only the 'Concatenating rows with Int64 datatype coerces to object' remains. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24768 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIruIrDQY_97VxLlRh4l0whl1ngVZks5vDNeNgaJpZM4Z_EIi> .

janvanrijn · 2019-01-14T19:44:10Z

Updated.

TomAugspurger · 2019-01-14T20:13:41Z

Thanks. So the issue is that concating integer and empty coerces to object

In [33]: pd.concat([df_a, pd.DataFrame(index=[0, 1])], ignore_index=True, sort=True).dtypes
Out[33]:
a    object
dtype: object

The logic in get_empty_dtype_and_na in internals/concat.py needs to be updated. I suspect that things can be simplified somewhat now that we have EAs.

jreback · 2019-01-14T20:19:26Z

this is correct
we changed this

jreback · 2019-01-14T20:20:25Z

sorry that was for int64

concat with empty is not tested at all for EA types

TomAugspurger · 2019-01-14T20:24:04Z

Yeah, just to make this clear, the following are two different cases ```python In [10]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame()]).dtypes Out[10]: A int64 dtype: object In [11]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame(columns=['A'])]).dtypes Out[11]: A object dtype: object ``` Concatenating with a newly-created column should preserve the dtype (if possible).

…

On Mon, Jan 14, 2019 at 2:20 PM Jeff Reback ***@***.***> wrote: sorry that was for int64 concat with empty is not tested at all for EA types — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24768 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHImTiGV39Wu5S3iiqhL6WMJ25HtOeks5vDOaPgaJpZM4Z_EIi> .

janvanrijn · 2019-01-28T21:29:26Z

Also, when making an Int64 column the index, it is coerced to a float type.

TomAugspurger · 2019-01-28T21:38:31Z

@janvanrijn that's a separate issue. #22861 / #23223

jreback · 2019-01-28T22:06:20Z

Also, when making an Int64 column the index, it is coerced to a float type.

is not implemented

wuhaochen · 2019-02-15T16:58:07Z

Yeah, just to make this clear, the following are two different cases python In [10]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame()]).dtypes Out[10]: A int64 dtype: object In [11]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame(columns=['A'])]).dtypes Out[11]: A object dtype: object Concatenating with a newly-created column should preserve the dtype (if possible).
…
On Mon, Jan 14, 2019 at 2:20 PM Jeff Reback @.***> wrote: sorry that was for int64 concat with empty is not tested at all for EA types — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24768 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHImTiGV39Wu5S3iiqhL6WMJ25HtOeks5vDOaPgaJpZM4Z_EIi .

In such case concatenating float64 data doesn't preserve the dtype though. Is it considered a bug?
It is a bit weird if the behaviors for int and float are not the same.

pd.concat([pd.DataFrame({"A": [1., 2.]}), pd.DataFrame(columns=['A'])]).dtypes

A float64
dtype: object

TomAugspurger · 2019-02-15T17:08:56Z

Not sure about bug, or just not implemented fully yet. The general rule for concatenating a mixture of EAs and non-EAs is to coerce everything to object. We have an open issue about a type-based multiple dispatch for concat. On Fri, Feb 15, 2019 at 10:58 AM Haochen Wu <[email protected]> wrote:

…

Yeah, just to make this clear, the following are two different cases python In [10]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame()]).dtypes Out[10]: A int64 dtype: object In [11]: pd.concat([pd.DataFrame({"A": [1, 2]}), pd.DataFrame(columns=['A'])]).dtypes Out[11]: A object dtype: object Concatenating with a newly-created column should preserve the dtype (if possible). … <#m_-1856625048003412203_> On Mon, Jan 14, 2019 at 2:20 PM Jeff Reback ***@***.***> wrote: sorry that was for int64 concat with empty is not tested at all for EA types — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24768 (comment) <#24768 (comment)>>, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHImTiGV39Wu5S3iiqhL6WMJ25HtOeks5vDOaPgaJpZM4Z_EIi . In such case concatenating float64 data doesn't preserve the dtype though. Is it considered a bug? pd.concat([pd.DataFrame({"A": [1., 2.]}), pd.DataFrame(columns=['A'])]).dtypes A float64 dtype: object — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24768 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIqNP3kIeB4Ecky9zpD1chb-jx3UMks5vNucmgaJpZM4Z_EIi> .

WillAyd added Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. labels Jan 14, 2019

WillAyd changed the title ~~concatenating rows with int datatype~~ Concatenating rows with Int64 datatype coerces to object Jan 14, 2019

TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Jan 14, 2019

jbrockmendel removed Effort Medium labels Oct 21, 2019

phofl mentioned this issue Sep 10, 2020

Concatenating rows with Int64 datatype coerces to object #36278

Merged

4 tasks

jreback added this to the 1.2 milestone Sep 13, 2020

jreback closed this as completed in #36278 Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenating rows with Int64 datatype coerces to object #24768

Concatenating rows with Int64 datatype coerces to object #24768

janvanrijn commented Jan 14, 2019 •

edited

Loading

WillAyd commented Jan 14, 2019

janvanrijn commented Jan 14, 2019

janvanrijn commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019

janvanrijn commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019 via email

janvanrijn commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019

jreback commented Jan 14, 2019

jreback commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019 via email

janvanrijn commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

jreback commented Jan 28, 2019

wuhaochen commented Feb 15, 2019 •

edited

Loading

TomAugspurger commented Feb 15, 2019 via email

Concatenating rows with Int64 datatype coerces to object #24768

Concatenating rows with Int64 datatype coerces to object #24768

Comments

janvanrijn commented Jan 14, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

WillAyd commented Jan 14, 2019

janvanrijn commented Jan 14, 2019

janvanrijn commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019

janvanrijn commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019 via email

janvanrijn commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019

jreback commented Jan 14, 2019

jreback commented Jan 14, 2019

TomAugspurger commented Jan 14, 2019 via email

janvanrijn commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

jreback commented Jan 28, 2019

wuhaochen commented Feb 15, 2019 • edited Loading

TomAugspurger commented Feb 15, 2019 via email

janvanrijn commented Jan 14, 2019 •

edited

Loading

Output of `pd.show_versions()`

wuhaochen commented Feb 15, 2019 •

edited

Loading