Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series iteration and to_dict methods *sometimes* return underlying storage type vs. Python object #25969

Open
boydgreenfield opened this issue Apr 3, 2019 · 5 comments
Labels
API - Consistency Internal Consistency of API/Behavior Bug Dtype Conversions Unexpected or buggy dtype conversions

Comments

@boydgreenfield
Copy link

boydgreenfield commented Apr 3, 2019

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

s1 = pd.Series({"a": np.int64(64), "b": 10})
for v in s1.to_dict().values():
    print(type(v))  # prints <class 'int'> 2x

s2 = pd.Series({"a": np.int64(64), "b": 10, "c": "ABC"})
for v in s2.to_dict().values():
    print(type(v))  # prints <class 'numpy.int64'> for first variable "a"

for k, v in s1.items():
    print(k, type(v))  # prints <class 'int'> 2x

for k, v in s2.items():
    print(k, type(v))  # prints <class 'numpy.int64'> again for the first variable "a"

Problem description

pd.Series.to_dict can return different types for objects depending on the composition of the series. This also affects iteration, e.g., for k, v in series: .... This is inconsistent and, critically, leads to really weird and hard to debug issues downstream with types, especially around JSON conversion (the built-in json module and many others will blow up when it encounters numpy dtypes).

I cannot find this exact issue open in the issue tracker, though there are a number of related issues including:

Expected Output

Expected output is for type coercion to Python ints to occur regardless of the exact column composition in the Series. #24908 is a related issue for DataFrame coercions with irregular behavior happening as a result.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 10.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
@mroeschke
Copy link
Member

This probably occurs because s2 is object dtype and it's trying to preserve the dtype of each input argument while the arguments in s1 can both be coerced to int64.

Investigation and PR's welcome~

@mroeschke mroeschke added the Bug label Apr 4, 2019
@mroeschke mroeschke added this to the Contributions Welcome milestone Apr 4, 2019
@drew-heenan
Copy link
Contributor

drew-heenan commented Apr 7, 2019

I'm having a go at this issue - quick note @boydgreenfield, it looks like iterating over a Series object as in the last two loops in your example results in an iteration only over the int values in the Series. Did you mean to iterate over s1.items() or similar?

@boydgreenfield
Copy link
Author

@drew-heenan Yes you're right I meant .items(). Have updated the above code snippet. Thanks for taking a look at the issue!

boydgreenfield added a commit to onecodex/onecodex that referenced this issue Apr 14, 2019
boydgreenfield added a commit to onecodex/onecodex that referenced this issue Apr 14, 2019
@jbrockmendel jbrockmendel added the API - Consistency Internal Consistency of API/Behavior label Sep 22, 2020
@jreback jreback modified the milestones: Contributions Welcome, 1.3 Feb 12, 2021
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Feb 12, 2021
@simonjayhawkins
Copy link
Member

from #37648 (comment)

This resolves the issue of return types from to_dict. #25969 also discusses return types from .items(), which relates to an outstanding NumPy issue numpy/numpy#14139, and I don't address that part here atm

@simonjayhawkins simonjayhawkins modified the milestones: 1.3, Contributions Welcome Jun 11, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@ghost
Copy link

ghost commented Dec 8, 2022

Some findings about the root cause of this casting issue on Series.items(): #50125 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants