Non-unique integers coerced to float during UInt64Index creation with explicit #29526

oguzhanogreden · 2019-11-10T09:14:13Z

Noticed this thanks to @jschendel 's comment over here. I'll use this to learn more about data types and try to suggest a reasonable solution.

Somewhat relevant: #15832, #18400

Code Sample, a copy-pastable example if possible

index1 = 7606741985629028552
index2 = 17876870360202815256

UInt64Index([index1, index2])[0]
# Returns: 7606741985629028352

# These will return the input value:
UInt64Index([index1])[0]  
UInt64Index([index2])[0]
UInt64Index([index1, index1])[0]

Problem description

The numpy array creation here coerces to float, while it's possible to specify dtype and prevent this behavior.

Expected Output

UInt64Index contains precisely the input values.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 0.25.3
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None

Replicates on current master as well.

The text was updated successfully, but these errors were encountered:

jschendel · 2019-11-11T05:51:14Z

Yeah, the root cause of this looks to be numpy's inference rules.

From what I can tell numpy looks to be inferring float64 instead of uint64 when both of the following conditions are met:

input data contains a number supported by both int64 and uint64 (0 to 2**63 - 1)
input data contains a number supported by only uint64 (2**63 and above)

In [1]: import numpy as np; np.__version__
Out[1]: '1.17.3'

In [2]: np.array([1, 2**63])
Out[2]: array([1.00000000e+00, 9.22337204e+18])

This works fine if uint64 is explicitly specified:

In [3]: np.array([1, 2**63], dtype="uint64")
Out[3]: array([                  1, 9223372036854775808], dtype=uint64)

If both values are in the int64 range the dtype is correctly inferred as int64:

In [4]: np.array([1, 2**63 - 1])
Out[4]: array([                  1, 9223372036854775807])

In [5]: _.dtype
Out[5]: dtype('int64')

If both values are in the uint64-only range the dtype correctly inferred as uint64:

In [6]: np.array([2**63, 2**63 + 1])
Out[6]: array([9223372036854775808, 9223372036854775809], dtype=uint64)

The intermediate conversion to float can cause precision loss starting at 2**53 + 1, which is the first integer that can't be represented exactly by float64:

In [7]: np.float64(2**53), np.float64(2**53 + 1)
Out[7]: (9007199254740992.0, 9007199254740992.0)

I'll look into this on the numpy side to see if this is the expected inference behavior or a bug.

This was referenced Nov 10, 2019

Makes NumericIndex constructor dtype aware #29529

Merged

Correct type inference for UInt64Index during access #29420

Merged

jschendel added Bug Dtype Conversions Unexpected or buggy dtype conversions Index Related to the Index class or subclasses labels Nov 11, 2019

jschendel added this to the 1.0 milestone Nov 11, 2019

jreback closed this as completed in #29529 Nov 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-unique integers coerced to float during UInt64Index creation with explicit #29526

Non-unique integers coerced to float during UInt64Index creation with explicit #29526

oguzhanogreden commented Nov 10, 2019

INSTALLED VERSIONS

jschendel commented Nov 11, 2019

Non-unique integers coerced to float during UInt64Index creation with explicit #29526

Non-unique integers coerced to float during UInt64Index creation with explicit #29526

Comments

oguzhanogreden commented Nov 10, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jschendel commented Nov 11, 2019

Output of `pd.show_versions()`