BUG: Unable to create a MultiIndex with `nan` values in nullable `Float` dtypes #39984

galipremsagar · 2021-02-23T02:44:21Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

>>> df = pd.DataFrame({'a':pd.Series([1, 2, None], dtype='Int64'), 'b':pd.Float64Dtype().__from_arrow__(pa.array([0.2, np.nan, None]))})
>>> df
      a     b
0     1   0.2
1     2   NaN
2  <NA>  <NA>
>>> pd.MultiIndex.from_frame(df)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 662, in from_frame
    return cls.from_arrays(columns, sortorder=sortorder, names=names)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 462, in from_arrays
    codes, levels = factorize_from_iterables(arrays)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2632, in factorize_from_iterables
    return map(list, zip(*(factorize_from_iterable(it) for it in iterables)))
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2632, in <genexpr>
    return map(list, zip(*(factorize_from_iterable(it) for it in iterables)))
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2604, in factorize_from_iterable
    cat = Categorical(values, ordered=False)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 359, in __init__
    dtype = CategoricalDtype(categories, dtype.ordered)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 160, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 314, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 508, in validate_categories
    raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null

Problem description

Should nan be allowed in the first place while creating a nullable floating array? If yes, then we have an issue while creating a mulit-index from the dataframe.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 7d32926
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 20.3.0
Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.2.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : 5.29.0
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

@galipremsagar

Fixes: #7367, #7446 This PR upgrades pandas to `1.2.2` in `cudf`. Changes include: - [x] Bumping up `pandas` version. - [x] Fixing `isin` behavior which now takes in types into accout: pandas-dev/pandas#38781 - [x] `CategoricalColumn.__setitem__` will now not allow setting of values that are not in existing categories. - [x] Introduced `cudf.core._compat.PANDAS_GE_120` variable to create back-ward compatibility. - [x] Updated usages of `pd.core.tools.datetimes._guess_datetime_format` to `pd.core.tools.datetimes.guess_datetime_format` - [x] Introduced `std` & `median` in `DateTimeColumn`. - [x] Fixed incorrect handling of passing `StringMethods` as an input to methods in string APIs. - [x] Fixed a typo in calling `is_valid` of `Scalar`. - [x] Removed unnecessary special handling in `TimeDeltaColumn.sum` logic for empty inputs. - [x] Introduced passing `dtype='float64'` wherever there is an empty series being created since pandas will soon be defaulting to `object` dtype if no type is passed and we don't have a perfectly resembling `object` dtype as that of pandas. - [x] Fixed deprecation warnings of `Index.__or__` and `Index.__xor__` by replacing with `union` & `symmetric_difference` APIs. - [x] Introduced mapping of our `float32` & `float64` dtypes to pandas Nullable dtypes `FLoat32Dtype` & `Float64Dtype` when `nullable=True` in `to_pandas`. - [x] With introduction of nullable float dtypes, there is an issue in creating `MultiIndex` from dataframe: pandas-dev/pandas#39984, so introduced a workaround in our `MultiIndex.__repr__` code. - [x] Removed usages of `check_less_precise` in our code-base as this is deprecated and is replaced with `rtol` & `atol`. Retained its usages in our testing APIs for back-ward compatibility. - [x] Removed good number `xfail` cases which are actually passing right now because of resolved issues in both `pandas` & `cudf`. - [x] Did some miscellaneous code-cleanup in pytests. - [x] Fixed pytests that will fail when run in parallel due to access to shared pytest params being manipulated inplace. - [x] Follow a standard import pattern across pytest files, some files do `from pandas import Series` and some do `from cudf.core import Series`. So removed both patterns and doing only simple `import cudf` & `import pandas as pd` to avoid confusion while debugging test failures across multiple files. (Made this change in all pytest files which I had to touch as part of pandas upgrade, we can make similar changes in future for the files which we touch). - [x] Fix issue with assigning `np.nan` values to a `CategoricalColumn` and fix related `__repr__` code: #7446 Authors: - GALI PREM SAGAR (@galipremsagar) Approvers: - Keith Kraus (@kkraus14) - AJ Schmidt (@ajschmidt8) URL: #7375

jbrockmendel · 2022-01-09T00:48:59Z

works on master, needs test

BarkotBeyene · 2022-07-29T18:55:50Z

I'll be working on the test for this issue.

galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021

galipremsagar mentioned this issue Feb 23, 2021

[REVIEW] Upgrade pandas to 1.2 rapidsai/cudf#7375

Merged

19 tasks

mzeitlin11 added MultiIndex NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2021

isVoid mentioned this issue Dec 6, 2021

Move drop_duplicates, drop_na, _gather, take to IndexFrame and create their _base_index counterparts rapidsai/cudf#9807

Merged

jbrockmendel added Needs Tests Unit test(s) needed to prevent regressions and removed Needs Discussion Requires discussion from core team before further action labels Jan 9, 2022

mroeschke added good first issue and removed Bug labels Jul 6, 2022

mroeschke mentioned this issue Aug 8, 2022

TST: GH39984 Addition to tests #47981

Closed

5 tasks

BarkotBeyene mentioned this issue Aug 11, 2022

TST: GH39984 Addition to tests #48042

Merged

5 tasks

mroeschke closed this as completed in #48042 Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Unable to create a MultiIndex with `nan` values in nullable `Float` dtypes #39984

BUG: Unable to create a MultiIndex with `nan` values in nullable `Float` dtypes #39984

galipremsagar commented Feb 23, 2021

INSTALLED VERSIONS

jbrockmendel commented Jan 9, 2022

BarkotBeyene commented Jul 29, 2022

BUG: Unable to create a MultiIndex with nan values in nullable Float dtypes #39984

BUG: Unable to create a MultiIndex with nan values in nullable Float dtypes #39984

Comments

galipremsagar commented Feb 23, 2021

Code Sample, a copy-pastable example

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Jan 9, 2022

BarkotBeyene commented Jul 29, 2022

BUG: Unable to create a MultiIndex with `nan` values in nullable `Float` dtypes #39984

BUG: Unable to create a MultiIndex with `nan` values in nullable `Float` dtypes #39984

Output of `pd.show_versions()`