Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unable to create a MultiIndex with nan values in nullable Float dtypes #39984

Closed
2 of 3 tasks
galipremsagar opened this issue Feb 23, 2021 · 2 comments · Fixed by #48042
Closed
2 of 3 tasks

BUG: Unable to create a MultiIndex with nan values in nullable Float dtypes #39984

galipremsagar opened this issue Feb 23, 2021 · 2 comments · Fixed by #48042
Labels
good first issue MultiIndex NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions

Comments

@galipremsagar
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

>>> df = pd.DataFrame({'a':pd.Series([1, 2, None], dtype='Int64'), 'b':pd.Float64Dtype().__from_arrow__(pa.array([0.2, np.nan, None]))})
>>> df
      a     b
0     1   0.2
1     2   NaN
2  <NA>  <NA>
>>> pd.MultiIndex.from_frame(df)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 662, in from_frame
    return cls.from_arrays(columns, sortorder=sortorder, names=names)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/indexes/multi.py", line 462, in from_arrays
    codes, levels = factorize_from_iterables(arrays)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2632, in factorize_from_iterables
    return map(list, zip(*(factorize_from_iterable(it) for it in iterables)))
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2632, in <genexpr>
    return map(list, zip(*(factorize_from_iterable(it) for it in iterables)))
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 2604, in factorize_from_iterable
    cat = Categorical(values, ordered=False)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/arrays/categorical.py", line 359, in __init__
    dtype = CategoricalDtype(categories, dtype.ordered)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 160, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 314, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/Users/pgali/PycharmProjects/del/venv1/lib/python3.7/site-packages/pandas/core/dtypes/dtypes.py", line 508, in validate_categories
    raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null

Problem description

Should nan be allowed in the first place while creating a nullable floating array? If yes, then we have an issue while creating a mulit-index from the dataframe.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 7d32926
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 20.3.0
Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8
pandas : 1.2.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : 5.29.0
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@galipremsagar galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 23, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Feb 26, 2021
Fixes: #7367, #7446

This PR upgrades pandas to `1.2.2` in `cudf`. Changes include:

- [x] Bumping up `pandas` version.
- [x] Fixing `isin` behavior which now takes in types into accout: pandas-dev/pandas#38781
- [x] `CategoricalColumn.__setitem__` will now not allow setting of values that are not in existing categories.
- [x] Introduced `cudf.core._compat.PANDAS_GE_120` variable to create back-ward compatibility.
- [x] Updated usages of `pd.core.tools.datetimes._guess_datetime_format` to `pd.core.tools.datetimes.guess_datetime_format`
- [x] Introduced `std` & `median` in `DateTimeColumn`.
- [x] Fixed incorrect handling of passing `StringMethods` as an input to methods in string APIs.
- [x] Fixed a typo in calling `is_valid` of `Scalar`.
- [x] Removed unnecessary special handling in `TimeDeltaColumn.sum` logic for empty inputs.
- [x] Introduced passing `dtype='float64'` wherever there is an empty series being created since pandas will soon be defaulting to `object` dtype if no type is passed and we don't have a perfectly resembling `object` dtype as that of pandas.
- [x] Fixed deprecation warnings of `Index.__or__` and `Index.__xor__` by replacing with `union` & `symmetric_difference` APIs.
- [x] Introduced mapping of our `float32` & `float64` dtypes to pandas Nullable dtypes `FLoat32Dtype` & `Float64Dtype` when `nullable=True` in `to_pandas`.
- [x] With introduction of nullable float dtypes, there is an issue in creating `MultiIndex` from dataframe: pandas-dev/pandas#39984, so introduced a workaround in our `MultiIndex.__repr__` code.
- [x] Removed usages of `check_less_precise` in our code-base as this is deprecated and is replaced with `rtol` & `atol`. Retained its usages in our testing APIs for back-ward compatibility.
- [x] Removed good number `xfail` cases which are actually passing right now because of resolved issues in both `pandas` & `cudf`.
- [x] Did some miscellaneous code-cleanup in pytests.
- [x] Fixed pytests that will fail when run in parallel due to access to shared pytest params being manipulated inplace.
- [x] Follow a standard import pattern across pytest files, some files do `from pandas import Series` and some do `from cudf.core import Series`. So removed both patterns and doing only simple `import cudf` & `import pandas as pd` to avoid confusion while debugging test failures across multiple files. (Made this change in all pytest files which I had to touch as part of pandas upgrade, we can make similar changes in future for the files which we touch).
- [x] Fix issue with assigning `np.nan` values to a `CategoricalColumn` and fix related `__repr__` code: #7446

Authors:
  - GALI PREM SAGAR (@galipremsagar)

Approvers:
  - Keith Kraus (@kkraus14)
  - AJ Schmidt (@ajschmidt8)

URL: #7375
@mzeitlin11 mzeitlin11 added MultiIndex NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 1, 2021
@jbrockmendel
Copy link
Member

works on master, needs test

@jbrockmendel jbrockmendel added Needs Tests Unit test(s) needed to prevent regressions and removed Needs Discussion Requires discussion from core team before further action labels Jan 9, 2022
@BarkotBeyene
Copy link
Contributor

I'll be working on the test for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue MultiIndex NA - MaskedArrays Related to pd.NA and nullable extension arrays Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
5 participants