Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] For an arrow table that contains string columns and is converted from pandas, the from_arrow fails after slice because the length does not match. #12463

Closed
infzo opened this issue Jan 3, 2023 · 3 comments · Fixed by #12665
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@infzo
Copy link

infzo commented Jan 3, 2023

Describe the bug
For an arrow table that contains string columns and is converted from pandas, the from_arrow fails after slice because the length does not match.

Steps/Code to reproduce bug

import cudf
import pyarrow
import pandas as pd

cdf = pd.DataFrame.from_dict({'a': ['aa', 'bb', 'cc'], 'b': [1, 2, 3]})
print(cdf)

tbl = pyarrow.Table.from_pandas(cdf)
print(tbl)

tbl_slice = tbl.slice(0, 2)
print(tbl_slice)

gdf = cudf.DataFrame.from_arrow(tbl_slice)
>>> import cudf
>>> import pyarrow
>>> import pandas as pd
>>>
>>>
>>> cdf = pd.DataFrame.from_dict({'a': ['aa', 'bb', 'cc'], 'b': [1, 2, 3]})
>>> print(cdf)
    a  b
0  aa  1
1  bb  2
2  cc  3
>>>
>>>
>>> tbl = pyarrow.Table.from_pandas(cdf)
>>> print(tbl)
pyarrow.Table
a: string
b: int64
----
a: [["aa","bb","cc"]]
b: [[1,2,3]]
>>>
>>>
>>> tbl_slice = tbl.slice(0, 2)
>>> print(tbl_slice)
pyarrow.Table
a: string
b: int64
----
a: [["aa","bb"]]
b: [[1,2]]
>>>
>>>
>>> gdf = cudf.DataFrame.from_arrow(tbl_slice)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/dataframe.py", line 4458, in from_arrow
    out = out.set_index(
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/dataframe.py", line 2453, in set_index
    df.index = idx
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/dataframe.py", line 1027, in __setattr__
    super().__setattr__(key, col)
  File "/opt/huawei/release/bi/uxdf/envPkg/Miniconda3/envs/uxdf_server/lib/python3.9/site-packages/cudf/core/indexed_frame.py", line 341, in index
    raise ValueError(
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
>>>

Expected behavior
Translate correctly.

Environment overview (please complete the following information)

  • Environment location: Cloud(HuaweiCloud)
  • Method of cuDF install: conda

Environment details
Not found.

@infzo infzo added Needs Triage Need team to review and classify bug Something isn't working labels Jan 3, 2023
@galipremsagar galipremsagar added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 3, 2023
@galipremsagar galipremsagar self-assigned this Jan 3, 2023
@galipremsagar
Copy link
Contributor

galipremsagar commented Jan 3, 2023

@infzo Thanks for reporting this issue. This is an index metadata-related issue. Upon investigation, we found out that it is an issue with pyarrow and raised it here: apache/arrow#15178

Until that is resolved, we recommend using the following workaround for a sliced table:

In [17]: cudf.DataFrame.from_pandas(tbl_slice.to_pandas())
Out[17]: 
    a  b
0  aa  1
1  bb  2

@galipremsagar galipremsagar added not a bug bug Something isn't working wontfix This will not be worked on and removed bug Something isn't working Python Affects Python cuDF API. not a bug labels Jan 3, 2023
@galipremsagar
Copy link
Contributor

Closing this issue, as an upstream issue has been raised: apache/arrow#15178

@galipremsagar galipremsagar added Python Affects Python cuDF API. and removed wontfix This will not be worked on labels Jan 31, 2023
@galipremsagar
Copy link
Contributor

Reopening since arrow might not be fixing this issue on their end: #12665

@galipremsagar galipremsagar reopened this Jan 31, 2023
rapids-bot bot pushed a commit that referenced this issue Feb 2, 2023
Fixes: #12463 

This PR handles any kind of outdated pandas index metadata in `from_arrow` by ignoring it.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #12665
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants