Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cuDF cannot create list of struct dataframe using dict or from pandas #7561

Closed
devavret opened this issue Mar 10, 2021 · 3 comments · Fixed by #8244
Closed

[BUG] cuDF cannot create list of struct dataframe using dict or from pandas #7561

devavret opened this issue Mar 10, 2021 · 3 comments · Fixed by #8244
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@devavret
Copy link
Contributor

cuDF loses the field names of the struct inside a list when creating from a dict:

In [1]: import cudf

In [2]: import pandas as pd

In [3]: cudf.DataFrame({
   ...:             "family": [
   ...:                 [None, {"human?": True, "deets": {"weight": 2.4, "age": 27}}],
   ...:                 [
   ...:                     {"human?": None, "deets": {"weight": 5.3, "age": 25}},
   ...:                     {"human?": False, "deets": {"weight": 8.0, "age": 31}},
   ...:                     {"human?": False, "deets": None},
   ...:                 ],
   ...:                 [],
   ...:                 [{"human?": None, "deets": {"weight": 6.9, "age": None}}],
   ...:             ]
   ...:         })
Out[3]: 
                                              family
0    [None, {'0': {'0': 27.0, '1': 2.4}, '1': True}]
1  [{'0': {'0': 25.0, '1': 5.3}, '1': None}, {'0'...
2                                                 []
3          [{'0': {'0': None, '1': 6.9}, '1': None}]

or from pandas:

In [4]: pdf = pd.DataFrame({
   ...:             "family": [
   ...:                 [None, {"human?": True, "deets": {"weight": 2.4, "age": 27}}],
   ...:                 [
   ...:                     {"human?": None, "deets": {"weight": 5.3, "age": 25}},
   ...:                     {"human?": False, "deets": {"weight": 8.0, "age": 31}},
   ...:                     {"human?": False, "deets": None},
   ...:                 ],
   ...:                 [],
   ...:                 [{"human?": None, "deets": {"weight": 6.9, "age": None}}],
   ...:             ]
   ...:         })

In [5]: cudf.from_pandas(pdf)
Out[5]: 
                                              family
0    [None, {'0': {'0': 27.0, '1': 2.4}, '1': True}]
1  [{'0': {'0': 25.0, '1': 5.3}, '1': None}, {'0'...
2                                                 []
3          [{'0': {'0': None, '1': 6.9}, '1': None}]

The expected dataframe has field names:

In [6]: pdf
Out[6]: 
                                              family
0  [None, {'human?': True, 'deets': {'weight': 2....
1  [{'human?': None, 'deets': {'weight': 5.3, 'ag...
2                                                 []
3  [{'human?': None, 'deets': {'weight': 6.9, 'ag...
@devavret devavret added bug Something isn't working Needs Triage Need team to review and classify labels Mar 10, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Mar 26, 2021
@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@isVoid
Copy link
Contributor

isVoid commented May 12, 2021

RCA: the issue is that during roundtrip to libcudf, field name information is lost, and column index is used to construct the field name. (Note that non-nested struct column preserves the field name because it goes a different code path)

str(i): dtype_from_column_view(cv.child(i))

To resolve this, cudf need to pass along the field names from pyarrow column to reconstructed cudf column.

@isVoid
Copy link
Contributor

isVoid commented May 13, 2021

RCA Update: synced with @shwina offline, inside cudf type system there is column._copy_type_metadata that handles field name transfer. However, when constructing a cudf column from pyarrow, there isn't a function that handles copying type from pyarrow array. The plan forward is to do type copy upon construction in from_arrow.

rapids-bot bot pushed a commit that referenced this issue May 20, 2021
Closes #7561 

This PR makes sure upon constructing cudf object, nested types from the pyarrow array is copied to cudf object. This should handle arbitrary nesting of `Lists`, `Structs`. For decimal types, precision is copied from the array.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Keith Kraus (https://github.com/kkraus14)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)

URL: #8244
devavret added a commit to devavret/cudf that referenced this issue Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants