Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix consumption of CPU-backed interchange protocol dataframes #11392

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Jul 28, 2022

Description

Closes #11245. This PR fixes a bug in our code that consumes a protocol DataFrame that is backed by CPU memory.

This enables using the from_dataframe() function to construct DataFrames from other libraries:

In [12]: pdf = pd.DataFrame({'x': [1, 2, 3], 'y': [1.0, 2.0, np.nan], 'z': ['a', 'b', 'c']})

In [13]: vdf = vaex.from_dict({'x': [1, 2, 3], 'y': [1.0, 2.0, np.nan]})

In [14]: cudf.from_dataframe(pdf, allow_copy=True)
Out[14]: 
   x    y  z
0  1  1.0  a
1  2  2.0  b
2  3  NaN  c

In [15]: cudf.from_dataframe(vdf, allow_copy=True)
Out[15]: 
   x    y
0  1  1.0
1  2  2.0
2  3  NaN

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina changed the base branch from branch-22.08 to branch-22.10 July 28, 2022 16:24
@github-actions github-actions bot added CMake CMake build issue conda Java Affects Java cuDF API. Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jul 28, 2022
@shwina shwina changed the title Fix interchange protocol error cross device Fix consumption of CPU-backed interchange protocol dataframes Jul 28, 2022
@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@shwina shwina changed the base branch from branch-22.12 to branch-23.02 December 2, 2022 21:28
@ttnghia ttnghia removed the request for review from a team December 2, 2022 22:24
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shwina This popped back up on my radar after you bumped the branch. If this is still active, I think it only needs a bit more work to be complete?

python/cudf/cudf/tests/test_df_protocol.py Show resolved Hide resolved
python/cudf/cudf/core/df_protocol.py Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nitpicks, but otherwise looks good.

Comment on lines 51 to 56
class _MaskKind(enum.IntEnum):
NON_NULLABLE = (0,)
NAN = (1,)
SENTINEL = (2,)
BITMASK = (3,)
BYTEMASK = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it deliberate that BYTEMASK is an int, but the other mask kinds are tuples? (They both turn into ints on construction, but the tuple instantiation looks kind of weird.

Comment on lines 741 to 752
if not allow_copy:
raise TypeError(
"This operation must copy data from CPU to GPU. "
"Set `allow_copy=True` to allow it."
)
else:
dbuf = rmm.DeviceBuffer(ptr=buf.ptr, size=buf.bufsize)
return _CuDFBuffer(
as_buffer(dbuf, exposed=True),
protocol_dtype_to_cupy_dtype(data_type),
allow_copy,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style, I (at least) find double-negated conditions hard to parse. Perhaps have the truthy case in the if, and the exceptional case in the else branch:

if allow_copy:
    dbuf = ...
    return ...
else:
    raise TypeError(...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ajschmidt8 ajschmidt8 removed the request for review from a team December 14, 2022 01:11
@ajschmidt8
Copy link
Member

Removing ops-codeowners from the required reviews since it doesn't seem there are any file changes that we're responsible for. Feel free to add us back if necessary.

@shwina shwina changed the base branch from branch-23.02 to branch-23.04 January 26, 2023 16:47
@vyasr
Copy link
Contributor

vyasr commented Apr 5, 2023

@shwina what's the status of this PR?

@github-actions github-actions bot added the ci label Apr 5, 2023
@shwina shwina changed the base branch from branch-23.04 to branch-23.06 April 5, 2023 18:28
@shwina
Copy link
Contributor Author

shwina commented Apr 5, 2023

Thanks @vyasr and apologies all for letting this PR go stale - I've updated the PR to target 23.06, and I believe addressed all the review comments.

@bdice, @wence-, would you be willing to take a final look? Thank you for the suggested improvements!

@shwina shwina requested review from bdice and wence- April 5, 2023 18:56
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good, a minor request to add a little more testing in the pandas<->cudf interop tests

python/cudf/cudf/core/df_protocol.py Show resolved Hide resolved
)
def test_from_cpu_df(pandas_df):
df = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
cudf.from_dataframe(df, allow_copy=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cudf.from_dataframe(df, allow_copy=True)
cdf = cudf.from_dataframe(df, allow_copy=True)
assert_eq(df, cdf)

Add a case where there are nulls in the pandas frame?

Also can we check the round-trip in the opposite direction? (Assuming that is supported too).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added some null tests.

Also can we check the round-trip in the opposite direction? (Assuming that is supported too).

Today, that segfaults because Pandas doesn't check for the device type of the incoming buffers and tries to interpret the buffer pointers as host pointers :(

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor change for docstring syntax, otherwise LGTM.

python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
@GregoryKimball GregoryKimball removed the request for review from elstehle April 17, 2023 16:50
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, apologies for the delay

@shwina
Copy link
Contributor Author

shwina commented Apr 28, 2023

/merge

@rapids-bot rapids-bot bot merged commit ca2ca37 into rapidsai:branch-23.06 Apr 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake CMake build issue Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Raise informative error when interchanging non-cudf dataframes
6 participants