Concatenate dictionary of objects along axis=1 #15623

er-eis · 2024-04-30T21:59:21Z

Description

Note: This work is heavily based off amanlai's PR raised here, wasn't able to base my branch off amanlai's due to deleted branch.

Closes #15115.
Unlike pandas.concat, cudf.concat doesn't work with a dictionary of objects. The following code raises an error.

d = {
    'first': cudf.DataFrame({'A': [1, 2], 'B': [3, 4]}),
    'second': cudf.DataFrame({'A': [5, 6], 'B': [7, 8]}),
}

cudf.concat(d, axis=1)

This commit resolves this issue.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
^ need me to create an entry in the CHANGELOG.md?

Original change here rapidsai#3188 Why were we casting to "float64" in the old testcase? Maybe related to this comment? rapidsai#3188 (comment)

copy-pr-bot · 2024-04-30T21:59:24Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

er-eis · 2024-04-30T22:01:37Z

to the reviewer(s):

please take a look here 0736e4e#diff-1d997d95893af6a665d5803d179f472c673a7d4e1a0a04a305bd2f1c4a66d957L237

any idea why we were casting to 'float64' previously? tried tracking down why, see my commit message

bdice · 2024-04-30T22:15:29Z

/okay to test

bdice · 2024-04-30T22:21:02Z

any idea why we were casting to 'float64' previously?

My best guess is that this a historical artifact and is no longer necessary. I think your change to remove that cast is the right move -- thanks!

I triggered CI and we can see if it passes tests. If it fails, we can investigate further.

er-eis · 2024-04-30T22:25:30Z

My best guess is that this a historical artifact and is no longer necessary.

it weirds me out that the test has been passing all this time, but i'm a noob around these parts so i trust you!

wence-

Thanks for resurrecting this! This code is really fiddly, so great job unpicking things. I have a few suggestions and questions for cleanup.

python/cudf/cudf/core/reshape.py

wence- · 2024-05-01T09:20:19Z

/ok to test

python/cudf/cudf/core/reshape.py

wence- · 2024-05-03T14:54:05Z

/ok to test

wence- · 2024-05-03T14:54:17Z

Sorry, I made a sequence of mess-ups, but hopefully we've got there.

python/cudf/cudf/core/reshape.py

wence- · 2024-05-03T14:58:12Z

/ok to test

er-eis · 2024-05-03T15:18:05Z

@wence- do we prefer set().union(*(map(type, obj._data.keys()) for obj in objs)) over {type(name) for o in objs for name in o._data.keys()} ?

{type(name) for o in objs for name in o._data.keys()} is:

an order of magnitude faster
imo, more readable/pythonic

you can test the execution time by setting a breakpoint on line 427 and running this:

before = time.perf_counter()
set().union(*(map(type, obj._data.keys()) for obj in objs))
after_changed = time.perf_counter() - before
b = time.perf_counter()
{type(name) for o in objs for name in o._data.keys()}
a_changed = time.perf_counter() - b

wence- · 2024-05-03T15:21:14Z

@wence- do we prefer set().union(*(map(type, obj._data.keys()) for obj in objs)) over {type(name) for o in objs for name in o._data.keys()} ?

{type(name) for o in objs for name in o._data.keys()} is:

an order of magnitude faster

imo, more readable/pythonic

you can test the execution time by setting a breakpoint on line 427 and running this:
before = time.perf_counter()
set().union(*(map(type, obj._data.keys()) for obj in objs))
after_changed = time.perf_counter() - before
b = time.perf_counter()
{type(name) for o in objs for name in o._data.keys()}
a_changed = time.perf_counter() - b

Ah sorry. Too much time writing Haskell. Please go back to your approach.

er-eis · 2024-05-03T15:21:37Z

sure, making a commit now and merging in latest changes

wence- · 2024-05-03T15:26:14Z

/ok to test

bdice · 2024-05-03T15:29:10Z

python/cudf/cudf/tests/test_concat.py

+@pytest.mark.parametrize(
+    "d",
+    [
+        {"first": cudf.DataFrame({"A": [1, 2], "B": [3, 4]})},


It's best if we can avoid creating instances of GPU objects (cudf.DataFrame) in the parametrize arguments. Those are executed at test collection time rather than at test runtime, and make the test suite slow to launch due to a large number of small host-device copies. Let's defer construction using a pattern like this:

@pytest.mark.parametrize( "d", [ {"first": {"A": [1, 2], "B": [3, 4]}}, # ... ], ) def test_concat_dictionary(d, axis): # Convert dict-of-dicts to dict-of-DataFrames to avoid raw GPU objects in the parameters d = {k: cudf.DataFrame(v) for k, v in input.items()} result = cudf.concat(d, axis=axis) expected = cudf.from_pandas( pd.concat({k: df.to_pandas() for k, df in d.items()}, axis=axis) ) assert_eq(expected, result)

will do! let me know if you'd like me to clean up the other tests in this file in the same manner

i passed in the reference to each class for the tests, let me know if this causes weirdness during test collection and i'll make a simple map.

Yes, that’s totally fine. No need to refactor the rest of the file in this PR. A separate PR would be welcome if you’re interested.

python/cudf/cudf/core/reshape.py

bdice · 2024-05-03T17:37:16Z

/ok to test

er-eis · 2024-05-03T18:50:14Z

@bdice @wence- i think we're good to merge?

bdice · 2024-05-03T21:47:51Z

/merge

bdice · 2024-05-03T21:48:31Z

Thanks @er-eis! I think you mentioned you had a follow-up PR planned? Please feel free to open an issue documenting any next steps that are needed, even if you don't have time to contribute a PR.

er-eis · 2024-05-04T00:47:49Z

@bdice on it!

@bdice and @wence- , thanks for the great reviews, this was fun!

er-eis added 4 commits April 30, 2024 17:20

Work from amanlai

7e89e43

Tests

1a767fb

Remove extraneous testcase

9cebaa2

Fix some legacy tests

0736e4e

Original change here rapidsai#3188 Why were we casting to "float64" in the old testcase? Maybe related to this comment? rapidsai#3188 (comment)

github-actions bot added the Python Affects Python cuDF API. label Apr 30, 2024

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict

f4c1e77

er-eis marked this pull request as ready for review April 30, 2024 22:08

er-eis requested a review from a team as a code owner April 30, 2024 22:08

er-eis requested review from vyasr and charlesbluca April 30, 2024 22:08

bdice added feature request New feature or request non-breaking Non-breaking change labels Apr 30, 2024

bdice assigned er-eis Apr 30, 2024

wence- requested changes May 1, 2024

View reviewed changes

Address PR comments, add failing testcase

a55ca67

er-eis force-pushed the er-eis/allow-concat-on-frame-dict branch from 3e8bbc9 to a55ca67 Compare May 1, 2024 13:52

er-eis added 7 commits May 1, 2024 09:52

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict

b08e862

Remove extraneous check

8a3bfb0

Simplify type check, ensure only index concat

de91bd9

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict

29afda0

Simplify type check

8047569

Fix type check

a1e0949

Simplify type check

3136955

er-eis mentioned this pull request May 1, 2024

More explicit index concat er-eis/cudf#2

Closed

3 tasks

wence- reviewed May 3, 2024

View reviewed changes

python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved

OMG

8bd0be7

wence- reviewed May 3, 2024

View reviewed changes

python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved

Edit by fixed point iteration

b5de816

Pythonic object name set uniqueness

2124759

bdice reviewed May 3, 2024

View reviewed changes

Avoid create GPU instances during test collection

b5b9116

er-eis force-pushed the er-eis/allow-concat-on-frame-dict branch from e9c7239 to b5b9116 Compare May 3, 2024 15:54

er-eis added 4 commits May 3, 2024 11:54

Merge branch 'branch-24.06' into er-eis/allow-concat-on-frame-dict

0a5a91b

Dedent conditional

4ba8e3a

Can not -> Cannot

8bf95de

Add 'columns' to tests

516d28b

er-eis requested a review from bdice May 3, 2024 16:01

bdice approved these changes May 3, 2024

View reviewed changes

rapids-bot bot merged commit 2ff60d6 into rapidsai:branch-24.06 May 3, 2024
70 checks passed

This was referenced May 4, 2024

[BUG] Concat Index behavior diverts from pandas #15649

Open

[BUG] test_concat file instantiates GPU objects in the parametrize arguments #15651

Open

er-eis deleted the er-eis/allow-concat-on-frame-dict branch May 4, 2024 05:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenate dictionary of objects along axis=1 #15623

Concatenate dictionary of objects along axis=1 #15623

er-eis commented Apr 30, 2024 •

edited

Loading

copy-pr-bot bot commented Apr 30, 2024

er-eis commented Apr 30, 2024

bdice commented Apr 30, 2024

bdice commented Apr 30, 2024

er-eis commented Apr 30, 2024

wence- left a comment

wence- commented May 1, 2024

wence- commented May 3, 2024

wence- commented May 3, 2024

wence- commented May 3, 2024

er-eis commented May 3, 2024

wence- commented May 3, 2024

er-eis commented May 3, 2024

wence- commented May 3, 2024

bdice May 3, 2024

er-eis May 3, 2024

er-eis May 3, 2024

bdice May 3, 2024

bdice commented May 3, 2024

er-eis commented May 3, 2024

bdice commented May 3, 2024

bdice commented May 3, 2024

er-eis commented May 4, 2024

Concatenate dictionary of objects along axis=1 #15623

Concatenate dictionary of objects along axis=1 #15623

Conversation

er-eis commented Apr 30, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Apr 30, 2024

er-eis commented Apr 30, 2024

bdice commented Apr 30, 2024

bdice commented Apr 30, 2024

er-eis commented Apr 30, 2024

wence- left a comment

Choose a reason for hiding this comment

wence- commented May 1, 2024

wence- commented May 3, 2024

wence- commented May 3, 2024

wence- commented May 3, 2024

er-eis commented May 3, 2024

wence- commented May 3, 2024

er-eis commented May 3, 2024

wence- commented May 3, 2024

bdice May 3, 2024

Choose a reason for hiding this comment

er-eis May 3, 2024

Choose a reason for hiding this comment

er-eis May 3, 2024

Choose a reason for hiding this comment

bdice May 3, 2024

Choose a reason for hiding this comment

bdice commented May 3, 2024

er-eis commented May 3, 2024

bdice commented May 3, 2024

bdice commented May 3, 2024

er-eis commented May 4, 2024

er-eis commented Apr 30, 2024 •

edited

Loading