Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenate dictionary of objects along axis=1 #15623

Merged

Conversation

er-eis
Copy link
Contributor

@er-eis er-eis commented Apr 30, 2024

Description

Note: This work is heavily based off amanlai's PR raised here, wasn't able to base my branch off amanlai's due to deleted branch.

Closes #15115.
Unlike pandas.concat, cudf.concat doesn't work with a dictionary of objects. The following code raises an error.

d = {
    'first': cudf.DataFrame({'A': [1, 2], 'B': [3, 4]}),
    'second': cudf.DataFrame({'A': [5, 6], 'B': [7, 8]}),
}

cudf.concat(d, axis=1)

This commit resolves this issue.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
    ^ need me to create an entry in the CHANGELOG.md?

er-eis added 4 commits April 30, 2024 17:20
Original change here rapidsai#3188
Why were we casting to "float64" in the old testcase?
Maybe related to this comment? rapidsai#3188 (comment)
Copy link

copy-pr-bot bot commented Apr 30, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Apr 30, 2024
@er-eis
Copy link
Contributor Author

er-eis commented Apr 30, 2024

to the reviewer(s):

please take a look here 0736e4e#diff-1d997d95893af6a665d5803d179f472c673a7d4e1a0a04a305bd2f1c4a66d957L237

any idea why we were casting to 'float64' previously? tried tracking down why, see my commit message

@er-eis er-eis marked this pull request as ready for review April 30, 2024 22:08
@er-eis er-eis requested a review from a team as a code owner April 30, 2024 22:08
@er-eis er-eis requested review from vyasr and charlesbluca April 30, 2024 22:08
@bdice
Copy link
Contributor

bdice commented Apr 30, 2024

/okay to test

@bdice bdice added feature request New feature or request non-breaking Non-breaking change labels Apr 30, 2024
@bdice
Copy link
Contributor

bdice commented Apr 30, 2024

any idea why we were casting to 'float64' previously?

My best guess is that this a historical artifact and is no longer necessary. I think your change to remove that cast is the right move -- thanks!

I triggered CI and we can see if it passes tests. If it fails, we can investigate further.

@er-eis
Copy link
Contributor Author

er-eis commented Apr 30, 2024

My best guess is that this a historical artifact and is no longer necessary.

it weirds me out that the test has been passing all this time, but i'm a noob around these parts so i trust you!

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for resurrecting this! This code is really fiddly, so great job unpicking things. I have a few suggestions and questions for cleanup.

python/cudf/cudf/core/reshape.py Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
@wence-
Copy link
Contributor

wence- commented May 1, 2024

/ok to test

@er-eis er-eis force-pushed the er-eis/allow-concat-on-frame-dict branch from 3e8bbc9 to a55ca67 Compare May 1, 2024 13:52
@er-eis er-eis mentioned this pull request May 1, 2024
3 tasks
@wence-
Copy link
Contributor

wence- commented May 3, 2024

/ok to test

@wence-
Copy link
Contributor

wence- commented May 3, 2024

Sorry, I made a sequence of mess-ups, but hopefully we've got there.

@wence-
Copy link
Contributor

wence- commented May 3, 2024

/ok to test

@er-eis
Copy link
Contributor Author

er-eis commented May 3, 2024

@wence- do we prefer set().union(*(map(type, obj._data.keys()) for obj in objs)) over {type(name) for o in objs for name in o._data.keys()} ?

{type(name) for o in objs for name in o._data.keys()} is:

  • an order of magnitude faster
  • imo, more readable/pythonic

you can test the execution time by setting a breakpoint on line 427 and running this:

before = time.perf_counter()
set().union(*(map(type, obj._data.keys()) for obj in objs))
after_changed = time.perf_counter() - before
b = time.perf_counter()
{type(name) for o in objs for name in o._data.keys()}
a_changed = time.perf_counter() - b

@wence-
Copy link
Contributor

wence- commented May 3, 2024

@wence- do we prefer set().union(*(map(type, obj._data.keys()) for obj in objs)) over {type(name) for o in objs for name in o._data.keys()} ?

{type(name) for o in objs for name in o._data.keys()} is:

  • an order of magnitude faster
  • imo, more readable/pythonic

you can test the execution time by setting a breakpoint on line 427 and running this:

before = time.perf_counter()
set().union(*(map(type, obj._data.keys()) for obj in objs))
after_changed = time.perf_counter() - before
b = time.perf_counter()
{type(name) for o in objs for name in o._data.keys()}
a_changed = time.perf_counter() - b

Ah sorry. Too much time writing Haskell. Please go back to your approach.

@er-eis
Copy link
Contributor Author

er-eis commented May 3, 2024

sure, making a commit now and merging in latest changes

@wence-
Copy link
Contributor

wence- commented May 3, 2024

/ok to test

@pytest.mark.parametrize(
"d",
[
{"first": cudf.DataFrame({"A": [1, 2], "B": [3, 4]})},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's best if we can avoid creating instances of GPU objects (cudf.DataFrame) in the parametrize arguments. Those are executed at test collection time rather than at test runtime, and make the test suite slow to launch due to a large number of small host-device copies. Let's defer construction using a pattern like this:

@pytest.mark.parametrize(
    "d",
    [
        {"first": {"A": [1, 2], "B": [3, 4]}},
        # ...
    ],
)
def test_concat_dictionary(d, axis):
    # Convert dict-of-dicts to dict-of-DataFrames to avoid raw GPU objects in the parameters
    d = {k: cudf.DataFrame(v) for k, v in input.items()}

    result = cudf.concat(d, axis=axis)
    expected = cudf.from_pandas(
        pd.concat({k: df.to_pandas() for k, df in d.items()}, axis=axis)
    )
    assert_eq(expected, result)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do! let me know if you'd like me to clean up the other tests in this file in the same manner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i passed in the reference to each class for the tests, let me know if this causes weirdness during test collection and i'll make a simple map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that’s totally fine. No need to refactor the rest of the file in this PR. A separate PR would be welcome if you’re interested.

python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/reshape.py Show resolved Hide resolved
@er-eis er-eis force-pushed the er-eis/allow-concat-on-frame-dict branch from e9c7239 to b5b9116 Compare May 3, 2024 15:54
@er-eis er-eis requested a review from bdice May 3, 2024 16:01
@bdice
Copy link
Contributor

bdice commented May 3, 2024

/ok to test

@er-eis
Copy link
Contributor Author

er-eis commented May 3, 2024

@bdice @wence- i think we're good to merge?

@bdice
Copy link
Contributor

bdice commented May 3, 2024

/merge

@rapids-bot rapids-bot bot merged commit 2ff60d6 into rapidsai:branch-24.06 May 3, 2024
70 checks passed
@bdice
Copy link
Contributor

bdice commented May 3, 2024

Thanks @er-eis! I think you mentioned you had a follow-up PR planned? Please feel free to open an issue documenting any next steps that are needed, even if you don't have time to contribute a PR.

@er-eis
Copy link
Contributor Author

er-eis commented May 4, 2024

@bdice on it!

@bdice and @wence- , thanks for the great reviews, this was fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow cudf.concat to process a dictionary of frames / add keys=
3 participants