Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenate dictionary of objects along axis=1 #15160

Closed
wants to merge 7 commits into from
Closed

Concatenate dictionary of objects along axis=1 #15160

wants to merge 7 commits into from

Conversation

amanlai
Copy link

@amanlai amanlai commented Feb 27, 2024

Closes #15115.

Unlike pandas.concat, cudf.concat doesn't work with a dictionary of objects. The following code raises an error.

d = {
    'first': cudf.DataFrame({'A': [1, 2], 'B': [3, 4]}),
    'second': cudf.DataFrame({'A': [5, 6], 'B': [7, 8]}),
}

cudf.concat(d, axis=1)

This commit resolves this issue.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 27, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 27, 2024
@shwina shwina added non-breaking Non-breaking change feature request New feature or request labels Feb 28, 2024
@shwina
Copy link
Contributor

shwina commented Feb 28, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Feb 28, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Feb 28, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Feb 28, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Feb 29, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Feb 29, 2024

Thanks @amanlai - I pushed a couple of quick changes to your branch, which simply make it so that we access the ._column_names attribute rather than .columns when trying to get the column names.

The latter can be surprisingly expensive in cuDF and therefore we forbid its use within the code base.

Fingers crossed that we will now pass all tests 🤞 after which I think this should be good to go!

@shwina
Copy link
Contributor

shwina commented Feb 29, 2024

@amanlai - after getting rid of the errors from using .columns, it looks like we're down to three unit test failures:

FAILED test_concat.py::test_concat_dictionary[d1] - NotImplementedError: Unsupported column type passed to create an Index: <class 'cudf.core.column.lists.ListColumn'>
FAILED test_concat.py::test_concat_dictionary[d3] - AssertionError: DataFrame.columns are different
FAILED test_concat.py::test_concat_dict_incorrect_type - AttributeError: 'Index' object has no attribute 'index'

Do you want to debug and fix those up? If you're having trouble, let us know and we're glad to help!

…e object (pulled unnecessary changes in previous commit)
@amanlai
Copy link
Author

amanlai commented Mar 1, 2024

@shwina I did push some changes to reshape.py that should fix those issues. I also refactored it a bit so that concatenation of a dictionary that includes both Series and DF works as well and added a test case for that in test_concat.py as well.

@shwina
Copy link
Contributor

shwina commented Mar 3, 2024

/ok to test

@shwina
Copy link
Contributor

shwina commented Mar 4, 2024

Hmm, with the latest changes we're seeing a ton of new failures:

2024-03-03T12:20:33.5760738Z == 575 failed, 100874 passed, 2092 skipped, 867 xfailed in 2584.73s (0:43:04) ==

Are you able to run the tests locally to debug? It may be helpful to ensure that you have all tests passing locally, rather than waiting on CI runs to report the results. Typically, I do something like:

pytest -n <ncpus> cudf/tests/

@amanlai
Copy link
Author

amanlai commented Mar 4, 2024

@shwina I had tested them locally and the test cases were all working fine. However, I don't know why but some files were modified (by my system?) and when I pushed the changes I screwed up and did git add . which pushed the unnecessary changes as well. I force pushed the previous version of those files but it may have screwed up even more. I'm just not good at git. I'm sorry that I'm taking too much of your time.

@shwina
Copy link
Contributor

shwina commented Mar 5, 2024

No worries at all, and no need to apologize! Can you tell me what the changes we should keep since the commit before last?

@amanlai
Copy link
Author

amanlai commented Apr 1, 2024

@shwina I have no idea why so many tests are failing. I checked on my end and there are no errors. I feel like I have spent enough time on this PR. I want to let somebody else take a crack at it. Sorry that I wasted so much of your time. Cheers.

@amanlai amanlai closed this Apr 1, 2024
@amanlai amanlai deleted the fea-concat-dict-axis-1 branch April 1, 2024 21:59
rapids-bot bot pushed a commit that referenced this pull request May 3, 2024
Note: This work is heavily based off [amanlai's](https://github.com/amanlai) PR [raised here](#15160), wasn't able to base my branch off amanlai's due to deleted branch.
 
> Closes #15115.
>Unlike `pandas.concat`, `cudf.concat` doesn't work with a dictionary of objects. The following code raises an error.
```python
d = {
    'first': cudf.DataFrame({'A': [1, 2], 'B': [3, 4]}),
    'second': cudf.DataFrame({'A': [5, 6], 'B': [7, 8]}),
}

cudf.concat(d, axis=1)
```
>This commit resolves this issue.

Authors:
  - https://github.com/er-eis
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)

URL: #15623
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Allow cudf.concat to process a dictionary of frames / add keys=
2 participants