Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow indexing by selecting duplicate labels #16514

Merged
merged 1 commit into from
Aug 12, 2024

Conversation

mroeschke
Copy link
Contributor

Description

xref #16507

I would say this was a bug before because we would silently return a new DataFrame with just len(set(column_labels)) when selecting by column. Now this operation raises since duplicate column labels are generally not supported.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@mroeschke mroeschke added bug Something isn't working Python Affects Python cuDF API. non-breaking Non-breaking change labels Aug 8, 2024
@mroeschke mroeschke requested a review from a team as a code owner August 8, 2024 21:22
Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mroeschke . I noticed pandas supports this, do you feel this is a common pattern its worth supporting in cuDF over the longer term?

@bdice
Copy link
Contributor

bdice commented Aug 12, 2024

Or conversely, does pandas plan to remove this functionality?

@mroeschke
Copy link
Contributor Author

I would say pandas does not plan on ever removing duplicate label functionality, but also that duplicate labels are not that commonplace in pandas.

It would be nice if cudf would eventually support duplicate column labels (I'll open an issue), but I don't think it's high priority.

@mroeschke
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit a3dc14f into rapidsai:branch-24.10 Aug 12, 2024
88 checks passed
@mroeschke mroeschke deleted the bug/indexing/duplicates branch August 12, 2024 17:57
rapids-bot bot pushed a commit to rapidsai/cugraph that referenced this pull request Aug 15, 2024
cc: @rlratzel @ChuckHastings 

This PR addresses failures seen in certain PRs (like [here](https://github.com/rapidsai/cugraph/actions/runs/10372270389/job/28718471674?pr=4606#step:7:5269)) due to a [recent change](rapidsai/cudf#16514) to `cudf` that disallows selecting duplicate column labels.

---

In `hypergraph.py`, this PR modifies `_create_hyper_edges` and `_create_direct_edges` to ensure that DataFrames are being indexed by non-duplicate column values.

This is done by taking a list that includes duplicates (`fs`), and removing the non-unique values
```python
fs = list(set(fs))
```


_This part requires some attention from the author of the unit test @jnke2016_

In `test_hypergraph.py`, this PR adds the `check_like=True` arg to `assert_frame_equals` function because the ordering of the columns is different for the two DFs.

Authors:
  - Ralph Liu (https://github.com/nv-rliu)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)
  - Chuck Hastings (https://github.com/ChuckHastings)
  - Paul Taylor (https://github.com/trxcllnt)
  - Joseph Nke (https://github.com/jnke2016)

URL: #4610
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants