-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disallow indexing by selecting duplicate labels #16514
Disallow indexing by selecting duplicate labels #16514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mroeschke . I noticed pandas supports this, do you feel this is a common pattern its worth supporting in cuDF over the longer term?
Or conversely, does pandas plan to remove this functionality? |
I would say pandas does not plan on ever removing duplicate label functionality, but also that duplicate labels are not that commonplace in pandas. It would be nice if cudf would eventually support duplicate column labels (I'll open an issue), but I don't think it's high priority. |
/merge |
cc: @rlratzel @ChuckHastings This PR addresses failures seen in certain PRs (like [here](https://github.com/rapidsai/cugraph/actions/runs/10372270389/job/28718471674?pr=4606#step:7:5269)) due to a [recent change](rapidsai/cudf#16514) to `cudf` that disallows selecting duplicate column labels. --- In `hypergraph.py`, this PR modifies `_create_hyper_edges` and `_create_direct_edges` to ensure that DataFrames are being indexed by non-duplicate column values. This is done by taking a list that includes duplicates (`fs`), and removing the non-unique values ```python fs = list(set(fs)) ``` _This part requires some attention from the author of the unit test @jnke2016_ In `test_hypergraph.py`, this PR adds the `check_like=True` arg to `assert_frame_equals` function because the ordering of the columns is different for the two DFs. Authors: - Ralph Liu (https://github.com/nv-rliu) Approvers: - Rick Ratzel (https://github.com/rlratzel) - Chuck Hastings (https://github.com/ChuckHastings) - Paul Taylor (https://github.com/trxcllnt) - Joseph Nke (https://github.com/jnke2016) URL: #4610
Description
xref #16507
I would say this was a bug before because we would silently return a new DataFrame with just
len(set(column_labels))
when selecting by column. Now this operation raises since duplicate column labels are generally not supported.Checklist