Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Raise duplicate column error in DataFrame.rename #10120

Merged
merged 1 commit into from
Jan 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions python/cudf/cudf/core/column_accessor.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2021, NVIDIA CORPORATION.
# Copyright (c) 2021-2022, NVIDIA CORPORATION.

from __future__ import annotations

Expand Down Expand Up @@ -523,14 +523,19 @@ def rename_column(x):
raise IndexError(
f"Too many levels: Index has only 1 level, not {level+1}"
)

if isinstance(mapper, Mapping):
new_names = (
new_col_names = [
mapper.get(col_name, col_name) for col_name in self.keys()
)
]
Comment on lines -527 to +530
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately obvious to me - why opt for a list instead of a tuple here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't change a tuple to a list, it changes a generator to a list. There's no equivalent to a list comprehension for a tuple because the syntax (x for x in ...) doesn't create a tuple, it creates a generator. The reason this change is necessary is that you cannot query the length of a generator, so the error check below won't work.

Aside: If you're not familiar with generators, they are what you get when you write a function that uses yield instead of return. The purpose of generators is not to materialize the entire set of data at once, but to instead produce the data one at a time. One of the most obvious uses of this is to save memory, but there are also lots of other reasons you might prefer a generator to a list such as wanting to produce an arbitrary number of elements from an infinite sequence. Here's a simple example illustrating the difference:

>>> def f(x):                                                                                                                                                                                     
...     print(f"Called f({x})")                                                                       
...     return x**2
>>> for _ in (f(x) for x in range(3)):
...     print("Hello")                                                                                
... 
Called f(0)
Hello
Called f(1)
Hello
Called f(2)
Hello
>>> for _ in [f(x) for x in range(3)]:                                                                
...     print("Hello")
... 
Called f(0)
Called f(1)
Called f(2)
Hello
Hello
Hello

else:
new_names = (mapper(col_name) for col_name in self.keys())
new_col_names = [mapper(col_name) for col_name in self.keys()]

if len(new_col_names) != len(set(new_col_names)):
raise ValueError("Duplicate column names are not allowed")

ca = ColumnAccessor(
dict(zip(new_names, self.values())),
dict(zip(new_col_names, self.values())),
level_names=self.level_names,
multiindex=self.multiindex,
)
Expand Down
8 changes: 8 additions & 0 deletions python/cudf/cudf/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -9083,3 +9083,11 @@ def test_dataframe_assign_cp_np_array():
gdf[[f"f_{i}" for i in range(n)]] = cp_ndarray

assert_eq(pdf, gdf)


def test_dataframe_rename_duplicate_column():
gdf = cudf.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5]})
with pytest.raises(
ValueError, match="Duplicate column names are not allowed"
):
gdf.rename(columns={"a": "b"}, inplace=True)