Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix dataframe setitem with ndarray types #10056

Merged
merged 2 commits into from
Jan 14, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 16 additions & 2 deletions python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1123,7 +1123,15 @@ def __setitem__(self, arg, value):
for col_name in self._data:
self._data[col_name][mask] = value
else:
if isinstance(value, DataFrame):
if isinstance(value, (cupy.ndarray, np.ndarray)):
_setitem_with_dataframe(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the array was being converted to dataframe in order to make use of _setitem_with_dataframe. Ideally I wish we can separate the logic of column selection from this helper and directly replace the columns targeted. But that's out of the scope for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, it would be nice to just convert the array to a column without having to deal with all the DataFrame junk but I don't see an easy way to do that at present :(

input_df=self,
replace_df=cudf.DataFrame(value),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can make use of factory method Dataframe._from_data here instead of public ctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check this but realized we have the logic that handles ndarray's in DataFrame constructor and will need this type of inputs to go through that code-paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from_data assumes that the inputs are already columns and it won't call as_column. We could manually call that function and construct a suitable data dict, but again this feels like grounds for a future refactor rather than something to do now. At some point we should review all of cudf to see where we construct Frames via constructor rather than using one of the faster factories, I think many if not most of them can be rewritten.

input_cols=arg,
mask=None,
ignore_index=True,
)
elif isinstance(value, DataFrame):
_setitem_with_dataframe(
input_df=self,
replace_df=value,
Expand Down Expand Up @@ -6393,13 +6401,15 @@ def _setitem_with_dataframe(
replace_df: DataFrame,
input_cols: Any = None,
mask: Optional[cudf.core.column.ColumnBase] = None,
ignore_index: bool = False,
):
"""
This function sets item dataframes relevant columns with replacement df
:param input_df: Dataframe to be modified inplace
:param replace_df: Replacement DataFrame to replace values with
:param input_cols: columns to replace in the input dataframe
:param mask: boolean mask in case of masked replacing
:param ignore_index: Whether to conduct index equality and reindex
"""

if input_cols is None:
Expand All @@ -6410,7 +6420,11 @@ def _setitem_with_dataframe(
"Number of Input Columns must be same replacement Dataframe"
)

if len(input_df) != 0 and not input_df.index.equals(replace_df.index):
if (
not ignore_index
and len(input_df) != 0
and not input_df.index.equals(replace_df.index)
):
replace_df = replace_df.reindex(input_df.index)

for col_1, col_2 in zip(input_cols, replace_df.columns):
Expand Down
11 changes: 11 additions & 0 deletions python/cudf/cudf/tests/test_dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -9030,3 +9030,14 @@ def test_dataframe_add_suffix():
expected = pdf.add_suffix("_item")

assert_eq(got, expected)


def test_dataframe_assign_cp_np_array():
m, n = 5, 3
cp_ndarray = cupy.random.randn(m, n)
pdf = pd.DataFrame({f"f_{i}": range(m) for i in range(n)})
gdf = cudf.DataFrame({f"f_{i}": range(m) for i in range(n)})
pdf[[f"f_{i}" for i in range(n)]] = cupy.asnumpy(cp_ndarray)
gdf[[f"f_{i}" for i in range(n)]] = cp_ndarray

assert_eq(pdf, gdf)