Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix dataframe setitem with ndarray types #10056

Merged
merged 2 commits into from
Jan 14, 2022

Conversation

galipremsagar
Copy link
Contributor

Fixes: #9928

This PR fixes 2d array assignment in setitem

@galipremsagar galipremsagar added bug Something isn't working Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer non-breaking Non-breaking change labels Jan 14, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner January 14, 2022 17:29
@galipremsagar galipremsagar self-assigned this Jan 14, 2022
@galipremsagar galipremsagar added the 3 - Ready for Review Ready for review by team label Jan 14, 2022
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love for our setitem implementations to not need so many special cases, but I addressing that is probably out of scope for now. I don't see a much more elegant way to resolve this that doesn't also do extra work.

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Jan 14, 2022
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 small question + random rants. Overall it's good and shouldn't block it from merging.

@@ -1123,7 +1123,15 @@ def __setitem__(self, arg, value):
for col_name in self._data:
self._data[col_name][mask] = value
else:
if isinstance(value, DataFrame):
if isinstance(value, (cupy.ndarray, np.ndarray)):
_setitem_with_dataframe(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the array was being converted to dataframe in order to make use of _setitem_with_dataframe. Ideally I wish we can separate the logic of column selection from this helper and directly replace the columns targeted. But that's out of the scope for sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree, it would be nice to just convert the array to a column without having to deal with all the DataFrame junk but I don't see an easy way to do that at present :(

if isinstance(value, (cupy.ndarray, np.ndarray)):
_setitem_with_dataframe(
input_df=self,
replace_df=cudf.DataFrame(value),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we can make use of factory method Dataframe._from_data here instead of public ctor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check this but realized we have the logic that handles ndarray's in DataFrame constructor and will need this type of inputs to go through that code-paths.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_from_data assumes that the inputs are already columns and it won't call as_column. We could manually call that function and construct a suitable data dict, but again this feels like grounds for a future refactor rather than something to do now. At some point we should review all of cudf to see where we construct Frames via constructor rather than using one of the faster factories, I think many if not most of them can be rewritten.

@codecov
Copy link

codecov bot commented Jan 14, 2022

Codecov Report

Merging #10056 (e6f09c2) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02   #10056      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20540     +235     
================================================
+ Hits               2130     2139       +9     
- Misses            18175    18401     +226     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.66% <0.00%> (-0.25%) ⬇️
python/dask_cudf/dask_cudf/core.py 70.85% <0.00%> (-0.17%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 12adb8a...e6f09c2. Read the comment docs.

@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@galipremsagar galipremsagar removed the request for review from charlesbluca January 14, 2022 23:21
@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8c8d6ef into rapidsai:branch-22.02 Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] 2D Array Assignment
3 participants