-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrites column.__setitem__
, Use boolean_mask_scatter
#10202
Rewrites column.__setitem__
, Use boolean_mask_scatter
#10202
Conversation
rerun tests |
…ent/column_setitem_refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid to me. I have some minor suggestions for improvement, but this already helps clean up some gnarly branching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments attached.
python/cudf/cudf/_lib/copying.pyx
Outdated
"""Assign the `ith` row in input column to the row correspond to the `ith` | ||
`True` value in the target column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring could be clearer. I had to read a few levels of the C++ source to really understand what this was doing. Is this suggestion accurate?
"""Assign the `ith` row in input column to the row correspond to the `ith` | |
`True` value in the target column. | |
"""Copy the target columns, replacing masked rows with input data. | |
The ``input_`` data can be a list of columns or as a list of scalars. | |
A list of input columns will be used to replace corresponding rows in the | |
target columns for which the boolean mask is ``True``. A list of input | |
scalars will replace all rows in the target columns for which the boolean | |
mask is ``True``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I know how confusing this is. Everything in here is accurate. However I would like to add to this to highlight that the input_
columns need not to have the same number of rows as the target columns. Instead, they should have the same number of rows to the number of true
s in the boolean mask. As an example, it's totally valid to have:
input = [42]
boolean_mask = [F, F, F, T]
target = [1, 2, 3, 4]
-----
result = [1, 2, 3, 42]
Which is something pretty subtle from my point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
Copy the target columns, replacing masked rows with input data.
The ``input_`` data can be a list of columns or as a list of scalars.
A list of input columns will be used to replace corresponding rows in the
target columns for which the boolean mask is ``True``. For the nth ``True``
in the boolean mask, the nth row in ``input_`` is used to replace. (followed
by scalar's case.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that edit seems fine. It's super weird behavior to explain...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to coordinate with libcudf's team to revisit this API to adapt it to be less "weird". If anything - the currently behavior has hard constraint (to guarantee the API correctness) that cannot be determined without introspecting the data (# of True
s in the boolean mask). I don't know if that violates any API design rules but certainly makes the API flaky to users.
Co-authored-by: Vyas Ramasubramani <[email protected]>
…id/cudf into improvement/column_setitem_refactor
Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
column scatter; Minor doc updates
@@ -321,22 +322,22 @@ def _fill( | |||
if end <= begin or begin >= self.size: | |||
return self if inplace else self.copy() | |||
|
|||
fill_scalar = as_device_scalar(fill_value, self.dtype) | |||
device_value = as_device_scalar(fill_value, self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if possible we want to try and use a cudf.Scalar
in the python layer and then get the DeviceScalar
via .device_value
in cython right before it hits libcudf. This should avoid any unnecessary synchronization from interacting with the scalar here, such as when calling is_valid()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'm sure this is why I used cudf.Scalar
before.
@isVoid I only had a couple minor comments. I can approve once the other open threads are resolved. |
Co-authored-by: brandon-b-miller <[email protected]> Co-authored-by: Bradley Dice <[email protected]>
Co-authored-by: Bradley Dice <[email protected]>
…ent/column_setitem_refactor
…id/cudf into improvement/column_setitem_refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any further comments on this PR, but I agree with @isVoid's proposal to try and simplify the logic and documentation of boolean_mask_scatter
in both libcudf and our Python bindings.
From what I could see, Column.__setitem__
is the only use case of libcudf's boolean_mask_scatter
, so we may be able to specialize its design or API a bit more for the task at hand. In particular, we might be doing some work to make a gather map that could be rewritten with a simpler Thrust algorithm or transform iterator? I'm not sure exactly what a refactor would require without further inspection, but it certainly seems possible to reduce the LOC and improve performance from a brief review of the current implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why tests are currently failing? |
I can't replicate these regressions locally. I'll take a look at them if they fail in CI again. |
…ent/column_setitem_refactor
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10202 +/- ##
================================================
+ Coverage 10.42% 10.62% +0.20%
================================================
Files 119 122 +3
Lines 20603 20965 +362
================================================
+ Hits 2148 2228 +80
- Misses 18455 18737 +282
Continue to review full report at Codecov.
|
To close the loop, the tests that were failing were tests of the dataframe protocol that were accessing invalid data corresponding to null elements It is not clear why this particular PR triggered those failures, but we've decided not to try to address that here. 1754580 was a minimal changeset to move this PR forward, and we can tackle improving those tests at another time. |
@gpucibot merge |
This PR is part of #10153 |
Part of #10153 Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202 and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10359
closes #8667
This PR rewrites
column.__setitem__
and callsboolean_mask_scatter
if keys and values meet some criteria.Benchmark shows in low-order problem size (10K ish), there are 30% speed up for aligned values and 10% for unaligned values. Note standard deviation of the unaligned case is quite high after refactor. For larger problem sizes performance is rather unaffected.
Benchmarks:
10K
1M