Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrites column.__setitem__, Use boolean_mask_scatter #10202

Merged

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Feb 3, 2022

closes #8667

This PR rewrites column.__setitem__ and calls boolean_mask_scatter if keys and values meet some criteria.
Benchmark shows in low-order problem size (10K ish), there are 30% speed up for aligned values and 10% for unaligned values. Note standard deviation of the unaligned case is quite high after refactor. For larger problem sizes performance is rather unaffected.

Benchmarks:

10K
---------------------------------------------------------------- benchmark 'boolean_mask_col_aligned': 2 tests -----------------------------------------------------------------
Name (time in ms)                                      Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[boolean_mask_col_aligned] (afte)     1.2809 (1.0)      1.6781 (1.0)      1.3064 (1.0)      0.0364 (1.0)      1.2996 (1.0)      0.0137 (1.0)         22;34     761
column_setitem[boolean_mask_col_aligned] (befo)     1.7024 (1.33)     2.3863 (1.42)     1.7270 (1.32)     0.0523 (1.43)     1.7187 (1.32)     0.0138 (1.01)        20;31     563
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------- benchmark 'boolean_mask_col_unaligned': 2 tests --------------------------------------------------------------------------
Name (time in us)                                            Min                   Max                  Mean             StdDev                Median                IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[boolean_mask_col_unaligned] (afte)       972.3390 (1.0)      1,520.7559 (1.29)     1,008.6033 (1.0)      77.0920 (9.45)       984.0354 (1.0)      13.2429 (1.45)       83;132     984
column_setitem[boolean_mask_col_unaligned] (befo)     1,106.3821 (1.14)     1,179.5689 (1.0)      1,118.7759 (1.11)      8.1539 (1.0)      1,116.2200 (1.13)      9.1530 (1.0)        149;30     874
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------- benchmark 'boolean_mask_scalar': 2 tests ---------------------------------------------------------------------
Name (time in us)                                   Min                 Max                Mean             StdDev              Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[boolean_mask_scalar] (afte)     532.2532 (1.0)      605.4689 (1.0)      542.2607 (1.0)      11.1111 (1.10)     537.9058 (1.0)      5.2921 (1.0)       178;213    1461
column_setitem[boolean_mask_scalar] (befo)     770.1530 (1.45)     863.1549 (1.43)     781.4038 (1.44)     10.0834 (1.0)      778.3370 (1.45)     7.0044 (1.32)       114;90    1219
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'integer_scatter_map_col': 2 tests ----------------------------------------------------------------
Name (time in ms)                                     Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[integer_scatter_map_col] (afte)     1.4785 (1.0)      1.9170 (1.21)     1.5176 (1.01)     0.0438 (3.91)     1.5098 (1.00)     0.0171 (1.37)        18;26     644
column_setitem[integer_scatter_map_col] (befo)     1.4882 (1.01)     1.5802 (1.0)      1.5084 (1.0)      0.0112 (1.0)      1.5068 (1.0)      0.0124 (1.0)        140;22     650
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------- benchmark 'integer_scatter_map_scalar': 2 tests ----------------------------------------------------------------------
Name (time in us)                                          Min                   Max                Mean             StdDev              Median               IQR            Outliers  Rounds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[integer_scatter_map_scalar] (afte)     878.2479 (1.0)      1,343.0519 (1.39)     897.1496 (1.01)     27.5109 (3.23)     892.9770 (1.00)     7.8208 (1.0)         29;60    1074
column_setitem[integer_scatter_map_scalar] (befo)     879.3280 (1.00)       966.9410 (1.0)      890.8573 (1.0)       8.5287 (1.0)      888.7504 (1.0)      8.4981 (1.09)       171;50    1086
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------------- benchmark 'stride-1_slice_col': 2 tests -----------------------------------------------------------------------
Name (time in us)                                  Min                   Max                Mean             StdDev              Median                IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[stride-1_slice_col] (afte)     752.6411 (1.0)        852.3620 (1.0)      775.1790 (1.0)      10.9726 (1.0)      772.6796 (1.0)      14.1604 (1.32)       245;23    1152
column_setitem[stride-1_slice_col] (befo)     974.8179 (1.30)     1,307.2360 (1.53)     991.1287 (1.28)     25.8696 (2.36)     985.5330 (1.28)     10.6919 (1.0)         30;51     763
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------- benchmark 'stride-1_slice_scalar': 2 tests -------------------------------------------------------------------
Name (time in us)                                    Min                 Max               Mean            StdDev             Median               IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[stride-1_slice_scalar] (afte)     87.3711 (1.16)     134.7861 (1.0)      89.7566 (1.15)     2.4861 (1.0)      89.5601 (1.16)     1.6061 (1.13)        95;87    2106
column_setitem[stride-1_slice_scalar] (befo)     75.3789 (1.0)      136.7659 (1.01)     78.0297 (1.0)      4.8186 (1.94)     77.0842 (1.0)      1.4230 (1.0)       122;184    2403
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------- benchmark 'stride-2_slice_col': 2 tests ----------------------------------------------------------------------
Name (time in us)                                  Min                 Max                Mean             StdDev              Median                IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[stride-2_slice_col] (afte)     684.7882 (1.02)     972.5131 (1.01)     712.5983 (1.04)     54.4972 (3.04)     693.8996 (1.02)     10.3808 (1.18)      109;163    1338
column_setitem[stride-2_slice_col] (befo)     672.3758 (1.0)      964.4001 (1.0)      684.2917 (1.0)      17.9163 (1.0)      679.5955 (1.0)       8.7875 (1.0)        85;106    1368
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------- benchmark 'stride-2_slice_scalar': 2 tests --------------------------------------------------------------------
Name (time in us)                                     Min                 Max                Mean            StdDev              Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
column_setitem[stride-2_slice_scalar] (afte)     302.0421 (1.04)     374.3470 (1.0)      307.5677 (1.04)     4.3768 (1.0)      306.4690 (1.04)     2.4854 (1.17)      258;253    2532
column_setitem[stride-2_slice_scalar] (befo)     290.4800 (1.0)      378.1999 (1.01)     295.3950 (1.0)      4.6778 (1.07)     294.1729 (1.0)      2.1292 (1.0)       273;324    2977
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1M
------------------------------ benchmark 'boolean_mask_col_aligned': 2 tests ------------------------------
Name (time in ms)                                       Min                Max               Mean          
-----------------------------------------------------------------------------------------------------------
column_setitem[boolean_mask_col_aligned] (afte)     75.3847 (1.0)      79.0559 (1.0)      76.2878 (1.0)    
column_setitem[boolean_mask_col_aligned] (befo)     76.3708 (1.01)     79.7394 (1.01)     77.0892 (1.01)   
-----------------------------------------------------------------------------------------------------------

------------------------------ benchmark 'boolean_mask_col_unaligned': 2 tests ------------------------------
Name (time in ms)                                         Min                Max               Mean          
-------------------------------------------------------------------------------------------------------------
column_setitem[boolean_mask_col_unaligned] (afte)     46.5199 (1.0)      48.3434 (1.0)      46.9222 (1.0)    
column_setitem[boolean_mask_col_unaligned] (befo)     46.6314 (1.00)     48.5938 (1.01)     47.1492 (1.00)   
-------------------------------------------------------------------------------------------------------------

------------------------------ benchmark 'boolean_mask_scalar': 2 tests ------------------------------
Name (time in ms)                                  Min                Max               Mean          
------------------------------------------------------------------------------------------------------
column_setitem[boolean_mask_scalar] (afte)     17.0548 (1.0)      17.8006 (1.0)      17.5430 (1.0)    
column_setitem[boolean_mask_scalar] (befo)     18.4329 (1.08)     18.6918 (1.05)     18.5073 (1.05)   
------------------------------------------------------------------------------------------------------

-------------------------------- benchmark 'integer_scatter_map_col': 2 tests -------------------------------
Name (time in ms)                                       Min                 Max                Mean          
-------------------------------------------------------------------------------------------------------------
column_setitem[integer_scatter_map_col] (afte)     115.7189 (1.01)     120.0585 (1.0)      116.6452 (1.0)    
column_setitem[integer_scatter_map_col] (befo)     114.7481 (1.0)      122.7263 (1.02)     117.5000 (1.01)   
-------------------------------------------------------------------------------------------------------------

------------------------------ benchmark 'integer_scatter_map_scalar': 2 tests ------------------------------
Name (time in ms)                                         Min                Max               Mean          
-------------------------------------------------------------------------------------------------------------
column_setitem[integer_scatter_map_scalar] (afte)     57.9951 (1.0)      62.2284 (1.0)      59.8864 (1.0)    
column_setitem[integer_scatter_map_scalar] (befo)     60.9071 (1.05)     62.2952 (1.00)     61.6422 (1.03)   
-------------------------------------------------------------------------------------------------------------

------------------------------ benchmark 'stride-1_slice_col': 2 tests ------------------------------
Name (time in ms)                                 Min                Max               Mean          
-----------------------------------------------------------------------------------------------------
column_setitem[stride-1_slice_col] (afte)     56.9203 (1.0)      58.0924 (1.0)      57.4940 (1.0)    
column_setitem[stride-1_slice_col] (befo)     58.1888 (1.02)     59.2996 (1.02)     58.5722 (1.02)   
-----------------------------------------------------------------------------------------------------

-------------------------------- benchmark 'stride-1_slice_scalar': 2 tests -------------------------------
Name (time in us)                                     Min                 Max                Mean          
-----------------------------------------------------------------------------------------------------------
column_setitem[stride-1_slice_scalar] (afte)     287.1200 (1.08)     515.4130 (1.24)     298.1191 (1.09)   
column_setitem[stride-1_slice_scalar] (befo)     265.6982 (1.0)      415.9641 (1.0)      273.5206 (1.0)    
-----------------------------------------------------------------------------------------------------------

------------------------------ benchmark 'stride-2_slice_col': 2 tests ------------------------------
Name (time in ms)                                 Min                Max               Mean          
-----------------------------------------------------------------------------------------------------
column_setitem[stride-2_slice_col] (afte)     29.1543 (1.01)     31.0217 (1.00)     29.6341 (1.00)   
column_setitem[stride-2_slice_col] (befo)     28.9718 (1.0)      30.8824 (1.0)      29.5045 (1.0)    
-----------------------------------------------------------------------------------------------------

--------------------------------- benchmark 'stride-2_slice_scalar': 2 tests --------------------------------
Name (time in us)                                     Min                   Max                Mean          
-------------------------------------------------------------------------------------------------------------
column_setitem[stride-2_slice_scalar] (afte)     780.7089 (1.00)     1,407.5749 (1.0)      817.0583 (1.0)    
column_setitem[stride-2_slice_scalar] (befo)     777.0571 (1.0)      2,036.4628 (1.45)     832.0608 (1.02)   
-------------------------------------------------------------------------------------------------------------

@isVoid isVoid requested a review from a team as a code owner February 3, 2022 00:28
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 3, 2022
@isVoid isVoid added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 3, 2022
@isVoid
Copy link
Contributor Author

isVoid commented Feb 10, 2022

rerun tests

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks solid to me. I have some minor suggestions for improvement, but this already helps clean up some gnarly branching.

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments attached.

Comment on lines 589 to 590
"""Assign the `ith` row in input column to the row correspond to the `ith`
`True` value in the target column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring could be clearer. I had to read a few levels of the C++ source to really understand what this was doing. Is this suggestion accurate?

Suggested change
"""Assign the `ith` row in input column to the row correspond to the `ith`
`True` value in the target column.
"""Copy the target columns, replacing masked rows with input data.
The ``input_`` data can be a list of columns or as a list of scalars.
A list of input columns will be used to replace corresponding rows in the
target columns for which the boolean mask is ``True``. A list of input
scalars will replace all rows in the target columns for which the boolean
mask is ``True``.

Copy link
Contributor Author

@isVoid isVoid Feb 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know how confusing this is. Everything in here is accurate. However I would like to add to this to highlight that the input_ columns need not to have the same number of rows as the target columns. Instead, they should have the same number of rows to the number of trues in the boolean mask. As an example, it's totally valid to have:

input = [42]
boolean_mask = [F, F, F, T]
target = [1, 2, 3, 4]
-----
result = [1, 2, 3, 42]

Which is something pretty subtle from my point of view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

Copy the target columns, replacing masked rows with input data.
    
    The ``input_`` data can be a list of columns or as a list of scalars.
    A list of input columns will be used to replace corresponding rows in the
    target columns for which the boolean mask is ``True``. For the nth ``True``
    in the boolean mask, the nth row in ``input_`` is used to replace. (followed
    by scalar's case.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that edit seems fine. It's super weird behavior to explain...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to coordinate with libcudf's team to revisit this API to adapt it to be less "weird". If anything - the currently behavior has hard constraint (to guarantee the API correctness) that cannot be determined without introspecting the data (# of Trues in the boolean mask). I don't know if that violates any API design rules but certainly makes the API flaky to users.

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
@@ -321,22 +322,22 @@ def _fill(
if end <= begin or begin >= self.size:
return self if inplace else self.copy()

fill_scalar = as_device_scalar(fill_value, self.dtype)
device_value = as_device_scalar(fill_value, self.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if possible we want to try and use a cudf.Scalar in the python layer and then get the DeviceScalar via .device_value in cython right before it hits libcudf. This should avoid any unnecessary synchronization from interacting with the scalar here, such as when calling is_valid().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm sure this is why I used cudf.Scalar before.

@isVoid isVoid requested a review from vyasr February 15, 2022 18:22
@isVoid isVoid requested a review from bdice February 15, 2022 18:22
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Feb 15, 2022

@isVoid I only had a couple minor comments. I can approve once the other open threads are resolved.

@isVoid isVoid requested a review from bdice February 16, 2022 00:43
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any further comments on this PR, but I agree with @isVoid's proposal to try and simplify the logic and documentation of boolean_mask_scatter in both libcudf and our Python bindings.

From what I could see, Column.__setitem__ is the only use case of libcudf's boolean_mask_scatter, so we may be able to specialize its design or API a bit more for the task at hand. In particular, we might be doing some work to make a gather map that could be rewritten with a simpler Thrust algorithm or transform iterator? I'm not sure exactly what a refactor would require without further inspection, but it certainly seems possible to reduce the LOC and improve performance from a brief review of the current implementation.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor suggestions for cleanup but this looks fine to me. I agree with @bdice and @isVoid that we could improve this further by cleaning up the boolean_scatter_mask implementation.

python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/column/column.py Outdated Show resolved Hide resolved
@vyasr
Copy link
Contributor

vyasr commented Feb 16, 2022

Do you know why tests are currently failing?

@isVoid
Copy link
Contributor Author

isVoid commented Feb 17, 2022

I can't replicate these regressions locally. I'll take a look at them if they fail in CI again.

@codecov
Copy link

codecov bot commented Feb 23, 2022

Codecov Report

Merging #10202 (1754580) into branch-22.04 (a7d88cd) will increase coverage by 0.20%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10202      +/-   ##
================================================
+ Coverage         10.42%   10.62%   +0.20%     
================================================
  Files               119      122       +3     
  Lines             20603    20965     +362     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18737     +282     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <ø> (ø)
python/cudf/cudf/_version.py 0.00% <ø> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <ø> (ø)
... and 58 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d48dd6f...1754580. Read the comment docs.

@vyasr
Copy link
Contributor

vyasr commented Feb 23, 2022

To close the loop, the tests that were failing were tests of the dataframe protocol that were accessing invalid data corresponding to null elements It is not clear why this particular PR triggered those failures, but we've decided not to try to address that here. 1754580 was a minimal changeset to move this PR forward, and we can tackle improving those tests at another time.

@vyasr
Copy link
Contributor

vyasr commented Feb 23, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a72479f into rapidsai:branch-22.04 Feb 23, 2022
@isVoid
Copy link
Contributor Author

isVoid commented Feb 25, 2022

This PR is part of #10153

rapids-bot bot pushed a commit that referenced this pull request Mar 8, 2022
Part of #10153 

Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202  and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10359
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Use libcudf's boolean mask scattering for setitem
4 participants