Rewrites `sample` API #10262

isVoid · 2022-02-10T00:18:40Z

This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large.

Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated.

Sampling from cudf.Index/cudf.MultiIndex is deprecated.

This PR is breaking because:

User who previously calls sample API now gets different rows.
To align with pandas API, keep_index is renamed to ignore_index and its semantic is negated.

Current implementation does not depend on libcudf.copying.sample, thus cython bindings are removed.

Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state.

Benchmark Axis=0

-------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                                Min                   Max                  Mean              StdDev                Median                 IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisIndex-CupyRandomState] (afte)        296.7751 (455.90)     299.2855 (401.57)     297.9519 (448.88)     1.1162 (94.15)      297.7824 (451.66)     2.0472 (192.32)        2;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (afte)     4,435.3055 (>1000.0)  4,717.0815 (>1000.0)  4,507.1635 (>1000.0)  119.8772 (>1000.0)  4,452.5009 (>1000.0)  115.2876 (>1000.0)       1;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (befo)       276.1754 (424.26)     276.4792 (370.97)     276.2995 (416.26)     0.1258 (10.61)      276.3024 (419.08)     0.2010 (18.88)         1;0       5
sample_df[size10K-AxisIndex-CupyRandomState] (afte)           1.0789 (1.66)         1.2420 (1.67)         1.1238 (1.69)       0.0683 (5.76)         1.0962 (1.66)       0.0721 (6.77)          1;0       5
sample_df[size10K-AxisIndex-NumpyRandomState] (afte)          0.9018 (1.39)         1.1441 (1.54)         0.9140 (1.38)       0.0182 (1.54)         0.9094 (1.38)       0.0106 (1.0)         11;11     346
sample_df[size10K-AxisIndex-NumpyRandomState] (befo)          0.6510 (1.0)          0.7453 (1.0)          0.6638 (1.0)        0.0119 (1.0)          0.6593 (1.0)        0.0108 (1.01)        76;44     638
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

On axis=1 sample, this PR is faster than current if provided a numpy random state for random_state parameter, while slower if provided a seed instead.

Benchmark axis=1

--------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ----------------------------------------------------------------------------------
Name (time in us)                                               Min                 Max                Mean             StdDev              Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisColumn-NumpyRandomState] (afte)     173.2660 (1.0)      290.5080 (1.14)     178.2199 (1.0)       8.0913 (1.58)     175.7130 (1.0)      2.0767 (1.73)      227;419    2707
sample_df[size100M-AxisColumn-Seed] (afte)                 441.9110 (2.55)     617.1150 (2.42)     452.4197 (2.54)     14.1272 (2.76)     447.1345 (2.54)     7.9060 (6.59)      151;162    1484
sample_df[size100M-AxisColumn-Seed] (befo)                 297.1560 (1.72)     477.1500 (1.87)     307.8915 (1.73)     17.2036 (3.36)     300.5620 (1.71)     9.4080 (7.85)      159;168    1695
sample_df[size10K-AxisColumn-NumpyRandomState] (afte)      176.6440 (1.02)     254.9110 (1.0)      180.0217 (1.01)      5.1152 (1.0)      178.8940 (1.02)     1.1990 (1.0)       226;405    3542
sample_df[size10K-AxisColumn-Seed] (afte)                  451.6370 (2.61)     689.8120 (2.71)     465.9937 (2.61)     14.3921 (2.81)     463.0710 (2.64)     6.7365 (5.62)        62;91    1183
sample_df[size10K-AxisColumn-Seed] (befo)                  309.4000 (1.79)     413.9080 (1.62)     316.5210 (1.78)      7.6379 (1.49)     315.2130 (1.79)     5.4100 (4.51)        66;42     826
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part of #10153

codecov · 2022-02-10T02:16:42Z

Codecov Report

Merging #10262 (f4e2686) into branch-22.04 (a7d88cd) will increase coverage by 0.07%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.04   #10262      +/-   ##
================================================
+ Coverage         10.42%   10.50%   +0.07%     
================================================
  Files               119      126       +7     
  Lines             20603    21218     +615     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18990     +535

Impacted Files	Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py	`99.39% <0.00%> (-0.01%)`	⬇️
python/cudf/cudf/errors.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/ops.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/datasets.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/scalar.py	`0.00% <0.00%> (ø)`
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0af727...f4e2686. Read the comment docs.

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/tests/test_dataframe.py

…le twice; Fix type annotations.

Co-authored-by: Vyas Ramasubramani <[email protected]>

…tring.

vyasr

Almost there. I left a bunch of comments but they're mostly around docstrings and related minor improvements.

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/tests/conftest.py

python/cudf/cudf/tests/test_dataframe.py

Co-authored-by: Vyas Ramasubramani <[email protected]>

…ent/rewrite_sample

Co-authored-by: Vyas Ramasubramani <[email protected]>

…nto improvement/rewrite_sample

…tions and raise proper warning message.

vyasr

Pretty happy with this now. I think there's still a little bit of room for improvement, but it's come far enough that I'd rather get this merged now and work on more improvements later. Address my remaining comments as you see fit, and we can iterate further later. The biggest outstanding areas for potential improvement are still tests, and we can revisit that when we rework tests as a whole anyway.

python/cudf/cudf/core/indexed_frame.py

python/cudf/cudf/tests/conftest.py

vyasr · 2022-03-02T17:53:00Z

python/cudf/cudf/tests/conftest.py

+    """Specific to `test_sample*_axis_0` tests.
+    Only testing weights array that matches type with random state.
+    """
+    _, gd_random_state, _ = random_state_tuple_axis_0


Can we rewrite this fixture to not use the random_state_tuple_axis_0? It looks like two of the conditional branches below never use this output anyway, so we generate too many cases, and we also only use a subset of the possible values when checking the type of the gd_random_state in wrapped.

Yes, I should be able to rewrite this to return a factory that accepts an extra argument for distinction.

so we generate too many cases,

This fixture request does not generate more test cases. The number of test cases is determined by the product of all parametrization of all fixtures requested by the test case. Here, for each of the test case it is only requesting the instantiation of the random state tuple. Here's a small demo:

@pytest.fixture(params=[1, 2]) def fixture1(request): return request.param @pytest.fixture(params=['a', 'b']) def fixture2(request, fixture1): return fixture1 def test_f(fixture1, fixture2): pass

(rapids) rapids@compose:~/scratch$ pytest --collect-only pytest_play/fixture/parametrized_fixture.py ========================================================== test session starts =========================================================== platform linux -- Python 3.8.12, pytest-7.0.1, pluggy-1.0.0 benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000) rootdir: /home/yhw/dev/rapids/scratch plugins: forked-1.4.0, xdist-2.5.0, pudb-0.7.0, benchmark-3.4.1, hypothesis-6.36.2, repeat-0.9.1 collected 4 items <Module pytest_play/fixture/parametrized_fixture.py> <Function test_f[1-a]> <Function test_f[1-b]> <Function test_f[2-a]> <Function test_f[2-b]>

Yes, but in this case some of those parametrizations are redundant. For example, you will generate four different tests where request.param is None because random_state_tuple_axis_0 has four possible values, but all of those tests will do exactly the same thing. The same thing happens when request.param == "builtin-list".

I would agree that part of the parametrization of random state tuple for make_weights fixture is redundant. If we are to be very conservative about introducing unnecessary context to another method, the state of PR is changed to reflect that.

For pure academic debates, it can also be argued that requesting random state fixture explicitly shows the dependency of make_weights fixture has on the random state fixture, while the state of PR isn't as clear.

I think there exists a path to make both world happy - one would require parameterize pandas random state, cudf random state and checker independently into their own fixture and express their dependency with a forth fixture. Just a bit more work to figure out the constraints there.

python/cudf/cudf/tests/test_dataframe.py

python/cudf/cudf/tests/test_series.py

python/cudf/cudf/tests/conftest.py

isVoid · 2022-03-04T18:24:37Z

@vyasr I gave another round of updates, where I think all of the concerns are addressed. I left two tabs open in case you want to continue the discussion there. For the rest of this PR I'm moving forward to merge.

isVoid · 2022-03-04T18:27:10Z

@gpucibot merge

Part of #10153 Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202 and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10359

Rewrites sample

b831cc9

isVoid requested a review from a team as a code owner February 10, 2022 00:18

isVoid requested review from trxcllnt and shwina February 10, 2022 00:18

github-actions bot added the Python Affects Python cuDF API. label Feb 10, 2022

Restore support for index but add deprecation warning

ec64a98

isVoid added 3 - Ready for Review Ready for review by team breaking Breaking change improvement Improvement / enhancement to an existing function labels Feb 10, 2022

vyasr requested changes Feb 14, 2022

View reviewed changes

isVoid and others added 7 commits February 15, 2022 10:30

Better logic to compute n.

80e6cf6

Move size checks

82597e6

Consolidating error checking codes to reduce checking the same variab…

aa166b7

…le twice; Fix type annotations.

Change error message.

0466e84

Co-authored-by: Vyas Ramasubramani <[email protected]>

Inlining weights, random_state handling; rewriting weights docs…

8965538

…tring.

Rewrites make_random_state into a fixture; Optimizes code locations.

3f33dcb

Use fixture in reproducibility test

79d3d56

isVoid requested a review from vyasr February 15, 2022 20:39

vyasr requested changes Feb 15, 2022

View reviewed changes

isVoid and others added 8 commits February 15, 2022 15:42

Reverting support for cupy arrays for weights; add tests for weights.

ecd8061

Apply suggestions from code review

ce2e0fa

Co-authored-by: Vyas Ramasubramani <[email protected]>

Apply fixes from reviews

b0b6ce7

Conforming error messages.

62ee2b4

Removing redundant tests

6c34a71

Rewrites weights fixture; skip comparing error messages.

e0d78ee

Merge branch 'branch-22.04' of github.com:rapidsai/cudf into improvem…

8c0dff1

…ent/rewrite_sample

Update copyright headers

5300d02

isVoid mentioned this pull request Feb 25, 2022

Refactor cython interface: copying.pyx #10359

Merged

Revert copyright change in b08a9b5

ddc80aa

isVoid mentioned this pull request Feb 25, 2022

[FEA] Remove copying::sample from libcudf #10361

Closed

isVoid and others added 7 commits February 25, 2022 14:14

Update python/cudf/cudf/core/indexed_frame.py

c57f470

Co-authored-by: Vyas Ramasubramani <[email protected]>

Remove error mimic

f158610

Pre-commits

4f0d2ea

update axis=1 cupy random state error message

2a9c013

update reprocibility test

84d2da3

Revert test_struct copyright change.

3f797cf

Merge branch 'improvement/rewrite_sample' of github.com:isVoid/cudf i…

413d8e9

…nto improvement/rewrite_sample

github-actions bot removed the gpuCI label Feb 25, 2022

isVoid added 5 commits February 25, 2022 15:36

Test built-in iterable with argument

967a77e

doc fix

ebf6f70

Simplifying axis==1 case

20a5f9c

Simplify axis=0 case, further document unsupported argument combina…

17319b1

…tions and raise proper warning message.

Simplify series test.

ac77a23

vyasr approved these changes Mar 2, 2022

View reviewed changes

isVoid added 7 commits March 2, 2022 23:46

Update comments

12015f7

Update error message

5511585

Move nested function outside

93ffc7d

Inline random state construction

d0c7c39

Remove nested fixture request

66acd59

Use common variable

a7588b3

Parametrize series test into axis_0

f4e2686

rapids-bot bot merged commit 1e5b01f into rapidsai:branch-22.04 Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrites `sample` API #10262

Rewrites `sample` API #10262

isVoid commented Feb 10, 2022 •

edited

Loading

codecov bot commented Feb 10, 2022 •

edited

Loading

vyasr left a comment

vyasr left a comment

vyasr Mar 2, 2022

isVoid Mar 4, 2022

vyasr Mar 4, 2022

isVoid Mar 4, 2022

isVoid Mar 4, 2022

isVoid commented Mar 4, 2022 •

edited

Loading

isVoid commented Mar 4, 2022

Rewrites sample API #10262

Rewrites sample API #10262

Conversation

isVoid commented Feb 10, 2022 • edited Loading

codecov bot commented Feb 10, 2022 • edited Loading

Codecov Report

vyasr left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr Mar 2, 2022

Choose a reason for hiding this comment

isVoid Mar 4, 2022

Choose a reason for hiding this comment

vyasr Mar 4, 2022

Choose a reason for hiding this comment

isVoid Mar 4, 2022

Choose a reason for hiding this comment

isVoid Mar 4, 2022

Choose a reason for hiding this comment

isVoid commented Mar 4, 2022 • edited Loading

isVoid commented Mar 4, 2022

Rewrites `sample` API #10262

Rewrites `sample` API #10262

isVoid commented Feb 10, 2022 •

edited

Loading

codecov bot commented Feb 10, 2022 •

edited

Loading

isVoid commented Mar 4, 2022 •

edited

Loading