Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrites sample API #10262

Merged
merged 43 commits into from
Mar 4, 2022
Merged

Conversation

isVoid
Copy link
Contributor

@isVoid isVoid commented Feb 10, 2022

This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large.

Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated.

Sampling from cudf.Index/cudf.MultiIndex is deprecated.

This PR is breaking because:

  1. User who previously calls sample API now gets different rows.
  2. To align with pandas API, keep_index is renamed to ignore_index and its semantic is negated.

Current implementation does not depend on libcudf.copying.sample, thus cython bindings are removed.

Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state.

Benchmark Axis=0
-------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                                Min                   Max                  Mean              StdDev                Median                 IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisIndex-CupyRandomState] (afte)        296.7751 (455.90)     299.2855 (401.57)     297.9519 (448.88)     1.1162 (94.15)      297.7824 (451.66)     2.0472 (192.32)        2;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (afte)     4,435.3055 (>1000.0)  4,717.0815 (>1000.0)  4,507.1635 (>1000.0)  119.8772 (>1000.0)  4,452.5009 (>1000.0)  115.2876 (>1000.0)       1;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (befo)       276.1754 (424.26)     276.4792 (370.97)     276.2995 (416.26)     0.1258 (10.61)      276.3024 (419.08)     0.2010 (18.88)         1;0       5
sample_df[size10K-AxisIndex-CupyRandomState] (afte)           1.0789 (1.66)         1.2420 (1.67)         1.1238 (1.69)       0.0683 (5.76)         1.0962 (1.66)       0.0721 (6.77)          1;0       5
sample_df[size10K-AxisIndex-NumpyRandomState] (afte)          0.9018 (1.39)         1.1441 (1.54)         0.9140 (1.38)       0.0182 (1.54)         0.9094 (1.38)       0.0106 (1.0)         11;11     346
sample_df[size10K-AxisIndex-NumpyRandomState] (befo)          0.6510 (1.0)          0.7453 (1.0)          0.6638 (1.0)        0.0119 (1.0)          0.6593 (1.0)        0.0108 (1.01)        76;44     638
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

On axis=1 sample, this PR is faster than current if provided a numpy random state for random_state parameter, while slower if provided a seed instead.

Benchmark axis=1
--------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ----------------------------------------------------------------------------------
Name (time in us)                                               Min                 Max                Mean             StdDev              Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisColumn-NumpyRandomState] (afte)     173.2660 (1.0)      290.5080 (1.14)     178.2199 (1.0)       8.0913 (1.58)     175.7130 (1.0)      2.0767 (1.73)      227;419    2707
sample_df[size100M-AxisColumn-Seed] (afte)                 441.9110 (2.55)     617.1150 (2.42)     452.4197 (2.54)     14.1272 (2.76)     447.1345 (2.54)     7.9060 (6.59)      151;162    1484
sample_df[size100M-AxisColumn-Seed] (befo)                 297.1560 (1.72)     477.1500 (1.87)     307.8915 (1.73)     17.2036 (3.36)     300.5620 (1.71)     9.4080 (7.85)      159;168    1695
sample_df[size10K-AxisColumn-NumpyRandomState] (afte)      176.6440 (1.02)     254.9110 (1.0)      180.0217 (1.01)      5.1152 (1.0)      178.8940 (1.02)     1.1990 (1.0)       226;405    3542
sample_df[size10K-AxisColumn-Seed] (afte)                  451.6370 (2.61)     689.8120 (2.71)     465.9937 (2.61)     14.3921 (2.81)     463.0710 (2.64)     6.7365 (5.62)        62;91    1183
sample_df[size10K-AxisColumn-Seed] (befo)                  309.4000 (1.79)     413.9080 (1.62)     316.5210 (1.78)      7.6379 (1.49)     315.2130 (1.79)     5.4100 (4.51)        66;42     826
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Part of #10153

@isVoid isVoid requested a review from a team as a code owner February 10, 2022 00:18
@isVoid isVoid requested review from trxcllnt and shwina February 10, 2022 00:18
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 10, 2022
@isVoid isVoid added 3 - Ready for Review Ready for review by team breaking Breaking change improvement Improvement / enhancement to an existing function labels Feb 10, 2022
@codecov
Copy link

codecov bot commented Feb 10, 2022

Codecov Report

Merging #10262 (f4e2686) into branch-22.04 (a7d88cd) will increase coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10262      +/-   ##
================================================
+ Coverage         10.42%   10.50%   +0.07%     
================================================
  Files               119      126       +7     
  Lines             20603    21218     +615     
================================================
+ Hits               2148     2228      +80     
- Misses            18455    18990     +535     
Impacted Files Coverage Δ
...ython/custreamz/custreamz/tests/test_dataframes.py 99.39% <0.00%> (-0.01%) ⬇️
python/cudf/cudf/errors.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/ops.py 0.00% <0.00%> (ø)
python/cudf/cudf/datasets.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/scalar.py 0.00% <0.00%> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0af727...f4e2686. Read the comment docs.

python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
@isVoid isVoid requested a review from vyasr February 15, 2022 20:39
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. I left a bunch of comments but they're mostly around docstrings and related minor improvements.

python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/tests/conftest.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the gpuCI label Feb 25, 2022
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty happy with this now. I think there's still a little bit of room for improvement, but it's come far enough that I'd rather get this merged now and work on more improvements later. Address my remaining comments as you see fit, and we can iterate further later. The biggest outstanding areas for potential improvement are still tests, and we can revisit that when we rework tests as a whole anyway.

python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/indexed_frame.py Show resolved Hide resolved
python/cudf/cudf/tests/conftest.py Outdated Show resolved Hide resolved
"""Specific to `test_sample*_axis_0` tests.
Only testing weights array that matches type with random state.
"""
_, gd_random_state, _ = random_state_tuple_axis_0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rewrite this fixture to not use the random_state_tuple_axis_0? It looks like two of the conditional branches below never use this output anyway, so we generate too many cases, and we also only use a subset of the possible values when checking the type of the gd_random_state in wrapped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should be able to rewrite this to return a factory that accepts an extra argument for distinction.

so we generate too many cases,

This fixture request does not generate more test cases. The number of test cases is determined by the product of all parametrization of all fixtures requested by the test case. Here, for each of the test case it is only requesting the instantiation of the random state tuple. Here's a small demo:

@pytest.fixture(params=[1, 2])
def fixture1(request):
    return request.param

@pytest.fixture(params=['a', 'b'])
def fixture2(request, fixture1):
    return fixture1


def test_f(fixture1, fixture2):
    pass
(rapids) rapids@compose:~/scratch$ pytest --collect-only pytest_play/fixture/parametrized_fixture.py 
========================================================== test session starts ===========================================================
platform linux -- Python 3.8.12, pytest-7.0.1, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/yhw/dev/rapids/scratch
plugins: forked-1.4.0, xdist-2.5.0, pudb-0.7.0, benchmark-3.4.1, hypothesis-6.36.2, repeat-0.9.1
collected 4 items                                                                                                                        

<Module pytest_play/fixture/parametrized_fixture.py>
  <Function test_f[1-a]>
  <Function test_f[1-b]>
  <Function test_f[2-a]>
  <Function test_f[2-b]>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but in this case some of those parametrizations are redundant. For example, you will generate four different tests where request.param is None because random_state_tuple_axis_0 has four possible values, but all of those tests will do exactly the same thing. The same thing happens when request.param == "builtin-list".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree that part of the parametrization of random state tuple for make_weights fixture is redundant. If we are to be very conservative about introducing unnecessary context to another method, the state of PR is changed to reflect that.

For pure academic debates, it can also be argued that requesting random state fixture explicitly shows the dependency of make_weights fixture has on the random state fixture, while the state of PR isn't as clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there exists a path to make both world happy - one would require parameterize pandas random state, cudf random state and checker independently into their own fixture and express their dependency with a forth fixture. Just a bit more work to figure out the constraints there.

python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_series.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/conftest.py Outdated Show resolved Hide resolved
@isVoid
Copy link
Contributor Author

isVoid commented Mar 4, 2022

@vyasr I gave another round of updates, where I think all of the concerns are addressed. I left two tabs open in case you want to continue the discussion there. For the rest of this PR I'm moving forward to merge.

@isVoid
Copy link
Contributor Author

isVoid commented Mar 4, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1e5b01f into rapidsai:branch-22.04 Mar 4, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 8, 2022
Part of #10153 

Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202  and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10359
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants