-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrites sample
API
#10262
Rewrites sample
API
#10262
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10262 +/- ##
================================================
+ Coverage 10.42% 10.50% +0.07%
================================================
Files 119 126 +7
Lines 20603 21218 +615
================================================
+ Hits 2148 2228 +80
- Misses 18455 18990 +535
Continue to review full report at Codecov.
|
…le twice; Fix type annotations.
Co-authored-by: Vyas Ramasubramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. I left a bunch of comments but they're mostly around docstrings and related minor improvements.
Co-authored-by: Vyas Ramasubramani <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>
…nto improvement/rewrite_sample
…tions and raise proper warning message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty happy with this now. I think there's still a little bit of room for improvement, but it's come far enough that I'd rather get this merged now and work on more improvements later. Address my remaining comments as you see fit, and we can iterate further later. The biggest outstanding areas for potential improvement are still tests, and we can revisit that when we rework tests as a whole anyway.
python/cudf/cudf/tests/conftest.py
Outdated
"""Specific to `test_sample*_axis_0` tests. | ||
Only testing weights array that matches type with random state. | ||
""" | ||
_, gd_random_state, _ = random_state_tuple_axis_0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rewrite this fixture to not use the random_state_tuple_axis_0
? It looks like two of the conditional branches below never use this output anyway, so we generate too many cases, and we also only use a subset of the possible values when checking the type of the gd_random_state
in wrapped
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should be able to rewrite this to return a factory that accepts an extra argument for distinction.
so we generate too many cases,
This fixture request does not generate more test cases. The number of test cases is determined by the product of all parametrization of all fixtures requested by the test case. Here, for each of the test case it is only requesting the instantiation of the random state tuple. Here's a small demo:
@pytest.fixture(params=[1, 2])
def fixture1(request):
return request.param
@pytest.fixture(params=['a', 'b'])
def fixture2(request, fixture1):
return fixture1
def test_f(fixture1, fixture2):
pass
(rapids) rapids@compose:~/scratch$ pytest --collect-only pytest_play/fixture/parametrized_fixture.py
========================================================== test session starts ===========================================================
platform linux -- Python 3.8.12, pytest-7.0.1, pluggy-1.0.0
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/yhw/dev/rapids/scratch
plugins: forked-1.4.0, xdist-2.5.0, pudb-0.7.0, benchmark-3.4.1, hypothesis-6.36.2, repeat-0.9.1
collected 4 items
<Module pytest_play/fixture/parametrized_fixture.py>
<Function test_f[1-a]>
<Function test_f[1-b]>
<Function test_f[2-a]>
<Function test_f[2-b]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in this case some of those parametrizations are redundant. For example, you will generate four different tests where request.param
is None
because random_state_tuple_axis_0
has four possible values, but all of those tests will do exactly the same thing. The same thing happens when request.param == "builtin-list"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree that part of the parametrization of random state tuple for make_weights
fixture is redundant. If we are to be very conservative about introducing unnecessary context to another method, the state of PR is changed to reflect that.
For pure academic debates, it can also be argued that requesting random state fixture explicitly shows the dependency of make_weights
fixture has on the random state fixture, while the state of PR isn't as clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there exists a path to make both world happy - one would require parameterize pandas random state, cudf random state and checker independently into their own fixture and express their dependency with a forth fixture. Just a bit more work to figure out the constraints there.
@vyasr I gave another round of updates, where I think all of the concerns are addressed. I left two tabs open in case you want to continue the discussion there. For the rest of this PR I'm moving forward to merge. |
@gpucibot merge |
Part of #10153 Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202 and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10359
This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large.
Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated.
Sampling from
cudf.Index/cudf.MultiIndex
is deprecated.This PR is breaking because:
sample
API now gets different rows.keep_index
is renamed toignore_index
and its semantic is negated.Current implementation does not depend on
libcudf.copying.sample
, thus cython bindings are removed.Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state.
Benchmark Axis=0
On
axis=1
sample, this PR is faster than current if provided a numpy random state forrandom_state
parameter, while slower if provided a seed instead.Benchmark axis=1
Part of #10153