Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large. Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated. Sampling from `cudf.Index/cudf.MultiIndex` is deprecated. This PR is breaking because: 1. User who previously calls `sample` API now gets different rows. 2. To align with pandas API, `keep_index` is renamed to `ignore_index` and its semantic is negated. Current implementation does not depend on `libcudf.copying.sample`, thus cython bindings are removed. Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state. <details> <summary>Benchmark Axis=0</summary> ``` -------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests --------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sample_df[size100M-AxisIndex-CupyRandomState] (afte) 296.7751 (455.90) 299.2855 (401.57) 297.9519 (448.88) 1.1162 (94.15) 297.7824 (451.66) 2.0472 (192.32) 2;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (afte) 4,435.3055 (>1000.0) 4,717.0815 (>1000.0) 4,507.1635 (>1000.0) 119.8772 (>1000.0) 4,452.5009 (>1000.0) 115.2876 (>1000.0) 1;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (befo) 276.1754 (424.26) 276.4792 (370.97) 276.2995 (416.26) 0.1258 (10.61) 276.3024 (419.08) 0.2010 (18.88) 1;0 5 sample_df[size10K-AxisIndex-CupyRandomState] (afte) 1.0789 (1.66) 1.2420 (1.67) 1.1238 (1.69) 0.0683 (5.76) 1.0962 (1.66) 0.0721 (6.77) 1;0 5 sample_df[size10K-AxisIndex-NumpyRandomState] (afte) 0.9018 (1.39) 1.1441 (1.54) 0.9140 (1.38) 0.0182 (1.54) 0.9094 (1.38) 0.0106 (1.0) 11;11 346 sample_df[size10K-AxisIndex-NumpyRandomState] (befo) 0.6510 (1.0) 0.7453 (1.0) 0.6638 (1.0) 0.0119 (1.0) 0.6593 (1.0) 0.0108 (1.01) 76;44 638 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` </details> On `axis=1` sample, this PR is faster than current if provided a numpy random state for `random_state` parameter, while slower if provided a seed instead. <details> <summary> Benchmark axis=1 </summary> ``` --------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ---------------------------------------------------------------------------------- Name (time in us) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ sample_df[size100M-AxisColumn-NumpyRandomState] (afte) 173.2660 (1.0) 290.5080 (1.14) 178.2199 (1.0) 8.0913 (1.58) 175.7130 (1.0) 2.0767 (1.73) 227;419 2707 sample_df[size100M-AxisColumn-Seed] (afte) 441.9110 (2.55) 617.1150 (2.42) 452.4197 (2.54) 14.1272 (2.76) 447.1345 (2.54) 7.9060 (6.59) 151;162 1484 sample_df[size100M-AxisColumn-Seed] (befo) 297.1560 (1.72) 477.1500 (1.87) 307.8915 (1.73) 17.2036 (3.36) 300.5620 (1.71) 9.4080 (7.85) 159;168 1695 sample_df[size10K-AxisColumn-NumpyRandomState] (afte) 176.6440 (1.02) 254.9110 (1.0) 180.0217 (1.01) 5.1152 (1.0) 178.8940 (1.02) 1.1990 (1.0) 226;405 3542 sample_df[size10K-AxisColumn-Seed] (afte) 451.6370 (2.61) 689.8120 (2.71) 465.9937 (2.61) 14.3921 (2.81) 463.0710 (2.64) 6.7365 (5.62) 62;91 1183 sample_df[size10K-AxisColumn-Seed] (befo) 309.4000 (1.79) 413.9080 (1.62) 316.5210 (1.78) 7.6379 (1.49) 315.2130 (1.79) 5.4100 (4.51) 66;42 826 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` </details> Part of #10153 Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10262
- Loading branch information