-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Use List of Columns instead of Frames in Cython API #10153
Comments
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889
This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large. Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated. Sampling from `cudf.Index/cudf.MultiIndex` is deprecated. This PR is breaking because: 1. User who previously calls `sample` API now gets different rows. 2. To align with pandas API, `keep_index` is renamed to `ignore_index` and its semantic is negated. Current implementation does not depend on `libcudf.copying.sample`, thus cython bindings are removed. Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state. <details> <summary>Benchmark Axis=0</summary> ``` -------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests --------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sample_df[size100M-AxisIndex-CupyRandomState] (afte) 296.7751 (455.90) 299.2855 (401.57) 297.9519 (448.88) 1.1162 (94.15) 297.7824 (451.66) 2.0472 (192.32) 2;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (afte) 4,435.3055 (>1000.0) 4,717.0815 (>1000.0) 4,507.1635 (>1000.0) 119.8772 (>1000.0) 4,452.5009 (>1000.0) 115.2876 (>1000.0) 1;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (befo) 276.1754 (424.26) 276.4792 (370.97) 276.2995 (416.26) 0.1258 (10.61) 276.3024 (419.08) 0.2010 (18.88) 1;0 5 sample_df[size10K-AxisIndex-CupyRandomState] (afte) 1.0789 (1.66) 1.2420 (1.67) 1.1238 (1.69) 0.0683 (5.76) 1.0962 (1.66) 0.0721 (6.77) 1;0 5 sample_df[size10K-AxisIndex-NumpyRandomState] (afte) 0.9018 (1.39) 1.1441 (1.54) 0.9140 (1.38) 0.0182 (1.54) 0.9094 (1.38) 0.0106 (1.0) 11;11 346 sample_df[size10K-AxisIndex-NumpyRandomState] (befo) 0.6510 (1.0) 0.7453 (1.0) 0.6638 (1.0) 0.0119 (1.0) 0.6593 (1.0) 0.0108 (1.01) 76;44 638 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` </details> On `axis=1` sample, this PR is faster than current if provided a numpy random state for `random_state` parameter, while slower if provided a seed instead. <details> <summary> Benchmark axis=1 </summary> ``` --------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ---------------------------------------------------------------------------------- Name (time in us) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ sample_df[size100M-AxisColumn-NumpyRandomState] (afte) 173.2660 (1.0) 290.5080 (1.14) 178.2199 (1.0) 8.0913 (1.58) 175.7130 (1.0) 2.0767 (1.73) 227;419 2707 sample_df[size100M-AxisColumn-Seed] (afte) 441.9110 (2.55) 617.1150 (2.42) 452.4197 (2.54) 14.1272 (2.76) 447.1345 (2.54) 7.9060 (6.59) 151;162 1484 sample_df[size100M-AxisColumn-Seed] (befo) 297.1560 (1.72) 477.1500 (1.87) 307.8915 (1.73) 17.2036 (3.36) 300.5620 (1.71) 9.4080 (7.85) 159;168 1695 sample_df[size10K-AxisColumn-NumpyRandomState] (afte) 176.6440 (1.02) 254.9110 (1.0) 180.0217 (1.01) 5.1152 (1.0) 178.8940 (1.02) 1.1990 (1.0) 226;405 3542 sample_df[size10K-AxisColumn-Seed] (afte) 451.6370 (2.61) 689.8120 (2.71) 465.9937 (2.61) 14.3921 (2.81) 463.0710 (2.64) 6.7365 (5.62) 62;91 1183 sample_df[size10K-AxisColumn-Seed] (befo) 309.4000 (1.79) 413.9080 (1.62) 316.5210 (1.78) 7.6379 (1.49) 315.2130 (1.79) 5.4100 (4.51) 66;42 826 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` </details> Part of #10153 Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10262
Part of #10153 Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202 and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10359
Part of #10153 This PR refactors `filling.repeat` cython API to accept a list of columns instead of Frame object. In this PR I'm also trying out a possibly better pattern for index and indexed_frame to share logics, which might become a solution for #10068. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10371
Part of #10153 This PR changes the APIs in `groupby.pyx` to accept a list of columns as input, not a Frame. This change affects both keys and values. The `Groupby` object now only stores a list of columns in the `keys` attribute and other APIs (`groups`, `aggregate`, `shift`, `replace_nulls`) now only accept a list of columns as its value columns. The `aggregation` communication protocol has changed from a dictionary mapping column names to list of agg names to a list of list of agg names. See changes in `_normalize_aggs` for detail. This PR also tries to simplify post-processing of `result` frame in `agg` method now that we have a finer control in pure python. I gave an attempt to rewrite `aggregate_internal` and `scan_internal` but ended up in futile because the unified aggregation object is a cdef type and precludes separating the aggregation filtering step outside of it's current place. Besides, I tried unifying aggregation and scan with cython fused type but didn't make it due to limitation of using fused type with c++ templated type in cython. Overall, the performance of `agg` call is on par with main branch. With -3%-13% performance diff depending on agg types. <details> <summary>Raw Benchmark</summary> ``` ========================================================================== 36 passed in 33.48s ========================================================================== (rapids) rapids@compose:~/scratch/cudf_benchmarks$ ./compare.sh bench_groupby.py --------------------------------------------------------------- benchmark 'False-False-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg1-100] (afte) 2.5090 (1.0) 2.8418 (1.0) 2.5280 (1.0) 0.0290 (2.40) 2.5229 (1.0) 0.0103 (1.05) 15;19 273 groupby_agg[False-False-agg1-100] (befo) 2.7681 (1.10) 2.8441 (1.00) 2.7877 (1.10) 0.0121 (1.0) 2.7849 (1.10) 0.0098 (1.0) 60;26 252 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg1-10000] (afte) 2.7803 (1.0) 3.4156 (1.05) 2.8131 (1.0) 0.0548 (1.57) 2.8007 (1.0) 0.0253 (1.0) 10;12 252 groupby_agg[False-False-agg1-10000] (befo) 3.0402 (1.09) 3.2407 (1.0) 3.1571 (1.12) 0.0348 (1.0) 3.1535 (1.13) 0.0509 (2.01) 39;6 236 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-False-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg1-1000000] (afte) 13.2601 (1.0) 14.0128 (1.01) 13.4242 (1.0) 0.1056 (1.28) 13.4004 (1.0) 0.0286 (1.0) 5;8 68 groupby_agg[False-False-agg1-1000000] (befo) 13.5150 (1.02) 13.9165 (1.0) 13.6015 (1.01) 0.0826 (1.0) 13.5944 (1.01) 0.0696 (2.43) 8;5 66 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg2-100] (afte) 2.5342 (1.0) 2.8621 (1.0) 2.5591 (1.0) 0.0431 (3.18) 2.5509 (1.0) 0.0106 (1.01) 13;18 273 groupby_agg[False-False-agg2-100] (befo) 2.8797 (1.14) 2.9507 (1.03) 2.8997 (1.13) 0.0136 (1.0) 2.8965 (1.14) 0.0105 (1.0) 52;28 227 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg2-10000] (afte) 2.7922 (1.0) 3.2884 (1.0) 2.8205 (1.0) 0.0473 (1.40) 2.8118 (1.0) 0.0096 (1.0) 10;18 251 groupby_agg[False-False-agg2-10000] (befo) 3.1491 (1.13) 3.4791 (1.06) 3.1752 (1.13) 0.0338 (1.0) 3.1687 (1.13) 0.0108 (1.12) 6;17 172 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-False-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg2-1000000] (afte) 13.4699 (1.0) 14.6287 (1.0) 13.6020 (1.0) 0.1359 (1.0) 13.5769 (1.0) 0.0270 (1.0) 3;8 69 groupby_agg[False-False-agg2-1000000] (befo) 13.6079 (1.01) 29.8318 (2.04) 14.0777 (1.03) 1.9806 (14.57) 13.7795 (1.01) 0.0567 (2.10) 2;6 68 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[False-False-sum-100] (afte) 2.1667 (1.0) 2.2855 (1.0) 2.1831 (1.0) 0.0146 (1.49) 2.1802 (1.0) 0.0111 (1.14) 25;16 301 groupby_agg[False-False-sum-100] (befo) 2.4142 (1.11) 2.4782 (1.08) 2.4319 (1.11) 0.0098 (1.0) 2.4309 (1.11) 0.0097 (1.0) 65;15 278 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'False-False-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-sum-10000] (afte) 2.4293 (1.0) 2.6593 (1.0) 2.4493 (1.0) 0.0206 (1.66) 2.4455 (1.0) 0.0115 (1.10) 17;19 278 groupby_agg[False-False-sum-10000] (befo) 2.6646 (1.10) 2.7706 (1.04) 2.6832 (1.10) 0.0124 (1.0) 2.6811 (1.10) 0.0105 (1.0) 49;14 257 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- benchmark 'False-False-sum-1000000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-sum-1000000] (afte) 9.3678 (1.0) 21.0480 (2.07) 9.6817 (1.0) 1.2252 (16.49) 9.5286 (1.0) 0.0342 (1.28) 1;9 89 groupby_agg[False-False-sum-1000000] (befo) 9.6830 (1.03) 10.1832 (1.0) 9.7434 (1.01) 0.0743 (1.0) 9.7238 (1.02) 0.0266 (1.0) 6;6 86 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[False-True-agg1-100] (afte) 2.4392 (1.0) 2.7474 (1.06) 2.4598 (1.0) 0.0287 (2.07) 2.4545 (1.0) 0.0103 (1.0) 10;17 278 groupby_agg[False-True-agg1-100] (befo) 2.5183 (1.03) 2.6017 (1.0) 2.5354 (1.03) 0.0139 (1.0) 2.5332 (1.03) 0.0134 (1.30) 51;18 268 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'False-True-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg1-10000] (afte) 2.7196 (1.0) 3.2290 (1.06) 2.7446 (1.0) 0.0462 (2.17) 2.7359 (1.0) 0.0106 (1.00) 11;17 257 groupby_agg[False-True-agg1-10000] (befo) 2.7807 (1.02) 3.0590 (1.0) 2.8039 (1.02) 0.0213 (1.0) 2.8004 (1.02) 0.0106 (1.0) 16;18 251 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-True-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg1-1000000] (afte) 13.2259 (1.01) 13.7344 (1.0) 13.3449 (1.00) 0.0797 (1.0) 13.3288 (1.00) 0.0322 (1.41) 5;8 69 groupby_agg[False-True-agg1-1000000] (befo) 13.0875 (1.0) 14.1552 (1.03) 13.3135 (1.0) 0.1325 (1.66) 13.2901 (1.0) 0.0229 (1.0) 4;7 68 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[False-True-agg2-100] (afte) 2.4580 (1.0) 2.5791 (1.0) 2.4792 (1.0) 0.0174 (1.92) 2.4756 (1.0) 0.0121 (1.37) 21;14 277 groupby_agg[False-True-agg2-100] (befo) 2.6094 (1.06) 2.6686 (1.03) 2.6260 (1.06) 0.0091 (1.0) 2.6255 (1.06) 0.0088 (1.0) 66;21 264 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'False-True-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg2-10000] (afte) 2.7218 (1.0) 2.8843 (1.0) 2.7415 (1.0) 0.0180 (1.0) 2.7383 (1.0) 0.0116 (1.12) 21;16 257 groupby_agg[False-True-agg2-10000] (befo) 2.8771 (1.06) 3.1227 (1.08) 2.8956 (1.06) 0.0185 (1.03) 2.8922 (1.06) 0.0104 (1.0) 16;16 244 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-True-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg2-1000000] (afte) 13.4555 (1.01) 13.7924 (1.0) 13.5244 (1.00) 0.0601 (1.0) 13.5099 (1.00) 0.0362 (1.0) 7;6 70 groupby_agg[False-True-agg2-1000000] (befo) 13.3841 (1.0) 13.9437 (1.01) 13.4948 (1.0) 0.0773 (1.29) 13.4768 (1.0) 0.0443 (1.22) 5;5 68 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-sum-100] (afte) 2.1270 (1.0) 2.2397 (1.0) 2.1435 (1.0) 0.0158 (1.01) 2.1407 (1.0) 0.0105 (1.0) 27;22 302 groupby_agg[False-True-sum-100] (befo) 2.1881 (1.03) 2.3309 (1.04) 2.2048 (1.03) 0.0156 (1.0) 2.2014 (1.03) 0.0111 (1.06) 35;30 297 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-sum-10000] (afte) 2.4018 (1.0) 2.6107 (1.0) 2.4183 (1.0) 0.0198 (1.16) 2.4149 (1.0) 0.0108 (1.12) 14;14 277 groupby_agg[False-True-sum-10000] (befo) 2.4406 (1.02) 2.6840 (1.03) 2.4606 (1.02) 0.0170 (1.0) 2.4585 (1.02) 0.0097 (1.0) 15;14 274 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-sum-1000000': 2 tests ---------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-sum-1000000] (afte) 9.4459 (1.01) 10.0397 (1.0) 9.4983 (1.0) 0.0706 (1.0) 9.4846 (1.0) 0.0216 (1.0) 4;6 89 groupby_agg[False-True-sum-1000000] (befo) 9.3064 (1.0) 10.2732 (1.02) 9.5150 (1.00) 0.1107 (1.57) 9.4933 (1.00) 0.0239 (1.10) 6;10 88 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- benchmark 'True-False-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg1-100] (afte) 4.3327 (1.0) 4.4800 (1.0) 4.3504 (1.0) 0.0202 (1.0) 4.3457 (1.0) 0.0103 (1.0) 10;16 181 groupby_agg[True-False-agg1-100] (befo) 4.6486 (1.07) 12.4651 (2.78) 4.8006 (1.10) 0.7100 (35.18) 4.6664 (1.07) 0.0191 (1.86) 10;19 170 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg1-10000] (afte) 4.9246 (1.0) 5.1165 (1.0) 4.9491 (1.0) 0.0269 (1.0) 4.9407 (1.0) 0.0133 (1.06) 16;19 164 groupby_agg[True-False-agg1-10000] (befo) 5.2464 (1.07) 5.6002 (1.09) 5.2700 (1.06) 0.0370 (1.38) 5.2623 (1.07) 0.0126 (1.0) 10;17 154 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-False-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg1-1000000] (afte) 36.5089 (1.00) 37.2874 (1.0) 36.8305 (1.0) 0.2321 (1.0) 36.7404 (1.0) 0.2208 (1.0) 7;5 28 groupby_agg[True-False-agg1-1000000] (befo) 36.3558 (1.0) 47.0329 (1.26) 37.7670 (1.03) 2.7313 (11.77) 36.8183 (1.00) 0.8527 (3.86) 2;3 26 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[True-False-agg2-100] (afte) 4.6287 (1.0) 5.2921 (1.02) 4.6918 (1.0) 0.1017 (4.64) 4.6526 (1.0) 0.0496 (3.27) 21;23 167 groupby_agg[True-False-agg2-100] (befo) 4.9776 (1.08) 5.1737 (1.0) 5.0060 (1.07) 0.0219 (1.0) 4.9995 (1.07) 0.0152 (1.0) 18;10 161 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'True-False-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg2-10000] (afte) 5.2022 (1.0) 6.7622 (1.16) 5.2405 (1.0) 0.1267 (2.98) 5.2219 (1.0) 0.0157 (1.0) 2;16 155 groupby_agg[True-False-agg2-10000] (befo) 5.5802 (1.07) 5.8531 (1.0) 5.6166 (1.07) 0.0424 (1.0) 5.6041 (1.07) 0.0206 (1.31) 11;14 147 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-False-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg2-1000000] (afte) 37.9639 (1.0) 38.7598 (1.0) 38.2381 (1.0) 0.1221 (1.0) 38.2346 (1.00) 0.0583 (1.0) 2;2 27 groupby_agg[True-False-agg2-1000000] (befo) 38.0569 (1.00) 41.5735 (1.07) 38.7983 (1.01) 1.1968 (9.80) 38.1696 (1.0) 0.6344 (10.88) 5;5 26 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-sum-100] (afte) 3.6893 (1.0) 4.2792 (1.03) 3.7130 (1.0) 0.0580 (4.15) 3.7022 (1.0) 0.0079 (1.0) 10;16 206 groupby_agg[True-False-sum-100] (befo) 4.0016 (1.08) 4.1370 (1.0) 4.0218 (1.08) 0.0140 (1.0) 4.0180 (1.09) 0.0097 (1.23) 27;17 188 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-sum-10000] (afte) 4.2660 (1.0) 4.6651 (1.0) 4.2913 (1.0) 0.0493 (2.97) 4.2799 (1.0) 0.0097 (1.0) 10;21 185 groupby_agg[True-False-sum-10000] (befo) 4.5702 (1.07) 4.7321 (1.01) 4.5904 (1.07) 0.0166 (1.0) 4.5858 (1.07) 0.0134 (1.37) 24;8 172 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-False-sum-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-sum-1000000] (afte) 30.5871 (1.00) 30.9527 (1.0) 30.6797 (1.00) 0.0628 (1.0) 30.6720 (1.00) 0.0421 (1.0) 4;3 32 groupby_agg[True-False-sum-1000000] (befo) 30.5386 (1.0) 31.8930 (1.03) 30.6654 (1.0) 0.2383 (3.80) 30.6013 (1.0) 0.0573 (1.36) 1;4 31 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg1-100] (afte) 4.2812 (1.0) 4.5815 (1.0) 4.3304 (1.0) 0.0495 (1.43) 4.3134 (1.0) 0.0647 (4.80) 22;4 173 groupby_agg[True-True-agg1-100] (befo) 4.4126 (1.03) 4.7356 (1.03) 4.4357 (1.02) 0.0348 (1.0) 4.4253 (1.03) 0.0135 (1.0) 14;18 158 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg1-10000] (afte) 4.8505 (1.0) 5.3411 (1.0) 4.8882 (1.0) 0.0596 (1.49) 4.8693 (1.0) 0.0240 (1.41) 12;15 166 groupby_agg[True-True-agg1-10000] (befo) 4.9857 (1.03) 5.3869 (1.01) 5.0191 (1.03) 0.0399 (1.0) 5.0089 (1.03) 0.0170 (1.0) 9;15 160 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-True-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg1-1000000] (afte) 36.5387 (1.01) 55.8017 (1.52) 37.3622 (1.03) 3.6965 (48.22) 36.5756 (1.00) 0.0882 (2.75) 1;3 27 groupby_agg[True-True-agg1-1000000] (befo) 36.3456 (1.0) 36.7584 (1.0) 36.4209 (1.0) 0.0767 (1.0) 36.4014 (1.0) 0.0320 (1.0) 1;4 27 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg2-100] (afte) 4.5713 (1.0) 5.1548 (1.06) 4.6064 (1.0) 0.0621 (4.49) 4.5886 (1.0) 0.0203 (1.51) 13;22 170 groupby_agg[True-True-agg2-100] (befo) 4.7628 (1.04) 4.8752 (1.0) 4.7832 (1.04) 0.0138 (1.0) 4.7795 (1.04) 0.0134 (1.0) 29;9 167 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg2-10000] (afte) 5.1343 (1.0) 5.4159 (1.0) 5.1769 (1.0) 0.0517 (1.36) 5.1590 (1.0) 0.0179 (1.21) 16;22 157 groupby_agg[True-True-agg2-10000] (befo) 5.3567 (1.04) 5.6432 (1.04) 5.3858 (1.04) 0.0379 (1.0) 5.3785 (1.04) 0.0147 (1.0) 7;12 152 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-True-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg2-1000000] (afte) 38.0357 (1.00) 38.2935 (1.00) 38.1159 (1.00) 0.0597 (1.0) 38.1014 (1.00) 0.0846 (1.0) 6;1 27 groupby_agg[True-True-agg2-1000000] (befo) 37.9134 (1.0) 38.2851 (1.0) 38.0201 (1.0) 0.0929 (1.55) 37.9944 (1.0) 0.1066 (1.26) 7;1 26 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-sum-100] (afte) 3.7452 (1.0) 4.0287 (1.0) 3.8009 (1.0) 0.0408 (1.0) 3.7968 (1.0) 0.0503 (1.0) 29;3 131 groupby_agg[True-True-sum-100] (befo) 3.8752 (1.03) 4.4384 (1.10) 3.9316 (1.03) 0.0608 (1.49) 3.9265 (1.03) 0.0504 (1.00) 4;3 148 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- benchmark 'True-True-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-sum-10000] (afte) 4.4442 (1.0) 11.3511 (2.35) 4.5582 (1.0) 0.5829 (24.78) 4.4741 (1.0) 0.0323 (2.85) 3;19 171 groupby_agg[True-True-sum-10000] (befo) 4.5676 (1.03) 4.8264 (1.0) 4.5913 (1.01) 0.0235 (1.0) 4.5871 (1.03) 0.0114 (1.0) 15;16 168 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-True-sum-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[True-True-sum-1000000] (afte) 30.5326 (1.00) 33.6395 (1.02) 31.2355 (1.0) 0.9563 (1.20) 30.6933 (1.0) 0.9663 (1.0) 5;3 30 groupby_agg[True-True-sum-1000000] (befo) 30.4080 (1.0) 33.0341 (1.0) 31.2527 (1.00) 0.7946 (1.0) 30.9808 (1.01) 1.2781 (1.32) 11;0 30 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` </details> [Benchmark code](https://github.com/isVoid/cudf_benchmarks/blob/9d9644eaa5301df7894c2fe4b1ba317396240518/bench_groupby.py#L23-L42) Authors: - Michael Wang (https://github.com/isVoid) - Bradley Dice (https://github.com/bdice) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #10419
This PR covers many low hanging fruits for #10153. All API accepting Frames now accepts a list of columns in the following files: - hash.pyx - interop.pyx - join.pyx - partitioning.pyx - quantiles.pyx - reshape.pyx - search.pyx - transform.pyx - lists.pyx - string/combine.pyx This PR covers point 5 in the follow-ups to #9889. Also, in `join.pyx`, gil was not released when dispatching workload to libcudf. This PR fixes that. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10463
This PR contributes to #10153, refactors all cython APIs in `transpose.pyx`, `sort.pyx` to accept a list of columns as input. This PR also includes several minor improvements in the code base, see comments below for detail. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #10675
This PR refactors `merge_sorted` in `merge.pyx` to accept a list of columns, contributes to #10153 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #10698
@isVoid Should we close this, make a new issue for the IO files, and label it "good first issue"? |
I would still like to keep tracking in this issue. Especially the I/O files refactors are the hardest among the cohorts. I don't think moving it to a separate issue will help lowering the barrier. |
I'm going to close this issue. With pylibcudf coming we are going to have to rethink the internals of cudf further. Ultimately the list of columns approach we took here is certainly going to be closer to the final end state, but it is no longer directly applicable to future design. |
Following the removal of
Table
class #9315, and a pilot study of changing several cython APIs into using list of columns #9558, we want to extend the previous effort and use lists of columns to replace usage ofFrame
with list of columns. The main arguments for this change are:table_view
is essentially a thin wrapper ofvector<column_view>
, this match with list of column concept.This is a meta issue that tracks the list of candidate APIs that awaits to be changed and the refactor progress.
The following list is compiled by searching
\(\n?.*table_view.*(\n|\))
in the cpp code base and find their corresponding wrapper in cython.Scatter
: should we expand the cython interface to support multi column scatter?table_empty_like
table_slice
table_split
boolean_mask_scatter_table
: see [FEA] Use libcudf's boolean mask scattering for setitem #8667.sample
repeat
groupby.__cinit__
groupby.groups
groupby.shift
groupby.replace_nulls
hash
to_dlpack
join/semi_join
: lhs, rhsmerge_sorted
partition
quantiles
interleave_columns
tile
search_sorted
is_sorted
order_by
digitize
rank_columns
table_encode
transpose
write_csv
: note index argumentwrite_orc
write_parquet
explode_outer
concatenate_rows
concatenate
Second step from this is to remove
utils.table_view_from_table
completely.The text was updated successfully, but these errors were encountered: