[FEA] Use List of Columns instead of Frames in Cython API #10153

isVoid · 2022-01-28T00:56:35Z

Following the removal of Table class #9315, and a pilot study of changing several cython APIs into using list of columns #9558, we want to extend the previous effort and use lists of columns to replace usage of Frame with list of columns. The main arguments for this change are:

Libcudf's table_view is essentially a thin wrapper of vector<column_view>, this match with list of column concept.
In the vision by data consortium API, dataframes does not have a column name.
From developer's perspective, we want to rely on simpler data structure on cython layer for easy debugging.

This is a meta issue that tracks the list of candidate APIs that awaits to be changed and the refactor progress.

The following list is compiled by searching $\n?.*table_view.*(\n|$) in the cpp code base and find their corresponding wrapper in cython.

Second step from this is to remove utils.table_view_from_table completely.

The text was updated successfully, but these errors were encountered:

This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268 Related issue: #1268 Related PRs: #9154, #9166, #9492 Next steps: - [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice - [ ] Consolidate both `cov()` and `corr()` - [ ] Fix #10303 - [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment)) - [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment)) Authors: - Mayank Anand (https://github.com/mayankanand007) - Michael Wang (https://github.com/isVoid) - Sheilah Kirui (https://github.com/skirui-source) Approvers: - Bradley Dice (https://github.com/bdice) - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9889

This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large. Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated. Sampling from `cudf.Index/cudf.MultiIndex` is deprecated. This PR is breaking because: 1. User who previously calls `sample` API now gets different rows. 2. To align with pandas API, `keep_index` is renamed to `ignore_index` and its semantic is negated. Current implementation does not depend on `libcudf.copying.sample`, thus cython bindings are removed. Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state. <details> <summary>Benchmark Axis=0</summary> ``` -------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests --------------------------------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- sample_df[size100M-AxisIndex-CupyRandomState] (afte) 296.7751 (455.90) 299.2855 (401.57) 297.9519 (448.88) 1.1162 (94.15) 297.7824 (451.66) 2.0472 (192.32) 2;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (afte) 4,435.3055 (>1000.0) 4,717.0815 (>1000.0) 4,507.1635 (>1000.0) 119.8772 (>1000.0) 4,452.5009 (>1000.0) 115.2876 (>1000.0) 1;0 5 sample_df[size100M-AxisIndex-NumpyRandomState] (befo) 276.1754 (424.26) 276.4792 (370.97) 276.2995 (416.26) 0.1258 (10.61) 276.3024 (419.08) 0.2010 (18.88) 1;0 5 sample_df[size10K-AxisIndex-CupyRandomState] (afte) 1.0789 (1.66) 1.2420 (1.67) 1.1238 (1.69) 0.0683 (5.76) 1.0962 (1.66) 0.0721 (6.77) 1;0 5 sample_df[size10K-AxisIndex-NumpyRandomState] (afte) 0.9018 (1.39) 1.1441 (1.54) 0.9140 (1.38) 0.0182 (1.54) 0.9094 (1.38) 0.0106 (1.0) 11;11 346 sample_df[size10K-AxisIndex-NumpyRandomState] (befo) 0.6510 (1.0) 0.7453 (1.0) 0.6638 (1.0) 0.0119 (1.0) 0.6593 (1.0) 0.0108 (1.01) 76;44 638 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ``` </details> On `axis=1` sample, this PR is faster than current if provided a numpy random state for `random_state` parameter, while slower if provided a seed instead. <details> <summary> Benchmark axis=1 </summary> ``` --------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ---------------------------------------------------------------------------------- Name (time in us) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ sample_df[size100M-AxisColumn-NumpyRandomState] (afte) 173.2660 (1.0) 290.5080 (1.14) 178.2199 (1.0) 8.0913 (1.58) 175.7130 (1.0) 2.0767 (1.73) 227;419 2707 sample_df[size100M-AxisColumn-Seed] (afte) 441.9110 (2.55) 617.1150 (2.42) 452.4197 (2.54) 14.1272 (2.76) 447.1345 (2.54) 7.9060 (6.59) 151;162 1484 sample_df[size100M-AxisColumn-Seed] (befo) 297.1560 (1.72) 477.1500 (1.87) 307.8915 (1.73) 17.2036 (3.36) 300.5620 (1.71) 9.4080 (7.85) 159;168 1695 sample_df[size10K-AxisColumn-NumpyRandomState] (afte) 176.6440 (1.02) 254.9110 (1.0) 180.0217 (1.01) 5.1152 (1.0) 178.8940 (1.02) 1.1990 (1.0) 226;405 3542 sample_df[size10K-AxisColumn-Seed] (afte) 451.6370 (2.61) 689.8120 (2.71) 465.9937 (2.61) 14.3921 (2.81) 463.0710 (2.64) 6.7365 (5.62) 62;91 1183 sample_df[size10K-AxisColumn-Seed] (befo) 309.4000 (1.79) 413.9080 (1.62) 316.5210 (1.78) 7.6379 (1.49) 315.2130 (1.79) 5.4100 (4.51) 66;42 826 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` </details> Part of #10153 Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10262

Part of #10153 Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202 and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10359

Part of #10153 This PR refactors `filling.repeat` cython API to accept a list of columns instead of Frame object. In this PR I'm also trying out a possibly better pattern for index and indexed_frame to share logics, which might become a solution for #10068. Authors: - Michael Wang (https://github.com/isVoid) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) URL: #10371

Part of #10153 This PR changes the APIs in `groupby.pyx` to accept a list of columns as input, not a Frame. This change affects both keys and values. The `Groupby` object now only stores a list of columns in the `keys` attribute and other APIs (`groups`, `aggregate`, `shift`, `replace_nulls`) now only accept a list of columns as its value columns. The `aggregation` communication protocol has changed from a dictionary mapping column names to list of agg names to a list of list of agg names. See changes in `_normalize_aggs` for detail. This PR also tries to simplify post-processing of `result` frame in `agg` method now that we have a finer control in pure python. I gave an attempt to rewrite `aggregate_internal` and `scan_internal` but ended up in futile because the unified aggregation object is a cdef type and precludes separating the aggregation filtering step outside of it's current place. Besides, I tried unifying aggregation and scan with cython fused type but didn't make it due to limitation of using fused type with c++ templated type in cython. Overall, the performance of `agg` call is on par with main branch. With -3%-13% performance diff depending on agg types. <details> <summary>Raw Benchmark</summary> ``` ========================================================================== 36 passed in 33.48s ========================================================================== (rapids) rapids@compose:~/scratch/cudf_benchmarks$ ./compare.sh bench_groupby.py --------------------------------------------------------------- benchmark 'False-False-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg1-100] (afte) 2.5090 (1.0) 2.8418 (1.0) 2.5280 (1.0) 0.0290 (2.40) 2.5229 (1.0) 0.0103 (1.05) 15;19 273 groupby_agg[False-False-agg1-100] (befo) 2.7681 (1.10) 2.8441 (1.00) 2.7877 (1.10) 0.0121 (1.0) 2.7849 (1.10) 0.0098 (1.0) 60;26 252 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg1-10000] (afte) 2.7803 (1.0) 3.4156 (1.05) 2.8131 (1.0) 0.0548 (1.57) 2.8007 (1.0) 0.0253 (1.0) 10;12 252 groupby_agg[False-False-agg1-10000] (befo) 3.0402 (1.09) 3.2407 (1.0) 3.1571 (1.12) 0.0348 (1.0) 3.1535 (1.13) 0.0509 (2.01) 39;6 236 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-False-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg1-1000000] (afte) 13.2601 (1.0) 14.0128 (1.01) 13.4242 (1.0) 0.1056 (1.28) 13.4004 (1.0) 0.0286 (1.0) 5;8 68 groupby_agg[False-False-agg1-1000000] (befo) 13.5150 (1.02) 13.9165 (1.0) 13.6015 (1.01) 0.0826 (1.0) 13.5944 (1.01) 0.0696 (2.43) 8;5 66 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg2-100] (afte) 2.5342 (1.0) 2.8621 (1.0) 2.5591 (1.0) 0.0431 (3.18) 2.5509 (1.0) 0.0106 (1.01) 13;18 273 groupby_agg[False-False-agg2-100] (befo) 2.8797 (1.14) 2.9507 (1.03) 2.8997 (1.13) 0.0136 (1.0) 2.8965 (1.14) 0.0105 (1.0) 52;28 227 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg2-10000] (afte) 2.7922 (1.0) 3.2884 (1.0) 2.8205 (1.0) 0.0473 (1.40) 2.8118 (1.0) 0.0096 (1.0) 10;18 251 groupby_agg[False-False-agg2-10000] (befo) 3.1491 (1.13) 3.4791 (1.06) 3.1752 (1.13) 0.0338 (1.0) 3.1687 (1.13) 0.0108 (1.12) 6;17 172 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-False-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-agg2-1000000] (afte) 13.4699 (1.0) 14.6287 (1.0) 13.6020 (1.0) 0.1359 (1.0) 13.5769 (1.0) 0.0270 (1.0) 3;8 69 groupby_agg[False-False-agg2-1000000] (befo) 13.6079 (1.01) 29.8318 (2.04) 14.0777 (1.03) 1.9806 (14.57) 13.7795 (1.01) 0.0567 (2.10) 2;6 68 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-False-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[False-False-sum-100] (afte) 2.1667 (1.0) 2.2855 (1.0) 2.1831 (1.0) 0.0146 (1.49) 2.1802 (1.0) 0.0111 (1.14) 25;16 301 groupby_agg[False-False-sum-100] (befo) 2.4142 (1.11) 2.4782 (1.08) 2.4319 (1.11) 0.0098 (1.0) 2.4309 (1.11) 0.0097 (1.0) 65;15 278 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'False-False-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-sum-10000] (afte) 2.4293 (1.0) 2.6593 (1.0) 2.4493 (1.0) 0.0206 (1.66) 2.4455 (1.0) 0.0115 (1.10) 17;19 278 groupby_agg[False-False-sum-10000] (befo) 2.6646 (1.10) 2.7706 (1.04) 2.6832 (1.10) 0.0124 (1.0) 2.6811 (1.10) 0.0105 (1.0) 49;14 257 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- benchmark 'False-False-sum-1000000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-False-sum-1000000] (afte) 9.3678 (1.0) 21.0480 (2.07) 9.6817 (1.0) 1.2252 (16.49) 9.5286 (1.0) 0.0342 (1.28) 1;9 89 groupby_agg[False-False-sum-1000000] (befo) 9.6830 (1.03) 10.1832 (1.0) 9.7434 (1.01) 0.0743 (1.0) 9.7238 (1.02) 0.0266 (1.0) 6;6 86 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[False-True-agg1-100] (afte) 2.4392 (1.0) 2.7474 (1.06) 2.4598 (1.0) 0.0287 (2.07) 2.4545 (1.0) 0.0103 (1.0) 10;17 278 groupby_agg[False-True-agg1-100] (befo) 2.5183 (1.03) 2.6017 (1.0) 2.5354 (1.03) 0.0139 (1.0) 2.5332 (1.03) 0.0134 (1.30) 51;18 268 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'False-True-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg1-10000] (afte) 2.7196 (1.0) 3.2290 (1.06) 2.7446 (1.0) 0.0462 (2.17) 2.7359 (1.0) 0.0106 (1.00) 11;17 257 groupby_agg[False-True-agg1-10000] (befo) 2.7807 (1.02) 3.0590 (1.0) 2.8039 (1.02) 0.0213 (1.0) 2.8004 (1.02) 0.0106 (1.0) 16;18 251 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-True-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg1-1000000] (afte) 13.2259 (1.01) 13.7344 (1.0) 13.3449 (1.00) 0.0797 (1.0) 13.3288 (1.00) 0.0322 (1.41) 5;8 69 groupby_agg[False-True-agg1-1000000] (befo) 13.0875 (1.0) 14.1552 (1.03) 13.3135 (1.0) 0.1325 (1.66) 13.2901 (1.0) 0.0229 (1.0) 4;7 68 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[False-True-agg2-100] (afte) 2.4580 (1.0) 2.5791 (1.0) 2.4792 (1.0) 0.0174 (1.92) 2.4756 (1.0) 0.0121 (1.37) 21;14 277 groupby_agg[False-True-agg2-100] (befo) 2.6094 (1.06) 2.6686 (1.03) 2.6260 (1.06) 0.0091 (1.0) 2.6255 (1.06) 0.0088 (1.0) 66;21 264 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'False-True-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg2-10000] (afte) 2.7218 (1.0) 2.8843 (1.0) 2.7415 (1.0) 0.0180 (1.0) 2.7383 (1.0) 0.0116 (1.12) 21;16 257 groupby_agg[False-True-agg2-10000] (befo) 2.8771 (1.06) 3.1227 (1.08) 2.8956 (1.06) 0.0185 (1.03) 2.8922 (1.06) 0.0104 (1.0) 16;16 244 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'False-True-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-agg2-1000000] (afte) 13.4555 (1.01) 13.7924 (1.0) 13.5244 (1.00) 0.0601 (1.0) 13.5099 (1.00) 0.0362 (1.0) 7;6 70 groupby_agg[False-True-agg2-1000000] (befo) 13.3841 (1.0) 13.9437 (1.01) 13.4948 (1.0) 0.0773 (1.29) 13.4768 (1.0) 0.0443 (1.22) 5;5 68 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-sum-100] (afte) 2.1270 (1.0) 2.2397 (1.0) 2.1435 (1.0) 0.0158 (1.01) 2.1407 (1.0) 0.0105 (1.0) 27;22 302 groupby_agg[False-True-sum-100] (befo) 2.1881 (1.03) 2.3309 (1.04) 2.2048 (1.03) 0.0156 (1.0) 2.2014 (1.03) 0.0111 (1.06) 35;30 297 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-sum-10000] (afte) 2.4018 (1.0) 2.6107 (1.0) 2.4183 (1.0) 0.0198 (1.16) 2.4149 (1.0) 0.0108 (1.12) 14;14 277 groupby_agg[False-True-sum-10000] (befo) 2.4406 (1.02) 2.6840 (1.03) 2.4606 (1.02) 0.0170 (1.0) 2.4585 (1.02) 0.0097 (1.0) 15;14 274 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'False-True-sum-1000000': 2 tests ---------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[False-True-sum-1000000] (afte) 9.4459 (1.01) 10.0397 (1.0) 9.4983 (1.0) 0.0706 (1.0) 9.4846 (1.0) 0.0216 (1.0) 4;6 89 groupby_agg[False-True-sum-1000000] (befo) 9.3064 (1.0) 10.2732 (1.02) 9.5150 (1.00) 0.1107 (1.57) 9.4933 (1.00) 0.0239 (1.10) 6;10 88 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- benchmark 'True-False-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg1-100] (afte) 4.3327 (1.0) 4.4800 (1.0) 4.3504 (1.0) 0.0202 (1.0) 4.3457 (1.0) 0.0103 (1.0) 10;16 181 groupby_agg[True-False-agg1-100] (befo) 4.6486 (1.07) 12.4651 (2.78) 4.8006 (1.10) 0.7100 (35.18) 4.6664 (1.07) 0.0191 (1.86) 10;19 170 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg1-10000] (afte) 4.9246 (1.0) 5.1165 (1.0) 4.9491 (1.0) 0.0269 (1.0) 4.9407 (1.0) 0.0133 (1.06) 16;19 164 groupby_agg[True-False-agg1-10000] (befo) 5.2464 (1.07) 5.6002 (1.09) 5.2700 (1.06) 0.0370 (1.38) 5.2623 (1.07) 0.0126 (1.0) 10;17 154 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-False-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg1-1000000] (afte) 36.5089 (1.00) 37.2874 (1.0) 36.8305 (1.0) 0.2321 (1.0) 36.7404 (1.0) 0.2208 (1.0) 7;5 28 groupby_agg[True-False-agg1-1000000] (befo) 36.3558 (1.0) 47.0329 (1.26) 37.7670 (1.03) 2.7313 (11.77) 36.8183 (1.00) 0.8527 (3.86) 2;3 26 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[True-False-agg2-100] (afte) 4.6287 (1.0) 5.2921 (1.02) 4.6918 (1.0) 0.1017 (4.64) 4.6526 (1.0) 0.0496 (3.27) 21;23 167 groupby_agg[True-False-agg2-100] (befo) 4.9776 (1.08) 5.1737 (1.0) 5.0060 (1.07) 0.0219 (1.0) 4.9995 (1.07) 0.0152 (1.0) 18;10 161 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ --------------------------------------------------------------- benchmark 'True-False-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg2-10000] (afte) 5.2022 (1.0) 6.7622 (1.16) 5.2405 (1.0) 0.1267 (2.98) 5.2219 (1.0) 0.0157 (1.0) 2;16 155 groupby_agg[True-False-agg2-10000] (befo) 5.5802 (1.07) 5.8531 (1.0) 5.6166 (1.07) 0.0424 (1.0) 5.6041 (1.07) 0.0206 (1.31) 11;14 147 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-False-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-agg2-1000000] (afte) 37.9639 (1.0) 38.7598 (1.0) 38.2381 (1.0) 0.1221 (1.0) 38.2346 (1.00) 0.0583 (1.0) 2;2 27 groupby_agg[True-False-agg2-1000000] (befo) 38.0569 (1.00) 41.5735 (1.07) 38.7983 (1.01) 1.1968 (9.80) 38.1696 (1.0) 0.6344 (10.88) 5;5 26 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-sum-100] (afte) 3.6893 (1.0) 4.2792 (1.03) 3.7130 (1.0) 0.0580 (4.15) 3.7022 (1.0) 0.0079 (1.0) 10;16 206 groupby_agg[True-False-sum-100] (befo) 4.0016 (1.08) 4.1370 (1.0) 4.0218 (1.08) 0.0140 (1.0) 4.0180 (1.09) 0.0097 (1.23) 27;17 188 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-False-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-sum-10000] (afte) 4.2660 (1.0) 4.6651 (1.0) 4.2913 (1.0) 0.0493 (2.97) 4.2799 (1.0) 0.0097 (1.0) 10;21 185 groupby_agg[True-False-sum-10000] (befo) 4.5702 (1.07) 4.7321 (1.01) 4.5904 (1.07) 0.0166 (1.0) 4.5858 (1.07) 0.0134 (1.37) 24;8 172 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-False-sum-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-False-sum-1000000] (afte) 30.5871 (1.00) 30.9527 (1.0) 30.6797 (1.00) 0.0628 (1.0) 30.6720 (1.00) 0.0421 (1.0) 4;3 32 groupby_agg[True-False-sum-1000000] (befo) 30.5386 (1.0) 31.8930 (1.03) 30.6654 (1.0) 0.2383 (3.80) 30.6013 (1.0) 0.0573 (1.36) 1;4 31 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg1-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg1-100] (afte) 4.2812 (1.0) 4.5815 (1.0) 4.3304 (1.0) 0.0495 (1.43) 4.3134 (1.0) 0.0647 (4.80) 22;4 173 groupby_agg[True-True-agg1-100] (befo) 4.4126 (1.03) 4.7356 (1.03) 4.4357 (1.02) 0.0348 (1.0) 4.4253 (1.03) 0.0135 (1.0) 14;18 158 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg1-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg1-10000] (afte) 4.8505 (1.0) 5.3411 (1.0) 4.8882 (1.0) 0.0596 (1.49) 4.8693 (1.0) 0.0240 (1.41) 12;15 166 groupby_agg[True-True-agg1-10000] (befo) 4.9857 (1.03) 5.3869 (1.01) 5.0191 (1.03) 0.0399 (1.0) 5.0089 (1.03) 0.0170 (1.0) 9;15 160 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-True-agg1-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg1-1000000] (afte) 36.5387 (1.01) 55.8017 (1.52) 37.3622 (1.03) 3.6965 (48.22) 36.5756 (1.00) 0.0882 (2.75) 1;3 27 groupby_agg[True-True-agg1-1000000] (befo) 36.3456 (1.0) 36.7584 (1.0) 36.4209 (1.0) 0.0767 (1.0) 36.4014 (1.0) 0.0320 (1.0) 1;4 27 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg2-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg2-100] (afte) 4.5713 (1.0) 5.1548 (1.06) 4.6064 (1.0) 0.0621 (4.49) 4.5886 (1.0) 0.0203 (1.51) 13;22 170 groupby_agg[True-True-agg2-100] (befo) 4.7628 (1.04) 4.8752 (1.0) 4.7832 (1.04) 0.0138 (1.0) 4.7795 (1.04) 0.0134 (1.0) 29;9 167 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-agg2-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg2-10000] (afte) 5.1343 (1.0) 5.4159 (1.0) 5.1769 (1.0) 0.0517 (1.36) 5.1590 (1.0) 0.0179 (1.21) 16;22 157 groupby_agg[True-True-agg2-10000] (befo) 5.3567 (1.04) 5.6432 (1.04) 5.3858 (1.04) 0.0379 (1.0) 5.3785 (1.04) 0.0147 (1.0) 7;12 152 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-True-agg2-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-agg2-1000000] (afte) 38.0357 (1.00) 38.2935 (1.00) 38.1159 (1.00) 0.0597 (1.0) 38.1014 (1.00) 0.0846 (1.0) 6;1 27 groupby_agg[True-True-agg2-1000000] (befo) 37.9134 (1.0) 38.2851 (1.0) 38.0201 (1.0) 0.0929 (1.55) 37.9944 (1.0) 0.1066 (1.26) 7;1 26 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------- benchmark 'True-True-sum-100': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-sum-100] (afte) 3.7452 (1.0) 4.0287 (1.0) 3.8009 (1.0) 0.0408 (1.0) 3.7968 (1.0) 0.0503 (1.0) 29;3 131 groupby_agg[True-True-sum-100] (befo) 3.8752 (1.03) 4.4384 (1.10) 3.9316 (1.03) 0.0608 (1.49) 3.9265 (1.03) 0.0504 (1.00) 4;3 148 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- benchmark 'True-True-sum-10000': 2 tests --------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- groupby_agg[True-True-sum-10000] (afte) 4.4442 (1.0) 11.3511 (2.35) 4.5582 (1.0) 0.5829 (24.78) 4.4741 (1.0) 0.0323 (2.85) 3;19 171 groupby_agg[True-True-sum-10000] (befo) 4.5676 (1.03) 4.8264 (1.0) 4.5913 (1.01) 0.0235 (1.0) 4.5871 (1.03) 0.0114 (1.0) 15;16 168 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------- benchmark 'True-True-sum-1000000': 2 tests ----------------------------------------------------------------- Name (time in ms) Min Max Mean StdDev Median IQR Outliers Rounds ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ groupby_agg[True-True-sum-1000000] (afte) 30.5326 (1.00) 33.6395 (1.02) 31.2355 (1.0) 0.9563 (1.20) 30.6933 (1.0) 0.9663 (1.0) 5;3 30 groupby_agg[True-True-sum-1000000] (befo) 30.4080 (1.0) 33.0341 (1.0) 31.2527 (1.00) 0.7946 (1.0) 30.9808 (1.01) 1.2781 (1.32) 11;0 30 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ ``` </details> [Benchmark code](https://github.com/isVoid/cudf_benchmarks/blob/9d9644eaa5301df7894c2fe4b1ba317396240518/bench_groupby.py#L23-L42) Authors: - Michael Wang (https://github.com/isVoid) - Bradley Dice (https://github.com/bdice) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Bradley Dice (https://github.com/bdice) URL: #10419

This PR covers many low hanging fruits for #10153. All API accepting Frames now accepts a list of columns in the following files: - hash.pyx - interop.pyx - join.pyx - partitioning.pyx - quantiles.pyx - reshape.pyx - search.pyx - transform.pyx - lists.pyx - string/combine.pyx This PR covers point 5 in the follow-ups to #9889. Also, in `join.pyx`, gil was not released when dispatching workload to libcudf. This PR fixes that. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10463

This PR contributes to #10153, refactors all cython APIs in `transpose.pyx`, `sort.pyx` to accept a list of columns as input. This PR also includes several minor improvements in the code base, see comments below for detail. Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #10675

This PR refactors `merge_sorted` in `merge.pyx` to accept a list of columns, contributes to #10153 Authors: - Michael Wang (https://github.com/isVoid) Approvers: - Ashwin Srinath (https://github.com/shwina) URL: #10698

GregoryKimball · 2022-06-28T04:05:03Z

@isVoid Should we close this, make a new issue for the IO files, and label it "good first issue"?

isVoid · 2022-07-15T01:27:13Z

I would still like to keep tracking in this issue. Especially the I/O files refactors are the hardest among the cohorts. I don't think moving it to a separate issue will help lowering the barrier.

vyasr · 2024-05-14T00:06:18Z

I'm going to close this issue. With pylibcudf coming we are going to have to rethink the internals of cudf further. Ultimately the list of columns approach we took here is certainly going to be closer to the final end state, but it is no longer directly applicable to future design.

isVoid added feature request New feature or request Needs Triage Need team to review and classify labels Jan 28, 2022

isVoid self-assigned this Jan 28, 2022

isVoid added the improvement Improvement / enhancement to an existing function label Feb 3, 2022

vyasr mentioned this issue Feb 17, 2022

Add covariance for sort groupby (python) #9889

Merged

5 tasks

This was referenced Feb 25, 2022

Refactor cython interface: copying.pyx #10359

Merged

Rewrites column.__setitem__, Use boolean_mask_scatter #10202

Merged

Rewrites sample API #10262

Merged

Refactor filling.repeat API #10371

Merged

isVoid mentioned this issue Mar 11, 2022

Use list of columns for methods in Groupby.pyx #10419

Merged

isVoid mentioned this issue Mar 19, 2022

Use Lists of Columns for Various Files #10463

Merged

isVoid changed the title ~~[FEA] Use List of Columns instead of Frames~~ [FEA] Use List of Columns instead of Frames in Cython API Apr 15, 2022

isVoid mentioned this issue Apr 16, 2022

Cython API Refactor: transpose.pyx, sort.pyx #10675

Merged

isVoid mentioned this issue Apr 20, 2022

Cython API refactor: merge.pyx #10698

Merged

GregoryKimball added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 28, 2022

vyasr closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Use List of Columns instead of Frames in Cython API #10153

[FEA] Use List of Columns instead of Frames in Cython API #10153

isVoid commented Jan 28, 2022 •

edited

Loading

GregoryKimball commented Jun 28, 2022

isVoid commented Jul 15, 2022

vyasr commented May 14, 2024 •

edited

Loading

[FEA] Use List of Columns instead of Frames in Cython API #10153

[FEA] Use List of Columns instead of Frames in Cython API #10153

Comments

isVoid commented Jan 28, 2022 • edited Loading

GregoryKimball commented Jun 28, 2022

isVoid commented Jul 15, 2022

vyasr commented May 14, 2024 • edited Loading

isVoid commented Jan 28, 2022 •

edited

Loading

vyasr commented May 14, 2024 •

edited

Loading