Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Use List of Columns instead of Frames in Cython API #10153

Closed
45 of 51 tasks
isVoid opened this issue Jan 28, 2022 · 3 comments
Closed
45 of 51 tasks

[FEA] Use List of Columns instead of Frames in Cython API #10153

isVoid opened this issue Jan 28, 2022 · 3 comments
Assignees
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.

Comments

@isVoid
Copy link
Contributor

isVoid commented Jan 28, 2022

Following the removal of Table class #9315, and a pilot study of changing several cython APIs into using list of columns #9558, we want to extend the previous effort and use lists of columns to replace usage of Frame with list of columns. The main arguments for this change are:

  1. Libcudf's table_view is essentially a thin wrapper of vector<column_view>, this match with list of column concept.
  2. In the vision by data consortium API, dataframes does not have a column name.
  3. From developer's perspective, we want to rely on simpler data structure on cython layer for easy debugging.

This is a meta issue that tracks the list of candidate APIs that awaits to be changed and the refactor progress.

The following list is compiled by searching \(\n?.*table_view.*(\n|\)) in the cpp code base and find their corresponding wrapper in cython.

  • Copying.pyx:
  • filling.pyx:
    • repeat
  • groupby.pyx:
    • groupby.__cinit__
    • groupby.groups
    • groupby.shift
    • groupby.replace_nulls
  • hash.pyx:
    • hash
  • interop.pyx:
    • to_dlpack
  • join.pyx:
    • join/semi_join: lhs, rhs
  • merge.pyx:
    • merge_sorted
  • partitioning.pyx:
    • partition
  • quantile.pyx:
    • quantiles
  • reshape.pyx:
    • interleave_columns
    • tile
  • search.pyx:
    • search_sorted
  • sort.pyx:
    • is_sorted
    • order_by
    • digitize
    • rank_columns
  • transform.pyx:
    • table_encode
  • transpose.pyx:
    • transpose
  • csv.pyx:
    • write_csv: note index argument
  • orc.pyx:
    • write_orc
  • parquet.pyx:
    • write_parquet
  • lists.pyx:
    • explode_outer
    • concatenate_rows
  • string/combine.pyx:
    • concatenate

Second step from this is to remove utils.table_view_from_table completely.

@isVoid isVoid added feature request New feature or request Needs Triage Need team to review and classify labels Jan 28, 2022
@isVoid isVoid self-assigned this Jan 28, 2022
@isVoid isVoid added the improvement Improvement / enhancement to an existing function label Feb 3, 2022
rapids-bot bot pushed a commit that referenced this issue Feb 17, 2022
This PR adds the functionality to perform `.cov()` on a `GroupBy` object and completes #1268

Related issue: #1268
Related PRs: #9154, #9166, #9492 

Next steps:

- [ ] Fix Symmetry problem [PR 10098](#10098 (comment)): avoid computing the covariance/ correlation between the same colums twice
- [ ] 	Consolidate  both `cov()` and `corr()`
- [ ] Fix #10303
- [ ] Add `cov `bindings in `aggregation.pyx` (separate PR): [comment](#9889 (comment))
- [ ] Simplify `combine_columns` after #10153 covers `interleave_columns`: [comment](#9889 (comment))

Authors:
  - Mayank Anand (https://github.com/mayankanand007)
  - Michael Wang (https://github.com/isVoid)
  - Sheilah Kirui (https://github.com/skirui-source)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9889
rapids-bot bot pushed a commit that referenced this issue Mar 4, 2022
This PR rewrites sample API. On function side, this API now accepts a cupy random state or a numpy random state. If a host (numpy) random state is accpeted, the sampled rows should match the result with pandas given the same initial state and operation sequence. On the other hand, if given a device random state, we should expect higher performance if the sample count is large.

Syntatically, this PR refactors existing code into two sub-method that deals with different axis to sample from. The sub-methods are type annotated.

Sampling from `cudf.Index/cudf.MultiIndex` is deprecated. 

This PR is breaking because:
1. User who previously calls `sample` API now gets different rows.
2. To align with pandas API, `keep_index` is renamed to `ignore_index` and its semantic is negated.

Current implementation does not depend on `libcudf.copying.sample`, thus cython bindings are removed.

Performance: at 10K rows, this PR is 39% slower than current. Amounting for 0.3ms. At 100M rows, this PR is 7% slower using cupy random state.
<details>
<summary>Benchmark Axis=0</summary>

```
-------------------------------------------------------------------------------------- benchmark 'axis=0': 6 tests ---------------------------------------------------------------------------------------
Name (time in ms)                                                Min                   Max                  Mean              StdDev                Median                 IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisIndex-CupyRandomState] (afte)        296.7751 (455.90)     299.2855 (401.57)     297.9519 (448.88)     1.1162 (94.15)      297.7824 (451.66)     2.0472 (192.32)        2;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (afte)     4,435.3055 (>1000.0)  4,717.0815 (>1000.0)  4,507.1635 (>1000.0)  119.8772 (>1000.0)  4,452.5009 (>1000.0)  115.2876 (>1000.0)       1;0       5
sample_df[size100M-AxisIndex-NumpyRandomState] (befo)       276.1754 (424.26)     276.4792 (370.97)     276.2995 (416.26)     0.1258 (10.61)      276.3024 (419.08)     0.2010 (18.88)         1;0       5
sample_df[size10K-AxisIndex-CupyRandomState] (afte)           1.0789 (1.66)         1.2420 (1.67)         1.1238 (1.69)       0.0683 (5.76)         1.0962 (1.66)       0.0721 (6.77)          1;0       5
sample_df[size10K-AxisIndex-NumpyRandomState] (afte)          0.9018 (1.39)         1.1441 (1.54)         0.9140 (1.38)       0.0182 (1.54)         0.9094 (1.38)       0.0106 (1.0)         11;11     346
sample_df[size10K-AxisIndex-NumpyRandomState] (befo)          0.6510 (1.0)          0.7453 (1.0)          0.6638 (1.0)        0.0119 (1.0)          0.6593 (1.0)        0.0108 (1.01)        76;44     638
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
</details>

On `axis=1` sample, this PR is faster than current if provided a numpy random state for `random_state` parameter, while slower if provided a seed instead.
<details>
<summary> Benchmark axis=1 </summary>

```
--------------------------------------------------------------------------------- benchmark 'axis=1': 6 tests ----------------------------------------------------------------------------------
Name (time in us)                                               Min                 Max                Mean             StdDev              Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
sample_df[size100M-AxisColumn-NumpyRandomState] (afte)     173.2660 (1.0)      290.5080 (1.14)     178.2199 (1.0)       8.0913 (1.58)     175.7130 (1.0)      2.0767 (1.73)      227;419    2707
sample_df[size100M-AxisColumn-Seed] (afte)                 441.9110 (2.55)     617.1150 (2.42)     452.4197 (2.54)     14.1272 (2.76)     447.1345 (2.54)     7.9060 (6.59)      151;162    1484
sample_df[size100M-AxisColumn-Seed] (befo)                 297.1560 (1.72)     477.1500 (1.87)     307.8915 (1.73)     17.2036 (3.36)     300.5620 (1.71)     9.4080 (7.85)      159;168    1695
sample_df[size10K-AxisColumn-NumpyRandomState] (afte)      176.6440 (1.02)     254.9110 (1.0)      180.0217 (1.01)      5.1152 (1.0)      178.8940 (1.02)     1.1990 (1.0)       226;405    3542
sample_df[size10K-AxisColumn-Seed] (afte)                  451.6370 (2.61)     689.8120 (2.71)     465.9937 (2.61)     14.3921 (2.81)     463.0710 (2.64)     6.7365 (5.62)        62;91    1183
sample_df[size10K-AxisColumn-Seed] (befo)                  309.4000 (1.79)     413.9080 (1.62)     316.5210 (1.78)      7.6379 (1.49)     315.2130 (1.79)     5.4100 (4.51)        66;42     826
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```
</details>

Part of #10153

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10262
rapids-bot bot pushed a commit that referenced this issue Mar 8, 2022
Part of #10153 

Aside from the two harder cases: `boolean_mask_scatter` and `sample` that's been addressed in #10202  and #10262 , this PR tackles rest of refactors that's in `copying.pyx`, in combination of the other two, this PR should address all interface refactor in `copying.pyx`.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10359
rapids-bot bot pushed a commit that referenced this issue Mar 18, 2022
Part of #10153 

This PR refactors `filling.repeat` cython API to accept a list of columns instead of Frame object. In this PR I'm also trying out a possibly better pattern for index and indexed_frame to share logics, which might become a solution for #10068.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10371
rapids-bot bot pushed a commit that referenced this issue Mar 18, 2022
Part of #10153 

This PR changes the APIs in `groupby.pyx` to accept a list of columns as input, not a Frame. This change affects both keys and values. The `Groupby` object now only stores a list of columns in the `keys` attribute and other APIs (`groups`, `aggregate`, `shift`, `replace_nulls`) now only accept a list of columns as its value columns. The `aggregation` communication protocol has changed from a dictionary mapping column names to list of agg names to a list of list of agg names. See changes in `_normalize_aggs` for detail.
This PR also tries to simplify post-processing of `result` frame in `agg` method now that we have a finer control in pure python.

I gave an attempt to rewrite `aggregate_internal` and `scan_internal` but ended up in futile because the unified aggregation object  is a cdef type and precludes separating the aggregation filtering step outside of it's current place. Besides, I tried unifying aggregation and scan with cython fused type but didn't make it due to limitation of using fused type with c++ templated type in cython.

Overall, the performance of `agg` call is on par with main branch. With -3%-13% performance diff depending on agg types.

<details>
<summary>Raw Benchmark</summary>

```
========================================================================== 36 passed in 33.48s ==========================================================================
(rapids) rapids@compose:~/scratch/cudf_benchmarks$ ./compare.sh bench_groupby.py 

--------------------------------------------------------------- benchmark 'False-False-agg1-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                               Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-agg1-100] (afte)     2.5090 (1.0)      2.8418 (1.0)      2.5280 (1.0)      0.0290 (2.40)     2.5229 (1.0)      0.0103 (1.05)        15;19     273
groupby_agg[False-False-agg1-100] (befo)     2.7681 (1.10)     2.8441 (1.00)     2.7877 (1.10)     0.0121 (1.0)      2.7849 (1.10)     0.0098 (1.0)         60;26     252
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-False-agg1-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                 Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-agg1-10000] (afte)     2.7803 (1.0)      3.4156 (1.05)     2.8131 (1.0)      0.0548 (1.57)     2.8007 (1.0)      0.0253 (1.0)         10;12     252
groupby_agg[False-False-agg1-10000] (befo)     3.0402 (1.09)     3.2407 (1.0)      3.1571 (1.12)     0.0348 (1.0)      3.1535 (1.13)     0.0509 (2.01)         39;6     236
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'False-False-agg1-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                    Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-agg1-1000000] (afte)     13.2601 (1.0)      14.0128 (1.01)     13.4242 (1.0)      0.1056 (1.28)     13.4004 (1.0)      0.0286 (1.0)           5;8      68
groupby_agg[False-False-agg1-1000000] (befo)     13.5150 (1.02)     13.9165 (1.0)      13.6015 (1.01)     0.0826 (1.0)      13.5944 (1.01)     0.0696 (2.43)          8;5      66
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-False-agg2-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                               Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-agg2-100] (afte)     2.5342 (1.0)      2.8621 (1.0)      2.5591 (1.0)      0.0431 (3.18)     2.5509 (1.0)      0.0106 (1.01)        13;18     273
groupby_agg[False-False-agg2-100] (befo)     2.8797 (1.14)     2.9507 (1.03)     2.8997 (1.13)     0.0136 (1.0)      2.8965 (1.14)     0.0105 (1.0)         52;28     227
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-False-agg2-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                 Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-agg2-10000] (afte)     2.7922 (1.0)      3.2884 (1.0)      2.8205 (1.0)      0.0473 (1.40)     2.8118 (1.0)      0.0096 (1.0)         10;18     251
groupby_agg[False-False-agg2-10000] (befo)     3.1491 (1.13)     3.4791 (1.06)     3.1752 (1.13)     0.0338 (1.0)      3.1687 (1.13)     0.0108 (1.12)         6;17     172
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'False-False-agg2-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                    Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-agg2-1000000] (afte)     13.4699 (1.0)      14.6287 (1.0)      13.6020 (1.0)      0.1359 (1.0)      13.5769 (1.0)      0.0270 (1.0)           3;8      69
groupby_agg[False-False-agg2-1000000] (befo)     13.6079 (1.01)     29.8318 (2.04)     14.0777 (1.03)     1.9806 (14.57)    13.7795 (1.01)     0.0567 (2.10)          2;6      68
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-False-sum-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                              Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-sum-100] (afte)     2.1667 (1.0)      2.2855 (1.0)      2.1831 (1.0)      0.0146 (1.49)     2.1802 (1.0)      0.0111 (1.14)        25;16     301
groupby_agg[False-False-sum-100] (befo)     2.4142 (1.11)     2.4782 (1.08)     2.4319 (1.11)     0.0098 (1.0)      2.4309 (1.11)     0.0097 (1.0)         65;15     278
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-False-sum-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-sum-10000] (afte)     2.4293 (1.0)      2.6593 (1.0)      2.4493 (1.0)      0.0206 (1.66)     2.4455 (1.0)      0.0115 (1.10)        17;19     278
groupby_agg[False-False-sum-10000] (befo)     2.6646 (1.10)     2.7706 (1.04)     2.6832 (1.10)     0.0124 (1.0)      2.6811 (1.10)     0.0105 (1.0)         49;14     257
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------- benchmark 'False-False-sum-1000000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                  Min                Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-False-sum-1000000] (afte)     9.3678 (1.0)      21.0480 (2.07)     9.6817 (1.0)      1.2252 (16.49)    9.5286 (1.0)      0.0342 (1.28)          1;9      89
groupby_agg[False-False-sum-1000000] (befo)     9.6830 (1.03)     10.1832 (1.0)      9.7434 (1.01)     0.0743 (1.0)      9.7238 (1.02)     0.0266 (1.0)           6;6      86
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-agg1-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                              Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-agg1-100] (afte)     2.4392 (1.0)      2.7474 (1.06)     2.4598 (1.0)      0.0287 (2.07)     2.4545 (1.0)      0.0103 (1.0)         10;17     278
groupby_agg[False-True-agg1-100] (befo)     2.5183 (1.03)     2.6017 (1.0)      2.5354 (1.03)     0.0139 (1.0)      2.5332 (1.03)     0.0134 (1.30)        51;18     268
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-agg1-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-agg1-10000] (afte)     2.7196 (1.0)      3.2290 (1.06)     2.7446 (1.0)      0.0462 (2.17)     2.7359 (1.0)      0.0106 (1.00)        11;17     257
groupby_agg[False-True-agg1-10000] (befo)     2.7807 (1.02)     3.0590 (1.0)      2.8039 (1.02)     0.0213 (1.0)      2.8004 (1.02)     0.0106 (1.0)         16;18     251
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'False-True-agg1-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                   Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-agg1-1000000] (afte)     13.2259 (1.01)     13.7344 (1.0)      13.3449 (1.00)     0.0797 (1.0)      13.3288 (1.00)     0.0322 (1.41)          5;8      69
groupby_agg[False-True-agg1-1000000] (befo)     13.0875 (1.0)      14.1552 (1.03)     13.3135 (1.0)      0.1325 (1.66)     13.2901 (1.0)      0.0229 (1.0)           4;7      68
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-agg2-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                              Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-agg2-100] (afte)     2.4580 (1.0)      2.5791 (1.0)      2.4792 (1.0)      0.0174 (1.92)     2.4756 (1.0)      0.0121 (1.37)        21;14     277
groupby_agg[False-True-agg2-100] (befo)     2.6094 (1.06)     2.6686 (1.03)     2.6260 (1.06)     0.0091 (1.0)      2.6255 (1.06)     0.0088 (1.0)         66;21     264
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-agg2-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-agg2-10000] (afte)     2.7218 (1.0)      2.8843 (1.0)      2.7415 (1.0)      0.0180 (1.0)      2.7383 (1.0)      0.0116 (1.12)        21;16     257
groupby_agg[False-True-agg2-10000] (befo)     2.8771 (1.06)     3.1227 (1.08)     2.8956 (1.06)     0.0185 (1.03)     2.8922 (1.06)     0.0104 (1.0)         16;16     244
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'False-True-agg2-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                   Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-agg2-1000000] (afte)     13.4555 (1.01)     13.7924 (1.0)      13.5244 (1.00)     0.0601 (1.0)      13.5099 (1.00)     0.0362 (1.0)           7;6      70
groupby_agg[False-True-agg2-1000000] (befo)     13.3841 (1.0)      13.9437 (1.01)     13.4948 (1.0)      0.0773 (1.29)     13.4768 (1.0)      0.0443 (1.22)          5;5      68
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-sum-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                             Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-sum-100] (afte)     2.1270 (1.0)      2.2397 (1.0)      2.1435 (1.0)      0.0158 (1.01)     2.1407 (1.0)      0.0105 (1.0)         27;22     302
groupby_agg[False-True-sum-100] (befo)     2.1881 (1.03)     2.3309 (1.04)     2.2048 (1.03)     0.0156 (1.0)      2.2014 (1.03)     0.0111 (1.06)        35;30     297
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-sum-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                               Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-sum-10000] (afte)     2.4018 (1.0)      2.6107 (1.0)      2.4183 (1.0)      0.0198 (1.16)     2.4149 (1.0)      0.0108 (1.12)        14;14     277
groupby_agg[False-True-sum-10000] (befo)     2.4406 (1.02)     2.6840 (1.03)     2.4606 (1.02)     0.0170 (1.0)      2.4585 (1.02)     0.0097 (1.0)         15;14     274
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'False-True-sum-1000000': 2 tests ----------------------------------------------------------------
Name (time in ms)                                 Min                Max              Mean            StdDev            Median               IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[False-True-sum-1000000] (afte)     9.4459 (1.01)     10.0397 (1.0)      9.4983 (1.0)      0.0706 (1.0)      9.4846 (1.0)      0.0216 (1.0)           4;6      89
groupby_agg[False-True-sum-1000000] (befo)     9.3064 (1.0)      10.2732 (1.02)     9.5150 (1.00)     0.1107 (1.57)     9.4933 (1.00)     0.0239 (1.10)         6;10      88
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------- benchmark 'True-False-agg1-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                              Min                Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-agg1-100] (afte)     4.3327 (1.0)       4.4800 (1.0)      4.3504 (1.0)      0.0202 (1.0)      4.3457 (1.0)      0.0103 (1.0)         10;16     181
groupby_agg[True-False-agg1-100] (befo)     4.6486 (1.07)     12.4651 (2.78)     4.8006 (1.10)     0.7100 (35.18)    4.6664 (1.07)     0.0191 (1.86)        10;19     170
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-False-agg1-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-agg1-10000] (afte)     4.9246 (1.0)      5.1165 (1.0)      4.9491 (1.0)      0.0269 (1.0)      4.9407 (1.0)      0.0133 (1.06)        16;19     164
groupby_agg[True-False-agg1-10000] (befo)     5.2464 (1.07)     5.6002 (1.09)     5.2700 (1.06)     0.0370 (1.38)     5.2623 (1.07)     0.0126 (1.0)         10;17     154
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'True-False-agg1-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                   Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-agg1-1000000] (afte)     36.5089 (1.00)     37.2874 (1.0)      36.8305 (1.0)      0.2321 (1.0)      36.7404 (1.0)      0.2208 (1.0)           7;5      28
groupby_agg[True-False-agg1-1000000] (befo)     36.3558 (1.0)      47.0329 (1.26)     37.7670 (1.03)     2.7313 (11.77)    36.8183 (1.00)     0.8527 (3.86)          2;3      26
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-False-agg2-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                              Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-agg2-100] (afte)     4.6287 (1.0)      5.2921 (1.02)     4.6918 (1.0)      0.1017 (4.64)     4.6526 (1.0)      0.0496 (3.27)        21;23     167
groupby_agg[True-False-agg2-100] (befo)     4.9776 (1.08)     5.1737 (1.0)      5.0060 (1.07)     0.0219 (1.0)      4.9995 (1.07)     0.0152 (1.0)         18;10     161
------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-False-agg2-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                                Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-agg2-10000] (afte)     5.2022 (1.0)      6.7622 (1.16)     5.2405 (1.0)      0.1267 (2.98)     5.2219 (1.0)      0.0157 (1.0)          2;16     155
groupby_agg[True-False-agg2-10000] (befo)     5.5802 (1.07)     5.8531 (1.0)      5.6166 (1.07)     0.0424 (1.0)      5.6041 (1.07)     0.0206 (1.31)        11;14     147
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'True-False-agg2-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                   Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-agg2-1000000] (afte)     37.9639 (1.0)      38.7598 (1.0)      38.2381 (1.0)      0.1221 (1.0)      38.2346 (1.00)     0.0583 (1.0)           2;2      27
groupby_agg[True-False-agg2-1000000] (befo)     38.0569 (1.00)     41.5735 (1.07)     38.7983 (1.01)     1.1968 (9.80)     38.1696 (1.0)      0.6344 (10.88)         5;5      26
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-False-sum-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                             Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-sum-100] (afte)     3.6893 (1.0)      4.2792 (1.03)     3.7130 (1.0)      0.0580 (4.15)     3.7022 (1.0)      0.0079 (1.0)         10;16     206
groupby_agg[True-False-sum-100] (befo)     4.0016 (1.08)     4.1370 (1.0)      4.0218 (1.08)     0.0140 (1.0)      4.0180 (1.09)     0.0097 (1.23)        27;17     188
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-False-sum-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                               Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-sum-10000] (afte)     4.2660 (1.0)      4.6651 (1.0)      4.2913 (1.0)      0.0493 (2.97)     4.2799 (1.0)      0.0097 (1.0)         10;21     185
groupby_agg[True-False-sum-10000] (befo)     4.5702 (1.07)     4.7321 (1.01)     4.5904 (1.07)     0.0166 (1.0)      4.5858 (1.07)     0.0134 (1.37)         24;8     172
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'True-False-sum-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                  Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-False-sum-1000000] (afte)     30.5871 (1.00)     30.9527 (1.0)      30.6797 (1.00)     0.0628 (1.0)      30.6720 (1.00)     0.0421 (1.0)           4;3      32
groupby_agg[True-False-sum-1000000] (befo)     30.5386 (1.0)      31.8930 (1.03)     30.6654 (1.0)      0.2383 (3.80)     30.6013 (1.0)      0.0573 (1.36)          1;4      31
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-True-agg1-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                             Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-agg1-100] (afte)     4.2812 (1.0)      4.5815 (1.0)      4.3304 (1.0)      0.0495 (1.43)     4.3134 (1.0)      0.0647 (4.80)         22;4     173
groupby_agg[True-True-agg1-100] (befo)     4.4126 (1.03)     4.7356 (1.03)     4.4357 (1.02)     0.0348 (1.0)      4.4253 (1.03)     0.0135 (1.0)         14;18     158
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-True-agg1-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                               Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-agg1-10000] (afte)     4.8505 (1.0)      5.3411 (1.0)      4.8882 (1.0)      0.0596 (1.49)     4.8693 (1.0)      0.0240 (1.41)        12;15     166
groupby_agg[True-True-agg1-10000] (befo)     4.9857 (1.03)     5.3869 (1.01)     5.0191 (1.03)     0.0399 (1.0)      5.0089 (1.03)     0.0170 (1.0)          9;15     160
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'True-True-agg1-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                  Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-agg1-1000000] (afte)     36.5387 (1.01)     55.8017 (1.52)     37.3622 (1.03)     3.6965 (48.22)    36.5756 (1.00)     0.0882 (2.75)          1;3      27
groupby_agg[True-True-agg1-1000000] (befo)     36.3456 (1.0)      36.7584 (1.0)      36.4209 (1.0)      0.0767 (1.0)      36.4014 (1.0)      0.0320 (1.0)           1;4      27
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-True-agg2-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                             Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-agg2-100] (afte)     4.5713 (1.0)      5.1548 (1.06)     4.6064 (1.0)      0.0621 (4.49)     4.5886 (1.0)      0.0203 (1.51)        13;22     170
groupby_agg[True-True-agg2-100] (befo)     4.7628 (1.04)     4.8752 (1.0)      4.7832 (1.04)     0.0138 (1.0)      4.7795 (1.04)     0.0134 (1.0)          29;9     167
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-True-agg2-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                               Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-agg2-10000] (afte)     5.1343 (1.0)      5.4159 (1.0)      5.1769 (1.0)      0.0517 (1.36)     5.1590 (1.0)      0.0179 (1.21)        16;22     157
groupby_agg[True-True-agg2-10000] (befo)     5.3567 (1.04)     5.6432 (1.04)     5.3858 (1.04)     0.0379 (1.0)      5.3785 (1.04)     0.0147 (1.0)          7;12     152
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'True-True-agg2-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                  Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-agg2-1000000] (afte)     38.0357 (1.00)     38.2935 (1.00)     38.1159 (1.00)     0.0597 (1.0)      38.1014 (1.00)     0.0846 (1.0)           6;1      27
groupby_agg[True-True-agg2-1000000] (befo)     37.9134 (1.0)      38.2851 (1.0)      38.0201 (1.0)      0.0929 (1.55)     37.9944 (1.0)      0.1066 (1.26)          7;1      26
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------- benchmark 'True-True-sum-100': 2 tests ---------------------------------------------------------------
Name (time in ms)                            Min               Max              Mean            StdDev            Median               IQR            Outliers  Rounds
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-sum-100] (afte)     3.7452 (1.0)      4.0287 (1.0)      3.8009 (1.0)      0.0408 (1.0)      3.7968 (1.0)      0.0503 (1.0)          29;3     131
groupby_agg[True-True-sum-100] (befo)     3.8752 (1.03)     4.4384 (1.10)     3.9316 (1.03)     0.0608 (1.49)     3.9265 (1.03)     0.0504 (1.00)          4;3     148
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

---------------------------------------------------------------- benchmark 'True-True-sum-10000': 2 tests ---------------------------------------------------------------
Name (time in ms)                              Min                Max              Mean            StdDev            Median               IQR            Outliers  Rounds
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-sum-10000] (afte)     4.4442 (1.0)      11.3511 (2.35)     4.5582 (1.0)      0.5829 (24.78)    4.4741 (1.0)      0.0323 (2.85)         3;19     171
groupby_agg[True-True-sum-10000] (befo)     4.5676 (1.03)      4.8264 (1.0)      4.5913 (1.01)     0.0235 (1.0)      4.5871 (1.03)     0.0114 (1.0)         15;16     168
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------- benchmark 'True-True-sum-1000000': 2 tests -----------------------------------------------------------------
Name (time in ms)                                 Min                Max               Mean            StdDev             Median               IQR            Outliers  Rounds
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
groupby_agg[True-True-sum-1000000] (afte)     30.5326 (1.00)     33.6395 (1.02)     31.2355 (1.0)      0.9563 (1.20)     30.6933 (1.0)      0.9663 (1.0)           5;3      30
groupby_agg[True-True-sum-1000000] (befo)     30.4080 (1.0)      33.0341 (1.0)      31.2527 (1.00)     0.7946 (1.0)      30.9808 (1.01)     1.2781 (1.32)         11;0      30
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
```

</details>

[Benchmark code](https://github.com/isVoid/cudf_benchmarks/blob/9d9644eaa5301df7894c2fe4b1ba317396240518/bench_groupby.py#L23-L42)

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)

URL: #10419
@isVoid isVoid changed the title [FEA] Use List of Columns instead of Frames [FEA] Use List of Columns instead of Frames in Cython API Apr 15, 2022
rapids-bot bot pushed a commit that referenced this issue Apr 19, 2022
This PR covers many low hanging fruits for #10153. All API accepting Frames now accepts a list of columns in the following files:

- hash.pyx
- interop.pyx
- join.pyx
- partitioning.pyx
- quantiles.pyx
- reshape.pyx
- search.pyx
- transform.pyx
- lists.pyx
- string/combine.pyx

This PR covers point 5 in the follow-ups to #9889.
Also, in `join.pyx`, gil was not released when dispatching workload to libcudf. This PR fixes that.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10463
rapids-bot bot pushed a commit that referenced this issue Apr 19, 2022
This PR contributes to #10153, refactors all cython APIs in `transpose.pyx`, `sort.pyx` to accept a list of columns as input.

This PR also includes several minor improvements in the code base, see comments below for detail.

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #10675
rapids-bot bot pushed a commit that referenced this issue Apr 22, 2022
This PR refactors `merge_sorted` in `merge.pyx` to accept a list of columns, contributes to #10153

Authors:
  - Michael Wang (https://github.com/isVoid)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)

URL: #10698
@GregoryKimball GregoryKimball added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jun 28, 2022
@GregoryKimball
Copy link
Contributor

@isVoid Should we close this, make a new issue for the IO files, and label it "good first issue"?

@isVoid
Copy link
Contributor Author

isVoid commented Jul 15, 2022

I would still like to keep tracking in this issue. Especially the I/O files refactors are the hardest among the cohorts. I don't think moving it to a separate issue will help lowering the barrier.

@vyasr
Copy link
Contributor

vyasr commented May 14, 2024

I'm going to close this issue. With pylibcudf coming we are going to have to rethink the internals of cudf further. Ultimately the list of columns approach we took here is certainly going to be closer to the final end state, but it is no longer directly applicable to future design.

@vyasr vyasr closed this as completed May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request improvement Improvement / enhancement to an existing function Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

3 participants