[FEA] DataFrame level support for value_counts function #5169

oyilmaz-nvidia · 2020-05-12T22:44:58Z

Is your feature request related to a problem? Please describe.
I wish cuDF could run value_counts in the DataFrame level.

As an example, we need a version of "value_counts" function in the DataFrame level that runs on the columns in parallel. Right now, "value_counts" (and other functions as well) function is only available in the Series level which only allows us to run this function one column at a time. For instance, if we have 'C1', 'C2', 'C3', ..., 'C20' as categorical columns, we have to call the value_counts for 'C1', then for 'C2', ...., then for 'C20' one by one. I think this sequential computing approach is causing us to lose significant amount of time because we can calculate the value counts for 'C1', 'C2', 'C3', ..., 'C20' simultaneously in one CUDA kernel.

Also, some of these operations can be possibly fused into one or couple of CUDA kernels.

Describe the solution you'd like
Creating CUDA kernels for these functions to run on columns of DataFrame in parallel.

Describe alternatives you've considered
One easy and fast solution might be to create N number of streams where N is the number of columns in the DataFrame. Then, N number of kernel calls can be made using these streams. This might not be as fast as the one kernel approach since there will be a delay to create streams but it can provide a level of speedup. Maybe streams can be created at the beginning of a program in the cudf level and these streams can be used whenever need.

jrhemstad · 2020-05-12T23:46:07Z

value_counts is just a groupby count, yes? If so, we're already doing thing in a single fused kernel.

devavret · 2020-05-13T00:32:39Z

@jrhemstad I think he means for the entire column. Similar to series level reductions. Think the same as groupby but there is only one group. This shouldn't be done by just calling groupby with a key table that contains one column of all 0's because then we'd be needlessly building a hash table, and lose the performance advantage that's the motivation behind this request.

harrism · 2020-05-13T01:20:02Z

But you aren't making one group for the whole column, you are making one group per value, and doing a count on each group.

What he's asking for is something that runs independently on all columns of a table.

oyilmaz-nvidia · 2020-05-13T19:16:14Z

Yes, value_counts will run independently on all columns of a table.

harrism · 2020-05-14T02:29:08Z

So this is a histogram per column. It could be done with groupby, but it would be a separate groupby per column. I think first we should build value_counts for a single column if that is requested, and then evaluate whether looping over columns on the CPU is sufficient. Without a performance comparison or a repro of poor performance, it's hard to motivate further optimization.

oyilmaz-nvidia · 2020-05-14T04:14:16Z

@harrism That’s Ok. We can close these tickets. Looks like you don’t think there will be a speedup.

harrism · 2020-05-14T06:34:30Z

No, that's not what I'm saying. There's always an opportunity to speedup over iterating in Python. What is example application code or benchmarks demonstrating what you want to speed up.

github-actions · 2021-03-14T19:13:24Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions · 2021-03-14T19:13:35Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

beckernick · 2021-11-02T21:20:09Z

I don't think this needs libcudf support. I believe this can be implemented reasonably efficiently in the Python layer with something like df.groupby(list_of_all_columns).size().

Add functionality for value_counts() in DataFrame. Resolves #5169 Authors: - https://github.com/martinfalisse Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10813

oyilmaz-nvidia added Needs Triage Need team to review and classify feature request New feature or request labels May 12, 2020

oyilmaz-nvidia changed the title ~~[FEA] DataFrame level support for value_counts~~ [FEA] DataFrame level support for value_counts function May 12, 2020

oyilmaz-nvidia mentioned this issue May 12, 2020

[FEA] DataFrame level support for some of the critical functions such as value_counts and unique #5048

Closed

kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 29, 2020

github-actions bot added the inactive-90d label Mar 14, 2021

github-actions bot added the inactive-30d label Mar 14, 2021

beckernick added this to the Pandas API Alignment and Coverage milestone Nov 2, 2021

beckernick removed libcudf Affects libcudf (C++/CUDA) code. inactive-30d labels Nov 2, 2021

beckernick added the good first issue Good for newcomers label Nov 2, 2021

martinfalisse mentioned this issue May 7, 2022

Implement value_counts for DataFrame #10813

Merged

rapids-bot bot closed this as completed in #10813 Jun 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] DataFrame level support for value_counts function #5169

[FEA] DataFrame level support for value_counts function #5169

oyilmaz-nvidia commented May 12, 2020

jrhemstad commented May 12, 2020

devavret commented May 13, 2020

harrism commented May 13, 2020

oyilmaz-nvidia commented May 13, 2020

harrism commented May 14, 2020

oyilmaz-nvidia commented May 14, 2020

harrism commented May 14, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

beckernick commented Nov 2, 2021

[FEA] DataFrame level support for value_counts function #5169

[FEA] DataFrame level support for value_counts function #5169

Comments

oyilmaz-nvidia commented May 12, 2020

jrhemstad commented May 12, 2020

devavret commented May 13, 2020

harrism commented May 13, 2020

oyilmaz-nvidia commented May 13, 2020

harrism commented May 14, 2020

oyilmaz-nvidia commented May 14, 2020

harrism commented May 14, 2020

github-actions bot commented Mar 14, 2021

github-actions bot commented Mar 14, 2021

beckernick commented Nov 2, 2021