Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] DataFrame level support for value_counts function #5169

Closed
oyilmaz-nvidia opened this issue May 12, 2020 · 10 comments · Fixed by #10813
Closed

[FEA] DataFrame level support for value_counts function #5169

oyilmaz-nvidia opened this issue May 12, 2020 · 10 comments · Fixed by #10813
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@oyilmaz-nvidia
Copy link

Is your feature request related to a problem? Please describe.
I wish cuDF could run value_counts in the DataFrame level.

As an example, we need a version of "value_counts" function in the DataFrame level that runs on the columns in parallel. Right now, "value_counts" (and other functions as well) function is only available in the Series level which only allows us to run this function one column at a time. For instance, if we have 'C1', 'C2', 'C3', ..., 'C20' as categorical columns, we have to call the value_counts for 'C1', then for 'C2', ...., then for 'C20' one by one. I think this sequential computing approach is causing us to lose significant amount of time because we can calculate the value counts for 'C1', 'C2', 'C3', ..., 'C20' simultaneously in one CUDA kernel.

Also, some of these operations can be possibly fused into one or couple of CUDA kernels.

Describe the solution you'd like
Creating CUDA kernels for these functions to run on columns of DataFrame in parallel.

Describe alternatives you've considered
One easy and fast solution might be to create N number of streams where N is the number of columns in the DataFrame. Then, N number of kernel calls can be made using these streams. This might not be as fast as the one kernel approach since there will be a delay to create streams but it can provide a level of speedup. Maybe streams can be created at the beginning of a program in the cudf level and these streams can be used whenever need.

@oyilmaz-nvidia oyilmaz-nvidia added Needs Triage Need team to review and classify feature request New feature or request labels May 12, 2020
@oyilmaz-nvidia oyilmaz-nvidia changed the title [FEA] DataFrame level support for value_counts [FEA] DataFrame level support for value_counts function May 12, 2020
@jrhemstad
Copy link
Contributor

value_counts is just a groupby count, yes? If so, we're already doing thing in a single fused kernel.

@devavret
Copy link
Contributor

@jrhemstad I think he means for the entire column. Similar to series level reductions. Think the same as groupby but there is only one group. This shouldn't be done by just calling groupby with a key table that contains one column of all 0's because then we'd be needlessly building a hash table, and lose the performance advantage that's the motivation behind this request.

@harrism
Copy link
Member

harrism commented May 13, 2020

But you aren't making one group for the whole column, you are making one group per value, and doing a count on each group.

What he's asking for is something that runs independently on all columns of a table.

@oyilmaz-nvidia
Copy link
Author

Yes, value_counts will run independently on all columns of a table.

@harrism
Copy link
Member

harrism commented May 14, 2020

So this is a histogram per column. It could be done with groupby, but it would be a separate groupby per column. I think first we should build value_counts for a single column if that is requested, and then evaluate whether looping over columns on the CPU is sufficient. Without a performance comparison or a repro of poor performance, it's hard to motivate further optimization.

@oyilmaz-nvidia
Copy link
Author

@harrism That’s Ok. We can close these tickets. Looks like you don’t think there will be a speedup.

@harrism
Copy link
Member

harrism commented May 14, 2020

No, that's not what I'm saying. There's always an opportunity to speedup over iterating in Python. What is example application code or benchmarks demonstrating what you want to speed up.

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels May 29, 2020
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@beckernick beckernick removed libcudf Affects libcudf (C++/CUDA) code. inactive-30d labels Nov 2, 2021
@beckernick
Copy link
Member

I don't think this needs libcudf support. I believe this can be implemented reasonably efficiently in the Python layer with something like df.groupby(list_of_all_columns).size().

@beckernick beckernick added the good first issue Good for newcomers label Nov 2, 2021
rapids-bot bot pushed a commit that referenced this issue Jun 1, 2022
Add functionality for value_counts() in DataFrame. Resolves #5169

Authors:
  - https://github.com/martinfalisse

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #10813
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request good first issue Good for newcomers Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants