-
Notifications
You must be signed in to change notification settings - Fork 922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] DataFrame level support for value_counts function #5169
Comments
|
@jrhemstad I think he means for the entire column. Similar to series level reductions. Think the same as groupby but there is only one group. This shouldn't be done by just calling groupby with a key table that contains one column of all 0's because then we'd be needlessly building a hash table, and lose the performance advantage that's the motivation behind this request. |
But you aren't making one group for the whole column, you are making one group per value, and doing a count on each group. What he's asking for is something that runs independently on all columns of a table. |
Yes, value_counts will run independently on all columns of a table. |
So this is a histogram per column. It could be done with groupby, but it would be a separate groupby per column. I think first we should build value_counts for a single column if that is requested, and then evaluate whether looping over columns on the CPU is sufficient. Without a performance comparison or a repro of poor performance, it's hard to motivate further optimization. |
@harrism That’s Ok. We can close these tickets. Looks like you don’t think there will be a speedup. |
No, that's not what I'm saying. There's always an opportunity to speedup over iterating in Python. What is example application code or benchmarks demonstrating what you want to speed up. |
This issue has been labeled |
This issue has been labeled |
I don't think this needs libcudf support. I believe this can be implemented reasonably efficiently in the Python layer with something like |
Add functionality for value_counts() in DataFrame. Resolves #5169 Authors: - https://github.com/martinfalisse Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #10813
Is your feature request related to a problem? Please describe.
I wish cuDF could run value_counts in the DataFrame level.
As an example, we need a version of "value_counts" function in the DataFrame level that runs on the columns in parallel. Right now, "value_counts" (and other functions as well) function is only available in the Series level which only allows us to run this function one column at a time. For instance, if we have 'C1', 'C2', 'C3', ..., 'C20' as categorical columns, we have to call the value_counts for 'C1', then for 'C2', ...., then for 'C20' one by one. I think this sequential computing approach is causing us to lose significant amount of time because we can calculate the value counts for 'C1', 'C2', 'C3', ..., 'C20' simultaneously in one CUDA kernel.
Also, some of these operations can be possibly fused into one or couple of CUDA kernels.
Describe the solution you'd like
Creating CUDA kernels for these functions to run on columns of DataFrame in parallel.
Describe alternatives you've considered
One easy and fast solution might be to create N number of streams where N is the number of columns in the DataFrame. Then, N number of kernel calls can be made using these streams. This might not be as fast as the one kernel approach since there will be a delay to create streams but it can provide a level of speedup. Maybe streams can be created at the beginning of a program in the cudf level and these streams can be used whenever need.
The text was updated successfully, but these errors were encountered: