-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Zero-copy slice should compute the null counts for the slices #3600
Comments
Would it make sense to make the computation an optional flag (maybe defaulting to true)? Right now slice/split() are nice and lightweight. Being able to turn off this additional computation for specific cases might be handy. |
I have a useful extension of this that might be worth incorporating: Currently, when a string column is sliced, the size of the chars child column is not updated on the CPU. This makes sense because you have to reach into gpu memory to check the offsets column to get the info. For the strings implementation of contiguous_split() I need to be able to compute a buffer size that includes the chars child subcolumn in a fast way, at scale. Currently, you can get this info on the cpu by calling strings_column_view.chars() - however that function ends up reading device memory directly to get the info. This will not be fast enough for the scale that contiguous_split() wants to run at (potentially hundreds of thousands of columns). So, if we're writing a kernel that is computing null_counts() on the validity vector, it might also make sense also to compute the size field of the chars child column for string columns. This way, strings_column_view.chars() would become "free". |
I think that makes sense. |
To expand on Jake's comment: "cudf::slice should be updated to compute the null count of each output slice in a single kernel." This should also hold true for tables as well. A major use case for slice() at scale is contiguous_split(). My standard test case has been:
This yields a total of 128k output columns. The performance of contiguous_split() at that scale becomes dominated by # of kernel calls. So a solution that computes the null counts at the split level (whole output tables) in a single kernel instead of at the column level is highly desirable. It's the difference between 256 additional kernel calls and 128k additional kernel calls. |
Yes, providing an optional flag can be one way, and I think computing null counts only if the original column has a valid null count is another; and this applies to copy_range or other functions as well. Null counts may not be necessary in all applications, and in case null counts are unnecessary, seemingly cheap operations can become expensive due to computing null counts and this can be surprising to users. Let me know what you guys think. |
AFAIK the table split just calls the column split on each column, so this should already be taken care of.
That would require some mechanism to detect if the original column has a "valid" null count, which I am hoping to avoid.
I generally dislike flags like this in public APIs. It exposes too much implementation detail. I'd want to see how expensive the null counts for all splits is before making a final call. |
I agree that we don't need to decide on this now, but libcudf++ already provides a mechanism to invalidate a null count, cudf/cpp/include/cudf/column/column_view.hpp Line 132 in 072785e
so, I'm not sure adding a function to peek whether a column_view object has a valid null count further compromises the abstraction level. |
See #3579 for conversation on this topic. |
Fixed by #3698 |
Is your feature request related to a problem? Please describe.
The zero-copy slice API returns a
vector<column_view>
s of sub-views into another column.Currently, the returned
column_view
s have an internal null count ofUNKNOWN_NULL_COUNT
, which means each sub-views null count will be computed lazily upon invocation ofcolumn_view::null_count()
.If the number of partitions is large, then this could mean a significant amount of overhead to compute the null count for each sub-view. Each sub-view would require it's own kernel launch to count a relatively small number of bits.
It would be much faster to compute the null count for each returned sub-view in a single kernel within the
slice
function.Describe the solution you'd like
cudf::slice
should be updated to compute the null count of each output slice in a single kernel.Basically, the problem is given a bitmask and a list of pairs of bit indices like:
[ {begin0, end0}, {begin1, end1}, ... {beginN, endN}]
We want to produce the count of unset bits between each
begin(i),end(i)
so the output is[ count0, count1, ... countN]
.The output
column_view
s null counts can then be updated accordingly.@harrism suggested a good implementation idea:
Additional context
Related: #3579 (comment)
The text was updated successfully, but these errors were encountered: