Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Zero-copy slice should compute the null counts for the slices #3600

Closed
jrhemstad opened this issue Dec 13, 2019 · 9 comments
Closed

[FEA] Zero-copy slice should compute the null counts for the slices #3600

jrhemstad opened this issue Dec 13, 2019 · 9 comments
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 13, 2019

Is your feature request related to a problem? Please describe.

The zero-copy slice API returns a vector<column_view>s of sub-views into another column.

Currently, the returned column_views have an internal null count of UNKNOWN_NULL_COUNT, which means each sub-views null count will be computed lazily upon invocation of column_view::null_count().

If the number of partitions is large, then this could mean a significant amount of overhead to compute the null count for each sub-view. Each sub-view would require it's own kernel launch to count a relatively small number of bits.

It would be much faster to compute the null count for each returned sub-view in a single kernel within the slice function.

Describe the solution you'd like

cudf::slice should be updated to compute the null count of each output slice in a single kernel.

Basically, the problem is given a bitmask and a list of pairs of bit indices like: [ {begin0, end0}, {begin1, end1}, ... {beginN, endN}]

We want to produce the count of unset bits between each begin(i),end(i) so the output is [ count0, count1, ... countN].

The output column_views null counts can then be updated accordingly.

@harrism suggested a good implementation idea:

I think it's a matter of a cubDeviceSegmentedReduce() with segment start/end index iterators and input transform iterator which converts bit indices to word indices, masks off any unused portion of the current word, and uses an inverted popc to return the number of null bits in the word. Done.

Additional context
Related: #3579 (comment)

@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. labels Dec 13, 2019
@nvdbaranec
Copy link
Contributor

Would it make sense to make the computation an optional flag (maybe defaulting to true)? Right now slice/split() are nice and lightweight. Being able to turn off this additional computation for specific cases might be handy.

@nvdbaranec
Copy link
Contributor

I have a useful extension of this that might be worth incorporating:

Currently, when a string column is sliced, the size of the chars child column is not updated on the CPU. This makes sense because you have to reach into gpu memory to check the offsets column to get the info.

For the strings implementation of contiguous_split() I need to be able to compute a buffer size that includes the chars child subcolumn in a fast way, at scale. Currently, you can get this info on the cpu by calling strings_column_view.chars() - however that function ends up reading device memory directly to get the info. This will not be fast enough for the scale that contiguous_split() wants to run at (potentially hundreds of thousands of columns).

So, if we're writing a kernel that is computing null_counts() on the validity vector, it might also make sense also to compute the size field of the chars child column for string columns. This way, strings_column_view.chars() would become "free".

@harrism
Copy link
Member

harrism commented Dec 19, 2019

I think that makes sense.

@nvdbaranec
Copy link
Contributor

To expand on Jake's comment:

"cudf::slice should be updated to compute the null count of each output slice in a single kernel."

This should also hold true for tables as well. A major use case for slice() at scale is contiguous_split(). My standard test case has been:

  • 6 GB of fixed width data total (random floats and ints)
  • Table with 512 columns
  • 256 splits

This yields a total of 128k output columns. The performance of contiguous_split() at that scale becomes dominated by # of kernel calls. So a solution that computes the null counts at the split level (whole output tables) in a single kernel instead of at the column level is highly desirable. It's the difference between 256 additional kernel calls and 128k additional kernel calls.

@seunghwak
Copy link
Contributor

Would it make sense to make the computation an optional flag (maybe defaulting to true)? Right now slice/split() are nice and lightweight. Being able to turn off this additional computation for specific cases might be handy.

Yes, providing an optional flag can be one way, and I think computing null counts only if the original column has a valid null count is another; and this applies to copy_range or other functions as well.

Null counts may not be necessary in all applications, and in case null counts are unnecessary, seemingly cheap operations can become expensive due to computing null counts and this can be surprising to users.

Let me know what you guys think.

@jrhemstad
Copy link
Contributor Author

This should also hold true for tables as well

AFAIK the table split just calls the column split on each column, so this should already be taken care of.

I think computing null counts only if the original column has a valid null count is another

That would require some mechanism to detect if the original column has a "valid" null count, which I am hoping to avoid.

Would it make sense to make the computation an optional flag (maybe defaulting to true)?

I generally dislike flags like this in public APIs. It exposes too much implementation detail. I'd want to see how expensive the null counts for all splits is before making a final call.

@seunghwak
Copy link
Contributor

I think computing null counts only if the original column has a valid null count is another

That would require some mechanism to detect if the original column has a "valid" null count, which I am hoping to avoid.

Would it make sense to make the computation an optional flag (maybe defaulting to true)?

I generally dislike flags like this in public APIs. It exposes too much implementation detail. I'd want to see how expensive the null counts for all splits is before making a final call.

I agree that we don't need to decide on this now, but libcudf++ already provides a mechanism to invalidate a null count,

* point `set_null_count(UNKNOWN_NULL_COUNT)` was invoked, then the

so, I'm not sure adding a function to peek whether a column_view object has a valid null count further compromises the abstraction level.

@jrhemstad
Copy link
Contributor Author

I'm not sure adding a function to peek whether a column_view object has a valid null count further compromises the abstraction level.

See #3579 for conversation on this topic.

@kkraus14
Copy link
Collaborator

Fixed by #3698

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

5 participants