[FEA] provide API to check if null_count is known #3579

revans2 · 2019-12-10T22:10:27Z

There is no way to tell if a null_count is set or not for a column. The reason I want this is so that when I send a column or table from one GPU to another I don't want to have to compute the null_count if it is not known, but I do want to send it if it is known.

The code to do this looks rather simple and I am happy to take a crack at it. I just want to be sure that this fits with the design goals of cudf before I get too far along with it.

jrhemstad · 2019-12-10T22:18:21Z

I just want to be sure that this fits with the design goals of cudf before I get too far along with it

Yes, it would be easy to do, but I'm pretty reluctant to do something like this. It kind of defeats the point of the abstraction. Users aren't really supposed to know or care what the internal state of the column is as it's an implementation detail.

when I send a column or table from one GPU to another I don't want to have to compute the null_count if it is not known, but I do want to send it if it is known.

The only reason I can imagine why you wouldn't want to compute the null_count() at the origin is if the destination doesn't care about the null_count(), otherwise it doesn't seem like it would matter if it gets computed at the origin or the destination. In which case, if the destination doesn't care, just send UNKNOWN_NULL_COUNT and avoid invoking null_count() at the origin.

revans2 · 2019-12-10T22:29:43Z

I guess that works. I'll just stop sending the null count all together and we can revisit it if it becomes an issue in the future.

jlowe · 2019-12-11T16:37:22Z

Users aren't really supposed to know or care what the internal state of the column is as it's an implementation detail.

Yes, but it's an expensive detail that is abstracted in a way that appears to be cheap. Abstracting the performance implications seems like it could be problematic for a performance-oriented library.

For example, consider column concatenation. If all of the columns have a cached null count then the CPU can trivially compute the final null count as it iterates the columns to prepare for the concatenation kernel. If one or more columns have an unknown null count, it is probably more efficient to mark the final concatenation column with an unknown null count and defer the computation. However marking the result column always unknown means we will always need to run a kernel to find the null count even if all the concatenation inputs had known counts.

Many algorithms simply want to know if they need to deal with validity rather than actually knowing the null counts. As another example, consider an algorithm that takes 2 inputs, the first with an unknown null count and the second with a known null count > 0. It doesn't make sense to run a kernel on the first input to compute the null count given the second trivially knows the algorithm needs to deal with validity. IMO the algorithm implementation should only compute the null count of their inputs if all of the inputs have unknown null counts and it's cheaper to compute the null counts than it is to just deal with the validity.

I'm also a bit surprised the value_accessor uses has_nulls() in a CUDF_EXPECTS. has_nulls() is implemented in terms of null_count(), so I believe this triggers a kernel invocation, synchronize, and copy to host for any column with an unknown null count.

jrhemstad · 2019-12-11T18:07:21Z

As another example, consider an algorithm that takes 2 inputs, the first with an unknown null count and the second with a known null count > 0. It doesn't make sense to run a kernel on the first input to compute the null count given the second trivially knows the algorithm needs to deal with validity.

Let's assume we add column::is_null_count_known().

In your example of 2 inputs lhs, rhs, we'd have to do something like:

void some_function(column_view lhs, column_view rhs){
   bool do_i_care_about_nulls = false;
   if(lhs.is_null_count_known() && rhs.is_null_count_known()){
      // null counts don't require a kernel
      do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls();
   }
   else if(lhs.is_null_count_known()){
      do_i_care_about_nulls = lhs.has_nulls();
   }
   else if(rhs.is_null_count_known()){
      do_i_care_about_nulls = rhs.has_nulls();
   }
   else{
      do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls(); // invokes kernel
   }
}

In contrast, it could be:

void some_function(column_view lhs, column_view rhs){
   bool do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls();
}

Will the former potentially have better performance? Probably.

Is the latter vastly less complex and therefore easier to maintain? Definitely.

In general, adding an additional is_null_count_known() predicate will increase the cyclomatic complexity by a factor of 2n (where n is the number of columns being considered).

To me, the potential performance impact of the null_count() abstraction is worth it because of how much it simplifies code (and therefore reduces maintenance costs and potential bugs).

I can always be swayed to change my mind (and I'm not the final arbiter of what we do). If a majority of people think the null_count() abstraction is too costly, we can revisit it.

Personally, I'd need to see a profile that shows the null_count() abstraction has a significant performance impact in a real workflow before I could be swayed. Otherwise this feels like premature optimization.

jlowe · 2019-12-11T20:56:28Z

@nvdbaranec ran across this cost when prototyping the alternative approach to split. Dave, can you provide the necessary details?

As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like:

  bool has_nulls(column_view const& c, ...);

This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one.

nvdbaranec · 2019-12-11T21:19:35Z

Right. So the test case was: 128k calls to a simple copy-in-place cuda kernel. * With all null_count() overhead eliminated, the time for these 128k calls was 500 milliseconds. * Prior to eliminating that stuff, the total for 128k calls was 10,000 milliseconds. A factor of 20x. The difference there came from 2 things: * the call to column_device_view::create() was quietly calling null_count() internally, and the incoming column had UNKNOWN_NULL_COUNT. * My code was doing the standard thing of declaring a device_scalar and initializing into zero. This scalar was passed to the kernel and then the value was read backafterwards. So that's a memory allocation and two cudaMemcopies. * By artificially bashing the null count on the incoming column to 0, I skipped the work of all the null_count() calls happening in column_device_view::create(). This dropped the time down to 3,600 ms. So that's > 6 seconds of time disappearing into null_count() recomputation on 128k input columns. I did further testing on the remaining 3,600 ms and all of that (except the 500 ms of actual kernel time) came from the declaration of the scalar (and alloc/memcpy to initialize) and the reading back of the value at the end (a memcpy). * By keeping a global device_scalar around and initializing the count to 0 in the kernel, the time dropped down to 2,700 ms. So in that case, the time was entirely in calling the .value() function, which I believe results in a cudaMemcpy back to main memory. * Removing that .value() read and the total time dropped down to 500ms, all of which showed up as cleanly clustered kernel calls in nsight-sys. The key takeaway here is that removing 128k null_count() calls dropped 10,000 ms to 3,600 milliseconds. Quite a bit.

…

________________________________ From: Jason Lowe <[email protected]> Sent: Wednesday, December 11, 2019 2:56 PM To: rapidsai/cudf <[email protected]> Cc: Dave Baranec <[email protected]>; Mention <[email protected]> Subject: Re: [rapidsai/cudf] [FEA] provide API to check if null_count is known (#3579) @nvdbaranec<https://github.com/nvdbaranec> ran across this cost when prototyping the alternative approach to split. Dave, can you provide the necessary details? As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like: bool has_nulls(column_view const& c, ...); This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#3579?email_source=notifications&email_token=ANQRY6TV2S45XHWHOYUWDETQYFHX3A5CNFSM4JZFR3K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGURKZA#issuecomment-564729188>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANQRY6QRC5H2HWNG2SARO3LQYFHX3ANCNFSM4JZFR3KQ>.

----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------

nvdbaranec · 2019-12-11T21:22:37Z

Total incoming data size there was 6 GB, divided into 512 columns split 256 ways.

…

________________________________ From: Dave Baranec <[email protected]> Sent: Wednesday, December 11, 2019 3:19 PM To: rapidsai/cudf <[email protected]>; rapidsai/cudf <[email protected]> Cc: Mention <[email protected]> Subject: Re: [rapidsai/cudf] [FEA] provide API to check if null_count is known (#3579) Right. So the test case was: 128k calls to a simple copy-in-place cuda kernel. * With all null_count() overhead eliminated, the time for these 128k calls was 500 milliseconds. * Prior to eliminating that stuff, the total for 128k calls was 10,000 milliseconds. A factor of 20x. The difference there came from 2 things: * the call to column_device_view::create() was quietly calling null_count() internally, and the incoming column had UNKNOWN_NULL_COUNT. * My code was doing the standard thing of declaring a device_scalar and initializing into zero. This scalar was passed to the kernel and then the value was read backafterwards. So that's a memory allocation and two cudaMemcopies. * By artificially bashing the null count on the incoming column to 0, I skipped the work of all the null_count() calls happening in column_device_view::create(). This dropped the time down to 3,600 ms. So that's > 6 seconds of time disappearing into null_count() recomputation on 128k input columns. I did further testing on the remaining 3,600 ms and all of that (except the 500 ms of actual kernel time) came from the declaration of the scalar (and alloc/memcpy to initialize) and the reading back of the value at the end (a memcpy). * By keeping a global device_scalar around and initializing the count to 0 in the kernel, the time dropped down to 2,700 ms. So in that case, the time was entirely in calling the .value() function, which I believe results in a cudaMemcpy back to main memory. * Removing that .value() read and the total time dropped down to 500ms, all of which showed up as cleanly clustered kernel calls in nsight-sys. The key takeaway here is that removing 128k null_count() calls dropped 10,000 ms to 3,600 milliseconds. Quite a bit.

________________________________ From: Jason Lowe <[email protected]> Sent: Wednesday, December 11, 2019 2:56 PM To: rapidsai/cudf <[email protected]> Cc: Dave Baranec <[email protected]>; Mention <[email protected]> Subject: Re: [rapidsai/cudf] [FEA] provide API to check if null_count is known (#3579) @nvdbaranec<https://github.com/nvdbaranec> ran across this cost when prototyping the alternative approach to split. Dave, can you provide the necessary details? As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like: bool has_nulls(column_view const& c, ...); This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#3579?email_source=notifications&email_token=ANQRY6TV2S45XHWHOYUWDETQYFHX3A5CNFSM4JZFR3K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGURKZA#issuecomment-564729188>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANQRY6QRC5H2HWNG2SARO3LQYFHX3ANCNFSM4JZFR3KQ>.

----------------------------------------------------------------------------------- This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. -----------------------------------------------------------------------------------

jrhemstad · 2019-12-11T21:33:11Z

@nvdbaranec's usecase doesn't quite convince me because it is a situation where column_device_view::create() should probably be avoided anyways (independent of the null_count() issue) because it is also going to incur a device memory allocation/copy if the input is a compound column (like strings). If it's intended to be used with such an extreme number of columns, then it should probably be done in a different way.

The device_scalar component is independent of the null_count() issue.

harrism · 2019-12-12T00:51:38Z

Are these independent copies, or are you copying from 128K columns into a single or a few columns? It seems to me that the desired optimization is to avoid unnecessarily computing null count on inputs if there is going to be a consolidation into fewer columns so that the final null count computation can be cheaper. Is that a correct assessment?

revans2 · 2019-12-12T16:17:20Z

@harrism Speaking of avoiding unnecessary computing of null count. I just noticed that by combining column_view and with the lazy null count we are likely to recompute the null count for every operation.

The column_view copies the null count from a column. If a column_view has no null count cached and one is needed it will compute a new one, but will not update the null count for the column it came from. Every API takes a column_view not a column and typically the view is recreated each time, because they are supposed to be short lived. So if I am going to use the same column for more than one operation the null count will likely be recomputed each time. Especially if we have common system code that forces the null count to be computed for a lot of operations.

jrhemstad · 2019-12-12T18:34:48Z

So if I am going to use the same column for more than one operation the null count will likely be recomputed each time

This isn't quite right. Once a column's null count is computed, it is cached until it is invalidated (either via operator mutable_column_view() or explicitly doing set_null_count(UNKNOWN_NULL_COUNT).

Example:

cudf::column c(...); // internally, null_count == UNKNOWN_NULL_COUNT

column_view v1 = c; // column::operator column_view() will invoke column::null_count() that invokes a kernel to compute null count

column_view v2 = c; // c's null count is known internally, no kernel is invoked

column_view v3 = c; // c's null count is still known, no kernel

mutable_column_view m1 = c; // column::operator mutable_column_view() invalidates c's internal null count

column_view v4 = c; // Unless column::set_null_count() was invoked, c's internal null count is still unknown. Invokes a kernel to compute it

jrhemstad · 2019-12-12T18:44:41Z

Another thought.

The root of this issue seems to stem from the fact that the zero-copy slice API returns a vector<column_view> where each column_view's internal null count is unknown. We could probably write a kernel that could compute each views null count in parallel such that the null count(s) of the column_views returned from slice/split are all already known.

I understand that the null count of each partition isn't strictly needed to satisfy contiguous_split, but it seems like it's just moving the problem of computing the null_count() to the consumer of the contiguous_split partitions. Either way, those null counts need to be computed, whether its in the slice or the contiguous_split, or in the receiving end of the consumer of the contiguous_split. So why not just compute it when we have the opportunity to do it in a single kernel in the slice as opposed to a bunch of tiny kernel calls (either at the producer or the consumer).

harrism · 2019-12-13T00:01:43Z

Computing them in parallel will be a cost of mere milliseconds, so it makes sense. Moreover, I think it's a matter of a cubDeviceSegmentedReduce() with segment start/end index iterators and input transform iterator which converts bit indices to word indices, masks off any unused portion of the current word, and uses an inverted popc to return the number of null bits in the word. Done.

vyasr · 2022-10-17T22:11:08Z

null counts are sometimes needed in contexts where no stream is known. That is problematic because if the null count on a column has not yet been calculated, it will trigger a kernel launch and there is no way to indicate what stream that will occur on. To prevent this problem, we plan to change the null count of columns to something that must be provided on construction. That change will make this issue moot since it will require that a null count always be known.

vyasr · 2022-10-21T17:38:52Z

I am going to close this in favor of the discussion on #11968 since that is more in line with the current goals of libcudf.

revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS libcudf++ labels Dec 10, 2019

revans2 closed this as completed Dec 10, 2019

jrhemstad reopened this Dec 11, 2019

jrhemstad removed the Needs Triage Need team to review and classify label Dec 11, 2019

jrhemstad mentioned this issue Dec 13, 2019

[FEA] Zero-copy slice should compute the null counts for the slices #3600

Closed

jrhemstad removed the libcudf++ label Sep 2, 2021

vyasr added this to the Enable streams milestone Oct 17, 2022

vyasr closed this as completed Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] provide API to check if null_count is known #3579

[FEA] provide API to check if null_count is known #3579

revans2 commented Dec 10, 2019

jrhemstad commented Dec 10, 2019

revans2 commented Dec 10, 2019

jlowe commented Dec 11, 2019

jrhemstad commented Dec 11, 2019 •

edited

Loading

jlowe commented Dec 11, 2019

nvdbaranec commented Dec 11, 2019 via email

nvdbaranec commented Dec 11, 2019 via email

jrhemstad commented Dec 11, 2019 •

edited

Loading

harrism commented Dec 12, 2019

revans2 commented Dec 12, 2019

jrhemstad commented Dec 12, 2019

jrhemstad commented Dec 12, 2019 •

edited

Loading

harrism commented Dec 13, 2019 •

edited

Loading

vyasr commented Oct 17, 2022

vyasr commented Oct 21, 2022 •

edited

Loading

[FEA] provide API to check if null_count is known #3579

[FEA] provide API to check if null_count is known #3579

Comments

revans2 commented Dec 10, 2019

jrhemstad commented Dec 10, 2019

revans2 commented Dec 10, 2019

jlowe commented Dec 11, 2019

jrhemstad commented Dec 11, 2019 • edited Loading

jlowe commented Dec 11, 2019

nvdbaranec commented Dec 11, 2019 via email

nvdbaranec commented Dec 11, 2019 via email

jrhemstad commented Dec 11, 2019 • edited Loading

harrism commented Dec 12, 2019

revans2 commented Dec 12, 2019

jrhemstad commented Dec 12, 2019

jrhemstad commented Dec 12, 2019 • edited Loading

harrism commented Dec 13, 2019 • edited Loading

vyasr commented Oct 17, 2022

vyasr commented Oct 21, 2022 • edited Loading

jrhemstad commented Dec 11, 2019 •

edited

Loading

jrhemstad commented Dec 11, 2019 •

edited

Loading

jrhemstad commented Dec 12, 2019 •

edited

Loading

harrism commented Dec 13, 2019 •

edited

Loading

vyasr commented Oct 21, 2022 •

edited

Loading