Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] provide API to check if null_count is known #3579

Closed
revans2 opened this issue Dec 10, 2019 · 15 comments
Closed

[FEA] provide API to check if null_count is known #3579

revans2 opened this issue Dec 10, 2019 · 15 comments
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Dec 10, 2019

There is no way to tell if a null_count is set or not for a column. The reason I want this is so that when I send a column or table from one GPU to another I don't want to have to compute the null_count if it is not known, but I do want to send it if it is known.

The code to do this looks rather simple and I am happy to take a crack at it. I just want to be sure that this fits with the design goals of cudf before I get too far along with it.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS libcudf++ labels Dec 10, 2019
@jrhemstad
Copy link
Contributor

I just want to be sure that this fits with the design goals of cudf before I get too far along with it

Yes, it would be easy to do, but I'm pretty reluctant to do something like this. It kind of defeats the point of the abstraction. Users aren't really supposed to know or care what the internal state of the column is as it's an implementation detail.

when I send a column or table from one GPU to another I don't want to have to compute the null_count if it is not known, but I do want to send it if it is known.

The only reason I can imagine why you wouldn't want to compute the null_count() at the origin is if the destination doesn't care about the null_count(), otherwise it doesn't seem like it would matter if it gets computed at the origin or the destination. In which case, if the destination doesn't care, just send UNKNOWN_NULL_COUNT and avoid invoking null_count() at the origin.

@revans2
Copy link
Contributor Author

revans2 commented Dec 10, 2019

I guess that works. I'll just stop sending the null count all together and we can revisit it if it becomes an issue in the future.

@revans2 revans2 closed this as completed Dec 10, 2019
@jlowe
Copy link
Member

jlowe commented Dec 11, 2019

Users aren't really supposed to know or care what the internal state of the column is as it's an implementation detail.

Yes, but it's an expensive detail that is abstracted in a way that appears to be cheap. Abstracting the performance implications seems like it could be problematic for a performance-oriented library.

For example, consider column concatenation. If all of the columns have a cached null count then the CPU can trivially compute the final null count as it iterates the columns to prepare for the concatenation kernel. If one or more columns have an unknown null count, it is probably more efficient to mark the final concatenation column with an unknown null count and defer the computation. However marking the result column always unknown means we will always need to run a kernel to find the null count even if all the concatenation inputs had known counts.

Many algorithms simply want to know if they need to deal with validity rather than actually knowing the null counts. As another example, consider an algorithm that takes 2 inputs, the first with an unknown null count and the second with a known null count > 0. It doesn't make sense to run a kernel on the first input to compute the null count given the second trivially knows the algorithm needs to deal with validity. IMO the algorithm implementation should only compute the null count of their inputs if all of the inputs have unknown null counts and it's cheaper to compute the null counts than it is to just deal with the validity.

I'm also a bit surprised the value_accessor uses has_nulls() in a CUDF_EXPECTS. has_nulls() is implemented in terms of null_count(), so I believe this triggers a kernel invocation, synchronize, and copy to host for any column with an unknown null count.

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 11, 2019

As another example, consider an algorithm that takes 2 inputs, the first with an unknown null count and the second with a known null count > 0. It doesn't make sense to run a kernel on the first input to compute the null count given the second trivially knows the algorithm needs to deal with validity.

Let's assume we add column::is_null_count_known().

In your example of 2 inputs lhs, rhs, we'd have to do something like:

void some_function(column_view lhs, column_view rhs){
   bool do_i_care_about_nulls = false;
   if(lhs.is_null_count_known() && rhs.is_null_count_known()){
      // null counts don't require a kernel
      do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls();
   }
   else if(lhs.is_null_count_known()){
      do_i_care_about_nulls = lhs.has_nulls();
   }
   else if(rhs.is_null_count_known()){
      do_i_care_about_nulls = rhs.has_nulls();
   }
   else{
      do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls(); // invokes kernel
   }
}

In contrast, it could be:

void some_function(column_view lhs, column_view rhs){
   bool do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls();
}

Will the former potentially have better performance? Probably.

Is the latter vastly less complex and therefore easier to maintain? Definitely.

In general, adding an additional is_null_count_known() predicate will increase the cyclomatic complexity by a factor of 2n (where n is the number of columns being considered).

To me, the potential performance impact of the null_count() abstraction is worth it because of how much it simplifies code (and therefore reduces maintenance costs and potential bugs).

I can always be swayed to change my mind (and I'm not the final arbiter of what we do). If a majority of people think the null_count() abstraction is too costly, we can revisit it.

Personally, I'd need to see a profile that shows the null_count() abstraction has a significant performance impact in a real workflow before I could be swayed. Otherwise this feels like premature optimization.

@jrhemstad jrhemstad reopened this Dec 11, 2019
@jrhemstad jrhemstad removed the Needs Triage Need team to review and classify label Dec 11, 2019
@jlowe
Copy link
Member

jlowe commented Dec 11, 2019

@nvdbaranec ran across this cost when prototyping the alternative approach to split. Dave, can you provide the necessary details?

As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like:

  bool has_nulls(column_view const& c, ...);

This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one.

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Dec 11, 2019 via email

@nvdbaranec
Copy link
Contributor

nvdbaranec commented Dec 11, 2019 via email

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 11, 2019

@nvdbaranec's usecase doesn't quite convince me because it is a situation where column_device_view::create() should probably be avoided anyways (independent of the null_count() issue) because it is also going to incur a device memory allocation/copy if the input is a compound column (like strings). If it's intended to be used with such an extreme number of columns, then it should probably be done in a different way.

The device_scalar component is independent of the null_count() issue.

@harrism
Copy link
Member

harrism commented Dec 12, 2019

Are these independent copies, or are you copying from 128K columns into a single or a few columns? It seems to me that the desired optimization is to avoid unnecessarily computing null count on inputs if there is going to be a consolidation into fewer columns so that the final null count computation can be cheaper. Is that a correct assessment?

@revans2
Copy link
Contributor Author

revans2 commented Dec 12, 2019

@harrism Speaking of avoiding unnecessary computing of null count. I just noticed that by combining column_view and with the lazy null count we are likely to recompute the null count for every operation.

The column_view copies the null count from a column. If a column_view has no null count cached and one is needed it will compute a new one, but will not update the null count for the column it came from. Every API takes a column_view not a column and typically the view is recreated each time, because they are supposed to be short lived. So if I am going to use the same column for more than one operation the null count will likely be recomputed each time. Especially if we have common system code that forces the null count to be computed for a lot of operations.

@jrhemstad
Copy link
Contributor

So if I am going to use the same column for more than one operation the null count will likely be recomputed each time

This isn't quite right. Once a column's null count is computed, it is cached until it is invalidated (either via operator mutable_column_view() or explicitly doing set_null_count(UNKNOWN_NULL_COUNT).

Example:

cudf::column c(...); // internally, null_count == UNKNOWN_NULL_COUNT

column_view v1 = c; // column::operator column_view() will invoke column::null_count() that invokes a kernel to compute null count

column_view v2 = c; // c's null count is known internally, no kernel is invoked

column_view v3 = c; // c's null count is still known, no kernel

mutable_column_view m1 = c; // column::operator mutable_column_view() invalidates c's internal null count

column_view v4 = c; // Unless column::set_null_count() was invoked, c's internal null count is still unknown. Invokes a kernel to compute it

@jrhemstad
Copy link
Contributor

jrhemstad commented Dec 12, 2019

Another thought.

The root of this issue seems to stem from the fact that the zero-copy slice API returns a vector<column_view> where each column_view's internal null count is unknown. We could probably write a kernel that could compute each views null count in parallel such that the null count(s) of the column_views returned from slice/split are all already known.

I understand that the null count of each partition isn't strictly needed to satisfy contiguous_split, but it seems like it's just moving the problem of computing the null_count() to the consumer of the contiguous_split partitions. Either way, those null counts need to be computed, whether its in the slice or the contiguous_split, or in the receiving end of the consumer of the contiguous_split. So why not just compute it when we have the opportunity to do it in a single kernel in the slice as opposed to a bunch of tiny kernel calls (either at the producer or the consumer).

@harrism
Copy link
Member

harrism commented Dec 13, 2019

Computing them in parallel will be a cost of mere milliseconds, so it makes sense. Moreover, I think it's a matter of a cubDeviceSegmentedReduce() with segment start/end index iterators and input transform iterator which converts bit indices to word indices, masks off any unused portion of the current word, and uses an inverted popc to return the number of null bits in the word. Done.

@vyasr
Copy link
Contributor

vyasr commented Oct 17, 2022

null counts are sometimes needed in contexts where no stream is known. That is problematic because if the null count on a column has not yet been calculated, it will trigger a kernel launch and there is no way to indicate what stream that will occur on. To prevent this problem, we plan to change the null count of columns to something that must be provided on construction. That change will make this issue moot since it will require that a null count always be known.

@vyasr vyasr added this to the Enable streams milestone Oct 17, 2022
@vyasr
Copy link
Contributor

vyasr commented Oct 21, 2022

I am going to close this in favor of the discussion on #11968 since that is more in line with the current goals of libcudf.

@vyasr vyasr closed this as completed Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

6 participants