-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] provide API to check if null_count is known #3579
Comments
Yes, it would be easy to do, but I'm pretty reluctant to do something like this. It kind of defeats the point of the abstraction. Users aren't really supposed to know or care what the internal state of the column is as it's an implementation detail.
The only reason I can imagine why you wouldn't want to compute the |
I guess that works. I'll just stop sending the null count all together and we can revisit it if it becomes an issue in the future. |
Yes, but it's an expensive detail that is abstracted in a way that appears to be cheap. Abstracting the performance implications seems like it could be problematic for a performance-oriented library. For example, consider column concatenation. If all of the columns have a cached null count then the CPU can trivially compute the final null count as it iterates the columns to prepare for the concatenation kernel. If one or more columns have an unknown null count, it is probably more efficient to mark the final concatenation column with an unknown null count and defer the computation. However marking the result column always unknown means we will always need to run a kernel to find the null count even if all the concatenation inputs had known counts. Many algorithms simply want to know if they need to deal with validity rather than actually knowing the null counts. As another example, consider an algorithm that takes 2 inputs, the first with an unknown null count and the second with a known null count > 0. It doesn't make sense to run a kernel on the first input to compute the null count given the second trivially knows the algorithm needs to deal with validity. IMO the algorithm implementation should only compute the null count of their inputs if all of the inputs have unknown null counts and it's cheaper to compute the null counts than it is to just deal with the validity. I'm also a bit surprised the |
Let's assume we add In your example of 2 inputs void some_function(column_view lhs, column_view rhs){
bool do_i_care_about_nulls = false;
if(lhs.is_null_count_known() && rhs.is_null_count_known()){
// null counts don't require a kernel
do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls();
}
else if(lhs.is_null_count_known()){
do_i_care_about_nulls = lhs.has_nulls();
}
else if(rhs.is_null_count_known()){
do_i_care_about_nulls = rhs.has_nulls();
}
else{
do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls(); // invokes kernel
}
} In contrast, it could be: void some_function(column_view lhs, column_view rhs){
bool do_i_care_about_nulls = lhs.has_nulls() or rhs.has_nulls();
} Will the former potentially have better performance? Probably. Is the latter vastly less complex and therefore easier to maintain? Definitely. In general, adding an additional To me, the potential performance impact of the I can always be swayed to change my mind (and I'm not the final arbiter of what we do). If a majority of people think the Personally, I'd need to see a profile that shows the |
@nvdbaranec ran across this cost when prototyping the alternative approach to As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like: bool has_nulls(column_view const& c, ...); This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one. |
Right. So the test case was: 128k calls to a simple copy-in-place cuda kernel.
* With all null_count() overhead eliminated, the time for these 128k calls was 500 milliseconds.
* Prior to eliminating that stuff, the total for 128k calls was 10,000 milliseconds. A factor of 20x. The difference there came from 2 things:
* the call to column_device_view::create() was quietly calling null_count() internally, and the incoming column had UNKNOWN_NULL_COUNT.
* My code was doing the standard thing of declaring a device_scalar and initializing into zero. This scalar was passed to the kernel and then the value was read backafterwards. So that's a memory allocation and two cudaMemcopies.
* By artificially bashing the null count on the incoming column to 0, I skipped the work of all the null_count() calls happening in column_device_view::create(). This dropped the time down to 3,600 ms.
So that's > 6 seconds of time disappearing into null_count() recomputation on 128k input columns. I did further testing on the remaining 3,600 ms and all of that (except the 500 ms of actual kernel time) came from the declaration of the scalar (and alloc/memcpy to initialize) and the reading back of the value at the end (a memcpy).
* By keeping a global device_scalar around and initializing the count to 0 in the kernel, the time dropped down to 2,700 ms. So in that case, the time was entirely in calling the .value() function, which I believe results in a cudaMemcpy back to main memory.
* Removing that .value() read and the total time dropped down to 500ms, all of which showed up as cleanly clustered kernel calls in nsight-sys.
The key takeaway here is that removing 128k null_count() calls dropped 10,000 ms to 3,600 milliseconds. Quite a bit.
…________________________________
From: Jason Lowe <[email protected]>
Sent: Wednesday, December 11, 2019 2:56 PM
To: rapidsai/cudf <[email protected]>
Cc: Dave Baranec <[email protected]>; Mention <[email protected]>
Subject: Re: [rapidsai/cudf] [FEA] provide API to check if null_count is known (#3579)
@nvdbaranec<https://github.com/nvdbaranec> ran across this cost when prototyping the alternative approach to split. Dave, can you provide the necessary details?
As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like:
bool has_nulls(column_view const& c, ...);
This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#3579?email_source=notifications&email_token=ANQRY6TV2S45XHWHOYUWDETQYFHX3A5CNFSM4JZFR3K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGURKZA#issuecomment-564729188>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANQRY6QRC5H2HWNG2SARO3LQYFHX3ANCNFSM4JZFR3KQ>.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
|
Total incoming data size there was 6 GB, divided into 512 columns split 256 ways.
…________________________________
From: Dave Baranec <[email protected]>
Sent: Wednesday, December 11, 2019 3:19 PM
To: rapidsai/cudf <[email protected]>; rapidsai/cudf <[email protected]>
Cc: Mention <[email protected]>
Subject: Re: [rapidsai/cudf] [FEA] provide API to check if null_count is known (#3579)
Right. So the test case was: 128k calls to a simple copy-in-place cuda kernel.
* With all null_count() overhead eliminated, the time for these 128k calls was 500 milliseconds.
* Prior to eliminating that stuff, the total for 128k calls was 10,000 milliseconds. A factor of 20x. The difference there came from 2 things:
* the call to column_device_view::create() was quietly calling null_count() internally, and the incoming column had UNKNOWN_NULL_COUNT.
* My code was doing the standard thing of declaring a device_scalar and initializing into zero. This scalar was passed to the kernel and then the value was read backafterwards. So that's a memory allocation and two cudaMemcopies.
* By artificially bashing the null count on the incoming column to 0, I skipped the work of all the null_count() calls happening in column_device_view::create(). This dropped the time down to 3,600 ms.
So that's > 6 seconds of time disappearing into null_count() recomputation on 128k input columns. I did further testing on the remaining 3,600 ms and all of that (except the 500 ms of actual kernel time) came from the declaration of the scalar (and alloc/memcpy to initialize) and the reading back of the value at the end (a memcpy).
* By keeping a global device_scalar around and initializing the count to 0 in the kernel, the time dropped down to 2,700 ms. So in that case, the time was entirely in calling the .value() function, which I believe results in a cudaMemcpy back to main memory.
* Removing that .value() read and the total time dropped down to 500ms, all of which showed up as cleanly clustered kernel calls in nsight-sys.
The key takeaway here is that removing 128k null_count() calls dropped 10,000 ms to 3,600 milliseconds. Quite a bit.
________________________________
From: Jason Lowe <[email protected]>
Sent: Wednesday, December 11, 2019 2:56 PM
To: rapidsai/cudf <[email protected]>
Cc: Dave Baranec <[email protected]>; Mention <[email protected]>
Subject: Re: [rapidsai/cudf] [FEA] provide API to check if null_count is known (#3579)
@nvdbaranec<https://github.com/nvdbaranec> ran across this cost when prototyping the alternative approach to split. Dave, can you provide the necessary details?
As for complexity, yes it's necessarily more complex. No argument there. However a utility method could help isolate the complexity to only a few places in the code. For example, there could be a variadic utility like:
bool has_nulls(column_view const& c, ...);
This would hide the complexity for the many places in the code where it only wants to know whether to invoke the never-null kernel or allocate a validity buffer and call the nullable one.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#3579?email_source=notifications&email_token=ANQRY6TV2S45XHWHOYUWDETQYFHX3A5CNFSM4JZFR3K2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGURKZA#issuecomment-564729188>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ANQRY6QRC5H2HWNG2SARO3LQYFHX3ANCNFSM4JZFR3KQ>.
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
|
@nvdbaranec's usecase doesn't quite convince me because it is a situation where The |
Are these independent copies, or are you copying from 128K columns into a single or a few columns? It seems to me that the desired optimization is to avoid unnecessarily computing null count on inputs if there is going to be a consolidation into fewer columns so that the final null count computation can be cheaper. Is that a correct assessment? |
@harrism Speaking of avoiding unnecessary computing of null count. I just noticed that by combining column_view and with the lazy null count we are likely to recompute the null count for every operation. The column_view copies the null count from a column. If a column_view has no null count cached and one is needed it will compute a new one, but will not update the null count for the column it came from. Every API takes a column_view not a column and typically the view is recreated each time, because they are supposed to be short lived. So if I am going to use the same column for more than one operation the null count will likely be recomputed each time. Especially if we have common system code that forces the null count to be computed for a lot of operations. |
This isn't quite right. Once a Example: cudf::column c(...); // internally, null_count == UNKNOWN_NULL_COUNT
column_view v1 = c; // column::operator column_view() will invoke column::null_count() that invokes a kernel to compute null count
column_view v2 = c; // c's null count is known internally, no kernel is invoked
column_view v3 = c; // c's null count is still known, no kernel
mutable_column_view m1 = c; // column::operator mutable_column_view() invalidates c's internal null count
column_view v4 = c; // Unless column::set_null_count() was invoked, c's internal null count is still unknown. Invokes a kernel to compute it |
Another thought. The root of this issue seems to stem from the fact that the zero-copy slice API returns a I understand that the null count of each partition isn't strictly needed to satisfy |
Computing them in parallel will be a cost of mere milliseconds, so it makes sense. Moreover, I think it's a matter of a cubDeviceSegmentedReduce() with segment start/end index iterators and input transform iterator which converts bit indices to word indices, masks off any unused portion of the current word, and uses an inverted |
null counts are sometimes needed in contexts where no stream is known. That is problematic because if the null count on a column has not yet been calculated, it will trigger a kernel launch and there is no way to indicate what stream that will occur on. To prevent this problem, we plan to change the null count of columns to something that must be provided on construction. That change will make this issue moot since it will require that a null count always be known. |
I am going to close this in favor of the discussion on #11968 since that is more in line with the current goals of libcudf. |
There is no way to tell if a null_count is set or not for a column. The reason I want this is so that when I send a column or table from one GPU to another I don't want to have to compute the null_count if it is not known, but I do want to send it if it is known.
The code to do this looks rather simple and I am happy to take a crack at it. I just want to be sure that this fits with the design goals of cudf before I get too far along with it.
The text was updated successfully, but these errors were encountered: