-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSSION] libcudf should not introspect input data to perform error checking #5505
Comments
👍 |
I'm fine with this, but it's worth bringing up that |
Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 ) |
Whether we check bounds in libcudf or Python, either way it's a kernel launch. I wouldn't expect it to be substantively more expensive to do the bounds check outside of the |
@shwina I think we can add this in the Cython as opposed to in the Python layer to amortize some of the typical Python overheads. |
Sounds good -- I'll benchmark and report here, and we can decide based on that? |
So I ran a quick benchmark. Here are the results: With libcudf bounds checking:
Without any bounds checking:
With cudf bounds checking:
Benchmark used (basically "reversing" a column by performing a gather): import timeit
import cupy as cp
import cudf
for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
a = cudf.Series(cp.arange(size))
start = timeit.default_timer()
for i in range(10):
result = a.iloc[cp.arange(size-1, -1, -1)]
end = timeit.default_timer()
print(f"size: {size} :: time: {end-start}") The "cudf bounds checking" is implemented in Cython with as little overhead as possible. Basically it is a cdef data_type c_dtype = data_type(tid)
cdef unique_ptr[aggregation] c_agg = move(make_aggregation("max"))
cdef unique_ptr[scalar] c_reduce_result
cdef Scalar sc = as_scalar(source_table._num_rows)
cdef scalar* c_sc = sc.c_value.get()
with nogil:
c_reduce_result = move(
cpp_reduce(
gather_map_view,
c_agg,
c_dtype
)
)
py_reduce_result = Scalar.from_unique_ptr(move(c_reduce_result))
if py_reduce_result.value > 0:
raise RuntimeError("Index out of bounds") |
I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease. I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side. |
Agreed. The performance difference is minimal enough to not be concerning to me.
I think you could actually eliminate the binop by instead just doing a |
How silly of me :) Edited with numbers for doing just a |
I'll put in a PR to do bounds checking in Cython for both scatter and gather. I'm happy to also throw in bounds checking removal in C++. |
That said, I think you actually need to do a |
Are you also considering removing support for negative index values? |
No. I'm just saying that if you want to keep the same bounds checking logic that exists in libcudf, you need to check for |
Got it -- yup makes sense. |
We could even add a |
This issue has been labeled |
This issue has been labeled |
In the interest of making this PR actionable, I would like to collect all remaining instances of this pattern in libcudf so that we know what needs to be done to address this issue: |
@jrhemstad @davidwendt any other instances that you're aware of? Please feel free to add to the list above. In case we find some cases where this is happening without even an option to turn it off, we can always rip those out later. For now I just looked through public headers and didn't find any obvious examples other than these. |
I don't know of other instances. I didn't even know these existed :) The other actionable item I'd suggest is having something about this guidance in the dev docs somewhere. |
That's a good call. I'll make a note to add that to our dev docs. |
This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481. Partly addresses #5505, #1781. Resolves #4511. Authors: - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Jake Hemstad (https://github.com/jrhemstad) - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #11853
This PR removes optional validation for some APIs. Performing these validations requires data introspection, which we do not want. This PR resolves #5505. Authors: - Vyas Ramasubramani (https://github.com/vyasr) - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - GALI PREM SAGAR (https://github.com/galipremsagar) - Matthew Roeschke (https://github.com/mroeschke) - David Wendt (https://github.com/davidwendt) - Nghia Truong (https://github.com/ttnghia) - Jason Lowe (https://github.com/jlowe) URL: #11938
Is your feature request related to a problem? Please describe.
A few functions in libcudf optionally validate that input data is valid before performing the operation. For example,
cudf/cpp/include/cudf/copying.hpp
Lines 62 to 66 in 31d7466
cudf::gather
has acheck_bounds
bool parameter that enables verifying if the values in thegather_map
are within bounds. This requires launching a kernel to introspect the gather map data:cudf/cpp/src/copying/gather.cu
Lines 30 to 39 in 31d7466
The reason for this verification is that cuDF Python expects to throw an exception if any of the values are out of bounds.
However, there is no reason for libcudf to be performing this verification directly inside of the
gather
implementation. The cuDF Python bindings forgather
can easily add this bounds checking as a pre-processing step.cudf::repeat
is a similar example of this behavior:cudf/cpp/include/cudf/filling.hpp
Lines 118 to 122 in b72e647
Having this bounds checking inside the libcudf function is detrimental for a number of reasons:
gather
implementation is quite complicated because of all the branches inside, including whether or not we need to check bounds. It's not as simple as just a singleif/else
because thecheck_bounds
flag interacts with other internal flags such asallow_negative_indices
. It would pretty significantly simplifygather
s implementation to move this check outside ofgather
's implementation.thrust::gather
does not perform any validation of the map data.Describe the solution you'd like
Optional input data validation like in
gather
orrepeat
should be eliminated from libcudf features.Python features that rely on this bounds checking should implement it as a pre-processing step using existing libcudf primitives, or identify new primitives that would enable the necessary validation.
Additional context
To be clear, I am not suggesting libcudf functions do not perform any error checking whatsoever. I am suggesting we remove any error checking that requires introspection of input data (i.e., launching a kernel). Performing checks for data to be the right data type, size, etc. must still be preserved. An obvious indication of functions doing this today are optional boolean flags like in
gather
orrepeat
.The text was updated successfully, but these errors were encountered: