Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

Closed
jrhemstad opened this issue Jun 18, 2020 · 23 comments · Fixed by #11938
Closed

[DISCUSSION] libcudf should not introspect input data to perform error checking #5505

jrhemstad opened this issue Jun 18, 2020 · 23 comments · Fixed by #11938
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code Python Affects Python cuDF API.

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

A few functions in libcudf optionally validate that input data is valid before performing the operation. For example,

std::unique_ptr<table> gather(
table_view const& source_table,
column_view const& gather_map,
bool check_bounds = false,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

cudf::gather has a check_bounds bool parameter that enables verifying if the values in the gather_map are within bounds. This requires launching a kernel to introspect the gather map data:

if (bounds == out_of_bounds_policy::FAIL) {
cudf::size_type begin =
neg_indices == negative_index_policy::ALLOWED ? -source_table.num_rows() : 0;
CUDF_EXPECTS(num_destination_rows ==
thrust::count_if(rmm::exec_policy()->on(0),
gather_map.begin<map_type>(),
gather_map.end<map_type>(),
bounds_checker<map_type>{begin, source_table.num_rows()}),
"Index out of bounds.");
}

The reason for this verification is that cuDF Python expects to throw an exception if any of the values are out of bounds.

However, there is no reason for libcudf to be performing this verification directly inside of the gather implementation. The cuDF Python bindings for gather can easily add this bounds checking as a pre-processing step.

cudf::repeat is a similar example of this behavior:

std::unique_ptr<table> repeat(
table_view const& input_table,
column_view const& count,
bool check_count = false,
rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

Having this bounds checking inside the libcudf function is detrimental for a number of reasons:

  1. It complicates the libcudf implementation. The gather implementation is quite complicated because of all the branches inside, including whether or not we need to check bounds. It's not as simple as just a single if/else because the check_bounds flag interacts with other internal flags such as allow_negative_indices. It would pretty significantly simplify gathers implementation to move this check outside of gather's implementation.
  2. It violates the single responsibility principle. Verifying the validity of input data should be independent of performing the actual operation.
  3. It is not in line with the semantics of other C++ libraries. For example, thrust::gather does not perform any validation of the map data.

Describe the solution you'd like

Optional input data validation like in gather or repeat should be eliminated from libcudf features.

Python features that rely on this bounds checking should implement it as a pre-processing step using existing libcudf primitives, or identify new primitives that would enable the necessary validation.

Additional context

To be clear, I am not suggesting libcudf functions do not perform any error checking whatsoever. I am suggesting we remove any error checking that requires introspection of input data (i.e., launching a kernel). Performing checks for data to be the right data type, size, etc. must still be preserved. An obvious indication of functions doing this today are optional boolean flags like in gather or repeat.

@jrhemstad jrhemstad added feature request New feature or request proposal Change current process or code libcudf Affects libcudf (C++/CUDA) code. labels Jun 18, 2020
@harrism
Copy link
Member

harrism commented Jul 20, 2020

👍

@shwina
Copy link
Contributor

shwina commented Jul 22, 2020

I'm fine with this, but it's worth bringing up that gather performance is critical to performance in some applications, and the pre-processing associated with negative indexes was previously found to be a bottleneck: #2675

@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Jul 22, 2020
@shwina
Copy link
Contributor

shwina commented Jul 22, 2020

Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 )

@jrhemstad
Copy link
Contributor Author

Ah, misunderstood -- this is about bounds checking rather than negative index transformation. I think both incur similar overhead though (cost of a binaryop), and it's worth taking into consideration that this will impact the performance of indexing in Python (cc: @kkraus14 )

Whether we check bounds in libcudf or Python, either way it's a kernel launch. I wouldn't expect it to be substantively more expensive to do the bounds check outside of the gather call.

@kkraus14
Copy link
Collaborator

@shwina I think we can add this in the Cython as opposed to in the Python layer to amortize some of the typical Python overheads.

@shwina
Copy link
Contributor

shwina commented Jul 22, 2020

Sounds good -- I'll benchmark and report here, and we can decide based on that?

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

So I ran a quick benchmark. Here are the results:

With libcudf bounds checking:

size: 100 :: time: 0.007442206988343969
size: 1000 :: time: 0.006695960997603834
size: 10000 :: time: 0.006788923987187445
size: 100000 :: time: 0.009929071005899459
size: 1000000 :: time: 0.028811413008952513
size: 10000000 :: time: 0.06236411599093117

Without any bounds checking:

size: 100 :: time: 0.006547413009684533
size: 1000 :: time: 0.005977320979582146
size: 10000 :: time: 0.005807141977129504
size: 100000 :: time: 0.008778560993960127
size: 1000000 :: time: 0.020219083002302796
size: 10000000 :: time: 0.0580876559833996

With cudf bounds checking:

size: 100 :: time: 0.010622989007970318
size: 1000 :: time: 0.010242449003271759
size: 10000 :: time: 0.009886790998280048
size: 100000 :: time: 0.01461042498704046
size: 1000000 :: time: 0.026591391011606902
size: 10000000 :: time: 0.0630068660248071

Benchmark used (basically "reversing" a column by performing a gather):

import timeit
import cupy as cp
import cudf

for size in [100, 1_000, 10_000, 100_000, 1_000_000, 10_000_000]:
    a = cudf.Series(cp.arange(size))

    start = timeit.default_timer()
    for i in range(10):
        result = a.iloc[cp.arange(size-1, -1, -1)]
    end = timeit.default_timer()

    print(f"size: {size} :: time: {end-start}")

The "cudf bounds checking" is implemented in Cython with as little overhead as possible. Basically it is a max reduction (max(gather_map)).

    cdef data_type c_dtype = data_type(tid)
    cdef unique_ptr[aggregation] c_agg = move(make_aggregation("max"))
    cdef unique_ptr[scalar] c_reduce_result
    cdef Scalar sc = as_scalar(source_table._num_rows)
    cdef scalar* c_sc = sc.c_value.get()

    with nogil:

        c_reduce_result = move(
            cpp_reduce(
                gather_map_view,
                c_agg,
                c_dtype
            )
        )

    py_reduce_result = Scalar.from_unique_ptr(move(c_reduce_result))

    if py_reduce_result.value > 0:
        raise RuntimeError("Index out of bounds")

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease.

I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side.

@jrhemstad
Copy link
Contributor Author

I tend to be +1 for cleaner code over small performance gains :) This represents about a 10-15% performance decrease.

Agreed. The performance difference is minimal enough to not be concerning to me.

I think if libcudf had a binop+reduce primitive though (even just for numeric types), that would allow us to separate bounds checking from gather, and help with performance on the Python side.

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

How silly of me :) Edited with numbers for doing just a max. The difference is even less concerning now (especially for larger problem sizes).

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

I'll put in a PR to do bounds checking in Cython for both scatter and gather. I'm happy to also throw in bounds checking removal in C++.

@jrhemstad
Copy link
Contributor Author

I think you could actually eliminate the binop by instead just doing a max reduction of the gather map. All you care about is if a single value in the gather map is out of bounds, so if you just compute the max and it is out of bounds, then you know there is an error and you can throw.

How silly of me :) Edited with numbers for doing just a max. The difference is even less concerning now (especially for larger problem sizes).

That said, I think you actually need to do a min and a max reduction to account for the "negative values wrap" logic. I.e., you need to check if a value in the gather map is outside the bounds (-n, n]

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

Are you also considering removing support for negative index values?

@jrhemstad
Copy link
Contributor Author

Are you also considering removing support for negative index values?

No. I'm just saying that if you want to keep the same bounds checking logic that exists in libcudf, you need to check for (-n, n].

@shwina
Copy link
Contributor

shwina commented Jul 23, 2020

Got it -- yup makes sense.

@jrhemstad
Copy link
Contributor Author

We could even add a minmax reduction to do this in a single operation.

@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@vyasr
Copy link
Contributor

vyasr commented Jul 20, 2022

In the interest of making this PR actionable, I would like to collect all remaining instances of this pattern in libcudf so that we know what needs to be done to address this issue:

@vyasr
Copy link
Contributor

vyasr commented Jul 20, 2022

@jrhemstad @davidwendt any other instances that you're aware of? Please feel free to add to the list above. In case we find some cases where this is happening without even an option to turn it off, we can always rip those out later. For now I just looked through public headers and didn't find any obvious examples other than these.

@jrhemstad
Copy link
Contributor Author

I don't know of other instances. I didn't even know these existed :)

The other actionable item I'd suggest is having something about this guidance in the dev docs somewhere.

@vyasr
Copy link
Contributor

vyasr commented Jul 21, 2022

That's a good call. I'll make a note to add that to our dev docs.

@vyasr
Copy link
Contributor

vyasr commented Oct 17, 2022

We can close this once #11853 and #11938 are merged.

rapids-bot bot pushed a commit that referenced this issue Oct 18, 2022
This PR adds a section to the developer documentation about various libcudf design decisions that affect users. These policies are important for us to document and communicate consistently. I am not sure what the best place for this information is, but I think the developer docs are a good place to start since until we address #11481 we don't have a great way to publish any non-API user-facing libcudf documentation. I've created this draft PR to solicit feedback from other libcudf devs about other policies that we should be documenting in a similar manner. Once everyone is happy with the contents, I would suggest that we merge this into the dev docs for now and then revisit a better place once we've tackled #11481.

Partly addresses #5505, #1781.

Resolves #4511.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Bradley Dice (https://github.com/bdice)
  - David Wendt (https://github.com/davidwendt)

URL: #11853
rapids-bot bot pushed a commit that referenced this issue Oct 20, 2022
This PR removes optional validation for some APIs. Performing these validations requires data introspection, which we do not want. This PR resolves #5505.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Matthew Roeschke (https://github.com/mroeschke)
  - David Wendt (https://github.com/davidwendt)
  - Nghia Truong (https://github.com/ttnghia)
  - Jason Lowe (https://github.com/jlowe)

URL: #11938
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. proposal Change current process or code Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants