Support benchmarking kernels that cannot take an explicit stream #13

jrhemstad · 2021-04-22T14:25:57Z

From the example, nvbench expects all kernels to be executed on the stream provided by launch.get_stream().

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) { 
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
  });
}
NVBENCH_BENCH(my_benchmark);

This can be problematic when attempting to benchmark functions that contain kernel calls, but do not expose stream parameters (for one reason or another) on which those kernels should run. It would be nice to still be able to benchmark such functions.

The text was updated successfully, but these errors were encountered:

alliepiper · 2021-04-22T14:30:25Z

For prioritization, do you have a concrete need for this or is it a "nice to have"?

jrhemstad · 2021-04-22T14:32:07Z

I'd call it a P1.5 that we can round up to P2 :)

alliepiper · 2021-04-22T14:58:33Z

We can add some optional API on nvbench::state that would allow users to provide a specific stream. This way the timers etc will know what to do.

I knew it was only a matter of time before I ran into the owning_stream vs stream_view problem 😅

PointKernel · 2022-01-11T17:37:14Z

I'd like to solve this issue thus try to understand the solution here: My initial thought is to overload state.exec() which takes no kernel launcher argument and the example use case will look like this:

void my_benchmark(nvbench::state& state) {
  state.exec([]() { 
    invoke_gpu_kernel(...); // a host API invoking GPU kernels but takes no stream argument
    my_kernel<<<num_blocks, 256>>>(); // or launching a kernel with the default stream
  });
}
NVBENCH_BENCH(my_benchmark);

Here we assume that the default stream is known by nvbench::state.

Does the above general idea sound right? If so, how can we retrieve the default stream information if it's not explicitly specified (so it can be used by timers etc)?

alliepiper · 2022-01-12T16:26:21Z

My initial thought is to overload state.exec() which takes no kernel launcher argument and the example use case will look like this:

The KernelLauncher functor passed into state_exec should still take the launch object, even if it's unused. The launch object is the only way for NVBench to pass information into the KernelLauncher, so I'd like to leave that in place. We may add more members to launch if needed down the road.

Here we assume that the default stream is known by nvbench::state. How can we retrieve the default stream information if it's not explicitly specified (so it can be used by timers etc)?

Rather than assume that cudaStreamDefault will be used, we should pass an explicit stream into the state object. This way we can support cases where an explicit, non-default stream must be used.

This would look like:

void my_benchmark(nvbench::state& state) {
  state.set_cuda_stream(nvbench::cuda_stream{cudaStreamDefault, false});
  state.exec([](nvbench::launch&) { 
    invoke_gpu_kernel(...); // a host API invoking GPU kernels without passing an explicit stream
    my_kernel<<<num_blocks, 256>>>(); // or launching a kernel with the default stream
  });
}
NVBENCH_BENCH(my_benchmark);

This way, we can still support invoke_gpu_kernel if it uses some library-specific, non-default stream.

My plan for implementing this was:

Update nvbench::cuda_stream to allow construction from an explicit stream, with support for owning and non-owning semantics.
Update nvbench::state to add a nvbench::cuda_stream member along with setters/getters.
Update nvbench::launch to hold a cuda_stream const ref instead of a value.
Update the nvbench/detail/measure* classes to use the stream from the state when constructing the launch object.

For (1), this would be a new constructor:

cuda_stream(cudaStream_t stream, bool owning);

Default construction will continue to create a new, owning=true stream
If owning is true, the stream will be destroyed in the destructor.
The move constructor / move assignment operator need to be updated to properly free the old stream when necessary.
The destructor should be updated to use NVBENCH_CUDA_CALL_NOEXCEPT to avoid throwing exceptions from a destructor.

For (2):

The default state constructor will default construct the new cuda_stream member.
New nvbench::state API:
- void set_cuda_stream(nvbench::cuda_stream&&)
- [[nodiscard]] const nvbench::cuda_stream& get_cuda_stream() const

(3) will require removing the launcher's default constructor and replacing it with explicit launch(const nvbench::cuda_stream&), as well as updating the member variable to be a const ref.

(4) will require modifying the measure_*_base constructors to pass the cuda_stream from the state into the launch constructor.

Does that make sense?

jrhemstad · 2022-01-12T16:34:13Z

Update nvbench::cuda_stream to allow construction from an explicit stream, with support for owning and non-owning semantics.

Since we're going with distinct types for owning/non-owning streams in libcu++ (and we'll likely eventually switch to those in nvbench), would it make more sense to also use distinct types in nvbench for now?

alliepiper · 2022-01-12T16:43:31Z

I thought about it, but I'm not seeing any benefit that would justify the added complexity here. Using a single type simplifies the implementation considerably in this case.

Once the libcu++ implementation is ready I'd consider switching if it makes sense and there's a good motivation to do so.

alliepiper added type: enhancement New feature or request. helps: rapids Helps or needed by RAPIDS. P2: nice to have Desired, but not necessary. labels Apr 22, 2021

PointKernel mentioned this issue Jan 10, 2022

Optimize groupby::scan rapidsai/cudf#9754

Merged

alliepiper assigned PointKernel Jan 13, 2022

PointKernel mentioned this issue Feb 4, 2022

Add implicit stream benchmarking support #76

Merged

alliepiper closed this as completed in #76 Feb 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support benchmarking kernels that cannot take an explicit stream #13

Support benchmarking kernels that cannot take an explicit stream #13

jrhemstad commented Apr 22, 2021

alliepiper commented Apr 22, 2021

jrhemstad commented Apr 22, 2021

alliepiper commented Apr 22, 2021

PointKernel commented Jan 11, 2022 •

edited

Loading

alliepiper commented Jan 12, 2022 •

edited

Loading

jrhemstad commented Jan 12, 2022

alliepiper commented Jan 12, 2022

Support benchmarking kernels that cannot take an explicit stream #13

Support benchmarking kernels that cannot take an explicit stream #13

Comments

jrhemstad commented Apr 22, 2021

alliepiper commented Apr 22, 2021

jrhemstad commented Apr 22, 2021

alliepiper commented Apr 22, 2021

PointKernel commented Jan 11, 2022 • edited Loading

alliepiper commented Jan 12, 2022 • edited Loading

jrhemstad commented Jan 12, 2022

alliepiper commented Jan 12, 2022

PointKernel commented Jan 11, 2022 •

edited

Loading

alliepiper commented Jan 12, 2022 •

edited

Loading