Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support benchmarking kernels that cannot take an explicit stream #13

Closed
jrhemstad opened this issue Apr 22, 2021 · 7 comments · Fixed by #76
Closed

Support benchmarking kernels that cannot take an explicit stream #13

jrhemstad opened this issue Apr 22, 2021 · 7 comments · Fixed by #76
Assignees
Labels
helps: rapids Helps or needed by RAPIDS. P2: nice to have Desired, but not necessary. type: enhancement New feature or request.

Comments

@jrhemstad
Copy link
Collaborator

From the example, nvbench expects all kernels to be executed on the stream provided by launch.get_stream().

void my_benchmark(nvbench::state& state) {
  state.exec([](nvbench::launch& launch) { 
    my_kernel<<<num_blocks, 256, 0, launch.get_stream()>>>();
  });
}
NVBENCH_BENCH(my_benchmark);

This can be problematic when attempting to benchmark functions that contain kernel calls, but do not expose stream parameters (for one reason or another) on which those kernels should run. It would be nice to still be able to benchmark such functions.

@alliepiper
Copy link
Collaborator

For prioritization, do you have a concrete need for this or is it a "nice to have"?

@jrhemstad
Copy link
Collaborator Author

I'd call it a P1.5 that we can round up to P2 :)

@alliepiper alliepiper added type: enhancement New feature or request. helps: rapids Helps or needed by RAPIDS. P2: nice to have Desired, but not necessary. labels Apr 22, 2021
@alliepiper
Copy link
Collaborator

We can add some optional API on nvbench::state that would allow users to provide a specific stream. This way the timers etc will know what to do.

I knew it was only a matter of time before I ran into the owning_stream vs stream_view problem 😅

@PointKernel
Copy link
Member

PointKernel commented Jan 11, 2022

I'd like to solve this issue thus try to understand the solution here: My initial thought is to overload state.exec() which takes no kernel launcher argument and the example use case will look like this:

void my_benchmark(nvbench::state& state) {
  state.exec([]() { 
    invoke_gpu_kernel(...); // a host API invoking GPU kernels but takes no stream argument
    my_kernel<<<num_blocks, 256>>>(); // or launching a kernel with the default stream
  });
}
NVBENCH_BENCH(my_benchmark);

Here we assume that the default stream is known by nvbench::state.

Does the above general idea sound right? If so, how can we retrieve the default stream information if it's not explicitly specified (so it can be used by timers etc)?

@alliepiper
Copy link
Collaborator

alliepiper commented Jan 12, 2022

My initial thought is to overload state.exec() which takes no kernel launcher argument and the example use case will look like this:

The KernelLauncher functor passed into state_exec should still take the launch object, even if it's unused. The launch object is the only way for NVBench to pass information into the KernelLauncher, so I'd like to leave that in place. We may add more members to launch if needed down the road.

Here we assume that the default stream is known by nvbench::state. How can we retrieve the default stream information if it's not explicitly specified (so it can be used by timers etc)?

Rather than assume that cudaStreamDefault will be used, we should pass an explicit stream into the state object. This way we can support cases where an explicit, non-default stream must be used.

This would look like:

void my_benchmark(nvbench::state& state) {
  state.set_cuda_stream(nvbench::cuda_stream{cudaStreamDefault, false});
  state.exec([](nvbench::launch&) { 
    invoke_gpu_kernel(...); // a host API invoking GPU kernels without passing an explicit stream
    my_kernel<<<num_blocks, 256>>>(); // or launching a kernel with the default stream
  });
}
NVBENCH_BENCH(my_benchmark);

This way, we can still support invoke_gpu_kernel if it uses some library-specific, non-default stream.


My plan for implementing this was:

  1. Update nvbench::cuda_stream to allow construction from an explicit stream, with support for owning and non-owning semantics.
  2. Update nvbench::state to add a nvbench::cuda_stream member along with setters/getters.
  3. Update nvbench::launch to hold a cuda_stream const ref instead of a value.
  4. Update the nvbench/detail/measure* classes to use the stream from the state when constructing the launch object.

For (1), this would be a new constructor:

cuda_stream(cudaStream_t stream, bool owning);
  • Default construction will continue to create a new, owning=true stream
  • If owning is true, the stream will be destroyed in the destructor.
  • The move constructor / move assignment operator need to be updated to properly free the old stream when necessary.
  • The destructor should be updated to use NVBENCH_CUDA_CALL_NOEXCEPT to avoid throwing exceptions from a destructor.

For (2):

  • The default state constructor will default construct the new cuda_stream member.
  • New nvbench::state API:
    • void set_cuda_stream(nvbench::cuda_stream&&)
    • [[nodiscard]] const nvbench::cuda_stream& get_cuda_stream() const

(3) will require removing the launcher's default constructor and replacing it with explicit launch(const nvbench::cuda_stream&), as well as updating the member variable to be a const ref.

(4) will require modifying the measure_*_base constructors to pass the cuda_stream from the state into the launch constructor.

Does that make sense?

@jrhemstad
Copy link
Collaborator Author

  1. Update nvbench::cuda_stream to allow construction from an explicit stream, with support for owning and non-owning semantics.

Since we're going with distinct types for owning/non-owning streams in libcu++ (and we'll likely eventually switch to those in nvbench), would it make more sense to also use distinct types in nvbench for now?

@alliepiper
Copy link
Collaborator

I thought about it, but I'm not seeing any benefit that would justify the added complexity here. Using a single type simplifies the implementation considerably in this case.

Once the libcu++ implementation is ready I'd consider switching if it makes sense and there's a good motivation to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
helps: rapids Helps or needed by RAPIDS. P2: nice to have Desired, but not necessary. type: enhancement New feature or request.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants