Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implicit stream benchmarking support #76

Merged
merged 18 commits into from
Feb 11, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ various NVBench features and usecases:
- [Enums and compile-time-constant-integral parameter axes](examples/enums.cu)
- [Reporting item/sec and byte/sec throughput statistics](examples/throughput.cu)
- [Skipping benchmark configurations](examples/skip.cu)
- [Benchmarking on a specific stream](examples/stream.cu)
- [Benchmarks that sync CUDA devices: `nvbench::exec_tag::sync`](examples/exec_tag_sync.cu)
- [Manual timing: `nvbench::exec_tag::timer`](examples/exec_tag_timer.cu)

Expand Down
19 changes: 19 additions & 0 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,25 @@ void my_benchmark(nvbench::state& state) {
NVBENCH_BENCH(my_benchmark);
```

NVBench records the elapsed time of work on a CUDA stream for each iteration of a benchmark.
By default, NVBench creates and provides an explicit stream via `launch::get_stream()`
to pass to every stream-ordered operation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't quite true -- for the isolated/cold measurements, each iteration is recorded, but for the batch/hot measurements, several iterations are lumped together in a single timer.

I'd also move this down into it's own section -- this section is meant to give an extremely brief overview of a minimal benchmark specification and introduce key concepts. Using an explicit stream is an advanced usecase that should have it's own section.

I'll push a commit to this branch that restructures this a bit, since I'm pretty picky about these docs 😅


Sometimes it is inconvenient or impossible to specify an explicit CUDA stream to every
stream-ordered operation. A `cudaStream_t` may be provided via `state::set_cuda_stream`.
It is assumed that all work of interest executes on or synchronizes with this stream.
```cpp
void my_benchmark(nvbench::state& state) {
cudaStream_t default_stream = 0;
state.set_cuda_stream(nvbench::cuda_stream{default_stream, false});
state.exec([](nvbench::launch&) {
my_func(); // a host API invoking GPU kernels without taking an explicit stream
my_kernel<<<num_blocks, 256>>>(); // or a kernel launched with the default stream
});
}
NVBENCH_BENCH(my_benchmark);
```

There are three main components in the definition of a benchmark:

- A `KernelGenerator` callable (`my_benchmark` above)
Expand Down
1 change: 1 addition & 0 deletions examples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(example_srcs
exec_tag_sync.cu
exec_tag_timer.cu
skip.cu
stream.cu
throughput.cu
auto_throughput.cu
)
Expand Down
62 changes: 62 additions & 0 deletions examples/stream.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
/*
* Copyright 2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
* the License.
*
* You may obtain a copy of the License at
*
* http://llvm.org/foundation/relicensing/LICENSE.txt
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <nvbench/nvbench.cuh>

// Grab some testing kernels from NVBench:
#include <nvbench/test_kernels.cuh>

// Thrust vectors simplify memory management:
#include <thrust/device_vector.h>

// A function to benchmark but does not expose an explicit stream argument.
void copy(int32_t *input, int32_t *output, std::size_t const num_values)
{
nvbench::copy_kernel<<<256, 256>>>(input, output, num_values);
}

// `stream_bench` copies a 64 MiB buffer of int32_t on a CUDA stream specified
// by the user.
//
// By default, NVBench creates and provides an explicit stream via
// `launch::get_stream()` to pass to every stream-ordered operation. Sometimes
// it is inconvenient or impossible to specify an explicit CUDA stream to every
// stream-ordered operation. In this case, users may specify a target stream via
// `state::set_cuda_stream`. It is assumed that all work of interest executes on
// or synchronizes with this stream.
void stream_bench(nvbench::state &state)
{
// Allocate input data:
const std::size_t num_values = 64 * 1024 * 1024 / sizeof(nvbench::int32_t);
thrust::device_vector<nvbench::int32_t> input(num_values);
thrust::device_vector<nvbench::int32_t> output(num_values);

// Set CUDA default stream as the target stream. Note the default stream
// is non-owning.
cudaStream_t default_stream = 0;
state.set_cuda_stream(
nvbench::cuda_stream{default_stream, false /*owning = false*/});

state.exec([&input, &output, num_values](nvbench::launch &) {
copy(thrust::raw_pointer_cast(input.data()),
thrust::raw_pointer_cast(output.data()),
num_values);
});
}

NVBENCH_BENCH(stream_bench);
42 changes: 36 additions & 6 deletions nvbench/cuda_stream.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand All @@ -22,25 +22,55 @@

#include <cuda_runtime_api.h>

#include <memory>

namespace nvbench
{

// RAII wrapper for a cudaStream_t.
struct cuda_stream
{
cuda_stream() { NVBENCH_CUDA_CALL(cudaStreamCreate(&m_stream)); }
~cuda_stream() { NVBENCH_CUDA_CALL(cudaStreamDestroy(m_stream)); }
cuda_stream()
: m_stream{[]() {
cudaStream_t s;
NVBENCH_CUDA_CALL(cudaStreamCreate(&s));
return s;
}(),
stream_deleter{true}}
{}

cuda_stream(cudaStream_t stream, bool owning)
: m_stream{stream, stream_deleter{owning}}
{}

~cuda_stream() = default;

// move-only
cuda_stream(const cuda_stream &) = delete;
cuda_stream(cuda_stream &&) = default;
cuda_stream &operator=(const cuda_stream &) = delete;
cuda_stream(cuda_stream &&) = default;
cuda_stream &operator=(cuda_stream &&) = default;

operator cudaStream_t() const { return m_stream; }
operator cudaStream_t() const { return m_stream.get(); }

cudaStream_t get_stream() const { return m_stream.get(); }

private:
cudaStream_t m_stream;
struct stream_deleter
{
using pointer = cudaStream_t;
bool owning;

constexpr void operator()(pointer s) const noexcept
{
if (owning)
{
NVBENCH_CUDA_CALL_NOEXCEPT(cudaStreamDestroy(s));
}
}
};

std::unique_ptr<cudaStream_t, stream_deleter> m_stream;
};

} // namespace nvbench
3 changes: 2 additions & 1 deletion nvbench/detail/measure_cold.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand Down Expand Up @@ -39,6 +39,7 @@ namespace nvbench::detail

measure_cold_base::measure_cold_base(state &exec_state)
: m_state{exec_state}
, m_launch{m_state.get_cuda_stream()}
, m_run_once{exec_state.get_run_once()}
, m_min_samples{exec_state.get_min_samples()}
, m_max_noise{exec_state.get_max_noise()}
Expand Down
9 changes: 7 additions & 2 deletions nvbench/detail/measure_cupti.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand Down Expand Up @@ -169,7 +169,12 @@ std::vector<std::string> add_metrics(nvbench::state &state)
} // namespace

measure_cupti_base::measure_cupti_base(state &exec_state)
try : m_state{exec_state}, m_cupti(*m_state.get_device(), add_metrics(m_state))
try : m_state
{
exec_state
}
, m_launch{m_state.get_cuda_stream()},
m_cupti{*m_state.get_device(), add_metrics(m_state)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. Understandably, clang-format is not a fan of initializer-scope try statements. I'll clean this up a bit in my follow up patch.

{}
catch (const std::exception &ex)
{
Expand Down
3 changes: 2 additions & 1 deletion nvbench/detail/measure_hot.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand Down Expand Up @@ -37,6 +37,7 @@ namespace nvbench::detail

measure_hot_base::measure_hot_base(state &exec_state)
: m_state{exec_state}
, m_launch{m_state.get_cuda_stream()}
, m_min_samples{exec_state.get_min_samples()}
, m_min_time{exec_state.get_min_time()}
, m_skip_time{exec_state.get_skip_time()}
Expand Down
9 changes: 6 additions & 3 deletions nvbench/launch.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand All @@ -25,8 +25,11 @@ namespace nvbench

struct launch
{
explicit launch(const nvbench::cuda_stream &stream)
: m_stream{stream}
{}

// move-only
launch() = default;
launch(const launch &) = delete;
launch(launch &&) = default;
launch &operator=(const launch &) = delete;
Expand All @@ -38,7 +41,7 @@ struct launch
};

private:
nvbench::cuda_stream m_stream;
const nvbench::cuda_stream &m_stream;
};

} // namespace nvbench
21 changes: 15 additions & 6 deletions nvbench/state.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand All @@ -18,6 +18,7 @@

#pragma once

#include <nvbench/cuda_stream.cuh>
#include <nvbench/device_info.cuh>
#include <nvbench/exec_tag.cuh>
#include <nvbench/named_values.cuh>
Expand Down Expand Up @@ -62,6 +63,15 @@ struct state
state &operator=(const state &) = delete;
state &operator=(state &&) = default;

[[nodiscard]] const nvbench::cuda_stream &get_cuda_stream() const
{
return m_cuda_stream;
}
void set_cuda_stream(nvbench::cuda_stream &&stream)
{
m_cuda_stream = std::move(stream);
}

/// The CUDA device associated with with this benchmark state. May be
/// nullopt for CPU-only benchmarks.
[[nodiscard]] const std::optional<nvbench::device_info> &get_device() const
Expand Down Expand Up @@ -259,11 +269,9 @@ struct state

[[nodiscard]] bool is_cupti_required() const
{
return is_l2_hit_rate_collected()
|| is_l1_hit_rate_collected()
|| is_stores_efficiency_collected()
|| is_loads_efficiency_collected()
|| is_dram_throughput_collected();
return is_l2_hit_rate_collected() || is_l1_hit_rate_collected() ||
is_stores_efficiency_collected() ||
is_loads_efficiency_collected() || is_dram_throughput_collected();
}

summary &add_summary(std::string summary_tag);
Expand Down Expand Up @@ -303,6 +311,7 @@ private:
std::optional<nvbench::device_info> device,
std::size_t type_config_index);

nvbench::cuda_stream m_cuda_stream;
std::reference_wrapper<const nvbench::benchmark_base> m_benchmark;
nvbench::named_values m_axis_values;
std::optional<nvbench::device_info> m_device;
Expand Down
21 changes: 20 additions & 1 deletion testing/state.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2021 NVIDIA Corporation
* Copyright 2021-2022 NVIDIA Corporation
*
* Licensed under the Apache License, Version 2.0 with the LLVM exception
* (the "License"); you may not use this file except in compliance with
Expand Down Expand Up @@ -51,6 +51,24 @@ struct state_tester : public nvbench::state

using nvbench::detail::state_tester;

void test_streams()
{
dummy_bench bench;

state_tester state{bench};

// Test non-owning stream
cudaStream_t default_stream = 0;
state.set_cuda_stream(nvbench::cuda_stream{default_stream, false});
ASSERT(state.get_cuda_stream() == default_stream);

// Test owning stream
auto stream = nvbench::cuda_stream{};
auto gold = stream.get_stream();
state.set_cuda_stream(std::move(stream));
ASSERT(state.get_cuda_stream() == gold);
}

void test_params()
{
dummy_bench bench;
Expand Down Expand Up @@ -110,6 +128,7 @@ void test_defaults()

int main()
{
test_streams();
test_params();
test_summaries();
test_defaults();
Expand Down