Write Benchmark Tests

MongoDB has a function level benchmarking library that is based on Google Benchmark v1.3.0, with a few changes to provide a more familiar interface for MongoDB developers. Compared with system-level performance tests that rely on full-fledged MongoDB servers or even clusters, function-level benchmarks make it easier to isolate certain code segments that may be difficult to test through a complete MongoDB server.

The MongoDB Benchmarks follow most of the conventions and best practices of Google Benchmark, with some deviations to make the user experience closer to that of MongoDB C++ unit tests.

Writing a Benchmark

The Basics:

// my_bm.cpp

#include <benchmark/benchmark.h>

#include "mongo/util/processinfo.h"

namespace mongo {
namespace {

static void BM_Foo(benchmark::State& state) {
  if (state.thread_index == 0) {
    // Setup code here. None of the threads will start until all have
    // reached the start of the benchmark loop.
  }

  for (auto keepRunning : state) {
    // Test code goes here.
  }

  if (state.thread_index == 0) {
    // Teardown code here.
  }
}

BENCHMARK(BM_Foo)
  ->Range(1, 1<<4)
  ->Threads(1, ProcessInfo::getNumAvailableCores());


class MyFixture : public benchmark::Fixture {
public:
    void SetUp(benchmark::State& state) override {
		...
    }

    void TearDown(benchmark::State& state) override {
		...
    }
};

BENCHMARK_F(MyFixture, BM_Bar)(benchmark::State& state) {
   for (auto keepRunning : state) {
     // Test code goes here.
  }
}

BENCHMARK_REGISTER_F(MyFixture, BM_Bar)-> ... (same options as BENCHMARK())

}  // namespace
}  // namespace mongo

# SConscript

env.Benchmark(
    target=my_bm',
    source=[
        'my_bm.cpp',
    ],
    LIBDEPS=[
    	...
    ],
)

Preferred style:

benchmark functions should be named BM_UpperCamelCase. The name must be UpperCamelCase only and must not include underscores to avoid confusing the metrics processor.
benchmark file names should end with _bm, excluding the file extension.
SCons Benchmark targets must end with _bm.

There are a number of existing tests that serve as good examples:

Best Practices

Following the best practices below will ensure that your test has a high signal to noise ratio.

Benchmark with all optimizations enabled, as you would compile a production build. This means using the toolchain g++, static linking, dbg=off, opt=on, etc.
Use benchmark::DoNotOptimize() liberally to prevent your code from being optimized away; when in doubt, use it.
- There are a number of other mechanisms available to prevent unexpected optimizations:
- benchmark::ClobberMemory() can be used to "flush" memory and provides a read/write barrier to prevent instructions from being reordered across it.
- asm("") can be used to prevent a function that has no side-effects from being optimized away. See the noinline keyword on [this page from the GCC documentation] (https://gcc.gnu.org/onlinedocs/gcc-5.3.0/gcc/Function-Attributes.html) for more detail.
Avoid using RNGs. If one is absolutely necessary, use a fixed seed.
Avoid disk IO.
Avoid non-loopback networking.
Avoid tests that stress multiple resources (e.g. CPU and memory access).
Avoid tests that take more than 0.1 seconds in a single iteration. Consider breaking each one up into multiple tests. ("iteration" here refers to a single run of a keepRunning loop, not a single call to a BM_ function, which can take much longer.)

Advanced Usages

If you like to explore other available functionality the Benchmark framework provides, please refer to Google Benchmark's excellent documentation. The rule of thumb is that if a functionality is provided by both Google Benchmark and MongoDB, use the MongoDB version. There are a few differences as a result of this rule, including:

BENCHMARK_MAIN does not need to be explicitly declared.
mongo::ProcessInfo should be used for getting hardware information in favor of Google Benchmark's version.

Add my_bm.cpp to the respective resmoke.py suite yaml file. For example add src/mongo/db/repl/oplog_application_bm.cpp to buildscripts/resmokeconfig/suites/benchmarks_replication.yml

Building Benchmark tests

ninja -f build.ninja -j 100 install-benchmarks # causes build/benchmarks.txt to be created

Running a Benchmark

Running a benchmark through resmoke.py

./buildscripts/resmoke.py run --suite=benchmarks_first_half [/optional/path/to/benchmark/test/binary]

The results from the above command will be printed to the console and look like the following.

Run on (8 X 2800 MHz CPU s)
2018-03-08 14:55:12
-----------------------------------------------------------------------------------
Benchmark                                            Time           CPU Iterations
-----------------------------------------------------------------------------------
BM_ClockNow/poll period:0/threads:1                 29 ns         29 ns   24998482
BM_ClockNow/poll period:0/threads:1                 29 ns         29 ns   24998482
BM_ClockNow/poll period:0/threads:1                 29 ns         29 ns   24998482
BM_ClockNow/poll period:0/threads:1_mean            29 ns         29 ns   24998482
BM_ClockNow/poll period:0/threads:1_median          29 ns         29 ns   24998482
BM_ClockNow/poll period:0/threads:1_stddev           0 ns          0 ns   24998482
BM_ClockNow/poll period:0/threads:2                 14 ns         28 ns   25178950
BM_ClockNow/poll period:0/threads:2                 14 ns         27 ns   25178950
BM_ClockNow/poll period:0/threads:2                 14 ns         28 ns   25178950
BM_ClockNow/poll period:0/threads:2_mean            14 ns         28 ns   25178950
BM_ClockNow/poll period:0/threads:2_median          14 ns         28 ns   25178950
BM_ClockNow/poll period:0/threads:2_stddev           0 ns          0 ns   25178950

If you see a warning for CPU frequency scaling, consider turning CPU scaling off for more stable results. cpufrequtils is a popular tool to do this on Linux.

A few Benchmark specific options have been added to resmoke.py

--benchmarkMinTimeSecs[=5]         replaces --repeat; set a higher number to make a test run longer.
--benchmarkRepetitions[=3]         can be changed to 1 to make tests finish faster. Mean/Media/StdDev will not be available.

Preventing unexpected optimizations

If a test is taking less than 2 nano-seconds per iteration, or if you believe one or more lines of code never touches memory (i.e. it directly and indirectly affects only registers), none of the compiler hints may be useful in preventing code being optimized away. In this case it's prudent to manually check that the compiler is not optimizing out your code. This can be done by looking at the disassembly of a benchmark binary.

The disassembly for an empty, optimized, and unstripped benchmark contains something like this:

1000013d9:	e8 52 58 00 00 	callq	22610 <__ZN9benchmark5State16StartKeepRunningEv>
1000013de:	48 89 df 	movq	%rbx, %rdi
1000013e1:	48 83 c4 08 	addq	$8, %rsp
1000013e5:	5b 	popq	%rbx
1000013e6:	5d 	popq	%rbp
1000013e7:	e9 04 59 00 00 	jmp	22788 <__ZN9benchmark5State17FinishKeepRunningEv>

The keywords to look out for are callq, StartKeepRunning, and FinishKeepRunning. If there's not much happening between StartKeepRunning and FinishKeepRunning, your test is likely doing nothing. The addq on line 3 is part of Google Benchmark's book-keeping and not the benchmark itself.

Evergreen Integration

If you're writing benchmarks for a new feature, putting them in a new resmoke suite is recommended to ensure faster feedback. The resmoke suite definitions are in the mongo repository under buildscripts/resmokeconfig/suites/.

In your new suite YAML definition, make sure to always include the system_resource_canary_bm* test and the CombineBenchmarkResults hook.

# benchmarks_my_feature_name.yml

test_kind: benchmark_test

selector:
  root: build/benchmarks.txt
  include_files:
  # The trailing asterisk is for catching the [.exe] suffix on Windows.
  - build/**/system_resource_canary_bm*
  - build/**/path/to/my/benchmarks*

executor:
  config: {}
  hooks:
  - class: CombineBenchmarkResults

Also Add your files to the excluded_files section of benchmarks.yml, which is a fallback suite to pick up orphaned benchmarks.

...
selector:
  ...
  exclude_files:
  - build/**/path/to/my/benchmarks*
...

Use the following template to add a Evergreen task for the benchmarks_my_feature_name suite. Ensure that when adding the task to Enterprise RHEL 6, use the specially tuned perf distro, which will provide a better signal to noise ratio.

Unless the suite is for testing platform-specific codepaths, it should be added to Enterprise RHEL 6.2 (on the centos6-perf distro), Enterprise RHEL 7.0 (on the rhel70-small distro), and Enterprise Windows 2008R2 (on the windows-64-vs2015-small distro) build variants to ensure coverage of major MongoDB platforms.

See etc/evergreen_yml_components/definitions.yml

# etc/evergreen_yml_components/definitions.yml

- <<: *benchmark_template
  name: benchmarks_my_feature_name
  commands:
  - func: "do benchmark setup"
  - func: "run tests"
    vars:
      resmoke_args: --suites=benchmarks_my_feature_name
      run_multiple_jobs: false
  - func: "send benchmark results"


# Add the task to a build variant.
  - name: benchmarks_my_feature_name
    distros:
    - centos6-perf

Understanding Benchmark Results in Evergreen

Benchmark thresholds

Some benchmarks have certain thresholds that they are compared against. These thresholds are set on a per variant basis, as benchmarks may have different characteristics depending on the underlying machine. These thresholds are automatically applied when the benchmarks are run in Evergreen. You can modify these thresholds by touching this file.

Interpreting Graphs in the Perf Plugin

The main difference between benchmarks and most other existing performance tests is that the result is presented as latency, not throughput. This means lower numbers on the graphs are better, which is not visually intuitive. To ensure consistency with existing tests, the benchmarks run through resmoke.py will return 0 - actual_latency in the JSON report file (but not the console output); this preserves the "higher is better" semantics, but the y axis on the graphs will be negative numbers.

A side effect of negating the latency is that "Thread level - Max Only" will show the max number on the Y axis, corresponding to the lowest latency, which is usually the one with the least thread contention.

Until EVG-3009 is implemented, Benchmark graphs should be analyzed with the "All" thread level, to ensure meaningful data for high thread-levels is not ignored.

Another difference is that the absolute scale is different between latency and throughput graphs, despite higher being better for both. Since latency is the inverse of throughput, a 50% increase in latency results in only a 33% drop in throughput.

Displaying user-defined counters

Google Benchmark allows you to define custom counters, where each counter is a double-like object. The counters are stored in the perf dashboard's "historic data" json file but is not visualized at the moment. Note that all benchmark results are negated, including counters.

Getting started

Building

Testing

Testing in Evergreen

Code Style

Server Internals

Provide feedback

Saved searches

Use saved searches to filter your results more quickly