LilithHafner · LilithHafner · Dec 1, 2024 · Nov 30, 2024 · Nov 30, 2024 · Nov 30, 2024
diff --git a/README.md b/README.md
@@ -28,10 +28,13 @@ julia> @b rand(1000) hash # How long does it take to hash that array?
 
 julia> @b rand(1000) _.*5 # How long does it take to multiply it by 5 element wise?
 172.970 ns (3 allocs: 7.875 KiB)
-```
 
-[Why Chairmarks?](https://Chairmarks.lilithhafner.com/stable/why)
+julia> @b rand(100,100) inv,_^2,sum # Is it be faster to invert, square, or sum a matrix? [THIS USAGE IS EXPERIMENTAL]
+(92.917 μs (9 allocs: 129.203 KiB), 27.166 μs (3 allocs: 78.203 KiB), 1.083 μs)
+```
 
 [Tutorial](https://Chairmarks.lilithhafner.com/stable/tutorial)
 
+[Why Chairmarks?](https://Chairmarks.lilithhafner.com/stable/why)
+
 [API Reference](https://Chairmarks.lilithhafner.com/stable/reference)
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,4 +1,5 @@
 [deps]
+BenchmarkTools = "6e4b80f9-dd63-53aa-95a3-0cdb28fa8baf"
 Chairmarks = "0ca39b1e-fe0b-4e98-acfc-b1656634c4de"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 DocumenterVitepress = "4710194d-e776-4893-9690-8d956a29c365"
diff --git a/docs/src/explanations.md b/docs/src/explanations.md
@@ -67,6 +67,15 @@ stops respecting the requested runtime budget and so it could very well perform
 precisely than Chairmarks (it's hard to compete with a 500ms benchmark when you only have
 1ms). In practice, however, Chairmarks stays pretty reliable even for fairly low runtimes.
 
+When comparing different implementations of the same function, `@b rand f,g` can be more reliable
+than `judge(minimum(@benchmark(f(x) setup=(x=rand()))), minimum(@benchmark(g(x) setup=(x=rand())))`
+because the former randomly interleaves calls to `f` and `g` in the same context and scope
+with the same inputs while the latter runs all evaluations of `f` before all evaluations of
+`g` and—typically less importantly—uses different random inputs.
+
+!!! warning
+    Comparative benchmarking is experimental and may be removed or changed in future versions
+
 ## How does tuning work?
 
 First of all, what is "tuning" for? It's for tuning the number of evaluations per sample.

diff --git a/docs/src/index.md b/docs/src/index.md
@@ -30,4 +30,7 @@ julia> @b rand(1000) hash # How long does it take to hash that array?
 
 julia> @b rand(1000) _.*5 # How long does it take to multiply it by 5 element wise?
 172.970 ns (3 allocs: 7.875 KiB)
+
+julia> @b rand(100,100) inv,_^2,sum # Is it be faster to invert, square, or sum a matrix? [THIS USAGE IS EXPERIMENTAL]
+(92.917 μs (9 allocs: 129.203 KiB), 27.166 μs (3 allocs: 78.203 KiB), 1.083 μs)
 ```
diff --git a/docs/src/migration.md b/docs/src/migration.md
@@ -3,7 +3,7 @@ CurrentModule = Chairmarks
 DocTestSetup = quote
     using Chairmarks
 end
-DocTestFilters = [r"\d\d?\d?\.\d{3} [μmn]?s( \(.*\))?"]
+DocTestFilters = [r"\d\d?\d?\.\d{3} [μmn]?s( \(.*\))?|  (time:  |memory:) .*% => (improvement|regression|invariant) \((5|1).00% tolerance\)"]
 ```
 
 # [How to migrate from BenchmarkTools to Chairmarks](@id migration)
@@ -95,6 +95,40 @@ Benchmark results have the following fields:
 
 Note that more fields may be added as more information becomes available.
 
+### Comparisons
+
+Chairmarks does not provide a `judge` function to decide if two benchmarks are significantly
+different. However, you can get accurate data to inform that judgement by passing passing a
+comma separated list of functions to `@b` or `@be`.
+
+
+!!! warning
+    Comparative benchmarking is experimental and may be removed or changed in future versions
+
+```jldoctest; setup=(using BenchmarkTools)
+julia> f() = sum(rand() for _ in 1:1000)
+f (generic function with 1 method)
+
+julia> g() = sum(rand() for _ in 1:1010)
+g (generic function with 1 method)
+
+julia> @b f,g
+(1.121 μs, 1.132 μs)
+
+julia> @b f,g
+(1.063 μs, 1.073 μs)
+
+julia> judge(minimum(@benchmark(f())), minimum(@benchmark(g())))
+BenchmarkTools.TrialJudgement:
+  time:   -5.91% => improvement (5.00% tolerance)
+  memory: +0.00% => invariant (1.00% tolerance)
+
+julia> judge(minimum(@benchmark(f())), minimum(@benchmark(g())))
+BenchmarkTools.TrialJudgement:
+  time:   -0.78% => invariant (5.00% tolerance)
+  memory: +0.00% => invariant (1.00% tolerance)
+```
+
 ### Nonconstant globals and interpolation
 
 Like BenchmarkTools, benchmarks that include access to nonconstant globals will receive a
@@ -121,3 +155,52 @@ julia> @b rand($x) # interpolate (most familiar to BenchmarkTools users)
 julia> @b x rand # put the access in the setup phase (most concise in simple cases)
 15.507 ns (2 allocs: 112 bytes)
 ```
+
+### `BenchmarkGroup`s
+
+It is possible to use `BenchmarkTools.BenchmarkGroup` with Chairmarks. Replacing
+`@benchmarkable` invocations with `@be` invocations and wrapping the group in a function
+suffices. You don't have to run `tune!` and instead of calling `run`, call the function.
+Even running `Statistics.median(suite)` works—although any custom plotting might need a
+couple of tweaks.
+
+```julia
+using BenchmarkTools, Statistics
+
+function create_benchmarks()
+    functions = Function[sqrt, inv, cbrt, sin, cos]
+    group = BenchmarkGroup()
+    for (index, func) in enumerate(functions)
+        group[index] = @benchmarkable $func(x) setup=(x=rand())
+    end
+    group
+end
+
+suite = create_benchmarks()
+
+tune!(suite)
+
+median(run(suite))
+# edit code
+median(run(suite))
+```
+
+```julia
+using Chairmarks, Statistics
+
+function run_benchmarks()
+    functions = Function[sqrt, inv, cbrt, sin, cos]
+    group = BenchmarkGroup()
+    for (index, func) in enumerate(functions)
+        group[nameof(func)] = @be rand func
+    end
+    group
+end
+
+median(run_benchmarks())
+# edit code
+median(run_benchmarks())
+```
+
+This behavior emerged naturally rather than being intentionally designed so expect some
+rough edges. See https://github.com/LilithHafner/Chairmarks.jl/issues/70 for more info.
diff --git a/docs/src/reference.md b/docs/src/reference.md
@@ -12,12 +12,14 @@ version number if the change is not expected to cause significant disruptions.
 - [`Chairmarks.Benchmark`](@ref)
 - [`@b`](@ref)
 - [`@be`](@ref)
+- [`Chairmarks.summarize`](@ref)
 - [`Chairmarks.DEFAULTS`](@ref)
 
 ```@docs
 Chairmarks.Sample
 Chairmarks.Benchmark
 @b
 @be
+Chairmarks.summarize
 Chairmarks.DEFAULTS
 ```
diff --git a/docs/src/tutorial.md b/docs/src/tutorial.md
@@ -89,22 +89,31 @@ julia> @b rand(100) hash
 
 The first argument is called once per sample, and the second argument is called once per
 evaluation, each time passing the result of the first argument. We can also use the special
-`_` variable to refer to the output of the previous step. Here, we compare two different
-implementations of the norm of a vector
+`_` variable to refer to the output of the previous step. Here, we benchmark computing the
+norm of a vector:
 
 ```jldoctest
 julia> @b rand(100) sqrt(sum(_ .* _))
-37.628 ns (2 allocs: 928 bytes)
+38.373 ns (2 allocs: 928 bytes)
+```
+
+The _ refers to the array whose norm is to be computed.
+
+We can perform a comparison of two different implementations of the same specification by
+providing a comma-separated list of functions to benchmark. Here, we compare two ways of
+computing the norm of a vector:
 
-julia> @b rand(100) sqrt(sum(x->x^2, _))
-11.053 ns
+!!! warning
+    Comparative benchmarking is experimental and may be removed or changed in future versions
+
+```jldoctest
+julia> @b rand(100) sqrt(sum(_ .* _)),sqrt(sum(x->x^2, _))
+(40.373 ns (2 allocs: 928 bytes), 11.440 ns)
 ```
 
-The _ refers to the array whose norm is to be computed. Both implementations are quite fast.
-These measurements are on a 3.5 GHz CPU so it appears that the first implementation takes
-about one clock cycle per element, with a bit of overhead. The second, on the other hand,
-appears to be running much faster than that, likely because it is making use of SIMD
-instructions.
+This invocation pattern runs the setup function once per sample and randomly selects which
+implementation to run first for each sample. This makes comparative benchmarks robust to
+fluctuations in system load.
 
 ## Common pitfalls
 

diff --git a/docs/src/why.md b/docs/src/why.md
@@ -10,27 +10,21 @@ DocTestFilters = [r"\d\d?\d?\.\d{3} [μmn]?s( \(.*\))?"]
 
 Capable of detecting 1% difference in runtime in ideal conditions
 
+!!! warning
+    Comparative benchmarking is experimental and may be removed or changed in future versions
+
 ```jldoctest
 julia> f(n) = sum(rand() for _ in 1:n)
 f (generic function with 1 method)
 
-julia> @b f(1000)
-1.074 μs
-
-julia> @b f(1000)
-1.075 μs
-
-julia> @b f(1000)
-1.076 μs
+julia> @b f(1000), f(1010)
+(1.064 μs, 1.074 μs)
 
-julia> @b f(1010)
-1.086 μs
+julia> @b f(1000), f(1010)
+(1.063 μs, 1.073 μs)
 
-julia> @b f(1010)
-1.087 μs
-
-julia> @b f(1010)
-1.087 μs
+julia> @b f(1000), f(1010)
+(1.064 μs, 1.074 μs)
 ```
 
 ## Efficient
@@ -89,6 +83,17 @@ julia> @b rand(100) sort(_, by=x -> exp(-x)) issorted(_, rev=true) || error()
 5.358 μs (2 allocs: 1.750 KiB)
 ```
 
+The function being benchmarked can be a comma separated list of functions in which case a tuple
+of the results is returned
+
+!!! warning
+    Comparative benchmarking is experimental and may be removed or changed in future versions
+
+```jldoctest
+julia> @b rand(100) sort(_, alg=InsertionSort),sort(_, alg=MergeSort)
+(1.245 μs (2 allocs: 928 bytes), 921.875 ns (4 allocs: 1.375 KiB))
+```
+
 See [`@be`](@ref) for more info
 
 ## Truthful

diff --git a/src/Chairmarks.jl b/src/Chairmarks.jl
@@ -17,7 +17,7 @@ module Chairmarks
 
 using Printf
 
-VERSION >= v"1.11.0-DEV.469" && eval(Meta.parse("public Sample, Benchmark, DEFAULTS"))
+VERSION >= v"1.11.0-DEV.469" && eval(Meta.parse("public Sample, Benchmark, DEFAULTS, summarize"))
 export @b, @be
 
 include("types.jl")

diff --git a/src/public.jl b/src/public.jl
@@ -5,8 +5,8 @@ Benchmark `f` and return the fastest [`Sample`](@ref).
 
 Use [`@be`](@ref) for full results.
 
-`@b args...` is equivalent to `summarize(@be args...)`. See the docstring for [`@be`](@ref)
-for more information.
+`@b args...` is equivalent to `Chairmarks.summarize(@be args...)`. See the docstring of
+[`@be`](@ref) for more information.
 
 # Examples
 
@@ -34,6 +34,9 @@ julia> @b (x = 0; for _ in 1:50; x = hash(x); end; x) # We can use arbitrary exp
 
 julia> @b (x = 0; for _ in 1:5e8; x = hash(x); end; x) # This runs for a long time, so it is only run once (with no warmup)
 2.447 s (without a warmup)
+
+julia> @b rand(10) hash,objectid # Which hash algorithm is faster? [THIS USAGE IS EXPERIMENTAL]
+(17.256 ns, 4.246 ns)
 ```
 """
 macro b(args...)
@@ -148,6 +151,14 @@ At a high level, the implementation of this function looks like this
 So `init` will be called once, `setup` and `teardown` will be called once per sample, and
 `f` will be called `evals` times per sample.
 
+# Experimental Features
+
+You can pass a comma separated list of functions or expressions to `@be` and they will all
+be benchmarked at the same time with interleaved samples, returning a tuple of `Benchmark`s.
+
+!!! warning
+    Comparative benchmarking is experimental and may be removed or changed in future versions
+
 # Examples
 
 ```jldoctest; filter = [r"\\d\\d?\\d?\\.\\d{3} [μmn]?s( \\(.*\\))?"=>s"RES", r"\\d+ (sample|evaluation)s?"=>s"### \\1"], setup=(using Random)
@@ -203,14 +214,26 @@ Benchmark: 3387 samples with 144 evaluations
 julia> @be (x = 0; for _ in 1:5e8; x = hash(x); end; x) # This runs for a long time, so it is only run once (with no warmup)
 Benchmark: 1 sample with 1 evaluation
         2.488 s (without a warmup)
+
+julia> @be rand(10) hash,objectid # Which hash algorithm is faster? [THIS USAGE IS EXPERIMENTAL]
+Benchmark: 14887 samples with 436 evaluations
+ min    17.106 ns
+ median 18.922 ns
+ mean   20.974 ns
+ max    234.998 ns
+Benchmark: 14887 samples with 436 evaluations
+ min    4.110 ns
+ median 4.683 ns
+ mean   4.979 ns
+ max    42.911 ns
 ```
 """
 macro be(args...)
     process_args(args)
 end
 
 """
-    summarize(b::Benchmark) -> Any
+`summarize(@be ...)` is equivalent to `@b ...`
 
 Used by `@b` to summarize the output of `@be`. Currently implemented as elementwise `minimum`.
 """