Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sneaky branch-predictor remembering inputs over benchmark loops #241

Open
chethega opened this issue Nov 11, 2018 · 2 comments
Open

Sneaky branch-predictor remembering inputs over benchmark loops #241

chethega opened this issue Nov 11, 2018 · 2 comments

Comments

@chethega
Copy link

We recently discovered in JuliaLang/julia#29888 a somewhat surprising fact about modern CPU: The branch-predictor is capable of remembering astoundingly long periodic patterns. If we run a benchmark loop, then each evaluation and possibly each sample will have identical branching patterns, thus introducing a period. In other words, the sneaky little CPU learns our sample set and we have an emergent defeat device for some of our benchmarks.

At least the findall benchmarks are broken, and probably have been broken forever. I suspect that the logical indexing benchmarks are broken as well. But we should really go over all our benchmarks and figure out which ones are affected. Also, this is interesting and something to keep in mind for all our benchmarks. Indirect branch prediction (the BTB) and D-cache are something to keep in mind as well.

The likely fix is to increase the size of testsets. Long-term it would be cool to parametrize all benchmarks, and occasionally (rarely) run regression tests on our regression tests: Check that everything has the expected scaling behavior, and explain or fix surprises. Alternatively, we could regenerate new random data between runs. But afaik BenchmarkTools has no support for that (would need new feature to fix evals/sample = 1).

Demo:

julia> using Printf, BenchmarkTools

julia> function cpu_speed_ghz()
                  # if available, use reported CPU speed instead of current speed (which
                  # may be inaccurate if the CPU is idling)
                  cpu_info = Sys.cpu_info()[1]
                  m = match(r"([\d\.]+)GHz", cpu_info.model)
                  ghz = m ≡ nothing ? cpu_info.speed / 1000 : parse(Float64,  m.captures[1])
              end;

julia> const CPU_SPEED_GHZ = cpu_speed_ghz();

julia> const cpu_model = Sys.cpu_info()[1].model;

julia> begin 
           N=30_000
           list = fill(false, N); list[1:2:end].=true;
           bt0 = @belapsed findall($list)
           list .= rand(Bool, N)
           btL = @belapsed findall($list)
           time_to_cycle = 10^9/N * CPU_SPEED_GHZ
           penalty = 2*(btL-bt0)*time_to_cycle
           @printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
           cpu_model, penalty/CPU_SPEED_GHZ , penalty)

           bt = bt0
           @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 30_000]
               pat = rand(Bool, n)
               for i=1:n:N list[i:(i+n-1)].=pat end
               bt = @belapsed findall($list)
               @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           end
       end;

yielding:

Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty:  9.9 ns = 19.8 cycles

Period     2:   44.81 us =    2.99 cycles per idx. Miss-rate  0.00%
Period   100:   53.22 us =    3.55 cycles per idx. Miss-rate  2.83%
Period   500:   51.52 us =    3.43 cycles per idx. Miss-rate  2.26%
Period  1000:   51.37 us =    3.42 cycles per idx. Miss-rate  2.21%
Period  2000:   57.85 us =    3.86 cycles per idx. Miss-rate  4.39%
Period  2500:   88.66 us =    5.91 cycles per idx. Miss-rate 14.77%
Period  3000:  121.78 us =    8.12 cycles per idx. Miss-rate 25.93%
Period  5000:  159.28 us =   10.62 cycles per idx. Miss-rate 38.56%
Period 10000:  182.87 us =   12.19 cycles per idx. Miss-rate 46.51%
Period 30000:  192.51 us =   12.83 cycles per idx. Miss-rate 49.75%

This is compatible with Agner Fog's tables. And it is absolutely mindboggling that the CPU manages to completely defeat patterns of length 2000. If you have a different CPU-Arch available (Ryzen? Skylake? Power?), then please post similar figures. We should increase testset sizes above the BHT limits for all realistic current and near-future CPUs.

JuliaBox's CPU gives a similar cutoff:

Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz; branch-miss penalty:  6.3 ns = 14.6 cycles

Period     2:   28.40 us =    2.18 cycles per idx. Miss-rate  0.00%
Period   100:   32.50 us =    2.49 cycles per idx. Miss-rate  2.16%
Period   500:   32.30 us =    2.48 cycles per idx. Miss-rate  2.05%
Period  1000:   32.70 us =    2.51 cycles per idx. Miss-rate  2.26%
Period  2000:   33.40 us =    2.56 cycles per idx. Miss-rate  2.63%
Period  2500:   42.70 us =    3.27 cycles per idx. Miss-rate  7.52%
Period  3000:   71.90 us =    5.51 cycles per idx. Miss-rate 22.87%
Period  5000:  102.20 us =    7.84 cycles per idx. Miss-rate 38.80%
Period 10000:  116.70 us =    8.95 cycles per idx. Miss-rate 46.42%
Period 30000:  123.40 us =    9.46 cycles per idx. Miss-rate 49.95%
@maxbennedich
Copy link

I get similar numbers/cutoff on a Skylake:

Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz; branch-miss penalty:  6.4 ns = 18.6 cycles

Period     2:   28.22 us =    2.73 cycles per idx. Miss-rate  0.00%
Period   100:   36.35 us =    3.51 cycles per idx. Miss-rate  4.21%
Period   500:   27.39 us =    2.65 cycles per idx. Miss-rate -0.43%
Period  1000:   29.46 us =    2.85 cycles per idx. Miss-rate  0.64%
Period  2000:   39.03 us =    3.77 cycles per idx. Miss-rate  5.60%
Period  2500:   67.03 us =    6.48 cycles per idx. Miss-rate 20.12%
Period  3000:   82.83 us =    8.01 cycles per idx. Miss-rate 28.31%
Period  5000:  112.05 us =   10.83 cycles per idx. Miss-rate 43.47%
Period 10000:  116.88 us =   11.30 cycles per idx. Miss-rate 45.97%
Period 30000:  124.50 us =   12.04 cycles per idx. Miss-rate 49.92%

These numbers are very consistent between runs; in particular the discrepancy for period 100 (slower than periods 500 and 1000).

@taqtiqa-mark
Copy link

I see a monotonic increase on every run. Jump point is the same on each run.

Intel(R) Celeron(R) CPU G550T @ 2.20GHz; branch-miss penalty: 10.0 ns = 21.9 cycles

Period     2:   65.21 us =    4.78 cycles per idx. Miss-rate  0.00%
Period   100:   67.70 us =    4.96 cycles per idx. Miss-rate  0.84%
Period   500:   68.99 us =    5.06 cycles per idx. Miss-rate  1.27%
Period  1000:   71.57 us =    5.25 cycles per idx. Miss-rate  2.13%
Period  2000:   82.29 us =    6.03 cycles per idx. Miss-rate  5.72%
Period  2500:   98.69 us =    7.24 cycles per idx. Miss-rate 11.21%
Period  3000:  121.89 us =    8.94 cycles per idx. Miss-rate 18.98%
Period  5000:  169.03 us =   12.40 cycles per idx. Miss-rate 34.77%
Period 10000:  198.12 us =   14.53 cycles per idx. Miss-rate 44.51%
Period 30000:  215.11 us =   15.78 cycles per idx. Miss-rate 50.21%

Keno pushed a commit that referenced this issue Feb 4, 2022
Add Documenter SSH key to TagBot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants