Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: instruction alignment optimizations for assembly routines, good starter projects #63678

Open
mauri870 opened this issue Oct 23, 2023 · 12 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Milestone

Comments

@mauri870
Copy link
Member

mauri870 commented Oct 23, 2023

Issue #56474 added support for instruction alignment on the amd64 architecture. This is achieved with the PCALIGN assembly pseudo instruction, which inserts NOP's to align the next instruction to a given boundary.

Since this feature is pretty new on amd64, we didn't had much time to check which assembly routines in the runtime/libraries would benefit com instruction alignment. In most cases, the effect of instruction alignment is minimal, but on critical subroutines and critical innermost loops it can deliver a significant boost in performance.

Some examples:

There are multiple places were we might get interesting results using PCALIGN in amd64 assembly:

  • runtime functions (memmove, memclr, etc)
  • internal/bytealg
  • crypto/*
  • *amd64*.s hot loops/critical sections in assembly code

Generally a 16-byte alignment works fine, while 32-byte is better when aligning AVX2 instructions. Be careful when overusing it, routines may end up slower than before.

In order to verify if there are any speedups you can run the benchmarks for the affected package with a higher count, at least -count=10. Then use benchstat to compare the before/after results. I'd say anything higher than 3-5% consistently is worth submitting.

Code alignment is already supported on ppc64, arm64, loong64 and riscv64. Feel free to look into improvements for these architectures as well! You can rely on qemu-static to run the benchmarks after compiling the tests with go test -c.

Happy hacking!

@mauri870 mauri870 added Performance help wanted NeedsFix The path to resolution is known, but the work has not been done. compiler/runtime Issues related to the Go compiler and/or runtime. labels Oct 23, 2023
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Oct 23, 2023
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Oct 23, 2023
MemclrUnaligned/0_5-16        1.821n ± 1%    1.803n ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16       1.879n ± 1%    1.855n ±  1%        ~ (p=0.210 n=20+10)
MemclrUnaligned/0_64-16       2.044n ± 1%    2.044n ±  2%        ~ (p=0.871 n=20+10)
MemclrUnaligned/0_256-16      3.614n ± 1%    3.600n ±  3%        ~ (p=0.552 n=20+10)
MemclrUnaligned/0_4096-16     32.63n ± 2%    32.34n ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16    483.5n ± 3%    479.1n ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16        1.800n ± 1%    1.808n ±  1%        ~ (p=0.333 n=20+10)
MemclrUnaligned/1_16-16       1.863n ± 1%    1.847n ±  2%        ~ (p=0.345 n=20+10)
MemclrUnaligned/1_64-16       2.929n ± 1%    2.107n ±  2%  -28.05% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16      4.942n ± 1%    4.973n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/1_4096-16     40.09n ± 1%    39.49n ±  2%        ~ (p=0.210 n=20+10)
MemclrUnaligned/1_65536-16    650.0n ± 3%    653.7n ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16        1.806n ± 1%    1.812n ±  1%        ~ (p=0.291 n=20+10)
MemclrUnaligned/4_16-16       1.867n ± 1%    1.862n ±  1%        ~ (p=0.551 n=20+10)
MemclrUnaligned/4_64-16       2.946n ± 2%    2.752n ±  2%   -6.59% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16      4.942n ± 1%    5.144n ±  2%   +4.08% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16     39.88n ± 1%    40.21n ±  4%        ~ (p=0.346 n=20+10)
MemclrUnaligned/4_65536-16    643.7n ± 2%    647.8n ±  4%        ~ (p=0.657 n=20+10)
MemclrUnaligned/7_5-16        1.802n ± 1%    1.801n ±  3%        ~ (p=0.481 n=20+10)
MemclrUnaligned/7_16-16       1.863n ± 1%    1.863n ±  2%        ~ (p=0.626 n=20+10)
MemclrUnaligned/7_64-16       2.947n ± 1%    2.125n ±  2%  -27.91% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16      4.967n ± 1%    5.005n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/7_4096-16     39.52n ± 3%    40.07n ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16    651.5n ± 3%    649.2n ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16       7.646µ ± 2%    7.618µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16       54.15µ ± 3%   119.05µ ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16       108.8µ ± 3%    107.0µ ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16      216.2µ ± 2%    216.3µ ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16      888.4µ ± 2%    867.3µ ±  6%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16       10.85µ ± 2%    11.00µ ±  5%   +1.37% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16       48.66µ ± 2%    47.79µ ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16       96.18µ ± 4%    97.18µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16      232.7µ ± 2%    276.9µ ± 19%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16      883.2µ ± 2%    892.3µ ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16       10.97µ ± 2%    11.04µ ±  4%        ~ (p=0.073 n=20+10)
MemclrUnaligned/4_4M-16       48.53µ ± 2%    45.21µ ± 11%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16       97.31µ ± 2%    96.12µ ±  3%        ~ (p=0.311 n=20+10)
MemclrUnaligned/4_16M-16      234.7µ ± 6%    241.0µ ± 42%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16      891.9µ ± 2%    875.8µ ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16       11.03µ ± 3%    10.85µ ±  4%        ~ (p=0.495 n=20+10)
MemclrUnaligned/7_4M-16       51.37µ ± 2%    48.38µ ±  2%   -5.83% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16       97.66µ ± 3%    97.83µ ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16      231.4µ ± 7%    274.7µ ± 29%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16      891.8µ ± 3%    868.4µ ±  4%        ~ (p=0.061 n=20+10)

MemclrUnaligned/0_5-16       2.558Gi ± 1%   2.583Gi ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16      7.931Gi ± 1%   8.030Gi ±  1%        ~ (p=0.214 n=20+10)
MemclrUnaligned/0_64-16      29.15Gi ± 1%   29.17Gi ±  2%        ~ (p=0.914 n=20+10)
MemclrUnaligned/0_256-16     65.97Gi ± 1%   66.23Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_4096-16    116.9Gi ± 2%   117.9Gi ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16   126.2Gi ± 3%   127.4Gi ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16       2.587Gi ± 1%   2.575Gi ±  1%        ~ (p=0.328 n=20+10)
MemclrUnaligned/1_16-16      7.998Gi ± 1%   8.066Gi ±  2%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_64-16      20.35Gi ± 1%   28.29Gi ±  2%  +39.02% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16     48.24Gi ± 1%   47.94Gi ±  3%        ~ (p=0.307 n=20+10)
MemclrUnaligned/1_4096-16    95.16Gi ± 1%   96.61Gi ±  2%        ~ (p=0.214 n=20+10)
MemclrUnaligned/1_65536-16   93.90Gi ± 3%   93.38Gi ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16       2.578Gi ± 1%   2.569Gi ±  1%        ~ (p=0.286 n=20+10)
MemclrUnaligned/4_16-16      7.979Gi ± 1%   8.005Gi ±  1%        ~ (p=0.588 n=20+10)
MemclrUnaligned/4_64-16      20.24Gi ± 2%   21.67Gi ±  2%   +7.06% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16     48.24Gi ± 1%   46.35Gi ±  2%   -3.92% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16    95.65Gi ± 2%   94.87Gi ±  4%        ~ (p=0.350 n=20+10)
MemclrUnaligned/4_65536-16   94.82Gi ± 2%   94.22Gi ±  5%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_5-16       2.584Gi ± 1%   2.585Gi ±  3%        ~ (p=0.475 n=20+10)
MemclrUnaligned/7_16-16      7.999Gi ± 1%   7.999Gi ±  2%        ~ (p=0.619 n=20+10)
MemclrUnaligned/7_64-16      20.22Gi ± 1%   28.05Gi ±  2%  +38.72% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16     48.00Gi ± 1%   47.65Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/7_4096-16    96.54Gi ± 3%   95.19Gi ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16   93.69Gi ± 3%   94.02Gi ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16      127.7Gi ± 2%   128.2Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16      72.14Gi ± 3%   46.75Gi ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16      71.82Gi ± 3%   72.98Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16     72.29Gi ± 2%   72.24Gi ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16     70.35Gi ± 2%   72.07Gi ±  5%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16      89.98Gi ± 2%   88.77Gi ±  5%   -1.35% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16      80.28Gi ± 2%   81.74Gi ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16      81.23Gi ± 4%   80.39Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16     67.16Gi ± 2%   57.17Gi ± 22%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16     70.77Gi ± 2%   70.04Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16      88.99Gi ± 2%   88.48Gi ±  4%        ~ (p=0.074 n=20+10)
MemclrUnaligned/4_4M-16      80.49Gi ± 2%   86.41Gi ± 10%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16      80.28Gi ± 2%   81.28Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_16M-16     66.58Gi ± 6%   64.83Gi ± 29%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16     70.07Gi ± 2%   71.37Gi ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16      88.51Gi ± 3%   89.98Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/7_4M-16      76.04Gi ± 2%   80.74Gi ±  2%   +6.19% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16      80.00Gi ± 3%   79.86Gi ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16     67.53Gi ± 7%   58.23Gi ± 24%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16     70.08Gi ± 3%   71.98Gi ±  4%        ~ (p=0.061 n=20+10)

For golang#63678
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Oct 23, 2023
MemclrUnaligned/0_5-16        1.821n ± 1%    1.803n ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16       1.879n ± 1%    1.855n ±  1%        ~ (p=0.210 n=20+10)
MemclrUnaligned/0_64-16       2.044n ± 1%    2.044n ±  2%        ~ (p=0.871 n=20+10)
MemclrUnaligned/0_256-16      3.614n ± 1%    3.600n ±  3%        ~ (p=0.552 n=20+10)
MemclrUnaligned/0_4096-16     32.63n ± 2%    32.34n ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16    483.5n ± 3%    479.1n ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16        1.800n ± 1%    1.808n ±  1%        ~ (p=0.333 n=20+10)
MemclrUnaligned/1_16-16       1.863n ± 1%    1.847n ±  2%        ~ (p=0.345 n=20+10)
MemclrUnaligned/1_64-16       2.929n ± 1%    2.107n ±  2%  -28.05% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16      4.942n ± 1%    4.973n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/1_4096-16     40.09n ± 1%    39.49n ±  2%        ~ (p=0.210 n=20+10)
MemclrUnaligned/1_65536-16    650.0n ± 3%    653.7n ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16        1.806n ± 1%    1.812n ±  1%        ~ (p=0.291 n=20+10)
MemclrUnaligned/4_16-16       1.867n ± 1%    1.862n ±  1%        ~ (p=0.551 n=20+10)
MemclrUnaligned/4_64-16       2.946n ± 2%    2.752n ±  2%   -6.59% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16      4.942n ± 1%    5.144n ±  2%   +4.08% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16     39.88n ± 1%    40.21n ±  4%        ~ (p=0.346 n=20+10)
MemclrUnaligned/4_65536-16    643.7n ± 2%    647.8n ±  4%        ~ (p=0.657 n=20+10)
MemclrUnaligned/7_5-16        1.802n ± 1%    1.801n ±  3%        ~ (p=0.481 n=20+10)
MemclrUnaligned/7_16-16       1.863n ± 1%    1.863n ±  2%        ~ (p=0.626 n=20+10)
MemclrUnaligned/7_64-16       2.947n ± 1%    2.125n ±  2%  -27.91% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16      4.967n ± 1%    5.005n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/7_4096-16     39.52n ± 3%    40.07n ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16    651.5n ± 3%    649.2n ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16       7.646µ ± 2%    7.618µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16       54.15µ ± 3%   119.05µ ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16       108.8µ ± 3%    107.0µ ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16      216.2µ ± 2%    216.3µ ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16      888.4µ ± 2%    867.3µ ±  6%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16       10.85µ ± 2%    11.00µ ±  5%   +1.37% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16       48.66µ ± 2%    47.79µ ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16       96.18µ ± 4%    97.18µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16      232.7µ ± 2%    276.9µ ± 19%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16      883.2µ ± 2%    892.3µ ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16       10.97µ ± 2%    11.04µ ±  4%        ~ (p=0.073 n=20+10)
MemclrUnaligned/4_4M-16       48.53µ ± 2%    45.21µ ± 11%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16       97.31µ ± 2%    96.12µ ±  3%        ~ (p=0.311 n=20+10)
MemclrUnaligned/4_16M-16      234.7µ ± 6%    241.0µ ± 42%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16      891.9µ ± 2%    875.8µ ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16       11.03µ ± 3%    10.85µ ±  4%        ~ (p=0.495 n=20+10)
MemclrUnaligned/7_4M-16       51.37µ ± 2%    48.38µ ±  2%   -5.83% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16       97.66µ ± 3%    97.83µ ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16      231.4µ ± 7%    274.7µ ± 29%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16      891.8µ ± 3%    868.4µ ±  4%        ~ (p=0.061 n=20+10)

MemclrUnaligned/0_5-16       2.558Gi ± 1%   2.583Gi ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16      7.931Gi ± 1%   8.030Gi ±  1%        ~ (p=0.214 n=20+10)
MemclrUnaligned/0_64-16      29.15Gi ± 1%   29.17Gi ±  2%        ~ (p=0.914 n=20+10)
MemclrUnaligned/0_256-16     65.97Gi ± 1%   66.23Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_4096-16    116.9Gi ± 2%   117.9Gi ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16   126.2Gi ± 3%   127.4Gi ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16       2.587Gi ± 1%   2.575Gi ±  1%        ~ (p=0.328 n=20+10)
MemclrUnaligned/1_16-16      7.998Gi ± 1%   8.066Gi ±  2%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_64-16      20.35Gi ± 1%   28.29Gi ±  2%  +39.02% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16     48.24Gi ± 1%   47.94Gi ±  3%        ~ (p=0.307 n=20+10)
MemclrUnaligned/1_4096-16    95.16Gi ± 1%   96.61Gi ±  2%        ~ (p=0.214 n=20+10)
MemclrUnaligned/1_65536-16   93.90Gi ± 3%   93.38Gi ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16       2.578Gi ± 1%   2.569Gi ±  1%        ~ (p=0.286 n=20+10)
MemclrUnaligned/4_16-16      7.979Gi ± 1%   8.005Gi ±  1%        ~ (p=0.588 n=20+10)
MemclrUnaligned/4_64-16      20.24Gi ± 2%   21.67Gi ±  2%   +7.06% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16     48.24Gi ± 1%   46.35Gi ±  2%   -3.92% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16    95.65Gi ± 2%   94.87Gi ±  4%        ~ (p=0.350 n=20+10)
MemclrUnaligned/4_65536-16   94.82Gi ± 2%   94.22Gi ±  5%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_5-16       2.584Gi ± 1%   2.585Gi ±  3%        ~ (p=0.475 n=20+10)
MemclrUnaligned/7_16-16      7.999Gi ± 1%   7.999Gi ±  2%        ~ (p=0.619 n=20+10)
MemclrUnaligned/7_64-16      20.22Gi ± 1%   28.05Gi ±  2%  +38.72% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16     48.00Gi ± 1%   47.65Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/7_4096-16    96.54Gi ± 3%   95.19Gi ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16   93.69Gi ± 3%   94.02Gi ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16      127.7Gi ± 2%   128.2Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16      72.14Gi ± 3%   46.75Gi ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16      71.82Gi ± 3%   72.98Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16     72.29Gi ± 2%   72.24Gi ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16     70.35Gi ± 2%   72.07Gi ±  5%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16      89.98Gi ± 2%   88.77Gi ±  5%   -1.35% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16      80.28Gi ± 2%   81.74Gi ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16      81.23Gi ± 4%   80.39Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16     67.16Gi ± 2%   57.17Gi ± 22%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16     70.77Gi ± 2%   70.04Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16      88.99Gi ± 2%   88.48Gi ±  4%        ~ (p=0.074 n=20+10)
MemclrUnaligned/4_4M-16      80.49Gi ± 2%   86.41Gi ± 10%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16      80.28Gi ± 2%   81.28Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_16M-16     66.58Gi ± 6%   64.83Gi ± 29%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16     70.07Gi ± 2%   71.37Gi ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16      88.51Gi ± 3%   89.98Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/7_4M-16      76.04Gi ± 2%   80.74Gi ±  2%   +6.19% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16      80.00Gi ± 3%   79.86Gi ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16     67.53Gi ± 7%   58.23Gi ± 24%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16     70.08Gi ± 3%   71.98Gi ±  4%        ~ (p=0.061 n=20+10)

For golang#63678
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/537055 mentions this issue: runtime: memclr_amd64 use PCALIGN optimize

@mauri870 mauri870 added this to the Unreleased milestone Oct 23, 2023
qiulaidongfeng added a commit to qiulaidongfeng/go that referenced this issue Oct 24, 2023
MemclrUnaligned/0_5-16        1.821n ± 1%    1.803n ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16       1.879n ± 1%    1.855n ±  1%        ~ (p=0.210 n=20+10)
MemclrUnaligned/0_64-16       2.044n ± 1%    2.044n ±  2%        ~ (p=0.871 n=20+10)
MemclrUnaligned/0_256-16      3.614n ± 1%    3.600n ±  3%        ~ (p=0.552 n=20+10)
MemclrUnaligned/0_4096-16     32.63n ± 2%    32.34n ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16    483.5n ± 3%    479.1n ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16        1.800n ± 1%    1.808n ±  1%        ~ (p=0.333 n=20+10)
MemclrUnaligned/1_16-16       1.863n ± 1%    1.847n ±  2%        ~ (p=0.345 n=20+10)
MemclrUnaligned/1_64-16       2.929n ± 1%    2.107n ±  2%  -28.05% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16      4.942n ± 1%    4.973n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/1_4096-16     40.09n ± 1%    39.49n ±  2%        ~ (p=0.210 n=20+10)
MemclrUnaligned/1_65536-16    650.0n ± 3%    653.7n ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16        1.806n ± 1%    1.812n ±  1%        ~ (p=0.291 n=20+10)
MemclrUnaligned/4_16-16       1.867n ± 1%    1.862n ±  1%        ~ (p=0.551 n=20+10)
MemclrUnaligned/4_64-16       2.946n ± 2%    2.752n ±  2%   -6.59% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16      4.942n ± 1%    5.144n ±  2%   +4.08% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16     39.88n ± 1%    40.21n ±  4%        ~ (p=0.346 n=20+10)
MemclrUnaligned/4_65536-16    643.7n ± 2%    647.8n ±  4%        ~ (p=0.657 n=20+10)
MemclrUnaligned/7_5-16        1.802n ± 1%    1.801n ±  3%        ~ (p=0.481 n=20+10)
MemclrUnaligned/7_16-16       1.863n ± 1%    1.863n ±  2%        ~ (p=0.626 n=20+10)
MemclrUnaligned/7_64-16       2.947n ± 1%    2.125n ±  2%  -27.91% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16      4.967n ± 1%    5.005n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/7_4096-16     39.52n ± 3%    40.07n ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16    651.5n ± 3%    649.2n ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16       7.646µ ± 2%    7.618µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16       54.15µ ± 3%   119.05µ ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16       108.8µ ± 3%    107.0µ ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16      216.2µ ± 2%    216.3µ ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16      888.4µ ± 2%    867.3µ ±  6%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16       10.85µ ± 2%    11.00µ ±  5%   +1.37% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16       48.66µ ± 2%    47.79µ ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16       96.18µ ± 4%    97.18µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16      232.7µ ± 2%    276.9µ ± 19%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16      883.2µ ± 2%    892.3µ ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16       10.97µ ± 2%    11.04µ ±  4%        ~ (p=0.073 n=20+10)
MemclrUnaligned/4_4M-16       48.53µ ± 2%    45.21µ ± 11%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16       97.31µ ± 2%    96.12µ ±  3%        ~ (p=0.311 n=20+10)
MemclrUnaligned/4_16M-16      234.7µ ± 6%    241.0µ ± 42%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16      891.9µ ± 2%    875.8µ ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16       11.03µ ± 3%    10.85µ ±  4%        ~ (p=0.495 n=20+10)
MemclrUnaligned/7_4M-16       51.37µ ± 2%    48.38µ ±  2%   -5.83% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16       97.66µ ± 3%    97.83µ ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16      231.4µ ± 7%    274.7µ ± 29%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16      891.8µ ± 3%    868.4µ ±  4%        ~ (p=0.061 n=20+10)

MemclrUnaligned/0_5-16       2.558Gi ± 1%   2.583Gi ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16      7.931Gi ± 1%   8.030Gi ±  1%        ~ (p=0.214 n=20+10)
MemclrUnaligned/0_64-16      29.15Gi ± 1%   29.17Gi ±  2%        ~ (p=0.914 n=20+10)
MemclrUnaligned/0_256-16     65.97Gi ± 1%   66.23Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_4096-16    116.9Gi ± 2%   117.9Gi ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16   126.2Gi ± 3%   127.4Gi ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16       2.587Gi ± 1%   2.575Gi ±  1%        ~ (p=0.328 n=20+10)
MemclrUnaligned/1_16-16      7.998Gi ± 1%   8.066Gi ±  2%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_64-16      20.35Gi ± 1%   28.29Gi ±  2%  +39.02% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16     48.24Gi ± 1%   47.94Gi ±  3%        ~ (p=0.307 n=20+10)
MemclrUnaligned/1_4096-16    95.16Gi ± 1%   96.61Gi ±  2%        ~ (p=0.214 n=20+10)
MemclrUnaligned/1_65536-16   93.90Gi ± 3%   93.38Gi ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16       2.578Gi ± 1%   2.569Gi ±  1%        ~ (p=0.286 n=20+10)
MemclrUnaligned/4_16-16      7.979Gi ± 1%   8.005Gi ±  1%        ~ (p=0.588 n=20+10)
MemclrUnaligned/4_64-16      20.24Gi ± 2%   21.67Gi ±  2%   +7.06% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16     48.24Gi ± 1%   46.35Gi ±  2%   -3.92% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16    95.65Gi ± 2%   94.87Gi ±  4%        ~ (p=0.350 n=20+10)
MemclrUnaligned/4_65536-16   94.82Gi ± 2%   94.22Gi ±  5%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_5-16       2.584Gi ± 1%   2.585Gi ±  3%        ~ (p=0.475 n=20+10)
MemclrUnaligned/7_16-16      7.999Gi ± 1%   7.999Gi ±  2%        ~ (p=0.619 n=20+10)
MemclrUnaligned/7_64-16      20.22Gi ± 1%   28.05Gi ±  2%  +38.72% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16     48.00Gi ± 1%   47.65Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/7_4096-16    96.54Gi ± 3%   95.19Gi ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16   93.69Gi ± 3%   94.02Gi ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16      127.7Gi ± 2%   128.2Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16      72.14Gi ± 3%   46.75Gi ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16      71.82Gi ± 3%   72.98Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16     72.29Gi ± 2%   72.24Gi ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16     70.35Gi ± 2%   72.07Gi ±  5%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16      89.98Gi ± 2%   88.77Gi ±  5%   -1.35% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16      80.28Gi ± 2%   81.74Gi ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16      81.23Gi ± 4%   80.39Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16     67.16Gi ± 2%   57.17Gi ± 22%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16     70.77Gi ± 2%   70.04Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16      88.99Gi ± 2%   88.48Gi ±  4%        ~ (p=0.074 n=20+10)
MemclrUnaligned/4_4M-16      80.49Gi ± 2%   86.41Gi ± 10%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16      80.28Gi ± 2%   81.28Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_16M-16     66.58Gi ± 6%   64.83Gi ± 29%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16     70.07Gi ± 2%   71.37Gi ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16      88.51Gi ± 3%   89.98Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/7_4M-16      76.04Gi ± 2%   80.74Gi ±  2%   +6.19% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16      80.00Gi ± 3%   79.86Gi ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16     67.53Gi ± 7%   58.23Gi ± 24%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16     70.08Gi ± 3%   71.98Gi ±  4%        ~ (p=0.061 n=20+10)

For golang#63678
AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Oct 26, 2023
goos: linux
goarch: amd64
pkg: crypto/subtle
cpu: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
                      │   master    │                HEAD                 │
                      │   sec/op    │   sec/op     vs base                │
XORBytes/8Bytes-8       10.90n ± 1%   10.96n ± 5%        ~ (p=0.617 n=10)
XORBytes/128Bytes-8     14.85n ± 2%   12.05n ± 2%  -18.82% (p=0.000 n=10)
XORBytes/2048Bytes-8    88.30n ± 2%   72.64n ± 1%  -17.73% (p=0.000 n=10)
XORBytes/32768Bytes-8   1.489µ ± 2%   1.442µ ± 1%   -3.12% (p=0.000 n=10)
geomean                 67.91n        60.99n       -10.19%

                      │    master    │                 HEAD                 │
                      │     B/s      │     B/s       vs base                │
XORBytes/8Bytes-8       700.5Mi ± 1%   696.5Mi ± 5%        ~ (p=0.631 n=10)
XORBytes/128Bytes-8     8.026Gi ± 2%   9.890Gi ± 2%  +23.22% (p=0.000 n=10)
XORBytes/2048Bytes-8    21.60Gi ± 2%   26.26Gi ± 1%  +21.55% (p=0.000 n=10)
XORBytes/32768Bytes-8   20.50Gi ± 2%   21.16Gi ± 1%   +3.21% (p=0.000 n=10)
geomean                 7.022Gi        7.819Gi       +11.34%

For golang#63678
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/537856 mentions this issue: crypto/subtle: use PCALIGN in xorBytes

gopherbot pushed a commit that referenced this issue Oct 26, 2023
goos: linux
goarch: amd64
pkg: crypto/subtle
cpu: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
                      │   master    │                HEAD                 │
                      │   sec/op    │   sec/op     vs base                │
XORBytes/8Bytes-8       10.90n ± 1%   10.96n ± 5%        ~ (p=0.617 n=10)
XORBytes/128Bytes-8     14.85n ± 2%   12.05n ± 2%  -18.82% (p=0.000 n=10)
XORBytes/2048Bytes-8    88.30n ± 2%   72.64n ± 1%  -17.73% (p=0.000 n=10)
XORBytes/32768Bytes-8   1.489µ ± 2%   1.442µ ± 1%   -3.12% (p=0.000 n=10)
geomean                 67.91n        60.99n       -10.19%

                      │    master    │                 HEAD                 │
                      │     B/s      │     B/s       vs base                │
XORBytes/8Bytes-8       700.5Mi ± 1%   696.5Mi ± 5%        ~ (p=0.631 n=10)
XORBytes/128Bytes-8     8.026Gi ± 2%   9.890Gi ± 2%  +23.22% (p=0.000 n=10)
XORBytes/2048Bytes-8    21.60Gi ± 2%   26.26Gi ± 1%  +21.55% (p=0.000 n=10)
XORBytes/32768Bytes-8   20.50Gi ± 2%   21.16Gi ± 1%   +3.21% (p=0.000 n=10)
geomean                 7.022Gi        7.819Gi       +11.34%

For #63678

Change-Id: I3996873773748a6f78acc6575e70e09bb6aea979
GitHub-Last-Rev: d9129cb
GitHub-Pull-Request: #63754
Reviewed-on: https://go-review.googlesource.com/c/go/+/537856
Reviewed-by: David Chase <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Auto-Submit: Keith Randall <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Oct 26, 2023
goos: linux
goarch: amd64
pkg: bytes
cpu: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
                                │    master     │                 HEAD                  │
                                │    sec/op     │    sec/op      vs base                │
Equal/0-8                         0.2800n ± 22%   0.2865n ± 26%        ~ (p=0.075 n=10)
Equal/1-8                          18.57n ±  2%    19.34n ±  6%   +4.15% (p=0.014 n=10)
Equal/6-8                          19.07n ±  1%    19.38n ±  2%   +1.63% (p=0.014 n=10)
Equal/9-8                          19.39n ±  2%    19.05n ±  1%   -1.78% (p=0.005 n=10)
Equal/15-8                         19.46n ±  1%    19.10n ±  1%   -1.85% (p=0.000 n=10)
Equal/16-8                         19.36n ±  2%    18.95n ±  1%   -2.09% (p=0.011 n=10)
Equal/20-8                         20.20n ±  1%    19.83n ±  1%   -1.86% (p=0.001 n=10)
Equal/32-8                         20.95n ±  1%    20.84n ±  1%   -0.57% (p=0.010 n=10)
Equal/4K-8                         97.40n ±  2%    81.34n ±  3%  -16.49% (p=0.000 n=10)
Equal/4M-8                         81.74µ ±  3%    71.52µ ±  4%  -12.49% (p=0.000 n=10)
Equal/64M-8                        1.319m ±  1%    1.139m ±  3%  -13.68% (p=0.000 n=10)
EqualBothUnaligned/64_0-8          8.707n ±  4%    8.588n ±  3%        ~ (p=0.353 n=10)
EqualBothUnaligned/64_1-8          8.513n ±  3%    8.614n ±  2%        ~ (p=0.481 n=10)
EqualBothUnaligned/64_4-8          8.752n ±  3%    8.637n ±  4%        ~ (p=0.148 n=10)
EqualBothUnaligned/64_7-8          8.742n ±  3%    8.514n ±  2%        ~ (p=0.052 n=10)
EqualBothUnaligned/4096_0-8        89.87n ±  3%    70.44n ±  5%  -21.63% (p=0.000 n=10)
EqualBothUnaligned/4096_1-8        91.67n ±  5%    70.89n ±  3%  -22.67% (p=0.000 n=10)
EqualBothUnaligned/4096_4-8        90.43n ±  2%    70.52n ±  3%  -22.01% (p=0.000 n=10)
EqualBothUnaligned/4096_7-8        89.53n ±  3%    72.02n ±  5%  -19.56% (p=0.000 n=10)
EqualBothUnaligned/4194304_0-8     86.43µ ±  3%    73.40µ ±  4%  -15.07% (p=0.000 n=10)
EqualBothUnaligned/4194304_1-8     85.48µ ±  2%    75.35µ ±  1%  -11.85% (p=0.000 n=10)
EqualBothUnaligned/4194304_4-8     86.51µ ±  3%    75.44µ ±  4%  -12.80% (p=0.000 n=10)
EqualBothUnaligned/4194304_7-8     86.40µ ±  3%    74.41µ ±  3%  -13.88% (p=0.000 n=10)
EqualBothUnaligned/67108864_0-8    1.374m ±  3%    1.171m ±  3%  -14.75% (p=0.000 n=10)
EqualBothUnaligned/67108864_1-8    1.401m ±  4%    1.198m ±  4%  -14.49% (p=0.000 n=10)
EqualBothUnaligned/67108864_4-8    1.393m ±  4%    1.205m ±  4%  -13.53% (p=0.000 n=10)
EqualBothUnaligned/67108864_7-8    1.396m ±  3%    1.199m ±  4%  -14.11% (p=0.000 n=10)
geomean                            735.7n          666.7n         -9.39%

                                │    master    │                 HEAD                 │
                                │     B/s      │     B/s       vs base                │
Equal/1-8                         51.36Mi ± 2%   49.32Mi ± 6%   -3.98% (p=0.015 n=10)
Equal/6-8                         300.0Mi ± 1%   295.3Mi ± 2%   -1.57% (p=0.011 n=10)
Equal/9-8                         442.5Mi ± 2%   450.6Mi ± 1%   +1.82% (p=0.005 n=10)
Equal/15-8                        734.9Mi ± 1%   748.8Mi ± 1%   +1.90% (p=0.000 n=10)
Equal/16-8                        788.4Mi ± 2%   805.2Mi ± 1%   +2.14% (p=0.011 n=10)
Equal/20-8                        944.2Mi ± 1%   961.8Mi ± 1%   +1.87% (p=0.002 n=10)
Equal/32-8                        1.422Gi ± 0%   1.430Gi ± 1%   +0.58% (p=0.011 n=10)
Equal/4K-8                        39.17Gi ± 2%   46.90Gi ± 3%  +19.74% (p=0.000 n=10)
Equal/4M-8                        47.79Gi ± 3%   54.62Gi ± 4%  +14.27% (p=0.000 n=10)
Equal/64M-8                       47.38Gi ± 1%   54.89Gi ± 3%  +15.85% (p=0.000 n=10)
EqualBothUnaligned/64_0-8         6.845Gi ± 4%   6.940Gi ± 3%        ~ (p=0.353 n=10)
EqualBothUnaligned/64_1-8         7.002Gi ± 3%   6.919Gi ± 2%        ~ (p=0.481 n=10)
EqualBothUnaligned/64_4-8         6.811Gi ± 3%   6.901Gi ± 4%        ~ (p=0.165 n=10)
EqualBothUnaligned/64_7-8         6.819Gi ± 3%   7.002Gi ± 2%        ~ (p=0.052 n=10)
EqualBothUnaligned/4096_0-8       42.45Gi ± 3%   54.16Gi ± 5%  +27.60% (p=0.000 n=10)
EqualBothUnaligned/4096_1-8       41.61Gi ± 6%   53.82Gi ± 3%  +29.33% (p=0.000 n=10)
EqualBothUnaligned/4096_4-8       42.19Gi ± 2%   54.09Gi ± 3%  +28.22% (p=0.000 n=10)
EqualBothUnaligned/4096_7-8       42.61Gi ± 3%   52.97Gi ± 5%  +24.33% (p=0.000 n=10)
EqualBothUnaligned/4194304_0-8    45.20Gi ± 3%   53.22Gi ± 4%  +17.75% (p=0.000 n=10)
EqualBothUnaligned/4194304_1-8    45.70Gi ± 2%   51.84Gi ± 1%  +13.43% (p=0.000 n=10)
EqualBothUnaligned/4194304_4-8    45.15Gi ± 3%   51.78Gi ± 4%  +14.68% (p=0.000 n=10)
EqualBothUnaligned/4194304_7-8    45.21Gi ± 3%   52.50Gi ± 4%  +16.12% (p=0.000 n=10)
EqualBothUnaligned/67108864_0-8   45.50Gi ± 3%   53.37Gi ± 3%  +17.30% (p=0.000 n=10)
EqualBothUnaligned/67108864_1-8   44.63Gi ± 4%   52.17Gi ± 4%  +16.89% (p=0.000 n=10)
EqualBothUnaligned/67108864_4-8   44.86Gi ± 4%   51.88Gi ± 4%  +15.65% (p=0.000 n=10)
EqualBothUnaligned/67108864_7-8   44.76Gi ± 3%   52.12Gi ± 4%  +16.43% (p=0.000 n=10)
geomean                           9.734Gi        10.79Gi       +10.88%

For golang#63678
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/537995 mentions this issue: internal/bytealg: use PCALIGN in memequal

@mauri870 mauri870 changed the title all: instruction alignment optimizations for amd64 assembly, good starter projects all: instruction alignment optimizations for assembly routines, good starter projects Oct 28, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/538315 mentions this issue: crypto/subtle: use PCALIGN in xorBytes for arm64

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/538116 mentions this issue: internal/bytealg: optimize Count/CountString in arm64

@qiulaidongfeng
Copy link
Member

@mauri870 Is such a change worth submitting?
goos: windows
goarch: amd64
pkg: crypto/sha512
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
Hash8Bytes/New-16 163.5n ± 1% 160.4n ± 0% -1.90% (p=0.000 n=10)
Hash8Bytes/Sum384-16 157.5n ± 0% 156.1n ± 0% -0.89% (p=0.000 n=10)
Hash8Bytes/Sum512-16 158.8n ± 0% 158.5n ± 0% ~ (p=0.075 n=10)
Hash1K/New-16 1.157µ ± 0% 1.153µ ± 1% ~ (p=0.091 n=10)
Hash1K/Sum384-16 1.153µ ± 0% 1.141µ ± 0% -1.00% (p=0.000 n=10)
Hash1K/Sum512-16 1.154µ ± 0% 1.147µ ± 1% -0.61% (p=0.000 n=10)
Hash8K/New-16 8.153µ ± 0% 8.076µ ± 0% -0.94% (p=0.000 n=10)
Hash8K/Sum384-16 8.122µ ± 0% 8.060µ ± 1% -0.76% (p=0.000 n=10)
Hash8K/Sum512-16 8.159µ ± 0% 8.082µ ± 0% -0.93% (p=0.000 n=10)
geomean 1.146µ 1.136µ -0.84%

                │   old.txt    │               new.txt               │
                 │     B/s      │     B/s       vs base               │

Hash8Bytes/New-16 46.67Mi ± 1% 47.56Mi ± 0% +1.91% (p=0.000 n=10)
Hash8Bytes/Sum384-16 48.44Mi ± 0% 48.89Mi ± 0% +0.92% (p=0.000 n=10)
Hash8Bytes/Sum512-16 48.06Mi ± 0% 48.14Mi ± 0% ~ (p=0.093 n=10)
Hash1K/New-16 843.8Mi ± 0% 847.0Mi ± 1% ~ (p=0.089 n=10)
Hash1K/Sum384-16 847.6Mi ± 0% 855.7Mi ± 0% +0.96% (p=0.000 n=10)
Hash1K/Sum512-16 846.3Mi ± 0% 851.7Mi ± 1% +0.64% (p=0.000 n=10)
Hash8K/New-16 958.2Mi ± 0% 967.3Mi ± 0% +0.95% (p=0.000 n=10)
Hash8K/Sum384-16 961.9Mi ± 0% 969.3Mi ± 1% +0.76% (p=0.000 n=10)
Hash8K/Sum512-16 957.6Mi ± 0% 966.6Mi ± 0% +0.94% (p=0.000 n=10)
geomean 338.3Mi 341.2Mi +0.84%

@mauri870
Copy link
Member Author

@qiulaidongfeng Probably not, too little of a change that it could be just spurious alignment changes. Generally if a routine really benefits from instruction alignment you'll see a noticeable increase (in the range of > 5%) that is constantly reproducible with a higher -count in benchmarks.

@qiulaidongfeng
Copy link
Member

@mauri870 Is it worth submitting results that show significant changes in only one benchmark?
goos: windows
goarch: amd64
pkg: bytes
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
│ old.txt │ new.txt │
│ sec/op │ sec/op vs base │
IndexByte/10-16 2.542n ± 2% 2.538n ± 1% ~ (p=0.725 n=10)
IndexByte/32-16 2.991n ± 2% 2.954n ± 1% ~ (p=0.089 n=10)
IndexByte/4K-16 55.75n ± 1% 39.37n ± 1% -29.38% (p=0.000 n=10)
IndexByte/4M-16 34.91µ ± 1% 34.29µ ± 1% -1.78% (p=0.000 n=10)
IndexByte/64M-16 1.512m ± 2% 1.533m ± 2% ~ (p=0.353 n=10)
IndexBytePortable/10-16 3.124n ± 4% 3.096n ± 2% ~ (p=0.325 n=10)
IndexBytePortable/32-16 8.316n ± 2% 8.213n ± 2% ~ (p=0.448 n=10)
IndexBytePortable/4K-16 846.2n ± 1% 836.9n ± 3% ~ (p=0.393 n=10)
IndexBytePortable/4M-16 857.4µ ± 3% 853.5µ ± 1% ~ (p=0.481 n=10)
IndexBytePortable/64M-16 13.74m ± 2% 13.65m ± 2% ~ (p=1.000 n=10)
geomean 1.192µ 1.144µ -4.01%

                     │   old.txt    │                new.txt                │
                     │     B/s      │      B/s       vs base                │

IndexByte/10-16 3.663Gi ± 2% 3.669Gi ± 1% ~ (p=0.739 n=10)
IndexByte/32-16 9.962Gi ± 2% 10.089Gi ± 1% ~ (p=0.089 n=10)
IndexByte/4K-16 68.42Gi ± 1% 96.89Gi ± 1% +41.60% (p=0.000 n=10)
IndexByte/4M-16 111.9Gi ± 1% 113.9Gi ± 1% +1.81% (p=0.000 n=10)
IndexByte/64M-16 41.35Gi ± 2% 40.78Gi ± 2% ~ (p=0.353 n=10)
IndexBytePortable/10-16 2.981Gi ± 4% 3.008Gi ± 2% ~ (p=0.353 n=10)
IndexBytePortable/32-16 3.583Gi ± 2% 3.629Gi ± 2% ~ (p=0.436 n=10)
IndexBytePortable/4K-16 4.508Gi ± 1% 4.558Gi ± 3% ~ (p=0.393 n=10)
IndexBytePortable/4M-16 4.556Gi ± 3% 4.577Gi ± 2% ~ (p=0.481 n=10)
IndexBytePortable/64M-16 4.549Gi ± 2% 4.580Gi ± 2% ~ (p=1.000 n=10)
geomean 10.14Gi 10.57Gi +4.18%

@mauri870
Copy link
Member Author

@qiulaidongfeng That might be happening because you are optimizing just one branch from the assembly code? Either way seems to be a good optimization for that particular case, without negatively impacting the others.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/538715 mentions this issue: internal/bytealg: optimize indexbyte in amd64

gopherbot pushed a commit that referenced this issue Oct 31, 2023
For #63678

goos: darwin
goarch: arm64
pkg: strings
                          │ count_old.txt │            count_new.txt            │
                          │    sec/op     │   sec/op     vs base                │
CountHard1-8                 368.7µ ± 11%   332.0µ ± 1%   -9.95% (p=0.002 n=10)
CountHard2-8                 348.8µ ±  5%   333.1µ ± 1%   -4.51% (p=0.000 n=10)
CountHard3-8                 402.7µ ± 25%   359.5µ ± 1%  -10.75% (p=0.000 n=10)
CountTorture-8              10.536µ ± 23%   9.913µ ± 0%   -5.91% (p=0.000 n=10)
CountTortureOverlapping-8    74.86µ ±  9%   67.56µ ± 1%   -9.75% (p=0.000 n=10)
CountByte/10-8               6.905n ±  3%   6.690n ± 1%   -3.11% (p=0.001 n=10)
CountByte/32-8               3.247n ± 13%   3.207n ± 2%   -1.23% (p=0.030 n=10)
CountByte/4096-8             83.72n ±  1%   82.58n ± 1%   -1.36% (p=0.007 n=10)
CountByte/4194304-8          85.17µ ±  5%   84.02µ ± 8%        ~ (p=0.075 n=10)
CountByte/67108864-8         1.497m ±  8%   1.397m ± 2%   -6.69% (p=0.000 n=10)
geomean                      9.977µ         9.426µ        -5.53%

                     │ count_old.txt │            count_new.txt            │
                     │      B/s      │     B/s       vs base               │
CountByte/10-8         1.349Gi ±  3%   1.392Gi ± 1%  +3.20% (p=0.002 n=10)
CountByte/32-8         9.180Gi ± 11%   9.294Gi ± 2%  +1.24% (p=0.029 n=10)
CountByte/4096-8       45.57Gi ±  1%   46.20Gi ± 1%  +1.38% (p=0.007 n=10)
CountByte/4194304-8    45.86Gi ±  5%   46.49Gi ± 7%       ~ (p=0.075 n=10)
CountByte/67108864-8   41.75Gi ±  8%   44.74Gi ± 2%  +7.16% (p=0.000 n=10)
geomean                16.10Gi         16.55Gi       +2.85%

Change-Id: Ifc2173ba3a926b0fa9598372d4404b8645929d45
Reviewed-on: https://go-review.googlesource.com/c/go/+/538116
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Bryan Mills <[email protected]>
Run-TryBot: shuang cui <[email protected]>
Auto-Submit: Keith Randall <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 1, 2023
goos: windows
goarch: amd64
pkg: bytes
cpu: AMD Ryzen 7 7840HS w/ Radeon 780M Graphics
                         │   old.txt   │               new.txt               │
                         │   sec/op    │   sec/op     vs base                │
IndexByte/10-16            2.613n ± 1%   2.558n ± 1%   -2.09% (p=0.014 n=10)
IndexByte/32-16            3.034n ± 1%   3.010n ± 2%        ~ (p=0.305 n=10)
IndexByte/4K-16            57.20n ± 2%   39.58n ± 2%  -30.81% (p=0.000 n=10)
IndexByte/4M-16            34.48µ ± 1%   33.83µ ± 2%   -1.87% (p=0.023 n=10)
IndexByte/64M-16           1.493m ± 2%   1.450m ± 2%   -2.89% (p=0.000 n=10)
IndexBytePortable/10-16    3.172n ± 4%   3.163n ± 2%        ~ (p=0.684 n=10)
IndexBytePortable/32-16    8.465n ± 2%   8.375n ± 3%        ~ (p=0.631 n=10)
IndexBytePortable/4K-16    852.0n ± 1%   846.6n ± 3%        ~ (p=0.971 n=10)
IndexBytePortable/4M-16    868.2µ ± 2%   856.6µ ± 2%        ~ (p=0.393 n=10)
IndexBytePortable/64M-16   13.81m ± 2%   13.88m ± 3%        ~ (p=0.684 n=10)
geomean                    1.204µ        1.148µ        -4.63%

                         │   old.txt    │               new.txt                │
                         │     B/s      │     B/s       vs base                │
IndexByte/10-16            3.565Gi ± 1%   3.641Gi ± 1%   +2.15% (p=0.015 n=10)
IndexByte/32-16            9.821Gi ± 1%   9.899Gi ± 2%        ~ (p=0.315 n=10)
IndexByte/4K-16            66.70Gi ± 2%   96.39Gi ± 2%  +44.52% (p=0.000 n=10)
IndexByte/4M-16            113.3Gi ± 1%   115.5Gi ± 2%   +1.91% (p=0.023 n=10)
IndexByte/64M-16           41.85Gi ± 2%   43.10Gi ± 2%   +2.98% (p=0.000 n=10)
IndexBytePortable/10-16    2.936Gi ± 4%   2.945Gi ± 2%        ~ (p=0.684 n=10)
IndexBytePortable/32-16    3.521Gi ± 2%   3.559Gi ± 3%        ~ (p=0.631 n=10)
IndexBytePortable/4K-16    4.477Gi ± 1%   4.506Gi ± 3%        ~ (p=0.971 n=10)
IndexBytePortable/4M-16    4.499Gi ± 2%   4.560Gi ± 2%        ~ (p=0.393 n=10)
IndexBytePortable/64M-16   4.525Gi ± 2%   4.504Gi ± 3%        ~ (p=0.684 n=10)
geomean                    10.04Gi        10.53Gi        +4.86%

For #63678

Change-Id: I0571c2b540a816d57bd6ed8bb1df4191c7992d92
GitHub-Last-Rev: 7e95b8b
GitHub-Pull-Request: #63847
Reviewed-on: https://go-review.googlesource.com/c/go/+/538715
Reviewed-by: David Chase <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Auto-Submit: Keith Randall <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
@mknyszek mknyszek moved this to In Progress in Go Compiler / Runtime Nov 1, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/539976 mentions this issue: runtime: optimize aeshashbody with PCALIGN in amd64

AlexanderYastrebov added a commit to AlexanderYastrebov/go that referenced this issue Nov 10, 2023
goos: linux
goarch: amd64
pkg: bytes
cpu: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
                                │    master     │                 HEAD                  │
                                │    sec/op     │    sec/op      vs base                │
Equal/0-8                         0.2800n ± 22%   0.2865n ± 26%        ~ (p=0.075 n=10)
Equal/1-8                          18.57n ±  2%    19.34n ±  6%   +4.15% (p=0.014 n=10)
Equal/6-8                          19.07n ±  1%    19.38n ±  2%   +1.63% (p=0.014 n=10)
Equal/9-8                          19.39n ±  2%    19.05n ±  1%   -1.78% (p=0.005 n=10)
Equal/15-8                         19.46n ±  1%    19.10n ±  1%   -1.85% (p=0.000 n=10)
Equal/16-8                         19.36n ±  2%    18.95n ±  1%   -2.09% (p=0.011 n=10)
Equal/20-8                         20.20n ±  1%    19.83n ±  1%   -1.86% (p=0.001 n=10)
Equal/32-8                         20.95n ±  1%    20.84n ±  1%   -0.57% (p=0.010 n=10)
Equal/4K-8                         97.40n ±  2%    81.34n ±  3%  -16.49% (p=0.000 n=10)
Equal/4M-8                         81.74µ ±  3%    71.52µ ±  4%  -12.49% (p=0.000 n=10)
Equal/64M-8                        1.319m ±  1%    1.139m ±  3%  -13.68% (p=0.000 n=10)
EqualBothUnaligned/64_0-8          8.707n ±  4%    8.588n ±  3%        ~ (p=0.353 n=10)
EqualBothUnaligned/64_1-8          8.513n ±  3%    8.614n ±  2%        ~ (p=0.481 n=10)
EqualBothUnaligned/64_4-8          8.752n ±  3%    8.637n ±  4%        ~ (p=0.148 n=10)
EqualBothUnaligned/64_7-8          8.742n ±  3%    8.514n ±  2%        ~ (p=0.052 n=10)
EqualBothUnaligned/4096_0-8        89.87n ±  3%    70.44n ±  5%  -21.63% (p=0.000 n=10)
EqualBothUnaligned/4096_1-8        91.67n ±  5%    70.89n ±  3%  -22.67% (p=0.000 n=10)
EqualBothUnaligned/4096_4-8        90.43n ±  2%    70.52n ±  3%  -22.01% (p=0.000 n=10)
EqualBothUnaligned/4096_7-8        89.53n ±  3%    72.02n ±  5%  -19.56% (p=0.000 n=10)
EqualBothUnaligned/4194304_0-8     86.43µ ±  3%    73.40µ ±  4%  -15.07% (p=0.000 n=10)
EqualBothUnaligned/4194304_1-8     85.48µ ±  2%    75.35µ ±  1%  -11.85% (p=0.000 n=10)
EqualBothUnaligned/4194304_4-8     86.51µ ±  3%    75.44µ ±  4%  -12.80% (p=0.000 n=10)
EqualBothUnaligned/4194304_7-8     86.40µ ±  3%    74.41µ ±  3%  -13.88% (p=0.000 n=10)
EqualBothUnaligned/67108864_0-8    1.374m ±  3%    1.171m ±  3%  -14.75% (p=0.000 n=10)
EqualBothUnaligned/67108864_1-8    1.401m ±  4%    1.198m ±  4%  -14.49% (p=0.000 n=10)
EqualBothUnaligned/67108864_4-8    1.393m ±  4%    1.205m ±  4%  -13.53% (p=0.000 n=10)
EqualBothUnaligned/67108864_7-8    1.396m ±  3%    1.199m ±  4%  -14.11% (p=0.000 n=10)
geomean                            735.7n          666.7n         -9.39%

                                │    master    │                 HEAD                 │
                                │     B/s      │     B/s       vs base                │
Equal/1-8                         51.36Mi ± 2%   49.32Mi ± 6%   -3.98% (p=0.015 n=10)
Equal/6-8                         300.0Mi ± 1%   295.3Mi ± 2%   -1.57% (p=0.011 n=10)
Equal/9-8                         442.5Mi ± 2%   450.6Mi ± 1%   +1.82% (p=0.005 n=10)
Equal/15-8                        734.9Mi ± 1%   748.8Mi ± 1%   +1.90% (p=0.000 n=10)
Equal/16-8                        788.4Mi ± 2%   805.2Mi ± 1%   +2.14% (p=0.011 n=10)
Equal/20-8                        944.2Mi ± 1%   961.8Mi ± 1%   +1.87% (p=0.002 n=10)
Equal/32-8                        1.422Gi ± 0%   1.430Gi ± 1%   +0.58% (p=0.011 n=10)
Equal/4K-8                        39.17Gi ± 2%   46.90Gi ± 3%  +19.74% (p=0.000 n=10)
Equal/4M-8                        47.79Gi ± 3%   54.62Gi ± 4%  +14.27% (p=0.000 n=10)
Equal/64M-8                       47.38Gi ± 1%   54.89Gi ± 3%  +15.85% (p=0.000 n=10)
EqualBothUnaligned/64_0-8         6.845Gi ± 4%   6.940Gi ± 3%        ~ (p=0.353 n=10)
EqualBothUnaligned/64_1-8         7.002Gi ± 3%   6.919Gi ± 2%        ~ (p=0.481 n=10)
EqualBothUnaligned/64_4-8         6.811Gi ± 3%   6.901Gi ± 4%        ~ (p=0.165 n=10)
EqualBothUnaligned/64_7-8         6.819Gi ± 3%   7.002Gi ± 2%        ~ (p=0.052 n=10)
EqualBothUnaligned/4096_0-8       42.45Gi ± 3%   54.16Gi ± 5%  +27.60% (p=0.000 n=10)
EqualBothUnaligned/4096_1-8       41.61Gi ± 6%   53.82Gi ± 3%  +29.33% (p=0.000 n=10)
EqualBothUnaligned/4096_4-8       42.19Gi ± 2%   54.09Gi ± 3%  +28.22% (p=0.000 n=10)
EqualBothUnaligned/4096_7-8       42.61Gi ± 3%   52.97Gi ± 5%  +24.33% (p=0.000 n=10)
EqualBothUnaligned/4194304_0-8    45.20Gi ± 3%   53.22Gi ± 4%  +17.75% (p=0.000 n=10)
EqualBothUnaligned/4194304_1-8    45.70Gi ± 2%   51.84Gi ± 1%  +13.43% (p=0.000 n=10)
EqualBothUnaligned/4194304_4-8    45.15Gi ± 3%   51.78Gi ± 4%  +14.68% (p=0.000 n=10)
EqualBothUnaligned/4194304_7-8    45.21Gi ± 3%   52.50Gi ± 4%  +16.12% (p=0.000 n=10)
EqualBothUnaligned/67108864_0-8   45.50Gi ± 3%   53.37Gi ± 3%  +17.30% (p=0.000 n=10)
EqualBothUnaligned/67108864_1-8   44.63Gi ± 4%   52.17Gi ± 4%  +16.89% (p=0.000 n=10)
EqualBothUnaligned/67108864_4-8   44.86Gi ± 4%   51.88Gi ± 4%  +15.65% (p=0.000 n=10)
EqualBothUnaligned/67108864_7-8   44.76Gi ± 3%   52.12Gi ± 4%  +16.43% (p=0.000 n=10)
geomean                           9.734Gi        10.79Gi       +10.88%

For golang#63678
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/541756 mentions this issue: internal/bytealg: optimize Count with PCALIGN in riscv64

gopherbot pushed a commit that referenced this issue Nov 17, 2023
For #63678

goos: linux
goarch: amd64
pkg: runtime
cpu: AMD EPYC Processor
                       │  base.txt   │               16.txt                │
                       │   sec/op    │   sec/op     vs base                │
Hash5-2                  4.969n ± 1%   4.583n ± 1%  -7.75% (n=100)
Hash16-2                 4.966n ± 1%   4.536n ± 1%  -8.65% (n=100)
Hash64-2                 5.687n ± 1%   5.726n ± 1%       ~ (p=0.181 n=100)
Hash1024-2               26.73n ± 1%   25.72n ± 1%  -3.76% (n=100)
Hash65536-2              1.345µ ± 0%   1.331µ ± 0%  -1.04% (p=0.000 n=100)
HashStringSpeed-2        12.76n ± 1%   12.53n ± 1%  -1.76% (p=0.000 n=100)
HashBytesSpeed-2         20.13n ± 1%   19.96n ± 1%       ~ (p=0.176 n=100)
HashInt32Speed-2         9.065n ± 1%   9.007n ± 1%       ~ (p=0.247 n=100)
HashInt64Speed-2         9.076n ± 1%   9.027n ± 1%       ~ (p=0.179 n=100)
HashStringArraySpeed-2   33.33n ± 1%   32.94n ± 3%  -1.19% (p=0.028 n=100)
FastrandHashiter-2       16.47n ± 0%   16.54n ± 1%  +0.39% (p=0.013 n=100)
geomean                  17.85n        17.43n       -2.33%

            │   base.txt   │                16.txt                 │
            │     B/s      │      B/s       vs base                │
Hash5-2       959.7Mi ± 1%   1040.4Mi ± 1%  +8.41% (p=0.000 n=100)
Hash16-2      3.001Gi ± 1%    3.285Gi ± 1%  +9.48% (p=0.000 n=100)
Hash64-2      10.48Gi ± 1%    10.41Gi ± 1%       ~ (p=0.179 n=100)
Hash1024-2    35.68Gi ± 1%    37.08Gi ± 1%  +3.92% (p=0.000 n=100)
Hash65536-2   45.41Gi ± 0%    45.86Gi ± 0%  +1.01% (p=0.000 n=100)
geomean       8.626Gi         9.001Gi       +4.35%

Change-Id: Icf98dc935181ea5d30f7cbd5dcf284ec7aef8e9a
Reviewed-on: https://go-review.googlesource.com/c/go/+/539976
Run-TryBot: qiulaidongfeng <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Auto-Submit: Keith Randall <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: David Chase <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 17, 2023
goos: linux
goarch: amd64
pkg: bytes
cpu: Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz
                                │    master     │                 HEAD                  │
                                │    sec/op     │    sec/op      vs base                │
Equal/0-8                         0.2800n ± 22%   0.2865n ± 26%        ~ (p=0.075 n=10)
Equal/1-8                          18.57n ±  2%    19.34n ±  6%   +4.15% (p=0.014 n=10)
Equal/6-8                          19.07n ±  1%    19.38n ±  2%   +1.63% (p=0.014 n=10)
Equal/9-8                          19.39n ±  2%    19.05n ±  1%   -1.78% (p=0.005 n=10)
Equal/15-8                         19.46n ±  1%    19.10n ±  1%   -1.85% (p=0.000 n=10)
Equal/16-8                         19.36n ±  2%    18.95n ±  1%   -2.09% (p=0.011 n=10)
Equal/20-8                         20.20n ±  1%    19.83n ±  1%   -1.86% (p=0.001 n=10)
Equal/32-8                         20.95n ±  1%    20.84n ±  1%   -0.57% (p=0.010 n=10)
Equal/4K-8                         97.40n ±  2%    81.34n ±  3%  -16.49% (p=0.000 n=10)
Equal/4M-8                         81.74µ ±  3%    71.52µ ±  4%  -12.49% (p=0.000 n=10)
Equal/64M-8                        1.319m ±  1%    1.139m ±  3%  -13.68% (p=0.000 n=10)
EqualBothUnaligned/64_0-8          8.707n ±  4%    8.588n ±  3%        ~ (p=0.353 n=10)
EqualBothUnaligned/64_1-8          8.513n ±  3%    8.614n ±  2%        ~ (p=0.481 n=10)
EqualBothUnaligned/64_4-8          8.752n ±  3%    8.637n ±  4%        ~ (p=0.148 n=10)
EqualBothUnaligned/64_7-8          8.742n ±  3%    8.514n ±  2%        ~ (p=0.052 n=10)
EqualBothUnaligned/4096_0-8        89.87n ±  3%    70.44n ±  5%  -21.63% (p=0.000 n=10)
EqualBothUnaligned/4096_1-8        91.67n ±  5%    70.89n ±  3%  -22.67% (p=0.000 n=10)
EqualBothUnaligned/4096_4-8        90.43n ±  2%    70.52n ±  3%  -22.01% (p=0.000 n=10)
EqualBothUnaligned/4096_7-8        89.53n ±  3%    72.02n ±  5%  -19.56% (p=0.000 n=10)
EqualBothUnaligned/4194304_0-8     86.43µ ±  3%    73.40µ ±  4%  -15.07% (p=0.000 n=10)
EqualBothUnaligned/4194304_1-8     85.48µ ±  2%    75.35µ ±  1%  -11.85% (p=0.000 n=10)
EqualBothUnaligned/4194304_4-8     86.51µ ±  3%    75.44µ ±  4%  -12.80% (p=0.000 n=10)
EqualBothUnaligned/4194304_7-8     86.40µ ±  3%    74.41µ ±  3%  -13.88% (p=0.000 n=10)
EqualBothUnaligned/67108864_0-8    1.374m ±  3%    1.171m ±  3%  -14.75% (p=0.000 n=10)
EqualBothUnaligned/67108864_1-8    1.401m ±  4%    1.198m ±  4%  -14.49% (p=0.000 n=10)
EqualBothUnaligned/67108864_4-8    1.393m ±  4%    1.205m ±  4%  -13.53% (p=0.000 n=10)
EqualBothUnaligned/67108864_7-8    1.396m ±  3%    1.199m ±  4%  -14.11% (p=0.000 n=10)
geomean                            735.7n          666.7n         -9.39%

                                │    master    │                 HEAD                 │
                                │     B/s      │     B/s       vs base                │
Equal/1-8                         51.36Mi ± 2%   49.32Mi ± 6%   -3.98% (p=0.015 n=10)
Equal/6-8                         300.0Mi ± 1%   295.3Mi ± 2%   -1.57% (p=0.011 n=10)
Equal/9-8                         442.5Mi ± 2%   450.6Mi ± 1%   +1.82% (p=0.005 n=10)
Equal/15-8                        734.9Mi ± 1%   748.8Mi ± 1%   +1.90% (p=0.000 n=10)
Equal/16-8                        788.4Mi ± 2%   805.2Mi ± 1%   +2.14% (p=0.011 n=10)
Equal/20-8                        944.2Mi ± 1%   961.8Mi ± 1%   +1.87% (p=0.002 n=10)
Equal/32-8                        1.422Gi ± 0%   1.430Gi ± 1%   +0.58% (p=0.011 n=10)
Equal/4K-8                        39.17Gi ± 2%   46.90Gi ± 3%  +19.74% (p=0.000 n=10)
Equal/4M-8                        47.79Gi ± 3%   54.62Gi ± 4%  +14.27% (p=0.000 n=10)
Equal/64M-8                       47.38Gi ± 1%   54.89Gi ± 3%  +15.85% (p=0.000 n=10)
EqualBothUnaligned/64_0-8         6.845Gi ± 4%   6.940Gi ± 3%        ~ (p=0.353 n=10)
EqualBothUnaligned/64_1-8         7.002Gi ± 3%   6.919Gi ± 2%        ~ (p=0.481 n=10)
EqualBothUnaligned/64_4-8         6.811Gi ± 3%   6.901Gi ± 4%        ~ (p=0.165 n=10)
EqualBothUnaligned/64_7-8         6.819Gi ± 3%   7.002Gi ± 2%        ~ (p=0.052 n=10)
EqualBothUnaligned/4096_0-8       42.45Gi ± 3%   54.16Gi ± 5%  +27.60% (p=0.000 n=10)
EqualBothUnaligned/4096_1-8       41.61Gi ± 6%   53.82Gi ± 3%  +29.33% (p=0.000 n=10)
EqualBothUnaligned/4096_4-8       42.19Gi ± 2%   54.09Gi ± 3%  +28.22% (p=0.000 n=10)
EqualBothUnaligned/4096_7-8       42.61Gi ± 3%   52.97Gi ± 5%  +24.33% (p=0.000 n=10)
EqualBothUnaligned/4194304_0-8    45.20Gi ± 3%   53.22Gi ± 4%  +17.75% (p=0.000 n=10)
EqualBothUnaligned/4194304_1-8    45.70Gi ± 2%   51.84Gi ± 1%  +13.43% (p=0.000 n=10)
EqualBothUnaligned/4194304_4-8    45.15Gi ± 3%   51.78Gi ± 4%  +14.68% (p=0.000 n=10)
EqualBothUnaligned/4194304_7-8    45.21Gi ± 3%   52.50Gi ± 4%  +16.12% (p=0.000 n=10)
EqualBothUnaligned/67108864_0-8   45.50Gi ± 3%   53.37Gi ± 3%  +17.30% (p=0.000 n=10)
EqualBothUnaligned/67108864_1-8   44.63Gi ± 4%   52.17Gi ± 4%  +16.89% (p=0.000 n=10)
EqualBothUnaligned/67108864_4-8   44.86Gi ± 4%   51.88Gi ± 4%  +15.65% (p=0.000 n=10)
EqualBothUnaligned/67108864_7-8   44.76Gi ± 3%   52.12Gi ± 4%  +16.43% (p=0.000 n=10)
geomean                           9.734Gi        10.79Gi       +10.88%

For #63678

Change-Id: I427b8756e361fd4d36984c2bdb8bc3661ac3a0b8
GitHub-Last-Rev: 981d272
GitHub-Pull-Request: #63757
Reviewed-on: https://go-review.googlesource.com/c/go/+/537995
Reviewed-by: David Chase <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: qiulaidongfeng <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
Reviewed-by: Mauri de Souza Meneguzzo <[email protected]>
Auto-Submit: Keith Randall <[email protected]>
Reviewed-by: Keith Randall <[email protected]>
gopherbot pushed a commit that referenced this issue Nov 22, 2023
For #63678

Benchmark on Milk-V Mars CM eMMC (Starfive/JH7110 SoC)

goos: linux
goarch: riscv64
pkg: bytes
                │ /root/bytes.old.bench │        /root/bytes.pc16.bench         │
                │        sec/op         │   sec/op     vs base                  │
Count/10                    223.9n ± 1%   220.8n ± 1%   -1.36% (p=0.001 n=10)
Count/32                    571.6n ± 0%   571.3n ± 0%        ~ (p=0.054 n=10)
Count/4K                    38.56µ ± 0%   38.55µ ± 0%   -0.01% (p=0.010 n=10)
Count/4M                    40.13m ± 0%   39.21m ± 0%   -2.28% (p=0.000 n=10)
Count/64M                   627.5m ± 0%   627.4m ± 0%   -0.01% (p=0.019 n=10)
CountEasy/10                101.3n ± 0%   101.3n ± 0%        ~ (p=1.000 n=10) ¹
CountEasy/32                139.3n ± 0%   139.3n ± 0%        ~ (p=1.000 n=10) ¹
CountEasy/4K                5.565µ ± 0%   5.564µ ± 0%   -0.02% (p=0.001 n=10)
CountEasy/4M                5.619m ± 0%   5.619m ± 0%        ~ (p=0.190 n=10)
CountEasy/64M               89.94m ± 0%   89.93m ± 0%        ~ (p=0.436 n=10)
CountSingle/10              53.80n ± 0%   46.06n ± 0%  -14.39% (p=0.000 n=10)
CountSingle/32             104.30n ± 0%   79.64n ± 0%  -23.64% (p=0.000 n=10)
CountSingle/4K             10.413µ ± 0%   7.247µ ± 0%  -30.40% (p=0.000 n=10)
CountSingle/4M             11.603m ± 0%   8.388m ± 0%  -27.71% (p=0.000 n=10)
CountSingle/64M             230.9m ± 0%   172.3m ± 0%  -25.40% (p=0.000 n=10)
CountHard1                  9.981m ± 0%   9.981m ± 0%        ~ (p=0.810 n=10)
CountHard2                  9.981m ± 0%   9.981m ± 0%        ~ (p=0.315 n=10)
CountHard3                  9.981m ± 0%   9.981m ± 0%        ~ (p=0.159 n=10)
geomean                     144.6µ        133.5µ        -7.70%
¹ all samples are equal

                │ /root/bytes.old.bench │        /root/bytes.pc16.bench         │
                │          B/s          │      B/s       vs base                │
Count/10                   42.60Mi ± 1%    43.19Mi ± 1%   +1.39% (p=0.001 n=10)
Count/32                   53.38Mi ± 0%    53.42Mi ± 0%   +0.06% (p=0.049 n=10)
Count/4K                   101.3Mi ± 0%    101.3Mi ± 0%        ~ (p=0.077 n=10)
Count/4M                   99.68Mi ± 0%   102.01Mi ± 0%   +2.34% (p=0.000 n=10)
Count/64M                  102.0Mi ± 0%    102.0Mi ± 0%        ~ (p=0.076 n=10)
CountEasy/10               94.18Mi ± 0%    94.18Mi ± 0%        ~ (p=0.054 n=10)
CountEasy/32               219.1Mi ± 0%    219.1Mi ± 0%   +0.01% (p=0.016 n=10)
CountEasy/4K               702.0Mi ± 0%    702.0Mi ± 0%   +0.00% (p=0.000 n=10)
CountEasy/4M               711.9Mi ± 0%    711.9Mi ± 0%        ~ (p=0.133 n=10)
CountEasy/64M              711.6Mi ± 0%    711.7Mi ± 0%        ~ (p=0.447 n=10)
CountSingle/10             177.2Mi ± 0%    207.0Mi ± 0%  +16.81% (p=0.000 n=10)
CountSingle/32             292.7Mi ± 0%    383.2Mi ± 0%  +30.91% (p=0.000 n=10)
CountSingle/4K             375.1Mi ± 0%    539.0Mi ± 0%  +43.70% (p=0.000 n=10)
CountSingle/4M             344.7Mi ± 0%    476.9Mi ± 0%  +38.33% (p=0.000 n=10)
CountSingle/64M            277.2Mi ± 0%    371.5Mi ± 0%  +34.05% (p=0.000 n=10)
geomean                    199.7Mi         219.8Mi       +10.10%

Change-Id: I1abf6b220b9802028f8ad5eebc8d3b7cfa3e89ea
Reviewed-on: https://go-review.googlesource.com/c/go/+/541756
Reviewed-by: David Chase <[email protected]>
Reviewed-by: Cherry Mui <[email protected]>
Reviewed-by: Joel Sing <[email protected]>
Run-TryBot: M Zhuo <[email protected]>
TryBot-Result: Gopher Robot <[email protected]>
Reviewed-by: Wang Yaduo <[email protected]>
Reviewed-by: Mark Ryan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. help wanted NeedsFix The path to resolution is known, but the work has not been done. Performance
Projects
Status: In Progress
Development

No branches or pull requests

4 participants
@gopherbot @mauri870 @qiulaidongfeng and others