Skip to content

Commit

Permalink
runtime: memclr_amd64 use PCALIGN optimize
Browse files Browse the repository at this point in the history
MemclrUnaligned/0_5-16        1.821n ± 1%    1.803n ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16       1.879n ± 1%    1.855n ±  1%        ~ (p=0.210 n=20+10)
MemclrUnaligned/0_64-16       2.044n ± 1%    2.044n ±  2%        ~ (p=0.871 n=20+10)
MemclrUnaligned/0_256-16      3.614n ± 1%    3.600n ±  3%        ~ (p=0.552 n=20+10)
MemclrUnaligned/0_4096-16     32.63n ± 2%    32.34n ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16    483.5n ± 3%    479.1n ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16        1.800n ± 1%    1.808n ±  1%        ~ (p=0.333 n=20+10)
MemclrUnaligned/1_16-16       1.863n ± 1%    1.847n ±  2%        ~ (p=0.345 n=20+10)
MemclrUnaligned/1_64-16       2.929n ± 1%    2.107n ±  2%  -28.05% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16      4.942n ± 1%    4.973n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/1_4096-16     40.09n ± 1%    39.49n ±  2%        ~ (p=0.210 n=20+10)
MemclrUnaligned/1_65536-16    650.0n ± 3%    653.7n ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16        1.806n ± 1%    1.812n ±  1%        ~ (p=0.291 n=20+10)
MemclrUnaligned/4_16-16       1.867n ± 1%    1.862n ±  1%        ~ (p=0.551 n=20+10)
MemclrUnaligned/4_64-16       2.946n ± 2%    2.752n ±  2%   -6.59% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16      4.942n ± 1%    5.144n ±  2%   +4.08% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16     39.88n ± 1%    40.21n ±  4%        ~ (p=0.346 n=20+10)
MemclrUnaligned/4_65536-16    643.7n ± 2%    647.8n ±  4%        ~ (p=0.657 n=20+10)
MemclrUnaligned/7_5-16        1.802n ± 1%    1.801n ±  3%        ~ (p=0.481 n=20+10)
MemclrUnaligned/7_16-16       1.863n ± 1%    1.863n ±  2%        ~ (p=0.626 n=20+10)
MemclrUnaligned/7_64-16       2.947n ± 1%    2.125n ±  2%  -27.91% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16      4.967n ± 1%    5.005n ±  3%        ~ (p=0.302 n=20+10)
MemclrUnaligned/7_4096-16     39.52n ± 3%    40.07n ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16    651.5n ± 3%    649.2n ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16       7.646µ ± 2%    7.618µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16       54.15µ ± 3%   119.05µ ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16       108.8µ ± 3%    107.0µ ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16      216.2µ ± 2%    216.3µ ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16      888.4µ ± 2%    867.3µ ±  6%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16       10.85µ ± 2%    11.00µ ±  5%   +1.37% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16       48.66µ ± 2%    47.79µ ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16       96.18µ ± 4%    97.18µ ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16      232.7µ ± 2%    276.9µ ± 19%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16      883.2µ ± 2%    892.3µ ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16       10.97µ ± 2%    11.04µ ±  4%        ~ (p=0.073 n=20+10)
MemclrUnaligned/4_4M-16       48.53µ ± 2%    45.21µ ± 11%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16       97.31µ ± 2%    96.12µ ±  3%        ~ (p=0.311 n=20+10)
MemclrUnaligned/4_16M-16      234.7µ ± 6%    241.0µ ± 42%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16      891.9µ ± 2%    875.8µ ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16       11.03µ ± 3%    10.85µ ±  4%        ~ (p=0.495 n=20+10)
MemclrUnaligned/7_4M-16       51.37µ ± 2%    48.38µ ±  2%   -5.83% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16       97.66µ ± 3%    97.83µ ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16      231.4µ ± 7%    274.7µ ± 29%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16      891.8µ ± 3%    868.4µ ±  4%        ~ (p=0.061 n=20+10)

MemclrUnaligned/0_5-16       2.558Gi ± 1%   2.583Gi ±  2%        ~ (p=0.076 n=20+10)
MemclrUnaligned/0_16-16      7.931Gi ± 1%   8.030Gi ±  1%        ~ (p=0.214 n=20+10)
MemclrUnaligned/0_64-16      29.15Gi ± 1%   29.17Gi ±  2%        ~ (p=0.914 n=20+10)
MemclrUnaligned/0_256-16     65.97Gi ± 1%   66.23Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_4096-16    116.9Gi ± 2%   117.9Gi ±  3%        ~ (p=0.948 n=20+10)
MemclrUnaligned/0_65536-16   126.2Gi ± 3%   127.4Gi ±  5%        ~ (p=0.588 n=20+10)
MemclrUnaligned/1_5-16       2.587Gi ± 1%   2.575Gi ±  1%        ~ (p=0.328 n=20+10)
MemclrUnaligned/1_16-16      7.998Gi ± 1%   8.066Gi ±  2%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_64-16      20.35Gi ± 1%   28.29Gi ±  2%  +39.02% (p=0.000 n=20+10)
MemclrUnaligned/1_256-16     48.24Gi ± 1%   47.94Gi ±  3%        ~ (p=0.307 n=20+10)
MemclrUnaligned/1_4096-16    95.16Gi ± 1%   96.61Gi ±  2%        ~ (p=0.214 n=20+10)
MemclrUnaligned/1_65536-16   93.90Gi ± 3%   93.38Gi ±  4%        ~ (p=0.530 n=20+10)
MemclrUnaligned/4_5-16       2.578Gi ± 1%   2.569Gi ±  1%        ~ (p=0.286 n=20+10)
MemclrUnaligned/4_16-16      7.979Gi ± 1%   8.005Gi ±  1%        ~ (p=0.588 n=20+10)
MemclrUnaligned/4_64-16      20.24Gi ± 2%   21.67Gi ±  2%   +7.06% (p=0.000 n=20+10)
MemclrUnaligned/4_256-16     48.24Gi ± 1%   46.35Gi ±  2%   -3.92% (p=0.000 n=20+10)
MemclrUnaligned/4_4096-16    95.65Gi ± 2%   94.87Gi ±  4%        ~ (p=0.350 n=20+10)
MemclrUnaligned/4_65536-16   94.82Gi ± 2%   94.22Gi ±  5%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_5-16       2.584Gi ± 1%   2.585Gi ±  3%        ~ (p=0.475 n=20+10)
MemclrUnaligned/7_16-16      7.999Gi ± 1%   7.999Gi ±  2%        ~ (p=0.619 n=20+10)
MemclrUnaligned/7_64-16      20.22Gi ± 1%   28.05Gi ±  2%  +38.72% (p=0.000 n=20+10)
MemclrUnaligned/7_256-16     48.00Gi ± 1%   47.65Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/7_4096-16    96.54Gi ± 3%   95.19Gi ±  3%        ~ (p=0.650 n=20+10)
MemclrUnaligned/7_65536-16   93.69Gi ± 3%   94.02Gi ±  4%        ~ (p=0.846 n=20+10)
MemclrUnaligned/0_1M-16      127.7Gi ± 2%   128.2Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/0_4M-16      72.14Gi ± 3%   46.75Gi ± 66%        ~ (p=0.350 n=20+10)
MemclrUnaligned/0_8M-16      71.82Gi ± 3%   72.98Gi ±  3%        ~ (p=0.559 n=20+10)
MemclrUnaligned/0_16M-16     72.29Gi ± 2%   72.24Gi ±  3%        ~ (p=0.681 n=20+10)
MemclrUnaligned/0_64M-16     70.35Gi ± 2%   72.07Gi ±  5%        ~ (p=0.055 n=20+10)
MemclrUnaligned/1_1M-16      89.98Gi ± 2%   88.77Gi ±  5%   -1.35% (p=0.028 n=20+10)
MemclrUnaligned/1_4M-16      80.28Gi ± 2%   81.74Gi ±  1%        ~ (p=0.120 n=20+10)
MemclrUnaligned/1_8M-16      81.23Gi ± 4%   80.39Gi ±  5%        ~ (p=0.373 n=20+10)
MemclrUnaligned/1_16M-16     67.16Gi ± 2%   57.17Gi ± 22%        ~ (p=0.286 n=20+10)
MemclrUnaligned/1_64M-16     70.77Gi ± 2%   70.04Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/4_1M-16      88.99Gi ± 2%   88.48Gi ±  4%        ~ (p=0.074 n=20+10)
MemclrUnaligned/4_4M-16      80.49Gi ± 2%   86.41Gi ± 10%        ~ (p=0.082 n=20+10)
MemclrUnaligned/4_8M-16      80.28Gi ± 2%   81.28Gi ±  3%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_16M-16     66.58Gi ± 6%   64.83Gi ± 29%        ~ (p=0.328 n=20+10)
MemclrUnaligned/4_64M-16     70.07Gi ± 2%   71.37Gi ±  3%        ~ (p=0.448 n=20+10)
MemclrUnaligned/7_1M-16      88.51Gi ± 3%   89.98Gi ±  4%        ~ (p=0.502 n=20+10)
MemclrUnaligned/7_4M-16      76.04Gi ± 2%   80.74Gi ±  2%   +6.19% (p=0.000 n=20+10)
MemclrUnaligned/7_8M-16      80.00Gi ± 3%   79.86Gi ±  3%        ~ (p=0.846 n=20+10)
MemclrUnaligned/7_16M-16     67.53Gi ± 7%   58.23Gi ± 24%        ~ (p=0.286 n=20+10)
MemclrUnaligned/7_64M-16     70.08Gi ± 3%   71.98Gi ±  4%        ~ (p=0.061 n=20+10)

For golang#63678
  • Loading branch information
qiulaidongfeng committed Oct 23, 2023
1 parent bc2124d commit 6c01665
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions src/runtime/memclr_amd64.s
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ skip_erms:
JE loop_preheader_avx2
// TODO: for really big clears, use MOVNTDQ, even without AVX2.

PCALIGN $16
loop:
MOVOU X15, 0(DI)
MOVOU X15, 16(DI)
Expand Down Expand Up @@ -89,6 +90,7 @@ loop_preheader_avx2:
CMPQ BX, $0x2000000
JAE loop_preheader_avx2_huge

PCALIGN $32
loop_avx2:
VMOVDQU Y0, 0(DI)
VMOVDQU Y0, 32(DI)
Expand Down Expand Up @@ -135,6 +137,7 @@ loop_preheader_avx2_huge:
ANDQ $~31, DI
SUBQ DI, SI
ADDQ SI, BX
PCALIGN $32
loop_avx2_huge:
VMOVNTDQ Y0, 0(DI)
VMOVNTDQ Y0, 32(DI)
Expand Down

0 comments on commit 6c01665

Please sign in to comment.