Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support neon, sse simd and dynamic dispatch #56

Merged
merged 3 commits into from
Apr 11, 2023

Conversation

xiegx94
Copy link
Contributor

@xiegx94 xiegx94 commented Mar 2, 2023

Main changes

  • support both static dispatch and dynamic dispatch
  • support neon and sse architecture

@xiegx94 xiegx94 mentioned this pull request Mar 2, 2023
@xiegx94
Copy link
Contributor Author

xiegx94 commented Mar 7, 2023

This PR provides 2 ways to support multi-arch dispatch: dispatch at compile (static dispatch) and dispatch at runtime (dynamic dispatch). Dynamic dispatch is implemented by using gcc/clang multiversioning-functions, which causes these function cannot be inlined when compile and the performance will be worse.

The structure of arch folder

├── avx2
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── common
│   ├── quote_common.h
│   ├── quote_tables.h
│   ├── skip_common.h
│   ├── unicode_common.h
│   └── x86_common
│       ├── itoa.h
│       ├── quote.inc.h
│       └── skip.inc.h
├── neon
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── simd_base.h
├── simd_dispatch.h
├── simd_itoa.h
├── simd_quote.h
├── simd_skip.h
├── simd_str2int.h
├── sonic_cpu_feature.h
├── sse
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── target_macro.h
└── x86_ifuncs
    ├── base.h
    ├── ifunc_macro.h
    ├── itoa.h
    ├── quote.h
    ├── skip.h
    └── str2int.h

How to add a new function

If you want to add a new simd function which is called foo, then, you should follow below steps:

  1. implement foo for every arch, such as:
namespace sonic_json {
namespace internal {
namespace avx2 {

void foo() { return; }

}  // namespace avx2
}  // namespace internal
}  // namespace sonic_json
  1. provide dynamic dispatch functions for x86 (or others platform)
namespace sonic_json {
namespace internal {

__attribute__((target(HASWELL))) inline void foo() { return avx2::foo(); }
__attribute__((target(WESTMERE))) inline void foo() { return sse::foo(); }
__attribute__((target("default"))) inline void foo() { return sse::foo(); }

}
}
  1. If you want implement foo in a new header file foo.h, you should provide such file for every arch and x86_ifuncs. then add a new file simd_foo.h in arch floder:
#pragma once

#include "simd_dispatch.h"

#include INCLUDE_ARCH_FILE(foo.h)

namespace sonic_json {
namespace internal {

SONIC_USING_ARCH_FUNC(foo);

}
}

How to add a new architecture

If there is a new architecture named Y86, you should do:

  1. write a new rule to detect Y86 macro ( provide by gcc/clang) in sonic_cpu_feature.h
#if defined(__Y86__)
#define SONIC_HAVE_Y86
#endif
  1. write a new rule about how to dispatch in simd_dispatch.h
#if defined(SONIC_STATIC_DISPATCH)
#if defined(SONIC_HAVE_Y86)
#define SONIC_USING_ARCH_FUNC(func) using Y86::func
#define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(Y86/file)
#endif
#elif defined(SONIC_DYNAMIC_DISPATCH)
#if defined(SONIC_HAVE_Y86)
#define SONIC_USING_ARCH_FUNC(func)
#define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(y86_ifuncs/file)
#endif
#endif
  1. create y86 folder and implement all simd functions
  2. create y86 folder and implement all multiversioning-functions.

sonic 的多架构设计同时支持在编译期间选择指定的指令和在运行时根据运行的平台选择合适的指令。同时支持两种方式是因为在运行时抉择会让使用 simd 的函数/接口无法在编译期间 inline,这会引起一些性能下降。

arch 目录结构

├── avx2
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── common
│   ├── quote_common.h
│   ├── quote_tables.h
│   ├── skip_common.h
│   ├── unicode_common.h
│   └── x86_common
│       ├── itoa.h
│       ├── quote.inc.h
│       └── skip.inc.h
├── neon
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── simd_base.h
├── simd_dispatch.h
├── simd_itoa.h
├── simd_quote.h
├── simd_skip.h
├── simd_str2int.h
├── sonic_cpu_feature.h
├── sse
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── target_macro.h
└── x86_ifuncs
    ├── base.h
    ├── ifunc_macro.h
    ├── itoa.h
    ├── quote.h
    ├── skip.h
    └── str2int.h

avx2, sse, neon。特定架构下的 simd 实现代码
common, 通用的一些实现
x86_ifuncs x86 平台动态 dispatch 代码

如何添加新的函数

  1. 在每个 arch 下添加新的函数,如:
namespace sonic_json {
namespace internal {
namespace avx2 {

void foo() { return; }

}  // namespace avx2
}  // namespace internal
}  // namespace sonic_json
  1. 在 x86_ifunc 下添加 x86 动态 dispatch 支持:
namespace sonic_json {
namespace internal {

__attribute__((target(HASWELL))) inline void foo() { return avx2::foo(); }
__attribute__((target(WESTMERE))) inline void foo() { return sse::foo(); }
__attribute__((target("default"))) inline void foo() { return sse::foo(); }

}
}
  1. (可选)如果添加了新的头文件,则在 arch 下添加 simd_foo.h,在各 arch 下添加 foo.h 文件。simd_foo 如下:
#pragma once

#include "simd_dispatch.h"

#include INCLUDE_ARCH_FILE(foo.h)

namespace sonic_json {
namespace internal {

SONIC_USING_ARCH_FUNC(foo);

}
}

如何添加新的架构

假如有个新的架构叫Y86,需要在 sonic 中添加其 simd 支持,则:

  1. 在 sonic_cpu_feature.h 中检测Y86的宏:
#if defined(__Y86__)
#define SONIC_HAVE_Y86
#endif
  1. 在 simd_dispatch 中添加 dispatch 规则
#if defined(SONIC_STATIC_DISPATCH)
#if defined(SONIC_HAVE_Y86)
#define SONIC_USING_ARCH_FUNC(func) using Y86::func
#define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(Y86/file)
#endif
#elif defined(SONIC_DYNAMIC_DISPATCH)
#if defined(SONIC_HAVE_Y86)
#define SONIC_USING_ARCH_FUNC(func)
#define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(y86_ifuncs/file)
#endif
#endif
  1. 添加 y86 文件夹,添加所有的 simd 函数的 y86 实现
  2. 添加 y86_ifuncs 文件夹,添加 y86 的 multiversioning-function 实现

@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2023

Codecov Report

Merging #56 (9980dc1) into master (80cdba0) will increase coverage by 0.84%.
The diff coverage is 91.61%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
+ Coverage   95.04%   95.88%   +0.84%     
==========================================
  Files          22       21       -1     
  Lines        2785     2431     -354     
==========================================
- Hits         2647     2331     -316     
+ Misses        138      100      -38     
Impacted Files Coverage Δ
include/sonic/allocator.h 90.43% <ø> (ø)
include/sonic/dom/dynamicnode.h 96.08% <ø> (ø)
include/sonic/dom/serialize.h 93.39% <ø> (ø)
include/sonic/internal/arch/avx2/base.h 100.00% <ø> (ø)
include/sonic/internal/ftoa.h 97.34% <ø> (ø)
include/sonic/internal/itoa.h 100.00% <ø> (ø)
include/sonic/internal/arch/simd_skip.h 89.23% <89.23%> (ø)
include/sonic/dom/handler.h 99.04% <100.00%> (ø)
include/sonic/dom/parser.h 94.23% <100.00%> (ø)
include/sonic/internal/arch/avx2/simd.h 100.00% <100.00%> (ø)
... and 4 more

... and 3 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@xiegx94
Copy link
Contributor Author

xiegx94 commented Mar 22, 2023

Performance

test case master(haswell) sse haswell dynamic dispatch
book/Decode_SonicDyn 980 ns 813 ns 849 ns 1128
gsoc-2018/Decode_SonicDyn 1406878 ns 1470898 ns 1339296 ns 1752588
fgo/Decode_SonicDyn 129952490 ns 112165070 ns 117719364 ns 150338769
lottie/Decode_SonicDyn 948184 ns 805143 ns 842756 ns 1187414
canada/Decode_SonicDyn 4068896 ns 3756878 ns 3789520 ns 4085432
github_events/Decode_SonicDyn 42468 ns 39368 ns 39716 ns 54755
otfcc/Decode_SonicDyn 321242929 ns 292141676 ns 320360184 ns 377427578
poet/Decode_SonicDyn 1611831 ns 1572923 ns 1534339 ns 1743444
citm_catalog/Decode_SonicDyn 1212610 ns 1137476 ns 1217241 ns 1439325
twitter/Decode_SonicDyn 194191 ns 181451 ns 185673 ns 260165
twitterescaped/Decode_SonicDyn 572412 ns 492546 ns 555098 ns 671564
book/Encode_SonicDyn 598 ns 619 ns 631 ns 616
gsoc-2018/Encode_SonicDyn 702591 ns 796307 ns 680203 ns 672574
fgo/Encode_SonicDyn 75789720 ns 75432301 ns 76930374 ns 75568517
lottie/Encode_SonicDyn 846441 ns 858753 ns 839591 ns 871623
canada/Encode_SonicDyn 6078378 ns 6152427 ns 6009102 ns 6074922
github_events/Encode_SonicDyn 21617 ns 22432 ns 21035 ns 20380
otfcc/Encode_SonicDyn 222879330 ns 155490041 ns 159626245 ns 160975575
poet/Encode_SonicDyn 864328 ns 840780 ns 720925 ns 721330
citm_catalog/Encode_SonicDyn 600309 ns 533967 ns 560394 ns 554867
twitter/Encode_SonicDyn 94764 ns 97690 ns 93807 ns 89528
twitterescaped/Encode_SonicDyn 281830 ns 284284 ns 263186 ns 269042


using common::EqBytes4;
using common::SkipLiteral;
using sse::GetNextToken; // !!!Not efficency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个注释是什么原因

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetNextToken 是一个模板函数,没办法使用 multiversion function 的机制让编译器自动选择版本。这里在所有的架构下都选择了sse 版本。

@@ -479,7 +515,15 @@ struct simd256<int8_t> : num256<int8_t> {
template <>
struct simd256<uint8_t> : num256<uint8_t> {
using Base = num256<uint8_t>;
using Base::Base;
// using Base::Base;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥不能复用,按理说O3 优化下,应该都能内联

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和内联没关系。这里没有提供构造函数,所以构造函数是默认的,没有内联。这样会导致构造函数在非指定的架构下被应用,引发编译器报错。

@liuq19
Copy link
Collaborator

liuq19 commented Mar 23, 2023

Performance

test case sse haswell dynamic dispatch
book/Decode_SonicDyn 813 ns 849 ns 1128 ns
gsoc-2018/Decode_SonicDyn 1470898 ns 1339296 ns 1752588 ns
fgo/Decode_SonicDyn 112165070 ns 117719364 ns 150338769 ns
lottie/Decode_SonicDyn 805143 ns 842756 ns 1187414 ns
canada/Decode_SonicDyn 3756878 ns 3789520 ns 4085432 ns
github_events/Decode_SonicDyn 39368 ns 39716 ns 54755 ns
otfcc/Decode_SonicDyn 292141676 ns 320360184 ns 377427578 ns
poet/Decode_SonicDyn 1572923 ns 1534339 ns 1743444 ns
citm_catalog/Decode_SonicDyn 1137476 ns 1217241 ns 1439325 ns
twitter/Decode_SonicDyn 181451 ns 185673 ns 260165 ns
twitterescaped/Decode_SonicDyn 492546 ns 555098 ns 671564 ns
book/Encode_SonicDyn 619 ns 631 ns 616 ns
gsoc-2018/Encode_SonicDyn 796307 ns 680203 ns 672574 ns
fgo/Encode_SonicDyn 75432301 ns 76930374 ns 75568517 ns
lottie/Encode_SonicDyn 858753 ns 839591 ns 871623 ns
canada/Encode_SonicDyn 6152427 ns 6009102 ns 6074922 ns
github_events/Encode_SonicDyn 22432 ns 21035 ns 20380 ns
otfcc/Encode_SonicDyn 155490041 ns 159626245 ns 160975575 ns
poet/Encode_SonicDyn 840780 ns 720925 ns 721330 ns
citm_catalog/Encode_SonicDyn 533967 ns 560394 ns 554867 ns
twitter/Encode_SonicDyn 97690 ns 93807 ns 89528 ns
twitterescaped/Encode_SonicDyn 284284 ns 263186 ns 269042 ns

最好分别贴下static 模式和 dynamic 模式下,目前分支和master分支的相对性能测试数据,这样应该更清楚一点

@xiegx94
Copy link
Contributor Author

xiegx94 commented Mar 23, 2023

Performance
test case sse haswell dynamic dispatch
book/Decode_SonicDyn 813 ns 849 ns 1128 ns
gsoc-2018/Decode_SonicDyn 1470898 ns 1339296 ns 1752588 ns
fgo/Decode_SonicDyn 112165070 ns 117719364 ns 150338769 ns
lottie/Decode_SonicDyn 805143 ns 842756 ns 1187414 ns
canada/Decode_SonicDyn 3756878 ns 3789520 ns 4085432 ns
github_events/Decode_SonicDyn 39368 ns 39716 ns 54755 ns
otfcc/Decode_SonicDyn 292141676 ns 320360184 ns 377427578 ns
poet/Decode_SonicDyn 1572923 ns 1534339 ns 1743444 ns
citm_catalog/Decode_SonicDyn 1137476 ns 1217241 ns 1439325 ns
twitter/Decode_SonicDyn 181451 ns 185673 ns 260165 ns
twitterescaped/Decode_SonicDyn 492546 ns 555098 ns 671564 ns
book/Encode_SonicDyn 619 ns 631 ns 616 ns
gsoc-2018/Encode_SonicDyn 796307 ns 680203 ns 672574 ns
fgo/Encode_SonicDyn 75432301 ns 76930374 ns 75568517 ns
lottie/Encode_SonicDyn 858753 ns 839591 ns 871623 ns
canada/Encode_SonicDyn 6152427 ns 6009102 ns 6074922 ns
github_events/Encode_SonicDyn 22432 ns 21035 ns 20380 ns
otfcc/Encode_SonicDyn 155490041 ns 159626245 ns 160975575 ns
poet/Encode_SonicDyn 840780 ns 720925 ns 721330 ns
citm_catalog/Encode_SonicDyn 533967 ns 560394 ns 554867 ns
twitter/Encode_SonicDyn 97690 ns 93807 ns 89528 ns
twitterescaped/Encode_SonicDyn 284284 ns 263186 ns 269042 ns

最好分别贴下static 模式和 dynamic 模式下,目前分支和master分支的相对性能测试数据,这样应该更清楚一点

Updated.

@xiegx94 xiegx94 force-pushed the feat/support-ifunc branch from 9c10f87 to 227105a Compare March 27, 2023 02:37
sum = sum * 10 + (c[i] - '0');
i++;
}
man_nd = i;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为啥没有实现neon simd版本。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

暂时没想到在 neon 下怎么实现这个函数


__attribute__((target("default"))) inline uint8_t skip_space_safe(
const uint8_t*, size_t&, size_t, size_t&, uint64_t&) {
return 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里返回0 可能会有问题吗?相当于fallback逻辑,然后在非west和 haswell下会执行到这里。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,在没有实现 fallback 的时候,这里使用 static assert 阻止编译比较好。

include/sonic/allocator.h Outdated Show resolved Hide resolved
}) +\
select({
"static_dispatch": static_dispatch_copts,
"dynamic_dispatch": dynamic_dispatch_copts,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic 模式下需要加 mavx2 编译options吗,理论上应该不需要了吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里没有指定必须使用 mavx2 编译选项。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants