Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvement for Intel MTL CPUs #19524

Merged
merged 4 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 45 additions & 1 deletion onnxruntime/core/platform/windows/env.cc
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,9 @@
#include "core/common/span_utils.h"
#include "core/platform/env.h"
#include "core/platform/scoped_resource.h"
#if defined(_M_X64) && !defined(_M_ARM64EC) && defined(ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH)
#include "core/platform/windows/hardware_core_enumerator.h"
#endif
#include <unsupported/Eigen/CXX11/ThreadPool>
#include <wil/Resource.h>

Expand Down Expand Up @@ -248,12 +251,53 @@
Sleep(static_cast<DWORD>(micros) / 1000);
}

// EIGEN_NO_CPUID is not defined in any C/C++ source code. It is a compile option.
#if defined(_M_X64) && !defined(_M_ARM64EC) && !defined(EIGEN_NO_CPUID) && defined(ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH)

Check warning on line 255 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:255: Lines should be <= 120 characters long [whitespace/line_length] [2]
static constexpr std::array<int, 3> kVendorID_Intel = {0x756e6547, 0x6c65746e, 0x49656e69}; // "GenuntelineI"
#endif
int WindowsEnv::DefaultNumCores() {
return std::max(1, static_cast<int>(std::thread::hardware_concurrency() / 2));
}

int WindowsEnv::GetNumPhysicalCpuCores() const {
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
// EIGEN_NO_CPUID is not defined in any C/C++ source code. It is a compile option.
#if defined(_M_X64) && !defined(_M_ARM64EC) && !defined(EIGEN_NO_CPUID) && defined(ONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH)

Check warning on line 264 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:264: Lines should be <= 120 characters long [whitespace/line_length] [2]
// The following code is a temporary fix for a perf problem on Intel's Meteor Lake CPUs. The Intel compute platform has

Check warning on line 265 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:265: Lines should be <= 120 characters long [whitespace/line_length] [2]
// a hybrid architecture that some CPU cores runs significant slower than the others. If we distribute our compute work

Check warning on line 266 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:266: Lines should be <= 120 characters long [whitespace/line_length] [2]
// evenly to all CPU cores, the slowest CPU core will drag the performance down. So, instead, we reduce the total number

Check warning on line 267 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:267: Lines should be <= 120 characters long [whitespace/line_length] [2]
// of threads to exclude the slowest cores out.
// The following code is based on assumptions that:
// 1. All Intel hybrid CPUs should have 3 levels of cache.
// 2. If a CPU core is only associated with two levels of cache, it should be a low performance CPU core and should
// not be used.
// Since we don't know what the next Intel hybrid CPU would be like, later on we may need to rework the following code.

Check warning on line 273 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Lines should be <= 120 characters long [whitespace/line_length] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:273: Lines should be <= 120 characters long [whitespace/line_length] [2]
// However, no matter what the code should not cause any crash. The worst is it might return 1 that
// thread pools will not be created, which is just a perf issue and does not impact usability.
// TODO: detect if CPUID instruction is available per instructions at https://wiki.osdev.org/CPUID#Checking_CPUID_availability

Check warning on line 276 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2] Raw Output: onnxruntime/core/platform/windows/env.cc:276: Missing username in TODO; it should look like "// TODO(my_username): Stuff." [readability/todo] [2]
int regs[4];
__cpuid(regs, 0);
bool bIsIntel =
(kVendorID_Intel[0] == regs[1]) &&
(kVendorID_Intel[1] == regs[2]) &&
(kVendorID_Intel[2] == regs[3]);
if (bIsIntel && regs[0] >= 7) {
// Query Structured Extended Feature Flags Enumeration Leaf
__cpuid(regs, 0x7);
// The bit 15 of EDX indicates if the processor is identified as a hybrid part.
bool ishybrid = regs[3] & (1 << 15);
if (ishybrid) {
// NOTE: even if ishybrid is true, it doesn't mean the processor must have P-cores and E-cores.
// On Intel CPUs we assume the HardwareCoreEnumerator::DefaultIntraOpNumThreads function would never fail.
// NOTE: due to resource restrictions, we cannot test this branch in our CI build pipelines.
return std::max(static_cast<uint32_t>(1), HardwareCoreEnumerator::DefaultIntraOpNumThreads());
} else {
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
}
} else

Check warning on line 296 in onnxruntime/core/platform/windows/env.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 If an else has a brace on one side, it should have it on both [readability/braces] [5] Raw Output: onnxruntime/core/platform/windows/env.cc:296: If an else has a brace on one side, it should have it on both [readability/braces] [5]
#endif
{
return cores_.empty() ? DefaultNumCores() : static_cast<int>(cores_.size());
}
}

std::vector<LogicalProcessors> WindowsEnv::GetDefaultThreadAffinities() const {
Expand Down
89 changes: 89 additions & 0 deletions onnxruntime/core/platform/windows/hardware_core_enumerator.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
// Copyright (c) Microsoft Corporation. All rights reserved.

Check warning on line 1 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 At least two spaces is best between code and comments [whitespace/comments] [2] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:1: At least two spaces is best between code and comments [whitespace/comments] [2]
// Licensed under the MIT License.

#include "hardware_core_enumerator.h"

Check warning on line 4 in onnxruntime/core/platform/windows/hardware_core_enumerator.cc

View workflow job for this annotation

GitHub Actions / Lint C++

[cpplint] reported by reviewdog 🐶 Include the directory when naming header files [build/include_subdir] [4] Raw Output: onnxruntime/core/platform/windows/hardware_core_enumerator.cc:4: Include the directory when naming header files [build/include_subdir] [4]
#include <memory>
#include <Windows.h>
#include <assert.h>

namespace onnxruntime {

struct LogicalProcessorInformation {
std::unique_ptr<char[]> Buffer;
size_t Length;
};

struct CoreCounter {
uint32_t PhysicalCores = 0;
uint32_t SocDieCores = 0;
};

static LogicalProcessorInformation GetLogicalProcessorInfos(LOGICAL_PROCESSOR_RELATIONSHIP relationship) {
DWORD length = 0;
DWORD rc = GetLogicalProcessorInformationEx(relationship, nullptr, &length);

assert(rc == FALSE);

auto processorInformationBytes = std::make_unique<char[]>(length);

rc = GetLogicalProcessorInformationEx(
relationship, reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(processorInformationBytes.get()), &length);

assert(rc == TRUE);

return {std::move(processorInformationBytes), length};
}

uint32_t CountSetBits(DWORD input) {
uint32_t c;
for (c = 0; input; c++) {
input &= input - 1;
}
return c;
}

static CoreCounter GetNumberOPhysicalAndEngineeringCores() {
auto logicalProcessorInformation = GetLogicalProcessorInfos(RelationAll);

CoreCounter cores;
DWORD dwLevel2GroupMask = 0;
DWORD dwLevel3GroupMask = 0;
size_t read = 0;
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX currentProcessorInfo = NULL;

while ((read + FIELD_OFFSET(SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX, Processor)) < logicalProcessorInformation.Length) {
currentProcessorInfo =
reinterpret_cast<PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX>(logicalProcessorInformation.Buffer.get() + read);
if ((read + currentProcessorInfo->Size) > logicalProcessorInformation.Length) {
break;
}

switch (currentProcessorInfo->Relationship) {
case RelationProcessorCore:
cores.PhysicalCores++;
break;
case RelationCache:
if (currentProcessorInfo->Cache.Level == 2) {
dwLevel2GroupMask |= currentProcessorInfo->Cache.GroupMask.Mask;
} else if (currentProcessorInfo->Cache.Level == 3) {
dwLevel3GroupMask |= currentProcessorInfo->Cache.GroupMask.Mask;
}
break;
}

read += currentProcessorInfo->Size;
}

cores.SocDieCores = CountSetBits(dwLevel2GroupMask & ~dwLevel3GroupMask);
return cores;
}

uint32_t HardwareCoreEnumerator::DefaultIntraOpNumThreads() {
// # of physical cores = # of P cores + # of E Cores + # of Soc Cores.
// # of logical cores = # of P cores x 2 (if hyper threading is enabled) + # of E cores + # of Soc Cores.
auto cores = GetNumberOPhysicalAndEngineeringCores();
// We want to use the number of physical cores, but exclude soc cores
return cores.PhysicalCores - cores.SocDieCores;
}

} // namespace onnxruntime
12 changes: 12 additions & 0 deletions onnxruntime/core/platform/windows/hardware_core_enumerator.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
// Copyright (c) Microsoft Corporation. All rights reserved.
// Licensed under the MIT License.

#pragma once
#include <stdint.h>

namespace onnxruntime {
struct HardwareCoreEnumerator {
HardwareCoreEnumerator() = delete;
static uint32_t DefaultIntraOpNumThreads();
};
} // namespace onnxruntime
19 changes: 14 additions & 5 deletions onnxruntime/core/util/thread_utils.cc
Original file line number Diff line number Diff line change
Expand Up @@ -93,22 +93,31 @@ static std::unique_ptr<ThreadPool>
CreateThreadPoolHelper(Env* env, OrtThreadPoolParams options) {
ThreadOptions to;
if (options.thread_pool_size <= 0) { // default
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
if (options.auto_set_affinity) {
#ifdef _WIN32
// Only set thread affinity on Server with auto affinity.
// On client best to let OS scheduler handle.
// On big (P-Core) / little (E-Core) CPU designs affinity overrides QoS and has high power usage
if (IsWindowsServer()) {
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
to.affinities = std::move(default_affinities);
} else {
options.thread_pool_size = Env::Default().GetNumPhysicalCpuCores();
}
#else
auto default_affinities = Env::Default().GetDefaultThreadAffinities();
if (default_affinities.size() <= 1) {
return nullptr;
}
options.thread_pool_size = static_cast<int>(default_affinities.size());
to.affinities = std::move(default_affinities);
#endif
} else {
options.thread_pool_size = Env::Default().GetNumPhysicalCpuCores();
}
}
if (options.thread_pool_size <= 1) {
Expand Down
3 changes: 2 additions & 1 deletion tools/ci_build/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -1526,7 +1526,8 @@ def generate_build_tree(
ldflags = ["/profile", "/DYNAMICBASE"]
# Address Sanitizer libs do not have a Qspectre version. So they two cannot be both enabled.
if not args.enable_address_sanitizer:
cflags += ["/Qspectre"]
# Also enable a special perf patch that was made for Intel Meteor Lake mobile CPUs
cflags += ["/Qspectre", "/DONNXRUNTIME_ENABLE_INTEL_METEOR_LAKE_MOBILE_PLATFORM_PERF_PATCH"]
if config == "Release":
cflags += ["/O2", "/Ob2", "/DNDEBUG"]
elif config == "RelWithDebInfo":
Expand Down
Loading