-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Fix Linux FP exception when NUMA nodes greater than 1 #22861
Conversation
// finalizing the number of heaps. | ||
if (!pmask) | ||
{ | ||
pmask = 0xFFFFFFFFFFFFFFFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you tested this on 32-bit? if so I am surprised you didn't get a compiler warning here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried building for x86 and did not see any compiler warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I tried this with the compiler explorer, it displayed a warning on x86:
[x86-64 clang 7.0.0 #1] warning: implicit conversion from 'unsigned long long' to 'uintptr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion]
@vkvenkat have you actually tried the x86 on Unix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any warnings when I built for x86 on Ubuntu, but theoretically there should have been one. So I will use the BIT64 & BIT32 macros to fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkvenkat you can use UINTPTR_MAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed this change & squashed commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@janvorli Some OSX CI builds are failing: Undefined symbols for architecture x86_64: "_GetCurrentProcessorNumberEx", referenced from CPUGroupInfo::CalculateCurrentProcessorNumber() in libutilcodestaticnohost.a(util.cpp.o)
. Do we need to revert to using GetProcAddress calls with exports in mscorwks_unixexports.src rather than direct API calls to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should work. Let me give it a quick try on my local mac.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've found the problem. The libutilcodestaticnohost were missing coreclrpal library. This fixes the issue:
diff --git a/src/utilcode/staticnohost/CMakeLists.txt b/src/utilcode/staticnohost/CMakeLists.txt
index eea4d60785..e66a5de40d 100644
--- a/src/utilcode/staticnohost/CMakeLists.txt
+++ b/src/utilcode/staticnohost/CMakeLists.txt
@@ -8,5 +8,5 @@ endif(WIN32)
add_library_clr(utilcodestaticnohost STATIC ${UTILCODE_STATICNOHOST_SOURCES})
if(CLR_CMAKE_PLATFORM_UNIX)
- target_link_libraries(utilcodestaticnohost nativeresourcestring)
+ target_link_libraries(utilcodestaticnohost nativeresourcestring coreclrpal)
endif(CLR_CMAKE_PLATFORM_UNIX)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this fixes it for OSX. But I ran into this issue on Linux: ../../pal/src/libcoreclrpal.a(process.cpp.o): In function CreateProcessW: coreclr/src/pal/src/thread/process.cpp:530: multiple definition of CreateProcessW
CMakeFiles/mscordbi.dir/__/mscordac/palredefines.S.o:(.text+0x14f): first defined here
Using a nested check for CLR_CMAKE_PLATFORM_DARWIN fixes this.
thanks for doing these fixes! |
do we know why this is? this should be fixed... |
@@ -34233,6 +34236,20 @@ HRESULT GCHeap::Initialize() | |||
{ | |||
pmask &= smask; | |||
|
|||
#ifdef FEATURE_PAL | |||
// GetCurrentProcessAffinityMask can return pmask=0 and smask=0 on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we know why this is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented that based on the MSDN doc, which says:
On a system with more than 64 processors, if the threads of the calling process are in a single processor group, the function sets the variables pointed to by lpProcessAffinityMask and lpSystemAffinityMask to the process affinity mask and the processor mask of active logical processors for that group. If the calling process contains threads in multiple groups, the function returns zero for both affinity masks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think it can happen on Windows too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is on the path where if (!(GCToOSInterface::CanEnableGCCPUGroups()))
so we are saying there's < 64 procs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my 96-core machine, this API returns a pmask set to 48 cores on Windows & 0 on Linux. For an 8-core machine, we see a pmask set to 8 cores on both Windows/Linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, but I was trying to say that even if there are > 64 processors available to this process, we can still get to this code path if the COMPlus_GCCpuGroup=0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am starting to doubt I understand what the If the calling process contains threads in multiple groups in the MSDN doc means. I have read it as "if the current process has affinity mask set to multiple groups" and that's how I have implemented it in the PAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right - if you have GCCpuGroup set to 0 it would be more than 64 procs available to this process while CanEnableGCCCpuGroups is FALSE.
I'm looking at the cpu group code in util.cpp. the policy it uses is a little odd to me - it enables cpu groups by default instead of checking to see if one of the configs wants to enable cpu groups and then enable it if that's the case. but by default processes do not use more than one cpu group worth of processors. what's the policy on linux? if you have > 64 procs do processes use all procs by default?
there is a discrepancy on windows and linux regardless, we should unify the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you have > 64 procs do processes use all procs by default?
Linux has a very different way of reporting / setting affinity and there is no special handling for more or less than 64 processors. It doesn't have any groups. These are all Windows specific constructs.
There is sched_getaffinity
/ sched_setaffinity
that use a cpu_set_t
which can be manipulated as described here: https://linux.die.net/man/3/cpu_set. It is implemented as a bitset that can hold as many bits as needed for all the processors in the system.
Then there is a function numa_node_to_cpus
that fills in a cpu_set_t
with all processors belonging to the requested numa node index where the numa node index can be a value from 0 to numa_max_node() - 1
And finally numa_num_possible_cpus()
that returns the number of cpus enabled by the kernel (there is a kernel option that allows you to limit that number at boot time if needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my code in PAL takes these values and transforms them into the Windows style, artificially creating groups so that processors in a group belong to single NUMA node. So e.g. on my box, I have two NUMA nodes each containing 4 CPUs. So I create two groups with 4 processors each.
We fixed this with this PR. Updated the original PR description to reflect this. |
@dotnet-bot test Windows_NT x64 Release CoreFX Tests |
@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
@@ -8,5 +8,9 @@ endif(WIN32) | |||
add_library_clr(utilcodestaticnohost STATIC ${UTILCODE_STATICNOHOST_SOURCES}) | |||
|
|||
if(CLR_CMAKE_PLATFORM_UNIX) | |||
target_link_libraries(utilcodestaticnohost nativeresourcestring) | |||
if(CLR_CMAKE_PLATFORM_DARWIN) | |||
target_link_libraries(utilcodestaticnohost nativeresourcestring coreclrpal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The target_link_libraries can be used multiple times in an additive manner. Could you please keep the original as common and add just the `target_link_libraries(coreclrpal) conditionally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
@dotnet-bot test Windows_NT x64 Release CoreFX Tests please |
Looks like all the tests passed. Any additional feedback for this PR? |
@vkvenkat/@janvorli I just wanna make sure I understand what's working on Linux, if you could validate that'd be great. on a machine with multiple NUMA nodes, if complus_GCCpuGroup is not set which means GCToOSInterface::CanEnableGCCPUGroups will return FALSE, the # of GC heaps we will now create is max (64, total number of cores on the machine), even if the # of cores this process is allowed to use is only a subset of the cores (let's say it's only allowed to use 32 cores)? |
@Maoni0 I tried the above scenario by limiting the cores using
In my 96 core machine, NUMA node0 spans CPUs 0-23,48-71 and NUMA node1 spans CPUs 24-47,72-95. Here is the heap count:
|
thanks very much @vkvenkat! so the last case is incorrect, right? and on Windows it would return the right number. |
Yes, we need to find an alternative way to get the number of cores enabled in taskset when NUMA nodes > 1 on Linux. On Windows, I expect pmask to be set to the right number of cores for all cases. |
right. we should have a separate issue to track that then. I'll merge this one. thanks so much for doing this work!! |
I also did some Windows experiments with
|
excellent! thanks @vkvenkat. |
@vkvenkat looking at the Windows results, none of the cases let a process run on CPUs from multiple NUMA nodes. The pmask seems to be always pruned so that only single NUMA node is used. Is there a way to run a process on CPUs from multiple NUMA nodes on Windows? |
I never saw a pmask of 0 on Windows from |
The GC heap count is currently being set to zero when the available NUMA nodes are greater than 1 on Linux, leading to a Divide by Zero error. Reverting the GC heap count calculation logic to the version before PR #22180.
Fixed the process mask on Linux for GC threads to get affinitized to the right core & for GCHeapAffinitizeMask to control the number of heaps and processor affinities when GCCpuGroup is not set.
Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1. Some GetProcAddress calls in util.cpp were failing, which made CPUGroupInfo::InitCPUGroupInfoAPI & NumaNodeInfo::InitNumaNodeInfoAPI to return FALSE. Fixed these by changing the GetProcAddress calls to direct API calls instead as all of them are present at least from Windows 7 on.
PTAL @Maoni0 @janvorli