Fix Linux FP exception when NUMA nodes greater than 1 #22861

vkvenkat · 2019-02-26T19:44:18Z

The GC heap count is currently being set to zero when the available NUMA nodes are greater than 1 on Linux, leading to a Divide by Zero error. Reverting the GC heap count calculation logic to the version before PR #22180.

Fixed the process mask on Linux for GC threads to get affinitized to the right core & for GCHeapAffinitizeMask to control the number of heaps and processor affinities when GCCpuGroup is not set.

Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1. Some GetProcAddress calls in util.cpp were failing, which made CPUGroupInfo::InitCPUGroupInfoAPI & NumaNodeInfo::InitNumaNodeInfoAPI to return FALSE. Fixed these by changing the GetProcAddress calls to direct API calls instead as all of them are present at least from Windows 7 on.

PTAL @Maoni0 @janvorli

Maoni0 · 2019-02-26T19:49:58Z

src/gc/gc.cpp

+                // finalizing the number of heaps.
+                if (!pmask)
+                {
+                    pmask = 0xFFFFFFFFFFFFFFFF;


have you tested this on 32-bit? if so I am surprised you didn't get a compiler warning here.

I just tried building for x86 and did not see any compiler warnings.

When I tried this with the compiler explorer, it displayed a warning on x86:
[x86-64 clang 7.0.0 #1] warning: implicit conversion from 'unsigned long long' to 'uintptr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion]

@vkvenkat have you actually tried the x86 on Unix?

I didn't see any warnings when I built for x86 on Ubuntu, but theoretically there should have been one. So I will use the BIT64 & BIT32 macros to fix this.

@vkvenkat you can use UINTPTR_MAX.

Pushed this change & squashed commits.

@janvorli Some OSX CI builds are failing: Undefined symbols for architecture x86_64: "_GetCurrentProcessorNumberEx", referenced from CPUGroupInfo::CalculateCurrentProcessorNumber() in libutilcodestaticnohost.a(util.cpp.o). Do we need to revert to using GetProcAddress calls with exports in mscorwks_unixexports.src rather than direct API calls to fix this?

It should work. Let me give it a quick try on my local mac.

Ok, I've found the problem. The libutilcodestaticnohost were missing coreclrpal library. This fixes the issue:

diff --git a/src/utilcode/staticnohost/CMakeLists.txt b/src/utilcode/staticnohost/CMakeLists.txt index eea4d60785..e66a5de40d 100644 --- a/src/utilcode/staticnohost/CMakeLists.txt +++ b/src/utilcode/staticnohost/CMakeLists.txt @@ -8,5 +8,5 @@ endif(WIN32) add_library_clr(utilcodestaticnohost STATIC ${UTILCODE_STATICNOHOST_SOURCES}) if(CLR_CMAKE_PLATFORM_UNIX) - target_link_libraries(utilcodestaticnohost nativeresourcestring) + target_link_libraries(utilcodestaticnohost nativeresourcestring coreclrpal) endif(CLR_CMAKE_PLATFORM_UNIX)

Thanks, this fixes it for OSX. But I ran into this issue on Linux: ../../pal/src/libcoreclrpal.a(process.cpp.o): In function CreateProcessW: coreclr/src/pal/src/thread/process.cpp:530: multiple definition of CreateProcessW
CMakeFiles/mscordbi.dir/__/mscordac/palredefines.S.o:(.text+0x14f): first defined here

Using a nested check for CLR_CMAKE_PLATFORM_DARWIN fixes this.

Maoni0 · 2019-02-26T19:50:52Z

thanks for doing these fixes!

Maoni0 · 2019-02-26T19:53:42Z

Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1.

do we know why this is? this should be fixed...

Maoni0 · 2019-02-26T19:54:05Z

src/gc/gc.cpp

@@ -34233,6 +34236,20 @@ HRESULT GCHeap::Initialize()
            {
                pmask &= smask;

+#ifdef FEATURE_PAL
+                // GetCurrentProcessAffinityMask can return pmask=0 and smask=0 on


do we know why this is?

I've implemented that based on the MSDN doc, which says:

On a system with more than 64 processors, if the threads of the calling process are in a single processor group, the function sets the variables pointed to by lpProcessAffinityMask and lpSystemAffinityMask to the process affinity mask and the processor mask of active logical processors for that group. If the calling process contains threads in multiple groups, the function returns zero for both affinity masks.

So I think it can happen on Windows too.

this is on the path where if (!(GCToOSInterface::CanEnableGCCPUGroups())) so we are saying there's < 64 procs.

On my 96-core machine, this API returns a pmask set to 48 cores on Windows & 0 on Linux. For an 8-core machine, we see a pmask set to 8 cores on both Windows/Linux.

I understand, but I was trying to say that even if there are > 64 processors available to this process, we can still get to this code path if the COMPlus_GCCpuGroup=0.

Actually, I am starting to doubt I understand what the If the calling process contains threads in multiple groups in the MSDN doc means. I have read it as "if the current process has affinity mask set to multiple groups" and that's how I have implemented it in the PAL.

you are right - if you have GCCpuGroup set to 0 it would be more than 64 procs available to this process while CanEnableGCCCpuGroups is FALSE.

I'm looking at the cpu group code in util.cpp. the policy it uses is a little odd to me - it enables cpu groups by default instead of checking to see if one of the configs wants to enable cpu groups and then enable it if that's the case. but by default processes do not use more than one cpu group worth of processors. what's the policy on linux? if you have > 64 procs do processes use all procs by default?

there is a discrepancy on windows and linux regardless, we should unify the behavior.

if you have > 64 procs do processes use all procs by default?

Linux has a very different way of reporting / setting affinity and there is no special handling for more or less than 64 processors. It doesn't have any groups. These are all Windows specific constructs.

There is sched_getaffinity / sched_setaffinity that use a cpu_set_t which can be manipulated as described here: https://linux.die.net/man/3/cpu_set. It is implemented as a bitset that can hold as many bits as needed for all the processors in the system.
Then there is a function numa_node_to_cpus that fills in a cpu_set_t with all processors belonging to the requested numa node index where the numa node index can be a value from 0 to numa_max_node() - 1 And finally numa_num_possible_cpus() that returns the number of cpus enabled by the kernel (there is a kernel option that allows you to limit that number at boot time if needed).

So my code in PAL takes these values and transforms them into the Windows style, artificially creating groups so that processors in a group belong to single NUMA node. So e.g. on my box, I have two NUMA nodes each containing 4 CPUs. So I create two groups with 4 processors each.

vkvenkat · 2019-02-26T19:58:29Z

Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1.

do we know why this is? this should be fixed...

We fixed this with this PR. Updated the original PR description to reflect this.

vkvenkat · 2019-02-28T18:42:25Z

@dotnet-bot test Windows_NT x64 Release CoreFX Tests

vkvenkat · 2019-02-28T18:47:23Z

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Windows_NT x64 Checked CoreFX Tests
@dotnet-bot test coreclr-ci (Test Pri0 Windows_NT x86 checked)
@dotnet-bot test coreclr-ci

janvorli · 2019-02-28T19:06:05Z

src/utilcode/staticnohost/CMakeLists.txt

@@ -8,5 +8,9 @@ endif(WIN32)
 add_library_clr(utilcodestaticnohost STATIC ${UTILCODE_STATICNOHOST_SOURCES})

 if(CLR_CMAKE_PLATFORM_UNIX)
-  target_link_libraries(utilcodestaticnohost  nativeresourcestring)
+  if(CLR_CMAKE_PLATFORM_DARWIN)
+    target_link_libraries(utilcodestaticnohost  nativeresourcestring  coreclrpal)


The target_link_libraries can be used multiple times in an additive manner. Could you please keep the original as common and add just the `target_link_libraries(coreclrpal) conditionally?

Sure, will do.

janvorli · 2019-02-28T21:59:57Z

@dotnet-bot test Windows_NT x64 Release CoreFX Tests please

vkvenkat · 2019-03-01T19:02:45Z

Looks like all the tests passed. Any additional feedback for this PR?

Maoni0 · 2019-03-01T20:52:42Z

@vkvenkat/@janvorli I just wanna make sure I understand what's working on Linux, if you could validate that'd be great. on a machine with multiple NUMA nodes, if complus_GCCpuGroup is not set which means GCToOSInterface::CanEnableGCCPUGroups will return FALSE, the # of GC heaps we will now create is max (64, total number of cores on the machine), even if the # of cores this process is allowed to use is only a subset of the cores (let's say it's only allowed to use 32 cores)?

vkvenkat · 2019-03-02T00:06:00Z

@Maoni0 I tried the above scenario by limiting the cores using taskset:

When the cores are limited to one NUMA node, GetCurrentProcessAffinityMask returns a non-zero mask and the heaps are set to the core count in taskset.
When the specified cores are from different NUMA nodes, GetCurrentProcessAffinityMask returns 0 and the heaps are set to 64. The GCToOSInterface::GetTotalProcessorCount() call returns the total cores on the machine even when we limit them using taskset.

In my 96 core machine, NUMA node0 spans CPUs 0-23,48-71 and NUMA node1 spans CPUs 24-47,72-95. Here is the heap count:

Cores	pmask	Heaps
All	0	64
0-7	ff	8
0-23	ffffff	24
0-24	0	64

Maoni0 · 2019-03-02T00:19:15Z

thanks very much @vkvenkat! so the last case is incorrect, right? and on Windows it would return the right number.

vkvenkat · 2019-03-02T00:30:15Z

Yes, we need to find an alternative way to get the number of cores enabled in taskset when NUMA nodes > 1 on Linux. On Windows, I expect pmask to be set to the right number of cores for all cases.

Maoni0 · 2019-03-02T00:40:59Z

right. we should have a separate issue to track that then. I'll merge this one. thanks so much for doing this work!!

vkvenkat · 2019-03-02T01:38:22Z

I also did some Windows experiments with start /affinity <hex affinity mask> <dotnet app>. In my 96 core machine, CPU group 0 spans CPUs 0-47 & CPU group 1 spans CPUs 48-97. As expected, the pmask is always non-zero and the heaps are set correctly. When the core limitation spans multiple CPU groups, only the cores specified from the first group are considered.

Cores	pmask	Heaps
All	ffffffffffff	48
0-7	ff	8
0-23	ffffff	24
0-24	1ffffff	25
0-49	ffffffffffff	48
45-55	e00000000000	3

Maoni0 · 2019-03-02T02:07:38Z

excellent! thanks @vkvenkat.

janvorli · 2019-03-05T01:12:30Z

@vkvenkat looking at the Windows results, none of the cases let a process run on CPUs from multiple NUMA nodes. The pmask seems to be always pruned so that only single NUMA node is used. Is there a way to run a process on CPUs from multiple NUMA nodes on Windows?

vkvenkat · 2019-03-05T03:01:40Z

I never saw a pmask of 0 on Windows from GetProcessAffinityMask when testing in machines with multiple NUMA nodes. But to run the process on cores from different CPU groups (NUMA nodes), its threads need to be affinitized to these cores using SetThreadIdealProcessorEx. This is happening in the GC through the GCToOSInterface::SetCurrentThreadIdealAffinity wrapper in gc_heap::balance_heaps. So we are able to utilize cores from both NUMA nodes when COMPlus_GCCpuGroup is set, despite the pmask being non-zero.

…clr#22861) Commit migrated from dotnet/coreclr@c1801e8

Maoni0 reviewed Feb 26, 2019

View reviewed changes

janvorli reviewed Feb 28, 2019

View reviewed changes

Revert heapcount and enable CPU Groups to fix Ubuntu FPE

e7e4d02

Maoni0 approved these changes Mar 2, 2019

View reviewed changes

Maoni0 merged commit c1801e8 into dotnet:master Mar 2, 2019

vkvenkat mentioned this pull request Mar 15, 2019

Fix GetFullAffinityMask for cpuCount==64 #23276

Merged

hoyosjs mentioned this pull request Apr 12, 2019

Add PAL exports to DAC and remove linkage of PAL from static utility library #23937

Merged

Maoni0 mentioned this pull request Jan 31, 2020

return correct pmask when there are multiple NUMA nodes on linux dotnet/runtime#12161

Closed

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Revert heapcount and enable CPU Groups to fix Ubuntu FPE (dotnet/core…

20e44bc

…clr#22861) Commit migrated from dotnet/coreclr@c1801e8

Fix Linux FP exception when NUMA nodes greater than 1 #22861

Fix Linux FP exception when NUMA nodes greater than 1 #22861

Conversation

vkvenkat commented Feb 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkvenkat Feb 26, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Maoni0 commented Feb 26, 2019

Maoni0 commented Feb 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkvenkat commented Feb 26, 2019 • edited Loading

vkvenkat commented Feb 28, 2019 • edited Loading

vkvenkat commented Feb 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

janvorli commented Feb 28, 2019

vkvenkat commented Mar 1, 2019

Maoni0 commented Mar 1, 2019

vkvenkat commented Mar 2, 2019 • edited Loading

Maoni0 commented Mar 2, 2019

vkvenkat commented Mar 2, 2019

Maoni0 commented Mar 2, 2019

vkvenkat commented Mar 2, 2019

Maoni0 commented Mar 2, 2019

janvorli commented Mar 5, 2019

vkvenkat commented Mar 5, 2019

vkvenkat commented Feb 26, 2019 •

edited

Loading

vkvenkat Feb 26, 2019 •

edited

Loading

vkvenkat commented Feb 26, 2019 •

edited

Loading

vkvenkat commented Feb 28, 2019 •

edited

Loading

vkvenkat commented Mar 2, 2019 •

edited

Loading