Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Fix Linux FP exception when NUMA nodes greater than 1 #22861

Merged
merged 1 commit into from
Mar 2, 2019
Merged

Fix Linux FP exception when NUMA nodes greater than 1 #22861

merged 1 commit into from
Mar 2, 2019

Conversation

vkvenkat
Copy link

@vkvenkat vkvenkat commented Feb 26, 2019

The GC heap count is currently being set to zero when the available NUMA nodes are greater than 1 on Linux, leading to a Divide by Zero error. Reverting the GC heap count calculation logic to the version before PR #22180.

Fixed the process mask on Linux for GC threads to get affinitized to the right core & for GCHeapAffinitizeMask to control the number of heaps and processor affinities when GCCpuGroup is not set.

Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1. Some GetProcAddress calls in util.cpp were failing, which made CPUGroupInfo::InitCPUGroupInfoAPI & NumaNodeInfo::InitNumaNodeInfoAPI to return FALSE. Fixed these by changing the GetProcAddress calls to direct API calls instead as all of them are present at least from Windows 7 on.

PTAL @Maoni0 @janvorli

// finalizing the number of heaps.
if (!pmask)
{
pmask = 0xFFFFFFFFFFFFFFFF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested this on 32-bit? if so I am surprised you didn't get a compiler warning here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried building for x86 and did not see any compiler warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tried this with the compiler explorer, it displayed a warning on x86:
[x86-64 clang 7.0.0 #1] warning: implicit conversion from 'unsigned long long' to 'uintptr_t' (aka 'unsigned int') changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion]

@vkvenkat have you actually tried the x86 on Unix?

Copy link
Author

@vkvenkat vkvenkat Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any warnings when I built for x86 on Ubuntu, but theoretically there should have been one. So I will use the BIT64 & BIT32 macros to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkvenkat you can use UINTPTR_MAX.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed this change & squashed commits.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@janvorli Some OSX CI builds are failing: Undefined symbols for architecture x86_64: "_GetCurrentProcessorNumberEx", referenced from CPUGroupInfo::CalculateCurrentProcessorNumber() in libutilcodestaticnohost.a(util.cpp.o). Do we need to revert to using GetProcAddress calls with exports in mscorwks_unixexports.src rather than direct API calls to fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work. Let me give it a quick try on my local mac.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've found the problem. The libutilcodestaticnohost were missing coreclrpal library. This fixes the issue:

diff --git a/src/utilcode/staticnohost/CMakeLists.txt b/src/utilcode/staticnohost/CMakeLists.txt
index eea4d60785..e66a5de40d 100644
--- a/src/utilcode/staticnohost/CMakeLists.txt
+++ b/src/utilcode/staticnohost/CMakeLists.txt
@@ -8,5 +8,5 @@ endif(WIN32)
 add_library_clr(utilcodestaticnohost STATIC ${UTILCODE_STATICNOHOST_SOURCES})

 if(CLR_CMAKE_PLATFORM_UNIX)
-  target_link_libraries(utilcodestaticnohost  nativeresourcestring)
+  target_link_libraries(utilcodestaticnohost  nativeresourcestring coreclrpal)
 endif(CLR_CMAKE_PLATFORM_UNIX)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this fixes it for OSX. But I ran into this issue on Linux: ../../pal/src/libcoreclrpal.a(process.cpp.o): In function CreateProcessW: coreclr/src/pal/src/thread/process.cpp:530: multiple definition of CreateProcessW
CMakeFiles/mscordbi.dir/__/mscordac/palredefines.S.o:(.text+0x14f): first defined here

Using a nested check for CLR_CMAKE_PLATFORM_DARWIN fixes this.

@Maoni0
Copy link
Member

Maoni0 commented Feb 26, 2019

thanks for doing these fixes!

@Maoni0
Copy link
Member

Maoni0 commented Feb 26, 2019

Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1.

do we know why this is? this should be fixed...

@@ -34233,6 +34236,20 @@ HRESULT GCHeap::Initialize()
{
pmask &= smask;

#ifdef FEATURE_PAL
// GetCurrentProcessAffinityMask can return pmask=0 and smask=0 on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we know why this is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented that based on the MSDN doc, which says:

On a system with more than 64 processors, if the threads of the calling process are in a single processor group, the function sets the variables pointed to by lpProcessAffinityMask and lpSystemAffinityMask to the process affinity mask and the processor mask of active logical processors for that group. If the calling process contains threads in multiple groups, the function returns zero for both affinity masks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think it can happen on Windows too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is on the path where if (!(GCToOSInterface::CanEnableGCCPUGroups())) so we are saying there's < 64 procs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my 96-core machine, this API returns a pmask set to 48 cores on Windows & 0 on Linux. For an 8-core machine, we see a pmask set to 8 cores on both Windows/Linux.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but I was trying to say that even if there are > 64 processors available to this process, we can still get to this code path if the COMPlus_GCCpuGroup=0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am starting to doubt I understand what the If the calling process contains threads in multiple groups in the MSDN doc means. I have read it as "if the current process has affinity mask set to multiple groups" and that's how I have implemented it in the PAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right - if you have GCCpuGroup set to 0 it would be more than 64 procs available to this process while CanEnableGCCCpuGroups is FALSE.

I'm looking at the cpu group code in util.cpp. the policy it uses is a little odd to me - it enables cpu groups by default instead of checking to see if one of the configs wants to enable cpu groups and then enable it if that's the case. but by default processes do not use more than one cpu group worth of processors. what's the policy on linux? if you have > 64 procs do processes use all procs by default?

there is a discrepancy on windows and linux regardless, we should unify the behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have > 64 procs do processes use all procs by default?

Linux has a very different way of reporting / setting affinity and there is no special handling for more or less than 64 processors. It doesn't have any groups. These are all Windows specific constructs.

There is sched_getaffinity / sched_setaffinity that use a cpu_set_t which can be manipulated as described here: https://linux.die.net/man/3/cpu_set. It is implemented as a bitset that can hold as many bits as needed for all the processors in the system.
Then there is a function numa_node_to_cpus that fills in a cpu_set_t with all processors belonging to the requested numa node index where the numa node index can be a value from 0 to numa_max_node() - 1 And finally numa_num_possible_cpus() that returns the number of cpus enabled by the kernel (there is a kernel option that allows you to limit that number at boot time if needed).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my code in PAL takes these values and transforms them into the Windows style, artificially creating groups so that processors in a group belong to single NUMA node. So e.g. on my box, I have two NUMA nodes each containing 4 CPUs. So I create two groups with 4 processors each.

@vkvenkat
Copy link
Author

vkvenkat commented Feb 26, 2019

Also, GCToOSInterface::CanEnableGCCPUGroups always returned FALSE on Linux when NUMA nodes > 1.

do we know why this is? this should be fixed...

We fixed this with this PR. Updated the original PR description to reflect this.

@vkvenkat
Copy link
Author

vkvenkat commented Feb 28, 2019

@dotnet-bot test Windows_NT x64 Release CoreFX Tests

@vkvenkat
Copy link
Author

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test
@dotnet-bot test Windows_NT x64 Checked CoreFX Tests
@dotnet-bot test coreclr-ci (Test Pri0 Windows_NT x86 checked)
@dotnet-bot test coreclr-ci

@@ -8,5 +8,9 @@ endif(WIN32)
add_library_clr(utilcodestaticnohost STATIC ${UTILCODE_STATICNOHOST_SOURCES})

if(CLR_CMAKE_PLATFORM_UNIX)
target_link_libraries(utilcodestaticnohost nativeresourcestring)
if(CLR_CMAKE_PLATFORM_DARWIN)
target_link_libraries(utilcodestaticnohost nativeresourcestring coreclrpal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target_link_libraries can be used multiple times in an additive manner. Could you please keep the original as common and add just the `target_link_libraries(coreclrpal) conditionally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

@janvorli
Copy link
Member

@dotnet-bot test Windows_NT x64 Release CoreFX Tests please

@vkvenkat
Copy link
Author

vkvenkat commented Mar 1, 2019

Looks like all the tests passed. Any additional feedback for this PR?

@Maoni0
Copy link
Member

Maoni0 commented Mar 1, 2019

@vkvenkat/@janvorli I just wanna make sure I understand what's working on Linux, if you could validate that'd be great. on a machine with multiple NUMA nodes, if complus_GCCpuGroup is not set which means GCToOSInterface::CanEnableGCCPUGroups will return FALSE, the # of GC heaps we will now create is max (64, total number of cores on the machine), even if the # of cores this process is allowed to use is only a subset of the cores (let's say it's only allowed to use 32 cores)?

@vkvenkat
Copy link
Author

vkvenkat commented Mar 2, 2019

@Maoni0 I tried the above scenario by limiting the cores using taskset:

  • When the cores are limited to one NUMA node, GetCurrentProcessAffinityMask returns a non-zero mask and the heaps are set to the core count in taskset.
  • When the specified cores are from different NUMA nodes, GetCurrentProcessAffinityMask returns 0 and the heaps are set to 64. The GCToOSInterface::GetTotalProcessorCount() call returns the total cores on the machine even when we limit them using taskset.

In my 96 core machine, NUMA node0 spans CPUs 0-23,48-71 and NUMA node1 spans CPUs 24-47,72-95. Here is the heap count:

Cores pmask Heaps
All 0 64
0-7 ff 8
0-23 ffffff 24
0-24 0 64

@Maoni0
Copy link
Member

Maoni0 commented Mar 2, 2019

thanks very much @vkvenkat! so the last case is incorrect, right? and on Windows it would return the right number.

@vkvenkat
Copy link
Author

vkvenkat commented Mar 2, 2019

Yes, we need to find an alternative way to get the number of cores enabled in taskset when NUMA nodes > 1 on Linux. On Windows, I expect pmask to be set to the right number of cores for all cases.

@Maoni0
Copy link
Member

Maoni0 commented Mar 2, 2019

right. we should have a separate issue to track that then. I'll merge this one. thanks so much for doing this work!!

@Maoni0 Maoni0 merged commit c1801e8 into dotnet:master Mar 2, 2019
@vkvenkat
Copy link
Author

vkvenkat commented Mar 2, 2019

I also did some Windows experiments with start /affinity <hex affinity mask> <dotnet app>. In my 96 core machine, CPU group 0 spans CPUs 0-47 & CPU group 1 spans CPUs 48-97. As expected, the pmask is always non-zero and the heaps are set correctly. When the core limitation spans multiple CPU groups, only the cores specified from the first group are considered.

Cores pmask Heaps
All ffffffffffff 48
0-7 ff 8
0-23 ffffff 24
0-24 1ffffff 25
0-49 ffffffffffff 48
45-55 e00000000000 3

@Maoni0
Copy link
Member

Maoni0 commented Mar 2, 2019

excellent! thanks @vkvenkat.

@janvorli
Copy link
Member

janvorli commented Mar 5, 2019

@vkvenkat looking at the Windows results, none of the cases let a process run on CPUs from multiple NUMA nodes. The pmask seems to be always pruned so that only single NUMA node is used. Is there a way to run a process on CPUs from multiple NUMA nodes on Windows?

@vkvenkat
Copy link
Author

vkvenkat commented Mar 5, 2019

I never saw a pmask of 0 on Windows from GetProcessAffinityMask when testing in machines with multiple NUMA nodes. But to run the process on cores from different CPU groups (NUMA nodes), its threads need to be affinitized to these cores using SetThreadIdealProcessorEx. This is happening in the GC through the GCToOSInterface::SetCurrentThreadIdealAffinity wrapper in gc_heap::balance_heaps. So we are able to utilize cores from both NUMA nodes when COMPlus_GCCpuGroup is set, despite the pmask being non-zero.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants