Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows/Arm64: Use 8.1 atomic instructions if they are available #70921

Merged
merged 7 commits into from
Jun 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions src/coreclr/utilcode/util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,13 @@

#ifndef DACCESS_COMPILE
UINT32 g_nClrInstanceId = 0;

#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
// Flag to check if atomics feature is available on
// the machine
bool g_arm64_atomics_present = false;
#endif

#endif //!DACCESS_COMPILE

//*****************************************************************************
Expand Down
1 change: 1 addition & 0 deletions src/coreclr/vm/codeman.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1566,6 +1566,7 @@ void EEJitManager::SetCpuInfo()
if (IsProcessorFeaturePresent(PF_ARM_V81_ATOMIC_INSTRUCTIONS_AVAILABLE))
{
CPUCompileFlags.Set(InstructionSet_Atomics);
g_arm64_atomics_present = true;
}

// PF_ARM_V82_DP_INSTRUCTIONS_AVAILABLE (43)
Expand Down
4 changes: 4 additions & 0 deletions src/coreclr/vm/syncblk.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1652,7 +1652,11 @@ AwareLock::EnterHelperResult ObjHeader::EnterObjMonitorHelperSpin(Thread* pCurTh
}

LONG newValue = oldValue | tid;
#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
if (FastInterlockedCompareExchangeAcquire((LONG*)&m_SyncBlockValue, newValue, oldValue) == oldValue)
#else
if (InterlockedCompareExchangeAcquire((LONG*)&m_SyncBlockValue, newValue, oldValue) == oldValue)
#endif
{
return AwareLock::EnterHelperResult_Entered;
}
Expand Down
8 changes: 8 additions & 0 deletions src/coreclr/vm/syncblk.h
Original file line number Diff line number Diff line change
Expand Up @@ -382,13 +382,21 @@ class AwareLock
LockState CompareExchange(LockState toState, LockState fromState)
{
LIMITED_METHOD_CONTRACT;
#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
return (UINT32)FastInterlockedCompareExchange((LONG *)&m_state, (LONG)toState, (LONG)fromState);
#else
return (UINT32)InterlockedCompareExchange((LONG *)&m_state, (LONG)toState, (LONG)fromState);
#endif
}

LockState CompareExchangeAcquire(LockState toState, LockState fromState)
{
LIMITED_METHOD_CONTRACT;
#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
return (UINT32)FastInterlockedCompareExchangeAcquire((LONG *)&m_state, (LONG)toState, (LONG)fromState);
#else
return (UINT32)InterlockedCompareExchangeAcquire((LONG *)&m_state, (LONG)toState, (LONG)fromState);
#endif
}

public:
Expand Down
17 changes: 17 additions & 0 deletions src/coreclr/vm/syncblk.inl
Original file line number Diff line number Diff line change
Expand Up @@ -602,7 +602,11 @@ FORCEINLINE AwareLock::EnterHelperResult ObjHeader::EnterObjMonitorHelper(Thread
}

LONG newValue = oldValue | tid;
#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
if (FastInterlockedCompareExchangeAcquire((LONG*)&m_SyncBlockValue, newValue, oldValue) == oldValue)
#else
if (InterlockedCompareExchangeAcquire((LONG*)&m_SyncBlockValue, newValue, oldValue) == oldValue)
#endif
{
return AwareLock::EnterHelperResult_Entered;
}
Expand Down Expand Up @@ -650,7 +654,11 @@ FORCEINLINE AwareLock::EnterHelperResult ObjHeader::EnterObjMonitorHelper(Thread
return AwareLock::EnterHelperResult_UseSlowPath;
}

#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
if (FastInterlockedCompareExchangeAcquire((LONG*)&m_SyncBlockValue, newValue, oldValue) == oldValue)
Copy link
Member

@VSadov VSadov Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we just remove some or all of the "FastInterlockedXXXX" versions?

Having "Fast" variants always feels a bit confusing. Are they unfit for some uses? Where would I want to use the slow versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we wanted to just patch the hot paths (which is where we should always use fast versions) instead of all the paths and hence I had to reintroduce them at just few call-sites.

#else
if (InterlockedCompareExchangeAcquire((LONG*)&m_SyncBlockValue, newValue, oldValue) == oldValue)
#endif
{
return AwareLock::EnterHelperResult_Entered;
}
Expand Down Expand Up @@ -723,7 +731,12 @@ FORCEINLINE AwareLock::LeaveHelperAction ObjHeader::LeaveObjMonitorHelper(Thread
{
// We are leaving the lock
DWORD newValue = (syncBlockValue & (~SBLK_MASK_LOCK_THREADID));

#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think moving the static/dynamic switching logic into InterlockedCompareExchangeRelease and similar would be nicer.

It should result in simpler looking code and if switching needs to change due to new compiler or platform, no need to revisit all the callsites.

if (FastInterlockedCompareExchangeRelease((LONG*)&m_SyncBlockValue, newValue, syncBlockValue) != (LONG)syncBlockValue)
#else
if (InterlockedCompareExchangeRelease((LONG*)&m_SyncBlockValue, newValue, syncBlockValue) != (LONG)syncBlockValue)
#endif
{
return AwareLock::LeaveHelperAction_Yield;
}
Expand All @@ -732,7 +745,11 @@ FORCEINLINE AwareLock::LeaveHelperAction ObjHeader::LeaveObjMonitorHelper(Thread
{
// recursion and ThinLock
DWORD newValue = syncBlockValue - SBLK_LOCK_RECLEVEL_INC;
#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
if (FastInterlockedCompareExchangeRelease((LONG*)&m_SyncBlockValue, newValue, syncBlockValue) != (LONG)syncBlockValue)
#else
if (InterlockedCompareExchangeRelease((LONG*)&m_SyncBlockValue, newValue, syncBlockValue) != (LONG)syncBlockValue)
#endif
{
return AwareLock::LeaveHelperAction_Yield;
}
Expand Down
66 changes: 66 additions & 0 deletions src/coreclr/vm/util.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,14 @@
#define MAX_CACHE_LINE_SIZE 64
#endif

#ifndef DACCESS_COMPILE
#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)
// Flag to check if atomics feature is available on
// the machine
extern bool g_arm64_atomics_present;
#endif
#endif

#ifndef TARGET_UNIX
// Copied from malloc.h: don't want to bring in the whole header file.
void * __cdecl _alloca(size_t);
Expand Down Expand Up @@ -71,6 +79,64 @@ BOOL inline FitsInU4(unsigned __int64 val)
return val == (unsigned __int64)(unsigned __int32)val;
}

#if defined(DACCESS_COMPILE)
#define FastInterlockedCompareExchange InterlockedCompareExchange
#define FastInterlockedCompareExchangeAcquire InterlockedCompareExchangeAcquire
#define FastInterlockedCompareExchangeRelease InterlockedCompareExchangeRelease
#else

#if defined(TARGET_WINDOWS) && defined(TARGET_ARM64)

FORCEINLINE LONG FastInterlockedCompareExchange(
LONG volatile *Destination,
LONG Exchange,
LONG Comperand)
{
if (g_arm64_atomics_present)
{
return (LONG) __casal32((unsigned __int32*) Destination, (unsigned __int32)Comperand, (unsigned __int32)Exchange);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it can be unconditionally enabled for osx-arm64 for us

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__casal32 is an intrinsic exposed by VC++ only. For osx-arm64, we can just patch PAL_ArmInterlockedOperationBarrier to not call __sync_synchronize, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For osx-arm64, we can just patch PAL_ArmInterlockedOperationBarrier to not call __sync_synchronize, right?

Ah, no, that will depend on compilers flag, e.g. if someone adds -mcpu=generic to CMake then clang will stop emitting e.g. casal for us on osx-arm64 and the barrier will be needed. While an intrinsic would always be fine since there is no pre 8.1 hw on osx-arm64 (and will never be judging by apple's clang)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone adds -mcpu=generic to CMake

What is the default flag used for osx? Since its basline is 8.1, it should be -mcpu=armv8.1 by default, even if we don't specify it?

While an intrinsic would always be fine since there is no pre 8.1 hw on osx-arm64

I don't understand how/why will someone want to use -mcpu=generic and even if they use it, it will be v8.1 only. Correct me if I am wrong.

While an intrinsic would always be fine

Can you check what code we generate today with default flags on osx for __sync_val_compare_and_swap? Should be casal. In that case, we can eliminate the __sync_synchronize for OSX. What change you tried that you mentioned in #67824 (comment)?

Copy link
Member

@EgorBo EgorBo Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What change you tried that you mentioned in #67824 (comment)?

I think it was -mcpu=native (I verified coreclr via objdump) without that explicit barrier for all Interlocked functions including the one is GC and measured impact on TE (ampere-linux):
image

What is the default flag used for osx?

We don't set march/mcpu/m* flags so we use defaults

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For osx, we should definitely remove the barriers from pal.h because the baseline for M1 is v8.1+ AFAIK.

The barrier is a workaround for __sync_val_compare_and_swap not being a full fence.

M1 is v8.1+, so casal would be sufficient and additional fence is redundant.
but does compiler actually use casal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but does compiler actually use casal

It does. https://godbolt.org/z/8xb3x7o71

I will put a PR out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if -mcpu=generic somehow flows in it will stop emitting casal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@VSadov VSadov Jun 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it use casal in debug? Can it switch to old LL/SC helper because of register pressure or if the old implementation is found faster (it could be).
It feels like inline asm could have more reliable guarantees.

}
else
{
return InterlockedCompareExchange(Destination, Exchange, Comperand);
}
}

FORCEINLINE LONG FastInterlockedCompareExchangeAcquire(
IN OUT LONG volatile *Destination,
IN LONG Exchange,
IN LONG Comperand
)
{
if (g_arm64_atomics_present)
{
return (LONG) __casa32((unsigned __int32*) Destination, (unsigned __int32)Comperand, (unsigned __int32)Exchange);
}
else
{
return InterlockedCompareExchangeAcquire(Destination, Exchange, Comperand);
}
}

FORCEINLINE LONG FastInterlockedCompareExchangeRelease(
IN OUT LONG volatile *Destination,
IN LONG Exchange,
IN LONG Comperand
)
{
if (g_arm64_atomics_present)
{
return (LONG) __casl32((unsigned __int32*) Destination, (unsigned __int32)Comperand, (unsigned __int32)Exchange);
}
else
{
return InterlockedCompareExchangeRelease(Destination, Exchange, Comperand);
}
}

#endif // defined(TARGET_WINDOWS) && defined(TARGET_ARM64)

#endif //defined(DACCESS_COMPILE)


//************************************************************************
Expand Down