Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86] Prefer lock or over mfence #106555

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 36 additions & 16 deletions llvm/lib/Target/X86/X86.td
Original file line number Diff line number Diff line change
Expand Up @@ -754,6 +754,10 @@ def TuningUseGLMDivSqrtCosts
def TuningBranchHint: SubtargetFeature<"branch-hint", "HasBranchHint", "true",
"Target has branch hint feature">;

def TuningAvoidMFENCE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should better to use the opposite definition, e.g., TuningPreferMFENCE given modern CPUs don't need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sympathetic to that, but also following GCCs lead here.

: SubtargetFeature<"avoid-mfence", "AvoidMFence", "true",
vchuravy marked this conversation as resolved.
Show resolved Hide resolved
"Avoid MFENCE for fence seq_cst, and instead use lock or">;

//===----------------------------------------------------------------------===//
// X86 CPU Families
// TODO: Remove these - use general tuning features to determine codegen.
Expand Down Expand Up @@ -815,7 +819,8 @@ def ProcessorFeatures {
TuningSlow3OpsLEA,
TuningSlowDivide64,
TuningSlowIncDec,
TuningInsertVZEROUPPER
TuningInsertVZEROUPPER,
TuningAvoidMFENCE
];

list<SubtargetFeature> X86_64V2Features = !listconcat(X86_64V1Features, [
Expand All @@ -831,7 +836,8 @@ def ProcessorFeatures {
TuningFastSHLDRotate,
TuningFast15ByteNOP,
TuningPOPCNTFalseDeps,
TuningInsertVZEROUPPER
TuningInsertVZEROUPPER,
TuningAvoidMFENCE
];

list<SubtargetFeature> X86_64V3Features = !listconcat(X86_64V2Features, [
Expand All @@ -850,7 +856,8 @@ def ProcessorFeatures {
TuningPOPCNTFalseDeps,
TuningLZCNTFalseDeps,
TuningInsertVZEROUPPER,
TuningAllowLight256Bit
TuningAllowLight256Bit,
TuningAvoidMFENCE
];

list<SubtargetFeature> X86_64V4Features = !listconcat(X86_64V3Features, [
Expand All @@ -874,15 +881,17 @@ def ProcessorFeatures {
TuningFastGather,
TuningPOPCNTFalseDeps,
TuningInsertVZEROUPPER,
TuningAllowLight256Bit
TuningAllowLight256Bit,
TuningAvoidMFENCE
];

// Nehalem
list<SubtargetFeature> NHMFeatures = X86_64V2Features;
list<SubtargetFeature> NHMTuning = [TuningMacroFusion,
TuningSlowDivide64,
TuningInsertVZEROUPPER,
TuningNoDomainDelayMov];
TuningNoDomainDelayMov,
TuningAvoidMFENCE];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add TuningAvoidMFENCE to tuning for x86-64-v2 and later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added it to x86-64-v1, since that ought to be equivalent to early "core" marchs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that we might as well support it on any 64-bit capable CPU

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that's what GCC did.

+DEF_TUNE (X86_TUNE_AVOID_MFENCE, "avoid_mfence",
 	 m_CORE_ALL | m_BDVER | m_ZNVER | m_GENERIC)


// Westmere
list<SubtargetFeature> WSMAdditionalFeatures = [FeaturePCLMUL];
Expand All @@ -903,7 +912,8 @@ def ProcessorFeatures {
TuningFast15ByteNOP,
TuningPOPCNTFalseDeps,
TuningInsertVZEROUPPER,
TuningNoDomainDelayMov];
TuningNoDomainDelayMov,
TuningAvoidMFENCE];
list<SubtargetFeature> SNBFeatures =
!listconcat(WSMFeatures, SNBAdditionalFeatures);

Expand Down Expand Up @@ -969,7 +979,8 @@ def ProcessorFeatures {
TuningAllowLight256Bit,
TuningNoDomainDelayMov,
TuningNoDomainDelayShuffle,
TuningNoDomainDelayBlend];
TuningNoDomainDelayBlend,
TuningAvoidMFENCE];
list<SubtargetFeature> SKLFeatures =
!listconcat(BDWFeatures, SKLAdditionalFeatures);

Expand Down Expand Up @@ -1004,7 +1015,8 @@ def ProcessorFeatures {
TuningNoDomainDelayMov,
TuningNoDomainDelayShuffle,
TuningNoDomainDelayBlend,
TuningFastImmVectorShift];
TuningFastImmVectorShift,
TuningAvoidMFENCE];
list<SubtargetFeature> SKXFeatures =
!listconcat(BDWFeatures, SKXAdditionalFeatures);

Expand Down Expand Up @@ -1047,7 +1059,8 @@ def ProcessorFeatures {
TuningNoDomainDelayMov,
TuningNoDomainDelayShuffle,
TuningNoDomainDelayBlend,
TuningFastImmVectorShift];
TuningFastImmVectorShift,
TuningAvoidMFENCE];
list<SubtargetFeature> CNLFeatures =
!listconcat(SKLFeatures, CNLAdditionalFeatures);

Expand Down Expand Up @@ -1076,7 +1089,8 @@ def ProcessorFeatures {
TuningNoDomainDelayMov,
TuningNoDomainDelayShuffle,
TuningNoDomainDelayBlend,
TuningFastImmVectorShift];
TuningFastImmVectorShift,
TuningAvoidMFENCE];
list<SubtargetFeature> ICLFeatures =
!listconcat(CNLFeatures, ICLAdditionalFeatures);

Expand Down Expand Up @@ -1222,7 +1236,8 @@ def ProcessorFeatures {
// Tremont
list<SubtargetFeature> TRMAdditionalFeatures = [FeatureCLWB,
FeatureGFNI];
list<SubtargetFeature> TRMTuning = GLPTuning;
list<SubtargetFeature> TRMAdditionalTuning = [TuningAvoidMFENCE];
list<SubtargetFeature> TRMTuning = !listconcat(GLPTuning, TRMAdditionalTuning);
list<SubtargetFeature> TRMFeatures =
!listconcat(GLPFeatures, TRMAdditionalFeatures);

Expand Down Expand Up @@ -1400,7 +1415,8 @@ def ProcessorFeatures {
TuningFastImm16,
TuningSBBDepBreaking,
TuningSlowDivide64,
TuningSlowSHLD];
TuningSlowSHLD,
TuningAvoidMFENCE];
list<SubtargetFeature> BtVer2Features =
!listconcat(BtVer1Features, BtVer2AdditionalFeatures);

Expand Down Expand Up @@ -1429,7 +1445,8 @@ def ProcessorFeatures {
TuningFastScalarShiftMasks,
TuningBranchFusion,
TuningSBBDepBreaking,
TuningInsertVZEROUPPER];
TuningInsertVZEROUPPER,
TuningAvoidMFENCE];
vchuravy marked this conversation as resolved.
Show resolved Hide resolved

// PileDriver
list<SubtargetFeature> BdVer2AdditionalFeatures = [FeatureF16C,
Expand Down Expand Up @@ -1509,7 +1526,8 @@ def ProcessorFeatures {
TuningSlowSHLD,
TuningSBBDepBreaking,
TuningInsertVZEROUPPER,
TuningAllowLight256Bit];
TuningAllowLight256Bit,
TuningAvoidMFENCE];
list<SubtargetFeature> ZN2AdditionalFeatures = [FeatureCLWB,
FeatureRDPID,
FeatureRDPRU,
Expand Down Expand Up @@ -1697,7 +1715,8 @@ def : ProcModel<P, SandyBridgeModel, [
[
TuningMacroFusion,
TuningSlowUAMem16,
TuningInsertVZEROUPPER
TuningInsertVZEROUPPER,
TuningAvoidMFENCE
]>;
}
foreach P = ["penryn", "core_2_duo_sse4_1"] in {
Expand All @@ -1716,7 +1735,8 @@ def : ProcModel<P, SandyBridgeModel, [
[
TuningMacroFusion,
TuningSlowUAMem16,
TuningInsertVZEROUPPER
TuningInsertVZEROUPPER,
TuningAvoidMFENCE
]>;
}

Expand Down
19 changes: 4 additions & 15 deletions llvm/lib/Target/X86/X86ISelLowering.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -31422,21 +31422,10 @@ X86TargetLowering::lowerIdempotentRMWIntoFencedLoad(AtomicRMWInst *AI) const {
// otherwise, we might be able to be more aggressive on relaxed idempotent
// rmw. In practice, they do not look useful, so we don't try to be
// especially clever.
if (SSID == SyncScope::SingleThread)
// FIXME: we could just insert an ISD::MEMBARRIER here, except we are at
// the IR level, so we must wrap it in an intrinsic.
return nullptr;

if (!Subtarget.hasMFence())
// FIXME: it might make sense to use a locked operation here but on a
// different cache-line to prevent cache-line bouncing. In practice it
// is probably a small win, and x86 processors without mfence are rare
// enough that we do not bother.
return nullptr;

Function *MFence =
llvm::Intrinsic::getDeclaration(M, Intrinsic::x86_sse2_mfence);
Builder.CreateCall(MFence, {});
// Use `fence seq_cst` over `llvm.x64.sse2.mfence` here to get the correct
// lowering for SSID == SyncScope::SingleThread and avoidMFence || !hasMFence
Builder.CreateFence(AtomicOrdering::SequentiallyConsistent, SSID);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change be pulled out into a preliminary patch?


// Finally we can emit the atomic load.
LoadInst *Loaded = Builder.CreateAlignedLoad(
Expand Down Expand Up @@ -31524,7 +31513,7 @@ static SDValue LowerATOMIC_FENCE(SDValue Op, const X86Subtarget &Subtarget,
// cross-thread fence.
if (FenceOrdering == AtomicOrdering::SequentiallyConsistent &&
FenceSSID == SyncScope::System) {
if (Subtarget.hasMFence())
if (!Subtarget.avoidMFence() && Subtarget.hasMFence())
return DAG.getNode(X86ISD::MFENCE, dl, MVT::Other, Op.getOperand(0));

SDValue Chain = Op.getOperand(0);
Expand Down
86 changes: 30 additions & 56 deletions llvm/test/CodeGen/X86/atomic-idempotent.ll
Original file line number Diff line number Diff line change
Expand Up @@ -27,18 +27,16 @@ define i8 @add8(ptr %p) {
;
; X86-SLM-LABEL: add8:
; X86-SLM: # %bb.0:
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-SLM-NEXT: xorl %eax, %eax
; X86-SLM-NEXT: lock xaddb %al, (%ecx)
; X86-SLM-NEXT: # kill: def $al killed $al killed $eax
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-SLM-NEXT: lock orl $0, (%esp)
; X86-SLM-NEXT: movzbl (%eax), %eax
; X86-SLM-NEXT: retl
;
; X86-ATOM-LABEL: add8:
; X86-ATOM: # %bb.0:
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-ATOM-NEXT: xorl %eax, %eax
; X86-ATOM-NEXT: lock xaddb %al, (%ecx)
; X86-ATOM-NEXT: # kill: def $al killed $al killed $eax
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-ATOM-NEXT: lock orl $0, (%esp)
; X86-ATOM-NEXT: movzbl (%eax), %eax
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: retl
Expand All @@ -62,26 +60,18 @@ define i16 @or16(ptr %p) {
;
; X86-SLM-LABEL: or16:
; X86-SLM: # %bb.0:
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-SLM-NEXT: movzwl (%ecx), %eax
; X86-SLM-NEXT: .p2align 4, 0x90
; X86-SLM-NEXT: .LBB1_1: # %atomicrmw.start
; X86-SLM-NEXT: # =>This Inner Loop Header: Depth=1
; X86-SLM-NEXT: lock cmpxchgw %ax, (%ecx)
; X86-SLM-NEXT: jne .LBB1_1
; X86-SLM-NEXT: # %bb.2: # %atomicrmw.end
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-SLM-NEXT: lock orl $0, (%esp)
; X86-SLM-NEXT: movzwl (%eax), %eax
; X86-SLM-NEXT: retl
;
; X86-ATOM-LABEL: or16:
; X86-ATOM: # %bb.0:
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-ATOM-NEXT: movzwl (%ecx), %eax
; X86-ATOM-NEXT: .p2align 4, 0x90
; X86-ATOM-NEXT: .LBB1_1: # %atomicrmw.start
; X86-ATOM-NEXT: # =>This Inner Loop Header: Depth=1
; X86-ATOM-NEXT: lock cmpxchgw %ax, (%ecx)
; X86-ATOM-NEXT: jne .LBB1_1
; X86-ATOM-NEXT: # %bb.2: # %atomicrmw.end
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-ATOM-NEXT: lock orl $0, (%esp)
; X86-ATOM-NEXT: movzwl (%eax), %eax
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: retl
%1 = atomicrmw or ptr %p, i16 0 acquire
ret i16 %1
Expand All @@ -103,26 +93,18 @@ define i32 @xor32(ptr %p) {
;
; X86-SLM-LABEL: xor32:
; X86-SLM: # %bb.0:
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-SLM-NEXT: movl (%ecx), %eax
; X86-SLM-NEXT: .p2align 4, 0x90
; X86-SLM-NEXT: .LBB2_1: # %atomicrmw.start
; X86-SLM-NEXT: # =>This Inner Loop Header: Depth=1
; X86-SLM-NEXT: lock cmpxchgl %eax, (%ecx)
; X86-SLM-NEXT: jne .LBB2_1
; X86-SLM-NEXT: # %bb.2: # %atomicrmw.end
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-SLM-NEXT: lock orl $0, (%esp)
; X86-SLM-NEXT: movl (%eax), %eax
; X86-SLM-NEXT: retl
;
; X86-ATOM-LABEL: xor32:
; X86-ATOM: # %bb.0:
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-ATOM-NEXT: movl (%ecx), %eax
; X86-ATOM-NEXT: .p2align 4, 0x90
; X86-ATOM-NEXT: .LBB2_1: # %atomicrmw.start
; X86-ATOM-NEXT: # =>This Inner Loop Header: Depth=1
; X86-ATOM-NEXT: lock cmpxchgl %eax, (%ecx)
; X86-ATOM-NEXT: jne .LBB2_1
; X86-ATOM-NEXT: # %bb.2: # %atomicrmw.end
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-ATOM-NEXT: lock orl $0, (%esp)
; X86-ATOM-NEXT: movl (%eax), %eax
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: retl
%1 = atomicrmw xor ptr %p, i32 0 release
ret i32 %1
Expand Down Expand Up @@ -318,26 +300,18 @@ define i32 @and32 (ptr %p) {
;
; X86-SLM-LABEL: and32:
; X86-SLM: # %bb.0:
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-SLM-NEXT: movl (%ecx), %eax
; X86-SLM-NEXT: .p2align 4, 0x90
; X86-SLM-NEXT: .LBB5_1: # %atomicrmw.start
; X86-SLM-NEXT: # =>This Inner Loop Header: Depth=1
; X86-SLM-NEXT: lock cmpxchgl %eax, (%ecx)
; X86-SLM-NEXT: jne .LBB5_1
; X86-SLM-NEXT: # %bb.2: # %atomicrmw.end
; X86-SLM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-SLM-NEXT: lock orl $0, (%esp)
; X86-SLM-NEXT: movl (%eax), %eax
; X86-SLM-NEXT: retl
;
; X86-ATOM-LABEL: and32:
; X86-ATOM: # %bb.0:
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %ecx
; X86-ATOM-NEXT: movl (%ecx), %eax
; X86-ATOM-NEXT: .p2align 4, 0x90
; X86-ATOM-NEXT: .LBB5_1: # %atomicrmw.start
; X86-ATOM-NEXT: # =>This Inner Loop Header: Depth=1
; X86-ATOM-NEXT: lock cmpxchgl %eax, (%ecx)
; X86-ATOM-NEXT: jne .LBB5_1
; X86-ATOM-NEXT: # %bb.2: # %atomicrmw.end
; X86-ATOM-NEXT: movl {{[0-9]+}}(%esp), %eax
; X86-ATOM-NEXT: lock orl $0, (%esp)
; X86-ATOM-NEXT: movl (%eax), %eax
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: nop
; X86-ATOM-NEXT: retl
%1 = atomicrmw and ptr %p, i32 -1 acq_rel
ret i32 %1
Expand Down
Loading
Loading