Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for inlining new atomic methods on Power #3018

Merged
merged 9 commits into from
Nov 13, 2018
4 changes: 4 additions & 0 deletions compiler/p/codegen/OMRCodeGenerator.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,8 @@ class OMR_EXTENSIBLE CodeGenerator : public OMR::CodeGenerator

bool getSupportsIbyteswap();

bool supportsAtomicAdd() {return true;}

void generateBinaryEncodingPrologue(TR_PPCBinaryEncodingData *data);

void beginInstructionSelection();
Expand All @@ -184,6 +186,8 @@ class OMR_EXTENSIBLE CodeGenerator : public OMR::CodeGenerator
TR::Instruction *generateGroupEndingNop(TR::Node *node , TR::Instruction *preced = 0);
TR::Instruction *generateProbeNop(TR::Node *node , TR::Instruction *preced = 0);

bool inlineDirectCall(TR::Node *node, TR::Register *&resultReg);

bool isSnippetMatched(TR::Snippet *, int32_t, TR::SymbolReference *);

bool mulDecompositionCostIsJustified(int numOfOperations, char bitPosition[], char operationType[], int64_t value);
Expand Down
230 changes: 229 additions & 1 deletion compiler/p/codegen/OMRTreeEvaluator.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,10 @@
#include <stdint.h> // for int32_t, etc
#include <stdio.h> // for NULL, fprintf, etc
#include <string.h> // for strstr
#include "codegen/AheadOfTimeCompile.hpp" // for AheadOfTimeCompile
#include "codegen/AheadOfTimeCompile.hpp" // for AheadOfTimeCompile
#include "codegen/BackingStore.hpp" // for TR_BackingStore
#include "codegen/CodeGenerator.hpp" // for CodeGenerator
#include "codegen/CodeGenerator_inlines.hpp" // for CodeGenerator
#include "codegen/FrontEnd.hpp" // for feGetEnv, etc
#include "codegen/InstOpCode.hpp" // for InstOpCode, etc
#include "codegen/Instruction.hpp" // for Instruction
Expand Down Expand Up @@ -75,6 +76,7 @@
#include "infra/Bit.hpp" // for intParts
#include "infra/List.hpp" // for List, etc
#include "optimizer/Structure.hpp"
#include "p/codegen/OMRCodeGenerator.hpp"
#include "p/codegen/GenerateInstructions.hpp"
#include "p/codegen/PPCAOTRelocation.hpp"
#include "p/codegen/PPCHelperCallSnippet.hpp"
Expand Down Expand Up @@ -5189,6 +5191,232 @@ TR::Register *OMR::Power::TreeEvaluator::directCallEvaluator(TR::Node *node, TR:
return resultReg;
}

static TR::Register *inlineSimpleAtomicUpdate(TR::Node *node, bool isAddOp, bool isLong, bool isGetThenUpdate, TR::CodeGenerator *cg)
{
TR::Node *valueAddrChild = node->getFirstChild();
TR::Node *deltaChild = NULL;
TR::Register *valueAddrReg = cg->evaluate(valueAddrChild);
TR::Register *deltaReg = NULL;
TR::Register *resultReg = cg->allocateRegister();
TR::Register *cndReg = cg->allocateRegister(TR_CCR);
TR::Register *tempReg = NULL;
bool isDeltaConstant = false;
bool isDeltaImmediate = false;
bool isDeltaImmediateShifted = false;

int32_t numDeps = 4;

int32_t delta = 0;

// Memory barrier --- NOTE: we should be able to do a test upfront to save this barrier,
// but Hursley advised to be conservative due to lack of specification.
generateInstruction(cg, TR::InstOpCode::lwsync, node);

TR::LabelSymbol *doneLabel = TR::LabelSymbol::create(cg->trHeapMemory(),cg);
TR::LabelSymbol *loopLabel = TR::LabelSymbol::create(cg->trHeapMemory(),cg);

loopLabel->setStartInternalControlFlow();
deltaChild = node->getSecondChild();

if (deltaChild->getOpCode().isLoadConst()
&& !deltaChild->getRegister()
&& deltaChild->getDataType() == TR::Int32)
{
delta = (int32_t)(deltaChild->getInt());
isDeltaConstant = true;

//determine if the constant can be represented as an immediate
if (delta <= UPPER_IMMED && delta >= LOWER_IMMED)
{
// avoid evaluating immediates for add operations
isDeltaImmediate = true;
}
else if (delta & 0xFFFF == 0 && (delta & 0xFFFF0000) >> 16 <= UPPER_IMMED && (delta & 0xFFFF0000) >> 16 >= LOWER_IMMED)
{
// avoid evaluating shifted immediates for add operations
isDeltaImmediate = true;
isDeltaImmediateShifted = true;
}
else
{
// evaluate non-immediate constants since there may be reuse
// and they have to go into a reg anyway
tempReg = cg->evaluate(deltaChild);
}
}
else
{
tempReg = cg->evaluate(deltaChild);
}

generateLabelInstruction(cg, TR::InstOpCode::label, node, loopLabel);

deltaReg = cg->allocateRegister();
if (isDeltaImmediate && !isAddOp)
{
// If argument is immediate, but the operation is not an add,
// the value must still be loaded into a register
// If argument is an immediate value and operation is an add,
// it will be used as an immediate operand in an add immediate instruction
loadConstant(cg, node, delta, deltaReg);
}
else if (!isDeltaImmediate)
{
// For non-constant arguments, use evaluated register
// For non-immediate constants, evaluate since they may be re-used
numDeps++;
generateTrg1Src1Instruction(cg, TR::InstOpCode::mr, node, deltaReg, tempReg);
}

uint8_t len = isLong ? 8 : 4;

generateTrg1MemInstruction(cg, isLong ? TR::InstOpCode::ldarx : TR::InstOpCode::lwarx, node, resultReg,
new (cg->trHeapMemory()) TR::MemoryReference(0, valueAddrReg, len, cg));

if (isAddOp)
{
if (isDeltaImmediateShifted)
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addis, node, deltaReg, resultReg, ((delta & 0xFFFF0000) >> 16));
else if (isDeltaImmediate)
generateTrg1Src1ImmInstruction(cg, TR::InstOpCode::addi, node, deltaReg, resultReg, delta);
else
generateTrg1Src2Instruction(cg, TR::InstOpCode::add, node, deltaReg, resultReg, deltaReg);
}

generateMemSrc1Instruction(cg, isLong ? TR::InstOpCode::stdcx_r : TR::InstOpCode::stwcx_r, node, new (cg->trHeapMemory()) TR::MemoryReference(0, valueAddrReg, len, cg),
deltaReg);

// We expect this store is usually successful, i.e., the following branch will not be taken
if (TR::Compiler->target.cpu.id() >= TR_PPCgp)
{
generateConditionalBranchInstruction(cg, TR::InstOpCode::bne, PPCOpProp_BranchUnlikely, node, loopLabel, cndReg);
}
else
{
generateConditionalBranchInstruction(cg, TR::InstOpCode::bne, node, loopLabel, cndReg);
}

// We deviate from the VM helper here: no-store-no-barrier instead of always-barrier
generateInstruction(cg, TR::InstOpCode::sync, node);

TR::RegisterDependencyConditions *conditions;

//Set the conditions and dependencies
conditions = new (cg->trHeapMemory()) TR::RegisterDependencyConditions((uint16_t) numDeps, (uint16_t) numDeps, cg->trMemory());

addDependency(conditions, valueAddrReg, TR::RealRegister::NoReg, TR_GPR, cg);
conditions->getPreConditions()->getRegisterDependency(0)->setExcludeGPR0();
conditions->getPostConditions()->getRegisterDependency(0)->setExcludeGPR0();
addDependency(conditions, resultReg, TR::RealRegister::NoReg, TR_GPR, cg);
conditions->getPreConditions()->getRegisterDependency(1)->setExcludeGPR0();
conditions->getPostConditions()->getRegisterDependency(1)->setExcludeGPR0();
addDependency(conditions, deltaReg, TR::RealRegister::NoReg, TR_GPR, cg);
addDependency(conditions, cndReg, TR::RealRegister::cr0, TR_CCR, cg);

if (tempReg)
{
addDependency(conditions, tempReg, TR::RealRegister::NoReg, TR_GPR, cg);
}

doneLabel->setEndInternalControlFlow();
generateDepLabelInstruction(cg, TR::InstOpCode::label, node, doneLabel, conditions);

cg->decReferenceCount(valueAddrChild);
cg->stopUsingRegister(valueAddrReg);
cg->stopUsingRegister(cndReg);

if (tempReg)
{
cg->stopUsingRegister(tempReg);
}

if (deltaChild)
{
cg->decReferenceCount(deltaChild);
}

if (isGetThenUpdate)
{
//for Get And Op, we will store the result in the result register
cg->stopUsingRegister(deltaReg);
node->setRegister(resultReg);
return resultReg;
}
else
{
//for Op And Get, we will store the return value in the delta register
//we no longer need the result register
cg->stopUsingRegister(resultReg);
node->setRegister(deltaReg);
return deltaReg;
}
}

bool OMR::Power::CodeGenerator::inlineDirectCall(TR::Node *node, TR::Register *&resultReg)
{
TR::CodeGenerator *cg = self();
TR::Compilation *comp = cg->comp();
TR::SymbolReference* symRef = node->getSymbolReference();
bool doInline = false;

if (symRef && symRef->getSymbol()->castToMethodSymbol()->isInlinedByCG())
{
bool isAddOp = false;
bool isLong = false;
bool isGetThenUpdate = false;

if (comp->getSymRefTab()->isNonHelper(symRef, TR::SymbolReferenceTable::atomicAdd32BitSymbol))
{
isAddOp = true;
isLong = false;
isGetThenUpdate = false;
doInline = true;
}
else if (comp->getSymRefTab()->isNonHelper(symRef, TR::SymbolReferenceTable::atomicAdd64BitSymbol))
{
isAddOp = true;
isLong = true;
isGetThenUpdate = false;
doInline = true;
}
else if (comp->getSymRefTab()->isNonHelper(symRef, TR::SymbolReferenceTable::atomicFetchAndAdd32BitSymbol))
{
isAddOp = true;
isLong = false;
isGetThenUpdate = true;
doInline = true;
}
else if (comp->getSymRefTab()->isNonHelper(symRef, TR::SymbolReferenceTable::atomicFetchAndAdd64BitSymbol))
{
isAddOp = true;
isLong = true;
isGetThenUpdate = true;
doInline = true;
}
else if (comp->getSymRefTab()->isNonHelper(symRef, TR::SymbolReferenceTable::atomicSwap32BitSymbol))
{
isAddOp = false;
isLong = false;
isGetThenUpdate = true;
doInline = true;
}
else if (comp->getSymRefTab()->isNonHelper(symRef, TR::SymbolReferenceTable::atomicSwap64BitSymbol))
{
isAddOp = false;
isLong = true;
isGetThenUpdate = true;
doInline = true;
}

if (doInline)
{
resultReg = inlineSimpleAtomicUpdate(node, isAddOp, isLong, isGetThenUpdate, cg);
}
}

return doInline;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level issue: this will not work for 32bit JVM. i.e. this is assuming a 64bit JVM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zl-wang Julian, thanks for catching this. Just so I'm clear, can this code be inlined on a 32 bit JVM when it's working with 32 bit operands, and inlined on a 64 bit JVM when it's working with either 32 bit and 64 bit operands? Or can it only be inlined on a 64 JVM ever?

Something I just noticed this morning, I was playing with AtomicLong.getAndAdd and AtomicInteger.getAndAdd without my changes for these new intrinsics, and it looks like inline code is never generated for AtomicLong.getAndAdd for 32 bit or 64 bit JVMs, and always generated for AtomicInteger.getAndAdd for both 32 bit and 64 bit JVMs. Is that expected?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzongaro I didn't expect that. I expected existing JVM can inline CAS of both 32/64 operands in both 32/64 bit JVM. The only assumption is the underlying CPU is 64bit (this has been true since java7.1 when we discontinued support of 32bit hardware). The case for inlining CAS of 64bit operand in 32bit JVM needs special care though ... you can assemble 64bit operand in a register within the CAS range only, depending on the fact that CAS will definitely fail if there was a signal delivered in the between (i.e. meaning the upper 32bit of the register can be trashed anytime).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zl-wang Julian, I've been working on a revision that handles 64-bit values running in a 32-bit JVM, but it makes the code quite ugly. I’ll pull that out as a separate pull request, and I’ll let you decide whether it’s an important enough case to handle, as well as point out any problems with the way that I tackled the problem.

TR::Register *OMR::Power::TreeEvaluator::performCall(TR::Node *node, bool isIndirect, TR::CodeGenerator *cg)
{
TR::SymbolReference *symRef = node->getSymbolReference();
Expand Down