Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up generation of load and store sequences for Power #5630

Open
aviansie-ben opened this issue Oct 27, 2020 · 13 comments
Open

Clean up generation of load and store sequences for Power #5630

aviansie-ben opened this issue Oct 27, 2020 · 13 comments

Comments

@aviansie-ben
Copy link
Contributor

Currently, the generation of load and store sequences is deeply entangled with TR::MemoryReference and has code copied in numerous different locations. Specifically, we are currently able to generate a memory reference for an arbitrary load/store node from any evaluator and then emit instructions manually for performing the load/store. For instance, loads on Power are implemented like the following (simplified for convenience):

TR::MemoryReference *memRef = TR::MemoryReference::createWithRootLoadOrStore(cg, node, 4);

generateTrg1MemInstruction(cg, TR::InstOpCode::lwz, node, trgReg, memRef);
if (needsSync)
  generateInstruction(cg, TR::InstOpCode::lwsync, node);

memRef->decReferenceCounts(cg);

However, code for emitting loads in particular is copied all over the codegen, as it is common to optimize when a child is a single-reference load by changing the opcode used to perform the load (e.g. an ibyteswap of an iload can be done in a single lwbrx instruction). This has resulted in a number of bugs and design issues that have proliferated uncontrolled throughout the Power codegen:

  • It makes it easy to forget to emit the post-load lwsync instruction for volatile loads, which can result in nondeterministic and unreproducible bugs (I've found at least a dozen locations that have this problem so far)
  • It forces code for Java volatile semantics to exist in OMR, as that code must be copied at each location that does a node-based load/store
  • It forces TR::MemoryReference to keep track of nodes whose reference counts need to be decremented, unnecessarily coupling that class to tree evaluation
  • It forces TR::MemoryReference to be tightly coupled to how unresolved symbol references work in OpenJ9
  • It places a lot of complex logic (e.g. tree evaluator optimizations) in TR::MemoryReference, making that class unnecessarily complicated when it should be quite simple in theory

To address these issues, I'd like to propose adding a new extensible class called LoadStoreHandler to the Power codegen that will handle these operations in a safe and controlled manner (this is a slightly simplified version to eliminate some unnecessary implementation details regarding 64-bit loads/stores on 32-bit systems):

class OMR_EXTENSIBLE LoadStoreHandler
   {
   public:
   static void generateLoadNodeSequence(TR::CodeGenerator *cg, TR::Register *trgReg, TR::Node *node, TR::InstOpCode::Mnemonic loadOp, uint32_t length, bool requireIndexForm=false, int64_t extraOffset=0);
   static void generateStoreNodeSequence(TR::CodeGenerator *cg, TR::Register *srcReg, TR::Node *node, TR::InstOpCode::Mnemonic storeOp, uint32_t length, bool requireIndexForm=false, int64_t extraOffset=0);

   static void generateComputeAddressSequence(TR::CodeGenerator *cg, TR::Register *addrReg, TR::Node *node, int64_t extraOffset=0);
   static void generateLoadAddressSequence(TR::CodeGenerator *cg, TR::Register *trgReg, TR::Node *node, TR::Register *addrReg, TR::InstOpCode::Mnemonic loadOp, uint32_t length, bool requireIndexForm=false);
   static void generateStoreAddressSequence(TR::CodeGenerator *cg, TR::Register *srcReg, TR::Node *node, TR::Register *addrReg, TR::InstOpCode::Mnemonic storeOp, uint32_t length, bool requireIndexForm=false);
   };

The idea here is to provide a simple interface for the evaluators to use that makes it much harder to use incorrectly and hides unnecessary details from the user. For most use cases, a simple call to generateLoadNodeSequence or generateStoreNodeSequence can be used to encompass the entire operation of loading/storing with a particular opcode. This makes it completely impossible to forget to emit memory barriers when required, as the fact that this is even happening at all is hidden from the user entirely.

For more advanced use cases, generateComputeAddressSequence can be used to compute the effective address of a load/store into a register. This register can later be passed to generateLoadAddressSequence or generateStoreAddressSequence to actually produce the load/store and corresponding barriers. While potentially more error-prone, this is required for cases where special code (e.g. write barrier handling) must run between when the child nodes are evaluated and when the memory operation is actually performed.

In order to make extension of these methods by downstream projects easier without needing to copy the OMR definitions of the methods on LoadStoreHelper, the OMR implementations would delegate to some methods from another extensible class that defines internal implementation details and isn't meant to be used directly in tree evaluators:

class NodeMemoryReference
    {
    ...
    public:
    TR::MemoryReference *getMemoryReference();
    void decReferenceCounts(TR::CodeGenerator *cg);
    };

class OMR_EXTENSIBLE LoadStoreHandlerImpl
   {
   static NodeMemoryReference generateMemoryReference(TR::CodeGenerator *cg, TR::Node *node, uint32_t length, bool requireIndexForm=false, int64_t extraOffset=0);

   static void generateLoadSequence(TR::CodeGenerator *cg, TR::Register *trgReg, TR::Node *node, TR::MemoryReference *memRef, TR::InstOpCode::Mnemonic loadOp);
   static void generateStoreSequence(TR::CodeGenerator *cg, TR::Register *srcReg, TR::Node *node, TR::MemoryReference *memRef, TR::InstOpCode::Mnemonic storeOp);
   };

Doing this has a couple of notable benefits. For starters, generation of the sequences for loads and stores is centralized into one function which uses an arbitrary TR::MemoryReference without needing to expose that detail to the evaluators. This allows both the address-based and node-based methods to be overridden by a downstream project simultaneously without requiring such code to be duplicated and allows downstream projects' implementations of these methods to delegate to the OMR versions for simple cases without needing to copy code.

Furthermore, the introduction of NodeMemoryReference as an abstract representation of a memory reference alongside a number of nodes whose registers are being used in the memory reference allows TR::MemoryReference to no longer have to worry about keeping track of nodes itself.

For now, these functions would all be implemented by delegating to the existing methods on TR::MemoryReference that we already use in order to avoid breaking changes for the time being. However, significant amounts of code could be moved out of TR::MemoryReference after an interface like this is introduced without breaking evaluators.

@fjeremic
Copy link
Contributor

Could you further explain the need for NodeMemoryReference to exist? It seems like an odd concept to me that doesn't exist elsewhere. What sort of counting or tracking with nodes and memory references need to be done?

@aviansie-ben
Copy link
Contributor Author

@fjeremic The reason that nodes need to be tracked here is that some of the nodes' registers may end up being used in the TR::MemoryReference itself, while others may have been evaluated and used prior to generation of the memory reference. The nodes whose registers are used in the memory reference must not have their reference counts decremented until after the use of the memory reference. Take the following tree for instance:

[1] aloadi
[2]  aladd
[3]   aload <temp 1>
[4]   lload <temp 2>

Assuming that node [2] is only used here, if node [3] ends up in rX and node [4] ends up in rY with node [1] evaluating into rZ, we want to generate a single ldx rZ, rX, rY instruction to perform the load. In this case, the returned memory reference will end up referring to rX and rY, so the refcounts for nodes [3] and [4] must not be decremented until after the memory reference is used, while the refcount for node [2] can be decremented immediately. Because of this, we need a way to indicate to the caller what nodes need to have their refcounts decremented after the caller is done using the memory reference.

Previously, we handled this by keeping track of a "base node" and an "index node" in TR::MemoryReference itself, but this makes no sense IMO since the vast majority of memory references do not refer to nodes and so do not need to keep track of this information. By splitting this into a separate data structure which is used only where this information needs to be kept track of, details about how a memory reference was constructed from trees can avoid leaking into TR::MemoryReference at all.

@fjeremic
Copy link
Contributor

@aviansie-ben check ArtificiallyInflateReferenceCountWhenNecessary on Z. This is handled transparently [1] by artificially incrementing the reference count of nodes which would die otherwise if decremented by evaluators before the use of the memory reference. The data is kept on a stack during evaluation. It avoids the user having to do anything and they don't have to worry about node reference counts when constructing memory references.

[1] https://github.com/eclipse/omr/blob/9ff4de0d50b9b13c3c4d6c69781249b9bcc8b392/compiler/z/codegen/OMRMemoryReference.cpp#L1203-L1287

@aviansie-ben
Copy link
Contributor Author

@fjeremic I see how that can resolve the problem, but I'm not sure I like that as a solution. Artificially increasing the reference count and then scheduling it to be decremented later seems like a bit of a hack to me and requires quite a bit of additional bookkeeping, although I have to admit it may be better from the evaluator's point of view to not have to worry about stuff like this.

That being said, I don't think the benefits of a solution like that really apply once an interface like LoadStoreHandler is put into place. The whole point of this new interface is to make it unnecessary for evaluators to construct a TR::MemoryReference referring to the address of a load/store node themselves in the first place, as doing so is inherently dangerous due to the aforementioned issues with volatile variable accesses. The NodeMemoryReference class that's being added here is completely internal to LoadStoreHandler/LoadStoreHandlerImpl, so the evaluators already don't have to worry about the node reference counts themselves.

@fjeremic
Copy link
Contributor

make it unnecessary for evaluators to construct a TR::MemoryReference referring to the address of a load/store node themselves in the first place

Will we be enforcing this by removing such constructors/factory methods on TR::MemoryReference to make sure the user isn't even able to call the APIs which construct memory references by passing in a root load or store?

@aviansie-ben
Copy link
Contributor Author

aviansie-ben commented Oct 29, 2020

Will we be enforcing this by removing such constructors/factory methods on TR::MemoryReference to make sure the user isn't even able to call the APIs which construct memory references by passing in a root load or store?

Sorry, I should have been more clear about this in my original proposal. The eventual goal is to remove these APIs on MemoryReference and in doing so both prevent misuse in evaluators and completely decouple MemoryReference from tree evaluation. The initial PR will not do this and will instead just have the generateMemoryReference function call the old APIs, but all of this code would be moved to be internal to LoadStoreHandlerImpl once OpenJ9 is moved to the new API.

@fjeremic
Copy link
Contributor

Awesome. I support your efforts then! I'll take a look once the dust settles on your eventual PR and if all looks well we can migrate the Z codegen to a similar solution as I do agree the artificial mucking with reference counts is not the greatest solution.

@aviansie-ben
Copy link
Contributor Author

@gita-omr @zl-wang Just a quick FYI on this, since it's aiming to fix and prevent a family of volatility related bugs I've found in the last week or so where the lwsync/isync barrier after a volatile load is missing when certain codegen optimizations are triggered. Specifically, these are locations like [1] where a load underneath another operation is optimized for without checking whether the symref of the load is volatile. I can confirm that it's possible to trigger this bug under certain conditions, though it can be difficult to get the trees to look exactly right for the bad optimizations to actually trigger (the volatile load must swing down and only be referenced in the one problematic tree), so this could be causing some difficult-to-reproduce race condition bugs.

[1] https://github.com/eclipse/omr/blob/4d32dfda3c999f4e1c27b7dcca73ab27c1989a0f/compiler/p/codegen/UnaryEvaluator.cpp#L520-L530

@gita-omr
Copy link
Contributor

@aviansie-ben could you please add @zl-wang and myself as reviewers? We would certainly like to take a look and discuss.

@aviansie-ben
Copy link
Contributor Author

@gita-omr There is no PR associated with this at the moment. This is simply an issue for discussing high-level design of the new API for performing these sorts of optimizations in a safe manner. I hope to have a WIP PR open by the end of the day on Monday, and I'll certainly add you as reviewers on that PR once it's open.

@aviansie-ben
Copy link
Contributor Author

I've now opened #5652 as a WIP PR for the OMR side of this work. These changes still need more testing and there are still many areas in OpenJ9 that need to be updated to use the new API.

@gita-omr
Copy link
Contributor

gita-omr commented Nov 5, 2020

@aviansie-ben would it be possible to add examples of how the new interface would be used in (1) the basic case (2) the case in #5630 (comment)

@gita-omr
Copy link
Contributor

gita-omr commented Nov 5, 2020

Nvm, found examples in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants