-
Notifications
You must be signed in to change notification settings - Fork 515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: only cache completed connection targets #2240
fix: only cache completed connection targets #2240
Conversation
Signed-off-by: Daniel Bluhm <[email protected]>
Making some noise. Feedback welcome, @shaangill025 @swcurran. Anyone in particular I should tag for review? |
Interesting one — great catch, Daniel. Lets look at this one at ACA-Pug tomorrow (May 30). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cache consistency... this is definitely an issue and the proposed change seems sound.
I don't think performance testing this is necessary as long as the code behaviour has not changed.
Integration tests failed on a previous run; I don't think the failure was relevant but I've merged in main, triggering another run. I'll monitor and follow up on any failures. |
Kudos, SonarCloud Quality Gate passed! |
This PR adjusts the caching behavior for connection targets to solve some issues seen in multi-replica setups under certain circumstances.
The Problem
Given the following conditions,
If the agent that initiates the credential issuance is not the same agent that completed the DID exchange, the agent will have a connection target cached that still contains the public DID and public routing info of the remote agent, causing the remote agent (if it's also running ACA-Py) to get errors like the following:
This is because it received a message encrypted for it's public DID and key, which does not have a connection associated with it.
Consider the following diagram. In this diagram,
alice0
andalice1
represent two replicas of ACA-Py working collectively to represent an Alice agent.bob
represents the remote agent.If the loadbalancing of the replicas happens to cause the above handling of messages by the different replicas, we see the error pasted above on
bob
.This can also happen with a slightly different sequence of messages, as shown in this diagram:
In this case, this results in the following error on
bob
:The Solution
The solution implemented by this PR is to only cache connection targets on completion of a DID Exchange or connection; or, in other words, only cache connection targets for connections that have reached a
completed
state. Given that the connection target is intentionally changed in the process of these protocols (you could even say that's kind of the whole point of the protocol), I believe there is little benefit to caching the initial connection target anyways. This notion is supported by these lines:https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/connections/models/conn_record.py#L503-L513
After every save on the connection record (which will occur at basically every step of the did exchange or connection protocols), we're clearing the cache of the target.
Given that the cache is being cleared, causing the connection target info to be recalled from the wallet record, I believe the changes made in this PR should not impact performance, though I have not tested this.
I get the feeling this change could improve other scenarios we've seen previously that motivated the creation of the redis cache plugin but I have not tested these other scenarios.