Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: only cache completed connection targets #2240

Merged

Conversation

dbluhm
Copy link
Contributor

@dbluhm dbluhm commented May 24, 2023

This PR adjusts the caching behavior for connection targets to solve some issues seen in multi-replica setups under certain circumstances.

The Problem

Given the following conditions,

  • ACA-Py is connecting to a remote agent through the remote agent's Public DID
  • ACA-Py is using more than one replica with loadbalancing such that one replica may start a DID Exchange/Connection and another may finish the exchange.
  • Following connection, ACA-Py attempts to issue a credential to the remote agent (or theoretically any other message where the protocol requires a connection)

If the agent that initiates the credential issuance is not the same agent that completed the DID exchange, the agent will have a connection target cached that still contains the public DID and public routing info of the remote agent, causing the remote agent (if it's also running ACA-Py) to get errors like the following:

2023-05-23 18:18:55,358 aries_cloudagent.core.conductor ERROR Exception in message handler:
Traceback (most recent call last):
  File "/usr/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/core/dispatcher.py", line 264, in handle_message
    await handler(context, responder)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/protocols/issue_credential/v2_0/handlers/cred_offer_handler.py", line 51, in handle
    raise HandlerException(
aries_cloudagent.messaging.base_handler.HandlerException: No connection or associated connectionless exchange found for credential offer

This is because it received a message encrypted for it's public DID and key, which does not have a connection associated with it.

Consider the following diagram. In this diagram, alice0 and alice1 represent two replicas of ACA-Py working collectively to represent an Alice agent. bob represents the remote agent.

offer-cred-cache-conn-failure

If the loadbalancing of the replicas happens to cause the above handling of messages by the different replicas, we see the error pasted above on bob.

This can also happen with a slightly different sequence of messages, as shown in this diagram:

issue-cred-cache-conn-failure

In this case, this results in the following error on bob:

Traceback (most recent call last):
  File "/usr/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/core/dispatcher.py", line 264, in handle_message
    await handler(context, responder)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/protocols/issue_credential/v2_0/handlers/cred_issue_handler.py", line 49, in handle
    raise HandlerException(
aries_cloudagent.messaging.base_handler.HandlerException: No connection or associated connectionless exchange found for credential

The Solution

The solution implemented by this PR is to only cache connection targets on completion of a DID Exchange or connection; or, in other words, only cache connection targets for connections that have reached a completed state. Given that the connection target is intentionally changed in the process of these protocols (you could even say that's kind of the whole point of the protocol), I believe there is little benefit to caching the initial connection target anyways. This notion is supported by these lines:

https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/connections/models/conn_record.py#L503-L513

    async def post_save(self, session: ProfileSession, *args, **kwargs):
        """Perform post-save actions.

        Args:
            session: The active profile session
        """
        await super().post_save(session, *args, **kwargs)

        # clear cache key set by connection manager
        cache_key = f"connection_target::{self.connection_id}"
        await self.clear_cached_key(session, cache_key)

After every save on the connection record (which will occur at basically every step of the did exchange or connection protocols), we're clearing the cache of the target.

Given that the cache is being cleared, causing the connection target info to be recalled from the wallet record, I believe the changes made in this PR should not impact performance, though I have not tested this.

I get the feeling this change could improve other scenarios we've seen previously that motivated the creation of the redis cache plugin but I have not tested these other scenarios.

@dbluhm
Copy link
Contributor Author

dbluhm commented May 27, 2023

Making some noise. Feedback welcome, @shaangill025 @swcurran. Anyone in particular I should tag for review?

@swcurran
Copy link
Contributor

Interesting one — great catch, Daniel. Lets look at this one at ACA-Pug tomorrow (May 30).

Copy link
Contributor

@Jsyro Jsyro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cache consistency... this is definitely an issue and the proposed change seems sound.

I don't think performance testing this is necessary as long as the code behaviour has not changed.

@dbluhm
Copy link
Contributor Author

dbluhm commented Jun 1, 2023

Integration tests failed on a previous run; I don't think the failure was relevant but I've merged in main, triggering another run. I'll monitor and follow up on any failures.

@swcurran swcurran merged commit a1ec0b7 into openwallet-foundation:main Jun 1, 2023
@sonarqubecloud
Copy link

sonarqubecloud bot commented Jun 1, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants