fix: only cache completed connection targets #2240

dbluhm · 2023-05-24T16:29:47Z

This PR adjusts the caching behavior for connection targets to solve some issues seen in multi-replica setups under certain circumstances.

The Problem

Given the following conditions,

ACA-Py is connecting to a remote agent through the remote agent's Public DID
ACA-Py is using more than one replica with loadbalancing such that one replica may start a DID Exchange/Connection and another may finish the exchange.
Following connection, ACA-Py attempts to issue a credential to the remote agent (or theoretically any other message where the protocol requires a connection)

If the agent that initiates the credential issuance is not the same agent that completed the DID exchange, the agent will have a connection target cached that still contains the public DID and public routing info of the remote agent, causing the remote agent (if it's also running ACA-Py) to get errors like the following:

2023-05-23 18:18:55,358 aries_cloudagent.core.conductor ERROR Exception in message handler:
Traceback (most recent call last):
  File "/usr/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/core/dispatcher.py", line 264, in handle_message
    await handler(context, responder)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/protocols/issue_credential/v2_0/handlers/cred_offer_handler.py", line 51, in handle
    raise HandlerException(
aries_cloudagent.messaging.base_handler.HandlerException: No connection or associated connectionless exchange found for credential offer

This is because it received a message encrypted for it's public DID and key, which does not have a connection associated with it.

Consider the following diagram. In this diagram, alice0 and alice1 represent two replicas of ACA-Py working collectively to represent an Alice agent. bob represents the remote agent.

If the loadbalancing of the replicas happens to cause the above handling of messages by the different replicas, we see the error pasted above on bob.

This can also happen with a slightly different sequence of messages, as shown in this diagram:

In this case, this results in the following error on bob:

Traceback (most recent call last):
  File "/usr/lib/python3.9/asyncio/tasks.py", line 256, in __step
    result = coro.send(None)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/core/dispatcher.py", line 264, in handle_message
    await handler(context, responder)
  File "/usr/lib/python3.9/site-packages/aries_cloudagent/protocols/issue_credential/v2_0/handlers/cred_issue_handler.py", line 49, in handle
    raise HandlerException(
aries_cloudagent.messaging.base_handler.HandlerException: No connection or associated connectionless exchange found for credential

The Solution

The solution implemented by this PR is to only cache connection targets on completion of a DID Exchange or connection; or, in other words, only cache connection targets for connections that have reached a completed state. Given that the connection target is intentionally changed in the process of these protocols (you could even say that's kind of the whole point of the protocol), I believe there is little benefit to caching the initial connection target anyways. This notion is supported by these lines:

https://github.com/hyperledger/aries-cloudagent-python/blob/main/aries_cloudagent/connections/models/conn_record.py#L503-L513

    async def post_save(self, session: ProfileSession, *args, **kwargs):
        """Perform post-save actions.

        Args:
            session: The active profile session
        """
        await super().post_save(session, *args, **kwargs)

        # clear cache key set by connection manager
        cache_key = f"connection_target::{self.connection_id}"
        await self.clear_cached_key(session, cache_key)

After every save on the connection record (which will occur at basically every step of the did exchange or connection protocols), we're clearing the cache of the target.

Given that the cache is being cleared, causing the connection target info to be recalled from the wallet record, I believe the changes made in this PR should not impact performance, though I have not tested this.

I get the feeling this change could improve other scenarios we've seen previously that motivated the creation of the redis cache plugin but I have not tested these other scenarios.

Signed-off-by: Daniel Bluhm <[email protected]>

dbluhm · 2023-05-27T20:09:36Z

Making some noise. Feedback welcome, @shaangill025 @swcurran. Anyone in particular I should tag for review?

swcurran · 2023-05-29T21:33:27Z

Interesting one — great catch, Daniel. Lets look at this one at ACA-Pug tomorrow (May 30).

aries_cloudagent/protocols/connections/v1_0/manager.py

Jsyro

Oh cache consistency... this is definitely an issue and the proposed change seems sound.

I don't think performance testing this is necessary as long as the code behaviour has not changed.

dbluhm · 2023-06-01T15:00:37Z

Integration tests failed on a previous run; I don't think the failure was relevant but I've merged in main, triggering another run. I'll monitor and follow up on any failures.