Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a mechanism to monitor, track and notify Connection Timeouts from a deployed Mediator #75

Closed
3 tasks
swcurran opened this issue Apr 13, 2023 · 6 comments
Assignees

Comments

@swcurran
Copy link
Contributor

We are seeing connection timeouts in Aries mobile wallets with a (more or less) stock Aries Mediator Service mediator. We need a way to be aware of these errors on the mediator side so that we can know when and how often they are occurring, and so that a notification can go to the team that has deployed the monitor. This task is to figure how to add monitoring to a deployment of the aries-mediator-service.

Suggesting steps:

  • Reproduce the error (@jleach has created some scripts, and the Locust tool in the load-testing folder of this repo can be used) consistently from non prod (dev or test)
  • Is there something that can be seen on the server side that can be used for logging, tracking and notifications?
  • Where the timeouts set? and can we increase? server time out should be shorter (30s) than client side (45s)

The logging info below is a possibility. We'd have to see what a "normal" websocket closure (including the mobile device turning off) looks like to ensure we aren't looking at false positives.

March 17th 2023, 14:59:22.548	aries-mediator-agent	2023-03-17 21:59:22,548 aries_cloudagent.transport.inbound.ws ERROR Unexpected Websocket message type received: WSMsgType.CLOSED: None, None
March 17th 2023, 14:56:39.975	aries-mediator-agent	2023-03-17 21:56:39,975 aries_cloudagent.transport.inbound.ws ERROR Unexpected Websocket message type received: WSMsgType.CLOSED: None, None

@swcurran
Copy link
Contributor Author

@usingtechnology -- note this issue as you see what is happening on the ACA-Py Mediator side.

@swcurran
Copy link
Contributor Author

Assigning this to @usingtechnology and @WadeBarnes after a question from @jleach about the status of this issue. In the research being done into the mediator behaviour, have we done enough to be able to detect on the mediator side when an error in establishing a connection (either to the mediator itself, or to another agent) occurs?

Note that the answer to this might be a “no, not possible”, and we close this accordingly.

Thanks!

@WadeBarnes
Copy link
Contributor

The error messages listed above are a common occurrence.

For example:
image

There are a lot of "Error" messages (noise) around (what seems to be) regular web socket traffic. Therefore I think the first thing that needs to happen is a review of the logging associated to the traffic to determine what a normal web socket connection lifecycle should look like and ensue the events are logged appropriately. At the same time we could review the timeout settings and determine what settings would be considered reasonable. The current web socket timing settings are ACAPY_WS_HEARTBEAT_INTERVAL=15, and ACAPY_WS_TIMEOUT_INTERVAL=60 in all environments, based on recommendations here; openwallet-foundation/acapy#2157 (comment)

Related issue:

@WadeBarnes
Copy link
Contributor

WadeBarnes commented May 17, 2023

Some thoughts on this ...

  • Although it would be nice to have something in the code capable of triggering notifications of actual issues, it might not scale well when dealing with multiple instances. It could result in a flood of alerts.
  • Implementing external monitoring capable of aggregating the logs/alerts/triggers from multiple ACA-Py instances would likely be more appropriate. I had a quick look at the tools we have available to teams in our cluster, and besides being able to create visualizations like above there are no alert and notification capabilities (I can see) available out of the box so we'd have to wire something together for our purposes. Other teams have accomplished this in various ways.
  • We need something that allows us to monitor/query the logs for keywords and then set a threshold for when a notification would be sent. I'm deliberately trying not to solution while stating this.

@jleach
Copy link
Contributor

jleach commented May 17, 2023

@swcurran Do you think this should be transferred into ACA-py as an action item for @WadeBarnes' comments above (review params and logging)? Once done close it - if feels a little amorphic is that its hard to tease out what specific changes need to take place beyond this.

@swcurran
Copy link
Contributor Author

From the sounds of it, I think this request should be pushed to the BC Gov deployment repo for the mediator, and we work on the types of solutions @WadeBarnes mention above that work in the BC Gov context. As we find useful things, updating either or both of the ACA-Py and this repo is appropriate as documentation or code (if that makes sense).

I’m going to close this issue here — feel free to reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants