Destroyed agent shuts down remote agent performing background tasks #418

emmacasolin · 2022-07-15T06:22:55Z

Describe the bug

We have several asynchronous background queues, some of which involve establishing grpc connections (node connections) with remote agents. However, if the remote agent we're contacting is destroyed during connection establishment then this will shut down our own agent.

Example:

{"type":"ErrorAgentClientDestroyed","data":{"message":"","timestamp":"2022-07-15T05:59:49.016Z","data":{},"stack":"ErrorAgentClientDestroyed\n    at /home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:567:39\n    at /home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:205:22\n    at withF (/home/emma/Projects/js-polykey/node_modules/@matrixai/resources/src/utils.ts:24:18)\n    at async constructor_.withConnF (/home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:197:12)\n    at async constructor_.getClosestGlobalNodes (/home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:495:28)\n    at async constructor_.findNode (/home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:412:8)\n    at async constructor_.refreshBucket (/home/emma/Projects/js-polykey/src/nodes/NodeManager.ts:580:5)\n    at async constructor_.startRefreshBucketQueue (/home/emma/Projects/js-polykey/src/nodes/NodeManager.ts:711:9)","description":"Agent Client is destroyed","exitCode":64}}

To Reproduce

The timing is quite finicky, but you just need two agents running and you need to kill one of them at the time when the other is doing something like

INFO:NodeConnectionManager:Getting connection to vutea98s5hv7qcde3elv4vc9qpqsv2oph374ql8i0ogiu106nia2g
INFO:NodeConnectionManager:existing entry found for vutea98s5hv7qcde3elv4vc9qpqsv2oph374ql8i0ogiu106nia2g
INFO:NodeConnectionManager:withConnF calling function with connection to vutea98s5hv7qcde3elv4vc9qpqsv2oph374ql8i0ogiu106nia2g

Expected behavior

For operations that are occurring in the background, potentially even without the user being aware of them, this should not cause the agent to shut down. While this behaviour makes sense for an operation the user chose to initiate, background tasks that the user has no control over should not be able to kill the agent,

Tasks

1. Review findNode, pingNode, getClosestGlobalNodes, getRemoteNodeClosestNodes and syncNodeGraph can't throw an error due to a connection error. 1 hour
~ Create tests for PolykeyAgent stability during random agent-agent interactions Polykey-CLI#8 ~
- Any GRPC calls made downstream of GRPC handlers don't need this treatment, any errors will bubble up to the connection and not crash the agent.

The text was updated successfully, but these errors were encountered:

tegefaulkes · 2022-07-15T06:34:45Z

Seems like an oversight. Any background operations using a connection needs to catch any problems with connections and either handle or ignore them. The NodeConnectionManager already handles the following errors nodesErrors.ErrorNodeConnectionDestroyed, grpcErrors.ErrorGRPC, and agentErrors.ErrorAgentClientDestroyed and cleans up the connection removing it from the connection map. It still throws them up the chain however.

CMCDragonkai · 2022-09-14T04:17:03Z

With the new TaskManager, the TaskHandler for interacting with node connections in the background should now just eat this exception, and fulfill normally without error.

This is because it is expected that a connection may fail because the other node/agent service may go offline due to network problems... etc.

At the same time, if the exception were to bubble up, it would be logged as a warning on the TaskManager, and then emitted to any task promise.

Either we ignore the exception the task handler, OR we handle it in the task promise. The decision depends on whether we actually care to await the task promise. If it's not being awaited for, then we should be eating this exception in the task handler.

I believe in the nodes domain, we never await the task promises. @tegefaulkes says that it's not done for discovery either. So in both cases we just have the task handler consume these exceptions.

CMCDragonkai · 2022-09-14T04:22:20Z

In order to prevent regressions for this. We should have an integration test for this. It is an integration test, because we don't want to know how the details of the system work. We just want to say that if another node fails, the first node should not fail. It does not matter what the first node is doing. AT ALL TIMES, the failure of a second node should not trigger a failure of the first node.

This sounds like a tests/bin test. But this is a "multi node situation". We are talking about multiple agents here.

I think we need to group our integration tests together:

tests/bin - single node
tests/integration/multi - multi node tests
tests/integration/... - all the other integration tests (nat tests, and testnet tests)

CMCDragonkai · 2022-09-14T04:27:23Z

An idea is to use fastcheck to run random interactions between the first and second node. Send a SIGKILL to the second node randomly, and then the constraint is that the first node must still be alive and responsive.

tegefaulkes · 2022-09-14T08:39:23Z

It's going to be tricky to write a good test for this. I can have two nodes do random things and kill one of the at a random time. That's not too hard to do. However there are far to may factors here for this to really be useful. To have good coverage whenever we run the test we will need to run a lot of runs to cover a good about of scenarios. This will make it a time consuming test. If the problem depends on timing of operations then it's unlikely to recreate fail scenarios between platforms. lastly the test requires starting a node and killing it over and over again. This makes the test pretty expensive too.

Since by design the tasks system can't crash when a handler throws. I don't think this is a problem for background tasks anymore.

The other places something like this can cause a problem is.

Any GRPC call could fail for connection reasons. In we should be handling any connection errors when making these calls.
GRPC handlers should handle this as well. I think the already catch any errors and throw it down the connection. However I'm not sure how they handle a connection failure.

CMCDragonkai · 2022-09-14T08:40:38Z

We can force a few scenarios, like when we first startup, and when we perform a call that involves a long interaction between 2 nodes. That should be enough.

tegefaulkes · 2022-09-14T08:54:56Z

Looking at the problem some more. The error here isn't really expected at the scope of the task handlers. The expectation is that refreshBucket shouldn't throw an error. It's really a bug with findNode and by extension getClosestGlobalNodes throwing when it shouldn't. getClosestGlobalNodes SHOULD be catching any connection errors and skipping that node during it's search. As a result getClosestGlobalNodes shouldn't throw and return Promise<NodeAddress | undefined>.

CMCDragonkai · 2022-09-14T09:06:51Z

Seems like getClosestGlobalNodes is doing 2 things that should in the future be refactored to be some sort of mutual recursion. That way one can resolve a node ID to node address, while also updating the node graph in the process. The naming can then be changed to updateClosestGlobalNodes which would return Promise<void>.

tegefaulkes · 2022-09-14T09:19:05Z

To check for a connection error we check for

            e instanceof nodesErrors.ErrorNodeConnectionDestroyed ||
            e instanceof grpcErrors.ErrorGRPC ||
            e instanceof agentErrors.ErrorAgentClientDestroyed

This is a little clunky, maybe I should make a isConnectionError() utility to keep the logic of this check in one place.

tegefaulkes · 2022-09-14T09:36:34Z

For the name getClosestGlobalNodes, I don't think updateClosestGlobalNodes quite fits either. Its searching the network for the target node by asking the other nodes in the network. It's adding any nodes it contacts along the way to the node graph.

I think a better name would be searchNetworkForNode or since it's the 2nd part of findNode, findNodeFromNetwork?

`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node. #418

tegefaulkes · 2022-09-15T02:41:45Z

Updated description with tasks, 2-3 hours for this one.

tegefaulkes · 2022-09-15T04:12:21Z

Connection errors like this are only really a problem if they're not caught and handled at some point. The GRPC service handlers by design catch any error and send that through the connection as metadata. So in that case any user-triggered operations shouldn't be able to crash the agent.

That leaves any background tasks and parts of the code supporting normal operation that could cause this. NodeConnectionManager functions that create connections such as findNode, pingNode, getClosestGlobalNodes, getRemoteNodeClosestNodes and syncNodeGraph are likely suspects. I've reviewed and updated them so ensure that they don't throw in the case of a failed connection. Resulting in returning a default such as false or no data.

The background tasks shouldn't throw any errors that could crash the agent unless we await the task promise. the existing task handlers don't make any connections within the handler's function but they do call the NodeConectionManager functions above. So any connection error that reaches the handler should be unexpected. Discovery domain still needs to be converted and looked at.

I gave 1-2 hours to task 2 since It seems like it would take a bit of digging to be sure if we're handling connection errors properly in all cases. but on reviewing I think most connections are triggered via a service handler at some point so they're ultimately handled properly. Any other connections are via the NodeConnectionManager methods and I've checked the.

As for testing this. We'd need to make a test node that accepts any GRPC calls and immediately kills the connection. It would be best we can mimic a process exiting abruptly without having to keep restarting the GRPC server. That will take some experimenting. We can have variants of this where one refuses connection, times out connecting, times out returning data, mimics a process crash, etc, etc. Then we can start an agent and see if it breaks while doing nothing or attempt specific GRPC calls against it.

Figuring out these tests will be tricky and time consuming. The worst part is coming up with agents that fail the connections in specific ways.

CMCDragonkai · 2022-09-15T04:24:43Z

Figuring out these tests will be tricky and time consuming. The worst part is coming up with agents that fail the connections in specific ways.

I'm thinking it's best we just do things randomly without coding specific ways of crashing. Fuzz test the crashing that is. And if we can do it randomly sufficiently enough, it should be enough.

CMCDragonkai · 2022-09-15T04:25:13Z

Coding in specific ways of crashing is brittle, and ultimately will not catch things we aren't aware of. That's the whole point of fuzzing, and fast check should help here.

CMCDragonkai · 2022-09-15T04:26:28Z

For the name getClosestGlobalNodes, I don't think updateClosestGlobalNodes quite fits either. Its searching the network for the target node by asking the other nodes in the network. It's adding any nodes it contacts along the way to the node graph.

I think a better name would be searchNetworkForNode or since it's the 2nd part of findNode, findNodeFromNetwork?

Yea I'm not in favour of renaming unless the function itself gets refactored via functional decomposition so that the "mutual recursion" and side-effects is explicit and separated. But that's not a priority atm.

#418

CMCDragonkai · 2022-09-16T07:35:01Z

Try with CommandPing first. So basically 2 agents, get agent 1 to ping agent 2. Randomly kill agent 2.

If that works, we use CommandClaim.

CMCDragonkai · 2022-09-16T07:38:07Z

Use the test that is using 2 nodes for the syncNodeGraph test. Then just kill the second node.

tegefaulkes · 2022-09-21T04:55:09Z

I created a new issue MatrixAI/Polykey-CLI#8 relating to this.

`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node. #418

#418

`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node. #418

emmacasolin added the bug Something isn't working label Jul 15, 2022

CMCDragonkai mentioned this issue Sep 8, 2022

Integrate TaskManager into NodeGraph and Discovery #445

Merged

15 tasks

CMCDragonkai assigned tegefaulkes Sep 14, 2022

CMCDragonkai mentioned this issue Sep 14, 2022

Connection dropped/timed out when connecting to deployed agent #414

Closed

tegefaulkes added a commit that referenced this issue Sep 15, 2022

wip: small fixes

09ca800

#418

tegefaulkes added a commit that referenced this issue Sep 21, 2022

wip: small fixes

5b92e59

#418

CMCDragonkai closed this as completed in #445 Sep 21, 2022

CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destroyed agent shuts down remote agent performing background tasks #418

Destroyed agent shuts down remote agent performing background tasks #418

emmacasolin commented Jul 15, 2022 •

edited by tegefaulkes

Loading

tegefaulkes commented Jul 15, 2022

CMCDragonkai commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022 •

edited

Loading

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022 •

edited

Loading

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022

tegefaulkes commented Sep 15, 2022

tegefaulkes commented Sep 15, 2022

CMCDragonkai commented Sep 15, 2022

CMCDragonkai commented Sep 15, 2022

CMCDragonkai commented Sep 15, 2022

CMCDragonkai commented Sep 16, 2022 •

edited

Loading

CMCDragonkai commented Sep 16, 2022

tegefaulkes commented Sep 21, 2022

Destroyed agent shuts down remote agent performing background tasks #418

Destroyed agent shuts down remote agent performing background tasks #418

Comments

emmacasolin commented Jul 15, 2022 • edited by tegefaulkes Loading

Describe the bug

To Reproduce

Expected behavior

Tasks

tegefaulkes commented Jul 15, 2022

CMCDragonkai commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022 • edited Loading

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022 • edited Loading

CMCDragonkai commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022

tegefaulkes commented Sep 14, 2022

tegefaulkes commented Sep 15, 2022

tegefaulkes commented Sep 15, 2022

CMCDragonkai commented Sep 15, 2022

CMCDragonkai commented Sep 15, 2022

CMCDragonkai commented Sep 15, 2022

CMCDragonkai commented Sep 16, 2022 • edited Loading

CMCDragonkai commented Sep 16, 2022

tegefaulkes commented Sep 21, 2022

emmacasolin commented Jul 15, 2022 •

edited by tegefaulkes

Loading

CMCDragonkai commented Sep 14, 2022 •

edited

Loading

tegefaulkes commented Sep 14, 2022 •

edited

Loading

CMCDragonkai commented Sep 16, 2022 •

edited

Loading