Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destroyed agent shuts down remote agent performing background tasks #418

Closed
1 task done
emmacasolin opened this issue Jul 15, 2022 · 18 comments · Fixed by #445
Closed
1 task done

Destroyed agent shuts down remote agent performing background tasks #418

emmacasolin opened this issue Jul 15, 2022 · 18 comments · Fixed by #445
Assignees
Labels
bug Something isn't working r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy

Comments

@emmacasolin
Copy link
Contributor

emmacasolin commented Jul 15, 2022

Describe the bug

We have several asynchronous background queues, some of which involve establishing grpc connections (node connections) with remote agents. However, if the remote agent we're contacting is destroyed during connection establishment then this will shut down our own agent.

Example:

{"type":"ErrorAgentClientDestroyed","data":{"message":"","timestamp":"2022-07-15T05:59:49.016Z","data":{},"stack":"ErrorAgentClientDestroyed\n    at /home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:567:39\n    at /home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:205:22\n    at withF (/home/emma/Projects/js-polykey/node_modules/@matrixai/resources/src/utils.ts:24:18)\n    at async constructor_.withConnF (/home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:197:12)\n    at async constructor_.getClosestGlobalNodes (/home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:495:28)\n    at async constructor_.findNode (/home/emma/Projects/js-polykey/src/nodes/NodeConnectionManager.ts:412:8)\n    at async constructor_.refreshBucket (/home/emma/Projects/js-polykey/src/nodes/NodeManager.ts:580:5)\n    at async constructor_.startRefreshBucketQueue (/home/emma/Projects/js-polykey/src/nodes/NodeManager.ts:711:9)","description":"Agent Client is destroyed","exitCode":64}}

To Reproduce

The timing is quite finicky, but you just need two agents running and you need to kill one of them at the time when the other is doing something like

INFO:NodeConnectionManager:Getting connection to vutea98s5hv7qcde3elv4vc9qpqsv2oph374ql8i0ogiu106nia2g
INFO:NodeConnectionManager:existing entry found for vutea98s5hv7qcde3elv4vc9qpqsv2oph374ql8i0ogiu106nia2g
INFO:NodeConnectionManager:withConnF calling function with connection to vutea98s5hv7qcde3elv4vc9qpqsv2oph374ql8i0ogiu106nia2g

Expected behavior

For operations that are occurring in the background, potentially even without the user being aware of them, this should not cause the agent to shut down. While this behaviour makes sense for an operation the user chose to initiate, background tasks that the user has no control over should not be able to kill the agent,

Tasks

@emmacasolin emmacasolin added the bug Something isn't working label Jul 15, 2022
@tegefaulkes
Copy link
Contributor

Seems like an oversight. Any background operations using a connection needs to catch any problems with connections and either handle or ignore them. The NodeConnectionManager already handles the following errors nodesErrors.ErrorNodeConnectionDestroyed, grpcErrors.ErrorGRPC, and agentErrors.ErrorAgentClientDestroyed and cleans up the connection removing it from the connection map. It still throws them up the chain however.

@CMCDragonkai
Copy link
Member

With the new TaskManager, the TaskHandler for interacting with node connections in the background should now just eat this exception, and fulfill normally without error.

This is because it is expected that a connection may fail because the other node/agent service may go offline due to network problems... etc.

At the same time, if the exception were to bubble up, it would be logged as a warning on the TaskManager, and then emitted to any task promise.

Either we ignore the exception the task handler, OR we handle it in the task promise. The decision depends on whether we actually care to await the task promise. If it's not being awaited for, then we should be eating this exception in the task handler.

I believe in the nodes domain, we never await the task promises. @tegefaulkes says that it's not done for discovery either. So in both cases we just have the task handler consume these exceptions.

@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 14, 2022

In order to prevent regressions for this. We should have an integration test for this. It is an integration test, because we don't want to know how the details of the system work. We just want to say that if another node fails, the first node should not fail. It does not matter what the first node is doing. AT ALL TIMES, the failure of a second node should not trigger a failure of the first node.

This sounds like a tests/bin test. But this is a "multi node situation". We are talking about multiple agents here.

I think we need to group our integration tests together:

  1. tests/bin - single node
  2. tests/integration/multi - multi node tests
  3. tests/integration/... - all the other integration tests (nat tests, and testnet tests)

@CMCDragonkai
Copy link
Member

An idea is to use fastcheck to run random interactions between the first and second node. Send a SIGKILL to the second node randomly, and then the constraint is that the first node must still be alive and responsive.

@tegefaulkes
Copy link
Contributor

It's going to be tricky to write a good test for this. I can have two nodes do random things and kill one of the at a random time. That's not too hard to do. However there are far to may factors here for this to really be useful. To have good coverage whenever we run the test we will need to run a lot of runs to cover a good about of scenarios. This will make it a time consuming test. If the problem depends on timing of operations then it's unlikely to recreate fail scenarios between platforms. lastly the test requires starting a node and killing it over and over again. This makes the test pretty expensive too.

Since by design the tasks system can't crash when a handler throws. I don't think this is a problem for background tasks anymore.

The other places something like this can cause a problem is.

  1. Any GRPC call could fail for connection reasons. In we should be handling any connection errors when making these calls.
  2. GRPC handlers should handle this as well. I think the already catch any errors and throw it down the connection. However I'm not sure how they handle a connection failure.

@CMCDragonkai
Copy link
Member

We can force a few scenarios, like when we first startup, and when we perform a call that involves a long interaction between 2 nodes. That should be enough.

@tegefaulkes
Copy link
Contributor

tegefaulkes commented Sep 14, 2022

Looking at the problem some more. The error here isn't really expected at the scope of the task handlers. The expectation is that refreshBucket shouldn't throw an error. It's really a bug with findNode and by extension getClosestGlobalNodes throwing when it shouldn't. getClosestGlobalNodes SHOULD be catching any connection errors and skipping that node during it's search. As a result getClosestGlobalNodes shouldn't throw and return Promise<NodeAddress | undefined>.

@CMCDragonkai
Copy link
Member

Seems like getClosestGlobalNodes is doing 2 things that should in the future be refactored to be some sort of mutual recursion. That way one can resolve a node ID to node address, while also updating the node graph in the process. The naming can then be changed to updateClosestGlobalNodes which would return Promise<void>.

@tegefaulkes
Copy link
Contributor

To check for a connection error we check for

            e instanceof nodesErrors.ErrorNodeConnectionDestroyed ||
            e instanceof grpcErrors.ErrorGRPC ||
            e instanceof agentErrors.ErrorAgentClientDestroyed

This is a little clunky, maybe I should make a isConnectionError() utility to keep the logic of this check in one place.

@tegefaulkes
Copy link
Contributor

For the name getClosestGlobalNodes, I don't think updateClosestGlobalNodes quite fits either. Its searching the network for the target node by asking the other nodes in the network. It's adding any nodes it contacts along the way to the node graph.

I think a better name would be searchNetworkForNode or since it's the 2nd part of findNode, findNodeFromNetwork?

tegefaulkes added a commit that referenced this issue Sep 14, 2022
`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node.

#418
@tegefaulkes
Copy link
Contributor

Updated description with tasks, 2-3 hours for this one.

@tegefaulkes
Copy link
Contributor

Connection errors like this are only really a problem if they're not caught and handled at some point. The GRPC service handlers by design catch any error and send that through the connection as metadata. So in that case any user-triggered operations shouldn't be able to crash the agent.

That leaves any background tasks and parts of the code supporting normal operation that could cause this. NodeConnectionManager functions that create connections such as findNode, pingNode, getClosestGlobalNodes, getRemoteNodeClosestNodes and syncNodeGraph are likely suspects. I've reviewed and updated them so ensure that they don't throw in the case of a failed connection. Resulting in returning a default such as false or no data.

The background tasks shouldn't throw any errors that could crash the agent unless we await the task promise. the existing task handlers don't make any connections within the handler's function but they do call the NodeConectionManager functions above. So any connection error that reaches the handler should be unexpected. Discovery domain still needs to be converted and looked at.

I gave 1-2 hours to task 2 since It seems like it would take a bit of digging to be sure if we're handling connection errors properly in all cases. but on reviewing I think most connections are triggered via a service handler at some point so they're ultimately handled properly. Any other connections are via the NodeConnectionManager methods and I've checked the.

As for testing this. We'd need to make a test node that accepts any GRPC calls and immediately kills the connection. It would be best we can mimic a process exiting abruptly without having to keep restarting the GRPC server. That will take some experimenting. We can have variants of this where one refuses connection, times out connecting, times out returning data, mimics a process crash, etc, etc. Then we can start an agent and see if it breaks while doing nothing or attempt specific GRPC calls against it.

Figuring out these tests will be tricky and time consuming. The worst part is coming up with agents that fail the connections in specific ways.

@CMCDragonkai
Copy link
Member

Figuring out these tests will be tricky and time consuming. The worst part is coming up with agents that fail the connections in specific ways.

I'm thinking it's best we just do things randomly without coding specific ways of crashing. Fuzz test the crashing that is. And if we can do it randomly sufficiently enough, it should be enough.

@CMCDragonkai
Copy link
Member

Coding in specific ways of crashing is brittle, and ultimately will not catch things we aren't aware of. That's the whole point of fuzzing, and fast check should help here.

@CMCDragonkai
Copy link
Member

For the name getClosestGlobalNodes, I don't think updateClosestGlobalNodes quite fits either. Its searching the network for the target node by asking the other nodes in the network. It's adding any nodes it contacts along the way to the node graph.

I think a better name would be searchNetworkForNode or since it's the 2nd part of findNode, findNodeFromNetwork?

Yea I'm not in favour of renaming unless the function itself gets refactored via functional decomposition so that the "mutual recursion" and side-effects is explicit and separated. But that's not a priority atm.

tegefaulkes added a commit that referenced this issue Sep 15, 2022
@CMCDragonkai
Copy link
Member

CMCDragonkai commented Sep 16, 2022

Try with CommandPing first. So basically 2 agents, get agent 1 to ping agent 2. Randomly kill agent 2.

If that works, we use CommandClaim.

@CMCDragonkai
Copy link
Member

Use the test that is using 2 nodes for the syncNodeGraph test. Then just kill the second node.

@tegefaulkes
Copy link
Contributor

I created a new issue MatrixAI/Polykey-CLI#8 relating to this.

tegefaulkes added a commit that referenced this issue Sep 21, 2022
`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node.

#418
tegefaulkes added a commit that referenced this issue Sep 21, 2022
tegefaulkes added a commit that referenced this issue Sep 21, 2022
`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node.

#418
@CMCDragonkai CMCDragonkai added the r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy label Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working r&d:polykey:core activity 3 Peer to Peer Federated Hierarchy
Development

Successfully merging a pull request may close this issue.

3 participants