-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destroyed agent shuts down remote agent performing background tasks #418
Comments
Seems like an oversight. Any background operations using a connection needs to catch any problems with connections and either handle or ignore them. The |
With the new This is because it is expected that a connection may fail because the other node/agent service may go offline due to network problems... etc. At the same time, if the exception were to bubble up, it would be logged as a warning on the Either we ignore the exception the task handler, OR we handle it in the task promise. The decision depends on whether we actually care to await the task promise. If it's not being awaited for, then we should be eating this exception in the task handler. I believe in the nodes domain, we never await the task promises. @tegefaulkes says that it's not done for discovery either. So in both cases we just have the task handler consume these exceptions. |
In order to prevent regressions for this. We should have an integration test for this. It is an integration test, because we don't want to know how the details of the system work. We just want to say that if another node fails, the first node should not fail. It does not matter what the first node is doing. AT ALL TIMES, the failure of a second node should not trigger a failure of the first node. This sounds like a I think we need to group our integration tests together:
|
An idea is to use fastcheck to run random interactions between the first and second node. Send a SIGKILL to the second node randomly, and then the constraint is that the first node must still be alive and responsive. |
It's going to be tricky to write a good test for this. I can have two nodes do random things and kill one of the at a random time. That's not too hard to do. However there are far to may factors here for this to really be useful. To have good coverage whenever we run the test we will need to run a lot of runs to cover a good about of scenarios. This will make it a time consuming test. If the problem depends on timing of operations then it's unlikely to recreate fail scenarios between platforms. lastly the test requires starting a node and killing it over and over again. This makes the test pretty expensive too. Since by design the tasks system can't crash when a handler throws. I don't think this is a problem for background tasks anymore. The other places something like this can cause a problem is.
|
We can force a few scenarios, like when we first startup, and when we perform a call that involves a long interaction between 2 nodes. That should be enough. |
Looking at the problem some more. The error here isn't really expected at the scope of the task handlers. The expectation is that |
Seems like |
To check for a connection error we check for e instanceof nodesErrors.ErrorNodeConnectionDestroyed ||
e instanceof grpcErrors.ErrorGRPC ||
e instanceof agentErrors.ErrorAgentClientDestroyed This is a little clunky, maybe I should make a |
For the name I think a better name would be |
`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node. #418
Updated description with tasks, 2-3 hours for this one. |
Connection errors like this are only really a problem if they're not caught and handled at some point. The GRPC service handlers by design catch any error and send that through the connection as metadata. So in that case any user-triggered operations shouldn't be able to crash the agent. That leaves any background tasks and parts of the code supporting normal operation that could cause this. The background tasks shouldn't throw any errors that could crash the agent unless we await the task promise. the existing task handlers don't make any connections within the handler's function but they do call the I gave 1-2 hours to task 2 since It seems like it would take a bit of digging to be sure if we're handling connection errors properly in all cases. but on reviewing I think most connections are triggered via a service handler at some point so they're ultimately handled properly. Any other connections are via the As for testing this. We'd need to make a test node that accepts any GRPC calls and immediately kills the connection. It would be best we can mimic a process exiting abruptly without having to keep restarting the GRPC server. That will take some experimenting. We can have variants of this where one refuses connection, times out connecting, times out returning data, mimics a process crash, etc, etc. Then we can start an agent and see if it breaks while doing nothing or attempt specific GRPC calls against it. Figuring out these tests will be tricky and time consuming. The worst part is coming up with agents that fail the connections in specific ways. |
I'm thinking it's best we just do things randomly without coding specific ways of crashing. Fuzz test the crashing that is. And if we can do it randomly sufficiently enough, it should be enough. |
Coding in specific ways of crashing is brittle, and ultimately will not catch things we aren't aware of. That's the whole point of fuzzing, and fast check should help here. |
Yea I'm not in favour of renaming unless the function itself gets refactored via functional decomposition so that the "mutual recursion" and side-effects is explicit and separated. But that's not a priority atm. |
Try with If that works, we use |
Use the test that is using 2 nodes for the |
I created a new issue MatrixAI/Polykey-CLI#8 relating to this. |
`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node. #418
`getRemoteNodeClosestNodes` was throwing an connection error in certain conditions. If it failed to connect to a node it should've just skipped that node. #418
Describe the bug
We have several asynchronous background queues, some of which involve establishing grpc connections (node connections) with remote agents. However, if the remote agent we're contacting is destroyed during connection establishment then this will shut down our own agent.
Example:
To Reproduce
The timing is quite finicky, but you just need two agents running and you need to kill one of them at the time when the other is doing something like
Expected behavior
For operations that are occurring in the background, potentially even without the user being aware of them, this should not cause the agent to shut down. While this behaviour makes sense for an operation the user chose to initiate, background tasks that the user has no control over should not be able to kill the agent,
Tasks
findNode
,pingNode
,getClosestGlobalNodes
,getRemoteNodeClosestNodes
andsyncNodeGraph
can't throw an error due to a connection error. 1 hourThe text was updated successfully, but these errors were encountered: