-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Connection dropped/timed out when connecting to deployed agent #414
Comments
Interestingly, I'm having a lot less difficulty connecting to deployed agents on my home wifi compared to uni wifi, and I haven't had the connection dropped like it was yesterday. This may just be our agents not being able to punch through stronger NATs, so this issue may be relevant when it comes to working on #383 |
The testnet isn't behind any NAT. It has public IP addresses, when connecting from local agent to the testnet, there's isn't any punch through. As soon as your local agent initiates a message, your local NAT would allow reverse packets from the same IP+port. |
Yes, I meant in the case of the testnet trying to connect back to me. If the conntrack entry has expired then the testnet would not be able to reach me, if that's indeed what it's attempting to do. These connection re-attempts (coming from the testnet) are continuing for 20+ hours after my local agent was killed, probably longer, but that's the longest any of the agents showing this behaviour have been active. |
Ah yes, that can happen. Now that I think about it. What is the reason for the testnet attempting to connect back? It has something to do with DHT queries I believe, but in that case, one must expect failures to reconnect in case your local node goes offline. Is the connection attempt solely at the proxy level (proxy connection), or at the grpc connection level (node connection)? Are there are other reasons? @tegefaulkes For punching back to your local node, you have to have #383 done too. Plus your node is pretty transient since it's both mobile on the Uni wifi network, and then the Uni network itself would have a NAT. Note that #383 isn't solely about 2 levels of NATs, but really any number of levels of NATs. Tailscale/tailnet is good source of inspiration for how to address some of this. |
Furthermore if reconnection attempts fail for any arbitrary node on the network, these may make very noisy logs, because there would lots of failed network connection attempts in the future as the network scales. I reckon structured logging will need to be made use of to help filter out these logs. |
While my agent is online the connection attempts are at the grpc connection/node connection level, but once it goes offline I believe it's only at the proxy level (since the only logs are coming from |
AFAIK the only times a node will automatically try to connect to another node is
If the nodes are continuously trying to connect over and over again then I'd check
|
This problem is similar to #418. This may also be an unhandled promise rejection. I believe our PK agent should have a unhandled promise rejection handler registered at the top level. And if it receives it, should report it appropriately and also exit the program. We should have this done inside At the moment there is no unhandled promise rejection handler. We can add this in to aid in our debugging. At the same time, we could also address: #307. See: https://nodejs.org/api/process.html#event-unhandledrejection We should have 2 new exceptions at the top level:
These 2 indicate software bugs, so they should use Actually I just remembered |
I don't have a solution for this bug yet, @tegefaulkes can add the handlers, and let's see if we reproduce AFTER we have solved the other bugs #418 and #415. |
I just looked, Right now it will set the process exitcode to the error's code if it's an With #414 (comment) already addressed then I don't know what else to do for this issue. |
I could resolve #307 for detecting promise deadlocks. |
Yes the However we want to change this to rethrow a special set of exceptions: #414 (comment) So that it is clear to us that these are unrecoverable errors. In fact, the presence of these indicate a software bug. |
Since they are all unrecoverable errors, we can use Regardless of whether it is an instance of
We should not use I suggest we putting all of these exceptions into the |
We don't want these errors to extend I'm going to rename |
I'll hold off on that change for now until I decide the best action.
|
Ok well in that case, it should be And have the |
Consider this:
|
Change the 2 Also in the
|
Those last 2 errors, nothing is using those. Must have been forgotten, can be removed. |
Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414
Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414
Un-recoverable errors include `ErrorBinUncaughtException`, `ErrorBinUnhandledRejection` and `ErrorBinAsynchronousDeadlock`. #414
Describe the bug
Occasionally, when attempting to connect to a deployed agent, we see connection timeouts. Sometimes, this also results in a connection being dropped (and shutting down both agents if the connection is attempted during
pk agent start
with the deployed agent specified as a seed node). A specific occurrence of both agents shutting down ended with this error on the local agent:And this series of errors/warnings on the deployed agent:
To Reproduce
The exact cause of the error is still unknown, however, it may be reproducible by connecting to a deployed agent as a seed node
npm run polykey -- agent start --seednodes="[deployedAgentNodeId]@[deployedAgentHost]:[deployedAgentPort] --verbose --format json
Expected behavior
Even if a connection times out, this shouldn't kill an agent, especially not the seed node. The seed node should be able to ignore a failed connection, even if it was the cause of the failed connection, since we need seed nodes to be stable.
The text was updated successfully, but these errors were encountered: