-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NAT error causes validator to fail (invisible to network) #461
Comments
I think this might happen if your public IP address changes. Do you know how often this happens for you? |
I'm really not sure since I haven't had any need to monitor this. But I've made a note of it as of today, so we shall see going forward. But most residential IP address are not guaranteed to be static, and can change as frequently as every couple of days. So I'm thinking there probably needs to be a more robust solution than the --public-ip flag. |
@dbrisinda What command are you using to start your node? Is your node behind a router? Does that router have UPNP? If so, try looking at your router's UPNP mappings and share them here |
@dbrisinda Hmm and this log appears:
when your node loses connectivity? |
Does your node have high CPU usage when this happens? |
Yes, that's the log that appears. I don't know if the node has high CPU when this happens, as I haven't caught it at the moment of the error, but some time after the fact. Once that message appears, if I don't restart the node, those messages repeat with regular frequency, maybe a dozen or two identical messages in a day. |
@dbrisinda Have you tried running with the |
No, I haven't tried running with NAT Only setting on the router, since I don't really know what that means. But I will try that the next time it fails. |
Okay, same error just appeared again after 3 days running without interruption: ERROR[09-30|09:53:11] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:941b0148-a0f4-4f24-a549-f694da024585/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers) So I changed the router settings to "NAT Only" as suggested. And restarted the node. I also noticed that there is no need to port forward manually, since UPnP automatically port forwards 9651. Though I kept the manual port forward range 9650-9652 just in case even though it appears redundant. I will post here within a few days if it fails as before. |
Any updates? @dbrisinda |
Still running without incident using v1.0.1... |
Okay, just got this error message again after 7 days, but this time, the node appears to still be connected to the newtork, as VScout.io shows it's up, as does avax.dev: [...] ERROR[10-09|14:49:39] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers) INFO [10-09|14:50:43] network/peer.go#483: beacon Nr584bLpGgbCUbZFSBaBz3Xum5wpca9Ym attempting to connect with newer version avalanche/1.0.2. You may want to update your client Note, my public IP address has remained unchanged during this time, though my ISP provider doesn't guarantee it will always be so. Going to start a new v1.0.2 node in NAT Only mode as previously. |
So I thought I should also include the following as it may be helpful for diagnosing these NAT issues on macOS. After I run v1.0.2 I get the following popup in macOS (this was also happening before): And the node starts as follows with the new error (previously it reported something like the node might only be able to connect to a limited set of peers):
However, if I restart the node, and very quickly click on this "Allow" button before the node has a chance to initialize, then the node initializes without errors:
I've already added the avalanchego executable to the Security & Privacy > Firewall > Firewall Options files list that should allow incoming connections for the executable without being asked each time in a popup. However, macOS may not be allowing incoming network connections automatically (forcing the user to click on "Allow" in the popup) since the executable does not appear to be signed by a valid certificate authority (according to Apple). The second checked box below in particular stresses this point, so self-signed (adhoc) executables might not work, even though the executable is in the list of software that is allowed to receive incoming connections. Checking the signature on avalanchego gives:
So it appears the adhoc signature may be the culprit here. And since I'm compiling from scratch, I will have to sign the executable using a valid certifcate authority right after compiling. Looks like I will need to research how to do this properly. But I would like to suggest it might be worthwhile to add some docs on the github avalanchego repo landing page for codesigning on macOS for those compiling from source to make this more seamless. I also noticed that the avalanchego executable in the distributed macOS zip package has no codesigning at all, so I would imagine it would have the same issues. |
If you attempt to use UPnP on the node and you have conflicting port forwarding rules on your router. You router will respond with errors when attempt to perform the UPnP mapping. You need to either stop using UPnP or disable any conflicting port forwarding. |
@tasinco No conflicting port forwarding on the router. The UPnP seems to be working okay, as the single 9561 port is automatically forwarded when I launch the node and cleared when the node terminates. |
I just noticed that my node has logged several of these ERROR and WARN messages again as of 12 hours ago. So that would be about 3 days without errors. Notice the last one is a WARN instead of an ERROR. Though, when I look on VScout.io and avax.dev, my node still appears to be connected (yes). So this version of the node (v1.0.2) seems to be an improvement over previously, where the node would disconnect when these errors appeared: ERROR[10-11|07:34:47] nat/nat.go#86: Renewing port mapping try #1 from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Still thinking adhoc signing could be connected to this issue... |
The Renew message was set to a warning because we cannot really confirm the route is gone. Most routers handle UPnP requests differently. One thing that could be better for your purposes is how often we retry the UPnP update. Based on the new retry interval 5min and the map timeout 30min, we have a better chance of getting the UPnP update before the connection drops. I don't expect every UPnP map request will work. But as long as we have a success before the mapping expires it will keep the port alive. I don't understand the adhoc signing comment, as that has nothing to do with keeping the port alive. |
From your response it seems like situation is better. Closing the issue. We continue to update the networking code and offer features. |
@tasinco Thanks for all you guys do. You are a stellar team of engineers. Hats off to you all... (Yes, the retry interval and adusted map timeout in v1.0.2 seems to have helped the situation) Regarding the adhoc signing comment... Because of paranoid macOS security settings, for example, on launch of the node, I always get the popup: "Do you want the application avalanchego to accept incoming network connections?" I have to select the "Allow" button quickly, otherwise the node starts up with this error: ERROR[10-12|20:49:51] main/main.go#90: UPnP or NAT-PMP router attach failed, you may not be listening publicly, please confirm the settings in your router And if I check the forwarded ports on the router, the 9651 port has NOT been forwarded via UPnP. The reason this fails is because the executable is not signed (or it is adhoc signed), as macOS determines that if it is not signed then it does not allow the node to receive incoming network connections, and the UPnP request fails. IF the executable IS properly signed (or I quickly select the "Allow" button on launch) the port is forwarded and I get no errors. Now, after some period of time, the port mapping expires, and because the executable is not signed (or is adhoc signed), it fails to renew the port mapping, since that requires "trusted" status, which was only given once at launch explicitly by quickly selecting the "Allow" button. It seems that when the port mapping expires, so does this user-intervention which allowed a non-signed executable to receive incoming network connections and the required port to be auto-forwarded by UPnP. In summary, properly code signing the avalanchego executable would remove the popup asking if incoming network connections should be allowed, and likely ensure that expired port mapping can be remapped without failing. Is that more clear? |
UPDATE I signed the avalanchego executable using my Apple developer certificate, and it didn't make any difference in terms of the popup mentioned above that showed up on first run. That popup still showed up after a fresh build. What I noticed is, that popup only appears on the first launch of the node. If I close it down, and then relaunch, the popup doesn't appear again provided I previously selected "Allow". So macOS remembers the last setting to allow incoming network connections for a given executable until I compile a new version of the node. So it appears you are right. Code signing doesn't have anything to do with the UPnP issues after all. My mistake. |
An internal ticket was opened with our team to look at solutions to properly sign the macos binaries we provide. |
Cheers. Here's a great doc that I found extremely helpful. Also supports third-party (non-Apple) signing identities: |
Describe the bug
Node runs correctly for 1-2 days, then displays a NAT error which causes the node to no longer be visible to the Avalanche network. Naturally, validator staking uptime drops significantly. The only remedy is to restart the node, which then works for another 1-2 days before the error repeats itself.
To Reproduce
Normal launch of node.
Expected behavior
Node continues to run without interruption.
Screenshots
ERROR[09-27|12:25:09] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:941b0148-a0f4-4f24-a549-f694da024585/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Operating System
macOS 10.15.6
Additional context
By submitting this issue I agree to the Terms and Conditions of the Developer Accelerator Program.
The text was updated successfully, but these errors were encountered: