Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAT error causes validator to fail (invisible to network) #461

Closed
dbrisinda opened this issue Sep 29, 2020 · 24 comments
Closed

NAT error causes validator to fail (invisible to network) #461

dbrisinda opened this issue Sep 29, 2020 · 24 comments
Labels
bug Something isn't working

Comments

@dbrisinda
Copy link

dbrisinda commented Sep 29, 2020

Describe the bug
Node runs correctly for 1-2 days, then displays a NAT error which causes the node to no longer be visible to the Avalanche network. Naturally, validator staking uptime drops significantly. The only remedy is to restart the node, which then works for another 1-2 days before the error repeats itself.

To Reproduce
Normal launch of node.

Expected behavior
Node continues to run without interruption.

Screenshots
ERROR[09-27|12:25:09] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:941b0148-a0f4-4f24-a549-f694da024585/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Operating System
macOS 10.15.6

Additional context

  • ports 9650 to 9652 are all forwarded on router (just in case)
  • can't use --public-ip since ip is not guaranteed to be static by IP provider

By submitting this issue I agree to the Terms and Conditions of the Developer Accelerator Program.

@dbrisinda dbrisinda added the bug Something isn't working label Sep 29, 2020
@danlaine
Copy link

I think this might happen if your public IP address changes. Do you know how often this happens for you?

@dbrisinda
Copy link
Author

I'm really not sure since I haven't had any need to monitor this. But I've made a note of it as of today, so we shall see going forward. But most residential IP address are not guaranteed to be static, and can change as frequently as every couple of days. So I'm thinking there probably needs to be a more robust solution than the --public-ip flag.

@danlaine
Copy link

@dbrisinda What command are you using to start your node? Is your node behind a router? Does that router have UPNP? If so, try looking at your router's UPNP mappings and share them here

@dbrisinda
Copy link
Author

Start node: ./build/avalanchego
Node is behind a router that has UPnP

Screen Shot 2020-09-29 at 4 19 51 PM

Here are the mappings; the 9651 mapping was added dynamically by UPnP, while the range 9650-9652 was added manually by me just to make sure.

Screen Shot 2020-09-29 at 4 23 02 PM

@danlaine
Copy link

@dbrisinda Hmm and this log appears:

ERROR[09-27|12:25:09] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:941b0148-a0f4-4f24-a549-f694da024585/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

when your node loses connectivity?

@danlaine
Copy link

Does your node have high CPU usage when this happens?

@dbrisinda
Copy link
Author

Yes, that's the log that appears. I don't know if the node has high CPU when this happens, as I haven't caught it at the moment of the error, but some time after the fact. Once that message appears, if I don't restart the node, those messages repeat with regular frequency, maybe a dozen or two identical messages in a day.

@dbrisinda
Copy link
Author

In case this is helpful, here's a snapshot of all other ports allowing traffic using the low-security setting on the router:

Screen Shot 2020-09-29 at 4 56 39 PM

Screen Shot 2020-09-29 at 4 54 54 PM

@danlaine
Copy link

@dbrisinda Have you tried running with the NAT Only setting on your router?

@dbrisinda
Copy link
Author

No, I haven't tried running with NAT Only setting on the router, since I don't really know what that means. But I will try that the next time it fails.

@dbrisinda
Copy link
Author

Okay, same error just appeared again after 3 days running without interruption:

ERROR[09-30|09:53:11] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:941b0148-a0f4-4f24-a549-f694da024585/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

So I changed the router settings to "NAT Only" as suggested. And restarted the node. I also noticed that there is no need to port forward manually, since UPnP automatically port forwards 9651. Though I kept the manual port forward range 9650-9652 just in case even though it appears redundant. I will post here within a few days if it fails as before.

@danlaine
Copy link

danlaine commented Oct 7, 2020

Any updates? @dbrisinda

@dbrisinda
Copy link
Author

Still running without incident using v1.0.1...

@dbrisinda
Copy link
Author

Okay, just got this error message again after 7 days, but this time, the node appears to still be connected to the newtork, as VScout.io shows it's up, as does avax.dev:

[...]
INFO [10-08|15:37:33] network/peer.go#483: beacon QKGoUvqcgormCoMj6yPw9isY7DX9H4mdd attempting to connect with newer version avalanche/1.0.2. You may want to update your client

ERROR[10-09|14:49:39] nat/nat.go#116: Renewing port mapping from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

INFO [10-09|14:50:43] network/peer.go#483: beacon Nr584bLpGgbCUbZFSBaBz3Xum5wpca9Ym attempting to connect with newer version avalanche/1.0.2. You may want to update your client
INFO [10-09|14:50:46] network/peer.go#483: beacon Hr78Fy8uDYiRYocRYHXp4eLCYeb8x5UuM attempting to connect with newer version avalanche/1.0.2. You may want to update your client
INFO [10-09|14:50:48] network/peer.go#483: beacon kZNuQMHhydefgnwjYX1fhHMpRNAs9my1 attempting to connect with newer version avalanche/1.0.2. You may want to update your client
[...]

Note, my public IP address has remained unchanged during this time, though my ISP provider doesn't guarantee it will always be so.

Going to start a new v1.0.2 node in NAT Only mode as previously.

@dbrisinda
Copy link
Author

So I thought I should also include the following as it may be helpful for diagnosing these NAT issues on macOS.

After I run v1.0.2 I get the following popup in macOS (this was also happening before):

Screen Shot 2020-10-10 at 1 15 36 AM

And the node starts as follows with the new error (previously it reported something like the node might only be able to connect to a limited set of peers):

Franks-iMac:/Users/dalmazio/go/src/github.com/ava-labs/avalanchego(master)$ ./build/avalanchego
     _____               .__                       .__
    /  _  \___  _______  |  | _____    ____   ____ |  |__   ____    ,_ o
   /  /_\  \  \/ /\__  \ |  | \__  \  /    \_/ ___\|  |  \_/ __ \   / //\,
  /    |    \   /  / __ \|  |__/ __ \|   |  \  \___|   Y  \  ___/    \>> |
  \____|__  /\_/  (____  /____(____  /___|  /\___  >___|  /\___  >    \\
          \/           \/          \/     \/     \/     \/     \/
ERROR[10-10|01:12:38] main/main.go#90: UPnP or NAT-PMP router attach failed, you may not be listening publicly, please confirm the settings in your router
INFO [10-10|01:12:38] node/node.go#747: Node version is: avalanche/1.0.2
[...]

However, if I restart the node, and very quickly click on this "Allow" button before the node has a chance to initialize, then the node initializes without errors:

Franks-iMac:/Users/dalmazio/go/src/github.com/ava-labs/avalanchego(master)$ ./build/avalanchego
     _____               .__                       .__
    /  _  \___  _______  |  | _____    ____   ____ |  |__   ____    ,_ o
   /  /_\  \  \/ /\__  \ |  | \__  \  /    \_/ ___\|  |  \_/ __ \   / //\,
  /    |    \   /  / __ \|  |__/ __ \|   |  \  \___|   Y  \  ___/    \>> |
  \____|__  /\_/  (____  /____(____  /___|  /\___  >___|  /\___  >    \\
          \/           \/          \/     \/     \/     \/     \/
INFO [10-10|01:19:03] nat/nat.go#70: NAT Traversal successful from external port 9651 to internal port 9651
INFO [10-10|01:19:03] node/node.go#747: Node version is: avalanche/1.0.2
[...]

I've already added the avalanchego executable to the Security & Privacy > Firewall > Firewall Options files list that should allow incoming connections for the executable without being asked each time in a popup.

However, macOS may not be allowing incoming network connections automatically (forcing the user to click on "Allow" in the popup) since the executable does not appear to be signed by a valid certificate authority (according to Apple).

The second checked box below in particular stresses this point, so self-signed (adhoc) executables might not work, even though the executable is in the list of software that is allowed to receive incoming connections.

Screen Shot 2020-10-10 at 1 33 32 AM

Checking the signature on avalanchego gives:

Franks-iMac:/Users/dalmazio/go/src/github.com/ava-labs/avalanchego(master)$ codesign -dvvv ./build/avalanchego 
Executable=/Users/dalmazio/go/src/github.com/ava-labs/avalanchego/build/avalanchego
Identifier=avalanchego-bb22ba56952876b396ed75a73bfba4b1ea782937
Format=Mach-O thin (x86_64)
CodeDirectory v=20100 size=246021 flags=0x2(adhoc) hashes=7683+2 location=system
Hash type=sha256 size=32
CandidateCDHash sha256=f49287473de8739c3498b26eba432bb01adaa348
CandidateCDHashFull sha256=f49287473de8739c3498b26eba432bb01adaa348fc6cb65cb017b9c53eaf9086
Hash choices=sha256
CMSDigest=f49287473de8739c3498b26eba432bb01adaa348fc6cb65cb017b9c53eaf9086
CMSDigestType=2
CDHash=f49287473de8739c3498b26eba432bb01adaa348
Signature=adhoc
Info.plist=not bound
TeamIdentifier=not set
Sealed Resources=none
Internal requirements count=0 size=12

So it appears the adhoc signature may be the culprit here. And since I'm compiling from scratch, I will have to sign the executable using a valid certifcate authority right after compiling.

Looks like I will need to research how to do this properly. But I would like to suggest it might be worthwhile to add some docs on the github avalanchego repo landing page for codesigning on macOS for those compiling from source to make this more seamless.

I also noticed that the avalanchego executable in the distributed macOS zip package has no codesigning at all, so I would imagine it would have the same issues.

@tasinco
Copy link
Contributor

tasinco commented Oct 12, 2020

If you attempt to use UPnP on the node and you have conflicting port forwarding rules on your router. You router will respond with errors when attempt to perform the UPnP mapping.

You need to either stop using UPnP or disable any conflicting port forwarding.

@dbrisinda
Copy link
Author

@tasinco No conflicting port forwarding on the router. The UPnP seems to be working okay, as the single 9561 port is automatically forwarded when I launch the node and cleared when the node terminates.

Screen Shot 2020-10-12 at 6 10 59 PM

@dbrisinda
Copy link
Author

I just noticed that my node has logged several of these ERROR and WARN messages again as of 12 hours ago. So that would be about 3 days without errors.

Notice the last one is a WARN instead of an ERROR. Though, when I look on VScout.io and avax.dev, my node still appears to be connected (yes). So this version of the node (v1.0.2) seems to be an improvement over previously, where the node would disconnect when these errors appeared:

ERROR[10-11|07:34:47] nat/nat.go#86: Renewing port mapping try #1 from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
ERROR[10-12|06:50:35] nat/nat.go#86: Renewing port mapping try #1 from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
ERROR[10-12|06:50:46] nat/nat.go#86: Renewing port mapping try #2 from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
ERROR[10-12|06:50:57] nat/nat.go#86: Renewing port mapping try #3 from external port 9651 to internal port 9651 failed with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
WARN [10-12|06:50:58] nat/nat.go#116: Renew NAT Traversal failed from external port 9651 to internal port 9651 with goupnp: error performing SOAP HTTP request: Post "http://192.168.1.254:5431/uuid:be482aa3-893c-4217-a7c3-8337c05da0bd/WANIPConnection:1": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Still thinking adhoc signing could be connected to this issue...

@tasinco
Copy link
Contributor

tasinco commented Oct 13, 2020

The Renew message was set to a warning because we cannot really confirm the route is gone. Most routers handle UPnP requests differently.

One thing that could be better for your purposes is how often we retry the UPnP update. Based on the new retry interval 5min and the map timeout 30min, we have a better chance of getting the UPnP update before the connection drops.

I don't expect every UPnP map request will work. But as long as we have a success before the mapping expires it will keep the port alive.

I don't understand the adhoc signing comment, as that has nothing to do with keeping the port alive.

@tasinco
Copy link
Contributor

tasinco commented Oct 13, 2020

From your response it seems like situation is better. Closing the issue.

We continue to update the networking code and offer features.

@tasinco tasinco closed this as completed Oct 13, 2020
@dbrisinda
Copy link
Author

@tasinco Thanks for all you guys do. You are a stellar team of engineers. Hats off to you all...

(Yes, the retry interval and adusted map timeout in v1.0.2 seems to have helped the situation)

Regarding the adhoc signing comment...

Because of paranoid macOS security settings, for example, on launch of the node, I always get the popup: "Do you want the application avalanchego to accept incoming network connections?"

I have to select the "Allow" button quickly, otherwise the node starts up with this error:

ERROR[10-12|20:49:51] main/main.go#90: UPnP or NAT-PMP router attach failed, you may not be listening publicly, please confirm the settings in your router

And if I check the forwarded ports on the router, the 9651 port has NOT been forwarded via UPnP. The reason this fails is because the executable is not signed (or it is adhoc signed), as macOS determines that if it is not signed then it does not allow the node to receive incoming network connections, and the UPnP request fails.

IF the executable IS properly signed (or I quickly select the "Allow" button on launch) the port is forwarded and I get no errors.

Now, after some period of time, the port mapping expires, and because the executable is not signed (or is adhoc signed), it fails to renew the port mapping, since that requires "trusted" status, which was only given once at launch explicitly by quickly selecting the "Allow" button. It seems that when the port mapping expires, so does this user-intervention which allowed a non-signed executable to receive incoming network connections and the required port to be auto-forwarded by UPnP.

In summary, properly code signing the avalanchego executable would remove the popup asking if incoming network connections should be allowed, and likely ensure that expired port mapping can be remapped without failing.

Is that more clear?

@dbrisinda
Copy link
Author

UPDATE

I signed the avalanchego executable using my Apple developer certificate, and it didn't make any difference in terms of the popup mentioned above that showed up on first run. That popup still showed up after a fresh build.

What I noticed is, that popup only appears on the first launch of the node. If I close it down, and then relaunch, the popup doesn't appear again provided I previously selected "Allow". So macOS remembers the last setting to allow incoming network connections for a given executable until I compile a new version of the node.

So it appears you are right. Code signing doesn't have anything to do with the UPnP issues after all. My mistake.

@tasinco
Copy link
Contributor

tasinco commented Oct 13, 2020

An internal ticket was opened with our team to look at solutions to properly sign the macos binaries we provide.

@dbrisinda
Copy link
Author

Cheers. Here's a great doc that I found extremely helpful. Also supports third-party (non-Apple) signing identities:

https://developer.apple.com/library/archive/documentation/Security/Conceptual/CodeSigningGuide/Procedures/Procedures.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants