Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend netbird status command to include health information #1471

Merged
merged 16 commits into from
Jan 22, 2024
Merged

Conversation

lixmal
Copy link
Contributor

@lixmal lixmal commented Jan 16, 2024

Describe your changes

  1. Added probes to query the signal server, the management server, the and the current relays whenever netbird status is run.
  • The signal server grpc connection is tested by sending to a dummy peer.
  • The management grpc connection is tested by retrieving the public key.
  • The STUN servers are tested by trying to bind to an address.
  • The TURN servers are tested by trying to allocate a session.
  1. The iface code was extended to return wireguard status (last handshake, tx/rx bytes) by peer. It is also retrieved on demand when netbird status is run.

New status output:

image

image

Issue ticket number and link

As a user who is troubleshooting our client’s connectivity, I want to know if there are connection issues between the client and the management system.

We have 3 main layers that are used in connection discovery:

  • signal server (for message exchange)

  • management server (for network map)

  • stun/turn server(s) (for public IP discovery and relay connections)

For each of these components, we should validate if they are working as intended.

Another missing piece in our status command is the Wireguard handshake status, the peer transfer stats, and the endpoints used by NetBird in the connection from both ends.

Below, you can see a draft of the detailed status output after the changes are applied:

Peers detail:
 raspberrypi.netbird.cloud:
  NetBird IP: 100.127.45.118
  Public key: qNqWct63/h1uL+2wPS3LZKodFOar5N4Lpj4KUunrIng=
  Status: Connected
  -- detail --
  Connection type: P2P
  Direct: true
  ICE candidate (Local/Remote): srflx/host
  ICE candidate endpoints (Local/Remote): 54.35.10.87:51820/101.12.32.16:51820
  Last connection update: 2024-01-08 17:16:42
  Last Wireguard handshake: 2024-01-08 20:16:42
  Transfer status (received/sent): 41.83 KiB/174.65 KiB

iphone.netbird.cloud:
  NetBird IP: 100.127.45.25
  Public key: qNqWct63/h1uL+lkndkvlnsdkvndslfkvndklf=
  Status: Connected
  -- detail --
  Connection type: Relay
  Direct: true
  ICE candidate (Local/Remote): srflx/relay
  ICE candidate endpoints (Local/Remote): 54.35.10.87:51820/35.10.23.111:23345
  Last connection update: 2024-01-08 17:16:42
  Last Wireguard handshake: 2024-01-08 20:16:42
  Transfer status (received/sent): 41.83 KiB/174.65 KiB

Daemon version: 0.25.3
CLI version: 0.25.3
Management: Connected to https://api.netbird.io:443
Signal: Not connected to https://signal.netbird.io:443, reason: ...
Relay: 
  https://turn.netbird.io:5555 is Available
  https://turn.netbird.io:5555 is Unavailable, reason: stun or turn error
FQDN: maycons-mbp-2.netbird.cloud
NetBird IP: 100.127.197.64/16
Interface type: Userspace
Peers count: 1/1 Connected

For the summarized status:

Daemon version: 0.25.3
CLI version: 0.25.3
Management: Connected 
Signal: Disconnected, reason: ...
Relay: 1/2 Available
FQDN: maycons-mbp-2.netbird.cloud
NetBird IP: 100.127.197.64/16
Interface type: Userspace
Peers count: 1/1 Connected

Let’s store a state that indicates when the last probe was performed against management, signal, and relay, and if the checks were positive, do not perform the probe again by default. A user can force probe with a --force-health-probe flag. If the probe check fails, next time user runs the status command we query again, regardless of the last check time.

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)
  • Extended the README / documentation, if necessary

@CLAassistant
Copy link

CLAassistant commented Jan 16, 2024

CLA assistant check
All committers have signed the CLA.

@lixmal lixmal marked this pull request as ready for review January 17, 2024 08:52
pascal-fischer
pascal-fischer previously approved these changes Jan 18, 2024
This happens on PSK/pubkey mismatch and is returned by wireguard in
userspace mode.
pascal-fischer
pascal-fischer previously approved these changes Jan 18, 2024
@lixmal lixmal requested a review from pascal-fischer January 22, 2024 09:18
@lixmal lixmal merged commit a7d6632 into main Jan 22, 2024
16 checks passed
@lixmal lixmal deleted the health-probes branch January 22, 2024 11:20
Foosec pushed a commit to Foosec/netbird that referenced this pull request May 8, 2024
…o#1471)

* Adds management, signal, and relay (STUN/TURN) health probes to the status command.

* Adds a reason when the management or signal connections are disconnected.

* Adds last wireguard handshake and received/sent bytes per peer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants