Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ffmuc-mesh-vpn-wireguard: enable loadbalancing #100

Merged
merged 6 commits into from
Mar 26, 2024

Conversation

awlx
Copy link
Member

@awlx awlx commented Mar 22, 2024

This enables loadbalancing to actually use the supernodes to the best extend.

grische and others added 2 commits March 17, 2024 22:21
Add support for the server-side loadbalancing with wgkex v0.2.0+

Co-authored-by: DasSkelett <[email protected]>
@awlx awlx requested review from grische and maurerle March 22, 2024 12:38
@awlx awlx self-assigned this Mar 22, 2024
@blocktrron
Copy link
Member

Do you have empiric data on the effectiveness of communicating load-information to clients?

I did think a lot about whether this is necessary on tunnel establishment but it all came down to concluding statistics to do the work for me should suffice, especially since selection logic is uniform (well, mostly uniform given the connectees client version).

This also has the advantage of having the chance by statistic to establish a peer-connection to a different peer in case we can not establish a connection to the single host your broker now communicates to the client. This could be fixed by a ordered list of candidates with a decrementing selection-weight. However it is arguably, if this is something to consider.

Coming from distributing clients on a wireless network, the major benefit I'd extrapolate from such a solution is if it would allow to continuously evaluate the need of reselection on the connection-terminating end, especially since judging effectiveness of target selection primarily depends on bandwidth usage, which is not known on connection establishment neither static. But this is a different topic. Nevertheless, did you put this into consideration and if so, did you draw a conclusion on the topic?

@awlx
Copy link
Member Author

awlx commented Mar 22, 2024

Screenshot 2024-03-22 at 17 56 44 This PR is especially meant to avoid situations like the one above when we do Supernode maintenances and all nodes pile up on one GW after the gateways rebooted.

We didn't plan yet to have a gateway switch after the initial connect.

@T0biii
Copy link
Contributor

T0biii commented Mar 22, 2024

Another feature is that we can add gateways without new firmware


# Get the number of configured peers and randomly select one
NUMBER_OF_PEERS=$(uci -q show wireguard | grep -E -ce "peer_[0-9]+.endpoint")
PEER="$(awk -v min=1 -v max="$NUMBER_OF_PEERS" 'BEGIN{srand(); print int(min+rand()*(max-min+1))}')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should still change this for v1 because I recently found out that this random is "random" based on the system time, maybe there is another option to determine the random without depending on the system time. Which is why so many clients choose the same gateway from the list but this shouldn't be a blocker for a merge as the v2 offers significantly better advantages for load balancing

Copy link
Contributor

@grische grische Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that our awk implementation for srand only uses second-precision, that is indeed not very helpful:

# i=0; while [[ $i -lt 15 ]]; do date -Ins | tr -d '\n' && echo -n " rand()=" && awk 'BEGIN{srand(); print rand()}'; i=$((i+1)); done
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051
2024-03-26T13:32:18,000000000+01:00 rand()=0.764051

I can confirm this is the same with Ubuntu 22.04 on a desktop system (but it does show nanoseconds on date --iso=ns):

$ i=0; while [[ $i -lt 15 ]]; do date -Ins | tr -d '\n' && echo -n " rand()=" && awk 'BEGIN{srand(); print rand()}'; i=$((i+1)); done
2024-03-26T13:53:25,716105751+01:00 rand()=0.49871
2024-03-26T13:53:25,717883612+01:00 rand()=0.49871
2024-03-26T13:53:25,719541005+01:00 rand()=0.49871
2024-03-26T13:53:25,721102173+01:00 rand()=0.49871
2024-03-26T13:53:25,722682426+01:00 rand()=0.49871
...

As $RANDOM is not available in busybox sh, an alternative would be to use busybox' hexdump + /dev/urandom:

# i=0; while [[ $i -lt 15 ]]; do i=$((i+1)); date -Ins | tr -d '\n' && echo -n " rand()=" && hexdump -n 4 -e '"%u"' </dev/urandom && echo; done
2024-03-26T13:36:00,000000000+01:00 rand()=2986735940
2024-03-26T13:36:00,000000000+01:00 rand()=234726918
2024-03-26T13:36:00,000000000+01:00 rand()=4249724515
2024-03-26T13:36:00,000000000+01:00 rand()=165910888
2024-03-26T13:36:00,000000000+01:00 rand()=2380433412
2024-03-26T13:36:00,000000000+01:00 rand()=3772558562
2024-03-26T13:36:00,000000000+01:00 rand()=277626919
2024-03-26T13:36:00,000000000+01:00 rand()=2142351523
2024-03-26T13:36:00,000000000+01:00 rand()=1096050454
2024-03-26T13:36:00,000000000+01:00 rand()=1464321489
2024-03-26T13:36:00,000000000+01:00 rand()=2153772455
2024-03-26T13:36:00,000000000+01:00 rand()=2246216180
2024-03-26T13:36:00,000000000+01:00 rand()=3724409659
2024-03-26T13:36:00,000000000+01:00 rand()=2829435099
2024-03-26T13:36:00,000000000+01:00 rand()=4106529008

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a separate PR for this: #101

@blocktrron
Copy link
Member

@awlx Does the loadbalancing solve the issue in case the gateways are offline at moment of re-evaluation anyhow? Or do you advertise Gateways which will re-appear soon even in case they are currently offline?

Nothing blocking this PR (It's your package anyhow), but i have a hard time understanding how this fixes the problem you describe. The piling up happens because the gateway-selection is limited in time of re-evaluation. But the piling up would happen with loadbalancing also, as Gateways which are offline would not be considered as candidates?

@awlx
Copy link
Member Author

awlx commented Mar 22, 2024

If the gateways are offline for longer than a minute yes. But we time the reboots so always 3 gateways are online :).

@blocktrron
Copy link
Member

Understood. But how does this differ from a random reselection of the Gateway list on the device then? This would also give a reasonable chance of each node selecting a different gateway, doesn't it?

@maurerle
Copy link
Member

If every node selects a random gateway from a list, and all nodes are rebooted - you get a random distribution.

If you have maintenance on one of n nodes, you would still have all nodes establish a connection to the available n-1 nodes. Once they established the connection the are not reconnecting, therefore the last maintained gateway stays with less clients.
This can be fixed using the loadbalancing here.

# Parse the returned JSON in a Lua script, returning the endpoint address, port, pubkey and first allowed IP, separated by newlines
if ! data=$(lua /lib/gluon/gluon-mesh-wireguard-vxlan/parse-wgkex-response.lua "$WGKEX_DATA"); then
logger -p err -t checkuplink "Parsing wgkex broker data failed"
exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would eventually handle the case of "Parsing wgkex broker data failed" and fall back to v1 instead of not connecting at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a good idea :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grische @DasSkelett some thing to consider?

Copy link
Contributor

@T0biii T0biii Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maurerle thanks a lot, that's a very good idea. I added the patches + some suggestions of mine on top of the current PR.

@blocktrron
Copy link
Member

@maurerle But the loadbalancing does not engage when a connection to one of the remaining gateways has been established:

I feel like I don't see the point everyone else is seeing, but for your scenario to be viable, a node would require to re-evaluate itÄs existing choice, doesn't it?

@awlx
Copy link
Member Author

awlx commented Mar 23, 2024

The random distribution is also a problem, when we do maintenance in one PoP the nodes would try to reach the "deactivated" gateways and take longer to reconnect. With the new approach they don't even see the gateways in maintenance, it's also easier to add or remove gateways.

This is the first iteration of loadbalancing, in the end we also want the nodes to re-evaluate every X minutes if they are still on the best GW, but this requires quite some tuning otherwise we have a flappy node.

@blocktrron
Copy link
Member

Thanks for the explanation, now it is more clear to me what you try to achieve.

@awlx awlx merged commit 9fc30b8 into freifunk-gluon:master Mar 26, 2024
2 checks passed
grische added a commit to grische/site-ffm that referenced this pull request Mar 27, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102
github-actions bot pushed a commit to freifunkMUC/site-ffm that referenced this pull request Mar 27, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102

(cherry picked from commit fc42990)
grische added a commit to grische/community-packages that referenced this pull request Apr 6, 2024
* ffmuc-mesh-vpn-wireguard: Add support for the server-side loadbalancing with wgkex v0.2.0+
---------

Co-authored-by: Grische <[email protected]>
Co-authored-by: DasSkelett <[email protected]>
Co-authored-by: Tobias <[email protected]>
grische added a commit to grische/site-ffm that referenced this pull request Apr 6, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102
grische added a commit to grische/site-ffm that referenced this pull request Apr 6, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102
grische added a commit to grische/site-ffm that referenced this pull request Apr 6, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102
grische added a commit to grische/site-ffm that referenced this pull request Apr 6, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102
grische added a commit to grische/site-ffm that referenced this pull request Apr 6, 2024
The new version of ffmuc-mesh-vpn-wireguard-vxlan supports load-balancing
of clients using wgkex.

For details, see
- freifunkMUC/wgkex#87
- freifunk-gluon/community-packages#100
- freifunk-gluon/community-packages#101
- freifunk-gluon/community-packages#102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants