Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support UDP/TCP port fowarding to a host without setting up a tun #1179

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

cre4ture
Copy link

@cre4ture cre4ture commented Jul 14, 2024

This is intended to implement #1014

Build-Status twin PR: cre4ture#1

How to be used can be read in the example config.yml:

# by using port port forwarding (port tunnels) its possible to establish connections
# from/into the nebula-network without using a tun/tap device and thus without requiring root access
# on the host. port forwarding is only supported when setting "tun.user" is set to true and thus
# a user-tun instead of a real one is used.
# IMPORTANT: For incoming tunnels, don't forget to also open the firewall for the relevant ports.
port_forwarding:
  outbound:
  # format of local and remote address: <host/ip>:<port>
  #- local_address: 127.0.0.1:3399
  #  remote_address: 192.168.100.92:4499
     # format of protocols lists (yml-list): [tcp], [udp], [tcp, udp]
  #  protocols: [tcp, udp]
  inbound:
  # format of forward_address: <host/ip>:<port>
  #- port: 5599
  #  forward_address: 127.0.0.1:5599
     # format of protocols lists (yml-list): [tcp], [udp], [tcp, udp]
  #  protocols: [tcp, udp]

Till now I only did some basic manual tests with netcat. They resulted in the expected behavior.
Further tests where done by @sybrensa and also by myself.

These points are still open:

  • namings like "tunnel" vs. "forwarding", "ingoing" vs. "inbound", ... Please help me in finding the right terminology as I'm not an expert in this topic.
  • mutexes needed? I honestly did't yet explicitly check for potential concurrent accesses. this is a TODO DONE.
  • clean shutdown - Currently it gives some errors when shutting down. this is also TODO DONE.
  • memory leaks - I think DONE.
  • performance tunings - DONE, but could be more
  • 2nd iteration of performance tunings - DONE, but could be more
  • reloadable (SIGHUP) port forwarding configuration

I did file copy tests. What I got was 1/2 the rate and around 4 times more CPU usage compared to the case with kernel-tun.
I tried multiple different improvements on my code, but I fear its actually more a limitation of the gvisor netstack :-/

Any help or idea is welcome.

Update: My file copy tests achieve now around 90% of the speed at 160% of CPU load of nebula. This is a significant improvment and acceptable (at least for my useases). Thanks @akernet for the support here.

Copy link

Thanks for the contribution! Before we can merge this, we need @cre4ture to sign the Salesforce Inc. Contributor License Agreement.

@akernet
Copy link

akernet commented Jul 21, 2024

Thanks a ton for taking this project on!

I took a stab at this before finding your changes here, it's here if you want to take a look.

In particular e764eb adds a script that sets up a two device loopback tunnel (without root) and runs an iperf3 speed test.

I was able to improve throughput by ~7x by adding some buffering to the UserDevice (a16fdb) which you might want to try too. With this change it is running at 1.6 Gbit/s with a 1300 MTU which makes it quite usable to me at least.

I still think there are several improvements left on the table:

  • Re-use allocations in this buffering by using something like sync.Pool,
  • Or even better, avoid these copies all together, although that would require some bigger changes in the rest of Nebula.
  • Depending on how the internal locking of gvisor works you could probably gain some by running several net stacks in parallel that handle their individual set of connections (distributing packets based on their source address+destination port). This would not improve single socket connections though.

@cre4ture
Copy link
Author

Thanks a ton for taking this project on!

I took a stab at this before finding your changes here, it's here if you want to take a look.

In particular e764eb adds a script that sets up a two device loopback tunnel (without root) and runs an iperf3 speed test.

I was able to improve throughput by ~7x by adding some buffering to the UserDevice (a16fdb) which you might want to try too. With this change it is running at 1.6 Gbit/s with a 1300 MTU which makes it quite usable to me at least.

I still think there are several improvements left on the table:

* Re-use allocations in this buffering by using something like `sync.Pool`,

* Or even better, avoid these copies all together, although that would require some bigger changes in the rest of Nebula.

* Depending on how the internal locking of gvisor works you could probably gain some by running several net stacks in parallel that handle their individual set of connections (distributing packets based on their source address+destination port). This would not improve single socket connections though.

Hey, cool. I could use some support especially regarding the performance and the automated testing. But also for other golang specifics as I'm rather unexperienced there.

I tried to do some performance profiling with the help of pprof.
This results where leading me to problems in gvisor itself, as I can read it from this diagram:
grafik

It seems that its doing some dynamic memory allocation. But from a first glance into the relevant source-code I could not see the reason for it. Do you have an idea? I'm actually already failing to upgrade the gvisior to the latest version. something is wrong with it such that it complains about multiple packages in a directrory. But how can this be undiscovered by the maintainters of gvisor? I have the feeling that I'm doing something wrong... :-/

I will for sure try your idea with the buffered queues.
I didn't yet look in detail into your branch. But it seems that we had some similar approach using the gvisor stack.
I would be glad, if you can do a review of my code. It seems that the maintainers of the repo where so far bussy with other stuff. ;-)

@johnmaguire
Copy link
Collaborator

Hi @cre4ture - this PR looks very interesting! I wanted to let you know that the maintainers best suited to review this PR are currently working on #6 which requires some major rework of Nebula, so it will probably be a bit before we are able to dig in. That being said, thanks for the contribution!

@cre4ture
Copy link
Author

cre4ture commented Jul 22, 2024

@akernet I added your test and afterwards also the performance improvement via cherry-picking. Hope this is fine for you :-)

Your performance improvement lead to a improvment of x6 for the test in your testscript-commit on my hardware. I will do a further test where I will use this on a real-world example (the file-copy test with two seperate machines) to confirm this improvment. I'm a bit concerned because the pprof profiling directed my in some different direction. At least this is how I interpret it right now. I will let you know.

@cre4ture
Copy link
Author

cre4ture commented Jul 22, 2024

@akernet

it seems to me that the improvement with the buffered pipes has no impact on the my testing scenario with the file-copy between two different machines. I'm using a "private cloud" server storage called "garage" mounted via "rclone". The two machines are my laptop and a NUC PC ("server"). Both connected via ethernet cable (1GBit/s) to my local network. I use the offical release binary for the laptop and exchange for testing purposes the binary on the server side. I measure the throughput rate (nautilus file copy) on the laptop. And I measure the CPU load on the server side. These are my results:

1. test-case: official released binary - kernel tun:
95.5 MB/second
CPU load on server side:
88 % nebula
71 % garage

2. test-case: local build with buffered pipes - gvisor tun+stack:
52.8 MB/second
CPU load server side:
197 % nebula
34 % garage

3. test-case: local build no buffered pipes - gvisor tun+stack:
52.3 MB/second
CPU load server side:
195 % nebula
35 % garage

4. test-case: local build with the buffered pipes - but kernel tun (as in 1. test-case):
99.5 MB/second
94% nebula
72% garage

The last test shall demonstrate the the locally compiled binary has comparable performance as the release binary.
So what exactly this means is not yet clear to me. It seems that the buffered pipes do not have a significant impact for this test-scenarios. But I can't reason about it.

In general, the userspace tun/stack has an impact of factor 4. which results from x2 times CPU load and 1/2 throughput rate put together.

@akernet
Copy link

akernet commented Jul 23, 2024

I would be glad, if you can do a review of my code. It seems that the maintainers of the repo where so far bussy with other stuff. ;-)

I'm also new to go but I'll try to find some time in the next days! :)

I added your test and afterwards also the performance improvement via cherry-picking. Hope this is fine for you :-)

Ofc!

I'm using a "private cloud" server storage called "garage"

Cool, I've been looking at garage for the past weeks so nice to see that you are using it with Nebula!

It seems that the buffered pipes do not have a significant impact for this test-scenarios. But I can't reason about it.

Yeah this is a bit strange. How is the system load overall, is it close to max? The buffering should not make Nebula more efficient when it comes to CPU, in fact it's probably gonna be more costly due to the copying. What it does is decoupling the outside and inside (gvisor) parts, allowing them to run in parallel. If the system is at limits already this is unlikely to help, however. I'll try to do some benchmarking in CPU limited scenarios too, gvisor is always gonna use some additional resources but the 66% of your pprof run seems a bit high.

@cre4ture
Copy link
Author

cre4ture commented Jul 26, 2024

Update regarding performance:
Yesterday I could achieve a significant improvement for my filecopy test-scenario:

"performance 5":

89.8 MB/second
cpu-load server side:
163 % CPU nebula
63 % CPU garage

This brings the copy-speed to almost the reference (99MB/second) - only 10% difference remaining.
The CPU-load of nebula is only + ~60% now.

So overall we are at around 175% of the reference. Which is significant compared to the initial ~ 400%.

@akernet I think the main difference comes from a further improvement of your performance tuning "buffering to UserDevice". I achieved it by addtionally avoiding data-copy and dynamic allocation steps (i think).

Copy link

@akernet akernet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice! Cool that you got the buffer reuse to work

cmd/nebula/main.go Outdated Show resolved Hide resolved
// Its there to avoid a fresh dynamic memory allocation of 1 byte
// for each time its used.
var BYTE_SLICE_ONE []byte = []byte{1}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest moving this to a separate PR since it affects normal configs too

examples/config.yml Outdated Show resolved Hide resolved
examples/config.yml Outdated Show resolved Hide resolved
examples/config.yml Outdated Show resolved Hide resolved
port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
default:
}

rn, r_err := from.Read(buf)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could be replaced with io.Copy()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can. I once tried this. But it didn't improve the performance. Thats why I went on with other experiments. But now, as we solved the issue to a big extent, I will introduce it again as it simplifies the code. Thanks for pointing it out.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a variant with io.Copy().
Problem: It seems to slightly decrease performance. So I'm wondering if I better should keep the original implementation.

port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
port-forwarder/fwd_tcp.go Outdated Show resolved Hide resolved
@cre4ture cre4ture force-pushed the feature/try_with_gvisor_stack branch from e2523ca to 08c3922 Compare July 30, 2024 21:54
Copy link

Thanks for the contribution! Before we can merge this, we need @akernet to sign the Salesforce Inc. Contributor License Agreement.

@cre4ture cre4ture force-pushed the feature/try_with_gvisor_stack branch 3 times, most recently from 9718675 to 3a03eb9 Compare August 2, 2024 21:34
@cre4ture cre4ture requested a review from akernet August 3, 2024 09:48
@cre4ture
Copy link
Author

cre4ture commented Aug 3, 2024

@akernet please re-review. and it seems that you need to also sign the cla as I cherry picked from your branch to honor your contribution. :-)

@cre4ture cre4ture force-pushed the feature/try_with_gvisor_stack branch from 5be9458 to 24169d7 Compare August 5, 2024 21:57
@johnmaguire
Copy link
Collaborator

@cre4ture I'm not sure off-hand. But the fact that it succeeded on most platforms and failed on one again makes me wonder if it's a flaky test. I probably would've tried re-running the CI w/o changes to see whether it failed on a repeat run. If not, then we need to think about whether there's any timing issue that could affect that test. (I haven't looked at the test code really, but just as an example, maybe the updated CA bundle hasn't finished writing to disk when the test starts. Or maybe there was a silent error doing so?)

@johnmaguire
Copy link
Collaborator

FYI we recently merged #1181 which caused conflicts. One conflict is caused by returning *gonet.TCPConn in lieu of net.Conn. Is it necessary to return the former, or we can continue to return the interface? Thanks!

…isor_stack

# Conflicts:
#	examples/go_service/main.go
#	service/service.go
@cre4ture
Copy link
Author

FYI we recently merged #1181 which caused conflicts. One conflict is caused by returning *gonet.TCPConn in lieu of net.Conn. Is it necessary to return the former, or we can continue to return the interface? Thanks!

I merged the changes from main. It seems that the interface is OK.
Please tell me if you prefer a rebase, and/or a squash commit/merge.

udp/udp_linux.go Outdated
@@ -315,6 +321,10 @@ func (u *StdConn) getMemInfo(meminfo *[unix.SK_MEMINFO_VARS]uint32) error {

func (u *StdConn) Close() error {
//TODO: this will not interrupt the read loop
Copy link
Author

@cre4ture cre4ture Sep 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnmaguire I think I have the test stable now. At least on linux. The issue was a unclean shutdown of the nebula service at the end of each test which caused confusion in the following tests.

The change here in this file solved the issue.
I will check the next days if there are some sleeps to cleanup, now as I found the issue.

@cre4ture cre4ture force-pushed the feature/try_with_gvisor_stack branch from ed51dd9 to 84d1a26 Compare September 15, 2024 09:19
return s.eg.Wait()
err := s.eg.Wait()

s.ipstack.Destroy()
Copy link
Author

@cre4ture cre4ture Sep 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this line made the windows test stable. They now run 200+ in a row without issue.

@@ -323,6 +329,14 @@ func (u *RIOConn) Close() error {
windows.PostQueuedCompletionStatus(u.rx.iocp, 0, 0, nil)
windows.PostQueuedCompletionStatus(u.tx.iocp, 0, 0, nil)

u.rx.mu.Lock() // for waiting till active reader is done
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to run the test 1000 times to achieve stable reproduction on my windows 11. Is fixed with this change.

@cre4ture cre4ture force-pushed the feature/try_with_gvisor_stack branch from a2cbb62 to cd510b3 Compare September 16, 2024 21:08
@cre4ture
Copy link
Author

@johnmaguire after extensive testing and a few findings I dare to claim that the tests are now stable on windows and linux. Can you please approve another workflow run on this PR? And have a look about what is still do be done to merge this?

@ExplodingDragon
Copy link

@cre4ture Any new changes? Currently it seems to be incompatible with the master branch.

…isor_stack

# Conflicts:
#	service/service_test.go
@cre4ture
Copy link
Author

cre4ture commented Dec 10, 2024

@ExplodingDragon thanks for the notification. I merged the changes in. I think the PR is ready for review. It's not clear when the Maintainers will find time for it.

@maggie44
Copy link

Is it possible to also add the ability to map two cidrs against each other?

  • local_cidr: 10.0.0.0/24
    remote_cidr: 192.168.100.0/24

@cre4ture
Copy link
Author

Is it possible to also add the ability to map two cidrs against each other?

* local_cidr: 10.0.0.0/24
  remote_cidr: 192.168.100.0/24

Hello maggie44,

thanks for your interesting question. I can't really answer that without further studying the documentation myself.
If nebula supports it, then its probably only possible with a real tun and thus root access.
I think this would be a good starting point for you:
https://nebula.defined.net/docs/config/tun/#tunroutes

hope this helps you a step forward.

@maggie44
Copy link

maggie44 commented Dec 13, 2024

I meant in relation to this PR. Based on the example in the initial post it is configured like this:

port_forwarding:
  outbound:
  # format of local and remote address: <host/ip>:<port>
  #- local_address: 127.0.0.1:3399
  #  remote_address: 192.168.100.92:4499
     # format of protocols lists (yml-list): [tcp], [udp], [tcp, udp]
  #  protocols: [tcp, udp]

I was querying whether it could also be configured like this:

port_forwarding:
  outbound:
  #- local_address: 10.0.0.0/24:3399
  #  remote_address: 192.168.100.0/24:4499
     # format of protocols lists (yml-list): [tcp], [udp], [tcp, udp]
  #  protocols: [tcp, udp]

But I see now that the local address would need an interface? Which would defeat the object of removing the TUN device.

The goal is to be able to map a lot of devices to the local ports or address without TUN, and without having super long lists of IPs in the config file.

@maggie44
Copy link

maggie44 commented Dec 13, 2024

Best I can think of right now would be to have TCP forwarding from a local IP, like 127.0.0.1:3399, routed through to an IP based on path. This is very much an API related feature. For example:

http://127.0.0.1:3399/192.168.100.0

Where 192.168.100.0 is a path. Traffic routes in to 127.0.0.1:3399 which avoids requiring the TUN, then based on path (192.168.100.0) proxies traffic to the required host. That would allow opening TCP connections to an unlimited amount of connected hosts (I'm intentionally avoiding talking about UDP in relation to this, as the paths wouldn't be compatible).

Or for port compatibility:

http://127.0.0.1:3399/proxy?target=192.168.100.0:4040

Straying off the original intent of this PR though, initially I was thinking of it only as a CIDR mapping which would have been a smaller change. Looking at the Service function example there might already be the ability to do this anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: Support UDP/TCP port fowarding to a host without setting up a tun
5 participants