You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a continuation of a bug I've come across with the DAOS filesystem. UCX, by the looks of it does not work behind a NAT.
Steps to Reproduce
Setup a DAOS server, where the client is hosted on a virt-manager VM using a NAT'ed network. Also applies to any other NAT.
What's happening:
While running the command 'daos pool query tank', a ucx connection gets established. Somewhere along the way, UCX tries to negotiate the session to a higher range dynamic port. This outgoing SYN request does not get properly routed by the clients NAT. I know this is technically a NAT config problem, but I was wondering if this is a known issue and if someone knows of an workaround/solution that doesn't involve advanced NAT table rules.
I've tried using UCX_TCP_CM_REUSEADDR, and UCX_TCP_PORT_RANGE, but neither help for this scenario.
Setup and versions
debian 12
ucx+tcp
daos 2.4
Additional information (depending on the issue)
Attached pcap output.
Also:
[1701709287.814849] [elster-storage:3353 :a] tcp_sockcm_ep.c:1124 UCX DEBUG server created an endpoint on tcp_sockcm 0x7f918402f8c0 id: -1 state: 1
[1701709287.814853] [elster-storage:3353 :a] async.c:230 UCX DEBUG added async handler 0x7f9174001a90 [id=723 ref 1] uct_tcp_sa_data_handler() to hash
[1701709287.814860] [elster-storage:3353 :a] async.c:508 UCX DEBUG listening to async event fd 723 events 0x5 mode thread_spinlock
[1701709287.815390] [elster-storage:3353 :a] sock.c:967 UCX DEBUG check ifname for socket on 10.35.0.110:0
[1701709287.815498] [elster-storage:3353 :a] sock.c:985 UCX DEBUG matching ip found iface on eno1
[1701709287.815503] [elster-storage:3353 :a] tcp_sockcm_ep.c:648 UCX DEBUG fd 723: remote_data: (field_mask=15) dev_addr: <invalid address family> (length=6), conn_priv_data_length=47
[1701709287.815505] [elster-storage:3353 :a] wireup_cm.c:1130 UCX DEBUG server received a connection request on the rdmacm sockaddr transport (worker=0x7f9184039eb0 cm=0x7f918402f8c0 worker_cms_index=0)
[1701709287.815529] [elster-storage:3353 :1] ucp_ep.c:354 UCX DEBUG created ep 0x7f918804b000 to <no debug data> conn_request on uct_listener
[1701709287.815609] [elster-storage:3353 :1] wireup.c:1071 UCX DEBUG ep 0x7f918804b000: am_lane 1 wireup_msg_lane 1 cm_lane 0 keepalive_lane <none> reachable_mds 0x1
[1701709287.815613] [elster-storage:3353 :1] wireup.c:1094 UCX DEBUG ep 0x7f918804b000: lane[0]: cm <unknown>
[1701709287.815617] [elster-storage:3353 :1] wireup.c:1094 UCX DEBUG ep 0x7f918804b000: lane[1]: 0:tcp/eno1.0 md[0] -> addr[0].md[0]/tcp/sysdev[255] rma_bw#0 am am_bw#0 wireup
[1701709287.815620] [elster-storage:3353 :1] tcp_ep.c:259 UCX DEBUG tcp_ep 0x7f9184068190: created on iface 0x7f91840376b0, fd -1
[1701709287.815623] [elster-storage:3353 :1] wireup_ep.c:543 UCX DEBUG ep 0x7f918804b000: wireup_ep 0x7f9184296c70 created next_ep 0x7f9184068190 to <no debug data> using tcp/eno1
[1701709287.816574] [elster-storage:3353 :1] tcp_cm.c:96 UCX DEBUG tcp_ep 0x7f9184068190: CLOSED -> CONNECTING for the [10.35.0.110:60099]<->[192.168.50.111:40695]:125 connection [-:Rx]
[17017092
elster-storage INFO 2023/12/04 09:01:27 daos_engine:0 87.816587] [elster-storage:3353 :1] tcp_cm.c:96 UCX DEBUG tcp_ep 0x7f9184068190: CONNECTING -> CONNECTING for the [10.35.0.110:60099]<->[192.168.50.111:40695]:125 connection [-:Rx]
[1701709287.818856] [elster-storage:3353 :1] sock.c:325 UCX ERROR connect(fd=724, dest_addr=192.168.50.111:40695) failed: Connection refused
elster-storage INFO 2023/12/04 09:01:27 daos_engine:0 [1701709287.818864] [elster-storage:3353 :1] wireup_cm.c:1239 UCX WARN server ep 0x7f918804b000 failed to connect to remote address on device eno1, tl_bitmap 0x1 0x0, status Destination is unreachable
[1701709287.818883] [elster-storage:3353 :1] async.c:155 UCX DEBUG removed async handler 0x7f9174001a90 [id=723 ref 1] uct_tcp_sa_data_handler() from hash
[1701709287.818888] [elster-storage:3353 :1] async.c:561 UCX DEBUG removing async handler 0x7f9174001a90 [id=723 ref 1] uct_tcp_sa_data_handler()
[1701709287.818894] [elster-storage:3353 :1] async.c:170 UCX DEBUG release async handler 0x7f9174001a90 [id=723 ref 0] uct_tcp_sa_data_handler()
[1701709287.818908] [elster-storage:3353 :1] ucp_ep.c:1209 UCX DEBUG ep 0x7f918804b000: destroy
[1701709287.818909] [elster-storage:3353 :1] ucp_ep.c:1459 UCX DEBUG ep 0x7f918804b000: cleanup lanes
[1701709287.818911] [elster-storage:3353 :1] ucp_ep.c:1469 UCX DEBUG ep 0x7f918804b000: pending & destroy uct_ep[1]=0x7f9184296c70
[1701709287.818914] [elster-storage:3353 :1] wireup_ep.c:471 UCX DEBUG ep 0x7f918804b000: destroy wireup ep 0x7f9184296c70
[1701709287.818916] [elster-storage:3353 :1] ucp_ep.c:1267 UCX DEBUG ep 0x7f918804b000: unprogress iface 0x7f91840376b0 tcp/eno1
[1701709287.819885] [elster-storage:3353 :1] tcp_ep.c:358 UCX DEBUG tcp_ep 0x7f9184068190: purge outstanding operations with status Request canceled
[1701709287.819895] [elster-storage:3353 :1] tcp_cm.c:96 UCX DEBUG tcp_ep 0x7f9184068190: CONNECTING -> CLOSED for the [10.35.0.110:60099]<->[192.168.50.111:40695]:125 connection [-:-]
[1701709287.819897] [elster-storage:3353 :1] tcp_ep.c:408 UCX DEBUG tcp_ep 0x7f9184068190: destroyed on iface 0x7f91840376b0
^ Notice how upon successful ucx connection with 10.35.0.110:31416, at packet #32, 10.35.0.110 tried to open up a new port 57213 on the clients machine. This is unsuccessfully routed by the client NAT.
What could be going on here, and what would be the correct approach to solving this issue? Any advice on NAT config is also appreciated, though ideally I'd love to solve this with as minimal NAT configuration as possible.
Thank you very much for your time.
The text was updated successfully, but these errors were encountered:
Since all engines need to be able to communicate, the different network interfaces must be on the same subnet or you must configuring routing across the different subnets.
So just make sure that everyone is seeing each other.
Describe the bug
This is a continuation of a bug I've come across with the DAOS filesystem. UCX, by the looks of it does not work behind a NAT.
Steps to Reproduce
Setup a DAOS server, where the client is hosted on a virt-manager VM using a NAT'ed network. Also applies to any other NAT.
What's happening:
While running the command 'daos pool query tank', a ucx connection gets established. Somewhere along the way, UCX tries to negotiate the session to a higher range dynamic port. This outgoing SYN request does not get properly routed by the clients NAT. I know this is technically a NAT config problem, but I was wondering if this is a known issue and if someone knows of an workaround/solution that doesn't involve advanced NAT table rules.
I've tried using UCX_TCP_CM_REUSEADDR, and UCX_TCP_PORT_RANGE, but neither help for this scenario.
Setup and versions
debian 12
ucx+tcp
daos 2.4
Additional information (depending on the issue)
Attached pcap output.
Also:
packets.zip
^ Notice how upon successful ucx connection with 10.35.0.110:31416, at packet #32, 10.35.0.110 tried to open up a new port 57213 on the clients machine. This is unsuccessfully routed by the client NAT.
What could be going on here, and what would be the correct approach to solving this issue? Any advice on NAT config is also appreciated, though ideally I'd love to solve this with as minimal NAT configuration as possible.
Thank you very much for your time.
The text was updated successfully, but these errors were encountered: