Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/tcp: introduce TCP_NO_CONNECT flag #10534

Draft
wants to merge 1 commit into
base: v1.22.x
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions include/rdma/fi_ext.h
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,11 @@ enum {
FI_OPT_EFA_WRITE_IN_ORDER_ALIGNED_128_BYTES, /* bool */
};

/* provider specific op flags range between 60-63 */
enum {
FI_TCP_NO_CONNECT = (1ULL << 60)
};

struct fi_fid_export {
struct fid **fid;
uint64_t flags;
Expand Down
9 changes: 9 additions & 0 deletions man/fi_tcp.7.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,15 @@ The following features are supported
*Shared Rx Context*
: The tcp provider supports shared receive context

# PROVIDER SPECIFIC OPERATION FLAGS
: The tcp provider supports the following op flags

*FI_TCP_NO_CONNECT*
: This flag indicates that operations should fail if there is no
existing connection to the remote peer. In such case, an FI_ENOTCONN
error should be expected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make this a flag that's checked on every operation (i.e. it's okay to connect if it's a send, but not a write?). It makes more sense either applied to the entire rdm endpoint or to a specific peer.

Copy link
Contributor Author

@ooststep ooststep Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the target problematic scenario, the rdm is used for some peers where this flag is needed and some peers where it would not be, so applying to the entire rdm was undesirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense. An RDM endpoint is unconnected. Exposing low-level connection implementation details is not desirable. There could be multiple connections to the same peer. The connection might be in another process (e.g. Pony Express or SNAP or whatever it's called). It could be in the kernel (RDS).

This still isn't a per operation flag. At best it's per peer, but even that use case seems questionable. It's like putting half of an RDM endpoint behind a firewall, but the other half ignores it. Apply firewall semantics to the entire RDM endpoint. If some peers are outside the firewall, but some are inside, require 2 endpoints with some sort of per EP configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I also agree with the first part (ie. exposing low-level connection is a bad idea) though providing 2 types of endpoints is also difficult for the user. To provide some context as to what problem we were trying to solve, in a client-server configuration where the client is behind a firewall, we have cases where for example client A would send a message to server A, and server A would then attempt to do an emulated tcp RMA to client B through an fi_write() call (without prior connection established from client B to server A). In that case, it seems that with the tcp provider server A remains stuck attempting to establish a connection and doing an fi_cancel on the RMA does not seem to be able to complete, as it's not supported currently by the tcp provider. So what we wanted was a way of having some completion and error being returned when server A is not able to reach client B. I'm opened to other means but having to manage 2 endpoints seems also cumbersome, I would also be happy though if we don't have to expose any connection logic.

Copy link

@jolivier23 jolivier23 Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shefty Just to add a bit of context beyond what Jerome said, this use case came from Parallelstore (Google's DAOS service). We control only the server side of the equation and rely on the user to do client configuration (as we don't control their VMs, network configuration, and processes). Opening the firewall to server to client connections is not common for services and requires them to do it explicitly or their writes will simply hang. We want to remove this requirement as it is becoming a major scale issue for onboarding new customers because users don't read documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the alternative, I would only configure the EPs at the server, as that's where the actual problem occurs. The client behavior is unchanged.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is really on the client. The client endpoint is the one behind the firewall and can't be reached (unless the server has already connected to it). Can we encode something into the client side URI that would indicate it can't be reached? The advantage to that approach from our end is that then the client could have complete control over telling the server whether or not it can handle the error (e.g. can support getting back an error indicating that the server couldn't connect and handle it appropriately)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the server's behavior that should change.

From the viewpoint of the client, everything works. It wants to talk to server A and can do so. That server A wants to pass off the response to some other system that the client doesn't know about is related to the storage architecture, not the client SW.

I don't think pushing this detail into the apps is the best option. But you can work-around this in the client by having the client send some sort of 'hello' message to every storage server during initialization -- to poke holes through the firewall. That pushes the burden onto every client app that might want to use DAOS.

Alternatively, you can configure the server SW to be firewall aware, so that it avoids forwarding requests to servers not already communicating with the client.

Or, change the protocol around handling firewalls. Have server A tell the client to retry its request with server B, rather than forwarding it internally.

There are likely other options for this. But I would avoid picking one which encoded these details in the SW API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be better if this thread were copied into an issue for continued discussion, rather than attaching it to this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.



# RUNTIME PARAMETERS

The tcp provider may be configured using several environment variables. A
Expand Down
7 changes: 5 additions & 2 deletions prov/tcp/src/xnet.h
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,9 @@ typedef void xnet_profile_t;
#define XNET_MIN_MULTI_RECV 16384
#define XNET_PORT_MAX_RANGE (USHRT_MAX)

/* provider specific op flags */
#define TCP_NO_CONNECT FI_TCP_NO_CONNECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use FI_TCP_NO_CONNECT directly?

Copy link
Contributor Author

@ooststep ooststep Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider specific flags are typically defined within the provider header files without any FI prefixing and used internally. I wanted to have a definition here as a placeholder so that it's obvious there are existing provider op flags that need to be accounted for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ooststep This is true when the API flag differs from the provider flag, or the scope of the flags are different. E.g. the API flags cover a 64-bit range, but the provider flag maps to some 16-bit range in the protocol.


extern struct fi_provider xnet_prov;
extern struct util_prov xnet_util_prov;
void xnet_init_infos(void);
Expand Down Expand Up @@ -304,8 +307,8 @@ struct xnet_rdm {

int xnet_rdm_ep(struct fid_domain *domain, struct fi_info *info,
struct fid_ep **ep_fid, void *context);
ssize_t xnet_get_conn(struct xnet_rdm *rdm, fi_addr_t dest_addr,
struct xnet_conn **conn);
ssize_t xnet_get_conn(struct xnet_rdm *rdm, fi_addr_t addr,
struct xnet_conn **conn, uint64_t flags);
struct xnet_ep *xnet_get_rx_ep(struct xnet_rdm *rdm, fi_addr_t addr);
void xnet_freeall_conns(struct xnet_rdm *rdm);

Expand Down
42 changes: 21 additions & 21 deletions prov/tcp/src/xnet_rdm.c
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ xnet_rdm_send(struct fid_ep *ep_fid, const void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -97,7 +97,7 @@ xnet_rdm_sendv(struct fid_ep *ep_fid, const struct iovec *iov,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -117,7 +117,7 @@ xnet_rdm_sendmsg(struct fid_ep *ep_fid, const struct fi_msg *msg,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, msg->addr, &conn);
ret = xnet_get_conn(rdm, msg->addr, &conn, flags);
if (ret)
goto unlock;

Expand All @@ -137,7 +137,7 @@ xnet_rdm_inject(struct fid_ep *ep_fid, const void *buf,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -157,7 +157,7 @@ xnet_rdm_senddata(struct fid_ep *ep_fid, const void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -178,7 +178,7 @@ xnet_rdm_injectdata(struct fid_ep *ep_fid, const void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand Down Expand Up @@ -245,7 +245,7 @@ xnet_rdm_tsend(struct fid_ep *ep_fid, const void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -267,7 +267,7 @@ xnet_rdm_tsendv(struct fid_ep *ep_fid, const struct iovec *iov,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -288,7 +288,7 @@ xnet_rdm_tsendmsg(struct fid_ep *ep_fid, const struct fi_msg_tagged *msg,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, msg->addr, &conn);
ret = xnet_get_conn(rdm, msg->addr, &conn, flags);
if (ret)
goto unlock;

Expand All @@ -308,7 +308,7 @@ xnet_rdm_tinject(struct fid_ep *ep_fid, const void *buf,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -329,7 +329,7 @@ xnet_rdm_tsenddata(struct fid_ep *ep_fid, const void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -350,7 +350,7 @@ xnet_rdm_tinjectdata(struct fid_ep *ep_fid, const void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand Down Expand Up @@ -384,7 +384,7 @@ xnet_rdm_read(struct fid_ep *ep_fid, void *buf, size_t len,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, src_addr, &conn);
ret = xnet_get_conn(rdm, src_addr, &conn, rdm->util_ep.rx_op_flags);
if (ret)
goto unlock;

Expand All @@ -406,7 +406,7 @@ xnet_rdm_readv(struct fid_ep *ep_fid, const struct iovec *iov,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, src_addr, &conn);
ret = xnet_get_conn(rdm, src_addr, &conn, rdm->util_ep.rx_op_flags);
if (ret)
goto unlock;

Expand All @@ -427,7 +427,7 @@ xnet_rdm_readmsg(struct fid_ep *ep_fid, const struct fi_msg_rma *msg,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, msg->addr, &conn);
ret = xnet_get_conn(rdm, msg->addr, &conn, flags);
if (ret)
goto unlock;

Expand All @@ -448,7 +448,7 @@ xnet_rdm_write(struct fid_ep *ep_fid, const void *buf,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -470,7 +470,7 @@ xnet_rdm_writev(struct fid_ep *ep_fid, const struct iovec *iov,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -491,7 +491,7 @@ xnet_rdm_writemsg(struct fid_ep *ep_fid, const struct fi_msg_rma *msg,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, msg->addr, &conn);
ret = xnet_get_conn(rdm, msg->addr, &conn, flags);
if (ret)
goto unlock;

Expand All @@ -512,7 +512,7 @@ xnet_rdm_inject_write(struct fid_ep *ep_fid, const void *buf,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -535,7 +535,7 @@ xnet_rdm_writedata(struct fid_ep *ep_fid, const void *buf,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand All @@ -557,7 +557,7 @@ xnet_rdm_inject_writedata(struct fid_ep *ep_fid, const void *buf,

rdm = container_of(ep_fid, struct xnet_rdm, util_ep.ep_fid);
ofi_genlock_lock(&xnet_rdm2_progress(rdm)->rdm_lock);
ret = xnet_get_conn(rdm, dest_addr, &conn);
ret = xnet_get_conn(rdm, dest_addr, &conn, rdm->util_ep.tx_op_flags);
if (ret)
goto unlock;

Expand Down
43 changes: 23 additions & 20 deletions prov/tcp/src/xnet_rdm_cm.c
Original file line number Diff line number Diff line change
Expand Up @@ -317,45 +317,48 @@ xnet_alloc_conn(struct xnet_rdm *rdm, struct util_peer_addr *peer)
return conn;
}

static struct xnet_conn *
xnet_add_conn(struct xnet_rdm *rdm, struct util_peer_addr *peer)
static ssize_t
xnet_add_conn(struct xnet_rdm *rdm, struct util_peer_addr *peer,
struct xnet_conn **conn, uint64_t flags)
{
struct xnet_conn *conn;

assert(xnet_progress_locked(xnet_rdm2_progress(rdm)));
conn = ofi_idm_lookup(&rdm->conn_idx_map, peer->index);
if (conn)
return conn;
*conn = ofi_idm_lookup(&rdm->conn_idx_map, peer->index);
if (*conn)
return 0;

conn = xnet_alloc_conn(rdm, peer);
if (!conn)
return NULL;
if (flags & TCP_NO_CONNECT)
return -FI_ENOTCONN;

if (ofi_idm_set(&rdm->conn_idx_map, peer->index, conn) < 0) {
xnet_free_conn(conn);
*conn = xnet_alloc_conn(rdm, peer);
if (!(*conn))
return -FI_ENOMEM;

if (ofi_idm_set(&rdm->conn_idx_map, peer->index, *conn) < 0) {
xnet_free_conn(*conn);
XNET_WARN_ERR(FI_LOG_EP_CTRL, "ofi_idm_set", -FI_ENOMEM);
return NULL;
return -FI_ENOMEM;
}

conn->flags |= XNET_CONN_INDEXED;
return conn;
(*conn)->flags |= XNET_CONN_INDEXED;
return 0;
}

/* The returned conn is only valid if the function returns success.
* This is called from data transfer ops, which return ssize_t, so
* we return that rather than int.
*/
ssize_t xnet_get_conn(struct xnet_rdm *rdm, fi_addr_t addr,
struct xnet_conn **conn)
struct xnet_conn **conn, uint64_t flags)
{
struct util_peer_addr **peer;
ssize_t ret;

assert(xnet_progress_locked(xnet_rdm2_progress(rdm)));
peer = ofi_av_addr_context(rdm->util_ep.av, addr);
*conn = xnet_add_conn(rdm, *peer);
if (!*conn)
return -FI_ENOMEM;
ret = xnet_add_conn(rdm, *peer, conn, flags);
if (ret)
return ret;

if (!(*conn)->ep) {
ret = xnet_rdm_connect(*conn);
Expand Down Expand Up @@ -444,8 +447,8 @@ static void xnet_process_connreq(struct fi_eq_cm_entry *cm_entry)
goto reject;
}

conn = xnet_add_conn(rdm, peer);
if (!conn)
ret = xnet_add_conn(rdm, peer, &conn, 0);
if (ret)
goto put;

FI_INFO(&xnet_prov, FI_LOG_EP_CTRL, "connreq for %p\n", conn);
Expand Down
Loading