Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16365 client: intercept MPI_Init() to avoid nested call #14992

Merged
merged 5 commits into from
Aug 29, 2024
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions src/client/dfuse/pil4dfs/int_dfs.c
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,8 @@ static long int page_size;
#define DAOS_INIT_NOT_RUNNING 0
#define DAOS_INIT_RUNNING 1

static _Atomic uint64_t mpi_init_count;

static long int daos_initing;
_Atomic bool d_daos_inited;
static bool daos_debug_inited;
Expand Down Expand Up @@ -467,6 +469,8 @@ static int (*next_posix_fallocate64)(int fd, off64_t offset, off64_t len);
static int (*next_tcgetattr)(int fd, void *termios_p);
/* end NOT supported by DAOS */

static int (*next_mpi_init)(int *argc, char ***argv);

/* to do!! */
/**
* static char * (*org_realpath)(const char *pathname, char *resolved_path);
Expand Down Expand Up @@ -1020,6 +1024,25 @@ consume_low_fd(void)
return rc;
}

int
MPI_Init(int *argc, char ***argv)
{
int rc;

if (next_mpi_init == NULL) {
next_mpi_init = dlsym(RTLD_NEXT, "MPI_Init");
D_ASSERT(next_mpi_init != NULL);
}

atomic_fetch_add_relaxed(&mpi_init_count, 1);
rc = next_mpi_init(argc, argv);
atomic_fetch_add_relaxed(&mpi_init_count, -1);
return rc;
}

int
PMPI_Init(int *argc, char ***argv) __attribute__((alias("MPI_Init")));

/** determine whether a path (both relative and absolute) is on DAOS or not. If yes,
* returns parent object, item name, full path of parent dir, full absolute path, and
* the pointer to struct dfs_mt.
Expand Down Expand Up @@ -1117,6 +1140,15 @@ query_path(const char *szInput, int *is_target_path, struct dcache_rec **parent,
uint64_t status_old = DAOS_INIT_NOT_RUNNING;
bool rc_cmp_swap;

/* Check whether MPI_Init() is running. If yes, pass to the original
* libc functions. Avoid possible zeInit reentrancy/nested call.
*/

if (atomic_load_relaxed(&mpi_init_count) > 0) {
Copy link
Contributor

@knard38 knard38 Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure to perfectly understand the _relaxed semantic, but for this test it should probably be better to use a strict atomic (from my understanding of the gcc documentation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://en.cppreference.com/w/cpp/atomic/memory_order
From my understanding, atomic operation is already guaranteed with "memory_order_relaxed".

Copy link
Contributor

@knard38 knard38 Aug 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed you are right, I had misunderstood the following sentence from the documentation https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync
__ATOMIC_RELAXED Implies no inter-thread ordering constraints.
At the end the following documentation is really more clear on the different memory model used for inter thread synchronization https://gcc.gnu.org/wiki/Atomic/GCCMM/AtomicSync

However, I still have a concern with this synchronization design pattern:

Thread A:
Execute line 1147 and the condition is false
Stopped by the scheduler

Thread B:
Execute line 1037
Start Execute line 1038 and is interrupted by the scheduler during the execution of next_mpi_init()

Thread A:
Execute line 1158 and following

From my understanding we still have the race issue.

Copy link
Contributor Author

@wiliamhuang wiliamhuang Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knard-intel Thank you very much for your comments! I will think more about this.
The issue we encountered are in MPI applications on Aurora. The hang was caused by dead lock in nested calls of zeInit() in Intel level zero drivers in the same thread.
Our goal was to avoid daos_init() being called inside MPI_Init(). All IO requests are forward to dfuse.
We do not know whether we will have issues if thread A is calling daos_init() and thread B starts calling MPI_Init().

Copy link
Contributor

@knard38 knard38 Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation which makes sense to me :-)

*is_target_path = 0;
goto out_normal;
}

/* daos_init() is expensive to call. We call it only when necessary. */

/* Check whether daos_init() is running. If yes, pass to the original
Expand Down
Loading