-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-13194 gurt: environment APIs hook #12220
Conversation
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex. Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD. This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...). Required-githooks: true Signed-off-by: Bruno Faccini <[email protected]>
Bug-tracker data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style warning(s) for job https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-12220/1/
Please review https://wiki.hpdd.intel.com/display/DC/Coding+Rules
Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/1/execution/node/145/log |
I had forgotten to remove debug fprintf()s to stderr, and there was some cosmetic fixes requested by our set of code beautifiers... Required-githooks: true Signed-off-by: Bruno Faccini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/2/execution/node/1015/log |
Unexpectedly Glibc seems not be already/early binded upon some commands start, so try to dlopen() it !!... Required-githooks: true Signed-off-by: Bruno Faccini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/301/log |
Test stage Build RPM on Leap 15.4 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/298/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/341/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/442/log |
All is in the title !!... Required-githooks: true Signed-off-by: Bruno Faccini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Upon 2nd try to bind Glibc symbol, use the handle we got from dlopen() !! Required-githooks: true Signed-off-by: Bruno Faccini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/5/execution/node/1209/log |
The 3 functional tests failures seems to be related to DAOS-12883/DAOS-13488 ... |
src/gurt/misc.c
Outdated
D_MUTEX_LOCK(&hook_env_lock); | ||
if (real_getenv == NULL) { | ||
real_getenv = (char * (*)(const char *))dlsym(RTLD_NEXT, "getenv"); | ||
if (real_getenv == NULL) { | ||
/* Glibc symbols could not be resolved !!... */ | ||
void *handle; | ||
|
||
handle = dlopen("libc.so.6", RTLD_LAZY); | ||
D_ASSERT(handle != NULL); | ||
real_getenv = (char * (*)(const char *))dlsym(handle, "getenv"); | ||
} | ||
D_ASSERT(real_getenv != NULL); | ||
} | ||
|
||
p = real_getenv(name); | ||
D_MUTEX_UNLOCK(&hook_env_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this block is repeated across the code 5 times, once for each call, with the only difference being the symbol being looked up and used. Can this be simplified to use a shared function across those calls instead?
Something like
char *getenv(const char *name)
{
return shared_call(&real_getenv, "getenv");
}
char *setenv(const char *name)
{
return shared_call(&real_setenv, "setenv");
}
etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, you are right!
src/gurt/misc.c
Outdated
static int (*real_unsetenv)(const char *); | ||
static int (*real_clearenv)(void); | ||
|
||
char *getenv(const char *name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the question is I have here is whether this actually works in practice. Do calls to getenv from dependent libraries such as libfabric get intercepted by hooks in libgurt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I have already indicated in DAOS-13194 JiRA/ticket :
I have manually verified that this approach works for the daos/dfuse/dmg/daos_agent/daos_server/daos_engine/daos_server_helper binaries.
src/gurt/misc.c
Outdated
|
||
D_MUTEX_LOCK(&hook_env_lock); | ||
if (real_getenv == NULL) { | ||
real_getenv = (char * (*)(const char *))dlsym(RTLD_NEXT, "getenv"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dlsym returns a void *. You don't need to cast it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... and do you know where does this rule come from ? Early Clang definitions, Kernel developers, ...??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void *
will alias to anything, it never needs a cast.
src/gurt/misc.c
Outdated
{ | ||
int rc; | ||
|
||
D_MUTEX_LOCK(&hook_env_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might suggest using pthread_once and initializing all of the hooks at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, you are right !!
D_ASSERT(real_getenv != NULL); | ||
} | ||
|
||
p = real_getenv(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you separate init from this, the lock here should really only protect the real call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you separate init from this, the lock here should really only protect the real call.
Well, initialisation of the "real_..." variables is racy too, but may be I can switch to using "atomic" variables then ?!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, I have switched to using Atomics for the initialisation part and also switched to rw-lock usage (instead of simple mutex, mainly to allow concurrent getenv()s) for the exception part.
This has a conflict. |
To fix conflicts !!... Required-githooks: true Signed-off-by: Bruno Faccini <[email protected]>
Ok, just merged with latest master ... really sorry for my reviewers :-( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12220/16/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR has not passed unit test stage.
please rebase with master and repush
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12220/17/testReport/ |
No comment ....... Signed-off-by: Bruno Faccini <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. No errors found by checkpatch.
After a merge with latest master, a forced test rerun, a new merge with latest master, I have been able to get 2 consecutive and successful CI sessions with no more apparently unrelated unit-tests errors (due to "nvme_control_ctests" unexpectedly missing) !!... |
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex. Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD. This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...). Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b Signed-off-by: Bruno Faccini <[email protected]>
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex. Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD. This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...). Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b Signed-off-by: Bruno Faccini <[email protected]>
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex. Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD. This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...). Signed-off-by: Bruno Faccini <[email protected]>
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex. Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD. This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...). Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b Signed-off-by: Bruno Faccini <[email protected]>
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex. Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD. This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...). Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b Signed-off-by: Bruno Faccini <[email protected]>
In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex.
Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD.
This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...).
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: