Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13194 gurt: environment APIs hook #12220

Merged
merged 16 commits into from
Jul 13, 2023
Merged

Conversation

bfaccini
Copy link
Contributor

In order to prevent known race to occur due to lack of locking in Glibc environment APIs (getenv()/[uns]setenv()/ putenv()/clearenv()), they have been overloaded and strengthened in Gurt with hooks now all using a common lock/mutex.

Libgurt is the preferred place for this as it is the lowest layer in DAOS, so it will be the earliest to be loaded and will ensure the hook to be installed as early as possible and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection in the Glibc APIs but do not handle all unsafe use-cases (like the change/removal of an env var when its value address has already been grabbed by a previous getenv(), ...).

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Required-githooks: true

Signed-off-by: Bruno Faccini <[email protected]>
@bfaccini bfaccini self-assigned this May 26, 2023
@github-actions
Copy link

github-actions bot commented May 26, 2023

Bug-tracker data:
Ticket title is 'frontera: ior segfault when starting read phase'
Status is 'In Review'
Labels: 'TACC_Frontera'
https://daosio.atlassian.net/browse/DAOS-13194

@github-actions github-actions bot added priority Ticket has high priority (automatically managed) release-2.4 PR is eventually targeted for 2.4 labels May 26, 2023
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
src/gurt/misc.c Outdated Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

I had forgotten to remove debug fprintf()s to stderr, and
there was some cosmetic fixes requested by our set of
code beautifiers...

Required-githooks: true

Signed-off-by: Bruno Faccini <[email protected]>
@daosbuild1 daosbuild1 dismissed their stale review May 26, 2023 11:38

Updated patch

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/2/execution/node/1015/log

Unexpectedly Glibc seems not be already/early binded upon
some commands start, so try to dlopen() it !!...

Required-githooks: true

Signed-off-by: Bruno Faccini <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/301/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.4 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/298/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/3/execution/node/341/log

@daosbuild1
Copy link
Collaborator

All is in the title !!...

Required-githooks: true

Signed-off-by: Bruno Faccini <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

Upon 2nd try to bind Glibc symbol, use the handle we
got from dlopen() !!

Required-githooks: true

Signed-off-by: Bruno Faccini <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12220/5/execution/node/1209/log

@bfaccini
Copy link
Contributor Author

The 3 functional tests failures seems to be related to DAOS-12883/DAOS-13488 ...

src/gurt/misc.c Outdated
Comment on lines 1226 to 1241
D_MUTEX_LOCK(&hook_env_lock);
if (real_getenv == NULL) {
real_getenv = (char * (*)(const char *))dlsym(RTLD_NEXT, "getenv");
if (real_getenv == NULL) {
/* Glibc symbols could not be resolved !!... */
void *handle;

handle = dlopen("libc.so.6", RTLD_LAZY);
D_ASSERT(handle != NULL);
real_getenv = (char * (*)(const char *))dlsym(handle, "getenv");
}
D_ASSERT(real_getenv != NULL);
}

p = real_getenv(name);
D_MUTEX_UNLOCK(&hook_env_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this block is repeated across the code 5 times, once for each call, with the only difference being the symbol being looked up and used. Can this be simplified to use a shared function across those calls instead?
Something like

char *getenv(const char *name)
{
   return shared_call(&real_getenv, "getenv");
}

char *setenv(const char *name)
{
   return shared_call(&real_setenv, "setenv");
}

etc...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, you are right!

src/gurt/misc.c Outdated
static int (*real_unsetenv)(const char *);
static int (*real_clearenv)(void);

char *getenv(const char *name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the question is I have here is whether this actually works in practice. Do calls to getenv from dependent libraries such as libfabric get intercepted by hooks in libgurt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I have already indicated in DAOS-13194 JiRA/ticket :

I have manually verified that this approach works for the daos/dfuse/dmg/daos_agent/daos_server/daos_engine/daos_server_helper binaries.

src/gurt/misc.c Outdated

D_MUTEX_LOCK(&hook_env_lock);
if (real_getenv == NULL) {
real_getenv = (char * (*)(const char *))dlsym(RTLD_NEXT, "getenv");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dlsym returns a void *. You don't need to cast it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... and do you know where does this rule come from ? Early Clang definitions, Kernel developers, ...??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void * will alias to anything, it never needs a cast.

src/gurt/misc.c Outdated
{
int rc;

D_MUTEX_LOCK(&hook_env_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest using pthread_once and initializing all of the hooks at once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, you are right !!

D_ASSERT(real_getenv != NULL);
}

p = real_getenv(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you separate init from this, the lock here should really only protect the real call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you separate init from this, the lock here should really only protect the real call.

Well, initialisation of the "real_..." variables is racy too, but may be I can switch to using "atomic" variables then ?!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally, I have switched to using Atomics for the initialisation part and also switched to rw-lock usage (instead of simple mutex, mainly to allow concurrent getenv()s) for the exception part.

@bfaccini bfaccini requested a review from a team June 19, 2023 09:03
@ashleypittman
Copy link
Contributor

This has a conflict.

To fix conflicts !!...

Required-githooks: true

Signed-off-by: Bruno Faccini <[email protected]>
@bfaccini bfaccini dismissed stale reviews from jolivier23 and soumagne via 36e5b40 June 20, 2023 09:04
@bfaccini bfaccini requested review from soumagne and jolivier23 June 20, 2023 09:05
@bfaccini
Copy link
Contributor Author

This has a conflict.

Ok, just merged with latest master ... really sorry for my reviewers :-(

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12220/16/testReport/

@bfaccini bfaccini requested a review from a team June 29, 2023 05:20
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR has not passed unit test stage.
please rebase with master and repush

@bfaccini bfaccini requested a review from mchaarawi July 10, 2023 16:59
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12220/17/testReport/

No comment .......

Signed-off-by: Bruno Faccini <[email protected]>
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@bfaccini bfaccini requested a review from a team July 13, 2023 07:54
@bfaccini
Copy link
Contributor Author

After a merge with latest master, a forced test rerun, a new merge with latest master, I have been able to get 2 consecutive and successful CI sessions with no more apparently unrelated unit-tests errors (due to "nvme_control_ctests" unexpectedly missing) !!...

@ashleypittman ashleypittman merged commit 7d2bc9b into master Jul 13, 2023
@ashleypittman ashleypittman deleted the bfaccini/DAOS-13194_2nd branch July 13, 2023 08:06
mjmac pushed a commit that referenced this pull request Nov 8, 2023
In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b
Signed-off-by: Bruno Faccini <[email protected]>
mjmac pushed a commit that referenced this pull request Jan 2, 2024
In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b
Signed-off-by: Bruno Faccini <[email protected]>
@mjmac mjmac mentioned this pull request Jan 2, 2024
jolivier23 pushed a commit that referenced this pull request Feb 28, 2024
In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Signed-off-by: Bruno Faccini <[email protected]>
jolivier23 pushed a commit that referenced this pull request Mar 12, 2024
In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b
Signed-off-by: Bruno Faccini <[email protected]>
jolivier23 pushed a commit that referenced this pull request Apr 10, 2024
In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b
Signed-off-by: Bruno Faccini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

8 participants