Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Race Condition #637

Merged
merged 7 commits into from
May 5, 2023
Merged

Conversation

kkloberdanz
Copy link
Contributor

@kkloberdanz kkloberdanz commented May 3, 2023

Related: MONGOCRYPT-526

Fix race condition by moving critical section into the mutex protected scope.

See the following repo for a tool to test the race condition fixed in this PR: https://github.com/kkloberdanz/libmongocrypt_stresstest

@kkloberdanz kkloberdanz marked this pull request as ready for review May 3, 2023 23:12

if (dropped_last_ref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for this boolean and copying the old state was to avoid doing this work within the protection of the mutex. If this is moved into that scope, then the code can be simplified to do the destruction directly without the state copying.

src/mongocrypt.c Outdated
Comment on lines 614 to 621
#ifndef __linux__
mcr_dll_close(old_state.dll);
mcr_dll_close(g_csfle_state.dll);
#endif
/// NOTE: On Linux, skip closing the CSFLE library itself, since a bug in
/// the way ld-linux and GCC interact causes static destructors to not run
/// during dlclose(). Still, free the error string:
mstr_free(old_state.dll.error_string);
/// NOTE: On Linux, skip closing the CSFLE library itself, since a bug in
/// the way ld-linux and GCC interact causes static destructors to not run
/// during dlclose(). Still, free the error string:
mstr_free(g_csfle_state.dll.error_string);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you recall how this dlclose() bug manifested itself? If I get rid of the #ifndef __linux__ above and run it on Linux, I don't see any issues. I'm considering removing the #ifndef guard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very obscure painful bug that appears related to application shutdown. The crypt_shared library depends on global destructors, and the behavior of Linux dynamic libraries do not play nicely with C++ static lifetime semantics. The static destructors remain registered with atexit(), even if we unload the library, leading to crashes during shutdown. I wasn't able to replicate it myself, but downstream drivers ran into it almost immediately and this was the only reasonable fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! I'll restore the #ifndef __linux__ guard.

Copy link
Contributor

@vector-of-bool vector-of-bool May 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a link to this ticket: https://jira.mongodb.org/browse/SERVER-63710

Correction: Not a crash, but faulty leak-detection warnings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please add a link to the ticket – I still think this is only showing on Linux because that’s where leak detection runs, not because Linux is doing something buggy.

Copy link
Contributor

@kevinAlbs kevinAlbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a fix to a possible double-free.

/// NOTE: On Linux, skip closing the CSFLE library itself, since a bug in
/// the way ld-linux and GCC interact causes static destructors to not run
/// during dlclose(). Still, free the error string:
mstr_free(g_csfle_state.dll.error_string);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mcr_dll_close call also frees g_csfle_state.dll.error_string. I think this needs to be in an #else:

#ifndef __linux__
            mcr_dll_close(g_csfle_state.dll);
#else
            /// NOTE: On Linux, skip closing the CSFLE library itself, since a bug in
            /// the way ld-linux and GCC interact causes static destructors to not run
            /// during dlclose(). Still, free the error string:
            mstr_free(g_csfle_state.dll.error_string);
#endif

I do not know why, but ASAN does not report this as a double free for me.

@kkloberdanz kkloberdanz merged commit 1525897 into mongodb:master May 5, 2023
@kkloberdanz kkloberdanz deleted the kyle/fix-race-condition branch May 5, 2023 14:31
kkloberdanz added a commit that referenced this pull request May 5, 2023
Related: MONGOCRYPT-526

Fix race condition by moving critical section into the mutex protected scope.

See the following repo for a tool to test the race condition fixed in this PR: https://github.com/kkloberdanz/libmongocrypt_stresstest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants