Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require explicit, single-threaded initialization #91

Merged
merged 13 commits into from
Sep 18, 2023

Conversation

exarkun
Copy link
Member

@exarkun exarkun commented Sep 14, 2023

Fixes #89

This changes the underlying C library to require an initial call to fec_init before fec_new (and therefore the other APIs) can succeed. This avoids the problem where multiple threads call fec_new and implicitly trigger parallel initialization leading to corruption of some zfec internal state.

Calls to the C function fec_new will now return NULL if a call to the C function fec_init has not yet completed.

This change is not visible to the Python bindings because (a) they always call into fec while holding the GIL and so there is no chance for multithreaded usage and (b) the Python module initializer in the bindings now calls the initialization function automatically.

This change is visible to the Haskell bindings because the Haskell API can really make parallel calls into the fec library from multiple OS threads. Therefore, there is a new Haskell API, Codec.FEC.initialize, which must be used before Codec.FEC.new. If it is not, Codec.FEC.new will throw an exception.

This is an incompatible change to the Haskell API so it comes with a version bump for those bindings.

Copy link
Member

@hacklschorsch hacklschorsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respect!

Scene from 1995 "Hackers"

fec_initialized = 1;

void
fec_init (void) {
Copy link

@itamarst itamarst Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread synchronization isn't immediate. If thread 1 and 2 are both started, and thread 1 calls init, thread 2 won't necessarily see updated versions of the memory modified by thread 1, and even if it does it may be inconsistent across different memory locations. Put another way, modifying the static does not create a happens-before relationship, except for threads started after init_fec() is called.

The solution is to make fec_initialized an atomic, which is supported by modern C, and then read/write it with acquire/release atomic ordering.

https://marabos.nl/atomics/ is good book about this; Rust's model is based on C++'s model, and C's model is probably also C++'s model, so the semantics will all be the same even if the APIs are slightly different.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative is to just create a static lock that needs to be acquired to initialized or check initialization.(Acquiring/releasing a lock is equivalent to acquire/release semantic on an atomic variable, so it too creates a happens-before relationship.)

This would simplify the API in that you don't need to make fec_init() public, you could just always call it. It would also make fec_new() slightly slower, but probably not in a very meaningful way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great points here. From some simple experiments, it looks like neither C11 "atomics" nor "threads" are available from the MSVC we get from current CI (eg, stdatomic.h exists but merely including it leads to numerous errors). I don't quite understand this since we're using the Windows 2022 image and https://github.com/actions/runner-images/blob/main/images/win/Windows2022-Readme.md says this includes MSVC 17.7. The build failures have a version number that looks like "14.35" ... but maybe that's not the MSVC version? Maybe https://reactos.org/wiki/Visual_Studio_Versions is suggesting that "14.35" actually means "17.7". I don't know.

So for now I'm going with "documentation". Hopefully later someone can figure out how to get at atomics/threads on Windows and we can strengthen the implementation.

Copy link
Member

@hacklschorsch hacklschorsch Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't a mutex suffice here? Maybe that is available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A statically-initializable mutex here would work, yes. There isn't one in C99 so we'd need a third-party library. There is probably one but I am reluctant to give zfec it's first third-party build-time dependency. It seems likely that pthreads offers some kind of solution here, which would at least solve the problem for Linux (maybe macOS too?), but it's not the obvious pthread_mutex_t because those are not statically initializable (so all you do is move the initialization problem from "coordinate zfec initialization" to "coordinate zfec initialization mutex initialization"). Windows probably has its own solution but I don't know what it is offhand either (there's at least 5 solutions that aren't statically initializable, for sure).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have vague memory that certain Python releases require certain MSVC versions? So older Python might get older MSVC or something.

@exarkun
Copy link
Member Author

exarkun commented Sep 14, 2023

There was a segfault during one of the macOS test runs. I managed to reproduce this (at least, it appears to show up in the same test method) on my Linux laptop on the branch (after ~800 iterations) and on master (after ~1800 iterations) - so I'm supposing for now that's some pre-existing issue.

It's nice to have access to them on CI but more importantly showing build
progress prevents the build from timing out...
@itamarst
Copy link

Might be able to find the problem with one of clang's sanitizers, or with valgrind.

The test suite takes a long time to run, particularly on CI, this might help
it a bit.  Also the number of successes can be overridden on the command line
if necessary.
@exarkun exarkun merged commit 92faaca into tahoe-lafs:master Sep 18, 2023
@exarkun exarkun deleted the 89.corrupt-results branch September 18, 2023 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Using multithreading, sometimes the results are wrong
3 participants