-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sire OpenMM context minimiser isn't thread safe #259
Comments
This is an issue with Sire since the segmentation fault doesn't happen if I minimise the context within the dynamics object using |
I noticed that the vendored LBFGS source files are quite different to the current OpenMM ones here. However, syncing them makes no difference to the segmentation fault. |
This is a result of Sire releasing the GIL during the minimisation function. The following diff fixes things:
I've added this to my other fix branch and will use this for now. |
While this works locally, I'm seeing hangs when running on |
Yes, it's just running in serial if the GIL isn't released. Will need to figure out why releasing it is causing a segmentation fault. |
This does the job:
|
Interesting that you needed to do that explicitly. I've seen something like this before when default arguments were passed to the function from Python. The SireBase GIL functionality would release the GIL on function exit (e.g. both return and when raise exception) but this happened after destruction of the default arguments. As these were connected to Python objects, their destruction caused some change in the Python interpreter state while the GIL was held, hence a segfault. This is why the auto-wrapped functions that had default arguments weren't wrapped with the SireBase GIL code. My guess is that there is something here that links back to Python that is being deallocated on function return before the GIL is being re-acquired. The Python GIL is a dark art ;-) |
Yes, very puzzling. I'm just pleased that the solution was easy. I'll also make sure the thread state is being reset if an exception is thrown. I think the code already handles that via the data log anyway. |
Yes, it appears that everything is handled via try/catch within the main |
I'm using threads to run multiple minimisations in parallel and see segmentation faults. The LBFGS code claims that it is thread safe, so maybe this is an issue with OpenMM, or the way that Sire is interfacing with things.
For the current replica exchange implementation I create contexts when a runner is instantiated and keep them alive for the duration of the simulation, with threads uses to execute them in parallel. When using processes, it's only possible to create the context within the process due to a CUDA initialisation issue, plus the fact that contexts can't be pickled. Running dynamics works absolutely fine with threads, i.e. I can use threads to manage different contexts on multiple GPUs, so this isn't any issue with running multiple contexts in parallel via threads. It's exclusively a problem with minimisation. Currently I'm minimising in serial, then running the dynamics in parallel.
The text was updated successfully, but these errors were encountered: