Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sire OpenMM context minimiser isn't thread safe #259

Closed
lohedges opened this issue Nov 4, 2024 · 9 comments · Fixed by #263
Closed

[BUG] Sire OpenMM context minimiser isn't thread safe #259

lohedges opened this issue Nov 4, 2024 · 9 comments · Fixed by #263
Assignees
Labels
bug Something isn't working

Comments

@lohedges
Copy link
Contributor

lohedges commented Nov 4, 2024

I'm using threads to run multiple minimisations in parallel and see segmentation faults. The LBFGS code claims that it is thread safe, so maybe this is an issue with OpenMM, or the way that Sire is interfacing with things.

For the current replica exchange implementation I create contexts when a runner is instantiated and keep them alive for the duration of the simulation, with threads uses to execute them in parallel. When using processes, it's only possible to create the context within the process due to a CUDA initialisation issue, plus the fact that contexts can't be pickled. Running dynamics works absolutely fine with threads, i.e. I can use threads to manage different contexts on multiple GPUs, so this isn't any issue with running multiple contexts in parallel via threads. It's exclusively a problem with minimisation. Currently I'm minimising in serial, then running the dynamics in parallel.

@lohedges lohedges added the bug Something isn't working label Nov 4, 2024
@lohedges lohedges self-assigned this Nov 4, 2024
@lohedges
Copy link
Contributor Author

lohedges commented Nov 4, 2024

This is an issue with Sire since the segmentation fault doesn't happen if I minimise the context within the dynamics object using openmm.LocalEnergyMinimizer directly. I'll do that for now.

@lohedges
Copy link
Contributor Author

lohedges commented Nov 4, 2024

I noticed that the vendored LBFGS source files are quite different to the current OpenMM ones here. However, syncing them makes no difference to the segmentation fault.

@lohedges lohedges changed the title [BUG] OpenMM minimiser doesn't appear to be thread safe [BUG] Sire OpenMM context minimiser isn't thread safe Nov 4, 2024
@lohedges
Copy link
Contributor Author

lohedges commented Nov 8, 2024

This is a result of Sire releasing the GIL during the minimisation function. The following diff fixes things:

diff --git a/wrapper/Convert/SireOpenMM/openmmminimise.cpp b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
index dd5a5f53..b8334ff5 100644
--- a/wrapper/Convert/SireOpenMM/openmmminimise.cpp
+++ b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
@@ -629,8 +629,6 @@ namespace SireOpenMM
             timeout = std::numeric_limits<double>::max();
         }

-        auto gil = SireBase::release_gil();
-
         const OpenMM::System &system = context.getSystem();

         int num_particles = system.getNumParticles();

I've added this to my other fix branch and will use this for now.

lohedges added a commit that referenced this issue Nov 8, 2024
@lohedges
Copy link
Contributor Author

lohedges commented Nov 8, 2024

While this works locally, I'm seeing hangs when running on neogodzilla. I'll continue to use the OpenMM minimiser directly for now.

@lohedges
Copy link
Contributor Author

lohedges commented Nov 8, 2024

Yes, it's just running in serial if the GIL isn't released. Will need to figure out why releasing it is causing a segmentation fault.

@lohedges
Copy link
Contributor Author

lohedges commented Nov 8, 2024

This does the job:

diff --git a/wrapper/Convert/SireOpenMM/openmmminimise.cpp b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
index dd5a5f53..2f08f450 100644
--- a/wrapper/Convert/SireOpenMM/openmmminimise.cpp
+++ b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
@@ -50,6 +50,7 @@
 #include <limits.h> // CHAR_BIT
 #include <sstream>
 #include <stdint.h> // uint64_t
+#include <Python.h>

 inline auto is_ieee754_nan(double const x)
     -> bool
@@ -619,6 +620,8 @@ namespace SireOpenMM
                                     double starting_k, double ratchet_scale,
                                     double max_constraint_error, double timeout)
     {
+        PyThreadState *_save = PyEval_SaveThread();
+
         if (max_iterations < 0)
         {
             max_iterations = std::numeric_limits<int>::max();
@@ -629,8 +632,6 @@ namespace SireOpenMM
             timeout = std::numeric_limits<double>::max();
         }

-        auto gil = SireBase::release_gil();
-
         const OpenMM::System &system = context.getSystem();

         int num_particles = system.getNumParticles();
@@ -1105,6 +1106,8 @@ namespace SireOpenMM
                                            CODELOC);
         }

+        PyEval_RestoreThread(_save);
+
         return data.getLog().join("\n");
     }

@chryswoods
Copy link
Contributor

Interesting that you needed to do that explicitly. I've seen something like this before when default arguments were passed to the function from Python. The SireBase GIL functionality would release the GIL on function exit (e.g. both return and when raise exception) but this happened after destruction of the default arguments. As these were connected to Python objects, their destruction caused some change in the Python interpreter state while the GIL was held, hence a segfault. This is why the auto-wrapped functions that had default arguments weren't wrapped with the SireBase GIL code.

My guess is that there is something here that links back to Python that is being deallocated on function return before the GIL is being re-acquired. The Python GIL is a dark art ;-)

@lohedges
Copy link
Contributor Author

lohedges commented Nov 9, 2024

Yes, very puzzling. I'm just pleased that the solution was easy. I'll also make sure the thread state is being reset if an exception is thrown. I think the code already handles that via the data log anyway.

@lohedges
Copy link
Contributor Author

Yes, it appears that everything is handled via try/catch within the main minimise_openmm_context function, so this should be okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants