[BUG] Sire OpenMM context minimiser isn't thread safe #259

lohedges · 2024-11-04T14:33:28Z

I'm using threads to run multiple minimisations in parallel and see segmentation faults. The LBFGS code claims that it is thread safe, so maybe this is an issue with OpenMM, or the way that Sire is interfacing with things.

For the current replica exchange implementation I create contexts when a runner is instantiated and keep them alive for the duration of the simulation, with threads uses to execute them in parallel. When using processes, it's only possible to create the context within the process due to a CUDA initialisation issue, plus the fact that contexts can't be pickled. Running dynamics works absolutely fine with threads, i.e. I can use threads to manage different contexts on multiple GPUs, so this isn't any issue with running multiple contexts in parallel via threads. It's exclusively a problem with minimisation. Currently I'm minimising in serial, then running the dynamics in parallel.

lohedges · 2024-11-04T15:48:35Z

This is an issue with Sire since the segmentation fault doesn't happen if I minimise the context within the dynamics object using openmm.LocalEnergyMinimizer directly. I'll do that for now.

lohedges · 2024-11-04T16:10:20Z

I noticed that the vendored LBFGS source files are quite different to the current OpenMM ones here. However, syncing them makes no difference to the segmentation fault.

lohedges · 2024-11-08T14:33:42Z

This is a result of Sire releasing the GIL during the minimisation function. The following diff fixes things:

diff --git a/wrapper/Convert/SireOpenMM/openmmminimise.cpp b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
index dd5a5f53..b8334ff5 100644
--- a/wrapper/Convert/SireOpenMM/openmmminimise.cpp
+++ b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
@@ -629,8 +629,6 @@ namespace SireOpenMM
             timeout = std::numeric_limits<double>::max();
         }

-        auto gil = SireBase::release_gil();
-
         const OpenMM::System &system = context.getSystem();

         int num_particles = system.getNumParticles();

I've added this to my other fix branch and will use this for now.

lohedges · 2024-11-08T15:03:27Z

While this works locally, I'm seeing hangs when running on neogodzilla. I'll continue to use the OpenMM minimiser directly for now.

lohedges · 2024-11-08T15:35:05Z

Yes, it's just running in serial if the GIL isn't released. Will need to figure out why releasing it is causing a segmentation fault.

lohedges · 2024-11-08T16:23:43Z

This does the job:

diff --git a/wrapper/Convert/SireOpenMM/openmmminimise.cpp b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
index dd5a5f53..2f08f450 100644
--- a/wrapper/Convert/SireOpenMM/openmmminimise.cpp
+++ b/wrapper/Convert/SireOpenMM/openmmminimise.cpp
@@ -50,6 +50,7 @@
 #include <limits.h> // CHAR_BIT
 #include <sstream>
 #include <stdint.h> // uint64_t
+#include <Python.h>

 inline auto is_ieee754_nan(double const x)
     -> bool
@@ -619,6 +620,8 @@ namespace SireOpenMM
                                     double starting_k, double ratchet_scale,
                                     double max_constraint_error, double timeout)
     {
+        PyThreadState *_save = PyEval_SaveThread();
+
         if (max_iterations < 0)
         {
             max_iterations = std::numeric_limits<int>::max();
@@ -629,8 +632,6 @@ namespace SireOpenMM
             timeout = std::numeric_limits<double>::max();
         }

-        auto gil = SireBase::release_gil();
-
         const OpenMM::System &system = context.getSystem();

         int num_particles = system.getNumParticles();
@@ -1105,6 +1106,8 @@ namespace SireOpenMM
                                            CODELOC);
         }

+        PyEval_RestoreThread(_save);
+
         return data.getLog().join("\n");
     }

chryswoods · 2024-11-08T23:59:09Z

Interesting that you needed to do that explicitly. I've seen something like this before when default arguments were passed to the function from Python. The SireBase GIL functionality would release the GIL on function exit (e.g. both return and when raise exception) but this happened after destruction of the default arguments. As these were connected to Python objects, their destruction caused some change in the Python interpreter state while the GIL was held, hence a segfault. This is why the auto-wrapped functions that had default arguments weren't wrapped with the SireBase GIL code.

My guess is that there is something here that links back to Python that is being deallocated on function return before the GIL is being re-acquired. The Python GIL is a dark art ;-)

lohedges · 2024-11-09T07:50:08Z

Yes, very puzzling. I'm just pleased that the solution was easy. I'll also make sure the thread state is being reset if an exception is thrown. I think the code already handles that via the data log anyway.

lohedges · 2024-11-11T09:43:12Z

Yes, it appears that everything is handled via try/catch within the main minimise_openmm_context function, so this should be okay.

lohedges added the bug Something isn't working label Nov 4, 2024

lohedges self-assigned this Nov 4, 2024

lohedges changed the title ~~[BUG] OpenMM minimiser doesn't appear to be thread safe~~ [BUG] Sire OpenMM context minimiser isn't thread safe Nov 4, 2024

lohedges added a commit that referenced this issue Nov 8, 2024

Don't release the GIL. [closes #259]

a0b67d2

lohedges added a commit that referenced this issue Nov 8, 2024

Fix thread safety issue in OpenMM minimiser. [closes #259]

46b5d3f

lohedges mentioned this issue Dec 3, 2024

Fix issue #262 #263

Merged

lohedges closed this as completed in 1e253b1 Dec 3, 2024

lohedges closed this as completed in #263 Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sire OpenMM context minimiser isn't thread safe #259

[BUG] Sire OpenMM context minimiser isn't thread safe #259

lohedges commented Nov 4, 2024 •

edited

Loading

lohedges commented Nov 4, 2024

lohedges commented Nov 4, 2024 •

edited

Loading

lohedges commented Nov 8, 2024

lohedges commented Nov 8, 2024

lohedges commented Nov 8, 2024

lohedges commented Nov 8, 2024

chryswoods commented Nov 8, 2024

lohedges commented Nov 9, 2024 •

edited

Loading

lohedges commented Nov 11, 2024

[BUG] Sire OpenMM context minimiser isn't thread safe #259

[BUG] Sire OpenMM context minimiser isn't thread safe #259

Comments

lohedges commented Nov 4, 2024 • edited Loading

lohedges commented Nov 4, 2024

lohedges commented Nov 4, 2024 • edited Loading

lohedges commented Nov 8, 2024

lohedges commented Nov 8, 2024

lohedges commented Nov 8, 2024

lohedges commented Nov 8, 2024

chryswoods commented Nov 8, 2024

lohedges commented Nov 9, 2024 • edited Loading

lohedges commented Nov 11, 2024

lohedges commented Nov 4, 2024 •

edited

Loading

lohedges commented Nov 4, 2024 •

edited

Loading

lohedges commented Nov 9, 2024 •

edited

Loading