Fix race condition involving wrapper lookup #865

colesbury · 2025-01-17T00:14:09Z

wjakob · 2025-01-17T01:27:06Z

src/nb_type.cpp

+}
+
+static bool nb_try_inc_ref(PyObject *obj) noexcept {
+#if 0 && defined(Py_GIL_DISABLED) && PY_VERSION_HEX >= 0x030E00A5


Is the 0 intentional here?

wjakob · 2025-01-17T01:29:55Z

src/nb_type.cpp

@@ -1766,16 +1834,18 @@ PyObject *nb_type_put(const std::type_info *cpp_type,
                PyTypeObject *tp = Py_TYPE(seq.inst);

                if (nb_type_data(tp)->type == cpp_type) {
-                    Py_INCREF(seq.inst);
-                    return seq.inst;
+                    if (nb_try_inc_ref(seq.inst)) {


I think that if nb_try_inc_ref fails, what we actually want to do is to break out of the loop that considers the different Python objects associated with that pointer address. We tried the only possible match, and it cannot be returned.

Ditto below.

There are also two other places, where this nb_try_inc_ref will be needed, which is in nb_type_put_p below (mirrors the function here, but for polymorphic C++ objects)

I think breaking out here would create more wrappers than necessary. The case I'm thinking about is where there are multiple lookups before the to-be-deallocated wrapper is removed from inst_c2p:

T1: starts to deallocate wrapper object, but hasn't removed it from inst_c2p yet.
T2: lookups wrapper, sees zero refcount. The entry is converted to a linked list and the new wrapper is inserted at the end.
...
T3: lookups wrapper. The first entry still the zero refcount wrapper. The second entry is the wrapper we want. If we break out early, we'll create a third wrapper, which doesn't seem ideal.

In nanobind (and pybind11, for that matter), a single pointer can be associated with multiple Python instances.

For example, you could have

struct A { B b; /* ... */ }

with both an A* pointer and a B* pointer being simultaneously alive through Python wrappers. These can be created and GCed independently from each other.

The nb_type_put() function converts a C++ pointer into an associated Python object (if one exists) or it creates a new one. It walks through a chain of objects that are associated with a given pointer address until it finds one of the right type. There can only be one of the right type.

Your commit adds a failure mechanism, where it turns out that one wrapper is in the process of being deleted, and we need to create a new one. In this case, it (AFAIK) doesn't make sense to try the other types associated with the current pointer, since we already found the one. Hence the suggestion to break and immediately create a new wrapper. Arguably this is pretty minor and unlikely to have a big performance impact.

wjakob · 2025-01-17T01:30:34Z

src/nb_type.cpp

+#if 0 && defined(Py_GIL_DISABLED) && PY_VERSION_HEX >= 0x030E00A5
+    PyUnstable_EnableTryIncRef(obj);
+#elif defined(Py_GIL_DISABLED)
+    // TODO: Replace with PyUnstable_Object_EnableTryIncRef when available.


What's the difference between PyUnstable_Object_EnableTryIncRef and PyUnstable_EnableTryIncRef?

No difference, just the name changed. I'll update the comment.

wjakob · 2025-01-17T01:33:15Z

src/nb_type.cpp

@@ -40,6 +40,72 @@ static PyObject **nb_weaklist_ptr(PyObject *self) {
    return weaklistoffset ? (PyObject **) ((uint8_t *) self + weaklistoffset) : nullptr;
 }

+static void nb_enable_try_inc_ref(PyObject *obj) noexcept {
+#if 0 && defined(Py_GIL_DISABLED) && PY_VERSION_HEX >= 0x030E00A5


Is the 0 intentional here?

The APIs haven't been added to CPython yet, but I wanted to get this PR ready while the issue was still fresh in my mind. We can either:

Delete this code path for now

Wait until the APIs are added

Or anything else you prefer

I think it's OK to commit the code as is, we can remove the #if 0 part once the change to CPython goes in.

Do I understand correctly that the version below now has an optimization (specific to object construction) that won't be in the official API?

wjakob · 2025-01-17T01:40:34Z

Hi Sam, thank you for tracking this down and making a patch. I left some comments inside. This added requirement to invoke nb_enable_try_inc_ref on each new instance sounds expensive to me. Wouldn't it better to provide an API to directly allocate try-inc-ref-enabled objects via PyObject_New() and PyType_GenericAlloc()?

colesbury · 2025-01-17T15:35:30Z

I'll measure the overhead nb_enable_try_inc_ref. Adding a whole new allocation API like PyObject_New() to allocate try-inc-ref-enabled objects seems like it would be a tough sell.

hawkinsp · 2025-01-17T19:40:28Z

I confirmed that this PR fixes a race that @vfdev-5 and I were trying to track down in the JAX test suite when run under free-threading (reduced example: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890)

colesbury · 2025-01-17T19:47:20Z

I benchmarked some simple wrapping code using the current main branch of CPython (with free threading), roughly:

struct Foo {};
nb::class_<Foo>(m, "Foo");
m.def("new_foo", []() -> Foo * { return new Foo(); });

And then calling t.new_foo() in a loop to measure the combined time for wrapper creation and deallcation.

On my M1 mac, the difference is basically in the noise margin: each iteration takes ~80 ns. The extra overhead seems to be about maybe 1-2 ns, but there's similar variation between runs.
On my Intel Linux machine: I'm seeing ~129 ns with the extra cost of ~6 ns.

We can optimize nb_enable_try_inc_ref (and PyUnstable_EnableTryIncRef) a bit more for the case where it's called immediately after object construction. That brings down the overhead on Intel so the iteration cost is ~129.5 ns (well within the noise margin; not sure 0.5 ns is meaningful here) with the inlined implementation or ~131 ns with the out-of-line call to PyUnstable_EnableTryIncRef.

We only call `nb_enable_try_inc_ref` during object construction when we have the only reference to the object. We can safely overwrite `ob_ref_shared` without using an atomic compare-exchange. This speeds up creating Python object wrappers by ~5 ns on my Intel machine. It doesn't seem to matter as much on Apple's M1 chip.

wjakob · 2025-01-18T09:32:45Z

Thank you @colesbury, I left 2 more comments.

By the way, in src/common.cpp:1182, there is the maybe_make_immortal function that contains FT implementation details.
Is there an official API for this step that could be used?

colesbury added 2 commits January 17, 2025 00:12

Fix race condition involving wrapper lookup

681a04e

Fix PY_VERSION_HEX

fdafb34

wjakob reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition involving wrapper lookup #865

Fix race condition involving wrapper lookup #865

colesbury commented Jan 17, 2025

wjakob Jan 17, 2025

wjakob Jan 17, 2025

colesbury Jan 17, 2025

wjakob Jan 18, 2025

wjakob Jan 17, 2025

colesbury Jan 17, 2025

wjakob Jan 17, 2025

colesbury Jan 17, 2025

wjakob Jan 18, 2025 •

edited

Loading

wjakob commented Jan 17, 2025

colesbury commented Jan 17, 2025

hawkinsp commented Jan 17, 2025

colesbury commented Jan 17, 2025 •

edited

Loading

wjakob commented Jan 18, 2025

Fix race condition involving wrapper lookup #865

Are you sure you want to change the base?

Fix race condition involving wrapper lookup #865

Conversation

colesbury commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjakob Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

wjakob commented Jan 17, 2025

colesbury commented Jan 17, 2025

hawkinsp commented Jan 17, 2025

colesbury commented Jan 17, 2025 • edited Loading

wjakob commented Jan 18, 2025

wjakob Jan 18, 2025 •

edited

Loading

colesbury commented Jan 17, 2025 •

edited

Loading