Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition involving wrapper lookup #865

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

colesbury
Copy link

See #864

}

static bool nb_try_inc_ref(PyObject *obj) noexcept {
#if 0 && defined(Py_GIL_DISABLED) && PY_VERSION_HEX >= 0x030E00A5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 0 intentional here?

@@ -1766,16 +1834,18 @@ PyObject *nb_type_put(const std::type_info *cpp_type,
PyTypeObject *tp = Py_TYPE(seq.inst);

if (nb_type_data(tp)->type == cpp_type) {
Py_INCREF(seq.inst);
return seq.inst;
if (nb_try_inc_ref(seq.inst)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that if nb_try_inc_ref fails, what we actually want to do is to break out of the loop that considers the different Python objects associated with that pointer address. We tried the only possible match, and it cannot be returned.

Ditto below.

There are also two other places, where this nb_try_inc_ref will be needed, which is in nb_type_put_p below (mirrors the function here, but for polymorphic C++ objects)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think breaking out here would create more wrappers than necessary. The case I'm thinking about is where there are multiple lookups before the to-be-deallocated wrapper is removed from inst_c2p:

T1: starts to deallocate wrapper object, but hasn't removed it from inst_c2p yet.
T2: lookups wrapper, sees zero refcount. The entry is converted to a linked list and the new wrapper is inserted at the end.
...
T3: lookups wrapper. The first entry still the zero refcount wrapper. The second entry is the wrapper we want. If we break out early, we'll create a third wrapper, which doesn't seem ideal.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In nanobind (and pybind11, for that matter), a single pointer can be associated with multiple Python instances.

For example, you could have

struct A {
    B b;
    /* ... */
}

with both an A* pointer and a B* pointer being simultaneously alive through Python wrappers. These can be created and GCed independently from each other.

The nb_type_put() function converts a C++ pointer into an associated Python object (if one exists) or it creates a new one. It walks through a chain of objects that are associated with a given pointer address until it finds one of the right type. There can only be one of the right type.

Your commit adds a failure mechanism, where it turns out that one wrapper is in the process of being deleted, and we need to create a new one. In this case, it (AFAIK) doesn't make sense to try the other types associated with the current pointer, since we already found the one. Hence the suggestion to break and immediately create a new wrapper. Arguably this is pretty minor and unlikely to have a big performance impact.

src/nb_type.cpp Outdated
#if 0 && defined(Py_GIL_DISABLED) && PY_VERSION_HEX >= 0x030E00A5
PyUnstable_EnableTryIncRef(obj);
#elif defined(Py_GIL_DISABLED)
// TODO: Replace with PyUnstable_Object_EnableTryIncRef when available.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between PyUnstable_Object_EnableTryIncRef and PyUnstable_EnableTryIncRef?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference, just the name changed. I'll update the comment.

@@ -40,6 +40,72 @@ static PyObject **nb_weaklist_ptr(PyObject *self) {
return weaklistoffset ? (PyObject **) ((uint8_t *) self + weaklistoffset) : nullptr;
}

static void nb_enable_try_inc_ref(PyObject *obj) noexcept {
#if 0 && defined(Py_GIL_DISABLED) && PY_VERSION_HEX >= 0x030E00A5
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the 0 intentional here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The APIs haven't been added to CPython yet, but I wanted to get this PR ready while the issue was still fresh in my mind. We can either:

  1. Delete this code path for now
  2. Wait until the APIs are added
  3. Or anything else you prefer

Copy link
Owner

@wjakob wjakob Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to commit the code as is, we can remove the #if 0 part once the change to CPython goes in.

Do I understand correctly that the version below now has an optimization (specific to object construction) that won't be in the official API?

@wjakob
Copy link
Owner

wjakob commented Jan 17, 2025

Hi Sam, thank you for tracking this down and making a patch. I left some comments inside. This added requirement to invoke nb_enable_try_inc_ref on each new instance sounds expensive to me. Wouldn't it better to provide an API to directly allocate try-inc-ref-enabled objects via PyObject_New() and PyType_GenericAlloc()?

@colesbury
Copy link
Author

I'll measure the overhead nb_enable_try_inc_ref. Adding a whole new allocation API like PyObject_New() to allocate try-inc-ref-enabled objects seems like it would be a tough sell.

@hawkinsp
Copy link
Contributor

I confirmed that this PR fixes a race that @vfdev-5 and I were trying to track down in the JAX test suite when run under free-threading (reduced example: https://gist.github.com/vfdev-5/d46bcdc73231bd1e2b2f85c40de9f890)

@colesbury
Copy link
Author

colesbury commented Jan 17, 2025

I benchmarked some simple wrapping code using the current main branch of CPython (with free threading), roughly:

struct Foo {};
nb::class_<Foo>(m, "Foo");
m.def("new_foo", []() -> Foo * { return new Foo(); });

And then calling t.new_foo() in a loop to measure the combined time for wrapper creation and deallcation.

On my M1 mac, the difference is basically in the noise margin: each iteration takes ~80 ns. The extra overhead seems to be about maybe 1-2 ns, but there's similar variation between runs.
On my Intel Linux machine: I'm seeing ~129 ns with the extra cost of ~6 ns.

We can optimize nb_enable_try_inc_ref (and PyUnstable_EnableTryIncRef) a bit more for the case where it's called immediately after object construction. That brings down the overhead on Intel so the iteration cost is ~129.5 ns (well within the noise margin; not sure 0.5 ns is meaningful here) with the inlined implementation or ~131 ns with the out-of-line call to PyUnstable_EnableTryIncRef.

We only call `nb_enable_try_inc_ref` during object construction when we
have the only reference to the object. We can safely overwrite
`ob_ref_shared` without using an atomic compare-exchange.

This speeds up creating Python object wrappers by ~5 ns on my Intel
machine. It doesn't seem to matter as much on Apple's M1 chip.
@wjakob
Copy link
Owner

wjakob commented Jan 18, 2025

Thank you @colesbury, I left 2 more comments.

By the way, in src/common.cpp:1182, there is the maybe_make_immortal function that contains FT implementation details.
Is there an official API for this step that could be used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants