Rare class init / atomic property deadlock #284

ERobsham · 2024-03-08T18:52:45Z

While investigating some intermittent deadlocks being observed in a processes that is part of a suite of software I help maintain, I think I ended up tracking down a potential deadlock.

(Partially due to how pointer hashes could collide for the for the 'spinlocks' used in a few places... but I'm not sure any hash algorithm would ensure this couldn't be hit, just help reduce the rate it happens at)

The process seems to only lock up during initial startup. If gets through initial startup, it seems like it continues chugging along without any hiccups. The following rabbit hole I'm about to describe does use this fact as a basis for making some assumptions on which code paths are taken.

Stack trace of the deadlocked thread (this is a multi threaded application, but none of the other threads are really interesting to look at):

Thread 1 (Thread 24814.24814):
#0  0x0000ffffa4ba0120 in clock_nanosleep () 
#1  0x0000ffffa4ba570c in nanosleep () 
#2  0x0000ffffa4ba55e4 in sleep () 
#3  0x0000ffffa4d4d9e0 in ?? ()
#4  0x0000ffffa4d4dec4 in objc_sync_enter ()
#5  0x0000ffffa4d3e6dc in ?? ()
#6  0x0000ffffa4d46b34 in objc_msg_lookup_sender ()
#7  0x0000ffffa4d4bfe0 in objc_retain ()
#8  0x0000ffffa4d4eb64 in objc_getProperty ()
#9  0x0000ffffa14a2fa0 in -[SomeObj start] (self=0xaaaafd5ab608, _cmd=<optimized out>)
... remaining trace isn't very relevant

So it looks like inside objc_getProperty we attempt to retain, while inside a spin_lock(0xaaaafd5ab608 + propOffset)

OBJC_PUBLIC id objc_getProperty(id obj, SEL _cmd, ptrdiff_t offset, BOOL isAtomic) {
	// ...
    char *addr = (char*)obj;
	addr += offset;
    // ...
	if (isAtomic) {                                 // in our case, `isAtomic` is `true`
		volatile int *lock = lock_for_pointer(addr);
		lock_spinlock(lock);                        // locked on lock_for_pointer(obj + offset) 
		ret = *(id*)addr;
		ret = objc_retain(ret);                     // <==== we're inside here @ bt #7
	    // ...

At this point we've taken the lock_spinlock(0xaaaafd5ab608 + propOffset)

At #6 the objc_msg_lookup_sender() calls the objc_msg_lookup_internal() version, and where this deadlock only happens during initial startup, I'm assuming we're hitting the 'uninstalled dtable' path:

static __attribute__((always_inline)) struct objc_slot2 *objc_msg_lookup_internal(id *receiver, SEL selector, uint64_t *version) {
    // ...
	struct objc_slot2 * result = objc_dtable_lookup(class->dtable, selector->index);
	if (UNLIKELY(0 == result)) {
		dtable_t dtable = dtable_for_class(class);
		/* Install the dtable if it hasn't already been initialized. */
		if (dtable == uninstalled_dtable) {
			objc_send_initialize(*receiver);   // <===== `receiver == (0xaaaafd5ab608 + propOffset)`
            // ...

Inside objc_send_initialize(), we essentially hit a @synchronized() block:

OBJC_PUBLIC void objc_send_initialize(id object) {
    Class class = classForObject(object);
    // ...
    Class meta = class->isa;
    // ...
    LOCK_OBJECT_FOR_SCOPE((id)meta);  // <=== essentially `meta == [(0xaaaafd5ab608 + propOffset).class]->isa`
    // ...
}

Essentially LOCK_OBJECT_FOR_SCOPE() is a macro that runs objc_sync_enter() / objc_sync_exit() automatically for the current lexical scope, like a @synchronized block would. So now we're at #4, and we attempt to either loop up or create the lock for this property:

OBJC_PUBLIC int objc_sync_enter(id object) {
	if ((object == 0) || isSmallObject(object)) { return 0; }
	struct reference_list *list = referenceListForObject(object, YES);  // <== looking up the `reference_list` can cause a `lock_spinlock()`
                                                                        //     to be taken on the `object` when `create == YES`
    // ...
}

Once referenceListForObject() is called with create == YES, then within referenceListForObject(), if the reference_list->lock needs to be initialized, we take another spin lock, this time on [(0xaaaafd5ab608 + propOffset).class]->isa essentially:

static struct reference_list* referenceListForObject(id object, BOOL create) {
	if (class_isMetaClass(object->isa)) {
		Class cls = (Class)object;
		if ((NULL == cls->extra_data) && create) {
			volatile int *lock = lock_for_pointer(cls);
            // ...
			lock_spinlock(lock);            //  <====  maybe here?
			// ...
		}
		return cls->extra_data;
	}
	Class hiddenClass = findHiddenClass(object);
	if ((NULL == hiddenClass) && create) {
		volatile int *lock = lock_for_pointer(object);
		lock_spinlock(lock);                //  <====  maybe here?
		// ...
	}
	return hiddenClass ? object_getIndexedIvars(hiddenClass) : NULL;
}

In either code path where the lock needs to be initialized, we're essentially now taking a second lock_for_pointer(), and I'm thinking in some cases the pointer hashes collide, making us lock_spinlock() on the same spinlock for both pointers... and in turn, deadlocking the main thread.

Unfortunately, I haven't figured out a good solution for this one yet. But I at least wanted to get this issue posted, in case anyone else has seen weird behavior with atomic properties, or this sparks any ideas on how to fix the issue.

The text was updated successfully, but these errors were encountered:

ERobsham · 2024-03-08T20:14:55Z

I'm still trying to work through this, but initially I'm reaching for adding another set of spinlocks as a fairly straightforward solution?

So something along the lines of dropping int spinlocks[spinlock_count]; from properties.m and replacing it with two sets of spinlocks in spinlock.h:

PRIVATE int prop_spinlocks[spinlock_count];
PRIVATE int associate_spinlocks[spinlock_count];

And then update from one lock_for_pointer(..) function to having matching prop_lock_for_pointer(..) / associate_lock_for_pointer(..) functions that properties.m / associate.m would use respectively.

I'm not 100% on this, but it does appear as though the associate.m spinlock usage is more closely related to ensuring class / associated lock initializations are done, where the properties.m spinlock usage is more just for memory management of the actual properties backing storage..?

And that should at least ensure objc_getProperty() - > ... ->referenceListForObject() spinlocks could never collide.

Update: added PR #285 as a better illustration of the idea.

davidchisnall · 2024-03-09T08:19:18Z

I think that is the right solution, yes. I, a bit confused about how you get here though. The retain call will create a dtable only if the class of the object has not been sent any messages before. I’m not sure what would trigger that. How are you assigning an object as an ivar of another object without ever sending it any messages? I guess it must be an unsafe property or it would have been sent a retain message on the way in.

ERobsham · 2024-03-09T13:11:19Z

That is a great question... I didn't actually look into the application code (I was just brought that stack trace, and told it was only happening during initial startup, which is what lead me down this path). I wouldn't be surprised as I have definitely seen a number of direct ivar assignments of a properties backing storage in the codebase in question. I can definitely confirm if that is the case though.

ERobsham · 2024-03-09T13:36:35Z

Well now I'm confused too. The initial assignment for the property I was told produced this specific stack trace was assigned via the property accessor, IE:

self.atomicProp = [[[SomeClass alloc] initWith:...] autorelease];

...? That would definitely mean it had to run through objc_retain() while outside the properties spinlock at least once...

davidchisnall · 2024-03-09T17:30:17Z

The +alloc there should have initialised the class, so we shouldn’t hit that path at all.

Can you run with Asan or valgrind? You would see the same symptoms if the ivar contained a dangling pointer.

davidchisnall · 2024-03-10T13:28:15Z

Wait... did you say that this class is also directly assigning to the ivar? That's almost certainly UB: if you have an atomic property, then you must use it to get the atomicity. If you are assigning to an ivar directly and it is concurrently assigned to via an atomic property then this is not atomic and may introduce temporal safety bugs.

ERobsham · 2024-03-10T13:37:11Z

I thought it was a possibility, in the codebase there is ~20 years worth of code, and some of it definitely has ... less than ideal 'features', lets say?

However, I did double check the specific application code the stack trace above came from and it does seem to be uniformly using property accessors correctly (no unsafe direct accesses).

ERobsham · 2024-03-10T14:05:39Z

Can you run with Asan or valgrind?

Yeah, I can try to get another repro with valgrind. I might even try to create a more simple (more importantly, non-proprietary) example I can share if I can get something that'll somewhat reliably repro this issue (I think getting those pointer hashes to collide is the key, which makes this pretty random).

You would see the same symptoms if the ivar contained a dangling pointer.

So the flow leading up to this deadlock is super simple, and nothing other than an initial assignment then method call happens on the properties. Basically flows like:

`main()` creates an instance of the main class -> 
`[[MainAppClass alloc] init]` where all the properties are initialized, 
        then right before it exits `-init` it calls `-start` -> 
`[self start]` this is where most of the properties are 'started', and 
        `[self.atomicProperty start]` gets run. In this case, we ended up 
        deadlocked inside the `objc_getProperty() -> objc_retain() -> ... ` call chain.

Im not too sure how we'd get to a point where there is a dangling pointer for that property, where that is all very straightforward, and even all happens on a single thread...?
But I'm also not sure how we'd get through an +alloc / objc_setProperty() -> objc_retain() / etc previously, and then hit this class init path either...?

Unfortunately, I couldn't find another path through the code that would lead to taking out two spinlocks... But it feels like there must be something else that I'm still missing here...

davidchisnall · 2024-03-12T08:26:01Z

There’s one other way you could get there: if the isa pointer of the object in the ivar is corrupted (e.g. to point to a different, u initialised, class).

ERobsham · 2024-03-13T23:57:10Z

Okay, so had some other fun to attend to the last couple of days, but the latest update is: its looking like the compiler might be reordering / 'optimizing' things on us?

Still need to test out a theory, but at first glance when stepping through in gdb with all the libobjc symbols loaded in from our toolchain, it's now looking like the ret var in objc_getProperty() might still be uninitialized when its getting retain called on it. Almost as if whatever was on the stack previously where the function gets pushed on is what getting the retain called on it, instead of the calculated addr.

Anyways, still chipping away at this one between some other tasks, so I'll update again once I get a couple theories confirmed/squashed.

davidchisnall · 2024-03-16T12:35:00Z

That shouldn't be possible, but if it is happening then you can try adding an atomic signal fence after the spinlock acquire. It may also be that LLVM is now better at optimising atomics and we need to make the spinlock code use proper C11 atomics with acquire / release semantics to ensure that the reordering doesn't happen.

ERobsham · 2024-04-02T20:13:40Z

Well I'm no closer to understanding the why, but I do have a bit more data at least. I tried out reordering the assignment of ret a bit, to try and make sure it happened before any volatile variable access (in objc_getProperty(..)):

-	id ret;
+	id ret = *(id*)addr;
	if (isAtomic)
	{
		volatile int *lock = lock_for_pointer(addr);
		lock_spinlock(lock);
-		ret = *(id*)addr;
		ret = objc_retain(ret);
		unlock_spinlock(lock);
		ret = objc_autoreleaseReturnValue(ret);
	}

But my coworker still hit this same issue again (This time we got some extra symbols loaded into gdb, so the stack trace is much clearer):

(gdb) bt
#0  0x0000ffff98102124 in clock_nanosleep () from target:/lib/libc.so.6
#1  0x0000ffff9810770c in nanosleep () from target:/lib/libc.so.6
#2  0x0000ffff981075e4 in sleep () from target:/lib/libc.so.6
#3  0x0000ffff982af9f0 in lock_spinlock (spinlock=0xffff982db32c) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/spinlock.h:77
#4  referenceListForObject (object=0xaaaafdf962d0, create=<optimized out>) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/associate.m:294
#5  0x0000ffff982afed4 in objc_sync_enter (object=0x0) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/associate.m:398
#6  0x0000ffff982a06ec in objc_send_initialize (object=<optimized out>) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/dtable.c:718
#7  0x0000ffff982a8b44 in objc_msg_lookup_internal (receiver=0xffffd772abd0, selector=0xffff982d8d50, version=0x0) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/sendmsg2.c:107
#8  objc_msg_lookup_sender (receiver=0xffffd772abd0, selector=0xffff982d8d50, sender=<optimized out>) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/sendmsg2.c:200
#9  0x0000ffff982adff0 in retain (obj=<optimized out>) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/arc.m:291
#10 objc_retain (obj=<optimized out>) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/arc.m:571
#11 0x0000ffff982b0b7c in objc_getProperty (obj=<optimized out>, _cmd=<optimized out>, offset=<optimized out>, isAtomic=<optimized out>) at /usr/src/debug/libobjc2/2.1.0+AUTOINC+5105c6e747-r0/git/properties.m:35
#12 0x0000ffff94a1370c in -[OurClass ourMainThreadMethodCallThatIsDecendsFromMain] (self=0xaaaafdc53428, _cmd=<optimized out>) at ____.m:4570
...

This time we can definitively see that we're hitting the objc_send_initialize() path I assumed in the beginning, but it still doesn't explain the why we're hitting this path.

And those receiver=0xffffd772abd0 values are still throwing me for a loop. I expected those to be something like the self=0xaaaafdc53428 + an offset, but that is wayyy off. Which is what was leading me down the path that the ret value was uninitialized when the retain got called with it.

ERobsham mentioned this issue Mar 8, 2024

Potential atomic property/init deadlock fix #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rare class init / atomic property deadlock #284

Rare class init / atomic property deadlock #284

ERobsham commented Mar 8, 2024

ERobsham commented Mar 8, 2024 •

edited

Loading

davidchisnall commented Mar 9, 2024

ERobsham commented Mar 9, 2024

ERobsham commented Mar 9, 2024

davidchisnall commented Mar 9, 2024

davidchisnall commented Mar 10, 2024

ERobsham commented Mar 10, 2024

ERobsham commented Mar 10, 2024

davidchisnall commented Mar 12, 2024

ERobsham commented Mar 13, 2024

davidchisnall commented Mar 16, 2024

ERobsham commented Apr 2, 2024

Rare class init / atomic property deadlock #284

Rare class init / atomic property deadlock #284

Comments

ERobsham commented Mar 8, 2024

ERobsham commented Mar 8, 2024 • edited Loading

davidchisnall commented Mar 9, 2024

ERobsham commented Mar 9, 2024

ERobsham commented Mar 9, 2024

davidchisnall commented Mar 9, 2024

davidchisnall commented Mar 10, 2024

ERobsham commented Mar 10, 2024

ERobsham commented Mar 10, 2024

davidchisnall commented Mar 12, 2024

ERobsham commented Mar 13, 2024

davidchisnall commented Mar 16, 2024

ERobsham commented Apr 2, 2024

ERobsham commented Mar 8, 2024 •

edited

Loading