Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

jothan · 2024-03-21T14:26:24Z

Code: https://github.com/esp-rs/esp-idf-svc/blob/4fff46bba1be66ae3c6945d3b8bda30a589f6f6b/examples/tls_async.rs

Crate versions:
Cargo.lock

I am running the example on a ESPcam board, with an esp32 (revision v3.0) Xtensa module (Tinker AI ESP32-S, to be exact).

The example usually crashes once or twice after reset before working.

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

[profile.de[v]
debug = true
opt-level = "s"
codegen-units = 1

The text was updated successfully, but these errors were encountered:

Vollbrecht · 2024-04-09T17:40:26Z

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

jothan · 2024-04-09T17:54:58Z

I've only managed to reproduce this with an optimized debug build with codegen-units=1.

To be more concrete you never saw it before applying this optimization?

To be honest, it seems to depend on code layout. It might trigger in other conditions, I have not managed to narrow it down. It smells like a race condition of some sort.

notgull · 2024-04-09T17:59:53Z

cc @ivmarkov Any idea why this would be happening?

notgull · 2024-04-10T00:18:32Z

The issue actually appears to be in our locking implementation, specifically OnceCell.

ivmarkov · 2024-04-10T02:31:14Z

cc @ivmarkov Any idea why this would be happening?

No idea. Will try to reproduce once I'm back.

jothan · 2024-04-24T17:44:36Z

@ivmarkov have you had any luck reproducing this ? Let me know if you need further input and testing from my part.

ivmarkov · 2024-04-24T18:13:50Z

Couldn't get to there yet. Sorry about that! Very likely will try to do something related to async-io in the next couple of days.

npmenard · 2024-04-24T20:59:38Z

I am running in the same issue as described by @jothan

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters) but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?
Interestingly if I add a dummy read before pinning like :

let event_listener = EventListener::new();
let _ = event_listener.is_listening();
pin!(event_listener);

The pinned address will be 32bit aligned pointing to a stack area where all bytes are 0 and the bug doesn't happen.

For reference here are the two assembly dump of the function
not working
working

Hope that helps @ivmarkov

ivmarkov · 2024-05-01T18:39:47Z

@npmenard Thank you for the detailed analysis!

I can reproduce the issue (though I haven't connected with GDB to examine the address), and I confirm that the dummy listener.is_listening(); is fixing the issue for me too - but only in a more trivial case I have. In a more complex setup, I have another crash - something related to thread locals - but let's put that aside for a moment.

(Also, not sure if that matters, but I'm consistently reproducing the issue with opt-levels "s" and "z" (haven't tried with other opt levels yet, will do)).

As for:

I have done some investigation and I believe the issue boils down to the pinning of event_listener In this case the address generated is not 32 bit aligned (ex : 0x3ffe860f not sure if this matters)

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct?
Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

but more importantly point back to an area of the stack where the first byte is not 0x0. Using GDB i can see that the value it points to is 0x3f (which is the MSB of an address used during the creation of the thread) since the remain bytes are 0 maybe 0xffe8610 would be the correct pinned address?

Even more suspicious. This sounds a bit like a bug in the LLVM xtensa backend.

@MabezDev - sorry for pulling you here, but do you think you and/or Andrei can help? There are assembly dumps attached to the previous comment.
I can also try to simplify and provide a small(er) test case, though with async-io in the picture, it would be an ESP IDF IO example in the end...

npmenard · 2024-05-03T17:09:39Z

I assume by address 0x3ffe860f you mean the &mut EventListener which is actually stored in the Pin struct?
Yes, not being 32-bit aligned sounds suspicious, as long as the Event:Listener contains (recursively) at least one integer type which is 4 bytes. As in i32 or u32. If that's the case, the whole structure needs to be aligned to 4 bytes of course, or else the xtensa cpu would generate a trap.

It's the address that would be printed when formatting event_listener with {:p} after it was pinned for example :
log::info!( "pin_ev_ptr {:p}", event_listener);
In my case the cpu doesn't generate a trap when loading at an unaligned address. To illustrate see is_listening assembly dump and the registers status when we check self.listener.listener.is_some() (l32i.n a2, a2, 0). Register a2 contains the unaligned address and once the load is executed it will contain 3f when it should be 0.

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by
log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&**(&event_listener.pointer))) as *const _ as u32); are different.

Lastly I am attaching a backtrace showing the code path that writes at the location in stack where ultimatley the Pin will point to. It happens during WindowOverflow8 exception is triggered when setting a thread local variable.

Hopefully this helps a bit. Let me know if there is anything else I can do to help.

jothan · 2024-05-13T17:20:43Z

With the new Rust 1.78 for ESP32 (in pre-release in the rust-build project), debug assertions for std are now enabled and seem to catch a similar alignment problem with an assertion in core::ptr::read::precondition_check.

Full backtrace: backtrace.txt

Partial backtrace:

Guru Meditation Error: Core  0 panic'ed (Unhandled debug exception).                                                                                          
Debug exception reason: BREAK instr                                                                                                                           
Core  0 register dump:                                                                                                                                        
PC      : 0x400f0d06  PS      : 0x00060c36  A0      : 0x800f0ab0  A1      : 0x3ffcdfc0                                                                        
0x400f0d06 - core::panicking::panic_nounwind_fmt::runtime                                                                                                     
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/panicking.rs:91                                                                
A2      : 0x3ffce060  A3      : 0x00000040  A4      : 0x000001c0  A5      : 0x3ffce220                                                                        
A6      : 0x00060b23  A7      : 0x00060b20  A8      : 0x800f0d01  A9      : 0x3ffcdfa0                                                                        
A10     : 0x00000000  A11     : 0x00000040  A12     : 0x000000f8  A13     : 0x3ffce4e0                                                                        
A14     : 0x3fff3fc8  A15     : 0x400f0cf4  SAR     : 0x00000018  EXCCAUSE: 0x00000001                                                                        
0x400f0cf4 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/intrinsics.rs:2797                                                             
EXCVADDR: 0x00000000  LBEG    : 0x4000c2e0  LEND    : 0x4000c2f6  LCOUNT  : 0x00000000                                                                        
                                                                                                                                                              
                                                                                                                                                              
Backtrace: 0x400f0d03:0x3ffcdfc0 0x400f0aad:0x3ffcdfe0 0x400f0beb:0x3ffce5c0 0x400ee62d:0x3ffce860 0x400f14bb:0x3ffce880 0x400f148d:0x3ffce8b0 0x400d6e7e:0x3\
ffce8e0 0x400e3afe:0x3ffce930 0x400decc1:0x3ffced20 0x401a1419:0x3ffcf150 0x400d6a52:0x3ffcf170 0x400d479b:0x3ffcf230 0x400d6450:0x3ffcf350 0x400e82f1:0x3ffc\
f5a0 0x400da0c2:0x3ffcf880 0x400da994:0x3ffcf8a0 0x4011b067:0x3ffcf8d0 0x4011b04e:0x3ffcf8f0 0x40120ff3:0x3ffcf910 0x4012cf40:0x3ffcf930                      
0x400f0d03 - core::ptr::read::precondition_check                                                                                                              
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1282                                                                
0x400f0aad - core::ptr::read                                                                                                                                  
    at /home/user/.rustup/toolchains/esp/lib/rustlib/src/rust/library/core/src/ptr/mod.rs:1276                                                                
0x400f0beb - async_lock::once_cell::OnceCell<T>::get_or_try_init_blocking                                                                                     
    at /home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-lock-3.3.0/src/once_cell.rs:441                                                  
0x400ee62d - async_lock::once_cell::OnceCell<T>::get_or_init_blocking

jothan · 2024-05-13T18:20:43Z

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

ivmarkov · 2024-05-13T18:29:21Z

Looking at the backtraces, it would seem that OnceCell does something weird that causes EventListener trouble down the line.

Why would OnceCell be doing something weird and moreover, this something only happens with esp idf + xtensa and not on x86 and (likely) arm? I still find it more likely that the root cause is a miscompilation bug in the llvm xtensa backend.

Will open a bug at the xtensa llvm fork shortly btw.

And will test (again) with riscv32imc on the esp idf, though I'm relatively sure the issue did not happen there.

jothan · 2024-05-13T18:32:22Z

@ivmarkov Doing something weird in this case is triggering the miscompilation. I don't doubt your analysis.

ivmarkov · 2024-05-13T19:04:32Z

Other than dereferencing a raw pointer, I don't see anything else weird, unfortunately...

ivmarkov · 2024-05-23T16:42:08Z

I know there is little to no progress here, for which I apologize. In the meantime, what I found to work for me is setting the version of async-io back to "2.0.0" (as in async-io = "=2.0.0").

We'll resume the work on finding the root cause shortly.

notgull · 2024-06-01T19:26:21Z

Does this issue still occur with async-lock v3.4.0?

jothan · 2024-06-01T23:00:17Z

@notgull It still panics in the same spot with debug assertions on (Rust 1.78), however, it does not crash when they are off.

jothan · 2024-08-01T15:34:27Z

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

ivmarkov · 2024-08-01T15:43:12Z

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@MabezDev sorry for the ping ^^^

@jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. async-io-mini is API-compatible with async-io as long as you don't use the async-io timers - ~~which you anyway probably shouldn't as on the ESP IDF they have a minimum res of 10ms~~ (UPDATE: on a second thought, this is likely not true).

jothan · 2024-08-01T16:38:26Z

@ivmarkov Any news on the LLVM front ? I can reproduce unaligned pointers on the latest 1.80 toolchain.

@MabezDev sorry for the ping ^^^

@jothan I know this is not really a proper fix, but you can give async-io-mini a try until a proper fix is in-place. async-io-mini is API-compatible with async-io as long as you don't use the async-io timers - ~~which you anyway probably shouldn't as on the ESP IDF they have a minimum res of 10ms~~ (UPDATE: on a second thought, this is likely not true).

I might give async-io-mini a try, my blocker is mostly my use of smol-hyper (that depends on regular async-io). I am also using some async-io timers, but I can always use some other timers.

bugadani · 2024-09-11T08:30:41Z

Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it.

ivmarkov · 2024-09-11T09:09:14Z

Did any of you end up opening an issue in https://github.com/espressif/llvm-project ? I can't find it.

I didn't.
I really feel uncomfortable opening it and providing details myself, as my knowledge in xtensa is exactly 0%.
I keep pushing though.

Would you like to take it upon yourself? Would really appreciate that and you might be the right person, given your xtensa mis-compilations' history, so to say.

bugadani · 2024-09-11T11:54:52Z

I can reliably reproduce a panic with this:

use core::pin::pin;

use esp_idf_svc::io;

// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    io::vfs::initialize_eventfd(5).unwrap();

    // You can use `esp_idf_svc::hal::task::block_on` as well
    async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}

thread 'async-io' panicked at C:\Users\bugad\.rustup\toolchains\esp\lib\rustlib\src\rust\library\core\src\panicking.rs:219:5:
unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null

But removing pin! just makes this work.

This sounds a bit different from the report here, so can someone please verify if this is the actual issue?

ivmarkov · 2024-09-11T12:39:43Z

I can reliably reproduce a panic with this:

use core::pin::pin;

use esp_idf_svc::io;

// Don't forget to raise the CONFIG_PTHREAD_TASK_STACK_SIZE_DEFAULT in `sdkconfig.defaults` to > 4K so that the
// `async-io` background thread can work fine
pub fn main() {
    esp_idf_svc::sys::link_patches();
    esp_idf_svc::log::EspLogger::initialize_default();
    io::vfs::initialize_eventfd(5).unwrap();

    // You can use `esp_idf_svc::hal::task::block_on` as well
    async_io::block_on(pin!(async move { Result::<_, anyhow::Error>::Ok(()) }));
}

thread 'async-io' panicked at C:\Users\bugad\.rustup\toolchains\esp\lib\rustlib\src\rust\library\core\src\panicking.rs:219:5:
unsafe precondition(s) violated: ptr::read requires that the pointer argument is aligned and non-null

But removing pin! just makes this work.

This sounds a bit different from the report here, so can someone please verify if this is the actual issue?

Is it?
To me, it sounds suspiciously close to this and the following comments.

It is as if the code generated by the pin! macro results in a mis-aligned pointer address, due to mis-compilation?

pin! generates simply:

$crate::pin::Pin::<&mut _> { __pointer: &mut { $value } }

According to the report I quoted, the above code results in:
a) ~~__pointer containing misaligned address~~ UPDATE: or more likely calling Deref on the Pin type
b) This address seems to be wrong (off by 1 byte)

ivmarkov · 2024-09-11T12:48:02Z

Actually @npmenard says something even more interesting:

I have noticed something odd but I am not sure if it will shade some light, whenever the bug occurs the value printed by
log::info!( "pin_ev_ptr {:p} {:x}", event_listener, *(&(&**(&event_listener.pointer))) as *const _ as u32); are different.

Sounds like Deref on Pin produces bogus code.

bugadani · 2024-09-11T14:37:40Z

Well I guess we can blame pin, but that might also be independent of our issue. What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator.

Whether this is related to an architecture-specific crash... I don't know.

ivmarkov · 2024-09-11T16:27:52Z

Well I guess we can blame pin, but that might also be independent of our issue.

I'm not blaming pin. But so far, it seems we can only reproduce the mis-compilation when pin is used. I'm sure the mis-compilation has a bigger blast radius, but so far we have seen it happening when pin is somehow involved.

What's weird is that pin seems to carbon-copy whatever you give it, which may be slightly problematic for a self-referencing generator.

I would not expect anything else given that Rust arrays are Copy? self-referencing generators that you get from async are certainly not, but I think this point is (hopefully) irrelevant for the issue at hand.

bugadani · 2024-09-11T16:36:34Z

Why do you expect &mut array to duplicate the array on stack?

Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow.

ivmarkov · 2024-09-11T16:48:20Z

Why do you expect &mut array to duplicate the array on stack?

Because the pin! macro does not seem to simply expand to a &mut array?
I find this comment relevant (line 1959). It talks about moving, but since arrays are Copy, what happens is copying instead of moving, I guess. Would you agree?

Regardless, I can reproduce the issue (using different opt levels, I get very different crashes and occasionally something working), I'm trying to minimize it somehow.

Good news - if you can minimize it, you would be our hero! :)

bugadani · 2024-09-12T11:58:26Z

So, in my case, the root cause of the crash is:

Reactor is a 448 byte struct, with alignment 64
OnceCell initializes this struct in an async closure, and taking the value out of a Ready is the pointer read where my app crashes
The cause of the crash is that Ready and its contents aren't placed on a 64-byte allocated address. (in my case, 0xsomethingsomethingA0, or 0x...30)

Now let's see if I can break a non-idf application with this knowledge :)

ivmarkov · 2024-09-12T12:06:46Z

So, in my case, the root cause of the crash is:

Reactor is a 448 byte struct, with alignment 64

Alignment of 64 bytes? :o How is that even possible? What might require such an alignment?
I think on xtensa, the maximum alignment is 64 bits which is only 8 bytes?

OnceCell initializes this struct in an async closure, and taking the value out of a Ready is the pointer read where my app crashes

The cause of the crash is that Ready and its contents aren't placed on a 64-byte allocated address. (in my case, 0xsomethingsomethingA0, or 0x...30)

Yes this is aligned to 64bits, but obviously not to 64 bYtes.

bugadani · 2024-09-12T12:20:28Z

https://doc.rust-lang.org/std/mem/fn.align_of_val.html

Returns the ABI-required minimum alignment of the type of the value that val points to in bytes.

The culprit is ConcurrentQueue, it raises the alignment requirement (with a contained CachePadded type, which is repr(align(64))).

ivmarkov · 2024-09-12T12:23:37Z

I think you are getting there!
It might well be, that the LLVM xtensa backend simply does not support such an alignment. Or would that be the linker? Not sure...

ivmarkov · 2024-09-17T12:14:18Z

@bugadani Did you already open an issue at https://github.com/espressif/llvm-project ?
Can't see any.

bugadani · 2024-09-17T12:22:25Z

I've reported the issue internally

ivmarkov · 2024-09-17T13:11:21Z

Not sure what is that internal repo, but... I hope you'll inform us how it is developing :)

jothan · 2024-09-30T13:32:20Z

I just read about the atomics-in-psram issue. I am using both cores with PSRAM on an ESP32-cam board. I am boxing some futures on the heap that might end up in PSRAM. I think I'm going to revert to a single threaded executor or disable the second core entirely for stability's sake.

ivmarkov · 2024-10-16T08:39:04Z

@bugadani Is there any development in the internal repo that you could share with us all?

bugadani · 2024-10-16T09:22:46Z

I don't have any updates.

jothan mentioned this issue Apr 9, 2024

Sporadic crash when running async TLS example esp-rs/esp-idf-svc#407

Closed

notgull transferred this issue from smol-rs/async-io Apr 10, 2024

ivmarkov mentioned this issue May 12, 2024

ESP32 Web Socket Listener, TCP Server ivmarkov/edge-net#15

Closed

ivmarkov mentioned this issue Jun 3, 2024

socket leak? ivmarkov/async-io-mini#1

Closed

ivmarkov mentioned this issue Aug 21, 2024

Async TcpStream cannot send larger Amounts of Data esp-rs/esp-idf-svc#472

Closed

ivmarkov mentioned this issue Sep 25, 2024

Long Running Futures Eventually Stall ivmarkov/edge-executor#3

Open

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

Sporadic panic in async-io thread with esp-sys-svc async TLS example #84

Comments

jothan commented Mar 21, 2024

Vollbrecht commented Apr 9, 2024

jothan commented Apr 9, 2024

notgull commented Apr 9, 2024

notgull commented Apr 10, 2024

ivmarkov commented Apr 10, 2024

jothan commented Apr 24, 2024

ivmarkov commented Apr 24, 2024

npmenard commented Apr 24, 2024 • edited Loading

ivmarkov commented May 1, 2024

npmenard commented May 3, 2024 • edited Loading

jothan commented May 13, 2024

jothan commented May 13, 2024

ivmarkov commented May 13, 2024

jothan commented May 13, 2024

ivmarkov commented May 13, 2024

ivmarkov commented May 23, 2024

notgull commented Jun 1, 2024

jothan commented Jun 1, 2024

jothan commented Aug 1, 2024

ivmarkov commented Aug 1, 2024 • edited Loading

jothan commented Aug 1, 2024

bugadani commented Sep 11, 2024

ivmarkov commented Sep 11, 2024

bugadani commented Sep 11, 2024

ivmarkov commented Sep 11, 2024 • edited Loading

ivmarkov commented Sep 11, 2024

bugadani commented Sep 11, 2024

ivmarkov commented Sep 11, 2024

bugadani commented Sep 11, 2024

ivmarkov commented Sep 11, 2024 • edited Loading

bugadani commented Sep 12, 2024

ivmarkov commented Sep 12, 2024

bugadani commented Sep 12, 2024 • edited Loading

ivmarkov commented Sep 12, 2024 • edited Loading

ivmarkov commented Sep 17, 2024

bugadani commented Sep 17, 2024

ivmarkov commented Sep 17, 2024

jothan commented Sep 30, 2024

ivmarkov commented Oct 16, 2024

bugadani commented Oct 16, 2024

npmenard commented Apr 24, 2024 •

edited

Loading

npmenard commented May 3, 2024 •

edited

Loading

ivmarkov commented Aug 1, 2024 •

edited

Loading

ivmarkov commented Sep 11, 2024 •

edited

Loading

ivmarkov commented Sep 11, 2024 •

edited

Loading

bugadani commented Sep 12, 2024 •

edited

Loading

ivmarkov commented Sep 12, 2024 •

edited

Loading