Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic operations in PSRAM (esp. with 3rd party libraries) #2027

Open
ProfFan opened this issue Aug 28, 2024 · 23 comments
Open

Atomic operations in PSRAM (esp. with 3rd party libraries) #2027

ProfFan opened this issue Aug 28, 2024 · 23 comments
Labels
status:blocked Unable to progress - dependent on another task upstream The issue lies with a dependency

Comments

@ProfFan
Copy link
Contributor

ProfFan commented Aug 28, 2024

AFAIK the current atomic operations on the S3 uses native atomics. However this creates a problem: a 3rd party library could be using atomics on their data structure placed on the SPI PSRAM.

In the IDF, they handled it by first checking for PSRAM address, then try acquire the external RAM lock, then do CAS: https://github.com/espressif/esp-idf/blob/c9df77efbf871d4c3ae9fb828778ff8c4ab36804/components/esp_hw_support/cpu.c#L199

I wonder if a similar thing should be done for the S3 (for example, a PR to portable-atomic that adds a feature for a similar implementation?

Based on discussions with @ivmarkov

@bugadani
Copy link
Contributor

I don't see a simple way out of this without introducing our own, "PSRAM-safe" atomics. Which will be a usability and interoperability nightmare, because library developers will not likely use it, or even know about it.

Turning portable-atomic into a shim of sorts for us is at least technically possible, but still doesn't solve every issue - for MCUs that can use the core atomic types, those ones can't really be shimmed for us to provide our own implementation.

I don't think we can extend portable-atomic to know about our particular memory mappings.

@MabezDev
Copy link
Member

Do we know if an exception is raised whilst accessing atomics in psram, or does it just silently fail? If we can detect it, that at the very least opens up some options for us.

@ivmarkov
Copy link

Do we know if an exception is raised whilst accessing atomics in psram, or does it just silently fail? If we can detect it, that at the very least opens up some options for us.

I doubt any exception is raised. From the APP/PRO CPU POV, accessing PSRAM is like accessing regular internal RAM. And it ends up accessing regular internal RAM after all - thanks to the caching MMU.

With that said, I'm not in the position to discuss a possible fix, as I do not understand the operation of the MMUs to a deeper-enough detail. All I know is that ESP-IDF provides a special xchg atomic implemented in software, which is to be used when PSRAM is enabled, as pointed to by @ProfFan .

With that said, whoever wrote and maintains this is probably in that position. (@bjoernQ - is that you?).

I would be happy to know though why hardware atomics cannot work with PSRAM:

  • Is it because the two CPUs might have their own separate dram caches for the same psram; why would we even want such a setup?
  • Or is it because an atomic might span across two cache-lines (the end of one, and the beginning of the other) and as such the writing might actually not be atomic at all?

Anyway, I guess me and @ProfFan are just raising the flag that there is an issue with atomics and PSRAM, and somebody needs to look into it. For both ESP-IDF and baremetal by the way, as the ESP-IDF multicore targets are also happily using only hardware atomics ATM; and once we know of a fix, we should probably change the definition of the upstream multicore STD targets from just using hardware atomics to emitting calls to the atomic intrinsic functions, so that the user code can use whatever ESP-IDF had put in the definition of these intrinsic functions, depending on user settings).

Ah, one more question, couldn't resist:

  • Would hardware atomics work correctly, if the programmer is ensuring that the relevant PSRAM memory region is accessed only by one of the two CPUs? As in, do we even have a workaround for the current situation...

@bjoernQ
Copy link
Contributor

bjoernQ commented Aug 30, 2024

With that said, whoever wrote and maintains this is probably in that position. (@bjoernQ - is that you?).

Yes, I ported the code and no I don't know. I also doubt an exception is raised there but we can just give it a try

I'm also curious about the "why doesn't it just work"

@MabezDev
Copy link
Member

I'll try and find those answers and get back to you.

In the meantime, the only truly bullet-proof way to solve this right now is to have a target variant where we disable hardware atomics, falling back to portable-atomic or something else to ensure atomicity (note: this is exactly what esp-idf does, when the work around is enabled). I'd really like to avoid doing this though if we can help it.

@ivmarkov
Copy link

I'll try and find those answers and get back to you.

In the meantime, the only truly bullet-proof way to solve this right now is to have a target variant where we disable hardware atomics, falling back to portable-atomic or something else to ensure atomicity (note: this is exactly what esp-idf does, when the work around is enabled). I'd really like to avoid doing this though if we can help it.

The question is, is portable-atomic even having a feature that would enable atomics (at least on bare-metal, for ESP-IDF we have other options) to operate correctly.

The critical-section one is tempting, but do you remember why our support for u64 on multicore riscv got kicked out from upstream STD? Because it essentially used disable-all-interrupts-critical-sections-plus-spinlocks and they said using spinloocks for atomic operations defeats their very purpose in the first place.

Also if you notice, the software PSRAM-safe xchg impl in the ESP-IDF does something different... yet similar. They seem to implement the "fallible" xchg variant only, which can sporadically fail. But then the question is, what would the user do when presented with such an xchg impl, other than re-trying in a loop, which suspiciously looks like a cousin to spinlocks anyway...

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 30, 2024

  • Would hardware atomics work correctly, if the programmer is ensuring that the relevant PSRAM memory region is accessed only by one of the two CPUs?

I think the answer currently for the S3 is most likely true but I am not sure. It's probably good to open an issue in esp-idf and tag @igrr for help 🍭

for MCUs that can use the core atomic types, those ones can't really be shimmed for us to provide our own implementation.

If we have a custom target (for example xtensa-none-elf-esp32s3) this can be put as a part in core right? As we build core with the application anyway.

The critical-section one is tempting, but do you remember why our support for u64 on multicore riscv got kicked out from upstream STD? Because it essentially used disable-all-interrupts-critical-sections-plus-spinlocks and they said using spinloocks for atomic operations defeats their very purpose in the first place.

I wonder what they did for Cortex-M0/M0+ and such? See https://doc.rust-lang.org/std/sync/atomic/index.html#portability
thumbv6m does not support CAS at all.

@ivmarkov
Copy link

I wonder what they did for Cortex-M0/M0+ and such? See https://doc.rust-lang.org/std/sync/atomic/index.html#portability thumbv6m does not support CAS at all.

If Cortex-M0/M0+ is always single-core (is it?) then disable-all-interrupts-one-way-or-another as a strategy does work fine and since it is a single core, you don't have the "is a spinlock a lock or just waiting" issue.

By the way... by reading the opening paragraph in the "Portability section"... namely:
"
All atomic types in this module are guaranteed to be lock-free if they’re available. This means they don’t internally acquire a global mutex. Atomic types and operations are not guaranteed to be wait-free. This means that operations like fetch_or may be implemented with a compare-and-swap loop.
"

I'm now even more puzzled as to why they kicked out our u64 multicore support...
I mean, if we talk about "spinlocks" (that we used in that implementation...)

  • Are these "locks" in the terminology of that section? <-- I DOUBT this now
  • Or are these "waits" in the terminology of that section <-- I now believe this is the case, and if so, our spinlocking impl should've not been kicked out in the first place...

@bugadani
Copy link
Contributor

If Cortex-M0/M0+ is always single-core (is it?)

No, see RP2040

@ivmarkov
Copy link

ivmarkov commented Aug 30, 2024

If Cortex-M0/M0+ is always single-core (is it?)

No, see RP2040

Ah so you mean for rp2040 they use a spinlock+disable-all-interrupts software impl? If yes, then this only reinforces my suspicion, that our u64 atomic impl Shall Be Re-Instated...

I guess at the time (and now still) I was not understanding the subject matter well enough to object their decision on technical ground...

@ivmarkov
Copy link

...and if spinlocking+disable-all-interrupts is fine as a strategy, then the critical-section impl of portable-atomic is good enough of a workaround for multicore.

@ProfFan
Copy link
Contributor Author

ProfFan commented Aug 30, 2024

No, see RP2040

Beaten me in 5 secs lol

Ah so you mean for rp2040 they use a spinlock+disable-all-interrupts software impl? If yes, then this only reinforces my suspicion, that our u64 atomic impl Shall Be Re-Instated...

Probably no, see rust-lang/rust#99668

...and if spinlocking+disable-all-interrupts is fine as a strategy, then the critical-section impl of portable-atomic is good enough of a workaround for multicore.

For real-time applications like what we use ESPs for I think this is worse than the IDF way of "best effort" weak CAS, but can live without if it's too complicated to make something similar in Rust.

@ivmarkov
Copy link

No, see RP2040

Beaten me in 5 secs lol

Ah so you mean for rp2040 they use a spinlock+disable-all-interrupts software impl? If yes, then this only reinforces my suspicion, that our u64 atomic impl Shall Be Re-Instated...

Probably no, see rust-lang/rust#99668

TL;DR :(
If you can summarize for me, what does anybody use for CAS in multicore M0?
Because I don't think - whatever is used - can avoid using a loop and a loop is as close to a spinlock as you can get? As both are essentially just wait-and-check OR wait-and-retry which is essentially the same?

...and if spinlocking+disable-all-interrupts is fine as a strategy, then the critical-section impl of portable-atomic is good enough of a workaround for multicore.

For real-time applications like what we use ESPs for I think this is worse than the IDF way of "best effort" weak CAS, but can live without if it's too complicated to make something similar in Rust.

Because the weak cas esp idf impl does not disable interrupts (which is... slow?) or if not why then?

Because a strong cas impl on top of a weak cas would also require a loop anyway...

@MabezDev
Copy link
Member

MabezDev commented Sep 11, 2024

I would be happy to know though why hardware atomics cannot work with PSRAM

Whilst it's related to the cache/MMU setup, the reason it doesn't work isn't PSRAM is treated too much like flash. Unfortunately, the appropriate signals to provide the required locking (which aren't needed when caching flash) were also not implemented for caching PSRAM, meaning the CPU will happily do atomic modifications on PSRAM memory without actually ensuring it's an atomic operation.

As you mentioned, PSRAM is treated just like DRAM in the esp32 and esp32s3, and because of this, we can't use the ATOMCTL register to control what the CPU does when it executes s32c1l on external memories.

Would hardware atomics work correctly, if the programmer is ensuring that the relevant PSRAM memory region is accessed only by one of the two CPUs? As in, do we even have a workaround for the current situation

Hmm, I'm not sure if this would work, consider the case of running one core but sharing an atomic variable between an interrupt. I believe these would be two different accesses, but I suppose the CPU (being a single core in this case) can only do one thing at a time so access would be sequential?

So in summary, we have no way of detecting this, we need to be proactive when using PSRAM. Essentially for every atomic access, we need to

  • Check if the variable is located in PSRAM
  • If it is, also acquire a global spinlock to ensure atomicity

I guess the question is where we do this. I already suggested a different target as a nuclear option but we can probably do better. Maybe we can do better, perhaps with something like -Ctarget-feature=-feature to disable atomics and fall back to libcalls that have the work around in place when psram is being used in the program?

@MabezDev
Copy link
Member

From our discussion in the meeting, we need to determine whether enabling portable-atomic/critical-section even on targets with native atomics will use a critical section. If it does, we at least have a nice-ish work around for the issue as we can detect PSRAM use and enable this feature.

This is not ideal though as it incurs a cost to all atomics, even ones stored in internal memory. Longer term it would be better to have a solution like esp-idf's where it checks where the atomic is before deciding whether to guard access with a spinlock.

@bugadani
Copy link
Contributor

As we're using atomic compare_exchange in our multi-core critical section implementation, can we enable portable-atomic/critical-section without things blowing up?

@MabezDev
Copy link
Member

As we're using atomic compare_exchange in our multi-core critical section implementation, can we enable portable-atomic/critical-section without things blowing up?

Ah I forgot about that, good catch. Fortunately, I think we can control this by using core::sync::atomic::* in that impl, and ensuring that the global spinlock gets placed in internal memory via a linker attribute.

@taiki-e
Copy link

taiki-e commented Oct 6, 2024

we need to determine whether enabling portable-atomic/critical-section even on targets with native atomics will use a critical section.

portable-atomic/critical-section is ignored if the target has atomic (i.e., cfg!(target_has_atomic = "ptr") is true).

I'm open to adding a new cfg to adjust the behavior here, but do not intend to change the default or add a new Cargo feature because:

  • Due to bad behavior of the cargo v1 resolver, other platform-specific dependencies enable this feature, causing build failures or silent performance degradation.
  • Some people unconditionally enable the portable-atomic/critical-section feature without reading documentation, e.g., regression with portable-atomics in 1.20.0 matklad/once_cell#264.
  • Not officially documented, but in the current portable-atomic, if cfg!(target_has_atomic = "N") is true, portable_atomic::Atomic{I,U}N is lock-free1, but changing this behavior risks introducing issues into the code that assumes it.

Footnotes

  1. Except in the case of a compiler or target-spec bug, such as https://github.com/rust-lang/rust/issues/117305, https://github.com/rust-lang/rust/issues/117306

@tom-borcin tom-borcin added this to the 0.22.0 milestone Oct 7, 2024
@MabezDev MabezDev added the upstream The issue lies with a dependency label Oct 8, 2024
@MabezDev
Copy link
Member

MabezDev commented Oct 8, 2024

Thanks @taiki-e ❤️!

I'm open to adding a new cfg to adjust the behavior here, but do not intend to change the default

I think this entirely reasonable, I have a couple of questions:

  • Do you have something in mind for the impl, or would you like us to have a go and PR to portable-atomic?
  • What do you think the name of the feature should be?
  • Do you think there is any scope for a conditional check? For instance we don't need to use the critical section if the atomic is placed in internal memory, so it would be nice to run a small address range check prior before committing to acquiring the lock.

@taiki-e
Copy link

taiki-e commented Oct 8, 2024

  • Do you have something in mind for the impl, or would you like us to have a go and PR to portable-atomic?
  • What do you think the name of the feature should be?
  • Do you think there is any scope for a conditional check? For instance we don't need to use the critical section if the atomic is placed in internal memory, so it would be nice to run a small address range check prior before committing to acquiring the lock.

I think there are two options on what implementation to use:

  • Add an option to use the (sharded) global lock based fallback (currently used for wide atomics on platforms where CAS is available) even if the corresponding native atomic operations are available.
    • If this is done, we also need to add an option to disable interrupts in the lock based fallback.
  • Add an option to use critical-section impl not the lock based fallback.

(The former is more efficient but more complex to implement (although there is already code related to interrupt disabling).)

And even two options on when to use it:

  • Platform-independent Implementation that does it in all cases.
  • Platform-specific implementation that does it only for a specific address

(The latter is more efficient but more complex as well.)

Of course the cfg name will be different depending on which option is selected.

Could you tell me which of these options you would actually like?

@MabezDev
Copy link
Member

I think there are two options on what implementation to use:

It sounds like using the critical-section impl would be more straight forward I think, but if you have some of the scaffolding ready for the global sharded lock, then it might be worth trying that. Either of these approaches would be acceptable to us.

And even two options on when to use it:

For our specific use case, the latter is more attractive to me. Note that we only need to do this on four targets, xtensa-esp32-none-elf, xtensa-esp32-esp-idf, xtensa-esp32s3-none-elf, xtensa-esp32s3-esp-idf. Are you suggesting special casing some address ranges for these targets, or do you think a more general solution of somehow supplying an address range to portable-atomic?

@taiki-e
Copy link

taiki-e commented Oct 22, 2024

It sounds like using the critical-section impl would be more straight forward I think, but if you have some of the scaffolding ready for the global sharded lock, then it might be worth trying that. Either of these approaches would be acceptable to us.

Note that we only need to do this on four targets, xtensa-esp32-none-elf, xtensa-esp32-esp-idf, xtensa-esp32s3-none-elf, xtensa-esp32s3-esp-idf.

Is it safe to call interrupt disable instruction from user code in xtensa-*-esp-idf? For example, RISC-V and Arm usually restrict the use of such instructions in user mode. If the same is true for Xtensa esp-idf, there will be no option to use anything other than critical-section.

Are you suggesting special casing some address ranges for these targets, or do you think a more general solution of somehow supplying an address range to portable-atomic?

I think this depends on the actual spec and guarantees. If it is something that is clearly guaranteed in the spec, then I guess you can hard code it, but if it is something like “vendor may change it in the future” or “different for each chip or firmware configuration” then we can't hard code the value into the portable-atomic code.

@MabezDev
Copy link
Member

Is it safe to call interrupt disable instruction from user code in xtensa-*-esp-idf? For example, RISC-V and Arm usually restrict the use of such instructions in user mode. If the same is true for Xtensa esp-idf, there will be no option to use anything other than critical-section.

@ivmarkov Could you clarify this, the details are a bit fuzzy for me.

I think this depends on the actual spec and guarantees. If it is something that is clearly guaranteed in the spec, then I guess you can hard code it, but if it is something like “vendor may change it in the future” or “different for each chip or firmware configuration” then we can't hard code the value into the portable-atomic code.

The silicon has a hardcoded address space for PSRAM, it will never change. It is dependant on the chip configuration (at time of production) this means, the esp32 and esp32-s3 have different address spaces for PSRAM so we'll need a solution that accounts for that (I believe we can achieve this with cfg's). Firmware configuration has no impact.

@MabezDev MabezDev added the status:blocked Unable to progress - dependent on another task label Nov 8, 2024
@MabezDev MabezDev removed this from the 0.22.0 milestone Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status:blocked Unable to progress - dependent on another task upstream The issue lies with a dependency
Projects
Status: Todo
Development

No branches or pull requests

7 participants