-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm64: test92 infloops at least with M1 (apple's CPU) #160
Comments
After the reading, the data is written back for checking. The address is different from the region locked for atomic update, so it should not be a problem. In the arm64 doc there is a section for these instructions: B2.7.3 Load-Acquire, Store-Release It is hard to understand, but I don't see any restrictions for memory writes. Am I missing something? At least they should warn about this side effect, or is it a cpu bug? |
While a CPU bug is a possibility, I assume that it would be unlikely considering how common those are. The most likely reason is AFAIK that the behaviour is permitted by the specification (in B2.9.3) as spelled with:
While I assume it would be possible that someone would code a critical section that updates a close enough memory address that it will fail as in the test, I consider that sort of situation unlikely, which is why I was wondering what sort of atomic access was being modeled by the test. It is worth also mentioning that the recommendation says:
|
What does "exclusive access" exactly means? It is accessed only by the current thread, looks exclusive enough. The test does the basic atomic access, reads the memory, and writes another value if nobody updated the target value. |
Perhaps this is the cause: the "exclusive access" on M1 breaks even if the cpu - who has the exclusive access - writes something to the implementation defined "exclusive range". From API definition, we could say that writing data during the atomic operation has undefined behavior. I hope at least reading memory is not. |
The specification allows up to a maximum exclusive granule of 512 words to be reserved after `ldxr`, so writes close enough to the address that was loaded could result on the reservation being invalidated and `stxr` failing. Instead of writing back the just read data, do a second read to validate the first one and compare them internally. Fixes: zherczeg#160
I see where this can be confusing; another ARM cpu (Cortex A53) I got a hold of, doesn't invalidate the reservation if doing a regular store (without using Reads seem to be ok (although we might be missing memory barriers for some use cases and also to deal correctly with misalignment), so decided to rewrite the test logic doing reads instead but in the process got myself in an infloop with x86_64 because the "temp" register wasn't set to the "expected" value. Is there a way to implement this interface in a way that would be less brittle? |
The specification allow up to a maximum exclusive granule of 512 words to be reserved by `ldxr`, so writes close enough to the address that was loaded could result on the reservation being invalidated and `stxr` failing. Instead of writing back the values, do a second normal read to validate the first one and compare them internally. While at it make the test logic slightly easier to read. Fixes: zherczeg#160
The specification allow up to a maximum exclusive granule of 512 words to be reserved by `ldxr`, so writes close enough to the address that was loaded could result on the reservation being invalidated and `stxr` failing. Instead of writing back the values, do a second normal read to validate the first one and compare them internally. While at it make the test logic slightly easier to read. Fixes: zherczeg#160
I think the best solution for this problem is adding a limitation between the load/store pair: memory should not be written. There are already several rules for this instruction pair, this should not make much difference. The test system should be changed to store the loaded data after the store is completed. The loads should be checked by the test system. The loaded values can be stored temporarily in registers r0-r2 before they are written if needed. Moving data between registers is allowed. I wanted to work on this soon, but if you want it, you can do it. |
We don't have to deal with misalignment, the instructions are expected to use with the correct alignment. There are several rules for using the load/store pair, if any of them is not satisfied, the operation is undefined. |
It looks like I forgot to add the misalignment rule, it can be done in the test update patch. |
There was already a limitation that said:
But still the test did, and we were lucky it broke in an obvious way with the M1. It seems there might be other restrictions that could be imposed by the targeted CPUs, for example RISCV seems to restrict how many instructions or even which ones can be used in the critical section and POWER might clear the reservation even with the reads if it hits the same cache line. I am also unclear on how this implementation should be used, currently it can be fine for implementing an atomic counter but not a mutex which is why I am surprised it doesn't have a "memory order" constrain. if the CPU implementation is done with LL/SC then the "temp_reg" is irrelevant, and if with CAS then the load/store separation seems artificial and forces passing state through the "temp_reg". |
No longer an issue with HEAD, API concerns remain but better to be handled independently. Some documentation update or sample code (better if not a toy one) might be worth adding to the "tutorial" though |
The issue is caused on the first test, but all others are likely affected as well, as the logic is similar. What kind of atomic operation was being modeled in them?
From what I see the problem might be with the test itself, which invalidates the atomic read for SLJIT_R1 (AKA x1) by writing it back to an address just 8 bytes away (hence likely hitting the same cache line) as shown by the following disassembly of the generated code:
If the store is done outside of the ldxr/stxr segment or at least 24 bytes away then it succeeds.
The text was updated successfully, but these errors were encountered: