arm64: test92 infloops at least with M1 (apple's CPU) #160

carenas · 2023-04-17T22:24:45Z

The issue is caused on the first test, but all others are likely affected as well, as the logic is similar. What kind of atomic operation was being modeled in them?

From what I see the problem might be with the test itself, which invalidates the atomic read for SLJIT_R1 (AKA x1) by writing it back to an address just 8 bytes away (hence likely hitting the same cache line) as shown by the following disassembly of the generated code:

   0xffffb7fd0da0:	ldxr	x1, [x19]
   0xffffb7fd0da4:	str	x1, [x19,#8]
   0xffffb7fd0da8:	mov	x0, x1
   0xffffb7fd0dac:	mov	x1, #0xffffffffffffd980    	// #-9856
   0xffffb7fd0db4:	stxr	w9, x1, [x19]
   0xffffb7fd0db8:	cmp	w9, #0x0
   0xffffb7fd0dbc:	b.ne	0xffffb7fd0da0

If the store is done outside of the ldxr/stxr segment or at least 24 bytes away then it succeeds.

The text was updated successfully, but these errors were encountered:

zherczeg · 2023-04-18T05:20:23Z

After the reading, the data is written back for checking. The address is different from the region locked for atomic update, so it should not be a problem. In the arm64 doc there is a section for these instructions:

B2.7.3 Load-Acquire, Store-Release

It is hard to understand, but I don't see any restrictions for memory writes. Am I missing something? At least they should warn about this side effect, or is it a cpu bug?

carenas · 2023-04-18T05:56:11Z

While a CPU bug is a possibility, I assume that it would be unlikely considering how common those are.

The most likely reason is AFAIK that the behaviour is permitted by the specification (in B2.9.3) as spelled with:

A Load-Exclusive instruction marks a small block of memory for exclusive access. The size of the marked block is IMPLEMENTATION DEFINED

While I assume it would be possible that someone would code a critical section that updates a close enough memory address that it will fail as in the test, I consider that sort of situation unlikely, which is why I was wondering what sort of atomic access was being modeled by the test.

It is worth also mentioning that the recommendation says:

In some implementations the CTR identifies the Exclusives reservation granule, see CTR_EL0. Otherwise, software must assume that the maximum Exclusives reservation granule, 512 words, is implemented.

zherczeg · 2023-04-18T09:17:00Z

What does "exclusive access" exactly means? It is accessed only by the current thread, looks exclusive enough. The test does the basic atomic access, reads the memory, and writes another value if nobody updated the target value.

zherczeg · 2023-04-18T11:10:02Z

Perhaps this is the cause: the "exclusive access" on M1 breaks even if the cpu - who has the exclusive access - writes something to the implementation defined "exclusive range". From API definition, we could say that writing data during the atomic operation has undefined behavior. I hope at least reading memory is not.

The specification allows up to a maximum exclusive granule of 512 words to be reserved after `ldxr`, so writes close enough to the address that was loaded could result on the reservation being invalidated and `stxr` failing. Instead of writing back the just read data, do a second read to validate the first one and compare them internally. Fixes: zherczeg#160

carenas · 2023-04-23T14:59:06Z

I see where this can be confusing; another ARM cpu (Cortex A53) I got a hold of, doesn't invalidate the reservation if doing a regular store (without using stxr) to the same address that was loaded from inside the critical section if coming from the same CPU, but the apple one does.

Reads seem to be ok (although we might be missing memory barriers for some use cases and also to deal correctly with misalignment), so decided to rewrite the test logic doing reads instead but in the process got myself in an infloop with x86_64 because the "temp" register wasn't set to the "expected" value.

Is there a way to implement this interface in a way that would be less brittle?

The specification allow up to a maximum exclusive granule of 512 words to be reserved by `ldxr`, so writes close enough to the address that was loaded could result on the reservation being invalidated and `stxr` failing. Instead of writing back the values, do a second normal read to validate the first one and compare them internally. While at it make the test logic slightly easier to read. Fixes: zherczeg#160

zherczeg · 2023-04-24T04:51:48Z

I think the best solution for this problem is adding a limitation between the load/store pair: memory should not be written. There are already several rules for this instruction pair, this should not make much difference.

The test system should be changed to store the loaded data after the store is completed. The loads should be checked by the test system. The loaded values can be stored temporarily in registers r0-r2 before they are written if needed. Moving data between registers is allowed.

I wanted to work on this soon, but if you want it, you can do it.

zherczeg · 2023-04-24T04:53:55Z

We don't have to deal with misalignment, the instructions are expected to use with the correct alignment. There are several rules for using the load/store pair, if any of them is not satisfied, the operation is undefined.

zherczeg · 2023-04-24T04:55:21Z

https://github.com/zherczeg/sljit/blob/master/sljit_src/sljitLir.h#L1757

zherczeg · 2023-04-24T04:57:24Z

It looks like I forgot to add the misalignment rule, it can be done in the test update patch.

carenas · 2023-04-24T14:39:01Z

I think the best solution for this problem is adding a limitation between the load/store pair: memory should not be written. There are already several rules for this instruction pair, this should not make much difference.

There was already a limitation that said:

it should not modify the memory area targeted by the atomic operation between the load/store operations

But still the test did, and we were lucky it broke in an obvious way with the M1.

It seems there might be other restrictions that could be imposed by the targeted CPUs, for example RISCV seems to restrict how many instructions or even which ones can be used in the critical section and POWER might clear the reservation even with the reads if it hits the same cache line.

I am also unclear on how this implementation should be used, currently it can be fine for implementing an atomic counter but not a mutex which is why I am surprised it doesn't have a "memory order" constrain.

if the CPU implementation is done with LL/SC then the "temp_reg" is irrelevant, and if with CAS then the load/store separation seems artificial and forces passing state through the "temp_reg".

carenas · 2023-05-02T20:10:33Z

No longer an issue with HEAD, API concerns remain but better to be handled independently.

Some documentation update or sample code (better if not a toy one) might be worth adding to the "tutorial" though

carenas mentioned this issue Apr 23, 2023

arm: refactor test92 to avoid conflicting writes and add possibly conflicting reads #162

Open

zherczeg mentioned this issue Apr 26, 2023

arm64: alternative implementation of atomics using CAS #164

Merged

carenas closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm64: test92 infloops at least with M1 (apple's CPU) #160

arm64: test92 infloops at least with M1 (apple's CPU) #160

carenas commented Apr 17, 2023 •

edited

Loading

zherczeg commented Apr 18, 2023

carenas commented Apr 18, 2023

zherczeg commented Apr 18, 2023

zherczeg commented Apr 18, 2023

carenas commented Apr 23, 2023 •

edited

Loading

zherczeg commented Apr 24, 2023

zherczeg commented Apr 24, 2023

zherczeg commented Apr 24, 2023

zherczeg commented Apr 24, 2023

carenas commented Apr 24, 2023 •

edited

Loading

carenas commented May 2, 2023

arm64: test92 infloops at least with M1 (apple's CPU) #160

arm64: test92 infloops at least with M1 (apple's CPU) #160

Comments

carenas commented Apr 17, 2023 • edited Loading

zherczeg commented Apr 18, 2023

carenas commented Apr 18, 2023

zherczeg commented Apr 18, 2023

zherczeg commented Apr 18, 2023

carenas commented Apr 23, 2023 • edited Loading

zherczeg commented Apr 24, 2023

zherczeg commented Apr 24, 2023

zherczeg commented Apr 24, 2023

zherczeg commented Apr 24, 2023

carenas commented Apr 24, 2023 • edited Loading

carenas commented May 2, 2023

carenas commented Apr 17, 2023 •

edited

Loading

carenas commented Apr 23, 2023 •

edited

Loading

carenas commented Apr 24, 2023 •

edited

Loading