Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64: test92 infloops at least with M1 (apple's CPU) #160

Closed
carenas opened this issue Apr 17, 2023 · 11 comments · May be fixed by #162
Closed

arm64: test92 infloops at least with M1 (apple's CPU) #160

carenas opened this issue Apr 17, 2023 · 11 comments · May be fixed by #162

Comments

@carenas
Copy link
Contributor

carenas commented Apr 17, 2023

The issue is caused on the first test, but all others are likely affected as well, as the logic is similar. What kind of atomic operation was being modeled in them?

From what I see the problem might be with the test itself, which invalidates the atomic read for SLJIT_R1 (AKA x1) by writing it back to an address just 8 bytes away (hence likely hitting the same cache line) as shown by the following disassembly of the generated code:

   0xffffb7fd0da0:	ldxr	x1, [x19]
   0xffffb7fd0da4:	str	x1, [x19,#8]
   0xffffb7fd0da8:	mov	x0, x1
   0xffffb7fd0dac:	mov	x1, #0xffffffffffffd980    	// #-9856
   0xffffb7fd0db4:	stxr	w9, x1, [x19]
   0xffffb7fd0db8:	cmp	w9, #0x0
   0xffffb7fd0dbc:	b.ne	0xffffb7fd0da0

If the store is done outside of the ldxr/stxr segment or at least 24 bytes away then it succeeds.

@zherczeg
Copy link
Owner

After the reading, the data is written back for checking. The address is different from the region locked for atomic update, so it should not be a problem. In the arm64 doc there is a section for these instructions:

B2.7.3 Load-Acquire, Store-Release

It is hard to understand, but I don't see any restrictions for memory writes. Am I missing something? At least they should warn about this side effect, or is it a cpu bug?

@carenas
Copy link
Contributor Author

carenas commented Apr 18, 2023

While a CPU bug is a possibility, I assume that it would be unlikely considering how common those are.

The most likely reason is AFAIK that the behaviour is permitted by the specification (in B2.9.3) as spelled with:

A Load-Exclusive instruction marks a small block of memory for exclusive access. The size of the marked block is IMPLEMENTATION DEFINED

While I assume it would be possible that someone would code a critical section that updates a close enough memory address that it will fail as in the test, I consider that sort of situation unlikely, which is why I was wondering what sort of atomic access was being modeled by the test.

It is worth also mentioning that the recommendation says:

In some implementations the CTR identifies the Exclusives reservation granule, see CTR_EL0. Otherwise, software must assume that the maximum Exclusives reservation granule, 512 words, is implemented.

@zherczeg
Copy link
Owner

What does "exclusive access" exactly means? It is accessed only by the current thread, looks exclusive enough. The test does the basic atomic access, reads the memory, and writes another value if nobody updated the target value.

@zherczeg
Copy link
Owner

Perhaps this is the cause: the "exclusive access" on M1 breaks even if the cpu - who has the exclusive access - writes something to the implementation defined "exclusive range". From API definition, we could say that writing data during the atomic operation has undefined behavior. I hope at least reading memory is not.

carenas added a commit to carenas/sljit that referenced this issue Apr 23, 2023
The specification allows up to a maximum exclusive granule of 512 words
to be reserved after `ldxr`, so writes close enough to the address that
was loaded could result on the reservation being invalidated and `stxr`
failing.

Instead of writing back the just read data, do a second read to validate
the first one and compare them internally.

Fixes: zherczeg#160
@carenas
Copy link
Contributor Author

carenas commented Apr 23, 2023

I see where this can be confusing; another ARM cpu (Cortex A53) I got a hold of, doesn't invalidate the reservation if doing a regular store (without using stxr) to the same address that was loaded from inside the critical section if coming from the same CPU, but the apple one does.

Reads seem to be ok (although we might be missing memory barriers for some use cases and also to deal correctly with misalignment), so decided to rewrite the test logic doing reads instead but in the process got myself in an infloop with x86_64 because the "temp" register wasn't set to the "expected" value.

Is there a way to implement this interface in a way that would be less brittle?

carenas added a commit to carenas/sljit that referenced this issue Apr 23, 2023
The specification allow up to a maximum exclusive granule of 512 words
to be reserved by `ldxr`, so writes close enough to the address that
was loaded could result on the reservation being invalidated and `stxr`
failing.

Instead of writing back the values, do a second normal read to validate
the first one and compare them internally.  While at it make the test
logic slightly easier to read.

Fixes: zherczeg#160
carenas added a commit to carenas/sljit that referenced this issue Apr 23, 2023
The specification allow up to a maximum exclusive granule of 512 words
to be reserved by `ldxr`, so writes close enough to the address that
was loaded could result on the reservation being invalidated and `stxr`
failing.

Instead of writing back the values, do a second normal read to validate
the first one and compare them internally.  While at it make the test
logic slightly easier to read.

Fixes: zherczeg#160
@zherczeg
Copy link
Owner

I think the best solution for this problem is adding a limitation between the load/store pair: memory should not be written. There are already several rules for this instruction pair, this should not make much difference.

The test system should be changed to store the loaded data after the store is completed. The loads should be checked by the test system. The loaded values can be stored temporarily in registers r0-r2 before they are written if needed. Moving data between registers is allowed.

I wanted to work on this soon, but if you want it, you can do it.

@zherczeg
Copy link
Owner

We don't have to deal with misalignment, the instructions are expected to use with the correct alignment. There are several rules for using the load/store pair, if any of them is not satisfied, the operation is undefined.

@zherczeg
Copy link
Owner

@zherczeg
Copy link
Owner

It looks like I forgot to add the misalignment rule, it can be done in the test update patch.

@carenas
Copy link
Contributor Author

carenas commented Apr 24, 2023

I think the best solution for this problem is adding a limitation between the load/store pair: memory should not be written. There are already several rules for this instruction pair, this should not make much difference.

There was already a limitation that said:

it should not modify the memory area targeted by the atomic operation between the load/store operations

But still the test did, and we were lucky it broke in an obvious way with the M1.

It seems there might be other restrictions that could be imposed by the targeted CPUs, for example RISCV seems to restrict how many instructions or even which ones can be used in the critical section and POWER might clear the reservation even with the reads if it hits the same cache line.

I am also unclear on how this implementation should be used, currently it can be fine for implementing an atomic counter but not a mutex which is why I am surprised it doesn't have a "memory order" constrain.

if the CPU implementation is done with LL/SC then the "temp_reg" is irrelevant, and if with CAS then the load/store separation seems artificial and forces passing state through the "temp_reg".

@carenas
Copy link
Contributor Author

carenas commented May 2, 2023

No longer an issue with HEAD, API concerns remain but better to be handled independently.

Some documentation update or sample code (better if not a toy one) might be worth adding to the "tutorial" though

@carenas carenas closed this as completed May 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants