Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use atomics where available #8

Merged
merged 12 commits into from
Nov 9, 2017
Merged

use atomics where available #8

merged 12 commits into from
Nov 9, 2017

Conversation

japaric
Copy link
Member

@japaric japaric commented Oct 31, 2017

cc #5
cc @pftbest

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

@homunkulus try

@homunkulus
Copy link
Contributor

⌛ Trying commit f9a3dfc with merge 4a527af...

japaric pushed a commit that referenced this pull request Oct 31, 2017
use atomics where available

cc #5
cc @pftbest
@homunkulus
Copy link
Contributor

☀️ Test successful - status-travis
State: approved= try=True

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

@homunkulus try

@homunkulus
Copy link
Contributor

⌛ Trying commit 158d19b with merge b519403...

japaric pushed a commit that referenced this pull request Oct 31, 2017
use atomics where available

cc #5
cc @pftbest
@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

SUMMARY: ThreadSanitizer: data race /checkout/src/liballoc/heap.rs:104 in _$LT$alloc..heap..Heap$u20$as$u20$alloc..allocator..Alloc$GT$::dealloc::h42931c7bdbec6994

Hmm, looks like the Travis environment doesn't do demangling for some reason.

@homunkulus
Copy link
Contributor

💔 Test failed - status-travis

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

@homunkulus try

@homunkulus
Copy link
Contributor

⌛ Trying commit a93f857 with merge 5fcb8c4...

japaric pushed a commit that referenced this pull request Oct 31, 2017
use atomics where available

cc #5
cc @pftbest
@homunkulus
Copy link
Contributor

💔 Test failed - status-travis

@japaric
Copy link
Member Author

japaric commented Oct 31, 2017

I tested running the sanitizer on a minimal Ubuntu 16.04 install and demangling works there. Then I remembered that Travis is using 14.04, which I haven't tested.

@pftbest
Copy link

pftbest commented Nov 2, 2017

All single core systems are supported.

I think this statement might be not exactly true. There are some single core systems that can execute instructions out-of-order, and can break this code.

Good news is that Cortex-M is not one of them yet:

The Cortex-M processors never perform memory accesses out of order compared to instruction flow, however, the architecture does not prohibit this in future implementations. ARMv7-M code written to be portable to ARMv7-AR processors, like Cortex-A9, must already take account of this ordering model.

I believe that such platforms like Cortex-A9 all have atomic support, so this wouldn't matter in practice, but running non-atomic code on them would probably be a mistake, even if they have only one core.

@japaric
Copy link
Member Author

japaric commented Nov 7, 2017

Good point, @pftbest. I think we should note that (out-of-order execution) in the documentation. I think it's best to ask the user to check the disassembly if they are using a single core system with "no atomics"; if the code is wrong the fix will involve patching the implementation for that specific architecture / target (i.e. #[cfg]) as I don't think there's a general way to insert instruction / memory barriers via LLVM intrinsics.

which works on thumbv6m-none-eabi and probably other targets with max-atomic-width = 0
@japaric
Copy link
Member Author

japaric commented Nov 8, 2017

I ended up re-implementing AtomicUsize using atomic load/store (LLVM) intrinsics. This way I was able to drop the conditional code while retaining support for thumbv6m-none-eabi.

@pftbest
Copy link

pftbest commented Nov 8, 2017

That's a great idea.

@japaric
Copy link
Member Author

japaric commented Nov 8, 2017

I changed the load_acquire for load_relaxed. The previous version produced two memory barriers; I think we only need one: the one between the data read / write and the the head / tail pointer update.

So, this code:

fn exti0() {
    unsafe {
        RB.split().0.enqueue(0).unwrap();
    }
}

Produced:

080001c0 <EXTI0>:
 80001c0:       f240 0000       movw    r0, #0
 80001c4:       f2c2 0000       movt    r0, #8192       ; 0x2000
 80001c8:       6803            ldr     r3, [r0, #0]
 80001ca:       6842            ldr     r2, [r0, #4]
 80001cc:       f3bf 8f5f       dmb     sy
 80001d0:       1c51            adds    r1, r2, #1
 80001d2:       f001 010f       and.w   r1, r1, #15
 80001d6:       4299            cmp     r1, r3
 80001d8:       bf1f            itttt   ne
 80001da:       eb00 0282       addne.w r2, r0, r2, lsl #2
 80001de:       2300            movne   r3, #0
 80001e0:       6093            strne   r3, [r2, #8]
 80001e2:       f3bf 8f5f       dmbne   sy
 80001e6:       bf1c            itt     ne
 80001e8:       6041            strne   r1, [r0, #4]
 80001ea:       4770            bxne    lr
 80001ec:       b580            push    {r7, lr}
 80001ee:       466f            mov     r7, sp
 80001f0:       f7ff ffe2       bl      80001b8 <core::result::unwrap_failed>

With the change, it now produces:

080001f0 <EXTI0>:
 80001f0:       f240 0000       movw    r0, #0
 80001f4:       f2c2 0000       movt    r0, #8192       ; 0x2000
 80001f8:       6803            ldr     r3, [r0, #0]
 80001fa:       6842            ldr     r2, [r0, #4]
 80001fc:       1c51            adds    r1, r2, #1
 80001fe:       f001 010f       and.w   r1, r1, #15
 8000202:       4299            cmp     r1, r3
 8000204:       bf1f            itttt   ne
 8000206:       eb00 0282       addne.w r2, r0, r2, lsl #2
 800020a:       2300            movne   r3, #0
 800020c:       6093            strne   r3, [r2, #8]
 800020e:       f3bf 8f5f       dmbne   sy
 8000212:       bf1c            itt     ne
 8000214:       6041            strne   r1, [r0, #4]
 8000216:       4770            bxne    lr
 8000218:       b580            push    {r7, lr}
 800021a:       466f            mov     r7, sp
 800021c:       f7ff ffcc       bl      80001b8 <core::result::unwrap_failed>

TSan seems to be OK with the change.

@japaric
Copy link
Member Author

japaric commented Nov 8, 2017

@homunkulus try

japaric pushed a commit that referenced this pull request Nov 8, 2017
use atomics where available

cc #5
cc @pftbest
@homunkulus
Copy link
Contributor

⌛ Trying commit af5fdf3 with merge b31015c...

@homunkulus
Copy link
Contributor

💔 Test failed - status-travis

@japaric
Copy link
Member Author

japaric commented Nov 8, 2017

@homunkulus try

@homunkulus
Copy link
Contributor

⌛ Trying commit 7806bc9 with merge 14a5689...

japaric pushed a commit that referenced this pull request Nov 8, 2017
use atomics where available

cc #5
cc @pftbest
@homunkulus
Copy link
Contributor

💔 Test failed - status-travis

@japaric
Copy link
Member Author

japaric commented Nov 9, 2017

@homunkulus try

@homunkulus
Copy link
Contributor

⌛ Trying commit 5ff961c with merge 89e5733...

japaric pushed a commit that referenced this pull request Nov 9, 2017
use atomics where available

cc #5
cc @pftbest
@homunkulus
Copy link
Contributor

☀️ Test successful - status-travis
State: approved= try=True

also

- add a "`split` freezes the ring buffer" compile fail test
- hide compile-fail doc tests
- add scoped threads tests
@japaric
Copy link
Member Author

japaric commented Nov 9, 2017

@homunkulus r+

@homunkulus
Copy link
Contributor

📌 Commit 30ea33c has been approved by japaric

@homunkulus
Copy link
Contributor

⌛ Testing commit 30ea33c with merge 612bf44...

japaric pushed a commit that referenced this pull request Nov 9, 2017
use atomics where available

cc #5
cc @pftbest
@homunkulus
Copy link
Contributor

☀️ Test successful - status-travis
Approved by: japaric
Pushing 612bf44 to master...

@homunkulus homunkulus merged commit 30ea33c into master Nov 9, 2017
@japaric japaric deleted the atomic branch November 9, 2017 01:42
japaric added a commit that referenced this pull request May 10, 2022
these changes optimize `Vec<u8, 1024>::clone` down to these operations

1. reserve the stack space (1028 bytes on 32-bit ARM) and leave it uninitialized
2. zero the `len` field
3. memcpy `len` bytes of data from the parent

analyzed source code

``` rust
use heapless::Vec;

fn clone(vec: &Vec<u8, 1024>) {
    let mut vec = vec.clone();
    black_box(&mut vec);
}

fn black_box<T>(val: &mut T) {
    unsafe { asm!("// {0}", in(reg) val) }
}
```

machine code with `lto = fat`, `codegen-units = 1` and `opt-level = 'z'` ('z' instead of 3 to avoid loop unrolling and keep the machine code readable)

``` armasm
00020100 <clone>:
   20100:              b5d0             push    {r4, r6, r7, lr}
   20102:              af02             add     r7, sp, #8
   20104:              f5ad 6d81        sub.w   sp, sp, #1032   ; 0x408
   20108:              2300             movs    r3, #0
   2010a:              c802             ldmia   r0!, {r1}
   2010c:              9301             str     r3, [sp, #4]
   2010e:              aa01             add     r2, sp, #4
   20110:       /--/-X b141             cbz     r1, 20124 <clone+0x24>
   20112:       |  |   4413             add     r3, r2
   20114:       |  |   f810 4b01        ldrb.w  r4, [r0], #1
   20118:       |  |   3901             subs    r1, #1
   2011a:       |  |   711c             strb    r4, [r3, #4]
   2011c:       |  |   9b01             ldr     r3, [sp, #4]
   2011e:       |  |   3301             adds    r3, #1
   20120:       |  |   9301             str     r3, [sp, #4]
   20122:       |  \-- e7f5             b.n     20110 <clone+0x10>
   20124:       \----> a801             add     r0, sp, #4
   20126:              f50d 6d81        add.w   sp, sp, #1032   ; 0x408
   2012a:              bdd0             pop     {r4, r6, r7, pc}
```

note that it's not optimizing step (3) to an actual `memcpy` because we lack the 'trait specialization' code that libstd uses

---

before `clone` was optimized to

1. reserve and zero (`memclr`) 1028 (!?) bytes of stack space
2. (unnecessarily) runtime check if `len` is equal or less than 1024 (capacity) -- this included a panicking branch
3. memcpy `len` bytes of data from the parent
japaric added a commit that referenced this pull request May 10, 2022
these changes optimize `Vec<u8, 1024>::clone` down to these operations

1. reserve the stack space (1028 bytes on 32-bit ARM) and leave it uninitialized
2. zero the `len` field
3. memcpy `len` bytes of data from the parent

analyzed source code

``` rust
use heapless::Vec;

fn clone(vec: &Vec<u8, 1024>) {
    let mut vec = vec.clone();
    black_box(&mut vec);
}

fn black_box<T>(val: &mut T) {
    unsafe { asm!("// {0}", in(reg) val) }
}
```

machine code with `lto = fat`, `codegen-units = 1` and `opt-level = 'z'` ('z' instead of 3 to avoid loop unrolling and keep the machine code readable)

``` armasm
00020100 <clone>:
   20100:              b5d0             push    {r4, r6, r7, lr}
   20102:              af02             add     r7, sp, #8
   20104:              f5ad 6d81        sub.w   sp, sp, #1032   ; 0x408
   20108:              2300             movs    r3, #0
   2010a:              c802             ldmia   r0!, {r1}
   2010c:              9301             str     r3, [sp, #4]
   2010e:              aa01             add     r2, sp, #4
   20110:       /--/-X b141             cbz     r1, 20124 <clone+0x24>
   20112:       |  |   4413             add     r3, r2
   20114:       |  |   f810 4b01        ldrb.w  r4, [r0], #1
   20118:       |  |   3901             subs    r1, #1
   2011a:       |  |   711c             strb    r4, [r3, #4]
   2011c:       |  |   9b01             ldr     r3, [sp, #4]
   2011e:       |  |   3301             adds    r3, #1
   20120:       |  |   9301             str     r3, [sp, #4]
   20122:       |  \-- e7f5             b.n     20110 <clone+0x10>
   20124:       \----> a801             add     r0, sp, #4
   20126:              f50d 6d81        add.w   sp, sp, #1032   ; 0x408
   2012a:              bdd0             pop     {r4, r6, r7, pc}
```

note that it's not optimizing step (3) to an actual `memcpy` because we lack the 'trait specialization' code that libstd uses

---

before `clone` was optimized to

1. reserve and zero (`memclr`) 1028 (!?) bytes of stack space
2. (unnecessarily) runtime check if `len` is equal or less than 1024 (capacity) -- this included a panicking branch
3. memcpy `len` bytes of data from the parent
bors bot added a commit that referenced this pull request May 10, 2022
290: optimize the codegen of Vec::clone r=japaric a=japaric

these changes optimize `Vec<u8, 1024>::clone` down to these operations

1. reserve the stack space (1028 bytes on 32-bit ARM) and leave it uninitialized
2. zero the `len` field
3. memcpy `len` bytes of data from the parent

analyzed source code

``` rust
use heapless::Vec;

fn clone(vec: &Vec<u8, 1024>) {
    let mut vec = vec.clone();
    black_box(&mut vec);
}

fn black_box<T>(val: &mut T) {
    unsafe { asm!("// {0}", in(reg) val) }
}
```

machine code with `lto = fat`, `codegen-units = 1` and `opt-level = 'z'` ('z' instead of 3 to avoid loop unrolling and keep the machine code readable)

``` armasm
00020100 <clone>:
   20100:              b5d0             push    {r4, r6, r7, lr}
   20102:              af02             add     r7, sp, #8
   20104:              f5ad 6d81        sub.w   sp, sp, #1032   ; 0x408
   20108:              2300             movs    r3, #0
   2010a:              c802             ldmia   r0!, {r1}
   2010c:              9301             str     r3, [sp, #4]
   2010e:              aa01             add     r2, sp, #4
   20110:       /--/-X b141             cbz     r1, 20124 <clone+0x24>
   20112:       |  |   4413             add     r3, r2
   20114:       |  |   f810 4b01        ldrb.w  r4, [r0], #1
   20118:       |  |   3901             subs    r1, #1
   2011a:       |  |   711c             strb    r4, [r3, #4]
   2011c:       |  |   9b01             ldr     r3, [sp, #4]
   2011e:       |  |   3301             adds    r3, #1
   20120:       |  |   9301             str     r3, [sp, #4]
   20122:       |  \-- e7f5             b.n     20110 <clone+0x10>
   20124:       \----> a801             add     r0, sp, #4
   20126:              f50d 6d81        add.w   sp, sp, #1032   ; 0x408
   2012a:              bdd0             pop     {r4, r6, r7, pc}
```

note that it's not optimizing step (3) to an actual `memcpy` because we lack the 'trait specialization' code that libstd uses

---

before `clone` was optimized to

1. reserve and zero (`memclr`) 1028 (!?) bytes of stack space
2. (unnecessarily) runtime check if `len` is equal or less than 1024 (capacity) -- this included a panicking branch
3. memcpy `len` bytes of data from the parent

Co-authored-by: Jorge Aparicio <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants