`memcpy` implementation is too large on embedded #93265

piegamesde · 2022-01-24T13:53:46Z

I used the following code to copy a memory region around on an embedded riscv32im-unknown-none-elf target:

core::ptr::copy_nonoverlapping(0xe0000000 as *const u8, 0x80000000 as *mut u8, 256 * 1024);

Modulo some checks, this resolves to a jump with a memcpy implementation that looks like this in a disassembler:

700002e8 <memcpy>:
700002e8:       00f00693                li      a3,15
700002ec:       08c6f863                bgeu    a3,a2,7000037c <memcpy+0x94>
700002f0:       40a006b3                neg     a3,a0
700002f4:       0036f813                andi    a6,a3,3
700002f8:       010503b3                add     t2,a0,a6
700002fc:       02080063                beqz    a6,7000031c <memcpy+0x34>
70000300:       00050793                mv      a5,a0
70000304:       00058693                mv      a3,a1
70000308:       00068703                lb      a4,0(a3)
7000030c:       00e78023                sb      a4,0(a5)
70000310:       00178793                addi    a5,a5,1
70000314:       00168693                addi    a3,a3,1
70000318:       fe77e8e3                bltu    a5,t2,70000308 <memcpy+0x20>
7000031c:       010582b3                add     t0,a1,a6
70000320:       41060833                sub     a6,a2,a6
70000324:       ffc87893                andi    a7,a6,-4
70000328:       0032f593                andi    a1,t0,3
7000032c:       011386b3                add     a3,t2,a7
70000330:       04058c63                beqz    a1,70000388 <memcpy+0xa0>
70000334:       ffc2f613                andi    a2,t0,-4
70000338:       00062583                lw      a1,0(a2) # 40000 <_hart_stack_size+0x3f800>
7000033c:       07105463                blez    a7,700003a4 <memcpy+0xbc>
70000340:       00329713                slli    a4,t0,0x3
70000344:       01877313                andi    t1,a4,24
70000348:       40e00733                neg     a4,a4
7000034c:       01877e13                andi    t3,a4,24
70000350:       00460613                addi    a2,a2,4
70000354:       00062703                lw      a4,0(a2)
70000358:       0065d5b3                srl     a1,a1,t1
7000035c:       01c717b3                sll     a5,a4,t3
70000360:       00b7e5b3                or      a1,a5,a1
70000364:       00b3a023                sw      a1,0(t2)
70000368:       00438393                addi    t2,t2,4
7000036c:       00460613                addi    a2,a2,4
70000370:       00070593                mv      a1,a4
70000374:       fed3e0e3                bltu    t2,a3,70000354 <memcpy+0x6c>
70000378:       02c0006f                j       700003a4 <memcpy+0xbc>
7000037c:       00050693                mv      a3,a0
70000380:       02c04863                bgtz    a2,700003b0 <memcpy+0xc8>
70000384:       0440006f                j       700003c8 <memcpy+0xe0>
70000388:       01105e63                blez    a7,700003a4 <memcpy+0xbc>
7000038c:       00028593                mv      a1,t0
70000390:       0005a603                lw      a2,0(a1) # e0000000 <_sstack+0x5ffa0000>
70000394:       00c3a023                sw      a2,0(t2)
70000398:       00438393                addi    t2,t2,4
7000039c:       00458593                addi    a1,a1,4
700003a0:       fed3e8e3                bltu    t2,a3,70000390 <memcpy+0xa8>
700003a4:       011285b3                add     a1,t0,a7
700003a8:       00387613                andi    a2,a6,3
700003ac:       00c05e63                blez    a2,700003c8 <memcpy+0xe0>
700003b0:       00c68633                add     a2,a3,a2
700003b4:       00058703                lb      a4,0(a1)
700003b8:       00e68023                sb      a4,0(a3)
700003bc:       00168693                addi    a3,a3,1
700003c0:       00158593                addi    a1,a1,1
700003c4:       fec6e8e3                bltu    a3,a2,700003b4 <memcpy+0xcc>
700003c8:       00008067                ret

For something that could just be a loop copying bytes, this is huge. I did not have a deep look at it, but as far as I can tell, the main loop at the end is as small as one would expect. Before it though, there is some special casing for small copy operations – it starts with li a3,15; bgeu a3,a2,<memcpy+0x94>, so that whole 0x90 bytes long part is skipped if the length is greater than 15.

Since I am aiming for the smallest possible binary size, I'd consider this a bug. More specifically, I have the following issues:

If the copy length is statically known at compile time (like in this case), the special cases that don't apply should be omitted
If compiling for size (-Os), no special casing should be done at all since it increases the binary size but is not necessary.

I tried reproducing this in godbolt, but I did not manage because of the tool's limitations. As far as I could tell, the memcpy comes from an LLVM intrinsic. Therefore, one must build for an embedded target (otherwise, libc's memcpy will be used instead) and build a self-contained binary (otherwise, it only emits the jump to some LLVM function without showing its code). This is sadly not supported by godbolt at the moment. The issue might apply to multiple or all bare metal targets, although I was only able to check for riscv32*-unknown-none.

The text was updated successfully, but these errors were encountered:

piegamesde · 2022-01-25T11:34:50Z

I just found rust-lang/compiler-builtins#339 which looks similar (although the complaint there is speed and not size), so maybe this issue should be transferred to the compiler-builtins repo?

bjorn3 · 2022-01-26T16:45:46Z

If the copy length is statically known at compile time (like in this case), the special cases that don't apply should be omitted

A single memcpy implementation is used for all memory copies or unknown or large size. For small memory copies llvm emits inline code rather than a memcpy call. -Copt-level=s/z may affect the threshold at which it is considered large though.

If compiling for size (-Os), no special casing should be done at all since it increases the binary size but is not necessary.

AFAIK there is no option for rust code to detect with which optimization level it has been compiled.

Does the specific riscv cpu you use support fast unaligned memory access? If so enabling the mem-unaligned feature of compiler-builtins should cut down on most of the code. If not, enabling it may slow down by >100(!) times due to having to trap into machine mode on every unaligned memory access to emulate it.

By the way the actual implementation of memcpy in compiler-builtins is https://github.com/rust-lang/compiler-builtins/blob/ea0cb5b589cc498d629c545e9bae600301ba6aed/src/mem/impls.rs#L28

piegamesde · 2022-01-26T23:58:49Z

Does the specific riscv cpu you use support fast unaligned memory access?

As far as I know, RISC-V does not allow unaligned memory access at all.

Now I kind of understand how the result I see is produced. LLVM does decide to not inline the function call somehow and then simply calls to the one-size-fits-all memcopy implementation. Its size usually does not matter, since it is amortized across function calls.

So to resolve my issue I'd need to either force-inline a function call from caller site or make use of different intrinsics based on the compiler options. Since I don't think any of these is feasible in Rust at the moment, I'd be better off using a custom memcpy implementation where possible (or overwriting the symbol in the linker or something).

bjorn3 · 2022-01-27T08:00:32Z

As far as I know, RISC-V does not allow unaligned memory access at all.

Looks like it is optional to support unaligned memory accesses. I though it was mandatory to do emulation of them in machine mode if the underlying hardware doesn't support them, but it seems not. In any case your specific cpu may support them, in which case enabling mem-unaligned is still valid.

pca006132 · 2022-07-07T12:54:30Z

For arm, we can override the weakly linked __aeabi_memcpy function to use an implementation suitable for the specific application, e.g. optimize for speed/space. Wondering if it is possible to override memcpy for other targets.

benjaminmordaunt · 2022-12-27T15:33:42Z

@pca006132 I'm interested in the arm case for the armv4t-none-eabi target. What I really want is a call to __aeabi_memcpy4, which is inserted by a copy_from_slice operation on a core::slice to be:

Inlined - I'm effectively building my own memcpy, so don't want a memcpy inside a memcpy, but do want the LLVM optimization goodness.
Specialized to the specific data size, alignment etc. that I am using in my code.

Is this possible? I don't want any external calls at all, as each function (using ffunction-sections) is linked independently into the final binary and extra compiler-generated functions are a no-no. At the same time, manually copying between src and dst in an iterator produces some really poorly optimized code. (I expect to see some LDMIA/STMIA instructions for bulk transfers, but just see ldr and mov...)

benjaminmordaunt · 2022-12-27T15:35:09Z

I suppose on a more general note - is it possible to force Rust/LLVM to always generate inlined versions of memcpy calls?

bjorn3 · 2022-12-27T15:43:10Z

LLVM will codegen memcpy calls inline with a known size if this size is below a certain threshold. This is not done through inlining (memory intrinsic generally can't be inlined as LLVM will likely replace it with a call to said memory intrinsic again later on. in addition rustc makes sure that they never participate in LTO to prevent linker errors and recursive calls) but by literally replacing the call with a hardcoded sequence for the given size. Not sure if you can force LLVM to not generate memcpy calls.

clubby789 · 2023-03-30T21:29:06Z

@rustbot label +A-llvm

benjaminmordaunt · 2023-04-16T22:55:10Z

Has there been any progress on this? There is currently no way to prevent Rust (or LLVM) from generating calls to __builtin_memcpy and friends, even when shoving every options under the sun to the compiler, including #![no_builtins]. (Btw, there is a related issue here which seems to have gone cold - that attribute is severely lacking documentation and, as I've seen, doesn't actually prevent builtin generation for a certain selection).

I think the underlying issue is that the LLVM IR being produced for ::clone/::copy methods always just ends up containing llvm.memcpy intrinsics and I can't see any way of customising that. In my case, I could probably benefit from a switch that turns all emitted llvm.memcpy into llvm.memcpy.inline (see here).

(For context, the project is OpenNitro. The BIOS contains its own hacked together super-old memory copying routine which, if anything, should be called. I'm trying to achieve somewhat binary parity with the original blobs...)

rustbot added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Mar 30, 2023

Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023

9oelM mentioned this issue Sep 16, 2023

Remove unnecessary main function call in the generated wasm file Xahau/hooks-rs#14

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`memcpy` implementation is too large on embedded #93265

`memcpy` implementation is too large on embedded #93265

piegamesde commented Jan 24, 2022

piegamesde commented Jan 25, 2022

bjorn3 commented Jan 26, 2022

piegamesde commented Jan 26, 2022

bjorn3 commented Jan 27, 2022

pca006132 commented Jul 7, 2022

benjaminmordaunt commented Dec 27, 2022

benjaminmordaunt commented Dec 27, 2022

bjorn3 commented Dec 27, 2022

clubby789 commented Mar 30, 2023

benjaminmordaunt commented Apr 16, 2023

memcpy implementation is too large on embedded #93265

memcpy implementation is too large on embedded #93265

Comments

piegamesde commented Jan 24, 2022

piegamesde commented Jan 25, 2022

bjorn3 commented Jan 26, 2022

piegamesde commented Jan 26, 2022

bjorn3 commented Jan 27, 2022

pca006132 commented Jul 7, 2022

benjaminmordaunt commented Dec 27, 2022

benjaminmordaunt commented Dec 27, 2022

bjorn3 commented Dec 27, 2022

clubby789 commented Mar 30, 2023

benjaminmordaunt commented Apr 16, 2023

`memcpy` implementation is too large on embedded #93265

`memcpy` implementation is too large on embedded #93265