-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memcpy
implementation is too large on embedded
#93265
Comments
I just found rust-lang/compiler-builtins#339 which looks similar (although the complaint there is speed and not size), so maybe this issue should be transferred to the compiler-builtins repo? |
A single memcpy implementation is used for all memory copies or unknown or large size. For small memory copies llvm emits inline code rather than a memcpy call.
AFAIK there is no option for rust code to detect with which optimization level it has been compiled. Does the specific riscv cpu you use support fast unaligned memory access? If so enabling the By the way the actual implementation of memcpy in compiler-builtins is https://github.com/rust-lang/compiler-builtins/blob/ea0cb5b589cc498d629c545e9bae600301ba6aed/src/mem/impls.rs#L28 |
As far as I know, RISC-V does not allow unaligned memory access at all. Now I kind of understand how the result I see is produced. LLVM does decide to not inline the function call somehow and then simply calls to the one-size-fits-all So to resolve my issue I'd need to either force-inline a function call from caller site or make use of different intrinsics based on the compiler options. Since I don't think any of these is feasible in Rust at the moment, I'd be better off using a custom |
Looks like it is optional to support unaligned memory accesses. I though it was mandatory to do emulation of them in machine mode if the underlying hardware doesn't support them, but it seems not. In any case your specific cpu may support them, in which case enabling |
For arm, we can override the weakly linked |
@pca006132 I'm interested in the arm case for the armv4t-none-eabi target. What I really want is a call to
Is this possible? I don't want any external calls at all, as each function (using ffunction-sections) is linked independently into the final binary and extra compiler-generated functions are a no-no. At the same time, manually copying between src and dst in an iterator produces some really poorly optimized code. (I expect to see some LDMIA/STMIA instructions for bulk transfers, but just see ldr and mov...) |
I suppose on a more general note - is it possible to force Rust/LLVM to always generate inlined versions of memcpy calls? |
LLVM will codegen memcpy calls inline with a known size if this size is below a certain threshold. This is not done through inlining (memory intrinsic generally can't be inlined as LLVM will likely replace it with a call to said memory intrinsic again later on. in addition rustc makes sure that they never participate in LTO to prevent linker errors and recursive calls) but by literally replacing the call with a hardcoded sequence for the given size. Not sure if you can force LLVM to not generate memcpy calls. |
@rustbot label +A-llvm |
Has there been any progress on this? There is currently no way to prevent Rust (or LLVM) from generating calls to I think the underlying issue is that the LLVM IR being produced for (For context, the project is OpenNitro. The BIOS contains its own hacked together super-old memory copying routine which, if anything, should be called. I'm trying to achieve somewhat binary parity with the original blobs...) |
I used the following code to copy a memory region around on an embedded
riscv32im-unknown-none-elf
target:Modulo some checks, this resolves to a jump with a
memcpy
implementation that looks like this in a disassembler:For something that could just be a loop copying bytes, this is huge. I did not have a deep look at it, but as far as I can tell, the main loop at the end is as small as one would expect. Before it though, there is some special casing for small copy operations – it starts with
li a3,15; bgeu a3,a2,<memcpy+0x94>
, so that whole 0x90 bytes long part is skipped if the length is greater than 15.Since I am aiming for the smallest possible binary size, I'd consider this a bug. More specifically, I have the following issues:
-Os
), no special casing should be done at all since it increases the binary size but is not necessary.I tried reproducing this in godbolt, but I did not manage because of the tool's limitations. As far as I could tell, the
memcpy
comes from an LLVM intrinsic. Therefore, one must build for an embedded target (otherwise, libc'smemcpy
will be used instead) and build a self-contained binary (otherwise, it only emits the jump to some LLVM function without showing its code). This is sadly not supported by godbolt at the moment. The issue might apply to multiple or all bare metal targets, although I was only able to check forriscv32*-unknown-none
.The text was updated successfully, but these errors were encountered: