-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
32-bit ARM builds fail as single process uses >3 GiB memory #4320
Comments
Just a quick note. What do you expect us to do on how Rust builds the binaries? Or on how library crates are build. Must be something in there if it suddenly happend. Also, all the building of the binaries do work on GitHub at least. |
If the failure point is at the linking step, maybe disabling LTO can help? Does compiling in debug mode finish correctly at least? |
Also, note #4308, which isn't in 1.30.3. |
What happens if you use main? |
I asked this myself. At least I wanted to make you aware of the issue, as this should affect others as well. And probably someone with more Rust build/cargo knowledge has an idea how to work around the issue.
I see you use Docker
I am not 100% sure whether it is the linking step, but the LTO is disabled when e.g. using
Will try it as well. |
It's always good to point this out of course. I just wondered since not much changed to Vaultwarden it self except the crates.
If should indeed.
👍🏻 |
The same happened to me. Apart from the problem with Handlebars, which I fixed using 5.1.0 version, I could compile Vaultwarden with |
Although, it affects Debian Bullseye, Bookworm and Trixie all together, Bullseye with LibSSL1.1 and the others with LibSSL3, and quite different toolchain (C) versions as well. Rust itself is installed via rustup, instead of Debian packages. And vaultwarden v1.30.1 still builds fine, so to me it looks more like crate dependency versions which make the difference. And this is of course nasty to track down. I should have tested ... okay the
Is there a way to verify that LTO is disabled? I see the Building with
That is interesting. I saw that this new profile was recently added and thought trying it, though do not expect it to produce much smaller binaries in our case, since we run |
Could you provide a bit more details on the hosts you build it on? And, you also mentioned qemu and different architectures. |
I also wonder what happens if you try an |
I use two different hosts:
On the Odroid XU4, I am using this image. It has some scripts for system setup purpose, but is at its core a minimal Debian Bookworm I can actually try to replicate it on any other Debian-based live image via GitHub Actions, or even Ubuntu (which is 99% the same in relevant regards). But Debian does not seem to offer them for ARM: https://www.debian.org/CD/live/ I will try to do a build with proceeding |
If you want docker to be able to run armxxx images locally you need binfmt support on your host. We use that same principle to create the final containers per architecture. We just pull in the armv6 or armv7 or aarch64 container and run Technically you can do the same with a docker image. docker run --rm -it -e QEMU_CPU=arm1176 --platform=linux/arm/v5 debian:bookworm-slim bash
root@bc44eb3f2c25:/# uname -a
Linux bc44eb3f2c25 6.7.3-zen1-2-zen #1 ZEN SMP PREEMPT_DYNAMIC Fri, 02 Feb 2024 17:03:56 +0000 armv6l GNU/Linux
root@bc44eb3f2c25:/# That will use QEMU emulation for all binaries within that container. The same happens on GitHub for us in the workflows. vaultwarden/.github/workflows/release.yml Lines 65 to 68 in 897bdf8
There we load binfmt support so we can use the same way to just run that architecture. |
Okay great. Installing the |
Looking at your odroid, it should be |
and probably also |
Yes, but it does not matter, as it fails on all 32-bit ARM systems the same way (when using Debian and the same set of commands). |
Have you tried the cross-compiler For the Raspberry Pi armv6 I haven't found a better solution than the qemu-builder for the moment, and I think sooner or latter it will be impossible to build Vaultwarden on 32 bits machines, as the crates will keep growing and growing. |
That is how we do it. We cross-compile for the target architecture. |
First of all, Cross-compiling is of course an option. But as said, to rule out surprises and assure linked and available shared library do 100% match, also on Raspbian systems, I prefer to do builds within the target userland. But indeed, as long as there is no way to somehow reduce Currently running the test with the Debian Bookworm Docker container: apt update
apt install qemu-user-static binfmt-support
cat << '_EOF_' > vaultwarden.sh
#!/usr/bin/env sh
set -e
apt-get update
apt-get -y install curl gcc libc6-dev pkg-config libssl-dev git
curl -sSfo rustup-init.sh 'https://sh.rustup.rs'
chmod +x rustup-init.sh
# ARMv7: Workaround for failing crates index update in emulated 32-bit ARM environments: https://github.com/rust-lang/cargo/issues/8719#issuecomment-1516492970
# ARMv8: Workaround for increased cargo fetch RAM usage: https://github.com/rust-lang/cargo/issues/10583
export CARGO_REGISTRIES_CRATES_IO_PROTOCOL='sparse' CARGO_NET_GIT_FETCH_WITH_CLI='true'
./rustup-init.sh -y --profile minimal --default-toolchain none
rm rustup-init.sh
export PATH="$HOME/.cargo/bin:$PATH"
curl -fLO 'https://github.com/dani-garcia/vaultwarden/archive/1.30.3.tar.gz'
tar xf '1.30.3.tar.gz'
rm '1.30.3.tar.gz'
cd 'vaultwarden-1.30.3'
cargo build --features sqlite --release
_EOF_
docker run --platform=linux/arm/v7 debian:bookworm-slim sh -c "$(<vaultwarden.sh)" ... I did this within a VirtualBox VM, running Debian Bookworm. I should have enabled nested virtualization first (requires disabled core isolation > memory integrity security feature on Windows 11) to speed things up, it is running, but quite slowly ... Btw, the |
@MichaIng we do cross-compiling too. Building via qemu takes a very long time. While not measured, it was certainly more then double the time. Also, sometimes qemu can cause strange issues which are hard to debug. But moste or the time it works, but much slower. |
And there it failed the same way within Docker container:
If |
But, what is the reason for not cross-compiling? |
Quoting myself:
Basically to assure that the userland on the build system, hence the linked libraries on the build host, do exactly match the one on the target system. And depending on the toolchain, it is also much easier to setup, compared to installing cross-compiler and multiarch libraries, assuring it is used throughout the toolchain. E.g. Python builds with Rust code have an issue of loosing architecture information along the way. 32-bit ARM wheels compiled on 32-bit userland/OS with 64-bit kernel (default since Raspberry Pi 4 and 5, even on 32-bit userland/OS) are strangely marked as |
Nothing very thorough. In my old RPi 1 everything is as slow as geology, so i didn't notice a very significant drop in performance. |
Another item. Why not use the per-compiled MUSL binaries. Those are distro indipendant. |
Where do you provide those? Or do you mean to extract them from Docker images? However, as we have own build scripts and GitHub workflows already, it feels better to also use them, and control the builds, in case flags, profiles etc. And I guess you e.g. do not provide RISC-V |
@MichaIng could you try the following please? Replace the following part in Cargo.toml: [profile.release]
strip = "debuginfo"
lto = "fat"
codegen-units = 1 With: [profile.release]
strip = "debuginfo"
lto = "thin" And test again? |
Actually, this might be better for your use case, run this before you run the export CARGO_PROFILE_RELEASE_LTO=thin CARGO_PROFILE_RELEASE_CODEGEN_UNITS=16 CARGO_PROFILE_RELEASE_STRIP=symbols |
I Actually think that my previous post will help you. I was looking at the diff between The main benefit will be the I also set the I also added the I tested this my self on my system via a docker container. And it looked like it didn't came above 4GiB. |
That works for me. |
I also added a new release profile That might be useful too once merged. |
Thank you for that, it's going to be very useful for me. |
@FlakyPi can you verify if this works? export CARGO_PROFILE_RELEASE_LTO=thin CARGO_PROFILE_RELEASE_CODEGEN_UNITS=1 CARGO_PROFILE_RELEASE_STRIP=symbols If so, then we can close this issue since the PR for using |
It crashes with It worked with |
Hmm then ill have to change the profile. |
It seems (as disscusses here dani-garcia#4320) a single codegen unit makes it still crash. This sets it to the default 16 Rust uses for the release profile.
It seems (as disscusses here #4320) a single codegen unit makes it still crash. This sets it to the default 16 Rust uses for the release profile.
Ok that is merged. Since that seems to solve the issue I'm going to close this one. If not please reopen. |
Many thanks guys, and sorry for my late reply. Since >1 codegen units and thin LTO seem to potentially worsen performance, and are both not present in the
Although, docs say that I am also confused why more parallelisation (codegen units) uses less memory, while I would usually expect it to consume more memory. Did someone test |
@MichaIng it's difficult for me too really test it actually. The main thing is, the profile changed since you noticed compiling went wrong. We changed from Also, 16 codegen units also release memory when they are done, and it's not one proces, that might help on low-end systems maybe? I think thin with 16 is the best bet, since that was the previous default. |
I tested the Then I tested with ... currently running without both, and afterwards with both, just to have the full picture. EDIT: Okay, now I am confused, as the build went through without any of both settings changed, using max 2.14 GiB memory, hence even a little lower than with EDIT2: |
On my Odroid XU4, I get:
This is on Arch Linux ARM armv7 with https://gitlab.archlinux.org/archlinux/packaging/packages/vaultwarden/-/blob/cb935a55918ef8cace6455426f9c68b7687dd29d/PKGBUILD, but modified for
|
Builds pass with |
@polyzen probably because there is a lot of extra code per database feature and it also needs to link with one extra library. |
I accidentally built with an older version when doing above tests, which explains why it succeeded with the
Max memory was btw obtained in a loop which checked What I take from this, is that the max memory usage with Does someone know whether the panic stack trace gives any meaningful information, when all symbols are removed? Else I suggest to add |
Thanks for all the testing. |
Sure 🙂. I guess otherwise the differences were smaller. |
Not perse, since you changed building parameters. |
Err right, depends on with which flags the dependencies were compiled, respectively as far as we know simply which size they have. However, everything was recompiled on every build. So we have an idea now which flag/option has which effect. |
Subject of the issue
When building vaultwarden on 32-bit ARM systems, it fails at the last compilation step, when assembling/linking the final
vaultwarden
binary. I first recognised that our GitHub Actions workflow failed building vaultwarden v1.30.3. This compiles on the public GitHub Actions runners within a QEMU-emulated container, throwing the following errors:I then tested it natively on an Odroid XU4, which fails with:
I suspect both to be the same underlying issues, but the build inside the container is probably aborted by the host/container engine.
The same works well with
x86_64
andaarch64
builds, natively and within the same QEMU container setup.I monitored some stats during the build on the Odroid XU4:
These are the seconds around the failure. RAM size is 2 GiB, and I created an 8 GiB swap space. The last build step of the
vaultwarden
crate/binary utilises a single CPU core (the XU4 has 8 cores, so 1 core maxed is 12.5% CPU usage, the way I obtained it above) with a single process, and RAM + swap usage seems to crack the 3 GiB limit for a single process, which would explain the issue. LPAE allows an overall larger memory size/usage, but the utilisation for a single process is still limited.I verified this again by monitoring the process resident and virtual memory usage in
htop
:One
rustc
process during the last build step, with 4 threads. And the build fails when the virtual memory usage crosses 3052 MiB, i.e. quite precisely the 32-bit per-process memory limit.Since we have successful builds/packages with vaultwarden v1.30.1, I tried building v1.30.2, which expectedly fails the same way, as it differs with just 2 tiny surely unrelated commits from v1.30.3. v1.30.1 still builds fine, so the culprit is to be found between v1.30.1 and v1.30.2, probably just dependency crates which raised in size.
Not sure whether there is an easy solution/workaround. Of course we could try cross-compiling, but I actually would like to avoid that, as it is difficult to assure that correctly shared libraries are linked, especially when building for ARMv6 Raspbian systems.
Deployment environment
Install method: source build
Clients used:
Reverse proxy and version:
MySQL/MariaDB or PostgreSQL version:
Other relevant details:
Steps to reproduce
On Debian (any version):
Expected behaviour
Actual behaviour
Troubleshooting data
The text was updated successfully, but these errors were encountered: