-
-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional attempting free on address which was not malloc()-ed
crash in CI
#78749
Comments
I've come across this several times, and was discussed some on rocketchat, since it almost always for me works when re-running the failed check I suspect it's a memory issue causing some corruption of the runner, similar to how we previously had issues with compilation in scu builds |
This is failing on the SCU build, and these errors started after enabling it, so it's probably related. |
Sorry I've been out of action for a bit (my new PC arrived yesterday and is almost setup for compiling). We are planning to modify SCU build slightly soon so the CI can specify maximum SCU file size which will limit RAM use, at the moment it will be very high especially with sanitizers. I can help investigate this next week, but feel free to turn off SCU for this action temporarily, or use it on a simpler (non sanitizer / tests) build. SCU build on sanitizer / test build?If we want to run SCU in CI, we need to verify that running it on the sanitizer / tests run (because it is slowest) is actually the best choice. Although in theory SCU should accelerate CI too, when I looked at the actual build times (for CI anyway) they weren't necessarily faster than regular build. This may be due to e.g. the CI caching, the larger translation units, or simply running low on RAM. There's also no guarantee that the SCU build will operate exactly as the regular build, if there are code errors. The gold standard for sanitizer errors should probably be the regular build imo (although SCU sanitizer build will also reveal different bugs). Some ideas for what might be happeningSCU build has fewer large files compared to normal build with lots of small files, so they can stress things in different ways. If the build itself is working ok, but the sanitizers are failing on a test, then that would seem to indicate e.g. a corrupt build, maybe a order construction / destruction problem, or a timing difference exposing race condition. I'm not an expert on github actions, but some possible ideas.
Some things we might try to figure out the culprit:
Order of construction / destruction(this is only small chance of being the problem, but worth considering if reader is not familiar) Order of construction / destruction bugs (usually globals / statics / possibly singletons) can be extremely nasty. These are why many programmers prefer explicit construction / destruction functions for globals rather than relying on constructor / destructor operators - i.e. defined in code rather than determined by the compiler. Whether or not this is causing the problem, we should probably consider having a test for order of construction / destruction in the CI. This is an achilles heel of C++: order of construction and destruction of globals between translation units is undefined, and it can change from build to build, resulting in bugs on some builds but not in others. See: With a SCU build in particular, any globals within the larger translation unit, the order will now be determined by where they appear in the "uber file", so it can expose existing order of construction bugs. |
For a scratch build, it does have a significant impact.
But indeed when there is a valid cache and the change doesn't impact core, it might not make as big a difference. |
A lot of running stuff in GH actions later:
Details
#!/bin/bash -ex
i=0
scons platform=linuxbsd target=editor tests=no verbose=yes warnings=extra werror=yes module_text_server_fb_enabled=yes dev_build=yes scu_build=yes debug_symbols=no use_asan=yes use_ubsan=yes linker=gold
while true; do
((i+=1))
echo $"Attempt $i"
rm -rf test_project
rm -rf ~/.local/share/godot/
rm -rf ~/.config/godot/
rm -rf ~/.cache/godot/
unzip 4.0.zip
mv "regression-test-project-4.0" "test_project"
bin/godot.linuxbsd.editor.dev.x86_64.san --audio-driver Dummy --editor --quit --path test_project 2>&1 | tee sanitizers_log.txt || true
misc/scripts/check_ci_log.py sanitizers_log.txt || break;
done
|
Thanks for testing, so the SCU stuff was a red herring and just a coincidence. This occasional CI crash is still a recent regression that happened around the time we merged the SCU CI change (#78462), so it might come from another PR merged around that time, i.e. in the past 2-3 weeks. Or possibly a change to the GitHub Action environment. |
|
CC @warriormaster12 @RandomShaper |
@RedworkDE @RandomShaper could be that this flag was missing 🤔 https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPipelineCacheCreateFlagBits.html |
That flag reduces thread safety, so to say. It puts the responsibility of ensuring the writes to the PSO cache are synchronized on the application. In Godot, we can use it since the |
Sadly we still get intermittent crashes on ASAN builds on CI with the same error. Here's an example from a PR which was rebased to include #80296: The stacktrace is different:
|
@RandomShaper What do you think, would an option to disable pipeline cache completely for github actions be a feasible solution? |
So is this still caused by the Vulkan PSO cache? I thought that the new crash was unrelated. Can we please try a pipeline build it disabled, to be sure? |
The original intermittent crash was caused by #76348, and #80296 changed the issue but we still get intermittent crashes. So unless another PR in the meantime introduced another intermittent memory corruption issue that was hidden by the previous crash, it's likely that the PSO cache is still the issue. How can we disable it? |
Atm the moment the only option is to comment code related to pipeline caching. Commenting function |
Have there been new cases of editor crashing or has the issue by any chance resolved itself? I might be very wrong but so far what I've looked at, none of the new pull request actions had experienced this issue. |
I haven't seen it in the past couple of weeks, while it used to be very frequent. Weird. |
I guess we could follow the situation and close the issue if nothing changes maybe in another couple of weeks? |
My pet theory is that #81771 fixed these CI Vulkan pipeline cache crashes. Earlier, I noticed some weirdness with Vulkan when run with |
That sounds plausible indeed. Either way, we can close this as resolved. |
Godot version
master CI builds
System information
Linux / Editor with doubles and GCC sanitizers
Issue description
The
Linux / Editor with doubles and GCC sanitizers
build can fail with aattempting free on address which was not malloc()-ed
crash in theOpen and close editor (Vulkan)
step.https://github.com/godotengine/godot/actions/runs/5366302204/jobs/9735786296#step:15:138
https://github.com/godotengine/godot/actions/runs/5389199469/jobs/9783070708?pr=78740#step:15:138
The failure doesn't seem related to either PR (in fact one of them is just a docs change). I have only seen this issue these two times, still probably worth looking into.
Steps to reproduce
Open and close https://github.com/godotengine/regression-test-project/archive/4.0.zip using the Vulkan renderer.
Minimal reproduction project
https://github.com/godotengine/regression-test-project/archive/4.0.zip
The text was updated successfully, but these errors were encountered: