-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscompilation of Bevy with MSVC #1249
Comments
Are you able to attach a debugger and get a backtrace when it crashes? |
Yes, but I'm not sure what to look for. It might be hitting a UD2 instruction but I don't really think so, that's about all that stands out to me. The stack is just a bunch of offsets in the assembly. I tried building from source with Edit: Also just noticed that |
I'm probably not going to get around debugging this myself soon as I don't have a windows dev environment for rust set up. I unfortunately also don't have a clue where the bug could be, so I can't do a blind attempt at fixing it either. Linux has the best support as that is what I use myself. *Bsd also works really well as it is close enough to Linux. macOS should work too most of the time, but support is a tiny bit worse as it uses a different object file format. Windows support is not the best as you noticed. Only the MinGW toolchain is supported at all. |
That's unfortunate. I don't have the expertise to debug any further than I have, but hopefully someone will be able to pick this up at a later date. |
https://github.com/bjorn3/rustc_codegen_cranelift/pull/1255 shows an ABI incompatibility with MSVC. Debugging that one should probably be easier than this. |
Hey @PROMETHIA-27, would you be able to test this branch to see if it fixes the issues you are experiencing? It contains a fix for an ABI issue that sounds suspiciously like what you are hitting. |
I was unable to build the project (kept getting |
That's great! 🎉
I haven't really tested anything with the jit, I think I disabled it early on for windows when porting the testsuite into rust. Hopefully it's something simple 😄 |
Nice!
Cranelift-jit should set the mappings to RX before an attempt at executing the code is made. |
We do run the cranelift test suite on the JIT, and it runs fine on windows. But I don't know much beyond that. |
So:
I think that's a regression overall? IIRC I had the default app actually running last time |
Yeah, this sounds like a regression, I'll try to reproduce that locally The JIT segfaulting is also something I'm able to reproduce, but It was segfaulting before, so I don't think that one is a regression. |
I'm having some issues reproducing this. I've cloned bevy, and ran
This compiles fine all the way until the last linking step, and then it seems to be stuck on the last linking step. The linker is using 100% of one cpu core, and has been running for the last 10 mins. Did you also run into this? I tried both on the build output:
Edit: Ok, so I re read this issue and it looks like you are using a hello world application (presumably this?) and compiling bevy as a dependency. When I tried to do that, everything compiled and ran as expected. Here's a repo with what I tried. Can you test that and see if it crashes for you? |
I tested two projects: one was hello world, fn main() {
println!("Hello, world!");
} and the other was a bevy default app. use bevy::prelude::*;
fn main() {
App::new().add_plugins(DefaultPlugins).run();
} Your repo does seem to work fine. |
Right, with the default app, I run into the linker being stuck issue. 😕 I'm going to try and troubleshoot that |
So, I let this run for a while with the default app, and it turns out its just really really slow. 18min to compile slow 🧐
But at least I can reproduce something, I'll track this down and see where it gets. Is that the error that you are also seeing? Edit: It seems some people are also hitting this with LLVM, so it may be an issue with the VS version I'm using Edit 2: I replaced the default linker with Edit 3: (sorry I don't want to spam notifications) I re-tried the bevy example I wonder if the issues we are seeing are all related to MSVC linker stuff? I'm using VS2019 with this linker version:
|
No, the error I got was a segfault (or window's equivalent, STATUS_ACCESS_VIOLATION) and it didn't take an abnormal amount of time to compile. |
It looks like you are using VS2022, which is the latest version. I downloaded and installed VS2022 (
I'm a bit at a loss here... Can you send a backtrace or run it in WinDbg to see where its crashing? |
I took a look with WinDbg but I still don't know what I'm looking for. It seemed to crash on a function pointer call? It segfaults on The call stack doesn't seem useful, it's just one frame named Nothing else pops out at me as being useful/relevant. Is cranelift able to build .pdbs? I think that would be useful but it doesn't seem to. Or embed more debug info. Normally windbg can actually point back to source code from assembly, but with cranelift builds it seems to be completely lacking any debug info. While messing around with the loaded plugins I managed to get it to crash with a panic, which also seemed to raise an ILLEGAL_INSTRUCTION error, after the panic printed. Edit: Also tried getting a backtrace with RUST_BACKTRACE=1, didn't work. Edit 2: The problem is even narrower than I thought! It's crashing at app construction time, not when the app is running. Just adding default plugins and not calling |
Not necessarily. If you have a miscompilation, minor unrelated changes can cause the miscompiled part to flip between crashing and not crashing for example due to a different value in a register somewhere or due to different code layout. Unfortunately miscompilations are pretty hard to debug. Right now I'm debugging a miscompilation that happens on the main branch of Cranelift (not 0.88.1 like cg_clif uses right now. 0.88.0 had a another miscompilation on AArch64, so I had to skip it) |
I do have a pdb next to the exe on the target dir, but to be honest I've never checked if they work. I did notice that a regular rust pdb for the bevy_hello_world has 350MiB but the ones produced by cg_clif have 50MiB, so maybe we aren't including a lot of info on it? I tried a bunch of plugin combinations and different orders on my machine, but none of them seemed to cause that issue. Yesterday I had the thought "Hey, maybe everything magically works on my machine, lets try a different one", so I added CI today to the But then another thing occurred to me, In that CI run I'm using the cg_clif artifacts from CI and last time you tried and it worked it was also a CI artifact. So maybe that figures into it? Could you try running it on your machine with the CI artifacts? Looking at the commit log between the last version that worked and now, the other thing that jumps out at me was the parallelization effort by @bjorn3. Although I think that would be unlikely since it looks you can reproduce this consistently. Another thing that is different is there are some slight differences in the stack probing code that you tried last time (and worked) and the code that actually was included upstream, so maybe there is a bug there? If the artifact thing fails, I'm going to try to revert to a old version of cranelift with the new stack probing code to see if that triggers something. Thanks for all your effort debugging this! Its been very helpful on narrowing things down! |
cg_clif only knows how to generate DWARF debuginfo. As such there is no debuginfo on Windows. Not sure why there is a pdb that large generated. If you know of any crate to generate codeview debuginfo please let me know.
It is. |
It's been a few minutes since I started building with the msvc artifact, so I think you're onto something. Will update when it's finished but I suspect that because of the toolchain override, when I built it built a gnu toolchain cargo-clif, but I use stable-msvc normally. So I can't imagine that was a nice interaction. Next I'll try building with my build on stable-gnu and see if that works. EDIT: Should've waited one more minute. Just finished, in 11min, and I got the same event loop not-main-thread error that was mentioned. My build did not work, but something interesting I noticed is that either artifact force installs a matching nightly toolchain when run, while my build does not. gnu artifact not only built but ran perfectly. |
That sounds like exactly what I've been seeing when linking with the default linker, using
Building But if I understand this correctly you had a
I'm not too sure about why this happens. |
Even with gimli-rs/object#475 we are still failing to compile bevy (with the same behaviour as in https://github.com/bjorn3/rustc_codegen_cranelift/issues/1249#issuecomment-1259733923), however now linking is pretty much instant, which helps with testing this. I'm going to look into the linker warnings above and the illegal instruction on that comment. |
bevy_reflect
with cranelift causes STATUS_ACCESS_VIOLATION during compilation
I was looking at this today for fun, and I notice that the panic observed in this thread happens is this check on winit: if !attributes.any_thread && thread_id != main_thread_id() {
panic!(
"Initializing the event loop outside of the main thread is a significant \
cross-platform compatibility hazard. If you absolutely need to create an \
EventLoop on a different thread, you can use the \
`EventLoopBuilderExtWindows::any_thread` function."
);
} That check the current thread id against this function: /// Returns the id of the main thread.
///
/// Windows has no real API to check if the current executing thread is the "main thread", unlike
/// macOS.
///
/// Windows will let us look up the current thread's id, but there's no API that lets us check what
/// the id of the main thread is. We would somehow need to get the main thread's id before a
/// developer could spin off any other threads inside of the main entrypoint in order to emulate the
/// capabilities of other platforms.
///
/// We can get the id of the main thread by using CRT initialization. CRT initialization can be used
/// to setup global state within a program. The OS will call a list of function pointers which
/// assign values to a static variable. To have get a hold of the main thread id, we need to place
/// our function pointer inside of the `.CRT$XCU` section so it is called before the main
/// entrypoint.
///
/// Full details of CRT initialization can be found here:
/// <https://docs.microsoft.com/en-us/cpp/c-runtime-library/crt-initialization?view=msvc-160>
fn main_thread_id() -> u32 {
static mut MAIN_THREAD_ID: u32 = 0;
/// Function pointer used in CRT initialization section to set the above static field's value.
// Mark as used so this is not removable.
#[used]
#[allow(non_upper_case_globals)]
// Place the function pointer inside of CRT initialization section so it is loaded before
// main entrypoint.
//
// See: https://doc.rust-lang.org/stable/reference/abi.html#the-link_section-attribute
#[link_section = ".CRT$XCU"]
static INIT_MAIN_THREAD_ID: unsafe fn() = {
unsafe fn initer() {
MAIN_THREAD_ID = GetCurrentThreadId();
}
initer
};
unsafe { MAIN_THREAD_ID }
} Which involves using CRT initialization to get the main thread id. Considering that there is also a linker warning about CRT sections, this appears to be the problem. In fact, wrapping the Removing that panic check from |
Thanks for the investigation!
That is very much possible. I will need to look into what the expected attributes are. |
I built a small example using the winit code above and we really are not emitting those symbols! test.rs
When we run it, we always get 0 with cg-clif
Listing the symbol table for a rustc produced binary shows the CRT symbols. But a CLIF one does not have them!
Doing a
The MS article does mention that multiple initializers are supported in the same section, but that should also be straightforward. @bjorn3 I'm not too sure where the best place within cg_clif is to handle this, but it should be fairly easy to mimic this output! I'd be happy to do it with some guidance |
The section contents seem to be identical:
The symbol table is indeed missing
I think a change to
With cg_clif |
Both changes will need to be made in object I think. |
Does Bevy work on Windows with https://github.com/bjorn3/rustc_codegen_cranelift/actions/runs/4276291179? This updates Cranelift to the latest commit on the main branch which includes bytecodealliance/wasmtime#5619. This version will be released in ~3 weeks, at which point I will update the master branch of cg_clif to use it. |
@bjorn3 Can confirm that winit and Bevy are working. |
Thanks for testing! |
I can also confirm it works! Got the first-person example in bevy_mod_wanderlust working. Seems to run into vulkan validation errors but that doesn't seem to be significant or consistent. |
Further good news! JIT mode works on windows!
Also! This required some setup to convert bevy to dynamic. I don't think bevy's built-in dynamic feature would work, since I think that would statically link a crate in which dynamically links the rest. To do this, I created a crate in the same workspace called There's one issue though; when the JIT compiler tries to resolve dependencies dynamically, it uses LoadLibraryExW on windows via libloading. It can load the stdlib just fine; that always works. But crates (always?) depend on the stdlib when compiled into their .dlls, When it tries to resolve their stdlib dependency, it can't (usually) find it. To fix this I was able to just copy the stdlib.dll into the target/debug/deps folder, which lets it find it. Some alternatives:
I think copying the stdlib to deps automatically would probably be the most doable; has the least amount of non-portable code/assumptions involved. |
Indeed. Rustc spawns a new thread to run on. For regular jit mode this thread is where the program runs. For lazy jit mode another thread is spawned to run the program and the main rustc thread is used for compiling new functions on the fly. I believe winit has a function to allow the event loop on a different thread on platforms that support it (which at least Linux does). You can try modifying bevy_winit to use it.
Right, I believe there is a variant of COFF with a higher limitation, but the object crate doesn't know how to write it afaik.
You figured it out. I expected that cargo would take care of adding libstd to the dynamic linker search path, but maybe it only does when running an executable and not when running rustc as is done in jit mode?
I can try this.
I don't like messing with the target dir behind the back of cargo. |
Guaranteeing that stdlib is loaded first would also fix it, I missed that possibility. That would be best I think. I was able to accomplish it for my setup by simply .rev()'ing the dependencies, but I'm not sure if stdlib always shows up last, or what happens in more complex setups. |
Thanks for the idea! That is actually correct. |
Cool! Wrt this:
Could this be reversed, and have rustc be split off into the new thread, and have the program run on main? I feel like this would probably be a more optimal answer than using a dubious feature + patching to get winit to not complain. |
I tested it myself- works perfectly. Would this be a good candidate to PR? |
Yes, but it did require re-introducing a custom rustc driver which broke sccache and |
Isn't it already using a custom driver? driver/jit.rs? |
That is a different kind of driver. The |
Ah, I see the issue. Greedy JIT mode works fine by just swapping the loop/call from the thread spawn but lazy jit dies because some TLS stuff is no longer available. |
Huh, I wouldn't expect lazy jit to break tls accesses any more than jit mode in general. That is I did expect tls accesses made by any jitted code to result in an error (or was it a panic?) for both modes but tls accesses made by code in the dylibs to work just fine in both modes. Lazy jit mode is even more experimental than regular jit mode, but if you have a somewhat minimal reproducing example I can take a look. |
im tried compiling my bevy project with bjorn3/rustc_codegen_cranelift/actions/runs/4276291179 but it crashes with a weird error. i am unsure of the meaning and if its possible/how to fix it, curious if anyone else has seen this before i also compiled from source and tried using that and it resulted in the same error so im assuming this is something related to my project or one of my dependencies Details
|
Could you open a separate issue for that? This is unrelated to the MSVC trouble OP has. I will dig into what exactly causes this issue as soon as I have time. I'm busy for the next few days unfortunately. |
Compiling on windows with the latest CI artifact build.
Full error message:
Not sure what's causing this. Happens consistently, so it's not random, but I have no idea what's in
bevy_reflect
that causes it. It's a pretty simple crate, doesn't do anything particularly weird.Unfortunately this crate is depended on by the most important bevy crates, so bevy as a whole can't be compiled on cranelift.
Not certain if this is limited to windows/just my PC, unable to test it anywhere else. Going to try and play with it to see if I can learn anything about it.
EDIT: This appears to not apply to the main branch of bevy. Instead, the main branch has the same error on
bevy_hierarchy
.EDIT 2: This appears to also not apply to compiling the main branch directly, rather than as a git dependency. That works fine.
EDIT 3: After some testing, these appear to be the conditions of the crash:
&[T]
does not trigger it; only a[T]
. As such this does not occur if nesting types, unless the nesting is to produce a multidimensional array. This also does not occur if the field itself is an array or slice. Not all dynamically sized types cause this issue;str
is fine. Not all types taking a const generic parameter cause this issue; aCustom<const T: u32>
is fine.#[derive(Component)]
, referring tobevy_ecs
'sComponent
trait. Manual impls do not cause this issue. The manual impl is as follows:No other derive I tested causes this issue.
This issue occurs for the
Children
type inbevy_hierarchy
which has the following definition:Unable to identify the problem in
bevy_reflect
. There are no instances of#[derive(Component)]
, but many other macros which are significantly more complex thanComponent
.The text was updated successfully, but these errors were encountered: