Skip to content
This repository has been archived by the owner on Nov 5, 2018. It is now read-only.

SIGILL in <crossbeam_channel::flavors::zero::Channel<T>>::recv #76

Closed
matklad opened this issue Jul 20, 2018 · 12 comments
Closed

SIGILL in <crossbeam_channel::flavors::zero::Channel<T>>::recv #76

matklad opened this issue Jul 20, 2018 · 12 comments

Comments

@matklad
Copy link

matklad commented Jul 20, 2018

Hi! I am seeing this very weird SIGILL. Unfortunately, I can't give a nice steps to reproduce because there are a lot of moving parts here. I am not even sure that it is a problem with crossbeam-channel. However I think it makes sense to report the issue nonetheless :)

I've recently added crossbeam channel to RLS: rust-lang/rls#923.

I don't actually send any messages and simply wait for channels to close, taking advantage of the select! macro. Specifically, I see SIGILL in this bit of code:

https://github.com/rust-lang-nursery/rls/blob/0b9254b7dcdf52bba50a6e477d7891fb23adf62b/src/concurrency.rs#L44-L57

The most interesting part is that I get SIGILL only when testing RLS inside rust-lang/rust repository. That is, cargo test inside RLS itself works.

I get rust-lang rust checked out locally to 29ee65411c46b8f701bd1f241725092cb1b347e6 commit, and the src/tools/rls submodule checked out to matklad/rls@746b0f4.

With this setup, runnig ./x.py test src/tools/rls sigills. See the rls commit above for the precise point where it happens.

So, this is definitely not the most self-contained and easy to reproduce bug report, but this is the best I have at this time :-)

@matklad
Copy link
Author

matklad commented Jul 20, 2018

Oh, and here's how the failure looks like


Thread 2 "test::fail_unin" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7fffef9ff700 (LWP 14791)]
0x000055555572334a in <crossbeam_channel::flavors::zero::Channel<T>>::recv ()
(gdb) bt
#0  0x000055555572334a in <crossbeam_channel::flavors::zero::Channel<T>>::recv ()
#1  0x000055555570a753 in rls::concurrency::Jobs::wait_for_all ()
#2  0x00005555556dc719 in rls::actions::InitActionContext::wait_for_concurrent_jobs ()
#3  0x00005555556bee25 in rls::test::harness::expect_messages ()
#4  0x0000555555755d02 in rls::test::fail_uninitialized_request ()
#5  0x00007ffff5ee4f5f in <F as alloc::boxed::FnBox<A>>::call_box ()

@matklad
Copy link
Author

matklad commented Jul 20, 2018

I've tried replacing the crossbeam_channel with mutexes and condvars, and now that test blocks forever, which kinda hints that my code is to blame. Can it be the case that crossbeam aborts for some misuses of the API (I see some aborts in the code)? If that is the case, then printing a message might be helpful :)

EDIT: nvm, replacing crossbeam_chanel with Mutex/Condvar did fix the issue. I had forgotten to actually notify on the condwar, and that kinda exmplains the deadlock :-)

EDIT: just in case, here's condvar impl: rust-lang/rls#951

matklad added a commit to matklad/rls that referenced this issue Jul 20, 2018
When building RLS inside rust-lang/rust, I see a weird SIGILL in
crossbeam_channel:
crossbeam-rs/crossbeam-channel#76

Switching to a condvar for synchronization seems to fix the issue.
@ghost
Copy link

ghost commented Jul 20, 2018

I cannot reproduce the issue. :(

Does it also happen in debug mode? It'd help to see which line in the code caused SIGILL.

Another thing that'd help would be to see where the other threads got stuck at the moment of SIGILL, too. For example, it would be telling if some other thread was inside zero::Channel::<T>::send() when zero::Channel::<T>::recv() hits the SIGILL.

@ghost
Copy link

ghost commented Jul 20, 2018

Oh, it could be because ./x.py test src/tools/rls automatically switches src/tools/rls from branch 746b0f4 to 2b57851. Is there something I can do to avoid that?

@matklad
Copy link
Author

matklad commented Jul 20, 2018

Is there something I can do to avoid that?

Don't know of a proper way to do this, I just make a commit on rust-lang/rust repo which updates submodule.

@ghost
Copy link

ghost commented Jul 20, 2018

Ok, I've managed to reproduce the issue.

Do you know how to run the test in gdb? This is what I'm getting:

$ /home/stjepan/work/rust/build/x86_64-unknown-linux-gnu/stage2-tools/x86_64-unknown-linux-gnu/release/deps/rls-3ec593da9fd41bf7                                                                    
/home/stjepan/work/rust/build/x86_64-unknown-linux-gnu/stage2-tools/x86_64-unknown-linux-gnu/release/deps/rls-3ec593da9fd41bf7: error while loading shared libraries: librustc_driver-123213898594acb6.so: cannot open shared object file: No such file or directory

Also, is there a way to run the test in debug mode rather than release?

@matklad
Copy link
Author

matklad commented Jul 20, 2018

I've used the following:

 LD_LIBRARY_PATH="/home/matklad/projects/rust/build/x86_64-unknown-linux-gnu/stage2/lib/rustlib/x86_64-unknown-linux-gnu/lib:$LD_LIBRAY_PATH" gdb --args ~/projects/rust/build/x86_64-unknown-linux-gnu/stage2-tools/x86_64-unknown-linux-gnu/release/deps/rls-3ec593da9fd41bf7 --test-threads 1 test::fail_uninitialized_request --nocapture

@matklad
Copy link
Author

matklad commented Jul 20, 2018

Also, is there a way to run the test in debug mode rather than release?

Don't really know about this. I actually don't understand the specifics of x.py :(

One thing I've tried though is to setup the stage2 as a custom rustup toochain and then do cargo t inside src/tools/rls, but that didn't trigger the behavior.

One thing I realize now is that I've always tested in debug inside rls repo. Maybe --release is important for triggering the issue?

@matklad
Copy link
Author

matklad commented Jul 20, 2018

@stjepang yep! Running cargo test --release in the RLS repo triggers the bug! Sorry for making you to compile rustc :D

@matklad
Copy link
Author

matklad commented Jul 20, 2018

As a sanity check, test --release works as expected with the condvar version.

@ghost ghost closed this as completed in #77 Jul 20, 2018
@ghost
Copy link

ghost commented Jul 20, 2018

I've just published v0.2.3 that should fix this bug.
Can you confirm that it's resolved on your machine, too? :)

@matklad
Copy link
Author

matklad commented Jul 20, 2018

Yep, works on my machine!

Thanks for the swift fix!

This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

1 participant