Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MacOS] nightly causes runtime segfault when crate binary is compiled with --release #50586

Closed
jjpe opened this issue May 9, 2018 · 20 comments
Closed
Assignees
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-macos Operating system: macOS P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.

Comments

@jjpe
Copy link

jjpe commented May 9, 2018

On OS X 10.10, when I compile a crate I wrote for work and run the crate binary using cargo run --release, the generated binary will segfault when I try certain actions. Oddly enough, debug builds do not seem to exhibit the segfaulting behavior at all.

This bug is tricky to describe, as I have no clue what causes the generated binary to segfault.
Just to rule out the factor: the crate I'm building has no unsafe code.
However, I do know that compiling using up to and including rustc 1.26.0-nightly (392645394 2018-03-15), the generated crate binary runs fine.

Starting with rustc 1.26.0-nightly (55c984ee5 2018-03-16) something has changed in behavior to cause this segfault on OS X.

For context and some additional information, please see the discussion on users.rust-lang.org.

@kennytm kennytm added O-macos Operating system: macOS regression-from-stable-to-stable Performance or correctness regression from one stable version to another. C-bug Category: This is a bug. labels May 9, 2018
@kennytm
Copy link
Member

kennytm commented May 9, 2018

PRs in the range 3926453...55c984e:

* 55c984ee5d Auto merge of #48097 - oli-obk:did_you_mean_to_run_rls_with_clippy, r=nrc
* 3b6412b943 Auto merge of #48896 - alexcrichton:bitcode-in-object, r=michaelwoerister
* cc34ca1c97 Auto merge of #48818 - michaelwoerister:issue-47309, r=eddyb
* 5f3996c3ec Auto merge of #48813 - sinkuu:build_in_assert_macro, r=alexcrichton
* a7170b0412 Auto merge of #48524 - abonander:check-macro-stability, r=petrochenkov
* 36b6687318 Auto merge of #49051 - kennytm:rollup, r=kennytm
* db2f0ae1f4 Rollup merge of #48972 - mark-i-m:rustdoc_readme, r=QuietMisdreavus
* fb49ae0260 Rollup merge of #49049 - Centril:fix/flatten-fusediterator, r=bluss
* a199fb2319 Rollup merge of #49037 - estebank:coherence-tweaks, r=nikomatsakis
* 64490ff883 Rollup merge of #48990 - ExpHP:dont-drop-the-bomb, r=estebank
* 6fbd03393b Rollup merge of #48853 - Songbird0:addrparseerror_documentation_improvement, r=GuillaumeGomez
* 90230ce372 Rollup merge of #49050 - snf:fix_oom_asmjs, r=alexcrichton
* 55a0075a20 Rollup merge of #49042 - kennytm:fix-e0307-typo, r=rkruppe
* 77cf02de18 Rollup merge of #49024 - draganmladjenovic:mips64_cabi_sret, r=sanxiyn
* 5a7aa6cae5 Rollup merge of #49007 - estebank:follow-up-47574, r=oli-obk
* bf270bb66a Rollup merge of #48965 - alexcrichton:add-sha-feature, r=eddyb
* 97b489ef39 Rollup merge of #48961 - Songbird0:rustdoc_doctests_assertions, r=GuillaumeGomez
* 292c6ca041 Rollup merge of #48959 - alexcrichton:signext, r=eddyb
* 35b008cb92 Rollup merge of #48957 - Eijebong:rand, r=michaelwoerister
* e1d19df9a5 Rollup merge of #48922 - petrochenkov:asunder, r=nikomatsakis
* 68a602efa9 Rollup merge of #48892 - alexcrichton:thinlto-again, r=Mark-Simulacrum
* da88827267 Rollup merge of #48875 - jcowgill:mips-test-fixes, r=sanxiyn
* a45b79ddb7 Rollup merge of #48706 - ehuss:main-not-found-in-crate, r=estebank

@jjpe
Copy link
Author

jjpe commented May 9, 2018

@kennytm Does the regression-from-stable-to-stable label apply to regressions on nightly as well?

@kennytm
Copy link
Member

kennytm commented May 9, 2018

@jjpe It is "stable-to-stable" since the regression happened at 1.26, the coming stable version.

@nikomatsakis
Copy link
Contributor

Hmm, I have a "nagging feeling" this may fall under the heading of "LLVM codegen regression" (#50422). Without access to some way to reproduce, it's going to be hard to track down. Here are some ideas you might try though:

  • See if enabling or disabling lto in your "release" configuration makes any difference, perhaps?
    • I forget how to do that :P but can figure it out if needed; add something like lto = no to the [profile.release] in Cargo.toml, probably
  • Try to get a test case you can give to us :)
  • Or, bisect yourself through the commits in this range. I think though that because this regression is so old, we'll have to build from source to do the builds by hand.

@nikomatsakis nikomatsakis added the P-high High priority label May 10, 2018
@nikomatsakis
Copy link
Contributor

triage: P-high

However, this is not assigned to anyone yet, as I don't know that we have any immediate steps forward available to us.

@alexcrichton
Copy link
Member

One thing this may be which I've seen a lot of lately is a misaligned memory access due to misaligned TLS slots. To diagnose that though we'd need to take a look a gdb backtrace as well as the faulting instruction

@kennytm
Copy link
Member

kennytm commented May 11, 2018

TLS... that reminds me of #48866. Any chance you're using Xcode 7?

@daschl
Copy link

daschl commented May 15, 2018

I think I may got more info on this one since I may ran into it as well. I'm on OSX 10.11.6

The code works fine on 1.25 but immediately segfaults on 1.26. Its also using the regex crate and here is the valgrind info:

==75691== Process terminating with default action of signal 11 (SIGSEGV)
==75691==  General Protection Fault
==75691==    at 0x100151B0B: thread_local::thread_id::get (in target/release/oida)
==75691==    by 0x1000FC3E5: regex::re_unicode::Regex::captures (in target/release/oida)
==75691==    by 0x100032CEB: oida::index::EventIndex::feed (in target/release/oida)
==75691==    by 0x10003C742: oida::main (in target/release/oida)
==75691==    by 0x10002ECE5: std::rt::lang_start::{{closure}} (in target/release/oida)
==75691==    by 0x100172477: _ZN3std9panicking3try7do_call17hc7b9e6190a1d9f3eE.llvm.7628423903506348692 (rt.rs:59)
==75691==    by 0x10018655E: __rust_maybe_catch_panic (lib.rs:102)
==75691==    by 0x100170E01: std::rt::lang_start_internal (panicking.rs:285)
==75691==    by 0x10003E16B: main (in target/release/oida)

A couple of more data points

  • It works with development but fails with release
  • Code depends on regex 1.0 which itself depends on thread_local 0.3.5 which seems to cause it
  • It works with lto = true on 1.26 but fails with lto = false

I'll see if I can come up with an isolated test.

@daschl
Copy link

daschl commented May 15, 2018

Actually that was easier than expected, it boils down to this to segfault:

extern crate thread_local;

use thread_local::ThreadLocal;

fn main() {
    let tls: ThreadLocal<u32> = ThreadLocal::new();
    assert_eq!(tls.get(), None);
}
$ valgrind target/release/examples/repro
==80783== Memcheck, a memory error detector
==80783== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==80783== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==80783== Command: target/release/examples/repro
==80783== 
--80783-- run: /usr/bin/dsymutil "target/release/examples/repro"
==80783== 
==80783== Process terminating with default action of signal 11 (SIGSEGV)
==80783==  General Protection Fault
==80783==    at 0x100002CAB: thread_local::thread_id::get (in target/release/examples/repro)
==80783==    by 0x1000020DB: repro::main (in target/release/examples/repro)
==80783==    by 0x100001CE5: std::rt::lang_start::{{closure}} (in target/release/examples/repro)
==80783==    by 0x10000B227: _ZN3std9panicking3try7do_call17hc7b9e6190a1d9f3eE.llvm.7628423903506348692 (rt.rs:59)
==80783==    by 0x10001BFAE: __rust_maybe_catch_panic (lib.rs:102)
==80783==    by 0x100009D51: std::rt::lang_start_internal (panicking.rs:285)
==80783==    by 0x10000222B: main (in target/release/examples/repro)
==80783== 
==80783== HEAP SUMMARY:
==80783==     in use at exit: 22,839 bytes in 193 blocks
==80783==   total heap usage: 259 allocs, 66 frees, 28,607 bytes allocated
==80783== 
==80783== LEAK SUMMARY:
==80783==    definitely lost: 0 bytes in 0 blocks
==80783==    indirectly lost: 0 bytes in 0 blocks
==80783==      possibly lost: 2,064 bytes in 1 blocks
==80783==    still reachable: 184 bytes in 2 blocks
==80783==         suppressed: 20,591 bytes in 190 blocks
==80783== Rerun with --leak-check=full to see details of leaked memory
==80783== 
==80783== For counts of detected and suppressed errors, rerun with: -v
==80783== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Segmentation fault: 11

Let me know if there is more I can do to help debug this.

@jjpe
Copy link
Author

jjpe commented May 15, 2018

I tried turning on LTO, and that definitely makes the segfault go away on OS X.
In addition, I built the code with --release in Linux and Win10, and there are no regressions there that I can detect ATM.

As for where the issue occurs, the code sample I mentioned on rust-lang.org is this:

    pub fn read_regex(&mut self, regex: &Regex) -> LexResult<Token> {
        self.ensure_has_input()?;
        let (m_start, m_end) = match regex.find(self.remaining_source_code()) {
            Some(regex_match) if regex_match.start() == 0 =>
                (regex_match.start(),  regex_match.end()),
            _ => return Err(LexErr::ExpectedRegexMatch {
                position: self.position()
            }),
        };
        let start = self.position() + m_start;
        let end   = self.position() + m_end;
        self.position = end;
        Ok(Token::new(start..end))
    }

The segfault location here is in the regex.find(self.remaining_source_code()) method call (as opposed to its argument self.remaining_source_code(), which works as it should). So it doesn't only happen in Regex.captures().

@daschl
Copy link

daschl commented May 15, 2018

@jjpe I'm curious - can you run the binary with valgrind so we see where it ends up in the stack when segfaulting? Checking to see if the thread locals are also the culprit.

@kennytm kennytm added the A-thread-locals Area: Thread local storage (TLS) label May 15, 2018
@jjpe
Copy link
Author

jjpe commented May 15, 2018

I redacted some details due to the fact that this is proprietary software, but I think the parts relevant to this thread (no pun intended) are still in there. Specifically, looks like you're correct about thread locals being issue:

$ sudo valgrind ./target/release/sxi
Password:
==17563== Memcheck, a memory error detector
==17563== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==17563== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==17563== Command: ./target/release/sxi
==17563==
--17563-- run: /usr/bin/dsymutil "./target/release/sxi"
Press Ctrl-D or enter "quit" to exit.
>> "foo"
==17563==
==17563== Process terminating with default action of signal 11 (SIGSEGV)
==17563==  General Protection Fault
==17563==    at 0x1004BDD4B: thread_local::thread_id::get (in ./target/release/sxi)
==17563==    by 0x100471960: regex::re_unicode::Regex::find_at (in ./target/release/sxi)
==17563==    by 0x10046EBB0: regex::re_unicode::Regex::find (in ./target/release/sxi)
==17563==    by 0x100409777: PARSER_CRATE::lexer::Lexer::read_regex (in ./target/release/sxi)
==17563==    by 0x10040A47B: PARSER_CRATE::lex_rules::whitespace (in ./target/release/sxi)
==17563==    by 0x10030CE3E: core::ops::function::Fn::call (in ./target/release/sxi)
                          [redacted]
==17563==    by 0x100407BF0: core::ops::function::Fn::call (in ./target/release/sxi)
==17563==
==17563== HEAP SUMMARY:
==17563==     in use at exit: 42,555 bytes in 432 blocks
==17563==   total heap usage: 520 allocs, 88 frees, 52,289 bytes allocated
==17563==
==17563== LEAK SUMMARY:
==17563==    definitely lost: 0 bytes in 0 blocks
==17563==    indirectly lost: 0 bytes in 0 blocks
==17563==      possibly lost: 0 bytes in 0 blocks
==17563==    still reachable: 7,575 bytes in 15 blocks
==17563==         suppressed: 34,980 bytes in 417 blocks
==17563== Rerun with --leak-check=full to see details of leaked memory
==17563==
==17563== For counts of detected and suppressed errors, rerun with: -v
==17563== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@alexcrichton
Copy link
Member

From what we know about this bug the issue here is that there's a bug in older linkers on OSX which seems to be fixed on more recent versions. AFAIK we don't know what the bug is, only that newer versions of the linker, when used, do not produce binaries that fault whereas older binaries do.

One alternative is to compile Rust code with MACOSX_DEPLOYMENT_TARGET=10.6 which will cause libstd a different strategy of TLS that shouldn't hit this linker bug.

@daschl
Copy link

daschl commented May 16, 2018

@alexcrichton do you have an idea why it still works with 1.25 then? Is it because of the LLVM upgrade which triggered this linker bug?

@alexcrichton
Copy link
Member

@daschl LLVM upgrades bring a huge amount of changes, so my guess is that something entirely unrelated is triggering a preexisting bug. I'd doubt that an LLVM change which caused this to worsen would be the actual cause of the underlying issue.

@jjpe
Copy link
Author

jjpe commented May 16, 2018

Well, at least for now enabling LTO is the "workaround of least resistance" because I don't need to care about env vars, and it doesn't seem to bring any performance regressions in my case either. So until I get a chance to upgrade the OS I'll just use that.

I'm not sure what this means for the bug, though.

@pnkfelix
Copy link
Member

visiting for triage. assigning self to evaluate current status and what our options are.

@kennytm
Copy link
Member

kennytm commented Jul 12, 2018

If it is TLS it might be fixed by #51828. Someone needs to confirm.

@jjpe
Copy link
Author

jjpe commented Jul 13, 2018

I just built my crate with using cargo +nightly run --release and with lto = true removed from Cargo.toml.
cargo +nightly -V reports cargo 1.29.0-nightly (af9e40c26 2018-07-05) (which is a tad odd, since I ran rustup update right before that, and rustup reported having updated nightly to rustc 1.29.0-nightly (64f7de921 2018-07-12)).

I cannot reproduce the issue anymore.

@SimonSapin
Copy link
Contributor

Thank you @jjpe for confirming that this is fixed!

(The Cargo version that you get together with Nightly is based on the src/tools/cargo submodule in this repository, which is only updated occasionally. This is why cargo +nightly -V shows can show a date older than rustc +nightly -V. In both cases the date is that of the latest merge commit made by @bors in the corresponding repository.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-macos Operating system: macOS P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another.
Projects
None yet
Development

No branches or pull requests

7 participants