-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reland: Improve performance of global code by emitting fewer atomic barriers. #47636
Conversation
The good news is that it also seems to happen on Linux i686 so we might have a hope to debug it :) |
So this reproduces pretty consistently on 32-bit Linux by just doing |
b368475
to
2623a10
Compare
2623a10
to
abf54fd
Compare
Finally caught this in a trace with debug info: https://julialang-dumps.s3.amazonaws.com/reports/2022-12-21T12-04-45-maleadt.tar.zst. Note that this is a 32-bit process, so you either need to replay this with on a 32-bit system, or use a build of rr that supports replaying 32-bit processes (i.e. not the one from rr_jll). |
Some digging in rr. This crashes when accessing an invalid field of an object:
Note the invalid
Not only does the field get overwritten here, the object is entirely different (i.e. the object's header is different). I'm not sure whether that means that the
|
847c633
to
707682e
Compare
So this consistently crashes during LinearAlgebra/special with the above GC corruption when removing the atomic barrier between every toplevel instruction. To be sure, the latest commits puts it back, showing that the issue isn't caused by either switching to monotonic ordering, or by late-inserting barriers before calls. @vtjnash Do you have any thoughts how this could be related? Could you explain the full purpose of these barriers? I thought it was only required to ensure a correct world age, so I'm not sure how removing those loads could result in GC corruption. |
Bump. It'd be a bit unfortunate to have another minor version with this bug 🙂 |
ab65997
to
a34b0ed
Compare
Rebased. Looks like the error doesn't happen anymore? I couldn't try locally, because this time even a non-debug build failed because of address space exhaustion. EDIT: using the i686 Buildkite artifact, I'm running the LinearAlgebra/special testsuite a bunch of times right now, and it hasn't failed yet. Looking good? |
Why are we relaxing this from acquire to monotonic? |
I think you suggested that? |
I don't remember |
Do you want me to remove that part? |
a34b0ed
to
6cc1542
Compare
6cc1542
to
d2daaa5
Compare
Backporting these changes on top of 1.9 still results in segfaults in i686 while running LinearAlgebra/special... @vtjnash if you want, I can try generating an rr trace for you to take a look (as I'm slightly concerned that the underlying issue may still be present here; and it would also be useful to have this on 1.9). |
Yes, that would be helpful |
Re-land #47578 which got reverted because of a GC issue on win32. I haven't debugged this, but want to have an open PR so that I don't lose track of this.
Fixes #47561