-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lld][coff] undefined symbols from lazy-loaded archives #107365
Comments
@llvm/issue-subscribers-lld-coff Author: Mike Hommey (glandium)
STR:
- Download https://drive.google.com/file/d/1HhG-LO4MU1GGOLstHlFBi7rOrwKfNDbZ/view?usp=sharing
- Unpack the archive and enter the `repro` directory.
- Run `lld-link @response.txt`
Expected result:
Actual result:
What happens is that the symbols, by the time A crude retry, like the following, works around the issue:
But somehow, there is some non-determinism, as when running the command repeatedly, sometimes, it fails with Cc: @mstorsjo |
Thanks, I can reproduce this. As additional context - I can reproduce the nondeterminism on Linux. (Some underlying details about how files are loaded in LLD differ between Linux and Windows, where they are loaded asynchronously on Windows, but that doesn't seem to play any role here.) I wonder if #85290 makes any difference here - although I doubt it. |
It doesn't, but interestingly it makes strdup sometimes appear as undefined symbol, while it never happens with an unpatched tree. |
Yeah I’ve noticed that with a recent Unreal Engine build. I’ll fix it. |
Wasn’t that part of the original behavior too? From your original issue description:
|
No, I never got strdup as undefined symbol before applying my patch. |
Ah, I misunderstood - that was after applying your patch - I see. FWIW, I did see strdup as undefined symbol, occasionally, with an unpatched build. |
Maybe I didn't try hard enough. Maybe the patch makes it more likely, too. |
@mstorsjo Ignoring the strdup issue, do you think the crude retry patch from the report is good enough as a start? |
Well it looks, as you say, a bit crude. In one sense, this looks pretty much like what we do in So in one sense, it feels like a case where it's a tradeoff between always doing two passes through the symbols, or only doing it when we have a reason to believe we might need another round - and how costly is it to guess whether we need to make another round? Or should we make |
The indeterminism of the error for |
I now know what's up with
The second scenario is:
I'm not sure yet how to fix this properly. |
Good job in tracking this down! I wasn't really aware that I wonder if we should restructure Alternatively, can we in the case
include a check kinda like this:
before checking whether it is defined? |
I was actually just testing that right now, and it works, but your suggestion to add a resolveWeakAlias sounds simpler. |
and, unsurprisingly, it works. I'll go with this. Do you prefer this is addressed in a separate PR? Considering the non-determinism involved, I'm not sure this specific part can have a corresponding testcase. |
I'd prefer a separate PR, although they unfortunately kinda clash, touching the same area. I'm fine with the second PR being based on top of the first once, and you can add a note that it's based on that and reviews should focus on the second commit only. As for a testcase - yeah, given the nondeterminism it's a bit tricky. But it would still be good with some sort of testcase that can trigger the codepaths, even if at least occasionally. |
Is this something that would be considered uplifting to the release/19.x branch? (I'm going to apply them downstream anyways, + ec4d5a6, without which I'm also getting undefined symbols for the non-import strdup on other link commands) |
Maybe, I guess. It's not a regression fix, but if it's a low-risk, small and selfcontained bugfix I guess it can be considered for backporting too. |
Moving the discussion from the PR back here, where this is more relevant: I actually got things to link without the fix, by tweaking the command line lld-link is called with (which, incidentally, and annoyingly, -reproduce doesn't keep). And by using link.exe with a command line closer to the original, I actually get a more useful warning about what might, ultimately, be the root cause of all that's going on:
Sadly, it doesn't say what it conflicts with. I'm not sure yet where the problems comes from, but a) this is probably a situation lld-link should also warn about b) considering the PR ends up converting errors into different warnings that lead to a binary that doesn't work, it might be better to leave the situation as is with the errors. Better to fail to link than to produce something utterly broken. |
I finally found where this is all coming from. Here's the story, starting with the missing details in the original report:
At this point, I think the only thing that lld-link would need to fix here is to emit LNK4098. |
Although... as demonstrated by the testcase in #109082, the bug that it fixes is real... but I've only hit it by happenstance. |
…invokes r=firefox-build-system-reviewers,sergesanspaille See llvm/llvm-project#107365 (comment) for an example of how things can go wrong when they are passed first. I don't think there was a specific reason for them to have been put first in the first place. Differential Revision: https://phabricator.services.mozilla.com/D222899
…invokes r=firefox-build-system-reviewers,sergesanspaille See llvm/llvm-project#107365 (comment) for an example of how things can go wrong when they are passed first. I don't think there was a specific reason for them to have been put first in the first place. Differential Revision: https://phabricator.services.mozilla.com/D222899 UltraBlame original commit: f6cff71bfc6cba2b1d37d79fce2cb6b5d0b00852
…invokes r=firefox-build-system-reviewers,sergesanspaille See llvm/llvm-project#107365 (comment) for an example of how things can go wrong when they are passed first. I don't think there was a specific reason for them to have been put first in the first place. Differential Revision: https://phabricator.services.mozilla.com/D222899 UltraBlame original commit: f6cff71bfc6cba2b1d37d79fce2cb6b5d0b00852
Yes, that's a proper fix indeed. If this case is unlikely to come up otherwise, I guess it's not strictly needed to fix it, but then again, the fix isn't all that ugly either so it could be worth fixing it in any case. Do you have opinions on which way to go about it? As for the conflicts between static and dynamic CRT - that's good info, that properly explains how you'd run into this situation. Even despite that, I don't see why the result of that setup would crash though, no reason specific to LLD at least. But perhaps there are bits within the statically linked CRT objects that break if you'd mix and match like this? |
…invokes r=firefox-build-system-reviewers,sergesanspaille See llvm/llvm-project#107365 (comment) for an example of how things can go wrong when they are passed first. I don't think there was a specific reason for them to have been put first in the first place. Differential Revision: https://phabricator.services.mozilla.com/D222899
…invokes r=firefox-build-system-reviewers,sergesanspaille See llvm/llvm-project#107365 (comment) for an example of how things can go wrong when they are passed first. I don't think there was a specific reason for them to have been put first in the first place. Differential Revision: https://phabricator.services.mozilla.com/D222899
…invokes r=firefox-build-system-reviewers,sergesanspaille See llvm/llvm-project#107365 (comment) for an example of how things can go wrong when they are passed first. I don't think there was a specific reason for them to have been put first in the first place. Differential Revision: https://phabricator.services.mozilla.com/D222899 UltraBlame original commit: f6cff71bfc6cba2b1d37d79fce2cb6b5d0b00852
STR:
repro
directory.lld-link @response.txt
Expected result:
-reproduce
didn't store the manifest file, you may want to edit the response file to remove the manifest-related flags, in which case the expected result is a successful link)Actual result:
What happens is that the symbols, by the time
resolveRemainingUndefines
is called, are still undefined, butisLazy()
is true for them, andresolveRemainingUndefines
doesn't handle that case.A crude retry, like the following, works around the issue:
But somehow, there is some non-determinism, as when running the command repeatedly, sometimes, it fails with
strdup
marked as undefined symbol.Cc: @mstorsjo
The text was updated successfully, but these errors were encountered: