-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zig prereq races as of 0.11.0-dev.1782+b52be973d #14815
Comments
I do have If that helps, I could send over the archive privately. |
I have CI a machine in production where I can reproduce this.
I will do my best for it to keep it's state until tomorrow. If there is a command I can run there, I am all ears. |
@motiejus and I have been discussing this issue on IRC, and he shared with me a tarball of the zig-cache directory from the affected system. We have made the following observations:
The only artifacts found in o/ apart from 3726 empty directories are:
What I find curious about this is that zig only creates a directory inside o/ if it is about to write a build artifact there. I can't fathom why there would be 3726 empty directories here, apart from a third party going in and deleting files. As for a plan of attack here, I'd like to finish my work on #14647. I'm hoping to land this branch within 1 week. There is a high chance, that if there is a race condition in zig's caching system which is causing this issue, that it will be reproduced by either my local computer, or the CI, when testing this branch. If it does not get reproduced, that suggests for us to look into a different possible cause. Perhaps something more straightforward, hiding under our very noses. Meanwhile, we have seen the merge of #14821 yesterday, addressing what could be a related issue to this one. This issue was indeed being triggered reliably by my work in the branch as mentioned above. Another avenue of attack that I will pursue is to examine the caching system and look for an equivalent bug to this one that was recently solved. |
I have landed #14647, and as hoped, I have seen some sporadic issues that look related to this one. For example, in https://github.com/ziglang/zig/actions/runs/4444638053/jobs/7803016844:
Likewise, in https://github.com/ziglang/zig/actions/runs/4444638053/jobs/7803016681:
This is likely a related issue, and has a completely straightforward solution: #14978 So, I will continue by addressing the tail end of issues that my build-parallel branch has exposed. I think it is very likely that, in a few weeks time, as these follow-up issues are resolved, it will also solve this issue (the one I am commenting on). |
More information in the commit message and ziglang/zig#14815
To capture/troubleshoot such cases I am considering to run a Also, #14923 would have been resolved much quicker with In this specific case we would see the stack trace immediately where the Specific questions to maintainers (sorry for previous pings):
Use case for publishing
|
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
I find myself quite often creating ReleaseSafe builds and putting them to production for certain experiments: - Debug info are for stack traces. An ongoing example where those would help is ziglang#14815. - Safety checks would have saved a couple of mine and @kubkon's hours in ziglang#15098. This is a breaking change for scripts that make Zig releases -- I will submit another PR to zig-bootstrap and release-cutter after this is merged.
Sporadic failure observed on a CI run mac-debug (#15285 (comment)). 2023-04-18T08:30:11.3938460Z Install the project...
2023-04-18T08:30:11.4063650Z -- Install configuration: "Debug"
2023-04-18T08:57:06.4523620Z ++ pwd
2023-04-18T08:57:06.4799110Z + stage3/bin/zig build test docs --zig-lib-dir /Users/runner/work/zig/zig/build/../lib -Denable-macos-sdk -Dstatic-llvm -Dskip-non-native --search-prefix /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af
2023-04-18T09:23:00.6724990Z zig test ReleaseFast native: error: thread 183881 panic: reached unreachable code
2023-04-18T09:23:00.6854820Z /Users/runner/work/zig/zig/lib/std/os.zig:962:23: 0x100e24071 in ftruncate (zig)
2023-04-18T09:23:00.6969080Z .INVAL => unreachable, // Handle not open for writing
2023-04-18T09:23:00.7074050Z ^
2023-04-18T09:23:00.7194090Z /Users/runner/work/zig/zig/lib/std/fs/file.zig:254:30: 0x100c90736 in setEndPos (zig)
2023-04-18T09:23:00.7302450Z try os.ftruncate(self.handle, length);
2023-04-18T09:23:00.7403820Z ^
2023-04-18T09:23:00.7916250Z /Users/runner/work/zig/zig/lib/std/Build/Cache.zig:870:55: 0x100c903c3 in writeManifest (zig)
2023-04-18T09:23:00.8029440Z try manifest_file.setEndPos(contents.items.len);
2023-04-18T09:23:00.8029780Z ^
2023-04-18T09:23:00.8030070Z /Users/runner/work/zig/zig/src/Compilation.zig:2122:26: 0x100cff3d0 in update (zig)
2023-04-18T09:23:00.8030380Z man.writeManifest() catch |err| {
2023-04-18T09:23:00.8030600Z ^
2023-04-18T09:23:00.8031030Z /Users/runner/work/zig/zig/src/main.zig:3384:36: 0x100d2f1c6 in serve (zig)
2023-04-18T09:23:00.8031350Z try comp.update(main_progress_node);
2023-04-18T09:23:00.8031600Z ^
2023-04-18T09:23:00.8031880Z /Users/runner/work/zig/zig/src/main.zig:3202:31: 0x100b98d1e in buildOutputType (zig)
2023-04-18T09:23:00.8032790Z test_exec_args.items,
2023-04-18T09:23:00.8033040Z ^
2023-04-18T09:23:00.8038770Z /Users/runner/work/zig/zig/src/main.zig:273:31: 0x100b6b62f in mainArgs (zig)
2023-04-18T09:23:00.8039320Z return buildOutputType(gpa, arena, args, .zig_test);
2023-04-18T09:23:00.8039580Z ^
2023-04-18T09:23:00.8039840Z /Users/runner/work/zig/zig/src/main.zig:211:20: 0x100b6a887 in main (zig)
2023-04-18T09:23:00.8060930Z return mainArgs(gpa, arena, args);
2023-04-18T09:23:00.8061200Z ^
2023-04-18T09:23:00.8061470Z /Users/runner/work/zig/zig/lib/std/start.zig:609:37: 0x100b6d0b3 in main (zig)
2023-04-18T09:23:00.8061760Z const result = root.main() catch |err| {
2023-04-18T09:23:00.8061990Z ^
2023-04-18T09:23:00.8062190Z ???:?:?: 0x7fff203e7f3c in ??? (???)
2023-04-18T09:23:00.8062590Z ???:?:?: 0x12 in ??? (???)
2023-04-18T09:23:00.8062770Z
2023-04-18T09:23:00.8062960Z zig test ReleaseFast native: error: the following command terminated unexpectedly:
2023-04-18T09:23:26.2318440Z /Users/runner/work/zig/zig/build/stage3/bin/zig test /Users/runner/work/zig/zig/test/behavior.zig -OReleaseFast --cache-dir /Users/runner/work/zig/zig/build/zig-local-cache --global-cache-dir /Users/runner/work/zig/zig/build/zig-global-cache --name test -I /Users/runner/work/zig/zig/test -L /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/runner/work/zig/zig/lib --listen=-
2023-04-18T09:23:26.2319480Z zig test ReleaseFast native: error: thread 183882 panic: reached unreachable code
2023-04-18T09:23:26.2424740Z /Users/runner/work/zig/zig/lib/std/os.zig:962:23: 0x107649071 in ftruncate (zig)
2023-04-18T09:23:26.2526060Z .INVAL => unreachable, // Handle not open for writing
2023-04-18T09:23:26.2640040Z ^
2023-04-18T09:23:26.2743480Z /Users/runner/work/zig/zig/lib/std/fs/file.zig:254:30: 0x1074b5736 in setEndPos (zig)
2023-04-18T09:23:26.2852980Z try os.ftruncate(self.handle, length);
2023-04-18T09:23:26.2964870Z ^
2023-04-18T09:23:26.3068300Z /Users/runner/work/zig/zig/lib/std/Build/Cache.zig:870:55: 0x1074b53c3 in writeManifest (zig)
2023-04-18T09:23:26.3177990Z try manifest_file.setEndPos(contents.items.len);
2023-04-18T09:23:26.3279350Z ^
2023-04-18T09:23:26.3381060Z /Users/runner/work/zig/zig/src/Compilation.zig:2122:26: 0x1075243d0 in update (zig)
2023-04-18T09:23:26.3482580Z man.writeManifest() catch |err| {
2023-04-18T09:23:26.3583880Z ^
2023-04-18T09:23:26.3685600Z /Users/runner/work/zig/zig/src/main.zig:3384:36: 0x1075541c6 in serve (zig)
2023-04-18T09:23:26.3787030Z try comp.update(main_progress_node);
2023-04-18T09:23:26.3889390Z ^
2023-04-18T09:23:26.3990850Z /Users/runner/work/zig/zig/src/main.zig:3202:31: 0x1073bdd1e in buildOutputType (zig)
2023-04-18T09:23:26.4042130Z test_exec_args.items,
2023-04-18T09:23:26.4042810Z ^
2023-04-18T09:23:26.4043350Z /Users/runner/work/zig/zig/src/main.zig:273:31: 0x10739062f in mainArgs (zig)
2023-04-18T09:23:26.4043910Z return buildOutputType(gpa, arena, args, .zig_test);
2023-04-18T09:23:26.4044370Z ^
2023-04-18T09:23:26.4044840Z /Users/runner/work/zig/zig/src/main.zig:211:20: 0x10738f887 in main (zig)
2023-04-18T09:23:26.4045340Z return mainArgs(gpa, arena, args);
2023-04-18T09:23:26.4045760Z ^
2023-04-18T09:23:26.4046240Z /Users/runner/work/zig/zig/lib/std/start.zig:609:37: 0x1073920b3 in main (zig)
2023-04-18T09:23:26.4046760Z const result = root.main() catch |err| {
2023-04-18T09:23:26.4047210Z ^
2023-04-18T09:23:26.4047650Z ???:?:?: 0x7fff203e7f3c in ??? (???)
2023-04-18T09:23:26.4048070Z ???:?:?: 0x13 in ??? (???)
2023-04-18T09:23:26.4048390Z
2023-04-18T09:23:26.4050030Z zig test ReleaseFast native: error: the following command terminated unexpectedly:
2023-04-18T09:23:26.4053880Z /Users/runner/work/zig/zig/build/stage3/bin/zig test /Users/runner/work/zig/zig/test/behavior.zig -lc -OReleaseFast --cache-dir /Users/runner/work/zig/zig/build/zig-local-cache --global-cache-dir /Users/runner/work/zig/zig/build/zig-global-cache --name test -I /Users/runner/work/zig/zig/test -L /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/lib -I /Users/runner/zig+llvm+lld+clang-x86_64-macos-none-0.11.0-dev.2441+eb19f73af/include --zig-lib-dir /Users/runner/work/zig/zig/lib --listen=- |
Fixed by #15351. However there are still more issues left |
I spent 2 hours today troubleshooting this failure:
The The relevant code is here: Lines 395 to 435 in a1aa55e
We have 7 processes racing to populate pub fn main() void {
doNothing(0);
}
fn doNothing(arg: u0) void {
_ = arg;
}
// run
// They all report "cached", which prompts the question, where did the file get written from? I almost want to suspect hash collision, since it would explain this error output, if the colliding hash used a different filename than "tmp.zig".
When I run this locally on my linux host, Perhaps there is a false positive cache hit happening here? I don't see how that could be happening based on the logic in Cache.zig however. The only thing I can think of right now is the possibility that file system writes on macOS are asynchronous in the sense that file system operations on one file done after another file are not guaranteed to be observed in the same order by another process. Pseudocode example:
If this sequence of operations is expected to be possible on macOS then there is a design flaw in Zig's cache system. Perhaps @jamii has some insight on this last point? |
This is probably not correct even with many linux filesystems, depending on the settings. Operations on directories (ie creating files) can be reordered arbitrarily except before/after an fsync on the directory. See eg https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf. I think at minimum you need:
It's roughly the same situation as reads/writes between threads, where fsync is a combined read+write barrier. On mac the fsync might have to be |
I'd recommend writing the caching code so that it can be run against a mocked filesystem which reorders operations at random unless prevented by an fsync. Or even just throws an error if one process tries to read from a file that another process has modified and not fsynced. |
Thank you @jamii! Appreciate your insight here. |
More information in the commit message and ziglang/zig#14815
This should be fixed by #15641 The previously described behavior was possible because:
|
Thank you for your fix, @jacobly0 ! @andrewrk mentioned there is still a race in Windows:
Do we have a tracking issue for this? I need a reference to put into the hermetic_cc_toolchain workaround. |
From ziglang/zig#14815 (comment): > @motiejus to be clear, this unrelated issue is not affected by a cache > clear, and can be resolved by simply rerunning the build (and not > getting very unlucky again).
Thanks! Closing this one and simplifying my workaround in uber/hermetic_cc_toolchain#74 now. |
From @jacobly in ziglang/zig#14815 (comment): > @motiejus to be clear, this unrelated issue is not affected by a cache > clear, and can be resolved by simply rerunning the build (and not > getting very unlucky again).
From @jacobly in ziglang/zig#14815 (comment): > @motiejus to be clear, this unrelated issue is not affected by a cache > clear, and can be resolved by simply rerunning the build (and not > getting very unlucky again).
Zig notably fixes ziglang/zig#14815; also, by @jacobly in ziglang/zig#14815 (comment): > @motiejus to be clear, this unrelated issue is not affected by a cache > clear, and can be resolved by simply rerunning the build (and not > getting very unlucky again).
Zig notably fixes ziglang/zig#14815; also, by @jacobly in ziglang/zig#14815 (comment): > @motiejus to be clear, this unrelated issue is not affected by a cache > clear, and can be resolved by simply rerunning the build (and not > getting very unlucky again).
Zig notably fixes ziglang/zig#14815; also, by @jacobly in ziglang/zig#14815 (comment): > @motiejus to be clear, this unrelated issue is not affected by a cache > clear, and can be resolved by simply rerunning the build (and not > getting very unlucky again).
Dear maintainers,
We still observe some very-hard-to-reproduce races when building libc/rt prerequisites.
Some anecdotal evidence:
error: FileNotFound
When executed on a fresh installation,
zig build-exe toolchain/launcher.zig
(exact command) sometimes fails with:This happens only on a fresh
$ZIG_GLOBAL_CACHE_DIR
(which we keep in/tmp/bazel-zig-cc
). We have seen this happen on Darwin x86_64 and Darwin M1. We may saw it on Linux, but I no longer have the logs to verify. My memory is poor.I tried to reproduce this on my MacOS machine overnight, without success. But we do receive a couple of complaints a week consistently over the last few weeks. Note that the sample size is quite large.
libcompiler_rt.a: No such file or directory
This happened on our CI yesterday:
Unfortunately, I can no longer access the build host nor access it's global cache dir. It may be related.
Summary
I understand this is very little information to troubleshoot effectively. Here are the steps I am trying to do:
FileNotFound
) on any Linux machine and instruct the engineer to re-run the command understrace
, to see which file they are missing. However, this was not reported on Linux for the last week or so: either it did not happen, or people learned to remove the cache directory and move on. Since this happens more on OSX, it would make sense to debug it here. However, our engineers cannot rundtruss
for compliance reasons.Food for thought: is it time to reconsider how error context is propagated during the build phase, so errors could be augmented with additional context?
The text was updated successfully, but these errors were encountered: