Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify checksum of static lib in build script #545

Open
bartlomieju opened this issue Nov 30, 2020 · 14 comments
Open

Verify checksum of static lib in build script #545

bartlomieju opened this issue Nov 30, 2020 · 14 comments
Assignees

Comments

@bartlomieju
Copy link
Member

As an alternative approach to fixing truncated or malformed archive (offset to next archive member past the end of the archive after member /) error (#543, denoland/deno#8449) we shall create checksum files in CI and later verify those checksums against downloaded static lib during build process.

@mash-graz
Copy link
Contributor

hmm -- @bnoordhuis has already suggested this kind of workaround in an early answer to my reports about this defect in our current build system.

maybe he still hasn't changed his mind about the actual cause of this issue and still denies any explanation, which interpret the troubles as a consequence of inter-process race conditions in our build system. i don't know -- he simply didn't answer my additional analyses and suggested practical solutions resp. patch attempts until now..

i spend a lot of time last weekend to study this symptoms and their causes. it's a rather puzzling and hard to reproduce issue, like most errors caused by parallel running computing processes resp. unsynchronized data access.

i still can't give an exact explanation, why cargo is running two instances of build[.rs] execution in our particular case at all, but reports about other issues caused by parallel execution in cargo (rust-lang/cargo#851, rust-lang/cargo#1569, rust-lang/cargo#5575) and recent changes in the build system (https://internals.rust-lang.org/t/evaluating-pipelined-rustc-compilation/10199) suggest the assumption, that we always should take care of this kind behavior resp. avoid any side effects of parallel invocation in our build scripts, because parallel runs of the build script may happen in cargo for many reasons.

utilizing checksums or renaming/moving the completed file by an atomic operation, when the download is done, are IMHO just very unsatisfying workarounds, which do not solve the actual cause of the defect. in fact, they will most likely just result in a second parallel download attempt instead of blocking any further processing until the still ongoing first download finally becomes ready. that's why i suggested and used the utilization of lock files as much more adequate answer to this challenge in my PR (#543).

@bnoordhuis
Copy link
Contributor

I went over this with @bartlomieju yesterday and we concluded that there's no way for the build script to exhibit such behavior barring bugs in cargo. Possible, of course, but not reproducible. Neither of us really liked the idea of landing fixes for a poorly understood problem.

Something like cargo build --all-targets (mentioned in one of the issues you link to) can't trigger it either, AFAICT. That might kick off parallel downloads but to separate directories.

I saw the PR you opened but that uses lockf() / fcntl(F_SETLK) under the hood and that's its own can of worms. Consider file locks on NFS, for example - they have all kinds of interesting failure modes.

I'm not completely opposed to some kind of mutual exclusion (in a "can't hurt - hopefully" fashion) but I'd use something like std::fs::OpenOptions::new().create_new(true) instead.

@mash-graz
Copy link
Contributor

mash-graz commented Dec 1, 2020

as already mentioned, there are many(!) reasons, why cargo may invoke the build script twice in parallel. you'll hardly prevent this behavior and foresee all coming changes in rusts build system, which will reintroduce this kind of problem in one or the other way. IMHO it's therefore nearly unavoidable to solve this issue in a proper manner!

I saw the PR you opened but that uses lockf() / fcntl(F_SETLK) under the hood and that's its own can of worms. Consider file locks on NFS, for example - they have all kinds of interesting failure modes.

yes -- all this available cross platform file locking solutions for rust are somehow questionable, but it's even harder to handle the same task by any new and creative home made invention much better. the rust std lib unfortunately only supports useful mutex/semaphore handling for intra-process synchronization out of the box, but no simple and reliable solution for inter-process barrier/lock handling. i have therefore chosen one of the most simple and small external crates for this purpose, which at least handles the most desirable set of requirements of such a mechanism (e.g. automatically removing the locks in case of interrupted process etc.) as long as you respect a few minor limitations (e.g. that you have to use the same open file handle and not only file path within your application on windows etc.).

if you prefer any of the other available and well tested actual cross platform lock file implementations resp. external rust crates for practical reasons, i wouldn't care -- that's up to you! --, but just ignoring the real cause of our particular problem and debating workarounds, which will hardly solve the actual issue in a similar consequent, simple and reliable manner as by utilizing a properly implemented lock file solutions, looks hardly comprehensible to me -- sorry. :(

@bnoordhuis
Copy link
Contributor

cargo may invoke the build script twice in parallel

Can you link me to discussion or documentation that supports that claim? None of the issues you linked to seem to make that claim, not even as bug reports.

If you're referring to pipelining, I don't think that's relevant. Look at cargo build-plan: the build step that compiles librusty_v8.rlib and librusty_v8.rmeta doesn't start until the build step that executes build.rs completes.

@mash-graz
Copy link
Contributor

Can you link me to discussion or documentation that supports that claim?

i think, the quoted 'cargo build -vv' logs reported in denoland/deno#8449 should verify this claim very well.

but i could try to build a demonstration in GitLabs CI-system tonight, if you are still not satisfied by my explanations.

it's only important, that you compile the package on a machine with very slow network and more than one core, because a cargo build -j1 will indeed hide the issue.

@mash-graz
Copy link
Contributor

mash-graz commented Dec 1, 2020

o.k. here is the demonstration:

i wasn't able to reproduce the issue by compiling only the rusty_v8 crate in an isolated context, but as soon as i use the build context of the deno crate everything gets perfectly reproducible.

i just utilized a slightly modified repo of rusty_v8, where one time.sleep...-line was added in the python download script between fetching junks of data to simulate a slow network link:

https://gitlab.com/mash-graz/rusty_v8_slow_network/-/commit/7b294ef60b0ab8fb86a06b40052944cb0429d0f5

and in a modified repo of deno i utilized this customized version of rusty_v8 instead of the original one for a perfectly reproducible CI based build, which executes the command cargo build --release -j4 -vv in a common debian based rust 1.48 container.

the output of this build process resp. the reproduced failure is documented at:

https://gitlab.com/mash-graz/deno_slow_network/-/jobs/884293446

it's again this already reported: bad archive: truncated or malformed archive resp. build failed

i hope, this practical demonstration resp. reproducible recipe helps to finally trust my claims.

@mash-graz
Copy link
Contributor

@bnoordhuis
did you already find some time to comprehend this practical demonstration of this particular issue?

@mash-graz
Copy link
Contributor

i just added references to a few other issues affected by this bug to this issue ticket resp. the suggested solution/explanation...

@mash-graz
Copy link
Contributor

#426 and denoland/deno#7517 seem to be very likely other issues affected by this bug...

@bartlomieju
Copy link
Member Author

Closing in favor of #543

@bartlomieju
Copy link
Member Author

Actually it's still an issue to resolve

@bartlomieju bartlomieju reopened this Dec 27, 2020
@mash-graz
Copy link
Contributor

Actually it's still an issue to resolve

could you please explain the reasons for reopening?
did you observe any remaining build problems?

at least on may machine it seems to work.

nevertheless i also have my doubts, if this issue couldn't be fixed in an even more pleasant and general manner:

i.e.
the present fix only solves the issues related to this one huge static lib download, because this was the only obvious source of troubles on my machine, but the build script may download some other smaller components as well on other systems. perhaps this could explain similar symptoms of flakiness affecting the CI system on mac etc. and in the long run we should also replace the download process by a rust implementation instead of utilizing a python script...

but this are ideas for improvements, which i only found on mediating about this case after i wrote this first patch...

so it's only a first small step forward, which hopefully eliminates at least one obvious source of troubles.

@bnoordhuis
Copy link
Contributor

File locking and checksumming are orthogonal issues. Checksumming is to detect bit flips and truncated downloads.

@mash-graz
Copy link
Contributor

mash-graz commented Dec 28, 2020

File locking and checksumming are orthogonal issues. Checksumming is to detect bit flips and truncated downloads.

yes, this are indeed two different topics.

i was more interested in fixing this actual bug, which is/was only falsely reported as "truncated file" but in fact caused by other reasons, but it would be indeed another useful improvement, to compress and pack the relevant binary payloads in some kind of archive format (zip, rar, 7z etc.), that provide basic checksum validation automatically on decompression. this would also improve the download times drastically!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants