-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/vgo: use tar instead of zip for package archives #24057
Comments
If I had to make a suggestion as to what to use instead, I would recommend tar archives: they could also be imported as-is into e.g. the Debian archive (or other Linux distributions), which typically don’t support zip. |
I asked about zip and tar files too in golang-dev - search for "zip" in https://groups.google.com/forum/m/#!topic/golang-dev/MNQwgYHMEcY. I don't know how to share single messages on mobile, unfortunately. |
Having dealt extensively with both the TAR and ZIP formats, I have concluded that both are terrible formats and consistent support for them is awful. However, in terms of better world-wide support, I vote for TAR. Here is my assessment of the advantage and disadvantage of each:
The main advantage of ZIP is the ability to random-access between files. For which, I'm not sure if that feature is a deal breaker. There are ways to stripe through a TAR archive once and build an index to provide random access between files and within a file. |
On the computer now, pasting @rsc's answer to my question about the decision to use ZIP:
|
Somewhat disagree. Essentially:
Agreed. That's the main benefit of ZIP.
For a cold-cache, it's still going to be alot of data. Imagine building Kubernetes or JuJu for the first time. For some empirical data:
Some observations:
|
Also, I don't think random-access is that important. The purpose of the container format is for the proxies to transfer the package sources+testdata across the network, which is almost certainly in a streaming fashion. I can imagine random-access is a useful property for the cache, but that seems to be a implementation detail. The external world should not care if the internal cache implementation ends up storing them as ZIP archives, SQLite databases, protobufs, flatbuffers, regular files, etc. |
I want to add my voice in support of tar over zip for all the reasons @dsnet mentioned. |
This use case sounds ideal for catar, with the down-side of obscurity.
|
@stapelberg's original report conflates restrictions imposed by vgo on the content of a Go module with the container format. It's not the use of zip that's the problem in that transcript. It's that vgo quite intentionally supports only plain files (no devices, no symlinks, no sparse files) and no executable bits. Go modules hold Go source code for building with the Go toolchain. Basic testdata is fine, but in general, for portability, modules must not be sources for other special kinds of files. Even if vgo were storing modules as tar.gz files, it would still be using an unpacker (x/vgo/vendor/cmd/go/internal/modfetch/unzip.go) that ignores symlinks and executable bits. Those do not belong in source archives, and their presence is more likely to be portability problems or attack vectors than innocent code. These restrictions would have been in place from day 1 if Go code had been responsible for putting source code on disk; the only reason symlinks and executables work today is because we delegated that to version control tools. Not anymore. So as far as the original report is concerned ("zip files lose symbolic links and executable permissions"), sorry, but working as intended, and not because of zip files. It looks like github.com/prometheus/procfs needs to find a different way to initialize its test environment; it can no longer assume the special files are available directly from the source code tree. The discussion here then turned to merits of tar vs zip more generally. Note that vgo downloads zip files from GitHub and tgz files from Gerrit and transcodes both into zip files of the proper format (with the right file tree structure) for saving locally. The format served by popular code hosting sites is therefore not relevant. We still need to supply a tool to turn a local git/etc repo into a tree of zip files for proxies, static hosting, etc, but that will happen too. In both cases, vgo itself is writing the zip file. The interchange format ends up being almost an internal detail. The reasons I prefer zip instead of tar are, in order:
Mainly these considerations are for ease of debugging and poking around, but it's easy to envision actual vgo features (like vgo verify on subsets of a module) that would require random access. Honestly, if we need to pick an archive file format in 2018, I really think one that can't list the files in the archive or get to a specific file without processing the entire archive - that is, can't do it in O(1) time instead of O(n) time - is just not in the running. The rest of my comment is really all in the margins. With lots of respect for all your work on both archive/tar and archive/zip, @dsnet, I disagree with your conclusion that tar is more well-specified or simpler:
As far as compression ratios, I agree that tar, being uncompressed, admits more effective compression. But your empirical data shows that the cost is you must give up random access. I'm not willing to do that. I take your point about having a different on-disk-cache vs network transfer format, but that's added complexity and eliminates the simplicity of having a Go package proxy that serves out of an actual cmd/go download cache. In fact if we just write a few more metadata files in the cache (which I intend to do), GOPROXY=file://$GOPATH/src/v will work. Different context but same point as above: one format is better than two. So as far as the new title ("use tar instead of zip for package archives"), again sorry, but working as intended. P.S. Years ago, when I had internalized "everything associated with Windows is awful and sucks and everything associated with Unix is the one true way" (see this thread), I remember Bryan Ford showing me VXA, a really beautiful system. When he got to the part where he mentioned using the zip file format, I remember this visceral "Ugh! Why would you do that? It's awful." But in fact I'd been blinded by my priors and (as usual) Bryan's design was exactly right. So especially since Go developers who frequent our issue tracker seem to tend toward being Unix developers, a cautionary note to anyone reading this issue: if you recognize that you're having a similar gut reaction like "clearly tar is better than zip", I'd encourage you to try to step back and examine where that's coming from. (Or maybe I was the only one who fell into that trap, in which case ignore this note.) |
SGTM. Thanks for your well-written reply.
Using the same format for both the proxy and cache certainly has it's advantages. If so, then table-of-contents and random-access will be necessary. Given these constraints, I agree zip is right choice.
SGTM.
My opinion on tar and zip are influenced by frustration working with both these formats. Interestingly, I actually supported zip some time ago, and flipped my opinion after working more on zip. I may be just pessimistic towards the state of world. Fortunately, I certainly don't support creating our own archive format. Fun fact: I'm writing this from a Windows machine. |
I would expect a lot of repositories to contain shell scripts for various things, including administrative actions and test helpers that are run in subprocesses. Go itself contains 59 executable files in the repository. ( Please accept that repositories want to contain executable files. |
@tv42 The purpose of Go modules is to provide Go source files for inclusion in a Go build. A Go build uses only Go files (and now |
@dolmen You've just engineered a solution where running "go generate" in vendored dependencies no longer works. |
This is easily visible in repositories such as https://github.com/prometheus/procfs, which contains executable files (e.g. ttar) and symbolic links (e.g. fixtures/self):
While permissions apparently can be stuffed into extended attributes (see https://stackoverflow.com/a/13633772/712014), it seems that GitHub and/or vgo don’t support that. Symlink support was added to zip years ago as per https://serverfault.com/a/265678/100772, but doesn’t seem to be used by GitHub and/or vgo either.
The text was updated successfully, but these errors were encountered: