I wrote a little tarfs
package that I've found very useful.
If you ever have to work with tar files in Go, you might find it useful too.
Even if you don't, you might still learn something interesting by reading this post.
Tar is short for "tape archive". It's one of many ways to take a bunch of files and stuff them into a single file. This is useful because many tools are built to handle single files and not entire filesystems. You can also ignore the differences in filesystem implementations by forcing everything to serialize to a tar file.
Notably, since the tar format is meant for tape archives, it's not designed to allow for quick random access. Tape drives are relatively slow, and archives imply that they aren't meant for online access.
There are several flavors of tar, but let's ignore that for now and focus on a simplified model of the format.
A tar file is a just a series of records that alternate between a 512-byte header and the actual file contents (if there are any). The header contains metadata for the next file, including its name and size, which tells you how many bytes to read after the header. After the end of each file, there is zero padding until the next 512-byte boundary.
After the last file, there are (at least) two 512 byte blocks of zeros that indicate the End Of Archive (EOA).
This format is easy to both produce and consume in a streaming manner, which means it's amenable to bash one-liners, which probably explains its ubiquity.
Given the amount of zero-padding and text duplication common in the headers, tars are generally compressed with gzip to form a .tar.gz
file, which is similarly amenable to bash one-liners.
If you search for how to do this, you'll find a straightforward implementation.
(inlined implementation)
// Untar takes a destination path and a reader; a tar reader loops over the tarfile
// creating the file structure at 'dst' along the way, and writing any files
func Untar(dst string, r io.Reader) error {
gzr, err := gzip.NewReader(r)
if err != nil {
return err
}
defer gzr.Close()
tr := tar.NewReader(gzr)
for {
header, err := tr.Next()
switch {
// if no more files are found return
case err == io.EOF:
return nil
// return any other error
case err != nil:
return err
// if the header is nil, just skip it (not sure how this happens)
case header == nil:
continue
}
// the target location where the dir/file should be created
target := filepath.Join(dst, header.Name)
// the following switch could also be done using fi.Mode(), not sure if there
// a benefit of using one vs. the other.
// fi := header.FileInfo()
// check the file type
switch header.Typeflag {
// if its a dir and it doesn't exist create it
case tar.TypeDir:
if _, err := os.Stat(target); err != nil {
if err := os.MkdirAll(target, 0755); err != nil {
return err
}
}
// if it's a file create it
case tar.TypeReg:
f, err := os.OpenFile(target, os.O_CREATE|os.O_RDWR, os.FileMode(header.Mode))
if err != nil {
return err
}
// copy over contents
if _, err := io.Copy(f, tr); err != nil {
return err
}
// manually close here after each file operation; defering would cause each file close
// to wait until all operations have completed.
f.Close()
}
}
}
This kind of works, but there are a handful of problems.
Unless you can completely trust the tar file (i.e. you created it yourself), it's not safe to just extract it to a temporary directory.
This is due to a category of vulnerability called Zip Slip.
A maliciously crafted tar file can have paths that escape the target directory using relative filenames (like ../../evil.sh
) and overwrite existing files with arbitrary content.
This implementation only handles directories and regular files. There are other file types that will just get silently ignored, most notably symlinks and hardlinks.
It also assumes that the target directory already exists for each regular file. Most tar implementations do indeed include any directories before files in those directories, but this isn't guaranteed, so you may end up trying to write a file to a non-existent directory.
This is just me being pedantic but this function assumes the input is gzipped, which isn't necessarily the case. Sometimes you are just dealing with a tar file and don't need to gunzip it first. In fact, assuming you need to gunzip it first constrains your thinking to serial access patterns, which makes it difficult to see a nicer solution.
One problem I'm fairly familiar with is reading a docker archive tarball.
I've described the layout of a docker archive here, but I'll repeat the relevant parts here.
A docker archive is a tarball that contains a manifest.json
file and a bunch of other blob files.
The manifest.json
file is an array, so it can contain multiple images.
In order to know which blobs you need to extract for a given image (and in which order), you need to open and parse the manifest.json
file first.
This means we really want random access to the tar filesystem, so how can we get it?
There are a couple strategies here.
The most obvious thing to do is untar the entire thing and access it through the normal filesystem.
Docker does this, but it's a lot more complicated than you might expect. It involves chroots, unshare, re-execing itself, unix pipes, and a bunch of additional complexity to convince the go runtime to play nicely with all of that.
There's also file metadata within the tar that you may or may not want to preserve like permissions, ownership, access times, and xattrs (which dictate capabilities), all of which have security implications.
From comments on their implementation:
// The implementation is buggy, and some bugs may be load-bearing.
// Here be dragons.
Note that I'm not trying to disparage docker's implementation at all, it just reflects the complicated reality of the problem. Sometimes you do actually need to write out the files to disk, which indeed requires a lot of this ceremony if you have to handle untrusted inputs.
Let's look at a different approach to the same problem.
This is in go-containerregistry, which means I'm at least partially responsible for it, so it's fair game to disparage.
Rather than untarring the entire thing to disk to give us random access, we can instead just re-open the tar file and read through it again until we find what we're looking for. This ends up being pretty slow, but it also avoid all the complexity and safety concerns of untarring it to disk.
Both of these solutions were written before go introduced the io/fs
package, which provides a really nice filesystem abstraction.
In fact, fs.FS
is exactly the interface you usually want for a tar file.
It's a bit of a shame that the archive/tar
package doesn't implement this interface, but it's not too hard to do it yourself, which is exactly what I've done in my tarfs
package.
Notably, I'm not the first person to do this, but I do like my implementation the best.
When creating a new tarfs.FS
, we first have to iterate over the whole tar file to build an index of the file offsets.
This is a little slow, but it's actually a requirement for correctness, and it can be amortized over many reads by saving the index for later use.
The index would look something like this:
Notice that this would only be 520 bytes per file instead of the entire file size, so it's relatively small. Also, given that headers are pretty similar, this index should compress very well. There is probably a much more compact format we could store these things in, but it's nice to have the identical data.
In theory, we could store only the offsets to the header data in the original tar file so that we don't store it twice, but by duplicating the header data, we can render the entire filesystem from just the index without actually having access to the original tar file. This technique is useful to me for quickly rendering results on dag.dev, since I can just store the indexes locally and render the filesystem view without hitting the registry.
Separately, we have an in-memory mapping of filename to the index entry to give us random access to any file metadata.
This is just a map[string]int
that gives us the array index of the file entry in the index described above.
With this structure, every tarfs.FS.Open()
is just two map accesses, and every Read()
is just a single ReadAt()
call on the underlying os.File
.
Also, because I know that ~all my workloads end up immediately calling fs.WalkDir
on the result, as part of the indexing I generate a pre-sorted map[string][]fs.DirEntry
so that calling ReadDir
doesn't have to allocate.
For the stateless FS.ReadDir
method, we just return the whole []fs.DirEntry
as-is, and for the stateful File.ReadDir
method, we just return a slice of the requested size and maintain a cursor for the given position.
Somewhat unexpectedly, implementing this part was the most difficult part for me when I was trying to get tarfs
to pass fstest.TestFS
.
The star of the show here is really io.ReaderAt
, which enables concurrent and efficient access to the bytes of the underlying tar file.
This will be even more apparent if I ever write about how this composes with gsip
and Range requests.
To compare these approaches, I've summoned a random tar file from Docker Hub:
crane blob ubuntu@sha256:9c704ecd0c694c4cbdd85e589ac8d1fc3fd8f890b7f3731769a5b169eb495809 | gunzip > ubuntu.tar
Let's get a general sense of its properties:
$ wc -c ubuntu.tar
80559104 ubuntu.tar # 80.6MB
$ tarp < ubuntu.tar | jq '.Typeflag' | sort | uniq -c
2579 48 # Regular files
2 49 # Hardlinks
197 50 # Symlinks
658 53 # Directories
$ tarp < ubuntu.tar | jq 'select(.Typeflag == 48) | .Size' | jq -s 'add'
78046512 # 78.0MB of regular file data
I've written three methods for accessing tar files:
untar
which writes everything to a temporary directory, then deletes it afterwards,tarfs
which uses thetarfs
package, andscantar
which re-scans through to access each file.
The dumb benchmark I have here is to read three files and write them to stdout.
untar < ubuntu.tar 0.04s user 0.54s system 98% cpu 0.588 total
tarfs < ubuntu.tar 0.02s user 0.01s system 97% cpu 0.028 total
scantar < ubuntu.tar 0.02s user 0.01s system 93% cpu 0.033 total
You can see that untar
is the slowest because it has to actually write everything out to disk and also clean it up.
Since tarfs
and scantar
are read-only operations, they are both super fast.
We can cheat a little bit to give tarfs an advantage by looking at a bunch of files that are near the end of the tar file.
tarfs < ubuntu.tar 0.02s user 0.01s system 94% cpu 0.035 total
scantar < ubuntu.tar 0.08s user 0.05s system 98% cpu 0.128 total
We can see that tarfs
is basically unaffected, whereas scantar
is much slower because it's reading through almost the whole tar file for every file access.
A more realistic use case might be walking to Walk
every file in the tar, so I did that to compare tarfs
and untar
, generating this graph:
The Y axis is in microseconds, the X axis is mostly each file being accessed, but I also added a data point for the RemoveAll for untar
at the end.
Why does it look like this?
With tarfs
, we have a single open
syscall, then the initial scan through to generate the index takes a bunch of read
syscalls, then "walking" the filesystem is entirely in userspace.
Reading each individual file is also exclusively pread
syscalls.
You can see the overall time spent is almost evenly split between the initial scan and the subsequent accesses.
Indexing the tar takes ~26 milliseconds, then reading all the files takes another ~17ms.
This makes sense, because indexing has to read through every byte and parse the tar headers, whereas we are reading only the file data portions when we walk the filesystem.
With untar
, not only do we have to open
, write
, and close
(and sometimes mkdir
) each file during the initial untar.
We also have to open
, read
, and close
each file during the subsequent accesses, so we have ~3x the overhead just for reading each file.
We have to call getdirentries
to then walk the filesystem, and finally the tmpdir cleanup requires unlinkat
syscalls for each file.
Untarring everything takes ~360ms, then reading all the files takes ~100ms, then cleanup takes ~230ms.
Less than 15% of the execution time is spent actually accessing the data.
Note that I'm not showing the scantar
results here because we don't have a way to Walk the filesystem without some preprocessing.
In cases where your workload is actually this contrived, the existing tar.Reader
API is probably fine for you, and you would expect this to take about as long as tarfs
takes to index the tar file (~26ms).
There are two scenarios where you might just want to use the plain tar package and not tarfs
.
- You only have an
io.Reader
and not anio.ReaderAt
, e.g. if you're streaming the tar file from a network connection. - You are looking for a single entry in the tar file and are okay with exiting early.
If you find yourself in the second secnario, you should be careful. It is not generally correct to assume that the first matching entry in a tar file is what you want to extract because tar files may contain duplicate entries.
Let me explain.
First, let's create a tar file with a single entry:
echo -n "Hello, world" > hello.txt && tar -cf out.tar hello.txt
We can list the files:
tar -tvf out.tar
-rw-r--r-- 0 jonjohnson staff 12 Jun 20 16:21 hello.txt
We can even extract the file:
tar -Oxf out.tar hello.txt
Hello, world
Instead of using -c
to create a tar file, we can use -r
to update it:
echo '!' >> hello.txt && tar -rf out.tar hello.txt
And we'll see that there are two entries:
tar -tvf out.tar
-rw-r--r-- 0 jonjohnson staff 12 Jun 20 16:21 hello.txt
-rw-r--r-- 0 jonjohnson staff 14 Jun 20 16:21 hello.txt
If we try to extract the file as before, we'll get both:
tar -Oxf out.tar hello.txt
Hello, worldHello, world!
But if you actually write it out to disk, you'll see that the last file wins:
mkdir out && tar -xf out.tar -C out && cat out/hello.txt
Hello, world!
For this reason, in the general case, you should read through the entire tar file before returning anything, because the file you care about might have been overwritten.
There are definitely situations where you know the layout of a tar file ahead of time (e.g. an APK package) where you know you can stop reading when you get to the first matching file, but a general purpose tar library can't make that assumption.
Right now io/fs
is kind of weird about symlinks, and I'm waiting for all of that to shake out before I commit to anything in particular.
In most of my uses of tarfs
, I actually don't want to Open()
symlinks or hardlinks because I want to preserve them in the ouput archives I'm producing.
There's a Sys
method on fs.FileInfo
that I use to return the original *tar.Header
, which is what I use in practice for handling these things.
This is one place where the performance stuff above is a little misleading, but there are only two hardlinks in the entire archive, so it's not significant enough to affect the results.
(Since the time of writing, I've added symlink chasing to tarfs
, but I don't care to update the rest of this post.)
I wouldn't attempt to put tarfs
into production quite yet, but eventually I will extract it from its experimental home under targz.