-
Notifications
You must be signed in to change notification settings - Fork 1k
vendor/ verification #121
Comments
I've tackled some verification of binaries in the past for a tool I built using GPG and SHA256: https://github.com/thoughtbot/ftree/tree/master/dist#verifying |
@calebthompson Thanks for the link! I'll have a look through your stuff as soon as I can. Quickly, though - our goal here is verifying a source tree, not the compiled binary (maybe that matters, maybe it doesn't). I suspect we also have some thinking to do about what we're verifying against. |
Definitely. We could pipe the entire tree into either tool to generate the signatures.
|
Some time ago the folks at Gentoo had to tackle the same exact problem for their package manager, Portage; they ended up with the generation of a digest for every first-level directory's content of their package manager's tree (which is file-based and quite huge).
That is completely unnecessary. Hence, listing the directory's data is a naive approach, and prone to errors; a better solution should be to hash the directory's data and its metadata altogether (stored as an archive): Yet even considering data+metadata doesn't solve the problem that content sorting differs from one file-system to another, even on the same OS. Therefor, listing the directory's data+metadata throughout external tools can't produce consistent hashes on different OSes, leading to undefined behaviours and false positives. The only viable solution (which is the same used by Gentoo's Portage and Git's tree-ish ID, btw) is doing content sorting directly within Go standard library already offers everything that is required, from file-system abstraction to archive generation, in a cross-platform way, thus guaranteeing the uniqueness of the hashes across different platforms. As a final note, this solution would be marginally susceptible to length-extension attacks depending on the hash function used. ¹- md5sum is used only as an example for brevity, but any hash function should work here; |
This is great info and context - thanks. Some quick thoughts:
I'm not sure it's exactly the same. This process is occurring on a user's system, probably when a
So, that's central-to-user; we have a user-to-user problem. That may not end up mattering, if only because there's no uniform, reliable way to verify a trusted upstream source. I'm just noting it while I absorb this information.
I think you may have missed the original reasoning for this: if we are to admit the possibility of vendor pruning (#120), and especially multiple/customizable types of pruning, then it would be valid to have a But, there also may have to be additional files in vendor to allow for processing/code generation type tasks. That's much harder to integrate.
The info about the issues with command-driven tooling is useful, but don't worry - we were never considering doing this any way other than directly within the tool, using the facilities provided in stdlib. |
I wonder if we need to tackle security at all? We inherit security from the underlying vcs. We inherit both vulnerabilities and future improvements this way. But more specifically, if security is paramount, user should commit vendored code into own repo. In this situation the only two posibilities for the code to be comprimised is during initial clone or in users repo. If initial clone is a suspect, then no security of ours can detect that. If users repo was compromised, then the lock file could be changed as well. In other words, what is the use case of layering extra security on top of vcs? |
@mikijov Security is its own issue, and yeah, we do have to seriously address it. However, this issue is about verification - related to, but independent of, security. Without this verification mechanism, it's much more difficult, for us to check that the contents of a @karrick was working on this during the Gophercon hack day, and I think we have an algorithm together to do the hashing: karrick@9449c59 |
While working on the algorithm to recursively hash the file system contents rooted at a specified file system object, I came across a possible gotcha when we follow symbolic links, which made me consider that the reason The naive solution to preventing an infinite loop is quite simple, but does not scale. We could punt on this issue for now, and pretend that there are no symbolic links that would cause an infinite loop, and we can continue on the solution we discussed at GopherCon, but I wanted you to be made aware of it. Or if you have any suggestions, please let me know. |
Maybe you have a suggestion for quickly detecting whether the file system node has been visited already. My initial idea was to use a map and perform a quick key check, despite this method involves a constantly growing map on the heap. However, in order to insulate from the underlying os-specific data structures, there is no way I have found to grab the underlying Therefore the only way I can think of to correctly determine whether the current node has been visited is to use the package fs
import (
"crypto/sha256"
"fmt"
"io"
"os"
"path/filepath"
"sort"
"strconv"
"github.com/pkg/errors"
)
var pathSeparator = string(os.PathSeparator)
// HashFromNode returns a deterministic hash of the file system node specified
// by pathname, performing a breadth-first traversal of directories, while
// ignoring any directory child named "vendor".
//
// While filepath.Walk could have been used, that standard library function
// skips symbolic links, and for now, it's a design requirement for this
// function to follow symbolic links.
//
// WARNING: This function loops indefinitely when it encounters a loop caused by
// poorly created symbolic links.
func HashFromNode(pathname string) (hash string, err error) {
// bool argument: whether or not prevent file system loops due to symbolic
// links
return hashFromNode(pathname, false)
}
func hashFromNode(pathname string, preventLoops bool) (hash string, err error) {
h := sha256.New()
var fileInfos []os.FileInfo
// Initialize a work queue with the os-agnostic cleaned up pathname.
pathnameQueue := []string{filepath.Clean(pathname)}
for len(pathnameQueue) > 0 {
// NOTE: pop a pathname from the queue
pathname, pathnameQueue = pathnameQueue[0], pathnameQueue[1:]
fi, er := os.Stat(pathname)
if er != nil {
err = errors.Wrap(er, "cannot stat")
return
}
fh, er := os.Open(pathname)
if er != nil {
err = errors.Wrap(er, "cannot open")
return
}
// NOTE: Optionally disable checks to prevent infinite recursion when a
// symbolic link causes an infinite loop, because this method does not
// scale.
if preventLoops {
// Have we visited this node already?
for _, seen := range fileInfos {
if os.SameFile(fi, seen) {
goto skipNode
}
}
fileInfos = append(fileInfos, fi)
}
// NOTE: Write pathname to hash, because hash ought to be as much a
// function of the names of the files and directories as their
// contents. Added benefit is that empty directories effect final hash
// value.
//
// Ignore return values from writing to the hash, because hash write
// always returns nil error.
_, _ = h.Write([]byte(pathname))
if fi.IsDir() {
childrenNames, er := fh.Readdirnames(0) // 0: read names of all children
if er != nil {
err = errors.Wrap(er, "cannot read directory")
return
}
// NOTE: Sort children names to ensure deterministic ordering of
// contents of each directory, so hash remains same even if
// operating system returns same values in a different order on
// subsequent invocation.
sort.Strings(childrenNames)
for _, childName := range childrenNames {
switch childName {
case ".", "..", "vendor":
// skip
default:
pathnameQueue = append(pathnameQueue, pathname+pathSeparator+childName)
}
}
} else {
// NOTE: Format the file size as a base 10 integer, and ignore
// return values from writing to the hash, because hash write always
// returns a nil error.
_, _ = h.Write([]byte(strconv.FormatInt(fi.Size(), 10)))
_, er = io.Copy(h, fh)
err = errors.Wrap(er, "cannot read file") // errors.Wrap only wraps non-nil, so elide checking here
}
skipNode:
// NOTE: Close the file handle to the open directory or file.
if er = fh.Close(); err == nil {
err = errors.Wrap(er, "cannot close")
}
if err != nil {
return // early termination if error
}
}
hash = fmt.Sprintf("%x", h.Sum(nil))
return
} |
@karrick i've actually spent way, way too much time thinking about the roiling hellscape that is symlinks. i've spent some time investigating extant symlink cycle detection algorithms, and they...suck. just, categorically, suck. turns out, graphs are hard. however, for once, i think i have a happy answer to a symlink-related question! 😄 basically: the hasher should never, ever traverse symlinks. if it encounters one, then it should instead hash the contents/address of the link itself (basically, the output of i believe this is sufficient for all cases because of some key invariants:
(it is not a coincidence that these invariants mirror those of snapshot-based version control systems) i'm not entirely sure of this approach (i never am with symlinks, because symlinks 😱💥💀), so lemme walk it through. symlinks can vary along a number of independent dimensions that are relevant for our purposes:
as long as do not embed a guarantee in the hashing process that symlinks reaching outside of a project tree must exist/be valid, then we're fine to simply record the target of the symlink. if the symlink escapes the tree (which, if we want to defend against, is something we should enforce elsewhere), then we're making no correctness guarantee anyway. if it does not, then the structure of the hashing algorithm guarantees that the bits contained in/under the referent will be reached through normal, non-symlink means. given that, there's no added value to hashing the bits again; it's sufficient to simply record that a link exists. the only real question in my mind with this is how we record the fact - in a cross-platform way - that a given file is a symlink in the hashing inputs itself, so that the hasher disambiguates between a normal file containing an address and a symlink with that same address. |
@sdboyer, thanks for the feedback. I spent a bit of time today looking at the surrounding I'll update my fork tomorrow based on your feedback, and send a PR your way for the hashing code. |
sounds great!
yeah, i hadn't previously carved out a place for it. i have some basic thoughts on where to fit it in: first and simplest would be to just add a workhorse func to then, we'll want something like type MismatchType uint8
const (
NoMismatch MismatchType = iota
EmptyDigestInLock
DigestMismatchInLock
NotInLock
NotInTree
) i feel like these signatures and names may be enough to suggest the desired behavior, but lmk if more detail would be helpful. (also, we need a better name than eventually, i'd like to see this func called in a more flexible way from up in but, let's have a first PR just focus on these two functions, and the concomitant additions to |
Someone feel free to correct me if I'm wrong, but IIRC one of the reasons that file size prefixed its contents when This comment doesn't really change the algorithm much, but I thought I'd point that out. We can always remove the file size from the hash in the future, and worst case people will re-download their dependencies another time. |
I've got some interesting preliminary benchmarks from a new custom directory walking function. First we establish a baseline using the
Then we walk the directory tree using the standard library
Then we walk the directory tree using a new faster walk function.
I know some of that is due to virtual file system caching, but it's at least a promising start. |
After many more benchmarks, the However, the fast walking example consistently performs at 2x speed as compared to the version that uses the standard library |
The modified code works well on Linux and Mac, but not on Windows, because of different syscalls available, as expected. I think |
I have extracted the directory walking logic to https://github.com/karrick/godirwalk, which now works with unix and Windows. Incidentally it even corrects a subtle bug on Windows over the standard library. On Windows, a symlink to a directory has multiple type mode bits set. The logic in |
sounds like a good time for a PR to switch out our internal impl and use godirwalk for the hasher, then 😄 |
Just wondering what the state of this issue is - it seems like @karrick came up with a cross-platform method of creating a hash of each vendored dependency. What steps have to be taken next to push resolution of this important issue forward? |
@sdboyer in nixos the problem for generating a hash of a GIT repository is already solved. i've added some notes here: https://github.com/nixcloud/dep2nix#golang-dep-git-repo-hash-extension i love the using here is the difference to yarn:
that said, using the NAR file specification along with the libraries needed for building the hash (implemented as go library using
|
@sdboyer do you have any license concerns using |
dep needs a way of verifying that the contents of
vendor/
are what's expected. This serves three purposes:dep status
is very limited in what it can report aboutvendor/
if it can't verify thatvendor
is actually what's expected.The simple, obvious solution is to deterministically hash (SHA256, I imagine) the file trees of dependencies, then store the resulting digest in the lock. However, the need to perform vendor stripping (#120) complicates this, as that creates perfectly valid situations in which the code under
vendor/
might differ from the upstream source. The more choice around stripping modes we create, the more complex and error-prone this challenge becomes.Or, we might simply rely on the hashing performed by the underlying VCS - indeed, we already record "revisions" in the lock, so we're already sorta doing this. Relying on it is triply disqualified, though:
The only path forward I've come up with so far is to compute and record hashes for individual files contained in dependencies (perhaps before stripping, perhaps after). This would allow us to be more adaptive about assessing package integrity - e.g., we can know that it's OK if certain files are missing.
The downside is that these lists of file hash digests would be potentially huge (e.g., if you import k8s), though that might be mitigated by placing them under vendor/ rather than directly in the lock - sdboyer/gps#69. Also, without the larger security model in place, I don't know if disaggregating hashing to the individual file level might compromise some broader notion of integrity that will prove important.
We might also record just the project unit-level hash against some future day where we've reinvented GOPATH and are no longer relying on vendor.
The text was updated successfully, but these errors were encountered: