Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merkle verification of non-content-addressed data #11919

Open
Ericson2314 opened this issue Nov 20, 2024 · 0 comments
Open

Merkle verification of non-content-addressed data #11919

Ericson2314 opened this issue Nov 20, 2024 · 0 comments

Comments

@Ericson2314
Copy link
Member

Ericson2314 commented Nov 20, 2024

Problem

In a world where CA derivations take over, this is a non-issue: the use of exclusively content-addressing store paths mean that we are effectively constructing merkle dags / doing deep content addressing for entire closures:

  • A single content-addressing store path verifies the object it refers to's entire closure, because references (as store paths) effect the calculation, and those are likewise content-addressing.
  • A resolved derivation (only has inputSrcs) likewise has a completely unambiguous input closure, allowing it to serve as a properly *shallow trace key

However, a "mixed store" with some content-addressed and some input-addressed store object doesn't have any of these nice properties, because a single input addressed reference "breaks" the transitive guarantees, meaning we don't really know anything about the input-addressed object or closure (even the content-addressed objects in its closure.

Despite my slowness, I am not worried about the technical changes to Nix and Hydra that allow us to start using content-addressing derivations "for real". Rather, I am worried about software the contains pathological self-references for which we'll have no choice but to continue using store paths that are fixed at build time (actually they could be input addressed or just randomly generated, it doesn't matter) so as to avoid rewrites. I hope such software is rare/unimportant, but I don't know whether that will be the case.

Solution

Just because some store objects are "mounted" at non-content-addressed store paths doesn't mean we need to give up on content-addressing! The escape hatch is simple that we can use a content address in addition to a store path to lock down the store object's contents.

Indeed, we already do a version of this with "NAR hashes" --- we use those even when the object is input-addressed or content-addressed in a non-NAR way. It just happens that NAR hashes are not adequate for this task because they only track individual objects not closure.

Imagine a "deep NAR hash" that was is a combination of the store object's own files nar hash, and the references as a map, a map from "store path" to "deep NAR hash".

Note: the "NAR" part is very unimportant from this. We might well want to replace the NAR file system hashing with something better too. The "deep" generalization here is agnostic to how the file system objects of a store object are hashed.

struct DeepNARHash { Hash h };

DeepNARHash DeepNARHash::calc(Hash narHash, std::map<StorePath, DeepNARHash>);

This inductive structure gives us the Merkle hashing for whole closure verification we want

We update ValidPathInfo with

- std::set<StorePath> references;
+ std::map<StorePath, DeepNARHash> references;

For CA derivations shallow traces we likewise want to make inputs a parameter on DerivationNew, so we have

using BasicDerivation = DerivationNew<std::set<StorePath>>

using Derivation = DerivationNew<std::pair<
    std::set<StorePath>,
    DerivedPathMap<std::set<OutputName>>
>;

// new
using UnambigousDerivation = DerivationNew<std::map<StorePath, DeepNARHash>>

This restores the properties we want --- they don't come from the store paths now (unless the store path happens to be content-addressing, in which case it still does), but instead the new deep NAR hash. In addition, it vastly lowers the stakes for input-addressing.

We don't need to worry about the quality of input addressing / collisions because accidental frankenbuilds are no longer possible. This is why it is fine to image we just randomly generated non-CA paths --- even if there were collisions, we would detect them when different actions wanted to "use" the store path for different store objects.

@Ericson2314 Ericson2314 changed the title Merkel verification of non-content-addressed data Merkle verification of non-content-addressed data Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant