Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partially implement cache hash API in zig #4635

Closed
wants to merge 45 commits into from

Conversation

leroycep
Copy link
Contributor

@leroycep leroycep commented Mar 5, 2020

Will close #4311

This pull request translates the C++ cache_hash API into zig, under std.cache_hash. Currently it only supports adding files before hitting the cache, doesn't implement the many different cache_* functions, doesn't check files' modification times, and has a memory leak somewhere.

It's a start.

  • open manifest files with a lock
  • cache_* functions
    • cache_buf (will probably merge into cache_val)
    • cache_file
    • cache_val(var)
      • Not sure if this is the way to go, or if I should copy the C API and make a function for every type
  • Functions to add dependencies
    • CacheHash.add(value)
    • CacheHash.addFile(path)
    • CacheHash.addFile(path) but after the initial hash has been computed
  • cache_hit
    • check file modification times
      • check for problematic timestamps
        • implement a cross-platform high resolution calendar timestamp function in std.time
  • Make user pass in max file size to addFile* functions

@Vexu
Copy link
Member

Vexu commented Mar 5, 2020

Not sure if this is the way to go, or if I should copy the C API and make a function for every type

The C API should only be copied in stage2.zig.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work so far. I gave a couple suggestions to consider at your leisure

lib/std/cache_hash.zig Outdated Show resolved Hide resolved
pub const CacheHash = struct {
alloc: *Allocator,
blake3: Blake3,
manifest_dir: []const u8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet this could be improved by using a std.fs.Dir handle rather than a string path name

const decoder = base64.Base64Decoder.init(base64_alphabet, base64_pad_char);
const BIN_DIGEST_LEN = 32;

pub const CacheHashFile = struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The full path to this struct is std.cache_hash.CacheHashFile so you can remove redundancy by renaming this to simply File. To avoid clashing with fs.File I would suggest removing const File = fs.File; and instead using fs.File when you need that one.

Similar comment for functions which have cache_ in their name.

@leroycep leroycep force-pushed the feature-cache-hash-zig branch from 2e5adc1 to e6ed5d9 Compare March 7, 2020 05:46
@leroycep
Copy link
Contributor Author

leroycep commented Mar 7, 2020

There are a few pieces missing from the zig std library that are needed to do a full port of the c cache_hash API:

  • Access to a file identifiers
    • FILE_INTERNAL_INFORMATION.IndexNumber on windows
    • Stat.ino on linux
  • An API (probably in std.fs) to open files and put a lock on them

Should I just put all the relevant code in cache_hash, or should I make separate pull requests that add this functionality to the zig std library?

@andrewrk
Copy link
Member

andrewrk commented Mar 8, 2020

Should I just put all the relevant code in cache_hash, or should I make separate pull requests that add this functionality to the zig std library?

For these things let's go the std lib route.

debug.assert(self.manifest_file == null);

const val_type = @TypeOf(val);
switch (@typeInfo(val_type)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I wonder if this can be simplified with something like:

self.blake3.update(mem.asBytes(&val));`

Plus some checks for zero-sized types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea, I'll try it out. There will also need to be a check for boolean values, because booleans are handled differently (or least were in the C version, but I see no reason to change that).

I think I'll also try to detect fs.File types and have special logic for them.

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this at a point where it could be merged as is and then improved? Let me know when you're ready for a review.

lib/std/fs.zig Outdated Show resolved Hide resolved
@leroycep
Copy link
Contributor Author

This PR is pretty close, just a few things to add. I think the biggest thing I need to add is logic for adding files after the hash has been generated (cache_add_file in the c code I think). It would also be nice to have a few more tests that make sure that CacheHash doesn't have any false positives.

@leroycep leroycep force-pushed the feature-cache-hash-zig branch from 240a9cb to 69699ba Compare April 16, 2020 02:19
@leroycep
Copy link
Contributor Author

leroycep commented Apr 16, 2020

Okay, I've changed addFile to return the index of the cache_hash.File entry. This means that after checking for a cache hit, you can get the contents of the file that were used in the hashing, like so:

const src_file_idx = try cache_hash.addFile("src/foo.txt");
if (cache_hash.hit()) |hash| {
     // cache hit
} else {
    const src_contents = cache_hash.files.items[src_file_idx].?;
    // compile stuff
}

I'm not sure this is the best API to have. Any feedback welcome!

Copy link
Member

@andrewrk andrewrk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! And thanks for your patience on how long it has taken for me to review it.

I gave some areas to look into improving - let me know when you're ready for round 2.

lib/std/cache_hash.zig Outdated Show resolved Hide resolved
lib/std/cache_hash.zig Outdated Show resolved Hide resolved
Comment on lines +373 to +390
fn is_problematic_timestamp(file_mtime_ns: i64) bool {
const now_ms = time.milliTimestamp();
const file_mtime_ms = @divFloor(file_mtime_ns, time.millisecond);
return now_ms == file_mtime_ms;
}
Copy link
Member

@andrewrk andrewrk Apr 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this has the same behavior as the c++ code:

static bool is_problematic_timestamp(const OsTimeStamp *fs_clock) {
    OsTimeStamp wall_clock = os_timestamp_calendar();
    // First make all the least significant zero bits in the fs_clock, also zero bits in the wall clock.
    if (fs_clock->nsec == 0) {
        wall_clock.nsec = 0;
        if (fs_clock->sec == 0) {
            wall_clock.sec = 0;
        } else {
            wall_clock.sec &= (-1ull) << ctzll(fs_clock->sec);
        }
    } else {
        wall_clock.nsec &= (-1ull) << ctzll(fs_clock->nsec);
    }
    return wall_clock.nsec == fs_clock->nsec && wall_clock.sec == fs_clock->sec;
}

std.time.milliTimeStamp is not quite what we want here. It looks like std.time.milliTimeStamp is too high level, and std.time does not actually offer what we need: a cross-platform high resolution calendar timestamp. This will require a new function added to std.time. I believe that std.time.milliTimeStamp could then be reworked to call this other function, and do a small amount of logic on top of it.

leroycep added 23 commits April 30, 2020 15:42
It was causing a segfault on `mipsel` architecture, not sure why other
architectures weren't affected.
Instead of releasing the manifest file when an error occurs, it is
only released when when `CacheHash.release` is called. This maps better
to what a zig user expects when they do `defer cache_hash.release()`.
It checks whether the cache will respond correctly to inputs that don't
initially depend on filesystem state. In that case, we have to check
for the existence of a manifest file, instead of relying on reading the
list of entries to tell us if the cache is invalid.
If a user doesn't care that the manifest failed to be written, they can
simply ignore it. The program will still work; that particular cache
item will simply not be cached.
This makes it possible for the user to retrieve the contents of the
file without running into data races.
It occured when the manifest file was manually edited to include an extra
file. Now it will simply copy the file name in the manifest file
People using the API as intended would never trigger this assertion
anyway, but if someone has a non standard use case, I see no reason
to make the program panic.
@andrewrk
Copy link
Member

andrewrk commented May 3, 2020

@leroycep I'm juggling a few things so I might not be paying close attention on this. Let me know when there is somewhere I can focus my attention to unblock you on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"cache hash" API in the std lib (or at least in src-self-hosted)
4 participants