-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add a mechanism to avoid including unnecessary files in the trace #2529
Conversation
This is more a kick off point for discussion than a serious PR at this point, but I figure it's always better to discuss with working code than in the abstract. We're planning to start collecting large numbers of rr traces from failing CI jobs. However, at the moment, these traces are a bit large for comfort. However, they don't need to be that large, because a good fraction of their size comes from binaries that we're already storing on our artifact server, and thus don't need to store separately in the trace. This PR adds an `--index-dir` option to `rr pack`. rr will index each such directory, cataloguing all contained files by content hash. It will then symlink the index directory into the trace, rewrite the mmap entries to point into this symlink and as usual delete any unreferenced files. The idea is to send this directory over to the replay machine, rewrite the symlink to a directory containing the same immutable binaries that were in the index directory and replay the trace that way. This works ok, and is probably what we'll do for the short term, but it's not a satisfying long-term solution, since we don't record any information whatsoever about what was in the index directory. I'm thinking a better solution would be to have a custom `external` record type that we could use with a custom packer to record exactly where on our artifact server the relevant binary can be found (which is easy, since those binaries are already content-addressed). That said, the effort to implement that seemed not-insubstantial, so this is the poor man's version of that. If the version in this PR is intersting, I'm happy to clean it up and get it in properly, but as I said, I'd like to personally move away from it in the medium term, so I don't need it merged.
It makes sense to have some kind of mechanism for this, preferably one that doesn't constrain the mechanics of the external storage. But I wonder if this actually needs rr changes? What if:
Then before replay:
|
I.e. I don't see a reason to modify the rr trace data itself to record that some files are stored externally. That would only make sense if rr itself was going to retrieve those files before/during replay. But making rr do that just seems like unnecessary coupling. |
Yes, I think that's about right, but I think it would be good to somehow signal to rr that this has been done such that it can error during replay if the custom decompression step hasn't been done. Perhaps the packer script should embed a human-readable error string in the trace for rr to emit if it tries to replay something that hasn't been unpacked yet. |
Yeah that would make sense. Maybe we should just have a well-known file name that rr checks for; if that file exists, then |
That seems like a good solution. |
Seems like this PR was found to be "not the way to go". I suggest to close it and possibly have an issue or PR for "let rr look for the magic file and print its output", if that is still relevant. |
Yes. |
This is more a kick off point for discussion than a serious PR at this point,
but I figure it's always better to discuss with working code than in the abstract.
We're planning to start collecting large numbers of rr traces from failing CI jobs.
However, at the moment, these traces are a bit large for comfort. However, they
don't need to be that large, because a good fraction of their size comes from
binaries that we're already storing on our artifact server, and thus don't need
to store separately in the trace. This PR adds an
--index-dir
option torr pack
.rr will index each such directory, cataloguing all contained files by content hash.
It will then symlink the index directory into the trace, rewrite the mmap entries
to point into this symlink and as usual delete any unreferenced files. The idea is
to send this directory over to the replay machine, rewrite the symlink to a directory
containing the same immutable binaries that were in the index directory and replay
the trace that way.
This works ok, and is probably what we'll do for the short term, but it's not a
satisfying long-term solution, since we don't record any information whatsoever
about what was in the index directory. I'm thinking a better solution would be
to have a custom
external
record type that we could use with a custom packerto record exactly where on our artifact server the relevant binary can be found
(which is easy, since those binaries are already content-addressed). That said,
the effort to implement that seemed not-insubstantial, so this is the poor man's
version of that.
If the version in this PR is interesting, I'm happy to clean it up and get it in
properly, but as I said, I'd like to personally move away from it in the medium
term, so I don't need it merged.