-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration with commit evolution #1
Comments
Hi, @arxanas! Thanks for the questions. I'm interested in supporting use cases in other git tools. Here are the answers:
Yes. The slides can be found here: https://github.com/facebookexperimental/eden/blob/master/eden/scm/slides/201904-segmented-changelog/segmented-changelog.pdf
You can find some benchmark data here: facebook/sapling@e12b6c8
Correctness is taken seriously. There are unit tests for sure. Some core algorithms (such as gca, ranges) were fuzz-tested against other known correct algorithms (see https://github.com/facebookexperimental/eden/tree/master/eden/scm/lib/dag/fuzz). The
Do you mean detached heads? The current design is to build up dag based on all references. So if you access something that's not referred by git references, it will not be found. In theory it's possible to have some logic to update the DAG on demand. But that's not implemented yet.
As long as it compiles.
There was a big change, though. That is switching from sync to async Rust. It was done at upstream. If you want sync interface then I might add some sync adapters. Aside from that, the core APIs (
I can take a look exposing APIs to control heads instead of reading git references. It shouldn't be too hard.
I think it wouldn't be too hard to support customizing revset functions so you can define your own You can create separate issues if you want to track the above 2 features more precisely.
For better performance. Ideally |
Thank you very much for your detailed response and for the references. This is really fantastic work by you and your team.
Sorry, I wasn't clear about what I meant, but it's great to hear that it's battle-tested. I meant: do the data structures give false positives/false negatives or under-/over-approximations in some cases (like e.g. bloom filters)? But it's clear now from your references that they give exact answers.
Is the async interface already checked in/published to crates.io? I don't see it in the API documentation. By "upstream", do you mean that it was checked into https://github.com/facebookexperimental/eden, but the dependencies haven't been updated yet for this crate? Regardless, I don't think using sync versus async will be a problem on my end. |
There are no bloom filters being used. That said, keep in mind that
Then it iterates though all commits examining the date Unlike git, the revset is a powerful language so it is possible to short-circuit the search, something like:
Assuming
Yes. We'd like to set up publishing crates directly from there. However
That's good to know. I think I can skip the sync adapter interface then. |
@martinvonz You might already be familiar with Eden, but if not, then this thread might interest you. The Eden DAG implementation turns out to be pretty easy to integrate with, and it implements a host of useful operations (enough to power the revset engine in this project). If you don't already have a solution for complex graph operations for https://github.com/martinvonz/jj, you might want to integrate with this. Note that it's GPL-2; I considered it worth it to relicense Here's the example Git integration: https://github.com/facebookexperimental/eden/blob/e5527715b701a4f72a1d474be546226264289d37/eden/scm/lib/dag/gitdag/src/gitdag.rs. You just need to provide an initial set of leaf nodes and a function to get the parents of a given node. |
Thanks, @arxanas! Yes, I'm familiar with Eden and I know Jun/quark-zju from when we both worked on Mercurial :) I already have support for revsets in It sounds like I should at least be able to use the Also, thanks for the heads up about the license. I'll have to check what Google's policy is. |
Currently, merge-base calculation doesn't properly reuse already-performed work. The merge-base cache only caches exact OID pairs in order to speed up queries. If you immediately query for the merge-base of a descendant commit for which you already know the merge-base, the entire merge-base calculation has to be repeated. There are some tricks that could be used to speed up merge-base calculation (including Git's `commit-graph` feature), but the Eden source control management system (by Facebook) is written in Rust and has a high-quality persistent directed acyclic graph implementation, which aims to service merge-base queries in <1ms for multi-million commit repositories. It also provides a lot of other DAG operations for free. For example, this `find_path_to_merge_base` operation has a large comment explaining how it's buggy: https://github.com/arxanas/git-branchless/blob/1f74da335bb2ff92d7a8178f957abd1d483750cf/src/core/rewrite/plan.rs#L331-L399 We get a correct implementation using the Eden DAG, along with enough operations to implement a Mercurial-style revset query engine. See more discussion at quark-zju/gitrevset#1, including links explaining the underlying data structure and performance numbers.
Currently, merge-base calculation doesn't properly reuse already-performed work. The merge-base cache only caches exact OID pairs in order to speed up queries. If you immediately query for the merge-base of a descendant commit for which you already know the merge-base, the entire merge-base calculation has to be repeated. There are some tricks that could be used to speed up merge-base calculation (including Git's `commit-graph` feature), but the Eden source control management system (by Facebook) is written in Rust and has a high-quality persistent directed acyclic graph implementation, which aims to service merge-base queries in <1ms for multi-million commit repositories. It also provides a lot of other DAG operations for free. For example, this `find_path_to_merge_base` operation has a large comment explaining how it's buggy: https://github.com/arxanas/git-branchless/blob/1f74da335bb2ff92d7a8178f957abd1d483750cf/src/core/rewrite/plan.rs#L331-L399 We get a correct implementation using the Eden DAG, along with enough operations to implement a Mercurial-style revset query engine. See more discussion at quark-zju/gitrevset#1, including links explaining the underlying data structure and performance numbers.
Hi @quark-zju, this is a super cool project. I would like to integrate it into my project at https://github.com/arxanas/git-branchless, which simulates the workflows at companies like Facebook. I had a few questions I was hoping you could help me with.
Data structures?
What data structures are used to implement the DAG? I found this remark:
at https://docs.rs/esl01-dag/0.1.1/esl01_dag/struct.IdDag.html, but I didn't find the associated slides (are they publicly available?). What kind of performance can I expect for various operations?
What kind of correctness guarantees can I expect? If I query the DAG for a node which hasn't yet been observed, what happens? Can I use it in a multi-threaded or multi-process context?
How stable is the DAG API? To what degree can I rely on it?
Performance with reference updates?
The performance for initializing the DAG when running
git-revs
is quite good on the repository I'm testing with (maybe 30 seconds, compared to minutes when runninggit commit-graph
instead, but I didn't even measure the time because it took so long). But subsequent invocations take two or three seconds at a minimum.My guess is that it's because crawling all the references here: https://docs.rs/gitdag/0.1.2/src/gitdag/gitdag.rs.html#82. I think you mentioned somewhere in the documentation that it will be slow if there are a lot of references. In the case of
git-branchless
, we keep track of the commit graph heads ourselves, and we don't care about remote references, so I should be able to significantly speed it up. However, I can't pass in my ownGitDag
to this library. Should I change the API, and if so, what changes do you recommend?Commit evolution?
git-branchless
implements its own commit evolution feature, not based on the reflog (see https://github.com/arxanas/git-branchless/wiki/Architecture). So I don't want the reflog-based commit evolution implementation here: https://github.com/quark-zju/gitrevset/blob/master/src/mutation.rs#L13. Similarly to the above, if I want to swap out the implementation for this behavior with my own, do I need to change thegit-revset
API, and if so, what changes do you recommend?add_heads_and_flush
For this function
DagPersistent::add_heads_and_flush
: https://docs.rs/esl01-dag/0.2.1/esl01_dag/namedag/struct.NameDag.html#method.add_heads_and_flush, why does it care about the difference between master names and non-master names?git-branchless
relies on a main branch, but I don't see why the DAG itself cares about which branches are "main".The text was updated successfully, but these errors were encountered: