Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with commit evolution #1

Closed
arxanas opened this issue Jul 20, 2021 · 5 comments
Closed

Integration with commit evolution #1

arxanas opened this issue Jul 20, 2021 · 5 comments

Comments

@arxanas
Copy link

arxanas commented Jul 20, 2021

Hi @quark-zju, this is a super cool project. I would like to integrate it into my project at https://github.com/arxanas/git-branchless, which simulates the workflows at companies like Facebook. I had a few questions I was hoping you could help me with.

Data structures?

What data structures are used to implement the DAG? I found this remark:

See slides/201904-segmented-changelog/segmented-changelog.pdf for pretty graphs about how segments help with ancestry queries.

at https://docs.rs/esl01-dag/0.1.1/esl01_dag/struct.IdDag.html, but I didn't find the associated slides (are they publicly available?). What kind of performance can I expect for various operations?

What kind of correctness guarantees can I expect? If I query the DAG for a node which hasn't yet been observed, what happens? Can I use it in a multi-threaded or multi-process context?

How stable is the DAG API? To what degree can I rely on it?

Performance with reference updates?

The performance for initializing the DAG when running git-revs is quite good on the repository I'm testing with (maybe 30 seconds, compared to minutes when running git commit-graph instead, but I didn't even measure the time because it took so long). But subsequent invocations take two or three seconds at a minimum.

My guess is that it's because crawling all the references here: https://docs.rs/gitdag/0.1.2/src/gitdag/gitdag.rs.html#82. I think you mentioned somewhere in the documentation that it will be slow if there are a lot of references. In the case of git-branchless, we keep track of the commit graph heads ourselves, and we don't care about remote references, so I should be able to significantly speed it up. However, I can't pass in my own GitDag to this library. Should I change the API, and if so, what changes do you recommend?

Commit evolution?

git-branchless implements its own commit evolution feature, not based on the reflog (see https://github.com/arxanas/git-branchless/wiki/Architecture). So I don't want the reflog-based commit evolution implementation here: https://github.com/quark-zju/gitrevset/blob/master/src/mutation.rs#L13. Similarly to the above, if I want to swap out the implementation for this behavior with my own, do I need to change the git-revset API, and if so, what changes do you recommend?

add_heads_and_flush

For this function DagPersistent::add_heads_and_flush: https://docs.rs/esl01-dag/0.2.1/esl01_dag/namedag/struct.NameDag.html#method.add_heads_and_flush, why does it care about the difference between master names and non-master names? git-branchless relies on a main branch, but I don't see why the DAG itself cares about which branches are "main".

@quark-zju
Copy link
Owner

quark-zju commented Jul 20, 2021

Hi, @arxanas! Thanks for the questions. I'm interested in supporting use cases in other git tools.

Here are the answers:

but I didn't find the associated slides (are they publicly available?)

Yes. The slides can be found here: https://github.com/facebookexperimental/eden/blob/master/eden/scm/slides/201904-segmented-changelog/segmented-changelog.pdf

What kind of performance can I expect for various operations?

You can find some benchmark data here: facebook/sapling@e12b6c8

What kind of correctness guarantees can I expect?

Correctness is taken seriously. There are unit tests for sure. Some core algorithms (such as gca, ranges) were fuzz-tested against other known correct algorithms (see https://github.com/facebookexperimental/eden/tree/master/eden/scm/lib/dag/fuzz). The dag crate has completely replaced the old revlog-based commit graph algorithms and been running in production for months.

If I query the DAG for a node which hasn't yet been observed, what happens?

Do you mean detached heads? The current design is to build up dag based on all references. So if you access something that's not referred by git references, it will not be found. In theory it's possible to have some logic to update the DAG on demand. But that's not implemented yet.

Can I use it in a multi-threaded or multi-process context?

As long as it compiles.

How stable is the DAG API? To what degree can I rely on it?

There was a big change, though. That is switching from sync to async Rust. It was done at upstream. If you want sync interface then I might add some sync adapters.

Aside from that, the core APIs (DagAlgorithm) are kind of stable practically speaking, although there was no explicit guarantees.

In the case of git-branchless, we keep track of the commit graph heads ourselves, and we don't care about remote references,

I can take a look exposing APIs to control heads instead of reading git references. It shouldn't be too hard.

Similarly to the above, if I want to swap out the implementation for this behavior with my own, do I need to change the git-revset API

I think it wouldn't be too hard to support customizing revset functions so you can define your own predecessorsand override the builtin one. I can take a look (in two weeks).

You can create separate issues if you want to track the above 2 features more precisely.

why does it care about the difference between master names and non-master names?

For better performance. Ideally ::master is just a single span (ex. 0:1000) instead of multiple spans (ex. 0:500 + 502:1001). So it pre-allocates a large chunk of ids for the master branch. See also https://github.com/facebookexperimental/eden/blob/master/eden/scm/newdoc/dev/internals/SegmentedChangelog.rst#groups

@arxanas
Copy link
Author

arxanas commented Jul 21, 2021

Thank you very much for your detailed response and for the references. This is really fantastic work by you and your team.

Correctness is taken seriously.

Sorry, I wasn't clear about what I meant, but it's great to hear that it's battle-tested. I meant: do the data structures give false positives/false negatives or under-/over-approximations in some cases (like e.g. bloom filters)? But it's clear now from your references that they give exact answers.

There was a big change, though. That is switching from sync to async Rust. It was done at upstream. If you want sync interface then I might add some sync adapters.

Is the async interface already checked in/published to crates.io? I don't see it in the API documentation. By "upstream", do you mean that it was checked into https://github.com/facebookexperimental/eden, but the dependencies haven't been updated yet for this crate? Regardless, I don't think using sync versus async will be a problem on my end.

@quark-zju
Copy link
Owner

quark-zju commented Jul 21, 2021

bloom filters

There are no bloom filters being used. That said, keep in mind that dag only cares about the graph, not commit. dag does not store or handle commit message, date, user, etc. It's this crate that deals with date, user, messages and currently if you specify something like:

date(2020)

Then it iterates though all commits examining the date

Unlike git, the revset is a powerful language so it is possible to short-circuit the search, something like:

first(date(2020))::

Assuming date(2020) is lazy, first(date(2020)) will stop the date iteration once founding the first matching entry. Note that date is not actually lazy in this crate yet. But it can be made so, and in hg it is lazy.

By "upstream", do you mean that it was checked into https://github.com/facebookexperimental/eden

Yes. We'd like to set up publishing crates directly from there. However crates.io does not support 2fac, which is required to officially publish crates. So I was using esl01- as a workaround and that hasn't been updated.

I don't think using sync versus async will be a problem on my end.

That's good to know. I think I can skip the sync adapter interface then.

@arxanas arxanas closed this as completed Jul 22, 2021
@arxanas
Copy link
Author

arxanas commented Aug 28, 2021

@martinvonz You might already be familiar with Eden, but if not, then this thread might interest you. The Eden DAG implementation turns out to be pretty easy to integrate with, and it implements a host of useful operations (enough to power the revset engine in this project).

If you don't already have a solution for complex graph operations for https://github.com/martinvonz/jj, you might want to integrate with this. Note that it's GPL-2; I considered it worth it to relicense git-branchless to use it.

Here's the example Git integration: https://github.com/facebookexperimental/eden/blob/e5527715b701a4f72a1d474be546226264289d37/eden/scm/lib/dag/gitdag/src/gitdag.rs. You just need to provide an initial set of leaf nodes and a function to get the parents of a given node.

@martinvonz
Copy link

Thanks, @arxanas! Yes, I'm familiar with Eden and I know Jun/quark-zju from when we both worked on Mercurial :)

I already have support for revsets in jj. I want to support both Git and my other backend there, so I thought it would be hard to use this project there (and I think I only heard about it after I started implementing my own revsets anyway).

It sounds like I should at least be able to use the dag::Dag, so I'll have to look into that some day. It would be nice to get the speed of the segmented changelog. One constraint I have is that I've worked hard to make sure that jj does not require locking, so the append-only-ness of the segmented changelog is concerning (I'm worried that two concurrent processes or threads race to append data to the file). I'm sure there's a solution to that, though.

Also, thanks for the heads up about the license. I'll have to check what Google's policy is.

arxanas added a commit to arxanas/git-branchless that referenced this issue Sep 2, 2021
Currently, merge-base calculation doesn't properly reuse already-performed work. The merge-base cache only caches exact OID pairs in order to speed up queries. If you immediately query for the merge-base of a descendant commit for which you already know the merge-base, the entire merge-base calculation has to be repeated.

There are some tricks that could be used to speed up merge-base calculation (including Git's `commit-graph` feature), but the Eden source control management system (by Facebook) is written in Rust and has a high-quality persistent directed acyclic graph implementation, which aims to service merge-base queries in <1ms for multi-million commit repositories.

It also provides a lot of other DAG operations for free. For example, this `find_path_to_merge_base` operation has a large comment explaining how it's buggy:

https://github.com/arxanas/git-branchless/blob/1f74da335bb2ff92d7a8178f957abd1d483750cf/src/core/rewrite/plan.rs#L331-L399

We get a correct implementation using the Eden DAG, along with enough operations to implement a Mercurial-style revset query engine.

See more discussion at quark-zju/gitrevset#1, including links explaining the underlying data structure and performance numbers.
arxanas added a commit to arxanas/git-branchless that referenced this issue Sep 2, 2021
Currently, merge-base calculation doesn't properly reuse already-performed work. The merge-base cache only caches exact OID pairs in order to speed up queries. If you immediately query for the merge-base of a descendant commit for which you already know the merge-base, the entire merge-base calculation has to be repeated.

There are some tricks that could be used to speed up merge-base calculation (including Git's `commit-graph` feature), but the Eden source control management system (by Facebook) is written in Rust and has a high-quality persistent directed acyclic graph implementation, which aims to service merge-base queries in <1ms for multi-million commit repositories.

It also provides a lot of other DAG operations for free. For example, this `find_path_to_merge_base` operation has a large comment explaining how it's buggy:

https://github.com/arxanas/git-branchless/blob/1f74da335bb2ff92d7a8178f957abd1d483750cf/src/core/rewrite/plan.rs#L331-L399

We get a correct implementation using the Eden DAG, along with enough operations to implement a Mercurial-style revset query engine.

See more discussion at quark-zju/gitrevset#1, including links explaining the underlying data structure and performance numbers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants