Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement depth/shallow-clone support #67

Merged
merged 14 commits into from
Feb 26, 2019

Conversation

nickurak
Copy link

Fixes #65

Note that this PR is intended to be read/reviewed commit-by-commit as opposed to via the "Files changed tab". (That is to say, each commit does one small piece of work that can be reviewed and understood separately, and thus is probably easier to read that way).

This is my first PR to this repository, please don't hold back on feedback.

Jeremy Nickurak added 3 commits February 22, 2019 16:00
While the default is to act on the repository local properties, being
explicit here makes it obvious that these properties are only being
set for test purposes.
"assert False" doesn't provide pytest much information to give a
useful error. "assert True" is a no-op equivalent to "pass".

Using a canary value to check if the exception was hit allows the
error to be more meaningful in case of a failing test.
One can specify the branch to track in the same 'git' command as the
clone. This can make the operation a little quicker if the desired
branch is significantly different from the default (typically
'master') branch.

It will also facilitate the desired branch being checked out in future
commits that implement shallow clones.
@nickurak
Copy link
Author

Note that these changes don't yet implement anything in the link-generator -- you'll need to add a "&depth=X" parameter to your URLs, or specify the NBGITPULLER_DEPTH environment variable in your jupyter environment for it to have any affect.

The goal is that, until we build some UI, it'll be especially hard for any of these changes to have any impact. We can verify that this functionality is what we're looking for before exposing a user-facing change, and then consider what user-facing changes and/or default changes we want.

@betatim
Copy link
Member

betatim commented Feb 24, 2019

This looks like a nice PR!

In https://github.com/jupyter/repo2docker we investigated making shallow clones as a way to speed up things. This was maybe 6months ago or so and we ended up giving up on it. IIRC the problem was either that we struggled to un-shallow the clone reliably when we needed that and/or that git clone --depth 5 ... followed by git checkout a-branch-that-hasnt-been-changed-in-last-five-commits would break in weird ways. This is the code we use now which sometimes makes a shallow clone and this is the related PR/discussion.

I am here to find out if the state of things has changed (-> we can update repo2docker) and share the experience of shallow clones.

@nickurak
Copy link
Author

As far as I know, we don't have a use case for converting shallow clones to full clones here.

I'll have to rely on the existing project maintainers for advice on whether changing branches is important -- but that is something that's normally not allowed in a shallow clone (and indeed, is something that could defeat much of the purpose of a shallow clone -- as you'd need to fetch a bunch of content from other branches that might include the large files you're trying to avoid).

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style comments, but otherwise this is awesome.

I haven't done commit by commit review since my Gerrit days, and this was a joy to do on this PR. Thank you! <3

@betatim I think in nbgitpuller's case we'll never really have to move from shallow to full clone, so I think it's ok.

@@ -64,7 +65,13 @@ def initialize_repo(self):
"""

logging.info('Repo {} doesn\'t exist. Cloning...'.format(self.repo_dir))
yield from execute_cmd(['git', 'clone', '--branch', self.branch_name, self.git_url, self.repo_dir])
def clone_args():
yield from ['git', 'clone']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool! However, for clarity I'd prefer it if we just constructed a list directly instead.

@@ -40,10 +41,14 @@ require([

GitSync.prototype.start = function() {
// Start git pulling handled by SyncHandler, declared in handlers.py
var syncUrl = this.baseUrl + 'git-pull/api?' + $.param({
syncUrlParams = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be a 'var' here, since otherwise this variable is now global.

@yuvipanda
Copy link
Contributor

This is an awesome PR, @nickurak. I'll be happy to merge once the minor comments are addressed. I'd also love to have you on as a maintainer. What do you think?

@nickurak
Copy link
Author

@yuvipanda Last two commits should address those issues.

I'd like to squash those fixups in before this gets merged if you don't mind, or you can do that yourself -- but I'll let you look at the changes separately this way before I do so.

@yuvipanda
Copy link
Contributor

@nickurak thanks! This looks good to me, and am happy to merge anytime you want. I personally leave the fixups in the commit history, but am happy to have it be in a state you feel good about.

Jeremy Nickurak added 11 commits February 25, 2019 15:28
If provided to the GitPuller, a 'depth' parameter will be passed along
to git, resulting in a shallow clone. Shallow clones with a depth of
'N' only include history going back 'N' commits. This can be helpful
in reducing the size of the cloned repository, and in the amount of
time it takes to clone that repository, especially when large files
were added and later removed/edited.

As a special-case, a non-positive value for depth (eg, 0 or negative)
can be specified to explicitly request all history.

For more information, see:
https://git-scm.com/docs/git-clone#git-clone---depthltdepthgt
This allows the default "depth" option to be configured via the
NBGITPULLER_DEPTH environment variables, or standard traitlet
behavior.
These classes had quite a bit in common, and factoring out their
changes makes each a little more specific to the unique values they
add.
git modifies its behavior between plain file-path remotes and
URI-specified remotes. For example, shallow clones are ignored for
plain file-path remotes, but enabled for URI-specified remotes.

Using the full URI specification makes our tests behave more like
real-world nbgitpuller use cases.
This makes invocation with pytest's "-s" option easier to read.
These tests verify that basic shallow clones, and environment-variable
configured shallow clones behave as expected.
This was repeated several times throughout the test module, and with
varying levels of detail and bugs (missing space between strings,
missed stripping of trailing newline).
… defaults

Since creating a long_remote implicitly creates git repositories with
names that conflict with the defaults, it's easy for tests to create
repositories that conflict with that file path. Using more arbitrary
names reduces the likelihood of a collision.
@nickurak
Copy link
Author

Thanks @yuvipanda. I have a whole spiel on git storytelling I can get into about the fixups -- but yes I'd rather have them in this state where I've used autosquash to correct the commits in place.

This last push is ready to go. It also adds 3 more commits -- some more refactoring in the tests, and another test (making sure a pull on top of a shallow clone works).

If there's no other issues, merge at your convenience (but please don't squash-merge ;) ).

@yuvipanda yuvipanda merged commit 06e62f6 into jupyterhub:master Feb 26, 2019
@yuvipanda
Copy link
Contributor

@nickurak awesome :) I've merged. Would love to hear your spiel on git storytelling though.

@betatim
Copy link
Member

betatim commented Feb 26, 2019

I came for the shallow clone advice/expertise and stayed for the git usage pattern!

@nickurak
Copy link
Author

Well I'm gradually collecting material to make a talk on this but, here's the short(ish) version:

Git history, done carefully, is incredibly useful for future users of the code base. It's a permanent record of not only the how, but more importantly the why individual steps were taken. It's great for future authors (or tho original authors ourselves, since it's easy to forget what we did ;) ) trying to understand what's there, and what changes make sense in that context.

In contrast, the details on how and why we got to a particular written history of how and why we did something are (and I'm veering into personal experience and opinion here) less important. It's the meta-how-and-why. It's not the thing that's helpful performing the archeology of the codebase down the road.

Another angle on this is that the practice of making your git history look like you wished you'd done it (as opposed to how it actually did it) is helpful in retrospecting and learning. For example (and this isn't something i pushed on in this PR), consider this sequence of events:

  1. Developer introduces functionality
  2. Reviewer suggests adding test coverage
  3. Developer adds test coverage
  4. Reviewer suggests that TDD would have been helpful in providing coverage, and structuring the code around testability
  5. Developer performs a rebase to move their tests up to the beginning of their history, and goes through the steps of letting those tests fail, and observing them switch from red to green as the functionality comes together.

Now that git history shows a compelling story for the archeologists trying to figure out how and why the code works the way it does, AND the developer has a better idea of how they might approach that work later. As a side-effect they've also gone through more self-review of their own code (maybe even before code review happened).

That's not to say the meta-how-and-why isn't useful -- it's just useful in a different way. It's potentially useful in learning the ropes of the lifecycle of a PR. It's also information which is at least partially captured anyways here in the github PR discussion.

I do wish that git had a better use flow for "here's a string of history that isn't the authoritative version (even if it's a bit of a lie), but still exists if you want to understand how it got here." Github is also especially bad at this -- eg, there's no way to see those fixup commits I made now, and if you're not careful the order of commits in the "commits" tab can easily become non-sensical. But that's getting into yet another rant ;)

(Credit, BTW, for various versions of some of these thoughts goes to past collaborators, especially Steve Bromling, who's been super helpful in devevloping the tools to execute and talk about these git usage patterns).

@betatim
Copy link
Member

betatim commented Feb 26, 2019

I like the idea of re-ordering commits, especially in the "write code, suggest to add tests, add tests, move tests" story. Is there a good UI tool that helps with that workflow? I find that despite using git a lot every day I've only recently become comfortable with using git rebase -i HEAD~5 to reorder commits (and still have a high probability of shooting myself in the foot while doing so) and hardly ever attempt to split one commit into smaller ones retroactively. I sometimes use git add -p, I very rarely meet anyone who knows it exists.

All this makes me think: I need a better GUI tool for this because I would love to use the style of working you described more often.

@nickurak
Copy link
Author

I mostly just use git add -p and an alias for git rebase -i (*) and some other odd posixy shell glue as required, but I live and die by the command line most days, so might not be a great example. That said, I really enjoy it -- I really have fun rewriting history this way, and moreso as I get better at it.

If you're an emacs user, its magit mode can do some wonderfully powerful stuff with git history rewriting. I've seen it done, and while I usually edit code in emacs, I'm wary of going in too deep into the emacs lifestyle ;)

This might be a good place for somebody to come up with a dedicated tool for crafting git history. It might even be worth something more jupyter-centric, given the specialized types of work that happens in notebooks.

(*) Actually, git rebase --interactive --onto $(git merge-base -a HEAD @{upstream}) to make sure that I'm not inadvertently doing a real-change-of-base-commit while rewriting (**)

(**) Actually actually, git rebase --interactive --onto $(git merge-base -a HEAD @{upstream}) --exec 'git commit --amend --date=now --no-edit' while working on repos stored on GitHub, to work around isaacs/github#386

@choldgraf
Copy link
Member

um this is a super interesting post-merge PR conversation :-)

This is one reason I like the rust-lang RFC page. If you dig into the PR conversations you can see the process by which decisions have been made (or not made)

@yuvipanda
Copy link
Contributor

Nice! My first set of serious code review happened on Gerrit, which follows the same philosophy as @nickurak. In fact, it also implements the wishlist item - you have both a 'this is how my commit should have been from the start' as well as a 'this is how this commit got to being here'.

You can see this in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/454346. It's one single commit, that has been amended about 13 times so far. Reviewers leave comments, and the author (or someone else!) amends the commit to incorporate it. Gerrit keeps a record of each amend, so you can see how the individual commit has changed over time. When merged, it is merged as one commit. You can have multiple commits referencing each other, with easy web based rebase. So when stuff gets merged, you get the clean easy to understand archeological history, and it's wonderful.

The downside is that the UX is very hard unless you sortof understand a good chunk of git. There's a 'git review' UI thing that makes some of this easier, but it still breaks quite a bit. GitHub's review flow is much easier to teach to people, so that is the tradeoff here.

Either way, different philosophies! GitHub is picking up stuff from Gerrit - for example, the 'allow maintainers to push changes' is amazing and a core part of how people use Gerrit. If someone complains about a nit, you can ask them to go fix it themselves. Often I'll just fix it instead of complaining about it :D You can do that on GitHub now...

I'm not holding my breath for the fundamental change though - gerrit's code review is per-commit, while here it's per-branch. I think per-branch is much easier for people to understand - 'commits are immutable, branches can be updated'. This isn't true at all in Git, but hey...

This is a great conversation! <3

@yuvipanda
Copy link
Contributor

@nickurak I think I just realized that the traitlet isn't actually set when the config is set in jupyter_notebook_config.py. You need to pass the 'NotebookApp' object as parent to the Configurable constructor - like https://github.com/yuvipanda/nbresuse/blob/master/nbresuse/__init__.py#L94.

This lets you put stuff like the following in a jupyter_notebook_config.py:

c.GitPuller.clone_depth = 1

And have that be read by the handlers. This is very useful, and used throughout a lot of places.

Do you think you can implement that too?

@nickurak
Copy link
Author

@yuvipanda I'm not sure what needs to happen there. (I haven't used traitlets before this PR).

I got as far as making these changes: https://gist.github.com/nickurak/4b8e83b483833ed77b34b81d3f76c328

But adding c.GitPuller.clone_depth=9 to ~/.jupyter/jupyter_notebook_config.py didn't seem to have the desired effect.

@yuvipanda
Copy link
Contributor

@nickurak can you check if 'parent' is being passed through to the super() constructor?

@nickurak
Copy link
Author

@yuvipanda yep, confirmed via pdb -- it's in kwargs in the parent constructor:

  /home/atrus/.virtualenvs/callysto/lib/python3.7/site-packages/nbgitpuller-0.6.1-py3.7.egg/nbgitpuller/handlers.py(66)get()
-> gp = GitPuller(repo, branch, repo_dir, depth=depth, parent=self.settings['nbapp'])
  /home/atrus/.virtualenvs/callysto/lib/python3.7/site-packages/nbgitpuller-0.6.1-py3.7.egg/nbgitpuller/pull.py(72)__init__()
-> super(Configurable, self).__init__(**newargs)
> /home/atrus/.virtualenvs/callysto/lib/python3.7/site-packages/traitlets/traitlets.py(988)__init__()
-> def __init__(self, *args, **kwargs):
(Pdb) kwargs['parent']
<notebook.notebookapp.NotebookApp object at 0x7f3967c380b8>

@nickurak
Copy link
Author

nickurak commented Mar 6, 2019

Fixed. Tracking the issue in #74, and resolved in #75.

(For those who were paying attention to my git history and TDD rants, #75 follows that pattern, introducing failing tests in one commit which are caused to pass by later commits).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants