Implement depth/shallow-clone support #67

nickurak · 2019-02-22T23:02:31Z

Fixes #65

Note that this PR is intended to be read/reviewed commit-by-commit as opposed to via the "Files changed tab". (That is to say, each commit does one small piece of work that can be reviewed and understood separately, and thus is probably easier to read that way).

This is my first PR to this repository, please don't hold back on feedback.

While the default is to act on the repository local properties, being explicit here makes it obvious that these properties are only being set for test purposes.

"assert False" doesn't provide pytest much information to give a useful error. "assert True" is a no-op equivalent to "pass". Using a canary value to check if the exception was hit allows the error to be more meaningful in case of a failing test.

One can specify the branch to track in the same 'git' command as the clone. This can make the operation a little quicker if the desired branch is significantly different from the default (typically 'master') branch. It will also facilitate the desired branch being checked out in future commits that implement shallow clones.

nickurak · 2019-02-22T23:06:10Z

Note that these changes don't yet implement anything in the link-generator -- you'll need to add a "&depth=X" parameter to your URLs, or specify the NBGITPULLER_DEPTH environment variable in your jupyter environment for it to have any affect.

The goal is that, until we build some UI, it'll be especially hard for any of these changes to have any impact. We can verify that this functionality is what we're looking for before exposing a user-facing change, and then consider what user-facing changes and/or default changes we want.

betatim · 2019-02-24T08:16:33Z

This looks like a nice PR!

In https://github.com/jupyter/repo2docker we investigated making shallow clones as a way to speed up things. This was maybe 6months ago or so and we ended up giving up on it. IIRC the problem was either that we struggled to un-shallow the clone reliably when we needed that and/or that git clone --depth 5 ... followed by git checkout a-branch-that-hasnt-been-changed-in-last-five-commits would break in weird ways. This is the code we use now which sometimes makes a shallow clone and this is the related PR/discussion.

I am here to find out if the state of things has changed (-> we can update repo2docker) and share the experience of shallow clones.

nickurak · 2019-02-25T17:25:45Z

As far as I know, we don't have a use case for converting shallow clones to full clones here.

I'll have to rely on the existing project maintainers for advice on whether changing branches is important -- but that is something that's normally not allowed in a shallow clone (and indeed, is something that could defeat much of the purpose of a shallow clone -- as you'd need to fetch a bunch of content from other branches that might include the large files you're trying to avoid).

yuvipanda

Minor style comments, but otherwise this is awesome.

I haven't done commit by commit review since my Gerrit days, and this was a joy to do on this PR. Thank you! <3

@betatim I think in nbgitpuller's case we'll never really have to move from shallow to full clone, so I think it's ok.

yuvipanda · 2019-02-23T02:44:51Z

nbgitpuller/pull.py

@@ -64,7 +65,13 @@ def initialize_repo(self):
        """

        logging.info('Repo {} doesn\'t exist. Cloning...'.format(self.repo_dir))
-        yield from execute_cmd(['git', 'clone', '--branch', self.branch_name, self.git_url, self.repo_dir])
+        def clone_args():
+            yield from ['git', 'clone']


This is very cool! However, for clarity I'd prefer it if we just constructed a list directly instead.

yuvipanda · 2019-02-25T19:07:21Z

nbgitpuller/static/index.js

@@ -40,10 +41,14 @@ require([

    GitSync.prototype.start = function() {
        // Start git pulling handled by SyncHandler, declared in handlers.py
-        var syncUrl = this.baseUrl + 'git-pull/api?' + $.param({
+        syncUrlParams = {


there should be a 'var' here, since otherwise this variable is now global.

yuvipanda · 2019-02-25T19:09:33Z

This is an awesome PR, @nickurak. I'll be happy to merge once the minor comments are addressed. I'd also love to have you on as a maintainer. What do you think?

nickurak · 2019-02-25T20:45:34Z

@yuvipanda Last two commits should address those issues.

I'd like to squash those fixups in before this gets merged if you don't mind, or you can do that yourself -- but I'll let you look at the changes separately this way before I do so.

yuvipanda · 2019-02-25T21:36:27Z

@nickurak thanks! This looks good to me, and am happy to merge anytime you want. I personally leave the fixups in the commit history, but am happy to have it be in a state you feel good about.

If provided to the GitPuller, a 'depth' parameter will be passed along to git, resulting in a shallow clone. Shallow clones with a depth of 'N' only include history going back 'N' commits. This can be helpful in reducing the size of the cloned repository, and in the amount of time it takes to clone that repository, especially when large files were added and later removed/edited. As a special-case, a non-positive value for depth (eg, 0 or negative) can be specified to explicitly request all history. For more information, see: https://git-scm.com/docs/git-clone#git-clone---depthltdepthgt

This allows the default "depth" option to be configured via the NBGITPULLER_DEPTH environment variables, or standard traitlet behavior.

These classes had quite a bit in common, and factoring out their changes makes each a little more specific to the unique values they add.

git modifies its behavior between plain file-path remotes and URI-specified remotes. For example, shallow clones are ignored for plain file-path remotes, but enabled for URI-specified remotes. Using the full URI specification makes our tests behave more like real-world nbgitpuller use cases.

This makes invocation with pytest's "-s" option easier to read.

These tests verify that basic shallow clones, and environment-variable configured shallow clones behave as expected.

This was repeated several times throughout the test module, and with varying levels of detail and bugs (missing space between strings, missed stripping of trailing newline).

… defaults Since creating a long_remote implicitly creates git repositories with names that conflict with the defaults, it's easy for tests to create repositories that conflict with that file path. Using more arbitrary names reduces the likelihood of a collision.

nickurak · 2019-02-25T22:33:22Z

Thanks @yuvipanda. I have a whole spiel on git storytelling I can get into about the fixups -- but yes I'd rather have them in this state where I've used autosquash to correct the commits in place.

This last push is ready to go. It also adds 3 more commits -- some more refactoring in the tests, and another test (making sure a pull on top of a shallow clone works).

If there's no other issues, merge at your convenience (but please don't squash-merge ;) ).

yuvipanda · 2019-02-26T05:38:17Z

@nickurak awesome :) I've merged. Would love to hear your spiel on git storytelling though.

betatim · 2019-02-26T06:27:05Z

I came for the shallow clone advice/expertise and stayed for the git usage pattern!

nickurak · 2019-02-26T18:28:14Z

Well I'm gradually collecting material to make a talk on this but, here's the short(ish) version:

Git history, done carefully, is incredibly useful for future users of the code base. It's a permanent record of not only the how, but more importantly the why individual steps were taken. It's great for future authors (or tho original authors ourselves, since it's easy to forget what we did ;) ) trying to understand what's there, and what changes make sense in that context.

In contrast, the details on how and why we got to a particular written history of how and why we did something are (and I'm veering into personal experience and opinion here) less important. It's the meta-how-and-why. It's not the thing that's helpful performing the archeology of the codebase down the road.

Another angle on this is that the practice of making your git history look like you wished you'd done it (as opposed to how it actually did it) is helpful in retrospecting and learning. For example (and this isn't something i pushed on in this PR), consider this sequence of events:

Developer introduces functionality
Reviewer suggests adding test coverage
Developer adds test coverage
Reviewer suggests that TDD would have been helpful in providing coverage, and structuring the code around testability
Developer performs a rebase to move their tests up to the beginning of their history, and goes through the steps of letting those tests fail, and observing them switch from red to green as the functionality comes together.

Now that git history shows a compelling story for the archeologists trying to figure out how and why the code works the way it does, AND the developer has a better idea of how they might approach that work later. As a side-effect they've also gone through more self-review of their own code (maybe even before code review happened).

That's not to say the meta-how-and-why isn't useful -- it's just useful in a different way. It's potentially useful in learning the ropes of the lifecycle of a PR. It's also information which is at least partially captured anyways here in the github PR discussion.

I do wish that git had a better use flow for "here's a string of history that isn't the authoritative version (even if it's a bit of a lie), but still exists if you want to understand how it got here." Github is also especially bad at this -- eg, there's no way to see those fixup commits I made now, and if you're not careful the order of commits in the "commits" tab can easily become non-sensical. But that's getting into yet another rant ;)

(Credit, BTW, for various versions of some of these thoughts goes to past collaborators, especially Steve Bromling, who's been super helpful in devevloping the tools to execute and talk about these git usage patterns).

betatim · 2019-02-26T18:51:08Z

I like the idea of re-ordering commits, especially in the "write code, suggest to add tests, add tests, move tests" story. Is there a good UI tool that helps with that workflow? I find that despite using git a lot every day I've only recently become comfortable with using git rebase -i HEAD~5 to reorder commits (and still have a high probability of shooting myself in the foot while doing so) and hardly ever attempt to split one commit into smaller ones retroactively. I sometimes use git add -p, I very rarely meet anyone who knows it exists.

All this makes me think: I need a better GUI tool for this because I would love to use the style of working you described more often.

nickurak · 2019-02-26T21:19:48Z

I mostly just use git add -p and an alias for git rebase -i (*) and some other odd posixy shell glue as required, but I live and die by the command line most days, so might not be a great example. That said, I really enjoy it -- I really have fun rewriting history this way, and moreso as I get better at it.

If you're an emacs user, its magit mode can do some wonderfully powerful stuff with git history rewriting. I've seen it done, and while I usually edit code in emacs, I'm wary of going in too deep into the emacs lifestyle ;)

This might be a good place for somebody to come up with a dedicated tool for crafting git history. It might even be worth something more jupyter-centric, given the specialized types of work that happens in notebooks.

(*) Actually, git rebase --interactive --onto $(git merge-base -a HEAD @{upstream}) to make sure that I'm not inadvertently doing a real-change-of-base-commit while rewriting (**)

(**) Actually actually, git rebase --interactive --onto $(git merge-base -a HEAD @{upstream}) --exec 'git commit --amend --date=now --no-edit' while working on repos stored on GitHub, to work around isaacs/github#386

choldgraf · 2019-02-26T21:45:25Z

um this is a super interesting post-merge PR conversation :-)

This is one reason I like the rust-lang RFC page. If you dig into the PR conversations you can see the process by which decisions have been made (or not made)

yuvipanda · 2019-02-26T22:31:44Z

Nice! My first set of serious code review happened on Gerrit, which follows the same philosophy as @nickurak. In fact, it also implements the wishlist item - you have both a 'this is how my commit should have been from the start' as well as a 'this is how this commit got to being here'.

You can see this in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/454346. It's one single commit, that has been amended about 13 times so far. Reviewers leave comments, and the author (or someone else!) amends the commit to incorporate it. Gerrit keeps a record of each amend, so you can see how the individual commit has changed over time. When merged, it is merged as one commit. You can have multiple commits referencing each other, with easy web based rebase. So when stuff gets merged, you get the clean easy to understand archeological history, and it's wonderful.

The downside is that the UX is very hard unless you sortof understand a good chunk of git. There's a 'git review' UI thing that makes some of this easier, but it still breaks quite a bit. GitHub's review flow is much easier to teach to people, so that is the tradeoff here.

Either way, different philosophies! GitHub is picking up stuff from Gerrit - for example, the 'allow maintainers to push changes' is amazing and a core part of how people use Gerrit. If someone complains about a nit, you can ask them to go fix it themselves. Often I'll just fix it instead of complaining about it :D You can do that on GitHub now...

I'm not holding my breath for the fundamental change though - gerrit's code review is per-commit, while here it's per-branch. I think per-branch is much easier for people to understand - 'commits are immutable, branches can be updated'. This isn't true at all in Git, but hey...

This is a great conversation! <3

yuvipanda · 2019-02-27T07:44:15Z

@nickurak I think I just realized that the traitlet isn't actually set when the config is set in jupyter_notebook_config.py. You need to pass the 'NotebookApp' object as parent to the Configurable constructor - like https://github.com/yuvipanda/nbresuse/blob/master/nbresuse/__init__.py#L94.

This lets you put stuff like the following in a jupyter_notebook_config.py:

c.GitPuller.clone_depth = 1

And have that be read by the handlers. This is very useful, and used throughout a lot of places.

Do you think you can implement that too?

nickurak · 2019-02-27T21:14:52Z

@yuvipanda I'm not sure what needs to happen there. (I haven't used traitlets before this PR).

I got as far as making these changes: https://gist.github.com/nickurak/4b8e83b483833ed77b34b81d3f76c328

But adding c.GitPuller.clone_depth=9 to ~/.jupyter/jupyter_notebook_config.py didn't seem to have the desired effect.

yuvipanda · 2019-02-28T06:27:43Z

@nickurak can you check if 'parent' is being passed through to the super() constructor?

nickurak · 2019-02-28T18:18:28Z

@yuvipanda yep, confirmed via pdb -- it's in kwargs in the parent constructor:

  /home/atrus/.virtualenvs/callysto/lib/python3.7/site-packages/nbgitpuller-0.6.1-py3.7.egg/nbgitpuller/handlers.py(66)get()
-> gp = GitPuller(repo, branch, repo_dir, depth=depth, parent=self.settings['nbapp'])
  /home/atrus/.virtualenvs/callysto/lib/python3.7/site-packages/nbgitpuller-0.6.1-py3.7.egg/nbgitpuller/pull.py(72)__init__()
-> super(Configurable, self).__init__(**newargs)
> /home/atrus/.virtualenvs/callysto/lib/python3.7/site-packages/traitlets/traitlets.py(988)__init__()
-> def __init__(self, *args, **kwargs):
(Pdb) kwargs['parent']
<notebook.notebookapp.NotebookApp object at 0x7f3967c380b8>

nickurak · 2019-03-06T22:43:38Z

Fixed. Tracking the issue in #74, and resolved in #75.

(For those who were paying attention to my git history and TDD rants, #75 follows that pattern, introducing failing tests in one commit which are caused to pass by later commits).

Jeremy Nickurak added 3 commits February 22, 2019 16:00

Be explicit in tests about configuring repo-local properties

0fd118f

While the default is to act on the repository local properties, being explicit here makes it obvious that these properties are only being set for test purposes.

yuvipanda requested changes Feb 25, 2019

View reviewed changes

yuvipanda approved these changes Feb 25, 2019

View reviewed changes

Jeremy Nickurak added 11 commits February 25, 2019 15:28

Allow nbgitpuller handler to take and pass along a depth arg

34dc9e6

Pass along 'depth' parameter in status.html template

e1b25cf

Switch to traitlets for "depth" configuration option

d0ae4b3

This allows the default "depth" option to be configured via the NBGITPULLER_DEPTH environment variables, or standard traitlet behavior.

tests: refactor util classes

6ba6399

These classes had quite a bit in common, and factoring out their changes makes each a little more specific to the unique values they add.

tests: cleanup newlines around Puller output

4a6badb

This makes invocation with pytest's "-s" option easier to read.

tests: test depth/shallow-clone functionality

397bc7a

These tests verify that basic shallow clones, and environment-variable configured shallow clones behave as expected.

tests: factor out the "pull everything" pattern

ed19f2b

This was repeated several times throughout the test module, and with varying levels of detail and bugs (missing space between strings, missed stripping of trailing newline).

tests: test pulling on top of shallow-clone

e7cd350

nickurak force-pushed the shallow-clones branch from 1253d0c to e7cd350 Compare February 25, 2019 22:30

yuvipanda merged commit 06e62f6 into jupyterhub:master Feb 26, 2019

nickurak mentioned this pull request Mar 6, 2019

Pass nbapp along to GitPuller so it can read from our configuration #75

Merged

ianabc mentioned this pull request Mar 14, 2019

Nbgitpuller and updates callysto/docker-stacks#26

Closed

jlperla mentioned this pull request Aug 9, 2019

Test shallow clone for the ngitpuller links QuantEcon/lecture-source-jl#594

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement depth/shallow-clone support #67

Implement depth/shallow-clone support #67

nickurak commented Feb 22, 2019

nickurak commented Feb 22, 2019

betatim commented Feb 24, 2019

nickurak commented Feb 25, 2019

yuvipanda left a comment

yuvipanda Feb 23, 2019

yuvipanda Feb 25, 2019

yuvipanda commented Feb 25, 2019

nickurak commented Feb 25, 2019

yuvipanda commented Feb 25, 2019

nickurak commented Feb 25, 2019

yuvipanda commented Feb 26, 2019

betatim commented Feb 26, 2019

nickurak commented Feb 26, 2019

betatim commented Feb 26, 2019

nickurak commented Feb 26, 2019

choldgraf commented Feb 26, 2019

yuvipanda commented Feb 26, 2019

yuvipanda commented Feb 27, 2019

nickurak commented Feb 27, 2019

yuvipanda commented Feb 28, 2019

nickurak commented Feb 28, 2019

nickurak commented Mar 6, 2019

Implement depth/shallow-clone support #67

Implement depth/shallow-clone support #67

Conversation

nickurak commented Feb 22, 2019

nickurak commented Feb 22, 2019

betatim commented Feb 24, 2019

nickurak commented Feb 25, 2019

yuvipanda left a comment

Choose a reason for hiding this comment

yuvipanda Feb 23, 2019

Choose a reason for hiding this comment

yuvipanda Feb 25, 2019

Choose a reason for hiding this comment

yuvipanda commented Feb 25, 2019

nickurak commented Feb 25, 2019

yuvipanda commented Feb 25, 2019

nickurak commented Feb 25, 2019

yuvipanda commented Feb 26, 2019

betatim commented Feb 26, 2019

nickurak commented Feb 26, 2019

betatim commented Feb 26, 2019

nickurak commented Feb 26, 2019

choldgraf commented Feb 26, 2019

yuvipanda commented Feb 26, 2019

yuvipanda commented Feb 27, 2019

nickurak commented Feb 27, 2019

yuvipanda commented Feb 28, 2019

nickurak commented Feb 28, 2019

nickurak commented Mar 6, 2019