-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement depth/shallow-clone support #67
Conversation
While the default is to act on the repository local properties, being explicit here makes it obvious that these properties are only being set for test purposes.
"assert False" doesn't provide pytest much information to give a useful error. "assert True" is a no-op equivalent to "pass". Using a canary value to check if the exception was hit allows the error to be more meaningful in case of a failing test.
One can specify the branch to track in the same 'git' command as the clone. This can make the operation a little quicker if the desired branch is significantly different from the default (typically 'master') branch. It will also facilitate the desired branch being checked out in future commits that implement shallow clones.
Note that these changes don't yet implement anything in the link-generator -- you'll need to add a "&depth=X" parameter to your URLs, or specify the NBGITPULLER_DEPTH environment variable in your jupyter environment for it to have any affect. The goal is that, until we build some UI, it'll be especially hard for any of these changes to have any impact. We can verify that this functionality is what we're looking for before exposing a user-facing change, and then consider what user-facing changes and/or default changes we want. |
This looks like a nice PR! In https://github.com/jupyter/repo2docker we investigated making shallow clones as a way to speed up things. This was maybe 6months ago or so and we ended up giving up on it. IIRC the problem was either that we struggled to un-shallow the clone reliably when we needed that and/or that I am here to find out if the state of things has changed (-> we can update repo2docker) and share the experience of shallow clones. |
As far as I know, we don't have a use case for converting shallow clones to full clones here. I'll have to rely on the existing project maintainers for advice on whether changing branches is important -- but that is something that's normally not allowed in a shallow clone (and indeed, is something that could defeat much of the purpose of a shallow clone -- as you'd need to fetch a bunch of content from other branches that might include the large files you're trying to avoid). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style comments, but otherwise this is awesome.
I haven't done commit by commit review since my Gerrit days, and this was a joy to do on this PR. Thank you! <3
@betatim I think in nbgitpuller's case we'll never really have to move from shallow to full clone, so I think it's ok.
nbgitpuller/pull.py
Outdated
@@ -64,7 +65,13 @@ def initialize_repo(self): | |||
""" | |||
|
|||
logging.info('Repo {} doesn\'t exist. Cloning...'.format(self.repo_dir)) | |||
yield from execute_cmd(['git', 'clone', '--branch', self.branch_name, self.git_url, self.repo_dir]) | |||
def clone_args(): | |||
yield from ['git', 'clone'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool! However, for clarity I'd prefer it if we just constructed a list directly instead.
nbgitpuller/static/index.js
Outdated
@@ -40,10 +41,14 @@ require([ | |||
|
|||
GitSync.prototype.start = function() { | |||
// Start git pulling handled by SyncHandler, declared in handlers.py | |||
var syncUrl = this.baseUrl + 'git-pull/api?' + $.param({ | |||
syncUrlParams = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there should be a 'var' here, since otherwise this variable is now global.
This is an awesome PR, @nickurak. I'll be happy to merge once the minor comments are addressed. I'd also love to have you on as a maintainer. What do you think? |
@yuvipanda Last two commits should address those issues. I'd like to squash those fixups in before this gets merged if you don't mind, or you can do that yourself -- but I'll let you look at the changes separately this way before I do so. |
@nickurak thanks! This looks good to me, and am happy to merge anytime you want. I personally leave the fixups in the commit history, but am happy to have it be in a state you feel good about. |
If provided to the GitPuller, a 'depth' parameter will be passed along to git, resulting in a shallow clone. Shallow clones with a depth of 'N' only include history going back 'N' commits. This can be helpful in reducing the size of the cloned repository, and in the amount of time it takes to clone that repository, especially when large files were added and later removed/edited. As a special-case, a non-positive value for depth (eg, 0 or negative) can be specified to explicitly request all history. For more information, see: https://git-scm.com/docs/git-clone#git-clone---depthltdepthgt
This allows the default "depth" option to be configured via the NBGITPULLER_DEPTH environment variables, or standard traitlet behavior.
These classes had quite a bit in common, and factoring out their changes makes each a little more specific to the unique values they add.
git modifies its behavior between plain file-path remotes and URI-specified remotes. For example, shallow clones are ignored for plain file-path remotes, but enabled for URI-specified remotes. Using the full URI specification makes our tests behave more like real-world nbgitpuller use cases.
This makes invocation with pytest's "-s" option easier to read.
These tests verify that basic shallow clones, and environment-variable configured shallow clones behave as expected.
This was repeated several times throughout the test module, and with varying levels of detail and bugs (missing space between strings, missed stripping of trailing newline).
… defaults Since creating a long_remote implicitly creates git repositories with names that conflict with the defaults, it's easy for tests to create repositories that conflict with that file path. Using more arbitrary names reduces the likelihood of a collision.
1253d0c
to
e7cd350
Compare
Thanks @yuvipanda. I have a whole spiel on git storytelling I can get into about the fixups -- but yes I'd rather have them in this state where I've used autosquash to correct the commits in place. This last push is ready to go. It also adds 3 more commits -- some more refactoring in the tests, and another test (making sure a pull on top of a shallow clone works). If there's no other issues, merge at your convenience (but please don't squash-merge ;) ). |
@nickurak awesome :) I've merged. Would love to hear your spiel on git storytelling though. |
I came for the shallow clone advice/expertise and stayed for the git usage pattern! |
Well I'm gradually collecting material to make a talk on this but, here's the short(ish) version: Git history, done carefully, is incredibly useful for future users of the code base. It's a permanent record of not only the how, but more importantly the why individual steps were taken. It's great for future authors (or tho original authors ourselves, since it's easy to forget what we did ;) ) trying to understand what's there, and what changes make sense in that context. In contrast, the details on how and why we got to a particular written history of how and why we did something are (and I'm veering into personal experience and opinion here) less important. It's the meta-how-and-why. It's not the thing that's helpful performing the archeology of the codebase down the road. Another angle on this is that the practice of making your git history look like you wished you'd done it (as opposed to how it actually did it) is helpful in retrospecting and learning. For example (and this isn't something i pushed on in this PR), consider this sequence of events:
Now that git history shows a compelling story for the archeologists trying to figure out how and why the code works the way it does, AND the developer has a better idea of how they might approach that work later. As a side-effect they've also gone through more self-review of their own code (maybe even before code review happened). That's not to say the meta-how-and-why isn't useful -- it's just useful in a different way. It's potentially useful in learning the ropes of the lifecycle of a PR. It's also information which is at least partially captured anyways here in the github PR discussion. I do wish that git had a better use flow for "here's a string of history that isn't the authoritative version (even if it's a bit of a lie), but still exists if you want to understand how it got here." Github is also especially bad at this -- eg, there's no way to see those fixup commits I made now, and if you're not careful the order of commits in the "commits" tab can easily become non-sensical. But that's getting into yet another rant ;) (Credit, BTW, for various versions of some of these thoughts goes to past collaborators, especially Steve Bromling, who's been super helpful in devevloping the tools to execute and talk about these git usage patterns). |
I like the idea of re-ordering commits, especially in the "write code, suggest to add tests, add tests, move tests" story. Is there a good UI tool that helps with that workflow? I find that despite using git a lot every day I've only recently become comfortable with using All this makes me think: I need a better GUI tool for this because I would love to use the style of working you described more often. |
I mostly just use If you're an emacs user, its This might be a good place for somebody to come up with a dedicated tool for crafting git history. It might even be worth something more jupyter-centric, given the specialized types of work that happens in notebooks. (*) Actually, (**) Actually actually, |
um this is a super interesting post-merge PR conversation :-) This is one reason I like the rust-lang RFC page. If you dig into the PR conversations you can see the process by which decisions have been made (or not made) |
Nice! My first set of serious code review happened on Gerrit, which follows the same philosophy as @nickurak. In fact, it also implements the wishlist item - you have both a 'this is how my commit should have been from the start' as well as a 'this is how this commit got to being here'. You can see this in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/454346. It's one single commit, that has been amended about 13 times so far. Reviewers leave comments, and the author (or someone else!) amends the commit to incorporate it. Gerrit keeps a record of each amend, so you can see how the individual commit has changed over time. When merged, it is merged as one commit. You can have multiple commits referencing each other, with easy web based rebase. So when stuff gets merged, you get the clean easy to understand archeological history, and it's wonderful. The downside is that the UX is very hard unless you sortof understand a good chunk of git. There's a 'git review' UI thing that makes some of this easier, but it still breaks quite a bit. GitHub's review flow is much easier to teach to people, so that is the tradeoff here. Either way, different philosophies! GitHub is picking up stuff from Gerrit - for example, the 'allow maintainers to push changes' is amazing and a core part of how people use Gerrit. If someone complains about a nit, you can ask them to go fix it themselves. Often I'll just fix it instead of complaining about it :D You can do that on GitHub now... I'm not holding my breath for the fundamental change though - gerrit's code review is per-commit, while here it's per-branch. I think per-branch is much easier for people to understand - 'commits are immutable, branches can be updated'. This isn't true at all in Git, but hey... This is a great conversation! <3 |
@nickurak I think I just realized that the traitlet isn't actually set when the config is set in jupyter_notebook_config.py. You need to pass the 'NotebookApp' object as parent to the Configurable constructor - like https://github.com/yuvipanda/nbresuse/blob/master/nbresuse/__init__.py#L94. This lets you put stuff like the following in a jupyter_notebook_config.py:
And have that be read by the handlers. This is very useful, and used throughout a lot of places. Do you think you can implement that too? |
@yuvipanda I'm not sure what needs to happen there. (I haven't used traitlets before this PR). I got as far as making these changes: https://gist.github.com/nickurak/4b8e83b483833ed77b34b81d3f76c328 But adding |
@nickurak can you check if 'parent' is being passed through to the super() constructor? |
@yuvipanda yep, confirmed via pdb -- it's in kwargs in the parent constructor:
|
Fixes #65
Note that this PR is intended to be read/reviewed commit-by-commit as opposed to via the "Files changed tab". (That is to say, each commit does one small piece of work that can be reviewed and understood separately, and thus is probably easier to read that way).
This is my first PR to this repository, please don't hold back on feedback.