Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GSoC-2020] Git Repository Size Estimation #931

Merged
merged 60 commits into from
Aug 25, 2020

Conversation

rishabhBudhouliya
Copy link
Contributor

@rishabhBudhouliya rishabhBudhouliya commented Jul 20, 2020

GitToolChooser

Description: This class aims to add the functionality of recommending a git implementation on the basis of the size of a repository which has a strong correlation to the performance of git fetch (from Performance Benchmarks).

We have two heuristics to calculate the size of a git repository:

  • Using cached .git dir to estimate the size of a repository
  • Providing an extension point which upon implementation can use APIs exposed by git service providers like Github, GitLab etc to fetch size of repo.

How are we doing this?

  • The class can instantiate using a constructor which checks if cache exists and calculates the size if it does. Else, it checks if we have an implementation(s) for the extension point provided by the class.
  • Currently, the GitToolChooser is planning to be supported by the Github branch source plugin, Gitlab branch source plugin and the Gitea branch source plugin.

Assumptions:

  • We believe that the implementers of the extension point have the user credentials in order to query size of a repository on the basis of its URL.
  • We are instantiating GitToolChooser using an instance of AbstractGitSCMSourcejust as the GitSCMTelescope does.
  • This scope of usage of this API is not limited to the AbstractGitSCMSource, it can be used in other places provided we have the repository url.

Architecture Diagram
Screenshot 2020-07-29 at 1 47 41 AM

Checklist

  • I have read the CONTRIBUTING doc
  • I have referenced the Jira issue related to my changes in one or more commit messages
  • I have added tests that verify my changes
  • Unit tests pass locally with my changes
  • I have added documentation as necessary
  • No Javadoc warnings were introduced with my changes
  • No spotbugs warnings were introduced with my changes
  • Documentation in README has been updated as necessary
  • Online help has been added and reviewed for any new or modified fields
  • I have interactively tested my changes
  • Any dependent changes have been merged and published in upstream modules (like git-client-plugin)

Types of changes

  • Dependency or infrastructure update
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Further comments

To avoid issues with the existing functionality of the plugin, we have used the exact same logic to resolve git tool before making a recommendation as is provided by GitUtils.resolveGitTool.

@rishabhBudhouliya
Copy link
Contributor Author

cc @omkar-dsd

Copy link
Member

@fcojfernandez fcojfernandez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend you always to add tests. It doesn't only help you to test and check your code, but they also help you to polish your code since you're creating a "real" scenario

And it will help us to see how your code works during the review as well!

@rishabhBudhouliya
Copy link
Contributor Author

I'd recommend you always to add tests. It doesn't only help you to test and check your code, but they also help you to polish your code since you're creating a "real" scenario

And it will help us to see how your code works during the review as well!

@fcojfernandez I agree and I am in the process of writing some, will commit asap.

@MarkEWaite MarkEWaite added the enhancement Improvement or new feature label Jul 20, 2020
@omkar-dsd
Copy link

@rishabhBudhouliya Are you including it in this PR itself?

}
}

private boolean getSizeFromAPI(String repoUrl) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSizeFromAPI does not return the size as one would expect from its name.

Naming is definitely one of the toughest part of development :P . imo, it should be something like: setSizeFromCache and setSizeFromAPI, also it will maintain consistency among these two methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I have changed it.

private long sizeOfRepo = 0L;
private String implementation;
private String gitTool;
public static final int SIZE_TO_SWITCH = 5000;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define the scale (Bytes, KB, MB, etc.) in comment, else it can be confusing 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, while writing the tests it has deemed to me that it might be cleaner to use a non-primitive data type instead of a primitive data type.

The API might return varying sizes, we need to take that case into account as well.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you could perhaps have an abstract method which defaults to KB to use for any necessary conversions. Perhaps a different PR for something like that if we have time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinharringa Great suggestion, I have a seperate PR for this specification.

a "NONE" string
Also, size which is decided to switch implementation has been increased
to 50K because 5 MiB is experimentally an approximation where behavior
of jgit starts to change.
Note: The current test cases target only the scenario with heuristic 1,
i.e the cache determination option.
@rishabhBudhouliya
Copy link
Contributor Author

@fcojfernandez @omkar-dsd I have added the test cases. Currently, they cover scenarios where only cached .git estimation is used.

For API extension point testing, I've seen that we have a @TestExtension annotation which allows to implement the extension for a test. The concern with this approach will be that we might need to use external dependencies the git plugin hasn't included like OkHTTP.

@omkar-dsd
Copy link

Woops, looks like it is failing for the Windows environment. :| @rishabhBudhouliya can you check? Is it the same case that was noticed earlier last month?

@fcojfernandez
Copy link
Member

For API extension point testing, I've seen that we have a @TestExtension annotation which allows to implement the extension for a test. The concern with this approach will be that we might need to use external dependencies the git plugin hasn't included like OkHTTP.

You can mock the response using mockito or wiremock depending on your needs.

String tool = repoSizeEstimator.getGitTool();

// The class should make no recommendation since it can't find a .git cached directory
assertThat(tool.equals("NONE"), is(true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very simple case and hence quite readable, but it's a good idea to start using the methods that allow you to have more readable tests:

  • You could use assertTrue(tool.equas("NONE")
  • You could use
Suggested change
assertThat(tool.equals("NONE"), is(true));
assertThat(tool, is("NONE"));

which is even much clearer: you expect that tool is "NONE". You can even read what you expect

/*
Since the size of repository is 21.785 KiBs, the estimator should suggest "jgit" as an implementation
*/
assertThat(repoSizeEstimator.getGitTool(), containsString("git"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an example of what I meant above: Super-readable. check that repoSizeExtimagor contains a String "git"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, you and @MarkEWaite have made this clear often in PR submissions, please bear with me as I slowly grasp its importance.

* A class which allows {@link AbstractGitSCMSource} to estimate the size of a repository from a distance
* without requiring a local checkout.
*/
public class GitRepoSizeEstimator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is still a bit confusing: GitRepoSizeEstimator but it only offers the Git Tool to use, nothing about Size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to change the name, but offer the size data as well

Copy link
Contributor Author

@rishabhBudhouliya rishabhBudhouliya Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption was that the class should only provide the right recommendation for git implementation. To create a more coherent name, I suggest GitToolAdvisor or maybe something a bit less dramatic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should also return size? While writing the class I couldn't imagine the benefit of providing the size of the repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird and confusing to have a GitRepoSizeEstimator class that is estimating something that is never returned and instead it's returning the tool to use. Something we can change in the last minute

Copy link
Contributor Author

@rishabhBudhouliya rishabhBudhouliya Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, GitToolSuggestion? A GitToolSuggestion.getGitTool() seems like a good way to tell that the developer that we wish to suggest a better approach in terms of git implementation.

Whatever everyone agrees upon is fine by me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really bad at naming methods and classes, so I will let others suggest something. for me it's fine if the class refers an actual behaviour

Comment on lines 143 to 147
Jenkins jenkins = Jenkins.getInstanceOrNull();
if (jenkins == null) {
return null;
}
return jenkins.getExtensionList(RepositorySizeAPI.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extension should not be available if Jenkins instance is null, so the get() method will do the job for you.

Suggested change
Jenkins jenkins = Jenkins.getInstanceOrNull();
if (jenkins == null) {
return null;
}
return jenkins.getExtensionList(RepositorySizeAPI.class);
return Jenkins.get().getExtensionList(RepositorySizeAPI.class);

Comment on lines 151 to 159
private boolean setSizeFromAPI(String repoUrl) {
for (RepositorySizeAPI r: Objects.requireNonNull(RepositorySizeAPI.all())) {
if (r != null) {
sizeOfRepo = r.getSizeOfRepository(repoUrl);
return true;
}
}
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to get here? r is never null and all will return a non-null object. it might be empty, but not null.
then, this is always returning the first extension in the list, whatever the size of this and whatever the extensions in Jenkins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to give you more context: Any class extending RepositorySizeApi and with the @Extension annotation, will be loaded when Jenkins is launched (or when the plugin is installed) and it will be available.
Jenkins.get().getExtensionList(RepositorySizeAPI.class) will look for any class that matches that criteria and returning them in a list.

Comment on lines 165 to 167
public String getGitTool() {
return gitTool;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it should return the GitTool object. @MarkEWaite what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for returning a String is this portion of code in AbstractGitSCMSource:

GitTool tool = resolveGitTool(context.gitTool(), listener);
            if (tool != null) {
                git.using(tool.getGitExe());
            }
            GitClient client = git.getClient();

We create a git client using the git.using(tool.getGitExe()). If we instantiate a GitRepoSizeEstimator just before it, it can provided a String recommendedGitExe to git.using() to create the client with the optimal implementation.

@rishabhBudhouliya
Copy link
Contributor Author

@fcojfernandez The latest commit 0fbdf85 has added the filter which was done as you have suggested in previous comments.

The ExtensionPoint contract now has a new method public abstract boolean acceptsRemote(String remote) which will put the decision of implementing the correct remote on the implementers instead of on the git plugin.

The ExtensionList expects RepositorySizeAPI.class to be of size 1 now.

.filter(r -> r.acceptsRemote(repoUrl))
.collect(Collectors.toList());

if (acceptedRepository.size() == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know how the implementer plugins will accept the remote/url. If there are more than one that match the url, what should we do? Take the first? return a false? My suggestion (and it's only a suggestion and I'm open to discuss:

Suggested change
if (acceptedRepository.size() == 1) {
if (acceptedRepository.size() > 0) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid point. Currently the code chooses to avoid the whole situation by returning false. What if, instead of asking for a size of type long, we ask for a non-primitive custom class which encapsulates details like

  • size type (in KiBs, in MiBs),
  • from which provider (for ex: Github or GitLab)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have coded something like this in a different branch, I can raise a separate PR for you and others to review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you will still have the same question to answer: what will you return if more than one implementer extensions are marked as valid? The first one or null? Or do you will try one by one until one retrieves a valid object? those are the design question you have to try to answer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After searching for an answer for sometime, I've come up with certain ideas:

  • Accept a singleton instance: Assume that it is not possible to have two extension instances of the same type, i.e, if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it. If we find out that more than one does, we reject all instances.

Advantage: The extension point contract is simple at this moment, from git plugin's point of view, we can't assume to know what classes will implement it, so we ask them a simple question: Do you want to implement it? Yes or no.
If we add another rule on the contract, a method which returns the type of provider implementing the extension point, it adds a little bit of complexity.

Disadvantage: We miss out on the correct instance which provides the size for our estimation.

  • Add another method on the contract: Ask for the provider's name, say a method like public abstract String returnProviderName() which returns strings like "Github" or "Gitlab" which we can then use to filter out more than one instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to answer bit by bit, but starting from the end:

Add another method on the contract: Ask for the provider's name, say a method like public abstract String returnProviderName() which returns strings like "Github" or "Gitlab" which we can then use to filter out more than one instance.

I don't think that's a good idea because you would need a mechanism to distinguish the "provider" in the git-plugin itself and as Mark stated very well, it's better to leave that responsibility to the implementer. See what I mean:

  • Let's suppose we add the distinguishWhenUnclear method.
  • If implementer plugins have a really good accept method, so your URL is only accepted by 1 plugin, then your distinguish method is not actually needed.
  • If implementer plugins have vague accept method, so more than one is accepting the URL, you have to distinguish between them. Then why do you need them to implement the accept method?

If we see the provider name is useful we can add it, but not for filtering purposes.

Accept a singleton instance: Assume that it is not possible to have two extension instances of the same type, i.e, if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it. If we find out that more than one does, we reject all instances.

That's a valid solution of course. We can return a singleton object or a list filter in the Estimator class. Both solutions are equally valid and will lead to the same end. What I wanted to do was to make you think is about what behaviour we want when more than one Extension objects are accepting and why. You've given me the answer with:

If we find out that more than one does, we reject all instances.

for the what, and

Assume that it is not possible to have two extension instances of the same type, i.e, if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it

for the why. Now my opinion/suggestion
I agree with if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it, but just an error in a pattern can make your implementation return a false positive. We can assume that an Extension that has returned a false positive will return an error when it tries to get the size using an API that is not valid for it, so if we return the full list of Extensions, it's enough if we accept the first one that is able to give us a number and not an error. This way we can distinguish between implementations and the probability of getting a false number is actually low. In other words: it's a less aggressive approach
But that's my opinion/suggestion and it does not mean that it makes sense or it's the best approach. I'd like your thoughts about that and also I'd like other mentors's opinions here, @MarkEWaite @justinharringa @omkar-dsd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a good idea because you would need a mechanism to distinguish the "provider" in the git-plugin itself and as Mark stated very well, it's better to leave that responsibility to the implementer. See what I mean:

Understood, it might only increase complexity within the git plugin which is not be necessary.

@fcojfernandez That seems like a neat approach to me, assume that if there is a false positive, there is a very low chance that numerical size returned by it would be non-zero.
If I've understood you correctly, if we have multiple implementations in the extension list, we check the size and only accept the non-zero size instance.

This leads me to ask another question to you, what if some plugin implements our method of getSizeFromRepository() and does so incorrectly? Do we expect an exception thrown from the that erroneous method? I wouldn't assume that because our method definition doesn't throw any exception.
Should we expect a size = 0 if things go wrong from the implementers side?

Copy link
Member

@fcojfernandez fcojfernandez Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A repo might be empty, so 0 might or not be a wrong value per se, but I don't think it's a use case we should care (actually unlikely).
Apart from that, adding a safe check just in case any exception that happens is not properly handled by the implementer plugin is just free. It's possible that the implementer is not checking the final endpoint, for example. IMO, better safe than sorry, and if the Extension throws an exception or return 0, then you can discarded the value in both situation.

regarding my comment

a safe check just in case any exception that happens is not properly handled by the implementer plugin

When you delegate the responsibility, it's always a good idea to have a check. Keep that always in mind because even when you want the exception to be thrown, you can catch it, process it and re-throw it

* A class which allows {@link AbstractGitSCMSource} to estimate the size of a repository from a distance
* without requiring a local checkout.
*/
public class GitRepoSizeEstimator {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's weird and confusing to have a GitRepoSizeEstimator class that is estimating something that is never returned and instead it's returning the tool to use. Something we can change in the last minute

Comment on lines 162 to 164
public String getGitTool() {
return gitTool;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recovering the comment missing after the last commit:
I was wondering if we just return the GitTool. What do you think @MarkEWaite ?

Comment on lines 112 to 115
List<RepositorySizeAPI> acceptedRepository = Objects.requireNonNull(RepositorySizeAPI.all())
.stream()
.filter(r -> r.acceptsRemote(repoUrl))
.collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, this filter might be done in a method of the Extension, so here we have something like

Suggested change
List<RepositorySizeAPI> acceptedRepository = Objects.requireNonNull(RepositorySizeAPI.all())
.stream()
.filter(r -> r.acceptsRemote(repoUrl))
.collect(Collectors.toList());
List<RepositorySizeAPI> acceptedRepository = RepositorySizeAPI.getValids(repoUrl);

.filter(r -> r.acceptsRemote(repoUrl))
.collect(Collectors.toList());

if (acceptedRepository.size() == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot know how the implementer plugins will accept the URL, so we might have more than one. What should we do? Pick up the first one? Consider we cannot know what to do and return false? My suggestion (just a suggestion, I'm happy to open discussion) would be to pick the first one up

Suggested change
if (acceptedRepository.size() == 1) {
if (acceptedRepository.size() > 0) {

@rishabhBudhouliya
Copy link
Contributor Author

@fcojfernandez So I while I was looking to find a way to uniquely identifying multiple instances which acceptRemote(), I got into thinking of how the Extension Point is designed on a Singleton pattern.

I was reading the documentation and I found out that for cases where users should be able to configure zero or more instances of some things you define the Describable/Descriptor pattern is preferrable.

For reference, the BuildChooser extension point provides a descriptor which is used to identify specific instances.

We could change the design pattern to address our needs better.

@fcojfernandez
Copy link
Member

@rishabhBudhouliya I've always used the Extension pattern for these cases and I've used the Descriptor pattern only for objects used in the UI, but it doesn't mean that I'm right. I'd recommend you to rise the question in the mailing list. But please don't rise a question named as GSoC ... but something concrete to the differences between both patterns.

rishabhBudhouliya and others added 9 commits August 17, 2020 10:39
The process of recommending an optimal git impl is in two stages:
Stage 1: Determine if the user additional behavior contains a feature which is not supported
by JGit - see UnsupportedCommand usage
Stage 2: Instantiating GitToolChooser to recommend a git tool
Here, the GitToolChooser uses the repository URL to determine a recommendation for the git
implementation.
This method can be called by an extension to convey if a certain extension
feature is supported by JGit or not.

After this information is provided by the an extenions or multiple extensions, the
command can determine if JGit is to be used or not.

Even if one extension is not supporting a feature, JGit will not be recommended.
By implementing this method, this extension is conveying that in the case a user wants to use
timeout, JGit will not be able to perform the git client operations as it is not yet supported.
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two optional comments and one change that is needed before merge (or at least before release).

The disableGitToolChooser global configuration switch needs an online help page to match with the online help available from the other switches on the same page.

Add a file src/main/resources/hudson/plugins/git/GitSCM/help-disableGitToolChooser.html based on the content of src/main/resources/hudson/plugins/git/GitSCM/help-hideCredentials.html that describes more details of the git chooser

README.adoc Outdated Show resolved Hide resolved
pom.xml Outdated Show resolved Hide resolved
Includes a null pointer exception fix for UnsupportedCommand
Copy link
Contributor

@MarkEWaite MarkEWaite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're ready to merge. Excellent work, @rishabhBudhouliya

@MarkEWaite MarkEWaite dismissed fcojfernandez’s stale review August 25, 2020 01:50

Concerns raised by @fcojfernandez are resolved

@MarkEWaite MarkEWaite merged commit 612f554 into jenkinsci:master Aug 25, 2020
@multani
Copy link

multani commented Sep 1, 2020

Hello, I think this PR introduces a NullPointerException in specific cases, in particular in this changeset:

  • line 846: chooser is initialized to a null value
  • line 847: getUserRemoteConfigs() may return no values (under which circumstances, I'm not 100% sure yet). chooser in that case is never modified
  • line 852: chooser.getGitTool()) raises NullPointerException.

I opened JENKINS-63572 for that.

@MarkEWaite
Copy link
Contributor

Thanks @multani . You'll find comments in JENKINS-63572 that describe how you can avoid the null pointer exception and make your Pipeline a little faster by avoiding a redundant checkout operation.

@rishabhBudhouliya is preparing a pull request to fix JENKINS-63519 and will include this fix in that pull request as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants