-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GSoC-2020] Git Repository Size Estimation #931
[GSoC-2020] Git Repository Size Estimation #931
Conversation
cc @omkar-dsd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend you always to add tests. It doesn't only help you to test and check your code, but they also help you to polish your code since you're creating a "real" scenario
And it will help us to see how your code works during the review as well!
@fcojfernandez I agree and I am in the process of writing some, will commit asap. |
@rishabhBudhouliya Are you including it in this PR itself? |
} | ||
} | ||
|
||
private boolean getSizeFromAPI(String repoUrl) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getSizeFromAPI
does not return the size as one would expect from its name.
Naming is definitely one of the toughest part of development :P . imo, it should be something like: setSizeFromCache
and setSizeFromAPI
, also it will maintain consistency among these two methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I have changed it.
private long sizeOfRepo = 0L; | ||
private String implementation; | ||
private String gitTool; | ||
public static final int SIZE_TO_SWITCH = 5000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define the scale (Bytes, KB, MB, etc.) in comment, else it can be confusing 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, while writing the tests it has deemed to me that it might be cleaner to use a non-primitive data type instead of a primitive data type.
The API might return varying sizes, we need to take that case into account as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you could perhaps have an abstract method which defaults to KB to use for any necessary conversions. Perhaps a different PR for something like that if we have time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinharringa Great suggestion, I have a seperate PR for this specification.
a "NONE" string Also, size which is decided to switch implementation has been increased to 50K because 5 MiB is experimentally an approximation where behavior of jgit starts to change.
Note: The current test cases target only the scenario with heuristic 1, i.e the cache determination option.
@fcojfernandez @omkar-dsd I have added the test cases. Currently, they cover scenarios where only cached .git estimation is used. For API extension point testing, I've seen that we have a |
Woops, looks like it is failing for the Windows environment. :| @rishabhBudhouliya can you check? Is it the same case that was noticed earlier last month? |
You can mock the response using mockito or wiremock depending on your needs. |
String tool = repoSizeEstimator.getGitTool(); | ||
|
||
// The class should make no recommendation since it can't find a .git cached directory | ||
assertThat(tool.equals("NONE"), is(true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very simple case and hence quite readable, but it's a good idea to start using the methods that allow you to have more readable tests:
- You could use assertTrue(tool.equas("NONE")
- You could use
assertThat(tool.equals("NONE"), is(true)); | |
assertThat(tool, is("NONE")); |
which is even much clearer: you expect that tool is "NONE". You can even read what you expect
/* | ||
Since the size of repository is 21.785 KiBs, the estimator should suggest "jgit" as an implementation | ||
*/ | ||
assertThat(repoSizeEstimator.getGitTool(), containsString("git")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's an example of what I meant above: Super-readable. check that repoSizeExtimagor contains a String "git"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand, you and @MarkEWaite have made this clear often in PR submissions, please bear with me as I slowly grasp its importance.
* A class which allows {@link AbstractGitSCMSource} to estimate the size of a repository from a distance | ||
* without requiring a local checkout. | ||
*/ | ||
public class GitRepoSizeEstimator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is still a bit confusing: GitRepoSizeEstimator
but it only offers the Git Tool to use, nothing about Size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to change the name, but offer the size data as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption was that the class should only provide the right recommendation for git implementation. To create a more coherent name, I suggest GitToolAdvisor
or maybe something a bit less dramatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should also return size? While writing the class I couldn't imagine the benefit of providing the size of the repository.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird and confusing to have a GitRepoSizeEstimator
class that is estimating something that is never returned and instead it's returning the tool to use. Something we can change in the last minute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, GitToolSuggestion
? A GitToolSuggestion.getGitTool()
seems like a good way to tell that the developer that we wish to suggest a better approach in terms of git implementation.
Whatever everyone agrees upon is fine by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really bad at naming methods and classes, so I will let others suggest something. for me it's fine if the class refers an actual behaviour
Jenkins jenkins = Jenkins.getInstanceOrNull(); | ||
if (jenkins == null) { | ||
return null; | ||
} | ||
return jenkins.getExtensionList(RepositorySizeAPI.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extension should not be available if Jenkins instance is null, so the get()
method will do the job for you.
Jenkins jenkins = Jenkins.getInstanceOrNull(); | |
if (jenkins == null) { | |
return null; | |
} | |
return jenkins.getExtensionList(RepositorySizeAPI.class); | |
return Jenkins.get().getExtensionList(RepositorySizeAPI.class); |
private boolean setSizeFromAPI(String repoUrl) { | ||
for (RepositorySizeAPI r: Objects.requireNonNull(RepositorySizeAPI.all())) { | ||
if (r != null) { | ||
sizeOfRepo = r.getSizeOfRepository(repoUrl); | ||
return true; | ||
} | ||
} | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are you trying to get here? r
is never null and all will return a non-null object. it might be empty, but not null.
then, this is always returning the first extension in the list, whatever the size of this and whatever the extensions in Jenkins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to give you more context: Any class extending RepositorySizeApi and with the @Extension
annotation, will be loaded when Jenkins is launched (or when the plugin is installed) and it will be available.
Jenkins.get().getExtensionList(RepositorySizeAPI.class)
will look for any class that matches that criteria and returning them in a list.
public String getGitTool() { | ||
return gitTool; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it should return the GitTool object. @MarkEWaite what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for returning a String
is this portion of code in AbstractGitSCMSource:
GitTool tool = resolveGitTool(context.gitTool(), listener);
if (tool != null) {
git.using(tool.getGitExe());
}
GitClient client = git.getClient();
We create a git client using the git.using(tool.getGitExe())
. If we instantiate a GitRepoSizeEstimator just before it, it can provided a String recommendedGitExe
to git.using()
to create the client with the optimal implementation.
@fcojfernandez The latest commit 0fbdf85 has added the filter which was done as you have suggested in previous comments. The ExtensionPoint contract now has a new method The |
.filter(r -> r.acceptsRemote(repoUrl)) | ||
.collect(Collectors.toList()); | ||
|
||
if (acceptedRepository.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know how the implementer plugins will accept the remote/url. If there are more than one that match the url, what should we do? Take the first? return a false? My suggestion (and it's only a suggestion and I'm open to discuss:
if (acceptedRepository.size() == 1) { | |
if (acceptedRepository.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Valid point. Currently the code chooses to avoid the whole situation by returning false
. What if, instead of asking for a size of type long, we ask for a non-primitive custom class which encapsulates details like
- size type (in KiBs, in MiBs),
- from which provider (for ex: Github or GitLab)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have coded something like this in a different branch, I can raise a separate PR for you and others to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you will still have the same question to answer: what will you return if more than one implementer extensions are marked as valid? The first one or null? Or do you will try one by one until one retrieves a valid object? those are the design question you have to try to answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After searching for an answer for sometime, I've come up with certain ideas:
- Accept a singleton instance: Assume that it is not possible to have two extension instances of the same type, i.e, if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it. If we find out that more than one does, we reject all instances.
Advantage: The extension point contract is simple at this moment, from git plugin's point of view, we can't assume to know what classes will implement it, so we ask them a simple question: Do you want to implement it? Yes or no.
If we add another rule on the contract, a method which returns the type of provider implementing the extension point, it adds a little bit of complexity.
Disadvantage: We miss out on the correct instance which provides the size for our estimation.
- Add another method on the contract: Ask for the provider's name, say a method like
public abstract String returnProviderName()
which returns strings like "Github" or "Gitlab" which we can then use to filter out more than one instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to answer bit by bit, but starting from the end:
Add another method on the contract: Ask for the provider's name, say a method like public abstract String returnProviderName() which returns strings like "Github" or "Gitlab" which we can then use to filter out more than one instance.
I don't think that's a good idea because you would need a mechanism to distinguish the "provider" in the git-plugin itself and as Mark stated very well, it's better to leave that responsibility to the implementer. See what I mean:
- Let's suppose we add the
distinguishWhenUnclear
method. - If implementer plugins have a really good
accept
method, so your URL is only accepted by 1 plugin, then yourdistinguish
method is not actually needed. - If implementer plugins have vague
accept
method, so more than one is accepting the URL, you have to distinguish between them. Then why do you need them to implement theaccept
method?
If we see the provider name is useful we can add it, but not for filtering purposes.
Accept a singleton instance: Assume that it is not possible to have two extension instances of the same type, i.e, if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it. If we find out that more than one does, we reject all instances.
That's a valid solution of course. We can return a singleton object or a list filter in the Estimator class. Both solutions are equally valid and will lead to the same end. What I wanted to do was to make you think is about what behaviour we want when more than one Extension objects are accepting and why. You've given me the answer with:
If we find out that more than one does, we reject all instances.
for the what
, and
Assume that it is not possible to have two extension instances of the same type, i.e, if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it
for the why. Now my opinion/suggestion
I agree with if the URL is from Github provider, assume that only Github Branch Source Plugin is going to implement it. It doesn't make sense to me that any other plugin would want to implement it
, but just an error in a pattern can make your implementation return a false positive. We can assume that an Extension that has returned a false positive will return an error when it tries to get the size using an API that is not valid for it, so if we return the full list of Extensions, it's enough if we accept the first one that is able to give us a number and not an error. This way we can distinguish between implementations and the probability of getting a false number is actually low. In other words: it's a less aggressive approach
But that's my opinion/suggestion and it does not mean that it makes sense or it's the best approach. I'd like your thoughts about that and also I'd like other mentors's opinions here, @MarkEWaite @justinharringa @omkar-dsd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a good idea because you would need a mechanism to distinguish the "provider" in the git-plugin itself and as Mark stated very well, it's better to leave that responsibility to the implementer. See what I mean:
Understood, it might only increase complexity within the git plugin which is not be necessary.
@fcojfernandez That seems like a neat approach to me, assume that if there is a false positive, there is a very low chance that numerical size returned by it would be non-zero.
If I've understood you correctly, if we have multiple implementations in the extension list, we check the size and only accept the non-zero size instance.
This leads me to ask another question to you, what if some plugin implements our method of getSizeFromRepository()
and does so incorrectly? Do we expect an exception thrown from the that erroneous method? I wouldn't assume that because our method definition doesn't throw any exception.
Should we expect a size = 0
if things go wrong from the implementers side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A repo might be empty, so 0 might or not be a wrong value per se, but I don't think it's a use case we should care (actually unlikely).
Apart from that, adding a safe check just in case any exception that happens is not properly handled by the implementer plugin is just free. It's possible that the implementer is not checking the final endpoint, for example. IMO, better safe than sorry, and if the Extension throws an exception or return 0, then you can discarded the value in both situation.
regarding my comment
a safe check just in case any exception that happens is not properly handled by the implementer plugin
When you delegate the responsibility, it's always a good idea to have a check. Keep that always in mind because even when you want the exception to be thrown, you can catch it, process it and re-throw it
src/test/java/jenkins/plugins/git/GitRepoSizeEstimatorTest.java
Outdated
Show resolved
Hide resolved
src/test/java/jenkins/plugins/git/GitRepoSizeEstimatorTest.java
Outdated
Show resolved
Hide resolved
src/test/java/jenkins/plugins/git/GitRepoSizeEstimatorTest.java
Outdated
Show resolved
Hide resolved
src/test/java/jenkins/plugins/git/GitRepoSizeEstimatorTest.java
Outdated
Show resolved
Hide resolved
* A class which allows {@link AbstractGitSCMSource} to estimate the size of a repository from a distance | ||
* without requiring a local checkout. | ||
*/ | ||
public class GitRepoSizeEstimator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's weird and confusing to have a GitRepoSizeEstimator
class that is estimating something that is never returned and instead it's returning the tool to use. Something we can change in the last minute
public String getGitTool() { | ||
return gitTool; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recovering the comment missing after the last commit:
I was wondering if we just return the GitTool. What do you think @MarkEWaite ?
List<RepositorySizeAPI> acceptedRepository = Objects.requireNonNull(RepositorySizeAPI.all()) | ||
.stream() | ||
.filter(r -> r.acceptsRemote(repoUrl)) | ||
.collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, this filter might be done in a method of the Extension, so here we have something like
List<RepositorySizeAPI> acceptedRepository = Objects.requireNonNull(RepositorySizeAPI.all()) | |
.stream() | |
.filter(r -> r.acceptsRemote(repoUrl)) | |
.collect(Collectors.toList()); | |
List<RepositorySizeAPI> acceptedRepository = RepositorySizeAPI.getValids(repoUrl); |
.filter(r -> r.acceptsRemote(repoUrl)) | ||
.collect(Collectors.toList()); | ||
|
||
if (acceptedRepository.size() == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot know how the implementer plugins will accept the URL, so we might have more than one. What should we do? Pick up the first one? Consider we cannot know what to do and return false? My suggestion (just a suggestion, I'm happy to open discussion) would be to pick the first one up
if (acceptedRepository.size() == 1) { | |
if (acceptedRepository.size() > 0) { |
src/test/java/jenkins/plugins/git/GitRepoSizeEstimatorTest.java
Outdated
Show resolved
Hide resolved
@fcojfernandez So I while I was looking to find a way to uniquely identifying multiple instances which I was reading the documentation and I found out that for cases where For reference, the BuildChooser extension point provides a descriptor which is used to identify specific instances. We could change the design pattern to address our needs better. |
@rishabhBudhouliya I've always used the Extension pattern for these cases and I've used the Descriptor pattern only for objects used in the UI, but it doesn't mean that I'm right. I'd recommend you to rise the question in the mailing list. But please don't rise a question named as |
Co-authored-by: Mark Waite <[email protected]>
The process of recommending an optimal git impl is in two stages: Stage 1: Determine if the user additional behavior contains a feature which is not supported by JGit - see UnsupportedCommand usage Stage 2: Instantiating GitToolChooser to recommend a git tool Here, the GitToolChooser uses the repository URL to determine a recommendation for the git implementation.
This method can be called by an extension to convey if a certain extension feature is supported by JGit or not. After this information is provided by the an extenions or multiple extensions, the command can determine if JGit is to be used or not. Even if one extension is not supporting a feature, JGit will not be recommended.
By implementing this method, this extension is conveying that in the case a user wants to use timeout, JGit will not be able to perform the git client operations as it is not yet supported.
- Additionally, shift the code of searching and matching for credentials to a function called `lookupScanCredentials` (avoid branching)
Cover all the branches in the method
Let user decide if they need to disable performance enhancements incase of any undocumented issues.
… Disable performance enchancement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two optional comments and one change that is needed before merge (or at least before release).
The disableGitToolChooser
global configuration switch needs an online help page to match with the online help available from the other switches on the same page.
Add a file src/main/resources/hudson/plugins/git/GitSCM/help-disableGitToolChooser.html
based on the content of src/main/resources/hudson/plugins/git/GitSCM/help-hideCredentials.html
that describes more details of the git chooser
Includes a null pointer exception fix for UnsupportedCommand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're ready to merge. Excellent work, @rishabhBudhouliya
Concerns raised by @fcojfernandez are resolved
Hello, I think this PR introduces a
I opened JENKINS-63572 for that. |
Thanks @multani . You'll find comments in JENKINS-63572 that describe how you can avoid the null pointer exception and make your Pipeline a little faster by avoiding a redundant checkout operation. @rishabhBudhouliya is preparing a pull request to fix JENKINS-63519 and will include this fix in that pull request as well. |
GitToolChooser
Description: This class aims to add the functionality of recommending a git implementation on the basis of the size of a repository which has a strong correlation to the performance of git fetch (from Performance Benchmarks).
We have two heuristics to calculate the size of a git repository:
size
of repo.How are we doing this?
Assumptions:
GitToolChooser
using an instance ofAbstractGitSCMSource
just as theGitSCMTelescope
does.Architecture Diagram
Checklist
Types of changes
Further comments
To avoid issues with the existing functionality of the plugin, we have used the exact same logic to resolve git tool before making a recommendation as is provided by
GitUtils.resolveGitTool
.