Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-63539] Use additional repo URL variants to find cache #947

Closed
wants to merge 27 commits into from

Conversation

MarkEWaite
Copy link
Contributor

@MarkEWaite MarkEWaite commented Aug 27, 2020

JENKINS-63539 Use additional repo URL variants to find cache

Estimation of repository size uses a cache key based on the repository URL. There are several different forms of repository URL that can represent the same git repository.

For example, the following URL's all point to the same repository

This pull request permutes the repository URL provided by the user into several other forms in hopes that one of the forms already has a cached copy that can be used for local estimation of repository size. This assumes that local Java processing of strings is much faster than a REST API call to the hosting provider to request the size of the repository.

Checklist

  • I have read the CONTRIBUTING doc
  • I have added tests that verify my changes
  • Unit tests pass locally with my changes
  • I have added documentation as necessary
  • No Javadoc warnings were introduced with my changes
  • No spotbugs warnings were introduced with my changes
  • Documentation in README has been updated as necessary
  • Online help has been added and reviewed for any new or modified fields
  • I have interactively tested my changes
  • Any dependent changes have been merged and published in upstream modules (like git-client-plugin)

Types of changes

  • New feature (non-breaking change which adds functionality)

@MarkEWaite MarkEWaite added the enhancement Improvement or new feature label Aug 27, 2020
Copy link
Contributor

@rishabhBudhouliya rishabhBudhouliya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, while creating decideAndUseCache I didn't think about the semantics of the URLs. Thanks for doing this, it is a great addition.

src/main/java/jenkins/plugins/git/GitToolChooser.java Outdated Show resolved Hide resolved
@MarkEWaite
Copy link
Contributor Author

@rishabhBudhouliya I realized that I need more tests of the newly added method. I'll add those tests before we merge this change. The statement coverage of the functions looks reasonable but the number of assertions for those statements is far too low.

@MarkEWaite MarkEWaite marked this pull request as draft August 27, 2020 15:20
@MarkEWaite MarkEWaite marked this pull request as ready for review August 27, 2020 21:15
@MarkEWaite MarkEWaite changed the title Use additional repo URL variants to find cache JENKINS-63539 Use additional repo URL variants to find cache Aug 27, 2020
@MarkEWaite MarkEWaite changed the title JENKINS-63539 Use additional repo URL variants to find cache [JENKINS-63539] Use additional repo URL variants to find cache Aug 28, 2020
@MarkEWaite
Copy link
Contributor Author

The tests lead me to #948 while trying to understand the branch coverage report.

No need to require command line git on the master for this case
Estimation of repository size uses a cache key based on
the repository URL.  There are several different forms of
repository URL that can represent the same git repository.

For example, the following URL's all point to the same repository

* git://github.com/jenkinsci/git-plugin
* [email protected]:jenkinsci/git-plugin.git
* https://github.com/jenkinsci/git-plugin
* https://github.com/jenkinsci/git-plugin.git
* ssh://[email protected]/jenkinsci/git-plugin.git

This pull request permutes the repository URL provided by the user into
several other forms in hopes that one of the forms already has a cached
copy that can be used for local estimation of repository size.  This
assumes that local Java processing of strings is much faster than a REST
API call to the hosting provider to request the size of the repository.

Includes tests to confirm remote alternatives

The alternatives work well for repository URL formats for GitHub, GitLab,
Gitea, and most other providers.  The alternatives do not work well with
Bitbucket URL formats because Bitbucket applies the user name differently
than the other providers.

Use random values in locations where the value of the random will have
no affect on the assertions.

Do not return null from remote alternatives

Cleaner interface if it always returns a value
@MarkEWaite
Copy link
Contributor Author

@rishabhBudhouliya I reworked the commit to better match the cases that are being tested. I believe the new implementation is simpler, clearer, and more accurate. It also has many more tests than the first implementation I created.

No need for protocolPatterns array outside the method.
Consistent with the logging of final exit
If there are multiple cached copies of the repository, be pessimistic
and assume that the largest cache is the best approximation of
repository size.
Reduce risk of test dependency on specific sizes
The default initial capacity is 16 with 75% fill factor.  This code will
insert roughly 10 entries.
Filter the trailing slash in the pattern matcher since any number of
slashes on the end of the repository URL do not change the URL.
Repository size estimate is a very coarse decision criteria.  Outdated
information is unlikely to cause serious harm.  Caching the repository
URL and the size in memory seems like a very low cost way to avoid disc
access (for local cache) and network access (for REST API calls).
The job will generally fail when no remote configs are defined, but it
should not throw a Java exception due to a user configuration error.
Job will usually fail in other ways, but a null pointer exception is
not a friendly way to show a configuration error to a user.
sizeOfRepo = FileUtils.sizeOfDirectory(cacheDir);
sizeOfRepo = (sizeOfRepo/1000); // Conversion from Bytes to Kilo Bytes
long clientRepoSize = FileUtils.sizeOfDirectory(cacheDir) / 1024;
if (clientRepoSize > sizeOfRepo) {
Copy link
Contributor

@rishabhBudhouliya rishabhBudhouliya Sep 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case would we have multiple caches for the same git repository?
Would it happen if two different multi branch projects work on a different version of the same git repository?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that one case that might happen is two jobs that refer to a remote repository with different URLs and with different update frequencies. If one of the caches is only updated once a week and the other is updated once an hour, the "once an hour" repository may have a larger size than the "once a week" repository. This assumes the larger of the two values is the better approximation of repository size.

@MarkEWaite
Copy link
Contributor Author

I've realized this is implemented incorrectly. It needs rework to use much simpler code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants