[JENKINS-63539] Use additional repo URL variants to find cache #947

MarkEWaite · 2020-08-27T01:24:39Z

JENKINS-63539 Use additional repo URL variants to find cache

Estimation of repository size uses a cache key based on the repository URL. There are several different forms of repository URL that can represent the same git repository.

For example, the following URL's all point to the same repository

git://github.com/jenkinsci/git-plugin
[email protected]:jenkinsci/git-plugin.git
https://github.com/jenkinsci/git-plugin
https://github.com/jenkinsci/git-plugin.git
ssh://[email protected]/jenkinsci/git-plugin.git

This pull request permutes the repository URL provided by the user into several other forms in hopes that one of the forms already has a cached copy that can be used for local estimation of repository size. This assumes that local Java processing of strings is much faster than a REST API call to the hosting provider to request the size of the repository.

Checklist

Types of changes

New feature (non-breaking change which adds functionality)

rishabhBudhouliya

This is great, while creating decideAndUseCache I didn't think about the semantics of the URLs. Thanks for doing this, it is a great addition.

src/main/java/jenkins/plugins/git/GitToolChooser.java

MarkEWaite · 2020-08-27T15:20:45Z

@rishabhBudhouliya I realized that I need more tests of the newly added method. I'll add those tests before we merge this change. The statement coverage of the functions looks reasonable but the number of assertions for those statements is far too low.

MarkEWaite · 2020-08-28T03:25:24Z

The tests lead me to #948 while trying to understand the branch coverage report.

No need to require command line git on the master for this case

Estimation of repository size uses a cache key based on the repository URL. There are several different forms of repository URL that can represent the same git repository. For example, the following URL's all point to the same repository * git://github.com/jenkinsci/git-plugin * [email protected]:jenkinsci/git-plugin.git * https://github.com/jenkinsci/git-plugin * https://github.com/jenkinsci/git-plugin.git * ssh://[email protected]/jenkinsci/git-plugin.git This pull request permutes the repository URL provided by the user into several other forms in hopes that one of the forms already has a cached copy that can be used for local estimation of repository size. This assumes that local Java processing of strings is much faster than a REST API call to the hosting provider to request the size of the repository. Includes tests to confirm remote alternatives The alternatives work well for repository URL formats for GitHub, GitLab, Gitea, and most other providers. The alternatives do not work well with Bitbucket URL formats because Bitbucket applies the user name differently than the other providers. Use random values in locations where the value of the random will have no affect on the assertions. Do not return null from remote alternatives Cleaner interface if it always returns a value

MarkEWaite · 2020-08-29T02:33:44Z

@rishabhBudhouliya I reworked the commit to better match the cases that are being tested. I believe the new implementation is simpler, clearer, and more accurate. It also has many more tests than the first implementation I created.

No need for protocolPatterns array outside the method.

Consistent with the logging of final exit

If there are multiple cached copies of the repository, be pessimistic and assume that the largest cache is the best approximation of repository size.

Reduce risk of test dependency on specific sizes

The default initial capacity is 16 with 75% fill factor. This code will insert roughly 10 entries.

Filter the trailing slash in the pattern matcher since any number of slashes on the end of the repository URL do not change the URL.

Repository size estimate is a very coarse decision criteria. Outdated information is unlikely to cause serious harm. Caching the repository URL and the size in memory seems like a very low cost way to avoid disc access (for local cache) and network access (for REST API calls).

The job will generally fail when no remote configs are defined, but it should not throw a Java exception due to a user configuration error.

Job will usually fail in other ways, but a null pointer exception is not a friendly way to show a configuration error to a user.

Also use isWindows() rather than the SystemUtils check

rishabhBudhouliya · 2020-09-07T06:48:00Z

src/main/java/jenkins/plugins/git/GitToolChooser.java

-                    sizeOfRepo = FileUtils.sizeOfDirectory(cacheDir);
-                    sizeOfRepo = (sizeOfRepo/1000); // Conversion from Bytes to Kilo Bytes
+                    long clientRepoSize = FileUtils.sizeOfDirectory(cacheDir) / 1024;
+                    if (clientRepoSize > sizeOfRepo) {


In which case would we have multiple caches for the same git repository?
Would it happen if two different multi branch projects work on a different version of the same git repository?

I think that one case that might happen is two jobs that refer to a remote repository with different URLs and with different update frequencies. If one of the caches is only updated once a week and the other is updated once an hour, the "once an hour" repository may have a larger size than the "once a week" repository. This assumes the larger of the two values is the better approximation of repository size.

MarkEWaite · 2020-09-12T02:35:45Z

I've realized this is implemented incorrectly. It needs rework to use much simpler code.

MarkEWaite added the enhancement Improvement or new feature label Aug 27, 2020

MarkEWaite requested a review from rishabhBudhouliya August 27, 2020 01:25

rishabhBudhouliya approved these changes Aug 27, 2020

View reviewed changes

src/main/java/jenkins/plugins/git/GitToolChooser.java Outdated Show resolved Hide resolved

MarkEWaite marked this pull request as draft August 27, 2020 15:20

MarkEWaite marked this pull request as ready for review August 27, 2020 21:15

MarkEWaite changed the title ~~Use additional repo URL variants to find cache~~ JENKINS-63539 Use additional repo URL variants to find cache Aug 27, 2020

MarkEWaite changed the title ~~JENKINS-63539 Use additional repo URL variants to find cache~~ [JENKINS-63539] Use additional repo URL variants to find cache Aug 28, 2020

Use JGit for caching decision

cfed0dd

No need to require command line git on the master for this case

MarkEWaite force-pushed the find-more-caches branch from 16de1ef to b7f1c62 Compare August 29, 2020 02:28

MarkEWaite force-pushed the find-more-caches branch from b7f1c62 to 410f70e Compare August 29, 2020 02:31

MarkEWaite added 12 commits August 29, 2020 05:58

Move array initialization inside method

be4edc4

No need for protocolPatterns array outside the method.

Log early exit from alternatives generator

1b11caf

Consistent with the logging of final exit

Use size estimate from largest cache

9fd24bb

If there are multiple cached copies of the repository, be pessimistic and assume that the largest cache is the best approximation of repository size.

Use random sizes within range

abe2421

Reduce risk of test dependency on specific sizes

Mention repo URL is git tool log message

31ecea6

No benefit from initial capacity of LinkedHashSet

c87d025

The default initial capacity is 16 with 75% fill factor. This code will insert roughly 10 entries.

Cache variants were not handling URLs that ended with '/'

afb270f

Filter the trailing slash in the pattern matcher since any number of slashes on the end of the repository URL do not change the URL.

Check the remoteAlternatives empty string case

9cc9dc7

Do not include credentials in checkout unless required

e3af88b

[JENKINS-63572] Test for NPE if no remote configs are defined

7176254

The job will generally fail when no remote configs are defined, but it should not throw a Java exception due to a user configuration error.

[JENKINS-63572] Avoid NPE if remote configs is empty

ea58c75

Job will usually fail in other ways, but a null pointer exception is not a friendly way to show a configuration error to a user.

MarkEWaite mentioned this pull request Sep 4, 2020

Extension: RepositorySizeGithubAPI to get size of a repo jenkinsci/github-branch-source-plugin#316

Open

8 tasks

MarkEWaite added 3 commits September 6, 2020 19:07

Add GitToolChooser logging to diagnose test failures

d2cd2d7

Use unique project name in GitToolChooserTest

ae3f031

Also use isWindows() rather than the SystemUtils check

Fix compilation error from prior check

1ec4598

MarkEWaite added 3 commits September 6, 2020 19:18

Fix workflow syntax error from merge

7a13a85

Better formatting of logging

da79822

Report cache dir location when found

cf9aa8c

rishabhBudhouliya reviewed Sep 7, 2020

View reviewed changes

MarkEWaite added 7 commits September 7, 2020 13:37

Better log message when cache entry found

8f6058d

Add call to clear repository cache

d670743

Clear repository size cache on test entry

987b3c0

Merge branch 'master' into find-more-caches

dbd2c7a

Merge branch 'master' into find-more-caches

f74d8ed

Remove System.out from test

ee2cce4

Fix merge mistakes

ea73f1e

MarkEWaite closed this Sep 12, 2020

rishabhBudhouliya mentioned this pull request Sep 17, 2020

[JENKINS-63539] Expand repo cache lookup for better size estimates #958

Merged

10 tasks

MarkEWaite deleted the find-more-caches branch September 18, 2020 16:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-63539] Use additional repo URL variants to find cache #947

[JENKINS-63539] Use additional repo URL variants to find cache #947

MarkEWaite commented Aug 27, 2020 •

edited

Loading

rishabhBudhouliya left a comment

MarkEWaite commented Aug 27, 2020

MarkEWaite commented Aug 28, 2020

MarkEWaite commented Aug 29, 2020

rishabhBudhouliya Sep 7, 2020 •

edited

Loading

MarkEWaite Sep 7, 2020

MarkEWaite commented Sep 12, 2020

[JENKINS-63539] Use additional repo URL variants to find cache #947

[JENKINS-63539] Use additional repo URL variants to find cache #947

Conversation

MarkEWaite commented Aug 27, 2020 • edited Loading