WIP: Optimize for big repository on repo home page to make it load less than 1s and /main less than 8s #6045

lunny · 2019-02-12T11:10:20Z

I tested it locally. The recent linux repo home page loads time from 8s to 2s if enabled memory last commit cache, 8s to 2.8s if enabled boltdb last commit cache.

update: now it only 1.9s even boltdb on my MacOS since we only get simple commit infomation.

update(2019-02-22): added ls tree cache and files limit to 1000, now aport will /main directory will load less than 8s on my Macbook pro. I will add a log cache and need more work to clean up the codes.

blocked by go-gitea/git#145

codecov-io · 2019-02-12T11:41:37Z

Codecov Report

Merging #6045 into master will decrease coverage by 0.13%.
The diff coverage is 13.41%.

@@            Coverage Diff             @@
##           master    #6045      +/-   ##
==========================================
- Coverage   38.83%   38.69%   -0.14%     
==========================================
  Files         354      343      -11     
  Lines       50183    49128    -1055     
==========================================
- Hits        19488    19010     -478     
+ Misses      27870    27366     -504     
+ Partials     2825     2752      -73

Impacted Files	Coverage Δ
routers/init.go	`65.78% <0%> (-2.71%)`	⬇️
modules/cache/lastcommit/memory.go	`0% <0%> (ø)`
modules/cache/lastcommit/boltdb.go	`0% <0%> (ø)`
modules/setting/setting.go	`47.85% <100%> (+0.08%)`	⬆️
routers/repo/view.go	`47.3% <100%> (+6.21%)`	⬆️
modules/setting/git.go	`40% <40%> (ø)`
modules/cache/lastcommit/init.go	`63.63% <63.63%> (ø)`
modules/auth/repo_form.go	`39% <0%> (-4.25%)`	⬇️
models/repo_list.go	`63.29% <0%> (-3.56%)`	⬇️
routers/org/org.go	`7.31% <0%> (-1.99%)`	⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 035c82a...3c25048. Read the comment docs.

lafriks · 2019-02-12T13:47:38Z

Why not just reuse already existing caching functionality like commit counts are cached?

lunny · 2019-02-12T14:17:52Z

I think that cache infrastructures are too complicated. and I think the commit counting cache should also be used a separated cache service to do that.

clandmeter · 2019-02-12T14:44:31Z

Thanks a lot @lunny for trying to improve this usability concern.

I can confirm that using boltdb improves performance the most for me.

memory:
	initial load 120 seconds
	second load 60-70 seconds

boltdb:
	second load 40-50 seconds

none:
	initial load 120 seconds
	second load 130 seconds

url to try is http://51.15.253.24:3000/clandmeter/aports/src/branch/master/main

What really would make a difference on repo layouts like this would be a pager with a commit list limit set in app.ini. Or if that's not possible a hard limit like github does.

lunny · 2019-02-12T15:05:50Z

@clandmeter have you enabled [cache] on your app.ini. That will also help on commit counting. We have more improvements work to do. Maybe we should follow how github did that. limit 1000 files and hide all the last commit message?

kolaente

Do you have another pr for code.gitea.io/git?

kolaente · 2019-02-12T14:03:13Z

modules/cache/lastcommit/boltdb.go

+	err := c.db.View(func(tx *bolt.Tx) error {
+		b := tx.Bucket(c.bucket)
+		v := b.Get([]byte(getKey(repoPath, ref, entryPath)))
+		if v == nil || len(v) <= 0 {


You can simplify that whole thing with something like if len(v) == 0 {, if v is nil, len(v) will return 0.

clandmeter · 2019-02-12T15:27:56Z

@lunny i enabled cache (redis) as this was a vanilla setup. the load is now around 30-40 seconds.

Yes a hard limit could be an option.
If the limit amount could be a setting users would be able to fine tune it based on system spec.

Github does the same: https://github.com/alpinelinux/aports/tree/master/main

This loads in a few seconds 1-3. I don't think gitea can do that even with a 1k limit. The amount of objects is now around 2k so i would assume at best it would load in 15 seconds.

Even with caching I see a lot of git commands (git rev-list -1 xxx) beeing executed in the background. I guess the cache doesn't cover all of the git lookups.

lunny · 2019-02-13T02:51:00Z

@clandmeter yes. I need send several PRs for all the optimizations since not all the git command be cached. I will send another PR to limit the display files number according user setting.

clandmeter · 2019-02-13T08:41:23Z

@lunny I wonder why you choose to use boltdb but did not add a Redis option? Redis is already used for caching and would provide much better scaling when your instance would grow?

lafriks · 2019-02-13T09:14:52Z

@lunny imho it would be better to implement type that uses internally existing caching methods. This way we would support also redis etc

lunny · 2019-02-13T14:30:52Z

@clandmeter It's easy to add redis.
@lafriks Will consider that.

silverwind · 2019-02-13T18:09:44Z

docs/content/doc/advanced/config-cheat-sheet.en-us.md

+## Git - LastCommitCache settings (`git.last_commit_cache`)
+
+- `TYPE`: **none**: Cache type, could be empty, `memory` or `boltdb`.
+- `DATA_PATH`: ****: Cache dir when type is boltdb.


Can we have a default path?

The default type is none means do not use cache. For small repository, it's enough. So the default path is empty.

lunny · 2019-02-16T12:27:53Z

@lafriks added old cache configuration support.

lunny · 2019-02-16T12:28:52Z

@clandmeter added redis support and reduced half of git commands when list last commit information.

clandmeter · 2019-02-16T23:30:22Z

@lunny nice thx.
What is the config option to enable redis for commit cache?
Also applying this PR on master will segfault when using boltdb.

[Macaron] 2019-02-16 23:29:38: Started GET /clandmeter/aports/src/branch/master/main for 172.16.254.10
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x55cf09b37a95]

goroutine 1043 [running]:
code.gitea.io/gitea/modules/cache/lastcommit.(*BoltDBCache).Get(0x0, 0xc001a12540, 0x31, 0xc001a23b90, 0x28, 0xc0036648d0, 0x9, 0xc000513708, 0x55cf08d95855, 0xc000a35c80)
        /home/clandmeter/go/src/code.gitea.io/gitea/modules/cache/lastcommit/boltdb.go:57 +0xe5
code.gitea.io/gitea/vendor/code.gitea.io/git.targetedSearch(0xc000c86280, 0xc0024263c0, 0x55cf0a7f7860, 0x0)
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:105 +0x5aa
created by code.gitea.io/gitea/vendor/code.gitea.io/git.getCommitsInfo
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:247 +0x29c

lunny · 2019-02-17T02:18:19Z

Sorry, I changed the config option DATA_PATH to CONN_STR.

clandmeter · 2019-02-17T07:52:27Z

@lunny could you update the ini sample on how to use the new options? I am unable to make redis work. I can only make memory work.

techknowlogick · 2019-02-22T14:43:06Z

modules/cache/ls_tree/cache.go

@@ -0,0 +1,30 @@
+package ls_tree


Copyright title please

techknowlogick · 2019-02-22T14:43:30Z

modules/cache/ls_tree/init.go

@@ -0,0 +1,28 @@
+package ls_tree


Same for this file

lunny · 2019-02-22T14:45:16Z

@clandmeter I send another commit today and I think that will make aport /main loading less than 8s(max 1000 files). And I will add another cache soon.

clandmeter · 2019-02-22T17:08:26Z

@lunny im not able to make this work anymore.
What is the correct way to add redis caching?
I used:

; Last commit cache
[git.last_commit_cache]
USE_DEFAULT_CACHE = true

[cache]
ADAPTER = redis
HOST = network=tcp,addr=:6379,db=0,pool_size=100,idle_timeout=180

which results in

[Macaron] 2019-02-22 17:02:14: Started GET /alpine/aports for 172.16.254.10
panic: interface conversion: interface {} is string, not *git.Commit

goroutine 1049 [running]:
code.gitea.io/gitea/modules/cache/lastcommit.(*lastCommitCache).Get(0xc00031a9e0, 0xc001989b00, 0x2d, 0xc001a2b5c0, 0x28, 0xc003a3f4a0, 0x9, 0x0, 0x0, 0x0)
        /home/clandmeter/go/src/code.gitea.io/gitea/modules/cache/lastcommit/cache.go:23 +0x103
code.gitea.io/gitea/vendor/code.gitea.io/git.targetedSearch(0xc00003b630, 0xc003a55620, 0x563ca2460060, 0xc00031a9e0)
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:105 +0x5aa
created by code.gitea.io/gitea/vendor/code.gitea.io/git.getCommitsInfo
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:247 +0x29c

…d speed

clandmeter · 2019-02-23T15:12:44Z

@lunny yes after retesting boltdb it seems broken.

Regarding redis, it looks like it doesn't cache anything. Compared to memory caching you can see the second run will do none or almost no git lookups anymore. With redis this keeps happening making it much slower. Funny thing is is does check for redis availability because if i turn it of it will return error 500.

I have some small remarks for your reference.

To my understanding a redis lookup should be much faster than a git lookup. So if you enable global cache with redis why not automatically enable it for the other cache options?

Also what does the following mean:

repos which have less than commits number will enable this cache, 0 will always cache if TYPE is not empty

This reads as repo with more than 1000 commits (default setting) have caching disabled?
The amount of commits in a tree has nothing to do with this caching option, or should i read this differently?

Thanks for looking into it, and ill keep following this PR.

filipnavara · 2019-04-24T18:36:09Z

After re-running my Bloom filter experiment from December based on this article I think that's a better approach for solving the problem. It offers a way to accelerate the listing for any branch or history point. I realise the burden of proof is on me and unfortunately I may not be able to deliver it in timely manner. This may still be the way to go short-term but if someone decides to invest time into it there are viable alternatives. I'm now going through the discussion on the Git mailing list to read up on the various attempts to implement it and whether it got any traction since last year.

filipnavara · 2019-05-01T16:17:53Z

I still don't have much to share about my Bloom filter experiments in terms of code, but I do have some preliminary numbers from testing them on the Linux repository. Since this PR was started Gitea underwent some changes that changed some performance characteristics, mostly for the better. Migrating to go-git allowed me to experiment with caching some intermediate results that can efficiently help answer many different types of history queries such as the one used for repository listing.

This is all driven by the GitHub enigineers philosophy:

Generally speaking, caching specific results to queries is a weak approach to
performance in complex systems. What you want to do instead is caching
intermediate steps of the computation, to be able to efficiently answer any kind
of query.

Listing the recent commits in Linux repository has few challenges:

To answer the question "What files are the root directory at HEAD revision and when were they changed last time?" it is necessary to traverse 8954 commits. This actually means loading 8954 commit objects and just as many tree objects from the repository.
The Linux repository is highly packed and these objects are often delta encoded so the actual number of objects that have to be loaded is higher than that.
go-git is not too efficient for huge pack files with delta encoded objects. Due to some implementation decisions about caching objects by their respective hash loading a delta encoded object using offset-delta encoding requires building a reverse offset-hash mapping that is quite expensive.

How can this be addressed?

For most of the commit objects we don't actually need to load the whole object. We only care about the tree hash and the parent commits. Only for the last few commits we actually display we care about the commit message, autor and other properties. Turns out that this is already something tackled in regular Git by a precomputed commit-graph file. It stores minimal information needed for walking the commit graph in a form that is efficient for random access. In the case of Linux repository this file takes around 46 Mb (pack file is 2,5Gb + 200Mb index) for all commits in the repository in all branches. Using the commit-graph file saves loading nearly all of the commit objects. Work is underway to implement the support for this in go-git, but Gitea changes are needed to take full advantage of it (eg. scheduling when to rebuild the commit-graph files and executing it in the background).
The number of loaded tree objects could be reduced by querying Bloom filters that can efficiently answer whether a certain path may have been changed in a given commit. With 99% probability (and 0% false negatives) the filters can correctly answer whether we need to load and compare two given commits or whether we can skip over to the parent commit straight away. In Linux repository this allows to skip loading trees for roughly 4738 of the 8954 commits (for comparison, in Rails repository it filters out 19432 of 22997 commits). The file size of the stored bloom filters is roughly 26 Mb.
The go-git bottleneck with building the reverse offset-hash map for a packfile can be eliminated entirely. It's just a bit tricky to do so while keeping the external API shape intact. I've done some changes upstream that significantly improved the time to build this map, but it still takes a little over 1 second to do so.

Overall, with all the above performance enhancements I was able to get the load time of the repository listing page down to respectable 1.2 second. This general performance improvement translates to different listings, older points in history or other branches.

clandmeter · 2019-05-10T08:26:05Z

@lunny any idea when this is going to be finalized/implemented?

lunny · 2019-05-10T10:46:58Z

@clandmeter will continue work on this PR. I wish this could be released in v1.9

stale · 2019-07-09T11:23:34Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

kolaente · 2019-07-09T11:59:08Z

@lunny any updates on this?

lunny · 2019-07-09T13:22:15Z

This PR is outdate and I will send another PR to do some optimization.

michaelshiel · 2019-08-28T20:29:42Z

We are evaluating Gitea, and already love the project very much so thank you for that!
Unfortunately, our project is very large and after pushing all branches to my Gitea test server, the main page takes 1m40s to load... are there any updates in this area?

lunny · 2019-08-29T01:49:25Z

Which gitea version are you using? @michaelshiel

lunny · 2019-08-29T01:56:43Z

@typeless when you have many sub directories in a directory and many commits, it is still slow.

michaelshiel · 2019-08-29T13:35:18Z

Version: Gitea v1.10.0+dev-226-g256b17817 built with GNU Make 4.2.1, go1.12.9 : bindata, sqlite, sqlite_unlock_notify

10,145 commits, 517 branches, and 1.9GB total repo size.
87,911 files in 14,214 folders

michaelshiel · 2019-08-30T13:28:22Z

To update, the fault appears to have been on my end. I initially was running the latest image in docker on OSX (with delegated volume mappings), running on Ubuntu 18.04 has no performance problems so far and any page load is sub 1 second. Sorry for the noise, and thanks again!

stale · 2019-10-29T13:51:38Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

kolaente · 2019-11-01T19:16:32Z

@lunny any updates on this?

lunny · 2019-11-02T00:28:43Z

@kolaente I will start another PR after some time. This PR is too far away current master.

stale · 2020-01-01T01:06:05Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

lunny · 2020-01-30T05:08:04Z

This has been replaced by #10069

lunny added the type/enhancement An improvement of existing functionality label Feb 12, 2019

lunny force-pushed the lunny/cache_file_last_commit branch from 5175afd to f17fa2d Compare February 12, 2019 11:31

GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 12, 2019

lunny changed the title ~~Added two last commit cache implementations to speed up repo page load speed~~ Added two last commit cache implementations to speed up repo page loading Feb 12, 2019

kolaente reviewed Feb 12, 2019

View reviewed changes

silverwind reviewed Feb 13, 2019

View reviewed changes

lunny force-pushed the lunny/cache_file_last_commit branch from f17fa2d to 9c33992 Compare February 16, 2019 12:25

lunny changed the title ~~Added two last commit cache implementations to speed up repo page loading~~ Optimize for big repository on repo home page to make it load less than 1s Feb 22, 2019

lunny changed the title ~~Optimize for big repository on repo home page to make it load less than 1s~~ WIP: Optimize for big repository on repo home page to make it load less than 1s Feb 22, 2019

techknowlogick reviewed Feb 22, 2019

View reviewed changes

lunny changed the title ~~WIP: Optimize for big repository on repo home page to make it load less than 1s~~ WIP: Optimize for big repository on repo home page to make it load less than 1s and /main less than 8s Feb 22, 2019

lunny added 2 commits February 23, 2019 14:48

added two last commit cache implementations to speed up repo page loa…

63693b2

…d speed

fixed lints

ec9c119

filipnavara mentioned this pull request Apr 19, 2019

Improve listing performance by using go-git #6478

Merged

stale bot added the issue/stale label Jul 9, 2019

stale bot removed the issue/stale label Jul 9, 2019

stale bot added the issue/stale label Oct 29, 2019

stale bot removed the issue/stale label Nov 1, 2019

stale bot added the issue/stale label Jan 1, 2020

lunny added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Jan 1, 2020

stale bot removed the issue/stale label Jan 1, 2020

lunny mentioned this pull request Jan 30, 2020

Cache last commit to accelerate the repository directory page visit #10069

Merged

lunny closed this Jan 30, 2020

lunny deleted the lunny/cache_file_last_commit branch January 30, 2020 05:08

go-gitea locked and limited conversation to collaborators Nov 24, 2020

WIP: Optimize for big repository on repo home page to make it load less than 1s and /main less than 8s #6045

WIP: Optimize for big repository on repo home page to make it load less than 1s and /main less than 8s #6045

Conversation

lunny commented Feb 12, 2019 • edited Loading

codecov-io commented Feb 12, 2019 • edited Loading

Codecov Report

lafriks commented Feb 12, 2019

lunny commented Feb 12, 2019

clandmeter commented Feb 12, 2019 • edited Loading

lunny commented Feb 12, 2019 • edited Loading

kolaente left a comment

Choose a reason for hiding this comment

kolaente Feb 12, 2019

Choose a reason for hiding this comment

clandmeter commented Feb 12, 2019

lunny commented Feb 13, 2019

clandmeter commented Feb 13, 2019

lafriks commented Feb 13, 2019

lunny commented Feb 13, 2019

silverwind Feb 13, 2019

Choose a reason for hiding this comment

lunny Feb 16, 2019

Choose a reason for hiding this comment

lunny commented Feb 16, 2019

lunny commented Feb 16, 2019

clandmeter commented Feb 16, 2019

lunny commented Feb 17, 2019

clandmeter commented Feb 17, 2019

techknowlogick Feb 22, 2019

Choose a reason for hiding this comment

techknowlogick Feb 22, 2019

Choose a reason for hiding this comment

lunny commented Feb 22, 2019

clandmeter commented Feb 22, 2019

clandmeter commented Feb 23, 2019

filipnavara commented Apr 24, 2019

filipnavara commented May 1, 2019 • edited Loading

clandmeter commented May 10, 2019

lunny commented May 10, 2019

stale bot commented Jul 9, 2019

kolaente commented Jul 9, 2019

lunny commented Jul 9, 2019

michaelshiel commented Aug 28, 2019

lunny commented Aug 29, 2019

lunny commented Aug 29, 2019 • edited Loading

michaelshiel commented Aug 29, 2019 • edited Loading

michaelshiel commented Aug 30, 2019

stale bot commented Oct 29, 2019

kolaente commented Nov 1, 2019

lunny commented Nov 2, 2019

stale bot commented Jan 1, 2020

lunny commented Jan 30, 2020

lunny commented Feb 12, 2019 •

edited

Loading

codecov-io commented Feb 12, 2019 •

edited

Loading

clandmeter commented Feb 12, 2019 •

edited

Loading

lunny commented Feb 12, 2019 •

edited

Loading

filipnavara commented May 1, 2019 •

edited

Loading

lunny commented Aug 29, 2019 •

edited

Loading

michaelshiel commented Aug 29, 2019 •

edited

Loading