Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Optimize for big repository on repo home page to make it load less than 1s and /main less than 8s #6045

Closed
wants to merge 6 commits into from

Conversation

lunny
Copy link
Member

@lunny lunny commented Feb 12, 2019

I tested it locally. The recent linux repo home page loads time from 8s to 2s if enabled memory last commit cache, 8s to 2.8s if enabled boltdb last commit cache.

update: now it only 1.9s even boltdb on my MacOS since we only get simple commit infomation.

update(2019-02-22): added ls tree cache and files limit to 1000, now aport will /main directory will load less than 8s on my Macbook pro. I will add a log cache and need more work to clean up the codes.

blocked by go-gitea/git#145

@lunny lunny added the type/enhancement An improvement of existing functionality label Feb 12, 2019
@lunny lunny force-pushed the lunny/cache_file_last_commit branch from 5175afd to f17fa2d Compare February 12, 2019 11:31
@codecov-io
Copy link

codecov-io commented Feb 12, 2019

Codecov Report

Merging #6045 into master will decrease coverage by 0.13%.
The diff coverage is 13.41%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #6045      +/-   ##
==========================================
- Coverage   38.83%   38.69%   -0.14%     
==========================================
  Files         354      343      -11     
  Lines       50183    49128    -1055     
==========================================
- Hits        19488    19010     -478     
+ Misses      27870    27366     -504     
+ Partials     2825     2752      -73
Impacted Files Coverage Δ
routers/init.go 65.78% <0%> (-2.71%) ⬇️
modules/cache/lastcommit/memory.go 0% <0%> (ø)
modules/cache/lastcommit/boltdb.go 0% <0%> (ø)
modules/setting/setting.go 47.85% <100%> (+0.08%) ⬆️
routers/repo/view.go 47.3% <100%> (+6.21%) ⬆️
modules/setting/git.go 40% <40%> (ø)
modules/cache/lastcommit/init.go 63.63% <63.63%> (ø)
modules/auth/repo_form.go 39% <0%> (-4.25%) ⬇️
models/repo_list.go 63.29% <0%> (-3.56%) ⬇️
routers/org/org.go 7.31% <0%> (-1.99%) ⬇️
... and 70 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 035c82a...3c25048. Read the comment docs.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Feb 12, 2019
@lafriks
Copy link
Member

lafriks commented Feb 12, 2019

Why not just reuse already existing caching functionality like commit counts are cached?

@lunny
Copy link
Member Author

lunny commented Feb 12, 2019

I think that cache infrastructures are too complicated. and I think the commit counting cache should also be used a separated cache service to do that.

@clandmeter
Copy link

clandmeter commented Feb 12, 2019

Thanks a lot @lunny for trying to improve this usability concern.

I can confirm that using boltdb improves performance the most for me.

memory:
	initial load 120 seconds
	second load 60-70 seconds

boltdb:
	second load 40-50 seconds

none:
	initial load 120 seconds
	second load 130 seconds

url to try is http://51.15.253.24:3000/clandmeter/aports/src/branch/master/main

What really would make a difference on repo layouts like this would be a pager with a commit list limit set in app.ini. Or if that's not possible a hard limit like github does.

@lunny
Copy link
Member Author

lunny commented Feb 12, 2019

@clandmeter have you enabled [cache] on your app.ini. That will also help on commit counting. We have more improvements work to do. Maybe we should follow how github did that. limit 1000 files and hide all the last commit message?

@lunny lunny changed the title Added two last commit cache implementations to speed up repo page load speed Added two last commit cache implementations to speed up repo page loading Feb 12, 2019
Copy link
Member

@kolaente kolaente left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have another pr for code.gitea.io/git?

err := c.db.View(func(tx *bolt.Tx) error {
b := tx.Bucket(c.bucket)
v := b.Get([]byte(getKey(repoPath, ref, entryPath)))
if v == nil || len(v) <= 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify that whole thing with something like if len(v) == 0 {, if v is nil, len(v) will return 0.

@clandmeter
Copy link

@lunny i enabled cache (redis) as this was a vanilla setup. the load is now around 30-40 seconds.

Yes a hard limit could be an option.
If the limit amount could be a setting users would be able to fine tune it based on system spec.

Github does the same: https://github.com/alpinelinux/aports/tree/master/main

This loads in a few seconds 1-3. I don't think gitea can do that even with a 1k limit. The amount of objects is now around 2k so i would assume at best it would load in 15 seconds.

Even with caching I see a lot of git commands (git rev-list -1 xxx) beeing executed in the background. I guess the cache doesn't cover all of the git lookups.

@lunny
Copy link
Member Author

lunny commented Feb 13, 2019

@clandmeter yes. I need send several PRs for all the optimizations since not all the git command be cached. I will send another PR to limit the display files number according user setting.

@clandmeter
Copy link

@lunny I wonder why you choose to use boltdb but did not add a Redis option? Redis is already used for caching and would provide much better scaling when your instance would grow?

@lafriks
Copy link
Member

lafriks commented Feb 13, 2019

@lunny imho it would be better to implement type that uses internally existing caching methods. This way we would support also redis etc

@lunny
Copy link
Member Author

lunny commented Feb 13, 2019

@clandmeter It's easy to add redis.
@lafriks Will consider that.

## Git - LastCommitCache settings (`git.last_commit_cache`)

- `TYPE`: **none**: Cache type, could be empty, `memory` or `boltdb`.
- `DATA_PATH`: ****: Cache dir when type is boltdb.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a default path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default type is none means do not use cache. For small repository, it's enough. So the default path is empty.

@lunny lunny force-pushed the lunny/cache_file_last_commit branch from f17fa2d to 9c33992 Compare February 16, 2019 12:25
@lunny
Copy link
Member Author

lunny commented Feb 16, 2019

@lafriks added old cache configuration support.

@lunny
Copy link
Member Author

lunny commented Feb 16, 2019

@clandmeter added redis support and reduced half of git commands when list last commit information.

@clandmeter
Copy link

@lunny nice thx.
What is the config option to enable redis for commit cache?
Also applying this PR on master will segfault when using boltdb.

[Macaron] 2019-02-16 23:29:38: Started GET /clandmeter/aports/src/branch/master/main for 172.16.254.10
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x55cf09b37a95]

goroutine 1043 [running]:
code.gitea.io/gitea/modules/cache/lastcommit.(*BoltDBCache).Get(0x0, 0xc001a12540, 0x31, 0xc001a23b90, 0x28, 0xc0036648d0, 0x9, 0xc000513708, 0x55cf08d95855, 0xc000a35c80)
        /home/clandmeter/go/src/code.gitea.io/gitea/modules/cache/lastcommit/boltdb.go:57 +0xe5
code.gitea.io/gitea/vendor/code.gitea.io/git.targetedSearch(0xc000c86280, 0xc0024263c0, 0x55cf0a7f7860, 0x0)
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:105 +0x5aa
created by code.gitea.io/gitea/vendor/code.gitea.io/git.getCommitsInfo
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:247 +0x29c

@lunny
Copy link
Member Author

lunny commented Feb 17, 2019

Sorry, I changed the config option DATA_PATH to CONN_STR.

@clandmeter
Copy link

@lunny could you update the ini sample on how to use the new options? I am unable to make redis work. I can only make memory work.

@lunny lunny changed the title Added two last commit cache implementations to speed up repo page loading Optimize for big repository on repo home page to make it load less than 1s Feb 22, 2019
@lunny lunny changed the title Optimize for big repository on repo home page to make it load less than 1s WIP: Optimize for big repository on repo home page to make it load less than 1s Feb 22, 2019
@@ -0,0 +1,30 @@
package ls_tree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copyright title please

@@ -0,0 +1,28 @@
package ls_tree
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for this file

@lunny
Copy link
Member Author

lunny commented Feb 22, 2019

@clandmeter I send another commit today and I think that will make aport /main loading less than 8s(max 1000 files). And I will add another cache soon.

@lunny lunny changed the title WIP: Optimize for big repository on repo home page to make it load less than 1s WIP: Optimize for big repository on repo home page to make it load less than 1s and /main less than 8s Feb 22, 2019
@clandmeter
Copy link

@lunny im not able to make this work anymore.
What is the correct way to add redis caching?
I used:

; Last commit cache
[git.last_commit_cache]
USE_DEFAULT_CACHE = true

[cache]
ADAPTER = redis
HOST = network=tcp,addr=:6379,db=0,pool_size=100,idle_timeout=180

which results in

[Macaron] 2019-02-22 17:02:14: Started GET /alpine/aports for 172.16.254.10
panic: interface conversion: interface {} is string, not *git.Commit

goroutine 1049 [running]:
code.gitea.io/gitea/modules/cache/lastcommit.(*lastCommitCache).Get(0xc00031a9e0, 0xc001989b00, 0x2d, 0xc001a2b5c0, 0x28, 0xc003a3f4a0, 0x9, 0x0, 0x0, 0x0)
        /home/clandmeter/go/src/code.gitea.io/gitea/modules/cache/lastcommit/cache.go:23 +0x103
code.gitea.io/gitea/vendor/code.gitea.io/git.targetedSearch(0xc00003b630, 0xc003a55620, 0x563ca2460060, 0xc00031a9e0)
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:105 +0x5aa
created by code.gitea.io/gitea/vendor/code.gitea.io/git.getCommitsInfo
        /home/clandmeter/go/src/code.gitea.io/gitea/vendor/code.gitea.io/git/commit_info.go:247 +0x29c

@clandmeter
Copy link

@lunny yes after retesting boltdb it seems broken.

Regarding redis, it looks like it doesn't cache anything. Compared to memory caching you can see the second run will do none or almost no git lookups anymore. With redis this keeps happening making it much slower. Funny thing is is does check for redis availability because if i turn it of it will return error 500.

I have some small remarks for your reference.

To my understanding a redis lookup should be much faster than a git lookup. So if you enable global cache with redis why not automatically enable it for the other cache options?

Also what does the following mean:

repos which have less than commits number will enable this cache, 0 will always cache if TYPE is not empty

This reads as repo with more than 1000 commits (default setting) have caching disabled?
The amount of commits in a tree has nothing to do with this caching option, or should i read this differently?

Thanks for looking into it, and ill keep following this PR.

@filipnavara
Copy link
Contributor

After re-running my Bloom filter experiment from December based on this article I think that's a better approach for solving the problem. It offers a way to accelerate the listing for any branch or history point. I realise the burden of proof is on me and unfortunately I may not be able to deliver it in timely manner. This may still be the way to go short-term but if someone decides to invest time into it there are viable alternatives. I'm now going through the discussion on the Git mailing list to read up on the various attempts to implement it and whether it got any traction since last year.

@filipnavara
Copy link
Contributor

filipnavara commented May 1, 2019

I still don't have much to share about my Bloom filter experiments in terms of code, but I do have some preliminary numbers from testing them on the Linux repository. Since this PR was started Gitea underwent some changes that changed some performance characteristics, mostly for the better. Migrating to go-git allowed me to experiment with caching some intermediate results that can efficiently help answer many different types of history queries such as the one used for repository listing.

This is all driven by the GitHub enigineers philosophy:

Generally speaking, caching specific results to queries is a weak approach to
performance in complex systems. What you want to do instead is caching
intermediate steps of the computation, to be able to efficiently answer any kind
of query.

Listing the recent commits in Linux repository has few challenges:

  1. To answer the question "What files are the root directory at HEAD revision and when were they changed last time?" it is necessary to traverse 8954 commits. This actually means loading 8954 commit objects and just as many tree objects from the repository.
  2. The Linux repository is highly packed and these objects are often delta encoded so the actual number of objects that have to be loaded is higher than that.
  3. go-git is not too efficient for huge pack files with delta encoded objects. Due to some implementation decisions about caching objects by their respective hash loading a delta encoded object using offset-delta encoding requires building a reverse offset-hash mapping that is quite expensive.

How can this be addressed?

  • For most of the commit objects we don't actually need to load the whole object. We only care about the tree hash and the parent commits. Only for the last few commits we actually display we care about the commit message, autor and other properties. Turns out that this is already something tackled in regular Git by a precomputed commit-graph file. It stores minimal information needed for walking the commit graph in a form that is efficient for random access. In the case of Linux repository this file takes around 46 Mb (pack file is 2,5Gb + 200Mb index) for all commits in the repository in all branches. Using the commit-graph file saves loading nearly all of the commit objects. Work is underway to implement the support for this in go-git, but Gitea changes are needed to take full advantage of it (eg. scheduling when to rebuild the commit-graph files and executing it in the background).
  • The number of loaded tree objects could be reduced by querying Bloom filters that can efficiently answer whether a certain path may have been changed in a given commit. With 99% probability (and 0% false negatives) the filters can correctly answer whether we need to load and compare two given commits or whether we can skip over to the parent commit straight away. In Linux repository this allows to skip loading trees for roughly 4738 of the 8954 commits (for comparison, in Rails repository it filters out 19432 of 22997 commits). The file size of the stored bloom filters is roughly 26 Mb.
  • The go-git bottleneck with building the reverse offset-hash map for a packfile can be eliminated entirely. It's just a bit tricky to do so while keeping the external API shape intact. I've done some changes upstream that significantly improved the time to build this map, but it still takes a little over 1 second to do so.

Overall, with all the above performance enhancements I was able to get the load time of the repository listing page down to respectable 1.2 second. This general performance improvement translates to different listings, older points in history or other branches.

@clandmeter
Copy link

@lunny any idea when this is going to be finalized/implemented?

@lunny
Copy link
Member Author

lunny commented May 10, 2019

@clandmeter will continue work on this PR. I wish this could be released in v1.9

@stale
Copy link

stale bot commented Jul 9, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

@stale stale bot added the issue/stale label Jul 9, 2019
@kolaente
Copy link
Member

kolaente commented Jul 9, 2019

@lunny any updates on this?

@stale stale bot removed the issue/stale label Jul 9, 2019
@lunny
Copy link
Member Author

lunny commented Jul 9, 2019

This PR is outdate and I will send another PR to do some optimization.

@michaelshiel
Copy link

We are evaluating Gitea, and already love the project very much so thank you for that!
Unfortunately, our project is very large and after pushing all branches to my Gitea test server, the main page takes 1m40s to load... are there any updates in this area?

@lunny
Copy link
Member Author

lunny commented Aug 29, 2019

Which gitea version are you using? @michaelshiel

@lunny
Copy link
Member Author

lunny commented Aug 29, 2019

@typeless when you have many sub directories in a directory and many commits, it is still slow.

@michaelshiel
Copy link

michaelshiel commented Aug 29, 2019

Version: Gitea v1.10.0+dev-226-g256b17817 built with GNU Make 4.2.1, go1.12.9 : bindata, sqlite, sqlite_unlock_notify

10,145 commits, 517 branches, and 1.9GB total repo size.
87,911 files in 14,214 folders

@michaelshiel
Copy link

To update, the fault appears to have been on my end. I initially was running the latest image in docker on OSX (with delegated volume mappings), running on Ubuntu 18.04 has no performance problems so far and any page load is sub 1 second. Sorry for the noise, and thanks again!

@stale
Copy link

stale bot commented Oct 29, 2019

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

@stale stale bot added the issue/stale label Oct 29, 2019
@kolaente
Copy link
Member

kolaente commented Nov 1, 2019

@lunny any updates on this?

@stale stale bot removed the issue/stale label Nov 1, 2019
@lunny
Copy link
Member Author

lunny commented Nov 2, 2019

@kolaente I will start another PR after some time. This PR is too far away current master.

@stale
Copy link

stale bot commented Jan 1, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions.

@stale stale bot added the issue/stale label Jan 1, 2020
@lunny lunny added the issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented label Jan 1, 2020
@stale stale bot removed the issue/stale label Jan 1, 2020
@lunny
Copy link
Member Author

lunny commented Jan 30, 2020

This has been replaced by #10069

@lunny lunny closed this Jan 30, 2020
@lunny lunny deleted the lunny/cache_file_last_commit branch January 30, 2020 05:08
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. type/enhancement An improvement of existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.