-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(performance): change data.json to trie #12711
feat(performance): change data.json to trie #12711
Conversation
/cc @Moocar |
This is super interesting! @Moocar is working on a related change to lazy load page metadata which will make this PR as it stands not mergeable but moving to a trie is definitely needed and we'd love your help moving forward to land it! We don't use reach's matching algorithm at all basically so that's not a problem. The main thing we need to support is detecting if a route should be handled client side. Otherwise it's just exact matches. |
That'll be awesome. Do you have any word on the state of @Moocar's change? Would it be worth turning that into an a kind of "page manifest improvements" epic and have me rebase my branch onto there? |
Maybe better names (or use an object/struct):
|
I do agree that the key names aren't great. But I was thinking that it was just a serialised version of the sites pages it didn't necessarily need to be human readable and should be as small as possible. I'll wait to hear what's going to happen with this feature before making any more changes. |
Hey @bennetthardwick cool stuff! The work I'm doing is best summarized by #11982. In a nutshell, there won't be a global I've also separated match-paths into a separate collection (
So the logic to find a page is now as so:
I'm still working on it, but I think the above strategy will mean we won't need a more complex datastructure to lookup paths. The number of pages in memory will depend on how many links a user has clicked, and how many have been prefetched as a result. I'm also assuming that most sites have very few matchPaths. But if that assumption is wrong, then a trie based lookup could potentially be used for that part. |
It's worth noting that the trie data structure is literally built for this task. Whilst yes, it's not as necessary, it's very possible that someone will have a hundred or so links on a single page. The added complexity is negligible compared to the potential gains, even in a case where only linked pages are listed. |
@me4502 If a site defines hundreds of matchPath (regex) paths, then for sure, the trie will start to help. But for most sites, I'm assuming there will only ever be a few of those. In which case, in the new Let me know if I'm missing something. I'm definitely up for introducing a trie for the lookup of |
Hey @Moocar, I'm really excited about these changes because it addresses another one of our pain points (the size of the |
cd6de07
to
b96a96b
Compare
@bennetthardwick Great to hear! Hard to say when the first cut will be ready, but hopefully in the next week. I'll try and remember to ping this PR once it's up, as I'd really appreciate some testing on large sites. |
b96a96b
to
cd4c4ac
Compare
Hey @Moocar, any updates on this? We set a site live last week with about 25k pages using this change and it fixed all our performance issues. You can test out the site here if you like. We still feel like using this datastructure is necessary even with smaller |
@bennetthardwick It's working on my machine (TM). Just documenting, testing etc and then I'll be releasing a series of PRs to make the reviewer's job easier (it touches 58 files so far). Since it's such a large change to Gatsby, I don't imagine it will be merged soon. But I'll shoot through the branch name when it's ready for testing. Hopefully this week (for real this time). Great to hear that the trie works for the large sites. My PR will remove the need for a |
@bennetthardwick pinged you on #13004 |
This issue is fixed by #13004 |
Description
Gatsby sites that have a lot of pages (50k - 60k) spend a lot of time during run time in the
findPage
method when prefetching links. This is due to the method looping through all members of thepage-manifest.json
file to search for matching paths.To fix this, I've changed the
page-manifests.json
file to use a trie-based data structure. This allows for near-constant time lookups and path matching, rather than the O(n) complexity of an array search.This could also be used in the future for lazy loading chunked
page-manifest.json
files for particular path segments, and also optimising the dev 404 page to a tree-view as it struggles to load when trying to render large quantities of pages.One concern that I have regarding this change is that it removes the
@reach/router
match function. While the trie version does support splats (*
) formatchPath
, it doesn't support more complex routes such as route parameters (which are not currently used by Gatsby).You can see the benefits of the change by using the
gatsby-dev-cli
with the gatsby-intense-benchmark.Our company has been working on a Gatsby site which has 30k pages at the moment, with the intention to scale to 100k+. We believe improvements to runtime performance are crucial to meet this goal.
Related Issues