Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache.json is large due to containing multiples copies of the same data #345

Closed
1 of 5 tasks
thescientist13 opened this issue May 3, 2020 · 3 comments · Fixed by #346
Closed
1 of 5 tasks

cache.json is large due to containing multiples copies of the same data #345

thescientist13 opened this issue May 3, 2020 · 3 comments · Fixed by #346
Assignees
Labels
bug Something isn't working CLI Content as Data P0 Critical issue that should get addressed ASAP v0.5.1
Milestone

Comments

@thescientist13
Copy link
Member

thescientist13 commented May 3, 2020

Type of Change

  • New Feature Request
  • Documentation / Website
  • Improvement / Suggestion
  • Bug
  • Other (please clarify below)

Summary

Moving the discussion from #317 (comment), but after the 0.5.0 release, there are some serious concerns with cache.json generation. Docs page is transfering over 2.6MB of JSON data!

  1. It's being over fetched (why!?)
  2. It's huge! Docs page
    image

This also means all index.html pages are huge too...
Screen Shot 2020-05-03 at 11 48 19 AM

Details

Thoughts on quick wins here:

  1. Fix overfetching by caching in the client. (but would also be good to know why this is happening)
  2. Reduce duplication of query results within a single cache.json file (e.g. about/cache.json) because deepmerge just keeps doubling up the previous file contents, resulting in duplicates over and over again?

For something like docs/cache.json, this could bring the size down to 35k! We should also try and add a "budget" spec for this to make sure the size can't explode again without it failing at least.

Later down the road, we should look to reduce duplication of query results across all cache.json files - see #317

@thescientist13 thescientist13 added bug Something isn't working P0 Critical issue that should get addressed ASAP Content as Data labels May 3, 2020
@thescientist13 thescientist13 added this to the MVP milestone May 3, 2020
@thescientist13
Copy link
Member Author

thescientist13 commented May 3, 2020

Did some more discovery on this today and I realize what the crux of the issue is for the file size concerns, anyway.

So for every page that gets serialized, that will make a GraphQL call (e.g. per ${page}-template.js), cache.js will create a JSON file of that query data for every one of those.

So even for something like /docs/*, which only needs 3 (?) GraphQL calls total, we are getting a ${hash}-cache.json file for each sub-route that is essentially all just duplicates.

So when we deepmerge all that into a single cache.json to load into the client / HTML when serializing... things will explode in size, as we are seeing. 💥
Screen Shot 2020-05-03 at 12 53 29 PM

I recall now as part of #115 one of the reasons for this approach was that trying to read / write to "shared" .json files asynchronously was resulting in thread unsafe writing operations to the same file at the same time, leading to corrupted JSON that would fail to get read / parsed correctly in serialize.js.


So I did try again to generate cache.json files per query / per top level route, and now we see exactly what we would expect in count and size.

// hash against the query instead
const md5 = crypto.createHash('md5').update(query).digest('hex');

Screen Shot 2020-05-03 at 12 51 31 PM

Of course, this returns us to the problem of overlapping writes from time to time, even if adding something like this

if (!fs.existsSync(targetFile)) {
  await fs.mkdirs(targetDir, { recursive: true });
  await fs.writeFile(path.join(targetFile), cache, 'utf8');
}
query from route /guides/netlify-cms
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
==================================
==================================
query from route /plugins/
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
==================================
query from route /guides/cloudflare-workers-deployment
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
==================================
query from route /plugins/index-hooks
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
SyntaxError: /Users/owenbuckley/Workspace/project-evergreen/repos/greenwood/public/docs/2bc8e256a25844b37c22af93673c67e3-cache.json: Unexpected token B in JSON at position 3343

I think though maybe since we won'y really need deepmerge with this solution, and so it will be a fair tradeoff to add proper-lockfile instead, and then I think things will return to normal? (though we can of course still continue to optimize data fetching even further, but we can do that in other issues).

@thescientist13 thescientist13 changed the title cache.json is unoptimized - files are being overfetched and too large cache.json is unoptimized - files are being overfetched and are too large May 3, 2020
@thescientist13
Copy link
Member Author

thescientist13 commented May 6, 2020

So as far as the overfetching issue goes, I think the basic client logic is mostly right.

develop

In development mode, window.__APOLLO_STATE__ is correctly undefined, and so the app should use a "real" Apollo client (backupQuery) to speak to the real running GraphQL server.
develop-fetch-calls-queries

build (serve)

When running yarn serve, window.__APOLLO_STATE__ is correctly defined, however, of course it shouldn't have to make any fetch calls, or to the point you made on the call, it shouldn't need to fetch at all on first load.
serve-fetch-calls-logging-queries

BUT.... look. 👀 👇
serve-fetch-calls-logging

the <header> component is logging just as many render calls as we're seeing overfetching calls!?

Analysis

To compare, if we look to see what the render logging looks like in develop mode, we see what we would have expected...
develop-fetch-calls-renders

Which would be at most two renders, in theory:

  1. One with the initial value of navigation as an empty array []
  2. One when the data is ready from cache (in memory or otherwise) and renders the navigation query data

So to unpack everything

  • technically, the if / else is correct, so that's good (for me anyway, haha)
  • we should still cache requests in memory
  • something unrelated to client.js is actually causing all the over fetching. lit-html. Not sure if it's related to change detection or something? Maybe we need to try its experimental hydration support and lit-ssr? Either way, not sure why it would need to re-render so many times?

In theory, if hydration is working, then if the HTML is already serialized / in the DOM, in theory there would only ever need to be one render? 🤔

Next Steps

  1. Cache (in memory) requests to cache.json files, to help reduce repeated fetch calls (as part of this issue)
  2. Figure out how not to make that first call, to know it should use WINDOW.APOLLO_STATE instead (as part of this issue)
  3. Make an issue to track why so many renders are happening? I'm not sure if it's because we're serializing HTML and then also loading it through JavaScript?

@thescientist13 thescientist13 changed the title cache.json is unoptimized - files are being overfetched and are too large cache.json is large due to containing multiples copies of the same data May 7, 2020
@thescientist13
Copy link
Member Author

Made some follow up issues

  1. Components in serve mode are over-rending - components are over rendering #348
  2. Production build not rehydrating cache.json from WINDOW.APOLLO_STATE - data client is not rehydrating from existing WINDOW.APOLLO_STATE on initial load #349
  3. cache previously fetched _cache.json requests in memory - reduce duplicate fetches for cache.json by storing in client.js memory #347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CLI Content as Data P0 Critical issue that should get addressed ASAP v0.5.1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants