cache.json is large due to containing multiples copies of the same data #345

thescientist13 · 2020-05-03T16:38:56Z

Type of Change

Summary

Moving the discussion from #317 (comment), but after the 0.5.0 release, there are some serious concerns with cache.json generation. Docs page is transfering over 2.6MB of JSON data!

It's being over fetched (why!?)
It's huge! Docs page

This also means all index.html pages are huge too...

Details

Thoughts on quick wins here:

Fix overfetching by caching in the client. (but would also be good to know why this is happening)
Reduce duplication of query results within a single cache.json file (e.g. about/cache.json) because deepmerge just keeps doubling up the previous file contents, resulting in duplicates over and over again?

For something like docs/cache.json, this could bring the size down to 35k! We should also try and add a "budget" spec for this to make sure the size can't explode again without it failing at least.

Later down the road, we should look to reduce duplication of query results across all cache.json files - see #317

The text was updated successfully, but these errors were encountered:

thescientist13 · 2020-05-03T17:04:51Z

Did some more discovery on this today and I realize what the crux of the issue is for the file size concerns, anyway.

So for every page that gets serialized, that will make a GraphQL call (e.g. per ${page}-template.js), cache.js will create a JSON file of that query data for every one of those.

So even for something like /docs/*, which only needs 3 (?) GraphQL calls total, we are getting a ${hash}-cache.json file for each sub-route that is essentially all just duplicates.

So when we deepmerge all that into a single cache.json to load into the client / HTML when serializing... things will explode in size, as we are seeing. 💥

I recall now as part of #115 one of the reasons for this approach was that trying to read / write to "shared" .json files asynchronously was resulting in thread unsafe writing operations to the same file at the same time, leading to corrupted JSON that would fail to get read / parsed correctly in serialize.js.

So I did try again to generate cache.json files per query / per top level route, and now we see exactly what we would expect in count and size.

// hash against the query instead
const md5 = crypto.createHash('md5').update(query).digest('hex');

Of course, this returns us to the problem of overlapping writes from time to time, even if adding something like this

if (!fs.existsSync(targetFile)) {
  await fs.mkdirs(targetDir, { recursive: true });
  await fs.writeFile(path.join(targetFile), cache, 'utf8');
}

query from route /guides/netlify-cms
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
==================================
==================================
query from route /plugins/
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
==================================
query from route /guides/cloudflare-workers-deployment
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
==================================
query from route /plugins/index-hooks
query hash 157d00b65a839ebbd1a72ae5a3884080
==================================
SyntaxError: /Users/owenbuckley/Workspace/project-evergreen/repos/greenwood/public/docs/2bc8e256a25844b37c22af93673c67e3-cache.json: Unexpected token B in JSON at position 3343

I think though maybe since we won'y really need deepmerge with this solution, and so it will be a fair tradeoff to add proper-lockfile instead, and then I think things will return to normal? (though we can of course still continue to optimize data fetching even further, but we can do that in other issues).

thescientist13 · 2020-05-06T00:39:03Z

So as far as the overfetching issue goes, I think the basic client logic is mostly right.

develop

In development mode, window.__APOLLO_STATE__ is correctly undefined, and so the app should use a "real" Apollo client (backupQuery) to speak to the real running GraphQL server.

build (serve)

When running yarn serve, window.__APOLLO_STATE__ is correctly defined, however, of course it shouldn't have to make any fetch calls, or to the point you made on the call, it shouldn't need to fetch at all on first load.

BUT.... look. 👀 👇

the <header> component is logging just as many render calls as we're seeing overfetching calls!?

Analysis

To compare, if we look to see what the render logging looks like in develop mode, we see what we would have expected...

Which would be at most two renders, in theory:

One with the initial value of navigation as an empty array []
One when the data is ready from cache (in memory or otherwise) and renders the navigation query data

So to unpack everything

technically, the if / else is correct, so that's good (for me anyway, haha)
we should still cache requests in memory
something unrelated to client.js is actually causing all the over fetching. lit-html. Not sure if it's related to change detection or something? Maybe we need to try its experimental hydration support and lit-ssr? Either way, not sure why it would need to re-render so many times?

In theory, if hydration is working, then if the HTML is already serialized / in the DOM, in theory there would only ever need to be one render? 🤔

Next Steps

Cache (in memory) requests to cache.json files, to help reduce repeated fetch calls (as part of this issue)
Figure out how not to make that first call, to know it should use WINDOW.APOLLO_STATE instead (as part of this issue)
Make an issue to track why so many renders are happening? I'm not sure if it's because we're serializing HTML and then also loading it through JavaScript?

thescientist13 · 2020-05-07T00:43:04Z

Made some follow up issues

Components in serve mode are over-rending - components are over rendering #348
Production build not rehydrating cache.json from WINDOW.APOLLO_STATE - data client is not rehydrating from existing WINDOW.APOLLO_STATE on initial load #349
cache previously fetched _cache.json requests in memory - reduce duplicate fetches for cache.json by storing in client.js memory #347

thescientist13 added bug Something isn't working P0 Critical issue that should get addressed ASAP Content as Data labels May 3, 2020

thescientist13 added this to the MVP milestone May 3, 2020

thescientist13 assigned thescientist13 and hutchgrant May 3, 2020

thescientist13 mentioned this issue May 3, 2020

code splitting / chunking of cache.json files #317

Closed

5 tasks

thescientist13 added the CLI label May 3, 2020

thescientist13 changed the title ~~cache.json is unoptimized - files are being overfetched and too large~~ cache.json is unoptimized - files are being overfetched and are too large May 3, 2020

thescientist13 mentioned this issue May 7, 2020

make cache file writing sync and hashed by query #346

Merged

thescientist13 changed the title ~~cache.json is unoptimized - files are being overfetched and are too large~~ cache.json is large due to containing multiples copies of the same data May 7, 2020

This was referenced May 7, 2020

components are over rendering #348

Closed

data client is not rehydrating from existing WINDOW.APOLLO_STATE on initial load #349

Open

thescientist13 closed this as completed in #346 May 15, 2020

thescientist13 added the v0.5.1 label May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache.json is large due to containing multiples copies of the same data #345

cache.json is large due to containing multiples copies of the same data #345

thescientist13 commented May 3, 2020 •

edited

Loading

thescientist13 commented May 3, 2020 •

edited

Loading

thescientist13 commented May 6, 2020 •

edited

Loading

thescientist13 commented May 7, 2020

cache.json is large due to containing multiples copies of the same data #345

cache.json is large due to containing multiples copies of the same data #345

Comments

thescientist13 commented May 3, 2020 • edited Loading

Type of Change

Summary

Details

thescientist13 commented May 3, 2020 • edited Loading

thescientist13 commented May 6, 2020 • edited Loading

develop

build (serve)

Analysis

Next Steps

thescientist13 commented May 7, 2020

thescientist13 commented May 3, 2020 •

edited

Loading

thescientist13 commented May 3, 2020 •

edited

Loading

thescientist13 commented May 6, 2020 •

edited

Loading