Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo Size #13699

Closed
moonmeister opened this issue Apr 29, 2019 · 14 comments
Closed

Repo Size #13699

moonmeister opened this issue Apr 29, 2019 · 14 comments
Labels
type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change

Comments

@moonmeister
Copy link
Contributor

Summary

Cloning the Gatsby repository is becoming a little absurd due to its size. The git clone transfers 550 MB compressed and uncompressed on disk the repo is 854 MB ( this takes several minutes to clone even on a decent connection). There has been one attempt to fix this in the past(#6486) though I'm not sure if they rewrote git history to purge the large files.

Getting this reduced would help everyone but would also encourage contributions from countries and areas where slower connections are the norm. At this point the repository size is not helping.

Relevant information

There are 92 files in the repo over 1mb.
Compressed (by git) size: 550 MB
Size on Disk: 850+ MB

124K    ./examples/no-plugins
172K    ./examples/using-asciidoc
244K    ./examples/hn
316K    ./examples/simple-auth
136K    ./examples/no-trailing-slashes
164K    ./examples/using-contentful
100K    ./examples/using-glamor
1.7M    ./examples/image-processing
100K    ./examples/feed
84K     ./examples/using-styled-jsx
164K    ./examples/using-typescript
156K    ./examples/sitemap
144K    ./examples/using-faker
72K     ./examples/using-styletron
112K    ./examples/using-stylus
5.9M    ./examples/using-remark-copy-linked-files
176K    ./examples/using-emotion-prismjs
168K    ./examples/using-redirects
188K    ./examples/using-page-transitions
88K     ./examples/using-styled-components
120K    ./examples/using-path-prefix
780K    ./examples/using-gatsby-image
2.5M    ./examples/using-sqip
516K    ./examples/using-shopify
148K    ./examples/using-css-modules
176K    ./examples/using-mobx
104K    ./examples/using-sass
108K    ./examples/using-excel
272K    ./examples/styleguide
468K    ./examples/using-type-definitions
180K    ./examples/using-drupal
172K    ./examples/using-wordpress
1004K   ./examples/using-javascript-transforms
100K    ./examples/using-csv
128K    ./examples/client-only-paths
84K     ./examples/using-emotion
132K    ./examples/using-hjson
544K    ./examples/using-jest
164K    ./examples/using-redux
368K    ./examples/using-i18n
144K    ./examples/using-mongodb
88K     ./examples/using-cxs
88K     ./examples/using-medium
332K    ./examples/graphql-reference
156K    ./examples/using-multiple-providers
96K     ./examples/using-page-loading-indicator
14M     ./examples/gatsbygram
72K     ./examples/using-jss
300K    ./examples/using-gatsby-source-graphql
152K    ./examples/using-local-plugins
108K    ./examples/using-gatsby-without-graphql
140K    ./examples/using-prefetching-preloading-modules
128K    ./examples/using-js-search
13M     ./examples/using-remark
46M     ./examples
60K     ./infrastructure/functions
128K    ./infrastructure
4.0K    ./.git/branches
44K     ./.git/refs
529M    ./.git/objects
16K     ./.git/info
56K     ./.git/logs
140K    ./.git/hooks
531M    ./.git
44K     ./www/plugins
28K     ./www/__mocks__
1.7M    ./www/static
4.1M    ./www/src
6.0M    ./www
104K    ./integration-tests/long-term-caching
412K    ./integration-tests/gatsby-pipeline
544K    ./integration-tests
48K     ./scripts/site-up-checker
28K     ./scripts/add-npm-owner
28K     ./scripts/check-publish-access
28K     ./scripts/get-unowned-packages
60K     ./scripts/gatsby-plugin-checker
364K    ./scripts
112K    ./packages/gatsby-plugin-subfont
104K    ./packages/gatsby-transformer-pdf
192K    ./packages/gatsby-plugin-page-creator
112K    ./packages/gatsby-remark-custom-blocks
128K    ./packages/gatsby-plugin-styletron
148K    ./packages/gatsby-transformer-xml
976K    ./packages/gatsby-source-wordpress
124K    ./packages/gatsby-remark-copy-linked-files
204K    ./packages/gatsby-plugin-sitemap
264K    ./packages/gatsby-cli
244K    ./packages/gatsby-source-drupal
108K    ./packages/gatsby-source-medium
212K    ./packages/gatsby-plugin-postcss
192K    ./packages/gatsby-remark-images
132K    ./packages/gatsby-remark-responsive-iframe
180K    ./packages/gatsby-plugin-catch-links
152K    ./packages/gatsby-source-graphql
128K    ./packages/gatsby-plugin-flow
128K    ./packages/gatsby-plugin-typescript
108K    ./packages/gatsby-source-hacker-news
268K    ./packages/gatsby-plugin-offline
204K    ./packages/gatsby-plugin-feed
144K    ./packages/gatsby-plugin-fullstory
168K    ./packages/gatsby-plugin-netlify
120K    ./packages/babel-preset-gatsby-package
120K    ./packages/gatsby-remark-katex
196K    ./packages/gatsby-transformer-documentationjs
116K    ./packages/gatsby-cypress
136K    ./packages/gatsby-plugin-nprogress
216K    ./packages/gatsby-remark-code-repls
548K    ./packages/gatsby-remark-prismjs
132K    ./packages/gatsby-plugin-preact
164K    ./packages/gatsby-remark-autolink-headers
696K    ./packages/gatsby-plugin-sharp
260K    ./packages/gatsby-transformer-react-docgen
196K    ./packages/gatsby-source-mongodb
296K    ./packages/gatsby-transformer-remark
172K    ./packages/gatsby-plugin-netlify-cms
4.7M    ./packages/gatsby
160K    ./packages/gatsby-plugin-facebook-analytics
108K    ./packages/gatsby-transformer-javascript-frontmatter
280K    ./packages/gatsby-source-filesystem
148K    ./packages/gatsby-transformer-excel
144K    ./packages/graphql-skip-limit
180K    ./packages/gatsby-remark-graphviz
112K    ./packages/gatsby-source-npm-package-search
160K    ./packages/gatsby-plugin-layout
64K     ./packages/gatsby-plugin-no-sourcemaps
172K    ./packages/gatsby-plugin-sass
104K    ./packages/gatsby-transformer-asciidoc
132K    ./packages/gatsby-plugin-remove-trailing-slashes
1.3M    ./packages/gatsby-source-lever
120K    ./packages/gatsby-plugin-jss
120K    ./packages/gatsby-plugin-cxs
1.4M    ./packages/gatsby-transformer-sqip
240K    ./packages/gatsby-telemetry
140K    ./packages/gatsby-transformer-toml
124K    ./packages/gatsby-source-wikipedia
124K    ./packages/gatsby-remark-smartypants
148K    ./packages/gatsby-transformer-yaml
124K    ./packages/gatsby-plugin-emotion
152K    ./packages/gatsby-transformer-json
136K    ./packages/gatsby-plugin-guess-js
208K    ./packages/gatsby-image
136K    ./packages/gatsby-transformer-hjson
108K    ./packages/babel-preset-gatsby
144K    ./packages/gatsby-react-router-scroll
152K    ./packages/gatsby-plugin-canonical-urls
144K    ./packages/gatsby-transformer-csv
116K    ./packages/gatsby-plugin-styled-jsx
120K    ./packages/gatsby-plugin-styled-components
208K    ./packages/gatsby-transformer-sharp
708K    ./packages/gatsby-codemods
140K    ./packages/gatsby-plugin-react-helmet
160K    ./packages/gatsby-plugin-coffeescript
104K    ./packages/gatsby-plugin-lodash
188K    ./packages/gatsby-link
260K    ./packages/gatsby-dev-cli
45M     ./packages/gatsby-transformer-screenshot
104K    ./packages/gatsby-source-faker
108K    ./packages/gatsby-transformer-javascript-static-exports
140K    ./packages/gatsby-plugin-google-analytics
128K    ./packages/gatsby-plugin-glamor
140K    ./packages/gatsby-plugin-twitter
108K    ./packages/gatsby-plugin-google-tagmanager
148K    ./packages/gatsby-remark-images-contentful
604K    ./packages/gatsby-source-contentful
152K    ./packages/gatsby-plugin-google-gtag
108K    ./packages/gatsby-plugin-react-css-modules
188K    ./packages/gatsby-plugin-less
108K    ./packages/gatsby-plugin-create-client-paths
148K    ./packages/babel-plugin-remove-graphql-queries
176K    ./packages/gatsby-plugin-stylus
160K    ./packages/gatsby-plugin-typography
156K    ./packages/gatsby-source-shopify
144K    ./packages/gatsby-remark-embed-snippet
272K    ./packages/gatsby-plugin-manifest
69M     ./packages
52K     ./benchmarks/plugin-manifest
124K    ./benchmarks/markdown
100K    ./benchmarks/create-pages
88K     ./benchmarks/query
368K    ./benchmarks
1.7M    ./starters/blog
576K    ./starters/hello-world
1008K   ./starters/default
3.2M    ./starters
134M    ./docs/blog
12M     ./docs/tutorial
912K    ./docs/contributing
39M     ./docs/docs
28K     ./docs/features
4.8M    ./docs/creators
190M    ./docs
76K     ./plop-templates/package
64K     ./plop-templates/example
144K    ./plop-templates
344K    ./themes/gatsby-theme-blog-mdx
112K    ./themes/gatsby-theme-blog-core
1.1M    ./themes/gatsby-theme-blog
64K     ./themes/gatsby-starter-theme-blog-mdx
1.6M    ./themes
40K     ./.github/ISSUE_TEMPLATE
68K     ./.github
316K    ./e2e-tests/path-prefix
1.1M    ./e2e-tests/development-runtime
1.8M    ./e2e-tests/gatsby-image
720K    ./e2e-tests/production-runtime
3.8M    ./e2e-tests
1.7M    ./flow-typed/npm
1.7M    ./flow-typed
16K     ./.forestry/snippets
32K     ./.forestry/front_matter
64K     ./.forestry
20K     ./.circleci
854M    .

Optional Solutions

  • Clean up stale branches that are not needed
    My thought was starting by deleting branches that have merge or closed PRs. Anything that hasn't had a PR + is older than a certain time maybe we delete (or maybe give the author notice it will be deleted). Anything with an open pr can be left.

  • Cleanup detatched commits and other git things that don't matter -
    I've run across this but I don't entirely understand what it is doing and if there are other things that could be done

  • Compressing images and purge from history - http://blog.jessitron.com/2013/08/finding-and-removing-large-files-in-git.html
    Troubles here is people might just keep adding large files...Not sure if it's possible to write a script that gets triggered by git hooks to compress any images being added to the repo.

  • fix gatsby-plugin-screenshots to not need to bundle chrome - @Ankcorn
    Chrome bundle is 44MB.

  • Git-LFS - this has been brought up before and we'd need to look into the affect on ease of contributions.

  • Move images out of the repo -
    If LFS isn't an option maybe moving to CMS like contentful that could handle assets would be a better alternative. If that's not an option the website/images/blog cloud be move to its own
    repository.

Prompts

What methods are we okay to move ahead with?

If a method has been given the go ahead and you want to tackle a method let us know and submit a PR...

What other options are there for reducing repo size that we can consider?

@moonmeister moonmeister added type: question or discussion Issue discussing or asking a question about Gatsby 💡 Proposed Work labels Apr 29, 2019
@anantoghosh
Copy link
Contributor

I would like to mention bfg https://rtyley.github.io/bfg-repo-cleaner/ which has worked very well for me.

@DSchau
Copy link
Contributor

DSchau commented Apr 30, 2019

Agreed on a lot of this, and I think this is certainly worth considering, but your initial numbers are a little off.

We recommend using a shallow clone (e.g. --depth 1), which in doing so gets the repo size down to 246.98 Mb (on my machine?)

I think you raise some work that's worth doing, specifically investigation into whether we can avoid bundling Chrome.

As far as the others:

Cleaning up branches

(not as concerning, and in cloning you only get the master branch, right?)

Compressing images and purge from history

I get a little squeamish about re-writing history, but if this can be done cleanly (e.g. on a test repo/fork of Gatsby perhaps?) I'd be open to it.

Git-LFS

This is for truly huge files, right? I'm not sure we have many that would be worthy of this. Also - I want to keep ease of contribution in the forefront of our minds.

In general, cloning is a one-time, sunk cost, so I'd urge us to not make changes that improve the initial set-up cost, but degrade or complicate the experience down the line.

Move images out of the repo

We feel pretty strongly that whenever possible we want to keep content in the monorepo, so same idea here!

@moonmeister
Copy link
Contributor Author

moonmeister commented May 1, 2019

@DSchau Yes, using --depth 1 is helpful but how is the developer experience? The means you're only downloading the latest copy...I don't think you can even checkout other branches, and if you're trying to look at commit history you might really me lost. this might be useful for some but not sure it solves this problem in its entirety.

Cleaning up branches will help a full clone. Git makes copies of all changed files...any branches with commits will add to the size of the repository as a whole.

I agree that we can't trade long-term usability for upfront costs. I won't worry about LFS, or additional repo solutions.

One thing your repo size from using depth tells me is that there is a lot in history that probably needs cleaning up. Second, it also tells me that those ~90 files that are over 1MB represent a significant more portion of the repo size then I initially thought. Chrome is a 5th of the repo it seems.

I think right now the things we should focus on are:

  • Getting chrome removed
  • compressing images and videos as best as possible (and make this run on git hooks)
  • removing old versions of these from history (@anantoghosh Yes I've seen bfg but never used it, do you have experience here, are you able to assist on this?)
  • cleaning up unused branches (I'll start by deleting closed/merged branches if that's okay?)

@thecodingaviator
Copy link
Contributor

Netlify has recently released a new feature for serving large images, I'm not sure if it is for large files too. I see we're hosted on Netlify so we can make use of that feature.

@Daniel15
Copy link
Contributor

Daniel15 commented May 16, 2019

If you really do want large files in a Git repo (instead of using something like Git LFS), another option is to store the large assets in a separate repo and pull it in as a submodule. Then at least people that don't want/need the static files can just avoid updating the submodules when pulling.

That Netlify feature looks pretty useful though!

@gatsbot

This comment has been minimized.

@gatsbot gatsbot bot added the stale? Issue that may be closed soon due to the original author not responding any more. label Jun 6, 2019
@gatsbot

This comment has been minimized.

@gatsbot gatsbot bot closed this as completed Jun 17, 2019
@thecodingaviator thecodingaviator added not stale and removed stale? Issue that may be closed soon due to the original author not responding any more. labels Jun 17, 2019
@axe312ger
Copy link
Collaborator

axe312ger commented Aug 13, 2019

Just came around here since I am on a new machine. Can definitely confirm:

I don't think you can even checkout other branches

Had to git fetch origin gatsby-transformer-video to checkout a specific branch. This takes now again the same time master took 😅

edit: Fetching and checking out the FETCH_HEAD did not help, still could not rebase/merge.
Ttried several things from stackoverflow... took the time to check out the whole repo to be able to work with branches 😓

@wardpeet
Copy link
Contributor

wardpeet commented Sep 4, 2019

@eyalroth brought this up as well in #16889

I just had a small fix I wanted to contribute to this repository, but the process was difficult, time consuming and daunting, and all because of the extremely cumbersome process of setting up the repository locally.

My setup process

  1. Upon cloning the repository I encountered a git problem I've never seen before: "The remote end hung up unexpectedly". Eventually I managed to workaround the problem by increasing http.postBuffer and using the HTTPS clone link instead of the SSH one, but that obviously took a few attempts and quite some time to resolve.
  2. I had to wait for the entire repo to download, which is a bit more than 600MB. On a slower-than-normal connection (5-10 Mbps) this could take up to 20 minutes.
  3. After installing Yarn, which was pretty straightforward, I encountered a problem which seems to take a variety of different forms. Troubleshooting this problem took an immense amount of time, since I had to clean the Yarn cache and yarn install multiple times for each workaround attempt. Also there was the --network-concurrency 1 attempt which took forever.
    I eventually resorted to upgrade my Windows from 10.0.17134.799 to 10.0.18362.295 in a final desperate attempt to resolve these issues by upgrading WSL. Lucky for me, that worked, but took quite a long time (couple of hours).
  4. On to the next Yarn error: "There appears to be trouble with your network connection. Retrying..." (+#5259). This was luckily resolved faster than the previous error with the --network-timeout workaround, but still took more than just a few minutes to fix.
  5. Finally, yarn install finished without any error, but that alone took more than 10 minutes.
  6. yarn run bootstrap takes a few more minutes to complete.
  7. Finally, some tests (less than 20) fail. I didn't bother making them pass since they were not related to the package/plugin I was intending to fix.

Proposal

As I said in the beginning, this is one hell of a bad user experience for anyone who is just casual contributor, and I assume these problems have led others to eventually give up on contributing to this repository.

I believe there are two major steps that can be taken to dramatically improve this experience.

1. Reduce the repository size

This was very recently discussed in #16722, but I believe the discussion there was missing the fact that the reason this repository is so big is not because it is a mono-repo, but because more than half of it includes the docs directory, and especially the docs/blog directory:

image

(.git and node_modules can obviously be ignored in this demonstration)

There are two possible ways of remedying this: (a) Move the documentation / blog out of the repository, or (b) Make use of Git LFS to basically remove all the documentation / blog images from local repositories; though, I have no previous experience with this tool so I cannot attest to how well it works.

2. Move away from Yarn

The vast majority of my time consumed in the setup process was invested in solving Yarn problems. I can't say how common are these problems, especially since I am quite new to node and javascript, but from what little research I made I gather that some of these problems are indeed common in large Yarn projects.

The official documentation explains that the reason this repository is using Yarn instead of NPM is mainly due to the "workspaces" feature, but from what I gather that feature is possible to achieve in NPM via Lerna (and some say it is even better).

I understand this is probably quite a major step and requires a lot of work, but if Yarn is going to keep causing problems, perhaps it's better making that transition earlier so it will be easier.

@swyxio
Copy link
Contributor

swyxio commented Oct 8, 2019

relatedly the initial build time is pretty long from what i recall. no interest in firing it up again to check 😅

@moonmeister
Copy link
Contributor Author

@sw-yx yeah the site build is rediculous.

@cpboyd
Copy link
Contributor

cpboyd commented Nov 7, 2019

Just to add:
As a developer, I like to maintain a local copy of the Gatsby repo in order to track changes to see why my build might have broken after a yarn upgrade

The --depth 1 as suggested by @DSchau is a one-time only solution. With the monorepo approach, I still get all starter/www and image changes.

There's no way (that I'm aware of) to condense an already cloned repo's history.

Sure, I could re-clone, but that's an involved process that adds to SSD wear-and-tear as well as network traffic for anyone on metered connections.

I think, as an open-source project, Gatsby should weigh the concerns of all developers including those that might have limited storage or network traffic.

I largely agree with the concept of a monorepo for related code like the official plugins, but I don't see a clear benefit to including the docs, starters, and www in the main repo.

Additionally, with those split out, it would be potentially easier to look through the commit history and see actual code changes.

Finally, GitHub suggests repositories be kept under 1GB, so this is an issue worth considering as this repo approaches that threshold:
https://help.github.com/en/github/managing-large-files/what-is-my-disk-quota#file-and-repository-size-limitations

@moonmeister
Copy link
Contributor Author

I agree completely @cpboyd. With the i18n projects being put in their own repos that'd make sense to move at least docs and www into their own.

@LekoArts LekoArts added type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change and removed type: question or discussion Issue discussing or asking a question about Gatsby 💡 Proposed Work labels Nov 13, 2019
@LekoArts
Copy link
Contributor

I'll close this issue as mostly resolved, as most of the proposed solutions are done.

  • We removed chrome from gatsby-transformer-screenshot in chore: Update the screenshot Lambda function #20427
  • We removed www from the repo with our site unification
  • We removed the docs/blog as it now lives in WordPress
  • We removed some stale branches
  • We cleaned up our docs a bit in the reorganization and were able to delete a bit (including images)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: maintenance An issue or pull request describing a change that isn't a bug, feature or documentation change
Projects
None yet
Development

No branches or pull requests

10 participants