Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

17 MB / 90 MB repo size!? #96

Closed
nanoant opened this issue Jun 22, 2015 · 29 comments · Fixed by #2081
Closed

17 MB / 90 MB repo size!? #96

nanoant opened this issue Jun 22, 2015 · 29 comments · Fixed by #2081
Assignees
Labels
kind: question solution: proposed fix a fix for the issue has been proposed and waits for confirmation

Comments

@nanoant
Copy link

nanoant commented Jun 22, 2015

Even 17 MB download is not really that big, it is really unexpected to download that much for single-header json library. Moreover once cloned it takes 90MB on the disk, even more surprising.

And the reason for that are these benchmark test files introduced by cb873a4. Is it really good to put such large files in the repo!? Unfortunately once you have put it then it is not easy anymore to reduce repo size. You will need to rewrite.

Anyway please consider rewriting the repo, and keeping it clean & lean, then likely everyone can use it as submodule.

@nlohmann
Copy link
Owner

Hi @nanoant,

any idea how to reorganize without fragmenting stuff around into other repos or to other sites? I understand your concern, but at least now everything is in one place. In particular, the benchmark helped to improve at least one pull request.

Cheers,
Niels

@nanoant
Copy link
Author

nanoant commented Jun 22, 2015

@nlohmann My concern is just these huge benchmark input files that blow the repo size, these may be just moved out from the main repo to some other repo, and then cloned on first benchmark run (in Makefile). Anything else is already pretty small.

Right now I am probably going for embedding json.hpp directly into our repo rather than using submodule for this reason.

@nlohmann
Copy link
Owner

Would a special development branch help? Then I could move most of the stuff from there and leave the master branch small. (Sorry, I have little experience with Git.)

@nanoant
Copy link
Author

nanoant commented Jun 22, 2015

@nlohmann I think the problem is that once the object (huge file) is there in the repo then there is no difference in which branch it resides, the clone will still fetch whole repo including huge file.

I was able to reduce repo size to 3.4 MB with recipe taken from: http://manuel.manuelles.nl/blog/2011/12/22/shrinking-your-git-repository/

But still this is lots of effort, so I probably you should skip that, and leave the repo as is.

@azadkuh
Copy link

azadkuh commented Sep 20, 2016

@nlohmann
what do u think about breaking the repo into:

  • library repo: only contains the single header file
  • utilities repo: contain all other stuff (docs, tests, ...)

the library repo will be easy to be integrated into other 3rdparty codes (by git submodules, subtree ...)

to prevent radical changes to the current workflow, you can create another repo (a single header) as release repo and let people use that repo as dependency simply by git or cmake.

@nlohmann
Copy link
Owner

This split would make the development a bit inconvenient on my side. As I am doing this alone in my spare time, I would like to keep it as is.

@nlohmann
Copy link
Owner

Why would you include a single-header project as submodule?

@gregmarr
Copy link
Contributor

That's how I use it. That way the link is explicit, and it tracks a particular version, rather than manually extracting the one header file from the repository with no indication that it came from another repo, and have to deal with the possibility that someone may modify the local copy. However, I have no problem with the size of this project as a submodule.

@nanoant
Copy link
Author

nanoant commented Feb 23, 2017

@nlohmann you can consider using git subtree split to move out the tests (can be optional submodule of main) and other big subfolders out of main repo, then you force push small master bash to main repo. This will break obviously other ppl forks, but I guess it will bring more good than harm overall.

@nlohmann
Copy link
Owner

@nanoant To me, this splitting would be inconvenient, and I like having everything in one repository.

@fnc12
Copy link

fnc12 commented Feb 23, 2017

Guys. I think the best solution is to have a package manager that will download whole repo, insert json.hpp into headers dir and remove repo

@tdegeus
Copy link

tdegeus commented Apr 15, 2017

To add to the discussion I will just add my argumentation here. I want to use "json" as a submodule to my project, because:

  1. I want to be transparent is ownership, to give credits where they are due.
  2. To facilitate updating from source.

However, the current repository is extremely large, due to the tests and benchmarks. This hinders again the distribution of my code. Therefore it would be great if a minimal repository can be maintained.

@therocode
Copy link

therocode commented Jul 5, 2017

I'll also chime in here - I also like using even single-header libs as a submodule to be able to track what version I'm using and easily be able to update by just pulling the submodule. And yes, for me it is also problematic that 90mb is downloaded. It takes ages, even on my fast connection and this is gonna hit everyone using my repo as well, including autobuilders. This memory/bandwidth cost of 50+ mb just coming from a single test file is ridiculous.

@azadkuh
Copy link

azadkuh commented Jul 6, 2017

@nlohmann

just tried to pack every released json.hpp into this repo:
azadkuh/nlohmann_json_release

# size in KB

$> curl https://api.github.com/repos/nlohmann/json | grep size
   "size": 136956,

$> curl https://api.github.com/repos/azadkuh/nlohmann_json_release | grep size
   "size": 100,

@njlr
Copy link
Contributor

njlr commented Sep 8, 2017

I would prefer not to maintain my own minified fork (or depend on anyone else's).

Wouldn't a submodule solution meet everyone's requirements?

  • Move tests to their own repo
  • Library has tests as a Git submodule
  • Minimal fetch is then git clone [email protected]:nlohmann/json.git
  • Maxmimal fetch is then git clone --recursive [email protected]:nlohmann/json.git

@theodelrieu
Copy link
Contributor

I believe this would make everybody happy indeed.

@dan-42
Copy link
Contributor

dan-42 commented Sep 19, 2017

I know this is already closed, but as the discussion is still going on, I want to share my thoughts

I wonder why this is needed? I mean @nlohmann is right it is nice to have everything in one place for development and IMHO this is the what this repo is for.
If you use this as a submodule then you want to use also the tests and so on as you are developing for this lib.
And after all it is a single header library so how can you get confused which version you have?

But to have the luxury of a git-submodule I would propose a different approach.
Create a sepeate repo only for released versions and only having the very view things that are needed (Readme, license, src, CMake, ...)
This repo can be maintained by the community and will not affect any development on this repo.

PRO

CON

  • separate repo from actual development
  • extra "small" work for maintaining

@astoeckel
Copy link

astoeckel commented Feb 4, 2018

Just to add to the discussion: I have created a repository at https://github.com/astoeckel/json which tracks the released json.hpp.

Note however that this repository is bare-bones and tailored towards my own needs. It just contains src/json.hpp and meson.build, as well as the licence and a minimal readme. Feel free to use the provided Python script update.py to create your own version of the tracking repository.

JloveU added a commit to JloveU/json_include that referenced this issue Mar 8, 2019
nlohmann/json commit:
21516f2bae552a49cc1ba1c11746be3730361d8d

Reference:
nlohmann/json#96
@nickaein
Copy link
Contributor

As of now, the size of repository has increased to whopping 435 MBs. The biggest offenders are test, benchmarks, and git history itself (.git):

isaac@host:~/temp/json# du -k -d 1
8       ./.circleci
724     ./single_include
44      ./.github
816     ./include
5908    ./doc
8       ./cmake
179368  ./.git
300     ./third_party
59388   ./benchmarks
197800  ./test
444664  .
root@host1:~/temp/json#

To reduce the size of test and benchmarks directory, we have to move their large files (mostly json/data files) into another source (e.g. a separate git repository). The questions is now how to populate and integrate them during testing/benchmarking, I can think of these two options:

  1. Download them during the build by CMake's ExternalProject command from the external source (e.g. Git repo, HTTP server, etc).

  2. Add them as a git submodule to the main repo.

I believe using ExternalProject is a better option since the resources will be downloaded automatically when building benchmarks/tests is triggered. The best part is that user does not have to do anything whether he/she wants to run the test/benchmarks or not.

We can later deal with the size of .git and try to rewrite the git history to remove the commits related to large files. Since these files has been mostly static, number of commit related to them is probably very low. Therefore, the other contributors shouldn't have much trouble (if any) in rebasing their clone with the revised upstream repo.

@nlohmann
Copy link
Owner

Thanks for investigating! I have little time in the moment, but this is definitely a way to move forward!

@theodelrieu
Copy link
Contributor

I had a PR closed by the stale bot which used ExternalData, I don't have the time to look into it right now, so if somebody wants to go back from where I left, don't hesitate.

#1251

@nickaein
Copy link
Contributor

@nlohmann: Sure! I've created a PR and the preliminary tests seems promising.

@theodelrieu: Thanks for pointing that out! I will look into it.

Since this change can affect the workflow of the users, there is the possibility that I've overlooked some use-cases. Therefore, besides passing the CI tests, hearing others' experience would important to make sure this change will not have negative effects.

@nlohmann
Copy link
Owner

@nickaein @theodelrieu @gregmarr

Hi all, I finally had time to think about this issue and thing we should move forward with it.

As a first working draft, I created a branch https://github.com/nlohmann/json/tree/tests_external which loads the tests from a separate repository via

    include(FetchContent)
    FetchContent_Declare(nlohmann_json_tests
        GIT_REPOSITORY https://github.com/nlohmann/json_tests_tmp.git
        GIT_SHALLOW TRUE
    )

    FetchContent_GetProperties(nlohmann_json_tests)
    if(NOT nlohmann_json_tests_POPULATED)
        message(STATUS "Download tests")
        FetchContent_Populate(nlohmann_json_tests)
        message(STATUS "Download tests - done")
        add_subdirectory(${nlohmann_json_tests_SOURCE_DIR}/test)
    endif()

As a result, the tests folder can be removed from the repository, leaving a much smaller footprint.

However, I am not yet sure whether this is the best approach. In particular, I am afraid that tests and code get out of sync as a PR in the original repo cannot simply add tests on its own, but would need a PR in the tests repo first. Alternatively, I could only move the test files to the separate repo, but leave the JSON files in the separate repo.

Any ideas on this?

@nlohmann nlohmann reopened this Apr 29, 2020
@t-b
Copy link
Contributor

t-b commented Apr 29, 2020

I think splitting code and tests into a separate repositories is a bad idea. Newer git versions have "sparse checkout" support, see https://www.git-scm.com/docs/git-sparse-checkout, so it would maybe possible to not checkout the tests folder.

For general users the various package managers should also help.

@azadkuh
Copy link

azadkuh commented Apr 29, 2020

@nlohmann
this repository already has a great but large history, IMHO this repository is just fine for contributors with all test stuff and ..., but the final users may only need a single header file, and any other thing means YAGNI.

it may be a better idea to have a separate repository for released versions of json.hpp, this approach is both easier for you and other contributors to work on and manage this repo and also it's easier for final users to fetch a light repository with a single header file.

BTW every solution has its own pros/cons

@nickaein
Copy link
Contributor

nickaein commented May 1, 2020

@nlohmann
That's great! I'm not 100% sure about this solution yet, but seems be a good compromise.

Library and tests can get out of sync

I am afraid that tests and code get out of sync as a PR in the original repo cannot simply add tests on its own, but would need a PR in the tests repo first.

I agree. There is also the problem that after a while a user may want to checkout an older version of the library while the latest json_tests repo might not be backward-compatible with that version (e.g. a test file has been deprecated and removed in some commit).

Possible solution: We might have to (internally) tie the library version to a json_tests commit/tag. In FetchContent_Declare this can be done by using GIT_TAG. This can mitigate the out of sync problems as the PR author can explicitly point to a specific json_tests commit/tag (almost always the latest tag as the time of creating PR).

I could only move the test files to the separate repo, but leave the JSON files in the separate repo.

I believe you meant to keep the source code for tests here and only move the large JSON files into the separate repo.

I totally agree with this approach. The changes in the library and test source code are related in meaningful way and there is semantic to see them in single commit (or a single PR).

On the other hand, since the large JSON data for tests doesn't have much semantics and they rarely change, we can have them as an external repo.

Which one? test or benchmark data

There is also ~58 MBs worth of data for benchmarks. While they have smaller impact (compared to ~194 MBs of test data), we can move them if you want to be conservative.

In future though, both of them can be moved to an external repo.

Why?

Putting aside the technicalities, I believe we should consider the use-cases which make the users raise this issue. In some cases the user would be happy if along the single-header release, README, LICENSE, and CMakeList.txt (probably a stripped-down version) is also provided. For instance: #1066 and #1184.

Currently, they find the contents of releases inadequate and fall back to cloning the whole repo. This can be avoided if we add such files to the zip releases. I'm not sure if there are other considerations though.

@nickaein
Copy link
Contributor

nickaein commented May 1, 2020

@t-b
Sparse checkout is an interesting solution! However, I wonder if it could be effective in the scenarios that people are complaining about repo size. The users can already use this solution, yet we still see complains. In addition, their hand might be tied in using this feature (e.g. in a automated environment like CI where they can only specify a URL to git repo).
And lastly, we still increase the size of .git directory in future whenever we add/modify test data.

Nevertheless, this solution can probably help some of the users. We can recommend this (as a temporary solution at least), until further notice.

@azadkuh

but the final users may only need a single header file, and any other thing means YAGNI.

There is already single header files available for such users at Releases page. Does providing them in a separate repo have any benefits?

@nlohmann
Copy link
Owner

nlohmann commented May 1, 2020

Thanks for the comments!

  • Right now, I think it would be best to only move the test and benchmark data to a different repository and leave the actual code in the main repository. They are updated rather infrequently, and maybe one can even mark a subset of tests that can be run without external data - these tests can then still be executed without any download.
  • I am not sure whether the contents of the releases need to be adjusted. I think most complaints about the size came from people who just added the library as submodule and never cared about releases in the first place.

@nlohmann nlohmann linked a pull request May 3, 2020 that will close this issue
@nlohmann
Copy link
Owner

nlohmann commented May 3, 2020

Hi all, please have a look at #2081 where I implemented the approach of having a separate test data repository.

@nlohmann nlohmann added the solution: proposed fix a fix for the issue has been proposed and waits for confirmation label May 3, 2020
@nlohmann nlohmann self-assigned this May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: question solution: proposed fix a fix for the issue has been proposed and waits for confirmation
Projects
None yet
14 participants