Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strip old tarballs from git history #25

Closed
jimhester opened this issue Feb 20, 2016 · 16 comments
Closed

Strip old tarballs from git history #25

jimhester opened this issue Feb 20, 2016 · 16 comments

Comments

@jimhester
Copy link
Contributor

The old tarballs have been removed from the working tree but are still present in the history. They make the repo size much larger than it needs to be, as the tarballs can be downloaded from boost directly.

Stripping them from the history take the bare repo size (e.g. git clone --mirror) from ~350 MB to 16MB on my machine. This will make cloning this repo much faster!

You can do so by downloading the BFG repo cleaner and running the following commands.

git clone --mirror [email protected]:eddelbuettel/bh.git
java -jar bfg.jar --strip-blobs-bigger-than 10MB bh.git
cd bh.git
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --force
@eddelbuettel
Copy link
Owner

In favour! Ping me if I don't get to this in a few days.

@eddelbuettel
Copy link
Owner

Done. One minor correction was that '10m' rather than '10MB' is the size designator.

But something didn't work, see the log -- or is this expected?

edd@max:~/git/bh-new(master)$ git push --force
Counting objects: 19169, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (6663/6663), done.
Writing objects: 100% (19169/19169), 14.67 MiB | 693.00 KiB/s, done.
Total 19169 (delta 12232), reused 19169 (delta 12232)
To [email protected]:eddelbuettel/bh.git
 + 4092d17...5b0588e master -> master (forced update)
 ! [remote rejected] refs/pull/1/head -> refs/pull/1/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/10/head -> refs/pull/10/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/14/head -> refs/pull/14/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/15/head -> refs/pull/15/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/2/head -> refs/pull/2/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/23/head -> refs/pull/23/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/24/head -> refs/pull/24/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/4/head -> refs/pull/4/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/6/head -> refs/pull/6/head (deny updating a hidden ref)
 ! [remote rejected] refs/pull/8/head -> refs/pull/8/head (deny updating a hidden ref)
error: failed to push some refs to '[email protected]:eddelbuettel/bh.git'
edd@max:~/git/bh-new(master)$ 

The remote has received changes, a new checkout is now at 142mb -- as opposed to 660mb.

@eddelbuettel
Copy link
Owner

That plainly didn't work.

On another machine (the laptop) I updated and am now at 460mb instead of 445mb before, and got four warnings about files over 50mb. Which of course are no longer visible either. Fiddlesticks.

It is also messing with my history. What were three commits to bh on Sunday become six, and now nine.

And it messed with the git log. In 'graph mode' I now have an entire new 'line'.

@eddelbuettel
Copy link
Owner

Oh for fsck's sake. And GH now shows 288 commits instead of 144. That. Was. Not. A. Good. Idea.

@eddelbuettel
Copy link
Owner

I had left my main repo checkout untouched / unchanged. I just pushed a backup copy to GitLib just to be safe -- 144 commits, 305mb. As it should be.

Whereas this one is now busted at 288 commits.

Advice? If I nuke this at GH and re-create I loose issues and PRs.

@jimhester
Copy link
Contributor Author

I think you need to re-clone the repo on other machines you can't just git pull from the previous clone. Did you push from the second machine as well? I think that is where your duplicate commits are committing from.

@eddelbuettel
Copy link
Owner

I think you need to re-clone the repo on other machines you can't just git pull from the previous clone.

I don;t understand sentence. If the old one was (is) ~/git/bh and I clone freshly into ~/git/bh-check then the latter does not know about the former.

The latter has git ls | wc -l result in 335 commits whereas the pristine copy has 166. The modified also has 166 but under sha1 values --- and once pushed and merged we get 2 x 166 + 1 = 335.

I would love to undo the 'wrong' 166 ones at the merge. Sadly THIS REPO now has all 335. How do I get rid of them without loosing history, issues, ... and other metadata I'd loose by deleting the whole repo?

@jimhester
Copy link
Contributor Author

Add a new commit to the head of the clean repo and git push --force as Gabor suggested.

@eddelbuettel
Copy link
Owner

That seems to have done it! Thanks @jimhester and @gaborcsardi.

I'll close this issue now. I added a workaround to the README.md. Jim was actually the first person foolish^Hbrave enough to do a full PR -- everybody else who wanted a new Boost library just filed an issue. So I leave the monster size for now.

@gaborcsardi
Copy link

I guess you can try to get rid of the large files in the history in another branch, then you don't mess with master. If it will be successful this time, then you can just rename branches. This might mess up open PRs, but the rest should be OK I think.

@eddelbuettel
Copy link
Owner

I am still open to doing this but having been burned twice by such git filtering approaches, I would need a simpler / better / more reliable "script" to follow.

The goal is to keep this repo with issue ticket history, but filter master. I do not know how to do that. I seem to be able to filter a repo and push it somewhere else with an altered history, but that is not the goal.

@gaborcsardi
Copy link

I am not sure what you mean by 'history'. The commit hashes will change. There is simply no way of keeping them.

But this is fine, imo. You can keep the branch as oldmaster or something, so the hashes in issues and elsewhere still point to sg meaningful.

Forks and pull requests need to rebase or maybe even re-fork.

I'll give it a try, once I get to a better internet connection, I am on (slow) public wifi today.

@eddelbuettel
Copy link
Owner

'history' was a sloppy term. "Preserve as much as I can" from the existing repo -- as opposed to starting over with a fresh one with filtered code. That includes the history and sequence of commits, but then under different sha1 ids.

Maybe the branch switch is the element I was missing. But force pushing back into master I ended up with everything double.

@gaborcsardi
Copy link

No force pushing. Put the filtered repo in a new branch and then rename
branches. You'll also need to rename locally or just reclone.
On 27 Mar 2016 12:10, "Dirk Eddelbuettel" [email protected] wrote:

'history' was a sloppy term. "Preserve as much as I can" from the existing
repo -- as opposed to starting over with a fresh one with filtered code.
That includes the history and sequence of commits, but then under different
sha1 ids.

Maybe the branch switch is the element I was missing. But force pushing
back into master I ended up with everything double.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#25 (comment)

@eddelbuettel
Copy link
Owner

Issue #34 with this contributed script did the trick. Many thanks to @Enchufa2 for providing it.

@Enchufa2
Copy link

Enchufa2 commented Dec 5, 2016

You're welcome. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants