Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LFS quota exceeded #323

Open
idavydov opened this issue Aug 31, 2023 · 27 comments
Open

LFS quota exceeded #323

idavydov opened this issue Aug 31, 2023 · 27 comments

Comments

@idavydov
Copy link
Collaborator

idavydov commented Aug 31, 2023

Hi all,

Besca contains references to some LFS files. As far as I understand, when someone git clones the repo, LFS files are also pulled.

We currently are over bandwidth, which means that none of the bedapub projects can push LFS files. The error I was getting:

LFS files into bedapub repo, but I was getting this message:
batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.
error: failed to push some refs 

As far as I understand, these files are only in history and are now stored elsewhere.

What I propose currently is to rewrite history and remove all the references to those files. The upside of this is that in one month LFS hopefully will be usable again. They will be still in the storage, but at least bedapub will have 700 Mb for other projects to use.

When history is rewrited, one should make sure to clone the repository again, otherwise there's a risk to accidentally push large files again.

An alternative solution would be to get budget for LFS.

What do you think?

CC @hatjek @kohleman @swalpe

Here's the list of references:

$ git lfs ls-files --deleted --all
75f416ec45 - besca/datasets/data/pbmc_storage_processed_downsampled.h5ad
81fae1e6df - besca/datasets/data/pbmc_storage_raw_downsampled.h5ad
e325371e52 - besca/datasets/data/pbmc3k_filtered.h5ad
ba1c7d7ade - besca/datasets/data/pbmc3k_processed.h5ad
bd409cc9c1 - besca/datasets/data/pbmc3k_raw.h5ad
49381ba23b - besca/datasets/data/pbmc_storage_raw_downsampled.h5ad
quota usage image
@hatjek
Copy link
Contributor

hatjek commented Sep 4, 2023

Hi @idavydov ,

Thank you for brining this up! Yes, the files are not needed anymore. Please go ahead.

For everyone's reference, the ones we still need are now here: https://zenodo.org/record/4441679

Best
Klas

@idavydov
Copy link
Collaborator Author

idavydov commented Sep 4, 2023

hi @hatjek ,

I tried rewriting the history using this command:

git lfs ls-files --deleted --all | awk '{print $3}' | xargs basename | xargs -n 1 bfg --delete-files

and it seems to work fine. see here

Regarding other large files, it's mainly .ipynb files, which I assume have large images inside them.

large files
$ git rev-list --objects --all |
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' |
  sed -n 's/^blob //p' |
  sort --numeric-sort --key=2 |
  cut -c 1-12,41- |
  $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest | tail -n 40
97e431a73c58  5.4MiB docs/source/tutorials/notebook3_batch_correction.ipynb
d638bf0f327a  5.7MiB docs/source/tutorials/notebook3_batch_correction.ipynb
bafc39047d18  5.7MiB docs/source/tutorials/notebook3_batch_correction.ipynb
0b97479b5881  6.1MiB _sources/tutorials/notebook3_batch_correction.ipynb.txt
51f43b650c61  6.3MiB docs/source/tutorials/scRNAseq_tutorial.ipynb
45689e94a985  7.1MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
1c01c36962c7  7.1MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
01a6bf88f042  7.1MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
9b8a465fb2ca  8.0MiB _sources/tutorials/auto_annot_tutorial.ipynb.txt
51be539e6595  8.4MiB workbooks/standard_workflow_besca2.ipynb
cd5eeb5b16ff  8.4MiB workbooks/standard_workflow_besca2.ipynb
c59ae5e69555  8.5MiB _sources/tutorials/notebook2_celltype_annotation.ipynb.txt
f7f3bc94f7e2  8.5MiB docs/source/tutorials/notebook2_celltype_annotation.ipynb
2f48e7e99c2c  8.7MiB workbooks/celltype_annotation_besca.ipynb
c5a41b3be673  8.7MiB workbooks/standard_workflow_besca2.ipynb
a3f12a4c94ce  9.7MiB docs/source/tutorials/scRNAseq_tutorial.ipynb
20d8fdec04b3  9.7MiB _sources/tutorials/scRNAseq_tutorial.ipynb.txt
b01c6f995942  9.7MiB docs/source/tutorials/scRNAseq_tutorial.ipynb
337d0653f135  9.7MiB docs/source/tutorials/scRNAseq_tutorial.ipynb
adc2ae24eada   13MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
0969babbf685   13MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
fecb616889d7   13MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
c428c5f9124b   13MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
a3a4a47e8126   14MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
895c62a874b6   14MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
5692c254135a   14MiB docs/source/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb
5c542712cc40   14MiB _sources/tutorials/notebook2_celltype_annotation_pbmc3k.ipynb.txt
6984f59c213c   15MiB docs/source/tutorials/auto_annot_tutorial.ipynb
e49417729826   15MiB docs/source/tutorials/auto_annot_tutorial.ipynb
53db68a5f50f   15MiB docs/source/tutorials/auto_annot_tutorial.ipynb
3d0794c7f180   17MiB workbooks/celltype_annotation_besca.ipynb
94e160d40a47   20MiB workbooks/celltype_annotation_besca.ipynb
f8fb905ca97c   20MiB workbooks/celltype_annotation_besca.ipynb
0e52ef41e5ac   28MiB workbooks/celltype_annotation_besca.ipynb
d62f63ea3811   28MiB workbooks/celltype_annotation_besca.ipynb
7b1e56d09835   28MiB workbooks/celltype_annotation_besca.ipynb
08094d2a98d6   28MiB workbooks/celltype_annotation_besca.ipynb
755fe978fb5c   30MiB workbooks/celltype_annotation_besca.ipynb
b1a5a6014c9e   30MiB workbooks/celltype_annotation_besca.ipynb
833f44074681   43MiB workbooks/02- Annotating Cellines.ipynb

some of these notebooks are small in the current master, but some are not.

I can do more cleanups while i'm at it, but it doesn't make too much sense with this huge notebooks still present in the current version.

e.g., this one doesn't even render for me anymore.

image

@idavydov
Copy link
Collaborator Author

idavydov commented Sep 4, 2023

I think the solution could be something like jupytext. In this case, only the source code of the notebook is stored in the repository, and the actual notebook is available via github pages.

But I never done this in practice, and it might be quite some work to implement.

@hatjek
Copy link
Contributor

hatjek commented Sep 5, 2023

I think we should clear the notebooks before submitting them to git. This might be a good solution:
https://github.com/srstevenson/nb-clean
What do you think?

@idavydov
Copy link
Collaborator Author

idavydov commented Sep 5, 2023

Sure, but then you lose all the cell output. Is that ok for how are you intended to use jupyter notebooks?

@hatjek
Copy link
Contributor

hatjek commented Sep 5, 2023

Yes. I think they are compiled for the documentation using gh-pages, see https://github.com/bedapub/besca/tree/gh-pages/tutorials
@kohleman Do you know if we need the cell content of those tutorial notebooks in the main branch?

@kohleman
Copy link
Collaborator

kohleman commented Sep 5, 2023

Hi @hatjek is right. This is needed for the documentation.

@idavydov
Copy link
Collaborator Author

Hi @hatjek ,

So as far as I understand the contents of the cells is needed for converting notebooks to htmls.

What would be your suggestion then? I go ahead with cleaning up only LFS references?

Best,
Iakv

@kohleman
Copy link
Collaborator

kohleman commented Sep 11, 2023 via email

@idavydov
Copy link
Collaborator Author

Hi @hatjek ,

We had a brief chat with @kohleman.

Full cleanup is possible, but this would require changes in how documentation is generated. Basically, all notebooks needs to be rendered via github CI. In theory should be something like this.

That's definitely doable, but requires substantial amount of work from someone actively working with besca.

There are two options now:

  1. I force push the modified branch to solve the LFS quota problem, and we create another ticket to perform a full cleanup later when someone dedicates time.
  2. Someone from the besca team actively get's involved to solve this, and we solve both problems for a single history rewrite. I can give some general tips on CI rendering, but my experience is limited to Rmarkdown's; so this could be quite a bit of work. @kohleman unfortunately does't have time to work on this at the moment.

What would be your preferred option?

Best,
Iakov

@hatjek
Copy link
Contributor

hatjek commented Sep 25, 2023

Dear @idavydov ,
Thank you for looking into this in more detail and for your help! I would suggest to continue with option 1.
Best
Klas

@idavydov
Copy link
Collaborator Author

idavydov commented Nov 9, 2023

Hi @hatjek & @kohleman

Sorry for the delay, this was less pressing for me since I managed to find a way not to use LFS for my project.

I tested this again, seems to work fine (see here)

Shall I go ahead and force push?

  1. This implies that everyone should clone the repo one more time (and stop using the old clone).
  2. All the open pull requests would need to be rebased on top of the current master.

If someone forgets to do this, the files will reappear in history.

Best,
Iakov

@hatjek
Copy link
Contributor

hatjek commented Nov 10, 2023

Hi @idavydov ,

Yes, I think we should go ahead. My understanding is that only pull requests from an old clone would disrupt the cleanup. Can we introduce stricter rules for pull requests (currently, 1 review is required, but I think it can easily overruled)?

And would it help to create a new "main" branch (instead of "master") as default?

Best
Klas

@idavydov
Copy link
Collaborator Author

Hi @hatjek ,

We can make >1 reviews. And I think changing the default branch is a great idea.

Let me try that. If we see that this didn't affect the quota (because whatever is pulling master is configured specifically for that branch), we might have to rename/remove the old master.

Best,
Iakov

@idavydov
Copy link
Collaborator Author

Ok, this is done. The new default branch is now main.

@kohleman @swalpe @hatjek please rebase all your branches you plan to merge on top of the main.

Let's check in a month if the quota situation is improved.

@idavydov
Copy link
Collaborator Author

complete

@idavydov
Copy link
Collaborator Author

I now also converted the old master into a tag archive/master. This way it's easy to restore if needed.

@hatjek
Copy link
Contributor

hatjek commented Nov 10, 2023

Great, thanks a lot!!

@idavydov
Copy link
Collaborator Author

apparently it didn't work and we still have problems with the quota

@idavydov idavydov reopened this Mar 14, 2024
@idavydov
Copy link
Collaborator Author

So I think here's what happens. We still have branches and tags which reference bad commits (which have lfs files in their history).

An approach to fix it would be to:

Find all the tags and branches referencing bad commits:

git lfs ls-files --deleted --all | awk '{print $3}' | xargs -I {} sh -c 'git log --all --pretty=format:"%H" -- {}; echo' | sort -u > commits.txt
while read c; do git branch --all --contains $c; done < commits.txt | sort -u > branches.txt
while read c; do git tag  --contains $c; done < commits.txt | sort -u > tags.txt
branches.txt
  remotes/origin/20230417-David-stdwfFormat
  remotes/origin/SigUpdata_PCS
  remotes/origin/certifi
  remotes/origin/crispr
  remotes/origin/docs_action
  remotes/origin/pcs_fix_hvg
  remotes/origin/s3
  remotes/origin/swalpe-patch-celltypes
  remotes/origin/workflow_experiments
tags.txt
2.2
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.3
2.4
2.4.5
2.4.6
2.5
2.5.1
2.5.2
2.5.3
archive/master

Save information about current tags

grep archive -v tags.txt | while read t; do git log -1 --pretty=format:"$t\!%ci\!%an <%ae>\!%s%n" $t; done  > full_tag_info.txt
full_tag_info.txt
2.2!2020-08-07 17:18:37 +0200!Alice Julien-Laferriere !Merge branch 'master' of https://github.com/bedapub/besca
2.2.1!2020-05-18 15:09:26 +0200!Jitao David Zhang !version 2.1.1, depending on Scanpy 2.8
2.2.2!2020-05-18 15:39:56 +0200!Jitao David Zhang !the previous version led to failed installation, now it is fixed
2.2.3!2020-05-19 11:31:02 +0200!Jitao David Zhang !rename the directory 'Import' to 'import' to be consistent with other directories
2.2.4!2020-05-19 13:37:53 +0200!Jitao David Zhang !Revert "rename the Import directory to import, which is consistent with other packages"
2.2.5!2022-03-03 10:53:06 +0100!Jitao David Zhang !clean workbook after running them successfully
2.3!2020-11-24 15:44:46 +0100!Alice JL <[email protected]>!Merge pull request #111 from bedapub/Update-Documentation
2.4!2021-02-19 13:53:19 +0100!Alice JL <[email protected]>!Merge pull request #139 from bedapub/Update-Documentation
2.4.5!2022-06-14 14:13:52 +0200!Jitao David Zhang !logging.info uses %d formatting strings
2.4.6!2022-06-17 10:41:27 +0200!Jitao David Zhang !remove dependency on Accio
2.5!2022-09-15 13:38:00 +0200!Manuel Kohler !minor: removed typo
2.5.1!2022-09-19 09:08:55 +0200!Manuel Kohler !Merge pull request #257 from bedapub/dependabot/pip/devtools/oauthlib-3.2.1
2.5.2!2022-10-20 14:43:33 +0200!Manuel Kohler !Merge pull request #261 from bedapub/260-homologs-folder-is-not-installed
2.5.3!2022-11-10 16:26:33 +0100!hatjek !Merge pull request #264 from bedapub/hatjek-patch-1

Delete all the tags (locally)

cat tags.txt | xargs -n 1 git tag -d
sed 's|remotes/||' branches.txt | xargs -n 1 git branch -d -r

Remap tags to the new history

#!/bin/bash

while IFS=! read -r tag timestamp author commit_message ; do
    echo $tag
    commit_hash=$(
        git log --all --fixed-strings --grep="$commit_message" --author="$author" --since="$timestamp" --until="$timestamp" --pretty=format:"%H"
    )
    num_commits=$(echo "$commit_hash" | wc -l)
    [ "$num_commits" -eq 1 ] || { echo "Error: Found $num_commits commits, expected 1"; exit 1; }
    git tag -a "$tag" "$commit_hash" -m "$tag"
done < full_tag_info.txt

Delete remote branches and overwrite tags

This has not been done, yet; but let me know if ok with everyone.

@hatjek
Copy link
Contributor

hatjek commented Mar 15, 2024

Hi @idavydov ,
Thank you for the suggestion and your effort. Will the branches be deleted? I don't think we need them anymore, but we should check back with the different developers. Otherwise, it sounds good to me to go ahead. What do you think @kohleman ?
Best
Klas

@kohleman
Copy link
Collaborator

Hi @idavydov,
I have no problem in removing some branches. I would just like to do it together.

@idavydov
Copy link
Collaborator Author

hi @hatjek ,

Yes, all branches in branches.txt (see full list above) would be deleted. As far as I see all of them are classified as stale by github.

@hatjek
Copy link
Contributor

hatjek commented Mar 15, 2024

OK, sounds good to me. Just check back with the branch "owners" if they can be deleted.

@idavydov
Copy link
Collaborator Author

idavydov commented Mar 15, 2024

hi @hatjek @kohleman @Accio @marlisese @swalpe

In an effort to circumvent LFS quota issues affecting the bedapub space, I would like to perform a cleanup.

To perform this clean-up I'd need to delete branches which still reference old commits which include LFS files.

Here's the list of branches:

20230417-David-stdwfFormat
SigUpdata_PCS
certifi
crispr
docs_action
pcs_fix_hvg
s3
swalpe-patch-celltypes
workflow_experiments

All of these branches were classified as stale and have either been merged or haven't been updated for a year or more.

Could you please:

  • either confirm that it's ok to delete
  • or try to rebase them on the current main (please reach out if you need help with that)

Thanks
Iakov

@hatjek
Copy link
Contributor

hatjek commented Mar 15, 2024

s3 can be deleted. Thank you!

@swalpe
Copy link
Collaborator

swalpe commented Mar 16, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants