-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Teach Git to handle huge files in smudge/clean #3487
Teach Git to handle huge files in smudge/clean #3487
Conversation
I wrote this in a way to localize the changes to a small area. Personally I find the solution a little ugly, and wouldn't mind extending the changes a bit farther out. But you don't get very far down that path before you've got to turn Thoughts / comments welcomed. If I get a chance this weekend, I'll try introducing a commit which makes those larger changes, simply for discussion. (Oh -- and I couldn't verify this truly fixes the issue on Windows 😢 My VM is too small and runs out of both memory and swap space, so it dies to that. But I validated that the changes at least compile without warning.) |
Hi Matt, I'm not a user of Could the commit message include a little bit more background as to how the smudge/clean filters are used in LFS, and the separation of concerns as to where the 'failure' that is being tested resides. (e.g. areas of: Git, LFS, the filters, bash, .. [or maybe some call back from LFS into git?]) To me (in ignorance), we have a very small file that is being stored in the git repo which has no ulong=32bit limitations. We then have external filters that massage the file (external 'bash' commands?) from being a short file to being a very large file (breaking the ulong=32bit barrier). And a filter to reverse that. Somehow, those filters are corrupting the end to end loop-back testing, but it's unclear to me (i.e. there is some info I'm missing) where git itself is involved in those steps. Reminding myself: The “clean” filter is run when files are staged. The “smudge” filter is run on checkout (https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes fig 143/144 ;-) Ideally, the commit message (to inform future readers/helpers & me) should also indicate where, within Git, you/we think that the failing is located. At the moment, I couldn't tell. Philip |
A bit more searching, and serendipity, got me to git smudge clean interface suboptiomal which does look to include some cross-related info about git-lfs uses the interface and how both git-annex and git-lfs can have a struggle with large files. Perhaps some of their core insights could be extracted? |
@PhilipOakley thank you for the feedback. (Genuine "thank you", not the "thanks" you described at the Contributors Summit 😋) As this is my first contribution, it's hard to know exactly what's worth stating and what's well-known to reviewers. I'll flesh out my descriptions to make this more self-explanatory. |
Hopefully these longer commit messages better motivate the change. I also realized that the belt-and-suspenders guard (checking for very large objects being checked out from the repository) may not be strictly necessary, so I split it into its own final commit. |
Aside: the links in the GitHub notification email appear to be in error as they link to the git repo, rather than your fork. e.g. |
Maybe it was the force push.. (still OK) |
entry.c
Outdated
@@ -85,6 +85,10 @@ static int create_file(const char *path, unsigned int mode) | |||
void *read_blob_entry(const struct cache_entry *ce, size_t *size) | |||
{ | |||
enum object_type type; | |||
unsigned long trunc_size = (unsigned long)*size; | |||
if (trunc_size != *size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sticking my neck out here):
I think that trunc_size
in the if
should be up cast back to size_t
so that we are explicit about what we intend, rather than assume the 'side-effects' of coercion between independent types. Others may disagree..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we're not relying on a side effect here by using unsigned long
. Instead, we are testing precisely what the problem is. The only problem I could see is that some static analysis running on a 64-bit Linux would complain that this check is "always false". We would have to work around such a warning, but we do not need to upcast the trunc_size
again (that would even risk overzealous optimization in buggy C implementations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are relying on the fact that two different 'implementation defined' types will convert the way we hope. The implementation defined 'gotchas' are not quite as bad as undefined behaviours, but can catch us all out (which is why we are here).
Maybe this has to be thrown at the "naming is hard" side of what the temp variable should be called so as to "obviously", at a glance, inform the reader of it's purpose, maybe windows_long_size
, or even shortened_long_size
?
Or maybe just a short comment about the LLP64 issue it's catching (we do want it to go away in the long term).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is simple: when size_t
is wider than unsigned long
, the cast to the latter might cut off bits. The end result is that the cast value is different from the original value. And that is precisely what the code in question tested.
So I don't see any way to improve the clarity of this code.
Casting the value back to size_t
explicitly before comparing only would make things confusing again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the commit message be split into separate paragraphs rather than a wall of text?
Be explicit about having checked (or not) all the callers for the changed function.
Discussion point: can we create a fresh typedef (e.g. memlen
, still to ulong
, to indicate the places where future changes will be needed (e.g. those function casts). Let's use the name equivalence of typedefs to our benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the commit message be split into paragraphs rather than a wall of text? Including just after "buffer the entire smudged file into memory." which is (IIUC) the Git internals area.
@vtbassmatt congratulations:
It works! Now, I would like to caution that Git LFS uses the So maybe in the short run I could ask you to set up a manual test repository on GitHub that has a 4.5GB blob served via Git LFS? That way, we should be able to verify this fix "on the cheap". I've got a fixup to address https://github.com/git-for-windows/git/runs/3989339996?check_suite_focus=true#step:4:566, but am currently investigating https://github.com/git-for-windows/git/runs/3989354647?check_suite_focus=true#step:6:521 (and therefore will hold off from pushing my fixup just yet, in case I figure out another one to push on top). |
@PhilipOakley yes, my apologies on the wall of text. I struggled to capture what I wanted to say, but might be able to articulate it better this morning. @dscho thank you very much! I'll get a test repo set up so we can do a quick validation. I also woke up this morning realizing that I've proven and fixed the |
Okay, I was investigating the
If you look at Which makes sense, as we do not know the size of the object before reading it, it is therefore the responsibility of Now, it is quite the rabbit hole to follow the call chain of I am in the process of crafting a |
I can't figure out if I broke this or if it was already broken? Also, I'm rewording the commit messages to be clearer and less "wall of text"y. I'll try to make sure I don't destroy your |
I always find it better to reply with a concrete suggestion how the end result should look like. However, in this case I am the only non-native speaker of the three of us, so I will refrain from doing that in this instance.
Thank you! I saw from the docs that the free GitHub accounts do not allow that, but you probably have the means to get close to the (hard) 5GB limit.
Good point! This could be potentially much harder, if Git insists on reading everything into memory. But I could imagine that it might not be an issue because IIRC Git has some smarts to use a "streaming" interface when
I would actually go for something slightly different. In a new test case, create a
I would say in a new commit. My general rule: related code changes should be in the same patch series, but each commit should be as fine-grained as possible while still standing on its own. The This is a little bit muddy, I know, as it often comes down to a matter of taste. For example, Junio often likes the test cases to be committed together with the corresponding fix (at least if both are small-enough code changes). |
I find it easiest to simply add a newline/paragraph after every sentence (fullstop/period),
But simply, new paragraph after every sentence is 90% of the win ;-) |
I actually do not think so. Paragraphs should be separated semantically. It might often match your rule, but only by happenstance. |
@dscho, I should have realised that my suggestion was mid sentence, which human error theory says is the most likely that readers miss. I wasn't clear enough about what I meant by "split into separate paragraphs". I added an extended reply above. |
It doesn't go the other way though. And once separated, it is much easier for the speaker (native or not) to see where helpful joins occur. Semantically, they are all about the same patch ;-) |
Honestly, I feel that this thread has devolved into a lot of un-actionable talk at this point. @PhilipOakley I would find it more helpful if you rephrased the commit message and then offered the result as a suggestion. |
I am going into full action gear myself, to get this thread back on track. @vtbassmatt what do you think of this?
|
I like it! I'll fix when I get a chance.
…On Mon, Oct 25, 2021 at 7:35 AM Johannes Schindelin < ***@***.***> wrote:
I am going into full action gear myself, to get this thread back on track.
@vtbassmatt <https://github.com/vtbassmatt> what do you think of this?
read_blob_entry: use size_t instead of unsigned long
There is mixed use of `size_t` and `unsigned long` to deal with sizes in the
codebase.
Recall that Windows defines `unsigned long` as 32 bits even on 64-bit
platforms, meaning that converting `size_t` to `unsigned long` narrows the
range. This mostly doesn't cause a problem since Git rarely deals with files
larger than 2^32 bytes.
But adjunct systems such as Git LFS, which use smudge/clean filters to keep
huge files out of the repository, may have huge file contents passed through
some of the functions in `entry.c` and `convert.c`.
On Windows, this results in a truncated file being written to the work tree.
This is due to one specific use of `unsigned long` in `write_entry()` (and a
similar instance in `write_pc_item_to_fd()` for parallel checkout), which
appear to be calling `read_blob_entry()` with a pointer to `unsigned long` for
convenience.
By altering the signature of `read_blob_entry()` to expect a `size_t`,
`write_entry()` can be switched to use `size_t` internally (which all of its
callers and most of its callees already used).
Note: this only solves the problem when a `smudge` produces a file larger than
`unsigned long` can describe. This commit does not touch on the problem where
blob contents larger than `unsigned long` can handle (but `size_t` could) are
handled improperly by Git.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3487 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFOMK3GIHFCFVDMK2EXIFDUIU6HRANCNFSM5GQ7GGAA>
.
|
I also fixed the compile error, and the test failures. I pushed up the result to your branch, mostly as You might like the commit message I crafted for f9b622e. |
This would be dramatically out of scope for this PR. |
@vtbassmatt no, the intention of the original code is to let |
@vtbassmatt thank you for sharing access to a Git LFS-enabled test repository with me privately. I can confirm that the 5GB file fails to check out with the current Git for Windows version: me@work MINGW64 ~/repros/large-file-lfs-3487/lfs-testing (master)
$ git switch large-file
Switched to a new branch 'large-file'
Branch 'large-file' set up to track remote branch 'large-file' from 'origin'.
Encountered 1 file(s) that may not have been copied correctly on Windows:
large-file.bin
See: `git lfs help smudge` for more details.
me@work MINGW64 ~/repros/large-file-lfs-3487/lfs-testing (large-file)
$ ls -lah large-file.bin
-rw-r--r-- 1 me 4096 1.0G Oct 25 19:58 large-file.bin And when I clone this (in hindsight, I could have just created a simple local test repository...) with the Git built from your PR branch, it works!!! 🎉 me@work MINGW64 /usr/src/git/wip8 (huge-file-smudge-clean)
$ ./bin-wrappers/git -C ~/repros/large-file-lfs-3487/ clone lfs-testing test-with-fixed-git
Cloning into 'test-with-fixed-git'...
done.
Updating files: 100% (507/507), done.
Filtering content: 100% (505/505), 1.00 GiB | 13.17 MiB/s, done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
large-file.bin
See: `git lfs help smudge` for more details.
me@work MINGW64 /usr/src/git/wip8 (huge-file-smudge-clean)
$ ls -lh ~/repros/large-file-lfs-3487/test-with-fixed-git/large-file.bin
-rw-r--r-- 1 me 4096 5.0G Oct 26 09:20 /c/Users/johasc/repros/large-file-lfs-3487/test-with-fixed-git/large-file.bin The only remaining issue is that Git LFS has no way of knowing that Git for Windows got the size right, and therefore still offers the warning that "1 file(s) [...] may not have been copied correctly" (emphasis mine). |
🎉 thank you so much!
I can't find the reference right now, but I believe the Git LFS project said they're willing to remove or alter that warning once Git for Windows no longer exhibits the issue. I'll put up a PR there once this one is on its way. |
Great!
One suggestion first: could you change that oneline to
So there is more work to do to make this test case pass. And when we know what this work is, we could put it into the same commit, keep the same oneline (i.e. the first line of the commit message), and then wordsmith the commit message a bit (I think we know what reviewers on the Git mailing list will say if a commit message refers to an earlier commit message for details... 😁). |
The good news is that the equivalent worked when I tried it with Git LFS, which means that the |
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
Teach Git to handle huge files in smudge/clean
On platforms like Windows, where
unsigned long
is 32 bits, Git can only deal with files up to 4GB. This is fine for normal objects, since no one should commit such large files to Git. However, even external solutions like Git LFS are subject to the limitation, since smudge/clean also make use ofunsigned long
.This patch series introduces a test case exposing the problem, then makes the minimum changes necessary for the test to pass on such platforms.