-
Notifications
You must be signed in to change notification settings - Fork 767
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Early EOF errors when running git fetch over ssh #1322
Comments
bump? |
What is the latency between the client and server? I'm willing to take a look at this, but I'd want some sort of repo snapshot (accessible to me) I know can reproduce the issue before diving into setting up a test. |
Based on finding PowerShell/Win32-OpenSSH#1322
I think we're seeing this as well. Windows 2016 Server, desktop edition. ssh key is loaded into Windows ssh-agent service. Verbose log output (using public repositories):
|
Based on finding PowerShell/Win32-OpenSSH#1322
@NoMoreFood I think in our case (based on the timestamps in my log above), around 1.1s between running the command in the client and starting to receive data. |
I get this fairly consistently (but not 100% reliably) when cloning Git repos using Win32 SSH. Most commonly when cloning https://github.com/aspnet/AspNetCore |
We get this around 50% of the time, intermittently. This also happens with OpenSSH 8.0.0.0p1-Beta |
@anurse @petemounce What OSes are you running? |
@NoMoreFood I am running Windows Server 2016 Desktop/GUI edition within Google Cloud. Image Family is If it's relevant, https://github.com/GoogleCloudPlatform/compute-image-windows/ is I think what goes into the image-creation process to prepare them for customer usage. |
This is the script I'm using to bootstrap sshd onto the instances in question - this is relevant to this issue I think because it's how ssh.exe becomes present, and how the default shell is set to powershell. (The verbatim script relies on GCE-GetMetadata to pull the base64 encoded sshd_config (also snippet'd here) and a base64 encoded public key from GCE instance-metadata. Those sections can be hacked out with hard-coded values if you don't run this within GCE to reproduce; I don't have the Azure/AWS equivalents supported, sorry. I included the full script + sshd_config because I thought you might be interested in adapting it into your docs around enabling pubkey auth - if there's an issue around that, please point me at it and I can replicate there if that's helpful)
|
Windows 10 (various builds, including 18932.1000). Just a standard interactive instance (no docker containers, etc.). Literally just tried and got it again:
|
It's not a 100% consistent repro for me, but it's like.. 80-90% consistent. I can usually get a successful clone by repeating a bunch of times. Alternately, if I reset |
I badly don't want to use the ssh bundled with git-for-windows because then I (I think?) would need to find a different arrangement for auth, differently from the other two platforms I support. I think my options would be:
|
@NoMoreFood friendly ping? Is there any more information I could provide that might help? |
@NoMoreFood Running into this on Windows Server 2019 as well (OpenSSH version 7.7.2.1 from Windows Update / pre-installed with Windows). Is there a timeline on addressing this? |
@petemounce @RNabel I've been working on other projects, but might be able to sneak this in. Can you verify it's still reproducible on 8.0? |
It's definitely still reproducible on 8. I'm less sure it's a data problem, since on retry it can succeed against the same sha being cloned. |
I see this all the time when attempting to clone dotnet/roslyn with 7.7.2.1. It's getting to the point that I'm going to have to disable the service. |
@NoMoreFood friendly ping? Any news? |
If you need a repo that commonly reproduces this, try cloning https://github.com/dotnet/roslyn. It's a pretty reliable failure for me. |
I also consistently get this on Windows 10 when running |
@NoMoreFood do you need any more information for repro, or want any other help? |
Any news on this?I'm seeing the same behaviour... |
This seemed to be much better once I moved to |
Does CreatePseudoConsole actually happen in conhost.exe (which is called by sshd), and is that what seems to be open-sourced here? But anyway, that's all on the sshd server side, whereas your fsync is on the ssh client, right? |
I don't think this goes through the |
Bah. Seeing the issue again with either |
So another theory of what's happening... git has this code that's doing the reading: It's clearly expecting reads from stdin to be blocking (bytes read == 0 is the error we're seeing). Digging through the Windows I ended up hitting the same "early EOF" issue even while moving comms between the two to a named pipe on windows; the git side is just using the microsoft CRT's implementation of posix file IO though, whereas openssh has its own win32 API layer. |
TL; DR: I had same issue on 8.1.p1, uninstalled it and installed Git-2.37.1-64, choosing its bundled ssh, which is OpenSSH_9.0p1, which has resolved it for me. Full post: On Win10, with my .gitconfig containing sshCommand = "ssh -vvv", I'd been using OpenSSH v8.1.p1, and I'd get about 30% failure rate. The command was simple, I just increment NN as 00, 01, 02 etc. Here's the relevant ending output from a failed transfer. Notice the 'fatal: early EOF line'.
Having found this thread, I tried a newer OpenSSH. Sadly, OpenSSH V8.9.1.0p1 - same results, roughly 30% failure rate. Then, Retrying my test again exactly as above - perfect run, 20/20 clones now succeed. So, I'm unsure if 9.0p1 has a fix, or if it's simply built different than the win32-openssh versions available here... Either way, I suggest people try out OpenSSH_9.0p1 themselves, and see if it resolves the issue. |
If you enable level4 debug output with debug4("write - reporting %d bytes written, io:%p", bytes_copied, pio); and debug4("read - io:%p read: %d remaining: %d", pio, bytes_copied,
pio->read_details.remaining); in contrib/win32/win32compat/fileio.c, which might give clues whether the custom Win32 wrappers for read and write used here do anything inappropriate with zero-length reads and writes, as suggested by @vvuk. |
@vvuk Good idea! Looking at fileio_write_wrapper there does not appear to be any protection against a zero-length write happening, which Win32 (unlike POSIX) might pass on to the consumer, which a POSIX application might receive as a zero-length read and interpret as an EOF indication. What happens, if you simply add at the start of that function something like if (bytes_to_copy == 0) return 0; to reduce the likelihood of Or also add debug output in fileio_write to check if Similarly add debug checks in WriteThread in termio.c, to check that the functions |
@manojampalam Are you confident that the write calls in fileio.h and termios.h will not accidentally cause zero-byte writes, which the receiving |
so I re-added the Optional Features OpenSSH (8.1.p1), and added this to my global .gitconfig
I can still reliably reproduce the issue, but the output doesn't contain any new lines with "write - reporting" or "read - io" in it :-( a recent output:
I'm happy to keep experimenting on my 'easy' test case here - lmk what else I could try. |
|
Hi I believe that I know what is causing this. Why was this issue closed? |
When https://github.com/PowerShell/openssh-portable/blob/latestw_all/channels.c#L2051 ultimately dispatches the copying writing to a pipe to a worker thread (this is because it appears that the pipe uses the However the close method introduces several race conditions. If the thread has not yet invoked I have found this issue is not reproducible in a debug build, but is reproducible in a release build. It is not clear to me why
Could be moved prior to the invocation of |
This is definitely not fixed. |
I gave up waiting for Microsoft to fix the bug and ship it — if they’ve failed to fix it for three years and the issue is closed I’m not sure how much clearer they can make the fact they don’t care. I have switched to using PuTTY’s |
@LukeCarrier Thanks we also found this solution, but in reality its pretty sad to use another tool, roll it out, update it and so on. In an Enterprise context it's not so nice tbh. And it's pretty sad for MS to just ignore or be unable to simply ship a small bugfix... |
For those still following along on this issue, the fix was merged and is available in the current beta: #2012 (comment). |
@333fred can you give us any information whether this fix will be backported to Windows 10 given Microsoft's current stance on the Windows 10 update policy? I am asking because Windows 10 is still used in a lot of corporate environments (sometimes not even version 22H2) and it's likely it'll stay this way for a long while. |
I have absolutely no idea. I'm just a user who was hit by this bug here 🙂. |
Hi, Basically, I am using 9.5p1, newer than the claimed to fix 9.2.2, but I still see the following quite a lot on my Win11 workstation:
As a game engine programmer, I just git clone all my company's game team's repos in the past few days 1 by 1 on both my MacBookPro and Win11 workstation(running 23H2 10.0.22631.3880). Under the same wifi, on the same table that I sit, at the same time to clone, both with git-lfs(win by default, mac by brew install). MBP never failed a single time. While on my Win11, it would 1 time succeed for a game repo, and it would also fail several times before succeeding in cloning another game repo. I searched in our internal Slack channels, though not all, but many coders had met the same issue, and only on Windows. Their workarounds were all related to replacing the built-in OpenSSH(C:/Windows/System32/OpenSSH/ssh.exe) with some other SSH. Then I used And the result is acceptably good, one git clone for a repo with 4 submodules, succeeded with just one of the submodules failing once, though I don't know why submodule will auto retry clone, but it's an acceptable result for me: one git clone, one successfully cloned repo. The error log was "Failed to clone 'our-internal-codename'. Retry scheduled"
|
Long story short:
|
@TroutZhang if you're running git on windows and you're using any of the built-in windows versions of OpenSSH (such as The default behavior of git on windows (https://git-scm.com/download/win) is to use the bundled version of If you want to confirm that you are in fact using the version of ssh that you think you are, don't rely on looking at the path environment variables or settings configurations. Look at what is being spawned by git either using Process Explorer or powershell (I'm guessing that since this is a long pull, you should have enough time while the pull is happening to find the proper process in Process Explorer, if not there's always ProcMon). If this is still an issue that you're interested in debugging (rather than just using a ssh program that doesn't have this issue ;) ), please confirm (using the above methodology) that |
@cwgreene Thanks for the info. I think you're right. I should've used the process explorer to check the child process to see which ssh.exe is being invoked in the first place. I'll get back with more info later.
|
I've just tried with both the following repo clones:
Using Using |
@cwgreene |
Spiffy, glad to hear that it's indeed fixed. :) |
If I configure the Git for Windows client to use the SSH version that ships with Windows, I occasionally see the following error during
git fetch
:Here's a log that includes ssh -vvv output:
fetch.log
And here's the test script that I used, which replays the same fetch 100 times in a row:
fetchloop.sh.txt
I believe that this is a bug in the SSH client that ships with Windows, because:
The bug also seems data-shape dependent in that it reproduces more or less frequently depending on the exact state of the Git repos on the client and server. I have snapshots of the client and server Git repo state. I can't share it publicly, can can share it internally with Microsoft employees.
My client OS is Windows 10 1809 Build 17763.253
((Get-Item (Get-Command sshd).Source).VersionInfo.FileVersion)
says 7.7.2.2The Git servers that I have tested against are:
From messing around with the Azure DevOps SSH code, I found the following:
So this looks like a race condition where channel data from the server is lost when the client receives SSH_MSG_CHANNEL_CLOSE.
BTW, I tried searching for related issues and found #752 and a few others, but they all seemed to be issues with running the Windows version of sshd on the server, as opposed to the client.
@dscho this should probably block git-for-windows/git#1981
The text was updated successfully, but these errors were encountered: