-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.19.5 to 1.20 or 1.20.1 #7440
Comments
Thanks for reporting. The fact that it changes with an upgrade to the Go version makes me think that there is an issue in net/http, which the storage client uses internally. Can you clarify which version of go you were using previously (1.95 doesn't exist I think...)? Also, are you using http/1 or http/2 (default)? I can escalate this to the Go language team, or take a look to see if any similar issues are reported there. |
Sorry, I meant Go 1.19.5
I'm using default http2.
…On Tue, Feb 21, 2023 at 3:07 PM Chris Cotter ***@***.***> wrote:
Thanks for reporting. The fact that it changes with an upgrade to the Go
version makes me think that there is an issue in net/http, which the
storage client uses internally.
Can you clarify which version of go you were using previously (1.95
doesn't exist I think...)? Also, are you using http/1 or http/2 (default)?
I can escalate this to the Go language team, or take a look to see if any
similar issues are reported there.
—
Reply to this email directly, view it on GitHub
<#7440 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA7GVQ4JQLDOYRI2CEGFQ6DWYVDELANCNFSM6AAAAAAU64AR5Q>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Sounds good. I will look into it with the Go team. In the mean time, if you want to upgrade the language version, one option is to disable HTTP/2 (by setting |
Does the error contain the string "; received from peer", as in "stream error: stream ID ####; INTERNAL_ERROR; received from peer"? If so, this is an error produced by the GFE (the destination webserver), indicating an internal error on its part. Generally, GFEs generate this error when the backend (the GCS server behind the GFE) halts mid-response. I've heard other reports of these errors showing up more with Go 1.20. I don't know what might have changed from 1.19 to affect this, but my current hypothesis is that we've always been receiving these errors but are surfacing them more frequently now for some reason. (For example, perhaps if we get an error and a clean stream termination at about the same time, maybe we're more likely to process the error now--although I haven't found any changes that seem like they'd cause that to happen.) A reliable reproduction case would help a lot in narrowing down the root cause of the changed behavior on the Go end, at least. However, the actual error here seems to be on the remote end. (Assuming I'm right and this is a "received from peer" error.) |
@garan82 it would also be helpful if you have a rough estimate on what percentage of the time these errors occur, and how large the files you are reading are. I am going to try to reproduce. |
To clarify, not Seek, but Our file format is somewhat analogous to SSTables, and we must read the index at the end of a file to efficiently find where the data we want within the larger file. So most operations on our end use NewRangeReader multiple times. I've tried to reproduce this outside of our production env on GKE but to no avail. It's a sporadic issue. One theory I have is that it occurs more often when we're doing a large number of reads of objects from hundreds (or thousands) of simultaneous readers. We've hit on numerous occasions the 50Gbps read limit from GCS multi-region buckets. One hard to verify theory is that this is an artifact of that rate limiting with that 50Gbps limit manifesting in the GFEs in some way? It's a remote chance because it was the same with Go 1.19, but I'm out of ideas. |
@mikewiacek thanks for clarifying. I also haven't been able to repro so far but I will keep an eye out for these errors in our internal client library perf testing environment. I think that if the rate of errors is low, it also might be worthwhile to just add it to your retry predicate using the customization features in the library. Example code to set it for a single call:
This can also be set for all calls using Also, you can add logging in the http2 error case of |
@tritone if we were to bring up one of our jobs that hits really high throughput, and use go 1.20 so we can make these errors occur, is there value in coordinating such an run with you? We can make them occur in that scenario pretty regularly and maybe there is some real time debugging you can do? I'm gonna try the retry function on our end and add some metrics to it to see how often it fires. |
@mikewiacek thanks for the offer; it would definitely be helpful to coordinate and at the very least we would be able to look at specific error occurences to try to get a better handle on the failure mechanism. You can file a case with support and reference this issue to start a non-public thread. The support engineer should be able to look up my email/ldap from my github handle. |
This doesn't look to be related to go 1.20 after all. We updated some google cloud go dependencies while staying on go 1.19 and have reproduced the issue there. I've done some pruning of our go.mod file that worked without issue and the go.mod with the updates. I've pruned off deps that are unlikely to be related... Here's the current diff. One of the + lines is probably responsible for the issues we're seeing.
|
Okay I am a little confused now -- the original message mentioned that "We have no problem with Go 1.19.5 & https://pkg.go.dev/cloud.google.com/go/storage 1.28 or 1.29". Your summary states that storage is being upgraded from v1.25.0 to v1.27.0. Could you clarify what versions of storage are affected or not? I think most of the other dependency diff should be irrelevant to this issue, unless there is something outside of the storage package that is causing more network connections to be dropped/severed for your application. |
I've finished a lengthy debugging session with someone else who has been seeing elevated INTERNAL_ERROR rates with more recent versions of the HTTP/2 client. I can't guarantee that the root cause is the same, but it's plausible. HTTP/2 connections are flow-controlled. Flow control is on a per-connection and per-stream basis (where a stream corresponds to an HTTP request). The Go HTTP/2 client gives servers an initial 1GiB of connection-level quota and 4MiB of per-stream quota. As data is read from response bodies, the HTTP/2 client provides additional quota to the server so that it can send more data. This means that a server can send an initial 4MiB of data in response to a request. The client will send it an update permitting more data to be sent after the user reads some of the response body. The HTTP/2 client's handling of connection-level flow control (which spans all requests) changed in v0.5.0 of golang.org/x/net. In v0.4.0, the client essentially disables connection-level flow control. Any time the user reads from any response body, if the server has less than 0.5GiB of connection-level quota, the client adds quota to bring it to 1GiB. (This behavior is strange enough that I think it must have been unintentional.) In v0.5.0, the client returns connection-level flow control quota in proportion to the amount of data read from response bodies. Read 1MiB, 1MiB of quota is returned. Quota is also returned for data discarded when a body is closed without being read. The practical impact of this is that a server talking to a v0.4.0 or earlier client will never be blocked by connection-level flow control. A server talking to a v0.5.0 or later client may be blocked by connection-level flow control. This can happen if the client has enough unclosed streams open and containing unread data, for example if the user has forgotten to call Response.Body.Close on a discarded request. This can result in a case where a server can no longer send response bodies. New requests to the server will return successfully, because response headers are not flow controlled, but reading from the response body will block. A read which blocks for a sufficiently long time may be closed by the GFE with an INTERNAL_ERROR status. I'm not sure if that reflects the backend serving the request restarting, or just the backend giving up on trying to write a response that it can make no progress on. If this is what's happening here, the fix is to make sure to always close response bodies so that the resources associated with the request can be reclaimed. (You'd want to do this regardless of flow control issues, to avoid accumulating large amounts of unread, buffered data.) Note that Go 1.20 does not have the change to flow control I'm describing here. However, cloud.google.com/go does depend on golang.org/x/net, so newer versions of that package will pick up this change. |
Belated thanks to @neild and @mikewiacek for the extensive investigation here. I'm working on a change to ensure that we are closing response bodies appropriately for all error cases in the Reader. Will update when I have a PR. |
Client
Storage
https://pkg.go.dev/cloud.google.com/go/storage
Environment
GKE
Go Environment
$ go version
go version go1.20 linux/amd64
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home//.cache/go-build"
GOENV="/home//.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home//go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home//go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/usr/local/home//go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build824194778=/tmp/go-build -gno-record-gcc-switches"
When reading objects from GCS, we're getting a lots of "stream error: stream ID x; INTERNAL_ERROR" errors. It comes and goes, there is no clear pattern, essentially it's very similar to #784
Additional context
We have no problem with Go 1.19.5 & https://pkg.go.dev/cloud.google.com/go/storage 1.28 or 1.29. Once upgrading to Go 1.20 or Go 1.20.1, we encounter many GCS errors.
The text was updated successfully, but these errors were encountered: