Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.19.5 to 1.20 or 1.20.1 #7440

Open
garan82 opened this issue Feb 17, 2023 · 14 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@garan82
Copy link

garan82 commented Feb 17, 2023

Client
Storage
https://pkg.go.dev/cloud.google.com/go/storage

Environment
GKE

Go Environment
$ go version
go version go1.20 linux/amd64

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home//.cache/go-build"
GOENV="/home//.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home//go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home//go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/usr/local/home//go.mod"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build824194778=/tmp/go-build -gno-record-gcc-switches"

When reading objects from GCS, we're getting a lots of "stream error: stream ID x; INTERNAL_ERROR" errors. It comes and goes, there is no clear pattern, essentially it's very similar to #784

Additional context
We have no problem with Go 1.19.5 & https://pkg.go.dev/cloud.google.com/go/storage 1.28 or 1.29. Once upgrading to Go 1.20 or Go 1.20.1, we encounter many GCS errors.

@garan82 garan82 added the triage me I really want to be triaged. label Feb 17, 2023
@codyoss codyoss changed the title [storage] "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.95 to 1.20 or 1.20.1 storage: "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.95 to 1.20 or 1.20.1 Feb 17, 2023
@codyoss codyoss added the api: storage Issues related to the Cloud Storage API. label Feb 17, 2023
@tritone
Copy link
Contributor

tritone commented Feb 21, 2023

Thanks for reporting. The fact that it changes with an upgrade to the Go version makes me think that there is an issue in net/http, which the storage client uses internally.

Can you clarify which version of go you were using previously (1.95 doesn't exist I think...)? Also, are you using http/1 or http/2 (default)?

I can escalate this to the Go language team, or take a look to see if any similar issues are reported there.

@garan82
Copy link
Author

garan82 commented Feb 21, 2023 via email

@garan82 garan82 changed the title storage: "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.95 to 1.20 or 1.20.1 storage: "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.19.5 to 1.20 or 1.20.1 Feb 22, 2023
@tritone
Copy link
Contributor

tritone commented Feb 22, 2023

Sounds good. I will look into it with the Go team. In the mean time, if you want to upgrade the language version, one option is to disable HTTP/2 (by setting GODEBUG=http2client=0)

@tritone tritone added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. labels Feb 22, 2023
@neild
Copy link

neild commented Feb 22, 2023

Does the error contain the string "; received from peer", as in "stream error: stream ID ####; INTERNAL_ERROR; received from peer"?

If so, this is an error produced by the GFE (the destination webserver), indicating an internal error on its part. Generally, GFEs generate this error when the backend (the GCS server behind the GFE) halts mid-response.

I've heard other reports of these errors showing up more with Go 1.20. I don't know what might have changed from 1.19 to affect this, but my current hypothesis is that we've always been receiving these errors but are surfacing them more frequently now for some reason. (For example, perhaps if we get an error and a clean stream termination at about the same time, maybe we're more likely to process the error now--although I haven't found any changes that seem like they'd cause that to happen.)

A reliable reproduction case would help a lot in narrowing down the root cause of the changed behavior on the Go end, at least. However, the actual error here seems to be on the remote end. (Assuming I'm right and this is a "received from peer" error.)

@tritone
Copy link
Contributor

tritone commented Feb 22, 2023

@garan82 it would also be helpful if you have a rough estimate on what percentage of the time these errors occur, and how large the files you are reading are. I am going to try to reproduce.

@garan82
Copy link
Author

garan82 commented Feb 22, 2023

@neild yes the error contain "; received from peer".

@neild/@tritone: it doesn't happen consistently, hence we first thought it was intermittent.

  • It happens with Seek, but not Read.
  • Our GCS files are ~ 100MB-200MB.
  • I'm not sure about the percentage of time this happens, still digging into that.

@mikewiacek
Copy link
Contributor

To clarify, not Seek, but NewRangeReader

Our file format is somewhat analogous to SSTables, and we must read the index at the end of a file to efficiently find where the data we want within the larger file. So most operations on our end use NewRangeReader multiple times.

I've tried to reproduce this outside of our production env on GKE but to no avail. It's a sporadic issue. One theory I have is that it occurs more often when we're doing a large number of reads of objects from hundreds (or thousands) of simultaneous readers. We've hit on numerous occasions the 50Gbps read limit from GCS multi-region buckets. One hard to verify theory is that this is an artifact of that rate limiting with that 50Gbps limit manifesting in the GFEs in some way? It's a remote chance because it was the same with Go 1.19, but I'm out of ideas.

@tritone
Copy link
Contributor

tritone commented Feb 25, 2023

@mikewiacek thanks for clarifying. I also haven't been able to repro so far but I will keep an eye out for these errors in our internal client library perf testing environment.

I think that if the rate of errors is low, it also might be worthwhile to just add it to your retry predicate using the customization features in the library. Example code to set it for a single call:

func myErrorFunc(err error) bool {
	if storage.ShouldRetry(err) {
		return true
	}
	var http2Err *http2.StreamError
	if ok := errors.As(err, http2Err); ok {
		if http2Err.Code == http2.ErrCodeInternal {
			return true
		}
	}
	return false
}

// to use:
	o := client.Bucket(bucketName).Object(objectPath).Retryer(storage.WithErrorFunc(myErrorFunc))
	reader, err := o.NewRangeReader(ctx, 0, -1)

This can also be set for all calls using client.SetRetry with the same option.

Also, you can add logging in the http2 error case of myErrorFunc if you want to log how frequently this case happens going forward.

@mikewiacek
Copy link
Contributor

@tritone if we were to bring up one of our jobs that hits really high throughput, and use go 1.20 so we can make these errors occur, is there value in coordinating such an run with you? We can make them occur in that scenario pretty regularly and maybe there is some real time debugging you can do? I'm gonna try the retry function on our end and add some metrics to it to see how often it fires.

@tritone
Copy link
Contributor

tritone commented Feb 27, 2023

@mikewiacek thanks for the offer; it would definitely be helpful to coordinate and at the very least we would be able to look at specific error occurences to try to get a better handle on the failure mechanism. You can file a case with support and reference this issue to start a non-public thread. The support engineer should be able to look up my email/ldap from my github handle.

@mikewiacek
Copy link
Contributor

This doesn't look to be related to go 1.20 after all. We updated some google cloud go dependencies while staying on go 1.19 and have reproduced the issue there. I've done some pruning of our go.mod file that worked without issue and the go.mod with the updates. I've pruned off deps that are unlikely to be related... Here's the current diff. One of the + lines is probably responsible for the issues we're seeing.

-	cloud.google.com/go/aiplatform v1.16.0
-	cloud.google.com/go/bigquery v1.37.0
+	cloud.google.com/go/aiplatform v1.24.0
+	cloud.google.com/go/bigquery v1.42.0
-	cloud.google.com/go/oslogin v1.3.0
-	cloud.google.com/go/pubsub v1.24.0
+	cloud.google.com/go/oslogin v1.5.0
+	cloud.google.com/go/pubsub v1.25.1
-	cloud.google.com/go/spanner v1.33.0
-	cloud.google.com/go/storage v1.25.0
-	contrib.go.opencensus.io/exporter/prometheus v0.2.0
-	contrib.go.opencensus.io/exporter/stackdriver v0.13.13
+	cloud.google.com/go/spanner v1.39.0
+	cloud.google.com/go/storage v1.27.0
+	contrib.go.opencensus.io/exporter/prometheus v0.4.2
+	contrib.go.opencensus.io/exporter/stackdriver v0.13.14
-	github.com/golang/protobuf v1.5.2
+	github.com/golang/protobuf v1.5.3
-	github.com/googleapis/gax-go/v2 v2.5.1
+	github.com/googleapis/gax-go/v2 v2.6.0
-	github.com/prometheus/client_golang v1.12.2
+	github.com/prometheus/client_golang v1.14.1-0.20221122130035-8b6e68085b10
-	go.opencensus.io v0.23.0
+	go.opencensus.io v0.24.0
-	golang.org/x/crypto v0.0.0-20220829220503-c86fa9a7ed90
-	golang.org/x/exp v0.0.0-20220722155223-a9213eeb770e
-	golang.org/x/net v0.0.0-20221014081412-f15817d10f9b
-	golang.org/x/oauth2 v0.0.0-20221014153046-6fdb5e3db783
-	golang.org/x/sync v0.0.0-20220929204114-8fcdb60fdcc0
-	golang.org/x/sys v0.5.0
-	golang.org/x/time v0.0.0-20220722155302-e5dcc9cfc0b9
-	golang.org/x/tools v0.1.12
+	golang.org/x/crypto v0.6.0
+	golang.org/x/exp v0.0.0-20230310171629-522b1b587ee0
+	golang.org/x/net v0.7.0
+	golang.org/x/oauth2 v0.5.0
+	golang.org/x/sync v0.1.0
+	golang.org/x/sys v0.6.0
+	golang.org/x/time v0.3.0
+	golang.org/x/tools v0.6.0
-	google.golang.org/genproto v0.0.0-20220802133213-ce4fa296bf78
-	google.golang.org/grpc v1.50.1
-	google.golang.org/protobuf v1.28.1
+	google.golang.org/genproto v0.0.0-20221010155953-15ba04fc1c0e
+	google.golang.org/grpc v1.52.0-dev
+	google.golang.org/protobuf v1.29.0
-	cloud.google.com/go v0.103.0 // indirect
-	cloud.google.com/go/iam v0.3.0 // indirect
+	cloud.google.com/go v0.104.0 // indirect
+	cloud.google.com/go/iam v0.5.0 // indirect
-	cloud.google.com/go/monitoring v1.5.0 // indirect
+	cloud.google.com/go/monitoring v1.6.0 // indirect
+	cloud.google.com/go/profiler v0.3.0 // indirect
-	github.com/census-instrumentation/opencensus-proto v0.4.0 // indirect
+	github.com/census-instrumentation/opencensus-proto v0.4.1 // indirect
-	github.com/opencontainers/image-spec v1.0.3-0.20220114050600-8b9d41f48198 // indirect
+	github.com/opencontainers/image-spec v1.1.0-rc2 // indirect
-	github.com/opentracing/basictracer-go v1.0.0 // indirect
+	github.com/opentracing/basictracer-go v1.1.0 // indirect
-	github.com/prometheus/client_model v0.2.0 // indirect
-	github.com/prometheus/common v0.37.0 // indirect
-	github.com/prometheus/procfs v0.7.3 // indirect
-	github.com/prometheus/prometheus v0.37.0 // indirect
-	github.com/prometheus/statsd_exporter v0.22.4 // indirect
+	github.com/prometheus/client_model v0.3.0 // indirect
+	github.com/prometheus/common v0.39.0 // indirect
+	github.com/prometheus/procfs v0.8.0 // indirect
+	github.com/prometheus/prometheus v0.39.2 // indirect
+	github.com/prometheus/statsd_exporter v0.22.7 // indirect
+	go.opentelemetry.io/otel v1.11.2 // indirect
+	go.opentelemetry.io/otel/trace v1.11.2 // indirect
-	golang.org/x/mod v0.6.0-dev.0.20220419223038-86c51ed26bb4 // indirect
-	golang.org/x/term v0.0.0-20220526004731-065cf7ba2467 // indirect
-	golang.org/x/text v0.3.8 // indirect
+	golang.org/x/mod v0.9.0 // indirect
+	golang.org/x/term v0.5.0 // indirect
+	golang.org/x/text v0.8.0 // indirect

@tritone
Copy link
Contributor

tritone commented Mar 16, 2023

Okay I am a little confused now -- the original message mentioned that "We have no problem with Go 1.19.5 & https://pkg.go.dev/cloud.google.com/go/storage 1.28 or 1.29". Your summary states that storage is being upgraded from v1.25.0 to v1.27.0. Could you clarify what versions of storage are affected or not? I think most of the other dependency diff should be irrelevant to this issue, unless there is something outside of the storage package that is causing more network connections to be dropped/severed for your application.

@neild
Copy link

neild commented Mar 30, 2023

I've finished a lengthy debugging session with someone else who has been seeing elevated INTERNAL_ERROR rates with more recent versions of the HTTP/2 client. I can't guarantee that the root cause is the same, but it's plausible.

HTTP/2 connections are flow-controlled. Flow control is on a per-connection and per-stream basis (where a stream corresponds to an HTTP request). The Go HTTP/2 client gives servers an initial 1GiB of connection-level quota and 4MiB of per-stream quota. As data is read from response bodies, the HTTP/2 client provides additional quota to the server so that it can send more data.

This means that a server can send an initial 4MiB of data in response to a request. The client will send it an update permitting more data to be sent after the user reads some of the response body.

The HTTP/2 client's handling of connection-level flow control (which spans all requests) changed in v0.5.0 of golang.org/x/net.

In v0.4.0, the client essentially disables connection-level flow control. Any time the user reads from any response body, if the server has less than 0.5GiB of connection-level quota, the client adds quota to bring it to 1GiB. (This behavior is strange enough that I think it must have been unintentional.)

In v0.5.0, the client returns connection-level flow control quota in proportion to the amount of data read from response bodies. Read 1MiB, 1MiB of quota is returned. Quota is also returned for data discarded when a body is closed without being read.

The practical impact of this is that a server talking to a v0.4.0 or earlier client will never be blocked by connection-level flow control. A server talking to a v0.5.0 or later client may be blocked by connection-level flow control. This can happen if the client has enough unclosed streams open and containing unread data, for example if the user has forgotten to call Response.Body.Close on a discarded request.

This can result in a case where a server can no longer send response bodies. New requests to the server will return successfully, because response headers are not flow controlled, but reading from the response body will block.

A read which blocks for a sufficiently long time may be closed by the GFE with an INTERNAL_ERROR status. I'm not sure if that reflects the backend serving the request restarting, or just the backend giving up on trying to write a response that it can make no progress on.

If this is what's happening here, the fix is to make sure to always close response bodies so that the resources associated with the request can be reclaimed. (You'd want to do this regardless of flow control issues, to avoid accumulating large amounts of unread, buffered data.)

Note that Go 1.20 does not have the change to flow control I'm describing here. However, cloud.google.com/go does depend on golang.org/x/net, so newer versions of that package will pick up this change.

@tritone
Copy link
Contributor

tritone commented Jul 13, 2023

Belated thanks to @neild and @mikewiacek for the extensive investigation here. I'm working on a change to ensure that we are closing response bodies appropriately for all error cases in the Reader. Will update when I have a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. priority: p3 Desirable enhancement or fix. May not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

5 participants