Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916

Open
myitcv opened this issue Dec 2, 2021 · 9 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. proxy.golang.org
Milestone

Comments

@myitcv
Copy link
Member

myitcv commented Dec 2, 2021

What version of Go are you using (go version)?

$ go version
go version devel go1.18-d34051bf16 Thu Dec 2 07:04:05 2021 +0000 linux/arm64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="devel go1.18-d34051bf16 Thu Dec 2 07:04:05 2021 +0000"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3900600216=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Related to #34370.

Around 2021/12/01 1754 UTC we cut a new release of CUE, v0.4.1-beta.4.

At the time that could not be resolved via sum.golang.org (https://sum.golang.org/lookup/cuelang.org/[email protected]) because of:

not found: cuelang.org/[email protected]: unrecognized import path "cuelang.org/go": https fetch: Get "https://cuelang.org/go?go-get=1": dial tcp 68.183.23.220:443: connect: connection refused

i.e. it appears a transient error resolving https://cuelang.org/go?go-get=1 caused sum.golang.org to cache an error.

However, sum.golang.org was not consistently returning this error (by IP):

# 172.217.169.49
not found: cuelang.org/[email protected]: unrecognized import path "cuelang.org/go": https fetch: Get "https://cuelang.org/go?go-get=1": dial tcp 68.183.23.220:443: connect: connection refused

# 142.250.187.241
8200543
cuelang.org/go v0.4.1-beta.4 h1:OQwpzifUIivXoxQK68EukVKnATP90GeelTSHnOW1qfI=
cuelang.org/go v0.4.1-beta.4/go.mod h1:P09/R4UfAEzLkV9DXxwlxQnIZbkaT4uIhiEgs6Vsz2Q=

go.sum database tree
8211598
6KTVEmHHgorOjo5YAT18Gj8CznzxCPxir+IF+SDAZUo=

— sum.golang.org Az3grmjNeN768KqG22Zo7F4Gp6T4entYI8PeYDSiWCjBsITq7p9CiDkiA62ZLw6aaG97j5rHOTesLqWZGVS836655QU=

What did you expect to see?

  • my sense from GOSUMDB failing "410 Gone" for publicly-available tag #34370 (comment) was that the sum.golang.org servers should now be more consistent in their responses. Is there some way in which the cache could be synchronised/invalidated in the case of error responses when it is known, as was the case here, that other instances of sum.golang.org were not in error? Hence I would either expect no cache discrepancies, or at least a much shorter "bad" cache time period
  • cmd/go (which I believe is what {proxy,sum}.golang.org use behind the scenes?) failing after a single call to ?go-get=1 also seems a little unfortunate. In this case the effect was exacerbated by the subsequent caching issue, but I wonder if there is some sort of retry logic that should be applied here in the case of connection refused? Not least because the SLA we can generally expect from vanity servers is unlikely to match that of, say, {proxy,sum}.golang.org. Hence, I would have expected us not to get in to this situation if a simply retry on connection refused was attempted.

What did you see instead?

A "long" bad cache on sum.golang.org which then prevented a successful release.

cc @bcmills for cmd/go and @katiehockman for {proxy,sum}.golang.org

Thanks to @mvdan and @seankhliao for helping to debug here.

@myitcv myitcv added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 2, 2021
@seankhliao
Copy link
Member

cc @heschi

@bcmills
Copy link
Contributor

bcmills commented Dec 2, 2021

I wonder if there is some sort of retry logic that should be applied here in the case of connection refused?

#28194 is probably relevant for that, but it's tricky: I would expect that an import-path server generally fails due to (say) temporary outages for machine or infrastructure updates more often than nondeterministic errors or short-term network drops. How do we balance retrying flakes vs. stalling on retries that won't succeed any time soon anyway (compare #40376)?

What's more, transparent retries can hide real problems — masking problems generally provides a short-term benefit at a long-term cost. (Compare #42699, and “Postel was wrong” is a good read too.)

@mvdan
Copy link
Member

mvdan commented Dec 2, 2021

#42809 might also be relevant. I can understand it if we don't want to keep retrying with short intervals.

That said, I think we can do better without wasting resources. For instance, like Paul said, one of the servers saw the version while another didn't. This resulted in inconsistent yet reproducible behavior among users, which we could reproduce for a solid ten minutes. I think that kind of inconsistency could lead to significant confusion in practice; imagine an user reports to upstream that go install always fails for them, but for upstream it always works.

At a conceptual level, I imagine that as soon as one of the servers sees a valid new version, it could invalidate cached errors from all other servers.

@heschi
Copy link
Contributor

heschi commented Dec 2, 2021

There is only one instance of the mirror, but it is a globally distributed system with multiple layers of caching. We're not trying to guarantee global consistency, so this is working as expected as long as the cache became consistent within half an hour. There is no way to synchronize the global cache.

@myitcv
Copy link
Member Author

myitcv commented Dec 3, 2021

@bcmills thanks, #28194 and #40376 are indeed relevant. But I think somewhat different. Because in this case the caller is not in control of the side effect of the (30 mins, albeit temporary) "bad" cache state that might result, yet they will likely suffer its (temporary) "long"-term cost, as will others. So for this specific error, one could argue that retrying and the stalling that results in is worth it.

@myitcv
Copy link
Member Author

myitcv commented Dec 3, 2021

@heschi thanks for the detail.

Is it always the case that the {proxy,sum} mirror reflects the first fetch of a module, or rather one of the fetches during the 30 minute window that follow the first fetch?

Regardless of whether cmd/go is changed in some way to retry, in the case of this specific error is there merit in having a shorter cache invalidation period (apologies if I'm not using the correct terminology)? Again the reasoning here being that the caller, all else being equal, is very likely not in control of this category of error.

@heschi
Copy link
Contributor

heschi commented Dec 3, 2021

The only guarantee we attempt to make is that users see data that's less than half an hour old. There is no concept of a "first" fetch in the system. So if it does multiple fetches within a 30 minute interval, users may see any of the results in the interim.

The go command doesn't provide any kind of structured output for fetch errors, and the mirror has so far avoided parsing error text. I don't think this is a critical enough case to start doing that.

@myitcv
Copy link
Member Author

myitcv commented Dec 6, 2021

There is no concept of a "first" fetch in the system. So if it does multiple fetches within a 30 minute interval, users may see any of the results in the interim.

Obviously not to question your explanation, but rather to share my understanding to this point which has been that the first fetch was the significant one. However it sounds like there are some scenarios where {proxy,sum}.golang.org can settle on a later module version that is different from the first fetch version, i.e. following a tag move?

The go command doesn't provide any kind of structured output for fetch errors, and the mirror has so far avoided parsing error text. I don't think this is a critical enough case to start doing that.

I'll defer on this point, because they only data points I am aware of are my own issue report.

@heschi
Copy link
Contributor

heschi commented Dec 6, 2021

Not all fetches are successful, as in this issue. And not all names stay the same forever, e.g. branches. But we are now way off topic for this issue.

@seankhliao seankhliao changed the title {proxy,sum}.golang.org: single intermittent failure of vanity URL causes stale cache proxy.golang.org: single intermittent failure of vanity URL causes stale cache Aug 20, 2022
@seankhliao seankhliao added this to the Unplanned milestone Aug 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. proxy.golang.org
Projects
None yet
Development

No branches or pull requests

6 participants