-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proxy.golang.org: single intermittent failure of vanity URL causes stale cache #49916
Comments
cc @heschi |
#28194 is probably relevant for that, but it's tricky: I would expect that an import-path server generally fails due to (say) temporary outages for machine or infrastructure updates more often than nondeterministic errors or short-term network drops. How do we balance retrying flakes vs. stalling on retries that won't succeed any time soon anyway (compare #40376)? What's more, transparent retries can hide real problems — masking problems generally provides a short-term benefit at a long-term cost. (Compare #42699, and “Postel was wrong” is a good read too.) |
#42809 might also be relevant. I can understand it if we don't want to keep retrying with short intervals. That said, I think we can do better without wasting resources. For instance, like Paul said, one of the servers saw the version while another didn't. This resulted in inconsistent yet reproducible behavior among users, which we could reproduce for a solid ten minutes. I think that kind of inconsistency could lead to significant confusion in practice; imagine an user reports to upstream that At a conceptual level, I imagine that as soon as one of the servers sees a valid new version, it could invalidate cached errors from all other servers. |
There is only one instance of the mirror, but it is a globally distributed system with multiple layers of caching. We're not trying to guarantee global consistency, so this is working as expected as long as the cache became consistent within half an hour. There is no way to synchronize the global cache. |
@bcmills thanks, #28194 and #40376 are indeed relevant. But I think somewhat different. Because in this case the caller is not in control of the side effect of the (30 mins, albeit temporary) "bad" cache state that might result, yet they will likely suffer its (temporary) "long"-term cost, as will others. So for this specific error, one could argue that retrying and the stalling that results in is worth it. |
@heschi thanks for the detail. Is it always the case that the Regardless of whether |
The only guarantee we attempt to make is that users see data that's less than half an hour old. There is no concept of a "first" fetch in the system. So if it does multiple fetches within a 30 minute interval, users may see any of the results in the interim. The go command doesn't provide any kind of structured output for fetch errors, and the mirror has so far avoided parsing error text. I don't think this is a critical enough case to start doing that. |
Obviously not to question your explanation, but rather to share my understanding to this point which has been that the first fetch was the significant one. However it sounds like there are some scenarios where
I'll defer on this point, because they only data points I am aware of are my own issue report. |
Not all fetches are successful, as in this issue. And not all names stay the same forever, e.g. branches. But we are now way off topic for this issue. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Related to #34370.
Around 2021/12/01 1754 UTC we cut a new release of CUE,
v0.4.1-beta.4
.At the time that could not be resolved via sum.golang.org (https://sum.golang.org/lookup/cuelang.org/[email protected]) because of:
i.e. it appears a transient error resolving https://cuelang.org/go?go-get=1 caused sum.golang.org to cache an error.
However, sum.golang.org was not consistently returning this error (by IP):
What did you expect to see?
cmd/go
(which I believe is what{proxy,sum}.golang.org
use behind the scenes?) failing after a single call to?go-get=1
also seems a little unfortunate. In this case the effect was exacerbated by the subsequent caching issue, but I wonder if there is some sort of retry logic that should be applied here in the case ofconnection refused
? Not least because the SLA we can generally expect from vanity servers is unlikely to match that of, say,{proxy,sum}.golang.org
. Hence, I would have expected us not to get in to this situation if a simply retry onconnection refused
was attempted.What did you see instead?
A "long" bad cache on
sum.golang.org
which then prevented a successful release.cc @bcmills for
cmd/go
and @katiehockman for{proxy,sum}.golang.org
Thanks to @mvdan and @seankhliao for helping to debug here.
The text was updated successfully, but these errors were encountered: