-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build/cmd/coordinator: failures are sometimes missing error output #39349
Comments
The failures are still occurring, but they now have one extra line of output: a
2021-08-27T01:08:12-c927599/linux-386-longtest |
This seems like a very high failure rate, and Given that the recent failures are only on the 386-longtest builder, I suspect that this is due to some test running out of address space. Probably either the builder itself needs to be tuned or the |
I wonder if this could also just be a timing issue. Maybe the coordinator isn't giving the |
Some of these failures from the second comment could be because dist doesn't implement timeouts on |
The failures all indicate that cmd/dist failed due to a It appears that Docker will send a Is there any logging of timeouts done by the coordinator? I don't know this code myself. I'm not sure this is a release blocker because it looks to me like an issue with the builder system rather than something that will be fixed in the release. |
My understanding from https://go.dev/wiki/PortingPolicy#first-class-ports is that first-class ports are expected to have passing tests at each release, and The inflection point of the failure rate of this builder appears to have been around August 2020. Considering the state of the world in August 2020, I can understand why it wasn't addressed immediately at that time, but I do not know why it remained unaddressed at the time of the Go 1.17 release in 2021. I assume that it was overlooked because it lacked the |
In addition to the interaction with the porting policy, to me this kind of issue is also a matter of equity. I watch the builders to check whether there have been regressions in the parts of the project for which I am responsible, and failures in the So when I see failures on this builder, I check them. A significant rate of false-positive failures causes a significant amount of unproductive, avoidable triage work, and that in turn contributes to feelings of frustration and burnout. Since the Go project does not seem to have anyone else triaging new or existing builder failures with any regularity, I feel that the costs of this ongoing flakiness have been externalized on to me. #33598 went through a trajectory very similar to this issue: we had a series of recurring failures on the builders for #39665 was also similar: Dmitri reported If we consider subrepo tests, there are many more examples. As I understand it from #11811 the policy of the project is that subrepo tests should be passing before a release, but for at least the past couple of years we have cut releases with frequently- or persistently-broken subrepo builders. (Some examples: #45700, #31567, #36163; the latter on My takeaway from those cases is that persistent builder issues generally will not be addressed unless I address them myself (as in #33598, #31567, and #36163), or they actively interfere with running Letting these kinds of persistent issues linger was understandable as a short-term situation in 2020, but it isn't tenable as a steady state for a large, staffed project. We all want to land releases, and our existing policy (at least as I understand it) is that to land a release we also need to keep the build healthy. Adhering to that policy provides some backpressure on the accumulation of unplanned technical debt, and helps to “internalize the externality” of flaky and broken builders. |
I feel your pain, and many thanks for doing this unrewarding work. Still, this seems like a process problem. I see no reason that an issue like this should block a release. If the linux-386-longtest builder failed 50% of the time then it would have to block the release because it might be hiding other problems. But that's not the case here; we are getting enough successful runs to have reason to believe that the build is basically OK. You are pointing out that if these issues are not marked as release blockers, then they will never be addressed. That is the problem to fix. We shouldn't get ourselves into a position where we have to hold up a release because of a problem with the builders, where we have no reason to believe that the problem is or is hiding a problem with the release. So I sympathize with the approach that you are taking, but I think we need to develop a different approach. |
I agree, but I think we already have a (documented but unenforced) solution to the process problem: namely, to align the incentives between shipping the release and maintaining a healthy build.
I agree, but we could also avoid getting ourselves into that position by identifying and addressing new failures when they are introduced. If the builders are consistently passing, then they will almost always remain so unless failures are introduced or exposed by specific code or configuration changes. If we respond to new failures immediately, we can more easily identify the likely causes, and either roll back or adapt to the changes that introduced them. As I understand it, the release freeze exists precisely to allow time for new problems to be addressed before the scheduled release. For example, the spike in the For another example, the failure rate for the NetBSD and OpenBSD builders spiked during the current cycle (#49209 and related issues) due to a latent platform bug exposed by a configuration change in mid-October. The compiler & runtime team began diagnosing it in earnest in early November, the builders were reconfigured to diagnose the problem, and a configuration that works around the problem was identified on Nov. 29 (#49209 (comment)) — all within the normal release freeze window (if barely). The process worked, but I suspect that it only even started when it did because I tagged that issue If we follow the process and fix the problems as they are introduced, it doesn't need to hold up the release. The apparent tension between fixing the builds and cutting the release on time is, as far as I can tell, a consequence of putting off the build-fixes until the release is already looming. |
If we want to call a mulligan on the process problem, I suggest that we treat this issue as a release-blocker, but for Go 1.19 rather than 1.18. I firmly believe that it should be a release-blocker for some release in the near future, because otherwise the incentives of the process are misaligned and we have no check on accumulating an unbounded amount of unplanned technical debt. However, I agree that it does not indicate a regression specific to the 1.18 release, which is by all accounts a busy release cycle and major milestone and already delayed enough as it is. Making it a release-blocker for Go 1.19 would allow adequate time to ship the 1.18 release, and provide an entire development window in which to investigate and mitigate this issue. |
In our team meeting this week, we've discussed this and agreed as the next step here to move this to 1.19 milestone, to give us a chance to prioritize investigation of this issue during the 1.19 dev cycle. To make progress here, we need to do development work on x/build/cmd/coordinator, and the 1.18 freeze isn't the right period for that work. For the 1.18 freeze, if there are Updating this issue so that it's more up to date with the latest reality, and we can adjust it further as needed. |
This issue is currently labeled as early-in-cycle for Go 1.19. |
It appears that the regexp I used in #39349 (comment) was even under-counting: the
2022-03-21T18:58:42-79103fa/windows-386-2008 |
One more today:
|
Coordinator folks, what builder logs would we expect if cmd/dist itself got stuck? Does the buildlet or coordinator impose a timeout that would be logged, or would we expect the coordinator to just kill off the VM and the log to stop? |
@dmitshur says that until two weeks ago, if cmd/dist wedged, the builder would eventually get killed and the build retried. And as of two weeks ago, it will now get killed and clearly marked as timed-out. Which suggests this is not cmd/dist wedging. |
@aclements The relevant issue for the recent change is #42699 (comment). (Maybe it was 1 week ago instead of 2.) @cagedmantis also pointed out this the occasional coordinator restarts may be contributing to this problem. Issue #51057 is related to that. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Do we know if this is still happening? I have no idea if we've done anything to fix it, but there also haven't been complaints for months. |
This may have been triggered by some underlying hang in a test in the |
Closing as "probably obsolete" :) |
https://build.golang.org/log/dab786adef1a18622f61641285864ac9c63fb7e3 is marked with
fail
on the dashboard, but the wordFAIL
does not appear in the output file at all.Either the output is truncated, or the last test in the log (
misc/cgo/testshared
) exited with a nonzero status and no output.CC @dmitshur @toothrot @cagedmantis
The text was updated successfully, but these errors were encountered: