-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
misc cleanups & bump gazette #1873
Conversation
The former produces the complete error chain, while the latter doesn't.
This is only mechanical renames.
Remove `legacyCheckpoint` and `legacyState` migration mechanism, as the migration has been completed. Refactor taskBase.heartbeatLoop() and call it earlier in the lifecycle of captures / derivations / materializations. We can query for the current container at the time of periodic stats generation, rather than waiting until after a first container is up to start the loop.
`controlPlane` encapsulates commonalities in calling control plane APIs on behalf of a data-plane task context.
038baaf
to
063a0cc
Compare
Ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"assignment", shard.Assignment().Decoded, | ||
) | ||
|
||
// TODO(johnny): Notify control-plane of failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? AFAIK we currently find out about task errors because they log shard failed
. Would this be a different mechanism?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is lifted from a feature branch I wrote a couple of months ago, when the plan was to have reactors call out to the control plane here. Instead, we're going to do more with the shard failed
logs we already produce. This should just be removed.
} | ||
|
||
if sc := httpResp.StatusCode; sc >= 500 && sc < 600 { | ||
skim.RetryMillis = rand.Uint64N(4_750) + 250 // Random backoff in range [0.250s, 5s]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any downstream consequences of never surfacing these? Like, is there any reason we should limit the number of times we retry a 5xx error before surfacing it?
So far I've only seen these pop up transiently when there's heavy load on agent-api, in which case a retry seems like the right solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They do get surfaced in the agent-api Cloud Run service, both as plots and also logs.
We ourselves never return 500's -- they're coming from Cloud Run, for its own reasons, and uncorrelated to anything we're doing. If it has a longer outage, this retry is the best handling we can have that I'm aware of.
(Note also, if we logged them out here, it would be an explosive increase in our own log volume if there were a service-wide cloud run disruption).
Description:
Various minor improvements and refactor cleanups which were rebased / extracted from an abandoned work branch.
No functional changes aside from improved
flowctl
errors.Workflow steps:
(How does one use this feature, and how has it changed)
Documentation links affected:
(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)
Notes for reviewers:
(anything that might help someone review this PR)
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)