Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

misc cleanups & bump gazette #1873

Merged
merged 5 commits into from
Jan 24, 2025
Merged

misc cleanups & bump gazette #1873

merged 5 commits into from
Jan 24, 2025

Conversation

jgraettinger
Copy link
Member

@jgraettinger jgraettinger commented Jan 16, 2025

Description:

Various minor improvements and refactor cleanups which were rebased / extracted from an abandoned work branch.

No functional changes aside from improved flowctl errors.

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@jgraettinger jgraettinger requested a review from psFried January 16, 2025 03:39
The former produces the complete error chain, while the latter doesn't.
Remove `legacyCheckpoint` and `legacyState` migration mechanism, as the
migration has been completed.

Refactor taskBase.heartbeatLoop() and call it earlier in the lifecycle
of captures / derivations / materializations. We can query for the
current container at the time of periodic stats generation, rather than
waiting until after a first container is up to start the loop.
`controlPlane` encapsulates commonalities in calling control plane APIs
on behalf of a data-plane task context.
@jgraettinger jgraettinger changed the title misc cleanups misc cleanups & bump gazette Jan 21, 2025
@jgraettinger jgraettinger requested review from jshearer and removed request for psFried January 22, 2025 21:57
@jgraettinger
Copy link
Member Author

Ping

Copy link
Contributor

@jshearer jshearer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"assignment", shard.Assignment().Decoded,
)

// TODO(johnny): Notify control-plane of failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? AFAIK we currently find out about task errors because they log shard failed. Would this be a different mechanism?

Copy link
Member Author

@jgraettinger jgraettinger Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is lifted from a feature branch I wrote a couple of months ago, when the plan was to have reactors call out to the control plane here. Instead, we're going to do more with the shard failed logs we already produce. This should just be removed.

}

if sc := httpResp.StatusCode; sc >= 500 && sc < 600 {
skim.RetryMillis = rand.Uint64N(4_750) + 250 // Random backoff in range [0.250s, 5s].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any downstream consequences of never surfacing these? Like, is there any reason we should limit the number of times we retry a 5xx error before surfacing it?

So far I've only seen these pop up transiently when there's heavy load on agent-api, in which case a retry seems like the right solution.

Copy link
Member Author

@jgraettinger jgraettinger Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They do get surfaced in the agent-api Cloud Run service, both as plots and also logs.

We ourselves never return 500's -- they're coming from Cloud Run, for its own reasons, and uncorrelated to anything we're doing. If it has a longer outage, this retry is the best handling we can have that I'm aware of.

(Note also, if we logged them out here, it would be an explosive increase in our own log volume if there were a service-wide cloud run disruption).

@jgraettinger jgraettinger merged commit beb66df into master Jan 24, 2025
4 checks passed
@jgraettinger jgraettinger deleted the johnny/shard-failure branch January 24, 2025 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants