-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] use list
api to query multiple Cloud Build statuses together
#6005
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6005 +/- ##
=========================================
Coverage ? 70.48%
=========================================
Files ? 463
Lines ? 17867
Branches ? 0
=========================================
Hits ? 12594
Misses ? 4340
Partials ? 933
Continue to review full report at Codecov.
|
pkg/skaffold/build/gcb/status.go
Outdated
delete(jobs, id) | ||
} | ||
} | ||
statuses, err := getStatuses(projectID, jobs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a limit to the number of jobs that we can query? We can skip anything excessive since we should trim jobs as they succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we trim the jobs as we succeed, fail or cancel. The query parameter is freeform, I'm not sure if there is a limit. However I don't anticipate us hitting it.
list
api to query multiple Cloud Build statuses togetherlist
api to query multiple Cloud Build statuses together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code is a bit tough to review, but based on running through some scenarios in my head and trying it out things look good.
pkg/skaffold/build/gcb/types.go
Outdated
// RetryTimeout is the max amount of time to retry getting the status of the build before erroring | ||
RetryTimeout = 3 * time.Minute | ||
// MaxRetryCount is the max number of times we retry a throttled GCB API request | ||
MaxRetryCount = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hard to do the math in my head but does 20 tries roughly equate to the 3 minute timeout we had before?
}) | ||
case projectID := <-p.trans: | ||
go func() { | ||
p.C <- projectID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we already blocked on the projectID := <-p.trans
call in the case
statement, does this need to be in a go func()
?
p.remove <- projectID | ||
} | ||
|
||
func (r *statusManagerImpl) run() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function could use maybe a two to three sentence high-level description of what it does. there's a lot going on here :)
something like
run() manages the status of all GCB builds started by Skaffold, by using the `list` API to retrieve all jobs from a build ID. for each project ID encountered, a `poller` is set up to repeatedly poll and report status of all jobs, with a backoff mechanism in place to handle throttling from the API server
.... etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd definitely like to see some tests!
var buildComplete bool | ||
delay := time.NewTicker(PollDelay) | ||
defer delay.Stop() | ||
buildResult := b.reporter.getStatus(ctx, projectID, remoteID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of s/getStatus/trackStatus/ and s/reporter/monitor/?
- getStatus implies that we've wait until we've gotten the result
- this reporter is monitoring for build status, and doesn't itself report.
@@ -144,60 +142,45 @@ func (b *Builder) buildArtifactWithCloudBuild(ctx context.Context, out io.Writer | |||
|
|||
var digest string | |||
offset := int64(0) | |||
var buildComplete bool | |||
delay := time.NewTicker(PollDelay) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delay := time.NewTicker(PollDelay) | |
delay := time.NewTicker(PollDelay) // fixed schedule for log polling |
|
||
func (r *statusManagerImpl) setResult(result result) { | ||
r.resultMutex.RLock() | ||
r.results[result.jobID] <- result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete the result from r.results
? If so, add a test case to ensure we don't see .results
grow without end. If not, add a comment.
var buildComplete bool | ||
delay := time.NewTicker(PollDelay) | ||
defer delay.Stop() | ||
buildResult := b.reporter.getStatus(ctx, projectID, remoteID) | ||
watch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loop labels can be hard to reason about, especially when used with both goto
and break
, and this loop relies on the buildResult channel not being closed (which seems unusual?).
Can we instead track the number of logging bytes last received, and turn the for {}
into for !buildComplete || lastLog > 0 {}
.
for { | ||
select { | ||
case projectID := <-p.remove: | ||
if b, found := p.timers[projectID]; found { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b
is a non-obvious name. t
?
delete(p.timers, projectID) | ||
delete(p.backoffs, projectID) | ||
case projectID := <-p.reset: | ||
if b, found := p.timers[projectID]; found { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/b/t/
return reporter | ||
} | ||
|
||
// poller sends the `projectID` on channel `C` with an exponentially increasing period for each project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about breaking poller
out to a new file. It should have a stop()
method.
Though it feels like you could simplify this and avoid the goroutine and internal channels with a sync.Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not sure that such a "simplification" would actually reduce the complexity though.)
// jobID represents a single build job | ||
type jobID struct { | ||
projectID string | ||
buildID string | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT of creating types for buildID
and projectID
for use in the signatures and get some type checking into place:
type projectID string
type buildID string
I think the poller would benefit.
// maintain single instance of the GCB client per skaffold process | ||
client *cloudbuild.Service | ||
clientOnce sync.Once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Singletons are bad — they're a global dependency and breaks modularity. We should strive to avoid them (look at the kubeContext/kubeConfig mess). At the very least, have this instance be managed by the statusManagerImpl
.
You should be able to write tests for the statusManagerImpl and have different backoff values (nobody wants the test to wait a minute to verify timeout, for example).
will make updates and add tests and reopen |
Fixes: #5888
Description
TODO: Add description, add tests