Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] use list api to query multiple Cloud Build statuses together #6005

Closed
wants to merge 4 commits into from

Conversation

gsquared94
Copy link
Contributor

@gsquared94 gsquared94 commented Jun 10, 2021

Fixes: #5888
Description
TODO: Add description, add tests

@codecov
Copy link

codecov bot commented Jun 10, 2021

Codecov Report

❗ No coverage uploaded for pull request base (master@840f9b2). Click here to learn what that means.
The diff coverage is 12.12%.

❗ Current head 830a56c differs from pull request most recent head 34053c5. Consider uploading reports for the commit 34053c5 to get more accurate results
Impacted file tree graph

@@            Coverage Diff            @@
##             master    #6005   +/-   ##
=========================================
  Coverage          ?   70.48%           
=========================================
  Files             ?      463           
  Lines             ?    17867           
  Branches          ?        0           
=========================================
  Hits              ?    12594           
  Misses            ?     4340           
  Partials          ?      933           
Impacted Files Coverage Δ
pkg/skaffold/build/gcb/cloud_build.go 0.00% <0.00%> (ø)
pkg/skaffold/build/gcb/status.go 13.75% <13.75%> (ø)
pkg/skaffold/build/gcb/types.go 57.14% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 840f9b2...34053c5. Read the comment docs.

delete(jobs, id)
}
}
statuses, err := getStatuses(projectID, jobs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a limit to the number of jobs that we can query? We can skip anything excessive since we should trim jobs as they succeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we trim the jobs as we succeed, fail or cancel. The query parameter is freeform, I'm not sure if there is a limit. However I don't anticipate us hitting it.

@gsquared94 gsquared94 changed the title use list api to query multiple Cloud Build statuses together [WIP] use list api to query multiple Cloud Build statuses together Jun 16, 2021
Copy link
Contributor

@nkubala nkubala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is a bit tough to review, but based on running through some scenarios in my head and trying it out things look good.

// RetryTimeout is the max amount of time to retry getting the status of the build before erroring
RetryTimeout = 3 * time.Minute
// MaxRetryCount is the max number of times we retry a throttled GCB API request
MaxRetryCount = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard to do the math in my head but does 20 tries roughly equate to the 3 minute timeout we had before?

})
case projectID := <-p.trans:
go func() {
p.C <- projectID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we already blocked on the projectID := <-p.trans call in the case statement, does this need to be in a go func()?

p.remove <- projectID
}

func (r *statusManagerImpl) run() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function could use maybe a two to three sentence high-level description of what it does. there's a lot going on here :)

something like

run() manages the status of all GCB builds started by Skaffold, by using the `list` API to retrieve all jobs from a build ID. for each project ID encountered, a `poller` is set up to repeatedly poll and report status of all jobs, with a backoff mechanism in place to handle throttling from the API server.... etc.

Copy link
Member

@briandealwis briandealwis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd definitely like to see some tests!

var buildComplete bool
delay := time.NewTicker(PollDelay)
defer delay.Stop()
buildResult := b.reporter.getStatus(ctx, projectID, remoteID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of s/getStatus/trackStatus/ and s/reporter/monitor/?

  • getStatus implies that we've wait until we've gotten the result
  • this reporter is monitoring for build status, and doesn't itself report.

@@ -144,60 +142,45 @@ func (b *Builder) buildArtifactWithCloudBuild(ctx context.Context, out io.Writer

var digest string
offset := int64(0)
var buildComplete bool
delay := time.NewTicker(PollDelay)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delay := time.NewTicker(PollDelay)
delay := time.NewTicker(PollDelay) // fixed schedule for log polling


func (r *statusManagerImpl) setResult(result result) {
r.resultMutex.RLock()
r.results[result.jobID] <- result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete the result from r.results? If so, add a test case to ensure we don't see .results grow without end. If not, add a comment.

var buildComplete bool
delay := time.NewTicker(PollDelay)
defer delay.Stop()
buildResult := b.reporter.getStatus(ctx, projectID, remoteID)
watch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loop labels can be hard to reason about, especially when used with both goto and break, and this loop relies on the buildResult channel not being closed (which seems unusual?).

Can we instead track the number of logging bytes last received, and turn the for {} into for !buildComplete || lastLog > 0 {}.

for {
select {
case projectID := <-p.remove:
if b, found := p.timers[projectID]; found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b is a non-obvious name. t?

delete(p.timers, projectID)
delete(p.backoffs, projectID)
case projectID := <-p.reset:
if b, found := p.timers[projectID]; found {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/b/t/

return reporter
}

// poller sends the `projectID` on channel `C` with an exponentially increasing period for each project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about breaking poller out to a new file. It should have a stop() method.

Though it feels like you could simplify this and avoid the goroutine and internal channels with a sync.Map

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not sure that such a "simplification" would actually reduce the complexity though.)

Comment on lines +58 to +62
// jobID represents a single build job
type jobID struct {
projectID string
buildID string
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of creating types for buildID and projectID for use in the signatures and get some type checking into place:

type projectID string
type buildID string

I think the poller would benefit.

Comment on lines +37 to +39
// maintain single instance of the GCB client per skaffold process
client *cloudbuild.Service
clientOnce sync.Once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Singletons are bad — they're a global dependency and breaks modularity. We should strive to avoid them (look at the kubeContext/kubeConfig mess). At the very least, have this instance be managed by the statusManagerImpl.

You should be able to write tests for the statusManagerImpl and have different backoff values (nobody wants the test to wait a minute to verify timeout, for example).

@gsquared94
Copy link
Contributor Author

will make updates and add tests and reopen

@gsquared94 gsquared94 closed this Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GCB builder may exhaust get quota with large number of build artifacts
3 participants