Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(executor): Minimize the number of Kubernetes API requests made by executors #4954

Merged
merged 60 commits into from
Feb 12, 2021

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Jan 27, 2021

This PR minimizes the number of Kubernetes API requests made by the executor.

Today, start-up, the executors make 2 Kubernetes API requests immediately, get pod and watch pod. Tomorrow, a maximum of 1 request is now only made under two circumstances:

  • Using the k8sapi executor. Obviously.
  • When PNS has been unable to determine the container IDs due to a short-running main container.

I've added the logging of K8S requests to help diagnose regression in the future.

  • Executor
    • Remove get pod for getting annotations on start-up (read from mounted annotation path)
    • Remove watch pod on start-up completely
  • docker use docker ps to determine container IDs instead of API request
  • k8sapi business as usual
  • kubelet use localhost/pods API request to determine container IDs
  • pns introduce ARGO_CONTAINER_NAME to identify the container name for processes, works on long-running containers.

Testing Notes

System Executor Outcome
GCP docker PASS - fixed bug in killing sidecars
GCP k8sapi PASS
GCP kubelet PASS - reduced status hard loop
GCP pns PASS

@alexec alexec changed the title feat(executor): Remove need for watch feat(executor): Reduce usage of watch when waiting for container start Jan 27, 2021
util/logs/workflow-logger.go Outdated Show resolved Hide resolved
workflow/executor/k8sapi/k8sapi.go Outdated Show resolved Hide resolved
workflow/executor/kubelet/kubelet.go Outdated Show resolved Hide resolved
workflow/executor/pns/pns.go Outdated Show resolved Hide resolved
test/e2e/smoke_test.go Outdated Show resolved Hide resolved
@alexec alexec changed the title feat(executor): Reduce usage of watch when waiting for container start feat(executor): Reduce the number of Kubernetes API requests make by executors Jan 31, 2021
@alexec alexec requested review from jessesuen and dtaniwaki January 31, 2021 17:37
@alexec alexec changed the title feat(executor): Reduce the number of Kubernetes API requests make by executors feat(executor): Minimize the number of Kubernetes API requests made by executors Jan 31, 2021
.github/workflows/ci-build.yaml Outdated Show resolved Hide resolved
.github/workflows/ci-build.yaml Outdated Show resolved Hide resolved
test/e2e/fixtures/e2e_suite.go Outdated Show resolved Hide resolved
test/e2e/functional_test.go Outdated Show resolved Hide resolved
workflow/controller/workflowpod.go Outdated Show resolved Hide resolved
workflow/executor/kubelet/client.go Outdated Show resolved Hide resolved
workflow/util/util.go Outdated Show resolved Hide resolved
workflow/executor/executor.go Outdated Show resolved Hide resolved
Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: Alex Collins <[email protected]>
Signed-off-by: Alex Collins <[email protected]>
@alexec alexec linked an issue Feb 2, 2021 that may be closed by this pull request
@alexec
Copy link
Contributor Author

alexec commented Feb 2, 2021

Before:

hello-world-xlp8t wait time="2021-02-02T01:08:22.850Z" level=info msg="Watch pods 200"
hello-world-xlp8t wait time="2021-02-02T01:08:37.590Z" level=info msg="Get pods 200"
hello-world-xlp8t wait time="2021-02-02T01:08:37.598Z" level=info msg="Get pods 200"
hello-world-xlp8t wait time="2021-02-02T01:08:37.614Z" level=info msg="List log 200"

@alexec
Copy link
Contributor Author

alexec commented Feb 5, 2021

Good. The builds are failing as they should.

return *ctr
}

func (woc *wfOperationCtx) newWaitContainer(tmpl *wfv1.Template) (*apiv1.Container, error) {
ctr := woc.newExecContainer(common.WaitContainerName, tmpl)
ctr.Command = []string{"argoexec", "wait"}
ctr.Command = []string{"argoexec", "wait", "--loglevel", getExecutorLogLevel()}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

func (p *PNSExecutor) Wait(ctx context.Context, containerNames, sidecarNames []string) error {

allContainerNames := append(containerNames, sidecarNames...)
go p.pollRootProcesses(ctx, allContainerNames)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this happens sufficiently early along in the process to remove the need for WaitInit().

continue
log.Infof("Ignoring %d exit code of container '%s'", t.ExitCode, ctr.Name)
} else {
return wfv1.NodeFailed, msg
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is much cleaner now

Comment on lines +210 to +212
// https://github.com/kubernetes/kubernetes/blob/ca6bdba014f0a98efe0e0dd4e15f57d1c121d6c9/pkg/kubelet/dockertools/labels.go#L37
"--filter=label=io.kubernetes.pod.namespace="+d.namespace,
"--filter=label=io.kubernetes.pod.name="+d.podName,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

func (p *PNSExecutor) pollRootProcesses(timeout time.Duration) {
log.Warnf("Polling root processes (%v)", timeout)
deadline := time.Now().Add(timeout)
func (p *PNSExecutor) pollRootProcesses(ctx context.Context, containerNames []string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation limited our hard tight-looped polling to 1 minute, because it was quite aggressive. How do we limit this in the new implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I’ve removed this by accident and need to add it back.

Copy link
Contributor Author

@alexec alexec Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is fixed now, I find it is easier to do timeouts using context.WithTimeout because you usually have a context available

@alexec
Copy link
Contributor Author

alexec commented Feb 12, 2021

@jessesuen I've addressed your comment.

@alexec alexec merged commit 2ff4db1 into argoproj:master Feb 12, 2021
@alexec alexec deleted the now branch February 12, 2021 20:03
@simster7 simster7 mentioned this pull request Feb 16, 2021
33 tasks
@simster7 simster7 mentioned this pull request Feb 23, 2021
34 tasks
@simster7 simster7 mentioned this pull request Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiencing issues with the PNS executor
3 participants