Optimize step signalling in entrypoint #1570

skaegi · 2019-11-14T18:07:19Z

The entrypoint signaling mechanism currently wakes up and checks for file changes written by the previous step every second. This is simple but in out experience slow. We have a synthetic test that runs a 20 step do nothing task that takes something like 60s to run. A similar raw Pod without the entrypoint runs in 10s. We should see if we can get those times down.

We might reduce sleep time to 500ms but another option is using fsnotify to make our signalling immediate. Another option described in #1569 is use a sidecar as a signaling hub.

vdemeester · 2019-11-15T09:01:45Z

/kind feature
/priority important-longterm

skaegi · 2019-11-20T16:29:03Z

/assign

skaegi · 2019-11-21T18:16:12Z

Ok... so first off my initial numbers were totally incorrect. My imagePullPolicy was just not right so that was a good part of what I was seeing. Redoing my numbers in my cluster I see 6s for a vanilla pod case and 17s for the TaskRun case for a 20 step.

So I played with the entrypoint wait time...:
raw pod -- 6s (lower limit...)
1ms -- 12s (burning laptop... power percentage going down in real time)
50ms -- 10-11s
100ms -- 11-12s
200ms -- 11-12s
250ms -- 12-15s (sudden jump here -- not sure why -- might be specific to my test)
300ms -- 14-15s
500ms -- 14-16s
750ms -- 15-16s
1000ms -- 15-17s

The point here is not to pick a magic number like 200ms but to point out that in the process of optimizing the entrypoint the first big problem is that we're currently spending a significant chunk of time waiting that goes up more or less linearly with the number of steps. fsnotify might let us bring that overhead for the waiting bit to roughly zero so I'll try that out next.

Later I think it would be good to do a bit of analysis on the initial sync and maybe what the initcontainers are doing...

pod gist
taskrun gist

imjasonh · 2019-11-21T19:01:57Z

Thanks for that data Simon! This makes me think we should have a metric for "overhead" time -- time spent between step[n].finish and step[n+1].start. That would let us gather data across a bunch of runs before and after (and during) tweaks to the poll interval and while moving to something better.

This is also something an operator might want to monitor, in case they want to precache popular step images for instance.

Unfortunately today we don't have a good strong signal about when a step actually started executing due to entrypoint rewriting. Tackling that first could help here and probably other places.

skaegi · 2019-12-12T15:58:35Z

I've been working with Kata containers a fair bit lately and... inotify does not work there 😿 I guess that means our advanced sleep technology is a really good choice for now.

dibyom · 2020-03-12T16:04:18Z

(remove/re-add labels to check if project automation bot is working, plz ignore)

afrittoli · 2020-06-15T15:19:32Z

@skaegi feel free to bring this back to the API WG for discussion if it need priority attention

tekton-robot · 2020-08-14T20:46:13Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T20:46:13Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot · 2020-08-14T20:46:14Z

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Nov 15, 2019

vdemeester added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 15, 2019

vdemeester added this to the Pipelines 1.0/beta 🐱 milestone Nov 15, 2019

tekton-robot assigned skaegi Nov 20, 2019

dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 12, 2020

bobcatfish modified the milestones: Pipelines 1.0/beta 🐱, Pipelines 1.1 / Post-beta 🐱 Mar 16, 2020

imjasonh mentioned this issue Mar 31, 2020

Report downwardAPI volumes are not allowed to be used in a shared cluster #2307

Closed

afrittoli removed this from the Pipelines Post-beta 🐱 milestone May 4, 2020

afrittoli added area/performance Issues or PRs that are related to performance aspects. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 15, 2020

tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020

tekton-robot closed this as completed Aug 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize step signalling in entrypoint #1570

Optimize step signalling in entrypoint #1570

skaegi commented Nov 14, 2019

vdemeester commented Nov 15, 2019

skaegi commented Nov 20, 2019

skaegi commented Nov 21, 2019 •

edited

Loading

imjasonh commented Nov 21, 2019

skaegi commented Dec 12, 2019

dibyom commented Mar 12, 2020

afrittoli commented Jun 15, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

Optimize step signalling in entrypoint #1570

Optimize step signalling in entrypoint #1570

Comments

skaegi commented Nov 14, 2019

vdemeester commented Nov 15, 2019

skaegi commented Nov 20, 2019

skaegi commented Nov 21, 2019 • edited Loading

imjasonh commented Nov 21, 2019

skaegi commented Dec 12, 2019

dibyom commented Mar 12, 2020

afrittoli commented Jun 15, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

tekton-robot commented Aug 14, 2020

skaegi commented Nov 21, 2019 •

edited

Loading