Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize step signalling in entrypoint #1570

Closed
skaegi opened this issue Nov 14, 2019 · 10 comments
Closed

Optimize step signalling in entrypoint #1570

skaegi opened this issue Nov 14, 2019 · 10 comments
Assignees
Labels
area/performance Issues or PRs that are related to performance aspects. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@skaegi
Copy link
Contributor

skaegi commented Nov 14, 2019

The entrypoint signaling mechanism currently wakes up and checks for file changes written by the previous step every second. This is simple but in out experience slow. We have a synthetic test that runs a 20 step do nothing task that takes something like 60s to run. A similar raw Pod without the entrypoint runs in 10s. We should see if we can get those times down.

We might reduce sleep time to 500ms but another option is using fsnotify to make our signalling immediate. Another option described in #1569 is use a sidecar as a signaling hub.

@vdemeester
Copy link
Member

/kind feature
/priority important-longterm

@tekton-robot tekton-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Nov 15, 2019
@vdemeester vdemeester added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Nov 15, 2019
@vdemeester vdemeester added this to the Pipelines 1.0/beta 🐱 milestone Nov 15, 2019
@skaegi
Copy link
Contributor Author

skaegi commented Nov 20, 2019

/assign

@skaegi
Copy link
Contributor Author

skaegi commented Nov 21, 2019

Ok... so first off my initial numbers were totally incorrect. My imagePullPolicy was just not right so that was a good part of what I was seeing. Redoing my numbers in my cluster I see 6s for a vanilla pod case and 17s for the TaskRun case for a 20 step.

So I played with the entrypoint wait time...:
raw pod -- 6s (lower limit...)
1ms -- 12s (burning laptop... power percentage going down in real time)
50ms -- 10-11s
100ms -- 11-12s
200ms -- 11-12s
250ms -- 12-15s (sudden jump here -- not sure why -- might be specific to my test)
300ms -- 14-15s
500ms -- 14-16s
750ms -- 15-16s
1000ms -- 15-17s


The point here is not to pick a magic number like 200ms but to point out that in the process of optimizing the entrypoint the first big problem is that we're currently spending a significant chunk of time waiting that goes up more or less linearly with the number of steps. fsnotify might let us bring that overhead for the waiting bit to roughly zero so I'll try that out next.

Later I think it would be good to do a bit of analysis on the initial sync and maybe what the initcontainers are doing...

pod gist
taskrun gist

@imjasonh
Copy link
Member

Thanks for that data Simon! This makes me think we should have a metric for "overhead" time -- time spent between step[n].finish and step[n+1].start. That would let us gather data across a bunch of runs before and after (and during) tweaks to the poll interval and while moving to something better.

This is also something an operator might want to monitor, in case they want to precache popular step images for instance.

Unfortunately today we don't have a good strong signal about when a step actually started executing due to entrypoint rewriting. Tackling that first could help here and probably other places.

@skaegi
Copy link
Contributor Author

skaegi commented Dec 12, 2019

I've been working with Kata containers a fair bit lately and... inotify does not work there 😿 I guess that means our advanced sleep technology is a really good choice for now.

@dibyom dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 12, 2020
@dibyom
Copy link
Member

dibyom commented Mar 12, 2020

(remove/re-add labels to check if project automation bot is working, plz ignore)

@dibyom dibyom added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Mar 12, 2020
@bobcatfish bobcatfish modified the milestones: Pipelines 1.0/beta 🐱, Pipelines 1.1 / Post-beta 🐱 Mar 16, 2020
@afrittoli afrittoli removed this from the Pipelines Post-beta 🐱 milestone May 4, 2020
@afrittoli afrittoli added area/performance Issues or PRs that are related to performance aspects. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 15, 2020
@afrittoli
Copy link
Member

@skaegi feel free to bring this back to the API WG for discussion if it need priority attention

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Issues or PRs that are related to performance aspects. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants