-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize step signalling in entrypoint #1570
Comments
/kind feature |
/assign |
Ok... so first off my initial numbers were totally incorrect. My imagePullPolicy was just not right so that was a good part of what I was seeing. Redoing my numbers in my cluster I see 6s for a vanilla pod case and 17s for the TaskRun case for a 20 step. So I played with the entrypoint wait time...: The point here is not to pick a magic number like 200ms but to point out that in the process of optimizing the entrypoint the first big problem is that we're currently spending a significant chunk of time waiting that goes up more or less linearly with the number of steps. fsnotify might let us bring that overhead for the waiting bit to roughly zero so I'll try that out next. Later I think it would be good to do a bit of analysis on the initial sync and maybe what the initcontainers are doing... |
Thanks for that data Simon! This makes me think we should have a metric for "overhead" time -- time spent between This is also something an operator might want to monitor, in case they want to precache popular step images for instance. Unfortunately today we don't have a good strong signal about when a step actually started executing due to entrypoint rewriting. Tackling that first could help here and probably other places. |
I've been working with Kata containers a fair bit lately and... inotify does not work there 😿 I guess that means our advanced sleep technology is a really good choice for now. |
(remove/re-add labels to check if project automation bot is working, plz ignore) |
@skaegi feel free to bring this back to the API WG for discussion if it need priority attention |
Rotten issues close after 30d of inactivity. /close Send feedback to tektoncd/plumbing. |
Stale issues rot after 30d of inactivity. /lifecycle rotten Send feedback to tektoncd/plumbing. |
@tekton-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The entrypoint signaling mechanism currently wakes up and checks for file changes written by the previous step every second. This is simple but in out experience slow. We have a synthetic test that runs a 20 step do nothing task that takes something like 60s to run. A similar raw Pod without the entrypoint runs in 10s. We should see if we can get those times down.
We might reduce sleep time to 500ms but another option is using fsnotify to make our signalling immediate. Another option described in #1569 is use a sidecar as a signaling hub.
The text was updated successfully, but these errors were encountered: