-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pinning CPU with a pipeline with 50 tasks and use of params/results chain #3521
Comments
Watching this thread closely. We did an upgrade from v0.10 to v0.17 a couple of weeks ago (had to excise PipelineResources) and performance has been all over the place since then. I'm thinking about downgrading at this point. I've looked into the two HA solutions in this isssue but the results have not been great -- sometimes we see double work or random task restarts when the task should be finishing with the StatefulSet approach and hardly any performance increase with the documented approach. Our pipelines go from If I can provide any data or notes in this area I am eager to help. |
Uhm, pipeline
|
@skaegi I think you always provision the pipeline and task definitions along with the pipeline run, is that correct? |
Fixed the task as follows:
The whole pipeline ran quite quickly:
|
With pipeline |
We definitely get stuck in
|
I added a simple unit test that emulates |
pipeline/pkg/reconciler/pipelinerun/resources/pipelinerunstate.go Lines 314 to 324 in 406a21e
The loop in pipeline/pkg/reconciler/pipelinerun/resources/pipelinerunresolution.go Lines 136 to 152 in cdc171b
parentTasksSkip which invokes Skip recursively.
On my local cluster I see that each iteration in the loop in Even though recursion fits with the high CPU and long execution time that we see, it's still not clear to me why it would take so long; the algorithm can be surely improved upon, however there may be some bug in there that I'm still missing to really explain this. It looks like it's ripe time I resume work on this refactor (#2821) that stores the "won't run" status in the state and avoids the continuous loop over the entire pipeline. |
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, and we store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "IsTaskSkipped". This method performs the lazy computation of the skips map the first time it is invoked. Any following invocation is able to provide the skip status of a specific pipeline task by looking up in the map. This solution manages to hide the details of the skip logic to the core reconciler logic. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 I converted the unit test for "Skip" to a unit test for "SkipMap", to ensure that the new logic gives the same result as we used to have. The test should be moved to a different module as "SkipMap" lives in a different module, however leaving it in place very much helps with the review. I will move it in a follow-up patch. This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Signed-off-by: Andrea Frittoli <[email protected]>
In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions. It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release? |
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, and we store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "IsTaskSkipped". This method performs the lazy computation of the skips map the first time it is invoked. Any following invocation is able to provide the skip status of a specific pipeline task by looking up in the map. This solution manages to hide some of the details of the skip logic from the core reconciler logic. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 I converted the unit test for "Skip" to a unit test for "SkipMap", to ensure that the new logic gives the same result as we used to have. The test should be moved to a different module as "SkipMap" lives in a different module, however leaving it in place very much helps with the review. I will move it in a follow-up patch. This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Signed-off-by: Andrea Frittoli <[email protected]>
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, and we store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "IsTaskSkipped". This method performs the lazy computation of the skips map the first time it is invoked. Any following invocation is able to provide the skip status of a specific pipeline task by looking up in the map. This solution manages to hide some of the details of the skip logic from the core reconciler logic. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 I converted the unit test for "Skip" to a unit test for "SkipMap", to ensure that the new logic gives the same result as we used to have. The test should be moved to a different module as "SkipMap" lives in a different module, however leaving it in place very much helps with the review. I will move it in a follow-up patch. This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Signed-off-by: Andrea Frittoli <[email protected]>
@imjasonh we originally found the problem in 0.16.3 but FWIW we're in the process of moving to 0.18.x so at least for us further back patching is not needed. |
That is already done as a unit test that reproduces the issue (also in the PR). When I added the unit test it would time out after 30s and with the PR passes. On my laptop it manages a pipeline with 120 tasks in 30s, so there's still room for improvement.
|
Perhaps we could add the pipeline60 as an e2e test too. |
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, and we store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "IsTaskSkipped". This method performs the lazy computation of the skips map the first time it is invoked. Any following invocation is able to provide the skip status of a specific pipeline task by looking up in the map. This solution manages to hide some of the details of the skip logic from the core reconciler logic. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 I converted the unit test for "Skip" to a unit test for "SkipMap", to ensure that the new logic gives the same result as we used to have. The test should be moved to a different module as "SkipMap" lives in a different module, however leaving it in place very much helps with the review. I will move it in a follow-up patch. This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Signed-off-by: Andrea Frittoli <[email protected]>
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "IsTaskSkipped". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Co-authored-by: Scott <[email protected]> Signed-off-by: Andrea Frittoli <[email protected]>
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "IsTaskSkipped". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Co-authored-by: Scott <[email protected]> Signed-off-by: Andrea Frittoli <[email protected]>
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "ResetSkippedCache". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Co-authored-by: Scott <[email protected]> Signed-off-by: Andrea Frittoli <[email protected]>
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "ResetSkippedCache". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining #2821 This changes adds a unit test that reproduces the issue in #3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes #3521 Co-authored-by: Scott <[email protected]> Signed-off-by: Andrea Frittoli <[email protected]>
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "ResetSkippedCache". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining tektoncd#2821 This changes adds a unit test that reproduces the issue in tektoncd#3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes tektoncd#3521 Co-authored-by: Scott <[email protected]> Signed-off-by: Andrea Frittoli <[email protected]> (cherry picked from commit fda7a81)
The pipelinerun state is made of resolved pipelinerun tasks (rprt), which are build from the actual status of the associated taskruns. It is computationaly easy to know if a taskrun started, or completed successfully or unsuccessfully; however determining whether a taskrun has been skipped or will be skipped in the pipeline run execution, requires evaluating the entire pipeline run status and associated dag. The Skip method used to apply to a single rprt, evaluate the entire pipeline run status and dag, return whether the specific rprt was going to be skipped, and throw away the rest. We used to invoke the Skip method on every rprt in the pipeline state to calculate candidate tasks for execution. To make things worse, we also invoked the "Skip" method as part of the "isDone", defined as a logical OR between "isSuccessful", "isFailed" and "Skip". With this change we compute the list of tasks to be skipped once, incrementally, by caching the results of each invocation of 'Skip'. We store the result in a map to the pipelinerun facts, along with pipelinerun state and associated dags. We introdce a new method on the pipelinerun facts called "ResetSkippedCache". This solution manages to hide some of the details of the skip logic from the core reconciler logic, bit it still requires the cache to be manually reset in a couple of places. I believe further refactor could help, but I wanted to keep this PR as little as possible. I will further pursue this work by revining #2821 This changes adds a unit test that reproduces the issue in #3521, which used to fail (with timeout 30s) and now succeedes for pipelines roughly up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links takes longer than 30s, so I left the unit test at 80 to avoid introducing a flaky test in CI. There is still work to do to improve this further, some profiling / tracing work might help. Breaking large pipelines in logical groups (branches or pipelines in pipelines) would help reduce the complexity and computational cost for very large pipelines. Fixes #3521 Co-authored-by: Scott <[email protected]> Signed-off-by: Andrea Frittoli <[email protected]> (cherry picked from commit fda7a81)
We are hitting a scalability problem that pins the controller in our clusters when we have a complex pipeline with close to 50 tasks that makes heavy use of params and results. I'll enclose a synthetic test case but this is a real world case that we are seeing in our clusters.
What we see is that the controller is CPU pinned before creating the first Task pod. In my cluster what I'm seeing is a very sudden degradation in start time for the synthetic test case when I run a pipeline that contains:
... that feels like we have an algorithm that is out of whack.
The following synthetic tests are standalone and don't use a workspace or anything fancy and might be good in our tests later on...
pipe40p.yaml.txt
pipe50p.yaml.txt
pipe60p.yaml.txt
The text was updated successfully, but these errors were encountered: