Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinning CPU with a pipeline with 50 tasks and use of params/results chain #3521

Closed
skaegi opened this issue Nov 13, 2020 · 12 comments · Fixed by #3524
Closed

Pinning CPU with a pipeline with 50 tasks and use of params/results chain #3521

skaegi opened this issue Nov 13, 2020 · 12 comments · Fixed by #3524
Labels
area/performance Issues or PRs that are related to performance aspects. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@skaegi
Copy link
Contributor

skaegi commented Nov 13, 2020

We are hitting a scalability problem that pins the controller in our clusters when we have a complex pipeline with close to 50 tasks that makes heavy use of params and results. I'll enclose a synthetic test case but this is a real world case that we are seeing in our clusters.

What we see is that the controller is CPU pinned before creating the first Task pod. In my cluster what I'm seeing is a very sudden degradation in start time for the synthetic test case when I run a pipeline that contains:

  1. 40 Tasks -- 10 seconds to start
  2. 50 Tasks -- 1 minute to start
  3. 60 Tasks -- 41 minutes to start
    ... that feels like we have an algorithm that is out of whack.

The following synthetic tests are standalone and don't use a workspace or anything fancy and might be good in our tests later on...

pipe40p.yaml.txt
pipe50p.yaml.txt
pipe60p.yaml.txt

@skaegi skaegi added the kind/bug Categorizes issue or PR as related to a bug. label Nov 13, 2020
@dustym
Copy link

dustym commented Nov 13, 2020

Watching this thread closely. We did an upgrade from v0.10 to v0.17 a couple of weeks ago (had to excise PipelineResources) and performance has been all over the place since then. I'm thinking about downgrading at this point. I've looked into the two HA solutions in this isssue but the results have not been great -- sometimes we see double work or random task restarts when the task should be finishing with the StatefulSet approach and hardly any performance increase with the documented approach. Our pipelines go from 2 concurrent tasks -> 6 concurrent tasks -> 96 concurrent tasks and at any given time there are 15 running on the cluster. Each pipeline gets its own namespace if that makes any difference.

If I can provide any data or notes in this area I am eager to help.

@afrittoli
Copy link
Member

Uhm, pipeline 40 started quickly on my kind cluster, but I found an issue in the pipeline definition:

status:
  completionTime: "2020-11-14T10:05:06Z"
  conditions:
  - lastTransitionTime: "2020-11-14T10:05:06Z"
    message: 'unable to find result referenced by param "a" in "t2": Could not find
      result with name a for task run t1'
    reason: InvalidTaskResultReference
    status: "False"
    type: Succeeded

@afrittoli
Copy link
Member

@skaegi I think you always provision the pipeline and task definitions along with the pipeline run, is that correct?

@afrittoli
Copy link
Member

Fixed the task as follows:

  steps:
  - name: echo
    image: alpine
    script: |
      echo "a-result" > $(results.a.path)
      echo "b-result" > $(results.b.path)
      echo "c-result" > $(results.c.path)
      echo "d-result" > $(results.d.path)
      echo "e-result" > $(results.e.path)
      echo "f-result" > $(results.f.path)
      echo "g-result" > $(results.g.path)
      echo "h-result" > $(results.h.path)
      echo "i-result" > $(results.i.path)
      echo "j-result" > $(results.j.path)

The whole pipeline ran quite quickly:

NAME                    STARTED         DURATION   STATUS
pipeline-40-run-92kf8   2 minutes ago   1 minute   Succeeded

@afrittoli
Copy link
Member

With pipeline 50 I see evident signs of degradation already: 30s, then 1 taskrun, then a couple of minutes, then the rest.

@afrittoli afrittoli added area/performance Issues or PRs that are related to performance aspects. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 14, 2020
@afrittoli
Copy link
Member

afrittoli commented Nov 14, 2020

We definitely get stuck in dag.GetSchedulable, I'll try to build a unit test that matches the pipe60 pipeline and see if I can reproduce the issue there. This is the last line in the controller logs before the waiting starts...

{"level":"info","ts":"2020-11-14T16:23:35.940Z","logger":"tekton.github.com-tektoncd-pipeline-pkg-reconciler-pipelinerun.Reconciler","caller":"resources/pipelinerunstate.go:197","msg":"#3521 About to invoke dag.GetSchedulable","commit":"cdc171b","knative.dev/traceid":"42f40bcb-907c-49cb-8aa4-b184cc97b7a7","knative.dev/key":"default/pipeline-60-run-mhgkq"}

@afrittoli
Copy link
Member

I added a simple unit test that emulates pipe60, but that runs fine, it takes overall about 1s to run dag.GetSchedulable twice, once with empty execution history and once after t1: https://gist.github.com/afrittoli/be4dae78d46cdeb8cfd4a749839b66bb

@afrittoli
Copy link
Member

successfulOrSkippedDAGTasks is the function where all the time is spent:

func (facts *PipelineRunFacts) successfulOrSkippedDAGTasks() []string {
tasks := []string{}
for _, t := range facts.State {
if facts.isDAGTask(t.PipelineTask.Name) {
if t.IsSuccessful() || t.Skip(facts) {
tasks = append(tasks, t.PipelineTask.Name)
}
}
}
return tasks
}

The loop in successfulOrSkippedDAGTasks calls t.Skip(facts) for every pipeline task. The Skip function

// Skip returns true if a PipelineTask will not be run because
// (1) its When Expressions evaluated to false
// (2) its Condition Checks failed
// (3) its parent task was skipped
// (4) Pipeline is in stopping state (one of the PipelineTasks failed)
// Note that this means Skip returns false if a conditionCheck is in progress
func (t *ResolvedPipelineRunTask) Skip(facts *PipelineRunFacts) bool {
if facts.isFinalTask(t.PipelineTask.Name) || t.IsStarted() {
return false
}
if t.conditionsSkip() || t.whenExpressionsSkip(facts) || t.parentTasksSkip(facts) || facts.IsStopping() {
return true
}
return false
}
invokes parentTasksSkip which invokes Skip recursively.

On my local cluster I see that each iteration in the loop in successfulOrSkippedDAGTasks can take up to 2 minutes.
Since we check for if t.IsSuccessful() || t.Skip(facts), once tasks start to run the number of times that we have to invoke Skip decreases, unlocking the execution.

Even though recursion fits with the high CPU and long execution time that we see, it's still not clear to me why it would take so long; the algorithm can be surely improved upon, however there may be some bug in there that I'm still missing to really explain this.

It looks like it's ripe time I resume work on this refactor (#2821) that stores the "won't run" status in the state and avoids the continuous loop over the entire pipeline.

afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 15, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide the details of the skip logic to the
core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <[email protected]>
@imjasonh
Copy link
Member

In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions.

It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release?

afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 15, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide some of the details of the skip logic from
the core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <[email protected]>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 15, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide some of the details of the skip logic from
the core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <[email protected]>
@skaegi
Copy link
Contributor Author

skaegi commented Nov 15, 2020

@imjasonh we originally found the problem in 0.16.3 but FWIW we're in the process of moving to 0.18.x so at least for us further back patching is not needed.

@afrittoli
Copy link
Member

afrittoli commented Nov 15, 2020

In addition to the fix (thanks Andrea!) we should definitely encode this pipeline into a test to guard against future similar regressions.

That is already done as a unit test that reproduces the issue (also in the PR). When I added the unit test it would time out after 30s and with the PR passes. On my laptop it manages a pipeline with 120 tasks in 30s, so there's still room for improvement.

It looks like this behavior is new since the last release, is that correct? So we'll only need a patch for the latest release?

@afrittoli
Copy link
Member

Perhaps we could add the pipeline60 as an e2e test too.
I would not add it as a YAML test because it's easier to maintain when it's generated via code.
It takes a couple of minutes to run on my kind cluster, so it should be fine for E2E tests.

afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once, and
we store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped". This method performs the
lazy computation of the skips map the first time it is invoked.
Any following invocation is able to provide the skip status of a
specific pipeline task by looking up in the map.

This solution manages to hide some of the details of the skip logic from
the core reconciler logic. I believe further refactor could help, but
I wanted to keep this PR as little as possible. I will further pursue
this work by revining tektoncd#2821

I converted the unit test for "Skip" to a unit test for "SkipMap",
to ensure that the new logic gives the same result as we used to have.
The test should be moved to a different module as "SkipMap" lives
in a different module, however leaving it in place very much helps
with the review. I will move it in a follow-up patch.

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Signed-off-by: Andrea Frittoli <[email protected]>
@pritidesai pritidesai added this to the Pipelines v0.18 milestone Nov 16, 2020
afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <[email protected]>

Signed-off-by: Andrea Frittoli <[email protected]>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "IsTaskSkipped".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <[email protected]>

Signed-off-by: Andrea Frittoli <[email protected]>
afrittoli added a commit to afrittoli/pipeline that referenced this issue Nov 16, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <[email protected]>

Signed-off-by: Andrea Frittoli <[email protected]>
tekton-robot pushed a commit that referenced this issue Nov 17, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821

This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes #3521

Co-authored-by: Scott <[email protected]>

Signed-off-by: Andrea Frittoli <[email protected]>
pritidesai pushed a commit to pritidesai/pipeline that referenced this issue Nov 17, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining tektoncd#2821

This changes adds a unit test that reproduces the issue in tektoncd#3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes tektoncd#3521

Co-authored-by: Scott <[email protected]>

Signed-off-by: Andrea Frittoli <[email protected]>
(cherry picked from commit fda7a81)
tekton-robot pushed a commit that referenced this issue Nov 17, 2020
The pipelinerun state is made of resolved pipelinerun tasks (rprt),
which are build from the actual status of the associated taskruns.

It is computationaly easy to know if a taskrun started, or completed
successfully or unsuccessfully; however determining whether a taskrun
has been skipped or will be skipped in the pipeline run execution,
requires evaluating the entire pipeline run status and associated dag.

The Skip method used to apply to a single rprt, evaluate the entire
pipeline run status and dag, return whether the specific rprt was
going to be skipped, and throw away the rest.

We used to invoke the Skip method on every rprt in the pipeline state
to calculate candidate tasks for execution. To make things worse,
we also invoked the "Skip" method as part of the "isDone", defined as
a logical OR between "isSuccessful", "isFailed" and "Skip".

With this change we compute the list of tasks to be skipped once,
incrementally, by caching the results of each invocation of 'Skip'.
We store the result in a map to the pipelinerun facts, along with
pipelinerun state and associated dags. We introdce a new method on the
pipelinerun facts called "ResetSkippedCache".

This solution manages to hide some of the details of the skip logic from
the core reconciler logic, bit it still requires the cache to be
manually reset in a couple of places. I believe further refactor could
help, but I wanted to keep this PR as little as possible.
I will further pursue this work by revining #2821

This changes adds a unit test that reproduces the issue in #3521, which
used to fail (with timeout 30s) and now succeedes for pipelines roughly
up to 120 tasks / 120 links. On my laptop, going beyond 120 tasks/links
takes longer than 30s, so I left the unit test at 80 to avoid
introducing a flaky test in CI. There is still work to do to improve
this further, some profiling / tracing work might help.

Breaking large pipelines in logical groups (branches or pipelines in
pipelines) would help reduce the complexity and computational cost for
very large pipelines.

Fixes #3521

Co-authored-by: Scott <[email protected]>

Signed-off-by: Andrea Frittoli <[email protected]>
(cherry picked from commit fda7a81)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Issues or PRs that are related to performance aspects. kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants