Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable Affinity Assistant by default #3161

Closed
wants to merge 1 commit into from

Conversation

imjasonh
Copy link
Member

@imjasonh imjasonh commented Sep 2, 2020

There's been some confusion among users about the benefits/limitations of AA, in particular that it limits TaskRuns from mounting two persistent workspaces.

See:

We've heard that some Tekton installation providers disable this feature by default to avoid this confusion, and this PR is intended to start a conversation about disabling it in the official Tekton release config until the limitations/confusion can be reduced or eliminated.

/cc @jlpettersson
/cc @bobcatfish
/cc @vdemeester

NB: If this is included in the next Tekton release, users who upgrade to that release with kubectl apply will see AA disabled, perhaps unexpectedly.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • [n/a] Includes tests (if functionality changed/added)
  • [n] Includes docs (if user facing)
  • [y] Commit messages follow commit message best practices
  • [y] Release notes block has been filled in or deleted (only if no user facing changes)

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

If API changes are included, additive changes must be approved by at least two OWNERS and backwards incompatible changes must be approved by more than 50% of the OWNERS, and they must first be added in a backwards compatible way.

Release Notes

Disable Affinity Assistant in the official Tekton release config.

@imjasonh imjasonh added this to the Pipelines v0.16 milestone Sep 2, 2020
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 2, 2020
@imjasonh
Copy link
Member Author

imjasonh commented Sep 2, 2020

/kind misc

@tekton-robot tekton-robot added the kind/misc Categorizes issue or PR as a miscellaneuous one. label Sep 2, 2020
Copy link
Member

@vdemeester vdemeester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjasonh thanks for opening this for discussion. Indeed, we did disable this by default on openshift-pipelines because it gets more in the way to our current users that it helps them at least (given their current use of pipeline).

One thing we should try to keep in mind in the long run is, when we push a new feature that can have this impact, we may want to start by having an opt-in behavior at first (and later switch to opt-out). Doing the opposite (as this PR does), feels a bit backward 😛

/approve
/hold

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 8, 2020
@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 8, 2020
@imjasonh
Copy link
Member Author

imjasonh commented Sep 8, 2020

One thing we should try to keep in mind in the long run is, when we push a new feature that can have this impact, we may want to start by having an opt-in behavior at first (and later switch to opt-out). Doing the opposite (as this PR does), feels a bit backward 😛

Yeah I completely agree, it's not ideal. I was thinking it might be worth adding to the TEP template some prompt to discuss how we'd release the newly proposed feature, and use that to encourage people to think about opt-in vs opt-out (and just having feature flags for everything in general!)

@jlpettersson
Copy link
Member

My opinion is probably known :)

Without the AA:

With the AA:

  • A constraint is introduced - or more upfront (since the above problems exists): At most one PVC per Task. Other than that, the above problems are solved.
  • Pipeline executes 20-30 seconds faster per task that use the PV, except the first.

@bobcatfish
Copy link
Collaborator

At most one PVC per Task.

This seems like a pretty big constraint to me :S We designed Tasks to be able to handle more than one workspace intentionally.

@vdemeester
Copy link
Member

* A constraint is introduced - or more upfront (since the above problems exists): At most one PVC per Task. Other than that, the above problems are solved.

This is the core of the problem for me. I do see value on this but I think in an ideal world

  • this would have been opt-in initially to gather feedback on this without disallowing some behavior.
  • having a way to disable the behavior "on-demand" for a run (as a "do at your own risk" thing) would be more than nice. If a user need to go around the constraint for 2% of it's use case, but feel safer having the assistant in place for the 98% rest, he should be able to do it

Some pipelines in a regional cluster will deadlock

Not all cluster are regional. Depending on the users, for some, they don't have this problem, at all.

@pierretasci
Copy link

In my experience, not all k8s clusters I have encountered work well with the Affinity Assistant because of non-standard configurations. Not all installations allow access to the K8s nodes or they have non-standard labels on the nodes. Since this feature configures the core experience rather than extends it, I personally tend to prefer opt-in semantics. Said another way, the default should work with the majority of cases possible but that is just my 2 cents.

@imjasonh
Copy link
Member Author

imjasonh commented Sep 9, 2020

Yeah I think the central issue is that the tradeoff is not clearly laid out in documentation (that comment is a great distillation though!), and it introduces a serious limitation for many users. And it was enabled by default in the last release. Ideally we'd have made it opt-in then, and now that it's opt-out, rolling it back to opt-in is not necessarily a safe change. 😞

@jlpettersson
Copy link
Member

Since this feature configures the core experience rather than extends it

It was introduced to avoid the problems with deadlocks and parallelism-problems rather than add a feature.

Said another way, the default should work with the majority of cases possible but that is just my 2 cents.

+1. I had the same reasoning when this was added - to solve the assumed most common cases: #2630 (comment)

Using PVC with access mode ReadWriteOnce (the most common access mode) can only be mounted on one Node at a time. In addition, they most commonly lives in a single datacenter/Availability Zone. So if a pipeline starts e.g. with two parallel tasks - they may be scheduled to two different Availability Zones - this goes fine - until a Task try to mount both volumes - located in different AZ, and the pipeline is deadlocked - but this only happens when scheduled to different zones.

And it was enabled by default in the last release.

The Affinity Assistant was introduced in v0.13.0

Not all cluster are regional.

That is true. But it was also motivated to allow for parallel task execution. But not all clusters have more than one Node either. But it happened that "Node Affinity" solved both origin problems.

We designed Tasks to be able to handle more than one workspace intentionally.

The constraint is for PVCs, not strictly for workspaces since a workspace can be backed by other things than a PVC, and a Task can use multiple workspaces backed by the same PVC (using subPath).

A Custom Scheduler is a better solution than the Affinity Assistant for this problem - it should schedule the Pods based on the same "constraints" as the Affinity Assistant, but it does not mount the PVC itself, as the AA had to do - so it has a bit friendlier behavior.

@ghost
Copy link

ghost commented Oct 2, 2020

There's been some discussion around this in WGs now and I think the consensus was:

  1. we shouldn't disable this by default now because we'd be messing with folks' existing installations.
  2. we will exercise more caution about enabling new features by default in future.

@tektoncd/core-maintainers or anyone else on this thread: does anyone have anything further to add? Is there a purpose in keeping this PR open any longer?

@imjasonh
Copy link
Member Author

imjasonh commented Oct 2, 2020

There's been some discussion around this in WGs now and I think the consensus was:

  1. we shouldn't disable this by default now because we'd be messing with folks' existing installations.
  2. we will exercise more caution about enabling new features by default in future.

@tektoncd/core-maintainers or anyone else on this thread: does anyone have anything further to add? Is there a purpose in keeping this PR open any longer?

This is an excellent summary. I don't think there's anything else for this PR.

@imjasonh imjasonh closed this Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/misc Categorizes issue or PR as a miscellaneuous one. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants