Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature blog for KEP-3017: Unhealthy Pod Eviction Policy for PDBs #37627

Merged

Conversation

atiratree
Copy link
Member

@atiratree atiratree commented Oct 31, 2022

enhancement: kubernetes/enhancements#3017

API changes + feature gate: kubernetes/kubernetes#113375

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 31, 2022
@k8s-ci-robot k8s-ci-robot added area/blog Issues or PRs related to the Kubernetes Blog subproject language/en Issues or PRs related to English language sig/docs Categorizes an issue or PR as relevant to SIG Docs. labels Oct 31, 2022
@netlify
Copy link

netlify bot commented Oct 31, 2022

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit e27c608
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/63b5db71c7ea4b0009136c8f
😎 Deploy Preview https://deploy-preview-37627--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@atiratree atiratree changed the title [WIP] Feature blog for KEP-3017: Pod Healthy Policy for PDBs [WIP] Feature blog for KEP-3017: Unhealthy Pod Eviction Policy for PDBs Nov 10, 2022
@fsmunoz
Copy link
Contributor

fsmunoz commented Nov 22, 2022

Hello @atiratree , v1.26 Comms lead here. This feature blog is on a feature tracked for release, the deadline for submitting a draft is the 29th of November; this should be considered the hard limit since we will need to review/edit/discuss the draft, so if at all possible it's better to submit it earlier to avoid any problems.

Any doubts, we're here to help!

Copy link
Member

@Debanitrkl Debanitrkl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @atiratree, v1.26 release comms shadow here, as the deadline is already over and the feature blog PR is not up yet, should be consider it cancelled for this release.

@sftim
Copy link
Contributor

sftim commented Dec 7, 2022

You can still blog about the new feature, just not as part of the post-release blog series for v1.26.

Please do consider tweaking the article for publication early next year.

@atiratree atiratree force-pushed the blog-pod-healthy-policy-for-pdbs branch from 3d47da9 to 2903c7f Compare December 8, 2022 00:19
@atiratree atiratree changed the title [WIP] Feature blog for KEP-3017: Unhealthy Pod Eviction Policy for PDBs Feature blog for KEP-3017: Unhealthy Pod Eviction Policy for PDBs Dec 8, 2022
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 8, 2022
@atiratree
Copy link
Member Author

atiratree commented Dec 8, 2022

@Debanitrkl @fsmunoz @sftim sorry for the constant delays. I have pushed the blog proposal and will try to find people to review this.

Not sure if it can still make it to the post-release blog series, but I will be happy to see this published even on a later date.

@fsmunoz
Copy link
Contributor

fsmunoz commented Dec 8, 2022

Thank you @atiratree , we're looking at it!

@atiratree atiratree force-pushed the blog-pod-healthy-policy-for-pdbs branch from 2903c7f to 432fb0d Compare December 9, 2022 17:32
@fsmunoz
Copy link
Contributor

fsmunoz commented Dec 12, 2022

@atiratree publication date for this feature blog is January 6, 2023, you can update the date in the header. We have the series in good shape so I don't see a problem in adding this one, regardless of the missed deadlines.

The publication date is still relatively far away, but we should be able to review and merge it before, assuming that comments and suggestions are quickly reflected - your help in that is greatly appreciated.

@atiratree atiratree force-pushed the blog-pod-healthy-policy-for-pdbs branch from 432fb0d to 47888f6 Compare December 12, 2022 22:26
@atiratree
Copy link
Member Author

Thank you. I have updated the date

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. Here's some more feedback.

---

**Authors:** Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a few words before the first heading. We avoid starting blog articles with a 2nd level heading. You could omit “What problems does this solve?”; I think it'd be better to add 50ish words to summarize what you're saying in the article.###

Please also mention that this is describing a feature released last month, ie December 2022.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atiratree maybe something like:

Ensuring the disruptions to your applications do not affect its availability isn't a simple task. With
the introduction of Unhealthy Pod Eviction Policy for Pod Disruption Budgets this is going to change.
In this article, we will dive deeper into what modifications were introduced in PDBs to give application
owners greater flexibility in managing disruptions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Ensuring the disruptions to your applications do not affect its availability isn't a simple
task. With the introduction of _unhealthy pod eviction policy_ for PodDisruptionBudgets,
the way you can ensure this has changed again.
In this article, we will dive deeper into what modifications were introduced for PDBs to
give application owners greater flexibility in managing disruptions.

?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh thanks for the suggestion. I have added it there, with the link to PDBs.

With the introduction of unhealthy pod eviction policy for PodDisruptionBudgets,
the way you can ensure this has changed again.

@sftim not sure what his is trying to say. Can we keep Maciej's version?

Please also mention that this is describing a feature released last month, ie December 2022.

Do we need to mention the time, when we specify the version in the title?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by time? We don't need to be precise to the hour.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I meant the date and not the time. I have rewritten the going to change part. Please let me know if you have something better in mind.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we also mention the k8s version in the What problems does this solve? section

you are choosing the best effort availability for your application.
This will make it easier to maintain and upgrade your clusters.

Eviction API will consider these policies when eviction of a pod that is guarded by a PDB is requested.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What could the outcome be in the default IfHealthyBudget case? I mean: it's not always a successful create for the subresource.

I think we're assuming that the reader already knows this, and that we should instead make it more clear.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if I follow.

we are describing the behaviour of the default IfHealthyBudget in the section above with.

Eviction of pods respects PodDisruptionBudgets (PDBs). This means that the eviction of a pod
should not disrupt a guarded application and `.status.currentHealthy` of a PDB should not fall 
below `.status.desiredHealthy`. Running pods that are [Unhealthy](/docs/tasks/run-application/configure-pdb/#healthiness-of-a-pod)
do not count towards the PDB status, but eviction of these is only possible in case the application
is not disrupted. This helps disrupted or not yet started application to achieve availability
as soon as possible without additional downtime that would be caused by evictions.

or are you suggesting we should describe the edge cases of Eviction API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If:

  • you have a workload with n Pods
  • you protect the workload with a PodDisruptionBudget
  • you choose, or default to, IfHealthyBudget
  • several of those Pods are in CrashLoopBackoff

and you then

  • (attempt to) drain a node where the CrashLoopBackoffs are happening

what actually happens? It'd be good to summarise this mainly in terms of kubectl drain, but readers might also want to know at API level. For example: you try to create an eviction, and that request fails with error 418, I'm a teapot (OK, probably not that specific error code).

My ask here is to explain the outcome that you don't see if you enable the feature gate and set a custom unhealthyPodEvictionPolicy, so that readers can appreciate the benefit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I have tried to put more details to describe the behavior.

Unfortunately, this poses a problem for cluster administrators that would like to drain nodes
without any manual interventions. Misbehaving applications with pods in `CrashLoopBackOff`
state (due to a bug or misconfiguration) or pods that are simply failing to become ready
make this task much harder. Any eviction request will fail due to violation of a PDB, 
when all pods of an application are unhealthy. Draining of a node cannot make any progress
in that case.

Do you think it make sense to describe the exact message and http code here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, you should describe something. I haven't tried the feature out so I don't know what you need to say. I don't think “Eviction API will consider these policies when eviction of a pod that is guarded by a PDB is requested.” feels like enough.

That said, if we had to we could merge this and live with the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like:

Eviction is the process that triggers graceful pod termination. The process can be initiated either by a kubectl drain command, or due to specific conditions in the cluster (see node-pressure eviction or api-eviction). During this process every pod removal is consulted with appropriate PDBs, to ensure that a requested number of pods is always running in the cluster. The aforementioned policies allow PDB authors to have a greater control how the process counts not healthy pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for the suggestion, I used part of it and reorganized the section: https://github.com/kubernetes/website/compare/4629fb1dcda1ab70deaed7a30f89228d0d1fbb6d..e27c6086386bcbdc67b52cf7c9909d3c5caf788d.

I think we should not mention the node-pressure eviction here, since it could distract from the main topic. The node-pressure eviction does not abide by the PDBs. I have tried to highlight that we are dealing with the API initiated eviction.

@@ -0,0 +1,76 @@
---
layout: blog
title: "Eviction policy for unhealthy pods guarded by PodDisruptionBudgets"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
title: "Eviction policy for unhealthy pods guarded by PodDisruptionBudgets"
title: "Kubernetes 1.26: Configure How PodDisruptionBudgets Affect Unhealthy Pods"

How about that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have put the Kubernetes 1.26 just for now. I like the rest Configure How PodDisruptionBudgets Affect Unhealthy Pods , but I am wondering if we could/should mention the eviction part as well?

---

**Authors:** Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atiratree maybe something like:

Ensuring the disruptions to your applications do not affect its availability isn't a simple task. With
the introduction of Unhealthy Pod Eviction Policy for Pod Disruption Budgets this is going to change.
In this article, we will dive deeper into what modifications were introduced in PDBs to give application
owners greater flexibility in managing disruptions.

@atiratree atiratree force-pushed the blog-pod-healthy-policy-for-pdbs branch from 47888f6 to 4fa924b Compare December 21, 2022 23:07
@atiratree atiratree force-pushed the blog-pod-healthy-policy-for-pdbs branch 2 times, most recently from feebd52 to 4629fb1 Compare December 21, 2022 23:24
@sftim
Copy link
Contributor

sftim commented Dec 29, 2022

We can merge this if needed; I'd prefer some more polish first.

@fsmunoz
Copy link
Contributor

fsmunoz commented Jan 2, 2023

@atiratree if you could take a look at the added review comments, it would be great. Publication date is 4 days ahead, so there's still time. Otherwise, we can default to merge as-is, but this would be a last option.

@atiratree
Copy link
Member Author

@sftim I am trying to address the missing pieces. can you please take a look again?

@tengqm
Copy link
Contributor

tengqm commented Jan 3, 2023

@atiratree Please go ahead and revise the text.

@atiratree
Copy link
Member Author

@tengqm do you have anything in particular in mind? I believe I have replied to all of the remarks so far.


## What problems does this solve?

Eviction of pods respects PodDisruptionBudgets (PDBs). This means that the eviction of a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit / optional change)

Suggested change
Eviction of pods respects PodDisruptionBudgets (PDBs). This means that the eviction of a pod
Eviction of pods respects PodDisruptionBudgets (PDBs). This means that a
[voluntary disruption](https://kubernetes.io/docs/concepts/scheduling-eviction/#pod-disruption)
to a Pod

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could link that. What do you think about still having the eviction mentioned there?

This means that a requested [voluntary disruption](https://kubernetes.io/docs/concepts/scheduling-eviction/#pod-disruption)
via an eviction to a Pod, should not disrupt a guarded application and `.status.currentHealthy` of a PDB should not fall
below `.status.desiredHealthy`

@atiratree atiratree force-pushed the blog-pod-healthy-policy-for-pdbs branch from 4629fb1 to e27c608 Compare January 4, 2023 20:02
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 4, 2023
@atiratree
Copy link
Member Author

@sftim @soltysh thanks for the feedback. Can you please do another round?

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two more nits, but otherwise this looks great to me


API-initiated eviction is the process that triggers graceful pod termination.
The process can be initiated either by calling the API directly,
by using a kubectl drain command, or other actors in the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
by using a kubectl drain command, or other actors in the cluster.
by using a `kubectl drain` command, or other actors in the cluster.


There are two policies `IfHealthyBudget` and `AlwaysAllow` to choose from.

The former, `IfHealthyBudget`, follows the existing behavior to achieve the best availability
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this does match the current behavior, I'd add a one sentence how it works. Something like:

Unhealthy pods can be disrupted only if the applications has a minimum available .status.desiredHealthy number of pods.

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge it. @atiratree even after merge, you can open a PR to make minor edits - eg, see #37627 (review)

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 5, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 72d92fd88ac87e9803c0b39f157886feddf08158

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sftim, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 5, 2023
@k8s-ci-robot k8s-ci-robot merged commit 4fba599 into kubernetes:main Jan 5, 2023
@atiratree
Copy link
Member Author

@soltysh @sftim good idea, I am adressing the nits in #38784

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/blog Issues or PRs related to the Kubernetes Blog subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/docs Categorizes an issue or PR as relevant to SIG Docs. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Published
Development

Successfully merging this pull request may close these issues.

8 participants