Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor node draining racing avoid condition #130

Merged
merged 1 commit into from
Jun 25, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions deploy/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,12 @@ rules:
verbs:
- get
- update
- apiGroups:
- 'coordination.k8s.io'
resources:
- 'leases'
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
Expand Down
14 changes: 6 additions & 8 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ require (
github.com/fsnotify/fsnotify v1.4.9
github.com/go-logr/logr v0.2.1
github.com/go-logr/zapr v0.2.0 // indirect
github.com/gogo/protobuf v1.3.2 // indirect
github.com/golang/glog v0.0.0-20160126235308-23def4e6c14b
github.com/google/go-cmp v0.5.0
github.com/huandu/xstrings v1.3.2 // indirect
Expand All @@ -30,19 +29,18 @@ require (
google.golang.org/genproto v0.0.0-20200610104632-a5b850bcf112 // indirect
google.golang.org/protobuf v1.25.0 // indirect
gopkg.in/yaml.v2 v2.3.0
k8s.io/api v0.19.0
k8s.io/api v0.19.10
k8s.io/apiextensions-apiserver v0.19.0
k8s.io/apimachinery v0.19.0
k8s.io/client-go v0.19.0
k8s.io/code-generator v0.19.0
k8s.io/kubectl v0.19.0
k8s.io/apimachinery v0.19.10
k8s.io/client-go v0.19.10
k8s.io/code-generator v0.19.10
k8s.io/kubectl v0.19.10
k8s.io/utils v0.0.0-20200729134348-d5654de09c73
sigs.k8s.io/controller-runtime v0.6.2
)

replace (
github.com/gogo/protobuf => github.com/gogo/protobuf v1.3.2
github.com/openshift/api => github.com/openshift/api v0.0.0-20200827090112-c05698d102cf
k8s.io/kubectl => k8s.io/kubectl v0.19.0
k8s.io/kubelet => k8s.io/kubelet v0.19.0
k8s.io/kubelet => k8s.io/kubelet v0.19.10
)
35 changes: 23 additions & 12 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -840,8 +840,9 @@ golang.org/x/net v0.0.0-20200513185701-a91f0712d120/go.mod h1:qpuaurCH72eLCgpAm/
golang.org/x/net v0.0.0-20200520004742-59133d7f0dd7/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A=
golang.org/x/net v0.0.0-20200520182314-0ba52f642ac2/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A=
golang.org/x/net v0.0.0-20200707034311-ab3426394381/go.mod h1:/O7V0waA8r7cgGh81Ro3o1hOxt32SMVPicZroKQ2sZA=
golang.org/x/net v0.0.0-20201021035429-f5854403a974 h1:IX6qOQeG5uLjB/hjjwjedwfjND0hgjPMMyO1RoIXQNI=
golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU=
golang.org/x/net v0.0.0-20201110031124-69a78807bb2b h1:uwuIcX0g4Yl1NC5XAz37xsr2lTtcqevgzYNVt49waME=
golang.org/x/net v0.0.0-20201110031124-69a78807bb2b/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU=
golang.org/x/oauth2 v0.0.0-20180821212333-d2e6202438be/go.mod h1:N/0e6XlmueqKjAGxoOufVs8QHGRruUQn6yWY3a++T0U=
golang.org/x/oauth2 v0.0.0-20190226205417-e64efc72b421/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
golang.org/x/oauth2 v0.0.0-20190604053449-0f29369cfe45/go.mod h1:gOpvHmFTYa4IltrdGE7lF6nIHvwfUNPOp7c8zoXwtLw=
Expand Down Expand Up @@ -909,8 +910,9 @@ golang.org/x/sys v0.0.0-20200519105757-fe76b779f299/go.mod h1:h1NjWce9XRLGQEsW7w
golang.org/x/sys v0.0.0-20200523222454-059865788121/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200615200032-f1bc736245b1/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200622214017-ed371f2e16b4/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f h1:+Nyd8tzPX9R7BWHguqsrbFdRx3WQ/1ib8I44HXV5yTA=
golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20201112073958-5cba982894dd h1:5CtCZbICpIOFdgO940moixOPjc0178IU44m4EjOO5IY=
golang.org/x/sys v0.0.0-20201112073958-5cba982894dd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/text v0.0.0-20160726164857-2910a502d2bf/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.0.0-20170915032832-14c0d48ead0c/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/text v0.0.0-20170915090833-1cbadb444a80/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
Expand Down Expand Up @@ -1110,31 +1112,37 @@ howett.net/plist v0.0.0-20181124034731-591f970eefbb h1:jhnBjNi9UFpfpl8YZhA9CrOqp
howett.net/plist v0.0.0-20181124034731-591f970eefbb/go.mod h1:vMygbs4qMhSZSc4lCUl2OEE+rDiIIJAIdR4m7MiMcm0=
k8s.io/api v0.18.3/go.mod h1:UOaMwERbqJMfeeeHc8XJKawj4P9TgDRnViIqqBeH2QA=
k8s.io/api v0.18.6/go.mod h1:eeyxr+cwCjMdLAmr2W3RyDI0VvTawSg/3RFFBEnmZGI=
k8s.io/api v0.19.0 h1:XyrFIJqTYZJ2DU7FBE/bSPz7b1HvbVBuBf07oeo6eTc=
k8s.io/api v0.19.0/go.mod h1:I1K45XlvTrDjmj5LoM5LuP/KYrhWbjUKT/SoPG0qTjw=
k8s.io/api v0.19.10 h1:AGQnvGl693Kxib2Byxc7lTcNm1/KdHMht9xDb5GcOmY=
k8s.io/api v0.19.10/go.mod h1:FSHnCfzQ9/b9qSSF1A1+QWk5cdO1auExjLcc/RJ/zB0=
k8s.io/apiextensions-apiserver v0.18.6/go.mod h1:lv89S7fUysXjLZO7ke783xOwVTm6lKizADfvUM/SS/M=
k8s.io/apiextensions-apiserver v0.19.0 h1:jlY13lvZp+0p9fRX2khHFdiT9PYzT7zUrANz6R1NKtY=
k8s.io/apiextensions-apiserver v0.19.0/go.mod h1:znfQxNpjqz/ZehvbfMg5N6fvBJW5Lqu5HVLTJQdP4Fs=
k8s.io/apimachinery v0.18.0/go.mod h1:9SnR/e11v5IbyPCGbvJViimtJ0SwHG4nfZFjU77ftcA=
k8s.io/apimachinery v0.18.3/go.mod h1:OaXp26zu/5J7p0f92ASynJa1pZo06YlV9fG7BoWbCko=
k8s.io/apimachinery v0.18.6/go.mod h1:OaXp26zu/5J7p0f92ASynJa1pZo06YlV9fG7BoWbCko=
k8s.io/apimachinery v0.19.0 h1:gjKnAda/HZp5k4xQYjL0K/Yb66IvNqjthCb03QlKpaQ=
k8s.io/apimachinery v0.19.0/go.mod h1:DnPGDnARWFvYa3pMHgSxtbZb7gpzzAZ1pTfaUNDVlmA=
k8s.io/apimachinery v0.19.10 h1:icUAE7gRZ1lAfQcwejs5kJtWBNa7uOXYL53pAuUW19U=
k8s.io/apimachinery v0.19.10/go.mod h1:9eb44nUQSsz9QZiilFRuMj3ZbTmoWolU8S2gnXoRMjo=
k8s.io/apiserver v0.18.6/go.mod h1:Zt2XvTHuaZjBz6EFYzpp+X4hTmgWGy8AthNVnTdm3Wg=
k8s.io/apiserver v0.19.0/go.mod h1:XvzqavYj73931x7FLtyagh8WibHpePJ1QwWrSJs2CLk=
k8s.io/cli-runtime v0.19.0 h1:wLe+osHSqcItyS3MYQXVyGFa54fppORVA8Jn7DBGSWw=
k8s.io/cli-runtime v0.19.0/go.mod h1:tun9l0eUklT8IHIM0jors17KmUjcrAxn0myoBYwuNuo=
k8s.io/cli-runtime v0.19.10 h1:kjTRVlbyJ9f/WOt3menZbWY/uhZyN9PudI9ZMB6BTSE=
k8s.io/cli-runtime v0.19.10/go.mod h1:jwfyV9E4s69e3LRaAhGyEe4z+aJDeCPVvEEglOAXCmg=
k8s.io/client-go v0.18.3/go.mod h1:4a/dpQEvzAhT1BbuWW09qvIaGw6Gbu1gZYiQZIi1DMw=
k8s.io/client-go v0.18.6/go.mod h1:/fwtGLjYMS1MaM5oi+eXhKwG+1UHidUEXRh6cNsdO0Q=
k8s.io/client-go v0.19.0 h1:1+0E0zfWFIWeyRhQYWzimJOyAk2UT7TiARaLNwJCf7k=
k8s.io/client-go v0.19.0/go.mod h1:H9E/VT95blcFQnlyShFgnFT9ZnJOAceiUHM3MlRC+mU=
k8s.io/client-go v0.19.10 h1:imq8ZDEdeabcKZNxdR3MfhxVKYnBumNqZSEpCfyVN9w=
k8s.io/client-go v0.19.10/go.mod h1:z1zmFQISK3I2eWLcRjhhX2/DpQUuq5Jemy45W2a5cPQ=
k8s.io/code-generator v0.18.3/go.mod h1:TgNEVx9hCyPGpdtCWA34olQYLkh3ok9ar7XfSsr8b6c=
k8s.io/code-generator v0.18.6/go.mod h1:TgNEVx9hCyPGpdtCWA34olQYLkh3ok9ar7XfSsr8b6c=
k8s.io/code-generator v0.19.0 h1:r0BxYnttP/r8uyKd4+Njg0B57kKi8wLvwEzaaVy3iZ8=
k8s.io/code-generator v0.19.0/go.mod h1:moqLn7w0t9cMs4+5CQyxnfA/HV8MF6aAVENF+WZZhgk=
k8s.io/code-generator v0.19.10 h1:ZgZjMytfuOEtMyhsRuA9M5EgfNpf7B0rUX5UIPtJW0o=
k8s.io/code-generator v0.19.10/go.mod h1:ADrDvaUQWGn4a8lX0ONtzb7uFmDRQOMSYIMk1qWIAx8=
k8s.io/component-base v0.18.6/go.mod h1:knSVsibPR5K6EW2XOjEHik6sdU5nCvKMrzMt2D4In14=
k8s.io/component-base v0.19.0 h1:OueXf1q3RW7NlLlUCj2Dimwt7E1ys6ZqRnq53l2YuoE=
k8s.io/component-base v0.19.0/go.mod h1:dKsY8BxkA+9dZIAh2aWJLL/UdASFDNtGYTCItL4LM7Y=
k8s.io/component-base v0.19.10 h1:asX5d7XlF3j1eq0Zpx/n3lStDEDf7tS+86LEU1AEHGU=
k8s.io/component-base v0.19.10/go.mod h1:zI8HZGnByYplvaM9JevUscbVVG4bTGzJmrDR60jffvQ=
k8s.io/gengo v0.0.0-20190128074634-0689ccc1d7d6/go.mod h1:ezvh/TsK7cY6rbqRK0oQQ8IAqLxYwwyPxAX1Pzy0ii0=
k8s.io/gengo v0.0.0-20200114144118-36b2048a9120/go.mod h1:ezvh/TsK7cY6rbqRK0oQQ8IAqLxYwwyPxAX1Pzy0ii0=
k8s.io/gengo v0.0.0-20200413195148-3a45101e95ac/go.mod h1:ezvh/TsK7cY6rbqRK0oQQ8IAqLxYwwyPxAX1Pzy0ii0=
Expand All @@ -1153,11 +1161,13 @@ k8s.io/kube-openapi v0.0.0-20200121204235-bf4fb3bd569c/go.mod h1:GRQhZsXIAJ1xR0C
k8s.io/kube-openapi v0.0.0-20200410145947-61e04a5be9a6/go.mod h1:GRQhZsXIAJ1xR0C9bd8UpWHZ5plfAS9fzPjJuQ6JL3E=
k8s.io/kube-openapi v0.0.0-20200805222855-6aeccd4b50c6 h1:+WnxoVtG8TMiudHBSEtrVL1egv36TkkJm+bA8AxicmQ=
k8s.io/kube-openapi v0.0.0-20200805222855-6aeccd4b50c6/go.mod h1:UuqjUnNftUyPE5H64/qeyjQoUZhGpeFDVdxjTeEVN2o=
k8s.io/kubectl v0.19.0 h1:t9uxaZzGvqc2jY96mjnPSjFHtaKOxoUegeGZdaGT6aw=
k8s.io/kubectl v0.19.0/go.mod h1:gPCjjsmE6unJzgaUNXIFGZGafiUp5jh0If3F/x7/rRg=
k8s.io/kubelet v0.19.0 h1:1x+ZC2o7rRKy+bMen5u3PpBdterOck7i4EpZUM3zDfE=
k8s.io/kubelet v0.19.0/go.mod h1:cGds22piF/LnFzfAaIT+efvOYBHVYdunqka6NVuNw9g=
k8s.io/kubectl v0.19.10 h1:0rZNx1VER3QniJJQbaaoZspS5RyoTuPodghZk+2Yjfo=
k8s.io/kubectl v0.19.10/go.mod h1:G1yOYR8unVtDsBiXsXXkJjH0oy0OMfvhAFswhipiZ6A=
k8s.io/kubelet v0.19.10 h1:HH0MWJaEoVtRFLhtYuZkp1EvgvxKa+XHUNCH7w0h+TM=
k8s.io/kubelet v0.19.10/go.mod h1:Bqr/c9tCHbwovrS91SXMKaUeJgZ5P6KvyfKI2m95CXg=
k8s.io/metrics v0.19.0/go.mod h1:WykpW8B60OeAJx1imdwUgyOID2kDljr/Q+1zrPJ98Wo=
k8s.io/metrics v0.19.10/go.mod h1:smnHGFw6yGhVS/DvMzi19tMQSAopPLbl2fafOF0DsUA=
k8s.io/utils v0.0.0-20200324210504-a9aa75ae1b89/go.mod h1:sZAwmy6armz5eXlNoLmJcl4F1QuKu7sr+mFQ0byX7Ew=
k8s.io/utils v0.0.0-20200603063816-c1c6865ac451/go.mod h1:jPW/WVKK9YHAvNhRxK0md/EJ228hCsBRufyofKtW8HA=
k8s.io/utils v0.0.0-20200729134348-d5654de09c73 h1:uJmqzgNWG7XyClnU/mLPBWwfKKF1K8Hf8whTseBgJcg=
Expand All @@ -1176,8 +1186,9 @@ sigs.k8s.io/kustomize v2.0.3+incompatible h1:JUufWFNlI44MdtnjUqVnvh29rR37PQFzPbL
sigs.k8s.io/kustomize v2.0.3+incompatible/go.mod h1:MkjgH3RdOWrievjo6c9T245dYlB5QeXV4WCbnt/PEpU=
sigs.k8s.io/structured-merge-diff/v3 v3.0.0-20200116222232-67a7b8c61874/go.mod h1:PlARxl6Hbt/+BC80dRLi1qAmnMqwqDg62YvvVkZjemw=
sigs.k8s.io/structured-merge-diff/v3 v3.0.0/go.mod h1:PlARxl6Hbt/+BC80dRLi1qAmnMqwqDg62YvvVkZjemw=
sigs.k8s.io/structured-merge-diff/v4 v4.0.1 h1:YXTMot5Qz/X1iBRJhAt+vI+HVttY0WkSqqhKxQ0xVbA=
sigs.k8s.io/structured-merge-diff/v4 v4.0.1/go.mod h1:bJZC9H9iH24zzfZ/41RGcq60oK1F7G282QMXDPYydCw=
sigs.k8s.io/structured-merge-diff/v4 v4.0.3 h1:4oyYo8NREp49LBBhKxEqCulFjg26rawYKrnCmg+Sr6c=
sigs.k8s.io/structured-merge-diff/v4 v4.0.3/go.mod h1:bJZC9H9iH24zzfZ/41RGcq60oK1F7G282QMXDPYydCw=
sigs.k8s.io/yaml v1.1.0/go.mod h1:UJmg0vDUVViEyp3mgSv9WPwZCDxu4rQW1olrI1uml+o=
sigs.k8s.io/yaml v1.2.0 h1:kr/MCeFWJWTwyaHoR9c8EjH9OumOmoF9YGiZd7lFm/Q=
sigs.k8s.io/yaml v1.2.0/go.mod h1:yfXDCHCao9+ENCvLSE62v9VSji2MKu5jeNfTrofGhJc=
Expand Down
79 changes: 57 additions & 22 deletions pkg/daemon/daemon.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ import (
"time"

"github.com/golang/glog"
mcfgv1 "github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1"
daemonconsts "github.com/openshift/machine-config-operator/pkg/daemon/constants"
mcclientset "github.com/openshift/machine-config-operator/pkg/generated/clientset/versioned"
mcfginformers "github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions"
"golang.org/x/time/rate"
corev1 "k8s.io/api/core/v1"
v1 "k8s.io/api/core/v1"
Expand All @@ -32,18 +36,15 @@ import (
"k8s.io/client-go/kubernetes"
listerv1 "k8s.io/client-go/listers/core/v1"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"
"k8s.io/client-go/util/workqueue"
"k8s.io/kubectl/pkg/drain"

// "k8s.io/client-go/kubernetes/scheme"
sriovnetworkv1 "github.com/k8snetworkplumbingwg/sriov-network-operator/api/v1"
snclientset "github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/client/clientset/versioned"
sninformer "github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/client/informers/externalversions"
"github.com/k8snetworkplumbingwg/sriov-network-operator/pkg/utils"
mcfgv1 "github.com/openshift/machine-config-operator/pkg/apis/machineconfiguration.openshift.io/v1"
daemonconsts "github.com/openshift/machine-config-operator/pkg/daemon/constants"
mcclientset "github.com/openshift/machine-config-operator/pkg/generated/clientset/versioned"
mcfginformers "github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions"
)

const (
Expand Down Expand Up @@ -124,7 +125,7 @@ const (
var namespace = os.Getenv("NAMESPACE")
var pluginsPath = os.Getenv("PLUGINSPATH")

// writer implements io.Writer interface as a pass-through for klog.
// writer implements io.Writer interface as a pass-through for glog.
type writer struct {
logFunc func(args ...interface{})
}
Expand Down Expand Up @@ -791,27 +792,61 @@ func (dn *Daemon) getNodeMachinePool() error {
return fmt.Errorf("getNodeMachinePool(): Failed to find the MCP of the node")
}

func (dn *Daemon) getDrainLock(ctx context.Context, done chan bool) {
var err error

lock := &resourcelock.LeaseLock{
LeaseMeta: metav1.ObjectMeta{
Name: "config-daemon-draining-lock",
Namespace: namespace,
},
Client: dn.kubeClient.CoordinationV1(),
LockConfig: resourcelock.ResourceLockConfig{
Identity: dn.name,
},
}

// start the leader election
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Copy link
Collaborator

@adrianchiris adrianchiris May 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at leaderelection docs:

Package leaderelection implements leader election of a set of endpoints. It uses an annotation in the endpoints object to store the record of the election state. This implementation does not guarantee that only one client is acting as a leader (a.k.a. fencing).

is it not an issue ? i think what we are doing here is considered fencing
how will the system behave when there is only one endpoint trying to take the lead on LeaseLock ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using endpoint, I use Lease API for leader election here. I don't think that statement applies. As all the clients race for the same Lease object. So I don't think there could be more than one acting as leader.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for explaining

Lock: lock,
ReleaseOnCancel: true,
LeaseDuration: 5 * time.Second,
RenewDeadline: 3 * time.Second,
RetryPeriod: 1 * time.Second,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: func(ctx context.Context) {
glog.V(2).Info("getDrainLock(): started leading")
for {
time.Sleep(3 * time.Second)
if dn.drainable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outside of any concern for this PR because this pattern was here before this PR - but have you folks ever seen nodes getting stuck on this condition? It could happen if a node reboots and doesn't startup and daemonset is unable to update its draining status.
Then line 847 is blocked indefinitely.

Copy link
Collaborator Author

@pliurh pliurh May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is intentional. We don't want a configuration mistake to break more nodes. If users encounter such a problem, they'd better do some troubleshooting to find out why the node cannot come back.

glog.V(2).Info("getDrainLock(): no other node is draining")
err = dn.annotateNode(dn.name, annoDraining)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using this mechanism, do we still need to annotate the node with annotDraining ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the trick. The leader election mechanism requires the leader to keep updating the Lease object. But in our case, the node may reboot itself, then lose leadership. So I use a 2 layers lock here. The node can only start draining with 2 conditions: 1) it becomes the leader 2) no other node is draining which is indicated by the annotation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for clarifying

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pliurh mind mentioning this in the commit message ? so its clear in the commit message how this mechanism is used to control node draining

if err != nil {
glog.Errorf("getDrainLock(): Failed to annotate node: %v", err)
continue
}
done <- true
return
}
glog.V(3).Info("getDrainLock(): other node is draining, wait...")
}
},
OnStoppedLeading: func() {
glog.V(2).Info("getDrainLock(): stopped leading")
},
},
})
}

func (dn *Daemon) drainNode(name string) error {
glog.Info("drainNode(): Update prepared")
var err error

ctx, cancel := context.WithCancel(context.TODO())
defer cancel()
// wait a random time to avoid all the nodes drain at the same time
time.Sleep(wait.Jitter(3*time.Second, 3))
wait.JitterUntil(func() {
if !dn.drainable {
glog.V(2).Info("drainNode(): other node is draining")
return
}
glog.V(2).Info("drainNode(): no other node is draining")
err = dn.annotateNode(dn.name, annoDraining)
if err != nil {
glog.Errorf("drainNode(): Failed to annotate node: %v", err)
return
}
cancel()
}, 3*time.Second, 3, true, ctx.Done())

done := make(chan bool)
go dn.getDrainLock(ctx, done)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run dn.getDrainLock in foreground and not use done chan?
I assume leaderelection.RunOrDie is a loop w/o timeout, is it true?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leaderelection.RunOrDie runs forever unless the context is canceled. I didn't run it in the foreground, because I want it keeps leading until the node finishes draining. So the rest nodes can only start the election after the prior one finishes draining if there is no reboot required.

<-done

if utils.ClusterType == utils.ClusterTypeOpenshift {
mcpInformerFactory := mcfginformers.NewSharedInformerFactory(dn.mcClient,
Expand Down
15 changes: 14 additions & 1 deletion vendor/golang.org/x/net/html/parse.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 18 additions & 3 deletions vendor/golang.org/x/net/http2/transport.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading