gRPC Health probe fails when running with -enable-nodefeature-api #1032

ArangoGutierrez · 2023-01-13T13:55:07Z

When running NFD with -enable-nodefeature-api, the gRPC health probe fails

Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
  Warning  Unhealthy  21m (x5110 over 2d23h)    kubelet  Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
  Warning  BackOff    6m9s (x16447 over 2d23h)  kubelet  Back-off restarting failed container
  Warning  Unhealthy  69s (x4053 over 2d23h)    kubelet  Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
[2:49](https://kubernetes.slack.com/archives/DTRREAN6A/p1673617787536269)

[eduardo@fedora aws-kube-ci]$ kubectl --kubeconfig kubeconfig get pods -n node-feature-discovery 
NAME                          READY   STATUS             RESTARTS           AGE
gpu-feature-discovery-pnt92   1/1     Running            0                  2d23h
nfd-master-668b74498d-kj5fm   0/1     CrashLoopBackOff   1350 (2m58s ago)   2d23h
nfd-master-cd7f6d78d-gg44s    0/1     CrashLoopBackOff   1355 (108s ago)    2d23h
nfd-worker-sgbgj              1/1     Running            0                  2d23h
[2:50](https://kubernetes.slack.com/archives/DTRREAN6A/p1673617806643339)

Master pod is crashing due to grpc helath probe failure

The text was updated successfully, but these errors were encountered:

marquiz · 2023-01-13T14:04:29Z

Need to replace the grpc health probe with smth else. Maybe we could use sigs.k8s.io/controller-runtime/pkg/healthz? Suggestions for better alternatives (I'm pretty uneducated on this area)? @ArangoGutierrez @zvonkok @fmuyassarov ?

marquiz · 2023-01-13T14:14:47Z

Anyway, thanks for reporting this @ArangoGutierrez! We totally missed this aspect 🙄 We need to fix this and prolly backport to v0.12 branch and cut a patch release

fmuyassarov · 2023-01-13T14:17:22Z

not sure if it is going to make any difference, but what if we try with k8s built in gRPC health probe instead of using the utility?

ArangoGutierrez · 2023-01-13T14:22:09Z

https://kubernetes.io/blog/2018/10/01/health-checking-grpc-servers-on-kubernetes/
Sounds good
@marquiz WDYT?

fmuyassarov · 2023-01-13T14:42:16Z

not sure if it is going to make any difference, but what if we try with k8s built in gRPC health probe instead of using the utility?

actually this might not change anything

ArangoGutierrez · 2023-01-13T14:57:38Z

Yeah, I will try a couple ideas this weekend and report back

ArangoGutierrez · 2023-01-16T16:26:25Z

/assign

Fixes kubernetes-sigs#1032

lorelei-rupp-imprivata · 2024-04-11T13:35:02Z

We are running version 0.14.2 of this and seeing the exact same thing, its crash looping constantly restarting with the same errors
we have the enable node feature api enabled

We are rolling this out with the GPU operator

  Normal   Pulling    24m   kubelet            Pulling image "registry.k8s.io/nfd/node-feature-discovery:v0.14.2"
  Normal   Pulled     24m   kubelet            Successfully pulled image "registry.k8s.io/nfd/node-feature-discovery:v0.14.2" in 4.419758738s (4.419771428s including waiting)
  Normal   Killing    23m   kubelet            Container master failed liveness probe, will be restarted
  Warning  Unhealthy  23m   kubelet            Readiness probe failed: 2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel created
2024/04/11 13:08:18 INFO: [core] [Channel #1] original dial target is: ":8080"
2024/04/11 13:08:18 INFO: [core] [Channel #1] dial target ":8080" parse failed: parse ":8080": missing protocol scheme
2024/04/11 13:08:18 INFO: [core] [Channel #1] fallback to scheme "passthrough"
2024/04/11 13:08:18 INFO: [core] [Channel #1] parsed dial target is: {Scheme:passthrough Authority: URL:{Scheme:passthrough Opaque: User: Host: Path:/:8080 RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel authority set to "localhost:8080"
2024/04/11 13:08:18 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": ":8080",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Type": 0,
      "Metadata": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel created
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address ":8080" to connect
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel Connectivity change to READY
status: SERVING
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel Connectivity change to SHUTDOWN
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel deleted
  Normal   Created    23m (x2 over 24m)    kubelet  Created container master
  Normal   Started    23m (x2 over 24m)    kubelet  Started container master
  Normal   Pulled     23m                  kubelet  Container image "registry.k8s.io/nfd/node-feature-discovery:v0.14.2" already present on machine
  Warning  Unhealthy  22m (x5 over 23m)    kubelet  Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
  Warning  Unhealthy  9m7s (x32 over 23m)  kubelet  Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
  Warning  BackOff    4m4s (x67 over 19m)  kubelet  Back-off restarting failed container master in pod xxxx-node-feature-discovery-master-5c6dd6667698dfq_xxx(..)

ArangoGutierrez added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2023

k8s-ci-robot assigned ArangoGutierrez Jan 16, 2023

ArangoGutierrez added a commit to ArangoGutierrez/node-feature-discovery that referenced this issue Jan 16, 2023

Skip register of gRPC LabelerServer when using -enable-nodefeature-api

5fb77db

Fixes kubernetes-sigs#1032

ArangoGutierrez mentioned this issue Jan 16, 2023

nfd-master: always start gRPC server #1034

Merged

marquiz mentioned this issue Jan 16, 2023

Release v0.12.1 #1035

Closed

24 tasks

k8s-ci-robot closed this as completed in #1034 Jan 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gRPC Health probe fails when running with -enable-nodefeature-api #1032

gRPC Health probe fails when running with -enable-nodefeature-api #1032

ArangoGutierrez commented Jan 13, 2023

marquiz commented Jan 13, 2023

marquiz commented Jan 13, 2023

fmuyassarov commented Jan 13, 2023

ArangoGutierrez commented Jan 13, 2023

fmuyassarov commented Jan 13, 2023

ArangoGutierrez commented Jan 13, 2023

ArangoGutierrez commented Jan 16, 2023

lorelei-rupp-imprivata commented Apr 11, 2024

gRPC Health probe fails when running with -enable-nodefeature-api #1032

gRPC Health probe fails when running with -enable-nodefeature-api #1032

Comments

ArangoGutierrez commented Jan 13, 2023

marquiz commented Jan 13, 2023

marquiz commented Jan 13, 2023

fmuyassarov commented Jan 13, 2023

ArangoGutierrez commented Jan 13, 2023

fmuyassarov commented Jan 13, 2023

ArangoGutierrez commented Jan 13, 2023

ArangoGutierrez commented Jan 16, 2023

lorelei-rupp-imprivata commented Apr 11, 2024