Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC Health probe fails when running with -enable-nodefeature-api #1032

Closed
ArangoGutierrez opened this issue Jan 13, 2023 · 8 comments · Fixed by #1034
Closed

gRPC Health probe fails when running with -enable-nodefeature-api #1032

ArangoGutierrez opened this issue Jan 13, 2023 · 8 comments · Fixed by #1034
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@ArangoGutierrez
Copy link
Contributor

When running NFD with -enable-nodefeature-api, the gRPC health probe fails

Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
  Warning  Unhealthy  21m (x5110 over 2d23h)    kubelet  Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
  Warning  BackOff    6m9s (x16447 over 2d23h)  kubelet  Back-off restarting failed container
  Warning  Unhealthy  69s (x4053 over 2d23h)    kubelet  Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
[2:49](https://kubernetes.slack.com/archives/DTRREAN6A/p1673617787536269)

[eduardo@fedora aws-kube-ci]$ kubectl --kubeconfig kubeconfig get pods -n node-feature-discovery 
NAME                          READY   STATUS             RESTARTS           AGE
gpu-feature-discovery-pnt92   1/1     Running            0                  2d23h
nfd-master-668b74498d-kj5fm   0/1     CrashLoopBackOff   1350 (2m58s ago)   2d23h
nfd-master-cd7f6d78d-gg44s    0/1     CrashLoopBackOff   1355 (108s ago)    2d23h
nfd-worker-sgbgj              1/1     Running            0                  2d23h
[2:50](https://kubernetes.slack.com/archives/DTRREAN6A/p1673617806643339)

Master pod is crashing due to grpc helath probe failure

@ArangoGutierrez ArangoGutierrez added the kind/bug Categorizes issue or PR as related to a bug. label Jan 13, 2023
@marquiz
Copy link
Contributor

marquiz commented Jan 13, 2023

Need to replace the grpc health probe with smth else. Maybe we could use sigs.k8s.io/controller-runtime/pkg/healthz? Suggestions for better alternatives (I'm pretty uneducated on this area)? @ArangoGutierrez @zvonkok @fmuyassarov ?

@marquiz
Copy link
Contributor

marquiz commented Jan 13, 2023

Anyway, thanks for reporting this @ArangoGutierrez! We totally missed this aspect 🙄 We need to fix this and prolly backport to v0.12 branch and cut a patch release

@fmuyassarov
Copy link
Member

not sure if it is going to make any difference, but what if we try with k8s built in gRPC health probe instead of using the utility?

@ArangoGutierrez
Copy link
Contributor Author

@fmuyassarov
Copy link
Member

not sure if it is going to make any difference, but what if we try with k8s built in gRPC health probe instead of using the utility?

actually this might not change anything

@ArangoGutierrez
Copy link
Contributor Author

Yeah, I will try a couple ideas this weekend and report back

@ArangoGutierrez
Copy link
Contributor Author

/assign

@lorelei-rupp-imprivata
Copy link

We are running version 0.14.2 of this and seeing the exact same thing, its crash looping constantly restarting with the same errors
we have the enable node feature api enabled

We are rolling this out with the GPU operator

  Normal   Pulling    24m   kubelet            Pulling image "registry.k8s.io/nfd/node-feature-discovery:v0.14.2"
  Normal   Pulled     24m   kubelet            Successfully pulled image "registry.k8s.io/nfd/node-feature-discovery:v0.14.2" in 4.419758738s (4.419771428s including waiting)
  Normal   Killing    23m   kubelet            Container master failed liveness probe, will be restarted
  Warning  Unhealthy  23m   kubelet            Readiness probe failed: 2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel created
2024/04/11 13:08:18 INFO: [core] [Channel #1] original dial target is: ":8080"
2024/04/11 13:08:18 INFO: [core] [Channel #1] dial target ":8080" parse failed: parse ":8080": missing protocol scheme
2024/04/11 13:08:18 INFO: [core] [Channel #1] fallback to scheme "passthrough"
2024/04/11 13:08:18 INFO: [core] [Channel #1] parsed dial target is: {Scheme:passthrough Authority: URL:{Scheme:passthrough Opaque: User: Host: Path:/:8080 RawPath: OmitHost:false ForceQuery:false RawQuery: Fragment: RawFragment:}}
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel authority set to "localhost:8080"
2024/04/11 13:08:18 INFO: [core] [Channel #1] Resolver state updated: {
  "Addresses": [
    {
      "Addr": ":8080",
      "ServerName": "",
      "Attributes": null,
      "BalancerAttributes": null,
      "Type": 0,
      "Metadata": null
    }
  ],
  "ServiceConfig": null,
  "Attributes": null
} (resolver returned new addresses)
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel switches to new LB policy "pick_first"
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel created
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address ":8080" to connect
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to READY
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel Connectivity change to READY
status: SERVING
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel Connectivity change to SHUTDOWN
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/04/11 13:08:18 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/04/11 13:08:18 INFO: [core] [Channel #1] Channel deleted
  Normal   Created    23m (x2 over 24m)    kubelet  Created container master
  Normal   Started    23m (x2 over 24m)    kubelet  Started container master
  Normal   Pulled     23m                  kubelet  Container image "registry.k8s.io/nfd/node-feature-discovery:v0.14.2" already present on machine
  Warning  Unhealthy  22m (x5 over 23m)    kubelet  Liveness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
  Warning  Unhealthy  9m7s (x32 over 23m)  kubelet  Readiness probe failed: command "/usr/bin/grpc_health_probe -addr=:8080" timed out
  Warning  BackOff    4m4s (x67 over 19m)  kubelet  Back-off restarting failed container master in pod xxxx-node-feature-discovery-master-5c6dd6667698dfq_xxx(..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants