ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8

haoqing0110 · 2019-08-13T02:39:54Z

Hi, all,

I'm from IBM Cloud Private team. In our environment, we found the alert ICPetcdHighNumberOfFailedGRPCRequests is frequently triggered (every 5 mintues).

In my investigation, I found the alert is triggered by a normal option.
Every time etcdctl lease keep-alive $lease interacts with the etcd cluster, will trigger below log, then trigger ICPetcdHighNumberOfFailedGRPCRequests alert.

{"log":"2019-07-05 08:20:02.426997 D | etcdserver/api/v3rpc: failed to receive lease keepalive request from gRPC stream (\"rpc error: code = Unavailable desc = client disconnected\")\n","stream":"stderr","time":"2019-07-05T08:20:02.427136464Z"}

Seems alert rule is not meaningful if grpc_code="Unavailable" or grpc_method="LeaseKeepAlive" , so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .

     - alert: ICPetcdHighNumberOfFailedGRPCRequests
       annotations:
         message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
           $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
       expr: |
         100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK", grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive"}[5m])) BY (job, instance, grpc_service, grpc_method)
           /
         sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
           > 1
       for: 10m
       labels:
         severity: warning

Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.

Someone meet similar issue in:
openshift/cluster-monitoring-operator#248

The text was updated successfully, but these errors were encountered:

haoqing0110 · 2019-08-13T02:51:31Z

@RayStoner @RobertJBarron @rafal-szypulka can you help on this?

rafal-szypulka · 2019-08-13T13:56:20Z

@haoqing0110 it looks that the source of the problem is still unresolved etcd bug: etcd-io/etcd#10289 and this problem exist not only in ICP, but also in openshift: etcd-io/etcd#10629
In my opinion, this alert should be disabled until it will be resolved in etcd. Other option may be to filter-out grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive" as you did, but I am not completely sure if it will get us meaningful results. I would just disable this alert rule for now.

haoqing0110 · 2019-08-15T09:22:06Z

@rafal-szypulka Thanks! Disable is good for our case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8

haoqing0110 commented Aug 13, 2019 •

edited

Loading

haoqing0110 commented Aug 13, 2019

rafal-szypulka commented Aug 13, 2019

haoqing0110 commented Aug 15, 2019

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8

Comments

haoqing0110 commented Aug 13, 2019 • edited Loading

haoqing0110 commented Aug 13, 2019

rafal-szypulka commented Aug 13, 2019

haoqing0110 commented Aug 15, 2019

haoqing0110 commented Aug 13, 2019 •

edited

Loading