Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert #8

Open
haoqing0110 opened this issue Aug 13, 2019 · 3 comments
Open

Comments

@haoqing0110
Copy link

haoqing0110 commented Aug 13, 2019

Hi, all,

I'm from IBM Cloud Private team. In our environment, we found the alert ICPetcdHighNumberOfFailedGRPCRequests is frequently triggered (every 5 mintues).

In my investigation, I found the alert is triggered by a normal option.
Every time etcdctl lease keep-alive $lease interacts with the etcd cluster, will trigger below log, then trigger ICPetcdHighNumberOfFailedGRPCRequests alert.

{"log":"2019-07-05 08:20:02.426997 D | etcdserver/api/v3rpc: failed to receive lease keepalive request from gRPC stream (\"rpc error: code = Unavailable desc = client disconnected\")\n","stream":"stderr","time":"2019-07-05T08:20:02.427136464Z"}

Seems alert rule is not meaningful if grpc_code="Unavailable" or grpc_method="LeaseKeepAlive" , so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .

     - alert: ICPetcdHighNumberOfFailedGRPCRequests
       annotations:
         message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
           $labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
       expr: |
         100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK", grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive"}[5m])) BY (job, instance, grpc_service, grpc_method)
           /
         sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
           > 1
       for: 10m
       labels:
         severity: warning

Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.

Someone meet similar issue in:
openshift/cluster-monitoring-operator#248

@haoqing0110
Copy link
Author

@RayStoner @RobertJBarron @rafal-szypulka can you help on this?

@rafal-szypulka
Copy link
Collaborator

@haoqing0110 it looks that the source of the problem is still unresolved etcd bug: etcd-io/etcd#10289 and this problem exist not only in ICP, but also in openshift: etcd-io/etcd#10629
In my opinion, this alert should be disabled until it will be resolved in etcd. Other option may be to filter-out grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive" as you did, but I am not completely sure if it will get us meaningful results. I would just disable this alert rule for now.

@haoqing0110
Copy link
Author

@rafal-szypulka Thanks! Disable is good for our case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants