You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm from IBM Cloud Private team. In our environment, we found the alert ICPetcdHighNumberOfFailedGRPCRequests is frequently triggered (every 5 mintues).
In my investigation, I found the alert is triggered by a normal option.
Every time etcdctl lease keep-alive $lease interacts with the etcd cluster, will trigger below log, then trigger ICPetcdHighNumberOfFailedGRPCRequests alert.
{"log":"2019-07-05 08:20:02.426997 D | etcdserver/api/v3rpc: failed to receive lease keepalive request from gRPC stream (\"rpc error: code = Unavailable desc = client disconnected\")\n","stream":"stderr","time":"2019-07-05T08:20:02.427136464Z"}
@haoqing0110 it looks that the source of the problem is still unresolved etcd bug: etcd-io/etcd#10289 and this problem exist not only in ICP, but also in openshift: etcd-io/etcd#10629
In my opinion, this alert should be disabled until it will be resolved in etcd. Other option may be to filter-out grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive" as you did, but I am not completely sure if it will get us meaningful results. I would just disable this alert rule for now.
Hi, all,
I'm from IBM Cloud Private team. In our environment, we found the alert
ICPetcdHighNumberOfFailedGRPCRequests
is frequently triggered (every 5 mintues).In my investigation, I found the alert is triggered by a normal option.
Every time
etcdctl lease keep-alive $lease
interacts with the etcd cluster, will trigger below log, then triggerICPetcdHighNumberOfFailedGRPCRequests
alert.Seems alert rule is not meaningful if
grpc_code="Unavailable"
orgrpc_method="LeaseKeepAlive"
, so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.
Someone meet similar issue in:
openshift/cluster-monitoring-operator#248
The text was updated successfully, but these errors were encountered: