Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rafthttp: probe connection for Raft message transport #10022

Closed
wants to merge 5 commits into from

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Aug 17, 2018

In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means we
weren't sampling enough, or all the latency spikes happen
outside of snapshot pipeline connection. The later is most likely
the case, since the cluster had leader elections from missing
heartbeats.

This PR adds another probing routine to monitor the connection
for Raft message transports.

We need to track which connection had high latency spikes.

etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_RAFT_MESSAGE",To="729934363faa4a24",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_RAFT_MESSAGE",To="729934363faa4a24",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_SNAPSHOT",To="729934363faa4a24",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_SNAPSHOT",To="729934363faa4a24",le="0.0002"} 1

@jpbetz Would adding ConnectionType="ROUND_TRIPPER_SNAPSHOT" label break anything in your monitoring systems? I want to backport this as well. Currently, probing doesn't tell much, since it only tracks snapshot sender connection...

@gyuho gyuho requested review from jpbetz and xiang90 August 17, 2018 23:46
@gyuho gyuho added WIP labels Aug 17, 2018
gyuho added a commit to gyuho/etcd that referenced this pull request Aug 18, 2018
…econds" metric

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see etcd-io#10022)
and snapshots are already being tracked via etcd-io#9997.

```
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8
```

Signed-off-by: Gyuho Lee <[email protected]>
@gyuho
Copy link
Contributor Author

gyuho commented Aug 18, 2018

Address #9438.

@codecov-io
Copy link

codecov-io commented Aug 18, 2018

Codecov Report

Merging #10022 into master will decrease coverage by 0.03%.
The diff coverage is 94.44%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #10022      +/-   ##
==========================================
- Coverage   71.59%   71.56%   -0.04%     
==========================================
  Files         390      390              
  Lines       36258    36266       +8     
==========================================
- Hits        25960    25952       -8     
- Misses       8488     8505      +17     
+ Partials     1810     1809       -1
Impacted Files Coverage Δ
etcdserver/api/rafthttp/metrics.go 100% <ø> (ø) ⬆️
etcdserver/api/rafthttp/transport.go 83.87% <100%> (+0.45%) ⬆️
etcdserver/api/rafthttp/probing_status.go 57.44% <83.33%> (-0.34%) ⬇️
clientv3/leasing/util.go 91.66% <0%> (-6.67%) ⬇️
pkg/netutil/netutil.go 63.11% <0%> (-6.56%) ⬇️
etcdctl/ctlv3/command/lease_command.go 65.34% <0%> (-5.95%) ⬇️
pkg/transport/listener.go 58.67% <0%> (-4.09%) ⬇️
etcdserver/v2_server.go 80.76% <0%> (-3.85%) ⬇️
proxy/grpcproxy/watch.go 89.44% <0%> (-3.11%) ⬇️
lease/leasehttp/http.go 63.97% <0%> (-2.95%) ⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 34fcaba...37cf84c. Read the comment docs.

gyuho added a commit to gyuho/etcd that referenced this pull request Aug 29, 2018
…econds" metric

Currently, only v2 metrics ("stats.FollowerStats") tracks Raft message
send latencies. Add Prometheus histogram to track Raft messages for
writes, since heartbeats are probed (see etcd-io#10022)
and snapshots are already being tracked via etcd-io#9997.

```
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0001"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgProp",le="0.0002"} 1
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0001"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="729934363faa4a24",Type="MsgApp",le="0.0002"} 9
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0001"} 8
etcd_network_raft_send_total_duration_seconds_bucket{To="7339c4e5e833c029",Type="MsgAppResp",le="0.0002"} 8
```

Signed-off-by: Gyuho Lee <[email protected]>
Preliminary work to add prober to "streamRt"

Signed-off-by: Gyuho Lee <[email protected]>
In our production cluster, we found one TCP connection had >8-sec
latencies to a remote peer, but "etcd_network_peer_round_trip_time_seconds"
metrics shows <1-sec latency distribution, which means we
weren't sampling enough, or all the latency spikes happen
outside of snapshot pipeline connection. The later is most likely
the case, since the cluster had leader elections from missing
heartbeats.

This PR adds another probing routine to monitor the connection
for Raft message transports.

Signed-off-by: Gyuho Lee <[email protected]>
We need to track which connection had high latency spikes.

```
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_RAFT_MESSAGE",To="729934363faa4a24",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_RAFT_MESSAGE",To="729934363faa4a24",le="0.0002"} 1
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_SNAPSHOT",To="729934363faa4a24",le="0.0001"} 0
etcd_network_peer_round_trip_time_seconds_bucket{ConnectionType="ROUND_TRIPPER_SNAPSHOT",To="729934363faa4a24",le="0.0002"} 1
```

Signed-off-by: Gyuho Lee <[email protected]>
Fix

```
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
```

Signed-off-by: Gyuho Lee <[email protected]>
@gyuho
Copy link
Contributor Author

gyuho commented Aug 29, 2018

@wenjiaswe @jpbetz I meant to ask about this metrics as well. Would adding ConnectionType break anything in GKE?

@gyuho
Copy link
Contributor Author

gyuho commented Aug 29, 2018

AWS does not use this metrics, so this should be safe for us.

@jpbetz
Copy link
Contributor

jpbetz commented Aug 29, 2018

Let's hold on this until we can confirm how label additions handled. We're primarily concerned with how https://github.com/GoogleCloudPlatform/k8s-stackdriver/tree/master/prometheus-to-sd handles this case. @wenjiaswe is checking now.

@gyuho
Copy link
Contributor Author

gyuho commented Aug 29, 2018

@jpbetz No rush. Thanks for checking!

Copy link
Contributor

@wenjiaswe wenjiaswe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @gyuho , I checked in prometheus-to-sd translator, @jpbetz is correct, label addition does break metric push in GKE. Please refer to code review for details. Everything else LGTM.

@@ -143,7 +143,7 @@ var (
// highest bucket start of 0.0001 sec * 2^15 == 3.2768 sec
Buckets: prometheus.ExponentialBuckets(0.0001, 2, 16),
},
[]string{"To"},
[]string{"ConnectionType", "To"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gyuho unfortunately, label addition will break GKE. I checked https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/master/prometheus-to-sd, for metrics with prefix "container.googleapis.com" (e.g., etcd metrics), if definition (for example, label is one of the definition) of the metric was changed, then the metric is marked as broken and the metric is not going to be pushed: https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/a39287c62ca7c1ffaa3cac4887f6a92c1943abab/prometheus-to-sd/translator/metric_descriptor_cache.go#L72. prometheus-to-sd only UpdateMetricDescriptors if it contains "custom.googleapis.com" prefix:https://github.com/GoogleCloudPlatform/k8s-stackdriver/blob/1047589c380eed64c94484e7ab989db53d1b907b/prometheus-to-sd/main.go#L141.

@gyuho
Copy link
Contributor Author

gyuho commented Sep 5, 2018

@wenjiaswe @jpbetz Thanks for checking!

Then, let's close this.

I will add a separate metrics.

@gyuho gyuho closed this Sep 5, 2018
@gyuho gyuho removed the Release-Note label Sep 5, 2018
@jpbetz
Copy link
Contributor

jpbetz commented Sep 5, 2018

@wenjiaswe Would you also file an issue somewhere to track this limitation. Many label additions are conceptually backward compatible since their introduction would not break existing readers that are unaware of the label, so improving our metrics processing pipeline tolerant of label additions seems desirable and useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants