backend: add prometheus metric for large snapshot duration. #7892

fanminshi · 2017-05-06T00:06:01Z

fanminshi · 2017-05-06T00:08:23Z

I manually create 2 snapshots which take 10 seconds each.

I saw following metrics:

etcd_disk_backend_snapshot_duration_seconds_bucket{le="1"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="2"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="4"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="8"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="16"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="32"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="64"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="128"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="256"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="512"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="+Inf"} 2
etcd_disk_backend_snapshot_duration_seconds_sum 20.001328967
etcd_disk_backend_snapshot_duration_seconds_count 2

I am unsure why buckets 32-512 also have counts 2.

EDIT: le might stands for <= which makes sense for the metric output.

heyitsanthony · 2017-05-06T00:08:24Z

mvcc/backend/metrics.go

@@ -24,8 +24,18 @@ var (
 		Help:      "The latency distributions of commit called by backend.",
 		Buckets:   prometheus.ExponentialBuckets(0.001, 2, 14),
 	})
+
+	snapShotDurations = prometheus.NewHistogram(prometheus.HistogramOpts{


snapshotDurations

heyitsanthony · 2017-05-06T00:10:03Z

mvcc/backend/metrics.go

+		Name:      "backend_snapshot_duration_seconds",
+		Help:      "The latency distributions of Snapshot called by backend.",
+		// 1 second -> 1024 seconds
+		Buckets: prometheus.ExponentialBuckets(1, 2, 10),


1024 seconds is extreme, probably want to capture something like [10ms -- 30s]

heyitsanthony · 2017-05-06T00:10:33Z

mvcc/backend/metrics.go

+		Namespace: "etcd",
+		Subsystem: "disk",
+		Name:      "backend_snapshot_duration_seconds",
+		Help:      "The latency distributions of Snapshot called by backend.",


The latency distribution of backend snapshots.

heyitsanthony · 2017-05-06T00:12:18Z

mvcc/backend/metrics.go

+		Name:      "backend_snapshot_duration_seconds",
+		Help:      "The latency distributions of Snapshot called by backend.",
+		// 1 second -> 512 seconds
+		Buckets: prometheus.ExponentialBuckets(1, 2, 10),


probably want something like 10ms -- 1minute. 512 seconds is a lot

@xiang90 suggested to track large snapshot duration fro 1 second to around 10 min.

we can track low numbers that are common cases. when cluster starts to suffer we probably want to track large numbers. if the snap is 8gb with slow network 100s seconds is possible

okay, should we do 10ms to 10mins?

edit: 10ms

FIXES etcd-io#7878

fanminshi · 2017-05-06T00:28:21Z

@xiang90 @heyitsanthony.

How about 10ms to 10 mins?

wenjiaswe · 2018-07-20T17:50:32Z

@gyuho I didn't find this one in CHANGELOG-3.2.md, it is merged in 3.2 right? Just want to confirm, I can add in the changelog and backport to 3.1 if it's safe.

gyuho · 2018-07-20T17:52:46Z

@wenjiaswe Yeah, it's in 3.2 not in 3.1 Let's backport and update changelog.

CHANGELOG-3.2: update from #7892

fanminshi added the WIP label May 6, 2017

fanminshi force-pushed the add_snashot_duration_metric branch from c01e297 to 94054a4 Compare May 6, 2017 00:06

fanminshi force-pushed the add_snashot_duration_metric branch from 94054a4 to 6b5272e Compare May 6, 2017 00:10

heyitsanthony reviewed May 6, 2017

View reviewed changes

fanminshi force-pushed the add_snashot_duration_metric branch from 6b5272e to 42d2cc8 Compare May 6, 2017 00:16

fanminshi added area/performance and removed WIP labels May 6, 2017

backend: add prometheus metric for large snapshot duration.

230106d

FIXES etcd-io#7878

fanminshi force-pushed the add_snashot_duration_metric branch from 42d2cc8 to 230106d Compare May 6, 2017 00:27

fanminshi merged commit 2655540 into etcd-io:master May 8, 2017

wenjiaswe added a commit to wenjiaswe/etcd that referenced this pull request Jul 20, 2018

CHANGELOG-3.2: update from etcd-io#7892

882a601

wenjiaswe added a commit to wenjiaswe/etcd that referenced this pull request Jul 20, 2018

CHANGELOG-3.2: update from etcd-io#7892

82b712a

wenjiaswe mentioned this pull request Jul 20, 2018

CHANGELOG-3.2: update from #7892 #9948

Merged

wenjiaswe added a commit to wenjiaswe/etcd that referenced this pull request Jul 24, 2018

CHANGELOG-3.2: update from etcd-io#7892

7b71022

gyuho added a commit that referenced this pull request Jul 24, 2018

Merge pull request #9948 from wenjiaswe/changelog47892

69ae028

CHANGELOG-3.2: update from #7892

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

backend: add prometheus metric for large snapshot duration. #7892

backend: add prometheus metric for large snapshot duration. #7892

fanminshi commented May 6, 2017

fanminshi commented May 6, 2017 •

edited

Loading

heyitsanthony May 6, 2017

heyitsanthony May 6, 2017

heyitsanthony May 6, 2017

heyitsanthony May 6, 2017

fanminshi May 6, 2017 •

edited

Loading

heyitsanthony May 6, 2017

xiang90 May 6, 2017

fanminshi May 6, 2017 •

edited

Loading

fanminshi commented May 6, 2017

wenjiaswe commented Jul 20, 2018

gyuho commented Jul 20, 2018

backend: add prometheus metric for large snapshot duration. #7892

backend: add prometheus metric for large snapshot duration. #7892

Conversation

fanminshi commented May 6, 2017

fanminshi commented May 6, 2017 • edited Loading

heyitsanthony May 6, 2017

Choose a reason for hiding this comment

heyitsanthony May 6, 2017

Choose a reason for hiding this comment

heyitsanthony May 6, 2017

Choose a reason for hiding this comment

heyitsanthony May 6, 2017

Choose a reason for hiding this comment

fanminshi May 6, 2017 • edited Loading

Choose a reason for hiding this comment

heyitsanthony May 6, 2017

Choose a reason for hiding this comment

xiang90 May 6, 2017

Choose a reason for hiding this comment

fanminshi May 6, 2017 • edited Loading

Choose a reason for hiding this comment

fanminshi commented May 6, 2017

wenjiaswe commented Jul 20, 2018

gyuho commented Jul 20, 2018

fanminshi commented May 6, 2017 •

edited

Loading

fanminshi May 6, 2017 •

edited

Loading

fanminshi May 6, 2017 •

edited

Loading