Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend: add prometheus metric for large snapshot duration. #7892

Merged
merged 1 commit into from
May 8, 2017

Conversation

fanminshi
Copy link
Member

FIXES #7878

@fanminshi fanminshi added the WIP label May 6, 2017
@fanminshi fanminshi force-pushed the add_snashot_duration_metric branch from c01e297 to 94054a4 Compare May 6, 2017 00:06
@fanminshi
Copy link
Member Author

fanminshi commented May 6, 2017

I manually create 2 snapshots which take 10 seconds each.

I saw following metrics:

etcd_disk_backend_snapshot_duration_seconds_bucket{le="1"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="2"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="4"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="8"} 0
etcd_disk_backend_snapshot_duration_seconds_bucket{le="16"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="32"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="64"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="128"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="256"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="512"} 2
etcd_disk_backend_snapshot_duration_seconds_bucket{le="+Inf"} 2
etcd_disk_backend_snapshot_duration_seconds_sum 20.001328967
etcd_disk_backend_snapshot_duration_seconds_count 2

I am unsure why buckets 32-512 also have counts 2.

EDIT: le might stands for <= which makes sense for the metric output.

@fanminshi fanminshi force-pushed the add_snashot_duration_metric branch from 94054a4 to 6b5272e Compare May 6, 2017 00:10
@@ -24,8 +24,18 @@ var (
Help: "The latency distributions of commit called by backend.",
Buckets: prometheus.ExponentialBuckets(0.001, 2, 14),
})

snapShotDurations = prometheus.NewHistogram(prometheus.HistogramOpts{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshotDurations

Name: "backend_snapshot_duration_seconds",
Help: "The latency distributions of Snapshot called by backend.",
// 1 second -> 1024 seconds
Buckets: prometheus.ExponentialBuckets(1, 2, 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 seconds is extreme, probably want to capture something like [10ms -- 30s]

Namespace: "etcd",
Subsystem: "disk",
Name: "backend_snapshot_duration_seconds",
Help: "The latency distributions of Snapshot called by backend.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latency distribution of backend snapshots.

Name: "backend_snapshot_duration_seconds",
Help: "The latency distributions of Snapshot called by backend.",
// 1 second -> 512 seconds
Buckets: prometheus.ExponentialBuckets(1, 2, 10),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably want something like 10ms -- 1minute. 512 seconds is a lot

Copy link
Member Author

@fanminshi fanminshi May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiang90 suggested to track large snapshot duration fro 1 second to around 10 min.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can track low numbers that are common cases. when cluster starts to suffer we probably want to track large numbers. if the snap is 8gb with slow network 100s seconds is possible

Copy link
Member Author

@fanminshi fanminshi May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, should we do 10ms to 10mins?

edit: 10ms

@fanminshi fanminshi force-pushed the add_snashot_duration_metric branch from 6b5272e to 42d2cc8 Compare May 6, 2017 00:16
@fanminshi fanminshi force-pushed the add_snashot_duration_metric branch from 42d2cc8 to 230106d Compare May 6, 2017 00:27
@fanminshi
Copy link
Member Author

@xiang90 @heyitsanthony.

How about 10ms to 10 mins?

@fanminshi fanminshi merged commit 2655540 into etcd-io:master May 8, 2017
@wenjiaswe
Copy link
Contributor

@gyuho I didn't find this one in CHANGELOG-3.2.md, it is merged in 3.2 right? Just want to confirm, I can add in the changelog and backport to 3.1 if it's safe.

@gyuho
Copy link
Contributor

gyuho commented Jul 20, 2018

@wenjiaswe Yeah, it's in 3.2 not in 3.1 Let's backport and update changelog.

wenjiaswe added a commit to wenjiaswe/etcd that referenced this pull request Jul 20, 2018
wenjiaswe added a commit to wenjiaswe/etcd that referenced this pull request Jul 20, 2018
wenjiaswe added a commit to wenjiaswe/etcd that referenced this pull request Jul 24, 2018
gyuho added a commit that referenced this pull request Jul 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants