Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to reliable socket and reconnecting libs #3242

Merged
merged 3 commits into from
Oct 16, 2019
Merged

Add metrics to reliable socket and reconnecting libs #3242

merged 3 commits into from
Oct 16, 2019

Conversation

scrye
Copy link
Contributor

@scrye scrye commented Oct 11, 2019

Metrics:

# HELP lib_reliable_dials_total Total number of Dial calls.
# TYPE lib_reliable_dials_total counter
lib_reliable_dials_total{result="err_not_classified"} 1
lib_reliable_dials_total{result="ok_success"} 3
# HELP lib_reliable_registers_total Total number of Register calls.
# TYPE lib_reliable_registers_total counter
lib_reliable_registers_total{result="ok_success",svc="CS"} 1
lib_reliable_registers_total{result="ok_success",svc="UNKNOWN"} 3
# HELP lib_reliable_reads_total Total number of Read calls
# TYPE lib_reliable_reads_total histogram
lib_reliable_reads_total_bucket{result="err_not_classified",le="1"} 2
lib_reliable_reads_total_bucket{result="err_not_classified",le="+Inf"} 2
lib_reliable_reads_total_sum{result="err_not_classified"} 0
lib_reliable_reads_total_count{result="err_not_classified"} 2
lib_reliable_reads_total_bucket{result="ok_success",le="1"} 0
lib_reliable_reads_total_bucket{result="ok_success",le="+Inf"} 422
lib_reliable_reads_total_sum{result="ok_success"} 148709
lib_reliable_reads_total_count{result="ok_success"} 422
# HELP lib_reliable_writes_total Total number of Write calls
# TYPE lib_reliable_writes_total histogram
lib_reliable_writes_total_bucket{result="ok_success",le="1"} 0
lib_reliable_writes_total_bucket{result="ok_success",le="+Inf"} 403
lib_reliable_writes_total_sum{result="ok_success"} 85732
lib_reliable_writes_total_count{result="ok_success"} 403
# HELP lib_reconnect_retries_total Total number of reconnection attempt retries
# TYPE lib_reconnect_retries_total counter
lib_reconnect_retries_total 2
# HELP lib_reconnect_timeouts_total Total number of reconnection attempt timeouts
# TYPE lib_reconnect_timeouts_total counter
lib_reconnect_timeouts_total 0

Fixes #3183.


This change is Reviewable

@scrye scrye added SNET feature New feature or request c/observability Metrics, logging, tracing labels Oct 11, 2019
@scrye scrye requested a review from kormat October 11, 2019 12:07
@scrye scrye self-assigned this Oct 11, 2019
@scrye scrye requested review from lukedirtwalker and removed request for kormat October 16, 2019 06:21
Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 15 of 15 files at r1.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @scrye)


go/lib/prom/prom.go, line 75 at r1 (raw file):

		1.28, 2.56, 5.12, 10.24}
	// DefaultSizeBuckets 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384
	DefaultSizeBuckets = []float64{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192,

I think on the lower bound we could start at 32 or something.


go/lib/sock/reliable/reliable.go, line 402 at r1 (raw file):

		return prom.Success
	}
	if opError, ok := err.(*net.OpError); ok && opError.Timeout() {

would serrors.IsTimeout(err) not work for this case?


go/lib/sock/reliable/internal/metrics/metrics.go, line 1 at r1 (raw file):

// Copyright 2019 ETH Zurich

Please add tests for metrics labels.
promtest.CheckLabelsStruct(t, DialLabels{}) etc.


go/lib/sock/reliable/reconnect/errors.go, line 21 at r1 (raw file):

var (
	ErrDispatcherDead            = serrors.New("dispatcher dead")
	ErrReconnecterTimeoutExpired = serrors.New("timeout expired")

I wonder if that should be a timeout error so that serrors.IsTimeout would return true.


go/lib/sock/reliable/reconnect/internal/metrics/metrics.go, line 62 at r1 (raw file):

// Timeouts returns a counter for timeout errors.
func (m metrics) Timeouts(_ Labels) prometheus.Counter {

I think we could just skip the Labels for this. just Timeouts().Inc() on the call site.

scrye added 2 commits October 16, 2019 10:16
Metrics:
```
# HELP lib_reliable_dials_total Total number of Dial calls.
# TYPE lib_reliable_dials_total counter
lib_reliable_dials_total{result="err_not_classified"} 1
lib_reliable_dials_total{result="ok_success"} 3
# HELP lib_reliable_registers_total Total number of Register calls.
# TYPE lib_reliable_registers_total counter
lib_reliable_registers_total{result="ok_success",svc="CS"} 1
lib_reliable_registers_total{result="ok_success",svc="UNKNOWN"} 3
# HELP lib_reliable_reads_total Total number of Read calls
# TYPE lib_reliable_reads_total histogram
lib_reliable_reads_total_bucket{result="err_not_classified",le="1"} 2
lib_reliable_reads_total_bucket{result="err_not_classified",le="+Inf"} 2
lib_reliable_reads_total_sum{result="err_not_classified"} 0
lib_reliable_reads_total_count{result="err_not_classified"} 2
lib_reliable_reads_total_bucket{result="ok_success",le="1"} 0
lib_reliable_reads_total_bucket{result="ok_success",le="+Inf"} 422
lib_reliable_reads_total_sum{result="ok_success"} 148709
lib_reliable_reads_total_count{result="ok_success"} 422
# HELP lib_reliable_writes_total Total number of Write calls
# TYPE lib_reliable_writes_total histogram
lib_reliable_writes_total_bucket{result="ok_success",le="1"} 0
lib_reliable_writes_total_bucket{result="ok_success",le="+Inf"} 403
lib_reliable_writes_total_sum{result="ok_success"} 85732
lib_reliable_writes_total_count{result="ok_success"} 403
# HELP lib_reconnect_retries_total Total number of reconnection attempt
retries
# TYPE lib_reconnect_retries_total counter
lib_reconnect_retries_total 2
# HELP lib_reconnect_timeouts_total Total number of reconnection attempt
timeouts
# TYPE lib_reconnect_timeouts_total counter
lib_reconnect_timeouts_total 0
```

Fixes #3183.
Copy link
Contributor Author

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 15 files reviewed, 5 unresolved discussions (waiting on @lukedirtwalker)


go/lib/prom/prom.go, line 75 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I think on the lower bound we could start at 32 or something.

Done.


go/lib/sock/reliable/reliable.go, line 402 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

would serrors.IsTimeout(err) not work for this case?

Yes, it should work. Done.


go/lib/sock/reliable/internal/metrics/metrics.go, line 1 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

Please add tests for metrics labels.
promtest.CheckLabelsStruct(t, DialLabels{}) etc.

Done.


go/lib/sock/reliable/reconnect/errors.go, line 21 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I wonder if that should be a timeout error so that serrors.IsTimeout would return true.

It probably should, but it's not something I would change in this PR because it can break stuff, since it interferes with execution paths.

Added a FIXME instead.


go/lib/sock/reliable/reconnect/internal/metrics/metrics.go, line 62 at r1 (raw file):

Previously, lukedirtwalker (Lukas Vogel) wrote…

I think we could just skip the Labels for this. just Timeouts().Inc() on the call site.

Done.

Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @scrye)


go/lib/sock/reliable/internal/metrics/metrics.go, line 1 at r1 (raw file):

Previously, scrye (Sergiu Costea) wrote…

Done.

The file is missing.

Copy link
Collaborator

@lukedirtwalker lukedirtwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@scrye scrye merged commit 95c4cd5 into scionproto:master Oct 16, 2019
@scrye scrye deleted the pubpr-fix-3183 branch October 16, 2019 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/observability Metrics, logging, tracing feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lib/sock/reliable: expose metrics
2 participants