-
Notifications
You must be signed in to change notification settings - Fork 179
Add new metrics. #396
Add new metrics. #396
Conversation
why not put the |
Thanks, didn't know about that. That would cover |
yes I am still looking through the code to see if we can simplify that part as well. |
1. 'prometheus_tsdb_wal_truncate_fail' for failed WAL truncation. 2. 'prometheus_tsdb_checkpoint_delete_fail' for failed old checkpoint delete. Signed-off-by: Ganesh Vernekar <[email protected]>
2a38e2a
to
632dfb3
Compare
head.go
Outdated
@@ -146,10 +148,18 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics { | |||
Name: "prometheus_tsdb_wal_truncate_duration_seconds", | |||
Help: "Duration of WAL truncation.", | |||
}) | |||
m.walTruncateFail = prometheus.NewCounter(prometheus.CounterOpts{ | |||
Name: "prometheus_tsdb_wal_truncate_fail", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus_tsdb_wal_truncate_failed_total
rather
head.go
Outdated
m.samplesAppended = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_head_samples_appended_total", | ||
Help: "Total number of appended samples.", | ||
}) | ||
m.checkpointDeleteFail = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_checkpoint_delete_fail", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheus_tsdb_checkpoint_deletions_failed_total
head.go
Outdated
m.samplesAppended = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_head_samples_appended_total", | ||
Help: "Total number of appended samples.", | ||
}) | ||
m.checkpointDeleteFail = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_checkpoint_delete_fail", | ||
Help: "Number of times deletion of old checkpoint failed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Total number of checkpoint deletions that failed."
head.go
Outdated
@@ -146,10 +148,18 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics { | |||
Name: "prometheus_tsdb_wal_truncate_duration_seconds", | |||
Help: "Duration of WAL truncation.", | |||
}) | |||
m.walTruncateFail = prometheus.NewCounter(prometheus.CounterOpts{ | |||
Name: "prometheus_tsdb_wal_truncate_fail", | |||
Help: "Number of times WAL truncation failed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Total number of WAL truncations that failed."
checkpoint.go
Outdated
@@ -277,12 +279,14 @@ func Checkpoint(logger log.Logger, w *wal.WAL, m, n int, keep func(id uint64) bo | |||
// Leftover segments will just be ignored in the future if there's a checkpoint | |||
// that supersedes them. | |||
level.Error(logger).Log("msg", "truncating segments failed", "err", err) | |||
walTruncationFail.Add(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the metric to the wal.Truncate()
function and add another metric (eg prometheus_tsdb_checkpoints_failed_total
) in head.go
that would be increased when this function returns a non-nil error.
checkpoint.go
Outdated
} | ||
if err := DeleteCheckpoints(w.Dir(), n); err != nil { | ||
// Leftover old checkpoints do not cause problems down the line beyond | ||
// occupying disk space. | ||
// They will just be ignored since a higher checkpoint exists. | ||
level.Error(logger).Log("msg", "delete old checkpoints", "err", err) | ||
checkpointDeleteFail.Add(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use checkpointDeleteFail.Inc()
My review clashed with @krasi-georgiev, I'll do a second pass on the updated code... |
checkpoint.go
Outdated
@@ -283,6 +284,7 @@ func Checkpoint(logger log.Logger, w *wal.WAL, m, n int, keep func(id uint64) bo | |||
// occupying disk space. | |||
// They will just be ignored since a higher checkpoint exists. | |||
level.Error(logger).Log("msg", "delete old checkpoints", "err", err) | |||
checkpointDeleteFail.Add(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could use checkpointDeleteFail.Inc()
wal/wal.go
Outdated
@@ -530,13 +535,15 @@ func (w *WAL) Segments() (m, n int, err error) { | |||
func (w *WAL) Truncate(i int) error { | |||
refs, err := listSegments(w.dir) | |||
if err != nil { | |||
w.truncateFail.Add(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w.truncateFail.Inc()
wal/wal.go
Outdated
return err | ||
} | ||
for _, r := range refs { | ||
if r.n >= i { | ||
break | ||
} | ||
if err := os.Remove(filepath.Join(w.dir, r.s)); err != nil { | ||
w.truncateFail.Add(float64(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
I'd add another metric (eg |
head.go
Outdated
}) | ||
m.checkpointsFail = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_checkpoints_failed_total", | ||
Help: "Total number of create checkpoint that failed.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) "Total number of checkpoints that failed."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
0678984
to
ff0c907
Compare
I think the This way the It also solves the problem with the metrics. |
I don't see anything wrong in what you proposed. Should I wait for another ack or shall I go ahead with this? |
it is an easy change so I would say go ahead and if anyone objects it is easy to revert. |
ping me when ready for another review. |
ff0c907
to
231eb41
Compare
ping @krasi-georgiev |
wal/wal.go
Outdated
return err | ||
} | ||
for _, r := range refs { | ||
if r.n >= i { | ||
break | ||
} | ||
if err := os.Remove(filepath.Join(w.dir, r.s)); err != nil { | ||
w.truncateFail.Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a deal breaker, but to make it more bulletproof against further code changes how about adding this somewhere in the beginning:
defer func() {
if err!=nil{
w.truncateFail.Inc()
}
}()
this way even if we add more blocks that might return an error wouldn't have to change the logic for the metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about that :) I modified now.
80d6431
to
4115dc5
Compare
Sorry, for jumping in late for the review, this generally LGTM, but I think we need to add the total number of attempts for each failure too. Metrics are cheap, and shouldn't be a problem to add a couple of more if they are going to make debugging easy. See: https://prometheus.io/docs/practices/instrumentation/#failures
Further, we don't have a metric for see the number of failures for creating a |
👍 to what @gouthamve says. As I've written before, it is not just about checkpoint deletions that fail. We have blind spots on |
From what I understand, we are looking at adding these in addition to the currents ones
Am I correct? |
@codesome yes. By convention, metric names are
|
4115dc5
to
92a5a0f
Compare
Updated. Now it contains metrics for
|
head.go
Outdated
Help: "Total number of checkpoint deletions attempted.", | ||
}) | ||
m.checkpointCreationFail = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_checkpoint_creation_failed_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) prometheus_tsdb_checkpoint_creations_failed_total
head.go
Outdated
Help: "Total number of checkpoint creations that failed.", | ||
}) | ||
m.checkpointCreationTotal = prometheus.NewCounter(prometheus.CounterOpts{ | ||
Name: "prometheus_tsdb_checkpoint_creation_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) prometheus_tsdb_checkpoint_creations_total
2 small comments otherwise 👍 |
LGTM |
Signed-off-by: Ganesh Vernekar <[email protected]>
92a5a0f
to
61b000e
Compare
I was wondering if there is some bug in travis test. Error says It is failing similarly in #398 PS: Test passes in my system with no build fails. |
when I clone the pr using I do see 1323, but it is as it should be : maybe try rebasing? |
Signed-off-by: Ganesh Vernekar <[email protected]>
Yup rebasing fixed it. Didn't know that tests are performed this way. |
@codesome thanks for the work! |
Fixes #373.
This is the best I could think of to prevent passing entire
head
orheadMetrics
intoCheckpoint(..)
.Signed-off-by: Ganesh Vernekar [email protected]