Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Add new metrics. #396

Merged
merged 3 commits into from
Oct 1, 2018
Merged

Conversation

codesome
Copy link
Contributor

Fixes #373.

This is the best I could think of to prevent passing entire head or headMetrics into Checkpoint(..).

Signed-off-by: Ganesh Vernekar [email protected]

@krasi-georgiev
Copy link
Contributor

krasi-georgiev commented Sep 25, 2018

@codesome
Copy link
Contributor Author

Thanks, didn't know about that. That would cover prometheus_tsdb_wal_truncate_fail, we might still want to pass prometheus_tsdb_checkpoint_delete_fail to the Checkpoint(...).

@krasi-georgiev
Copy link
Contributor

yes I am still looking through the code to see if we can simplify that part as well.

1. 'prometheus_tsdb_wal_truncate_fail' for failed WAL truncation.
2. 'prometheus_tsdb_checkpoint_delete_fail' for failed old checkpoint delete.

Signed-off-by: Ganesh Vernekar <[email protected]>
head.go Outdated
@@ -146,10 +148,18 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics {
Name: "prometheus_tsdb_wal_truncate_duration_seconds",
Help: "Duration of WAL truncation.",
})
m.walTruncateFail = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_wal_truncate_fail",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheus_tsdb_wal_truncate_failed_total rather

head.go Outdated
m.samplesAppended = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_head_samples_appended_total",
Help: "Total number of appended samples.",
})
m.checkpointDeleteFail = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_checkpoint_delete_fail",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheus_tsdb_checkpoint_deletions_failed_total

head.go Outdated
m.samplesAppended = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_head_samples_appended_total",
Help: "Total number of appended samples.",
})
m.checkpointDeleteFail = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_checkpoint_delete_fail",
Help: "Number of times deletion of old checkpoint failed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Total number of checkpoint deletions that failed."

head.go Outdated
@@ -146,10 +148,18 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics {
Name: "prometheus_tsdb_wal_truncate_duration_seconds",
Help: "Duration of WAL truncation.",
})
m.walTruncateFail = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_wal_truncate_fail",
Help: "Number of times WAL truncation failed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Total number of WAL truncations that failed."

checkpoint.go Outdated
@@ -277,12 +279,14 @@ func Checkpoint(logger log.Logger, w *wal.WAL, m, n int, keep func(id uint64) bo
// Leftover segments will just be ignored in the future if there's a checkpoint
// that supersedes them.
level.Error(logger).Log("msg", "truncating segments failed", "err", err)
walTruncationFail.Add(float64(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the metric to the wal.Truncate() function and add another metric (eg prometheus_tsdb_checkpoints_failed_total) in head.go that would be increased when this function returns a non-nil error.

checkpoint.go Outdated
}
if err := DeleteCheckpoints(w.Dir(), n); err != nil {
// Leftover old checkpoints do not cause problems down the line beyond
// occupying disk space.
// They will just be ignored since a higher checkpoint exists.
level.Error(logger).Log("msg", "delete old checkpoints", "err", err)
checkpointDeleteFail.Add(float64(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use checkpointDeleteFail.Inc()

@simonpasquier
Copy link
Contributor

My review clashed with @krasi-georgiev, I'll do a second pass on the updated code...

checkpoint.go Outdated
@@ -283,6 +284,7 @@ func Checkpoint(logger log.Logger, w *wal.WAL, m, n int, keep func(id uint64) bo
// occupying disk space.
// They will just be ignored since a higher checkpoint exists.
level.Error(logger).Log("msg", "delete old checkpoints", "err", err)
checkpointDeleteFail.Add(float64(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could use checkpointDeleteFail.Inc()

wal/wal.go Outdated
@@ -530,13 +535,15 @@ func (w *WAL) Segments() (m, n int, err error) {
func (w *WAL) Truncate(i int) error {
refs, err := listSegments(w.dir)
if err != nil {
w.truncateFail.Add(float64(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.truncateFail.Inc()

wal/wal.go Outdated
return err
}
for _, r := range refs {
if r.n >= i {
break
}
if err := os.Remove(filepath.Join(w.dir, r.s)); err != nil {
w.truncateFail.Add(float64(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@simonpasquier
Copy link
Contributor

I'd add another metric (eg prometheus_tsdb_checkpoints_failed_total) in head.go that would be increased whenever Checkpoint() returns a non-nil error.

head.go Outdated
})
m.checkpointsFail = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_checkpoints_failed_total",
Help: "Total number of create checkpoint that failed.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) "Total number of checkpoints that failed."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@krasi-georgiev
Copy link
Contributor

I think the w.Truncate(..) and DeleteCheckpoints(...) should be move from Checkpoint to head.Truncate(...)

https://github.com/prometheus/tsdb/blob/a971f52ac8c41ecf8ca54325b312ddab97b1f3aa/checkpoint.go#L275-L286

This way the Checkpoint func's purpose is to only creating the checkpoints and deleting old checkpoints and truncating the wal is left to the caller.

It also solves the problem with the metrics.

@codesome
Copy link
Contributor Author

I don't see anything wrong in what you proposed. Should I wait for another ack or shall I go ahead with this?

@krasi-georgiev
Copy link
Contributor

it is an easy change so I would say go ahead and if anyone objects it is easy to revert.

@krasi-georgiev
Copy link
Contributor

ping me when ready for another review.

@codesome
Copy link
Contributor Author

ping @krasi-georgiev

wal/wal.go Outdated
return err
}
for _, r := range refs {
if r.n >= i {
break
}
if err := os.Remove(filepath.Join(w.dir, r.s)); err != nil {
w.truncateFail.Inc()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a deal breaker, but to make it more bulletproof against further code changes how about adding this somewhere in the beginning:

defer func() {
    if err!=nil{
       w.truncateFail.Inc()
    }
}()

this way even if we add more blocks that might return an error wouldn't have to change the logic for the metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about that :) I modified now.

head.go Show resolved Hide resolved
checkpoint.go Show resolved Hide resolved
@codesome codesome force-pushed the new-metrics branch 2 times, most recently from 80d6431 to 4115dc5 Compare September 26, 2018 13:39
@gouthamve
Copy link
Collaborator

Sorry, for jumping in late for the review, this generally LGTM, but I think we need to add the total number of attempts for each failure too. Metrics are cheap, and shouldn't be a problem to add a couple of more if they are going to make debugging easy.

See: https://prometheus.io/docs/practices/instrumentation/#failures

When reporting failures, you should generally have some other metric representing the total number of attempts. This makes the failure ratio easy to calculate.

Further, we don't have a metric for see the number of failures for creating a Checkpoint and I think that should be added too. I am not fully awake yet, so, feel free to let me know if I got something wrong anywhere.

@simonpasquier
Copy link
Contributor

👍 to what @gouthamve says. As I've written before, it is not just about checkpoint deletions that fail. We have blind spots on head.Truncate() and this PR needs to solve them all,

@codesome
Copy link
Contributor Author

From what I understand, we are looking at adding these in addition to the currents ones

  • prometheus_tsdb_wal_truncate_attempt_total
  • prometheus_tsdb_head_truncate_attempt_total
  • prometheus_tsdb_checkpoint_creation_failed_total
  • prometheus_tsdb_checkpoint_creation_attempt_total

Am I correct?

@simonpasquier
Copy link
Contributor

@codesome yes. By convention, metric names are foo_failed_total and foo_total (instead of foo_attempt_total).

  • prometheus_tsdb_wal_truncations_total
  • prometheus_tsdb_head_truncations_total
  • prometheus_tsdb_checkpoint_creations_failed_total
  • prometheus_tsdb_checkpoint_creations_total

@codesome
Copy link
Contributor Author

Updated. Now it contains metrics for

  • Head truncation fail and total
  • WAL truncation fail and total
  • Checkpoint creation fail and total
  • Checkpoint deletion fail and total

head.go Outdated
Help: "Total number of checkpoint deletions attempted.",
})
m.checkpointCreationFail = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_checkpoint_creation_failed_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) prometheus_tsdb_checkpoint_creations_failed_total

head.go Outdated
Help: "Total number of checkpoint creations that failed.",
})
m.checkpointCreationTotal = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_checkpoint_creation_total",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) prometheus_tsdb_checkpoint_creations_total

@simonpasquier
Copy link
Contributor

2 small comments otherwise 👍

@krasi-georgiev
Copy link
Contributor

LGTM
@gouthamve or @simonpasquier feel free to merge once your comments are addressed.

Signed-off-by: Ganesh Vernekar <[email protected]>
@codesome
Copy link
Contributor Author

I was wondering if there is some bug in travis test. Error says ./db_test.go:1323:9: cannot assign 1 values to 2 variables, but I can see only 1304 lines in db_test.go.

It is failing similarly in #398

PS: Test passes in my system with no build fails.

@krasi-georgiev
Copy link
Contributor

when I clone the pr using
https://help.github.com/articles/checking-out-pull-requests-locally/

I do see 1323, but it is as it should be : err := db.compact()

maybe try rebasing?

@codesome
Copy link
Contributor Author

Yup rebasing fixed it. Didn't know that tests are performed this way.

@simonpasquier simonpasquier merged commit 043e3bb into prometheus-junkyard:master Oct 1, 2018
@simonpasquier
Copy link
Contributor

@codesome thanks for the work!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants