Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compactor: add downsample duration histogram #4551

Closed
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 18 additions & 2 deletions cmd/thanos/downsample.go
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ import (
type DownsampleMetrics struct {
downsamples *prometheus.CounterVec
downsampleFailures *prometheus.CounterVec
downsampleDuration *prometheus.Histogram
}

func newDownsampleMetrics(reg *prometheus.Registry) *DownsampleMetrics {
Expand All @@ -51,6 +52,11 @@ func newDownsampleMetrics(reg *prometheus.Registry) *DownsampleMetrics {
Name: "thanos_compact_downsample_failures_total",
Help: "Total number of failed downsampling attempts.",
}, []string{"group"})
m.downsampleDuration = promauto.With(reg).NewHistogram(prometheus.HistogramOpts{
Name: "thanos_compact_downsample_duration_seconds",
Help: "Duration of downsample runs",
Buckets: []float64{1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can start from a larger number. Please note that the time includes the block download and index verification steps so small buckets doesn't make sense here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello - that turn around was blazing fast!

I have no reservations in making the starting intervals larger - I was merely trying to replicate the same intervals used in the prometheus_tsdb_compaction_duration_seconds_bucket metric currently being exposed by the compactor.

If I'm not mistaken the begin variable is updated to time.Now() once the block has been downloaded and the index has been verified: https://github.com/thanos-io/thanos/blob/main/cmd/thanos/downsample.go#L326

So what this should measure is the time taken to open the block and run the downsample.Downsample method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right! Sorry, the Github code review just showed the begin := time.Now() at the beginning so I misunderstood it.
But it would be better to double-check the buckets. Also might be good to have a label for different resolution. 5m block and 1h block are different.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it - I was thinking the same. I was originally conflicted because the compaction duration histogram has no labels (like compaction level) so I thought might be better to conform to the standard. But the resolution level would result in something more meaningful.

I'm also thinking of adding histograms for overall block upload and download times as well (in another PR). Having resolution/compaction level on that may be beneficial as well.

})

return m
}
Expand Down Expand Up @@ -237,7 +243,7 @@ func downsampleBucket(
resolution = downsample.ResLevel2
errMsg = "downsampling to 60 min"
}
if err := processDownsampling(ctx, logger, bkt, m, dir, resolution, hashFunc); err != nil {
if err := processDownsampling(ctx, logger, bkt, m, dir, resolution, hashFunc, metrics); err != nil {
metrics.downsampleFailures.WithLabelValues(compact.DefaultGroupKey(m.Thanos)).Inc()
return errors.Wrap(err, errMsg)
}
Expand Down Expand Up @@ -309,7 +315,16 @@ func downsampleBucket(
return nil
}

func processDownsampling(ctx context.Context, logger log.Logger, bkt objstore.Bucket, m *metadata.Meta, dir string, resolution int64, hashFunc metadata.HashFunc) error {
func processDownsampling(
ctx context.Context,
logger log.Logger,
bkt objstore.Bucket,
m *metadata.Meta,
dir string,
resolution int64,
hashFunc metadata.HashFunc,
metrics *DownsampleMetrics,
) error {
begin := time.Now()
bdir := filepath.Join(dir, m.ULID.String())

Expand Down Expand Up @@ -346,6 +361,7 @@ func processDownsampling(ctx context.Context, logger log.Logger, bkt objstore.Bu

level.Info(logger).Log("msg", "downsampled block",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to initialize a variable for passed_time above this line so that we don't have to call time.Since(begin) multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

"from", m.ULID, "to", id, "duration", time.Since(begin), "duration_ms", time.Since(begin).Milliseconds())
metrics.downsampleDuration.Observe(time.Since(begin).Seconds())

if err := block.VerifyIndex(logger, filepath.Join(resdir, block.IndexFilename), m.MinTime, m.MaxTime); err != nil {
return errors.Wrap(err, "output block index not valid")
Expand Down