Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for remaining planned compactions #3

Open
wants to merge 16 commits into
base: shuffle-sharding-compactor
Choose a base branch
from

Conversation

ac1214
Copy link
Owner

@ac1214 ac1214 commented Jul 12, 2021

Signed-off-by: Albert [email protected]

What this PR does:

Adds cortex_compactor_remaining_planned_compactions to track the remaining number of compactions to be completed by the cortex compactor.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

ac1214 and others added 16 commits July 5, 2021 15:44
…ortexproject#4262)

* add MaxRetries to WaitInstanceState

Signed-off-by: Albert <[email protected]>

* update CHANGELOG.md

Signed-off-by: Albert <[email protected]>

* Add timeout for waiting on compactor to become ACTIVE in the ring.

Signed-off-by: Albert <[email protected]>

* add MaxRetries variable back to WaitInstanceState

Signed-off-by: Albert <[email protected]>

* Fix linting issues

Signed-off-by: Albert <[email protected]>

* Remove duplicate entry from changelog

Signed-off-by: Albert <[email protected]>

* Address PR comments and set timeout to be configurable

Signed-off-by: Albert <[email protected]>

* Address PR comments and fix tests

Signed-off-by: Albert <[email protected]>

* Update unit tests

Signed-off-by: Albert <[email protected]>

* Update changelog and fix linting

Signed-off-by: Albert <[email protected]>

* Fixed CHANGELOG entry order

Signed-off-by: Marco Pracucci <[email protected]>

Co-authored-by: Albert <[email protected]>
Co-authored-by: Marco Pracucci <[email protected]>
* MergeIterator: allocate less memory at first

We were allocating 24x the number of streams of batches, where each
batch holds up to 12 samples.

By allowing `c.batches` to reallocate when needed, we avoid the need
to pre-allocate enough memory for all possible scenarios.

* chunk_test: fix innacurate end time on chunks

The `through` time is supposed to be the last time in the chunk, and
having it one step higher was throwing off other tests and benchmarks.

* MergeIterator benchmark: add more realistic sizes

At 15-second scrape intervals a chunk covers 30 minutes, so 1,000 chunks
is about three weeks, a highly un-representative test.

Instant queries, such as those done by the ruler, will only fetch one
chunk from each ingester.

Signed-off-by: Bryan Boreham <[email protected]>
* Expose default configuration values for memberlist.

Set the defaults for various memberlist configuration values based on
the "Default LAN" configuration. The only result of this change is that
the defaults are now visible and are in the documentation. This also
means that if the default values change, then the changes are visible
in the documentation, where as before they would have gone unnoticed.

To prevent this being a breaking change, the existing behaviour is
retained, in case anyone is explicitly setting the values to zero and
expecting the default to be used.

Signed-off-by: Steve Simpson <[email protected]>

* Remove use of zero value as default value indicator.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>
cortexproject#4342)

* Allow setting ring heartbeat timeout to zero to disable timeout check.

This change allows the various ring heartbeat timeouts to be configured
with zero, as a means of disabling the timeout. This is expected to be
used with a separate enhancement to allow disabling heartbeats. When the
heartbeat timeout is disabled, instances will always appear as healthy
in the ring.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>
…time. (cortexproject#4317)

* Add a new config and metric for reporting ruler query execution wall time.

Signed-off-by: Tyler Reid <[email protected]>

* Spacing and PR number fixup

Signed-off-by: Tyler Reid <[email protected]>

* Wrap the defer in a function to make it defer after the return rather than after the if block. Add a unit test to validate we're tracking time correctly.

Signed-off-by: Tyler Reid <[email protected]>

* Use seconds for our duration rather than nanoseconds

Signed-off-by: Tyler Reid <[email protected]>

* Review comment fixes

Signed-off-by: Tyler Reid <[email protected]>

* Update config flag in the config docs

Signed-off-by: Tyler Reid <[email protected]>

* Pass counter rather than counter vector for metrics query function

Signed-off-by: Tyler Reid <[email protected]>

* Fix comment in MetricsQueryFunction

Signed-off-by: Tyler Reid <[email protected]>

* Move query metric and log to separate function. Add log message for ruler query time.

Signed-off-by: Tyler Reid <[email protected]>

* Update config file and change log to show this a per user metric

Signed-off-by: Tyler Reid <[email protected]>

* code review fixes

Signed-off-by: Tyler Reid <[email protected]>

* update log message for ruler query metrics

Signed-off-by: Tyler Reid <[email protected]>

* Remove append and just use the array for key values in the log messag

Signed-off-by: Tyler Reid <[email protected]>

* Add query-frontend component to front end log message

Signed-off-by: Tyler Reid <[email protected]>
I thought it would be good to put a security page into the docs, so that
it shows up in a search.

Content is just pointing at other resources.

Signed-off-by: Bryan Boreham <[email protected]>
…xproject#4345)

* Optimise memberlist kv store access by storing data unencoded.

The following profile data was taken from running 50 idle ingesters with
memberlist, with almost everything at default values (5s heartbeats):

```
52.16% mergeBytesValueForKey
+- 52.16% mergeValueForKey
   +- 47.84% computeNewValue
      +- 27.24% codec Proto Decode
      +- 26.25% mergeWithTime
```

It is apparent from the this that a lot of time is spent on the memberlist
receive path, as might be expected, specifically, the merging of the update
into the current state. The cost however is not in decoding the incoming
states (occurs in `mergeBytesValueForKey` before `mergeValueForKey`), but
in fact decoding _current state_ of the value in the store (as it is stored
encoded). The ring state was measured at 123K (50 ingesters), so it makes
sense that decoding could be costly.

This can be avoided by storing the value in it's decoded `Mergeable` form.
When doing this, care has to be taken to deep copy the value when
accessed, as it is modified in place before being updated in the store,
and accessed outside the store mutex.

Note a side effect of this change is that is no longer straightforward
to expose the `memberlist_kv_store_value_bytes` metric, as this reported
the size of the encoded data, therefore it has been removed.

Signed-off-by: Steve Simpson <[email protected]>

* Typo.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>
…o. (cortexproject#4344)

* Allow disabling of ring heartbeats by setting relevant options to zero.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>
…#4346)

* Expose configuration of memberlist packet compression.

Allows manually specifying whether memberlist should compress packets
via a new configuration flag: `-memberlist.enable-compression`.

This typically has little benefit for Cortex, as the ring state messages
are already compressed with Snappy, the second layer of compression does
not achieve any additional saving. It's not clear cut whether there
might still be some benefit for internal memberlist messages; this needs
to be evaluated in a environment of some reasonable scale.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>

* Review comments.

Signed-off-by: Steve Simpson <[email protected]>
…exproject#4348)

It was only waiting one second for the second sync to complete, which is
probably too harsh a deadline than necessary for overloaded systems.

Signed-off-by: Steve Simpson <[email protected]>
…xproject#4349)

The test is writing a single silence and  checking a metric which
indicates whether replicating the silence has been attempted yet.
This is so we can check later on that no replication activity
occurs. The assertions later on in the test are passing, but the
first one is not, indicating that the replication doesn't trigger
early enough. This makes sense because the replication is not
synchronous with the writing of the silence.

Signed-off-by: Steve Simpson <[email protected]>
)

* Add proposal document

Signed-off-by: Gofman <[email protected]>
Signed-off-by: ilangofman <[email protected]>

* Minor text modifications

Signed-off-by: ilangofman <[email protected]>

* Implement requested changes to the proposal

Signed-off-by: ilangofman <[email protected]>

* Fix mention of Compactor instead of purger in proposal

Signed-off-by: ilangofman <[email protected]>

* Fixed wording and spelling in proposal

Signed-off-by: ilangofman <[email protected]>

* Update the cache invalidation method

Signed-off-by: ilangofman <[email protected]>

* Fix wording on cache invalidation section

Signed-off-by: ilangofman <[email protected]>

* Minor wording additions

Signed-off-by: ilangofman <[email protected]>

* Remove white-noise from text

Signed-off-by: ilangofman <[email protected]>

* Remove the deleting state and change cache invalidation

Signed-off-by: ilangofman <[email protected]>

* Add deleted state and update cache invalidation

Signed-off-by: ilangofman <[email protected]>

* Add one word to clear things up

Signed-off-by: ilangofman <[email protected]>

* update api limits section

Signed-off-by: ilangofman <[email protected]>

* ran clean white noise

Signed-off-by: ilangofman <[email protected]>
Signed-off-by: Albert <[email protected]>
@alvinlin123 alvinlin123 force-pushed the shuffle-sharding-compactor branch from db0595b to 8e78a51 Compare January 14, 2022 22:03
roystchiang pushed a commit that referenced this pull request Apr 6, 2022
Added workflow for building prerelease images
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants