Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create v2-1.md #1848

Merged
merged 12 commits into from
May 17, 2022
39 changes: 39 additions & 0 deletions docs/sources/release-notes/v2.1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: "Grafana Mimir version 2.1 release notes"
menuTitle: "V2.1 release notes"
description: "Release notes for Grafana Mimir version 2.1"
weight: 200
---

# Grafana Mimir version 2.1 release notes

Grafana Labs is excited to announce version 2.1 of Grafana Mimir, the most scalable, most performant open source time series database in the world.

Below we highlight the top features, enhancements and bugfixes in this release, as well as relevant callouts for those upgrading from Grafana Mimir 2.0. The complete list of changes is recorded in the [Changelog](https://github.com/grafana/mimir/blob/main/CHANGELOG.md).

## Features and enhancements

- **Mimir on ARM**: We now publish Docker images for both `amd64` and `arm64`, making it easier for those on arm-based machines to develop and run Mimir. Multiplaform images are available from the the [Mimir docker registry](https://hub.docker.com/r/grafana/mimir). Note that our existing integration test suite only uses the `amd64` images, which means we cannot make any functional or performance guarantees about the `arm64` images.

- **`Remote` ruler mode for improved rule evaluation performance**: We've added a `remote` mode for the Grafana Mimir ruler, in which the ruler delegates rule evaluation to the [query-frontend]({{< relref "../operators-guide/architecture/components/query-frontend/index.md" >}}) rather than evaluating rules directly within the ruler process itself. This allows recording and alerting rules to benefit from the query parallelization techniques implemented in the query-frontend (like query sharding). `Remote` mode is considered experimental and is off by default. To enable, see [remote ruler]({{< relref "../operators-guide/architecture/components/ruler/#remote" >}}).

- **Per-tenant custom trackers for monitoring cardinality**: In Grafana Mimir 2.0, we introduced a [custom tracker feature]({{< relref "../operators-guide/configuring/configuring-custom-trackers.md" >}}) that allows you to track the count of active series over time that match a specific label matcher. In Grafana Mimir 2.1, we've made it possible to configure custom trackers via the [runtime configuration file]({{< relref "../operators-guide/configuring/about-runtime-configuration.md" >}}). This means you can now define different trackers for each tenant in your cluster and modify those trackers without an ingester restart.

- **Reduce cardinality of Grafana Mimir's `/metrics` endpoint**: While Grafana Mimir does a good job of exposing a relatively small number of series about its own state, this number can tick up when running Grafana Mimir clusters with high tenant counts or high active series counts. To reduce this number (and the accompanying cost of scraping and storing these time series), we've made [several optimizations](https://github.com/grafana/mimir/issues/1750). These improvements reduce series count on the `/metrics` endpoint by more than 10%.

## Upgrade considerations

We've updated the default values for 2 parameters in Grafana Mimir to give users better out-of-the-box performance:

- We've changed the default for `-blocks-storage.tsdb.isolation-enabled` from `true` to `false`. We've marked this flag as deprecated and will remove it, setting the value permanently to `false`, in 2 releases. Our decision to do this came from our experience running our [1 billion series load test](https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/#prometheus-tsdb-enhancements), where we saw that disabling this setting reduced ingester 99th percentile latency by 90%.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should point out that TSDB isolation feature (within single ingester) doesn't bring any benefit in our architecture, where single push request is distributed to many ingesters. Mimir didn't provide isolation guarantees even with this option enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We sort of mention that in the linked blog post already:

TSDB isolation is a feature that wasn’t used in Mimir, due to its distributed architecture, but was introducing a significant negative impact on write latency caused by a high lock contention on TSDB isolation lock.

For the purposes of keeping the release notes concise, I'd rather keep as is and if anything, just add a bit more detail to the blog (we can ask the content team to edit the blog content even though its already been published). Sounds like you're saying that not only does tsdb isolation not provide any benefits due to our distributed architecture but it actually doesn't really even do what it says it does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can give me another sentence you'd like to add to the existing blog post to clarify this, happy to get it in:
https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/#prometheus-tsdb-enhancements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like this:

Suggested change
- We've changed the default for `-blocks-storage.tsdb.isolation-enabled` from `true` to `false`. We've marked this flag as deprecated and will remove it, setting the value permanently to `false`, in 2 releases. Our decision to do this came from our experience running our [1 billion series load test](https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/#prometheus-tsdb-enhancements), where we saw that disabling this setting reduced ingester 99th percentile latency by 90%.
- We've changed the default for `-blocks-storage.tsdb.isolation-enabled` from `true` to `false`. We've marked this flag as deprecated and will remove it, setting the value permanently to `false`, in 2 releases. Our decision to do this came from our experience running our [1 billion series load test](https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/#prometheus-tsdb-enhancements), where we saw that disabling this setting reduced ingester 99th percentile latency by 90%. Note that due to Mimir's architecture, Mimir doesn't benefit from TSDB isolation feature, so disabling it is a net win for Mimir.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I wouldn't add anything to the blog post, just to release notes. But if you don't think it's necessary, that's fine. You're correct that it's explained in the blog post already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried again! lemme know what you think.


- The store gateway attributes cache is now enabled by default (achieved by updating the default for `-blocks-storage.bucket-store.chunks-cache.attributes-in-memory-max-items` from `0` to `50000`). This in-memory cache makes it faster to look up object attributes for chunk data. We've been running this optional cache internally for a while and upon a recent configuration audit, realized it made sense to do the same for all users. The increase in store-gateway memory utilization from enabling this cache is negligible and easily justified given the performance gains.
pracucci marked this conversation as resolved.
Show resolved Hide resolved

## Bug fixes

### 2.1.0 bug fixes

- [PR 1704](https://github.com/grafana/mimir/pull/1704): Fixed a bug that previously caused Grafana Mimir to crash on startup when trying to run in monolithic mode with the results cache enabled due to duplicate metric names.
- [PR 1835](https://github.com/grafana/mimir/pull/1835): Fixed a bug that caused Grafana Mimir to crash when an invalid Alertmanager configuration was set even though the Alertmanager component was disabled. After this fix, the Alertmanager configuration is only validated if the Alertmanager component is loaded.
- [PR 1836](https://github.com/grafana/mimir/pull/1836): The ability to run Alertmanager with `local` storage broke in Grafana Mimir 2.0 when we removed the ability to run the Alertmanager without sharding. With this bugfix, we've made it possible to again run Alertmanager with `local` storage. However, for production use, we still recommend using external store since this is needed to persist Alertmanager state (e.g. silences) between replicas.
- [PR 1715](https://github.com/grafana/mimir/pull/1715): Restored Grafana Mimir's ability to use CNAME DNS records to reach memcached servers. The bug was inherited from an upstream change to Thanos; we contributed a fix to Thanos and subsequently updated our Thanos version.