Introduce ingester instance limits to configuration, and add alerts. #296

pstibrany · 2021-04-22T08:01:18Z

What this PR does: This PR adds configuration for ingester instance limits. It also adds alerts for ingester getting close to max series and max tenants limits. (70% = warning, 80% critical)

Checklist

CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

pracucci

Good job, thanks!

…rafana/cortex-jsonnet#296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback.

@osg-grafana

* Increased CortexAllocatingTooMuchMemory alert threshold Signed-off-by: Marco Pracucci <[email protected]> * Add alert for etcd memory limits close Signed-off-by: Goutham Veeramachaneni <[email protected]> * the distributor now supports push via GRPC (grafana/cortex-jsonnet#266) Signed-off-by: Mauro Stettler <[email protected]> * Fixed CortexQuerierHighRefetchRate alert Signed-off-by: Marco Pracucci <[email protected]> * Fixed label matcher Signed-off-by: Marco Pracucci <[email protected]> * Sort legend descending in the CPU/memory panels Signed-off-by: Marco Pracucci <[email protected]> * Add slow queries dashboard Signed-off-by: Marco Pracucci <[email protected]> * Added tenant ID field to the table Signed-off-by: Marco Pracucci <[email protected]> * Add recording rules to calculate Cortex scaling - Update dashboard so it only shows under provisioned services and why - Add sizing rules based on limits. - Add some docs to the dashboard. Signed-off-by: Tom Wilkie <[email protected]> * Increased CortexRequestErrors alert severity Signed-off-by: Marco Pracucci <[email protected]> * Fixed "Disk Writes" and "Disk Reads" panels Signed-off-by: Marco Pracucci <[email protected]> * Pre-compute aggregations to optimize scaling recording rules Signed-off-by: Marco Pracucci <[email protected]> * Removed 5m step from subquery Signed-off-by: Marco Pracucci <[email protected]> * Add function to customize compactor statefulset Signed-off-by: Marco Pracucci <[email protected]> * Use the job name in compactor alerts too Signed-off-by: Marco Pracucci <[email protected]> * Fixed CortexCompactorRunFailed threshold Signed-off-by: Marco Pracucci <[email protected]> * Added Cortex Rollout progress dashboard Signed-off-by: Marco Pracucci <[email protected]> * Fix 'Unhealthy pods' in Cortex Rollout dashboard Signed-off-by: Marco Pracucci <[email protected]> * Simplify compactor alerts We should simply alert on things not having run since X. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Use the right metric Signed-off-by: Goutham Veeramachaneni <[email protected]> * Apply suggestions from code review Co-authored-by: Marco Pracucci <[email protected]> Signed-off-by: Goutham Veeramachaneni <[email protected]> * Fix CortexCompactorHasNotSuccessfullyRunCompaction to avoid false positives Signed-off-by: Marco Pracucci <[email protected]> * Introduce ingester instance limits to configuration, and add alerts. (grafana/cortex-jsonnet#296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback. * Improve CortexRulerFailedRingCheck alert Signed-off-by: Marco Pracucci <[email protected]> * Added example Loki query to CortexTenantHasPartialBlocks playbook Signed-off-by: Marco Pracucci <[email protected]> * Default dashboards to Cortex blocks storage only Signed-off-by: Marco Pracucci <[email protected]> * Add missing memberlist components to alerts This adds the admin-api, compactor and store-gateway components to the memberlist alert. Signed-off-by: Christian Simon <[email protected]> * mixin: Add gateway to valid job names (for GEM) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. (grafana/cortex-jsonnet#311) * Only show namespaces from selected cluster. "All" works thanks to using regex matcher. * CHANGELOG.md * Fixed CortexIngesterHasNotShippedBlocks alert false positive Signed-off-by: Marco Pracucci <[email protected]> * Fixed mixin linter Signed-off-by: Marco Pracucci <[email protected]> * Add placeholders to make the linter pass Signed-off-by: Marco Pracucci <[email protected]> * cortex-mixin: Use kube_pod_container_resource_{requests,limits} metrics This updates the recording rules to make them compatible with kube-state-metrics v2.0.0 which introduces some breaking changes in some metric names. With kube-state-metrics v2.0.0: - `kube_pod_container_resource_requests_cpu_cores` becomes `kube_pod_container_resource_requests{resource="cpu"}` - `kube_pod_container_resource_requests_memory_bytes` becomes `kube_pod_container_resource_requests{resource="memory"}` * cortex-mixin: Make the recording rules backwards compatible * refactor: functions to reduce code duplication - improve overrideability - making more use of `per_instance_label` from _config - added containerNetworkPanel functions for dashboards to use * fix: lint * refactor: config for job aggregation strings - to make it easier to override, define "cluster_namespace_job" in $._config as `job_aggregation_prefix`. - added some `job_aggregation_labels_*` as well The resulting output does not change (unless config is overridden). * lint * Update cortex-mixin/dashboards/writes.libsonnet simplify mapping by extending $._config Co-authored-by: Marco Pracucci <[email protected]> * fix: syntax * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * refactor: added a group_config defines group-related strings based off of array-based parameters in _config. deprecated _config.alert_aggregation_labels with a std.trace warning, while maintaining (temporary?) backward compatibility. * Lower CortexIngesterRestarts severity Signed-off-by: Marco Pracucci <[email protected]> * feature: add some text boxes and descriptions Focussing on the reads and writes dashboards, added some info panels and hover-over descriptions for some of the panels. Some common code used by the compactor also received additional text content. New functions: - addRows - addRowsIf ...to add a list of rows to a dashboard. The `thanosMemcachedCache` function has had some of its query text sprawled out for easier reading and comparison with similar dashboard queries. * fix: text replacements, repair addRows * Changing copy to add 'latency' as well. * Cut down on text from initial PR. Tucked existing text from the compactor dashboard under tooltips, rather than making them text boxes. * Getting rid of a few space/comma errors. * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/compactor.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * fix: formatting - limit to 4 panels per row * fmt * fix: remove accidental line * Update cortex-mixin/dashboards/dashboard-utils.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/writes.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * Update cortex-mixin/dashboards/reads.libsonnet Co-authored-by: Ursula Kallio <[email protected]> * fix: Requests per second * fix: text * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <[email protected]> * fix: clarity * Apply suggestions from code review as per @osg-grafana Co-authored-by: Ursula Kallio <[email protected]> * Add a simple playbook for ingester series limit alert. Signed-off-by: Callum Styan <[email protected]> * Add cortex-gw-internal to watched gateway metrics (grafana/cortex-jsonnet#328) * Add cortex-gw-internal to watched gateway metrics * Update CHANGELOG.md Co-authored-by: Marco Pracucci <[email protected]> * fix: query formatting to aid in merge * fix: query formatting to aid in merge * fix: consistent labelling * fix: ensure panel titles are consistent - Most existing "per second" panel titles in `main` are written "/ sec", corrected recent commits to match. * Improved CortexIngesterReachingSeriesLimit playbook and added CortexIngesterReachingTenantsLimit playbook Signed-off-by: Marco Pracucci <[email protected]> * Better formatting for ingester_instance_limits+ example Signed-off-by: Marco Pracucci <[email protected]> * Clarify which alerts apply to chunks storage only Signed-off-by: Marco Pracucci <[email protected]> * Improve compactor alerts and playbooks Signed-off-by: Marco Pracucci <[email protected]> * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> * Fixed and improved runtime config alerts and playbooks Signed-off-by: Marco Pracucci <[email protected]> * fix: resolve review feedback * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> * Update cortex-mixin/docs/playbooks.md Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> * MarkCortexTableSyncFailure and CortexOldChunkInMemory alerts as chunks storage only Signed-off-by: Marco Pracucci <[email protected]> * Fixed whitespace noise Signed-off-by: Marco Pracucci <[email protected]> * refactor: resources dashboard comtainer functions added: - containerDiskWritesPanel - containerDiskReadsPanel - containerDiskSpaceUtilization * revert: matching spacing format of main * lint: white noise * Add playbook for CortexRequestErrors and config option to exclude specific routes Signed-off-by: Marco Pracucci <[email protected]> * Change min-step to 15s to show better detail. $__rate_interval will be floored at 4x this quantity, so 15s lets us see faster transients than the previous value of 1m. Signed-off-by: Bryan Boreham <[email protected]> * Added playbook for CortexFrontendQueriesStuck and CortexSchedulerQueriesStuck Signed-off-by: Marco Pracucci <[email protected]> * Remove CortexQuerierCapacityFull alert Signed-off-by: Marco Pracucci <[email protected]> * Added playbook for CortexProvisioningTooManyWrites Signed-off-by: Marco Pracucci <[email protected]> * Added playbook for CortexAllocatingTooMuchMemory Signed-off-by: Marco Pracucci <[email protected]> * Address review feedback Signed-off-by: Marco Pracucci <[email protected]> * Replaced CortexCacheRequestErrors with CortexMemcachedRequestErrors Signed-off-by: Marco Pracucci <[email protected]> * Replace ruler alerts, and add playbooks. * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Fix white space. * Better alert messages. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci <[email protected]> * Add playbook for CortexProvisioningTooManyActiveSeries Signed-off-by: Marco Pracucci <[email protected]> * Improve messaging. * Fixed formatting Signed-off-by: Marco Pracucci <[email protected]> * Improved alert messages with Cortex cluster Signed-off-by: Marco Pracucci <[email protected]> * Improved CortexRequestLatency playbook Signed-off-by: Marco Pracucci <[email protected]> * Added 'Per route p99 latency' to ruler configuration API Signed-off-by: Marco Pracucci <[email protected]> * Addressed review comments Signed-off-by: Marco Pracucci <[email protected]> * Aded object storage metrics for Ruler and Alertmanager Signed-off-by: Marco Pracucci <[email protected]> * Add playbook entry for CortexGossipMembersMismatch. * Clarify data loss related to 'not healthy index found' issue Signed-off-by: Marco Pracucci <[email protected]> * Review comments. * Improve CortexIngesterReachingSeriesLimit playbook Signed-off-by: Marco Pracucci <[email protected]> * Increased CortexIngesterReachingSeriesLimit critical alert threshold from 80% to 85% Signed-off-by: Marco Pracucci <[email protected]> * Increase CortexIngesterReachingSeriesLimit warning `for` duration As it turns out, during normal shuffle-sharding operation, the 70% mark is often exceeded, but not by much. Rather than increasing the threshold to 75%, this commit increases the `for` duration to 3h, following the thought that we want this alert to fire if ingesters are constantly above the threshold even after stale series are flushed (which occurs every 2h, when the TSDB head is compacted). We flush series with a timestamp between [-3h, -1h] after the last compaction, so the worst case scenario is that it takes 3h to flush a stale series. Signed-off-by: beorn7 <[email protected]> * Fix scaling dashboard to work on multi-zone ingesters Signed-off-by: Marco Pracucci <[email protected]> * Simplified cluster_namespace_deployment:actual_replicas:count recording rule Signed-off-by: Marco Pracucci <[email protected]> * Added a comment to explain '.*?' Signed-off-by: Marco Pracucci <[email protected]> * Fix rollout dashboard to work with multi-zone deployments Signed-off-by: Marco Pracucci <[email protected]> * Fixed legends Signed-off-by: Marco Pracucci <[email protected]> * Extend Alertmanager dashboard with currently unused metrics. Metrics for general operation: - Added "Tenants" stat panel using: `cortex_alertmanager_tenants_discovered` - Added "Tenant Configuration Sync" row using: `cortex_alertmanager_sync_configs_failed_total` `cortex_alertmanager_sync_configs_total` `cortex_alertmanager_ring_check_errors_total` Metrics specific to sharding operation: - Added "Sharding Initial State Sync" row using: `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_completed_total` `cortex_alertmanager_state_initial_sync_duration_seconds` - Added "Sharding State Operations" row using: `cortex_alertmanager_state_fetch_replica_state_total` `cortex_alertmanager_state_fetch_replica_state_failed_total` `cortex_alertmanager_state_replication_total` `cortex_alertmanager_state_replication_failed_total` `cortex_alertmanager_partial_state_merges_total` `cortex_alertmanager_partial_state_merges_failed_total` `cortex_alertmanager_state_persist_total` `cortex_alertmanager_state_persist_failed_total` * Review comments + fix latency panel. * Review comments. * Clarify the gsutil mv command for moving corrupted blocks Signed-off-by: Tyler Reid <[email protected]> * Modify log message to fit example command Signed-off-by: Tyler Reid <[email protected]> * Update grafana-builder from Mar 2019 to Feb 2021 Brings in the following changes: - Use default as a picker value for datasource variable grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/204 - allow table link in new tab grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/238 - allow setting a default datasource grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/301 - Add textPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/341 - make status code label name overrideable in qpsPanel grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/397 - use $__rate_interval over $__interval grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/401 - Set shared tooltip to false by default grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/458 - Use custom 'all' value to avoid massive regexes in queries. grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/469 https://github.com/grafana/jsonnet-libs/commits/master/grafana-builder/ * Match query-frontend/query-scheduler/querier custom deployments by default Signed-off-by: Marco Pracucci <[email protected]> * Create playbooks for sharded alertmanager * Add new alerts for alertmanager sharding mode of operation. * fix(rules): upstream recording rule switched to sum_irate ref: kubernetes-monitoring/kubernetes-mixin#619 * Fix CortexIngesterReachingSeriesLimit playbook Signed-off-by: Arve Knudsen <[email protected]> * feat: Allow configuration of ring members in gossip alerts Signed-off-by: Jack Baldry <[email protected]> * fix: Add store-gateway and compactor ring_members Also re-order names for readability. Signed-off-by: Jack Baldry <[email protected]> * fix: Match all ingester workloads and avoid matching the cortex-gateway Signed-off-by: Jack Baldry <[email protected]> * feat: Optionally allow use of array or string to configure ring members Signed-off-by: Jack Baldry <[email protected]> * address review feedback Signed-off-by: Jack Baldry <[email protected]> * fix: Correct ingester and querier regexps Signed-off-by: Jack Baldry <[email protected]> * Fixes to initial state sync panels on alertmanager dashboard. 1) Change minimal interval to 1m for sync duration and fetch state panels. This is in order to show infrequent events at smaller time windows. 2) Change syncs/sec panel to reflect absolute value of metric not rate. The initial sync only occurs once per-tenant so the counter value is essentially 0 or 1. Due to how per-tenant metrics are aggregated, the external facing metric really acts more like a gauge reflecting the number of tenants which achieved each outcome. Also, stack this panel as it becomes easier to visually see when the initial syncs have completed for all tenants (e.g. during a rollout). * Add rate back to Alertmanager dashboard initial syncs panel. The metric in fact does act like a counter due to soft deletion of the per-user registry when the user is unconfigured (e.g. moved to another instance or configuration deleted). * Make the overrides metric name configurable. We (Grafana Labs) are about to put in a new system to control and export data about limits and we'll need to use a different name. This shouldn't affect our OSS users. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Improve Cortex / Queries dashboard Signed-off-by: Marco Pracucci <[email protected]> * Add recording rules for speeding up Alertmanager dashboard. With large numbers of tenants the queries for some panels on thos dashboard can become quite slow as the metrics exposed are per-tenant. * Fixes from testing. * Move rules to their own group. * Split `cortex_api` recording rule group into three groups. This is a workaround for large clusters where this group can become slow to evaluate. * Update gsutil installation playbook Signed-off-by: Marco Pracucci <[email protected]> * Use `$._config.job_names.gateway` in resources dashboards. This fixes panels where `cortex-gw` was hardcoded. * Fine tune CortexIngesterReachingSeriesLimit alert Signed-off-by: Marco Pracucci <[email protected]> * Add CortexRolloutStuck alert Signed-off-by: Marco Pracucci <[email protected]> * Fixed playbook Signed-off-by: Marco Pracucci <[email protected]> * Added CortexFailingToTalkToConsul alert Signed-off-by: Marco Pracucci <[email protected]> * Fixed alert message Signed-off-by: Marco Pracucci <[email protected]> * Update alert to be generic to KV stores Signed-off-by: Marco Pracucci <[email protected]> * Add README * Add mimir-mixin CI checks * Update build image * Move to operations folder * Add missing zip to build-image * Run prettifier on playbooks.md * Update build-image Co-authored-by: Marco Pracucci <[email protected]> Co-authored-by: Goutham Veeramachaneni <[email protected]> Co-authored-by: Mauro Stettler <[email protected]> Co-authored-by: Tom Wilkie <[email protected]> Co-authored-by: Tom Wilkie <[email protected]> Co-authored-by: Goutham Veeramachaneni <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> Co-authored-by: Alex Martin <[email protected]> Co-authored-by: Javier Palomo <[email protected]> Co-authored-by: Darren Janeczek <[email protected]> Co-authored-by: Darren Janeczek <[email protected]> Co-authored-by: Jennifer Villa <[email protected]> Co-authored-by: Ursula Kallio <[email protected]> Co-authored-by: Callum Styan <[email protected]> Co-authored-by: Johanna Ratliff <[email protected]> Co-authored-by: Bryan Boreham <[email protected]> Co-authored-by: Steve Simpson <[email protected]> Co-authored-by: beorn7 <[email protected]> Co-authored-by: Tyler Reid <[email protected]> Co-authored-by: George Robinson <[email protected]> Co-authored-by: Duologic <[email protected]> Co-authored-by: Arve Knudsen <[email protected]> Co-authored-by: Jack Baldry <[email protected]>

…rafana/cortex-jsonnet#296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback.

* Added mega_user class Signed-off-by: Marco Pracucci <[email protected]> * Fine-tune blocks storage config Signed-off-by: Marco Pracucci <[email protected]> * Disable tests by default to fix README instructions Ref grafana/cortex-jsonnet#95 * Run store-gateway without CPU limits Signed-off-by: Marco Pracucci <[email protected]> * Use v1 API for Deployment and StatefulSet resources * Version bump to v1.1.0 * Actually include the ruler * Update config option name * Added ruler_enabled and alertmanager_enabled flags. (grafana/cortex-jsonnet#116) * Added publish not ready addresses Signed-off-by: Joe Elliott <[email protected]> * Removed -experimental.tsdb.store-gateway-enabled flag Signed-off-by: Marco Pracucci <[email protected]> * Added a discovery svc and pointed the querier service at itself Signed-off-by: Joe Elliott <[email protected]> * lint Signed-off-by: Joe Elliott <[email protected]> * Added PodDisruptionBudget for store-gateway Signed-off-by: Marco Pracucci <[email protected]> * Allow to configure the blocks replication factor Signed-off-by: Marco Pracucci <[email protected]> * Switch store-gateway StatefulSets to Parallel Pod Management Signed-off-by: Marco Pracucci <[email protected]> * Ruler should use metadata cache as well, if configured. (grafana/cortex-jsonnet#128) Ruler instantiates querier internally, so it can use metadata cache. * Allow to customize ingester disk size and class Signed-off-by: Marco Pracucci <[email protected]> * Version bump to 1.2.0 * refactor: use jaeger-agent-mixin lib got moved: grafana/jsonnet-libshttps://github.com/grafana/cortex-jsonnet/pull/291 used jb-0.4.0 which updates the jsonnetfile.json format * Switch blocks storage ingesters to Parallel pod management policy and 4d retention Signed-off-by: Marco Pracucci <[email protected]> * Fixed comment Signed-off-by: Marco Pracucci <[email protected]> * Chunks blocks migration (grafana/cortex-jsonnet#148) * Allow configuring querier with second store engine. * Introduced newIngesterStatefulSet and newIngesterPdb functions. * Rename parameters to be more clear. * refactor(cortex): use first class citizens for: * requiredDuringSchedulingIgnoredDuringExecutionType * portsType These are available from: https://github.com/jsonnet-libs/k8s-alpha * Update blocks storage CLI flags Signed-off-by: Marco Pracucci <[email protected]> * Do not apply blocks storage config to query-frontend, table-manager and purger Signed-off-by: Marco Pracucci <[email protected]> * Cleaned up blocks storage config Signed-off-by: Marco Pracucci <[email protected]> * Apply chunks-store config if primary or secondary store use chunks. (grafana/cortex-jsonnet#160) * Enable table manager when using chunks storage as secondary storage engine for querier. (grafana/cortex-jsonnet#161) * fix(ksonnet): backwards compatibility with ksonnet * add overrides config to tsdb store-gateway * Add jsonnet for ingester StatefulSet with WAL (grafana/cortex-jsonnet#72) * Add jsonnet for ingester StatefulSet with WAL Signed-off-by: Ganesh Vernekar <[email protected]> * Add CHANGELOG entry Signed-off-by: Ganesh Vernekar <[email protected]> * Fix lint Signed-off-by: Ganesh Vernekar <[email protected]> * Fix review comments Signed-off-by: Ganesh Vernekar <[email protected]> * Change max query length to 32 days To allow for comparision over months of 31d Signed-off-by: Goutham Veeramachaneni <[email protected]> * Fix ruler S3 config option (grafana/cortex-jsonnet#174) * Removed -experimental.tsdb.store-gateway-enabled flag Signed-off-by: Marco Pracucci <[email protected]> * Use correct config variable for s3 ruler config * restore dropped line Co-authored-by: Marco Pracucci <[email protected]> * Add support for local ruler_client_type (grafana/cortex-jsonnet#175) * Support Alertmanager HA With this, we can now support increasing the number of replicas for a Cortex AM thus enabling HA. Please note that Alerts themselves are not gossiped between Alertmanagers. Each Ruler needs to send the alert to every Alertmanager available thus the reason why a headless service gets created when the number of replicas is more than 1. * Setup the gossip port * s/isGossiping/isHa * Bump to 3 replicas by default * Bump the cortex image, the latest stable is 1.3 * Fix typo in Alertmanager configuration * Alertmanager configuration tweaks - Introduces the `fallback_config` option to allow an Alertmanager to have a fallback config. - Given the headless service a different name to allow seamless switching between 1 or multiple replicas. The cluster field in the service metadata is immutable which made it impossible to create the new service unless you delete the previous one. * Remove different name for a headless service Sadly, we can't have a different name for the headless service as the statefulset is configured to match its name. * Fix ruler s3 storage configuration * Block storage support for s3 * Added Azure support to blocks storage Signed-off-by: Marco Pracucci <[email protected]> * Fixed linter Signed-off-by: Marco Pracucci <[email protected]> * Removed the experimental prefix from blocks storage CLI flags Signed-off-by: Marco Pracucci <[email protected]> * Lower default ingestion limits and create a new overrides user * Address review feedback * Bump default series limit by 50% * Add flusher job for blocks. * Fixed Azure account name/key config Signed-off-by: Marco Pracucci <[email protected]> * Rename changed flags for 1.4 release. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Make sure only a single ruler rolls out at a time Signed-off-by: Goutham Veeramachaneni <[email protected]> * Cut 1.4.0 Signed-off-by: Marco Pracucci <[email protected]> * Add overrides exporter Overrides exporter part of grafana/cortex-tools and exposes runtime overrides and related presets of Cortex as metrics. Signed-off-by: Christian Simon <[email protected]> * Refactor limits and overrides Ensure we expose 'extra_small_user' and reference it setting the "default" values. This will raise the limits of the 'small_user' preset to the defaults for `ingester.max-samples-per-query` and `ingester.max-series-per-query`. Signed-off-by: Christian Simon <[email protected]> * Removed support for ingester.statefulset_replicas Signed-off-by: Marco Pracucci <[email protected]> * Switch compactor statefulset to Parallel pod management policy Signed-off-by: Marco Pracucci <[email protected]> * Cut 1.5.0 release Signed-off-by: Marco Pracucci <[email protected]> * Add ruler limits Sets default presets for for all the 'users' when it comes to ruler limits. * Add for the last user * Enabled compactor sharding Signed-off-by: Marco Pracucci <[email protected]> * Rollback PR 213 Signed-off-by: Marco Pracucci <[email protected]> * Re-introduce ruler limits Signed-off-by: Marco Pracucci <[email protected]> * [fixup] ruler limits config key name Ruler limits have a prefix of `ruler_` on the config key name. This makes the key match and then uses them as the value for the flags. * Removed postings-compression-enabled Signed-off-by: Marco Pracucci <[email protected]> * Fine-tuned gRPC keepalive pings settings Signed-off-by: Marco Pracucci <[email protected]> * Fixed gRPC settings Signed-off-by: Marco Pracucci <[email protected]> * Release 1.6.0 Signed-off-by: Marco Pracucci <[email protected]> * Add option to configure unregister ingesters on shutdown Signed-off-by: Marco Pracucci <[email protected]> * Fixed config Signed-off-by: Marco Pracucci <[email protected]> * Improved comment Signed-off-by: Marco Pracucci <[email protected]> * Updated doc Signed-off-by: Marco Pracucci <[email protected]> * Removed ifs Signed-off-by: Marco Pracucci <[email protected]> * Updated comment Signed-off-by: Marco Pracucci <[email protected]> * Fixed syntax error Signed-off-by: Marco Pracucci <[email protected]> * Remove misleading comment (grafana/cortex-jsonnet#243) Signed-off-by: Marco Pracucci <[email protected]> * Add option to customise the configmap name Signed-off-by: Goutham Veeramachaneni <[email protected]> * Fix for real Signed-off-by: Marco Pracucci <[email protected]> * Added bucket index flag, and enable bucket index by default. (grafana/cortex-jsonnet#254) * Cleanup blocks storage config Signed-off-by: Marco Pracucci <[email protected]> * feat: allow for Alertmanager to configure multiple storage backends Signed-off-by: Jacob Lisi <[email protected]> * Update cortex/config.libsonnet Co-authored-by: gotjosh <[email protected]> * Update cortex/alertmanager.libsonnet Co-authored-by: gotjosh <[email protected]> * Release 1.7.0. (grafana/cortex-jsonnet#260) * Release 1.7.0. * cortex: config: Fix error message for alertmanager_client_type. * cortex: alertmanager: Remove space in dot notation. * Up metadata connection limits * Add flag to enable streaming of chunks. (grafana/cortex-jsonnet#276) Signed-off-by: Peter Štibraný <[email protected]> * Add recording rules to calculate Cortex scaling - Update dashboard so it only shows under provisioned services and why - Add sizing rules based on limits. - Add some docs to the dashboard. Signed-off-by: Tom Wilkie <[email protected]> * chore: update lib to use new API paths Signed-off-by: Jacob Lisi <[email protected]> * Create 1.8.0 release. (grafana/cortex-jsonnet#282) * Create 1.8.0 release. Signed-off-by: Peter Štibraný <[email protected]> * Update image tags. Signed-off-by: Peter Štibraný <[email protected]> * Do not use deprecated Alertmanager cluster flags Signed-off-by: Marco Pracucci <[email protected]> * fix: Update ksonnet-util vendor lock The previous version `c19a92e586a6752f11745b47f309b13f02ef7147` is incompatible with the library in its current form. For example in `tsdb.libsonnet` L81, we use `pvc.new('ingester-pvc')` but at the locked version, in `ksonnet-util/kausal.libsonnet` the `pvc.new` function takes no arguments. Signed-off-by: Jack Baldry <[email protected]> * Add function to customize compactor statefulset Signed-off-by: Marco Pracucci <[email protected]> * Add querier_service_ignored_labels (grafana/cortex-jsonnet#291) Co-authored-by: Victor Tsang Hi <[email protected]> * Introduce ingester instance limits to configuration, and add alerts. (grafana/cortex-jsonnet#296) * Introduce ingester instance limits to configuration, and add alerts. * CHANGELOG.md * Address (internal) review feedback. * Add `query-scheduler.libsonnet` (grafana/cortex-jsonnet#295) * Add query-scheduler.libsonnet. * CHANGELOG.md * Use flag to enable query-scheduler. * Fix image. * Replace use of querier.compress-http-responses removed in Cortex 1.9 Signed-off-by: Nick Pillitteri <[email protected]> * Enable index-header lazy loading in store-gateway Signed-off-by: Marco Pracucci <[email protected]> * Do not use deprecated/removed flag -limits.per-user-override-config Signed-off-by: Marco Pracucci <[email protected]> * Use new ruler storage config and enable API compression Signed-off-by: Marco Pracucci <[email protected]> * Changed alertmanager config to use the new storage config Signed-off-by: Marco Pracucci <[email protected]> * Cut release 1.9.0 Signed-off-by: Goutham Veeramachaneni <[email protected]> * Mount overrides configmap to alertmanager too Signed-off-by: Marco Pracucci <[email protected]> * Upgrade memcached Signed-off-by: Marco Pracucci <[email protected]> * Increase default store-gateway memory request and limit Signed-off-by: Marco Pracucci <[email protected]> * Fix Signed-off-by: Marco Pracucci <[email protected]> * Set -server.grpc-max-*-msg-size-bytes for ruler and ingester. (grafana/cortex-jsonnet#326) * Fixed --alertmanager.cluster.peers Signed-off-by: Marco Pracucci <[email protected]> * Set empty alertmanager listen address with 1 replica Alertmanager tries to start clustering unless the flag is explicitly set as an empty string https://github.com/prometheus/alertmanager#turn-off-high-availability * Add option to disable anti-affinity in newIngesterStatefulSet() Signed-off-by: Marco Pracucci <[email protected]> * Fix alertmanager config change introduced in grafana/cortex-jsonnet#344 Signed-off-by: Marco Pracucci <[email protected]> * Create another tier with 300K active series The other tiers have a 3x jump except when we go from 100K to 1Mil. I think we should have a 3x jump for the first tier too. Signed-off-by: Goutham Veeramachaneni <[email protected]> * Improve config settings based on recent learnings Signed-off-by: Marco Pracucci <[email protected]> * Added functions to create query-frontend and querier deployments Signed-off-by: Marco Pracucci <[email protected]> * Added function to create query-scheduler deployment Signed-off-by: Marco Pracucci <[email protected]> * chore: upgrade to latest etcd-operator Brings: grafana/jsonnet-libs#480 * Alertmanager: Allow storage configuration to support Azure The alertmanager configuration did not have support for Azure. Let's add it. * remove new line * Fix comment on medium_small_user config It says it should be 100k + 50%, but that's what extra_small_user is. Here we have 300k, which is 200k + 50%. Signed-off-by: Oleg Zaytsev <[email protected]> * Remove wrong comment Signed-off-by: Oleg Zaytsev <[email protected]> * Add overrides to compactor Signed-off-by: Goutham Veeramachaneni <[email protected]> * Split limits config into a variable we can reuse Signed-off-by: Goutham Veeramachaneni <[email protected]> * Review feedback Signed-off-by: Goutham Veeramachaneni <[email protected]> * Fix missing ruler limits Damn, missed this in grafana/cortex-jsonnet#391 Signed-off-by: Goutham Veeramachaneni <[email protected]> * Alertmanager: Add sharding configuration. * Fix `compactor_blocks_retention_period` type in `extra_small_user` (grafana/cortex-jsonnet#395) * Fix `compactor_blocks_retention_period` type in `extra_small_user` The actual type of `compactor_blocks_retention_period` is `model.Duration`. Which comes from prometheus `common` package. The problem is that `model.Duration` have custom JSON unmarshal which treat the incoming value as string. https://github.com/prometheus/common/blob/main/model/time.go#L276 So setting it as integer, won't work when unmarshalling with JSON. NOTE: This won't be an issue for YamlUnmarshal, as it always treating it as string (even though you put it as integer) https://github.com/prometheus/common/blob/main/model/time.go#L307 * update CHANGELOG * Update rule limits to be inline with customer expectations We built the initial rules on guesswork and now we're updating them based on what the customers are asking for. Further, the ruler can be horizontally scaled and we're happy letting our users have more rules! Signed-off-by: Goutham Veeramachaneni <[email protected]> * Remove max_samples_per_query limit. (grafana/cortex-jsonnet#397) * Remove max_samples_per_query limit. * Fixed CHANGELOG.md * Removed chunks storage query sharding config support Signed-off-by: Marco Pracucci <[email protected]> * Add queryEngineConfig Signed-off-by: Marco Pracucci <[email protected]> * tsdb: Add multi concurrency and max idle connections store gateway params Signed-off-by: Arve Knudsen <[email protected]> * Update cortex/tsdb.libsonnet Co-authored-by: Marco Pracucci <[email protected]> * Fix formatting Signed-off-by: Arve Knudsen <[email protected]> * tsdb: Use literal numbers instead of variables Signed-off-by: Arve Knudsen <[email protected]> * cortex: Make ruler object storage support generic Signed-off-by: Arve Knudsen <[email protected]> * Remove ruler-storage.gcs.bucket-name for Azure Signed-off-by: Arve Knudsen <[email protected]> * cortex: Define Azure ruler args Signed-off-by: Arve Knudsen <[email protected]> * Parameterize Signed-off-by: Arve Knudsen <[email protected]> * Further document ingester_stream_chunks_when_using_blocks parameter Signed-off-by: Arve Knudsen <[email protected]> * Add options to disable anti-affinity Signed-off-by: Marco Pracucci <[email protected]> * Upstream some config improvements Signed-off-by: Marco Pracucci <[email protected]> * Increased max connections for memcached chunks and index-queries too Signed-off-by: Marco Pracucci <[email protected]> * Ruler: Pass `-ruler-storage.s3.endpoint` to ruler when using S3. This argument is is required, without it, the following error appears: ``` no s3 endpoint in config file ``` * Allow to create custom store-gateway StatefulSets via newStoreGatewayStatefulSet() Signed-off-by: Marco Pracucci <[email protected]> * Fix newStoreGatewayStatefulSet() to use input container Signed-off-by: Marco Pracucci <[email protected]> * Add CI check for jsonnet manifests * Remove additional git diff in check-mixin * Imported cortex-jsonnet CHANGELOG entries from 1.9.0 Signed-off-by: Marco Pracucci <[email protected]> * Improved CHANGELOG header Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Marco Pracucci <[email protected]> Co-authored-by: Austin McKinley <[email protected]> Co-authored-by: Tom Wilkie <[email protected]> Co-authored-by: Jacob Lisi <[email protected]> Co-authored-by: Austin McKinley <[email protected]> Co-authored-by: Goutham Veeramachaneni <[email protected]> Co-authored-by: Peter Štibraný <[email protected]> Co-authored-by: Joe Elliott <[email protected]> Co-authored-by: Joe Elliott <[email protected]> Co-authored-by: Duologic <[email protected]> Co-authored-by: Jeroen Op 't Eynde <[email protected]> Co-authored-by: Sandeep Sukhani <[email protected]> Co-authored-by: Ganesh Vernekar <[email protected]> Co-authored-by: Stan Kwong <[email protected]> Co-authored-by: gotjosh <[email protected]> Co-authored-by: forestsword <[email protected]> Co-authored-by: Jacob Lisi <[email protected]> Co-authored-by: Alex Martin <[email protected]> Co-authored-by: Tom Wilkie <[email protected]> Co-authored-by: Jack Baldry <[email protected]> Co-authored-by: Victor Tsang Hi <[email protected]> Co-authored-by: Victor Tsang Hi <[email protected]> Co-authored-by: Nick Pillitteri <[email protected]> Co-authored-by: Steve Simpson <[email protected]> Co-authored-by: Hamish <[email protected]> Co-authored-by: Javier Palomo <[email protected]> Co-authored-by: gotjosh <[email protected]> Co-authored-by: Oleg Zaytsev <[email protected]> Co-authored-by: Kaviraj <[email protected]> Co-authored-by: Arve Knudsen <[email protected]>

Introduce ingester instance limits to configuration, and add alerts.

40dbe6a

pstibrany requested a review from a team as a code owner April 22, 2021 08:01

pstibrany added 2 commits April 22, 2021 10:02

CHANGELOG.md

4fdcb81

Address (internal) review feedback.

c437a55

pracucci approved these changes Apr 22, 2021

View reviewed changes

pstibrany merged commit ce896a7 into main Apr 22, 2021

pstibrany deleted the ingester-instance-limits branch April 22, 2021 09:18

simonswine mentioned this pull request Nov 18, 2021

Import cortex-jsonnet into mimir repo grafana/mimir#506

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce ingester instance limits to configuration, and add alerts. #296

Introduce ingester instance limits to configuration, and add alerts. #296

pstibrany commented Apr 22, 2021

pracucci left a comment

Introduce ingester instance limits to configuration, and add alerts. #296

Introduce ingester instance limits to configuration, and add alerts. #296

Conversation

pstibrany commented Apr 22, 2021

pracucci left a comment

Choose a reason for hiding this comment