Hero is relieved that they don’t have to choose between Prometheus and OpenTelemetry for standardization, whew. OpenTelemetry understands Prometheus metrics and Prometheus can use the OpenTelemetry SDKs to correlate Prometheus metrics with other system data.
But when it comes to metrics, is it enough to have just Prometheus? Or should Hero choose a tool like Thanos or Cortex that builds on Prometheus and adds features like high availability, horizontal scaling, and long-term storage? What differentiates Thanos and Cortex from one another? Let's dig in.
Prometheus is an open source tool that helps to generate, collect, process, store, and query/visualize data in the form of time series metrics. The process goes something like this:
Step 1 - Generation
Prometheus metrics are everywhere! They’re probably already being generated by your favorite third-party tools; if not, an exporter is almost certainly available. For your internal applications, use Prometheus client libraries to instrument code to generate Prometheus metrics. All of these generated metrics are discoverable by Prometheus.
Step 2 - Collection
Once the metrics are generated, Prometheus collects them using a pull-based method called scraping
. If you prefer to push your metrics, Prometheus has a tool for that too.
Step 3 - Manipulation (optional)
If you like, you can use a relabel_config to manipulate your scrape data at this point.
Step 4 - Ingestion/Storage
Ingestion
is the act of putting data into storage. With Prometheus, the metrics are stored in the Prometheus instance itself.
Step 5 - Use the data!
Prometheus doesn’t have a visualization component, so you’ll likely use a third-party dashboard tool to see charts and graphs and pretty colors. You can also use Prometheus’s query language PromQL to query Prometheus, and you can use Prometheus Alertmanager to create alerts.
Prometheus was designed to run as one standalone instance. When you start to scale Prometheus, it can be hard to manage all of the Prometheus instances as a whole. Here are some problems that emerge:
Horizontal Scaling - Prometheus does not natively support horizontal scaling. You can add more instances, but the instances don’t know about each other.
Global Quering - Relatedly, since the instances of Prometheus don’t know about each other, you are not able to query across the entire system. This is sometimes solved using a federated Prometheus architecture where one central instance of Prometheus scrapes the data from all of the other instances, but this isn’t the most efficient way, and it doesn’t solve many of the other Prometheus problems.
High Availability - This can be accomplished in Prometheus by having more than one Prometheus instance scrape the same data, but this isn’t an elegant solution.
Long Term Metrics Storage - By default, Prometheus keeps metrics for 15 days. You can configure Prometheus to keep metrics for longer, but if you go much longer you’ll start to have performance and cost problems. The longer you store metrics, the more you’ll increase disk space requirements, the more the query performance will decline, and the more in-memory data you’ll have, resulting in memory problems too.
Multi-Tenancy - Prometheus does not support multi-tenancy. This can cause problems where one team uses too many resources, or where teams can see each other’s data, which might violate security and/or privacy requirements.
Thanos is a toolkit of components that are packaged as a single binary and can be composed into a highly available and scalable metrics solution.
It works by adding a sidecar to each Prometheus instance that can read its data for query and/or upload the data to long-term cloud object storage. It also adds a component that serves metrics from inside a cloud storage bucket. So now Thanos’s query engine can serve both long-term object store metrics plus metrics stored in one or many Prometheus instances–all in one place.
Thanos has additional features too, like a compactor that compacts and downsamples cloud-bucket data, resulting in faster queries, even across huge amounts of data. Thanos can also enforce global alerting rules and receive data from Prometheus’s remote write write-ahead log.
Thanos is designed to handle a very large number of time series, often in the millions or even billions, depending on the configuration and underlying infrastructure. Because it can be seamlessly added on top of Prometheus, it is a great choice for companies who already have a large Prometheus footprint.
Cortex is a multi-tenant, highly available, horizontally scalable, long-term storage solution for metrics that builds on Prometheus.
Similar to Thanos, Cortex receives metrics and writes them to long-term object storage. Also similar to Thanos, Cortex handles global query requests, providing caching and distributing the requests to either short- or long-term storage in a way that is invisible to the user. It also performs compaction for faster querying.
However, Cortex uses a more complex storage architecture that emphases multi-tenancy and high availability. Cortex deploys and scales independently from Prometheus, and it can also receive metrics from sources other than Prometheus, like from the OpenTelemetry collector, for example, or from an OTLP-instrumented application.
Cortex especially shines in a multi-tenant use case where it can implement query limits and/or ingestion limits on a per-tenant basis. It also can isolate data between teams.
- Prometheus
- Thanos has not yet been implemented. Please let us know (by opening an issue) if you would like to contribute the implementation.
- Cortex has not yet been implemented. Please let us know (by opening an issue) if you would like to contribute the implementation.
- Pixie