This program use for collect Prometheus data for last n minutes (default is 5 minutes). Then take average on each metrics. Output to a text file.
Use with another Prometheus for store the downsampled data. For our case, we set the long term Prometheus retention to 2 years (Still testing. Hope it will work as expected).
Tested for a while. Memory usage on prometheus keep growing. It may cause by metrics didn't update recently but not reach retention will keep that index in memory. Now will try thanos.
This program only output a text file on K8S empty dir. Then use a nginx in same pod to expose the output to long-term Prometheus.
And need to set honor_labels: true
inside long term Prometheus scrape job. Otherwise some conflicted labels will be renamed.
There 4 parameters can config. You can either use args
or environment variable
.
- Source Prometheus endpoint
- Default: http://127.0.0.1:9090
- Args: -s
- Environame variable: PDS_SOURCE
- Output file path
- Default: /tmp/prometheus_downsample_output.txt
- Args: -o
- Environame variable: PDS_OUTPUT
- Interval in minute for collect data from source Prometheus
- Default: 5m
- Args: -i
- Environame variable: PDS_INTERVAL
- Max concurrent connection to source Prometheus
- Default: 50
- Args: -c
- Environame variable: PDS_CONCURRENT
Example: Your prometheus endpoint is http://192.168.1.20:9090
and want to downsample data for every 10 minutes
:
go run prometheus-downsampler.go -s http://192.168.1.20:9090 -i 10m
or
./prometheus-downsampler -s http://192.168.1.20:9090 -i 10m
- Call Querying label values API to get all metric names
- Call Range Queries API to get every metrics with 1 minute step
- Take average on each metrics
- Write all metrics with exposition format to a temp file
- Rename the temp file to output file name
This program can handle collect a longer time range data. Then group them to every n minute and take average. But due to below reasons. Now only process for single time group.
- exposition format mention
Each line must have a unique combination of a metric name and labels. Otherwise, the ingestion behavior is undefined.
. But didn't mention is it safe if have different timestamp - Also tested for a while with export 1 hour data with 12 data points. The long term Prometheus lost some of data point.
Because we scrape over 650K metrics every 10 second (With 2 Prometheus servers for HA).
We tried use remote_write
to InfluxDB (Single server, not enterprise edition).
But it cause InfluxDB very high CPU usage, no response and OOM dead very soon. Also make the operation Prometheus dead.
So we try on different way (This project).
Also just need a little bit modify on Grafana dashboard. No need to re-build all dashboard (Don't know why use Prometheus remote_read
from InfluxDB for Grafana always got proxy timeout from Grafana).