Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus logger: fix potential unlimited memory usage #529

Merged
merged 12 commits into from
Jan 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,44 @@ multiplexer:
# chan-buffer-size: 65535
# # compute histogram for qnames length, latencies, queries and replies size repartition
# histogram-metrics-enabled: false
# # compute requesters metrics - total and top requesters
# requesters-metrics-enabled: true
# # compute domains metrics - total and top domains
# domains-metrics-enabled: true
# # compute NOERROR domains metrics - total and top domains
# noerror-metrics-enabled: true
# # compute NOERROR domains metrics - total and top domains
# servfail-metrics-enabled: true
# # compute NXDOMAIN domains metrics - total and top domains
# nonexistent-metrics-enabled: true
# # compute TIMEOUT domains metrics - total and top domains
# timeout-metrics-enabled: true
# # prometheus-labels: (list of strings) labels to add to metrics. Currently supported labels: stream_id, resolver, stream_global
# prometheus-labels: ["stream_id"]
# # LRU (least-recently-used) cache size for observed clients DNS
# requesters-cache-size: 250000
# # maximum time (in seconds) before eviction from the LRU cache
# requesters-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed domains
# domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed NOERROR domains
# noerror-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# noerror-domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed SERVFAIL domains
# servfail-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# servfail-domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed NX domains
# nonexistent-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# nonexistent-domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed other domains (suspicious, tlds, ...)
# default-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# default-domains-cache-ttl: 3600

# # write captured dns traffic to text or binary files with rotation and compression support
# logfile:
Expand Down
1 change: 1 addition & 0 deletions dnsutils/constant.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ const (
ProtoDoT = "DOT"
ProtoDoH = "DOH"

DNSRcodeNoError = "NOERROR"
DNSRcodeNXDomain = "NXDOMAIN"
DNSRcodeServFail = "SERVFAIL"
DNSRcodeTimeout = "TIMEOUT"
Expand Down
54 changes: 47 additions & 7 deletions docs/loggers/logger_prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,17 @@ Options:
- `top-n`: (string) default number of items on top
- `chan-buffer-size`: (integer) channel buffer size used on incoming dns message, number of messages before to drop it.
- `histogram-metrics-enabled`: (boolean) compute histogram for qnames length, latencies, queries and replies size repartition
- `prometheus-labels`: (list of strings) labels to add to metrics. Currently supported labels: `stream_id`, `resolver`
- `prometheus-labels`: (list of strings) labels to add to metrics. Currently supported labels: `stream_id` (default), `stream_global`, `resolver`
- `requesters-cache-size`: (integer) LRU (least-recently-used) cache size for observed clients DNS per stream
- `requesters-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed domains per stream
- `domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `noerror-domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed NOERROR domains per stream
- `noerror-domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `servfail-domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed SERVFAIL domains per stream
- `servfail-domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `nonexistent-domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed NX domains per stream
- `nonexistent-domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache

Default values:

Expand All @@ -39,7 +49,25 @@ prometheus:
top-n: 10
chan-buffer-size: 65535
histogram-metrics-enabled: false
requesters-metrics-enabled: true
domains-metrics-enabled: true
noerror-domains-metrics-enabled: true
servfail-domains-metrics-enabled: true
nonexistent-domains-metrics-enabled: true
timeout-domains-metrics-enabled: true
prometheus-labels: ["stream_id"]
requesters-cache-size: 250000
requesters-cache-ttl: 3600
domains-cache-size: 500000
domains-cache-ttl: 3600
noerror-domains-cache-size: 100000
noerror-domains-cache-ttl: 3600
servfail-domains-cache-size: 10000
servfail-domains-cache-ttl: 3600
nonexistent-domains-cache-size: 10000
nonexistent-domains-cache-ttl: 3600
default-domains-cache-size: 1000
default-domains-cache-ttl: 3600
```

Scrape metric with curl:
Expand All @@ -55,9 +83,11 @@ The full metrics can be found [here](./../metrics.txt).
| Metric | Notes
|-------------------------------------------------|------------------------------------
| dnscollector_build_info | Build info
| dnscollector_requesters_total | The total number of requesters per stream identity
| dnscollector_nxdomains_total | The total number of NX domains per stream identity
| dnscollector_domains_total | The total number of domains per stream identity
| dnscollector_total_requesters_lru | Total number of DNS clients most recently observed per stream identity.
| dnscollector_total_domains_lru | Total number of serverfail domains most recently observed per stream identity
| dnscollector_total_noerror_domains_lru | Total number of serverfail domains most recently observed per stream identity
| dnscollector_total_servfail_domains_lru | Total number of serverfail domains most recently observed per stream identity
| dnscollector_total_nonexistentçdomains_lru | Total number of NX domains most recently observed per stream identity
| dnscollector_dnsmessage_total | Counter of total of DNS messages
| dnscollector_queries_total | Counter of total of queries
| dnscollector_replies_total | Counter of total of replies
Expand All @@ -77,15 +107,15 @@ The full metrics can be found [here](./../metrics.txt).
| dnscollector_reassembled_total | Total of reassembled DNS messages (TCP level)
| dnscollector_throughput_ops | Number of ops per second received, partitioned by stream
| dnscollector_throughput_ops_max | Max number of ops per second observed, partitioned by stream
| dnscollector_tlds_total | The total number of tld per stream identity
| dnscollector_total_tlds_lru | Total number of tld most recently observed per stream identity
| dnscollector_top_domains | Number of hit per domain topN, partitioned by stream and qname
| dnscollector_top_nxdomains | Number of hit per nx domain topN, partitioned by stream and qname
| dnscollector_top_sfdomains | Number of hit per servfail domain topN, partitioned by stream and qname
| dnscollector_top_requesters | Number of hit per requester topN, partitioned by client IP
| dnscollector_top_tlds | Number of hit per tld - topN
| dnscollector_top_unanswered | Number of hit per unanswered domain - topN
| dnscollector_unanswered_total | The total number of unanswered domains per stream identity
| dnscollector_suspicious_total | The total number of unanswered domains per stream identity
| dnscollector_total_unanswered_lru | Total number of unanswered domains most recently observed per stream identity
| dnscollector_total_suspicious_lru | Total number of suspicious domains most recently observed per stream identity
| dnscollector_qnames_size_bytes_bucket | Histogram of the size of the qname in bytes
| dnscollector_queries_size_bytes_bucket | Histogram of the size of the queries in bytes.
| dnscollector_replies_size_bytes_bucket | Histogram of the size of the replies in bytes.
Expand All @@ -97,3 +127,13 @@ The following [build-in](https://grafana.com/grafana/dashboards/16630) dashboard
<p align="center">
<img src="../_images/dashboard_prometheus.png" alt="dnscollector"/>
</p>

# Merge streams for metrics computation

Use the following setting to consolidate all streams into one for metric computations.

```yaml
prometheus:
....
prometheus-labels: ["stream_global"]
```
Loading
Loading