prometheus logger: fix potential unlimited memory usage #529

dmachard · 2023-12-28T09:22:03Z

This PR try to find a solution to limit memory usage with Prometheus logger.

integrate LRU cache
update docs
update prometheus_labels settings to ignore stream_id

The following list are stored in memory without any limitations:

queries number made by a specific requestor
queries number ended up in NOERROR (or other except NXDOMAIN and SERVFAIL)
queries number ended up in NXDOMAIN
queries number ended up in SERVFAIL
queries number for a specific TLD
queries number for a specific eTLD+1

The following metrics has been replaced

dnscollectors_domains_total (counter) -> dnscollectors_total_domains_lru (gauge)
dnscollectors_nxdomains_total (counter) -> dnscollectors_total_nxdomains_lru (gauge)
dnscollectors_sfdomains_total (counter) -> dnscollectors_total_sfdomains_lru (gauge)
dnscollectors_requesters_total (counter) -> dnscollectors_total_requesters_lru (gauge)
dnscollectors_tlds_total (counter) -> dnscollectors_total_tlds_lru (gauge)
dnscollectors_etldsplusone_total (counter) -> dnscollectors_total_etldsplusone_lru (gauge)
dnscollectors_suspicious_total (counter) -> dnscollectors_total_suspicious_lru (gauge)
dnscollectors_unanswered_total (counter) -> dnscollectors_total_unanswered_lru (gauge)

johnhtodd · 2023-12-28T18:20:27Z

I see some typographical errors in loggers.go - perhaps change "requeters" to "requesters" in the yaml?

Also: what is the "size" referring to? Bytes, megabytes, number of items...? I assume number of items, but perhaps you have not updated the config file example yet for description.

dmachard · 2023-12-28T19:58:40Z

Thanks for typo. Works well to limit memory usage, but costs CPU

The size is the number of items in the cache.
It's not possible to set the max of Mb with the github.com/hashicorp/golang-lru/

requesters-cache-size: 50000
requesters-cache-ttl: 3600
domains-cache-size: 50000
domains-cache-ttl: 3600

johnhtodd · 2023-12-29T01:02:45Z

if this can get merged into the pipeline branch, I'll test ASAP and report comparative CPU usage.

dmachard · 2023-12-29T17:46:38Z

This patch has been merged in the pipeline branch
Any feedbacks will be appreciated.

johnhtodd · 2023-12-29T20:08:18Z

Running - looks good so far, but will know shortly when we start to evict from the LRU for domains what that does to CPU.

johnhtodd · 2023-12-29T20:45:00Z

Testing questions: with the defaults, the number of domains stored should never be above 500000 - correct? (in your notes above, there is a typo of 50000) I am looking at dnscollector_total_domains_lru to measure this number. Currently, the value of that counter is 579000 so something is wrong. (branch: pipeline_mode, full clone half an hour ago, changes to all metrics with "_lru" are apparent so I know I'm running the right version.)

dmachard · 2023-12-29T21:20:45Z

correct, the default value is 500k https://github.com/dmachard/go-dnscollector/blob/c7f54ac9bf32e3778b3af5ba437ab3e7f91892d6/pkgconfig/loggers.go#L337

Except if you overwrite the default value with config file ?
I retested on my side and the gauge value is equal to the max value

johnhtodd · 2023-12-30T00:53:22Z

I did not overwrite with the config file, so this was using defaults. However, the good news is that the number has been dropping but that may be due to timers and not to the maximum number (queries have been decreasing over the last few hours, so growth may be naturally diminishing.) This graph is plotting sum(dnscollector_total_domains_lru) for my system. It peaks well over 600k names, which is far above the 500k maximum.

johnhtodd · 2023-12-30T00:57:02Z

It may be worth noting that I have three feeds coming into this system from three different resolvers. Does this maximum value apply to the total number of names in memory, or is it per stream_id ? If it is the latter, then perhaps this is expected behavior.

dmachard · 2023-12-30T09:37:10Z

Does this maximum value apply to the total number of names in memory, or is it per stream_id ?

The maximum value is per stream_id

This graph is plotting sum(dnscollector_total_domains_lru) for my system.

Okay, you have one dimension in 'dnscollector_total_domains_lru,' which is the stream_id. The maximum value in your case (with the sum) should be 1.5 million.

Could you plot 'dnscollector_total_domains_lru' without the sum?

johnhtodd · 2023-12-31T02:03:30Z

OK, this works. It's a bit confusing, since we don't know how many domains we have in total - we know for each stream, which may be nearly 100% overlapping, and there is no way to disambiguate those (yet) unless there is a custom metric for the true "total" of non-overlapping name space. This is not important; I don't much care, but it's interesting and someone else may care.

dmachard · 2023-12-31T09:27:37Z

It was almost supported so I made a minor code adjustment to consolidate all streams into one for metric computations.

If you want to test, you can utilize and append the following key to your Prometheus settings:

prometheus-labels: ["stream_global"]

With this modification, you should hit the 500k limit of the LRU cache (stream_id label will be removed)

Regarding memory and CPU usage it's ok ?
Thanks a lot for you feedbacks.

P.S.: if you want to know how many domains we have in total, don't forget to also count NXDomains (dnscollector_total_nxdomains_lru) and SERVFAIL (dnscollector_sfdomains_lru)

johnhtodd · 2023-12-31T22:18:51Z

CPU and memory numbers look fine - no significant changes from previous behaviors. I've had an instance running for two full days - no issues, and the memory usage is staying below the thresholds presented. I will re-start with a more aggressive threshold (lower) to see if that changes my CPU loading, but I think that is just an academic exercise at this point.

Do the NXDOMAIN and SERVFAIL data also fall into the "dnscollector_total_domains_lru" number?

dmachard · 2024-01-01T09:52:03Z

Do the NXDOMAIN and SERVFAIL data also fall into the "dnscollector_total_domains_lru" number?

See my previous #529 (comment) " if you want to know how many domains we have in total, don't forget to also count NXDomains (dnscollector_total_nxdomains_lru) and SERVFAIL (dnscollector_sfdomains_lru)"

Thanks for feedback, I will merge soon.

johnhtodd · 2024-01-01T18:37:15Z

Thank you for the comments, but I'm still not quite clear on the terminology. The term "dnscollector_total_domains_lru" would imply that is is the total of all possible subsets, regardless of rcode status. If it was only the "noerror" domains, then it would be expected that the metric would be "dnscollector_total_noerrordomains_lru".

It's fine that there is no single metric that shows the counter of all noerror, nxdomain, and servfail domains across all streams. If there are three metrics (dnscollector_total_nxdomains_lru, dnscollector_sfdomains_lru, and dnscollector_total_noerrordomains_lru) that have to be added, that is fine as long as they are unique counters of non-duplicated domains in each of those categories that looks at all of the possible stream sets.

In addition, having those counters (dnscollector_noerrordomains_lru, dnscollector_nxdomains_lru, dnscollector_sfdomains_lru) for each stream is useful. The sum of each of these rcode sets across streams will almost always be (confusingly) larger than the corresponding dnscollector_total_* values for each set, since I assume your code keeps each domain once, but tags it with which streams have seen the domain?

Also, the presence or absence of a "stream_id" tag would imply if a metric was per-stream or not. If there was no "stream_id" tag, then I would assume it would be a de-duplicated counter of all possible domains of a particular rcode, across all streams.

Sorry to be so particular about the naming here, but it does make a significant difference in how numbers are interpreted which then leads directly into the ability to manage the operation of the package in a meaningful way by a staff who may not be so specificaly intimate with the details of the code and the subtle distinctions of metric naming. Keeping Prometheus values straight is an important task in any large-scale operational considerations and I want to make sure this doesn't need to be re-done in a while after many people have already made assumptions about what the metrics mean.

dmachard · 2024-01-01T20:08:00Z

I prefer to remove any ambiguous in metrics, here my proposal:

dnscollector_total_domains_lru: the total of all possible domains, regardless of rcode status and without duplication
dnscollector_total_nonexistent_domains_lru: the total of all NX domain with possible duplication
dnscollector_total_servfail_domains_lru: the total of all SERVFAIL domain with possible duplication
dnscollector_total_noerror_domains_lru: the total of all NOERROR domain

Regarding memory usage, a LRU cache is associated to each metrics so it's must be configurable or not individually to compute them or not. Duplication entries can exists between LRU cache for metric n°2, n°3 and n°4 because for example at sometime a specific domain can be "NOERROR" and after "SERVFAIL"

Keep in mind that these LRU caches are also used and mandatory to compute "top domains/requesters" in realtime with the following metrics

dnscollector_top_domains
dnscollector_top_nonexistent_domains
dnscollector_top_servfail_domains
dnscollector_top_noerror_domains

Regarding the stream_id label. All metrics are by default per-stream but in this branch it's possible to add the settings prometheus-labels: ["stream_global"] to make metrics uniq across all streams and de-duplicated.

secure-xxx · 2024-01-23T18:05:16Z

Looks like a memory leaking. We used one of first versions since 2022y year, at this week try to update because kafka logger was added. Running collector in k8s and watching his restarting by OOM.
1 using old version
2 update to 0.40
3 downgrade to 0.32

secure-xxx · 2024-01-23T18:24:29Z

0.32 - 90 min uptime without fails, have stable rate - 40 ops

dmachard · 2024-01-23T19:00:11Z

Looks like a memory leaking. We used one of first versions since 2022y year, at this week try to update because kafka logger was added. Running collector in k8s and watching his restarting by OOM. 1 using old version 2 update to 0.40 3 downgrade to 0.32

Thank for sharing that, can you track this in a new issue ?

secure-xxx · 2024-01-23T22:42:49Z

yep

add golang-lru

a74f441

counter to gauge

abfdbe0

dmachard added 2 commits December 28, 2023 20:59

make linter happy

b91c002

rename metrics

4c84d6a

dmachard changed the title ~~prometheus logger: fix potential memory leak~~ prometheus logger: fix potential unlimited memory usage Dec 28, 2023

dmachard added 2 commits December 29, 2023 10:09

rename metrics

6630ce7

Update config

d06d608

dmachard added 2 commits December 31, 2023 10:15

support stream_global as selector

3f6e8b2

set default value if no label provided

8ad2f89

dmachard added 2 commits December 31, 2023 10:52

Update docs

578ccb8

Update docs

1570634

dmachard mentioned this pull request Dec 31, 2023

rest api logger: fix potential unlimited memory usage #535

Open

dmachard added 2 commits January 2, 2024 13:34

update and fix tests

7069277

revert config to default

902bbcc

dmachard merged commit 8cd4d0f into main Jan 3, 2024
62 checks passed

dmachard deleted the prom_fix_potential_memory_leak branch January 3, 2024 20:05

secure-xxx mentioned this pull request Jan 23, 2024

Memory leak with Kafka and ElasticLogger ? #567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prometheus logger: fix potential unlimited memory usage #529

prometheus logger: fix potential unlimited memory usage #529

dmachard commented Dec 28, 2023 •

edited

Loading

johnhtodd commented Dec 28, 2023

dmachard commented Dec 28, 2023 •

edited

Loading

johnhtodd commented Dec 29, 2023 •

edited

Loading

dmachard commented Dec 29, 2023

johnhtodd commented Dec 29, 2023

johnhtodd commented Dec 29, 2023

dmachard commented Dec 29, 2023

johnhtodd commented Dec 30, 2023

johnhtodd commented Dec 30, 2023

dmachard commented Dec 30, 2023 •

edited

Loading

johnhtodd commented Dec 31, 2023 •

edited

Loading

dmachard commented Dec 31, 2023 •

edited

Loading

johnhtodd commented Dec 31, 2023

dmachard commented Jan 1, 2024

johnhtodd commented Jan 1, 2024

dmachard commented Jan 1, 2024

secure-xxx commented Jan 23, 2024

secure-xxx commented Jan 23, 2024

dmachard commented Jan 23, 2024

secure-xxx commented Jan 23, 2024

prometheus logger: fix potential unlimited memory usage #529

prometheus logger: fix potential unlimited memory usage #529

Conversation

dmachard commented Dec 28, 2023 • edited Loading

johnhtodd commented Dec 28, 2023

dmachard commented Dec 28, 2023 • edited Loading

johnhtodd commented Dec 29, 2023 • edited Loading

dmachard commented Dec 29, 2023

johnhtodd commented Dec 29, 2023

johnhtodd commented Dec 29, 2023

dmachard commented Dec 29, 2023

johnhtodd commented Dec 30, 2023

johnhtodd commented Dec 30, 2023

dmachard commented Dec 30, 2023 • edited Loading

johnhtodd commented Dec 31, 2023 • edited Loading

dmachard commented Dec 31, 2023 • edited Loading

johnhtodd commented Dec 31, 2023

dmachard commented Jan 1, 2024

johnhtodd commented Jan 1, 2024

dmachard commented Jan 1, 2024

secure-xxx commented Jan 23, 2024

secure-xxx commented Jan 23, 2024

dmachard commented Jan 23, 2024

secure-xxx commented Jan 23, 2024

dmachard commented Dec 28, 2023 •

edited

Loading

dmachard commented Dec 28, 2023 •

edited

Loading

johnhtodd commented Dec 29, 2023 •

edited

Loading

dmachard commented Dec 30, 2023 •

edited

Loading

johnhtodd commented Dec 31, 2023 •

edited

Loading

dmachard commented Dec 31, 2023 •

edited

Loading