Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus logger: fix potential unlimited memory usage #529

Merged
merged 12 commits into from
Jan 3, 2024

Conversation

dmachard
Copy link
Owner

@dmachard dmachard commented Dec 28, 2023

This PR try to find a solution to limit memory usage with Prometheus logger.

  • integrate LRU cache
  • update docs
  • update prometheus_labels settings to ignore stream_id

The following list are stored in memory without any limitations:

  • queries number made by a specific requestor
  • queries number ended up in NOERROR (or other except NXDOMAIN and SERVFAIL)
  • queries number ended up in NXDOMAIN
  • queries number ended up in SERVFAIL
  • queries number for a specific TLD
  • queries number for a specific eTLD+1

The following metrics has been replaced

  • dnscollectors_domains_total (counter) -> dnscollectors_total_domains_lru (gauge)
  • dnscollectors_nxdomains_total (counter) -> dnscollectors_total_nxdomains_lru (gauge)
  • dnscollectors_sfdomains_total (counter) -> dnscollectors_total_sfdomains_lru (gauge)
  • dnscollectors_requesters_total (counter) -> dnscollectors_total_requesters_lru (gauge)
  • dnscollectors_tlds_total (counter) -> dnscollectors_total_tlds_lru (gauge)
  • dnscollectors_etldsplusone_total (counter) -> dnscollectors_total_etldsplusone_lru (gauge)
  • dnscollectors_suspicious_total (counter) -> dnscollectors_total_suspicious_lru (gauge)
  • dnscollectors_unanswered_total (counter) -> dnscollectors_total_unanswered_lru (gauge)

@johnhtodd
Copy link

I see some typographical errors in loggers.go - perhaps change "requeters" to "requesters" in the yaml?

Also: what is the "size" referring to? Bytes, megabytes, number of items...? I assume number of items, but perhaps you have not updated the config file example yet for description.

@dmachard
Copy link
Owner Author

dmachard commented Dec 28, 2023

Thanks for typo. Works well to limit memory usage, but costs CPU

The size is the number of items in the cache.
It's not possible to set the max of Mb with the github.com/hashicorp/golang-lru/

requesters-cache-size: 50000
requesters-cache-ttl: 3600
domains-cache-size: 50000
domains-cache-ttl: 3600

@dmachard dmachard changed the title prometheus logger: fix potential memory leak prometheus logger: fix potential unlimited memory usage Dec 28, 2023
@johnhtodd
Copy link

johnhtodd commented Dec 29, 2023

if this can get merged into the pipeline branch, I'll test ASAP and report comparative CPU usage.

@dmachard
Copy link
Owner Author

This patch has been merged in the pipeline branch
Any feedbacks will be appreciated.

@johnhtodd
Copy link

Running - looks good so far, but will know shortly when we start to evict from the LRU for domains what that does to CPU.

@johnhtodd
Copy link

Testing questions: with the defaults, the number of domains stored should never be above 500000 - correct? (in your notes above, there is a typo of 50000) I am looking at dnscollector_total_domains_lru to measure this number. Currently, the value of that counter is 579000 so something is wrong. (branch: pipeline_mode, full clone half an hour ago, changes to all metrics with "_lru" are apparent so I know I'm running the right version.)

@dmachard
Copy link
Owner Author

correct, the default value is 500k https://github.com/dmachard/go-dnscollector/blob/c7f54ac9bf32e3778b3af5ba437ab3e7f91892d6/pkgconfig/loggers.go#L337

Except if you overwrite the default value with config file ?
I retested on my side and the gauge value is equal to the max value

@johnhtodd
Copy link

I did not overwrite with the config file, so this was using defaults. However, the good news is that the number has been dropping but that may be due to timers and not to the maximum number (queries have been decreasing over the last few hours, so growth may be naturally diminishing.) This graph is plotting sum(dnscollector_total_domains_lru) for my system. It peaks well over 600k names, which is far above the 500k maximum.

Screen Shot 2023-12-29 at 4 49 54 PM

@johnhtodd
Copy link

It may be worth noting that I have three feeds coming into this system from three different resolvers. Does this maximum value apply to the total number of names in memory, or is it per stream_id ? If it is the latter, then perhaps this is expected behavior.

@dmachard
Copy link
Owner Author

dmachard commented Dec 30, 2023

Does this maximum value apply to the total number of names in memory, or is it per stream_id ?

The maximum value is per stream_id

This graph is plotting sum(dnscollector_total_domains_lru) for my system.

Okay, you have one dimension in 'dnscollector_total_domains_lru,' which is the stream_id. The maximum value in your case (with the sum) should be 1.5 million.

Could you plot 'dnscollector_total_domains_lru' without the sum?

@johnhtodd
Copy link

johnhtodd commented Dec 31, 2023

OK, this works. It's a bit confusing, since we don't know how many domains we have in total - we know for each stream, which may be nearly 100% overlapping, and there is no way to disambiguate those (yet) unless there is a custom metric for the true "total" of non-overlapping name space. This is not important; I don't much care, but it's interesting and someone else may care.
Screen Shot 2023-12-30 at 6 01 25 PM

@dmachard
Copy link
Owner Author

dmachard commented Dec 31, 2023

It was almost supported so I made a minor code adjustment to consolidate all streams into one for metric computations.

If you want to test, you can utilize and append the following key to your Prometheus settings:

prometheus-labels: ["stream_global"]

With this modification, you should hit the 500k limit of the LRU cache (stream_id label will be removed)

Regarding memory and CPU usage it's ok ?
Thanks a lot for you feedbacks.

P.S.: if you want to know how many domains we have in total, don't forget to also count NXDomains (dnscollector_total_nxdomains_lru) and SERVFAIL (dnscollector_sfdomains_lru)

@johnhtodd
Copy link

CPU and memory numbers look fine - no significant changes from previous behaviors. I've had an instance running for two full days - no issues, and the memory usage is staying below the thresholds presented. I will re-start with a more aggressive threshold (lower) to see if that changes my CPU loading, but I think that is just an academic exercise at this point.

Do the NXDOMAIN and SERVFAIL data also fall into the "dnscollector_total_domains_lru" number?

@dmachard
Copy link
Owner Author

dmachard commented Jan 1, 2024

Do the NXDOMAIN and SERVFAIL data also fall into the "dnscollector_total_domains_lru" number?

See my previous #529 (comment) " if you want to know how many domains we have in total, don't forget to also count NXDomains (dnscollector_total_nxdomains_lru) and SERVFAIL (dnscollector_sfdomains_lru)"

Thanks for feedback, I will merge soon.

@johnhtodd
Copy link

Thank you for the comments, but I'm still not quite clear on the terminology. The term "dnscollector_total_domains_lru" would imply that is is the total of all possible subsets, regardless of rcode status. If it was only the "noerror" domains, then it would be expected that the metric would be "dnscollector_total_noerrordomains_lru".

It's fine that there is no single metric that shows the counter of all noerror, nxdomain, and servfail domains across all streams. If there are three metrics (dnscollector_total_nxdomains_lru, dnscollector_sfdomains_lru, and dnscollector_total_noerrordomains_lru) that have to be added, that is fine as long as they are unique counters of non-duplicated domains in each of those categories that looks at all of the possible stream sets.

In addition, having those counters (dnscollector_noerrordomains_lru, dnscollector_nxdomains_lru, dnscollector_sfdomains_lru) for each stream is useful. The sum of each of these rcode sets across streams will almost always be (confusingly) larger than the corresponding dnscollector_total_* values for each set, since I assume your code keeps each domain once, but tags it with which streams have seen the domain?

Also, the presence or absence of a "stream_id" tag would imply if a metric was per-stream or not. If there was no "stream_id" tag, then I would assume it would be a de-duplicated counter of all possible domains of a particular rcode, across all streams.

Sorry to be so particular about the naming here, but it does make a significant difference in how numbers are interpreted which then leads directly into the ability to manage the operation of the package in a meaningful way by a staff who may not be so specificaly intimate with the details of the code and the subtle distinctions of metric naming. Keeping Prometheus values straight is an important task in any large-scale operational considerations and I want to make sure this doesn't need to be re-done in a while after many people have already made assumptions about what the metrics mean.

@dmachard
Copy link
Owner Author

dmachard commented Jan 1, 2024

I prefer to remove any ambiguous in metrics, here my proposal:

  1. dnscollector_total_domains_lru: the total of all possible domains, regardless of rcode status and without duplication
  2. dnscollector_total_nonexistent_domains_lru: the total of all NX domain with possible duplication
  3. dnscollector_total_servfail_domains_lru: the total of all SERVFAIL domain with possible duplication
  4. dnscollector_total_noerror_domains_lru: the total of all NOERROR domain

Regarding memory usage, a LRU cache is associated to each metrics so it's must be configurable or not individually to compute them or not. Duplication entries can exists between LRU cache for metric n°2, n°3 and n°4 because for example at sometime a specific domain can be "NOERROR" and after "SERVFAIL"

Keep in mind that these LRU caches are also used and mandatory to compute "top domains/requesters" in realtime with the following metrics

  • dnscollector_top_domains
  • dnscollector_top_nonexistent_domains
  • dnscollector_top_servfail_domains
  • dnscollector_top_noerror_domains

Regarding the stream_id label. All metrics are by default per-stream but in this branch it's possible to add the settings prometheus-labels: ["stream_global"] to make metrics uniq across all streams and de-duplicated.

@dmachard dmachard merged commit 8cd4d0f into main Jan 3, 2024
62 checks passed
@dmachard dmachard deleted the prom_fix_potential_memory_leak branch January 3, 2024 20:05
@secure-xxx
Copy link

Looks like a memory leaking. We used one of first versions since 2022y year, at this week try to update because kafka logger was added. Running collector in k8s and watching his restarting by OOM.
1 using old version
2 update to 0.40
3 downgrade to 0.32
image

@secure-xxx
Copy link

0.32 - 90 min uptime without fails, have stable rate - 40 ops

image

@dmachard
Copy link
Owner Author

Looks like a memory leaking. We used one of first versions since 2022y year, at this week try to update because kafka logger was added. Running collector in k8s and watching his restarting by OOM. 1 using old version 2 update to 0.40 3 downgrade to 0.32

Thank for sharing that, can you track this in a new issue ?

@secure-xxx
Copy link

yep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants