Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client: Return empty values when host stats fail #6349

Merged
merged 3 commits into from
Nov 20, 2019
Merged

Conversation

endocrimes
Copy link
Contributor

@endocrimes endocrimes commented Sep 18, 2019

Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including nomad.client.unallocated.* metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.

Alternatives

There are a few alternative approaches, including only merging client_stats: Always emit client stats, but this has the downside of keeping the status quo of breaking many metrics for a single host collector failing. We could also try to do something a bit smarter about not emitting undetected metrics only rather than 0 values, but I'm not sure how valuable they would be.

Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including `nomad.client.unallocated.*` metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor code stylistic points. It does seem reasonable to me to publish metrics we are able to collect as best effort; it makes sense to prioritize incomplete metrics over no metrics at all if we cannot get complete metrics and don't know how to get them.

Would love to have another review by folks with more client and metrics knowledge, in case of any promises we make about invariants in metrics.

Also, failing test seems relevant.

@@ -1335,10 +1335,14 @@ func (tr *TaskRunner) emitStats(ru *cstructs.TaskResourceUsage) {

if ru.ResourceUsage.MemoryStats != nil {
tr.setGaugeForMemory(ru)
} else {
tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would adjust the casing, we typically have lower case message. Also, MemoryStats is nil is low level implementation; would be nice to have a user facing message, maybe something like:

Suggested change
tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil")
tr.logger.Debug("memory stats for alloc is unpopulated; skipping publishing them")

@@ -117,39 +116,45 @@ func (h *HostStatsCollector) collectLocked() error {
// Determine up-time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and Collect() still return an error? May make sense to make them void functions instead? Simplifies client.go implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I'm planning on going back and refactoring this a little bit next week. Mostly wanted to get something together to help folks with debugging.

Copy link
Member

@nickethier nickethier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is intended to mimic gopsutil behavior, then maybe we have the collect*Stats funcs return the default values along with the error instead of needing to check and set them during error handling.

@endocrimes
Copy link
Contributor Author

@nickethier I think I'm ok with having the defaults setting here in the same place as handling the error, mostly bc it gives us a single place to look/change things under those conditions, but I'm not set on it.

@schmichael schmichael modified the milestones: 0.10.1, 0.10.2 Nov 5, 2019
@preetapan preetapan merged commit d4f801d into master Nov 20, 2019
@preetapan preetapan deleted the b-host-stats branch November 20, 2019 16:13
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants