client: Return empty values when host stats fail #6349

endocrimes · 2019-09-18T23:07:39Z

Currently, there is an issue when running on Windows whereby under some
circumstances the Windows stats API's will begin to return errors (such
as internal timeouts) when a client is under high load, and potentially
other forms of resource contention / system states (and other unknown
cases).

When an error occurs during this collection, we then short circuit
further metrics emission from the client until the next interval.

This can be problematic if it happens for a sustained number of
intervals, as our metrics aggregator will begin to age out older
metrics, and we will eventually stop emitting various types of metrics
including nomad.client.unallocated.* metrics.

However, when metrics collection fails on Linux, gopsutil will in many cases
(e.g cpu.Times) silently return 0 values, rather than an error.

Here, we switch to returning empty metrics in these failures, and
logging the error at the source. This brings the behaviour into line
with Linux/Unix platforms, and although making aggregation a little
sadder on intermittent failures, will result in more desireable overall
behaviour of keeping metrics available for further investigation if
things look unusual.

Alternatives

There are a few alternative approaches, including only merging client_stats: Always emit client stats, but this has the downside of keeping the status quo of breaking many metrics for a single host collector failing. We could also try to do something a bit smarter about not emitting undetected metrics only rather than 0 values, but I'm not sure how valuable they would be.

Currently, there is an issue when running on Windows whereby under some circumstances the Windows stats API's will begin to return errors (such as internal timeouts) when a client is under high load, and potentially other forms of resource contention / system states (and other unknown cases). When an error occurs during this collection, we then short circuit further metrics emission from the client until the next interval. This can be problematic if it happens for a sustained number of intervals, as our metrics aggregator will begin to age out older metrics, and we will eventually stop emitting various types of metrics including `nomad.client.unallocated.*` metrics. However, when metrics collection fails on Linux, gopsutil will in many cases (e.g cpu.Times) silently return 0 values, rather than an error. Here, we switch to returning empty metrics in these failures, and logging the error at the source. This brings the behaviour into line with Linux/Unix platforms, and although making aggregation a little sadder on intermittent failures, will result in more desireable overall behaviour of keeping metrics available for further investigation if things look unusual.

notnoop

Minor code stylistic points. It does seem reasonable to me to publish metrics we are able to collect as best effort; it makes sense to prioritize incomplete metrics over no metrics at all if we cannot get complete metrics and don't know how to get them.

Would love to have another review by folks with more client and metrics knowledge, in case of any promises we make about invariants in metrics.

Also, failing test seems relevant.

notnoop · 2019-09-19T01:07:58Z

client/allocrunner/taskrunner/task_runner.go

@@ -1335,10 +1335,14 @@ func (tr *TaskRunner) emitStats(ru *cstructs.TaskResourceUsage) {

 	if ru.ResourceUsage.MemoryStats != nil {
 		tr.setGaugeForMemory(ru)
+	} else {
+		tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil")


I would adjust the casing, we typically have lower case message. Also, MemoryStats is nil is low level implementation; would be nice to have a user facing message, maybe something like:

Suggested change

tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil")

tr.logger.Debug("memory stats for alloc is unpopulated; skipping publishing them")

notnoop · 2019-09-19T01:12:54Z

client/stats/host.go

@@ -117,39 +116,45 @@ func (h *HostStatsCollector) collectLocked() error {
 	// Determine up-time


Should this and Collect() still return an error? May make sense to make them void functions instead? Simplifies client.go implementation.

Yeah - I'm planning on going back and refactoring this a little bit next week. Mostly wanted to get something together to help folks with debugging.

nickethier

If it is intended to mimic gopsutil behavior, then maybe we have the collect*Stats funcs return the default values along with the error instead of needing to check and set them during error handling.

endocrimes · 2019-10-12T13:36:13Z

@nickethier I think I'm ok with having the defaults setting here in the same place as handling the error, mostly bc it gives us a single place to look/change things under those conditions, but I'm not set on it.

github-actions · 2023-01-25T02:16:21Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

endocrimes force-pushed the b-host-stats branch from ff316c8 to 9a7dbc8 Compare September 18, 2019 23:09

endocrimes added the stage/needs-discussion label Sep 18, 2019

endocrimes added 2 commits September 19, 2019 01:22

client_stats: Always emit client stats

c8ba938

endocrimes force-pushed the b-host-stats branch from 7ae5048 to c8ba938 Compare September 18, 2019 23:22

notnoop reviewed Sep 19, 2019

View reviewed changes

command: Improve metrics fail logging

4ba87cc

endocrimes mentioned this pull request Sep 21, 2019

Missing allocation resource use metrics #5928

Closed

endocrimes removed the stage/needs-discussion label Oct 11, 2019

endocrimes added this to the 0.10.1 milestone Oct 11, 2019

endocrimes requested a review from nickethier October 11, 2019 11:49

nickethier reviewed Oct 11, 2019

View reviewed changes

schmichael modified the milestones: 0.10.1, 0.10.2 Nov 5, 2019

tgross mentioned this pull request Nov 20, 2019

Endpoint metrics are not displayed #6712

Closed

preetapan merged commit d4f801d into master Nov 20, 2019

preetapan deleted the b-host-stats branch November 20, 2019 16:13

github-actions bot locked as resolved and limited conversation to collaborators Jan 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client: Return empty values when host stats fail #6349

client: Return empty values when host stats fail #6349

endocrimes commented Sep 18, 2019 •

edited

Loading

notnoop left a comment •

edited

Loading

notnoop Sep 19, 2019

notnoop Sep 19, 2019

endocrimes Sep 21, 2019

nickethier left a comment

endocrimes commented Oct 12, 2019

github-actions bot commented Jan 25, 2023

	tr.logger.Debug("Skipping memory stats for allocation", "reason", "MemoryStats is nil")
	tr.logger.Debug("memory stats for alloc is unpopulated; skipping publishing them")

		@@ -117,39 +116,45 @@ func (h *HostStatsCollector) collectLocked() error {
		// Determine up-time

client: Return empty values when host stats fail #6349

client: Return empty values when host stats fail #6349

Conversation

endocrimes commented Sep 18, 2019 • edited Loading

Alternatives

notnoop left a comment • edited Loading

Choose a reason for hiding this comment

notnoop Sep 19, 2019

Choose a reason for hiding this comment

notnoop Sep 19, 2019

Choose a reason for hiding this comment

endocrimes Sep 21, 2019

Choose a reason for hiding this comment

nickethier left a comment

Choose a reason for hiding this comment

endocrimes commented Oct 12, 2019

github-actions bot commented Jan 25, 2023

endocrimes commented Sep 18, 2019 •

edited

Loading

notnoop left a comment •

edited

Loading