-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client: emit optional telemetry from prerun and prestart hooks. #24556
Conversation
The Nomad client can now optionally emit telemetry data from the prerun and prestart hooks. This allows operators to monitor and alert on failures and time taken to complete. The new datapoints are: - nomad.client.alloc_hook.prerun.success (counter) - nomad.client.alloc_hook.prerun.failed (counter) - nomad.client.alloc_hook.prerun.elapsed (sample) - nomad.client.task_hook.prestart.success (counter) - nomad.client.task_hook.prestart.failed (counter) - nomad.client.task_hook.prestart.elapsed (sample) The hook execution time is useful to Nomad engineering and will help optimize code where possible and understand job specification impacts on hook performance. Currently only the PreRun and PreStart hooks have telemetry enabled, so we limit the number of new metrics being produced.
// If the operator has disabled hook metrics, then don't call the time | ||
// function to save 30ns per hook. | ||
var hookExecutionStart time.Time | ||
|
||
if !ar.clientConfig.DisableAllocationHookMetrics { | ||
hookExecutionStart = time.Now() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought, but what if the HookStatsHandler
interface also required a Start()
implementation that set the start time, which would be kept as internal state of the handler (and would be a noop of the noop handler).
Might keep some of that implementation detail out of this alloc runner hook code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did think of that while writing the code.
I decided against it as all the hooks that emit telemetry use a single HookStatsHandler
within the task or alloc runner. It would need to assume each hook is called without any concurrency with a blind assumption the start time is meant for the hook when calling Emit. This felt wrong to me.
I also considered having a HookStatsHandler
implementation per hook call, but did not go down this route to avoid overhead in setting this up per hook and the memory overhead with additional label setup and such.
Let me know your thoughts, I am happy to change the approach if we feel the advantages are worth it and our assumptions are correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Description
The Nomad client can now optionally emit telemetry data from the prerun and prestart hooks. This allows operators to monitor and alert on failures and time taken to complete.
The new datapoints are:
nomad.client.alloc_hook.prerun.success (counter)
nomad.client.alloc_hook.prerun.failed (counter)
nomad.client.alloc_hook.prerun.elapsed (sample)
nomad.client.task_hook.prestart.success (counter)
nomad.client.task_hook.prestart.failed (counter)
nomad.client.task_hook.prestart.elapsed (sample)
The hook execution time is useful to Nomad engineering and will help optimize code where possible and understand job specification impacts on hook performance.
Currently only the PreRun and PreStart hooks have telemetry enabled, so we limit the number of new metrics being produced.
Testing & Reproduction steps
I tested on a Debian 12 arm64 using a single agent in server/client mode with this configuration.
while true; do; nomad job dispatch sleep; sleep 2; done
to generate metricspromana monitoring job spec:
sleep job spec:
Links
Jira: https://hashicorp.atlassian.net/browse/NET-11237
RFC: https://go.hashi.co/rfc/nmd-206
Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.