Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logs: fix missing allocation logs after update to Nomad 1.5.4 #17087

Merged
merged 2 commits into from
May 4, 2023

Conversation

tgross
Copy link
Member

@tgross tgross commented May 4, 2023

When the server restarts for the upgrade, it loads the structs.Job from the Raft snapshot/logs. The jobspec has long since been parsed, so none of the guards around the default value are in play. The empty field value for Enabled is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running allocations when either the client or server restart. But as soon as any allocation gets rescheduled (ex. you drain all your clients during upgrades), it'll be using the structs.Job that the server has, which has Enabled = false, and logs will not be collected.

This changeset fixes the bug by adding a new field Disabled which defaults to false (so that the zero value works), and deprecates the old field.

Fixes #17076

When the server restarts for the upgrade, it loads the `structs.Job` from the
Raft snapshot/logs. The jobspec has long since been parsed, so none of the
guards around the default value are in play. The empty field value for `Enabled`
is the zero value, which is false.

This doesn't impact any running allocation because we don't replace running
allocations when either the client or server restart. But as soon as any
allocation gets rescheduled (ex. you drain all your clients during upgrades),
it'll be using the `structs.Job` that the server has, which has `Enabled =
false`, and logs will not be collected.

This changeset fixes the bug by adding a new field `Disabled` which defaults to
false (so that the zero value works), and deprecates the old field.

Fixes #17076
api/tasks.go Outdated Show resolved Hide resolved
command/agent/job_endpoint.go Outdated Show resolved Hide resolved
jobspec/parse_task.go Outdated Show resolved Hide resolved
@tgross
Copy link
Member Author

tgross commented May 4, 2023

@schmichael I've addressed your comments and added some extra tests around those cases. I've also done bench testing across upgrades: both from 1.5.3 to 1.5.5-dev and from 1.5.3 to 1.5.4 to 1.5.5-dev. The nice thing I discovered with the fix is that allocations with missing logmon will automatically get a logmon created when upgrading from 1.5.4 to 1.5.5-dev without having to reschedule the allocation. The allocrunner prestart hook does the right thing for us.

@tgross tgross requested a review from schmichael May 4, 2023 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line theme/logging type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nomad v1.5.4: Missing allocation log files on upgrade
2 participants