Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propolis logs lost on reboot #7012

Open
faithanalog opened this issue Nov 7, 2024 · 3 comments
Open

propolis logs lost on reboot #7012

faithanalog opened this issue Nov 7, 2024 · 3 comments

Comments

@faithanalog
Copy link
Contributor

faithanalog commented Nov 7, 2024

While investigating an apparent crucible bug, we ended up with a sled crashed into kmdb. We had retrieved some propolis logs from that sled, but not all the ones we wanted.

Upon rebooting the sled, all of the propolis datasets appeared to have been deleted, and the propolis logs we wanted along with them.

The same zpools were present, so I do think that this was some system cleaning up after VMs that no longer existed after the sled crashed, but I'm not sure what exactly might have done this.

@smklein
Copy link
Collaborator

smklein commented Nov 7, 2024

See:

// Stores filesystems for zones
ExpectedDataset::new(ZONE_DATASET).wipe(),

// Identifies if the dataset should be deleted on boot
wipe: bool,

if dataset.wipe {
match Zfs::get_oxide_value(name, "agent") {
Ok(v) if &v == agent_local_value => {
info!(log, "Skipping automatic wipe for dataset: {}", name);
}
Ok(_) | Err(_) => {
info!(log, "Automatically destroying dataset: {}", name);
Zfs::destroy_dataset(name).or_else(|err| {
// If we can't find the dataset, that's fine -- it might
// not have been formatted yet.
if matches!(
err.err,
DestroyDatasetErrorVariant::NotFound
) {
Ok(())
} else {
Err(err)
}
})?;
}
}
}

Unless you're reporting this as a regression, I think this is currently expected behavior. We are destroying all transient zone filesystems when the sled reboots, at the moment.

There are related issues to make the set of datasets less "implicit", and more "managed by Nexus". Of these, I'd say:

Are probably most relevant.

In particular:

  • If we finish making Nexus aware of all U.2 dataset allocations, we can avoid this periodic "clear-on-reboot" behavior to garbage collect old instance filesystems.
  • ... then we can make more significant progress on re-constructing instance state, rather than destroying them on boot.

@faithanalog
Copy link
Contributor Author

faithanalog commented Nov 8, 2024

This is all good background. I figured it was expected but didn't know anything about the mechanism.

The thing that bothers me about the behavior is mainly the loss of diagnostic data in the log files. My hope is that we could archive the logs from the zone filesystem somewhere before destroying the dataset (though- how would we manage the lifecycle of those logs after we do this?)

@smklein
Copy link
Collaborator

smklein commented Nov 8, 2024

The Zone Bundler in sled-agent/src/zone_bundle.rs exists, and was created to take snapshots of unexpectedly dying zones. This may be a spot where we could re-use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants