propolis logs lost on reboot #7012

faithanalog · 2024-11-07T22:45:40Z

While investigating an apparent crucible bug, we ended up with a sled crashed into kmdb. We had retrieved some propolis logs from that sled, but not all the ones we wanted.

Upon rebooting the sled, all of the propolis datasets appeared to have been deleted, and the propolis logs we wanted along with them.

The same zpools were present, so I do think that this was some system cleaning up after VMs that no longer existed after the sled crashed, but I'm not sure what exactly might have done this.

The text was updated successfully, but these errors were encountered:

smklein · 2024-11-07T23:06:31Z

See:

omicron/sled-storage/src/dataset.rs

Lines 65 to 66 in 8d73079

    
           // Stores filesystems for zones 
        
           ExpectedDataset::new(ZONE_DATASET).wipe(),

omicron/sled-storage/src/dataset.rs

Lines 110 to 111 in 8d73079

    
           // Identifies if the dataset should be deleted on boot 
        
           wipe: bool,

omicron/sled-storage/src/dataset.rs

Lines 285 to 306 in 8d73079

    
           if dataset.wipe { 
        
               match Zfs::get_oxide_value(name, "agent") { 
        
                   Ok(v) if &v == agent_local_value => { 
        
                       info!(log, "Skipping automatic wipe for dataset: {}", name); 
        
                   } 
        
                   Ok(_) | Err(_) => { 
        
                       info!(log, "Automatically destroying dataset: {}", name); 
        
                       Zfs::destroy_dataset(name).or_else(|err| { 
        
                           // If we can't find the dataset, that's fine -- it might 
        
                           // not have been formatted yet. 
        
                           if matches!( 
        
                               err.err, 
        
                               DestroyDatasetErrorVariant::NotFound 
        
                           ) { 
        
                               Ok(()) 
        
                           } else { 
        
                               Err(err) 
        
                           } 
        
                       })?; 
        
                   } 
        
               } 
        
           }

Unless you're reporting this as a regression, I think this is currently expected behavior. We are destroying all transient zone filesystems when the sled reboots, at the moment.

There are related issues to make the set of datasets less "implicit", and more "managed by Nexus". Of these, I'd say:

Are probably most relevant.

In particular:

If we finish making Nexus aware of all U.2 dataset allocations, we can avoid this periodic "clear-on-reboot" behavior to garbage collect old instance filesystems.
... then we can make more significant progress on re-constructing instance state, rather than destroying them on boot.

faithanalog · 2024-11-08T07:33:25Z

This is all good background. I figured it was expected but didn't know anything about the mechanism.

The thing that bothers me about the behavior is mainly the loss of diagnostic data in the log files. My hope is that we could archive the logs from the zone filesystem somewhere before destroying the dataset (though- how would we manage the lifecycle of those logs after we do this?)

smklein · 2024-11-08T21:22:39Z

The Zone Bundler in sled-agent/src/zone_bundle.rs exists, and was created to take snapshots of unexpectedly dying zones. This may be a spot where we could re-use it.

askfongjojo mentioned this issue Dec 15, 2024

Propolis panic during zone uninstall oxidecomputer/propolis#827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propolis logs lost on reboot #7012

propolis logs lost on reboot #7012

faithanalog commented Nov 7, 2024 •

edited

Loading

smklein commented Nov 7, 2024

faithanalog commented Nov 8, 2024 •

edited

Loading

smklein commented Nov 8, 2024

propolis logs lost on reboot #7012

propolis logs lost on reboot #7012

Comments

faithanalog commented Nov 7, 2024 • edited Loading

smklein commented Nov 7, 2024

faithanalog commented Nov 8, 2024 • edited Loading

smklein commented Nov 8, 2024

faithanalog commented Nov 7, 2024 •

edited

Loading

faithanalog commented Nov 8, 2024 •

edited

Loading