Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xfs corruption, but no xfs_repair #8292

Closed
smira opened this issue Feb 9, 2024 · 8 comments
Closed

xfs corruption, but no xfs_repair #8292

smira opened this issue Feb 9, 2024 · 8 comments
Assignees

Comments

@smira
Copy link
Member

smira commented Feb 9, 2024

Bug Report

XFS partition got corrupted, but Talos didn't run xfs-repair.

Description

Logs

09/02/2024 11:41:47 [talos] task mountEphemeralPartition (1/1): starting
09/02/2024 11:41:47 XFS (sda6): Mounting V5 Filesystem
09/02/2024 11:41:47 XFS (sda6): totally zeroed log
09/02/2024 11:41:48 XFS (sda6): Corruption warning: Metadata has LSN (90:781352) ahead of current LSN (1:0). Please unmount and run xfs_repair (>= v4.3) to resolve.
09/02/2024 11:41:48 XFS (sda6): log mount/recovery failed: error -22
09/02/2024 11:41:48 XFS (sda6): log mount failed
09/02/2024 11:41:48 [talos] task mountEphemeralPartition (1/1): failed: error mounting: 1 error(s) occurred:
09/02/2024 11:41:49  invalid argument
09/02/2024 11:41:50 [talos] phase ephemeral (8/17): failed
09/02/2024 11:41:50 [talos] boot sequence: failed

Environment

  • Talos version: [talosctl version --nodes <problematic nodes>]
  • Kubernetes version: [kubectl version --short]
  • Platform:
frezbo added a commit to frezbo/talos that referenced this issue Feb 13, 2024
Run `xfs_repair` for invalid argument error.

Fixes: siderolabs#8292

Signed-off-by: Noel Georgi <[email protected]>
@smira
Copy link
Member Author

smira commented Feb 13, 2024

Filesystem Corruption Detection

  1. Errors EUCLEAN, EINVAL from mount syscall (more errors?).
  2. If META key 'needs_repair' is set
  • set this key early on boot, and remove it once machine enters running & ready
  • user can set this key manually and reboot

Scenario: Talos boots up, mount() finishes successfully, but the filesystem is corrupted, so containerd fails to start, so the META key needs_repair is not removed, and on next reboot Talos will run xfs_repair.

Filesystem Repair

  1. Try mounting the filesystem (temporarily) (to replay the XFS log) [ignore errors].
  2. Run xfs_repair.
  3. If it fails, go to step 1, but next time add -L.

@smira
Copy link
Member Author

smira commented Feb 13, 2024

Two PRs:

  1. Adds EINVAL to EUCLEAN (backport to 1.6)
  2. Which adds needs_repair flag - 1.7 only.

frezbo added a commit to frezbo/talos that referenced this issue Feb 13, 2024
Run `xfs_repair` for invalid argument error.

Part of: siderolabs#8292

Signed-off-by: Noel Georgi <[email protected]>
smira pushed a commit to smira/talos that referenced this issue Feb 21, 2024
Run `xfs_repair` for invalid argument error.

Part of: siderolabs#8292

Signed-off-by: Noel Georgi <[email protected]>
(cherry picked from commit 2f0421b)
dsseng pushed a commit to dsseng/talos that referenced this issue Mar 7, 2024
Run `xfs_repair` for invalid argument error.

Part of: siderolabs#8292

Signed-off-by: Noel Georgi <[email protected]>
@frezbo
Copy link
Member

frezbo commented May 13, 2024

Add EIO (-5) also

@frezbo
Copy link
Member

frezbo commented May 13, 2024

Add EIO (-5) also

Handled in #8733

@goproslowyo
Copy link

I have unfortunately just ran into this while doing a talosctl upgrade to a node from 1.6.1 to 1.6.7 :(.

I tried booting into a livecd and attempted xfs_repair but that didn't seem to work. I also attempted adding the -L flag afterwards and that also didn't seem to help.

I am not sure how to proceed... I guess I wipe the node and start over?

@smira
Copy link
Member Author

smira commented May 31, 2024

First of all, it's better to submit the logs, otherwise it's shooting in the dark what kind of issue that is.

But yes, on broken hardware xfs might be corrupted beyond repair, so wiping the filesystem is the only way out.

If e.g. only /var is corrupted, and this is a worker, or HA controlplane, a single partition can be wiped while preserving the rest:

talosctl reset -n NODE --system-labels-to-wipe=EPHEMERAL --reboot

@goproslowyo
Copy link

Yea, normally I would provide logs but the machine was in a boot loop where xfs_repair failed and then the node would reboot -- I couldn't grab them but probably could have if I was quicker.

I manually reset the node to maintenance mode via the GRUB menu and was able to rejoin the node to the cluster without issue. I am now updated to 1.7.4 so we'll see how it goes :)

Thanks @smira!

@smira
Copy link
Member Author

smira commented Dec 26, 2024

See #9848

@smira smira closed this as not planned Won't fix, can't repro, duplicate, stale Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants