xfs corruption, but no xfs_repair #8292

smira · 2024-02-09T17:34:18Z

Bug Report

XFS partition got corrupted, but Talos didn't run xfs-repair.

Description

Logs

09/02/2024 11:41:47 [talos] task mountEphemeralPartition (1/1): starting
09/02/2024 11:41:47 XFS (sda6): Mounting V5 Filesystem
09/02/2024 11:41:47 XFS (sda6): totally zeroed log
09/02/2024 11:41:48 XFS (sda6): Corruption warning: Metadata has LSN (90:781352) ahead of current LSN (1:0). Please unmount and run xfs_repair (>= v4.3) to resolve.
09/02/2024 11:41:48 XFS (sda6): log mount/recovery failed: error -22
09/02/2024 11:41:48 XFS (sda6): log mount failed
09/02/2024 11:41:48 [talos] task mountEphemeralPartition (1/1): failed: error mounting: 1 error(s) occurred:
09/02/2024 11:41:49  invalid argument
09/02/2024 11:41:50 [talos] phase ephemeral (8/17): failed
09/02/2024 11:41:50 [talos] boot sequence: failed

Environment

Talos version: [talosctl version --nodes <problematic nodes>]
Kubernetes version: [kubectl version --short]
Platform:

The text was updated successfully, but these errors were encountered:

Run `xfs_repair` for invalid argument error. Fixes: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]>

smira · 2024-02-13T15:45:45Z

Filesystem Corruption Detection

Errors EUCLEAN, EINVAL from mount syscall (more errors?).
If META key 'needs_repair' is set

set this key early on boot, and remove it once machine enters running & ready
user can set this key manually and reboot

Scenario: Talos boots up, mount() finishes successfully, but the filesystem is corrupted, so containerd fails to start, so the META key needs_repair is not removed, and on next reboot Talos will run xfs_repair.

Filesystem Repair

Try mounting the filesystem (temporarily) (to replay the XFS log) [ignore errors].
Run xfs_repair.
If it fails, go to step 1, but next time add -L.

smira · 2024-02-13T15:49:04Z

Two PRs:

Adds EINVAL to EUCLEAN (backport to 1.6)
Which adds needs_repair flag - 1.7 only.

Run `xfs_repair` for invalid argument error. Part of: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]>

Run `xfs_repair` for invalid argument error. Part of: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]> (cherry picked from commit 2f0421b)

Run `xfs_repair` for invalid argument error. Part of: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]>

frezbo · 2024-05-13T15:45:55Z

Add EIO (-5) also

frezbo · 2024-05-13T15:52:14Z

Add EIO (-5) also

Handled in #8733

goproslowyo · 2024-05-30T20:30:10Z

I have unfortunately just ran into this while doing a talosctl upgrade to a node from 1.6.1 to 1.6.7 :(.

I tried booting into a livecd and attempted xfs_repair but that didn't seem to work. I also attempted adding the -L flag afterwards and that also didn't seem to help.

I am not sure how to proceed... I guess I wipe the node and start over?

smira · 2024-05-31T11:21:37Z

First of all, it's better to submit the logs, otherwise it's shooting in the dark what kind of issue that is.

But yes, on broken hardware xfs might be corrupted beyond repair, so wiping the filesystem is the only way out.

If e.g. only /var is corrupted, and this is a worker, or HA controlplane, a single partition can be wiped while preserving the rest:

talosctl reset -n NODE --system-labels-to-wipe=EPHEMERAL --reboot

goproslowyo · 2024-05-31T17:45:03Z

Yea, normally I would provide logs but the machine was in a boot loop where xfs_repair failed and then the node would reboot -- I couldn't grab them but probably could have if I was quicker.

I manually reset the node to maintenance mode via the GRUB menu and was able to rejoin the node to the cluster without issue. I am now updated to 1.7.4 so we'll see how it goes :)

Thanks @smira!

smira · 2024-12-26T16:39:40Z

See #9848

smira assigned frezbo Feb 9, 2024

frezbo added a commit to frezbo/talos that referenced this issue Feb 13, 2024

fix: run xfs_repair on invalid argument error

d40ff04

Run `xfs_repair` for invalid argument error. Fixes: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]>

frezbo mentioned this issue Feb 13, 2024

fix: run xfs_repair on invalid argument error #8310

Merged

frezbo added a commit to frezbo/talos that referenced this issue Feb 13, 2024

fix: run xfs_repair on invalid argument error

2f0421b

Run `xfs_repair` for invalid argument error. Part of: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]>

smira pushed a commit to smira/talos that referenced this issue Feb 21, 2024

fix: run xfs_repair on invalid argument error

3683687

Run `xfs_repair` for invalid argument error. Part of: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]> (cherry picked from commit 2f0421b)

dsseng pushed a commit to dsseng/talos that referenced this issue Mar 7, 2024

fix: run xfs_repair on invalid argument error

3817d6c

Run `xfs_repair` for invalid argument error. Part of: siderolabs#8292 Signed-off-by: Noel Georgi <[email protected]>

smira closed this as not planned Won't fix, can't repro, duplicate, stale Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xfs corruption, but no xfs_repair #8292

xfs corruption, but no xfs_repair #8292

smira commented Feb 9, 2024 •

edited

Loading

smira commented Feb 13, 2024 •

edited

Loading

smira commented Feb 13, 2024

frezbo commented May 13, 2024

frezbo commented May 13, 2024

goproslowyo commented May 30, 2024

smira commented May 31, 2024

goproslowyo commented May 31, 2024

smira commented Dec 26, 2024

xfs corruption, but no xfs_repair #8292

xfs corruption, but no xfs_repair #8292

Comments

smira commented Feb 9, 2024 • edited Loading

Bug Report

Description

Logs

Environment

smira commented Feb 13, 2024 • edited Loading

Filesystem Corruption Detection

Filesystem Repair

smira commented Feb 13, 2024

frezbo commented May 13, 2024

frezbo commented May 13, 2024

goproslowyo commented May 30, 2024

smira commented May 31, 2024

goproslowyo commented May 31, 2024

smira commented Dec 26, 2024

smira commented Feb 9, 2024 •

edited

Loading

smira commented Feb 13, 2024 •

edited

Loading