Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't import NVME root pool after trim and scrub #14643

Open
Freewalkr opened this issue Mar 17, 2023 · 4 comments
Open

Can't import NVME root pool after trim and scrub #14643

Freewalkr opened this issue Mar 17, 2023 · 4 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@Freewalkr
Copy link

Freewalkr commented Mar 17, 2023

System information

Type Version/Name
Distribution Name Arch Linux
Distribution Version LTS
Kernel Version 6.1.19-1
Architecture x86_64
OpenZFS Version zfs-2.1.99-1, ahrens-raidz-expand branch

Describe the problem you're observing

I'm using ZFS version from zfs-dkms-raidz-expansion-git AUR package and have root on ZFS on NVME drive. NVME is not in any RAID-Z, it's the only drive in the pool. I haven't run scrub on this pool for like half a year, while I upgraded to raidz-expand branch about 3 months ago.

Yesterday I deleted some big files in the root pool, created and deleted a couple of snapshots, ran a trim manually (it went without errors) and ran scrub on the pool. In progress zpool status responded well, but after the end of scrubbing zpool status hung in "uninterruptible sleep" state. I didn't look into logs, waited for 12 hours, rebooted - and system didn't come back.

I booted from Manjaro 6.1.19-1 installed on the other drive with the same ZFS version and tried to import the root pool from NVME. It went into uninterruptible sleep too, there's verification panic. dmesg output is attached below.

Describe how to reproduce the problem

  1. Use ahrens-raidz-expand branch (maybe it's zfs-dkms-raidz-expansion-git package specific?)
  2. Upgrade the pool on NVME with new features.
  3. Wait for about 3 months.
  4. Delete some files, run a trim manually.
  5. Run a scrub.

Include any warning/errors/backtraces from the system logs

dmesg output when trying to import the pool:

[   82.223202] VERIFY3(0 == dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)) failed (0 == 2)
[   82.223205] PANIC at spa_errlog.c:1068:delete_errlog()
[   82.223207] Showing stack for process 8492
[   82.223208] CPU: 5 PID: 8492 Comm: txg_sync Tainted: P           OE      6.1.19-1-MANJARO #1 389fdd3d7a99644f7437b855eb50ad5703eff2d2
[   82.223210] Hardware name: Gigabyte Technology Co., Ltd. X570S AERO G/X570S AERO G, BIOS F4c 05/12/2022
[   82.223211] Call Trace:
[   82.223212]  <TASK>
[   82.223214]  dump_stack_lvl+0x48/0x60
[   82.223218]  spl_panic+0xf4/0x10c [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[   82.223227]  spa_errlog_sync+0x2b7/0x2d0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223299]  ? spa_change_guid_check+0xe0/0xe0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223355]  ? spa_change_guid_check+0xe0/0xe0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223405]  ? spa_sync+0x554/0xf70 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223455]  ? spa_txg_history_init_io+0x117/0x120 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223513]  ? txg_sync_thread+0x201/0x3a0 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223567]  ? txg_fini+0x260/0x260 [zfs 97fa2db2161ede37c6882f701ff1ae4988ad0056]
[   82.223618]  ? spl_taskq_fini+0x80/0x80 [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[   82.223624]  ? thread_generic_wrapper+0x5e/0x70 [spl d5e4e55912190c05565b79c74a4165e6160071f9]
[   82.223629]  ? kthread+0xde/0x110
[   82.223631]  ? kthread_complete_and_exit+0x20/0x20
[   82.223633]  ? ret_from_fork+0x22/0x30
[   82.223636]  </TASK>
@Freewalkr Freewalkr added the Type: Defect Incorrect behavior (e.g. crash, hang) label Mar 17, 2023
@Freewalkr
Copy link
Author

I've imported the pool in read-only mode. zpool status output:

[fwkr@fwkr-x570saerog ~]$ zpool status zroot1
  pool: zroot1
 state: ONLINE
  scan: scrub in progress since Fri Mar 17 02:11:47 2023
	238G scanned at 0B/s, 238G issued at 0B/s, 238G total
	0B repaired, 99.99% done, no estimated completion time
config:

	NAME         STATE     READ WRITE CKSUM
	zroot1       ONLINE       0     0     0
	  nvme0n1p2  ONLINE       0     0     0

errors: No known data errors

@systemmonkey42
Copy link

I made the mistake of running ZFS trim.. The trim never finished, and the system hung.

I can now only import the pool in read-only mode. Anything else hangs and never completes.

I can't cancel the trim, and I can't mount the pool read/write anymore. What are my options?

@Freewalkr
Copy link
Author

I made the mistake of running ZFS trim.. The trim never finished, and the system hung.

I can now only import the pool in read-only mode. Anything else hangs and never completes.

I can't cancel the trim, and I can't mount the pool read/write anymore. What are my options?

I managed to import pool in read-only mode while booting from other disk. Then I did zfs send zroot1 to a file (I had a snapshot right before trim), destroyed zroot1, created it again and ran zfs receive from the file. You need to regenerate zpool.cache and initramfs after that. Now everything works great.

@baslking
Copy link

baslking commented Aug 9, 2024

I have a very similar situation.
Arm64 6.6.44 kernel with zfs 2.2.4, been running for several years, through thick and thin. I activated a monthy trim job, and once it launched, it hung, made no progress and now my import hangs. readonly=on import works fine
Is there no solution for this?
It seems that at the very least some serious warnings about using trim should be made!
Is there no way to cancel the trim operations when doing an import?

[ 3021.820408] INFO: task zpool:4297 blocked for more than 845 seconds.
[ 3021.826930]       Tainted: P         C O       6.6.44-1-rpi #1
[ 3021.832838] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3021.840713] task:zpool           state:D stack:0     pid:4297  ppid:1259   flags:0x0000000c
[ 3021.849077] Call trace:
[ 3021.851517]  __switch_to+0xe0/0x178
[ 3021.855014]  __schedule+0x37c/0xaf0
[ 3021.858504]  schedule+0x64/0x108
[ 3021.861733]  spl_panic+0x110/0x120 [spl]
[ 3021.865696]  vdev_trim_calculate_progress+0x34c/0x370 [zfs]
[ 3021.871772]  vdev_trim_load+0x38/0x150 [zfs]
[ 3021.876482]  vdev_trim_restart+0x128/0x250 [zfs]
[ 3021.881498]  vdev_trim_restart+0x54/0x250 [zfs]
[ 3021.886437]  vdev_trim_restart+0x54/0x250 [zfs]
[ 3021.891369]  spa_load+0x1648/0x1770 [zfs]
[ 3021.895762]  spa_load_best+0x5c/0x2b0 [zfs]
[ 3021.900332]  spa_import+0x1ec/0x608 [zfs]
[ 3021.904724]  zfs_ioc_pool_import+0x14c/0x178 [zfs]
[ 3021.909907]  zfsdev_ioctl_common+0x808/0x890 [zfs]
[ 3021.915094]  zfsdev_ioctl+0x70/0x108 [zfs]
[ 3021.919599]  __arm64_sys_ioctl+0xb4/0x100
[ 3021.923632]  invoke_syscall+0x50/0x120
[ 3021.927390]  el0_svc_common.constprop.0+0x48/0xf0
[ 3021.932098]  do_el0_svc+0x24/0x38
[ 3021.935416]  el0_svc+0x40/0xe8
[ 3021.938475]  el0t_64_sync_handler+0x120/0x130
[ 3021.942836]  el0t_64_sync+0x190/0x198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants