Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A reproducible testcase for ZFS hanging on IO error. #7469

Closed
zejn opened this issue Apr 22, 2018 · 6 comments
Closed

A reproducible testcase for ZFS hanging on IO error. #7469

zejn opened this issue Apr 22, 2018 · 6 comments
Labels
Status: Stale No recent activity for issue

Comments

@zejn
Copy link

zejn commented Apr 22, 2018

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 17.10.1, 18.04
Linux Kernel 4.13.0-21-generic, deb package linux-image-4.13.0-21-generic=4.13.0-21.24 on Ubuntu 17.01, 4.15.0-15-generic (deb package linux-image-4.15.0-15-generic=4.15.0-15.16) on Ubuntu 18.04
Architecture amd64, x86_64
ZFS Version 0.6.5.11-1ubuntu3.3 on Ubuntu 17.10.1, 0.7.5-1ubuntu15 on Ubuntu 18.04
SPL Version 0.6.5.11-1ubuntu1 on Ubuntu 17.10.1, 0.7.5-1ubuntu1 on Ubuntu 18.04

Describe the problem you're observing

This is a pure software test case for reliably giving ZFS block device error by using "error" mapping target from device mapper. This is done to test how a file system reacts to device error when issuing fsync, as there are differences between different file systems.

The linked script creates a 200MB file to be used as a loop device, sets up a loop device, then sets up device mapper to throw error on access to a range of 5KB somewhere in the block device. Zpool is created on the resulting device mapper block device, thus giving a file system which will reliably hit an error if you write enough data to it.

A tool then issues writes and fsyncs until the file system is full or hits another error.

The issue that this tool is testing is that on some file systems fsync returns success after previous fsync error even though the previous data was not persisted successfully, effectively losing data. More information about fsync losing data is available on Stack Overflow -- http://stackoverflow.com/q/42434872/398670 and PostgreSQL mailing list -- https://www.postgresql.org/message-id/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com

ZFS does not lose data, but hangs the reproducing tools.

On ZFS the tool hangs on fsync call, which I assume is hanging on error from underlying device mapper block device, getting error either on write or on read. After hitting that, z_wr_iss kernel process will be using most of the CPU, load increases significantly and the system becomes somewhat unresponsive.

I was running this in a VM, with freshly installed Ubuntu (server release, both 17.10.1 and 18.04), so this is not a hardware issue. Rather it is a simulation of hardware issue. The question is what is expected of ZFS to do when it hits a bad (unreadable and unwritable) sector, but making system unresponsive is probably not desired.

Describe how to reproduce the problem

The tool for writing and fsyncing and a script around it to automate setting up loopback, device mapper with error target and zpool is available here:
https://github.com/zejn/scrapcode/tree/master/testcases/fsync-error-clear

To run it you need to have docker installed then run FSTYPE=zfs make in the fsync-error-clear folder.
Or, if you don't want docker, copy fsync_test_setup.sh into standalone and run ./fsync_test_standalone.sh zfs "" "" "keepopen". You will probably want to do that in a VM since the machine will become mostly unusable.

Include any warning/errors/backtraces from the system logs

There are txg_sync hangs in the logs.

@bunder2015
Copy link
Contributor

Would it be possible to try this on 0.7.8? 0.6.5.11 is rather dated and a multitude of updates have occurred since it was released.

@zejn
Copy link
Author

zejn commented Apr 22, 2018

I've also tested this on upcoming Ubuntu 18.04 LTS, which has 0.7.5 -- https://packages.ubuntu.com/bionic/zfs-dkms -- and it shows the same behavior.

@zejn
Copy link
Author

zejn commented Apr 22, 2018

I've updated the issue description with more information.

@zejn
Copy link
Author

zejn commented Apr 23, 2018

Thanks for pointing that out.

To try to get around that, I've now set up a block device, which gets mapped via virtio into VM. I used that as the underlying device for device mapper to set up error mapping instead of loop. The result is the same, so I'm not sure if ZFS also has a scheduling issue with device mapper.

These show up in kernel log after a while:

[  484.318980] INFO: task zed:918 blocked for more than 120 seconds.
[  484.319367]       Tainted: P           O     4.15.0-15-generic #16-Ubuntu
[  484.319667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.320024] zed             D    0   918      1 0x00000000
[  484.320027] Call Trace:
[  484.320035]  __schedule+0x297/0x8b0
[  484.320038]  schedule+0x2c/0x80
[  484.320046]  cv_wait_common+0x11e/0x140 [spl]
[  484.320049]  ? wait_woken+0x80/0x80
[  484.320055]  __cv_wait+0x15/0x20 [spl]
[  484.320102]  txg_wait_synced+0xdd/0x130 [zfs]
[  484.320141]  spa_vdev_state_exit+0x8a/0x160 [zfs]
[  484.320178]  vdev_fault+0xb9/0x190 [zfs]
[  484.320215]  zfs_ioc_vdev_set_state+0x119/0x180 [zfs]
[  484.320251]  zfsdev_ioctl+0x1e0/0x610 [zfs]
[  484.320255]  do_vfs_ioctl+0xa8/0x630
[  484.320258]  ? __do_page_fault+0x270/0x4d0
[  484.320261]  ? __schedule+0x29f/0x8b0
[  484.320263]  SyS_ioctl+0x79/0x90
[  484.320266]  do_syscall_64+0x73/0x130
[  484.320270]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  484.320272] RIP: 0033:0x7f26ba36c5d7
[  484.320273] RSP: 002b:00007f26b88c2368 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  484.320276] RAX: ffffffffffffffda RBX: 00007f26b4012460 RCX: 00007f26ba36c5d7
[  484.320277] RDX: 00007f26b88c2370 RSI: 0000000000005a0d RDI: 0000000000000005
[  484.320278] RBP: 00007f26b88c2370 R08: 0000000000000000 R09: 0000000000000000
[  484.320279] R10: 0000000000000000 R11: 0000000000000246 R12: 9309f1d79595e689
[  484.320280] R13: 0000561440dfc690 R14: 00007f26b88c5920 R15: 000000000000000b
[  484.320294] INFO: task txg_sync:1964 blocked for more than 120 seconds.
[  484.320586]       Tainted: P           O     4.15.0-15-generic #16-Ubuntu
[  484.320903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.321231] txg_sync        D    0  1964      2 0x80000000
[  484.321234] Call Trace:
[  484.321237]  __schedule+0x297/0x8b0
[  484.321240]  schedule+0x2c/0x80
[  484.321242]  io_schedule+0x16/0x40
[  484.321248]  cv_wait_common+0xb2/0x140 [spl]
[  484.321250]  ? wait_woken+0x80/0x80
[  484.321255]  __cv_wait_io+0x18/0x20 [spl]
[  484.321292]  zio_wait+0xf8/0x1b0 [zfs]
[  484.321324]  dsl_pool_sync+0x3e5/0x430 [zfs]
[  484.321361]  spa_sync+0x43e/0xd80 [zfs]
[  484.321398]  txg_sync_thread+0x2cd/0x4a0 [zfs]
[  484.321432]  ? txg_quiesce_thread+0x3d0/0x3d0 [zfs]
[  484.321437]  thread_generic_wrapper+0x74/0x90 [spl]
[  484.321441]  kthread+0x121/0x140
[  484.321445]  ? __thread_exit+0x20/0x20 [spl]
[  484.321448]  ? kthread_create_worker_on_cpu+0x70/0x70
[  484.321450]  ? do_syscall_64+0x73/0x130
[  484.321452]  ? SyS_exit_group+0x14/0x20
[  484.321455]  ret_from_fork+0x35/0x40
[  484.321458] INFO: task fsync-error-cle:2032 blocked for more than 120 seconds.
[  484.321799]       Tainted: P           O     4.15.0-15-generic #16-Ubuntu
[  484.322089] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.322411] fsync-error-cle D    0  2032   1858 0x00000004
[  484.322413] Call Trace:
[  484.322417]  __schedule+0x297/0x8b0
[  484.322419]  ? __wake_up_common+0x73/0x130
[  484.322421]  schedule+0x2c/0x80
[  484.322427]  cv_wait_common+0x11e/0x140 [spl]
[  484.322429]  ? wait_woken+0x80/0x80
[  484.322434]  __cv_wait+0x15/0x20 [spl]
[  484.322469]  txg_wait_open+0xb0/0x100 [zfs]
[  484.322499]  dmu_tx_wait+0x382/0x390 [zfs]
[  484.322501]  ? _cond_resched+0x19/0x40
[  484.322528]  dmu_tx_assign+0x83/0x470 [zfs]
[  484.322564]  zfs_write+0x3f6/0xd80 [zfs]
[  484.322567]  ? __enqueue_entity+0x5c/0x60
[  484.322571]  ? ttwu_do_wakeup+0x1e/0x150
[  484.322574]  ? try_to_wake_up+0x59/0x490
[  484.322609]  zpl_write_common_iovec+0x8c/0xe0 [zfs]
[  484.322643]  zpl_iter_write+0xae/0xe0 [zfs]
[  484.322646]  new_sync_write+0xe7/0x140
[  484.322649]  __vfs_write+0x29/0x40
[  484.322652]  vfs_write+0xb1/0x1a0
[  484.322655]  SyS_write+0x55/0xc0
[  484.322657]  do_syscall_64+0x73/0x130
[  484.322661]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  484.322662] RIP: 0033:0x7fc1df3f2154
[  484.322663] RSP: 002b:00007fff56310288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  484.322665] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc1df3f2154
[  484.322666] RDX: 000000000000a000 RSI: 0000555a37c42060 RDI: 0000000000000003
[  484.322668] RBP: 00007fff563102d0 R08: 00007fc1df6cf8c0 R09: 00007fc1df8f24c0
[  484.322669] R10: 000000000000002e R11: 0000000000000246 R12: 0000555a37a40b60
[  484.322670] R13: 00007fff563103b0 R14: 0000000000000000 R15: 0000000000000000

@dweeezil
Copy link
Contributor

The original author's test is mainly showing the ZFS will almost never return an EIO to the userland process when faced with a pool having no redundancy and underlying vdevs which repeatedly, themselves return EIOs to the zio layer. I ran the author's test manually and what happens is that when the bad section of the vdev is hit, self-healing comes into play and everything works just fine. Eventually, the pool fills to the point where that's no longer possible and the pool will likely become DEGRADED.

If there's a bug here, it's that the EIOs should eventually be propagated back to the userland process. Finally, I'll note that this all has nothing to do with fsync() at all and can be demonstrated simply by attempting to fill the pool with data in the author's test setup.

I suppose this would be similar to the conditions under which some file systems (ext4) automatically engage a read-only mode.

@stale
Copy link

stale bot commented Aug 25, 2020

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Aug 25, 2020
@stale stale bot closed this as completed Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue
Projects
None yet
Development

No branches or pull requests

4 participants
@zejn @dweeezil @bunder2015 and others