ZVOL: sync=always is not being honored #12391

arun-kv · 2021-07-19T11:32:56Z

System information

Type	Version/Name
Distribution Name	Ubuntu
Distribution Version	20.04.2 LTS
Kernel Version	5.8.0-55-generic
Architecture	x86_64
OpenZFS Version	zfs-2.1.99-327_g88a483303

Describe the problem you're observing

zvol (sync=always): Write(O_SYNC) returns success when the iSCSI disk is not accessible.
The application gets a write success from zvol created with option sync=always even when the disk is not accessible.
Note: Only the first write after the disk becomes inaccessible is wrongly acknowledged as successfully written, after that all other writes from the application wait.

Describe how to reproduce the problem

Create a zpool using an iSCSI disk (one disk in the pool)
Create a zvol with dedup=off, compression=off, sync=always, primarycache=metadata, volblocksize=4K
Open the zvol disk using O_SYNC flag.
Write some data to zvol (keep the application waiting for the user input)
Pause the iSCSI target VM
Continue the write (Now application will wait for ~1 min
until zfs prints " WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended" this message in the syslog), and you will see now the write call returns success.
Do another write, this will wait until I start the iSCSI target VM and run zpool clear command.

Include any warning/errors/backtraces from the system logs

sd 6:0:0:1: rejecting I/O to offline device
[ 1494.822037] blk_update_request: I/O error, dev sde, sector 3156224 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 1494.822056] zio pool=zpool vdev=/dev/sde1 error=5 type=2 offset=1614938112 size=4096 flags=184880
[ 1494.822098] blk_update_request: I/O error, dev sde, sector 2107648 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 1494.822106] zio pool=zpool vdev=/dev/sde1 error=5 type=2 offset=1078067200 size=4096 flags=184880
[ 1494.822355] blk_update_request: I/O error, dev sde, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[ 1494.822373] zio pool=zpool vdev=/dev/sde1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[ 1494.822425] blk_update_request: I/O error, dev sde, sector 20952080 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[ 1494.822435] zio pool=zpool vdev=/dev/sde1 error=5 type=1 offset=10726416384 size=8192 flags=b08c1
[ 1494.822453] blk_update_request: I/O error, dev sde, sector 20952592 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[ 1494.822461] zio pool=zpool vdev=/dev/sde1 error=5 type=1 offset=10726678528 size=8192 flags=b08c1
[ 1494.823781] WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended.

Example Program

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>

#define SIZE	4 * 1024
#define DEVICE  "/dev/zd0"

int main()
{
        char *rbuf, *wbuf;
	int ret = 0;
	int fd = 0;
	int len = 0;
 
	wbuf  = malloc(SIZE);
	fd = open(DEVICE, O_RDWR|O_SYNC);
	if(!fd) {
		printf("Failed to open file %s \n", DEVICE);
		return(1);
	}
	
	while (1) {
		printf("Enter char to write: ");
		char c = getchar();
		getchar();
		fflush(stdin);
		if(c != '\n') {
			printf("%c\n", c);
			memset(wbuf, c, SIZE);
			ret = write(fd, wbuf, SIZE);
			if (ret < 0) {
				printf("Write failed error: %d\n", -errno);
				break;
			} else {
				printf("write success ret: %d offset: %d\n", ret, len);
			}
		}
		len += SIZE;
		ret = lseek(fd, len, SEEK_SET);
	}

	close(fd);
}

The text was updated successfully, but these errors were encountered:

imtiazdc · 2021-07-19T12:35:39Z

Hi @ahrens,

Unless we are missing something, it looks like a severe issue. Given our limited understanding of the code base, we are unable to suggest a fix at this point. It will really help us if you could take a look at this and provide some insights. Happy to debug and provide you any further information.

bghira · 2021-07-19T13:48:17Z

please verify that it happens without iSCSI - there are MANY layers here. are you using open-iscsi? what is your iscsid.conf ? what is the target config? what backend are you using?

arun-kv · 2021-07-19T14:39:57Z

Hi @bghira, I have tested using local disk on the same VM (I'm using Oracle virtual box VM with SATA controller)

zpool create -o ashift=12 zpool /dev/sdc -f
zfs create -o volblocksize=4k -o dedup=off -o compression=off -o sync=always -o primarycache=metadata -V 6G zpool/zvol
Wrote some data (keep the application waiting for the user input)
echo offline > /sys/block/sdc/device/state
When I entered the new data, I got success message from write
reboot the system, read the offset and I don't find the data

Got similar message like above when I offline the device,

blk_update_request: I/O error, dev sdc, sector 5253216 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 541.138074] zio pool=zpool vdev=/dev/sdc1 error=5 type=2 offset=2688598016 size=8192 flags=180880
[ 541.138120] blk_update_request: I/O error, dev sdc, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 541.138128] zio pool=zpool vdev=/dev/sdc1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[ 541.138183] blk_update_request: I/O error, dev sdc, sector 20952080 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 541.138192] zio pool=zpool vdev=/dev/sdc1 error=5 type=1 offset=10726416384 size=8192 flags=b08c1
[ 541.138214] blk_update_request: I/O error, dev sdc, sector 20952592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 541.138222] zio pool=zpool vdev=/dev/sdc1 error=5 type=1 offset=10726678528 size=8192 flags=b08c1
[ 541.138693] zvol_write End
[ 541.253765] WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended.

bghira · 2021-07-19T15:19:07Z

sounds like you are running with writeback cache?

arun-kv · 2021-07-19T17:18:52Z

Hi @bghira, There is no writeback cache.
I just tested it on a physical server(with local disk) and I see same behavior, write return success when zfs got EIO error from the underlying device.

jxdking · 2021-07-19T17:53:49Z

There is a chance that zil_commit() returns without actually flushing the disk. See comments in zil_lwb_write_done() and zil_lwb_flush_defer().
Thus, if you offline zil device, you may loose data. This is the time mirror slog may help.

arun-kv · 2021-07-20T08:22:05Z

@jxdking Yes, this issues happens when all the disk in the pool are inaccessible. My expectation was application will wait until I fix the disk error and run zpool clear command, because sync=always was set and application was using O_SYNC flag.

Ex: If someone using network attached devices in the zpool (without local slog device) and a network outage more than 1 min can cause data loss (my first experiment with iSCSI disk) ??

IsaacVaughn · 2021-07-23T14:48:48Z

There is a chance that zil_commit() returns without actually flushing the disk. See comments in zil_lwb_write_done() and zil_lwb_flush_defer().
Thus, if you offline zil device, you may loose data. This is the time mirror slog may help.

This seems somewhat dubious to me. Certainly, zil_lwb_write_done can fail, but does that result in the zil_commit waiter returning success?

Looking at zil_commit_impl, I see
* If there was an error writing out the ZIL blocks that
* this thread is waiting on, then we fallback to
* relying on spa_sync() to write out the data this
* thread is waiting on. Obviously this has performance
* implications, but the expectation is for this to be
* an exceptional case, and shouldn't occur often.

To me, this implies that the error when flushing the ZIL blocks is caught and handled by escalated to a general sync operation. Offlining a ZIL device can definitely result in the data already written to ZIL being lost, but I don't think it will lose future writes.

Also, it sounds like he has offlined his entire pool, so probably everything should fail and error out.

arun-kv · 2021-07-23T15:19:29Z

@IsaacVaughn I have done some experiment by moving the below code from zil_lwb_flush_vdevs_done to zil_lwb_write_done
`while ((zcw = list_head(&lwb->lwb_waiters)) != NULL) {
mutex_enter(&zcw->zcw_lock);

            ASSERT(list_link_active(&zcw->zcw_node));
            list_remove(&lwb->lwb_waiters, zcw);

            ASSERT3P(zcw->zcw_lwb, ==, lwb);
            zcw->zcw_lwb = NULL;

            **zcw->zcw_zio_error = zio->io_error;**

            ASSERT3B(zcw->zcw_done, ==, B_FALSE);
            zcw->zcw_done = B_TRUE;
            cv_broadcast(&zcw->zcw_cv);

            mutex_exit(&zcw->zcw_lock);
    }`

because the error was actually happening getting in zil_lwb_write_done zio, and here zcw->zcw_zio_error = zio->io_error; was set to actual errror (EIO) after this change I could see the zil_commit_impl waits for spa_sync to flush the data and spa_sync is waiting for the disk to be accessible(I manually did a disk offline). And once I bring the disk online (echo "running" > /sys/block/sdc/device/state) everything starts from where it stopped.
So can zil_commit_waiter wait untill both zil_lwb_flush_vdevs_done and zil_lwb_write_done is successfully done?

Below is the stack trace after my change:

[ 1088.105111] zvol D 0 13811 2 0x00004000
[ 1088.105112] Call Trace:
[ 1088.105117] __schedule+0x394/0xa60
[ 1088.105119] ? default_wake_function+0x1a/0x30
[ 1088.105120] ? autoremove_wake_function+0x12/0x40
[ 1088.105121] schedule+0x55/0xc0
[ 1088.105122] io_schedule+0x16/0x40
[ 1088.105127] cv_wait_common+0xae/0x130 [spl]
[ 1088.105128] ? wait_woken+0x80/0x80
[ 1088.105130] __cv_wait_io+0x18/0x20 [spl]
[ 1088.105176] txg_wait_synced_impl+0xd6/0x120 [zfs]
[ 1088.105210] txg_wait_synced+0x10/0x40 [zfs]
[ 1088.105244] zil_commit_impl+0x612/0xdb0 [zfs]
[ 1088.105279] zil_commit+0x40/0x60 [zfs]
[ 1088.105312] zvol_write.isra.0.cold+0x187/0x397 [zfs]
[ 1088.105313] ? finish_task_switch+0x72/0x230
[ 1088.105346] zvol_write_task+0x1a/0x30 [zfs]
[ 1088.105350] taskq_thread+0x2fb/0x510 [spl]
[ 1088.105351] ? wake_up_q+0xa0/0xa0
[ 1088.105384] ? zvol_write.isra.0+0x110/0x110 [zfs]
[ 1088.105385] kthread+0x114/0x150
[ 1088.105389] ? task_done+0xb0/0xb0 [spl]
[ 1088.105389] ? kthread_park+0x90/0x90
[ 1088.105391] ret_from_fork+0x22/0x30

[ 1088.105400] txg_sync D 0 13988 2 0x00004000
[ 1088.105401] Call Trace:
[ 1088.105402] __schedule+0x394/0xa60
[ 1088.105404] schedule+0x55/0xc0
[ 1088.105405] schedule_timeout+0x8d/0x160
[ 1088.105406] ? __next_timer_interrupt+0xe0/0xe0
[ 1088.105407] io_schedule_timeout+0x1e/0x50
[ 1088.105410] __cv_timedwait_common+0x134/0x170 [spl]
[ 1088.105411] ? wait_woken+0x80/0x80
[ 1088.105413] __cv_timedwait_io+0x19/0x20 [spl]
[ 1088.105447] zio_wait+0x137/0x280 [zfs]
[ 1088.105474] dsl_pool_sync+0xce/0x4f0 [zfs]
[ 1088.105505] spa_sync+0x571/0x1000 [zfs]
[ 1088.105506] ? mutex_lock+0x13/0x40
[ 1088.105538] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[ 1088.105570] txg_sync_thread+0x2c6/0x460 [zfs]
[ 1088.105601] ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[ 1088.105604] thread_generic_wrapper+0x79/0x90 [spl]
[ 1088.105605] kthread+0x114/0x150
[ 1088.105608] ? __thread_exit+0x20/0x20 [spl]
[ 1088.105609] ? kthread_park+0x90/0x90
[ 1088.105610] ret_from_fork+0x22/0x30

problame · 2021-07-27T13:33:43Z

@arun-kv Moving the code that propagates the zio_t::io_error to zil_commit_waiter_t::zcw_zio_error only fixes the issue for your case, but then again won't propagate flush errors.
I think the culprit in your case might be this early return in zil_lwb_write_done:

zfs/module/zfs/zil.c

Lines 1245 to 1257 in 296a4a3

    
           	/* 
        
           	 * If there was an IO error, we're not going to call zio_flush() 
        
           	 * on these vdevs, so we simply empty the tree and free the 
        
           	 * nodes. We avoid calling zio_flush() since there isn't any 
        
           	 * good reason for doing so, after the lwb block failed to be 
        
           	 * written out. 
        
           	 */ 
        
           	if (zio->io_error != 0) { 
        
           		while ((zv = avl_destroy_nodes(t, &cookie)) != NULL) 
        
           			kmem_free(zv, sizeof (*zv)); 
        
           		return; 
        
           	}

There, we should propagate the error into the waiter but don't do it.

Could you revert your change and then either

add a print statement or something to that early return to verify that you are hitting it, or
set zcw_zio_error from within the early return? (I'm not certain about necessary synchronization here, so maybe just do the printf).

arun-kv · 2021-07-27T14:07:28Z

@problame Yes that early return is hitting, that's the reason I moved the code here to propagate error, below is the debug message,
zil_lwb_write_done zio: 00000000497853bd zio_type: 2 error: 6 io_pipeline_trace: 24641781

Yes as you said my fix is not solving the issue, that was just added for checking the behavior.

I was thinking about a solution, which is to cv_broadcast(&zcw->zcw_cv); the waiter only if there an error in zil_lwb_flush_vdevs_done, if there is no error in zil_lwb_flush_vdevs_done then do the cv_broadcast(&zcw->zcw_cv); in zil_lwb_write_done.(Just a thought, I don't understand the details)

problame · 2021-07-27T14:32:44Z

cc @prakashsurya

jxdking · 2021-07-27T15:35:04Z

From my understanding,

The zio passed in zil_lwb_flush_vdevs_done is the parent zio of the one passed in zil_lwb_write_done. If there is an error in zil_lwb_write_done, that implies there will be an error in zil_lwb_flush_vdevs_done. You don't need to handle error in zil_lwb_write_done.
zil_lwb_flush_vdevs_done will be always called no matter the flush is issued or defered. It is the parent zio of the write operation and the optional flush operation, so it will be called even the flush is skipped.

You may try to remove or comment out following code from zil_lwb_write_done, to see if it fixes your issue.
By doing so, the flush is never skipped.

        if (list_head(&lwb->lwb_waiters) == NULL && nlwb != NULL) {
		zil_lwb_flush_defer(lwb, nlwb);
		ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
		return;
	}

arun-kv · 2021-07-27T16:56:19Z

No, it did not fix the issue after commenting out the code, below is the debug message i got, zil_lwb_flush_vdevs_done is not getting the error that zil_lwb_write_done is getting.

[  345.173734] zil_lwb_write_done zio: 000000001627a0c4 zio_type: 2 error: 6 io_pipeline_trace: 24641781
[  345.173739] zil_lwb_flush_vdevs_done zio: 0000000035d5870f zio_type: 0 error: 0 io_pipeline_trace: 17301505

zil_lwb_write_done is having an error check before calling zil_lwb_flush_defer, and its returning from there without calling flush_defer / zio_flush.

        if (zio->io_error != 0) {
                while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
                        kmem_free(zv, sizeof (*zv));
                return;
        }
       if (list_head(&lwb->lwb_waiters) == NULL && nlwb != NULL) {
		zil_lwb_flush_defer(lwb, nlwb);
		ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
		return;
	}

jxdking · 2021-07-27T17:33:40Z

@arun-kv
Found the reason, in zil_lwb_write_open

        lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
		    zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
		    BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
		    prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
		    ZIO_FLAG_FASTWRITE, &zb);

ZIO_FLAG_DONT_PROPAGATE prevents the error gets bubble up to parent zio. However, I am not sure whether changing the flag here is the right solution. It may break other logic.

Another option is we can log zcw_zio_error in zil_lwb_write_done if there is any error. In zil_lwb_flush_vdevs_done, we only override zcw_zio_error when io_error is not 0.

arun-kv · 2021-07-28T04:24:45Z

@jxdking

Thanks, both the solutions worked.
Second solutions is more risk free? below is the diff of my change.

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 78d0711cc..1827df0bd 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1179,7 +1179,8 @@ zil_lwb_flush_vdevs_done(zio_t *zio)
 		ASSERT3P(zcw->zcw_lwb, ==, lwb);
 		zcw->zcw_lwb = NULL;
 
-		zcw->zcw_zio_error = zio->io_error;
+		if (zio->io_error != 0)
+			zcw->zcw_zio_error = zio->io_error;
 
 		ASSERT3B(zcw->zcw_done, ==, B_FALSE);
 		zcw->zcw_done = B_TRUE;
@@ -1253,6 +1254,16 @@ zil_lwb_write_done(zio_t *zio)
 	 * written out.
 	 */
 	if (zio->io_error != 0) {
+		zil_commit_waiter_t *zcw;
+		for (zcw = list_head(&lwb->lwb_waiters); zcw != NULL;
+			zcw = list_next(&lwb->lwb_waiters, zcw)) {
+			mutex_enter(&zcw->zcw_lock);
+			ASSERT(list_link_active(&zcw->zcw_node));
+			ASSERT3P(zcw->zcw_lwb, ==, lwb);
+			zcw->zcw_zio_error = zio->io_error;
+			mutex_exit(&zcw->zcw_lock);
+		}
+
 		while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
 			kmem_free(zv, sizeof (*zv));
 		return;

jxdking · 2021-07-28T13:08:50Z

Second solutions is more risk free? below is the diff of my change.

All the zio of the issued lwb are chained together. The original author used ZIO_FLAG_DONT_PROPAGATE explicitly. I guess there must be a reason.

Why don't you create a pull request. It looks like a bug that original code does not catch io_error in zil_lwb_write_done

Errors in zil_lwb_write_done() are not propagated to zil_lwb_flush_vdevs_done() which can result in zil_commit_impl() not returning an error to applications even when zfs was not able to write data to the disk. Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to allow errors to propagate and consolidate the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Reviewed-by: George Wilson <[email protected]> Reviewed-by: Prakash Surya <[email protected]> Signed-off-by: Arun KV <[email protected]> Closes openzfs#12391 Closes openzfs#12443

arun-kv added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 19, 2021

arun-kv mentioned this issue Jul 27, 2021

Fixed ZFSin going to an infinate loop when the underlying disk returns error openzfsonwindows/ZFSin#340

Merged

arun-kv mentioned this issue Jul 28, 2021

Fixed data integrity issue when underlying disk returns error to zfs #12443

Merged

13 tasks

behlendorf closed this as completed in f82f027 Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZVOL: sync=always is not being honored #12391

ZVOL: sync=always is not being honored #12391

arun-kv commented Jul 19, 2021 •

edited by lundman

Loading

imtiazdc commented Jul 19, 2021

bghira commented Jul 19, 2021 •

edited

Loading

arun-kv commented Jul 19, 2021

bghira commented Jul 19, 2021

arun-kv commented Jul 19, 2021

jxdking commented Jul 19, 2021

arun-kv commented Jul 20, 2021 •

edited

Loading

IsaacVaughn commented Jul 23, 2021

arun-kv commented Jul 23, 2021

problame commented Jul 27, 2021

arun-kv commented Jul 27, 2021

problame commented Jul 27, 2021

jxdking commented Jul 27, 2021 •

edited

Loading

arun-kv commented Jul 27, 2021

jxdking commented Jul 27, 2021 •

edited

Loading

arun-kv commented Jul 28, 2021

jxdking commented Jul 28, 2021 •

edited

Loading

ZVOL: sync=always is not being honored #12391

ZVOL: sync=always is not being honored #12391

Comments

arun-kv commented Jul 19, 2021 • edited by lundman Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

Example Program

imtiazdc commented Jul 19, 2021

bghira commented Jul 19, 2021 • edited Loading

arun-kv commented Jul 19, 2021

bghira commented Jul 19, 2021

arun-kv commented Jul 19, 2021

jxdking commented Jul 19, 2021

arun-kv commented Jul 20, 2021 • edited Loading

IsaacVaughn commented Jul 23, 2021

arun-kv commented Jul 23, 2021

problame commented Jul 27, 2021

arun-kv commented Jul 27, 2021

problame commented Jul 27, 2021

jxdking commented Jul 27, 2021 • edited Loading

arun-kv commented Jul 27, 2021

jxdking commented Jul 27, 2021 • edited Loading

arun-kv commented Jul 28, 2021

jxdking commented Jul 28, 2021 • edited Loading

arun-kv commented Jul 19, 2021 •

edited by lundman

Loading

bghira commented Jul 19, 2021 •

edited

Loading

arun-kv commented Jul 20, 2021 •

edited

Loading

jxdking commented Jul 27, 2021 •

edited

Loading

jxdking commented Jul 27, 2021 •

edited

Loading

jxdking commented Jul 28, 2021 •

edited

Loading