Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZVOL: sync=always is not being honored #12391

Closed
arun-kv opened this issue Jul 19, 2021 · 17 comments
Closed

ZVOL: sync=always is not being honored #12391

arun-kv opened this issue Jul 19, 2021 · 17 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@arun-kv
Copy link
Contributor

arun-kv commented Jul 19, 2021

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 20.04.2 LTS
Kernel Version 5.8.0-55-generic
Architecture x86_64
OpenZFS Version zfs-2.1.99-327_g88a483303

Describe the problem you're observing

zvol (sync=always): Write(O_SYNC) returns success when the iSCSI disk is not accessible.
The application gets a write success from zvol created with option sync=always even when the disk is not accessible.
Note: Only the first write after the disk becomes inaccessible is wrongly acknowledged as successfully written, after that all other writes from the application wait.

Describe how to reproduce the problem

  1. Create a zpool using an iSCSI disk (one disk in the pool)
  2. Create a zvol with dedup=off, compression=off, sync=always, primarycache=metadata, volblocksize=4K
  3. Open the zvol disk using O_SYNC flag.
  4. Write some data to zvol (keep the application waiting for the user input)
  5. Pause the iSCSI target VM
  6. Continue the write (Now application will wait for ~1 min
    until zfs prints " WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended" this message in the syslog), and you will see now the write call returns success.
  7. Do another write, this will wait until I start the iSCSI target VM and run zpool clear command.

Include any warning/errors/backtraces from the system logs

sd 6:0:0:1: rejecting I/O to offline device
[ 1494.822037] blk_update_request: I/O error, dev sde, sector 3156224 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 1494.822056] zio pool=zpool vdev=/dev/sde1 error=5 type=2 offset=1614938112 size=4096 flags=184880
[ 1494.822098] blk_update_request: I/O error, dev sde, sector 2107648 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 1494.822106] zio pool=zpool vdev=/dev/sde1 error=5 type=2 offset=1078067200 size=4096 flags=184880
[ 1494.822355] blk_update_request: I/O error, dev sde, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[ 1494.822373] zio pool=zpool vdev=/dev/sde1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[ 1494.822425] blk_update_request: I/O error, dev sde, sector 20952080 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[ 1494.822435] zio pool=zpool vdev=/dev/sde1 error=5 type=1 offset=10726416384 size=8192 flags=b08c1
[ 1494.822453] blk_update_request: I/O error, dev sde, sector 20952592 op 0x0:(READ) flags 0x0 phys_seg 2 prio class 0
[ 1494.822461] zio pool=zpool vdev=/dev/sde1 error=5 type=1 offset=10726678528 size=8192 flags=b08c1
[ 1494.823781] WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended.

Example Program

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>

#define SIZE	4 * 1024
#define DEVICE  "/dev/zd0"

int main()
{
        char *rbuf, *wbuf;
	int ret = 0;
	int fd = 0;
	int len = 0;
 
	wbuf  = malloc(SIZE);
	fd = open(DEVICE, O_RDWR|O_SYNC);
	if(!fd) {
		printf("Failed to open file %s \n", DEVICE);
		return(1);
	}
	
	while (1) {
		printf("Enter char to write: ");
		char c = getchar();
		getchar();
		fflush(stdin);
		if(c != '\n') {
			printf("%c\n", c);
			memset(wbuf, c, SIZE);
			ret = write(fd, wbuf, SIZE);
			if (ret < 0) {
				printf("Write failed error: %d\n", -errno);
				break;
			} else {
				printf("write success ret: %d offset: %d\n", ret, len);
			}
		}
		len += SIZE;
		ret = lseek(fd, len, SEEK_SET);
	}

	close(fd);
}
@arun-kv arun-kv added the Type: Defect Incorrect behavior (e.g. crash, hang) label Jul 19, 2021
@imtiazdc
Copy link

Hi @ahrens,

Unless we are missing something, it looks like a severe issue. Given our limited understanding of the code base, we are unable to suggest a fix at this point. It will really help us if you could take a look at this and provide some insights. Happy to debug and provide you any further information.

@bghira
Copy link

bghira commented Jul 19, 2021

please verify that it happens without iSCSI - there are MANY layers here. are you using open-iscsi? what is your iscsid.conf ? what is the target config? what backend are you using?

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 19, 2021

Hi @bghira, I have tested using local disk on the same VM (I'm using Oracle virtual box VM with SATA controller)

  1. zpool create -o ashift=12 zpool /dev/sdc -f
  2. zfs create -o volblocksize=4k -o dedup=off -o compression=off -o sync=always -o primarycache=metadata -V 6G zpool/zvol
  3. Wrote some data (keep the application waiting for the user input)
  4. echo offline > /sys/block/sdc/device/state
  5. When I entered the new data, I got success message from write
  6. reboot the system, read the offset and I don't find the data

Got similar message like above when I offline the device,

blk_update_request: I/O error, dev sdc, sector 5253216 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
[ 541.138074] zio pool=zpool vdev=/dev/sdc1 error=5 type=2 offset=2688598016 size=8192 flags=180880
[ 541.138120] blk_update_request: I/O error, dev sdc, sector 2576 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 541.138128] zio pool=zpool vdev=/dev/sdc1 error=5 type=1 offset=270336 size=8192 flags=b08c1
[ 541.138183] blk_update_request: I/O error, dev sdc, sector 20952080 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 541.138192] zio pool=zpool vdev=/dev/sdc1 error=5 type=1 offset=10726416384 size=8192 flags=b08c1
[ 541.138214] blk_update_request: I/O error, dev sdc, sector 20952592 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 541.138222] zio pool=zpool vdev=/dev/sdc1 error=5 type=1 offset=10726678528 size=8192 flags=b08c1
[ 541.138693] zvol_write End
[ 541.253765] WARNING: Pool 'zpool' has encountered an uncorrectable I/O failure and has been suspended.

@bghira
Copy link

bghira commented Jul 19, 2021

sounds like you are running with writeback cache?

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 19, 2021

Hi @bghira, There is no writeback cache.
I just tested it on a physical server(with local disk) and I see same behavior, write return success when zfs got EIO error from the underlying device.

@jxdking
Copy link
Contributor

jxdking commented Jul 19, 2021

There is a chance that zil_commit() returns without actually flushing the disk. See comments in zil_lwb_write_done() and zil_lwb_flush_defer().
Thus, if you offline zil device, you may loose data. This is the time mirror slog may help.

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 20, 2021

@jxdking Yes, this issues happens when all the disk in the pool are inaccessible. My expectation was application will wait until I fix the disk error and run zpool clear command, because sync=always was set and application was using O_SYNC flag.

Ex: If someone using network attached devices in the zpool (without local slog device) and a network outage more than 1 min can cause data loss (my first experiment with iSCSI disk) ??

@IsaacVaughn
Copy link

There is a chance that zil_commit() returns without actually flushing the disk. See comments in zil_lwb_write_done() and zil_lwb_flush_defer().
Thus, if you offline zil device, you may loose data. This is the time mirror slog may help.

This seems somewhat dubious to me. Certainly, zil_lwb_write_done can fail, but does that result in the zil_commit waiter returning success?

Looking at zil_commit_impl, I see
* If there was an error writing out the ZIL blocks that
* this thread is waiting on, then we fallback to
* relying on spa_sync() to write out the data this
* thread is waiting on. Obviously this has performance
* implications, but the expectation is for this to be
* an exceptional case, and shouldn't occur often.

To me, this implies that the error when flushing the ZIL blocks is caught and handled by escalated to a general sync operation. Offlining a ZIL device can definitely result in the data already written to ZIL being lost, but I don't think it will lose future writes.

Also, it sounds like he has offlined his entire pool, so probably everything should fail and error out.

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 23, 2021

@IsaacVaughn I have done some experiment by moving the below code from zil_lwb_flush_vdevs_done to zil_lwb_write_done
`while ((zcw = list_head(&lwb->lwb_waiters)) != NULL) {
mutex_enter(&zcw->zcw_lock);

            ASSERT(list_link_active(&zcw->zcw_node));
            list_remove(&lwb->lwb_waiters, zcw);

            ASSERT3P(zcw->zcw_lwb, ==, lwb);
            zcw->zcw_lwb = NULL;

            **zcw->zcw_zio_error = zio->io_error;**

            ASSERT3B(zcw->zcw_done, ==, B_FALSE);
            zcw->zcw_done = B_TRUE;
            cv_broadcast(&zcw->zcw_cv);

            mutex_exit(&zcw->zcw_lock);
    }`

because the error was actually happening getting in zil_lwb_write_done zio, and here zcw->zcw_zio_error = zio->io_error; was set to actual errror (EIO) after this change I could see the zil_commit_impl waits for spa_sync to flush the data and spa_sync is waiting for the disk to be accessible(I manually did a disk offline). And once I bring the disk online (echo "running" > /sys/block/sdc/device/state) everything starts from where it stopped.
So can zil_commit_waiter wait untill both zil_lwb_flush_vdevs_done and zil_lwb_write_done is successfully done?

Below is the stack trace after my change:

[ 1088.105111] zvol D 0 13811 2 0x00004000
[ 1088.105112] Call Trace:
[ 1088.105117] __schedule+0x394/0xa60
[ 1088.105119] ? default_wake_function+0x1a/0x30
[ 1088.105120] ? autoremove_wake_function+0x12/0x40
[ 1088.105121] schedule+0x55/0xc0
[ 1088.105122] io_schedule+0x16/0x40
[ 1088.105127] cv_wait_common+0xae/0x130 [spl]
[ 1088.105128] ? wait_woken+0x80/0x80
[ 1088.105130] __cv_wait_io+0x18/0x20 [spl]
[ 1088.105176] txg_wait_synced_impl+0xd6/0x120 [zfs]
[ 1088.105210] txg_wait_synced+0x10/0x40 [zfs]
[ 1088.105244] zil_commit_impl+0x612/0xdb0 [zfs]
[ 1088.105279] zil_commit+0x40/0x60 [zfs]
[ 1088.105312] zvol_write.isra.0.cold+0x187/0x397 [zfs]
[ 1088.105313] ? finish_task_switch+0x72/0x230
[ 1088.105346] zvol_write_task+0x1a/0x30 [zfs]
[ 1088.105350] taskq_thread+0x2fb/0x510 [spl]
[ 1088.105351] ? wake_up_q+0xa0/0xa0
[ 1088.105384] ? zvol_write.isra.0+0x110/0x110 [zfs]
[ 1088.105385] kthread+0x114/0x150
[ 1088.105389] ? task_done+0xb0/0xb0 [spl]
[ 1088.105389] ? kthread_park+0x90/0x90
[ 1088.105391] ret_from_fork+0x22/0x30

[ 1088.105400] txg_sync D 0 13988 2 0x00004000
[ 1088.105401] Call Trace:
[ 1088.105402] __schedule+0x394/0xa60
[ 1088.105404] schedule+0x55/0xc0
[ 1088.105405] schedule_timeout+0x8d/0x160
[ 1088.105406] ? __next_timer_interrupt+0xe0/0xe0
[ 1088.105407] io_schedule_timeout+0x1e/0x50
[ 1088.105410] __cv_timedwait_common+0x134/0x170 [spl]
[ 1088.105411] ? wait_woken+0x80/0x80
[ 1088.105413] __cv_timedwait_io+0x19/0x20 [spl]
[ 1088.105447] zio_wait+0x137/0x280 [zfs]
[ 1088.105474] dsl_pool_sync+0xce/0x4f0 [zfs]
[ 1088.105505] spa_sync+0x571/0x1000 [zfs]
[ 1088.105506] ? mutex_lock+0x13/0x40
[ 1088.105538] ? spa_txg_history_init_io+0x106/0x110 [zfs]
[ 1088.105570] txg_sync_thread+0x2c6/0x460 [zfs]
[ 1088.105601] ? txg_thread_exit.isra.0+0x60/0x60 [zfs]
[ 1088.105604] thread_generic_wrapper+0x79/0x90 [spl]
[ 1088.105605] kthread+0x114/0x150
[ 1088.105608] ? __thread_exit+0x20/0x20 [spl]
[ 1088.105609] ? kthread_park+0x90/0x90
[ 1088.105610] ret_from_fork+0x22/0x30

@problame
Copy link
Contributor

@arun-kv Moving the code that propagates the zio_t::io_error to zil_commit_waiter_t::zcw_zio_error only fixes the issue for your case, but then again won't propagate flush errors.
I think the culprit in your case might be this early return in zil_lwb_write_done:

zfs/module/zfs/zil.c

Lines 1245 to 1257 in 296a4a3

/*
* If there was an IO error, we're not going to call zio_flush()
* on these vdevs, so we simply empty the tree and free the
* nodes. We avoid calling zio_flush() since there isn't any
* good reason for doing so, after the lwb block failed to be
* written out.
*/
if (zio->io_error != 0) {
while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
kmem_free(zv, sizeof (*zv));
return;
}

There, we should propagate the error into the waiter but don't do it.

Could you revert your change and then either

  • add a print statement or something to that early return to verify that you are hitting it, or
  • set zcw_zio_error from within the early return? (I'm not certain about necessary synchronization here, so maybe just do the printf).

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 27, 2021

@problame Yes that early return is hitting, that's the reason I moved the code here to propagate error, below is the debug message,
zil_lwb_write_done zio: 00000000497853bd zio_type: 2 error: 6 io_pipeline_trace: 24641781

Yes as you said my fix is not solving the issue, that was just added for checking the behavior.

I was thinking about a solution, which is to cv_broadcast(&zcw->zcw_cv); the waiter only if there an error in zil_lwb_flush_vdevs_done, if there is no error in zil_lwb_flush_vdevs_done then do the cv_broadcast(&zcw->zcw_cv); in zil_lwb_write_done.(Just a thought, I don't understand the details)

@problame
Copy link
Contributor

cc @prakashsurya

@jxdking
Copy link
Contributor

jxdking commented Jul 27, 2021

From my understanding,

  1. The zio passed in zil_lwb_flush_vdevs_done is the parent zio of the one passed in zil_lwb_write_done. If there is an error in zil_lwb_write_done, that implies there will be an error in zil_lwb_flush_vdevs_done. You don't need to handle error in zil_lwb_write_done.
  2. zil_lwb_flush_vdevs_done will be always called no matter the flush is issued or defered. It is the parent zio of the write operation and the optional flush operation, so it will be called even the flush is skipped.

You may try to remove or comment out following code from zil_lwb_write_done, to see if it fixes your issue.
By doing so, the flush is never skipped.

        if (list_head(&lwb->lwb_waiters) == NULL && nlwb != NULL) {
		zil_lwb_flush_defer(lwb, nlwb);
		ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
		return;
	}

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 27, 2021

No, it did not fix the issue after commenting out the code, below is the debug message i got, zil_lwb_flush_vdevs_done is not getting the error that zil_lwb_write_done is getting.

[  345.173734] zil_lwb_write_done zio: 000000001627a0c4 zio_type: 2 error: 6 io_pipeline_trace: 24641781
[  345.173739] zil_lwb_flush_vdevs_done zio: 0000000035d5870f zio_type: 0 error: 0 io_pipeline_trace: 17301505

zil_lwb_write_done is having an error check before calling zil_lwb_flush_defer, and its returning from there without calling flush_defer / zio_flush.

        if (zio->io_error != 0) {
                while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
                        kmem_free(zv, sizeof (*zv));
                return;
        }
       if (list_head(&lwb->lwb_waiters) == NULL && nlwb != NULL) {
		zil_lwb_flush_defer(lwb, nlwb);
		ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
		return;
	}
      

@jxdking
Copy link
Contributor

jxdking commented Jul 27, 2021

@arun-kv
Found the reason, in zil_lwb_write_open

        lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
		    zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
		    BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
		    prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
		    ZIO_FLAG_FASTWRITE, &zb);

ZIO_FLAG_DONT_PROPAGATE prevents the error gets bubble up to parent zio. However, I am not sure whether changing the flag here is the right solution. It may break other logic.

Another option is we can log zcw_zio_error in zil_lwb_write_done if there is any error. In zil_lwb_flush_vdevs_done, we only override zcw_zio_error when io_error is not 0.

@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 28, 2021

@jxdking

Thanks, both the solutions worked.
Second solutions is more risk free? below is the diff of my change.

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 78d0711cc..1827df0bd 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1179,7 +1179,8 @@ zil_lwb_flush_vdevs_done(zio_t *zio)
 		ASSERT3P(zcw->zcw_lwb, ==, lwb);
 		zcw->zcw_lwb = NULL;
 
-		zcw->zcw_zio_error = zio->io_error;
+		if (zio->io_error != 0)
+			zcw->zcw_zio_error = zio->io_error;
 
 		ASSERT3B(zcw->zcw_done, ==, B_FALSE);
 		zcw->zcw_done = B_TRUE;
@@ -1253,6 +1254,16 @@ zil_lwb_write_done(zio_t *zio)
 	 * written out.
 	 */
 	if (zio->io_error != 0) {
+		zil_commit_waiter_t *zcw;
+		for (zcw = list_head(&lwb->lwb_waiters); zcw != NULL;
+			zcw = list_next(&lwb->lwb_waiters, zcw)) {
+			mutex_enter(&zcw->zcw_lock);
+			ASSERT(list_link_active(&zcw->zcw_node));
+			ASSERT3P(zcw->zcw_lwb, ==, lwb);
+			zcw->zcw_zio_error = zio->io_error;
+			mutex_exit(&zcw->zcw_lock);
+		}
+
 		while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
 			kmem_free(zv, sizeof (*zv));
 		return;

@jxdking
Copy link
Contributor

jxdking commented Jul 28, 2021

Second solutions is more risk free? below is the diff of my change.

All the zio of the issued lwb are chained together. The original author used ZIO_FLAG_DONT_PROPAGATE explicitly. I guess there must be a reason.

Why don't you create a pull request. It looks like a bug that original code does not catch io_error in zil_lwb_write_done

tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 15, 2021
Errors in zil_lwb_write_done() are not propagated to
zil_lwb_flush_vdevs_done() which can result in zil_commit_impl()
not returning an error to applications even when zfs was not able
to write data to the disk.

Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to
allow errors to propagate and consolidate the error handling for
flush and write errors to a single location (rather than having
error handling split between the "write done" and "flush done"
handlers).

Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Signed-off-by: Arun KV <[email protected]>
Closes openzfs#12391
Closes openzfs#12443
rincebrain pushed a commit to rincebrain/zfs that referenced this issue Sep 22, 2021
Errors in zil_lwb_write_done() are not propagated to
zil_lwb_flush_vdevs_done() which can result in zil_commit_impl()
not returning an error to applications even when zfs was not able
to write data to the disk.

Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to
allow errors to propagate and consolidate the error handling for
flush and write errors to a single location (rather than having
error handling split between the "write done" and "flush done"
handlers).

Reviewed-by: George Wilson <[email protected]>
Reviewed-by: Prakash Surya <[email protected]>
Signed-off-by: Arun KV <[email protected]>
Closes openzfs#12391
Closes openzfs#12443
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

6 participants