SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

StefanEnspired · 2023-02-19T15:27:23Z

System information

Type	Version/Name
Distribution Name	Fedora
Distribution Version	37
Kernel Version	6.1.11-200.fc37.x86_64
Architecture	x86_64
OpenZFS Version	2.1.8

Describe the problem you're observing

When I have a log-generating program and want to run grep on its output, it never finishes.

Describe how to reproduce the problem

A reproducer is this log generator in go, which can easily be used to trigger this condition. When I pipe its output into a file and run grep on it while it’s running, grep cannot run until the generator is stopped. I can reproduce this inside a plain Fedora 37 VM with 4 GB of memory, a completely empty pool on an empty virtual disk. Setting zfs_dmu_offset_next_sync=0 eliminates this problem. It was introduced in #12724

Mailing list link: https://zfsonlinux.topicbox.com/groups/zfs-discuss/T696e33a9741edf03/reading-buffered-data-causes-io-storm

package main

import (
	"fmt"
	"math/rand"
	"time"
)

func main() {
	seq := make([]uint64, 20)
	var number int
	for {
		for i := 0; i < 19; i++ {
			seq[i] = seq[i+1]
		}
		seq[19] = rand.Uint64()
		number++
		fmt.Printf("{\"level\":\"debug\",\"log\":\"Sequence %v\",\"number\":%d,\"my-time\":\"%v\"}\n", seq, number, time.Now())
		time.Sleep(time.Millisecond)
	}
}

The text was updated successfully, but these errors were encountered:

rincebrain · 2023-02-19T18:33:14Z

In theory, that PR is supposed to avoid that by the caller holding the rangelock, according to the comment, so it shouldn't get dirtied again and never return, assuming that's what transpires.

A trivial workaround, assuming that's what happens, would be to make it refuse to go through that codepath twice so it falls back to the tunable off behavior in that case, and just says no holes. An alternate behavior would be to assume it's good enough once synced once, but I guess that depends on the semantics anyone might expect from that.

@behlendorf any ideas why this might not be doing what the comment expects?

jknepher · 2023-02-23T15:22:25Z

In case it is helpful, I experience the same problem, and have for a while. I have trained myself to always use cat/tail | grep rather than directly using grep on any active log file, as that works fine.
x86_64
Linux 5.10.52-gentoo
zfs-kmod-2.1.99-943_g045aeabce

devZer0 · 2023-03-08T07:57:45Z

@rincebrain ,for me this looks very different from #14594

if there is a correlation here, it can be easily tested

disable atime or enable relatime and retry
check if it still happens with 2.1.4

behlendorf · 2023-03-08T21:22:53Z

The dn_struct_rwlock locking in dmu_offset_next() is supposed to protect against a concurrent dirtying update. But it seems pretty clear from the reproducer there's an update happening outside that lock. PR #13368 may help, but it'll depend on exactly what is being dirtied and where. This change also still needs some work.

devZer0 · 2023-03-08T21:33:57Z

@rincebrain ,for me this looks very different from #14594

if there is a correlation here, it can be easily tested
* disable atime or enable relatime and retry
* check if it still happens with 2.1.4

i was wrong with this, apparently Setting zfs_dmu_offset_next_sync=0 eliminates my problem, too

dbavatar · 2023-03-09T22:27:56Z

We his this too on kernel 5.15, after pulling latest zfs turned zfs_dmu_offset_next_sync on by default. I remember that zfs_holey() fn was very iffy. Would be nice if someone is smart enough to get that data from in-memory state without causing a txg_sync.

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes #13368 Issue #14594 Issue #14512 Issue #14009

Holding the zp->z_rangelock is insufficient to prevent the dnode from being re-dirtied in all cases. To avoid looping indefinately in dmu_offset_next() on files being actively written only wait once for a dirty dnode to be synced. If after waiting it's still dirty don't report the hole. This is always safe. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#14512

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

behlendorf · 2023-03-18T00:32:25Z

I've opened PR #14641 with a proposed fix for this issue.

Holding the zp->z_rangelock is insufficient to prevent the dnode from being re-dirtied in all cases. To avoid looping indefinately in dmu_offset_next() on files being actively written only wait once for a dirty dnode to be synced. If after waiting it's still dirty don't report the hole. This is always safe. Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#14512

Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Signed-off-by: Brian Behlendorf <[email protected]> Issue openzfs#14512

Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#14512 Closes openzfs#14641

Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes #14512 Closes #14641

Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#14512 Closes openzfs#14641

snajpa · 2023-09-02T18:08:58Z

Just to make sure our voice is heard too, our usecase for OpenZFS is container hosting, we can't afford any situation where a container can just keep forcing out txgs onto disk, it creates highly unfair situation to the other ones on the same host. So, in our case, the best course of action - due to lack of other options - is to make ZFS act like it doesn't support holes at all. We're about to push this patch into our production to do just that. If anyone is interested, I could do a PR with a knob to turn holes off in this way.

behlendorf · 2023-09-02T18:56:42Z

@snajpa I'm not sure I understand why setting zfs_dmu_offset_next_sync=0 isn't sufficient for you use case. This would prevent users from being able to force txg syncs as well as preserve hole reporting for file which aren't being actively modified.

snajpa · 2023-09-03T21:30:15Z

I'm sorry I posted before reading the actual code. It's as you say, sorted out.

Holding the zp->z_rangelock as a RL_READER over the range 0-UINT64_MAX is sufficient to prevent the dnode from being re-dirtied by concurrent writers. To avoid potentially looping multiple times for external caller which do not take the rangelock holes are not reported after the first sync. While not optimal this is always functionally correct. This change adds the missing rangelock calls on FreeBSD to zvol_cdev_ioctl(). Reviewed-by: Brian Atkinson <[email protected]> Signed-off-by: Brian Behlendorf <[email protected]> Closes openzfs#14512 Closes openzfs#14641

StefanEnspired added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 19, 2023

StefanEnspired changed the title ~~HOLE_SEEK loop & forced syncing causes never-ending delay in grep~~ SEEK_HOLE loop & forced syncing causes never-ending delay in grep Feb 20, 2023

rincebrain mentioned this issue Mar 7, 2023

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

Open

thesamesam mentioned this issue Mar 11, 2023

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

Open

behlendorf mentioned this issue Mar 14, 2023

[2.1.10] ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #14627

Merged

7 tasks

behlendorf mentioned this issue Mar 16, 2023

Additional limits on hole reporting #14641

Merged

13 tasks

behlendorf closed this as completed in 64bfa6b Mar 28, 2023

behlendorf mentioned this issue Mar 28, 2023

[2.1.10] Additional limits on hole reporting #14683

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

StefanEnspired commented Feb 19, 2023

rincebrain commented Feb 19, 2023

jknepher commented Feb 23, 2023

devZer0 commented Mar 8, 2023

behlendorf commented Mar 8, 2023

devZer0 commented Mar 8, 2023

dbavatar commented Mar 9, 2023

behlendorf commented Mar 18, 2023

snajpa commented Sep 2, 2023

behlendorf commented Sep 2, 2023

snajpa commented Sep 3, 2023

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

Comments

StefanEnspired commented Feb 19, 2023

System information

Describe the problem you're observing

Describe how to reproduce the problem

rincebrain commented Feb 19, 2023

jknepher commented Feb 23, 2023

devZer0 commented Mar 8, 2023

behlendorf commented Mar 8, 2023

devZer0 commented Mar 8, 2023

dbavatar commented Mar 9, 2023

behlendorf commented Mar 18, 2023

snajpa commented Sep 2, 2023

behlendorf commented Sep 2, 2023

snajpa commented Sep 3, 2023