Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

Closed
StefanEnspired opened this issue Feb 19, 2023 · 10 comments
Closed

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

StefanEnspired opened this issue Feb 19, 2023 · 10 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@StefanEnspired
Copy link

System information

Type Version/Name
Distribution Name Fedora
Distribution Version 37
Kernel Version 6.1.11-200.fc37.x86_64
Architecture x86_64
OpenZFS Version 2.1.8

Describe the problem you're observing

When I have a log-generating program and want to run grep on its output, it never finishes.

Describe how to reproduce the problem

A reproducer is this log generator in go, which can easily be used to trigger this condition. When I pipe its output into a file and run grep on it while it’s running, grep cannot run until the generator is stopped. I can reproduce this inside a plain Fedora 37 VM with 4 GB of memory, a completely empty pool on an empty virtual disk. Setting zfs_dmu_offset_next_sync=0 eliminates this problem. It was introduced in #12724

Mailing list link: https://zfsonlinux.topicbox.com/groups/zfs-discuss/T696e33a9741edf03/reading-buffered-data-causes-io-storm

package main

import (
	"fmt"
	"math/rand"
	"time"
)

func main() {
	seq := make([]uint64, 20)
	var number int
	for {
		for i := 0; i < 19; i++ {
			seq[i] = seq[i+1]
		}
		seq[19] = rand.Uint64()
		number++
		fmt.Printf("{\"level\":\"debug\",\"log\":\"Sequence %v\",\"number\":%d,\"my-time\":\"%v\"}\n", seq, number, time.Now())
		time.Sleep(time.Millisecond)
	}
}
@StefanEnspired StefanEnspired added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 19, 2023
@rincebrain
Copy link
Contributor

In theory, that PR is supposed to avoid that by the caller holding the rangelock, according to the comment, so it shouldn't get dirtied again and never return, assuming that's what transpires.

A trivial workaround, assuming that's what happens, would be to make it refuse to go through that codepath twice so it falls back to the tunable off behavior in that case, and just says no holes. An alternate behavior would be to assume it's good enough once synced once, but I guess that depends on the semantics anyone might expect from that.

@behlendorf any ideas why this might not be doing what the comment expects?

@StefanEnspired StefanEnspired changed the title HOLE_SEEK loop & forced syncing causes never-ending delay in grep SEEK_HOLE loop & forced syncing causes never-ending delay in grep Feb 20, 2023
@jknepher
Copy link

In case it is helpful, I experience the same problem, and have for a while. I have trained myself to always use cat/tail | grep rather than directly using grep on any active log file, as that works fine.
x86_64
Linux 5.10.52-gentoo
zfs-kmod-2.1.99-943_g045aeabce

@devZer0
Copy link

devZer0 commented Mar 8, 2023

@rincebrain ,for me this looks very different from #14594

if there is a correlation here, it can be easily tested

  • disable atime or enable relatime and retry
  • check if it still happens with 2.1.4

@behlendorf
Copy link
Contributor

The dn_struct_rwlock locking in dmu_offset_next() is supposed to protect against a concurrent dirtying update. But it seems pretty clear from the reproducer there's an update happening outside that lock. PR #13368 may help, but it'll depend on exactly what is being dirtied and where. This change also still needs some work.

@devZer0
Copy link

devZer0 commented Mar 8, 2023

@rincebrain ,for me this looks very different from #14594

if there is a correlation here, it can be easily tested

* disable atime or enable relatime and retry
* check if it still happens with 2.1.4

i was wrong with this, apparently Setting zfs_dmu_offset_next_sync=0 eliminates my problem, too

@dbavatar
Copy link
Contributor

dbavatar commented Mar 9, 2023

We his this too on kernel 5.15, after pulling latest zfs turned zfs_dmu_offset_next_sync on by default. I remember that zfs_holey() fn was very iffy. Would be nice if someone is smart enough to get that data from in-memory state without causing a txg_sync.

behlendorf pushed a commit that referenced this issue Mar 14, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #13368 
Issue #14594 
Issue #14512 
Issue #14009
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Mar 14, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
behlendorf pushed a commit that referenced this issue Mar 15, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes #13368 
Issue #14594 
Issue #14512 
Issue #14009
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 16, 2023
Holding the zp->z_rangelock is insufficient to prevent the
dnode from being re-dirtied in all cases.  To avoid looping
indefinately in dmu_offset_next() on files being actively
written only wait once for a dirty dnode to be synced.  If
after waiting it's still dirty don't report the hole.  This
is always safe.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#14512
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by:  Alexander Motin <[email protected]>
Signed-off-by: Matthew Ahrens <[email protected]>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
@behlendorf
Copy link
Contributor

I've opened PR #14641 with a proposed fix for this issue.

andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Mar 18, 2023
Holding the zp->z_rangelock is insufficient to prevent the
dnode from being re-dirtied in all cases.  To avoid looping
indefinately in dmu_offset_next() on files being actively
written only wait once for a dirty dnode to be synced.  If
after waiting it's still dirty don't report the hole.  This
is always safe.

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#14512
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 27, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Signed-off-by: Brian Behlendorf <[email protected]>
Issue openzfs#14512
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 28, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#14512
Closes openzfs#14641
andrewc12 pushed a commit to andrewc12/openzfs that referenced this issue Mar 28, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#14512
Closes openzfs#14641
behlendorf added a commit that referenced this issue Mar 29, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes #14512
Closes #14641
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Mar 29, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#14512
Closes openzfs#14641
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Apr 5, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#14512
Closes openzfs#14641
@snajpa
Copy link
Contributor

snajpa commented Sep 2, 2023

Just to make sure our voice is heard too, our usecase for OpenZFS is container hosting, we can't afford any situation where a container can just keep forcing out txgs onto disk, it creates highly unfair situation to the other ones on the same host. So, in our case, the best course of action - due to lack of other options - is to make ZFS act like it doesn't support holes at all. We're about to push this patch into our production to do just that. If anyone is interested, I could do a PR with a knob to turn holes off in this way.

@behlendorf
Copy link
Contributor

@snajpa I'm not sure I understand why setting zfs_dmu_offset_next_sync=0 isn't sufficient for you use case. This would prevent users from being able to force txg syncs as well as preserve hole reporting for file which aren't being actively modified.

@snajpa
Copy link
Contributor

snajpa commented Sep 3, 2023

I'm sorry I posted before reading the actual code. It's as you say, sorted out.

pcd1193182 pushed a commit to pcd1193182/zfs that referenced this issue Sep 26, 2023
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#14512
Closes openzfs#14641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

7 participants