Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lseek(..., ..., SEEK_HOLE) extremely slow under heavy I/O load. #4306

Closed
bucko909 opened this issue Feb 4, 2016 · 16 comments
Closed

lseek(..., ..., SEEK_HOLE) extremely slow under heavy I/O load. #4306

bucko909 opened this issue Feb 4, 2016 · 16 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@bucko909
Copy link

bucko909 commented Feb 4, 2016

I've been experiencing random hangs in syscalls (ie. cannot even ^C the process) when running grep and less on our log server which runs on (gzip compressed) ZFS. It appears to be correlated with high load averages possibly as a result of compression jobs, and I've mostly (only?) observed it on files which are actively being appended. I've been able to reproduce 7 second delays running these tools in testing during heavy load, and I've experienced delays in the minutes. I have no proof that the longer delays are due to the same, but it seems likely.

I can't guarantee that the long and short delays are the same problem, but I have observed via strace that the hangs are in lseek with a whence of SEEK_HOLE. I confirmed this with a simple C program which just opened my file and ran lseek(fd, 32768, SEEK_HOLE); -- which was the exact syscall grep hung on.

More evidence is that SEEK_HOLE isn't used in cat, which means that cat FILE | grep SEARCH is much faster than grep SEARCH FILE. When the server was busy enough to reproduce this yesterday, I recorded ~7s to grep an 80MB file, and 0.1s to cat | grep.

For now I've instigated a workaround which is an LD_PRELOAD library which rewrites lseek(fd, offset, SEEK_HOLE) to lseek(fd, 0, SEEK_END) -- I've not been able to reproduce these delays with the library loaded.

Looking at the source, I see that SEEK_HOLE and SEEK_DATA ultimately end up here:

https://github.com/zfsonlinux/zfs/blob/19d55079aecb5c022c1c09e0eace4f7da7381a62/module/zfs/dmu.c#L1812

Which does txg_wait_synced if the file has dirty data. I've observed sync take 15 minutes on this server when it's under heavy load, so that would probably explain the delay. Is there any way around this beyond my stupid LD_PRELOAD hack?

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Feb 4, 2016
@behlendorf
Copy link
Contributor

@bucko909 you've definitely put your finger on an issue. In order for SEEK_HOLE to authoritatively return the next on-disk hole it needs to make sure that all dirty blocks for that file are on-disk. Where those blocks end up getting written, or even if they get written, isn't determined until the txg they're part of is written. Which means this function has to block waiting for that txg to be sync'd when a process is appending to the file.

There's probably room for some additional optimization here but it'll take some work. One thing you might try in your test application, and I'm not sure this will work, is to fsync() the file before calling lseek().

@ryao
Copy link
Contributor

ryao commented Feb 5, 2016

@bucko909 A 15 minute transaction sync time is a problem. Ideally, it should never exceed 5 seconds. lseek blocking on the transaction sync has hit me in the past too (mostly in vim), although it never took more than 5 seconds. You could try reducing the value of /sys/module/zfs/parameters/zfs_dirty_data_max. That should speed up the transaction group commit by reducing the amount to commit.

@behlendorf I do not think that fsync() would help here because it would send data to ZIL, rather than sync the transaction group.

Waiting on the transaction group likely only matters in situations where zero detection by compression makes holes. It is okay to report data where there are holes, so ignoring hole creation by zero detection in compression should be fine. At the same time, it is not okay to report holes where they is data, so we need to think about whether there are any situations that this prevents. I cannot think of any at the present time.

@ryao
Copy link
Contributor

ryao commented Feb 5, 2016

For some additional information, here is what the stack should look like for zvol_discard/zfs_trunc:

dnode_next_offset
dnode_free_range
dmu_free_long_range_impl
dmu_free_long_range
zvol_discard/zfs_trunc

txg_wait_synced is protecting us from calling dnode_next_offset in dmu_offset_next, but there is no such protection for zvol_discard or zfs_trunc. If we report holes where there is data there, we would fail to discard/truncate data, so I suspect that reporting holes where there is data is not actually possible. If the only risk is reporting data where a hole would be formed by the transaction group commit, then we should be okay. I still need to read more code, but it looks like that bit of protection can go.

@ahrens What do you think?

@bucko909
Copy link
Author

bucko909 commented Feb 5, 2016

As I say, I can't prove that the longer delays are the same. I only recently remembered to strace before the hang -- straceing during it just hangs strace -- so I'd previously been blaming the general I/O subsystem. A friend has let me know about /proc/sysrq-trigger, so if I hit a long delay again, I'll get some kernel stack traces and make a new bug if they don't agree with the diagnosis here. My suspicion is that a completely full writeback buffer (the machine is often saturated with random writes -- it's got >10,000 files open and appends to >1000 of them every few seconds) and a file that's being continually appended combine to make the sync take a very long time sometimes, though.

In the spirit of enquiry, I confirmed @ryao's suspicion that fsync before lseek doesn't help -- the server isn't so busy right now, but on a bunch of tests I still saw the lseek delay a few times after fsync.

ryao added a commit to ryao/zfs that referenced this issue Feb 5, 2016
The original implementation of `SEEK_DATA` and `SEEK_HOLE` for `lseek()`
in OpenSolaris would block on the transaction group commit when called
on files that have dirty data. This was an attempt to ensure that any
holes that are made by zero detection are reported. However, there is
nothing stopping userland from redirtying the dnode with additional
writes while the transaction group commit finishes, such that the
reported information is not necessarily up to date. The consequence is
that we can report data where holes are (or would be) present.

Reporting data where there is holes is alright because userspace would
then copy zeroes instead on a read, but the inverse of reporting holes
where there is data is not alright. Other consumers of
`dnode_next_offset` such as zvol_discard would fail to perform their
intended function if holes were reported where there is data and do not
block on the transaction group commit like we do with lseek on
`SEEK_DATA` or `SEEK_HOLE`. Consequently, it should be the case that
this code will never report a hole where there is data and there is no
need to block on the transaction group commit.

Closes openzfs#4306

Signed-off-by: Richard Yao <[email protected]>
@ryao
Copy link
Contributor

ryao commented Feb 5, 2016

@bucko909 I have opened #4311 with a patch that deletes this logic for the buildbot to subject to testing. The only way this patch should cause a problem is if there is some way for us to report a hole where there is data, but I do not see how that is possible. If you are willing to risk me being wrong, you could try applying the patch to your system and seeing how things perform. Otherwise, I suggest holding off until it has had more review.

That said, I think that your userspace is doing multiple lseek() calls on a file continuously being dirtied. I do not see any other way that you could blocked for 15 minutes in this callpath without the system having bigger performance problems. Profiling the kernel thread during such a period should show that.

ryao added a commit to ryao/zfs that referenced this issue Feb 5, 2016
The original implementation of `SEEK_DATA` and `SEEK_HOLE` for `lseek()`
in OpenSolaris would block on the transaction group commit when called
on files that have dirty data. This was an attempt to ensure that any
holes that are made by zero detection are reported. However, there is
nothing stopping userland from redirtying the dnode with additional
writes while the transaction group commit finishes, such that the
reported information is not necessarily up to date. The consequence is
that we can report data where holes are (or would be) present.

Reporting data where there is holes is alright because userspace would
then copy zeroes instead on a read, but the inverse of reporting holes
where there is data is not alright. Other consumers of
`dnode_next_offset` such as zvol_discard would fail to perform their
intended function if holes were reported where there is data and do not
block on the transaction group commit like we do with lseek on
`SEEK_DATA` or `SEEK_HOLE`. Consequently, it should be the case that
this code will never report a hole where there is data and there is no
need to block on the transaction group commit.

Closes openzfs#4306

Signed-off-by: Richard Yao <[email protected]>
@bucko909
Copy link
Author

bucko909 commented Feb 5, 2016

Righto. I think we'll stick with the current driver and the workaround for now; testing this on our live logging endpoint sounds a bit risky (though, as I say, we probably never have any holes). I'll open a new bug if I catch the longer delays using my LD_PRELOAD hack so I think this issue can be closed.

@bucko909 bucko909 closed this as completed Feb 5, 2016
@bucko909
Copy link
Author

bucko909 commented Feb 5, 2016

(Or should it be left open until the patch is merged? I'll leave it open in that case; I initially read the merge as a new issue -- not used to github yet!)

@bucko909 bucko909 reopened this Feb 5, 2016
@ryao
Copy link
Contributor

ryao commented Feb 5, 2016

It should be kept open until the patch is reviewed and merged.

@kernelOfTruth
Copy link
Contributor

referencing #4311

@stephane-chazelas
Copy link

FYI, it also badly affects performance for synchronised disk I/O in VMs in recent versions of QEMU when the disk images are qcow2 images stored on ZFS:

https://www.mail-archive.com/[email protected]/msg427170.html

@dbavatar
Copy link
Contributor

I spent a bunch of time investigating this problem, without knowing about this issue on github. I believe most ZFS users experience this slowness in some way, without realizing it. It's extremely trivial to hit this condition - the simplest case is a log file which is obviously appended to occasionally. This actually causes a sparse section to appear at the end of the file. Now if you run a very very common command on a log, i.e. "grep foo bar.log", grep will attempt to seek past the hole at the end of the file, causing potential a multiple MINUTE delay while txg_wait_synced() completes. This is the ONLY vnop to require txg_wait_synced(). All the other calls are zfs/zpool specific actions.

There is a simple solution to this problem, which is to simply disable this implementation. The system will just fine without it, and do so quite well. I've disabled this site-wide at my organization, and I recommend ZFS take this as the default, until such time as someone can think of an implementation that works properly. I'd be happy to submit such a patch, if it would be accepted.

@behlendorf behlendorf added this to the 0.7.0 milestone Mar 23, 2017
@behlendorf
Copy link
Contributor

There is a simple solution to this problem, which is to simply disable this implementation.

That's not an unreasonable suggestion, but I think we can do somewhat better than this with a reasonable compromise. Instead of disabling this implementation entirely we could dynamically disable this functionality only when the dnode is dirty. This would resolve the issue observed with append+grep and allow tools like tar, cp to continue to benefit from the optimization for files which are not being actively updated.

Or in other words until all pending IO which could create new holes or remove existing ones is resolved we fallback to the simplest safe implementation which is nicely described in the man page.

    In the simplest implementation, a filesystem  can  support the  operations
    by making SEEK_HOLE always return the offset of the end of the file, and
    making SEEK_DATA always return offset (i.e.,  even  if the  location
    referred to by offset is a hole, it can be considered to consist of data that
    is a sequence of zeros).

@dbavatar if you're willing to put together such a patch and we don't uncover any unexpected problems I'd accept it.

@dbavatar
Copy link
Contributor

That's not an unreasonable suggestion, but I think we can do somewhat better than this with a reasonable compromise. Instead of disabling this implementation entirely we could dynamically disable this functionality only when the dnode is dirty. This would resolve the issue observed with append+grep and allow tools like tar, cp to continue to benefit from the optimization for files which are not being actively updated.

@behlendorf I don't believe this can work, because at least in my case the file is basically always dirty. There's always new log data coming in on an active system, not to mention the txg timeout is often less than the time to sync a txg. Thinking about this more, I believe as-is this vnop could lead to data corruption. For it to work right, I believe you need a zfs_range_lock(read) to wrap it. Otherwise, many writes could have occurred during the time we were waiting for txg_wait_synced() to complete. One thread could report back 0 data in range minutes after many writes were made. This is not true if doing plain seek/read/write. An actual solution to this probably looks like zfs_read(), except it doesn't bother allocating dbuf's, just stops at the first actual byte found.

@behlendorf
Copy link
Contributor

because at least in my case the file is basically always dirty.

So unfortunately I don't think even taking the zfs_range_lock()'s here would be enough to always report this correctly. The problem is that the locks are only held for the duration of the system calls over the regions being updated. However, it's not until much latter when the txg is being synced and the locks are all dropped that we determine the final on-disk layout.

For this to work correctly, and be fast, you'd need to take the zfs_range_lock() and then walk all the pending dbuf writes to determine exactly what updates to the file are going to happen. Then you'd need to merge that with the current on-disk state to find the next hole. That's do able, but not easy which is why the code currently waits for the txg to do exactly this and then checks only the on-disk version.

It's also worth pointing out that the moment we drop the range lock it's entirely possible for a new write to come in the the hole information to be flat out wrong. I suggested simply disabling SEEK_HOLE when the dnode is dirty because it seems like a reasonable way to handle the common case you've described. This should be functionally correct since it will appear to applications as if all holes have simply been removed from the file and you'll get good append performance since you'll never wait on a txg. Only once the file is no longer being actively updated will the holes reappear.

That said, I'm all for a better solutions if you can propose one. For example, I can see how we might be able to special case this append-only case you're hitting.

@dbavatar
Copy link
Contributor

I suggested simply disabling SEEK_HOLE when the dnode is dirty because it seems like a reasonable way to handle the common case you've described. This should be functionally correct since it will appear to applications as if all holes have simply been removed from the file and you'll get good append performance since you'll never wait on a txg. Only once the file is no longer being actively updated will the holes reappear.

I understand how this can work now, please review the patch here dbavatar@f71ff02 I've only done minimal testing on it.

@behlendorf
Copy link
Contributor

Thanks for proposing a patch. The approach taken looks reasonable but can you please add a comment above the dmu_offset_next() function describing this behavior which is otherwise a bit non-intuitive. Also right now zfs_holey_common() is the only consumer of this function so the impact of this change in behavior will be limited.

@dbavatar could you please open a new PR with the proposed patch. This way we can get some additional reviews and have it run through the automated regression testing which should include checks for correct SEEK_HOLE behavior. It would also be great if you could include any performance data you have in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

6 participants