-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lseek(..., ..., SEEK_HOLE) extremely slow under heavy I/O load. #4306
Comments
@bucko909 you've definitely put your finger on an issue. In order for SEEK_HOLE to authoritatively return the next on-disk hole it needs to make sure that all dirty blocks for that file are on-disk. Where those blocks end up getting written, or even if they get written, isn't determined until the txg they're part of is written. Which means this function has to block waiting for that txg to be sync'd when a process is appending to the file. There's probably room for some additional optimization here but it'll take some work. One thing you might try in your test application, and I'm not sure this will work, is to |
@bucko909 A 15 minute transaction sync time is a problem. Ideally, it should never exceed 5 seconds. @behlendorf I do not think that Waiting on the transaction group likely only matters in situations where zero detection by compression makes holes. It is okay to report data where there are holes, so ignoring hole creation by zero detection in compression should be fine. At the same time, it is not okay to report holes where they is data, so we need to think about whether there are any situations that this prevents. I cannot think of any at the present time. |
For some additional information, here is what the stack should look like for zvol_discard/zfs_trunc:
@ahrens What do you think? |
As I say, I can't prove that the longer delays are the same. I only recently remembered to In the spirit of enquiry, I confirmed @ryao's suspicion that |
The original implementation of `SEEK_DATA` and `SEEK_HOLE` for `lseek()` in OpenSolaris would block on the transaction group commit when called on files that have dirty data. This was an attempt to ensure that any holes that are made by zero detection are reported. However, there is nothing stopping userland from redirtying the dnode with additional writes while the transaction group commit finishes, such that the reported information is not necessarily up to date. The consequence is that we can report data where holes are (or would be) present. Reporting data where there is holes is alright because userspace would then copy zeroes instead on a read, but the inverse of reporting holes where there is data is not alright. Other consumers of `dnode_next_offset` such as zvol_discard would fail to perform their intended function if holes were reported where there is data and do not block on the transaction group commit like we do with lseek on `SEEK_DATA` or `SEEK_HOLE`. Consequently, it should be the case that this code will never report a hole where there is data and there is no need to block on the transaction group commit. Closes openzfs#4306 Signed-off-by: Richard Yao <[email protected]>
@bucko909 I have opened #4311 with a patch that deletes this logic for the buildbot to subject to testing. The only way this patch should cause a problem is if there is some way for us to report a hole where there is data, but I do not see how that is possible. If you are willing to risk me being wrong, you could try applying the patch to your system and seeing how things perform. Otherwise, I suggest holding off until it has had more review. That said, I think that your userspace is doing multiple |
The original implementation of `SEEK_DATA` and `SEEK_HOLE` for `lseek()` in OpenSolaris would block on the transaction group commit when called on files that have dirty data. This was an attempt to ensure that any holes that are made by zero detection are reported. However, there is nothing stopping userland from redirtying the dnode with additional writes while the transaction group commit finishes, such that the reported information is not necessarily up to date. The consequence is that we can report data where holes are (or would be) present. Reporting data where there is holes is alright because userspace would then copy zeroes instead on a read, but the inverse of reporting holes where there is data is not alright. Other consumers of `dnode_next_offset` such as zvol_discard would fail to perform their intended function if holes were reported where there is data and do not block on the transaction group commit like we do with lseek on `SEEK_DATA` or `SEEK_HOLE`. Consequently, it should be the case that this code will never report a hole where there is data and there is no need to block on the transaction group commit. Closes openzfs#4306 Signed-off-by: Richard Yao <[email protected]>
Righto. I think we'll stick with the current driver and the workaround for now; testing this on our live logging endpoint sounds a bit risky (though, as I say, we probably never have any holes). I'll open a new bug if I catch the longer delays using my |
(Or should it be left open until the patch is merged? I'll leave it open in that case; I initially read the merge as a new issue -- not used to github yet!) |
It should be kept open until the patch is reviewed and merged. |
referencing #4311 |
FYI, it also badly affects performance for synchronised disk I/O in VMs in recent versions of QEMU when the disk images are qcow2 images stored on ZFS: https://www.mail-archive.com/[email protected]/msg427170.html |
I spent a bunch of time investigating this problem, without knowing about this issue on github. I believe most ZFS users experience this slowness in some way, without realizing it. It's extremely trivial to hit this condition - the simplest case is a log file which is obviously appended to occasionally. This actually causes a sparse section to appear at the end of the file. Now if you run a very very common command on a log, i.e. "grep foo bar.log", grep will attempt to seek past the hole at the end of the file, causing potential a multiple MINUTE delay while txg_wait_synced() completes. This is the ONLY vnop to require txg_wait_synced(). All the other calls are zfs/zpool specific actions. There is a simple solution to this problem, which is to simply disable this implementation. The system will just fine without it, and do so quite well. I've disabled this site-wide at my organization, and I recommend ZFS take this as the default, until such time as someone can think of an implementation that works properly. I'd be happy to submit such a patch, if it would be accepted. |
That's not an unreasonable suggestion, but I think we can do somewhat better than this with a reasonable compromise. Instead of disabling this implementation entirely we could dynamically disable this functionality only when the dnode is dirty. This would resolve the issue observed with append+grep and allow tools like Or in other words until all pending IO which could create new holes or remove existing ones is resolved we fallback to the simplest safe implementation which is nicely described in the man page.
@dbavatar if you're willing to put together such a patch and we don't uncover any unexpected problems I'd accept it. |
@behlendorf I don't believe this can work, because at least in my case the file is basically always dirty. There's always new log data coming in on an active system, not to mention the txg timeout is often less than the time to sync a txg. Thinking about this more, I believe as-is this vnop could lead to data corruption. For it to work right, I believe you need a zfs_range_lock(read) to wrap it. Otherwise, many writes could have occurred during the time we were waiting for txg_wait_synced() to complete. One thread could report back 0 data in range minutes after many writes were made. This is not true if doing plain seek/read/write. An actual solution to this probably looks like zfs_read(), except it doesn't bother allocating dbuf's, just stops at the first actual byte found. |
So unfortunately I don't think even taking the zfs_range_lock()'s here would be enough to always report this correctly. The problem is that the locks are only held for the duration of the system calls over the regions being updated. However, it's not until much latter when the txg is being synced and the locks are all dropped that we determine the final on-disk layout. For this to work correctly, and be fast, you'd need to take the zfs_range_lock() and then walk all the pending dbuf writes to determine exactly what updates to the file are going to happen. Then you'd need to merge that with the current on-disk state to find the next hole. That's do able, but not easy which is why the code currently waits for the txg to do exactly this and then checks only the on-disk version. It's also worth pointing out that the moment we drop the range lock it's entirely possible for a new write to come in the the hole information to be flat out wrong. I suggested simply disabling SEEK_HOLE when the dnode is dirty because it seems like a reasonable way to handle the common case you've described. This should be functionally correct since it will appear to applications as if all holes have simply been removed from the file and you'll get good append performance since you'll never wait on a txg. Only once the file is no longer being actively updated will the holes reappear. That said, I'm all for a better solutions if you can propose one. For example, I can see how we might be able to special case this append-only case you're hitting. |
I understand how this can work now, please review the patch here dbavatar@f71ff02 I've only done minimal testing on it. |
Thanks for proposing a patch. The approach taken looks reasonable but can you please add a comment above the @dbavatar could you please open a new PR with the proposed patch. This way we can get some additional reviews and have it run through the automated regression testing which should include checks for correct SEEK_HOLE behavior. It would also be great if you could include any performance data you have in the PR. |
I've been experiencing random hangs in syscalls (ie. cannot even
^C
the process) when runninggrep
andless
on our log server which runs on (gzip
compressed) ZFS. It appears to be correlated with high load averages possibly as a result of compression jobs, and I've mostly (only?) observed it on files which are actively being appended. I've been able to reproduce 7 second delays running these tools in testing during heavy load, and I've experienced delays in the minutes. I have no proof that the longer delays are due to the same, but it seems likely.I can't guarantee that the long and short delays are the same problem, but I have observed via
strace
that the hangs are inlseek
with awhence
ofSEEK_HOLE
. I confirmed this with a simple C program which just opened my file and ranlseek(fd, 32768, SEEK_HOLE);
-- which was the exact syscallgrep
hung on.More evidence is that
SEEK_HOLE
isn't used in cat, which means thatcat FILE | grep SEARCH
is much faster thangrep SEARCH FILE
. When the server was busy enough to reproduce this yesterday, I recorded ~7s togrep
an 80MB file, and 0.1s tocat | grep
.For now I've instigated a workaround which is an
LD_PRELOAD
library which rewriteslseek(fd, offset, SEEK_HOLE)
tolseek(fd, 0, SEEK_END)
-- I've not been able to reproduce these delays with the library loaded.Looking at the source, I see that
SEEK_HOLE
andSEEK_DATA
ultimately end up here:https://github.com/zfsonlinux/zfs/blob/19d55079aecb5c022c1c09e0eace4f7da7381a62/module/zfs/dmu.c#L1812
Which does
txg_wait_synced
if the file has dirty data. I've observedsync
take 15 minutes on this server when it's under heavy load, so that would probably explain the delay. Is there any way around this beyond my stupidLD_PRELOAD
hack?The text was updated successfully, but these errors were encountered: