-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting Files Doesn't Free Space, unless I unmount the filesystem #1548
Comments
One possible answer for this behavior is that the deleted file is still open. You can delete an open file which will remove it from the namespace but the space can't be reclaimed until the last user closes it. |
I have stopped all my process for 2 hours, Only the ZFS filesystem remains mounted. There is not other processes accessing the files and no descriptor marked as "deleted" in /proc//fd/. And still it doesn't decrease the disk usage. Besides, it reports 0 in the "freeing" property:
but when I unmount the filesystem:
I'm using zfs-0.6.1-1.el6.x86_64 on CentOS 6 machines. |
@fabiokorbes It might be interesting to see the output of zpool history -i. |
Here it is:
|
@fabiokorbes I think I've been able to reproduce this leak. I gather you're storing your logs directly in the root directory of the pool? Since your pool history doesn't show any file system creations, I'd imagine so. I've discovered that some objects remain in the root dataset after deleting all the files in it. In my test, I created a pool name junk, copied a bunch of files into it and then removed them. I also did the same with a filesystem within it named junk/a. Here's the output of zdb -d junk:
Notice the objects left in the root file system. The 21.1M also jives with the output of zfs list. I suspect you might find the same, too. As a cross-check, I ran this test on FreeBSD and the problem does not occur there. It would also appear to me that your pool has never been exported by virtue of the txg: 4 in the label. I've discovered that an export/import cycle will free this space. I'll hold off on looking into this further until I hear back from you as to whether this might be your issue. HMM, after further investigation, I can't seem to duplicate this any more. A zdb -d output would still be interesting. |
@fabiokorbes I think I may have been a little quick-on-the-draw coming to the conclusion that I did. I think I was just seeing the effects of deferred deletion. I guess the output of zdb -d would still be interesting and also it would be interesting to know if an export/import cycle makes any difference. |
Here it is the zdb -d output, before...
and after a export/import:
BUT, I think it only release the space because the export unmounts the filesystem. |
[NOTE: This comment is effectively superseded by my next one in which I show the reproducing steps] @fabiokorbes You're correct, the mount/umount is what ends up freeing the space. BTW, I presume that 267G is the correct amount of used space in your gfs dataset? I've been able to duplicate this problem again and figured out why I had difficulties previously. In my test, I'm rsyncinc data from my /etc and /usr directories on to my test dataset. The problem only occurs when I used "-X" which preserves extended attributes so I think this issue may only occur when some of the created and/or deleted files have extended atrributes. Further debugging with zdb -dddd shows me that when the dataset is in this condition (after having copied lots of files and directories and then deleting them), there is unprocessed stuff in the ZFS delete queue and lots of "ZFS directory" objects laying around like this:
Even further examination shows me that these objects are still in the ZFS delete queue object (seemingly always object #3). Obviously, un-purged directory objects aren't going to be wasting a lot of space. I need to do further testing. My guess is that the same problem may exist for regular files that have extended attributes. I just did a bit more digging and I find b00131d be possibly related to, at least, the leakage that I've discovered. Your leakage problem may very well be different. It would be interesting for you to examine the output of zdb -dddd gfs and see what type of objects are layout around when you have some leakage. They'll be pretty obvious because their path will show up as "???..." as you can see above. |
@fabiokorbes After a bit more experimenting, I discovered that, at least in non-xattr=sa mode in which extended attributes are stored in their own ZFS directory and file objects, removing a file does not free its space. The reproducing steps are rather simple:
And if we check the delete queue:
I'll do a little poking around today to see why the deletions aren't happening. As we've both seen, a mount/umount cycle will cause the space to be freed. I've not yet checked whether the freeing happens on the umount or on the mount. |
I dont know, if this is related, but there maybe a leakage, which is present since years. I can always reproduce this (with latest spl/zfs) by creating 100k files (with a total of 100 GB) in a zpool root dir, and deleting them totally afterwards. No need for xattribs, and no cleanup with export/import. Each time after the delete the amount of used space increased around 200-400KB. After some runs, it uses about 2 MB more than before. If there is no mechanism to clean this up, a zpool may fill up pretty fast. If it is by design, it should´nt fill up the pool. zdb -d stor2 before zdb -d stor2 after create/delete of 100 GB/100k files next run after an export/import after the next create/delete cycle after the next create/delete cycle |
@pyavdr I supposed a zdb -dd stor2 afterwards would be in order to see which object is increasing in size. Then followed up with a zdb -dddd ... to see what's in that object. BTW, you can't do zdb -dddd stor2 object_id because it's the root dataset and will display the object_id in the MOS. You need to do zdb -dddd stor2 and look at the whole thing (with more or less or whatever). |
@dweeezil I've run a zdb -dddd and I found 1407 objects with path ???. The sum of their dsize is 129GB now. This is the same amount my FS is occupying more than I've estimated. Some examples:
I believe the freeing happens on the umount. Because it takes several minutes to complete. It take more time proportionally to the disk occupation. |
@fabiokorbes That shows your problem isn't the xattr-related issue I've discovered. Those show every sign of being deleted while open but I know you have tried killing all your processes. If they're really not open, it seems there must be some weird sequence of system calls being performed on them that cause them to be left in this condition. Maybe it is related to the xattr issue I've discovered but this seems like something different. I don't think I can come up with a reproducer on my own. You'll need to find the sequence and timing of the system calls used by your program to try to help reproduce this. I'm also wondering if your whole file system becomes "stuck" at this point or whether newly-created files that are created outside your logging application are deleted properly. If you do something like echo blah > blah.txt and then remove it, is its space freed? This is a lot easier to test in a non-root file system. You might try to zfs create gfs/test and then create a sample file there. Then run an ls -i on it to find its object id which you can view with zdb -dddd gfs/test <object_id>. When testing, remember to run a sync between deleting the file and checking its object with zdb. |
Sorry to butt in, but if the files were truly "stuck open" at a system level, wouldn't umount complain vigorously ??? It certainly does on my system when I forget to kill a process (I'm looking at you nfsd) before I unmount a filesystem. On Jun 27,2013, at 16:46 , Tim Chase wrote:
|
@ColdCanuck You'd certainly think that stuck-open files would prevent unmounting (except for forced unmounts). I do think there's a bug here and that it's not directly related to deleting open files. I think that reproducing this will require getting a handle on the exact file system operations being performed by the logging application. @fabiokorbes Have you gotten a handle on any of the following details of the problem:
|
@dweeezil, @fabiokorbes It would be useful to know if you're able to reproduce this issue when setting xattr=sa. My suspicion is you won't be able too since it won't make extensive use of the delete queue, but it's worth verifying. More generally ZoL has some significant differences in how in manages the delete queue compared to the other implementations, particularly in regards to xattrs. This was required to avoid some particularly nasty deadlocks which are possible with the Linux VFS. Basically, the other implementations largely handle the deletion of xattrs synchronously while ZoL does it asynchronously in the context of the iput taskq. b00131d Fix unlink/xattr deadlock However, if you're not making extensive use of xattrs normal file deletion will also be deferred to zfs_unlinked_drain() time which is done unconditionally during the mount. It sounds like this is what may be happening. Although it would be useful to determine where it is blocking during the umount. It may be blocked waiting for the pending zfs_unlinked_drain() work items to be finished. Getting a stack trace of where the umount is blocked and what the iput thread is doing would be helpful. |
@behlendorf A quick test shows me that deleting files with xattrs when xattr=sa works just fine (the space is freed). The non-space-freeing problem with xattr=on is something I found by accident when trying to duplicate @fabiokorbes problem. I'm planning on working up a separate issue for that problem once I nail it down a bit better. |
I should have provided more details of our environment at first place. We have 10 servers with ZFS filesystem. We combine them with Gluster and export the GlusterFS to another server where syslog-ng writes the logs. I thought the issue was not related to Gluster because it happens even when I delete a file manually at the ZFS. I haven't tried to delete a manually created file. I did it now (with dd) and it works. It freed the space. Gluster uses extended attributes.
Yes, it happens right after the reboot. I'm not 100% sure if every files leaks space. But the growing rate says so.
No, they don't.
Do you mean a child of the dataset? I haven't tried it. I'll try to change the xattr property to sa next. |
@fabiokorbes Neither of the files shown in your zdb -dddd output above had extended attributes. If they did, there would be an xattr line right below the pflags line that would reference the ZFS directory object containing the extended attributes. I'm afraid I've never used Gluster (but I'm reading up on it right now). I have a feeling your problem is that some part of the Gluster server side software is holding the deleted file open. One of my virtual test environments uses CentOS 6.4 which looks like it's pretty easy to install Gluster on. I may give that a try. |
ops! sorry! I think my grep truncates the last line:
But I have deleted files without the xattr too: |
Here it is a file without xattr
|
@fabiokorbes Yes, your whole problem is the "deleting files with xattrs leaks space" problem that I found. Your object 6705 is clearly the "value file" of one of your "trusted.gfid" attributes (it was created with a block size of 512 which is what these internally-generated attribute files use). Switching to xattr=sa will fix your problem but it will make your pool incompatible with other ZFS implementations. I started tracking down the cause of this xattr problem this morning and have made good progress. I'm not sure now whether to file a more specific issue for this or to keep this thread going. Once I nail down the cause, I'll post some more information here. |
Yes, it fixed my problem. Thanks! 👍 The old files still leak space. Is this what you mean by incompatible with other ZFS implementations ? |
@fabiokorbes My comment regarding incompatibility referred to the fact that only ZoL has the xattr=sa mechanism to store xattrs. If you import the pool under Illumos or FreeBSD, they won't see your xattrs (actually, I don't think FreeBSD supports xattrs at the moment). It doesn't sound like this is a big problem for you. Until the cause of this bug is fixed, removing any of your pre-xattr=sa files will leak space and you'll have to reclaim the space by a mount/umount cycle. Newly-created files will be just fine. Regarding my comment about storing files in the root of a pool, I'd still recommend setting up future pools with at least one child file system even if the pool is only single-purpose (storing log files, etc.). At the very least, you'd be able to use the zdb -ddd / <object_id> types of commands in the future to get information about an individual file. It also allows for more flexibility in moving things around in the future without re-creating the pool. For example, if you had an application storing stuff in tank/a and you wanted to start anew but keep the old files around, you could zfs rename tank/a tank/b and then zfs create tank/a... that kind of thing. Child file systems also allow you to use the property inheritance scheme which can be very useful. |
@behlendorf I've tracked down the problem, sort of. This, of course, only applies to normal non-xattr=sa mode. Any time a file's xattr is accessed (or when it's created), extra inode references (i_count) are created that prevent the file's space from being freed when it's removed. In no case, will the space consumed by the xattr pseudo-directories nor the attribute pseudo-files be freed. Here are a couple of scenarios that should make it clear. Typical failure case:
Case in which file's space is freed:
And another failure case:
I did review all three of the xattr-related deadlock-handling commits you referenced above and think I've got a pretty good grasp of the situation, but given the way it's coded now, it's not clear to me when the space for the file was actually supposed to be freed short of an unmount. A couple of other notes and comments are in order: For most of this testing, I was using a system on which SELinux was disabled. Presumably in some (all?) cases with SELinux, the xattrs are accessed pretty much all the time which would make it impossible for any space to be reclaimed. I suspect the same is also true of Gluster which seems to use xattrs for its own purpose. I'm not sure where to go from here. There's either a simple bug in the code as it's currently written or the overall design of it precludes ever freeing the space. |
I decided to fiddle around with this a bit more. First off, I think it's worth throwing in a reference to #457 here. The discussion in that issue makes it sound like the handling of xattr cleanup is still in flux. Reverting e89260a against master as of 20c17b9 does allow for the space consumed by a file with xattrs to be freed. The xattr directory and xattrs themselves, however, still lay around on the unlink set until an unmount (mount?). Finally, reverting 7973e46 against my revert above, which should restore the old synchronous xattr cleanup behavior, doesn't change the freeing behavior at all; the xattr directory and the xattrs themselves aren't freed. |
@dweeezil Thanks for looking in to this, let me try and explain how it should work today and a theory for why this is happening. When we create an xattr directory or xattr file we create normal znode+inode for the object. However, because we don't want these files to show up in the namespace we don't create a dentry for them. The new inodes get hashed in to the inode cache and attached to the per-superblock LRU. The VFS is then responsible for releasing then as needed, a reference via iget() is taken on the objects to keep them around in memory. Note dentries for these inodes are impossible so they will never hold a reference. Destruction and freeing the space for these objects is handled when the last reference on the inode is dropped via iput(), see zpl_evict_inode->zfs_inactive->zfs_\zinactive. The zp->z_unlinked flag is checked in zfs_zinactive which means the should not just be released from the cache but the objects should be destroyed from the dataset. It sounds as if the VFS isn't dropping the reference it's holding on the xattr inodes until unmount time. This makes some sense because unless there's memory pressure the VFS won't kick and object out of the cache and drop its reference. However, at unmount time it has to drop this last reference on all the objects in the inode cache so it can unmount. Dropping this last reference will result in the unlink and likely explains why unmount takes so long. This explanation is consistent with what you've described above. Note that the first patch you reverted e89260a adds a hunk to zfs_znode_alloc which caused a directory xattrs to hold a reference on its parent file. The result is that the file won't be unlinked until the VFS drops its reference on the xattr directory which is turn drops its reference on the file. Which is exactly the behavior you observed. Two possible solutions worth exploring.
|
@behlendorf Thanks for the explanation. The missing link for me is that I didn't realize the VFS would ever release the reference under any condition. I did realize that it was going to take another iput to release it but I didn't understand where it was supposed to come from. I just (finally) fully read the comment block in zfs_xattr.c. I didn't realize that Solaris had a file-like API to deal with xattrs which makes them a lot more like "forks". The invisible file and directory scheme is a rather natural way to implement their API. It sounds like it's possible to create hierarchies of attributes in Solaris. I'm liking your #2 idea above. |
@dweeezil Yes exactly, they're much more like forks on Solaris. My preference would be for the second option as well, do you want to take a crack at it? |
@behlendorf I'll give it a whack but there's a few more things I've got to fully understand. I'm still getting caught up on the 2.6-3.X inode cache API changes. Also, I'm taking a side-trip to see how this is handled in FreeBSD (which also doesn't have a Solaris-like xattr API). I was under the impression that FreeBSD ZFS didn't support xattrs but it turns out that it does. |
that's true, xattr=off was off in my pool, but first it was set to xattr=on, then i noticed the strange behaviour of my pool and i disabled it on the fly (without reboot). But that didn't change anything ... Now i'm running with xattr=sa, zfs 0.6.5.3-2 and i have done a reboot after setting xattr=sa. I'll keep an eye on it the next weeks ... |
@discostur Remember that each file must be created with xattr=sa set, in order for it to work. If you have existing files, you'll need to copy them (e.g. to a new dataset). |
Ugh - is there any way to do that in-place ? I have 500G free on a 10T filesystem ... |
@Nukien cp each filename to filename.new and then mv it back. This still won't guarantee you won't have any downtime. The files might be in use, after all. And the old files will still be taking up space until you drop caches or unmount. |
Doh. Must be early, should have thought of that. Scripting time - I would like to keep the filetime for each one ... Let's see, "find /zfs1/sys1 -type f -exec ... |
@Nukien From what I read - the only guarantee to have no issues in the future is to create the filesystem anew with xattr=sa - but that surely is no option in your case @ringingliberty 's solution appears to be the optimal one - not sure if it's enough concerning the filesystem data CC: @dweeezil |
Started getting this problem after last upgrade SPL: Loaded module v0.6.5.3-1 NAME PROPERTY VALUE SOURCE dd if=/dev/null of=syslog.tar.gz I have 0'd out a few hundred gig of files and removed them NAME USED AVAIL REFER MOUNTPOINT NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT running out of files to try removing edit ok changed xattr zfs set xattr=sa ZFSRaidz2 I removed a few more gig and space finally free'd up NAME USED AVAIL REFER MOUNTPOINT NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT |
same here: upgraded from zfs-0.6.2 to 0.6.5.3 -> 100% full
fixed on IRC #zfsonlinux after being pointed at
and make sure you don't go over 90% full next time :) |
Hmm, Running 0.6.5.3-1 but space does not seem to be reclaimed after a delete. I am running with xattr=on NOT xattr=sa. |
I just found this issue on ZFS v0.6.5.4-1 with a 4.2.8-300.fc23.x86_64 kernel. I stored all files in the root filesystem instead of in datasets, but has since created datasets and been moving the files to them. After that I noticed the reported disk usage of the root filesystem only increasing both in "zfs list" and "df" when moving files from it. I don't have any other information other than that I do rely on SELinux and xattr is set to the default on. I could reboot the system and that freed up space. |
Same bug here with SPL/ZFS v0.6.4.2-1 + kernel 3.10.76 ; |
Looks like the same bug with zfs-0.6.5.4 and kernel 3.10.0-327. |
I ran into this, but I was focusing on the inode counts rather than space usage as it wasn't a lot of space. Before unmounting, 151616 inodes were in use. Unmounting took a moment, and after remounting it was 6. Yes, I |
We are seeing the same problem on a large (3-digit TB) fully updated CentOS 7 machine. setting drop_caches to 0 caused the machine to stop responding over NFS for 15-20 minutes. The mounts block locally except on one mount at a time (is suspect it was blocking on whichever mount it was clearing files for at the time). After the interruption, about 5 TB of space was cleared, but running it again during a cron job is not having the same effect (it seems to have only worked once). The machine is in near-constant use, so we haven't had a chance to try unmounting, exporting or rebooting since then. |
Using a temporary filesystem I was able to clear some deleted files by un-mounting and re-mounting the specific filesystem. Obviously we would like to avoid this, but the work-around does work in a pinch. |
Slight correction to my earlier comment, we are setting drop_caches to 3 (0 is what it reads before the value is set). |
|
I'm aware that it performs the function during the "write" (and thus blocks). When I do the write, it does block for a few seconds to a few minutes (depending on the last time it was done), but cat is definitely showing it as having a value of 3. It's probably just be reporting the last value that passed to it. |
Anyone using quotas and has this issue ? in that case db707ad OpenZFS 6940 - Cannot unlink directories when over quota might help edit: currently not sure where to look at but I'm pretty certain adding a
during file deletion operations (generally) could improve the situation ; if that transaction is delayed we might end up at position 1 again, how and why (due to xattr=on) is another story |
Also seeing this issue. umount/mount will solve it. Changing to xattr=sa seem to work |
Just a "Me Too" comment. I was moving 1TB of data between datastores, using a SSD datastore/pool in between for speed. (tank/DS1 -> SSD -> tank/DS2) Had to umount/mount the SSD datastore between each chunk of data for the SSD datastore to reflect the available space. 3.5 years on this bug. A fix would be swell. |
That's a really hard to pinpoint target ! It used to work for me, now I just realized that upon wiping out a directory of 110 GB plus snapshots of a different repository (around 10 GB), that the space isn't freed: branch used is https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_04.12.2016_2 no xattr is used, no SELinux is used Kernel is 4.9.0 rt based edit: is this also appearing, btw, with an SLOG or ZIL device ? or is the behavior different ? edit2: that get's interesting ! it just freed 6 GB by destroying some other snapshots, the ~120 GB however are still amiss from the free space shown by zpool list As far as I know those folders are mostly configured the same, except for ditto blocks (copies=2 or copies=3) for certain subfolders. Those folders where the space isn't freed are directly on the root of the pool, the folders where space was freed [the root folder is also on the root of the pool] but there are subsequent folders, whereas for the non-freeing folders there aren't ... edit3: there were actually still processes open to those directories, interesting ... Now it worked |
a "me too" comment; My situation was:
The USED adds up roughly. In this state i was in the middle of something like: Despite all the MOVE operations (~36GB into SNIPF), there was still a WHACK of usage in SNIPF_TMP.
I also checked to see if the pool was busy freeing any bytes, nope
My "storage" dir has loads of important live/active services in sub-dirs ... it's impossible for me unmount it right now.
I have verified that the move operation IS deleting files off the filesystem by diffing the "_TMP" dir to the DEST directory too. Tried echo-ing non-zero value into
EDIT: upon further investigation, my issue is not related, ... the "mv" command seems to have an interesting behaviour in when/how it deletes files during the operation. It looks like it only deletes files per LEAF NODE in a directory tree (although I haven't seen what happens when there is >2 depths like in my original case with the mega move above), this is the pattern though in a smaller example:
CONFIRMED -- my case is a symptom of the in this case, using
|
We are experiencing the same problem reported in issue #1188 . When we delete a file, its disk space is not free'd.
We are testing the use of a ZFS volume to store logs. We delete old logs in the same pace we write new ones. So, the disk usage should be constant in a short term. But it isn't. it always increase (as reported by df or zfs list commands). Although, the disk usage reported by du command remains constant.
It takes a full month to reach 100% of the disk. So I don't think it is an asynchronous-delete or recently-freed issues. I have already stopped the writing process for several hours, so it is not a load issue neither. And we aren't doing any snapshots.
It only releases the space of the deleted files when I unmount the filesystem. And it takes about 10 minutes to unmount.
Is this a bug? or is there something I could do to make the deletions work?
Besides, is there a command to manually release the space without I need to unmount the filesystem?
Here, some info of my system:
The text was updated successfully, but these errors were encountered: