Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting Files Doesn't Free Space, unless I unmount the filesystem #1548

Closed
fabiokorbes opened this issue Jun 25, 2013 · 96 comments
Closed

Deleting Files Doesn't Free Space, unless I unmount the filesystem #1548

fabiokorbes opened this issue Jun 25, 2013 · 96 comments
Labels
Component: Memory Management kernel memory management

Comments

@fabiokorbes
Copy link

We are experiencing the same problem reported in issue #1188 . When we delete a file, its disk space is not free'd.

We are testing the use of a ZFS volume to store logs. We delete old logs in the same pace we write new ones. So, the disk usage should be constant in a short term. But it isn't. it always increase (as reported by df or zfs list commands). Although, the disk usage reported by du command remains constant.

It takes a full month to reach 100% of the disk. So I don't think it is an asynchronous-delete or recently-freed issues. I have already stopped the writing process for several hours, so it is not a load issue neither. And we aren't doing any snapshots.

It only releases the space of the deleted files when I unmount the filesystem. And it takes about 10 minutes to unmount.

Is this a bug? or is there something I could do to make the deletions work?

Besides, is there a command to manually release the space without I need to unmount the filesystem?

Here, some info of my system:

# zdb 
gfs:
    version: 5000
    name: 'gfs'
    state: 0
    txg: 4
    pool_guid: 12548299384193180756
    hostid: 3943305418
    hostname: 'margari.tpn.terra.com'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 12548299384193180756
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 776557494125630754
            path: '/dev/sdb1'
            whole_disk: 0
            metaslab_array: 33
            metaslab_shift: 32
            ashift: 9
            asize: 730802946048
            is_log: 0
            create_txg: 4
    features_for_read:


# zpool status   
  pool: gfs
 state: ONLINE
  scan: scrub repaired 0 in 4h6m with 0 errors on Sat Jun  8 00:42:59 2013
config:

        NAME        STATE     READ WRITE CKSUM
        gfs         ONLINE       0     0     0
          sdb1      ONLINE       0     0     0

errors: No known data errors


# zfs list -t all
NAME   USED  AVAIL  REFER  MOUNTPOINT
gfs    364G   305G   364G  /gfs


# zfs get all
NAME  PROPERTY              VALUE                  SOURCE
gfs   type                  filesystem             -
gfs   creation              Tue Apr  9 13:13 2013  -
gfs   used                  364G                   -
gfs   available             305G                   -
gfs   referenced            364G                   -
gfs   compressratio         5.30x                  -
gfs   mounted               yes                    -
gfs   quota                 none                   default
gfs   reservation           none                   default
gfs   recordsize            64K                    local
gfs   mountpoint            /gfs                   default
gfs   sharenfs              off                    default
gfs   checksum              off                    local
gfs   compression           gzip-9                 local
gfs   atime                 off                    local
gfs   devices               on                     default
gfs   exec                  on                     default
gfs   setuid                on                     default
gfs   readonly              off                    default
gfs   zoned                 off                    default
gfs   snapdir               hidden                 default
gfs   aclinherit            restricted             default
gfs   canmount              on                     default
gfs   xattr                 on                     default
gfs   copies                1                      default
gfs   version               5                      -
gfs   utf8only              off                    -
gfs   normalization         none                   -
gfs   casesensitivity       sensitive              -
gfs   vscan                 off                    default
gfs   nbmand                off                    default
gfs   sharesmb              off                    default
gfs   refquota              none                   default
gfs   refreservation        none                   default
gfs   primarycache          all                    default
gfs   secondarycache        all                    default
gfs   usedbysnapshots       0                      -
gfs   usedbydataset         364G                   -
gfs   usedbychildren        88.2M                  -
gfs   usedbyrefreservation  0                      -
gfs   logbias               latency                default
gfs   dedup                 off                    default
gfs   mlslabel              none                   default
gfs   sync                  standard               default
gfs   refcompressratio      5.30x                  -
gfs   written               364G                   -
gfs   snapdev               hidden                 default
@behlendorf
Copy link
Contributor

One possible answer for this behavior is that the deleted file is still open. You can delete an open file which will remove it from the namespace but the space can't be reclaimed until the last user closes it.

@fabiokorbes
Copy link
Author

I have stopped all my process for 2 hours, Only the ZFS filesystem remains mounted. There is not other processes accessing the files and no descriptor marked as "deleted" in /proc//fd/. And still it doesn't decrease the disk usage.

Besides, it reports 0 in the "freeing" property:

[root@colchester ~]# zpool list -o name,size,allocated,free,freeing
NAME   SIZE  ALLOC   FREE  FREEING
gfs    680G   372G   308G        0
[root@colchester ~]# du -sh /gfs/
273G    /gfs/

but when I unmount the filesystem:

[root@colchester ~]# zfs unmount /gfs/
[root@colchester ~]# zfs mount gfs  
[root@colchester ~]# zpool list -o name,size,allocated,free,freeing
NAME   SIZE  ALLOC   FREE  FREEING
gfs    680G   272G   408G        0

I'm using zfs-0.6.1-1.el6.x86_64 on CentOS 6 machines.

@dweeezil
Copy link
Contributor

@fabiokorbes It might be interesting to see the output of zpool history -i.

@fabiokorbes
Copy link
Author

Here it is:

[root@colchester gfs]# zpool history -i 
History for 'gfs':
2013-04-16.20:07:17 zpool create -f gfs /dev/sdb1
2013-04-16.20:07:17 [internal pool create txg:5] pool spa 5000; zfs spa 5000; zpl 5; uts colchester.tpn.terra.com 2.6.32-279.22.1.el6.x86_64 #1 SMP Wed Feb 6 03:10:46 UTC 2013 x86_64
2013-04-16.20:07:44 [internal property set txg:11] recordsize=65536 dataset = 21
2013-04-16.20:07:44 zfs set recordsize=64K gfs
2013-04-16.20:07:44 [internal property set txg:12] compression=13 dataset = 21
2013-04-16.20:07:44 zfs set compression=gzip-9 gfs
2013-04-16.20:07:44 [internal property set txg:13] atime=0 dataset = 21
2013-04-16.20:07:44 zfs set atime=off gfs
2013-04-16.20:07:44 [internal property set txg:14] checksum=2 dataset = 21
2013-04-16.20:07:49 zfs set checksum=off gfs

@dweeezil
Copy link
Contributor

@fabiokorbes I think I've been able to reproduce this leak. I gather you're storing your logs directly in the root directory of the pool? Since your pool history doesn't show any file system creations, I'd imagine so.

I've discovered that some objects remain in the root dataset after deleting all the files in it. In my test, I created a pool name junk, copied a bunch of files into it and then removed them. I also did the same with a filesystem within it named junk/a. Here's the output of zdb -d junk:

zdb -d junk
Dataset mos [META], ID 0, cr_txg 4, 552K, 50 objects
Dataset junk/a [ZPL], ID 47, cr_txg 324, 136K, 6 objects
Dataset junk [ZPL], ID 21, cr_txg 1, 20.1M, 2977 objects

Notice the objects left in the root file system. The 21.1M also jives with the output of zfs list. I suspect you might find the same, too.

As a cross-check, I ran this test on FreeBSD and the problem does not occur there.

It would also appear to me that your pool has never been exported by virtue of the txg: 4 in the label. I've discovered that an export/import cycle will free this space.

I'll hold off on looking into this further until I hear back from you as to whether this might be your issue.

HMM, after further investigation, I can't seem to duplicate this any more. A zdb -d output would still be interesting.

@dweeezil
Copy link
Contributor

@fabiokorbes I think I may have been a little quick-on-the-draw coming to the conclusion that I did. I think I was just seeing the effects of deferred deletion. I guess the output of zdb -d would still be interesting and also it would be interesting to know if an export/import cycle makes any difference.

@fabiokorbes
Copy link
Author

Here it is the zdb -d output, before...

[root@vineland ~]# zdb -d gfs  
Dataset mos [META], ID 0, cr_txg 4, 91.2M, 204 objects
Dataset gfs [ZPL], ID 21, cr_txg 1, 350G, 7411 objects

and after a export/import:

[root@vineland ~]# zdb -d gfs  
Dataset mos [META], ID 0, cr_txg 4, 103M, 204 objects
Dataset gfs [ZPL], ID 21, cr_txg 1, 267G, 6418 objects

BUT, I think it only release the space because the export unmounts the filesystem.

@dweeezil
Copy link
Contributor

[NOTE: This comment is effectively superseded by my next one in which I show the reproducing steps]

@fabiokorbes You're correct, the mount/umount is what ends up freeing the space. BTW, I presume that 267G is the correct amount of used space in your gfs dataset?

I've been able to duplicate this problem again and figured out why I had difficulties previously. In my test, I'm rsyncinc data from my /etc and /usr directories on to my test dataset. The problem only occurs when I used "-X" which preserves extended attributes so I think this issue may only occur when some of the created and/or deleted files have extended atrributes.

Further debugging with zdb -dddd shows me that when the dataset is in this condition (after having copied lots of files and directories and then deleting them), there is unprocessed stuff in the ZFS delete queue and lots of "ZFS directory" objects laying around like this:

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
       186    1    16K     1K     1K     1K  100.00  ZFS directory
                                        176   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED 
        dnode maxblkid: 0
        path    ???<object#186>
        uid     0
        gid     0
        atime   Thu Jun 27 00:25:01 2013
        mtime   Thu Jun 27 00:25:02 2013
        ctime   Thu Jun 27 00:25:02 2013
        crtime  Thu Jun 27 00:24:58 2013
        gen     715
        mode    40755
        size    2
        parent  7
        links   0
        pflags  40800000144
        xattr   5056
        microzap: 1024 bytes, 0 entries

Even further examination shows me that these objects are still in the ZFS delete queue object (seemingly always object #3).

Obviously, un-purged directory objects aren't going to be wasting a lot of space. I need to do further testing. My guess is that the same problem may exist for regular files that have extended attributes.

I just did a bit more digging and I find b00131d be possibly related to, at least, the leakage that I've discovered.

Your leakage problem may very well be different. It would be interesting for you to examine the output of zdb -dddd gfs and see what type of objects are layout around when you have some leakage. They'll be pretty obvious because their path will show up as "???..." as you can see above.

@dweeezil
Copy link
Contributor

@fabiokorbes After a bit more experimenting, I discovered that, at least in non-xattr=sa mode in which extended attributes are stored in their own ZFS directory and file objects, removing a file does not free its space. The reproducing steps are rather simple:

# zpool create junk /dev/disk/by-partlabel/junk 
# zfs create junk/a
# dd bs=100k if=/dev/urandom of=/junk/a/junk count=200
200+0 records in
200+0 records out
20480000 bytes (20 MB) copied, 1.04426 s, 19.6 MB/s
# ls -li /junk/a/junk
7 -rw-r--r-- 1 root root 20480000 Jun 27 07:18 /junk/a/junk
# zdb -dd junk/a
Dataset junk/a [ZPL], ID 40, cr_txg 7, 19.7M, 7 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         0    7    16K    16K  15.0K    16K   21.88  DMU dnode
...
         7    3    16K   128K  19.6M  19.6M  100.00  ZFS plain file
# setfattr -n user.blah -v 'Hello world' /junk/a/junk
# !zdb
zdb -dd junk/a
Dataset junk/a [ZPL], ID 40, cr_txg 7, 19.7M, 9 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         0    7    16K    16K  15.0K    16K   28.12  DMU dnode
        -1    1    16K    512     1K    512  100.00  ZFS user/group used
        -2    1    16K    512     1K    512  100.00  ZFS user/group used
         1    1    16K    512     1K    512  100.00  ZFS master node
         2    1    16K    512     1K    512  100.00  SA master node
         3    1    16K    512     1K    512  100.00  ZFS delete queue
         4    1    16K    512     1K    512  100.00  ZFS directory
         5    1    16K  1.50K     1K  1.50K  100.00  SA attr registration
         6    1    16K    16K  7.00K    32K  100.00  SA attr layouts
         7    3    16K   128K  19.6M  19.6M  100.00  ZFS plain file
         8    1    16K    512     1K    512  100.00  ZFS directory
         9    1    16K    512    512    512  100.00  ZFS plain file

# rm /junk/a/junk
rm: remove regular file `/junk/a/junk'? y
# sync
# !zdb
zdb -dd junk/a
Dataset junk/a [ZPL], ID 40, cr_txg 7, 19.7M, 9 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
... [note that the xattr showed up as objects 8 and 9]
         7    3    16K   128K  19.6M  19.6M  100.00  ZFS plain file
         8    1    16K    512     1K    512  100.00  ZFS directory
         9    1    16K    512    512    512  100.00  ZFS plain file
# rm /junk/a/junk
rm: remove regular file `/junk/a/junk'? y
# sync
# !zdb
zdb -dd junk/a
Dataset junk/a [ZPL], ID 40, cr_txg 7, 19.7M, 9 objects

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         3    1    16K    512     1K    512  100.00  ZFS delete queue
... [oops, they're still there]
         7    3    16K   128K  19.6M  19.6M  100.00  ZFS plain file
         8    1    16K    512     1K    512  100.00  ZFS directory
         9    1    16K    512    512    512  100.00  ZFS plain file

# zfs list junk/a
NAME     USED  AVAIL  REFER  MOUNTPOINT
junk/a  19.7M  97.9G  19.7M  /junk/a

And if we check the delete queue:

# zdb -dddd junk/a 3
Dataset junk/a [ZPL], ID 40, cr_txg 7, 19.7M, 9 objects, rootbp DVA[0]=<0:13fc200:200> DVA[1]=<0:4c003c000:200> [L0 DMU objset] fletcher4 lzjb LE contiguous unique double size=800L/200P birth=27L/27P fill=9 cksum=13b093fe34:743c2a82058:163d71c39fe5f:2ee95f68b055dc

    Object  lvl   iblk   dblk  dsize  lsize   %full  type
         3    1    16K    512     1K    512  100.00  ZFS delete queue
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    microzap: 512 bytes, 1 entries

        7 = 7 

I'll do a little poking around today to see why the deletions aren't happening. As we've both seen, a mount/umount cycle will cause the space to be freed. I've not yet checked whether the freeing happens on the umount or on the mount.

@pyavdr
Copy link
Contributor

pyavdr commented Jun 27, 2013

I dont know, if this is related, but there maybe a leakage, which is present since years. I can always reproduce this (with latest spl/zfs) by creating 100k files (with a total of 100 GB) in a zpool root dir, and deleting them totally afterwards. No need for xattribs, and no cleanup with export/import. Each time after the delete the amount of used space increased around 200-400KB. After some runs, it uses about 2 MB more than before. If there is no mechanism to clean this up, a zpool may fill up pretty fast. If it is by design, it should´nt fill up the pool.

zdb -d stor2 before
Dataset mos [META], ID 0,cr_txg 4, 2.87M, 304 objects
Dataset stor2 [ZPL], ID 21, cr_txg 1 3.94M, 10 objects

zdb -d stor2 after create/delete of 100 GB/100k files
Dataset mos [META], ID 0,cr_txg 4, 2.90M, 305 objects
Dataset stor2 [ZPL], ID 21, cr_txg 1 4.12M, 10 objects

next run
Dataset mos [META], ID 0,cr_txg 4, 2.90M, 306 objects
Dataset stor2 [ZPL], ID 21, cr_txg 1 4.30M, 10 objects

after an export/import
Dataset mos [META], ID 0,cr_txg 4, 2.87M, 306 objects
Dataset stor2 [ZPL], ID 21, cr_txg 1 4.30M, 10 objects

after the next create/delete cycle
Dataset mos [META], ID 0,cr_txg 4, 3.13M, 306 objects
Dataset stor2 [ZPL], ID 21, cr_txg 1 4.42M, 10 objects

after the next create/delete cycle
Dataset mos [META], ID 0,cr_txg 4, 2.95M, 306 objects
Dataset stor2 [ZPL], ID 21, cr_txg 1 4.52M, 10 objects

@dweeezil
Copy link
Contributor

@pyavdr I supposed a zdb -dd stor2 afterwards would be in order to see which object is increasing in size. Then followed up with a zdb -dddd ... to see what's in that object. BTW, you can't do zdb -dddd stor2 object_id because it's the root dataset and will display the object_id in the MOS. You need to do zdb -dddd stor2 and look at the whole thing (with more or less or whatever).

@fabiokorbes
Copy link
Author

@dweeezil I've run a zdb -dddd and I found 1407 objects with path ???. The sum of their dsize is 129GB now. This is the same amount my FS is occupying more than I've estimated. Some examples:

Object  lvl   iblk   dblk  dsize  lsize   %full  type
  6703    4    16K    64K   714M  2.82G  100.00  ZFS plain file
                                    176   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 46255
    path    ???<object#6703>
    uid     0
    gid     0
    atime   Sat Jun  1 01:00:00 2013
    mtime   Sat Jun  1 02:00:00 2013
    ctime   Sat Jun 22 02:01:35 2013
    crtime  Sat Jun  1 01:00:00 2013
    gen     907574
    mode    100644
    size    3031405813
    parent  6706
    links   0
    pflags  40800000004


Object  lvl   iblk   dblk  dsize  lsize   %full  type
  6704    1    16K    512     1K    512  100.00  ZFS directory
                                    168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    ???<object#6704>
    uid     0
    gid     0
    atime   Sat Jun  1 01:00:00 2013
    mtime   Sat Jun  1 01:00:00 2013
    ctime   Sat Jun  1 01:00:00 2013
    crtime  Sat Jun  1 01:00:00 2013
    gen     907574
    mode    41777
    size    3
    parent  6703
    links   2
    pflags  40800000145

I believe the freeing happens on the umount. Because it takes several minutes to complete. It take more time proportionally to the disk occupation.

@dweeezil
Copy link
Contributor

@fabiokorbes That shows your problem isn't the xattr-related issue I've discovered. Those show every sign of being deleted while open but I know you have tried killing all your processes. If they're really not open, it seems there must be some weird sequence of system calls being performed on them that cause them to be left in this condition. Maybe it is related to the xattr issue I've discovered but this seems like something different.

I don't think I can come up with a reproducer on my own. You'll need to find the sequence and timing of the system calls used by your program to try to help reproduce this.

I'm also wondering if your whole file system becomes "stuck" at this point or whether newly-created files that are created outside your logging application are deleted properly. If you do something like echo blah > blah.txt and then remove it, is its space freed? This is a lot easier to test in a non-root file system. You might try to zfs create gfs/test and then create a sample file there. Then run an ls -i on it to find its object id which you can view with zdb -dddd gfs/test <object_id>. When testing, remember to run a sync between deleting the file and checking its object with zdb.

@ColdCanuck
Copy link
Contributor

Sorry to butt in, but if the files were truly "stuck open" at a system level, wouldn't umount complain vigorously ???

It certainly does on my system when I forget to kill a process (I'm looking at you nfsd) before I unmount a filesystem.

On Jun 27,2013, at 16:46 , Tim Chase wrote:

@fabiokorbes That shows your problem isn't the xattr-related issue I've discovered. Those show every sign of being deleted while open but I know you have tried killing all your processes. If they're really not open, it seems there must be some weird sequence of system calls being performed on them that cause them to be left in this condition. Maybe it is related to the xattr issue I've discovered but this seems like something different.

I don't think I can come up with a reproducer on my own. You'll need to find the sequence and timing of the system calls used by your program to try to help reproduce this.

I'm also wondering if your whole file system becomes "stuck" at this point or whether newly-created files that are created outside your logging application are deleted properly. If you do something like echo blah > blah.txt and then remove it, is its space freed? This is a lot easier to test in a non-root file system. You might try to zfs create gfs/test and then create a sample file there. Then run an ls -i on it to find its object id which you can view with zdb -dddd gfs/test . When testing, remember to run a sync between deleting the file and checking its object with zdb.


Reply to this email directly or view it on GitHub.

@dweeezil
Copy link
Contributor

@ColdCanuck You'd certainly think that stuck-open files would prevent unmounting (except for forced unmounts). I do think there's a bug here and that it's not directly related to deleting open files.

I think that reproducing this will require getting a handle on the exact file system operations being performed by the logging application.

@fabiokorbes Have you gotten a handle on any of the following details of the problem:

  1. Does it start happening immediately after a reboot or does the system have to run for awhile?
  2. Once it starts happening, does every single file leak space from that point on?
  3. Do files created by simple utilities such as cp or dd leak space when deleted?
  4. Have you tried creating file in a non-root file system to see if that makes any difference?

@behlendorf
Copy link
Contributor

@dweeezil, @fabiokorbes It would be useful to know if you're able to reproduce this issue when setting xattr=sa. My suspicion is you won't be able too since it won't make extensive use of the delete queue, but it's worth verifying.

More generally ZoL has some significant differences in how in manages the delete queue compared to the other implementations, particularly in regards to xattrs. This was required to avoid some particularly nasty deadlocks which are possible with the Linux VFS. Basically, the other implementations largely handle the deletion of xattrs synchronously while ZoL does it asynchronously in the context of the iput taskq.

b00131d Fix unlink/xattr deadlock
53c7411 Revert "Fix unlink/xattr deadlock"
7973e46 Revert "Revert "Fix unlink/xattr deadlock""

However, if you're not making extensive use of xattrs normal file deletion will also be deferred to zfs_unlinked_drain() time which is done unconditionally during the mount. It sounds like this is what may be happening. Although it would be useful to determine where it is blocking during the umount. It may be blocked waiting for the pending zfs_unlinked_drain() work items to be finished. Getting a stack trace of where the umount is blocked and what the iput thread is doing would be helpful.

@dweeezil
Copy link
Contributor

@behlendorf A quick test shows me that deleting files with xattrs when xattr=sa works just fine (the space is freed). The non-space-freeing problem with xattr=on is something I found by accident when trying to duplicate @fabiokorbes problem. I'm planning on working up a separate issue for that problem once I nail it down a bit better.

@fabiokorbes
Copy link
Author

I should have provided more details of our environment at first place. We have 10 servers with ZFS filesystem. We combine them with Gluster and export the GlusterFS to another server where syslog-ng writes the logs.

I thought the issue was not related to Gluster because it happens even when I delete a file manually at the ZFS. I haven't tried to delete a manually created file. I did it now (with dd) and it works. It freed the space.

Gluster uses extended attributes.

@dweeezil,

Does it start happening immediately after a reboot or does the system have to run for awhile?
Once it starts happening, does every single file leak space from that point on?

Yes, it happens right after the reboot. I'm not 100% sure if every files leaks space. But the growing rate says so.

Do files created by simple utilities such as cp or dd leak space when deleted?

No, they don't.

Have you tried creating file in a non-root file system to see if that makes any difference?

Do you mean a child of the dataset? I haven't tried it.

I'll try to change the xattr property to sa next.

@dweeezil
Copy link
Contributor

@fabiokorbes Neither of the files shown in your zdb -dddd output above had extended attributes. If they did, there would be an xattr line right below the pflags line that would reference the ZFS directory object containing the extended attributes.

I'm afraid I've never used Gluster (but I'm reading up on it right now). I have a feeling your problem is that some part of the Gluster server side software is holding the deleted file open.

One of my virtual test environments uses CentOS 6.4 which looks like it's pretty easy to install Gluster on. I may give that a try.

@fabiokorbes
Copy link
Author

ops! sorry! I think my grep truncates the last line:

Object  lvl   iblk   dblk  dsize  lsize   %full  type
  6703    4    16K    64K   714M  2.82G  100.00  ZFS plain file
                                    176   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 46255
    path    ???<object#6703>
    uid     0
    gid     0
    atime   Sat Jun  1 01:00:00 2013
    mtime   Sat Jun  1 02:00:00 2013
    ctime   Sat Jun 22 02:01:35 2013
    crtime  Sat Jun  1 01:00:00 2013
    gen     907574
    mode    100644
    size    3031405813
    parent  6706
    links   0
    pflags  40800000004
    xattr   6704

Object  lvl   iblk   dblk  dsize  lsize   %full  type
  6704    1    16K    512     1K    512  100.00  ZFS directory
                                    168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    ???<object#6704>
    uid     0
    gid     0
    atime   Sat Jun  1 01:00:00 2013
    mtime   Sat Jun  1 01:00:00 2013
    ctime   Sat Jun  1 01:00:00 2013
    crtime  Sat Jun  1 01:00:00 2013
    gen     907574
    mode    41777
    size    3
    parent  6703
    links   2
    pflags  40800000145
    microzap: 512 bytes, 1 entries

            trusted.gfid = 6705 (type: Regular File)

But I have deleted files without the xattr too:

@fabiokorbes
Copy link
Author

Here it is a file without xattr

Object  lvl   iblk   dblk  dsize  lsize   %full  type
  6705    1    16K    512    512    512  100.00  ZFS plain file
                                    168   bonus  System attributes
    dnode flags: USED_BYTES USERUSED_ACCOUNTED 
    dnode maxblkid: 0
    path    ???<object#6705>
    uid     0
    gid     0
    atime   Sat Jun  1 01:00:00 2013
    mtime   Sat Jun  1 01:00:00 2013
    ctime   Sat Jun  1 01:00:00 2013
    crtime  Sat Jun  1 01:00:00 2013
    gen     907574
    mode    100644
    size    16
    parent  6704
    links   1
    pflags  40800000005

@dweeezil
Copy link
Contributor

@fabiokorbes Yes, your whole problem is the "deleting files with xattrs leaks space" problem that I found. Your object 6705 is clearly the "value file" of one of your "trusted.gfid" attributes (it was created with a block size of 512 which is what these internally-generated attribute files use).

Switching to xattr=sa will fix your problem but it will make your pool incompatible with other ZFS implementations.

I started tracking down the cause of this xattr problem this morning and have made good progress. I'm not sure now whether to file a more specific issue for this or to keep this thread going. Once I nail down the cause, I'll post some more information here.

@fabiokorbes
Copy link
Author

Yes, it fixed my problem. Thanks! 👍

The old files still leak space. Is this what you mean by incompatible with other ZFS implementations ?

@dweeezil
Copy link
Contributor

@fabiokorbes My comment regarding incompatibility referred to the fact that only ZoL has the xattr=sa mechanism to store xattrs. If you import the pool under Illumos or FreeBSD, they won't see your xattrs (actually, I don't think FreeBSD supports xattrs at the moment). It doesn't sound like this is a big problem for you.

Until the cause of this bug is fixed, removing any of your pre-xattr=sa files will leak space and you'll have to reclaim the space by a mount/umount cycle. Newly-created files will be just fine.

Regarding my comment about storing files in the root of a pool, I'd still recommend setting up future pools with at least one child file system even if the pool is only single-purpose (storing log files, etc.). At the very least, you'd be able to use the zdb -ddd / <object_id> types of commands in the future to get information about an individual file. It also allows for more flexibility in moving things around in the future without re-creating the pool. For example, if you had an application storing stuff in tank/a and you wanted to start anew but keep the old files around, you could zfs rename tank/a tank/b and then zfs create tank/a... that kind of thing. Child file systems also allow you to use the property inheritance scheme which can be very useful.

@dweeezil
Copy link
Contributor

@behlendorf I've tracked down the problem, sort of. This, of course, only applies to normal non-xattr=sa mode. Any time a file's xattr is accessed (or when it's created), extra inode references (i_count) are created that prevent the file's space from being freed when it's removed. In no case, will the space consumed by the xattr pseudo-directories nor the attribute pseudo-files be freed. Here are a couple of scenarios that should make it clear.

Typical failure case:

  1. Create new file (object X)
  2. Add an xattr to file (creates objects X1 and X2)
  3. Unlink file
  4. None of objects X, X1 nor X2 are freed

Case in which file's space is freed:

  1. Create new file (object X)
  2. Add an xattr to file (creates objects X1 and X2)
  3. Unmount file system
  4. Mount file system
  5. Unlink file
  6. Space for X is freed but space for X1 and X2 is not
  7. Unmount file system, space for X1 and X2 is freed

And another failure case:

  1. Create new file (object X)
  2. Add an xattr to file (creates objects X1 and X2)
  3. Unmount file system
  4. Mount file system
  5. Perform an lgetxattr() on the file
  6. Unlink file
  7. None of objects X, X1 nor X2 are freed

I did review all three of the xattr-related deadlock-handling commits you referenced above and think I've got a pretty good grasp of the situation, but given the way it's coded now, it's not clear to me when the space for the file was actually supposed to be freed short of an unmount.

A couple of other notes and comments are in order: For most of this testing, I was using a system on which SELinux was disabled. Presumably in some (all?) cases with SELinux, the xattrs are accessed pretty much all the time which would make it impossible for any space to be reclaimed. I suspect the same is also true of Gluster which seems to use xattrs for its own purpose.

I'm not sure where to go from here. There's either a simple bug in the code as it's currently written or the overall design of it precludes ever freeing the space.

@dweeezil
Copy link
Contributor

dweeezil commented Jul 1, 2013

I decided to fiddle around with this a bit more. First off, I think it's worth throwing in a reference to #457 here. The discussion in that issue makes it sound like the handling of xattr cleanup is still in flux.

Reverting e89260a against master as of 20c17b9 does allow for the space consumed by a file with xattrs to be freed. The xattr directory and xattrs themselves, however, still lay around on the unlink set until an unmount (mount?).

Finally, reverting 7973e46 against my revert above, which should restore the old synchronous xattr cleanup behavior, doesn't change the freeing behavior at all; the xattr directory and the xattrs themselves aren't freed.

@behlendorf
Copy link
Contributor

@dweeezil Thanks for looking in to this, let me try and explain how it should work today and a theory for why this is happening.

When we create an xattr directory or xattr file we create normal znode+inode for the object. However, because we don't want these files to show up in the namespace we don't create a dentry for them. The new inodes get hashed in to the inode cache and attached to the per-superblock LRU. The VFS is then responsible for releasing then as needed, a reference via iget() is taken on the objects to keep them around in memory. Note dentries for these inodes are impossible so they will never hold a reference.

Destruction and freeing the space for these objects is handled when the last reference on the inode is dropped via iput(), see zpl_evict_inode->zfs_inactive->zfs_\zinactive. The zp->z_unlinked flag is checked in zfs_zinactive which means the should not just be released from the cache but the objects should be destroyed from the dataset.

It sounds as if the VFS isn't dropping the reference it's holding on the xattr inodes until unmount time. This makes some sense because unless there's memory pressure the VFS won't kick and object out of the cache and drop its reference. However, at unmount time it has to drop this last reference on all the objects in the inode cache so it can unmount. Dropping this last reference will result in the unlink and likely explains why unmount takes so long.

This explanation is consistent with what you've described above. Note that the first patch you reverted e89260a adds a hunk to zfs_znode_alloc which caused a directory xattrs to hold a reference on its parent file. The result is that the file won't be unlinked until the VFS drops its reference on the xattr directory which is turn drops its reference on the file. Which is exactly the behavior you observed.

Two possible solutions worth exploring.

  1. The obvious one is to drop the references on the xattr directory and its children during zfs_rmnode(). That originally lead to the deadlocks described Async zfs_purgedir #457 and why the iput was made asynchronous. But perhaps there is a safe way this could be accomplished. Or,

  2. Never add xattr directories or inodes to the super blocks LRU or the inode cache. Originally this was done because it allowed me to reuse a lot of existing code and keep things reasonable simple. It also provided a clean mechanism for the VFS to discard cached xattrs which hadn't been used in a long time and were just wasting memory. However, there's no requirement that these inodes be attached to the per-superblock LRU or that they exist in the inode hash. If they were just attached to their parent file that may greatly simply some of the possible deadlocks caused by the VFS. It may even become possible to destroy them synchronously in zfs_rmnode which would be very desirable.

@dweeezil
Copy link
Contributor

dweeezil commented Jul 1, 2013

@behlendorf Thanks for the explanation. The missing link for me is that I didn't realize the VFS would ever release the reference under any condition. I did realize that it was going to take another iput to release it but I didn't understand where it was supposed to come from.

I just (finally) fully read the comment block in zfs_xattr.c. I didn't realize that Solaris had a file-like API to deal with xattrs which makes them a lot more like "forks". The invisible file and directory scheme is a rather natural way to implement their API. It sounds like it's possible to create hierarchies of attributes in Solaris.

I'm liking your #2 idea above.

@behlendorf
Copy link
Contributor

@dweeezil Yes exactly, they're much more like forks on Solaris. My preference would be for the second option as well, do you want to take a crack at it?

@dweeezil
Copy link
Contributor

dweeezil commented Jul 2, 2013

@behlendorf I'll give it a whack but there's a few more things I've got to fully understand.

I'm still getting caught up on the 2.6-3.X inode cache API changes.

Also, I'm taking a side-trip to see how this is handled in FreeBSD (which also doesn't have a Solaris-like xattr API). I was under the impression that FreeBSD ZFS didn't support xattrs but it turns out that it does.

@discostur
Copy link

@kernelOfTruth

that's true, xattr=off was off in my pool, but first it was set to xattr=on, then i noticed the strange behaviour of my pool and i disabled it on the fly (without reboot). But that didn't change anything ...

Now i'm running with xattr=sa, zfs 0.6.5.3-2 and i have done a reboot after setting xattr=sa. I'll keep an eye on it the next weeks ...

@ringingliberty
Copy link

@discostur Remember that each file must be created with xattr=sa set, in order for it to work. If you have existing files, you'll need to copy them (e.g. to a new dataset).

@Nukien
Copy link

Nukien commented Nov 25, 2015

Ugh - is there any way to do that in-place ? I have 500G free on a 10T filesystem ...

@ringingliberty
Copy link

@Nukien cp each filename to filename.new and then mv it back. This still won't guarantee you won't have any downtime. The files might be in use, after all. And the old files will still be taking up space until you drop caches or unmount.

@Nukien
Copy link

Nukien commented Nov 25, 2015

Doh. Must be early, should have thought of that. Scripting time - I would like to keep the filetime for each one ... Let's see, "find /zfs1/sys1 -type f -exec ...

@kernelOfTruth
Copy link
Contributor

@Nukien From what I read - the only guarantee to have no issues in the future is to create the filesystem anew with xattr=sa - but that surely is no option in your case

@ringingliberty 's solution appears to be the optimal one - not sure if it's enough concerning the filesystem data

CC: @dweeezil

@MasterCATZ
Copy link

Started getting this problem after last upgrade

SPL: Loaded module v0.6.5.3-1trusty
ZFS: Loaded module v0.6.5.3-1
trusty, ZFS pool version 5000, ZFS filesystem version 5

NAME PROPERTY VALUE SOURCE
ZFSRaidz2 xattr on default
I am a little scared to try changing this .. as I use an FreeBSD recovery disk for when Ubuntu does break on me

dd if=/dev/null of=syslog.tar.gz
rm syslog.tar.gz

I have 0'd out a few hundred gig of files and removed them
free space is still not freeing up , unmounted / remounted etc

NAME USED AVAIL REFER MOUNTPOINT
ZFSRaidz2 10.5T 0 10.5T /ZFSRaidz2
( how ever the USED space here changed to lower number but AVAIL never increased )

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
ZFSRaidz2 16.2T 15.8T 464G - 3% 97% 1.00x ONLINE -

running out of files to try removing
only thing not done is a scrub as I don't have the needed 19 hrs untill next week

edit ok changed xattr

zfs set xattr=sa ZFSRaidz2

I removed a few more gig and space finally free'd up

NAME USED AVAIL REFER MOUNTPOINT
ZFSRaidz2 10.4T 35.8G 10.4T /ZFSRaidz2

NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
ZFSRaidz2 16.2T 15.7T 574G - 3% 96% 1.00x ONLINE -

@truatpasteurdotfr
Copy link

same here: upgraded from zfs-0.6.2 to 0.6.5.3 -> 100% full
tried zpool upgrade -> no change
zfs destroy a few fs to make room && zfs umount -a && zfs mount -a: unchanged situation.
zfs set xattr=sa && zfs destroy && zfs umount && zfs mount : no changes
zpool scrub in progress

SPL: Loaded module v0.6.5.3-1
ZFS: Loaded module v0.6.5.3-1, ZFS pool version 5000, ZFS filesystem version 5
SPL: using hostid 0x00000000

df:
...
mpool/private           4125696   4125696          0 100% /mpool/private
mpool/pub/esxi          2052736   2052736          0 100% /mpool/pub/esxi
mpool/pub/livecd         429952    429952          0 100% /mpool/pub/livecd
mpool/pub/raspbian      1016192   1016192          0 100% /mpool/pub/raspbian
mpool/pub/videos       90928384  90928384          0 100% /mpool/pub/videos

fixed on IRC #zfsonlinux after being pointed at
3d45fdd

zfs umount -a
zpool export mpool
service zfs-zed stop
rmmod zfs
modprobe zfs spa_slop_shift=6 
zpool import mpool
zfs mount -a

and make sure you don't go over 90% full next time :)

@RichardSharpe
Copy link
Contributor

Hmm,

Running 0.6.5.3-1 but space does not seem to be reclaimed after a delete.

I am running with xattr=on NOT xattr=sa.

@yesbox
Copy link

yesbox commented Jan 10, 2016

I just found this issue on ZFS v0.6.5.4-1 with a 4.2.8-300.fc23.x86_64 kernel. I stored all files in the root filesystem instead of in datasets, but has since created datasets and been moving the files to them. After that I noticed the reported disk usage of the root filesystem only increasing both in "zfs list" and "df" when moving files from it. I don't have any other information other than that I do rely on SELinux and xattr is set to the default on. I could reboot the system and that freed up space.

@odoucet
Copy link

odoucet commented Jan 26, 2016

Same bug here with SPL/ZFS v0.6.4.2-1 + kernel 3.10.76 ;
data is stored on a child (not directly on root). xattr=on
zfs umount did the trick but took several minutes.

@jcphill
Copy link

jcphill commented Feb 1, 2016

Looks like the same bug with zfs-0.6.5.4 and kernel 3.10.0-327.
Deleting files does not reclaim space even in child datasets.
Tried "echo 3 > /proc/sys/vm/drop_caches" which ran for several minutes but no change.
Export and import does reclaim the space.
Setting xattr=sa fixes the issue for newly created files.
Setting xattr=off has no effect, even on newly created files.
Update: Disabling SELinux fixes the issue completely, at least for new files with xattr=on

@DeHackEd
Copy link
Contributor

I ran into this, but I was focusing on the inode counts rather than space usage as it wasn't a lot of space.

Before unmounting, 151616 inodes were in use. Unmounting took a moment, and after remounting it was 6. Yes, I rm -rf'd everything.

@DarwinSurvivor
Copy link

We are seeing the same problem on a large (3-digit TB) fully updated CentOS 7 machine.
kernel version: 3.10.0-327.28.3.el7
zfs (and zfs-dkms) version: 0.6.5.7-1.el6

setting drop_caches to 0 caused the machine to stop responding over NFS for 15-20 minutes. The mounts block locally except on one mount at a time (is suspect it was blocking on whichever mount it was clearing files for at the time). After the interruption, about 5 TB of space was cleared, but running it again during a cron job is not having the same effect (it seems to have only worked once). The machine is in near-constant use, so we haven't had a chance to try unmounting, exporting or rebooting since then.

@DarwinSurvivor
Copy link

Using a temporary filesystem I was able to clear some deleted files by un-mounting and re-mounting the specific filesystem. Obviously we would like to avoid this, but the work-around does work in a pinch.

@DarwinSurvivor
Copy link

Slight correction to my earlier comment, we are setting drop_caches to 3 (0 is what it reads before the value is set).

@DeHackEd
Copy link
Contributor

drop_caches is not a stored value, it's a signal to perform an action. Your shell will block until the operation completes and it will always read as 0.

@DarwinSurvivor
Copy link

I'm aware that it performs the function during the "write" (and thus blocks). When I do the write, it does block for a few seconds to a few minutes (depending on the last time it was done), but cat is definitely showing it as having a value of 3. It's probably just be reporting the last value that passed to it.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Sep 24, 2016

Anyone using quotas and has this issue ?

in that case db707ad OpenZFS 6940 - Cannot unlink directories when over quota

might help

edit:

currently not sure where to look at but I'm pretty certain adding a

dmu_tx_mark_netfree(tx);

during file deletion operations (generally) could improve the situation ; if that transaction is delayed we might end up at position 1 again, how and why (due to xattr=on) is another story

@leeuwenrjj
Copy link

Also seeing this issue.
CentOS 7.2, zfs-0.6.5.7-1.el7.
Default xattr, running with compression on.

umount/mount will solve it.
We saw a high txg_sync and z_fr_iss when doing the umount/mount if that helps...

Changing to xattr=sa seem to work

@mrjester888
Copy link

mrjester888 commented Dec 19, 2016

Just a "Me Too" comment. I was moving 1TB of data between datastores, using a SSD datastore/pool in between for speed. (tank/DS1 -> SSD -> tank/DS2) Had to umount/mount the SSD datastore between each chunk of data for the SSD datastore to reflect the available space.

3.5 years on this bug. A fix would be swell.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Jan 8, 2017

That's a really hard to pinpoint target !

It used to work for me,

now I just realized that upon wiping out a directory of 110 GB plus snapshots of a different repository (around 10 GB), that the space isn't freed:

branch used is https://github.com/kernelOfTruth/zfs/commits/zfs_kOT_04.12.2016_2

no xattr is used,

no SELinux is used

Kernel is 4.9.0 rt based

edit:

is this also appearing, btw, with an SLOG or ZIL device ?

or is the behavior different ?

edit2:

that get's interesting !

it just freed 6 GB by destroying some other snapshots,

the ~120 GB however are still amiss from the free space shown by

zpool list

As far as I know those folders are mostly configured the same, except for ditto blocks (copies=2 or copies=3) for certain subfolders.

Those folders where the space isn't freed are directly on the root of the pool,

the folders where space was freed [the root folder is also on the root of the pool] but there are subsequent folders, whereas for the non-freeing folders there aren't ...

edit3:

there were actually still processes open to those directories, interesting ...

Now it worked

@fermulator
Copy link

fermulator commented Jul 6, 2018

a "me too" comment;

My situation was:

  • had a simple 1TB mirror pool with a handful of datasets
  • realized I wanted more granular control over a set of sub-dirs in one of the datasets
  • so went to start creating more child datasets
    • (roughly it was : move dir FOO into FOO_TMP, create dataset, clone ACLs, mv FOO_TMP/* FOO/)
  • one of the sub-dirs was huge (200GB+)
  • quickly observed ALLOC of the dataset creeping and FREE dropping!! eventually got here
$ sudo zfs list -r zfsmain
NAME                                           USED  AVAIL  REFER  MOUNTPOINT
zfsmain                                        829G      0    96K  /zfsmain
zfsmain/SNIPD                              212G      0   212G  /zfsmain/SNIPD
zfsmain/storage                                352G      0   315G  /zfsmain/storage
zfsmain/storage/SNIPA                     454M      0   454M  /zfsmain/storage/SNIPA
zfsmain/storage/SNIPA@2018-06-05_LOCKED      0      -   454M  -
zfsmain/storage/SNIPT                   1.02G      0  1.02G  /zfsmain/storage/SNIPT
zfsmain/storage/SNIPF                        35.9G      0  35.9G  /zfsmain/storage/SNIPF
zfsmain/SNIPU                                265G      0   265G  /zfsmain/SNIPU

The USED adds up roughly. In this state i was in the middle of something like:
mv /zfsmain/storage/SNIPF_TMP/* /zfsmain/storage/SNIPF/

Despite all the MOVE operations (~36GB into SNIPF), there was still a WHACK of usage in SNIPF_TMP.

$ du -hsc SNIPF_TMP | grep total
246G    total

I also checked to see if the pool was busy freeing any bytes, nope

$ sudo zpool list -o name,size,allocated,free,freeing zfsmain
NAME      SIZE  ALLOC   FREE  FREEING
zfsmain   856G   800G  55.8G        0

My "storage" dir has loads of important live/active services in sub-dirs ... it's impossible for me unmount it right now.

$ sudo zfs get all zfsmain/storage | grep -i mount
zfsmain/storage  mounted               yes                    -
zfsmain/storage  mountpoint            /zfsmain/storage       default
zfsmain/storage  canmount              on                     default

I have verified that the move operation IS deleting files off the filesystem by diffing the "_TMP" dir to the DEST directory too.

Tried echo-ing non-zero value into /proc/sys/vm/drop_caches, blocked for a few moments, insignificant zpool iostat activity ... -- no affect on AVAIL storage.


$ dmesg | egrep "ZFS|SPL"
[   32.910008] SPL: Loaded module v0.6.5.6-0ubuntu4
[   33.001094] ZFS: Loaded module v0.6.5.6-0ubuntu20, ZFS pool version 5000, ZFS filesystem version 5
$ uname -r
4.4.0-128-generic

EDIT: upon further investigation, my issue is not related, ... the "mv" command seems to have an interesting behaviour in when/how it deletes files during the operation. It looks like it only deletes files per LEAF NODE in a directory tree (although I haven't seen what happens when there is >2 depths like in my original case with the mega move above), this is the pattern though in a smaller example:

mv -v FOO_TMP/* somewhere/FOO/
SOMEFILE -> SOMEFILE
removed SOMEFILE
subdirBAR/file1 -> ...
subdirBAR/file2 -> ...
subdirBAR/file3 -> ...
...
removed file1
removed file2
removed file3
removed subdirBAR

CONFIRMED -- my case is a symptom of the mv command (sorry for the confounding) - if there exists ANY sub-dir, which consumes more GBs than the AVAIL GBs in the dataset, the filesystem will run out of disk space because mv only deletes files after it's moved ALL files in the SRC leaf dir. :(


in this case, using rsync will aid

rsync -avh --progress --remove-source-files FOO_TMP/* FOO/
find . -depth -type d -empty -delete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Memory Management kernel memory management
Projects
None yet
Development

No branches or pull requests