File system locks when many concurent threads are opened #608

mikhmv · 2012-03-19T04:28:55Z

Hi,
I have an issue with File system accessibility on heavy load.
I am using 5 disks in RAID-Z, Compression gzip-5, Dedup on. Host with 64 cores and 256GB RAM.
When I run some processes which used many open file connections the system completely unresponsive.

As an example, the "ls" can take 1 minute.

Regards,
Max

ryao · 2012-03-19T14:22:44Z

Which version of ZFS and which distribution?

mikhmv · 2012-03-19T15:46:30Z

Here is a system info:

max@s0:~$ uname -a
Linux s0 3.2.0-19-generic #30-Ubuntu SMP Fri Mar 16 16:27:15 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

max@s0:$ dpkg -s zfs-dkms
Package: zfs-dkms
Status: install ok installed
Priority: extra
Section: kernel
Installed-Size: 9437
Maintainer: Darik Horn [email protected]
Architecture: amd64
Source: zfs-linux
Version: 0.6.0.54-0ubuntu1precise1

max@s0:$ dpkg -s zfsutils
Package: zfsutils
Status: install ok installed
Priority: extra
Section: admin
Installed-Size: 696
Maintainer: Darik Horn [email protected]
Architecture: amd64
Source: zfs-linux
Version: 0.6.0.54-0ubuntu1precise1
max@s0:$ dpkg -s libuutil1
Package: libuutil1
Status: install ok installed
Priority: extra
Section: libs
Installed-Size: 147
Maintainer: Darik Horn [email protected]
Architecture: amd64
Source: zfs-linux
Version: 0.6.0.54-0ubuntu1precise1

max@s0:$ dpkg -s libzfs1
Package: libzfs1
Status: install ok installed
Priority: extra
Section: libs
Installed-Size: 307
Maintainer: Darik Horn [email protected]
Architecture: amd64
Source: zfs-linux
Version: 0.6.0.54-0ubuntu1precise1

max@s0:$ dpkg -s libzpool1
Package: libzpool1
Status: install ok installed
Priority: extra
Section: libs
Installed-Size: 1122
Maintainer: Darik Horn [email protected]
Architecture: amd64
Source: zfs-linux
Version: 0.6.0.54-0ubuntu1precise1

max@s0:$ dpkg -s zfs-auto-snapshot
Package: zfs-auto-snapshot
Status: install ok installed
Priority: extra
Section: admin
Installed-Size: 67
Maintainer: Darik Horn [email protected]
Architecture: all
Version: 1.0.8-0ubuntu1precise1

max@s0:~$ sudo zpool list
[sudo] password for max:
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
tank 9.06T 6.37T 2.69T 70% 1.01x ONLINE -

max@s0:~$ sudo zpool status
pool: tank
state: ONLINE
scan: scrub canceled on Thu Mar 15 20:19:58 2012
config:

    NAME        STATE     READ WRITE CKSUM
    tank        ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        d1      ONLINE       0     0     0
        d2      ONLINE       0     0     0
        d3      ONLINE       0     0     0
        d4      ONLINE       0     0     0
        d5      ONLINE       0     0     0

errors: No known data errors

max@s0:~$ sudo zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 5.11T 2.04T 256K /tank
tank/Irina 310G 2.04T 310G /tank/Irina
tank/OpenNebula 866G 2.04T 863G /tank/OpenNebula
tank/biouml-shared 3.94T 2.04T 3.79T /tank/biouml-shared

max@s0:~$ zfs get all tank/biouml-shared
exportfs: could not open /var/lib/nfs/.etab.lock for locking: errno 13 (Permission denied)
NAME PROPERTY VALUE SOURCE
tank/biouml-shared type filesystem -
tank/biouml-shared creation Sun Feb 5 8:29 2012 -
tank/biouml-shared used 3.94T -
tank/biouml-shared available 2.04T -
tank/biouml-shared referenced 3.79T -
tank/biouml-shared compressratio 1.08x -
tank/biouml-shared mounted yes -
tank/biouml-shared quota none default
tank/biouml-shared reservation none default
tank/biouml-shared recordsize 128K default
tank/biouml-shared mountpoint /tank/biouml-shared default
tank/biouml-shared sharenfs off local
tank/biouml-shared checksum on default
tank/biouml-shared compression gzip local
tank/biouml-shared atime on default
tank/biouml-shared devices on default
tank/biouml-shared exec on default
tank/biouml-shared setuid on default
tank/biouml-shared readonly off default
tank/biouml-shared zoned off default
tank/biouml-shared snapdir hidden default
tank/biouml-shared aclinherit restricted default
tank/biouml-shared canmount on default
tank/biouml-shared xattr on default
tank/biouml-shared copies 1 default
tank/biouml-shared version 5 -
tank/biouml-shared utf8only off -
tank/biouml-shared normalization none -
tank/biouml-shared casesensitivity sensitive -
tank/biouml-shared vscan off default
tank/biouml-shared nbmand off default
tank/biouml-shared sharesmb off default
tank/biouml-shared refquota none default
tank/biouml-shared refreservation none default
tank/biouml-shared primarycache all default
tank/biouml-shared secondarycache all default
tank/biouml-shared usedbysnapshots 154G -
tank/biouml-shared usedbydataset 3.79T -
tank/biouml-shared usedbychildren 0 -
tank/biouml-shared usedbyrefreservation 0 -
tank/biouml-shared logbias latency default
tank/biouml-shared dedup on inherited from tank
tank/biouml-shared mlslabel none default
tank/biouml-shared sync standard default
tank/biouml-shared refcompressratio 1.04x -

Regular hardrive test:
max@s0:~$ time echo test zfs speed > test.txt

real 0m0.016s
user 0m0.000s
sys 0m0.000s

ZFS:
oneadmin@s0:/tank/biouml-shared/tmp-tools$ time echo test zfs speed > test.txt

real 0m2.446s
user 0m0.000s
sys 0m0.000s

oneadmin@s0:/tank/biouml-shared/tmp$ time ls -lahs > test.time.txt

real 0m8.420s
user 0m0.000s
sys 0m0.040s

max@s0:~$ zpool iostat 5
exportfs: could not open /var/lib/nfs/.etab.lock for locking: errno 13 (Permission denied)
capacity operations bandwidth
pool alloc free read write read write

tank 6.37T 2.69T 307 194 30.2M 783K
tank 6.37T 2.69T 353 121 31.8M 700K
tank 6.37T 2.69T 248 209 14.0M 1.22M
tank 6.37T 2.69T 219 226 13.9M 1.38M
tank 6.37T 2.69T 333 97 35.4M 544K
tank 6.37T 2.69T 245 334 11.4M 1.86M
tank 6.37T 2.69T 232 128 15.5M 662K
tank 6.37T 2.69T 317 50 38.2M 110K

I performed these tests when system is a little responsive. It was worse before.

max@s0:~$ sudo lsof | grep tank| wc -l
124

I will provide tests again when system will be under heavy load.

mikhmv · 2012-03-20T18:37:12Z

My system now is well loaded. You can see performance here:

time ls -lahs realigned/
total 2.4G
39K drwx------ 3 oneadmin cloud 11 Mar 20 14:30 .
14K drwx------ 7 oneadmin cloud 7 Mar 20 02:07 ..
512 -rw------- 1 oneadmin cloud 0 Mar 20 14:30 5173N_sorted_dedup_rg_dd2_kar.chr14.ra.bam
280M -rw------- 1 oneadmin cloud 282M Mar 20 14:30 5173N_sorted_dedup_rg_dd2_kar.chr15.ra.bam
65K -rw------- 1 oneadmin cloud 113K Mar 19 21:02 5173N_sorted_dedup_rg_dd2_kar.chr22.ra.bai
1.9G -rw------- 1 oneadmin cloud 5.3G Mar 19 21:02 5173N_sorted_dedup_rg_dd2_kar.chr22.ra.bam
7.0K -rw------- 1 oneadmin cloud 5 Mar 19 21:02 5173N_sorted_dedup_rg_dd2_kar.chr22.ra.bam.done
7.0K -rw------- 1 oneadmin cloud 2.4K Mar 20 02:50 5173N_sorted_dedup_rg_dd2_kar.chrM.ra.bai
259M -rw------- 1 oneadmin cloud 259M Mar 20 02:50 5173N_sorted_dedup_rg_dd2_kar.chrM.ra.bam
7.0K -rw------- 1 oneadmin cloud 5 Mar 20 03:38 5173N_sorted_dedup_rg_dd2_kar.chrM.ra.bam.done
39K drwx------ 2 oneadmin cloud 32 Mar 20 14:08 logs

real 4m36.037s
user 0m0.004s

sys 0m0.008s

zpool iostat 5
capacity operations bandwidth
pool alloc free read write read write

tank 6.39T 2.67T 320 203 29.3M 956K
tank 6.39T 2.67T 396 108 44.0M 829K
tank 6.39T 2.67T 407 109 45.7M 829K
tank 6.39T 2.67T 464 98 52.2M 548K

max@s0:/var/lib/one/var$ sudo lsof | grep tank| wc -l
99

mikhmv · 2012-03-20T20:51:07Z

The system ignoring writing when present several concurrent reads.

In next test I have several reading streams and have several writing (cp command).

max@s0:/var/lib/one/var$ sudo zpool iostat 5
[sudo] password for max:
capacity operations bandwidth
pool alloc free read write read write

tank 6.39T 2.67T 322 201 29.5M 950K
tank 6.39T 2.67T 473 0 54.9M 0
tank 6.39T 2.67T 506 0 58.7M 0
tank 6.39T 2.67T 390 0 42.4M 0
tank 6.39T 2.67T 357 0 38.7M 0
tank 6.39T 2.67T 195 0 15.8M 0
tank 6.39T 2.67T 297 0 30.2M 0
tank 6.39T 2.67T 409 0 45.4M 0

mikhmv · 2012-03-20T20:52:35Z

longer log:

max@s0:/var/lib/one/var$ sudo zpool iostat 5
[sudo] password for max:
capacity operations bandwidth
pool alloc free read write read write

tank 6.39T 2.67T 322 201 29.5M 950K
tank 6.39T 2.67T 473 0 54.9M 0
tank 6.39T 2.67T 506 0 58.7M 0
tank 6.39T 2.67T 390 0 42.4M 0
tank 6.39T 2.67T 357 0 38.7M 0
tank 6.39T 2.67T 195 0 15.8M 0
tank 6.39T 2.67T 297 0 30.2M 0
tank 6.39T 2.67T 409 0 45.4M 0
tank 6.39T 2.67T 458 0 52.5M 0
tank 6.39T 2.67T 391 0 43.9M 0
tank 6.39T 2.67T 214 0 16.5M 0
tank 6.39T 2.67T 400 0 42.5M 0
tank 6.39T 2.67T 237 0 20.0M 0
tank 6.39T 2.67T 335 0 34.9M 0
tank 6.39T 2.67T 316 0 31.7M 0
tank 6.39T 2.67T 345 0 36.0M 0
tank 6.39T 2.67T 173 0 10.5M 0
tank 6.39T 2.67T 227 0 19.3M 0
tank 6.39T 2.67T 371 0 39.4M 0
tank 6.39T 2.67T 277 0 25.8M 0
tank 6.39T 2.67T 314 0 29.9M 0
tank 6.39T 2.67T 299 0 30.2M 0
tank 6.39T 2.67T 232 0 19.0M 0
tank 6.39T 2.67T 277 0 27.5M 0
tank 6.39T 2.67T 243 0 21.8M 0
tank 6.39T 2.67T 306 0 30.9M 0
tank 6.39T 2.67T 245 0 22.6M 0
tank 6.39T 2.67T 265 0 22.6M 0
tank 6.39T 2.67T 377 0 39.5M 0
tank 6.39T 2.67T 165 0 9.42M 0
tank 6.39T 2.67T 230 0 17.9M 0
tank 6.39T 2.67T 185 0 12.0M 0
tank 6.39T 2.67T 318 0 31.3M 0
tank 6.39T 2.67T 414 0 43.7M 0
tank 6.39T 2.67T 319 0 30.0M 0
tank 6.39T 2.67T 284 0 26.1M 0
tank 6.39T 2.67T 204 0 13.9M 0
tank 6.39T 2.67T 390 0 42.2M 0
tank 6.39T 2.67T 413 0 44.2M 0

mikhmv · 2012-03-21T00:24:15Z

I think the priorities of writes should be higher than reads as you can theoretically read unlimited amount of data but writes usually limited.

I have data analysis workflow which reading data, storing in RAM and writing back. The program (I am not a developed it) has indipendent threads for reading and writing. What happens that memory usage is growing as program cannot write anything. In the addition this behaviour completely blocking a machine

ryao · 2012-04-19T04:39:48Z

@mikhmv Pull request #660 might solve your problem.

ryao · 2012-05-17T01:42:21Z

Pull request #660 was merged. Do you still have this problem against the latest code?

mikhmv · 2012-05-17T02:30:24Z

Hard to say. I am using stable version now. It has bug with removing big files, but I don't do it often. I stop using daily because it was unstable (2 weeks ago). I had to reboot server 3 times per day.

behlendorf · 2013-04-11T20:44:30Z

Closing this issue as stale. If you're still observing similiar issues with the latest code please go ahead and open a new issue.

There are changes to vfs_getattr() in torvalds/linux@a528d35. The new interface is: int vfs_getattr(const struct path *path, struct kstat *stat, u32 request_mask, unsigned int query_flags) The request_mask argument indicates which field(s) the caller intends to use. Fields the caller does not specify via request_mask may be set in the returned struct anyway, but their values may be approximate. The query_flags argument indicates whether the filesystem must update the attributes from the backing store. This patch uses the query_flags which result in vfs_getattr behaving the same as it did with the 2-argument version which the kernel provided before Linux 4.11. Members blksize and blocks are now always the same size regardless of arch. They match the size of the equivalent members in vnode_t. The configure checks are modified to ensure that the appropriate vfs_getattr() interface is used. A more complete fix, removing the ZFS dependency on vfs_getattr() entirely, is deferred as it is a much larger project. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes openzfs#608

In Linux 4.11, torvalds/linux@2a1f062, signal handling related functions were moved from sched.h into sched/signal.h. Add configure checks to detect this and include the new file where needed. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes openzfs#608

Before kernel 2.6.29 credentials were embedded in task_structs, and zfs had cases where one thread would need to refer to the credential of another thread, forcing it to take a hold on the foreign thread's task_struct to ensure it was not freed. Since 2.6.29, the credential has been moved out of the task_struct into a cred_t. In addition, the mainline kernel originally did not export __put_task_struct() but the RHEL5 kernel did, according to openzfs/spl@e811949a570. As of 2.6.39 the mainline kernel exports it. There is no longer zfs code that takes or releases holds on a task_struct, and so there is no longer any reference to __put_task_struct(). This affects the linux 4.11 kernel because the prototype for __put_task_struct() is in a new include file (linux/sched/task.h) and so the config check failed to detect the exported symbol. Removing the unnecessary stub and corresponding config check. This works on kernels since the oldest one currently supported, 2.6.32 as shipped with Centos/RHEL. Reviewed-by: Chunwei Chen <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Olaf Faaland <[email protected]> Closes openzfs#608

`cargo update` Update our direct dependencies to the latest versions (per `cargo outdated`), except for `azure*`, which will require changes to our code. Update allowed licenses to allow the Unicode license, which is actually FSF approved but not marked as such in the SPDX metadata. Note: requires rustc 1.61, the product uses 1.63. Run `rustup default 1.63` on your laptop to switch to it.

The sysinfo crate changed the meaning of `System::total_memory()`, from returning kilobytes to returning bytes. This makes the agent think that the system has 1024x the amount of RAM that it really does, and we try to use more memory than exists. The problem was introduced by PR openzfs#608 This commit changes our code to interpret the new meaning of the return value correctly.

behlendorf closed this as completed Apr 11, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File system locks when many concurent threads are opened #608

File system locks when many concurent threads are opened #608

mikhmv commented Mar 19, 2012

ryao commented Mar 19, 2012

mikhmv commented Mar 19, 2012

mikhmv commented Mar 20, 2012

mikhmv commented Mar 20, 2012

mikhmv commented Mar 20, 2012

mikhmv commented Mar 21, 2012

ryao commented Apr 19, 2012

ryao commented May 17, 2012

mikhmv commented May 17, 2012

behlendorf commented Apr 11, 2013

File system locks when many concurent threads are opened #608

File system locks when many concurent threads are opened #608

Comments

mikhmv commented Mar 19, 2012

ryao commented Mar 19, 2012

mikhmv commented Mar 19, 2012

mikhmv commented Mar 20, 2012

sys 0m0.008s

mikhmv commented Mar 20, 2012

In next test I have several reading streams and have several writing (cp command).

mikhmv commented Mar 20, 2012

longer log:

mikhmv commented Mar 21, 2012

ryao commented Apr 19, 2012

ryao commented May 17, 2012

mikhmv commented May 17, 2012

behlendorf commented Apr 11, 2013