atomic operations hurt performance on large-scale NUMA systems #3752

dweeezil · 2015-09-08T04:38:04Z

While commit ca0bf58 was successful in reducing the lock contention on the various per-state ARC lists by converting the lists to "multilists" and removing arcs_mtx, it left arcs_lsize[] (the accounting of per-state evictable data) directly within arc_state_t and it's always updated atomically. The atomic updates of arc_lsize[] have become, in their own way, a contention point similar to that of the old arcs_mtx mutex.

I'm testing with a 4-node NUMA system having 40 cores, 80 threads and 512GiB of RAM. The test is a simple 32-process 4K random read of fully-cached files (reads are strictly from the ARC). With current master code, I see the following result:

dweeezil@zfs-dev:~/fio$ fio fio.test
test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=pvsync, iodepth=1
...
fio-2.2.9-20-g1520
Starting 32 processes
Jobs: 32 (f=32): [r(32)] [100.0% done] [6544MB/0KB/0KB /s] [1675K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=63965: Mon Sep  7 22:26:19 2015
  read : io=392092MB, bw=6534.7MB/s, iops=1672.9K, runt= 60002msec

As an experiment, I simply removed the adjustments in add_reference() and remove_reference() (effectively removing a single locked assembly instruction in each place) and saw:

dweeezil@zfs-dev:~/fio$ !fio
fio fio.test
test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=pvsync, iodepth=1
...
fio-2.2.9-20-g1520
Starting 32 processes
Jobs: 32 (f=32): [r(32)] [100.0% done] [7482MB/0KB/0KB /s] [1915K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=73269: Mon Sep  7 22:31:19 2015
  read : io=447956MB, bw=7465.9MB/s, iops=1911.3K, runt= 60001msec

Over a 14% improvement in throughput.

It also appears the manipulation of the arcstats with atomic operations can cause quite a performance hit (I'll be testing those shortly).

As a bit of background, this is the same bit of benchmarking I've been running for several months this summer (2015) which has helped track down other similar bottlenecks (in SPL and elsewhere). The performance I was getting immediately after porting ca0bf58 was < 4000MB/s so quite a lot of improvement has been made so far. That said, however, XFS yields over 30000MB/s on the very same benchmark (on the same hardware) so there's clearly plenty of room for improvement (XFS seems to benefit from page cache integration and the ability to use generic_file_read_iter()). I'll also note that @prakashsurya's benchmarks on illumos (see https://reviews.csiden.org/r/151/) yielded better numbers, too, so there might also be other Linux-related NUMA issues at play here. For example, I can get rather different results by running a subset of the test (<= 20 threads) and pinning them to a single socket with numactl.

I'm posting this mainly as a place to record such information and to put it (the issue) on people's radar. I'll be posting more information to this issue as I find it.

The text was updated successfully, but these errors were encountered:

dweeezil · 2015-09-14T04:28:50Z

The atomic operations performed by crhold() and crfree() (actually by the kernel support they rely on) were found to be taking over 5% of the CPU time on this benchmark. See dweeezil/zfs@db3d7c7 for a patch to remove them.

behlendorf · 2015-09-16T04:59:06Z

Ouch, nice find. I'll need to take a more careful look at the credential interface in Linux but I suspect we may be able to get away will directly calling current_cred() and passing it in without the holds. Since we're in a system call and we're getting the credentials for the current process they must remain valid for the life of the system call.

ryao · 2015-10-02T20:03:45Z

@dweeezil I agree with @behlendorf. Nice find.

What kind of pool configuration did you use to test?

As for catching up to XFS, some idea would be implementing zero-copy writes by manipulating the pages tables to do CoW for large I/O operations after the ABD work is merged. If sync() is being done at all, ZIO pipelining would help. Also, #3765 might help and we likely need to investigate the behavior when a read is issued for a block on which a prefetch read is already issued. Lastly, #3643 might help.

In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Authored by: Paul Dagnelie <[email protected]> Reviewed by: Pavel Zakharov <[email protected]> Reviewed by: Matthew Ahrens <[email protected]> Approved by: Dan McDonald <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Ported-by: Paul Dagnelie <[email protected]> OpenZFS-issue: https://www.illumos.org/issues/8484 OpenZFS-commit: openzfs/openzfs@7028a8b92b7 Issue #3752 Closes #7462

dweeezil · 2019-02-17T15:23:24Z

This issue has been mostly fixed by #7462. I think, however, that ZoL may still have some unique (to ZoL vs. other OpenZFS implementations) atomic operations in hot paths that could converted to use the aggsum facility. That said, #7462 does address the specific case I was referring to so I'll close this issue (which I actually thought was already closed).

behlendorf added the Type: Performance Performance improvement or performance problem label Sep 8, 2015

dweeezil changed the title ~~atomic ARC operations hurt performance on large-scale NUMA systems~~ atomic operations hurt performance on large-scale NUMA systems Sep 14, 2015

ryao mentioned this issue Jun 13, 2017

Make zvol operations use _by_dnode routines openzfs/openzfs#403

Closed

dweeezil mentioned this issue May 1, 2018

OpenZFS 8484 - Implement aggregate sum and use for arc counters #7462

Closed

13 tasks

dweeezil closed this as completed Feb 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

atomic operations hurt performance on large-scale NUMA systems #3752

atomic operations hurt performance on large-scale NUMA systems #3752

dweeezil commented Sep 8, 2015

dweeezil commented Sep 14, 2015

behlendorf commented Sep 16, 2015

ryao commented Oct 2, 2015

dweeezil commented Feb 17, 2019

atomic operations hurt performance on large-scale NUMA systems #3752

atomic operations hurt performance on large-scale NUMA systems #3752

Comments

dweeezil commented Sep 8, 2015

dweeezil commented Sep 14, 2015

behlendorf commented Sep 16, 2015

ryao commented Oct 2, 2015

dweeezil commented Feb 17, 2019