Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

atomic operations hurt performance on large-scale NUMA systems #3752

Closed
dweeezil opened this issue Sep 8, 2015 · 5 comments
Closed

atomic operations hurt performance on large-scale NUMA systems #3752

dweeezil opened this issue Sep 8, 2015 · 5 comments
Labels
Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem

Comments

@dweeezil
Copy link
Contributor

dweeezil commented Sep 8, 2015

While commit ca0bf58 was successful in reducing the lock contention on the various per-state ARC lists by converting the lists to "multilists" and removing arcs_mtx, it left arcs_lsize[] (the accounting of per-state evictable data) directly within arc_state_t and it's always updated atomically. The atomic updates of arc_lsize[] have become, in their own way, a contention point similar to that of the old arcs_mtx mutex.

I'm testing with a 4-node NUMA system having 40 cores, 80 threads and 512GiB of RAM. The test is a simple 32-process 4K random read of fully-cached files (reads are strictly from the ARC). With current master code, I see the following result:

dweeezil@zfs-dev:~/fio$ fio fio.test
test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=pvsync, iodepth=1
...
fio-2.2.9-20-g1520
Starting 32 processes
Jobs: 32 (f=32): [r(32)] [100.0% done] [6544MB/0KB/0KB /s] [1675K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=63965: Mon Sep  7 22:26:19 2015
  read : io=392092MB, bw=6534.7MB/s, iops=1672.9K, runt= 60002msec

As an experiment, I simply removed the adjustments in add_reference() and remove_reference() (effectively removing a single locked assembly instruction in each place) and saw:

dweeezil@zfs-dev:~/fio$ !fio
fio fio.test
test: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K, ioengine=pvsync, iodepth=1
...
fio-2.2.9-20-g1520
Starting 32 processes
Jobs: 32 (f=32): [r(32)] [100.0% done] [7482MB/0KB/0KB /s] [1915K/0/0 iops] [eta 00m:00s]
test: (groupid=0, jobs=32): err= 0: pid=73269: Mon Sep  7 22:31:19 2015
  read : io=447956MB, bw=7465.9MB/s, iops=1911.3K, runt= 60001msec

Over a 14% improvement in throughput.

It also appears the manipulation of the arcstats with atomic operations can cause quite a performance hit (I'll be testing those shortly).

As a bit of background, this is the same bit of benchmarking I've been running for several months this summer (2015) which has helped track down other similar bottlenecks (in SPL and elsewhere). The performance I was getting immediately after porting ca0bf58 was < 4000MB/s so quite a lot of improvement has been made so far. That said, however, XFS yields over 30000MB/s on the very same benchmark (on the same hardware) so there's clearly plenty of room for improvement (XFS seems to benefit from page cache integration and the ability to use generic_file_read_iter()). I'll also note that @prakashsurya's benchmarks on illumos (see https://reviews.csiden.org/r/151/) yielded better numbers, too, so there might also be other Linux-related NUMA issues at play here. For example, I can get rather different results by running a subset of the test (<= 20 threads) and pinning them to a single socket with numactl.

I'm posting this mainly as a place to record such information and to put it (the issue) on people's radar. I'll be posting more information to this issue as I find it.

@behlendorf behlendorf added the Type: Performance Performance improvement or performance problem label Sep 8, 2015
@dweeezil dweeezil changed the title atomic ARC operations hurt performance on large-scale NUMA systems atomic operations hurt performance on large-scale NUMA systems Sep 14, 2015
@dweeezil
Copy link
Contributor Author

The atomic operations performed by crhold() and crfree() (actually by the kernel support they rely on) were found to be taking over 5% of the CPU time on this benchmark. See dweeezil/zfs@db3d7c7 for a patch to remove them.

@behlendorf
Copy link
Contributor

Ouch, nice find. I'll need to take a more careful look at the credential interface in Linux but I suspect we may be able to get away will directly calling current_cred() and passing it in without the holds. Since we're in a system call and we're getting the credentials for the current process they must remain valid for the life of the system call.

@ryao
Copy link
Contributor

ryao commented Oct 2, 2015

@dweeezil I agree with @behlendorf. Nice find.

What kind of pool configuration did you use to test?

As for catching up to XFS, some idea would be implementing zero-copy writes by manipulating the pages tables to do CoW for large I/O operations after the ABD work is merged. If sync() is being done at all, ZIO pipelining would help. Also, #3765 might help and we likely need to investigate the behavior when a read is issued for a block on which a prefetch read is already issued. Lastly, #3643 might help.

behlendorf pushed a commit that referenced this issue Jun 6, 2018
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Authored by: Paul Dagnelie <[email protected]>
Reviewed by: Pavel Zakharov <[email protected]>
Reviewed by: Matthew Ahrens <[email protected]>
Approved by: Dan McDonald <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Ported-by: Paul Dagnelie <[email protected]>

OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: openzfs/openzfs@7028a8b92b7
Issue #3752
Closes #7462
@dweeezil
Copy link
Contributor Author

This issue has been mostly fixed by #7462. I think, however, that ZoL may still have some unique (to ZoL vs. other OpenZFS implementations) atomic operations in hot paths that could converted to use the aggsum facility. That said, #7462 does address the specific case I was referring to so I'll close this issue (which I actually thought was already closed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

4 participants
@behlendorf @ryao @dweeezil and others