dmu_object_alloc() is single-threaded #4703

ahrens · 2016-05-27T04:39:51Z

Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, and workarounds for several other issues, I noticed that dmu_object_alloc() was using about 55% of all CPU, most of the time waiting to acquire the os_obj_lock:

In order to increase parallelism of object allocation, we must solve two problems:

We need to decrease the hold time of os_obj_lock (or not grab it at all). The os_obj_lock protects os_obj_next, but we may not need to hold it during the rest of dmu_object_alloc(), especially during the call to dnode_hold_impl().
Once the above is solved, there will be several threads in dmu_object_alloc(), calling dnode_hold_impl() concurrently. Since we allocate adjacent objects with an i++-style allocator, the several threads will be holding adjacent dnodes, which all come from the same dbuf. As a result, the threads will contend on the dbuf’s locks.

A relatively simple way to address this problem would be to have a “next object to allocate” for each CPU. Each of these “next object”s would be in a different block of the dnode object, so that concurrent allocation would be holding dnodes in different dbufs. When a thread’s “next object” reaches the end of the block, it will be reset to the per-objset os_obj_next, which will be increased by a block’s worth of objects (32). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This should decrease lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 32 allocations.

A prototype of the above showed that a ~20% performance improvement on the benchmark is possible.

Once other bugs are fixed (to the point that we see the large lock contention on os_obj_lock), the reward for fixing this issue is medium-high, and the cost is low-medium. The code changes are localized to dmu_object_alloc(). Because this will change which object IDs are allocated, there is potential for object numbers to become more scattered, hurting locality when reading them in. We will need to spend some time evaluating this impact.

The text was updated successfully, but these errors were encountered:

ahrens · 2016-12-08T18:42:11Z

Code for the prototype mentioned above is available in this commit: ahrens@240b227
Note that this is not production-ready!

dmu_object_alloc() is single-threaded, so when multiple threads are creating files in a single filesystem, they spend a lot of time waiting for the os_obj_lock. To improve performance of multi-threaded file creation, we must make dmu_object_alloc() typically not grab any filesystem-wide locks. The solution is to have a “next object to allocate” for each CPU. Each of these “next object”s is in a different block of the dnode object, so that concurrent allocation holds dnodes in different dbufs. When a thread’s “next object” reaches the end of a chunk of objects (by default 4 blocks worth -- 128 dnodes), it will be reset to the per-objset os_obj_next, which will be increased by a chunk of objects (128). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This decreases lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 128 allocations. This results in a 70% performance improvement to multi-threaded object creation (where each thread is creating objects in its own directory), from 67,000/sec to 115,000/sec, with 8 CPUs. Work sponsored by Intel Corp. Closes openzfs#4703

dmu_object_alloc() is single-threaded, so when multiple threads are creating files in a single filesystem, they spend a lot of time waiting for the os_obj_lock. To improve performance of multi-threaded file creation, we must make dmu_object_alloc() typically not grab any filesystem-wide locks. The solution is to have a "next object to allocate" for each CPU. Each of these "next object"s is in a different block of the dnode object, so that concurrent allocation holds dnodes in different dbufs. When a thread's "next object" reaches the end of a chunk of objects (by default 4 blocks worth -- 128 dnodes), it will be reset to the per-objset os_obj_next, which will be increased by a chunk of objects (128). Only when manipulating the os_obj_next will we need to grab the os_obj_lock. This decreases lock contention dramatically, because each thread only needs to grab the os_obj_lock briefly, once per 128 allocations. This results in a 70% performance improvement to multi-threaded object creation (where each thread is creating objects in its own directory), from 67,000/sec to 115,000/sec, with 8 CPUs. Work sponsored by Intel Corp. Reviewed-by: Ned Bass <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Matthew Ahrens <[email protected]> Closes openzfs#4703

behlendorf added the Type: Performance Performance improvement or performance problem label May 27, 2016

This was referenced Jun 26, 2016

lock contention on dn_struct_rwlock during concurrent file creation #4804

Closed

lock contention on z_acl_lock during concurrent file creation #4805

Closed

lock contention on sa_lock during concurrent file creation #4806

Closed

behlendorf mentioned this issue Jan 18, 2017

cache the state for the last block used in object allocation: dbuf an… #5611

Closed

11 tasks

ahrens mentioned this issue May 10, 2017

OpenZFS 8199 - multi-threaded dmu_object_alloc() #6117

Closed

12 tasks

behlendorf closed this as completed in dbeb879 Jun 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmu_object_alloc() is single-threaded #4703

dmu_object_alloc() is single-threaded #4703

ahrens commented May 27, 2016

ahrens commented Dec 8, 2016

dmu_object_alloc() is single-threaded #4703

dmu_object_alloc() is single-threaded #4703

Comments

ahrens commented May 27, 2016

ahrens commented Dec 8, 2016