ThreadPool::install and allocation overhead #666

ralfbiedert · 2019-06-28T16:04:40Z

Background

We are using rayon within a library that is called from game engines in the main loop, at about 90 FPS, depending on the device. From the game engine we receive a number of objects which we transform with some math-heavy calculations.

We have used rayon for some time now, and from a performance / usability perspective it hits a sweep spot where we can simplify our code a lot, and see a good speed up over a single threaded approach.

However, as we are getting into more demanding environments we need to control for allocation, and want to be more-or-less allocation free after initialization.

Problem

I started analyzing our allocation pattern and noticed that ThreadPool::install does about 2 allocations per invocation, at about 10 - 20 bytes each (after the 1st round):

=== Allocation Statistics ===
"()" -- Events: 0, Bytes: 0
"Box::new(1)" -- Events: 1, Bytes: 4
"pool = ThreadPoolBuilder::new().num_threads(4).build()" -- Events: 65, Bytes: 4942
"pool.install(|| {}) -- 1st time" -- Events: 23, Bytes: 5608
"pool.install(|| {}) -- 2nd time" -- Events: 2, Bytes: 24
"pool.install(|| {}) -- next 1000 times" -- Events: 2032, Bytes: 48064
"drop(pool)" -- Events: 0, Bytes: 0

(Source) - rayon 1.1.0

Ideally, ThreadPool::install should be allocation free eventually, and / or there should be an alternative API that allow for allocation free processing. In the other thread #614 a heapless(?) StackJob was mentioned, but I couldn't find anything in the docs.

The text was updated successfully, but these errors were encountered:

cuviper · 2019-06-28T16:19:40Z

StackJob isn't in the docs because it's an internal detail, but anyway this is already what install uses. So we're avoiding direct allocation, but there still may be something in the SegQueue which we use to send the job into the thread pool.

cc @stjepang -- is it expected that SegQueue would allocate every time? Is there a better alternative for this use case? e.g. Maybe the crossbeam_deque::Injector when we update to 0.7...

ralfbiedert · 2019-06-30T18:13:51Z

I just did some digging, this is what I found out so far:

ThreadPool::install invokes Registry::in_worker
There the check worker_thread.is_null() constantly fails, so it invokes self.in_worker_cold(op).
That in turn does a LockLatch::new(), which causes the following allocations:

"Mutex::new" -- Events: 1, Bytes: 16
"Condvar::new" -- Events: 1, Bytes: 8

The worker_thread.is_null check fails because WorkerThread::current() always returns 0x0.
Which is a bit puzzling, because I can observe that WorkerThread::set_current(worker_thread); is invoked from main_loop (multiple times), and

println!("main_loop() - before set {:?}", WorkerThread::current());
WorkerThread::set_current(worker_thread);
println!("main_loop() - after set {:?}", WorkerThread::current());

each prints something like:

main_loop() - before set 0x0
main_loop() - after set 0x635feff240

ralfbiedert · 2019-06-30T18:19:14Z

Btw, I used this application

let pool = ThreadPoolBuilder::new().num_threads(4).build().unwrap();

pool.install(|| {});
pool.install(|| {});

println!("Waiting 1 second");
sleep(Duration::new(1, 0));

for i in 0..10 {
    pool.install(|| {})
}

which printed

0x0
0x0
Waiting 1 second
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0

for WorkerThread::current() in Registry::in_worker() when debugging the above.

cuviper · 2019-06-30T18:24:13Z

Oh, indeed -- install from outside the pool does block on a LockLatch, so we'll create a new Mutex and CondVar for that. The worker_thread is a thread-local, which is why it's seeing is_null() when called from outside the pool. Hmm, maybe we could create LockLatch using thread-locals too...

ralfbiedert · 2019-06-30T19:58:36Z

Some more digging:

First Allocation Source (LockLatch)

Putting the LockLatch on TLS technically works, ~~but Condvar doesn't reset, so the variable can't be reused~~ (I think I got something wrong here, will check tomorrow).
I quickly hacked a POC with AtomicBoolean and a hot loop, which seems to work, now I got rid of the allocation related to LockLatch.

However, there is still another allocation source:

Second Allocation Source (SegQueue::push)

Every once in a while (apparently every 32 invocations precisely):

"self.in_worker_cold(op)" -- Events: 1, Bytes: 752
"self.in_worker_cold(op)" -- Events: 0, Bytes: 0
...
... 32 invocations later ...
... 
"self.in_worker_cold(op)" -- Events: 0, Bytes: 0
"self.in_worker_cold(op)" -- Events: 1, Bytes: 752

Seems to come from Registry::inject ... self.injected_jobs.push(job_ref), which is SegQueue::push

cuviper · 2019-07-01T17:53:14Z

Putting the LockLatch on TLS technically works, ~~but Condvar doesn't reset, so the variable can't be reused~~ (I think I got something wrong here, will check tomorrow).

The LockLatch as written today sets its Mutex<bool> to true forever, but that's something we can change if the mutex is to be reused. Condvar itself certainly allows reuse.

ralfbiedert · 2019-07-01T18:23:55Z

The LockLatch as written today sets its Mutex to true forever, but that's something we can change if the mutex is to be reused. Condvar itself certainly allows reuse.

Yes, got it to work now with the original LockLatch + reset().

Do you plan to work on this yourself, or do you want me to clean up and submit a PR for the LockLatch allocation?

cuviper · 2019-07-01T18:45:07Z

I was tinkering a little, but if you want to submit a proper PR, that would be welcome. Note that I think we don't want to affect LockLatch as used in ThreadInfo (primed and stopped), only the one used by install -- that's when we know we'll immediately do a blocking wait in the same thread, so TLS reuse isn't an issue. Maybe this should be a separate latch type, where its wait automatically resets on success, while the mutex is still held.

ralfbiedert · 2019-07-02T15:43:14Z

Ok, I now managed to remove the remaining allocation source. I first looked into crossbeam's SegQueue, but have to admit the allocation / Block shift logic wasn't easy to follow.

I then realized that I could just combine an ArrayQueue to get predictable allocaiton up to a certain point, with the SegQueue for anything that goes beyond that. Due to it's simplicity it loses the FIFO property past a certain point (if you insert more than cap elements ordering might be mixed up at the tranisition zone if multiple threads push / pop at the same time).

However, if I understand it correclty theinjected_jobs inside the Registry don't really mind if that happens, because a job might get injected "randomly" anyway, e.g., from another task.

In any case, I created a PR #670 for you to review and discuss.

If I combine #670 and #668 I'm not seeing any allocation anymore in my test case.

668: Reusing LockLatch to reduce allocation r=cuviper a=ralfbiedert This addresses the first half of #666. I kept the `LockLatch` and added a `reset()` method because that had overall fewer lines of code. If you want I can still change it to a separate `ReusableLockLatch` type. Co-authored-by: Ralf Biedert <[email protected]> Co-authored-by: Josh Stone <[email protected]>

cuviper mentioned this issue Jun 29, 2019

rayon adaptive #616

Open

This was referenced Jul 1, 2019

SegQueue::push() ... pop() keeps allocating. crossbeam-rs/crossbeam#398

Open

Intermittent test failures on master #667

Closed

Reusing LockLatch to reduce allocation #668

Merged

Removed remaining allocations in ThreadPool::install. #670

Closed

cuviper mentioned this issue Aug 12, 2019

Clarifying whether join() might allocate #681

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThreadPool::install and allocation overhead #666

ThreadPool::install and allocation overhead #666

ralfbiedert commented Jun 28, 2019 •

edited

Loading

cuviper commented Jun 28, 2019

ralfbiedert commented Jun 30, 2019

ralfbiedert commented Jun 30, 2019

cuviper commented Jun 30, 2019

ralfbiedert commented Jun 30, 2019 •

edited

Loading

cuviper commented Jul 1, 2019

ralfbiedert commented Jul 1, 2019

cuviper commented Jul 1, 2019

ralfbiedert commented Jul 2, 2019

ThreadPool::install and allocation overhead #666

ThreadPool::install and allocation overhead #666

Comments

ralfbiedert commented Jun 28, 2019 • edited Loading

Background

Problem

cuviper commented Jun 28, 2019

ralfbiedert commented Jun 30, 2019

ralfbiedert commented Jun 30, 2019

cuviper commented Jun 30, 2019

ralfbiedert commented Jun 30, 2019 • edited Loading

First Allocation Source (LockLatch)

Second Allocation Source (SegQueue::push)

cuviper commented Jul 1, 2019

ralfbiedert commented Jul 1, 2019

cuviper commented Jul 1, 2019

ralfbiedert commented Jul 2, 2019

ralfbiedert commented Jun 28, 2019 •

edited

Loading

ralfbiedert commented Jun 30, 2019 •

edited

Loading