Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadPool::install and allocation overhead #666

Open
ralfbiedert opened this issue Jun 28, 2019 · 9 comments
Open

ThreadPool::install and allocation overhead #666

ralfbiedert opened this issue Jun 28, 2019 · 9 comments

Comments

@ralfbiedert
Copy link
Contributor

ralfbiedert commented Jun 28, 2019

Background

We are using rayon within a library that is called from game engines in the main loop, at about 90 FPS, depending on the device. From the game engine we receive a number of objects which we transform with some math-heavy calculations.

We have used rayon for some time now, and from a performance / usability perspective it hits a sweep spot where we can simplify our code a lot, and see a good speed up over a single threaded approach.

However, as we are getting into more demanding environments we need to control for allocation, and want to be more-or-less allocation free after initialization.

Problem

I started analyzing our allocation pattern and noticed that ThreadPool::install does about 2 allocations per invocation, at about 10 - 20 bytes each (after the 1st round):

=== Allocation Statistics ===
"()" -- Events: 0, Bytes: 0
"Box::new(1)" -- Events: 1, Bytes: 4
"pool = ThreadPoolBuilder::new().num_threads(4).build()" -- Events: 65, Bytes: 4942
"pool.install(|| {}) -- 1st time" -- Events: 23, Bytes: 5608
"pool.install(|| {}) -- 2nd time" -- Events: 2, Bytes: 24
"pool.install(|| {}) -- next 1000 times" -- Events: 2032, Bytes: 48064
"drop(pool)" -- Events: 0, Bytes: 0

(Source) - rayon 1.1.0

Ideally, ThreadPool::install should be allocation free eventually, and / or there should be an alternative API that allow for allocation free processing. In the other thread #614 a heapless(?) StackJob was mentioned, but I couldn't find anything in the docs.

@cuviper
Copy link
Member

cuviper commented Jun 28, 2019

StackJob isn't in the docs because it's an internal detail, but anyway this is already what install uses. So we're avoiding direct allocation, but there still may be something in the SegQueue which we use to send the job into the thread pool.

cc @stjepang -- is it expected that SegQueue would allocate every time? Is there a better alternative for this use case? e.g. Maybe the crossbeam_deque::Injector when we update to 0.7...

@ralfbiedert
Copy link
Contributor Author

I just did some digging, this is what I found out so far:

  • ThreadPool::install invokes Registry::in_worker
  • There the check worker_thread.is_null() constantly fails, so it invokes self.in_worker_cold(op).
  • That in turn does a LockLatch::new(), which causes the following allocations:
"Mutex::new" -- Events: 1, Bytes: 16
"Condvar::new" -- Events: 1, Bytes: 8
  • The worker_thread.is_null check fails because WorkerThread::current() always returns 0x0.

  • Which is a bit puzzling, because I can observe that WorkerThread::set_current(worker_thread); is invoked from main_loop (multiple times), and

println!("main_loop() - before set {:?}", WorkerThread::current());
WorkerThread::set_current(worker_thread);
println!("main_loop() - after set {:?}", WorkerThread::current());
  • each prints something like:
main_loop() - before set 0x0
main_loop() - after set 0x635feff240

@ralfbiedert
Copy link
Contributor Author

Btw, I used this application

let pool = ThreadPoolBuilder::new().num_threads(4).build().unwrap();

pool.install(|| {});
pool.install(|| {});

println!("Waiting 1 second");
sleep(Duration::new(1, 0));

for i in 0..10 {
    pool.install(|| {})
}

which printed

0x0
0x0
Waiting 1 second
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0
0x0

for WorkerThread::current() in Registry::in_worker() when debugging the above.

@cuviper
Copy link
Member

cuviper commented Jun 30, 2019

Oh, indeed -- install from outside the pool does block on a LockLatch, so we'll create a new Mutex and CondVar for that. The worker_thread is a thread-local, which is why it's seeing is_null() when called from outside the pool. Hmm, maybe we could create LockLatch using thread-locals too...

@ralfbiedert
Copy link
Contributor Author

ralfbiedert commented Jun 30, 2019

Some more digging:

First Allocation Source (LockLatch)

  • Putting the LockLatch on TLS technically works, but Condvar doesn't reset, so the variable can't be reused (I think I got something wrong here, will check tomorrow).
  • I quickly hacked a POC with AtomicBoolean and a hot loop, which seems to work, now I got rid of the allocation related to LockLatch.

However, there is still another allocation source:

Second Allocation Source (SegQueue::push)

  • Every once in a while (apparently every 32 invocations precisely):
"self.in_worker_cold(op)" -- Events: 1, Bytes: 752
"self.in_worker_cold(op)" -- Events: 0, Bytes: 0
...
... 32 invocations later ...
... 
"self.in_worker_cold(op)" -- Events: 0, Bytes: 0
"self.in_worker_cold(op)" -- Events: 1, Bytes: 752
  • Seems to come from Registry::inject ... self.injected_jobs.push(job_ref), which is SegQueue::push

@cuviper
Copy link
Member

cuviper commented Jul 1, 2019

  • Putting the LockLatch on TLS technically works, but Condvar doesn't reset, so the variable can't be reused (I think I got something wrong here, will check tomorrow).

The LockLatch as written today sets its Mutex<bool> to true forever, but that's something we can change if the mutex is to be reused. Condvar itself certainly allows reuse.

@ralfbiedert
Copy link
Contributor Author

The LockLatch as written today sets its Mutex to true forever, but that's something we can change if the mutex is to be reused. Condvar itself certainly allows reuse.

Yes, got it to work now with the original LockLatch + reset().

Do you plan to work on this yourself, or do you want me to clean up and submit a PR for the LockLatch allocation?

@cuviper
Copy link
Member

cuviper commented Jul 1, 2019

I was tinkering a little, but if you want to submit a proper PR, that would be welcome. Note that I think we don't want to affect LockLatch as used in ThreadInfo (primed and stopped), only the one used by install -- that's when we know we'll immediately do a blocking wait in the same thread, so TLS reuse isn't an issue. Maybe this should be a separate latch type, where its wait automatically resets on success, while the mutex is still held.

@ralfbiedert
Copy link
Contributor Author

Ok, I now managed to remove the remaining allocation source. I first looked into crossbeam's SegQueue, but have to admit the allocation / Block shift logic wasn't easy to follow.

I then realized that I could just combine an ArrayQueue to get predictable allocaiton up to a certain point, with the SegQueue for anything that goes beyond that. Due to it's simplicity it loses the FIFO property past a certain point (if you insert more than cap elements ordering might be mixed up at the tranisition zone if multiple threads push / pop at the same time).

However, if I understand it correclty theinjected_jobs inside the Registry don't really mind if that happens, because a job might get injected "randomly" anyway, e.g., from another task.

In any case, I created a PR #670 for you to review and discuss.

If I combine #670 and #668 I'm not seeing any allocation anymore in my test case.

bors bot added a commit that referenced this issue Jul 22, 2019
668: Reusing LockLatch to reduce allocation  r=cuviper a=ralfbiedert

This addresses the first half of #666.

I kept the `LockLatch` and added a `reset()` method because that had overall fewer lines of code. If you want I can still change it to a separate `ReusableLockLatch` type.


Co-authored-by: Ralf Biedert <[email protected]>
Co-authored-by: Josh Stone <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants