Improve concurrent optimization #70

minnerbe · 2024-02-10T02:26:31Z

I improved the implementation of the concurrent optimization loop in TileUtil. Instead of scheduling being handled by one thread and the actual work being done by nThreads worker threads (resulting in nThreads + 1 threads being used), scheduling is now offloaded to the nThreads worker threads (resulting in only nThreads threads being used) by means of concurrent collections.

Runtime

I ran a few small experiments: a NxN layout of 2D tiles with M exact matches to their spatial neighbors. Each tile was perturbed by a random (with fixed seed) shift. A 2D translation was fitted. The optimization reduced the error to 1e-4 without a plateau. For three different NxN, M layouts, I compare the proposed algorithm (PR) and the original concurrent optimization (orig) for 1 to 32 threads on a machine with 40 physical cores, as well as the single-threaded performance (second row).
Since the number of iterations varies because of random shuffling, I printed the runtimes per iteration for better comparison. Also, keep in mind that the original code uses one more thread (the scheduling thread) than given here.

       (100x100, 10)  (50x50, 100)  (10x10, 1000)
single:     52ms          142ms          113ms
          PR  orig      PR   orig      PR   orig
01:      55ms 66ms    148ms 152ms    110ms 117ms
02:      34ms 48ms     73ms 112ms     56ms  92ms
04:      21ms 40ms     52ms  95ms     35ms  49ms
08:      18ms 39ms     33ms  86ms     18ms  34ms
16:      13ms 40ms     25ms  87ms     13ms  34ms
32:      16ms 51ms     24ms  90ms     13ms  35ms

Details

The new algorithm is locking-free. It consists of four steps for every tile:

Signal beginning of work by accessing shared collections.
Neighborhood check.
Process (=fit and apply) the tile based on outcome of neighborhood check.
Signal end of work by accessing shared collections.

Concurrent access to the collection of pending tiles (ConcurrentLinkedDeque) and currently executing tiles (ConcurrentHashMap.newKeySet()) is handled by the collections themselves. Checking whether a neighboring tile is currently being processed is handled by the individual threads as is subsequent action (processing or deferring the tile). In contrast to the prior implementation, two neighboring tiles can be checked at the same time (but they cannot be processed at the same time; see below).

Apart from re-organizing the main fit-apply loop, I also added parallelized versions of TileCollection::updateErrors and TileCollection::apply. In for the cases I've tested, this played a significant role to reduce the overall runtime.

Also, I added a boolean flag to indicate whether logging is desired or not (the default being false). The logging output now prints TileCollection::getError instead of ErrorStatistics::mean, which is the correct value here, I believe. This fixes #68.

Finally, I fixed some IDE warnings and deleted some commented code that seemed to be used for parallelizing at some point. As a disclaimer, I did not adhere to the prevalent coding style, since my IDE actively deletes spaces when moving things around.

Correctness

The JMM gives a few guarantees about what happens-before something else:

Each action in a thread happens-before every action in that thread that comes later in the program's order.
Actions in a thread prior to placing an object into any concurrent collection happen-before actions subsequent to the access or removal of that element from the collection in another thread.

I cannot guarantee that this method is correct (i.e., always terminates and computes the right thing). However, I'm pretty confident for the following reasons based on above principles:

Each tile is handled by exactly one thread at the same time. (Guaranteed by concurrent collections.)
Each tile is only inserted in the executing tiles if it's not already present. (A tile is removed from the executing tiles before it is placed back in the deque.)
A tile cannot be processed if one of its neighbors is currently being processed. (Neighborhood check happens after insertion in the executing tile set and "being in the neighborhood of" is symmetric. This does mean that two neighboring tiles can be checked at the same time, leading to none of them being processed.)
The algorithm processes every tile (A tile is placed in the deque before the same thread polls the deque again. Therefore, at least one thread doesn't see an empty pending list as long as there is work to do.)
The algorithm doesn't hang if all tiles in the pending set are neighboring each other. (In this case, there is a high chance that there is always one tile in the executing set so that the neighborhood check fails (even if this tile is never processed). There is a clean up thread to prevent this: All threads except this one will die after a certain number of attempts and the clean up thread can complete all pending tiles).

Further steps

While we're at it, should we:

add a method for single-threaded optimization to TileUtil to unify the API?
deprecate the TileCollection::<optimize|concurrently|silently> methods?
use a specialized single-threaded implementation for nThreads == 1?
use a ConcurrentLinkedQueue-based parallelization for TileCollection::updateErrors and TileCollection::apply instead of an a-priori partition of the list of tiles? This potentially allows for better load balancing, but could have more overhead.

I'm happy for any feedback regarding correctness and efficiency in more practical situations, especially from @axtimwalde, @tpietzsch, @StephanPreibisch, and @bogovicj.

For a densely connected graph, the threads would be stuck checking the neighborhood condition, which would always fail. Now, all threads except for one terminate after a certain number of iterations.

minnerbe · 2024-02-13T21:55:14Z

A follow up based on an offline discussion with @axtimwalde, who expressed interest in more validation in the form of real experiments.

I pointed some of our EM-alignment code at this PR and re-aligned a few stacks. One stack consists of 60k tiles with ~100 matches per tile pair. The models that are fitted are multi-layer interpolated models (i.e., an interpolation of affine, rigid, and translation).

The alignments were done with 8, 16, and 32 cores and compared against the original algorithm with 32 cores. For all runs, the alignment errors were basically identical, but runtime for the algorithm proposed in this PR was roughly 3, 5, and 7 times lower, respectively. In particular, the algorithm seems to consistently terminate and produce the same results as before the change.

Is this experiment sufficient @axtimwalde, or do you want to see further results?

axtimwalde · 2024-02-13T23:14:41Z

Thanks!

acardona · 2024-03-27T14:33:41Z

Only to add: THANKS!

On a 256-core computer, the optimizer went from spending over 8 hours to a mere 20 minutes, for a ~12,000 section series registered 6-adjacent with hundreds to a thousand SIFT features per pairwise section matching. This is AMAZING.

minnerbe added 8 commits February 9, 2024 14:07

Add first prototype of improved concurrent optimization loop

6ff09ce

Prevent a tile from being in both collections at the same time

9ba05c2

Assign one clean up thread

548b185

For a densely connected graph, the threads would be stuck checking the neighborhood condition, which would always fail. Now, all threads except for one terminate after a certain number of iterations.

Parallelize initial call to apply

6ab8ed9

Fix division by zero error if threads are not started yet

07ea067

Parallelize error computation

b68ed61

Add flag for verbosity

64b7a86

Fix some IDE warnings

032d3c8

axtimwalde merged commit a53245d into axtimwalde:master Feb 13, 2024
1 check passed

minnerbe mentioned this pull request May 16, 2024

Fix potential hang when exceptions are thrown during concurrent optimization #72

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve concurrent optimization #70

Improve concurrent optimization #70

minnerbe commented Feb 10, 2024 •

edited

Loading

minnerbe commented Feb 13, 2024

axtimwalde commented Feb 13, 2024

acardona commented Mar 27, 2024

Improve concurrent optimization #70

Improve concurrent optimization #70

Conversation

minnerbe commented Feb 10, 2024 • edited Loading

Runtime

Details

Correctness

Further steps

minnerbe commented Feb 13, 2024

axtimwalde commented Feb 13, 2024

acardona commented Mar 27, 2024

minnerbe commented Feb 10, 2024 •

edited

Loading