Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concurrent optimization #70

Merged

Conversation

minnerbe
Copy link
Contributor

@minnerbe minnerbe commented Feb 10, 2024

I improved the implementation of the concurrent optimization loop in TileUtil. Instead of scheduling being handled by one thread and the actual work being done by nThreads worker threads (resulting in nThreads + 1 threads being used), scheduling is now offloaded to the nThreads worker threads (resulting in only nThreads threads being used) by means of concurrent collections.

Runtime

I ran a few small experiments: a NxN layout of 2D tiles with M exact matches to their spatial neighbors. Each tile was perturbed by a random (with fixed seed) shift. A 2D translation was fitted. The optimization reduced the error to 1e-4 without a plateau. For three different NxN, M layouts, I compare the proposed algorithm (PR) and the original concurrent optimization (orig) for 1 to 32 threads on a machine with 40 physical cores, as well as the single-threaded performance (second row).
Since the number of iterations varies because of random shuffling, I printed the runtimes per iteration for better comparison. Also, keep in mind that the original code uses one more thread (the scheduling thread) than given here.

       (100x100, 10)  (50x50, 100)  (10x10, 1000)
single:     52ms          142ms          113ms
          PR  orig      PR   orig      PR   orig
01:      55ms 66ms    148ms 152ms    110ms 117ms
02:      34ms 48ms     73ms 112ms     56ms  92ms
04:      21ms 40ms     52ms  95ms     35ms  49ms
08:      18ms 39ms     33ms  86ms     18ms  34ms
16:      13ms 40ms     25ms  87ms     13ms  34ms
32:      16ms 51ms     24ms  90ms     13ms  35ms

Details

The new algorithm is locking-free. It consists of four steps for every tile:

  1. Signal beginning of work by accessing shared collections.
  2. Neighborhood check.
  3. Process (=fit and apply) the tile based on outcome of neighborhood check.
  4. Signal end of work by accessing shared collections.

Concurrent access to the collection of pending tiles (ConcurrentLinkedDeque) and currently executing tiles (ConcurrentHashMap.newKeySet()) is handled by the collections themselves. Checking whether a neighboring tile is currently being processed is handled by the individual threads as is subsequent action (processing or deferring the tile). In contrast to the prior implementation, two neighboring tiles can be checked at the same time (but they cannot be processed at the same time; see below).

Apart from re-organizing the main fit-apply loop, I also added parallelized versions of TileCollection::updateErrors and TileCollection::apply. In for the cases I've tested, this played a significant role to reduce the overall runtime.

Also, I added a boolean flag to indicate whether logging is desired or not (the default being false). The logging output now prints TileCollection::getError instead of ErrorStatistics::mean, which is the correct value here, I believe. This fixes #68.

Finally, I fixed some IDE warnings and deleted some commented code that seemed to be used for parallelizing at some point. As a disclaimer, I did not adhere to the prevalent coding style, since my IDE actively deletes spaces when moving things around.

Correctness

The JMM gives a few guarantees about what happens-before something else:

  • Each action in a thread happens-before every action in that thread that comes later in the program's order.
  • Actions in a thread prior to placing an object into any concurrent collection happen-before actions subsequent to the access or removal of that element from the collection in another thread.

I cannot guarantee that this method is correct (i.e., always terminates and computes the right thing). However, I'm pretty confident for the following reasons based on above principles:

  • Each tile is handled by exactly one thread at the same time. (Guaranteed by concurrent collections.)
  • Each tile is only inserted in the executing tiles if it's not already present. (A tile is removed from the executing tiles before it is placed back in the deque.)
  • A tile cannot be processed if one of its neighbors is currently being processed. (Neighborhood check happens after insertion in the executing tile set and "being in the neighborhood of" is symmetric. This does mean that two neighboring tiles can be checked at the same time, leading to none of them being processed.)
  • The algorithm processes every tile (A tile is placed in the deque before the same thread polls the deque again. Therefore, at least one thread doesn't see an empty pending list as long as there is work to do.)
  • The algorithm doesn't hang if all tiles in the pending set are neighboring each other. (In this case, there is a high chance that there is always one tile in the executing set so that the neighborhood check fails (even if this tile is never processed). There is a clean up thread to prevent this: All threads except this one will die after a certain number of attempts and the clean up thread can complete all pending tiles).

Further steps

While we're at it, should we:

  • add a method for single-threaded optimization to TileUtil to unify the API?
  • deprecate the TileCollection::<optimize|concurrently|silently> methods?
  • use a specialized single-threaded implementation for nThreads == 1?
  • use a ConcurrentLinkedQueue-based parallelization for TileCollection::updateErrors and TileCollection::apply instead of an a-priori partition of the list of tiles? This potentially allows for better load balancing, but could have more overhead.

I'm happy for any feedback regarding correctness and efficiency in more practical situations, especially from @axtimwalde, @tpietzsch, @StephanPreibisch, and @bogovicj.

@minnerbe
Copy link
Contributor Author

A follow up based on an offline discussion with @axtimwalde, who expressed interest in more validation in the form of real experiments.

I pointed some of our EM-alignment code at this PR and re-aligned a few stacks. One stack consists of 60k tiles with ~100 matches per tile pair. The models that are fitted are multi-layer interpolated models (i.e., an interpolation of affine, rigid, and translation).

The alignments were done with 8, 16, and 32 cores and compared against the original algorithm with 32 cores. For all runs, the alignment errors were basically identical, but runtime for the algorithm proposed in this PR was roughly 3, 5, and 7 times lower, respectively. In particular, the algorithm seems to consistently terminate and produce the same results as before the change.

Is this experiment sufficient @axtimwalde, or do you want to see further results?

@axtimwalde axtimwalde merged commit a53245d into axtimwalde:master Feb 13, 2024
1 check passed
@axtimwalde
Copy link
Owner

Thanks!

@acardona
Copy link
Collaborator

Only to add: THANKS!

On a 256-core computer, the optimizer went from spending over 8 hours to a mere 20 minutes, for a ~12,000 section series registered 6-adjacent with hundreds to a thousand SIFT features per pairwise section matching. This is AMAZING.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants