Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize ManyToMany plugin #4454

Merged
merged 5 commits into from
Sep 15, 2017
Merged

Parallelize ManyToMany plugin #4454

merged 5 commits into from
Sep 15, 2017

Conversation

oxidase
Copy link
Contributor

@oxidase oxidase commented Aug 29, 2017

Issue

Backward and forward searches in manyToMany plugin are independent and can be easily parallelized.
Here is timing results in milliseconds with 4 cores (2 real + 2 HT) for 25x25 random table requests for DE-sized extract:

par_ser

Tasklist

  • review
  • adjust for comments

Requirements / Relations

Link any requirements here. Other pull requests this PR is based on?

@@ -125,6 +127,7 @@ template <typename Algorithm> class Engine final : public EngineInterface
}
std::unique_ptr<DataFacadeProvider<Algorithm>> facade_provider;
mutable SearchEngineData<Algorithm> heaps;
tbb::task_scheduler_init task_scheduler;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be constructed with num threads otherwise default ctor will use available cpus to infer number - which will be wrong e.g. in a Docker container:

https://www.threadingbuildingblocks.org/docs/help/hh_goto.htm?index.htm#reference/task_scheduler/task_scheduler_init_cls.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daniel-j-h i added use_threads_number parameter that can be set in node bindings and in command line arguments of osrm-routed. osrm-routed now has two separate thread pools: server and internal tbb, so number of threads is actually doubled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it? I thought the benefit of using the dynamically linked libtbb was to avoid exactly these issues?

@oxidase oxidase force-pushed the parallel/m2m branch 2 times, most recently from feada3e to 19c8fc1 Compare August 30, 2017 12:18
Copy link
Member

@TheMarex TheMarex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! However this needs to be verified under load and I would like to see the slowdown with just using one thread over the old version. I checked our node bindings and it seems we already link against libtbb for other stuff so this is not breaking any dependencies.

This is somewhat of an paradigm shift with how we do parallelization (external vs. internal thread pool), if this goes well we might consider parallizing other algorithms as well.

@@ -90,6 +90,7 @@ struct EngineConfig final
int max_alternatives = 3; // set an arbitrary upper bound; can be adjusted by user
bool use_shared_memory = true;
Algorithm algorithm = Algorithm::CH;
int use_threads_number = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: use_threads_number is kind of a weird phrasing. number_of_threads or just threads.

We should also document a value that would make TBB use the default number of threads (I think -1?).

int &max_locations_map_matching,
int &max_results_nearest,
int &max_alternatives)
EngineConfig &config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

const auto bucket_iterator = search_space_with_buckets.find(node);
// iterate bucket if there exists one
if (bucket_iterator != search_space_with_buckets.end())
const auto &bucket_list = std::equal_range(search_space_with_buckets.begin(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check the performance impact of this over using a unordered_map? 2log(N) might be fine though.

@TheMarex TheMarex modified the milestones: 5.11.0, 5.12.0 Aug 30, 2017
@danpat danpat modified the milestones: 5.13.0, 5.12.0 Sep 5, 2017
@oxidase oxidase force-pushed the parallel/m2m branch 2 times, most recently from c74e038 to d132cf7 Compare September 14, 2017 20:32
@oxidase
Copy link
Contributor Author

oxidase commented Sep 15, 2017

@daniel-j-h let me show what i mean by "double number of threads". In my local run when i stop in manyToManySearch osrm-routed has the following threads

 Id   Target Id         Frame 
  1    Thread 0x7ffff7f96740 (LWP 11822) "osrm-routed" do_sigwait (sig=0x7fffffffc430, set=<optimized out>) at ../sysdeps/unix/sysv/linux/sigwait.c:64
  2    Thread 0x7ffff070e700 (LWP 12263) "osrm-routed" 0x00007ffff6ecd9dd in pthread_join (threadid=140737217296128, thread_return=0x0) at pthread_join.c:90
  3    Thread 0x7fffefd7f700 (LWP 12264) "osrm-routed" 0x00007ffff5f1a373 in epoll_wait () at ../sysdeps/unix/syscall-template.S:84
* 4    Thread 0x7fffef3f0700 (LWP 12265) "osrm-routed" osrm::engine::routing_algorithms::manyToManySearch<osrm::engine::routing_algorithms::ch::Algorithm> (engine_working_data=..., facade=..., phantom_nodes=std::vector of length 25, capacity 25 = {...}, source_indices=std::vector of length 0, capacity 0, target_indices=std::vector of length 0, capacity 0) at /home/miha/mapbox/osrm-backend/src/engine/routing_algorithms/many_to_many.cpp:322
  5    Thread 0x7fffeea61700 (LWP 12266) "osrm-routed" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  6    Thread 0x7fffee0d2700 (LWP 12267) "osrm-routed" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  7    Thread 0x7fffed743700 (LWP 12773) "osrm-routed" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  8    Thread 0x7fffecf41700 (LWP 12774) "osrm-routed" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  9    Thread 0x7fffed342700 (LWP 12775) "osrm-routed" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38

The first thread is the main one and waits for a signal at

sigwait(&wait_mask, &sig);

The second thread is a server thread that starts at

std::thread server_thread(std::move(server_task));
and waits at
thread->join();

Threads 3-6 are asio services started at

std::shared_ptr<std::thread> thread = std::make_shared<std::thread>(

Threads 7-8 are started by TBB during the first call of

Without the PR osrm-routed uses 6 threads in my particular case, with the PR 6 + 3 threads in TBB pool.

Also as we checked yesterday there is no static TBB distributions, so it should be safe to assume a unique TBB data singleton.

Also changing a map of vectors to a ordered vector leads to change of average query time for germany latest from 568.7661ms to 496.6084ms

@oxidase oxidase merged commit 966139c into master Sep 15, 2017
@oxidase oxidase deleted the parallel/m2m branch September 15, 2017 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants