Deadlock using `osrm-datastore` while requests are being served #1888

danpat · 2016-01-13T00:02:56Z

Under some circumstances, it seems that osrm-datastore can deadlock if routing requests are being served at the same time.

To reproduce the problem:

Run osrm-datastore whatever.osrm to load the initial dataset
Run osrm-routed -s to load data from shared memory.
Set up a request running in a rapid loop: while true ; do curl 'http://localhost:5000/viaroute?loc=39.22693426244916,-75.59280395507812&loc=39.06451486901886,-75.465087890625' ; done
Now while that loop is running, do the same with osrm-datastore, like so: while true ; ./osrm-datastore whatever.osrm ; done

In a perfect world, the locking would work properly and routing wouldn't be interrupted. However, very shortly after starting the osrm-datastore loop, I start seeing errors like this:

Then, within a few seconds, everything stops completely, and osrm-routed appears to be stuck in a busy loop.

Killing the curl loop does not unlock things. osrm-routed responds to requests with {"status": 500,"status_message":"Internal Server Error"} and does not close the socket (leading to curl hanging).

Killing the osrm-datastore loop does not resurrect osrm-routed.

Killing the osrm-datastore loop and restarting osrm-routed does seem to bring things back to life, so it doesn't look like any routing data is getting corrupted, just locking state.

The text was updated successfully, but these errors were encountered:

danpat · 2016-01-13T01:47:16Z

Simpler reproduction recipe:

Run osrm-datastore whatever.osrm
Run osrm-routed -s
Run osrm-datastore whatever.osrm
Run osrm-datastore whatever.osrm
Run curl http://localhost:5000/viaroute?loc=1,1&loc=2,2

You can now kill osrm-routed, and restarting it (without running further osrm-datastore commands) will lead to errors like this:

danpat@rundle:~/mapbox/osrm-backend/build$ ./osrm-routed -s
[info] starting up engines, v4.9.0
[debug] Loading from shared memory
[debug] Threads:    8
[debug] IP address: 0.0.0.0
[debug] IP port:    5000
[debug] writeable memory allocated 12 bytes
[debug] deallocating prev memory
[debug] deallocating prev memory
[warn] caught exception: No such file or directory, code 7
[warn] [exception] No such file or directory
danpat@rundle:~/mapbox/osrm-backend/build$

Re-running osrm-datastore whatever.osrm will "clean up" the problem, and everything is back to normal.

daniel-j-h · 2016-01-13T11:11:54Z

Can reproduce, perfect.

I could hunt it down to the line where the SharedMemory object is being constructed.

daniel-j-h · 2016-01-13T11:17:28Z

SharedMemoryFactory::Get returns a newed pointer to a SharedMemory object. On the call sites no one ever calls delete on that, so its destructor is never invoked, resulting in e.g. shm_remove effectively being dead code.. interesting.

danpat · 2016-01-14T00:30:36Z

The problem here is caused by toggling between LAYOUT_1 and LAYOUT_2. SharedDataFacade only updates its internal pointer when a request comes in.

If osrm-datastore is run such that we toggle from LAYOUT_1 -> LAYOUT_2 -> LAYOUT_1 without SharedDataFacade being updated (via an incoming request), then the CheckAndReloadFacade method will erroneously call Remove() on the very data that it's supposed to start using.

danpat · 2016-01-14T18:53:44Z

There appears to be a race condition where SharedDataFacade can perform its CheckAndReloadFacade while a query is still in-progress on another thread. This can cause all kinds of undefined behaviour, including lockups and segfaults, depending on exactly where the query thread is up to when CheckAndReloadFacade resets all the data structures the query thread is looking at.

Generally, single-threaded requestors against a single osrm-routed instance won't trigger this problem, it requires concurrent requests and a new dataset from osrm-datastore.

TheMarex · 2016-01-20T03:30:25Z

This needs a bigger restructuring of osrm-datastore and in general the data backend. Tracking here: #1907

Closing after the partial fix is merged.

danpat mentioned this issue Jan 14, 2016

Don't delete shared memory segments if we're about to use them #1892

Merged

TheMarex closed this as completed Jan 20, 2016

homersimpsons mentioned this issue Aug 19, 2016

Multiple routing profile in one data file #2791

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock using `osrm-datastore` while requests are being served #1888

Deadlock using `osrm-datastore` while requests are being served #1888

danpat commented Jan 13, 2016

danpat commented Jan 13, 2016

daniel-j-h commented Jan 13, 2016

daniel-j-h commented Jan 13, 2016

danpat commented Jan 14, 2016

danpat commented Jan 14, 2016

TheMarex commented Jan 20, 2016

Deadlock using osrm-datastore while requests are being served #1888

Deadlock using osrm-datastore while requests are being served #1888

Comments

danpat commented Jan 13, 2016

danpat commented Jan 13, 2016

daniel-j-h commented Jan 13, 2016

daniel-j-h commented Jan 13, 2016

danpat commented Jan 14, 2016

danpat commented Jan 14, 2016

TheMarex commented Jan 20, 2016

Deadlock using `osrm-datastore` while requests are being served #1888

Deadlock using `osrm-datastore` while requests are being served #1888