Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock using osrm-datastore while requests are being served #1888

Closed
danpat opened this issue Jan 13, 2016 · 6 comments
Closed

Deadlock using osrm-datastore while requests are being served #1888

danpat opened this issue Jan 13, 2016 · 6 comments

Comments

@danpat
Copy link
Member

danpat commented Jan 13, 2016

Under some circumstances, it seems that osrm-datastore can deadlock if routing requests are being served at the same time.

To reproduce the problem:

  1. Run osrm-datastore whatever.osrm to load the initial dataset
  2. Run osrm-routed -s to load data from shared memory.
  3. Set up a request running in a rapid loop: while true ; do curl 'http://localhost:5000/viaroute?loc=39.22693426244916,-75.59280395507812&loc=39.06451486901886,-75.465087890625' ; done
  4. Now while that loop is running, do the same with osrm-datastore, like so: while true ; ./osrm-datastore whatever.osrm ; done

In a perfect world, the locking would work properly and routing wouldn't be interrupted. However, very shortly after starting the osrm-datastore loop, I start seeing errors like this:

screen shot 2016-01-12 at 3 58 55 pm

screen shot 2016-01-12 at 4 03 12 pm

Then, within a few seconds, everything stops completely, and osrm-routed appears to be stuck in a busy loop.

Killing the curl loop does not unlock things. osrm-routed responds to requests with {"status": 500,"status_message":"Internal Server Error"} and does not close the socket (leading to curl hanging).

Killing the osrm-datastore loop does not resurrect osrm-routed.

Killing the osrm-datastore loop and restarting osrm-routed does seem to bring things back to life, so it doesn't look like any routing data is getting corrupted, just locking state.

@danpat
Copy link
Member Author

danpat commented Jan 13, 2016

Simpler reproduction recipe:

  1. Run osrm-datastore whatever.osrm
  2. Run osrm-routed -s
  3. Run osrm-datastore whatever.osrm
  4. Run osrm-datastore whatever.osrm
  5. Run curl http://localhost:5000/viaroute?loc=1,1&loc=2,2

You can now kill osrm-routed, and restarting it (without running further osrm-datastore commands) will lead to errors like this:

danpat@rundle:~/mapbox/osrm-backend/build$ ./osrm-routed -s
[info] starting up engines, v4.9.0
[debug] Loading from shared memory
[debug] Threads:    8
[debug] IP address: 0.0.0.0
[debug] IP port:    5000
[debug] writeable memory allocated 12 bytes
[debug] deallocating prev memory
[debug] deallocating prev memory
[warn] caught exception: No such file or directory, code 7
[warn] [exception] No such file or directory
danpat@rundle:~/mapbox/osrm-backend/build$

Re-running osrm-datastore whatever.osrm will "clean up" the problem, and everything is back to normal.

@daniel-j-h
Copy link
Member

Can reproduce, perfect.

I could hunt it down to the line where the SharedMemory object is being constructed.

@daniel-j-h
Copy link
Member

SharedMemoryFactory::Get returns a newed pointer to a SharedMemory object. On the call sites no one ever calls delete on that, so its destructor is never invoked, resulting in e.g. shm_remove effectively being dead code.. interesting.

@danpat
Copy link
Member Author

danpat commented Jan 14, 2016

The problem here is caused by toggling between LAYOUT_1 and LAYOUT_2. SharedDataFacade only updates its internal pointer when a request comes in.

If osrm-datastore is run such that we toggle from LAYOUT_1 -> LAYOUT_2 -> LAYOUT_1 without SharedDataFacade being updated (via an incoming request), then the CheckAndReloadFacade method will erroneously call Remove() on the very data that it's supposed to start using.

@danpat
Copy link
Member Author

danpat commented Jan 14, 2016

There appears to be a race condition where SharedDataFacade can perform its CheckAndReloadFacade while a query is still in-progress on another thread. This can cause all kinds of undefined behaviour, including lockups and segfaults, depending on exactly where the query thread is up to when CheckAndReloadFacade resets all the data structures the query thread is looking at.

Generally, single-threaded requestors against a single osrm-routed instance won't trigger this problem, it requires concurrent requests and a new dataset from osrm-datastore.

@TheMarex
Copy link
Member

This needs a bigger restructuring of osrm-datastore and in general the data backend. Tracking here: #1907

Closing after the partial fix is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants