Erratic behaviour with some DB populations #1558

fgalan · 2015-11-27T13:09:53Z

We have found at the end of 0.26.0 development cycle (i.e. 0.26.0 release candidates are on the way and the version is almost closed) some problems that makes CB behave in a weird way with some DB populations. We will use this issue to summarize what we know and possible research lines or potential solutions.

Facts

When the DB is populated in an specific way (ask to @fgalan for orion-evil-dump.tgz dump) CB behaves in a weird way. The behaviour depends if -subCacheIval is set to 0 (i.e. only one refresh is done, at startup) or a different value (i.e. refresh are done in a periodic way).

In the case of no periodic refresh, the CB seems to be stable (we have seen it work without fail during days at orion.lab) although some weird things occur:

"Fatal Error" messages related with DB may appear in the log
The item count at /cache/statistics doens't match the one it should be at DB (which may mean that the cache content is not correct)

In the case of periodic refresh, the CB crashes after a while (using -subCacheIval 5 it uses to crash in less than 20 min in orion.lab). The core backtrace information shows exceptions (most of the time of assertion type) related with MongoDB C++ driver methods.

Theories

We don't know the actual cause of the problem yet, we only have some theories that need to be validaded.

One possible theory is related with incorrect usage of cursors due to thread safeness (more details here: http://stackoverflow.com/questions/33945987/thread-safeness-at-mongodb-c-driver-regarding-indirect-connection-usage-throug). If a cursor is corrupted that would explain why assertion in MongoDB C++ methods (such the get*Field family or more()) fail. It can also explain why the item count in the cache doesn't match the actual one at DB.

How to validate this theory:

Hack CB code so cache use the request semaphore (quickest way seems to be hacking the implementation of cacheSemTake/cacheSemGive functions, to wait also at the request semaphore, appart from the semaphore properly protecting cache access at RAM)
Run CB with -reqMutexPolicy all. That will force all cursor usage in non-concurrent mode (as there aren't cursor usages outise mongoBackend and the "all" req mutex policy ensures that as much one thread at a time is executing code in the mongoBackend).

More info to check: how IoTAgent C++ is using cursors:

Another things that we could test

Study the relationship between the ONTIMEINTERVAL initial load at startup and this problem. I.e. hack CB do disble the recoverOntimeIntervalThreads() calls and see what it happens.
Upgrade to a newer version of MongoDB C++ driver (right now we are using legacy-1.0.2 but maybe newer ones include relevant bugfixes)
Some tool in the valgrind suite (appart from the memcheck one that we normally use) may help?

The text was updated successfully, but these errors were encountered:

fgalan · 2015-11-30T11:41:33Z

The answer to http://stackoverflow.com/questions/33945987/thread-safeness-at-mongodb-c-driver-regarding-indirect-connection-usage-throug/33977828#33977828 confirms that cursors cannot be use in the way we do currently (at 0.25.0).

PR #1564 solves this, although it is not yet sure if that suffices to solve this issue or there is something else wrong that need to be fixed.

fgalan · 2015-11-30T16:30:51Z

The CB developed in PR #1570 seems to work in orion.lab in a right way. It has been running 5 hours without crashes, without relevant ERRORs in the log, with a -subCacheIval 5, with perfect cache/db csubs sync regarding the number of elements. Previous versions didn't reach so far.

<orion>
  <version>0.25.0-next</version>
  <uptime>0 d, 5 h, 6 m, 55 s</uptime>
  <git_hash>5b3153942d61dc960f168a0953e4a9e85953577a</git_hash>
  <compile_time>Mon Nov 30 12:15:01 CET 2015</compile_time>
  <compiled_by>fermin</compiled_by>
  <compiled_in>centollo</compiled_in>
</orion>

However, before definitively close this issue, we will hold a "quarentine" for it. Migrated to 0.27.0 milestone, it will be closed at the end of the 0.27.0 development cycle if no new problem related with this is found.

(Lowering priority to P5).

fgalan · 2016-01-08T14:38:17Z

It is seems pretty stable, after more than 24 days of uninterrupted operaetion at orion.lab.fiware.org:

{
  "orion" : {
  "version" : "0.26.1",
  "uptime" : "24 d, 10 h, 58 m, 18 s",
  "git_hash" : "93376583fbb6c8dd68b80b5bac436bf03f7d8358",
  "compile_time" : "Wed Dec 9 11:45:27 CET 2015",
  "compiled_by" : "fermin",
  "compiled_in" : "centollo"
}
}

Thus, moving back to 0.26.0 milestone and closing.

fgalan added bug P8 labels Nov 27, 2015

fgalan mentioned this issue Nov 30, 2015

Avoid concurrent usage of connections due to cursors #1564

Merged

fgalan added this to the 0.26.0 milestone Nov 30, 2015

fgalan self-assigned this Nov 30, 2015

fgalan mentioned this issue Nov 30, 2015

One-phase vs. two-phase cursor processing in mongoBackend #1570

Open

fgalan modified the milestones: 0.27.0, 0.26.0 Nov 30, 2015

fgalan added P5 and removed P8 labels Nov 30, 2015

fgalan modified the milestones: 0.26.0, 0.27.0 Jan 8, 2016

fgalan closed this as completed Jan 8, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erratic behaviour with some DB populations #1558

Erratic behaviour with some DB populations #1558

fgalan commented Nov 27, 2015

fgalan commented Nov 30, 2015

fgalan commented Nov 30, 2015

fgalan commented Jan 8, 2016

Erratic behaviour with some DB populations #1558

Erratic behaviour with some DB populations #1558

Comments

fgalan commented Nov 27, 2015

fgalan commented Nov 30, 2015

fgalan commented Nov 30, 2015

fgalan commented Jan 8, 2016