Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erratic behaviour with some DB populations #1558

Closed
fgalan opened this issue Nov 27, 2015 · 3 comments
Closed

Erratic behaviour with some DB populations #1558

fgalan opened this issue Nov 27, 2015 · 3 comments
Assignees
Labels
Milestone

Comments

@fgalan
Copy link
Member

fgalan commented Nov 27, 2015

We have found at the end of 0.26.0 development cycle (i.e. 0.26.0 release candidates are on the way and the version is almost closed) some problems that makes CB behave in a weird way with some DB populations. We will use this issue to summarize what we know and possible research lines or potential solutions.

Facts

When the DB is populated in an specific way (ask to @fgalan for orion-evil-dump.tgz dump) CB behaves in a weird way. The behaviour depends if -subCacheIval is set to 0 (i.e. only one refresh is done, at startup) or a different value (i.e. refresh are done in a periodic way).

In the case of no periodic refresh, the CB seems to be stable (we have seen it work without fail during days at orion.lab) although some weird things occur:

  • "Fatal Error" messages related with DB may appear in the log
  • The item count at /cache/statistics doens't match the one it should be at DB (which may mean that the cache content is not correct)

In the case of periodic refresh, the CB crashes after a while (using -subCacheIval 5 it uses to crash in less than 20 min in orion.lab). The core backtrace information shows exceptions (most of the time of assertion type) related with MongoDB C++ driver methods.

Theories

We don't know the actual cause of the problem yet, we only have some theories that need to be validaded.

One possible theory is related with incorrect usage of cursors due to thread safeness (more details here: http://stackoverflow.com/questions/33945987/thread-safeness-at-mongodb-c-driver-regarding-indirect-connection-usage-throug). If a cursor is corrupted that would explain why assertion in MongoDB C++ methods (such the get*Field family or more()) fail. It can also explain why the item count in the cache doesn't match the actual one at DB.

How to validate this theory:

  • Hack CB code so cache use the request semaphore (quickest way seems to be hacking the implementation of cacheSemTake/cacheSemGive functions, to wait also at the request semaphore, appart from the semaphore properly protecting cache access at RAM)
  • Run CB with -reqMutexPolicy all. That will force all cursor usage in non-concurrent mode (as there aren't cursor usages outise mongoBackend and the "all" req mutex policy ensures that as much one thread at a time is executing code in the mongoBackend).

More info to check: how IoTAgent C++ is using cursors:

Another things that we could test

  • Study the relationship between the ONTIMEINTERVAL initial load at startup and this problem. I.e. hack CB do disble the recoverOntimeIntervalThreads() calls and see what it happens.
  • Upgrade to a newer version of MongoDB C++ driver (right now we are using legacy-1.0.2 but maybe newer ones include relevant bugfixes)
  • Some tool in the valgrind suite (appart from the memcheck one that we normally use) may help?
@fgalan
Copy link
Member Author

fgalan commented Nov 30, 2015

The answer to http://stackoverflow.com/questions/33945987/thread-safeness-at-mongodb-c-driver-regarding-indirect-connection-usage-throug/33977828#33977828 confirms that cursors cannot be use in the way we do currently (at 0.25.0).

PR #1564 solves this, although it is not yet sure if that suffices to solve this issue or there is something else wrong that need to be fixed.

@fgalan
Copy link
Member Author

fgalan commented Nov 30, 2015

The CB developed in PR #1570 seems to work in orion.lab in a right way. It has been running 5 hours without crashes, without relevant ERRORs in the log, with a -subCacheIval 5, with perfect cache/db csubs sync regarding the number of elements. Previous versions didn't reach so far.

<orion>
  <version>0.25.0-next</version>
  <uptime>0 d, 5 h, 6 m, 55 s</uptime>
  <git_hash>5b3153942d61dc960f168a0953e4a9e85953577a</git_hash>
  <compile_time>Mon Nov 30 12:15:01 CET 2015</compile_time>
  <compiled_by>fermin</compiled_by>
  <compiled_in>centollo</compiled_in>
</orion>

However, before definitively close this issue, we will hold a "quarentine" for it. Migrated to 0.27.0 milestone, it will be closed at the end of the 0.27.0 development cycle if no new problem related with this is found.

(Lowering priority to P5).

@fgalan fgalan modified the milestones: 0.27.0, 0.26.0 Nov 30, 2015
@fgalan fgalan added P5 and removed P8 labels Nov 30, 2015
@fgalan
Copy link
Member Author

fgalan commented Jan 8, 2016

It is seems pretty stable, after more than 24 days of uninterrupted operaetion at orion.lab.fiware.org:

{
  "orion" : {
  "version" : "0.26.1",
  "uptime" : "24 d, 10 h, 58 m, 18 s",
  "git_hash" : "93376583fbb6c8dd68b80b5bac436bf03f7d8358",
  "compile_time" : "Wed Dec 9 11:45:27 CET 2015",
  "compiled_by" : "fermin",
  "compiled_in" : "centollo"
}
}

Thus, moving back to 0.26.0 milestone and closing.

@fgalan fgalan modified the milestones: 0.26.0, 0.27.0 Jan 8, 2016
@fgalan fgalan closed this as completed Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant