Improve keepalive performance in mongo connection pool #4517

rg2011 · 2024-02-20T18:38:18Z

Is your feature request related to a problem / use case? Please describe.

It is related to a performance problem. We have noticed a high rate of slow queries in our mongo deployment, regarding the listDatabases command:

{
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "admin.$cmd",
    "command": {
      "listDatabases": 1,
      "$db": "admin",
  ... omitted for brevity ...
    "locks": {
      "ParallelBatchWriterMode": {
        "acquireCount": {
          "r": 140
        }
      },
      "FeatureCompatibilityVersion": {
        "acquireCount": {
          "r": 140
        }
      },
      "ReplicationStateTransition": {
        "acquireCount": {
          "w": 1
        }
      },
      "Global": {
        "acquireCount": {
          "r": 140
        }
      },
      "Mutex": {
        "acquireCount": {
          "r": 139
        }
      }
    },
   .... omitted for brevity ...
    "durationMillis": 354
  }
}

The log has been redacted for brevity, but I've let the part about the locks in there. It seems that the command acquires 139 - 140 locks, which might be the reason why it is so slow.

The source IP address of these requests belong to Orion servers. We have several of them in our multi-tenant deployment. It seems that fiware-orion uses the listDatabases command as a keepalive:

fiware-orion/src/lib/mongoDriver/mongoConnectionPool.cpp

Lines 105 to 120 in a0d75b3

    
           // MongoDB has a ping command, but we are not using it, as it doesn't not 
        
           // provides auth checking when user and pass are empty (it provides auth 
        
           // when we have user and pass, but that is not enough). 
        
           // 
        
           // In addition, note that command and database depend on mtenant. If we 
        
           // are in mtenant mode we 
        
           // will need at some point to look for all orion* databases, so command to 
        
           // ping will be listDatabases in the admin DB. But if we run in not mtenant 
        
           // mode listCollections in the default database will suffice 
        
           std::string  cmd; 
        
           std::string  effectiveDb; 
        
           if (mtenat) 
        
           { 
        
             cmd = "listDatabases";

Deployed at scale, we are hitting around 300 - 500 ms per each listDatabases request, as shown in the log. We would like to propose changing to some lighter command for keepalive, instead of listDatabases.

Describe the solution you'd like

Stop using listDatabases for keepalive in the mongo pool. Replace with a less expensive command.

Describe alternatives you've considered

Really not much besides increasing the resources of the mongo servers or splitting the mongo databases across different replicasets, but both options seem much more costly than changing the keepalive method.

Describe why you need this feature

Slow queries have an overall impact on the cluster performance, might be degrading some of the actual work the cluster has to do.

Currently the listDatabases queries are not the only slow queries we have, but they amount to roughly 40% - 50% of all the slow queries in the replicaset.

Additional information

Do you have the intention to implement the solution

I can help with choosing a new command to use as keepalive in the pool. For instance, getParameter might be a good candidate, e.g. db.adminCommand({ getParameter:1, logLevel:1}).

I can also help evaluating the impact on performance once the command is changed.

The text was updated successfully, but these errors were encountered:

fgalan · 2024-02-22T09:12:08Z

The mentioned code corresponds to pingConnection(), which is uses only at CB startup, so its impact is very limited.

listDatabases is used in another place in the code:

fiware-orion/src/lib/mongoBackend/MongoGlobal.cpp

Lines 228 to 239 in a0d75b3

    
           bool getOrionDatabases(std::vector<std::string>* dbsP) 
        
           { 
        
             orion::BSONObj  result; 
        
             std::string     err; 
        
             orion::BSONObjBuilder bob; 
        
             bob.append("listDatabases", 1); 
        
             if (!orion::runDatabaseCommand("admin", bob.obj(), &result, &err)) 
        
             { 
        
               return false; 
        
             }

The getOrionDatabases() function is invoked from subCacheRefresh(). This is done with a frequency of -subCacheIval seconds (60 by default).

@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases happens at a frequency that matches the configuration of -subCacheIval? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".

fgalan · 2024-03-06T11:29:57Z

Ref https://www.mongodb.com/docs/manual/reference/command/listDatabases/

Use nameOnly to make the operation lighter.

rg2011 · 2024-03-06T16:06:23Z

@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases happens at a frequency that matches the configuration of -subCacheIval? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".

Yes, it's every 60 seconds.

fgalan · 2024-03-07T13:26:38Z

PR #4530

fgalan · 2024-03-07T16:22:12Z

PR has been merged but keep this issue opened while it can be tested in the same environment where @rg2011 detect the problem.

fgalan · 2024-06-06T10:10:10Z

This has been included in Orion 4.0.0.

Pending on a test in the environment before closing the issue.

rg2011 · 2024-06-27T16:06:34Z

Deployed 4.0.0 in prod environment and confirmed decrease in slow queries. Thanks!

rg2011 added the backlog label Feb 20, 2024

fgalan mentioned this issue Mar 7, 2024

ADD nameOnly to listDatabase command on MongoDB #4530

Merged

fgalan added this to the 3.13.0 milestone Mar 7, 2024

rg2011 closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve keepalive performance in mongo connection pool #4517

Improve keepalive performance in mongo connection pool #4517

rg2011 commented Feb 20, 2024

fgalan commented Feb 22, 2024 •

edited

Loading

fgalan commented Mar 6, 2024

rg2011 commented Mar 6, 2024

fgalan commented Mar 7, 2024

fgalan commented Mar 7, 2024

fgalan commented Jun 6, 2024

rg2011 commented Jun 27, 2024

Improve keepalive performance in mongo connection pool #4517

Improve keepalive performance in mongo connection pool #4517

Comments

rg2011 commented Feb 20, 2024

fgalan commented Feb 22, 2024 • edited Loading

fgalan commented Mar 6, 2024

rg2011 commented Mar 6, 2024

fgalan commented Mar 7, 2024

fgalan commented Mar 7, 2024

fgalan commented Jun 6, 2024

rg2011 commented Jun 27, 2024

fgalan commented Feb 22, 2024 •

edited

Loading