Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve keepalive performance in mongo connection pool #4517

Closed
rg2011 opened this issue Feb 20, 2024 · 7 comments
Closed

Improve keepalive performance in mongo connection pool #4517

rg2011 opened this issue Feb 20, 2024 · 7 comments
Labels
Milestone

Comments

@rg2011
Copy link

rg2011 commented Feb 20, 2024

Is your feature request related to a problem / use case? Please describe.

It is related to a performance problem. We have noticed a high rate of slow queries in our mongo deployment, regarding the listDatabases command:

{
  "msg": "Slow query",
  "attr": {
    "type": "command",
    "ns": "admin.$cmd",
    "command": {
      "listDatabases": 1,
      "$db": "admin",
  ... omitted for brevity ...
    "locks": {
      "ParallelBatchWriterMode": {
        "acquireCount": {
          "r": 140
        }
      },
      "FeatureCompatibilityVersion": {
        "acquireCount": {
          "r": 140
        }
      },
      "ReplicationStateTransition": {
        "acquireCount": {
          "w": 1
        }
      },
      "Global": {
        "acquireCount": {
          "r": 140
        }
      },
      "Mutex": {
        "acquireCount": {
          "r": 139
        }
      }
    },
   .... omitted for brevity ...
    "durationMillis": 354
  }
}

The log has been redacted for brevity, but I've let the part about the locks in there. It seems that the command acquires 139 - 140 locks, which might be the reason why it is so slow.

The source IP address of these requests belong to Orion servers. We have several of them in our multi-tenant deployment. It seems that fiware-orion uses the listDatabases command as a keepalive:

// MongoDB has a ping command, but we are not using it, as it doesn't not
// provides auth checking when user and pass are empty (it provides auth
// when we have user and pass, but that is not enough).
//
// In addition, note that command and database depend on mtenant. If we
// are in mtenant mode we
// will need at some point to look for all orion* databases, so command to
// ping will be listDatabases in the admin DB. But if we run in not mtenant
// mode listCollections in the default database will suffice
std::string cmd;
std::string effectiveDb;
if (mtenat)
{
cmd = "listDatabases";

Deployed at scale, we are hitting around 300 - 500 ms per each listDatabases request, as shown in the log. We would like to propose changing to some lighter command for keepalive, instead of listDatabases.

Describe the solution you'd like

Stop using listDatabases for keepalive in the mongo pool. Replace with a less expensive command.

Describe alternatives you've considered

Really not much besides increasing the resources of the mongo servers or splitting the mongo databases across different replicasets, but both options seem much more costly than changing the keepalive method.

Describe why you need this feature

Slow queries have an overall impact on the cluster performance, might be degrading some of the actual work the cluster has to do.

Currently the listDatabases queries are not the only slow queries we have, but they amount to roughly 40% - 50% of all the slow queries in the replicaset.

Additional information

Do you have the intention to implement the solution

I can help with choosing a new command to use as keepalive in the pool. For instance, getParameter might be a good candidate, e.g. db.adminCommand({ getParameter:1, logLevel:1}).

I can also help evaluating the impact on performance once the command is changed.

@rg2011 rg2011 added the backlog label Feb 20, 2024
@fgalan
Copy link
Member

fgalan commented Feb 22, 2024

The mentioned code corresponds to pingConnection(), which is uses only at CB startup, so its impact is very limited.

listDatabases is used in another place in the code:

bool getOrionDatabases(std::vector<std::string>* dbsP)
{
orion::BSONObj result;
std::string err;
orion::BSONObjBuilder bob;
bob.append("listDatabases", 1);
if (!orion::runDatabaseCommand("admin", bob.obj(), &result, &err))
{
return false;
}

The getOrionDatabases() function is invoked from subCacheRefresh(). This is done with a frequency of -subCacheIval seconds (60 by default).

@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases happens at a frequency that matches the configuration of -subCacheIval? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".

@fgalan
Copy link
Member

fgalan commented Mar 6, 2024

Ref https://www.mongodb.com/docs/manual/reference/command/listDatabases/

Use nameOnly to make the operation lighter.

@rg2011
Copy link
Author

rg2011 commented Mar 6, 2024

@rg2011 to confirm this theory... could you check if the "slow query" log regarding listDatabases happens at a frequency that matches the configuration of -subCacheIval? Note that if you have several CBs working in parallel and they has been started at different moments, you could have several "secuences".

Yes, it's every 60 seconds.

@fgalan
Copy link
Member

fgalan commented Mar 7, 2024

PR #4530

@fgalan
Copy link
Member

fgalan commented Mar 7, 2024

PR has been merged but keep this issue opened while it can be tested in the same environment where @rg2011 detect the problem.

@fgalan
Copy link
Member

fgalan commented Jun 6, 2024

This has been included in Orion 4.0.0.

Pending on a test in the environment before closing the issue.

@rg2011
Copy link
Author

rg2011 commented Jun 27, 2024

Deployed 4.0.0 in prod environment and confirmed decrease in slow queries. Thanks!

@rg2011 rg2011 closed this as completed Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants