Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

Closed
peci1 opened this issue Nov 24, 2018 · 8 comments
Closed

roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

peci1 opened this issue Nov 24, 2018 · 8 comments
Labels

Comments

@peci1
Copy link
Contributor

peci1 commented Nov 24, 2018

When debugging a performance problem with my node, I created an MWE in C++ which shows interesting behavior with serious performance hit. In case you use MultiThreadedSpinner or AsyncSpinner with >1 threads and one of your callbacks takes too long (messages arrive quicker), the spinner eats up a lot of CPU.

I'll post the MWE on Monday, so for now just a textual description. I have a normal node with the MT/Async spinner, I subscribe to one 50 Hz PointCloud2 topic with a single subscriber. The callback contains just ros::Duration::sleep(time).

When running the code with a single thread (either with SingleThreadedSpinner, or even MT spinner with 1 thread), the node takes up about 5% of CPU and just throws away the messages it did not have time to process.

As soon as there are at least 2 threads, CPU consumption goes high (40-400% of CPU) and scales (lineraly?) with the duration I wait in the callback and the number of threads.

My research ended in spinner.cpp, where I see that in case of 1 thread, callAvailable() is used, whereas with more threads, callOne() is used. However, I can't see through all the multi-threading magic in there. I ran a profiler on the node, and most of the time is spent with mutexes and other sync primitives.

I haven't tried yet with SubscribeOptions.allow_concurrent_callbacks .

Possibly related:

@peci1
Copy link
Contributor Author

peci1 commented Nov 26, 2018

The MWE:

#include <ros/ros.h>
#include <sensor_msgs/PointCloud2.h>

class TestClass {
    ros::NodeHandle pubNh;
    ros::Subscriber s;

public:
    TestClass();
    void callback(sensor_msgs::PointCloud2ConstPtr msg);
};

TestClass::TestClass() :
      s(pubNh.subscribe("/scan_point_cloud", 10, &TestClass::callback, this)) {

}

void TestClass::callback(sensor_msgs::PointCloud2ConstPtr msg) {
  ros::Duration(0.01).sleep();
}

int main(int argc, char **argv) {
  ros::init(argc, argv, "Transform_Pointcloud_and_Filter_It_Node");
  TestClass PclTransfWithFilt;

  ros::MultiThreadedSpinner spinner(2);

  spinner.spin();

  return 0;
}

Play with the duration of sleep in the callback to see effects of this bug.

@peci1
Copy link
Contributor Author

peci1 commented Nov 27, 2018

I found the busy-wait loop.

It's caused by returning TryAgain from SubscriptionQueue::call() without making sure the next trial will not happen right away.

SubscriptionQueue::call()

...
boost::recursive_mutex::scoped_try_lock lock(callback_mutex_, boost::defer_lock);
if (!allow_concurrent_callbacks_)
  {
    lock.try_lock();
    if (!lock.owns_lock())
    {
      return CallbackInterface::TryAgain;
    }
}
...

The loops in CallbackQueue::callOneCB() and CallbackQueue::callOne also do not insert any waiting after the TryAgain is returned and the callback is pushed back to the queue. That results in the busy-wait when one thread owns the lock and all other threads keep getting TryAgain and the callbacks are put back to the queue and immediately executed again.

I verified two workarounds:

  1. Use allow_concurrent_callbacks in the subscriber.
  2. Put a little (10ms) wall-time wait after or before returning the TryAgain (or use something like try_lock_for()). However, I see that this is not a very good solution, since somebody could get upset with a 10ms wait here which may screw a lot of things up.

I'm thinking (and I'll try to write) of a solution with something like a conditional variable that'd be used in case allow_concurrent_callbacks is false.

If somebody has a better idea, please, help :)

@fujitatomoya
Copy link
Contributor

@peci1
what version of ros_comm do you use?
is that kinetic-devel 1.12.13 or later?

@peci1
Copy link
Contributor Author

peci1 commented Dec 3, 2018

No, I'm on indigo. However, I looked through the changes since then till melodic and none seem to be affecting this behavior (at least from looking at the code and commit messages).

@peci1
Copy link
Contributor Author

peci1 commented Dec 3, 2018

@fujitatomoya I tested in melodic now and the behavior is the same.

@peci1
Copy link
Contributor Author

peci1 commented Jan 30, 2019

@fujitatomoya See PR #1602 , I have the tests and a solution for this issue.

@peci1
Copy link
Contributor Author

peci1 commented Jan 31, 2019

PR #1608 against melodic-devel.

@peci1
Copy link
Contributor Author

peci1 commented Feb 14, 2023

Fixed by #1684 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants