roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

peci1 · 2018-11-24T01:30:36Z

When debugging a performance problem with my node, I created an MWE in C++ which shows interesting behavior with serious performance hit. In case you use MultiThreadedSpinner or AsyncSpinner with >1 threads and one of your callbacks takes too long (messages arrive quicker), the spinner eats up a lot of CPU.

I'll post the MWE on Monday, so for now just a textual description. I have a normal node with the MT/Async spinner, I subscribe to one 50 Hz PointCloud2 topic with a single subscriber. The callback contains just ros::Duration::sleep(time).

When running the code with a single thread (either with SingleThreadedSpinner, or even MT spinner with 1 thread), the node takes up about 5% of CPU and just throws away the messages it did not have time to process.

As soon as there are at least 2 threads, CPU consumption goes high (40-400% of CPU) and scales (lineraly?) with the duration I wait in the callback and the number of threads.

My research ended in spinner.cpp, where I see that in case of 1 thread, callAvailable() is used, whereas with more threads, callOne() is used. However, I can't see through all the multi-threading magic in there. I ran a profiler on the node, and most of the time is spent with mutexes and other sync primitives.

I haven't tried yet with SubscribeOptions.allow_concurrent_callbacks .

Possibly related:

The text was updated successfully, but these errors were encountered:

peci1 · 2018-11-26T15:15:02Z

The MWE:

#include <ros/ros.h>
#include <sensor_msgs/PointCloud2.h>

class TestClass {
    ros::NodeHandle pubNh;
    ros::Subscriber s;

public:
    TestClass();
    void callback(sensor_msgs::PointCloud2ConstPtr msg);
};

TestClass::TestClass() :
      s(pubNh.subscribe("/scan_point_cloud", 10, &TestClass::callback, this)) {

}

void TestClass::callback(sensor_msgs::PointCloud2ConstPtr msg) {
  ros::Duration(0.01).sleep();
}

int main(int argc, char **argv) {
  ros::init(argc, argv, "Transform_Pointcloud_and_Filter_It_Node");
  TestClass PclTransfWithFilt;

  ros::MultiThreadedSpinner spinner(2);

  spinner.spin();

  return 0;
}

Play with the duration of sleep in the callback to see effects of this bug.

peci1 · 2018-11-27T15:54:56Z

I found the busy-wait loop.

It's caused by returning TryAgain from SubscriptionQueue::call() without making sure the next trial will not happen right away.

SubscriptionQueue::call()

...
boost::recursive_mutex::scoped_try_lock lock(callback_mutex_, boost::defer_lock);
if (!allow_concurrent_callbacks_)
  {
    lock.try_lock();
    if (!lock.owns_lock())
    {
      return CallbackInterface::TryAgain;
    }
}
...

The loops in CallbackQueue::callOneCB() and CallbackQueue::callOne also do not insert any waiting after the TryAgain is returned and the callback is pushed back to the queue. That results in the busy-wait when one thread owns the lock and all other threads keep getting TryAgain and the callbacks are put back to the queue and immediately executed again.

I verified two workarounds:

Use allow_concurrent_callbacks in the subscriber.
Put a little (10ms) wall-time wait after or before returning the TryAgain (or use something like try_lock_for()). However, I see that this is not a very good solution, since somebody could get upset with a 10ms wait here which may screw a lot of things up.

I'm thinking (and I'll try to write) of a solution with something like a conditional variable that'd be used in case allow_concurrent_callbacks is false.

If somebody has a better idea, please, help :)

fujitatomoya · 2018-12-03T06:18:01Z

@peci1
what version of ros_comm do you use?
is that kinetic-devel 1.12.13 or later?

peci1 · 2018-12-03T09:10:12Z

No, I'm on indigo. However, I looked through the changes since then till melodic and none seem to be affecting this behavior (at least from looking at the code and commit messages).

peci1 · 2018-12-03T12:10:46Z

@fujitatomoya I tested in melodic now and the behavior is the same.

peci1 · 2019-01-30T19:55:12Z

@fujitatomoya See PR #1602 , I have the tests and a solution for this issue.

peci1 · 2019-01-31T12:59:19Z

PR #1608 against melodic-devel.

peci1 · 2023-02-14T19:28:17Z

Fixed by #1684 .

peci1 mentioned this issue Jan 30, 2019

[WIP] Fix subscription busy wait #1602

Closed

dirk-thomas added the bug label Jan 30, 2019

peci1 mentioned this issue Jan 31, 2019

[WIP] Fix subscription busy wait (melodic-devel) #1608

Closed

This was referenced Jun 18, 2019

Simplify implementation of CallbackQueue::callOne() cwecht/ros_comm#1

Closed

rtt_roscomm: set allow_concurrent_callbacks flag orocos/rtt_ros_integration#129

Closed

drjsmith mentioned this issue Sep 22, 2021

getMTNodeHandle doesn't result in more callbacks in parallel on a single topic ros/nodelet_core#77

Closed

peci1 closed this as completed Feb 14, 2023

Zob314 mentioned this issue Aug 10, 2023

roscpp multithreaded spinners eat up CPU when callbacks take too long #2341

Open

meyerj mentioned this issue Oct 7, 2024

Fix "roscpp multithreaded spinners eat up CPU when callbacks take too long" #2377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

peci1 commented Nov 24, 2018

peci1 commented Nov 26, 2018

peci1 commented Nov 27, 2018

fujitatomoya commented Dec 3, 2018

peci1 commented Dec 3, 2018

peci1 commented Dec 3, 2018

peci1 commented Jan 30, 2019

peci1 commented Jan 31, 2019

peci1 commented Feb 14, 2023

roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

roscpp multithreaded spinners eat up CPU when callbacks take too long #1545

Comments

peci1 commented Nov 24, 2018

peci1 commented Nov 26, 2018

peci1 commented Nov 27, 2018

fujitatomoya commented Dec 3, 2018

peci1 commented Dec 3, 2018

peci1 commented Dec 3, 2018

peci1 commented Jan 30, 2019

peci1 commented Jan 31, 2019

peci1 commented Feb 14, 2023