Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timer that can be triggered a finite number of times. #2006

Open
wants to merge 17 commits into
base: rolling
Choose a base branch
from

Conversation

Voldivh
Copy link
Contributor

@Voldivh Voldivh commented Aug 26, 2022

This PR addresses the #1990. Basically it gives the timers API the possibility to select the number of times the timer will be triggered and be canceled after that using a variable. This variable (amount_of_callbacks) have the default value of 0 which indicates that the timer will be triggered infinite times until it is 'manually' cancelled, i.e., the same behavior the times have at the moment. For example, initializing the timer with amount_of_callbacks = 1, would create an one shot timer, which would be triggered once and then would be canceled.

@Voldivh Voldivh marked this pull request as ready for review August 26, 2022 15:18
@alsora
Copy link
Collaborator

alsora commented Aug 27, 2022

This implementation adds overhead to existing timers.
We should make sure that the added functionality does not affect timers that don't want to use it.

@Voldivh
Copy link
Contributor Author

Voldivh commented Aug 30, 2022

This implementation adds overhead to existing timers. We should make sure that the added functionality does not affect timers that don't want to use it.

I agree. I thought about doing the implementation in another class, however, that option would imply copying all the code from an existing class and just modify a couple of lines and would also reduce the discoverability of the feature. What would you recommend to do in this case so that we can make sure the feature doesn't affect the existing timers?

@tfoote
Copy link
Contributor

tfoote commented Aug 30, 2022

With adding generality there is always going to be overhead. There are different dimensions of overhead, there's pure runtime, there's memory footprint, there's development effort, and there's api complexity all of which are side effects of design decisions.

For core C++ libraries often they can avoid runtime overhead by doing template meta-programming and other advanced features. Another alternative would be if we were to change this API to be a factory pattern with generators that leverage different potential backend implementations. I don't think that's something that we want to get into.

But if I look at this for what the effect to the user will be. It's increasing the memory footprint by 2 unsigned integers. And in each timer cycle there's one if statement checking if an unsigned integer is non-zero. This is a very low overhead for having a generic API available. It keeps the API simple and compact. And does not increase the code complexity significantly. If we are to add another class with a parallel implmentation it will likely have a higher memory impact as both classes will be in parallel in the shared libraries and then potentially loaded into memory too, with a total extra space of much greater than 2 unsigned ints of space.

And looking a little bit closer the user's callback is invoked before the if statement so to the user the only potential slowdown is that the CPU has to do approximately one extra cycle prior to recomputing the timeout and enqueuing itself back into the waitset. As such the user will not have any effect on their timing either delaying their callback, nor delaying their next callback. With the exception that they overran their last callback time buffer and then it will be approximately 1 cpu clock cycle later if the compiler has not optimized out the check that this code path will clearly not take.

I don't think that there's a way to do this more efficiently that's both simple and easy for the user to leverage following our established API and design patterns as well as not adding significant extra lines of code for ongoing maintenance and testing. As such I would suggest going with this current quite simple solution.

@alsora
Copy link
Collaborator

alsora commented Aug 31, 2022

@tfoote IMO the ROS 2 design should be rigorous about how new features are added.

This "callback counter" is a convenience utility and, as we discussed in the issue, it's something that users could very easily do from the application code.
It looks like several people think that this is a very important feature and, as such, it should be available out of the box.
I'm ok with that, but this should not affect users that don't want to use it (this should apply to all features, but in particular to utilities).

This particular PR is adding an overhead that is probably negligible, but still doing an exception for this PR today, then for another PR tomorrow and so is not good.
Besides eventually having a noticeable impact on performance, it's also not a good design practice as there isn't a clear separation between the c++ core (i.e. a wrapper around rcl) and the all the stuff built on top of it.

Performance is one of the main problems of ROS 2, and there are a lot of on-going efforts to make ROS 2 more efficient, but this is possible only if performance is taken in serious consideration at all the layers.

My recommendation is to handle this via a new OneShotTimer class that simply has to store the additional counter and override the execute_callback method.

@alsora
Copy link
Collaborator

alsora commented Aug 31, 2022

@tfoote @Voldivh sorry for the short notice, but would you be available to have a discussion on this in the ROS 2 Client Library Working Group? It's today (August 31th) at 8AM Pacific Time: https://meet.google.com/pxe-ohqa-cjm

@tfoote
Copy link
Contributor

tfoote commented Aug 31, 2022

This particular PR is adding an overhead that is probably negligible, but still doing an exception for this PR today, then for another PR tomorrow and so is not good.

Saying that there might be another PR tomorrow that might slow down performance is a reason to not take this one, is not really a valid critique of this PR.

there are a lot of on-going efforts to make ROS 2 more efficient, but this is possible only if performance is taken in serious consideration at all the layers.

Indeed, have there ever been any metrics at issue or proposed which this change will effect adversely? What is a metric on the performance of this PR that is acceptable vs unacceptable? As this is in the non-zero case an unchanging variable I suspect that the compiler will likely even detect that as such and this if statement never get even evaluated.

Are there any known or perceived cases when this will ever become a bottleneck to computation that we will need to optimize? If not this seems like a case of premature optimization for runtime overhead at the cost of many other aspects of the system.

Forcing things into separate classes and implementations has notable other deleterious effects such as slowing down developer/user velocity and increasing our support burden with the number of lines of code, tests and documentation to maintain. These are all real costs which will be born out by both us and users in the community and trades off one form of overhead for another.

I'm ok with that, but this should not affect users that don't want to use it (this should apply to all features, but in particular to utilities).

This is a great design goal but it cannot be applied in the abstract on a single metric. All of the above costs I've outlined above also effect the entire community.

it's also not a good design practice as there isn't a clear separation between the c++ core (i.e. a wrapper around rcl) and the all the stuff built on top of it.

I think that this is one of the fundamental area where we're looking at this differently. I see the rclcpp API as a user facing API that should be designed to optimize the development of the end user. To the end user there is absolutely no difference whether the code is implemented under the hood using rcl or locally implemented in rclcpp. If under the hood in the implementation you want to make a distinction as to how it's implemented that's great. But it should not be something the the user sees or cares about. If we were to extend the functionality that is implemented in rcl and change the backend of that feature in rclcpp, an rclcpp API user should not notice or care. rclcpp's public API is an abstraction which should hide all of those underlying implementation details.

@tfoote
Copy link
Contributor

tfoote commented Sep 1, 2022

So I was thinking about this a bit last night. And I think that a better solution that can both keep the default API generic and fully featured and separately maintain a highly optimized path. More concretely we will have the default Timer with all the features and it will have slightly higher overhead, and separately we have the LowOverheadTimer(naming TBD, using this for now for clarity) which is highly optimized and has a less user centric API that developers who are worried about the runtime overhead can opt into the more optimized one. If we're worried about a single if statement I suspect that there's more optimizations that can be achieved by stripping out more of the layers that already exist.

This will then allow the majority of users who are not going to be effected by the overhead to have the familiar API with all the bells and whistles. And the small fraction of users who want/need to optimize things can choose to use the lower level interface which does not need to be as cleanly presented or fully featured which will allow even more runtime optimizations than we already offer.

@alsora
Copy link
Collaborator

alsora commented Sep 2, 2022

I like this proposal @tfoote!
This definitely goes into the direction that I'm pushing for the ROS 2 libraries: expose both user-friendly APIs as well as optimized bare-bone primitives.
The latter should also be used internally in the core library.

I really want to apologize if my comments before were sounding too strong, in hindsightI shouldn't have started this discussion on this very PR, but rather just started it separately.
However, I'm glad that we had some good discussions on these very important topics!

One of the features developed by iRobot is the Events Executor.
The main reason why this executor is more efficient than the default one is exactly because it skips all the layers, which currently are not zero-overhead wrappers.
There's a lot of possible optimizations in the ROS stack, to implement them it's really fundamental to have a clear design in mind that makes ROS usable both by people who don't care about performance and resource management (and are happy to delegate it to the ROS stack) and by people who on the other hand must have full control over the operations to run them on resource constrained hardware.

@fujitatomoya
Copy link
Collaborator

@tfoote @alsora thanks for sharing your thoughts.

IMO, this feature is basic control for timer, this is acceptable and reasonable.
I think priority of rclcpp is what to provide interface for user application, and then we can review how efficiently we can implement?

If we're worried about a single if statement I suspect that there's more optimizations that can be achieved by stripping out more of the layers that already exist.

This is so true. Is this something we really want to do as design driver? I am just confused that user client library driver...
If we are worried about single if statement, i think we have clear complexity threshold on specific platform as requirement, probably we would not even use rclcpp. (i think this is one of the driver of https://github.com/ros2/rclc.)

It will be really difficult to support new feature in rclcpp, if the performance comes 1st priority.
and what about the other client library like rclpy? the same driver should be applied as user client library?

Just a bit confused about where this is going to land.

thanks!

@fujitatomoya
Copy link
Collaborator

CC: @ros2/team

@clalancette
Copy link
Contributor

It will be really difficult to support new feature in rclcpp, if the performance comes 1st priority.

So given where we are in the lifecycle in ROS 2, my opinion here is that performance doesn't need to be 1st priority, but it definitely needs to be a priority. We've heard a lot of grumbling from the community about how the performance of ROS 2 is worse than ROS 1 in a lot of ways. While we have a lot of knobs that people can use to make the system more performant, the fact of the matter is that out-of-the-box, ROS 2 currently isn't great at this. So any new features that go in need to be weighed against this fact.

That said, for this particular feature, I agree that the cost of an extra branch is likely going to be lost in the noise here. There's an easy way to test; do some performance testing before and after this PR with timers. If we can't measure the overhead (as I suspect), then I would be for this PR going in more-or-less as-is.

@alsora
Copy link
Collaborator

alsora commented Sep 2, 2022

Our analysis showed that one of the major CPU bottlenecks of ROS 2 is caused by how its abstraction layers are implemented.
In particular, getting data from the user application to the DDS and viceversa is very expensive (these should be almost negligible, we should be just passing a pointer around).
The source of the overhead is not one big, expensive operation (i.e. something that would be easily detected if we had performance measurements in the PR process), but rather it comes from summing together a lot of if/else, checks and other stuff that on its own has a negligible impact on performance.

While developing the events executor, we did some tests where we invoked the DDS APIs as close as possible to the application layer.
This is obviously bad from an architecture point of view, but it improved the performance a lot.

The "fast-path" i.e. that sequence of operations that is invoked at very high frequency while an application is running has to be cleaned from all operations that aren't required.
We all love utilities and wrappers, but they should be implemented on top of efficient primitives, which are also accessible to users if they need them.

This PR is not a problem per se, merging it will have no impact on performance for 99.9% of the users (me included) and its overhead would be hard to measure, however it's a clear example of how the ROS 2 design does not follow the "you don't pay for what you don't use" principle in its fast-path
Users that want a timer that runs forever will have to check that if-condition every time, similarly, users with an executor with only a rclcpp::Waitable have to go through ~20 if-conditions every time the executor wakes up,
https://github.com/ros2/rclcpp/blob/rolling/rclcpp/src/rclcpp/executor.cpp#L826.
These are just examples and there are a lot of them unfortunately.

User-friendly, feature-rich APIs and efficient primitives are in my opinion both equally important requirements for ROS 2, but they should be built one on top of the other rather than be mixed together.

Signed-off-by: Voldivh <[email protected]>
@Voldivh
Copy link
Contributor Author

Voldivh commented Sep 7, 2022

After reading all the discussion and the common ground reached regarding the software design, I ran some simple test on my local host to check the overhead this PR would include. I basically created a timer with a callback and checked if the timer was being fired on time (according to the period) using std::chrono::high_resolution_clock as the watchdog of the timer. Also I kept looking at the general CPU and memory load of the process with the proposed feature and without it (i.e. rolling). Here are the results I got on my local host:

As for the timer being triggered on time, I only ran it for a few callbacks and all were triggered within the time period:

eloy@567080f97543:~/ros2_rolling_ws$ ros2 run testing_pkg timer 
Callbacks called: 100
Missed Deadlines: 0

As for the workload on the CPU and the memory usage, take a look at the yellow row from the next pictures:

Feature Branch

image

Rolling

image

As we expected, the CPU/memory consumption is basically the same for both cases. I know this was a simple test to run that does not show an actual benchmark test for the feature. However, I believe it shows that the overhead is so small that it doesn't affect the behavior of the timers.

rclcpp/include/rclcpp/create_timer.hpp Outdated Show resolved Hide resolved
rclcpp/include/rclcpp/create_timer.hpp Outdated Show resolved Hide resolved
rclcpp/include/rclcpp/timer.hpp Outdated Show resolved Hide resolved
rclcpp/include/rclcpp/timer.hpp Show resolved Hide resolved
rclcpp/include/rclcpp/timer.hpp Outdated Show resolved Hide resolved
rclcpp/test/rclcpp/test_create_timer.cpp Show resolved Hide resolved
*/
template<typename DurationRepT = int64_t, typename DurationT = std::milli, typename CallbackT>
typename rclcpp::WallTimer<CallbackT>::SharedPtr
create_wall_timer(
std::chrono::duration<DurationRepT, DurationT> period,
CallbackT callback,
rclcpp::CallbackGroup::SharedPtr group = nullptr);
rclcpp::CallbackGroup::SharedPtr group = nullptr,
unsigned int amount_of_callbacks = 0);
Copy link
Collaborator

@alsora alsora Sep 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable is of type uint32_t in most of the other instances, please use the same type also here and in Node::create_wall_timer( for consistency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. Done.

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@fujitatomoya
Copy link
Collaborator

@Voldivh could you address the conflicts?

@Voldivh
Copy link
Contributor Author

Voldivh commented Jun 6, 2023

@Voldivh could you address the conflicts?

Sure thing, will do.

@fujitatomoya
Copy link
Collaborator

same here, build failed https://build.ros2.org/job/Rpr__rclcpp__ubuntu_jammy_amd64/906/console

10:51:52 /tmp/ws/src/rclcpp/rclcpp/include/rclcpp/subscription.hpp:208:5: error: there are no arguments to ‘TRACETOOLS_TRACEPOINT’ that depend on a template parameter, so a declaration of ‘TRACETOOLS_TRACEPOINT’ must be available [-fpermissive]
10:51:52   208 |     TRACETOOLS_TRACEPOINT(
10:51:52       |     ^~~~~~~~~~~~~~~~~~~~~
10:51:52 gmake[2]: *** [CMakeFiles/rclcpp.dir/build.make:181: CMakeFiles/rclcpp.dir/src/rclcpp/any_executable.cpp.o] Error 1
10:51:52 gmake[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/rclcpp.dir/all] Error 2

@Voldivh
Copy link
Contributor Author

Voldivh commented Jun 8, 2023

same here, build failed https://build.ros2.org/job/Rpr__rclcpp__ubuntu_jammy_amd64/906/console

10:51:52 /tmp/ws/src/rclcpp/rclcpp/include/rclcpp/subscription.hpp:208:5: error: there are no arguments to ‘TRACETOOLS_TRACEPOINT’ that depend on a template parameter, so a declaration of ‘TRACETOOLS_TRACEPOINT’ must be available [-fpermissive]
10:51:52   208 |     TRACETOOLS_TRACEPOINT(
10:51:52       |     ^~~~~~~~~~~~~~~~~~~~~
10:51:52 gmake[2]: *** [CMakeFiles/rclcpp.dir/build.make:181: CMakeFiles/rclcpp.dir/src/rclcpp/any_executable.cpp.o] Error 1
10:51:52 gmake[1]: *** [CMakeFiles/Makefile2:138: CMakeFiles/rclcpp.dir/all] Error 2

Same question here with this commit.

@fujitatomoya
Copy link
Collaborator

CI:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

(Rpr job is supposed to fail.)

@fujitatomoya
Copy link
Collaborator

@alsora could you review your comment?

@@ -91,7 +91,8 @@ create_timer(
rclcpp::Duration period,
CallbackT && callback,
rclcpp::CallbackGroup::SharedPtr group = nullptr,
bool autostart = true)
bool autostart = true,
uint32_t amount_of_callbacks = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t amount_of_callbacks = 0)
uint32_t number_of_callbacks = 0)

@@ -100,7 +101,8 @@ create_timer(
group,
node_base.get(),
node_timers.get(),
autostart);
autostart,
amount_of_callbacks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
amount_of_callbacks);
number_of_callbacks);

@@ -112,7 +114,8 @@ create_timer(
rclcpp::Duration period,
CallbackT && callback,
rclcpp::CallbackGroup::SharedPtr group = nullptr,
bool autostart = true)
bool autostart = true,
uint32_t amount_of_callbacks = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t amount_of_callbacks = 0)
uint32_t number_of_callbacks = 0)

@@ -121,7 +124,8 @@ create_timer(
group,
rclcpp::node_interfaces::get_node_base_interface(node).get(),
rclcpp::node_interfaces::get_node_timers_interface(node).get(),
autostart);
autostart,
amount_of_callbacks);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
amount_of_callbacks);
number_of_callbacks);

Comment on lines 144 to 145
* \param amount_of_callbacks Quantity of times the callback will be triggered. The default value is 0
* that means an infinite amount of callbacks i.e. the clock's default behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \param amount_of_callbacks Quantity of times the callback will be triggered. The default value is 0
* that means an infinite amount of callbacks i.e. the clock's default behavior.
* \param number_of_callbacks number of times the callback will be triggered. The default value is 0
* that means an infinite amount of callbacks i.e. the timer's default behavior.

@@ -317,6 +335,7 @@ class GenericTimer : public TimerBase
RCLCPP_DISABLE_COPY(GenericTimer)

FunctorT callback_;
uint32_t amount_of_callbacks_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t amount_of_callbacks_;
uint32_t number_of_callbacks_;

@@ -337,14 +356,17 @@ class WallTimer : public GenericTimer<FunctorT>
* \param callback The callback function to execute every interval
* \param context node context
* \param autostart timer state on initialization
* \param amount_of_callbacks Quantity of times the callback will be triggered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* \param amount_of_callbacks Quantity of times the callback will be triggered.
* \param number_of_callbacks Number of times the callback will be triggered.

*/
WallTimer(
std::chrono::nanoseconds period,
FunctorT && callback,
rclcpp::Context::SharedPtr context,
bool autostart = true)
bool autostart = true,
uint32_t amount_of_callbacks = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uint32_t amount_of_callbacks = 0)
uint32_t number_of_callbacks = 0)

: GenericTimer<FunctorT>(
std::make_shared<Clock>(RCL_STEADY_TIME), period, std::move(callback), context, autostart)
std::make_shared<Clock>(RCL_STEADY_TIME), period, std::move(callback), context, autostart,
amount_of_callbacks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
amount_of_callbacks)
number_of_callbacks)

Comment on lines 219 to 259

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary whitespace change.

@fujitatomoya
Copy link
Collaborator

as we talked before, this makes more sense if the canceled timer is removed from the wait_set for performance. AFAIK, the canceled timer is NOT removed from the wait_set, those timers are just skipped to execute the callback.

IMO, i think this PR is still useful for syntax sugar for application.

@Voldivh
Copy link
Contributor Author

Voldivh commented Jul 6, 2023

as we talked before, this makes more sense if the canceled timer is removed from the wait_set for performance. AFAIK, the canceled timer is NOT removed from the wait_set, those timers are just skipped to execute the callback.

IMO, i think this PR is still useful for syntax sugar for application.

I took a closer look at the implementation of the cancel method and you are right, a canceled timer is not removed from the wait_set it is skipped after checking it's state (canceled or not canceled).

Taking that into account, I do feel that the PR would make more sense if it removes the timer from the wait_set. So far, I've only thought of destroying the timer object after the callbacks are completed, but, would be open to other ideas.

@clalancette
Copy link
Contributor

Taking that into account, I do feel that the PR would make more sense if it removes the timer from the wait_set. So far, I've only thought of destroying the timer object after the callbacks are completed, but, would be open to other ideas.

I don't think we should destroy the timer object. In particular, I can imagine that the user may want to retrigger the same timer multiple times, which they should be able to do by calling the reset method. Destroying the object would prevent that.

In order to remove it from the waitset, I think we'd need to do something more clever where we track inside of the timer object whether we are currently in the waitset. This would be true by default. When we automatically expired via this mechanism, or cancel was called, we would remove ourselves from the waitset and set this as false. The next time that reset was called, we would add ourselves back to the waitset and set this to true.

However, it's even more complicated than that, since at the TimerBase level we have no access to the executor. So we'd somehow have to signal the executor to remove use from the waitset. I'm not quite sure how we would do that.

@Voldivh
Copy link
Contributor Author

Voldivh commented Jul 17, 2023

Taking that into account, I do feel that the PR would make more sense if it removes the timer from the wait_set. So far, I've only thought of destroying the timer object after the callbacks are completed, but, would be open to other ideas.

I don't think we should destroy the timer object. In particular, I can imagine that the user may want to retrigger the same timer multiple times, which they should be able to do by calling the reset method. Destroying the object would prevent that.

In order to remove it from the waitset, I think we'd need to do something more clever where we track inside of the timer object whether we are currently in the waitset. This would be true by default. When we automatically expired via this mechanism, or cancel was called, we would remove ourselves from the waitset and set this as false. The next time that reset was called, we would add ourselves back to the waitset and set this to true.

However, it's even more complicated than that, since at the TimerBase level we have no access to the executor. So we'd somehow have to signal the executor to remove use from the waitset. I'm not quite sure how we would do that.

Ok I see, as I feared, there is no easy way around it then. I'll take some time to look into it and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants