-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak #1900
Fix memory leak #1900
Conversation
@SteveMacenski how do these changes look? When you run this branch and hit ctrl + c after providing the initial pose and activating all the system nodes, you'll notice |
Could this be a problem in the behavior tree navigator too with the shared ptr in the blackboard?
Sorry, I thought this branch was supposed to resolve that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in concept. I don't really love this lock need before accessing node_
but I get it. Would this work with a raw pointer as well and not increment up the shared pointer counter? I know that's generally frowned upon, but would avoid that and as long as the plugins go down before the servers (which would always happen) then the nodes are going to be value in that scope.
If we do go this way, I think it would be good to have the lifecycle node wrapper have a logger object we use so that we dont lock the node every time we need to just do something simple like logging.
I made the fix for recovery server, just wanted to check with you to see if this makes sense.
What I'm seeing is that the navigator gets destroyed properly after the initial pose is given and all the nodes are activated. However, it fails during destruction when the action server has an ongoing goal. I'm guessing it's the same issue but what confuses me is the fact that it gets destroyed properly after the initial pose is provided. At this point, all the BT nodes should be instantiated and hold the
That should work. If we pass raw pointers to the plugin objects, the server destructor should still execute since nothing is holding the shared ptr. But I really don't think we should go that way. The plugin interface is that part where users are expected to write there own code and dealing with raw pointers there is just calling for several issues.
I agree with you here, I'm not a fan of this method. A user writing their own controller/planner plugins can still choose to hold a |
I agree
That's the major reason I would want to do raw pointers, stop bad behavior or people blaming the framework for a subtle issue. Even if we document that, its liable to get lost. |
BT Navigator doesn't have a cyclic dependency anywhere. I can't quite figure out why BT Navigator exits badly sometimes, but I've noticed that only happens when the system is killed while a goal is being handled. Once the current goal is reached and the system is killed, everything exits normally. Everything else seems to be fixed now. @SteveMacenski ready for review! I thought about it a lot and decided to go against using weak pointers. Every class that receives a shared_ptr to an rclcpp::Node doesn't hold a reference anymore. Instead, it just holds a pointer to the node interfaces that it needs to get things done. This seems to be the more natural ros2 way of doing things. |
I do wonder though what the best option is of the two: @mkhansenbot @crdelsey @cdelsey @wjwwood any input? |
Does it matter how a pointer to the parent From what I understood by following some core ros2 packages, a child object should never even need to hold a pointer to the parent |
What if we change the API for everything in plugin-land to use That would make it so that users literally don't have the option to store a shared pointer (unless they Between a raw pointer and a I think we need some external opinions on this. I don't have the answer. They all sound not-awesome in different ways. My engineering sense says WeakPtr, my open-source sense says raw pointer, and my 'lets just move on from this' sense says |
I'm not sure of all of the context here, but if you need to use the If you just need it at the beginning, then you can take a As for |
We can't give a couple of the interfaces unfortunately, the primary use-case is to give these nodes to the plugins that may need to themselves log, get parameters, create network interfaces, some have services, etc. I think the weak pointer is the best solution then. Per your link, that's the kind of not-best-practices I was also looking to avoid but also trying to balance the simplest solution. Something like @naiveHobo what do you think? I'd like for us to come to a consensus before we change anything else since this PR affects so many things |
I agree, we should go with the weak pointer. In the last few commits in this PR, I was able to remove the need to store a pointer to the parent |
For things we have in our control, yes. Though the internal of plugins from users we don't have that control and some plugins need the node for services / topics as well. I generally don't like the idea of pulling out interfaces from the Side question: would it work to have a weak pointer, lock it, and only unlock it in the destructor of the plugins? Would that have the intended outcome? |
I don't think that will work. The plugin objects are destroyed only when the parent |
Got it |
@SteveMacenski this should be ready for review, though CI is failing because of the master to main switch. |
@ruffsl This also fails on main branch too |
To reiterate, PR CI jobs are created using the config from branch that is to be pulled from, not necessarily the same as the version of the config files from the branch the is to be merged into. This this affords external contributions to test/customize config changes within the PR jobs themselves. It also means that if contributors wish to use the latest CI config, they'll need to rebase their PR on top of the branch they wish to merge into. This PR seems to still be using the CI config from before the migration. It would be nice to be able to test the merged result, but it seems like this is still a long open feature request for CircleCI: https://ideas.circleci.com/ideas/CCI-I-926 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a few themes:
- Going back over the github view and auditing the use of SharedPtr, in some cases you didn't change them and it's not immediately clear which should be weak or shared ptrs.
- initializing loggers with empty strings, lets not do that, that will hurt us alot
- In many cases the clock nor the logger were properly initialized to actually get the clock or logger from the node
- Removed destroy logging, though I now see you added it to the lifecycle node destructor, you can ignore those comments.
nav2_costmap_2d/include/nav2_costmap_2d/clear_costmap_service.hpp
Outdated
Show resolved
Hide resolved
nav2_costmap_2d/include/nav2_costmap_2d/costmap_2d_publisher.hpp
Outdated
Show resolved
Hide resolved
nav2_navfn_planner/include/nav2_navfn_planner/navfn_planner.hpp
Outdated
Show resolved
Hide resolved
How would an issue manifest if we didn't Some of these objects have activate / deactive functions. Could it make sense to put the lock and have a shared pointer in the activate and then unlock in deactivate so we don't have to deal with the semantics regularly (but still have ability to kill it before we enter any destructor items) |
If you end up using a weak_ptr without locking, it won't have any of the base class functions available. The only way you can access the object the weak_ptr points to is by locking it. So it should crash I think.
It's the same issue again. If the node is killed during any lifecycle state before deactivating the node, the child objects would be holding a shared_ptr to the parent |
Lets just verify that. The CPP docs on this aren't very specific. It would be good to know what it would look like if we made a mistake. This touches a bunch of code, so sooner or later we're going to have a problem from someone's PRs or missing something here. It would be good to know what the signals are. The docs just say
OK |
Codecov Report
@@ Coverage Diff @@
## main #1900 +/- ##
==========================================
- Coverage 76.25% 75.66% -0.60%
==========================================
Files 224 224
Lines 10956 11014 +58
==========================================
- Hits 8355 8334 -21
- Misses 2601 2680 +79
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@SteveMacenski Understandably, it's a compile time error: |
Let me know when I should look over this again |
Should be good for review right now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to really write a good section in the Foxy -> Galactic migration guide about this change, why it was necessary, and how to properly use.
Obviously this is a massive PR, so I will likely need to read through it another 2-3 times just to see if I catch anything else. It has little to do with you and more to do with the massive scope.
I'm a little concerned about the logger / clock initialization. For the costmap layers, I agree that logger_ is set in the constructor to layer, so no need for them to appear in there. I would though like you to do a review yourself to make sure everywhere has them properly set. This will be especially an issue that only certain users will run into because it could be easily masked and cause subtle issues. This touches so much, I can't say definitively that I was able to find 100% of them. Logs missing nodes would be really annoying to find later. And the wrong clocks used would be really hard to find as well.
@mikeferguson if you have 15-30 minutes and want to help, this would be a great PR to review and sanity check us.
@@ -58,7 +58,7 @@ class SimpleGoalChecker : public nav2_core::GoalChecker | |||
SimpleGoalChecker(); | |||
// Standard GoalChecker Interface | |||
void initialize( | |||
const rclcpp_lifecycle::LifecycleNode::SharedPtr & nh, | |||
const rclcpp_lifecycle::LifecycleNode::WeakPtr & parent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all instances: I don't think its generally best practice to pass smart pointers by reference. Weak ptrs might be different, but I don't see any documentation about that. Why not have the weakptr be by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From this comment: #1900 (comment) what you say makes sense to me, but the conventions around smart pointers are the opposite. We should discuss this and then resolve one way or another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing shared_ptrs by reference makes sense (to avoid the overhead of incrementing and then decrementing the reference count). For a weak_ptr though, it doesn't have the same overhead - although the multiple copy/initialization argument does seem to make since for this use case.
nav2_costmap_2d/include/nav2_costmap_2d/clear_costmap_service.hpp
Outdated
Show resolved
Hide resolved
nav2_costmap_2d/include/nav2_costmap_2d/costmap_2d_publisher.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good - there are a few things that appear to be common issues so I didn't call out every one of them:
- When you call weak_ptr.lock() outside of the function it was initially passed in, you really should be checking that the returned shared_ptr is valid before dereferencing it.
- there are a number of logger call lines that have been split into 2-3 lines, but are clearly not too long for the linter. seems that should be fixed?
- see especially the comment on count_subscribers - pretty sure remapping will break unless you use publisher->get_subscription_count().
@@ -58,7 +58,7 @@ class SimpleGoalChecker : public nav2_core::GoalChecker | |||
SimpleGoalChecker(); | |||
// Standard GoalChecker Interface | |||
void initialize( | |||
const rclcpp_lifecycle::LifecycleNode::SharedPtr & nh, | |||
const rclcpp_lifecycle::LifecycleNode::WeakPtr & parent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing shared_ptrs by reference makes sense (to avoid the overhead of incrementing and then decrementing the reference count). For a weak_ptr though, it doesn't have the same overhead - although the multiple copy/initialization argument does seem to make since for this use case.
@@ -38,7 +38,8 @@ class RosTopicLogger : public BT::StatusChangeLogger | |||
void flush() override; | |||
|
|||
protected: | |||
rclcpp::Node::SharedPtr ros_node_; | |||
rclcpp::Clock::SharedPtr clock_; | |||
rclcpp::Logger logger_{rclcpp::get_logger("bt_navigator")}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be worth noting on these that the names won't actually be used (right? cause these get overwritten in the constructor once we have the SharedPtr?). I could see someone getting confused if they expect the logger to have this name but instead it has the name of whatever the node has been named at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea here was to default the logger object to a reasonable name in case the logger object isn't overridden by a user since a lot of the places that use the logger object are plugin interfaces that could be further implemented by an external user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with @mikeferguson, if we were going to set them, then I think this is the way to do it, but I think we should ensure that they're always set properly from the node
. For the case of new plugins, they'll be given a new WeakPtr
for them to own and they'll need to store their own logger_
to use. For everything in the codebase itself, we should ensure that its always properly set so that it respects remaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice safety mechanism for the specific nodes where that mistake may be possible (costmap_2d comes to mind). I'm not sure that is possible in other places like this instance. We "own" BT navigator's node and logger object, there's no plugin coming along to not properly initialize it
[&]() {RCLCPP_INFO(node_->get_logger(), "%s running...", recovery_name_.c_str());}); | ||
{ | ||
auto node = node_.lock(); | ||
auto timer = node->create_wall_timer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should check node before dereferencing
nav2_navfn_planner/include/nav2_navfn_planner/navfn_planner.hpp
Outdated
Show resolved
Hide resolved
@SteveMacenski @mikeferguson ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just need the OK from Fergs too
I'll review later tonight |
* Fix memory leak in nav2_recoveries * Fix recovery server memory leak (better interface) * Fix costmap2d memory leak * Fix nav2_navfn_planner memory leak * Fix planner server and navfn planner memory leak * Make all rclcpp::Node::SharedPtr argument passing const * Fix controller server and DWB plugins memory leak * Minor fixes * Fix formatting errors * Change all plugin interfaces to use weak_ptr intead of shared_ptr to parent rclcpp::Node * Convert all SharedPtr to WeakPtr * Check shared_ptr after lock and before dereferencing
* Fix memory leak in nav2_recoveries * Fix recovery server memory leak (better interface) * Fix costmap2d memory leak * Fix nav2_navfn_planner memory leak * Fix planner server and navfn planner memory leak * Make all rclcpp::Node::SharedPtr argument passing const * Fix controller server and DWB plugins memory leak * Minor fixes * Fix formatting errors * Change all plugin interfaces to use weak_ptr intead of shared_ptr to parent rclcpp::Node * Convert all SharedPtr to WeakPtr * Check shared_ptr after lock and before dereferencing
This reverts commit 681ccfa.
* Fix memory leak (#1900) * Fix memory leak in nav2_recoveries * Fix recovery server memory leak (better interface) * Fix costmap2d memory leak * Fix nav2_navfn_planner memory leak * Fix planner server and navfn planner memory leak * Make all rclcpp::Node::SharedPtr argument passing const * Fix controller server and DWB plugins memory leak * Minor fixes * Fix formatting errors * Change all plugin interfaces to use weak_ptr intead of shared_ptr to parent rclcpp::Node * Convert all SharedPtr to WeakPtr * Check shared_ptr after lock and before dereferencing * Smac/Hybrid-A* planner (#2021) * adding smac_planner to navigation2 metapackage * adding params to metapackage * update config files * adding navfn benchmark testing * updates to costmap_2d for flexility * update planner API for new changes * adding ompl to underlay because ros2 master doesn't contain the rosdep key * patching templated footprint collision checker * fix typo * updating readme config file * Analytic expansion (#43) * Use OMPL to generate heuristics The calculation is run at every planning cycle. It does not seem to slow down the planner - the calculation time seems to be quick enough that the improvement in graph expansion accounts for it. * Use OMPL to calculate analytic solution when near goal * Make angles multiples of the bin size to stop looping behaviour * Uncrustify * Use faster std::sqrt function * Fix analytic path so that the collision checker has coordinates to check! * Pre-allocate variables in analytic path expansion * Rename typedef to NodeGetter to more accurately describe function * Use distance rather than heuristic to determine when to perform analytic expansion Also force the analytic expansion to run on first iteration in case path is trivial. * Move the check for motion model into the main A* loop * Add copyright notices * Remove comment about relaxing node match tolerances The analytic expansion removes the need for this. * Correctly reset node coordinates when aborting from analytic expansion * Move analytic expansion logic to separate function * Uncrustify * Remove unneeded call to get goal coordinates * Fix the calculation of intervals in the analytic path Reserve the number of candidate nodes we are expecting. Base calculations on intervals rather than points - makes distances between nodes work properly. * Rescale heuristic so that analytic expansions are based on distance * Repeatedly split analytic path in half when checking for collision * Add parameter to control rate of analytic expansion attempts * Uncrustify * Fix incorrect type in templated function * Cpplint * Revert "Repeatedly split analytic path in half when checking for collision" This reverts commit 94d9ee0. There was a marginal speed gain (perhaps!) and the splitting approach made the code harder to understand and maintain. * Uncrustify * Add doxygen comments * Add parameter description for analytic expansion ratio * Set lower limit of 2 on number of iterations between analytic expansions * Reduce expected number of iterations because of analytic completion * Refactor analytic expansion ratio calcs to make logic easier to understand * add readme color * fix linting * ceil from floor (and speed up) * a few updates * fix smac tests * fixing smoother test * remove cost check - to be readded at another time * working last test from debug issues * Update README.md * Update README.md * adding getUseRadius API doxygen Co-authored-by: James Ward <[email protected]> * Adding additional SmacPlanner tests (#2036) * adding some more tests to smac_planner * addtl smac tests * remove unused functions * adding additional constants and smoothers tests (#2038) * Revert "Fix memory leak (#1900)" This reverts commit 681ccfa. * Changed WeakPtr to SharedPtr for compatibility with Foxy Co-authored-by: Sarthak Mittal <[email protected]> Co-authored-by: Steve Macenski <[email protected]> Co-authored-by: James Ward <[email protected]>
* Fix memory leak in nav2_recoveries * Fix recovery server memory leak (better interface) * Fix costmap2d memory leak * Fix nav2_navfn_planner memory leak * Fix planner server and navfn planner memory leak * Make all rclcpp::Node::SharedPtr argument passing const * Fix controller server and DWB plugins memory leak * Minor fixes * Fix formatting errors * Change all plugin interfaces to use weak_ptr intead of shared_ptr to parent rclcpp::Node * Convert all SharedPtr to WeakPtr * Check shared_ptr after lock and before dereferencing
* Fix memory leak in nav2_recoveries * Fix recovery server memory leak (better interface) * Fix costmap2d memory leak * Fix nav2_navfn_planner memory leak * Fix planner server and navfn planner memory leak * Make all rclcpp::Node::SharedPtr argument passing const * Fix controller server and DWB plugins memory leak * Minor fixes * Fix formatting errors * Change all plugin interfaces to use weak_ptr intead of shared_ptr to parent rclcpp::Node * Convert all SharedPtr to WeakPtr * Check shared_ptr after lock and before dereferencing
Basic Info
Description of contribution in a few bullet points
shared_from_this()
cyclic dependencies in major servers which causes a memory leak on killing the system when the node is in any lifecycle state other thanunconfigured
.ros2 run --prefix 'valgrind --leak-check=full'
configure
lifecycle transition usingros2 lifecycle set /server_name configure
planner_server:
recovery_server:
controller_server:
costmap_2d:
Future work that may be required in bullet points
If the changes with the recovery server are approved, I can make similar changes for all servers which use pluginlib.