Skip to content

Add ability to preempt execution and add test #684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 29, 2025

Conversation

MarqRazz
Copy link

This PR is an update of #363 and adds a test to ensure it is working as expected.

Additional changes I made to make it possible:

  • Added the ability to not start Rviz when running demo.launch.py (I use this launch file in my test and needed the ability to disable it)
  • Added a persistent node and action client to Task
    Extended testing of my motion system has found that recreating the node on each execution request can cause a seg fault after many hours of running. I am continuing to test this and will update with any results, but if you feel this is out of scope I can move this feature to a different PR. Here is a link to the upstream issue with cyclonedds and changes they have made.
[my_motion_node-9] my_motion_node: ./src/core/ddsc/src/dds_handles.c:430: dds_handle_drop_childref_and_pin: Assertion `(cf & HDL_REFCOUNT_MASK) > 0' failed.
[my_motion_node-9] Stack trace (most recent call last) in thread 686402:
[my_motion_node-9] #27   Object "/usr/lib/aarch64-linux-gnu/ld-linux-aarch64.so.1", at 0xffffffffffffffff, in 
[my_motion_node-9] #26   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffffb3adba4b, in 
[my_motion_node-9] #25   Object "/usr/lib/aarch64-linux-gnu/libc.so.6", at 0xffffb3a7595b, in 
[my_motion_node-9] #24   Object "/usr/lib/aarch64-linux-gnu/libstdc++.so.6.0.33", at 0xffffb3d81adf, in 
...
[my_motion_node-9] #19   Object "/root/ws/install/my_motion_node", at 0xaaaad172b93b, in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release_last_use_cold()
[my_motion_node-9] #18   Object "/root/ws/install/my_motion_node", at 0xaaaad176fff7, in std::_Sp_counted_deleter<rclcpp_action::Client<moveit_task_constructor_msgs::action::ExecuteTaskSolution>*, rclcpp_action::create_client<moveit_task_constructor_msgs::action::ExecuteTaskSolution>(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeGraphInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeLoggingInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeWaitablesInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<rclcpp::CallbackGroup>, rcl_action_client_options_s const&)::{lambda(rclcpp_action::Client<moveit_task_constructor_msgs::action::ExecuteTaskSolution>*)#1}, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
[my_motion_node-9] #17   Object "/root/ws/install/my_motion_node", at 0xaaaad176f58f, in rclcpp_action::create_client<moveit_task_constructor_msgs::action::ExecuteTaskSolution>(std::shared_ptr<rclcpp::node_interfaces::NodeBaseInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeGraphInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeLoggingInterface>, std::shared_ptr<rclcpp::node_interfaces::NodeWaitablesInterface>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::shared_ptr<rclcpp::CallbackGroup>, rcl_action_client_options_s const&)::{lambda(rclcpp_action::Client<moveit_task_constructor_msgs::action::ExecuteTaskSolution>*)#1}::operator()(rclcpp_action::Client<moveit_task_constructor_msgs::action::ExecuteTaskSolution>*) const

@MarqRazz MarqRazz requested review from v4hn and rhaschke May 23, 2025 17:59
Copy link
Contributor

@rhaschke rhaschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original review comments (#363 (review)) still apply.
It would be much more reasonable to provide access to the ActionServer:

auto as = task.getExecutionActionServer()
auto goal_handle = as->async_send_goal(solution) // convert solution to goal and send it to action server
...

Additionally, I have a few other remarks:

  • Your approach to reuse demo.launch.py from moveit_task_constructor_demo doesn't work as this is a downstream package of moveit_task_constructor_capabilities
  • You don't use callback groups as in #363. Isn't that needed?
  • The code is overly complex, partially. I will commit some simplifications.

@MarqRazz
Copy link
Author

Thanks for the simplifications and updates Robert!

It would be much more reasonable to provide access to the ActionServer:

The problem with this approach is that the user would need access to the ActionClient and node to properly manage the request. I did not think it was good approach to expose both of those objects to the users program but if you had other ideas on how they could properly manage it I'm open to suggestions. For example here is how execute is currently waiting for the result:

while (result_future.wait_for(std::chrono::milliseconds(10)) != std::future_status::ready) {
	...
	rclcpp::spin_some(execute_solution_node_);
}

I also looked into having the user pass in their own node to execute but it means that they would be responsible for spinning things such that the client request could complete correctly.

The user can always create their own ActionClient and node to handle request as they see best if they need fine control over execution. Given that execute() is an exposed feature to task I feel this is an easy and robust way to make it feature complete. In my use case MTC is wrapped in a ActionServer and having a single call to preempt planning or execution simplifies my setup from having to know which state it is in and guarantees things are properly stopped.

Your approach to reuse demo.launch.py from moveit_task_constructor_demo doesn't work as this is a downstream package of moveit_task_constructor_capabilities

I can copy the launch file into the execution test or move it over to the demo package, do you have a preference on how I fix this?

You don't use callback groups as in #363. Isn't that needed?

I did not find it a requirement for the ActionServer to be Reentrant to cancel the request.

@rhaschke
Copy link
Contributor

It would be much more reasonable to provide access to the ActionServer:

The problem with this approach is that the user would need access to the ActionClient and node

Agreed. Given that situation, handling the cancellation internal to task seems reasonable.
I have one other idea to avoid the busy loop: One could simply cancel the execution goal when calling Task::preempt().
This would require saving the latest goal_handle.

I can copy the launch file into the execution test or move it over to the demo package, do you have a preference on how I fix this?

Could you simply reuse core/test/test.launch.py?

@MarqRazz
Copy link
Author

I have one other idea to avoid the busy loop: One could simply cancel the execution goal when calling Task::preempt().
This would require saving the latest goal_handle.

I'm still looking into this but wouldn't it still require a busy loop blocking execute from returning until motion or cancel is complete?

@rhaschke
Copy link
Contributor

rhaschke commented May 27, 2025

No. I think you would simply use the existing code:
rclcpp::spin_until_future_complete(node, result_future)

@MarqRazz
Copy link
Author

So sorry, I though I hit quote and it edited your post 😊.

spin can only be called on a node once so we would need a busy loop in preempt to wait for the cancel request to complete. If you are okay with that I can get the change implemented.

@rhaschke
Copy link
Contributor

spin can only be called on a node once

Really? Is that documented somewhere? What happens if one spins a node from several threads simultaneously?

@MarqRazz
Copy link
Author

MarqRazz commented May 27, 2025

I was actually thinking of calling spin multiple times on an executor. For example the single_threaded_executor throws an exception if you do this.

I tried saving the goal handle and moved the cancel call to preempt and my test segfaults if I wait for the cancel result.

void Task::preempt() {
	pimpl()->preempt_requested_ = true;
	if(execute_goal_handle_){
		auto cancel_future = execute_ac_->async_cancel_goal(execute_goal_handle_);
		if (rclcpp::spin_until_future_complete(execute_solution_node_, cancel_future) !=
			rclcpp::FutureReturnCode::SUCCESS) {
			RCLCPP_ERROR(execute_solution_node_->get_logger(), "Could not preempt execution");
		}
	}
}

To fix this I tried to make sure the thread was joinable but it still segfaults so I suspect something is going out of scope when cancel and rclcpp::spin_until_future_complete in execute attempt to return at the same time.

1: [test_execution-6]          55: 	// cancel the trajectory half way through the expected execution time
1: [test_execution-6]          56: 	rclcpp::sleep_for(exec_duration.to_chrono<std::chrono::milliseconds>() / 2);
1: [test_execution-6] #6  | Source "/usr/include/c++/13/bits/std_thread.h", line 173, in __terminate
1: [test_execution-6]     |   171:     {
1: [test_execution-6]     |   172:       if (joinable())
1: [test_execution-6]     | > 173: 	std::__terminate();
1: [test_execution-6]     |   174:     }
1: [test_execution-6]       Source "/usr/include/x86_64-linux-gnu/c++/13/bits/c++config.h", line 322, in ~thread [0x5821e663a802]
1: [test_execution-6]         319:   inline void __terminate() _GLIBCXX_USE_NOEXCEPT
1: [test_execution-6]         320:   {
1: [test_execution-6]         321:     void terminate() _GLIBCXX_USE_NOEXCEPT __attribute__ ((__noreturn__));
1: [test_execution-6]       > 322:     terminate();
1: [test_execution-6]         323:   }
1: [test_execution-6]         324: #pragma GCC visibility pop
1: [test_execution-6]         325: }
1: [test_execution-6] #5    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7d6ba97cda54, in std::terminate()
1: [test_execution-6] #4    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7d6ba97e30d9, in 
1: [test_execution-6] #3    Object "/usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.33", at 0x7d6ba97cdff4, in 
1: [test_execution-6] #2    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7d6ba953e8fe, in abort
1: [test_execution-6] #1    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7d6ba955b27d, in raise
1: [test_execution-6] #0    Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7d6ba95b4b2c, in pthread_kill
1: [test_execution-6] Aborted (Signal sent by tkill() 58817 0)

Removing waiting for the cancel result in preempt works as expected. Does this sound like a better approach than manually spinning and calling cancel from execute?

@rhaschke
Copy link
Contributor

Why does move_group crash on shutdown? This shouldn't happen.

@rhaschke
Copy link
Contributor

Why does move_group crash on shutdown? This shouldn't happen.

This crash is flaky. With the asan build move_group seems to crash even before running the test binary.

@MarqRazz
Copy link
Author

Why does move_group crash on shutdown? This shouldn't happen.

move_group has been flaky at shutting down for me for quite a while. Here is the crash I normally see.

[move_group-5] #10   Source "/root/my_ws/src/moveit2/moveit_ros/planning/planning_scene_monitor/src/planning_scene_monitor.cpp", line 549, in scenePublishingThread [0x7d8db9c1afaf]
[move_group-5]         546:       planning_scene_publisher_->publish(msg);
[move_group-5]         547:       if (is_full)
[move_group-5]         548:         RCLCPP_DEBUG(logger_, "Published full planning scene: '%s'", msg.name.c_str());
[move_group-5]       > 549:       rate.sleep();
[move_group-5]         550:     }
[move_group-5] #9    Object "/opt/ros/jazzy/lib/librclcpp.so", at 0x7d8db914c480, in rclcpp::Rate::sleep()
[move_group-5] #8    Object "/opt/ros/jazzy/lib/librclcpp.so", at 0x7d8db90a0238, in rclcpp::Clock::sleep_for(rclcpp::Duration, std::shared_ptr<rclcpp::Context>)

@MarqRazz
Copy link
Author

This crash is flaky. With the asan build move_group seems to crash even before running the test binary.

I pulled moveit/moveit2:rolling-source and built this branch but move_group always runs successfully. I'm still working on re-create the asan environment so that I can debug what is going on. It looks like it might be an issue loading the ExecuteTaskSolutionCapability which didn't change here 🤔

[move_group-3] #21   Object "/root/ws_moveit/install/moveit_ros_move_group/lib/moveit_ros_move_group/move_group", at 0x5556cb5dd3ed, in move_group::MoveGroupExe::configureCapabilities()
[move_group-3] #20   Object "/root/ws_moveit/install/moveit_ros_move_group/lib/moveit_ros_move_group/move_group", at 0x5556cb5cc020, in pluginlib::ClassLoader<move_group::MoveGroupCapability>::createUniqueInstance(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
[move_group-3] #19   Object "/root/ws_moveit/install/moveit_ros_move_group/lib/moveit_ros_move_group/move_group", at 0x5556cb5bf4ff, in pluginlib::ClassLoader<move_group::MoveGroupCapability>::loadLibraryForClass(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
...
[move_group-3] #13   Object "/opt/ros/rolling/lib/librcutils.so", at 0x7fca7070d95c, in rcutils_load_shared_library
[move_group-3] #12   Object "/usr/lib/x86_64-linux-gnu/libc.so.6", at 0x7fca702db25e, in dlopen

rhaschke added 2 commits May 28, 2025 18:04
If planning fails, we cannot access the solution...
Copy link

codecov bot commented May 28, 2025

Codecov Report

Attention: Patch coverage is 73.91304% with 6 lines in your changes missing coverage. Please review.

Project coverage is 49.69%. Comparing base (4de8d23) to head (3fc04e3).
Report is 142 commits behind head on ros2.

Files with missing lines Patch % Lines
core/src/task.cpp 73.92% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             ros2     #684      +/-   ##
==========================================
+ Coverage   42.47%   49.69%   +7.22%     
==========================================
  Files          83       99      +16     
  Lines        8086     8967     +881     
==========================================
+ Hits         3434     4455    +1021     
+ Misses       4652     4512     -140     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MarqRazz
Copy link
Author

@rhaschke I was able to reproduce the issue locally by pulling moveit/moveit2:rolling-source making a separate workspace for MTC and only building MTC with --cmake-args -DCMAKE_CXX_FLAGS=-fsanitize=address.

Plugin loading when using -fsanitize=address strongly suggests an ABI incompatibility or a runtime linking issue between the plugin compiled with AddressSanitizer and other parts of the ROS 2 system (like class_loader, move_group itself, or underlying libraries) that were not compiled with it.

From my research, if you link or dynamically load modules where some are instrumented and some are not, you can run into crashes, especially around memory allocation/deallocation or when the sanitizer's runtime tries to manage state across these boundaries.

Do we have access to a docker container with MoveIt also built with --cmake-args -DCMAKE_CXX_FLAGS=-fsanitize=address?

@rhaschke rhaschke merged commit 584197a into moveit:ros2 May 29, 2025
5 checks passed
MarqRazz added a commit to MarqRazz/moveit_task_constructor that referenced this pull request May 29, 2025
@MarqRazz MarqRazz deleted the preempt_execution branch May 29, 2025 20:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants