Skip to content

Construct TF listeners passing nodes, spinning on separate thread #5406

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

roncapat
Copy link
Contributor

See #1182 - this is a re-assessment after 6 years.
Advantages: easier auditing of TF2 subscriptions across the ROS graph. The nav2 node(s) name(s) will appear for example when issuing ros2 topic info -v /tf.


Basic Info

Info Please fill out this column
Ticket(s) this addresses #1182
Primary OS tested on Ubuntu
Robotic platform tested on proprietary simulation & HW
Does this PR contain AI generated software? No
Was this PR description generated by AI software? No

Description of contribution in a few bullet points

Construct tf2_ros::TransformListener instances passing a node pointer, so that no additional nodes with randomized names are spawned on the ROS graph (less pollution, better auditing), but enabling the spin_thread flag, so that we ensure TF subscriptions are not interleaved with other nav2-related callbacks in the same executor.

For Maintainers:

  • Check that any new parameters added are updated in docs.nav2.org
  • Check that any significant change is added to the migration guide
  • Check that any new features OR changes to existing behaviors are reflected in the tuning guide
  • Check that any new functions have Doxygen added
  • Check that any new features have test coverage
  • Check that any new plugins is added to the plugins page
  • If BT Node, Additionally: add to BT's XML index of nodes for groot, BT package's readme table, and BT library lists
  • Should this be backported to current distributions? If so, tag with backport-*.

Copy link
Contributor

mergify bot commented Jul 30, 2025

@roncapat, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

@SteveMacenski
Copy link
Member

SteveMacenski commented Jul 31, 2025

Pull in / rebase on main once #5409 is merged to get CI to turn over. Sorry about that, we hit the wall with circleci limits and needing to load balance the builds.

Also makes sure to sign off with DCO

Signed-off-by: Patrick Roncagliolo <ronca.pat@gmail.com>
@SteveMacenski
Copy link
Member

... huh. Alot of tests failed. I'm rerunning but if they fail again, I think this introduces a regression

@roncapat
Copy link
Contributor Author

roncapat commented Aug 1, 2025

Also noticed right now... I will be able to investigate as early as next week, sorry. Will be interesting to see what is the cause, since in my real testing scenario it works impressively well. Hope to learn something and have a fix!

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 1, 2025

impressively well

How so? big perf boost? I wouldn't have expected that

@roncapat
Copy link
Contributor Author

roncapat commented Aug 1, 2025

Nah I mean more like "without surprises" - but I am 99% sure that by using the node, it will benefit from enabled IPC on the /tf subscribers.

It has been maybe two years or so, I have pushed in the past some patches for IPC in the TransformListener, need to check again in which conditions it gets enabled or not.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 1, 2025

yeah this is still failing completely - I think there's something awry here. I sampled 2 of the 16 tests and the lifecycle transition never completes while its waiting for a transform to be available (which seems awfully related, so I don't think its a CI fluke)

Signed-off-by: Patrick Roncagliolo <ronca.pat@gmail.com>
@roncapat
Copy link
Contributor Author

roncapat commented Aug 3, 2025

I began to study deeper the tf2_ros::TransformListener.

What I assessed, basically, is that current nav2 code uses the constructor:
TransformListener(tf2::BufferCore & buffer, bool spin_thread = true, bool static_only = false)
Notice the default spin_thread = true.

This is to say, the "only" difference introduced in this PR is the node used by TransformListerer implementation, not the spinning logic, that is exactly kept the same.

Of course, here we are passing a LifecycleNode, from which the TransformListener costructor will extract a set of NodeInterfaces. I will probably focus in the upcoming days on possible subtle implications of this - for example, whether the LifecycleNode current state could influence the correct working of TransformListener.

Moreover, it seems the only place where we have problems in the tests is the costmap_2d_ros. Reverting modifications only for that node makes all the tests pass. This may be an hint on specific way of using the TransformListener that can cause such issue w.r.t. other use cases, restricting the "search area" for the issue.

Will update you if I discover something more in the upcoming days.

@roncapat
Copy link
Contributor Author

roncapat commented Aug 3, 2025

Ok, I may have undestood the issue.
costmap_2d_ros expect TF to be received during the on_activate call.

Per https://design.ros2.org/articles/node_lifecycle.html, in the inactive mode ...the node will not receive any execution time to read topics, perform processing of data, respond to functional service requests, etc.

Two options:

  • explicitly create a non-lifecycle-node and explicitly pass to the TransformListener -> node name can be customized***
  • defer the canTransform call to the "active" state of costmap_2d_ros

*** https://github.com/ros2/geometry2/blob/2b1742c80a4e91a411e5798eec78573928391a7c/tf2_ros/src/transform_listener.cpp#L46-L56


My personal take (a bit phylosophic, take this with a grain of salt):

I fully understand why canTransform is called there, but I think that it reveals a (minor) flaw in the choice of adopting a LifecycleNode-based architecture - basically by expecting to receive something during the on_activate transition -this currently works because this responsibility is deferred to a "classic" node, hidden in TransformListener.
This also reveals that, when the costmap node is inactive, the listener is still receiving /tf messages - while being 100% strict, in principle, it should be "disabled" (not receiving) too.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 4, 2025

Mhm, I don't think we can activate until we have all the inputs required to actually be able to process something. Unlike other things like having services available from other nodes that we can make sure are available intrinsically by the ordering of lifecycle transitions, the setting of the robot's initial pose is a user-application defined task (or SLAM if running SLAM) that we cannot know is completed without checking.

defer the canTransform call to the "active" state of costmap_2d_ros

We could move it to the already actived state, but then requests are able to be submitted without actually being processable. At the moment, I think we should leave this as-is but can be reopened designwise. I suppose we could have a timer or possibly in the update map thread check for this transform and have the similar delay after activation. That would complicate the implementation a bit, but nothing terrible. My biggest concern there is that we have a timeout feature for waiting for that transform. If we cannot return a failure on a state transition when that timeout is exceeded, then the server becomes in an unrecoverable state. If we have some ideas around that, I wouldn't object to a redesign of this handling.

This also reveals that, when the costmap node is inactive, the listener is still receiving /tf messages - while being 100% strict, in principle, it should be "disabled" (not receiving) too.

TF is not lifecycle enabled, so that's no surprise. This isn't doing 'work' or given 'execution time' on the application though so I think that's fine. The lifecycle transition quote you gave from the design document I think is talking about the work in the transition function to block the completion of transition. While perhaps TF could technically do some work given a message, the transition isn't dependent on it, so that's fine. How we use TF to block for the available transform however does break that principle. But your point is understood. If we wanted to be aggressively pure on Lifecycle Nodes, there are many ROS libraries that would need to have activate/deactivate functions enabled.

Anyway, but why does change with using the spinning thread and node not work? I'm a little unclear as to that, since there is no lifecycle subscription for the subscription within TF to not be processing. The spin thread should be creating its own executor spun in its own thread as well so that should be all working independently, from first glance.

@roncapat
Copy link
Contributor Author

roncapat commented Aug 5, 2025

Anyway, but why does change with using the spinning thread and node not work? I'm a little unclear as to that, since there is no lifecycle subscription for the subscription within TF to not be processing. The spin thread should be creating its own executor spun in its own thread as well so that should be all working independently, from first glance.

I may have misunderstood the design document (at least the part I quoted), but it seems to me that since we are using the Lifecycle Node interfaces to create the subscription inside TransformListener, also that subscription will not receive any execution time to read topics. I don't understand how, will try to study more the rclcpp_lifecycle code.

@SteveMacenski
Copy link
Member

SteveMacenski commented Aug 5, 2025

I think this has more to with the TF code spinning w.r.t. the main node. Maybe some print statements would help clarify. I think we should understand the 'why' before we merge, but once we do I'm happy to merge assuming we don't find its just hiding something buggy (or we find that this change is actually buggy and costmap2D is the only place showing the problem to us immediately)

@ros-navigation ros-navigation deleted a comment from claude bot Aug 5, 2025
@ros-navigation ros-navigation deleted a comment from claude bot Aug 5, 2025
@roncapat
Copy link
Contributor Author

roncapat commented Aug 5, 2025

I agree!
I think I have found the issue. Took this screenshot while running some failing tests:
image
/tf and /tf_static get namespaced!

Will check a simple way to avoid this.


EDIT 1:
I tried to add remappings like

    rclcpp::NodeOptions().arguments({
    "--ros-args", "-r", std::string("__ns:=") + nav2_util::add_namespaces(parent_namespace, local_namespace),
    "--ros-args", "-r", nav2_util::add_namespaces(parent_namespace, local_namespace) + "/tf:=/tf",
    "--ros-args", "-r", nav2_util::add_namespaces(parent_namespace, local_namespace) + "/tf_static:=/tf_static",
    "--ros-args", "-r", "tf:=/tf",
    "--ros-args", "-r", "tf_static:=/tf_static",
    "--ros-args", "-p", "use_sim_time:=" + std::string(use_sim_time ? "true" : "false"),

in costmap_2d_ros.cpp but they won't work.
Apparently the issue lies in the nav2_system_test launchfile test_error_codes_launch.py, where:

    remappings = [('/tf', 'tf'), ('/tf_static', 'tf_static')]

is found, like in many nav2_bringup files.
Emptying that list will do the trick. Of course, it is not the right solution.
It seems that remapping from CLI forcefully override any hardcoded override.

Copy link
Contributor

mergify bot commented Aug 5, 2025

@roncapat, your PR has failed to build. Please check CI outputs and resolve issues.
You may need to rebase or pull in main due to API changes (or your contribution genuinely fails).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants