| name | robot-bringup |
| description | Patterns and best practices for bringing up a complete ROS2-based robotics system on a robot's onboard computer, including systemd services, launch file composition, ordered startup, and production monitoring. Use this skill when configuring a robot to start ROS2 nodes on boot, writing systemd unit files for ROS2 launch, composing layered launch files for full robot stacks, setting up watchdog monitoring, configuring udev rules for deterministic device naming, or debugging boot-time race conditions. Trigger whenever the user mentions robot bringup, robot startup, systemd for ROS2, ROS2 on boot, launch file composition, robot boot sequence, udev rules for cameras or serial ports, watchdog for robot systems, automatic restart for ROS2 nodes, network configuration for multi-machine ROS2, log rotation for robots, graceful shutdown of robot stacks, or SSH-based remote debugging of robots. Also trigger for environment setup in systemd (sourcing workspaces), ordered startup with health checks, deterministic device naming, or any discussion of running ROS2 systems as long-running production services. Covers systemd on Ubuntu 22.04/24.04 with ROS2 Humble, Iron, and Jazzy.
|
Robot Bringup Skill
When to Use This Skill
- Configuring a robot to automatically start its full ROS2 stack on boot via systemd
- Writing systemd unit files that correctly source ROS2 workspaces and set DDS environment
- Composing layered launch files (hardware, drivers, perception, application) into a single bringup
- Setting up ordered startup with health checks to avoid race conditions between dependent nodes
- Writing udev rules for deterministic device naming of cameras, LiDARs, and serial devices
- Configuring CycloneDDS or FastDDS for multi-machine ROS2 discovery across robot and base station
- Implementing watchdog and heartbeat monitoring for production robot systems
- Setting up log rotation and structured logging for long-running robot deployments
- Writing graceful shutdown handlers that bring actuators to a safe state before exit
- Debugging boot-time failures, service ordering issues, or device enumeration races
The Robot Bringup Stack
A production robot bringup follows a layered startup sequence from hardware initialization through application-level nodes. Each layer depends on the one below it.
┌─────────────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
│ Navigation, manipulation, mission planning, HRI │
├─────────────────────────────────────────────────────────────────────┤
│ PERCEPTION LAYER │
│ Object detection, SLAM, point cloud filtering, sensor fusion │
├─────────────────────────────────────────────────────────────────────┤
│ DRIVER LAYER │
│ Camera drivers, LiDAR drivers, motor controllers, IMU │
├─────────────────────────────────────────────────────────────────────┤
│ HARDWARE LAYER │
│ udev rules, device enumeration, USB reset, firmware check │
├─────────────────────────────────────────────────────────────────────┤
│ ROS2 ENVIRONMENT │
│ Source workspace, set RMW, ROS_DOMAIN_ID, DDS config │
├─────────────────────────────────────────────────────────────────────┤
│ SYSTEMD TARGETS & SERVICES │
│ network-online.target → robot-hw.target → robot-bringup.target │
├─────────────────────────────────────────────────────────────────────┤
│ LINUX BOOT (systemd) │
│ BIOS/UEFI → GRUB → kernel → systemd init │
├─────────────────────────────────────────────────────────────────────┤
│ HARDWARE BOOT │
│ Power supply, onboard computer, peripherals │
└─────────────────────────────────────────────────────────────────────┘
systemd Service Units for ROS2
Basic ROS2 Service Unit
Place service files in /etc/systemd/system/. This template starts a ROS2 launch file as a long-running service with watchdog support.
[Unit]
Description=Robot ROS2 Bringup Stack
Documentation=https://github.com/my-org/my-robot
After=network-online.target robot-hw.target
Wants=network-online.target
Requires=robot-hw.target
[Service]
Type=notify
User=robot
Group=robot
WorkingDirectory=/home/robot
EnvironmentFile=/etc/robot/ros2.env
ExecStartPre=/usr/local/bin/robot-device-check.sh
ExecStart=/bin/bash -c '\
source /opt/ros/${ROS_DISTRO}/setup.bash && \
source /home/robot/ros2_ws/install/setup.bash && \
exec ros2 launch my_robot_bringup bringup.launch.py'
ExecStop=/bin/kill -INT $MAINPID
TimeoutStopSec=30
Restart=on-failure
RestartSec=5
WatchdogSec=30
KillMode=mixed
KillSignal=SIGINT
FinalKillSignal=SIGKILL
TimeoutStartSec=60
StandardOutput=journal
StandardError=journal
SyslogIdentifier=robot-bringup
[Install]
WantedBy=multi-user.target
Environment Setup in systemd
Store environment variables in a dedicated file rather than sourcing .bashrc (which is not loaded by systemd).
ROS_DISTRO=humble
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
ROS_DOMAIN_ID=42
CYCLONEDDS_URI=file:///etc/robot/cyclonedds.xml
ROS_LOCALHOST_ONLY=0
ROS_LOG_DIR=/var/log/ros2
RCUTILS_LOGGING_USE_STDOUT=0
RCUTILS_COLORIZED_OUTPUT=0
ROBOT_NAME=my_robot_01
ROBOT_CONFIG_DIR=/etc/robot/config
Dependencies Between Services
Split the robot stack into multiple systemd services with explicit ordering. This allows independent restart of layers and clearer failure isolation.
[Unit]
Description=Robot Hardware Drivers (cameras, LiDAR, IMU, motors)
After=network-online.target robot-hw.target
Wants=network-online.target
Requires=robot-hw.target
[Service]
Type=notify
User=robot
EnvironmentFile=/etc/robot/ros2.env
ExecStart=/bin/bash -c '\
source /opt/ros/${ROS_DISTRO}/setup.bash && \
source /home/robot/ros2_ws/install/setup.bash && \
exec ros2 launch my_robot_bringup drivers.launch.py'
Restart=on-failure
RestartSec=5
WatchdogSec=30
KillMode=mixed
KillSignal=SIGINT
TimeoutStopSec=20
StandardOutput=journal
SyslogIdentifier=robot-drivers
[Install]
WantedBy=robot-bringup.target
[Unit]
Description=Robot Perception Stack (SLAM, detection, sensor fusion)
After=robot-drivers.service
Requires=robot-drivers.service
PartOf=robot-drivers.service
[Service]
Type=notify
User=robot
EnvironmentFile=/etc/robot/ros2.env
ExecStart=/bin/bash -c '\
source /opt/ros/${ROS_DISTRO}/setup.bash && \
source /home/robot/ros2_ws/install/setup.bash && \
exec ros2 launch my_robot_bringup perception.launch.py'
Restart=on-failure
RestartSec=5
WatchdogSec=30
KillMode=mixed
KillSignal=SIGINT
TimeoutStopSec=20
StandardOutput=journal
SyslogIdentifier=robot-perception
[Install]
WantedBy=robot-bringup.target
[Unit]
Description=Robot Application Layer (navigation, planning, HRI)
After=robot-perception.service
Requires=robot-perception.service
PartOf=robot-perception.service
[Service]
Type=notify
User=robot
EnvironmentFile=/etc/robot/ros2.env
ExecStart=/bin/bash -c '\
source /opt/ros/${ROS_DISTRO}/setup.bash && \
source /home/robot/ros2_ws/install/setup.bash && \
exec ros2 launch my_robot_bringup application.launch.py'
Restart=on-failure
RestartSec=10
WatchdogSec=30
KillMode=mixed
KillSignal=SIGINT
TimeoutStopSec=20
StandardOutput=journal
SyslogIdentifier=robot-application
[Install]
WantedBy=robot-bringup.target
Restart Policies and Failure Recovery
Configure rate limiting to prevent restart loops when a service is fundamentally broken (e.g., missing device, configuration error).
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=120
StartLimitBurst=5
OnFailure=robot-alert@%n.service
Resource Limits and cgroups
Constrain resource usage to prevent a runaway node from starving the rest of the system.
MemoryMax=2G
MemoryHigh=1800M
CPUQuota=300%
Nice=-5
IOSchedulingClass=realtime
IOSchedulingPriority=0
ProtectHome=read-only
ProtectSystem=strict
ReadWritePaths=/var/log/ros2 /tmp
PrivateTmp=true
Launch File Composition and Layering
Launch Layer Architecture
Organize launch files into layers that mirror the systemd service architecture. Each layer is an independent launch file that can be tested in isolation.
bringup.launch.py (top-level: composes all layers)
├── hardware.launch.py (udev checks, device readiness)
├── drivers.launch.py (camera, LiDAR, IMU, motor drivers)
│ ├── camera.launch.py
│ ├── lidar.launch.py
│ └── motors.launch.py
├── perception.launch.py (SLAM, detection, fusion)
│ ├── slam.launch.py
│ └── detection.launch.py
└── application.launch.py (navigation, planning, HRI)
├── navigation.launch.py
└── mission.launch.py
Hardware Layer Launch
from launch import LaunchDescription
from launch.actions import LogInfo, ExecuteProcess, TimerAction
from launch.conditions import IfCondition
from launch.substitutions import LaunchConfiguration, EnvironmentVariable
def generate_launch_description():
robot_name = LaunchConfiguration('robot_name',
default=EnvironmentVariable('ROBOT_NAME', default_value='default_robot'))
check_camera = ExecuteProcess(
cmd=['test', '-e', '/dev/robot/camera_front'],
name='check_camera_front',
output='screen',
)
check_lidar = ExecuteProcess(
cmd=['test', '-e', '/dev/robot/lidar'],
name='check_lidar',
output='screen',
)
check_imu = ExecuteProcess(
cmd=['test', '-e', '/dev/robot/imu'],
name='check_imu',
output='screen',
)
log_ready = TimerAction(
period=2.0,
actions=[LogInfo(msg='Hardware checks passed, devices ready')],
)
return LaunchDescription([
check_camera,
check_lidar,
check_imu,
log_ready,
])
Driver Layer Launch
from launch import LaunchDescription
from launch.actions import DeclareLaunchArgument, IncludeLaunchDescription, GroupAction
from launch.launch_description_sources import PythonLaunchDescriptionSource
from launch.substitutions import LaunchConfiguration, PathJoinSubstitution
from launch_ros.actions import Node, SetRemap
from launch_ros.substitutions import FindPackageShare
def generate_launch_description():
use_sim = LaunchConfiguration('use_sim', default='false')
camera_config = LaunchConfiguration('camera_config', default='default')
camera_node = Node(
package='usb_cam',
executable='usb_cam_node_exe',
name='camera_front',
parameters=[PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'camera_front.yaml'
])],
remappings=[('/image_raw', '/camera/front/image_raw')],
)
lidar_node = Node(
package='sllidar_ros2',
executable='sllidar_node',
name='lidar',
parameters=[{
'serial_port': '/dev/robot/lidar',
'serial_baudrate': 460800,
'frame_id': 'lidar_link',
'angle_compensate': True,
}],
)
imu_node = Node(
package='imu_driver',
executable='imu_node',
name='imu',
parameters=[{
'port': '/dev/robot/imu',
'frame_id': 'imu_link',
'publish_rate': 100.0,
}],
)
motor_node = Node(
package='motor_driver',
executable='motor_controller_node',
name='motor_controller',
parameters=[PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'motors.yaml'
])],
)
return LaunchDescription([
DeclareLaunchArgument('use_sim', default_value='false'),
DeclareLaunchArgument('camera_config', default_value='default'),
camera_node,
lidar_node,
imu_node,
motor_node,
])
Perception Layer Launch
from launch import LaunchDescription
from launch.actions import DeclareLaunchArgument, GroupAction
from launch.conditions import IfCondition
from launch.substitutions import LaunchConfiguration, PathJoinSubstitution
from launch_ros.actions import Node, ComposableNodeContainer, LoadComposableNode
from launch_ros.descriptions import ComposableNode
from launch_ros.substitutions import FindPackageShare
def generate_launch_description():
enable_slam = LaunchConfiguration('enable_slam', default='true')
enable_detection = LaunchConfiguration('enable_detection', default='true')
perception_container = ComposableNodeContainer(
name='perception_container',
namespace='',
package='rclcpp_components',
executable='component_container_mt',
composable_node_descriptions=[
ComposableNode(
package='image_proc',
plugin='image_proc::RectifyNode',
name='rectify',
remappings=[('image', '/camera/front/image_raw')],
),
ComposableNode(
package='my_detection',
plugin='my_detection::DetectorNode',
name='detector',
parameters=[PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'detector.yaml'
])],
),
],
condition=IfCondition(enable_detection),
)
slam_node = Node(
package='slam_toolbox',
executable='async_slam_toolbox_node',
name='slam',
parameters=[PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'slam.yaml'
])],
condition=IfCondition(enable_slam),
)
return LaunchDescription([
DeclareLaunchArgument('enable_slam', default_value='true'),
DeclareLaunchArgument('enable_detection', default_value='true'),
perception_container,
slam_node,
])
Application Layer Launch
from launch import LaunchDescription
from launch.actions import DeclareLaunchArgument, IncludeLaunchDescription
from launch.launch_description_sources import PythonLaunchDescriptionSource
from launch.substitutions import LaunchConfiguration, PathJoinSubstitution
from launch_ros.actions import Node
from launch_ros.substitutions import FindPackageShare
def generate_launch_description():
nav_params = LaunchConfiguration('nav_params', default=PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'nav2_params.yaml'
]))
nav2_bringup = IncludeLaunchDescription(
PythonLaunchDescriptionSource(PathJoinSubstitution([
FindPackageShare('nav2_bringup'), 'launch', 'bringup_launch.py'
])),
launch_arguments={
'params_file': nav_params,
'use_sim_time': LaunchConfiguration('use_sim', default='false'),
}.items(),
)
mission_node = Node(
package='my_mission',
executable='mission_planner',
name='mission_planner',
parameters=[PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'mission.yaml'
])],
)
return LaunchDescription([
DeclareLaunchArgument('nav_params', default_value=''),
DeclareLaunchArgument('use_sim', default_value='false'),
nav2_bringup,
mission_node,
])
Top-Level Bringup Launch
This launch file composes all layers into a single entry point, with conditional arguments for simulation vs. real hardware and robot variant selection.
from launch import LaunchDescription
from launch.actions import (
DeclareLaunchArgument, IncludeLaunchDescription,
GroupAction, LogInfo, TimerAction,
)
from launch.conditions import IfCondition, UnlessCondition
from launch.launch_description_sources import PythonLaunchDescriptionSource
from launch.substitutions import (
LaunchConfiguration, PathJoinSubstitution, PythonExpression,
)
from launch_ros.actions import PushRosNamespace, SetParameter
from launch_ros.substitutions import FindPackageShare
def generate_launch_description():
pkg_share = FindPackageShare('my_robot_bringup')
use_sim = LaunchConfiguration('use_sim')
robot_variant = LaunchConfiguration('robot_variant')
enable_perception = LaunchConfiguration('enable_perception')
enable_navigation = LaunchConfiguration('enable_navigation')
hardware_launch = GroupAction(
actions=[
LogInfo(msg='Starting hardware layer...'),
IncludeLaunchDescription(
PythonLaunchDescriptionSource(
PathJoinSubstitution([pkg_share, 'launch', 'hardware.launch.py'])
),
),
],
condition=UnlessCondition(use_sim),
)
drivers_launch = GroupAction(
actions=[
LogInfo(msg='Starting driver layer...'),
IncludeLaunchDescription(
PythonLaunchDescriptionSource(
PathJoinSubstitution([pkg_share, 'launch', 'drivers.launch.py'])
),
launch_arguments={'use_sim': use_sim}.items(),
),
],
)
perception_launch = GroupAction(
actions=[
LogInfo(msg='Starting perception layer...'),
IncludeLaunchDescription(
PythonLaunchDescriptionSource(
PathJoinSubstitution([pkg_share, 'launch', 'perception.launch.py'])
),
),
],
condition=IfCondition(enable_perception),
)
application_launch = GroupAction(
actions=[
LogInfo(msg='Starting application layer...'),
IncludeLaunchDescription(
PythonLaunchDescriptionSource(
PathJoinSubstitution([pkg_share, 'launch', 'application.launch.py'])
),
launch_arguments={'use_sim': use_sim}.items(),
),
],
condition=IfCondition(enable_navigation),
)
return LaunchDescription([
DeclareLaunchArgument('use_sim', default_value='false',
description='Use simulation instead of real hardware'),
DeclareLaunchArgument('robot_variant', default_value='standard',
description='Robot variant: standard, heavy_payload, outdoor'),
DeclareLaunchArgument('enable_perception', default_value='true',
description='Enable perception stack'),
DeclareLaunchArgument('enable_navigation', default_value='true',
description='Enable navigation and application stack'),
SetParameter(name='use_sim_time', value=use_sim),
LogInfo(msg=['Bringing up robot variant: ', robot_variant]),
hardware_launch,
drivers_launch,
perception_launch,
application_launch,
])
Conditional Loading (Sim vs Real, Robot Variants)
Use IfCondition, UnlessCondition, and PythonExpression to swap configurations based on runtime arguments.
from launch.conditions import IfCondition, UnlessCondition
from launch.substitutions import PythonExpression, LaunchConfiguration
robot_variant = LaunchConfiguration('robot_variant')
variant_config = PathJoinSubstitution([
FindPackageShare('my_robot_bringup'), 'config', 'variants',
PythonExpression(["'", robot_variant, "' + '.yaml'"]),
])
arm_driver = Node(
package='arm_driver',
executable='arm_controller_node',
name='arm_controller',
condition=IfCondition(
PythonExpression(["'", robot_variant, "' == 'heavy_payload'"])
),
)
Ordered Startup with Health Checks
Startup Dependency Graph
Nodes must start in a specific order to avoid subscribing to topics that do not yet exist or calling services before they are available.
┌──────────────┐
│ motors.srv │
└──────┬───────┘
│
┌────────────┼────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ camera │ │ lidar │ │ imu │
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────┐
│ perception / SLAM │
└──────────────┬───────────────────┘
│
▼
┌──────────────────────────────────┐
│ navigation / planning │
└──────────────────────────────────┘
Health Check Scripts
Use health check scripts in ExecStartPre to block service startup until dependencies are ready.
#!/bin/bash
set -euo pipefail
REQUIRED_DEVICES=(
"/dev/robot/camera_front"
"/dev/robot/lidar"
"/dev/robot/imu"
"/dev/robot/motor_controller"
)
TIMEOUT=30
POLL_INTERVAL=1
elapsed=0
for device in "${REQUIRED_DEVICES[@]}"; do
elapsed=0
while [ ! -e "$device" ]; do
if [ "$elapsed" -ge "$TIMEOUT" ]; then
echo "ERROR: Device $device not found after ${TIMEOUT}s" >&2
exit 1
fi
echo "Waiting for $device... (${elapsed}s/${TIMEOUT}s)"
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
done
echo "Found device: $device"
done
echo "All required devices are present."
exit 0
#!/bin/bash
set -euo pipefail
source /opt/ros/${ROS_DISTRO}/setup.bash
source /home/robot/ros2_ws/install/setup.bash
TIMEOUT=60
POLL_INTERVAL=2
for node_name in "$@"; do
elapsed=0
while ! ros2 node list 2>/dev/null | grep -q "$node_name"; do
if [ "$elapsed" -ge "$TIMEOUT" ]; then
echo "ERROR: Node $node_name not found after ${TIMEOUT}s" >&2
exit 1
fi
echo "Waiting for node $node_name... (${elapsed}s/${TIMEOUT}s)"
sleep "$POLL_INTERVAL"
elapsed=$((elapsed + POLL_INTERVAL))
done
echo "Node active: $node_name"
done
echo "All required nodes are active."
exit 0
Wait-for-Topic Pattern
A reusable Python utility to block until a topic is being published, useful for ordered startup in launch files.
import argparse
import sys
import time
import importlib
import rclpy
from rclpy.node import Node
from rclpy.qos import qos_profile_sensor_data
class TopicWaiter(Node):
def __init__(self, topic_name, msg_type_str, timeout):
super().__init__('topic_waiter')
self.received = False
self.timeout = timeout
self.start_time = time.time()
module_name, class_name = msg_type_str.rsplit('/', 1)
module_name = module_name.replace('/', '.')
module = importlib.import_module(module_name)
msg_type = getattr(module, class_name)
self.sub = self.create_subscription(
msg_type, topic_name, self._callback, qos_profile_sensor_data)
self.timer = self.create_timer(1.0, self._check_timeout)
self.get_logger().info(f'Waiting for topic {topic_name}...')
def _callback(self, msg):
self.get_logger().info('Topic is active, message received.')
self.received = True
def _check_timeout(self):
if self.received:
raise SystemExit(0)
elapsed = time.time() - self.start_time
if elapsed > self.timeout:
self.get_logger().error(f'Timeout after {self.timeout}s')
raise SystemExit(1)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('topic', help='Topic name to wait for')
parser.add_argument('msg_type', help='Message type (e.g., sensor_msgs/msg/LaserScan)')
parser.add_argument('--timeout', type=float, default=30.0)
args = parser.parse_args()
rclpy.init()
node = TopicWaiter(args.topic, args.msg_type, args.timeout)
rclpy.spin(node)
if __name__ == '__main__':
main()
Lifecycle Node Orchestration for Ordered Startup
Use lifecycle (managed) nodes to enforce startup ordering. A lifecycle manager configures and activates nodes in sequence, ensuring each node completes its configuration before the next one starts.
from launch import LaunchDescription
from launch_ros.actions import Node
def generate_launch_description():
lifecycle_manager = Node(
package='nav2_lifecycle_manager',
executable='lifecycle_manager',
name='lifecycle_manager',
output='screen',
parameters=[{
'node_names': [
'motor_controller',
'camera_driver',
'lidar_driver',
'slam',
'navigation',
],
'autostart': True,
'bond_timeout': 10.0,
'bond_respawn_max_duration': 2.0,
}],
)
return LaunchDescription([
lifecycle_manager,
])
udev Rules for Deterministic Device Naming
Writing udev Rules for Cameras
USB cameras can enumerate in any order on boot, causing /dev/video0 to be unpredictable. Use udev rules to assign stable symlinks based on device attributes.
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1234", ATTRS{idProduct}=="5678", \
KERNELS=="1-1.2:1.0", ATTR{index}=="0", \
SYMLINK+="robot/camera_front", MODE="0666", GROUP="video"
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="1234", ATTRS{idProduct}=="5678", \
KERNELS=="1-1.3:1.0", ATTR{index}=="0", \
SYMLINK+="robot/camera_rear", MODE="0666", GROUP="video"
SUBSYSTEM=="video4linux", ATTRS{idVendor}=="8086", ATTRS{idProduct}=="0b3a", \
ATTRS{serial}=="123456789", ATTR{index}=="0", \
SYMLINK+="robot/camera_depth", MODE="0666", GROUP="video"
Writing udev Rules for Serial Devices
Serial devices (IMU, motor controller, GPS) also need stable names since /dev/ttyUSB* numbering is non-deterministic.
SUBSYSTEM=="tty", ATTRS{idVendor}=="0403", ATTRS{idProduct}=="6001", \
ATTRS{serial}=="AB0CDEFG", \
SYMLINK+="robot/imu", MODE="0666", GROUP="dialout"
SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", \
KERNELS=="1-1.4:1.0", \
SYMLINK+="robot/motor_controller", MODE="0666", GROUP="dialout"
SUBSYSTEM=="tty", ATTRS{idVendor}=="1546", ATTRS{idProduct}=="01a7", \
SYMLINK+="robot/gps", MODE="0666", GROUP="dialout"
SUBSYSTEM=="tty", ATTRS{idVendor}=="10c4", ATTRS{idProduct}=="ea60", \
ATTRS{serial}=="0001", \
SYMLINK+="robot/lidar", MODE="0666", GROUP="dialout"
Reloading and Testing
sudo udevadm control --reload-rules
sudo udevadm trigger
sudo udevadm test $(udevadm info --query=path --name=/dev/ttyUSB0)
udevadm info --name=/dev/ttyUSB0 --attribute-walk
udevadm monitor --subsystem-match=tty --property
Network Configuration for Multi-Machine ROS2
Static IP Configuration
Assign a static IP to the robot's wired interface using netplan (Ubuntu 22.04+).
network:
version: 2
ethernets:
eth0:
addresses:
- 10.0.0.10/24
routes:
- to: default
via: 10.0.0.1
nameservers:
addresses:
- 8.8.8.8
- 8.8.4.4
wifis:
wlan0:
dhcp4: true
access-points:
"RobotNetwork":
password: "securepassword"
sudo netplan apply
DDS Discovery Across Machines
CycloneDDS requires explicit peer configuration for multi-machine setups since multicast may not work across network segments.
<CycloneDDS>
<Domain>
<General>
<Interfaces>
<NetworkInterface name="eth0" priority="default" multicast="false"/>
</Interfaces>
<AllowMulticast>false</AllowMulticast>
</General>
<Discovery>
<ParticipantIndex>auto</ParticipantIndex>
<Peers>
<Peer address="10.0.0.10"/>
<Peer address="10.0.0.20"/>
<Peer address="10.0.0.11"/>
</Peers>
<MaxAutoParticipantIndex>30</MaxAutoParticipantIndex>
</Discovery>
<Internal>
<SocketReceiveBufferSize min="10MB"/>
</Internal>
</Domain>
</CycloneDDS>
Firewall Rules
DDS uses a range of UDP ports for discovery and data exchange. Open these ports on both the robot and the base station.
#!/bin/bash
DOMAIN_ID=42
BASE_PORT=$((7400 + 250 * DOMAIN_ID))
sudo ufw allow proto udp from 10.0.0.0/24 to any port $BASE_PORT:$((BASE_PORT + 100))
DATA_PORT=$((BASE_PORT + 1))
sudo ufw allow proto udp from 10.0.0.0/24 to any port $DATA_PORT:$((DATA_PORT + 200))
sudo ufw reload
echo "Firewall configured for ROS2 DDS on domain $DOMAIN_ID"
ROS_DOMAIN_ID and ROS_LOCALHOST_ONLY
export ROS_DOMAIN_ID=42
export ROS_LOCALHOST_ONLY=1
export ROS_LOCALHOST_ONLY=0
ros2 daemon stop && ros2 daemon start
ros2 topic list
Watchdog and Heartbeat Monitoring
systemd Watchdog Integration
When WatchdogSec is set in the service unit, the process must periodically notify systemd that it is alive. If the notification is missed, systemd restarts the service.
import os
import socket
import time
import rclpy
from rclpy.node import Node
from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus
class WatchdogNode(Node):
"""Notifies systemd that the ROS2 process is alive."""
def __init__(self):
super().__init__('watchdog_node')
watchdog_usec = os.environ.get('WATCHDOG_USEC')
if watchdog_usec:
interval_sec = int(watchdog_usec) / 1_000_000 / 2.0
else:
interval_sec = 10.0
self.get_logger().warn('WATCHDOG_USEC not set, using 10s interval')
self.notify_socket = os.environ.get('NOTIFY_SOCKET')
self._sd_notify('READY=1')
self.get_logger().info(
f'Watchdog node started, notify interval: {interval_sec:.1f}s')
self.create_timer(interval_sec, self._watchdog_tick)
self.diag_sub = self.create_subscription(
DiagnosticArray, '/diagnostics', self._diag_callback, 10)
self.system_healthy = True
def _watchdog_tick(self):
"""Send watchdog keepalive to systemd if system is healthy."""
if self.system_healthy:
self._sd_notify('WATCHDOG=1')
else:
self.get_logger().error(
'System unhealthy, withholding watchdog notification')
def _diag_callback(self, msg):
"""Monitor diagnostics for critical errors."""
for status in msg.status:
if status.level == DiagnosticStatus.ERROR:
self.get_logger().error(f'Critical error: {status.name}: {status.message}')
self.system_healthy = False
return
self.system_healthy = True
def _sd_notify(self, state):
"""Send notification to systemd."""
if not self.notify_socket:
return
addr = self.notify_socket
if addr[0] == '@':
addr = '\0' + addr[1:]
sock = socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM)
try:
sock.connect(addr)
sock.sendall(state.encode())
finally:
sock.close()
def main():
rclpy.init()
node = WatchdogNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
ROS2-Level Heartbeat Monitor Node
A monitor node that subscribes to heartbeat topics from critical subsystems and publishes overall system health. If a heartbeat is missed, it triggers a safe stop.
import time
import rclpy
from rclpy.node import Node
from rclpy.qos import QoSProfile, ReliabilityPolicy, DurabilityPolicy
from std_msgs.msg import Bool, String
from geometry_msgs.msg import Twist
from diagnostic_msgs.msg import DiagnosticArray, DiagnosticStatus, KeyValue
class HeartbeatMonitor(Node):
def __init__(self):
super().__init__('heartbeat_monitor')
self.declare_parameter('monitored_nodes', [
'motor_controller', 'camera_driver', 'lidar_driver', 'slam'
])
self.declare_parameter('heartbeat_timeout_sec', 5.0)
self.declare_parameter('check_period_sec', 1.0)
self.monitored_nodes = self.get_parameter('monitored_nodes').value
self.timeout = self.get_parameter('heartbeat_timeout_sec').value
check_period = self.get_parameter('check_period_sec').value
self.last_heartbeat = {name: time.time() for name in self.monitored_nodes}
reliable_qos = QoSProfile(
reliability=ReliabilityPolicy.RELIABLE,
durability=DurabilityPolicy.VOLATILE,
depth=1,
)
for node_name in self.monitored_nodes:
self.create_subscription(
Bool, f'/{node_name}/heartbeat',
lambda msg, n=node_name: self._heartbeat_callback(n, msg),
reliable_qos,
)
self.health_pub = self.create_publisher(
DiagnosticArray, '/system_health', 10)
self.estop_pub = self.create_publisher(
Bool, '/emergency_stop', reliable_qos)
self.cmd_vel_pub = self.create_publisher(
Twist, '/cmd_vel', 10)
self.create_timer(check_period, self._check_health)
self.get_logger().info(
f'Monitoring heartbeats for: {self.monitored_nodes}')
def _heartbeat_callback(self, node_name, msg):
"""Record heartbeat reception time."""
self.last_heartbeat[node_name] = time.time()
def _check_health(self):
"""Check all heartbeats and publish diagnostics."""
now = time.time()
diag_array = DiagnosticArray()
diag_array.header.stamp = self.get_clock().now().to_msg()
all_healthy = True
for node_name in self.monitored_nodes:
elapsed = now - self.last_heartbeat[node_name]
status = DiagnosticStatus()
status.name = f'heartbeat/{node_name}'
if elapsed < self.timeout:
status.level = DiagnosticStatus.OK
status.message = f'Alive ({elapsed:.1f}s ago)'
else:
status.level = DiagnosticStatus.ERROR
status.message = f'TIMEOUT ({elapsed:.1f}s since last heartbeat)'
all_healthy = False
self.get_logger().error(
f'Heartbeat timeout for {node_name}: {elapsed:.1f}s')
status.values = [
KeyValue(key='elapsed_sec', value=f'{elapsed:.2f}'),
KeyValue(key='timeout_sec', value=f'{self.timeout:.2f}'),
]
diag_array.status.append(status)
self.health_pub.publish(diag_array)
if not all_healthy:
self._trigger_safe_stop()
def _trigger_safe_stop(self):
"""Send zero velocity and emergency stop signal."""
self.get_logger().warn('Triggering safe stop due to heartbeat failure')
self.cmd_vel_pub.publish(Twist())
estop_msg = Bool()
estop_msg.data = True
self.estop_pub.publish(estop_msg)
def main():
rclpy.init()
node = HeartbeatMonitor()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
Hardware Watchdog Integration
Many robot onboard computers have a hardware watchdog timer (e.g., Intel TCO, iTCO_wdt). If the software fails to pet the watchdog, the hardware performs a hard reboot.
sudo cat /sys/class/watchdog/watchdog0/state
sudo cat /sys/class/watchdog/watchdog0/timeout
Logging and Log Rotation
ROS2 Log Configuration
export RCUTILS_LOGGING_USE_STDOUT=0
export RCUTILS_COLORIZED_OUTPUT=0
export RCUTILS_CONSOLE_OUTPUT_FORMAT="[{severity}] [{time}] [{name}]: {message}"
ros2 run my_pkg my_node --ros-args --log-level debug
ros2 run my_pkg my_node --ros-args --log-level my_node:=debug
ros2 param set /my_node use_sim_time false
ros2 service call /my_node/set_logger_level rcl_interfaces/srv/SetLoggerLevel \
"{logger_name: 'my_node', level: 10}"
journald Configuration for ROS2 Services
[Journal]
Storage=persistent
SystemMaxUse=1G
SystemMaxFileSize=100M
MaxRetentionSec=30day
RateLimitIntervalSec=10s
RateLimitBurst=10000
ForwardToSyslog=yes
journalctl -u robot-drivers.service -f
journalctl -u robot-bringup.service -b
journalctl -u robot-bringup.service -p err
journalctl -u robot-bringup.service --since "2024-01-01" --output=json > logs.json
logrotate for ROS2 Log Files
ROS2 writes log files to ~/.ros/log/ by default, or to $ROS_LOG_DIR. These grow unbounded without rotation.
/var/log/ros2/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
create 0644 robot robot
maxsize 100M
dateext
dateformat -%Y%m%d
postrotate
systemctl kill --signal=HUP robot-bringup.service 2>/dev/null || true
endscript
}
/home/robot/.ros/log/**/*.log {
daily
rotate 3
compress
missingok
notifempty
maxsize 50M
}
Structured Logging for Production
import json
import logging
from rclpy.node import Node
class StructuredLogger:
"""Wraps ROS2 logger with structured JSON output for production monitoring."""
def __init__(self, node: Node):
self.node = node
self.logger = node.get_logger()
def log_event(self, event_type: str, level: str = 'info', **kwargs):
"""Log a structured event with key-value metadata."""
entry = {
'event': event_type,
'node': self.node.get_name(),
'namespace': self.node.get_namespace(),
'stamp': self.node.get_clock().now().nanoseconds,
**kwargs,
}
message = json.dumps(entry)
getattr(self.logger, level)(message)
Graceful Shutdown Sequences
Signal Handling in ROS2 Nodes
ROS2 nodes should handle SIGINT and SIGTERM to bring actuators to a safe state before exiting.
import signal
import sys
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Twist
from std_msgs.msg import Bool
class SafeShutdownNode(Node):
def __init__(self):
super().__init__('safe_shutdown_node')
self.cmd_vel_pub = self.create_publisher(Twist, '/cmd_vel', 10)
self.brake_pub = self.create_publisher(Bool, '/brakes/engage', 10)
signal.signal(signal.SIGTERM, self._shutdown_handler)
signal.signal(signal.SIGINT, self._shutdown_handler)
self.get_logger().info('Node started with graceful shutdown handler')
def _shutdown_handler(self, signum, frame):
"""Handle shutdown signals by commanding safe state."""
sig_name = signal.Signals(signum).name
self.get_logger().warn(f'Received {sig_name}, initiating safe shutdown...')
zero_twist = Twist()
for _ in range(5):
self.cmd_vel_pub.publish(zero_twist)
brake_msg = Bool()
brake_msg.data = True
self.brake_pub.publish(brake_msg)
self.get_logger().info('Safe state commanded, shutting down...')
self.destroy_node()
rclpy.shutdown()
sys.exit(0)
def main():
rclpy.init()
node = SafeShutdownNode()
rclpy.spin(node)
node.destroy_node()
rclpy.shutdown()
if __name__ == '__main__':
main()
Ordered Shutdown via systemd Dependencies
The PartOf and Before directives ensure that application-level services are stopped before drivers, preventing the situation where a navigation node sends velocity commands after the motor driver has exited.
[Unit]
PartOf=robot-perception.service
Before=robot-perception.service
[Service]
KillSignal=SIGINT
TimeoutStopSec=15
FinalKillSignal=SIGTERM
SendSIGKILL=yes
Safe State on Shutdown
import rclpy
from rclpy.node import Node
class ActuatorNode(Node):
def __init__(self):
super().__init__('actuator_node')
self.cmd_pub = self.create_publisher(Twist, '/cmd_vel', 10)
context = self.context
context.on_shutdown(self._on_shutdown)
def _on_shutdown(self):
"""Called automatically during rclpy.shutdown()."""
self.get_logger().info('Shutdown callback: commanding zero velocity')
self.cmd_pub.publish(Twist())
Remote Monitoring and Debugging
SSH Tunneling for ROS2 Topics
Forward DDS traffic over SSH when direct network connectivity is not available (e.g., robot is on a cellular connection).
#!/bin/bash
set -euo pipefail
ROBOT_HOST="${1:?Usage: $0 robot@host}"
DOMAIN_ID="${ROS_DOMAIN_ID:-0}"
BASE_PORT=$((7400 + 250 * DOMAIN_ID))
echo "Setting up SSH tunnel for ROS2 domain $DOMAIN_ID (ports $BASE_PORT-$((BASE_PORT + 200)))"
ssh -N \
-L ${BASE_PORT}:localhost:${BASE_PORT} \
-L $((BASE_PORT + 1)):localhost:$((BASE_PORT + 1)) \
-L $((BASE_PORT + 10)):localhost:$((BASE_PORT + 10)) \
-L $((BASE_PORT + 11)):localhost:$((BASE_PORT + 11)) \
"$ROBOT_HOST" &
SSH_PID=$!
echo "SSH tunnel PID: $SSH_PID"
export ROS_LOCALHOST_ONLY=1
echo "Run: export ROS_LOCALHOST_ONLY=1"
echo "Then use ros2 topic list, ros2 topic echo, etc."
echo "Press Ctrl+C to close tunnel."
wait $SSH_PID
Remote journalctl and Service Management
ssh robot@10.0.0.10 'journalctl -u robot-bringup.service -f'
ssh robot@10.0.0.10 'systemctl status robot-drivers.service robot-perception.service'
ssh robot@10.0.0.10 'sudo systemctl restart robot-perception.service'
ssh robot@10.0.0.10 'systemd-analyze blame | head -20'
ssh robot@10.0.0.10 'systemctl --failed'
ssh robot@10.0.0.10 'journalctl -u robot-bringup.service -o json --follow'
Deploying Updates via SSH
#!/bin/bash
set -euo pipefail
ROBOT_HOST="${1:?Usage: $0 robot@host}"
WORKSPACE="/home/robot/ros2_ws"
echo "=== Building workspace locally ==="
colcon build --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-select my_robot_bringup
echo "=== Syncing to robot ==="
rsync -avz --delete \
--exclude='build/' --exclude='log/' \
src/ "${ROBOT_HOST}:${WORKSPACE}/src/"
echo "=== Building on robot ==="
ssh "$ROBOT_HOST" "cd ${WORKSPACE} && \
source /opt/ros/\${ROS_DISTRO}/setup.bash && \
colcon build --cmake-args -DCMAKE_BUILD_TYPE=Release"
echo "=== Restarting robot services ==="
ssh "$ROBOT_HOST" 'sudo systemctl restart robot-bringup.target'
echo "=== Checking service status ==="
ssh "$ROBOT_HOST" 'sleep 3 && systemctl status robot-bringup.target --no-pager'
echo "Deploy complete."
Robot Bringup Anti-Patterns
1. Sourcing setup.bash in .bashrc for systemd
Problem: systemd services do not load ~/.bashrc or ~/.profile. Environment variables set there are invisible to the service, causing ROS2 commands to fail with "command not found" or missing package errors.
source /opt/ros/humble/setup.bash
ROS_DISTRO=humble
RMW_IMPLEMENTATION=rmw_cyclonedds_cpp
2. No Startup Ordering
Problem: Starting all ROS2 nodes simultaneously causes race conditions. A navigation node may attempt to call a service that has not yet been advertised by the driver, leading to intermittent startup failures.
Fix: Use After= and Requires= in systemd units, or use a lifecycle manager to enforce ordered transitions:
[Unit]
Description=Robot Navigation
[Unit]
Description=Robot Navigation
After=robot-perception.service
Requires=robot-perception.service
3. Using Restart=always Without Rate Limiting
Problem: A service that crashes on startup (e.g., missing config file, hardware disconnected) will restart in a tight loop, consuming CPU and flooding the journal.
Fix: Use StartLimitIntervalSec and StartLimitBurst to cap restart attempts:
[Service]
Restart=always
RestartSec=1
[Service]
Restart=on-failure
RestartSec=5
StartLimitIntervalSec=120
StartLimitBurst=5
4. Relying on network.target Instead of network-online.target
Problem: network.target is reached as soon as the network configuration starts, not when connectivity is actually established. DDS discovery fails because the network interface does not have an IP address yet.
Fix: Use network-online.target and ensure systemd-networkd-wait-online.service or NetworkManager-wait-online.service is enabled:
[Unit]
After=network.target
[Unit]
After=network-online.target
Wants=network-online.target
5. No Log Rotation
Problem: ROS2 log files in ~/.ros/log/ and journal entries grow without limit, eventually filling the disk on an embedded system with limited storage.
Fix: Configure logrotate for ROS2 log files and set journald size limits:
6. Hardcoded Device Paths (/dev/ttyUSB0)
Problem: /dev/ttyUSB0 can be assigned to any USB serial device depending on enumeration order. After a reboot, the IMU might become /dev/ttyUSB1 and the motor controller /dev/ttyUSB0, reversing the mapping.
Fix: Use udev rules to create stable symlinks:
serial_port: "/dev/ttyUSB0"
serial_port: "/dev/robot/imu"
7. Running the Entire Stack as Root
Problem: Running ROS2 as root is a security risk and can cause permission issues with rosbag2, log files, and parameter persistence. A bug in a ROS2 node could damage the operating system.
Fix: Create a dedicated robot user and grant only the necessary device permissions via udev GROUP and MODE rules:
sudo useradd -r -m -s /bin/bash robot
sudo usermod -aG dialout,video,plugdev robot
8. No Graceful Shutdown Handler
Problem: When systemd sends SIGTERM or SIGINT to stop a ROS2 node, the node exits immediately without commanding zero velocity or engaging brakes. The robot may coast or continue moving with the last commanded velocity.
Fix: Register signal handlers or use rclpy's shutdown callback to command a safe state:
def main():
rclpy.init()
node = MotorControlNode()
rclpy.spin(node)
def main():
rclpy.init()
node = MotorControlNode()
try:
rclpy.spin(node)
except KeyboardInterrupt:
pass
finally:
node.command_zero_velocity()
node.engage_brakes()
node.destroy_node()
rclpy.shutdown()
Robot Bringup Checklist
- udev rules written and tested for all USB devices (cameras, LiDARs, serial adapters) with stable symlinks under
/dev/robot/
- systemd service units created for each layer (drivers, perception, application) with correct
After=/Requires= ordering
- ROS2 environment file (
/etc/robot/ros2.env) configured with ROS_DISTRO, RMW_IMPLEMENTATION, ROS_DOMAIN_ID, and CYCLONEDDS_URI
- CycloneDDS or FastDDS XML configured with explicit peer list for multi-machine discovery
- Launch files layered and composable with conditional arguments for sim/real and robot variants
- Health check scripts written for
ExecStartPre to verify device presence before starting drivers
- Watchdog integration configured:
WatchdogSec in service units and sd_notify(WATCHDOG=1) in the ROS2 process
- Heartbeat monitor node deployed to detect node failures and trigger safe stop
- Graceful shutdown handlers registered in all actuator nodes (zero velocity, engage brakes on
SIGINT/SIGTERM)
- Log rotation configured via logrotate for
$ROS_LOG_DIR and journald SystemMaxUse limits
- Restart policies rate-limited with
StartLimitIntervalSec and StartLimitBurst to prevent restart loops
- Resource limits set via
MemoryMax, CPUQuota to prevent runaway nodes from starving the system
- Network and firewall configured with static IPs, DDS port rules, and
ROS_LOCALHOST_ONLY set correctly
- Full boot test performed from power-off to autonomous operation, verifying service ordering and recovery from simulated failures