HORUS Benchmarks
Performance validation with real-world robotics workloads.
Benchmark Methodology
Measurement Approach
- Statistical sampling: Criterion.rs with 20+ samples per measurement
- Confidence intervals: Min/mean/max with outlier detection
- Controlled methodology: 1s warm-up, 5s measurement phases
- Reproducible: Less than 1% variance across measurements
- Comprehensive coverage: 5 workload types, 4 scalability points
Workload Testing
- Real workloads: Control loops, sensor fusion, I/O operations
- Fault injection: Failure policy recovery testing
- Scale testing: Validated up to 200 concurrent nodes
- Mixed patterns: Combined blocking/non-blocking operations
- Long-running: 25+ second failure recovery tests
Executive Summary
HORUS delivers sub-microsecond to low-microsecond latency for production robotics applications:
| Message Type | Size | Latency (Topic N:N) | Throughput | Typical Rate | Headroom |
|---|---|---|---|---|---|
| CmdVel | 16 B | ~500 ns | 2.7M msg/s | 1000 Hz | 2,700x |
| BatteryState | 104 B | ~600 ns | 1.67M msg/s | 1 Hz | 1.67M x |
| IMU | 304 B | ~940 ns | 1.8M msg/s | 100 Hz | 18,000x |
| Odometry | 736 B | ~1.1 μs | 1.3M msg/s | 50 Hz | 26,000x |
| LaserScan | 1.5 KB | ~2.2 μs | 633K msg/s | 10 Hz | 63,300x |
| PointCloud (1K) | ~12 KB | ~12 μs | 83K msg/s | 30 Hz | 2,767x |
| PointCloud (10K) | ~120 KB | ~360 μs | 4.7K msg/s | 30 Hz | 157x |
Latency Comparison: HORUS vs ROS2
Lower is better. Logarithmic scale (send-only latency in μs)
Performance Highlights
Key Findings
Sub-microsecond latency for messages up to 1.5KB Serde integration works flawlessly with complex nested structs Linear scaling with message size (predictable performance) Massive headroom for all typical robotics frequencies
Production Readiness
- Real-time control: ~500 ns latency supports 1000Hz+ control loops with 2,700x headroom
- Sensor fusion: Mixed workload maintains sub-microsecond performance (648 ns avg)
- Perception pipelines: 10K point clouds @ 30Hz with 189x headroom
- Multi-robot systems: Throughput supports 100+ robots on single node
Detailed Results
CmdVel (Motor Control Command)
Use Case: Real-time motor control @ 1000Hz
Structure: { timestamp: u64, linear: f32, angular: f32 }
Average Latency: ~500 ns (Topic N:N)
Throughput: 2.7M msg/s
Topic 1:1: ~85 ns median
Analysis: Sub-microsecond performance suitable for 1000Hz control loops with 2,700x headroom.
LaserScan (2D Lidar Data)
Use Case: 2D lidar sensor data @ 10Hz
Structure: { ranges: [f32; 360], angle_min/max, metadata }
Average Latency: ~2.2 μs (Topic N:N)
Throughput: 633K msg/s
Topic 1:1: ~900 ns estimated
Analysis: Consistent low-microsecond latency for 1.5KB messages. Can easily handle 10Hz lidar updates with 63,300x headroom.
IMU (Inertial Measurement Unit)
Use Case: Orientation and acceleration @ 100Hz
Structure: { orientation: [f64; 4], angular_velocity: [f64; 3], linear_acceleration: [f64; 3], covariances: [f64; 27] }
Average Latency: ~940 ns (Topic N:N)
Throughput: 1.8M msg/s
Topic 1:1: ~400 ns estimated
Analysis: Sub-microsecond performance with complex nested arrays and 27-element covariance matrices.
Odometry (Pose + Velocity)
Use Case: Robot localization @ 50Hz
Structure: { pose: Pose2D, twist: Twist, pose_covariance: [f64; 36], twist_covariance: [f64; 36] }
Average Latency: ~1.1 μs (Topic N:N)
Throughput: 1.3M msg/s
Topic 1:1: ~600 ns estimated
Analysis: Low-microsecond latency for 736-byte messages with extensive covariance data.
PointCloud (3D Perception)
Small (100 points @ 30Hz)
Average Latency: 1.85 μs
Throughput: 539,529 msg/s
Data Size: ~1.2 KB
Medium (1,000 points @ 30Hz)
Average Latency: 7.55 μs
Throughput: 132,432 msg/s
Data Size: ~12 KB
Large (10,000 points @ 30Hz)
Average Latency: ~360 μs (Topic N:N)
Throughput: 4.7K msg/s
Data Size: ~120 KB
Analysis: Linear scaling with point count. Even 10K point clouds process in ~360 μs (sufficient for 30Hz perception with 157x headroom).
Mixed Workload (Realistic Robot Loop)
Simulation: Real robot control loop @ 100Hz Components: CmdVel @ 100Hz + IMU @ 100Hz + BatteryState @ 1Hz
Total Operations: 20,100 messages
Average Latency: ~1.0 μs (Topic N:N)
Throughput: ~1.5M msg/s
Range: ~500-1200 ns
Analysis: Low-microsecond average latency for mixed message types simulating realistic robotics workload.
Comparison with traditional frameworks
Latency Comparison
Measurement Note: Topic 1:1 values below are send-only (one-direction). For round-trip (send+receive), approximately double these values (e.g., 87ns send-only → ~175ns round-trip).
| Framework | Small Msg (send-only) | Medium Msg (send-only) | Large Msg (send-only) |
|---|---|---|---|
| HORUS Topic (1:1) | 87 ns | ~160 ns | ~400 ns |
| HORUS Topic (N:N) | 313 ns | ~500 ns | ~1.1 μs |
| ROS2 (DDS) | 50-100 μs | 100-500 μs | 1-10 ms |
| ROS2 (FastDDS) | 20-50 μs | 50-200 μs | 500 μs - 5 ms |
Performance Advantage: HORUS is 230-575x faster than ROS2 for typical message sizes.
HORUS Speedup vs ROS2
How many times faster HORUS Link is compared to ROS2 DDS
Latency by Message Size
Measurement Note: All latencies below are send-only (one-direction publish). "1:1" = single producer/consumer, "N:N" = multiple producers and consumers.
| Message Size | Message Type | N:N (send-only) | 1:1 (send-only) | vs ROS2 |
|---|---|---|---|---|
| 16 B | CmdVel | ~313 ns | 87 ns | 230-575x faster |
| 104 B | BatteryState | ~600 ns | ~350 ns | 83-286x faster |
| 304 B | IMU | ~940 ns | ~400 ns | 53-250x faster |
| 736 B | Odometry | ~1.1 μs | ~600 ns | 45-167x faster |
| 1,480 B | LaserScan | ~2.2 μs | ~900 ns | 23-111x faster |
Observation: Near-linear scaling with message size demonstrates efficient serialization and IPC.
Latency vs Message Size
HORUS shows linear scaling. Values in nanoseconds.
Python Performance
The HORUS Python bindings (PyO3) call directly into the Rust shared memory layer, avoiding pickle serialization overhead. Python nodes and Rust nodes communicate through the same shared memory, enabling cross-language interoperability with minimal overhead.
Why Python HORUS is Fast:
- Zero-copy via Rust core: Python bindings call directly into Rust shared memory
- No pickle overhead: Messages use efficient binary serialization
- PyO3 efficiency: Minimal FFI overhead between Python and Rust
TensorPool
HORUS TensorPool provides shared memory tensors optimized for ML/AI workloads. Pre-mapped shared memory means no malloc() or zero-initialization on the hot path.
from horus import TensorPool
import numpy as np
# Create pool
pool = TensorPool(12345) # pool_id
# Allocate tensor (pre-mapped shared memory)
h = pool.alloc([1024, 1024], 'float32')
# Zero-copy NumPy view
arr = h.numpy() # No data copied
# Cross-process sharing via shared memory
descriptor = h.to_descriptor()
Key Advantages:
- Cross-process sharing via shared memory
- Pre-allocated pool — no malloc on hot path
- Refcounted handles — safe concurrent access
- Zero-copy NumPy —
.numpy()returns view
Running Rust Benchmarks
Quick Run
cd horus
cargo run --release -p horus_benchmarks --bin robotics_messages_benchmark
Available Benchmarks
| Binary | Description |
|---|---|
robotics_messages_benchmark | IPC latency with real robotics message types |
all_paths_latency | AdaptiveTopic latency across all backend routes |
cross_process_benchmark | Cross-process shared memory IPC |
scalability_benchmark | Scaling with producer/consumer thread counts |
determinism_benchmark | Execution determinism and jitter |
dds_comparison_benchmark | Comparison with DDS middleware (requires --features dds) |
Run any benchmark with:
cargo run --release -p horus_benchmarks --bin <name>
# JSON output for CI/regression tracking
cargo run --release -p horus_benchmarks --bin <name> -- --json results.json
Criterion micro-benchmarks:
cd horus
cargo bench -p horus_benchmarks
Expected Output
HORUS Production Message Benchmark Suite
Testing with real robotics message types
CmdVel (Motor Control Command)
Size: 16 bytes | Typical rate: 1000Hz
Latency (avg): ~500 ns (Topic N:N) / ~85 ns (Topic 1:1)
Throughput: 2.7M msg/s (Topic N:N)
LaserScan (2D Lidar Data)
Size: 1480 bytes | Typical rate: 10Hz
Latency (avg): ~2.2 μs (Topic N:N) / ~900 ns (Topic 1:1)
Throughput: 633K msg/s (Topic N:N)
Use Case Selection
Message Type Guidelines
CmdVel (~500 ns N:N / ~85 ns 1:1)
- Motor control @ 1000Hz
- Real-time actuation commands
- Safety-critical control loops
IMU (~940 ns N:N / ~400 ns 1:1)
- High-frequency sensor fusion @ 100Hz
- State estimation pipelines
- Orientation tracking
LaserScan (~2.2 μs N:N / ~900 ns 1:1)
- 2D lidar @ 10Hz
- Obstacle detection
- SLAM front-end
Odometry (~1.1 μs N:N / ~600 ns 1:1)
- Pose estimation @ 50Hz
- Dead reckoning
- Filter updates
PointCloud (~360 μs for 10K pts)
- 3D perception @ 30Hz
- Object detection pipelines
- Dense mapping
Performance Characteristics
Strengths
- Sub-microsecond latency for messages up to 1.5KB
- Consistent performance across message types (low variance)
- Linear scaling with message size
- Production-ready throughput with large headroom
- Serde integration handles complex nested structs efficiently
Additional Notes
- Complex structs (IMU with 27-element covariances): Still sub-microsecond
- Variable-size messages (PointCloud with Vec): Linear scaling
Real-World Applications
| Application | Frequency | HORUS (Topic 1:1) | HORUS (Topic N:N) | ROS2 | Speedup |
|---|---|---|---|---|---|
| Motor control | 1000 Hz | ~85 ns | ~500 ns | 50 μs | 200-588x |
| IMU fusion | 100 Hz | ~400 ns | ~940 ns | 50 μs | 53-125x |
| Lidar SLAM | 10 Hz | ~900 ns | ~2.2 μs | 100 μs | 45-111x |
| Vision | 30 Hz | ~120 μs | ~360 μs | 5 ms | 14-42x |
| Planning | 100 Hz | ~600 ns | ~1.1 μs | 100 μs | 91-167x |
Throughput Comparison
Messages per second (millions). Higher is better.
Methodology
Benchmark Pattern: Ping-Pong
HORUS uses the industry-standard ping-pong benchmark pattern for IPC latency measurement:
Why Ping-Pong?
- Industry standard: Used by ROS2, iceoryx2, ZeroMQ benchmarks
- Prevents queue buildup: Each message acknowledged before next send
- Realistic: Models request-response patterns in robotics
- Comparable: Direct apples-to-apples comparison with other frameworks
- Conservative: Measures true round-trip latency, not just one-way send
What we measure:
- Round-trip time: Producer Consumer ACK Producer
- Includes serialization, IPC, deserialization, and synchronization
- Cross-core communication (Core 0 ↔ Core 1)
What we DON'T measure:
- Burst throughput (no backpressure)
- One-way send time without acknowledgment
- Same-core communication (unrealistic for multi-process IPC)
Test Environment
- Build:
cargo build --releasewith full optimizations - CPU Governor: Performance mode
- CPU Affinity: Producer pinned to Core 0, Consumer pinned to Core 1
- Process Isolation: Dedicated topics per benchmark
- Warmup: 1,000 iterations before measurement
- Measurement: RDTSC (cycle-accurate timestamps)
Message Realism
- Actual HORUS library message types
- Serde serialization (production path)
- Realistic field values and sizes
- Complex nested structures (IMU, Odometry)
Statistical Methodology
- 10,000 iterations per test
- Median, P95, P99 latency tracking
- Variance tracking (min/max ranges)
- Multiple message sizes
- Mixed workload testing
Measurement Details
RDTSC Calibration:
- Null cost (back-to-back rdtsc): ~36 cycles
- Target on modern x86_64: 20-30 cycles
- Timestamp embedded directly in message payload
Cross-Core Testing:
- Producer and consumer on different CPU cores
- Simulates real multi-process robotics systems
- Includes cache coherency overhead (~60 cycles theoretical minimum)
Scheduler Performance
Enhanced Smart Scheduler
HORUS now includes an intelligent scheduler that automatically optimizes node execution based on runtime behavior:
Key Enhancements:
- Tiered Execution: Explicit tier annotation (UltraFast, Fast, Normal)
- Failure Policies: Per-node failure handling with automatic recovery
- Predictable by Default: Sequential execution with consistent priority ordering
- Safety Monitoring: WCET enforcement, watchdogs, and emergency stop
Comprehensive Benchmark Results
Test Configuration:
- Workload duration: 5 seconds per test
- Sample size: 20 measurements per benchmark
- Platform: Modern x86_64 Linux system
| Workload Type | Mean Time | Description | Key Achievement |
|---|---|---|---|
| UltraFastControl | 2.387s | High-frequency control loops | Optimized for high-frequency control |
| FastSensor | 2.382s | Rapid sensor processing | Maintains sub-μs sensor fusion |
| HeavyIO | 3.988s | I/O-intensive operations | Async tier prevents blocking |
| MixedRealistic | 4.064s | Real-world mixed workload | Balanced optimization across tiers |
| FaultTolerance | 25.485s | With simulated failures | Failure policy recovery working |
Scalability Performance
The scheduler demonstrates excellent linear scaling:
| Node Count | Execution Time | Scaling Factor |
|---|---|---|
| 10 nodes | 106.93ms | Baseline |
| 50 nodes | 113.93ms | 1.07x (5x nodes) |
| 100 nodes | 116.49ms | 1.09x (10x nodes) |
| 200 nodes | 119.55ms | 1.12x (20x nodes) |
Key Insights:
- Near-linear scaling from 10 to 200 nodes
- Only 13ms increase for 20x more nodes
- Maintains sub-120ms for large systems
- Automatic tier classification optimizes execution order
Scheduler Scalability
Near-constant execution time regardless of node count
Real-Time Performance
RtNode Support
HORUS now provides industrial-grade real-time support for safety-critical applications:
RT Features:
- WCET Enforcement: Worst-Case Execution Time monitoring
- Deadline Tracking: Count and handle deadline misses
- Safety Monitor: Emergency stop on critical failures
- Watchdog Timers: Detect hung or crashed nodes
RT Performance Characteristics
| Metric | Performance | Description |
|---|---|---|
| WCET Overhead | <5μs | Cost of monitoring execution time |
| Deadline Precision | ±10μs | Jitter in deadline detection |
| Watchdog Resolution | 1ms | Minimum detection time |
| Emergency Stop | <100μs | Time to halt all nodes |
| Context Switch | <1μs | Priority preemption overhead |
Safety-Critical Configuration
Running with full safety monitoring enabled:
let scheduler = Scheduler::new().tick_rate(1000_u64.hz());
| Feature | Overhead | Impact |
|---|---|---|
| WCET Tracking | ~1μs per node | Negligible for >100μs tasks |
| Deadline Monitor | ~500ns per node | Sub-microsecond overhead |
| Watchdog Feed | ~100ns per tick | Minimal impact |
| Safety Checks | ~2μs total | Worth it for safety |
| Memory Locking | One-time 10ms | Prevents page faults |
Real-Time Test Results
Test: Mixed RT and Normal Nodes
- 2 critical RT nodes @ 1kHz
- 2 normal nodes @ 100Hz
- 2 background nodes @ 10Hz
| Node Type | Target Rate | Achieved | Jitter | Misses |
|---|---|---|---|---|
| RT Critical | 1000 Hz | 999.8 Hz | ±10μs | 0 |
| RT High | 500 Hz | 499.9 Hz | ±15μs | 0 |
| Normal | 100 Hz | 99.9 Hz | ±50μs | <0.1% |
| Background | 10 Hz | 10 Hz | ±200μs | <0.5% |
Zero deadline misses for critical RT nodes over 1M iterations.
Real-Time Node Performance
Target rate achievement and jitter measurements
All-Routes Latency
HORUS automatically selects the optimal communication path based on topology (same-thread, cross-thread, cross-process) and producer/consumer count. This benchmark measures the latency of each automatically-selected route.
Benchmark Results
| Scenario | Latency | Target | Notes |
|---|---|---|---|
| Same thread, 1:1 | 16ns | 60ns | Ultra-fast direct path |
| Cross-thread, 1:1 | 11ns | 60ns | Optimized single-producer path |
| Cross-process, 1:1 | 182ns | 100ns | Shared memory path |
| Cross-process, N:1 | 244ns | 150ns | Multi-producer shared memory |
| Cross-process, N:N | 187ns | 200ns | General cross-process |
Latency by Topology
| Topology | Producers | Consumers | Latency |
|---|---|---|---|
| Same thread | 1 | 1 | ~16ns |
| Same process | 1 | 1 | ~11ns |
| Same process | N | 1 | ~15ns |
| Same process | 1 | N | ~15ns |
| Same process | N | N | ~20ns |
| Cross process | 1 | 1 | ~180ns |
| Cross process | N | 1 | ~250ns |
| Cross process | 1 | N | ~200ns |
| Cross process | N | N | ~190ns |
Key Achievements
- Sub-20ns for same-process communication
- Sub-200ns for cross-process 1:1
- Sub-300ns for multi-producer cross-process
- Zero configuration — optimal path selected automatically
- Seamless migration — path upgrades transparently as topology changes
Running the Benchmark
cd horus
cargo build --release -p horus_benchmarks
./target/release/all_paths_latency
Summary
HORUS provides production-grade performance for real robotics applications:
Automatic Path Selection (Recommended):
- 16 ns — Same-thread
- 11 ns — Cross-thread, 1:1
- 182 ns — Cross-process, 1:1
- 244 ns — Cross-process, multi-producer
- 187 ns — Cross-process, multi-producer/consumer
Point-to-Point (1:1):
- 87 ns — Send only (ultra-low latency)
- 161 ns — CmdVel (motor control)
- 262 ns — Send+Recv round-trip
- ~400 ns — IMU (sensor fusion)
- ~120 μs — PointCloud with 10K points
Multi-Producer/Consumer (N:N):
- ~313 ns — CmdVel (motor control)
- ~500 ns — IMU (sensor fusion)
- ~2.2 μs — LaserScan (2D lidar)
- ~1.1 μs — Odometry (localization)
- ~360 μs — PointCloud with 10K points
Ready for production deployment in demanding robotics applications requiring real-time performance with complex data types.
Next Steps
- Learn how to maximize performance: Performance Optimization
- Explore message types: Message Types
- See usage examples: Examples
- Get started: Quick Start
Build faster. Debug easier. Deploy with confidence.