HORUS Benchmarks

Performance validation with real-world robotics workloads.

Benchmark Methodology

Measurement Approach

  • Statistical sampling: Criterion.rs with 20+ samples per measurement
  • Confidence intervals: Min/mean/max with outlier detection
  • Controlled methodology: 1s warm-up, 5s measurement phases
  • Reproducible: Less than 1% variance across measurements
  • Comprehensive coverage: 5 workload types, 4 scalability points

Workload Testing

  • Real workloads: Control loops, sensor fusion, I/O operations
  • Fault injection: Failure policy recovery testing
  • Scale testing: Validated up to 200 concurrent nodes
  • Mixed patterns: Combined blocking/non-blocking operations
  • Long-running: 25+ second failure recovery tests

Executive Summary

HORUS delivers sub-microsecond to low-microsecond latency for production robotics applications:

Message TypeSizeLatency (Topic N:N)ThroughputTypical RateHeadroom
CmdVel16 B~500 ns2.7M msg/s1000 Hz2,700x
BatteryState104 B~600 ns1.67M msg/s1 Hz1.67M x
IMU304 B~940 ns1.8M msg/s100 Hz18,000x
Odometry736 B~1.1 μs1.3M msg/s50 Hz26,000x
LaserScan1.5 KB~2.2 μs633K msg/s10 Hz63,300x
PointCloud (1K)~12 KB~12 μs83K msg/s30 Hz2,767x
PointCloud (10K)~120 KB~360 μs4.7K msg/s30 Hz157x

Latency Comparison: HORUS vs ROS2

Lower is better. Logarithmic scale (send-only latency in μs)

HORUS Link (SPSC, wait-free)
HORUS Hub (MPMC, lock-free)
ROS2 DDS (typical)

Performance Highlights

Key Findings

Sub-microsecond latency for messages up to 1.5KB Serde integration works flawlessly with complex nested structs Linear scaling with message size (predictable performance) Massive headroom for all typical robotics frequencies

Production Readiness

  • Real-time control: ~500 ns latency supports 1000Hz+ control loops with 2,700x headroom
  • Sensor fusion: Mixed workload maintains sub-microsecond performance (648 ns avg)
  • Perception pipelines: 10K point clouds @ 30Hz with 189x headroom
  • Multi-robot systems: Throughput supports 100+ robots on single node

Detailed Results

CmdVel (Motor Control Command)

Use Case: Real-time motor control @ 1000Hz Structure: { timestamp: u64, linear: f32, angular: f32 }

Average Latency: ~500 ns (Topic N:N)
Throughput:      2.7M msg/s
Topic 1:1:       ~85 ns median

Analysis: Sub-microsecond performance suitable for 1000Hz control loops with 2,700x headroom.


LaserScan (2D Lidar Data)

Use Case: 2D lidar sensor data @ 10Hz Structure: { ranges: [f32; 360], angle_min/max, metadata }

Average Latency: ~2.2 μs (Topic N:N)
Throughput:      633K msg/s
Topic 1:1:       ~900 ns estimated

Analysis: Consistent low-microsecond latency for 1.5KB messages. Can easily handle 10Hz lidar updates with 63,300x headroom.


IMU (Inertial Measurement Unit)

Use Case: Orientation and acceleration @ 100Hz Structure: { orientation: [f64; 4], angular_velocity: [f64; 3], linear_acceleration: [f64; 3], covariances: [f64; 27] }

Average Latency: ~940 ns (Topic N:N)
Throughput:      1.8M msg/s
Topic 1:1:       ~400 ns estimated

Analysis: Sub-microsecond performance with complex nested arrays and 27-element covariance matrices.


Odometry (Pose + Velocity)

Use Case: Robot localization @ 50Hz Structure: { pose: Pose2D, twist: Twist, pose_covariance: [f64; 36], twist_covariance: [f64; 36] }

Average Latency: ~1.1 μs (Topic N:N)
Throughput:      1.3M msg/s
Topic 1:1:       ~600 ns estimated

Analysis: Low-microsecond latency for 736-byte messages with extensive covariance data.


PointCloud (3D Perception)

Small (100 points @ 30Hz)

Average Latency: 1.85 μs
Throughput:      539,529 msg/s
Data Size:       ~1.2 KB

Medium (1,000 points @ 30Hz)

Average Latency: 7.55 μs
Throughput:      132,432 msg/s
Data Size:       ~12 KB

Large (10,000 points @ 30Hz)

Average Latency: ~360 μs (Topic N:N)
Throughput:      4.7K msg/s
Data Size:       ~120 KB

Analysis: Linear scaling with point count. Even 10K point clouds process in ~360 μs (sufficient for 30Hz perception with 157x headroom).


Mixed Workload (Realistic Robot Loop)

Simulation: Real robot control loop @ 100Hz Components: CmdVel @ 100Hz + IMU @ 100Hz + BatteryState @ 1Hz

Total Operations: 20,100 messages
Average Latency:  ~1.0 μs (Topic N:N)
Throughput:       ~1.5M msg/s
Range:            ~500-1200 ns

Analysis: Low-microsecond average latency for mixed message types simulating realistic robotics workload.


Comparison with traditional frameworks

Latency Comparison

Measurement Note: Topic 1:1 values below are send-only (one-direction). For round-trip (send+receive), approximately double these values (e.g., 87ns send-only → ~175ns round-trip).

FrameworkSmall Msg (send-only)Medium Msg (send-only)Large Msg (send-only)
HORUS Topic (1:1)87 ns~160 ns~400 ns
HORUS Topic (N:N)313 ns~500 ns~1.1 μs
ROS2 (DDS)50-100 μs100-500 μs1-10 ms
ROS2 (FastDDS)20-50 μs50-200 μs500 μs - 5 ms

Performance Advantage: HORUS is 230-575x faster than ROS2 for typical message sizes.

HORUS Speedup vs ROS2

How many times faster HORUS Link is compared to ROS2 DDS

>500x faster
100-500x faster
<100x faster

Latency by Message Size

Measurement Note: All latencies below are send-only (one-direction publish). "1:1" = single producer/consumer, "N:N" = multiple producers and consumers.

Message SizeMessage TypeN:N (send-only)1:1 (send-only)vs ROS2
16 BCmdVel~313 ns87 ns230-575x faster
104 BBatteryState~600 ns~350 ns83-286x faster
304 BIMU~940 ns~400 ns53-250x faster
736 BOdometry~1.1 μs~600 ns45-167x faster
1,480 BLaserScan~2.2 μs~900 ns23-111x faster

Observation: Near-linear scaling with message size demonstrates efficient serialization and IPC.

Latency vs Message Size

HORUS shows linear scaling. Values in nanoseconds.


Python Performance

The HORUS Python bindings (PyO3) call directly into the Rust shared memory layer, avoiding pickle serialization overhead. Python nodes and Rust nodes communicate through the same shared memory, enabling cross-language interoperability with minimal overhead.

Why Python HORUS is Fast:

  1. Zero-copy via Rust core: Python bindings call directly into Rust shared memory
  2. No pickle overhead: Messages use efficient binary serialization
  3. PyO3 efficiency: Minimal FFI overhead between Python and Rust

TensorPool

HORUS TensorPool provides shared memory tensors optimized for ML/AI workloads. Pre-mapped shared memory means no malloc() or zero-initialization on the hot path.

from horus import TensorPool
import numpy as np

# Create pool
pool = TensorPool(12345)  # pool_id

# Allocate tensor (pre-mapped shared memory)
h = pool.alloc([1024, 1024], 'float32')

# Zero-copy NumPy view
arr = h.numpy()  # No data copied

# Cross-process sharing via shared memory
descriptor = h.to_descriptor()

Key Advantages:

  • Cross-process sharing via shared memory
  • Pre-allocated pool — no malloc on hot path
  • Refcounted handles — safe concurrent access
  • Zero-copy NumPy.numpy() returns view

Running Rust Benchmarks

Quick Run

cd horus
cargo run --release -p horus_benchmarks --bin robotics_messages_benchmark

Available Benchmarks

BinaryDescription
robotics_messages_benchmarkIPC latency with real robotics message types
all_paths_latencyAdaptiveTopic latency across all backend routes
cross_process_benchmarkCross-process shared memory IPC
scalability_benchmarkScaling with producer/consumer thread counts
determinism_benchmarkExecution determinism and jitter
dds_comparison_benchmarkComparison with DDS middleware (requires --features dds)

Run any benchmark with:

cargo run --release -p horus_benchmarks --bin <name>

# JSON output for CI/regression tracking
cargo run --release -p horus_benchmarks --bin <name> -- --json results.json

Criterion micro-benchmarks:

cd horus
cargo bench -p horus_benchmarks

Expected Output


  HORUS Production Message Benchmark Suite
  Testing with real robotics message types


  CmdVel (Motor Control Command)
    Size: 16 bytes | Typical rate: 1000Hz
    Latency (avg): ~500 ns (Topic N:N) / ~85 ns (Topic 1:1)
    Throughput: 2.7M msg/s (Topic N:N)


  LaserScan (2D Lidar Data)
    Size: 1480 bytes | Typical rate: 10Hz
    Latency (avg): ~2.2 μs (Topic N:N) / ~900 ns (Topic 1:1)
    Throughput: 633K msg/s (Topic N:N)


Use Case Selection

Message Type Guidelines

CmdVel (~500 ns N:N / ~85 ns 1:1)

  • Motor control @ 1000Hz
  • Real-time actuation commands
  • Safety-critical control loops

IMU (~940 ns N:N / ~400 ns 1:1)

  • High-frequency sensor fusion @ 100Hz
  • State estimation pipelines
  • Orientation tracking

LaserScan (~2.2 μs N:N / ~900 ns 1:1)

  • 2D lidar @ 10Hz
  • Obstacle detection
  • SLAM front-end

Odometry (~1.1 μs N:N / ~600 ns 1:1)

  • Pose estimation @ 50Hz
  • Dead reckoning
  • Filter updates

PointCloud (~360 μs for 10K pts)

  • 3D perception @ 30Hz
  • Object detection pipelines
  • Dense mapping

Performance Characteristics

Strengths

  1. Sub-microsecond latency for messages up to 1.5KB
  2. Consistent performance across message types (low variance)
  3. Linear scaling with message size
  4. Production-ready throughput with large headroom
  5. Serde integration handles complex nested structs efficiently

Additional Notes

  • Complex structs (IMU with 27-element covariances): Still sub-microsecond
  • Variable-size messages (PointCloud with Vec): Linear scaling

Real-World Applications

ApplicationFrequencyHORUS (Topic 1:1)HORUS (Topic N:N)ROS2Speedup
Motor control1000 Hz~85 ns~500 ns50 μs200-588x
IMU fusion100 Hz~400 ns~940 ns50 μs53-125x
Lidar SLAM10 Hz~900 ns~2.2 μs100 μs45-111x
Vision30 Hz~120 μs~360 μs5 ms14-42x
Planning100 Hz~600 ns~1.1 μs100 μs91-167x

Throughput Comparison

Messages per second (millions). Higher is better.


Methodology

Benchmark Pattern: Ping-Pong

HORUS uses the industry-standard ping-pong benchmark pattern for IPC latency measurement:

Loading diagram...
Ping-Pong Benchmark Pattern

Why Ping-Pong?

  • Industry standard: Used by ROS2, iceoryx2, ZeroMQ benchmarks
  • Prevents queue buildup: Each message acknowledged before next send
  • Realistic: Models request-response patterns in robotics
  • Comparable: Direct apples-to-apples comparison with other frameworks
  • Conservative: Measures true round-trip latency, not just one-way send

What we measure:

  • Round-trip time: Producer Consumer ACK Producer
  • Includes serialization, IPC, deserialization, and synchronization
  • Cross-core communication (Core 0 ↔ Core 1)

What we DON'T measure:

  • Burst throughput (no backpressure)
  • One-way send time without acknowledgment
  • Same-core communication (unrealistic for multi-process IPC)

Test Environment

  • Build: cargo build --release with full optimizations
  • CPU Governor: Performance mode
  • CPU Affinity: Producer pinned to Core 0, Consumer pinned to Core 1
  • Process Isolation: Dedicated topics per benchmark
  • Warmup: 1,000 iterations before measurement
  • Measurement: RDTSC (cycle-accurate timestamps)

Message Realism

  • Actual HORUS library message types
  • Serde serialization (production path)
  • Realistic field values and sizes
  • Complex nested structures (IMU, Odometry)

Statistical Methodology

  • 10,000 iterations per test
  • Median, P95, P99 latency tracking
  • Variance tracking (min/max ranges)
  • Multiple message sizes
  • Mixed workload testing

Measurement Details

RDTSC Calibration:

  • Null cost (back-to-back rdtsc): ~36 cycles
  • Target on modern x86_64: 20-30 cycles
  • Timestamp embedded directly in message payload

Cross-Core Testing:

  • Producer and consumer on different CPU cores
  • Simulates real multi-process robotics systems
  • Includes cache coherency overhead (~60 cycles theoretical minimum)

Scheduler Performance

Enhanced Smart Scheduler

HORUS now includes an intelligent scheduler that automatically optimizes node execution based on runtime behavior:

Key Enhancements:

  • Tiered Execution: Explicit tier annotation (UltraFast, Fast, Normal)
  • Failure Policies: Per-node failure handling with automatic recovery
  • Predictable by Default: Sequential execution with consistent priority ordering
  • Safety Monitoring: WCET enforcement, watchdogs, and emergency stop

Comprehensive Benchmark Results

Test Configuration:

  • Workload duration: 5 seconds per test
  • Sample size: 20 measurements per benchmark
  • Platform: Modern x86_64 Linux system
Workload TypeMean TimeDescriptionKey Achievement
UltraFastControl2.387sHigh-frequency control loopsOptimized for high-frequency control
FastSensor2.382sRapid sensor processingMaintains sub-μs sensor fusion
HeavyIO3.988sI/O-intensive operationsAsync tier prevents blocking
MixedRealistic4.064sReal-world mixed workloadBalanced optimization across tiers
FaultTolerance25.485sWith simulated failuresFailure policy recovery working

Scalability Performance

The scheduler demonstrates excellent linear scaling:

Node CountExecution TimeScaling Factor
10 nodes106.93msBaseline
50 nodes113.93ms1.07x (5x nodes)
100 nodes116.49ms1.09x (10x nodes)
200 nodes119.55ms1.12x (20x nodes)

Key Insights:

  • Near-linear scaling from 10 to 200 nodes
  • Only 13ms increase for 20x more nodes
  • Maintains sub-120ms for large systems
  • Automatic tier classification optimizes execution order

Scheduler Scalability

Near-constant execution time regardless of node count

106.9ms
10 nodes
baseline
113.9ms
50 nodes
6.5% overhead
116.5ms
100 nodes
8.9% overhead
119.5ms
200 nodes
11.8% overhead

Real-Time Performance

RtNode Support

HORUS now provides industrial-grade real-time support for safety-critical applications:

RT Features:

  • WCET Enforcement: Worst-Case Execution Time monitoring
  • Deadline Tracking: Count and handle deadline misses
  • Safety Monitor: Emergency stop on critical failures
  • Watchdog Timers: Detect hung or crashed nodes

RT Performance Characteristics

MetricPerformanceDescription
WCET Overhead<5μsCost of monitoring execution time
Deadline Precision±10μsJitter in deadline detection
Watchdog Resolution1msMinimum detection time
Emergency Stop<100μsTime to halt all nodes
Context Switch<1μsPriority preemption overhead

Safety-Critical Configuration

Running with full safety monitoring enabled:

let scheduler = Scheduler::new().tick_rate(1000_u64.hz());
FeatureOverheadImpact
WCET Tracking~1μs per nodeNegligible for >100μs tasks
Deadline Monitor~500ns per nodeSub-microsecond overhead
Watchdog Feed~100ns per tickMinimal impact
Safety Checks~2μs totalWorth it for safety
Memory LockingOne-time 10msPrevents page faults

Real-Time Test Results

Test: Mixed RT and Normal Nodes

  • 2 critical RT nodes @ 1kHz
  • 2 normal nodes @ 100Hz
  • 2 background nodes @ 10Hz
Node TypeTarget RateAchievedJitterMisses
RT Critical1000 Hz999.8 Hz±10μs0
RT High500 Hz499.9 Hz±15μs0
Normal100 Hz99.9 Hz±50μs<0.1%
Background10 Hz10 Hz±200μs<0.5%

Zero deadline misses for critical RT nodes over 1M iterations.

Real-Time Node Performance

Target rate achievement and jitter measurements

RT Critical 1000 Hz
100.0%
rate achieved
±10μs jitter
0% deadline misses
RT High 500 Hz
100.0%
rate achieved
±15μs jitter
0% deadline misses
Normal 100 Hz
99.9%
rate achieved
±50μs jitter
0.1% deadline misses
Background 10 Hz
100.0%
rate achieved
±200μs jitter
0.5% deadline misses

All-Routes Latency

HORUS automatically selects the optimal communication path based on topology (same-thread, cross-thread, cross-process) and producer/consumer count. This benchmark measures the latency of each automatically-selected route.

Benchmark Results

ScenarioLatencyTargetNotes
Same thread, 1:116ns60nsUltra-fast direct path
Cross-thread, 1:111ns60nsOptimized single-producer path
Cross-process, 1:1182ns100nsShared memory path
Cross-process, N:1244ns150nsMulti-producer shared memory
Cross-process, N:N187ns200nsGeneral cross-process

Latency by Topology

TopologyProducersConsumersLatency
Same thread11~16ns
Same process11~11ns
Same processN1~15ns
Same process1N~15ns
Same processNN~20ns
Cross process11~180ns
Cross processN1~250ns
Cross process1N~200ns
Cross processNN~190ns

Key Achievements

  • Sub-20ns for same-process communication
  • Sub-200ns for cross-process 1:1
  • Sub-300ns for multi-producer cross-process
  • Zero configuration — optimal path selected automatically
  • Seamless migration — path upgrades transparently as topology changes

Running the Benchmark

cd horus
cargo build --release -p horus_benchmarks
./target/release/all_paths_latency

Summary

HORUS provides production-grade performance for real robotics applications:

Automatic Path Selection (Recommended):

  • 16 ns — Same-thread
  • 11 ns — Cross-thread, 1:1
  • 182 ns — Cross-process, 1:1
  • 244 ns — Cross-process, multi-producer
  • 187 ns — Cross-process, multi-producer/consumer

Point-to-Point (1:1):

  • 87 ns — Send only (ultra-low latency)
  • 161 ns — CmdVel (motor control)
  • 262 ns — Send+Recv round-trip
  • ~400 ns — IMU (sensor fusion)
  • ~120 μs — PointCloud with 10K points

Multi-Producer/Consumer (N:N):

  • ~313 ns — CmdVel (motor control)
  • ~500 ns — IMU (sensor fusion)
  • ~2.2 μs — LaserScan (2D lidar)
  • ~1.1 μs — Odometry (localization)
  • ~360 μs — PointCloud with 10K points

Ready for production deployment in demanding robotics applications requiring real-time performance with complex data types.


Next Steps

Build faster. Debug easier. Deploy with confidence.