Benchmarks

All numbers on this page are measured values from the HORUS benchmark suite, not estimates. Run on Intel i9-14900K (32 cores), WSL2, release mode, RDTSC cycle-accurate timing on 2026-03-22.

Reproduce on your hardware:

cargo run --release -p horus_benchmarks --bin all_paths_latency

Quick Reference

Transportp50ThroughputUse Case
Same-thread12ns100M+ msg/sIn-process pipeline
Cross-thread 1:191ns13M+ msg/sMulti-threaded nodes
Cross-thread N:N150ns8M+ msg/sMulti-producer/consumer
Cross-process 1:1171ns5M+ msg/sMulti-process systems
Cross-process MPMC91ns10M+ msg/sMulti-process multi-participant
Cross-process broadcast152ns5M+ msg/sLatest-value broadcast
CmdVel (16B)89ns11.1M msg/sMotor control at 1kHz+
Imu (304B)119ns7.8M msg/sSensor fusion at 500Hz+
LaserScan (1.5KB)151ns6.3M msg/sLidar at 10-40Hz
Python typed msg1.7μs2.7M msg/sML inference nodes
HORUS vs iceoryx21.4–6.3x fasterBeats on every IPC path

IPC Latency — All Backend Paths

Measured with all_paths_latency: 100,000 iterations per scenario, RDTSC timing with 6ns overhead subtracted, Tukey IQR outlier removal, bootstrap 95% confidence intervals.

IPC Latency — All Backend Paths

Measured p50 and p99 latency in nanoseconds. Lower is better.

Intra-process (no kernel)
Cross-process (shared memory)

Intra-Process

ScenarioBackendp50p95p99p99.9maxCV
Same threadDirectChannel12ns12ns13ns13ns13ns0.047
Cross-thread 1:1SpscIntra91ns105ns107ns125ns125ns0.077
Cross-thread 1:NSpmcIntra80ns88ns92ns94ns94ns0.053
Cross-thread N:1MpscIntra187ns312ns372ns458ns464ns0.313
Cross-thread N:NFanoutIntra150ns270ns307ns322ns322ns0.354

Cross-Process

ScenarioBackendp50p95p99p99.9max
1 pub, 1 subSpscShm171ns186ns192ns195ns195ns
2 pub, 1 subMpscShm158ns182ns190ns200ns200ns
2 pub, 2 sub (MPMC)FanoutShm91ns230ns
Broadcast (POD)PodShm152ns203ns227ns254ns254ns

Hardware Floor

Operationp50What it measures
Raw SHM atomic (cross-process)63nsKernel/hardware minimum for cross-process
Raw memcpy 8B11nsCache-to-cache copy
Raw memcpy 1KB17nsL1 cache bandwidth
Raw memcpy 8KB49nsL2 cache bandwidth
Raw memcpy 64KB811nsRAM bandwidth
Atomic store+load11nsSingle atomic round-trip

Framework Overhead

PathTotalHardware floorHORUS overhead
Cross-process 1:1 (SpscShm)171ns57ns114ns
Cross-process MPMC (FanoutShm)91ns57ns34ns
Cross-process broadcast (PodShm)152ns57ns95ns

HORUS adds 34–114ns over the hardware minimum for cross-process IPC. FanoutShm achieves the lowest overhead because its contention-free SPSC channel matrix eliminates CAS operations on the hot path.


Robotics Message Types

Measured with robotics_messages_benchmark: 50,000 iterations, cross-thread producer/consumer on separate cores.

Robotics Message Latency

Measured median and p99 latency for standard robotics messages

11.1M
msg/s
7.8M
msg/s
8.1M
msg/s
6.3M
msg/s
MessageSizeMedianp99ThroughputTypical RateHeadroom
CmdVel16B89ns91ns11.1M msg/s1,000 Hz11,100x
Imu304B119ns150ns7.8M msg/s500 Hz15,600x
JointCommand928B128ns157ns8.1M msg/s500 Hz16,200x
LaserScan1,480B151ns184ns6.3M msg/s40 Hz157,500x

Real-Time Suitability

Control RateBudgetWorst-Case (p99)Result
1 kHz (motor control)1msCmdVel 91nsPASS (11,000x headroom)
10 kHz (servo control)100μsCmdVel 91nsPASS (1,100x headroom)
500 Hz (sensor fusion)2msImu 150nsPASS (13,300x headroom)
40 Hz (lidar)25msLaserScan 184nsPASS (135,000x headroom)

All message types pass real-time suitability at their typical robotics frequencies.


HORUS vs Competition

Measured with competitor_comparison: 5 seconds sustained per transport, same machine.

HORUS vs Raw UDP

p50 latency in nanoseconds — HORUS eliminates kernel network stack entirely

Speedup: 54x (8B) 49x (32B) over raw UDP
TransportSizep50p95p99Throughput
HORUS SHM8B23ns25ns29ns100M+ msg/s
Raw UDP8B1,235ns1,328ns1,558ns3.9M msg/s
HORUS SHM32B23ns25ns29ns101M+ msg/s
Raw UDP32B1,122ns1,246ns2,129ns4.1M msg/s

Speedup: 54x (8B), 49x (32B) over raw UDP on the same machine.

HORUS eliminates the kernel network stack entirely. UDP requires sendto() + recvfrom() system calls (~1,100ns of kernel overhead). HORUS uses direct shared memory access (~23ns total).

HORUS vs iceoryx2

HORUS vs iceoryx2

p50 latency in nanoseconds — HORUS beats iceoryx2 on every IPC path

Speedup: 6.3x same-thread 2.1x cross-process 1.4x MPMC 4.3x throughput

iceoryx2 is Eclipse's lock-free zero-copy IPC middleware. Measured with iceoryx2_comparison and fanout_shm_bench: same machine, same message types, release mode.

ScenarioHORUSiceoryx2Speedup
Same-thread11 ns69 ns6.3x
Cross-thread 1:195 ns182 ns1.9x
Cross-process 1:1170 ns361 ns2.1x
Cross-process MPMC 2P/2S96 ns135 ns1.4x
Throughput (u64)95 M/s22 M/s4.3x

HORUS beats iceoryx2 on every IPC path. The cross-process MPMC advantage comes from FanoutShm — a contention-free SPSC channel matrix that eliminates all CAS operations on the hot path.

# Reproduce (requires iceoryx2 feature)
cargo run --release -p horus_benchmarks --bin iceoryx2_comparison --features iceoryx2

# HORUS-only cross-process MPMC
cargo run --release -p horus_benchmarks --bin fanout_shm_bench

Scalability

Measured with scalability_benchmark: sustained throughput with varying producer/consumer thread counts.

Thread Scaling

Throughput (M msg/s) with varying thread counts. Higher is better.

Producer Scaling (1 Consumer)

Peak at 6 producers (13.5M/s)

Consumer Scaling (1 Producer)

Plateaus at 2 (broadcast semantics)

Thread Scaling

ProducersConsumersThroughputPer-ThreadScaling Efficiency
112.4M msg/s1.20Mbaseline
217.2M msg/s2.40M300%
4111.8M msg/s2.35M489%
123.5M msg/s1.17M146%
142.9M msg/s0.58M122%
226.8M msg/s1.70M141%
4411.2M msg/s1.41M117%
888.4M msg/s0.52M44%

Producer Scaling (1 Consumer)

1 producer:   3.0 M/s  ████████████
2 producers:  8.7 M/s  ██████████████████████████████████
3 producers: 11.5 M/s  ████████████████████████████████████████████
4 producers: 11.9 M/s  ████████████████████████████████████████████
6 producers: 13.5 M/s  █████████████████████████████████████████████████  ← peak
8 producers: 11.5 M/s  ████████████████████████████████████████████

Peak throughput at 6 producers (13.5M msg/s). Beyond 6, contention on the atomic head pointer causes slight degradation.

Consumer Scaling (1 Producer)

1 consumer:  2.8 M/s  ███████████
2 consumers: 4.8 M/s  ██████████████████
4 consumers: 4.8 M/s  ██████████████████
8 consumers: 3.7 M/s  ██████████████

Consumer scaling plateaus at 2 — the ring buffer uses broadcast semantics (all consumers read the same data), so adding consumers doesn't increase total throughput.


Real-Time Determinism

Measured with determinism_benchmark: 10 runs of 100,000 iterations each, CPU-pinned to cores 0 and 1.

Latency Distribution

Percentile latencies in nanoseconds — tight clustering indicates deterministic behavior

7.9ns
std dev
0.060
CV (run-to-run)
26ns
max - median
0.02%
miss rate @ 1μs
MetricValue
Mean latency87.0ns
Median latency86.0ns
Std dev7.9ns
Min61ns
Max112ns
p95102ns
p99109ns
p99.9112ns
p99.99112ns
Run-to-run CV0.060
Deadline misses at 1μs212 / 1,000,000 (0.02%)

Interpretation: The 7.9ns standard deviation and 0.06 run-to-run coefficient of variation indicate highly deterministic behavior. The 0.02% deadline miss rate at the extremely aggressive 1μs deadline is due to OS scheduling jitter in WSL2 — on a bare-metal Linux system with PREEMPT_RT kernel and isolcpus, expect zero misses.


Python Binding Performance

Measured with research_bench_python.py: 5 seconds sustained per test, Python 3.12, PyO3 bindings v0.1.9.

Typed Message IPC (Zero-Copy Pod Path)

Messagep50p95p99p999Throughput
CmdVel send+recv1.7μs1.8μs2.4μs15.2μs2.7M msg/s
Pose2D send+recv1.7μs1.9μs3.0μs18.2μs2.7M msg/s
Imu send+recv1.9μs2.0μs4.2μs21.1μs2.4M msg/s

Generic Message IPC (Serialization Path)

Payloadp50p95p99Throughput
dict {v: 1.0}6.2μs7.9μs19.9μs714K msg/s
dict {x,y,z,w}12.4μs15.3μs34.2μs382K msg/s
dict 50 keys (~1KB)111μs143μs196μs42K msg/s

Typed messages are 4-65x faster than dicts because they bypass serialization and use direct Pod memcpy through the Rust layer.

Image Zero-Copy

Python Image Zero-Copy

640×480 RGB image transfer latency (μs). DLPack is 13x faster than copying.

3.5M/s
throughput
1.5M/s
throughput
0.334M/s
throughput
Operationp50ThroughputNotes
Image.to_numpy (640x480)3.0μs1.5M/sReturns view into SHM pool
np.from_dlpack (640x480)1.1μs3.5M/sDLPack protocol, true zero-copy
np.copy (640x480) baseline14.0μs334K/sFor comparison (actual copy)

np.from_dlpack() is 13x faster than np.copy() — it returns a numpy array backed by the shared memory pool with no data movement.

FFI Overhead Attribution

FFI Overhead: Rust vs Python

Latency in nanoseconds — constant ~1.7μs overhead from PyO3 + GIL + allocation

Overhead breakdown: PyO3 crossing ~500ns + GIL acquisition ~500ns + Python object alloc ~700ns
OperationRust (ns)Python (ns)OverheadFactor
CmdVel141,7121,698ns122x
Pose2D141,6821,668ns120x
Imu141,8841,870ns135x
dict (small)146,2466,232ns446x

The ~1.7μs Python overhead comes from: PyO3 boundary crossing (~500ns), GIL acquisition (~500ns), and Python object allocation (~700ns). This overhead is constant regardless of message size.

Scheduler Tick Overhead

MetricValue
Target rate10,000 Hz
Achieved rate5,932 Hz
Per-tick overhead~11μs (Rust→Python→Rust)
GC dip (worst second)96 fewer ticks

The GIL is the bottleneck for Python tick rate. For control loops above ~5kHz, use Rust nodes.

When to Use Python vs Rust

Use CaseRecommendedWhy
ML inference (PyTorch, YOLO)Python1.7μs overhead negligible vs 10-200ms inference
Data science, prototypingPythonDeveloper velocity matters more than latency
Motor control at 1kHz+Rust89ns vs 1,700ns — 19x difference
Safety monitorsRustDeterministic timing, no GIL
Sensor fusion at 500Hz+RustPredictable p99 latency

C++ Binding Performance

Measured with cpp_benchmark: release mode, g++ -O2, 10,000 iterations per test, high_resolution_clock timing with warmup.

FFI Boundary Cost

OperationMinMedianP99P999MaxStddev
FFI call (abi_version)15ns16ns17ns17ns5.0μs49ns
Atomic read (is_running)16ns17ns18ns18ns33ns0.6ns

The raw cost of crossing the Rust-C++ boundary is 15-17ns — comparable to a C++ virtual function call.

Scheduler Tick from C++

ScenarioMinMedianP99P999Max
Empty scheduler35ns37ns45ns49ns1.6μs
1 node + callback243ns250ns10.5μs11.3μs13.8μs
10 nodes2.1μs2.2μs102μs109μs127μs
50 nodes10.7μs11.0μs515μs549μs597μs

Per-node overhead is ~220ns, which includes catch_unwind safety wrapper + closure dispatch through the FFI boundary.

Throughput

MetricValue
Ticks/second (1 node)2,844,911
Time per tick0.35μs
CPU overhead at 1kHz0.035% of one core

Scalability

NodesMedian TickPer-Node Cost
1250ns250ns
102.2μs220ns
5011.0μs220ns

Linear scaling — per-node cost is constant at ~220ns regardless of node count.

C++ vs Rust vs Python Overhead

Language1-Node TickOverhead vs RustNotes
Rust (native)~89nsbaselineDirect scheduler call
C++ (FFI)~250ns+161ns (1.8x)extern "C" + catch_unwind
Python (PyO3)~1,700ns+1,611ns (19x)GIL + PyO3 + object alloc

C++ adds 161ns over native Rust — the cost of the extern "C" boundary and panic safety wrapper. For perspective, this is 0.16 microseconds — invisible at any practical control rate.

Memory Safety

Validated with AddressSanitizer (g++ -fsanitize=address):

TestIterationsASAN Errors
Scheduler create/destroy1,0000
Sustained ticks5,0000
50 concurrent nodes5,000 ticks0
Null pointer calls10,0000

Zero memory safety violations across the entire FFI surface.

Running C++ Benchmarks

# Build release
cargo build --release --no-default-features -p horus_cpp

# Compile benchmark
g++ -std=c++17 -O2 -o cpp_benchmark \
    horus_cpp/tests/cpp_benchmark.cpp \
    -L target/release -lhorus_cpp -lpthread -ldl -lm

# Run
LD_LIBRARY_PATH=target/release ./cpp_benchmark

# With ASAN
g++ -std=c++17 -O2 -fsanitize=address -fno-omit-frame-pointer \
    -o cpp_stress_asan horus_cpp/tests/cpp_stress_test.cpp \
    -L target/release -lhorus_cpp -lpthread -ldl -lm
LD_LIBRARY_PATH=target/release ./cpp_stress_asan

Running Benchmarks

Rust Benchmarks

# Main benchmark: all 10 backend paths (~2 min)
cargo run --release -p horus_benchmarks --bin all_paths_latency

# Robotics message types: CmdVel, Imu, LaserScan, JointCommand
cargo run --release -p horus_benchmarks --bin robotics_messages_benchmark

# HORUS vs UDP comparison
cargo run --release -p horus_benchmarks --bin competitor_comparison

# Scalability: thread count sweep
cargo run --release -p horus_benchmarks --bin scalability_benchmark

# RT determinism: jitter analysis
cargo run --release -p horus_benchmarks --bin determinism_benchmark

# Hardware floor: raw memcpy, atomic, mmap
cargo run --release -p horus_benchmarks --bin raw_baselines

# Cross-process: true inter-process IPC
cargo run --release -p horus_benchmarks --bin cross_process_benchmark

# Full research suite (~30 min)
./benchmarks/research/run_all.sh

# Quick validation (~3 min)
./benchmarks/research/run_all.sh --quick

Python Benchmarks

cd horus_py

# Quick validation (2s per test)
PYTHONPATH=. python3 benchmarks/research_bench_python.py --duration 2

# Full research run (10s per test)
PYTHONPATH=. python3 benchmarks/research_bench_python.py --duration 10 --csv results.csv

# JSON summary for CI
PYTHONPATH=. python3 benchmarks/research_bench_python.py --json summary.json

Criterion Micro-Benchmarks

# All criterion benches (HTML reports in target/criterion/)
cargo bench -p horus_benchmarks

# Filter by name
cargo bench -p horus_benchmarks -- topic_latency

Methodology

Timing

  • RDTSC (x86_64) with serializing fences (lfence + mfence), calibrated per-run (~3.37 GHz on test machine)
  • Overhead: ~6ns per measurement, subtracted from all samples
  • Fallback: Instant::now() (~11ns) on non-x86

Statistical Analysis

  • Percentiles: p1, p5, p25, p50, p75, p95, p99, p99.9, p99.99
  • Confidence intervals: Bootstrap with 10,000 resamples, 95% level
  • Outlier filtering: Tukey IQR (1.5x fence)
  • Determinism metrics: Coefficient of variation, run-to-run variance

Environment Control

  • CPU governor: Performance mode recommended (numbers above measured with powersave in WSL2)
  • CPU affinity: Producer and consumer pinned to separate physical cores
  • Warmup: 5,000-10,000 iterations discarded before measurement
  • Measurement: 50,000-100,000 iterations per test
  • Turbo boost: Disabled recommended for reproducibility

Reproducing These Numbers

Your numbers will differ based on CPU, OS, governor, and VM/bare-metal:

FactorImpact
performance governor vs powersave2-5x faster latencies
Bare-metal Linux vs WSL210-30% faster, fewer outliers
PREEMPT_RT kernelNear-zero deadline misses
isolcpus for benchmark coresLower jitter, tighter p99
Older CPU (i5 vs i9)1.5-3x slower
ARM (Raspberry Pi, Jetson)3-10x slower, still sub-microsecond

See Also