Fault Tolerance

You need to prevent a single crashing node from killing your entire robot. HORUS failure policies let you define per-node behavior on failure: stop the system, restart with backoff, skip temporarily, or ignore entirely.

When To Use This

  • Any robot with more than one node (which is every production robot)
  • Sensor drivers that may disconnect (USB, serial, network)
  • Systems with non-critical nodes (logging, telemetry) that should not bring down the system
  • Safety-critical deployments where you need explicit failure contracts per node

Use Safety Monitor instead if you need watchdog timers, deadline enforcement, or graduated degradation. Failure policies and safety monitoring are complementary -- use both in production.

Prerequisites

  • Familiarity with Nodes and the Scheduler Configuration node builder API
  • Understanding of which nodes are critical vs non-critical in your system

The Problem

Without failure policies, one crashing node kills the entire system:

Tick 100: sensor_driver panics (USB disconnected)
Tick 100: scheduler stops
Tick 100: motor_controller stops receiving commands
Result: robot stops moving in the middle of a task

With failure policies, the system adapts:

Tick 100: sensor_driver panics (USB disconnected)
Tick 100: FailurePolicy::Restart → re-init sensor_driver (10ms backoff)
Tick 101: sensor_driver panics again → restart (20ms backoff)
Tick 102: USB reconnects → sensor_driver.init() succeeds → normal operation
Result: robot paused briefly, then resumed automatically

The Four Policies

Fatal — Stop Everything

// simplified
scheduler.add(motor_controller)
    .order(0)
    .rate(1000_u64.hz())
    .failure_policy(FailurePolicy::Fatal)
    .build()?;

First failure stops the scheduler immediately. Use for nodes where continued operation after failure is unsafe:

  • Motor controllers (stale commands = uncontrolled motion)
  • Safety monitors (can't monitor safety if the monitor is broken)
  • Emergency stop handlers

When it triggers: node.tick() raises an exception (Python) or panics (Rust). The scheduler calls stop() and shuts down all nodes cleanly.

Restart — Re-Initialize with Backoff

// simplified
scheduler.add(lidar_driver)
    .order(1)
    .rate(100_u64.hz())
    .failure_policy(FailurePolicy::restart(3, 50_u64.ms()))
    .build()?;

Re-initializes the node with exponential backoff. After max_restarts exhausted, escalates to fatal stop.

failure 1 → restart, wait 50ms
failure 2 → restart, wait 100ms (2x backoff)
failure 3 → restart, wait 200ms (2x backoff)
failure 4 → max_restarts exceeded → fatal stop

After a successful tick, the backoff timer clears. Use for nodes that can recover from transient failures:

  • Sensor drivers (hardware reconnection)
  • Network clients (server temporarily unavailable)
  • Camera nodes (USB reset)

Skip — Tolerate with Cooldown

// simplified
scheduler.add(telemetry_uploader)
    .order(200)
    .async_io()
    .failure_policy(FailurePolicy::skip(5, 1_u64.secs()))
    .build()?;

After max_failures consecutive failures, the node is suppressed for the cooldown period. After cooldown, the node is allowed again and the failure counter resets.

failure 1 → continue
failure 2 → continue
failure 3 → continue
failure 4 → continue
failure 5 → node suppressed for 1 second
... 1 second passes ...
node allowed again, failure count = 0

Use for nodes whose absence doesn't affect core robot operation:

  • Logging and telemetry upload
  • Diagnostics reporting
  • Cloud sync
  • Non-critical monitoring

Ignore — Swallow Failures

// simplified
scheduler.add(stats_collector)
    .order(100)
    .failure_policy(FailurePolicy::Ignore)
    .build()?;

Failures are completely ignored. The node keeps ticking every cycle regardless of errors. Use only when partial results are acceptable:

  • Statistics collectors (missing one sample is fine)
  • Best-effort visualization
  • Debug output nodes

Severity-Aware Handling

HORUS errors carry severity levels that can override the configured policy:

SeverityEffect
Fatal (e.g., shared memory corruption)Always stops the scheduler, even with Ignore policy
Transient (e.g., topic full, network timeout)De-escalates Fatal policy to Restart (transient errors are recoverable)
Permanent (e.g., invalid configuration)Follows the configured policy

This means a safety-critical node with Fatal policy won't kill the system on a transient network glitch — it'll restart instead. But a shared-memory corruption always stops, even on an Ignore node.

Complete Robot Example

// simplified
use horus::prelude::*;

fn main() -> Result<()> {
    let mut scheduler = Scheduler::new()
        .tick_rate(500_u64.hz())
        .prefer_rt()
        .watchdog(500_u64.ms());

    // CRITICAL: Motor controller — stop if it fails
    scheduler.add(MotorController::new())
        .order(0)
        .rate(500_u64.hz())
        .on_miss(Miss::SafeMode)
        .failure_policy(FailurePolicy::Fatal)
        .build()?;

    // RECOVERABLE: Lidar driver — restart on USB disconnect
    scheduler.add(LidarDriver::new())
        .order(1)
        .rate(100_u64.hz())
        .failure_policy(FailurePolicy::restart(5, 100_u64.ms()))
        .build()?;

    // RECOVERABLE: Camera — restart up to 3 times
    scheduler.add(CameraNode::new())
        .order(2)
        .rate(30_u64.hz())
        .failure_policy(FailurePolicy::restart(3, 200_u64.ms()))
        .build()?;

    // NON-CRITICAL: Path planner — skip if it fails repeatedly
    scheduler.add(PathPlanner::new())
        .order(5)
        .compute()
        .failure_policy(FailurePolicy::skip(3, 2_u64.secs()))
        .build()?;

    // BEST-EFFORT: Telemetry — ignore failures
    scheduler.add(TelemetryUploader::new())
        .order(200)
        .async_io()
        .rate(1_u64.hz())
        .failure_policy(FailurePolicy::Ignore)
        .build()?;

    scheduler.run()
}

Choosing the Right Policy

Node TypePolicyWhy
Motor control, safetyFatalUnsafe to continue without these
Sensor driversRestart(3-5, 50-200ms)Hardware reconnects are common
Perception pipelinesRestart(3, 100ms) or Skip(5, 2s)Can recover or degrade gracefully
Logging, telemetrySkip(5, 1s) or IgnoreNon-critical, absence is tolerable
Debug/visualizationIgnorePartial results are fine

Python Error Handlers

Use the on_error callback to handle failures in Python nodes:

import horus

def my_error_handler(node, exception):
    node.log_error(f"Node failed: {exception}")
    # Optionally take corrective action
    if "USB" in str(exception):
        node.log_warning("USB disconnected — will retry on restart")

def sensor_tick(node):
    data = read_hardware()  # May raise OSError
    node.send("sensor", data)

sensor = horus.Node(
    name="sensor",
    tick=sensor_tick,
    on_error=my_error_handler,
    failure_policy="restart",
    rate=100,
    pubs=["sensor"],
)
horus.run(sensor)

The on_error callback runs after the failure policy processes the error. It receives the node and the exception object. Use it for logging, alerting, or state cleanup before the next restart.


Monitoring Failures

Failure events are recorded in the BlackBox flight recorder:

# View failure events from the blackbox
horus blackbox show --filter errors

# Monitor live
horus log -f --level error

In code:

// simplified
// Inspect anomalies via CLI: horus blackbox --anomalies
if let Some(bb) = scheduler.get_blackbox() {
    for record in bb.lock().expect("blackbox lock").anomalies() {
        println!("[tick {}] {:?}", record.tick, record.event);
    }
}

Design Decisions

Why per-node policies instead of a global failure mode?

Different nodes have fundamentally different failure characteristics. A motor controller failure is safety-critical -- you must stop. A telemetry uploader failure is harmless -- you can ignore it. Per-node policies match the failure contract to the node's role in the system.

Why severity-aware handling?

A transient network timeout on a Fatal node should not kill the system -- it should restart. But shared memory corruption on an Ignore node must always stop -- it indicates fundamental system failure. Severity-aware handling prevents both over-reaction (killing the system on a glitch) and under-reaction (ignoring corruption).

Why exponential backoff on Restart?

A sensor driver that fails once probably has a transient issue (USB reset). A sensor driver that fails 5 times in rapid succession has a permanent problem. Exponential backoff gives transient issues time to resolve while quickly escalating persistent failures.

Trade-offs

GainCost
One crashing node does not kill the systemMust think about failure contracts per node
Restart with backoff handles transient hardware issuesBackoff adds latency during recovery
Skip prevents non-critical failures from cascadingSkipped nodes produce no output during cooldown
Severity overrides prevent both over- and under-reactionBehavior depends on error severity, not just configured policy

Common Errors

SymptomCauseFix
System stops on a transient network errorFatal policy on a node with transient failuresUse Restart for nodes that can recover, or rely on severity-aware de-escalation
Node restarts endlesslymax_restarts set too high, or the failure is permanentLower max_restarts and check logs for the root cause
Non-critical node drags down performanceSkip cooldown too short, node fails immediately after cooldownIncrease cooldown duration or investigate the root cause
Fatal node did not stop the systemError was classified as Transient severity, de-escalated to RestartExpected behavior. Truly fatal errors (e.g., SHM corruption) always stop regardless
Cannot find failure events after crashNo BlackBox configuredAdd .blackbox(64) to the scheduler for crash forensics

See Also