Fault Tolerance

You need to prevent a single crashing node from killing your entire robot. HORUS failure policies let you define per-node behavior on failure: stop the system, restart with backoff, skip temporarily, or ignore entirely.

When To Use This

Any robot with more than one node (which is every production robot)
Sensor drivers that may disconnect (USB, serial, network)
Systems with non-critical nodes (logging, telemetry) that should not bring down the system
Safety-critical deployments where you need explicit failure contracts per node

Use Safety Monitor instead if you need watchdog timers, deadline enforcement, or graduated degradation. Failure policies and safety monitoring are complementary -- use both in production.

Prerequisites

Familiarity with Nodes and the Scheduler Configuration node builder API
Understanding of which nodes are critical vs non-critical in your system

The Problem

Without failure policies, one crashing node kills the entire system:

Tick 100: sensor_driver panics (USB disconnected)
Tick 100: scheduler stops
Tick 100: motor_controller stops receiving commands
Result: robot stops moving in the middle of a task

With failure policies, the system adapts:

Tick 100: sensor_driver panics (USB disconnected)
Tick 100: FailurePolicy::Restart → re-init sensor_driver (10ms backoff)
Tick 101: sensor_driver panics again → restart (20ms backoff)
Tick 102: USB reconnects → sensor_driver.init() succeeds → normal operation
Result: robot paused briefly, then resumed automatically

The Four Policies

Fatal — Stop Everything

// simplified
scheduler.add(motor_controller)
    .order(0)
    .rate(1000_u64.hz())
    .failure_policy(FailurePolicy::Fatal)
    .build()?;

First failure stops the scheduler immediately. Use for nodes where continued operation after failure is unsafe:

Motor controllers (stale commands = uncontrolled motion)
Safety monitors (can't monitor safety if the monitor is broken)
Emergency stop handlers

When it triggers: node.tick() raises an exception (Python) or panics (Rust). The scheduler calls stop() and shuts down all nodes cleanly.

Restart — Re-Initialize with Backoff

// simplified
scheduler.add(lidar_driver)
    .order(1)
    .rate(100_u64.hz())
    .failure_policy(FailurePolicy::restart(3, 50_u64.ms()))
    .build()?;

Re-initializes the node with exponential backoff. After max_restarts exhausted, escalates to fatal stop.

failure 1 → restart, wait 50ms
failure 2 → restart, wait 100ms (2x backoff)
failure 3 → restart, wait 200ms (2x backoff)
failure 4 → max_restarts exceeded → fatal stop

After a successful tick, the backoff timer clears. Use for nodes that can recover from transient failures:

Sensor drivers (hardware reconnection)
Network clients (server temporarily unavailable)
Camera nodes (USB reset)

Skip — Tolerate with Cooldown

// simplified
scheduler.add(telemetry_uploader)
    .order(200)
    .async_io()
    .failure_policy(FailurePolicy::skip(5, 1_u64.secs()))
    .build()?;

After max_failures consecutive failures, the node is suppressed for the cooldown period. After cooldown, the node is allowed again and the failure counter resets.

failure 1 → continue
failure 2 → continue
failure 3 → continue
failure 4 → continue
failure 5 → node suppressed for 1 second
... 1 second passes ...
node allowed again, failure count = 0

Use for nodes whose absence doesn't affect core robot operation:

Logging and telemetry upload
Diagnostics reporting
Cloud sync
Non-critical monitoring

Ignore — Swallow Failures

// simplified
scheduler.add(stats_collector)
    .order(100)
    .failure_policy(FailurePolicy::Ignore)
    .build()?;

Failures are completely ignored. The node keeps ticking every cycle regardless of errors. Use only when partial results are acceptable:

Statistics collectors (missing one sample is fine)
Best-effort visualization
Debug output nodes

Severity-Aware Handling

HORUS errors carry severity levels that can override the configured policy:

Severity	Effect
Fatal (e.g., shared memory corruption)	Always stops the scheduler, even with `Ignore` policy
Transient (e.g., topic full, network timeout)	De-escalates `Fatal` policy to `Restart` (transient errors are recoverable)
Permanent (e.g., invalid configuration)	Follows the configured policy

This means a safety-critical node with Fatal policy won't kill the system on a transient network glitch — it'll restart instead. But a shared-memory corruption always stops, even on an Ignore node.

Complete Robot Example

// simplified
use horus::prelude::*;

fn main() -> Result<()> {
    let mut scheduler = Scheduler::new()
        .tick_rate(500_u64.hz())
        .prefer_rt()
        .watchdog(500_u64.ms());

    // CRITICAL: Motor controller — stop if it fails
    scheduler.add(MotorController::new())
        .order(0)
        .rate(500_u64.hz())
        .on_miss(Miss::SafeMode)
        .failure_policy(FailurePolicy::Fatal)
        .build()?;

    // RECOVERABLE: Lidar driver — restart on USB disconnect
    scheduler.add(LidarDriver::new())
        .order(1)
        .rate(100_u64.hz())
        .failure_policy(FailurePolicy::restart(5, 100_u64.ms()))
        .build()?;

    // RECOVERABLE: Camera — restart up to 3 times
    scheduler.add(CameraNode::new())
        .order(2)
        .rate(30_u64.hz())
        .failure_policy(FailurePolicy::restart(3, 200_u64.ms()))
        .build()?;

    // NON-CRITICAL: Path planner — skip if it fails repeatedly
    scheduler.add(PathPlanner::new())
        .order(5)
        .compute()
        .failure_policy(FailurePolicy::skip(3, 2_u64.secs()))
        .build()?;

    // BEST-EFFORT: Telemetry — ignore failures
    scheduler.add(TelemetryUploader::new())
        .order(200)
        .async_io()
        .rate(1_u64.hz())
        .failure_policy(FailurePolicy::Ignore)
        .build()?;

    scheduler.run()
}

Choosing the Right Policy

Node Type	Policy	Why
Motor control, safety	`Fatal`	Unsafe to continue without these
Sensor drivers	`Restart(3-5, 50-200ms)`	Hardware reconnects are common
Perception pipelines	`Restart(3, 100ms)` or `Skip(5, 2s)`	Can recover or degrade gracefully
Logging, telemetry	`Skip(5, 1s)` or `Ignore`	Non-critical, absence is tolerable
Debug/visualization	`Ignore`	Partial results are fine

Python Error Handlers

Use the on_error callback to handle failures in Python nodes:

import horus

def my_error_handler(node, exception):
    node.log_error(f"Node failed: {exception}")
    # Optionally take corrective action
    if "USB" in str(exception):
        node.log_warning("USB disconnected — will retry on restart")

def sensor_tick(node):
    data = read_hardware()  # May raise OSError
    node.send("sensor", data)

sensor = horus.Node(
    name="sensor",
    tick=sensor_tick,
    on_error=my_error_handler,
    failure_policy="restart",
    rate=100,
    pubs=["sensor"],
)
horus.run(sensor)

The on_error callback runs after the failure policy processes the error. It receives the node and the exception object. Use it for logging, alerting, or state cleanup before the next restart.

Monitoring Failures

Failure events are recorded in the BlackBox flight recorder:

# View failure events from the blackbox
horus blackbox show --filter errors

# Monitor live
horus log -f --level error

In code:

// simplified
// Inspect anomalies via CLI: horus blackbox --anomalies
if let Some(bb) = scheduler.get_blackbox() {
    for record in bb.lock().expect("blackbox lock").anomalies() {
        println!("[tick {}] {:?}", record.tick, record.event);
    }
}

Design Decisions

Why per-node policies instead of a global failure mode?

Different nodes have fundamentally different failure characteristics. A motor controller failure is safety-critical -- you must stop. A telemetry uploader failure is harmless -- you can ignore it. Per-node policies match the failure contract to the node's role in the system.

Why severity-aware handling?

A transient network timeout on a Fatal node should not kill the system -- it should restart. But shared memory corruption on an Ignore node must always stop -- it indicates fundamental system failure. Severity-aware handling prevents both over-reaction (killing the system on a glitch) and under-reaction (ignoring corruption).

Why exponential backoff on Restart?

A sensor driver that fails once probably has a transient issue (USB reset). A sensor driver that fails 5 times in rapid succession has a permanent problem. Exponential backoff gives transient issues time to resolve while quickly escalating persistent failures.

Trade-offs

Gain	Cost
One crashing node does not kill the system	Must think about failure contracts per node
Restart with backoff handles transient hardware issues	Backoff adds latency during recovery
Skip prevents non-critical failures from cascading	Skipped nodes produce no output during cooldown
Severity overrides prevent both over- and under-reaction	Behavior depends on error severity, not just configured policy

Common Errors

Symptom	Cause	Fix
System stops on a transient network error	`Fatal` policy on a node with transient failures	Use `Restart` for nodes that can recover, or rely on severity-aware de-escalation
Node restarts endlessly	`max_restarts` set too high, or the failure is permanent	Lower `max_restarts` and check logs for the root cause
Non-critical node drags down performance	`Skip` cooldown too short, node fails immediately after cooldown	Increase cooldown duration or investigate the root cause
`Fatal` node did not stop the system	Error was classified as `Transient` severity, de-escalated to Restart	Expected behavior. Truly fatal errors (e.g., SHM corruption) always stop regardless
Cannot find failure events after crash	No BlackBox configured	Add `.blackbox(64)` to the scheduler for crash forensics