Fault Tolerance

HORUS provides tier-aware failure policies that prevent cascading failures by controlling how the scheduler responds when nodes fail. Each policy defines a different recovery strategy.

Overview

Failure policies protect against:

  • Cascading failures from one failing node
  • Resource exhaustion from repeated retry attempts
  • System-wide slowdowns from blocked calls

The Scheduler manages failure policies automatically. Configure per-node policies using .failure_policy() in the node builder, and the scheduler handles failure tracking, cooldowns, and recovery.

Failure Policies

PolicyBehaviorDefault For
FailurePolicy::FatalStop the scheduler immediatelyCritical real-time nodes
FailurePolicy::restart(max, backoff_ms)Re-initialize with exponential backoffStandard nodes
FailurePolicy::skip(max_failures, cooldown_ms)Suppress the node after repeated failuresBackground, async I/O nodes
FailurePolicy::IgnoreLog the failure, continue runningBest-effort nodes

How Policies Work

Fatal

First failure stops the scheduler. Use for motor controllers, safety systems — anything where continued operation after failure is unsafe.

Restart

Tracks failures and restarts the node with exponential backoff:

failure 1 → restart (10ms backoff)
failure 2 → restart (20ms backoff)
failure 3 → restart (40ms backoff)
failure 4 → exceeds max_restarts → fatal stop

After a successful tick, the backoff timer is cleared.

Skip

Tracks consecutive failures and suppresses the node after the threshold:

failure 1 → continue (below threshold)
failure 2 → continue (below threshold)
failure 3 → suppress node for cooldown_ms
         ... cooldown expires ...
         → node allowed again, failure count reset

Use for logging, telemetry, diagnostics — nodes where temporary absence is acceptable.

Ignore

Failures are swallowed entirely. The node keeps ticking every cycle regardless of errors.

Configuring Failure Policies

Set per-node failure policies using .failure_policy() in the node builder:

use horus::prelude::*;

let mut scheduler = Scheduler::new()
    .watchdog(500_u64.ms());

// Critical node: stop immediately on failure
scheduler.add(motor_controller)
    .order(0)
    .failure_policy(FailurePolicy::Fatal)
    .build()?;

// Sensor node: restart up to 5 times with 100ms delay
scheduler.add(sensor_reader)
    .order(1)
    .failure_policy(FailurePolicy::restart(5, 100))
    .build()?;

// Logging node: skip after 3 failures, retry after 5 seconds
scheduler.add(data_logger)
    .order(5)
    .failure_policy(FailurePolicy::skip(3, 5000))
    .build()?;

// Telemetry: ignore failures entirely
scheduler.add(metrics_node)
    .order(10)
    .failure_policy(FailurePolicy::Ignore)
    .build()?;

Nodes use their tier's default failure policy unless overridden with .failure_policy().

Monitoring

Failure events are automatically recorded in the BlackBox flight recorder. After a failure, inspect the blackbox to see what happened:

if let Some(bb) = scheduler.blackbox() {
    for record in &bb.anomalies() {
        println!("[tick {}] {:?}", record.tick, record.event);
    }
}

Best Practices

  1. Set failure policies per node: Critical nodes should use FailurePolicy::Fatal, non-critical nodes can use skip() or Ignore.

  2. Use skip() for non-critical nodes: Logging, telemetry, and diagnostics nodes should tolerate failures gracefully.

  3. Test failure scenarios: Verify your system behaves correctly when nodes are temporarily suppressed by skip() policies.

See Also