Fault Tolerance
HORUS provides tier-aware failure policies that prevent cascading failures by controlling how the scheduler responds when nodes fail. Each policy defines a different recovery strategy.
Overview
Failure policies protect against:
- Cascading failures from one failing node
- Resource exhaustion from repeated retry attempts
- System-wide slowdowns from blocked calls
The Scheduler manages failure policies automatically. Configure per-node policies using .failure_policy() in the node builder, and the scheduler handles failure tracking, cooldowns, and recovery.
Failure Policies
| Policy | Behavior | Default For |
|---|---|---|
FailurePolicy::Fatal | Stop the scheduler immediately | Critical real-time nodes |
FailurePolicy::restart(max, backoff_ms) | Re-initialize with exponential backoff | Standard nodes |
FailurePolicy::skip(max_failures, cooldown_ms) | Suppress the node after repeated failures | Background, async I/O nodes |
FailurePolicy::Ignore | Log the failure, continue running | Best-effort nodes |
How Policies Work
Fatal
First failure stops the scheduler. Use for motor controllers, safety systems — anything where continued operation after failure is unsafe.
Restart
Tracks failures and restarts the node with exponential backoff:
failure 1 → restart (10ms backoff)
failure 2 → restart (20ms backoff)
failure 3 → restart (40ms backoff)
failure 4 → exceeds max_restarts → fatal stop
After a successful tick, the backoff timer is cleared.
Skip
Tracks consecutive failures and suppresses the node after the threshold:
failure 1 → continue (below threshold)
failure 2 → continue (below threshold)
failure 3 → suppress node for cooldown_ms
... cooldown expires ...
→ node allowed again, failure count reset
Use for logging, telemetry, diagnostics — nodes where temporary absence is acceptable.
Ignore
Failures are swallowed entirely. The node keeps ticking every cycle regardless of errors.
Configuring Failure Policies
Set per-node failure policies using .failure_policy() in the node builder:
use horus::prelude::*;
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms());
// Critical node: stop immediately on failure
scheduler.add(motor_controller)
.order(0)
.failure_policy(FailurePolicy::Fatal)
.build()?;
// Sensor node: restart up to 5 times with 100ms delay
scheduler.add(sensor_reader)
.order(1)
.failure_policy(FailurePolicy::restart(5, 100))
.build()?;
// Logging node: skip after 3 failures, retry after 5 seconds
scheduler.add(data_logger)
.order(5)
.failure_policy(FailurePolicy::skip(3, 5000))
.build()?;
// Telemetry: ignore failures entirely
scheduler.add(metrics_node)
.order(10)
.failure_policy(FailurePolicy::Ignore)
.build()?;
Nodes use their tier's default failure policy unless overridden with .failure_policy().
Monitoring
Failure events are automatically recorded in the BlackBox flight recorder. After a failure, inspect the blackbox to see what happened:
if let Some(bb) = scheduler.blackbox() {
for record in &bb.anomalies() {
println!("[tick {}] {:?}", record.tick, record.event);
}
}
Best Practices
-
Set failure policies per node: Critical nodes should use
FailurePolicy::Fatal, non-critical nodes can useskip()orIgnore. -
Use
skip()for non-critical nodes: Logging, telemetry, and diagnostics nodes should tolerate failures gracefully. -
Test failure scenarios: Verify your system behaves correctly when nodes are temporarily suppressed by
skip()policies.
See Also
- BlackBox Flight Recorder - Event logging for post-mortem analysis
- Safety Monitor - Real-time safety monitoring
- Scheduler Configuration - SchedulerConfig and node configuration