Safety Monitor
The Safety Monitor provides real-time safety monitoring for safety-critical robotics applications. It enforces timing constraints, monitors node health, and applies deadline miss policies when safety violations occur.
Overview
The Safety Monitor includes:
- Watchdogs: Monitor node liveness — trigger action if a critical node hangs
- Budget Enforcement: Per-node tick budgets — act if a node takes too long (implicit when nodes have
.rate()set) - Deadline Tracking: Count deadline misses and apply the configured
Misspolicy - Miss Policies:
Warn,Skip,SafeMode, orStop— per-node control over what happens on deadline miss
The Scheduler manages the safety monitor internally — you configure it with composable builder methods and the scheduler automatically feeds watchdogs, checks budgets, and applies miss policies.
Enabling Safety Monitoring
Use composable builder methods to enable safety monitoring. Each method adds a specific safety feature:
use horus::prelude::*;
// Production: watchdog for frozen node detection
// Budget enforcement is implicit when nodes have .rate() set
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms())
.tick_rate(1000_u64.hz());
// Safety-critical: require RT + blackbox + strict deadline limit
let mut scheduler = Scheduler::new()
.require_rt()
.watchdog(500_u64.ms())
.blackbox(64)
.tick_rate(1000_u64.hz())
.max_deadline_misses(3);
Composable Builder Comparison
| Builder | Watchdog | Budget Enforcement | Memory Locking | Blackbox |
|---|---|---|---|---|
new() | No | Implicit (when nodes have .rate()) | No | No |
.watchdog(500_u64.ms()) | Yes (500ms) | Implicit | No | No |
.require_rt() | No | Implicit | Yes | No |
.watchdog(500_u64.ms()).require_rt() | Yes (500ms) | Implicit | Yes | No |
.watchdog(500_u64.ms()).blackbox(64) | Yes (500ms) | Implicit | No | Yes (64MB) |
Configuring Nodes with Rates
After configuring the scheduler, add nodes with timing constraints using the node builder. Setting .rate() automatically marks the node as RT and derives budget (80% of period) and deadline (95% of period):
use horus::prelude::*;
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms())
.tick_rate(1000_u64.hz());
// RT node — rate auto-derives budget and deadline
scheduler.add(motor_controller)
.order(0)
.rate(1000_u64.hz()) // budget=800us, deadline=950us
.on_miss(Miss::SafeMode) // Enter safe state on miss
.build()?;
scheduler.add(sensor_fusion)
.order(1)
.rate(200_u64.hz()) // budget=4ms, deadline=4.75ms
.on_miss(Miss::Skip) // Skip tick on miss
.build()?;
scheduler.run()?;
Watchdogs
Watchdogs monitor node liveness. The scheduler automatically feeds watchdogs on successful node ticks. If a critical node fails to execute within the watchdog timeout, the safety monitor triggers graduated degradation.
Normal operation:
Node tick → success → watchdog fed → timer reset
Failure scenario:
Node hangs → watchdog timeout expires → graduated degradation → EMERGENCY STOP
Timeout Guidelines
Watchdog timeout should be:
- Longer than expected execution time
- Shorter than safety-critical response time
Example:
Expected tick period: 10ms
Safety deadline: 100ms
Watchdog timeout: 50ms (5× period)
Budget Enforcement
Budget enforcement is implicit when nodes have .rate() set. The auto-derived budget (80% of the period from .rate()) ensures nodes complete within time limits:
scheduler.add(motor_controller)
.order(0)
.rate(1000_u64.hz()) // auto-derives budget=800us, deadline=950us
.on_miss(Miss::SafeMode) // Enter safe state on budget/deadline miss
.build()?;
When a node exceeds its budget, the configured Miss policy is applied. Violations are also recorded in the BlackBox when using .blackbox(n).
Node Health States
Every node has a health state tracked internally by the scheduler. The four states form a graduated degradation ladder:
| State | Meaning |
|---|---|
Healthy | Normal operation — node ticks every cycle |
Warning | Watchdog at 1x timeout — node still ticks, but a warning is logged |
Unhealthy | Watchdog at 2x timeout — node is skipped in the tick loop |
Isolated | Watchdog at 3x timeout — enter_safe_state() is called, node is skipped |
Graduated Degradation Transitions
The scheduler evaluates watchdog severity every tick and transitions nodes through health states automatically:
Healthy ──(1x timeout)──► Warning ──(2x timeout)──► Unhealthy ──(3x timeout)──► Isolated
▲ │ │
│ (successful tick) │ (continued successful ticks restore rate) │
└─────────────────────────┘ │
▲ │
└──────────────────── (recovery via RestoreRate) ────────────────────────────────┘
Escalation happens when a node's watchdog is not fed (the node is slow or hung):
- Healthy to Warning — 1x watchdog timeout elapsed. The node still runs, but the scheduler logs a warning.
- Warning to Unhealthy — 2x timeout. The node is skipped entirely in the tick loop to prevent cascading delays.
- Unhealthy to Isolated — 3x timeout. The scheduler calls
enter_safe_state()on the node and continues to skip it. For critical nodes, this also triggers an emergency stop.
Recovery happens on successful ticks:
- A
Warningnode that ticks successfully transitions back toHealthyimmediately, and its watchdog is re-fed. - An
Isolatedor rate-reduced node can recover through the graduated degradation system — after enough consecutive successful ticks at a reduced rate, the scheduler restores the original rate and transitions back toHealthy.
Relationship to Miss Policies
Node health states and Miss policies are complementary:
Misspolicies act on individual deadline/budget violations (skip one tick, enter safe mode, stop the scheduler).- Health states track sustained behavior over time via the watchdog. A node can be in
Warningeven if itsMisspolicy isWarn— repeated warnings escalate toUnhealthyand eventuallyIsolated.
Both systems work together: the Miss policy handles immediate responses, while health states provide graduated, automatic degradation for persistently failing nodes.
Shutdown Report
When the scheduler shuts down with .watchdog() enabled, the timing report includes a health summary:
Node Health:
[OK] All 4 nodes healthy
Or, if any nodes degraded during the run:
Node Health:
3 healthy, 1 warning, 0 unhealthy, 0 isolated, 0 stopped
- sensor_fusion: WARNING
Miss — Deadline Miss Policy
The Miss enum controls what happens when a node exceeds its deadline:
| Policy | Behavior |
|---|---|
Miss::Warn | Log a warning and continue (default) |
Miss::Skip | Skip the node for this tick |
Miss::SafeMode | Call enter_safe_state() on the node |
Miss::Stop | Stop the entire scheduler |
SafeMode in Detail
When Miss::SafeMode triggers:
- The scheduler calls
enter_safe_state()on the offending node - Each subsequent tick, the scheduler checks
is_safe_state() - When the node reports safe, normal operation resumes
Implement these on your Node:
impl Node for MotorController {
fn enter_safe_state(&mut self) {
self.velocity = 0.0;
self.disable_motor();
}
fn is_safe_state(&self) -> bool {
self.velocity == 0.0
}
fn tick(&mut self) { /* ... */ }
}
Emergency Stop
Emergency stop is triggered automatically by:
- Watchdog expiration (node hangs)
Miss::Stoppolicy on deadline miss- Exceeding the
max_deadline_missesthreshold
When emergency stop triggers:
- All node execution is halted
- An emergency stop event is recorded in the BlackBox
- The scheduler transitions to emergency state
Inspecting After Emergency Stop
use horus::prelude::*;
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms())
.blackbox(64)
.tick_rate(1000_u64.hz());
// ... application runs and hits emergency stop ...
// Inspect what happened via BlackBox
if let Some(bb) = scheduler.get_blackbox() {
let anomalies = bb.lock().unwrap().anomalies();
println!("=== SAFETY EVENTS ({}) ===", anomalies.len());
for record in &anomalies {
println!("[tick {}] {:?}", record.tick, record.event);
}
}
Best Practices
1. Start with Conservative Rates
Set rates generously initially, then tighten after profiling:
// Start: use rate() — auto-derives budget at 80% of period
scheduler.add(motor_controller)
.order(0)
.rate(500_u64.hz()) // period=2ms, budget=1.6ms
.on_miss(Miss::Warn) // Log only while tuning
.build()?;
// After profiling: tighten to 1kHz
scheduler.add(motor_controller)
.order(0)
.rate(1000_u64.hz()) // period=1ms, budget=800us
.on_miss(Miss::SafeMode) // Enforce in production
.build()?;
2. Layer Safety Checks
Use composable builders (watchdog + blackbox) with per-node miss policies:
// .watchdog() gives you frozen node detection
// Budget enforcement is implicit from .rate()
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms())
.blackbox(64)
.tick_rate(1000_u64.hz());
// Then set per-node policies for fine-grained control
scheduler.add(motor_controller)
.order(0)
.rate(1000_u64.hz())
.on_miss(Miss::SafeMode) // Critical — enter safe state
.build()?;
scheduler.add(telemetry)
.order(10)
.rate(10_u64.hz())
.on_miss(Miss::Skip) // Non-critical — just skip
.build()?;
3. Choose the Right Configuration
| Use Case | Configuration |
|---|---|
| Medical / surgical robots | .require_rt().watchdog(500_u64.ms()).blackbox(64) |
| Industrial control | .require_rt().watchdog(500_u64.ms()) |
| CNC / aerospace | .require_rt().watchdog(500_u64.ms()).blackbox(64).max_deadline_misses(3) |
| General production | .watchdog(500_u64.ms()).blackbox(64) |
4. Test Safety Setup
Verify your system handles deadline misses correctly:
#[test]
fn test_safety_critical_setup() {
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms())
.tick_rate(1000_u64.hz());
scheduler.add(test_node)
.order(0)
.rate(1000_u64.hz())
.on_miss(Miss::SafeMode)
.build()
.expect("should build node");
}
See Also
- Scheduling - Scheduler overview and node ordering
- BlackBox Flight Recorder - Event recording for post-mortem analysis
- Fault Tolerance - Failure policies and recovery
- Scheduler Configuration - Builder methods and node configuration