BlackBox Flight Recorder
Your robot crashed overnight and the logs are gone. You need a crash-safe, always-on event recorder that captures the sequence of events leading up to any failure. The BlackBox is that recorder.
When To Use This
| Situation | Use BlackBox? |
|---|---|
| Robot runs unattended (production, field tests) | Yes -- you need crash forensics |
| Safety-critical system (motors, arms, drones) | Yes -- every deadline miss is recorded |
| Development with debugger attached | Optional -- you can inspect directly |
| Short test runs (under 5 minutes) | Optional -- logs are usually sufficient |
| Overnight regression testing | Yes -- find intermittent failures |
Use Record & Replay instead if you need full node state (inputs/outputs) for deterministic replay. The BlackBox captures lightweight events (what happened), not full data.
Prerequisites
- Familiarity with Scheduler Configuration -- especially the
.blackbox()builder method - Understanding of Safety Monitor for interpreting deadline misses and watchdog events
Understanding the BlackBox
BlackBox vs Logging: Logs are text, grow forever, and require parsing. The BlackBox is structured events, fixed-size (never fills your disk), and queryable by type (show only anomalies).
BlackBox vs Record/Replay: Record/Replay captures full node state (inputs/outputs) for deterministic replay -- great for debugging but storage-heavy. The BlackBox captures lightweight events (what happened, not the full data) -- always-on, zero overhead, crash-safe.
How It Works
The BlackBox is a circular buffer — it keeps the last N events and discards the oldest when full. This means:
- Fixed memory — never grows beyond the configured size
- Always-on — no performance impact (events are tiny structs)
- Crash-safe — data persists even if the process is killed
- No manual instrumentation — the Scheduler records events automatically
Enabling BlackBox
Use the .blackbox(size_mb) builder method to enable the BlackBox:
// simplified
use horus::prelude::*;
// 16MB black box for general production
let mut scheduler = Scheduler::new()
.blackbox(16);
// 1GB black box for safety-critical systems with watchdog
let mut scheduler = Scheduler::new()
.watchdog(500_u64.ms())
.blackbox(1024);
// 100MB black box for hard real-time systems
let mut scheduler = Scheduler::new()
.blackbox(100);
What Gets Recorded
The BlackBox automatically captures events during scheduler execution:
| Event | Description |
|---|---|
| Scheduler start/stop | When the scheduler begins and ends |
| Node execution | Each node tick with duration and success/failure |
| Node errors | Failed node executions |
| Deadline misses | Nodes that missed their timing deadline |
| Budget violations | Nodes that exceeded their execution time budget |
| Failure policy events | Failure policy state transitions |
| Emergency stops | Safety system activations |
| Custom events | User-defined markers |
Post-Mortem Debugging
After a failure, the BlackBox contains the sequence of events leading up to it. Inspect via the CLI (works for both Rust and Python projects) or programmatically in Rust:
# CLI — works for any HORUS project (Rust or Python)
horus blackbox --anomalies
horus blackbox --json
horus blackbox show --filter errors
horus blackbox show --last 100
Programmatic access
// simplified
use horus::prelude::*;
let mut scheduler = Scheduler::new()
.blackbox(16);
// ... application runs ...
// Programmatic access after scheduler.run() returns:
if let Some(bb) = scheduler.get_blackbox() {
let anomalies = bb.lock().expect("blackbox lock").anomalies();
println!("=== ANOMALIES ({}) ===", anomalies.len());
for record in &anomalies {
println!("[tick {}] {:?}", record.tick, record.event);
}
}
Circular Buffer Behavior
The BlackBox uses a fixed-size circular buffer. When full, the oldest events are discarded:
Buffer capacity: 50,000 records (10MB)
Event 1 → [1, _, _, _, _] New events fill the buffer
Event 2 → [1, 2, _, _, _]
...
Event N → [1, 2, ..., N-1, N] Buffer full
Event N+1 → [2, 3, ..., N, N+1] Oldest dropped
This ensures bounded memory usage while keeping the most recent events for debugging.
Recommended Buffer Sizes
| Use Case | Configuration | Buffer Size |
|---|---|---|
| Development | .blackbox(16) | 16 MB |
| Long-running production | .blackbox(100) | 100 MB |
| Safety-critical | .blackbox(1024) | 1 GB |
CLI Usage
Inspect the BlackBox from the command line:
# View all events
horus blackbox
# View anomalies only (errors, deadline misses, e-stops)
horus blackbox --anomalies
# Follow in real-time (like tail -f)
horus blackbox --follow
# Filter by node
horus blackbox --node motor_ctrl
# Filter by event type
horus blackbox --event DeadlineMiss
# JSON output for scripts/dashboards
horus blackbox --json
Debugging Walkthrough: "My Robot Crashed Overnight"
Scenario: Your mobile robot stopped moving during an overnight warehouse test. The process restarted but the original crash data is gone.
Step 1: Check the BlackBox
horus blackbox --anomalies
Step 2: Read the timeline
[03:17:01.001] SchedulerStart { nodes: 4, rate: 500Hz }
[03:17:01.500] NodeTick { name: "planner", duration_us: 2100, success: true }
[03:17:01.502] DeadlineMiss { name: "collision_checker", deadline_us: 1900, actual_us: 4200 }
[03:17:01.503] DeadlineMiss { name: "collision_checker", deadline_us: 1900, actual_us: 5100 }
[03:17:01.504] NodeError { name: "arm_controller", error: "joint limit exceeded" }
[03:17:01.504] EmergencyStop { reason: "deadline miss threshold exceeded" }
Step 3: Diagnose The collision checker started missing its 1.9ms deadline (taking 4-5ms instead). During that time, the planner sent a trajectory that would have been rejected — but the check arrived too late. The arm exceeded its joint limits.
Step 4: Fix
- Tighten the collision checker's budget:
.budget(1500_u64.us()) - Or add a safety interlock: hold trajectory execution until collision check completes
- Or move collision checking to the same RT thread as the arm controller
BlackBox vs Other Debugging Tools
| Tool | What it captures | Storage | When to use |
|---|---|---|---|
| BlackBox | Scheduler events (lightweight) | Fixed ring buffer (16-1024 MB) | Always-on crash forensics |
| Record/Replay | Full node state (inputs/outputs) | Grows with time | Reproduce specific bugs |
| horus log | Text log messages | Grows with time | Verbose debugging |
| horus monitor | Live system state | None (real-time only) | Active debugging |
Design Decisions
Why a ring buffer instead of a log file?
A log file grows without bound and eventually fills the disk. A ring buffer has fixed, predictable memory usage. For a robot running 24/7 in a warehouse, you cannot afford to run out of disk space. The ring buffer keeps the most recent events and silently discards the oldest.
Why structured events instead of text logs?
Structured events can be filtered by type (--event DeadlineMiss), queried by node (--node motor_ctrl), and exported to JSON for dashboards. Text logs require regex parsing and are fragile. Structured events also have lower overhead -- no string formatting during the hot path.
Why automatic recording instead of manual instrumentation?
The scheduler knows when every node ticks, when deadlines are missed, and when failures occur. Requiring developers to manually add recording calls would lead to incomplete data. The BlackBox captures everything the scheduler sees, with zero code changes.
Trade-offs
| Gain | Cost |
|---|---|
| Fixed memory -- never fills disk | Oldest events are lost when buffer is full |
| Always-on with zero overhead | Only captures scheduler events, not application-level data |
| Crash-safe (survives process kill) | Requires post-mortem inspection (not real-time alerting) |
| No code changes required | Cannot record custom application data (use Record/Replay for that) |
Common Errors
| Symptom | Cause | Fix |
|---|---|---|
horus blackbox shows no events | BlackBox not enabled | Add .blackbox(16) to the scheduler builder |
| BlackBox missing events from crash | Buffer too small, events were overwritten | Increase buffer size: .blackbox(100) or .blackbox(1024) |
| Cannot read BlackBox after process restart | BlackBox data is in-memory, not persisted to disk by default | Use horus blackbox --follow during the run, or configure filesystem persistence |
horus blackbox --anomalies shows nothing | No anomalies occurred (all nodes ran within budget) | This is normal. Use horus blackbox without filters to see all events |
| High memory usage | Buffer size too large for the system | Reduce .blackbox(size_mb) to match available memory |
See Also
- Safety Monitor — Deadline enforcement and watchdog events recorded by the BlackBox
- Fault Tolerance — Failure policies whose state transitions are recorded
- Record & Replay — Full node state recording for deterministic replay
- Scheduler Configuration —
.blackbox()builder method