BlackBox Flight Recorder

Your robot crashed overnight and the logs are gone. You need a crash-safe, always-on event recorder that captures the sequence of events leading up to any failure. The BlackBox is that recorder.

When To Use This

Situation	Use BlackBox?
Robot runs unattended (production, field tests)	Yes -- you need crash forensics
Safety-critical system (motors, arms, drones)	Yes -- every deadline miss is recorded
Development with debugger attached	Optional -- you can inspect directly
Short test runs (under 5 minutes)	Optional -- logs are usually sufficient
Overnight regression testing	Yes -- find intermittent failures

Use Record & Replay instead if you need full node state (inputs/outputs) for deterministic replay. The BlackBox captures lightweight events (what happened), not full data.

Prerequisites

Familiarity with Scheduler Configuration -- especially the .blackbox() builder method
Understanding of Safety Monitor for interpreting deadline misses and watchdog events

Understanding the BlackBox

BlackBox vs Logging: Logs are text, grow forever, and require parsing. The BlackBox is structured events, fixed-size (never fills your disk), and queryable by type (show only anomalies).

BlackBox vs Record/Replay: Record/Replay captures full node state (inputs/outputs) for deterministic replay -- great for debugging but storage-heavy. The BlackBox captures lightweight events (what happened, not the full data) -- always-on, zero overhead, crash-safe.

How It Works

The BlackBox is a circular buffer — it keeps the last N events and discards the oldest when full. This means:

Fixed memory — never grows beyond the configured size
Always-on — no performance impact (events are tiny structs)
Crash-safe — data persists even if the process is killed
No manual instrumentation — the Scheduler records events automatically

Enabling BlackBox

Use the .blackbox(size_mb) builder method to enable the BlackBox:

// simplified
use horus::prelude::*;

// 16MB black box for general production
let mut scheduler = Scheduler::new()
    .blackbox(16);

// 1GB black box for safety-critical systems with watchdog
let mut scheduler = Scheduler::new()
    .watchdog(500_u64.ms())
    .blackbox(1024);

// 100MB black box for hard real-time systems
let mut scheduler = Scheduler::new()
    .blackbox(100);

What Gets Recorded

The BlackBox automatically captures events during scheduler execution:

Event	Description
Scheduler start/stop	When the scheduler begins and ends
Node execution	Each node tick with duration and success/failure
Node errors	Failed node executions
Deadline misses	Nodes that missed their timing deadline
Budget violations	Nodes that exceeded their execution time budget
Failure policy events	Failure policy state transitions
Emergency stops	Safety system activations
Custom events	User-defined markers

Post-Mortem Debugging

After a failure, the BlackBox contains the sequence of events leading up to it. Inspect via the CLI (works for both Rust and Python projects) or programmatically in Rust:

# CLI — works for any HORUS project (Rust or Python)
horus blackbox --anomalies
horus blackbox --json
horus blackbox show --filter errors
horus blackbox show --last 100

Programmatic access

// simplified
use horus::prelude::*;

let mut scheduler = Scheduler::new()
    .blackbox(16);

// ... application runs ...

// Programmatic access after scheduler.run() returns:
if let Some(bb) = scheduler.get_blackbox() {
    let anomalies = bb.lock().expect("blackbox lock").anomalies();
    println!("=== ANOMALIES ({}) ===", anomalies.len());
    for record in &anomalies {
        println!("[tick {}] {:?}", record.tick, record.event);
    }
}

Circular Buffer Behavior

The BlackBox uses a fixed-size circular buffer. When full, the oldest events are discarded:

Buffer capacity: 50,000 records (10MB)

Event 1 → [1, _, _, _, _]     New events fill the buffer
Event 2 → [1, 2, _, _, _]
...
Event N → [1, 2, ..., N-1, N]  Buffer full
Event N+1 → [2, 3, ..., N, N+1]  Oldest dropped

This ensures bounded memory usage while keeping the most recent events for debugging.

Recommended Buffer Sizes

Use Case	Configuration	Buffer Size
Development	`.blackbox(16)`	16 MB
Long-running production	`.blackbox(100)`	100 MB
Safety-critical	`.blackbox(1024)`	1 GB

CLI Usage

Inspect the BlackBox from the command line:

# View all events
horus blackbox

# View anomalies only (errors, deadline misses, e-stops)
horus blackbox --anomalies

# Follow in real-time (like tail -f)
horus blackbox --follow

# Filter by node
horus blackbox --node motor_ctrl

# Filter by event type
horus blackbox --event DeadlineMiss

# JSON output for scripts/dashboards
horus blackbox --json

Debugging Walkthrough: "My Robot Crashed Overnight"

Scenario: Your mobile robot stopped moving during an overnight warehouse test. The process restarted but the original crash data is gone.

Step 1: Check the BlackBox

horus blackbox --anomalies

Step 2: Read the timeline

[03:17:01.001] SchedulerStart { nodes: 4, rate: 500Hz }
[03:17:01.500] NodeTick { name: "planner", duration_us: 2100, success: true }
[03:17:01.502] DeadlineMiss { name: "collision_checker", deadline_us: 1900, actual_us: 4200 }
[03:17:01.503] DeadlineMiss { name: "collision_checker", deadline_us: 1900, actual_us: 5100 }
[03:17:01.504] NodeError { name: "arm_controller", error: "joint limit exceeded" }
[03:17:01.504] EmergencyStop { reason: "deadline miss threshold exceeded" }

Step 3: Diagnose The collision checker started missing its 1.9ms deadline (taking 4-5ms instead). During that time, the planner sent a trajectory that would have been rejected — but the check arrived too late. The arm exceeded its joint limits.

Step 4: Fix

Tighten the collision checker's budget: .budget(1500_u64.us())
Or add a safety interlock: hold trajectory execution until collision check completes
Or move collision checking to the same RT thread as the arm controller

BlackBox vs Other Debugging Tools

Tool	What it captures	Storage	When to use
BlackBox	Scheduler events (lightweight)	Fixed ring buffer (16-1024 MB)	Always-on crash forensics
Record/Replay	Full node state (inputs/outputs)	Grows with time	Reproduce specific bugs
horus log	Text log messages	Grows with time	Verbose debugging
horus monitor	Live system state	None (real-time only)	Active debugging

Design Decisions

Why a ring buffer instead of a log file?

A log file grows without bound and eventually fills the disk. A ring buffer has fixed, predictable memory usage. For a robot running 24/7 in a warehouse, you cannot afford to run out of disk space. The ring buffer keeps the most recent events and silently discards the oldest.

Why structured events instead of text logs?

Structured events can be filtered by type (--event DeadlineMiss), queried by node (--node motor_ctrl), and exported to JSON for dashboards. Text logs require regex parsing and are fragile. Structured events also have lower overhead -- no string formatting during the hot path.

Why automatic recording instead of manual instrumentation?

The scheduler knows when every node ticks, when deadlines are missed, and when failures occur. Requiring developers to manually add recording calls would lead to incomplete data. The BlackBox captures everything the scheduler sees, with zero code changes.

Trade-offs

Gain	Cost
Fixed memory -- never fills disk	Oldest events are lost when buffer is full
Always-on with zero overhead	Only captures scheduler events, not application-level data
Crash-safe (survives process kill)	Requires post-mortem inspection (not real-time alerting)
No code changes required	Cannot record custom application data (use Record/Replay for that)

Common Errors

Symptom	Cause	Fix
`horus blackbox` shows no events	BlackBox not enabled	Add `.blackbox(16)` to the scheduler builder
BlackBox missing events from crash	Buffer too small, events were overwritten	Increase buffer size: `.blackbox(100)` or `.blackbox(1024)`
Cannot read BlackBox after process restart	BlackBox data is in-memory, not persisted to disk by default	Use `horus blackbox --follow` during the run, or configure filesystem persistence
`horus blackbox --anomalies` shows nothing	No anomalies occurred (all nodes ran within budget)	This is normal. Use `horus blackbox` without filters to see all events
High memory usage	Buffer size too large for the system	Reduce `.blackbox(size_mb)` to match available memory