BlackBox Flight Recorder

Your robot crashed overnight and the logs are gone. You need a crash-safe, always-on event recorder that captures the sequence of events leading up to any failure. The BlackBox is that recorder.

When To Use This

SituationUse BlackBox?
Robot runs unattended (production, field tests)Yes -- you need crash forensics
Safety-critical system (motors, arms, drones)Yes -- every deadline miss is recorded
Development with debugger attachedOptional -- you can inspect directly
Short test runs (under 5 minutes)Optional -- logs are usually sufficient
Overnight regression testingYes -- find intermittent failures

Use Record & Replay instead if you need full node state (inputs/outputs) for deterministic replay. The BlackBox captures lightweight events (what happened), not full data.

Prerequisites

Understanding the BlackBox

BlackBox vs Logging: Logs are text, grow forever, and require parsing. The BlackBox is structured events, fixed-size (never fills your disk), and queryable by type (show only anomalies).

BlackBox vs Record/Replay: Record/Replay captures full node state (inputs/outputs) for deterministic replay -- great for debugging but storage-heavy. The BlackBox captures lightweight events (what happened, not the full data) -- always-on, zero overhead, crash-safe.

How It Works

The BlackBox is a circular buffer — it keeps the last N events and discards the oldest when full. This means:

  • Fixed memory — never grows beyond the configured size
  • Always-on — no performance impact (events are tiny structs)
  • Crash-safe — data persists even if the process is killed
  • No manual instrumentation — the Scheduler records events automatically

Enabling BlackBox

Use the .blackbox(size_mb) builder method to enable the BlackBox:

// simplified
use horus::prelude::*;

// 16MB black box for general production
let mut scheduler = Scheduler::new()
    .blackbox(16);

// 1GB black box for safety-critical systems with watchdog
let mut scheduler = Scheduler::new()
    .watchdog(500_u64.ms())
    .blackbox(1024);

// 100MB black box for hard real-time systems
let mut scheduler = Scheduler::new()
    .blackbox(100);

What Gets Recorded

The BlackBox automatically captures events during scheduler execution:

EventDescription
Scheduler start/stopWhen the scheduler begins and ends
Node executionEach node tick with duration and success/failure
Node errorsFailed node executions
Deadline missesNodes that missed their timing deadline
Budget violationsNodes that exceeded their execution time budget
Failure policy eventsFailure policy state transitions
Emergency stopsSafety system activations
Custom eventsUser-defined markers

Post-Mortem Debugging

After a failure, the BlackBox contains the sequence of events leading up to it. Inspect via the CLI (works for both Rust and Python projects) or programmatically in Rust:

# CLI — works for any HORUS project (Rust or Python)
horus blackbox --anomalies
horus blackbox --json
horus blackbox show --filter errors
horus blackbox show --last 100

Programmatic access

// simplified
use horus::prelude::*;

let mut scheduler = Scheduler::new()
    .blackbox(16);

// ... application runs ...

// Programmatic access after scheduler.run() returns:
if let Some(bb) = scheduler.get_blackbox() {
    let anomalies = bb.lock().expect("blackbox lock").anomalies();
    println!("=== ANOMALIES ({}) ===", anomalies.len());
    for record in &anomalies {
        println!("[tick {}] {:?}", record.tick, record.event);
    }
}

Circular Buffer Behavior

The BlackBox uses a fixed-size circular buffer. When full, the oldest events are discarded:

Buffer capacity: 50,000 records (10MB)

Event 1 → [1, _, _, _, _]     New events fill the buffer
Event 2 → [1, 2, _, _, _]
...
Event N → [1, 2, ..., N-1, N]  Buffer full
Event N+1 → [2, 3, ..., N, N+1]  Oldest dropped

This ensures bounded memory usage while keeping the most recent events for debugging.

Use CaseConfigurationBuffer Size
Development.blackbox(16)16 MB
Long-running production.blackbox(100)100 MB
Safety-critical.blackbox(1024)1 GB

CLI Usage

Inspect the BlackBox from the command line:

# View all events
horus blackbox

# View anomalies only (errors, deadline misses, e-stops)
horus blackbox --anomalies

# Follow in real-time (like tail -f)
horus blackbox --follow

# Filter by node
horus blackbox --node motor_ctrl

# Filter by event type
horus blackbox --event DeadlineMiss

# JSON output for scripts/dashboards
horus blackbox --json

Debugging Walkthrough: "My Robot Crashed Overnight"

Scenario: Your mobile robot stopped moving during an overnight warehouse test. The process restarted but the original crash data is gone.

Step 1: Check the BlackBox

horus blackbox --anomalies

Step 2: Read the timeline

[03:17:01.001] SchedulerStart { nodes: 4, rate: 500Hz }
[03:17:01.500] NodeTick { name: "planner", duration_us: 2100, success: true }
[03:17:01.502] DeadlineMiss { name: "collision_checker", deadline_us: 1900, actual_us: 4200 }
[03:17:01.503] DeadlineMiss { name: "collision_checker", deadline_us: 1900, actual_us: 5100 }
[03:17:01.504] NodeError { name: "arm_controller", error: "joint limit exceeded" }
[03:17:01.504] EmergencyStop { reason: "deadline miss threshold exceeded" }

Step 3: Diagnose The collision checker started missing its 1.9ms deadline (taking 4-5ms instead). During that time, the planner sent a trajectory that would have been rejected — but the check arrived too late. The arm exceeded its joint limits.

Step 4: Fix

  • Tighten the collision checker's budget: .budget(1500_u64.us())
  • Or add a safety interlock: hold trajectory execution until collision check completes
  • Or move collision checking to the same RT thread as the arm controller

BlackBox vs Other Debugging Tools

ToolWhat it capturesStorageWhen to use
BlackBoxScheduler events (lightweight)Fixed ring buffer (16-1024 MB)Always-on crash forensics
Record/ReplayFull node state (inputs/outputs)Grows with timeReproduce specific bugs
horus logText log messagesGrows with timeVerbose debugging
horus monitorLive system stateNone (real-time only)Active debugging

Design Decisions

Why a ring buffer instead of a log file?

A log file grows without bound and eventually fills the disk. A ring buffer has fixed, predictable memory usage. For a robot running 24/7 in a warehouse, you cannot afford to run out of disk space. The ring buffer keeps the most recent events and silently discards the oldest.

Why structured events instead of text logs?

Structured events can be filtered by type (--event DeadlineMiss), queried by node (--node motor_ctrl), and exported to JSON for dashboards. Text logs require regex parsing and are fragile. Structured events also have lower overhead -- no string formatting during the hot path.

Why automatic recording instead of manual instrumentation?

The scheduler knows when every node ticks, when deadlines are missed, and when failures occur. Requiring developers to manually add recording calls would lead to incomplete data. The BlackBox captures everything the scheduler sees, with zero code changes.

Trade-offs

GainCost
Fixed memory -- never fills diskOldest events are lost when buffer is full
Always-on with zero overheadOnly captures scheduler events, not application-level data
Crash-safe (survives process kill)Requires post-mortem inspection (not real-time alerting)
No code changes requiredCannot record custom application data (use Record/Replay for that)

Common Errors

SymptomCauseFix
horus blackbox shows no eventsBlackBox not enabledAdd .blackbox(16) to the scheduler builder
BlackBox missing events from crashBuffer too small, events were overwrittenIncrease buffer size: .blackbox(100) or .blackbox(1024)
Cannot read BlackBox after process restartBlackBox data is in-memory, not persisted to disk by defaultUse horus blackbox --follow during the run, or configure filesystem persistence
horus blackbox --anomalies shows nothingNo anomalies occurred (all nodes ran within budget)This is normal. Use horus blackbox without filters to see all events
High memory usageBuffer size too large for the systemReduce .blackbox(size_mb) to match available memory

See Also