Debugging Workflows

Three concrete workflows for the most common issues: deadline misses, panics, and performance problems.

Workflow 1: "My Motor Stutters"

Stuttering usually means deadline misses — the control loop is not completing within its budget.

Step 1: Check Scheduler Output

Enable monitoring and look for deadline miss warnings in stderr:

let mut scheduler = Scheduler::new()
    .verbose(true)
    .tick_rate(1000_u64.hz());

The scheduler prints a timing report on shutdown. Look for lines like:

[WARN] motor_ctrl: 12 deadline misses (worst: 2.3ms, budget: 1.0ms)

Step 2: Profile Tick Timing

Use profile() to get percentile statistics:

let report = scheduler.profile(5000)?;
println!("{report}");

// Check per-node budget utilization
for node in &report.nodes {
    if let Some(used) = node.budget_used {
        if used > 0.8 {
            println!("WARNING: {} using {:.0}% of budget", node.name, used * 100.0);
        }
    }
}

If p99 exceeds the budget, the node has latency spikes. If p99 is much higher than median, the node's execution time is inconsistent.

Step 3: Use Blackbox to Find the Exact Tick

Enable the blackbox to record the last N ticks per node:

let mut scheduler = Scheduler::new()
    .verbose(true)
    .with_blackbox(64)
    .tick_rate(1000_u64.hz());

After a miss, inspect the blackbox to find what happened on the tick that exceeded the budget. The blackbox records tick duration, input values, and events.

Step 4: Fix Common Causes

CauseSymptomFix
Allocation in tick()Sporadic spikesPre-allocate buffers in init()
Blocking I/OConsistent high latencyMove to .async_io() node
Lock contentionSpikes correlated with other nodesUse try_lock() or lock-free channels
Large computationAlways near budgetMove to .compute() with a longer budget
// Bad: allocating every tick
fn tick(&mut self) {
    let data: Vec<f64> = self.sensor.read_all(); // allocates
    self.process(&data);
}

// Good: pre-allocate, reuse buffer
fn init(&mut self) {
    self.buffer = vec![0.0; 128]; // allocate once
}
fn tick(&mut self) {
    self.sensor.read_into(&mut self.buffer); // reuse
    self.process(&self.buffer);
}

Workflow 2: "My Node Panicked"

A node panic is caught by the scheduler. The node is marked Unhealthy and on_error() is called.

Step 1: Check on_error() Output

Implement on_error() on your node to log the error:

impl Node for MotorCtrl {
    fn on_error(&mut self, error: &str) {
        eprintln!("MotorCtrl error: {error}");
        // Optionally: enter safe state, publish error topic
    }
}

Step 2: Get a Full Backtrace

RUST_BACKTRACE=1 ./target/release/my_robot

Step 3: Reproduce with Deterministic Mode

Use deterministic mode and tick_once() to replay the exact scenario:

let mut scheduler = Scheduler::new()
    .deterministic(true)
    .tick_rate(100_u64.hz());

scheduler.add(MotorCtrl::new()).build()?;

// Step through ticks one at a time
for i in 0..1000 {
    println!("tick {i}");
    scheduler.tick_once(); // panics are reproducible
}

Use tick(&["motor_ctrl"]) to isolate a single node.

Step 4: Fix the Panic

Fix the bug directly if it is in your code. For panics in third-party code, wrap with catch_unwind:

fn tick(&mut self) {
    let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
        self.flaky_library.update();
    }));
    if let Err(e) = result {
        eprintln!("Library panicked: {e:?}");
    }
}

Workflow 3: "My System Is Slow"

The system runs but cannot keep up with its tick rate.

Step 1: Find the Slowest Node

let report = scheduler.profile(1000)?;
println!("{report}");

// Nodes are listed with median and p99 timing
// The node with the highest p99 is your bottleneck
for node in &report.nodes {
    println!("{}: median={:?} p99={:?}", node.name, node.median, node.p99);
}

Sort by p99 to find the node causing the most delay.

Step 2: Check Execution Classes

A common mistake is running heavy work as BestEffort (the default), which blocks the main thread:

// Bad: heavy computation blocks the main loop
scheduler.add(PathPlanner::new()).build()?;

// Good: run on a compute thread pool
scheduler.add(PathPlanner::new())
    .compute()
    .build()?;

// Good: run blocking I/O on an async thread
scheduler.add(CameraDriver::new())
    .async_io()
    .build()?;

Step 3: Check CPU and Profile

# Per-core CPU usage — if one core is 100% while others idle, use .compute()
mpstat -P ALL 1 5

# Profile with perf to find hot functions
perf record -g ./target/release/my_robot && perf report
SymptomLikely CauseFix
One core at 100%Work not distributedUse .compute(), .cores(&[...])
Periodic spikes ~1sAllocator pressureUse jemalloc, pre-allocate
Latency grows over timeMemory leakMonitor RSS, fix leaking buffers
Random multi-ms stallsPage faults.require_rt() calls mlockall

See Also