Debugging Workflows
Three concrete workflows for the most common issues: deadline misses, panics, and performance problems.
Workflow 1: "My Motor Stutters"
Stuttering usually means deadline misses — the control loop is not completing within its budget.
Step 1: Check Scheduler Output
Enable monitoring and look for deadline miss warnings in stderr:
let mut scheduler = Scheduler::new()
.verbose(true)
.tick_rate(1000_u64.hz());
The scheduler prints a timing report on shutdown. Look for lines like:
[WARN] motor_ctrl: 12 deadline misses (worst: 2.3ms, budget: 1.0ms)
Step 2: Profile Tick Timing
Use profile() to get percentile statistics:
let report = scheduler.profile(5000)?;
println!("{report}");
// Check per-node budget utilization
for node in &report.nodes {
if let Some(used) = node.budget_used {
if used > 0.8 {
println!("WARNING: {} using {:.0}% of budget", node.name, used * 100.0);
}
}
}
If p99 exceeds the budget, the node has latency spikes. If p99 is much higher than median, the node's execution time is inconsistent.
Step 3: Use Blackbox to Find the Exact Tick
Enable the blackbox to record the last N ticks per node:
let mut scheduler = Scheduler::new()
.verbose(true)
.with_blackbox(64)
.tick_rate(1000_u64.hz());
After a miss, inspect the blackbox to find what happened on the tick that exceeded the budget. The blackbox records tick duration, input values, and events.
Step 4: Fix Common Causes
| Cause | Symptom | Fix |
|---|---|---|
Allocation in tick() | Sporadic spikes | Pre-allocate buffers in init() |
| Blocking I/O | Consistent high latency | Move to .async_io() node |
| Lock contention | Spikes correlated with other nodes | Use try_lock() or lock-free channels |
| Large computation | Always near budget | Move to .compute() with a longer budget |
// Bad: allocating every tick
fn tick(&mut self) {
let data: Vec<f64> = self.sensor.read_all(); // allocates
self.process(&data);
}
// Good: pre-allocate, reuse buffer
fn init(&mut self) {
self.buffer = vec![0.0; 128]; // allocate once
}
fn tick(&mut self) {
self.sensor.read_into(&mut self.buffer); // reuse
self.process(&self.buffer);
}
Workflow 2: "My Node Panicked"
A node panic is caught by the scheduler. The node is marked Unhealthy and on_error() is called.
Step 1: Check on_error() Output
Implement on_error() on your node to log the error:
impl Node for MotorCtrl {
fn on_error(&mut self, error: &str) {
eprintln!("MotorCtrl error: {error}");
// Optionally: enter safe state, publish error topic
}
}
Step 2: Get a Full Backtrace
RUST_BACKTRACE=1 ./target/release/my_robot
Step 3: Reproduce with Deterministic Mode
Use deterministic mode and tick_once() to replay the exact scenario:
let mut scheduler = Scheduler::new()
.deterministic(true)
.tick_rate(100_u64.hz());
scheduler.add(MotorCtrl::new()).build()?;
// Step through ticks one at a time
for i in 0..1000 {
println!("tick {i}");
scheduler.tick_once(); // panics are reproducible
}
Use tick(&["motor_ctrl"]) to isolate a single node.
Step 4: Fix the Panic
Fix the bug directly if it is in your code. For panics in third-party code, wrap with catch_unwind:
fn tick(&mut self) {
let result = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
self.flaky_library.update();
}));
if let Err(e) = result {
eprintln!("Library panicked: {e:?}");
}
}
Workflow 3: "My System Is Slow"
The system runs but cannot keep up with its tick rate.
Step 1: Find the Slowest Node
let report = scheduler.profile(1000)?;
println!("{report}");
// Nodes are listed with median and p99 timing
// The node with the highest p99 is your bottleneck
for node in &report.nodes {
println!("{}: median={:?} p99={:?}", node.name, node.median, node.p99);
}
Sort by p99 to find the node causing the most delay.
Step 2: Check Execution Classes
A common mistake is running heavy work as BestEffort (the default), which blocks the main thread:
// Bad: heavy computation blocks the main loop
scheduler.add(PathPlanner::new()).build()?;
// Good: run on a compute thread pool
scheduler.add(PathPlanner::new())
.compute()
.build()?;
// Good: run blocking I/O on an async thread
scheduler.add(CameraDriver::new())
.async_io()
.build()?;
Step 3: Check CPU and Profile
# Per-core CPU usage — if one core is 100% while others idle, use .compute()
mpstat -P ALL 1 5
# Profile with perf to find hot functions
perf record -g ./target/release/my_robot && perf report
| Symptom | Likely Cause | Fix |
|---|---|---|
| One core at 100% | Work not distributed | Use .compute(), .cores(&[...]) |
| Periodic spikes ~1s | Allocator pressure | Use jemalloc, pre-allocate |
| Latency grows over time | Memory leak | Monitor RSS, fix leaking buffers |
| Random multi-ms stalls | Page faults | .require_rt() calls mlockall |
See Also
- Logging - Structured logging with hlog!
- BlackBox Flight Recorder - Post-mortem tick analysis
- Monitor - Real-time visual debugging
- Troubleshooting - Common issues and solutions