Scheduler — Full Reference

A warehouse robot runs 15 nodes: LiDAR, cameras, motor controllers, safety monitors, path planners, and loggers. The motor controller must run every millisecond — miss a deadline and the robot overshoots into a shelf. The path planner needs 50 ms of CPU time — if it runs on the same thread as the motor controller, it blocks 50 ticks. The safety monitor must run before the motor controller every cycle — if it runs after, a collision is detected one cycle too late. And when someone presses the emergency stop, all 15 nodes must shut down in the right order: motors stop before sensors disconnect.

Writing this coordination by hand is the single largest source of bugs in robotics software. Thread pools, priority queues, signal handlers, deadline monitors, watchdog timers — each one is a weekend of debugging, and they all interact in non-obvious ways.

HORUS's scheduler handles all of this. You add nodes, configure their order and timing, and call run(). The scheduler manages execution order, tick rates, deadline enforcement, safety monitoring, graceful shutdown, and recording — across five different execution classes, each optimized for a different workload type.

For a gentler introduction, start with Scheduler: Running Your Nodes. For multi-process orchestration (mixed languages, process isolation), see Launch System.

The Execution Model

How a Tick Works

Every tick cycle, the scheduler does the following:

Each tick: rate-check every node, time it, enforce budgets, sleep to maintain rate

Key details:

  • BestEffort nodes auto-parallelize: the scheduler builds a dependency graph from topic send()/recv() metadata. Independent nodes (no shared topics) run in parallel via the ready-dispatch executor. Dependent nodes (one subscribes to another's topic) execute in causal order — the publisher always finishes before the subscriber starts. RT, Compute, Event, and AsyncIo nodes run on their own dedicated threads.
  • Rate limiting is per-node: a 10 Hz node inside a 1 kHz scheduler only has tick() called every 100th cycle.
  • Budget enforcement happens after tick() returns — the scheduler doesn't preempt mid-tick. It records the overrun and applies the Miss policy.
  • .order() is optional — when nodes have topic dependencies, the graph determines execution order automatically. .order() serves as a tiebreaker for independent nodes with no topic relationship.

Initialization

When you call scheduler.run(), initialization happens lazily — not when nodes are added:

  1. All pending node configurations are finalized (execution class inference, budget/deadline auto-derivation)
  2. init() is called on every node — topic send()/recv() calls during init() register with the TopicNodeRegistry
  3. The dependency graph is built from the registry: edges from publishers to subscribers, cycle detection, topological sort
  4. If a node's init() returns Err, that node enters Error state and is excluded from ticking. Other nodes continue.
  5. The main tick loop begins
  6. After the first tick, the graph is rebuilt once if new topic registrations were detected (for topics created lazily during tick())

This lazy initialization means you can configure nodes in any order and the scheduler resolves dependencies at startup.

Shutdown

When the scheduler receives a stop signal (Ctrl+C, SIGINT, SIGTERM, or .stop()):

  1. The main loop exits
  2. shutdown() is called on every node in reverse order — last-added first
  3. RT threads are given 3 seconds to exit cleanly; stalled threads are detached
  4. Errors during shutdown are logged but don't prevent other nodes from cleaning up
  5. Shared memory is cleaned up

Reverse-order shutdown ensures dependent nodes stop before their dependencies. The motor controller (order 1) shuts down before the sensor (order 0) that feeds it.

Configuring the Scheduler

Builder Methods

Scheduler::new() creates a minimal scheduler with capability detection (~30–100 µs) but no OS-level features applied. Use builder methods to opt in:

// simplified
use horus::prelude::*;

let mut scheduler = Scheduler::new()
    .tick_rate(1000_u64.hz())        // 1 kHz global tick rate
    .watchdog(500_u64.ms())          // Detect frozen nodes
    .blackbox(64)                    // 64 MB flight recorder
    .max_deadline_misses(50)         // Emergency stop after 50 misses
    .prefer_rt();                    // Request RT features (degrade gracefully)
BuilderWhat it enablesWhen to use
.tick_rate(freq)Global tick rate (default: 100 Hz)Always — match your fastest node
.watchdog(duration)Frozen node detection with graduated degradationProduction robots
.blackbox(size_mb)Flight recorder for post-mortem debuggingProduction, testing
.max_deadline_misses(n)Emergency stop threshold (default: 100)Safety-critical systems
.prefer_rt()Request RT features, degrade gracefully if unavailableMost production deployments
.require_rt()Require RT features — panics if unavailableHard real-time systems
.deterministic(true)Dependency-driven execution order for reproducible resultsSimulation, testing
.verbose(false)Suppress non-emergency loggingQuieter production
.with_recording()Enable session record/replayData collection
.cores(&[0, 1])Pin scheduler to specific CPU coresDedicated-core deployments

Adding Nodes

The fluent builder API configures each node:

// simplified
// Minimal — just order
scheduler.add(logger).order(200).build()?;

// With per-node rate — auto-derives budget (80%) and deadline (95%)
scheduler.add(motor_ctrl)
    .order(5)
    .rate(1000_u64.hz())           // 1 kHz → budget=800µs, deadline=950µs
    .on_miss(Miss::SafeMode)       // Enter safe state on deadline miss
    .build()?;

// Explicit budget and deadline (override auto-derivation)
scheduler.add(safety_monitor)
    .order(0)
    .budget(100_u64.us())          // Must finish in 100 µs
    .deadline(200_u64.us())        // Hard wall at 200 µs
    .on_miss(Miss::Stop)           // Stop everything if safety monitor misses
    .build()?;

// CPU-heavy work on thread pool
scheduler.add(path_planner).order(50).compute().build()?;

// Event-driven — wakes when topic receives data
scheduler.add(estop_handler).order(0).on("emergency.stop").build()?;

// Async I/O — network, HTTP, database
scheduler.add(cloud_uploader).order(200).async_io().build()?;

Per-Node Configuration Reference

MethodDescription
.order(n)Execution order — lower runs first. 0–9 critical, 10–49 high, 50–99 normal, 100–199 low, 200+ background
.rate(freq)Per-node tick rate. Auto-derives budget (80% of period) and deadline (95%). Auto-marks as RT
.budget(duration)Max expected tick duration. Auto-marks as RT
.deadline(duration)Hard deadline — Miss policy fires when exceeded. Auto-marks as RT
.on_miss(miss)Deadline miss policy: Warn, Skip, SafeMode, Stop (default: Warn)
.compute()Runs on thread pool — for CPU-heavy work
.on("topic")Event-driven — ticks only when named topic receives a message
.async_io()Runs on tokio runtime — for network/file I/O
.priority(n)OS thread priority (SCHED_FIFO 1–99) — requires RT capabilities
.core(cpu_id)Pin to specific CPU core
.watchdog(duration)Per-node watchdog timeout (overrides global)
.failure_policy(policy)What to do when tick() panics: Fatal, Restart, Skip, Ignore
.build()Finalize and register — returns Result

Execution Classes

The scheduler assigns each node an execution class based on its configuration. Each class uses a different executor optimized for its workload:

Five execution classes — auto-detected from node configuration
ClassThread modelTimingBest for
BestEffortSequential in main loopNo guaranteesLogging, telemetry, display
RtDedicated thread per nodeBudget + deadline enforcedMotor control, safety, sensor fusion
ComputeShared thread poolNone — CPU time onlyPath planning, SLAM, image processing
EventSleeps until topic messageNone — latency from publish to tickEmergency stop, command handlers
AsyncIoTokio runtimeNone — I/O boundCloud upload, HTTP, database

RT is auto-detected. There is no .rt() method. Setting .rate(), .budget(), or .deadline() on a node automatically assigns the Rt execution class. This maps developer intent ("this node needs to run at 1 kHz") to the right executor without requiring framework knowledge.

Deferred Finalization

Execution class inference happens at .build() time, not when individual methods are called. This means .rate(100_u64.hz()).compute() and .compute().rate(100_u64.hz()) behave identically — the last mutually-exclusive setter wins. Specifically:

  • .rate() / .budget() / .deadline() → Rt
  • .compute() → Compute (overrides Rt)
  • .on(topic) → Event (overrides Rt and Compute)
  • .async_io() → AsyncIo (overrides everything)

If none are set, the default is BestEffort.

Real-Time Enforcement

Budget and Deadline Auto-Derivation

When you set .rate(freq), the scheduler auto-derives:

  • Budget = 80% of the period (e.g., 1000 Hz → 800 µs budget)
  • Deadline = 95% of the period (e.g., 1000 Hz → 950 µs deadline)

You can override either with explicit .budget() or .deadline().

The Miss Enum

When a node exceeds its deadline, the Miss policy fires:

PolicyWhat happensUse when
Miss::WarnLog warning, continue normallyNon-critical nodes (default)
Miss::SkipSkip this node's next tick to recoverHigh-frequency nodes that can afford one skipped cycle
Miss::SafeModeCall enter_safe_state() on the nodeMotor controllers, actuators
Miss::StopStop the entire schedulerSafety monitor, last resort
// simplified
scheduler.add(motor_ctrl)
    .order(5)
    .rate(1000_u64.hz())
    .on_miss(Miss::SafeMode)  // Enter safe state on deadline miss
    .build()?;

Graduated Watchdog

When .watchdog(duration) is set, the scheduler monitors every node with a graduated response:

TimeoutHealth StateResponse
1× watchdogWarningLog warning
2× watchdogUnhealthySkip tick, log error
3× watchdog on critical nodeIsolatedRemove from tick loop, call enter_safe_state()

This prevents a single frozen node from stalling the entire system.

.require_rt() vs .prefer_rt()

MethodRT schedulingMemory lockingCPU affinityOn failure
.prefer_rt()Tries SCHED_FIFOTries mlockallTries isolated CPUsLogs degradation, continues
.require_rt()Requires SCHED_FIFORequires mlockallRequires isolated CPUsPanics

Use .prefer_rt() for most deployments — it applies what it can and logs what it can't. Use .require_rt() only for hard real-time systems where degraded mode is unacceptable.

Running the Scheduler

Continuous Mode

// simplified
scheduler.run()?;  // Run until Ctrl+C

Single-Tick Mode (Testing/Simulation)

Execute exactly one tick cycle, then return:

// simplified
scheduler.tick_once()?;  // Init lazily on first call, tick all nodes once

This is invaluable for testing — create a scheduler, add nodes, call tick_once(), and assert on topic output. No threads, no timing, fully deterministic.

// simplified
// In a test
let mut sched = Scheduler::new();
sched.add(my_node).order(0).build()?;

sched.tick_once()?;  // Init + one tick
assert_eq!(output_topic.recv(), Some(expected_value));

sched.tick_once()?;  // Second tick
assert_eq!(output_topic.recv(), Some(next_expected_value));

Deterministic Mode

Enable dependency-driven execution order for reproducible results:

// simplified
let mut scheduler = Scheduler::new()
    .deterministic(true);

In deterministic mode, the scheduler uses a dependency graph (inferred from topic connections) to determine execution order instead of relying on OS thread scheduling. This guarantees identical results across runs — essential for simulation and regression testing.

Duration-Limited

// simplified
scheduler.run_for(std::time::Duration::from_secs(30))?;  // Run for 30 seconds

DurationExt and Frequency

HORUS provides ergonomic extension methods for creating Duration and Frequency values:

// simplified
use horus::prelude::*;

// Duration helpers
200_u64.us()     // 200 microseconds
5_u64.ms()       // 5 milliseconds
1_u64.secs()     // 1 second
500_u64.ns()     // 500 nanoseconds

// Frequency
100_u64.hz()     // 100 Hz
1000_u64.hz()    // 1 kHz

Frequency validates at construction — 0_u64.hz(), f64::NAN.hz(), f64::INFINITY.hz(), and negative values all panic immediately rather than producing silent bugs later.

A Frequency auto-derives timing parameters:

  • .period()Duration (1/freq)
  • .budget_default() → 80% of period
  • .deadline_default() → 95% of period

Recording and Replay

Enable recording through the builder API:

// simplified
let mut scheduler = Scheduler::new()
    .blackbox(16)          // 16 MB flight recorder
    .with_recording();     // Enable session recording

scheduler.run()?;          // All ticks recorded

Replay a recorded session:

// simplified
let mut replay = Scheduler::replay_from("~/.horus/recordings/session".into())?;
replay.run()?;  // Replays exact tick sequence

Performance Monitoring

Programmatic

// simplified
let metrics = scheduler.metrics();
for m in &metrics {
    println!("{}: avg={:.2}ms, max={:.2}ms, misses={}",
        m.name(), m.avg_tick_duration_ms(), m.max_tick_duration_ms(), m.failed_ticks());
}

if let Some(safety) = scheduler.safety_stats() {
    println!("Budget overruns: {}, Watchdog expirations: {}",
        safety.budget_overruns, safety.watchdog_expirations);
}

CLI Tools

horus monitor --tui         # Interactive dashboard with per-node timing
horus node list             # List all registered nodes
horus node info MotorCtrl   # Detailed stats for a specific node

Design Decisions

Why a single Scheduler instead of per-node threads? Per-node threads give each node independence, but they lose deterministic ordering. If the sensor and motor controller run on separate threads, you can't guarantee the sensor publishes before the motor controller reads. The scheduler gives you the best of both: sequential ordering for BestEffort nodes (deterministic), dedicated threads for RT nodes (timing guarantees), and thread pools for Compute nodes (parallelism) — all coordinated.

Why auto-detect RT instead of requiring explicit configuration? Developers think "this node needs to run at 1 kHz," not "this node needs SCHED_FIFO priority 80 on an isolated CPU core." Auto-detection from .rate(), .budget(), and .deadline() maps intent to implementation. The developer describes what the node needs; the scheduler figures out how to deliver it.

Why graduated watchdog instead of instant kill? A single late tick might be a transient spike (GC in a Python node, page fault). Killing the node immediately is too aggressive. Graduated response (warn → skip → isolate) gives transient problems time to resolve while still catching truly frozen nodes. Only after 3× the watchdog timeout on a critical node does the scheduler isolate it.

Why reverse-order shutdown? Nodes are typically added in dependency order: sensors → controllers → loggers. Reverse-order shutdown means controllers stop motors before sensors disconnect, and loggers record the shutdown sequence before they themselves stop. This prevents the dangerous scenario where a sensor disconnects while a motor controller is still running with the last received command.

Why lazy initialization? init() runs when scheduler.run() is called, not when scheduler.add() is called. This means you can configure all nodes, set global scheduler settings, and defer hardware initialization until the system is truly ready to start. It also means the scheduler's clock, RT configuration, and recording state are all finalized before any node initializes.

Trade-offs

GainCost
Deterministic ordering — nodes always execute in the configured sequenceMust specify .order() for every node
Five execution classes — each workload gets the right executorMore complex scheduler internals (transparent to users)
Automatic RT detection.rate() implies dedicated thread + timingLess explicit control (override with .compute() or .on())
Budget/deadline enforcement — catches timing regressions automatically~µs measurement overhead per monitored node
Graduated watchdog — frozen nodes are isolated, not killed instantly3× timeout before isolation on critical nodes
Graceful shutdown — guaranteed shutdown() call on all nodesNodes must implement shutdown() for hardware cleanup
tick_once() — fully deterministic single-step execution for testsNot suitable for production (no rate control)

See Also