GIL & Performance

The GIL (Global Interpreter Lock) is the single most important performance factor for Python HORUS nodes. Understanding how it works lets you design systems that maximize Python's strengths while avoiding its limitations.


How the GIL Works in HORUS

The scheduler's tick loop runs in Rust. The GIL is only acquired when calling your Python callbacks:

Rust scheduler tick loop (no GIL held)
  │
  ├── Acquire GIL (~500ns)
  ├── Call Python tick(node)
  ├── Release GIL
  │
  ├── Acquire GIL (~500ns)
  ├── Call Python tick(node) for next node
  ├── Release GIL
  │
  └── ... (Rust handles timing, SHM, RT)

Key insight: The scheduler, shared memory transport, ring buffers, and RT scheduling are all pure Rust — they run without the GIL. Only your Python tick(), init(), and shutdown() callbacks acquire the GIL.


Tick Rate Ceiling

The GIL acquisition + Python callback overhead is ~11μs per tick. This puts a hard ceiling on Python tick rates:

Target RateBudget per TickAchievable?Headroom
100 Hz10msYes900x
1,000 Hz1msYes90x
5,000 Hz200μsMarginal~18x
10,000 Hz100μsNoMeasured: ~5,932 Hz max

Practical ceiling: ~5-6 kHz for trivial tick functions. With real work (NumPy, I/O, computation), expect lower.

What Costs What

OperationTimeSource
GIL acquire + release~500ns + 500nsPyO3 boundary
Python object allocation~700nsPer-tick overhead
node.send(CmdVel)~1.7μsTyped message (total)
node.send(dict)~6-50μsGenericMessage serialization
node.recv()~1.5μsTyped message
NumPy array creation~1-5μsDepends on size
img.to_numpy()~3μsSHM view
np.from_dlpack(img)~1.1μsTrue zero-copy

When to Use Python vs Rust

Use CaseLanguageWhy
ML inference (PyTorch, YOLO, TensorFlow)Python1.7μs overhead negligible vs 10-200ms inference
Data science, prototypingPythonDeveloper velocity matters more than latency
HTTP APIs, database queriesPythonUse async nodes, GIL released during I/O
Visualization, dashboardsPythonmatplotlib, plotly, etc.
Motor control at 1kHz+Rust89ns vs 1,700ns — 19x difference
Safety monitorsRustDeterministic timing, no GIL
Sensor fusion at 500Hz+RustPredictable p99 latency
High-frequency sensor driversRustDirect hardware access, no Python overhead

Rule of thumb: If your tick function takes >1ms (ML inference, complex planning, I/O), Python is fine — the GIL overhead is negligible. If it takes <100μs (control loops, sensor processing), use Rust.


compute=True for CPU-Bound Nodes

For CPU-heavy Python nodes (ML inference, path planning), use compute=True to run on a thread pool:

detector = horus.Node(
    name="yolo",
    tick=detect_tick,
    rate=30,
    compute=True,       # Runs on worker thread, not main tick loop
    on_miss="skip",
)

What this does: The node runs on a separate thread. The GIL is still acquired for tick(), but it doesn't block the main scheduler loop — other nodes tick on time.

What this doesn't do: It doesn't bypass the GIL. Two compute=True Python nodes still serialize through the GIL. For true parallelism in Python, use multi-process with horus launch.


GC Pauses

Python's garbage collector can cause tick jitter:

GC ImpactTypical Duration
Gen 0 collection~50-200μs
Gen 1 collection~500μs-2ms
Gen 2 collection~5-50ms

Mitigation

import gc

def init(node):
    # Disable automatic GC — run manually between ticks
    gc.disable()

def tick(node):
    do_work()

    # Run GC only when budget allows
    if horus.budget_remaining() > 5 * horus.ms:
        gc.collect(0)  # Gen 0 only (~100μs)

For hard-RT Python nodes, disable GC entirely and manage memory manually (pre-allocate buffers, reuse objects).


Optimization Patterns

Pre-allocate Outside tick()

import numpy as np

# BAD: allocate every tick
def tick(node):
    buffer = np.zeros((640, 480, 3))  # ~1ms allocation
    process(buffer)

# GOOD: allocate once in init
buffer = None

def init(node):
    global buffer
    buffer = np.zeros((640, 480, 3))  # One-time cost

def tick(node):
    process(buffer)  # Reuse buffer

Use Typed Messages, Not Dicts

# SLOW: ~6-50μs (MessagePack serialization)
node.send("cmd_vel", {"linear": 1.0, "angular": 0.5})

# FAST: ~1.7μs (zero-copy POD)
node.send("cmd_vel", horus.CmdVel(linear=1.0, angular=0.5))

Use DLPack for Images

import numpy as np

# SLOW: ~14μs (data copy)
frame = np.array(img)

# FAST: ~1.1μs (zero-copy view)
frame = np.from_dlpack(img)

Measuring Tick Performance

import horus
import time

tick_times = []

def profiled_tick(node):
    start = time.perf_counter_ns()

    # Your actual work here
    do_work()

    elapsed_us = (time.perf_counter_ns() - start) / 1000
    tick_times.append(elapsed_us)

    if len(tick_times) % 1000 == 0:
        avg = sum(tick_times[-1000:]) / 1000
        p99 = sorted(tick_times[-1000:])[990]
        node.log_info(f"Tick avg: {avg:.1f}μs, p99: {p99:.1f}μs")

Or use the scheduler's built-in stats:

sched = horus.Scheduler(tick_rate=100)
sched.add(my_node)
sched.run(duration=10.0)

stats = sched.get_node_stats("my_node")
print(f"Avg tick: {stats.get('avg_tick_duration_ms', 0):.2f}ms")

See Also