Production Deployment (Python)

Your Python HORUS nodes work on your laptop. Now they need to run 24/7 on a robot with no keyboard, no monitor, and nobody watching. This page covers virtual environments, dependency pinning, systemd services, logging, monitoring, garbage collection tuning, memory profiling, and the decision of what stays in Python versus what gets rewritten.


Virtual Environment Setup

Always isolate HORUS Python nodes in a virtual environment. System Python packages drift between OS updates and break silently.

# Create a dedicated venv for your project
python3 -m venv /opt/myrobot/venv

# Activate and install horus
source /opt/myrobot/venv/bin/activate
pip install maturin
cd /path/to/horus/horus_py
maturin develop --release

# Install your project dependencies
pip install -r requirements.txt

venv in horus.toml Projects

If you are using horus.toml for project management, HORUS generates a .horus/pyproject.toml from your manifest. The venv still works -- install the generated project after activation:

source /opt/myrobot/venv/bin/activate
cd /path/to/your/project
horus build  # generates .horus/pyproject.toml from horus.toml
pip install -e .horus/

Dependency Pinning

Pin every dependency version. An unpinned numpy upgrade at 3 AM will crash your robot at 3:01 AM.

requirements.txt

numpy==1.26.4
opencv-python-headless==4.9.0.80
torch==2.2.1+cpu
onnxruntime==1.17.1
scipy==1.12.0

Generate from your working environment:

pip freeze > requirements.txt

horus.toml

For HORUS-managed projects, pin in the manifest:

[dependencies]
numpy = { version = "1.26.4", source = "pypi" }
opencv-python-headless = { version = "4.9.0.80", source = "pypi" }
torch = { version = "2.2.1+cpu", source = "pypi" }

HORUS generates horus.lock (lockfile v3) with exact resolved versions for reproducible installs across machines.

CPU-Only PyTorch

Production robots rarely have datacenter GPUs. Use the CPU-only torch build to save 2 GB of disk and avoid CUDA driver version mismatches:

pip install torch==2.2.1+cpu --index-url https://download.pytorch.org/whl/cpu

systemd Service Files

Run HORUS Python nodes as systemd services for automatic restart, logging, and boot-time startup.

Basic Service

# /etc/systemd/system/horus-myrobot.service
[Unit]
Description=HORUS MyRobot Nodes
After=network.target

[Service]
Type=simple
User=robot
Group=robot
WorkingDirectory=/opt/myrobot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u main.py
Restart=on-failure
RestartSec=3
StandardOutput=journal
StandardError=journal

# Shared memory access
SupplementaryGroups=

# Real-time scheduling (optional)
LimitMEMLOCK=infinity
LimitRTPRIO=99

[Install]
WantedBy=multi-user.target

Key Settings

SettingValueWhy
Type=simpleRequiredHORUS blocks on horus.run()
User=robotDedicated userNever run as root in production
-u flag on PythonRequiredUnbuffered output so journald gets logs immediately
Restart=on-failureAuto-restartsystemd restarts if the process exits non-zero
RestartSec=33 second delayPrevents restart loops from burning CPU
LimitMEMLOCK=infinityFor RT nodesAllows memory locking to prevent page faults
LimitRTPRIO=99For RT nodesAllows real-time scheduling priority

Enable and Start

sudo systemctl daemon-reload
sudo systemctl enable horus-myrobot.service
sudo systemctl start horus-myrobot.service

# Check status
sudo systemctl status horus-myrobot.service

# View logs
journalctl -u horus-myrobot.service -f

Multi-Node Service with Separate Processes

For process isolation, run each node as its own service:

# /etc/systemd/system/horus-camera.service
[Unit]
Description=HORUS Camera Node
After=network.target

[Service]
Type=simple
User=robot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u nodes/camera_node.py
Restart=on-failure
RestartSec=2

[Install]
WantedBy=horus-myrobot.target
# /etc/systemd/system/horus-planner.service
[Unit]
Description=HORUS Planner Node
After=horus-camera.service

[Service]
Type=simple
User=robot
Environment=PATH=/opt/myrobot/venv/bin:/usr/local/bin:/usr/bin
ExecStart=/opt/myrobot/venv/bin/python -u nodes/planner_node.py
Restart=on-failure
RestartSec=2

[Install]
WantedBy=horus-myrobot.target

Use After= to express startup order between nodes. Use a shared .target to start/stop the entire robot stack as one unit.


Log Collection

HORUS nodes produce two streams of logs: structured logs from node.log_*() calls and standard output from print().

horus logs

The horus logs CLI command reads the structured log stream:

# Follow logs from all running nodes
horus logs -f

# Filter by node name
horus logs -f --node camera

# Filter by level
horus logs -f --level warning

node.log_* Output

Inside your tick function, use the structured logging methods:

def tick(node):
    node.log_info("Frame processed")
    node.log_warning("Latency spike: 12ms")
    node.log_error("Motor timeout")
    node.log_debug("Raw encoder: 4821")

These go through the scheduler's logging pipeline, tagged with the node name and timestamp. They appear in horus logs and, when running under systemd, in the journal.

journald Integration

When running as a systemd service, all output (structured logs and print statements) goes to the journal:

# Live follow
journalctl -u horus-myrobot.service -f

# Last 100 lines
journalctl -u horus-myrobot.service -n 100

# Since last boot
journalctl -u horus-myrobot.service -b

# Export for analysis
journalctl -u horus-myrobot.service --output=json > logs.json

Log Rotation

journald handles rotation automatically. For long-running deployments, configure retention:

# /etc/systemd/journald.conf.d/horus.conf
[Journal]
SystemMaxUse=500M
MaxRetentionSec=7d

Performance Tuning

What Stays in Python

Python is the right choice for nodes that are I/O-bound, compute-heavy-but-batchable, or change frequently:

Node typeWhy Python worksTypical rate
ML inferencePyTorch/ONNX ecosystem, GPU offload10-30 Hz
Data loggingI/O-bound (disk, database, network)1-10 Hz
Path planningScipy/numpy, compute=True offloads to thread pool1-10 Hz
Visualizationmatplotlib, OpenCV display1-30 Hz
HTTP/API integrationaiohttp, async nodes handle I/O naturally0.1-10 Hz
PrototypingFast iteration, no compile stepAny

When to Rewrite in Python to Another Language

Rewrite a node when Python becomes the bottleneck, not before. Profile first:

SignalWhat it meansAction
tick() consistently exceeds budgetCPU-bound work is too slowProfile, optimize, then rewrite hot path
Deadline misses under loadGIL contention or GC pausesTry gc.disable(), then rewrite if still missing
Memory growing unboundedPython object overheadProfile with tracemalloc, rewrite if unfixable
Latency jitter >1ms at >100HzPython overhead is inherentRewrite -- Python cannot do sub-ms deterministic ticks

The practical threshold: If your node needs deterministic ticks above 100 Hz, or sub-millisecond jitter, rewrite it. Below that, Python is fine.


Monitoring Python Nodes

horus monitor

The horus monitor command shows a live dashboard of all running nodes:

horus monitor

This shows per-node tick rate, budget usage, deadline misses, error counts, and health state.

Programmatic Monitoring

Use the Scheduler API to query stats from within your code:

import horus

sched = horus.Scheduler(tick_rate=1000, rt=True)
sched.add(sensor_node)
sched.add(planner_node)

# Start in background or check after running
stats = sched.get_node_stats("sensor")
print(f"Total ticks: {stats['total_ticks']}")
print(f"Errors: {stats['errors_count']}")

safety_stats()

For safety-critical deployments, query the safety monitor:

safety = sched.safety_stats()
if safety:
    print(f"Watchdog: {safety}")
    # Returns dict with watchdog stats, deadline misses, health states

Health Checks

Build a health-check endpoint for external monitoring (Prometheus, Grafana, fleet manager):

import horus
import json
from http.server import HTTPServer, BaseHTTPRequestHandler

sched = None

class HealthHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        stats = {}
        for name in ["camera", "planner", "motor"]:
            stats[name] = sched.get_node_stats(name)

        healthy = all(s.get("errors_count", 0) == 0 for s in stats.values())

        self.send_response(200 if healthy else 503)
        self.send_header("Content-Type", "application/json")
        self.end_headers()
        self.wfile.write(json.dumps(stats).encode())

    def log_message(self, format, *args):
        pass  # Suppress access logs

def start_health_server():
    server = HTTPServer(("0.0.0.0", 8080), HealthHandler)
    server.serve_forever()

Run the health server in a background thread or as a separate async node.


Garbage Collection Tuning

Python's garbage collector introduces non-deterministic pauses. For nodes with timing constraints, tune or disable it.

Disable GC for Real-Time Nodes

If your node has a tight budget (sub-10ms) and allocates few objects per tick, disable GC entirely:

import gc
import horus

def init(node):
    gc.disable()
    node.log_info("GC disabled for RT node")

def tick(node):
    # Pre-allocated buffers only -- no new objects per tick
    cmd = node.recv("cmd_vel")
    if cmd:
        apply_command(cmd)

motor = horus.Node(
    name="motor",
    tick=tick,
    init=init,
    rate=1000,
    subs=["cmd_vel"],
    failure_policy="fatal",
)

Requirement: When GC is disabled, you must not create circular references. Use pre-allocated buffers, avoid closures that capture self, and avoid building data structures in tick(). If you leak memory with GC disabled, it is never reclaimed.

Tune GC Thresholds for Other Nodes

For nodes that allocate objects (ML inference, data processing), tune the thresholds instead of disabling:

import gc

def init(node):
    # Default: (700, 10, 10)
    # Raise gen0 threshold to reduce collection frequency
    gc.set_threshold(1500, 15, 15)
    node.log_info(f"GC thresholds: {gc.get_threshold()}")

Higher thresholds mean fewer GC pauses but higher peak memory usage. Measure both latency and memory for your workload.

Manual GC Between Ticks

For the best control, disable automatic GC and trigger collection manually during idle periods:

import gc
import horus

gc.disable()

def tick(node):
    if not node.has_msg("camera.rgb"):
        # No frame to process -- good time to collect
        gc.collect(generation=0)  # Only gen0, fast (~100us)
        return

    frame = node.recv("camera.rgb")
    detect(frame)

Memory Profiling

tracemalloc for Leak Detection

Python nodes running for days can leak memory through accumulating references. Use tracemalloc to find the source:

import tracemalloc
import horus

tracemalloc.start(10)  # Keep 10 frames of traceback

tick_count = 0
baseline = None

def tick(node):
    global tick_count, baseline
    tick_count += 1

    # Normal work
    process_data(node)

    # Snapshot every 10000 ticks
    if tick_count % 10000 == 0:
        snapshot = tracemalloc.take_snapshot()
        if baseline is None:
            baseline = snapshot
        else:
            stats = snapshot.compare_to(baseline, "lineno")
            for stat in stats[:5]:
                node.log_warning(f"Memory growth: {stat}")

What to Look For

PatternLikely causeFix
Steady growth in one file/lineList or dict accumulating entriesCap size or use collections.deque(maxlen=N)
Growth in node.recv() callsHolding references to old messagesProcess and discard, do not store
Growth in json.loads()String interning or dict cachingUse msgpack or typed messages instead
Growth in third-party libraryLibrary-internal cachingCheck library docs for cache control

Resource Monitoring

Monitor system resources from within a node:

import os
import resource
import horus

def monitor_tick(node):
    # RSS (Resident Set Size) in MB
    usage = resource.getrusage(resource.RUSAGE_SELF)
    rss_mb = usage.ru_maxrss / 1024  # Linux reports in KB

    node.send("diagnostics.memory", {
        "rss_mb": rss_mb,
        "pid": os.getpid(),
    })

    if rss_mb > 500:
        node.log_warning(f"High memory: {rss_mb:.0f} MB")

monitor = horus.Node(
    name="resource_monitor",
    tick=monitor_tick,
    rate=1,
    pubs=["diagnostics.memory"],
)

Mixed Deployments

The most effective production architectures combine Python and other HORUS-supported languages. Each language handles what it does best, communicating through zero-copy shared memory topics.

Typical Architecture

Camera Driver (high-freq, safety) ──→ camera.rgb topic
ML Inference (Python, PyTorch)    ←── camera.rgb topic
                                  ──→ detections topic
Path Planner (Python, scipy)      ←── detections topic
                                  ──→ path topic
Motor Controller (high-freq, RT)  ←── path topic
                                  ──→ motor.status topic
Safety Monitor (high-freq, RT)    ←── motor.status topic

Safety-critical nodes (camera driver, motor controller, safety monitor) benefit from compiled languages. Python handles ML inference and path planning where ecosystem libraries matter more than tick latency.

Running Together

Each process runs independently. They communicate through HORUS topics over shared memory:

# Terminal 1: compiled safety-critical nodes
horus run safety_stack

# Terminal 2: Python ML nodes
source /opt/myrobot/venv/bin/activate
python ml_nodes.py

# Terminal 3: Python planner
source /opt/myrobot/venv/bin/activate
python planner.py

Or use systemd to manage all processes:

# /etc/systemd/system/horus-safety.service
[Service]
ExecStart=/usr/local/bin/horus run safety_stack

# /etc/systemd/system/horus-ml.service
[Service]
ExecStart=/opt/myrobot/venv/bin/python -u ml_nodes.py

# /etc/systemd/system/horus-planner.service
[Service]
ExecStart=/opt/myrobot/venv/bin/python -u planner.py

The Handoff Pattern

When a Python prototype node gets promoted to production, the topic interface stays the same. Only the implementation changes:

# Python prototype (runs at 30 Hz, good enough for testing)
def planner_tick(node):
    scan = node.recv("lidar.scan")
    if scan:
        path = compute_path(scan)  # scipy A*
        node.send("path", path)

The compiled replacement subscribes to the same topics and publishes the same messages. No other node needs to change. This is the key benefit of topic-based IPC: language boundaries are invisible to the rest of the system.


Pre-Deployment Checklist

Before shipping Python nodes to production:

  • All dependencies pinned in requirements.txt or horus.toml
  • Virtual environment created and tested on target hardware
  • systemd service file with Restart=on-failure
  • failure_policy set on every node (not relying on defaults)
  • node.log_*() used instead of print() for operational messages
  • GC tuned or disabled for nodes with timing constraints
  • Memory profiled under sustained load (run for hours, check RSS)
  • horus monitor shows all nodes healthy under load
  • Health-check endpoint accessible for external monitoring
  • Shared memory cleaned before first deploy (horus clean --shm)

Design Decisions

Why venv instead of containers? Containers add overhead (cgroup management, overlay filesystem, network namespacing) that hurts real-time performance. Shared memory IPC between containers requires explicit --ipc=host flags that defeat isolation. A virtual environment gives dependency isolation without the performance or IPC penalty. Use containers for CI/CD and development, not for production robots.

Why systemd instead of a HORUS-native process manager? systemd is battle-tested, ships with every Linux distribution, integrates with journald for logging, and supports cgroup resource limits. Building a custom process manager would duplicate all of this poorly. The HORUS scheduler manages node execution within a process; systemd manages processes within the system. Each tool does what it does best.

Why not auto-detect which nodes need GC tuning? Garbage collection impact depends on allocation patterns, object lifetimes, and timing requirements -- all application-specific. A node publishing pre-allocated IMU structs at 1000 Hz needs GC disabled. A node building detection lists at 10 Hz needs GC enabled. There is no heuristic that works for both. Explicit tuning by the developer is the only reliable approach.

Trade-offs

Python for ML vs compiled inference: Python gives you the full PyTorch/ONNX/HuggingFace ecosystem. Compiled inference (ONNX Runtime C++, TensorRT) gives lower latency and no GIL. For most robotics workloads, Python inference at 10-30 Hz is fast enough. Rewrite when profiling shows that Python overhead (not model inference) is the bottleneck.

Single process vs multi-process: Running all Python nodes in one process (one horus.run() call) shares the GIL. Running each node as a separate process (separate systemd services) avoids GIL contention but uses more memory and loses in-process topic shortcuts. Single process is simpler to deploy. Multi-process scales better when you have CPU-bound Python nodes competing for the GIL.

gc.disable() vs gc.set_threshold(): Disabling GC eliminates pauses completely but risks memory leaks if you create circular references. Tuning thresholds reduces pause frequency without eliminating leaks. For nodes with pre-allocated buffers and no circular references, disable. For nodes that build temporary data structures, tune thresholds. When in doubt, tune rather than disable -- a slow leak is easier to debug than a mysterious OOM after 48 hours.

Pinned versions vs version ranges: Pinned versions (numpy==1.26.4) guarantee reproducibility but require manual updates for security patches. Version ranges (numpy>=1.26,<1.27) allow patch updates but risk behavior changes. For production robots, pin everything. Run pip install --upgrade in CI, run your test suite, and pin the new versions explicitly.


See Also