Hot-Reload & Self-Healing

Live module swap with state migration, escalating recovery, and self-healing manager.

Profile Reference

Hot-Reload & Self-Healing

One of Helix's most distinctive features is the ability to replace kernel modules at runtime without rebooting, combined with an automated self-healing system that detects failures and attempts recovery. Together, these features mean the kernel can fix itself — a crashed scheduler gets restarted, a buggy driver gets hot-swapped with a fresh version, all while the rest of the system keeps running.

HotReloadableModule Trait

Any module that supports live replacement must implement this trait. The key methods are export_state and import_state — they serialize/deserialize the module's runtime state so it can be transferred to the replacement.

core/src/hotreload/mod.rs
rust
pub trait HotReloadableModule: Send + Sync {
fn name(&self) -> &'static str;
fn version(&self) -> ModuleVersion;
fn category(&self) -> ModuleCategory;
5
6
// Lifecycle
fn init(&mut self) -> Result<(), HotReloadError>;
fn prepare_unload(&mut self) -> Result<(), HotReloadError>;
9
10
// Hot-reload protocol — these enable live replacement
fn export_state(&self) -> Option<Box<dyn ModuleState>>; // Snapshot state
fn import_state(&mut self, state: &dyn ModuleState) // Restore state
13
-> Result<(), HotReloadError>;
14
fn can_unload(&self) -> bool; // Safe to remove right now?
fn as_any(&self) -> &dyn Any; // Downcast support
fn as_any_mut(&mut self) -> &mut dyn Any;
18
}
19
20
#[repr(u8)]
2 refs
pub enum ModuleCategory {
22
Scheduler = 0, MemoryAllocator = 1, Filesystem = 2,
23
Driver = 3, Network = 4, Security = 5,
24
Ipc = 6, Custom = 255,
25
}
Index

Hot-Reload Registry

The registry manages slots — named positions where a module can be loaded. Each slot holds one active module at a time. The hot_swap method is the core operation — it atomically replaces the old module with a new one, transferring state in the process.

core/src/hotreload/mod.rs
rust
2 refs
pub struct HotReloadRegistry { /* ... */ }
2
2 refs
impl HotReloadRegistry {
pub fn create_slot(&self, category: ModuleCategory) -> SlotId;
pub fn load_module(&self, slot: SlotId,
6
module: Box<dyn HotReloadableModule>) -> Result<(), HotReloadError>;
pub fn unload_module(&self, slot: SlotId) -> Result<(), HotReloadError>;
pub fn slot_status(&self, slot: SlotId) -> Option<SlotStatus>;
9
10
// Core operation — atomic live swap with state transfer
pub fn hot_swap(&self, slot: SlotId,
12
new: Box<dyn HotReloadableModule>) -> Result<(), HotReloadError>;
13
14
// Access the active module safely (typed via downcast)
pub fn with_module<T: 'static, F, R>(&self, slot: SlotId, f: F) -> Option<R>
16
where F: FnOnce(&T) -> R;
pub fn with_module_mut<T: 'static, F, R>(&self, slot: SlotId, f: F) -> Option<R>
18
where F: FnOnce(&mut T) -> R;
19
pub fn list_slots(&self)
21
-> Vec<(SlotId, ModuleCategory, SlotStatus, Option<&'static str>)>;
22
}
23
24
#[repr(u8)]
3 refs
pub enum SlotStatus {
26
Empty = 0, Loading = 1, Active = 2,
27
Unloading = 3, Swapping = 4, Failed = 5,
28
}
Index

Hot-Swap Protocol

The hot_swap method follows a safe 5-step protocol. If step 3 (init) fails, the system rolls back to the old module. If state migration fails in step 4, the new module starts fresh.

Hot-Swap Protocol7N · 7E
successfailfail① prepare_unload()Drain pending work1② export_state()Serialize runtime st…2③ old.stop()Halt old module2④ import_state()Restore in new modul…3⑤ new.init + startActivate new module3⑥ Replace SlotNew module is now ac…1RollbackOn failure: restart …2
100%
☝ Drag to pan·🤏 Pinch to zoom·Tap a node
StepActionOn Failure
1Call old.export_state() — snapshot runtime stateContinue without state
2Call old.prepare_unload() — drain pending workContinue (committed to swap)
3Call new.init() — initialize new moduleRollback — restore old module
4Call new.import_state(state) — migrate stateContinue (new module starts fresh)
5Activate new module — slot becomes Active

Self-Healing Manager

The self-healing manager monitors all registered components and automatically attempts recovery when failures are detected. It uses an escalating recovery strategy — simple restart first, then hot-swap, then isolation, and finally escalation.

core/src/selfheal.rs
rust
2 refs
pub struct SelfHealingManager { /* ... */ }
2
2 refs
impl SelfHealingManager {
4
pub const fn new() -> Self; // No config needed
pub fn register(&self, slot_id: SlotId, // Register by slot
6
name: &str, factory: Option<ModuleFactory>);
pub fn report_failure(&self, slot_id: SlotId);
2 refs
pub fn tick(&self); // Called on timer
pub fn stats(&self) -> RecoveryStats;
pub fn events(&self) -> Vec<RecoveryEvent>;
11
}
12
13
#[repr(u8)]
pub enum HealthStatus {
15
Healthy = 0, // Responding normally
16
Degraded = 1, // Functional but impaired
17
Unresponsive = 2, // Potential hang
18
Crashed = 3, // Module crashed
19
Recovering = 4, // Recovery in progress
20
Unknown = 255, // Not monitored
21
}
22
pub enum RecoveryAction {
24
None, // No action needed
25
Restart, // Re-init the module in-place
26
Failover, // Replace via factory function
27
Panic, // Unrecoverable — escalate to kernel panic
28
}
29
2 refs
pub struct RecoveryEvent {
31
pub tick: u64,
32
pub slot_id: SlotId,
33
pub module_name: String,
34
pub event_type: RecoveryEventType,
35
pub success: bool,
36
}
Index

Escalating Recovery Strategy

The manager tracks how many times each component has crashed and escalates the recovery strategy automatically.

Escalating Recovery5N · 4E
fails againfails againfails againfails again1st CrashSimple restart12nd CrashRestart with 100ms d…23rd CrashHot-Swap24th CrashIsolate25th+ CrashEscalate1
100%
☝ Drag to pan·🤏 Pinch to zoom·Tap a node
Crash CountActionWhat Happens
1stRestartRe-init the module in place
2ndRestartSecond attempt
3rdFailoverReplace via factory function (fresh instance)
4th+PanicUnrecoverable — escalate to kernel panic handler

Integration Example

Here's how the self-healing system integrates with the rest of the kernel in practice.

core/src/selfheal.rs
rust
1
// During kernel initialization
2
let heal = SelfHealingManager::new(); // const fn, no config needed
3
4
// Register modules by slot ID + optional factory for replacement
5
heal.register(sched_slot, "scheduler", Some(|| Box::new(RoundRobin::new())));
6
heal.register(fs_slot, "filesystem", None); // No auto-replacement
7
8
// In the timer interrupt handler:
9
// heal.tick() runs automatically and:
10
// 1. Checks health of each registered slot
11
// 2. If failed → attempt recovery (Restart / Failover / Panic)
12
// 3. Logs RecoveryEvents and updates stats
13
14
// Query recovery stats at any time
15
let stats = heal.stats();
16
log::info!("Health: {}%, recoveries: {}/{}",
17
stats.system_health,
18
stats.successful_recoveries,
19
stats.failures_detected,
20
);

Live Scheduler Swap Example

The most common hot-reload use case is swapping schedulers at runtime. This lets you change scheduling strategy without rebooting — switch from round-robin to priority-based when the workload changes.

core/src/hotreload/schedulers.rs
rust
1
// Built-in scheduler implementations
2 refs
pub struct RoundRobinScheduler { /* ... */ }
2 refs
pub struct PriorityScheduler { /* ... */ }
4
5
// Both implement HotReloadableModule + Scheduler
6
7
// Create a slot for the scheduler (category only)
8
let slot = registry.create_slot(ModuleCategory::Scheduler);
9
10
// Load the initial scheduler
11
registry.load_module(slot, Box::new(RoundRobinScheduler::new()))?;
12
13
// Later — swap to priority scheduling at runtime
14
// State is exported via ModuleState, migrated to the new module
15
registry.hot_swap(slot, Box::new(PriorityScheduler::new()))?;
Index

The combination of hot-reload and self-healing means Helix can survive failures that would crash a traditional kernel. A buggy driver gets isolated, a fresh copy is loaded, state is restored, and users never notice the 100ms hiccup. This is the foundation for Helix's reliability story.