2026-05-06

A misleading kernel error and a 134,665-restart loop

Lux PC hard-rebooted. The kernel logged a hardware error. The hardware was fine.

#incident#comfyui#systemd#lux#postmortem

What happened

Tuesday May 6, 1:09 PM. Lux — my Linux PC — hard-rebooted with no clean shutdown. No kernel panic captured. pstore empty. Last journal entry was 1:08 PM mid-ComfyUI startup.

Yesterday (May 5, 6:34 PM) the kernel had logged mce: Hardware Error: Machine check events logged at the same time as a ComfyUI exit. Easy to misread as failing hardware.

It wasn't.

Root cause

comfyui.service was set Restart=on-failure with no rate limiting. ComfyUI couldn't start because a recent update added comfy_aimdo.host_buffer as an import requirement, but the venv had comfy-aimdo==0.1.8 (which lacks that submodule).

Every start failed with ModuleNotFoundError: No module named 'comfy_aimdo.host_buffer'. systemd restarted it forever. The journal showed a restart counter of 134,665 across the previous boot. Within 36 minutes of the post-crash boot, the counter hit 601 again.

Each restart reloaded NVIDIA driver state. Yesterday's MCE was the GPU/CPU complaining mid-thrash. Eventually one restart wedged the kernel hard enough to need a power cycle.

The misleading parts

The hardware error in dmesg was the loudest signal. Naturally I thought "hardware is failing." Wrong axis. The MCE was a downstream symptom of the thrash, not the cause.

pstore was empty — no panic. That was actually a clue I missed. A real panic dumps to pstore. A hard hang doesn't. Empty pstore + sudden reboot = something locked the kernel up, not crashed it.

The fix

Three changes to comfyui.service:

RestartSec=30
StartLimitBurst=3
StartLimitIntervalSec=600

If the service fails 3 times in 10 minutes, systemd stops trying. Logs the fact. Waits for me to look at it.

Plus the actual dep upgrade: pip install comfy-aimdo==0.3.0 and a sync of requirements.txt against the latest ComfyUI.

What I'd do differently

Every systemd unit I write from now on gets RestartSec and StartLimitBurst defaults. No more thrash loops. The cost of rate-limiting restarts is "the service might stay down 10 minutes when it could have come back in 30 seconds." The cost of NOT rate-limiting is "the kernel wedges and you reboot the box."

Easy choice.

What you can steal

If you have any service with Restart=on-failure:

  1. Add RestartSec=30 (or whatever's appropriate)
  2. Add StartLimitBurst=3 and StartLimitIntervalSec=600
  3. Log the failure. Don't silently retry forever.

This is one of those rules where the cost of having it is zero and the cost of not having it can be hours of debugging plus a rebooted machine.