A misleading kernel error and a 134,665-restart loop
Lux PC hard-rebooted. The kernel logged a hardware error. The hardware was fine.
What happened
Tuesday May 6, 1:09 PM. Lux — my Linux PC — hard-rebooted with no clean shutdown. No kernel panic captured. pstore empty. Last journal entry was 1:08 PM mid-ComfyUI startup.
Yesterday (May 5, 6:34 PM) the kernel had logged mce: Hardware Error: Machine check events logged at the same time as a ComfyUI exit. Easy to misread as failing hardware.
It wasn't.
Root cause
comfyui.service was set Restart=on-failure with no rate limiting. ComfyUI couldn't start because a recent update added comfy_aimdo.host_buffer as an import requirement, but the venv had comfy-aimdo==0.1.8 (which lacks that submodule).
Every start failed with ModuleNotFoundError: No module named 'comfy_aimdo.host_buffer'. systemd restarted it forever. The journal showed a restart counter of 134,665 across the previous boot. Within 36 minutes of the post-crash boot, the counter hit 601 again.
Each restart reloaded NVIDIA driver state. Yesterday's MCE was the GPU/CPU complaining mid-thrash. Eventually one restart wedged the kernel hard enough to need a power cycle.
The misleading parts
The hardware error in dmesg was the loudest signal. Naturally I thought "hardware is failing." Wrong axis. The MCE was a downstream symptom of the thrash, not the cause.
pstore was empty — no panic. That was actually a clue I missed. A real panic dumps to pstore. A hard hang doesn't. Empty pstore + sudden reboot = something locked the kernel up, not crashed it.
The fix
Three changes to comfyui.service:
RestartSec=30
StartLimitBurst=3
StartLimitIntervalSec=600
If the service fails 3 times in 10 minutes, systemd stops trying. Logs the fact. Waits for me to look at it.
Plus the actual dep upgrade: pip install comfy-aimdo==0.3.0 and a sync of requirements.txt against the latest ComfyUI.
What I'd do differently
Every systemd unit I write from now on gets RestartSec and StartLimitBurst defaults. No more thrash loops. The cost of rate-limiting restarts is "the service might stay down 10 minutes when it could have come back in 30 seconds." The cost of NOT rate-limiting is "the kernel wedges and you reboot the box."
Easy choice.
What you can steal
If you have any service with Restart=on-failure:
- Add
RestartSec=30(or whatever's appropriate) - Add
StartLimitBurst=3andStartLimitIntervalSec=600 - Log the failure. Don't silently retry forever.
This is one of those rules where the cost of having it is zero and the cost of not having it can be hours of debugging plus a rebooted machine.