Husky crash recovery and startup sequence

This guide explains what the daemon does at startup, how it handles stale runtime artifacts, and how catchup and orphan reconciliation work.

Startup sequence overview

When husky daemon run starts, the daemon performs a predictable sequence:

load and validate husky.yaml
load and validate huskyd.yaml
build and validate the DAG
ensure the data directory exists
acquire or overwrite the PID file if stale
open the SQLite store
configure auth and logging
construct the executor and scheduler
reconcile orphaned runs
reconcile missed schedules for catchup-enabled jobs
start the IPC server
start the HTTP API and embedded dashboard
begin periodic scheduler ticks and retention vacuuming

This ordering matters:

invalid config prevents the daemon from starting
stale PID detection happens before normal serving begins
orphan reconciliation and catchup happen before regular steady-state scheduling

PID file handling

The daemon stores its PID in:

<data>/husky.pid

Behavior:

if the PID file exists and the PID is alive, startup fails with daemon already running
if the PID file exists but the PID is dead, Husky logs a stale PID warning and overwrites it
the PID file is removed on normal shutdown

IPC socket and API address files

The daemon also manages:

husky.sock — local IPC socket for CLI control
api.addr — actual HTTP bind address written after the API listener starts

These runtime files are regenerated as part of normal startup.

Orphan reconciliation

What counts as an orphan

An orphan is a run left in RUNNING state because the daemon exited or crashed before it could finalize that run.

At startup, Husky queries all RUNNING runs and treats them as orphaned work from the previous daemon invocation.

What Husky does

For each orphaned run:

mark the run FAILED
log the orphan reconciliation event
inspect the job's retry policy
if retries remain, schedule a retry after the configured backoff delay

Important behavior:

the daemon does not assume the old process is still healthy
the orphaned run is finalized conservatively as failed work
retries continue using the job's configured retry policy

Catchup behavior

Catchup is controlled per job with catchup: true in husky.yaml.

At startup, for each catchup-enabled job, Husky checks job_state.next_run.

If next_run is in the past:

and within scheduler.catchup_window if one is configured, Husky triggers the missed run immediately
if outside the catchup window, Husky skips the stale execution and logs why

Why `catchup_window` exists

Without a window, a long outage could cause very old missed schedules to execute after restart. catchup_window prevents that backlog flood.

Example:

scheduler:
  catchup_window: "24h"

Hot reload and crash recovery

husky reload and SIGHUP do not interrupt running jobs.

Reload sequence:

load new husky.yaml
validate schema and semantics
rebuild the DAG
atomically swap config and graph if successful
keep the old config if validation fails

Bearer tokens from auth.bearer.token_file are also reloaded on SIGHUP.

Data durability

Husky uses SQLite in WAL mode and serializes writes through a single writer goroutine. This improves resilience under normal operation and reduces lock contention.

Recovery-relevant persisted data includes:

run status and attempts
next scheduled run timestamps
log lines
output variables by cycle_id
alert delivery history

Failure scenarios and expected outcomes

Daemon crashes while a job is running

Expected outcome on next startup:

run is marked failed
retry may be scheduled if configured
downstream jobs do not treat the orphaned run as success

Machine is down during a scheduled time

Expected outcome on next startup:

jobs with catchup: true may run immediately
jobs with catchup: false simply advance to the next future schedule
jobs outside the configured catchup window are skipped

Config reload introduces a cycle

Expected outcome:

reload is rejected
current running config remains active
running jobs continue uninterrupted

Stale PID file remains after an unclean exit

Expected outcome:

Husky probes the PID
dead PID is treated as stale
file is overwritten and startup continues

Operational recommendations

use a stable dedicated data directory in service environments
pair catchup: true with a sensible scheduler.catchup_window
monitor daemon logs after restarts to understand catchup and orphan events
use integration tests under cmd/huskyd/ and tests/ when changing recovery behavior

Relevant integration coverage includes:

orphan reconciliation on restart
catchup true / false behavior
reload rejection when a cycle is introduced
reload while jobs are still running

See testing.md for commands.

Startup sequence overview​

PID file handling​

IPC socket and API address files​

Orphan reconciliation​

What counts as an orphan​

What Husky does​

Catchup behavior​

Why catchup_window exists​

Hot reload and crash recovery​

Data durability​

Failure scenarios and expected outcomes​

Daemon crashes while a job is running​

Machine is down during a scheduled time​

Config reload introduces a cycle​

Stale PID file remains after an unclean exit​

Operational recommendations​

Related testing coverage​