Husky — Task Tracker

Local-First Job Scheduler with Dependency Graphs

Phase 0 — Project Bootstrap

Weeks 0–1 · Set up the repository, toolchain, and all dependencies before any feature work begins.

Repository & Tooling

Initialise Git repository and add .gitignore (Go, binaries, .env)
Create top-level directory structure (cmd/, internal/, web/, docs/, scripts/, tests/)
Initialise Go module (go mod init github.com/husky-scheduler/husky)
Pin Go toolchain version in go.mod / .tool-versions (go 1.24.0)
Add Makefile with targets: build, test, lint, clean, run
Configure golangci-lint (.golangci.yml)
Add pre-commit hooks (lint + go vet on staged files)
Set up GitHub Actions CI skeleton (lint + test on push)

Dependencies

Add YAML parser — gopkg.in/yaml.v3
Add SQLite driver — modernc.org/sqlite (pure-Go, CGO-free)
Add JSON Schema validator for husky.yaml validation — santhosh-tekuri/jsonschema/v5
Add structured logger — log/slog (stdlib, Go 1.21+)
Add CLI framework — spf13/cobra
Add test assertion library — testify/assert + testify/require
Run go mod tidy and commit go.sum

Skeleton

Create embedded daemon runtime entry point (initially cmd/huskyd/main.go, later invoked via husky daemon run)
Create cmd/husky/main.go entry point (CLI)
Add placeholder VERSION constant and --version flag (internal/version package)
Create internal/config/ package stub
Create internal/store/ package stub
Verify make build produces the husky binary without errors

Phase 1 — Core Engine

Weeks 1–3 · YAML parser, schedule evaluator, SQLite schema, basic executor, and core CLI commands.

Step 1: YAML Parser & Config

Step 2: Schedule Evaluator

Step 3: SQLite State Store

Implement store package with WAL-mode SQLite connection
Create schema migrations for job_runs, job_state, run_logs, alerts tables (§4.1)
Add reason TEXT and triggered_by TEXT DEFAULT "scheduler" columns to job_runs (Feature 5)
Add sla_breached INTEGER DEFAULT 0 column to job_runs (Feature 1)
Add hc_status TEXT column to job_runs — values: null | pass | fail | warn (Feature 6)
Create run_outputs table: (id, run_id, job_name, var_name, value, cycle_id, created_at) with index on cycle_id (Feature 2)
Implement serialised writer goroutine channel for all writes
Implement RecordRunStart, RecordRunEnd, RecordLog, GetJobState operations
Implement UpdateJobState (last_success, last_failure, next_run, lock_pid)

Step 4: Executor

Implement bounded goroutine pool (default: 8 concurrent jobs) (§3.2.3)
Launch each job as a subprocess via /bin/sh -c (or direct exec for array form)
Set Setpgid=true on all subprocesses (§7.4)
Capture stdout and stderr line-by-line and write to run_logs in real time
Implement timeout: send SIGTERM, then SIGKILL after grace period
Kill entire process group on timeout/cancel

Step 5: Core CLI (`husky`)

husky start — launch the embedded Husky daemon in background, write daemon.pid, detach
Hidden husky daemon run entry point for direct foreground daemon execution by service managers and development workflows
husky stop — graceful shutdown (drain running jobs)
husky stop --force — immediate SIGKILL to all running jobs
husky status — tabular display of all job states from job_state
husky run <job> — trigger job immediately, bypassing schedule
Unix socket IPC between husky (CLI) and the embedded daemon runtime

Step 6: Testing

Unit tests for YAML parser (valid configs, missing required fields, bad enum values)
Unit tests for time field validation (all rows from Table §2.2.2)
Unit tests for schedule evaluator (NextRunTime for each frequency value)
Unit tests for SQLite store (CRUD, WAL concurrency)
Integration test: parse full example husky.yaml (§2.3) without errors

Phase 2 — DAG + Reliability

Weeks 4–6 · Dependency resolution, retry state machine, crash recovery, hot-reload, catchup.

DAG Resolver

Implement topological sort using Kahn's algorithm on depends_on declarations (§3.2.2)
Detect cycles at startup and emit full cycle path in error message
Reject daemon start if any cycle is found
Resolve after:<job> frequency as an implicit depends_on edge
At runtime, check all upstream job states are SUCCESS before dispatching a dependent job
husky dag — print ASCII DAG of all jobs and their dependencies
husky dag --json — emit machine-readable DAG structure

Retry FSM

Healthchecks

After main command exits with code 0, run healthcheck.command as a separate process (Feature 6)
Apply healthcheck.timeout (default: 30s); send SIGTERM + SIGKILL to healthcheck if exceeded (Feature 6)
on_fail: mark_failed (default): mark run FAILED, append healthcheck stderr to run_logs, trigger retry policy (Feature 6)
on_fail: warn_only: mark run SUCCESS with hc_status = warn, fire on_sla_breach notify event (Feature 6)
Skip healthcheck entirely when main command exits non-zero (Feature 6)
Capture healthcheck stdout/stderr into run_logs with stream = "healthcheck" stream tag (Feature 6)
Expose husky logs <job> --include-healthcheck flag to include healthcheck log lines (Feature 6)

Output Passing

After job completion, evaluate the job's output block and write each captured value to run_outputs (Feature 2)
Implement all capture modes: last_line, first_line, json_field:<key>, regex:<pattern>, exit_code (Feature 2)
Generate a cycle_id UUID at the root of each trigger chain (schedule tick or manual run); propagate to all downstream dependent jobs (Feature 2)
At job dispatch time, render {{ outputs.<job_name>.<var_name> }} template expressions in command, env values (Feature 2)
Return a descriptive error at dispatch time if a referenced output variable has no recorded value for the current cycle_id (Feature 2)
Scope all output values to a single cycle; never bleed values across independent trigger chains (Feature 2)

Crash Recovery

On startup, acquire daemon.pid lock (§6 step 2)
Detect stale lock (PID dead) vs. live daemon (exit with "daemon already running")
Reconcile orphaned runs: query job_state WHERE lock_pid IS NOT NULL, probe each PID with kill -0 (§6 step 3)
For dead PIDs: mark run FAILED, increment attempt, schedule retry
Reconcile missed schedules based on catchup flag (§6 step 4)
- catchup: true → trigger missed run immediately
- catchup: false → log skip, advance to next scheduled tick

Config Hot-Reload (SIGHUP)

Handle SIGHUP — parse full new config before applying
Atomic swap of running config under mutex
Ensure running jobs are never interrupted during reload
Re-run cycle detection on new config; reject reload (keep old config) if invalid
husky reload — send SIGHUP to daemon via Unix socket

Testing

Phase 3 — Observability

Weeks 7–9 · REST API, WebSocket log streaming, embedded web dashboard, webhook alerting.

REST API

WebSocket Log Streaming

GET /ws/logs/:run_id — stream live run_logs lines over WebSocket as they are written
Backfill existing log lines on connection open, then stream new lines
Close WebSocket cleanly when run reaches terminal state (SUCCESS | FAILED | SKIPPED)
husky logs <job> --tail — consume WebSocket stream in terminal

CLI Observability Commands

Web Dashboard

Rich Notifications

Testing

Phase 3.4 — Dashboard Enhancements

Complete · All backend endpoints, Data / Alerts / Config, Run Detail page, Enhanced Jobs tab, DAG SVG, Health tab, Integrations panel, and UI quality-of-life items implemented.

Backend: New API Endpoints

Dashboard: Daemon Info Panel (§1.3)

Extend the top bar with a clickable version badge that toggles an info dropdown panel displaying: PID, uptime, config file path, started_at full timestamp, SQLite DB path, total / active / paused job counts
Fetch daemon info from GET /api/daemon/info on mount and on each 30-second poll cycle

Dashboard: Data Tab (§2)

Add a Data tab to the navigation alongside Jobs | Audit | Outputs | Alerts | DAG | Config
job_runs sub-tab (§2.1): paginated table with columns id | job_name | status | attempt | trigger | triggered_by | reason | started_at | finished_at | exit_code | sla_breached | hc_status; pagination controls (25 / 50 / 100 rows); filter bar (job, status, trigger, date range, sla_breached flag)
job_state sub-tab (§2.2): table with columns job_name | last_success | last_failure | next_run | lock_pid; highlight rows with a non-null lock_pid using a warning badge; "Clear lock" action button per such row → POST /api/db/job_state/:job/clear_lock
run_logs sub-tab (§2.3): searchable log explorer across all runs; filters: run ID, job name, stream type, full-text keyword; rows colour-coded by stream (stderr = red tint, healthcheck = amber tint)

Dashboard: Alerts Tab Completion (§2.5)

Add status column with colour-coded badges: delivered = green, failed = red, pending = yellow
Add attempts, last_attempt_at, and error columns to the Alerts table
Add a “Retry” action button on rows where status = failed → POST /api/db/alerts/:id/retry; disable the button while the request is in flight
Add pagination / load-more to the Alerts table (currently loads a fixed number)

Dashboard: Config Editor Improvements (§3)

Inline error markers (§3.2): when “Validate” returns errors that include line numbers, scroll to and highlight the offending line(s) in the textarea; show the error description below the highlighted line
Config diff view (§3.4): after a successful “Save & Reload”, show a unified diff between the previous config text and the newly saved config; highlight added/removed job blocks so the operator can confirm what changed

Dashboard: Run Detail Page (§4)

Implement SPA routing using the hash (#/runs/:id); the root App component parses window.location.hash and renders a RunDetailView instead of the tab panel when a run route is matched
RunDetailView (§4.1): full-page view showing — job name (links back to Jobs tab filtered to that job); status badge, attempt, trigger, triggered_by, reason; started_at / finished_at / duration; exit code; SLA progress bar (green if within budget, amber → red if over); hc_status badge; cycle_id (hyperlinked to Data → run_outputs filtered by cycle); full log viewer with stream colouring and healthcheck toggle; output variables table; previous / next run links for the same job
Run actions (§4.2): Retry button (shown when status is FAILED) → POST /api/jobs/:name/retry; Skip button (shown when status is PENDING) → POST /api/jobs/:name/skip; Cancel button (shown when status is RUNNING) → POST /api/jobs/:name/cancel
Deep links (§4.3): update run pills in the Jobs tab to navigate to #/runs/:id instead of expanding inline; update Audit table rows to link to #/runs/:id; make run_id values in the Data tab clickable links to #/runs/:id

Dashboard: Enhanced Jobs Tab (§5)

Per-job pause/resume (§5.1): add Pause / Resume button to each job row action group (next to Run / Cancel); show a PAUSED badge in the Status column for paused jobs; wire to POST /api/jobs/:name/pause and POST /api/jobs/:name/resume
Full job config details (§5.2): extend the expanded row config panel to include frequency, on_success, on_retry, on_sla_breach notification settings, env var keys (values redacted as ***), output capture rules, and healthcheck command + timeout
Run history depth control (§5.3): add a "Show more" button below the run history pills that fetches the next page; support incremental loading (?runs=40 then ?runs=80)
Bulk trigger by tag (§5.4): add a "Run all" button to the tag health strip for each tag; show a confirmation dialog listing the jobs that will be triggered; fire one POST /api/jobs/:name/run per matching job on confirm

Dashboard: DAG Enhancements (§6)

Visual graph (§6.1): replace the current text edge list with a rendered directed graph using a lightweight library (Dagre-D3, D3-dag, or ELK.js); nodes show job name, current status badge, and last run time; directed edges with arrowheads; node borders colour-coded (green = SUCCESS, red = FAILED, blue = RUNNING, grey = PENDING); support scroll and zoom for large graphs
Node actions (§6.2): clicking a node navigates to #/runs/:id for the most recent run of that job; right-click context menu with Trigger / Cancel / View logs options
Cycle trace overlay (§6.3): from the Run Detail page or Data tab, a "Show in DAG" button that highlights all nodes and edges that participated in a specific cycle_id, with each node annotated with the outputs it captured in that cycle

Dashboard: Health & SLA Tab (§7)

Add a Health tab to the navigation
SLA compliance table (§7.1): for all jobs with an SLA configured, render job | SLA budget | last duration | Δ | breach rate (30 d); Δ shown in red when positive (breach) and green when negative (headroom); default sort by breach rate descending to surface worst offenders
Timeline swimlane (§7.2): chronological chart with one row per job showing the last N runs as coloured bars (bar length = duration, bar colour = status, amber border if SLA breached); hovering a bar shows a full run detail tooltip

Dashboard: Integrations Panel (§8)

Add an Integrations section as a sub-tab within Config or as a dedicated Settings tab
Integration health list (§8.1): display all configured integrations with name, provider, credential status (configured / missing required fields / env var unset), last test result and timestamp; fetched from GET /api/integrations
Test delivery button (§8.2): "Test" button per integration row → POST /api/integrations/:name/test; display the test payload sent and the HTTP response code / body returned

Dashboard: UI Quality-of-Life (§9)

Dark mode (§9.1): (→ Backlog)
Toast notifications (§9.2): ensure every user-initiated action (trigger, cancel, pause/resume, save config, retry alert, clear lock) shows a short-lived success or error toast with descriptive text
Keyboard shortcuts (§9.3): r refreshes the current view, / focuses the active filter/search input, Esc closes expanded rows or modals; register via a useEffect on document in the root App component; show a shortcut hint in the UI
Column and filter persistence (§9.4): (→ Backlog)
Copy to clipboard (§9.5): one-click copy button on run IDs, cycle IDs, and inside the log viewer (copies visible log text) via navigator.clipboard.writeText
Auto-polling control (§9.6): toggle button to pause/resume auto-refresh of the jobs table; show "last refreshed N seconds ago" label; when paused, no polling until the user manually refreshes or re-enables

Testing

Phase 3.5 — Daemon Runtime Configuration (`huskyd.yaml`)

Complete · All huskyd.yaml parsing, API binding, authentication (bearer, basic auth, RBAC), TLS, CORS, structured logging with rotation, storage retention vacuum, executor shell + global env, scheduler jitter, and full unit + integration test coverage implemented. OIDC, metrics, tracing, secrets backends, and advanced executor resource limits deferred to Phase 5 backlog.

Dashboard

huskyd.yaml subtab: if the daemon exposes a daemon-config path via GET /api/daemon/info, add a second subtab in the Config view displaying its path and raw YAML content (read-only initially)

Discovery & Parsing

Auto-assign API port with 127.0.0.1:0 (OS picks free port); write bound address to <data>/api.addr on startup; remove file on clean shutdown
husky dash — read <data>/api.addr and open dashboard URL in default system browser
Define DaemonConfig Go struct in a new internal/daemoncfg package mirroring the full huskyd.yaml shape
Load huskyd.yaml from the same directory as husky.yaml when the file exists; silently apply all defaults when it does not
Support --daemon-config <path> flag on huskyd to override the default discovery path
Validate huskyd.yaml with a JSON Schema at load time; emit structured errors with field path context
All fields optional — absence of any section falls back to the documented default; no breaking change for v1.0 users
husky validate extends to also validate huskyd.yaml when present

API Server

Authentication

Logging

log.level — debug | info | warn | error; default info; hot-reloadable via SIGHUP
log.format: json — swap the slog handler from slog.NewTextHandler to slog.NewJSONHandler; produces structured log lines consumable by Datadog, CloudWatch, Loki, Splunk
log.output: stdout | stderr | file — default stdout; file requires log.file.path
log.file.path — write logs to this file; create parent directories if absent
log.file.max_size_mb, .max_backups, .max_age_days, .compress — log rotation via lumberjack (or equivalent); compressed rotated files when compress: true
log.audit_log.enabled — write a separate newline-delimited JSON audit log to log.audit_log.path; each line is one job state transition event with full context (job, run ID, status, trigger, duration, reason)
log.audit_log.max_size_mb, .max_backups — independent rotation settings for the audit log file

Storage

storage.sqlite.path — override default <data>/husky.db location
storage.sqlite.wal_autocheckpoint — set SQLite wal_autocheckpoint pragma; default 1000 pages
storage.sqlite.busy_timeout — set SQLite busy-handler timeout; default 5s
storage.retention.max_age — background vacuum goroutine deletes job_runs (and cascade: run_logs, run_outputs) older than the specified duration; runs every 24 h
storage.retention.max_runs_per_job — after pruning by age, also enforce a per-job cap; keep only the N most recent completed runs; PENDING and RUNNING runs are never pruned
Background vacuum goroutine: run once at startup (after a short delay to not contend with catchup) then on a 24 h ticker; log number of rows pruned at INFO level
storage.engine: postgres stub — parse the config field and emit a clear not yet supported error at startup; prevents a confusing failure mode when users copy a future config forward
husky export --format=json honours the configured storage path when reading state

Scheduler

scheduler.max_concurrent_jobs — global ceiling on concurrently executing jobs across all job pools; default 32; per-job concurrency setting still applies at the job level
scheduler.catchup_window — maximum look-back window when reconciling missed schedules on restart; overrides per-job catchup: true for runs missed longer ago than this; default 24h
scheduler.shutdown_timeout — grace period given to in-flight jobs when a graceful stop is requested before SIGKILL is sent to remaining jobs; default 60s
scheduler.schedule_jitter — random jitter (up to this duration) added to each scheduled trigger to avoid thundering-herd bursts when many jobs share the same tick; default 0s (no jitter)

Executor

executor.pool_size — size of the bounded goroutine worker pool; default 8; surfaced in huskyd.yaml so production deployments can tune without recompile
executor.shell — shell used to run job commands; default /bin/sh; common override /bin/bash
executor.global_env — key-value pairs injected into every job's environment; lower priority than per-job env in husky.yaml; higher priority than host environment

Testing

Unit tests: bearer token middleware — valid token accepted, invalid token → 401, missing header → 401, auth disabled → all requests pass
Unit tests: basic auth middleware — valid credentials, wrong password, unknown user
Unit tests: RBAC — viewer can GET, viewer cannot POST trigger, operator can trigger, admin can do everything
Unit tests: storage retention vacuum — rows older than max_age deleted, rows within window kept, RUNNING rows never deleted
Unit tests: max_runs_per_job cap — correct rows pruned, most recent N retained
Unit tests: executor.global_env — values present in subprocess env, overridden by per-job env
Integration test: bearer auth enabled — husky status (HTTP commands) fail without HUSKY_TOKEN, succeed with correct token
Integration test: huskyd.yaml absent — daemon starts with all defaults; no error or warning beyond "no huskyd.yaml found, using defaults"

Phase 4 — Hardening & Distribution

Weeks 10–12 · Cross-compilation, packaging, auth, TLS, integration test suite.

Cross-Compilation

Build matrix in Makefile: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64
Verify modernc.org/sqlite CGO-free build works on all five targets
Produce reproducible builds (embed VERSION, COMMIT, BUILD_DATE via ldflags)
Add make dist target that produces a dist/ directory with all platform archives (.tar.gz) — each archive contains the single husky binary

Packaging & Distribution

Write Homebrew formula husky.rb (binary download + checksum)
Build .deb package for Debian/Ubuntu (using goreleaser or hand-rolled dpkg-deb)
Build .rpm package for RHEL/Fedora
Create install.sh one-liner installer (detects platform, downloads correct binary)
Write systemd unit file example (huskyd.service) — ExecStart points to husky daemon run
Write launchd plist example for macOS autostart (invokes husky daemon run)
Publish GitHub Release via GoReleaser CI step with checksums and signatures

Token Authentication

Superseded by Phase 3.5 (auth.type: bearer + auth.type: basic + RBAC in huskyd.yaml).

Implement token auth middleware for all REST + WebSocket endpoints (§7.2)
Read token from auth.token in husky.yaml or HUSKY_TOKEN env var
Store bcrypt hash of token in job_state database; never store plaintext
Require Authorization: Bearer <token> header when auth is enabled
Token auth disabled by default; emit startup log note when disabled
husky reads token from HUSKY_TOKEN env var for socket/HTTP calls

TLS

Superseded by Phase 3.5 (api.tls.* block in huskyd.yaml; --daemon-config flag provides override path).

Support --tls-cert and --tls-key flags on daemon start (§7.1)
Verify cert/key pair at startup; refuse to start on invalid files
Serve HTTPS when TLS flags are provided; HTTP otherwise
Document self-signed cert generation steps in README

External Binding

Superseded by Phase 3.5 (api.addr in huskyd.yaml).

Support --bind flag to override default 127.0.0.1:8420 (§7.1)
Emit prominent startup warning when bound to 0.0.0.0 without TLS + auth enabled

Integration Test Suite

Documentation

Write README.md — quickstart, install, husky.yaml, huskyd.yaml example, CLI reference
Write docs/configuration.md — full field reference including all v1.0 fields: sla, tags, timezone, healthcheck, output, expanded notify schema (Features 1–7)
Write docs/security.md — auth, TLS, process isolation guidance
Write docs/crash-recovery.md — startup sequence walkthrough
Write docs/output-passing.md — guide to capture modes, cycle_id scoping, and template syntax (Feature 2)
Write docs/notifications.md — all channel formats, template variables, only_after_failure behaviour (Feature 3)
Add CHANGELOG.md for v1.0.0
Use Docusaurus to create a documentation website

Phase 5: Backlog

Deferred from Phase 3.5

Items below were descoped from the alpha release. They are fully specified and ready to implement post-launch.

OIDC Authentication

auth.type: oidc — full implementation: fetch and cache JWKS from auth.oidc.jwks_uri; verify JWT signatures; refresh on key rotation (background goroutine)
auth.oidc.issuer, .client_id, .audience, .jwks_uri — standard OIDC discovery fields
auth.oidc.role_claim — JWT claim name mapped to Husky roles (viewer, operator, admin)
auth.oidc.default_role — role assigned when claim is absent

Executor Advanced Config

executor.working_dir: config_dir | <absolute_path> — working directory for all child processes; config_dir (default) means the directory containing husky.yaml
executor.resource_limits.max_memory_mb — Linux: set RLIMIT_AS; macOS: best-effort via setrlimit; 0 = unlimited
executor.resource_limits.max_open_files — set RLIMIT_NOFILE on child process; default 1024
executor.resource_limits.max_pids — Linux: set RLIMIT_NPROC; limits forking within a job

Metrics

metrics.enabled — expose a Prometheus-compatible scrape endpoint; default false
metrics.addr — bind address for the /metrics HTTP server; default 127.0.0.1:9091 (separate from dashboard port)
metrics.path — URL path for the scrape endpoint; default /metrics
metrics.auth — when true, protect /metrics with the same bearer token as the main API
Instrument: husky_job_runs_total{job,status,trigger}, husky_job_duration_seconds{job}, husky_job_sla_breaches_total{job}, husky_job_retries_total{job}, husky_scheduler_tick_duration_seconds, husky_running_jobs, husky_daemon_uptime_seconds
Use prometheus/client_golang — add to go.mod
Metrics server lifecycle tied to daemon context (shut down cleanly on SIGTERM)

Tracing

tracing.enabled — emit OpenTelemetry spans; default false
tracing.exporter — otlp | jaeger | zipkin | stdout; default otlp
tracing.endpoint, .service_name, .sample_rate
Instrument spans for: job dispatch, executor subprocess wait, retry backoff sleep, IPC request handling, notification delivery
Use go.opentelemetry.io/otel — add to go.mod
Tracer provider shut down gracefully on daemon context cancellation

Secrets Backends

secrets.provider: vault — resolve ${secret:NAME} placeholders via HashiCorp Vault KV v2 (addr, token_file, mount, path_prefix)
secrets.provider: aws_ssm — AWS Systems Manager Parameter Store (ambient IAM)
secrets.provider: aws_secrets_manager — AWS Secrets Manager (ambient IAM)
secrets.provider: gcp_secret_manager — GCP Secret Manager (Application Default Credentials)
Secret resolution at startup + SIGHUP; never written to disk or logs
Add ${secret:NAME} interpolation syntax alongside ${env:VAR} in internal/config/env.go

Daemon-Level Alerts

alerts.on_daemon_start — fire notification(s) when huskyd starts successfully
alerts.on_sla_breach — daemon-level fallback for SLA breach notifications when the job does not define notify.on_sla_breach
alerts.on_forced_kill — fire notification when a job is killed by shutdown_timeout SIGKILL

Dashboard Customisation

dashboard.enabled: false — disable dashboard while keeping REST + WebSocket API; respond to / with 404
dashboard.title — custom title injected into <title> and header; served via GET /api/config
dashboard.accent_color — CSS custom property for primary accent colour
dashboard.log_backfill_lines — historical lines sent on WebSocket connect; default 200
dashboard.poll_interval — frontend poll interval; served via GET /api/config; default 5s
GET /api/config — new endpoint returning dashboard runtime config (no auth required)

HTTP Client (Outbound)

http_client.timeout — global timeout for all outbound HTTP calls; default 15s
http_client.max_retries / .retry_backoff — notification delivery retries
http_client.proxy — optional HTTP proxy URL for all outbound calls
http_client.ca_bundle — PEM CA bundle appended to system cert pool

Process / System

process.user / process.group — drop privileges after port bind (Linux setuid/setgid; warning on macOS/Windows)
process.pid_file — override PID file location
process.ulimit_nofile — setrlimit(RLIMIT_NOFILE) on daemon process at startup
process.watchdog_interval — sd_notify WATCHDOG=1 pings for systemd WatchdogSec=

Phase 3.5 Tests (deferred)

Distributed execution across multiple machines
External queue integrations (Kafka, Redis, RabbitMQ)
Dynamic job creation via API at runtime
Plugin SDK / extensible executor types
Per-job log retention config (default: 30 days / 1,000 runs) with background vacuum
Optional log shipping to external sinks
DST ambiguous window warnings (§9 Risk Register)

Multi-Project Daemon Mode

Target: after v1.0 ships and real-world usage patterns are established.

Goal: A single huskyd process manages multiple projects, each with its own husky.yaml, SQLite store, and job namespace. One binary, one process, one dashboard — regardless of how many projects are registered.

Rationale: v1.0 ships as a per-project daemon (one daemon per husky.yaml). This is correct for the simple case and for developers using Husky locally. For production deployments — ops teams, servers, organisations running many pipelines — a per-project model multiplies resource overhead and eliminates the possibility of a unified operational view. V2 closes this gap without breaking v1.0 users.

Project Registry

Define a top-level huskyd.yaml server config file (distinct from per-project husky.yaml) — specifies bind address, TLS, auth, log retention, and the list of registered projects
Add projects: block to huskyd.yaml: list of entries with name (unique identifier) and config (path to project's husky.yaml)
Implement husky register <path> command — adds a project entry to huskyd.yaml and hot-reloads the daemon
Implement husky deregister <name> command — removes project entry; waits for running jobs to complete before tearing down project context
Implement husky projects list — tabular output: project name, config path, job count, status (active / paused / error)
Validate that project names are unique, lowercase alphanumeric with hyphens, max 64 characters
Detect duplicate husky.yaml paths at registration time; reject with a clear error

Project Isolation

Each registered project runs in an isolated context: independent scheduler goroutine, independent executor pool, independent SQLite database under <data-dir>/<project-name>/husky.db
Per-project goroutine pool size — inherits global default from huskyd.yaml but overridable per project
A panic or fatal error in one project context must not crash or stall other projects — recover and mark project as error state, emit alert
Per-project SIGHUP-style config reload — daemon watches each husky.yaml for changes and reloads only the affected project context
Global file-watcher (using fsnotify) replaces per-project polling; each husky.yaml change triggers isolated reload pipeline

Job Namespacing

All job identifiers are namespaced as <project-name>::<job-name> internally
CLI commands accept --project <name> flag to scope operations: husky run --project billing daily-report
Without --project, CLI auto-detects project by walking up from CWD until a husky.yaml is found, then resolves to the registered project name — preserves v1.0 UX in terminal
after: dependencies within a project remain unqualified (after: ingest) — no breaking change to v1.0 config syntax
Cross-project after: dependencies expressed as after: "billing::daily-report" — optional v2 feature, gated behind a config flag allow_cross_project_deps: true
husky status without --project shows only the auto-detected project; husky status --all shows all projects

Unified Web Dashboard

Single web dashboard serves all projects — project selector dropdown in top navigation
Default landing page shows cross-project summary: total jobs, running count, recent failures across all projects
Per-project view mirrors the v1.0 dashboard (job table, log stream, run history)
WebSocket log streams are scoped per project; subscribing to a project stream does not receive logs from other projects
GET /api/projects — list all registered projects with summary stats
GET /api/projects/<name>/jobs — scoped equivalent of v1.0 GET /api/jobs
GET /api/projects/<name>/jobs/<job>/runs — scoped run history
Retain v1.0 unscoped endpoints (GET /api/jobs) as aliases that resolve via CWD auto-detection or X-Husky-Project header

Unified Integration Credentials

Move Slack, PagerDuty, Discord, SMTP integration credentials to huskyd.yaml as global integrations — available to all projects without duplication
Per-project husky.yaml may still define local integrations that override or supplement global ones — global key takes precedence unless project explicitly overrides
husky integrations list shows global integrations from huskyd.yaml; husky integrations list --project <name> shows merged view for that project

Single Unix Socket

Replace per-project Unix sockets with a single socket at a fixed path (e.g. /var/run/husky/huskyd.sock or ~/.husky/huskyd.sock)
All IPC messages include a project field; daemon routes to the correct project context
v1.0 CLI compatibility: if project field is absent in IPC message, daemon resolves project from the cwd field included in the message

Migration from V1

Document upgrade path in docs/upgrading-to-v2.md: existing per-project huskyd users move to running a single huskyd with a huskyd.yaml that registers their project
husky migrate command — scans CWD for husky.yaml, generates a minimal huskyd.yaml with the project registered, and prints next steps
v1.0 invocation (huskyd --config ./husky.yaml) continues to work unchanged as a single-project shorthand — no breaking change for existing users

Testing

Unit tests: project registry — register, deregister, duplicate detection, name validation
Unit tests: job namespacing — auto-detect project from CWD, --project flag override, cross-project after: resolution
Integration test: two projects registered, run jobs in both simultaneously, verify isolation (no log cross-contamination, independent retry state)
Integration test: one project panics — verify other project continues running unaffected
Integration test: reload one project's husky.yaml via file-watcher — verify other project is not restarted
Integration test: husky status --all returns correct aggregated view across projects
Integration test: cross-project after: dependency — downstream job in project B fires after upstream in project A completes
Integration test: husky migrate generates valid huskyd.yaml from existing husky.yaml
Benchmark: scheduler tick latency under 10 projects × 100 jobs (1,000 total) < 20 ms

Phase 0 — Project Bootstrap​

Repository & Tooling​

Dependencies​

Skeleton​

Phase 1 — Core Engine​

Step 1: YAML Parser & Config​

Step 2: Schedule Evaluator​

Step 3: SQLite State Store​

Step 4: Executor​

Step 5: Core CLI (husky)​

Step 6: Testing​

Phase 2 — DAG + Reliability​

DAG Resolver​

Retry FSM​

Healthchecks​

Output Passing​

Crash Recovery​

Config Hot-Reload (SIGHUP)​

Testing​

Phase 3 — Observability​

REST API​

WebSocket Log Streaming​

CLI Observability Commands​

Web Dashboard​

Rich Notifications​

Testing​

Phase 3.4 — Dashboard Enhancements​

Backend: New API Endpoints​

Dashboard: Daemon Info Panel (§1.3)​

Dashboard: Data Tab (§2)​

Dashboard: Alerts Tab Completion (§2.5)​

Dashboard: Config Editor Improvements (§3)​

Dashboard: Run Detail Page (§4)​

Dashboard: Enhanced Jobs Tab (§5)​

Dashboard: DAG Enhancements (§6)​

Dashboard: Health & SLA Tab (§7)​

Dashboard: Integrations Panel (§8)​

Dashboard: UI Quality-of-Life (§9)​

Testing​

Phase 3.5 — Daemon Runtime Configuration (huskyd.yaml)​

Dashboard​

Discovery & Parsing​

API Server​

Authentication​

Logging​

Storage​

Scheduler​

Executor​

Testing​

Phase 4 — Hardening & Distribution​

Cross-Compilation​

Packaging & Distribution​

Token Authentication​

TLS​

External Binding​

Integration Test Suite​

Documentation​

Phase 5: Backlog​

Deferred from Phase 3.5​

OIDC Authentication​

Executor Advanced Config​

Metrics​

Tracing​

Secrets Backends​

Daemon-Level Alerts​

Dashboard Customisation​

HTTP Client (Outbound)​

Process / System​

Phase 3.5 Tests (deferred)​

Multi-Project Daemon Mode​

Project Registry​

Project Isolation​

Job Namespacing​

Unified Web Dashboard​

Unified Integration Credentials​

Single Unix Socket​

Migration from V1​

Testing​

Phase 0 — Project Bootstrap

Repository & Tooling

Dependencies

Skeleton

Phase 1 — Core Engine

Step 1: YAML Parser & Config

Step 2: Schedule Evaluator

Step 3: SQLite State Store

Step 4: Executor

Step 5: Core CLI (`husky`)

Step 6: Testing

Phase 2 — DAG + Reliability

DAG Resolver

Retry FSM

Healthchecks

Output Passing

Crash Recovery

Config Hot-Reload (SIGHUP)

Testing

Phase 3 — Observability

REST API

WebSocket Log Streaming

CLI Observability Commands

Web Dashboard

Rich Notifications

Testing

Phase 3.4 — Dashboard Enhancements

Backend: New API Endpoints

Dashboard: Daemon Info Panel (§1.3)

Dashboard: Data Tab (§2)

Dashboard: Alerts Tab Completion (§2.5)

Dashboard: Config Editor Improvements (§3)

Dashboard: Run Detail Page (§4)

Dashboard: Enhanced Jobs Tab (§5)

Dashboard: DAG Enhancements (§6)

Dashboard: Health & SLA Tab (§7)

Dashboard: Integrations Panel (§8)

Dashboard: UI Quality-of-Life (§9)

Testing

Phase 3.5 — Daemon Runtime Configuration (`huskyd.yaml`)

Dashboard

Discovery & Parsing

API Server

Authentication

Logging

Storage

Scheduler

Executor

Testing

Phase 4 — Hardening & Distribution

Cross-Compilation

Packaging & Distribution

Token Authentication

TLS

External Binding

Integration Test Suite

Documentation

Phase 5: Backlog

Deferred from Phase 3.5

OIDC Authentication

Executor Advanced Config

Metrics

Tracing

Secrets Backends

Daemon-Level Alerts

Dashboard Customisation

HTTP Client (Outbound)

Process / System

Phase 3.5 Tests (deferred)

Multi-Project Daemon Mode

Project Registry

Project Isolation

Job Namespacing

Unified Web Dashboard

Unified Integration Credentials

Single Unix Socket

Migration from V1

Testing