Skip to main content

Husky โ€” Task Tracker

Local-First Job Scheduler with Dependency Graphs


Phase 0 โ€” Project Bootstrapโ€‹

Weeks 0โ€“1 ยท Set up the repository, toolchain, and all dependencies before any feature work begins.

Repository & Toolingโ€‹

  • Initialise Git repository and add .gitignore (Go, binaries, .env)
  • Create top-level directory structure (cmd/, internal/, web/, docs/, scripts/, tests/)
  • Initialise Go module (go mod init github.com/husky-scheduler/husky)
  • Pin Go toolchain version in go.mod / .tool-versions (go 1.24.0)
  • Add Makefile with targets: build, test, lint, clean, run
  • Configure golangci-lint (.golangci.yml)
  • Add pre-commit hooks (lint + go vet on staged files)
  • Set up GitHub Actions CI skeleton (lint + test on push)

Dependenciesโ€‹

  • Add YAML parser โ€” gopkg.in/yaml.v3
  • Add SQLite driver โ€” modernc.org/sqlite (pure-Go, CGO-free)
  • Add JSON Schema validator for husky.yaml validation โ€” santhosh-tekuri/jsonschema/v5
  • Add structured logger โ€” log/slog (stdlib, Go 1.21+)
  • Add CLI framework โ€” spf13/cobra
  • Add test assertion library โ€” testify/assert + testify/require
  • Run go mod tidy and commit go.sum

Skeletonโ€‹

  • Create embedded daemon runtime entry point (initially cmd/huskyd/main.go, later invoked via husky daemon run)
  • Create cmd/husky/main.go entry point (CLI)
  • Add placeholder VERSION constant and --version flag (internal/version package)
  • Create internal/config/ package stub
  • Create internal/store/ package stub
  • Verify make build produces the husky binary without errors

Phase 1 โ€” Core Engineโ€‹

Weeks 1โ€“3 ยท YAML parser, schedule evaluator, SQLite schema, basic executor, and core CLI commands.

Step 1: YAML Parser & Configโ€‹

  • Define Config and Job Go structs matching the husky.yaml schema (ยง2.1, ยง2.4)
  • Implement YAML unmarshalling into Config struct
  • Implement JSON Schema validation of husky.yaml at load time
  • Validate frequency enum values (ยง2.2.1)
  • Validate time field โ€” 4-char military format, valid hour/minute range (ยง2.2.2)
  • Validate on_failure enum (alert | skip | stop | ignore)
  • Validate retry_delay format (exponential or fixed:<duration>)
  • Validate concurrency enum (allow | forbid | replace)
  • Interpolate ${env:HOST_VAR} references in env map at runtime (ยง7.3)
  • Apply defaults block to all jobs that do not override a field (ยง2.1)
  • Return structured parse errors with file location context
  • Add timezone field to Job struct โ€” IANA timezone identifier string (Feature 7)
  • Add timezone field to Defaults struct as global fallback (Feature 7)
  • Validate timezone identifier at parse time using Go's embedded time/tzdata package; reject unknown identifiers (Feature 7)
  • Add sla field to Job struct โ€” optional duration string (Feature 1)
  • Validate at parse time that sla < timeout when both are set; return structured error if not (Feature 1)
  • Add tags field to Job struct โ€” list of strings (Feature 4)
  • Validate tags: lowercase alphanumeric with hyphens only, max 10 tags per job, max 32 characters per tag (Feature 4)
  • Expand Notify struct to full rich notification schema โ€” on_failure, on_success, on_sla_breach, on_retry sub-objects each with channel, message, attach_logs, and only_after_failure fields (Feature 3)
  • Maintain backward compatibility: shorthand string form (notify.on_failure: slack:#channel) continues to parse correctly (Feature 3)
  • Add healthcheck block to Job struct: command string, timeout duration string, on_fail enum (mark_failed | warn_only) (Feature 6)
  • Add output block to Job struct: map of variable name โ†’ capture mode string (last_line, first_line, json_field:<key>, regex:<pattern>, exit_code) (Feature 2)
  • Add Integration struct with provider-specific credential fields (webhook_url, routing_key, host, port, username, password, from) (Feature 3)
  • Add Integrations map[string]*Integration top-level field to Config struct (Feature 3)
  • Infer provider from map key when key matches a known provider name (slack, pagerduty, discord, smtp, webhook); require explicit provider field for custom key names (Feature 3)
  • Validate each integration: required credential fields per provider (webhook_url for slack/discord/webhook, routing_key for pagerduty, host+from for smtp), SMTP port range 1โ€“65535 (Feature 3)
  • Extend ${env:VAR} interpolation to cover all integration credential fields (webhook_url, routing_key, username, password) (Feature 3)
  • Implement LoadDotEnv(dir) โ€” source .env file from same directory as husky.yaml before env interpolation; non-destructive (process env vars take precedence) (Feature 3)
  • Update JSON Schema to allow integrations: top-level block with integration $def (Feature 3)

Step 2: Schedule Evaluatorโ€‹

  • Implement NextRunTime(job Job, now time.Time) time.Time for all frequency values
    • hourly โ†’ next :00
    • daily, weekly, monthly, weekdays, weekends โ†’ next matching wall-clock tick using time field
    • manual โ†’ never schedules automatically
    • after:<job> โ†’ dependency-driven (no time calculation)
  • Log resolved UTC equivalent of every scheduled time at daemon startup
  • Implement 1-second ticker loop that evaluates all jobs each tick
  • Resolve time field relative to per-job timezone; fall back to defaults.timezone, then system timezone (Feature 7)
  • Use Go's embedded time/tzdata for timezone resolution โ€” no system tzdata dependency (Feature 7)
  • Handle DST gap: when the scheduled wall-clock time does not exist, run at the next valid time after the gap and log a warning (Feature 7)
  • Handle DST overlap: when the scheduled time occurs twice, run once on the first occurrence and log a warning (Feature 7)
  • Include timezone abbreviation (e.g. EST, JST) alongside the UTC equivalent in daemon startup log (Feature 7)

Step 3: SQLite State Storeโ€‹

  • Implement store package with WAL-mode SQLite connection
  • Create schema migrations for job_runs, job_state, run_logs, alerts tables (ยง4.1)
  • Add reason TEXT and triggered_by TEXT DEFAULT "scheduler" columns to job_runs (Feature 5)
  • Add sla_breached INTEGER DEFAULT 0 column to job_runs (Feature 1)
  • Add hc_status TEXT column to job_runs โ€” values: null | pass | fail | warn (Feature 6)
  • Create run_outputs table: (id, run_id, job_name, var_name, value, cycle_id, created_at) with index on cycle_id (Feature 2)
  • Implement serialised writer goroutine channel for all writes
  • Implement RecordRunStart, RecordRunEnd, RecordLog, GetJobState operations
  • Implement UpdateJobState (last_success, last_failure, next_run, lock_pid)

Step 4: Executorโ€‹

  • Implement bounded goroutine pool (default: 8 concurrent jobs) (ยง3.2.3)
  • Launch each job as a subprocess via /bin/sh -c (or direct exec for array form)
  • Set Setpgid=true on all subprocesses (ยง7.4)
  • Capture stdout and stderr line-by-line and write to run_logs in real time
  • Implement timeout: send SIGTERM, then SIGKILL after grace period
  • Kill entire process group on timeout/cancel

Step 5: Core CLI (husky)โ€‹

  • husky start โ€” launch the embedded Husky daemon in background, write daemon.pid, detach
  • Hidden husky daemon run entry point for direct foreground daemon execution by service managers and development workflows
  • husky stop โ€” graceful shutdown (drain running jobs)
  • husky stop --force โ€” immediate SIGKILL to all running jobs
  • husky status โ€” tabular display of all job states from job_state
  • husky run <job> โ€” trigger job immediately, bypassing schedule
  • Unix socket IPC between husky (CLI) and the embedded daemon runtime

Step 6: Testingโ€‹

  • Unit tests for YAML parser (valid configs, missing required fields, bad enum values)
  • Unit tests for time field validation (all rows from Table ยง2.2.2)
  • Unit tests for schedule evaluator (NextRunTime for each frequency value)
  • Unit tests for SQLite store (CRUD, WAL concurrency)
  • Integration test: parse full example husky.yaml (ยง2.3) without errors

Phase 2 โ€” DAG + Reliabilityโ€‹

Weeks 4โ€“6 ยท Dependency resolution, retry state machine, crash recovery, hot-reload, catchup.

DAG Resolverโ€‹

  • Implement topological sort using Kahn's algorithm on depends_on declarations (ยง3.2.2)
  • Detect cycles at startup and emit full cycle path in error message
  • Reject daemon start if any cycle is found
  • Resolve after:<job> frequency as an implicit depends_on edge
  • At runtime, check all upstream job states are SUCCESS before dispatching a dependent job
  • husky dag โ€” print ASCII DAG of all jobs and their dependencies
  • husky dag --json โ€” emit machine-readable DAG structure

Retry FSMโ€‹

  • Implement job run state machine: PENDING โ†’ RUNNING โ†’ SUCCESS | FAILED โ†’ RETRYING (ยง3.2.4)
  • Implement exponential backoff with ยฑ25% jitter (doubles each attempt: 30s, 60s, 120sโ€ฆ)
  • Implement fixed:<duration> retry delay
  • On max retries exceeded, execute on_failure action:
    • alert โ€” fire configured notify channels
    • skip โ€” mark run SKIPPED, continue other jobs
    • stop โ€” halt the entire pipeline (cancel all downstream dependents)
    • ignore โ€” log failure silently, take no action
  • Implement concurrency: forbid โ€” skip run if previous is still running (ยง2.4)
  • Implement concurrency: replace โ€” kill running instance and start fresh
  • husky retry <job> โ€” retry last failed run from scratch
  • husky cancel <job> โ€” send SIGTERM to running job
  • husky skip <job> โ€” mark pending run as SKIPPED

Healthchecksโ€‹

  • After main command exits with code 0, run healthcheck.command as a separate process (Feature 6)
  • Apply healthcheck.timeout (default: 30s); send SIGTERM + SIGKILL to healthcheck if exceeded (Feature 6)
  • on_fail: mark_failed (default): mark run FAILED, append healthcheck stderr to run_logs, trigger retry policy (Feature 6)
  • on_fail: warn_only: mark run SUCCESS with hc_status = warn, fire on_sla_breach notify event (Feature 6)
  • Skip healthcheck entirely when main command exits non-zero (Feature 6)
  • Capture healthcheck stdout/stderr into run_logs with stream = "healthcheck" stream tag (Feature 6)
  • Expose husky logs <job> --include-healthcheck flag to include healthcheck log lines (Feature 6)

Output Passingโ€‹

  • After job completion, evaluate the job's output block and write each captured value to run_outputs (Feature 2)
  • Implement all capture modes: last_line, first_line, json_field:<key>, regex:<pattern>, exit_code (Feature 2)
  • Generate a cycle_id UUID at the root of each trigger chain (schedule tick or manual run); propagate to all downstream dependent jobs (Feature 2)
  • At job dispatch time, render {{ outputs.<job_name>.<var_name> }} template expressions in command, env values (Feature 2)
  • Return a descriptive error at dispatch time if a referenced output variable has no recorded value for the current cycle_id (Feature 2)
  • Scope all output values to a single cycle; never bleed values across independent trigger chains (Feature 2)

Crash Recoveryโ€‹

  • On startup, acquire daemon.pid lock (ยง6 step 2)
  • Detect stale lock (PID dead) vs. live daemon (exit with "daemon already running")
  • Reconcile orphaned runs: query job_state WHERE lock_pid IS NOT NULL, probe each PID with kill -0 (ยง6 step 3)
  • For dead PIDs: mark run FAILED, increment attempt, schedule retry
  • Reconcile missed schedules based on catchup flag (ยง6 step 4)
    • catchup: true โ†’ trigger missed run immediately
    • catchup: false โ†’ log skip, advance to next scheduled tick

Config Hot-Reload (SIGHUP)โ€‹

  • Handle SIGHUP โ€” parse full new config before applying
  • Atomic swap of running config under mutex
  • Ensure running jobs are never interrupted during reload
  • Re-run cycle detection on new config; reject reload (keep old config) if invalid
  • husky reload โ€” send SIGHUP to daemon via Unix socket

Testingโ€‹

  • Unit tests for Kahn's algorithm (linear chain, fan-in, fan-out, cycle detection)
  • Unit tests for Retry FSM state transitions
  • Unit tests for exponential backoff timing + jitter bounds
  • Unit tests for each on_failure handler
  • Integration test: crash daemon mid-run, restart, verify orphan reconciliation
  • Integration test: introduce a cycle into config, verify daemon rejects it
  • Unit tests for healthcheck execution flow: main success + healthcheck pass, main success + healthcheck fail (mark_failed), main success + healthcheck fail (warn_only), main fail skips healthcheck (Feature 6)
  • Unit tests for healthcheck timeout enforcement (Feature 6)
  • Unit tests for each output capture mode: last_line, first_line, json_field, regex, exit_code (Feature 2)
  • Unit tests for {{ outputs.<job>.<var> }} template rendering in command and env (Feature 2)
  • Unit tests for cycle_id scoping โ€” values from cycle A are not visible in cycle B (Feature 2)
  • Unit tests for timezone scheduling: correct wall-clock resolution for a sample of IANA zones, DST gap and overlap behaviour (Feature 7)
  • Unit tests for sla < timeout validation at parse time (Feature 1)

Phase 3 โ€” Observabilityโ€‹

Weeks 7โ€“9 ยท REST API, WebSocket log streaming, embedded web dashboard, webhook alerting.

REST APIโ€‹

  • Bind API server to 127.0.0.1:8420 by default (ยง7.1)
  • GET /api/jobs โ€” list all jobs with current state
  • GET /api/jobs?tag=<tag> โ€” filter job list by tag (Feature 4)
  • GET /api/jobs/:name โ€” single job detail (config + last N runs)
  • POST /api/jobs/:name/run โ€” trigger manual run; accept optional reason body field (Feature 5)
  • POST /api/jobs/:name/cancel โ€” cancel running job
  • GET /api/runs/:id โ€” run detail with exit code, duration, sla_breached, hc_status fields (Feature 1, 6)
  • GET /api/runs/:id/logs โ€” paginated log lines for a run
  • GET /api/runs/:id/outputs โ€” output variables captured for a run (Feature 2)
  • GET /api/audit โ€” searchable run history; query params: job, trigger, status, since, reason, tag (Feature 5)
  • GET /api/tags โ€” list all defined tags (Feature 4)
  • GET /api/dag โ€” DAG structure as JSON
  • GET /api/status โ€” daemon health + uptime

WebSocket Log Streamingโ€‹

  • GET /ws/logs/:run_id โ€” stream live run_logs lines over WebSocket as they are written
  • Backfill existing log lines on connection open, then stream new lines
  • Close WebSocket cleanly when run reaches terminal state (SUCCESS | FAILED | SKIPPED)
  • husky logs <job> --tail โ€” consume WebSocket stream in terminal

CLI Observability Commandsโ€‹

  • husky logs <job> โ€” print last run logs
  • husky logs <job> --run=<id> โ€” logs for a specific run ID
  • husky logs <job> --include-healthcheck โ€” include healthcheck stream in output (Feature 6)
  • husky logs --tag <tag> --tail โ€” stream live logs for all jobs matching a tag (Feature 4)
  • husky history <job> โ€” table of last N runs (status, duration, trigger, reason, vs-SLA columns) (Feature 1, 5)
  • husky history <job> --last=<n> โ€” configurable run count
  • husky validate โ€” lint husky.yaml (schema, cycles, field types, timezone identifiers, sla < timeout)
  • husky validate --strict โ€” also warn on missing descriptions and notify configs
  • husky config show โ€” print effective config with defaults applied
  • husky export --format=json โ€” export full state snapshot
  • husky status --tag <tag> โ€” filter job status table by tag (Feature 4)
  • husky pause --tag <tag> โ€” pause all jobs matching a tag (Feature 4)
  • husky resume --tag <tag> โ€” resume all paused jobs matching a tag (Feature 4)
  • husky run --tag <tag> โ€” trigger all jobs matching a tag immediately (Feature 4)
  • husky tags list โ€” print all defined tags with the count of jobs carrying each one (Feature 4)
  • husky run <job> --reason "<text>" โ€” attach a free-text annotation (max 500 chars) to the manual run record (Feature 5)
  • husky audit โ€” searchable run history across all jobs (Feature 5)
  • husky audit --job <name> โ€” filter by job (Feature 5)
  • husky audit --trigger manual|schedule|dependency โ€” filter by trigger type (Feature 5)
  • husky audit --status failed|success|skipped โ€” filter by run status (Feature 5)
  • husky audit --since "<date>" โ€” filter runs after a given date (Feature 5)
  • husky audit --reason "<text>" โ€” full-text search on the reason field (Feature 5)
  • husky audit --tag <tag> โ€” filter by job tag (Feature 5)
  • husky audit --export csv โ€” export results as CSV (Feature 5)

Web Dashboardโ€‹

  • Scaffold Preact app in web/ directory
  • DAG view โ€” dependency edges and topological execution order, renders from GET /api/dag JSON
  • Job list view โ€” all jobs, last status, last run time, next scheduled run; tag filter dropdown; run/cancel actions
  • Run history view โ€” recent runs per job in expandable row detail; run pills colour-coded by status
  • Extend run history with vs SLA column showing โœ“ under / โš  +ฮ” (Feature 1)
  • Add third run-state colour: yellow for "running but past SLA" in addition to green/red (Feature 1)
  • Tag filter โ€” tag dropdown on jobs view filters to matching jobs; tag dropdown on audit view (Feature 4)
  • Tag filter sidebar with aggregate health counts per tag (Feature 4)
  • Audit log view โ€” searchable, filterable table (job, status, trigger, tag, since) with click-to-expand log viewer (Feature 5)
  • Live log viewer โ€” streams WebSocket log output; stderr tinted red; auto-scroll; historic fallback for completed runs (Feature 6)
  • Job detail view โ€” show captured output variables per run under a collapsible "Outputs" section (Feature 2)
  • Healthcheck output โ€” show healthcheck log lines in a collapsible section below main output in run detail (Feature 6)
  • Embed compiled web assets into binary via go:embed (ยง3.3)
  • Serve dashboard from GET / on API server
  • Add make web target โ€” npm ci && vite build writes Tailwind + Preact bundle to internal/api/dashboard/; make build depends on web
  • Style dashboard with Tailwind CSS (purged bundle โ‰ˆ 15 KB gzip)
  • Add timezone column to job list view (Feature 7)
  • Redact env var values in job config views (ยง7.3)

Rich Notificationsโ€‹

  • Implement notification dispatch for all four events: on_success, on_failure, on_sla_breach, on_retry (Feature 3)
  • Render message templates using {{ job.* }} and {{ run.* }} variables at dispatch time (Feature 3)
  • Attach last N lines of run_logs to notification when attach_logs: last_N_lines is set; send full log when all (Feature 3)
  • Implement only_after_failure: true on on_success โ€” suppress notification unless previous completed run was FAILED (Feature 3)
  • Support slack:#channel and slack:@user โ€” post formatted message via Slack Incoming Webhook URL (Feature 3)
  • Support pagerduty:<severity> โ€” trigger PagerDuty Events API v2; severity values: p1, p2, p3, p4 (Feature 3)
  • Support discord:#channel โ€” post plain text message via Discord Incoming Webhook URL (Feature 3)
  • Support webhook:<url> โ€” HTTP POST with JSON payload to custom URL (Feature 3)
  • Support email:<address> โ€” send SMTP email; requires smtp config block in husky.yaml defaults (Feature 3)
  • Implement SLA breach timer: when a job stays RUNNING past its sla duration, fire on_sla_breach (falls back to on_failure if not set); set sla_breached = 1 on the run record (Feature 1)
  • Write each dispatched alert to the alerts table (audit log) (ยง4.1)
  • Retry alert delivery up to 3 times with backoff on HTTP error
  • Backward compatibility: shorthand string form (notify.on_failure: slack:#channel) continues to dispatch correctly (Feature 3)
  • Resolve integration credentials at notification dispatch time: look up the integration name embedded in the channel string prefix (e.g. slack_ops:#alerts โ†’ cfg.Integrations["slack_ops"]) (Feature 3)
  • Fall back to the provider-named key when channel string uses bare provider prefix (e.g. slack:#channel โ†’ cfg.Integrations["slack"]) (Feature 3)
  • Support husky integrations list โ€” display all configured integrations as a table: name, provider, and credential status (set / missing) (Feature 3)
  • Support husky integrations test <name> โ€” send a live test message/event to the named integration and report success or failure (Feature 3)

Testingโ€‹

  • Unit tests for each REST endpoint (table-driven, httptest)
  • Integration test: trigger run via REST, stream logs via WebSocket, verify terminal state
  • Unit tests for notification template rendering โ€” all {{ job.* }} and {{ run.* }} variables (Feature 3)
  • Unit tests for Slack, PagerDuty, Discord, and custom webhook payload construction (Feature 3)
  • Unit tests for integration credential resolution at dispatch time โ€” named key, bare provider fallback, missing integration (Feature 3)
  • Integration tests for husky integrations test slack and husky integrations test pagerduty (Feature 3)
  • Unit tests for only_after_failure suppression logic (Feature 3)
  • Unit tests for SLA breach timer: fires on_sla_breach at correct elapsed time, sets sla_breached flag (Feature 1)
  • Unit tests for tag filtering on GET /api/jobs and husky status (Feature 4)
  • Unit tests for husky audit filter combinations (Feature 5)
  • Unit tests for husky run --reason stores reason in job_runs.reason (Feature 5)
  • End-to-end test: full pipeline run (ยง2.3 example), verify all alert rows written
  • End-to-end test: output-passing pipeline โ€” ingest captures file_path, transform receives it via template (Feature 2)

Phase 3.4 โ€” Dashboard Enhancementsโ€‹

Complete ยท All backend endpoints, Data / Alerts / Config, Run Detail page, Enhanced Jobs tab, DAG SVG, Health tab, Integrations panel, and UI quality-of-life items implemented.

Backend: New API Endpointsโ€‹

  • GET /api/daemon/info โ€” return PID, config file path, started_at timestamp, SQLite DB path, active job count, and paused job count (ยง1.3)
  • GET /api/db/job_runs โ€” paginated, sortable, filterable table of job_runs; query params: job, status, trigger, since, until, sla_breached, page, page_size (ยง2.1)
  • GET /api/db/job_state โ€” return the full job_state table (ยง2.2)
  • POST /api/db/job_state/:job/clear_lock โ€” set lock_pid = NULL for a job; return 409 if the job is currently RUNNING (ยง2.2)
  • GET /api/db/run_logs โ€” full-text searchable log explorer; query params: run_id, job, stream (stdout | stderr | healthcheck), q, limit, offset (ยง2.3)
  • GET /api/db/alerts โ€” paginated alerts history; query params: job, status, limit, offset (ยง2.5)
  • POST /api/db/alerts/:id/retry โ€” re-dispatch a failed alert without re-running the job; return 409 if alert is already pending (ยง2.5)
  • POST /api/jobs/:name/pause โ€” pause a single job by name (ยง5.1)
  • POST /api/jobs/:name/resume โ€” resume a single paused job by name (ยง5.1)
  • POST /api/jobs/:name/retry โ€” retry the last failed run for this job (ยง4.2)
  • POST /api/jobs/:name/skip โ€” mark a PENDING run as SKIPPED (ยง4.2)
  • GET /api/integrations โ€” list all configured integrations with name, provider, credential status, last test result and timestamp (ยง8.1)
  • POST /api/integrations/:name/test โ€” fire a test delivery to the named integration; return HTTP status and response payload (ยง8.2)

Dashboard: Daemon Info Panel (ยง1.3)โ€‹

  • Extend the top bar with a clickable version badge that toggles an info dropdown panel displaying: PID, uptime, config file path, started_at full timestamp, SQLite DB path, total / active / paused job counts
  • Fetch daemon info from GET /api/daemon/info on mount and on each 30-second poll cycle

Dashboard: Data Tab (ยง2)โ€‹

  • Add a Data tab to the navigation alongside Jobs | Audit | Outputs | Alerts | DAG | Config
  • job_runs sub-tab (ยง2.1): paginated table with columns id | job_name | status | attempt | trigger | triggered_by | reason | started_at | finished_at | exit_code | sla_breached | hc_status; pagination controls (25 / 50 / 100 rows); filter bar (job, status, trigger, date range, sla_breached flag)
  • job_state sub-tab (ยง2.2): table with columns job_name | last_success | last_failure | next_run | lock_pid; highlight rows with a non-null lock_pid using a warning badge; "Clear lock" action button per such row โ†’ POST /api/db/job_state/:job/clear_lock
  • run_logs sub-tab (ยง2.3): searchable log explorer across all runs; filters: run ID, job name, stream type, full-text keyword; rows colour-coded by stream (stderr = red tint, healthcheck = amber tint)

Dashboard: Alerts Tab Completion (ยง2.5)โ€‹

  • Add status column with colour-coded badges: delivered = green, failed = red, pending = yellow
  • Add attempts, last_attempt_at, and error columns to the Alerts table
  • Add a โ€œRetryโ€ action button on rows where status = failed โ†’ POST /api/db/alerts/:id/retry; disable the button while the request is in flight
  • Add pagination / load-more to the Alerts table (currently loads a fixed number)

Dashboard: Config Editor Improvements (ยง3)โ€‹

  • Inline error markers (ยง3.2): when โ€œValidateโ€ returns errors that include line numbers, scroll to and highlight the offending line(s) in the textarea; show the error description below the highlighted line
  • Config diff view (ยง3.4): after a successful โ€œSave & Reloadโ€, show a unified diff between the previous config text and the newly saved config; highlight added/removed job blocks so the operator can confirm what changed

Dashboard: Run Detail Page (ยง4)โ€‹

  • Implement SPA routing using the hash (#/runs/:id); the root App component parses window.location.hash and renders a RunDetailView instead of the tab panel when a run route is matched
  • RunDetailView (ยง4.1): full-page view showing โ€” job name (links back to Jobs tab filtered to that job); status badge, attempt, trigger, triggered_by, reason; started_at / finished_at / duration; exit code; SLA progress bar (green if within budget, amber โ†’ red if over); hc_status badge; cycle_id (hyperlinked to Data โ†’ run_outputs filtered by cycle); full log viewer with stream colouring and healthcheck toggle; output variables table; previous / next run links for the same job
  • Run actions (ยง4.2): Retry button (shown when status is FAILED) โ†’ POST /api/jobs/:name/retry; Skip button (shown when status is PENDING) โ†’ POST /api/jobs/:name/skip; Cancel button (shown when status is RUNNING) โ†’ POST /api/jobs/:name/cancel
  • Deep links (ยง4.3): update run pills in the Jobs tab to navigate to #/runs/:id instead of expanding inline; update Audit table rows to link to #/runs/:id; make run_id values in the Data tab clickable links to #/runs/:id

Dashboard: Enhanced Jobs Tab (ยง5)โ€‹

  • Per-job pause/resume (ยง5.1): add Pause / Resume button to each job row action group (next to Run / Cancel); show a PAUSED badge in the Status column for paused jobs; wire to POST /api/jobs/:name/pause and POST /api/jobs/:name/resume
  • Full job config details (ยง5.2): extend the expanded row config panel to include frequency, on_success, on_retry, on_sla_breach notification settings, env var keys (values redacted as ***), output capture rules, and healthcheck command + timeout
  • Run history depth control (ยง5.3): add a "Show more" button below the run history pills that fetches the next page; support incremental loading (?runs=40 then ?runs=80)
  • Bulk trigger by tag (ยง5.4): add a "Run all" button to the tag health strip for each tag; show a confirmation dialog listing the jobs that will be triggered; fire one POST /api/jobs/:name/run per matching job on confirm

Dashboard: DAG Enhancements (ยง6)โ€‹

  • Visual graph (ยง6.1): replace the current text edge list with a rendered directed graph using a lightweight library (Dagre-D3, D3-dag, or ELK.js); nodes show job name, current status badge, and last run time; directed edges with arrowheads; node borders colour-coded (green = SUCCESS, red = FAILED, blue = RUNNING, grey = PENDING); support scroll and zoom for large graphs
  • Node actions (ยง6.2): clicking a node navigates to #/runs/:id for the most recent run of that job; right-click context menu with Trigger / Cancel / View logs options
  • Cycle trace overlay (ยง6.3): from the Run Detail page or Data tab, a "Show in DAG" button that highlights all nodes and edges that participated in a specific cycle_id, with each node annotated with the outputs it captured in that cycle

Dashboard: Health & SLA Tab (ยง7)โ€‹

  • Add a Health tab to the navigation
  • SLA compliance table (ยง7.1): for all jobs with an SLA configured, render job | SLA budget | last duration | ฮ” | breach rate (30 d); ฮ” shown in red when positive (breach) and green when negative (headroom); default sort by breach rate descending to surface worst offenders
  • Timeline swimlane (ยง7.2): chronological chart with one row per job showing the last N runs as coloured bars (bar length = duration, bar colour = status, amber border if SLA breached); hovering a bar shows a full run detail tooltip

Dashboard: Integrations Panel (ยง8)โ€‹

  • Add an Integrations section as a sub-tab within Config or as a dedicated Settings tab
  • Integration health list (ยง8.1): display all configured integrations with name, provider, credential status (configured / missing required fields / env var unset), last test result and timestamp; fetched from GET /api/integrations
  • Test delivery button (ยง8.2): "Test" button per integration row โ†’ POST /api/integrations/:name/test; display the test payload sent and the HTTP response code / body returned

Dashboard: UI Quality-of-Life (ยง9)โ€‹

  • Dark mode (ยง9.1): (โ†’ Backlog)
  • Toast notifications (ยง9.2): ensure every user-initiated action (trigger, cancel, pause/resume, save config, retry alert, clear lock) shows a short-lived success or error toast with descriptive text
  • Keyboard shortcuts (ยง9.3): r refreshes the current view, / focuses the active filter/search input, Esc closes expanded rows or modals; register via a useEffect on document in the root App component; show a shortcut hint in the UI
  • Column and filter persistence (ยง9.4): (โ†’ Backlog)
  • Copy to clipboard (ยง9.5): one-click copy button on run IDs, cycle IDs, and inside the log viewer (copies visible log text) via navigator.clipboard.writeText
  • Auto-polling control (ยง9.6): toggle button to pause/resume auto-refresh of the jobs table; show "last refreshed N seconds ago" label; when paused, no polling until the user manually refreshes or re-enables

Testingโ€‹

  • Unit tests: GET /api/daemon/info โ€” returns correct PID, config path, start time, and job counts (ยง1.3)
  • Unit tests: GET /api/db/job_runs โ€” pagination boundaries, sort by each column, all filter param combinations (ยง2.1)
  • Unit tests: POST /api/db/job_state/:job/clear_lock โ€” clears lock when job is idle, returns 409 when job is RUNNING (ยง2.2)
  • Unit tests: GET /api/db/run_logs โ€” full-text q search, stream filter, pagination (ยง2.3)
  • Unit tests: POST /api/db/alerts/:id/retry โ€” re-queues a failed alert; returns 409 if alert is already pending; returns 404 for unknown ID (ยง2.5)
  • Unit tests: POST /api/jobs/:name/pause and /resume โ€” correct state transitions, 404 on unknown job (ยง5.1)
  • Unit tests: POST /api/jobs/:name/retry and /skip โ€” correct preconditions enforced (status must be FAILED / PENDING respectively) (ยง4.2)
  • Unit tests: GET /api/integrations โ€” returns all configured integrations with correct status field (ยง8.1)
  • Unit tests: POST /api/integrations/:name/test โ€” fires test delivery, returns response details, 404 on unknown integration (ยง8.2)
  • E2E: hash routing โ€” navigate directly to #/runs/:id, verify RunDetailView renders with correct data and back-navigation returns to the Jobs tab (ยง4.3)
  • E2E: bulk trigger by tag โ€” confirmation dialog lists correct jobs, all matching jobs reach RUNNING state after confirm (ยง5.4)
  • E2E: dark mode toggle โ€” dark class applied to <html>, preference survives page reload (ยง9.1) (โ†’ Backlog)

Phase 3.5 โ€” Daemon Runtime Configuration (huskyd.yaml)โ€‹

Complete ยท All huskyd.yaml parsing, API binding, authentication (bearer, basic auth, RBAC), TLS, CORS, structured logging with rotation, storage retention vacuum, executor shell + global env, scheduler jitter, and full unit + integration test coverage implemented. OIDC, metrics, tracing, secrets backends, and advanced executor resource limits deferred to Phase 5 backlog.

Dashboardโ€‹

  • huskyd.yaml subtab: if the daemon exposes a daemon-config path via GET /api/daemon/info, add a second subtab in the Config view displaying its path and raw YAML content (read-only initially)

Discovery & Parsingโ€‹

  • Auto-assign API port with 127.0.0.1:0 (OS picks free port); write bound address to <data>/api.addr on startup; remove file on clean shutdown
  • husky dash โ€” read <data>/api.addr and open dashboard URL in default system browser
  • Define DaemonConfig Go struct in a new internal/daemoncfg package mirroring the full huskyd.yaml shape
  • Load huskyd.yaml from the same directory as husky.yaml when the file exists; silently apply all defaults when it does not
  • Support --daemon-config <path> flag on huskyd to override the default discovery path
  • Validate huskyd.yaml with a JSON Schema at load time; emit structured errors with field path context
  • All fields optional โ€” absence of any section falls back to the documented default; no breaking change for v1.0 users
  • husky validate extends to also validate huskyd.yaml when present

API Serverโ€‹

  • api.addr โ€” bind the HTTP server to the specified address when set; override the 127.0.0.1:0 default (enables fixed-port deployments and binding to 0.0.0.0 for remote access)
  • Emit a prominent startup WARN log when api.addr binds to 0.0.0.0 without TLS and auth both enabled
  • api.base_path โ€” mount the dashboard and all API routes under a URL prefix (e.g. /husky); strip prefix transparently so internal handlers see clean paths
  • api.tls.enabled โ€” call tls.Listen instead of net.Listen when true; refuse to start if cert/key files are missing or unreadable
  • api.tls.cert / api.tls.key โ€” absolute paths to PEM-encoded certificate and private key
  • api.tls.min_version โ€” enforce minimum TLS version; accept "1.2" (default) or "1.3"
  • api.tls.client_ca โ€” when set, require mTLS; reject connections whose client certificate is not signed by the given CA
  • api.cors.allowed_origins โ€” inject CORS headers for listed origins; wildcard "*" accepted but emits a startup warning
  • api.cors.allow_credentials โ€” set Access-Control-Allow-Credentials: true when needed for browser-based OAuth flows
  • api.timeouts.read_header, .read, .write, .idle โ€” pass through to http.Server fields; defaults match current hardcoded values

Authenticationโ€‹

  • auth.type: none (default) โ€” no authentication; all endpoints open
  • auth.type: bearer โ€” require Authorization: Bearer <token> on all REST + WebSocket requests
    • auth.bearer.token_file โ€” read one token per line; all listed tokens are valid; hot-reload on SIGHUP
    • auth.bearer.token โ€” inline token string for simple setups (warn in logs that file-based is preferred)
  • auth.type: basic โ€” HTTP Basic Auth
    • auth.basic.users โ€” list of {username, password_hash} entries; password_hash is a bcrypt hash; plaintext passwords rejected at load time
  • auth.type: oidc โ€” startup stub: returns a clear error; full OIDC implementation deferred to Phase 5
  • auth.rbac โ€” define per-role allowed HTTP method + path patterns; built-in roles viewer, operator, admin with sensible defaults; custom roles accepted
  • Auth middleware applied uniformly to all routes served by the API server
  • Auth disabled entirely when auth.type: none; no performance overhead in the default case
  • husky CLI reads HUSKY_TOKEN env var and injects Authorization: Bearer header for all HTTP-based commands (pause, resume, dash) when auth is enabled

Loggingโ€‹

  • log.level โ€” debug | info | warn | error; default info; hot-reloadable via SIGHUP
  • log.format: json โ€” swap the slog handler from slog.NewTextHandler to slog.NewJSONHandler; produces structured log lines consumable by Datadog, CloudWatch, Loki, Splunk
  • log.output: stdout | stderr | file โ€” default stdout; file requires log.file.path
  • log.file.path โ€” write logs to this file; create parent directories if absent
  • log.file.max_size_mb, .max_backups, .max_age_days, .compress โ€” log rotation via lumberjack (or equivalent); compressed rotated files when compress: true
  • log.audit_log.enabled โ€” write a separate newline-delimited JSON audit log to log.audit_log.path; each line is one job state transition event with full context (job, run ID, status, trigger, duration, reason)
  • log.audit_log.max_size_mb, .max_backups โ€” independent rotation settings for the audit log file

Storageโ€‹

  • storage.sqlite.path โ€” override default <data>/husky.db location
  • storage.sqlite.wal_autocheckpoint โ€” set SQLite wal_autocheckpoint pragma; default 1000 pages
  • storage.sqlite.busy_timeout โ€” set SQLite busy-handler timeout; default 5s
  • storage.retention.max_age โ€” background vacuum goroutine deletes job_runs (and cascade: run_logs, run_outputs) older than the specified duration; runs every 24 h
  • storage.retention.max_runs_per_job โ€” after pruning by age, also enforce a per-job cap; keep only the N most recent completed runs; PENDING and RUNNING runs are never pruned
  • Background vacuum goroutine: run once at startup (after a short delay to not contend with catchup) then on a 24 h ticker; log number of rows pruned at INFO level
  • storage.engine: postgres stub โ€” parse the config field and emit a clear not yet supported error at startup; prevents a confusing failure mode when users copy a future config forward
  • husky export --format=json honours the configured storage path when reading state

Schedulerโ€‹

  • scheduler.max_concurrent_jobs โ€” global ceiling on concurrently executing jobs across all job pools; default 32; per-job concurrency setting still applies at the job level
  • scheduler.catchup_window โ€” maximum look-back window when reconciling missed schedules on restart; overrides per-job catchup: true for runs missed longer ago than this; default 24h
  • scheduler.shutdown_timeout โ€” grace period given to in-flight jobs when a graceful stop is requested before SIGKILL is sent to remaining jobs; default 60s
  • scheduler.schedule_jitter โ€” random jitter (up to this duration) added to each scheduled trigger to avoid thundering-herd bursts when many jobs share the same tick; default 0s (no jitter)

Executorโ€‹

  • executor.pool_size โ€” size of the bounded goroutine worker pool; default 8; surfaced in huskyd.yaml so production deployments can tune without recompile
  • executor.shell โ€” shell used to run job commands; default /bin/sh; common override /bin/bash
  • executor.global_env โ€” key-value pairs injected into every job's environment; lower priority than per-job env in husky.yaml; higher priority than host environment

Testingโ€‹

  • Unit tests: bearer token middleware โ€” valid token accepted, invalid token โ†’ 401, missing header โ†’ 401, auth disabled โ†’ all requests pass
  • Unit tests: basic auth middleware โ€” valid credentials, wrong password, unknown user
  • Unit tests: RBAC โ€” viewer can GET, viewer cannot POST trigger, operator can trigger, admin can do everything
  • Unit tests: storage retention vacuum โ€” rows older than max_age deleted, rows within window kept, RUNNING rows never deleted
  • Unit tests: max_runs_per_job cap โ€” correct rows pruned, most recent N retained
  • Unit tests: executor.global_env โ€” values present in subprocess env, overridden by per-job env
  • Integration test: bearer auth enabled โ€” husky status (HTTP commands) fail without HUSKY_TOKEN, succeed with correct token
  • Integration test: huskyd.yaml absent โ€” daemon starts with all defaults; no error or warning beyond "no huskyd.yaml found, using defaults"

Phase 4 โ€” Hardening & Distributionโ€‹

Weeks 10โ€“12 ยท Cross-compilation, packaging, auth, TLS, integration test suite.

Cross-Compilationโ€‹

  • Build matrix in Makefile: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64
  • Verify modernc.org/sqlite CGO-free build works on all five targets
  • Produce reproducible builds (embed VERSION, COMMIT, BUILD_DATE via ldflags)
  • Add make dist target that produces a dist/ directory with all platform archives (.tar.gz) โ€” each archive contains the single husky binary

Packaging & Distributionโ€‹

  • Write Homebrew formula husky.rb (binary download + checksum)
  • Build .deb package for Debian/Ubuntu (using goreleaser or hand-rolled dpkg-deb)
  • Build .rpm package for RHEL/Fedora
  • Create install.sh one-liner installer (detects platform, downloads correct binary)
  • Write systemd unit file example (huskyd.service) โ€” ExecStart points to husky daemon run
  • Write launchd plist example for macOS autostart (invokes husky daemon run)
  • Publish GitHub Release via GoReleaser CI step with checksums and signatures

Token Authenticationโ€‹

Superseded by Phase 3.5 (auth.type: bearer + auth.type: basic + RBAC in huskyd.yaml).

  • Implement token auth middleware for all REST + WebSocket endpoints (ยง7.2)
  • Read token from auth.token in husky.yaml or HUSKY_TOKEN env var
  • Store bcrypt hash of token in job_state database; never store plaintext
  • Require Authorization: Bearer <token> header when auth is enabled
  • Token auth disabled by default; emit startup log note when disabled
  • husky reads token from HUSKY_TOKEN env var for socket/HTTP calls

TLSโ€‹

Superseded by Phase 3.5 (api.tls.* block in huskyd.yaml; --daemon-config flag provides override path).

  • Support --tls-cert and --tls-key flags on daemon start (ยง7.1)
  • Verify cert/key pair at startup; refuse to start on invalid files
  • Serve HTTPS when TLS flags are provided; HTTP otherwise
  • Document self-signed cert generation steps in README

External Bindingโ€‹

Superseded by Phase 3.5 (api.addr in huskyd.yaml).

  • Support --bind flag to override default 127.0.0.1:8420 (ยง7.1)
  • Emit prominent startup warning when bound to 0.0.0.0 without TLS + auth enabled

Integration Test Suiteโ€‹

  • Test: parse, validate, and execute the full ยง2.3 example pipeline end-to-end
  • Test: DAG execution order is correct (ingest โ†’ transform โ†’ report)
  • Test: on_failure: stop halts downstream jobs
  • Test: retry with exponential backoff reaches max retries and fires alert
  • Test: concurrency: forbid skips overlapping runs
  • Test: daemon crash mid-run โ†’ restart โ†’ orphan reconciled โ†’ retry fires
  • Test: catchup: true triggers missed run after restart
  • Test: catchup: false skips missed run after restart
  • Test: SIGHUP hot-reload swaps config without interrupting running job
  • Test: cycle in config is rejected at startup and reload
  • Test: token auth rejects requests without valid bearer token
  • Test: husky validate catches all invalid field combinations
  • Test: job with sla breaches threshold, on_sla_breach fires, run completes with sla_breached = 1 and SUCCESS status (Feature 1)
  • Test: job with healthcheck, main succeeds, healthcheck fails โ†’ run marked FAILED, retries triggered (Feature 6)
  • Test: job with healthcheck on_fail: warn_only, healthcheck fails โ†’ run marked SUCCESS with hc_status = warn (Feature 6)
  • Test: output-passing pipeline, downstream job receives correct {{ outputs.* }} value (Feature 2)
  • Test: output values from cycle A are not visible in independently triggered cycle B (Feature 2)
  • Test: husky run <job> --reason persists reason; husky audit returns it; {{ run.reason }} renders in notification template (Feature 3, 5)
  • Test: per-job timezone โ€” job scheduled at wall-clock time in America/New_York fires at correct UTC moment before and after a DST transition (Feature 7)
  • Test: husky pause --tag <tag> pauses all matching jobs; husky resume --tag <tag> resumes them (Feature 4)
  • Benchmark: scheduler tick latency under 100 concurrent jobs < 10 ms

Documentationโ€‹

  • Write README.md โ€” quickstart, install, husky.yaml, huskyd.yaml example, CLI reference
  • Write docs/configuration.md โ€” full field reference including all v1.0 fields: sla, tags, timezone, healthcheck, output, expanded notify schema (Features 1โ€“7)
  • Write docs/security.md โ€” auth, TLS, process isolation guidance
  • Write docs/crash-recovery.md โ€” startup sequence walkthrough
  • Write docs/output-passing.md โ€” guide to capture modes, cycle_id scoping, and template syntax (Feature 2)
  • Write docs/notifications.md โ€” all channel formats, template variables, only_after_failure behaviour (Feature 3)
  • Add CHANGELOG.md for v1.0.0
  • Use Docusaurus to create a documentation website

Phase 5: Backlogโ€‹

Deferred from Phase 3.5โ€‹

Items below were descoped from the alpha release. They are fully specified and ready to implement post-launch.

OIDC Authenticationโ€‹

  • auth.type: oidc โ€” full implementation: fetch and cache JWKS from auth.oidc.jwks_uri; verify JWT signatures; refresh on key rotation (background goroutine)
  • auth.oidc.issuer, .client_id, .audience, .jwks_uri โ€” standard OIDC discovery fields
  • auth.oidc.role_claim โ€” JWT claim name mapped to Husky roles (viewer, operator, admin)
  • auth.oidc.default_role โ€” role assigned when claim is absent

Executor Advanced Configโ€‹

  • executor.working_dir: config_dir | <absolute_path> โ€” working directory for all child processes; config_dir (default) means the directory containing husky.yaml
  • executor.resource_limits.max_memory_mb โ€” Linux: set RLIMIT_AS; macOS: best-effort via setrlimit; 0 = unlimited
  • executor.resource_limits.max_open_files โ€” set RLIMIT_NOFILE on child process; default 1024
  • executor.resource_limits.max_pids โ€” Linux: set RLIMIT_NPROC; limits forking within a job

Metricsโ€‹

  • metrics.enabled โ€” expose a Prometheus-compatible scrape endpoint; default false
  • metrics.addr โ€” bind address for the /metrics HTTP server; default 127.0.0.1:9091 (separate from dashboard port)
  • metrics.path โ€” URL path for the scrape endpoint; default /metrics
  • metrics.auth โ€” when true, protect /metrics with the same bearer token as the main API
  • Instrument: husky_job_runs_total{job,status,trigger}, husky_job_duration_seconds{job}, husky_job_sla_breaches_total{job}, husky_job_retries_total{job}, husky_scheduler_tick_duration_seconds, husky_running_jobs, husky_daemon_uptime_seconds
  • Use prometheus/client_golang โ€” add to go.mod
  • Metrics server lifecycle tied to daemon context (shut down cleanly on SIGTERM)

Tracingโ€‹

  • tracing.enabled โ€” emit OpenTelemetry spans; default false
  • tracing.exporter โ€” otlp | jaeger | zipkin | stdout; default otlp
  • tracing.endpoint, .service_name, .sample_rate
  • Instrument spans for: job dispatch, executor subprocess wait, retry backoff sleep, IPC request handling, notification delivery
  • Use go.opentelemetry.io/otel โ€” add to go.mod
  • Tracer provider shut down gracefully on daemon context cancellation

Secrets Backendsโ€‹

  • secrets.provider: vault โ€” resolve ${secret:NAME} placeholders via HashiCorp Vault KV v2 (addr, token_file, mount, path_prefix)
  • secrets.provider: aws_ssm โ€” AWS Systems Manager Parameter Store (ambient IAM)
  • secrets.provider: aws_secrets_manager โ€” AWS Secrets Manager (ambient IAM)
  • secrets.provider: gcp_secret_manager โ€” GCP Secret Manager (Application Default Credentials)
  • Secret resolution at startup + SIGHUP; never written to disk or logs
  • Add ${secret:NAME} interpolation syntax alongside ${env:VAR} in internal/config/env.go

Daemon-Level Alertsโ€‹

  • alerts.on_daemon_start โ€” fire notification(s) when huskyd starts successfully
  • alerts.on_sla_breach โ€” daemon-level fallback for SLA breach notifications when the job does not define notify.on_sla_breach
  • alerts.on_forced_kill โ€” fire notification when a job is killed by shutdown_timeout SIGKILL

Dashboard Customisationโ€‹

  • dashboard.enabled: false โ€” disable dashboard while keeping REST + WebSocket API; respond to / with 404
  • dashboard.title โ€” custom title injected into <title> and header; served via GET /api/config
  • dashboard.accent_color โ€” CSS custom property for primary accent colour
  • dashboard.log_backfill_lines โ€” historical lines sent on WebSocket connect; default 200
  • dashboard.poll_interval โ€” frontend poll interval; served via GET /api/config; default 5s
  • GET /api/config โ€” new endpoint returning dashboard runtime config (no auth required)

HTTP Client (Outbound)โ€‹

  • http_client.timeout โ€” global timeout for all outbound HTTP calls; default 15s
  • http_client.max_retries / .retry_backoff โ€” notification delivery retries
  • http_client.proxy โ€” optional HTTP proxy URL for all outbound calls
  • http_client.ca_bundle โ€” PEM CA bundle appended to system cert pool

Process / Systemโ€‹

  • process.user / process.group โ€” drop privileges after port bind (Linux setuid/setgid; warning on macOS/Windows)
  • process.pid_file โ€” override PID file location
  • process.ulimit_nofile โ€” setrlimit(RLIMIT_NOFILE) on daemon process at startup
  • process.watchdog_interval โ€” sd_notify WATCHDOG=1 pings for systemd WatchdogSec=

Phase 3.5 Tests (deferred)โ€‹

  • Unit tests: DaemonConfig parsing โ€” all fields, defaults, unknown fields rejected
  • Unit tests: JSON Schema validation of huskyd.yaml โ€” valid, invalid, missing optional sections
  • Unit tests: api.base_path prefix stripping โ€” all routes reachable under prefix
  • Unit tests: log format switch โ€” structured JSON output when log.format: json
  • Unit tests: scheduler.schedule_jitter โ€” dispatched times fall within [tick, tick + jitter] bounds
  • Unit tests: Prometheus metrics โ€” counters increment on run completion, gauge reflects running count
  • Unit tests: ${secret:NAME} interpolation โ€” resolved from mock provider, not leaked to logs
  • Unit tests: dashboard.poll_interval + dashboard.title served by GET /api/config
  • Integration test: api.addr: 127.0.0.1:19999 โ€” verify daemon binds to configured port
  • Integration test: TLS cert + key โ€” https:// succeeds, http:// rejected
  • Integration test: SIGHUP with updated log.level: debug โ€” debug lines appear without restart
  • Integration test: storage vacuum fires; pruned row count matches max_age config

  • Distributed execution across multiple machines
  • External queue integrations (Kafka, Redis, RabbitMQ)
  • Dynamic job creation via API at runtime
  • Plugin SDK / extensible executor types
  • Per-job log retention config (default: 30 days / 1,000 runs) with background vacuum
  • Optional log shipping to external sinks
  • DST ambiguous window warnings (ยง9 Risk Register)

Multi-Project Daemon Modeโ€‹

Target: after v1.0 ships and real-world usage patterns are established.

Goal: A single huskyd process manages multiple projects, each with its own husky.yaml, SQLite store, and job namespace. One binary, one process, one dashboard โ€” regardless of how many projects are registered.

Rationale: v1.0 ships as a per-project daemon (one daemon per husky.yaml). This is correct for the simple case and for developers using Husky locally. For production deployments โ€” ops teams, servers, organisations running many pipelines โ€” a per-project model multiplies resource overhead and eliminates the possibility of a unified operational view. V2 closes this gap without breaking v1.0 users.

Project Registryโ€‹

  • Define a top-level huskyd.yaml server config file (distinct from per-project husky.yaml) โ€” specifies bind address, TLS, auth, log retention, and the list of registered projects
  • Add projects: block to huskyd.yaml: list of entries with name (unique identifier) and config (path to project's husky.yaml)
  • Implement husky register <path> command โ€” adds a project entry to huskyd.yaml and hot-reloads the daemon
  • Implement husky deregister <name> command โ€” removes project entry; waits for running jobs to complete before tearing down project context
  • Implement husky projects list โ€” tabular output: project name, config path, job count, status (active / paused / error)
  • Validate that project names are unique, lowercase alphanumeric with hyphens, max 64 characters
  • Detect duplicate husky.yaml paths at registration time; reject with a clear error

Project Isolationโ€‹

  • Each registered project runs in an isolated context: independent scheduler goroutine, independent executor pool, independent SQLite database under <data-dir>/<project-name>/husky.db
  • Per-project goroutine pool size โ€” inherits global default from huskyd.yaml but overridable per project
  • A panic or fatal error in one project context must not crash or stall other projects โ€” recover and mark project as error state, emit alert
  • Per-project SIGHUP-style config reload โ€” daemon watches each husky.yaml for changes and reloads only the affected project context
  • Global file-watcher (using fsnotify) replaces per-project polling; each husky.yaml change triggers isolated reload pipeline

Job Namespacingโ€‹

  • All job identifiers are namespaced as <project-name>::<job-name> internally
  • CLI commands accept --project <name> flag to scope operations: husky run --project billing daily-report
  • Without --project, CLI auto-detects project by walking up from CWD until a husky.yaml is found, then resolves to the registered project name โ€” preserves v1.0 UX in terminal
  • after: dependencies within a project remain unqualified (after: ingest) โ€” no breaking change to v1.0 config syntax
  • Cross-project after: dependencies expressed as after: "billing::daily-report" โ€” optional v2 feature, gated behind a config flag allow_cross_project_deps: true
  • husky status without --project shows only the auto-detected project; husky status --all shows all projects

Unified Web Dashboardโ€‹

  • Single web dashboard serves all projects โ€” project selector dropdown in top navigation
  • Default landing page shows cross-project summary: total jobs, running count, recent failures across all projects
  • Per-project view mirrors the v1.0 dashboard (job table, log stream, run history)
  • WebSocket log streams are scoped per project; subscribing to a project stream does not receive logs from other projects
  • GET /api/projects โ€” list all registered projects with summary stats
  • GET /api/projects/<name>/jobs โ€” scoped equivalent of v1.0 GET /api/jobs
  • GET /api/projects/<name>/jobs/<job>/runs โ€” scoped run history
  • Retain v1.0 unscoped endpoints (GET /api/jobs) as aliases that resolve via CWD auto-detection or X-Husky-Project header

Unified Integration Credentialsโ€‹

  • Move Slack, PagerDuty, Discord, SMTP integration credentials to huskyd.yaml as global integrations โ€” available to all projects without duplication
  • Per-project husky.yaml may still define local integrations that override or supplement global ones โ€” global key takes precedence unless project explicitly overrides
  • husky integrations list shows global integrations from huskyd.yaml; husky integrations list --project <name> shows merged view for that project

Single Unix Socketโ€‹

  • Replace per-project Unix sockets with a single socket at a fixed path (e.g. /var/run/husky/huskyd.sock or ~/.husky/huskyd.sock)
  • All IPC messages include a project field; daemon routes to the correct project context
  • v1.0 CLI compatibility: if project field is absent in IPC message, daemon resolves project from the cwd field included in the message

Migration from V1โ€‹

  • Document upgrade path in docs/upgrading-to-v2.md: existing per-project huskyd users move to running a single huskyd with a huskyd.yaml that registers their project
  • husky migrate command โ€” scans CWD for husky.yaml, generates a minimal huskyd.yaml with the project registered, and prints next steps
  • v1.0 invocation (huskyd --config ./husky.yaml) continues to work unchanged as a single-project shorthand โ€” no breaking change for existing users

Testingโ€‹

  • Unit tests: project registry โ€” register, deregister, duplicate detection, name validation
  • Unit tests: job namespacing โ€” auto-detect project from CWD, --project flag override, cross-project after: resolution
  • Integration test: two projects registered, run jobs in both simultaneously, verify isolation (no log cross-contamination, independent retry state)
  • Integration test: one project panics โ€” verify other project continues running unaffected
  • Integration test: reload one project's husky.yaml via file-watcher โ€” verify other project is not restarted
  • Integration test: husky status --all returns correct aggregated view across projects
  • Integration test: cross-project after: dependency โ€” downstream job in project B fires after upstream in project A completes
  • Integration test: husky migrate generates valid huskyd.yaml from existing husky.yaml
  • Benchmark: scheduler tick latency under 10 projects ร— 100 jobs (1,000 total) < 20 ms