How I run agents today · hooks, specs, subagents

Harness Engineering — the code that surrounds the agent

An LLM is non-deterministic, so the critical rules have to live outside the agent, in code the agent cannot rewrite.

You leave Claude Code running for an hour while you're in another meeting. You come back and scan the commits. How much do you trust what you see?

The harness is everything around the agent that makes the answer easy. Hooks vet every tool call. Specs frame the work before the agent starts. Sensors catch the agent the moment it drifts. Birgitta Böckeler calls all of that harness engineering.

Here's how I build mine, one piece at a time.

What this is for #s1

A few years ago, a huge chunk of the day went to micro-tasks we took for granted and barely counted: pulling the repo, creating a branch, spinning up the environment, writing boilerplate, moving files, retyping imports. On top of that, the bulk of the time went to implementation — typing code line by line. Today that work is shifting toward the extremes: planning up front and verification at the end. The typing belongs to the agent.

Send the agent off for an hour, twenty files later, and the rules of the game change. What matters then is supervision: someone watching what it did while you looked elsewhere. The harness is that supervision.

Three lanes from left to right: PLAN (wide, human-led, agent as thinking partner — frame the problem, decide architecture, write specs, explore adversarially), IMPLEMENT (narrow, delegated to agents — codegen, edits, refactor, scaffolding, supervised by the harness), and VERIFY (wide, human-led, agents help diagnose — read the diff, accept against the spec, check long-term consequences, capture lessons). A dashed loop arrow goes from VERIFY back to PLAN when the diff misses.

Look at the diagram. Plan and verify are where the time goes now. Plan is where you frame the problem, decide the approach, write the spec. Verify is where you read the diff the agent produced and see whether it matches what you asked for. When it doesn't, the lesson goes back into the harness — a new hook, a sharper rule, a stricter sensor — for next time.

You hold the big context. Architecture decisions, business constraints, six-months-from-now consequences. None of that fits comfortably in the agent's window, and even if it did, it isn't where the agent earns its keep.
The agent holds the local execution. Codegen, edits, refactors, scaffolding. Bounded tasks, supervised by the harness.
The loop closes through verify → plan. When the diff misses, you go back up to plan, sharpen it, and run another lap. The next diff arrives better framed from the start.
An hour on a sharper spec or a tighter hook keeps paying off run after run. I now spend more time upstream than feels intuitive.

Start with a hook #s2

A hook is a shell script. Four lines — five if you're being thorough. Claude Code runs it every time the agent wants to call a tool: edit a file, run bash, search the codebase. The script inspects the call, decides whether to let it through, and returns an exit code.

If your session has several agents with different roles — a planner, an implementer, a reviewer — the hook is where you decide who can do what. Reviewer wants to edit a file? The hook reads the JSON payload, sees who's calling, and either lets the call through or returns exit 2 to kill it. The LLM hears about the rejection through the error and replans.

The JSON payload has everything you need: tool_name, tool_input, agent_type. You write the policy in bash and jq.
Reviewer trying to edit a protected path? exit 2. End of story.
PostToolUse hooks are the symmetric version: they fire after the tool runs. Use them to log, mark caches dirty, kick off an indexer, append to a session journal.
Hooks live in the repo at .claude/hooks/. The agent has no way around them, and a new rule is two more lines in the script.

#!/usr/bin/env bash
# .claude/hooks/pre-tool-use.sh
payload=$(cat)
agent=$(jq -r '.agent_type'              <<<"$payload")
tool=$(jq  -r '.tool_name'                <<<"$payload")
path=$(jq  -r '.tool_input.file_path // empty' <<<"$payload")

# reviewer subagents are read-only
if [[ "$agent" == "reviewer" && "$tool" == "Edit" ]]; then
  echo "reviewer cannot edit $path" >&2
  exit 2
fi

The simplest PreToolUse you can write: read the agent_type from the payload, apply one rule, exit with 2 when the call doesn't fit. With this in place, the reviewer can't edit anything — no matter what the model decided.

Vertical flow: the LLM decides to call a tool, the PreToolUse hook receives a JSON payload with tool, agent_type and path, applies the policy, and either rejects with exit 2 (which sends a message to the LLM) or allows the tool to execute (and then PostToolUse fires).

Before and after each step #s3

You've seen a hook. What comes next is the full loop that hook is one piece of — borrowed by Birgitta Böckeler from control theory.

Two kinds of signal sit around the agent. Guides (feedforward) reach the agent before it acts — they tell it what's expected. Sensors (feedback) fire after each step — they measure what the agent did and feed the result back so it can correct course. Together they close the loop.

Horizontal control loop. On the left, Guides feed the agent before it acts (feedforward). In the middle, the Agent reasons, edits and runs tools. On the right, Sensors observe the result and feed signals back (feedback). A dashed loop arrow returns from Sensors to Guides through the bottom, meaning the sensor signal becomes observation on the next turn. The diagram emphasises that both sides of the loop are obligatory.

It's the thermostat pattern. The guide says "hold 21°C". The sensor measures the temperature every minute. Strip one of the two and you're left with a dumb gadget.

Guides only — a huge CLAUDE.md with 200 lines of rules and no test to check any of them hold. The agent reads them once, forgets them on the next pass, nobody catches the slip.
Sensors only — strict tests and CI but no living spec. The agent crashes into the same rule ten times because nobody told it why.
Sensors built for the LLM. A test that fails and dumps an 800-line stack trace becomes garbage in the model's next turn — and the model gets stuck. Design the failure message the way you design the assert: two lines, one for the problem, one for an actionable next step.

The big picture: the matrix #s4

We have the when — guides before the step, sensors after. We're missing the how. Two very different ways to apply a control.

Semantic: text. CLAUDE.md, agent frontmatter, ADRs. The LLM reads them and chooses whether to follow. Useful, but reliable only when the model cooperates.

Deterministic: code that fails on its own. A test, a hook, a schema validation. It doesn't care what the model decided — if the rule trips, the tool call dies.

When a rule really matters, it belongs in the deterministic quadrant. When it's preferable but not critical, the semantic side is enough. Birgitta Böckeler lays out these axes — feedforward × feedback × computational × inferential in her original terminology — as a 2×2 matrix.

The four cells of the harness — a 2×2 matrix. Rows: Feedforward (before the step) and Feedback (after the step). Columns: Computational (the code decides) and Inferential (the model decides). Each cell is labelled in bold with its combined name and closes with the motto a control of that kind would say to the agent: guides · deterministic "this is always done this way", guides · semantic "think about it this way", sensors · deterministic "this is broken — look here", sensors · semantic "this smells off — are you sure?". Each cell also lists concrete instances.

Two axes. The columns: how the control acts — code that fails on its own (deterministic) or text the LLM reads (semantic). The rows: when it acts — before the step (guide) or after the step (sensor). Every harness piece lands in one of the four cells.

Guide · Deterministic — "this is always done this way." Allowed-tools per agent, PreToolUse hooks, permission allowlists, schema checks. You take the key away from the agent.
Guide · Semantic — "think about it this way." Project rules, ADRs, skills, agent frontmatter, specs. The agent reads them and decides whether to follow.
Sensor · Deterministic — "this is broken — look here." Build, tests, type-checks, PostToolUse hooks, spec validate. Green or red, no negotiation.
Sensor · Semantic — "this smells off — are you sure?" Adversarial review, structured JSON diagnostics, reviewer subagents, lesson capture. Soft critique the model can act on next time around.
Rule of thumb: every time a rule fails on you in production, promote it one quadrant — from semantic to deterministic. An exit 2 cuts the call before the model gets a chance to decide.

What flows through the harness #s5

The matrix organizes the controls. What's missing is the connective tissue: how agents talk to each other, what gets written down at the end of a session, what signals persist so the next turn starts knowing where it left off. Without this layer, the loop opens every time you close the Claude Code window.

Four horizontal lanes, each a pattern of state and observability that connects the controls. Lane 1 (blue): structured outputs — a tester subagent emits a 7-field JSON to the orchestrator, enabling mechanical back-pressure. Lane 2 (yellow): session snapshots — a Stop hook writes a last.md (git status, active OpenSpec change, dirty markers, drift verdict) that the next session reads on open. Lane 3 (green): passive markers — editing a Swift file fires a PostToolUse hook that touches .codegraph-dirty, a stamp the next turn reads. Lane 4 (pink): meta-sensors — harness state on disk and operational docs both feed a drift check whose verdict either passes silently or blocks an archive.

Four small pieces do the work:

Structured outputs as inter-agent messages. When a tester subagent reports a failure, it emits a 7-field JSON — which test broke, module, severity, root-cause hypothesis, minimal repro, suggested fix location. The orchestrator reads that structure and decides on its own whether to loop back to implementation or move forward to verification. That's mechanical back-pressure: the loop closes without going through the human.
Snapshots at session end. A Stop hook writes a last.md file with git status, the active OpenSpec change, which files got marked dirty, and the latest coherence verdict. The LLM's memory evaporates when you close the window — the snapshot rebuilds it on the next prompt so you don't have to.
Passive markers for non-blocking signals. A .codegraph-dirty file touched on every Swift edit is exactly that: a stamp on disk the next turn reads and reacts to. It persists state between operations without blocking anything. It's the dirty bit pattern from distributed systems, applied to your own repo.
Meta-sensors that watch the harness itself. A weekly script checks that the hooks declared in settings match the scripts actually on disk. Another detects when operational docs (CLAUDE.md, ROADMAP) and the real state of OpenSpec have drifted. This is second-order Böckeler: a sensor over the sensors. Without it, you only find out about drift when an agent starts behaving strangely.
This is what separates a complete harness from one that just has controls sitting around. The matrix tells you where each piece belongs; these four patterns tell you how the pieces talk, persist, and stay aligned when you're not looking.

How I tune it per project #s6

Hooks and permissions cover the reactive half of the harness: what happens while the agent is working. The other half — the proactive one — is the specs: what you hand the agent before it starts. For that half, OpenSpec is the first place I look.

OpenSpec gives you a serious template — proposal, design, tasks, acceptance — and pushes you to generate it, refine it with an adversarial reviewer, and execute against it. A strong starting point for the feedforward · inferential cell.

Now the ugly part: no single harness fits every project. A native Swift desktop app needs a different kind of supervision than a full-stack TypeScript service. The build is slower, the platform APIs are stricter, the tests are heavier. A team shipping infrastructure carries a very different blast radius than a team shipping UI. Every repo ends up drawing its own wiring.

Reuse the matrix — feedforward + feedback × computational + inferential. That breakdown travels with you to any project.
Tune the concrete pieces in each repo — which hooks, which agents, which checks, which specs. That depends on the language, the stack, and the blast radius.
Promote rules upward when they hurt. Every recurring incident is a candidate for a new hook or a new sensor. The harness should learn alongside you.
Watch out for drift — hooks in settings, scripts on disk, rules in CLAUDE.md. They can fall out of sync silently. A periodic coherence check is a sensor on the sensors.

I've been running this on my end-to-end projects for a while: spec-driven flows, an orchestrator delegating to typed subagents, pre/post hooks that enforce who-can-do-what, and a small library of skills that carries between sessions. The harness ends up sitting in the repo right next to the source code. Once you build with agents, you carry and maintain that harness alongside the product. If that sounds familiar and your hooks are giving you trouble, that's a conversation I'd like to have.

Stack

Claude Code Hooks (PreToolUse / PostToolUse) Subagents OpenSpec Skills ADRs Spec-Driven Development

Links

Birgitta Böckeler — Harness Engineering for AI Coding OpenSpec