How I run agents today · hooks, specs, subagents
Harness Engineering — the code that surrounds the agent
An LLM is non-deterministic, so the critical rules have to live outside the agent, in code the agent cannot rewrite.
You leave Claude Code running for an hour while you're in another meeting. You come back and scan the commits. How much do you trust what you see?
The harness is everything around the agent that makes the answer easy. Hooks vet every tool call. Specs frame the work before the agent starts. Sensors catch the agent the moment it drifts. Birgitta Böckeler calls all of that harness engineering.
Here's how I build mine, one piece at a time.
What this is for #s1
A few years ago, a huge chunk of the day went to micro-tasks we took for granted and barely counted: pulling the repo, creating a branch, spinning up the environment, writing boilerplate, moving files, retyping imports. On top of that, the bulk of the time went to implementation — typing code line by line. Today that work is shifting toward the extremes: planning up front and verification at the end. The typing belongs to the agent.
Send the agent off for an hour, twenty files later, and the rules of the game change. What matters then is supervision: someone watching what it did while you looked elsewhere. The harness is that supervision.
Look at the diagram. Plan and verify are where the time goes now. Plan is where you frame the problem, decide the approach, write the spec. Verify is where you read the diff the agent produced and see whether it matches what you asked for. When it doesn't, the lesson goes back into the harness — a new hook, a sharper rule, a stricter sensor — for next time.
- You hold the big context. Architecture decisions, business constraints, six-months-from-now consequences. None of that fits comfortably in the agent's window, and even if it did, it isn't where the agent earns its keep.
- The agent holds the local execution. Codegen, edits, refactors, scaffolding. Bounded tasks, supervised by the harness.
- The loop closes through verify → plan. When the diff misses, you go back up to plan, sharpen it, and run another lap. The next diff arrives better framed from the start.
- An hour on a sharper spec or a tighter hook keeps paying off run after run. I now spend more time upstream than feels intuitive.
Start with a hook #s2
A hook is a shell script. Four lines — five if you're being thorough. Claude Code runs it every time the agent wants to call a tool: edit a file, run bash, search the codebase. The script inspects the call, decides whether to let it through, and returns an exit code.
If your session has several agents with different roles — a planner, an implementer, a reviewer — the hook is where you decide who can do what. Reviewer wants to edit a file? The hook reads the JSON payload, sees who's calling, and either lets the call through or returns exit 2 to kill it. The LLM hears about the rejection through the error and replans.
- The JSON payload has everything you need:
tool_name,tool_input,agent_type. You write the policy in bash andjq. - Reviewer trying to edit a protected path?
exit 2. End of story. -
PostToolUsehooks are the symmetric version: they fire after the tool runs. Use them to log, mark caches dirty, kick off an indexer, append to a session journal. - Hooks live in the repo at
.claude/hooks/. The agent has no way around them, and a new rule is two more lines in the script.
#!/usr/bin/env bash
# .claude/hooks/pre-tool-use.sh
payload=$(cat)
agent=$(jq -r '.agent_type' <<<"$payload")
tool=$(jq -r '.tool_name' <<<"$payload")
path=$(jq -r '.tool_input.file_path // empty' <<<"$payload")
# reviewer subagents are read-only
if [[ "$agent" == "reviewer" && "$tool" == "Edit" ]]; then
echo "reviewer cannot edit $path" >&2
exit 2
fi Before and after each step #s3
You've seen a hook. What comes next is the full loop that hook is one piece of — borrowed by Birgitta Böckeler from control theory.
Two kinds of signal sit around the agent. Guides (feedforward) reach the agent before it acts — they tell it what's expected. Sensors (feedback) fire after each step — they measure what the agent did and feed the result back so it can correct course. Together they close the loop.
It's the thermostat pattern. The guide says "hold 21°C". The sensor measures the temperature every minute. Strip one of the two and you're left with a dumb gadget.
- Guides only — a huge
CLAUDE.mdwith 200 lines of rules and no test to check any of them hold. The agent reads them once, forgets them on the next pass, nobody catches the slip. - Sensors only — strict tests and CI but no living spec. The agent crashes into the same rule ten times because nobody told it why.
- Sensors built for the LLM. A test that fails and dumps an 800-line stack trace becomes garbage in the model's next turn — and the model gets stuck. Design the failure message the way you design the assert: two lines, one for the problem, one for an actionable next step.
The big picture: the matrix #s4
We have the when — guides before the step, sensors after. We're missing the how. Two very different ways to apply a control.
Semantic: text. CLAUDE.md, agent frontmatter, ADRs. The LLM reads them and chooses whether to follow. Useful, but reliable only when the model cooperates.
Deterministic: code that fails on its own. A test, a hook, a schema validation. It doesn't care what the model decided — if the rule trips, the tool call dies.
When a rule really matters, it belongs in the deterministic quadrant. When it's preferable but not critical, the semantic side is enough. Birgitta Böckeler lays out these axes — feedforward × feedback × computational × inferential in her original terminology — as a 2×2 matrix.
Two axes. The columns: how the control acts — code that fails on its own (deterministic) or text the LLM reads (semantic). The rows: when it acts — before the step (guide) or after the step (sensor). Every harness piece lands in one of the four cells.
- Guide · Deterministic — "this is always done this way." Allowed-tools per agent, PreToolUse hooks, permission allowlists, schema checks. You take the key away from the agent.
- Guide · Semantic — "think about it this way." Project rules, ADRs, skills, agent frontmatter, specs. The agent reads them and decides whether to follow.
- Sensor · Deterministic — "this is broken — look here." Build, tests, type-checks, PostToolUse hooks, spec validate. Green or red, no negotiation.
- Sensor · Semantic — "this smells off — are you sure?" Adversarial review, structured JSON diagnostics, reviewer subagents, lesson capture. Soft critique the model can act on next time around.
- Rule of thumb: every time a rule fails on you in production, promote it one quadrant — from semantic to deterministic. An
exit 2cuts the call before the model gets a chance to decide.
What flows through the harness #s5
The matrix organizes the controls. What's missing is the connective tissue: how agents talk to each other, what gets written down at the end of a session, what signals persist so the next turn starts knowing where it left off. Without this layer, the loop opens every time you close the Claude Code window.
Four small pieces do the work:
- Structured outputs as inter-agent messages. When a tester subagent reports a failure, it emits a 7-field JSON — which test broke, module, severity, root-cause hypothesis, minimal repro, suggested fix location. The orchestrator reads that structure and decides on its own whether to loop back to implementation or move forward to verification. That's mechanical back-pressure: the loop closes without going through the human.
- Snapshots at session end. A
Stophook writes alast.mdfile with git status, the active OpenSpec change, which files got marked dirty, and the latest coherence verdict. The LLM's memory evaporates when you close the window — the snapshot rebuilds it on the next prompt so you don't have to. - Passive markers for non-blocking signals. A
.codegraph-dirtyfile touched on every Swift edit is exactly that: a stamp on disk the next turn reads and reacts to. It persists state between operations without blocking anything. It's thedirty bitpattern from distributed systems, applied to your own repo. - Meta-sensors that watch the harness itself. A weekly script checks that the hooks declared in settings match the scripts actually on disk. Another detects when operational docs (CLAUDE.md, ROADMAP) and the real state of OpenSpec have drifted. This is second-order Böckeler: a sensor over the sensors. Without it, you only find out about drift when an agent starts behaving strangely.
- This is what separates a complete harness from one that just has controls sitting around. The matrix tells you where each piece belongs; these four patterns tell you how the pieces talk, persist, and stay aligned when you're not looking.
How I tune it per project #s6
Hooks and permissions cover the reactive half of the harness: what happens while the agent is working. The other half — the proactive one — is the specs: what you hand the agent before it starts. For that half, OpenSpec is the first place I look.
OpenSpec gives you a serious template — proposal, design, tasks, acceptance — and pushes you to generate it, refine it with an adversarial reviewer, and execute against it. A strong starting point for the feedforward · inferential cell.
Now the ugly part: no single harness fits every project. A native Swift desktop app needs a different kind of supervision than a full-stack TypeScript service. The build is slower, the platform APIs are stricter, the tests are heavier. A team shipping infrastructure carries a very different blast radius than a team shipping UI. Every repo ends up drawing its own wiring.
- Reuse the matrix — feedforward + feedback × computational + inferential. That breakdown travels with you to any project.
- Tune the concrete pieces in each repo — which hooks, which agents, which checks, which specs. That depends on the language, the stack, and the blast radius.
- Promote rules upward when they hurt. Every recurring incident is a candidate for a new hook or a new sensor. The harness should learn alongside you.
- Watch out for drift — hooks in settings, scripts on disk, rules in
CLAUDE.md. They can fall out of sync silently. A periodic coherence check is a sensor on the sensors.
I've been running this on my end-to-end projects for a while: spec-driven flows, an orchestrator delegating to typed subagents, pre/post hooks that enforce who-can-do-what, and a small library of skills that carries between sessions. The harness ends up sitting in the repo right next to the source code. Once you build with agents, you carry and maintain that harness alongside the product. If that sounds familiar and your hooks are giving you trouble, that's a conversation I'd like to have.