Tech essay · The why, not the what June 2026

Code comments in the AI agent era

Code is now mass-produced, and the reader spending the most time on it is an AI agent. I built an entire app with one, in a language I don't master, and the result forced me to rethink the old rule that code shouldn't be commented.

We're at a point where code is mass-produced and its biggest reader is an AI agent. I think that forces us to rethink things we considered settled, and one of them is how we use comments.

For years I wrote as few comments as possible: maintaining them was expensive and they ended up lying. But the story has changed: code is generated by an agent, the main reader is another agent, and hooks can verify a comment is still true. Writing a comment costs tokens, sure. And how much does it cost to infer that same information every time it isn't there?

I built an entire app with an agent, in a language I don't master, and the result forced me to take the question seriously.

The piece that makes this sustainable is the harness in the diagram: pre-commit and pre-push hooks that launch adversarial review agents over every diff, hunting for drift between the code and its comments (and, as we'll see, the rest of the project's knowledge). It isn't deterministic, but it converges: every adversarial pass nudges code and comments further into sync, and the more independent reviews you stack, the easier it gets to tell real drift from hallucination.

An agent writes code and comments as a single artifact; a harness with pre-commit and pre-push hooks runs adversarial reviews over the diff. Not deterministic — but each pass converges code and comments and filters out hallucinated drift.

The context

#s1

Not all projects are alike: budget, time and risk change the approach. The reflection applies to products of any size, but it was born in a very specific context — the first cell of the diagram.

The project

iOS app: SwiftUI, SwiftData, CloudKit.
Built solo with Claude Code.
I come from TypeScript: I don't master Swift or the platform.
No board, no product docs: just the code, the agent and me.

The result

Extremely high development speed.
I understood the project in depth without mastering the language.
Far more comments than my team tradition would tolerate.
I'd never written this many whys: is it optimal, or am I normalizing an excess?

Each project size calls for a different approach depending on budget, time and risk; this post was born in the first cell (one person, small scope, low risk), but the reflection applies to all of them.

The old rule: the why, not the what

#s2

I didn't invent this: it's been documented for decades, and Robert C. Martin covers it in Clean Code (the good comment is the one that explains intent). The short version of the consensus:

Comments, yes or no? Yes, just the necessary ones: each is text you have to maintain, and text can lie.
The what? Never. The reader, human or agent, infers the what from the code faster and more reliably than from any written explanation. With AI it's also pure token cost.
The why? Whenever the code cannot show it: the framework bug being dodged, the internal contract, the behavior a user reported. Here the information isn't in the code at any price.
Where does it get tricky? When the why sounds like a requirement: we don't know when it belongs in separate documentation and when next to the code. That's the real debate, and I get to it below.

The same stop sign, now with the why underneath: "high accident zone". The sign no longer narrates the obvious — it tells you why it's there.

WHAT comments are pure cost; WHY comments hold information that isn't in the code at any price.

Two comments from my real code

#s3

The two blocks below were written by the agent in my real code. The question for each one is the same: what information do they hold that the code doesn't?

The first one handles user identity when importing a planilla (the app's name for a shift schedule). The code shows a state property and an @AppStorage; it doesn't show why identity doesn't always resolve on its own, why a "Who are you?" picker exists, or why the name anchors to the spelling of the received planilla. And it leaves a key clue: ADR-0005 §4, the local comment pointing at the cross-cutting decision. The second one is cryptic on purpose: the comment documents the contract — why the model's identity isn't enough and you have to compare by value.

Example 1 · the user identity

/// Received planilla whose receiver
/// couldn't be resolved on its own:
/// presents the "Who are you?" picker
/// (ADR-0005 §4).
@State private var identityPick: IncomingPlanilla?

/// The user's name (their isSelf row).
/// Auto-resolves identity when importing
/// a file and anchors to the spelling
/// in the received planilla.
@AppStorage("import.userName") private var userName: String = ""

Without the comments, the picker looks removable — Chesterton's fence in its purest form. And the ADR reference turns the comment into a link to the full decision, without duplicating it. (Comments in both examples are translated from the Spanish originals.)

Example 2 · the cache contract

/// Content signature of the custom
/// shifts. `onChange` watches it to
/// refresh the cache when any field
/// changes, not just inserts or deletes:
/// editing name, color or classification
/// doesn't change the model's identity,
/// so we compare by value.
private var customShiftSignature: [String] {
    customShiftModels.map {
        "\($0.code)\u{1}\($0.label)\u{1}\($0.colorHex)\u{1}\($0.isWork)"
    }
}

Exactly the kind of fragile code that, without a written contract, breaks silently when someone adds a field to the model.

An audit as a test bench

#s4

Recently I ran a full security audit over the project: several agents in parallel, powered by Fable, the latest frontier model in Claude Code, reading the entire source code. The goal was to hunt for security leaks, not to evaluate how well the project documented itself. And yet the audit didn't just thank me for the commented code: it proved the point. I repeated the exact same analysis after stripping every comment, and the comparison gave us a bit more warning and perspective on these two paths:

With comments: the audit cited the density of whys as what saved it from re-deriving every decision. It was cheaper, faster, and its verdict more reliable.
Without comments: over the same code without a single comment, it burned far more tokens, took longer and hallucinated more. Every session pays the re-derivation again and every wrong edit adds a spike.

Two charts on the same scale: without why-comments, cumulative effort climbs with every session and adds spikes from wrong edits; with them, you pay once to write and a small toll per read.

Beyond the numbers, here's what the audit left:

The verdict surprised me: without the comments, the documented SwiftUI hacks would have looked like noise to clean up, and the audit would have pushed me into regressions.
Persisted reasoning across sessions: the agent wrote each comment with the full context (the bug reproduced, the alternative discarded); every future session, the AI's or a human's, inherits it for the price of reading it.
What about context overload? It degrades when context is irrelevant or contradictory; a co-located comment is the opposite: it only enters context when that file is touched.
The real risk is staleness: an incorrect comment is a bug, reviewed like a broken test. In an agentic flow, a pre-commit hook can police that drift (a post of its own); and when the expiry comes from outside the repo, not from a diff: #s5.

The economics are asymmetric: the comment is paid in tokens on every read; its absence, in expensive re-derivation or very expensive wrong edits.

// the lesson
A review is only as good as the context it can read: without the written why, a hack dodging a bug looks like noise to clean up.

When the comment expires for external reasons

#s5

The framework-bug example leaves a question hanging: how do you detect that the next version of the library fixes the bug? Go back to the SwiftUI hack from the audit: today the workaround dodges a real bug, and its comment is what protects the code from a well-meaning cleanup. The day Apple fixes that bug in a new SDK, unannounced, the situation flips: the workaround becomes dead complexity and the comment — true until yesterday — starts lying. The harness polices drift between code and comment, but this drift isn't born in the repo: it arrives from outside.

The first instinct is to feed the hook the dependencies' changelogs. It doesn't hold up:

There's no structured source: it might be a CHANGELOG.md, GitHub Releases, a blog post, or nothing at all.
They're incomplete by nature: fixes lumped under "misc", or fixed by accident and never documented.
The comment↔changelog match is semantic, not lexical, and its cost is asymmetric: a false positive removes a workaround that's still needed and gifts you a regression; a false negative leaves dead complexity forever.
In my real case there's no changelog at all: for SwiftUI bugs, Apple's release notes are incomplete and Feedback Assistant is private.
The way out isn't a better changelog parser: it's turning the comment into something a machine can verify. The why is still written the same way; what gets standardized are the anchors that travel with it. Three, from strongest to weakest:

Anchor	How it warns you	Its limit
Canary test	A test that reproduces the bug and pins the broken behavior: `XCTExpectFailure` passes while the bug reproduces and fails the day the framework fixes it (equivalents: `withKnownIssue`, `test.failing`, `xfail`). A purely computational signal — and the only sensor that also catches a new SDK breaking differently.	Not every bug is testable: visual, timing, device-only.
Public issue	The issue's state is structured and queryable by API (`gh api … --jq .state`): a closed issue is the project itself asserting the fix, with the PR and the milestone. If it doesn't exist and the library is open source, filing it is the best cost/benefit step on the ladder.	Needs a public tracker; Apple's is private.
Version bound	`dep=swiftui sdk<=15.4`: a lockfile or SDK bump flags the comment for re-evaluation, not removal. It's the pattern behind `expiring-todo-comments` (ESLint) and `todo_or_die` (Ruby).	The weakest: it says "look at this", not "it's fixed".

The anchor ladder, strongest to weakest: the higher you climb, the more the signal looks like a red test and the less it looks like an opinion.

The three don't compete — they stack, each one where its cost pays off:

The version bound, on every single workaround: it costs nothing and a linter can demand it — a WORKAROUND without dep= plus a version doesn't pass.
The issue, on every workaround with a tracker, filing it when it doesn't exist yet.
The canary, only where it hurts: workarounds whose regression is expensive (data loss) or whose bug is silent.
All three on a critical workaround isn't redundancy: it's defense in depth.
And the changelog? It stays as a last resort: an agent with the changelog and the diff between versions as context, whose output is always a proposal with cited evidence, never an automatic removal. The final confirmation is behavioral: a green canary or manual verification.

// the detector
A dependency bump lists the affected tags, and each one resolves with the strongest detector available: canary, issue, re-evaluation; only the residual case burns an agent reading changelogs. The exact format of the standardized comment is pinned down in the rules (#s10); the full pipeline is harness engineering proper — and it deserves its own post.

Where should each piece of knowledge live?

#s6

What starts by observing comments in the source code ends up evolving into a broader question: where do we store the project's information. It's the crux of the matter, and a fair objection worth raising to yourself: what you're writing in comments are requirements — functional or non-functional — and they should live in a separate document. It's true that it's easy to mix comments with documentation. But many of these whys aren't product requirements at any level: they're framework bugs, internal contracts, discarded alternatives. The test I propose isn't is it functional or non-functional?, but: what does this knowledge change with?

Knowledge lives where it changes: a comment if it changes with the code, a spec (OpenSpec) if with the product, an ADR if with the architecture, a test if it's observable behavior.

Where it lives	When to use it	Example
Spec (OpenSpec)	Product behavior: it changes when the product changes.	"The free plan has a daily limit of planillas."
ADR	Cross-cutting decision: it crosses modules and survives code rewrites.	ADR-0005: how user identity is resolved across the app.
Why-comment	Local constraint: it changes with the code and is fully visible from one function. If it hangs off a cross-cutting decision, link the ADR.	The cache contract; the hack dodging a SwiftUI bug (ADR-0005 §4).
Test	Observable behavior: it must stay true even if the code is rewritten.	"Importing the same planilla twice doesn't duplicate shifts."

Four homes for knowledge, each with its own rate of change. When in doubt, the question is always the same: what does this knowledge change with?

The table classifies; what it doesn't say is where to start, or how its rows relate to each other. In short:

The first option is always the code itself: good names, types, invariants and, with DDD, the domain model itself. Only what the code cannot show needs a row.
The rows link to each other: a why that changes with the code can live in a separate document if the function keeps a comment pointing at it. It works, but that's two pieces to keep in sync and it takes a harness watching the drift — work that disappears when the why sits next to the code.
And in the opposite direction: product behavior lives in its spec, but the code should point at it.
With that criterion, the objection dismantles itself: "if identity doesn't resolve on its own, ask the user" is a micro-requirement; the spelling anchor is pure implementation; and the cross-cutting part is neither copied nor lost: it's linked (ADR-0005).
Aren't these just ADRs in disguise? The why-comment is, deep down, an inline micro-ADR: too small to deserve its own file, too non-obvious to omit. In my project both coexist without friction: ADRs for cross-cutting decisions, comments for local constraints.
Tests deserve a mention of their own: they're the only documentation that can't lie for long (an outdated test is a red test), and they serve four purposes depending on the moment: specify before the code exists, verify while writing it, protect as a regression net, document as executable examples.
Test and comment are complements, not substitutes: the test freezes the observable what; the comment keeps the why the test cannot assert. Without the test, an agent can break the behavior without noticing; without the comment, it can "fix" correct code.

// the heuristic
If the decision crosses modules or survives rewrites of the code, it calls for an ADR. If it's a local constraint that lives and dies with the surrounding code, a comment is usually enough.

Specs and ADRs in the repository

#s7

For managing specifications, OpenSpec feels very relevant right now: living specs of what's already built, plus change proposals that update the specs once archived. And the ADR directory belongs in the repository too.

Keeping it all repo-resident makes agent integration trivial: agents can read the specs and the ADRs, cite them (like ADR-0005 in example 1) and police their sync with the code. Below, the OpenSpec cycle and the minimal tree I'd expect to find.

The OpenSpec cycle: a proposal in changes/ gets built in the code and, once archived, updates the living specs; the next change starts from updated truth, and an agent can read it, cite it and police its sync.

repo/
├── openspec/
│   ├── project.md      # conventions and context
│   ├── specs/          # what is ALREADY built
│   └── changes/        # in-flight proposals
│       └── archive/    # once done, they update specs/
├── docs/
│   └── adr/            # cross-cutting decisions
│       └── ADR-0005-user-identity.md
├── src/                # code + co-located whys
└── tests/              # the observable what, frozen

The minimal tree: specifications (OpenSpec), decisions (ADRs), code and tests in the same repository, readable by humans and agents.

Structure grows with the project

#s8

That tree is a starting point, not one-size-fits-all. And documentation doesn't climb levels: it's composed from groups of pieces. There's the code with its whys and the tests for the critical path (the common floor of any repo); the decisions (ADRs, conventions); the product behavior (living specs, contracts); and the shared model: DDD, where the code shares the domain model with product (the ubiquitous language), becomes largely self-documenting, and its domain unit tests are the closest thing there is to an executable spec — the comments left there are the technical ones: frameworks, contracts, performance.

Each project combines the groups its context calls for, and the mix isn't deterministic: the more people and the more risk, the richer the combination tends to get, but it's a tendency, not a rule. This project is single-person and has ADRs; I have another, also single-person, with OpenSpec; there are projects without DDD and projects that don't test the same way. And something changed with agents: in a greenfield project, wiring the documentation from day one is cheaper than ever, because the agent generates and maintains it with you.

Four groups of documentation pieces — code, decisions, product behavior, shared model — and three real projects combining them differently; size and risk push toward richer combinations as a tendency, not a rule.

Can you fit all the documentation a hundred-person project demands into a one-person project? Sure, it can be done. The question is what you get back: you lose speed today and the benefit doesn't grow at the same rate — diminishing returns. On the other side of the scale, designing the structure early makes adopting it later cheaper. So where do you stop? There's no universal answer: it depends on the nature of the project, and each scenario settles the balance differently:

An MVP or an experiment: minimal structure. Many projects die along the way; if this one dies early, you won't have buried time in documentation nobody will read — you'll have spent it more efficiently.
A small product that intends to grow: projects tend to grow, and the hard part is managing that growth. The dedication you invest at the start is what pays off at the end: every why and every decision written today is context nobody has to rebuild tomorrow.
Something designed to scale from day one: wiring exhaustive documentation up front isn't wrong — it's an investment, and with agents it costs less than ever.
My case: on the project in this post, starting small and growing the documentation alongside the project is what worked. It's the answer for this nature of project, not a universal truth.

Benefit versus cost in speed as documentation structure grows: benefit flattens out (diminishing returns), cost compounds, and the sweet spot moves right as the project grows.

// the criterion
There is no universal optimum for documentation: the nature of the project sets it, and it moves as the project does.

When locality isn't enough: the aggregate view

#s9

Everything so far optimizes one thing: locality. Each why sits next to its use, perfect for editing and for an agent touching a single file. But at the top of the scale from the previous section a fair objection survives — the one a QA lead, an auditor or a compliance reviewer raises: with the knowledge distributed, how do I answer what are all the functional requirements?, which test covers which one?, what NFR does this ADR justify?, what changed with this feature? Locality scatters exactly the global view that audit needs. This is the classic tension — locality of reference vs. traceability — and it only shows up at the regulated, multi-team end of the ladder: a solo MVP never asks these questions; a product under audit asks them constantly.

Four distributed sources (spec, ADR, comment, test) on the left; a generated FR/NFR inventory and traceability matrix on the right. The matrix is generated in CI as a read-only projection; maintaining it by hand is crossed out, because it recreates a second source of truth that drifts.

ID	Type	Source	ADR	Tests	Status
FR-014	Functional	import-planilla.md	—	ImportPlanillaTests	covered
NFR-003	Privacy	privacy.md	ADR-0007	PrivacyTests	covered
NFR-004	Performance	perf.md	ADR-0008	—	gap: no test

Illustrative — this solo app doesn't need it. The point is the shape: every row is generated, and the empty Tests cell on NFR-004 is a red flag the view surfaces on its own, the way coverage surfaces an untested line.

The naive fix is a requirements-index.md and a traceability-matrix.md kept by hand. It doesn't hold up: a hand-maintained matrix is a second source of truth that diverges from specs, tests and ADRs, and it would need its own harness to police the drift — you'd be solving the problem by recreating the very thing this whole post fights. The fix is one word: the aggregate view is derived, not authored. An agent or a script crawls what already exists and computes the inventory: functional requirements from the specs, NFR justifications from the ADRs, coverage from the test names, status from the spec↔code delta the post already calls computable. It's a harness output, like a coverage report: regenerated in CI, never edited.

Two orthogonal axes, not one replacing the other: what does it change with? decides where the source lives (the heuristic from before); functional or non-functional? is the lens the aggregate view sorts by. A requirement is authored once in its spec and appears as a row — it isn't stored twice.
NFRs are where this earns its keep: "delete the PDF after parsing" scatters across a spec (the rule), an ADR (the how) and a test (the proof). An FR usually lives in one spec; an NFR almost never does. The matrix is what stitches the scattered NFR back into a single auditable row.
The view reports and audits at once: an FR with no test, or an NFR with no ADR justifying it, shows up as a gap — the same red flag as an uncovered line. The matrix doesn't just answer questions, it surfaces what's missing.
"What changed with this feature?" becomes a diff: regenerate the view at two commits and compare, instead of an archaeology session across specs, ADRs and tests.

// the projection
The aggregate view is a projection of the repo, not a parallel copy of it: authored once in specs, ADRs and tests; read many ways. The day it's hand-maintained, it has already started to lie.

The rules, in short

#s10

For my current context — solo development with AI — the approach has worked very well, and the underlying idea travels: documentation can live in the repository. When specs, ADRs and whys share the repo with the code, the classic "docs go one way, code goes another" disappears, and the system's growth benefits: whoever develops, human or agent, uses the documentation without leaving the repo instead of hunting for it outside. And when product and development aren't the same team, the risk is product behavior trapped in the code, where a tester or a PM will never find it; there you can build tooling that syncs the documentation and projects it outward — frontends that generate the product view from the repository — instead of maintaining two diverging sources. On a team I would still negotiate the density: eight lines of doc-comment have a human reading cost the agent doesn't pay.

Never the what, always the why the code cannot show: the classic "don't comment" rule was never wrong — it was incomplete. It was always a rule against WHAT comments; domains with a lot of non-obvious constraint per line need more WHY comments.
An incorrect comment is a bug: it gets reviewed and maintained like the code. The observable what is documented in a test, the only documentation that verifies itself.
Knowledge lives where it changes — and all of it repo-resident: whys with the code, decisions in ADRs, product behavior in specs, the observable what in tests. One repository, readable by humans and by agents.
A harness polices the sync (harness engineering): pre-commit and pre-push hooks launching adversarial agents over every diff, hunting for drift between the repo's source code and the repo's documentation.
Comments also expire for external reasons (#s5): the framework fixes the bug and the workaround turns into dead complexity. Every workaround carries at least one machine-verifiable anchor — canary test > public issue > version bound; a why with no anchor doesn't pass.
Deltas are identified, not guessed: with living specs — OpenSpec is a great lever here — the distance between what the spec promises and what the code does becomes computable, and an agent can read it, cite it and reconcile it.
At the audit rung, the aggregate view is generated, not authored (#s9): the FR/NFR inventory and the traceability matrix are a projection computed from specs, ADRs and tests — never a hand-kept second source.
The sync isn't deterministic, but it converges: every adversarial pass pulls code and documentation closer, and the more independent reviews you stack, the easier it gets to tell real drift from hallucination.

Three roads, the same rule: drive at 30. The first adds a redundant sign that narrates the obvious — noise that costs attention and tokens without adding information. The other two add context: the camera that verifies the limit, the children that tell you why it exists — and you drive more alert. Code works the same way: the behavior doesn't change; the attention of whoever edits it does.

/// WORKAROUND(FrameworkBug): dep=swiftui sdk<=15.4
/// issue=FB13241001
/// repro=Tests/Canaries/FB13241001_Test.swift
/// <the why, written as always>

The standard for workaround comments, from #s5: every WORKAROUND carries at least one machine-verifiable anchor, in order of preference repro > issue > version bound. A why with no anchor doesn't pass the linter.

// what changed
The rule is the same as ever; what's new is that now it can be measured: a comment's cost is counted in tokens, and its benefit shows up in the quality of the edits. By that measure, the why-comment is one of the most profitable investments in the repository.