Back to the blog

Tech essay · The why, not the what June 2026

Code comments in the AI agent era

Code is now mass-produced, and the reader spending the most time on it is an AI agent. I built an entire app with one, in a language I don't master, and the result forced me to rethink the old rule that code shouldn't be commented.

Code comments in the AI agent era

We're at a point where code is mass-produced and its biggest reader is an AI agent. I think that forces us to rethink things we considered settled, and one of them is how we use comments.

For years I wrote as few comments as possible: maintaining them was expensive and they ended up lying. But the story has changed: code is generated by an agent, the main reader is another agent, and hooks can verify a comment is still true. Writing a comment costs tokens, sure. And how much does it cost to infer that same information every time it isn't there?

I built an entire app with an agent, in a language I don't master, and the result forced me to take the question seriously.

The piece that makes this sustainable is the harness in the diagram: pre-commit and pre-push hooks that launch adversarial review agents over every diff, hunting for drift between the code and its comments (and, as we'll see, the rest of the project's knowledge). It isn't deterministic, but it converges: every adversarial pass nudges code and comments further into sync, and the more independent reviews you stack, the easier it gets to tell real drift from hallucination.

An agent writing, a harness checking Flow with a feedback loop. An AI agent generates code and why-comments as a single artifact. A harness with pre-commit and pre-push hooks runs an adversarial review over the diff, asking whether each comment is still true after the change. A dashed feedback arrow returns from the harness to the agent. Bottom strip: not deterministic, but each adversarial pass converges code and comments and filters out hallucinated drift. An agent writing, a harness checking comments stop being a leap of faith when something verifies them on every commit AI agent writes the code and the why-comments together also their main reader code + comments one artifact, one diff: they change together git commit · git push harness pre-commit hook pre-push hook adversarial review of the diff: "is each comment still true?" inferential feedback → fix the comment or the code not deterministic — each adversarial pass converges code & comments and filters out hallucinated drift
An agent writes code and comments as a single artifact; a harness with pre-commit and pre-push hooks runs adversarial reviews over the diff. Not deterministic — but each pass converges code and comments and filters out hallucinated drift.

The context

#s1

Not all projects are alike: budget, time and risk change the approach. The reflection applies to products of any size, but it was born in a very specific context — the first cell of the diagram.

The project

  • iOS app: SwiftUI, SwiftData, CloudKit.
  • Built solo with Claude Code.
  • I come from TypeScript: I don't master Swift or the platform.
  • No board, no product docs: just the code, the agent and me.

The result

  • Extremely high development speed.
  • I understood the project in depth without mastering the language.
  • Far more comments than my team tradition would tolerate.
  • I'd never written this many whys: is it optimal, or am I normalizing an excess?
Not every project plays by the same rules Five cells from left to right by team size: one dev, around five, around fifteen, around one hundred, around two hundred people. Each cell lists how the approach shifts with budget, time and risk. The first cell, one dev with a small scope moving fast at low risk, is highlighted in green with a badge reading this post starts here; the reflection applies to every cell. The rightmost cells note that documentation starts living outside the repository. Bottom strip: each size needs its own approach; the bigger the team, the easier docs and code drift apart. Not every project plays by the same rules budget · time · risk · people — the approach shifts with all of them 1 dev experiments small scope move fast · test fast low risk repo = whole truth this post starts here ~5 a small product shared conventions reviews appear repo + a few docs ~15 several squads process shows up ADRs · boards docs start to drift ~100 dedicated product team compliance · audits higher stakes docs live outside the repo ~200 many teams strict constraints slow by design truth is scattered each size needs its own approach — the bigger the team, the easier docs and code drift apart
Each project size calls for a different approach depending on budget, time and risk; this post was born in the first cell (one person, small scope, low risk), but the reflection applies to all of them.

The old rule: the why, not the what

#s2

I didn't invent this: it's been documented for decades, and Robert C. Martin covers it in Clean Code (the good comment is the one that explains intent). The short version of the consensus:

  • Comments, yes or no? Yes, just the necessary ones: each is text you have to maintain, and text can lie.
  • The what? Never. The reader, human or agent, infers the what from the code faster and more reliably than from any written explanation. With AI it's also pure token cost.
  • The why? Whenever the code cannot show it: the framework bug being dodged, the internal contract, the behavior a user reported. Here the information isn't in the code at any price.
  • Where does it get tricky? When the why sounds like a requirement: we don't know when it belongs in separate documentation and when next to the code. That's the real debate, and I get to it below.
The same stop sign, now with the why underneath: "high accident zone". The sign no longer narrates the obvious — it tells you why it's there.
The same stop sign, now with the why underneath: "high accident zone". The sign no longer narrates the obvious — it tells you why it's there.
Two kinds of comments Side-by-side comparison. Left card, WHAT comments: they narrate what the next line does; the agent infers the what from code faster and more reliably than from written text, and text can lie; verdict in red: pure cost, do not write them. Right card, WHY comments: they document a constraint the code cannot show, like a platform bug being dodged, a cache invalidation contract, or behavior observed by a real user; verdict in green: that information is not in the code at any price, write them. Bottom strip: the classic rule was never wrong, it was incomplete — it only ever targeted the WHAT. Two kinds of comments the whole debate hinges on this distinction WHAT comments narrate what the next line does · the agent infers the what from code · faster and more reliable than text · and text can lie pure cost — don't write them WHY comments a constraint the code cannot show · the platform bug being dodged · the cache invalidation contract · behavior observed by a real user not in the code at any price — write them the classic rule was never wrong — it was incomplete it only ever targeted the WHAT
WHAT comments are pure cost; WHY comments hold information that isn't in the code at any price.

Two comments from my real code

#s3

The two blocks below were written by the agent in my real code. The question for each one is the same: what information do they hold that the code doesn't?

The first one handles user identity when importing a planilla (the app's name for a shift schedule). The code shows a state property and an @AppStorage; it doesn't show why identity doesn't always resolve on its own, why a "Who are you?" picker exists, or why the name anchors to the spelling of the received planilla. And it leaves a key clue: ADR-0005 §4, the local comment pointing at the cross-cutting decision. The second one is cryptic on purpose: the comment documents the contract — why the model's identity isn't enough and you have to compare by value.

Example 1 · the user identity
/// Received planilla whose receiver
/// couldn't be resolved on its own:
/// presents the "Who are you?" picker
/// (ADR-0005 §4).
@State private var identityPick: IncomingPlanilla?

/// The user's name (their isSelf row).
/// Auto-resolves identity when importing
/// a file and anchors to the spelling
/// in the received planilla.
@AppStorage("import.userName") private var userName: String = ""
Without the comments, the picker looks removable — Chesterton's fence in its purest form. And the ADR reference turns the comment into a link to the full decision, without duplicating it. (Comments in both examples are translated from the Spanish originals.)
Example 2 · the cache contract
/// Content signature of the custom
/// shifts. `onChange` watches it to
/// refresh the cache when any field
/// changes, not just inserts or deletes:
/// editing name, color or classification
/// doesn't change the model's identity,
/// so we compare by value.
private var customShiftSignature: [String] {
    customShiftModels.map {
        "\($0.code)\u{1}\($0.label)\u{1}\($0.colorHex)\u{1}\($0.isWork)"
    }
}
Exactly the kind of fragile code that, without a written contract, breaks silently when someone adds a field to the model.

An audit as a test bench

#s4

Recently I ran a full security audit over the project: several agents in parallel, powered by Fable, the latest frontier model in Claude Code, reading the entire source code. The goal was to hunt for security leaks, not to evaluate how well the project documented itself. And yet the audit didn't just thank me for the commented code: it proved the point. I repeated the exact same analysis after stripping every comment, and the comparison gave us a bit more warning and perspective on these two paths:

  • With comments: the audit cited the density of whys as what saved it from re-deriving every decision. It was cheaper, faster, and its verdict more reliable.
  • Without comments: over the same code without a single comment, it burned far more tokens, took longer and hallucinated more. Every session pays the re-derivation again and every wrong edit adds a spike.
Two paths, two effort curves Two small line charts side by side, same scale. Left, without why-comments: cumulative effort climbs steeply because every session re-derives the same knowledge, and wrong edits add spikes marked with crosses. Right, with why-comments: a small initial cost to write them, then an almost flat line of roughly 100 to 200 tokens per read; a dashed reference shows how high the other path ends. Bottom strip: the gap grows with every session — a why-comment is paid once and read forever. Two paths, two effort curves cumulative effort, session after session, over the same file WITHOUT why-comments ✗ = wrong edit → regression every session re-derives sessions WITH why-comments where the other path ends written once · ~100–200 tokens per read sessions cumulative effort the gap grows with every session — a why-comment is paid once and read forever
Two charts on the same scale: without why-comments, cumulative effort climbs with every session and adds spikes from wrong edits; with them, you pay once to write and a small toll per read.

Beyond the numbers, here's what the audit left:

  • The verdict surprised me: without the comments, the documented SwiftUI hacks would have looked like noise to clean up, and the audit would have pushed me into regressions.
  • Persisted reasoning across sessions: the agent wrote each comment with the full context (the bug reproduced, the alternative discarded); every future session, the AI's or a human's, inherits it for the price of reading it.
  • What about context overload? It degrades when context is irrelevant or contradictory; a co-located comment is the opposite: it only enters context when that file is touched.
  • The real risk is staleness: an incorrect comment is a bug, reviewed like a broken test. In an agentic flow, a pre-commit hook can police that drift (a post of its own); and when the expiry comes from outside the repo, not from a diff: #s5.
The asymmetric economics of a why-comment Two columns compared. Left, cost when the comment is present: a small yellow box of roughly 100 to 200 tokens, paid each time the file enters context — small and predictable. Right, cost when the comment is absent: a tall red stack with re-derivation (compile, run, reproduce the bug, git archaeology — expensive) and a wrong edit that ships a regression — very expensive. A bottom strip notes that why-comments are persisted reasoning: every future session inherits the work for the price of reading it. The asymmetric economics of a why-comment what you pay with it vs. what you pay without it cost when PRESENT ~100–200 tokens paid each time the file enters context small · predictable · bounded cost when ABSENT re-derivation compile · run · reproduce the bug git archaeology expensive wrong edit → regression "this looks removable" · removed very expensive unbounded · paid when you least expect it why-comments are persisted reasoning across sessions every future session inherits the work for the price of reading it
The economics are asymmetric: the comment is paid in tokens on every read; its absence, in expensive re-derivation or very expensive wrong edits.
// the lesson

A review is only as good as the context it can read: without the written why, a hack dodging a bug looks like noise to clean up.

When the comment expires for external reasons

#s5

The framework-bug example leaves a question hanging: how do you detect that the next version of the library fixes the bug? Go back to the SwiftUI hack from the audit: today the workaround dodges a real bug, and its comment is what protects the code from a well-meaning cleanup. The day Apple fixes that bug in a new SDK, unannounced, the situation flips: the workaround becomes dead complexity and the comment — true until yesterday — starts lying. The harness polices drift between code and comment, but this drift isn't born in the repo: it arrives from outside.

The first instinct is to feed the hook the dependencies' changelogs. It doesn't hold up:

  • There's no structured source: it might be a CHANGELOG.md, GitHub Releases, a blog post, or nothing at all.
  • They're incomplete by nature: fixes lumped under "misc", or fixed by accident and never documented.
  • The comment↔changelog match is semantic, not lexical, and its cost is asymmetric: a false positive removes a workaround that's still needed and gifts you a regression; a false negative leaves dead complexity forever.
  • In my real case there's no changelog at all: for SwiftUI bugs, Apple's release notes are incomplete and Feedback Assistant is private.
  • The way out isn't a better changelog parser: it's turning the comment into something a machine can verify. The why is still written the same way; what gets standardized are the anchors that travel with it. Three, from strongest to weakest:
AnchorHow it warns youIts limit
Canary testA test that reproduces the bug and pins the broken behavior: XCTExpectFailure passes while the bug reproduces and fails the day the framework fixes it (equivalents: withKnownIssue, test.failing, xfail). A purely computational signal — and the only sensor that also catches a new SDK breaking differently.Not every bug is testable: visual, timing, device-only.
Public issueThe issue's state is structured and queryable by API (gh api … --jq .state): a closed issue is the project itself asserting the fix, with the PR and the milestone. If it doesn't exist and the library is open source, filing it is the best cost/benefit step on the ladder.Needs a public tracker; Apple's is private.
Version bounddep=swiftui sdk<=15.4: a lockfile or SDK bump flags the comment for re-evaluation, not removal. It's the pattern behind expiring-todo-comments (ESLint) and todo_or_die (Ruby).The weakest: it says "look at this", not "it's fixed".
The anchor ladder, strongest to weakest: the higher you climb, the more the signal looks like a red test and the less it looks like an opinion.

The three don't compete — they stack, each one where its cost pays off:

  • The version bound, on every single workaround: it costs nothing and a linter can demand it — a WORKAROUND without dep= plus a version doesn't pass.
  • The issue, on every workaround with a tracker, filing it when it doesn't exist yet.
  • The canary, only where it hurts: workarounds whose regression is expensive (data loss) or whose bug is silent.
  • All three on a critical workaround isn't redundancy: it's defense in depth.
  • And the changelog? It stays as a last resort: an agent with the changelog and the diff between versions as context, whose output is always a proposal with cited evidence, never an automatic removal. The final confirmation is behavioral: a green canary or manual verification.
// the detector

A dependency bump lists the affected tags, and each one resolves with the strongest detector available: canary, issue, re-evaluation; only the residual case burns an agent reading changelogs. The exact format of the standardized comment is pinned down in the rules (#s10); the full pipeline is harness engineering proper — and it deserves its own post.

Where should each piece of knowledge live?

#s6

What starts by observing comments in the source code ends up evolving into a broader question: where do we store the project's information. It's the crux of the matter, and a fair objection worth raising to yourself: what you're writing in comments are requirements — functional or non-functional — and they should live in a separate document. It's true that it's easy to mix comments with documentation. But many of these whys aren't product requirements at any level: they're framework bugs, internal contracts, discarded alternatives. The test I propose isn't is it functional or non-functional?, but: what does this knowledge change with?

Knowledge lives where it changes A decision tree with a step zero. Step zero: let the code say it — clear names, types, invariants; with DDD, the domain model speaks the business language. Only what the code cannot show needs a home, and the root question is: what does this knowledge change with? Four branches. If it changes with the code, it belongs in an inline why-comment next to the code (cache contract, framework workaround). If it changes with the product, it belongs in a living spec such as openspec/specs, and the code points to it. If it changes with the architecture, it belongs in an ADR (encryption strategy), which survives code rewrites. If it is observable behavior, it belongs in a test, which is executable and cannot lie. Bottom caption: one screen means comment, crosses modules means ADR, product language means spec, observable behavior means test. Knowledge lives where it changes not functional vs. non-functional — ask: what does it change with? step 0 — let the code say it clear names · types · invariants — with DDD, the domain model speaks the business language only what the code cannot show needs a home below what does this knowledge change with? the code the product the architecture the behavior why-comment inline, next to the code cache contract framework workaround a separate doc guarantees it's missed spec · OpenSpec living, in the repo "free plan: daily limit" openspec/specs/ product language — and the code points to it ADR context · forces · options encryption strategy docs/adr/ADR-0005 crosses modules, survives code rewrites test executable — it can't lie "picker shows up when identity unresolved" freezes the observable what; an outdated test is red one screen → comment · crosses modules → ADR product language → spec · observable behavior → test
Knowledge lives where it changes: a comment if it changes with the code, a spec (OpenSpec) if with the product, an ADR if with the architecture, a test if it's observable behavior.
Where it livesWhen to use itExample
Spec (OpenSpec)Product behavior: it changes when the product changes."The free plan has a daily limit of planillas."
ADRCross-cutting decision: it crosses modules and survives code rewrites.ADR-0005: how user identity is resolved across the app.
Why-commentLocal constraint: it changes with the code and is fully visible from one function. If it hangs off a cross-cutting decision, link the ADR.The cache contract; the hack dodging a SwiftUI bug (ADR-0005 §4).
TestObservable behavior: it must stay true even if the code is rewritten."Importing the same planilla twice doesn't duplicate shifts."
Four homes for knowledge, each with its own rate of change. When in doubt, the question is always the same: what does this knowledge change with?

The table classifies; what it doesn't say is where to start, or how its rows relate to each other. In short:

  • The first option is always the code itself: good names, types, invariants and, with DDD, the domain model itself. Only what the code cannot show needs a row.
  • The rows link to each other: a why that changes with the code can live in a separate document if the function keeps a comment pointing at it. It works, but that's two pieces to keep in sync and it takes a harness watching the drift — work that disappears when the why sits next to the code.
  • And in the opposite direction: product behavior lives in its spec, but the code should point at it.
  • With that criterion, the objection dismantles itself: "if identity doesn't resolve on its own, ask the user" is a micro-requirement; the spelling anchor is pure implementation; and the cross-cutting part is neither copied nor lost: it's linked (ADR-0005).
  • Aren't these just ADRs in disguise? The why-comment is, deep down, an inline micro-ADR: too small to deserve its own file, too non-obvious to omit. In my project both coexist without friction: ADRs for cross-cutting decisions, comments for local constraints.
  • Tests deserve a mention of their own: they're the only documentation that can't lie for long (an outdated test is a red test), and they serve four purposes depending on the moment: specify before the code exists, verify while writing it, protect as a regression net, document as executable examples.
  • Test and comment are complements, not substitutes: the test freezes the observable what; the comment keeps the why the test cannot assert. Without the test, an agent can break the behavior without noticing; without the comment, it can "fix" correct code.
// the heuristic

If the decision crosses modules or survives rewrites of the code, it calls for an ADR. If it's a local constraint that lives and dies with the surrounding code, a comment is usually enough.

Specs and ADRs in the repository

#s7

For managing specifications, OpenSpec feels very relevant right now: living specs of what's already built, plus change proposals that update the specs once archived. And the ADR directory belongs in the repository too.

Keeping it all repo-resident makes agent integration trivial: agents can read the specs and the ADRs, cite them (like ADR-0005 in example 1) and police their sync with the code. Below, the OpenSpec cycle and the minimal tree I'd expect to find.

Specs that live in the repo The OpenSpec cycle as a flow. A proposal in openspec/changes describes what should change. It gets built in the code: source, tests and why-comments. Archiving the completed change updates the living specs in openspec/specs, which describe what is true now. A dashed return arrow notes that the next change starts from updated truth. Bottom strip: an agent reads it, cites it, like ADR-0005 in example 1, and polices its sync with the code. Specs that live in the repo the OpenSpec cycle: propose → build → archive — the spec catches up with reality proposal openspec/changes/ what SHOULD change build src/ · tests/ code + whys + tests archive change done syncs the specs living spec openspec/specs/ what IS true now the next change starts from updated truth repo-resident → an agent reads it, cites it (ADR-0005 §4) and polices its sync with the code
The OpenSpec cycle: a proposal in changes/ gets built in the code and, once archived, updates the living specs; the next change starts from updated truth, and an agent can read it, cite it and police its sync.
repo/
├── openspec/
│   ├── project.md      # conventions and context
│   ├── specs/          # what is ALREADY built
│   └── changes/        # in-flight proposals
│       └── archive/    # once done, they update specs/
├── docs/
│   └── adr/            # cross-cutting decisions
│       └── ADR-0005-user-identity.md
├── src/                # code + co-located whys
└── tests/              # the observable what, frozen
The minimal tree: specifications (OpenSpec), decisions (ADRs), code and tests in the same repository, readable by humans and agents.

Structure grows with the project

#s8

That tree is a starting point, not one-size-fits-all. And documentation doesn't climb levels: it's composed from groups of pieces. There's the code with its whys and the tests for the critical path (the common floor of any repo); the decisions (ADRs, conventions); the product behavior (living specs, contracts); and the shared model: DDD, where the code shares the domain model with product (the ubiquitous language), becomes largely self-documenting, and its domain unit tests are the closest thing there is to an executable spec — the comments left there are the technical ones: frameworks, contracts, performance.

Each project combines the groups its context calls for, and the mix isn't deterministic: the more people and the more risk, the richer the combination tends to get, but it's a tendency, not a rule. This project is single-person and has ADRs; I have another, also single-person, with OpenSpec; there are projects without DDD and projects that don't test the same way. And something changed with agents: in a greenfield project, wiring the documentation from day one is cheaper than ever, because the agent generates and maintains it with you.

Documentation: combinations, not levels Four groups of documentation pieces shown side by side, with no hierarchy between them. In the code: why-comments and critical-path tests, the common floor of any repo. Decisions: ADRs and shared conventions. Product behavior: living specs such as OpenSpec and boundary contracts. Shared model: DDD with ubiquitous language and domain unit tests acting as executable specs. Below, three real projects mix the groups differently: this iOS app, solo, combines the code floor with ADRs; another solo project combines the floor with living specs; a multi-team product combines the floor with ADRs, specs, DDD and derived product docs. Bottom strip: every project composes its own mix — size and risk push toward richer combinations, as a tendency, not a rule. Documentation: combinations, not levels independent groups of pieces every project mixes differently — an example map, not a taxonomy in the code why-comments critical-path tests the common floor of any repo decisions ADRs shared conventions cross-cutting · durable product behavior living specs (OpenSpec) boundary contracts changes with the product shared model DDD · ubiquitous language domain unit tests ≈ executable specs same pieces, different combinations — three real examples this iOS app · solo code · whys · tests ADRs another solo project code · whys · tests living specs (OpenSpec) a multi-team product code · whys · tests ADRs specs DDD · domain tests derived product docs the floor repeats everywhere — everything else is a choice per project, not a stage to unlock every project composes its own mix — size and risk push toward richer combinations, as a tendency with agents, wiring documentation from day one is cheaper than ever — and the mix can grow with the project
Four groups of documentation pieces — code, decisions, product behavior, shared model — and three real projects combining them differently; size and risk push toward richer combinations as a tendency, not a rule.

Can you fit all the documentation a hundred-person project demands into a one-person project? Sure, it can be done. The question is what you get back: you lose speed today and the benefit doesn't grow at the same rate — diminishing returns. On the other side of the scale, designing the structure early makes adopting it later cheaper. So where do you stop? There's no universal answer: it depends on the nature of the project, and each scenario settles the balance differently:

  • An MVP or an experiment: minimal structure. Many projects die along the way; if this one dies early, you won't have buried time in documentation nobody will read — you'll have spent it more efficiently.
  • A small product that intends to grow: projects tend to grow, and the hard part is managing that growth. The dedication you invest at the start is what pays off at the end: every why and every decision written today is context nobody has to rebuild tomorrow.
  • Something designed to scale from day one: wiring exhaustive documentation up front isn't wrong — it's an investment, and with agents it costs less than ever.
  • My case: on the project in this post, starting small and growing the documentation alongside the project is what worked. It's the answer for this nature of project, not a universal truth.
The structure sweet spot A line chart. The horizontal axis is the weight of documentation structure: specs, ADRs, DDD, process. The vertical axis is value to the project. A solid line, benefit, rises fast and then flattens: diminishing returns. A dashed red line, cost in speed, starts low and compounds upward. A shaded green band marks the sweet spot where benefit most exceeds cost, with a dashed arrow noting that the sweet spot moves right as the project grows. A note adds: designing for growth early is cheaper than retrofitting, but don't buy the whole suit on day one. The structure sweet spot a 1-person project can carry 100-person structure — the question is what you get back sweet spot for this size, today value to the project structure: specs · ADRs · DDD · process benefit — flattens out cost in speed — compounds the sweet spot moves right as the project grows designing for growth early is cheaper than retrofitting — just don't buy the whole suit on day one
Benefit versus cost in speed as documentation structure grows: benefit flattens out (diminishing returns), cost compounds, and the sweet spot moves right as the project grows.
// the criterion

There is no universal optimum for documentation: the nature of the project sets it, and it moves as the project does.

When locality isn't enough: the aggregate view

#s9

Everything so far optimizes one thing: locality. Each why sits next to its use, perfect for editing and for an agent touching a single file. But at the top of the scale from the previous section a fair objection survives — the one a QA lead, an auditor or a compliance reviewer raises: with the knowledge distributed, how do I answer what are all the functional requirements?, which test covers which one?, what NFR does this ADR justify?, what changed with this feature? Locality scatters exactly the global view that audit needs. This is the classic tension — locality of reference vs. traceability — and it only shows up at the regulated, multi-team end of the ladder: a solo MVP never asks these questions; a product under audit asks them constantly.

The aggregate view is generated, not authored Locality of reference versus traceability, resolved by a projection. On the left, four distributed sources of truth, each living where it changes: a spec in openspec/specs contributes the functional requirement, an ADR in docs/adr contributes the non-functional justification, an inline why-comment in src contributes the local reasoning, and a test contributes coverage. A thick solid arrow labelled generated in CI flows left to right into the aggregate view on the right: a generated FR/NFR inventory and traceability matrix with columns ID, Type, Tests and Status. Two rows are covered (FR-014 functional and NFR-003 privacy, both with tests, green), and one row is a gap (NFR-004 performance, no test, amber): an empty Tests cell is a red flag, like an uncovered line. A dashed red arrow pointing back from the matrix to the sources is crossed out: maintaining the matrix by hand recreates a second source of truth that drifts. The matrix is read, never authored. The aggregate view is generated, not authored locality of reference vs. traceability — a projection resolves the tension sources — distributed, each where it changes spec → FR openspec/specs/ ADR → NFR why docs/adr/ comment → local why inline, in src/ test → coverage tests/ generated · CI read-only projection aggregate view — generated index FR/NFR inventory + traceability matrix ID Type Tests Status FR-014 Func covered NFR-003 Privacy covered NFR-004 Perf no test an empty Tests cell is a red flag, like an uncovered line — the view surfaces gaps full row: ID · Type · Source · ADR · Tests · Status hand-maintained authoring the matrix by hand = a second source that drifts the matrix is a projection of the repo — read it, never write it by hand
Four distributed sources (spec, ADR, comment, test) on the left; a generated FR/NFR inventory and traceability matrix on the right. The matrix is generated in CI as a read-only projection; maintaining it by hand is crossed out, because it recreates a second source of truth that drifts.
IDTypeSourceADRTestsStatus
FR-014Functionalimport-planilla.mdImportPlanillaTestscovered
NFR-003Privacyprivacy.mdADR-0007PrivacyTestscovered
NFR-004Performanceperf.mdADR-0008gap: no test
Illustrative — this solo app doesn't need it. The point is the shape: every row is generated, and the empty Tests cell on NFR-004 is a red flag the view surfaces on its own, the way coverage surfaces an untested line.

The naive fix is a requirements-index.md and a traceability-matrix.md kept by hand. It doesn't hold up: a hand-maintained matrix is a second source of truth that diverges from specs, tests and ADRs, and it would need its own harness to police the drift — you'd be solving the problem by recreating the very thing this whole post fights. The fix is one word: the aggregate view is derived, not authored. An agent or a script crawls what already exists and computes the inventory: functional requirements from the specs, NFR justifications from the ADRs, coverage from the test names, status from the spec↔code delta the post already calls computable. It's a harness output, like a coverage report: regenerated in CI, never edited.

  • Two orthogonal axes, not one replacing the other: what does it change with? decides where the source lives (the heuristic from before); functional or non-functional? is the lens the aggregate view sorts by. A requirement is authored once in its spec and appears as a row — it isn't stored twice.
  • NFRs are where this earns its keep: "delete the PDF after parsing" scatters across a spec (the rule), an ADR (the how) and a test (the proof). An FR usually lives in one spec; an NFR almost never does. The matrix is what stitches the scattered NFR back into a single auditable row.
  • The view reports and audits at once: an FR with no test, or an NFR with no ADR justifying it, shows up as a gap — the same red flag as an uncovered line. The matrix doesn't just answer questions, it surfaces what's missing.
  • "What changed with this feature?" becomes a diff: regenerate the view at two commits and compare, instead of an archaeology session across specs, ADRs and tests.
// the projection

The aggregate view is a projection of the repo, not a parallel copy of it: authored once in specs, ADRs and tests; read many ways. The day it's hand-maintained, it has already started to lie.

The rules, in short

#s10

For my current context — solo development with AI — the approach has worked very well, and the underlying idea travels: documentation can live in the repository. When specs, ADRs and whys share the repo with the code, the classic "docs go one way, code goes another" disappears, and the system's growth benefits: whoever develops, human or agent, uses the documentation without leaving the repo instead of hunting for it outside. And when product and development aren't the same team, the risk is product behavior trapped in the code, where a tester or a PM will never find it; there you can build tooling that syncs the documentation and projects it outward — frontends that generate the product view from the repository — instead of maintaining two diverging sources. On a team I would still negotiate the density: eight lines of doc-comment have a human reading cost the agent doesn't pay.

  • Never the what, always the why the code cannot show: the classic "don't comment" rule was never wrong — it was incomplete. It was always a rule against WHAT comments; domains with a lot of non-obvious constraint per line need more WHY comments.
  • An incorrect comment is a bug: it gets reviewed and maintained like the code. The observable what is documented in a test, the only documentation that verifies itself.
  • Knowledge lives where it changes — and all of it repo-resident: whys with the code, decisions in ADRs, product behavior in specs, the observable what in tests. One repository, readable by humans and by agents.
  • A harness polices the sync (harness engineering): pre-commit and pre-push hooks launching adversarial agents over every diff, hunting for drift between the repo's source code and the repo's documentation.
  • Comments also expire for external reasons (#s5): the framework fixes the bug and the workaround turns into dead complexity. Every workaround carries at least one machine-verifiable anchor — canary test > public issue > version bound; a why with no anchor doesn't pass.
  • Deltas are identified, not guessed: with living specs — OpenSpec is a great lever here — the distance between what the spec promises and what the code does becomes computable, and an agent can read it, cite it and reconcile it.
  • At the audit rung, the aggregate view is generated, not authored (#s9): the FR/NFR inventory and the traceability matrix are a projection computed from specs, ADRs and tests — never a hand-kept second source.
  • The sync isn't deterministic, but it converges: every adversarial pass pulls code and documentation closer, and the more independent reviews you stack, the easier it gets to tell real drift from hallucination.
Three roads, the same rule: drive at 30. The first adds a redundant sign that narrates the obvious — noise that costs attention and tokens without adding information. The other two add context: the camera that verifies the limit, the children that tell you why it exists — and you drive more alert. Code works the same way: the behavior doesn't change; the attention of whoever edits it does.
Three roads, the same rule: drive at 30. The first adds a redundant sign that narrates the obvious — noise that costs attention and tokens without adding information. The other two add context: the camera that verifies the limit, the children that tell you why it exists — and you drive more alert. Code works the same way: the behavior doesn't change; the attention of whoever edits it does.
/// WORKAROUND(FrameworkBug): dep=swiftui sdk<=15.4
/// issue=FB13241001
/// repro=Tests/Canaries/FB13241001_Test.swift
/// <the why, written as always>
The standard for workaround comments, from #s5: every WORKAROUND carries at least one machine-verifiable anchor, in order of preference repro > issue > version bound. A why with no anchor doesn't pass the linter.
// what changed

The rule is the same as ever; what's new is that now it can be measured: a comment's cost is counted in tokens, and its benefit shows up in the quality of the edits. By that measure, the why-comment is one of the most profitable investments in the repository.

Threads in this story