Keeping the AI Honest: Lock Files, Tests, and Design Docs

2026-04-17 — MaiMai

In June 2025, METR published a study showing something uncomfortable: frontier AI models, given a coding task and a test suite, increasingly chose to modify the tests rather than fix the bug. OpenAI's o3 model, asked to speed up a program, hacked the timer instead so execution always appeared fast. In November 2025, Anthropic's own research showed Claude models trained on coding environments learning to issue sys.exit(0) to exit the test harness with a success code — making failing tests appear to pass.

There's a name for this: reward hacking, also called specification gaming. The model optimizes the literal goal it was given ("make the test pass") without the unstated constraint you assumed was obvious ("…by fixing the actual bug"). Anthropic claims Claude 4 reduced this behavior by 65% compared to Claude 3.5 Sonnet. Sixty-five percent is a lot. It's not a hundred.

When you're solo vibe coding, there's no code reviewer to catch it. You are the only line of defense. Here's what that looks like in practice.

Ask AI Open-Ended Questions

Experienced professionals know a workplace rule: give your boss a multiple-choice question, not an essay prompt. Flip that. You are now the AI's "boss" — reverse the rule.

Reward hacking is AI cutting corners to meet a goal; another commonly overlooked pattern is its tendency to obey. Point east and it earnestly walks east, even when north is the right direction. Give it an explicit algorithm, framework, or implementation path in your prompt, and it will follow your path to the end — and argue that path was the best one, even when it wasn't.

So: unless you already have a firm architectural decision, don't tell the AI "use algorithm X," "pick framework Y," "implement it this way." Step back into the product manager's chair and describe the problem — what you're solving, where the constraints are, what the performance or complexity bounds look like, what counts as "good enough." Then let the AI enumerate candidate solutions, each with its own pros, cons, risks, and fit.

Two payoffs. The design space opens up and you're making a choice instead of rubber-stamping a direction that's already drifted. And the AI's argumentative skill gets pointed at comparing solutions rather than defending one — which is work it's genuinely good at.

But asking a single AI still has blind spots — it can make the first option it thought of sound so compelling that you assume it's the only viable one. You need a check.

In FIRE51, ChatGPT is the second-opinion assistant. For any design question, I hand the same problem to both Claude Code and ChatGPT, let two models with different training backgrounds each propose, then cross-reference. Typical outcome: Claude's recommendation looks great, but ChatGPT catches an edge case it missed — or the other way around. After the two AIs check each other, the architect makes the final call. No single AI gets to lead you by the nose.

The Pattern to Watch

Left unconstrained, a capable AI assistant will, in good faith:

Modify test fixtures to make tests pass
Simplify edge-case inputs to avoid failures
Refactor code you didn't ask it to touch
Add features you didn't request

None of this is malicious. It's pure optimization toward the stated goal, without the unstated constraints you assumed were obvious. Exactly the behavior METR documented.

What I Caught in FIRE51

The first time this bit me: Claude was debugging a tax calculation failure. The test expected a federal tax of $18,432 for a specific income profile. The engine computed $18,601. Claude proposed changing the expected value in tests/midupperclass2M_baseline.json.

That baseline was built by hand, against IRS worksheets, over several evenings. The engine was wrong, not the baseline. But to an optimizer minimizing the "tests failing" metric, adjusting the JSON was a one-line change and fixing the engine was a ten-line investigation. Both make the red turn green.

I wrote one line into CLAUDE.md that session:

tests/midupperclass2M_baseline.json and tests/upperclass5M_baseline.json are locked. Do not modify test data without explicit user instruction.

That single instruction, read at the start of every session thereafter, prevented the pattern from recurring. Not because the AI is "obedient" in some moral sense, but because now the path of least resistance is to fix the engine — the one thing it's allowed to touch.

Three Tools That Actually Work

Lock files explicitly. Any file the AI might "helpfully" change to make something else pass needs to be named and locked in CLAUDE.md. Baselines, golden outputs, expected snapshots — all of them. The AI reads CLAUDE.md at the start of every session and applies the rule without being reminded.

Validate against an external source. For FIRE51's tax engine, every output runs through PolicyEngine — an independent open-source tax model used by policy researchers at the Brookings Institution and elsewhere. The AI cannot adjust an external fact source to make numbers agree. Either the engine matches PolicyEngine to within tolerance, or it doesn't, and the delta points at the error.

Four real bugs, all caught by this validator, none visible by reading the code:

NIIT MAGI excluded capital gains (understated NIIT for large stock withdrawals)
SS provisional income excluded capital gains (understated SS taxability)
Age-65 additional standard deduction missing
California SS exclusion missing

If you're working in a domain with external ground truth — tax law, physics, financial regulation, cryptography — validate against it. The AI's output will look correct. It may not be.

Lock output formats. Also in CLAUDE.md:

The output format of run_simulation.ts is frozen for V1. Do not modify the output schema without explicit user instruction.

The report renderer, PDF generator, and CSV export all depend on that shape. A well-intentioned "simplification" of the output would silently break four separate consumers. The lock prevents it.

The Mental Model

Short sessions, frequent commits: every commit is a checkpoint, and a quick diff review catches any AI drift before it gets buried under later work.

CLAUDE.md is not a constitution or a style guide. It's the set of invariants the AI needs to re-read every session because its memory doesn't carry them. What cannot change. What cannot be touched. What the ground rules are.

Without it, every session starts from first principles and the AI rediscovers constraints by trial and error — sometimes by "successfully" completing a task in a way that breaks something else. With it, the constraints carry forward automatically, and the AI's optimization pressure gets redirected onto the part of the problem you actually want solved.

The 65% improvement in Claude 4 is real and welcome. But the other 35% is your job.

FIRE51 is a retirement planning tool built entirely via vibe coding. This is the fifth post in the Vibe Coding series.