When you ask Claude to execute a plan, how faithfully does the implementation track it?

Question

Kalle Lamminpää · Accepted Answer

A captured claude --print session executed the 196-line plan from the prior plan-mode walkthrough, anchored by a prompt that asked for honest deviations as comments rather than silent rewrites. The result was 16 tool calls, 42/42 tests green, and exactly one PLAN-FIX comment in the new test file documenting a real plan defect. The setup The plan from scenario #33 has been sitting at ~/.claude/plans/add-a-retry-mechanism-compiled-crayon.md. The demo's notifications service has not been touched since (no retry logic, no test file). Tests were 19/19 green from prior scenarios. The plan-then-execute workflow is two claude --print sessions back to back: Plan session (scenario #33): --permission-mode plan, prompt asking for retry logic + DLQ. Output: a 196-line plan file at ~/.claude/plans/-.md. Project unchanged. Execute session (this article): default permission mode, prompt referencing the plan path. Output: actual code changes that satisfy the plan. The two-session model decouples design review from implementation review. You read the plan, push back on it, then run execute. If you disagree with the plan, you do not pay for the implementation tokens. The prompt Two clauses are doing work: "Execute it exactly as written" anchors Claude to the plan as the authority. Without this, Claude might have re-derived the design from scratch. "If the plan is wrong about something, fix the plan in a comment" invites honest deviation. Without this, Claude might silently change course and you would never know which parts of the plan held up. What Claude actually did, in 16 tool calls The tool-call census from the events file: 9 Read calls (the plan, the existing types/service/templates, the shared time/logger modules, adjacent test files for pattern matching) 1 Edit (small change to types.ts to add the two optional fields) 2 Write calls (full rewrite of service.ts; new service.test.ts) 4 Bash calls (npm test, npm run typecheck, two diagnostic re-runs) Sixteen tool calls is roughly a third of the plan-mode session's 49. The cost of executing a plan is much lower than the cost of producing one because the design exploration (surveying the codebase, asking clarifying questions, debating tradeoffs) has already been paid. What the plan got, exactly Type changes matched the plan one-for-one: The class internals match the plan's pseudocode line for line. Constructor options: The due() accessor implementation is the plan's pseudocode verbatim. The DLQ is a separate private dlq: Notification[] = [], copied (return [...this.dlq]) for external safety. The backoff formula is this.initialRetryDelayMs 2 (attempts - 1). Error messages match the plan's strings exactly. The 23 test cases all landed. Test names track the plan's enumerated list. The full suite is 42/42 green: 19 existing + 23 new, exactly as the plan promised. The one deviation, surfaced as a PLAN-FIX comment At the top of the new service.test.ts, after the imports: (Punctuation lightly normalized; wording, structure, and the PLAN-FIX label are verbatim.) This is the article's hero artifact. The plan's test-infrastructure section recommended now.mockReturnValueOnce(new Date('...')) to advance the clock per step. Claude executed against the plan, hit a real test that needed two now() reads in sequence (because pre-DLQ markFailed writes nextRetryAt AND the logger writes a timestamp on the warning), and found that a one-shot mock would silently fall back for the second call. The deviation is not a "Claude knew better and rewrote the plan" move. It is "Claude tried the plan, the plan did not work, Claude documented the failure and a small fix that preserves the plan's intent (advance the clock) without breaking on the two-call pattern." The deviation is six lines of comment in code review, not a hidden change. Why this is the workflow you want for high-blast-radius changes The two-session model gives you four review surfaces: The plan itself (post scenario #33). Read 196 lines; push back on design choices; rerun plan mode if you disagreed strongly. The plan's "Decisions" table. Five rows of "we picked X because Y". You can audit each row against your own constraints. The execution diff. Standard code review on the actual change; but you also have the plan to compare against. Drift is visible. The PLAN-FIX comments. Any honest disagreement Claude found while executing. These are the most valuable artifact in the diff because they are the parts where reality pushed back on design. A single execution-only session would have given you only surface #3. You would have to infer the design from the diff. That is harder, and the parts where the design was wrong would not be marked. Footguns The "do not deviate, document deviations" framing has to be in the prompt. Without it, Claude has implicit license to improve the plan as it goes. Most of the time that is fine, but on the cases where Claude's improvement is wrong (or where the plan was wrong but you wanted to know about it), silent deviation costs you the review surface. Why this matters: the prompt above is six sentences and four of them are plan-anchoring instructions. Treat plan-anchoring as load-bearing prose; do not abbreviate it. The plan is a snapshot of design intent at a moment. If commits land between the plan and the execute, the execute session has to decide whether to follow the (now slightly stale) plan or adapt to the new state. In this run the gap was a few minutes; in a real workflow it could be hours. Why this matters: when you commit a plan to source control or paste it into a PR, include the commit hash or branch state at plan time. The execute session can then verify it is operating on the same tree. Plan-then-execute is two bills, not one. The plan-mode session cost 49 tool calls; the execute session cost 16. Total 65 calls vs maybe 35-40 for a single execute-only session. The savings are not in tokens; they are in review fidelity (you can reject a plan in 10 minutes; rejecting a 600-line PR takes 60 minutes). Why this matters: do not adopt this workflow because it is "more efficient". Adopt it because it makes design review possible. The plan-mode artifact has a whimsical filename. ~/.claude/plans/add-a-retry-mechanism-compiled-crayon.md. The execute session can find it because we pasted the path into the prompt; without the path, Claude would have to list ~/.claude/plans/ and guess. Why this matters: copy the plan path the moment plan mode finishes. The whimsical-name lookup later is a real annoyance. The execute session re-reads the codebase from scratch. Claude does not have the plan-mode session's accumulated context; it sees the plan file plus whatever it Reads itself. This is by design (each --print invocation is its own context), but it means the execute session has to reconstruct things the plan-mode session already learned. Why this matters: a plan that names file paths and line numbers is more valuable than one that hand-waves. The plan from #33 named src/index.ts:11, which let the execute session verify backwards compatibility without re-deriving the location. A plan can be self-consistent and still wrong about external reality. The mockReturnValueOnce recommendation was internally consistent (it advances the clock), but it failed to anticipate that markFailed's logger call would consume a second clock read. The plan-mode session's 30 reads did not catch this because the test infrastructure pattern was inferred from logger.test.ts, where the call shape was different. Why this matters: plans are predictions, not specifications. The execute session is where the prediction gets tested. Welcome the PLAN-FIX comments. When to use plan-then-execute Wide-blast-radius changes. A refactor that touches multiple files and changes invariants. The two-pass review (design first, implementation second) catches different classes of bugs. Cross-session handoffs. Plan today, execute tomorrow. The plan file is the durable artifact. High-stakes irreversible decisions. New service shapes, schema migrations, public API changes. The plan is the place to commit; the execution should not surprise. Cross-team review. A reviewer from another team can review the plan in 10 minutes and catch architectural mismatches without learning your codebase. Compliance contexts. Some teams need a written design before code lands. Plan mode produces one as a side effect of the same workflow. When NOT to use plan-then-execute Trivial changes. A typo fix, a renamed variable, a one-line guard. The plan is a tax. Exploratory work. When you are not sure what the right design is, plan mode produces a plan you do not trust. Run execute mode and treat the diff as the proposal. You are not going to review the plan. If the plan goes straight to execute without anyone reading it, you have spent tokens on a artifact nobody used. Run execute directly. The work is bounded by tests, not by design. "Make this failing test pass" is a tighter scope than "design a retry mechanism". Plan mode adds noise on a constrained task. Plan and execute will run in immediate succession with no review break.* If you are not pausing to read the plan, the workflow's value is gone. Run execute mode with a longer prompt instead.

When you ask Claude to execute a plan, how faithfully does the implementation track it?

The setup

The prompt

What Claude actually did, in 16 tool calls

What the plan got, exactly

The one deviation, surfaced as a `PLAN-FIX` comment

Why this is the workflow you want for high-blast-radius changes

Footguns

When to use plan-then-execute

When NOT to use plan-then-execute

Sources

Read more