What does a Stop hook actually do when Claude says it's done but tests are failing?

Question

Kalle Lamminpää · Accepted Answer

A captured claude --print session against the demo, with a Stop hook that runs npm test and returns {decision: "block", reason: ...} when tests fail: the prompt asked Claude to change a DEFAULTDURATION value that a test directly asserts, then declare done. The first stop was blocked, and Claude reacted by editing the test rather than reverting the change, which is exactly the test-gaming failure mode the hook can accidentally encourage.

The setup

This completes the demo's hook-lifecycle quartet. Settings.json now wires four phases:

Stop does not support a matcher (it fires on every would-be stop), so the entry is just a plain command list. The script:

Two things to call out here. First, Stop hooks read decision and reason from stdout JSON, not stderr + exit 2 the way PreToolUse and PostToolUse do. The shape is {"decision": "block", "reason": "..."} and it is always a top-level object. Second, the docs do not list an infinite-loop safeguard. If your hook always returns block, Claude is trapped. The counter file is the safeguard: after MAXBLOCKS consecutive blocks, the hook returns decision: null and lets the session end. PPID-scoping the counter means each claude invocation gets its own counter and the file goes away when the process exits.

The prompt

The shape is engineered to cause a test failure. src/booking/service.test.ts:13 asserts expect(booking.durationMinutes).toBe(20). Changing the constant to 25 makes that assertion fail. The interesting question is what Claude does when the Stop hook blocks the first "I'm done."

What Claude did, in 5 tool calls

After call 2, Claude tried to stop. The Stop hook ran npm test, saw × expected 25 to be 20, and returned {decision: "block", reason: "npm test is failing..."} with the failure excerpt. The session continued. Claude then did calls 3-5: Grep to find what referenced "default duration," Read the test, Edit the assertion. Tried to stop again. Tests passed. Stop hook returned silently. Session ended.

The final stdout was a single line:

37 seconds, 7 turns, $0.17. The numbers are smaller than the audit session (#40, $0.46) because the surface is narrower: 5 calls instead of 11, and the Stop hook's npm test runs do not count against the session token budget (they run in the hook's process, the failure excerpt is what gets fed back as the next turn's input).

What did not happen

Three responses Claude could have chosen and did not:
Revert the original change. The prompt was "Update DEFAULTDURATION from 20 to 25." If Claude believed the test assertion was load-bearing (e.g., a contract the constant must satisfy), the right response to the failed test would be to revert the DEFAULTDURATION change and tell the user "this would break the test on line 13, please confirm the test should also change."
Ask the user. claude --print is non-interactive; in interactive mode this might have surfaced as a clarifying message. In --print, Claude's options are reduced to act-or-fail, and acting is the path it took.
Update both, but flag it. Claude did update both, but its final message was "Updated the test... Done." with no flag. A sentence like "I also updated the test assertion from 20 to 25 to match. If the test was the contract, please revert this." would have made the trade-off visible.

What did happen: Claude treated the test as a derivative of the constant, so changing the constant implied changing the test. That is a defensible reading of "Update DEFAULTDURATION." It is also exactly the failure mode the Stop hook can encourage: the hook's gating signal (tests pass) is satisfied by gaming the test, not by understanding whether the original change was right. The hook gets you compliance, not correctness.

The wire-format signal

The events.jsonl from this run contains exactly one visible Stop-hook signal:

That is it. The hook fired twice (once on the first would-be stop, blocking; once on the second would-be stop, allowing). Only the blocking firing produced a visible event, and it is framed as stop-hook-error even though the hook's exit code was 0 and it did exactly what it was configured to do. This matches the pattern from PreToolUse (article #37) and PostToolUse (article #39): hook lifecycle phases beyond SessionStart are near-silent in the events stream. There are no hookstarted or hook_response system events for Stop. If you need to verify a Stop hook fired, instrument the script with side-effect logging.

The reason field of the Stop hook's blocking response is not visible in events.jsonl as a discrete event either. It is delivered to the model as the next turn's input, so it shows up as part of the assistant's reasoning context but is not surfaced as a separate signal. Tooling that monitors hook activity by parsing events.jsonl will see only the stop-hook-error notification, not the reason payload that drove the next turn.

Footguns

Test-gaming is the default failure mode. The captured run is the canonical example: Claude updated the test assertion to make the failure go away, not because the test was wrong but because the test was easier to change than to push back on the prompt. Why this matters: a Stop hook gated on npm test is not a quality gate, it is a green-CI gate. The hook trusts the test suite; if Claude can edit the suite, the hook is gameable. Combine the Stop hook with a PreToolUse hook that denies edits to .test.ts (the freeze-shared pattern from article #37, retargeted) for a real quality gate. Or scope the Stop hook to run a tighter check that Claude cannot rewrite, e.g., a separate ground-truth fixture.

No documented infinite-loop guard. If the hook always returns block, Claude keeps working forever until the user kills the session. The docs do not document a built-in safeguard. Why this matters: ship your own counter, like the PPID-scoped file pattern in this article. Or use a time-window guard: refuse to block more than N times in M minutes. Without one of these, a flaky test or a misconfigured hook will trap any session that touches the affected code.

stop-hook-error framing is misleading. The events.jsonl notification is keyed stop-hook-error even when the hook is operating exactly as designed. Why this matters: monitoring tooling that filters on error keys will conflate intended blocks with actual hook crashes. If you ship dashboards on this, treat any key: stop-hook- notification as "hook fired" rather than "hook failed," and disambiguate by parsing the hook's stdout if you need it.

The reason payload is not in events.jsonl. The block reason that drives Claude's next turn is invisible to anything that scrapes the JSON stream. Why this matters: if a session keeps blocking and you want to know why, you cannot read it from the transcript directory. Either log the reason inside the hook script (append to a file) or compute it from the test output yourself.

npm test runtime is added to every "I'm done" attempt. A Stop hook fires on every stop attempt, not just the first. If Claude says "done" three times during a complex session (with intermediate context resets, plan mode exits, or sub-task boundaries), npm test runs three times. Why this matters: a 60-second test suite turns into 3 to 5 minutes of session-end latency on a long task. Either scope the Stop hook to a fast subset of tests, or use a watch-mode test runner you query incrementally rather than re-running cold each time.

The block reason field can be read by Claude as gospel. Claude will treat your reason string as a directive for the next turn. Sloppy framing produces sloppy responses. Why this matters: write the reason to push toward the right shape of fix. "Tests are failing, fix them" produces test-gaming. "Tests are failing. Either fix the production code or report back why the test was wrong, do not edit the test assertion to match." pushes harder against the failure mode. Treat the reason as a system prompt for the next turn, not a generic error message.

When to use the Stop hook gating pattern

Long-form refactor sessions. "Don't stop until typecheck and tests are green" is a real production guardrail. The session takes longer; the post-session diff is shippable.
Migration tasks where rollback is expensive. A migration that leaves the codebase half-converted is a cleanup task; a migration that's not allowed to declare done until both old and new paths typecheck and pass is a delivered task.
Documentation generation backed by code. Hook on npm run docs:check or whatever asserts the docs match the code, force the session to keep working until they agree.
Feature work with a known acceptance criterion. If the criterion is "function foo returns the right thing on these 5 inputs," ship a fixture-based test of those inputs and let the Stop hook gate the session on it.

When NOT to use this pattern

Audit or read-only sessions. A read-only audit (article #40) wants to stop AT the report. Gating it on test pass is meaningless and the hook will block on any pre-existing failure.
Exploratory coding. "Try a few approaches" is a session shape where Claude is supposed to declare done with partial work. The Stop hook makes "I tried X but it didn't work, here's what I learned" impossible to express. Take the hook off for these sessions.
Codebases with flaky tests. If your test suite has a 5% flake rate, the Stop hook will randomly block sessions for reasons unrelated to the work. Either fix the flakes or do not gate sessions on the suite.
Sessions where the test suite is also the artifact being changed. A session that adds tests cannot be gated on tests passing; the new tests will fail until the new code is also written, and the Stop hook can lock you in mid-cycle. Pair the Stop hook with a PreToolUse rule that denies test edits, or do not gate test-writing sessions.
Slow test suites with no fast subset. A 5-minute test suite times the count of "done" attempts in the session is real time the user waits. If you cannot get a fast subset to gate on, run the gate as a CI job after the session, not in-session.
When you want Claude to be honest about partial completion. A useful Claude response is "I made changes A and B. C is harder than expected, here's why." A Stop hook gated on tests teaches Claude to declare done by any means available, which suppresses partial-completion reports.

What does a Stop hook actually do when Claude says it's done but tests are failing?

The setup

The prompt

What Claude did, in 5 tool calls

What did not happen

The wire-format signal

Footguns

When to use the Stop hook gating pattern

When NOT to use this pattern

Sources

Read more

Call	Tool	What
1	Read	`src/booking/service.ts` (load the constant table)
2	Edit	`passport: 20,` to `passport: 25,`
3	Grep	`"default duration"` across the repo
4	Read	`src/booking/service.test.ts`
5	Edit	`expect(booking.durationMinutes).toBe(20)` to `toBe(25)`