What does Claude Code look like when you ask it to audit a codebase without editing anything?

Question

Kalle Lamminpää · Accepted Answer

A captured claude --print session against the demo, prompted to audit five production-readiness concerns and produce a numbered report without modifying any code: Claude used 11 tool calls (1 Glob, 4 Grep, 6 Read), zero edits, finished in 54 seconds, and produced a structured report with exact file:line references. This article is the head-to-head against a shell-script alternative for each of those five concerns, and where the agent's $0.46 actually earned its keep.

The prompt

The "DO NOT modify" constraint is in the prompt twice. It is not enforced by a hook; the demo's settings.json still allows Edit() and Write(). The constraint is the prompt's job. The captured permission_denials array on the result event is empty, and the working tree was clean on exit, so Claude obeyed.

What Claude did, in 11 tool calls

The tool-call census is the entire story:

The four greps did the bulk of the work. The reads were used to disambiguate grep matches: which new Date() calls are "current time" (bad) versus pure date arithmetic on a parsed instant (fine), which throw new Error is the canonical project pattern versus genuinely untyped, which Map field is held privately on a service vs a transient closure.

The audit report, verbatim

This is the captured stdout, with em-dashes lightly normalized to commas:

Concern-by-concern: agent vs shell script

For each of the five concerns, here is what a shell-script alternative would have produced and how the agent did against it.

1. Files without a co-located test. The shell version is one pipe:

That gives 6 results. The agent gave 3, with types.ts files correctly excluded as "pure type definitions need no runtime test." That exclusion is judgment a script does not have. It is also wrong-shaped for a codebase where types files contain runtime guards or zod schemas. Worth knowing: the agent's category-collapse is an opinion, not a fact.

2. TODO scans. The shell version is grep -rn 'TODO\|FIXME\|XXX\|HACK' src/. It returns nothing. The agent ran the same regex and reported "None found." The shell version costs nothing and finishes in 30 ms; the agent costs $0.46 and a Grep round-trip. This concern alone is pure overpaying, but it is a small fraction of the total prompt, so the per-concern accounting matters less than the bundle accounting. Still, if all you wanted was concern #2, do not call Claude.

3. new Date() bypassing the now() pattern. This is the concern that flips the value calculation. The shell version is grep -rn 'new Date(\|Date.now(' src/, which returns 7 hits across 4 files. Five of those seven are not what you care about: addDays(parseInstant(iso), 30) constructs a new Date to do arithmetic, but the time it represents is a parameter, not "wall clock now." The agent correctly classified 3 hits as bypasses (with the now() injection used elsewhere as the corroborating evidence, citing the line numbers where now() IS used) and 4 as benign with explanations. A regex script cannot do that without an AST visitor that knows what parseInstant returns. This is the work that earns the cost.

4. In-memory state on services. The shell version is something like rg 'private (rows|queue|dlq|store)\s[:=]' src/, but you are basically guessing the variable names. The agent read four files and reported three findings, with the dlq flagged as "operational risk" because losing the dead-letter queue on restart is qualitatively worse than losing the in-flight queue. That qualitative call is the article. A grep would have surfaced the lines without ranking them; the ranking is what an on-call engineer actually wants.

5. Untyped error throws. The shell version is rg 'throw new Error\(' src/, which gives 10 hits. The agent gave the same 10, but added the kicker line: "Callers cannot distinguish 'not found' from 'in DLQ' from validation errors without string matching on error.message." That sentence is the audit. The grep gives you the count; the agent gives you the consequence.

The bundle math: concerns 1, 2, 3 (file co-location, TODO, new Date) are 70% findable by find plus a couple of greps. Concerns 4 and 5 surface as grep matches, but the value comes from the agent ranking and explaining them: DLQ flagged as "operational risk," untyped errors framed as "callers cannot distinguish." A shell script lists; the agent ranks and explains. If you only need a list, do not pay; if you need the second sentence, pay.

Numbers from the events.jsonl

The result event carries this:

54 seconds, 12 turns, 46 cents. The 360k cache-read tokens are the demo's small surface (16 source files, ~440 lines across the read set) prefixed with the standard system prompt and tool definitions, hot-cached from the SessionStart hook's git context injection that ran on session boot. On a first-touch session against a 500-file codebase, expect 5 to 10x more cache creation, a multi-minute duration, and several dollars in cost. The Glob would still be one call, but the Read budget grows roughly with the file count it judges worth disambiguating. Plan accordingly.

Footguns

The now() heuristic is implicit. The agent treated notifications/service.ts:24 and shared/logger.ts:35 as evidence that the project has a now() injection pattern. It does, but if you wrote a codebase where now() is named differently (clock(), getCurrentTime(), Time.now()), the agent might still anchor on the wrong landmark and either miss bypasses or over-report them. Why this matters: encode the convention in CLAUDE.md or a skill (see the booking-conventions session) so the agent is not inferring it from grep evidence on the fly.

The audit is not a fix. Five concerns surfaced. Six in-memory data structures, three time bypasses, ten untyped errors. The agent did exactly what the prompt asked: report, do not modify. If the goal is to fix, the report is the start of the next session, not the deliverable. Why this matters: a separate session (with the report pasted in or referenced as a file) is the right shape for the fix. Do not ask the audit session to also fix things; the audit prompt's "DO NOT modify" constraint and the fix prompt's "go fix this" constraint are different sessions.

Cost scales with read disambiguation, not with grep count. The four greps cost almost nothing in cache-creation. The six Read calls did. If the codebase has 50 files where grep-flagged lines need to be disambiguated, the audit is 8× longer and 8× the cost on the read budget alone. Why this matters: scope the prompt's concerns. "Scan everything everywhere" turns into a Read fan-out that no longer earns its cost. "Audit src/booking/ for X, Y, Z" gives the agent a small enough surface that the per-concern judgment stays cheap.

The agent silently merged "Date.now()" findings with "new Date()" findings. The prompt asked about both. The grep matched both. The report only listed new Date() calls; there were no Date.now() calls in the codebase, but the agent did not say so. A script would have made the absence visible. Why this matters: if the audit is for a compliance trail (someone needs to see "we checked for X and found zero"), explicitly require the agent to list each concern with a "found N" line including zeros. Otherwise the absence-of-finding is invisible.

The "no test file" check excluded type-only files. That exclusion is reasonable for the demo. It is not always reasonable. Some teams mandate a co-located test even on types.ts files (for type-level tests with expectTypeOf, or to lock barrel exports). Why this matters: if your team has a test-everything rule, override the agent's reasonable-by-default exclusion in the prompt: "Include *types.ts files in the no-test-file list."

When the read-only audit shape is worth it

Onboarding to an unfamiliar codebase. The shape doubles as a reading guide: the agent's reads tell you which files matter, the report tells you the shape of the trouble.
Pre-PR review of a large refactor. Run the audit on the branch, paste the report into the PR description as a "known unaudited surface" section.
Periodic baseline audits. Quarterly: same prompt, same five concerns, save the output. Diffing audits over time surfaces drift earlier than a CI lint rule would have caught it.
Concerns that need ranking, not listing. "Which of these in-memory state leaks is operational risk vs cosmetic?" is the question grep cannot answer.

When NOT to use this shape

A single concern that is regex-shaped. "Find all console.log calls" or "find all any types" is rg. Save the 46 cents.
Concerns that need a real type-aware tool. "Find all unused exports" wants ts-prune or knip, not Claude. The agent will produce something that looks right and might be wrong; a tool with a TypeScript program will be authoritative.
CI gates. Audits are advisory by design (the prompt's own constraint is "do not modify"). Wiring an LLM call into CI as a pass/fail is the wrong shape; the cost compounds, the determinism does not, and a flaky audit fails builds for no reason. Use the audit at edit time, not at merge time.
Codebases where you have not encoded conventions. As above with now(): the agent's value comes from anchoring on conventions it can read. If the codebase has none, the audit reads as generic best-practices advice that any engineer could have written without reading your code.
Sensitive code paths under read restrictions. A read-only audit still reads. If part of src/ contains secrets, license-restricted code, or regulated data, audit the sensitive surface separately under a session that has a Read deny rule scoped tight to that subtree.

Calls	Tool	Used for
1	Glob	`src/*/.ts` to enumerate the source set
1	Grep	`TODO\|FIXME\|XXX\|HACK` over `src`
1	Grep	`new Date\(\|Date\.now\(` over `src`
1	Grep	`throw new Error\(` over `src`
1	Grep	`now\s*[:=]\|now\(\)\|now\?:` over `src/shared/time.ts`
6	Read	`booking/store.ts`, `booking/service.ts`, `reporting/service.ts`, `notifications/service.ts` (×2 with offset/limit), `shared/time.ts`
0	Edit/Write	(none, prompt forbade it)

What does Claude Code look like when you ask it to audit a codebase without editing anything?

The prompt

What Claude did, in 11 tool calls

The audit report, verbatim

Concern-by-concern: agent vs shell script

Numbers from the events.jsonl

Footguns

When the read-only audit shape is worth it

When NOT to use this shape

Sources

Read more