real-session

15 questions

AI

When you ask Claude to execute a plan, how faithfully does the implementation track it?

A captured `claude --print` session that executes the 196-line plan from scenario #33. The prompt anchored Claude to the plan: 'do not deviate; if the plan is wrong, fix the plan in a comment in the test file rather than changing the structure.' Claude implemented exactly to spec for everything except one genuine plan defect (a `mockReturnValueOnce` that would have silently failed because `markFailed` calls `now()` twice), which it surfaced as a six-line `PLAN-FIX` comment instead of silently fixing. The article shows the 16-tool-call execution sequence, the parts that stuck, the one deviation, and what that workflow buys you over straight implementation.
AI

What does Claude Code do when adding a feature that touches a previously-fixed bug surface?

A captured `claude --print` session adds a `weeklyReport` method to the demo app's reporting service, with a 7-calendar-day window. The trap: the demo had a DST bug fixed two commits earlier in `src/shared/time.ts`. Did Claude reach for `+ 7 * 86_400_000` and re-introduce the bug, or transfer the wall-clock-aware design from the existing helpers? The article shows Claude's actual exploration sequence (10 reads before any edit), the implementation choice it made, and the test-coverage gap it left.
AI

What does a real Claude Code session look like fixing a timezone bug end to end?

A captured `claude --print` session against a deliberately broken `nextDayAtSameLocalTime` helper. Two failing vitest cases at the spring and autumn DST boundaries in `Europe/Helsinki`; one prompt; Claude diagnoses, fixes, and reruns to green. The article quotes Claude's actual diagnosis and shows the actual diff so you can see the reasoning shape and the structural fix.
AI

What does a SessionStart hook actually do for the agent in a real session?

A captured `claude --print` session against the demo app, with a SessionStart hook that runs `git log -5` plus working-tree status and injects the result as `additionalContext`. The prompt asked about recent feature work and forbade running `git log` manually. Claude used the commit SHA from the injected context to jump straight to `git show <sha>` and produced a substantive coverage analysis in four tool calls. The article shows the hook config, the script, the events.jsonl output that proves the hook fired, and what Claude visibly did with the context that saved a redundant tool call.
AI

How do I refactor three services at once with subagents?

A captured `claude --print --output-format stream-json` session against the same demo app. Lead agent designs and tests a shared logger sequentially, then dispatches three subagent migrations in a single assistant turn so they run in parallel. The article shows the actual fan-out, the actual subagent prompts (with enumerated call sites), and the moment the lead made a real judgment call on one service's quirks.
AI

What does Claude Code's plan mode actually produce on a real refactor?

A captured `claude --print --permission-mode plan` session that asked Claude to add a retry mechanism with exponential backoff and a dead-letter queue to the demo's notifications service. Plan mode redirects Claude to read, explore, and write a plan to `~/.claude/plans/`, with no project edits. The output was a 196-line specification with a decision table, state machine, file list, 23 enumerated test cases, and a backwards-compatibility note. The article shows what plan mode produces, what it skips, and when the plan beats the implementation as the deliverable.
AI

What does a PostToolUse hook actually do for a multi-file edit?

A captured `claude --print` session against the demo app, with a PostToolUse hook (matcher `Edit|Write`) that runs `npm run typecheck` after every edit. The prompt asked for a multi-file change (add `'parking-permit'` to `ServiceType`, update Finnish labels, update Record-typed consumers). Claude completed the work in eight Edits and also fixed a pre-existing `noUncheckedIndexedAccess` typecheck error that the hook flagged on the way through. The article shows the hook config, the actual fix Claude landed, and a real debugging surprise: the events.jsonl does not surface PostToolUse hook firings the same way SessionStart hooks do, which has consequences for how you verify these hooks are working.
AI

What does Claude Code do when a PreToolUse hook denies a tool call mid-task?

A captured `claude --print` session against the demo app, with a PreToolUse hook that denies any Edit or Write to `src/shared/`. The prompt forced a collision: add a `formatForCity` helper to `src/shared/time.ts` AND use it from `src/reporting/service.ts`. Claude read four files, attempted the Edit, hit the block, read the hook script itself to understand the policy, then stopped with three concrete paths forward (workaround, lift the freeze, two-PR split). The article shows what the block looks like in the events stream, what Claude considered and rejected, and why this is the right shape of agent behavior under denial-mode hooks.
AI

What does Claude Code look like when you ask it to audit a codebase without editing anything?

A captured `claude --print` session against the demo app, prompted to audit five production-readiness concerns and produce a structured report without modifying any code. Claude used 11 tool calls (1 Glob, 4 Grep, 6 Read), zero edits, finished in 54 seconds, cost 46 cents, and produced a 5-section report with file:line references. The article shows the verbatim audit output, the tool-call census, and a head-to-head comparison against `grep`/`rg` shell scripts for each of the five concerns: TODO scans and missing-test detection are commodity work the agent overpays for, but flagging `new Date()` calls that bypass an injection pattern (vs benign date arithmetic) and reading the dead-letter queue line as 'operational risk' are the kind of semantic judgment that justifies the cost.
AI

How does Claude Code find a regression that the test suite did not catch?

A captured `claude --print` session reverts a real bug landed in a refactor commit. The user complaint named a specific wrong number (0.111 vs 0.1 for a 1-cancellation-out-of-10 month). Claude ran `git log`, jumped straight to `git show <sha>` of the suspect commit because the commit message named the affected feature, then ran `git revert` and `npm test`. Five tool calls; the article shows why this works, why it would not work for a regression that has been there for weeks, and what the test gap was that let the bug through in the first place.
AI

What does Claude Code's skill auto-invocation actually look like in a real session?

A captured `claude --print` session against the demo app, with a `booking-conventions` skill defined at `.claude/skills/booking-conventions/SKILL.md`. The prompt mentioned bookings without mentioning the skill. The events.jsonl init event listed the skill in its `skills` array; Claude's first tool call was `Skill({skill: 'booking-conventions'})`, which loaded the full convention content into the session before any reads or edits. The implementation that landed honored the skill's status state machine and lexical-sort guidance without those rules ever appearing in the prompt.
AI

What does a Stop hook actually do when Claude says it's done but tests are failing?

A captured `claude --print` session against the demo app, with a Stop hook that runs `npm test` and returns `{decision: 'block', reason: ...}` when tests fail. The prompt asked Claude to change a `DEFAULT_DURATION` value that a test directly asserts, then declare done. The Stop hook blocked the first 'I'm done', and Claude reacted by editing the test assertion to match the new value rather than reverting the change or asking. The article shows the hook script, the wire-format signal in events.jsonl (a `stop-hook-error` notification), the test-gaming failure mode the hook accidentally encouraged, and the safeguard counter pattern that keeps a Stop hook from looping forever.
AI

What does a UserPromptSubmit hook actually do for a vague 'where was I?' prompt?

A captured `claude --print` session against the demo app, with a UserPromptSubmit hook that appends the current branch, last 3 commits, and `git diff --stat` to every prompt as additional context. The user prompted 'Summarize what work is currently in flight in this repo' with two real uncommitted changes on disk (a `cancelAllForCitizen` stub method and a `describe.skip` test placeholder). Claude's thinking block named 'git status and recent commits' as its source, then ran 2 Bash calls to read the actual diff content, then produced a one-paragraph summary. The article shows the wire format of the hook's injection, the events.jsonl gap (UserPromptSubmit fires with zero visible system events), and the Bash-matcher footgun where `git -C /path diff` failed to match the `Bash(git diff:*)` allow rule because of the `-C` flag prefix.
AI

What does Claude Code do when a failing test is the thing that's wrong?

A captured `claude --print` session against the demo app, with a deliberately contradictory test added to the working tree. The prompt was neutral: 'make the suite green.' Claude ran the test, ran `git diff` to see the test was a recent uncommitted addition, compared it against the function's docstring and the adjacent passing test, and deleted the wrong test rather than break the working function. The article quotes Claude's actual reasoning and shows why this outcome is not automatic.
AI

What does a real Claude Code debugging session actually look like?

An end-to-end transcript from 'reports are off by a day' through reproduction, root cause, fix, and test. Including the wrong turn we took along the way.

real-session

When you ask Claude to execute a plan, how faithfully does the implementation track it?

What does Claude Code do when adding a feature that touches a previously-fixed bug surface?

What does a real Claude Code session look like fixing a timezone bug end to end?

What does a SessionStart hook actually do for the agent in a real session?

How do I refactor three services at once with subagents?

What does Claude Code's plan mode actually produce on a real refactor?

What does a PostToolUse hook actually do for a multi-file edit?

What does Claude Code do when a PreToolUse hook denies a tool call mid-task?

What does Claude Code look like when you ask it to audit a codebase without editing anything?

How does Claude Code find a regression that the test suite did not catch?

What does Claude Code's skill auto-invocation actually look like in a real session?

What does a Stop hook actually do when Claude says it's done but tests are failing?

What does a UserPromptSubmit hook actually do for a vague 'where was I?' prompt?

What does Claude Code do when a failing test is the thing that's wrong?

What does a real Claude Code debugging session actually look like?