Claude Code’s Quality Drop Was a Product-Layer Failure, Not a Model Story

    Max Västhav
    anthropicclaude-codeopus-4-7ai-agentsevaluationagent-infrastructure

    TL;DR: Anthropic’s April 23 Claude Code postmortem is a useful reminder for anyone building agents: the model can be fine while the product around it quietly changes the result.

    Claude Opus 4.7 arrived with stronger coding benchmarks, better vision, and a new xhigh effort setting. Then developers started describing something that did not fit the launch story. Claude Code felt less careful. It forgot context. It made odd tool choices. It produced shorter answers at moments where depth mattered.

    Anthropic’s postmortem is interesting because the company did not blame the model weights. It traced the problem to three product-layer changes around Claude Code, the Claude Agent SDK, and Claude Cowork. The API was not affected, and all three issues were resolved by April 20 in version 2.1.116.

    That distinction matters. For production agents, “which model did we choose?” is only one part of reliability. The harness decides how much reasoning the model is allowed to spend, what context survives between turns, which tools it can use, and how much it is encouraged to explain before acting.

    Change those controls, and the same model can feel like a different operator.

    What happened with Claude Code quality?

    Anthropic’s April 23 engineering post says three separate changes combined into what users experienced as a broad Claude Code quality drop.

    First, Claude Code’s default reasoning effort was lowered from high to medium on March 4 for Sonnet 4.6 and Opus 4.6. The goal was lower latency and fewer long waits. Users preferred the deeper reasoning, and Anthropic reverted the change on April 7. Opus 4.7 now defaults to xhigh, while the other models default to high.

    Second, a caching optimization shipped on March 26 was meant to clear old thinking once after a session had been idle for more than an hour. A bug caused it to keep clearing older reasoning on every later turn in that process. Claude could continue using tools, but with less of its own prior reasoning visible. That showed up as forgetfulness, repetition, and strange decisions. Anthropic fixed this on April 10 in v2.1.101.

    Third, on April 16, Anthropic added a system prompt instruction intended to reduce verbosity. The exact line told Claude to keep text between tool calls to 25 words or fewer and final responses to 100 words or fewer unless the task required more detail. Broader testing later showed a 3% drop in coding quality evaluations for Opus 4.6 and Opus 4.7. The line was reverted on April 20.

    Sources:

    Why this matters beyond Claude Opus 4.7

    The useful lesson is not that Claude Code had a rough few weeks. Tools break. Defaults move. Product teams make tradeoffs.

    The lesson is that agent reliability lives in the layer between the model and the work.

    If you run coding agents, support agents, research agents, or internal workflow agents, the harness is part of the agent. It is not plumbing. It shapes behavior.

    A reasoning-effort default can make an agent plan less carefully. A context-compaction bug can make it repeat a mistake it already solved. A prompt line meant to save tokens can cut away the explanation that kept the agent grounded. None of those require a model regression.

    This is why benchmark announcements can be true and still miss the failure mode your team feels in production.

    The practical agent-building takeaway

    If your agent only works when the model, prompt, and runtime all stay perfectly still, it is not ready for real work.

    A better evaluation loop checks the full system:

    • Pin the runtime where possible: Track model version, effort level, tool permissions, system prompt version, memory policy, and context compaction behavior.
    • Measure behavior, not vibes: Count how often the agent reads before editing, repeats a corrected mistake, skips a required tool, or needs human repair.
    • Keep regression tasks from real work: Save failed runs as replayable tests. Synthetic benchmarks help, but production failures are better test cases.
    • Watch token-saving changes carefully: Shorter output is not always cheaper if it creates more retries, review time, and cleanup.
    • Separate model evaluation from harness evaluation: Test model capability through the API, then test the product wrapper that will actually run the work.
    • Roll out prompt and memory changes slowly: Treat them like code changes. Review them, test them, and keep a rollback path.

    The main question is not “is Opus 4.7 good?” It probably is, especially for hard coding tasks. The better question is “under which runtime conditions does it stay good?”

    If you’re building agents

    This postmortem is a useful operating manual in disguise.

    Agents do not fail only because the model is weak. They fail because the surrounding system changes how the model sees the task, remembers its decisions, spends reasoning, and chooses tools.

    That means your architecture needs boring controls:

    • Versioned prompts
    • Stable effort settings
    • Explicit memory rules
    • Tool-call logs
    • Replayable evals
    • Small rollouts for behavior changes
    • Human review at the points where damage is expensive

    Those controls do not look exciting in a demo. They matter when the agent has been running for six hours and you need to know whether it is still following the same contract it started with.

    Claude Code’s April incident is not just a Claude story. It is a warning about every agent product that wraps a strong model in a fast-moving harness.

    The model is the brain. The harness decides whether the brain gets enough context, time, and guardrails to do the job.

    Cookies & analytics

    We use cookies for essential functionality and, with your consent, analytics to improve the site. Read the Privacy Policy.