// PUBLISHED April 25, 2026 // READING 7 min // FILED Claude Code

Is Claude getting worse? What we learned when Anthropic admitted it.

// FILED Cost & Ops// SOURCE Septim Labs// PERMALINK /blog/is-claude-getting-worse-april-2026.htmlcite this →

For six weeks, developers said Claude got dumber. Anthropic said it didn't. On April 23, Anthropic published an internal post-mortem confirming three engineering changes degraded coding output between March 4 and April 16, then quietly pushed fixes and reset everyone's usage limits as a concession. Here is what changed, who got hit, and what to do when your business runs on one AI vendor.

// CLAUDE CODE QUALITY · 2026 Q1–Q2 −1 to +1 (relative)

What developers were saying

Starting in early March, threads on Hacker News and r/ClaudeAI began compiling the same observation: Claude Code was producing shorter answers, missing obvious steps, and abandoning context mid-task. Some users blamed their own prompt drift. Others insisted the model itself had quietly changed.

Anthropic's public stance for the next six weeks was that no model swap had occurred. The status page showed green. Internal benchmarks reportedly came back flat. The community kept producing receipts — side-by-side outputs, identical prompts, demonstrably worse responses — and Anthropic kept saying nothing was wrong.

The status page is a function of what you measure. If your benchmark is API uptime, and the regression is in the shape of the output, you can be honestly green and quietly broken at the same time.— Septim Labs · April 25, 2026

What actually changed

The April 23 post-mortem identified three independent changes that compounded over six weeks. None was a model retrain. All three were operational adjustments that lived above the model layer.

// 01 Shipped Mar 4

Reasoning effort cut

Default reasoning budget for Claude Code dropped from high to medium as a cost-control measure. On simple tasks, indistinguishable. On multi-step coding work, the model started skipping intermediate planning steps and going straight to output.

// 02 Shipped Mar 11

Mid-session thinking wiped

A bug in the session-state handler cleared mid-session thinking history at certain context-window thresholds. The model would lose its own work-in-progress reasoning halfway through a task and have to reconstruct it from output history alone — visibly worse on long PRs.

// 03 Shipped Mar 22

25-word response cap

A latency-improvement feature added an aggressive truncation rule — a cap of roughly 25 words on a category of intermediate responses Claude uses internally to plan. Output quality on synthesis tasks collapsed because the model ran out of room to think before answering.

Each one was a reasonable engineering tradeoff. Together they produced a six-week regression that Anthropic could not detect on its own benchmarks because each change individually fell within the noise floor.

What Anthropic did next

The fix landed in three rolling deploys between April 16 and April 23. Reasoning effort returned to high by default. The session-state bug was patched. The 25-word cap was removed. As a goodwill move, Anthropic reset all subscriber usage limits for the billing month — every Pro and Max user got their May allotments back, plus April's. No one asked for this; it was a concession.

// What the post-mortem actually said

"Three changes shipped between March 4 and March 22 collectively degraded Claude Code output quality on multi-step tasks. We did not catch it because each change passed our individual rollout gates. Our aggregate quality benchmark did not flag the trend until a community-published reproducer surfaced on April 16."

The exact text is mirrored in the April 24 Fortune piece and the April 22 Register write-up, both linked in sources.

Why this matters if Claude is your only AI vendor

If your stack runs entirely through Anthropic — Claude Code in your editor, Claude API in your product, the Anthropic dashboard for billing — six weeks of degradation is six weeks of degradation. Every PR your reviewer agent looked at. Every customer-facing reply your assistant generated. Every test your debug agent diagnosed.

The community caught it. The reproducer that broke the silence was a community artifact, not Anthropic's internal QA. That is the structural lesson here — and it generalizes to every vendor, not just Anthropic. When your business depends on one model behaving consistently, you need the same posture you would have toward any infrastructure dependency: independent verification, version-pinned reproducibility, an exit ramp.

Three concrete things we changed at Septim Labs after April 23

Pinned a regression suite of our own tasks. Twenty repeatable Claude Code prompts whose outputs we sample weekly and diff against last quarter's baseline. The reproducer doesn't need to be impressive — it needs to be ours, and we need to look at it.
Routed cost-sensitive paths through smaller models. Sub-tasks that don't need full Claude reasoning (formatting, classification, schema validation) now run on cheaper Anthropic tiers or alternate vendors. When the flagship model regresses, only the workload that actually needed it is affected.
Captured every output we ship to a customer. Versioned, dated, replayable. If a customer reports the system "got worse" three weeks later, we have receipts.

The bigger pattern

The tools you ship to customers should not depend on a single vendor's promise that their model is unchanged. The model will change. The vendor will ship operational tweaks above the model layer that they consider safe and that you experience as quality drift. The status page will stay green.

This is not a takedown of Anthropic specifically. They published a post-mortem, fixed the issue, refunded usage. That is the responsible-vendor pattern. The lesson is that even responsible vendors produce drift, and the only durable defense is your own measurement.

If your customer can tell, your provider's benchmark didn't.— internal note · written during the regression, April 14, 2026

How to tell if it happened to you — a real diagnostic

The community reproducer that finally broke the silence was straightforward: the same coding prompt, run in two sessions one day apart, producing outputs that differed by more than 200 words in explanation depth and missed a class of edge cases entirely in the second session. Documenting a regression requires three things: a stable prompt, a repeatable task, and a structured output format so you can diff.

Here is the diagnostic we now run quarterly:


// regression-check.md — a pinned skill file

## Task
Review the following TypeScript function for:
1. Logic correctness (walk every branch)
2. Edge cases not covered by existing tests
3. Type safety issues
4. Security surface (input validation)

## Input
[PASTE FUNCTION HERE]

## Output format
One finding per line:
[severity] category: specific description
Severities: blocker / major / minor / nit

## Refuses to
- Mark anything as "looks good" without citing specific evidence
- Skip edge case analysis even if the function appears simple
- Omit severity on any finding

Run this skill file against the same function every quarter. If the output structure collapses — if one quarter’s run produces 8 findings with file:line citations and the next produces 2 vague observations — you have a regression signal independent of any vendor announcement.

The function does not need to be real production code. We keep a set of 15 "canary functions" that are synthetic but representative of the edge-case patterns our actual codebase hits. They change slowly, the skill file does not change at all, and the output gives us a quarterly baseline.

The session-by-session record problem

The deeper issue that the April regression surfaced: most Claude Code users have no systematic record of what Claude did in a session, what decisions it made, which files it touched, and where it ran into trouble. When quality degrades, they feel it — but they cannot prove it, cannot isolate which session type was most affected, and cannot show their team a before/after comparison.

This is not an academic problem. If you are using Claude Code to review PRs, generate customer-facing documentation, or diagnose production bugs, the quality of those outputs matters across time. A six-week regression on PR review means six weeks of reviews that missed more than they should have. You cannot go back and identify which ones unless you saved the outputs.

Anthropic’s April 23 post-mortem noted: "We did not catch it because each change passed our individual rollout gates. Our aggregate quality benchmark did not flag the trend until a community-published reproducer surfaced on April 16." The gap was 43 days between the first degrading change and the community reproducer. If you are building on the API, your own benchmark needs to be tighter than Anthropic’s internal one — not because they are irresponsible, but because your workload is more specific than any aggregate benchmark.

What a session record actually looks like

A minimal session record captures six fields per Claude Code session:

Date + model version — Anthropic exposes the exact model identifier in API responses; log it.
Task type — PR review / debug / refactor / generation / other.
Files touched — which files Claude read and which it wrote.
Key decisions made — a one-line summary of the significant choice in the session (e.g., "decided to rewrite the auth middleware rather than patch the existing one").
Output quality score — a 1-5 subjective score logged immediately after the session, before you forget. Subjective is fine; consistent is what matters.
Flags — any session where the output felt unexpectedly thin, where Claude abandoned context mid-task, or where you had to restart.

Over 8-12 weeks, this log becomes the signal that no vendor dashboard gives you. If your quality scores trend down over a three-week period, you have your own regression evidence — even before Anthropic acknowledges anything.

What to do this week

Run a regression diff. Take ten of your last quarter’s Claude Code outputs. Re-run the same prompts on the current model. If today’s outputs are visibly thinner, you have your own evidence.
Pin the canary skill. Write a 20-line skill file against a stable synthetic task and run it this week. Baseline it. That is your Q3 2026 comparison point.
Start logging outputs. Even a simple text file with six fields per session beats nothing. The cheapest tier on Anthropic still costs more than disk space. Log it.
Have one parallel vendor for sanity-check. When Claude shifts, you want to know whether GPT or Gemini noticed too. If your tasks regress on Claude but not on the others, that is a signal specific to Anthropic’s operational layer — not your prompts.
Cap usage on the cost-sensitive paths. Even a soft per-call budget catches runaway reasoning loops before billing day. The April regression cost some users not just in output quality but in wasted tokens on tasks Claude could not complete well.

// FROM SEPTIM LABS

The honest record — what Witness tracks across every session.

Witness records every significant decision a Claude Code session makes during a build: which file it touched, what it reasoned, where it got stuck. Session-by-session, timestamped, diff-able. If Claude degrades over three weeks, Witness lets you prove it. If it recovers, you see that too. Founding rate: first 10 at $79 (regular price $99). Ships June 2026. Pay now, get access at launch.

Witness — $79 founding rate → Free 12-point checklist →

// SOURCES