anagnorisis.cloudSign in

← Hourlies

Hourly ·

GPT-5.5 Reasoning Budget Exposed: Nearly Half of Responses Hit a 516-Token Ceiling

A community analysis of 390,000 Codex responses reveals GPT-5.5 disproportionately terminates reasoning at exactly 516 tokens — and complex-task performance is suffering.

GPT-5.5 Reasoning Budget Exposed: Nearly Half of Responses Hit a 516-Token Ceiling

A meticulous data analysis posted to OpenAI's Codex issue tracker this week has surfaced troubling evidence that GPT-5.5 — the model powering Codex for complex coding tasks — is systematically capping its reasoning at fixed token boundaries.

The investigation, covering 390,195 responses across 865 sessions from February through June 2026, found that GPT-5.5 responses land on exactly 516 reasoning output tokens 44% of the time. For every other model tested, that figure is below 2%. Two additional spike points at 1,034 and 1,552 tokens suggest deliberate tiered caps rather than a naturally varying distribution.

The pattern accelerated sharply. In February, exact-516 clustering was negligible at 0.11%. By May it had surged to 53.3%, before settling at 35.8% in June. Meanwhile, average reasoning token depth collapsed — from 268 tokens in February to just 107 in May. The 90th percentile fell from 772 to 344 over the same window.

GPT-5.5 accounts for just 19.3% of all Codex responses but 82% of all exact-516 events. Its exact-516 ratio is 33.6 times higher than every other model combined.

The community is connecting these dots to an earlier report from April where GPT-5.5 responses terminating at exactly 516 reasoning tokens returned wrong answers on complex tasks. That report was closed without resolution. Users in the new thread are piling on with confirmations — "same issue" comments span multiple time zones and use cases, suggesting this is not an edge case but a systemic degradation affecting paid users on high-stakes coding work.

OpenAI has not yet responded to the investigation, which includes specific internal validation queries the community is asking the Codex team to run: check token-count distributions by model and day, replay matched complex tasks across GPT-5.2 and GPT-5.5 with quality evals, and separate exact-516 responses from longer-reasoning responses in A/B tests.

Until those answers arrive, the practical takeaway for developers is sobering: if your GPT-5.5 Codex session feels shallow or starts returning suspiciously wrong answers, check the reasoning token count. If it hits 516 exactly, you may be running into a hard budget cap — and your results may reflect truncated thinking rather than genuine model failure.

Sources: GitHub — openai/codex issue #30364

More Hourlies Stories

Content on Anagnorisis is summarized, paraphrased, and editorialized from publicly available sources for length and clarity. Original sources are linked where available. All trademarks belong to their respective owners.

More from Anagnorisis