Hourly · 2026-07-05 14:00 UTC

GPT-5.5 Reasoning Budget Exposed: Nearly Half of Responses Hit a 516-Token Ceiling

A community analysis of 390,000 Codex responses reveals GPT-5.5 disproportionately terminates reasoning at exactly 516 tokens — and complex-task performance is suffering.

A meticulous data analysis posted to OpenAI's Codex issue tracker this week has surfaced troubling evidence that GPT-5.5 — the model powering Codex for complex coding tasks — is systematically capping its reasoning at fixed token boundaries.

The investigation, covering 390,195 responses across 865 sessions from February through June 2026, found that GPT-5.5 responses land on exactly 516 reasoning output tokens 44% of the time. For every other model tested, that figure is below 2%. Two additional spike points at 1,034 and 1,552 tokens suggest deliberate tiered caps rather than a naturally varying distribution.

The pattern accelerated sharply. In February, exact-516 clustering was negligible at 0.11%. By May it had surged to 53.3%, before settling at 35.8% in June. Meanwhile, average reasoning token depth collapsed — from 268 tokens in February to just 107 in May. The 90th percentile fell from 772 to 344 over the same window.

GPT-5.5 accounts for just 19.3% of all Codex responses but 82% of all exact-516 events. Its exact-516 ratio is 33.6 times higher than every other model combined.

The community is connecting these dots to an earlier report from April where GPT-5.5 responses terminating at exactly 516 reasoning tokens returned wrong answers on complex tasks. That report was closed without resolution. Users in the new thread are piling on with confirmations — "same issue" comments span multiple time zones and use cases, suggesting this is not an edge case but a systemic degradation affecting paid users on high-stakes coding work.

OpenAI has not yet responded to the investigation, which includes specific internal validation queries the community is asking the Codex team to run: check token-count distributions by model and day, replay matched complex tasks across GPT-5.2 and GPT-5.5 with quality evals, and separate exact-516 responses from longer-reasoning responses in A/B tests.

Until those answers arrive, the practical takeaway for developers is sobering: if your GPT-5.5 Codex session feels shallow or starts returning suspiciously wrong answers, check the reasoning token count. If it hits 516 exactly, you may be running into a hard budget cap — and your results may reflect truncated thinking rather than genuine model failure.

Sources: GitHub — openai/codex issue #30364

GPT-5.5 Reasoning Budget Exposed: Nearly Half of Responses Hit a 516-Token Ceiling

GPT-5.5推理预算曝光：近半回应达到516字节 ceiling

More Hourlies Stories

More from Anagnorisis

GPT-5.5 Reasoning Budget Exposed: Nearly Half of Responses Hit a 516-Token Ceiling

GPT-5.5推理预算曝光：近半回应达到516字节 ceiling

More Hourlies Stories

More from Anagnorisis

Stay in the loop