anagnorisis.cloudSign in

← Hourlies

Hourly ·

AMD MI355X Delivers 2,626 tok/s on GLM 5.2 — 80% of Blackwell Speed at Half the Cost

Wafer served Z.AI's GLM 5.2 on AMD MI355X GPUs at 2,626 tok/s per node — 80% of NVIDIA B200 throughput but at over 2× lower cost per token. With Blackwell supply tightening and inference demand exploding, AMD's Instinct MI350 series is closing the gap faster than expected.

AMD MI355X Delivers 2,626 tok/s on GLM 5.2 — 80% of Blackwell Speed at Half the Cost

Inference supply can't keep up with the torrent of frontier model releases, and NVIDIA Blackwell GPUs are getting scarce and expensive. Wafer just showed there's another path — and it runs on AMD.

Serving Z.AI's GLM 5.2 on AMD MI355X accelerators, Wafer hit <strong>2,626 tok/s per node</strong> on a 20k-input / 1k-output workload with 60% cache hit rate. That's roughly 80% of what a B200 delivers — but the MI355X costs roughly 2.75× less per GPU than a B300. Bottom line: <strong>performance per dollar is winning, fast</strong>.

Single-stream latency tells the same story: <strong>213 tok/s</strong> on 10k input / 1.5k output, measured to Artificial Analysis standards. It won't top the raw-speed leaderboard, but it leads on cost-adjusted throughput — the metric that matters when you're serving billions of tokens.

Under the hood, Wafer quantized the bf16 model to MXFP4 using AMD Quark — a lossless conversion that preserved GSM8K, GPQA-Diamond, and tau2 scores within noise. They ran inference on sglang after hitting roadblocks with vLLM (no working MXFP4 + GlmMoeDsa path) and ATOM (output degradation at long context). Getting speculative decode working required two custom fixes for the ROCm image, underscoring that AMD's software gap is real — but shrinking by the week as AI-assisted optimization accelerates.

The takeaway: NVIDIA's monopoly on frontier inference is no longer a given. For the first time, a competitive AMD stack is delivering usable, cost-effective performance on a top-tier model — not in a lab, but in production. The inference cost curve just bent downward.

Sources: Wafer — Performance per dollar is getting faster and cheaper

More Hourlies Stories

Content on Anagnorisis is summarized, paraphrased, and editorialized from publicly available sources for length and clarity. Original sources are linked where available. All trademarks belong to their respective owners.

More from Anagnorisis