anagnorisis.cloudSign in

← Hourlies

Hourly ·

Mistral's Leanstral 1.5 Saturates Math Benchmarks, Finds Bugs in Real Code

Mistral AI's new Apache-2.0 proof-engineering model saturates miniF2F, solves 587 Putnam problems, and uncovered 5 previously unknown bugs across 57 open-source repositories — all at $4 per problem.

Mistral's Leanstral 1.5 Saturates Math Benchmarks, Finds Bugs in Real Code

Mistral AI released Leanstral 1.5 on July 2, a free Apache-2.0 licensed model purpose-built for formal verification and proof engineering in Lean 4. With 119 billion total parameters but only 6 billion active via a mixture-of-experts architecture, it delivers results that rival systems costing 75 times more to run.

The model completely saturates the miniF2F benchmark at 100% on both validation and test sets. On PutnamBench — 672 problems drawn from the notoriously difficult Putnam Mathematical Competition — Leanstral 1.5 solves 587 problems at pass@8, edging out Seed-Prover 1.5 by seven problems while costing roughly $4 per problem. Seed-Prover's high setting runs an estimated $300 or more per problem, consuming 10 H20-days of compute for each attempt.

On graduate and PhD-level abstract algebra benchmarks FATE-H and FATE-X, Leanstral 1.5 sets a new state of the art at 87% and 34% respectively.

Beyond benchmark scores, the model was tested on real-world code verification across 57 open-source repositories. An automated pipeline translated Rust code to Lean, had Leanstral infer correctness properties, and attempted to prove them. The process flagged 47 violated properties — 11 pointing to genuine bugs, five of which were previously unreported on GitHub. One bug, found in the zigzag decoding function of the datrs/varinteger library, caused an integer overflow on input Std.U64.MAX that would crash in debug mode and silently corrupt data in release.

Leanstral 1.5 was trained in three stages: mid-training, supervised fine-tuning, and reinforcement learning with a method called CISPO, operating across two environments — a multi-turn theorem-proving loop with Lean compiler feedback, and a code-agent environment where the model edits files, runs bash commands, and uses the Lean language server to inspect goals and errors in real time.

The model also proved the O(log n) time complexity guarantees for a real AVL tree implementation, a task requiring structural induction, monadic time tracking, and exhaustive case analysis — consuming 2.7 million tokens across 22 context compactions.

The weights are available on Hugging Face under the name mistralai/Leanstral-1.5-119B-A6B, with a free API endpoint and integration into Mistral Vibe for interactive proof engineering.

Sources: Mistral AI, Hugging Face

More Hourlies Stories

Content on Anagnorisis is summarized, paraphrased, and editorialized from publicly available sources for length and clarity. Original sources are linked where available. All trademarks belong to their respective owners.

More from Anagnorisis