Let's Reason Intensively

We asked large language models (LLMs) to estimate the fraction of a proposed math solution that was correct.

It turns out that while they can reason through complex problems, they still struggle to produce precise numerical outputs.

We call tasks like this Reasoning-Intensive Regression (RiR) problems. Below, I'll talk about why RiR represents a growing class of problems and what tools we currently have for solving them.

RiR involves regression tasks that require sequential reasoning rather than shallow feature identification. Think scoring customer satisfaction, evaluating LLM outputs against rubrics, or advanced retrieval. As language models become more capable, there's a growing need for finer-grained judgment scores, particularly for applications like world modeling. See our proposed taxonomy for the varying levels of regression problems, below

Given this, we ask: Are there effective methods that are data- and compute-efficient for tackling ad-hoc RiR problems?

Your first question might be: why "ad-hoc"? During our search for appropriate benchmarks, we found the literature quite lacking. Most regression benchmarks were not reasoning-intensive enough, and for those that were, the papers were often published without the accompanying dataset. Consequently, our first task was to create simple RiR datasets ourselves.

We take four tasks from the literature, including ProcessBench and rubric-based pairwise judges from RAG-QA, and cast them as RiR benchmarks. Below is an example entry for our task requiring the model to predict where a proposed mathematical solution went wrong, with a score between 0 and 10 (where 0 means the solution failed at the beginning, and 10 means it failed at the very end).

Applying two standard methods revealed key tradeoffs. Fine-tuning encoders (i.e. NeoBERT) led to fine-grained learning but collapsed to mean predictions when reasoning was needed and prompting LLMs (i.e. GPT 4.1 and GPT 5) led to strong reasoning but coarse and imprecise scores.

To resolve the issues described above, we developed MENTAT. This algorithm is composed of two phases:

Phase 1: LLM analyzes prediction errors in lightweight batches to evolve prompts, learning from mistakes.
Phase 2: Multiple rollouts from optimized prompt get aggregated by MLP trained on statistical features.

We tested MENTAT on our four tasks, and while it showed consistent improvements, it also revealed surprising trade-offs on the pairwise RAG comparison and essay grading tasks: unlike with math error detection, GPT-4.1 outperformed GPT-5 on these, likely due to "overthinking" by the reasoning model.

To recap: We define RiR as tasks requiring precise predictions, proper ranking, and deep per-instance reasoning. Standard methods struggle to balance all three. We introduce MENTAT, a simple algorithm that combines lightweight batched prompt evolution with ensemble learning. While MENTAT shows improvements over standard methods, much headroom remains. Moreover, to truly develop better algorithms for RiR, we need fewer ad-hoc benchmarks.

This was a short summary of the paper. For more details on how we created our four tasks, developed MENTAT, and analyzed the method (including various ablations), please read the full paper on arXiv. If you have any questions or just want to chat about this stuff, feel free to email me!

← Back to musings