We asked large language models (LLMs) to estimate the fraction of a proposed math solution that was correct.
It turns out that while they can reason through complex problems, they still struggle to produce precise numerical outputs.
We call tasks like this Reasoning-Intensive Regression (RiR) problems. Below, I'll talk about why RiR represents a growing class of problems, what tools we currently have for solving them, and where I think the field is heading.
Everyday decisions hinge on numerical predictions: mortgage rates, risk ratings, recommendation feedback. As language models become more capable, there's a growing need for finer-grained judgment scores, particularly for applications like world modeling, rubric-based LLM evaluation, and advanced retrieval.
But not all regression problems are created equal. I think of them in three levels:
Let me give a quick example to build intuition. Consider this flawed proof:
Claim: There are infinitely many prime numbers.
Proof:
The error is in step 4: not being divisible by any known prime doesn't make N prime. The correct reasoning is that N has at least one prime factor (by the fundamental theorem of arithmetic), and that factor can't be any pi, so it's a new prime. The proof is correct for about 55% of the way through. Now imagine asking a model to output that 0.55 score. That's RiR.
RiR is Level 3. Applications include scoring customer calls, rubric-based LLM generation, instruction-based query-document relevance, and forecasting. Unfortunately, RiR applications typically offer only very small training sets and are limited to lightweight computations.
Given this, we ask: Are there effective methods that are data- and compute-efficient for tackling ad-hoc RiR problems?
Your first question might be: why "ad-hoc"? During our search for appropriate benchmarks, we found the literature quite lacking. Most regression benchmarks were not reasoning-intensive enough, and for those that were, the papers were often published without the accompanying dataset. Consequently, our first task was to create simple RiR datasets ourselves.
We take four tasks from the literature, including ProcessBench and rubric-based pairwise judges from RAG-QA, and cast them as RiR benchmarks.
Here's what we discovered when applying standard methods:
Failure Mode 1: Fine-tuned encoders (NeoBERT) collapse to the mean. The model finds a degenerate solution with near-zero concordance. This isn't a reasoning failure; it's an optimization failure.
Failure Mode 2: Frozen LLMs (GPT-5) quantize their outputs. Good at ranking, but coarse and imprecise on calibration.
Why does quantization happen? LLMs predict tokens, not numbers. The training objective is cross-entropy over the vocabulary, where each digit is an independent classification decision. The loss has no notion of numeric distance. Predicting "2.0" for a target of "7.5" incurs the same penalty as predicting "7.0".
This reveals the core tension: LLMs reason well but output coarsely. Small regressors calibrate well but can't reason.
Quick note on metrics: NMSE (Normalized Mean Square Error) penalizes point distance, but collapsing to the mean achieves NMSE ≈ 1.0 with zero understanding. CCC (Concordance Correlation Coefficient) captures rank + calibration + bias, rewarding predictions that rank correctly AND maintain the right spread and mean. For RiR, you need both.

To resolve the issues described above, we developed MENTAT. The core insight: split reasoning from calibration.
Phase 1 (Iterative Prompt Evolution): Start with a basic prompt. In 3 iterations: run rollouts, identify the worst-performing examples, ask the LLM to diagnose its own errors, and refine the prompt based on what it learns. It's a batch reflective process. The model sees where it went wrong, explains why, and updates its instructions accordingly. Select the best prompt by CCC on a validation split.
Phase 2 (Multi-Rollout Neural Aggregation): Generate 3 rollouts per example with the evolved prompt. Train a tiny MLP (8 hidden units) on sorted rollouts + stats (mean, std, min, max). Optimize a combined CCC + NMSE loss.
We tested MENTAT on four tasks. Namely, mathematical error detection, instruction following, pairwise RAG comparison, and essay grading. While it showed consistent improvements, it also revealed surprising trade-offs. On pairwise RAG comparison, GPT-4.1 outperformed GPT-5. GPT-5 "overthinks": its predictions cluster near the center of the scale, under-dispersed relative to ground truth, with more than half of examples yielding identical rollouts across three samples. GPT-4.1 stays decisive, producing short judgments with better spread. Sometimes sophisticated reasoning is counterproductive.
We also tried RL fine-tuning (GRPO) on the instruction following task. It improved CCC over baseline prompting but degraded NMSE. The model learns relative discrimination among its rollouts but never sees a signal about absolute scale, so population-level mean and variance drift freely. It's like a judge who correctly orders contestants but scores everyone between 6.0-6.5 on a 1-10 scale.
To recap: We define RiR as tasks requiring precise predictions, proper ranking, and deep per-instance reasoning. Standard methods struggle to balance all three. MENTAT helps by splitting reasoning (LLM) from calibration (MLP), but much headroom remains.
I've been thinking about where this field could go. The current MENTAT pipeline is: LLM reasons → MLP calibrates. Two separate systems, 3x inference cost, and the MLP only sees scalar rollouts (it throws away the reasoning trace).
What if we could unify them? Distill LLM reasoning into a small model, then fine-tune with regression-aware loss on the final numerical output. The small model learns why the score is what it is, and the loss function actually respects numeric distance. Single model, single pass, reasoning and calibration unified.
We have that reasoning distillation is well-established (DeepSeek-R1 does essentially this at scale), and RAFT shows regression-aware loss works cleanly on small decoders with weight access. The interesting thing is that RAFT's gains over CE fine-tuning are real but modest on tasks like sentiment analysis, tasks where a fine-tuned encoder already does reasonably well. The bottleneck is both calibration and reasoning depth. I would be interested in seeing RAFT + reasoning distillation on genuine RiR tasks where reasoning is actually the bottleneck.
I think there's rich room here. A clean question is: does regression-aware fine-tuning help when the task requires genuine reasoning, and does reasoning distillation make it help more? Unforunately, playing around with this question is infeasible until there are large enough RiR benchmarks (data) to utilize.
For more details on these four tasks and how we created them, developed MENTAT, and analyzed the method (including various ablations), please read the full paper on arXiv. If you have any questions or just want to chat about this stuff, feel free to email me!