AI evals are getting expensive fast — here’s why it matters

AI evals are getting expensive fast — here’s why it matters

2 0 0

I’ve been watching the cost of AI evaluation creep up for a while, but a few recent numbers made me stop and actually pay attention.

The Holistic Agent Leaderboard (HAL) just burned through about $40,000 to run 21,730 agent rollouts across 9 models and 9 benchmarks. That’s not a typo. A single GAIA run on a frontier model can set you back $2,829 before caching even kicks in. Exgentic’s sweep across agent configurations hit $22,000 and found a 33× cost spread on identical tasks — meaning your scaffold choice alone can make or break your budget.

UK-AISI took it even further, scaling agentic steps into the millions just to study inference-time compute. And if you’re in scientific ML, The Well costs about 960 H100-hours to evaluate one new architecture, and nearly 4,000 H100-hours for a full four-baseline sweep.

This is higher than I expected, and it’s changing who can actually do serious evaluation work.

The static benchmark days are over

Back in 2022, Stanford’s HELM paper showed API costs ranging from $85 for a small OpenAI model to $10,926 for AI21’s J1-Jumbo. Open models needed 540 to 4,200 GPU-hours, with BLOOM and OPT at the top end. Across HELM’s 30 models and 42 scenarios, the total came to roughly $100,000. That seemed like a lot at the time.

Then Perlitz et al. looked at EleutherAI’s Pythia checkpoints and found something wild: developers pay for evaluation repeatedly during model development. Pythia released 154 checkpoints for each of 16 models across 8 sizes — that’s 2,464 checkpoints. Running the LM Evaluation Harness across all of them turned eval into a multiplier on training. For small models, evaluation became the dominant compute line item across the whole development cycle. Perlitz noted that evaluation costs “may even surpass those of pretraining when evaluating checkpoints.”

When you scale inference-time compute, you scale evaluation costs. It’s a compounding problem.

Perlitz et al. then asked how much of HELM actually carried the rankings. The answer was striking: a 100× to 200× reduction in compute preserved nearly the same ordering. Flash-HELM turned that into a coarse-to-fine procedure: run cheap evaluations first, then spend high-resolution compute only on the top candidates. Much of HELM’s compute was confirming rankings the field could have inferred much more cheaply.

Other work reached the same conclusion from different angles. tinyBenchmarks compressed MMLU from 14,000 items to 100 anchor items at about 2% error using Item Response Theory. The Open LLM Leaderboard collapsed from 29,000 examples to 180. Anchor Points showed that as few as 1 to 30 examples could rank-order 87 language-model/prompt pairs on GLUE. Static benchmarks had a weakness you could exploit: model differences often concentrate in a small subset of items, so ranking can survive aggressive subsampling.

That trick weakened sharply once benchmarks moved from static predictions to agents.

Agent evals are a different beast

HAL’s public accounting is the best I’ve seen. They run standardized agent harnesses across nine benchmarks covering coding, web navigation, science tasks, and customer service, with shared scaffolds and centralized cost tracking. The headline cost: $40,000 for 21,730 rollouts across 9 models and 9 benchmarks. By April 2026, that had grown to 26,597 rollouts. Ndzomga’s independent reproduction arrives at almost the same number: $46,000 across 242 agent runs.

Behind that aggregate, the cost of a single benchmark run varies by four orders of magnitude across HAL tasks, and by three orders within some individual benchmarks. The spread reflects the model × scaffold × token-budget product.

Behind these numbers is a blunt pricing fact. Claude Opus 4.1 charges $15 per million input tokens and $75 per million output. Gemini 2.0 Flash charges $0.10 and $0.40 — a two-order-of-magnitude spread on input alone. Agent benchmarks rarely benchmark “the model” in isolation. They benchmark a model × scaffold × token-budget product, and small scaffold choices can multiply costs 10×.

Worse, higher spend does not reliably buy better results. On Online Mind2Web, Browser-Use with Claude Sonnet 4 cost $1,577 for 40% accuracy. SeeAct with GPT-5 Medium hit 42% for $171. The HAL paper notes “a 9× difference in cost despite just a two-percentage-point difference in accuracy.” On GAIA, an HAL Generalist with o3 Medium cost $2,828 for 28.5% accuracy, while a different agent hit 57.6% for $1,686. CLEAR finds across 6 SOTA agents on 300 enterprise tasks that “accuracy-optimal configurations cost 4.4 to 10.8× more than Pareto-efficient alternatives” with comparable real-world performance.

This is the kind of inefficiency that makes you wonder how many interesting experiments just never happen because the eval budget ran dry.

What this means for the field

The cost threshold has crossed a line. Running serious evals now requires institutional budgets or corporate sponsorship. Individual researchers or small labs are priced out of agent evaluation entirely. The irony is that the field is moving toward more complex, interactive benchmarks precisely when the cost of running them is exploding.

Compression techniques worked for static benchmarks because model differences concentrated in a small subset of items. Agent benchmarks are noisy, scaffold-sensitive, and only partly compressible. Training-in-the-loop benchmarks are expensive by construction, and when you try to add reliability to these evals, repeated runs further multiply the cost.

I don’t have a neat solution here. But I think the community needs to acknowledge that evaluation cost is now a first-class constraint, not just an operational detail. If we don’t find ways to make agent evals cheaper — or at least more transparent about where the money goes — we’ll end up with a field where only the biggest players can afford to know if their models actually work.

Comments (0)

Be the first to comment!