VAKRA: A Brutally Honest Look at Where AI Agents Actually Fail

VAKRA: A Brutally Honest Look at Where AI Agents Actually Fail

5 0 0

IBM Research just dropped VAKRA, a benchmark that doesn’t mess around. It’s designed to test how AI agents handle multi-step workflows using real APIs and documents, and the results are honestly pretty humbling.

VAKRA stands for something I’m sure is very acronym-y, but what matters is what it does: it gives agents access to over 8,000 locally hosted APIs across 62 domains, backed by real databases and document collections. Tasks require 3-7 step reasoning chains. And models perform poorly. Not “could use improvement” poorly — genuinely, consistently bad.

Let’s talk about what VAKRA actually tests, because the task design is smarter than most benchmarks out there.

Capability 1: API Chaining with Business Intelligence APIs

This is the bread and butter of enterprise automation — chaining multiple tool calls together to answer a question. VAKRA includes 2,077 test instances across 54 domains for this capability alone.

The setup is clever. Each task starts with a get_data call that returns a lightweight preview of the data and configures the server to expose the right tools. So the agent has to figure out:

  • Which tool universe to query
  • How to chain filters correctly
  • When to call specialized getters vs. generic ones

The example in the paper asks which football team has specific build-up play stats. The answer is FC Barcelona. But getting there requires calling get_data, then three sequential select_data_equal_to calls, then get_team_name. That’s five tool calls for what looks like a simple question.

What I like about this setup is the SLOT-BIRD vs. SEL-BIRD distinction. SLOT-BIRD gives you 7 generic tools for data manipulation. SEL-BIRD flattens categorical arguments into separate functions — so instead of sort_data(ascending=True), you get sort_data_ascending and sort_data_descending. This seems minor, but it fundamentally changes how models need to reason about tool selection.

Capability 2: Tool Selection with Dashboard APIs

1,597 instances across 17 domains, with tool sets ranging from 6 to 328 tools per domain. Average is 116 tools. That’s a lot of options.

The APIs here are endpoint-style — highly specific, query-aligned endpoints that encapsulate most of the computation. Think REST APIs wrapped in MCP servers. The challenge isn’t chaining; it’s picking the right tool from a massive set.

There’s a practical constraint that I appreciate them calling out: the OpenAI API spec limits tool list input to 128 tools. So if your domain has 328 tools, you can’t even send them all at once. This is exactly the kind of real-world friction that benchmarks usually ignore.

Capability 3: Document Retrieval with Business Documents

This one tests whether agents can actually find information in unstructured documents. 1,916 instances across 32 domains.

The twist here is that documents are “domain-aligned” — they’re related to the same domains as the APIs. So an agent might need to look up a policy document to understand which API parameters to use. This cross-referencing between structured and unstructured data is where a lot of real enterprise work happens, and it’s notoriously hard for current models.

Capability 4: Multi-step Agentic Workflows

1,510 instances across 48 domains. This is the hardest one — tasks that require combining API calls with document retrieval in a single workflow.

The paper doesn’t go into full detail on this capability, but the pattern is clear: agents need to plan, execute, retrieve, and adapt. One wrong step early in the chain cascades into failure later.

Where Agents Actually Fail

The failure analysis is the most useful part of this whole thing. Based on what they’ve observed:

Tool hallucination is the biggest problem. Models invent tools that don’t exist, or misname existing ones. This happens more with large tool sets — the model loses track of what’s available and starts guessing.

Chain breakage is the second. Even when the first few calls are correct, models lose context and make wrong subsequent calls. The multi-step nature means errors compound.

Data misinterpretation is subtle but deadly. The get_data preview shows only first 3 values per field. Models assume these are representative and make decisions based on incomplete information.

Parameter binding failures are common too. Models pass wrong argument types, forget required parameters, or use values that don’t match the schema.

What’s refreshing about VAKRA is that it doesn’t let models cheat. The environment is executable — every trace is checked against ground truth. No partial credit for “close enough” answers.

The Takeaway

VAKRA confirms what anyone who’s tried to build a real agent knows: we’re not there yet. The gap between demo-quality single-turn chat and reliable multi-step execution is enormous.

The benchmark is available on GitHub and there’s a leaderboard. I’d encourage anyone working on agent frameworks to test their models against it. The failure modes VAKRA exposes are the same ones that will bite you in production.

Just don’t expect your model to score well. Mine didn’t.

Comments (0)

Be the first to comment!