You Can’t Evaluate an Agent with a Static Test: Benchmarking using Veris AI
February 2026 | Veris AI
Standard LLM benchmarks test models at a point in time, treating them as functions. Real AI agents need to be tested on real actions: calling APIs, maintaining state, recovering from errors, all in the right order. Veris AI is a simulation platform for AI agent development, and we ran 7 versions of our customer service agent through 100 card-replacement scenarios. Our simulation exposed failure modes and behavioural differences of different LLM providers that static evaluations could not detect.
Introduction
Conventional LLM benchmarks evaluate models in isolation: give a prompt (or a string of scripted prompts), then check the answer. But AI agents don’t work in isolation; they call tools, mutate backend states, depend on the results of previous turns, and must recover when steps fail. A benchmark that doesn’t reproduce these dynamics leaves huge gaps in ensuring reliability.
We built a customer support AI agent with the use case of credit card replacement support. At Veris, we benchmark agents using the Veris AI platform, a full-fidelity simulation with backend environments, live APIs, persistent states, and injectable failures. 7 LLMs listed below were connected in turn with this agent, and interacted with 100 customer personas, resulting in a grade based on 41 checks across 8 categories. To ensure consistency of checks, each interaction was performed thrice, and so were the grader checks on each interaction.
We found that the simulations caught failure modes which would be invisible to static evaluations: models that claim to freeze cards without calling the backend API, models that succeed in information gathering but catastrophically fail with error recovery, and unique tool call errors only revealed by the model’s internal trace.
The Problem with Static Evaluations
AI agents operate differently from chatbots. They call APIs, read and write to databases, make decisions that depend on the results of previous actions, and must handle errors from external systems. A model that correctly explains the card-freeze procedure might still fail to call the freeze endpoint, or call it on the wrong card, or call it at the wrong time. These are execution failures, not knowledge failures, and static evaluations cannot detect them without extensive setup.
What Simulation Adds
Simulation-based evaluation gives agents a realistic environment to operate in. If the agent freezes the wrong card because it skipped the verification step, the persistent state managed by the Veris simulation exposes the error upon evaluation. If a tool call returns an error and the agent ignores it, the grader flags the failure as it has access to the agent’s internals. None of these checks are possible with static question/answer evaluation.
Task
Card replacement is a good use case because it’s a real workflow with sequential dependencies (verify identity → find card → freeze → replace → confirm delivery), compliance constraints, and routine error conditions. It’s narrow enough to evaluate rigorously but complex enough that all models currently fail in diverse and interesting ways.
We evaluated 7 LLMs spanning different sizes, providers, and design philosophies: DeepSeek V3.2, GPT-4o, SmolLM3-3B (3B parameters), Kimi-K2, GPT-5 (Azure), GPT-OSS-120b (open-weight), and Grok-3-fast. Their overall pass rates are within 5 points of each other (88–94%). But at the category level, the spread is 66 points. That gap is entirely invisible to static evaluation — and it determines whether a deployment succeeds or fails.
Methods
Each benchmark session is a full simulated support interaction: a customer contacts the agent about a card issue, and the agent must navigate the conversation from start to resolution using real backend systems. The Veris simulator provides account records, card databases, freeze/unfreeze endpoints, and delivery scheduling APIs. The agent must use these tools to accomplish tasks — describing an action in conversation does not count as performing it.
Metric
Value
Models Tested
7
Runs / Model
3
Sessions / Model
~300
Checks / Session
41
Evals / Run
3
Scenarios: 100 customer scenarios cover a range of situations — lost cards, stolen wallets, damaged chips, expired cards, wrong addresses, accounts with multiple cards, and edge cases like cards already in transit. Each scenario defines the customer’s profile, the state of their account, and their opening message.
Stateful simulation: The simulator maintains a persistent backend state throughout each conversation. If the agent freezes a card in step 2, that card stays frozen for the rest of the session. This catches models that claim to have taken an action without actually calling the API — a common failure mode in production.
Multiple runs: Each model ran the full 100-scenario set 3 independent times, producing ~300 sessions per model. Running multiple times lets us measure whether a model’s performance is stable or varies significantly between attempts.
Grading: Each session is scored on 41 behavioral checks grouped into 8 categories: data accuracy, delivery confirmation, error handling, freeze handling, information gathering, replacement execution, scope management, and status updates. Every check is binary — pass, fail, or not applicable (for checks that don’t apply to a given scenario). Each session is graded 3 times by a GPT-5 judge.
Consistency filtering: For each check, we measure grader consistency across all applicable sessions (~300 per model). Checks where the 3 evaluations agree less than 80% of the time are discarded from all charts and tables — the grader signal is too noisy to trust. Checks above the 80% threshold are included as-is; some individual sessions may have inconsistent grades, but the check-level signal is reliable enough to report. Discarded checks are noted in the data so they can be revisited with improved grader prompts.
Results
A static eval would have struggled with the execution of the majority of our scenarios, as the complexity and temporality necessitates a simulation-based methodology. After scenarios are simulated, we use a unique grading system to ascertain performance and determine issues surfaced by the simulation sessions.
Our graders reported these seven models as roughly comparable. Overall pass rates range from 88.2% to 93.5%, a 5-point spread. But our detailed graders expose what aggregate scores hide: at the category level, the gap balloons to 66 points. Error handling alone shows a 66-point spread between best and worst. These are the per-model differences that determine whether a deployment works.
Model Performance Matrix
Model
Pass Rate
N/A Rate
Key Strength
deepseek-ai/DeepSeek-V3.2
93.5%
55.9%
All-round leader
openai/gpt-4o
91.1%
62.2%
Delivery confirmation
HuggingFaceTB/SmolLM3-3B
90.6%
60.5%
Low-cost contender
moonshotai/Kimi-K2
90.3%
61.9%
Delivery & scope
gpt-5 (azure)
90.2%
63.1%
Reliable mid-tier
openai/gpt-oss-120b
89.3%
59.5%
Open-weight option
grok-3-fast
88.2%
55.9%
Status & replacement
Category Breakdown
Green (>=90%) = strong; Yellow (75–90%) = adequate; Orange (50–75%) = weak; Red (<50%) = failure mode.
Category
DeepSeek
GPT-4o
SmolLM3
Kimi
GPT-5
GPT-OSS
Grok
Data Accuracy
96.5%
96.1%
95.9%
95.4%
95.2%
95.0%
92.7%
Delivery Confirm.
82.9%
91.7%
87.9%
89.7%
88.7%
75.7%
61.7%
Error Handling
80.6%
73.0%
76.8%
67.2%
68.4%
66.7%
14.7%
Freeze Handling
90.8%
90.2%
89.2%
88.4%
88.2%
88.5%
95.5%
Info Gathering
98.3%
94.9%
94.7%
94.9%
94.8%
92.5%
94.4%
Replacement Exec.
94.5%
74.4%
75.5%
75.0%
74.2%
87.7%
95.3%
Scope Mgmt
96.0%
96.9%
97.3%
96.2%
96.6%
92.2%
97.8%
Status Updates
87.4%
82.5%
84.8%
83.3%
83.7%
82.4%
92.2%
Model Differences
1. Error recovery requires a live environment to test
Error handling is the hardest category for every model, and the one most invisible to static evaluation. Static benchmarks don’t have APIs that fail, so they can’t test how an agent recovers when a tool call returns an error. In the simulation, DeepSeek leads at 80.6%, which would be weak in any other category. Grok-3-fast scores 14.7%, failing 85% of error-handling checks. This 66-point spread within a single category dwarfs the 5-point spread in overall pass rate, and it’s entirely undetectable without a live backend that can produce real failures.
2. Conversational ability ≠ tool-use ability
GPT-4o posts the best delivery confirmation score of any model (91.7%). It’s excellent at the conversational aspects of support: confirming details, gathering information, explaining timelines. But its replacement execution (74.4%) is the weakest in the field. It talks about replacements better than it executes them. A static eval that tests “does the model know the replacement procedure?” would rate GPT-4o highly. The simulation reveals that knowing the procedure and executing it are different skills.
3. Overall scores are misleading without category breakdown
SmolLM3-3B (3 billion parameters) matches GPT-5 on overall pass rate: 90.6% vs. 90.2%. A conventional leaderboard would rank them as equivalent. The simulation shows they’re not: SmolLM3 is stronger on delivery confirmation (87.9% vs. 88.7%) and data accuracy (95.9% vs. 95.2%), but weaker on replacement execution (75.5% vs. 74.2%). The “right” model depends on which categories matter for your deployment, a distinction that aggregate scores erase.
4. State persistence catches hallucinated actions
DeepSeek V3.2 leads overall at 93.5%, 2.4 points ahead of GPT-4o. One reason: it leads on both information gathering (98.3%) and replacement execution (94.5%). Because the simulator maintains a persistent state, the grader can verify that DeepSeek actually called the replacement API, not just that it told the customer a replacement was submitted. Models that “hallucinate” confirmations without executing tool calls are penalized. Static benchmarks, which have no backend to verify against, cannot make this distinction.
5. A model can be best and worst simultaneously
Grok-3-fast is the most dramatic illustration of why simulation matters. It leads on freeze handling (95.5%), replacement execution (95.3%), scope management (97.8%), and status updates (92.2%). Remove error handling from the calculation and Grok ranks first overall. But its error handling score is 14.7%, meaning it fails 85% of the time when an API returns an error. A static benchmark would never surface this split personality because it has no mechanism to inject tool failures and observe the agent’s response.
Performance Profiles
The radar chart (in the HTML version) is the kind of output that simulation makes possible. Each polygon represents a model’s pass rate across all 8 behavioral categories. The shapes are strikingly different — something no single aggregate score or static benchmark can capture.
Pass rate = false / (true + false) x 100, excluding N/A checks. Checks with <80% grader consistency are discarded. Data: 3 runs x 100 scenarios x 3 evals per model, GPT-5 judge.