October 28, 2025

Dynamic Benchmarking: Evaluate AI Agents through Environments, not Datasets

Andi Partovi

Introduction

In our previous post, Building an Agent? You Need an Environment, we laid out why simulation environments are an essential part of the stack for agent development. We argued that agents don’t live in neat, fully observable boxes, but rather in messy systems with unpredictable users, interconnected tools, and cascading consequences. To build high-value, action-based agents, you need an environment where your agent can practice, fail safely, and improve before going live.

But once you have an environment, how should you evaluate your agent inside it? Static evaluations of classic software engineering (golden datasets, assertions, and fixed input/output benchmarks) or classic ML evaluations (putting aside evaluation and test datasets) are no longer enough to evaluate a dynamic system like AI agents. Agents are indeterministic systems living in indeterministic environments. They act over time, adapt to context, branch based on tool behavior, and respond to unexpected user inputs. Evaluating them requires a shift from static benchmarks to dynamic, environment-driven evaluation.

This post explores that shift. We’ll examine static vs. dynamic evaluation, explain why static benchmarks break down for agents, and show how evaluation environments, not evaluation datasets, are the future of measuring agent reliability.

Static Evaluations: The Old Standard

Static evaluations rely on fixed inputs and predetermined outputs. They assume a predictable, deterministic system where the “correct” answer is known ahead of time.

In software engineering, this looks like a unit test or assertion:

assert process_invoice(invoice) == "payment_submitted"

‍

The code either passes or fails based on whether the output matches the expected string.

In machine learning, it’s a benchmark dataset: given an image, the model must return the correct class (e.g., ImageNet), or given a question, the model must extract the correct text span (e.g., SQuAD).

These approaches work when the system is static and predictable. But in agentic systems, where an AI might reason, call APIs, ask users for clarification, or adapt to changing state, the “correct” answer cannot be pre-specified. It often depends on how the interaction unfolds.

Why do static tests break down?

Imagine an enterprise AI assistant tasked with paying an invoice.

A test scenario might assume the following path:

Check the invoice from an email.
Verify the vendor ID in the ERP.
Submit the payment.

But real-world conditions often break this linear flow. For example:

The ERP flags the vendor as inactive, so the agent escalates the invoice to the finance team instead of paying it.
The invoice is missing a field, so the agent emails the vendor for more information.
All information is complete, but the vendor updates the invoice mid-process via an email, requiring the agent to restart verification.

All of these outcomes are reasonable and even desirable. Yet a static test, defined as invoice + email → payment_submitted, would mark every one of them as failures, because the output deviates from the golden label. This reveals the core flaw: static evaluation penalizes correct behavior when the correct behavior depends on context.

Building Dynamic Evaluations

Agents operate inside environments. They make sequential decisions, branch into alternative paths, and react to evolving conditions. Evaluating them requires us to judge not just the final output but the trajectory of the interaction, i.e. did the agent make good decisions at each step given the circumstances?

We argue that this kind of evaluation is impossible without an evaluation environment. Dynamic evaluation environments replace static datasets for agents.

Returning to our example, a proper evaluation environment would include:

An ERP system that can mark the vendor as active or inactive.
Simulated users who might send additional emails or change requirements mid-flow.
Payment systems with realistic failure modes and timing issues.

In such an environment, evaluation is no longer a simple equality check. Instead, the system compares the trace of the agent’s behavior against the ground-truth state of the environment. If the ERP marks a vendor inactive and the agent escalates to a human, the evaluator can verify that decision against the environment state and mark it as a success, even though the output diverged from the expected string in a static benchmark.

Components of a Dynamic Evaluation Environment

Building an evaluation environment is significantly more complex than building an evaluation dataset. It requires simulating a world with consistent internal states, agency, and uncertainty. The environment needs to track all tools and user behaviors. As we talked about before, a simulation environment can provide all these features and can be the strong foundation needed for a dynamic evaluation environment:

Internal simulation state tracking: A central simulation engine tracks the evolving state of all entities, tools, data stores, users, and external systems, throughout the agent’s operation.
Simulated users interacting with the agent: The environment must model user behavior (requests, interruptions, delays, misunderstandings, change of minds, etc.)
Simulated tools interacting with the agent: The environment must model tool behavior (state changes, incomplete answers, latencies, errors)
Rich traces of agent behavior captured through the simulation: Every action, decision, tool call, response, and branch taken by the agent is logged

So the simulation environment can provide both the ground truth (the internal simulation engine) and the actual flow (traces). However there are still two elements missing to make the evaluation environment complete:

Realistic scenarios with uncertainty and branching outcomes.
Verifiers checking the trace against the ground truth

Scenarios are needed to create diversity in simulation, uncovering edge cases, complex situations, and adversarial cases both in terms of tool and user behaviors. Scenarios guide agent’s self-play in the environment to cover real-world cases ensuring coverage of behaviors that matter in production.

Verifiers judge the correctness of the agent’s actions by comparing the trace to the ground truth. They come in three main types:

Human verifiers: Domain experts reviewing logs. Most accurate but costly and unscalable.
LLM verifiers: Large language models judging behavior at scale. Scalable but potentially less precise.
Objective verifiers: Automated checks that validate outcomes based on task-specific criteria. These are ideal but currently limited to domains like coding, math, or well-specified workflows.

The best dynamic environments can provide objective verifications based on the task. This has been traditionally hard to achieve outside of specific domains like math and coding.

So together, a simulation environment, realistic scenarios and verifiers create dynamic evaluation environments needed to test and verify AI agents in a robust and thorough approach.

Closing Thoughts

Static evals were enough when AI systems were static. But as we move into the age of agents, evaluation must evolve too. Dynamic evaluation, powered by environments, ensures agents aren’t just “right” in theory but robust in practice. Evaluating agents requires a shift from checking answers to assessing decisions, from datasets to environments. For teams building production-grade agents, this shift is not optional, it’s inevitable.

At Veris AI, we provide such environments so teams can continuously test and harden agents, not against a static dataset, but against life-like, evolving situations. This enables validating not just “did it produce the right answer?” but “did it do the right thing given what actually happened?”. Customers use Veris for A/B testing agents while in development, Regression testing agents as part of the CI/CD process, or to improve them using Reinforcement Fine-tuning.

For technical documentation, deployment guides, or to schedule a platform demonstration, contact hello@veris.ai or veris.ai/demo.

‍