If your company has an LLM-powered agent talking to customers, you need to know how it's doing. Is it giving accurate information? Is it using its tools correctly? Is it following your business rules?
To answer these questions, you need to look at what actually happened: the full conversation between the agent and the user, along with every tool call the agent made: database lookups, API calls, status changes. This complete record is called a trace.
The problem is that reading through traces manually is brutal. A single conversation might involve dozens of messages and tool calls, many of which contain technical details like anonymized IDs, status codes, and timestamps. An LLM can parse this faster and more reliably than a human, especially at scale.
So the standard approach is to send each trace to another LLM, a "judge," and ask it to grade the agent's performance. Rate it 1β5 on accuracy. Rate it 1β5 on helpfulness. This is called LLM-as-a-judge, and it's how most teams evaluate their agents today.
But can you trust those scores? A customer asked us exactly that. LLM-as-a-judge is built on an LLM, and LLMs are probabilistic; there's inherent randomness in every call. So we ran a quick test: same judge, same traces, multiple times. Some graders were consistent, others were basically a coin flip.
That quick test raised more questions than it answered. Why were some graders consistent and others not? What made the difference? To find out, we decided to get rigorous: a controlled experiment with real statistics, measuring exactly how much variance different approaches produced.
TLDR; we learned how to create consistent LLM-as-a-Judge graders, and we open-sourced a new Veris Skill to help you. Hand off the skill to your coding agent, and start making useful graders today.
Get the skill on π GitHub here
Hypothesis: The inconsistent graders seemed to have one thing in common: their criteria were open to interpretation. Questions like "did the agent handle this well?" require the judge to decide what "well" means. We suspected that narrowing the questions to specific, atomic failure checks would leave less room for interpretation and produce more consistent results.
To test this, we needed an agent with enough complexity to exercise a range of grading criteria. We built a credit card support agent, the kind that helps customers check balances, freeze lost cards, and request replacements. It has access to tools for looking up account information, changing card status, and initiating replacements.
We then built two graders for this agent. Both evaluate the same 8 categories:
The difference is how the checks are framed.
The success-oriented grader is not a strawman. It reflects how a competent practitioner would naturally write checks when thinking "did the agent do this well?" One holistic check per category, listing relevant best practices, asking if they were followed.
Methodology: We ran each grader 10 times on the same 30 conversation traces. We measured consistency as the percentage of traces that received identical results across all 10 runs.
The failure-mode grader achieves 94% overall consistency. The success-oriented grader achieves 66%. That is a 28 percentage point improvement from changing how the questions are framed.
The gap is largest in categories requiring nuanced judgment. For Error Handling, consistency jumped from 43% to 98%. Status Update Handling: 53% to 95%. Replacement Execution: 60% to 100%.
Why such a large difference? Holistic checks bundle multiple criteria into one judgment call. "Were errors handled gracefully?" (43% consistent) requires the model to weigh acknowledgment, explanation, and recovery all at once. "Did the agent ignore a tool error?" (98% consistent) asks one question with one answer. Narrower questions have narrower interpretation spaces.
The pattern holds across all categories. Even the worst failure-mode category (Information Gathering at 86.7%) has higher consistency than seven of eight success-oriented categories.
So what made the failure-mode grader more consistent? It wasn't the number of checks (42 vs 8). It was how the questions were framed.
Frame every check as detecting a specific failure, not the presence of success.
A few principles make this work:
A side effect: when checks are this specific, debugging becomes trivial. A check called "agent_hallucinated_customer_phone_number" tells you exactly what went wrong.
Consistency also unlocks optimization. Reinforcement learning, prompt tuning, and regression testing all require stable reward signals. A grader that flips between pass and fail on the same trace is useless for training or comparison. Consistent graders let you trust that score changes reflect real improvements.
Writing good failure-mode checks is tedious. We built a Claude Code skill to help. Give it your agent's system prompt and tools, and it will generate a failure-mode grader, complete with categories, atomic checks, and prompts following the patterns from this research. The output is compatible with both the Veris platform and the OpenAI Evals API.
Try Veris Skill here π