February 5, 2026

Your LLM-as-a-Judge is Asking the Wrong Question

Joshua Meyer

The Problem

If your company has an LLM-powered agent talking to customers, you need to know how it's doing. Is it giving accurate information? Is it using its tools correctly? Is it following your business rules?

To answer these questions, you need to look at what actually happened: the full conversation between the agent and the user, along with every tool call the agent made: database lookups, API calls, status changes. This complete record is called a trace.

The problem is that reading through traces manually is brutal. A single conversation might involve dozens of messages and tool calls, many of which contain technical details like anonymized IDs, status codes, and timestamps. An LLM can parse this faster and more reliably than a human, especially at scale.

So the standard approach is to send each trace to another LLM, a "judge," and ask it to grade the agent's performance. Rate it 1–5 on accuracy. Rate it 1–5 on helpfulness. This is called LLM-as-a-judge, and it's how most teams evaluate their agents today.

But can you trust those scores? A customer asked us exactly that. LLM-as-a-judge is built on an LLM, and LLMs are probabilistic; there's inherent randomness in every call. So we ran a quick test: same judge, same traces, multiple times. Some graders were consistent, others were basically a coin flip.

That quick test raised more questions than it answered. Why were some graders consistent and others not? What made the difference? To find out, we decided to get rigorous: a controlled experiment with real statistics, measuring exactly how much variance different approaches produced.

TLDR; we learned how to create consistent LLM-as-a-Judge graders, and we open-sourced a new Veris Skill to help you. Hand off the skill to your coding agent, and start making useful graders today.

Get the skill on 🔗 GitHub here

The Experiment

Hypothesis: The inconsistent graders seemed to have one thing in common: their criteria were open to interpretation. Questions like "did the agent handle this well?" require the judge to decide what "well" means. We suspected that narrowing the questions to specific, atomic failure checks would leave less room for interpretation and produce more consistent results.

To test this, we needed an agent with enough complexity to exercise a range of grading criteria. We built a credit card support agent, the kind that helps customers check balances, freeze lost cards, and request replacements. It has access to tools for looking up account information, changing card status, and initiating replacements.

We then built two graders for this agent. Both evaluate the same 8 categories:

Information Gathering: did the agent collect what it needed before acting?
Freeze Handling: did the agent handle card freezes correctly?
Replacement Execution: did the agent process card replacements properly?
Delivery Confirmation: did the agent confirm shipping details?
Status Update Handling: did the agent communicate status changes?
Scope Management: did the agent stay within its capabilities?
Error Handling: did the agent handle failures gracefully?
Data Accuracy: did the agent report information correctly?

The difference is how the checks are framed.

Aspect	Failure-Mode	Success-Oriented
Framing	"Did this specific failure occur?"	"Did the agent do this well?"
Categories	8	8
Total checks	42 atomic checks	8 holistic checks
Example check	"Did the agent fabricate a phone number?"	"Was the information accurate?"
Prompt style	Step-by-step, tool-specific	General best practices

The success-oriented grader is not a strawman. It reflects how a competent practitioner would naturally write checks when thinking "did the agent do this well?" One holistic check per category, listing relevant best practices, asking if they were followed.

Methodology: We ran each grader 10 times on the same 30 conversation traces. We measured consistency as the percentage of traces that received identical results across all 10 runs.

Results

The failure-mode grader achieves 94% overall consistency. The success-oriented grader achieves 66%. That is a 28 percentage point improvement from changing how the questions are framed.

Consistency by Category (10 runs each, 30 traces)

The gap is largest in categories requiring nuanced judgment. For Error Handling, consistency jumped from 43% to 98%. Status Update Handling: 53% to 95%. Replacement Execution: 60% to 100%.

Why such a large difference? Holistic checks bundle multiple criteria into one judgment call. "Were errors handled gracefully?" (43% consistent) requires the model to weigh acknowledgment, explanation, and recovery all at once. "Did the agent ignore a tool error?" (98% consistent) asks one question with one answer. Narrower questions have narrower interpretation spaces.

The pattern holds across all categories. Even the worst failure-mode category (Information Gathering at 86.7%) has higher consistency than seven of eight success-oriented categories.

The Takeaway

So what made the failure-mode grader more consistent? It wasn't the number of checks (42 vs 8). It was how the questions were framed.

Frame every check as detecting a specific failure, not the presence of success.

A few principles make this work:

Search for errors. "Is X correct?" becomes "Did X fail in this specific way?" Instead of "Did the agent handle errors gracefully?", ask "Did the agent ignore a tool error?"
One failure mode per check. If it tests multiple things, split it.
Define N/A conditions. Don't conflate "passed" with "not relevant."
Require cited evidence. Every result should reference a specific line or tool call.
Use step-by-step prompts for checks that require reasoning across multiple tool calls.
Organize hierarchically. Group checks into categories like "data accuracy" or "error handling." Stakeholders see category-level pass rates; engineers drill into specific failures.

A side effect: when checks are this specific, debugging becomes trivial. A check called "agent_hallucinated_customer_phone_number" tells you exactly what went wrong.

Consistency also unlocks optimization. Reinforcement learning, prompt tuning, and regression testing all require stable reward signals. A grader that flips between pass and fail on the same trace is useless for training or comparison. Consistent graders let you trust that score changes reflect real improvements.

Try It Yourself

Writing good failure-mode checks is tedious. We built a Claude Code skill to help. Give it your agent's system prompt and tools, and it will generate a failure-mode grader, complete with categories, atomic checks, and prompts following the patterns from this research. The output is compatible with both the Veris platform and the OpenAI Evals API.

Try Veris Skill here 🔗