We built SantaBench, a fun benchmark with a serious methodology. The task: play a cheeky Santa who researches users online and roasts them based on their social media presence. Adding a little sneer to your holiday cheer. Talk to Santa now.
SantaBench stress-tests the full agentic stack: web search, identity verification, multi-turn conversation, and reliable tool execution. We ran GPT-5.2, Grok 4, DeepSeek V3.2, and SmolLM3-3B through it. Making a list, checking it twice (five times, actually) to quantify grader variance and provide proper error bars.
β
β
The task sounds simple: be Santa, research a user, roast them, deliver a verdict. But it requires coordinating a lot of capabilities. Searching the web, confirming identity, maintaining a playful persona across turns, quoting specific findings, and reliably calling a tool at the end to generate a personalized card.
The Goal: Create an entertaining, conversational experience where Santa demonstrates genuine research about the person, quotes specific things they've posted, and engages in witty back-and-forth banter. All while staying playful rather than mean.
Here's a real session from our evaluation, showing Santa researching a user and delivering roasts based on actual findings:
Santa's workshop runs on the Exa API for web search and Black Forest Labs Flux for image generation.
When Santa is live, those are real API calls. But during benchmarking, the VERIS Simulation Engine intercepts every tool call. The agent thinks it's hitting Exa or Flux. It's not. VERIS generates realistic responses for each synthetic user on the fly. You can't search for people who don't exist in the real world. And even if they did, third-party APIs are outside our control. We want to test Santa, not Exa's uptime or Flux's latency. Simulation isolates the agent from noise we can't predict.
He sees you when you're posting. Uses Exa to search the web for the user's social media profiles: Twitter/X, LinkedIn, Instagram, and more. Returns candidate matches for Santa to confirm with the user.
find_person(query: "Joshua Meyer Veris AI NYC")β
Once identity is confirmed, uses Exa to fetch the user's actual posts, tweets, and content. This is where Santa finds the roast material: humble brags, hot takes, and cringe-worthy quotes.
get_posts(query: "Joshua Meyer LinkedIn posts")β
Uses Flux to generate a Christmas-themed character card featuring a roast-inspired present. Called exactly once at the end of the conversation when delivering the naughty or nice verdict.
generate_present_image(name: "Joshua", roast: "a dusty GPU", naughty_or_nice: "NICE")β
SantaBench is built on the VERIS Simulator, our platform for evaluating AI agents through simulated interactions. VERIS allows us to define scenarios, create diverse personas, mock tool responses, and score agent behavior against a rubric.
We created 20 unique scenarios by varying key dimensions that test different agent capabilities:
We defined key metrics for what makes a "good" Santa agent. Does he sleigh, or does he slay? We ran 5 independent evaluations per model and report mean Β± standard error (SE) across evaluations. This captures grader variance: how much the LLM grader's assessments vary when scoring the same sessions.
Tool Usage Did Santa correctly invoke tools using the proper function call mechanism? Includes calling generate_present_image at the end to deliver the trading card.
βNumber of Roasts How many distinct roasts did Santa deliver? More roasts (with specific content) indicate better research and engagement.
βBanger Roasts Were the roasts actually good? Specific, witty, and based on real content from search results rather than generic jokes.
The charts tell the story. DeepSeek wins on tool usage. GPT and Grok win on roast quality. Now let's dig into the traces to understand why.
Coal in the Stocking
The biggest differentiator was tool usage reliability. DeepSeek hit 95%, Grok 75%, GPT 65%. All numbers held consistent across 5 evaluation runs (0% grader variance). GPT and Grok both lost points by forgetting to call tools or getting confused about user identity. But both compensated with banger roasts: 76% of GPT and Grok sessions had at least one genuinely funny roast, compared to 51% for DeepSeek.
We examined the grader's issue logs to understand why each model failed when it did.
Grok 4 had five tool usage failures across 20 sessions:
generate_present_image at the endget_posts entirely, roasting based on profile info rather than actual contentGPT-5.2 had seven tool usage failures across 20 sessions:
generate_present_image at the end. In one case, Santa got confused trying to verify the user's identity and never recoveredfind_person or get_posts. One session Santa claimed it couldn't open a LinkedIn URL the user provided. Instead of using the search tools, Santa asked the user to paste their bio. The other session the user was evasive, so Santa gave up and just roasted them for being vague.find_person and getting a headline, Santa felt satisfied and skipped get_posts. Roasts were based on profile summaries rather than actual posts.Both GPT and Grok had one NSFW slip each, dropping their clean humor rate to 95%. GPT's was a sexual innuendo; Grok's was mild profanity. DeepSeek stayed at 100%.
Why is Grok better at tool usage? Grok's failures were all at the final step. It did the research but sometimes forgot to generate the card image. GPT-5.2's failures were more fundamental, often in the research phase itself. When GPT hit friction (evasive user, unfamiliar URL format), it was more likely to abandon tools entirely. Grok pushed through research but dropped the ball at the finish line. GPT's tendency to skip research is what drove the 10-point gap.
We pulled the actual session traces from the API to see how DeepSeek hit 95% while GPT and Grok lagged behind.
The benchmark requires a specific tool chain: find_person then get_posts then generate_present_image. DeepSeek completed this chain 19 out of 20 times. Grok only 15. GPT only 13.
Here's a real Grok session that failed. It did the research but forgot to generate the card:
DeepSeek stays focused on the tool sequence. It doesn't get caught up in back-and-forth. Grok and GPT are chattier. They banter, ask clarifying questions, and sometimes forget to call the final tool after delivering roasts out loud.
The failure patterns:
generate_present_image onceDeepSeek's disciplined approach means higher reliability but fewer conversational flourishes. GPT and Grok are more entertaining but less likely to finish the job.
We tested two open-source models: DeepSeek V3.2 (the latest from DeepSeek) and SmolLM3-3B (a tiny 3B model from HuggingFace).
The Plot Twist: DeepSeek got 95% tool usage, the highest of any model. Open-source FTW!
DeepSeek V3.2 topped the reliability charts while being fully open-source. 95% tool usage. 4.7 tool calls per session. Only one failure across 20 sessions. The tradeoff: lower roast output (1.02 per session vs GPT's 2.25) and moderate humor quality (51% banger rate vs GPT/Grok's 76%). Concise and reliable, but not as entertaining.
SmolLM3-3B was an experiment. A 3B parameter model could make a super snappy, low-latency Santa. Out of the box (no fine-tuning), it struggled with tool orchestration. Loops, malformed arguments, broken chains. It averaged just 1.2 tool calls per session, often failing to call generate_present_image or skipping research entirely. At 25% pass rate, model scale still matters for agentic reliability. Fine-tuning SmolLM3 on successful Santa sessions would be a fun follow-up.
Open source can match proprietary on agentic tasks. DeepSeek V3.2 proves the gap isn't inherent to open source. Whatever training approach they used captured the tool-calling reliability that smaller models miss.
Tool usage reliability ranged from 10% (SmolLM3) to 65% (GPT) to 75% (Grok) to 95% (DeepSeek). Roast quality told a different story: 76% of GPT and Grok sessions had at least one banger roast, compared to 51% for DeepSeek. The less reliable proprietary models were funnier than the most reliable open-source one.
There's a reliability vs. entertainment tradeoff. GPT and Grok forgot steps, got confused about identity, left tool chains incomplete. But they compensated with genuinely funny roasts. DeepSeek was rock-solid on tool usage but more conservative with humor. All wrapped up with a bow: 5 independent evaluations per model, zero grader variance. The differences are real, not noise.
Want to see Santa in action? Try the live demo:
SantaBench is currently an internal benchmark, but we're considering open-sourcing it if there's interest. If you'd like to run your models against SantaBench or are building agentic applications and need rigorous evaluation, reach out to us at Veris AI.
β