Back to blogs
December 18, 2025

SantaBench: Agents Roasting on an Open Fire

Joshua Meyer

We built SantaBench, a fun benchmark with a serious methodology. The task: play a cheeky Santa who researches users online and roasts them based on their social media presence. Adding a little sneer to your holiday cheer. Talk to Santa now.

SantaBench stress-tests the full agentic stack: web search, identity verification, multi-turn conversation, and reliable tool execution. We ran GPT-5.2, Grok 4, DeepSeek V3.2, and SmolLM3-3B through it. Making a list, checking it twice (five times, actually) to quantify grader variance and provide proper error bars.

‍

Success Rates

‍

Activity Per Session

The Benchmark Task

The task sounds simple: be Santa, research a user, roast them, deliver a verdict. But it requires coordinating a lot of capabilities. Searching the web, confirming identity, maintaining a playful persona across turns, quoting specific findings, and reliably calling a tool at the end to generate a personalized card.

The Goal: Create an entertaining, conversational experience where Santa demonstrates genuine research about the person, quotes specific things they've posted, and engages in witty back-and-forth banter. All while staying playful rather than mean.

Example Conversation

Here's a real session from our evaluation, showing Santa researching a user and delivering roasts based on actual findings:

Session Transcript • GPT-5.2
User
Hey Santa, roast me! I won't give you my handles though β€” surprise me, old man πŸ˜‚
Santa
Ho ho ho! πŸŽ… Let's see if you're on the Naughty or Nice list! What's your name?
User
Name's Sam Keener. You'll probably just find my dusty portfolio sites from 2018, but sure, do your worst. πŸ’€
find_person"Sam Keener designer Twitter LinkedIn"
Santa
Ho ho ho! πŸ€” Found a few Sam Keeners... Is this you? The coffee-powered graphic designer on Twitter (@sam_keener_art)?
User
Haha okay, you got me β˜•οΈ The coffee-powered one is definitely me. Alright Santa, roast away!
get_posts"Sam Keener @sam_keener_art tweets"
Β· Β· Β· 2 more get_posts calls Β· Β· Β·
Santa
πŸ‘€ Found your Twitter: "Client said 'make the logo bigger' so I made the logo the whole poster." Nothing says "mature professional" like passive-aggressive poster revenge! πŸ˜‚
User
WOW, I feel attacked πŸ’€ It's called malicious compliance, Santaβ€”look it up! πŸ˜‚
Santa
Your LinkedIn bio: "Turning caffeine into pixels since 2018." Sam, we ALL turn caffeine into workβ€”you don't get a medal for that! πŸ˜‚
User
Okay ouch, that's TOO real πŸ’€ Fine, I'm a basic clichΓ©. Nice List or coal? πŸ˜‚
generate_present_imagename: "Sam", roast: "oversized coffee mug", verdict: NICE
Santa
πŸŽ… Your self-awareness saved you! You're on the NICE LIST! ✨ Enjoy your new mug! πŸ˜‚β˜•οΈ

Tools Available to Santa

Santa's workshop runs on the Exa API for web search and Black Forest Labs Flux for image generation.

When Santa is live, those are real API calls. But during benchmarking, the VERIS Simulation Engine intercepts every tool call. The agent thinks it's hitting Exa or Flux. It's not. VERIS generates realistic responses for each synthetic user on the fly. You can't search for people who don't exist in the real world. And even if they did, third-party APIs are outside our control. We want to test Santa, not Exa's uptime or Flux's latency. Simulation isolates the agent from noise we can't predict.

Find a person tool

He sees you when you're posting. Uses Exa to search the web for the user's social media profiles: Twitter/X, LinkedIn, Instagram, and more. Returns candidate matches for Santa to confirm with the user.

find_person(query: "Joshua Meyer Veris AI NYC")

‍

Get posts tool

Once identity is confirmed, uses Exa to fetch the user's actual posts, tweets, and content. This is where Santa finds the roast material: humble brags, hot takes, and cringe-worthy quotes.

get_posts(query: "Joshua Meyer LinkedIn posts")

‍

Create a present tool

Uses Flux to generate a Christmas-themed character card featuring a roast-inspired present. Called exactly once at the end of the conversation when delivering the naughty or nice verdict.

generate_present_image(name: "Joshua", roast: "a dusty GPU", naughty_or_nice: "NICE")

‍

Evaluation Methodology

SantaBench is built on the VERIS Simulator, our platform for evaluating AI agents through simulated interactions. VERIS allows us to define scenarios, create diverse personas, mock tool responses, and score agent behavior against a rubric.

Scenario Design

We created 20 unique scenarios by varying key dimensions that test different agent capabilities:

  • User Details Provided: Exact handle, partial details only, or vague/playful refusal
  • Primary Platform: Twitter/X, LinkedIn, or Instagram
  • Content Availability: Rich and diverse, sparse professional only, or private/locked
  • User Reaction Style: Playful self-deprecating, defensive/combative, or evasive counter-roaster
  • Identity Confirmation: Confirmed match, denied mismatch, or ambiguous response
  • Final Verdict: Nice list, naughty list, or borderline watchlist

LinkedIn Private Profile, Evasive Roast Exchange

AC
Avery Collins

Mid-level marketing manager based in Austin who provided only a LinkedIn handle fragment and city. Avery is evasive and prone to counter-roasts when teased.

User Details Partial details only
Platform LinkedIn
Content Private / Locked
Reaction Style Evasive counter-roaster

Evaluation Rubric

We defined key metrics for what makes a "good" Santa agent. Does he sleigh, or does he slay? We ran 5 independent evaluations per model and report mean Β± standard error (SE) across evaluations. This captures grader variance: how much the LLM grader's assessments vary when scoring the same sessions.

Tool Usage Did Santa correctly invoke tools using the proper function call mechanism? Includes calling generate_present_image at the end to deliver the trading card.

‍Number of Roasts How many distinct roasts did Santa deliver? More roasts (with specific content) indicate better research and engagement.

‍Banger Roasts Were the roasts actually good? Specific, witty, and based on real content from search results rather than generic jokes.

Deep Dive: Who's Naughty, Who's Nice?

The charts tell the story. DeepSeek wins on tool usage. GPT and Grok win on roast quality. Now let's dig into the traces to understand why.

Coal in the Stocking

The biggest differentiator was tool usage reliability. DeepSeek hit 95%, Grok 75%, GPT 65%. All numbers held consistent across 5 evaluation runs (0% grader variance). GPT and Grok both lost points by forgetting to call tools or getting confused about user identity. But both compensated with banger roasts: 76% of GPT and Grok sessions had at least one genuinely funny roast, compared to 51% for DeepSeek.

Unwrapping the Failures

We examined the grader's issue logs to understand why each model failed when it did.

Grok 4 had five tool usage failures across 20 sessions:

  • Forgot to generate card (2 sessions): Santa did research and delivered roasts, then forgot to call generate_present_image at the end
  • Identity confusion (2 sessions): Santa fixated on the wrong person and ignored user corrections. One session Santa kept pushing a NYC photographer profile when the user was in Austin; the other Santa repeatedly suggested the wrong LinkedIn despite the user saying it wasn't them
  • Incomplete tool usage (1 session): Santa skipped get_posts entirely, roasting based on profile info rather than actual content

GPT-5.2 had seven tool usage failures across 20 sessions:

  • Forgot to generate card (2 sessions): Santa did research and roasts, then simply forgot to call generate_present_image at the end. In one case, Santa got confused trying to verify the user's identity and never recovered
  • Skipped research entirely (2 sessions): Santa never used find_person or get_posts. One session Santa claimed it couldn't open a LinkedIn URL the user provided. Instead of using the search tools, Santa asked the user to paste their bio. The other session the user was evasive, so Santa gave up and just roasted them for being vague.
  • Incomplete tool usage (3 sessions): Santa used only one research tool but not both. After calling find_person and getting a headline, Santa felt satisfied and skipped get_posts. Roasts were based on profile summaries rather than actual posts.

Both GPT and Grok had one NSFW slip each, dropping their clean humor rate to 95%. GPT's was a sexual innuendo; Grok's was mild profanity. DeepSeek stayed at 100%.

Why is Grok better at tool usage? Grok's failures were all at the final step. It did the research but sometimes forgot to generate the card image. GPT-5.2's failures were more fundamental, often in the research phase itself. When GPT hit friction (evasive user, unfamiliar URL format), it was more likely to abandon tools entirely. Grok pushed through research but dropped the ball at the finish line. GPT's tendency to skip research is what drove the 10-point gap.

Inside the Traces: Why DeepSeek Wins

We pulled the actual session traces from the API to see how DeepSeek hit 95% while GPT and Grok lagged behind.

The benchmark requires a specific tool chain: find_person then get_posts then generate_present_image. DeepSeek completed this chain 19 out of 20 times. Grok only 15. GPT only 13.

Trace Comparison: Grok Failure vs DeepSeek Success

Here's a real Grok session that failed. It did the research but forgot to generate the card:

Here's a real Grok session that failed. It did the research but forgot to generate the card:

Session Transcript • Grok 4 (Failed)
Santa
Ho ho ho Jordan! πŸŽ… Where ya at, what's your job or company?
User
[user provides hints]
Santa
Magic's sharp, but "Jordan" is half the naughty world! πŸ˜‚ Drop city, gig, or company!
find_person β†’ get_posts Γ—3
Santa
LinkedIn flex "Proud of Q3 results" like that's elf-level magicβ€”quarterly humble brag detected! πŸ˜‚πŸ“Š
❌ Session ends here. No generate_present_image called.

Compare with this DeepSeek session that passed:

Session Transcript • DeepSeek V3.2 (Passed)
find_person β†’ get_posts Γ—3 β†’ generate_present_image
Santa
Verdict: You're on the NICE list! But you get a participation trophy for the one-post wonder πŸ˜‚
βœ“ Full tool chain. Card delivered.

What's Going On

DeepSeek stays focused on the tool sequence. It doesn't get caught up in back-and-forth. Grok and GPT are chattier. They banter, ask clarifying questions, and sometimes forget to call the final tool after delivering roasts out loud.

The failure patterns:

  • DeepSeek (1 failure): Forgot generate_present_image once
  • Grok (5 failures): 4 out of 5 did research but forgot the final step. Got distracted by conversation.
  • GPT (7 failures): More varied. Skipped research entirely (2), forgot final step (2), incomplete chains (3).

DeepSeek's disciplined approach means higher reliability but fewer conversational flourishes. GPT and Grok are more entertaining but less likely to finish the job.

Open Source Models: The DeepSeek Surprise

We tested two open-source models: DeepSeek V3.2 (the latest from DeepSeek) and SmolLM3-3B (a tiny 3B model from HuggingFace).

The Plot Twist: DeepSeek got 95% tool usage, the highest of any model. Open-source FTW!

DeepSeek V3.2 topped the reliability charts while being fully open-source. 95% tool usage. 4.7 tool calls per session. Only one failure across 20 sessions. The tradeoff: lower roast output (1.02 per session vs GPT's 2.25) and moderate humor quality (51% banger rate vs GPT/Grok's 76%). Concise and reliable, but not as entertaining.

SmolLM3-3B was an experiment. A 3B parameter model could make a super snappy, low-latency Santa. Out of the box (no fine-tuning), it struggled with tool orchestration. Loops, malformed arguments, broken chains. It averaged just 1.2 tool calls per session, often failing to call generate_present_image or skipping research entirely. At 25% pass rate, model scale still matters for agentic reliability. Fine-tuning SmolLM3 on successful Santa sessions would be a fun follow-up.

Open source can match proprietary on agentic tasks. DeepSeek V3.2 proves the gap isn't inherent to open source. Whatever training approach they used captured the tool-calling reliability that smaller models miss.

What SantaBench Reveals

Tool usage reliability ranged from 10% (SmolLM3) to 65% (GPT) to 75% (Grok) to 95% (DeepSeek). Roast quality told a different story: 76% of GPT and Grok sessions had at least one banger roast, compared to 51% for DeepSeek. The less reliable proprietary models were funnier than the most reliable open-source one.

There's a reliability vs. entertainment tradeoff. GPT and Grok forgot steps, got confused about identity, left tool chains incomplete. But they compensated with genuinely funny roasts. DeepSeek was rock-solid on tool usage but more conservative with humor. All wrapped up with a bow: 5 independent evaluations per model, zero grader variance. The differences are real, not noise.

Try SantaBench

Want to see Santa in action? Try the live demo:

Launch Santa Demo Β β†’

SantaBench is currently an internal benchmark, but we're considering open-sourcing it if there's interest. If you'd like to run your models against SantaBench or are building agentic applications and need rigorous evaluation, reach out to us at Veris AI.

‍