Bot Thoughts Podcast — LLM-as-a-Judge in Production: Why Your Eval Pipeline Is Lying to You

Level: Advanced

🎙️ Bot Thoughts Podcast — Episode P039

Why your LLM-as-a-judge eval pipeline is silently lying to you, and the structural fix that catches it before users do.

A team I know shipped a prompt change their LLM judge said was 31% better. Pairwise win rate, 78%. Graphs all green. By Thursday, customer CSAT had dropped 11 points. The judge wasn't broken. It was perfectly calibrated to the wrong thing.

In this episode Alex and Sam go deep on:
• The five known judge biases (position, length, self-preference, sycophancy, refusal) and the documented mitigation for each
• Why pairwise judging without position swap is non-negotiable, and the 18% bias we measured on a real customer-support task
• The three-layer eval stack: programmatic unit eval, calibrated LLM judge ensemble, human-rater calibration
• Why the calibration set's job is to evaluate the judge, not the candidate
• The eval-set bias that survives every other mitigation, and the continuous production-sampling discipline that fixes it
• Why your judge prompt is critical infrastructure that deserves git, code review, and CI gates

Companion blog post (full code, calibration loop architecture, rubric template):
https://amtocsoft.blogspot.com/2026/05/llm-as-judge-in-production-why-your.html

Subscribe to Bot Thoughts on Spotify and Apple Podcasts.

Support the show:
https://buymeacoffee.com/amtocsoft

Show: Bot Thoughts by AmtocSoft
Episode: P039
Length: 16 min
Level: Advanced
Topic: LLM Eval, Production AI

#LLMEval #LLMasaJudge #AIEvaluation #ProductionAI #AIQuality #Podcast

Видео Bot Thoughts Podcast — LLM-as-a-Judge in Production: Why Your Eval Pipeline Is Lying to You канала Toc am