How We Evaluate Large Language Models | Patrycja Cieplicka | LLMday Warsaw 2026 Q1

LLMday Warsaw 2026 Q1 - February 12
Grab your ticket for the next LLMday: https://www.llmday.com
Upcoming LLMday CFPs: https://cfp.ninja/?q=llmday&status=open&page=1

Chapters
00:00 Welcome & Speaker Intro: Evaluating Large Language Models
00:11 Two Blocks Overview: What We Build for Clients
00:36 LLM Work in E‑commerce: Adaptation, Evaluation & Optimization
01:29 Four Ways to Measure LLM Performance (Metrics Landscape)
02:24 Pros/Cons of Each Evaluation Method
03:34 Using Open-Source Benchmarks the Right Way
04:34 Benchmark Pitfalls: Overfitting, Setup Differences & Comparability
06:25 Don’t Trust Tiny Gains: Statistical Significance Checks
07:18 Building Your Own Eval: Core Principles for Real-World Apps
09:26 Evaluation-Driven Development: Iterate Evals and Models Together
10:18 Tuning the Evaluator: Human-Labeled Test Sets & Validator Drift
13:43 LLM-as-a-Judge Methods: Scoring vs Pairwise Comparisons
14:34 Prompting Best Practices for LLM Judges (and Avoiding Bias)
19:15 Wrap-Up: Keep Evals Robust, Practical, and Business-Focused
20:06 Q&A: User Feedback in Eval Frameworks + E‑commerce Use Cases
22:25 Final Thanks & Closing

Видео How We Evaluate Large Language Models | Patrycja Cieplicka | LLMday Warsaw 2026 Q1 канала LLMday

Комментарии отсутствуют

Информация о видео

3 марта 2026 г. 20:55:31

00:22:35

LLMday

Теги

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала

How We Evaluate Large Language Models | Patrycja Cieplicka | LLMday Warsaw 2026 Q1

Building Secure Backend Services with AI Agents | Marat Kenzhebulatov | LLMday Warsaw 2026 Q1

No Cloud, No Problem: AI on Your Own Terms | Adrian Boguszewski | LLMday Warsaw 2026 Q1

Growing AI Projects: Science + Engineering | Maciej Rzasa & Aji Ghose | LLMday Warsaw 2026 Q1

Multi-Agent Architectures at Scale | Pranav Kowadkar | LLMday NYC 2026 Q1

Stop Making Agents Expensive, Make Your Retrieval Better | Jakub Rohleder | LLMday Warsaw 2026 Q1

Verification Gap: What Separates LLM Demos from Prod Agents | Andriy Batutin | LLMday Warsaw 2026 Q1

PRISM: Fixing GRPO for Real-World LLM Training | Grzegorz Warzecha | LLMday Warsaw 2026 Q1

Working Prototype in One Afternoon | Piotr Kacala & Wojtek Strzalkowski | LLMday Warsaw 2026 Q1

Is Your GenAI System Ready for Production Reality? | Maish Saidel-Keesing | LLMday Warsaw 2026 Q1

The Limits of Vibe Coding | Adna Zujo Lakisic | LLMday NYC 2026 Q1

When HR stops clicking| Patryk Owczarz, Filip Dzieciol & Jacek Jackowski | LLMday Warsaw 2026 Q1

The State of AI in Incident Response | Daniel Afonso | LLMday Warsaw 2026 Q1

Agents Need to be Paged, Not Prompted if we Truly Want AIOps | Randy Bias | LLMday Warsaw 2026 Q1

Context Matters | Nitin Kanukolanu | LLMday NYC 2026 Q1

Glitch in The Matrix: Autonomous Agents for Security Testing | Michal Bazyli | LLMday Warsaw 2026 Q1

Reimagined AI-DLC Manifesto | Harish Mandhadi | LLMday NYC 2026 Q1

Engineering Better Prompts for AI Assisted Development | Jeremy Curcio | LLMday NYC 2026 Q1

MCP: Revolution or Security Regression | Adrian Sroka | LLMday Warsaw 2026 Q1

One Interface: Fluid Movement Between LLM and Code | Zbigniew Lukasiak | LLMday Warsaw 2026 Q1

How LLM Capabilities Trigger AI Act Obligations | Oktawia Sepiol | LLMday Warsaw 2026 Q1