Загрузка...

AI News: EsoLang-Bench: Evaluating LLMs via Esoteric Programming Languages — Explained in 60s

Current benchmarks for large language model (LLM) code generation primarily evaluate mainstream languages like Python, where models benefit from massive pretraining corpora.
This leads to inflated accuracy scores that may reflect data memorization rather than genuine reasoning ability.
We introduce EsoLang-Bench, a benchmark of 80 programming problems across five esoteric languages (Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare) where training data is 5,000 to 100,000x scarcer than Python.
All models score 0% on problems above the Easy tier, Whitespace remains completely unsolved (0% across all configurations), and self-reflection provides essentially zero benefit.
These results reveal a dramatic gap between benchmark performance on mainstream languages and genuine programming ability, suggesting that current LLM code generation capabilities are far narrower than headline metrics imply.
Read more: https://esolang-bench.vercel.app/
#AI #ArtificialIntelligence #MachineLearning #TechAI #AITools #AIBreakthrough

Видео AI News: EsoLang-Bench: Evaluating LLMs via Esoteric Programming Languages — Explained in 60s канала Code Rush
Яндекс.Метрика
Все заметки Новая заметка Страницу в заметки
Страницу в закладки Мои закладки
На информационно-развлекательном портале SALDA.WS применяются cookie-файлы. Нажимая кнопку Принять, вы подтверждаете свое согласие на их использование.
О CookiesНапомнить позжеПринять