- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Q-RAG: How Reinforcement Learning Trains the Retriever, Not the LLM
Q-RAG is an ICLR 2026 oral paper that reframes multi-step retrieval-augmented generation by applying reinforcement learning directly to the retriever embedder — while leaving the answer-generating LLM completely frozen. This video breaks down the paper's core mechanism, its benchmark results, and the critical nuances the hype cycle tends to skip.
The paper formalizes retrieval as a finite-horizon Markov Decision Process over text chunks. At each step, the retriever scores candidate chunks via dot-product Q-values in embedding space, selects the highest-value chunk, and appends it to the growing context. A sparse binary reward — 1 only if all gold support facts are recovered — trains the embedder to assemble complete evidence sets without requiring any changes to the downstream reader model.
Key topics covered in order:
- **The core architectural shift**: Why training the embedder instead of the generator changes deployment economics for adaptive retrieval, and how Q-RAG compares to prior methods like Self-RAG, FLARE, and IRCoT that embed retrieval logic inside the language model.
- **The MDP formulation and Q-function**: How states grow incrementally from the query, how candidate chunks are scored as dot products between state and action embeddings, and why this avoids expensive transformer re-ranking at each step.
- **Sparse terminal reward design**: The system optimizes for evidence coverage — not answer quality, citation precision, or latency — using a binary reward at episode end.
- **Long-context generalization results**: Trained on 4K-token documents, Q-RAG maintains near-perfect needle-in-a-haystack scores and strong multi-hop QA performance at 1M tokens on the RULER benchmark, outperforming supervised baselines that degrade steeply with context length.
- **The stopping problem and over-retrieval**: The paper's main experiments use fixed retrieval budgets rather than a fully learned stopping policy. Appendix data quantifies both over-retrieval (noise accumulation) and under-retrieval (premature stopping) failure modes — the clearest current limitation of the approach.
Paper: "Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training" (Sorokin, Buzun, Burtsev et al., ICLR 2026)
#qrag #retrieval #rag #reinforcementlearning #llm #embeddings #longcontext #multihop #iclr2026
📑 Chapters:
0:00 250x context generalization with a frozen LLM
0:30 What Q-RAG changes about RAG architecture
1:05 Retrieval as a Markov Decision Process
1:50 Q-function scoring via dot products in embedding space
2:25 Sparse binary reward: optimizing evidence coverage
3:05 RULER benchmark: scaling from 4K to 1M tokens
3:45 In-domain results and where Beam-Retriever still wins
4:10 Over-retrieval and the stopping problem
4:55 Analogy to AlphaGo's search budget allocation
5:25 Deployment implications and what remains unsolved
#q-rag #retrieval augmented generation #reinforcement learning retriever #multi-step retrieval #embedder training #long context #iclr 2026 #markov decision process #value-based rl #adaptive retrieval #needle in a haystack #multi-hop qa #self-rag #frozen llm retrieval
Видео Q-RAG: How Reinforcement Learning Trains the Retriever, Not the LLM канала The Bearded AI Guy
The paper formalizes retrieval as a finite-horizon Markov Decision Process over text chunks. At each step, the retriever scores candidate chunks via dot-product Q-values in embedding space, selects the highest-value chunk, and appends it to the growing context. A sparse binary reward — 1 only if all gold support facts are recovered — trains the embedder to assemble complete evidence sets without requiring any changes to the downstream reader model.
Key topics covered in order:
- **The core architectural shift**: Why training the embedder instead of the generator changes deployment economics for adaptive retrieval, and how Q-RAG compares to prior methods like Self-RAG, FLARE, and IRCoT that embed retrieval logic inside the language model.
- **The MDP formulation and Q-function**: How states grow incrementally from the query, how candidate chunks are scored as dot products between state and action embeddings, and why this avoids expensive transformer re-ranking at each step.
- **Sparse terminal reward design**: The system optimizes for evidence coverage — not answer quality, citation precision, or latency — using a binary reward at episode end.
- **Long-context generalization results**: Trained on 4K-token documents, Q-RAG maintains near-perfect needle-in-a-haystack scores and strong multi-hop QA performance at 1M tokens on the RULER benchmark, outperforming supervised baselines that degrade steeply with context length.
- **The stopping problem and over-retrieval**: The paper's main experiments use fixed retrieval budgets rather than a fully learned stopping policy. Appendix data quantifies both over-retrieval (noise accumulation) and under-retrieval (premature stopping) failure modes — the clearest current limitation of the approach.
Paper: "Q-RAG: Long Context Multi-Step Retrieval via Value-Based Embedder Training" (Sorokin, Buzun, Burtsev et al., ICLR 2026)
#qrag #retrieval #rag #reinforcementlearning #llm #embeddings #longcontext #multihop #iclr2026
📑 Chapters:
0:00 250x context generalization with a frozen LLM
0:30 What Q-RAG changes about RAG architecture
1:05 Retrieval as a Markov Decision Process
1:50 Q-function scoring via dot products in embedding space
2:25 Sparse binary reward: optimizing evidence coverage
3:05 RULER benchmark: scaling from 4K to 1M tokens
3:45 In-domain results and where Beam-Retriever still wins
4:10 Over-retrieval and the stopping problem
4:55 Analogy to AlphaGo's search budget allocation
5:25 Deployment implications and what remains unsolved
#q-rag #retrieval augmented generation #reinforcement learning retriever #multi-step retrieval #embedder training #long context #iclr 2026 #markov decision process #value-based rl #adaptive retrieval #needle in a haystack #multi-hop qa #self-rag #frozen llm retrieval
Видео Q-RAG: How Reinforcement Learning Trains the Retriever, Not the LLM канала The Bearded AI Guy
Комментарии отсутствуют
Информация о видео
9 мая 2026 г. 13:22:00
00:06:13
Другие видео канала





















