- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
Ep12 · Fix Your RAG Retrieval — Chunk Overlap + MMR Reranking From Scratch (Python)
Your RAG retrieves the right passages — but are they the BEST passages? This episode makes retrieval itself smarter, with two cheap upgrades that need zero new gateway code. First, CHUNK OVERLAP: each chunk carries over the tail of the previous one, so a fact that lands on a chunk boundary still lives whole inside at least one chunk instead of being sliced in half. Second — the big one — RERANKING with MMR (Maximal Marginal Relevance) to fix a flaw nobody warns you about: plain top-k by cosine similarity is often REDUNDANT. Ask "what are all the reasons my requests fail?" and cosine cheerfully returns the same rate-limit passage three times, so the model never sees the other reasons and the answer comes out narrow.
MMR fixes that by reranking the wide result set for diversity: each pick scores high for relevance to the query MINUS how similar it is to what you've already chosen — relevant, but different. A lambda value dials how hard you push for diversity (≈0.6 is a good default). It's pure vector math, the same cost as cosine, no extra model. In the demo, plain cosine returns rate-limits.md three times and the answer only mentions rate limits; MMR returns rate-limits + refunds + billing, and the SAME model on the SAME question now gives a complete answer listing every reason. The lesson: better retrieval beats a bigger prompt. (A cross-encoder / LLM reranker is the other flavour — that one buys precision; MMR buys diversity.)
⭐ Code (clone & follow along):
https://github.com/vahid8/ai-engineering-series
🔑 Free key:
Gemini → https://aistudio.google.com/apikey
📺 Go deeper on MMR, reranking & RAG evals:
https://www.youtube.com/watch?v=HLywMSIQaDw
What you'll learn:
• Why plain top-k retrieval by cosine is often redundant (near-duplicate chunks)
• Chunk overlap — carrying each chunk's tail forward so no fact is sliced in half
• Two-stage retrieval — wide cheap recall, then rerank down to a better top-k
• MMR (Maximal Marginal Relevance): relevance minus redundancy, and the lambda knob
• MMR vs cross-encoder/LLM rerankers — diversity vs precision
• How better retrieval gives a complete answer without a bigger prompt
⏱️ Chapters:
0:00 Your RAG retrieves — but are they the best passages?
0:23 The problem: top-k by cosine is redundant
0:52 Two upgrades: chunk overlap + MMR
1:40 The gateway (still unchanged)
1:50 Upgrade 1 — chunk overlap
2:34 Upgrade 2 — rerank for diversity (MMR)
3:35 Set up the run: wide recall, plain top-k, then MMR
4:22 The payoff: same passage ×3 vs three real reasons
5:04 Narrow answer vs complete answer
5:29 Recap + what's next (evaluation)
🔧 Stack: Python · uv · FastAPI · LiteLLM · OpenAI SDK · Gemini (free tier)
▶️ Next episode: RAG Part 4 — evaluation. How do you actually MEASURE whether your RAG is any good?
Subscribe so you don't miss it.
#AIEngineering #Python #RAG #Reranking #MMR #Embeddings #VectorSearch #LLMOps
Видео Ep12 · Fix Your RAG Retrieval — Chunk Overlap + MMR Reranking From Scratch (Python) канала Vision
MMR fixes that by reranking the wide result set for diversity: each pick scores high for relevance to the query MINUS how similar it is to what you've already chosen — relevant, but different. A lambda value dials how hard you push for diversity (≈0.6 is a good default). It's pure vector math, the same cost as cosine, no extra model. In the demo, plain cosine returns rate-limits.md three times and the answer only mentions rate limits; MMR returns rate-limits + refunds + billing, and the SAME model on the SAME question now gives a complete answer listing every reason. The lesson: better retrieval beats a bigger prompt. (A cross-encoder / LLM reranker is the other flavour — that one buys precision; MMR buys diversity.)
⭐ Code (clone & follow along):
https://github.com/vahid8/ai-engineering-series
🔑 Free key:
Gemini → https://aistudio.google.com/apikey
📺 Go deeper on MMR, reranking & RAG evals:
https://www.youtube.com/watch?v=HLywMSIQaDw
What you'll learn:
• Why plain top-k retrieval by cosine is often redundant (near-duplicate chunks)
• Chunk overlap — carrying each chunk's tail forward so no fact is sliced in half
• Two-stage retrieval — wide cheap recall, then rerank down to a better top-k
• MMR (Maximal Marginal Relevance): relevance minus redundancy, and the lambda knob
• MMR vs cross-encoder/LLM rerankers — diversity vs precision
• How better retrieval gives a complete answer without a bigger prompt
⏱️ Chapters:
0:00 Your RAG retrieves — but are they the best passages?
0:23 The problem: top-k by cosine is redundant
0:52 Two upgrades: chunk overlap + MMR
1:40 The gateway (still unchanged)
1:50 Upgrade 1 — chunk overlap
2:34 Upgrade 2 — rerank for diversity (MMR)
3:35 Set up the run: wide recall, plain top-k, then MMR
4:22 The payoff: same passage ×3 vs three real reasons
5:04 Narrow answer vs complete answer
5:29 Recap + what's next (evaluation)
🔧 Stack: Python · uv · FastAPI · LiteLLM · OpenAI SDK · Gemini (free tier)
▶️ Next episode: RAG Part 4 — evaluation. How do you actually MEASURE whether your RAG is any good?
Subscribe so you don't miss it.
#AIEngineering #Python #RAG #Reranking #MMR #Embeddings #VectorSearch #LLMOps
Видео Ep12 · Fix Your RAG Retrieval — Chunk Overlap + MMR Reranking From Scratch (Python) канала Vision
Комментарии отсутствуют
Информация о видео
18 июня 2026 г. 19:16:05
00:06:05
Другие видео канала
