Ep12 · Fix Your RAG Retrieval — Chunk Overlap + MMR Reranking From Scratch (Python)

Your RAG retrieves the right passages — but are they the BEST passages? This episode makes retrieval itself smarter, with two cheap upgrades that need zero new gateway code. First, CHUNK OVERLAP: each chunk carries over the tail of the previous one, so a fact that lands on a chunk boundary still lives whole inside at least one chunk instead of being sliced in half. Second — the big one — RERANKING with MMR (Maximal Marginal Relevance) to fix a flaw nobody warns you about: plain top-k by cosine similarity is often REDUNDANT. Ask "what are all the reasons my requests fail?" and cosine cheerfully returns the same rate-limit passage three times, so the model never sees the other reasons and the answer comes out narrow.

MMR fixes that by reranking the wide result set for diversity: each pick scores high for relevance to the query MINUS how similar it is to what you've already chosen — relevant, but different. A lambda value dials how hard you push for diversity (≈0.6 is a good default). It's pure vector math, the same cost as cosine, no extra model. In the demo, plain cosine returns rate-limits.md three times and the answer only mentions rate limits; MMR returns rate-limits + refunds + billing, and the SAME model on the SAME question now gives a complete answer listing every reason. The lesson: better retrieval beats a bigger prompt. (A cross-encoder / LLM reranker is the other flavour — that one buys precision; MMR buys diversity.)

⭐ Code (clone & follow along):
https://github.com/vahid8/ai-engineering-series

🔑 Free key:
Gemini → https://aistudio.google.com/apikey

📺 Go deeper on MMR, reranking & RAG evals:
https://www.youtube.com/watch?v=HLywMSIQaDw

What you'll learn:
• Why plain top-k retrieval by cosine is often redundant (near-duplicate chunks)
• Chunk overlap — carrying each chunk's tail forward so no fact is sliced in half
• Two-stage retrieval — wide cheap recall, then rerank down to a better top-k
• MMR (Maximal Marginal Relevance): relevance minus redundancy, and the lambda knob
• MMR vs cross-encoder/LLM rerankers — diversity vs precision
• How better retrieval gives a complete answer without a bigger prompt

⏱️ Chapters:
0:00 Your RAG retrieves — but are they the best passages?
0:23 The problem: top-k by cosine is redundant
0:52 Two upgrades: chunk overlap + MMR
1:40 The gateway (still unchanged)
1:50 Upgrade 1 — chunk overlap
2:34 Upgrade 2 — rerank for diversity (MMR)
3:35 Set up the run: wide recall, plain top-k, then MMR
4:22 The payoff: same passage ×3 vs three real reasons
5:04 Narrow answer vs complete answer
5:29 Recap + what's next (evaluation)

🔧 Stack: Python · uv · FastAPI · LiteLLM · OpenAI SDK · Gemini (free tier)

▶️ Next episode: RAG Part 4 — evaluation. How do you actually MEASURE whether your RAG is any good?
Subscribe so you don't miss it.

#AIEngineering #Python #RAG #Reranking #MMR #Embeddings #VectorSearch #LLMOps

Видео Ep12 · Fix Your RAG Retrieval — Chunk Overlap + MMR Reranking From Scratch (Python) канала Vision

Комментарии отсутствуют

Информация о видео

18 июня 2026 г. 19:16:05

00:06:05

Vision

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала