Podcast RAG

Production RAG for 500+ Hours of Audio

Search and chat with 500+ hours of podcast content. Whisper transcription, hybrid retrieval, reranking, and streaming responses with citations.

Overview

A production RAG system that goes beyond tutorials. Hybrid search (vector + keyword), reranking pipeline, proper chunking with overlap, and timestamped citations back to source audio.

Challenge

Tutorial RAG systems fail in production: naive chunking loses context, pure vector search misses keywords, no reranking means noisy results. I wanted to build RAG that actually works.

Approach

Built a data pipeline: YouTube download, Whisper transcription on A100 GPU, smart chunking with 25% overlap at segment boundaries.

Used OpenAI text-embedding-3-large (3072 dimensions) for high-quality semantic search. Pinecone for vector storage at scale.

Implemented hybrid retrieval: vector search (top-20) + keyword search (top-20) merged and reranked to top-5. Best of both worlds.

Added timestamped citations. Every answer links back to the specific moment in the podcast.

Outcome

The system handles 500+ hours of content with sub-second retrieval. Hybrid search catches both semantic matches and exact keywords. Reranking eliminates noise. Citations build trust.

PythonOpenAIPineconefaster-whisperyt-dlpJupyter

GitHub