RAG Retrieval over the SciPy Codebase
Built the retrieval core of a Retrieval-Augmented Generation (RAG) system for a large real-world codebase (SciPy). The system ingests documentation and source files, chunks them with overlap, embeds chunks using domain-specific sentence embedding models, and indexes everything in FAISS for fast top-k similarity search. I added query tooling to inspect retrieved chunks and trace results back to the original file paths—enabling qualitative analysis and setting up the foundation for quantitative evaluation (Precision@k/Recall@k) and grounded Q&A.
- GitHub: TODO (paste repo link)
- Demo: TODO (optional)
Notes
What I built
- Code/doc ingestion: Recursive file loader to crawl the SciPy repository and collect
.py(code) and.rst(docs) text while safely skipping unreadable files. - Chunking with overlap: Word-based chunking with overlap to preserve context across boundaries; tuned chunk sizes differently for code vs. documentation.
- Embedding pipeline (GPU): Batch embedding generation with SentenceTransformers and normalized vectors for retrieval.
- Used separate embedding models for code and docs to improve relevance by domain.
- Vector database (FAISS): Built and persisted a FAISS
IndexFlatL2index plus JSONL-style metadata so retrieval can run without recomputing embeddings. - Retrieval + inspection: Query function that embeds the query, retrieves top-k chunk IDs, and prints matched chunks with file paths for debugging and evaluation.
Key takeaways
- Retrieval quality depends heavily on chunking strategy (size + overlap) and domain-appropriate embeddings (code vs. docs).
- Persisting the index + metadata makes the system practical: embedding is the expensive step; retrieval becomes fast and repeatable.
- Stress-testing with diverse query styles (API names, conceptual questions, misspellings) is essential before moving to quantitative metrics and LLM-grounded answer generation.
