Scalable Retrieval with Project Gutenberg
Modern AI systems rely on retrieval to fetch the right context before generation. In this project, I built a retrieval indexing prototype over the Project Gutenberg corpus: from dataset validation → text chunking → dense embeddings → hierarchical ANN graph construction (HNSW-style), with profiling to understand runtime and memory bottlenecks.
- GitHub: TODO (paste repo link)
- Demo: TODO (optional)
Notes
What I built
- Corpus validation + inspection
- Loaded and sampled
gutenberg_over_70000_metadata.csvto verify dataset integrity. - Traversed the corpus directory and opened serialized
.pklbook files to preview contents and confirm expected structure.
- Loaded and sampled
- Text → embeddings pipeline
- Implemented overlap-based sliding-window chunking (
chunk_text) to fit within embedding model input limits. - Generated dense embeddings using
sentence-transformers/all-MiniLM-L6-v2(embed_book) to enable nearest-neighbor retrieval in vector space.
- Implemented overlap-based sliding-window chunking (
- Simplified HNSW-style ANN index construction
- Implemented exponential level assignment to create a multi-layer hierarchy with sparse upper layers (hub-like nodes).
- Built per-layer nearest-neighbor connectivity using vectorized squared-distance computation and NetworkX graphs.
- Scalability instrumentation
- Added a profiling utility to measure wall-clock time and peak memory usage with
tracemalloc, enabling data-driven discussion of bottlenecks.
- Added a profiling utility to measure wall-clock time and peak memory usage with
Key takeaways
- Retrieval quality and speed depend heavily on preprocessing choices (chunk size, overlap, embedding strategy).
- Pairwise distance computation is the dominant bottleneck at scale, motivating hierarchical/approximate methods like HNSW.
- Exponential layer assignment concentrates connectivity in a small number of high-level “hub” nodes, enabling fast navigation during search.
- Measuring both time and peak memory is essential for understanding whether an indexing approach will scale.
