Assignment: Advanced Indexing#
Assignment Metadata#
Field |
Description |
|---|---|
Assignment Name |
Advanced Indexing for RAG Systems |
Course |
RAG and Optimization |
Project Name |
|
Estimated Time |
120 minutes |
Framework |
Python 3.10+, LangChain, ChromaDB/Qdrant, Sentence-Transformers |
Learning Objectives#
By completing this assignment, you will be able to:
Implement Semantic Chunking to split text based on meaning rather than fixed character counts
Configure HNSW index parameters (
M,ef_construction,ef_search) for optimal performanceCompare chunking strategies (Recursive vs. Semantic) and measure their impact on retrieval quality
Analyze trade-offs between retrieval speed and accuracy when tuning HNSW parameters
Validate the effectiveness of your indexing strategy through retrieval experiments
Problem Description#
You are building a RAG system for a technical documentation platform. The current system uses fixed-size chunking (500 characters) and brute-force vector search, which causes:
Semantic fragmentation: Important concepts are split across multiple chunks
High latency: Search becomes slow as the document count grows
Your task is to implement Semantic Chunking and configure HNSW indexing to solve these problems.
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
langchain>= 0.1.0sentence-transformers>= 2.2.0chromadb>= 0.4.0 ORqdrant-client>= 1.7.0numpy>= 1.24.0
Dataset#
Use the provided sample documents or create your own dataset with:
At least 10 documents
Each document containing multiple distinct topics/sections
Total text length of at least 50,000 characters
Tasks#
Task 1: Implement Semantic Chunking (40 points)#
Implement a Semantic Chunker that:
Splits documents into sentences using proper sentence boundary detection
Calculates cosine similarity between consecutive sentences
Creates chunk boundaries when similarity drops below a threshold
Handles edge cases (very short/long sentences, code blocks, lists)
Configuration Parameters:
Similarity threshold (recommended: 0.7-0.85)
Minimum chunk size (in sentences)
Maximum chunk size (in characters)
Comparison Experiment:
Process the same documents using both Recursive Chunking and Semantic Chunking
Record the number of chunks, average chunk size, and chunking time
Analyze at least 3 examples where Semantic Chunking preserves context better
Task 2: Configure HNSW Index (30 points)#
Set up a vector database (ChromaDB or Qdrant) with HNSW indexing
Experiment with HNSW parameters:
Test at least 3 different values for
M(e.g., 16, 32, 64)Test at least 3 different values for
ef_construction(e.g., 100, 200, 400)Test at least 3 different values for
ef_search(e.g., 50, 100, 200)
Document the results in a table showing:
Parameter configuration
Index build time
Average query latency
Memory usage (if measurable)
Recall@10 (compared to brute-force search)
Task 3: End-to-End RAG Pipeline (30 points)#
Build a complete RAG pipeline that uses:
Your Semantic Chunker for document processing
HNSW-indexed vector database for retrieval
An LLM for answer generation (can use OpenAI API or local model)
Create a test set of at least 10 questions that require:
Single-topic answers (should retrieve one complete chunk)
Multi-topic answers (should retrieve multiple related chunks)
Evaluate and compare retrieval quality between:
Baseline: Recursive Chunking + Brute-force search
Optimized: Semantic Chunking + HNSW index
Submission Requirements#
Required Deliverables#
Source code in a Jupyter notebook or Python scripts
README.mdwith setup instructions and usage examplesResults table comparing chunking strategies
Results table comparing HNSW parameter configurations
Screenshots or logs showing retrieval quality comparison
Submission Checklist#
All code runs without errors
Semantic Chunker correctly preserves topic boundaries
HNSW index is properly configured and benchmarked
End-to-end pipeline produces coherent answers
Documentation is complete with clear explanations
Evaluation Criteria#
Criteria |
Points |
|---|---|
Semantic Chunking implementation |
25 |
Chunking comparison analysis |
15 |
HNSW parameter experimentation |
20 |
Performance benchmarking |
10 |
End-to-end RAG pipeline |
20 |
Code quality and documentation |
10 |
Total |
100 |
Hints#
Start with a small dataset to test your Semantic Chunker before scaling up
Use
sentence-transformersmodels likeall-MiniLM-L6-v2for efficient similarity calculationWhen tuning HNSW, prioritize
ef_searchfor query-time optimizationConsider using the companion notebook
01-advanced-indexing.ipynbas a reference