Assignment: Advanced Indexing#

Assignment Metadata#

Field	Description
Assignment Name	Advanced Indexing for RAG Systems
Course	RAG and Optimization
Project Name	`advanced-indexing-rag`
Estimated Time	120 minutes
Framework	Python 3.10+, LangChain, ChromaDB/Qdrant, Sentence-Transformers

Learning Objectives#

By completing this assignment, you will be able to:

Implement Semantic Chunking to split text based on meaning rather than fixed character counts
Configure HNSW index parameters (M, ef_construction, ef_search) for optimal performance
Compare chunking strategies (Recursive vs. Semantic) and measure their impact on retrieval quality
Analyze trade-offs between retrieval speed and accuracy when tuning HNSW parameters
Validate the effectiveness of your indexing strategy through retrieval experiments

Problem Description#

You are building a RAG system for a technical documentation platform. The current system uses fixed-size chunking (500 characters) and brute-force vector search, which causes:

Semantic fragmentation: Important concepts are split across multiple chunks
High latency: Search becomes slow as the document count grows

Your task is to implement Semantic Chunking and configure HNSW indexing to solve these problems.

Technical Requirements#

Environment Setup#

Python 3.10 or higher
Required packages:
- langchain >= 0.1.0
- sentence-transformers >= 2.2.0
- chromadb >= 0.4.0 OR qdrant-client >= 1.7.0
- numpy >= 1.24.0

Dataset#

Use the provided sample documents or create your own dataset with:

At least 10 documents
Each document containing multiple distinct topics/sections
Total text length of at least 50,000 characters

Tasks#

Task 1: Implement Semantic Chunking (40 points)#

Implement a Semantic Chunker that:
- Splits documents into sentences using proper sentence boundary detection
- Calculates cosine similarity between consecutive sentences
- Creates chunk boundaries when similarity drops below a threshold
- Handles edge cases (very short/long sentences, code blocks, lists)
Configuration Parameters:
- Similarity threshold (recommended: 0.7-0.85)
- Minimum chunk size (in sentences)
- Maximum chunk size (in characters)
Comparison Experiment:
- Process the same documents using both Recursive Chunking and Semantic Chunking
- Record the number of chunks, average chunk size, and chunking time
- Analyze at least 3 examples where Semantic Chunking preserves context better

Task 2: Configure HNSW Index (30 points)#

Set up a vector database (ChromaDB or Qdrant) with HNSW indexing
Experiment with HNSW parameters:
- Test at least 3 different values for M (e.g., 16, 32, 64)
- Test at least 3 different values for ef_construction (e.g., 100, 200, 400)
- Test at least 3 different values for ef_search (e.g., 50, 100, 200)
Document the results in a table showing:
- Parameter configuration
- Index build time
- Average query latency
- Memory usage (if measurable)
- Recall@10 (compared to brute-force search)

Task 3: End-to-End RAG Pipeline (30 points)#

Build a complete RAG pipeline that uses:
- Your Semantic Chunker for document processing
- HNSW-indexed vector database for retrieval
- An LLM for answer generation (can use OpenAI API or local model)
Create a test set of at least 10 questions that require:
- Single-topic answers (should retrieve one complete chunk)
- Multi-topic answers (should retrieve multiple related chunks)
Evaluate and compare retrieval quality between:
- Baseline: Recursive Chunking + Brute-force search
- Optimized: Semantic Chunking + HNSW index

Evaluation Criteria#

Criteria	Points
Semantic Chunking implementation	25
Chunking comparison analysis	15
HNSW parameter experimentation	20
Performance benchmarking	10
End-to-end RAG pipeline	20
Code quality and documentation	10
Total	100

Hints#

Start with a small dataset to test your Semantic Chunker before scaling up
Use sentence-transformers models like all-MiniLM-L6-v2 for efficient similarity calculation
When tuning HNSW, prioritize ef_search for query-time optimization
Consider using the companion notebook 01-advanced-indexing.ipynb as a reference