Assignment: RAG Architecture Experiment Comparison#

Assignment Metadata#

Field	Description
Assignment Name	RAG Architecture Experiment Comparison
Course	LLMOps and Evaluation
Project Name	`rag-experiment-comparison`
Estimated Time	180 minutes
Framework	Python 3.10+, LangChain, ChromaDB/Qdrant, Neo4j (optional), RAGAS, OpenAI

By completing this assignment, you will be able to:

Your team needs to choose the right RAG architecture for a new product. You must conduct a scientific comparison to answer:

Quality: Which architecture produces the most accurate and faithful answers?
Cost: What is the cost per query for each architecture?
Latency: How does response time vary across architectures?
Use Cases: Which architecture fits which query types?

At least 10 documents (minimum 50,000 characters total)
At least 30 test questions categorized by type:
- Factual (40%): Simple fact lookup
- Relational (30%): Relationship between concepts
- Multi-hop (20%): Requires connecting multiple facts
- Analytical (10%): Summarization or trend analysis

Naive RAG (Required):
- Fixed-size chunking (500-1000 characters)
- Standard embedding model (e.g., text-embedding-3-small)
- Top-K retrieval (k=5)
- Direct LLM generation
Advanced RAG (Required):
- Semantic or recursive chunking
- Hybrid search (Vector + BM25)
- Query transformation (HyDE or similar)
- Re-ranking with Cross-Encoder
GraphRAG (Bonus - 10 extra points):
- Entity and relationship extraction
- Knowledge graph construction
- Graph-based retrieval
Document your implementations with architecture diagrams

Prepare documents:
- Select or create documents with clear topics
- Ensure coverage of different complexity levels
Generate test questions:
- Create questions for each category (Factual, Relational, Multi-hop, Analytical)
- Provide ground truth answers
- Tag questions with expected difficulty
Validate dataset:
- Ensure questions are unambiguous
- Verify ground truth accuracy
- Document any assumptions

Execute evaluation for each architecture:
- Run all test questions through each system
- Capture responses, contexts, and metadata
- Record latency for each query
Calculate metrics using RAGAS:
- Faithfulness
- Answer Relevancy
- Context Precision
- Context Recall
Track costs:
- Embedding API calls
- LLM API calls
- Total cost per query
Compile results table:

System	Faithfulness	Answer Rel.	Context Prec.	Context Rec.	Latency (s)	Cost ($)
Naive
Advanced

Performance by query type:
- Break down metrics by question category
- Identify strengths and weaknesses of each architecture
Trade-off analysis:
- Quality vs. Cost
- Quality vs. Latency
- Create visualizations (charts/graphs)
Write recommendations (300-500 words):
- When to use each architecture
- Optimization priorities for each
- Your recommended default choice with justification

Criteria	Points
RAG architecture implementations	40
Evaluation dataset quality	15
Experiment execution & metrics	25
Analysis and recommendations	15
Code quality and documentation	5
Total	100
Bonus: GraphRAG implementation	+10