Final Exam: Production-Ready RAG Evaluation System#
overview#
Field |
Value |
|---|---|
Course |
LLMOps and Evaluation |
Duration |
240 minutes (4 hours) |
Passing Score |
70% |
Total Points |
100 |
Description#
You have been hired as an MLOps Engineer at AI Solutions Corp., a company that builds enterprise AI assistants. Your task is to build a Production-Ready RAG Evaluation System that combines automated quality assessment, comprehensive observability, and rigorous architecture comparison.
The current system lacks:
Automated evaluation metrics to measure answer quality
Observability into LLM execution, costs, and latency
Data-driven architecture selection based on experiments
You must apply knowledge from RAGAS Evaluation Metrics, LLM Observability (LangFuse/LangSmith), and RAG Architecture Comparison to build a comprehensive evaluation and monitoring platform.
Objectives#
By completing this exam, you will demonstrate mastery of:
Implementing RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall)
Integrating LangFuse and/or LangSmith for comprehensive LLM tracing and cost tracking
Designing and executing RAG architecture experiments with scientific rigor
Building an end-to-end evaluation pipeline that combines all three components
Making data-driven architecture recommendations based on experimental results
Problem Description#
Build a Production-Ready RAG Evaluation System named rag-evaluation-platform that:
Evaluates RAG quality using RAGAS metrics on generated responses
Traces all LLM operations with full observability (tokens, costs, latency)
Compares multiple RAG architectures systematically
Produces actionable reports for architecture selection
The system should serve as a complete toolkit for evaluating, monitoring, and optimizing RAG systems in production.
Assumptions#
You have completed the assignments on RAGAS, Observability, and Experiment Comparison
OpenAI API key or compatible LLM endpoint is available
LangFuse Cloud account OR local Docker setup for self-hosted LangFuse
LangSmith account (free tier)
Python 3.10+ environment with necessary packages installed
Sample documents and test questions are provided or created
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
ragas>= 0.1.0langfuse>= 2.0.0langchain>= 0.1.0langchain-openai>= 0.0.5openai>= 1.0.0chromadb>= 0.4.0 ORqdrant-client>= 1.7.0sentence-transformers>= 2.2.0pandas>= 2.0.0matplotlib>= 3.7.0
Infrastructure#
Vector Database: ChromaDB or Qdrant
Observability: LangFuse (required) + LangSmith (optional)
Embedding Model:
text-embedding-3-smallor equivalentLLM: GPT-4 or equivalent
Tasks#
Task 1: RAGAS Evaluation Pipeline (25 points)#
Time Allocation: 60 minutes
Build a comprehensive evaluation pipeline using all four RAGAS metrics.
Requirements:#
Implement RAGAS Evaluation Module
Create functions to calculate Faithfulness, Answer Relevancy, Context Precision, and Context Recall
Support batch evaluation on datasets
Handle edge cases (empty contexts, very short answers)
Create Evaluation Dataset
Prepare at least 30 test questions with ground truth answers
Categorize questions: Factual (40%), Relational (30%), Multi-hop (20%), Analytical (10%)
Include retrieved contexts for each question
Run Evaluation
Execute evaluation on the complete dataset
Calculate aggregate statistics (mean, std, min, max)
Identify failure cases (scores < 0.5)
Deliverables:#
evaluation/ragas_evaluator.py- Core evaluation logicevaluation/dataset.py- Dataset loading and preparationdata/test_questions.json- Test dataset with ground truth
Task 2: LLM Observability Integration (25 points)#
Time Allocation: 60 minutes
Implement comprehensive tracing and monitoring for all LLM operations.
Requirements:#
LangFuse Integration
Configure LangFuse SDK with proper authentication
Implement
CallbackHandlerfor all LangChain operationsCapture: input/output, token counts, latency, costs
Cost Tracking Dashboard
Track token usage per query
Calculate costs based on model pricing
Generate cost breakdown reports
Production Best Practices
Implement configurable sampling (100% dev, 5% prod)
Add PII masking for sensitive data
Create correlation IDs for request tracking
(Bonus) LangSmith Integration
Configure auto-tracing via environment variables
Demonstrate Playground debugging for a failed trace
Deliverables:#
observability/langfuse_handler.py- LangFuse integrationobservability/cost_tracker.py- Cost calculation logicobservability/pii_masker.py- PII handlingScreenshots of LangFuse dashboard with traces
Task 3: RAG Architecture Comparison (25 points)#
Time Allocation: 60 minutes
Design and execute a rigorous experiment comparing multiple RAG architectures.
Requirements:#
Implement Two RAG Architectures
Naive RAG: Fixed chunking, Top-K retrieval, direct generation
Advanced RAG: Semantic chunking, hybrid search, re-ranking
Run Comparative Experiments
Execute both architectures on the same test dataset
Capture all RAGAS metrics for each architecture
Track latency and cost per query
Performance Analysis
Break down performance by question category
Calculate statistical significance of differences
Create visualizations (bar charts, tables)
Deliverables:#
architectures/naive_rag.py- Naive RAG implementationarchitectures/advanced_rag.py- Advanced RAG implementationexperiments/runner.py- Experiment executionresults/comparison_table.md- Results summary
Task 4: Integrated Evaluation Platform (25 points)#
Time Allocation: 60 minutes
Combine all components into a unified evaluation platform.
Requirements:#
End-to-End Pipeline
Single entry point to run complete evaluation
Automatic tracing of all operations
Configurable architecture selection
Comprehensive Reporting
Generate evaluation report with all metrics
Include observability insights (cost, latency distribution)
Architecture comparison summary
Actionable recommendations
CLI Interface
python evaluate.py --architecture naive --dataset data/test.json --output results/ python evaluate.py --architecture advanced --dataset data/test.json --output results/ python compare.py --results-dir results/ --output comparison_report.md
Answer Key Questions
Which architecture should be used for production and why?
What is the cost-quality trade-off between architectures?
What are the top 3 failure patterns and how to address them?
Deliverables:#
evaluate.py- Main evaluation scriptcompare.py- Architecture comparison scriptreports/evaluation_report.md- Complete evaluation reportANSWERS.md- Written responses to key questions
Questions to Answer#
Include written responses to these questions in ANSWERS.md:
RAGAS Interpretation: Analyze your Faithfulness and Answer Relevancy scores. What do low scores indicate about your RAG system, and how would you improve them?
Observability Value: How did LangFuse/LangSmith tracing help you identify issues in your RAG pipeline? Provide a specific example.
Architecture Decision: Based on your experiments, which RAG architecture would you recommend for a customer support chatbot vs. a legal document Q&A system? Justify with data.
Cost Optimization: If you had to reduce costs by 50% while maintaining 90% of quality, what strategies would you employ? Reference your experimental results.
Production Readiness: What additional monitoring, alerting, or evaluation would you add before deploying this system to production?
Submission Requirements#
Required Deliverables#
Complete source code organized in the specified directory structure
README.mdwith:Setup instructions (dependencies, API keys, observability setup)
Usage examples for CLI commands
Architecture diagram of the evaluation platform
ANSWERS.mdwith written responses to the 5 questionsTest dataset with at least 30 categorized questions
Results tables and visualizations
Screenshots of observability dashboards
Submission Checklist#
All code runs without errors
RAGAS evaluation produces valid scores for all metrics
LangFuse traces are captured and visible in dashboard
Both RAG architectures are implemented and evaluated
Comparison report includes statistical analysis
All questions answered with data-backed reasoning
Evaluation Criteria#
Criteria |
Weight |
Excellent (90-100%) |
Good (70-89%) |
Needs Improvement (50-69%) |
Unsatisfactory (<50%) |
|---|---|---|---|---|---|
RAGAS Evaluation |
25% |
All 4 metrics implemented correctly; comprehensive dataset; insightful failure analysis |
Metrics implemented; adequate dataset; basic analysis |
Partial metrics; small dataset; minimal analysis |
Missing metrics; no dataset |
Observability |
25% |
Full LangFuse integration; cost tracking; PII handling; production best practices |
LangFuse working; basic cost tracking; some best practices |
Partial tracing; no cost tracking |
No observability integration |
Architecture Comparison |
25% |
Both architectures implemented; rigorous experiments; statistical analysis; visualizations |
Both architectures; experiments run; basic comparison |
One architecture; limited experiments |
No architecture comparison |
Integration & Reporting |
15% |
Seamless pipeline; comprehensive reports; CLI interface; actionable insights |
Components integrated; adequate reports |
Partial integration; basic reports |
Components not connected |
Code Quality & Documentation |
10% |
Clean code; comprehensive docs; clear README; well-organized |
Readable code; adequate docs |
Messy code; minimal docs |
Poor quality; no docs |
Estimated Time#
Task |
Time Allocation |
|---|---|
Task 1: RAGAS Evaluation Pipeline |
60 minutes |
Task 2: LLM Observability Integration |
60 minutes |
Task 3: RAG Architecture Comparison |
60 minutes |
Task 4: Integrated Evaluation Platform |
60 minutes |
Total |
240 minutes (4 hours) |
Hints#
Task 1 - RAGAS:
Use the companion notebook
10_RAG_Evaluation_with_Ragas.ipynbas a referenceStart with a small dataset (10 questions) to verify your pipeline before scaling up
For claim decomposition in Faithfulness, consider using GPT-4 for accuracy
Task 2 - Observability:
Set up LangFuse first since it requires explicit callback handlers (good for understanding)
Use environment variables to switch between dev (100% tracing) and prod (5% sampling) modes
Test PII masking with fake data before using real sensitive information
Task 3 - Experiments:
Use the same embedding model for both architectures to ensure fair comparison
Run each query multiple times if measuring latency to account for variance
Calculate confidence intervals when comparing metric differences
Task 4 - Integration:
Use Pythonβs
argparseorclicklibrary for CLI implementationGenerate markdown reports that can be easily shared with stakeholders
Include both quantitative metrics and qualitative insights in recommendations