Assignment: RAGAS Evaluation Metrics#
Assignment Metadata#
Field |
Description |
|---|---|
Assignment Name |
RAGAS Evaluation Metrics for RAG Systems |
Course |
LLMOps and Evaluation |
Project Name |
|
Estimated Time |
90 minutes |
Framework |
Python 3.10+, RAGAS, LangChain, OpenAI API |
Learning Objectives#
By completing this assignment, you will be able to:
Implement RAGAS evaluation metrics for RAG systems
Calculate Faithfulness scores by decomposing answers into verifiable statements
Measure Answer Relevancy using reverse-engineered questions and embedding similarity
Evaluate Context Precision and Context Recall for retrieval quality assessment
Analyze the relationship between different metrics and overall RAG performance
Problem Description#
You are tasked with building an evaluation pipeline for a Q&A RAG system. The system retrieves documents and generates answers, but you need to measure its quality across multiple dimensions:
Faithfulness: Are generated answers grounded in the retrieved context?
Answer Relevancy: Do answers actually address the userβs questions?
Context Precision: Are relevant documents ranked higher in retrieval?
Context Recall: Does retrieval capture all necessary information?
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
ragas>= 0.1.0langchain>= 0.1.0openai>= 1.0.0datasets(HuggingFace)
Dataset#
Create or use a Q&A dataset with:
At least 20 question-answer pairs
Each item containing: question, ground_truth answer, retrieved contexts, generated answer
Tasks#
Task 1: Faithfulness Evaluation (25 points)#
Implement Faithfulness scoring that:
Decomposes generated answers into individual claims/statements
Verifies each claim against the retrieved context
Calculates the ratio of supported claims
Create test cases demonstrating:
High faithfulness (score > 0.9): All claims supported by context
Medium faithfulness (0.5-0.9): Partial support
Low faithfulness (< 0.5): Hallucinated content
Document at least 3 examples with detailed analysis of claim decomposition
Task 2: Answer Relevancy Evaluation (25 points)#
Implement Answer Relevancy scoring that:
Generates N hypothetical questions from the answer
Computes embedding similarity with the original question
Returns average cosine similarity score
Test with examples showing:
Complete answers (high relevancy)
Partial answers (medium relevancy)
Off-topic answers (low relevancy)
Analyze how answer completeness affects the relevancy score
Task 3: Context Precision & Recall (25 points)#
Implement Context Precision that:
Evaluates relevance of each retrieved chunk
Calculates Precision@k at each position
Computes weighted average for final score
Implement Context Recall that:
Decomposes reference answer into claims
Checks attribution to retrieved contexts
Calculates coverage ratio
Create a retrieval analysis with:
At least 5 queries with varying retrieval quality
Precision/Recall scores for each
Recommendations for improvement
Task 4: End-to-End Evaluation Pipeline (25 points)#
Build a complete evaluation pipeline using RAGAS:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
# Your evaluation code here
Evaluate your RAG system on the full dataset
Generate a report with:
Summary statistics (mean, std, min, max for each metric)
Correlation analysis between metrics
Identified failure cases and root causes
Submission Requirements#
Required Deliverables#
Source code (Jupyter notebook or Python scripts)
README.mdwith setup and usage instructionsEvaluation results table with all four metrics
Analysis report with examples and insights
Screenshots of RAGAS evaluation outputs
Submission Checklist#
All code runs without errors
All four RAGAS metrics are implemented correctly
Test cases cover edge cases and failure modes
Analysis includes actionable recommendations
Documentation is complete and clear
Evaluation Criteria#
Criteria |
Points |
|---|---|
Faithfulness implementation & analysis |
25 |
Answer Relevancy implementation |
25 |
Context Precision & Recall |
25 |
End-to-end pipeline & reporting |
15 |
Code quality and documentation |
10 |
Total |
100 |
Hints#
Use the companion notebook
10_RAG_Evaluation_with_Ragas.ipynbas a referenceStart with small examples to understand each metric before scaling up
For Faithfulness, consider using GPT-4 for more accurate claim verification
When testing Answer Relevancy, vary the completeness of answers systematically
Compare your manual calculations with RAGAS automated scores to validate understanding