Evaluation Toolkit - Ragas#
To be able to evaluate a RAG system, we need a specialized evaluation toolkit. One of the chosen candidates is RAGAS, with the corresponding research work ‘Ragas: Automated Evaluation of Retrieval Augmented Generation’ in 2024.
Ragas is an automated evaluation framework designed specifically for RAG systems, helping to measure the quality of both main components: retrieval and generation. Unlike traditional evaluation methods requiring ground truth annotations created by humans, Ragas uses large language models like GPT-4 to automate the evaluation process, minimizing necessary costs and time.
The framework operates on the principle of multi-dimensional evaluation, where each aspect of the RAG system is measured through separate metrics. The four main metrics used in this document include faithfulness, answer relevancy, context precision, and context recall.
1. Faithfulness - Measuring Faithfulness
The Faithfulness metric evaluates the truthfulness of the answer compared to the retrieved context, ensuring no hallucination phenomena. An answer is considered faithful if all statements in it can be supported by the retrieved context.
Calculation Process:
Decomposition: Use LLM to split the answer into individual statements (claims).
Verification: Check each statement to see if it can be inferred from the context.
Scoring: Apply the formula to calculate the ratio of correct statements.
Illustrative Example - Faithfulness
graph TD
Q["Question: 'Where and when was Einstein born?'"]
C["Context: '...born 14 March 1879...\nGerman-born physicist...'"]
A["Answer: 'Einstein was born in Germany\non 20 March 1879.'"]
A -->|"LLM decompose"| S1["Statement 1:\n'born in Germany' ✓\n(supported by context)"]
A -->|"LLM decompose"| S2["Statement 2:\n'born on 20 March 1879' ✗\n(context says 14 March)"]
S1 & S2 --> F["Faithfulness = 1/2 = 0.5"]
Question:
Where and when was Einstein born?
Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time.
Answer:
Einstein was born in Germany on 20 March 1879.
Analysis: LLM splits the answer into two statements:
Statement 1: ‘Einstein was born in Germany.’: Correct, can be inferred from context (‘German-born’).
Statement 2: ‘Einstein was born on 20 March 1879.’: Incorrect, context says 14 March 1879 not 20 March 1879.
Result: Faithfulness = 1/2 = 0.5 because only one of the two statements can be verified from the context.
2. Answer Relevancy - Measuring Relevance
The Answer Relevancy metric evaluates the relevance between the answer and the original question, aiming to confirm whether the answer addresses the problem asked. This metric does not evaluate true-false in the sense of factuality, but focuses on completeness and avoiding redundant information.
where \(E_{g_i}\) is the embedding of the \(i\)-th question generated from the answer, \(E_o\) is the embedding of the original question, and \(N\) is the number of generated questions.
Calculation Process:
Reverse-engineer: Ask LLM to generate \(N\) different questions from the given answer.
Embedding: Convert the original question and generated questions into embedding vectors.
Similarity Calculation: Calculate the average cosine similarity between the original question and the generated questions.
Illustrative Example - Answer Relevancy
Question:
Where is France and what is its capital?
Low relevance answer:
France is in western Europe.
High relevance answer:
France is in western Europe and Paris is its capital.
Low relevance answer analysis: LLM might generate questions like ‘Where is France located?’ or ‘In which part of Europe is France situated?’. These questions only partially match the original question because of missing information about the capital.
High relevance answer analysis: LLM might generate the question ‘Where is France and what is its capital?’ matching the original question, leading to higher cosine similarity.
Result: The complete answer has an Answer Relevancy score near 1, while the incomplete answer has a significantly lower score.
3. Context Precision - Measuring Retrieval Accuracy
The Context Precision metric measures the accuracy of the retrieval process by assessing the ranking of contexts. This metric checks if relevant chunks are ranked high in the list of retrieved contexts.
where \(K\) is the total number of chunks in retrieved contexts and \(v_k \in \{ 0, 1 \}\) is the relevance indicator at position \(k\).
Calculation Process:
Determine Relevance: Use LLM to evaluate if each context is relevant to the question.
Calculate Precision@k: For each position \(k\), calculate the ratio of relevant contexts in the top \(k\).
Weighted Average: Calculate the weighted average of Precision@k, counting only for positions with relevant contexts.
Illustrative Example - Context Precision
Question:
What are the health benefits of green tea?
Retrieved contexts in order:
Green tea contains antioxidants that may reduce cancer risk. - Relevant
Tea plantations are common in Asia, especially China and India. - Irrelevant
Green tea can boost metabolism and aid weight loss. - Relevant
The history of tea dates back thousands of years. - Irrelevant
Green tea improves brain function and mental alertness. - Relevant
Calculation:
Precision@1 = 1/1 = 1.0, \(v_1 = 1\)
Precision@2 = 1/2 = 0.5, \(v_2 = 0\)
{/formula-not-decoded/}
Precision@4 = 2/4 = 0.5, \(v_4 = 0\)
Precision@5 = 3/5 = 0.6, \(v_5 = 1\)
Result: Context Precision = (1.0 × 1 + 0.67 × 1 + 0.6 × 1) / 3 = 2.27 / 3 ≈ 0.76. The score reflects that there are irrelevant contexts interspersed between useful contexts.
4. Context Recall - Measuring Retrieval Coverage
The Context Recall metric evaluates the coverage of the retrieval process, measuring how much necessary information from the reference answer was found in the retrieved contexts. Formula:
Calculation Process:
Decomposition: Split the reference answer into individual sentences/claims.
Attribution: Use LLM to check if each claim can be inferred from retrieved contexts.
Ratio Calculation: Calculate the ratio of claims supported by contexts over total claims.
Illustrative Example - Context Recall
Question:
Where is the Eiffel Tower located?
Reference answer:
The Eiffel Tower is located in Paris.
Retrieved contexts:
Paris is the capital of France.
Analysis: Reference answer contains the main claim: ‘The Eiffel Tower is located in Paris.’ However, retrieved context only provides information ‘Paris is the capital of France’ without mentioning the location of the Eiffel Tower. Therefore, LLM cannot infer the claim from the reference based on the existing context.
Result: Context Recall = 0/1 = 0, indicating the retriever failed to find context containing necessary information to answer the question.
graph LR
Q[Question] --> RF[Ragas Evaluation Framework]
GA[Generated Answer] --> RF
RC[Retrieved Contexts] --> RF
REF[Reference Answer] --> RF
subgraph "Generation Metrics"
RF --> F["Faithfulness\n(Score 0–1)"]
RF --> AR["Answer Relevancy\n(Score 0–1)"]
end
subgraph "Retrieval Metrics"
RF --> CP["Context Precision\n(Score 0–1)"]
RF --> CR["Context Recall\n(Score 0–1)"]
end
Figure 6: Illustration for metrics of Ragas evaluation tool.
Each metric gives a value from 0 to 1, with higher values indicating better quality. These four metrics complement each other: faithfulness and answer relevancy evaluate generation quality, while context precision and context recall evaluate retrieval performance.