Quiz#

RAGAS Evaluation Metrics#

Question 1: Scenario: The LLM gives a brilliant, factually correct answer based on its pre-trained knowledge, but the retrieved context from your database contained entirely unrelated text. What is the Faithfulness score?

  • A. 1.0, because the answer is factually true.

  • B. 0.8, because it ignored the prompt.

  • C. 0.5, because the context was ignored.

  • D. 0.0, because none of the statements can be inferred from the retrieved context.

Answer: D

Question 2: If a user asks ‘What is the capital of Japan?’ and the LLM responds ‘Tokyo is a city in Japan with a large population, famous for cherry blossoms, and it serves as the capital.’, which metric might flag this answer as suboptimal?

  • A. Faithfulness (due to hallucination).

  • B. Context Recall (due to missing info).

  • C. Answer Relevancy (due to redundant/extra information not directly addressing only the prompt).

  • D. Context Precision (due to bad ranking).

Answer: C

Question 3: How does Context Precision handle irrelevant chunks that appear high up in the retrieved results (e.g., Position 1)?

  • A. It significantly penalizes the score because it calculates the ratio of relevant contexts at each top-k position.

  • B. It ignores them as long as a relevant chunk is at Position 5.

  • C. It boosts the score to encourage diversity.

  • D. It forces the LLM to rewrite the context.

Answer: A

Question 4: Scenario: You have an expert reference answer containing 4 key claims. Your retrieval system pulls contexts that only support 1 of those claims. What is the Context Recall score?

  • A. 1

  • B. 0.25 (1/4)

  • C. 0.5

  • D. 0

Answer: B

Question 5: Why does Ragas use an LLM (like GPT-4) as a ‘Judge’ for its metrics?

  • A. Because humans are incapable of reading RAG outputs.

  • B. To automate the evaluation process, minimizing the high costs and time associated with human ground-truth annotation.

  • C. To generate hypothetical vectors.

  • D. Because it is required by the Neo4j database.

Answer: B

Question 6: In the calculation of Answer Relevancy, why are ‘reverse-engineered’ questions generated?

  • A. To compare their embedding similarity against the original user question; high similarity means the answer directly addressed the prompt.

  • B. To train a new embedding model.

  • C. To populate the Graph database.

  • D. To ask the user for clarification.

Answer: A

Question 7: Which two Ragas metrics are specifically focused on evaluating the ‘Retrieval’ performance of a RAG system?

  • A. Faithfulness and Answer Relevancy

  • B. Answer Correctness and Faithfulness

  • C. Context Precision and Context Recall

  • D. Latency and Cost

Answer: C

Question 8: Which two Ragas metrics are specifically focused on evaluating the ‘Generation’ performance of a RAG system?

  • A. Faithfulness and Answer Relevancy

  • B. Context Precision and Context Recall

  • C. Retrieval Latency and Token Cost

  • D. Context Recall and Faithfulness

Answer: A

Question 9: If a RAG system has High Context Recall but Low Context Precision, what does this indicate about the retrieved chunks?

  • A. It found no useful information.

  • B. It found all the necessary information, but buried it among a lot of irrelevant noise (poor ranking).

  • C. It hallucinated the answer.

  • D. It ranked the exact right answer at position 1, but missed everything else.

Answer: B

Question 10: What is the first step in the calculation process for Context Recall?

  • A. Reverse-engineering questions.

  • B. Calculating cosine similarity.

  • C. Splitting the ‘reference answer’ (ground truth) into individual sentences/claims.

  • D. Generating an answer using GPT-4.

Answer: C

Question 11: In the ‘Green Tea’ Context Precision example, why does an irrelevant context at position 2 lower the final score?

  • A. Because Precision@2 drops to 0.5, pulling down the weighted average for subsequent relevant chunks.

  • B. Because the LLM deletes the irrelevant chunk.

  • C. Because it triggers a Faithfulness penalty.

  • D. Because it changes the user’s original question.

Answer: A

Question 12: What does it mean if Faithfulness evaluates to exactly 1.0?

  • A. The answer is 100% factually accurate to the real world.

  • B. The answer contains exactly 100 words.

  • C. Every single statement made in the generated answer can be directly supported by the retrieved context.

  • D. The retrieval process took exactly 1 second.

Answer: C

Question 13: Why is Answer Relevancy NOT considered a measure of ‘factuality’?

  • A. Because it uses BM25 instead of vectors.

  • B. Because it only checks if the answer conceptually aligns with what was asked, not whether the facts stated are true.

  • C. Because GPT-4 cannot evaluate facts.

  • D. Because it only measures the speed of the response.

Answer: B

Question 14: If your RAG system suffers from ‘hallucinations’, which metric will most directly drop?

  • A. Context Precision

  • B. Context Recall

  • C. Answer Relevancy

  • D. Faithfulness

Answer: D

Question 15: In the calculation process for Faithfulness, what happens after the answer is decomposed into claims?

  • A. The claims are translated.

  • B. The LLM verifies each statement to see if it can be inferred from the context.

  • C. The claims are stored in Neo4j.

  • D. The context is deleted.

Answer: B

Question 16: If a generated answer lacks necessary details requested in the prompt (e.g., asking for location and capital, but only giving location), what happens to Answer Relevancy?

  • A. It increases because the answer is shorter.

  • B. It stays the same.

  • C. It decreases because the reverse-engineered questions will not match the full scope of the original prompt.

  • D. It forces a re-retrieval.

Answer: C