LLMOps and Evaluation Question Bank#

No.

Training Unit

Lecture

Training content

Question

Level

Mark

Answer

Answer Option A

Answer Option B

Answer Option C

Answer Option D

Explanation

1

Unit 1: LLMOps

Lec2

RAGAS Metrics

What does the Faithfulness metric measure in RAGAS?

Easy

1

A

The truthfulness of the generated answer compared to the retrieved context

The relevance of the answer to the original question

The accuracy of the ranking of contexts

The coverage of the retrieval process

Faithfulness checks if all statements in the answer can be supported by the retrieved context, avoiding hallucinations.

2

Unit 1: LLMOps

Lec2

RAGAS Metrics

Which LLM framework is RAGAS designed to evaluate?

Easy

1

B

Agents

RAG systems

Fine-tuned models

Traditional Search Engines

Ragas is an automated evaluation framework designed specifically for RAG systems.

3

Unit 1: LLMOps

Lec2

RAGAS Metrics

What do you need to annotate data manually when using RAGAS?

Easy

1

C

Large scale human annotations

Only expert domain knowledge

Nothing, it uses LLMs like GPT-4 to automate evaluation

Both standard Q&A pairs and ranking queries

Unlike traditional methods, Ragas uses LLMs to automate the evaluation process without needing heavy human annotations.

4

Unit 1: LLMOps

Lec2

RAGAS Metrics

Which dimension is measured by Context Precision?

Easy

1

C

Quality of generation

Semantic similarity to the user query

Accuracy of the retrieval process

Coverage of expected facts

Context Precision measures the accuracy of the retrieval process by assessing the ranking of contexts.

5

Unit 1: LLMOps

Lec2

RAGAS Metrics

What is the main purpose of Answer Relevancy?

Easy

1

D

Fact-checking the answer

Verifying truthfulness

Guaranteeing context coverage

Measuring relevance between answer and original question

It evaluates the relevance between the answer and question to confirm it addresses the problem asked.

6

Unit 1: LLMOps

Lec2

RAGAS Metrics

What value range do Ragas metrics return?

Easy

1

B

0 to 100

0 to 1

-1 to 1

1 to 5

Each metric gives a value from 0 to 1, with higher values indicating better quality.

7

Unit 1: LLMOps

Lec2

RAGAS Metrics

Which metric evaluates if relevant chunks are ranked high in retrieved contexts?

Easy

1

C

Faithfulness

Context Recall

Context Precision

Answer Relevancy

Context Precision checks if relevant chunks are ranked high in the list of retrieved contexts.

8

Unit 1: LLMOps

Lec2

RAGAS Metrics

How many main metrics are covered in the RAGAS documentation?

Easy

1

A

4

5

3

6

The four main metrics are faithfulness, answer relevancy, context precision, and context recall.

9

Unit 1: LLMOps

Lec2

RAGAS Metrics

If Context Recall is 0, what does that indicate?

Easy

1

A

Retriever failed to find necessary context

Rank 1 is an irrelevant context

LLM generated hallucination

The answer is irrelevant to the query

It indicates the retriever failed to find context containing necessary information to answer the question.

10

Unit 1: LLMOps

Lec2

RAGAS Metrics

Which two metrics evaluate the “retrieval” performance?

Easy

1

B

Faithfulness & Answer Relevancy

Context Precision & Context Recall

Answer Relevancy & Context Recall

Context Precision & Faithfulness

Context precision and context recall evaluate retrieval performance.

11

Unit 1: LLMOps

Lec2

RAGAS Metrics

Describe the calculation process for Faithfulness in Ragas.

Medium

2

A

Decompose answer to statements, verify against context, calculate ratio

Generate questions, embed them, calculate cosine similarity

Determine context relevance, calculate Precision@k, aggregate

Decompose reference answer, verify if inferences exist in retrieved context

The process is: Decomposition (claims), Verification (checked against context), and Scoring (ratio).

12

Unit 1: LLMOps

Lec2

RAGAS Metrics

How does Answer Relevancy determine its score technically?

Medium

2

C

By classifying the answer using a trained classifier

By matching keywords between answer and question

By reverse-engineering questions from answer and calculating embedding cosine similarity

By comparing the character count of answer vs question

LLM generates N questions from the given answer, converts them to embeddings, and compares cosine similarity with the original question.

13

Unit 1: LLMOps

Lec2

RAGAS Metrics

A low Context Recall score means what in terms of information availability?

Medium

2

D

The information is hallucinated

The answer has redundant information

The retrieved information is scattered

The necessary facts from the reference answer are missing in the retrieved contexts

It means the necessary information from the reference answer was not found in the retrieved contexts.

14

Unit 1: LLMOps

Lec2

RAGAS Metrics

In Context Precision calculation, what is \(v_k\)?

Medium

2

C

Velocity of retrieval

Volume of chunks

Relevance indicator at position k

Value of cosine similarity

\(v_k \in \{0, 1\}\) is the relevance indicator at position k.

15

Unit 1: LLMOps

Lec2

RAGAS Metrics

Why might an answer score high in Faithfulness but low in Answer Relevancy?

Medium

2

B

The answer is hallucinated but relevant

The answer is entirely true based on context but fails to address the user’s specific question

The retriever brought back poor context

The context precision is very low

It can be completely faithful to retrieved context, but that context (and answer) might not be what the user asked for.

16

Unit 1: LLMOps

Lec2

RAGAS Metrics

Why is Faithfulness strictly compared to retrieved context and not world knowledge?

Medium

2

A

To prevent LLM hallucinations from being counted as correct if the retriever failed

Ragas has no access to world knowledge

The LLM doesn’t know facts

World knowledge costs more tokens

RAG’s core value is grounding generation on specific private/provided context, so it measures adherence to that context only to prevent unaccounted hallucinations.

17

Unit 1: LLMOps

Lec2

RAGAS Metrics

If LLM splits an answer into 3 statements, and only 2 are verified in context, Faithfulness is?

Medium

2

B

0.5

0.67

0.33

1.0

Faithfulness relies on the ratio of correct statements: 2 out of 3 makes it ~0.67.

18

Unit 1: LLMOps

Lec2

RAGAS Metrics

Given a scenario where a user asks about Einstein’s death, but the context only contains his birth, and the LLM answers “Einstein died in 1955” using its internal knowledge. What are the RAGAS metric implications?

Hard

3

B

High Faithfulness, Low Answer Relevancy

Low Faithfulness, High Answer Relevancy

Low Faithfulness, Low Context Recall

High Context Precision, High Context Recall

It answers the user (High Relevancy), but the claim isn’t in context, making Faithfulness low.

19

Unit 1: LLMOps

Lec2

RAGAS Metrics

To improve Context Precision in a RAG pipeline, what architecture modification would you introduce?

Hard

3

C

Increase LLM temperature

Swap FAISS for ChromaDB

Add a Cross-encoder reranking step

Generate multiple answers and average them

Reranking specifically improves the order/ranking of retrieved chunks, heavily impacting Context Precision metrics.

20

Unit 1: LLMOps

Lec2

RAGAS Metrics

Detail the mathematical rationale behind using N reverse-engineered questions for calculating Answer Relevancy.

Hard

3

A

Averages out the stochastic nature of LLMs generating questions to provide a stable semantic similarity

It is required to satisfy vector dimensions

One question uses up too few tokens

N acts as a padding token for embeddings

Generating N questions and averaging their cosine similarities mitigates the variance inherent in LLM generation, ensuring a robust relevancy score.

21

Unit 2: Observability

Lec6

Observability Concepts

What is Observability in the context of LLM applications?

Easy

1

A

The ability to track flows, errors and costs of LLM apps acting as black boxes

A library for generating UI code

A vector database

The algorithm used for chunking texts

It tracks probabilistic components acting as black boxes, aiding in tracing, tracking costs, and debugging.

22

Unit 2: Observability

Lec6

LangFuse Basics

Which of these tools is known for being Open Source?

Easy

1

B

LangChain

LangFuse

LangSmith

OpenAI

LangFuse is a popular open-source tool focusing on engineering observability.

23

Unit 2: Observability

Lec6

Observability Challenges

What makes LLM applications harder to debug than traditional software?

Easy

1

C

They use more memory

They require internet connections

They involve probabilistic, non-deterministic components

They use Python

You give input, get output. LLMs act as probabilisitic black boxes.

24

Unit 2: Observability

Lec6

LangSmith Basics

Who built LangSmith?

Easy

1

B

Google

The LangChain Team

OpenAI

Meta

LangSmith is built by the LangChain team for native integration.

25

Unit 2: Observability

Lec6

LangFuse Integration

In LangFuse, what is used to automatically instrument LangChain chains code?

Easy

1

C

System.out.println

VectorEmbeddings

CallbackHandler

FAISS

LangFuse provides a CallbackHandler that automatically instruments chains.

26

Unit 2: Observability

Lec6

Prompt Management

Why should you manage prompts in a tool like LangFuse instead of hardcoding in Git?

Easy

1

A

To allow non-engineers to tweak them

Because Git is too slow

Because Git charges per token

To hide prompts from developers

It acts as a CMS for prompts so non-engineers can comfortably inspect and tweak them.

27

Unit 2: Observability

Lec6

Setup

How can you enable LangSmith auto-tracing in a LangChain project usually?

Easy

1

D

Rewrite all code to use LangSmith classes

Contact support to enable it

Import enable_smith module

Just set environment variables

LangSmith is magic; you often don’t need code changes, just environment variables.

28

Unit 2: Observability

Lec6

Production Best Practices

What is the recommended tracing sampling rate for Production environments?

Easy

1

C

100%

50%

1-5% of traffic

None

In production, tracing every request is noisy and expensive, so 1-5% or high importance traces are recommended.

29

Unit 2: Observability

Lec6

Privacy

How handle PII Data Privacy before logging to a cloud observability tool?

Easy

1

B

Do nothing

Run PII Masking/Redaction functions

Encrypt with simple base64

Delete all logs

Never log sensitive data; run PII Masking or use enterprise redacting features.

30

Unit 2: Observability

Lec6

Alerts

What is an example of a good alert to set up in observability?

Easy

1

A

Error Rate Spike > 10% in 5 min

“Hello World” printed

CPU temperature

Single user logged out

You should alert on things like Error Rate > 10%, Latency Spikes, or Cost Anomalies.

31

Unit 2: Observability

Lec6

LangFuse vs LangSmith

If self-hosting data privacy is an absolute requirement and budget is zero, which tool is recommended?

Medium

2

C

Weights & Biases

LangSmith

LangFuse

CloudWatch

LangFuse is Open Source (MIT) and offers easy self-hosting (Docker Compose) for free.

32

Unit 2: Observability

Lec6

LangSmith Playground

What is the “Playground: Edit and Re-run” feature in LangSmith useful for?

Medium

2

A

You can take a failed production trace, change the prompt, and test a fix immediately

Training new models

Deploying code to AWS

Chatting with other developers

It allows you to take failed real-world traces and edit prompts/parameters to instantly see if the issue resolves.

33

Unit 2: Observability

Lec6

Latency Debugging

If a RAG request takes 10 seconds, how does tracing help?

Medium

2

B

It makes the query faster

It breaks down the latency per component (e.g., Vector DB vs API completion)

It charges the user for the wait time

It cancels requests longer than 5 seconds

Tracing visualizes the execution flow, pinpointing exactly which step (Vector Search vs Generate) is the bottleneck.

34

Unit 2: Observability

Lec6

Cost Tracking

Why is Cost Tracking a critical feature in LLM Observability compared to traditional app monitoring?

Medium

2

D

Because AWS charges are cheap

Because you don’t need servers

Because LLMs don’t cost real money

Because LLM API calls are charged per-token and single runaway loops can cost hundreds of dollars quickly

API calls are expensive, requiring real-time tracking to prevent unmanaged financial overruns.

35

Unit 2: Observability

Lec6

Langchain Integration

What environment variable activates LangSmith tracing?

Medium

2

B

LANGCHAIN_DEBUG=1

LANGCHAIN_TRACING_V2=true

LANGCHAIN_LOG=all

LANGSMITH_ACTIVE=1

export LANGCHAIN_TRACING_V2=true activates LangSmith native tracing.

36

Unit 2: Observability

Lec6

Prompt CMS

How do you fetch a production prompt dynamically using LangFuse SDK?

Medium

2

A

Using langfuse.get_prompt(name, version)

Reading from a local .json file

Executing a GraphQL query to Github

Using prompt = os.getenv('PROMPT')

Langfuse acts as a CMS and lets you retrieve prompts using get_prompt("name", version="production").

37

Unit 2: Observability

Lec6

Alerts & Best Practices

Why shouldn’t you just “stare at dashboards” for production LLM apps?

Medium

2

A

You need automated alerts (error spikes, costs) to respond fast to anomalies

Dashboards are always broken

It slows down the computer

Observability doesn’t provide dashboards

Dashboards are passive. Automated alerts are needed to actively manage sudden cost, latency, or error anomalies.

38

Unit 2: Observability

Lec6

Advanced LangChain Integration

You have a complex application utilizing standard Python code, LangChain agent loops, and custom API calls. Should you prefer LangSmith or LangFuse, and why?

Hard

3

B

LangSmith, because it supports Python natively better

LangFuse, because it is platform-agnostic and instruments cleanly across non-LangChain code too.

LangSmith, because LangChain is mandatory.

LangFuse, because it has an “Edit and Re-run” playground.

LangFuse is platform-agnostic for non-LangChain code, making it better for mixed-stack integrations, while LangSmith is highly specific and native to LangChain execution loops.

39

Unit 2: Observability

Lec6

Debugging Scenarios

In production, users report the chatbot occasionally ignores their negative feedback instructions. How would you leverage LangSmith to resolve this?

Hard

3

C

By deleting the user history and trying again

Check the VectorDB logs

Locate the failed traces in LangSmith, transition them to the Playground, adjust the system prompt, and replay to verify compliance

Re-index the FAISS database

LangSmith’s Playground allows you to take directly failed traces, manipulate the prompt, and replay the exact trace environment to find the fix.

40

Unit 2: Observability

Lec6

Data Security Architecture

Explain a robust architectural design for handling HIPAA/PII compliance while using a SaaS LLM Observability platform like LangSmith Enterprise.

Hard

3

A

Run an edge/middleware service that performs localized PII Entity masking/redaction before transmitting traces to the LangSmith API

Avoid observability tools completely

Share passwords directly via the agent

Mask PII inside the LangSmith GUI

PII must not leave the secure perimeter; redaction must happen at the application layer or middleware before data is shipped via logs/traces.