LLMOps and Evaluation Question Bank#
No. |
Training Unit |
Lecture |
Training content |
Question |
Level |
Mark |
Answer |
Answer Option A |
Answer Option B |
Answer Option C |
Answer Option D |
Explanation |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
What does the Faithfulness metric measure in RAGAS? |
Easy |
1 |
A |
The truthfulness of the generated answer compared to the retrieved context |
The relevance of the answer to the original question |
The accuracy of the ranking of contexts |
The coverage of the retrieval process |
Faithfulness checks if all statements in the answer can be supported by the retrieved context, avoiding hallucinations. |
2 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Which LLM framework is RAGAS designed to evaluate? |
Easy |
1 |
B |
Agents |
RAG systems |
Fine-tuned models |
Traditional Search Engines |
Ragas is an automated evaluation framework designed specifically for RAG systems. |
3 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
What do you need to annotate data manually when using RAGAS? |
Easy |
1 |
C |
Large scale human annotations |
Only expert domain knowledge |
Nothing, it uses LLMs like GPT-4 to automate evaluation |
Both standard Q&A pairs and ranking queries |
Unlike traditional methods, Ragas uses LLMs to automate the evaluation process without needing heavy human annotations. |
4 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Which dimension is measured by Context Precision? |
Easy |
1 |
C |
Quality of generation |
Semantic similarity to the user query |
Accuracy of the retrieval process |
Coverage of expected facts |
Context Precision measures the accuracy of the retrieval process by assessing the ranking of contexts. |
5 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
What is the main purpose of Answer Relevancy? |
Easy |
1 |
D |
Fact-checking the answer |
Verifying truthfulness |
Guaranteeing context coverage |
Measuring relevance between answer and original question |
It evaluates the relevance between the answer and question to confirm it addresses the problem asked. |
6 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
What value range do Ragas metrics return? |
Easy |
1 |
B |
0 to 100 |
0 to 1 |
-1 to 1 |
1 to 5 |
Each metric gives a value from 0 to 1, with higher values indicating better quality. |
7 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Which metric evaluates if relevant chunks are ranked high in retrieved contexts? |
Easy |
1 |
C |
Faithfulness |
Context Recall |
Context Precision |
Answer Relevancy |
Context Precision checks if relevant chunks are ranked high in the list of retrieved contexts. |
8 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
How many main metrics are covered in the RAGAS documentation? |
Easy |
1 |
A |
4 |
5 |
3 |
6 |
The four main metrics are faithfulness, answer relevancy, context precision, and context recall. |
9 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
If Context Recall is 0, what does that indicate? |
Easy |
1 |
A |
Retriever failed to find necessary context |
Rank 1 is an irrelevant context |
LLM generated hallucination |
The answer is irrelevant to the query |
It indicates the retriever failed to find context containing necessary information to answer the question. |
10 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Which two metrics evaluate the âretrievalâ performance? |
Easy |
1 |
B |
Faithfulness & Answer Relevancy |
Context Precision & Context Recall |
Answer Relevancy & Context Recall |
Context Precision & Faithfulness |
Context precision and context recall evaluate retrieval performance. |
11 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Describe the calculation process for Faithfulness in Ragas. |
Medium |
2 |
A |
Decompose answer to statements, verify against context, calculate ratio |
Generate questions, embed them, calculate cosine similarity |
Determine context relevance, calculate Precision@k, aggregate |
Decompose reference answer, verify if inferences exist in retrieved context |
The process is: Decomposition (claims), Verification (checked against context), and Scoring (ratio). |
12 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
How does Answer Relevancy determine its score technically? |
Medium |
2 |
C |
By classifying the answer using a trained classifier |
By matching keywords between answer and question |
By reverse-engineering questions from answer and calculating embedding cosine similarity |
By comparing the character count of answer vs question |
LLM generates N questions from the given answer, converts them to embeddings, and compares cosine similarity with the original question. |
13 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
A low Context Recall score means what in terms of information availability? |
Medium |
2 |
D |
The information is hallucinated |
The answer has redundant information |
The retrieved information is scattered |
The necessary facts from the reference answer are missing in the retrieved contexts |
It means the necessary information from the reference answer was not found in the retrieved contexts. |
14 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
In Context Precision calculation, what is \(v_k\)? |
Medium |
2 |
C |
Velocity of retrieval |
Volume of chunks |
Relevance indicator at position k |
Value of cosine similarity |
\(v_k \in \{0, 1\}\) is the relevance indicator at position k. |
15 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Why might an answer score high in Faithfulness but low in Answer Relevancy? |
Medium |
2 |
B |
The answer is hallucinated but relevant |
The answer is entirely true based on context but fails to address the userâs specific question |
The retriever brought back poor context |
The context precision is very low |
It can be completely faithful to retrieved context, but that context (and answer) might not be what the user asked for. |
16 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Why is Faithfulness strictly compared to retrieved context and not world knowledge? |
Medium |
2 |
A |
To prevent LLM hallucinations from being counted as correct if the retriever failed |
Ragas has no access to world knowledge |
The LLM doesnât know facts |
World knowledge costs more tokens |
RAGâs core value is grounding generation on specific private/provided context, so it measures adherence to that context only to prevent unaccounted hallucinations. |
17 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
If LLM splits an answer into 3 statements, and only 2 are verified in context, Faithfulness is? |
Medium |
2 |
B |
0.5 |
0.67 |
0.33 |
1.0 |
Faithfulness relies on the ratio of correct statements: 2 out of 3 makes it ~0.67. |
18 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Given a scenario where a user asks about Einsteinâs death, but the context only contains his birth, and the LLM answers âEinstein died in 1955â using its internal knowledge. What are the RAGAS metric implications? |
Hard |
3 |
B |
High Faithfulness, Low Answer Relevancy |
Low Faithfulness, High Answer Relevancy |
Low Faithfulness, Low Context Recall |
High Context Precision, High Context Recall |
It answers the user (High Relevancy), but the claim isnât in context, making Faithfulness low. |
19 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
To improve Context Precision in a RAG pipeline, what architecture modification would you introduce? |
Hard |
3 |
C |
Increase LLM temperature |
Swap FAISS for ChromaDB |
Add a Cross-encoder reranking step |
Generate multiple answers and average them |
Reranking specifically improves the order/ranking of retrieved chunks, heavily impacting Context Precision metrics. |
20 |
Unit 1: LLMOps |
Lec2 |
RAGAS Metrics |
Detail the mathematical rationale behind using N reverse-engineered questions for calculating Answer Relevancy. |
Hard |
3 |
A |
Averages out the stochastic nature of LLMs generating questions to provide a stable semantic similarity |
It is required to satisfy vector dimensions |
One question uses up too few tokens |
N acts as a padding token for embeddings |
Generating N questions and averaging their cosine similarities mitigates the variance inherent in LLM generation, ensuring a robust relevancy score. |
21 |
Unit 2: Observability |
Lec6 |
Observability Concepts |
What is Observability in the context of LLM applications? |
Easy |
1 |
A |
The ability to track flows, errors and costs of LLM apps acting as black boxes |
A library for generating UI code |
A vector database |
The algorithm used for chunking texts |
It tracks probabilistic components acting as black boxes, aiding in tracing, tracking costs, and debugging. |
22 |
Unit 2: Observability |
Lec6 |
LangFuse Basics |
Which of these tools is known for being Open Source? |
Easy |
1 |
B |
LangChain |
LangFuse |
LangSmith |
OpenAI |
LangFuse is a popular open-source tool focusing on engineering observability. |
23 |
Unit 2: Observability |
Lec6 |
Observability Challenges |
What makes LLM applications harder to debug than traditional software? |
Easy |
1 |
C |
They use more memory |
They require internet connections |
They involve probabilistic, non-deterministic components |
They use Python |
You give input, get output. LLMs act as probabilisitic black boxes. |
24 |
Unit 2: Observability |
Lec6 |
LangSmith Basics |
Who built LangSmith? |
Easy |
1 |
B |
The LangChain Team |
OpenAI |
Meta |
LangSmith is built by the LangChain team for native integration. |
|
25 |
Unit 2: Observability |
Lec6 |
LangFuse Integration |
In LangFuse, what is used to automatically instrument LangChain chains code? |
Easy |
1 |
C |
System.out.println |
VectorEmbeddings |
CallbackHandler |
FAISS |
LangFuse provides a CallbackHandler that automatically instruments chains. |
26 |
Unit 2: Observability |
Lec6 |
Prompt Management |
Why should you manage prompts in a tool like LangFuse instead of hardcoding in Git? |
Easy |
1 |
A |
To allow non-engineers to tweak them |
Because Git is too slow |
Because Git charges per token |
To hide prompts from developers |
It acts as a CMS for prompts so non-engineers can comfortably inspect and tweak them. |
27 |
Unit 2: Observability |
Lec6 |
Setup |
How can you enable LangSmith auto-tracing in a LangChain project usually? |
Easy |
1 |
D |
Rewrite all code to use LangSmith classes |
Contact support to enable it |
Import |
Just set environment variables |
LangSmith is magic; you often donât need code changes, just environment variables. |
28 |
Unit 2: Observability |
Lec6 |
Production Best Practices |
What is the recommended tracing sampling rate for Production environments? |
Easy |
1 |
C |
100% |
50% |
1-5% of traffic |
None |
In production, tracing every request is noisy and expensive, so 1-5% or high importance traces are recommended. |
29 |
Unit 2: Observability |
Lec6 |
Privacy |
How handle PII Data Privacy before logging to a cloud observability tool? |
Easy |
1 |
B |
Do nothing |
Run PII Masking/Redaction functions |
Encrypt with simple base64 |
Delete all logs |
Never log sensitive data; run PII Masking or use enterprise redacting features. |
30 |
Unit 2: Observability |
Lec6 |
Alerts |
What is an example of a good alert to set up in observability? |
Easy |
1 |
A |
Error Rate Spike > 10% in 5 min |
âHello Worldâ printed |
CPU temperature |
Single user logged out |
You should alert on things like Error Rate > 10%, Latency Spikes, or Cost Anomalies. |
31 |
Unit 2: Observability |
Lec6 |
LangFuse vs LangSmith |
If self-hosting data privacy is an absolute requirement and budget is zero, which tool is recommended? |
Medium |
2 |
C |
Weights & Biases |
LangSmith |
LangFuse |
CloudWatch |
LangFuse is Open Source (MIT) and offers easy self-hosting (Docker Compose) for free. |
32 |
Unit 2: Observability |
Lec6 |
LangSmith Playground |
What is the âPlayground: Edit and Re-runâ feature in LangSmith useful for? |
Medium |
2 |
A |
You can take a failed production trace, change the prompt, and test a fix immediately |
Training new models |
Deploying code to AWS |
Chatting with other developers |
It allows you to take failed real-world traces and edit prompts/parameters to instantly see if the issue resolves. |
33 |
Unit 2: Observability |
Lec6 |
Latency Debugging |
If a RAG request takes 10 seconds, how does tracing help? |
Medium |
2 |
B |
It makes the query faster |
It breaks down the latency per component (e.g., Vector DB vs API completion) |
It charges the user for the wait time |
It cancels requests longer than 5 seconds |
Tracing visualizes the execution flow, pinpointing exactly which step (Vector Search vs Generate) is the bottleneck. |
34 |
Unit 2: Observability |
Lec6 |
Cost Tracking |
Why is Cost Tracking a critical feature in LLM Observability compared to traditional app monitoring? |
Medium |
2 |
D |
Because AWS charges are cheap |
Because you donât need servers |
Because LLMs donât cost real money |
Because LLM API calls are charged per-token and single runaway loops can cost hundreds of dollars quickly |
API calls are expensive, requiring real-time tracking to prevent unmanaged financial overruns. |
35 |
Unit 2: Observability |
Lec6 |
Langchain Integration |
What environment variable activates LangSmith tracing? |
Medium |
2 |
B |
LANGCHAIN_DEBUG=1 |
LANGCHAIN_TRACING_V2=true |
LANGCHAIN_LOG=all |
LANGSMITH_ACTIVE=1 |
|
36 |
Unit 2: Observability |
Lec6 |
Prompt CMS |
How do you fetch a production prompt dynamically using LangFuse SDK? |
Medium |
2 |
A |
Using |
Reading from a local .json file |
Executing a GraphQL query to Github |
Using |
Langfuse acts as a CMS and lets you retrieve prompts using |
37 |
Unit 2: Observability |
Lec6 |
Alerts & Best Practices |
Why shouldnât you just âstare at dashboardsâ for production LLM apps? |
Medium |
2 |
A |
You need automated alerts (error spikes, costs) to respond fast to anomalies |
Dashboards are always broken |
It slows down the computer |
Observability doesnât provide dashboards |
Dashboards are passive. Automated alerts are needed to actively manage sudden cost, latency, or error anomalies. |
38 |
Unit 2: Observability |
Lec6 |
Advanced LangChain Integration |
You have a complex application utilizing standard Python code, LangChain agent loops, and custom API calls. Should you prefer LangSmith or LangFuse, and why? |
Hard |
3 |
B |
LangSmith, because it supports Python natively better |
LangFuse, because it is platform-agnostic and instruments cleanly across non-LangChain code too. |
LangSmith, because LangChain is mandatory. |
LangFuse, because it has an âEdit and Re-runâ playground. |
LangFuse is platform-agnostic for non-LangChain code, making it better for mixed-stack integrations, while LangSmith is highly specific and native to LangChain execution loops. |
39 |
Unit 2: Observability |
Lec6 |
Debugging Scenarios |
In production, users report the chatbot occasionally ignores their negative feedback instructions. How would you leverage LangSmith to resolve this? |
Hard |
3 |
C |
By deleting the user history and trying again |
Check the VectorDB logs |
Locate the failed traces in LangSmith, transition them to the Playground, adjust the system prompt, and replay to verify compliance |
Re-index the FAISS database |
LangSmithâs Playground allows you to take directly failed traces, manipulate the prompt, and replay the exact trace environment to find the fix. |
40 |
Unit 2: Observability |
Lec6 |
Data Security Architecture |
Explain a robust architectural design for handling HIPAA/PII compliance while using a SaaS LLM Observability platform like LangSmith Enterprise. |
Hard |
3 |
A |
Run an edge/middleware service that performs localized PII Entity masking/redaction before transmitting traces to the LangSmith API |
Avoid observability tools completely |
Share passwords directly via the agent |
Mask PII inside the LangSmith GUI |
PII must not leave the secure perimeter; redaction must happen at the application layer or middleware before data is shipped via logs/traces. |