LLMOps and Evaluation Question Bank#

No.	Training Unit	Lecture	Training content	Question	Level	Mark	Answer	Answer Option A	Answer Option B	Answer Option C	Answer Option D	Explanation
1	Unit 1: LLMOps	Lec2	RAGAS Metrics	What does the Faithfulness metric measure in RAGAS?	Easy	1	A	The truthfulness of the generated answer compared to the retrieved context	The relevance of the answer to the original question	The accuracy of the ranking of contexts	The coverage of the retrieval process	Faithfulness checks if all statements in the answer can be supported by the retrieved context, avoiding hallucinations.
2	Unit 1: LLMOps	Lec2	RAGAS Metrics	Which LLM framework is RAGAS designed to evaluate?	Easy	1	B	Agents	RAG systems	Fine-tuned models	Traditional Search Engines	Ragas is an automated evaluation framework designed specifically for RAG systems.
3	Unit 1: LLMOps	Lec2	RAGAS Metrics	What do you need to annotate data manually when using RAGAS?	Easy	1	C	Large scale human annotations	Only expert domain knowledge	Nothing, it uses LLMs like GPT-4 to automate evaluation	Both standard Q&A pairs and ranking queries	Unlike traditional methods, Ragas uses LLMs to automate the evaluation process without needing heavy human annotations.
4	Unit 1: LLMOps	Lec2	RAGAS Metrics	Which dimension is measured by Context Precision?	Easy	1	C	Quality of generation	Semantic similarity to the user query	Accuracy of the retrieval process	Coverage of expected facts	Context Precision measures the accuracy of the retrieval process by assessing the ranking of contexts.
5	Unit 1: LLMOps	Lec2	RAGAS Metrics	What is the main purpose of Answer Relevancy?	Easy	1	D	Fact-checking the answer	Verifying truthfulness	Guaranteeing context coverage	Measuring relevance between answer and original question	It evaluates the relevance between the answer and question to confirm it addresses the problem asked.
6	Unit 1: LLMOps	Lec2	RAGAS Metrics	What value range do Ragas metrics return?	Easy	1	B	0 to 100	0 to 1	-1 to 1	1 to 5	Each metric gives a value from 0 to 1, with higher values indicating better quality.
7	Unit 1: LLMOps	Lec2	RAGAS Metrics	Which metric evaluates if relevant chunks are ranked high in retrieved contexts?	Easy	1	C	Faithfulness	Context Recall	Context Precision	Answer Relevancy	Context Precision checks if relevant chunks are ranked high in the list of retrieved contexts.
8	Unit 1: LLMOps	Lec2	RAGAS Metrics	How many main metrics are covered in the RAGAS documentation?	Easy	1	A	4	5	3	6	The four main metrics are faithfulness, answer relevancy, context precision, and context recall.
9	Unit 1: LLMOps	Lec2	RAGAS Metrics	If Context Recall is 0, what does that indicate?	Easy	1	A	Retriever failed to find necessary context	Rank 1 is an irrelevant context	LLM generated hallucination	The answer is irrelevant to the query	It indicates the retriever failed to find context containing necessary information to answer the question.
10	Unit 1: LLMOps	Lec2	RAGAS Metrics	Which two metrics evaluate the “retrieval” performance?	Easy	1	B	Faithfulness & Answer Relevancy	Context Precision & Context Recall	Answer Relevancy & Context Recall	Context Precision & Faithfulness	Context precision and context recall evaluate retrieval performance.
11	Unit 1: LLMOps	Lec2	RAGAS Metrics	Describe the calculation process for Faithfulness in Ragas.	Medium	2	A	Decompose answer to statements, verify against context, calculate ratio	Generate questions, embed them, calculate cosine similarity	Determine context relevance, calculate Precision@k, aggregate	Decompose reference answer, verify if inferences exist in retrieved context	The process is: Decomposition (claims), Verification (checked against context), and Scoring (ratio).
12	Unit 1: LLMOps	Lec2	RAGAS Metrics	How does Answer Relevancy determine its score technically?	Medium	2	C	By classifying the answer using a trained classifier	By matching keywords between answer and question	By reverse-engineering questions from answer and calculating embedding cosine similarity	By comparing the character count of answer vs question	LLM generates N questions from the given answer, converts them to embeddings, and compares cosine similarity with the original question.
13	Unit 1: LLMOps	Lec2	RAGAS Metrics	A low Context Recall score means what in terms of information availability?	Medium	2	D	The information is hallucinated	The answer has redundant information	The retrieved information is scattered	The necessary facts from the reference answer are missing in the retrieved contexts	It means the necessary information from the reference answer was not found in the retrieved contexts.
14	Unit 1: LLMOps	Lec2	RAGAS Metrics	In Context Precision calculation, what is \(v_k\)?	Medium	2	C	Velocity of retrieval	Volume of chunks	Relevance indicator at position k	Value of cosine similarity	\(v_k \in \{0, 1\}\) is the relevance indicator at position k.
15	Unit 1: LLMOps	Lec2	RAGAS Metrics	Why might an answer score high in Faithfulness but low in Answer Relevancy?	Medium	2	B	The answer is hallucinated but relevant	The answer is entirely true based on context but fails to address the user’s specific question	The retriever brought back poor context	The context precision is very low	It can be completely faithful to retrieved context, but that context (and answer) might not be what the user asked for.
16	Unit 1: LLMOps	Lec2	RAGAS Metrics	Why is Faithfulness strictly compared to retrieved context and not world knowledge?	Medium	2	A	To prevent LLM hallucinations from being counted as correct if the retriever failed	Ragas has no access to world knowledge	The LLM doesn’t know facts	World knowledge costs more tokens	RAG’s core value is grounding generation on specific private/provided context, so it measures adherence to that context only to prevent unaccounted hallucinations.
17	Unit 1: LLMOps	Lec2	RAGAS Metrics	If LLM splits an answer into 3 statements, and only 2 are verified in context, Faithfulness is?	Medium	2	B	0.5	0.67	0.33	1.0	Faithfulness relies on the ratio of correct statements: 2 out of 3 makes it ~0.67.
18	Unit 1: LLMOps	Lec2	RAGAS Metrics	Given a scenario where a user asks about Einstein’s death, but the context only contains his birth, and the LLM answers “Einstein died in 1955” using its internal knowledge. What are the RAGAS metric implications?	Hard	3	B	High Faithfulness, Low Answer Relevancy	Low Faithfulness, High Answer Relevancy	Low Faithfulness, Low Context Recall	High Context Precision, High Context Recall	It answers the user (High Relevancy), but the claim isn’t in context, making Faithfulness low.
19	Unit 1: LLMOps	Lec2	RAGAS Metrics	To improve Context Precision in a RAG pipeline, what architecture modification would you introduce?	Hard	3	C	Increase LLM temperature	Swap FAISS for ChromaDB	Add a Cross-encoder reranking step	Generate multiple answers and average them	Reranking specifically improves the order/ranking of retrieved chunks, heavily impacting Context Precision metrics.
20	Unit 1: LLMOps	Lec2	RAGAS Metrics	Detail the mathematical rationale behind using N reverse-engineered questions for calculating Answer Relevancy.	Hard	3	A	Averages out the stochastic nature of LLMs generating questions to provide a stable semantic similarity	It is required to satisfy vector dimensions	One question uses up too few tokens	N acts as a padding token for embeddings	Generating N questions and averaging their cosine similarities mitigates the variance inherent in LLM generation, ensuring a robust relevancy score.
21	Unit 2: Observability	Lec6	Observability Concepts	What is Observability in the context of LLM applications?	Easy	1	A	The ability to track flows, errors and costs of LLM apps acting as black boxes	A library for generating UI code	A vector database	The algorithm used for chunking texts	It tracks probabilistic components acting as black boxes, aiding in tracing, tracking costs, and debugging.
22	Unit 2: Observability	Lec6	LangFuse Basics	Which of these tools is known for being Open Source?	Easy	1	B	LangChain	LangFuse	LangSmith	OpenAI	LangFuse is a popular open-source tool focusing on engineering observability.
23	Unit 2: Observability	Lec6	Observability Challenges	What makes LLM applications harder to debug than traditional software?	Easy	1	C	They use more memory	They require internet connections	They involve probabilistic, non-deterministic components	They use Python	You give input, get output. LLMs act as probabilisitic black boxes.
24	Unit 2: Observability	Lec6	LangSmith Basics	Who built LangSmith?	Easy	1	B	Google	The LangChain Team	OpenAI	Meta	LangSmith is built by the LangChain team for native integration.
25	Unit 2: Observability	Lec6	LangFuse Integration	In LangFuse, what is used to automatically instrument LangChain chains code?	Easy	1	C	System.out.println	VectorEmbeddings	CallbackHandler	FAISS	LangFuse provides a CallbackHandler that automatically instruments chains.
26	Unit 2: Observability	Lec6	Prompt Management	Why should you manage prompts in a tool like LangFuse instead of hardcoding in Git?	Easy	1	A	To allow non-engineers to tweak them	Because Git is too slow	Because Git charges per token	To hide prompts from developers	It acts as a CMS for prompts so non-engineers can comfortably inspect and tweak them.
27	Unit 2: Observability	Lec6	Setup	How can you enable LangSmith auto-tracing in a LangChain project usually?	Easy	1	D	Rewrite all code to use LangSmith classes	Contact support to enable it	Import `enable_smith` module	Just set environment variables	LangSmith is magic; you often don’t need code changes, just environment variables.
28	Unit 2: Observability	Lec6	Production Best Practices	What is the recommended tracing sampling rate for Production environments?	Easy	1	C	100%	50%	1-5% of traffic	None	In production, tracing every request is noisy and expensive, so 1-5% or high importance traces are recommended.
29	Unit 2: Observability	Lec6	Privacy	How handle PII Data Privacy before logging to a cloud observability tool?	Easy	1	B	Do nothing	Run PII Masking/Redaction functions	Encrypt with simple base64	Delete all logs	Never log sensitive data; run PII Masking or use enterprise redacting features.
30	Unit 2: Observability	Lec6	Alerts	What is an example of a good alert to set up in observability?	Easy	1	A	Error Rate Spike > 10% in 5 min	“Hello World” printed	CPU temperature	Single user logged out	You should alert on things like Error Rate > 10%, Latency Spikes, or Cost Anomalies.
31	Unit 2: Observability	Lec6	LangFuse vs LangSmith	If self-hosting data privacy is an absolute requirement and budget is zero, which tool is recommended?	Medium	2	C	Weights & Biases	LangSmith	LangFuse	CloudWatch	LangFuse is Open Source (MIT) and offers easy self-hosting (Docker Compose) for free.
32	Unit 2: Observability	Lec6	LangSmith Playground	What is the “Playground: Edit and Re-run” feature in LangSmith useful for?	Medium	2	A	You can take a failed production trace, change the prompt, and test a fix immediately	Training new models	Deploying code to AWS	Chatting with other developers	It allows you to take failed real-world traces and edit prompts/parameters to instantly see if the issue resolves.
33	Unit 2: Observability	Lec6	Latency Debugging	If a RAG request takes 10 seconds, how does tracing help?	Medium	2	B	It makes the query faster	It breaks down the latency per component (e.g., Vector DB vs API completion)	It charges the user for the wait time	It cancels requests longer than 5 seconds	Tracing visualizes the execution flow, pinpointing exactly which step (Vector Search vs Generate) is the bottleneck.
34	Unit 2: Observability	Lec6	Cost Tracking	Why is Cost Tracking a critical feature in LLM Observability compared to traditional app monitoring?	Medium	2	D	Because AWS charges are cheap	Because you don’t need servers	Because LLMs don’t cost real money	Because LLM API calls are charged per-token and single runaway loops can cost hundreds of dollars quickly	API calls are expensive, requiring real-time tracking to prevent unmanaged financial overruns.
35	Unit 2: Observability	Lec6	Langchain Integration	What environment variable activates LangSmith tracing?	Medium	2	B	LANGCHAIN_DEBUG=1	LANGCHAIN_TRACING_V2=true	LANGCHAIN_LOG=all	LANGSMITH_ACTIVE=1	`export LANGCHAIN_TRACING_V2=true` activates LangSmith native tracing.
36	Unit 2: Observability	Lec6	Prompt CMS	How do you fetch a production prompt dynamically using LangFuse SDK?	Medium	2	A	Using `langfuse.get_prompt(name, version)`	Reading from a local .json file	Executing a GraphQL query to Github	Using `prompt = os.getenv('PROMPT')`	Langfuse acts as a CMS and lets you retrieve prompts using `get_prompt("name", version="production")`.
37	Unit 2: Observability	Lec6	Alerts & Best Practices	Why shouldn’t you just “stare at dashboards” for production LLM apps?	Medium	2	A	You need automated alerts (error spikes, costs) to respond fast to anomalies	Dashboards are always broken	It slows down the computer	Observability doesn’t provide dashboards	Dashboards are passive. Automated alerts are needed to actively manage sudden cost, latency, or error anomalies.
38	Unit 2: Observability	Lec6	Advanced LangChain Integration	You have a complex application utilizing standard Python code, LangChain agent loops, and custom API calls. Should you prefer LangSmith or LangFuse, and why?	Hard	3	B	LangSmith, because it supports Python natively better	LangFuse, because it is platform-agnostic and instruments cleanly across non-LangChain code too.	LangSmith, because LangChain is mandatory.	LangFuse, because it has an “Edit and Re-run” playground.	LangFuse is platform-agnostic for non-LangChain code, making it better for mixed-stack integrations, while LangSmith is highly specific and native to LangChain execution loops.
39	Unit 2: Observability	Lec6	Debugging Scenarios	In production, users report the chatbot occasionally ignores their negative feedback instructions. How would you leverage LangSmith to resolve this?	Hard	3	C	By deleting the user history and trying again	Check the VectorDB logs	Locate the failed traces in LangSmith, transition them to the Playground, adjust the system prompt, and replay to verify compliance	Re-index the FAISS database	LangSmith’s Playground allows you to take directly failed traces, manipulate the prompt, and replay the exact trace environment to find the fix.
40	Unit 2: Observability	Lec6	Data Security Architecture	Explain a robust architectural design for handling HIPAA/PII compliance while using a SaaS LLM Observability platform like LangSmith Enterprise.	Hard	3	A	Run an edge/middleware service that performs localized PII Entity masking/redaction before transmitting traces to the LangSmith API	Avoid observability tools completely	Share passwords directly via the agent	Mask PII inside the LangSmith GUI	PII must not leave the secure perimeter; redaction must happen at the application layer or middleware before data is shipped via logs/traces.