AI Project Exams#
This page consolidates project exam descriptions from all advanced AI training modules.
RAG Optimization Project Exam#
Final Exam: Enterprise RAG System#
overview#
Field |
Value |
|---|---|
Course |
RAG and Optimization |
Duration |
240 minutes (4 hours) |
Passing Score |
70% |
Total Points |
100 |
Description#
You have been hired as an AI Engineer at TechDocs Inc., a company that provides enterprise documentation solutions. Your task is to build a production-ready Enterprise RAG System that can answer complex questions about technical documentation, company policies, and product specifications.
The current basic RAG system has several limitations:
Poor retrieval quality due to fixed-size chunking
Slow search performance with growing document collections
Inability to handle keyword-specific queries (error codes, product IDs)
Redundant and irrelevant results in retrieved documents
Missing relationship information between entities (policies, stakeholders, regulations)
You must apply all five optimization techniques learned in this module to build a comprehensive, production-grade RAG system.
Objectives#
By completing this exam, you will demonstrate mastery of:
Implementing Semantic Chunking for intelligent document segmentation
Configuring HNSW Index for high-performance vector search
Building Hybrid Search combining BM25 and Vector Search with RRF fusion
Applying Query Transformation techniques (HyDE and Query Decomposition)
Implementing Post-Retrieval Processing with Cross-Encoder and MMR
Designing a GraphRAG architecture for relationship-aware retrieval
Problem Description#
Build an Enterprise RAG System named enterprise-rag-system that processes a collection of technical documents and provides accurate, contextual answers to user queries. The system must handle:
Technical documentation with code snippets, error codes, and specifications
Policy documents with stakeholder relationships and regulatory references
Product catalogs with model numbers, features, and comparisons
The system should intelligently route queries to the appropriate retrieval strategy and provide high-quality, diverse, and accurate results.
Assumptions#
You have access to sample documents (technical docs, policies, product specs) or will use provided sample data
OpenAI API key or compatible LLM endpoint is available
Neo4j database is available (local Docker or cloud instance)
Python 3.10+ environment with necessary packages installed
Basic understanding of all five RAG optimization techniques
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
langchain>= 0.1.0langchain-neo4j>= 0.1.0openai>= 1.0.0sentence-transformers>= 2.2.0chromadb>= 0.4.0 ORqdrant-client>= 1.7.0rank-bm25>= 0.2.2pydantic>= 2.0.0neo4j>= 5.0.0
Infrastructure#
Vector Database: ChromaDB or Qdrant with HNSW indexing
Graph Database: Neo4j (Docker recommended)
Embedding Model:
text-embedding-3-smallorall-MiniLM-L6-v2Cross-Encoder:
cross-encoder/ms-marco-MiniLM-L-6-v2LLM: GPT-4 or equivalent
Tasks#
Task 1: Advanced Indexing Pipeline (20 points)#
Time Allocation: 45 minutes
Implement an intelligent document indexing pipeline that preserves semantic coherence.
Requirements:#
Semantic Chunking Implementation
Build a chunker that splits documents based on semantic similarity between sentences
Configure similarity threshold (0.7-0.85) and chunk size limits
Handle edge cases: code blocks, tables, lists, short documents
HNSW Index Configuration
Set up vector database with HNSW indexing
Configure optimal parameters:
M=32,ef_construction=200,ef_search=100Document the trade-offs for your chosen configuration
Indexing Pipeline
Process at least 20 documents through the pipeline
Store metadata (source, chunk_id, document_type) with each vector
Implement batch processing for efficiency
Deliverables:#
indexing/semantic_chunker.pyindexing/vector_store.pyIndexed document collection with metadata
Task 2: Hybrid Search Implementation (20 points)#
Time Allocation: 45 minutes
Build a hybrid retrieval system that combines keyword and semantic search.
Requirements:#
BM25 Retriever
Implement BM25 indexing for all document chunks
Proper tokenization with case normalization and punctuation handling
Return top-K results with BM25 scores
Hybrid Search with RRF
Execute both BM25 and Vector Search in parallel
Implement RRF fusion:
RRF(d) = Ξ£ 1/(60 + rank(d))Handle documents appearing in only one result list
Query Router
Analyze query to determine optimal search strategy
Route keyword-heavy queries to prioritize BM25
Route semantic queries to prioritize Vector Search
Use Hybrid Search as default
Deliverables:#
retrieval/bm25_retriever.pyretrieval/hybrid_search.pyretrieval/query_router.py
Task 3: Query Transformation Layer (15 points)#
Time Allocation: 35 minutes
Implement query transformation to handle vague and complex queries.
Requirements:#
HyDE Implementation
Generate hypothetical answer paragraphs using LLM
Use hypothetical answer embedding for retrieval
Design domain-appropriate generation prompts
Query Decomposition
Detect multi-part questions requiring information from multiple sources
Generate independent sub-queries for parallel retrieval
Aggregate results from all sub-queries
Transformation Router
Classify queries: simple, vague (use HyDE), complex (use Decomposition)
Apply appropriate transformation before retrieval
Deliverables:#
transformation/hyde.pytransformation/query_decomposition.pytransformation/transformation_router.py
Task 4: Post-Retrieval Processing (15 points)#
Time Allocation: 35 minutes
Implement re-ranking and diversity optimization for retrieved results.
Requirements:#
Cross-Encoder Re-ranking
Retrieve top-50 candidates with Bi-Encoder
Re-rank using Cross-Encoder (
cross-encoder/ms-marco-MiniLM-L-6-v2)Return top-10 re-ranked results
MMR for Diversity
Implement MMR algorithm with configurable Ξ» parameter
Default Ξ»=0.5 for balanced relevance/diversity
Ensure diverse information coverage in final results
Configurable Pipeline
Support both: Cross-Encoder β MMR and MMR β Cross-Encoder orders
Allow configuration of k values at each stage
Deliverables:#
post_retrieval/cross_encoder_reranker.pypost_retrieval/mmr.pypost_retrieval/post_retrieval_pipeline.py
Task 5: GraphRAG Integration (20 points)#
Time Allocation: 50 minutes
Build a knowledge graph for relationship-aware retrieval.
Requirements:#
Entity Extraction
Define Pydantic models for domain entities (Policy, Stakeholder, Product, Regulation, etc.)
Extract entities and relationships using LLM with structured output
Validate extracted data against schema
Knowledge Graph Construction
Populate Neo4j with extracted entities and relationships
Use MERGE to prevent duplicates
Create appropriate indexes for query performance
Graph-Aware Retrieval
Implement natural language to Cypher translation
Support relationship traversal queries
Combine graph results with vector search results
Deliverables:#
graph/entity_models.pygraph/entity_extractor.pygraph/knowledge_graph.pygraph/graph_retriever.py
Task 6: Integration and Orchestration (10 points)#
Time Allocation: 30 minutes
Integrate all components into a unified RAG system.
Requirements:#
Unified Query Pipeline
Accept user query as input
Apply query classification and routing
Execute appropriate retrieval strategy
Apply post-retrieval processing
Generate final answer using LLM
Configuration Management
Externalize all configurable parameters
Support different modes: fast (less accurate), accurate (slower), balanced
Error Handling and Logging
Graceful degradation if a component fails
Structured logging for debugging and monitoring
Deliverables:#
main.pyorenterprise_rag.pyconfig.pyorconfig.yamlREADME.mdwith setup and usage instructions
Questions to Answer#
Include written answers to these questions in your README.md or a separate ANSWERS.md file:
Architecture Decision: Explain why you chose your specific HNSW parameters and how they balance speed vs. accuracy for this use case.
Hybrid Search Trade-offs: Describe a scenario where Hybrid Search significantly outperforms pure Vector Search, and explain why.
Query Transformation Selection: How does your system decide when to use HyDE vs. Query Decomposition? What signals does it look for?
Re-ranking Strategy: Why did you choose your specific order of Cross-Encoder and MMR? What would change if the use case prioritized diversity over precision?
GraphRAG Value: Provide an example query that your GraphRAG component can answer that would be impossible or very difficult with vector search alone.
Submission Rules#
Required Deliverables#
Complete source code organized in the specified directory structure
README.mdwith:Setup instructions (dependencies, environment variables, database setup)
Usage examples for different query types
Architecture diagram (can be text-based)
ANSWERS.mdwith written responses to the 5 questionsdocker-compose.ymlfor Neo4j and any other servicesSample queries demonstrating each componentβs functionality
Screenshots or logs showing successful execution
Submission Checklist#
All code runs without errors
Semantic Chunking preserves document semantics
HNSW index is properly configured and benchmarked
Hybrid Search correctly combines BM25 and Vector results
Query Transformation handles vague and complex queries
Cross-Encoder improves ranking precision
MMR ensures result diversity
GraphRAG answers relationship queries
All components are integrated in unified pipeline
Documentation is complete and clear
Grading Rubrics#
Criterion |
Weight |
Excellent (90-100%) |
Good (70-89%) |
Satisfactory (50-69%) |
Needs Improvement (<50%) |
|---|---|---|---|---|---|
Advanced Indexing |
20% |
Semantic chunking preserves context perfectly; HNSW optimally configured with benchmarks |
Chunking works with minor issues; HNSW configured but not optimized |
Basic chunking implemented; HNSW uses default parameters |
Chunking breaks context; HNSW not implemented |
Hybrid Search |
20% |
BM25 and RRF perfectly implemented; Query router makes intelligent decisions |
Hybrid search works; Router has some misclassifications |
Basic hybrid search; No query routing |
Hybrid search not functional |
Query Transformation |
15% |
HyDE and Decomposition both work excellently; Smart routing between them |
Both techniques work; Routing is rule-based |
One technique works; No routing |
Neither technique functional |
Post-Retrieval |
15% |
Cross-Encoder significantly improves precision; MMR provides diverse results |
Both components work; Measurable improvement |
One component works |
Neither component functional |
GraphRAG |
20% |
Complete entity extraction; Rich graph; Answers complex relationship queries |
Graph populated; Basic queries work |
Partial graph; Limited queries |
Graph not functional |
Integration |
10% |
Seamless pipeline; Excellent error handling; Clean configuration |
Components integrated; Some rough edges |
Partial integration |
Components not connected |
Estimated Time#
Task |
Time Allocation |
|---|---|
Task 1: Advanced Indexing |
45 minutes |
Task 2: Hybrid Search |
45 minutes |
Task 3: Query Transformation |
35 minutes |
Task 4: Post-Retrieval |
35 minutes |
Task 5: GraphRAG |
50 minutes |
Task 6: Integration |
30 minutes |
Total |
240 minutes (4 hours) |
Hints#
General Tips:
Start by setting up the infrastructure (Neo4j, Vector DB) before writing code
Test each component independently before integration
Use the companion notebooks from assignments as references
Cache LLM responses during development to save API costs
Component-Specific Tips:
For Semantic Chunking: Use
sentence-transformersfor efficient similarity calculationFor HNSW: Prioritize
ef_searchtuning for query-time optimizationFor BM25: Use
nltk.word_tokenize()for consistent tokenizationFor HyDE: The hypothetical answer doesnβt need to be factually correct
For Cross-Encoder: Batch processing significantly improves throughput
For GraphRAG: Test Cypher queries in Neo4j Browser before implementing in code
Notes:#
You can use the your implementation in your previous assignment lab that you have done as the starting point.
LangGraph and Agentic AI Project Exam#
Final Project Exam: FPT Customer Chatbot - Multi-Agent AI System#
overview#
Field |
Value |
|---|---|
Course |
LangGraph and Agentic AI |
Project Name |
|
Duration |
360 minutes (6 hours) |
Passing Score |
70% |
Total Points |
100 |
Framework |
Python 3.10+, LangGraph, LangChain, Tavily API, FAISS, OpenAI |
Description#
You have been hired as an AI Engineer at FPT Software, tasked with building a Multi-Agent Customer Service Chatbot AI Core that demonstrates mastery of all concepts covered in the LangGraph and Agentic AI module.
This final project consolidates all five assignments into a single comprehensive multi-agent system:
Assignment 01: LangGraph Foundations & State Management
Assignment 02: Multi-Expert ReAct Research Agent
Assignment 03: Tool Calling & Tavily Search Integration
Assignment 04: FPT Customer Chatbot - Multi-Agent System
Assignment 05: Human-in-the-Loop & Persistence
You will build the AI Core for an FPT Customer Chatbot with hierarchical multi-agent architecture, real-time web search, human approval workflows, response caching, and persistent state management.
This exam focuses purely on the AI/LangGraph logic. For the Engineering layer (FastAPI, database, REST APIs), please refer to the Building Monolith API with FastAPI moduleβs final exam.
Objectives#
By completing this exam, you will demonstrate mastery of:
State Management: Implementing messages-centric patterns with TypedDict and add_messages reducer
ReAct Pattern: Building reasoning + acting loops with iteration control
Tool Calling: Integrating external APIs (Tavily) with parallel execution
Multi-Agent Architecture: Designing hierarchical systems with specialized agents
Human-in-the-Loop: Implementing interrupt patterns for user confirmation
Persistence: Configuring checkpointers for long-running conversations
Caching: Building vector store-based response caching with FAISS
Problem Description#
Build the AI Core for an FPT Customer Service Chatbot named fpt-customer-chatbot-ai that includes:
Agent |
Responsibilities |
|---|---|
Primary Assistant |
Routes user queries to appropriate specialized agents |
FAQ Agent |
Answers FPT policy questions using RAG with cached responses |
Ticket Agent |
Handles ticket-related conversations with HITL approval (mock tools) |
Booking Agent |
Handles booking conversations with HITL confirmation (mock tools) |
IT Support Agent |
Troubleshoots technical issues using Tavily search + caching |
The system must:
Maintain conversation context across multiple turns
Require human confirmation before sensitive operations
Cache responses for similar queries
Persist state across process restarts
Handle agent transitions gracefully with dialog stack
The Ticket and Booking agents will use mock tools that simulate database operations. The actual database integration is covered in the FastAPI module exam.
Prerequisites#
Completed all 5 module assignments (recommended)
OpenAI API key (
OPENAI_API_KEY)Tavily API key (
TAVILY_API_KEY)Python 3.10+ with virtual environment
Familiarity with Pydantic for schema validation
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
langgraph>= 0.2.0langchain>= 0.1.0langchain-openai>= 0.1.0langchain-community>= 0.1.0tavily-python>= 0.3.0faiss-cpu>= 1.7.0sentence-transformers>= 2.2.0pydantic>= 2.0.0
Mock Data Models#
For testing purposes, define the following Pydantic models (actual database integration is in FastAPI module):
Ticket Model:
Field |
Type |
Constraints |
|---|---|---|
ticket_id |
str |
Auto-generated UUID |
content |
str |
Required |
description |
str | None |
Optional |
customer_name |
str |
Required |
customer_phone |
str |
Required |
str | None |
Optional |
|
status |
TicketStatus |
Pending/InProgress/Resolved/Canceled |
created_at |
datetime |
Auto-set |
Booking Model:
Field |
Type |
Constraints |
|---|---|---|
booking_id |
str |
Auto-generated UUID |
reason |
str |
Required |
time |
datetime |
Required, must be future |
customer_name |
str |
Required |
customer_phone |
str |
Required |
str | None |
Optional |
|
note |
str | None |
Optional |
status |
BookingStatus |
Scheduled/Finished/Canceled |
Tasks#
Task 1: State Management Foundation (15 points)#
Time Allocation: 60 minutes
Build the core state management infrastructure for the multi-agent system.
Requirements:#
Define AgenticState using TypedDict with:
messages: UsingAnnotated[List[AnyMessage], add_messages]patterndialog_state: Stack for tracking agent hierarchyuser_id,email(optional): Context injection fieldsconversation_id: Session tracking
Implement dialog stack functions:
update_dialog_stack(left, right): Push/pop agent transitionspop_dialog_state(state): Return to Primary Assistant
Create context injection that auto-populates user info into tool calls
Configure MemorySaver checkpointer for initial development
Deliverables:#
state/agent_state.py- State definition with all fieldsstate/dialog_stack.py- Stack management functionsstate/context_injection.py- User context injection logic
Task 2: Specialized Agents Implementation (25 points)#
Time Allocation: 120 minutes
Implement all four specialized agents with their tools and schemas.
Requirements:#
Ticket Support Agent (8 points):
Define Pydantic schemas:
CreateTicket,TrackTicket,UpdateTicket,CancelTicketImplement mock tools that simulate CRUD operations (return success messages, store in memory dict)
Status transitions: Pending β InProgress β Resolved (or Canceled)
Add
CompleteOrEscalatetool for returning to Primary AssistantTools should accept and validate all required fields
Booking Agent (7 points):
Define Pydantic schemas with time validation (must be future)
Implement mock tools:
BookRoom,TrackBooking,UpdateBooking,CancelBookingStatus transitions: Scheduled β Finished (or Canceled)
Include
CompleteOrEscalatetool
IT Support Agent (5 points):
Integrate Tavily Search with
max_results: 5,search_depth: "advanced"Return practical troubleshooting guides from reliable sources
Include
CompleteOrEscalatetool
FAQ Agent (5 points):
Implement simple RAG for FPT policy questions
Return answers with source references
Include
CompleteOrEscalatetool
Mock tools should use an in-memory dictionary to store data for testing. This allows the AI system to function independently without database dependencies. The actual database integration will be handled in the FastAPI module exam.
Example mock implementation pattern:
# In-memory storage for testing
_ticket_store: dict[str, dict] = {}
@tool
def create_ticket(content: str, customer_name: str, customer_phone: str, ...) -> str:
"""Create a new support ticket."""
ticket_id = str(uuid.uuid4())
_ticket_store[ticket_id] = {...}
return f"Ticket created successfully with ID: {ticket_id}"
Deliverables:#
agents/ticket_agent.py- Ticket Support Agent with mock toolsagents/booking_agent.py- Booking Agent with mock toolsagents/it_support_agent.py- IT Support Agent with Tavilyagents/faq_agent.py- FAQ Agent with RAGschemas/directory with all Pydantic models
Task 3: Primary Assistant & Graph Construction (20 points)#
Time Allocation: 90 minutes
Build the Primary Assistant and construct the complete multi-agent graph.
Requirements:#
Define routing tools for Primary Assistant:
ToTicketAssistant: Route ticket-related queriesToBookingAssistant: Route booking-related queriesToITAssistant: Route technical issuesToFAQAssistant: Route policy questionsInclude user context injection in all routing tools
Implement entry nodes for agent transitions:
Create
create_entry_node(assistant_name)factory functionEntry nodes push new agent to
dialog_statestackGenerate appropriate welcome message
Build StateGraph with:
Primary Assistant as entry point
All specialized agent nodes
ToolNode for each agentβs tools
Conditional routing based on intent
Edge handling for
CompleteOrEscalate
Create
tool_node_with_fallbackfor graceful error handling
Deliverables:#
agents/primary_assistant.py- Primary Assistant with routinggraph/entry_nodes.py- Entry node factory functiongraph/builder.py- Complete graph constructiongraph/routing.py- Conditional routing logicGraph visualization PNG using
get_graph().draw_mermaid_png()
Task 4: Human-in-the-Loop Confirmation (20 points)#
Time Allocation: 90 minutes
Implement interrupt patterns for sensitive operations.
Requirements:#
Configure
interrupt_beforefor sensitive tools:All ticket creation/update/cancel operations
All booking creation/update/cancel operations
NOT for read operations (track) or search operations
Implement confirmation flow:
Detect pending tool state via
graph.get_state(config)Generate human-readable confirmation message
Parse user response: βyβ to continue, other to cancel
Create confirmation message generator:
Extract tool name and arguments from pending state
Format readable summary for user review
Include clear instructions for approval/rejection
Handle user responses:
βyβ or βyesβ: Resume execution with
app.invoke(None, config)Other: Update state to cancel operation and return message
Log all confirmation decisions
Deliverables:#
hitl/interrupt_config.py- List of sensitive toolshitl/confirmation.py- Confirmation flow logichitl/message_generator.py- Human-readable message formatting
Task 5: Response Caching with FAISS (10 points)#
Time Allocation: 60 minutes
Implement vector store-based caching for RAG and IT Support responses.
Requirements:#
Create cache_tool that:
Stores all RAG and IT Support responses in FAISS vectorstore
Indexes by query embedding using
sentence-transformersStores metadata: timestamp, query_type, source_agent
Implement cache lookup in orchestrator:
Before calling RAG/IT tools, check cache for similar queries
Use similarity threshold (0.85) to determine cache hit
Return cached response if found, otherwise proceed to tool
Add cache management:
TTL-based invalidation (24 hours)
Manual cache clear capability
Cache statistics logging (hits, misses, hit rate)
Deliverables:#
cache/faiss_cache.py- FAISS caching implementationcache/cache_manager.py- Cache management and TTL logiccache/cache_stats.py- Statistics tracking
Task 6: Persistence & Production Readiness (10 points)#
Time Allocation: 60 minutes
Configure persistent state and production-ready error handling.
Requirements:#
Replace MemorySaver with SQLiteSaver:
Configure persistent storage in
checkpoints.dbTest conversation resumption after process restart
Document the migration path to PostgresSaver
Implement thread management:
List active threads
View checkpoint history for a thread
Delete old threads (cleanup)
Add error handling and logging:
Structured logging with conversation context
Graceful error recovery for tool failures
User-friendly error messages
Deliverables:#
persistence/checkpointer.py- SQLiteSaver configurationpersistence/thread_manager.py- Thread management utilitiesutils/logging.py- Structured logging setuputils/error_handler.py- Error handling utilities
Test Scenarios#
Complete these test scenarios to demonstrate system functionality:
Scenario 1: Multi-Agent Conversation Flow#
User: "Hi, I need help with a few things"
β Primary Assistant welcomes user
User: "My laptop won't connect to WiFi"
β Routes to IT Support Agent
β Tavily search for troubleshooting
β Cache response
β Return to Primary Assistant
User: "I need to book a meeting room for tomorrow 2pm"
β Routes to Booking Agent
β Shows confirmation prompt (HITL)
β User confirms "y"
β Booking created
β Return to Primary Assistant
Scenario 2: HITL Rejection Flow#
User: "Create a support ticket for broken monitor"
β Routes to Ticket Agent
β Shows confirmation prompt
β User rejects with "no, wait"
β Operation cancelled
β Agent asks for clarification
Scenario 3: Cache Hit Flow#
User: "How do I reset my password?" (first time)
β FAQ Agent answers from RAG
β Response cached
User: "Password reset instructions?" (similar query)
β Cache hit detected (similarity > 0.85)
β Return cached response
Scenario 4: Persistence Test#
1. Start conversation, create a ticket
2. Stop the process
3. Restart with same thread_id
4. Verify conversation history retained
5. Track the created ticket
Questions to Answer#
Include written responses to these questions in ANSWERS.md:
State Management: Explain why the
add_messagesreducer is essential for multi-turn conversations. What problems would occur without it?Multi-Agent Architecture: Compare the dialog stack approach vs. flat routing. When would you choose one over the other?
Human-in-the-Loop Trade-offs: What are the UX implications of requiring confirmation for every sensitive action? How would you balance security vs. user experience?
Caching Strategy: How would you handle cache invalidation when the underlying FAQ documents are updated? Propose a solution.
Production Considerations: What additional features would you add before deploying this system to production? Consider: monitoring, scaling, security.
Submission Requirements#
Directory Structure#
fpt-customer-chatbot-ai/
βββ agents/
β βββ primary_assistant.py
β βββ ticket_agent.py
β βββ booking_agent.py
β βββ it_support_agent.py
β βββ faq_agent.py
βββ schemas/
β βββ ticket_schemas.py
β βββ booking_schemas.py
βββ state/
β βββ agent_state.py
β βββ dialog_stack.py
β βββ context_injection.py
βββ tools/
β βββ ticket_tools.py # Mock tools for ticket operations
β βββ booking_tools.py # Mock tools for booking operations
β βββ mock_store.py # In-memory storage for testing
βββ graph/
β βββ builder.py
β βββ entry_nodes.py
β βββ routing.py
βββ hitl/
β βββ interrupt_config.py
β βββ confirmation.py
β βββ message_generator.py
βββ cache/
β βββ faiss_cache.py
β βββ cache_manager.py
β βββ cache_stats.py
βββ persistence/
β βββ checkpointer.py
β βββ thread_manager.py
βββ utils/
β βββ logging.py
β βββ error_handler.py
βββ data/
β βββ fpt_policies.txt (or .json)
βββ main.py
βββ requirements.txt
βββ README.md
βββ ANSWERS.md
βββ graph_visualization.png
This AI core is designed to be integrated with the FastAPI backend from the Building Monolith API with FastAPI module. The mock tools in tools/ directory can be replaced with actual database operations when integrating.
Required Deliverables#
Complete source code following directory structure
README.mdwith:Setup instructions (environment, API keys, dependencies)
Usage examples and CLI commands
Architecture diagram or explanation
Notes on how to integrate with FastAPI backend
ANSWERS.mdwith written responses to all 5 questionsrequirements.txtwith all dependenciesgraph_visualization.png- Multi-agent graph visualizationDemo video or screenshots showing:
All four agent flows working
HITL confirmation workflow
Cache hit scenario
Persistence across restart
Submission Checklist#
All code runs without errors
All four specialized agents functional with mock tools
Primary Assistant routes correctly
HITL confirmation works for sensitive operations
Cache stores and retrieves responses
SQLiteSaver enables conversation persistence
Dialog stack tracks agent hierarchy
Context injection auto-populates user info
All test scenarios pass
Documentation is complete
Evaluation Criteria#
Criteria |
Points |
Excellent (100%) |
Good (75%) |
Needs Improvement (50%) |
|---|---|---|---|---|
State Management (Task 1) |
15 |
Perfect messages pattern, dialog stack, injection |
Working but minor issues in context handling |
Basic state only, missing stack or injection |
Specialized Agents (Task 2) |
25 |
All agents with complete tools and validation |
Most agents working, some validation missing |
Only 1-2 agents functional |
Graph Construction (Task 3) |
20 |
Complete graph with all routing and fallbacks |
Graph works but missing error handling |
Basic graph without proper routing |
Human-in-the-Loop (Task 4) |
20 |
Smooth confirmation UX with proper state handling |
HITL works but UX needs improvement |
Basic interrupt without proper messaging |
Response Caching (Task 5) |
10 |
Full caching with TTL and statistics |
Caching works but missing TTL or stats |
Basic storage without similarity search |
Persistence & Production (Task 6) |
10 |
SQLite with thread management and error handling |
Persistence works but limited management |
MemorySaver only, no persistence |
Total |
100 |
Hints#
Use
state["messages"][-1]to access the most recent messageThe
add_messagesreducer handles message deduplication automaticallyStore
dialog_stateas a list for stack operations (append/pop)
Use
ToolNode(tools).with_fallbacks([...])for graceful error handlingThe
CompleteOrEscalatetool should return a flag that routing can detectEntry nodes should push to stack, exit nodes should pop
Access pending state with
app.get_state(config).nextto see which node is pendingUse
app.update_state(config, values)to modify state before resumingConsider timeout handling for user confirmation
Use
sentence-transformers/all-MiniLM-L6-v2for consistent embeddingsStore original query and response as metadata, not just embedding
Implement cache warmup for common queries
SQLiteSaver requires context manager:
with SqliteSaver.from_conn_string(...) as saver:Thread IDs should be user-meaningful (e.g.,
user123-session1)Consider implementing session timeout (24h default)
References#
LLMOps and Evaluation Project Exam#
Final Exam: Production-Ready RAG Evaluation System#
overview#
Field |
Value |
|---|---|
Course |
LLMOps and Evaluation |
Duration |
240 minutes (4 hours) |
Passing Score |
70% |
Total Points |
100 |
Description#
You have been hired as an MLOps Engineer at AI Solutions Corp., a company that builds enterprise AI assistants. Your task is to build a Production-Ready RAG Evaluation System that combines automated quality assessment, comprehensive observability, and rigorous architecture comparison.
The current system lacks:
Automated evaluation metrics to measure answer quality
Observability into LLM execution, costs, and latency
Data-driven architecture selection based on experiments
You must apply knowledge from RAGAS Evaluation Metrics, LLM Observability (LangFuse/LangSmith), and RAG Architecture Comparison to build a comprehensive evaluation and monitoring platform.
Objectives#
By completing this exam, you will demonstrate mastery of:
Implementing RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall)
Integrating LangFuse and/or LangSmith for comprehensive LLM tracing and cost tracking
Designing and executing RAG architecture experiments with scientific rigor
Building an end-to-end evaluation pipeline that combines all three components
Making data-driven architecture recommendations based on experimental results
Problem Description#
Build a Production-Ready RAG Evaluation System named rag-evaluation-platform that:
Evaluates RAG quality using RAGAS metrics on generated responses
Traces all LLM operations with full observability (tokens, costs, latency)
Compares multiple RAG architectures systematically
Produces actionable reports for architecture selection
The system should serve as a complete toolkit for evaluating, monitoring, and optimizing RAG systems in production.
Assumptions#
You have completed the assignments on RAGAS, Observability, and Experiment Comparison
OpenAI API key or compatible LLM endpoint is available
LangFuse Cloud account OR local Docker setup for self-hosted LangFuse
LangSmith account (free tier)
Python 3.10+ environment with necessary packages installed
Sample documents and test questions are provided or created
Technical Requirements#
Environment Setup#
Python 3.10 or higher
Required packages:
ragas>= 0.1.0langfuse>= 2.0.0langchain>= 0.1.0langchain-openai>= 0.0.5openai>= 1.0.0chromadb>= 0.4.0 ORqdrant-client>= 1.7.0sentence-transformers>= 2.2.0pandas>= 2.0.0matplotlib>= 3.7.0
Infrastructure#
Vector Database: ChromaDB or Qdrant
Observability: LangFuse (required) + LangSmith (optional)
Embedding Model:
text-embedding-3-smallor equivalentLLM: GPT-4 or equivalent
Tasks#
Task 1: RAGAS Evaluation Pipeline (25 points)#
Time Allocation: 60 minutes
Build a comprehensive evaluation pipeline using all four RAGAS metrics.
Requirements:#
Implement RAGAS Evaluation Module
Create functions to calculate Faithfulness, Answer Relevancy, Context Precision, and Context Recall
Support batch evaluation on datasets
Handle edge cases (empty contexts, very short answers)
Create Evaluation Dataset
Prepare at least 30 test questions with ground truth answers
Categorize questions: Factual (40%), Relational (30%), Multi-hop (20%), Analytical (10%)
Include retrieved contexts for each question
Run Evaluation
Execute evaluation on the complete dataset
Calculate aggregate statistics (mean, std, min, max)
Identify failure cases (scores < 0.5)
Deliverables:#
evaluation/ragas_evaluator.py- Core evaluation logicevaluation/dataset.py- Dataset loading and preparationdata/test_questions.json- Test dataset with ground truth
Task 2: LLM Observability Integration (25 points)#
Time Allocation: 60 minutes
Implement comprehensive tracing and monitoring for all LLM operations.
Requirements:#
LangFuse Integration
Configure LangFuse SDK with proper authentication
Implement
CallbackHandlerfor all LangChain operationsCapture: input/output, token counts, latency, costs
Cost Tracking Dashboard
Track token usage per query
Calculate costs based on model pricing
Generate cost breakdown reports
Production Best Practices
Implement configurable sampling (100% dev, 5% prod)
Add PII masking for sensitive data
Create correlation IDs for request tracking
(Bonus) LangSmith Integration
Configure auto-tracing via environment variables
Demonstrate Playground debugging for a failed trace
Deliverables:#
observability/langfuse_handler.py- LangFuse integrationobservability/cost_tracker.py- Cost calculation logicobservability/pii_masker.py- PII handlingScreenshots of LangFuse dashboard with traces
Task 3: RAG Architecture Comparison (25 points)#
Time Allocation: 60 minutes
Design and execute a rigorous experiment comparing multiple RAG architectures.
Requirements:#
Implement Two RAG Architectures
Naive RAG: Fixed chunking, Top-K retrieval, direct generation
Advanced RAG: Semantic chunking, hybrid search, re-ranking
Run Comparative Experiments
Execute both architectures on the same test dataset
Capture all RAGAS metrics for each architecture
Track latency and cost per query
Performance Analysis
Break down performance by question category
Calculate statistical significance of differences
Create visualizations (bar charts, tables)
Deliverables:#
architectures/naive_rag.py- Naive RAG implementationarchitectures/advanced_rag.py- Advanced RAG implementationexperiments/runner.py- Experiment executionresults/comparison_table.md- Results summary
Task 4: Integrated Evaluation Platform (25 points)#
Time Allocation: 60 minutes
Combine all components into a unified evaluation platform.
Requirements:#
End-to-End Pipeline
Single entry point to run complete evaluation
Automatic tracing of all operations
Configurable architecture selection
Comprehensive Reporting
Generate evaluation report with all metrics
Include observability insights (cost, latency distribution)
Architecture comparison summary
Actionable recommendations
CLI Interface
python evaluate.py --architecture naive --dataset data/test.json --output results/ python evaluate.py --architecture advanced --dataset data/test.json --output results/ python compare.py --results-dir results/ --output comparison_report.md
Answer Key Questions
Which architecture should be used for production and why?
What is the cost-quality trade-off between architectures?
What are the top 3 failure patterns and how to address them?
Deliverables:#
evaluate.py- Main evaluation scriptcompare.py- Architecture comparison scriptreports/evaluation_report.md- Complete evaluation reportANSWERS.md- Written responses to key questions
Questions to Answer#
Include written responses to these questions in ANSWERS.md:
RAGAS Interpretation: Analyze your Faithfulness and Answer Relevancy scores. What do low scores indicate about your RAG system, and how would you improve them?
Observability Value: How did LangFuse/LangSmith tracing help you identify issues in your RAG pipeline? Provide a specific example.
Architecture Decision: Based on your experiments, which RAG architecture would you recommend for a customer support chatbot vs. a legal document Q&A system? Justify with data.
Cost Optimization: If you had to reduce costs by 50% while maintaining 90% of quality, what strategies would you employ? Reference your experimental results.
Production Readiness: What additional monitoring, alerting, or evaluation would you add before deploying this system to production?
Submission Requirements#
Required Deliverables#
Complete source code organized in the specified directory structure
README.mdwith:Setup instructions (dependencies, API keys, observability setup)
Usage examples for CLI commands
Architecture diagram of the evaluation platform
ANSWERS.mdwith written responses to the 5 questionsTest dataset with at least 30 categorized questions
Results tables and visualizations
Screenshots of observability dashboards
Submission Checklist#
All code runs without errors
RAGAS evaluation produces valid scores for all metrics
LangFuse traces are captured and visible in dashboard
Both RAG architectures are implemented and evaluated
Comparison report includes statistical analysis
All questions answered with data-backed reasoning
Evaluation Criteria#
Criteria |
Weight |
Excellent (90-100%) |
Good (70-89%) |
Needs Improvement (50-69%) |
Unsatisfactory (<50%) |
|---|---|---|---|---|---|
RAGAS Evaluation |
25% |
All 4 metrics implemented correctly; comprehensive dataset; insightful failure analysis |
Metrics implemented; adequate dataset; basic analysis |
Partial metrics; small dataset; minimal analysis |
Missing metrics; no dataset |
Observability |
25% |
Full LangFuse integration; cost tracking; PII handling; production best practices |
LangFuse working; basic cost tracking; some best practices |
Partial tracing; no cost tracking |
No observability integration |
Architecture Comparison |
25% |
Both architectures implemented; rigorous experiments; statistical analysis; visualizations |
Both architectures; experiments run; basic comparison |
One architecture; limited experiments |
No architecture comparison |
Integration & Reporting |
15% |
Seamless pipeline; comprehensive reports; CLI interface; actionable insights |
Components integrated; adequate reports |
Partial integration; basic reports |
Components not connected |
Code Quality & Documentation |
10% |
Clean code; comprehensive docs; clear README; well-organized |
Readable code; adequate docs |
Messy code; minimal docs |
Poor quality; no docs |
Estimated Time#
Task |
Time Allocation |
|---|---|
Task 1: RAGAS Evaluation Pipeline |
60 minutes |
Task 2: LLM Observability Integration |
60 minutes |
Task 3: RAG Architecture Comparison |
60 minutes |
Task 4: Integrated Evaluation Platform |
60 minutes |
Total |
240 minutes (4 hours) |
Hints#
Task 1 - RAGAS:
Use the companion notebook
10_RAG_Evaluation_with_Ragas.ipynbas a referenceStart with a small dataset (10 questions) to verify your pipeline before scaling up
For claim decomposition in Faithfulness, consider using GPT-4 for accuracy
Task 2 - Observability:
Set up LangFuse first since it requires explicit callback handlers (good for understanding)
Use environment variables to switch between dev (100% tracing) and prod (5% sampling) modes
Test PII masking with fake data before using real sensitive information
Task 3 - Experiments:
Use the same embedding model for both architectures to ensure fair comparison
Run each query multiple times if measuring latency to account for variance
Calculate confidence intervals when comparing metric differences
Task 4 - Integration:
Use Pythonβs
argparseorclicklibrary for CLI implementationGenerate markdown reports that can be easily shared with stakeholders
Include both quantitative metrics and qualitative insights in recommendations