Module 1 · AI

📖 27 min read

AI Project Exams#

This page consolidates project exam descriptions from all advanced AI training modules.

RAG Optimization Project Exam#

Final Exam: Enterprise RAG System#

overview#

Field	Value
Course	RAG and Optimization
Duration	240 minutes (4 hours)
Passing Score	70%
Total Points	100

Description#

You have been hired as an AI Engineer at TechDocs Inc., a company that provides enterprise documentation solutions. Your task is to build a production-ready Enterprise RAG System that can answer complex questions about technical documentation, company policies, and product specifications.

The current basic RAG system has several limitations:

Poor retrieval quality due to fixed-size chunking
Slow search performance with growing document collections
Inability to handle keyword-specific queries (error codes, product IDs)
Redundant and irrelevant results in retrieved documents
Missing relationship information between entities (policies, stakeholders, regulations)

You must apply all five optimization techniques learned in this module to build a comprehensive, production-grade RAG system.

Objectives#

By completing this exam, you will demonstrate mastery of:

Implementing Semantic Chunking for intelligent document segmentation
Configuring HNSW Index for high-performance vector search
Building Hybrid Search combining BM25 and Vector Search with RRF fusion
Applying Query Transformation techniques (HyDE and Query Decomposition)
Implementing Post-Retrieval Processing with Cross-Encoder and MMR
Designing a GraphRAG architecture for relationship-aware retrieval

Problem Description#

Build an Enterprise RAG System named enterprise-rag-system that processes a collection of technical documents and provides accurate, contextual answers to user queries. The system must handle:

Technical documentation with code snippets, error codes, and specifications
Policy documents with stakeholder relationships and regulatory references
Product catalogs with model numbers, features, and comparisons

The system should intelligently route queries to the appropriate retrieval strategy and provide high-quality, diverse, and accurate results.

Assumptions#

You have access to sample documents (technical docs, policies, product specs) or will use provided sample data
OpenAI API key or compatible LLM endpoint is available
Neo4j database is available (local Docker or cloud instance)
Python 3.10+ environment with necessary packages installed
Basic understanding of all five RAG optimization techniques

Technical Requirements#

Environment Setup#

Python 3.10 or higher
Required packages:
- langchain >= 0.1.0
- langchain-neo4j >= 0.1.0
- openai >= 1.0.0
- sentence-transformers >= 2.2.0
- chromadb >= 0.4.0 OR qdrant-client >= 1.7.0
- rank-bm25 >= 0.2.2
- pydantic >= 2.0.0
- neo4j >= 5.0.0

Infrastructure#

Vector Database: ChromaDB or Qdrant with HNSW indexing
Graph Database: Neo4j (Docker recommended)
Embedding Model: text-embedding-3-small or all-MiniLM-L6-v2
Cross-Encoder: cross-encoder/ms-marco-MiniLM-L-6-v2
LLM: GPT-4 or equivalent

Tasks#

Task 1: Advanced Indexing Pipeline (20 points)#

Time Allocation: 45 minutes

Implement an intelligent document indexing pipeline that preserves semantic coherence.

Requirements:#

Semantic Chunking Implementation
- Build a chunker that splits documents based on semantic similarity between sentences
- Configure similarity threshold (0.7-0.85) and chunk size limits
- Handle edge cases: code blocks, tables, lists, short documents
HNSW Index Configuration
- Set up vector database with HNSW indexing
- Configure optimal parameters: M=32, ef_construction=200, ef_search=100
- Document the trade-offs for your chosen configuration
Indexing Pipeline
- Process at least 20 documents through the pipeline
- Store metadata (source, chunk_id, document_type) with each vector
- Implement batch processing for efficiency

Deliverables:#

indexing/semantic_chunker.py
indexing/vector_store.py
Indexed document collection with metadata

Task 2: Hybrid Search Implementation (20 points)#

Time Allocation: 45 minutes

Build a hybrid retrieval system that combines keyword and semantic search.

Requirements:#

BM25 Retriever
- Implement BM25 indexing for all document chunks
- Proper tokenization with case normalization and punctuation handling
- Return top-K results with BM25 scores
Hybrid Search with RRF
- Execute both BM25 and Vector Search in parallel
- Implement RRF fusion: RRF(d) = Σ 1/(60 + rank(d))
- Handle documents appearing in only one result list
Query Router
- Analyze query to determine optimal search strategy
- Route keyword-heavy queries to prioritize BM25
- Route semantic queries to prioritize Vector Search
- Use Hybrid Search as default

Deliverables:#

retrieval/bm25_retriever.py
retrieval/hybrid_search.py
retrieval/query_router.py

Task 3: Query Transformation Layer (15 points)#

Time Allocation: 35 minutes

Implement query transformation to handle vague and complex queries.

Requirements:#

HyDE Implementation
- Generate hypothetical answer paragraphs using LLM
- Use hypothetical answer embedding for retrieval
- Design domain-appropriate generation prompts
Query Decomposition
- Detect multi-part questions requiring information from multiple sources
- Generate independent sub-queries for parallel retrieval
- Aggregate results from all sub-queries
Transformation Router
- Classify queries: simple, vague (use HyDE), complex (use Decomposition)
- Apply appropriate transformation before retrieval

Deliverables:#

transformation/hyde.py
transformation/query_decomposition.py
transformation/transformation_router.py

Task 4: Post-Retrieval Processing (15 points)#

Time Allocation: 35 minutes

Implement re-ranking and diversity optimization for retrieved results.

Requirements:#

Cross-Encoder Re-ranking
- Retrieve top-50 candidates with Bi-Encoder
- Re-rank using Cross-Encoder (cross-encoder/ms-marco-MiniLM-L-6-v2)
- Return top-10 re-ranked results
MMR for Diversity
- Implement MMR algorithm with configurable λ parameter
- Default λ=0.5 for balanced relevance/diversity
- Ensure diverse information coverage in final results
Configurable Pipeline
- Support both: Cross-Encoder → MMR and MMR → Cross-Encoder orders
- Allow configuration of k values at each stage

Deliverables:#

post_retrieval/cross_encoder_reranker.py
post_retrieval/mmr.py
post_retrieval/post_retrieval_pipeline.py

Task 5: GraphRAG Integration (20 points)#

Time Allocation: 50 minutes

Build a knowledge graph for relationship-aware retrieval.

Requirements:#

Entity Extraction
- Define Pydantic models for domain entities (Policy, Stakeholder, Product, Regulation, etc.)
- Extract entities and relationships using LLM with structured output
- Validate extracted data against schema
Knowledge Graph Construction
- Populate Neo4j with extracted entities and relationships
- Use MERGE to prevent duplicates
- Create appropriate indexes for query performance
Graph-Aware Retrieval
- Implement natural language to Cypher translation
- Support relationship traversal queries
- Combine graph results with vector search results

Deliverables:#

graph/entity_models.py
graph/entity_extractor.py
graph/knowledge_graph.py
graph/graph_retriever.py

Task 6: Integration and Orchestration (10 points)#

Time Allocation: 30 minutes

Integrate all components into a unified RAG system.

Requirements:#

Unified Query Pipeline
- Accept user query as input
- Apply query classification and routing
- Execute appropriate retrieval strategy
- Apply post-retrieval processing
- Generate final answer using LLM
Configuration Management
- Externalize all configurable parameters
- Support different modes: fast (less accurate), accurate (slower), balanced
Error Handling and Logging
- Graceful degradation if a component fails
- Structured logging for debugging and monitoring

Deliverables:#

main.py or enterprise_rag.py
config.py or config.yaml
README.md with setup and usage instructions

Questions to Answer#

Include written answers to these questions in your README.md or a separate ANSWERS.md file:

Architecture Decision: Explain why you chose your specific HNSW parameters and how they balance speed vs. accuracy for this use case.
Hybrid Search Trade-offs: Describe a scenario where Hybrid Search significantly outperforms pure Vector Search, and explain why.
Query Transformation Selection: How does your system decide when to use HyDE vs. Query Decomposition? What signals does it look for?
Re-ranking Strategy: Why did you choose your specific order of Cross-Encoder and MMR? What would change if the use case prioritized diversity over precision?
GraphRAG Value: Provide an example query that your GraphRAG component can answer that would be impossible or very difficult with vector search alone.

Grading Rubrics#

Criterion	Weight	Excellent (90-100%)	Good (70-89%)	Satisfactory (50-69%)	Needs Improvement (<50%)
Advanced Indexing	20%	Semantic chunking preserves context perfectly; HNSW optimally configured with benchmarks	Chunking works with minor issues; HNSW configured but not optimized	Basic chunking implemented; HNSW uses default parameters	Chunking breaks context; HNSW not implemented
Hybrid Search	20%	BM25 and RRF perfectly implemented; Query router makes intelligent decisions	Hybrid search works; Router has some misclassifications	Basic hybrid search; No query routing	Hybrid search not functional
Query Transformation	15%	HyDE and Decomposition both work excellently; Smart routing between them	Both techniques work; Routing is rule-based	One technique works; No routing	Neither technique functional
Post-Retrieval	15%	Cross-Encoder significantly improves precision; MMR provides diverse results	Both components work; Measurable improvement	One component works	Neither component functional
GraphRAG	20%	Complete entity extraction; Rich graph; Answers complex relationship queries	Graph populated; Basic queries work	Partial graph; Limited queries	Graph not functional
Integration	10%	Seamless pipeline; Excellent error handling; Clean configuration	Components integrated; Some rough edges	Partial integration	Components not connected

Estimated Time#

Task	Time Allocation
Task 1: Advanced Indexing	45 minutes
Task 2: Hybrid Search	45 minutes
Task 3: Query Transformation	35 minutes
Task 4: Post-Retrieval	35 minutes
Task 5: GraphRAG	50 minutes
Task 6: Integration	30 minutes
Total	240 minutes (4 hours)

Hints#

General Tips:

Start by setting up the infrastructure (Neo4j, Vector DB) before writing code
Test each component independently before integration
Use the companion notebooks from assignments as references
Cache LLM responses during development to save API costs

Component-Specific Tips:

For Semantic Chunking: Use sentence-transformers for efficient similarity calculation
For HNSW: Prioritize ef_search tuning for query-time optimization
For BM25: Use nltk.word_tokenize() for consistent tokenization
For HyDE: The hypothetical answer doesn’t need to be factually correct
For Cross-Encoder: Batch processing significantly improves throughput
For GraphRAG: Test Cypher queries in Neo4j Browser before implementing in code

Notes:#

You can use the your implementation in your previous assignment lab that you have done as the starting point.

LangGraph and Agentic AI Project Exam#

Final Project Exam: FPT Customer Chatbot - Multi-Agent AI System#

overview#

Field	Value
Course	LangGraph and Agentic AI
Project Name	`fpt-customer-chatbot-ai`
Duration	360 minutes (6 hours)
Passing Score	70%
Total Points	100
Framework	Python 3.10+, LangGraph, LangChain, Tavily API, FAISS, OpenAI

Description#

You have been hired as an AI Engineer at FPT Software, tasked with building a Multi-Agent Customer Service Chatbot AI Core that demonstrates mastery of all concepts covered in the LangGraph and Agentic AI module.

This final project consolidates all five assignments into a single comprehensive multi-agent system:

Assignment 01: LangGraph Foundations & State Management
Assignment 02: Multi-Expert ReAct Research Agent
Assignment 03: Tool Calling & Tavily Search Integration
Assignment 04: FPT Customer Chatbot - Multi-Agent System
Assignment 05: Human-in-the-Loop & Persistence

You will build the AI Core for an FPT Customer Chatbot with hierarchical multi-agent architecture, real-time web search, human approval workflows, response caching, and persistent state management.

This exam focuses purely on the AI/LangGraph logic. For the Engineering layer (FastAPI, database, REST APIs), please refer to the Building Monolith API with FastAPI module’s final exam.

Objectives#

By completing this exam, you will demonstrate mastery of:

State Management: Implementing messages-centric patterns with TypedDict and add_messages reducer
ReAct Pattern: Building reasoning + acting loops with iteration control
Tool Calling: Integrating external APIs (Tavily) with parallel execution
Multi-Agent Architecture: Designing hierarchical systems with specialized agents
Human-in-the-Loop: Implementing interrupt patterns for user confirmation
Persistence: Configuring checkpointers for long-running conversations
Caching: Building vector store-based response caching with FAISS

Problem Description#

Build the AI Core for an FPT Customer Service Chatbot named fpt-customer-chatbot-ai that includes:

Agent	Responsibilities
Primary Assistant	Routes user queries to appropriate specialized agents
FAQ Agent	Answers FPT policy questions using RAG with cached responses
Ticket Agent	Handles ticket-related conversations with HITL approval (mock tools)
Booking Agent	Handles booking conversations with HITL confirmation (mock tools)
IT Support Agent	Troubleshoots technical issues using Tavily search + caching

The system must:

Maintain conversation context across multiple turns
Require human confirmation before sensitive operations
Cache responses for similar queries
Persist state across process restarts
Handle agent transitions gracefully with dialog stack

The Ticket and Booking agents will use mock tools that simulate database operations. The actual database integration is covered in the FastAPI module exam.

Prerequisites#

Completed all 5 module assignments (recommended)
OpenAI API key (OPENAI_API_KEY)
Tavily API key (TAVILY_API_KEY)
Python 3.10+ with virtual environment
Familiarity with Pydantic for schema validation

Technical Requirements#

Environment Setup#

Python 3.10 or higher
Required packages:
- langgraph >= 0.2.0
- langchain >= 0.1.0
- langchain-openai >= 0.1.0
- langchain-community >= 0.1.0
- tavily-python >= 0.3.0
- faiss-cpu >= 1.7.0
- sentence-transformers >= 2.2.0
- pydantic >= 2.0.0

Mock Data Models#

For testing purposes, define the following Pydantic models (actual database integration is in FastAPI module):

Ticket Model:

Field	Type	Constraints
ticket_id	str	Auto-generated UUID
content	str	Required
description	str \| None	Optional
customer_name	str	Required
customer_phone	str	Required
email	str \| None	Optional
status	TicketStatus	Pending/InProgress/Resolved/Canceled
created_at	datetime	Auto-set

Booking Model:

Field	Type	Constraints
booking_id	str	Auto-generated UUID
reason	str	Required
time	datetime	Required, must be future
customer_name	str	Required
customer_phone	str	Required
email	str \| None	Optional
note	str \| None	Optional
status	BookingStatus	Scheduled/Finished/Canceled

Tasks#

Task 1: State Management Foundation (15 points)#

Time Allocation: 60 minutes

Build the core state management infrastructure for the multi-agent system.

Requirements:#

Define AgenticState using TypedDict with:
- messages: Using Annotated[List[AnyMessage], add_messages] pattern
- dialog_state: Stack for tracking agent hierarchy
- user_id, email (optional): Context injection fields
- conversation_id: Session tracking
Implement dialog stack functions:
- update_dialog_stack(left, right): Push/pop agent transitions
- pop_dialog_state(state): Return to Primary Assistant
Create context injection that auto-populates user info into tool calls
Configure MemorySaver checkpointer for initial development

Deliverables:#

state/agent_state.py - State definition with all fields
state/dialog_stack.py - Stack management functions
state/context_injection.py - User context injection logic

Task 2: Specialized Agents Implementation (25 points)#

Time Allocation: 120 minutes

Implement all four specialized agents with their tools and schemas.

Requirements:#

Ticket Support Agent (8 points):
- Define Pydantic schemas: CreateTicket, TrackTicket, UpdateTicket, CancelTicket
- Implement mock tools that simulate CRUD operations (return success messages, store in memory dict)
- Status transitions: Pending → InProgress → Resolved (or Canceled)
- Add CompleteOrEscalate tool for returning to Primary Assistant
- Tools should accept and validate all required fields
Booking Agent (7 points):
- Define Pydantic schemas with time validation (must be future)
- Implement mock tools: BookRoom, TrackBooking, UpdateBooking, CancelBooking
- Status transitions: Scheduled → Finished (or Canceled)
- Include CompleteOrEscalate tool
IT Support Agent (5 points):
- Integrate Tavily Search with max_results: 5, search_depth: "advanced"
- Return practical troubleshooting guides from reliable sources
- Include CompleteOrEscalate tool
FAQ Agent (5 points):
- Implement simple RAG for FPT policy questions
- Return answers with source references
- Include CompleteOrEscalate tool

Mock tools should use an in-memory dictionary to store data for testing. This allows the AI system to function independently without database dependencies. The actual database integration will be handled in the FastAPI module exam.

Example mock implementation pattern:

# In-memory storage for testing
_ticket_store: dict[str, dict] = {}

@tool
def create_ticket(content: str, customer_name: str, customer_phone: str, ...) -> str:
    """Create a new support ticket."""
    ticket_id = str(uuid.uuid4())
    _ticket_store[ticket_id] = {...}
    return f"Ticket created successfully with ID: {ticket_id}"

Deliverables:#

agents/ticket_agent.py - Ticket Support Agent with mock tools
agents/booking_agent.py - Booking Agent with mock tools
agents/it_support_agent.py - IT Support Agent with Tavily
agents/faq_agent.py - FAQ Agent with RAG
schemas/ directory with all Pydantic models

Task 3: Primary Assistant & Graph Construction (20 points)#

Time Allocation: 90 minutes

Build the Primary Assistant and construct the complete multi-agent graph.

Requirements:#

Define routing tools for Primary Assistant:
- ToTicketAssistant: Route ticket-related queries
- ToBookingAssistant: Route booking-related queries
- ToITAssistant: Route technical issues
- ToFAQAssistant: Route policy questions
- Include user context injection in all routing tools
Implement entry nodes for agent transitions:
- Create create_entry_node(assistant_name) factory function
- Entry nodes push new agent to dialog_state stack
- Generate appropriate welcome message
Build StateGraph with:
- Primary Assistant as entry point
- All specialized agent nodes
- ToolNode for each agent’s tools
- Conditional routing based on intent
- Edge handling for CompleteOrEscalate
Create tool_node_with_fallback for graceful error handling

Deliverables:#

agents/primary_assistant.py - Primary Assistant with routing
graph/entry_nodes.py - Entry node factory function
graph/builder.py - Complete graph construction
graph/routing.py - Conditional routing logic
Graph visualization PNG using get_graph().draw_mermaid_png()

Task 4: Human-in-the-Loop Confirmation (20 points)#

Time Allocation: 90 minutes

Implement interrupt patterns for sensitive operations.

Requirements:#

Configure interrupt_before for sensitive tools:
- All ticket creation/update/cancel operations
- All booking creation/update/cancel operations
- NOT for read operations (track) or search operations
Implement confirmation flow:
- Detect pending tool state via graph.get_state(config)
- Generate human-readable confirmation message
- Parse user response: “y” to continue, other to cancel
Create confirmation message generator:
- Extract tool name and arguments from pending state
- Format readable summary for user review
- Include clear instructions for approval/rejection
Handle user responses:
- “y” or “yes”: Resume execution with app.invoke(None, config)
- Other: Update state to cancel operation and return message
- Log all confirmation decisions

Deliverables:#

hitl/interrupt_config.py - List of sensitive tools
hitl/confirmation.py - Confirmation flow logic
hitl/message_generator.py - Human-readable message formatting

Task 5: Response Caching with FAISS (10 points)#

Time Allocation: 60 minutes

Implement vector store-based caching for RAG and IT Support responses.

Requirements:#

Create cache_tool that:
- Stores all RAG and IT Support responses in FAISS vectorstore
- Indexes by query embedding using sentence-transformers
- Stores metadata: timestamp, query_type, source_agent
Implement cache lookup in orchestrator:
- Before calling RAG/IT tools, check cache for similar queries
- Use similarity threshold (0.85) to determine cache hit
- Return cached response if found, otherwise proceed to tool
Add cache management:
- TTL-based invalidation (24 hours)
- Manual cache clear capability
- Cache statistics logging (hits, misses, hit rate)

Deliverables:#

cache/faiss_cache.py - FAISS caching implementation
cache/cache_manager.py - Cache management and TTL logic
cache/cache_stats.py - Statistics tracking

Task 6: Persistence & Production Readiness (10 points)#

Time Allocation: 60 minutes

Configure persistent state and production-ready error handling.

Requirements:#

Replace MemorySaver with SQLiteSaver:
- Configure persistent storage in checkpoints.db
- Test conversation resumption after process restart
- Document the migration path to PostgresSaver
Implement thread management:
- List active threads
- View checkpoint history for a thread
- Delete old threads (cleanup)
Add error handling and logging:
- Structured logging with conversation context
- Graceful error recovery for tool failures
- User-friendly error messages

Deliverables:#

persistence/checkpointer.py - SQLiteSaver configuration
persistence/thread_manager.py - Thread management utilities
utils/logging.py - Structured logging setup
utils/error_handler.py - Error handling utilities

Test Scenarios#

Complete these test scenarios to demonstrate system functionality:

Scenario 1: Multi-Agent Conversation Flow#

User: "Hi, I need help with a few things"
→ Primary Assistant welcomes user

User: "My laptop won't connect to WiFi"
→ Routes to IT Support Agent
→ Tavily search for troubleshooting
→ Cache response
→ Return to Primary Assistant

User: "I need to book a meeting room for tomorrow 2pm"
→ Routes to Booking Agent
→ Shows confirmation prompt (HITL)
→ User confirms "y"
→ Booking created
→ Return to Primary Assistant

Scenario 2: HITL Rejection Flow#

User: "Create a support ticket for broken monitor"
→ Routes to Ticket Agent
→ Shows confirmation prompt
→ User rejects with "no, wait"
→ Operation cancelled
→ Agent asks for clarification

Scenario 3: Cache Hit Flow#

User: "How do I reset my password?" (first time)
→ FAQ Agent answers from RAG
→ Response cached

User: "Password reset instructions?" (similar query)
→ Cache hit detected (similarity > 0.85)
→ Return cached response

Scenario 4: Persistence Test#

Start conversation, create a ticket
Stop the process
Restart with same thread_id
Verify conversation history retained
Track the created ticket

Questions to Answer#

Include written responses to these questions in ANSWERS.md:

State Management: Explain why the add_messages reducer is essential for multi-turn conversations. What problems would occur without it?
Multi-Agent Architecture: Compare the dialog stack approach vs. flat routing. When would you choose one over the other?
Human-in-the-Loop Trade-offs: What are the UX implications of requiring confirmation for every sensitive action? How would you balance security vs. user experience?
Caching Strategy: How would you handle cache invalidation when the underlying FAQ documents are updated? Propose a solution.
Production Considerations: What additional features would you add before deploying this system to production? Consider: monitoring, scaling, security.

Evaluation Criteria#

Criteria	Points	Excellent (100%)	Good (75%)	Needs Improvement (50%)
State Management (Task 1)	15	Perfect messages pattern, dialog stack, injection	Working but minor issues in context handling	Basic state only, missing stack or injection
Specialized Agents (Task 2)	25	All agents with complete tools and validation	Most agents working, some validation missing	Only 1-2 agents functional
Graph Construction (Task 3)	20	Complete graph with all routing and fallbacks	Graph works but missing error handling	Basic graph without proper routing
Human-in-the-Loop (Task 4)	20	Smooth confirmation UX with proper state handling	HITL works but UX needs improvement	Basic interrupt without proper messaging
Response Caching (Task 5)	10	Full caching with TTL and statistics	Caching works but missing TTL or stats	Basic storage without similarity search
Persistence & Production (Task 6)	10	SQLite with thread management and error handling	Persistence works but limited management	MemorySaver only, no persistence
Total	100

Hints#

Use state["messages"][-1] to access the most recent message
The add_messages reducer handles message deduplication automatically
Store dialog_state as a list for stack operations (append/pop)

Use ToolNode(tools).with_fallbacks([...]) for graceful error handling
The CompleteOrEscalate tool should return a flag that routing can detect
Entry nodes should push to stack, exit nodes should pop

Access pending state with app.get_state(config).next to see which node is pending
Use app.update_state(config, values) to modify state before resuming
Consider timeout handling for user confirmation

Use sentence-transformers/all-MiniLM-L6-v2 for consistent embeddings
Store original query and response as metadata, not just embedding
Implement cache warmup for common queries

SQLiteSaver requires context manager: with SqliteSaver.from_conn_string(...) as saver:
Thread IDs should be user-meaningful (e.g., user123-session1)
Consider implementing session timeout (24h default)

References#

LLMOps and Evaluation Project Exam#

Final Exam: Production-Ready RAG Evaluation System#

overview#

Field	Value
Course	LLMOps and Evaluation
Duration	240 minutes (4 hours)
Passing Score	70%
Total Points	100

Description#

You have been hired as an MLOps Engineer at AI Solutions Corp., a company that builds enterprise AI assistants. Your task is to build a Production-Ready RAG Evaluation System that combines automated quality assessment, comprehensive observability, and rigorous architecture comparison.

The current system lacks:

Automated evaluation metrics to measure answer quality
Observability into LLM execution, costs, and latency
Data-driven architecture selection based on experiments

You must apply knowledge from RAGAS Evaluation Metrics, LLM Observability (LangFuse/LangSmith), and RAG Architecture Comparison to build a comprehensive evaluation and monitoring platform.

Objectives#

By completing this exam, you will demonstrate mastery of:

Implementing RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision, Context Recall)
Integrating LangFuse and/or LangSmith for comprehensive LLM tracing and cost tracking
Designing and executing RAG architecture experiments with scientific rigor
Building an end-to-end evaluation pipeline that combines all three components
Making data-driven architecture recommendations based on experimental results

Problem Description#

Build a Production-Ready RAG Evaluation System named rag-evaluation-platform that:

Evaluates RAG quality using RAGAS metrics on generated responses
Traces all LLM operations with full observability (tokens, costs, latency)
Compares multiple RAG architectures systematically
Produces actionable reports for architecture selection

The system should serve as a complete toolkit for evaluating, monitoring, and optimizing RAG systems in production.

Assumptions#

You have completed the assignments on RAGAS, Observability, and Experiment Comparison
OpenAI API key or compatible LLM endpoint is available
LangFuse Cloud account OR local Docker setup for self-hosted LangFuse
LangSmith account (free tier)
Python 3.10+ environment with necessary packages installed
Sample documents and test questions are provided or created

Technical Requirements#

Environment Setup#

Python 3.10 or higher
Required packages:
- ragas >= 0.1.0
- langfuse >= 2.0.0
- langchain >= 0.1.0
- langchain-openai >= 0.0.5
- openai >= 1.0.0
- chromadb >= 0.4.0 OR qdrant-client >= 1.7.0
- sentence-transformers >= 2.2.0
- pandas >= 2.0.0
- matplotlib >= 3.7.0

Infrastructure#

Vector Database: ChromaDB or Qdrant
Observability: LangFuse (required) + LangSmith (optional)
Embedding Model: text-embedding-3-small or equivalent
LLM: GPT-4 or equivalent

Tasks#

Task 1: RAGAS Evaluation Pipeline (25 points)#

Time Allocation: 60 minutes

Build a comprehensive evaluation pipeline using all four RAGAS metrics.

Requirements:#

Implement RAGAS Evaluation Module
- Create functions to calculate Faithfulness, Answer Relevancy, Context Precision, and Context Recall
- Support batch evaluation on datasets
- Handle edge cases (empty contexts, very short answers)
Create Evaluation Dataset
- Prepare at least 30 test questions with ground truth answers
- Categorize questions: Factual (40%), Relational (30%), Multi-hop (20%), Analytical (10%)
- Include retrieved contexts for each question
Run Evaluation
- Execute evaluation on the complete dataset
- Calculate aggregate statistics (mean, std, min, max)
- Identify failure cases (scores < 0.5)

Deliverables:#

evaluation/ragas_evaluator.py - Core evaluation logic
evaluation/dataset.py - Dataset loading and preparation
data/test_questions.json - Test dataset with ground truth

Task 2: LLM Observability Integration (25 points)#

Time Allocation: 60 minutes

Implement comprehensive tracing and monitoring for all LLM operations.

Requirements:#

LangFuse Integration
- Configure LangFuse SDK with proper authentication
- Implement CallbackHandler for all LangChain operations
- Capture: input/output, token counts, latency, costs
Cost Tracking Dashboard
- Track token usage per query
- Calculate costs based on model pricing
- Generate cost breakdown reports
Production Best Practices
- Implement configurable sampling (100% dev, 5% prod)
- Add PII masking for sensitive data
- Create correlation IDs for request tracking
(Bonus) LangSmith Integration
- Configure auto-tracing via environment variables
- Demonstrate Playground debugging for a failed trace

Deliverables:#

observability/langfuse_handler.py - LangFuse integration
observability/cost_tracker.py - Cost calculation logic
observability/pii_masker.py - PII handling
Screenshots of LangFuse dashboard with traces

Task 3: RAG Architecture Comparison (25 points)#

Time Allocation: 60 minutes

Design and execute a rigorous experiment comparing multiple RAG architectures.

Requirements:#

Implement Two RAG Architectures
- Naive RAG: Fixed chunking, Top-K retrieval, direct generation
- Advanced RAG: Semantic chunking, hybrid search, re-ranking
Run Comparative Experiments
- Execute both architectures on the same test dataset
- Capture all RAGAS metrics for each architecture
- Track latency and cost per query
Performance Analysis
- Break down performance by question category
- Calculate statistical significance of differences
- Create visualizations (bar charts, tables)

Deliverables:#

architectures/naive_rag.py - Naive RAG implementation
architectures/advanced_rag.py - Advanced RAG implementation
experiments/runner.py - Experiment execution
results/comparison_table.md - Results summary

Task 4: Integrated Evaluation Platform (25 points)#

Time Allocation: 60 minutes

Combine all components into a unified evaluation platform.

Requirements:#

End-to-End Pipeline
- Single entry point to run complete evaluation
- Automatic tracing of all operations
- Configurable architecture selection
Comprehensive Reporting
- Generate evaluation report with all metrics
- Include observability insights (cost, latency distribution)
- Architecture comparison summary
- Actionable recommendations

CLI Interface

python evaluate.py --architecture naive --dataset data/test.json --output results/
python evaluate.py --architecture advanced --dataset data/test.json --output results/
python compare.py --results-dir results/ --output comparison_report.md

Answer Key Questions
- Which architecture should be used for production and why?
- What is the cost-quality trade-off between architectures?
- What are the top 3 failure patterns and how to address them?

Deliverables:#

evaluate.py - Main evaluation script
compare.py - Architecture comparison script
reports/evaluation_report.md - Complete evaluation report
ANSWERS.md - Written responses to key questions

Questions to Answer#

Include written responses to these questions in ANSWERS.md:

RAGAS Interpretation: Analyze your Faithfulness and Answer Relevancy scores. What do low scores indicate about your RAG system, and how would you improve them?
Observability Value: How did LangFuse/LangSmith tracing help you identify issues in your RAG pipeline? Provide a specific example.
Architecture Decision: Based on your experiments, which RAG architecture would you recommend for a customer support chatbot vs. a legal document Q&A system? Justify with data.
Cost Optimization: If you had to reduce costs by 50% while maintaining 90% of quality, what strategies would you employ? Reference your experimental results.
Production Readiness: What additional monitoring, alerting, or evaluation would you add before deploying this system to production?

Evaluation Criteria#

Criteria	Weight	Excellent (90-100%)	Good (70-89%)	Needs Improvement (50-69%)	Unsatisfactory (<50%)
RAGAS Evaluation	25%	All 4 metrics implemented correctly; comprehensive dataset; insightful failure analysis	Metrics implemented; adequate dataset; basic analysis	Partial metrics; small dataset; minimal analysis	Missing metrics; no dataset
Observability	25%	Full LangFuse integration; cost tracking; PII handling; production best practices	LangFuse working; basic cost tracking; some best practices	Partial tracing; no cost tracking	No observability integration
Architecture Comparison	25%	Both architectures implemented; rigorous experiments; statistical analysis; visualizations	Both architectures; experiments run; basic comparison	One architecture; limited experiments	No architecture comparison
Integration & Reporting	15%	Seamless pipeline; comprehensive reports; CLI interface; actionable insights	Components integrated; adequate reports	Partial integration; basic reports	Components not connected
Code Quality & Documentation	10%	Clean code; comprehensive docs; clear README; well-organized	Readable code; adequate docs	Messy code; minimal docs	Poor quality; no docs

Estimated Time#

Task	Time Allocation
Task 1: RAGAS Evaluation Pipeline	60 minutes
Task 2: LLM Observability Integration	60 minutes
Task 3: RAG Architecture Comparison	60 minutes
Task 4: Integrated Evaluation Platform	60 minutes
Total	240 minutes (4 hours)

Hints#

Task 1 - RAGAS:

Use the companion notebook 10_RAG_Evaluation_with_Ragas.ipynb as a reference
Start with a small dataset (10 questions) to verify your pipeline before scaling up
For claim decomposition in Faithfulness, consider using GPT-4 for accuracy

Task 2 - Observability:

Set up LangFuse first since it requires explicit callback handlers (good for understanding)
Use environment variables to switch between dev (100% tracing) and prod (5% sampling) modes
Test PII masking with fake data before using real sensitive information

Task 3 - Experiments:

Use the same embedding model for both architectures to ensure fair comparison
Run each query multiple times if measuring latency to account for variance
Calculate confidence intervals when comparing metric differences

Task 4 - Integration:

Use Python’s argparse or click library for CLI implementation
Generate markdown reports that can be easily shared with stakeholders
Include both quantitative metrics and qualitative insights in recommendations

AI Project Exams#

RAG Optimization Project Exam#

Final Exam: Enterprise RAG System#

overview#

Description#

Objectives#

Problem Description#

Assumptions#

Technical Requirements#

Environment Setup#

Infrastructure#

Tasks#

Task 1: Advanced Indexing Pipeline (20 points)#

Requirements:#

Deliverables:#

Task 2: Hybrid Search Implementation (20 points)#

Requirements:#

Deliverables:#

Task 3: Query Transformation Layer (15 points)#

Requirements:#

Deliverables:#

Task 4: Post-Retrieval Processing (15 points)#

Requirements:#

Deliverables:#

Task 5: GraphRAG Integration (20 points)#

Requirements:#

Deliverables:#

Task 6: Integration and Orchestration (10 points)#

Requirements:#

Deliverables:#

Questions to Answer#

Submission Rules#

Required Deliverables#

Submission Checklist#

Grading Rubrics#

Estimated Time#

Hints#

Notes:#

LangGraph and Agentic AI Project Exam#

Final Project Exam: FPT Customer Chatbot - Multi-Agent AI System#

overview#

Description#

Objectives#

Problem Description#

Prerequisites#

Technical Requirements#

Environment Setup#

Mock Data Models#

Tasks#

Task 1: State Management Foundation (15 points)#

Requirements:#

Deliverables:#

Task 2: Specialized Agents Implementation (25 points)#

Requirements:#

Deliverables:#

Task 3: Primary Assistant & Graph Construction (20 points)#

Requirements:#

Deliverables:#

Task 4: Human-in-the-Loop Confirmation (20 points)#

Requirements:#

Deliverables:#

Task 5: Response Caching with FAISS (10 points)#

Requirements:#

Deliverables:#

Task 6: Persistence & Production Readiness (10 points)#

Requirements:#

Deliverables:#

Test Scenarios#

Scenario 1: Multi-Agent Conversation Flow#

Scenario 2: HITL Rejection Flow#

Scenario 3: Cache Hit Flow#

Scenario 4: Persistence Test#

Questions to Answer#

Submission Requirements#

Directory Structure#

Required Deliverables#

Submission Checklist#

Evaluation Criteria#

Hints#

References#