Skip to main content
Ctrl+K
📋 Now tracking changes: View Release Notes
AI Study Roadmap - Home AI Study Roadmap - Home

AI Study Roadmap

Ctrl+K
  • GitLab
AI Study Roadmap - Home AI Study Roadmap - Home

AI Study Roadmap

Ctrl+K
  • GitLab

Table of Contents

  • Overview
  • Learn
    • AI
      • What Is AI Engineering?
      • Foundations
        • Introduction to AI & Generative AI
        • Introduction to RAG and Theoretical Foundations
        • Modern RAG Architecture
      • Core Techniques
        • LangChain Framework and Core Components
        • Building RAG Agent using LangChain
        • Query Transformations
        • Advanced Retrieval Strategies
        • Re-ranking
        • Tool Calling & Tavily Search
      • Advanced
        • Advanced Indexing
        • GraphRAG Implementation
        • LangGraph Foundations & State Management
        • Multi-Expert Research Agent with ReAct Pattern
        • Human-in-the-Loop & Persistence
        • Multi-Agent Collaboration
        • Observability: LangFuse & LangSmith
        • Evaluation Toolkit - Ragas
        • Experiment Comparison: Naive, Graph, Hybrid
    • Software Engineering
      • What Is Software Engineering?
      • Foundations
        • Web Concepts
        • What asyncio is ?
        • Threads vs Processes
        • CPython vs Pure Python
        • File Descriptors
        • Event Loop
      • Core Techniques
        • FastAPI Intro
        • What are Path Parameters in FastAPI?
        • FastAPI Code Examples for Query Parameters
        • Header Parameters in FastAPI
        • Body in FastAPI
        • Data Modeling (Pydantic)
        • Install required packages
        • Async Python, Postgres, and SQLAlchemy
        • Alembic Introduction
        • overview
        • CRUD Application Overview
        • What ASGI Is
      • Advanced
        • What is JWT ?
        • What is OAuth 2.0
        • Google OAuth2 Authentication
        • Authentication Implementation Overview
        • greenlet
        • Unit Testing FastAPI Applications
    • Cloud & Infrastructure
      • What Is Cloud & Infrastructure?
      • Foundations
        • Introduction to Cloud
        • Basic AWS Services Essential
        • Docker Fundamentals & Best Practices
        • Monolith vs. Microservices: Principles, Pros & Cons
      • Core Techniques
        • CI/CD Automation Pipelines
        • Continuous Code Quality with SonarQube
        • Implementing API Gateway
        • Message Queues with RabbitMQ
      • Advanced
        • SAGA Pattern Concepts
        • Performance - Redis Caching
        • Observability
        • Review & E2E Debugging
  • Training
    • AI Training
      • AI Training Overview
      • Foundations
        • Knowledge — AI Foundations
        • Practice — AI Foundations
        • Assessment — AI Foundations
      • Core Techniques
        • Knowledge — AI Core Techniques
        • Practice — AI Core Techniques
        • Assessment — AI Core Techniques
      • Advanced
        • Knowledge — AI Advanced
        • Practice — AI Advanced
        • Assessment — AI Advanced
      • Exams
        • AI Theory Exams
        • AI Project Exams
        • Basic AI Fundamentals Quiz
        • Exam Theory: RAG and Optimization
        • Final Exam: Enterprise RAG System
        • Final Exam
        • Final Project Exam: FPT Customer Chatbot - Multi-Agent AI System
        • LLMOps and Evaluation Question Bank
        • Final Exam: Production-Ready RAG Evaluation System
    • Software Engineering Training
      • Software Engineering Training Program
      • Foundations
        • Knowledge
        • Practice
        • Assessment
      • Core Techniques
        • Knowledge
        • Practice
        • Assessment
      • Advanced
        • Knowledge
        • Practice
        • Assessment
      • Exams
        • Final Project Exam: FPT Customer Chatbot - Backend API System
    • Cloud & Infrastructure Training
      • Cloud & Infrastructure Training Program
      • Foundations
        • Knowledge
        • Practice
        • Assessment
      • Core Techniques
        • Knowledge
        • Practice
        • Assessment
      • Advanced
        • Knowledge
        • Practice
        • Assessment
      • Exams
        • Basic DevOps Essentials for Developer - Theory Exam
        • Project Exam
        • Quiz
        • Final Exam: Deploy FastAPI Application to AWS Cloud
        • Final Exam
  • Reference
    • AI Reference
      • Introduction
      • Quiz and Summary
      • Quiz & Appendix - Advanced
    • Software Engineering Reference
      • Introduction to Software Engineering
      • Git Collaboration Workflow
        • Git Collaboration Workflow
        • Practice - Git Collaboration Workflow
        • Review Questions - Git Collaboration Workflow
      • Relational Database
        • Relational Database
        • Practice
        • Questions
      • API Mastery & REST Security
        • API Mastery: REST & Security
        • Practice
        • Questions
      • Caching Strategies & Redis
        • Caching Strategies with Redis
        • Practice
        • Questions
      • Testing Methodologies & TDD
        • Testing Methodologies & TDD
        • Practice: Unit Testing with Pytest
        • Questions
      • Container Orchestration & Compose
        • Container Orchestration with Docker Compose
        • Practice - Container Orchestration & Compose
        • Review Questions - Container Orchestration & Compose
      • Clean Architecture & Layering
        • Foreword
        • Practice
        • Questions
      • Microservices vs Serverless
        • What is serverless?
        • Questions
      • 🌐 Base URL
    • Cloud & Infrastructure Reference
      • Azure for AI/ML Applications
      • GCP cho AI/ML Applications
  • Resources
    • Glossary
    • Setup Guides
    • Study Materials
      • Contributors
      • SLIDE DECK: MICROSERVICE FUNDAMENTAL (MODULE 1)
  • Release Notes
    • Content Changelog
    • Platform Changelog
  • Training
  • AI Training
  • Core Techniques
  • Practice — AI Core Techniques
  • Assignment: Hybrid Search
Module 1 · AI
📖 4 min read · By ['[“HungHM15”]']

Assignment: Hybrid Search#

Assignment Metadata#

Field

Description

Assignment Name

Hybrid Search with BM25 and Reciprocal Rank Fusion

Course

RAG and Optimization

Project Name

hybrid-search-rag

Estimated Time

90 minutes

Framework

Python 3.10+, LangChain, rank-bm25, Sentence-Transformers, ChromaDB


Learning Objectives#

By completing this assignment, you will be able to:

  • Implement BM25 keyword search alongside vector-based semantic search

  • Apply Reciprocal Rank Fusion (RRF) to merge results from multiple retrievers

  • Compare the effectiveness of Vector Search, BM25, and Hybrid Search

  • Configure the fusion parameters to optimize retrieval quality

  • Analyze scenarios where Hybrid Search outperforms single-method approaches


Problem Description#

Your RAG system currently relies solely on Vector Search for retrieval. While this works well for semantic queries, users report poor results when searching for:

  • Specific error codes (e.g., “Error 503 Service Unavailable”)

  • Product SKUs and model numbers

  • Technical terms and acronyms

  • Proper names and exact phrases

Your task is to implement a Hybrid Search system that combines BM25 keyword matching with Vector Search, using RRF to merge the results.


Technical Requirements#

Environment Setup#

  • Python 3.10 or higher

  • Required packages:

    • langchain >= 0.1.0

    • rank-bm25 >= 0.2.2

    • sentence-transformers >= 2.2.0

    • chromadb >= 0.4.0

    • nltk >= 3.8.0 (for tokenization)

Dataset#

Prepare a dataset that includes documents with:

  • Technical specifications with codes/numbers

  • Natural language descriptions

  • Mixed content (code snippets, prose, tables)

  • At least 100 documents for meaningful comparison


Tasks#

Task 1: Implement BM25 Retriever (25 points)#

  1. Build a BM25 retriever that:

    • Tokenizes documents properly (handle punctuation, case normalization)

    • Indexes all documents in your corpus

    • Returns top-K documents with BM25 scores

  2. Test with keyword-heavy queries:

    • Create at least 5 queries containing specific codes, numbers, or technical terms

    • Verify that BM25 correctly retrieves documents with exact keyword matches

Task 2: Implement Hybrid Search with RRF (35 points)#

  1. Create a Hybrid Retriever that:

    • Executes both BM25 and Vector Search in parallel

    • Implements RRF score calculation: RRF(d) = Σ 1/(k + rank(d))

    • Uses configurable k constant (default: 60)

    • Returns merged and re-ranked results

  2. Handle edge cases:

    • Documents appearing in only one result list

    • Ties in RRF scores

    • Empty results from one retriever

Task 3: Comparative Evaluation (40 points)#

  1. Create a test set with 20 queries categorized as:

    • Keyword queries (5): Exact matches, codes, identifiers

    • Semantic queries (5): Conceptual questions, synonyms

    • Hybrid queries (10): Mix of keywords and semantic intent

  2. Evaluate each retrieval method (Vector, BM25, Hybrid):

    • Precision@5: Proportion of relevant documents in top 5

    • Recall@10: Proportion of all relevant documents retrieved in top 10

    • Mean Reciprocal Rank (MRR): Average of 1/rank of first relevant result

  3. Create a comparison table showing:

Query Type

Method

Precision@5

Recall@10

MRR

Keyword

Vector

Keyword

BM25

Keyword

Hybrid

Semantic

Vector

Semantic

BM25

Semantic

Hybrid

Hybrid

Vector

Hybrid

BM25

Hybrid

Hybrid


Submission Requirements#

Required Deliverables#

  • Source code (Jupyter notebook or Python scripts)

  • README.md with setup and usage instructions

  • Evaluation results table (as shown above)

  • Analysis document explaining when each method excels

  • Screenshots showing example queries and retrieved documents

Submission Checklist#

  • BM25 retriever correctly matches keywords

  • RRF fusion produces valid merged rankings

  • Evaluation covers all three query types

  • Code is well-documented with comments

  • Analysis includes specific examples


Evaluation Criteria#

Criteria

Points

BM25 implementation correctness

15

Tokenization and preprocessing

10

RRF implementation accuracy

25

Hybrid retriever edge case handling

10

Evaluation methodology

15

Comparative analysis quality

15

Code quality and documentation

10

Total

100


Hints#

  • The rank-bm25 library provides easy BM25 implementation

  • Use nltk.word_tokenize() for consistent tokenization

  • Test RRF with small examples first to verify your formula

  • Consider using the companion notebook 02-hybrid-search-rag.ipynb as reference

  • For the evaluation, manually label at least the top 10 results per query as relevant/not relevant

12 of 49 in AI

previous

Assignment: Query Transformation

next

Assignment: Post-Retrieval Processing

On this page
  • Assignment Metadata
  • Learning Objectives
  • Problem Description
  • Technical Requirements
    • Environment Setup
    • Dataset
  • Tasks
    • Task 1: Implement BM25 Retriever (25 points)
    • Task 2: Implement Hybrid Search with RRF (35 points)
    • Task 3: Comparative Evaluation (40 points)
  • Submission Requirements
    • Required Deliverables
    • Submission Checklist
  • Evaluation Criteria
  • Hints
AI Study Roadmap AI Study Roadmap

Copyright © 2025-2026 FSOFT.FHN.NGT AI Vanguard team.