Advanced Retrieval Strategies#

In the previous section, we explored ways to optimize data storage. However, no matter how well data is organized, if the search method is inappropriate, the returned results may still be inaccurate.

In the basic RAG architecture, we rely entirely on Vector Search. Although powerful in capturing semantics, this method reveals weaknesses when encountering queries requiring absolute accuracy in wording, such as searching for proper names, product codes, or rare technical terms.

To overcome this, the modern strategy is to combine the power of semantic search with traditional keyword search, creating a Hybrid Search mechanism.

Limitations of Vector Search and the Role of BM25#

Let’s consider a simple example: A user wants to find information about Error 503 Service Unavailable.

  • Vector Search: Tends to find paragraphs with equivalent meanings like “server overloaded”, “network connection failed”. Sometimes, it misses the exact number “503” because in vector space, these numbers may not carry much specific semantic meaning.

  • Keyword Search: Will find exactly documents containing the character strings “Error” and “503”. This is exactly when we need keyword frequency-based search algorithms, the most representative of which is BM25.

Best Matching 25 (BM25) Algorithm#

BM25 is considered the gold standard in the field of classical information retrieval. Although Vector Search models are currently dominant, BM25 still plays an irreplaceable role in Hybrid Search systems thanks to its ability to match precise keywords.

This is a refined upgrade of TF-IDF, designed to overcome the disadvantages regarding keyword spam and document length.

Mechanism of BM25

BM25 scores document relevance based on 3 core statistical principles:

  1. TF Saturation (Term Frequency Saturation): In TF-IDF, if a word appears more often, the score increases linearly infinitely, e.g., appearing 100 times scores 100 times higher.

    Comparison of score graphs where TF-IDF increases forever, while BM25 asymptotes to a limit Figure 4: Comparison of score graphs where TF-IDF increases forever, while BM25 asymptotes to a limit.

    BM25 Improvement: Applies saturation mechanism. If the keyword appears a 2nd, 3rd time, the score increases significantly. But if it has already appeared 100 times, appearing the 101st time hardly adds more score, as in Figure 4. This helps prevent keyword spamming.

  2. IDF (Inverse Document Frequency): Heavily penalizes common words and greatly rewards rare words. Example: In the query “symptoms of meningitis”, the word “meningitis” is many times more important than the word “of”.

  3. Length Normalization: This is where BM25 is much smarter. If a keyword appears once in a short paragraph, it is rated higher than appearing once in a long novel because information is diluted.

Illustrative Example: Why need ‘Saturation’?

Suppose a user searches for the keyword: “Galaxy”.

We have two documents:

  • Doc A: “Galaxy Galaxy Galaxy Galaxy…” (Repeated 1000 times to spam to get to the top).

  • Doc B: “Samsung just launched the new Galaxy phone…” (The word Galaxy appears 5 times in a reasonable context).

Scoring:

  • TF-IDF: Score(A) >> Score(B). Doc A wins overwhelmingly because it has 1000 keywords. → Skewed result.

  • BM25: Due to the saturation mechanism, after about 10-20 occurrences, Doc A’s score does not increase further. Meanwhile, Doc B gets bonus points thanks to reasonable document length.

→ BM25 gives a fairer assessment, eliminating the advantage of keyword spamming.

Reciprocal Rank Fusion (RRF)#

The problem when performing Hybrid Search is that the scoring scales of the two algorithms are completely different:

  • Vector Search usually uses cosine similarity, scores are in the range [0, 1].

  • BM25 uses a statistical formula, scores can be any positive number.

We cannot directly add these scores together. To solve this, we use the Reciprocal Rank Fusion (RRF) algorithm.

Principle

Instead of caring about scores, RRF only cares about rank. It assumes that if a document appears at a high rank in both lists, it is certainly an important document. The RRF score formula for a document \(d\):

\[ RRF(d) = \sum_{r \in R} \frac{1}{k + rank_r(d)} \]

Where:

  • \(rank_i(d)\): Is the rank of document \(d\) in list \(i\), starting from 1.

  • \(k\): Is a smoothing constant, usually chosen as \(k = 60\). This constant helps reduce score disparity between very high ranks, e.g., Top 1 vs Top 2, ensuring fairness.

Illustrative Example: Power of RRF

Suppose constant \(k = 60\). We consider two documents with different ranks in 2 search engines:

  • Doc A: Rank 1 in Vector because very semantically correct but Rank 10 in BM25 because of missing exact keywords.

  • Doc B: Rank 2 in Vector and Rank 3 in BM25, quite good on both sides.

1. Calculate Score for Doc A:

\[ Score(A) = \frac{1}{60 + 1} + \frac{1}{60 + 10} = \frac{1}{61} + \frac{1}{70} \approx 0.0164 + 0.0143 = 0.0307 \]

2. Calculate Score for Doc B:

\[ Score(B) = \frac{1}{60 + 2} + \frac{1}{60 + 3} = \frac{1}{62} + \frac{1}{63} \approx 0.0161 + 0.0159 = 0.0320 \]

Result: Doc B ranks above Doc A (0.0320 > 0.0307).

Although Doc A is Top 1 in Vector Search, RRF prioritizes Doc B because it has high consensus from both algorithms. RRF helps the system choose the most harmonious document instead of just trusting one side.

Method

Pros

Cons

Vector Search

Understands deep semantics, finds synonyms, multi-lingual.

Poor at matching exact keywords, hard to explain results.

BM25

Matches exact keywords, effective with specialized domains, fast speed.

Does not understand context, fails if user uses different synonyms than the text.

Hybrid Search

Balanced: Ensures important keywords are not missed while still understanding context.

More complex to deploy, consumes more resources due to running 2 parallel streams.