Search Algorithms - Hybrid Search and BM25 Overview
This page introduces background concepts such as tokenization, sparse vectors, and dense vectors and explains how best matching 25 (BM25) and hybrid search approaches work in practice.
Search systems are at the core of many modern applications, from document retrieval to recommendation engines and chat-based assistants. Building effective search means representing text in a way that computers can compare and rank results quickly.
Two important algorithms in this space are Best Matching 25 (BM25), a classic keyword-based ranking function, and Hybrid Search, a modern strategy that combines BM25 with dense vector similarity.
Learn more about:
Background concepts: Tokenization and vector representations
Before diving into retrieval systems, let's understand how raw text is processed by a computer.
Tokenization
Tokenization is a process that splits text into smaller units (tokens) that are represented as learnable high-dimensional vectors.
For BM25, this usually means words or word stems. For example, "hybrid search combines approaches" can become ["hybrid", "search", "combin", "approach"].
For embeddings, tokenization often uses subword units or byte-pair encoding so that even rare or unseen words can be represented consistently.
In modern large language models, tokenization is the first step in converting text into a numerical form. A tokenizer maps a sentence into a sequence of integer tokens, each representing a unit of text. For example, the sentence might be encoded as:
q
from transformers import BertTokenizerFast
from collections import Counter
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
Counter(tokenizer("The cow jumped over the moon")['input_ids'])
The result is a sparse vector with 101 and 102 delimiting the start and end of the text:
{1996: 2, 101: 1, 11190: 1, 5598: 1, 2058: 1, 4231: 1, 102: 1}
Types of tokenization include:
- Word tokenization, which breaks the text into individual words
- Character tokenization, which splits the text into individual characters
- Subword tokenization, which segments the text into units smaller than words but larger than characters
Several libraries and algorithms are commonly used for tokenization in Natural Language Processing (NLP), each suited to different use cases:
- NLTK: A classic Python toolkit offering simple word and sentence tokenization, great for learning and prototyping.
- spaCy: A fast, production-ready library with support for multiple languages, often used in large-scale applications.
- BERT tokenizer: Designed for transformer models, this tokenizer handles context and language nuances, making it useful for advanced NLP tasks.
- Byte-Pair Encoding (BPE): Breaks text into frequent subword units, effective for handling complex or morphologically rich languages.
- SentencePiece: An unsupervised tokenizer that works across languages, producing subword units suitable for neural models.
Sparse vectors
Traditional methods like BM25 rely on sparse vectors. Each dimension corresponds to a term in the vocabulary, and a document vector records how many times each term occurs. Since any given document uses only a small fraction of all possible words, most entries are zero; hence "sparse". Sparse vectors are excellent for exact keyword matching, but they don’t capture relationships between different words.
A basic example of a sparse vector is a bag-of-words representation of a sentence. Let's consider the sentence:
the cow jumped over the moon
Its bag of words is given by the count of each word:
{ "the":2 , "cow":1, "jumped":1, "over":1, "moon":1 }
Dense vectors
Embedding models generate dense vectors: fixed-length arrays of floating-point numbers in which every dimension contains information. These vectors are trained so that semantically similar texts — like "car" and "automobile" — are close in vector space, even if they share no tokens. Dense vectors allow semantic matching, capturing meaning beyond surface words.
Together, sparse and dense representations highlight a key trade-off: sparse vectors capture precision in term overlap, while dense vectors capture meaning. Hybrid search leverages both.
Best matching 25 (BM25) algorithm
BM25 is one of the most widely used algorithms for ranking documents in keyword-based search. It extends TF-IDF (term frequency–inverse document frequency) with refinements that make results more robust and effective.
BM25 scores documents based on:
- Term frequency (TF): documents with more occurrences of a query term score higher, but with diminishing returns.
- Inverse document frequency (IDF): rare terms are given more weight, since they help distinguish documents.
- Document length normalization: longer documents are penalized slightly, to prevent them from being unfairly boosted just because they contain more words.
The result is a ranking that balances relevance, rarity, and document length. BM25 excels when queries and documents share vocabulary directly. However, it cannot recognize synonyms or semantic similarity: "physician" and "doctor" are invisible to BM25 unless both tokens occur explicitly.
Refer to the bm25 parameters for details on the available APIs.
Hybrid search
Hybrid search addresses BM25’s limitations by combining sparse (keyword-based) and dense (embedding-based) representations. Instead of choosing one or the other, the system indexes documents in both forms:
- Lexical index (sparse): for BM25 keyword scoring.
- Vector index (dense): for semantic similarity using embeddings.
When a query arrives, the system runs both searches in parallel. Results are then merged using a fusion strategy, such as weighting normalized scores or blending ranked lists. This way, a document containing the exact query words and a document with a strong semantic match can both appear in the results.
Hybrid search is highly flexible. However, for precise keyword queries, BM25 is better. For vague or semantic queries, vector search contributes. The fusion ensures neither side is lost, giving a better user experience across a wide variety of query styles.
Refer to the hybrid search parameters for details on the available APIs.
Summary
- Tokenization transforms text into units that can be represented numerically.
- Sparse vectors (used by BM25) capture keyword overlap and precision.
- Dense vectors (from embeddings) capture semantic similarity and meaning.
- BM25 is a robust keyword ranking function, excellent for exact matches.
- Hybrid search merges BM25 with dense vector search to cover both precision and semantics.
By combining sparse and dense representations, hybrid search systems can serve a wider range of queries more effectively than either method alone.
Next steps
- Learn more about the available APIs for AI Libraries
- Follow the AI Libraries tutorials to get more familiar with these search algorithms.