Vector Index Introduction
This page introduces how vector indexes make finding similar data fast and scalable — from NLP embeddings to image features.
When you're building systems that use vector embeddings (for example: NLP embeddings, image features, etc.), efficiently finding "nearest neighbors" matters. As datasets grow, naive comparisons get costly. That's where vector indexes come in.
Vector indexes were developed to solve this problem by organizing data in ways that allow faster, often approximate, retrieval.
This page provides an overview of four common index types:
Brute force (flat) index overview
The flat index is the simplest and most intuitive approach: every query is compared directly with every vector in the dataset. There is no clustering, graph structure, or compression—just straightforward distance computations across all vectors.
The big advantage of this method is that results are exact: the query returns the true nearest neighbors, not an approximation.
How it works
Given a query vector q, loop through all vectors vᵢ, compute the similarity (or distance) metric (for example cosine similarity (CS), Euclidean distance), and pick the top k closest.
When to use
Flat indexing is especially effective in certain scenarios, such as:
-
For datasets with low-dimensional vectors — typically under a few dozen dimensions — it delivers fast and accurate results without the need for complex structures.
-
It performs well with smaller databases, where the number of vectors is limited and exhaustive search remains computationally feasible.
-
When queries are simple and don't require advanced filtering or ranking, flat indexing offers a clean and efficient solution.
-
It's also well-suited for environments where new data is ingested frequently or in real time, as it allows quick updates without the overhead of rebuilding index structures.
-
In cases where query volume is light, flat indexing can easily keep up without introducing latency or requiring additional optimization.
Refer to the flat index parameters for details on the available APIs.
Hierarchical navigable small worlds (HNSW) overview
HNSW introduces a more sophisticated structure by building a graph of vectors. The main strength of HNSW lies in its balance between speed and accuracy. It can achieve recall levels close to brute force but with faster query times. The downside is that the graph requires extra memory.
HNSW is often the best choice when datasets are large, queries must be answered in real time, and a small margin of approximation is acceptable.
How it works
At a high level, this is how HNSW works:
-
When you insert vectors, they are placed in multiple layers; each vector has connections to other vectors based on proximity.
-
At query time, you start at the top layer (sparser, long-range links) and navigate greedily to the area close to the query, then drop to lower layers for more refined search.
When to use
HNSW is well-suited for:
-
Large-scale datasets where brute force becomes too slow to be practical.
-
High recall requirements where near-exact results are preferred, even if some approximation is acceptable.
-
Real-time or low-latency applications that demand fast query responses.
-
Systems with sufficient memory, as the graph structure used by HNSW can be memory-intensive.
Refer to the HNSW index parameters for details on the available APIs.
Inverted file (IVF) overview
Inverted file (IVF) indexes takes a different approach, inspired by clustering. Instead of storing vectors in a single list, IVF partitions the vector space into clusters (often using k-means). Each cluster corresponds to an "inverted list" and vectors are assigned to the cluster centroid closest to them. At query time, the algorithm finds the cluster(s) most relevant to the query and searches only within those lists.
This selective search dramatically reduces the number of vectors to compare against, leading to faster results. The trade-off is that if a true nearest neighbor lies outside the searched clusters, it may be missed. The quality of the clustering, as well as the number of clusters examined per query, determine the balance between speed and accuracy.
IVF works well for very large, mostly static datasets where approximation is acceptable and preprocessing (like clustering) can be done ahead of time.
How it works
At a high level, this is how IVF works:
-
Pre-process: cluster vectors into, say, n centroids ("inverted lists"). Each vector is mapped to its nearest centroid; the inverted list for each centroid holds those vectors.
-
Query time: map the query to its nearest centroid(s), and search only within those inverted lists (not the whole dataset).
When to use
IVF indexing is a good fit for:
-
Extremely large datasets, typically in the tens of millions or more.
-
Applications that tolerate approximation, allowing faster search with slightly reduced accuracy.
-
Stable data environments, where clustering can be precomputed and doesn't need frequent updates.
-
Limited memory or compute resources, where scanning the entire dataset would be too costly.
Refer to the IVF index parameters for details on the available APIs.
Inverted file product quantization (IVFPQ) overview
IVFPQ builds on IVF by introducing Product Quantization (PQ), which compresses vectors into smaller codes. After clustering, each vector is split into subvectors, and each subvector is quantized against a small codebook. Instead of storing full vectors, you store these compact codes, which drastically reduce storage requirements. During query time, comparisons can be done approximately on the compressed codes, making search both faster and more memory-efficient.
The result is an index that scales to hundreds of millions or even billions of vectors without requiring enormous amounts of memory.
The trade-off is accuracy: quantization introduces error, and results are not as precise as those from IVF or HNSW. Building an IVFPQ index is also more complex, requiring choices about the number of clusters, subquantizers, and codebook sizes. It is particularly well-suited to very large datasets where storage and throughput are more critical than exact accuracy.
How it works
At a high level, this is how IVFPQ works:
-
First: cluster vectors into inverted lists (as in IVF).
-
Then: each vector is broken into subvectors (for example, split dimension into blocks). For each subvector, you quantize it (you map it to nearest centroid in a small codebook). This codebook gives you a compact representation.
-
At query time: you map the query to relevant clusters / inverted lists, and compare using compressed representations (with PQ), which reduces distance computation cost.
When to use
IVFPQ is a good option in the following scenarios:
-
Massive-scale datasets, ranging from hundreds of millions to billions of vectors.
-
Storage-constrained environments, where storing full-precision vectors isn't practical.
-
Use cases that allow some loss in accuracy, where finding "good enough" nearest neighbors is acceptable.
-
High query volume with limited compute, where fast responses are needed without heavy processing.
Refer to the IVFPQ index parameters for details on the available APIs.
Choosing the right index
Each of these indexing methods exists because no single approach fits all use cases.
-
Use Brute Force when datasets are small or exact accuracy is mandatory.
-
Choose HNSW for latency-sensitive applications requiring low-latency queries over large datasets.
-
Make use of IVF when you need to balance speed and accuracy for massive but mostly static collections.
-
Use IVFPQ for extreme scale, where compression and throughput outweigh recall.
In practice, your choice depends on the nature of your dataset, your tolerance for approximation, and your performance requirements. Developers often experiment with multiple approaches, tuning hyperparameters such as the number of clusters in IVF or the graph connectivity in HNSW, to find the best configuration.
Next steps
-
Learn more about the available APIs Review the AI Libraries reference section.
-
Follow the AI Libraries tutorials to get more familiar with these vector indexes