What Vector Similarity Means
An embedding model (e.g. text-embedding-3-small, all-MiniLM-L6-v2) maps text into a dense float vector. Semantically similar text maps to nearby vectors in that space. "The cat sat on the mat" and "A feline rested on a rug" will be much closer to each other than to "Quarterly earnings exceed expectations."
"Nearby" can be defined multiple ways. The choice matters:
| Metric | Formula | Range | Use When |
|---|---|---|---|
| Cosine similarity | cos(θ) = (A·B) / (|A||B|) | [-1, 1] | Text embeddings — measures angle, ignores magnitude |
| Euclidean (L2) | √Σ(aᵢ−bᵢ)² | [0, ∞) | Unit-normalized vectors (cosine then equals L2) |
| Dot product | Σaᵢbᵢ | (-∞, ∞) | When magnitude encodes relevance score (retrieval models) |
| Inner product | Same as dot product | (-∞, ∞) | Recommendation systems with learned embedding magnitudes |
For most embedding models, normalize vectors to unit length first, then use dot product — it's equivalent to cosine similarity but implemented as a fast matrix multiply.
Brute-Force KNN in NumPy
For fewer than ~100,000 vectors, brute force is fast enough and exact:
import numpy as np
class BruteForceIndex:
def __init__(self):
self.vectors = [] # list of np.ndarray
self.metadata = [] # list of any
def add(self, vec: np.ndarray, meta):
self.vectors.append(vec / np.linalg.norm(vec)) # unit-normalize
self.metadata.append(meta)
def search(self, query: np.ndarray, k: int = 5):
q = query / np.linalg.norm(query)
matrix = np.stack(self.vectors) # [N, d]
scores = matrix @ q # [N] — dot product = cosine for unit vecs
top_k = np.argsort(scores)[::-1][:k]
return [(self.metadata[i], scores[i]) for i in top_k]Complexity: O(N·d) per query, where N is the number of vectors and d is the dimension. At 1M vectors of 1536 dimensions (float32), each query requires 6GB of data to be read — about 40ms on a modern NVMe. At 10M vectors, it's 400ms. At 100M vectors, brute force is unusable in an interactive application.
IVF: Inverted File Index
The first practical acceleration: cluster the vector space, then at query time only search nearby clusters.
- Offline: Run k-means on all vectors with
k = sqrt(N)clusters (e.g., 1,000 clusters for 1M vectors). Store each vector in its nearest cluster's inverted list. - Query time: Find the
nprobenearest cluster centroids to the query vector. Search only those clusters' inverted lists.
from sklearn.cluster import MiniBatchKMeans
import numpy as np
# Build: cluster assignment
kmeans = MiniBatchKMeans(n_clusters=1000, random_state=42)
kmeans.fit(all_vectors) # fit on corpus
assignments = kmeans.predict(all_vectors) # which cluster each vec belongs to
# Query: find nearest centroids, then search those lists only
nprobe = 10 # search 10 of 1000 clusters (1% of data)
centroid_scores = kmeans.cluster_centers_ @ query_vec
nearest_clusters = np.argsort(centroid_scores)[::-1][:nprobe]
candidates = [v for c in nearest_clusters for v in inverted_lists[c]]With nprobe=10 and 1,000 clusters, you search ~1% of the data. The recall tradeoff: if the nearest vectors span multiple cluster boundaries, some will be missed. Increase nprobe to improve recall at the cost of speed. FAISS's IndexIVFFlat is the production version of this.
HNSW: The Algorithm Behind Modern Vector Databases
Hierarchical Navigable Small World (HNSW) achieves ~99% recall at 10–100× the speed of brute force. It's based on skip-list data structures applied to vector space.
Structure: A multi-layer graph. Layer 0 contains all N vectors, fully connected with short-range edges. Layer 1 contains a random subset (~N/M), with longer-range edges. Layer 2 contains a further subset. The top layer has only a handful of nodes with very long-range connections.
Search:
- Enter at the single entry point node in the top layer.
- Greedily move to whichever neighbor is closest to the query.
- When local minimum found in this layer, descend to next layer and repeat.
- At layer 0, run a beam search with width
efSearch, collecting all visited nodes. - Return top-k from visited set.
Key parameters:
M— maximum edges per node (typically 16–64). Higher M → better recall, more memory, slower inserts.efConstruction— beam width during index building (typically 100–200). Higher → better graph quality, slower builds.efSearch— beam width at query time. Trade recall vs latency at runtime without rebuilding the index.
import hnswlib
index = hnswlib.Index(space="cosine", dim=384)
index.init_index(max_elements=100_000, ef_construction=200, M=16)
index.add_items(corpus_vectors, ids=range(len(corpus_vectors)))
index.set_ef(50) # efSearch — tune per query SLA
labels, distances = index.knn_query(query_vector, k=10)
# labels: [[4, 17, 2, ...]] — top-10 IDs
# distances: [[0.12, 0.15, 0.18, ...]] — cosine distancesThe Metadata Filtering Problem
Real applications need filtered vector search: "find the 10 most similar embeddings where category='legal' AND year>2023." This is harder than it looks:
- Post-filter: Retrieve top-100 by vector similarity, then filter. If only 0.1% of vectors match the filter, you might get 0 results from 100 candidates even though 50 matching vectors exist.
- Pre-filter: Build a subset index on matching documents, then search only that. Requires pre-building per-filter indexes or dynamic index construction — expensive.
- Filtered HNSW: Qdrant and Weaviate implement HNSW variants that skip non-matching nodes during graph traversal. This is the state of the art but requires the filter selectivity to be known at query time.
Choosing a Vector Database
| Database | Deployment | Index | Metadata Filtering | Free Tier |
|---|---|---|---|---|
| Pinecone | Managed cloud | HNSW + IVF hybrid | Yes, any field | 1 index, 2M vectors |
| Weaviate | Self-hosted or cloud | HNSW | Yes, GraphQL | Sandbox (14 days) |
| Qdrant | Self-hosted or cloud | HNSW | Yes, filtered HNSW | 1GB cloud free |
| Chroma | In-process or server | HNSW (hnswlib) | Yes, basic | Open source, free |
| pgvector | PostgreSQL extension | IVFFlat or HNSW | Full SQL | With any Postgres |
For prototypes and small datasets (<100k vectors), Chroma or pgvector running locally removes infrastructure complexity entirely. For production at scale, Qdrant's filtered HNSW and Pinecone's managed sharding are worth the cost.
Float32 vectors: N × d × 4 bytes. 1M vectors × 1536 dims = 6GB RAM just for vectors, before the HNSW graph structure (add ~2–4× overhead). To fit 100M vectors in memory, use product quantization to compress each vector from 6KB to ~64 bytes — a 96× reduction at ~5% recall cost.