What Vector Similarity Means

An embedding model (e.g. text-embedding-3-small, all-MiniLM-L6-v2) maps text into a dense float vector. Semantically similar text maps to nearby vectors in that space. "The cat sat on the mat" and "A feline rested on a rug" will be much closer to each other than to "Quarterly earnings exceed expectations."

"Nearby" can be defined multiple ways. The choice matters:

MetricFormulaRangeUse When
Cosine similaritycos(θ) = (A·B) / (|A||B|)[-1, 1]Text embeddings — measures angle, ignores magnitude
Euclidean (L2)√Σ(aᵢ−bᵢ)²[0, ∞)Unit-normalized vectors (cosine then equals L2)
Dot productΣaᵢbᵢ(-∞, ∞)When magnitude encodes relevance score (retrieval models)
Inner productSame as dot product(-∞, ∞)Recommendation systems with learned embedding magnitudes

For most embedding models, normalize vectors to unit length first, then use dot product — it's equivalent to cosine similarity but implemented as a fast matrix multiply.

Brute-Force KNN in NumPy

For fewer than ~100,000 vectors, brute force is fast enough and exact:

import numpy as np

class BruteForceIndex:
    def __init__(self):
        self.vectors = []      # list of np.ndarray
        self.metadata = []     # list of any

    def add(self, vec: np.ndarray, meta):
        self.vectors.append(vec / np.linalg.norm(vec))  # unit-normalize
        self.metadata.append(meta)

    def search(self, query: np.ndarray, k: int = 5):
        q = query / np.linalg.norm(query)
        matrix = np.stack(self.vectors)          # [N, d]
        scores = matrix @ q                      # [N] — dot product = cosine for unit vecs
        top_k = np.argsort(scores)[::-1][:k]
        return [(self.metadata[i], scores[i]) for i in top_k]

Complexity: O(N·d) per query, where N is the number of vectors and d is the dimension. At 1M vectors of 1536 dimensions (float32), each query requires 6GB of data to be read — about 40ms on a modern NVMe. At 10M vectors, it's 400ms. At 100M vectors, brute force is unusable in an interactive application.

IVF: Inverted File Index

The first practical acceleration: cluster the vector space, then at query time only search nearby clusters.

  1. Offline: Run k-means on all vectors with k = sqrt(N) clusters (e.g., 1,000 clusters for 1M vectors). Store each vector in its nearest cluster's inverted list.
  2. Query time: Find the nprobe nearest cluster centroids to the query vector. Search only those clusters' inverted lists.
from sklearn.cluster import MiniBatchKMeans
import numpy as np

# Build: cluster assignment
kmeans = MiniBatchKMeans(n_clusters=1000, random_state=42)
kmeans.fit(all_vectors)                    # fit on corpus
assignments = kmeans.predict(all_vectors)  # which cluster each vec belongs to

# Query: find nearest centroids, then search those lists only
nprobe = 10  # search 10 of 1000 clusters (1% of data)
centroid_scores = kmeans.cluster_centers_ @ query_vec
nearest_clusters = np.argsort(centroid_scores)[::-1][:nprobe]
candidates = [v for c in nearest_clusters for v in inverted_lists[c]]

With nprobe=10 and 1,000 clusters, you search ~1% of the data. The recall tradeoff: if the nearest vectors span multiple cluster boundaries, some will be missed. Increase nprobe to improve recall at the cost of speed. FAISS's IndexIVFFlat is the production version of this.

HNSW: The Algorithm Behind Modern Vector Databases

Hierarchical Navigable Small World (HNSW) achieves ~99% recall at 10–100× the speed of brute force. It's based on skip-list data structures applied to vector space.

Structure: A multi-layer graph. Layer 0 contains all N vectors, fully connected with short-range edges. Layer 1 contains a random subset (~N/M), with longer-range edges. Layer 2 contains a further subset. The top layer has only a handful of nodes with very long-range connections.

Search:

  1. Enter at the single entry point node in the top layer.
  2. Greedily move to whichever neighbor is closest to the query.
  3. When local minimum found in this layer, descend to next layer and repeat.
  4. At layer 0, run a beam search with width efSearch, collecting all visited nodes.
  5. Return top-k from visited set.

Key parameters:

  • M — maximum edges per node (typically 16–64). Higher M → better recall, more memory, slower inserts.
  • efConstruction — beam width during index building (typically 100–200). Higher → better graph quality, slower builds.
  • efSearch — beam width at query time. Trade recall vs latency at runtime without rebuilding the index.
import hnswlib

index = hnswlib.Index(space="cosine", dim=384)
index.init_index(max_elements=100_000, ef_construction=200, M=16)
index.add_items(corpus_vectors, ids=range(len(corpus_vectors)))
index.set_ef(50)  # efSearch — tune per query SLA

labels, distances = index.knn_query(query_vector, k=10)
# labels: [[4, 17, 2, ...]] — top-10 IDs
# distances: [[0.12, 0.15, 0.18, ...]] — cosine distances

The Metadata Filtering Problem

Real applications need filtered vector search: "find the 10 most similar embeddings where category='legal' AND year>2023." This is harder than it looks:

  • Post-filter: Retrieve top-100 by vector similarity, then filter. If only 0.1% of vectors match the filter, you might get 0 results from 100 candidates even though 50 matching vectors exist.
  • Pre-filter: Build a subset index on matching documents, then search only that. Requires pre-building per-filter indexes or dynamic index construction — expensive.
  • Filtered HNSW: Qdrant and Weaviate implement HNSW variants that skip non-matching nodes during graph traversal. This is the state of the art but requires the filter selectivity to be known at query time.

Choosing a Vector Database

DatabaseDeploymentIndexMetadata FilteringFree Tier
PineconeManaged cloudHNSW + IVF hybridYes, any field1 index, 2M vectors
WeaviateSelf-hosted or cloudHNSWYes, GraphQLSandbox (14 days)
QdrantSelf-hosted or cloudHNSWYes, filtered HNSW1GB cloud free
ChromaIn-process or serverHNSW (hnswlib)Yes, basicOpen source, free
pgvectorPostgreSQL extensionIVFFlat or HNSWFull SQLWith any Postgres

For prototypes and small datasets (<100k vectors), Chroma or pgvector running locally removes infrastructure complexity entirely. For production at scale, Qdrant's filtered HNSW and Pinecone's managed sharding are worth the cost.

✅ Memory Estimate

Float32 vectors: N × d × 4 bytes. 1M vectors × 1536 dims = 6GB RAM just for vectors, before the HNSW graph structure (add ~2–4× overhead). To fit 100M vectors in memory, use product quantization to compress each vector from 6KB to ~64 bytes — a 96× reduction at ~5% recall cost.