From Words to Numbers
Neural networks operate on numbers, not text. Before any word reaches a model, it must be converted into a numeric vector. The encoding strategy determines whether the model can understand meaning or is stuck with arbitrary IDs.
Vocabulary: ["the", "cat", "sat", "on", "a", "mat", "dog", "bird"] — size 8
Problem: cat and dog are equally distant — the encoding has no concept of similarity. With a real vocabulary of 50,000+ words, each vector would have 50,000 dimensions with a single 1.
Analogy — GPS Coordinates for Words: One-hot encoding is like giving every city a unique ID with no geographic information — New York is #4827, Tokyo is #12041. You can't tell they're far apart. Dense embeddings are like GPS coordinates — New York (40.7, -74.0) and Philadelphia (39.9, -75.2) are nearby because they're geographically close. Similarly, “cat” and “kitten” are nearby in embedding space because they're semantically close.
One-hot encoding creates vectors as large as the vocabulary (50,000+ dimensions), where every word is equally distant from every other. Dense embeddings compress meaning into 256–4096 dimensions where similar words cluster together — "cat" and "dog" have similar vectors because they share context.
Word2Vec
Word2Vec (2013) proved that word relationships could be captured as vector arithmetic. The famous "king − man + woman = queen" showed that directions in embedding space encode semantic relationships.
Royalty × Gender
The vector difference king − man captures 'royalty without gender.' Adding woman produces a vector closest to queen.
Word2Vec learns by a simple task: predict nearby words. Through millions of such predictions, the model discovers that words appearing in similar contexts should have similar vectors — and that relationships between words emerge as consistent directions in the space.
Embedding Dimensions
No single dimension encodes a single concept. Meaning is distributed across all dimensions — a word's "animal-ness," "size," or "abstractness" emerges from the combination of hundreds of learned values.
With 64 dimensions, each value carries a large share of the word's meaning — the representation is coarse. At 4096 dimensions, meaning is finely distributed across thousands of values, enabling the model to capture subtle distinctions. GPT-4 uses 12,288 dimensions per token.
Similarity & Distance
How do we measure whether two words are "close" in embedding space? Cosine similarity measures the angle between vectors, Euclidean distance measures the straight-line gap, and dot product combines both magnitude and direction.
Select two words to compare
1 = identical direction, 0 = orthogonal, −1 = opposite
0 = same point, larger = farther apart
Combines magnitude and alignment
Analogy — Neighborhood Map: Embedding space is like a city map where similar businesses cluster into neighborhoods. All the restaurants end up on one street, tech companies cluster downtown, and parks line the river. You don't design it that way — it emerges from how businesses relate to each other. In embedding space, “pizza,” “pasta,” and “sushi” cluster in the food neighborhood, while “Python,” “Java,” and “Rust” cluster in the programming one.
Cosine similarity is the standard metric for embeddings because it ignores vector magnitude and focuses purely on direction. Two sentences about "dogs" will point in a similar direction regardless of length — cosine captures that. Euclidean distance is sensitive to magnitude, which can be useful when vector norms carry meaning.
Contextual Embeddings
Static embeddings (Word2Vec, GloVe) assign one vector per word — "bank" always gets the same representation regardless of context. Contextual embeddings (BERT, GPT) generate different vectors for the same word based on surrounding text.
"...the bank..." — Nature
Contextual embedding (6 of 768 dims):
I sat by the river bank watching the water flow.
Static (Word2Vec)
One fixed vector per word. 'bank' always = [0.45, 0.12, ...] regardless of context. Fast, but ambiguous words are poorly represented.
Contextual (BERT/GPT)
Vector is computed from the entire input sequence. 'bank' near 'river' produces a completely different vector than 'bank' near 'deposit'.
Analogy — Translator: Static embeddings (Word2Vec) give every word a single passport photo — it looks the same everywhere. Contextual embeddings are like a translator who changes their tone, style, and word choice depending on who they're talking to. The word “bank” gets a financial ID badge when surrounded by “money” and “loan,” but a geography ID badge next to “river” and “shore.” Same word, completely different representation.
Contextual embeddings are the reason modern LLMs can disambiguate words. Each transformer layer refines the embedding by incorporating information from surrounding tokens. By the final layer, "bank" in a financial context and "bank" in a river context have vectors pointing in entirely different directions.
Positional Encodings
Transformers process all tokens in parallel — they have no inherent sense of order. Positional encodings inject sequence position into each embedding so the model knows that "The cat sat" is different from "sat cat The."
PE values for position 4
Heatmap — positions 0–15 × 16 dimensions
positive negative
Sinusoidal
Fixed sin/cos waves with increasing wavelength per dimension. Used in the original Transformer (2017). No learned parameters — purely mathematical.
✓ Generalizes to unseen sequence lengths
RoPE
Rotary Position Embedding encodes position as rotation in 2D subspaces of the embedding. Used in LLaMA, Mistral, and most modern LLMs.
✓ Naturally decaying attention with distance
Learned
Trainable parameter vectors — one per position. Used in BERT and GPT-2. Simple but limited to maximum training length.
✓ Can learn task-specific positional patterns
The key insight of sinusoidal encodings: each dimension oscillates at a different frequency. Low dimensions change quickly (encoding fine position), while high dimensions change slowly (encoding coarse position). This lets the model learn to attend to both nearby and distant tokens.
Embedding Tables
An embedding table is a simple lookup matrix: vocabulary size × embedding dimension. Every token ID indexes one row. The number of parameters scales linearly with both dimensions — for large vocabularies and high-dimensional models, this table alone can contain billions of parameters.
38.6M
50,257 × 768
The embedding table is often the single largest component by parameter count in smaller models. LLaMA 3's 128K vocabulary × 4096 dimensions = 524M parameters just for the embedding lookup — before a single transformer layer runs. This is why vocabulary size is a critical design decision.
Sentence & Document Embeddings
A single word embedding isn't enough — we often need to represent entire sentences, paragraphs, or documents as a single vector. Pooling strategies collapse a sequence of token embeddings into one fixed-size vector.
Only the [CLS] token embedding becomes the sentence vector
text-embedding-3-large
OpenAI · 3072d
State-of-the-art commercial embedding model with Matryoshka support for variable dimensions.
all-MiniLM-L6-v2
Sentence Transformers · 384d
Lightweight, fast, and free. Great for production semantic search with limited resources.
e5-large-v2
Microsoft · 1024d
High-quality multilingual embeddings trained with contrastive learning on diverse tasks.
Sentence embeddings power semantic search, clustering, classification, and deduplication. The choice of pooling strategy and embedding model determines the quality of downstream tasks — mean pooling with a fine-tuned sentence-transformer typically outperforms CLS-based approaches for retrieval.
Vector Search & RAG
Retrieval-Augmented Generation (RAG) grounds LLM responses in real data. The pipeline embeds a user query, searches a vector database for similar documents, and injects the retrieved context into the prompt — reducing hallucination and enabling answers over private data.
Step 1: User asks a natural language question
LSH
O(1) lookupLocality-Sensitive Hashing
Hash similar vectors into the same bucket. Fast, memory-efficient, but lower recall for high-dimensional data.
HNSW
O(log n) searchHierarchical Navigable Small World
Layered graph where each node connects to nearby vectors. State-of-the-art recall/speed tradeoff. Used by Qdrant, pgvector, Pinecone.
IVF
O(√n) searchInverted File Index
Partition space into Voronoi cells, search only nearby partitions. Pairs well with product quantization for compression.
Top-3 Results
Word Embeddings
Dense vector representations that map words into continuous space where semantic similarity is captured by geometric proximity.
Tokenization Methods
BPE and WordPiece split text into subword units, balancing vocabulary size with sequence length for efficient language processing.
Attention Mechanism
Computes weighted combinations of values based on query-key similarity scores, allowing models to focus on relevant parts of the input.
RAG is the dominant pattern for building LLM applications over private data. The key insight: instead of fine-tuning the model (expensive, slow), you store your documents as embeddings and retrieve relevant ones at query time. HNSW is the most widely used ANN algorithm — it powers Qdrant, Pinecone, Weaviate, and pgvector.
Bi-encoder retrieval is fast but coarse. A cross-encoder reranker takes each (query, document) pair and scores relevance jointly — far more accurate but too slow for the full corpus. The two-stage pattern combines both: retrieve top-50 cheaply, then rerank to top-5 precisely.
Bi-encoder (e.g. BGE, E5) → Top 50
Cross-encoder (e.g. Cohere Rerank, BGE-reranker) → Top 5
Analogy — Job Interview: Retrieval is like screening resumes by keyword — fast but misses nuance. Reranking is the actual interview — slow but you understand who truly fits. You screen 500 applicants, then interview only the top 10.
Semantic search misses exact keyword matches; keyword search misses meaning. Hybrid search combines both using Reciprocal Rank Fusion (RRF) or weighted scoring to get the best of both worlds.
+ Exact keyword matches, acronyms, IDs, code
− No semantic understanding
+ Meaning, paraphrases, synonyms
− Misses exact terms, entity names
+ Best recall across both types
− Slightly more complex indexing
RRF formula: score(d) = Σ 1/(k + ranki(d)) — merges rankings without needing comparable scores.
How you split documents before embedding determines retrieval quality. Too large and chunks contain irrelevant noise; too small and they lose context.
Split every N tokens (e.g. 512). Simple but cuts mid-sentence. Add overlap (e.g. 50 tokens) to preserve context at boundaries.
Split by paragraph, then sentence, then token count. Preserves natural boundaries. Used by LangChain's RecursiveCharacterTextSplitter.
Embed small chunks for precision, but retrieve the larger parent chunk for context. Combines fine-grained matching with full context.
Use an LLM to prepend a short context sentence to each chunk before embedding — e.g. 'This chunk discusses the pricing of Widget X.' Dramatically improves retrieval.
Real-world RAG systems go far beyond simple retrieve-and-generate. These patterns solve common failure modes.
LLM rewrites the user question for better retrieval — expanding abbreviations, resolving pronouns, adding context from conversation history.
Generate a hypothetical answer first, embed that instead of the question. The hypothetical answer is closer in embedding space to real documents than a short question.
Generate 3–5 paraphrased queries, retrieve for each, merge results. Captures different facets of the user's intent.
After retrieval, evaluate whether retrieved docs actually answer the question. If confidence is low, fall back to web search or ask for clarification.
An agent decides when to retrieve, what to retrieve, and whether to re-retrieve based on the quality of initial results. Combines tool use with RAG.
Search by meaning, not keywords. 'How to fix a flat tire' finds results about 'changing punctured tires' even without shared words.
Group documents by semantic similarity using k-means or HDBSCAN on embeddings. Discover themes without predefined categories.
Find near-duplicate content by cosine similarity threshold. Catch paraphrased plagiarism, merge duplicate support tickets.
Flag content that's far from all clusters — outlier detection in embedding space catches unusual patterns, fraud, or drift.
Embed users and items in the same space. Recommend items closest to the user's embedding — Netflix, Spotify, and Amazon all use this.
Embed class labels and inputs. Classify by nearest label embedding — no training data needed. 'Is this email spam?' by proximity to 'spam' vs 'legitimate' embeddings.
The full production RAG stack is: chunk → embed → index → query rewrite → hybrid search (sparse + dense) → rerank → filter → augment prompt → generate → cite sources. Each stage has its own failure modes and tuning knobs. Getting RAG right is more about the pipeline engineering than the LLM itself.
The Embedding Space
When we project high-dimensional embeddings down to 2D (using t-SNE or UMAP), clusters of related words emerge. Animals group together, colors cluster nearby, countries form their own region. Analogies appear as parallelograms — consistent directional offsets.
The embedding space is not just a bag of points — it has rich geometric structure. Analogies form parallelograms because the direction from "man" to "king" (adding royalty) is the same as from "woman" to "queen." These regularities emerge purely from learning to predict surrounding words — no one teaches the model about royalty or gender.