Back to How LLMs Work
UNDER THE HOOD

Embeddings & Vector Spaces

The foundation of modern AI — converting words, sentences, and documents into numeric vectors that capture meaning. From one-hot encoding to Word2Vec, contextual embeddings to RAG pipelines, this is the interactive guide to how machines represent language.

25 min readBy Hammad Abbasi Interactive
01

From Words to Numbers

Neural networks operate on numbers, not text. Before any word reaches a model, it must be converted into a numeric vector. The encoding strategy determines whether the model can understand meaning or is stuck with arbitrary IDs.

one-hot encoding

Vocabulary: ["the", "cat", "sat", "on", "a", "mat", "dog", "bird"] — size 8

cat=
01000000
8 dims
dog=
00000010
8 dims
mat=
00000100
8 dims

Problem: cat and dog are equally distant — the encoding has no concept of similarity. With a real vocabulary of 50,000+ words, each vector would have 50,000 dimensions with a single 1.

Analogy — GPS Coordinates for Words: One-hot encoding is like giving every city a unique ID with no geographic information — New York is #4827, Tokyo is #12041. You can't tell they're far apart. Dense embeddings are like GPS coordinates — New York (40.7, -74.0) and Philadelphia (39.9, -75.2) are nearby because they're geographically close. Similarly, “cat” and “kitten” are nearby in embedding space because they're semantically close.

One-hot encoding creates vectors as large as the vocabulary (50,000+ dimensions), where every word is equally distant from every other. Dense embeddings compress meaning into 256–4096 dimensions where similar words cluster together — "cat" and "dog" have similar vectors because they share context.

02

Word2Vec

Word2Vec (2013) proved that word relationships could be captured as vector arithmetic. The famous "king − man + woman = queen" showed that directions in embedding space encode semantic relationships.

vector arithmetic
kingman+womanqueen

Royalty × Gender

The vector difference king − man captures 'royalty without gender.' Adding woman produces a vector closest to queen.

Word2Vec learns by a simple task: predict nearby words. Through millions of such predictions, the model discovers that words appearing in similar contexts should have similar vectors — and that relationships between words emerge as consistent directions in the space.

03

Embedding Dimensions

No single dimension encodes a single concept. Meaning is distributed across all dimensions — a word's "animal-ness," "size," or "abstractness" emerges from the combination of hundreds of learned values.

"knowledge" — first 8 of 768 dimensions
d0
+0.240
d1
-0.185
d2
+0.151
d3
-0.121
d4
+0.206
d5
-0.114
d6
+0.164
d7
-0.083
Showing: 8 of 768 dimensionsValues shrink as dimensions increase — meaning spreads out

With 64 dimensions, each value carries a large share of the word's meaning — the representation is coarse. At 4096 dimensions, meaning is finely distributed across thousands of values, enabling the model to capture subtle distinctions. GPT-4 uses 12,288 dimensions per token.

04

Similarity & Distance

How do we measure whether two words are "close" in embedding space? Cosine similarity measures the angle between vectors, Euclidean distance measures the straight-line gap, and dot product combines both magnitude and direction.

Select two words to compare

Word A
Word B
Cosine Similarity0.9649

1 = identical direction, 0 = orthogonal, −1 = opposite

Euclidean Distance0.3483

0 = same point, larger = farther apart

Dot Product1.6535

Combines magnitude and alignment

angle visualization
kingqueen15.2°

Analogy — Neighborhood Map: Embedding space is like a city map where similar businesses cluster into neighborhoods. All the restaurants end up on one street, tech companies cluster downtown, and parks line the river. You don't design it that way — it emerges from how businesses relate to each other. In embedding space, “pizza,” “pasta,” and “sushi” cluster in the food neighborhood, while “Python,” “Java,” and “Rust” cluster in the programming one.

Cosine similarity is the standard metric for embeddings because it ignores vector magnitude and focuses purely on direction. Two sentences about "dogs" will point in a similar direction regardless of length — cosine captures that. Euclidean distance is sensitive to magnitude, which can be useful when vector norms carry meaning.

05

Contextual Embeddings

Static embeddings (Word2Vec, GloVe) assign one vector per word — "bank" always gets the same representation regardless of context. Contextual embeddings (BERT, GPT) generate different vectors for the same word based on surrounding text.

"bank" in nature context

"...the bank..." — Nature

Contextual embedding (6 of 768 dims):

+0.12-0.67+0.85+0.33-0.41+0.55

I sat by the river bank watching the water flow.

Static (Word2Vec)

One fixed vector per word. 'bank' always = [0.45, 0.12, ...] regardless of context. Fast, but ambiguous words are poorly represented.

Contextual (BERT/GPT)

Vector is computed from the entire input sequence. 'bank' near 'river' produces a completely different vector than 'bank' near 'deposit'.

Analogy — Translator: Static embeddings (Word2Vec) give every word a single passport photo — it looks the same everywhere. Contextual embeddings are like a translator who changes their tone, style, and word choice depending on who they're talking to. The word “bank” gets a financial ID badge when surrounded by “money” and “loan,” but a geography ID badge next to “river” and “shore.” Same word, completely different representation.

Contextual embeddings are the reason modern LLMs can disambiguate words. Each transformer layer refines the embedding by incorporating information from surrounding tokens. By the final layer, "bank" in a financial context and "bank" in a river context have vectors pointing in entirely different directions.

06

Positional Encodings

Transformers process all tokens in parallel — they have no inherent sense of order. Positional encodings inject sequence position into each embedding so the model knows that "The cat sat" is different from "sat cat The."

sinusoidal positional encoding
Position: 4d_model = 512

PE values for position 4

d0
d1
d2
d3
d4
d5
d6
d7
d8
d9
d10
d11
d12
d13
d14
d15

Heatmap — positions 0–15 × 16 dimensions

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

positive negative

Sinusoidal

Fixed sin/cos waves with increasing wavelength per dimension. Used in the original Transformer (2017). No learned parameters — purely mathematical.

Generalizes to unseen sequence lengths

RoPE

Rotary Position Embedding encodes position as rotation in 2D subspaces of the embedding. Used in LLaMA, Mistral, and most modern LLMs.

Naturally decaying attention with distance

Learned

Trainable parameter vectors — one per position. Used in BERT and GPT-2. Simple but limited to maximum training length.

Can learn task-specific positional patterns

The key insight of sinusoidal encodings: each dimension oscillates at a different frequency. Low dimensions change quickly (encoding fine position), while high dimensions change slowly (encoding coarse position). This lets the model learn to attend to both nearby and distant tokens.

07

Embedding Tables

An embedding table is a simple lookup matrix: vocabulary size × embedding dimension. Every token ID indexes one row. The number of parameters scales linearly with both dimensions — for large vocabularies and high-dimensional models, this table alone can contain billions of parameters.

Vocabulary Size
Embedding Dimension
parameter calculator
Total Parameters

38.6M

50,257 × 768

FP32154.4 MB
FP1677.2 MB
INT838.6 MB

The embedding table is often the single largest component by parameter count in smaller models. LLaMA 3's 128K vocabulary × 4096 dimensions = 524M parameters just for the embedding lookup — before a single transformer layer runs. This is why vocabulary size is a critical design decision.

08

Sentence & Document Embeddings

A single word embedding isn't enough — we often need to represent entire sentences, paragraphs, or documents as a single vector. Pooling strategies collapse a sequence of token embeddings into one fixed-size vector.

cls token
[CLS]
The
cat
sat
embed

Only the [CLS] token embedding becomes the sentence vector

text-embedding-3-large

OpenAI · 3072d

State-of-the-art commercial embedding model with Matryoshka support for variable dimensions.

all-MiniLM-L6-v2

Sentence Transformers · 384d

Lightweight, fast, and free. Great for production semantic search with limited resources.

e5-large-v2

Microsoft · 1024d

High-quality multilingual embeddings trained with contrastive learning on diverse tasks.

Sentence embeddings power semantic search, clustering, classification, and deduplication. The choice of pooling strategy and embedding model determines the quality of downstream tasks — mean pooling with a fine-tuned sentence-transformer typically outperforms CLS-based approaches for retrieval.

09

Vector Search & RAG

Retrieval-Augmented Generation (RAG) grounds LLM responses in real data. The pipeline embeds a user query, searches a vector database for similar documents, and injects the retrieved context into the prompt — reducing hallucination and enabling answers over private data.

RAG pipeline

Step 1: User asks a natural language question

LSH

O(1) lookup

Locality-Sensitive Hashing

Hash similar vectors into the same bucket. Fast, memory-efficient, but lower recall for high-dimensional data.

HNSW

O(log n) search

Hierarchical Navigable Small World

Layered graph where each node connects to nearby vectors. State-of-the-art recall/speed tradeoff. Used by Qdrant, pgvector, Pinecone.

IVF

O(√n) search

Inverted File Index

Partition space into Voronoi cells, search only nearby partitions. Pairs well with product quantization for compression.

search demo

Top-3 Results

#1
Word Embeddings

Dense vector representations that map words into continuous space where semantic similarity is captured by geometric proximity.

0.33
#2
Tokenization Methods

BPE and WordPiece split text into subword units, balancing vocabulary size with sequence length for efficient language processing.

0.33
#3
Attention Mechanism

Computes weighted combinations of values based on query-key similarity scores, allowing models to focus on relevant parts of the input.

0.33

RAG is the dominant pattern for building LLM applications over private data. The key insight: instead of fine-tuning the model (expensive, slow), you store your documents as embeddings and retrieve relevant ones at query time. HNSW is the most widely used ANN algorithm — it powers Qdrant, Pinecone, Weaviate, and pgvector.

Animated — RAG Retrieval Pipeline
?
QueryUser question
Embed→ 768-dim vector
SearchANN lookup (HNSW)
📄
RetrieveTop-k documents
+
AugmentInject into prompt
GenerateLLM produces answer
Reranking — Two-Stage Retrieval

Bi-encoder retrieval is fast but coarse. A cross-encoder reranker takes each (query, document) pair and scores relevance jointly — far more accurate but too slow for the full corpus. The two-stage pattern combines both: retrieve top-50 cheaply, then rerank to top-5 precisely.

Stage 1 — Retrieval~1ms per 1M docs

Bi-encoder (e.g. BGE, E5)Top 50

Stage 2 — Reranking~200ms for 50 pairs

Cross-encoder (e.g. Cohere Rerank, BGE-reranker)Top 5

Analogy — Job Interview: Retrieval is like screening resumes by keyword — fast but misses nuance. Reranking is the actual interview — slow but you understand who truly fits. You screen 500 applicants, then interview only the top 10.

Hybrid Search — Best of Both Worlds

Semantic search misses exact keyword matches; keyword search misses meaning. Hybrid search combines both using Reciprocal Rank Fusion (RRF) or weighted scoring to get the best of both worlds.

Sparse (BM25 / TF-IDF)

+ Exact keyword matches, acronyms, IDs, code

No semantic understanding

Dense (Embedding)

+ Meaning, paraphrases, synonyms

Misses exact terms, entity names

Hybrid (Fused)

+ Best recall across both types

Slightly more complex indexing

RRF formula: score(d) = Σ 1/(k + ranki(d)) — merges rankings without needing comparable scores.

Chunking Strategies

How you split documents before embedding determines retrieval quality. Too large and chunks contain irrelevant noise; too small and they lose context.

Fixed-sizeBaseline

Split every N tokens (e.g. 512). Simple but cuts mid-sentence. Add overlap (e.g. 50 tokens) to preserve context at boundaries.

Recursive / SemanticBetter

Split by paragraph, then sentence, then token count. Preserves natural boundaries. Used by LangChain's RecursiveCharacterTextSplitter.

Parent-ChildBest for RAG

Embed small chunks for precision, but retrieve the larger parent chunk for context. Combines fine-grained matching with full context.

Contextual (Anthropic)State-of-art

Use an LLM to prepend a short context sentence to each chunk before embedding — e.g. 'This chunk discusses the pricing of Widget X.' Dramatically improves retrieval.

Production RAG Patterns

Real-world RAG systems go far beyond simple retrieve-and-generate. These patterns solve common failure modes.

Query Rewriting

LLM rewrites the user question for better retrieval — expanding abbreviations, resolving pronouns, adding context from conversation history.

?
HyDE (Hypothetical Document)

Generate a hypothetical answer first, embed that instead of the question. The hypothetical answer is closer in embedding space to real documents than a short question.

Multi-Query

Generate 3–5 paraphrased queries, retrieve for each, merge results. Captures different facets of the user's intent.

Corrective RAG (CRAG)

After retrieval, evaluate whether retrieved docs actually answer the question. If confidence is low, fall back to web search or ask for clarification.

Agentic RAG

An agent decides when to retrieve, what to retrieve, and whether to re-retrieve based on the quality of initial results. Combines tool use with RAG.

Embedding Use Cases Beyond RAG
🔍Semantic Search

Search by meaning, not keywords. 'How to fix a flat tire' finds results about 'changing punctured tires' even without shared words.

🎯Clustering & Topic Modeling

Group documents by semantic similarity using k-means or HDBSCAN on embeddings. Discover themes without predefined categories.

Deduplication

Find near-duplicate content by cosine similarity threshold. Catch paraphrased plagiarism, merge duplicate support tickets.

Anomaly Detection

Flag content that's far from all clusters — outlier detection in embedding space catches unusual patterns, fraud, or drift.

Recommendation Systems

Embed users and items in the same space. Recommend items closest to the user's embedding — Netflix, Spotify, and Amazon all use this.

🏷Classification (Zero-Shot)

Embed class labels and inputs. Classify by nearest label embedding — no training data needed. 'Is this email spam?' by proximity to 'spam' vs 'legitimate' embeddings.

The full production RAG stack is: chunk → embed → index → query rewrite → hybrid search (sparse + dense) → rerank → filter → augment prompt → generate → cite sources. Each stage has its own failure modes and tuning knobs. Getting RAG right is more about the pipeline engineering than the LLM itself.

10

The Embedding Space

When we project high-dimensional embeddings down to 2D (using t-SNE or UMAP), clusters of related words emerge. Animals group together, colors cluster nearby, countries form their own region. Analogies appear as parallelograms — consistent directional offsets.

2D projection (t-SNE style)
catdogbirdfishhorserabbitAnimalsredbluegreenyellowpurpleColorsfrancejapanbrazilgermanyindiaCountriesmanwomankingqueenAnalogygenderroyalty →

The embedding space is not just a bag of points — it has rich geometric structure. Analogies form parallelograms because the direction from "man" to "king" (adding royalty) is the same as from "woman" to "queen." These regularities emerge purely from learning to predict surrounding words — no one teaches the model about royalty or gender.

Building with LLMs?

I architect enterprise AI systems — RAG pipelines, multi-agent orchestration, custom copilots, and the infrastructure to run them reliably at scale.

Get In Touch