Back to How LLMs Work
UNDER THE HOOD

Neural Networks

The building blocks of every AI system. From a single artificial neuron to the architectures powering GPT-4 and Claude — an interactive, visual journey through how neural networks learn.

01

What Is a Neural Network?

A neural network is a function approximator. Given enough neurons and data, it can learn to map any input to any output. This idea — the Universal Approximation Theorem — is why neural networks power everything from image recognition to language models.

Biological Neuron
DendritesReceive signals from other neurons
Cell BodyProcesses incoming signals
AxonTransmits signal to next neuron
SynapseConnection point between neurons
Artificial Neuron

Every artificial neuron mirrors a biological one: inputs arrive (dendrites), are weighted and summed (cell body), pass through a nonlinear activation (axon hillock), and produce an output (axon terminal). Stack millions of these and you get a network that can learn nearly anything.

02

The Single Neuron

The perceptron — invented by Frank Rosenblatt in 1958 — is the simplest neural network: a single neuron that takes weighted inputs, adds a bias, and fires if the result exceeds a threshold. It can learn any linearly separable function.

Interactive Perceptron
x1 = 1.0 × w10.700
w1 = 0.7
x2 = 0.5 × w2-0.150
w2 = -0.3
x3 = 0.8 × w30.400
w3 = 0.5
Bias-0.2
Output
Weighted Sum + Bias
0.750
Step Function
1
Fires!
Sigmoid
0.679
Probability: 67.9%

The step function gives a hard 0/1 decision (original perceptron). Sigmoid gives a smooth probability — needed for gradient-based learning.

The XOR Problem — Why Single Neurons Aren't Enough

AND

0, 00
0, 10
1, 00
1, 11
Single neuron can learn

OR

0, 00
0, 11
1, 01
1, 11
Single neuron can learn

XOR

0, 00
0, 11
1, 01
1, 10
Needs hidden layer!

Minsky & Papert proved in 1969 that a single perceptron cannot learn XOR — this caused the first "AI winter." The solution? Add a hidden layer. Multi-layer networks can learn any function, not just linearly separable ones. This insight led to the modern deep learning era.

03

Activation Functions

Activation functions introduce nonlinearity — without them, stacking layers would just be matrix multiplication, and the network could only learn linear relationships. The choice of activation deeply affects training dynamics.

Activation Explorer
Input: 0.5Output: 0.500
relu(x) = max(0, x)
Details

How It Works

Zero for negatives, linear for positives. Simple, fast, default for most networks. Can cause 'dead neurons' if inputs always negative.

Used In

CNNs, most feedforward networks

Why Gradients Matter

sigmoidGradient max 0.25 — vanishes in deep networks
reluGradient is 1 or 0 — fast but 'dead neuron' risk
geluSmooth gradient everywhere — best for transformers

The gradient of the activation function determines how well signals flow backwards during training. Sigmoid and tanh squish gradients toward zero in deep networks (vanishing gradient problem). ReLU solved this — but GELU refined it further and is now standard in every major transformer, including GPT-4 and Claude.

04

Layers & Architecture

Depth is what makes a neural network "deep." Each layer extracts increasingly abstract features — edges in layer 1, shapes in layer 2, objects in layer 3. More layers means more abstraction, but also more parameters and harder training.

Interactive Network Builder
46 parameters
Input
3
Hidden 1
4
Hidden 2
4
Output
2
Layers
4
Total Neurons
13
Parameters
46

GPT-4 is estimated at ~1.8 trillion parameters. Your custom network above has 46. The difference? Scale — plus clever architecture choices (attention, normalization, residual connections) that make training at that scale possible.

05

Forward Propagation

Forward propagation is the process of passing data through the network from input to output. Each layer transforms its input: multiply by weights, add bias, apply activation. The final output is the network's prediction.

Step 1 — Input Layer

Raw data enters the network as a numeric vector. An image might be 784 pixel values; text might be a 768-dim embedding.

x₁ = 0.8
x₂ = 0.3
x₃ = 0.6
Input Layer
Multiply by Weights
Sum + Bias
Activation
Output
Animated — Data Flows Through the Network
InputRaw data
Weights×w + b
Σ
SumΣ weighted
ƒ
ActivationReLU/GELU
OutputPrediction

Analogy — Factory Assembly Line: Forward propagation works like a factory floor. Raw materials (input data) enter station 1, where they're weighed and measured (multiply by weights). At station 2, components are assembled (sum + bias). Station 3 does quality control — rejecting bad signals, passing good ones (activation). The finished product (prediction) rolls off the end. Each station transforms the product one step further — and no station can see what happened earlier or later.

Forward propagation is just a sequence of matrix multiplications and nonlinear activations. For a transformer with 70B parameters, one forward pass over a 4K-token prompt involves trillions of floating-point operations — taking ~500ms on 8 GPUs.

06

Loss Functions

The loss function measures how wrong the network is. It takes the network's prediction and the true answer, and outputs a single number — the error. The entire goal of training is to minimize this number.

Loss Explorer
Prediction (ŷ)Loss
Prediction: 0.70True: 1 | Loss: 0.3567
-[y·log(ŷ) + (1-y)·log(1-ŷ)]
How It Works

Standard loss for binary classification. Heavily penalizes confident wrong predictions. Used in the final layer of networks predicting yes/no outcomes.

Intuition

Prediction = 0.70   True = 1

The model is uncertain but right

Loss = 0.3567moderate, room to improve

Loss in LLMs

LLMs use cross-entropy loss over the entire vocabulary (e.g., 100K tokens). Each position in the sequence gets a loss based on how well the model predicted the actual next token. Total loss is averaged across all positions — this is the single number that training minimizes.

Drag the prediction slider and watch the loss change. Notice how cross-entropy produces an extremely steep gradient when the prediction is very wrong — this is by design. It forces the network to learn quickly from its worst mistakes.

07

Backpropagation

Backpropagation is how neural networks learn. After each prediction, error gradients flow backwards through the network, telling each weight how to adjust. It's just the calculus chain rule — applied millions of times.

Forward Pass

Data flows through the network, layer by layer, producing a prediction at the output. Each intermediate value is cached — these will be needed during the backward pass.

Input
Hidden 1
Hidden 2
Output
Vanishing Gradient Problem

As gradients flow backwards, they get multiplied at each layer. If activation derivatives are small (<1), gradients shrink exponentially — early layers barely learn. This is why sigmoid networks couldn't go deep.

Output
1.00
Hidden 2
0.35
Hidden 1
0.08
Input
0.01

Solutions: ReLU activations, residual connections (skip connections), layer normalization, careful initialization.

Animated — Gradient Flowing Backwards
0.02
Input
0.08
Hidden 1
0.25
Hidden 2
0.60
Hidden 3
1.00
Output
Strong gradient Weakening Vanishing

Analogy — GPS Recalculating: Backpropagation is like a GPS recalculating your route. You took a wrong turn (the network made a bad prediction), and the GPS figures out exactly which turn was wrong and how far off course you are (gradient). It then adjusts your route step by step, starting from where you are now and working backwards to the original wrong turn. The further back the error, the smaller the correction — just like vanishing gradients in deep networks.

Backpropagation was popularized by Rumelhart, Hinton & Williams in 1986. The key insight: you don't need to compute each weight's gradient separately — the chain rule lets you reuse intermediate results, making it efficient enough to train networks with billions of parameters.

08

The Training Loop

Training is just this loop repeated millions of times: forward pass, compute loss, backward pass, update weights. An epoch is one full pass through the dataset. GPT-3 trained for ~300B tokens — equivalent to reading the internet multiple times.

Forward Pass

Feed a batch of training examples through the network. For each example, the network produces a prediction. With a batch of 32 examples, this means 32 forward passes in parallel (GPUs excel at this).

Forward Pass
Compute Loss
Backward Pass
Update Weights
Loss Curve — Live
Learning Rate: 0.010Good range
Epoch
Epoch
0
Loss
2.800
LR
0.010

Analogy — Apprentice Chef: Training a neural network is like an apprentice learning to cook. Each attempt (epoch), you taste the dish (forward pass), compare it to the master chef's version (compute loss), figure out what went wrong — too much salt, undercooked (backprop) — and adjust the recipe (update weights). The learning rate is how boldly you change things: too aggressive and the dish becomes inedible; too timid and you never improve. After thousands of attempts, the apprentice matches the master.

Try dragging the learning rate. Too high and the loss oscillates or explodes — the network overshoots the minimum. Too low and it barely moves. Modern optimizers like AdamW adapt the rate per-parameter, but choosing the right base rate is still critical. LLaMA 2 used a cosine schedule starting at 3×10⁻⁴.

09

Architecture Zoo

Different problems need different architectures. Each one makes different tradeoffs between memory, parallelism, and the types of patterns it can learn. The transformer won because it parallelizes beautifully and scales to trillions of parameters.

Feedforward (MLP)

The simplest neural network — data flows in one direction, from input to output, through fully connected layers. Each neuron connects to every neuron in the next layer. Good for tabular data and as building blocks inside larger architectures.

Strengths

  • +Simple to understand and implement
  • +Fast training
  • +Universal approximator

Weaknesses

  • -No memory of past inputs
  • -Can't handle sequences
  • -Doesn't understand spatial structure
~Millions params
Feedforward (MLP) — Architecture

Used In

Classification, regression, FFN layers inside transformers

The history of deep learning is a history of architecture innovation: feedforward → CNNs → RNNs → attention → transformers. Each solved a fundamental limitation of the previous. The transformer's ability to parallelize and scale is why we have GPT-4 and Claude today.

10

From NNs to LLMs

Every LLM is a neural network — specifically, a transformer. Everything you've learned on this page is the foundation. Here's the 68-year journey from a single artificial neuron to models that write code, pass bar exams, and reason about the world.

Timeline — Neural Networks to LLMs
1958

Perceptron

Frank Rosenblatt builds the first artificial neuron — can learn simple linear patterns.

1969

XOR Problem

Minsky & Papert show perceptrons can't learn XOR. First AI winter begins.

1986

Backpropagation

Rumelhart, Hinton & Williams popularize backprop. Multi-layer networks become trainable.

1998

LeNet / CNNs

Yann LeCun's CNN reads handwritten digits. Convolutional networks prove their power.

2012

AlexNet

Deep CNN crushes ImageNet competition by 10%. GPU training changes everything. Deep learning era begins.

2014

Attention

Bahdanau et al. introduce attention for machine translation — the seed of the transformer.

2017

Transformer

"Attention Is All You Need" (Vaswani et al.). Replaces RNNs with pure attention. Parallelizable, scalable.

2018

BERT & GPT

Pre-training on massive text, then fine-tuning. Transfer learning comes to NLP.

2020

GPT-3

175B parameters. Few-shot learning emerges — the model can do tasks it wasn't trained for.

2022

ChatGPT

RLHF + instruction tuning. Language models become conversational assistants.

2024–26

Frontier Models

GPT-4, Claude, Gemini. Multi-modal, tool-using, agent-capable. Trillions of parameters.

Under the Hood

Continue: How LLMs Work

Now that you understand neural networks, explore the full LLM pipeline — transformers, tokenization, training, inference, and streaming.

The jump from GPT-2 (1.5B) to GPT-3 (175B) revealed emergent abilities — capabilities that appeared suddenly at scale, not gradually. This is why the field moved so fast: scaling neural networks didn't just make them better, it made them qualitatively different.