What Is a Neural Network?
A neural network is a function approximator. Given enough neurons and data, it can learn to map any input to any output. This idea — the Universal Approximation Theorem — is why neural networks power everything from image recognition to language models.
Every artificial neuron mirrors a biological one: inputs arrive (dendrites), are weighted and summed (cell body), pass through a nonlinear activation (axon hillock), and produce an output (axon terminal). Stack millions of these and you get a network that can learn nearly anything.
The Single Neuron
The perceptron — invented by Frank Rosenblatt in 1958 — is the simplest neural network: a single neuron that takes weighted inputs, adds a bias, and fires if the result exceeds a threshold. It can learn any linearly separable function.
The step function gives a hard 0/1 decision (original perceptron). Sigmoid gives a smooth probability — needed for gradient-based learning.
AND
OR
XOR
Minsky & Papert proved in 1969 that a single perceptron cannot learn XOR — this caused the first "AI winter." The solution? Add a hidden layer. Multi-layer networks can learn any function, not just linearly separable ones. This insight led to the modern deep learning era.
Activation Functions
Activation functions introduce nonlinearity — without them, stacking layers would just be matrix multiplication, and the network could only learn linear relationships. The choice of activation deeply affects training dynamics.
How It Works
Zero for negatives, linear for positives. Simple, fast, default for most networks. Can cause 'dead neurons' if inputs always negative.
Used In
CNNs, most feedforward networks
Why Gradients Matter
The gradient of the activation function determines how well signals flow backwards during training. Sigmoid and tanh squish gradients toward zero in deep networks (vanishing gradient problem). ReLU solved this — but GELU refined it further and is now standard in every major transformer, including GPT-4 and Claude.
Layers & Architecture
Depth is what makes a neural network "deep." Each layer extracts increasingly abstract features — edges in layer 1, shapes in layer 2, objects in layer 3. More layers means more abstraction, but also more parameters and harder training.
GPT-4 is estimated at ~1.8 trillion parameters. Your custom network above has 46. The difference? Scale — plus clever architecture choices (attention, normalization, residual connections) that make training at that scale possible.
Forward Propagation
Forward propagation is the process of passing data through the network from input to output. Each layer transforms its input: multiply by weights, add bias, apply activation. The final output is the network's prediction.
Raw data enters the network as a numeric vector. An image might be 784 pixel values; text might be a 768-dim embedding.
Analogy — Factory Assembly Line: Forward propagation works like a factory floor. Raw materials (input data) enter station 1, where they're weighed and measured (multiply by weights). At station 2, components are assembled (sum + bias). Station 3 does quality control — rejecting bad signals, passing good ones (activation). The finished product (prediction) rolls off the end. Each station transforms the product one step further — and no station can see what happened earlier or later.
Forward propagation is just a sequence of matrix multiplications and nonlinear activations. For a transformer with 70B parameters, one forward pass over a 4K-token prompt involves trillions of floating-point operations — taking ~500ms on 8 GPUs.
Loss Functions
The loss function measures how wrong the network is. It takes the network's prediction and the true answer, and outputs a single number — the error. The entire goal of training is to minimize this number.
Standard loss for binary classification. Heavily penalizes confident wrong predictions. Used in the final layer of networks predicting yes/no outcomes.
Intuition
Prediction = 0.70 True = 1
The model is uncertain but right
Loss = 0.3567 — moderate, room to improve
Loss in LLMs
LLMs use cross-entropy loss over the entire vocabulary (e.g., 100K tokens). Each position in the sequence gets a loss based on how well the model predicted the actual next token. Total loss is averaged across all positions — this is the single number that training minimizes.
Drag the prediction slider and watch the loss change. Notice how cross-entropy produces an extremely steep gradient when the prediction is very wrong — this is by design. It forces the network to learn quickly from its worst mistakes.
Backpropagation
Backpropagation is how neural networks learn. After each prediction, error gradients flow backwards through the network, telling each weight how to adjust. It's just the calculus chain rule — applied millions of times.
Data flows through the network, layer by layer, producing a prediction at the output. Each intermediate value is cached — these will be needed during the backward pass.
As gradients flow backwards, they get multiplied at each layer. If activation derivatives are small (<1), gradients shrink exponentially — early layers barely learn. This is why sigmoid networks couldn't go deep.
Solutions: ReLU activations, residual connections (skip connections), layer normalization, careful initialization.
Analogy — GPS Recalculating: Backpropagation is like a GPS recalculating your route. You took a wrong turn (the network made a bad prediction), and the GPS figures out exactly which turn was wrong and how far off course you are (gradient). It then adjusts your route step by step, starting from where you are now and working backwards to the original wrong turn. The further back the error, the smaller the correction — just like vanishing gradients in deep networks.
Backpropagation was popularized by Rumelhart, Hinton & Williams in 1986. The key insight: you don't need to compute each weight's gradient separately — the chain rule lets you reuse intermediate results, making it efficient enough to train networks with billions of parameters.
The Training Loop
Training is just this loop repeated millions of times: forward pass, compute loss, backward pass, update weights. An epoch is one full pass through the dataset. GPT-3 trained for ~300B tokens — equivalent to reading the internet multiple times.
Feed a batch of training examples through the network. For each example, the network produces a prediction. With a batch of 32 examples, this means 32 forward passes in parallel (GPUs excel at this).
Analogy — Apprentice Chef: Training a neural network is like an apprentice learning to cook. Each attempt (epoch), you taste the dish (forward pass), compare it to the master chef's version (compute loss), figure out what went wrong — too much salt, undercooked (backprop) — and adjust the recipe (update weights). The learning rate is how boldly you change things: too aggressive and the dish becomes inedible; too timid and you never improve. After thousands of attempts, the apprentice matches the master.
Try dragging the learning rate. Too high and the loss oscillates or explodes — the network overshoots the minimum. Too low and it barely moves. Modern optimizers like AdamW adapt the rate per-parameter, but choosing the right base rate is still critical. LLaMA 2 used a cosine schedule starting at 3×10⁻⁴.
Architecture Zoo
Different problems need different architectures. Each one makes different tradeoffs between memory, parallelism, and the types of patterns it can learn. The transformer won because it parallelizes beautifully and scales to trillions of parameters.
The simplest neural network — data flows in one direction, from input to output, through fully connected layers. Each neuron connects to every neuron in the next layer. Good for tabular data and as building blocks inside larger architectures.
Strengths
- +Simple to understand and implement
- +Fast training
- +Universal approximator
Weaknesses
- -No memory of past inputs
- -Can't handle sequences
- -Doesn't understand spatial structure
Used In
Classification, regression, FFN layers inside transformers
The history of deep learning is a history of architecture innovation: feedforward → CNNs → RNNs → attention → transformers. Each solved a fundamental limitation of the previous. The transformer's ability to parallelize and scale is why we have GPT-4 and Claude today.
From NNs to LLMs
Every LLM is a neural network — specifically, a transformer. Everything you've learned on this page is the foundation. Here's the 68-year journey from a single artificial neuron to models that write code, pass bar exams, and reason about the world.
Perceptron
Frank Rosenblatt builds the first artificial neuron — can learn simple linear patterns.
XOR Problem
Minsky & Papert show perceptrons can't learn XOR. First AI winter begins.
Backpropagation
Rumelhart, Hinton & Williams popularize backprop. Multi-layer networks become trainable.
LeNet / CNNs
Yann LeCun's CNN reads handwritten digits. Convolutional networks prove their power.
AlexNet
Deep CNN crushes ImageNet competition by 10%. GPU training changes everything. Deep learning era begins.
Attention
Bahdanau et al. introduce attention for machine translation — the seed of the transformer.
Transformer
"Attention Is All You Need" (Vaswani et al.). Replaces RNNs with pure attention. Parallelizable, scalable.
BERT & GPT
Pre-training on massive text, then fine-tuning. Transfer learning comes to NLP.
GPT-3
175B parameters. Few-shot learning emerges — the model can do tasks it wasn't trained for.
ChatGPT
RLHF + instruction tuning. Language models become conversational assistants.
Frontier Models
GPT-4, Claude, Gemini. Multi-modal, tool-using, agent-capable. Trillions of parameters.
Continue: How LLMs Work
Now that you understand neural networks, explore the full LLM pipeline — transformers, tokenization, training, inference, and streaming.
The jump from GPT-2 (1.5B) to GPT-3 (175B) revealed emergent abilities — capabilities that appeared suddenly at scale, not gradually. This is why the field moved so fast: scaling neural networks didn't just make them better, it made them qualitatively different.