The Big Picture
Neural networks are function approximators. Given input-output pairs, they learn a function that maps inputs to outputs — no matter how complex that mapping is.
They’re inspired by biological neurons but are really just compositions of simple mathematical operations: linear transformations followed by nonlinear activations, stacked into layers.
The Perceptron
The simplest neural network is a single perceptron — a linear model with a step function:
The step function outputs 1 if the sum is positive, 0 otherwise. This can model any linearly separable problem (like AND, OR) but famously cannot model XOR.
Historical note: The perceptron was invented in 1958 by Frank Rosenblatt. The discovery that it couldn’t learn XOR led to the first “AI winter.” The solution — multiple layers — wouldn’t gain traction for another 30 years.
From Perceptrons to Neurons
Modern neural networks replace the step function with smooth, differentiable activation functions:
Common Activation Functions
| Function | Formula | Range | Pros |
|---|---|---|---|
| Sigmoid | Outputs probabilities | ||
| Tanh | Zero-centered | ||
| ReLU | Fast, avoids vanishing gradients | ||
| Leaky ReLU | Avoids “dead neurons” | ||
| GELU | Smooth ReLU, used in transformers |
ReLU is the default choice for hidden layers. It’s simple, fast, and works remarkably well.
Multi-Layer Networks
The real power comes from stacking layers:
Each layer applies a linear transformation followed by a nonlinear activation:
Why Depth Matters
A network with even a single hidden layer can theoretically approximate any continuous function (the Universal Approximation Theorem). But in practice:
- Deeper networks learn hierarchical features (edges -> textures -> objects)
- Wider networks memorize more but generalize less
- Deeper is usually better than wider for the same parameter count
How Networks Learn: Backpropagation
Training a neural network means finding weights that minimize a loss function. The algorithm has two phases:
Forward Pass
Compute the output by passing input through each layer:
Backward Pass (Backpropagation)
Compute the gradient of the loss with respect to every weight using the chain rule:
The gradients flow backward through the network, telling each weight how to change to reduce the loss.
Weight Update
This is gradient descent applied to neural networks. In practice, we use mini-batch stochastic gradient descent with optimizers like Adam. The same optimization principles apply here as in Linear Regression, but scaled to millions of parameters.
The Training Loop
for epoch in range(num_epochs):
for batch in data_loader:
# 1. Forward pass
predictions = model(batch.inputs)
loss = loss_function(predictions, batch.targets)
# 2. Backward pass
loss.backward()
# 3. Update weights
optimizer.step()
optimizer.zero_grad()
This loop runs for tens to hundreds of epochs until the loss converges.
Loss Functions
The loss function defines what the network optimizes:
| Task | Loss Function | Formula |
|---|---|---|
| Regression | MSE | |
| Binary Classification | Binary Cross-Entropy | |
| Multi-class Classification | Cross-Entropy |
Cross-entropy loss is equivalent to maximum likelihood estimation for classification models.
Optimizers
Plain gradient descent is slow. Modern optimizers adapt the learning rate per parameter:
SGD with Momentum
Momentum accelerates convergence by accumulating past gradients, like a ball rolling downhill.
Adam (Adaptive Moment Estimation)
Combines momentum with per-parameter adaptive learning rates. It’s the default optimizer for most deep learning tasks:
- Adapts learning rate based on first moment (mean) and second moment (variance) of gradients
- Works well out of the box with learning rate around
- Handles sparse gradients and noisy objectives
Regularization Techniques
Neural networks are powerful but prone to overfitting. Key techniques to prevent it:
Dropout
Randomly set a fraction of neurons to zero during training:
This forces the network to learn redundant representations — no single neuron can be relied upon. At test time, all neurons are used (with scaled weights).
Weight Decay (L2 Regularization)
Add a penalty on the weight magnitudes:
Same as MAP estimation with a Gaussian prior on weights.
Batch Normalization
Normalize each layer’s inputs to have zero mean and unit variance:
Benefits: faster training, higher learning rates, mild regularization.
Early Stopping
Monitor validation loss during training. Stop when it starts increasing — the point where the model transitions from learning to memorizing.
Network Architecture Patterns
Fully Connected (Dense)
Every neuron connects to every neuron in the next layer. Good for tabular data.
Convolutional (CNN)
Neurons connect only to local spatial regions. Essential for images, using filters that slide across the input detecting patterns.
Recurrent (RNN/LSTM)
Neurons have connections that loop back in time. Designed for sequences (text, time series).
Transformer
Uses self-attention to process all positions in parallel. The architecture behind GPT, BERT, and modern LLMs.
Practical Tips
- Start simple: Begin with a small network and gradually add complexity
- Normalize inputs: Scale features to zero mean and unit variance
- Use batch normalization: Stabilizes training significantly
- Learning rate matters most: Try with Adam as a starting point
- Monitor both losses: Training and validation loss tell different stories
- Use dropout: 0.1-0.5 rate, especially in fully connected layers
- Data augmentation: Often more effective than a bigger model
The Deep Learning Revolution
What changed in the 2010s wasn’t the math — backpropagation was invented in the 1980s. What changed was:
- Data: The internet generated massive labeled datasets (ImageNet)
- Compute: GPUs made matrix multiplications 100x faster
- Architecture: ReLU activations solved the vanishing gradient problem
- Software: Frameworks like PyTorch made experimentation easy
These four ingredients turned a dormant theory into the most transformative technology of the decade.
Summary
- Neural networks are composed of layers of linear transformations + nonlinear activations
- Backpropagation uses the chain rule to compute gradients through every layer
- ReLU is the default activation; Adam is the default optimizer
- Regularization (dropout, weight decay, batch norm, early stopping) prevents overfitting
- Deep networks learn hierarchical features — from simple to complex
- The universal approximation theorem guarantees expressive power
- CNNs for images, RNNs for sequences, Transformers for everything
References
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
- Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386-408.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
- Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980
- Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366.
- Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929-1958.
- Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167