Introduction to Neural Networks

The Big Picture

Neural networks are function approximators. Given input-output pairs, they learn a function that maps inputs to outputs — no matter how complex that mapping is.

They’re inspired by biological neurons but are really just compositions of simple mathematical operations: linear transformations followed by nonlinear activations, stacked into layers.

The Perceptron

The simplest neural network is a single perceptron — a linear model with a step function:

\text{output} = \text{step}(w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b)

The step function outputs 1 if the sum is positive, 0 otherwise. This can model any linearly separable problem (like AND, OR) but famously cannot model XOR.

Historical note: The perceptron was invented in 1958 by Frank Rosenblatt. The discovery that it couldn’t learn XOR led to the first “AI winter.” The solution — multiple layers — wouldn’t gain traction for another 30 years.

From Perceptrons to Neurons

Modern neural networks replace the step function with smooth, differentiable activation functions:

\text{output} = \sigma(\mathbf{w}^T \mathbf{x} + b)

Common Activation Functions

Function	Formula	Range	Pros
Sigmoid	$\frac{1}{1 + e^{-x}}$	$(0, 1)$	Outputs probabilities
Tanh	$\frac{e^x - e^{-x}}{e^x + e^{-x}}$	$(-1, 1)$	Zero-centered
ReLU	$\max(0, x)$	$[0, \infty)$	Fast, avoids vanishing gradients
Leaky ReLU	$\begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{otherwise} \end{cases}$	$(-\infty, \infty)$	Avoids “dead neurons”
GELU	$x \cdot \Phi(x)$	$(-\infty, \infty)$	Smooth ReLU, used in transformers

ReLU is the default choice for hidden layers. It’s simple, fast, and works remarkably well.

Multi-Layer Networks

The real power comes from stacking layers:

\text{Input} \rightarrow \text{Hidden Layer 1} \rightarrow \text{Hidden Layer 2} \rightarrow \cdots \rightarrow \text{Output}

Each layer applies a linear transformation followed by a nonlinear activation:

\mathbf{h}_1 = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)

\mathbf{h}_2 = \sigma(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2)

\text{output} = \mathbf{W}_3 \mathbf{h}_2 + \mathbf{b}_3

Why Depth Matters

A network with even a single hidden layer can theoretically approximate any continuous function (the Universal Approximation Theorem). But in practice:

Deeper networks learn hierarchical features (edges -> textures -> objects)
Wider networks memorize more but generalize less
Deeper is usually better than wider for the same parameter count

How Networks Learn: Backpropagation

Training a neural network means finding weights that minimize a loss function. The algorithm has two phases:

Forward Pass

Compute the output by passing input through each layer:

\mathbf{x} \rightarrow \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \cdots \rightarrow \hat{y} \rightarrow \mathcal{L}

Backward Pass (Backpropagation)

Compute the gradient of the loss with respect to every weight using the chain rule:

\frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} = \frac{\partial \mathcal{L}}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial \mathbf{h}_2} \cdot \frac{\partial \mathbf{h}_2}{\partial \mathbf{W}_2}

The gradients flow backward through the network, telling each weight how to change to reduce the loss.

Weight Update

\mathbf{W} := \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}}

This is gradient descent applied to neural networks. In practice, we use mini-batch stochastic gradient descent with optimizers like Adam. The same optimization principles apply here as in Linear Regression, but scaled to millions of parameters.

The Training Loop

for epoch in range(num_epochs):
    for batch in data_loader:
        # 1. Forward pass
        predictions = model(batch.inputs)
        loss = loss_function(predictions, batch.targets)

        # 2. Backward pass
        loss.backward()

        # 3. Update weights
        optimizer.step()
        optimizer.zero_grad()

This loop runs for tens to hundreds of epochs until the loss converges.

Loss Functions

The loss function defines what the network optimizes:

Task	Loss Function	Formula
Regression	MSE	$\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$
Binary Classification	Binary Cross-Entropy	$-\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\right]$
Multi-class Classification	Cross-Entropy	$-\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic} \log(p_{ic})$

Cross-entropy loss is equivalent to maximum likelihood estimation for classification models.

Optimizers

Plain gradient descent is slow. Modern optimizers adapt the learning rate per parameter:

SGD with Momentum

\mathbf{v} := \beta \mathbf{v} + \nabla_{\mathbf{W}} \mathcal{L}

\mathbf{W} := \mathbf{W} - \alpha \mathbf{v}

Momentum accelerates convergence by accumulating past gradients, like a ball rolling downhill.

Adam (Adaptive Moment Estimation)

Combines momentum with per-parameter adaptive learning rates. It’s the default optimizer for most deep learning tasks:

Adapts learning rate based on first moment (mean) and second moment (variance) of gradients
Works well out of the box with learning rate around $\alpha = 0.001$
Handles sparse gradients and noisy objectives

Regularization Techniques

Neural networks are powerful but prone to overfitting. Key techniques to prevent it:

Dropout

Randomly set a fraction of neurons to zero during training:

\mathbf{h} = \text{dropout}\!\left(\sigma(\mathbf{W}\mathbf{x} + \mathbf{b}),\; p = 0.5\right)

This forces the network to learn redundant representations — no single neuron can be relied upon. At test time, all neurons are used (with scaled weights).

Weight Decay (L2 Regularization)

Add a penalty on the weight magnitudes:

\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \sum_{i} W_i^2

Same as MAP estimation with a Gaussian prior on weights.

Batch Normalization

Normalize each layer’s inputs to have zero mean and unit variance:

\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}

y = \gamma \hat{x} + \beta

Benefits: faster training, higher learning rates, mild regularization.

Early Stopping

Monitor validation loss during training. Stop when it starts increasing — the point where the model transitions from learning to memorizing.

Network Architecture Patterns

Fully Connected (Dense)

Every neuron connects to every neuron in the next layer. Good for tabular data.

Convolutional (CNN)

Neurons connect only to local spatial regions. Essential for images, using filters that slide across the input detecting patterns.

Recurrent (RNN/LSTM)

Neurons have connections that loop back in time. Designed for sequences (text, time series).

Transformer

Uses self-attention to process all positions in parallel. The architecture behind GPT, BERT, and modern LLMs.

Practical Tips

Start simple: Begin with a small network and gradually add complexity
Normalize inputs: Scale features to zero mean and unit variance
Use batch normalization: Stabilizes training significantly
Learning rate matters most: Try $\alpha = 0.001$ with Adam as a starting point
Monitor both losses: Training and validation loss tell different stories
Use dropout: 0.1-0.5 rate, especially in fully connected layers
Data augmentation: Often more effective than a bigger model

The Deep Learning Revolution

What changed in the 2010s wasn’t the math — backpropagation was invented in the 1980s. What changed was:

Data: The internet generated massive labeled datasets (ImageNet)
Compute: GPUs made matrix multiplications 100x faster
Architecture: ReLU activations solved the vanishing gradient problem
Software: Frameworks like PyTorch made experimentation easy

These four ingredients turned a dormant theory into the most transformative technology of the decade.

Summary

Neural networks are composed of layers of linear transformations + nonlinear activations
Backpropagation uses the chain rule to compute gradients through every layer
ReLU is the default activation; Adam is the default optimizer
Regularization (dropout, weight decay, batch norm, early stopping) prevents overfitting
Deep networks learn hierarchical features — from simple to complex
The universal approximation theorem guarantees expressive power
CNNs for images, RNNs for sequences, Transformers for everything

References

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386-408.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929-1958.
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167