Introduction to Neural Networks

From single neurons to deep networks: how neural networks learn to approximate any function.

Fundamentals February 20, 2026 7 min read

The Big Picture

Neural networks are function approximators. Given input-output pairs, they learn a function that maps inputs to outputs — no matter how complex that mapping is.

They’re inspired by biological neurons but are really just compositions of simple mathematical operations: linear transformations followed by nonlinear activations, stacked into layers.

The Perceptron

The simplest neural network is a single perceptron — a linear model with a step function:

output=step(w1x1+w2x2++wnxn+b)\text{output} = \text{step}(w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b)

The step function outputs 1 if the sum is positive, 0 otherwise. This can model any linearly separable problem (like AND, OR) but famously cannot model XOR.

Historical note: The perceptron was invented in 1958 by Frank Rosenblatt. The discovery that it couldn’t learn XOR led to the first “AI winter.” The solution — multiple layers — wouldn’t gain traction for another 30 years.

From Perceptrons to Neurons

Modern neural networks replace the step function with smooth, differentiable activation functions:

output=σ(wTx+b)\text{output} = \sigma(\mathbf{w}^T \mathbf{x} + b)

Common Activation Functions

FunctionFormulaRangePros
Sigmoid11+ex\frac{1}{1 + e^{-x}}(0,1)(0, 1)Outputs probabilities
Tanhexexex+ex\frac{e^x - e^{-x}}{e^x + e^{-x}}(1,1)(-1, 1)Zero-centered
ReLUmax(0,x)\max(0, x)[0,)[0, \infty)Fast, avoids vanishing gradients
Leaky ReLU{xif x>00.01xotherwise\begin{cases} x & \text{if } x > 0 \\ 0.01x & \text{otherwise} \end{cases}(,)(-\infty, \infty)Avoids “dead neurons”
GELUxΦ(x)x \cdot \Phi(x)(,)(-\infty, \infty)Smooth ReLU, used in transformers

ReLU is the default choice for hidden layers. It’s simple, fast, and works remarkably well.

Multi-Layer Networks

The real power comes from stacking layers:

InputHidden Layer 1Hidden Layer 2Output\text{Input} \rightarrow \text{Hidden Layer 1} \rightarrow \text{Hidden Layer 2} \rightarrow \cdots \rightarrow \text{Output}

Each layer applies a linear transformation followed by a nonlinear activation:

h1=σ(W1x+b1)\mathbf{h}_1 = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) h2=σ(W2h1+b2)\mathbf{h}_2 = \sigma(\mathbf{W}_2 \mathbf{h}_1 + \mathbf{b}_2) output=W3h2+b3\text{output} = \mathbf{W}_3 \mathbf{h}_2 + \mathbf{b}_3

Why Depth Matters

A network with even a single hidden layer can theoretically approximate any continuous function (the Universal Approximation Theorem). But in practice:

  • Deeper networks learn hierarchical features (edges -> textures -> objects)
  • Wider networks memorize more but generalize less
  • Deeper is usually better than wider for the same parameter count

How Networks Learn: Backpropagation

Training a neural network means finding weights that minimize a loss function. The algorithm has two phases:

Forward Pass

Compute the output by passing input through each layer:

xh1h2y^L\mathbf{x} \rightarrow \mathbf{h}_1 \rightarrow \mathbf{h}_2 \rightarrow \cdots \rightarrow \hat{y} \rightarrow \mathcal{L}

Backward Pass (Backpropagation)

Compute the gradient of the loss with respect to every weight using the chain rule:

LW2=Loutputoutputh2h2W2\frac{\partial \mathcal{L}}{\partial \mathbf{W}_2} = \frac{\partial \mathcal{L}}{\partial \text{output}} \cdot \frac{\partial \text{output}}{\partial \mathbf{h}_2} \cdot \frac{\partial \mathbf{h}_2}{\partial \mathbf{W}_2}

The gradients flow backward through the network, telling each weight how to change to reduce the loss.

Weight Update

W:=WαLW\mathbf{W} := \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}}

This is gradient descent applied to neural networks. In practice, we use mini-batch stochastic gradient descent with optimizers like Adam. The same optimization principles apply here as in Linear Regression, but scaled to millions of parameters.

The Training Loop

for epoch in range(num_epochs):
    for batch in data_loader:
        # 1. Forward pass
        predictions = model(batch.inputs)
        loss = loss_function(predictions, batch.targets)

        # 2. Backward pass
        loss.backward()

        # 3. Update weights
        optimizer.step()
        optimizer.zero_grad()

This loop runs for tens to hundreds of epochs until the loss converges.

Loss Functions

The loss function defines what the network optimizes:

TaskLoss FunctionFormula
RegressionMSE1ni=1n(yiy^i)2\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2
Binary ClassificationBinary Cross-Entropy1ni=1n[yilog(pi)+(1yi)log(1pi)]-\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log(p_i) + (1 - y_i)\log(1 - p_i)\right]
Multi-class ClassificationCross-Entropy1ni=1nc=1Cyiclog(pic)-\frac{1}{n}\sum_{i=1}^{n}\sum_{c=1}^{C} y_{ic} \log(p_{ic})

Cross-entropy loss is equivalent to maximum likelihood estimation for classification models.

Optimizers

Plain gradient descent is slow. Modern optimizers adapt the learning rate per parameter:

SGD with Momentum

v:=βv+WL\mathbf{v} := \beta \mathbf{v} + \nabla_{\mathbf{W}} \mathcal{L} W:=Wαv\mathbf{W} := \mathbf{W} - \alpha \mathbf{v}

Momentum accelerates convergence by accumulating past gradients, like a ball rolling downhill.

Adam (Adaptive Moment Estimation)

Combines momentum with per-parameter adaptive learning rates. It’s the default optimizer for most deep learning tasks:

  • Adapts learning rate based on first moment (mean) and second moment (variance) of gradients
  • Works well out of the box with learning rate around α=0.001\alpha = 0.001
  • Handles sparse gradients and noisy objectives

Regularization Techniques

Neural networks are powerful but prone to overfitting. Key techniques to prevent it:

Dropout

Randomly set a fraction of neurons to zero during training:

h=dropout ⁣(σ(Wx+b),  p=0.5)\mathbf{h} = \text{dropout}\!\left(\sigma(\mathbf{W}\mathbf{x} + \mathbf{b}),\; p = 0.5\right)

This forces the network to learn redundant representations — no single neuron can be relied upon. At test time, all neurons are used (with scaled weights).

Weight Decay (L2 Regularization)

Add a penalty on the weight magnitudes:

Ltotal=Ltask+λiWi2\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{task}} + \lambda \sum_{i} W_i^2

Same as MAP estimation with a Gaussian prior on weights.

Batch Normalization

Normalize each layer’s inputs to have zero mean and unit variance:

x^=xμσ2+ϵ\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} y=γx^+βy = \gamma \hat{x} + \beta

Benefits: faster training, higher learning rates, mild regularization.

Early Stopping

Monitor validation loss during training. Stop when it starts increasing — the point where the model transitions from learning to memorizing.

Network Architecture Patterns

Fully Connected (Dense)

Every neuron connects to every neuron in the next layer. Good for tabular data.

Convolutional (CNN)

Neurons connect only to local spatial regions. Essential for images, using filters that slide across the input detecting patterns.

Recurrent (RNN/LSTM)

Neurons have connections that loop back in time. Designed for sequences (text, time series).

Transformer

Uses self-attention to process all positions in parallel. The architecture behind GPT, BERT, and modern LLMs.

Practical Tips

  1. Start simple: Begin with a small network and gradually add complexity
  2. Normalize inputs: Scale features to zero mean and unit variance
  3. Use batch normalization: Stabilizes training significantly
  4. Learning rate matters most: Try α=0.001\alpha = 0.001 with Adam as a starting point
  5. Monitor both losses: Training and validation loss tell different stories
  6. Use dropout: 0.1-0.5 rate, especially in fully connected layers
  7. Data augmentation: Often more effective than a bigger model

The Deep Learning Revolution

What changed in the 2010s wasn’t the math — backpropagation was invented in the 1980s. What changed was:

  • Data: The internet generated massive labeled datasets (ImageNet)
  • Compute: GPUs made matrix multiplications 100x faster
  • Architecture: ReLU activations solved the vanishing gradient problem
  • Software: Frameworks like PyTorch made experimentation easy

These four ingredients turned a dormant theory into the most transformative technology of the decade.

Summary

  • Neural networks are composed of layers of linear transformations + nonlinear activations
  • Backpropagation uses the chain rule to compute gradients through every layer
  • ReLU is the default activation; Adam is the default optimizer
  • Regularization (dropout, weight decay, batch norm, early stopping) prevents overfitting
  • Deep networks learn hierarchical features — from simple to complex
  • The universal approximation theorem guarantees expressive power
  • CNNs for images, RNNs for sequences, Transformers for everything

References

  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
  • Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, 65(6), 386-408.
  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
  • Kingma, D. P., & Ba, J. (2014). Adam: A Method for Stochastic Optimization. arXiv:1412.6980
  • Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359-366.
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 15, 1929-1958.
  • Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay