- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Why Limits Matter
Every central idea in calculus — derivatives, integrals, series — is defined through a limit. The derivative is a limit of difference quotients. The integral is a limit of Riemann sums. Without limits, none of these concepts have rigorous meaning.
For machine learning, limits are not just abstract formalism. Gradient descent relies on derivatives, which rely on limits. Convergence guarantees for optimization algorithms are statements about limits of sequences. Even the universal approximation theorem is a limit-based result. Understanding limits gives you the language to reason about why training works, not just how.
Intuitive Idea of a Limit
A limit describes the value a function approaches as its input approaches some target — even if the function never actually reaches that value.
Consider the function:
At , this function is undefined (division by zero). But what happens as gets close to 1?
| 0.9 | 1.9 |
| 0.99 | 1.99 |
| 0.999 | 1.999 |
| 1.001 | 2.001 |
| 1.01 | 2.01 |
| 1.1 | 2.1 |
The values approach 2 from both sides. We write:
Algebraically, this is clear: , so for , , which equals 2 at .
Key insight: A limit describes the tendency of a function near a point, not the function’s value at that point. The function does not even need to be defined at the target point for the limit to exist.
The Formal Definition
The intuitive idea of “approaching” needs to be made precise. The epsilon-delta definition, formalized by Karl Weierstrass in the 19th century, does exactly that.
We say if:
In plain language: no matter how small a tolerance you demand around , I can find a neighborhood of radius around such that every in that neighborhood (except itself) maps to within of .
Worked Example
Let us prove that .
We need: given any , find such that .
We want , which means . So choosing works. For any , whenever , we get .
One-Sided Limits
Sometimes a function approaches different values from the left and right. The left-hand limit considers only , while the right-hand limit considers only .
The two-sided limit exists if and only if both one-sided limits exist and are equal:
Computing Limits
Direct Substitution
If is a “nice” function (polynomial, rational with nonzero denominator, exponential, etc.), simply plug in:
This works whenever is continuous at — a concept we formalize below.
Algebraic Manipulation
When direct substitution yields (an indeterminate form), simplify first:
Other algebraic techniques include rationalizing (multiplying by the conjugate), factoring, and common denominators.
Squeeze Theorem
If near , and , then .
Example: because and both bounds go to 0.
Limits at Infinity
The behavior of as tells us about long-term trends. The key principle is dominant term analysis: the fastest-growing term determines the limit.
Growth rate hierarchy (slowest to fastest):
This hierarchy matters in ML when analyzing algorithm complexity — an algorithm is fundamentally faster than .
L’Hopital’s Rule (Preview)
When substitution gives or , L’Hopital’s rule states:
provided the right-hand limit exists. This requires derivatives, which we cover next.
Continuity
A function is continuous at a point if the limit equals the function value — there are no jumps, holes, or breaks.
Formally, is continuous at if three conditions hold:
- is defined
- exists
If is continuous at every point in an interval, we say is continuous on that interval.
Types of Discontinuities
When continuity fails, it fails in one of three ways:
-
Removable discontinuity: The limit exists but is either undefined or doesn’t match. We can “fill in the hole.” Example: at .
-
Jump discontinuity: Left and right limits both exist but differ. Example: the step function .
-
Essential (infinite) discontinuity: The limit does not exist (function blows up or oscillates). Example: at .
Properties of Continuous Functions
Continuous functions on closed intervals have powerful guarantees:
Intermediate Value Theorem (IVT): If is continuous on and is between and , then there exists some with .
Intuition: A continuous function cannot jump over a value. If it starts below and ends above , it must cross somewhere. This is the mathematical reason why bisection search works for root-finding.
Extreme Value Theorem (EVT): If is continuous on a closed interval , then attains both a maximum and a minimum on that interval.
This theorem guarantees that optimization problems with continuous objectives over compact (closed and bounded) domains always have solutions.
Continuity in Higher Dimensions
For a function , continuity at a point means:
where means . The function must approach the same value regardless of the direction from which approaches .
This is more subtle than the one-dimensional case — a function can be continuous along every line through a point and still be discontinuous at that point. The standard notion requires convergence along all paths.
Lipschitz Continuity
A function is Lipschitz continuous with constant if:
for all in the domain. This is a stronger condition than ordinary continuity: it bounds how fast the function can change.
Key insight: Lipschitz continuity is the smoothness condition that optimization theory relies on most heavily. When we say “the gradient is -Lipschitz,” we are bounding how quickly the gradient can change, which directly determines the maximum safe learning rate for gradient descent.
Most loss functions used in ML (MSE, cross-entropy with bounded inputs, Huber loss) are Lipschitz continuous or have Lipschitz gradients, which is what makes gradient-based optimization work reliably.
Why This Matters for ML
Limits and continuity are not just abstract prerequisites — they underpin the core mechanisms of machine learning:
-
Loss function continuity enables gradient-based optimization. If the loss were discontinuous, small parameter changes could cause unpredictable jumps in the loss value, making optimization impossible.
-
Activation functions illustrate the continuity spectrum. The sigmoid is infinitely differentiable (smooth). ReLU is continuous but not differentiable at . The step function is discontinuous — this is why perceptrons could not be trained with gradient descent, motivating the switch to smooth activations.
-
Convergence of training is a statement about limits. When we say “SGD converges,” we mean the sequence of parameter vectors has a limit that is a (local) minimum of the loss.
-
Lipschitz conditions on the gradient determine the maximum learning rate. If the gradient is -Lipschitz, gradient descent converges when the learning rate .
These ideas are formalized through derivatives, which we turn to next.
Summary
- A limit describes the value a function approaches as input approaches a target
- The epsilon-delta definition makes “approaching” rigorous: for any tolerance , a neighborhood exists
- Direct substitution works for continuous functions; algebraic manipulation handles indeterminate forms
- A function is continuous if the limit equals the function value — no holes, jumps, or blowups
- The IVT guarantees intermediate values are hit; the EVT guarantees extrema exist on closed intervals
- Lipschitz continuity bounds the rate of change and is critical for optimization convergence
- These foundations enable everything in the series, starting with derivatives and differentiation
References
- Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning.
- Rudin, W. (1976). Principles of Mathematical Analysis (3rd ed.). McGraw-Hill.
- Abbott, S. (2015). Understanding Analysis (2nd ed.). Springer.
- Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org