Limits and Continuity: The Foundation of Calculus

Calculus & Optimization Series 1 / 18

Why Limits Matter

Every central idea in calculus — derivatives, integrals, series — is defined through a limit. The derivative is a limit of difference quotients. The integral is a limit of Riemann sums. Without limits, none of these concepts have rigorous meaning.

For machine learning, limits are not just abstract formalism. Gradient descent relies on derivatives, which rely on limits. Convergence guarantees for optimization algorithms are statements about limits of sequences. Even the universal approximation theorem is a limit-based result. Understanding limits gives you the language to reason about why training works, not just how.

Intuitive Idea of a Limit

A limit describes the value a function approaches as its input approaches some target — even if the function never actually reaches that value.

Consider the function:

f(x) = \frac{x^2 - 1}{x - 1}

At $x = 1$ , this function is undefined (division by zero). But what happens as $x$ gets close to 1?

$x$	$f(x)$
0.9	1.9
0.99	1.99
0.999	1.999
1.001	2.001
1.01	2.01
1.1	2.1

The values approach 2 from both sides. We write:

\lim_{x \to 1} \frac{x^2 - 1}{x - 1} = 2

Algebraically, this is clear: $x^2 - 1 = (x-1)(x+1)$ , so for $x \neq 1$ , $f(x) = x + 1$ , which equals 2 at $x = 1$ .

Key insight: A limit describes the tendency of a function near a point, not the function’s value at that point. The function does not even need to be defined at the target point for the limit to exist.

The Formal Definition

The intuitive idea of “approaching” needs to be made precise. The epsilon-delta definition, formalized by Karl Weierstrass in the 19th century, does exactly that.

We say $\lim_{x \to a} f(x) = L$ if:

\forall \, \epsilon > 0, \; \exists \, \delta > 0 \text{ such that } 0 < |x - a| < \delta \implies |f(x) - L| < \epsilon

In plain language: no matter how small a tolerance $\epsilon$ you demand around $L$ , I can find a neighborhood of radius $\delta$ around $a$ such that every $x$ in that neighborhood (except $a$ itself) maps to within $\epsilon$ of $L$ .

Worked Example

Let us prove that $\lim_{x \to 3} (2x + 1) = 7$ .

We need: given any $\epsilon > 0$ , find $\delta > 0$ such that $0 < |x - 3| < \delta \implies |f(x) - 7| < \epsilon$ .

\begin{aligned} |f(x) - 7| &= |(2x + 1) - 7| \\[6pt] &= |2x - 6| \\[6pt] &= 2|x - 3| \end{aligned}

We want $2|x - 3| < \epsilon$ , which means $|x - 3| < \epsilon / 2$ . So choosing $\delta = \epsilon / 2$ works. For any $\epsilon > 0$ , whenever $0 < |x - 3| < \delta = \epsilon/2$ , we get $|f(x) - 7| = 2|x - 3| < 2 \cdot \epsilon/2 = \epsilon$ .

One-Sided Limits

Sometimes a function approaches different values from the left and right. The left-hand limit $\lim_{x \to a^-} f(x)$ considers only $x < a$ , while the right-hand limit $\lim_{x \to a^+} f(x)$ considers only $x > a$ .

The two-sided limit exists if and only if both one-sided limits exist and are equal:

\lim_{x \to a} f(x) = L \iff \lim_{x \to a^-} f(x) = L \text{ and } \lim_{x \to a^+} f(x) = L

Computing Limits

Direct Substitution

If $f$ is a “nice” function (polynomial, rational with nonzero denominator, exponential, etc.), simply plug in:

\lim_{x \to a} f(x) = f(a)

This works whenever $f$ is continuous at $a$ — a concept we formalize below.

Algebraic Manipulation

When direct substitution yields $0/0$ (an indeterminate form), simplify first:

\lim_{x \to 4} \frac{x^2 - 16}{x - 4} = \lim_{x \to 4} \frac{(x-4)(x+4)}{x-4} = \lim_{x \to 4} (x + 4) = 8

Other algebraic techniques include rationalizing (multiplying by the conjugate), factoring, and common denominators.

Squeeze Theorem

If $g(x) \leq f(x) \leq h(x)$ near $a$ , and $\lim_{x \to a} g(x) = \lim_{x \to a} h(x) = L$ , then $\lim_{x \to a} f(x) = L$ .

Example: $\lim_{x \to 0} x^2 \sin(1/x) = 0$ because $-x^2 \leq x^2 \sin(1/x) \leq x^2$ and both bounds go to 0.

Limits at Infinity

The behavior of $f(x)$ as $x \to \infty$ tells us about long-term trends. The key principle is dominant term analysis: the fastest-growing term determines the limit.

\lim_{x \to \infty} \frac{3x^2 + 5x}{x^2 + 1} = \lim_{x \to \infty} \frac{3 + 5/x}{1 + 1/x^2} = 3

Growth rate hierarchy (slowest to fastest):

\ln x \ll x^a \ll a^x \ll x! \ll x^x \quad \text{for } a > 1

This hierarchy matters in ML when analyzing algorithm complexity — an $O(n \log n)$ algorithm is fundamentally faster than $O(n^2)$ .

L’Hopital’s Rule (Preview)

When substitution gives $0/0$ or $\infty/\infty$ , L’Hopital’s rule states:

\lim_{x \to a} \frac{f(x)}{g(x)} = \lim_{x \to a} \frac{f'(x)}{g'(x)}

provided the right-hand limit exists. This requires derivatives, which we cover next.

Continuity

A function is continuous at a point $a$ if the limit equals the function value — there are no jumps, holes, or breaks.

Formally, $f$ is continuous at $a$ if three conditions hold:

$f(a)$ is defined
$\lim_{x \to a} f(x)$ exists
$\lim_{x \to a} f(x) = f(a)$

If $f$ is continuous at every point in an interval, we say $f$ is continuous on that interval.

Types of Discontinuities

When continuity fails, it fails in one of three ways:

Removable discontinuity: The limit exists but $f(a)$ is either undefined or doesn’t match. We can “fill in the hole.” Example: $f(x) = (x^2 - 1)/(x - 1)$ at $x = 1$ .
Jump discontinuity: Left and right limits both exist but differ. Example: the step function $f(x) = \begin{cases} 0 & x < 0 \\ 1 & x \geq 0 \end{cases}$ .
Essential (infinite) discontinuity: The limit does not exist (function blows up or oscillates). Example: $f(x) = 1/x$ at $x = 0$ .

Properties of Continuous Functions

Continuous functions on closed intervals have powerful guarantees:

Intermediate Value Theorem (IVT): If $f$ is continuous on $[a, b]$ and $c$ is between $f(a)$ and $f(b)$ , then there exists some $x^* \in (a, b)$ with $f(x^*) = c$ .

Intuition: A continuous function cannot jump over a value. If it starts below $c$ and ends above $c$ , it must cross $c$ somewhere. This is the mathematical reason why bisection search works for root-finding.

Extreme Value Theorem (EVT): If $f$ is continuous on a closed interval $[a, b]$ , then $f$ attains both a maximum and a minimum on that interval.

This theorem guarantees that optimization problems with continuous objectives over compact (closed and bounded) domains always have solutions.

Continuity in Higher Dimensions

For a function $f: \mathbb{R}^n \to \mathbb{R}$ , continuity at a point $\mathbf{a}$ means:

\lim_{\mathbf{x} \to \mathbf{a}} f(\mathbf{x}) = f(\mathbf{a})

where $\mathbf{x} \to \mathbf{a}$ means $\|\mathbf{x} - \mathbf{a}\| \to 0$ . The function must approach the same value regardless of the direction from which $\mathbf{x}$ approaches $\mathbf{a}$ .

This is more subtle than the one-dimensional case — a function can be continuous along every line through a point and still be discontinuous at that point. The standard notion requires convergence along all paths.

Lipschitz Continuity

A function $f$ is Lipschitz continuous with constant $L$ if:

|f(x) - f(y)| \leq L \|x - y\|

for all $x, y$ in the domain. This is a stronger condition than ordinary continuity: it bounds how fast the function can change.

Key insight: Lipschitz continuity is the smoothness condition that optimization theory relies on most heavily. When we say “the gradient is $L$ -Lipschitz,” we are bounding how quickly the gradient can change, which directly determines the maximum safe learning rate for gradient descent.

Most loss functions used in ML (MSE, cross-entropy with bounded inputs, Huber loss) are Lipschitz continuous or have Lipschitz gradients, which is what makes gradient-based optimization work reliably.

Why This Matters for ML

Limits and continuity are not just abstract prerequisites — they underpin the core mechanisms of machine learning:

Loss function continuity enables gradient-based optimization. If the loss were discontinuous, small parameter changes could cause unpredictable jumps in the loss value, making optimization impossible.
Activation functions illustrate the continuity spectrum. The sigmoid $\sigma(x) = 1/(1 + e^{-x})$ is infinitely differentiable (smooth). ReLU $f(x) = \max(0, x)$ is continuous but not differentiable at $x = 0$ . The step function is discontinuous — this is why perceptrons could not be trained with gradient descent, motivating the switch to smooth activations.
Convergence of training is a statement about limits. When we say “SGD converges,” we mean the sequence of parameter vectors has a limit that is a (local) minimum of the loss.
Lipschitz conditions on the gradient determine the maximum learning rate. If the gradient is $L$ -Lipschitz, gradient descent converges when the learning rate $\alpha < 2/L$ .

These ideas are formalized through derivatives, which we turn to next.

Summary

A limit describes the value a function approaches as input approaches a target
The epsilon-delta definition makes “approaching” rigorous: for any tolerance $\epsilon$ , a neighborhood $\delta$ exists
Direct substitution works for continuous functions; algebraic manipulation handles indeterminate forms
A function is continuous if the limit equals the function value — no holes, jumps, or blowups
The IVT guarantees intermediate values are hit; the EVT guarantees extrema exist on closed intervals
Lipschitz continuity bounds the rate of change and is critical for optimization convergence
These foundations enable everything in the series, starting with derivatives and differentiation

References

Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning.
Rudin, W. (1976). Principles of Mathematical Analysis (3rd ed.). McGraw-Hill.
Abbott, S. (2015). Understanding Analysis (2nd ed.). Springer.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. stanford.edu
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 4. deeplearningbook.org