Tracing Back to the Origins | The Universal Approximation Theorem of Artificial Neural Networks

Zixin's Multiverse.

2024-06-01

Academics

In 1989, Cybenko published a paper titled Approximation by Superpositions of a Sigmoidal Function. In this paper, Cybenko proved that a feedforward neural network with a single hidden layer can approximate any continuous function defined on a compact set. This conclusion later became known as the Universal Approximation Theorem. In this paper, I will attempt to outline the entire proof process, clarify points of confusion encountered during reading, and delve into mathematical details Cybenko left unexplored.

Problem Description
A feedforward neural network with a single hidden layer typically has the following form:

In the structure of a neural network, represents the weights applied to the input , represents the weights of the hidden layer outputs, and represents the bias for neuron . We call an affine transformation of the neural network input.

Let represent the -dimensional unit cube , and let denote the space of continuous functions on , while denotes the space of finite, regular Borel measures on . We use to denote the supremum norm of , with commonly representing the maximum value of the function over its domain. The main objective of the proof is to explore the conditions under which functions of the form

are dense in with respect to the supremum norm.

Definition 1 If for all and , the integral over measure

implies , then is discriminatory.

I found this definition somewhat confusing while reading. Another way to express it is: for a non-zero measure , there exist such that

If the above expression equals zero, then must be a zero measure. More intuitively, a discriminatory is non-destructive in volume, ensuring that no information is lost when is processed after being weighted and shifted, and thus it does not reduce the affine space to a zero-measure set. Next, let’s look at the definition of a sigmoidal function.

Definition 2 A sigmoidal activation function satisfies

There are many functions that meet this definition (such as logistic, softmax, etc.), but we will discuss the general case here.

Main Conclusion

Theorem 1 Let be any continuous discriminatory function. Then, finite sums of the form

are dense in . In other words, for any and , there exists a of the above form such that

Proof Let be the set of functions constructed by (1). Clearly, is a linear subspace of . To prove that is dense in , we need to show that the closure of is itself. We denote the closure of by . Assuming , then is a proper closed subspace of .

Hahn-Banach Theorem
Let be a real vector space, and be a sublinear bounded functional on . Let be a linear subspace of , and be a real linear functional on . If , , then there exists a real linear functional on such that for all .

The Hahn-Banach theorem provides a way to extend a bounded linear functional defined on to a bounded linear functional defined on the entire . According to the Hahn-Banach theorem, there exists a bounded linear functional on , denoted by , such that

Riesz Representation Theorem
Let be a locally compact Hausdorff space, and be the set of continuous functions with compact support on . For every positive linear functional on , there exists a unique positive and regular Borel measure such that for all

The measure is regular and satisfies the following properties:

For any open set , .
For any compact set , .

By the Riesz Representation Theorem, for some and any , the functional can be expressed as follows:

For any , functions of the form are in , and is zero on . Therefore,

Since is a discriminatory function, we must have . This contradicts the conclusion , because

Thus, is dense in . Q.E.D.

Lemma 1 Any continuous sigmoidal function is discriminatory.

To revisit the definition of a discriminatory function, we need to check the condition

This is an integral over a measure space, which can be simplified by constructing an increasing sequence of non-negative simple functions to approximate it.

Note that for any , we have

Thus, as , the function sequence is bounded and converges pointwise to

Define the hyperplane and the open half-space . Note that for any , we have . Therefore, by the Lebesgue Dominated Convergence Theorem, we can obtain

where

If the measure of every half-space is zero, then , and thus

Next, we prove that in this case, must be a zero measure.

Let be a fixed value. For a bounded measurable function , define the linear functional as follows:

Since

and is a finite measure, we conclude that is a bounded (continuous) functional.

Define as the indicator function on , that is

Based on the assumption that the measure of any half-space is zero, we have

Similarly, if is the indicator function on , . If we let denote the indicator function on interval , then we can write

Since is a linear functional, from the above expression, we conclude that holds for any interval and . If is a step function, , then

Since step functions are dense in , for any , there exists a sequence of step functions such that . By the continuity of , we can deduce:

which implies .

In the above discussion, we assumed a constant value of . Consider the Fourier transform of an arbitrary :

If holds for all , then the measure must be zero. Therefore, any continuous sigmoidal function is discriminatory. Q.E.D.

Continuing with the main conclusions and implications for artificial neural networks:

Main Conclusion and Artificial Neural Networks

Theorem 2 Let be any continuous sigmoidal function. Finite sums of the following form

are dense in . In other words, for any and , there exists a of the above form such that

Proof By combining Theorem 1 and Lemma 1, we observe that continuous sigmoidal functions satisfy the conditions of Theorem 2.

Thus, we have proven that a feedforward neural network with a single hidden layer can approximate any continuous function defined on a compact set. Function approximation plays a fundamental role in machine learning, especially in tasks like classification. Let be the Lebesgue measure on , and let be disjoint measurable subsets of . We can define a decision function as

If , then it must be that , thus completing the classification. The following theorem will demonstrate that such decision functions can be approximated by a single hidden-layer feedforward neural network.

Theorem 3 Let be a continuous sigmoidal function. Suppose is a decision function on any finite measurable subset of . For any , there exists a function of the form

and a subset , such that , and for all , we have

Proof According to Lusin’s Theorem:

Lusin’s Theorem If is a measurable subset of and is a measurable function defined on and finite almost everywhere, then for any , there exists a closed set such that and is continuous everywhere on .

Thus, there exists a continuous function and a subset with , such that for all , . Since is continuous, by Theorem 2, we can find a function such that for all ,

Thus, for all , we have:

Q.E.D.

Conclusion

After Cybenko’s paper, Hornik discovered in 1991 that the ability of neural networks to serve as universal approximators does not depend solely on the specific form of the activation function but rather on the diversity of neurons and structure of multilayer neural networks. Subsequent research has tested and validated the Universal Approximation Theorem for different activation functions and architectures. This theorem emphasizes that neural networks can approximate any complex function to arbitrary precision, though it does not prescribe a method for achieving this limit. Only by combining theory with practice can we approach the true potential of universal approximation.

References

[1] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314. https://doi.org/10.1007/bf02551274