Function Approximation and the Curse of Dimensionality

Machine learning offers many models—linear classifiers, kernel methods, deep neural networks. Choosing between them requires empirical methods like cross-validation and theoretical analysis. Function approximation frames machine learning as approximating an unknown target function. We examine the "Curse of Dimensionality."

The Framework: Target and Model Function Classes

Supervised machine learning is a problem of function approximation. We assume an unknown function maps inputs to outputs. This is the function we want to learn.

Target Function Class (F): The set of all functions that could have generated our data. In image classification, the target function maps pixels to a class label (e.g., "cat" or "dog"). We characterize this class by properties like smoothness or compositionality. For most problems, the target functions occupy a small, structured subset of all possible functions.

Model Function Class (M): The set of functions our model can represent. A linear regression model represents all linear functions. A neural network represents a class of non-linear functions. Training searches this model class to find the function that best approximates the target function.

Measuring the Gap: The Distance Between Functions

To measure how well a model approximates the target, we measure the "distance" between two functions. We define the distance between a target function $$f$$ and a model function $$g$$ over an input domain $\mathcal{B}$ (like a unit ball) as:

\|f - g\|^2_\mathcal{B} = \frac{1}{V_\mathcal{B}} \int_\mathcal{B} (f(\mathbf{x}) - g(\mathbf{x}))^2 d\mathbf{x}

Here, $V_\mathcal{B}$ is the volume of the domain. This formula calculates the average squared difference between the two functions across all inputs in that domain. The goal is to find a model $$f_w$$ from model class $$M$$ that minimizes this distance for the hardest function in the target class $$F$$ .

In practice, we use a weighted distance that considers the probability distribution of the input data, $P(\mathbf{x})$ . This focuses measurement on regions where data appears.

The Challenge: The Curse of Dimensionality

Approximation theory relates the number of parameters a model needs (e.g., the number of basis functions, $M_\phi$ ) to three factors: accuracy, smoothness of the target function, and input dimensionality.

Define the smoothness of a function class with parameter $$m$$ , where larger $$m$$ means smoother function (continuous partial derivatives up to order $$m$$ ). The number of basis functions required is proportional to:

(\text{accuracy})^{M \times \text{roughness}}

Here, accuracy is the inverse of the desired error, $$M$$ is the number of input dimensions, and roughness is the inverse of smoothness ( $$1/m$$ ).

The number of required basis functions grows exponentially with input dimension $$M$$ and roughness. Richard Bellman called this the "Curse of Dimensionality." For high-dimensional problems with non-smooth functions, the number of basis functions becomes computationally infeasible.

When and Why Models Work

Data and problems have structures that mitigate the curse:

The Blessing of Dimensionality: A problem with low-dimensional input ( $$M$$ is small) but complex structure (large roughness) can be solved by transforming data into a high-dimensional feature space. This explains kernel methods and fixed basis functions. In this space, the problem may become simpler or linearly separable. High dimensionality helps.

Smooth Functions in High Dimensions: If the target function is smooth (small roughness), the curse is less severe. For a linear target, the number of required parameters is $$M + 1$$ , which grows linearly with input dimension. Linear models work on high-dimensional data when the relationship is linear.

Structured Target Functions: The curse is strong when we assume the target function could be any complex function. Real functions are not arbitrary.

- Sparsity: The function's complexity may be localized. A function may have high-frequency components only in a small region of the input space. Out of a large set of basis functions, only a few are needed. Neural networks can learn a sparse set of basis functions (the hidden units) during training.

- Data on a Manifold: High-dimensional data does not fill the entire input space. It lies on or near a lower-dimensional subspace called a manifold. Images of faces, while represented by thousands of pixels, can be described by fewer parameters (pose, lighting, expression). Models like neural networks and autoencoders can discover and exploit this manifold structure.

Manifolds and Adversarial Examples

Learning on a manifold has a risk. A model may perform well on data that lies on the manifold but fail on inputs perturbed off it. The model's approximation is only good where it has seen data. This explains adversarial examples—inputs that are slightly modified but cause incorrect predictions.

Conclusion

Function approximation and the curse of dimensionality explain model selection.

Linear models work when the target function is smooth and relationships are additive.
Kernel methods and fixed basis functions work when a complex, low-dimensional problem can be simplified by high-dimensional transformation (the "blessing of dimensionality").
Neural networks work when the solution is sparse or data lies on a low-dimensional manifold, as they adapt their internal representations to data structure.
Deep neural networks work for compositional functions, a structure in many problems.

This does not replace empirical testing, but guides reasoning about which models succeed or fail.