Kernel Methods and the Kernel Trick

One approach to modeling non-linear relationships is to transform input data into a higher-dimensional feature space using basis functions. In this space, a linear model may solve the problem. But the number of basis functions required can grow exponentially with input dimensionality—the curse of dimensionality. Kernel methods solve this problem, letting us work implicitly in feature spaces of high or infinite dimensions without computing the transformations explicitly.

From Basis Functions to Smoothness

Instead of starting with a predefined set of basis functions, kernel methods assume smoothness: inputs that are close to each other should have similar outputs. A kernel function, $k(\mathbf{x}, \mathbf{x}')$ , measures the similarity between two input points, $\mathbf{x}$ and $\mathbf{x}'$ . The prediction for a new point is a weighted combination of its similarities to the training data points.

Kernels and basis functions are connected. Any kernel function can be expressed as a dot product of feature vectors in some high-dimensional space:

k(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x})^T \phi(\mathbf{x}')

Here, $\phi(\mathbf{x})$ maps input $\mathbf{x}$ into the high-dimensional feature space. A kernel function implicitly defines a feature space. A polynomial kernel like $k(\mathbf{x}, \mathbf{x}') = (\mathbf{x}^T\mathbf{x}' + 1)^2$ corresponds to a feature space containing all second-order polynomial terms. The Gaussian kernel corresponds to an infinite-dimensional feature space.

This is the "kernel trick."

Bypassing High Dimensions

Many machine learning algorithms, from linear regression to Support Vector Machines, can be formulated to only require dot products of input vectors. In regularized least squares regression, the optimal weight vector $\mathbf{w}$ can be expressed as a linear combination of the feature vectors of the training data:

\mathbf{w} = \Phi^T \mathbf{v}

where $\Phi$ is the design matrix (rows are feature vectors $\phi(\mathbf{x}_i)^T$ ) and $\mathbf{v}$ is a vector of coefficients. When we make a prediction for a new point $\mathbf{x}'$ , we compute:

f(\mathbf{x}') = \phi(\mathbf{x}')^T \mathbf{w} = \phi(\mathbf{x}')^T \Phi^T \mathbf{v}

This expression involves dot products between the feature vector of the new point, $\phi(\mathbf{x}')$ , and the feature vectors of the training points, $\phi(\mathbf{x}_i)$ . By substituting the kernel function for these dot products, we get:

f(\mathbf{x}') = \sum_{i} v_i k(\mathbf{x}_i, \mathbf{x}')

The prediction is a weighted sum of kernel functions, centered at each of the $$N$$ training data points. We bypass the need to explicitly define or compute the high-dimensional feature vectors $\phi(\mathbf{x})$ . The entire algorithm—both training and prediction—can be "kernelized" by replacing all dot products with the kernel function. This lets us work with feature spaces of high dimensionality without the computational cost. The complexity scales with the number of data points, $$N$$ , rather than the number of features, $M_\phi$ .

The Representer Theorem

The kernel trick is not limited to one algorithm. The Representer Theorem shows that for a range of learning problems involving a regularized loss function, the optimal solution can be expressed as a linear combination of kernel functions centered on the training data points. This theorem applies to most cost functions in machine learning.

Strengths and Weaknesses of Kernel Methods

Kernel methods have advantages:

Handling Non-linearity: They let us apply linear models to non-linear problems by implicitly mapping data to a high-dimensional space where the problem may become linear.

Avoiding the Curse of Dimensionality: They can operate in feature spaces that would be too large to work with directly.

Flexibility in Data Types: It is easier to design a kernel function (a similarity measure) for complex data types like graphs, strings, or images than to design an explicit feature vector. This has made kernels useful in bioinformatics and computational chemistry.

They also have limitations:

Computational Complexity: The computational cost of training a kernel model scales with the number of training samples, as $\mathcal{O}(N^3)$ or $\mathcal{O}(N^2)$ . This makes them slower for large datasets compared to neural networks, whose training cost scales with the number of parameters and can be optimized with stochastic methods.

The Manifold Dilemma: Kernel methods, particularly those with local kernels like the Gaussian kernel, perform best when data lies on a low-dimensional manifold within the high-dimensional input space. They can model the function's complexity on this manifold. But their predictions decay to zero for points far from the training data manifold. This can be useful for one-class classification or anomaly detection, but it means they may perform poorly on test data from a different distribution (covariate shift).

Conclusion

By using the kernel trick, kernel methods let us implicitly work in high-dimensional feature spaces, turning non-linear problems into linear ones. Their computational complexity can limit them for large datasets, but they can model complex relationships and handle diverse data types. They provide a different perspective on model flexibility compared to neural networks: instead of learning a set of adaptive basis functions, kernel methods work with a potentially infinite set of fixed basis functions, with the model's complexity determined by the number of training examples.