Introduction to Linear Classifiers
Classification is the task of assigning a class label to an object based on its features. The linear classifier is a tool for this task. Its core principle is to separate classes using a linear decision boundary—a line in two dimensions, a plane in three, and a hyperplane in higher dimensions.
They cannot solve non-linear problems like the XOR puzzle on their own, but linear classifiers work well, especially in high-dimensional spaces. They form the building block for more complex models; when combined with non-linear transformations like basis functions, kernels, or the hidden layers of a neural network, they can produce flexible, non-linear decision boundaries.
We explore three approaches to building linear classifiers for binary classification problems.
1. The Generative Approach: Modeling the Data-Generating Process
A generative model for classification works by building a full probabilistic model of data generation. Instead of learning the decision boundary \(P(y|x)\) directly, it learns the class priors \(P(y)\) and the class-conditional distributions \(P(x|y)\). From these, the desired posterior probability \(P(y|x)\) can be calculated using Bayes' theorem.
An example is Linear Discriminant Analysis (LDA), which assumes that the data for each class is generated from a Gaussian distribution. A simplifying assumption in LDA is that while each class has its own mean (center), they all share the same covariance matrix.
The process is as follows:
- Estimate Class Priors: The prior probability of each class, \(P(y=k)\), is estimated from the proportion of each class in the training data.
- Estimate Class-Conditional Gaussians: The mean vector for each class, \(\mu_k\), and the shared covariance matrix, \(\Sigma\), are estimated from the training data.
- Apply Bayes' Theorem: For a new data point
x, we use Bayes' theorem to calculate the posterior probability of it belonging to each class. - Assign numerical target values to the classes (e.g., \(y=1\) for the positive class and \(y=0\) or \(y=-1\) for the negative class).
- Train a standard linear regression model to predict these numerical values from the input features \(x\), typically by minimizing the sum of squared errors.
- To classify a new point, compute the regression output \(f(x) = w^Tx + w_0\). If the output is above a threshold, classify it as positive; otherwise, negative.
- Generative models like LDA work when their distributional assumptions are met and can be useful when the class priors are important.
- Logistic Regression is the preferred discriminative approach, offering a direct and probabilistic model of the decision boundary.
- Classification via Regression provides a fast and simple alternative.
A result of these assumptions (Gaussian distributions with a shared covariance matrix) is that the decision boundary is linear. The posterior probability \(P(y=1|x)\) takes the form of the logistic (sigmoid) function applied to a linear function of \(x\): \(\sigma(w^Tx + w_0)\). The weight vector \(w\) is determined by the difference in the class means and the inverse of the covariance matrix.
This generative approach provides not just a hard classification but a full posterior probability, which can be useful for understanding the model's confidence. A simplified version of this is the Naive Bayes classifier, which assumes that the features are conditionally independent given the class (i.e., the covariance matrix is diagonal). Despite this "naive" assumption, it performs well in high-dimensional text classification tasks.
2. The Discriminative Approach: Logistic Regression
Instead of modeling the full data-generating process, a discriminative model focuses on learning the decision boundary \(P(y|x)\) directly. Logistic Regression is the quintessential discriminative linear classifier.
It directly models the posterior probability of class 1 as the logistic function of a linear combination of the inputs:
The parameters \(w\) and \(w_0\) are not found by estimating means and covariances, but by directly maximizing the conditional likelihood of the observed data. This is equivalent to minimizing the cross-entropy loss function, which penalizes the model when it makes confident but incorrect predictions.
Unlike the generative approach, logistic regression makes fewer assumptions about the data distribution. It does not assume the data is Gaussian. This makes it more robust when the generative assumptions do not hold. The parameters are found using iterative optimization methods like gradient descent or Newton-Raphson.
3. Classification via Regression: A Simple Heuristic
A third, more heuristic approach is to treat the classification problem as a regression problem. The setup is simple:
This approach seems simplistic and lacks the probabilistic grounding of the other two methods, but it can perform well in practice. Its advantage is computational efficiency, as it can be solved using the closed-form solution for least-squares regression. But it can be sensitive to outliers and its output cannot be interpreted as a probability.
Conclusion: Choosing the Right Tool
Linear classifiers are a tool in machine learning. These three approaches—generative, discriminative, and regression-based—offer different perspectives on how to find the optimal linear boundary.
Understanding these models is important, as they solve many problems directly and serve as the final output layer for many deep learning architectures.