Varol Cagdas Tok

Personal notes and articles.

The Perceptron Algorithm

The Perceptron, invented by Frank Rosenblatt in 1957, is a learning machine. Inspired by the biological neuron, it was one of the first algorithms to learn classification from data. The Perceptron introduced concepts of supervised learning: a model, a cost function, and a learning rule.

The Model: A Formalized Neuron

The Perceptron is a linear classifier for binary classification tasks. It takes a set of inputs, \(x\), and produces a binary output, \(\hat{y}\), which is typically either +1 or -1. The model operation mimics a simplified neuron:

  1. Compute the Net Input (Activation): The model calculates a weighted sum of its inputs, a linear function \(h(x)\):
  2. \[h(x) = w_0 + w_1x_1 + w_2x_2 + \dots + w_M x_M = w^T x\]

    The weights \(w\) are learned parameters. The term \(w_0\) is a bias, shifting the decision boundary. (Note: \(x\) is often augmented to include a 1 for the bias term).

    1. Apply an Activation Function: The net input is passed through a step function, the \(\text{sign}\) function:
    2. \[\hat{y} = \text{sign}(h(x))\]

      If \(h(x)\) is positive, the output is +1; if negative, -1.

      Geometrically, the equation \(h(x) = 0\) defines a hyperplane in the input space. This hyperplane is the decision boundary of the Perceptron. All points on one side are classified as +1, and all on the other as -1. The goal is to find weights \(w\) that position this hyperplane to correctly separate the data points.

      The Learning Process: The Perceptron Learning Rule

      The Perceptron learns its weights from a set of labeled training data. The process is driven by a simple and intuitive idea: if the model misclassifies a data point, adjust the weights to make the correct classification more likely in the future. This is formalized by the Perceptron learning rule, which is derived by minimizing a specific cost function.

      The Perceptron cost function is defined as:

      \[J(w) = - \sum_{i \in \mathcal{M}} y_i h(x_i)\]

      where \(\mathcal{M}\) is the set of misclassified points.

      This cost is positive only for misclassified points (where the sign of the true label \(y_i\) and the net input \(h(x_i)\) differ) and zero for correctly classified points. To minimize this cost, we use stochastic gradient descent (SGD):

      For a randomly chosen misclassified data point \((x_t, y_t)\):

      \[w \leftarrow w + \eta y_t x_t\]

      Here, \(\eta\) is the learning rate, a small positive constant. Intuition:

      • If a positive-class point (\(y_t = +1\)) is misclassified, \(h(x_t)\) was negative. The update adds a fraction of \(x_t\) to \(w\), pushing \(w\) to be more aligned with \(x_t\), making \(h(x_t) = w^T x_t\) more positive.
      • If a negative-class point (\(y_t = -1\)) is misclassified, \(h(x_t)\) was positive. The update subtracts a fraction of \(x_t\) from \(w\), making \(w\) less aligned with \(x_t\), pushing \(h(x_t)\) to be more negative.

      This process is repeated, iterating through the misclassified points, until all points are correctly classified.

      The Power and Limitations of the Perceptron

      The Perceptron was a groundbreaking invention, and it comes with a powerful theoretical guarantee. The Perceptron convergence theorem proves that if the training data is linearly separable (meaning a hyperplane exists that can perfectly separate the two classes), the Perceptron learning algorithm is guaranteed to find such a separating hyperplane in a finite number of steps.

      However, the Perceptron also has significant limitations that were famously highlighted by Marvin Minsky and Seymour Papert in their 1969 book, Perceptrons:

      1. Linear Separability: The most critical limitation is that the Perceptron can only solve problems that are linearly separable. It is fundamentally incapable of learning non-linear decision boundaries, such as the one required for the classic XOR problem.
        1. Non-Convergence for Non-Separable Data: If the data is not linearly separable, the learning algorithm will never converge. The weights will continue to be updated indefinitely as the algorithm cycles through the misclassified points.
          1. Non-Unique Solution: Even for linearly separable data, the algorithm does not guarantee finding the "best" separating hyperplane. The solution it finds depends on the initial weights and the order in which the data points are presented.
          2. The Legacy of the Perceptron

            Despite its limitations, the Perceptron was a step in artificial intelligence history. It introduced the paradigm of supervised learning: define a model, specify a cost function, and iterate to optimize parameters. The challenge of non-linearity motivated the development of the Multilayer Perceptron (MLP). By stacking Perceptron-like units with non-linear activation functions, these networks can solve complex, non-linear problems.