flowchart TD
A["A concrete decision scenario<br>(the movie example)"]
B["The perceptron model<br>(weights, bias, threshold)"]
C["Geometric picture<br>(the decision boundary)"]
D["Perceptron vs. MP Neuron<br>(what changed and why it matters)"]
E["The learning algorithm<br>(finding weights from data)"]
F["Why the updates work<br>(geometry of weight updates)"]
G["What if data is not<br>linearly separable? (XOR)"]
H["What comes next<br>(networks of perceptrons)"]
A --> B --> C --> D --> E --> F --> G --> H
style A fill:#2c3e50,color:#ecf0f1
style H fill:#2c3e50,color:#ecf0f1
Perceptron
A Decision You Already Know How to Make
Imagine you are trying to decide whether you will enjoy a movie tonight. You pull up the listing and notice a few things: Matt Damon is in it, the genre is thriller, and Christopher Nolan directed it. You weigh these up in your head, apply your own personal threshold for “yes, I’ll watch this,” and arrive at a binary answer: watch or skip.
That mental process, weighing features, summing them up, and applying a threshold, is, in essence, what a perceptron does. The perceptron is one of the oldest and most foundational ideas in machine learning. It is a simple binary classifier: given an input, it outputs either \(1\) (positive class) or \(0\) (negative class). Understanding it deeply is the first step toward understanding the neural networks that power modern AI.
This post is largely based on the treatment in Minsky & Papert (1969) and the lecture series by IIT Madras - B.S. Degree Programme (2023) and NPTEL-NOC IITM (2019).
The Journey Ahead
Before we begin, here is the road we will travel:
We will build the perceptron piece by piece from first principles, develop a geometric intuition for what it is actually doing, derive the learning algorithm from that intuition, and then show exactly where and why it fails. At the end, you will know not just how the perceptron works but why it works, and why that matters for everything that comes after it.
The Perceptron Model
Starting from the Movie Example
Let us stay with the movie example and think about it concretely before writing any equations. Suppose we are trying to predict whether a particular person will like a movie. We have three yes/no features:
\[ \begin{align*} x_1 \in \{0, 1\} &= \texttt{isActorDamon}\\ x_2 \in \{0, 1\} &= \texttt{isGenreThriller}\\ x_3 \in \{0, 1\} &= \texttt{isDirectorNolan} \end{align*} \]
And the label we want to predict is:
\[ y \in \{0, 1\} = \texttt{didPersonLikeMovie}. \]
The simplest possible rule: add up the features and check whether the total crosses some threshold \(\theta\). If \(x_1 + x_2 + x_3 \geq \theta\), predict like; otherwise, predict skip. This is actually the McCulloch-Pitts (MP) neuron (McCulloch & Pitts, 1943), and on the surface it seems reasonable.
But let us try it on a concrete case and see where it breaks. Suppose the person is a Nolan superfan: they will watch any film Nolan directed, regardless of genre or cast. Consider these two movies:
- Movie A: directed by Nolan, no Damon, not a thriller. Features: \((0, 0, 1)\), so the sum is \(1\).
- Movie B: starring Damon, a thriller, not directed by Nolan. Features: \((1, 1, 0)\), so the sum is \(2\).
With threshold \(\theta = 2\), Movie A (which this person loves) would be predicted as a skip, and Movie B (which they may or may not like) would be predicted as a watch. The rule gets it backwards for a Nolan fan, because it assigns the same weight to every feature. For this person, the fact that Nolan directed the film is worth far more than the other two features combined.
This is the gap that the perceptron fills.
Introducing Weights
The perceptron, originally proposed by Rosenblatt (1958) and carefully analyzed by Minsky & Papert (1969), fixes the equal-weight problem by introducing a weight \(w_j\) for each input \(x_j\). The weight encodes how important that feature is. Instead of adding up the raw features, we add up a weighted sum:
\[ y = f\left(x_1, \dots, x_d\right) = \begin{cases} 1, & \text{if $w_1 x_1 + \cdots + w_d x_d \geq \theta$},\\ 0, & \text{if $w_1 x_1 + \cdots + w_d x_d < \theta$}. \end{cases} \tag{1}\]
Here \(\theta\) is the threshold the weighted sum must clear for the output to be \(1\), and \(d\) is the number of input features.
Returning to our Nolan fan: if we set \(w_3\) large enough that \(w_3 \times 1\) alone clears \(\theta\) even when \(x_1 = 0\) and \(x_2 = 0\), the perceptron will correctly predict that any Nolan film is a watch, regardless of the other features. That is the power of weights: they let the model reflect the actual relative importance of features.
Note also that the inputs are no longer restricted to boolean values. A feature like \(\texttt{IMDBRating}\) can be a real number, and the perceptron handles that just as naturally.
Absorbing the Threshold: The Bias
There is a small but important notational trick that will simplify everything that follows. We can rewrite Equation 1 by moving \(\theta\) to the left-hand side:
\[ y = f\left(x_1, \dots, x_d\right) = \begin{cases} 1, & \text{if $w_1 x_1 + \cdots + w_d x_d - \theta \geq 0$},\\ 0, & \text{if $w_1 x_1 + \cdots + w_d x_d - \theta < 0$}. \end{cases} \tag{2}\]
Now \(-\theta\) is just a constant term added to the weighted sum. We can absorb it into the summation itself by imagining an extra input \(x_0 = 1\) (always turned on, regardless of the movie) with a corresponding weight \(w_0 = -\theta\). This weight \(w_0\) is called the bias. Figure 1 shows the resulting model.
With this convention, the perceptron equation becomes elegantly uniform: every input, including the bias input, participates in one single sum:
\[ y = f\left(x_0, x_1, \dots, x_d\right) = \begin{cases} 1, & \text{if $\sum_{j=0}^{d} w_j x_j \geq 0$},\\ 0, & \text{if $\sum_{j=0}^{d} w_j x_j < 0$}. \end{cases} \tag{3}\]
Significance of the Bias
The bias \(w_0 = -\theta\) encodes a prior, or a prejudice, about the output. A high threshold \(\theta\) (equivalently, a very negative bias \(w_0\)) means the weighted sum of the actual features must be large before the perceptron outputs \(1\). The bar is high.
In the movie example, a selective viewer who only likes films that simultaneously star Damon, are thrillers, and are directed by Nolan will have a high threshold \(\theta\). The perceptron trained on their data will learn a large value of \(\theta\) (very negative \(w_0\)). Conversely, someone who enjoys almost everything will have a very low threshold, perhaps even \(\theta = 0\), meaning any non-negative weighted sum is enough for \(y = 1\).
The bias is not an afterthought: it lets the perceptron shift its decision boundary to reflect how much cumulative evidence is needed before committing to a positive prediction.
Vector Notation
With \(d\) inputs plus the bias input \(x_0 = 1\), writing out the sum \(w_0 x_0 + w_1 x_1 + \cdots + w_d x_d\) repeatedly is cumbersome. We collect all inputs into a column vector \(\vb{x}\) and all weights into a column vector \(\vb{w}\):
\[ \vb{w} = \begin{bmatrix} w_0 \\ w_1 \\ \vdots \\ w_d \end{bmatrix}, \qquad \vb{x} = \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_d \end{bmatrix}. \]
The sum \(\sum_{j=0}^{d} w_j x_j\) is then just the dot product \(\vb{w}^{\intercal} \vb{x}\):
\[ \sum_{j=0}^{d} w_j x_j = \begin{bmatrix} w_0 & w_1 & \cdots & w_d \end{bmatrix} \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_d \end{bmatrix} = \vb{w}^{\intercal} \vb{x}. \]
So the perceptron equation in its most compact form is:
\[ y = f\left(\vb{x}\right) = \begin{cases} 1, & \text{if $\vb{w}^{\intercal} \vb{x} \geq 0$},\\ 0, & \text{if $\vb{w}^{\intercal} \vb{x} < 0$}. \end{cases} \tag{4}\]
Equation 1, Equation 2, Equation 3, and Equation 4 are all equivalent. They are all the perceptron equation, just written at different levels of explicitness. We will use \(\vb{w}^{\intercal} \vb{x} \geq 0\) from here on.
This notation is not just a shorthand. The dot product \(\vb{w}^{\intercal} \vb{x}\) also has a geometric meaning: it is related to the angle between the two vectors \(\vb{w}\) and \(\vb{x}\). We will use this fact extensively when we explain why the learning algorithm works.
The Geometric Picture: A Line That Divides the World
Now that we have the equation, let us ask: what does it look like in space?
The condition \(\vb{w}^{\intercal} \vb{x} = 0\) defines the set of all input vectors \(\vb{x}\) that sit exactly on the boundary between outputting \(1\) and outputting \(0\). In two dimensions (\(d = 2\), so ignoring the bias for a moment), this is a line through the origin. In \(d\) dimensions, it is a hyperplane.
All inputs with \(\vb{w}^{\intercal} \vb{x} \geq 0\) lie on the positive side of the boundary and get \(y = 1\). All inputs with \(\vb{w}^{\intercal} \vb{x} < 0\) lie on the negative side and get \(y = 0\). The perceptron is therefore a linear classifier: it partitions its input space into exactly two half-spaces with a single straight boundary.
The figure below shows what this looks like for a simple 2D example. The weight vector \(\vb{w}\) is drawn as an arrow; it is always perpendicular to the boundary, pointing into the positive half-space. Points on the same side as \(\vb{w}\) get \(y = 1\); points on the opposite side get \(y = 0\).
We will derive why \(\vb{w}\) is always perpendicular to the boundary shortly. For now, hold on to the picture: the perceptron draws a straight line (or hyperplane) through the input space, and everything on one side is a positive prediction. This immediately implies something important: a perceptron can only solve problems where the two classes can be separated by a straight line. That is not always possible, and we will come back to this at the end of the post.
The boundary equation is:
\[ \vb{w}^{\intercal} \vb{x} = 0, \tag{5}\]
or equivalently,
\[ w_0 x_0 + w_1 x_1 + w_2 x_2 + \cdots + w_d x_d = 0. \tag{6}\]
Perceptron vs. McCulloch-Pitts Neuron
At this point it is worth pausing to compare the perceptron to the McCulloch-Pitts (MP) neuron (McCulloch & Pitts, 1943), which preceded it.
flowchart LR
subgraph MP ["McCulloch-Pitts Neuron"]
direction TB
MP1["$$\text{Inputs: binary only }(x_j \in \{0,1\})$$"]
MP2["All weights = 1<br>(no learnable weights)"]
MP3["$$\text{Threshold }\theta \text{ fixed by hand}$$"]
MP4["Boundary: $$\sum_{j=1}^d x_j = \theta$$"]
MP1 --> MP2 --> MP3 --> MP4
end
subgraph P ["Perceptron"]
direction TB
P1["$$\text{Inputs: real-valued }(x_j \in \mathbb{R})$$"]
P2["$$\text{Learnable weights }w_j\text{ (encode feature importance)}$$"]
P3["$$\text{Learnable bias }w_0 = -\theta\text{ (encodes prior)}$$"]
P4["$$\text{Boundary: }\sum_{j=0}^d w_j x_j = 0$$"]
P1 --> P2 --> P3 --> P4
end
MP -->|"Generalized by"| P
The MP neuron accepts only binary inputs and assigns every feature exactly the same weight of \(1\). Its threshold \(\theta\) must be set by hand. The perceptron removes all three of these restrictions: inputs can be real-valued, each input gets its own learnable weight, and the threshold is absorbed into a learnable bias term. This means the decision boundary can be shifted and tilted to wherever the data requires, rather than being fixed in advance.
Both models share the same fundamental geometry: a linear boundary divides the input space into two halves. The perceptron simply makes that boundary far more flexible and, crucially, learnable from data. The concrete failure we saw earlier, where equal weights gave the wrong answer for the Nolan fan, is exactly the kind of problem that learnable weights solve.
The OR Function: A Worked Example
Before we introduce the learning algorithm, let us work through a concrete example by hand: the OR function. This will let us see exactly what it means to find a perceptron boundary, and it will make the need for an automatic learning procedure visceral.
Table 1 shows the OR truth table alongside what the perceptron equation requires for each input.
| \(x_1\) | \(x_2\) | OR | Required condition |
|---|---|---|---|
| \(0\) | \(0\) | \(0\) | \(w_0 + \left(w_1 \times 0\right) + \left(w_2 \times 0\right) < 0\) |
| \(1\) | \(0\) | \(1\) | \(w_0 + \left(w_1 \times 1\right) + \left(w_2 \times 0\right) \geq 0\) |
| \(0\) | \(1\) | \(1\) | \(w_0 + \left(w_1 \times 0\right) + \left(w_2 \times 1\right) \geq 0\) |
| \(1\) | \(1\) | \(1\) | \(w_0 + \left(w_1 \times 1\right) + \left(w_2 \times 1\right) \geq 0\) |
Substituting, the four conditions become:
\[ \begin{align*} w_0 &< 0,\\ w_0 + w_2 &\geq 0,\\ w_0 + w_1 &\geq 0,\\ w_0 + w_1 + w_2 &\geq 0. \end{align*} \]
This system has infinitely many solutions. One particular solution is \(w_0 = -1\), \(w_1 = 1.1\), \(w_2 = 1.1\), giving the boundary:
\[ -1 + 1.1 x_1 + 1.1 x_2 = 0. \tag{7}\]
Good. So a valid boundary does exist for OR, and we can find it by solving the inequalities. But notice what just happened: we had to manually write down four conditions, derive a system of inequalities, and then pick a solution. For two inputs and three weights, that was manageable. Now imagine doing this for a movie recommendation system with fifty features and thousands of training examples. We would need to solve a system of thousands of inequalities simultaneously. This is clearly impractical.
We need a procedure that finds the weights automatically, by looking at the data and correcting its mistakes one at a time. That is the perceptron learning algorithm.
The Perceptron Learning Algorithm
Setting Up the Problem
Suppose we have \(N\) labelled training examples. Each example is a movie with \(d\) features, and the label tells us whether the person liked it:
\[ \left\{ \left(\vb{x}_i, y_i\right) \right\}_{i=1}^{N}, \quad y_i \in \{0, 1\}. \]
Here \(\vb{x}_i\) is the input vector for the \(i\)-th movie (including the always-on bias input \(x_0 = 1\)), and \(y_i\) is the true label. We assume the data is linearly separable, meaning there exists some weight vector \(\vb{w}\) such that the perceptron correctly classifies every training example.
We define two sets:
\[ \mathcal{P} = \left\{ \vb{x}_i \mid y_i = 1 \right\}, \qquad \mathcal{N} = \left\{ \vb{x}_i \mid y_i = 0 \right\}. \]
\(\mathcal{P}\) is the set of positive examples (liked movies) and \(\mathcal{N}\) is the set of negative examples (disliked movies). Our goal is to find \(\vb{w}\) such that every \(\vb{x}_i \in \mathcal{P}\) satisfies \(\vb{w}^{\intercal} \vb{x}_i \geq 0\) and every \(\vb{x}_i \in \mathcal{N}\) satisfies \(\vb{w}^{\intercal} \vb{x}_i < 0\).
The Algorithm
The core idea is disarmingly simple: start with a random weight vector, pick a training example at random, and if it is misclassified, nudge the weight vector to correct the mistake. Repeat until no mistakes remain.
Algorithm 1
\begin{algorithm}
\caption{Perceptron Learning Algorithm}
\begin{algorithmic}
\INPUT Training data $\left\{\left(\mathbf{x}_i, y_i\right)\right\}_{i=1}^N$ where $y_i \in \{0,1\}$
\OUTPUT A weight vector $\mathbf{w}$ that separates both classes (if data is linearly separable)
\STATE Initialize weights $\mathbf{w} \sim \text{random}$
\STATE Define positive set $\mathcal{P} \gets \left\{ \mathbf{x}_i \mid y_i = 1 \right\}$
\STATE Define negative set $\mathcal{N} \gets \left\{ \mathbf{x}_i \mid y_i = 0 \right\}$
\REPEAT
\STATE Pick $\mathbf{x}_i \in \mathcal{P} \cup \mathcal{N}$ at random
\IF{$\mathbf{x}_i \in \mathcal{P}$ \AND $\mathbf{w}^\intercal \mathbf{x}_i < 0$}
\STATE $\mathbf{w} \gets \mathbf{w} + \mathbf{x}_i$
\ENDIF
\IF{$\mathbf{x}_i \in \mathcal{N}$ \AND $\mathbf{w}^\intercal \mathbf{x}_i \geq 0$}
\STATE $\mathbf{w} \gets \mathbf{w} - \mathbf{x}_i$
\ENDIF
\UNTIL{convergence}
\end{algorithmic}
\end{algorithm}
“Convergence” here means that all \(N\) training examples are correctly classified by the current \(\vb{w}\): no misclassified point remains.
The conditions at lines 6 and 9 both detect misclassifications.
- Line 6 catches a positive example placed on the wrong side of the boundary: \(\vb{x}_i \in \mathcal{P}\) but \(\vb{w}^{\intercal} \vb{x}_i < 0\), when we need \(\vb{w}^{\intercal} \vb{x}_i \geq 0\).
- Line 9 catches the opposite: a negative example on the wrong side, \(\vb{x}_i \in \mathcal{N}\) but \(\vb{w}^{\intercal} \vb{x}_i \geq 0\), when we need \(\vb{w}^{\intercal} \vb{x}_i < 0\).
When a misclassification is detected, the weight vector is updated by adding or subtracting the misclassified point. But why does this fix the error? The answer turns out to be a direct consequence of what the dot product measures: the angle between two vectors. We are about to see that the entire algorithm reduces to a beautifully simple geometric picture, and once you see it, the update rules will feel completely inevitable.
Verifying the Algorithm on OR
Before diving into the geometry, let us confirm the algorithm works on the OR example we just analyzed by hand. The following code runs the perceptron learning algorithm on the OR dataset and plots the learned boundary.
Code
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
plt.style.use("dark_background")
# OR dataset: rows are [x0=1, x1, x2], labels are y
X = np.array([
[1, 0, 0],
[1, 1, 0],
[1, 0, 1],
[1, 1, 1],
])
y = np.array([0, 1, 1, 1])
# Perceptron learning
rng = np.random.default_rng(42)
w = rng.uniform(-0.5, 0.5, size=3)
def predict(w, x):
return 1 if w @ x >= 0 else 0
max_iter = 1000
for _ in range(max_iter):
errors = 0
for xi, yi in zip(X, y):
pred = predict(w, xi)
if yi == 1 and pred == 0:
w = w + xi
errors += 1
elif yi == 0 and pred == 1:
w = w - xi
errors += 1
if errors == 0:
break
print(f"Learned weights: w0={w[0]:.3f}, w1={w[1]:.3f}, w2={w[2]:.3f}")
# Plot
fig, ax = plt.subplots(figsize=(5, 5), facecolor="#222222")
ax.set_facecolor("#222222")
# Decision boundary: w0 + w1*x1 + w2*x2 = 0 => x2 = -(w0 + w1*x1) / w2
x1_vals = np.linspace(-0.3, 1.5, 300)
if abs(w[2]) > 1e-9:
x2_vals = -(w[0] + w[1] * x1_vals) / w[2]
ax.plot(x1_vals, x2_vals, color="#3498db", linewidth=2, label="Decision boundary")
# Shade positive and negative half-spaces
x1g, x2g = np.meshgrid(np.linspace(-0.3, 1.5, 300), np.linspace(-0.3, 1.5, 300))
Z = w[0] + w[1] * x1g + w[2] * x2g
ax.contourf(x1g, x2g, Z, levels=[-1e6, 0, 1e6],
colors=["#c0392b", "#27ae60"], alpha=0.18)
# Plot data points
colors = {0: "#e74c3c", 1: "#2ecc71"}
for xi, yi in zip(X, y):
ax.scatter(xi[1], xi[2], color=colors[yi], s=120, zorder=5,
edgecolors="white", linewidths=0.8)
ax.set_xlim(-0.3, 1.5)
ax.set_ylim(-0.3, 1.5)
ax.set_xlabel("$x_1$", fontsize=13, color="white")
ax.set_ylabel("$x_2$", fontsize=13, color="white")
ax.set_title("OR function: perceptron decision boundary", fontsize=13,
color="white", pad=12)
ax.tick_params(colors="white")
for spine in ax.spines.values():
spine.set_edgecolor("#555555")
pos_patch = mpatches.Patch(color="#2ecc71", alpha=0.5, label="$y=1$ (positive)")
neg_patch = mpatches.Patch(color="#e74c3c", alpha=0.5, label="$y=0$ (negative)")
bound_line = plt.Line2D([0], [0], color="#3498db", linewidth=2,
label="Decision boundary")
ax.legend(handles=[pos_patch, neg_patch, bound_line], fontsize=10,
facecolor="#333333", edgecolor="#555555", labelcolor="white")
plt.tight_layout()
plt.show()Learned weights: w0=-0.726, w1=0.939, w2=1.359
The algorithm finds one of the infinitely many valid boundaries, separating the single negative point \((0, 0)\) from the three positive points. The boundary it lands on depends on the random initialisation and the order in which it processes the training examples, but all valid solutions correctly classify every point.
Now the natural question is: why do the update rules actually work? Answering that requires us to look at the geometry more carefully.
Why the Weight Updates Work: The Geometry
We are about to see that the perceptron updates have a beautiful geometric interpretation. The payoff is worth the short detour through dot product geometry.
The Angle Interpretation of the Dot Product
Recall that we introduced \(\vb{w}^{\intercal} \vb{x}\) as a compact way to write the weighted sum. It turns out the dot product also encodes the angle between the two vectors. Specifically, for any two vectors \(\vb{a}\) and \(\vb{b}\):
\[ \vb{a}^{\intercal} \vb{b} = \norm{\vb{a}} \norm{\vb{b}} \cos\left(\alpha\right), \]
where \(\alpha \in \left[0^{\circ}, 180^{\circ}\right]\) is the angle between them. Rearranging:
\[ \cos\left(\alpha\right) = \frac{\vb{a}^{\intercal} \vb{b}}{\norm{\vb{a}} \norm{\vb{b}}}. \tag{8}\]
The sign of the dot product directly tells us the angle: if \(\vb{a}^{\intercal} \vb{b} > 0\) then \(\alpha < 90^{\circ}\) (acute angle, the vectors point roughly in the same direction); if it equals zero then \(\alpha = 90^{\circ}\) (perpendicular); and if it is negative then \(\alpha > 90^{\circ}\) (obtuse angle, the vectors point roughly in opposite directions).
This is the key fact we need.
The Weight Vector is Perpendicular to the Boundary
Recall the decision boundary: \(\vb{w}^{\intercal} \vb{x} = 0\). Any point \(\vb{x}\) lying exactly on the boundary has a dot product of zero with \(\vb{w}\). By Equation 8, this means \(\alpha = 90^{\circ}\): the weight vector is perpendicular to every point on the boundary, and hence perpendicular to the boundary itself. This is why \(\vb{w}\) appears as a perpendicular arrow in Figure 2.
This is not just a curiosity. Since \(\vb{w}\) is perpendicular to the boundary and points into the positive half-space, the dot product \(\vb{w}^{\intercal} \vb{x}\) is simply measuring whether \(\vb{x}\) is on the same side as \(\vb{w}\) (positive dot product, acute angle, \(y = 1\)) or the opposite side (negative dot product, obtuse angle, \(y = 0\)). The perceptron’s decision rule is, geometrically, just a question of angle.
For a correctly classified positive point \(\vb{x}_i \in \mathcal{P}\): \(\vb{w}^{\intercal} \vb{x}_i \geq 0\), so the angle between \(\vb{w}\) and \(\vb{x}_i\) is at most \(90^{\circ}\), as required. For a correctly classified negative point \(\vb{x}_i \in \mathcal{N}\): \(\vb{w}^{\intercal} \vb{x}_i < 0\), so the angle is greater than \(90^{\circ}\), as required.
Fixing a Misclassified Positive Point
Now suppose \(\vb{x}_i \in \mathcal{P}\) is misclassified: the current \(\vb{w}\) gives \(\vb{w}^{\intercal} \vb{x}_i < 0\). By Equation 8, the angle between \(\vb{w}\) and \(\vb{x}_i\) is currently obtuse. But for a positive point, we need this angle to be acute. We need to rotate \(\vb{w}\) towards \(\vb{x}_i\).
The update \(\vb{w} \gets \vb{w} + \vb{x}_i\) does exactly this. The new weight vector \(\vb{w}' = \vb{w} + \vb{x}_i\) is the vector sum, and by the parallelogram law, it lies strictly between \(\vb{w}\) and \(\vb{x}_i\), thus making a smaller angle with \(\vb{x}_i\) than \(\vb{w}\) did. Adding the misclassified positive point to \(\vb{w}\) pulls \(\vb{w}\) towards that point. Figure 4 shows this geometry.
Fixing a Misclassified Negative Point
Conversely, suppose \(\vb{x}_i \in \mathcal{N}\) is misclassified: \(\vb{w}^{\intercal} \vb{x}_i \geq 0\), so the angle is currently acute. For a negative point, we need it to be obtuse. We need to rotate \(\vb{w}\) away from \(\vb{x}_i\).
The update \(\vb{w} \gets \vb{w} - \vb{x}_i\) does this. The new weight vector \(\vb{w}' = \vb{w} - \vb{x}_i\) makes a larger angle with \(\vb{x}_i\), pushing \(\vb{x}_i\) toward the negative side. Figure 5 shows the geometry.
Each update is a small rotation of the decision boundary. The algorithm converges because, for linearly separable data, these rotations are always making progress: the angle between \(\vb{w}\) and each positive point can only decrease in the aggregate, and similarly for negative points. This is the content of the Perceptron Convergence Theorem, which guarantees that if the data is linearly separable, the algorithm will find a separating \(\vb{w}\) in a finite number of updates (Minsky & Papert, 1969, Ch. 11).
The following animation shows the perceptron learning algorithm in action on the OR dataset, illustrating how the decision boundary rotates with each update.
Recap: Where We Are
flowchart TD
A["A concrete decision scenario<br>(the movie example)"]
B["The perceptron model<br>(weights, bias, threshold)"]
C["Geometric picture<br>(the decision boundary)"]
D["Perceptron vs. MP Neuron<br>(what changed and why it matters)"]
E["The learning algorithm<br>(finding weights from data)"]
F["Why the updates work<br>(geometry of weight updates)"]
G["What if data is not<br>linearly separable? (XOR)"]
H["What comes next<br>(networks of perceptrons)"]
A --> B --> C --> D --> E --> F --> G --> H
style F fill:#27ae60,color:#fff
style G fill:#2c3e50,color:#ecf0f1
style H fill:#2c3e50,color:#ecf0f1
We now have a complete picture of how the perceptron works and why the learning algorithm is correct. All that remains is to understand where it breaks down.
What if the Data is Not Linearly Separable?
Everything we have built so far assumes the data is linearly separable. But what happens when it is not?
Consider the XOR function, where the output is \(1\) exactly when the two inputs differ. Table 2 shows the truth table alongside what the perceptron equation requires for each input.
| \(x_1\) | \(x_2\) | XOR | Required condition |
|---|---|---|---|
| \(0\) | \(0\) | \(0\) | \(w_0 + \left(w_1 \times 0\right) + \left(w_2 \times 0\right) < 0\) |
| \(1\) | \(0\) | \(1\) | \(w_0 + \left(w_1 \times 1\right) + \left(w_2 \times 0\right) \geq 0\) |
| \(0\) | \(1\) | \(1\) | \(w_0 + \left(w_1 \times 0\right) + \left(w_2 \times 1\right) \geq 0\) |
| \(1\) | \(1\) | \(0\) | \(w_0 + \left(w_1 \times 1\right) + \left(w_2 \times 1\right) < 0\) |
Substituting, the conditions are:
\[ \begin{align*} w_0 &< 0,\\ w_0 + w_2 &\geq 0,\\ w_0 + w_1 &\geq 0,\\ w_0 + w_1 + w_2 &< 0. \end{align*} \]
The second and third inequalities give \(w_1 \geq -w_0 > 0\) and \(w_2 \geq -w_0 > 0\). Adding these: \(w_1 + w_2 \geq -2w_0 > -w_0\). But the fourth inequality requires \(w_1 + w_2 < -w_0\). This is a direct contradiction. No set of weights can satisfy all four conditions simultaneously.
The following plot shows why geometrically: the positive points \((1, 0)\) and \((0, 1)\) sit on one diagonal, and the negative points \((0, 0)\) and \((1, 1)\) sit on the other. No straight line can put both positive points on one side and both negative points on the other.
Code
import numpy as np
import matplotlib.pyplot as plt
plt.style.use("dark_background")
fig, ax = plt.subplots(figsize=(5, 5), facecolor="#222222")
ax.set_facecolor("#222222")
# XOR truth table
points = [(0, 0, 0), (1, 0, 1), (0, 1, 1), (1, 1, 0)]
colors = {0: "#e74c3c", 1: "#2ecc71"}
labels_added = {0: False, 1: False}
for x1, x2, label in points:
lbl = ("$y=1$ (positive)" if label == 1 else "$y=0$ (negative)") \
if not labels_added[label] else "_nolegend_"
ax.scatter(x1, x2, color=colors[label], s=200, zorder=5,
edgecolors="white", linewidths=1.2, label=lbl)
labels_added[label] = True
ax.set_xlim(-0.5, 1.5)
ax.set_ylim(-0.5, 1.5)
ax.set_xlabel("$x_1$", fontsize=14, color="white")
ax.set_ylabel("$x_2$", fontsize=14, color="white")
ax.set_title("XOR: no linear boundary can separate the classes",
fontsize=12, color="white", pad=12)
ax.tick_params(colors="white")
for spine in ax.spines.values():
spine.set_edgecolor("#555555")
for x1, x2, label in points:
ax.annotate(f"({x1},{x2})", xy=(x1, x2),
xytext=(x1 + 0.08, x2 + 0.08),
fontsize=11, color="white")
ax.legend(fontsize=11, facecolor="#333333", edgecolor="#555555",
labelcolor="white")
plt.tight_layout()
plt.show()If you try to draw a line separating the green points from the red points, you will quickly see it is impossible: any line that separates \((1, 0)\) and \((0, 1)\) from the origin will also capture \((1, 1)\) on the positive side (or vice versa). The positive and negative classes are interleaved in a way that no straight line can disentangle.
The perceptron learning algorithm, faced with non-linearly separable data, will never converge. It will loop forever, oscillating between weight vectors that misclassify different subsets of points, never finding one that works for all of them. This is a fundamental limitation, and it is precisely what Minsky & Papert (1969) documented in their careful analysis. Their work showed that many practically interesting functions, including XOR, are non-linearly separable and therefore beyond the reach of a single perceptron.
The natural question is: can we do better by combining multiple perceptrons? The answer is yes, and that is the subject of the next post.
Conclusion: What We Have Built and Where It Leads
Let us return to where we started: the movie decision. You have a set of features about a film, you weight them by importance, sum them up, and apply a threshold. The perceptron formalizes exactly this intuition, and the concrete failure of equal-weight summing for our Nolan fan was the precise gap that weights fill.
Along the way, we established several things that will reappear throughout the rest of this series. The perceptron generalizes the McCulloch-Pitts neuron (McCulloch & Pitts, 1943) by introducing learnable weights and a bias, allowing it to handle real-valued inputs and to shift its decision boundary freely. The boundary itself is always a hyperplane, and the weight vector \(\vb{w}\) always points perpendicularly into the positive half-space. This geometric fact is not an accident: it is the reason the learning algorithm works. Every weight update is a rotation of the boundary, pulling it toward misclassified positive points and pushing it away from misclassified negative points. The Perceptron Convergence Theorem (Minsky & Papert, 1969) guarantees this process terminates whenever the data is linearly separable.
The hard limit is linear separability. The XOR example shows that a single perceptron cannot represent functions where the positive and negative classes are geometrically interleaved. But the failure of a single perceptron is actually the beginning of the deeper story: a network of perceptrons can implement any boolean function, and with continuous activations, can approximate any function. That story starts in the next post.
Acknowledgment
I have referred to the YouTube playlists (IIT Madras - B.S. Degree Programme, 2023; NPTEL-NOC IITM, 2019) while writing this blog post.