Binary Classification

What kind of problems do we try to solve in machine learning? My attempt to answer this splits it into two types of problems: Classification problems and Regression problems. Roughly speaking, classification means the task of assigning a given data point to one of a (small) number of classes of things. For example, make an algorithm that decides from an image of a hand-written letter (the data point), which letter of the alphabet is in the image. By modeling, I mean the task of finding a meaningful relationship (a correlation, perhaps) between features of the data. Does lifespan increase as lifetime average salary does, for example?

In this post we’ll spend time on classification problems (AKA decision problems), and on the simplest type — binary classification. Often, binary classification problems are phrased in a way where the goal is a yesor no prediction (yes, this data point does fit in the class, or no it does not).

A common example is classifying emails as spam or not spam. To build a spam detector, you must decide how the emails will be represented to the algorithm. It’s likely that you want to use some coordinates that describe the contents of the email. For example, you might count how often words are mispelled, how often a sentence ends with an exclamation point, and the frequency of words being fully capitalized (“ACT NOW!”). These frequencies/coordinates, would then represent the email as a vector.

After this, the “learning” part can occur. You are given a training dataset — emails that have already been labeled yes (spam) or no (not spam). You also pick a model to use. The model is determined by some parameters, which you choose initially in some way. The model can then guess the output on the training data, check how wrong it was overall (the loss or error function), and has some way to update the parameters in order to reduce the loss — attempting to minimize the loss function.

🐉 More generally, there is an instance space \( \mathcal X\) of input vectors that you want to make decisions about, 1 and label space \( \mathcal Y\), the` possible labels for an instance from \(\mathcal X\). Use a function to describe this relationship: if \( y\) is the label for instance \(\bf x\), then write \( h({\bf x}) = y\) to mean \( y\) is the label of \(\bf x\). This is a function \( h: \mathcal X \to \mathcal Y\). Our goal is to find such a function, so that \(h({\bf x})\) is correct as much as possible. We need to do so using only information from the training data set (called \( \mathcal X^{train}\); this is a subset of \(\mathcal X\)), and the corresponding labels \( \mathcal Y^{train}\), a list of values from \(\mathcal Y\), each value corresponding to a point in \(\mathcal X^{train}\). In binary classification, we often let \( \mathcal Y = \{-1, 1\}\).

To get a better feel for how this can work, let’s discuss the most basic example, a half-space model. Remember that the data has been represented with coordinates, that is, a list of numbers. Call the number of coordinates \( n\) (each \( {\bf x} \in \mathcal X\) is in \( \mathbb R^n\)). You have to find a way to decide whether the answer is yes or no (whether \( h({\bf x}) = 1\) or \( h({\bf x})=-1\)) based on those coordinates.

Hyperplane with coefficient vector when n=2.

Lines are good at cutting a plane in two parts, right? Their generalization to more than two coordinates, a hyperplane in \( \mathbb R^n\), is equally nice. A hyperplane, call it \( H\), splits \( \mathbb R^n\) into two half-spaces that meet at \( H\). As a bonus, it is easy to tell in which half-space you are.
🐉 To determine \( H\) you need \( n\) coefficients, call them \( w_1,w_2,\ldots,w_n\), and one more number \( b\). The points of \( H\) are exactly those points \( (x_1,x_2,\ldots,x_n)\) where

\( w_{1}x_{1} + w_{2}x_{2} + \ldots + w_{n}x_{n} + b = 0\)

is a true equation. Notice that if you group the coefficients into a vector \( {\bf w} = (w_1,w_2,\ldots,w_n)\) and call \( {\bf x} = (x_1,x_2,\ldots,x_n)\), then the hyperplane is where \( {\bf w}\cdot{\bf x} + b = 0\). Think of \( {\bf w}\) as determining the direction that the hyperplane faces, and \( b\) is a shift away from the point \( (0,0,\ldots,0)\).

🐉 In fact, for an arbitrary point \( {\bf x}\) in \( \mathbb R^n\), whether \( {\bf w}\cdot{\bf x} + b\) is positive or negative determines on which side of the hyperplane \( {\bf x}\) lies. So the half-space model sets \( h(x) = 1\) if \( {\bf w}\cdot{\bf x} + b > 0\) and \( h(x)=-1\) if \( {\bf w}\cdot{\bf x} + b < 0\).2

Thus the half-space model is determined by a vector \( {\bf w}\) and number \( b\). With luck, you can find a \( {\bf w}\) and \( b\) where the model output matches the known labels on the training data reasonably well. If you are really lucky, then there is some \( {\bf w}\) and \( b\) so that every possible data point gets labeled correctly by the output of the model. When this happens, the data is called linearly separable. 3 If you are even more lucky, then you have an algorithm that will find that \( {\bf w}\) and \( b\) for you.


🐉🐉 Perceptron algorithm. If you are handed a training dataset, what algorithm can you use to find \( {\bf w}\) and \( b\)? The Perceptron algorithm4 is one possibility. Some initial discussion is needed before giving the algorithm.

The expression \( {\bf w}\cdot{\bf x} + b\) can be rewritten as a dot product of two vectors. Let \(\bf W\) be the \((n+1)\)-component vector that agrees with \(\bf w\) in the first \(n\) components, and has \(b\) as its last component. Also, define \(\bf X\) as the \((n+1)\)-vector that agrees with \(\bf x\) in the first \(n\) components and has 1 as last component. Then \( {\bf w\cdot x} + b = {\bf W\cdot X}\).

Also, if \( {\bf X_1}, {\bf X_2},\ldots, {\bf X_m} \in \mathcal X^{train}\) are the training data (\((n+1)\)-vectors, as was just explained), then for each \(\bf X_i\) set its label \( y_i\) equal to \( +1\) or \( -1\) according to the correct label that is given. The goal, thinking the data is linearly separable, is to find a \( {\bf W}\) so that \( {\bf W}\cdot{\bf X_i} > 0\) if \( y_i > 0\) and \( {\bf W}\cdot{\bf X_i} < 0\) if \( y_i < 0\). In other words, to make \( y_i\ {\bf W}\cdot{\bf X_i} > 0\) for all \(i\).

With this in mind, the Perceptron algorithm finds \( {\bf W}\) iteratively. It says to do the following with the Input: \( \mathcal X^{train} = [{\bf X_1}, {\bf X_2},\ldots, {\bf X_m}]^{\top}, \quad\mathcal Y^{train} = [y_1,y_2,\ldots,y_m]\).

⁍ Initialize \( {\bf W}^{(1)}\) to be \( (0,0,\ldots,0)\).

⁍ On step \( t\), if there is an \( {\bf X_i}\) where \( y_i\ {\bf W}^{(t)}\cdot{\bf X_i}\) is not positive (strictly bigger than zero), then set \( {\bf W}^{(t+1)} = {\bf W}^{(t)} + y_i\ {\bf X_i}\), for that \( i\). Otherwise, output \( {\bf W = W^{(t)}}\).

If the data is linearly separable, then the Perceptron algorithm will stop after, at most, a number of iterations that depends on the training set \( {\bf x_1},{\bf x_2}, \ldots, {\bf x_m}\) with its labels \( y_1,y_2, \ldots, y_m\):

Theorem. Define \( R = \max_i |{\bf X_i}|\) and \( B =\min_{\bf V} \{|{\bf V}| : \forall \ i,\ y_i{\bf V}\cdot{\bf X_i} \ge 1\}\). Then the Perceptron algorithm stops after at most \( (RB)^2\) iterations, and when it stops with output \( {\bf W}\), then \( y_i {\bf W}\cdot{\bf X_i} > 0\) for all \( i\).

Idea of the proof: Write \( {\bf W^*}\) for a vector that realizes the minimum \( B\). IOW, \( |{\bf W^*}| = B\) and \( y_i {\bf W^*}\cdot{\bf X_i} \ge 1\) for all \( i\).

Using how \( {\bf W}^{(t+1)}\) is obtained from \( {\bf W}^{(t)}\), you can show that after \( T+1\) iterations, \( {\bf W^*}\cdot{\bf W}^{(T+1)} \ge T\). Also, you can show that \( |{\bf W}^{(T+1)}| \le \sqrt{T} R\) (for this, you use the condition on \({\bf W}^{(T)}\) that necesitates an update). This, with the Cauchy-Schwarz inequality, means that \( \frac{\sqrt{T}}{BR} \le 1\) which can be rearranged to \( T \le (BR)^2.\quad\quad \square\)


Example of a linearly separable dataset. Let’s look at a toy example in \( \mathbb R^2\). Let \( \mathcal X = [(-1,3), (-1,-1), (3,-1), (0,1.5)]^{\top}\) and \( \mathcal Y = [-1,-1,1,1]\).

Before beginning, note that you must make a choice of the order in which you check the dot products on the instance set. Different choices affect the number of iterations the algorithm goes through. It can also affect the particular output \( {\bf W}^{(T)}\). In this example, I will start with \(\bf X_1\) and proceed in order through the points in \(\mathcal X\). After I make an update to \({\bf W}^{(t)}\), I will not restart at the beginning of \(\mathcal X\). Instead, I will continue from the next point in the set, but I will make sure that all points in \(\mathcal X\) have been checked on the final \({\bf W}\).

On step 1 of the Perceptron algorithm (since \( {\bf W}^{(1)} = \vec{\bf 0}\)) you get \( {\bf W}^{(2)} = \vec{\bf 0} + y_1{\bf X_1} = (1,-3,-1)\) if you use \( {\bf X_1}\) first.

On the next step, checking the vector \( {\bf X_2}\), you get \( y_2{\bf W}^{(2)}\cdot {\bf X_2} = -1\). Since this is negative, we set \( {\bf W}^{(3)} = {\bf W}^{(2)} – (-1,-1,1) = (2,-2,-2)\). If you continue this way, checking the dot products in the order mentioned, then on the 10th step you find \( {\bf W}^{(10)} = (4, -0.5, 1)\), which separates the points.

 

The line in \( \mathbb R^2\) determined by \( {\bf W} = {\bf W}^{(10)}\) is \( 4x_1 – 0.5x_2 + 1 = 0\). This line is shown on the left.

In the next post, we will look at a famous dataset that has a part that is linearly separable from the rest — the Iris dataset.

 


  1. In the email example, think of \(\mathcal X\) as the space of vectors that are made from an email. For example, if 23 coordinates were made (from frequencies and related things), then the instance space is \(\mathbb R^{23}\).
  2. A choice is made about whether to assign an \(\bf x\) that satisfies \({\bf w}\cdot{\bf x}+b = 0\) a positive or negative label. For the purpose of how well the model will perform, it should not affect performance.
  3. Whether it happens or not can depend on the way you chose to represent the data in coordinates. Side observation: if data is linearly separable (by a hyperplane \( H\)) then the line segment between two points on the positive side of \( H\) and the line segment between two points on the negative side of \( H\) cannot possibly intersect. This provides a criterion to decide that your data is not linearly separable.
  4. Rosenblatt, F. (1958), “The perceptron: A probabilistic model for information storage and organization in the brain,” Psychological Review 65, 386—407. (Reprinted in Neurocomputing, MIT Press, 1988).