Weak Learnability and Decision Stumps

The last post discussed the tradeoff that arises from the decomposition of Mean Squared Error into a Bias and a Complexity (and a Noise) term. In situations where other measures than MSE are being used for the loss of a model function (in classification problems, for example), a similar decomposition can be seen.

This and the next post will address an idea for how to take a hypothesis class that has strong Bias (the set of model functions is not robust enough for one of them to be a good classifier on the data) and combine the functions, with weights, to get a good classifier. In essence, taking weighted sums of functions from the hypotheses with strong bias adds needed Complexity, and enough to reduce the generalization error.

Notions of learnability. Returning to the setting of binary classification, the post Risk Minimization with Bias, contains a theorem about getting small error (not more than a given \( \varepsilon\)) with high probability (at least \( 1-\delta\), for a given \( \delta\)) when the the training set is sufficiently large and the hypothesis class has only finitely many functions. When a similar statement can be made about any given hypothesis class \( \mathcal H\) (not necessarily finite), then \( \mathcal H\) is called PAC learnable.

A class of functions \( \mathcal H\) which satisfies the Realizability assumption is called PAC learnable if for any \( \delta, \varepsilon \in (0,1)\) there is a positive integer \( m\) so that if \( S\) is sampled i.i.d. from \( \mathcal D\) and \( |S| \ge m\) then

\( L_{\mathcal D, h^\star}(h_S) \le \varepsilon\)

with probability at least \( 1-\delta\).

To reflect the terminology of \( \mathcal H\) being “learnable,” I will use the term “learner” or “learning algorithm” for the rule or algorithm that finds a model \( h_S\in\mathcal H\) with small error, given the training set \( S\).

It may be that fully implementing the ERM rule on \( \mathcal H\) is computationally undesirable — it requires finding a function that achieves a global minimum over \( \mathcal H\). It can be beneficial to consider a notion of learnability that is weaker than PAC learnable.

A learner for a class of functions \( \mathcal H\) is called a \( \gamma\)-weak learner if for any \( \delta \in (0,1)\), and any distribution \( \mathcal D\) on \( \mathcal X\), there is a positive integer \( m\) so that if \( S\) is sampled i.i.d. from \( \mathcal D\) and if \( \mathcal H\) is realizable¹, then whenever \( |S| \ge m\) the learner will determine \( g_S\) so that

\( L_{\mathcal D, h^\star}(g_S) \le 1/2 – \gamma\)

with probability at least \( 1-\delta\). The hypothesis class \( \mathcal H\) is called \( \gamma\)-weak learnable if there is a \( \gamma\)-weak learner for \( \mathcal H\).

How can \( \gamma\)-weak learners be understood? First, note that the learned function \( g_S\) need not come from \( \mathcal H\). So there is another hypothesis class (possibly not containing the true label function) from which \( g_S\) is drawn.

Secondly, there is an easily described model (strictly speaking, it is not a function) that will have loss equal to 1/2: Predict the label for \( {\bf x}\in\mathcal X\) to be +1 or -1 at random, each with equal likelihood (IOW, flip a coin to guess the label). Since each point in \( \mathcal X\) has a correct label in \( \{\pm1\}\), the process has a 50% chance of being incorrect on \( {\bf x}\).

Remember, the loss of a model is the probability that it will be wrong. So a \( \gamma\)-weak learner will produce a model that is a bit better than the flip of a coin². That doesn’t sound all that great. However, a procedure to find such a thing might be much less expensive computationally. Furthermore, there is a way to combine multiple models coming from a weak learning algorithm in order to improve the accuracy.

^1. Recall that the realizable assumption on \( \mathcal H\) means there is an \( h^\star \in \mathcal H\) that has zero true loss — correctly labels all data.
^2. The number \( \gamma\) is positive, so \( 1/2 – \gamma\) means lower than 50% chance of being wrong.

Getting a weak learner from a more basic hypothesis class. For the rest of this post, and the next post, I will work with a particular class of functions as a basic hypothesis class. Following the ERM rule in this basic class will be an example of a weak learner for a different hypothesis class. Given data vectors \( {\bf x} \), where \(\mathcal X \subseteq \mathbb R^d\) for some dimension \( d\), the class of functions consists of all functions \(f:\mathbb R^d \to \{1, -1\}\) where, for some \(b\in\mathbb R\), some \(\theta\in\{1,-1\}\), and some \(j\in\{1,2,\ldots,d\}\), you have

\( f({\bf x})=\text{sign}(\theta(b – x_j)).\)

A decision stump (for d=2), with non-zero loss. Blue points have label 1, red points have label -1. The function values are 1 to left of b and -1 to right of b.

Call this class of functions \(\mathcal B^d\). So, a function from \( \mathcal B^d\) is determined by \( b,\theta,\text{ and }j\), and it acts by taking a \( d\)-dimensional vector to the sign of \( \theta(b-x_j)\), with \( x_j\) being the \( j^{th}\) coordinate of the vector. Just to determine a convention for what to do if \(b = x_j\), let’s say that \(\text{sign}(0) = 1\).

The functions in \( \mathcal B^d\) are called decision stumps. Considering the post Binary Classification, decision stumps are special types of half-space models³. Their special type makes them very quick to compute from a training set.

^3. The normal vector is \( {\bf w} = (0,\ldots,0,-\theta,0,\ldots,0)\) (where \( -\theta\) appears in position \( j\)) and the shift/bias of the hyperplane is \( \theta\cdot b\).

A \( \gamma\)-weak learnable hypothesis class using decision stumps. In this section, consider a class of functions \( \mathcal H^d\) that can have value 1 on a bounded interval of one coordinate and -1 elsewhere (or vice-versa). That is, for numbers \( a,b\) with \( a<b\), one coordinate, say the \( j^{th}\) coordinate, of vectors in \( \mathbb R^d\), and \( \theta=\pm1\), define \( w_{a,b,\theta,j}({\bf x}) = \begin{cases}\theta,\ \text{if }a\le x_j\le b \\ -\theta,\ \text{otherwise}\end{cases}\).
Then,

\( \mathcal H^d = \{w_{a,b,\theta,j}\ |\ a\le b,\ \theta\in\{1,-1\},\ j\in\{1,2,\ldots,d\}\}.\)

Theorem. The ERM rule on \( \mathcal B^d\) is a \( \gamma\)-weak learner for \( \mathcal H^d\), for any \( 0<\gamma<\frac16\).

Idea of Proof. Let \( \mathcal X\) be a data set with correct label function \( h^\star\in \mathcal H^d\) and with probability distribution \( \mathcal D\). Let \( h^\star = w_{a^\star, b^\star, \theta^\star, j^\star}\). Then, regardless of \( \mathcal D\), at least one of the regions in \( \mathbb R^d\) listed below does not contain more than 1/3 of \( \mathcal X\) (according to the measure determined by \( \mathcal D\)):

\( \{x_j < a^\star\} \qquad \{a^\star \le x_j \le b^\star\} \qquad \{b^\star < x_j\}\)

For ease of discussion, say it is the third region, \( \{b^\star < x_j\}\). Then, for any \( \varepsilon>0\), a sufficiently large sample \( S\) will determine a decision stump \( h\) from \( \mathcal B^d\) that has probability less than \( \varepsilon\) of being wrong on the other two regions⁴. Thus, \( L_{\mathcal D,h^\star}(h) \le 1/3 + \varepsilon\).
Choosing \( \varepsilon < 1/6\) makes \( \gamma = 1/6 – \varepsilon\) be such that \( L_{\mathcal D,h^\star}(h) \le 1/3 + \varepsilon = 1/2 – \gamma\). \( \square\)

^4. There are some details missing here. In particular, this guarantee comes from \( \mathcal B^d\) being PAC learnable and we have yet to explain why that is true. It is a consequence of the VC dimension (which equals 2 in this case), a subject of a future post.

Implementation of ERM on stumps. Given a training set, or sample set of points \( S \subset \mathbb R^d\), how do you carry out ERM on \( \mathcal B^d\)? And can you efficiently find the \( h_S\in\mathcal B^d\) with minimal empirical loss? In this section, I describe a procedure to achieve an even more general goal, in anticipation of the next post where AdaBoost is described.

Example: Horizontal lines at some of the averages in B₂.

As an example, consider the figure to the right. The red and blue points are the sample points in \(S\), and the blue points have (correct) label \(1\), while red points have label \(-1\). If I use the second coordinate for decision stumps (meaning \(j=2\)), then decision stumps would give one label in \(\{1,-1\}\) to the points above the horizontal line \( y=b\), and the other label to points below it (which label depends on \( \theta\)). For this example, there will be errors on some points, no matter where the horizontal line is (no matter what \(b\) is; some possible horizontal lines have been drawn). If vertical lines are used (\(j=1\)), then there will also be some error, no matter where the vertical line is. The ERM rule is to find, among all vertical and horizontal lines (since \(d=2\)), the one with minimal empirical risk (that is wrong the least often)⁷.

I would like to express the empirical risk in a way that will be useful later, for the purpose of combining different decision stumps into one model function. To do so I will use the notation 𝟙\(_{[{\bf x}_{i,j}>b]}\) as shorthand for⁶

𝟙\(_{[{\bf x}_{i,j}>b]} = \begin{cases}1, {\bf x}_{i,j}>b\text{ is true;} \\ 0, {\bf x}_{i,j}>b\text{ is false.}\end{cases}\)

As an example of using this notation, \(\displaystyle\sum_{{\bf x_i}\in S}\)𝟙\(_{[{\bf x}_{i,j}>b]}\) would be equal to the number of points \({\bf x}_i\) in \(S\) for which the \(j^{th}\) coordinate, \({\bf x}_{i,j}\), is bigger than \(b\).

If \(\theta=1\), and \(S = \{{\bf x}_1, \ldots, {\bf x}_m\}\), then the decision stump determined by \(\theta, j\), and \(b\) will evaluate to \(-1\) on \({\bf x}_i\) if and only if \({\bf x}_{i,j} > b\). (Recall how the functions in \(\mathcal B^d\) are defined above.) Therefore, the empirical risk of this decision stump is equal to

\(\displaystyle\sum_{i : y_i=1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}>b]} + \sum_{i : y_i=-1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}\leq b]}\).⁸

If \(\theta=-1\), then the values of the decision stump switch: it gives \(1\) on \({\bf x}_i\) if and only if \({\bf x}_{i,j} \geq b\). And so, when \(\theta=-1\), the empirical risk is

\(\displaystyle\sum_{i : y_i=1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}<b]} + \sum_{i : y_i=-1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}\geq b]}\).

The example training set at the beginning of the section has 26 points, and a decision stump with minimum empirical risk is given by \(\theta=-1, j=2,\) and \(b \approx 0.45\). There are four -1 labeled points (red) that are above this horizontal line. This means that \(\sum_{i : y_i=-1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}\geq b]} = 4/26\). There is just one +1 labeled point (blue) that is below the line, so \(\sum_{i : y_i=1}\frac{1}{m}\)𝟙\(_{[{\bf x}_{i,j}<b]} = 1/26\). And so, the empirical risk of this decision stump is \(5/26\).

Now I will introduce a method to put different weights on the training points. The goal is to make it so that getting an error on a point that has a larger weight will contribute more to the empirical risk. Finding a decision stump that minimizes the weighted empirical risk will then be more likely to give a correct label to the points that have larger weight.

Let \( {\bf p} = (p_1,\ldots,p_m)\) be a probability vector⁹. A vector entry \(p_i\) will be used as the weight of the contribution made to the loss when the model is incorrect at point \({\bf x}_i\). If \( \theta=1\), and \(j\) is given \((1\le j\le d)\), then define

\( F_j(b)\) ≝ \( {\displaystyle\sum_{i:y_i=1}} p_i\)𝟙\(_{[{\bf x}_{i,j}>b]}\)\( {\displaystyle+ \sum_{i:y_i=-1}} p_i\)𝟙\(_{[{\bf x}_{i,j}\le b]}\).

If the weights are equal, with each \( p_i = 1/m\), then \( F_j(b)\) is the regular empirical risk from before, \(L_S(h_{\theta=1,j,b})\), where \( h_{\theta=1,j,b}\) is the decision stump. A simple adjustment can be made for the case when \( \theta=-1\) by switching the inequalities that compare to \( b\), analogous to what I did with the regular empirical risk above. So to follow the ERM rule it is necessary to find \( \text{argmin}_{\theta,j,b}(F_j(b))\). (That is, find the \( \theta, j,\) and \( b\) that minimize \( F_j(b)\).)

While there are infinitely many possibilities for \(b\), they do not all need to be considered to find \( \text{argmin}_{\theta,j,b}(F_j(b))\). First, sort the \( {\bf x}_i\) in ascending order according to their \( j^{th}\) coordinate; since it does not hurt to change which point in \(S\) is called \({\bf x}_1\), which one is \({\bf x}_2\), and so on, adjust the indices so that \({\bf x}_1, \ldots, {\bf x}_m\) satisfy that \({\bf x}_{i,j} < {\bf x}_{i+1,j}\) for every \(1\le i\le m-1\). Now, define

\( B_j = (\text{set of averages of consecutive }{\bf x}_{i,j}) \cup \{{\bf x}_{1,j} – 1\} \cup \{{\bf x}_{m,j} + 1\}\)

In other words, \(B_j\) contains numbers of the form \(\frac{{\bf x}_{i,j} + {\bf x}_{i+1,j}}{2}\), and it also contains \({\bf x}_{1,j}-1\) and \({\bf x}_{m,j}+1\). Then, every \( F_j(b)\) is equal to one where \( b \in B_j\). There are \( m+1\) numbers in \( B_j\) so, for each \(\theta\) and each \(j\), only \( m+1\) values of \( F\) need to be checked.

For each \( b\), there are \( m\) terms to be added, and you do this for each \( j = 1,\ldots,d\), and for \( \theta=1,-1\). Then you take the \( \theta,j,\) and \( b\) that give the smallest \( F_j(b)\). The runtime for this description of ERM on decision stumps is \( \mathcal O(dm^2)\).

More efficient implementation. Suppose that \( b\in B_j\), and \( b’\) is the next larger element in \( B_j\). Say \( {\bf x}_k\) is the one element of \( S\) between \( b\) and \( b’\) (so \( b < {\bf x}_{k,j} < b’\)). Then, if \( y_k=1\) (and \( \theta=1\)), the weighted loss using \( b\) is \( p_k\) greater than when using \( b’\). This is because \( {\bf x}_k\) is given label \(-1\), by the decision stump, when using \( b\), and label \(+1\) when using \( b’\). The roles reverse if \( y_k=-1\). So, when \(\theta=1\), we get

\( F_j(b’) = F_j(b) – p_k\)𝟙\(_{[y_k=1]}\)\( + p_k\)𝟙\(_{[y_k=-1]}\)\( = F_j(b) – y_kp_k\).

If \(\theta=-1\) instead, then \(F_j(b’) = F_j(b) + y_kp_k\). So, in general, we have

\(F_j(b’) = F_j(b) – \theta y_kp_k\)

when \(b\) and \(b’\) are consecutive numbers in \(B_j\). Once \( F_j(b)\) is calculated for the smallest \(b\), a recursive calculation gives each other \( F_j(b’)\), for that \( j\), each in constant time by using the equation. If implemented this way, then the process of minimizing the values \( F_j(b)\) runs in linear time. Good sorting algorithms run in at most \( \mathcal O(m\log m)\) time¹⁰, so implementing ERM on decision stumps can be done in \( \mathcal O(dm\log m)\) time.

The results of ERM over decision stumps with two different probability vectors.

I have placed below some code for this implementation of decision stumps in Python. The figure to the right shows the same example dataset as above. When a uniform weight is used (\( {\bf p} = (1/26,\ldots,1/26)\)) the decision stump with minimal loss has \( \theta=-1, j = 2\) and \( b \approx 0.447\) (the dashed line). When two blue points that are farthest to the right are weighted very heavily, the minimal loss decision stump has \( \theta=-1, j=1\) and \( b\approx 0.788\) (the solid line).

def erm_ds(S, P):
  # S: two lists, one of X vectors and one of Y labels
  x = np.array(S[0])
  y = np.array(S[1])
  min_j = -1 # will be coordinate for stump with min loss
  d = len(x[0]) # dimension
  min_params = [] # parameters for min loss stump
  for t in range(2):
    theta = 2*t-1 # theta=-1 when t==0, theta=1 when t==1
    for j in range(d):
      b = np.nan
      Flist = [] # list of losses for stumps 
      # next 3 lines: sort data in coord j; make labels,P match
      sorted_i = x[:,j].argsort()
      xsort, ysort = x[sorted_i], y[sorted_i]
      Psort = np.array(P)[sorted_i]
      # stump with lowest b mislabels any theta-labeled pt.
      F = sum([Psort[i] for i in range(len(P)) if ysort[i]==theta])
      Flist.append(F)
      for i in range(len(x)):
        # efficient implementation: compute this F from last F
        Flist.append(Flist[i] - theta*ysort[i]*Psort[i])
      min_i = np.argmin(np.array(Flist))
      if (min_i == 0):
        b = xsort[min_i][j] - 1
      elif (min_i == len(x)):
        b = xsort[min_i-1][j] + 1
      else:
        b = 0.5*(xsort[min_i-1][j] + xsort[min_i][j])
      # get one of following for each theta and each j      
      # store a 4-tuple
      min_params.append((theta, j+1, b, Flist[min_i]))
  # minimum according to F values
  lowest_one = np.argmin(np.array(min_params)[:,3])
  # return the triple (theta, j+1, b) for min loss stump
  # j+1 is used since first coordinate usually called x_1 (not x_0)
  return min_params[lowest_one]

^6. The notation \( {\bf x}_{i,j}\) means the \( j^{th}\) coordinate of the training vector \( {\bf x_i}\). This function is sometimes called the characteristic function, for the set where the statement is true.
^7. If \( {\bf P}\) consists only of fractions \( 1/m\), then least wrong simply means the one that has the least number of points mislabeled. For other probability vectors, the decision stump is penalized more for mislabeling certain points, the ones with a probability larger than the others.
^8. The notation \(i : y_i = 1\) below the summation sign means that you are summing only over those \({\bf x}_i\) which have \(y_1=1\) as their label. It is similar for \(i : y_i = -1\) on the second summation.
^9. A vector with non-negative entries that sum to 1.
^10. In some cases, sorting can be done in linear time. This could be such a case (I haven’t thought carefully about it). If so, then ERM on decision stumps can be done in linear time.