Boosting, AdaBoost, Image recognition

Decision stumps are very basic: one decision stump can only separate data into +1 and -1 labels based on a single coordinate being larger or smaller than one number.1 The general idea of boosting is to take some basic models, given by a weighted version of the ERM rule in their class, and create a new model function from them with a linear combination — a sum of the basic models, with each one multiplied by a number (the number being chosen in a certain helpful way). The linear combination (the new model) is often a better model for the data. AdaBoost2 is a particular implementation of this idea.

1. Recall from the last post how a decision stump determines \( \pm 1\) labels, always splitting the instance(data) points by a hyperplane that is perpendicular to the \( x_j\)-axis. We used the notation \( \mathcal B^d\) for decision stumps that worked on data points with \( d\) coordinates.
2. (Schapire and Freund 1995). Short for “Adaptive Boosting,” the algorithm adapts to how well it is performing (learns) as it goes along.


AdaBoost. As input the AdaBoost algorithm takes a number of rounds \( T\) and a training set \( S\) (consisting of data points \( {\bf x_i}\) and labels \( y_i = \pm1\), for \( i=1,\ldots,m\)). When the algorithm ends, the number of basic models used in the linear combination will be the number of rounds. For \( t\) between 1 and \( T\), if \( h_t\) is the basic model for that round, it will have a coefficient \( w_t\) and the new model will be given by

\( h({\bf x}) = \text{sign}\left(w_1h_1({\bf x}) + w_2h_2({\bf x}) + \cdots + w_Th_T({\bf x})\right).\)

The algorithm. I will explain the algorithm using decision stumps for the basic models. However, any hypothesis class could be used in its place. In round 1 the algorithm has uniform weights in \( {\bf p}^{(1)}\): so the vector’s coordinates are \( p^{(1)}_i = 1/m\), for all \( i\).

Round t of the AdaBoost algorithm:

⁍ Obtain basic model \( h_t\) from the ERM rule on \( \mathcal B^d\), using weights \( {\bf p}^{(t)}\).
⁍ Compute the weighted empirical risk3: \( \varepsilon_t = L_S(h_t, {\bf p}^{(t)})\)
⁍ Set \( w_t = \ln\left(\frac1{\varepsilon_t} – 1\right)\).   4
⁍ Update \( {\bf p}^{(t)}\) as follows: \( {\displaystyle p^{(t+1)}_i = \frac{p^{(t)}_i \text{exp}({-w_ty_ih_t({\bf x_i})})}{\sum_{k=1}^m p^{(t)}_k \text{exp}({-w_ty_kh_t({\bf x_k})})} }.\)
These are the coordinates of \( {\bf p}^{(t+1)}\) for round \( {\bf t+1}\).

Return: the list of weights and models, \( (w_t, h_t),\ t = 1,\ldots, T\).

To note a few things, the denominator in the update to \( {\bf p}\) equals the sum of the numerators of different \( p_i\)’s. This makes \( {\bf p}^{(t+1)}\) a probability vector. Also, if \( h_t({\bf x_i}) = y_i\) then (since \( w_t > 0\)) the power of e is negative, making \( p_i\) smaller, for that \( i\). If \( h_t({\bf x_i}) = -y_i\) then the opposite occurs and \( p_i\) becomes larger. So the new probability vector has its weight shifted towards the data points that were previously misclassified.

3. For us, \( h_t\) is the decision stump, with some \( \theta, j,\) and \( b\), that minimizes the expression \( F_j(b)\) from the last post, using the entries of \( {\bf p}^{(t)}\) for the weights. The risk \( \varepsilon_t\) is the value of that minimal \( F_j(b)\).
4. Note that \( w_t\) is larger when \( \varepsilon_t\) is smaller. Also, there is an assumption that \( 0 < \varepsilon_t < 1/2\). If it were that the error \( \varepsilon_t = 0\), then simply use the stump \( h_t\) for the model. The assumption \( \varepsilon_t < 1/2\) makes it so that \( w_t > 0\), and comes from assuming that a weak learner is giving the basic model functions.


AdaBoost in Python. Below we provide a block of Python code that will carry out the AdaBoost algorithm.

Some auxiliary functions are needed.
First, erm_ds is the efficient implementation of the ERM rule in \( \mathcal B^d\), with given weights \( {\bf p}\), that was discussed last post. The function stump is the following:
I wrote the sign function so it returns -1 if its input is negative, and returns 1 otherwise, including when the input is 0. Finally, a function L_S is defined to compute the weighted loss of a decision stump:
There are three inputs to L_S: a triple of parameters f that determines a decision stump; a list S which has two entries, S[0] is the list of data points for training and S[1] is the corresponding list of labels; and a list P that is the probability vector. Think about how the labels are \( \pm1\), and the stump gives the model’s answer at a point. This means the for loop goes through the data points and adds 0 to loss if the stump is right, and adds P[i] to loss otherwise.

With these preliminaries, the following code implements AdaBoost:


A generated example for trying out AdaBoost on decision stumps.

Example. Consider the artificially generated training set of points in the plane, displayed on the right5. This data set cannot be separated with one decision stump. The one with minimal loss, using uniform weights, has \( j=2, b\approx 0.78\). it mislabels 8 blue points that have 2nd coordinate less than 0.78. In this section, boosting stumps will give a model for the data set.

To begin, if a person had been given this data, and were asked to draw splitting hyperplanes for a set of stumps that would separate the data (since we are in \( \mathbb R^2\), that means vert. or horiz. lines), they might draw something like the following.

A boundary that might be drawn, if trying to make it out of pieces of vertical and horizontal lines.

At first glance there appear to be 4 decision stumps involved (for the four linear parts of the green boundary), but the boosted model doesn’t work out quite so simply6. The first 4 decision stumps give a decision boundary that has only three linear parts. After 6 rounds we get five parts to the decision boundary. A depiciton of how that boosted model with 6 stumps labels points is shown in the next figure.

 

How the boosted model, with 6 rounds in AdaBoost, labels points. The drawn boundary comes from coordinates and biases (j and b) used in the stumps.

To be clear, I estimated the gray and green lines myself (I actually drew it on my iPad), with no reference to the algorithm’s output. The coloring of the points on the right comes from the model that AdaBoost produces after 6 rounds. The training error of that model is approximately 0.097 (meaning it is right on more than 90% of the training data – note there are just 3 blue points on the top-left being mislabeled).

If we wanted to try to capture those extra points, we could do an extra round in AdaBoost. The result is shown in the next figure.

Labeling of points with 7 rounds in AdaBoost. The number of linear parts to the boundary is less than with 6 rounds (an effect of the particular coefficients wt).

While the result after 7 rounds looks qualitatively more like our initial guess, a more careful look shows that it is mislabeling more training points than the previous one; the training error here is approximately 0.129. Recall, the training set was fairly small and so AdaBoost doesn’t have much to work with at this point. However, one lesson we can take is that it is possible for training error to increase with more rounds of AdaBoost.

5. Disclaimer: The number of data points for this example is admittedly low for the complexity of the boosting procedure.
6. In fact, assuming different weights, you need at least 3 decision stumps in the linear combination just to get a result that is different than one decision stump alone. Such is the reality of the sign function.


Second example — finding a high contrast edge.  This example will use AdaBoost on decision stumps to complete a task on images. The images in the data set are grayscale images – each pixel has a value from 0 to 255 (inclusive), the pixel being brighter when closer to 255.

Image with a skyline curve, at top of the dog’s head.

Given an image, the task will be to determine if (somewhere in the image) there is a roughly horizontal edge with very bright pixels above it and dark pixels below it. Call this a skyline curve. To the right is an image with a very strong skyline curve appearing at the top of the dog’s head. So the goal is to decide if there is a skyline curve that stands out.7

The image files come from an old Dogs vs. Cats challenge at the website kaggle.com. The original competition was to make an algorithm that identifies whether each picture is of a cat or dog.

Example image from Dogs vs. Cats challenge.

The original training set has 25,000 images, each labeled ‘cat’ or ‘dog.’ Since our task is different than the original challenge, we will have to determine our own labels. Due to time constraints, this will require the use of a smaller training set.8

Making vectors out of the images. There are many ways to make a vector from a grayscale image. A common way is to treat each pixel value as a coordinate and order these in some way. For example, if the image is \( m\) pixels by \( n\) pixels, then you could make the first \( n\) coordinates be the values from the first row of pixels, then go to the \( n\) pixels in the next row, etc. In the end you would have a vector in \( \mathbb R^{mn}\).

The above process for getting a vector is not well-suited to this task, for a few reasons.9, 10, 11 Instead, I’ll adapt an idea for image recognition tasks that uses rectangles, and certain sums of the pixel values in those rectangles, to create coordinates for the vector.12

More precisely, divide the height (in pixels) of an image by 15 and call the result \( \alpha\). Divide the width (in pixels) of an image by 75 and call the result \( \beta\).13 For an \( \alpha\times \beta\) rectangle, let’s call it r, calculate the sum of the pixel values of the top-half of r and then subtract the sum of the pixel values of the bottom-half of r. This gives a number, call it tb(r), the top-bottom score of r.

For example, if r is the \( 6\times 3\) rectangle with pixel values shown to the right, then tb(r) = 1739. Remember, the task is to find a skyline curve. If a skyline curve passes through the middle of r, then tb(r) should be very large relative to the scores of other rectangles (high values on top, minus close-to-zero values on the bottom).

The idea is to focus on when tb(r) is large; however, the particular tb(r) scores don’t matter a great deal. Rather it matters where the rectangles with highest score are in relation to each other. If a “roughly horizontal” edge has decent length, there would be adjacent rectangles with high tb(r) along that edge.

Grayscale image with 10 highest top-bottom scored rectangles in red.

The vector: Take the 10 rectangles14 that have the largest value of tb(r), call them r1, r2, …, r10. For each ri, measure how far it is from the others (among those 10) in some way. Call that measure m(ri). Order these measures in ascending order, and the vector will be the ascending list of those measures. So each vector will have coordinates that do not decrease to the right, each coordinate being some m(ri).16 This produces a vector in \( \mathbb R^{10}\) for each image. To the right is one of the images in the data set, the 10 rectangles with highest score being drawn in red. Its vector: approximately  (1.7, 1.9, 1.9, 2.4, 2.9, 3.1, 3.1, 3.8, 7.2, 33.5).

With a labeled subset of the images (a person having decided if there is a skyline curve), use part of the labeled data for training, and the other part to test the resulting model. The training is done on the images’ vectors in \( \mathbb R^{10}\), by applying AdaBoost with decision stumps in \( \mathcal B^{10}\) as the base models.

I generated training and test sets for the Dogs vs. Cats images, by using numpy.random.choice to randomly select a file. I would store which one it was. Then I displayed it using matplotlib.pyplot and would visually decide about a strong skyline curve and, accordingly, add a 1 or -1 to a list of labels. Having to do this semi-manually necessitated a rather small training set (80 images) and test set (20 images).

The training error for 25 different models coming from AdaBoost. The horizontal axis is the number of rounds.

The results were okay, considering the training data! I iterated through 25 rounds, recording the training error of the AdaBoost output for each (honest empirical risk, not weighted by \( {\bf p}\)). To the right is the graph of the training errors, the horizontal axis being the number of stumps (rounds) in AdaBoost.

That the error is larger when the number of rounds is even is probably an artifact of the method; but note that the error for small odd numbers of rounds is close to decreasing. Something dramatic happens between 17 and 21 rounds. Due to the small amount of training data, I would guess that the model is overfitting at 21 rounds and that the best model is the one with 17 stumps (rounds), and training error around 0.15.

Using the test data, the model with 17 stumps had a test error of about 0.25. The model with 21 stumps had a test error of about 0.35. To visualize the results, I selected 3 images at random from the 25,000 and have displayed them and the rectangles for their vectors below. The captions give the 17-stump model’s output.

7. Disclaimers #1 & #2: The task of finding a particular type of curve or shape in an image, while quite applicable, might be called an ill-defined problem — since it does not have an objectively correct answer. The question of spam email was also. Such questions are an important part of data science.
Also, there are rather good algorithms for finding curves in images (though particular kinds of curves can be more of a challenge). These algorithms use different techniques. We might discuss one of them when we get to cluster analysis. 

8. Disclaimer #3: With a somewhat small training set there is potential for bad results. This is especially true considering that I am personally labeling the training and test data.
9. The images in the data have differing numbers of pixels. However, to use decision stumps, it is necessary that the vectors all have the same length.
10. The method of shuffling the pixels into coordinates simply doesn’t make sense for how decision stumps work, not for what the end goal of the task is.
11. The images in the data set are hundreds of pixels by hundreds of pixels. So \( mn\) is in the tens of thousands — vectors obtained this way would have tens of thousands of coordinates, as much (or more) than the number of data points. Overfitting would be almost inevitable.
12. An idea like this was proposed by Viola and Jones for the task of determining if a photo was of a human face.
13. The numbers 15 and 75 were a choice. For most images this choice should make it so that an \( \alpha\times\beta\) rectangle is at least twice as tall as it is wide. If dividing does not give an integer, then use the floor function (simply using int will work in Python) to get the closest integer below it.
14. The number 10 was a choice; a different choice could well improve results.
15. A measure derived from the taxicab metric on the grid of rectangles was used. The measure penalized vertical taxicab distance a bit more than horizontal, rather than finding the closest ri it found the four closest and averaged their taxicab distance to the given r, and it was multiplied by a constant that served to normalize the values between images.
16. We are more concerned with the set of rectangles, rather than their order according to the tb score.