Decision stumps are very basic: one decision stump can only separate data into +1 and -1 labels based on a single coordinate being larger or smaller than one number.1 The general idea of boosting is to take some basic models, given by a weighted version of the ERM rule in their class, and create a new model function from them with a linear combination — a sum of the basic models, with each one multiplied by a number (the number being chosen in a certain helpful way). The linear combination (the new model) is often a better model for the data. AdaBoost2 is a particular implementation of this idea.
1. Recall from the last post how a decision stump determines ±1 labels, always splitting the instance(data) points by a hyperplane that is perpendicular to the xj-axis. We used the notation Bd for decision stumps that worked on data points with d coordinates.
2. (Schapire and Freund 1995). Short for “Adaptive Boosting,” the algorithm adapts to how well it is performing (learns) as it goes along.
AdaBoost. As input the AdaBoost algorithm takes a number of rounds T and a training set S (consisting of data points xi and labels yi=±1, for i=1,…,m). When the algorithm ends, the number of basic models used in the linear combination will be the number of rounds. For t between 1 and T, if ht is the basic model for that round, it will have a coefficient wt and the new model will be given by
h(x)=sign(w1h1(x)+w2h2(x)+⋯+wThT(x)).
The algorithm. I will explain the algorithm using decision stumps for the basic models. However, any hypothesis class could be used in its place. In round 1 the algorithm has uniform weights in p(1): so the vector’s coordinates are p(1)i=1/m, for all i.
Round t of the AdaBoost algorithm:
⁍ Obtain basic model ht from the ERM rule on Bd, using weights p(t).
⁍ Compute the weighted empirical risk3: εt=LS(ht,p(t))
⁍ Set wt=ln(1εt–1). 4
⁍ Update p(t) as follows: p(t+1)i=p(t)iexp(−wtyiht(xi))∑mk=1p(t)kexp(−wtykht(xk)).
These are the coordinates of p(t+1) for round t+1.
Return: the list of weights and models, (wt,ht), t=1,…,T.
To note a few things, the denominator in the update to p equals the sum of the numerators of different pi’s. This makes p(t+1) a probability vector. Also, if ht(xi)=yi then (since wt>0) the power of e is negative, making pi smaller, for that i. If ht(xi)=−yi then the opposite occurs and pi becomes larger. So the new probability vector has its weight shifted towards the data points that were previously misclassified.
3. For us, ht is the decision stump, with some θ,j, and b, that minimizes the expression Fj(b) from the last post, using the entries of p(t) for the weights. The risk εt is the value of that minimal Fj(b).
4. Note that wt is larger when εt is smaller. Also, there is an assumption that 0<εt<1/2. If it were that the error εt=0, then simply use the stump ht for the model. The assumption εt<1/2 makes it so that wt>0, and comes from assuming that a weak learner is giving the basic model functions.
AdaBoost in Python. Below we provide a block of Python code that will carry out the AdaBoost algorithm.
Some auxiliary functions are needed.
First, erm_ds
is the efficient implementation of the ERM rule in Bd, with given weights p, that was discussed last post. The function stump
is the following:
I wrote the sign
function so it returns -1 if its input is negative, and returns 1 otherwise, including when the input is 0. Finally, a function L_S
is defined to compute the weighted loss of a decision stump:
There are three inputs to L_S
: a triple of parameters f
that determines a decision stump; a list S
which has two entries, S[0]
is the list of data points for training and S[1]
is the corresponding list of labels; and a list P
that is the probability vector. Think about how the labels are ±1, and the stump gives the model’s answer at a point. This means the for
loop goes through the data points and adds 0 to loss
if the stump is right, and adds P[i]
to loss
otherwise.
With these preliminaries, the following code implements AdaBoost:
Example. Consider the artificially generated training set of points in the plane, displayed on the right5. This data set cannot be separated with one decision stump. The one with minimal loss, using uniform weights, has j=2,b≈0.78. it mislabels 8 blue points that have 2nd coordinate less than 0.78. In this section, boosting stumps will give a model for the data set.
To begin, if a person had been given this data, and were asked to draw splitting hyperplanes for a set of stumps that would separate the data (since we are in R2, that means vert. or horiz. lines), they might draw something like the following.
At first glance there appear to be 4 decision stumps involved (for the four linear parts of the green boundary), but the boosted model doesn’t work out quite so simply6. The first 4 decision stumps give a decision boundary that has only three linear parts. After 6 rounds we get five parts to the decision boundary. A depiciton of how that boosted model with 6 stumps labels points is shown in the next figure.
To be clear, I estimated the gray and green lines myself (I actually drew it on my iPad), with no reference to the algorithm’s output. The coloring of the points on the right comes from the model that AdaBoost produces after 6 rounds. The training error of that model is approximately 0.097 (meaning it is right on more than 90% of the training data – note there are just 3 blue points on the top-left being mislabeled).
If we wanted to try to capture those extra points, we could do an extra round in AdaBoost. The result is shown in the next figure.
While the result after 7 rounds looks qualitatively more like our initial guess, a more careful look shows that it is mislabeling more training points than the previous one; the training error here is approximately 0.129. Recall, the training set was fairly small and so AdaBoost doesn’t have much to work with at this point. However, one lesson we can take is that it is possible for training error to increase with more rounds of AdaBoost.
5. Disclaimer: The number of data points for this example is admittedly low for the complexity of the boosting procedure.
6. In fact, assuming different weights, you need at least 3 decision stumps in the linear combination just to get a result that is different than one decision stump alone. Such is the reality of the sign
function.
Second example — finding a high contrast edge. This example will use AdaBoost on decision stumps to complete a task on images. The images in the data set are grayscale images – each pixel has a value from 0 to 255 (inclusive), the pixel being brighter when closer to 255.
Given an image, the task will be to determine if (somewhere in the image) there is a roughly horizontal edge with very bright pixels above it and dark pixels below it. Call this a skyline curve. To the right is an image with a very strong skyline curve appearing at the top of the dog’s head. So the goal is to decide if there is a skyline curve that stands out.7
The image files come from an old Dogs vs. Cats challenge at the website kaggle.com. The original competition was to make an algorithm that identifies whether each picture is of a cat or dog.
The original training set has 25,000 images, each labeled ‘cat’ or ‘dog.’ Since our task is different than the original challenge, we will have to determine our own labels. Due to time constraints, this will require the use of a smaller training set.8
Making vectors out of the images. There are many ways to make a vector from a grayscale image. A common way is to treat each pixel value as a coordinate and order these in some way. For example, if the image is m pixels by n pixels, then you could make the first n coordinates be the values from the first row of pixels, then go to the n pixels in the next row, etc. In the end you would have a vector in Rmn.
The above process for getting a vector is not well-suited to this task, for a few reasons.9, 10, 11 Instead, I’ll adapt an idea for image recognition tasks that uses rectangles, and certain sums of the pixel values in those rectangles, to create coordinates for the vector.12
More precisely, divide the height (in pixels) of an image by 15 and call the result α. Divide the width (in pixels) of an image by 75 and call the result β.13 For an α×β rectangle, let’s call it r, calculate the sum of the pixel values of the top-half of r and then subtract the sum of the pixel values of the bottom-half of r. This gives a number, call it tb(r), the top-bottom score of r.
For example, if r is the 6×3 rectangle with pixel values shown to the right, then tb(r) = 1739. Remember, the task is to find a skyline curve. If a skyline curve passes through the middle of r, then tb(r) should be very large relative to the scores of other rectangles (high values on top, minus close-to-zero values on the bottom).
The idea is to focus on when tb(r) is large; however, the particular tb(r) scores don’t matter a great deal. Rather it matters where the rectangles with highest score are in relation to each other. If a “roughly horizontal” edge has decent length, there would be adjacent rectangles with high tb(r) along that edge.
The vector: Take the 10 rectangles14 that have the largest value of tb(r), call them r1, r2, …, r10. For each ri, measure how far it is from the others (among those 10) in some way. Call that measure m(ri). Order these measures in ascending order, and the vector will be the ascending list of those measures. So each vector will have coordinates that do not decrease to the right, each coordinate being some m(ri).16 This produces a vector in R10 for each image. To the right is one of the images in the data set, the 10 rectangles with highest score being drawn in red. Its vector: approximately (1.7, 1.9, 1.9, 2.4, 2.9, 3.1, 3.1, 3.8, 7.2, 33.5).
With a labeled subset of the images (a person having decided if there is a skyline curve), use part of the labeled data for training, and the other part to test the resulting model. The training is done on the images’ vectors in R10, by applying AdaBoost with decision stumps in B10 as the base models.
I generated training and test sets for the Dogs vs. Cats images, by using numpy.random.choice
to randomly select a file. I would store which one it was. Then I displayed it using matplotlib.pyplot
and would visually decide about a strong skyline curve and, accordingly, add a 1 or -1 to a list of labels. Having to do this semi-manually necessitated a rather small training set (80 images) and test set (20 images).
The results were okay, considering the training data! I iterated through 25 rounds, recording the training error of the AdaBoost output for each (honest empirical risk, not weighted by p). To the right is the graph of the training errors, the horizontal axis being the number of stumps (rounds) in AdaBoost.
That the error is larger when the number of rounds is even is probably an artifact of the method; but note that the error for small odd numbers of rounds is close to decreasing. Something dramatic happens between 17 and 21 rounds. Due to the small amount of training data, I would guess that the model is overfitting at 21 rounds and that the best model is the one with 17 stumps (rounds), and training error around 0.15.
Using the test data, the model with 17 stumps had a test error of about 0.25. The model with 21 stumps had a test error of about 0.35. To visualize the results, I selected 3 images at random from the 25,000 and have displayed them and the rectangles for their vectors below. The captions give the 17-stump model’s output.
7. Disclaimers #1 & #2: The task of finding a particular type of curve or shape in an image, while quite applicable, might be called an ill-defined problem — since it does not have an objectively correct answer. The question of spam email was also. Such questions are an important part of data science.
Also, there are rather good algorithms for finding curves in images (though particular kinds of curves can be more of a challenge). These algorithms use different techniques. We might discuss one of them when we get to cluster analysis.
8. Disclaimer #3: With a somewhat small training set there is potential for bad results. This is especially true considering that I am personally labeling the training and test data.
9. The images in the data have differing numbers of pixels. However, to use decision stumps, it is necessary that the vectors all have the same length.
10. The method of shuffling the pixels into coordinates simply doesn’t make sense for how decision stumps work, not for what the end goal of the task is.
11. The images in the data set are hundreds of pixels by hundreds of pixels. So mn is in the tens of thousands — vectors obtained this way would have tens of thousands of coordinates, as much (or more) than the number of data points. Overfitting would be almost inevitable.
12. An idea like this was proposed by Viola and Jones for the task of determining if a photo was of a human face.
13. The numbers 15 and 75 were a choice. For most images this choice should make it so that an α×β rectangle is at least twice as tall as it is wide. If dividing does not give an integer, then use the floor function (simply using int
will work in Python) to get the closest integer below it.
14. The number 10 was a choice; a different choice could well improve results.
15. A measure derived from the taxicab metric on the grid of rectangles was used. The measure penalized vertical taxicab distance a bit more than horizontal, rather than finding the closest ri it found the four closest and averaged their taxicab distance to the given r, and it was multiplied by a constant that served to normalize the values between images.
16. We are more concerned with the set of rectangles, rather than their order according to the tb score.