Decision trees – Chris Cornwell

Decision trees are another example of a powerful model function for machine learning. They are often used for binary classification, or for multi-classification (a classification problem where there are multiple categorical labels, rather than just two). Similar to how neural networks are capable of producing any function, if the “size” of decision trees is unrestrained¹ then it can produce an arbitrarily close approximation to any desired classification of points in the input space.

^1. And computation time considerations are ignored

Construction of a decision tree. Decision trees are strongly related to decision stumps. In fact, a decision stump is a decision tree with depth 1 (the depth of a tree is defined below). In some circumstances, it is possible to think of a decision tree as a collection of nested decision stumps — on smaller and smaller subsets of the data.

Often when decision trees are discussed, the data is converted to points which have binary coordinates (either 0 or 1); however, it is not necessary. Here I will discuss data points with \( d\) numerical coordinates, in \( \mathbb R\), as I have before. If you wish, you can translate to the binary setting by using 0 or 1 to represent whether or not the given data is larger than some threshold in each coordinate.

You begin by splitting the data along some coordinate. In other words, you find a hyperplane \( x_j = b\), where \( 1\le j\le d\) and \(b\) is some number and each data point is on one side: for each \({\bf x}_i \in S\), either \( x_{i,j} < b\) or \( x_{i,j} > b\). This partitions the data into two subsets.

The goal of the process is to recursively partition each side, and ultimately to have points in the same partition have the same label.² At each split some criterion is used to try to make that happen. Once the split is chosen, you restrict to the two subsets of data points on each side and repeat the process on each subset. Unless other initial choices are made, you continue until a subset (in some region in the original space) has only one label. This region then corresponds to a leaf of the tree, labeled with that one label of the points.³

Below is a potential set of data for binary classification, with only 4 negatively-labeled points (the red points). If the first split made is along the horizontal line \( x_2 = -1.8\) (so \( j = 2\) and \(b = -1.8\)), then all but one point above the line is blue and 3 of the 12 points below the line are red.

Labeled data (of 50 points) for training a decision tree

Call the top region above the line the positive region of the split; below the line is the negative region. Continuing with nested splits into each side, one could imagine using two more splits in the positive region to isolate the one red point. A way to isolate the three red points in the negative side is with three additional nested splits on that side (see the figure below).

I have drawn the corresponding decision tree to the right, where the dashed lines in the figure correspond to branchings of the tree. The branches of the tree are labeled with + or -, corresponding the side of the line one is following (in each split for this tree, the negative side is the one side that contains red points).

Nested splittings with regions that contain a single label. The dashed lines doing the splitting get lighter the farther down they are in the tree.

The depth 4 tree corresponding to the splitting, with labels on the leaves.

Each leaf of the tree is a node where there are no further splittings beyond it, labeled by the labels of points in that region. The depth of a decision tree is defined to be the maximal number of branches that can be followed from the top node to a leaf. So the depth of the example decision tree is 4.

The resulting model function. Start by determining a decision tree on training data, as above. Then, given test data (not yet “seen” by the model), the decision tree model will check in which partition the test point resides. Then, it labels the test point with the label of the corresponding leaf. Often, in an effort to avoid overfitting, you decide beforehand on the depth of the tree. Since this will result in having more than one label in some of the partitions (and possibly all), the label given to a test point in each leaf is some function of the training labels in the corresponding partition — if the goal is classification, with some categorical labels, the mode could be used; if the goal is regression, with numerical labels, the mean, or the median, could be used.

^2. In the unrestricted case, at least.
^3. This describes the basic approach in a classification problem. Notice that it is not at all necessary to have only two labels (more labels will cause the leaves to have more than two possible labels). One can also use decision trees for modeling/regression. However, for it to be sensible in that case, you should stop splitting at some point (probably before there is just one label in each partition) and use the average, perhaps, of the labels in a partition in order to give a leaf a label.

How to decide about the splits. Deciding which split (decision stump) to use at each step in building the tree can be done in a number of ways.

One method, in the setting of binary classification, uses ideas that we have already built up. You can use the ERM procedure to find the decision stump with least empirical risk at each step (on the subset of training points in that region). The example set of points above is a bad one for this approach, since the proportion of red points is so small compared to blue points.

Another method is to calculate a quantity called the Information Gain of each possible split (the quantity \( IG\) defined below). You then choose the split that has the maximal \( IG\). Information Gain is calculated as follows.

First, define a function \( e\), called the entropy function, where

\( e(r) = -r\log(r) – (1-r)\log(1-r)\)

for any number \( 0 < r < 1\). When \( r = 0\) or \( r = 1\) then define \( e(r) = 0\).⁴ ⁵

Now, to define \( IG\) of a split, first let \( r = {\bf P}(y=1)\), the probability that the label is 1 (in the particular subset you have restricted to before the split). Let \( m\) be the number of points under consideration. In our example above, and after the first split along \( x_2 = -1.8\) was made, say we are in the positive side where there is 1 red point (negatively labeled), and are trying to figure out where to make the second splitting. Then \( m = 38\) (there are \( 38\) points above the horizontal line), and \(r = 37/38\).

Next, for a potential splitting, let \(r_+ = {\bf P}(y=1\ |\ x\text{ on the + side})\) be the fraction of blue (+1 labeled) points on the positive side of that split. Similarly, define \(r_- = {\bf P}(y=1\ |\ x\text{ on the – side})\). Also, let \( m_{\pm}\) be the total number of points on each side. In the continuing example, say the potential split is along the line \( x_1 =-3.9\) (one of the second level splits depicted in the figure above). Then \(r_+ = 35/35 = 1\) and \(r_- = 2/3\).

Now, with \(r, r_+,\) and \(r_-\) defined, the information gain of a split \( s\) is \( IG(s) = e(r) – (\frac{m_+}{m}e(r_+) + \frac{m_-}{m}e(r_-))\). Note that it actually doesn’t matter which side of the line is the positive or negative side, just where the splitting line is. For the continuing example, the information gain is

\( e(37/38) – (\frac{35}{38}e(1) + \frac{3}{38}e(2/3)) \approx 0.07.\)

If you were to use \( IG\) to build a decision tree, then at each step you must find the split, on the subset of the data under consideration, that has maximal information gain.

^4. It is important to consider this a definition, not a consequence of the logarithmic expression given above. While you can get \( 0\) from a limit of that expression as \( r\to 0^+\), it is not equal to its value. Trying to say it is might not lead to trouble if you only think this way for the purpose of the calculation of a single entropy function value. However, if you attempt to do any arithmetic and use properties of logarithms, thinking of \( 0\log 0\) as being equal to \( 0\) will easily lead to statements like \( 1 = 2\).

^5. The logarithm function can have any base. Most of the time, in discussions of Information Gain, the base is 2 due to the area of study that the idea comes from.