Decision Trees
Introduction
In the
K-means Algorithm, we made an inherent assumption that data points of various classes are not randomly sprinkled across the space, but instead appear in cluseters of more or less homogeneous space.
Decision trees exploit this very assumption to build a tree structure that recursively divides the feature space into regions with similar labels.
For this purpose, we use the concept of purity. We want to split the data at each node such that the resulting subsets are as pure as possible.
Two common impurity measures are Gini impurity and entropy.
Let
S={(x(1),y(1)),…,(x(n),y(n))} be a set of training examples where
y(i)∈{1,…,c} for
c classes. Then we can define the probability of any class
k as:
pk=∣S∣∣Sk∣ The Gini impurity of a leaf
S is given by:
G(S)=k=1∑cpk(1−pk) And for a split
S→SL,SR, the Gini impurity is given by:
G(S)=∣S∣∣SL∣⋅G(SL)+∣S∣∣SR∣⋅G(SR) Similarly, the entropy of a leaf
S is given by:
H(S)=−k=1∑cpklogpk And for a split
S→SL,SR, the entropy impurity is given by:
H(S)=∣S∣∣SL∣⋅H(SL)+∣S∣∣SR∣⋅H(SR) Also, note that the worst case for our probabilities is if they follow a uniform distribution and
pk=c1 for each
k because this would mean that each leaf is equally likely and our predictions are as good as random guessing.
If we let this distribution be
q then we see that minimizing the entropy is equivalent to maximizing the KL-divergence
DKL(p∣∣q).
argpmaxDKL(p∣∣q)=argpmin(−k∑pklogpk) =argpminH(p) However, building a maximally compact tree with only pure leaves for any data set is a NP-hard problem. Therefore, we use a greedy approach instead.
One example of this is the ID3 algorithm. In this algorithm, we start at the root node and at each step select the feature
xj that best separates the data and minimizes the impurity. We then recurse on the left and right subsets defined by this split until we can no longer split the data.
A problem with this approach is that the ID3 algorithm stops splitting if a single split doesn't reduce impurity, even though a combination of splits might.
Another example of the greedy approach is the CART (Classification and Regression Trees) algorithm. CART can handle both classification and regression tasks. For regression tasks, the cost function is the squared error and for classification tasks it is the Gini impurity or entropy.