In multivariate decision tree construction, we take a linear combination of features. Our main purpose is to find the coefficients wi for these features. But using all of the features may be costly and may lead to overfitting on a small training set. So in order to get rid of these unnecessary features, we use feature selection, thereby effectively setting wi of unnecessary features to 0.
Feature selection algorithm proceeds as follows: Firstly the coefficients of all attributes are determined. Afterwards, each attribute xi is dropped by setting wi = 0 and the increase in the impurity for that attribute is found. If dropping an attribute increases impurity significantly, it is an important attribute. Applying this rule we can find the most and least important attributes. Then we check if the impurity increase for the least important attribute is less than the impurity increase for the most important attribute by a constant (normally 0.1 or 0.2). If the difference is smaller we conclude that it is an unnecessary attribute and we drop it. We repeat this process until this condition is not satisfied for any attribute.
3.4.Classification and Regression Trees (CART)
CART algorithm (Breiman et. al., 1984) for finding the coefficients of the available features is a step-wise procedure, where in each step, one cycles through the features x1, x2, …, xf doing a search for an improved linear combination split. If there are symbolic features, they are converted to numeric features. Each instance is normalized by centering each value of each feature at its median and then dividing by its interquartile range. CART algorithm is very similar to ID3.
Finding the best split with CART algorithm is done as follows:
While disorder decreases
For each of the feature xi
Let the current split be v c, where v = m xm, the aim is to find the best split of the form v - (xi + ) c.
Divide the instances into two groups as
, xi + 0 and , xi + < 0.
For each instance:
Calculate un = . Divide the instances into two groups as for all un such that xi,n + 0, the algorithm finds the best split of the form u 1. For all un such that xi,n + < 0, the algorithm also finds the best split of the form u 2. Take the better of these two splits and let be the corresponding threshold.
These three best splits are compared and , values corresponding to the best of the three are used to update v as follows:
v’ = ’m xm, ’i = i - , ’m = m, m > 1 and c’ = c + .
t this point the updated split is of the form vl cl. The cycle is completed by finding the best split of the form vl cl’ as cl’ ranges over all values.
FIGURE 3.4.1 A step in CART algorithm
Figure 3.4.1 shows the first step of the CART algorithm for an example data set. The initial line is given as x1+x2<0. The lines shown as –0.25,0 and 0.25 are the best splits found for = -0.25,0 and 0.25. Here only the coefficient of attribute x1 is changed. The line with = 0 will be selected for further iteration.
Once the best split is found, we create two children as v = m xm < c and v > c, and divide the set of instances into two. We then continue recursively until either a subset is pure enough or the subset is small enough.
This algorithm has a complexity of O (6 * k * f * n) where k is the number iterations done while disorder decreases, f is the number of features, N is the number of instances. Loop (1) is iterated k times, loop (2) is iterated three times, loop (3) is iterated f times and loop (4) is iterated 2*n times. Also the algorithm is called for each node one time so the total complexity in training phase is O (6 * k * f * n).
The testing phase of the algorithm is as follows: Each instance is first normalized according to the median and quartile computed from the training set. Then the tree is traversed until a leaf node is encountered, by taking the linear combination of input attributes with the coefficients found by the CART algorithm for that node. If the instance has the class code same as the code of the leaf node, it is correctly classified else it is misclassified.
Although CART is a good multivariate algorithm it has some basic limitations. First of all, it is fully deterministic as ID3. There is no built in mechanism for escaping local minima, although such minima may be very common for some domains. Secondly, it sometimes makes adjustments that increase the impurity of the split. This feature was probably included to overcome local minima problem but it also has a drawback.
There is no upper bound on the time spent at any node in the decision tree. Loop (1) halts when no perturbation changes the impurity more than , but because impurity may increase and decrease, the algorithm can spend arbitrarily long time, or in some times infinite time, at a node. To overcome this problem, we have included also an iteration constant. If the algorithm cycles more than this iteration constant, it stops and returns the current split.