Multivariate decision trees for machine learning

səhifə	7/15
tarix	24.06.2016
ölçüsü	5.57 Mb.

1 2 3 4 5 6 7 8 9 10 ... 15

3.3.Feature Selection

In multivariate decision tree construction, we take a linear combination of features. Our main purpose is to find the coefficients w_i for these features. But using all of the features may be costly and may lead to overfitting on a small training set. So in order to get rid of these unnecessary features, we use feature selection, thereby effectively setting w_i of unnecessary features to 0.

Feature selection algorithm proceeds as follows: Firstly the coefficients of all attributes are determined. Afterwards, each attribute x_i is dropped by setting w_i= 0 and the increase in the impurity for that attribute is found. If dropping an attribute increases impurity significantly, it is an important attribute. Applying this rule we can find the most and least important attributes. Then we check if the impurity increase for the least important attribute is less than the impurity increase for the most important attribute by a constant  (normally 0.1 or 0.2). If the difference is smaller we conclude that it is an unnecessary attribute and we drop it. We repeat this process until this condition is not satisfied for any attribute.

3.4.Classification and Regression Trees (CART)

CART algorithm (Breiman et. al., 1984) for finding the coefficients of the available features is a step-wise procedure, where in each step, one cycles through the features x₁, x₂, …, x_f doing a search for an improved linear combination split. If there are symbolic features, they are converted to numeric features. Each instance is normalized by centering each value of each feature at its median and then dividing by its interquartile range. CART algorithm is very similar to ID3.

Finding the best split with CART algorithm is done as follows:

While disorder decreases
For each of the feature x_i
For   {-0.25,0,0.25}

Let the current split be v  c, where v = _mx_m,the aim is to find the best split of the form v -  (x_i+ )  c.
Divide the instances into two groups as

 

, x_i+   0 and  

, x_i+  < 0.

For each instance:

Calculate u_n= . Divide the instances into two groups as for all u_nsuch that x_i,n +   0, the algorithm finds the best split of the form u  ₁. For all u_nsuch that x_i,n +  < 0, the algorithm also finds the best split of the form u  ₂. Take the better of these two splits and let  be the corresponding threshold.

These three best splits are compared and ,  values corresponding to the best of the three are used to update v as follows:

v’ =

’_mx_m, ’_i= _i- , ’_m= _m, m > 1 and c’ = c +  .

A
t this point the updated split is of the form v_l  c_l. The cycle is completed by finding the best split of the form v_l  c_l’ as c_l’ ranges over all values.

FIGURE 3.4.1 A step in CART algorithm

Figure 3.4.1 shows the first step of the CART algorithm for an example data set. The initial line is given as x₁+x₂<0. The lines shown as –0.25,0 and 0.25 are the best splits found for  = -0.25,0 and 0.25. Here only the coefficient of attribute x₁ is changed. The line with  = 0 will be selected for further iteration.

Once the best split is found, we create two children as v = _mx_m < c and v > c, and divide the set of instances into two. We then continue recursively until either a subset is pure enough or the subset is small enough.

This algorithm has a complexity of O (6 * k * f * n) where k is the number iterations done while disorder decreases, f is the number of features, N is the number of instances. Loop (1) is iterated k times, loop (2) is iterated three times, loop (3) is iterated f times and loop (4) is iterated 2*n times. Also the algorithm is called for each node one time so the total complexity in training phase is O (6 * k * f * n).

The testing phase of the algorithm is as follows: Each instance is first normalized according to the median and quartile computed from the training set. Then the tree is traversed until a leaf node is encountered, by taking the linear combination of input attributes with the coefficients found by the CART algorithm for that node. If the instance has the class code same as the code of the leaf node, it is correctly classified else it is misclassified.

Although CART is a good multivariate algorithm it has some basic limitations. First of all, it is fully deterministic as ID3. There is no built in mechanism for escaping local minima, although such minima may be very common for some domains. Secondly, it sometimes makes adjustments that increase the impurity of the split. This feature was probably included to overcome local minima problem but it also has a drawback.

There is no upper bound on the time spent at any node in the decision tree. Loop (1) halts when no perturbation changes the impurity more than , but because impurity may increase and decrease, the algorithm can spend arbitrarily long time, or in some times infinite time, at a node. To overcome this problem, we have included also an iteration constant. If the algorithm cycles more than this iteration constant, it stops and returns the current split.

1 2 3 4 5 6 7 8 9 10 ... 15