Multivariate decision trees for machine learning

səhifə	8/15
tarix	24.06.2016
ölçüsü	5.57 Mb.

1 ... 4 5 6 7 8 9 10 11 ... 15

4.NEURAL NETWORK MODELS FOR TREE CONSTRUCTION

As we have seen in Chapter 3, the aim in multivariate decision tree construction is to find a way to determine the coefficients of the attributes. In this chapter, we will discuss how to determine these set of coefficients by using methods based on neural networks.

When neural networks are used in decision trees, at each node of the decision tree, there is a neural network trained with its corresponding data. Once the weights of the neural networks are found, they can be used to classify a test example.

F
IGURE 4.1 Linear perceptron model for multivariate decision trees

A model for a linear perceptron is shown in Figure 4.1 (Bishop, 1996). x_0,x₁, x₂ … x_f, form the input layer of the neural network whereas the node on top is the output node of the neural network denoted by y. The output of the neural network is binary. w_i’s are the weights of the input neurons. The weight of x₀= 1.0 is w₀, which will be used as a threshold unit (Corresponding to c of CART). y is computed as follows:

(4.1)
The sigmoid function gives a value between 0 and 1.

In order to train the linear perceptron, gradient-descent is used. This algorithm is used to train a single-output linear perceptron to differentiate between two disjoint groups of classes, which are the classes in the left branch C_L and the classes in the right branch C_R. The desired output for an instance is 1 when its class belongs to the group C_L and 0 when its class belongs to the group C_R. If there are only two classes present in that node, one class belongs to C_L and other to C_R. Otherwise we must select an appropriate partition for the classes available.

Finding the best split with neural network algorithms is done as a nested optimization problem. In the inner optimization problem, the gradient-descent algorithm is used to minimize the mean-square error and so find a good split as defined by w_i for the given two distinct groups of classes C_L and C_R. In the outer optimization problem, we find the best split of classes into the two groups, C_L and C_R.

4.1.Training Neural Networks

The inner optimization problem will be solved by using neural networks at each node of the decision tree. For this purpose we have used three different kinds of neural network models. These are linear perceptrons, multilayer perceptrons and a hybrid combination.

4.1.1.Linear Perceptron Model

A linear perceptron neural network model is shown in Figure 4.1. The training of this network is done by gradient-descent algorithm. Let (x^t, d^t) denote the training data, where x^t is the instance t and d^t is the desired output for that instance, which is 1 for left child and 0 for right child. y^t is the real output found by the formula:

(4.2)

A stochastic gradient algorithm for minimizing the mean-square error criterion is used.

(4.3)

At each epoch of training, all instances are passed one at a time in random order. While passing the coefficients of the input layer, w_i and threshold unit w₀ are updated by the given formulas. In these formulae  stands for learning rate. Learning rate is started as 0.3 / f and decreased to 10^-5 by multiplying at each epoch with a constant.  stands for momentum rate (0.7 in our application). It is used to update w_i by multiplying with the previous update value as:

w_i^t= ( d^t– y^t) y^t ( 1- y^t) x_i^t +  w_i^t-1 , i=0,…,n, x₀=+1 (4.4)

w_i^t+1= w_i^t + w_i^t (4.5)

The training of linear perceptron takes O(e * n * (f+1)) where e is the number of epochs to train the network, n is the number of instances and (f+1) is the number of inputs (+threshold), which is the number of weights to update.

4.1.2.Multilayer Perceptron Model

A multilayer perceptron model for decision making at a node is shown in Figure 4.1.2.1.

FIGURE 4.1.2.1 Multilayer perceptron model with one hidden layer

There is a hidden layer between input and output layers which makes y a nonlinear function of x. x_0,x₁, x₂ … x_f, form the input layer, H₀, H₁, … H_m form the hidden layer and y denotes the output of the neural network. For the weights connecting the layers, w_hi connects input neuron i to hidden neuron h and T_h connects hidden neuron h to the output layer. x₀and h₀ denote the bias units for the input and hidden layers respectively. The number of units in the hidden layer is taken as half of the number of features, for suitable dimensionality reduction from f to 1.

At instance t the real output y^tand the hidden layer output H_h^t are found as:

(4.6)

(4.7)

The update rules are different from single layer perceptron. There are two layers so there are two update rules, one for updating the w_hi and the other for updating T_h. Learning rate is started as 0.3 / f and decreased as explained in linear perceptron model. The update rules are:

T_h^t=  ( d^t– y^t) y^t ( 1- y^t)H_h^t+  T_h^t-1 , h=0,…,m, H₀=+1 (4.8)

T _h^t+1=T_h^t + T_h^t (4.9)

w_hi^t=  ( d^t– y^t) y^t ( 1- y^t)T_h^t H_h^t ( 1 - H_h^t) x_i^t +  T_h^t-1 (4.10)

where h=0,…,m and i=0,…,n H₀=+1 x₀=+1

w_hi^t+1=w_hi^t+w_hi^t (4.11)

The training of the multilayer perceptron takes O(e * N * f ²) where e is the number of epochs to train the network, n is the number of instances and f is the number of features. It is f ² because there are m * f updates for the weights in the first layer and m updates in the second layer. m equals to f / 2 as explained before. So the total number of updates is f ²/ 2 + f / 2 which is O(f ²).

Multilayer perceptron models differ from other multivariate models in its nonlinear nature. With this model one can have nonlinear split at a decision node. In Figure 4.1.2.2, an example nonlinear split is shown to solve the problem Choosing Car.

FIGURE 4.1.2.2 A nonlinear split to Choosing Car problem

4.1.3.The Hybrid Model

If we compare the complexity of the decision at a node, we see that the univariate methods are the simplest methods as they test only one feature. Linear methods, which take a linear combination of features, are more complex and nonlinear methods such as multilayer perceptrons are the most complex. But the aim of the decision trees is to find a way of representation of the decision that is as simple as possible and as accurate as possible. So we should not use always nonlinear methods, which have the additional disadvantage that they are not easily interpretable.

Therefore, it seems better to find a way of combining linear and nonlinear methods in a hybrid model. In this model, at each node we train both a linear and nonlinear perceptron and use a statistical test to check if there is a significant performance difference between the two. If the performance of the multilayer perceptron is better then the performance of the linear perceptron with a confidence level of %95, the multilayer perceptron is chosen to have a nonlinear decision node, else linear perceptron is chosen to have a linear decision node.

The tests we have used to compare linear and nonlinear models are combined 5x2 cv F test and 30 fold cross-validation paired t test. The details of the two tests are given in Appendix B.

1 ... 4 5 6 7 8 9 10 11 ... 15