As we have seen in Chapter 3, the aim in multivariate decision tree construction is to find a way to determine the coefficients of the attributes. In this chapter, we will discuss how to determine these set of coefficients by using methods based on neural networks.
When neural networks are used in decision trees, at each node of the decision tree, there is a neural network trained with its corresponding data. Once the weights of the neural networks are found, they can be used to classify a test example.
IGURE 4.1 Linear perceptron model for multivariate decision trees
A model for a linear perceptron is shown in Figure 4.1 (Bishop, 1996). x0, x1, x2 … xf, form the input layer of the neural network whereas the node on top is the output node of the neural network denoted by y. The output of the neural network is binary. wi’s are the weights of the input neurons. The weight of x0 = 1.0 is w0, which will be used as a threshold unit (Corresponding to c of CART). y is computed as follows:
The sigmoid function gives a value between 0 and 1.
In order to train the linear perceptron, gradient-descent is used. This algorithm is used to train a single-output linear perceptron to differentiate between two disjoint groups of classes, which are the classes in the left branch CL and the classes in the right branch CR. The desired output for an instance is 1 when its class belongs to the group CL and 0 when its class belongs to the group CR. If there are only two classes present in that node, one class belongs to CL and other to CR. Otherwise we must select an appropriate partition for the classes available.
Finding the best split with neural network algorithms is done as a nested optimization problem. In the inner optimization problem, the gradient-descent algorithm is used to minimize the mean-square error and so find a good split as defined by wi for the given two distinct groups of classes CL and CR. In the outer optimization problem, we find the best split of classes into the two groups, CL and CR.
4.1.Training Neural Networks
The inner optimization problem will be solved by using neural networks at each node of the decision tree. For this purpose we have used three different kinds of neural network models. These are linear perceptrons, multilayer perceptrons and a hybrid combination.
4.1.1.Linear Perceptron Model
A linear perceptron neural network model is shown in Figure 4.1. The training of this network is done by gradient-descent algorithm. Let (xt, dt) denote the training data, where xt is the instance t and dt is the desired output for that instance, which is 1 for left child and 0 for right child. yt is the real output found by the formula:
A stochastic gradient algorithm for minimizing the mean-square error criterion is used.
At each epoch of training, all instances are passed one at a time in random order. While passing the coefficients of the input layer, wi and threshold unit w0 are updated by the given formulas. In these formulae stands for learning rate. Learning rate is started as 0.3 / f and decreased to 10-5 by multiplying at each epoch with a constant. stands for momentum rate (0.7 in our application). It is used to update wi by multiplying with the previous update value as:
wit = ( dt – yt ) yt ( 1- yt ) xit + wit-1 , i=0,…,n, x0=+1 (4.4)
wit+1= wit + wit (4.5)
The training of linear perceptron takes O(e * n * (f+1)) where e is the number of epochs to train the network, n is the number of instances and (f+1) is the number of inputs (+threshold), which is the number of weights to update.
4.1.2.Multilayer Perceptron Model
A multilayer perceptron model for decision making at a node is shown in Figure 220.127.116.11.
FIGURE 18.104.22.168 Multilayer perceptron model with one hidden layer
There is a hidden layer between input and output layers which makes y a nonlinear function of x. x0, x1, x2 … xf, form the input layer, H0, H1, … Hm form the hidden layer and y denotes the output of the neural network. For the weights connecting the layers, whi connects input neuron i to hidden neuron h and Th connects hidden neuron h to the output layer. x0 and h0 denote the bias units for the input and hidden layers respectively. The number of units in the hidden layer is taken as half of the number of features, for suitable dimensionality reduction from f to 1.
At instance t the real output yt and the hidden layer output Hht are found as:
The update rules are different from single layer perceptron. There are two layers so there are two update rules, one for updating the whi and the other for updating Th. Learning rate is started as 0.3 / f and decreased as explained in linear perceptron model. The update rules are:
Tht = ( dt – yt ) yt ( 1- yt )Hht+ Tht-1 , h=0,…,m, H0=+1 (4.8)
T h t+1=Tht + Tht (4.9)
whit = ( dt – yt ) yt ( 1- yt )Tht Hht ( 1 - Hht) xit + Tht-1 (4.10)
where h=0,…,m and i=0,…,n H0=+1 x0=+1
The training of the multilayer perceptron takes O(e * N * f 2) where e is the number of epochs to train the network, n is the number of instances and f is the number of features. It is f 2 because there are m * f updates for the weights in the first layer and m updates in the second layer. m equals to f / 2 as explained before. So the total number of updates is f 2 / 2 + f / 2 which is O(f 2).
Multilayer perceptron models differ from other multivariate models in its nonlinear nature. With this model one can have nonlinear split at a decision node. In Figure 22.214.171.124, an example nonlinear split is shown to solve the problem Choosing Car.
FIGURE 126.96.36.199 A nonlinear split to Choosing Car problem
4.1.3.The Hybrid Model
If we compare the complexity of the decision at a node, we see that the univariate methods are the simplest methods as they test only one feature. Linear methods, which take a linear combination of features, are more complex and nonlinear methods such as multilayer perceptrons are the most complex. But the aim of the decision trees is to find a way of representation of the decision that is as simple as possible and as accurate as possible. So we should not use always nonlinear methods, which have the additional disadvantage that they are not easily interpretable.
Therefore, it seems better to find a way of combining linear and nonlinear methods in a hybrid model. In this model, at each node we train both a linear and nonlinear perceptron and use a statistical test to check if there is a significant performance difference between the two. If the performance of the multilayer perceptron is better then the performance of the linear perceptron with a confidence level of %95, the multilayer perceptron is chosen to have a nonlinear decision node, else linear perceptron is chosen to have a linear decision node.
The tests we have used to compare linear and nonlinear models are combined 5x2 cv F test and 30 fold cross-validation paired t test. The details of the two tests are given in Appendix B.