As we have seen in Chapter 3, the aim in multivariate decision tree construction is to find a way to determine the coefficients of the attributes. In this chapter, we will discuss how to determine these set of coefficients by using methods based on neural networks.
When neural networks are used in decision trees, at each node of the decision tree, there is a neural network trained with its corresponding data. Once the weights of the neural networks are found, they can be used to classify a test example.
F IGURE 4.1 Linear perceptron model for multivariate decision trees
A model for a linear perceptron is shown in Figure 4.1 (Bishop, 1996). x_{0, }x_{1}, x_{2} … x_{f}, form the input layer of the neural network whereas the node on top is the output node of the neural network denoted by y. The output of the neural network is binary. w_{i}’s are the weights of the input neurons. The weight of x_{0 }= 1.0 is w_{0}, which will be used as a threshold unit (Corresponding to c of CART). y is computed as follows:
(4.1)
The sigmoid function gives a value between 0 and 1.
In order to train the linear perceptron, gradientdescent is used. This algorithm is used to train a singleoutput linear perceptron to differentiate between two disjoint groups of classes, which are the classes in the left branch C_{L} and the classes in the right branch C_{R}. The desired output for an instance is 1 when its class belongs to the group C_{L} and 0 when its class belongs to the group C_{R}. If there are only two classes present in that node, one class belongs to C_{L} and other to C_{R}. Otherwise we must select an appropriate partition for the classes available.
Finding the best split with neural network algorithms is done as a nested optimization problem. In the inner optimization problem, the gradientdescent algorithm is used to minimize the meansquare error and so find a good split as defined by w_{i} for the given two distinct groups of classes C_{L} and C_{R}. In the outer optimization problem, we find the best split of classes into the two groups, C_{L} and C_{R}.
4.1.Training Neural Networks
The inner optimization problem will be solved by using neural networks at each node of the decision tree. For this purpose we have used three different kinds of neural network models. These are linear perceptrons, multilayer perceptrons and a hybrid combination.
4.1.1.Linear Perceptron Model
A linear perceptron neural network model is shown in Figure 4.1. The training of this network is done by gradientdescent algorithm. Let (x^{t}, d^{t}) denote the training data, where x^{t} is the instance t and d^{t} is the desired output for that instance, which is 1 for left child and 0 for right child. y^{t} is the real output found by the formula:
(4.2)
A stochastic gradient algorithm for minimizing the meansquare error criterion is used.
(4.3)
At each epoch of training, all instances are passed one at a time in random order. While passing the coefficients of the input layer, w_{i} and threshold unit w_{0} are updated by the given formulas. In these formulae stands for learning rate. Learning rate is started as 0.3 / f and decreased to 10^{5} by multiplying at each epoch with a constant. stands for momentum rate (0.7 in our application). It is used to update w_{i} by multiplying with the previous update value as:
w_{i}^{t }= ( d^{t }– y^{t }) y^{t} ( 1 y^{t }) x_{i}^{t} + w_{i}^{t1} , i=0,…,n, x_{0}=+1 (4.4)
w_{i}^{t+1}= w_{i}^{t} + w_{i}^{t} (4.5)
The training of linear perceptron takes O(e * n * (f+1)) where e is the number of epochs to train the network, n is the number of instances and (f+1) is the number of inputs (+threshold), which is the number of weights to update.
4.1.2.Multilayer Perceptron Model
A multilayer perceptron model for decision making at a node is shown in Figure 4.1.2.1.
FIGURE 4.1.2.1 Multilayer perceptron model with one hidden layer
There is a hidden layer between input and output layers which makes y a nonlinear function of x. x_{0, }x_{1}, x_{2} … x_{f}, form the input layer, H_{0}, H_{1}, … H_{m} form the hidden layer and y denotes the output of the neural network. For the weights connecting the layers, w_{hi} connects input neuron i to hidden neuron h and T_{h} connects hidden neuron h to the output layer. x_{0}_{ }and h_{0} denote the bias units for the input and hidden layers respectively. The number of units in the hidden layer is taken as half of the number of features, for suitable dimensionality reduction from f to 1.
At instance t the real output y^{t}^{ }and the hidden layer output H_{h}^{t} are found as:
(4.6)
(4.7)
The update rules are different from single layer perceptron. There are two layers so there are two update rules, one for updating the w_{hi} and the other for updating T_{h}. Learning rate is started as 0.3 / f and decreased as explained in linear perceptron model. The update rules are:
T_{h}^{t }= ( d^{t }– y^{t }) y^{t} ( 1 y^{t })H_{h}^{t}+ T_{h}^{t1} , h=0,…,m, H_{0}=+1 (4.8)
T _{h }^{t+1}=T_{h}^{t} + T_{h}^{t} (4.9)
w_{hi}^{t }= ( d^{t }– y^{t }) y^{t} ( 1 y^{t })T_{h}^{t} H_{h}^{t} ( 1  H_{h}^{t}) x_{i}^{t} + T_{h}^{t1} (4.10)
where h=0,…,m and i=0,…,n H_{0}=+1 x_{0}=+1
w_{hi}^{t+1}=w_{hi}^{t}+w_{hi}^{t} (4.11)
The training of the multilayer perceptron takes O(e * N * f ^{2}) where e is the number of epochs to train the network, n is the number of instances and f is the number of features. It is f ^{2} because there are m * f updates for the weights in the first layer and m updates in the second layer. m equals to f / 2 as explained before. So the total number of updates is f ^{2 }/ 2 + f / 2 which is O(f ^{2}).
Multilayer perceptron models differ from other multivariate models in its nonlinear nature. With this model one can have nonlinear split at a decision node. In Figure 4.1.2.2, an example nonlinear split is shown to solve the problem Choosing Car.
FIGURE 4.1.2.2 A nonlinear split to Choosing Car problem
4.1.3.The Hybrid Model
If we compare the complexity of the decision at a node, we see that the univariate methods are the simplest methods as they test only one feature. Linear methods, which take a linear combination of features, are more complex and nonlinear methods such as multilayer perceptrons are the most complex. But the aim of the decision trees is to find a way of representation of the decision that is as simple as possible and as accurate as possible. So we should not use always nonlinear methods, which have the additional disadvantage that they are not easily interpretable.
Therefore, it seems better to find a way of combining linear and nonlinear methods in a hybrid model. In this model, at each node we train both a linear and nonlinear perceptron and use a statistical test to check if there is a significant performance difference between the two. If the performance of the multilayer perceptron is better then the performance of the linear perceptron with a confidence level of %95, the multilayer perceptron is chosen to have a nonlinear decision node, else linear perceptron is chosen to have a linear decision node.
The tests we have used to compare linear and nonlinear models are combined 5x2 cv F test and 30 fold crossvalidation paired t test. The details of the two tests are given in Appendix B.
