Ana səhifə

Multivariate decision trees for machine learning

Yüklə 5.57 Mb.
ölçüsü5.57 Mb.
1   ...   5   6   7   8   9   10   11   12   ...   15

4.2.Class Separation by Selection Method

Class separation is a process that must be done when there are three or more classes available at the decision node. If there are c classes, then there are 2c-1-1 distinct partitions possible. Because we can not test for all, we use heuristics to get a reasonable partition in a reasonable amount of time. The first method we have used in class separation is the selection method, which is a depth-first search method with no backtracking. Let t be a decision node and C={C1, …, Cm} be the set of c classes at node t.

  1. Select an initial partition, CL and CR where each part only consists of examples of two arbitrarily chosen classes Ci and Cj respectively.

  2. Train the network at node t with the given partition. Do not consider the elements of other classes yet.

  3. For other classes in the class list, search for the class Ck that is best placed into one of the partitions. Best placing is the placing when Ck is assigned into the child where the impurity is minimum.

  4. Put Ck into its best partition and continue adding classes one by one using steps 2 to 4 until no more classes are left.

This algorithm is sensitive to the initial class partition due to its depth-first nature. So a heuristic technique to select the initial partition is used instead of selecting randomly. The Euclidean distance of the means of two classes (for all classes) is calculated, and the two furthest classes are taken as the initial two classes.

Algorithm traces steps 2 to 4 c-2 times. So the complexity of this algorithm is O(c) where c is the number of the class available at node t.

4.3.Class Separation by Exchange Method

The second method in class separation is the exchange method (Guo and Gelfond, 1992). This is a local search with backtracking. Let t be a decision node and C={C1, …, Cc} be the set of c classes at node t.

  1. Select an initial partition of all in classes into two subsets, CL and CR.

  2. Train the network to separate CL and CR. Compute the entropy E0 with the selected entropy formulae.

  3. For each of the classes k  {C1, …, Cc} form the partitions CL(k) and CR(k) by changing the assignment of the class Ck in the partitions CL and CR. Train the neural network with the partitions CL(k) and CR(k). Compute the entropy Ek and the decrease in the entropy Ek=Ek-E0.

  4. Let E* be the maximum of the impurity decreases for all classes. If this impurity decrease is less than 0 then exit else set CL = CL(k) and CR = CR(k), and do the steps 2 to 4 again.

We use a heuristic technique to start, instead of starting randomly.

  1. The two classes Ci and Cj with the maximum distance are found and placed into CL and CR.

  2. For each of the classes k  {C1, …, Cc} find the one with the minimum distance to CL or CR and then put it into that group. Repeat this second step until no more classes are left.


In this section we will discuss how to use a statistical approach to determine the set of coefficients in a linear decision node. So when we want to find a new split in a node of the tree, we use a statistical criterion to determine the coefficients of the features.

The statistical approach is named as Fisher’s Linear Discriminant Analysis (LDA) (Duda and Hart, 1973). This approach aims to find a linear combination of the variables that separates the two classes as much as possible, which is what we want to do. The criterion proposed by Fisher is the ratio of between-class to within-class variances, which we want to maximize. Formally we must find a direction w to maximize


where mL and mR are the two left and right groups means

, (5.2)

and Sw is within-class covariance matrix, which is bias corrected as


where L and R are the covariance matrices of class groups CL and CR respectively.

, (5.4)

There are nL samples in the left class group and nR samples in the right class group. The solution for w that maximizes JF can be obtained by differentiating JF with respect to w and equating it to zero. So we define w, namely the set of coefficients of the linear combination as:


But the coefficients of the features is only for the direction, we must also specify a threshold w0, which is defined as:


Note however that though the direction given in Equation 5.3 has been derived without any assumptions of normality, normal assumptions are used in Equation 5.4 to set the threshold for discrimination.

So in our case, we first divide classes into two groups as CL and CR by an appropriate class selection procedure (as defined in Sections 4.2 or 4.3). But this time the inner optimization procedure will be carried out by LDA as we will find the linear split with the Equations 5.3 and 5.4. There will be no training but only simple calculations with matrices. So we expect less training time in LDA, than a neural network.

In implementing the idea, we see that if there is a linear dependency between two or more features then some of the rows or columns of the covariance matrix Sw are the same or a multiple of the other and it becomes singular. But in this case the determinant will be zero and we can not find Sw-1.

A possible answer to the problem is PCA, which is principal component analysis (Rencher, 1995). In this analysis we first find the eigenvectors and eigenvalues of the matrix Sw. As the multiplication of the eigenvalues of a matrix is equal to the determinant, we sort the eigenvalues of the matrix and get rid of the eigenvectors with small eigenvalues. Let us say that the eigenvalues of the matrix Sw are λ1, λ 2 ... λ f, which are sorted in decreasing order. We will find the new number of features k such that;


where is the proportion of variance explained. k is the number of eigenvectors that are linearly independent. After finding k, we will find the corresponding k eigenvectors and store them. For each instance with f features, we will multiply it with the k eigenvectors to have a new instance with the number of features k. So our instance space is mapped from f dimensions into k dimensions. After this we will do LDA now with the mapped instances and find a new Sw and means and calculate the split. Now Sw can have an inverse so there is no problem.

In order to test the tree we will convert first the test instances into a space of k dimensions and then we will multiply the new test instance with the linear split found with LDA. If the result is greater then -w0 this instance is assigned to the left subtree of the node else to the right subtree of the node.

1   ...   5   6   7   8   9   10   11   12   ...   15

Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur © 2016
rəhbərliyinə müraciət