Multivariate decision trees for machine learning

səhifə	9/15
tarix	24.06.2016
ölçüsü	5.57 Mb.

1 ... 5 6 7 8 9 10 11 12 ... 15

4.2.Class Separation by Selection Method

Class separation is a process that must be done when there are three or more classes available at the decision node. If there are c classes, then there are 2^c-1-1 distinct partitions possible. Because we can not test for all, we use heuristics to get a reasonable partition in a reasonable amount of time. The first method we have used in class separation is the selection method, which is a depth-first search method with no backtracking. Let t be a decision node and C={C₁, …, C_m} be the set of c classes at node t.

Select an initial partition, C_L and C_R where each part only consists of examples of two arbitrarily chosen classes C_i and C_j respectively.
Train the network at node t with the given partition. Do not consider the elements of other classes yet.
For other classes in the class list, search for the class C_k that is best placed into one of the partitions. Best placing is the placing when C_k is assigned into the child where the impurity is minimum.
Put C_k into its best partition and continue adding classes one by one using steps 2 to 4 until no more classes are left.

This algorithm is sensitive to the initial class partition due to its depth-first nature. So a heuristic technique to select the initial partition is used instead of selecting randomly. The Euclidean distance of the means of two classes (for all classes) is calculated, and the two furthest classes are taken as the initial two classes.

Algorithm traces steps 2 to 4 c-2 times. So the complexity of this algorithm is O(c) where c is the number of the class available at node t.

4.3.Class Separation by Exchange Method

The second method in class separation is the exchange method (Guo and Gelfond, 1992). This is a local search with backtracking. Let t be a decision node and C={C₁, …, C_c} be the set of c classes at node t.

Select an initial partition of all in classes into two subsets, C_L and C_R.
Train the network to separate C_L and C_R. Compute the entropy E₀ with the selected entropy formulae.
For each of the classes k  {C₁, …, C_c} form the partitions C_L(k) and C_R(k) by changing the assignment of the class C_k in the partitions C_L and C_R. Train the neural network with the partitions C_L(k) and C_R(k). Compute the entropy E_k and the decrease in the entropy E_k=E_k-E₀.
Let E_* be the maximum of the impurity decreases for all classes. If this impurity decrease is less than 0 then exit else set C_L= C_L(k) and C_R= C_R(k), and do the steps 2 to 4 again.

We use a heuristic technique to start, instead of starting randomly.

The two classes C_i and C_j with the maximum distance are found and placed into C_L and C_R.
For each of the classes k  {C₁, …, C_c} find the one with the minimum distance to C_L or C_R and then put it into that group. Repeat this second step until no more classes are left.

5.THE STATISTICAL MODEL FOR TREE CONSTRUCTION

In this section we will discuss how to use a statistical approach to determine the set of coefficients in a linear decision node. So when we want to find a new split in a node of the tree, we use a statistical criterion to determine the coefficients of the features.

The statistical approach is named as Fisher’s Linear Discriminant Analysis (LDA) (Duda and Hart, 1973). This approach aims to find a linear combination of the variables that separates the two classes as much as possible, which is what we want to do. The criterion proposed by Fisher is the ratio of between-class to within-class variances, which we want to maximize. Formally we must find a direction w to maximize

(5.1)

where m_Land m_R are the two left and right groups means

(5.2)

and S_w is within-class covariance matrix, which is bias corrected as

(5.3)

where _L and _R are the covariance matrices of class groups C_L and C_R respectively.

(5.4)

There are n_L samples in the left class group and n_R samples in the right class group. The solution for w that maximizes J_F can be obtained by differentiating J_F with respect to w and equating it to zero. So we define w, namely the set of coefficients of the linear combination as:

(5.5)

But the coefficients of the features is only for the direction, we must also specify a threshold w₀, which is defined as:

(5.6)

Note however that though the direction given in Equation 5.3 has been derived without any assumptions of normality, normal assumptions are used in Equation 5.4 to set the threshold for discrimination.

So in our case, we first divide classes into two groups as C_L and C_R by an appropriate class selection procedure (as defined in Sections 4.2 or 4.3). But this time the inner optimization procedure will be carried out by LDA as we will find the linear split with the Equations 5.3 and 5.4. There will be no training but only simple calculations with matrices. So we expect less training time in LDA, than a neural network.

In implementing the idea, we see that if there is a linear dependency between two or more features then some of the rows or columns of the covariance matrix S_w are the same or a multiple of the other and it becomes singular. But in this case the determinant will be zero and we can not find S_w^-1.

A possible answer to the problem is PCA, which is principal component analysis (Rencher, 1995). In this analysis we first find the eigenvectors and eigenvalues of the matrix S_w. As the multiplication of the eigenvalues of a matrix is equal to the determinant, we sort the eigenvalues of the matrix and get rid of the eigenvectors with small eigenvalues. Let us say that the eigenvalues of the matrix S_w are λ₁, λ₂ ... λ_f, which are sorted in decreasing order. We will find the new number of features k such that;

(5.7)

where  is the proportion of variance explained. k is the number of eigenvectors that are linearly independent. After finding k, we will find the corresponding k eigenvectors and store them. For each instance with f features, we will multiply it with the k eigenvectors to have a new instance with the number of features k. So our instance space is mapped from f dimensions into k dimensions. After this we will do LDA now with the mapped instances and find a new S_w and means and calculate the split. Now S_w can have an inverse so there is no problem.

In order to test the tree we will convert first the test instances into a space of k dimensions and then we will multiply the new test instance with the linear split found with LDA. If the result is greater then -w₀this instance is assigned to the left subtree of the node else to the right subtree of the node.

1 ... 5 6 7 8 9 10 11 12 ... 15