Class separation is a process that must be done when there are three or more classes available at the decision node. If there are c classes, then there are 2^{c1}1 distinct partitions possible. Because we can not test for all, we use heuristics to get a reasonable partition in a reasonable amount of time. The first method we have used in class separation is the selection method, which is a depthfirst search method with no backtracking. Let t be a decision node and C={C_{1}, …, C_{m}} be the set of c classes at node t.

Select an initial partition, C_{L} and C_{R} where each part only consists of examples of two arbitrarily chosen classes C_{i} and C_{j} respectively.

Train the network at node t with the given partition. Do not consider the elements of other classes yet.

For other classes in the class list, search for the class C_{k} that is best placed into one of the partitions. Best placing is the placing when C_{k} is assigned into the child where the impurity is minimum.

Put C_{k} into its best partition and continue adding classes one by one using steps 2 to 4 until no more classes are left.
This algorithm is sensitive to the initial class partition due to its depthfirst nature. So a heuristic technique to select the initial partition is used instead of selecting randomly. The Euclidean distance of the means of two classes (for all classes) is calculated, and the two furthest classes are taken as the initial two classes.
Algorithm traces steps 2 to 4 c2 times. So the complexity of this algorithm is O(c) where c is the number of the class available at node t.
The second method in class separation is the exchange method (Guo and Gelfond, 1992). This is a local search with backtracking. Let t be a decision node and C={C_{1}, …, C_{c}} be the set of c classes at node t.

Select an initial partition of all in classes into two subsets, C_{L} and C_{R}.

Train the network to separate C_{L} and C_{R}. Compute the entropy E_{0} with the selected entropy formulae.

For each of the classes k {C_{1}, …, C_{c}} form the partitions C_{L}(k) and C_{R}(k) by changing the assignment of the class C_{k} in the partitions C_{L} and C_{R}. Train the neural network with the partitions C_{L}(k) and C_{R}(k). Compute the entropy E_{k} and the decrease in the entropy E_{k}=E_{k}E_{0}.

Let E_{*} be the maximum of the impurity decreases for all classes. If this impurity decrease is less than 0 then exit else set C_{L }= C_{L}(k) and C_{R }= C_{R}(k), and do the steps 2 to 4 again.
We use a heuristic technique to start, instead of starting randomly.

The two classes C_{i} and C_{j} with the maximum distance are found and placed into C_{L} and C_{R}.

For each of the classes k {C_{1}, …, C_{c}} find the one with the minimum distance to C_{L} or C_{R} and then put it into that group. Repeat this second step until no more classes are left.
5.THE STATISTICAL MODEL FOR TREE CONSTRUCTION
In this section we will discuss how to use a statistical approach to determine the set of coefficients in a linear decision node. So when we want to find a new split in a node of the tree, we use a statistical criterion to determine the coefficients of the features.
The statistical approach is named as Fisher’s Linear Discriminant Analysis (LDA) (Duda and Hart, 1973). This approach aims to find a linear combination of the variables that separates the two classes as much as possible, which is what we want to do. The criterion proposed by Fisher is the ratio of betweenclass to withinclass variances, which we want to maximize. Formally we must find a direction w to maximize
(5.1)
where m_{L}_{ }and m_{R} are the two left and right groups means
, (5.2)
and S_{w} is withinclass covariance matrix, which is bias corrected as
(5.3)
where _{L} and _{R} are the covariance matrices of class groups C_{L} and C_{R} respectively.
, (5.4)
There are n_{L} samples in the left class group and n_{R} samples in the right class group. The solution for w that maximizes J_{F} can be obtained by differentiating J_{F} with respect to w and equating it to zero. So we define w, namely the set of coefficients of the linear combination as:
(5.5)
But the coefficients of the features is only for the direction, we must also specify a threshold w_{0}, which is defined as:
(5.6)
Note however that though the direction given in Equation 5.3 has been derived without any assumptions of normality, normal assumptions are used in Equation 5.4 to set the threshold for discrimination.
So in our case, we first divide classes into two groups as C_{L} and C_{R} by an appropriate class selection procedure (as defined in Sections 4.2 or 4.3). But this time the inner optimization procedure will be carried out by LDA as we will find the linear split with the Equations 5.3 and 5.4. There will be no training but only simple calculations with matrices. So we expect less training time in LDA, than a neural network.
In implementing the idea, we see that if there is a linear dependency between two or more features then some of the rows or columns of the covariance matrix S_{w} are the same or a multiple of the other and it becomes singular. But in this case the determinant will be zero and we can not find S_{w}^{1}.
A possible answer to the problem is PCA, which is principal component analysis (Rencher, 1995). In this analysis we first find the eigenvectors and eigenvalues of the matrix S_{w}. As the multiplication of the eigenvalues of a matrix is equal to the determinant, we sort the eigenvalues of the matrix and get rid of the eigenvectors with small eigenvalues. Let us say that the eigenvalues of the matrix S_{w} are λ_{1}, λ_{ 2} ... λ_{ f}, which are sorted in decreasing order. We will find the new number of features k such that;
(5.7)
where is the proportion of variance explained. k is the number of eigenvectors that are linearly independent. After finding k, we will find the corresponding k eigenvectors and store them. For each instance with f features, we will multiply it with the k eigenvectors to have a new instance with the number of features k. So our instance space is mapped from f dimensions into k dimensions. After this we will do LDA now with the mapped instances and find a new S_{w} and means and calculate the split. Now S_{w} can have an inverse so there is no problem.
In order to test the tree we will convert first the test instances into a space of k dimensions and then we will multiply the new test instance with the linear split found with LDA. If the result is greater then w_{0}_{ }this instance is assigned to the left subtree of the node else to the right subtree of the node.
