In the univariate decision tree algorithm ID3, there are m splits for symbolic attributes, one for each possible value of that attribute and there are two splits for numeric attributes as xi a and xi > a. Most of the partition-merit criteria give good results for more splits. So in the selection of the attribute, to make the split, the symbolic attributes have greater advantage over numeric features. To see and test if this is an advantage we have also put a multiple split option to our decision tree induction algorithms (basically for ID3).
If there are s splits then the instances are divided into s+1 parts as xi1, a1xi2, …as-1 xi s and xi as, where a1, a2, …, as are the split points. If the feature xi has k values, then s-way split can be done in ks different ways. During splitting, we split instances into possible different s+1 parts and for each partition we calculate the partition-merit criterion pj and choose the best one.
This multiple splits only apply to numeric attributes; the data sets with only symbolic attributes will have no change in their results. Also this replacement prefers smaller split numbers in case of ties, so that the tree will have less nodes.
IGURE 2.4.1 A decision tree with multiple splits
Figure 2.4.1 shows a decision tree with multiple splits. For the first decision node three split points are selected as a<5, 5 a < 7, 7 a < 10 and a 10 whereas there are one split point for the second decision node and two split points for the third decision node. As can be seen, the attribute A is used twice while constructing the tree.
But we also change the iteration time of the algorithm. The algorithm has a time complexity of O (ns) instead of O (n) because of the search in a space of ks. So the total complexity is O (f * ns) instead of O (f * n), which can be very huge with large number of samples. So we have decided S to be at most three.
2.5.Filling in Missing Values
In some cases, some of the values of features may not be available. In such cases we must fill in the values of missing cases. There are a number of ways (Quinlan, 1989):
Ignoring any instance with a missing value of an attribute. This will reduce the number of available instances.
Filling in with the most likely value of the attribute with missing value of the instance.
Combining the results of classification using each possible value according to the probability of that value.
In our implementation, if the missing value is for an ordered feature, we fill it with the sample mean value of that attribute as it appears in the training set. When the missing feature is unordered, we fill it with the most probable value of that attribute in the data set. These statistics are stored so that the same filling can also be done for the test set. This method is called mean imputation.
The algorithms growing the tree deeply enough to perfectly classify all of the training examples will not give always good results. This mainly occurs due to two different causes. Firstly, the data set may contain noise and if we learn all examples we will also learn noise, which will reduce our performance over the test set. Secondly, our training set data may not be a good representative (big enough) of the data set. In either of these cases, univariate and multivariate algorithms can produce trees that overfit the training examples.
FIGURE 2.6.1. Overfitting in Learning
As the ID3 algorithm adds new nodes to the tree, the training set accuracy increases. However the test set accuracy first increases then decreases as can be seen in Figure 2.6.1.
One possibility is to prune unnecessary nodes or subtrees after the construction of the tree to avoid overfitting. In our implementation, we have used two types of pruning methods namely pre-pruning and post-pruning.
Pre-pruning methods simplify decision trees by preventing the tree to be complete. A simple form of pre-pruning that stops tree expansion in depth two performs surprisingly well (Holte, 1993). Usually, however the tree is no more expanded when no or not enough gain is expected.
Pre-pruning methods are more efficient than post-pruning methods because they terminate tree generation earlier whereas post-pruning methods require a post-processing step, where the tree is pruned back to give a smaller tree.
In our implementation of pre-pruning, we stop splitting further when the instance ratio (the number of instances of that node divided by the number of instances of the whole data set) is below a threshold (e.g., five percent). Then we create a leaf node and label it with class Ci where Ci is the class having the most instances.
Pre-pruning methods give inconsistent performance because of the horizon effect (Breiman et. al., 1984). This occurs mainly by stopping tree expansion prematurely. After this inconsistent behavior is noticed, the research in this area is abandoned in favor of post-processing methods. But in the case of large-scale data sets, where efficiency may be more important than accuracy, pre-pruning methods will be considered again.
Post-pruning is the most frequently used tree simplification algorithm, producing from an unpruned tree T, a pruned tree T’. Pruning the tree replaces a subtree with a leaf node, if doing so, the accuracy on a pruning set, distinct from the training set, improves.
If a decision tree is expanded using only the homogeneity stopping criteria, it will contain no resubstitution errors on the training set. Thus post-pruning can only increase resubstitution errors. However when the tree is expanded it may overfit the sample space by learning noise. So post-pruning can decrease error rate on unseen test cases.
IGURE 220.127.116.11 A pruned subtree
In the post-pruning algorithm, at each internal node, we check the classification accuracy change on the pruning set by pruning the subtree having that node as its root. If the classification accuracy does not decrease, we decide to prune that subtree into a leaf node.
Figure 18.104.22.168 shows a decision tree where nodes numbered 8 and 9 are pruned. After pruning, the subtree below the decision node 7 is converted into a leaf node.
In post-pruning, we use a set other than the training and test sets, called the pruning set. So in our implementation, we divide the whole data set equally into two parts; one training set and one test set and then we take 80 percent of the train set to form the growing set and 20 percent to obtain the pruning set. One observed disadvantage of this division is that it reduces the number of training cases involved in tree induction, which is not desirable for small data sets.