2.2.Identification Trees (ID3)
Identification trees classify instances by sorting them down the tree from the root to some leaf node. Each internal node in the tree specifies a test of one attribute of the instance, and each branch descending from that node corresponds to one of the possible values or intervals for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the corresponding tree branch. This process is recursively repeated for the subtree of the new node until a leaf node is reached. The leaf node stores the class code.
In the ID3 algorithm (Quinlan, 1989) the best split is found as follows:
For each attribute xi do the following:
If the feature is symbolic with m possible values, the instances are divided into m groups, where in each group the instances have the same value for the attribute xi. Calculate the partition-merit criteria (will be explained in Section 2.3) as pi.
If the feature is numeric, the instances could be divided into two in k different ways where k is the number of different values of the attribute xi. For each of these k ways, the partition-merit criterion is computed and the best is selected as pi.
Find the attribute j such that pj = mini pi as the split node attribute.
If xi is symbolic with m possible values, partition the set of instances into m subsets where at each partition xj = ak, k = 1, …, m.
If xi is numeric, partition the set of instances into two; xj a and xj > a, where a is the split threshold that optimizes the partition-merit criterion.
This algorithm has a complexity of O (f * n) where f is the number of attributes and n is the number of instances.
To classify the instances in the test set, we start from the root node and trace the tree node by node. At each internal node, we take the subtree for which the instance has the correct attribute value or on the correct side of the split threshold. At each leaf node, if the instance has the same class as the class of the leaf node the instance is correctly classified, else it is misclassified.
The central choice in tree algorithms is finding the best split. Each split partitions the sample into two or more parts and each subset of the partition has one or more classes in it. If there is only one class in a subset, then it is pure, else it is impure. The purer the partition is, the better it is. This measure of impurity is specified as the partition-merit criteria. An impurity measure has the characteristic of being minimum when there are only instances from one class and maximum when all classes have the same number of instances.
There are different types of partition-merit criteria. A comparative study (Mingers, 1989), found no significant differences in accuracy whereas there could be an interaction between partition-merit criteria and data sets (Brodley and Utgoff, 1995). We have used three types of impurity measures. They are:
Weak Theory Learning Measure
The Gini Index
IGURE 2.3.1 Graphs of Impurity Measures
Impurity measures are shown graphically in Figure 2.3.1. For simplicity, this graph is plotted for two classes. p denotes the probability of occurrence of the first class and I (p) denotes the calculated impurity measure. We see that the Weak Theory Learning Measure has the highest concavity whereas the Gini Index has the lowest concavity. So the Weak Theory Learning Measure is the most discriminating function.
2.3.1.Weak Theory Learning Measure
Theoretical motivation for an impurity measure is defined as
G (p) = (2.1)
where p is the probability of positive instances (Dietterich et. al., 1996). This model gives a basis for our implementation. So given a node t with c different classes where each class Ci has the probability of occurrence pi, the impurity of node t is
Let S be a split at a node t with m-way split. Assume there are c classes and there are nk instances having the attribute value ak, where nkl of them belong to class l. n is the total number of instances at that node satisfying , . Then the impurity of split S is
Impurity (S) = (2.3)
This measure of information gained from a particular split is popularized in (Quinlan, 1986). Quinlan takes the famous entropy formula of Information Theory, which is the minimum number of bits to encode the classification of an arbitrary member of a collection S. So if we transform this idea into our problem, the entropy of node t is
Entropy (t) = (2.4)
whereas the entropy of a split S is
Entropy (S) = (2.5)
The Gini Criterion (or Index) was first proposed in (Breiman et. al., 1984). The Gini Index is originally defined as the probability of misclassification of a set of instances, rather than the impurity of the split. Gini index of a node t is
Gini (t) = (2.6)
From that index we can refer the Gini index of a split as