Multivariate decision trees for machine learning

səhifə	4/15
tarix	24.06.2016
ölçüsü	5.57 Mb.

1 2 3 4 5 6 7 8 9 ... 15

2.2.Identification Trees (ID3)

Identification trees classify instances by sorting them down the tree from the root to some leaf node. Each internal node in the tree specifies a test of one attribute of the instance, and each branch descending from that node corresponds to one of the possible values or intervals for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the corresponding tree branch. This process is recursively repeated for the subtree of the new node until a leaf node is reached. The leaf node stores the class code.

In the ID3 algorithm (Quinlan, 1989) the best split is found as follows:

For each attribute x_i do the following:

If the feature is symbolic with m possible values, the instances are divided into m groups, where in each group the instances have the same value for the attribute x_i. Calculate the partition-merit criteria (will be explained in Section 2.3) as p_i.
If the feature is numeric, the instances could be divided into two in k different ways where k is the number of different values of the attribute x_i. For each of these k ways, the partition-merit criterion is computed and the best is selected as p_i.

Find the attribute j such that p_j= min_ip_ias the split node attribute.

If x_i is symbolic with m possible values, partition the set of instances into m subsets where at each partition x_j= a_k, k = 1, …, m.
If x_i is numeric, partition the set of instances into two; x_j  a and x_j > a, where a is the split threshold that optimizes the partition-merit criterion.

This algorithm has a complexity of O (f * n) where f is the number of attributes and n is the number of instances.

To classify the instances in the test set, we start from the root node and trace the tree node by node. At each internal node, we take the subtree for which the instance has the correct attribute value or on the correct side of the split threshold. At each leaf node, if the instance has the same class as the class of the leaf node the instance is correctly classified, else it is misclassified.

2.3.Partition-Merit Criteria

The central choice in tree algorithms is finding the best split. Each split partitions the sample into two or more parts and each subset of the partition has one or more classes in it. If there is only one class in a subset, then it is pure, else it is impure. The purer the partition is, the better it is. This measure of impurity is specified as the partition-merit criteria. An impurity measure has the characteristic of being minimum when there are only instances from one class and maximum when all classes have the same number of instances.

There are different types of partition-merit criteria. A comparative study (Mingers, 1989), found no significant differences in accuracy whereas there could be an interaction between partition-merit criteria and data sets (Brodley and Utgoff, 1995). We have used three types of impurity measures. They are:

Weak Theory Learning Measure
Information Gain
The Gini Index

IGURE 2.3.1 Graphs of Impurity Measures

Impurity measures are shown graphically in Figure 2.3.1. For simplicity, this graph is plotted for two classes. p denotes the probability of occurrence of the first class and I (p) denotes the calculated impurity measure. We see that the Weak Theory Learning Measure has the highest concavity whereas the Gini Index has the lowest concavity. So the Weak Theory Learning Measure is the most discriminating function.

2.3.1.Weak Theory Learning Measure

Theoretical motivation for an impurity measure is defined as

G (p) = (2.1)

where p is the probability of positive instances (Dietterich et. al., 1996). This model gives a basis for our implementation. So given a node t with c different classes where each class C_ihas the probability of occurrence p_i, the impurity of node t is

= (2.2)

Let S be a split at a node t with m-way split. Assume there are cclasses and there are n_kinstances having the attribute value a_k, where n_kl of them belong to class l. nis the total number of instances at that node satisfying , . Then the impurity of split S is

Impurity (S) = (2.3)

2.3.2.Information Gain

This measure of information gained from a particular split is popularized in (Quinlan, 1986). Quinlan takes the famous entropy formula of Information Theory, which is the minimum number of bits to encode the classification of an arbitrary member of a collection S. So if we transform this idea into our problem, the entropy of node t is

Entropy (t) = (2.4)

whereas the entropy of a split S is

Entropy (S) = (2.5)

2.3.3.Gini Index

The Gini Criterion (or Index) was first proposed in (Breiman et. al., 1984). The Gini Index is originally defined as the probability of misclassification of a set of instances, rather than the impurity of the split. Gini index of a node t is

Gini (t) = (2.6)

From that index we can refer the Gini index of a split as

= (2.7)

1 2 3 4 5 6 7 8 9 ... 15