In this study, we have detailed and compared univariate, linear and nonlinear decision tree methods using a set of simulations on twenty standard data sets. For univariate decision tree methods, we have used the ID3 algorithm and for multivariate decision tree methods, we have used the CART algorithm. For linear and nonlinear methods, we have used neural networks at each decision node. We also propose to use the Linear Discriminant Analysis (LDA) algorithm in constructing linear multivariate decision trees.
The comparison results of these four decision tree methods can be seen in Figures 7.1, 7.2, 7.3, 7.4, 7.5, 7.6 and 7.7. The first four figures compare the four methods in terms of accuracy, tree size and learning time. Last three figures compare the four methods in terms of two of these criteria.
Results for ID3 and CART can be grouped as follows:
There is not a significant difference between different impurity measures in terms of accuracy, tree size and learning time.
Post-pruning is better than pre-pruning in terms of tree size; they are same in terms of accuracy and learning time. Also post-pruning sometimes finds significantly better trees than pre-pruning, where pre-pruning stops tree creation process earlier.
Feature selection does not always improve performance of CART, but increases learning time significantly.
Results for neural trees can be grouped as follows:
There is not a significant difference between the F-test and the t-tests in terms of accuracy and tree size. But t-test has high learning time because of 30-fold crossvalidation.
Among the class separation heuristics, the exchange method is better than the selection method in terms of accuracy and tree size but is worse in terms of learning time.
There is low difference in terms of tree size and accuracy between neural trees (Linear, Nonlinear and Hybrid).
Learning time results of neural trees is in the following order: Hybrid > Nonlinear > Linear.
Results for LDA trees can be grouped as follows:
Using PCA with low percentage of explained variance decreases performance in accuracy and tree size.
Using PCA when it is not required decreases performance in accuracy, tree size and learning time.
Results for univariate and multivariate methods can be ordered as follows:
Accuracy: ID-LP = ID-LDA > CART ?= ID3.
Tree Size: ID3 > CART > ID-LDA > ID-LP.
Learning Time: CART > ID-LP > ID-LDA > ID3.
We can conclude by the following statements:
If the features are not correlated, we should use univariate decision trees (ID3)
If the features are correlated, we should use multivariate decision trees.
If pre-pruning is to be applied, we should not use nonlinear multivariate decision trees.
If a multivariate method is to be used, do not use CART, instead:
If time is important, use ID-LDA.
If space is important, use ID-LP.
ID-LDA has the same accuracy as ID-LP.
Brief description of data sets used in thesis is given below: (See http://www.ics.uci.edu/~mlearn/MLRepository.html)
Breast: This data set was created by Dr. William H. Wolberg (physician) University of Wisconsin Hospitals USA. The aim is to detect the type of breast cancer (malignant or benign) from 9 different attributes.
Bupa: This data set was created by BUPA Medical Research Ltd. Five Attributes of the data set are blood tests results, by which users want to find out liver disorders induced by alcohol consumption.
Car: This data set was created by Marko Bohanec. Car Evaluation Database was derived from a simple hierarchical decision model originally developed for the demonstration of DEX (M. Bohanec, V. Rajkovic: Expert system for decision making. Sistemica 1(1), pp. 145-157, 1990.)
Cylinder: This data set was created by Bob Evans RR Donnelley & Sons Co. 40 attributes are used to determine the existence of cylinder bands.
Dermatology: This data set was created by Nilsen İlter from Gazi University and H Altan Güvenir from Bilkent University. The aim is to determine the type of Eryhemato-Squamous Disease from clinical and histopathological attributes.
Ecoli: This data set was created by Kenta Nakio from Institue of Molecular and Cellular Biology in Osaka University. The aim is to localize the site of the protein.
Flare: This data set was created by Gary Bradshaw. The database contains 3 potential classes, one for the number of times a certain type of solar flare occured in a 24 hour period. Each instance represents captured features for 1 active region on the sun.
Glass: This data set was created by B. German from Central Research Establishment. The aim is to find out type of the glass.
Hepatitis: The task for this domain is to predict from test results whether a patient will live or die from hepatitis.
Horse: This data set was created by Mary McLeish and Matt Cecile. The aim of this data set is to determine whether the lesion is surgical.
Iris: This data set was created by R. A. Fisher. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. The aim is to decide the class of the iris plant.
Ironosphere: This data set was created by Vince Sigillito. The aim is to find out the type of the radar returns. "Good" radar returns are those showing evidence of some type of structure in the ionosphere. "Bad" returns are those that do not; their signals pass through the ionosphere.
Monks: This data set was created by Sebastian Thrun from Carnegie Mellon University. The MONK's problem were the basis of a first international comparison of learning algorithms. The result of this comparison is summarised in "The MONK's Problems - A Performance Comparison of Different Learning algorithms".
Mushroom: Mushroom records were drawn from The Audubon Society Field Guide to North American Mushrooms. This data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as edible or poisonous.
Ocrdigits: This data set was created by E. Alpaydın and C. Kaynak from Boğaziçi University. The aim is to recognize optically of ten different digits.
Pendigits: This data set was created by E. Alpaydın and F. Alimoğlu from Boğaziçi University. The aim is to recognize pen written digits.
Segment: This data set was created by Vision Groups from University of Massachusetts. For this data set the task is to learn to segment an image into seven classes: sky, cement, window, brick, grass, foliage and path. The data set was formed from seven images of buildings from the University of Massachusetts campus that were segmented by hand to create the class labels.
Vote: This data set was drawn from Congressional Quarterly Almanac. This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The task is to determine the party of the senators voted.
Wine: This data set was formed by Forina, M. et al, PARVUS of Institute of Pharmaceutical and Food Analysis and Technologies. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.
Zoo: This data set was created by Richard Forsyth. The animal in zoo are divided into seven groups and the task is to find from 17 different attributes of the animals the type of the animal.
Statistical tests we have used in this thesis are the F test and the t test.
Steps for the 5x2 F tests are as follows :
Split the original data randomly into two equal-sized parts. Call the first one training set and the other one, the test set.
Run the two algorithms on the training set and test on the test set.
For each algorithm divide the number of correct classifications to the size of the test set.
Record also other measures such as the number of nodes in the tree and the average time spent in learning.
Exchange train and test sets, do steps 2, 3 and 4 again.
In this test, pi(j) is the difference between the error rates of the two methods on fold j = 1, 2 of replication i = 1, …, 5. The average on replication i is pi = (pi(1) + pi(2)) / 2 and the variance is si2 = (pi(1)- pi)2 + (pi(2)- pi)2. The following statistic is approximately F distributed with ten and five degrees of freedom:
According to the value of f, the hypothesis that they have the same error rate is rejected or accepted according to a specified confidence level.
Steps for the 30 fold cross-validation t test are :
Partition the available data into 30 disjoint subsets T1, T2, … , T30 of equal size.
For each subset Ti use it for test set and the remaining data for training set.
Train both algorithms with the training set and test them on the test set. Call the difference between error rates of the two methods on the test set at iteration i, i. Let denote the average of i.
Find the estimate of the standard deviation. The following statistic is approximately t-distributed with 29 degrees of freedom.
According to the value of t, the hypothesis that they have the same error rate is accepted or rejected according to a specified confidence level.
Alpaydın, E., “Combined 5x2 cv F Test for Comparing Supervised Classification Learning Algorithms”, Neural Computation, Vol. 11, pp.1975-1982, 1999.
Bishop, C., Neural Networks for Pattern Recognition, Oxford University Press, 1996.
Breiman, L., J. H. Friedman, R. A. Olshen and C. J. Stone, Classification and Regression Trees, Wadsworth: Belmont, CA, 1984.
Breslow, L. A. and D. W. Aha, Simplifying Decision Trees: A Survey, NCARAI Technical Report No. AIC-96-014, 1997.
Brodley, C. E. and P. E. Utgoff, “Multivariate Decision Trees”, Machine Learning Vol. 19, pp. 45-77, 1995.
Dietterich, T., M. Kearns and Y. Mansour, “Applying the weak learning framework to understand and improve C4.5”, Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, 1996.
Duda, R. O. and P. E. Hart, Pattern Classification and Scene Analysis, Wiley-Interscience Publication, 1973.
Esposito, F., D. Malerba and G. Semeraro, “Decision tree pruning as a search in the state space”, Proceedings of the European Conference on Machine Learning, Vienna, Austria: Springer-Verlag, pp.165-184, 1993.
Fayyad U. M. and K. B. Irani, “The attribute selection problem in decision tree generation”, In Proceedings of AAAI-92, pp. 104-110, 1992.
Guo, H. and S. B. Gelfond, “Classification Trees with Neural Network Feature Extraction,” IEEE Transactions on Neural Networks, Vol. 3, pp. 923-933, 1992.
Hampson, S. E. and D. J. Volper, “Linear Function neurons: Structure and Training”, Biological Cybernetics, Vol. 53, pp. 203-210, 1986.
Holte, R. C., “Very simple classification rules perform well on most commonly used data sets”, Machine Learning, Vol. 11, pp. 63-91, 1993.
Mathues, C. J. and L. A. Rendell, “Constructive induction on decision trees”, In IJCAI-89, pp. 645-650, 1989.
Merz, C. J. and P. M. Murphy, UCI Repository of Machine Learning Databases, 1998, http://www.ics.uci.edu/~mlearn/MLRepository.html.
Mingers, J., “An empirical comparison of selection measures for decision tree induction”, Machine Learning, Vol. 3, pp. 319-342, 1989.
Mitchell, T., Machine Learning, McGraw-Hill, 1996.
Pagallo G. and D. Haussler, “Boolean feature discovery in empirical learning”, Machine Learning, pp. 71-99, 1990.
Rencher, A. C., Methods of multivariate analysis, Wiley Series, 1995.
Quinlan, J. R., “Induction of decision trees”, Machine Learning, Vol. 1, pp. 81-106, 1986.
Quinlan, J. R., “Unknown attribute values in induction”, Proceedings of the Sixth International Workshop on Machine Learning pp. 164-168, 1989.
Quinlan, J. R., “C4.5: Programs for machine learning”, San Mateo, CA: Morgan Kaufmann, 1993.
Utgoff, P. E. and C. E. Brodley. “Linear Machine decision trees”, (COINS Technical Report 91-10), Amberst, MA: University of Massachusetts, Department of Computer and Information Science, 1991.
REFERENCES NOT CITED
Bennett, K. P. and O. L. Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets”, Optimization Methods and Software, Vol. 1, pp. 23-24, 1992.
Murthy, K. S., S. Kasif and S. Salzberg, “A system for induction of oblique decision trees”, Journal of Artificial Intelligence Research, Vol. 2, pp. 1-32, 1994.
Oliver, J. J., “Decision Graphs – An Extension of Decision Trees”, Proceedings of the Fourth International Workshop on Artificial Intelligence and Statistics, pp. 343-350, 1993.