Ana səhifə

Multivariate decision trees for machine learning


Yüklə 5.57 Mb.
səhifə11/15
tarix24.06.2016
ölçüsü5.57 Mb.
1   ...   7   8   9   10   11   12   13   14   15

6.1.Results for Identification Trees


For the rest of these results, the definitions in Table 6.1.1 apply:

TABLE 6.1.1 Definition of methods



Name of the Method

Uni/Multi

Impurity Measure

Pruning

Multiple Splits

ID3

Uni

Information Gain

Pre-pruning

No

ID3Gini

Uni

Gini Index

Pre-pruning

No

ID3Root

Uni

Weak Theory L.

Pre-pruning

No

ID3P

Uni

Information Gain

Post-pruning

No

ID3-2

Uni

Information Gain

Pre-pruning

Yes Degree 2

ID3-3

Uni

Information Gain

Post-pruning

Yes Degree 3



6.1.1.Comparison of Different Kinds of Learning Measures


In this part, the three impurity learning measures are compared: Information Gain, Gini Index and Weak Theory Learning Measure. For pruning purposes, pre-pruning is applied. This section compares these three measures in terms of accuracy, node size and learning computation time. Accuracy results for impurity measures are shown in Table 6.1.1.1 and Figure 6.1.1.1. Node results are shown in Table 6.1.1.2 and Figure 6.1.1.2. Learning time results are shown in Table 6.1.1.3 and Figure 6.1.1.3.

For three impurity measures there is no significant difference in accuracy (except in one data set).

For larger data sets, which have more than 1000 samples, in three of five cases, ID3Root is better then ID3 and ID3 is better then ID3Gini in node size significantly. In other cases no significant increase or decrease is found.

For mixed data sets, where continuous and discrete attributes are together, it is seen that the discrete attribute with larger arity is firstly selected as a split attribute. This is due to the fragmentation problem.

As the node size increases, learning time increases accordingly and this result becomes significant while the data set size growing.

In terms of learning time, in seven of 20 data sets, ID3 is better then ID3Gini significantly.



TABLE 6.1.1.1 Accuracy results for three different types of impurity measures

Data set name

ID3

ID3Gini

ID3Root

Significance

Breast

94.111.24

94.131.57

94.341.53




Bupa

62.265.33

60.463.85

61.394.38




Car

80.971.26

80.491.32

80.901.24




Cylinder

68.502.22

67.392.91

70.063.66




Dermatology

92.842.37

93.332.36

92.512.32




Ecoli

78.103.57

78.212.50

77.924.18




Flare

85.262.03

84.892.00

85.072.14




Glass

60.655.97

59.356.14

63.366.27




Hepatitis

78.443.71

74.3310.36

75.475.53




Horse

87.551.98

87.501.93

87.831.94




Iris

93.872.75

93.872.75

93.472.47




Ironosphere

87.633.15

84.962.83

87.002.37




Monks

92.2710.15

92.2210.20

92.2210.20




Mushroom

99.700.06

99.680.08

99.620.15




Ocrdigits

78.401.47

77.331.74

76.741.32




Pendigits

85.731.01

86.590.85

85.371.16




Segment

91.081.16

90.492.02

89.081.06

1>>3

Vote

94.941.06

95.631.83

94.940.94




Wine

88.653.72

89.553.97

90.113.78




Zoo

92.064.80

92.264.75

92.454.79



TABLE 6.1.1.2 Node results for three different types of impurity measures



Data set name

ID3

ID3Gini

ID3Root

Significance

Breast

17.002.11

18.802.90

18.604.30




Bupa

53.405.48

54.206.20

58.006.75




Car

25.400.70

25.100.74

25.600.52




Cylinder

54.105.90

59.407.75

56.606.70

2>>1

Dermatology

20.402.67

19.802.15

19.202.39




Ecoli

33.802.70

35.002.67

34.606.52




Flare

37.904.51

39.304.30

37.503.87




Glass

38.205.90

38.804.16

40.205.27




Hepatitis

19.603.78

20.602.95

21.202.90




Horse

55.805.92

56.606.64

55.805.92




Iris

8.401.35

8.401.35

8.401.35




Ironosphere

19.203.05

21.604.53

19.603.41




Monks

25.4013.53

25.2013.21

25.2013.21




Mushroom

23.000.00

24.401.71

22.401.26




Ocrdigits

74.404.01

97.807.50

61.403.63

2>>1>>>3

Pendigits

81.805.51

99.207.27

67.803.79

2>>>1>>>3

Segment

41.803.79

47.805.43

34.403.66

2>>3

Vote

18.203.16

18.003.02

19.403.86




Wine

10.401.35

10.202.15

9.201.75




Zoo

15.001.89

14.601.58

14.601.58



TABLE 6.1.1.3 Learning time results for different types of impurity measures (in sec.)



Data set name

ID3

ID3Gini

ID3Root

Significance

Breast

20

31

20

2>>1

Bupa

31

40

51

2>>1

Car

50

50

50




Cylinder

102

111

101

2>1

Dermatology

30

30

30




Ecoli

30

41

41

3>1

Flare

20

20

21




Glass

30

40

41

2>>1

Hepatitis

10

20

10




Horse

40

41

40




Iris

00

00

00




Ironosphere

397

487

397




Monks

21

21

21




Mushroom

11333

8433

9266

3>2,1>>3

Ocrdigits

2079

25440

17024

2>>1>>>3

Pendigits

47622

516107

41590

3>>1>>>2

Segment

34510

49333

34256

2>>>1>>3

Vote

10

10

11




Wine

10

20

20




Zoo

10

10

10












6.1.2.Comparison of Pruning Techniques


As mentioned, two different types of pruning techniques have been used: pre-pruning and post-pruning. For simplicity, Information Gain is used as the impurity measure. In this section, we would like to find which pruning technique is better than the other. Accuracy results for these two pruning techniques are given in Table 6.1.2.1 and Figure 6.1.2.1. Node results are shown in Table 6.1.2.2 and Figure 6.1.2.2. Learning time results are shown in Table 6.1.2.3 and Figure 6.1.2.3.

TABLE 6.1.2.1 Accuracy results for pre-pruning and post-pruning techniques



Data set name

ID3

ID3P

Significance

Breast

94.111.24

94.681.84




Bupa

62.265.33

62.84.3.39




Car

80.971.26

79.937.90




Cylinder

68.502.22

67.625.11




Dermatology

92.842.37

92.512.42




Ecoli

78.103.57

78.274.00




Flare

85.262.03

88.352.55




Glass

60.655.97

60.195.35




Hepatitis

78.443.71

78.954.48




Horse

87.551.98

88.803.02




Iris

93.872.75

92.933.33




Ironosphere

87.633.15

86.153.72




Monks

92.2710.15

89.817.82




Mushroom

99.700.06

99.870.11




Ocrdigits

78.401.47

84.341.48

2>>1

Pendigits

85.731.01

92.540.61

2>>>1

Segment

91.081.16

91.990.95




Vote

94.941.06

95.630.66




Wine

88.653.72

86.631.94




Zoo

92.064.80

82.977.36



TABLE 6.1.2.2 Node results for pre-pruning and post-pruning techniques



Data set name

ID3

ID3P

Significance

Breast

17.002.11

13.004.99




Bupa

53.405.48

17.4012.54

1>>>2

Car

25.400.70

60.7845.00




Cylinder

54.105.90

20.408.47

1>>>2

Dermatology

20.402.67

12.401.35

1>>2

Ecoli

33.802.70

14.204.64

1>>>2

Flare

37.904.51

6.106.62

1>>2

Glass

38.205.90

14.404.01

1>>>2

Hepatitis

19.603.78

2.802.39

1>>2

Horse

55.805.92

45.603.92

1>>2

Iris

8.401.35

5.400.84




Ironosphere

19.203.05

7.602.67

1>>2

Monks

25.4013.53

25.409.28




Mushroom

23.000.00

26.801.99

2>1

Ocrdigits

74.404.01

104.4012.44

2>1

Pendigits

81.805.51

134.8013.48

2>>1

Segment

41.803.79

43.006.93




Vote

18.203.16

4.002.16

2>>>1

Wine

10.401.35

6.802.57




Zoo

15.001.89

9.202.39

1>>>2

There is no significant difference in accuracy between pre-pruning and post-pruning techniques. But due to the horizon effect, two data sets have significant accuracy improvement by using post-pruning.

Post-pruning technique lends to less nodes then pre-pruning technique. (In 11 out of 20 data sets)

When horizon effect applies, the node size also increases. So in those two data sets, the node size is significantly larger then in pre-pruning technique.

In discrete data sets, where the arity is greater then five, like in Car and Mushroom data sets, post-pruning technique can not prune the tree well. So it has large number of nodes.

In some data sets, where the number of instances for one class is very high compared to the other classes, if post-pruning is applied, then the number of nodes goes to one. So the whole tree is pruned back into only one node.

Post-pruning technique takes significantly large amount of time to learn. It is because of the fact that post-pruning technique prunes the tree after its construction. In some cases, pre-pruning takes less amount of time. This is because pruning set is taken from the training set, so in post-pruning the instances in the training set is less.



TABLE 6.1.2.3 Learning time results for pre-pruning and post-pruning techniques (in sec.)

Data set name

ID3

ID3P

Significance

Breast

20

31




Bupa

31

40




Car

50

403

2>>1

Cylinder

102

101

1>>2

Dermatology

30

30




Ecoli

30

30




Flare

20

21




Glass

30

20




Hepatitis

10

10




Horse

40

40




Iris

00

00




Ironosphere

397

214

1>>>2

Monks

21

31




Mushroom

11333

8429




Ocrdigits

2079

497109

2>>>1

Pendigits

47622

931255

2>>>1

Segment

34510

2628

1>>>2

Vote

10

20

2>>1

Wine

10

10




Zoo

10

00











6.1.3.Comparison of Multiple Splits


In this section we want to find out if it is better to use multiple splits instead of binary splits. To check this, we have made experiments on the data set with three-way and four-way splits and compared it with two-way splits. The results are shown in Table 6.1.3.1 and Figure 6.1.3.1. Node results are shown in Table 6.1.3.2 and Figure 6.1.3.2. Learning time results are shown in Table 6.1.3.3, Figure 6.1.3.3 and Figure 6.1.3.4.

TABLE 6.1.3.1 Accuracy results for splits with degrees two,three and four



Data set name

ID3

ID3-2

ID3-3

Significance

Breast

94.111.24

94.081.38

93.650.87

1>3

Bupa

62.265.33

59.414.61

59.702.78

1>>>2

Cylinder

68.502.22

63.624.08

65.445.66




Dermatology

92.842.37

92.461.80

91.372.51




Ecoli

78.103.57

76.613.97

75.244.61




Flare

85.262.03

85.262.03

85.262.03




Glass

60.655.97

56.925.82

54.214.89




Hepatitis

78.443.71

73.388.65

71.078.41




Horse

87.551.98

87.122.22

86.902.37




Iris

93.872.75

92.673.28

92.932.27




Ironosphere

87.633.15

87.631.39

N/A




Monks

92.2710.15

91.537.29

80.288.26

1>>2,1>>3

Ocrdigits

78.401.47

67.252.24

63.411.72

1>>>2>>>3

Pendigits

85.731.01

82.191.47

N/A

1>>2

Segment

91.081.16

N/A

N/A




Wine

88.653.72

86.635.28

83.034.90




Zoo

92.064.80

87.104.96

88.695.35

1>>2

TABLE 6.1.3.2 Node results for splits with degrees two,three and four



Data set name

ID3

ID3-2

ID3-3

Significance

Breast

17.002.11

17.903.90

18.803.36




Bupa

53.405.48

50.703.53

54.405.58

3>>2

Cylinder

54.105.90

52.405.21

54.804.85




Dermatology

20.402.67

20.303.56

27.003.89

3>>1

Ecoli

33.802.70

34.403.17

36.903.87




Flare

37.904.51

37.203.99

37.803.55




Glass

38.205.90

38.704.60

37.305.93




Hepatitis

19.603.78

20.603.81

21.403.95




Horse

55.805.92

57.506.59

58.206.92




Iris

8.401.35

8.002.26

8.101.97




Ironosphere

19.203.05

20.603.24

N/A




Monks

25.4013.53

33.906.76

38.505.04

2>>1,3>>>1

Ocrdigits

74.404.01

63.504.03

67.905.61




Pendigits

81.805.51

73.605.40

N/A




Segment

41.803.79

N/A

N/A




Wine

10.401.35

12.901.91

15.002.36

3>1

Zoo

15.001.89

16.201.99

16.502.17

2>1

For multiple splits the accuracy decreases while the degree of the split is increased from two to four; this difference is significant in six out of 20 data sets. This may be due to the fragmentation problem.

The number of nodes also increases when the degree of the split increases. Only in some small data sets there is a drop in node size from going degree two to three.

Learning time of higher degree splits is significantly greater then lower degree splits.

The accuracy, number of nodes and learning time does not change in data sets where all attributes are discrete (as can be expected).



TABLE 6.1.3.3 Learning time results for splits with degrees two,three and four

Data set name

ID3

ID3-2

ID3-3

Significance

Breast

20

40

111

3>>>2>>>1

Bupa

31

152

16863

3>>>2>>>1

Cylinder

102

7010

1013246

3>>2>>>1

Dermatology

30

71

3621

2>>1

Ecoli

30

231

41953

3>>>2>>>1

Flare

20

20

20




Glass

30

294

556105

3>>>2>>>1

Hepatitis

10

51

418

3>>>2>>>1

Horse

40

142

15880

3>2>>>1

Iris

00

10

132

3>>>2>>1

Ironosphere

397

60480

N/A

2>>>1

Monks

21

20

31

2>>1,3>>>1

Ocrdigits

2079

59698

2384352

3>>>2>>>1

Pendigits

47622

92722356

N/A

2>>>1

Segment

34510

N/A

N/A




Wine

10

183

26660

3>>>2>>>1

Zoo

10

10

10












1   ...   7   8   9   10   11   12   13   14   15


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©atelim.com 2016
rəhbərliyinə müraciət