3.1.Univariate vs. Multivariate Splits
Multivariate decision trees mainly differ from univariate decision trees in the way they test the attributes. Univariate decision trees test only one attribute at a node whereas multivariate decision trees test more than one attribute (generally a linear combination of attributes) at a node. This limitation to one attribute reduces the ability of expressing concepts. It shows its disability in three forms. Splits could only be orthogonal to axes, subtrees may be replicated in the decision tree and there may be fragmentation.
For the first disability, consider the twodimensional instance space shown in Figure 3.1.1. To approximate the class boundary, the corresponding univariate decision tree uses a series of orthogonal splits, whereas the multivariate test uses only one linear split.
FIGURE 3.1.1 Comparison of univariate and multivariate splits on the plane
This example shows the wellknown problem that a univariate test using feature x_{i} can only split a space with a boundary that is orthogonal to the x_{i}_{ }axis. This results in larger trees and poor generalization.
FIGURE 3.1.2. An example decision tree for replication problem
The second problem is the replication of trees. The decision tree shown in Figure 3.1.2 gives an inefficient representation of the proposition (A B) (C D). While the term (A B) is described efficiently, the term (C D) requires two identical subtrees to be represented. In general, conjunctions can be described efficiently by decision trees while disjunctions require a large tree to describe (Pagallo and Haussler, 1990) (Mathues and Rendell, 1989). One solution to the replication problem is to allow decision nodes consisting of more than one feature by combining them with an appropriate function.
F IGURE 3.1.3. An example decision tree for fragmentation problem
Figure 3.1.3 shows another kind of replication problem that can occur when the data contains attributes with high arity values i.e., attributes with large number of possible values. If a tree has high arity attributes (say arity 5) then it will quickly fragment the data in that node into small partitions. To avoid this problem we can construct subsets of the attribute values (Fayyad and Irani, 1992), e.g., seasons instead of months.
But in a multivariate decision tree, each test can be based on more than one feature. In Figure 3.1.1, we can separate the examples of two classes with one line. The test node is the multivariate test w_{1} x_{1} + w_{2} x_{2} c. Instances for which w_{1} x_{1} + w_{2} x_{2} is less than or equal to c are classified as one class; otherwise they are classified as the other class.
The multivariate decision treeconstructing algorithm selects not the best attribute but the best linear combination of the attributes: . w_{i} are the weights associated with each feature x_{i }and w_{0}_{ }is the threshold to be determined from the data. So there are basically two main operations in multivariate algorithms: Feature Selection determining which features to use and finding the weights w_{i} of those features and the threshold w_{0}.
FIGURE 3.1.4 Instances of the problem Choosing Car with multivariate split
Figure 3.1.4 illustrates a multivariate split for the same problem in Chapter 2. As it can be seen, the classes (Ferrari’s and Mustang’s) can now be splitted using only one multivariate split. The corresponding multivariate decision tree is shown in Figure 3.1.5. The split for the first node is
w_{1} Year + w_{2} Price w_{0},
which is a linear combination of features. The size of the tree is now reduced from seven nodes into three nodes.
FIGURE 3.1.5 Multivariate decision tree for the problem Choosing Car
Multivariate decision trees differ from univariate trees in two other respects: First, because the features in the linear combination split are multiplied with the coefficients, we must convert the symbolic features into numeric features. And second, because the final weighted sum is numeric, all splits are binary.
3.2.Symbolic and Numeric Features
A decision tree algorithm must handle both ordered and unordered features. In a univariate decision tree algorithm, we can use ordered features as x_{i } a, where a is in the observed range of the feature i and unordered features as x_{i }= a_{j} where a_{j} are possible values of the unordered feature x_{i}_{ }in the train set.
In a multivariate decision tree algorithm, we must convert each unordered feature i into a numeric representation to be able to take a linear combination of the features. So in our implementation, where i is the unordered feature with possible values a_{1}, …, a_{m} we convert it into multiple feature set as i_{1}, …,i_{m}_{ }where i_{j }= 1 if_{ }x_{i} takes value a_{j} and 0 otherwise. At the end of this preprocessing we convert previous features (some ordered, some unordered) x_{1 }, …,x_{f}_{ }into all ordered features as x_{1}, x_{21}, x_{22}, x_{23}_{ }… x_{2m}, x_{31}, x_{32}, x_{33}_{ }… x_{3p}, x_{4}_{, }… , x_{f}_{ }where x_{1} , x_{4}_{ }… are ordered, x_{2j} ,x_{3j}_{ }… are unordered features.
This encoding avoids imposing any order on the unordered values of the feature (Hampson and Volper, 1986), (Utgoff and Brodley, 1991). Note that the dimensionality and complexity increases when the number of attributes is increased.
