In machine learning the knowledge is extracted from a training sample for future prediction. Most machine learning methods make accurate predictions but are not interpretable. In this study, we concentrate on decision trees, which are simple and easily comprehensible. They are robust to noisy data and can learn disjunctive expressions (Mitchell, 1996).
A decision procedure inferring from an example space can be formalized as follows: The given data represents a set of objects or “instances”. Each object is described in terms of a collection of discrete or continuous valued independent variables or “attributes” x_{i}. Each instance has a dependent variable or a “class” associated with it. The data consists of vectors that give values for the attributes and the class of each object (x_{1}, x_{2}_{, … }x_{f} x).
The object of supervised learning is to find the function of the attributes that best predicts the class of an object (a function F: X_{1}_{ } X_{2}_{ } … X_{f} C). The given data is considered as the training set and the data that will be predicted is called the test set.
FIGURE 2.1. Instances of the problem Choosing Car (This is an imaginary data set).
An example data (training set) for the problem Choosing Car is shown in Figure 1.1. Year of the car and price are the two continuous attributes of this example. Both of the attributes are continuous. There are 16 instances in this data set. Seven of them are for the class Mustang (shown as diamonds) and nine of them are for the class Ferrari (shown as squares). Data is splitted by axisaligned lines, which define the decision tree.
FIGURE 2.2. Decision tree for the problem Choosing Car
The constructed decision tree is shown in Figure 2.2. Decision trees consist of internal nodes (drawn as rectangles in Figure 2.2) having one or more attributes to test and leaves (drawn as Mustang or Ferrari in Figure 2.1) to show the decision made. The test for the internal nodes is shown as lines below the rectangles. For example the test Year = 1980 line divides the data space in Figure 2.1 into two parts as:

Year 1980 : Diamonds (Mustang), squares (Ferrari) and

Year >1980: Diamonds (Mustang), squares (Ferrari).
The test for the data below the Price = 50000 line (shown as left subtree in Figure 2.2) further splits that subspace into two parts as:

Price 50000: Diamonds (Mustang)

Price > 50000: Squares (Ferrari)
After the second test we have samples of only one class in each node. Hence we stop further splitting. The leaf Mustang below Price 50000 is a leaf node, where the decision is made that if the year of the car is less than 1980 and the price of the car is less than 50000, we can say that this car is a Mustang. This is a rule, which we have derived from the decision tree in Figure 2.2.
The rules of decision tree consist of disjunctions of conjunctions, which are paths from the root node to the leaf nodes. For this example, the rules are:
If (Year 1980 Price 50000) (Year > 1980 Price 20000), then this car is a Mustang.
If (Year 1980 Price > 50000) (Year > 1980 Price > 20000), then this car is a Ferrari.
After the tree is constructed, any instance can be classified given the values of Year and Price attributes. For example the instance
< Year = 1970, Price = 60000 >
is classified as Ferrari whereas
< Year = 1990, Price = 10000 >
would be classified as Mustang. Classification of one instance is done by tracing the corresponding path for that instance, from the root node until a leaf node. For example, for the first instance, we compare its year with 1980, while its year is smaller than 1980, we go to left subtree and compare its price with 50000, while its price is bigger than 50000 we say that, this car is a Ferrari.
Readability can also be improved by changing the rules of decision tree into ifthen rules. For example the rules above could be changed into ifthen rules as follows:
If the year of the car is smaller than 1980 and its price is smaller than 50000, then this car is a Mustang.
If the year of the car is bigger than 1980 and its price smaller than 20000, then this car is a Mustang.
If the year of the car is smaller than 1980 and its price bigger than 50000, then this car is a Ferrari.
If the year of the car is bigger than 1980 and its price bigger than 20000, then this car is a Ferrari.
Decision tree learning can also be generalized to the case where the target function is continuous, i.e., regression. This is beyond the scope of this thesis.
2.1.Algorithm for Tree Construction
The basic tree construction algorithm is a greedy search on the space of possible decision trees. The search starts by creating a root node and continues by processing the training set according to the following steps given below.

If all of the instances belong to the same class C_{i} stop and return the leaf node with class C_{i}.

Find the split that best classifies the instances (as will be explained later).

Each split divides the data into subsets. (For example the split Year = 1980 divides the data into two subsets with seven and nine instances in each)

For each subset of the data, repeat steps 1, 2, 3 and 4 to construct the decision tree.
So this algorithm is recursive. For each node of the tree it is called once. Finding the best split takes most of the time. Time complexity of the algorithm is O (l * Timecomplexity (Best Split Function)), where l is the number of nodes in the decision tree.
