Ana səhifə

Aditya polumetla in partial fulfillment of the requirements for the degree of master of science


Yüklə 1.32 Mb.
səhifə2/12
tarix25.06.2016
ölçüsü1.32 Mb.
1   2   3   4   5   6   7   8   9   ...   12
denote temperature taken at hour at location , then the data instance will take the form,

temp_At-2, temp_At-1, temp_At, temp_Bt-2, temp_Bt-1, temp_Bt, temp_Ct-2, temp_Ct-1, temp_Ct

with the last attribute, temp_Ct, being the output attribute. We will refer to this example as 'our weather example' in the following sections in this chapter.


F
igure 2.3: Using data from nearby sites to predict temperature for the location C.

ML algorithms can be broadly classified into two groups, classification and regression algorithms. We describe these two types of classifications and some of the ML algorithms from each of these groups.
2.2.1 Classification Algorithms
Algorithms that classify a given instance into a set of discrete categories are called classification algorithms. These algorithms work on a training set to come up with a model or a set of rules that classify a given input into one of a set of discrete output values. Most classification algorithms can take inputs in any form, discrete or continuous although some of the classification algorithms require all of the inputs also to be discrete. The output is always in the form of a discrete value. Decision trees and Bayes nets are examples of classification algorithms.
To be able to apply classification algorithms on our weather example we need to convert the output attribute into classes. This is generally done by discretization, which is the process of dividing a continuous variable into classes. Discretization can be done in many ways, a simple approach would be to divide the temperature into ranges of 5 degrees and giving each range a name or by using entropy-based algorithms [Fayyad & Irani, 1993; Dougherty et al., 1995]. Inputs attributes can be left as continuous if the algorithm deals with them or they can be converted into discrete values depending on the algorithm.
We describe in detail the classification algorithms that have been used in this thesis in the sub-sections below.


        1. 2.2.1.1 The J48 Decision Tree Algorithm

J48 is a decision tree learner based on C4.5 [Quinlan, 1993]. C4.5 is an update of the ID3 Figure 2.4: A Decision Tree to predict the current temperature at site C based on temperature readings taken from a set of nearby sites.


algorithm [Quinlan, 1986]. We describe here the ID3 algorithm.
A decision tree classifies a given instance by passing it through the tree starting at the top and moving down until a leaf node is reached. The value at that leaf node gives the predicted output for the instance. At each node an attribute is tested and the branches from the node correspond to the values that attribute can take. When the instance reaches a node, the branch taken depends on the value it has for the attribute being tested at the node. A decision tree that can be used to predict the present hour temperature for site C in our weather example is given Figure 2.4. So if we were to classify an instance in our weather example with this tree we would start at the root node that tests the attribute temp_At-2 and based on the value taken by this attribute in the given instance we will take the left or right branch. When we reach a node after taking a branch, the attribute associated with it is tested and the corresponding branch taken until we reach a leaf node, which gives the value taken for the output attribute temp_Ct.
The ID3 algorithm builds a decision tree based on the set of training instances given to it. It takes a greedy top-down approach for the construction of the tree, starting with the creation of the root node. At each node the attribute that best classifies all the training instances that have reached that node is selected as the test attribute. At a node only those attributes are considered which were not used for classification at other nodes above it in the tree. To select the best attribute at a node, the information gain for each attribute is calculated and the attribute with the highest information gain is selected. Information gain for an attribute is defined as the reduction in entropy caused by splitting the instances based on values taken by the attribute. The information gain for an attribute A at a node is calculated using

,

where S is the set of instances at that node and |S| is its cardinality, Sv is the subset of S for which attribute A has value v, and entropy of the set S is calculated as



,

where pi is the proportion of instances in S that have the ith class value as output attribute.


A new branch is added below the node for each value taken by the test attribute. The training instances that have the test attribute value associated with the branch taken are passed down the branch, and this subset of training instances is used for the creation of further nodes. If this subset of training instances has the same output class value then a leaf is generated at the branch end, and the output attribute is assigned that class value. In the case where no instances are passed down a branch then a leaf node is added at the branch end that assigns the most common class value in the training instances to the output attribute. This process of generating nodes is continued until all the instances are correctly classified or all the attributes have been used or when its not possible to divide the examples.
Extensions were added to the basic ID3 algorithm to (1) deal with continuous valued attributes, (2) deal with instances that have missing attribute values and to (3) prevent overfitting the data (explained below).

When a discrete valued attribute is selected at a node the number of branches formed is equal to the number of possible values taken by the attribute. In the case of a continuous valued attribute two branches are formed based on a threshold value that best splits the instances into two. For example, in Figure 2.4 the attribute at the root node, temp_At-2, has a threshold value of 32. The threshold is the selected as the value of the attribute that maximizes the information gain of the given training instances. Fayyad & Irani [1993] extended this approach to split a continuous-valued attribute into more than two intervals.


There may arise cases where an instance has no value for an attribute (i.e., missing values) or has an unknown attribute value. The missing value can be replaced by the most common value for that attribute among the training instances that reach the node where this attribute is tested. In C4.5, the probability for each possible value taken by the attribute with missing value is calculated, based on the number of times it is seen in the training instances at a node. The probability values are then used for calculation of information gain at the node.
In the ID3 algorithm, sometimes due to too small of a training set being used, the tree built correctly classifies the training instances but fails when applied on the entire distribution of data because it focuses on the spurious correlation in the data when the remaining amount of data is small; this is know as overfitting. To avoid overfitting, C4.5 uses a technique called rule-post pruning. In rule post-pruning, after the tree is built, it is converted into a set of rules. For example, the rule generated for leftmost path of the tree in Figure 2.4 is

IF (temp_At-2 > 32 AND temp_At > 30 AND temp_Bt > 40)

THEN temp_Ct= 40-45 .

From each rule generated for the tree, those antecedents are pruned (i.e., removed) which do not reduce the accuracy of the model. Accuracy is measured based on the instances present in the validation set, which is a subset of the training set not used for building the model.




2.2.1.2 Naive Bayes
Naive Bayes [Good, 1965; Langley et al., 1992] is a simple probabilistic classifier based on Bayes' rule. The naive Bayes algorithm builds a probabilistic model by learning the conditional probabilities of each input attribute given a possible value taken by the output attribute. This model is then used to predict an output value when we are given a set of inputs. This is done by applying Bayes' rule on the conditional probability of seeing a possible output value when the attribute values in the given instance are seen together. Before describing the algorithm we first define the Bayes' rule.

Bayes’ rule states that



,

where P(A|B) is defined as the probability of observing A given that B occurs. P(A|B) is called posterior probability, and P(B|A), P(A) and P(B) are called prior probabilities. Bayes’ theorem gives a relationship between the posterior probability and the prior probability. It allows one to find the probability of observing A given B when the individual probabilities of A and B are known, and the probability of observing B given A is also known.


The naive Bayes algorithm uses a set of training examples to classify a new instance given to it using the Bayesian approach. For an instance, the Bayes rule is applied to find the probability of observing each output class given the input attributes and the class that has the highest probability is assigned to the instance. The probability values used are obtained from the counts of attribute values seen in the training set.
In our weather example, for a given instance with two input attributes temp_At and temp_Bt, with values a and b respectively, the value vMAP assigned by the naive Bayes algorithm to the the output attribute temp_Ct is the one that has the highest probability across all possible values taken by output attribute; this is known as the maximum-a-posteriori (MAP) rule. The probability of the output attribute taking a value vj when the given input attribute values are seen together is given by

.

This probability value as such is difficult to calculate. By applying Bayes theorem on this equation we get



,

where P(vj) is the probability of observing vj as the output value, P(a,b|vj) is the probability of observing input attribute values a, b together when output value is vj. But if the number of input attributes (a, b, c, d, ....) is large then we likely will not have enough data to estimate the probability P(a, b, c, d, .... | vj).


The naive Bayes algorithm solves this problem by using the assumption of conditional independence for the all the input attributes given the value for the output. This means it assumes that the values taken by an attribute are not dependent on the values of other attributes in the instance for any given output. By applying the conditional independence assumption, the probability of observing an output value for the inputs can be obtained by multiplying the probabilities of individual inputs given the output value. The probability value P(a, b | vj) can then be simplified as

,

where P(a | vj) is the probability of observing the value a for the attribute temp_At when output value is vj. Thus the probability of an output value vj to be assigned for the given input attributes is

.


Learning in the Naive Bayes algorithm involves finding the probabilities of P(vj) and P(ai|vj) for all possible values taken by the input and output attributes based on the training set provided. P(vj) is obtained from the ratio of the number of time the value vj is seen for the output attribute to the total number of instances in the training set. For an attribute at position i with value ai, the probability P(ai|vj) is obtained from the number of times ai is seen in the training set when the output value is vj.
The naive Bayes algorithm requires all attributes in the instance to be discrete. Continuous valued attributes have to be discretized before they can be used. Missing values for an attribute are not allowed, as they can lead to difficulties while calculating the probability values for that attribute. A common approach to deal with missing values is to replace them by a default value for that attribute.
2.2.1.3 Bayesian Belief Networks (Bayes Nets)
The naive Bayes algorithm uses the assumption that the values of all the input attributes are conditionally independent given the value of the output attribute. But there may be cases when assuming conditional independence of all the given inputs, may not lead to appropriate predictions. Bayesian Belief Networks or Bayes Nets introduce the idea of applying conditional independence on a certain number of inputs rather than on all of them. This notion avoids the global assumption of conditional independence while maintaining some amount of conditional independence among the inputs.
A Bayesian Belief Network [Friedman et al., 1997; Pearl, 1988] is a directed acyclic graphical network model that gives the joint probability distribution for a set of attributes. Each attribute in the instance is represented in the network in the form of a node. In the network a directed connection from node X to node Y is made when X is a parent of Y which means that there is a dependence relation of Y on X, or in other words X has an influence on Y. Thus in this network an attribute at a node is conditionally independent of its non-dependents in the network given the state of its parent nodes. These influences are represented by conditional probabilities, which gives the probability of a value at a node that is conditional on the value of its parents. These probability values for a node are Figure 2.5: A Bayesian network to predict temperature temp_Ct at a site. The arrows represent a direct relation between nodes. Each node is associated with a CPT.
arranged in a tabular form called a Conditional Probability Table (CPT). In the case of nodes with no parents, the CPT gives the distribution of the attribute at that node.
When a node is connected to a set of nodes, which are one step above in the hierarchy, these parent nodes have an influence on its behavior. This node is not affected by other nodes present in the given pool of nodes. It means the node is conditionally independent of all non-parent nodes when given its parents. The nodes which are more than one step above in hierarchy, that is parents of parents of a node, are not considered as directly influencing the node, as these nodes affect the nodes which are parents to the node in question and thus indirectly influence it. Thus the parents are considered for calculating the joint probability, as only the direct parents of a node influence the conditional probabilities at this node. Using conditional independence between nodes, the joint probability for a set of attribute values y1,y2,..,yn represented by the nodes Y1,Y2,..., Yn is given by

,

where Parents(Yi) are the immediate parents of node Yi. The probability values can be obtained directly from the CPTs associated with the node.



A Bayesian network requires that both input and output attributes be discrete. A simple Bayesian network for predicting temperature at a site in our weather example, using only a few of the input instances, is shown in Figure 2.5. Each node in the tree is associated with a CPT. For example, the CPT for the node temp_At-2 will contain the probability of each value taken by it when all possible values for temp_At-1 and temp_Ct (i.e., its parents) are seen together. For a given instance, the Bayesian network can be used to determine the probability distribution of the target class by multiplying all the individual probabilities of values taken up by the individual nodes. The class value that has the highest probability is selected. The probability of a class value taken by the output attribute temp_Ct for the given input attributes, using parental information of nodes from the Bayesian network in Figure 2.5 is
P(temp_Ct|temp_At-1, temp_At-2, temp_Bt, temp_Bt-2, temp_Ct-2) =

P(temp_Ct)*P(temp_At-1|temp_Ct)*P(temp_At-2|temp_At-1,temp_Ct)*P(temp_Bt|temp_Ct)*

P(temp_Bt-2|temp_At-1,temp_Ct)*P(temp_Ct-2|temp_At-2 ,temp_Ct).
Learning in Bayes' Nets from a given training set involves finding the best performing network structure and calculating CPTs. To build the network structure, we start by assigning each attribute a node. Learning the network connections involves moving though the set of possible connections and finding the accuracy of the network for the given training set. The accuracy of the network can be determined by using a scoring criterion such as the Akaike's Information Criterion [Akaike, 1974], the Minimum Description Criterion [Rissanen, 1978] or the Cross-Validation Criterion. Allen & Greiner [2000] present a brief description of these scoring criterions along with their empirical comparisons. For a network, the CPTs are calculated at each node based on the information obtained from the training set.
The K2 algorithm [Cooper & Herskovits, 1992] can be used to learn the Bayesian network structure. K2 puts the given nodes in an order and then processes one node at a time. It adds an edge to this node from previously added nodes only when the network accuracy is increased after this addition. When no further connections can be added to the current node that increase the accuracy, the algorithm then moves to another node. This process continues until all nodes have been processed.
When all variables present in the network are seen in the training data, the probability values in the CPTs can be filled by counting the required terms. In the case of training data with missing variables the gradient ascent training [Russel et al., 1995] method can be used to learn values for the CPTs.
2.2.2 Regression Algorithms
Algorithms that develop a model based on equations or mathematical operations on the values taken by the input attributes to produce a continuous value to represent the output are called of regression algorithms. The input to these algorithms can take both continuous and discrete values depending on the algorithm, whereas the output is a continuous value. We describe in detail the regression algorithms that have been used in this thesis below.
2.2.2.1 Linear Regression
The Linear Regression algorithm of WEKA [Witten & Frank, 2005] performs standard least squares regression to identify linear relations in the training data. This algorithm gives the best results when there is some linear dependency among the data. It requires the input attributes and target class to be numeric and it does not allow missing attributes values. The algorithm calculates a regression equation to predict the output (x) for a set of input attributes a1,a2,.....,ak. The equation to calculate the output is expressed in the form of a linear combination of input attributes with each attribute associated with its respective weight w0,w1,....,wk, where w1 is the weight of a1 and a0 is always taken as the constant 1. An equation takes the form

.

For our weather example the equation learned would take the form


temp_Ct = w0 + wAt-2 temp_At-2 + wAt-1 temp_At-1 + wAt temp_At + wBt-2 temp_Bt-2 +

wBt-1 temp_Bt-1 + wBt temp_Bt + wCt-2 temp_Ct-2 + wCt-1 temp_Ct-1 ,

where temp_Ct is value assigned to the output attribute, and each term on the right hand side is the product of the values of the input attributes and the weight associated with each input.


The accuracy of predicting the output by this algorithm can be measured as the absolute difference between the actual output observed and the predicted output as obtained from the regression equation, which is also the error. The weights must be chosen in such as way that they minimize the error. To get better accuracy higher weights must be assigned to those attributes that influence the result the most.
A set of training instances is used to update the weights. At the start, the weights can be assigned random values or all set to a constant (such as 0). For the first instance in the training data the predicted output is obtained as

,

where the superscript for attributes gives the instance position in the training data. After the predicted outputs for all instances are obtained, the weights are reassigned so as to minimize the sum of squared differences between the actual and predicted outcome. Thus the aim of the weight update process is to minimize



,

which is the sum of the squared differences between the observed output for the ith training instance (x(i)) and the predicted outcome for that training instance obtained from the linear regression equation.

1   2   3   4   5   6   7   8   9   ...   12


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©atelim.com 2016
rəhbərliyinə müraciət