Aditya polumetla in partial fulfillment of the requirements for the degree of master of science

səhifə	3/12
tarix	25.06.2016
ölçüsü	1.32 Mb.

1 2 3 4 5 6 7 8 9 ... 12

2.2.2.2 LeastMedSquare
The WEKA LeastMedSquare or Least Median Squares of Regression algorithm [Rousseeuw, 1984] is a linear regression method that minimizes the median of the squares of the differences from the regression line. The algorithm requires input and output attributes to be continuous, and it does not allow missing attribute values. Standard linear regression is applied to the input attributes to get the predict the output. The predicted output x is obtained as

where the a_iare input attributes and w_iare the weights associated with them.

In the LeastMedSquare algorithm, using the training data, the weights are updated in such a way that they minimize the median of the squares of the difference between the actual output and the predicted outcome using the regression equation. Weights can be initially set to random values or assigned a scalar value. The aim of the weight update process is to determine new weights to minimize

,

where i ranges from 1 to the number of instances in the training data that is being used, x⁽ⁱ⁾ is the actual output for the training instance i, and the predicted outcome for that training instance is obtained from the regression equation.

2.2.2.3 M5P
The M5P or M5Prime algorithm [Wang & Witten, 1997] is a regression-based decision tree algorithm, based on the M5 algorithm by Quinlan [1992]. M5P is developed using M5 with some additions made to it. We will first describe the M5 algorithm and then the

Figure 2.6: A M5 model tree for predicting temperature at a site. The decision taken at a node is based on the test of the attributes mentioned at that node. Each model at a leaf takes the form w₀+w₁a₁+ .... +w_ka_kwhere k is the number of input attributes.
features added to it in M5P.
M5 builds a tree to predict numeric values for a given instance. The algorithm requires the output attribute to be numeric while the input attributes can be either discrete or continuous. For a given instance the tree is traversed from top to bottom until a leaf node is reached. At each node in the tree a decision is made to follow a particular branch based on a test condition on the attribute associated with that node. Each leaf has a linear regression model associated with it of the form

based on some of the input attributes a₁,a₂,.....,a_k in the instance and whose respective weights w₀,w₁,....,w_k are calculated using standard regression (2.2.2.1). As the leaf nodes contain a linear regression model to obtain the predicted output, the tree is called a model tree. When the M5 algorithm is applied on our weather example, the model tree generated will take a form as shown in Figure 2.6.

To build a model tree, using the M5 algorithm, we start with a set of training instances. The tree is built using a divide-and-conquer method. At a node, starting with the root node, the instance set that reaches it is either associated with a leaf or a test condition is chosen that splits the instances into subsets based on the test outcome. A test is based on an attributes value, which is used to decide which branch to follow. There are many potential tests that can be used at a node. In M5 the test that maximizes the error reduction is used. For a test the expected error reduction is found using

where S is the set of instance passed to the node, stdev(S) is its standard deviation, S_i is the subset of S resulting from splitting at the node with the i^th outcome for the test. This process of creating new nodes is repeated until a there are too few instances to proceed further or the variation in the output values in the instances that reach the node is small.

Once the tree has been built, a linear model is constructed at each node. The linear model is a regression equation. The attributes used in the equation are those that are tested or are used in linear models in the sub-trees below this node. The attributes tested above this node are not used in the equation as their effect on predicting the output has already been captured in the tests done at the above nodes. The linear model built is further simplified by eliminating attributes in it. The attributes whose removal from the linear model leads to a reduction in the error are eliminated. The error is defined as the absolute difference between the output value predicted by the model and the actual output value seen for a given instance.

The tree built can take a complex form. The tree is pruned so as to make it simpler without losing the basic functionality. Starting from the bottom of the tree, the error is calculated for the linear model at each node. If the error for the linear model at a node is less than the model sub-tree below then the sub-tree for this node is pruned. In the case of missing values in training instances, M5P changes the expected error reduction equation to

where m is the number of instances without missing values for that attribute, S is the set of instances at the node, β(i) is the factor multiplied in case of discrete attributes, j takes values L and R with S_L and S_R being the sets obtained from splitting at that attribute.

2.2.2.4 MultiLayer Perceptron
A MultiLayer Perceptron (MLP) [Bishop, 1995] is a neural network that is trained using backpropagation. MLPs consist of multiple layers of computational units that are connected in a feed-forward way forming a directed connection from lower units to a unit in a subsequent layer. The basic structure of MLP consists of an input layer, one or more hidden layers and one output layer. Units in the hidden layer are termed hidden as their output is used only in the network and is not seen outside the network. An MLP with two hidden layers that can be used to predict temperature in our weather example is shown in Figure 2.7. The output from a unit is used as input to units in the subsequent layer. The connection between units in subsequent layers has an associated weight.

Figure 2.7: A multilayer perceptron with two hidden layers to predict temperature at a site. Each connection is associated with a weight. Hidden and output units are sigmoid units.
F
igure 2.8: A sigmoid unit that takes inputs x_i, w_i the weights associated with the inputs and sigmoid the resulting output from the unit.
The hidden and output units are based on sigmoid units. A sigmoid unit calculates a linear combination of its input and then applies the sigmoid function on the result. The sigmoid function, for net input x is

.
The output of a sigmoid unit, sigmoid(x), is a continuous function of its input (x) and is in the range of 0 to 1. A sigmoid unit is shown in Figure 2.8. In addition to the inputs supplied to it, the sigmoid unit also takes in a constant input of 1.
An MLP learns its weights using the backpropagation algorithm [Rumelhart et al., 1986]. The backpropagation algorithm takes a set of training instances for the learning process. For the given feed-forward network, the weights are initialized to small random numbers. Each training instance is passed through the network and the output from each unit is computed. The target output is compared with the output computed by the network to calculate the error and this error value is fed back through the network. To adjust the weights, backpropagation uses gradient descent to minimize the squared error between the target output and the computed output. At each unit in the network, starting from the output unit and moving down to the hidden units, its error value is used to adjust weights of its connections so as to reduce the error. The weights are updated using

where w_ji is the weight from unit i to unit j, x_ij is the input from unit i to unit j,  is the learning rate and _j is error obtained at unit j. This process of adjusting the weights using training instances is iterated for a fixed number of times or is continued until the error is small or cannot be reduced.

To improve the performance of the backpropagation algorithm, the weight-update made at the n^th iteration of the backpropagation is made partially dependent to the amount of weight changed in the n-1^st iteration. The amount by which the n-1^st iteration contributes is determined by a constant term called momentum (α). The new rule used for weight-update at the n^th iteration is

This momentum term is added to achieve faster convergence to a minimum in some cases.

2.2.2.5 RBF Network
An RBF or Radial Basis Function Network [Buhmann & Albovitz, 2003; Orr, 1996] is another type of a feed-forward neural network. It has three layers: the input, hidden and output layer. It differs from an MLP in the way the hidden layer units perform calculations. An RBF Network can build both regression and classification models. We will describe the regression model.
In an RBF Network, inputs from the input layer are mapped to each of the hidden units. The hidden units use radial functions for activation such as Gaussian, multiquadric, inverse-multiquadric and Cauchy [Orr, 1996]. The RBF Network of WEKA [Witten & Wang, 2005] uses the bell-shaped Gaussian function. The activation h(x) of the Gaussian function for a given input x decreases monotonically as the distance between the center c of the Gaussian and x increases. A Gaussian function is useful in finding the activation at a hidden unit, as the activation of inputs depends on their closeness to center of the hidden unit and thus can be used as a effective method to distinguish between inputs. The Gaussian function is of the form

.
The output layer takes in linear combination of outputs from hidden units and is similar to a regression model. An RBF Network for our weather models will be of the form shown in Figure 2.9.
AN RBF Network takes the inputs and the hidden units as points in space. The activation of a hidden unit depends on the distance between the point in space representing the input values and the point for that hidden unit. The distance is converted

Figure 2.9: An RBF Network with n hidden units to predict temperature at a site. X is the input vector given to the network and output. temp_C_tis the sum of the products of activations from the hidden units and the weights associated with them. Each hidden unit has its own learned center.

into a similarity measure by the Gaussian function. The point in space for the hidden unit is obtained from the center of the Gaussian for that hidden unit. The width of the Gaussian is a learned parameter as well.

An RBF Network is trained to learn the centers and widths of the Gaussian function for hidden units, and then to adjust weights in the regression model that is used at the output unit. To learn the centers of the Gaussian functions the k-means clustering algorithm can be used that clusters the training instances to obtain k Gaussian functions for each attribute in the instance. After the parameters for the Gaussian function at the hidden units have been found, the weights from these units to the output unit are adjusted using Linear Regression. The process can be repeated to learn in an EM manner.
2.2.2.6 The Conjunctive Rule Algorithm
The Conjunctive Rule algorithm in WEKA learns a single rule that can predict an output value. It can predict both discrete and numeric classes. An example of a conjunctive rule that can be developed from our weather example is

IF temp_A_t-2 > 30 AND temp_C_t-1 < 90 AND temp_B_t-1 > 40 THEN temp_C_t = 30.
A conjunctive rule consists of a set of antecedents (e.g., temp_A_t-2 > 30, temp_C_t-1 < 90) ANDed together to give the consequent (e.g., temp_C_t). Antecedents consist of relations between relevant attributes and the consequent indicates an output value.
The learning process in the Conjunctive Rule algorithm attempts to come up with a rule for all relevant attributes based on the training data. The algorithm learns by calculating the variance reduction for all possible antecedents and then selects the one that reduces the variance the most. In cases when the learned rule becomes too complex then it is pruned using reduced error pruning similar to that in the J48 system.
In cases when a test instance is seen that is not covered by the rule, then the attribute whose value is not included in the rule is assigned the default value for that attribute.
2.3 Predicting Time Sequence Data - Hidden Markov Models
A discrete process taking place in the real world generates what can be viewed as a symbol at each step. For example, a coin toss can lead to a series of heads and tails. Another example is the temperature at a certain location that is a result of a number of weather conditions and the location. These conditions and the location add up to form a real-world time series. As time goes by and the process keeps running we get a sequence of symbols generated by the process. Based on the outcome of the process these symbols can be discrete (e.g., precipitation type) or continuous (e.g., temperature). These sequences of observed symbols can be used to generate a statistical model that describes the workings of the process and the generation of symbols over time. This model can be used to identity or classify other sequences of symbols. One such model that can be used is the Hidden Markov Model (HMM).
When information gathered from a system is generated in a sequential manner, state transition networks can be used to represent knowledge about this system. This network consists of a set of states and transitions between states. Transitions between states are triggered by events in the domain. HMM state transition networks have an initial state and an accepting or end state. The network recognizes a sequence of events if these events start at the initial state and end in the accepting state when all the events have occurred.
A Markov model is a probabilistic model over a finite set of states, where the probability of being in a state at a time t depends only on the previous state visited at time t-1. An HMM (see Figure 2.10 for an example) is a model where the system being modeled is assumed to be a Markovian process with unknown parameters.

Figure 2.10: A Hidden Markov Model. A new state is visited when a transition occurs after a certain amount of time. Each state emits a symbol when reached.
In an HMM the states are referred to as hidden because the system we wish to model may have underlying causes that cannot be observed. For example, factors that determine the position and velocity of an object when the only information available is the position of the object are not observable. The hidden or non-observable process taking place in the system can be determined from the process that generates a sequence of observable symbols.
HMMs are used in many fields such as natural language processing, bioinformatics, genomics, optical character recognition and speech recognition. In speech recognition [Rabiner, 1989] the speech input is broken down into smaller segments which can be classified into a set of predefined symbols. The long sequences of symbols thus obtained are used to build a model using HMM learning methods. The generated model is used for processing sequences in order to identify or recognize the source of speech input.
An HMM can be used in our example of predicting weather conditions such as temperature. In order to predict temperature at a location, the temperature has to be converted into classes and each class represented by a unique symbol. For example, if we take hourly readings then we get a sequence of 24 symbols for a day. The sequences obtained for a set of days can be used to build an HMM model that predicts the temperature class value for each hour at that location.
The components that constitute an HMM are a finite set of states, a set of transitions between these states with a probability value associated with them and a set of output symbols emitted by the states. After each time interval a new state is entered depending on the transition probability from the previous state. This follows a simple Markov model in which the probability of entering a state depends only on the previous state. The state sequence obtained over time is called the path. The transition probability a_km of moving from a state k to a state m is given by the probability of being in state m at time t when at time t-1 we were in state k

a_km = P(path_t = m | path_t-1= k).
After a transition is made to a new state a symbol is emitted from the new state. Thus as the state path is followed we get a sequence of symbols (x₁, x₂, x₃, ...). The symbol which is observed at a state is based on a probability distribution that depends on the state. The probability that a symbol b is emitted on transition to state k, called the emission probability (e), is determined by

e_b(k) = P(x_t = b | path_t = k).
An initial state distribution gives the probability of being in a particular state at the start of the path. The choice of all these parameters, that is, the transition probabilities, emission symbols and initial state probabilities, are important in building an efficient HMM.
In our weather example, let the temperature be broken down into three classes, namely p, q, and r. Also assume that the HMM built has four states, namely 1, 2, 3 and 4. In this case our state transition probability matrix will be of size 4 by 4 and the emission probability matrix will be of size 3 by 4. The model built by an HMM learner will have these four states and at each state we can observe any of the three symbols p, q and r based on their respective emission probabilities at that respective state. The transition from one state to another is associated with the transition probability between the respective states.

Figure 2.11: A HMM with states 1, 2, 3 and 4 that emit a symbol when reached. Transitions start at the start state and end at the end state. The arrows represent all possible state transitions.
The transition probability of moving from state 1 to state 2 is represented by a₁₂. As the sequence has a length of 24, we will be seeing 24 transitions including a transition from the start state and always ending at the end state after the 24^th symbol is seen. Figure 2.11 shows a HMM with the four states 1, 2, 3 and 4 together with the start state, end state and all possible transitions between these states. The HMM in the figure allows all possible state transitions. The transition probability associated with each transition is given along with the transition. In the figure, symbols p, q and r are emitted at each state, 1, 2, 3 and 4, are given in rectangular blocks. Emission of each symbol at a state is determined by its emission probability at that state. Transitions always start at the start state and end at the end state.
According to Rabiner & Juang [1986], for any HMM there are three basic questions that have to be answered for the model to be used effectively,

1. What is the probability that an observed sequence is produced by the given model?
A model can generate many sequences based on the transitions taking place and symbols emitted each time a state is visited. The probability of the observed sequence being produced by the model gives an estimate of how good the model is. The Forward algorithm or the Backward algorithm [Baum & Petrie, 1966] can be used to find the probability of observing a sequence in a given model.
2. How do we find an optimal state sequence (path) for a given observation sequence?
As there exists many paths that yield a given sequence of symbols, to select the optimal state path we need to find the path with the highest probability of occurrence. The Viterbi algorithm [Viterbi, 1967; Forney, 1973] is a procedure that finds the single best path in the model for a given sequence of symbols.

3. How do we adjust the model parameters, that is, the transition probabilities, emission probabilities and the initial state probability to generate a model that best represents the training set?
Model parameters need to be adjusted so as to maximize the probability of the observed sequence. The Baum-Welch algorithm [Rabiner & Juang, 1986] is a method that uses an iterative approach to solve this problem. It starts with preassigned probabilities and tries to adjust them based on the observed sequences in the training set.

The Forward-Backward, the Baum-Welch and the Viterbi algorithms are described in detail in the following subsections.

2.3.1 The Forward Algorithm and The Backward Algorithm
A sequence of observations can be generated by taking different paths in the model. To Table 2.2: The Forward Algorithm

Forward Algorithm

Initialization (i = 0) :

Recursion (i = 1 .... L) :

Termination :

f_m(i) - prob. of seeing the symbol at position i in state m

e_m(x_i) - probability of emitting the symbol x_i by state m

a_km- probability of transition from state k to state m

P(x) - probability of observing the entire sequence x

L - length of the sequence

find the probability of this sequence we need to add up the probabilities that are obtained for each possible path the sequence can take. The Forward algorithm and the Backward algorithm uses dynamic programming to calculate this probability instead of having to enumerate the probabilities observed across all the possible paths.

As HMMs follow the Markovian process, the probability of being in a particular state depends only on the state that was observed before it. The Forward algorithm uses this approach to find the probability of the symbol sequence. After having observed i – 1 symbols in a sequence x of length L, let us say the symbol at position i in the sequence is seen in state m. Then the probability of seeing this symbol at position i in state m can be found by multiplying the emission probability of this symbol at state m with the sum of products of the probability of being at any state (with k states in the model) when the last symbol (i – 1^st) was seen with the transition probability of moving from that state to the present state m. It is called the forward probability and is given by

In case a transition is not possible then the transition probability is taken as 0. Before starting the sequence we initialize the probability of being in the start state with no symbols observed to be 1, that is, f₀(0) = 1. We calculate f_m(i) for each position i in the sequence as we move through the sequence. The probability of the observing the entire sequence, P(x), is the probability value obtained after we have seen all the symbols in the sequence after having reached the end state. The pseudocode for the Forward algorithm [Durbin et al., 1989] is given in Table 2.2.

The Backward algorithm can also be used to calculate the probability of the observed sequence. As the name says, it works analogously to the Forward algorithm. Instead of calculating the probability values from the start state, as is done by the Forward algorithm, the Backward algorithms starts at the end state and moves towards the start state. In each backward step, it calculates the probability of being at that state when considering that the rest of the sequence from that state to the end state has already been observed. We define the backward probability b_k(i) as

which is the probability of observing the rest of the sequence when we are in state k after observing i symbols.

1 2 3 4 5 6 7 8 9 ... 12