Aditya polumetla in partial fulfillment of the requirements for the degree of master of science

səhifə	5/12
tarix	25.06.2016
ölçüsü	1.32 Mb.

1 2 3 4 5 6 7 8 9 ... 12

3.3 Feature Vectors
To build a predictive model, using ML algorithms, that can predict the values reported by an RWIS sensor, we require a training set that contains data related to the weather conditions. A test set is used to evaluate the performance of the model built. ML algorithms require the feature vector for the dataset to consist of a set of input attributes and an output attribute, with the output attribute being the predicted variable. ML algorithms use the information from the feature vector in the training set to build the model. The input attributes for a feature vector in the test set are applied on the model built from the training set to predict the output attributes value and this predicted value is compared with the actual value to estimate the model performance. The feature vector for ML algorithms takes the form

input₁, ...., input_n, output

where input_i is an input attribute and output is the output attribute.

For example, to predict temperature at time t for the RWIS site 67 denoted as temp67_t, from the Figure 3.1, we will be using the temperature values at time t and t -1 for all the associated RWIS sites which are 27 and 67. The feature vector for this example will be

temp19_t-1, temp27_t-1,temp67_t-1, temp19_t, temp27_t, temp67_t

where temp67_t is the output attribute or dependent variable (and is therefore only available during the process of building the model as this is what we are predicting) and the rest form the input attributes or independent variables.

Predicting a value reported by a RWIS sensor corresponds to predicting that weather condition reported by the sensor. As variations in a weather condition, like air temperature, are not completely independent but depend on other conditions such as wind, precipitation, and air pressure, all these variables can be used to predict that condition at a location. The previous hour's readings (or further back) from a sensor can be used to predict present conditions, as changes in weather conditions follow a pattern and are not totally random. The RWIS-AWOS sets include sites from locations that do not belong to any micro-climatic regions, and thus the climatic conditions at a location have some correlation to the condition at another location in the set. Using this information we include data obtained from the nearby sites, both AWOS and RWIS, to predicted variables at an RWIS site. This can be done by including all the sites in the RWIS-AWOS set that a particular RWIS site belongs to.
In brief, to predict a value for an RWIS sensor we use other weather variables such as the previous hour's readings (or further back) for these variables and these respective readings from nearby RWIS and AWOS sites (see Figure 3.1).
3.4 Feature Symbols for HMMs
The instance in a dataset used by HMM for predicting a class value of a variable consists of a string of symbols that form the feature symbol. The symbols that constitute the feature symbol are unique and are generated from the training set.
For predicting a weather variable value at an RWIS site, taking hourly readings for a day of the variable's value into consideration we get a string of length 24. This string forms the feature symbol which is used in the dataset. When information from a single site is used for predictions, the class values taken by the variables form the symbol set.

When the variable information from two of more sites are used together, a string of class values is obtained by appending the variable's value from each site together. For example, to predict temperature class of site 19, we include temperature data from sites 27 and 67, which belong to the first set of RWIS sites. The combination of class values from these three sites seen at a particular hour will form a class string. For example if at sites 19, 27 and 67 the temperature class values seen are 3, 5 and 4 respectively then the class string formed by appending the class values in the order their sites were listed in this example will be 354.

The sequence of hourly class strings for a day form an instance in the dataset. All unique class strings that are seen in the training set are arranged in ascending order, if possible, and each class string is assigned a symbol. For hourly readings taken during a day we arrive at a string which is 24 symbols long, which forms the feature symbols used by HMM.
3.5 Methods Used for Weather Data Modeling
To be able detect RWIS sensor malfunctions, we need to check for significant variations and/or deviations between the values reported by this sensor and the actual weather conditions present at the location of this site. To determine the actual weather condition at a location, we try to predict a value for that weather variable. Based on the difference between the predicted value and the value reported by the RWIS sensor we can detect sensor malfunctions. To predict a variable's value a function is derived that can explain the system of weather variations. We use ML methods to build such a function, in other words a predictive model, using the weather data collected from the past. This model is used to predict present values for a weather variable. Both classification and regression algorithms and HMMs are used for modeling the weather data.
Malfunctions in a sensor are easily detected if the predicted values are highly accurate. This makes the accuracy of the prediction made by ML methods a key factor in detecting sensor malfunctions. The performance of an algorithm on the data provided is evaluated using the cross-validation technique.
We will first describe cross-validation and then present the general approach taken by the ML methods for prediction of weather variables at the RWIS sites.
3.5.1 Cross-Validation
Cross-validation is a technique that is used to evaluate the performance of a model built by an algorithm. In cross-validation, a subset of the data provided is kept aside and the remaining data, the training set, is used by the algorithm to build a model. The part of dataset not used in training, the test set, is then used to evaluate the performance of the model by measuring the accuracy with which the model classifies the test set instances.
In n-fold cross-validation, the dataset is divided into n subsets of equal size. One subset is used as a test set and the remaining n-1 subsets are used for training. The cross-validation is performed n times with each of the subsets being used as a test set exactly once. The performance of the model on each of the subset used as test set is averaged to calculate the overall performance of the algorithm. The advantage of using n-fold cross-validation is that each instance of the dataset gets to be in the test set once and it can be used to evaluate the performance of the model within a single dataset. Kohavi [1995] suggests using 10 or 20-fold cross-validations for better estimates.
Multiple n-fold cross-validations can be performed on a single dataset by randomizing the data in the dataset before splitting the it into n subsets. This method puts different data into the n subsets created each time. Data can be randomized using a random number generator and a different random order can be obtained each time by changing the seed value given to the generator.

3.5.2 General Classification Approach
The classification algorithms are used to classify the given instance into a set of discrete categories. Discrete variables like precipitation type and discretized temperature values are predicted using this approach. The classification algorithms described in Section 2.2.1, namely J48 decision trees, Naive Bayes and Bayesian Networks were used to predict these variables. As all these classification algorithms allow continuous input attributes, the feature vector used included the current and a couple of previous hour's temperature readings for various RWIS - AWOS sites along with the variable we are trying to predict.
Each dataset was built according to a feature vector format we derived based on the weather data available for the RWIS and the AWOS sites. The dataset is then split into a training set and a test set using the cross-validation method. The classification algorithms use the training set to build a model and the test set is used to evaluate the performance of the model. Multiple n-fold cross-validation are performed to obtain a better estimate of the model performance.
Classification algorithm predict the class value taken by the output attribute, in our case precipitation type and temperature class value, for a given instance in the test set. The prediction results are represented in form of a confusion matrix, with rows corresponding to actual values and columns corresponding to predicted values for the output attributes. Each block in the confusion matrix gives the number of times the actual class is predicted as the class given by the column. The numbers in the diagonal blocks give the number of time the predicted class value was equal to the actual class value. Thus, the sum of entries along the diagonals divided by the total number of instances present in the test set, gives percentage of the number of correctly classified instances. In the case of multiple n-fold cross-validations, the confusion matrices obtained for each test set seen are averaged to obtain a confusion matrix with the mean values.
For example, for the two confusion matrices Matrix 1 and Matrix 2, their Average Matrix can be obtained by averaging the values in their respective blocks. Foe example, the value at row 1 and column 1 in the Average Matrix is obtained by averaging the row 1 and column 1 values in the matrices A and B, that is, average of 10 and 15 is 12.5.

Matrix 1					Matrix 2
		Predicted					Predicted
Actual		Class A	Class B	Total	Actual		Class A	Class B	Total
	Class A	10	20	30		Class A	15	15	30
	Class B	30	40	70		Class B	30	40	70
	Total	40	60	100		Total	45	55	100

Average Matrix
		Predicted
Actual		Class A	Class B	Total
	Class A	12.5	17.5	35
	Class B	30	40	65
	Total	42.5	57.5	100

The error in the results is obtained from predicting the class value for the discretized temperature is determined by using the absolute distance between actual and predicted class values. A distance between two adjacent classes is taken as 1. In case of the actual and predicted class values being the same the distance is 0, which represents a correctly classified instance. The greater the distance the poorer the prediction.

We can use the percentage of instances that were classified correctly when no precipitation was present and when precipitation was present to determine the accuracy of the model built. This method is particularly important for determining the accuracy for precipitation because for most of the time no precipitation is reported, and thus in cases when precipitation is reported the algorithm may try to classify it as no precipitation by assuming the data is noise.
For the classification model built for an RWIS site using historical data, we can provide the current weather information for the attributes present in the feature vector and the model will predict a class value for that output at an RWIS site. For the discretized temperature the absolute distance between the predicted and actual values is used as a measure to detect sensor malfunctions, while for precipitation type we need to compare the performance ratios for detecting precipitation and no precipitation present for a misclassification to be able to identify a sensor malfunction.
3.5.2 General Regression Approach
Regression algorithms are used to determine the value taken by the output attribute in the given instance, based on an equation or mathematical operations. Continuous variables like temperature and visibility can be predicted using this approach. The regression algorithms described in section 2.2.2, namely Linear Regression, Least Median Square, M5P, MultiLayer Perceptron, RBF Nets and Conjunctive Rule are used to predict these variables. The feature vector for these regression attributes includes the current and a couple of previous hour's temperature readings for various RWIS - AWOS sites along with other information, with one sites temperature being the output attribute.
Each dataset is built according to the feature vector produced from the weather data available for the RWIS and AWOS sites. The dataset is split into a training set and a test set. The regression algorithms use the training set to build a model and the test set is used to evaluate the performance of the model. Multiple repetitions of n-fold cross-validation is used to obtain a better estimate of the model performance.
For a given set of input attributes the model will predict a value for the output. The performance of regression algorithms can be determined by the difference between the actual value and predicted value, which gives the amount of error in the prediction made.

For example, the mean absolute error when actual temperature value is 32ºF and the predicted values is 35ºF is 3ºF. The mean of the absolute errors across all instances in the test set gives the performance of the algorithms on the test set. In the case of multiple n-fold cross-validations, the error value is averaged across all the test sets seen.

To the regression model built for an RWIS site using historical data, we can provide the current weather information for the attributes present in the feature vector and the model will predict a value for the concerned output at the RWIS site. The closeness of the predicted value to the actual value depends on the efficiency of the model. For a model whose prediction accuracy is high a slight difference may indicate sensor malfunction. Even if the model is not efficient in predicting accurately, its consistency can be used to determine variations in values reported.
3.5.4 General HMM Approach
HMMs can be used to predict the class value of the weather variables seen at an RWIS site in a set. As HMMs require the variables to be discrete, precipitation type and discretized temperature can be predicted using HMMs. Each instance in the training set and the test set used in HMM consists of a string of symbols. For a given hour, the symbol string generated consists of a variable's class value seen at the RWIS site we are trying to predict. The class values seen at other RWIS sites that belong to a set can also be included. When a variable's information from two or more sites is used, the variable value from each site is appended to form a class string and all the unique class strings seen in training data are assigned a symbol. Thus for each day we get a symbol string of length 24 in the dataset.
To build the model from a given number of states and symbols emitted at a state, the Baum-Welch algorithm is applied on the training set to determine the initial state, transmission and emission probabilities. Each symbol present in the symbol set that was generated using the training set is emitted by a state. Instances from the test set are then passed to the Viterbi algorithm which gives the most probable state path.
To predict a variables class value, we need the symbol that would be observed across the most probable state path found by the Viterbi algorithm. We modified the Viterbi algorithm (described in Section 2.3.3), which we call the modified Viterbi algorithm, to obtain the most probable symbol that will be observed at each state visited across the most probable state path. Table 3.2 shows the modified Viterbi.
The symbol with the highest probability at a state is the one that has the highest emission probability at that state. As we are trying to predict the class value of a variable at a site, which forms one part in the class string that was converted into a symbol and the predicted value can be any of the class values taken by the variable, the predicted variable's class values in the class string is replaced with all possible values taken by the variable to get a set of possible symbols for the given symbol. Of the set of possible symbols we consider only those symbols that are seen in the training set. The symbol with the highest emission probability at the concerned state is predicted at that state. For example, if a variable took class values 1, 2 and 3 and the class string is of the from '1AB' with A and B being class values of other variables. Then all possible class strings for this variable would be 1AB, 2AB and 3AB.
We calculate the error in prediction as the absolute distance between the actual class value reported and the predicted class value. For calculating the error we find the distance with respect to the class value of the site being predicted. The distances between class values of other sites added to the class string are not taken in to account. For example, the class string for predicting temperature value at site 19 using sites 27 and 67 temperature class information has class values arranged as value of 19 followed by the values at 27 and 67. If the predicted class string for a given time is 345 and the actual class string seen for the time is 435, we get the error as a distance of 1, which is the difference between class values of the first position in the class strings.
Table 3.2 The Modified Viterbi Algorithm

Modified Viterbi Algorithm

Initialization (i = 0) :

Recursion (i = 1 ..... L) :

Termination : P(x,path*) = max_k(v_k(L)a_k0)

path*_L = argmax_k(v_k(L)a_k0)

Traceback ( i = L.....1) : path*_i-1 = ptr_i(path*_i)

Symbol Observed ( i = 1 ... L): symbol(i) = bestsymbol_i( path*_i )

v_m(i) - probability of the most probable path obtained after observing the first i

characters of the sequence and ending at state m

ptr_m(i) - pointer that stores the state that leads to state m after observing i symbols

path*_i- state visited at position i in the sequence

bestsymbol_i(m) - the most probable symbol seen at state m at position in
the sequence

symbolset_s – the set of all possible symbols that can be emitted from a state when a

certain symbol is actually seen

allpossibesymbols(x_i) – function that generates all possible symbols for the given

symbol x_i.

symbol(i) – symbol observed ith position in the string.

The modified Viterbi algorithm works similarly to the Viterbi algorithm. The Viterbi algorithm calculates the probability of the most probable path obtained after observing the first i characters of the given sequence and ending at state m using

where e_m(x_i) is the emission probability of the symbol at state m and a_kmis the transition probability of moving from state k to state m. In the modified Viterbi algorithm, in place of e_m(x_i) we use the value of emission probability value that is maximum of all of the possible symbols that can be seen at state m when the symbol x_i which is present in the actual symbol sequence. We calcualte v_m(i) in the modified Viterbi algorithm as

.
The actual symbol seen in the symbol string for that respective time is taken and all its possible symbols are found. Of all of the possible symbols or class strings, only those that are seen in the training instances are considered in the set of possible symbols. The symbol from the set of possible symbols that has the highest emission probability is selected as the symbol observed for the given state. To remember the symbol that was observed at a state the modified Viterbi algorithm uses the pointer bestsymbol_i(m) which contains the position of the selected symbol in the set of possible symbols seen at state m for a position i in the sequence. It is given by

.
The most probable path generated by the algorithm is visited from the start and by remembering the symbols, using bestsymbol_i(m), that were observed at the state for a particular time we can find the symbols that were with the highest emission probability at each state in the most probable state path. This new symbol sequence gives the observed sequence for that given input sequence, in other words, it is the predicted sequence for the given actual sequence.
To predict the class values for a variable at a RWIS site for a given instance, we pass the instance through the modified Viterbi algorithm which then gives the symbols that have the highest emission probabilities along the most probable path. The symbol found is then converted back into class string and the class value for the concerned RWIS sites is regarded as the predicted value.
For example, to predict temperature class values at site 19 we use the class string has temperature class values from sites 27 and 67 appended to the value observed at site 19. Let a given class string sequence for a day be

343, 323,243,544,.......,324 (a total of 24, one for each hour in a day)

This sequence is passed to the modified Viterbi algorithm which gives the observed sequence seen along with most probable state path. Let this output sequence for the given sequence be

433,323,343,334,....,324

As we are interested in predicting the class values at site 19, taking only the first class value from a class string we get, for actual values

3,3,2,5,....,5

the observed (or predicted) values for the day are

4,3,3,3,...,3

The distance between these class values will then give the accuracy of the predictions made by the model.
In cases when the class string obtained from the test set is missing in the set of class strings seen in the training data, we need to replace it, as it does not have a corresponding symbol associated with it and thus is not recognized by the HMM. For variables like discretized temperature, which have a relation between two class values, the missing class string is replaced with a class string seen in the training data that is closest in distance to the missing class string. The distance measure used is the Manhattan distance. The difference between two classes is taken as their distance, for example class 3 and 5 have a distance of 2 between them. The distance between two class strings is the sum of the distances between each individual class value, for example, class strings 355 and 533 have a distance of 6 between them. When finding the closest we start with a distance of 1 and change the class value for site to be predicted first and then the other sites according to their distance from the site used for prediction. If no class string is found at a distance of 1 we try by incrementing the distance. In case of precipitation type, where each class value is unique, the class value is changed to indicate no precipitation present, as this is the case that is seen for most of the instances. Replacing is also done with some changed at a time and changing the vales at the current and then the nearby sites.
The performance of the HMM on the given test set, on precipitation type and discretized temperature, can be evaluated using the same methods that were used in the general classification approach.
As the HMM needs a full length symbol string at a time to make predictions, we can give the past 23 hour observed data and the present hour's data from the concerned RWIS site to predict the value for the present hour. The HMM will predict the class value of the variable used in the data. To compare actual and predicted values for detecting sensor malfunctions we can use the method that was described in the general classification approach.

1 2 3 4 5 6 7 8 9 ... 12