SITE: Airport code where the AWOS unit is located
LON: Longitude
LAT: Latitude
Appendix B
Using WEKA
WEKA is written in Java and is organized into packages arranged in a hierarchical manner. Details of the packages and the hierarchy are given by Witten & Frank [2005]. WEKA can be run using its graphical user interface or through entering textual commands in the command prompt. The general structure of the WEKA textual command, to perform multiple 10-fold cross-validations on a dataset using an algorithm (classifier) is
java -mx1024M -cp classpath callClassifier classifier_path classifier_options -t trainset.arff -x 10 -s seed_value -c attribute_index
where -cp specifies the path (i.e., the class path) where WEKA is located, callClassifier3 is a java class that is used to output the complete class probability without which WEKA outputs an evaluative result of the algorithm, classifier_path is the location of the algorithm in the WEKA package hierarchy, classifier_options specifies the options taken by an algorithm, -t specifies the training file, -x specifies the number of folds for cross-validation, -s is used to indicated the seed value when a multiple n-fold cross-validations need to be preformed, -c specifies the output attributes position in the dataset provided. The -T option is used when a test file is used for evaluating the model, when not used a cross validation is preformed on the training set provided.
WEKA requires the data in the train/test file to be in ARFF format. The general format of an ARFF file is given in Table B1. The string @relation is used to mention the name of the dataset, @attribute is used to define the attributes name and type and @data is used to
Table B.1: Format of an ARFF file.
@relation Predict_Temp_Year2002
@attribute temperature_site1 real
@attribute temperature_site2 real
@attribute precipitation (yes, no)
% used for comments
@data
23,22,yes
12,23,no
23,32,no
.............
|
indicate the start of the data, which is in a comma-separated form.
Following are the classifier_path for the machine learning algorithms that were used in this thesis along with their default options (classifier_options)
Linear Regression
weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
where -S specifies the attribute selection methods with 0 representing the M5 method, and -R specifies the value of the ridge parameter.
Least Median Square
weka.classifiers.functions.LeastMedSq –S 4 –G 0
where -S specifies the size of random samples used to generate the least squared regression function, and -G specifies the seed value used to select subsets of the training data.
M5Prime
weka.classifiers.trees.M5P –M 4.0
where -M specifies the minimum number of instances.
Multilayer Perceptron
weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a
where -L specifies the learning rate, -M specifies the momentum, -N specifies the number of training epochs, -V specifies validation set size, -S specifies the seed value taken by the random number generator (random values are used for initialization of weights), -E specifies the validation threshold, and -H specifies the number of hidden layers with its value 'a' representing (num_attributes+num_classes)/2 layers.
RBF Network
weka.classifiers.functions.RBFNetwork -B 2 -S 1 -R 1.0E-8 -M -1 -W 0.1
where -B specifies the number of clusters generated by K-means, -S specifies the value of the seed passed on to the K-means, -R specifies the value of the ridge parameter, -M specifies the number of iterations to be performed by logistic regression, and -W specifies the minimum standard deviation for the clusters.
Conjunctive Rule
weka.classifiers.rules.ConjunctiveRule -N 3 -M 2.0 -P -1
-S 1
where -N specifies the amount of data used for pruning, -M specifies the minimum total weight of the instances in a rule, -P specifies the minimum number of antecedents allowed in a rule when pre-pruning is used, and -S specifies the seed value used.
J48
weka.classifiers.trees.J48 -C 0.25 -M 2
where -C specifies the confidence factor, and -M specifies the minimum number of instances taken by a leaf
Naive Bayes
weka.classifiers.bayes.NaiveBayes
Bayes Net
weka.classifiers.bayes.BayesNet -D
-Q weka.classifiers.bayes.net.search.local.K2 -- -P 1
-E weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5
-D is used to prevent memory problems with ADTree is used, -Q specifies the search algorithm, and -E specifies the estimator used for finding the CPTs. K2 search algorithm is given by weka.classifiers.bayes.net.search.local.K2 with its option -P specifying the maximum number of parents taken by a node in the Bayesian network. The estimator used for filling up the CPTs is weka.classifiers.bayes.net.estimate.SimpleEstimator, with its option -A specifying the alpha value of the estimator.
Appendix C
Detailed Results
Table C.1: Results obtained from using regression algorithms to predict temperature at an RWIS site (Experiment 1). Feature vector consists of temperature information from RWIS-AWIS sites in a set along with temperature offset for the RWIS sites. The table has mean absolute error values averaged over ten 10-fold cross-validations.
| |
ML Algorithms
| |
RWIS Site
|
LMS
|
LR
|
M5P
|
RBF
|
CR
|
MLP
|
Set 1
|
19
|
0.908
|
0.960
|
0.936
|
9.521
|
11.150
|
1.059
| |
27
|
1.217
|
0.896
|
0.873
|
9.465
|
10.199
|
1.058
| |
67
|
1.069
|
0.918
|
0.885
|
10.062
|
11.596
|
1.108
|
Set 2
|
14
|
0.659
|
0.795
|
0.751
|
8.478
|
10.605
|
0.789
| |
20
|
0.743
|
0.820
|
0.776
|
9.417
|
10.821
|
1.001
| |
35
|
0.553
|
0.977
|
0.864
|
9.523
|
10.817
|
1.051
| |
49
|
0.916
|
0.913
|
0.898
|
9.579
|
11.017
|
1.074
| |
62
|
0.800
|
0.779
|
0.769
|
9.383
|
11.040
|
0.892
|
Set 3
|
25
|
0.984
|
1.062
|
0.889
|
10.386
|
11.957
|
1.097
| |
56
|
0.925
|
0.913
|
0.807
|
10.510
|
11.512
|
1.133
| |
60
|
0.889
|
0.867
|
0.833
|
9.675
|
11.078
|
1.002
| |
68
|
0.958
|
1.015
|
0.901
|
9.017
|
10.439
|
1.235
| |
78
|
0.929
|
0.875
|
0.809
|
8.945
|
10.449
|
1.012
|
Mean of Abs. Errors (ºF)
|
0.888
|
0.907
|
0.845
|
9.535
|
10.975
|
1.039
|
StdDev of Abs. Errors
|
0.171
|
0.083
|
0.058
|
0.559
|
0.503
|
0.110
|
StdDev refers to Standard Deviation
Figure C.1: Mean absolute errors for different RWIS sites obtained from predicting temperature using regression algorithms.
Table C.2: Results obtained from using regression algorithms to predict temperature at an RWIS site (Experiment 2). Feature vector consists of temperature information from RWIS-AWIS sites in a set along with precipitation type for the RWIS sites. The table has mean absolute error values averaged over ten 10-fold cross-validations.
| |
ML Algorithms
| |
RWIS Site
|
LMS
|
LR
|
M5P
|
Set 1
|
19
|
1.023
|
1.115
|
1.001
| |
27
|
0.935
|
0.959
|
0.931
| |
67
|
1.001
|
1.006
|
0.984
|
Set 2
|
14
|
0.726
|
0.788
|
0.771
| |
20
|
0.862
|
0.872
|
0.827
| |
35
|
1.052
|
1.062
|
0.938
| |
49
|
1.051
|
1.014
|
1.022
| |
62
|
0.848
|
0.827
|
0.815
|
Set 3
|
25
|
1.217
|
1.222
|
1.004
| |
56
|
1.084
|
1.077
|
0.992
| |
60
|
1.007
|
0.981
|
0.924
| |
68
|
1.046
|
1.154
|
0.973
| |
78
|
0.876
|
0.891
|
0.856
|
Mean of Abs Errors (ºF)
|
0.979
|
0.997
|
0.926
|
StdDev of Abs Errors
|
0.127
|
0.130
|
0.083
|
StdDev refers to standard deviation
Figure C.2: Mean absolute errors for different RWIS sites obtained from predicting temperature using regression algorithms, with precipitation type information added to the feature vector.
Table C.3: Results obtained from using classification algorithms to predict precipitation type at an RWIS site (Experiment 4). Feature vector consists of temperature information from RWIS-AWIS sites in a set along with precipitation type for the RWIS sites. The table has classification error values, as reported by WEKA, averaged over ten 10-fold cross-validations.
| |
ML Algorithms
| |
RWIS Site
|
J48
|
NB
|
Bayes Net
|
Set 1
|
19
|
0.064
|
0.325
|
0.346
| |
27
|
0.256
|
0.420
|
0.363
| |
67
|
0.356
|
0.472
|
0.414
|
Set 2
|
14
|
0.213
|
0.384
|
0.330
| |
20
|
0.265
|
0.450
|
0.379
| |
35
|
0.077
|
0.312
|
0.337
| |
49
|
0.062
|
0.328
|
0.328
| |
62
|
0.061
|
0.312
|
0.341
|
Set 3
|
25
|
0.068
|
0.342
|
0.342
| |
56
|
0.072
|
0.345
|
0.333
| |
60
|
0.095
|
0.317
|
0.323
| |
68
|
0.061
|
0.253
|
0.315
| |
78
|
0.065
|
0.318
|
0.231
|
Mean of Classification Errors
|
0.132
|
0.352
|
0.337
|
StdDev of Classification Errors
|
0.102
|
0.062
|
0.041
| |