A machine Learning Based Model For Software Defect Prediction

tarix	25.06.2016
ölçüsü	0.83 Mb.

A Machine Learning Based Model For Software Defect Prediction

Onur Kutlubay, Mehmet Balman, Doğu Gül, Ayşe B. Bener

Boğaziçi University, Computer Engineering Department

kutlubay@cmpe.boun.edu.tr; mbalman@ku.edu.tr; dogugul@yahoo.com; bener@boun.edu.tr

Abstract

Identifying and locating defects in software projects is a difficult work. Especially, when project sizes grow, this task becomes expensive with sophisticated testing and evaluation mechanisms. On the other hand, measuring software in a continuous and disciplined manner brings many advantages such as accurate estimation of project costs and schedules, and improving product and process qualities. Detailed analysis of software metric data also gives significant clues about the locations of possible defects in a programming code.

The aim of this research is to establish a method for identifying software defects using machine learning methods. In this work we used NASA’s Metrics Data Program (MDP) as software metrics data. The repository at NASA IV & V Facility MDP contains software metric data and error data at the function/method level.

We used machine learning methods to construct a two step model that predicts potentially defected modules within a given set of software modules with respect to their metric data. Artificial Neural Networks and Decision Tree methods are utilized throughout the learning experiments. The data set used in the experiments is organized in two forms for learning and predicting purposes; the training set and the testing set. The experiments show that the two step model enhances defect prediction performance.

1. Introduction
According to a survey carried out by the Standish Group, an average software project exceeded its budget by 90 percent and its schedule by 222 percent (Chaos Chronicles, 1995). This survey took place in mid 90s and contained data from about 8-000 projects. These statistics show the importance of measuring the software early in its life cycle and taking the necessary precautions before these results come out. For the software projects carried out in the industry, an extensive metrics program is usually seen unnecessary and the practitioners start to stress on a metrics program when things are bad or when there is a need to satisfy some external assessment body.

On the academic side, less concentration is devoted on the decision support power of software measurement. The results of these measurements are usually evaluated with naive methods like regression and correlation between values. However models for assessing software risk in terms of predicting defects in a specific module or function have also been proposed in the previous research (Fenton and Neil, 1999). Some recent models also utilize machine-learning techniques for defect predicting (Neumann, 2002). But the main drawback of using machine learning in software defect prediction is the scarcity of data. Most of the companies do not share their software metric data with other organizations so that a useful database with great amount of data cannot be formed. However, there are publicly available well-established tools for extracting metrics such as size, McCabe’s cyclomatic complexity, and Halstead’s program vocabulary. These tools help automating the data collection process in software projects.

A well established metrics program yields to better estimations of cost and schedule. Besides, the analyses of measured metrics are good indicators of possible defects in the software being developed. Testing is the most popular method for defect detection in most of the software projects. However, when projects’ sizes grow in terms of both lines of code and effort spent, the task of testing gets more difficult and computationally expensive with the use of sophisticated testing and evaluation procedures. Nevertheless, defects that are identified in previous segments of programs can be clustered according to their various properties and most importantly according to their severity. If the relationship between the software metrics measured at a certain state and the defects’ properties can be formulated together, it becomes possible to predict similar defects in other parts of the code written.

The software metric data gives us the values for specific variables to measure a specific module/function or the whole software. When combined with the weighted error/defect data, this data set becomes the input for a machine learning system. A learning system is defined as a system that is said to learn from experience with respect to some class of tasks and performance measure, such that its performance at these tasks improve with experience (Mitchell, 1997). To design a learning system, the data set in this work is divided into two parts: the training data set and the testing data set. Some predictor functions are defined and trained with respect to Multi-Layer Perceptron and Decision Tree algorithms and the results are evaluated with the testing data set.

The second section gives a brief literature survey on the previous research and the third one talks about the data set used in our research. The fourth section states the problem and the fifth section explains the details of our proposed model for defect prediction. Also, the tools and methods that are utilized throughout the experiments are described in the same section. In the sixth section, we have listed the results of the experiments and a detailed evaluation of the machine learning algorithms is done in the same section. The last section concludes our work and summarizes the future research that could be done in this area.

2. Related Work
2.1. Metrics and Software Risk Assesment
Software metrics are mostly used for the purposes of product quality and process efficiency analysis and risk assessment for software projects. Software metrics have many benefits and one of the most significant benefits is that they provide information for defect prediction. Metric analysis allows project managers to assess software risks. Currently there are numerous metrics for assessing software risks. The early researches on software metrics have focused their attention mostly on McCabe, Halstead and lines of code (LOC) metrics. Among many software metrics, these three categories contain the most widely used metrics. Also in this work, we decided to use an evaluation mechanism mainly based on these metrics.

Metrics usually have definitions in terms of polynomial equations when they are not directly measured but derived from other metrics. Researchers have used neural network approach to generate new metrics instead of using metrics that are based on certain polynomial equations (Boetticher et al., 1993). This is actually introduced as an alternative method to overcome the challenge of derivation of a polynomial which provides the desired characteristics. Bayesian belief network is also used to make risk assessment in previous research (Fenton and Neil, 1999). Basic metrics such as LOC, Halstead and McCabe metrics are used in the learning process. The authors argue that some metrics do not give right prediction about software’s operational stage. For instance, there is not a similar relation between the number of fault for the pre- and post-release versions of the software and the cyclomatic complexity. To overcome this problem, Bayesian Belief Network is used for defect modeling.

In another research, the approach used is to categorize metrics with respect to the models developed. The model is based on the fact that “software metrics alone are difficult to evaluate”. They apply metrics on three models namely “Complexity”, “Risk” and “Test Targeting” model. Different results obtained with respect to these models and each is evaluated distinctly (Hudepohl et al., 1996).

It is shown that some metrics depict common features on software risk. Instead of using all the metrics adopted, a basic one that will represent a cluster can be used (Neumann, 2002). “Principal component analysis” which is one of the most popular approaches has to be applied in order to determine the clusters that include similar metrics.

2.2. Defect Prediction and Applications of Machine Learning
Defect prediction models can be classified according to the metrics used and the process step in the software life cycle. Most of the defect models use the basic metrics such as complexity and size of the software (Henry and Kafura, 1984). Testing metrics that are produced in test phase are also used to estimate the sequence of defects (Cusumano, 1991). Another approach is to investigate the quality of design and implementation processes, that quality of design process is the best predictor for the product quality (Bertolino and Strigini, 1996; Diaz and Sligo, 1997).

The main idea behind the prediction models is to estimate the reliability of the system, and investigate the effect of design and testing process over number of defects. Previous studies show that the metrics in all steps of the life cycle of a software project as design, implementation, testing, etc. should be utilized and connected with specific dependencies. Concentrating only a specific metric or process level is not enough for a satisfied prediction model (Fenton and Neil, 1999).

Machine learning algorithms have been proven to be practical for poorly understood problem domains that have changing conditions with respect to many values and regularities. Since software problems can be formulated as learning processes and classified according to the characteristics of defect, regular machine learning algorithms are applicable to prepare a probability distribution and analyze errors (Fenton and Neil, 1999; Zhang, 2000). Decision trees, artificial neural networks, Bayesian belief network and clustering techniques such as k-nearest neighborhood are examples of most commonly used techniques for software defect prediction problems (Mitchell, 1997; Zhang, 2000; Jensen, 1996).

Machine learning algorithms can be used over program execution to detect the number of the faulty runs, which will lead to find underlying defects. Executions are clustered according to the procedural and functional properties of this approach (Dickinson et al., 2001). Machine learning is also used to generate models of program properties that are known to cause errors. Support vector and decision tree learning tools are implemented to classify and investigate the most relevant subsets of program properties (Brun and Ernst, 2004). Underlying intuition is that most of the properties leading to faulty conditions can be classified within a few groups. Technique consists of two steps; training and classification. Fault relevant properties are utilized to generate a model, and this precomputed function selects the properties that are most likely to cause errors and defects in the software.

Clustering over function call profiles are used to determine which features enable a model to distinguish failures and non-failures (Podgurski et al., 2003). Dynamic invariant detection is used to detect likely invariants from a test suite and investigate violations that usually indicate erroneous state. This method is also used to determine counterexamples and find properties which lead to correct results for all conditions (Groce and Visser, 2003).

3. Metric Data Used
The data set used in this research is provided by the NASA IV&V Metrics Data Program – Metric Data Repository¹. The data repository contains software metrics and associated error data at the function/method level. The data repository stores and organizes the data which has been collected and validated by the Metrics Data Program.

The association between the error data and the metrics data in the repository provides the opportunity to investigate the relationship of metrics or combinations of metrics to the software. The data that is made available to general users has been sanitized and authorized for publication through the MDP website by officials representing the projects from which the data has originated. The database uses unique numeric identifiers to describe the individual error records and product entries. The level of abstraction allows data associations to be made without having to reveal specific information about the originating data.

The repository contains detailed metric data in terms of, product metrics, object oriented class metrics, requirement metrics and defect/product association metrics. We specifically concentrate on product metrics and related defect metrics. The data portion that feeds the experiments in this research contains the mentioned metric data for JM1 project.

Some of the product metrics that are included in the data set are, McCabe Metrics; Cyclomatic Complexity and Design Complexity, Halstead Metrics; Halstead Content, Halstead Difficulty, Halstead Effort, Halstead Error Estimate, Halstead Length, Halstead Level, Halstead Programming Time and Halstead Volume, LOC Metrics; Lines of Total Code, LOC Blank, Branch Count, LOC Comments, Number of Operands, Number of Unique Operands and Number of Unique Operators, and lastly Defect Metrics; Error Count, Error Density, Number of Defects (with severity and priority information).

After constructing our data repository, we have cleaned the data set against marginal values, which may lead our experiments to faulty results. For each type of feature in the database, the data containing feature values out of a range of ten standard deviations from the mean values are deleted from the database.

Our analysis depends on machine learning techniques so for this purpose we divided the data set in two groups; the training set and the testing set. These two groups used for training and testing experiments are extracted randomly from the overall data set for each experiment by using a simple shuffle algorithm. This method provided us with randomly generated data sets, which are believed to contain evenly distributed numbers of defect data.

4. Problem Statement
Two types of research can be studied on the code based metrics in terms of defect prediction. The first one is predicting whether a given code segment is defected or not. The second one is predicting the magnitude of the possible defect, if any, with respect to various viewpoints such as density, severity or priority. Estimating the defect causing potential of a given software project has a very critical value for the reliability of the project. Our work in this research is primarily focused on the second type of predictions. But it also includes some major experiments involving the first type of predictions.

Given a training data set, a learning system can be set up. This system would come out with a score point that indicates how much a test data and code segment is defected. After predicting this score point, the results can be evaluated with respect to popular performance functions. The two most common options here are the Mean Absolute Error (mae) and the Mean Squared Error (mse). The mae is generally used for classification, while the mse is most commonly seen in function approximation.

In this research we used mse since the performance function for the results of the experiments aims second type of prediction. Although mae could be a good measure for classification experiments, in our case, due to the fact that our output values are zeros and ones we chose to use some custom error measures. We will explain them in detail in the results section.

5. Proposed Model and Methodology
The data set used in this research contains defect density data which corresponds to the total number of defects per 1-000 lines of code. In this research we have used the software metric data set with this defect density data to predict the defect density value for a given project or a module. Artificial neural networks and decision tree approaches are used to predict the defect density values for a testing data set.

Multi-layer perceptron method is used in ANN experiments. Multilayer perceptrons are feedforward neural networks trained with the standard backpropagation algorithm. Feedforward neural networks provide a general framework for representing non-linear functional mappings between a set of input variables and a set of output variables. This is achieved by representing the nonlinear function of many variables in terms of compositions of nonlinear functions of a single variable, which are called activation functions (Bishop, 1995).

Decision trees are one of the most popular approaches for both classification and regression type predictions. They are generated based on specific rules. Decision tree is a classifier in a tree structure. Leaf node is the outcome obtained. It is computed with respect to the existing attributes. Decision node is based on an attribute, which branches for each possible outcome for that attribute. Decision trees can be thought as a sequence of questions, which leads to a final outcome. Each question depends on the previous question hence this case leads to a branching in the decision tree. While generating the decision tree, the main goal is to minimize the average number of questions in each case. This task provides increase in the performance of prediction (Mitchell, 1997). One approach to create a decision tree is to use the term entropy, which is a fundamental quantity in information theory. Entropy value determines the level of uncertainty. The degree of uncertainty is related to the success rate of predicting the result. Also to overcome the over-fitting problem we used pruning to minimize the output variable variance in the validation data by selecting a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally as accurate for predicting or classifying "new" observations. In the regression type prediction experiments we used regression trees which may be considered as a variant of decision trees, designed to approximate real-valued functions instead of being used for classification tasks.

In the experiments we first applied the two methods to perform a regression based prediction over the whole data set. According to the experiment results we calculated the corresponding mse values. Mse values provide the amount of the spread from the target values. To evaluate the performance of each algorithm with respect to the mse values, we compared the square root of the mse values with the standard deviance of the testing data set. The standard deviation of the data set is in fact the mse of it when all predictions are equal to the mean value of the data set. To declare that a specific experiment’s performance is acceptable, its mse value should be fairly less than the variance of the data set. Otherwise there is no need to apply such sophisticated learning methods, one can obtain a similar level of success by just predicting all values equal to mean value of the data set.

The first experiments that are done using the whole data set show that the performance of both algorithms are not in acceptable ranges as these outcomes are detailed in the results section. The data set includes mostly non-defected modules so there happens to be a bias towards underestimating the defect possibility in the prediction process. Also it is obvious that any other input data set will have the same characteristic since it is practically likely to have much more non-defected modules than defected ones in real life software projects.

As a second type of experiments we repeated the experiments with the metric data that contains only defected items. By using such a data set, the influence of the dense non-defected items disappeared as depicted in the results section. These kinds of experiments reveal successful results and since we are trying to estimate the density of the possible defects, using the new data set is an improvement with respect to our primary goal.

Despite the fact that the second type of experiments are successful in terms of defect prediction, it is practically impossible to start from this lucky position. In other words, without knowing which ones are defected, it does not make much sense that we can estimate the magnitude of the possible defects among the defected modules. So as a third type of experiment we used ANN and decision tree methods for classifying the whole data set in terms of being defected or not. The classification process has two clusters so that the testing data set is fit into. In these experiments the classification is done with respect to a threshold value, which is close to zero but is calculated internally by the experiments. This threshold point is the value where the performance of the classification algorithm is maximized. One of the two resulting clusters consists of the values less than this threshold value, which indicates that there is no defect. And the other cluster consists of the values greater than the threshold value, which indicates there is a defect. The threshold value may vary with respect to the input data set used and it can be calculated throughout the experiments for any data set. The performance of this classification process is measured by the total number of the correct predictions it has done compared to the incorrect ones. The results section includes the outcomes of these experiments in detail.

The three type of experiments explained above guided us in proposing the novel model for defect prediction in software projects. According to the results of these experiments, better results are obtained when first a classification is carried out and then a regression type prediction is done over the data set which is expected to be defected. So the model has two steps, first classifying the input data set with respect to being defected or not. After this classification, a new data set is generated with the values that are predicted as defected. And a regression is done to predict the defect density values among the new data set.

The novel model predicts the possibly defected modules in a given data set, besides it gives an estimation of the defect density in the module that is predicted as defected. So the model helps concentrating the efforts on specific suspected parts of the code so that significant amount of time and resource can be saved in software quality process.

6. Results
In this research, the training and testing are made using MATLAB’s MLP and decision tree algorithms based on a model for classification and regression. The data set used in the experiments contains 6-000 training data and 2-000 testing data. The resulting values are the mean values of 30 separately run experiments.

In designing the experiment set of the MLP algorithm, a neural network is generated by using linear function as the output unit activation function. 32 hidden units are used in network generation and the alpha value is set to 0.01 while the experiments are done with 200 training cycles. Also in the experiment set of decision tree algorithms, Treefit and Treeprune functions are used consecutively. The method of the Treefit function is altered for classification and regression purposes respectively.

6.1. Regression over the whole data set
In the first type of experiments neither ANN method nor decision trees did bring out successful results. The average variance of the data sets which are generated randomly by the use of a shuffling algorithm is 1-402.21 and the mean mse value for the ANN experiments is 1-295.96. This value is far from being acceptable since the method fails to approximate the defect density values. Figure 1 depicts the scatter graph of the predicted values and the real values. According to this graph, it is clear that the method potentially does faulty predictions over the non defected values. The points laying on the y-axis show that there are unacceptable amount of faulty predictions for non defected values. Also apart from missing to predict the non defected ones, it is obvious that the method is biased towards smaller approximations on the predictions for defected items because vast amount of predictions lay under the line which depicts the correct predictions.

Figure 1. The predicted values and the real values in ANN experiments

Decision tree method similarly brings out unsuccessful results when the input data set is the complete data set which contains both defected and non defected items where non defected ones are much more dense. The average variance of the data sets is 1-353.27 and the mean mse value for decision tree experiments is 1-316.42. This result is slightly worse than that of ANN results. Figure 2 shows the predictions done by the decision tree method and the real values. Like ANN method, decision tree method also misses predicting non defected values. Moreover, the decision tree method does much more non defected predictions where the real values show that the corresponding items are defected. Also the effect of the input data set which is explained as a bias towards zero value is not as high as in the ANN case.

Figure 2. The predicted values and the real values in decision tree experiments

6.2. Regression over the data set containing only defected items
The second type of experiments are done with input data sets which contain only defected items. The results for both ANN and decision tree methods are more successful than in the first type of experiments.

The average variance of the data sets used in the ANN experiments are 1-637.41 and the mean mse value is 262.61. According to these results the MLP algorithm approximates the error density values well when only defected items reside in the input data set. It also shows that the dense non defected data effects the prediction capability of the algorithm in a negative manner. Figure 3 shows the predicted values and the real values after an ANN experiment run. The algorithm estimates the defect density value better for smaller values as seen from the graph, where the scatter deviates more from the line that depicts the correct predictions for higher values of defect density.

Figure 3. The predicted values and the real values in ANN experiments where the input data set contains only defected items

The average variance of the data sets in the decision tree experiments are 1-656.23 and the mean mse value is 237.68. Like ANN experiments, decision tree method is also successful in predicting the defect density values when only defected items are included in the input data set. According to Figure 4 which depicts the experiment results, decision tree algorithm gives more accurate results for almost half of the samples than the ANN method. Despite, the spread of the erroneous predictions shows that their deviations are more than that of ANN’s. Like ANN method, decision tree method also results in increasing deviations from the real values as the defect density values increase.

Figure 4. The predicted values and the real values in decision tree experiments where the input data set contains only defected items

6.3. Classification with respect to defectedness
In the third type of experiments the problem is reduced to only predicting whether a module is defected or not. For this purpose both of the algorithms are used to classify the testing data set into two clusters. The value that divides the output data set into two clusters is calculated dynamically so that this value is selected among various values according to their performance in clustering the data set correctly. After several experiment runs, the performance of the clustering algorithm is measured with respect to these values and the best one is selected as the point which generates the two clusters; less values are non defected and the others are defected.

For both of the methods in classifying the defected and non defected items, the value that seperates the two clusters is selected as 5 while the trials were done with values ranging from 0 to 10. The performace drops significantly after that value but the best results are achieved when 5 is selected as the cluster seperation point for both of the ANN and decision tree methods.

In the ANN experiments the clustering algorithm is partly successful in predicting the defected items. The mean percentage of the correct predictions is 88.35% for ANN experiments. The mean percentage of correct defected predictions is 54.44% whereas the mean percentage of correct non defected predictions is 97.28%. These results show that the method is very succesful in finding out the really defected items. It is capable of finding out three out of every four defected items.

The decision tree method is more successful than the ANN method in these type of experiments. The mean percentage of the correct predictions is 91.75% for decision tree experiments. The mean percentage of correct defected predictions is 79.38% and the mean percentage of correct non defected predictions is 95.06%. The main difference between the two methods arises in predicting the defected and non defected items seperately. Decision tree method is better in the former where ANN method is more successful in the latter. According to these results it can be concluded that the experiments for classification are much successful with respect to the experiments that are aiming regression. Since the regression methods do perform better for the data set containing only the defected items, the predicted items as a result of this classification process will improve the overall performance of defect density prediction.

As a result, it can be deduced that we divide the defect prediction problem into two parts. The first part consists of predicting whether a given module is defected or not. And the second part is predicting the magnitude of the possible defect if it is labeled as defected by the first type. We understand that predicting the defect density value among a data set containing only defected items brings much better results than the case that the whole data set is used where an intrinsic bias towards lessening the magnitude of the defect arises. Also by dividing the problem into two separate problems, and knowing that second part is successful enough in predicting the defect density, it is possible to improve the overall performance of the learning system by improving the performance of the classification part.

7. Conclusion
In this research, we proposed a new defect prediction model based on machine learning methods. MLP and decision tree results have much more wrong defect predictions when applied to the entire data set containing both defected and non defected items. Since most modules in the input data have zero defects (80% of the whole data), applied machine learning methods fail to predict scores within expected performance. The data set is already 80% non-defected. Even if an algorithm claims that a test data is non-defected though it did not try to learn at all, the 80% success is guaranteed. Therefore logic behind the learning methodology fails. Different methodology which can manage such data set for software metrics is required.

Instead of predicting the defect density value of a given module, first, trying to find if a module is defected, and then estimating the magnitude of the defect seems to be an enhanced technique for such data sets. Metrics values for modules that have defect count zero or not are very similar so it is much easier to learn the defectedness probability. Moreover, it is also much easier to learn the magnitude of the defects while training within the modules that are known to be defected.

Training set of software metrics has most modules with zero or very small defect densities. So, defect density values can be classified into two clusters as defected and non-defected sets. This partitioning enhance the performance of learning process and enables regression to work only on training data consisting of modules that are predicted as defected in the first processing.

Clustering as defected and non-defected based on a threshold value enhances the learning and estimation in the classification process. This threshold value is self set within the learning process so that it is an equilibrium point where the learning performance is at maximum.

In our specific experiment dataset we observed that decision tree algorithm performs better than MLP algorithm in terms of both classifying the items in the dataset with respect to being defected, and estimating the defect density of the items that are thought to be defected. Also the decision tree algorithm generates rules in the classification process. These rules are used for deciding which branches to select towards the leaf nodes in the tree. The effects of all features in the dataset can be observed by looking at these rules.

By using our two step approach, along with predicting which modules are defected, the model generates estimations on the defect magnitudes. The software practitioners may use these estimation values in making decisions about the resources and effort in software quality processes such as testing. Our model constitutes to a well risk assessment technique in software projects regarding the code metrics data about the project.

As a future work, different machine learning algorithms or improved versions of the used machine learning algorithms may be included in the experiments. The algorithms used in our evaluation experiments are the simplest forms of some widely used methods. Also this model can be applied to other risk assessment procedures which can be supplied as input to the system. Certainly these risk issues should have quantitative representations to be considered as an input for our system.

Notes

For information on NASA/WVU IV&V Facility Metrics Data Program see http://mdp.ivv.nasa.gov.

Bibliography
Bertolino, A., and Strigini, L., 1996. On the Use of Testability Measures for Dependability Assessment, IEEE Trans. Software Engineering, vol. 22, no. 2, pp. 97-108.

Bishop, M., 1995, Neural Networks for Pattern Recognition, Oxford University Press.

Boetticher, G.D., Srinivas, K., Eichmann, D., 1993. A Neural Net-Based Approach to Software Metrics, Proceedings of the Fifth International Conference on Software Engineering and Knowledge Engineering, San Francisco, pp. 271-274.

CHAOS Chronicles, The Standish Group - Standish Group Internal Report, 1995.

Cusumano, M.A., 1991. Japan’s Software Factories, Oxford University Press.

Diaz, M., and Sligo, J., 1997. How Software Process Improvement Helped Motorola, IEEE Software, vol. 14, no. 5, pp. 75-81.

Dickinson, W., Leon, D., Podgurski, A., 2001. Finding failures by cluster analysis of execution profiles. In ICSE, pages 339– 348.

Fenton, N., and Neil, M., 1999. A critique of software defect prediction models, IEEE Transactions on Software Engineering, Vol. 25, No. 5, pp. 675-689.

Groce, and Visser, W., 2003. What went wrong: Explaining counterexamples, In SPIN 2003, pages 121–135.

Jensen, F.V., 1996. An Introduction to Bayesian Networks, Springer.

Henry, S., and Kafura, D., 1984. The Evaluation of Software System’s Structure Using Quantitative Software Metrics, Software Practice and Experience, vol. 14, no. 6, pp. 561-573.

Hudepohl, P., Khoshgoftaar, M., Mayrand, J., 1996. Integrating Metrics and Models for Software Risk Assessment, The Seventh International Symposium on Software Reliability Engineering (ISSRE '96).

Mitchell, T.M., 1997. Machine Learning, McGrawHill.

Neumann, D.E., 2002. An Enhanced Neural Network Technique for Software Risk Analysis, IEEE Transactions on Software Engineering, pp. 904-912.

Podgurski, D. Leon, P. Francis, W. Masri, M. Minch, J. Sun, and B.Wang. Automated support for classifying software failure reports. In ICSE, pages 465–475, May 2003.

Yuriy, B., and Ernst, M. D., 2004. Finding latent code errors via machine learning over program executions. Proceedings of the 26th International Conference on Software Engineering, (Edinburgh, Scotland).

Zhang, D., 2000. Applying Machine Learning Algorithms in Software Development, The Proceedings of 2000 Monterey Workshop on Modeling Software System Structures, Santa Margherita Ligure, Italy, pp.275-285.