计算机专业英文资料汇总(doc 32页)_第1页
计算机专业英文资料汇总(doc 32页)_第2页
计算机专业英文资料汇总(doc 32页)_第3页
计算机专业英文资料汇总(doc 32页)_第4页
计算机专业英文资料汇总(doc 32页)_第5页
已阅读5页,还剩27页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Jiang Y, Li M, Zhou ZH. Software defect detection with Rocus. JOURNAL OF COMPUTER SCIENCE AND TECH-NOLOGY 26(2): 328342 Mar. 2011. DOI 10.1007/s11390-011-1135-6Software Defect Detection with ROCUSYuan Jiang (姜远), Member, CCF, Ming Li (黎铭), Member, CCF, ACM, IEEEand Zhi-Hua Zhou (周志华), Senior Member, CCF, IEEE, Member, ACMNational Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, ChinaE-mail: fjiangyuan, lim, Received May 15, 2009; revised October 26, 2010.Abstract Software defect detection aims to automatically identify defective software modules for ecient software testin order to improve the quality of a software system. Although many machine learning methods have been successfullyapplied to the task, most of them fail to consider two practical yet important issues in software defect detection. First,it is rather dicult to collect a large amount of labeled training data for learning a well-performing model; second, in asoftware system there are usually much fewer defective modules than defect-free modules, so learning would have to beconducted over an imbalanced data set. In this paper, we address these two practical issues simultaneously by proposing anovel semi-supervised learning approach named Rocus. This method exploits the abundant unlabeled examples to improvethe detection accuracy, as well as employs under-sampling to tackle the class-imbalance problem in the learning process.Experimental results of real-world software defect detection tasks show that Rocus is eective for software defect detection.Its performance is better than a semi-supervised learning method that ignores the class-imbalance nature of the task and aclass-imbalance learning method that does not make eective use of unlabeled data.Keywords machine learning, data mining, semi-supervised learning, class-imbalance, software defect detection1 IntroductionEnabled by technological advances in computerhardware, software systems have become increasinglypowerful and versatile. However, the attendant increasein software complexity has made the timely develop-ment of reliable software systems extremely challenging.To make software systems reliable, it is very importantto identify as many defects as possible before releas-ing the software. However, due to the complexity ofthe software systems and the tight project schedule, itis almost impossible to extensively test every path ofthe software under all possible runtime environment.Thus, accurately predicting whether a software modulecontains defects can help to allocate the limited testresources eectively, and hence, improve the quality ofsoftware systems. Such a process is usually referred toas software defect detection, which has already drawnmuch attention in software engineering community.Machine learning techniques have been successfullyapplied to building predictive models for software defectdetection1-7. The static and dynamic code attributesor software metrics are extracted from each softwaremodule to form an example, which is then labeled asdefective or defect-free. Predictive models whichlearn from a large number of examples are expected toaccurately predict whether a given module is defective.However, most of these studies have not consideredtwo practical yet important issues in software defectdetection. First, although it is relatively easy to auto-matically generate examples from software modules us-ing some standard tools, determining whether a modulecontains defect through extensive test usually consumestoo much time and resource, since the number of pro-gram status grows exponentially as the complexity ofsoftware increases. With limited time and test resource,one can only obtain the labels for a small portion ofmodules. However, the predictive models that learnfrom such a small labeled training set may not performwell. Second, the data in software defect detection areessentially imbalanced. The number of defective mod-ules is usually much less than that of the defect-freemodules. Ignoring the imbalance nature of the prob-lem, a learner that minimizes the prediction error canRegular PaperThis work was supported by the National Natural Science Foundation of China under Grant Nos. 60975043, 60903103, and60721002.Corresponding Author2011 Springer Science+Business Media, LLC & Science Press, China万方数据Yuan Jiang et al.: Software Defect Detection with Rocus 329often produce a useless predictive model that uniformlypredicts all the modules as defect-free. Without takingthese two issues into consideration, the eectivenessof software defect detection in many real-world taskswould be greatly reduced.Some researchers have noticed the importance ofthese two issues in software defect detection and triedto tackle some of them based on machine learning tech-niques. For instance, Seliya and Khoshgoftaar8 em-ployed semi-supervised learning to improve the perfor-mance achieved on a small amount of labeled databy exploiting the abundant unlabeled data; contrar-ily, Pelayo and Dick9 applied the resampling strategyto balance the skewed class distribution of the datasetbefore learning the predictive model for software de-fect detection. Although attempting to tackle one is-sue may gain performance improvement to some extent,both methods suer the inuence of the other issuethat they have not considered. If conventional semi-supervised learning is used, assuming that the learnercan accurately assign labels to the unlabeled data, thelearner may be easily biased by the overwhelming num-ber of newly-labeled defect-free modules, and hence therened model would be less sensitive to the defect mo-dules. The sensitivity drops fast as the iterative semi-supervised learning proceeds. On the other hand, re-sampling methods would become less eective if pro-vided with only a few labeled examples, where overt-ting is inevitable no matter when it replicates the smallnumber of defective examples or reduces the numberof overwhelming defect-free examples. Therefore, toachieve eective software defect detection, we need toconsider these two important issues simultaneously. Tothe best of our knowledge, there is no previous workthat has considered these two issues simultaneously insoftware defect detection.In this paper, we address aforementioned two issuesby proposing a novel semi-supervised learning methodnamed Rocus (RandOm Committee with Under-Sampling). This method incorporates recent advancesin disagreement-based semi-supervised learning10 withunder-sampling strategy11 for imbalanced data. Thekey idea is to keep the individual learner focusing onthe minority-class during exploitation of the unlabeleddata. Experiments on eight real-world software defectdetection tasks show that Rocus is eective for soft-ware defect detection. Its performance is better thanboth the semi-supervised learning method that ignoresthe class-imbalance nature of the tasks and the class-imbalance learning method that does not exploit unla-beled data.The rest of the paper is organized as follows. Section2 briey reviews some related work. Section 3 presentsthe Rocus method. Section 4 reports the experimentsover the software defect detection tasks. Finally, Sec-tion 5 concludes this paper.2 Related Work2.1 Software Defect DetectionSoftware defect detection, which aims to automa-tically identify the software module that contains cer-tain defects, is essential to software quality insurance.Most of the software detection methods roughly fall intotwo categories. Methods in the rst category leveragethe execution information to identify suspicious pro-gram behaviors for defect detection12-14, while me-thods in the second category elaborate to extract staticcode properties, which are usually represented by a setof software metrics, for each module in the softwaresystem7;15-16. Since it would be easier to measure thestatic code properties than measure the dynamic pro-gram behaviors, metric-based software defect detectionhas drawn much attention. Widely-used software met-rics include LOC counts describing the module in termsof size, Halstead attributes measuring the number ofoperators and operands in the module as the readingcomplexity17 and McCabe complexity measures de-rived from the ow graph of the module18.In the past decade, machine learning has beenwidely applied to construct predictive models basedon the extracted software metrics to detect defects inthe software modules. Typical methods include lin-ear or logistic regression7;15, classication and regres-sion trees3;19, articial neural networks1;16, memory-based methods20-21 and Bayesian methods22-23. Inorder to further increase the robustness to the outlierin the training data and improve the prediction perfor-mance, Guo et al.2 applied ensemble learning to thesoftware defect detection and achieved better perfor-mance compared to other commonly-used methods suchas logistic regression and decision tree. Recently, Less-mann et al.4 conducted an intensive empirical study,where they compared the predictive performance of 22machine learning methods over the benchmark datasets of software defect detection.Note that few previous studies have ever consideredthe characteristics of software defect detection (e.g., thedefective module is dicult to collect). It has beenshowed that more accurate detection could be achievedeven if only one of such characteristics is considered du-ring learning8-9. Since these characteristics are usuallyintertwined with each other, better performance couldbe expected if carefully considering them together du-ring learning, which is what we do in this paper.万方数据330 J. Comput. Sci. & Technol., Mar. 2011, Vol.26, No.22.2 Semi-Supervised LearningIn many practical applications, many unlabeled datacan be easily collected, while only a few labeled data canbe obtained since much human eort and expertise arerequired. Semi-supervised learning24-25 is a machinelearning technique where the learner automatically ex-ploits the large amount of unlabeled data in additionto few labeled data to help improving the learning per-formance.Generally, semi-supervised learning methods fallinto four major categories, i.e., generative-model-based methods26-28, low density separation basedmethods29-31, graph-based methods32-34, anddisagreement-based methods35-39.Disagreement-based methods use multiple learnersand exploit the disagreements among the learners du-ring the learning process. If majority learners are muchmore condent of a disagreed unlabeled example thanminority learner(s), then the majority will teach theminority of this example. Disagreement-based semi-supervised learning originates from the work of Blumand Mitchell35, where classiers learned from two su-cient and redundant views teach each other using somecondently predicted unlabeled examples. Later, Gold-man and Zhou36 proposed an algorithm which doesnot require two views but require two dierent learningalgorithms. Zhou and Li38 proposed using three clas-siers to exploit unlabeled data, where an unlabeledexample is labeled and used to teach one classier ifthe other two classiers agree on its labeling. Later,Li and Zhou37 further extended the idea in 38 bycollaborating more classiers in training process. Be-sides classication, Zhou and Li39 also adapted thedisagreement-based paradigm to semi-supervised re-gression. Disagreement-based semi-supervised learningparadigm has been widely applied to natural languageprocessing (e.g., 40), information retrieval (e.g., 41-42), computer-aided diagnosis (e.g., 37), etc.Few researches applied semi-supervised learning tosoftware defect detection, where the labeled trainingexamples are limited while the unlabeled examples areabundant. Recently, Seliya and Khoshgoftaar8 ap-plied a generative-model-based semi-supervised lear-ning method to software defect detection and achievedperformance improvement. Note that 8 adopted agenerative approach for exploiting unlabeled data whilethe proposed method adopts a discriminative approach.Thus, we did not include it in our empirical study forthe purpose of fair comparison.2.3 Learning from Imbalanced DataIn many real-world applications such as softwaredefect detection, the class distribution of the data is im-balanced, that is, the examples from the minority classare (much) fewer than those from the other class. Sinceit is easy to achieve good performance by keeping themajority-class examples being classied correctly, thesensitivity of the classiers to the minority class may bevery low if directly learning from the imbalanced data.To achieve better sensitivity to the minority class, theclass-imbalance problem should be explicitly tackled.Popular class-imbalance learning techniques includesampling11;43-44 and cost-sensitive learning45-46.Since sampling technique is used in this paper, we in-troduce sampling in more details.Sampling attempts to achieve a balanced class dis-tribution by altering the dataset. Under-samplingreduces the number of the majority-class exampleswhile over-sampling increases the number of minority-class examples11, both of which have been shownto be eective to class-imbalance problems. Sophis-ticated methods can be employed to balance theclass distribution, such as adding synthetic minority-class examples generated from the interpolation ofneighboring minority-class examples43; discarding thenon-representative majority-class examples to balancethe class distribution44; combining dierent samplingmethods for further improvement47; using ensembletechnique for exploratory under-sampling to avoid theremoval of useful majority class examples48.The class-imbalance learning method is seldom usedin software defect detection. Recently, Pelayo andDick9 studied the eectiveness of Smote43 over thesoftware defect detection, and found that balancing theskewed class distribution is benecial to software defectdetection.3 Proposed ApproachLet L = f(x1; y1); (x2; y2); : : : ; (xm0 ; ym0 )g denotethe set of labeled examples and let U = fxm0+1,xm0+2; : : : ; xNg denote the set of unlabeled examples,where xi is a d-dimensional feature vector, and yi 2f1; +1g is the class label. Conventionally, +1 de-notes the minority class (e.g., defective in softwaredefect detection). Thereinafter, we refer to class +1 asthe minority-class and 1 as the majority-class. BothL and U are independently drawn from the same un-known distribution D whose marginal distributions sat-isfy PD(yi = +1) PD(yi = 1), and hence, L and Uare imbalanced datasets in essence.As mentioned in Section 1, directly applying semi-supervised learning to imbalanced data would berisky. Since L is imbalanced and usually small, very fewexamples of the minority-class would be used to initiate万方数据Yuan Jiang et al.: Software Defect Detection with Rocus 331the semi-supervised learning process. The resultingmodel may have poor sensitivity to the minority-classand hence can hardly identify the examples of theminority-class from the unlabeled set. In this case,learner would have to use little information from theminority-class and overwhelming information of themajority-class for model renement, and this leads toeven poorer sensitivity to the minority-class. As the it-erative semi-supervised learning proceeds, the learnedmodel would be biased to predict every example to themajority-class.In order to successfully conduct iterative semi-supervised learning on the imbalanced data, the learnershould have the following two properties. First, thelearner should have strong generalization ability, suchthat even if provided with a small labeled training setwith imbalanced class distribution, the learner wouldnot have zero sensitivity to the minority-class examplesduring the automatically labeling process; second, theinuence of overwhelming number of the newly labeledmajority-class examples should be further reduced inorder to improve the sensitivity of the learner to theminority examples after its renement in each learn-ing iteration. Based on these two considerations, wepropose the Rocus method to exploit the imbalancedunlabeled examples.To meet the rst requirement, we train multiple clas-siers and then combine them for prediction. The rea-son behind this specic choice of the ensemble learningparadigm is that an ensemble of classiers can usuallyachieve better generalization performance than a sin-gle classier. Such superiority is more obvious whenthe training set is small37 and the class distribution isimbalanced48. Thus, by exploiting the generalizationpower, the trained ensemble from L is able to identifysome minority-class examples from U eectively.Since multiple classiers are used, we employ thedisagreement-based semi-supervised learning paradi-gm10 to exploit the unlabeled examples in U. In de-tail, after the initial ensemble of classiers fh1; h2; : : :,hCg are constructed, some individual classiers selectsome examples in U to label according to a disagree-ment level, and then teach the other classiers withthe newly labeled examples. Here, similar to 37, weadopt a simple case where the classiers Hi = fh1; : : :,hi1; hi+1; : : : ; hCg are responsible for selecting con-dently labeled unlabeled examples in U for an

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论