The Impact of Concurrent Coverage Metrics on Testing Effectiveness论文笔记.doc_第1页
The Impact of Concurrent Coverage Metrics on Testing Effectiveness论文笔记.doc_第2页
The Impact of Concurrent Coverage Metrics on Testing Effectiveness论文笔记.doc_第3页
The Impact of Concurrent Coverage Metrics on Testing Effectiveness论文笔记.doc_第4页
The Impact of Concurrent Coverage Metrics on Testing Effectiveness论文笔记.doc_第5页
已阅读5页,还剩7页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

The Impact of Concurrent Coverage Metrics on Testing Effectiveness2013年IEEE第六届软件测试,验证和确认国际会议Abstruct当我们测试多线程的程序的时候,可能互相交互的线程的数量极其庞大,使得在实际中不可能获得所有的线程之间的影响关系。就像分支和语句覆盖率度量对于顺序程序测试一样,人们对于多线程程序提出了一种并发覆盖率度量。但是不像顺序程序覆盖率度量,并发覆盖率度量的效力在很大程度上是未经检验的。本文通过测试具有八个并发覆盖率度量标准的应用程序中的九段并发代码来研究了并发覆盖率和故障检测效率之间的关系。实验结果显示,现有的并发覆盖率度量能够比较强地预测并发测试的效力,并且通常能够合理的生成测试套件。但是使用这些度量标准进行预测和生成测试套件的时候需要复合程序,因此需要做额外的一些工作。I. INTRODUCTION采用动态数据跟踪的,基于静态检测的,基于模式驱动的等,这些方法的精确度不高,于是有人提出系统级的并发程序测试方法。比如Case of concurrent coverage metrics and Structural coverage metrics。we are aware of no study rigorously examining the impact of proposed concurrent coverage metrics. We expect that increasing concurrent coverage will improve testing effectiveness, but we also expect that it will increase test suite size。我们知道目前并没有太多对提出的并发覆盖率对测试工作的影响这方面的研究。我们预计提高并发覆盖率将提高测试效率,但我们也知道它会增加测试套件的大小。Qusetion1:Does improving concurrent coverage directly lead to a more effective testing process, or is it merely a byproduct of increasing test suite size?提高并发覆盖率能直接提高测试的效力么?还是它仅仅只是增多测试用例而产生的附属品?Question2:If improving coverage does lead to improvements, what practical gains in testing effectiveness can we expect?如果提高覆盖率真的能够提高测试的效力,那么我们能期待它在实际的测试效力上有什么作用? For each program and metric pairing, we used a randomized test case generation process to generate 100,000 test suites with varying levels of size and coverage, and measured the relationships between the percentage of coverage requirements satisfied, the number of test executions, and the fault detection ability of test suites via correlation and linear regression. Additionally, we compared test suites generated to achieve high coverage against random test suites of equal size. We measured fault detection ability using both mutation analysis (systematically seeding concurrency faults) and real-world faults.对于每一个 程序-度量 组,都采用随机测试用例生成方法产生了10W个测试套件,它们拥有不同的规模和覆盖率等级。实验通过对 测试套件覆盖率被达到的比例,测试用例的数量和缺陷检测能力的相关分析和线性回归,评判了它们之间的关系。同时,实验也比较了为了满足覆盖率要求而生成的测试用例和随机生成的测试用例,通过它们在检测变异对象缺陷和真实缺陷上的表现度量了它们的缺陷检测的能力。II. BACKGROUND AND RELATED WORK不同于线性时序代码度量标准,要想满足并发程序测试要求,不仅仅需要执行特定的代码元素,而且需要满足线程之间的约束。以block metric为例,在测试中必须满足每一个同步块至少被阻塞过一次。III. STUDY DESIGNThe purpose of this study is to rigorously investigate existing concurrent coverage metrics, and to either provide evidence of each metrics usefulness or demonstrate that the metric is of little value.这个实验的目标是研究现有的并发覆盖率度量,证明它们是有用的或者是证明它们没有什么价值。The usefulness of a coverage metric,concurrent or otherwise, invariably relates to many factors,such as the testing budget available, the characteristics of the program under test, and the goals of the testing process. Nevertheless, to show that any coverage metric can be considered useful, we should at minimum demonstrate two things:1) increased levels of coverage correspond to increased fault detection effectiveness;2) these increases are due in part to increasing coverage levels, not merely larger test suite sizes.无论是针对并发程序还是其他程序,覆盖率度量的有效性和许多因素有关,比如测试工作的可用预算,被测代码的特性或者是说测试过程的目标。然而,为了证明覆盖率度量的有效性,我们应该证明2点:增长了覆盖率就能增大缺陷检测的效力并且这些增长是和覆盖率的增长有联系,而不是仅仅因为增多了测试用例(就像去河里面戳鱼,戳到更多的鱼是和鱼叉的锋利程度有关系的,而不仅仅是你多戳了几次)Furthermore, to aide practitioners in selecting a coverage metric for use, we should attempt to quantify the relationship between coverage, size, and fault detection effectiveness. In particular, we are interested in the predictive value of each metric and the cost of achieving high levels of coverage。同时,为了选择较好的度量指标,我们需要量化测试覆盖率,测试用例集规模以及缺陷检测效力之间的关系,并且我们也对每个度量标准的预测值和达到高覆盖率的成本进行了研究。文章着力解决2个问题:Research Question 1 (RQ1): For each concurrent coverage metric studied, does the coverage achieved positively impact the effectiveness of the testing process for reasons other than increases in test suite size? In other words, we would like to provide evidence that given two test suites of equal size, the test suite with higher coverage will generally be more effective.我们给出2个相同规模的测试用例集,是否满足了高覆盖率的那一个用例集发现缺陷的效力更高?Research Question 2 (RQ2): For each concurrent coverage metric studied, how does the fault detection effectiveness of test suites achieving maximum coverage compare to that of random test suites of equal size? While coverage levels may relate to effectiveness, the practical impact of achieving high coverage for some metric over random test suites may be insignificant.取得最大覆盖率的测试用例集的缺陷检测效力和相同规模的随机测试相比又如何?The objects for this study have been drawn from existing work on testing concurrent software and include objects without faults, and objects with faults detected in previous studies. We list the objects with the lines of code, numbers of threads, and mutants used in Table I。研究的对象是已有的并发软件测试,包括不存在缺陷的对象和存在之前已经发现缺陷的对象,代码行数,线程个数,变异的对象数量等详细信息见table 1.ArrayList 数组列表BoundedBuffer 有界缓冲区Vector 向量Alarmclock 时钟Clean 清理PiperProducerconsumer 生产者和消费者Stringbuffer 数组缓冲区TwostagePart A Variables and MeasuresA. Independent variables:(独立变量)In this study, we manipulate two independent variables: the concurrent coverage metric, and the method of test suite construction.在实验中,定义2个独立变量:一是并发覆盖率度量标准,二是测试套件生成方法并发覆盖率度量标准:Numerous concurrent coverage metrics have been proposed, each based on some unique intuition about how to capture different aspects of concurrent executions. We view these metrics as having two key properties: the number of code elements the test requirements consider (either a single element or a pair of elements)the elements the metric is defined over (either synchronization elements or shared data access operations).现有已经提出了不少并发覆盖率度量标准,它们的核心思想就是怎样获取到并发执行的不同的方面。这些方法拥有2个关键的属性:测试需求考虑的代码要素的数量(无论是单一要素还是要素对)度量标准定义的元素(无论是同步要素还是共享数据访问操作)。例如,blocking and blocked coverage metrics定义了基于java程序的同步块的需求,这是一种单一的并发覆盖率度量指标。而blocked-pair metric是基于程序块对的,因此这是一种成对的度量标准。由于这些度量标准都是基于同步程序块的,因此它们都是同步度量指标。We selected eight coverage metrics for use in our study,focusing on well-known coverage metrics while also ensuring that we considered every possible combination of our two key properties. We list the metrics selected in Table II.实验选择了8种比较著名的度量标准,为了确保我们考虑了所有的上述的两个关键属性可能的组合,如Table2所示:We concentrated on metrics that generate modest numbers of test requirements, as this makes achieving high levels of coverage feasible in a reasonable time. Thus, coverage metrics that produce very large numbers of test requirements are not included in this study.实验排除了需要大量测试才能获得高覆盖率的度量指标,使得实验能够在比较合理的时间内获得比较可观的覆盖率。测试套件生成方法:We used two methods of test suite construction: random selection and greedy test suite reduction.In random selection, test suites are constructed by randomly selecting test executions to construct test suites of specified sizes. In greedy selection, test suites are constructed to achieve maximum achievable coverage using a small number of test executions. 实验采用了2种策略来生成测试套件:随机选择策略和贪心选择策略。采用随机选择策略就是从测试用例集中随机地选取一定规模的测试用例;贪心选择策略就是采用尽可能小的测试用例个数满足最大的测试覆盖率。B. Dependent Variables:(依赖变量)We measure three dependent variables computed over generated test suites: coverage achieved,test suite size, and fault detection effectiveness.通过上述生成的2组测试用例,我们计算这3个变量的值:达到的覆盖率,测试用例集的规模,缺陷检测的效力已达并发覆盖率:对于一个给定的程序和一个度量指标M,一个测试套件S的覆盖率可以这样计算:测试套件S中满足度量指标M的测试需求的测试用例数量/测试套件S总的测试用例数量We construct test executions while holding random test generation parameters constant (see Section III-B); because different parameters can result in covering different requirements。However, for the purpose of greedy test suite construction,we define maximum achievable coverage as the number of requirements than can be covered for a specific set of test generation parameters.实验在采用随机选择策略进行用例生成的时候保持参数不变,因为不同的参数会导致不同的覆盖率需求。在采用贪心选择策略生成测试用例的时候,我们定义了一个最大覆盖率,作为生成具体的某个测试用例集的参数。测试套件的规模Test suite size is the number of test executions in the test suite, and estimates testing cost.测试套件中测试用例的数量以及预测的测试开销缺陷检测的效力When computing the fault detection effectiveness of the testing process, we use concurrent mutation operators with correct objects (see Section III-B1). We then compute the fault detection effectiveness of a testing approach as the number of mutants killed/detected.When computing fault detection effectiveness for objects that contain known faults, detection of the fault is treated as success, and failure to detect the fault is treated as failure.计算测试过程的缺陷检测效力时,将一部分正确的对象采用并发变异操作进行变异,通过测试,我们能够找出一部分变异了的对象,将被发现的并且包含了缺陷的对象视为是成功的测试,而未发现包含缺陷的对象视为是不成功的测试。Part B Experiment SetupConducting our experiment requires us to (1) generate mutants for programs without faults(生成变异对象)(2) conduct a large number of random test executions(生成大量的随机测试用例)(3) for each execution record the requirements covered for all metrics and whether a fault is detected(对于每一个测试用例,记录它满足了哪些度量标准的需求,并且记录是否检测到了缺陷)(4) perform resampling over executions to construct test suites and finally(进行重新采样,分别采用随机选择和贪心策略选择)(5) measure the resulting coverage and fault detection effectiveness of each test suite.(对每个测试套件进行覆盖率,缺陷检测效力的评估和计算)A.Mutant Generation:We wished to study fault detection in the presence of many diverse fault types, which is not possible when using single-fault programs.为了能产生多种类型的缺陷,采用单一的变异方法肯定是不够的。于是实验就挑选了9种变异操作,实现缺陷的多样化,挑选的标准是:(1) did not fail for any generated test execution(2) were killed by every test execution(3)were malformed,such as resulted in code that could not be executed不能变异之后,任何一个测试用例都不能发现不能变异之后随便哪一个用例都能发现不能变异之后是畸形的,比如使代码不能运行了We list the final number of mutants used in Table III. B. Test Generation and Execution:We used a randomized test case generation approach to avoid bias that might result from using a directed test case generation approach.实验选择了随机生成测试用例的方法去避免使用某种具体的用例生成方法导致的偏见。Our approach selects an arbitrary test input and generates a large number of test executions by executing a target program on the test input with varying random delays inserted at shared resource accesses and synchronization operations.实验先选择一个任意的测试输入,通过把这个输入以随机地延迟时间插入到共享资源访问和同步操作中再执行目标程序来生成大量测试用例。We control two parameters of this approach: the probability that a delay will be inserted at each shared resource access or synchronization operation (0.1, 0.2, 0.3, and 0.4), and the maximum length of the delay to be inserted (5 msec, 10 msec.). We used these controls because previous work indicates that they can impact the effectiveness of the testing process.Larger or finer grained delays and probabilities did not yield significantly different results.In addition to the twelve random scheduling techniques, we ran test executions without inserting any delay noise.在这个方法中,我们操纵2个参数:在每一个共享数据访问和同步操作之间都插入延迟的概率插入延迟的最长时间(5ms,10ms.)操纵这2个参数的原因是之前已经有论文证明了这2个因素将会对测试的效力产生影响。选择比较合适的概率和延迟时间让结果看上去有明显的不同。同时对于12种随机生成方法,实验也设计了完全不插入任何延迟的测试用例。We began by estimating the number of test executions E required to achieve maximum coverage for all eight coverage metrics used. This was done by executing the original object for several hours and recording the rate of coverage increase for each metric. For each object, we required either 1000 or 2000 test executions. Following this, for each parameter setting we conducted E executions for each mutant (for objects with mutants) or each object program (for objects without mutants). During each execution, we recorded(1) the test requirements covered for each coverage metric studied(2) whether a fault was detected. We recorded an execution as detecting a fault if an uncaught exception is thrown by the programthe program deadlocks, determined by checking whether execution time is exceptionally long a program-specific assertion is violated.开始的时候,我们先估计满足八个覆盖率指标所要执行的测试用例的数量(通过执行原目标程序数小时,同时记录每个指标覆盖率的增加情况),对于每个度量指标,大概需要10002000个测试用例。接下来对每一种参数设置,对目标程序进行变异,再去执行测试用例,同时记录对于每个指标,测试用例覆盖率的满足情况是否检测到了缺陷C. Data Collection:通过以上几个步骤,我们可以得到(1) which test requirements are covered for each coverage metric (2) whether the program failed.针对每一个度量指标,测试套件满足的覆盖率程序是否失效Using this information we can, via random resampling, construct test suites of varying sizes and levels of coverage. Ideally, we would like to construct test suites encompassing all possible combinations of size and coverage. Unfortunately, as coverage and size tend to be highly correlated this is impossible; small test suites with high coverage are extremely rare in practice. We instead generated, for each combination of object and coverage metric, 100,000 test suites ranging in size from 1 to the maximum size via random sampling of executions. This results in a set of test suites with increasing size and, within each level of size, varying coverage. These test suites are used to address RQ1.有了这些信息,我们可以进行重新采样,去构造不同规模或者是不同覆盖率的测试套件理想状况下,我们应该可以构造出任何规模和覆盖率的组合的测试套件。但是由于覆盖率和测试用例规模之间其实联系紧密,规模比较小但是能达到比较高覆盖率的测试套件在实际中是比较少见的。实验所生成的不同覆盖率的测试套件的规模从1100000不等。我们从中选择了相同大小规模但达到不同覆盖率的用例集,用来回答RQ1。为了解决RQ2,我们选择了每一个度量指标下面达到最大覆盖率的100个测试用例,同时采用Selecting a test suite for a single-fault program is straightforward: we have one set of executions over the program, and we resample from this set to construct test suites. 1 Each test suite becomes a data point for analysis, having an associated level of coverage, size, and fault detection result (killed/notkilled).构造一个单一故障程序的测试套件很简单:有一组测试用例,我们从中重新取样构造测试套件。每个测试套件作为一个供分析的数据点,有覆盖率等级,套件规模和故障检测结果。(检测到的/未检测到的)The construction of test suites for objects using mutation generation is more complex. Each mutant differs in the synchronization primitives present, and thus we cannot replicate a sequence of interleavings (i.e., run the same test execution)across all mutants. Therefore, when constructing test suites for objects with mutants, we began by generating 100,000 separate test suites for each mutant. To compute the fault detection effectiveness of combinations of coverage levels and sizes across mutants, we randomly selected a mutant and a test suite associated with that mutant. Following this, for all remaining mutants we randomly selected test suites with the same (or as similar as possible) level of coverage and size,and computed the average coverage and size (which may vary slightly across mutants) and the number of mutants detected.These aggregated values become a data point for analysis. We repeated this cross-mutant selection 100,000 times.构造一个变异对象故障程序测试套件稍微复杂一点,首先分别对每一种变异方法单独生成100000个测试用例,并计算这个测试套件的缺陷检测效力,覆盖率等级和规模。接着,从所有的变异方法生成的测试套件中分别随机挑选出一组测试用例,计算它们的覆盖率,规模和发现缺陷的数量。计算它们的平均覆盖率,规模和发现缺陷的数量。这一个聚合值就作为一个供分析的数据点。(实验进行了10W次的交叉变异选择)Part C Threats to ValidityWe conducted our study using only Java programs with standard synchronization operations. These programs are relatively small but we believe that our results are at least generalizable to the class of programs concurrent testing research focuses on.实验的对象仅仅是拥有标准同步操作的JAVA程序,程序规模也不大。但是我们的实验结果至少能够概括侧重于程序并发测试。For concurrent coverage metrics, it is difficult to accurately determine satisfiable requirements.当前存在很多并发覆盖率度量指标,很难去判定谁优谁劣。但是对于本实验选择的这几个度量指标是综合考虑的。The random testing technique we use is implemented inhouse, but we have attempted to match the behavior of other random testing techniques by constructing a general technique and varying the parameters of probability and delay.对于本实验使用的随机测试技术,我们已经通过构造通用的参数(probability and delay)来使得我们实验中采用的随机测试技术和其他的随机测试技术有相似的效率。We used mutation analysis to measure testing effectiveness for some objects. Our seeded faults are designed to mimic actual concurrent faults, and of course are indeed faults,but the relationship between faults generated by concurrent mutation operators and real concurrency faults has not been thoroughly investigated. Nevertheless, the results for mutation-based objects and objects with real faults are similar.我们使用一些变异对象来分析和衡量测试效率。实验采用故障注入来模拟实际的并发故障,虽然不是真正的故障,然而基于变异对象的故障和对象的实际故障是相似的。For each object, we constructed from 1 to 88 faults and 100,000 test suites per coverage metric. While more mutants/faults/test suites could in theory alter our conclusions。对于每个对象,针对每一种度量指标构造了188个缺陷和100000个测试用例。理论上更多的缺陷和用例也不会改变我们的实验结果。IV. R ESULTS AND A NALYSISOur analyses are designed to study how each coverage metric impacts fault detection effectiveness. Towards RQ1,we visualized the pairwise relationship between variables;measured the correlation between coverage, size, and fault detection effectiveness; and performed linear regression to better understand how both coverage and size contribute to fault detection effectiveness. Towar

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论