Main

Benchmarking Archives

August 10, 2006

Introduction to Benchmarking

As part of my Ph.D., I developed a theory of benchmarking. I have a paper that gives a good short description. In a nutshell, a benchmark is an indicator of the level of maturity of the paradigm in a scientific community. (Yes, I do mean a Kuhnian scientific paradigm.) Moreover, creating a benchmark (or any other kind of standard) can increase the maturity of the community.

Ongoing work in this area involves study past and present benchmarking efforts, as well as developing new benchmarks in areas such as requirements engineering, testing, and static analysis.

Indicators of group cohesiveness and success

One of the claims of the theory of benchmarking is that the process of creating and deploying a benchmark can increase the level of cohesiveness of the community. It stands to reason that we should be able to measure this cohesiveness and track it over time, in particular, as the rate of benchmarking use increases.

I think it would be worthwhile to examine the sociology and social psychology literature for these measures of group cohesiveness.

Here are some places to start looking. (These suggestions come from Daniela Damian.)

McGrath and Hollingshead (1994): Groups Interacting with Technology: Ideas, Evidence, Issues and An Agenda.

McGrath (1984): groups: interaction and performance.

Short, Williams and Christie (1976): The Social Psychology of Telecommunications, Wiley.
This book has an instrument for characterising interpersonal relationships. Daniela used it in a study of computer-mediated requirements negotiations.

Another place to look for is the Empirical Software Engineering journal and community... but it depends again on what you are after.

August 28, 2006

Refining the Theory

The theory of benchmarking was constructed by examining a number of computer benchmarks in the literature. These include TREC Ad Hoc Task, TPC-A™, SPEC CPU2000, Calgary Corpus and Canterbury Corpus, Penn treebank, the xfig benchmark for program comprehension tools, and the C++ Extractor Test Suite (CppETS). The last two were developed by me. The others are described in differing amounts of detail in the literature. The next step in refining the theory would be to compare it against a benchmark that was not considered in the formulation of the theory.

NIST has been involved in developing a number of benchmarks. So have a number of companies. Research on this problem would involve interviewing people involved with the benchmark development and deployment, as well as examining documents, such as emails, meeting minutes, etc. For benchmarks, currently under development and/or use, it would involve attending technical meetings.

WebETS 1.0

Web site evolution is an emerging area of research. It is primarily concerned with developing software tools to support maintenance and evolution of web sites, which usually involves taking standard analyses, metrics, visualizations, etc. and applying them to web applications. In the same manner as with other types of source code, the first step is to extract facts to be used later for analysis and presentation.

WebETS (Web Extractor Test Suite) would follow the same pattern that was set by CppETS and JETS. The goal is to create a series of programs or code snippets to test and compare the capabilities of different fact extractors. A paper previously written by Holger Kienle would serve as a starting point for this research.

Holger Kienle and Susan Elliott Sim. Towards a Benchmark for Web Site Extractors: Call for Participation, Seventh European Conference on Software Maintenance and Reengineering, Benevento, Italy, pp. 82-90, 26-28 March, 2003.

Testing Benchmark

Testing is all about testable results, testable properties of programs, right? So it makes sense that we should be able to compare the performance of different testing tools on the same program.

Of course, it isn't that simple. Testing tools are quirky with respect to what kind of software artifacts, source languages, and descriptions they accept. Furthermore, the are designed to tackle different aspects of the testing problem. Not just running and tabulating test cases, but also generating test cases, ordering of test cases, measuring coverage, etc.

Lihua Xu has started working on a testing benchmark, but we've only scratched the surface. Here's a technical report that we wrote together.

Lihua Xu and Susan Elliott Sim, "Towards a Benchmark for Test Generation Techniques," Institute for Software Research, University of California, Irvine, Irvine, CA, USA, Technical Report #UCI-ISR-06-9, June 2006.

Applying Testing Techniques to CppETS

CppETS (C++ Extractor Test Suite) was created using programs that were created to stress test the capabilities of fact extractors. It would be interesting to take an approach that was more principled, more informed by testing theory, to evaluate and refine CppETS.

CppETS 2.0

CppETS 1.0 and 1.1 have been developed and successfully deployed.

CASCON 2001
IWPC 2002

Susan Elliott Sim, Richard C. Holt, Steve Easterbrook. "On Using a Benchmark to Evaluate C++ Extractors." Proceedings of the Tenth International Workshop on Program Comprehension, Paris, France, pp. 114-123, 26-29 June, 2002.

Naturally, the next step is to develop CppETS 2.0.

Here are some ideas for improvement that came from our experience working with CppETS.

  • Interpretation Framework for ResultsThe results are hard to interpret, and therefore hard for someone to use to select a fact extractor. A user would need understand his or her requirements very well and to look very closely at how each fact extractor performed on particular test cases. It would be nice if we could have a search form where a user could say what they wanted to use the extractor for and what features were important, and the framework could return a short list.
  • Schema Zoo It would be nice to collect the schemas for various fact extractors (and downstream analysis tools) and permit evaluation and selection of a fact extractor analytically. I have a collection of schemas and I think other people do too (e.g. Jean-Marie Favre and his MDE site).
  • Data Requirements for Downstream Analysis An examination of these requirements could start with the aforementioned schema zoo. Other factors such as accuracy would also need to be considered. For example, is an extractor that is 97% accurate good enough? Are some errors more serious than others?

About Benchmarking

This page contains an archive of all entries posted to Susan's Idea Jar in the Benchmarking category. They are listed from oldest to newest.

Clippings is the next category.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.31