Program performance is always a concern, even in this era of high-performance hardware. This article, the second in a two-part series, covers the statistics of benchmarking and offers a framework you can use to benchmark Java™ code ranging from self-contained microbenchmarks to code that calls a full application.
Part 1 of this two-part article guides you around the many pitfalls associated with benchmarking Java code. This second installment covers two distinct areas. First, it describes some elementary statistics that are useful for coping with the measurement variations inevitably seen in benchmarking. Second, it introduces a software framework for doing benchmarking and uses it in a series of worked examples to illustrate important points.
Statistics
It would be convenient if you only needed to perform one execution-time measurement and could use a single measurement to compare the performance of different code. Sadly, the popularity of this approach does not trump its invalidity — there are just too many sources of variation to let you trust a single measurement. In Part 1, I mentioned clock resolution, complicated JVM behavior, and automatic resource reclamation as noise sources; these are just a few of the factors that can randomly or systematically bias benchmarks. Steps can be taken to try to mitigate some of them; if you understood them well enough, you might even be able to perform deconvolution (see Resources). But these remedies are never perfect, so ultimately you must cope with the variations. The only way to do this is to take many measurements and use statistics to make reliable conclusions. Ignore this section at your own peril, for "He who refuses to do arithmetic is doomed to talk nonsense" (see Resources).
|
I'll present just enough statistics to address these common performance questions:
- Does task A execute faster than task B?
- Is this conclusion reliable, or was it a measurement fluke?
If you make several execution-time measurements, the first statistic that you are likely to compute is a single number that summarizes its typical value (see Resources for links to Wikipedia's definitions of the statistical concepts in this article). The most common such measure is the arithmetic mean, commonly known as the mean or the average. It is the sum of all the measurements divided by the number of measurements:
meanx = Summationi=1,n(xi) / n
A supplement to this article available on a companion Web site (see Resources) includes a discussion of other measures besides the mean; see its Alternatives to the mean section.
Using the mean of several measurements to quantify performance is certainly more accurate than using a single measurement, but it can be insufficient for determining which task executes faster. For example, suppose that the mean of task A's execution time is 1 millisecond and task B's mean is 1.1 milliseconds. Should you automatically conclude that task A is faster than task B? You might not if you also knew that task A's measurements ranged from 0.9 to 1.5 milliseconds, while task B's measurements ranged from 1.09 to 1.11 milliseconds. So, you also need to address the measurement spread.
The most common statistic for describing measurement spread (or scatter) is the standard deviation:
sdx = sqrt{ Sumi=1,n( [xi - meanx]2 ) / n }
How does the standard deviation quantify measurement scatter? Well, it depends on what you know about the measurements' probability density function (PDF). The stronger the assumptions you make, the finer the conclusions you can draw. The article supplement's Relating standard deviation to measurement scatter section explores this in more detail and concludes that in the benchmarking context, a reasonable rule of thumb is that at least 95% of the measurements should lie within three standard deviations of the mean.
So, how can the mean and standard deviation be used to determine which of two tasks is faster? Invoking the above rule of thumb, the easy case is if the means of the two tasks are separated by more than three standard deviations (choose the larger standard deviation of the two). In that case, the task with the smaller mean is clearly faster almost all of the time, as shown in Figure 1:
Figure 1. Means greater than three standard deviations apart imply clearly distinguishable performance
Unfortunately, determinations become trickier the more two tasks overlap (for example, if the means are one standard deviation apart), as Figure 2 illustrates:
Figure 2. Means less than three standard deviations apart imply performance overlaps
Maybe all that you can do is rank two tasks by their means but note how much they overlap and point out the corresponding degree of weakness in your conclusions.
Reliability
The other question to address is how reliable these mean and standard deviation statistics themselves are. Clearly, they are computed from the measurements, so a new set of measurements would likely yield different values. Assume for now that the measurement process is valid. (Note: it may be impossible to measure the "true" standard deviation. See the Standard deviation measurement issues section in the article supplement, which explores this issue.) Then how different would the mean and standard deviation be if the procedure were repeated? Would a new series of measurements yield significantly different results?
The most intuitive way to answer these questions is to construct confidence intervals for the statistics. In contrast to a single computed value (a point estimate) for the statistic, a confidence interval is a range of estimates. A probability p called the confidence level is associated with this range. Most commonly, p is chosen to be 95% and is kept constant during confidence interval comparisons. Confidence intervals are intuitive because their size indicates reliability: short intervals indicate that the statistic is precisely known, while wide intervals indicate uncertainty. For example, if the confidence interval for the mean execution of task A is [1, 1.1] milliseconds, and that for task B is [0.998, 0.999] milliseconds, then B's mean is known with more certainty than A's and is also distinguishably smaller (at the confidence level). See the Confidence intervals section in the article supplement for more discussion.
Historically, confidence intervals could only be easily calculated for a few common PDFs (such as the Gaussian) and simple statistics (such as the mean). In the late 1970s, however, a technique called bootstrapping was developed. It is the best general technique for producing confidence intervals. It works for any statistic, not just simple ones like the mean. Furthermore, the nonparametric form of bootstrapping makes no assumptions about the underlying PDF. It never yields unphysical results (for example, negative execution times for a confidence interval lower bound), and it can yield more narrow and accurate confidence intervals than if you make wrong assumptions about the PDF (such as that it is Gaussian). Details on bootstrapping are beyond this article's scope, but the framework I discuss below includes a class named Bootstrap that performs the calculations; see its Javadocs and source code for more information (see Resources).
In summary, you must:
- Perform many benchmark measurements.
- From the measurements, compute the mean and standard deviation.
- Use these statistics to decide that two tasks are either clearly distinguished in speed (means greater than three standard deviations apart), or else that they overlap.
- Compute confidence intervals for the mean and standard deviation to indicate how reliable they are.
- Rely on bootstrapping as the best way to compute confidence intervals.
discuss this topic to forum
