Writing a Java Micro-benchmark
Originally published: 2008-11-16
Last updated: 2014-07-10
A recent issue from the real world: a home-grown templating engine that
used String.replaceAll
to generate output. Was it
causing excessive memory and CPU load, or was it a harmless example of
over-engineering? Testing in situ, in the running application
server, would not be easy: how would we differentiate the effect of this
code from everything else involved in building a page? The answer: use a
micro-benchmark to quantify the difference between the two implementation
options.
Caveat: This article was originally written circa 2008, using JDK 1.5 and a single-core IBM T-42 as a test platform. While the general points remain valid, there have been many improvements in both machines and JDK in the intervening years; all numbers should be taken with a large grain of salt.
What is a Micro-Benchmark?
A benchmark is any program that attempts to measure computing performance. Traditional “macro” benchmarks exist as a means to compare different platforms in specific areas: for example, the Whetstone benchmark focused on floating-point computation, so was used (and often abused) to compare platforms for scientific number-crunching applications. By comparison, a micro-benchmark attempts to measure the performance of a small piece of code — often a single function call — on a single platform. Their results are typically used to choose between competing implementations that provide the same or similar functionality.
For example, here's a simple micro-benchmark that measures the time taken by
String.replaceAll()
:
public static void main(String[] argv) throws Exception { long start = System.currentTimeMillis(); for (int ii = 0 ; ii < 1000000 ; ii++) { String s = "this XXX a test".replaceAll("XXX", " is "); } long elapsed = System.currentTimeMillis() - start; System.out.println("elapsed time = " + elapsed + "ms"); System.out.println((elapsed * 1000.0) / 1000000 + " microseconds per execution"); }
That's all there is: a loop, an operation, and some timing code. I run this on my laptop, and it says that 1,000,000 iterations takes somewhere around 2500 milliseconds — or about 2.5 microseconds per iteration.
So what does this tell you? Well, one thing that it can tell you is that
calling replaceAll()
isn't that expensive in the grand
scheme. Certainly far less expensive than a database call, or writing data to
a socket, or really, anything else that is likely to happen when executing a
JSP. However, perhaps we can do better with concatenation — after all,
why pass up a chance for optimization?
public static void main(String[] argv) throws Exception { long start = System.currentTimeMillis(); for (int ii = 0 ; ii < 1000000 ; ii++) { String s = "this " + "is" + " a test"; } long elapsed = System.currentTimeMillis() - start; System.out.println("elapsed time = " + elapsed + "ms"); System.out.println((elapsed * 1000.0) / 1000000 + " microseconds per execution"); }
Running this on my laptop, I got 10ms total execution time! But do you see the bug?
Pitfall #1: Make sure you're measuring what you think you are
The bug, of course, is that we're concatenating three string literals, and the compiler does this for us. So what we're really testing is the speed of assignment. Changing the code so that it actually does a concatenation on each loop, as below, we get a more reasonable time of 350-375 milliseconds per run.
public static void main(String[] argv) throws Exception { long start = System.currentTimeMillis(); for (int ii = 0 ; ii < 1000000 ; ii++) { String s1 = "this "; String s2 = "is"; String s3 = " a test"; String s = s1 + s2 + s3; } long elapsed = System.currentTimeMillis() - start; System.out.println("elapsed time = " + elapsed + "ms"); System.out.println((elapsed * 1000.0) / 1000000 + " microseconds per execution"); }
At this point, you can look at the numbers and make a snap decision: is an extra two seconds per million iterations worth rewriting your code? Sometimes it is: a program that is CPU-bound and performing an enormous number of string manipulations might indeed benefit. But to back up that decision, you need to run your actual application under a profiler, to see exactly what percentage of time is spent doing replaces.
Pitfall #2: Is this a reasonable test?
One of the chief limitations of a micro-benchmark is that it's running in an environment that almost certainly doesn't match production. My laptop, for example, is a single-core processor running the Sun 1.5 JDK. There's no way, for example, that it should be used to test thread contention.
But even with the replaceAll()
test, there is a subtle
issue: my laptop should never experience garbage collection delays. The Sun
JVM uses a generational garbage collector, and none of these objects in this
test are large enough to leave Eden. The real-world example, however, was
running on a JRockit JVM and processing strings that were 100-200 characters
long.
Changing the length of the strings is simple, and the -verbose:gc
invocation argument will tell us if we're switching from minor to major
collections. Still, there's nothing like running in the actual environment.
And dropping the benchmark code into a JSP is pretty easy:
<%@page contentType="text/html"%> <%@page pageEncoding="UTF-8"%> <html> <head> <title>Micro-Benchmark</title> </head> <body> <% String testString = // something long long start = System.currentTimeMillis(); for (int ii = 0 ; ii < 1000000 ; ii++) { String s = testString.replaceAll("XXX", " is "); } long elapsed = System.currentTimeMillis() - start; %> <p> concat: elapsed time = <%=elapsed%> ms, <%=(elapsed * 1000.0) / 1000000%> microseconds per execution </body> </html>
Pitfall #3: Elapsed time versus Execution time
Running a micro-benchmark on an application server increases the potential for mis-measurement, because the benchmark code isn't running in isolation. Load elsewhere in the system means that your measured execution time will be greater than the actual CPU time consumed by the test.
In some cases, you may be able to measure actual CPU time: JDK 1.5
introduced the java.lang.management
package and
ThreadMXBean
:
this class will give you both total and user-mode CPU time consumed by a
thread. However, this feature is not supported by all JVMs, as it relies
on the underlying operating system to make the information available.
It's also overkill: the goal is to quantify the effect of a particular piece of code, not develop guarantees about its execution time. Instead, use a simple measurement strategy, execute the code multiple times, and keep track of its average execution time. Also, note that I said increases the potential for mis-measurement, rather than introduces: any computer has multiple competing processes (a fact that becomes painfully obvious when your anti-virus kicks in during a gaming session).
Pitfall #4: Cleverness
Looking at the examples so far, you may be having visions of refactoring: after all, they differ by a single line of code. Perhaps you are thinking of a micro-benchmark framework, using a functor object to encapsulate the code being tested. The risk of acting on these thoughts is that you might end up measuring the boilerplate rather than the code to be benchmarked.
Which is not to say that you should write separate programs for each test. In fact, it may be better to combine your tests in a single program, as it will allow you to “warm up” the JVM (a topic that will be discussed in more detail below). As long as you keep the benchmarked code and main loop together in a function, you're not going to skew the results. Keeping the timing markers close to the loop is also important, but not so much — the overhead of a function call or switch statement should disappear next to a million iterations of your test code. For example:
public class Micro2 { private static final int REPS = 1000000; private static final String PART1 = "for a real test, make this big"; private static final String PART2 = "so that we can see GC effects"; private static final String REGEX = "XXX"; private static final String REPL = " -- "; private static final String TEMPLATE = PART1 + REGEX + PART2; private enum CodeType { concat, replace }; public static void main(String[] argv) throws Exception { for (CodeType run : CodeType.values()) { long start = System.currentTimeMillis(); switch (run) { case concat : executeConcat(PART1, PART2, REPL); break; case replace : executeReplace(TEMPLATE, REGEX, REPL); break; default : // do nothing } long elapsed = System.currentTimeMillis() - start; System.out.println(run + ": elapsed time = " + elapsed + "ms (" + (elapsed * 1000.0) / REPS + "microseconds per execution)"); } } private static void executeReplace(String template, String regex, String repl) { for (int ii = 0 ; ii < REPS ; ii++) { String s = template.replaceAll(regex, repl); } } private static void executeConcat(String part1, String part2, String repl) { for (int ii = 0 ; ii < REPS ; ii++) { String s = part1 + repl + part2; } } }
Pitfall #5: Hotspot
Here's an interesting experiment: take the first benchmark program above, and change the loop count:
Loop Count | Execution Time |
---|---|
1000 | 50 ms |
10000 | 70 ms |
100000 | 310 ms |
1000000 | 2500 ms |
Intuitively, you'd expect a linear increase in execution time, but you don't see that (at least until the jump from 100,000 to 1,000,000 loops). There are a number of reasons for this discrepancy.
First is that Java loads classes dynamically: the first iteration pays the
penalty for loading the java.util.regex
classes.
Second, and more important, is that the JVM started by interpreting the Java
bytecodes for the program, but after a certain point the Hotspot compiler
identifies code that's frequently called and compiles it into native machine code.
You can watch this in action by specifying -XX:+PrintCompilation
on the Java command line:
bin, 504> java -XX:+PrintCompilation com.kdgregory.example.benchmark.Micro2 1 b java.lang.String::charAt (33 bytes) 2 b java.lang.Math::max (11 bytes) 3 b java.lang.String::hashCode (60 bytes) 4 b java.lang.String::indexOf (151 bytes) 5* b java.lang.System::arraycopy (0 bytes) 6 b java.lang.String::getChars (66 bytes) 7 b java.lang.AbstractStringBuilder::append (60 bytes) 8 b java.lang.StringBuilder::append (8 bytes) 9 b java.lang.String::<init> (80 bytes) 10 b java.lang.AbstractStringBuilder::<init> (12 bytes) 11 b java.lang.AbstractStringBuilder::expandCapacity (51 bytes) 12 b java.lang.StringBuilder::toString (17 bytes) 13 b java.lang.String::valueOf (14 bytes) 14 b java.lang.String::toString (2 bytes) 15 b java.lang.StringBuilder::<init> (18 bytes) 1% b com.kdgregory.example.benchmark.Micro2::executeConcat @ 5 (39 bytes) concat: elapsed time = 661ms (0.661microseconds per execution)
This process, of identifying hot code and compiling it, will take time. And until all hotspots have been identified and compiled, your timings will misrepresent reality. In most cases, simply using a large number of iterations will bring the timings in line: the time consumed by interpreted code early in the run will be “noise” compared to the time consumed by compiled code. Alternatively, you can “warm up” the JVM by running some number of iterations before your actual test (my reading of the various Sun white-papers indicates that 10,000 iterations is sufficient).
One other feature of Hotspot — and one that is particularly damaging
to micro-benchmarks — is dead code removal. If the JVM determines
that a particular piece of code is not actually doing anything useful, it
will simply stop executing that code. Using the -server
JVM argument increases the aggressiveness of dead-code removal, as shown by
the following example (which attempts to benchmark an empty loop):
bin, 527> java -XX:+PrintCompilation com.kdgregory.example.benchmark.Micro4 1 b java.lang.String::charAt (33 bytes) 2 b java.lang.Math::max (11 bytes) 3 b java.lang.String::hashCode (60 bytes) 1% b com.kdgregory.example.benchmark.Micro4::main @ 21 (94 bytes) pass 0 elapsed time = 70ms pass 1 elapsed time = 80ms pass 2 elapsed time = 70ms pass 3 elapsed time = 70ms pass 4 elapsed time = 70ms bin, 528> java -server -XX:+PrintCompilation com.kdgregory.example.benchmark.Micro4 1% com.kdgregory.example.benchmark.Micro4::main @ 21 (94 bytes) pass 0 elapsed time = 10ms 2% com.kdgregory.example.benchmark.Micro4::main @ 21 (94 bytes) pass 1 elapsed time = 20ms pass 2 elapsed time = 0ms pass 3 elapsed time = 0ms pass 4 elapsed time = 0ms
Conclusions
In a perfect world, micro-benchmarks are the second step of optimization: once you discover where your application is spending its time, you write a micro-benchmark to explore alternatives. However, even without that first step, and despite their many limitations, micro-benchmarks are a useful tool to help quantify code behavior.
Remember, however, that “help” is the operative word; their
results must always be evaluated in the context of the environment. Writing
a micro-benchmark so that you can say “string concatenation is four
times faster than replaceAll()
” is meaningless.
Not only might the numbers change on a different machine or in a different
program, but that speedup will make no difference if your program spends most
of its time waiting for a database query.
For More Information
There are a number of examples for this article:
- Micro1a is a naive test of
String.replaceAll()
. - Micro1withThreadTiming is the same benchmark, measuring CPU time via
ThreadMXBean
. - Micro2 is a parameterized micro-benchmark that compares concatenation with
replaceAll()
, but does not consider Hotspot optimization. - Micro3 adds a “warmup” to the previous test.
- Micro4 demonstrates Hotspot's dead-code elimination.
Oracle's white-paper gives a high-level description of the Hotspot JVM.
An article by Brian Goetz that details the ways that the Hotspot JVM can derail a micro-benchmark.
Google maintains Caliper, a tool for writing micro-benchmarks. Its chief benefit is that it will decide how many repetitions are sufficient for an accurate benchmark. But you still have to create benchmarkable code.
Copyright © Keith D Gregory, all rights reserved