Lies, Damned Lies, and Benchmarks

I have often seen computer benchmarks used outside of any reasonable context, or taken to be meaningful in the wrong context. If out of context use is not bad enough there don't seem to be any rules to the way some people cheat on their benchmarking.

I think that there should be three simple rules to using benchmarks. The first should be that everything counts. All time should count as time and all code size should count as code size. The second rule should be that one must always give units, units of time and repetition counts. The final rule should be that benchmarks designed to measure one thing should not be offered as evidence of a measure of something else.

Those seem like common sense, but they are not common practice. The problem with benchmark comparisons not showing units is that Windows systems expect worst case interrupt or task switch latencies in seconds or milliseconds, Unix systems expect to see them in milliseconds, real-time Unix variants and other C based real-time OS expect to see them in microseconds or milliseconds, and our Forth hardware/software based systems expect to see them in nanoseconds as we push for picoseconds. With a range covering ten orders of magnitude units need to be included with the published numbers.

I regularly see people publishing benchmarks in neat columns or rows but where one row or column has had part of the time subtracted where it was not subtracted from other columns or where some row or column is listed in different units than some other data but the units are not show. I even see where people simply compare benchmarks but change the number of iterations of the calculation in one row or column without documenting this.

Forty years ago I read a book entitled "How to Lie with Statistics." It seems like has become the handbook for benchmarking computers. Computer benchmarks primarily serve the purpose of giving people unrealistic ideas for marketing purposes. Besides the practices that can best be described as cheating on actual benchmarks many people simply equate megahertz for MIPS (Millions of Instructions Per Second) as this is common marketing practice.

The general public have seen the architecture of processors change radically over the years an have seen the clock rates of the processors increase steadily as well. As processors have become more modern they have used architectural techniques to perform more calculations per clock cycles. As fabrication technologies have become more refined the size of transistors has steadily decreased and their speed has increased so processor clocks have risen.

Intel's popular 80x86 familiar of processors provide a good example. The original IBM PC used an 8088 processor which was a version of their 8086 processor but restricted to an 8-bit memory bus to reduce cost. This required two memory cycles to perform each 16-bit memory operation. The processor used microcode that required multiple clock cycles for each instruction to step through its microcode. The IBM PC's processor was clocked at 4.77Mhz and with many different instruction with different numbers of required cycles it was difficult to get an average instruction cycle count except on a specific program. But with the average being greater than 5 the first PC achieved less than one MIPS.

The IBM AT machine used the Intel 80286 processor which had an external 16-bit bus, ran at 6.0Mhz or 8.0Mhz and which used more transistors to overlap instructions in a pipeline. It added new instructions and reduced the average instruction cycle count to about 5 so that it achieved over 1 MIPS. The 80386 saw more transistors, new 32-bit instructions, a 32-bit bus, and deeper pipelines so that its average instruction cycle was reduced to about 3 cycles and clock rates ranged from 16MHz to 33MHz. This produced machines with 5 to 10 MIPS performance.

The 80486 saw the introduction of some on-chip fast cache memory so that the processor would not be slowed down to the speed of external memory all the time. The average instruction cycle dropped to about 2 cycles and clock rates increased to 64MHz giving about 32 MIPS performance. The 80486 also got floating point hardware that had previously required a separate 80387 chip built in and available on-chip which make floating point faster.

The Pentium series processors then saw more and smaller transistors, deeper pipelines, new instructions, and more architectural implementation techniques from the new RISC processors. The average instruction cycle count dropped to closer to one and the clock rates soared. For the first time MHz could roughly equal MIPS when programs kept their pipelines filled and stayed in their fast cache memory. From this time on consumers often just equated new MHz clock ratings with MIPS of performance and this made it easy to market machines with higher clock speeds for more money than those with lower clock speeds.

Some consumers however noticed that their old programs, and even some of the new ones optimized to use the new Pentium instructions did not speed up at the same ratios as their clocks seemed to suggest that they should. The reasons for this were that real programs cannot keep the pipelines filled or keep the programs in their fast cache memories. As the pipelines got deeper and deeper the penalties for certain instruction sequences creating pipeline-stalls became greater and greater. So the occasional stalls will require a great numbers of extra cycles here and there. Also as the processor clocks increased faster than the memory clocks the penalty for loading cache memory from slower external memory, or writing cached memory back to external memory became greater. As programs became larger and larger they made cache misses more frequent.

The processors and memory also increased in speed at a higher rate than did mass storage disk media. So systems responded by caching disk files in memory to reduce this bottleneck. But these three factors combined to reduce real MIPS performance on real programs to a lower rate of increase than the MHz clock numbers would seem to suggest. Many consumers complained that their new machine had twice or even a ten times faster MHz clock but that they saw a much smaller increase in speed.

Equating MHz to MIPS performance which is good for marketing breaks down when one looks at real performance and this creates a need to provide better indicators of performance, so benchmarks are used. I have read vendors benchmarks on artificial problems designed to stay in cache and keep the pipelines filled that showed four instructions executing on every clock and have read user's complaints that on their real programs the same processor actually required four clocks per instruction.

In one presentation that I attended on an 'inexpensive' parallel processing machine (only $500,000 per processor node and using the PA-RISC architecture) and its parallelizing compiler the vendor showed a benchmark within 99% of the machine's 200 MFLOPS (Million Floating Point Operations Per Second) rate per node. I asked what the integer performance was on the machine. The vendor paused and then responded that he didn't know, and that they didn't care. He went on to describe how the compiler took two days to compile the program that he was describing. Since that compiler was using integer instructions exclusively I was very tempted to ask him what percentage of the compiler code executed floating point instructions.

Benchmarks posted to newsgroups are notorious for violating the three basic rules that I proposed above. Be careful. When pressed people will admit that they didn't treat all the rows or columns in the same way. This one actually ran ten times more iterations of the test, or that one is simply listed in different units, or this one subtracted something that was not subtracted from that one. There is an old saying that there are lies, damned lies and statistics. This is also true of benchmarks so view benchmarks with a critical mind.

Benchmarks can be very useful when used in context and when intentionally deceptive practices are not used. When a compiler writer is trying to optimize the code generator for a given environment they will want to test a particular short sequence of computer instructions outside of any general context or interference with other factors that complicate measurement of how well their compiler will convert some short sequence of code specification into a discrete sequence of processor instructions. For this code generator benchmark they want to get the best case timing removed from all complicating factors. By comparing the timing of these short sequences of instructions they can compare one code generation technique to another for generation of specific instruction sequences.

Compiler writer's best case code generation benchmarks are often posted in newsgroups and often appear in the vendor's advertising. One should always remember what they are and how they work and most importantly what they measure and what they don't. They are meant to measure the best case timing of a short sequence of processor code generated by some sequence of source. To do this they must remove all the complications listed above that account for large discrepancies between best case clock rates and instruction timing under ideal conditions and the real world complications.

So the way they are done is to wait for a system to get going, wait for the program to load from disk, wait for the program to load from slow external memory into fast cache memory, wait for the pipelines to fill, and then start timing on short sequence of instructions and then turn off the timing become complications happen.

The compiler writer wants to avoid complications like delays caused by system interrupts, pipeline stalls, cache misses causing access to slow memory, delays caused by disk access, and all the other types of complications that real programs encounter. To do this some people subtract the overhead that they calculate is always going to be present that they don't want to see, or that they don't want other people to see.

Or they first time an empty loop, then time an instruction sequence in the loop, then subtract the loop time of the empty loop. This allows them to only test the best case instruction sequence with minimal interference of the reality of things like pipeline stalls in the loops. Although it isn't very accurate. Or they turn timing on inside of the loop and turn it back off again before the looping trying to be careful to not time the timing routine itself.

Those isolated timings are useful to a compiler writer trying to improve their code generator for the different instruction sequences that might be generated by a short sequence of source code. But most people are not compiler writers and are more concerned with real performance which is a more complex thing.

Sometimes people apply these same sort of isolation techniques in benchmarking to provide deceptive evidence to those who are not looking closely. I have seen people arguing in a newsgroup that the file system that they use provides really fast access. They even gave file access benchmarks for evidence.

On one occasion the numbers looked just too good to be believable to me so I asked if they had subtracted anything from the benchmark. The answer was yes, they subtracted the time needed to get the file from disk into a file cache. They were giving file access from memory cache numbers to prove that file access from disk was fast. They had not mentioned that they simply subtracted the time of the actual disk access. Beware.

I have also seen other similarly deceptive benchmarking practices when people were claiming in a newsgroup that the Operating System that they used did not induce any significant overhead when used by their programs. They even provided benchmarks to indicate this. However the benchmarks were compiler writer's floating point code generation benchmarks showing that in all absence of use of their host Operating System a floating point instruction on the processor only ran a few percent slower or faster under several different operating systems.

Some programmers are equally reluctant to describe the actual size of their programs. Sometimes they complain that it is just too hard to do. They simply can't figure out exactly what routines in the host system or host system or its dynamic linked libraries. In theory they will say that this is what keeps their programs small, they share code with other applications that has to be there anyway. So they don't count it. Often is pressed about how large their programs actually are, even not counting the supporting code that they require, they use very vague terms like, small. All the supporting code that a programs requires should count as part of the size of that program.

Some programmers will tell you that they can write a 'hello world' program in twenty-six bytes or something like that in assembler. What they are trying to hide under the carpet is that their code is relying on and calling functions in the system BIOS or operating system, or file system. They simply don't count the code that is required by their program. But it is code and it is part of their application. So their eighteen byte 'hello world' program may actually use megabytes of code. Even if they don't use an operating system or file system their code may rely on BIOS code in a ROM or FLASH memory or on support code on their video card. It seems like common sense that all code used or required should count as code, but it is not common practice.

Of course when you go to buy a high performance car the salesman may tell you that the car has a top speed of 160 miles per hour. They are not likely to tell you that if your car is new, and your tires are new, and the weather is right, and you go to a desert or out on the autobahn or a racetrack, put in premium fuel, wait for the thing to warm up, give it one final tune up, and then hold the gas pedal to the floor long enough on a long enough, flat enough, smooth enough straight that it might just make it to 160 or that it might also self-destruct and kill you. They are not likely to tell you that you are likely never going to see that in real driving once you leave the sales lot. They are especially not going to remind you that you will get a ticket if you exceed 65 mph on real roads or that you are going to be driving at an average speed of 20 mph on your commute every day, or that driving this car at 20 mph average will cause it to require very expensive service! In advertising and marketing their job is to tell you that the car has a top speed of 160 mph so that you can imagine that until you write the check and face reality.

So what are real benchmarks that reflect what all real things do and what all real uses deal with. I am of the opinion that there are such things. For cars they are things like real life gas consumption in real life driving on real roads, real life maintenance and upkeep cost and frequency, expected lifetime, depreciation, accident frequency and extent statistics, insurance costs, likelihood of the vehicle being stolen, or how likely it is to burst into flames or roll over. Fortunately cars are not sold like computer hardware and software with the "We take no responsibility" type waivers.

For embedded computing systems or any real-time applications the benchmark that is important is the worst case scenario not the compiler writer's best case scenario. With particularly non-deterministic processors this is difficult to measure exactly. As Dr. Philip Koopman wrote in an excellent article for Embedded Computing Journal the real problem in real-time applications is not meeting the timing requirements with an average time, but beating them all the time. It is the odd occurence of events where an Interrupt Service Routine hits in a particular instruction sequence that result in significant performance hits because of pipeline stalls and cache memory misses. Since the occasional ISR may be one hundred times slower than the average, as Dr. Koopman demonstrated, there is only one solution when using these non-deterministic processors, clock them one hundred times faster than they need to be on the average to solve the problem.

For computers what is common to all systems and to all users? How long does the Operating System take to boot up or shut down? How long does it take a program to load and launch? How much time does the program spend waiting for mass storage to deliver library modules or data? How long does the system take to open and close windows or any of the other functions that are common to all users.

Just as real-time worst case benchmarks these things are on the opposite end of the spectrum from compiler writer's code generator benchmarks. Those are designed to avoid timing the things that actually slow down computers. They are designed to not show how real programs wait for the OS to load or perform OS functions or background housekeeping, wait for the disk, wait for disk cache, wait for main memory, wait for cache memory, wait for pipeline stalls, etc. Users beware.

All time should count as time. The size of all programs in memory, or worse yet paged in and out of memory, count in memory usage.

If you are a programmer then you are a computer user also. The time it takes to load tools counts. The time that the OS requires and your compiler needs to load is a factor in your productivity. The time your computer takes to compile your programs counts. The size of these tools will have an effect on the time they require to run. All time should count as time, all program size should count as program size.

However the salesman would rather talk about the top speed under ideal conditions than the issues that will confront the user day in and day out. Buyer beware.

Things like virus control software running in the background, disk fragmentation, excessive loading and unloading of programs, or just the default time-outs in networking code may account for more user time spent waiting than anything else. Look at real things, and pay attention.

Remember the fine print flashed on the screen you were a kid saying, model does not actually fly when you see those adds where people buy new personal computers that let them fly. Remember what you actually have to sit and wait for when working with a computer when you see those benchmarks with everything that really counts subtracted from them. Keep in mind what benchmarks are for.

If you use benchmarking yourself then use them carefully. Follow the rules if you use benchmarks as evidence for other people. If you see benchmarks used as evidence examine them carefully.

One thing is clear, Forth chips were not designed to run C code. It is ridiculous to suggest that C or Unix benchmarks have any meaning in that context. Forth vendors tend to show only compiler writer's best case snippet benchmarks. If ask for real numbers on real programs they respond with phrases like pretty fast. If you try to discuss things like compilation speed they will tell you it is too fast to care, or that it is virtually instantaneous. They brag about giant programs but won't say how long they take to compile. If you buy the products and test them and report that they were an order of magnitude faster than a public domain Forth compiling the same source code they will tell you it is because your program was bad. If you say maybe it has something to do with the underlying Operating System performance they will show you compiler writer's best case code snippet times, or say that floating point opcode won't be slowed down much but they will not discuss the reality of things like hosting an environment in Windows and needing megabytes of interface code, tens of thousands of Forth words, and being subject to the tens of thousands of bugs in the code that their system relies on.

Showing that it is easy to get several more orders of magnitude performance even on hardware designed to run C will usually only get the response that it isn't a fair comparison. And it's not, but not because the big solutions add so many extra problems that they limit what one person can do so much that they then have to add overhead to allow ten people to do the work of one which wasn't added to the problem when one person produced the demo.

The first time I saw OKAD demonstrated I was quite shocked at how it ran about ten thousand times faster than the conventional CAD program demonstrated on the same day on a different PC that was twice as fast. I was more shocked when I saw the half million dollar software require a room full of expensive workstations to move faster than a crawl and still could not simulate the chips designed in OKAD or get anywhere near the performance numbers for the OKAD simulations.

On my code I only saw between half of the performance and about equal performance when comparing functions running on my chip to the same function coded by experts for Pentium and running at the same clock frequency. Of course one of the ideas behind my chip was that it should be one thousand time lower power and lower production cost than a Pentium chip, and that it should include things that you have to add to a Pentium design, like motherboard chip sets, video cards, serial cards, parallel port cards, timer chip sets, sound cards, and network cards and the like. The network card being built in and optimized not for local area networking but for the high speed connects needed for parallel processing was to make it possible to connect a large number of them together with no change to the software. So being a thousand times cheaper than a Pentium to produce the idea was that if you want to pay Pentium prices one should get a thousand times more processing power, optimized for parallel real-time applications rather than being designed to process intentionally bloated code.

So I think it is more fair to compare a chip like F21 to a chip that is the same size, has the same number of gates and which cannot be distinguished from an F21 if you cover up the labels. These the high volume chips that are made in many times more numbers than Pentium. And when you benchmark an F21 against these chips you see a three to five order of magnitude increase in speed for the same production cost and power consumption.

If you compare one of my chips running my code to a Pentium chip running optimized code and running at the same clock frequency the Pentium is likely to win by some small margin. However it is absolutely nothing like the one thousand to one cost or power consumption ratios might suggest. Also Chuck Moore has written benchmarks for very similar designs. He tells people that if you compare well written code on one of his chips to a Pentium running at the same clock frequency that his chip will be about twice as fast. He has provided numerous examples, but people always dismiss them as too small to be reflective of their bloated programs.

Of course beating a Pentium running massive amounts of popular bloatware is so easy that it is a bad joke. But the benchmarks offered by Pentium sales people are designed to eliminate all the real effects that their real programs see like excessive disk thrashing, excessive interrupt service, giant programs and data sets not fitting in the fast memory caches, the pipeline stalls, or even the tremendous overhead created by programs being absurdly larger than they need to be to perform a given function.

Beating Pentium PCs running Windows on real world things by orders of magnitude is quite easy. Common sense should tell you that programs smaller than 1K should execute much faster than versions that need tens of megabytes. The most extreme cases are things like Windows GUI functions where beating Windows by a few orders of magnitude is not very hard. The most extreme case I have seen is multitasking functions.

Forth always had an advantage over more complex software methods for creating the most nimble multitasking routines. This is much more true for modern Forth on Forth hardware than it was in the old days. Architectures like Pentium are particularly difficult to multitask quickly because they are so very non-deterministic. The more registers that one has to save and restore, the deeper the pipelines that one has to flush and refill, and the bigger the programs spilling out of fast cache memory the more difficult efficient multitasking becomes. Most Unix systems are not designed for real-time processing and do not have impressive task switch times.

Forth has traditionally used what is called a co-operative multitasking mechanism. Forth tends to be a more holistic environment where programs are not fighting one another but are co-operating. So the multitasking mechanisms can be very fast. Although Windows has a co-operative multitasking it is absurdly slow. Windows can insert system housekeeping code in between user task switches. This produces not just slow task switch times but long pauses where the computer appears to have crashed. But eventually it comes out of the coma and gets to the next task in the user's program. At iTV we had a highly optimied and heavy multitasker. Beating Windows PCs by many orders of magnitude was obvious when we counted our worst case task switch times in nano-seconds and Windows counted them in seconds.

Of course it is very difficult to compare processors when one is designed to function at pulling a trainload of bloated code and the other is designed to be as quick and nimble as possible with a minimal of extra weight. It is hard to compare on the same code because the big designs are very hard to use without lots of complex code because their hardware complexity demands software complexity. At the same time the tiny processors are designed to run lean code with tiny opcodes and use simple code techniques.

The questions that I think are appropriate are things like does it boot in milliseconds or minutes? Does the program take seconds or microseconds to load and launch? Does it take minutes or microseconds to compile programs? Does it switch tasks in nanoseconds or seconds? Is the GUI interface so fast that you can't outrun it or is it so slow that you forget what you were doing before it responds to an event. As I said above these are the benchmarks that I think are important. I will publish a nice list of actual times on a number of these types of real world things for anyone who does not want to ignore the real world times of computers and does not want to be taken in by compiler writer's best case code snippet benchmarks.

Jeff Fox
3/31/03 homepage