Abstract
The Fortran Whetstone programs were the first general purpose benchmarks that set
industry standards of computer system performance. Whetstone programs also addressed the
question of the efficiency of different programming languages, an important issue not
covered by more contemporary standard benchmarks. Results are provided for computers
produced during the 1960's to present day systems, including via different languages.
The benchmark, a UK product, was based on work by Brian Wichmann ** of the National
Physical Laboratory. It was developed by Harold Curnow ** of HM Treasury Technical
Support Unit (TSU - later part of Central Computer and Telecommunications Agency
or CCTA). This document was produced by Roy Longbottom (TSU/CCTA 1960 to 1993),
who carried out further development.
** Download Whetstone.pdf, a copy of their original research paper - kindly supplied by Brian.
Contents
In The Beginning
Before the introduction of high level languages, general computer
performance comparisons were usually based on instruction execution times.
These were combined to produce an overall rating using a mix of instructions,
the most well known one being the Gibson Mix for scientific applications,
devised by J Gibson of IBM.
In 1957, the UK Government formed the Technical Support Unit to evaluate and
advise on computers, employing engineers from the telecommunications
service. This unit eventually became part of the central procurement body
later known as the Central Computer and Telecommunications Agency (CCTA).
TSU engineers produced numerous calculations between 1966 and 1973, using an
ADP Mix, the Gibson Mix and a Process Control Mix.
To Start
Whetting The Stone
During the late 1960's, the UK National Physical Laboratory had an English
Electric (ICL) KDF9 scientific computer with one of the first
implementations of Algol 60, the Whetstone translator-interpreter. Brian
Wichmann modified the interpreter to record statistics on the intermediate
Whetstone instructions and produced a suite of simple statements which could
be used to evaluate the efficiency of compilers and overall performance of a
processor (see ICL KDF9 benchmark results in the table - the first is for
the Whetstone Interpreter).
In 1971 Roy Wickens, one of the founding members of TSU, abandoned producing
a portable benchmark using real programs as it was becoming too expensive.
He asked Harold Curnow to produce modular synthetic benchmark suites. Harold
produced the COPRXX suite for COBOL and a scientific program based on Brian
Wichmann's work. The first Whetstone benchmark, known as HJC11 (later
ALPR12), was written in Algol 60 and completed in November 1972. The Fortran
codes (HJC12 and HJC12D) were published in April 1973 as FOPR12 and FOPR13.
The first results published were for IBM and ICL mainframes in 1973.
The speed rating was calculated in terms of Kilo Whetstone Instructions
Per Second or KWIPS. Later, Millions or MWIPS was used.
To Start
Rolling The Stone
During the 1970's, I was head of the CCTA Scientific Systems Branch with
responsibilities for evaluating new systems, advising on procurements and
supervising acceptance trials at both Government Departments and
Universities. This provided the means for obtaining numerous results on
minicomputers and mainframes. At this time, versions were available in
various programming languages (see results).
Taking personal responsibility for state of the art systems including
supercomputers, in 1978 I produced a fully vectorisable version FOVP12
(using arrays instead of simple variables). This provides MWIPS ratings at
different vector lengths (array dimensions). At the time, results of the
Livermore Kernels benchmark were available for top of the range scientific
systems but it was considered that it would be useful to be able to have
rough performance comparisons with less glamorous systems. Results are given
later in the tables at vector length 256. Also during 1978, the standard
versions were modified to calculate MWIPS, using CPU timers.
[Vector version reference - R Longbottom, "Performance of Multi-user
Supercomputing Facilities", 4th International Conference on Supercomputing,
April 1989]
It appears that my vectorisable version was used long after I departed from
the supercomputer scene.
Later Results From Here (mainly for workstations).
In 1980, I added facilities to time each of the eight loops to produce speed
ratings in Millions of Integer Instructions and Floating point Operations Per Second
(MIPS and MFLOPS). MIPS represent a relative measurement where DEC VAX 11/780 = 1.
This was to identify the tricks that some compilers were
getting up to and to provide more meaningful measures for supercomputers.
The last alterations to the benchmark were in 1987, in conjunction with
Bangor University, who made slight changes intended to avoid over
optimisation whilst still executing identical functions. The benchmarks were
also converted to Fortran 77 standards. At a later stage, I produced compatible
versions using Fortran, Basic, C and Java programming languages for use on PCs
(see PC results). These included further changes to repeat the tests via outer
loops to prevent speed calculation inaccuracy due to timer resolution.
2005 - The Whetstone Benchmark has been compiled to run as a 64 bit program
via Windows XP Pro x64 and modified to demonstrate performance of Dual Core CPUs.
Also available are 32 bit versions that use SSE floating point instructions via
the latest Microsoft compiler.
See
Win64.htm and
DualCore.htm
To Start
Throwing The Stone
The benchmark results were published within CCTA as "Commercial in
Confidence" and supplied to customers when required for a particular
procurement. By 1979, results were available for about 200 systems from 30
suppliers. Although the main emphasis was on comparing speeds via Fortran,
limited results were also available via Algol, PL/I, APL, Pascal, Basic,
Simula and Coral besides from varying optimising options. Along with results
in single and double precision (and extended precision where appropriate),
more than 500 measurements were available.
By this time, the Whetstone benchmark speed rating had become the default
definition of minicomputer MIPS (Millions of Instructions Per Second), its
significance being exaggerated when a minicomputer supplier somehow acquired
the table of Whetstone benchmark results and published some of them in the
computer press with the heading "Now who has the fastest minicomputer".
Whetstone performance ratings are known to have been a serious consideration
in the design of the Digital VAX systems and other minicomputers of the same
vintage, where some were reluctant to publish double precision results which
did not match VAX speeds. DEC benchmarking publications show that Whetstone
results were given serious consideration until 1986. The benchmark was still
being run by DEC in 1996 with results of Alpha-based systems available on
www.digital.com.
The Intel microprocessors were designed at the height of popularity of the
Whetstone benchmark. Examining the instruction set of the math coprocessor,
with instructions for sin, cos, atan, sqrt and log, possibly indicates a
complete hardware implementation (the one and only?) to match the benchmark.
The design also includes 80 bit registers which ensure fast double precision
operation. Although rightly not used as one of the main performance
measurement tools, the Whetstone benchmark was still run by Intel in 1996,
with results of 486 systems, DX4 and Pentium overdrive processors being available
on www.intel.com. The benchmark also formed a small part (2%) of the Intel
iComp benchmark.
As can be seen in PC results, the Intel P4 processor obtains poor results
relative to CPU MHz. This might be due to the length of the P4’s execution
pipelines and the relatively few instructions in the benchmark’s timing loops.
To Start
Compiler Optimisation
The benchmark is very simple, comprising some 150 statements with eight
active loops, three of which execute via procedure calls. Three loops carry
out floating point calculations, two functions, one assignments, one fixed
point arithmetic and one branching statements. The dominant loop, usually
accounting for 30% to 50% of the time, carries out floating point
calculations via procedure calls.
The tests only reference a small amount of data which will fit in the L1
cache of any CPU. Hence, L2 cache and memory speed should have no influence
on performance ratings. Speeds are invariably proportional to CPU MHz on a
given type of processor.
The code was designed to be non-optimisable and optimising compilers did not
have a significant impact until the introduction of in-lining of subroutine
instructions. Although this produces code outside the definition of
Whetstone instructions, which include a specific proportion of procedure
calls, it is a valid technique to obtain the best performance out of modern
systems and may well be the compiler default optimisation level. As
reflected in the PC results, a good compiler can halve the execution time by
in-lining, careful choice of instructions and sequence, and omission of
intermediate stores/loads.
With in-lining and global optimisation, a small number of compilers
identified that the dominant loop did not have to be executed and
immediately lead to an apparent more than doubling of MWIPS speeds. This was
identified by the 1980 enhancements and fixed in 1987, essentially by
changing the name of one variable. Unlike some other standard benchmarks,
Whetstone results were generally verified as part of the CCTA system
appraisal, in project related benchmarking sessions or during acceptance
trials. It was also standard practice to run the tests with different levels
of optimisation and obvious over optimised results were not published.
Besides the global optimisation problem, two other areas of complication
have been observed. The first is where loop variables can be too large for
index registers yet the program still runs with a truncated count. This is
catered for by having a double loop to control the running time. The second
complication becomes apparent as systems become faster and underflow can
slow down the execution rate of the first two loops. This can be fixed by
changing the values of variables t and t1 to be closer to 0.5.
The latest Intel compiler appears to over optimise the loop with
integer arithmetic. Here, a series of variables are calculated which produce
array indices of constant values and therefore only need to be calculated
once. It would seem that the only way the problem would arise was if the
compiler carried out the indexing calculations, maybe to determine that
array accesses are not going to be out of bounds.
To Start
Table Headings and Explanation
Supplier or System - The suppliers full name or earlier/later names may be
shown. Hardware options may be included with the system name.
CPU and Precision - This includes the CPU chip type and an indication of the
precision as shown in the original CCTA results. This gives Base: Precision.
For example, 2:23 indicates 23 binary digits and 16:6 six hexadecimal
digits. These were important when considering accuracy where hexadecimal
single precision was not that good. Precision numbers are as follows
(single and double):
SP DP
P1 16:6 16:14
P2 10:8 10:16
P3 8:13 8:26
P4 2:48 2:96
P5 2:23 2:30
P6 2:27 2:62
P7 2:23 2:55
P8 2:24 2:56
P9 2:23 2:46
P10 2:23 2:38
P11 2:23 2:39
P12 2:28 2:64
P13 2:22 2:38
P14 2:24 2:53
P15 2:39 2:78
P16 2:39 2:74
P17 2:23 2:47
P18 2:27 2:60
P19 2:31
P20 2:32
P1 also had Extended
Precision of 16:28 and
P10 2:69
|
For vector processors, an indication is given to show whether the result
represents scalar or vector performance, the latter being for vector length
256. In comparing supercomputer performance both scalar and vector results
should be taken into account. In this case a weighted harmonic mean is used
based on 90% of the code being vectorisable. The weighted average is
calculated as 1/(0.1/S+0.9/V), where S is the scalar speed and V the vector
speed.
Clock MHz - This may be derived from the clock period for older systems.
MWIPS - Generally single precision Whetstone rating in Millions of
Whetstone Instructions Per Second. Differences between Single Precision (MWIPS SP)
and Double Precision (MWIPS DP) should be noted.
MFLOPS - The geometric mean of the three floating point results in
Millions of Floating Point Operations Per Second.
VAX MIPS - The geometric mean of Millions of Operations Per Second for
the sections covering fixed point arithmetic, if then else and assignments,
multiplied by five. Such a calculation for the DEC VAX 11/780, accepted as
running at 1 Million Instructions Per Second, produces approximately 1.0 MIPS.
Lang - Two or three digit code to indicate the programming language:
For Fortran
Alg Algol
PL1 PL/I
Cor Coral
Pas Pascal
Apl APL
Bas Basic
BasI Interpreter
VB Visual Basic
C++ C or C++
Sim Simula
|
Opt - An indication of relative optimisation levels within a given range of
systems. IL might be shown to indicate in-lining of procedures or subroutines.
xxx indicates unknown for versions run by suppliers which may be subject to part
not being run due to over-optimisation.
Cost $K - Most of the prices shown were obtained from US sources.
Some of the costs are, of course, not particularly accurate. Mainframe and
larger minicomputer prices are generally only for a processor with minimum
memory capacity.
Intr Date - Approximate year of first delivery.
To Start
|