Title

Roy Longbottom at Linkedin Linpack Benchmark Results On PCs

Windows PC Normal Results Windows 32/64 Bit SSE2 Results Linux 32/64 Bit Results
OS/2 Results Android Versions Multithreading Benchmark
Raspberry Pi ARM/Linux


Description

This benchmark was produced by Jack Dongarra from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications from the mid 1980's with a slant towards supercomputer performance.

The original version was produced in Fortran but a "C" version appeared later. The standard "C" version operates on 100x100 matrices in double precision with rolled/unrolled and single/double precision options. The pre-compiled versions are double precision, rolled, optimised and non-optimised. These can be found in BenchNT.zip which also contains the source code, providing further explanatory comments. DOS versions are available in DosTests.zip and those to run via OS/2 in OS2Tests.zip. Then there is My Main Page for other PC benchmarks and results.

The benchmark has also been compiled with Microsoft 32 bit and 64 bit compilers that generate SSE and SSE2 instructions for floating point. The original 2006 64 bit version indicated poor performance on Core 2 Duo CPUs but this was corrected using a later compiler in 2009 - see Vista64.htm. Compiled codes (2006 and 2009 versions) are in Win64.zip with source code in NewSource.zip. See also Win64.htm.

Performance rating is in terms of Millions of Floating Point Operations Per Second (MFLOPS).

Linpack Reference - Jack Dongarra, Performance of Various Computers Using Standard Linear Algebra Software in a Fortran Environment from Here - PDF file including numerous results for minicomputers, workstations, mainframes and supercomputers.


Results

The following is a sample of results. Performance tends to be proportional to CPU MHz for a given type of processor but is also affected by cache size and speed. There can also be variations probably depending on where the data happens to be stored in cache. Details of cache sizes and range of CPU MHz can be found in CPUSpeed.htm. Results include those from DOS and Windows compilations that produce very similar speed measurements. Some SSE2 and OS/2 results are included at the bottom of the table.

Later results are for the same code ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see linux benchmarks.htm and download benchmark execution files, source code, compile and run instructions in classic_benchmarks.tar.gz. Using Windows the file downloaded wrongly as classic_benchmarks.tar.tar but was fine when renamed classic_benchmarks.tar.gz. Results are shown separately below. A 2013 version of GCC has the option to compile for AVX1, a more recent addition to the Intel instruction set. This can potentially double maximum SSE/SSE2 speeds. Results are included below, and certainly show much improved performance. The AVX version of the benchmark is included in AVX_benchmarks.tar.gz. Further details are in Linux AVX benchmarks.htm.


Windows PC Normal Results

Double Precision 100x100 compiled at 32 bits Opt No opt CPU MHz MFLOPS MFLOPS AMD 80386 40 0.53 0.36 80486 DX2 66 2.63 1.74 AMD 5X86 100 3.34 2.24 Pentium 75 7.56 4.04 Cyrix P150 120 10.08 8.75 Cyrix PP166 133 11.53 8.33 Pentium 100 12.07 5.40 IBM 6x86 150 12.87 8.29 Pentium 133 17.05 5.60 Pentium 166 19.89 6.86 Cyrix PR233 188 19.98 11.88 Pentium 200 22.80 8.10 AMD K6 200 22.84 11.39 Pentium MMX 200 23.53 8.75 AMD K62 500 45.79 26.44 Pentium II 300 47.74 18.25 Pentium Pro 200 48.50 10.72 Pentium III 450 61.52 26.51 Pentium II 450 61.56 26.47 Apple G3 700 63.30 28.58 AMD K63 450 65.20 28.55 Celeron A 300 79.65 19.24 Pentium III 600 84.18 35.81 Celeron A 450 119.59 28.84 Athlon 500 180.79 39.70 Atom 1600 183.01 89.19 Pentium IIIE 600 185.22 59.43 Duron 600 225.06 34.81 Pentium III 1000 316.67 55.52 Athlon Tbird 1000 372.69 81.11 Duron 1000 374.05 57.88 PIII Tualatin 1200 380.08 128.79 Pentium 4 1700 382.00 131.59 Pentium 4 1900 533.93 107.17 Celeron M 1295 539.76 123.59 Athlon 4 1600 585.74 103.42 P4 Xeon 2200 599.24 123.69 Pentium 4E 3000 630.30 165.01 Ath4 Barton 1800 659.57 117.29 Turion 64 M 1900 697.32 123.69 Opteron 1991 753.08 131.89 Athlon XP 2080 764.03 136.05 Pentium M 1862 834.29 181.05 Pentium 4 3066 840.27 174.64 Athlon XP 2338 859.43 153.21 Athlon 64 2150 811.86 142.80 Athlon 64 2211 838.22 145.60 Core 2 Duo M 1830 997.68 111.41 Pentium 4 3678 1017.01 209.01 Core i5 2467M @@@@ 1064.70 315.46 Celeron C2 M 2000 1092.56 121.25 Core 2 Duo 1 CP 2400 1315.42 195.13 Phenom II 3000 1412.83 244.43 Core i7 930 **** 1764.75 428.00 Core i7 860 #### 2004.31 381.97 Core i7 3930K &&&& 2529.73 746.01 Core i7 4820K $$$1 2671.15 892.04 Core i7 4820K $$$2 2684.05 895.54 Core i7 3930K OC 3112.94 926.92 #### Rated as 2800 MHz but running at up to 3460 MHz using Turbo Boost **** Rated as 2800 MHz but running at up to 3066 MHz using Turbo Boost @@@@ Rated as 1600 MHz running at up to 2300 MHz using Turbo Boost &&&& Rated as 3200 MHz but running at up to 3800 MHz OC OverClocked ~4720 MHz $$$1 Rated as 3700 MHz but running at up to 3900 MHz using Turbo Boost $$$2 Performance not Balanced Power Setting for 3900 MHz M = Mobile CPU To Start

Windows 32/64 Bit SSE2 Results

Double Precision 100x100 compiled at 32 and 64 bits Opt CPU MHz MFLOPS Celeron M 32b 1295 499.90 Pentium 4 32b 1900 677.67 Turion 64M 32b 1900 835.82 Pentium 4E 32b 3000 912.78 Athlon 64 32b 2211 1013.68 Athlon 64 64b 2211 1043.56 Athlon 64 64b 2211 1090.52 ## Core2 DuoM 32b 1830 1119.38 CeleronC2M 32b 2000 1221.49 Core 2 Duo 32b 2400 1479.78 Core 2 Duo 64b 2400 823.10 Core 2 Duo 64b 2400 1602.35 ## Phenom II 64b 3000 850.45 Phenom II 32b 3000 1713.22 Phenom II 64b 3000 1905.19 ## Core i7 4820K 32b $$$1 3388.59 ## Core i7 4820K 32b $$$2 3405.54 ## Core i7 4820K 64b $$$1 3526.84 ## Core i7 4820K 64b $$$2 3556.70 ## Core i7 3930K 64b &&&& 3927.74 ## ## 2009 compilation &&&& i7-3930K Overclocked see above $$$$ i7-4820K see above To Start

Linux 32/64 Bit Results

Double Precision 100x100 compiled at 32 and 64 bits Opt No opt CPU MHz MFLOPS MFLOPS Atom N455 32b Ub 1666 196 94 Atom N455 64b Ub 1666 226 89 Core 2 Mob 32b Ub 1830 983 307 Athlon 64 32b Ub 2211 936 231 Athlon 64 64b Ub 2211 1118 221 Core 2 Duo 32b Ub 2400 1288 404 Core 2 Duo 64b Ub 2400 1577 378 Phenom II 32b Ub 3000 1464 411 Phenom II 64b Ub 3000 1887 411 Phenom II 64b Fe 3000 1872 407 Core i7 930 64b Ub **** 2265 511 Core i7 4820K 32b Ub $$$1 2534 988 Core i7 4820K 64b Ub $$$1 3672 900 Core i7 4820K AVX Ub $$$12 5413 935 Ub = Ubuntu Linux, Fe = Fedora Linux **** Rated as 2800 MHz but running at up to 3066 MHz using Turbo Boost $$$1 Rated as 3700 MHz but running at up to 3900 MHz, using Turbo Boost $$$12 As $$$1, but compiled with GCC 4.8.2 that produces AVX SIMD insructions. To Start

OS/2 Results

Opt No opt CPU MHz MFLOPS MFLOPS IBM 80486BL 100 0.56 0.50 80486 DX2 66 2.65 2.00 80486 75 2.84 1.83 Cyrix P150 120 10.08 8.75 Pentium Pro 150 39.33 14.30 Pentium Pro 166 43.96 15.93 Pentium Pro 200 46.69 18.71 To Start

Android and Raspberry Pi Versions

Later conversions were varieties to run on Android tablets and phones on ARM CPUs. Most use a Java front end for starting and displaying results, with the compiled C code for calculations. Download:

www.roylongbottom.org.uk/Linpackv5.apk and www.roylongbottom.org.uk/Linpackv7.apk
www.roylongbottom.org.uk/LinpackSP.apk and www.roylongbottom.org.uk/LinpackJava.apk.

and install in the usual way for such devices. The v7 program, and single precision (SP) variety are compiled for newer hardware than the v5 version. Then there is a benchmark with all Java code. See - android benchmarks.htm. The latter includes LinpackJava.apk results from PCs via Android x86, details are here. Latest are modified to use NEON SIMD functions that that carry out four arithmetic operations simultaneously. See android neon benchmarks.htm . Download:

www.roylongbottom.org.uk/NEON-Linpack.apk

Latest benchmarks were compiled and run on a Raspberry Pi that uses ARM CPUs and Linux. See Raspberry Pi Benchmarks.htm and download from Raspberry_Pi_Benchmarks.zip.

 
Double Precision and Single Precision (SP) 100x100

                               v7/v5       v5 
CPU          MHz   Android    MFLOPS    MFLOPS

ARM 926EJ    800       2.2       5.7       5.6
ARM v7-A8    800     2.3.5      80.2          
ARM v7-A9    800     2.3.4     101.4      10.6
ARM v7-A9   1300a    4.1.2     151.1      17.1
ARM v7-A9   1500     4.0.3     171.4          
ARM v7-A9   1500a    4.0.3     155.5      16.9
ARM v7-A9   1400     4.0.4     184.4      19.9
ARM v7-A9   1600     4.0.3     196.5          
ARM v7-A15  2000b    4.2.2     459.2      28.8

                               v7 SP     Java 
CPU          MHz   Android    MFLOPS    MFLOPS

ARM 926EJ    800       2.2       9.6       2.3
ARM v7-A9    800     2.3.4     129.1      33.3
ARM v7-A9   1300a    4.1.2     201.3      56.4
ARM v7-A9   1500a    4.0.3     204.6      56.9
ARM v7-A9   1400     4.0.4     235.5      57.0
ARM v7-A15  2000b    4.2.2     803.0     143.1


Atom   Ax86 1666     2.2.1                15.7
Core 2 Ax86 2400     2.2.1                53.3

Raspberry Pi                    DP        SP  
CPU          MHz     Linux    MFLOPS    MFLOPS

ARM  1176    700     3.6.11     42        58  
ARM  1176   1000     3.6.11     68        88  

                              NEON SP         
CPU          MHz   Android    MFLOPS          

ARM v7-A9    800     2.3.4     255.8          
ARM v7-A9   1300a    4.1.2     376.0          
ARM v7-A9   1500a    4.0.3     382.5          
ARM v7-A9   1400     4.0.4     454.2          
ARM v7-A15  2000b    4.2.2    1334.9          

      v7 fast FPU used when available         
      CPU running at a 1200, b 1700          
      Ax86 Android x86 - Slow JIT Compiler?   

To Start


MultiThreading Versions

This version uses mainly the same C programming code as the single precision floating point NEON compilation above. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. Virtually the same programming code was used to produce execution files for ARM devices/Android, PCs/Linux and PCs/Windows. The latter two were compiled to use SSE instructions, that can carry out up to four operations simultaneously. Again see android neon benchmarks.htm . Download:

www.roylongbottom.org.uk/NEON-Linpack-MP.apk

Slight changes were made to the original version to allow a higher level of parallelism. The initial 100x100 Linpack benchmark was only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.

This program can run much slower using multiple threads due to the overhead of creating and closing threads too frequently. Using data size N x N, approximately 0.67 x N x N x N floating point calculations are carried out but, with this program, there are N breaks for handling threads (can someone do better?). Results show that overheads from these are generally constant executing up to four threads. Results below include overheads of using a single thread at 100x100 (microseconds difference from no threads / 100). In most cases, overheads (per N) increase with larger arrays, as performance becomes more dependent on higher level caches or RAM data transfers.

Results below show no performance gains due to multiprocessing on the ARM based devices. Constant performance of the Netbook, with no threads, suggests a CPU speed limitation, then gains with two threads due to Hyperthreading. The Core 2 Duo shows the best improvement using two threads. At least the Phenom shows some gain with four threads, using Linux but not via Windows.

Particularly with multithreading, it is important to verify that calculations produce the same numeric results, irrespective of the number of threads used. This benchmark checks for that and reports if there are errors. These results are shown below and are the same on Android, Linux and Windows. For completion, numeric results and a sample of performance are provided for a double precision compilation.

At some point in time, a Java version of the Linpack benchmark rearranged the order of initial data (function matgen) and this changed the numeric results. These are shown below.

   

Single Precision MFLOPS

100x100, 500x500, 1000x1000, 0, 1, 2, 4 Threads T11 Samsung EXYNOS 5250 2.0 GHz Cortex-A15, Android 4.2.2 Threads None 1 2 4 N 100 1399.82 54.86 55.31 54.66 N 500 1154.21 434.16 434.06 436.97 N 1000 571.26 482.57 487.25 485.80 N 100 1 Thread overheads 120 microseconds MHz measured at 1700 P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4 Threads None 1 2 4 N 100 455.90 42.37 41.76 37.32 N 500 395.16 326.43 321.82 309.55 N 1000 355.77 322.98 323.71 322.24 N 100 1 Thread overheads 147 microseconds T7 Nexus 7 Quad 1300 MHz Cortex-A9, Android 4.1.2 Threads None 1 2 4 N 100 413.47 45.95 48.22 48.34 N 500 253.08 187.51 189.69 189.94 N 1000 148.76 135.49 136.08 136.17 N 100 1 Thread overheads 133 microseconds MHz measured at 1200 Netbook 1.6 GHz Atom, 64 Bit Linux Threads None 1 2 4 N 100 263.60 48.45 55.02 35.99 N 500 258.06 228.34 293.27 248.43 N 1000 252.72 259.02 362.24 213.40 N 100 1 Thread overheads 116 microseconds Desktop 2.4 Ghz Core 2 Duo, 64 Bit Linux Threads None 1 2 4 N 100 1666.02 287.94 200.82 134.17 N 500 1908.89 1422.59 1902.42 1507.04 N 1000 1921.33 1624.31 2606.09 2306.14 N 100 1 Thread overheads 20 microseconds DeskTop 3.0 GHz Quad Core Phenom II, 64 Bit Linux Threads None 1 2 4 N 100 1924.69 279.90 206.19 141.13 N 500 2059.73 1333.07 1510.81 1247.76 N 1000 2074.59 1682.34 2314.57 2478.78 N 100 1 Thread overheads 21 microseconds DeskTop 3.0 GHz Quad Core Phenom II, 64 Bit Windows 7 Threads None 1 2 4 N 100 1961.90 103.26 56.87 32.03 N 500 1596.83 894.22 767.35 537.39 N 1000 1646.96 1261.64 1588.28 1337.33 N 100 1 Thread overheads 63 microseconds Single Precision Numeric Results NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.60 3.96 11.32 RE 3.80277634e-05 4.72068787e-04 2.70068645e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.38282776e-05 5.26905060e-05 1.62243843e-04 XN -7.51018524e-06 3.26633453e-05 -6.65783882e-05

Double Precision MFLOPS

Desktop 2.4 Ghz Core 2 Duo, 64 Bit Linux Threads None 1 2 4 N 100 1509.25 271.92 195.22 130.55 N 500 1697.29 1295.56 1753.95 1374.19 N 1000 980.78 892.38 1036.67 945.15 N 100 1 Thread overheads 21 microseconds Double Precision Numeric Results N 100 500 1000 NR 1.67 5.76 9.50 RE 7.41628980e-14 1.27986510e-12 4.22017976e-12 MA 2.22044605e-16 2.22044605e-16 2.22044605e-16 X0 -1.49880108e-14 5.59552404e-14 1.09912079e-13 XN -1.89848137e-14 3.39728246e-14 5.08926234e-13

Revised Matgen Results Single Precision

N 100 500 1000 NR 1.76 5.32 11.48 RE 4.18424606e-05 6.34193420e-04 2.73901224e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.37090683e-06 3.00407410e-05 -4.81605530e-05 XN -1.13248825e-06 -2.25305557e-05 8.94069672e-06

Revised Matgen Results Double Precision

N 100 500 1000 NR 1.43 5.17 10.10 RE 6.33937347e-14 1.14730447e-12 4.48374671e-12 MA 2.22044605e-16 2.22044605e-16 2.22044605e-16 X0 -2.55351296e-15 -3.64153152e-14 -1.59761093e-13 XN 6.21724894e-15 6.72795153e-14 -4.12780921e-13 To Start



Roy Longbottom at Linkedin Roy Longbottom October 2014

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection