Cray 1 Supercomputer Performance Comparisons With Home Computers Phones and TabletsRoy Longbottom
Detailed Results and Comparisons
|
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS CPU Device CPU MHz Max Gmean Min Linpack Whets Whets Year Year Main Columns V V V V Cray Cray 1 80 82.1 11.9 1.2 27 16.2 est 6 1978 XMP1 118 162.2 17.3 2.1 121 30.3 11 1985 Cray 1 Whets MFLOPS estimated based on XMP results Raspberry Pi 32 bit Pi CPU 1 1176JZF 700 148 55 17 42 271 94 2001 2012 2 A7 900 248 115 42 120 525 244 2011 2014 3 A53 1200 436 184 56 176 725 324 2012 2016 4 A72 1500 1861 679 180 764 1883 415 2015 2019 400 A72 1800 2262 819 217 1147 2258 498 2015 2020 Raspberry Pi 64 bit 400 A72 1800 3353 938 242 1337 2505 573 2015 2020 Rpi 1/Cray 1 8.8 1.8 4.6 13.8 1.6 16.7 15.7 64 bit/32 bit 1.0 1.5 1.1 1.1 1.2 1.1 1.1 64 bit/Cray 1 22.5 40.8 78.8 201.7 49.5 154.6 95.5 Main Columns # # # # |
Comparison - The first results were for tablets that did not have hardware or software to support fast floating point calculations. The earliest with appropriate facilities, from 2012, used the ARM Cortex-A9 processors, starting with 800 Mhz versions. This is indicated as having the three MFLOPS speeds of 20, 11 and 22, or at 10 times Cray 1 CPU MHz, with gains of 1.7, 0.4 and 3.7 in MFLOPS.
A later 800 MHz V7-A9 obtained 115, 101 and 155 MFLOPS, or Cray 1 gains of 9.7, 3.7 and 25.8 times.
Fastest results provided are for a 2021 mid priced phone with a Kryo 570 CPU, said to be based on ARM Cortex-A77. At 2000 MHz, this obtained an average LLL speed of 1468 MFLOPS, with Linpack at 1986 and 905 for Whetstone and Cray 1 performance gains of 123, 74 and 151 times, at 25 times CPU MHz.
The latest versions of the benchmarks can be downloaded and installed from the following (see security warning).
Android 9 Benchmarks and Stress Tests On 32 Bit and 64 Bit CPUs.
Then
Android 10 and 11 Benchmarks and ARM big.LITTLE Architecture Issues
might be of interest, with
Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM and Intel
providing more information, results and access to older (out of date) apps.
LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS CPU Device CPU MHz Max Gmean Min Linpack Whets Whets Year Year Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 Android 32 bits V7-A9 a 800 36 20 11 11 171 22 2012 2012 V7-A9 a later 800 253 115 47 101 687 155 2012 2012 v7-A9 1200 208 176 27 159 731 259 2012 2012 v8-A53 1300 397 164 28 348 868 332 2012 2015 v7-A15 1700 471 342 34 826 907 329 2012 2013 QU-800 2150 447 356 112 630 1974 610 2013 2013 V8-A72 1800 674 584 136 1023 2053 465 2015 2015 Android 64 bits v8-A53 1300 805 238 101 338 1494 319 2012 2015 Exynos 8890 2300 188 158 27 999 3342 760 2016 2017 v8-A57 2000 724 641 245 1163 1988 390 2013 2015 v8-A73 2000 877 786 269 1122 2927 497 2016 2019 Kryo 570 2000 1620 1468 514 1986 4650 905 2020 2021 A53 64/32bit 1.0 2.0 1.5 3.6 1.0 1.7 1.0 V7-A9 a/Cray 1 10.0 0.4 1.7 9.2 0.4 10.6 3.7 v7-A9 later 10.0 3.1 9.7 39.2 3.7 42.4 25.8 32b A72/Cray 1 22.5 8.2 49.0 113.5 37.9 126.7 77.5 64b 570/Cray 1 25.0 19.7 123.3 428.0 73.6 287.1 150.8 Main Columns # # # # |
Comparison - Below are samples of results where details for the three benchmarks were available. The first PC to reach the average Cray 1 Livermore Loops score is indicated as a 1994 100 MHz Pentium, shown as 12 MFLOPS, with Linpack and Whetstone at 12 and 16. This gives approximate Cray 1 comparisons of MHz and the three MFLOPS measurements of 1.3, 1.0. 0.44 and 2.6 times.
PCs with faster Pentium processors continued to produce performance proportional to CPU MHz, with improvements appearing with the 1995 Pentium Pro. At 200 MHz the three MFLOPS measurements were 34, 49 and 41 and four comparisons 2.5, 2.9, 1.8 and 6.8 times.
Next came various Pentium II and III models with improvements to these benchmarks mainly proportional to CPU MHz. Then the 2002 Pentium 4 is show to achieve 187 , 382 and 146 MFLOPS, but at 1700 MHz, producing the four Cray ` comparisons of 21, 16, 14 and 24 times, with decreases in MFLOPS per MHz, compared with earlier Pentiums.
With alternative CPU technology, the per MHz ratio improved with a single core of a 1820 MHz 2007 Core 2 processor obtaining 413, 998 and 374 MFLOPS or Cray 1 improvements of 23, 35, 37 and 62 times.
The 2010 Core i7 range produced an improvement in MFLOPS per MHz, with the 3900 MHz 2013 model obtaining 1108, 2684 and 716 MFLOPS and comparisons 49, 93, 99 and 119 times.
The 2021 laptop with a Core i5 1135G7 CPU provided further gains with a higher MFLOPS per MHz rating for Livermore Loops and Linpack but not much with Whetstone. MFLOPS identified were 1387, 3541 and 802, and Cray 1 comparisons of 117, 131 and 134 times.
These results are from running optimised versions of the original Windows Classic Benchmarks livecont.exe, linpcont.exe and whetcont.exe, available in downloadable
benchnt.zip.
LLLoops Gmean LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15 Windows PCs AMD 80386 40 1.2 0.6 0.2 0.5 5.7 0.8 1991 0.02 80486 DX2 66 4.9 2.7 0.7 2.6 15 3.3 1992 0.04 Pentium 75 24 7.7 1.3 7.6 48 11 1994 0.10 Pentium 100 34 12 2.1 12 66 16 1994 0.12 Pentium 200 66 22 3.8 132 31 1996 0.11 AMD K6 200 68 22 2.7 23 124 26 1997 0.11 Pentium Pro 200 121 34 3.6 49 161 41 1995 0.17 Pentium II 300 177 51 5.5 48 245 61 1997 0.17 AMD K62 500 172 55 6.0 46 309 67 1999 0.11 Pentium III 450 267 77 8.3 62 368 92 1999 0.17 Pentium 4 1700 1043 187 19 382 603 146 2002 0.11 Athlon Tbird 1000 1124 201 23 373 769 161 2000 0.20 Core 2 1830 1650 413 40 998 1557 374 2007 0.23 Core i5 2300 2326 438 35 1065 1813 428 2009 0.19 Athlon 64 2150 2484 447 48 812 1720 355 2005 0.21 Phenom II 3000 3894 644 64 1413 2145 424 2009 0.21 Core i7 930 3066 2751 732 68 1765 2496 576 2010 0.24 Core i7 4820K 3900 5508 1108 88 2680 3114 716 2013 0.28 Core i5 1135G7 4150 7505 1387 92 3541 3293 802 2021 0.33 Pentium/Cray 1 1.3 0.4 1.0 1.8 0.4 4.1 2.6 i5/Cray 1 52 91 117 77 131 203 134 i5/i7 1.1 1.4 1.3 1.1 1.3 1.1 1.1 Main Columns # # # # |
Windows benchmarks, used in this area, were lloops64.exe, linpack64.exe and whetsSSE.exe. These and source code files are included in Windows-Benchmarks.zip . Compared with the earlier results, performance increased to achieve Cray 1 MFLOPS gains of 238, 190 and 182 times. For this area, double precision Whetstone results are also shown to run at the same speed as the single precision version.
I have had difficulties in using the latest C compilers for Windows, but a new bootable flash drive for Ubuntu 20.04 provided the compiler, enabling more advanced options to be used under Linux. The new benchmarks were initially compiled on an older PC as it did not seem possible to boot the latest flash drive on my new Core i5 based laptop. For the latter, I installed WSL (Windows Subsystem for Linux) in order to compile and run the programs.
Linux - The first compilations under Linux were slightly faster than those from Windows. Those used here were compiled on the i5 laptop using the latest gcc 9.3.0 compiler, under Ubuntu. Disassembly code was examined to show that SSE, AVX and AVX-512 instructions were being used, as appropriate. This cannot be guaranteed by relying on compile options. These benchmarks can be downloaded in Linux-Benchmarks.tar.xz. The first Linux results, using the AVX SIMD instructions, increased the three i5 Cray 1 gains to 300, 259 and 179 times. AVX-512 hardware was only available on the Core i5 CPU, providing the three MFLOPS gains of 359, 337 and 226 times.
The table provides MFLOPS per MHz calculations for Livermore Loops average and maximum results. A major surprise is that the latter for SSE and AVX, of 3.56 and 4.77 were higher than recognised maximum double precision ratios, without FMA, of 2.0 and 4.0. This also applied for SSE for the Core i7 at 3.05. The AVX-512 FMA 47692 MFLOPS ratio of 11.49 suggests significant FMA was being used. See also Faster Than Expected below.
DP LLLoops Gmean Max LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS MFLOPS CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz per MHz Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15 1.03 Windows PCs Earlier SSE Compiler Core i7 4820K 3900 6145 2037 327 3601 6385 1081 2013 0.52 1.58 Core i5 1135G7 4150 8313 2828 386 5132 7466 1094 2021 0.68 2.00 Core i5 DP 4150 7256 1098 i5/Cray 1 52 101 238 321 190 461 182 i5/i7 1.1 1.4 1.4 1.2 1.4 1.2 1.0 Linux PCs SSE New Compiler Core i7 4820K 3900 11881 2578 569 5306 6007 1182 2013 0.66 3.05 Core i5 1135G7 4150 14786 3364 575 7322 6586 1052 2021 0.81 3.56 i5/Cray 1 52 180 283 479 271 407 175 i5/i7 1.1 1.2 1.3 1.0 1.4 1.1 0.9 Linux PCs AVX New Compiler Core i7 4820K 3900 12878 2615 597 5098 5887 1174 2013 0.67 3.30 Core i5 1135G7 4150 19794 3568 943 6998 6477 1077 2021 0.86 4.77 Core i5 DP 6861 1076 i5/Cray 1 52 241 300 786 259 400 179 i5/i7 1.1 1.5 1.4 1.6 1.4 1.1 0.9 SP/DP 0.9 1.0 i7 AVX/SSE 1.1 1.0 1.0 1.0 1.0 1.0 i5 AVX/SSE 1.3 1.1 1.6 1.0 1.0 1.0 Linux AVX 512 FMA New Cpmpiler Core i5 1135G7 4150 47692 4273 965 9088 8193 1353 2021 1.03 11.49 i5/Cray 1 52 581 359 805 337 506 226 Main Columns # # # # |
The i5 CPU MHz is 52 times than that for the Cray 1, compared with over 300 times for Livermore Loops and Linpack benchmarks using AVX-512 functions and more than 200 times for Whetstone. Later are multithreading results for the latter, and for a vector version, to highlight the benefits of using more advanced facilities
MFLOPS i5/Cray 1 LLOOPS Linpack Whets LLOOPS Linpack Whets No SSE 1387 3541 802 117 131 134 SSE 3364 7322 1052 283 271 175 AVX 3568 6998 1077 300 259 179 AVX512 4273 9088 1353 359 337 226
Best results, from the next table, for Core i5 and Raspberry Pi 400 are provided, to demonstrate their superiority over 1991 supercomputers. On top of this, the former have multiple cores, with the potential of four time higher throughput or raw performance. See
MP Whetstone results and those for MP MFLOPS.
Vector/ Scalar Vector Scalar MHz MWIPS MFLOPS MWIPS MFLOPS MFLOPS DATE Cray 1 80 16.2 5.9 98 47 8.0 1978 CDC Cyber 205 50 11.9 4.9 161 57 11.7 1981 Cray XMP1 118 30.3 11.0 313 151 13.7 1982 Cray 2/1 244 25.8 N/A 425 N/A 1984 Amdahl VP 500 # 143 21.7 7.5 250 103 13.8 1984 Amdahl VP 1100 # 143 21.7 7.5 374 146 19.5 1984 Amdahl VP 1200 # 143 21.7 7.5 581 264 35.3 1984 IBM 3090-150 VP 54 12.1 4.9 60 17 3.6 1986 (CDC) ETA 10E 95 15.7 6.5 335 124 19.2 1987 Cray YMP1 154 31.0 12.0 449 195 16.3 1987 Fujitsu VP-2400/4 312 71.7 25.4 1828 794 31.3 1991 NEC SX-3/11 345 42.9 17.0 1106 441 25.9 1991 NEC SX-3/12 345 42.9 17.0 1667 753 44.3 1991 # Fujitsu Systems Core i5 AVX512 SP 4150 7780 1353 21039 28303 20.9 2021 Core i5 AVX512 DP 4150 8193 1353 21464 20346 15.0 2021 Pi 400 SP 1800 2505 573 3755 2131 3.7 2020 Pi 400 DP 1800 2684 575 3407 1184 2.1 2020 |
The fastest Whetstone floating point code is not suitable to benefit much from fused multiply and add operation, with one multiply associated with four additions or subtractions. The maximum Core i5 speed of 75.1 GFLOPS is quite impressive.
Average i5 Cray 1 MFLOPS gains were 602 and 433 times, for single then double precision calculations.
Note that some SP SSE MFLOPS per MHz were again greater than 4.0 and AVX above 8.0 and half these with DP. The Raspberry Pi 400 vector performance was not that good but, as shown above, somewhat faster than the scalar speed.
Average Maximum Average Average MFLOPS MFLOPS MFLOPS Mode MHz MWIPS MFLOPS MFLOPS MFLOPS MFLOPS Per MHz Per MHz xCray 1 Windows SSE Phenom II 64b SP 3000 4869 4429 3067 751 1593 0.5 1.5 34 Phenom II 64b DP 3000 4897 2418 1722 751 1290 0.4 0.8 27 Phenom II 32b SP 3000 4624 1798 1584 701 1148 0.4 0.6 24 Core i7 4820K 64b SP 3900 7256 14233 12655 958 2513 0.6 3.6 53 Core i7 4820K 64b DP 3900 7299 7416 7019 953 2261 0.6 1.9 48 Core i7 4820K 32b SP 3900 10494 10362 9748 9468 9846 2.5 2.7 209 Core i5 1135G7 64b SP 4150 8435 23709 21246 1043 2862 0.7 5.7 61 Core i5 1135G7 64b DP 4150 8621 12375 11475 1041 2659 0.6 3.0 57 Core i5 1135G7 32b SP 4150 13387 18221 17254 13739 16162 3.9 4.4 344 Linux Core i7 4820K Op3 SP 3900 12012 12896 6248 17131 10136 2.6 4.4 216 Core i7 4820K AVX SP 3900 11924 20394 7124 23551 12938 3.3 6.0 275 Core i7 4820K Op3 DP 3900 11383 6259 4601 8711 6099 1.6 2.2 130 Core i7 4820K AVX DP 3900 11526 10509 5789 11950 8533 2.2 3.1 182 Core i5 1135G7 Op3 SP 4150 20870 21024 10721 28800 17088 4.1 6.9 364 Core i5 1135G7 AVX SP 4150 20294 37170 12353 39126 22487 5.4 9.4 478 Core i5 1135G7 A512 SP 4150 21039 62592 13037 75094 28303 6.8 18.1 602 Core i5 1135G7 Op3 DP 4150 20045 10884 8035 14575 10528 2.5 3.5 224 Core i5 1135G7 AVX DP 4150 20526 19270 10311 20360 15152 3.7 4.9 322 Core i5 1135G7 A512 DP 4150 21464 33188 11504 32907 20346 4.9 8.0 433 Raspberry Pi 400 SP 1800 3755 2413 1683 2506 2131 1.2 1.4 45 Raspberry Pi 400 DP 1800 3407 1216 1151 1186 1184 0.7 0.7 25 |
Phenom, Windows 7 - This demonstrates almost perfect speed gains using 1 to 2 and 2 to 4 cores, with no further increase using 8 threads.
Core i7 Desktop - This can use 4 cores or 8 independent threads at the same time. This application appeared to demonstrate near best case performance gains using 8 threads.
Core i5 Laptop - Performance Monitor indicated that this ran at around 4150 MHz using 1 and 2 threads, but reduced to about 3800 MHz for 4 and 8 threads.
Windows vs Linux - Average MFLOPS performance was quite similar, on both the i7 and i5 PCs, at the lower level of optimisation shown here.
Single vs Double Precision - Results indicated similar performance, as expected from scalar operation.
PC Performance Gains - some of the Core i7 speeds were faster than on the i5. For the latter, eight thread Cray 1 MFLOPS gains were 1521 times.
Android Phone - The Kryo 570 CPU has out-of-order execution, maybe responsible for the highest MFLOPS per MHz ratio of 0.42. But maximum performance of the big/LITTLE CPU arrangement, of 2 fast and 6 slow cores, lead to 8 core performance being only 5 times faster than than for 1 core. Still, the Cray 1 gain was 757 times.
Raspberry Pi 400 - As might be expected, performance of this quad core system produced the same elapsed time using 1, 2 and 4 threads, and a little bit extra with 8 threads. Maximum Cray 1 gain was 400 times.
Average --- Average MFLOPS --- System Threads MWIPS MFLOPS MFLOPS MFLOPS MFLOPS Secs xCray 1 Gain Per MHz Desktop Win 7 1 4086 817 817 752 794 5.0 132 1.0 0.26 Phenom II 2 8149 1635 1616 1501 1582 5.0 264 2.0 4 core 4 16199 3261 3234 2968 3149 5.1 525 4.0 3000 MHz 8 16602 3428 3461 3056 3304 10.1 551 4.2 Desktop Win 10 1 6169 1236 1236 856 1077 4.5 179 1.0 0.28 Core i7 4820K 2 13106 2601 2604 1910 2322 4.2 387 2.2 4 Core 8 Thread 4 25343 5181 5197 3723 4587 4.5 764 4.3 3900 MHz 8 46579 10310 10263 7403 9104 5.0 1517 8.5 Laptop Win 10 1 7555 1195 1216 1046 1147 4.9 191 1.0 0.28 Core i5 1135G7 2 15048 2385 2424 2083 2287 5.0 381 2.0 4 Core 8 Thread 4 27290 4339 4407 3787 4158 5.6 693 3.6 4150 MHz or less 8 53037 8619 8773 7538 8272 5.9 1379 7.2 Linux Desktop SP 1 6157 1189 1146 931 1076 4.7 179 1.0 0.28 Core i7 4820K 2 12641 2529 2608 1931 2314 4.6 386 2.1 4 Core 8 Thread 4 25490 5204 5213 3900 4685 4.6 781 4.4 3900 MHz 8 43907 10217 10440 7714 9279 5.7 1547 8.6 Desktop DP 1 6500 1235 1252 972 1138 3.9 190 1.0 0.29 Core i7 4820K 2 13098 2542 2636 1938 2328 3.9 388 2.0 4 Core 8 Thread 4 26298 5105 5273 3906 4676 3.9 779 4.1 3900 MHz 8 44758 10268 10435 7755 9312 5.2 1552 8.2 Laptop SP 1 7640 1140 1199 1015 1113 5.0 185 1.0 0.27 Core i5 1135G7 2 14662 2347 2262 1997 2192 5.4 365 2.0 4 Core 8 Thread 4 26754 4320 4387 3752 4133 6.1 689 3.7 4150 MHz or less 8 46016 7885 8264 6701 7556 7.5 1259 6.8 Laptop SP AVX512 1 8432 1281 1280 1248 1269 5.0 212 1.0 0.31 Core i5 1135G7 2 16728 2542 2548 2471 2520 5.0 420 2.0 4 Core 8 Thread 4 29816 4625 4617 4523 4588 6.0 765 3.6 4150 MHz or less 8 54985 9203 9188 8994 9127 6.6 1521 7.2 Laptop DP AVX512 1 8748 1278 1278 1248 1268 4.9 211 1.0 0.31 Core i5 1135G7 2 17372 2542 2542 2481 2521 5.0 420 2.0 4 Core 8 Thread 4 31459 4622 4622 4514 4585 5.5 764 3.6 4150 MHz or less 8 57024 9187 9210 8985 9126 6.0 1521 7.2 Android Phone 1 4327 1010 984 782 913 4.6 152 1.0 0.42 Kryo 570 2 8782 1850 2126 1604 1836 4.5 306 2.0 2 x 2200 MHz + 4 13969 3189 3373 2641 3034 6.9 506 3.3 6 x 1800 MHz 8 21039 4535 4985 4171 4540 7.9 757 5.0 Raspberry Pi 400 1 2266 644 645 376 520 5.0 87 1.0 0.29 4 x Cortex A72 2 4533 1285 1284 751 1038 5.0 173 2.0 1800 MHz 4 9065 2562 2498 1505 2062 5.0 344 4.0 8 9611 3284 3375 1543 2402 10.1 400 4.6 |
Calculations are carried out of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. In each case, accessing 102400, 1024000 and 10240000 data words, covering caches and RAM. Up to 64 threads can be used, each using a dedicated segment of the data, default being 8 threads. Data is checked for consistent values at the end.
Below are measured MFLOPS using 1, 2, 4 and 8 threads for the Core i7 and i5 computers, executing SSE and AVX instructions, plus AVX-512 on the i5. As for MP Whetstones, performance improvements, from doubling the number of threads (MP Gains), are shown to be non-linear for the Core i5 laptop.
Single core MFLOPS per MHz ratios are also shown. Maximum single precision expectations, without FMA instructions, are 4 for SSE and 8 for AVX and 16 for AVX-512, then 32 for the latter, where FMA is used. Then double precision operation expectations are half these values.
It can be seen that, for both i7 and i5, SSE and AVX MFLOPS/MHz ratios were higher than these. I have been unable to identify the reason for these levels of performance, without FMA type instructions being used. For further information see Faster Than Expected below.
AVX-512 MFLOPS per MHz was less than 32, one reason being that all instructions were not of the FMA variety, as shown in code disassemblies, shown below. These indicate that the highest expected speed achievable by the FMA code is just over 76% of maximum with complete FMA instructions, or 24.4 MFLOPS/MHz, close to that obtained.
The performance adjustment is also shown to produce a likely reduction in Cray 1 maximum speed to 122 MFLOPS, executing these functions. The maximum Core i5 single precision speed of 325915 MFLOPS indicates a Cray 1 gain of 2671 times. Maximum double precision result, from the next page, was 160641 MFLOPS with a gain of 1317 times.
Also single precision results on the next page indicate maximum 8 thread speed on Raspberry Pi of 30150 MFLOPS or Cray 1 gain of 247 times and Android phone at 35686 MFLOPS or gain of 293 times, both at Intel SSE SIMD level.
Threads 1 2 4 8 1 2 4 8 1 2 4 8 KWDs Ops SSE SSE SSE SSE AVX AVX AVX AVX AVX512 AVX512 AVX512 AVX512 Core i7 3900 MHz MFLOPS 102 2 10106 22704 47224 54668 11379 27114 56982 63095 N/A N/A N/A N/A 1024 2 9801 19227 36849 42389 10542 20127 39567 45256 10240 2 5856 9342 10120 9951 6004 9400 10165 9936 102 8 24258 48818 91871 97077 36354 82307 169881 184765 1024 8 24356 49258 91911 96902 34820 67057 130960 161412 10240 8 19421 34454 39855 39777 22340 36088 40372 39578 102 32 23355 46711 88383 93448 45374 88045 171961 177649 1024 32 23284 46883 88776 93381 45459 91277 172443 178895 10240 32 23107 46102 85346 92767 43834 86697 152019 157381 Maximum 24356 49258 91911 97077 45459 91277 172443 184765 MP Gains 1.0 2.0 3.8 4.0 1.0 2.0 3.8 4.1 AVX/SSE 1.9 1.9 1.9 1.9 Max/MHz 6.2 11.7 Core i5 MFLOPS 102 2 24612 48845 46738 80544 29021 30791 86020 93812 37656 74288 72164 121973 1024 2 21362 42345 43579 79180 21656 44753 44415 93920 23333 46844 58968 122122 10240 2 7495 12295 13298 14067 7620 11160 13454 14020 9274 13455 13337 13995 102 8 33271 65364 71105 119460 64946 128515 153955 210177 71895 142743 142554 241880 1024 8 32614 65504 63763 118933 62120 127095 121959 210157 66304 134081 144756 239841 10240 8 22467 38871 50079 56166 24963 42384 53438 56122 30345 49693 54170 56226 102 32 33273 58673 69365 119426 64941 124972 133637 225265 94417 170909 324843 325915 1024 32 32997 39974 86194 119313 64304 125772 125365 224014 91558 185785 324870 324936 10240 32 32777 64727 82112 116115 61061 114491 127026 200120 77458 140903 182219 222231 Maximum 33273 65504 86194 119460 64946 128515 153955 225265 94417 185785 324870 325915 MP Gains 1.0 2.0 2.6 3.6 1.0 2.0 2.4 3.5 1.0 2.0 3.4 3.5 AVX/SSE 2.0 2.0 1.8 1.9 a512/AVX 1.5 1.4 2.1 1.4 i5/i7 1.4 1.3 0.9 1.2 1.4 1.4 0.9 1.2 MHz 4150 4150 3600 3600 4150 4150 3600 3600 4150 4150 3600 3600 Max/MHz 8.0 15.6 22.8 |
Core i7 Core i5 Phone RPi Threads 1 8 1 8 1 8 1 8 1 8 1 8 1 8 Ops/word SSE SSE AVX AVX SSE SSE AVX AVX AVX512 AVX512 SP SP SP SP 102 2 4921 28537 5290 32337 12437 38391 14606 43320 18872 60955 6977 15998 4015 10169 1024 2 4820 21214 4772 19551 4978 29821 6351 32157 8120 35674 8034 14536 3865 9622 10240 2 2949 4923 2946 4950 3604 6562 3683 6728 4442 6514 2984 2442 447 585 102 8 12233 48924 17683 95178 16500 59285 32504 104046 35958 120212 1024 8 12074 48679 16145 78149 12762 54904 19300 92706 22226 105465 10240 8 9929 19774 10969 19845 10941 26897 12157 27045 14806 26544 102 32 11742 46894 22880 89459 16602 58258 32420 111461 47200 160641 12178 34803 7902 28978 1024 32 11697 46848 22667 88958 16314 59325 31215 107323 42515 151251 12139 35686 7860 30150 10240 32 11615 46395 21983 78687 16315 57399 30488 99303 38532 105812 12137 34050 7326 8537 Maximum 12233 48924 22880 95178 16602 59325 32504 111461 47200 160641 12178 35686 7902 30150 MP Gain 1.0 4.0 1.0 4.2 1.0 3.6 1.0 3.4 1.0 3.4 1.0 2.9 1.0 3.8 MF/MHZ 3.14 5.87 4.0 7.8 11.4 6.1 4.4 Double/Single Precision 102 2 0.49 0.52 0.46 0.51 0.51 0.48 0.50 0.46 0.50 0.50 1024 2 0.49 0.50 0.45 0.43 0.23 0.38 0.29 0.34 0.35 0.29 10240 2 0.50 0.49 0.49 0.50 0.48 0.47 0.48 0.48 0.48 0.47 102 8 0.50 0.50 0.49 0.52 0.50 0.50 0.50 0.50 0.50 0.50 1024 8 0.50 0.50 0.46 0.48 0.39 0.46 0.31 0.44 0.34 0.44 10240 8 0.51 0.50 0.49 0.50 0.49 0.48 0.49 0.48 0.49 0.47 102 32 0.50 0.50 0.50 0.50 0.50 0.49 0.50 0.49 0.50 0.49 1024 32 0.50 0.50 0.50 0.50 0.49 0.50 0.49 0.48 0.46 0.47 10240 32 0.50 0.50 0.50 0.50 0.50 0.49 0.50 0.50 0.50 0.48 |
---------- AVX-512 DP MFLOPS ---------- Thread Maximum Average Geomean Harmean Minimum 1 33413.3 5809.5 3430.8 2293.0 493.7 2 35648.5 5576.5 3275.7 2223.1 552.1 3 35422.7 5953.9 3449.2 2300.6 505.1 4 36895.5 5746.0 3344.4 2190.7 459.4 |
4 Byte Ops/ Repeat MFLOPS Using Number Of Threads Words Word Passes 4 8 16 32 64 102400 2 75000 72164 112210 155132 158133 153968 1024000 2 7500 58968 108429 119118 117709 122011 10240000 2 750 13337 13824 17251 60342 116964 102400 8 75000 142554 210116 253359 270576 275220 1024000 8 7500 144756 212406 233939 236110 242271 10240000 8 750 54170 54988 64520 174245 235583 102400 32 75000 324843 312508 316881 318233 327762 1024000 32 7500 324870 308995 310405 325996 327897 10240000 32 750 182219 204563 243408 301543 322605 |
The MP benchmark results can be used to represent multiple users running the same program or a single program executing multiple threads, each handling a dedicated segment of shared data. Again for the Core i5, MP Whetstone MFLOPS were similar for double and single precision versions, with little opportunity for vectorisation. The simpler Whetstone calculations demonstrate the benefit of hyperthreading with the 4 core, 8 thread throughput being nearly seven times faster than the standalone run. On the other hand, MP MFLOPS suffered from the i5 running at a lower MHz when four cores were being used, leading to 8 thread performance being less than four times faster than via 1 thread. This benchmark identified the highest Cray 1 performance gains of over 2600 times for single precision calculations, but half of this at double precision.
On cost/performance grounds, the Raspberry Pi 400 was better than the Core i5 laptop, in some of the early cases, but worse on others, then fell far behind on benchmarks that could benefit from compilation using Intel Advanced Vector instructions. Compared with the Cray 1, MP performance gains of up to 400 times were recorded.
Just considering performance of the Android phone, the more advanced ARM CPU used provided some significant gains over the Raspberry Pi, but lost the advantage, due to the big/LITTLE architecture, on running the MP MFLOPS 8 thread test. Still, best Cray 1 performance gain was 757 times through using multiple cores.
Core i5 AVX-512 Android Phone Raspberry Pi 400 Cray 1 X Cray X Cray X Cray CPU MHz 1 Thread 80 4150 52 2000 25 1800 23 CPU MHz 8 Thread N/A 3600 <1800 1800 1. Livermore Loops MFLOPS Max 82.1 47692 581 1620 20 3353 41 MFLOPS Average 11.9 4273 359 1468 123 938 79 2. Linpack MFLOPS 27 9088 337 1986 74 1337 50 3. Whetstone MFLOPS 6 1353 226 905 151 573 96 4. Vector Whetstone MFLOPS DP Average 47 20346 433 1184 25 MFLOPS DP Maximum 32907 1216 MFLOPS SP Average 28303 602 2131 45 MFLOPS SP Maximum 75094 2506 5. MP Whetstone MFLOPS DP Average 6 9126 1521 MFLOPS DP Maximum 9210 MFLOPS SP Average 9127 1521 4540 757 2402 400 MFLOPS SP Maximum 9203 4985 3284 6. MP MFLOPS MFLOPS DP 1 Thread 122 47200 387 MFLOPS DP 8 Thread 160641 1317 MFLOPS SP 1 Thread 94417 774 12178 100 7902 65 MFLOPS SP 8 Thread 325915 2671 35686 293 30150 247 |
CPU MHz - In a given processing architecture, performance is usually proportional to CPU MHz. This was clear in earlier times, when Pentium, Celeron and Xeon processors had the same core processor. The above benchmarks were run on a Core i5 with maximum turbo speed of 4150 MHz and an ARM CPU at 2000 MHz. The latest 2022 processors appear to be rated at up to 5500 MHz for PCs and 3000 MHz for ARM based phones. These would affect the single core benchmarks but not excessively.
Multiple Cores - At least for the laptop and phone used here, full benefits of multiple cores were not apparent. The laptop switched to a lower MHz and the phone’s 8 core big/LITTLE processor maximum performance became not much better than the 4 core Raspberry Pi. Performance appears to be becoming even more unpredictable. The latest (that I have seen) - Intel 24 threads over 16 cores, 8 at up to 5.1 GHz and 8 to 3.8 GHz.Then ARM that has cores 1 at 3200 MHz, 3 at 2420 MHz and 4 at 1800 MHz.
More Advanced CPU Options - Some CPUs in the Core range have two 512-bit fused-multiply add (FMA) units that can, potentially, double SIMD performance of the right sort of application. Judging by the improvement in adopting a higher level of SIMD here and consideration of heating effects, I would not bet on it.
AVX-512 AVX L22: L60: vmovupd (%rax), %zmm0 vmovups (%rax), %xmm1 addq $64, %rax vinsertf128 $0x1, 16(%rax), %ymm1, %ymm1 vaddpd %zmm0, %zmm28, %zmm31 addq $32, %rax vaddpd %zmm0, %zmm30, %zmm1 vaddps -24(%rsp), %ymm1, %ymm0 vmulpd %zmm27, %zmm31, %zmm31 vmulps 8(%rsp), %ymm0, %ymm15 vfmsub132pd %zmm29, %zmm31, %zmm vaddps 40(%rsp), %ymm1, %ymm0 vaddpd %zmm0, %zmm26, %zmm31 vmulps 72(%rsp), %ymm0, %ymm0 vfmadd231pd %zmm31, %zmm25, %zmm vsubps %ymm0, %ymm15, %ymm0 vaddpd %zmm24, %zmm0, %zmm31 vaddps 104(%rsp), %ymm1, %ymm15 vfnmadd132pd %zmm23, %zmm1, %zmm3 vmulps 136(%rsp), %ymm15, %ymm15 vaddpd %zmm22, %zmm0, %zmm1 vaddps %ymm15, %ymm0, %ymm0 vfmadd231pd %zmm21, %zmm1, %zmm3 vaddps 168(%rsp), %ymm1, %ymm15 vaddpd %zmm20, %zmm0, %zmm1 vmulps -56(%rsp), %ymm15, %ymm15 vfnmadd132pd %zmm19, %zmm31, %zmm vsubps %ymm15, %ymm0, %ymm0 vaddpd %zmm18, %zmm0, %zmm31 vaddps %ymm14, %ymm1, %ymm15 vfmadd231pd %zmm17, %zmm31, %zmm vmulps -88(%rsp), %ymm15, %ymm15 vaddpd %zmm16, %zmm0, %zmm31 vaddps %ymm15, %ymm0, %ymm0 vfnmadd132pd %zmm15, %zmm1, %zmm3 vaddps %ymm13, %ymm1, %ymm15 vaddpd %zmm14, %zmm0, %zmm1 vmulps %ymm12, %ymm15, %ymm15 vfmadd231pd %zmm13, %zmm1, %zmm3 vsubps %ymm15, %ymm0, %ymm0 vaddpd %zmm12, %zmm0, %zmm1 vaddps %ymm11, %ymm1, %ymm15 vaddpd %zmm10, %zmm0, %zmm0 vmulps %ymm10, %ymm15, %ymm15 vfnmadd132pd %zmm11, %zmm31, %zmm vaddps %ymm15, %ymm0, %ymm0 vfmadd132pd %zmm9, %zmm1, %zmm0 vaddps %ymm9, %ymm1, %ymm15 vmovupd %zmm0, -64(%rax) vmulps %ymm8, %ymm15, %ymm15 cmpq %rax, %rcx vsubps %ymm15, %ymm0, %ymm0 jne .L22 vaddps %ymm7, %ymm1, %ymm15 vmulps %ymm6, %ymm15, %ymm15 vaddps %ymm15, %ymm0, %ymm0 vaddps %ymm5, %ymm1, %ymm15 vaddps %ymm3, %ymm1, %ymm1 vmulps %ymm4, %ymm15, %ymm15 vsubps %ymm15, %ymm0, %ymm0 vmulps %ymm2, %ymm1, %ymm15 vaddps %ymm15, %ymm0, %ymm0 vmovups %xmm0, -32(%rax) vextractf128 $0x1, %ymm0, -16(%rax) cmpq %rdx, %rax jne .L60 |
As indicated, there were differences in numeric results from the Core i5 laptop, with accuracy reducing from 15 or 16 decimal places to 12 or 13, using the AVX512 compile option. Apparently, there is only one rounding for fused operations, as opposed to one for each separate instruction.
Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK Earlier Compilation 6 3 x 658 2 1.566566e+09 0.89 1751.62 64 4.375116344729986e+03 16 7 4 x 529 16 6.737344e+09 0.89 7529.02 995 6.104251075174761e+04 16 18 2 x 703 44 6.124536e+09 0.89 6867.09 100 1.015727037502299e+05 15 Log Program report - Numeric results were as expected AVX Compilation 6 3 x 814 2 1.937971e+09 1.00 1929.85 64 4.375116344729986e+03 16 7 4 x 616 16 7.845376e+09 1.00 7835.67 995 6.104251075174761e+04 16 18 2 x1711 44 1.490623e+10 1.0014869.06 100 1.015727037502299e+05 15 Log Program report - Numeric results were as expected AVX512 Compilation 6 3 x 757 2 1.802266e+09 1.00 1802.82 64 4.375116344743195e+03 12 7 4 x3738 16 4.760717e+10 1.0047691.47 995 6.104251075174966e+04 13 18 2 x2393 44 2.084782e+10 1.0020781.51 100 1.015727037502806e+05 12 Log Program report - Examples of different numeric results Test 6 result was 4.375116344743195e+03 expected 4.375116344729986e+03 Test 7 result was 6.104251075174966e+04 expected 6.104251075174761e+04 Test 18 result was 1.015727037502806e+05 expected 1.015727037502299e+05 |
Similar sumcheck variations were recorded on running the Linpack and Whetstone benchmarks on the Core i5 based laptop. In both cases, as for the Livermore Loops example, the errors were not reported running on older hardware or from alternative compilations.
Linpack AVX-512 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation AVX512 64 Bit, Tue Dec 7 11:38:24 2021 Speed 5151.83 MFLOPS Variable norm. resid Non-standard result was 1.9 instead of 1.7 Variable resid Non-standard result was 8.46778499e-14 instead of 7.41628980e-14 Variable x[0]-1 Non-standard result was -1.11799459e-13 instead of -1.49880108e-14 Variable x[n-1]-1 Non-standard result was -9.60342916e-14 instead of -1.89848137e-14 Whetstone SSE Whetstone Double Precision SSE2 Benchmark Tue Jan 11 19:34:50 2022 Test 5 Non-standard result was 0.49902937281518372 instead of 0.49902937281518167 Log file result Loop content Result MFLOPS MOPS Seconds N5 sin,cos etc. 0.49902937281518372 281.276 2.089 |
L.L.N.L. 'C' KERNELS: MFLOPS P.C. VERSION 4.0 Calculating outer loop overhead 1000 times 0.00 seconds 10000 times 0.00 seconds 100000 times 0.00 seconds 1000000 times 0.01 seconds 10000000 times 0.04 seconds 20000000 times 0.08 seconds 40000000 times 0.16 seconds 80000000 times 0.31 seconds Overhead for each loop 3.9288e-09 seconds Calibrating part 1 of 3 Loop count 4 0.00 seconds Loop count 16 0.00 seconds Loop count 64 0.00 seconds Loop count 256 0.01 seconds Loops 200 x 1 x Passes Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- 1 7 x1566 5 1.097296e+10 0.9811171.25 1001 5.114652693224671e+04 16 2 67 x 595 4 3.093524e+09 0.98 3164.40 101 1.539721811668385e+03 15 3 9 x 657 2 2.367565e+09 0.95 2494.82 1001 1.000742883066363e+01 15 4 14 x 728 2 2.446080e+09 0.96 2555.68 1001 5.999250595473891e-01 16 5 10 x 234 2 9.360000e+08 0.95 980.20 1001 4.548871642387267e+03 16 6 3 x 904 2 2.152243e+09 0.95 2276.20 64 4.375116344729986e+03 16 7 4 x 975 16 1.241760e+10 1.0312101.10 995 6.104251075174761e+04 16 8 10 x 385 36 5.488560e+09 0.95 5788.45 100 1.501268005625795e+05 15 9 36 x 536 17 6.626246e+09 0.96 6926.99 101 1.189443609974981e+05 16 10 34 x 456 9 2.818627e+09 0.95 2973.11 101 7.310369784325296e+04 16 11 11 x 565 1 1.243000e+09 0.95 1309.65 1001 3.342910972650109e+07 16 12 12 x1201 1 2.882400e+09 0.95 3030.87 1000 2.907141294167248e-05 16 13 36 x 177 7 5.709312e+08 0.95 600.71 64 1.202533961842805e+11 15 14 2 x 290 11 1.277276e+09 0.95 1347.14 1001 3.165553044000335e+09 15 15 1 x 660 33 2.178000e+09 0.96 2268.96 101 3.943816690352044e+04 15 16 25 x 768 10 2.035200e+09 0.94 2153.77 75 5.650760000000000e+05 16 17 35 x 368 9 2.341584e+09 0.96 2447.92 101 1.114641772902486e+03 16 18 2 x 733 44 6.385896e+09 0.97 6567.18 100 1.015727037502299e+05 15 19 39 x 215 6 1.016262e+09 0.95 1070.62 101 5.421816960147207e+02 16 20 1 x 187 26 9.724000e+08 0.95 1021.36 1000 3.040644339351239e+07 16 21 1 x 302 2 7.625500e+09 0.95 8021.31 101 1.597308280710199e+08 15 22 11 x 356 17 1.344754e+09 0.95 1416.60 101 2.938604376566697e+02 16 23 8 x 223 11 1.942776e+09 0.95 2045.20 100 3.549900501563623e+04 16 24 5 x1553 1 1.553000e+09 0.95 1637.44 1001 5.000000000000000e+02 16 Maximum Rate12101.10 Average Rate 3557.12 Geometric Mean 2580.73 Harmonic Mean 1966.74 Minimum Rate 600.71 Do Span 471 Calibrating part 2 of 3 Loop count 8 0.00 seconds Loop count 32 0.00 seconds Loop count 128 0.00 seconds Loops 200 x 2 x Passes Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- 1 40 x1061 5 8.572880e+09 0.98 8769.29 101 5.253344778937972e+02 16 2 40 x 495 4 3.072960e+09 1.01 3046.39 101 1.539721811668385e+03 15 3 53 x 595 2 2.548028e+09 1.00 2536.39 101 1.009741436578952e+00 16 4 70 x 949 2 3.188640e+09 1.00 3194.69 101 5.999250595473891e-01 16 5 55 x 247 2 1.086800e+09 1.00 1082.99 101 4.589031939600982e+01 16 6 7 x 760 2 2.042880e+09 0.98 2081.44 32 8.631675645333210e+01 16 7 22 x 858 16 1.220145e+10 0.9912378.97 101 6.345586315784055e+02 16 |
Below, the Livermore Loops example shows the full displayed output for the kernel producing maximum MFLOPS, the source code with 16 floating point operations and compile commands used. The SSE example indicates 3.56 MFLOPS per MHz, thought to be impossible without FMA. The AVX results provide 4.86 MFLOPS per MHz 21.5% higher than expected maximum.
The same range of results, source code and compile options are provided for MP MFLOPS benchmark these combinations of instructions., running via a single thread. Looking at the first word size details, least likely to involve RAM data transfers, SSE 12437 to 16602 MFLOPS equates to 3.00 to 4.00 per MHz and AVX 14606 to 32420 MFLOPS at 3.53 to 7.81 per MHz. These ranges include Livermore Loops ratios, but the larger ones are higher than might be expected using FMA, with the particular combination of instructions shown.
Unexpected high levels of performance were also produced on running the benchmarks on the much older Core i7 PC. Livermore Loops SSE maximum was 3.05 MFLOPS per MHz and 1 thread DP MP-MFLOPS with SSE 3.14 and AVX 5.87 MFLOPS per MHz.
4150 MHz Core i5 Livermore Loops Benchmark Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- SSE2 7 4 x1037 16 1.320723e+10 0.89 14782.74 995 6.104251075174761e+04 16 AVX 7 4 x1423 16 1.812333e+10 0.90 20184.30 995 6.104251075174761e+04 16 Kernel 7 C Code for ( k=0 ; k < n ; k++ ) { x[k] = u[k] + r*( z[k] + r*y[k] ) + t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + t*( u[k+6] + q*( u[k+5] + q*u[k+4] ) ) ); } Compiled With gcc lloops.c -O3 -msse2 -m64 -lrt -lc -lm -o lloopssse2 and gcc lloops.c -O3 -mavx -m64 -lrt -lc -lm -o lloopsavx ##################################################### 4150 MHz Core i5 MP DP MFLOPS Benchmark 1 Thread Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same SSE2 Data in & out 102400 2 75000 1.234995 12437 0.414016 Yes Data in & out 1024000 2 7500 3.085865 4978 0.812316 Yes Data in & out 10240000 2 750 4.262126 3604 0.977908 Yes Data in & out 102400 8 75000 3.723678 16500 0.563491 Yes Data in & out 1024000 8 7500 4.814260 12762 0.883058 Yes Data in & out 10240000 8 750 5.615416 10941 0.986707 Yes Data in & out 102400 32 75000 14.803324 16602 0.353716 Yes Data in & out 1024000 32 7500 15.063927 16314 0.723569 Yes Data in & out 10240000 32 750 15.063069 16315 0.964957 Yes AVX Data in & out 102400 2 75000 1.051636 14606 0.414016 Yes Data in & out 1024000 2 7500 2.418388 6351 0.812316 Yes Data in & out 10240000 2 750 4.170949 3683 0.977908 Yes Data in & out 102400 8 75000 1.890234 32504 0.563491 Yes Data in & out 1024000 8 7500 3.183412 19300 0.883058 Yes Data in & out 10240000 8 750 5.054079 12157 0.986707 Yes Data in & out 102400 32 75000 7.580423 32420 0.353716 Yes Data in & out 1024000 32 7500 7.873082 31215 0.723569 Yes Data in & out 10240000 32 750 8.061002 30488 0.964957 Yes C Function Code 8 Operationss per Word for(i=0; i < n; i++) x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; Compiled With gcc mpmflops2dp.c -lpthread -msse2 -lrt -lc -lm -O3 -o MPmflops64SSE2DP and gcc mpmflops2dp.c -lpthread -mavx -lrt -lc -lm -O3 -o MPmflops64AVXDP |