Roy Longbottom's Raspberry Pi MultiThreading
|
GeneralRoy Longbottom’s PC Benchmark Collection comprises numerous FREE benchmarks and reliability testing programs, for processors, caches, memory, buses, disks, flash drives, graphics, local area networks and Internet. Original ones run under DOS and later ones under all varieties of Windows. Most have also been converted to run under Linux on PCs, and many for ARM CPUs via Android and Raspbian. For the latter, details, benchmark execution files and source code download links, and results are provided in Raspberry Pi Benchmarks.htm with downloads for the multithreading program codes in Raspberry_Pi_MP_Benchmarks.zip. The ARM benchmarks use the same multithreading programming code as my Linux MultiThreading Benchmarks. Each of these particular programs obtain the same configuration details as the Android versions but, unlike with Android, results are saved in a text log file, besides being displayed. An example of the Raspberry Pi details obtained are shown below. When more than one CPU core is provided, separate details are normally shown for each one, labelled Processor 0, Processor 1 etc. At the end of each benchmark, any appropriate additional information can be entered from the keyboard. The C program codes used for the RPi were also compiled on a Linux based PC, the only change being for the version name (to Linux/Intel from Linux/ARM). This Intel version is included in the zip file. Results below include those, from this version, on an Intel Atom CPU and a quad core AMD Phenom processor. Results are now included for Raspberry Pi 2 that has a quad core ARM V7 processor. The original benchmarks were run along with revised versions (MP-xxxxPiA7), compiled with gcc 4.8, to use advanced hardware features, identified in cpuinfo details. The new benchmarks are included in the zip file. An example of the compile command, that uses the new features, is shown below. This also includes -funsafe-math-optimizations, which can produce incorrect results. For these benchmarks, it leads to acceptable minor rounding differences. 2016 - The latest benchmark s were run on Raspberry Pi 3 Model B that includes a quad core Broadcom BCM2837 system-on-chip running at 1200 MHz, each core having a 32 KB L1 cache. There is a shared 512 KB L2 cache and 1 GB RAM. The CPU is an ARM Cortex-A53, capable of 64 bit working, but presently only supports 32 bit operation.
A different graphics driver had to be installed to run a new
OpenGL GLUT benchmark.
In certain cases, benchmarks were rerun with this driver disabled, as it could lead to degraded CPU performance.
CPU, Cache and RAM MFLOPS Benchmarks
This benchmark also executes identical functions as my CUDA and OpenMP performance tests. Details and results of these can be found in
Linux CUDA MFLOPS.htm and
OpenMP Speeds.htm.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out and can be run using between 1 and 64 threads. Each thread uses the same calculations but accessing different segments of the data.
|
MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 43 33 31 191 170 161 2T 44 42 31 192 174 160 4T 44 43 31 192 176 159 8T 43 51 31 192 184 160 Results x 100000 1T 86735 98519 99984 79897 97638 99975 2T 86735 98519 99984 79897 97638 99975 4T 86735 98519 99984 79897 97638 99975 8T 86735 98519 99984 79897 97638 99975 End of test Sat Jul 27 17:42:00 2013 Later 76406 97075 99969 66015 95363 99951 Neon 76406 97075 99969 66008 95367 99951 DP 76384 97072 99969 66065 95370 99951 |
Although these Raspberry Pi MFLOPS speeds are quite impressive, they are nowhere near to claimed maximum capabilities, for example, RPi 3 Single Precision 38.4 GFLOPS at 32 MFLOPS/MHz and Double Precision at a quarter of these speeds - first maximum RPi 3 NEON-VFP GFLOPS were 6.03 SP and 2.3 DP, the former at 5.0 MFLOPS/MHz.
On the other hand, the same source code, compiled for Intel CPUs with GCC, obtained 23 out of 32 MFLOPS/MHz with SSE instructions and 45.6 out of 64 MFLOPS/MHz with AVX 1 options. This was on a Core i7 CPU. See Linux MP-MFLOPS Benchmarks.
Some of the instructions generated by the compiler, for the Raspberry Pi, are shown below, with some explanation.
Below are performance results for the RPi, at normal MHz settings and with maximum overclocking. Speed improvements, due to the latter, are approximately proportional to differences in CPU and SDRAM MHz. Results from the Android version, running on a four core CPU, are also provided. This shows speed gains of up to four times that for a single core but, in this case, needs eight threads to do it.
The Atom has Hyperthreading that allows more than one thread to run at the same time on a single core CPU. Results indicate that performance is mainly dependent on CPU speed, whereas there is some degradation due to cache and RAM speeds on the other systems. The quad core AMD CPU speed using L2 cache can be faster than with data from L1 cache, probably due to some conflict on storing results of calculations. Note that Intel numeric results are slightly different to those from ARM CPUs.
Comparing MP-MFLOPS speeds between the old RPi and Raspberry Pi 2 (1 core vs 4 cores), all at 1000 MHz, shows performance increases of 8 to 10 times at 2 operations per word and 6 to 7 times with the higher instruction count. This benchmark only uses single precision floating point arithmetic. As for other benchmarks, the new V7A compilation produced essentially the same MFLOPS speeds as the original MP-MFLOPS program. Running time on the RPi 2 was rather short. So the run pass parameter was increased for a longer run. As expected, this lead to slightly different numeric results (see below).
The program was recompiled including the -funsafe-math-optimizations parameter, to force the use of NEON instructions as MP-NeonMFLOPS. V7A2 NEON entry below shows results at 1000 MHz, achieving a peak performance of nearly 3 GFLOPS. With the lower processing per word tests, average speed gains of 24 times were demonstrated, for cache based data, compared with the original RPi. The revised compilation included fused multiply accumulate instructions, where slightly different answers can be produced (see @@@@@ below).
An earlier Android benchmark executes the same functions as MP-MFLOPS, but using NEON intrinsic functions. This was also converted for the RPi 2 - see MP-NeonMFLOPS, where results are almost the same as the NEON compiled version.
Raspberry Pi 3 with a CPU MHz 1.33 times that on the Raspberry Pi 2, some MP_MFLOPS benchmark speeds were not as advantageous. That was the case with MP-MFLOPSPiA7, at 2 operations per word, but averaged 55% faster at 32 operations per word. MP-MFLOPSPiNeon was much better with average performance ratios of 1.34 and 2.30 at the two sets of tests. Then much faster than MP-MFLOPSPiA7, by more than twice as fast with 32 calculations per word and up to 4.66 times from cached data, at 2 per word (see
assembly code below).
MP-MFLOPSDP, the double precision compilation of MP-MFLOPSPiA7, compiled with the same instructions as the single precision version, but applicable to 64 bit registers. Speeds were effectively the same, except some tests were slower with RAM based data.
pi@raspberrypi ~/benchmarks/mpmflops $ ./MP-MFLOPS V7A ./MP-MFLOPSPiA7 ***************************************************** Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 43 33 31 191 170 161 2T 44 42 31 192 174 160 4T 44 43 31 192 176 159 8T 43 51 31 192 184 160 Results x 100000 1T 86735 98519 99984 79897 97638 99975 End of test Sat Jul 27 17:42:00 2013 ############################ RPi OC ############################## Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 18:45:14 2013 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 49 58 45 278 255 237 2T 72 60 46 278 262 240 4T 72 62 46 279 265 239 8T 72 70 46 279 225 234 Results x 100000 1T 86735 98519 99984 79897 97638 99975 End of test Sat Jul 27 18:45:46 2013 ########################### RPi 2 ############################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-MFLOPS Linux/ARM v1.0 Mon Mar 2 17:14:57 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 149 147 130 410 395 387 2T 298 291 254 820 803 782 4T 526 409 393 1519 1622 1456 8T 494 486 335 1581 1518 1436 Results x 100000 1T 86735 98519 99984 79897 97638 99975 End of test Mon Mar 2 17:15:07 2015 ######################### RPi 2 OC ############################ Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-MFLOPS Linux/ARM v1.0 Wed Mar 4 10:24:36 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 136 161 144 452 451 433 2T 328 322 282 904 900 862 4T 593 546 449 1739 1790 1711 8T 543 537 437 1588 1679 1578 Results x 100000 1T 86735 98519 99984 79897 97638 99975 End of test Wed Mar 4 10:24:45 2015 ######################### RPi 2 V7A ############################ Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-MFLOPS Linux/ARM V7A v1.0 Sun Mar 15 12:50:06 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 113 158 138 438 437 417 2T 321 315 266 884 873 831 4T 424 611 343 1706 1731 1629 8T 560 512 332 1579 1622 1520 Results x 100000 1T 86735 98519 99984 79897 97639 99975 End of test Sun Mar 15 12:50:15 2015 ##################### V7A2 Increased Passes #################### ######################### RPi 2 V7A2 ########################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-MFLOPS Linux/ARM V7A v1.0 Fri Mar 20 16:59:26 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 158 158 136 424 435 414 2T 322 314 264 875 868 824 4T 528 533 394 1731 1744 1612 8T 549 505 392 1639 1629 1518 Results x 100000 1T 76406 97075 99969 66015 95363 99951 End of test Fri Mar 20 16:59:44 2015 ################## RPi 2 V7A2 Compiled NEON #################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-MFLOPS Compiled NEON v1.0 Tue Aug 16 11:18:32 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 357 451 337 690 688 657 2T 885 769 426 1355 1354 1315 4T 1320 1747 382 2700 2721 2552 8T 1391 1405 381 2548 2653 2446 Results x 100000 1T 76406 97075 99969 66008 95367 99951 End of test Tue Aug 16 11:18:43 2016 ######################## RPi 2 V7A2 OC ######################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-MFLOPS Linux/ARM V7A v1.0 Fri Mar 20 17:17:44 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 138 177 158 488 487 454 2T 359 352 308 976 968 922 4T 552 627 465 1939 1906 1760 8T 554 586 453 1763 1830 1779 Results x 100000 1T 76406 97075 99969 66015 95363 99951 End of test Fri Mar 20 17:18:00 2015 ################ RPi 2 V7A2 Compiled NEON OC ################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-MFLOPS Compiled NEON v1.0 Fri Mar 20 17:18:25 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 369 504 386 769 766 736 2T 1052 947 486 1521 1513 1440 4T 2052 2023 470 3040 2917 2854 8T 1764 1920 459 2860 2883 2597 Results x 100000 1T 76406 97075 99969 66008 95367 99951 @@@@@ @@@@@ End of test Fri Mar 20 17:18:35 2015 ######################### RPi 3 V7A2 ########################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Linux/ARM V7A v1.0 Mon Aug 15 19:07:03 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 168 182 171 691 693 684 2T 364 358 329 1382 1381 1358 4T 408 484 401 2451 2561 2436 8T 609 554 420 2531 2425 2315 Results x 100000 1T 76406 97075 99969 66015 95363 99951 End of test Mon Aug 15 19:07:15 2016 ########## RPi 3 V7A2 New OpenGL GLUT Driver Disabled ########## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Linux/ARM V7A v1.0 Tue Aug 30 14:16:59 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 159 181 178 690 692 685 2T 342 364 353 1384 1386 1368 4T 466 501 456 2451 2473 2633 8T 581 643 479 2618 2502 2550 Results x 100000 1T 76406 97075 99969 66015 95363 99951 End of test Tue Aug 30 14:17:11 2016 ################# RPi 3 V7A2 Double Precision ################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Double Precision v1.0 Wed Sep 7 17:07:12 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 143 182 171 678 680 674 2T 343 361 240 1360 1360 1335 4T 441 712 240 2232 2208 2185 8T 406 593 241 2345 2315 2272 Results x 100000 1T 76384 97072 99969 66065 95370 99951 End of test Wed Sep 7 17:07:18 2016 ################## RPi 3 V7A2 Compiled NEON #################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 419 782 437 1672 1660 1637 2T 1324 1529 442 3331 3308 3212 4T 1903 1574 439 5040 6073 5738 8T 1613 2204 433 5543 5780 5445 Results x 100000 1T 76406 97075 99969 66008 95367 99951 End of test Mon Aug 15 19:09:52 2016 ####### RPi 3 V7A 2 NEON New OpenGL GLUT Driver Disabled ####### MP-MFLOPS Compiled NEON v1.0 Tue Aug 30 14:18:13 2016 Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 488 774 485 1674 1652 1644 2T 1438 1503 488 3341 3299 3262 4T 1984 1703 472 5045 5125 5256 8T 1567 2098 470 5527 5400 5021 Results x 100000 1T 76406 97075 99969 66008 95367 99951 End of test Tue Aug 30 14:18:18 2016 ########################## Other ############################### P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4 Android MP-MFLOPS v7 Benchmark V1.0 23-Dec-2012 14.12 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 208 188 172 588 675 643 2T 392 375 302 1323 1342 1311 4T 472 439 321 1824 1758 1645 8T 619 608 381 2666 2537 2645 Total Elapsed Time 6.7 seconds ***************************************************** Intel Atom 1.66 GHz, Linux Ubuntu 10.10 MP-MFLOPS Linux/Intel v1.0 Sat Jul 27 18:18:15 2013 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 207 206 201 406 404 404 2T 303 363 354 799 793 783 4T 330 367 357 798 795 788 8T 321 366 354 796 793 788 Results x 100000 1T 86723 98518 99984 79927 97642 99975 End of test Sat Jul 27 18:18:26 2013 ***************************************************** Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04 MP-MFLOPS Linux/Intel v1.0 Tue Jul 30 15:12:02 2013 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 1987 1977 1585 3827 3826 3732 2T 3835 3975 2527 7631 7654 7442 4T 6723 7873 2932 10822 14463 13728 8T 5890 7659 5497 10300 14452 14006 Results x 100000 1T 86723 98518 99984 79927 97642 99975 End of test Tue Jul 30 15:12:03 2013 |
The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in Whetstone Benchmark History And Results and Results On PCs. The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations. As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.
This multithreading benchmark runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program. An initial calibration, using a single thread, determines the number of passes needed for an overall execution time of 5 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are noted for the first thread with others checked for the same values and an error message displayed if they are inconsistent.
Displayed speeds are in the order that tests are run but are sorted for logged results, as shown below.
Relative performance due to overclocking is similar to MP-MFLOPS, an exception being fixed point calculations, where the particular compiler might have optimised the code too much, and many more passes could be needed to produce consistent speeds. The Galaxy SIII and AMD Phenom are also more inclined to achieve a four times performance gain with quad cores. The Atom hyperthreading shows improved throughput with multiple threads on all tests.
Running the original benchmark on the Raspberry Pi 2 shows a performance increase between 1.3 and 2.1 times, on the different tests, on a single core at 1000 MHz. With multithreading, this leads to a 10.2 times increase on MFLOPS, 5.8 times on functions and 7.5 time on integer MOPS. The revised PiA7 compilation generates slower code on some tests but the three quoted ratios increase to 12.8, 6.1 and 8.2 times.
Raspberry Pi 3 Overall MWIPS ratings are 1.37 times RPi 2 speeds, with ratios for other tests in the range 1.19 to 1.79, except the last copy test average of 2.73.
pi@raspberrypi ~/benchmarks/mpwhetss $ ./MP-WHETS V7A ./MP-WHETSPiA7 ***************************************************** Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz MP-Whetstone Benchmark Linux/ARM v1.0 Sat Jul 27 17:44:25 2013 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 243.6 91.9 90.9 84.4 4.9 2.7 2143.6 471.5 114.9 2T 256.8 61.2 90.0 82.3 5.6 2.7 2201.7 496.9 120.4 4T 258.5 74.5 96.0 84.2 5.6 2.7 2272.7 501.5 118.7 8T 258.5 84.2 95.2 85.1 5.6 2.7 2774.6 522.9 106.0 Overall Seconds 3.26 1T, 6.34 2T, 12.57 4T, 25.80 8T ############################ RPi OC ############################## Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts MP-Whetstone Benchmark Linux/ARM v1.0 Sat Jul 27 18:41:42 2013 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 373.6 136.0 135.3 122.5 8.2 3.8 3271.4 705.5 180.1 2T 375.2 122.5 133.4 118.4 8.2 3.9 3247.6 733.4 180.0 4T 377.0 132.2 139.3 122.3 8.2 3.9 6267.6 737.9 172.3 8T 377.0 135.1 140.6 123.2 8.2 3.9 4585.5 749.7 162.5 Overall Seconds 3.52 1T, 7.07 2T, 14.23 4T, 28.83 8T ########################### RPi 2 ############################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-Whetstone Benchmark Linux/ARM v1.0 Tue Mar 3 16:37:24 2015MP-WHETSPiA7 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 515.4 250.2 250.0 223.0 10.0 5.1 4421.6 891.1 334.5 2T 1035.2 500.9 501.9 447.7 20.0 10.2 8878.8 1789.0 671.3 4T 2063.4 960.6 996.0 893.6 39.9 20.5 17560.2 3559.3 1334.9 8T 2140.9 1192.4 1325.4 992.3 40.3 21.2 24312.0 3968.1 1379.2 Overall Seconds 4.98 1T, 4.98 2T, 5.06 4T, 10.11 8T ######################### RPi 2 OC ############################# Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-Whetstone Benchmark Linux/ARM v1.0 Wed Mar 4 10:34:01 2015 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 577.6 280.2 280.1 249.7 11.2 5.7 4961.5 998.2 374.7 2T 1155.5 560.1 560.1 499.4 22.4 11.4 9915.1 1995.8 749.3 4T 2290.3 1080.3 1110.8 994.2 44.4 22.8 13471.3 3642.0 1491.6 8T 2405.6 1506.5 1490.2 1103.8 45.9 23.5 28234.7 5151.7 1552.5 Overall Seconds 4.74 1T, 4.74 2T, 4.84 4T, 9.82 8T ######################### RPi 2 V7A ############################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-Whetstone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 16:44:08 2015 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 527.8 361.8 362.9 184.2 10.0 5.6 3316.1 889.2 445.2 2T 1056.9 724.9 729.2 368.6 20.0 11.2 6638.7 1779.1 891.6 4T 2119.0 1381.1 1454.5 739.2 40.1 22.5 13301.0 3571.3 1788.4 8T 2195.2 1912.9 1849.8 805.7 40.8 23.1 17643.5 4808.5 1893.6 Overall Seconds 4.70 1T, 4.70 2T, 4.75 4T, 9.56 8T ######################## RPi 2 V7A OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-Whetstone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 17:54:38 2015 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 593.5 409.5 409.3 206.7 11.3 6.3 3729.2 998.2 499.6 2T 1181.3 814.4 801.0 411.7 22.4 12.5 7423.4 1988.4 994.7 4T 2351.2 1486.6 1527.9 813.0 44.7 25.0 14800.0 3825.0 1989.5 8T 2452.9 2199.5 2099.1 890.8 45.3 26.1 21104.2 5439.9 2084.2 Overall Seconds 4.98 1T, 5.03 2T, 5.10 4T, 10.26 8T ######################### RPi 3 V7A ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 723.1 517.2 517.0 254.9 12.1 8.8 5853.9 1181.8 1189.8 2T 1464.7 960.5 1025.1 511.3 24.1 18.5 11899.0 2381.2 2385.7 4T 2902.3 1696.4 1867.3 1013.4 47.8 36.8 19754.6 4541.3 4687.1 8T 3004.0 2747.8 2569.0 1066.4 48.6 38.0 25502.9 6075.2 5610.8 Overall Seconds 4.77 1T, 4.74 2T, 4.88 4T, 9.76 8T ########################## Other ################################# P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4 Android MP-Whetstone Benchmark V1.0 23-Dec-2012 14.36 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 1206.4 266.3 269.4 310.1 30.1 17.6 522.8 551.8 597.9 2T 2411.7 520.5 530.0 619.1 60.0 35.1 1026.4 1359.2 1195.9 4T 4719.0 874.2 881.7 1231.1 119.1 69.6 2072.8 2779.4 2369.0 8T 4676.4 1227.1 1105.1 1182.4 120.0 63.2 2254.4 2821.8 2299.5 Overall Seconds 4.84 1T, 4.82 2T, 5.14 4T, 10.25 8T ***************************************************** Intel Atom 1.66 GHz, Linux Ubuntu 10.10 MP-Whetstone Benchmark Linux/Intel v1.0 Sat Jul 27 18:08:28 2013 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 704.4 329.0 328.9 280.8 17.1 7.6 763.7 1248.6 117.7 2T 1203.2 562.1 613.8 484.0 30.4 13.4 997.5 1688.6 176.8 4T 1203.8 605.2 618.4 477.3 30.4 13.4 993.0 1688.2 178.0 8T 1206.9 608.1 619.8 486.1 30.3 13.4 1008.3 1702.2 177.9 Overall Seconds 4.99 1T, 6.28 2T, 12.48 4T, 24.93 8T ***************************************************** Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04 MP-Whetstone Benchmark Linux/Intel v1.0 Tue Jul 30 15:11:23 2013 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 2662.8 924.4 926.2 695.0 64.3 34.7 3088.3 2258.4 620.7 2T 5304.1 1850.8 1850.9 1387.8 128.4 69.1 6210.0 4507.3 1200.1 4T 10582.7 3551.7 3668.0 2771.6 256.7 138.0 12173.2 8966.3 2399.4 8T 10637.9 3758.1 3754.6 2772.7 257.2 138.5 12389.8 9104.7 2441.1 Overall Seconds 4.90 1T, 4.98 2T, 5.04 4T, 9.94 8T |
The Dhrystone "C" benchmark provides a measure of integer performance (no floating point instructions). It became the key standard benchmark from 1984. Speed was originally measured in Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result, the latter being regarded as the first 1 MIPS minicomputer. Details and results from Windows and Linux based PCs, can be found in Dhrystone Results.htm.
This multithreading benchmark runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program. An initial calibration, using a single thread, determines the number of passes needed for an overall execution time of 1 second. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread.
Some variables can be used by all threads and it might be foreseen that this could cause the program to crash. Data arrays have been moved so that different RAM will be allocated for each thread. One of the locations is used to count the number passes used be each thread and these are checked for consistency.
MP-Dhrystone Benchmark Linux/ARM v1.0 Fri Jul 26 12:25:24 2013 Using 1, 2, 4 and 8 Threads Threads Dhrys/sec VAX MIPS 1 1650351 939 2 1547631 881 4 1594706 908 8 1619087 922 Internal pass count correct all threads End of test Fri Jul 26 12:25:40 2013 |
RasPi performance improvement due to overclocking is again proportional to CPU MHz. The Android quad core phone shows limited performance gains of up to 2.63 times, a shared data effect? Again, the Atom shows performance gains due to Hyperthreading.
Worst comparisons are on the Phenom PC, where performance using two threads is a lot slower than one thread, probably due to a conflict in updating results.
These Raspberry Pi 2 results also show multithreading performance degradations, through handling the shared data, with wide variations in measured speed.. A 1000 MHz single core produces a 40% improvement in performance, compared with RPi 1 at the same frequency, with no gain via the PiA7 recompilation.
Raspberry Pi 3 performance, using a single thread, is not much faster than model 2 at 1.43 times faster, compared with a CPU MHz ratio of 1.33. Then, it appears to perform much better using more threads, at 3.49 times faster.
pi@raspberrypi ~/benchmarks/mphry $ ./MP-DHRY V7A ./MP-DHRYPiA7 ***************************************************** Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz MP-Dhrystone Benchmark Linux/ARM v1.0 Sat Jul 27 17:38:10 2013 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.97 2.07 4.01 7.91 Dhrystones per Second 1650351 1547631 1594706 1619087 VAX MIPS rating 939 881 908 922 Internal pass count correct all threads End of test Sat Jul 27 17:38:26 2013 ########################### RPi OC ############################## Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts MP-Dhrystone Benchmark Linux/ARM v1.0 Sat Jul 27 18:48:23 2013 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.67 1.38 2.72 5.41 Dhrystones per Second 2388323 2324087 2354828 2364828 VAX MIPS rating 1359 1323 1340 1346 Internal pass count correct all threads End of test Sat Jul 27 18:48:34 2013 ########################### RPi 2 ############################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-Dhrystone Benchmark Linux/ARM v1.0 Mon Mar 2 17:12:58 2015 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.67 3.45 7.33 14.56 Dhrystones per Second 2985075 1159420 1091405 1098901 VAX MIPS rating 1699 660 621 625 Internal pass count correct all threads End of test Mon Mar 2 17:13:06 2015 ######################### RPi 2 OC ############################# Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-Dhrystone Benchmark Linux/ARM v1.0 Wed Mar 4 12:04:06 2015 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.96 2.22 9.92 19.82 Dhrystones per Second 3333333 2882883 1290323 1291625 VAX MIPS rating 1897 1641 734 735 Internal pass count correct all threads End of test Wed Mar 4 12:04:17 2013 ######################### RPi 2 V7A ############################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 15:53:27 2015 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.54 0.68 2.21 2.95 Dhrystones per Second 2956666 4706235 2895209 4339729 VAX MIPS rating 1683 2679 1648 2470 Internal pass count correct all threads End of test Tue Mar 3 15:53:34 2015 ####################### RPi 2 V7A OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Tue Mar 3 17:41:09 2015 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.97 1.18 2.51 5.02 Dhrystones per Second 3286275 5439640 5096932 5094843 VAX MIPS rating 1870 3096 2901 2900 Internal pass count correct all threads End of test Tue Mar 3 17:41:20 2015 ######################### RPi 3 V7A ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.95 1.12 1.59 3.04 Dhrystones per Second 4229473 7124952 10091677 10523432 VAX MIPS rating 2407 4055 5744 5989 Internal pass count correct all threads End of test Mon Aug 15 19:48:04 2016 ########################## Other ################################# P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4 Android MP-Dhrystone 2 Benchmark V1.0 23-Dec-2012 14.47 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.50 0.57 1.07 1.53 Dhrystones per Second 3187471 5583906 5972050 8389079 VAX MIPS rating 1814 3178 3399 4775 Internal pass count correct all threads Total Elapsed Time 4.2 seconds ***************************************************** Intel Atom 1.66 GHz, Linux Ubuntu 10.10 MP-Dhrystone Benchmark Linux/Intel v1.0 Sat Jul 27 17:59:01 2013 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.87 1.37 2.78 5.47 Dhrystones per Second 4624003 5836935 5756209 5845862 VAX MIPS rating 2632 3322 3276 3327 Internal pass count correct all threads End of test Sat Jul 27 17:59:12 2013 ***************************************************** Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04 MP-Dhrystone Benchmark Linux/Intel v1.0 Tue Jul 30 15:10:47 2013 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.58 2.63 6.91 13.86 Dhrystones per Second 13854293 6080597 4628248 4618862 VAX MIPS rating 7885 3461 2634 2629 Internal pass count correct all threads End of test Tue Jul 30 15:11:11 2013 |
This uses the same calculations as my original
BusSpeed2K Benchmark ,
the link providing data and results, including Windows and Linux MP varieties.
Data is read using AND instructions at a range of data sizes covering caches and RAM.
The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data.
In this case, only 12.3 KB, 123KB and 12.3 MB memory sizes are used via 1, 2, 4 and 8 threads.
Speeds using L1 cache and large address increments can be unpredictable and not show performance gains using multiple cores. Some of this might be due to high overheads compared with actual execution time. Note that RasPi L2 cache speeds are relatively slow, compared with those from L1.
Ignoring the wildly variable burst reading comparisons, with CPUs at 1000 MHz, single core Raspberry Pi 2 performance, via L1 cache, showed no improvement. There were significant gains at 122.9 KB, L2 cache test. The PiA7 compilation produced some performance increases at 122.9 KB and double speed via RAM. Bottom line four thread comparison gains, against single core RPi 1, were 4.1 times from L1 cache, 17.5 times via L2 cache and 11.7 times with RAM based data, at 3.35 GB/second.
The exaggerated performance is valid, where all threads read the same data but, as the 512 KB L2 cache is shared between all cores, measured speed does not reflect RAM speed. In order to demonstrate more realistic memory speeds, a second version, MP-BusSpd2PiA7, was produced, where each thread starts reading from different addresses (RPi 2 V7A 2 OC) below. This produced fairly constant RAM speeds using multiple threads.
Raspberry Pi 3 - Results for the latest MP-BusSpd are shown below. Compared to default RPi 2 performance, best RAM speed improvements were the same as memory bus speed difference. Cache speed improvements were around 1.9 times, compared with CPU MHz ratio of 1.33.
The benchmark was rerun with the
new graphics driver disabled,
as this tended to degrade memory performance.
pi@raspberrypi ~/benchmarks/mpbusspd $ ./MP-BusSpd V7A ./MP-BusSpdPiA7 ***************************************************** Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz MP-BusSpd Linux/ARM v1.0 Sat Jul 27 17:32:12 2013 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 550 1200 1194 1258 1273 1179 2T 469 1207 1206 1261 1266 1217 4T 585 1207 1225 1252 1243 1249 8T 1046 1184 1208 1237 1236 1245 122.9 1T 22 46 45 55 107 218 2T 22 46 43 55 105 224 4T 21 46 43 55 106 224 8T 22 45 44 54 92 217 12288 1T 32 32 33 42 85 182 2T 15 33 30 41 80 165 4T 14 18 31 42 82 175 8T 15 28 32 43 81 178 End of test Sat Jul 27 17:32:29 2013 ########################### RPi OC ############################## Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts MP-BusSpd Linux/ARM v1.0 Sat Jul 27 18:50:22 2013 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1246 1751 1794 1768 1847 1859 2T 1591 1756 1741 1830 1837 1760 4T 1162 1709 1784 1830 1802 1840 8T 1574 1732 1739 1774 1817 1820 122.9 1T 90 90 84 106 198 415 2T 65 88 86 106 204 418 4T 90 88 86 103 196 403 8T 89 88 82 103 192 407 12288 1T 49 49 50 71 138 293 2T 37 49 50 71 138 295 4T 45 50 49 69 129 288 8T 30 48 50 70 135 291 End of test Sat Jul 27 18:50:35 2013 ########################### RPi 2 ############################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-BusSpd Linux/ARM v1.0 Mon Mar 2 17:09:03 2015 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1050 1681 1685 1729 1732 1734 2T 2955 3232 3324 3375 3425 3370 4T 5045 6417 6600 6753 6795 6868 8T 5053 5285 6087 5814 5845 6346 122.9 1T 383 391 695 1173 1493 1324 2T 712 738 1382 2324 2960 2652 4T 728 787 1593 3053 5693 4697 8T 774 771 1575 3192 4622 4704 12288 1T 71 76 151 295 635 349 2T 134 152 300 583 1242 691 4T 146 164 272 755 1415 1366 8T 137 77 240 421 930 1129 End of test Mon Mar 2 17:09:16 2015 ######################### RPi 2 OC ############################# Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-BusSpd Linux/ARM v1.0 Wed Mar 4 12:40:34 2015 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1032 1869 1884 1919 1923 1927 2T 3302 3606 3704 3794 3819 3821 4T 5833 7058 7356 7517 7348 7618 8T 5534 5699 6209 6517 6572 6674 122.9 1T 425 431 768 1285 1650 1469 2T 809 815 1540 2583 3306 2944 4T 824 875 1768 3651 6262 5809 8T 858 822 1709 3464 5574 4615 12288 1T 96 110 218 424 914 447 2T 193 219 436 785 1702 877 4T 165 246 457 754 2236 1738 8T 111 131 283 623 1348 1474 End of test Wed Mar 4 12:40:47 2015 ######################### RPi 2 V7A ############################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-BusSpd Linux/ARM V7A v1.0 Tue Mar 3 16:08:07 2015 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1571 1662 1670 1174 1698 1725 2T 3072 3266 3362 3379 3416 3443 4T 5077 6562 6582 6719 6771 6847 8T 5318 5731 6009 5939 5820 5535 122.9 1T 376 396 702 1192 1558 1624 2T 710 738 1388 2359 3111 3228 4T 708 779 1618 3238 5729 6383 8T 692 761 1612 2970 5056 5648 12288 1T 69 82 163 292 629 1251 2T 138 160 329 579 1247 2380 4T 217 175 364 485 1135 2582 8T 106 100 210 585 871 1817 End of test Tue Mar 3 16:08:21 2015 ####################### RPi 2 V7A OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-BusSpd Linux/ARM V7A v1.0 Tue Mar 3 17:14:10 2015 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1029 1851 1875 1903 1912 1909 2T 3413 3616 3736 3799 3813 3818 4T 6821 7300 5957 7493 7523 7621 8T 5668 5894 6455 6372 6508 7495 122.9 1T 433 442 782 1305 1738 1789 2T 810 813 1542 2588 3429 3574 4T 818 887 1780 3584 6552 7071 8T 839 854 1629 3284 5229 6202 12288 1T 92 116 228 407 854 1286 2T 184 230 450 699 1619 2531 4T 236 253 564 1492 2178 3356 8T 156 164 258 699 1065 3018 End of test Tue Mar 3 17:14:23 2015 ######################## RPi 2 V7A 2 ############################ Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-BusSpd ARM V7A v2 Fri Mar 6 17:29:14 2015 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 961 1602 1638 1668 1733 2227 2T 2644 3012 3154 3240 3393 4138 4T 4024 5503 6027 6389 6705 8153 8T 2780 3979 4777 5031 6028 6376 122.9 1T 356 389 688 1185 1541 2050 2T 706 731 1373 2343 3070 4065 4T 743 800 1595 3198 5894 7872 8T 750 775 1566 2928 5406 7139 12288 1T 66 71 159 281 628 1147 2T 87 87 177 311 697 1256 4T 84 98 191 297 700 1186 8T 103 93 177 294 742 1147 End of test Fri Mar 6 17:29:26 2015 ###################### RPi 2 V7A 2 OC ########################## Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-BusSpd ARM V7A v2 Fri Mar 6 17:35:56 2015 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 952 1772 1817 1862 1921 2466 2T 2958 3387 3552 3654 3826 4832 4T 4448 6110 6708 7037 7344 9078 8T 3358 4852 5570 5684 6631 7153 122.9 1T 435 436 787 1318 1718 2285 2T 813 816 1534 2610 3426 4527 4T 821 864 1780 3536 6523 8823 8T 813 812 1607 3307 5750 8159 12288 1T 94 104 229 406 904 1648 2T 141 141 289 454 1165 1785 4T 143 148 256 407 1000 1584 8T 148 133 250 485 1062 1531 End of test Fri Mar 6 17:36:08 2015 ######################## RPi 3 V7A 2 ############################ Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-BusSpd ARM V7A v2 Sun Jul 24 09:26:21 2016 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 3011 3715 3792 4080 4400 4149 2T 5391 6873 7125 7827 8466 8124 4T 8622 11926 13488 15276 16419 13422 8T 4922 7930 9659 11732 13307 11995 122.9 1T 565 563 1070 1792 2830 3865 2T 886 901 1762 3225 5402 7584 4T 901 921 1863 3727 7185 13816 8T 874 919 1762 3712 6269 9242 12288 1T 120 125 244 420 968 1926 2T 126 128 246 537 1000 2184 4T 110 118 231 443 990 1824 8T 120 137 262 517 1043 2124 End of test Sun Jul 24 09:26:33 2016 ########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ########## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-BusSpd ARM V7A v2 Tue Aug 30 13:45:43 2016 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1565 3749 3718 4078 4385 4160 2T 5041 6829 7066 7813 8584 7839 4T 5480 11958 13330 15256 16863 15614 8T 6006 8477 8873 7777 8918 8315 122.9 1T 566 566 1062 1822 2831 3907 2T 899 906 1742 2395 5433 7638 4T 907 935 1876 3757 7241 13871 8T 863 919 1789 3491 6411 9403 12288 1T 130 136 263 513 1047 2080 2T 185 138 276 554 1108 2149 4T 131 137 269 536 1169 2383 8T 125 133 224 513 1038 2142 End of test Tue Aug 30 13:45:55 2016 ########################## Other ################################# P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4 Android MP-BusSpd v7 Benchmark V1.0 23-Dec-2012 14.42 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 3452 3697 4088 4122 3860 4183 2T 6616 7251 8016 8179 8307 8191 4T 8108 7430 10052 8511 8305 8404 8T 8729 10701 11687 12938 15297 15116 122.9 1T 747 762 746 966 992 1401 2T 1132 1161 1155 1554 1873 2668 4T 1127 1133 1137 2193 2987 4614 8T 1134 1145 1133 2210 3153 4231 12288 1T 82 89 200 376 739 1184 2T 204 179 407 797 1449 2205 4T 399 359 334 1227 1183 4038 8T 134 123 226 502 1378 3718 Total Elapsed Time 13.4 seconds ***************************************************** Intel Atom 1.66 GHz, Linux Ubuntu 10.10 MP-BusSpd Linux/Intel v1.0 Sat Jul 27 18:03:37 2013 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 5512 6061 6219 6362 6359 6388 2T 5866 6412 6556 6638 6595 6659 4T 6157 6445 6551 6607 6605 6639 8T 6139 6424 6510 6611 6303 6070 122.9 1T 513 417 787 1476 2518 3945 2T 586 696 1316 2347 3655 4741 4T 625 686 1334 2270 3614 4736 8T 615 720 1255 2273 3635 4777 12288 1T 135 261 522 1034 1966 3280 2T 128 261 567 1146 2250 4535 4T 118 277 562 1183 2300 4454 8T 122 250 549 1122 2225 4413 End of test Sat Jul 27 18:03:49 2013 ***************************************************** Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04 MP-BusSpd Linux/Intel v1.0 Tue Jul 30 15:13:37 2013 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 10273 13905 14184 13640 13542 13651 2T 7599 14053 19451 22479 24301 25801 4T 7743 15110 29846 44953 48783 51672 8T 7613 15116 29805 44501 48082 51027 122.9 1T 1494 1496 2987 6001 11037 12857 2T 2980 2987 5952 11900 21852 25515 4T 5344 5967 11735 23781 43699 50429 8T 5947 5903 11661 23333 42953 50569 12288 1T 459 466 922 1878 3117 5236 2T 741 773 1452 2648 4731 8370 4T 839 887 1814 4006 7923 14909 8T 903 933 1921 4244 7997 14597 End of test Tue Jul 30 15:13:49 2013 |
RandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches. Details and results for Windows and Linux versions can be found in RandMem Results.htm. This benchmark uses data from the same array for all threads, but starting at different points. Results of the Serial Reading tests are checked for the same result on all threads.
The original Windows version produces extremely slow speeds with read/write tests, particularly with random access. Later Linux varieties included Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using multiple threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM (see linux%20multithreading%20benchmarks.htm). This and the Android benchmarks also use Mutex and some speeds continue to be unpredictable.
The revised PiA7 compilation made little difference to Raspberry Pi 2 MP-RandMem results. Multithreading did not provide any performance improvement with read/write tests (Mutex effect) but RPi 2 gains were around 1.6 times, using L1 cache, 4 to 6 times from L2 cache, with RAM at 6 times for serial access and 1.6 times with random access. RPi 2 single thread read only L1 cache tests showed no performance increase, with more than three times gain from l2 cache and a 70% improvement from RAM. Serial Read/Random Read, quad core versus RPi 1 single core performance ratios were about 4/4 times for L1 cache, 15/9 times with L2 cache and 5.7/4.3 times from RAM.
The results show that there is no gain in using multiple threads on systems with multiple cores, at least for cache based data, due to Mutex effects (but this is better than being much slower - see above). As could be anticipated, random access is slow, compared with serial reading and writing, when burst transfers are involved. Note the similarities with BusSpeed above.
Raspberry Pi 3 results are provided with and without the
new graphics driver,
as memory performance was degraded by the latter. RPi 2 comparisons, without the driver, are included below. Some were no better that the 1.33 CPU MHz increase, but RAM speed improvements were significant.
pi@raspberrypi ~/benchmarks/mprandmem$ ./MP-RandMem V7A ./MP-RandMemPiA7 ***************************************************** Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz MP-RandMem Linux/ARM v1.0 Mon Jul 29 16:23:20 2013 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 1564 1347 1605 1343 2T 1576 1338 1584 1217 4T 1550 1297 1544 1324 8T 1500 1303 1489 1183 122.9 1T 236 202 112 99 2T 234 201 111 110 4T 232 201 110 96 8T 226 200 109 99 12288 1T 170 135 23 26 2T 170 134 22 26 4T 169 132 23 25 8T 123 105 23 26 No Errors Found End of test Mon Jul 29 16:24:19 2013 ########################### RPi OC ############################## Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts MP-RandMem Linux/ARM v1.0 Mon Jul 29 17:05:25 2013 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2306 1938 2312 1937 2T 2281 1900 2288 1933 4T 2238 1918 2240 1918 8T 2171 1890 2178 1880 122.9 1T 460 369 205 194 2T 448 371 202 195 4T 441 371 204 194 8T 427 367 202 193 12288 1T 270 198 36 42 2T 270 198 36 42 4T 270 198 36 42 8T 270 198 36 42 No Errors Found End of test Mon Jul 29 17:06:16 2013 ########################### RPi 2 ############################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-RandMem Linux/ARM v1.0 Tue Mar 3 16:13:52 2015 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2256 2857 2257 2858 2T 4480 2847 4480 2849 4T 8738 2795 8759 2808 8T 8032 2772 8439 2794 122.9 1T 1624 1483 628 682 2T 3208 1467 1183 683 4T 6203 1457 1673 681 8T 5793 1385 1670 689 12288 1T 359 940 55 57 2T 670 941 105 57 4T 1180 936 126 57 8T 1161 938 127 56 No Errors Found End of test Tue Mar 3 16:14:38 2015 ######################### RPi 2 OC ############################# Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-RandMem Linux/ARM v1.0 Wed Mar 4 12:53:39 2015 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2493 3157 2509 3176 2T 4979 3161 4979 3153 4T 9357 3144 9699 3125 8T 8706 3104 8633 3085 122.9 1T 1796 2152 701 761 2T 3577 2142 1331 762 4T 6916 2151 1870 766 8T 6421 2135 1823 765 12288 1T 461 1233 68 70 2T 862 1218 129 69 4T 1561 1210 159 69 8T 1514 1202 162 69 No Errors Found End of test Wed Mar 4 12:54:25 2015 ######################### RPi 2 V7A ############################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-RandMem Linux/ARM V7A v1.0 Tue Mar 3 16:28:30 2015 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 1967 2784 1968 2793 2T 3711 2787 3740 2788 4T 7019 2730 7313 2762 8T 6783 2410 6881 2704 122.9 1T 1413 1489 532 681 2T 2788 1470 1013 679 4T 5393 1485 1593 681 8T 5207 1448 1587 686 12288 1T 357 950 45 57 2T 697 946 89 57 4T 1212 930 126 56 8T 1157 938 123 57 No Errors Found End of test Tue Mar 3 16:29:18 2015 ####################### RPi 2 V7A OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-RandMem Linux/ARM V7A v1.0 Tue Mar 3 17:51:16 2015 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2209 3129 2209 3130 2T 4177 3101 4156 3098 4T 8289 3056 8292 3059 8T 7641 3009 7574 2990 122.9 1T 1584 2121 592 754 2T 3109 2105 1127 758 4T 5983 2114 1715 759 8T 5669 2118 1682 764 12288 1T 453 1219 55 69 2T 841 1217 109 69 4T 1535 1209 157 69 8T 1523 1189 154 68 No Errors Found End of test Tue Mar 3 17:52:01 2015 ######################### RPi 3 V7A ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-RandMem Linux/ARM V7A v1.0 Mon Aug 15 19:37:27 2016 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2907 3773 2917 3790 2T 5480 3768 5187 3775 4T 11198 3679 10960 3712 8T 10094 3697 10038 3685 122.9 1T 2673 3340 686 892 2T 5031 3386 1251 888 4T 9398 3378 2002 890 8T 9291 3370 1916 886 12288 1T 1896 899 50 64 2T 2535 900 98 65 4T 2878 896 137 64 8T 2631 897 130 65 No Errors Found End of test Mon Aug 15 19:38:14 2016 ########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ########## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-RandMem Linux/ARM V7A v1.0 Tue Aug 30 14:13:08 2016 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2930 3791 2918 3791 2T 5571 3766 5194 3776 4T 11196 3722 11205 3722 8T 10063 3685 10051 3702 122.9 1T 2675 3398 681 893 2T 5124 3387 1256 886 4T 10041 3387 1916 891 8T 9593 3367 1952 890 12288 1T 2120 979 54 71 2T 3255 980 107 71 4T 3346 979 138 70 8T 2226 979 143 71 No Errors Found End of test Tue Aug 30 14:13:54 2016 RPi3/RPi2 Average L1 cache 1.53 1.36 1.47 1.35 L2 cache 1.86 2.29 1.24 1.31 RAM 4.46 1.04 1.17 1.25 ########################## Other ################################# P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4 Android MP-RndMem v7 Benchmark V1.0 23-Dec-2012 14.40 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.29 1T 2043 2028 2066 2027 2T 6788 3058 6835 3346 4T 6251 3104 6478 3376 8T 6635 3244 5408 3242 122.9 1T 1365 1392 1150 1151 2T 2415 1386 1927 1159 4T 2495 1374 1870 1117 8T 2470 1352 1772 1013 12288 1T 581 351 71 77 2T 1674 934 143 96 4T 1675 882 143 95 8T 1838 939 142 96 Total Elapsed Time 5.5 seconds ***************************************************** Intel Atom 1.66 GHz, Linux Ubuntu 10.10 MP-RandMem Linux/Intel v1.0 Mon Jul 29 17:14:26 2013 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 4207 5242 4206 5244 2T 6219 5159 5770 5118 4T 6155 5158 6206 5149 8T 5765 5019 6088 4956 122.9 1T 3084 3455 789 1077 2T 4692 3451 1230 1078 4T 4753 3408 1246 1076 8T 4689 3400 1243 1045 12288 1T 1291 1339 57 88 2T 3008 1323 108 88 4T 3043 1336 105 88 8T 3092 1329 108 87 No Errors Found End of test Mon Jul 29 17:15:12 2013 ***************************************************** Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04 MP-RandMem Linux/Intel v1.0 Tue Jul 30 15:15:12 2013 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 14913 11834 14229 11681 2T 24219 11686 23129 11552 4T 35885 11566 33095 11443 8T 29820 11596 29206 11518 122.9 1T 10936 10580 5543 4835 2T 20167 10563 9942 4814 4T 38266 10522 18061 4845 8T 37272 10437 17753 4813 12288 1T 3858 3864 655 559 2T 6280 3866 1137 558 4T 10752 3859 1920 558 8T 11107 3827 1924 558 No Errors Found End of test Tue Jul 30 15:15:53 2013 |
These benchmarks use the same source code program calculations as the original MP_MFLOPS benchmark for Linux with MP-MFLOPS above using a cut down version, implemented to use on Android devices. OpenMP-MFLOPS benchmark uses the simplest OpenMP directive, #pragma omp parallel for, before the for loops where parallelisation might be expected, and a -fopenmp compile parameter. Then, notOpenMP-MFLOPS is the same, without the compile parameter.
The default memory sizes used, starting at 400 KB, are much larger than MP-MFLOPS, as is the number of repeat passes. However, these benchmarks have run time parameters, shown below, that can change these. In fact, test runs show that performance is mainly dependent on the number of operations per word, and notOpenMP-MFLOPS speed is almost the same as 1 Thread MP-MFLOPS results.
Besides notOpenMP-MFLOPS, results below include OpenMP-MFLOPS, set to run on a single core, where speeds at 32 operations per word were nearly twice as fast on the former. Examination of the assembly code generated, for this particular test function, show that the latter has 67 instructions, and the former 346, clearly with more options to suit data size. Both use vfma fused floating-point multiply accumulate instructions, where there are rounding complications. Note the different results at 32 operations per word. At least the default OpenMP-MFLOPS benchmark shows speed gains of up to 3.9 times those using a single core.
Raspberry Pi 3 results, using the same parameters as
MP-MFLOPS,
have been included, with those for notOpenMP-MFLOPS 2 and 32 Ops/Word being almost identical to the same for MP-MFLOPSPiNeon but, unlike the latter, little gain was produced with multithreading. MP performance appeared to be improved somewhat by increasing the pass count. Compared to Raspberry Pi 2, notOpenMP-MFLOPS was around twice as fast at 8 and 32 operations per word, but less so with 2 operations. Then, it sometimes appeared to be slower with OpenMP-MFLOPS. Poor performance could well be associated with running time and overheads.
######################## Run Time Parameters ############################# For same as latest MP-MFLOPS ./OpenMP-MFLOPS Words 3200, Repeats 10000 ############################## RPi 2 V7A ################################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz OpenMP MFLOPS Benchmark 1 Sat Mar 7 15:44:05 2015 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.676731 739 0.929538 Yes Data in & out 1000000 2 250 1.365332 366 0.992550 Yes Data in & out 10000000 2 25 1.308170 382 0.999250 Yes Data in & out 100000 8 2500 1.076658 1858 0.957126 Yes Data in & out 1000000 8 250 1.390932 1438 0.995524 Yes Data in & out 10000000 8 25 1.356837 1474 0.999550 Yes Data in & out 100000 32 2500 5.561007 1439 0.890232 Yes Data in & out 1000000 32 250 5.843752 1369 0.988068 Yes Data in & out 10000000 32 25 5.791580 1381 0.998785 Yes End of test Sat Mar 7 15:44:30 2015 ***************** taskset 0x00000001 ./OpenMP-MFLOPS 1 Core ***************** OpenMP MFLOPS Benchmark 1 Mon Mar 9 11:56:40 2015 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 1.592769 314 0.929538 Yes Data in & out 1000000 2 250 1.928113 259 0.992550 Yes Data in & out 10000000 2 25 1.917868 261 0.999250 Yes Data in & out 100000 8 2500 4.049420 494 0.957126 Yes Data in & out 1000000 8 250 4.766354 420 0.995524 Yes Data in & out 10000000 8 25 4.757556 420 0.999550 Yes Data in & out 100000 32 2500 21.886468 366 0.890232 Yes Data in & out 1000000 32 250 22.745527 352 0.988068 Yes Data in & out 10000000 32 25 22.726837 352 0.998785 Yes End of test Mon Mar 9 11:58:07 2015 ----------------------------------------------------------------------------- Not OpenMP MFLOPS Benchmark 1 Sat Mar 7 15:41:17 2015 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 1.256587 398 0.929538 Yes Data in & out 1000000 2 250 1.470944 340 0.992550 Yes Data in & out 10000000 2 25 1.467244 341 0.999250 Yes Data in & out 100000 8 2500 2.574641 777 0.957126 Yes Data in & out 1000000 8 250 3.241242 617 0.995524 Yes Data in & out 10000000 8 25 3.226519 620 0.999550 Yes Data in & out 100000 32 2500 11.566683 692 0.890268 Yes Data in & out 1000000 32 250 12.312695 650 0.988078 Yes Data in & out 10000000 32 25 12.309223 650 0.998806 Yes End of test Sat Mar 7 15:42:07 2015 ####################### RPi 2 V7A OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 OpenMP MFLOPS Benchmark 1 Sat Mar 7 19:21:01 2015 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.502595 995 0.929538 Yes Data in & out 1000000 2 250 1.061047 471 0.992550 Yes Data in & out 10000000 2 25 1.027811 486 0.999250 Yes Data in & out 100000 8 2500 0.962144 2079 0.957126 Yes Data in & out 1000000 8 250 1.202937 1663 0.995524 Yes Data in & out 10000000 8 25 1.158232 1727 0.999550 Yes Data in & out 100000 32 2500 4.947005 1617 0.890232 Yes Data in & out 1000000 32 250 5.147261 1554 0.988068 Yes Data in & out 10000000 32 25 5.111022 1565 0.998785 Yes End of test Sat Mar 7 19:21:23 2015 Not OpenMP MFLOPS Benchmark 1 Sat Mar 7 19:19:54 2015 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 1.085229 461 0.929538 Yes Data in & out 1000000 2 250 1.314159 380 0.992550 Yes Data in & out 10000000 2 25 1.307451 382 0.999250 Yes Data in & out 100000 8 2500 2.323887 861 0.957126 Yes Data in & out 1000000 8 250 2.859657 699 0.995524 Yes Data in & out 10000000 8 25 2.851960 701 0.999550 Yes Data in & out 100000 32 2500 10.461870 765 0.890268 Yes Data in & out 1000000 32 250 11.074036 722 0.988078 Yes Data in & out 10000000 32 25 11.070011 723 0.998806 Yes End of test Sat Mar 7 19:20:39 2015 ######################### RPi 3 V7A ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz OpenMP MFLOPS Benchmark 1 Sat Jul 30 13:01:12 2016 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.363631 1375 0.929538 Yes Data in & out 1000000 2 250 1.133716 441 0.992550 Yes Data in & out 10000000 2 25 1.150107 435 0.999250 Yes Data in & out 100000 8 2500 0.432833 4621 0.957126 Yes Data in & out 1000000 8 250 1.177219 1699 0.995524 Yes Data in & out 10000000 8 25 1.151536 1737 0.999550 Yes Data in & out 100000 32 2500 3.845114 2081 0.890232 Yes Data in & out 1000000 32 250 3.754590 2131 0.988068 Yes Data in & out 10000000 32 25 3.737356 2141 0.998785 Yes End of test Sat Jul 30 13:01:29 2016 ----------------------------------------------------------------------------- Not OpenMP MFLOPS Benchmark 1 Mon Aug 15 19:23:03 2016 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.697952 716 0.929538 Yes Data in & out 1000000 2 250 1.160158 431 0.992550 Yes Data in & out 10000000 2 25 1.140070 439 0.999250 Yes Data in & out 100000 8 2500 1.178477 1697 0.957126 Yes Data in & out 1000000 8 250 1.442497 1386 0.995524 Yes Data in & out 10000000 8 25 1.428921 1400 0.999550 Yes Data in & out 100000 32 2500 5.060230 1581 0.890268 Yes Data in & out 1000000 32 250 5.203246 1538 0.988078 Yes Data in & out 10000000 32 25 5.203889 1537 0.998806 Yes End of test Mon Aug 15 19:23:26 2016 ######################### RPi 3 V7A ############################# Run with parameters ./OpenMP-MFLOPS Words 3200, Repeats 10000 Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz OpenMP MFLOPS Benchmark 1 Thu Sep 15 18:13:47 2016 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 3200 2 10000 0.138179 463 0.764063 Yes Data in & out 32000 2 1000 0.091516 699 0.970753 Yes Data in & out 320000 2 100 0.193833 330 0.997008 Yes Data in & out 3200 8 10000 0.148140 1728 0.850919 Yes Data in & out 32000 8 1000 0.120691 2121 0.982347 Yes Data in & out 320000 8 100 0.429023 597 0.998205 Yes Data in & out 3200 32 10000 0.514128 1992 0.660291 Yes Data in & out 32000 32 1000 0.703450 1456 0.953632 Yes Data in & out 320000 32 100 1.067654 959 0.995180 Yes End of test Thu Sep 15 18:13:50 2016 ----------------------------------------------------------------------------- Not OpenMP MFLOPS Benchmark 1 Thu Sep 15 18:14:47 2016 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 3200 2 10000 0.152466 420 0.764063 Yes Data in & out 32000 2 1000 0.081762 783 0.970753 Yes Data in & out 320000 2 100 0.134984 474 0.997008 Yes Data in & out 3200 8 10000 0.147960 1730 0.850919 Yes Data in & out 32000 8 1000 0.148731 1721 0.982347 Yes Data in & out 320000 8 100 0.168795 1517 0.998205 Yes Data in & out 3200 32 10000 0.644568 1589 0.660158 Yes Data in & out 32000 32 1000 0.649362 1577 0.953663 Yes Data in & out 320000 32 100 0.663790 1543 0.995240 Yes End of test Thu Sep 15 18:14:50 2016 |
MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data. Calculations are as shown in the results’ headings. As with OpenMP-MFLOPS benchmark, OpenMP-MemSpeed uses the simplest OpenMP directive (#pragma omp parallel) before the test loops. Full results are below for the RPi 2 running at 900 and 1000 MHz. The compile command is also shown.
With OpenMP-MemSpeed Version 1, the declaration to use OpenMP was before an inner loop, leading to possible performance degradation due to overheads. For Version 2, or OpenMP-MemSpeed2, the directive was moved to an outer loop. For the following results, this was run, along with a test to use one CPU core, via the command taskset 0x00000001 ./OpenMP-MemSpeed2. Then, another compilation (NotOpenMP-MemSpeed2), was produced without the -fopenmp compile option, to use a single core without OMP overheads. All three versions are in Raspberry_Pi_MP_Benchmarks.zip.
Raspberry Pi 3 - Results are below, along with RPi3/Rpi2 average performance ratios, plus those for Raspberry Pi 3 OpenMP/NotOpenMP and NotOpenMP/1 core OpenMP. Some RPi3/RPi2 comparisons were close to the 1.33 CPU MHz ratio, but most were higher, particularly on RAM speed, at up to 4.68 times, and all integer arithmetic tests, with all MP ratios between 4.08 and 5.62. RPi 3 multiprocessing gains were disappointing on integer operations but mainly over 3.5 times for cache based floating point and over 3.0 from RAM. Using one thread, the benchmark produced wide variations to the unthreaded code, mainly worse, as expected, but some were better. It could be assumed that different instructions were generated.
######################### RPi 2 V7A ############################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz gcc memSpeedOMP.c cpuidc.c -lrt -lc -lm -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations -fopenmp -o OpenMP-MemSpeed Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom Start of test Sat Mar 7 19:12:39 2015 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 589 759 843 925 871 896 517 491 490 8 1487 1056 1367 1707 1161 1472 971 876 876 16 2357 1595 1941 2852 2186 2348 1737 1422 1355 32 2565 2045 2669 4329 2876 3161 2945 2246 2235 64 3964 2294 3080 5634 3497 3962 4224 2936 2934 128 2420 2317 3096 5661 3478 3928 1831 3416 3425 256 2884 2150 2838 4411 3179 3578 1184 1392 1357 512 1837 1731 2155 3061 2327 2557 1064 1217 1218 1024 650 990 1106 1254 1134 1162 1050 1055 937 2048 793 833 907 1010 935 889 851 676 825 4096 792 705 864 1004 871 953 767 771 748 8192 760 829 881 1009 935 961 761 736 766 16384 839 810 873 1004 934 961 765 772 762 32768 850 829 906 1005 725 953 770 776 777 65536 951 838 894 1022 928 963 772 779 779 131072 949 835 867 1010 937 950 774 786 788 End of test Sat Mar 7 19:13:10 2015 ####################### RPi 2 V7A OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom Start of test Sat Mar 7 19:17:09 2015 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 669 855 948 1081 980 1004 578 548 496 8 1595 1339 1549 1948 1643 1399 1099 915 994 16 2467 1859 2292 3238 2464 2686 1966 1538 1660 32 3429 2302 3012 4932 3330 3720 3324 2527 2332 64 4190 2585 3520 6499 3969 4520 4881 3357 3366 128 4327 2670 3656 6991 4134 4789 4914 3094 3827 256 4185 2392 3524 6035 3994 4607 1710 2719 2713 512 2757 2119 2329 4008 2944 3250 1587 1726 1717 1024 1393 1161 1303 1493 1350 1408 1488 1476 1465 2048 903 996 1086 1207 1113 921 1083 1093 1086 4096 632 911 1058 1177 1094 1122 969 995 998 8192 1141 988 1074 1198 1113 1112 985 994 998 16384 825 980 1070 1184 950 1131 1014 1022 1015 32768 1111 994 1083 1209 1117 1155 994 957 1007 65536 1161 982 1084 1223 1112 1163 996 999 997 131072 1134 986 1083 1212 1098 1135 1005 1017 1022 End of test Sat Mar 7 19:17:40 2015 ################################################################# Version 2 ########################## RPi 2 ################################ Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 21:29:03 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 3259 2499 271 3261 2383 286 1333 2099 432 8 2854 2507 256 3160 2594 305 1235 2116 445 16 3329 2507 256 3331 3098 270 1173 1547 446 32 3210 2509 264 3328 3026 267 1155 1452 433 64 3461 1889 249 5869 3399 250 1128 2024 317 128 3215 2229 257 5719 3672 262 1117 1123 293 256 3896 2387 250 5677 3647 257 1119 1132 301 512 2521 1527 217 2718 2258 230 1115 1112 282 1024 931 871 185 1408 1254 182 1092 1094 258 2048 863 777 212 1217 1203 198 1095 1088 275 4096 846 724 159 962 885 168 1092 1078 251 8192 824 779 234 1151 1191 200 1090 1070 266 16384 791 701 362 961 1223 335 1078 1057 334 32768 845 641 398 930 973 391 913 1066 300 65536 312 256 331 359 306 338 956 1069 301 131072 312 255 332 360 306 338 994 812 356 End of test Mon Sep 5 21:29:41 2016 ####################### RPi 2 1 Core ############################ Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 21:30:34 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 857 674 676 2199 1096 658 2297 1172 489 8 1234 677 676 2269 1131 690 2351 1179 490 16 1236 677 677 2273 1138 694 2362 1183 490 32 1225 673 674 2258 1132 691 2354 1181 489 64 1056 616 623 1732 950 638 1428 1093 471 128 968 605 614 1660 947 626 1242 1127 476 256 910 602 611 1635 947 626 1191 1131 475 512 705 499 529 1242 743 515 1119 954 438 1024 347 282 350 434 359 357 803 785 339 2048 309 256 326 359 305 333 814 744 299 4096 304 251 324 353 302 331 856 785 313 8192 304 252 322 352 300 331 879 839 331 16384 305 251 324 352 300 331 891 864 342 32768 308 251 325 354 301 332 859 773 313 65536 309 253 325 355 293 331 836 737 302 131072 309 253 326 355 302 332 838 713 295 End of test Mon Sep 5 21:31:10 2016 ####################### RPi 2 Not OMP ########################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 21:31:37 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 823 1666 2560 2001 2172 2564 1985 1564 1565 8 1262 1671 2571 2024 2181 2570 2018 1573 1571 16 1261 1670 2576 2023 2179 2570 2023 1575 1575 32 1064 1275 1705 1552 1569 1724 1468 1325 1272 64 971 1292 1714 1493 1577 1721 1616 1296 1297 128 995 1319 1767 1539 1626 1718 1464 1317 1318 256 912 1294 1722 1494 1580 1714 1209 1376 1374 512 655 885 1091 977 1025 1078 1108 932 948 1024 364 408 451 422 439 450 863 510 511 2048 309 334 356 343 352 360 914 562 557 4096 305 304 350 338 345 350 930 607 604 8192 306 331 356 340 349 356 922 608 609 16384 311 332 358 343 349 358 917 609 609 32768 313 333 355 344 349 359 925 609 607 65536 313 333 359 343 351 358 926 611 612 131072 314 331 359 345 351 359 927 609 612 End of test Mon Sep 5 21:32:12 2016 ######################### RPi 3 V7A ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom Start of test Mon Aug 15 19:29:18 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 749 1064 1302 1407 1236 1266 775 745 745 8 379 1597 2200 2473 1968 2433 1481 1364 1378 16 3180 2126 3319 3928 2901 3866 2718 2364 2372 32 4244 2546 4492 5565 3733 5495 4685 3700 3713 64 4930 2772 5252 7185 3845 6693 6947 4959 4960 128 3699 2924 5785 8169 4592 7349 5553 6047 6009 256 5553 2970 5939 8340 4585 7720 9048 6657 6653 512 5167 2854 5537 7555 4116 7009 6125 5464 5288 1024 903 1436 1456 1329 1461 1456 1585 1609 1600 2048 950 1164 1155 1186 1171 914 1043 1036 1024 4096 974 1148 1039 1174 1164 1162 919 923 928 8192 920 1131 1158 1168 1163 1093 938 936 945 16384 919 838 948 1169 1165 990 931 940 946 32768 1166 1159 1168 1171 1167 1168 923 926 916 65536 1156 1146 1167 1170 1163 1147 928 939 931 131072 1163 1151 1148 1171 1075 1092 934 915 957 End of test Mon Aug 15 19:29:47 2016 ########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ########## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom Start of test Tue Aug 30 14:03:24 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 565 1059 1293 1403 1228 1381 773 740 743 8 433 1590 2185 2480 1989 886 1458 1349 1367 16 274 2118 3212 3987 2882 3846 2678 2335 2334 32 4234 2547 4489 5786 3723 5476 4645 3685 3690 64 3613 2791 5328 7263 3959 6777 7146 5065 5075 128 1349 2889 5624 6927 4090 7274 9530 5908 5923 256 3597 2960 5877 8177 4676 7637 8693 6697 6725 512 4140 2985 3621 8556 4784 7931 7867 6723 6768 1024 1534 1547 1631 1646 1629 1634 1872 1848 1852 2048 1274 1270 1274 1267 1106 1274 1106 1108 1090 4096 675 1263 1270 1277 1265 1266 1025 1031 1028 8192 1271 1256 1281 1280 1263 1265 996 994 959 16384 1281 1277 1289 1288 1102 1278 986 971 976 32768 1285 1283 1287 1301 1286 1291 977 966 986 65536 1291 1285 1291 1292 1291 1289 988 982 986 131072 1293 1283 1293 1298 1295 1287 970 979 998 End of test Tue Aug 30 14:03:53 2016 ################################################################# Version 2 ########################## RPi 3 ################################ Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 14:27:38 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 5518 2990 1309 8808 4732 1455 15426 7656 1244 8 5414 3115 1322 10150 5068 1470 14323 8301 1254 16 5503 3143 1270 10255 5154 1378 16743 8043 1221 32 5507 3145 1344 10142 5089 1458 16572 7732 1206 64 5033 2999 1257 9230 4867 1419 16012 7869 1228 128 5255 3041 1258 9372 5014 1365 9452 8192 1252 256 5266 3093 1282 9401 5006 1372 8418 7864 1313 512 4494 2765 1358 7248 4482 1332 5748 5460 1410 1024 3810 2683 1078 4425 3668 1155 1753 1732 1265 2048 2008 1425 1098 2274 2214 980 1086 1094 1333 4096 3972 2413 1075 4628 3672 945 1058 1057 839 8192 1597 2435 920 3671 3649 1199 1059 1067 1043 16384 3838 1624 1867 4440 1550 1108 1065 1076 1166 32768 1658 2273 1695 4227 1876 1054 1066 1039 921 65536 3657 1247 1286 4839 3801 1308 1053 1046 1133 131072 990 655 810 1260 932 826 1129 1083 619 End of test Mon Sep 5 14:28:08 2016 ####################### RPi 3 1 Core ############################ Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 14:30:31 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 775 789 994 2578 1309 1027 4087 2337 654 8 1551 793 1003 2620 1313 1029 4176 2361 656 16 1553 793 1003 2626 1314 1029 4209 2372 657 32 1512 782 982 2501 1282 1009 4146 2338 647 64 1464 770 961 2379 1242 982 3976 2183 636 128 1476 773 963 2406 1253 990 3837 2160 639 256 1478 773 964 2389 1256 982 3867 2208 639 512 1401 748 926 2204 1202 958 3342 2119 636 1024 1082 663 798 1347 979 814 1759 1634 616 2048 968 651 776 1193 923 791 1272 1215 604 4096 962 645 779 1171 909 812 1253 1247 615 8192 977 654 807 1233 925 820 1240 1245 619 16384 1016 653 794 1226 920 818 1223 1231 617 32768 1018 656 815 1263 930 806 1175 1176 615 65536 1026 658 816 1270 935 829 971 988 614 131072 1030 660 818 1269 938 830 866 870 608 End of test Mon Sep 5 14:30:57 2016 ####################### RPi 3 Not OMP ########################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 14:28:22 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 785 2536 3789 2360 3448 3787 2670 2693 2692 8 1594 2547 3812 2389 3465 3812 2715 2716 2716 16 1595 2551 3824 2392 3477 3823 2727 2728 2728 32 1556 2435 3564 2300 3272 3565 2730 2722 2723 64 1513 2314 3330 2189 3091 3327 2599 2435 2435 128 1516 2312 3357 2188 3118 3353 2635 2569 2569 256 1521 2316 3381 2187 3130 3384 2676 2618 2617 512 1419 2034 2765 1977 2674 2835 2593 2481 2524 1024 1113 1379 1544 1348 1521 1543 1691 1583 1586 2048 995 1203 1282 1193 1277 1257 1263 1231 1232 4096 992 1196 1248 1178 1252 1259 1203 1176 1166 8192 1041 1237 1290 1213 1298 1291 927 943 954 16384 1052 1262 1311 1229 1252 1303 874 866 867 32768 1053 1271 1317 1239 1325 1303 995 987 991 65536 1057 1281 1323 1245 1343 1316 920 920 918 131072 1057 1283 1323 1184 1350 1327 856 849 840 End of test Mon Sep 5 14:28:50 2016 ################################################################# Comparisons ################################################################# ########################### RPi3/RPi2 ########################### Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] Ares Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S RPi3/RPi2 OpenMP-MemSpeed2 L1 1.75 1.25 5.07 3.11 1.76 5.11 13.37 4.71 2.78 L2 1.70 1.64 5.38 1.85 1.62 5.62 7.43 4.80 4.46 RAM 3.73 2.94 4.68 4.32 2.90 4.05 1.03 0.99 3.73 RPi3/RPi2 NotOpenMP-MemSpeed2 L1 1.32 1.63 1.63 1.26 1.72 1.63 1.48 1.83 1.85 L2 1.82 1.99 2.13 1.67 2.17 2.16 1.95 2.15 2.15 RAM 3.33 3.79 3.64 3.56 3.70 3.61 1.12 1.70 1.70 ###################### RPi3 OpenMP/NotOpenMP #################### L1 3.46 1.25 0.35 4.31 1.50 0.38 5.83 2.95 0.45 L2 3.37 1.41 0.43 4.01 1.70 0.46 3.39 2.66 0.55 RAM 2.70 1.53 1.02 3.30 2.16 0.85 1.03 1.04 1.05 ################## RPi3 NotOpenMP/1 core OpenMP ################# L1 1.03 3.18 3.75 0.91 2.61 3.65 0.65 1.15 4.17 L2 1.03 2.78 3.12 0.92 2.28 3.06 0.73 1.13 3.71 RAM 1.04 1.90 1.62 0.99 1.40 1.59 0.87 0.86 1.66 |
This executes the same functions as MP-MFLOPS, with two versions. One uses NEON intrinsic functions, with the second one compiled with directives to use NEON. The two benchmarks obtain similar performance, as reflected in the results below, the first being for MP-MFLOPS, with compiled NEON instructions, but with rounding differences, identified by @@@@@.
Raspberry Pi 3 average performance gains over RPi 2 were 1.34 and 2.30 at the two sets of tests, for the compiled version and effectively the same for the program with intrinsic functions -
see above.
As produced, the 32 Operations Per Word arithmetic statements were in a loop with one load and one store, but compiled with numerous additional load instructions, with code similar to MP-MFLOPSPiNeon - See
assembly code below).
It could have probably have been anticipated, that there were insufficient registers for all the variables.
################## RPi 2 V7A2 Compiled NEON #################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-MFLOPS Compiled NEON v1.0 Fri Mar 20 17:01:47 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 361 446 329 692 678 647 2T 887 841 430 1371 1358 1300 4T 1596 1141 381 2719 2725 2482 8T 1542 1502 384 2604 2701 2460 Results x 100000 1T 76406 97075 99969 66008 95367 99951 End of test Fri Mar 20 17:01:58 2015 ################## RPi 2 NEON Intrinsics ####################### Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz MP-MFLOPS NEON Intrinsics v1.0 Fri Mar 20 17:07:09 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 249 347 268 709 706 679 2T 635 667 411 1403 1386 1323 4T 919 1342 377 2783 2798 2623 8T 1076 1341 380 2589 2476 2409 Results x 100000 1T 76406 97075 99969 66014 95363 99951 @@@@@ @@@@@ End of test Fri Mar 20 17:07:20 2015 ################ RPi 2 NEON Intrinsics OC ####################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 MP-MFLOPS NEON Intrinsics v1.0 Fri Mar 20 17:19:01 2015 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 309 386 308 788 785 758 2T 778 745 500 1554 1546 1483 4T 1048 1461 468 3097 3072 2931 8T 1377 1253 465 2780 2781 2689 Results x 100000 1T 76406 97075 99969 66014 95363 99951 End of test Fri Mar 20 17:19:11 2015 ################## RPi 3 V7A2 Compiled NEON #################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 419 782 437 1672 1660 1637 2T 1324 1529 442 3331 3308 3212 4T 1903 1574 439 5040 6073 5738 8T 1613 2204 433 5543 5780 5445 Results x 100000 1T 76406 97075 99969 66008 95367 99951 End of test Mon Aug 15 19:09:52 2016 ################## RPi 3 NEON Intrinsics ####################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS NEON Intrinsics v1.0 Mon Aug 15 19:41:37 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 347 583 427 1706 1703 1657 2T 1080 1157 438 3397 3398 3226 4T 979 1430 437 6265 6128 5464 8T 1218 1351 436 5507 5766 5426 Results x 100000 1T 76406 97075 99969 66014 95363 99951 End of test Mon Aug 15 19:41:42 2016 |
As indicated in Raspberry Pi Benchmarks.htm, the original Linpack benchmark operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as the single precision floating point NEON compilation. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The 0 thread procedures are identical to those in the single core 100 x 100 NEON compilation, using intrinsic functions.
The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.
Unlike the NEON MP MFLOPS benchmark, that carries out the same multiply/add calculations, this program can run much slower using multiple threads. This is due to the overhead of creating and closing threads too frequently. At 100x100, around 0.67 million floating point calculations are executed in daxpy, the critical function. With the present equations, threads have to be created 99 times (unless someone can do better and change more things). At 100x100, data size is 40 KB, L2 cache based. With larger matrices, performance becomes more dependent on RAM, but multi-threading overheads have less influence.
Without threading at N=100, as shown below, speed is a little faster than single core Linpack NEON MFLOPS, due to improved coding, but not as fast as MP-NeonMFLOPS, that has less variety in accessing data. Performance is worse at n=500 and 1000, where data is mainly from RAM.
The benchmark checks that the numeric results produced, using threads, are identical to those without threading. As expected, these are not the same using different matrix sizes, and the n=100 results are the same as linpackPiNEONi, the single core version.
Raspberry Pi 3 - At N=100, average speed was 1.73 times that from a RPi 2, with 1.52 to 1.59 times using the larger matrices. These can be compared with a CPU MHz ratio of 1.33.
######################### RPi 2 NEON ############################ Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz linpackPiNEONi MFLOPS 300 MP-NeonMFLOPS MFLOPS 347 at 128 KB ######################### RPi 2 NEON ############################ Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Linpack Single Precision MultiThreaded Benchmark Using NEON Intrinsics, Sun Mar 22 15:37:56 2015 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 323.06 66.59 64.76 64.64 N 500 276.52 216.62 215.69 216.28 N 1000 235.25 221.69 222.63 223.98 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04 Thread 0 - 4 Same Results Same Results Same Results ####################### RPi 2 NEON OC ########################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 Linpack Single Precision MultiThreaded Benchmark Using NEON Intrinsics, Sun Mar 22 15:47:04 2015 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 362.42 74.74 74.63 75.16 N 500 326.00 259.13 257.42 258.82 N 1000 280.61 262.30 262.31 262.38 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04 Thread 0 - 4 Same Results Same Results Same Results ######################### RPi 3 NEON ############################ Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Linpack Single Precision MultiThreaded Benchmark Using NEON Intrinsics, Mon Aug 15 19:44:30 2016 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 538.46 116.24 113.61 113.47 N 500 467.73 335.53 338.61 338.97 N 1000 363.87 336.10 336.72 336.22 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04 Thread 0 - 4 Same Results Same Results Same Results |
Below are examples of disassembled code for MP-MFLOPS plus the NEON variety. The first uses 32 bit single precision floating point registers. At least, with 32 arithmetic calculations per word, use is made of advanced instructions VFMA or VMAS (Vector Fused Multiply Accumulate or Subtract). Ten of these execute 20 of the 32 floating point operations, the other twelve being from conventional add and multiply instructions.
The NEON compilation uses the same VFMA and VFMS instructions, but using 128 bit quad words, for SIMD operation, then with an unrolled loop, with 10 VFMAs (or VFMSs) to execute 80 instructions. Four word vectors are also used for adds and multiplies. This produces up to 1.7 GFLOPS per core, on a Raspberry Pi 3, not very good, out of a maximum of 9.6 GFLOPS, a part of the reason being the excessive number of load instruction, probably due to an insufficient number of registers. With compiler generated unrolling, disassembled code can show many more sets of calculations, to cover situations where data is too small for the whole unrolled loop.
The notOpenMP-MFLOPS (OpenMP-MFLOPS but not using OMP threads) has the extra test with 8 operations per word. As shown below, the inner loop in unrolled by the compiler to produce 32 calculations via quad word vectors, but at not much more than 1.7 GFLOPS. Manually unrolling the loop to 16 x 8 calculations did not lead to further unrolling by the compiler. With four times more calculations in the loop, a maximum of just over 3 GFLOPS could be demonstrated, still a long way from 9.6.
MP-MFLOPSPiA7, MP-MFLOPSPiNeon 2 Operations Per Word 2 Operations Per Word .L27: .L83: flds s15, [r1] vld1.64 {d16-d17}, [lr:64] fadds s15, s0, s15 add r4, r4, #1 fmuls s15, s15, s1 add lr, lr, #16 fstmias r1!, {s15} cmp r2, r4 cmp r1, r0 add r3, r3, #16 bne .L27 vadd.f32 q8, q8, q10 vmul.f32 q8, q8, q9 vstr d16, [r3, #-16] vstr d17, [r3, #-8] bhi .L83 32 Operations Per Word 32 Operations Per Word .L21: .L61: flds s23, [r1] vld1.64 {d18-d19}, [lr:64] fadds s16, s23, s2 vldr d16, [sp, #64] fadds s24, s23, s0 vldr d17, [sp, #72] fadds s31, s23, s4 vldr d14, [sp, #80] fadds s30, s23, s6 vldr d15, [sp, #88] fnmuls s16, s3, s16 vadd.f32 q8, q9, q8 fadds s29, s23, s8 vld1.64 {d20-d21}, [sp:64] fadds s28, s23, s10 vmul.f32 q8, q8, q7 fadds s27, s23, s12 vadd.f32 q10, q9, q10 vfma.f32 s16, s24, s1 vldr d14, [sp, #16] fadds s26, s23, s14 vldr d15, [sp, #24] fadds s25, s23, s17 vldr d22, [sp, #144] fadds s24, s23, s19 vldr d23, [sp, #152] fadds s23, s23, s21 vfma.f32 q8, q10, q7 vfma.f32 s16, s31, s5 vldr d20, [sp, #128] vfms.f32 s16, s30, s7 vldr d21, [sp, #136] vfma.f32 s16, s29, s9 vldr d14, [sp, #192] vfms.f32 s16, s28, s11 vldr d15, [sp, #200] vfma.f32 s16, s27, s13 vadd.f32 q10, q9, q10 vfms.f32 s16, s26, s15 vadd.f32 q7, q9, q7 vfma.f32 s16, s25, s18 vfma.f32 q8, q10, q11 vfms.f32 s16, s24, s20 vldr d22, [sp, #208] vfma.f32 s16, s23, s22 vldr d23, [sp, #216] fstmias r1!, {s16} vadd.f32 q10, q9, q15 cmp r1, r0 add r4, r4, #1 bne .L21 add lr, lr, #16 cmp r2, r4 add r3, r3, #16 NotOpenMP vfma.f32 q8, q7, q11 vldr d22, [sp, #256] 8 Operations Per Word vldr d23, [sp, #264] vadd.f32 q7, q9, q11 .L31: vldr d22, [sp, #240] vld1.64 {d18-d19}, [lr:64] vldr d23, [sp, #248] add r4, r4, #1 vfma.f32 q8, q10, q14 add lr, lr, #16 vldr d20, [sp, #32] cmp r2, r4 vldr d21, [sp, #40] add r3, r3, #16 vadd.f32 q10, q9, q10 vadd.f32 q8, q9, q12 vfma.f32 q8, q7, q11 vadd.f32 q10, q9, q3 vldr d22, [sp, #96] vmul.f32 q8, q8, q11 vldr d23, [sp, #104] vadd.f32 q9, q9, q14 vadd.f32 q7, q9, q11 vfma.f32 q8, q10, q15 vldr d22, [sp, #48] vfms.f32 q8, q9, q13 vldr d23, [sp, #56] vstr d16, [r3, #-16] vfms.f32 q8, q10, q11 vstr d17, [r3, #-8] vldr d22, [sp, #112] bhi .L31 vldr d23, [sp, #120] vldr d20, [sp, #160] vldr d21, [sp, #168] vadd.f32 q10, q9, q10 vfms.f32 q8, q7, q11 vldr d22, [sp, #224] vldr d23, [sp, #232] vadd.f32 q7, q9, q11 vldr d22, [sp, #176] vldr d23, [sp, #184] vadd.f32 q9, q9, q13 vfms.f32 q8, q10, q11 vfms.f32 q8, q7, q6 vfms.f32 q8, q9, q12 vstr d16, [r3, #-16] vstr d17, [r3, #-8] bhi .L61 |