Roy Longbottom at Linkedin Roy Longbottom's Raspberry Pi MultiThreading
Benchmarks Including Raspberry Pi 2 and 3


General CPU, Cache and RAM MFLOPS MP-MFLOPS Results
MP Whetstone Benchmark MP-WHETS Results MP Dhrystone Benchmark
MP-Dhry Results MP BusSpeed Benchmark MP-BusSpd Results
MP-RandMem Benchmark MP-RandMem Results OpenMP MFLOPS Results
OpenMP MemSpeed Results MP-NeonMFLOPS MP NEON Linpack
Some Assemby Code


General

Roy Longbottom’s PC Benchmark Collection comprises numerous FREE benchmarks and reliability testing programs, for processors, caches, memory, buses, disks, flash drives, graphics, local area networks and Internet. Original ones run under DOS and later ones under all varieties of Windows. Most have also been converted to run under Linux on PCs, and many for ARM CPUs via Android and Raspbian. For the latter, details, benchmark execution files and source code download links, and results are provided in Raspberry Pi Benchmarks.htm with downloads for the multithreading program codes in Raspberry_Pi_MP_Benchmarks.zip.

The ARM benchmarks use the same multithreading programming code as my Linux MultiThreading Benchmarks. Each of these particular programs obtain the same configuration details as the Android versions but, unlike with Android, results are saved in a text log file, besides being displayed. An example of the Raspberry Pi details obtained are shown below. When more than one CPU core is provided, separate details are normally shown for each one, labelled Processor 0, Processor 1 etc. At the end of each benchmark, any appropriate additional information can be entered from the keyboard.

The C program codes used for the RPi were also compiled on a Linux based PC, the only change being for the version name (to Linux/Intel from Linux/ARM). This Intel version is included in the zip file. Results below include those, from this version, on an Intel Atom CPU and a quad core AMD Phenom processor.

Results are now included for Raspberry Pi 2 that has a quad core ARM V7 processor. The original benchmarks were run along with revised versions (MP-xxxxPiA7), compiled with gcc 4.8, to use advanced hardware features, identified in cpuinfo details. The new benchmarks are included in the zip file. An example of the compile command, that uses the new features, is shown below. This also includes -funsafe-math-optimizations, which can produce incorrect results. For these benchmarks, it leads to acceptable minor rounding differences.

2016 - The latest benchmark s were run on Raspberry Pi 3 Model B that includes a quad core Broadcom BCM2837 system-on-chip running at 1200 MHz, each core having a 32 KB L1 cache. There is a shared 512 KB L2 cache and 1 GB RAM. The CPU is an ARM Cortex-A53, capable of 64 bit working, but presently only supports 32 bit operation. A different graphics driver had to be installed to run a new OpenGL GLUT benchmark. In certain cases, benchmarks were rerun with this driver disabled, as it could lead to degraded CPU performance.


 SYSTEM INFORMATION

 From File /proc/cpuinfo
 Processor	: ARMv6-compatible processor rev 7 (v6l)
 BogoMIPS	: 464.48
 Features	: swp half thumb fastmult vfp edsp java tls 
 CPU implementer	: 0x41
 CPU architecture: 7
 CPU variant	: 0x0
 CPU part	: 0xb76
 CPU revision	: 7

 Hardware	: BCM2708
 Revision	: 000d
 Serial		: 00000000db690cb4
 

 From File /proc/version
 Linux version 3.6.11+ (dc4@dc4-arm-01) (gcc version 4.7.2 20120731 
  (prerelease) (crosstool-NG linaro-1.13.1+bzr2458 - Linaro GCC 2012.08) )
  #371 PREEMPT Thu Feb 7 16:31:35 GMT 2013
 
 ####################################################

 Raspberry Pi 2

 processor       : 0, 1, 2 and 3
 model name      : ARMv7 Processor rev 5 (v7l)
 BogoMIPS        : 38.40
 Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva
                  idivt vfpd32 lpae evtstrm 
 CPU implementer : 0x41
 CPU architecture: 7
 CPU variant     : 0x0
 CPU part        : 0xc07
 CPU revision    : 5

Linux version 3.18.5-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.8.3 20140303 (prerelease)
      (crosstool-NG linaro-1.13.1+bzr2650 - Linaro GCC 2014.03) ) #225 SMP PREEMPT 
      Fri Jan 30 18:53:55 GMT 2015

 ####################################################

 Raspberry Pi 3 - 32 bit mode

 processor       : 0, 1, 2 and 3
 model name      : ARMv7 Processor rev 4 (v7l)
 BogoMIPS        : 38.40
 Features        : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva 
                   idivt vfpd32 lpae evtstrm crc32 
 CPU implementer : 0x41
 CPU architecture: 7
 CPU variant     : 0x0
 CPU part        : 0xd03
 CPU revision    : 4

 Linux version 4.1.19-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG 
 crosstool-ng-1.22.0-88-g8460611) ) #858 SMP Tue Mar 15 15:56:00 GMT 2016

 ####################################################

 gcc 4.8 compile statement

 gcc MP-xxxxx.c  -lrt -lc -lm -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 
       -funsafe-math-optimizations -lpthread -o MP-xxxxxPiA7
   


To Start


CPU, Cache and RAM MFLOPS Benchmarks
MP-MFLOPS, MP-MFLOPSPiA7, MP-MFLOPSPiNeon, MP-MFLOPSDP

This benchmark also executes identical functions as my CUDA and OpenMP performance tests. Details and results of these can be found in Linux CUDA MFLOPS.htm and OpenMP Speeds.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out and can be run using between 1 and 64 threads. Each thread uses the same calculations but accessing different segments of the data.

The RPi and Android versions use the functions with 2 and 32 operations per word with 1, 2, 4 and 8 threads. Data sizes are also limited to three to use L1 cache, L2 cache and RAM at 12.8, 128 and 12800 KB (3200, 32000 and 3200000 single precision floating point words). These are accessed 5000, 500 and 5 times respectively (for constant 32 million and 512 million arithmetic operations). This means that numeric results at given data sizes should be the same using 1, 2, 3 and 4 threads. The repeat parameters were later increased to 10000, 1000 and 10, for longer running time, doubling the arithmetic calculations and different numeric results.

Below is an example of performance and numeric results, showing the latter are consistent across all threads with a particular set of calculations. All results words are checked for identical values (proves that all threads carried out every calculation required) and, if any discrepancies, the sumcheck is set to zero.

MP-MFLOPSPiA7 was compiled using the Cortex-A7 directive, with changes to use NEON instructins for MP-MFLOPSPiNeon, then all floating point variables changed to Double Precision for MP-MFLOPSDP. These all produced some different numeric results, as shown below.


 MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T       43      33      31     191     170     161
 2T       44      42      31     192     174     160
 4T       44      43      31     192     176     159
 8T       43      51      31     192     184     160
 Results x 100000 
 1T    86735   98519   99984   79897   97638   99975
 2T    86735   98519   99984   79897   97638   99975
 4T    86735   98519   99984   79897   97638   99975
 8T    86735   98519   99984   79897   97638   99975

         End of test Sat Jul 27 17:42:00 2013

 Later 76406   97075   99969   66015   95363   99951
 Neon  76406   97075   99969   66008   95367   99951
 DP    76384   97072   99969   66065   95370   99951
  


To Start


MP-MFLOPS Results

Although these Raspberry Pi MFLOPS speeds are quite impressive, they are nowhere near to claimed maximum capabilities, for example, RPi 3 Single Precision 38.4 GFLOPS at 32 MFLOPS/MHz and Double Precision at a quarter of these speeds - first maximum RPi 3 NEON-VFP GFLOPS were 6.03 SP and 2.3 DP, the former at 5.0 MFLOPS/MHz.

On the other hand, the same source code, compiled for Intel CPUs with GCC, obtained 23 out of 32 MFLOPS/MHz with SSE instructions and 45.6 out of 64 MFLOPS/MHz with AVX 1 options. This was on a Core i7 CPU. See Linux MP-MFLOPS Benchmarks.

Some of the instructions generated by the compiler, for the Raspberry Pi, are shown below, with some explanation.

Below are performance results for the RPi, at normal MHz settings and with maximum overclocking. Speed improvements, due to the latter, are approximately proportional to differences in CPU and SDRAM MHz. Results from the Android version, running on a four core CPU, are also provided. This shows speed gains of up to four times that for a single core but, in this case, needs eight threads to do it.

The Atom has Hyperthreading that allows more than one thread to run at the same time on a single core CPU. Results indicate that performance is mainly dependent on CPU speed, whereas there is some degradation due to cache and RAM speeds on the other systems. The quad core AMD CPU speed using L2 cache can be faster than with data from L1 cache, probably due to some conflict on storing results of calculations. Note that Intel numeric results are slightly different to those from ARM CPUs.

Comparing MP-MFLOPS speeds between the old RPi and Raspberry Pi 2 (1 core vs 4 cores), all at 1000 MHz, shows performance increases of 8 to 10 times at 2 operations per word and 6 to 7 times with the higher instruction count. This benchmark only uses single precision floating point arithmetic. As for other benchmarks, the new V7A compilation produced essentially the same MFLOPS speeds as the original MP-MFLOPS program. Running time on the RPi 2 was rather short. So the run pass parameter was increased for a longer run. As expected, this lead to slightly different numeric results (see below).

The program was recompiled including the -funsafe-math-optimizations parameter, to force the use of NEON instructions as MP-NeonMFLOPS. V7A2 NEON entry below shows results at 1000 MHz, achieving a peak performance of nearly 3 GFLOPS. With the lower processing per word tests, average speed gains of 24 times were demonstrated, for cache based data, compared with the original RPi. The revised compilation included fused multiply accumulate instructions, where slightly different answers can be produced (see @@@@@ below).

An earlier Android benchmark executes the same functions as MP-MFLOPS, but using NEON intrinsic functions. This was also converted for the RPi 2 - see MP-NeonMFLOPS, where results are almost the same as the NEON compiled version.

Raspberry Pi 3 with a CPU MHz 1.33 times that on the Raspberry Pi 2, some MP_MFLOPS benchmark speeds were not as advantageous. That was the case with MP-MFLOPSPiA7, at 2 operations per word, but averaged 55% faster at 32 operations per word. MP-MFLOPSPiNeon was much better with average performance ratios of 1.34 and 2.30 at the two sets of tests. Then much faster than MP-MFLOPSPiA7, by more than twice as fast with 32 calculations per word and up to 4.66 times from cached data, at 2 per word (see assembly code below). MP-MFLOPSDP, the double precision compilation of MP-MFLOPSPiA7, compiled with the same instructions as the single precision version, but applicable to 64 bit registers. Speeds were effectively the same, except some tests were slower with RAM based data.


 pi@raspberrypi ~/benchmarks/mpmflops $ ./MP-MFLOPS
                                   V7A  ./MP-MFLOPSPiA7 
  
 *****************************************************
 Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz

 MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 17:41:13 2013

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T       43      33      31     191     170     161
 2T       44      42      31     192     174     160
 4T       44      43      31     192     176     159
 8T       43      51      31     192     184     160
 Results x 100000 
 1T    86735   98519   99984   79897   97638   99975

         End of test Sat Jul 27 17:42:00 2013

  
############################ RPi OC ##############################

 Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts

 MP-MFLOPS Linux/ARM v1.0 Sat Jul 27 18:45:14 2013

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T       49      58      45     278     255     237
 2T       72      60      46     278     262     240
 4T       72      62      46     279     265     239
 8T       72      70      46     279     225     234
 Results x 100000 
 1T    86735   98519   99984   79897   97638   99975

         End of test Sat Jul 27 18:45:46 2013


 ########################### RPi 2 ###############################

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-MFLOPS Linux/ARM v1.0 Mon Mar  2 17:14:57 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      149     147     130     410     395     387
 2T      298     291     254     820     803     782
 4T      526     409     393    1519    1622    1456
 8T      494     486     335    1581    1518    1436
 Results x 100000 
 1T    86735   98519   99984   79897   97638   99975

         End of test Mon Mar  2 17:15:07 2015


 ######################### RPi 2 OC ############################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-MFLOPS Linux/ARM v1.0 Wed Mar  4 10:24:36 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      136     161     144     452     451     433
 2T      328     322     282     904     900     862
 4T      593     546     449    1739    1790    1711
 8T      543     537     437    1588    1679    1578
 Results x 100000
 1T    86735   98519   99984   79897   97638   99975

         End of test Wed Mar  4 10:24:45 2015


######################### RPi 2 V7A ############################

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-MFLOPS Linux/ARM V7A v1.0 Sun Mar 15 12:50:06 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      113     158     138     438     437     417
 2T      321     315     266     884     873     831
 4T      424     611     343    1706    1731    1629
 8T      560     512     332    1579    1622    1520
 Results x 100000
 1T    86735   98519   99984   79897   97639   99975

         End of test Sun Mar 15 12:50:15 2015


##################### V7A2 Increased Passes ####################
######################### RPi 2 V7A2 ###########################

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-MFLOPS Linux/ARM V7A v1.0 Fri Mar 20 16:59:26 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      158     158     136     424     435     414
 2T      322     314     264     875     868     824
 4T      528     533     394    1731    1744    1612
 8T      549     505     392    1639    1629    1518
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951

         End of test Fri Mar 20 16:59:44 2015

################## RPi 2 V7A2 Compiled NEON ####################

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-MFLOPS Compiled NEON v1.0 Tue Aug 16 11:18:32 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      357     451     337     690     688     657
 2T      885     769     426    1355    1354    1315
 4T     1320    1747     382    2700    2721    2552
 8T     1391    1405     381    2548    2653    2446
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951

         End of test Tue Aug 16 11:18:43 2016


######################## RPi 2 V7A2 OC #########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-MFLOPS Linux/ARM V7A v1.0 Fri Mar 20 17:17:44 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      138     177     158     488     487     454
 2T      359     352     308     976     968     922
 4T      552     627     465    1939    1906    1760
 8T      554     586     453    1763    1830    1779
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951

         End of test Fri Mar 20 17:18:00 2015

################ RPi 2 V7A2 Compiled NEON OC ###################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 


 MP-MFLOPS Compiled NEON v1.0 Fri Mar 20 17:18:25 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      369     504     386     769     766     736
 2T     1052     947     486    1521    1513    1440
 4T     2052    2023     470    3040    2917    2854
 8T     1764    1920     459    2860    2883    2597
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951
                               @@@@@   @@@@@

         End of test Fri Mar 20 17:18:35 2015


######################### RPi 3 V7A2 ###########################

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS Linux/ARM V7A v1.0 Mon Aug 15 19:07:03 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      168     182     171     691     693     684
 2T      364     358     329    1382    1381    1358
 4T      408     484     401    2451    2561    2436
 8T      609     554     420    2531    2425    2315
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951

         End of test Mon Aug 15 19:07:15 2016

########## RPi 3 V7A2 New OpenGL GLUT Driver Disabled ##########

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS Linux/ARM V7A v1.0 Tue Aug 30 14:16:59 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      159     181     178     690     692     685
 2T      342     364     353    1384    1386    1368
 4T      466     501     456    2451    2473    2633
 8T      581     643     479    2618    2502    2550
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951

         End of test Tue Aug 30 14:17:11 2016


################# RPi 3 V7A2 Double Precision ##################

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS Double Precision v1.0 Wed Sep  7 17:07:12 2016
 
   FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      143     182     171     678     680     674
 2T      343     361     240    1360    1360    1335
 4T      441     712     240    2232    2208    2185
 8T      406     593     241    2345    2315    2272
 Results x 100000
 1T    76384   97072   99969   66065   95370   99951

         End of test Wed Sep  7 17:07:18 2016


################## RPi 3 V7A2 Compiled NEON ####################

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      419     782     437    1672    1660    1637
 2T     1324    1529     442    3331    3308    3212
 4T     1903    1574     439    5040    6073    5738
 8T     1613    2204     433    5543    5780    5445
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951

         End of test Mon Aug 15 19:09:52 2016

####### RPi 3 V7A 2 NEON New OpenGL GLUT Driver Disabled #######

 MP-MFLOPS Compiled NEON v1.0 Tue Aug 30 14:18:13 2016

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      488     774     485    1674    1652    1644
 2T     1438    1503     488    3341    3299    3262
 4T     1984    1703     472    5045    5125    5256
 8T     1567    2098     470    5527    5400    5021
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951

         End of test Tue Aug 30 14:18:18 2016


########################## Other ###############################
 
 P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4

   Android MP-MFLOPS v7 Benchmark V1.0 23-Dec-2012 14.12

     FPU Add & Multiply using 1, 2, 4 and 8 Threads

         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      208     188     172     588     675     643
 2T      392     375     302    1323    1342    1311
 4T      472     439     321    1824    1758    1645
 8T      619     608     381    2666    2537    2645

           Total Elapsed Time    6.7 seconds

  
 *****************************************************
 Intel Atom 1.66 GHz, Linux Ubuntu 10.10

 MP-MFLOPS Linux/Intel v1.0 Sat Jul 27 18:18:15 2013

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      207     206     201     406     404     404
 2T      303     363     354     799     793     783
 4T      330     367     357     798     795     788
 8T      321     366     354     796     793     788
 Results x 100000 
 1T    86723   98518   99984   79927   97642   99975

         End of test Sat Jul 27 18:18:26 2013

  
 *****************************************************
 Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04

 MP-MFLOPS Linux/Intel v1.0 Tue Jul 30 15:12:02 2013

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T     1987    1977    1585    3827    3826    3732
 2T     3835    3975    2527    7631    7654    7442
 4T     6723    7873    2932   10822   14463   13728
 8T     5890    7659    5497   10300   14452   14006
 Results x 100000 
 1T    86723   98518   99984   79927   97642   99975

         End of test Tue Jul 30 15:12:03 2013

   


To Start


MP Whetstone Benchmarks - MP-WHETS, MP-WHETSPiA7

The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in Whetstone Benchmark History And Results and Results On PCs. The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations. As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.

This multithreading benchmark runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program. An initial calibration, using a single thread, determines the number of passes needed for an overall execution time of 5 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are noted for the first thread with others checked for the same values and an error message displayed if they are inconsistent.

Displayed speeds are in the order that tests are run but are sorted for logged results, as shown below.


To Start


MP-WHETS Results

Relative performance due to overclocking is similar to MP-MFLOPS, an exception being fixed point calculations, where the particular compiler might have optimised the code too much, and many more passes could be needed to produce consistent speeds. The Galaxy SIII and AMD Phenom are also more inclined to achieve a four times performance gain with quad cores. The Atom hyperthreading shows improved throughput with multiple threads on all tests.

Running the original benchmark on the Raspberry Pi 2 shows a performance increase between 1.3 and 2.1 times, on the different tests, on a single core at 1000 MHz. With multithreading, this leads to a 10.2 times increase on MFLOPS, 5.8 times on functions and 7.5 time on integer MOPS. The revised PiA7 compilation generates slower code on some tests but the three quoted ratios increase to 12.8, 6.1 and 8.2 times.

Raspberry Pi 3 Overall MWIPS ratings are 1.37 times RPi 2 speeds, with ratios for other tests in the range 1.19 to 1.79, except the last copy test average of 2.73.


 pi@raspberrypi ~/benchmarks/mpwhetss $ ./MP-WHETS 
                                   V7A  ./MP-WHETSPiA7
 
 *****************************************************
 Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz

  MP-Whetstone Benchmark Linux/ARM v1.0 Sat Jul 27 17:44:25 2013

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp  Fixpt     If  Equal
                 1      2      3  MOPS  MOPS   MOPS   MOPS   MOPS

 1T   243.6   91.9   90.9   84.4   4.9   2.7 2143.6  471.5  114.9
 2T   256.8   61.2   90.0   82.3   5.6   2.7 2201.7  496.9  120.4
 4T   258.5   74.5   96.0   84.2   5.6   2.7 2272.7  501.5  118.7
 8T   258.5   84.2   95.2   85.1   5.6   2.7 2774.6  522.9  106.0

   Overall Seconds   3.26 1T,   6.34 2T,  12.57 4T,  25.80 8T

  
 ############################ RPi OC ##############################

 Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts

  MP-Whetstone Benchmark Linux/ARM v1.0 Sat Jul 27 18:41:42 2013

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp  Fixpt     If  Equal
                 1      2      3  MOPS  MOPS   MOPS   MOPS   MOPS

 1T   373.6  136.0  135.3  122.5   8.2   3.8 3271.4  705.5  180.1
 2T   375.2  122.5  133.4  118.4   8.2   3.9 3247.6  733.4  180.0
 4T   377.0  132.2  139.3  122.3   8.2   3.9 6267.6  737.9  172.3
 8T   377.0  135.1  140.6  123.2   8.2   3.9 4585.5  749.7  162.5

   Overall Seconds   3.52 1T,   7.07 2T,  14.23 4T,  28.83 8T


 ########################### RPi 2 ###############################

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

  MP-Whetstone Benchmark Linux/ARM v1.0 Tue Mar  3 16:37:24 2015MP-WHETSPiA7

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T   515.4  250.2  250.0  223.0  10.0   5.1  4421.6  891.1  334.5
 2T  1035.2  500.9  501.9  447.7  20.0  10.2  8878.8 1789.0  671.3
 4T  2063.4  960.6  996.0  893.6  39.9  20.5 17560.2 3559.3 1334.9
 8T  2140.9 1192.4 1325.4  992.3  40.3  21.2 24312.0 3968.1 1379.2

   Overall Seconds   4.98 1T,   4.98 2T,   5.06 4T,  10.11 8T


 ######################### RPi 2 OC #############################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

MP-Whetstone Benchmark Linux/ARM v1.0 Wed Mar  4 10:34:01 2015

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T   577.6  280.2  280.1  249.7  11.2   5.7  4961.5  998.2  374.7
 2T  1155.5  560.1  560.1  499.4  22.4  11.4  9915.1 1995.8  749.3
 4T  2290.3 1080.3 1110.8  994.2  44.4  22.8 13471.3 3642.0 1491.6
 8T  2405.6 1506.5 1490.2 1103.8  45.9  23.5 28234.7 5151.7 1552.5

   Overall Seconds   4.74 1T,   4.74 2T,   4.84 4T,   9.82 8T


######################### RPi 2 V7A #############################

 Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

  MP-Whetstone Benchmark Linux/ARM V7A v1.0 Tue Mar  3 16:44:08 2015

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T   527.8  361.8  362.9  184.2  10.0   5.6  3316.1  889.2  445.2
 2T  1056.9  724.9  729.2  368.6  20.0  11.2  6638.7 1779.1  891.6
 4T  2119.0 1381.1 1454.5  739.2  40.1  22.5 13301.0 3571.3 1788.4
 8T  2195.2 1912.9 1849.8  805.7  40.8  23.1 17643.5 4808.5 1893.6

   Overall Seconds   4.70 1T,   4.70 2T,   4.75 4T,   9.56 8T


######################## RPi 2 V7A OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

  MP-Whetstone Benchmark Linux/ARM V7A v1.0 Tue Mar  3 17:54:38 2015

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T   593.5  409.5  409.3  206.7  11.3   6.3  3729.2  998.2  499.6
 2T  1181.3  814.4  801.0  411.7  22.4  12.5  7423.4 1988.4  994.7
 4T  2351.2 1486.6 1527.9  813.0  44.7  25.0 14800.0 3825.0 1989.5
 8T  2452.9 2199.5 2099.1  890.8  45.3  26.1 21104.2 5439.9 2084.2

   Overall Seconds   4.98 1T,   5.03 2T,   5.10 4T,  10.26 8T


######################### RPi 3 V7A #############################

    Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

  MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T   723.1  517.2  517.0  254.9  12.1   8.8  5853.9 1181.8 1189.8
 2T  1464.7  960.5 1025.1  511.3  24.1  18.5 11899.0 2381.2 2385.7
 4T  2902.3 1696.4 1867.3 1013.4  47.8  36.8 19754.6 4541.3 4687.1
 8T  3004.0 2747.8 2569.0 1066.4  48.6  38.0 25502.9 6075.2 5610.8

   Overall Seconds   4.77 1T,   4.74 2T,   4.88 4T,   9.76 8T

  
 ########################## Other #################################

 P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4

     Android MP-Whetstone Benchmark V1.0 23-Dec-2012 14.36

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp  Fixpt     If  Equal
                 1      2      3  MOPS  MOPS   MOPS   MOPS   MOPS

 1T  1206.4  266.3  269.4  310.1  30.1  17.6  522.8  551.8  597.9
 2T  2411.7  520.5  530.0  619.1  60.0  35.1 1026.4 1359.2 1195.9
 4T  4719.0  874.2  881.7 1231.1 119.1  69.6 2072.8 2779.4 2369.0
 8T  4676.4 1227.1 1105.1 1182.4 120.0  63.2 2254.4 2821.8 2299.5

 Overall Seconds   4.84 1T,   4.82 2T,   5.14 4T,  10.25 8T

  
 *****************************************************
 Intel Atom 1.66 GHz, Linux Ubuntu 10.10

  MP-Whetstone Benchmark Linux/Intel v1.0 Sat Jul 27 18:08:28 2013

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp  Fixpt     If  Equal
                 1      2      3  MOPS  MOPS   MOPS   MOPS   MOPS

 1T   704.4  329.0  328.9  280.8  17.1   7.6  763.7 1248.6  117.7
 2T  1203.2  562.1  613.8  484.0  30.4  13.4  997.5 1688.6  176.8
 4T  1203.8  605.2  618.4  477.3  30.4  13.4  993.0 1688.2  178.0
 8T  1206.9  608.1  619.8  486.1  30.3  13.4 1008.3 1702.2  177.9

   Overall Seconds   4.99 1T,   6.28 2T,  12.48 4T,  24.93 8T


 *****************************************************
 Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04

  MP-Whetstone Benchmark Linux/Intel v1.0 Tue Jul 30 15:11:23 2013

                    Using 1, 2, 4 and 8 Threads

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T  2662.8  924.4  926.2  695.0  64.3  34.7  3088.3 2258.4  620.7
 2T  5304.1 1850.8 1850.9 1387.8 128.4  69.1  6210.0 4507.3 1200.1
 4T 10582.7 3551.7 3668.0 2771.6 256.7 138.0 12173.2 8966.3 2399.4
 8T 10637.9 3758.1 3754.6 2772.7 257.2 138.5 12389.8 9104.7 2441.1

   Overall Seconds   4.90 1T,   4.98 2T,   5.04 4T,   9.94 8T

  


To Start


MP Dhrystone Benchmarks - MP-DHRY, MP-DHRYPiA7

The Dhrystone "C" benchmark provides a measure of integer performance (no floating point instructions). It became the key standard benchmark from 1984. Speed was originally measured in Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result, the latter being regarded as the first 1 MIPS minicomputer. Details and results from Windows and Linux based PCs, can be found in Dhrystone Results.htm.

This multithreading benchmark runs using 1, 2, 4 and 8 threads, executing multiple copies of the same program. An initial calibration, using a single thread, determines the number of passes needed for an overall execution time of 1 second. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread. Some variables can be used by all threads and it might be foreseen that this could cause the program to crash. Data arrays have been moved so that different RAM will be allocated for each thread. One of the locations is used to count the number passes used be each thread and these are checked for consistency.


  MP-Dhrystone Benchmark Linux/ARM v1.0 
        Fri Jul 26 12:25:24 2013

     Using 1, 2, 4 and 8 Threads

     Threads  Dhrys/sec   VAX MIPS
         1      1650351     939
         2      1547631     881
         4      1594706     908
         8      1619087     922

  Internal pass count correct all threads

   End of test Fri Jul 26 12:25:40 2013

  


To Start


MP-Dhry Results

RasPi performance improvement due to overclocking is again proportional to CPU MHz. The Android quad core phone shows limited performance gains of up to 2.63 times, a shared data effect? Again, the Atom shows performance gains due to Hyperthreading.

Worst comparisons are on the Phenom PC, where performance using two threads is a lot slower than one thread, probably due to a conflict in updating results.

These Raspberry Pi 2 results also show multithreading performance degradations, through handling the shared data, with wide variations in measured speed.. A 1000 MHz single core produces a 40% improvement in performance, compared with RPi 1 at the same frequency, with no gain via the PiA7 recompilation.

Raspberry Pi 3 performance, using a single thread, is not much faster than model 2 at 1.43 times faster, compared with a CPU MHz ratio of 1.33. Then, it appears to perform much better using more threads, at 3.49 times faster.


 pi@raspberrypi ~/benchmarks/mphry $ ./MP-DHRY 
                                 V7A ./MP-DHRYPiA7 
  
 *****************************************************
 Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz

  MP-Dhrystone Benchmark Linux/ARM v1.0 Sat Jul 27 17:38:10 2013

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.97     2.07     4.01     7.91
 Dhrystones per Second    1650351  1547631  1594706  1619087
 VAX MIPS rating              939      881      908      922

         Internal pass count correct all threads

         End of test Sat Jul 27 17:38:26 2013

 
 ########################### RPi OC ##############################

 Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts

  MP-Dhrystone Benchmark Linux/ARM v1.0 Sat Jul 27 18:48:23 2013

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.67     1.38     2.72     5.41
 Dhrystones per Second    2388323  2324087  2354828  2364828
 VAX MIPS rating             1359     1323     1340     1346

         Internal pass count correct all threads

         End of test Sat Jul 27 18:48:34 2013


########################### RPi 2 ###############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

  MP-Dhrystone Benchmark Linux/ARM v1.0 Mon Mar  2 17:12:58 2015

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.67     3.45     7.33    14.56
 Dhrystones per Second    2985075  1159420  1091405  1098901
 VAX MIPS rating             1699      660      621      625

         Internal pass count correct all threads

         End of test Mon Mar  2 17:13:06 2015


 ######################### RPi 2 OC #############################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

  MP-Dhrystone Benchmark Linux/ARM v1.0 Wed Mar  4 12:04:06 2015

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.96     2.22     9.92    19.82
 Dhrystones per Second    3333333  2882883  1290323  1291625
 VAX MIPS rating             1897     1641      734      735

         Internal pass count correct all threads

         End of test Wed Mar  4 12:04:17 2013


######################### RPi 2 V7A #############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

  MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Tue Mar  3 15:53:27 2015

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.54     0.68     2.21     2.95
 Dhrystones per Second    2956666  4706235  2895209  4339729
 VAX MIPS rating             1683     2679     1648     2470

         Internal pass count correct all threads

         End of test Tue Mar  3 15:53:34 2015


####################### RPi 2 V7A OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

  MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Tue Mar  3 17:41:09 2015

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.97     1.18     2.51     5.02
 Dhrystones per Second    3286275  5439640  5096932  5094843
 VAX MIPS rating             1870     3096     2901     2900

         Internal pass count correct all threads

         End of test Tue Mar  3 17:41:20 2015


######################### RPi 3 V7A #############################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

  MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.95     1.12     1.59     3.04
 Dhrystones per Second    4229473  7124952 10091677 10523432
 VAX MIPS rating             2407     4055     5744     5989

         Internal pass count correct all threads

         End of test Mon Aug 15 19:48:04 2016

 
########################## Other #################################

 P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4

Android MP-Dhrystone 2 Benchmark V1.0 23-Dec-2012 14.47

                   Using 1, 2, 4 and 8 Threads

Threads                        1        2        4        8
Seconds                     0.50     0.57     1.07     1.53
Dhrystones per Second    3187471  5583906  5972050  8389079
VAX MIPS rating             1814     3178     3399     4775

        Internal pass count correct all threads

          Total Elapsed Time    4.2 seconds


 *****************************************************
 Intel Atom 1.66 GHz, Linux Ubuntu 10.10

  MP-Dhrystone Benchmark Linux/Intel v1.0 Sat Jul 27 17:59:01 2013

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.87     1.37     2.78     5.47
 Dhrystones per Second    4624003  5836935  5756209  5845862
 VAX MIPS rating             2632     3322     3276     3327

         Internal pass count correct all threads

         End of test Sat Jul 27 17:59:12 2013

  
 *****************************************************
 Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04

  MP-Dhrystone Benchmark Linux/Intel v1.0 Tue Jul 30 15:10:47 2013

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.58     2.63     6.91    13.86
 Dhrystones per Second   13854293  6080597  4628248  4618862
 VAX MIPS rating             7885     3461     2634     2629

         Internal pass count correct all threads

         End of test Tue Jul 30 15:11:11 2013

  


To Start


MP BusSpeed Benchmarks - MP-BusSpd, MP-BusSpdPiA7, MP-BusSpd2PiA7

This uses the same calculations as my original BusSpeed2K Benchmark , the link providing data and results, including Windows and Linux MP varieties. Data is read using AND instructions at a range of data sizes covering caches and RAM. The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data. In this case, only 12.3 KB, 123KB and 12.3 MB memory sizes are used via 1, 2, 4 and 8 threads.

To Start


MP-BusSpd Results

Speeds using L1 cache and large address increments can be unpredictable and not show performance gains using multiple cores. Some of this might be due to high overheads compared with actual execution time. Note that RasPi L2 cache speeds are relatively slow, compared with those from L1.

Ignoring the wildly variable burst reading comparisons, with CPUs at 1000 MHz, single core Raspberry Pi 2 performance, via L1 cache, showed no improvement. There were significant gains at 122.9 KB, L2 cache test. The PiA7 compilation produced some performance increases at 122.9 KB and double speed via RAM. Bottom line four thread comparison gains, against single core RPi 1, were 4.1 times from L1 cache, 17.5 times via L2 cache and 11.7 times with RAM based data, at 3.35 GB/second.

The exaggerated performance is valid, where all threads read the same data but, as the 512 KB L2 cache is shared between all cores, measured speed does not reflect RAM speed. In order to demonstrate more realistic memory speeds, a second version, MP-BusSpd2PiA7, was produced, where each thread starts reading from different addresses (RPi 2 V7A 2 OC) below. This produced fairly constant RAM speeds using multiple threads.

Raspberry Pi 3 - Results for the latest MP-BusSpd are shown below. Compared to default RPi 2 performance, best RAM speed improvements were the same as memory bus speed difference. Cache speed improvements were around 1.9 times, compared with CPU MHz ratio of 1.33. The benchmark was rerun with the new graphics driver disabled, as this tended to degrade memory performance.


 pi@raspberrypi ~/benchmarks/mpbusspd $ ./MP-BusSpd 
                                  V7A   ./MP-BusSpdPiA7 
  
 *****************************************************
 Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz

 MP-BusSpd Linux/ARM v1.0 Sat Jul 27 17:32:12 2013

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T    550   1200   1194   1258   1273   1179
      2T    469   1207   1206   1261   1266   1217
      4T    585   1207   1225   1252   1243   1249
      8T   1046   1184   1208   1237   1236   1245
122.9 1T     22     46     45     55    107    218
      2T     22     46     43     55    105    224
      4T     21     46     43     55    106    224
      8T     22     45     44     54     92    217
12288 1T     32     32     33     42     85    182
      2T     15     33     30     41     80    165
      4T     14     18     31     42     82    175
      8T     15     28     32     43     81    178

         End of test Sat Jul 27 17:32:29 2013

  
 ########################### RPi OC ##############################

 Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts

 MP-BusSpd Linux/ARM v1.0 Sat Jul 27 18:50:22 2013

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1246   1751   1794   1768   1847   1859
      2T   1591   1756   1741   1830   1837   1760
      4T   1162   1709   1784   1830   1802   1840
      8T   1574   1732   1739   1774   1817   1820
122.9 1T     90     90     84    106    198    415
      2T     65     88     86    106    204    418
      4T     90     88     86    103    196    403
      8T     89     88     82    103    192    407
12288 1T     49     49     50     71    138    293
      2T     37     49     50     71    138    295
      4T     45     50     49     69    129    288
      8T     30     48     50     70    135    291

         End of test Sat Jul 27 18:50:35 2013


########################### RPi 2 ###############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-BusSpd Linux/ARM v1.0 Mon Mar  2 17:09:03 2015

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1050   1681   1685   1729   1732   1734
      2T   2955   3232   3324   3375   3425   3370
      4T   5045   6417   6600   6753   6795   6868
      8T   5053   5285   6087   5814   5845   6346
122.9 1T    383    391    695   1173   1493   1324
      2T    712    738   1382   2324   2960   2652
      4T    728    787   1593   3053   5693   4697
      8T    774    771   1575   3192   4622   4704
12288 1T     71     76    151    295    635    349
      2T    134    152    300    583   1242    691
      4T    146    164    272    755   1415   1366
      8T    137     77    240    421    930   1129

         End of test Mon Mar  2 17:09:16 2015


 ######################### RPi 2 OC #############################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-BusSpd Linux/ARM v1.0 Wed Mar  4 12:40:34 2015

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1032   1869   1884   1919   1923   1927
      2T   3302   3606   3704   3794   3819   3821
      4T   5833   7058   7356   7517   7348   7618
      8T   5534   5699   6209   6517   6572   6674
122.9 1T    425    431    768   1285   1650   1469
      2T    809    815   1540   2583   3306   2944
      4T    824    875   1768   3651   6262   5809
      8T    858    822   1709   3464   5574   4615
12288 1T     96    110    218    424    914    447
      2T    193    219    436    785   1702    877
      4T    165    246    457    754   2236   1738
      8T    111    131    283    623   1348   1474

         End of test Wed Mar  4 12:40:47 2015


######################### RPi 2 V7A #############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-BusSpd Linux/ARM V7A v1.0 Tue Mar  3 16:08:07 2015

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1571   1662   1670   1174   1698   1725
      2T   3072   3266   3362   3379   3416   3443
      4T   5077   6562   6582   6719   6771   6847
      8T   5318   5731   6009   5939   5820   5535
122.9 1T    376    396    702   1192   1558   1624
      2T    710    738   1388   2359   3111   3228
      4T    708    779   1618   3238   5729   6383
      8T    692    761   1612   2970   5056   5648
12288 1T     69     82    163    292    629   1251
      2T    138    160    329    579   1247   2380
      4T    217    175    364    485   1135   2582
      8T    106    100    210    585    871   1817

         End of test Tue Mar  3 16:08:21 2015


####################### RPi 2 V7A OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-BusSpd Linux/ARM V7A v1.0 Tue Mar  3 17:14:10 2015

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1029   1851   1875   1903   1912   1909
      2T   3413   3616   3736   3799   3813   3818
      4T   6821   7300   5957   7493   7523   7621
      8T   5668   5894   6455   6372   6508   7495
122.9 1T    433    442    782   1305   1738   1789
      2T    810    813   1542   2588   3429   3574
      4T    818    887   1780   3584   6552   7071
      8T    839    854   1629   3284   5229   6202
12288 1T     92    116    228    407    854   1286
      2T    184    230    450    699   1619   2531
      4T    236    253    564   1492   2178   3356
      8T    156    164    258    699   1065   3018

         End of test Tue Mar  3 17:14:23 2015


######################## RPi 2 V7A 2 ############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-BusSpd ARM V7A v2 Fri Mar  6 17:29:14 2015

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T    961   1602   1638   1668   1733   2227
      2T   2644   3012   3154   3240   3393   4138
      4T   4024   5503   6027   6389   6705   8153
      8T   2780   3979   4777   5031   6028   6376
122.9 1T    356    389    688   1185   1541   2050
      2T    706    731   1373   2343   3070   4065
      4T    743    800   1595   3198   5894   7872
      8T    750    775   1566   2928   5406   7139
12288 1T     66     71    159    281    628   1147
      2T     87     87    177    311    697   1256
      4T     84     98    191    297    700   1186
      8T    103     93    177    294    742   1147

         End of test Fri Mar  6 17:29:26 2015


###################### RPi 2 V7A 2 OC ##########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-BusSpd ARM V7A v2 Fri Mar  6 17:35:56 2015

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T    952   1772   1817   1862   1921   2466
      2T   2958   3387   3552   3654   3826   4832
      4T   4448   6110   6708   7037   7344   9078
      8T   3358   4852   5570   5684   6631   7153
122.9 1T    435    436    787   1318   1718   2285
      2T    813    816   1534   2610   3426   4527
      4T    821    864   1780   3536   6523   8823
      8T    813    812   1607   3307   5750   8159
12288 1T     94    104    229    406    904   1648
      2T    141    141    289    454   1165   1785
      4T    143    148    256    407   1000   1584
      8T    148    133    250    485   1062   1531

         End of test Fri Mar  6 17:36:08 2015


######################## RPi 3 V7A 2 ############################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-BusSpd ARM V7A v2 Sun Jul 24 09:26:21 2016

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   3011   3715   3792   4080   4400   4149
      2T   5391   6873   7125   7827   8466   8124
      4T   8622  11926  13488  15276  16419  13422
      8T   4922   7930   9659  11732  13307  11995
122.9 1T    565    563   1070   1792   2830   3865
      2T    886    901   1762   3225   5402   7584
      4T    901    921   1863   3727   7185  13816
      8T    874    919   1762   3712   6269   9242
12288 1T    120    125    244    420    968   1926
      2T    126    128    246    537   1000   2184
      4T    110    118    231    443    990   1824
      8T    120    137    262    517   1043   2124

         End of test Sun Jul 24 09:26:33 2016


########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ##########

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

MP-BusSpd ARM V7A v2 Tue Aug 30 13:45:43 2016

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1565   3749   3718   4078   4385   4160
      2T   5041   6829   7066   7813   8584   7839
      4T   5480  11958  13330  15256  16863  15614
      8T   6006   8477   8873   7777   8918   8315
122.9 1T    566    566   1062   1822   2831   3907
      2T    899    906   1742   2395   5433   7638
      4T    907    935   1876   3757   7241  13871
      8T    863    919   1789   3491   6411   9403
12288 1T    130    136    263    513   1047   2080
      2T    185    138    276    554   1108   2149
      4T    131    137    269    536   1169   2383
      8T    125    133    224    513   1038   2142

         End of test Tue Aug 30 13:45:55 2016

 
########################## Other #################################

 P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4

Android MP-BusSpd v7 Benchmark V1.0 23-Dec-2012 14.42

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   3452   3697   4088   4122   3860   4183
      2T   6616   7251   8016   8179   8307   8191
      4T   8108   7430  10052   8511   8305   8404
      8T   8729  10701  11687  12938  15297  15116
122.9 1T    747    762    746    966    992   1401
      2T   1132   1161   1155   1554   1873   2668
      4T   1127   1133   1137   2193   2987   4614
      8T   1134   1145   1133   2210   3153   4231
12288 1T     82     89    200    376    739   1184
      2T    204    179    407    797   1449   2205
      4T    399    359    334   1227   1183   4038
      8T    134    123    226    502   1378   3718

          Total Elapsed Time   13.4 seconds

 
 *****************************************************
 Intel Atom 1.66 GHz, Linux Ubuntu 10.10

 MP-BusSpd Linux/Intel v1.0 Sat Jul 27 18:03:37 2013

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   5512   6061   6219   6362   6359   6388
      2T   5866   6412   6556   6638   6595   6659
      4T   6157   6445   6551   6607   6605   6639
      8T   6139   6424   6510   6611   6303   6070
122.9 1T    513    417    787   1476   2518   3945
      2T    586    696   1316   2347   3655   4741
      4T    625    686   1334   2270   3614   4736
      8T    615    720   1255   2273   3635   4777
12288 1T    135    261    522   1034   1966   3280
      2T    128    261    567   1146   2250   4535
      4T    118    277    562   1183   2300   4454
      8T    122    250    549   1122   2225   4413

         End of test Sat Jul 27 18:03:49 2013

  
 *****************************************************
 Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04

 MP-BusSpd Linux/Intel v1.0 Tue Jul 30 15:13:37 2013

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T  10273  13905  14184  13640  13542  13651
      2T   7599  14053  19451  22479  24301  25801
      4T   7743  15110  29846  44953  48783  51672
      8T   7613  15116  29805  44501  48082  51027
122.9 1T   1494   1496   2987   6001  11037  12857
      2T   2980   2987   5952  11900  21852  25515
      4T   5344   5967  11735  23781  43699  50429
      8T   5947   5903  11661  23333  42953  50569
12288 1T    459    466    922   1878   3117   5236
      2T    741    773   1452   2648   4731   8370
      4T    839    887   1814   4006   7923  14909
      8T    903    933   1921   4244   7997  14597

         End of test Tue Jul 30 15:13:49 2013
  


To Start


MP RandMem Benchmarks - MP-RandMem, MP-RandMemPiA7

RandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches. Details and results for Windows and Linux versions can be found in RandMem Results.htm. This benchmark uses data from the same array for all threads, but starting at different points. Results of the Serial Reading tests are checked for the same result on all threads.

The original Windows version produces extremely slow speeds with read/write tests, particularly with random access. Later Linux varieties included Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using multiple threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM (see linux%20multithreading%20benchmarks.htm). This and the Android benchmarks also use Mutex and some speeds continue to be unpredictable.

The revised PiA7 compilation made little difference to Raspberry Pi 2 MP-RandMem results. Multithreading did not provide any performance improvement with read/write tests (Mutex effect) but RPi 2 gains were around 1.6 times, using L1 cache, 4 to 6 times from L2 cache, with RAM at 6 times for serial access and 1.6 times with random access. RPi 2 single thread read only L1 cache tests showed no performance increase, with more than three times gain from l2 cache and a 70% improvement from RAM. Serial Read/Random Read, quad core versus RPi 1 single core performance ratios were about 4/4 times for L1 cache, 15/9 times with L2 cache and 5.7/4.3 times from RAM.

To Start


MP-RandMem Results

The results show that there is no gain in using multiple threads on systems with multiple cores, at least for cache based data, due to Mutex effects (but this is better than being much slower - see above). As could be anticipated, random access is slow, compared with serial reading and writing, when burst transfers are involved. Note the similarities with BusSpeed above.

Raspberry Pi 3 results are provided with and without the new graphics driver, as memory performance was degraded by the latter. RPi 2 comparisons, without the driver, are included below. Some were no better that the 1.33 CPU MHz increase, but RAM speed improvements were significant.


 pi@raspberrypi ~/benchmarks/mprandmem$ ./MP-RandMem
                                  V7A   ./MP-RandMemPiA7 
  
 *****************************************************
 Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz

 MP-RandMem Linux/ARM v1.0 Mon Jul 29 16:23:20 2013

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    1564    1347    1605    1343
      2T    1576    1338    1584    1217
      4T    1550    1297    1544    1324
      8T    1500    1303    1489    1183
122.9 1T     236     202     112      99
      2T     234     201     111     110
      4T     232     201     110      96
      8T     226     200     109      99
12288 1T     170     135      23      26
      2T     170     134      22      26
      4T     169     132      23      25
      8T     123     105      23      26

     No Errors Found

    End of test Mon Jul 29 16:24:19 2013

 
########################### RPi OC ##############################

 Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts

 MP-RandMem Linux/ARM v1.0 Mon Jul 29 17:05:25 2013

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2306    1938    2312    1937
      2T    2281    1900    2288    1933
      4T    2238    1918    2240    1918
      8T    2171    1890    2178    1880
122.9 1T     460     369     205     194
      2T     448     371     202     195
      4T     441     371     204     194
      8T     427     367     202     193
12288 1T     270     198      36      42
      2T     270     198      36      42
      4T     270     198      36      42
      8T     270     198      36      42

     No Errors Found

    End of test Mon Jul 29 17:06:16 2013


########################### RPi 2 ###############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-RandMem Linux/ARM v1.0 Tue Mar  3 16:13:52 2015

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2256    2857    2257    2858
      2T    4480    2847    4480    2849
      4T    8738    2795    8759    2808
      8T    8032    2772    8439    2794
122.9 1T    1624    1483     628     682
      2T    3208    1467    1183     683
      4T    6203    1457    1673     681
      8T    5793    1385    1670     689
12288 1T     359     940      55      57
      2T     670     941     105      57
      4T    1180     936     126      57
      8T    1161     938     127      56

     No Errors Found

    End of test Tue Mar  3 16:14:38 2015


 ######################### RPi 2 OC #############################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-RandMem Linux/ARM v1.0 Wed Mar  4 12:53:39 2015

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2493    3157    2509    3176
      2T    4979    3161    4979    3153
      4T    9357    3144    9699    3125
      8T    8706    3104    8633    3085
122.9 1T    1796    2152     701     761
      2T    3577    2142    1331     762
      4T    6916    2151    1870     766
      8T    6421    2135    1823     765
12288 1T     461    1233      68      70
      2T     862    1218     129      69
      4T    1561    1210     159      69
      8T    1514    1202     162      69

     No Errors Found

    End of test Wed Mar  4 12:54:25 2015


######################### RPi 2 V7A #############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-RandMem Linux/ARM V7A v1.0 Tue Mar  3 16:28:30 2015

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    1967    2784    1968    2793
      2T    3711    2787    3740    2788
      4T    7019    2730    7313    2762
      8T    6783    2410    6881    2704
122.9 1T    1413    1489     532     681
      2T    2788    1470    1013     679
      4T    5393    1485    1593     681
      8T    5207    1448    1587     686
12288 1T     357     950      45      57
      2T     697     946      89      57
      4T    1212     930     126      56
      8T    1157     938     123      57

     No Errors Found

    End of test Tue Mar  3 16:29:18 2015


####################### RPi 2 V7A OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-RandMem Linux/ARM V7A v1.0 Tue Mar  3 17:51:16 2015

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2209    3129    2209    3130
      2T    4177    3101    4156    3098
      4T    8289    3056    8292    3059
      8T    7641    3009    7574    2990
122.9 1T    1584    2121     592     754
      2T    3109    2105    1127     758
      4T    5983    2114    1715     759
      8T    5669    2118    1682     764
12288 1T     453    1219      55      69
      2T     841    1217     109      69
      4T    1535    1209     157      69
      8T    1523    1189     154      68

     No Errors Found

    End of test Tue Mar  3 17:52:01 2015


######################### RPi 3 V7A #############################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-RandMem Linux/ARM V7A v1.0 Mon Aug 15 19:37:27 2016

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2907    3773    2917    3790
      2T    5480    3768    5187    3775
      4T   11198    3679   10960    3712
      8T   10094    3697   10038    3685
122.9 1T    2673    3340     686     892
      2T    5031    3386    1251     888
      4T    9398    3378    2002     890
      8T    9291    3370    1916     886
12288 1T    1896     899      50      64
      2T    2535     900      98      65
      4T    2878     896     137      64
      8T    2631     897     130      65

     No Errors Found

    End of test Mon Aug 15 19:38:14 2016


########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ##########

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-RandMem Linux/ARM V7A v1.0 Tue Aug 30 14:13:08 2016

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    2930    3791    2918    3791
      2T    5571    3766    5194    3776
      4T   11196    3722   11205    3722
      8T   10063    3685   10051    3702
122.9 1T    2675    3398     681     893
      2T    5124    3387    1256     886
      4T   10041    3387    1916     891
      8T    9593    3367    1952     890
12288 1T    2120     979      54      71
      2T    3255     980     107      71
      4T    3346     979     138      70
      8T    2226     979     143      71

     No Errors Found

    End of test Tue Aug 30 14:13:54 2016

 RPi3/RPi2 Average

 L1 cache   1.53    1.36    1.47    1.35
 L2 cache   1.86    2.29    1.24    1.31
 RAM        4.46    1.04    1.17    1.25

 
########################## Other #################################

 P11 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4

  Android MP-RndMem v7 Benchmark V1.0 23-Dec-2012 14.40

  MB/Second Using 1, 2, 4 and 8 Threads
  KB       SerRD SerRDWR   RndRD RndRDWR
12.29 1T    2043    2028    2066    2027
      2T    6788    3058    6835    3346
      4T    6251    3104    6478    3376
      8T    6635    3244    5408    3242
122.9 1T    1365    1392    1150    1151
      2T    2415    1386    1927    1159
      4T    2495    1374    1870    1117
      8T    2470    1352    1772    1013
12288 1T     581     351      71      77
      2T    1674     934     143      96
      4T    1675     882     143      95
      8T    1838     939     142      96

          Total Elapsed Time    5.5 seconds


 *****************************************************
 Intel Atom 1.66 GHz, Linux Ubuntu 10.10

 MP-RandMem Linux/Intel v1.0 Mon Jul 29 17:14:26 2013

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    4207    5242    4206    5244
      2T    6219    5159    5770    5118
      4T    6155    5158    6206    5149
      8T    5765    5019    6088    4956
122.9 1T    3084    3455     789    1077
      2T    4692    3451    1230    1078
      4T    4753    3408    1246    1076
      8T    4689    3400    1243    1045
12288 1T    1291    1339      57      88
      2T    3008    1323     108      88
      4T    3043    1336     105      88
      8T    3092    1329     108      87

     No Errors Found

    End of test Mon Jul 29 17:15:12 2013

  
 *****************************************************
 Quad Core AMD Phenom 3.0 GHz, Linux Ubuntu 12.04

 MP-RandMem Linux/Intel v1.0 Tue Jul 30 15:15:12 2013

  MB/Second Using 1, 2, 4 and 8 Threads

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T   14913   11834   14229   11681
      2T   24219   11686   23129   11552
      4T   35885   11566   33095   11443
      8T   29820   11596   29206   11518
122.9 1T   10936   10580    5543    4835
      2T   20167   10563    9942    4814
      4T   38266   10522   18061    4845
      8T   37272   10437   17753    4813
12288 1T    3858    3864     655     559
      2T    6280    3866    1137     558
      4T   10752    3859    1920     558
      8T   11107    3827    1924     558

     No Errors Found

    End of test Tue Jul 30 15:15:53 2013

  


To Start


OpenMP MFLOPS Results - OpenMP-MFLOPS, notOpenMP-MFLOPS

These benchmarks use the same source code program calculations as the original MP_MFLOPS benchmark for Linux with MP-MFLOPS above using a cut down version, implemented to use on Android devices. OpenMP-MFLOPS benchmark uses the simplest OpenMP directive, #pragma omp parallel for, before the for loops where parallelisation might be expected, and a -fopenmp compile parameter. Then, notOpenMP-MFLOPS is the same, without the compile parameter.

The default memory sizes used, starting at 400 KB, are much larger than MP-MFLOPS, as is the number of repeat passes. However, these benchmarks have run time parameters, shown below, that can change these. In fact, test runs show that performance is mainly dependent on the number of operations per word, and notOpenMP-MFLOPS speed is almost the same as 1 Thread MP-MFLOPS results.

Besides notOpenMP-MFLOPS, results below include OpenMP-MFLOPS, set to run on a single core, where speeds at 32 operations per word were nearly twice as fast on the former. Examination of the assembly code generated, for this particular test function, show that the latter has 67 instructions, and the former 346, clearly with more options to suit data size. Both use vfma fused floating-point multiply accumulate instructions, where there are rounding complications. Note the different results at 32 operations per word. At least the default OpenMP-MFLOPS benchmark shows speed gains of up to 3.9 times those using a single core.

Raspberry Pi 3 results, using the same parameters as MP-MFLOPS, have been included, with those for notOpenMP-MFLOPS 2 and 32 Ops/Word being almost identical to the same for MP-MFLOPSPiNeon but, unlike the latter, little gain was produced with multithreading. MP performance appeared to be improved somewhat by increasing the pass count. Compared to Raspberry Pi 2, notOpenMP-MFLOPS was around twice as fast at 8 and 32 operations per word, but less so with 2 operations. Then, it sometimes appeared to be slower with OpenMP-MFLOPS. Poor performance could well be associated with running time and overheads.


######################## Run Time Parameters #############################

 For same as latest MP-MFLOPS  ./OpenMP-MFLOPS Words 3200, Repeats 10000

############################## RPi 2 V7A #################################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

            OpenMP MFLOPS Benchmark 1 Sat Mar  7 15:44:05 2015

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.676731      739    0.929538   Yes
 Data in & out    1000000     2      250   1.365332      366    0.992550   Yes
 Data in & out   10000000     2       25   1.308170      382    0.999250   Yes

 Data in & out     100000     8     2500   1.076658     1858    0.957126   Yes
 Data in & out    1000000     8      250   1.390932     1438    0.995524   Yes
 Data in & out   10000000     8       25   1.356837     1474    0.999550   Yes

 Data in & out     100000    32     2500   5.561007     1439    0.890232   Yes
 Data in & out    1000000    32      250   5.843752     1369    0.988068   Yes
 Data in & out   10000000    32       25   5.791580     1381    0.998785   Yes

                End of test Sat Mar  7 15:44:30 2015


 ***************** taskset 0x00000001 ./OpenMP-MFLOPS 1 Core *****************

            OpenMP MFLOPS Benchmark 1 Mon Mar  9 11:56:40 2015

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   1.592769      314    0.929538   Yes
 Data in & out    1000000     2      250   1.928113      259    0.992550   Yes
 Data in & out   10000000     2       25   1.917868      261    0.999250   Yes

 Data in & out     100000     8     2500   4.049420      494    0.957126   Yes
 Data in & out    1000000     8      250   4.766354      420    0.995524   Yes
 Data in & out   10000000     8       25   4.757556      420    0.999550   Yes

 Data in & out     100000    32     2500  21.886468      366    0.890232   Yes
 Data in & out    1000000    32      250  22.745527      352    0.988068   Yes
 Data in & out   10000000    32       25  22.726837      352    0.998785   Yes

                End of test Mon Mar  9 11:58:07 2015


 -----------------------------------------------------------------------------
 
            Not OpenMP MFLOPS Benchmark 1 Sat Mar  7 15:41:17 2015

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   1.256587      398    0.929538   Yes
 Data in & out    1000000     2      250   1.470944      340    0.992550   Yes
 Data in & out   10000000     2       25   1.467244      341    0.999250   Yes

 Data in & out     100000     8     2500   2.574641      777    0.957126   Yes
 Data in & out    1000000     8      250   3.241242      617    0.995524   Yes
 Data in & out   10000000     8       25   3.226519      620    0.999550   Yes

 Data in & out     100000    32     2500  11.566683      692    0.890268   Yes
 Data in & out    1000000    32      250  12.312695      650    0.988078   Yes
 Data in & out   10000000    32       25  12.309223      650    0.998806   Yes

                End of test Sat Mar  7 15:42:07 2015


####################### RPi 2 V7A OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

            OpenMP MFLOPS Benchmark 1 Sat Mar  7 19:21:01 2015

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.502595      995    0.929538   Yes
 Data in & out    1000000     2      250   1.061047      471    0.992550   Yes
 Data in & out   10000000     2       25   1.027811      486    0.999250   Yes

 Data in & out     100000     8     2500   0.962144     2079    0.957126   Yes
 Data in & out    1000000     8      250   1.202937     1663    0.995524   Yes
 Data in & out   10000000     8       25   1.158232     1727    0.999550   Yes

 Data in & out     100000    32     2500   4.947005     1617    0.890232   Yes
 Data in & out    1000000    32      250   5.147261     1554    0.988068   Yes
 Data in & out   10000000    32       25   5.111022     1565    0.998785   Yes

                End of test Sat Mar  7 19:21:23 2015

 
            Not OpenMP MFLOPS Benchmark 1 Sat Mar  7 19:19:54 2015

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   1.085229      461    0.929538   Yes
 Data in & out    1000000     2      250   1.314159      380    0.992550   Yes
 Data in & out   10000000     2       25   1.307451      382    0.999250   Yes

 Data in & out     100000     8     2500   2.323887      861    0.957126   Yes
 Data in & out    1000000     8      250   2.859657      699    0.995524   Yes
 Data in & out   10000000     8       25   2.851960      701    0.999550   Yes

 Data in & out     100000    32     2500  10.461870      765    0.890268   Yes
 Data in & out    1000000    32      250  11.074036      722    0.988078   Yes
 Data in & out   10000000    32       25  11.070011      723    0.998806   Yes

                End of test Sat Mar  7 19:20:39 2015


######################### RPi 3 V7A #############################

       Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

            OpenMP MFLOPS Benchmark 1 Sat Jul 30 13:01:12 2016

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.363631     1375    0.929538   Yes
 Data in & out    1000000     2      250   1.133716      441    0.992550   Yes
 Data in & out   10000000     2       25   1.150107      435    0.999250   Yes

 Data in & out     100000     8     2500   0.432833     4621    0.957126   Yes
 Data in & out    1000000     8      250   1.177219     1699    0.995524   Yes
 Data in & out   10000000     8       25   1.151536     1737    0.999550   Yes

 Data in & out     100000    32     2500   3.845114     2081    0.890232   Yes
 Data in & out    1000000    32      250   3.754590     2131    0.988068   Yes
 Data in & out   10000000    32       25   3.737356     2141    0.998785   Yes

                End of test Sat Jul 30 13:01:29 2016


-----------------------------------------------------------------------------

            Not OpenMP MFLOPS Benchmark 1 Mon Aug 15 19:23:03 2016

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.697952      716    0.929538   Yes
 Data in & out    1000000     2      250   1.160158      431    0.992550   Yes
 Data in & out   10000000     2       25   1.140070      439    0.999250   Yes

 Data in & out     100000     8     2500   1.178477     1697    0.957126   Yes
 Data in & out    1000000     8      250   1.442497     1386    0.995524   Yes
 Data in & out   10000000     8       25   1.428921     1400    0.999550   Yes

 Data in & out     100000    32     2500   5.060230     1581    0.890268   Yes
 Data in & out    1000000    32      250   5.203246     1538    0.988078   Yes
 Data in & out   10000000    32       25   5.203889     1537    0.998806   Yes

                End of test Mon Aug 15 19:23:26 2016


######################### RPi 3 V7A #############################
  Run with parameters ./OpenMP-MFLOPS Words 3200, Repeats 10000

       Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

            OpenMP MFLOPS Benchmark 1 Thu Sep 15 18:13:47 2016

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out       3200     2    10000   0.138179      463    0.764063   Yes
 Data in & out      32000     2     1000   0.091516      699    0.970753   Yes
 Data in & out     320000     2      100   0.193833      330    0.997008   Yes

 Data in & out       3200     8    10000   0.148140     1728    0.850919   Yes
 Data in & out      32000     8     1000   0.120691     2121    0.982347   Yes
 Data in & out     320000     8      100   0.429023      597    0.998205   Yes

 Data in & out       3200    32    10000   0.514128     1992    0.660291   Yes
 Data in & out      32000    32     1000   0.703450     1456    0.953632   Yes
 Data in & out     320000    32      100   1.067654      959    0.995180   Yes

                End of test Thu Sep 15 18:13:50 2016

-----------------------------------------------------------------------------

            Not OpenMP MFLOPS Benchmark 1 Thu Sep 15 18:14:47 2016

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out       3200     2    10000   0.152466      420    0.764063   Yes
 Data in & out      32000     2     1000   0.081762      783    0.970753   Yes
 Data in & out     320000     2      100   0.134984      474    0.997008   Yes

 Data in & out       3200     8    10000   0.147960     1730    0.850919   Yes
 Data in & out      32000     8     1000   0.148731     1721    0.982347   Yes
 Data in & out     320000     8      100   0.168795     1517    0.998205   Yes

 Data in & out       3200    32    10000   0.644568     1589    0.660158   Yes
 Data in & out      32000    32     1000   0.649362     1577    0.953663   Yes
 Data in & out     320000    32      100   0.663790     1543    0.995240   Yes

                End of test Thu Sep 15 18:14:50 2016

  


To Start


OpenMP MemSpeed Results
OpenMP-MemSpeed, OpenMP-MemSpeed2, NotOpenMP-MemSpeed2

MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data. Calculations are as shown in the results’ headings. As with OpenMP-MFLOPS benchmark, OpenMP-MemSpeed uses the simplest OpenMP directive (#pragma omp parallel) before the test loops. Full results are below for the RPi 2 running at 900 and 1000 MHz. The compile command is also shown.

With OpenMP-MemSpeed Version 1, the declaration to use OpenMP was before an inner loop, leading to possible performance degradation due to overheads. For Version 2, or OpenMP-MemSpeed2, the directive was moved to an outer loop. For the following results, this was run, along with a test to use one CPU core, via the command taskset 0x00000001 ./OpenMP-MemSpeed2. Then, another compilation (NotOpenMP-MemSpeed2), was produced without the -fopenmp compile option, to use a single core without OMP overheads. All three versions are in Raspberry_Pi_MP_Benchmarks.zip.

Raspberry Pi 3 - Results are below, along with RPi3/Rpi2 average performance ratios, plus those for Raspberry Pi 3 OpenMP/NotOpenMP and NotOpenMP/1 core OpenMP. Some RPi3/RPi2 comparisons were close to the 1.33 CPU MHz ratio, but most were higher, particularly on RAM speed, at up to 4.68 times, and all integer arithmetic tests, with all MP ratios between 4.08 and 5.62. RPi 3 multiprocessing gains were disappointing on integer operations but mainly over 3.5 times for cache based floating point and over 3.0 from RAM. Using one thread, the benchmark produced wide variations to the unthreaded code, mainly worse, as expected, but some were better. It could be assumed that different instructions were generated.


######################### RPi 2 V7A #############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

  gcc memSpeedOMP.c cpuidc.c -lrt -lc -lm -O3 -mcpu=cortex-a7
   -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations 
   -fopenmp -o OpenMP-MemSpeed

     Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom

               Start of test Sat Mar  7 19:12:39 2015

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     589    759    843    925    871    896    517    491    490
       8    1487   1056   1367   1707   1161   1472    971    876    876
      16    2357   1595   1941   2852   2186   2348   1737   1422   1355
      32    2565   2045   2669   4329   2876   3161   2945   2246   2235
      64    3964   2294   3080   5634   3497   3962   4224   2936   2934
     128    2420   2317   3096   5661   3478   3928   1831   3416   3425
     256    2884   2150   2838   4411   3179   3578   1184   1392   1357
     512    1837   1731   2155   3061   2327   2557   1064   1217   1218
    1024     650    990   1106   1254   1134   1162   1050   1055    937
    2048     793    833    907   1010    935    889    851    676    825
    4096     792    705    864   1004    871    953    767    771    748
    8192     760    829    881   1009    935    961    761    736    766
   16384     839    810    873   1004    934    961    765    772    762
   32768     850    829    906   1005    725    953    770    776    777
   65536     951    838    894   1022    928    963    772    779    779
  131072     949    835    867   1010    937    950    774    786    788

                End of test Sat Mar  7 19:13:10 2015


####################### RPi 2 V7A OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

     Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom

               Start of test Sat Mar  7 19:17:09 2015

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     669    855    948   1081    980   1004    578    548    496
       8    1595   1339   1549   1948   1643   1399   1099    915    994
      16    2467   1859   2292   3238   2464   2686   1966   1538   1660
      32    3429   2302   3012   4932   3330   3720   3324   2527   2332
      64    4190   2585   3520   6499   3969   4520   4881   3357   3366
     128    4327   2670   3656   6991   4134   4789   4914   3094   3827
     256    4185   2392   3524   6035   3994   4607   1710   2719   2713
     512    2757   2119   2329   4008   2944   3250   1587   1726   1717
    1024    1393   1161   1303   1493   1350   1408   1488   1476   1465
    2048     903    996   1086   1207   1113    921   1083   1093   1086
    4096     632    911   1058   1177   1094   1122    969    995    998
    8192    1141    988   1074   1198   1113   1112    985    994    998
   16384     825    980   1070   1184    950   1131   1014   1022   1015
   32768    1111    994   1083   1209   1117   1155    994    957   1007
   65536    1161    982   1084   1223   1112   1163    996    999    997
  131072    1134    986   1083   1212   1098   1135   1005   1017   1022

                End of test Sat Mar  7 19:17:40 2015


#################################################################
                        Version 2
########################## RPi 2 ################################

     Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz        

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom

               Start of test Mon Sep  5 21:29:03 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    3259   2499    271   3261   2383    286   1333   2099    432
       8    2854   2507    256   3160   2594    305   1235   2116    445
      16    3329   2507    256   3331   3098    270   1173   1547    446
      32    3210   2509    264   3328   3026    267   1155   1452    433
      64    3461   1889    249   5869   3399    250   1128   2024    317
     128    3215   2229    257   5719   3672    262   1117   1123    293
     256    3896   2387    250   5677   3647    257   1119   1132    301
     512    2521   1527    217   2718   2258    230   1115   1112    282
    1024     931    871    185   1408   1254    182   1092   1094    258
    2048     863    777    212   1217   1203    198   1095   1088    275
    4096     846    724    159    962    885    168   1092   1078    251
    8192     824    779    234   1151   1191    200   1090   1070    266
   16384     791    701    362    961   1223    335   1078   1057    334
   32768     845    641    398    930    973    391    913   1066    300
   65536     312    256    331    359    306    338    956   1069    301
  131072     312    255    332    360    306    338    994    812    356

                End of test Mon Sep  5 21:29:41 2016

####################### RPi 2 1 Core ############################

     Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz        

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom

               Start of test Mon Sep  5 21:30:34 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     857    674    676   2199   1096    658   2297   1172    489
       8    1234    677    676   2269   1131    690   2351   1179    490
      16    1236    677    677   2273   1138    694   2362   1183    490
      32    1225    673    674   2258   1132    691   2354   1181    489
      64    1056    616    623   1732    950    638   1428   1093    471
     128     968    605    614   1660    947    626   1242   1127    476
     256     910    602    611   1635    947    626   1191   1131    475
     512     705    499    529   1242    743    515   1119    954    438
    1024     347    282    350    434    359    357    803    785    339
    2048     309    256    326    359    305    333    814    744    299
    4096     304    251    324    353    302    331    856    785    313
    8192     304    252    322    352    300    331    879    839    331
   16384     305    251    324    352    300    331    891    864    342
   32768     308    251    325    354    301    332    859    773    313
   65536     309    253    325    355    293    331    836    737    302
  131072     309    253    326    355    302    332    838    713    295

                End of test Mon Sep  5 21:31:10 2016

####################### RPi 2 Not OMP ###########################

     Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz        

    Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom

               Start of test Mon Sep  5 21:31:37 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     823   1666   2560   2001   2172   2564   1985   1564   1565
       8    1262   1671   2571   2024   2181   2570   2018   1573   1571
      16    1261   1670   2576   2023   2179   2570   2023   1575   1575
      32    1064   1275   1705   1552   1569   1724   1468   1325   1272
      64     971   1292   1714   1493   1577   1721   1616   1296   1297
     128     995   1319   1767   1539   1626   1718   1464   1317   1318
     256     912   1294   1722   1494   1580   1714   1209   1376   1374
     512     655    885   1091    977   1025   1078   1108    932    948
    1024     364    408    451    422    439    450    863    510    511
    2048     309    334    356    343    352    360    914    562    557
    4096     305    304    350    338    345    350    930    607    604
    8192     306    331    356    340    349    356    922    608    609
   16384     311    332    358    343    349    358    917    609    609
   32768     313    333    355    344    349    359    925    609    607
   65536     313    333    359    343    351    358    926    611    612
  131072     314    331    359    345    351    359    927    609    612

                End of test Mon Sep  5 21:32:12 2016
 

######################### RPi 3 V7A #############################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

     Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom

               Start of test Mon Aug 15 19:29:18 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     749   1064   1302   1407   1236   1266    775    745    745
       8     379   1597   2200   2473   1968   2433   1481   1364   1378
      16    3180   2126   3319   3928   2901   3866   2718   2364   2372
      32    4244   2546   4492   5565   3733   5495   4685   3700   3713
      64    4930   2772   5252   7185   3845   6693   6947   4959   4960
     128    3699   2924   5785   8169   4592   7349   5553   6047   6009
     256    5553   2970   5939   8340   4585   7720   9048   6657   6653
     512    5167   2854   5537   7555   4116   7009   6125   5464   5288
    1024     903   1436   1456   1329   1461   1456   1585   1609   1600
    2048     950   1164   1155   1186   1171    914   1043   1036   1024
    4096     974   1148   1039   1174   1164   1162    919    923    928
    8192     920   1131   1158   1168   1163   1093    938    936    945
   16384     919    838    948   1169   1165    990    931    940    946
   32768    1166   1159   1168   1171   1167   1168    923    926    916
   65536    1156   1146   1167   1170   1163   1147    928    939    931
  131072    1163   1151   1148   1171   1075   1092    934    915    957

                End of test Mon Aug 15 19:29:47 2016

########## RPi 3 V7A 2 New OpenGL GLUT Driver Disabled ##########

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

     Memory Reading Speed Test OpenMP Version 1 by Roy Longbottom

               Start of test Tue Aug 30 14:03:24 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     565   1059   1293   1403   1228   1381    773    740    743
       8     433   1590   2185   2480   1989    886   1458   1349   1367
      16     274   2118   3212   3987   2882   3846   2678   2335   2334
      32    4234   2547   4489   5786   3723   5476   4645   3685   3690
      64    3613   2791   5328   7263   3959   6777   7146   5065   5075
     128    1349   2889   5624   6927   4090   7274   9530   5908   5923
     256    3597   2960   5877   8177   4676   7637   8693   6697   6725
     512    4140   2985   3621   8556   4784   7931   7867   6723   6768
    1024    1534   1547   1631   1646   1629   1634   1872   1848   1852
    2048    1274   1270   1274   1267   1106   1274   1106   1108   1090
    4096     675   1263   1270   1277   1265   1266   1025   1031   1028
    8192    1271   1256   1281   1280   1263   1265    996    994    959
   16384    1281   1277   1289   1288   1102   1278    986    971    976
   32768    1285   1283   1287   1301   1286   1291    977    966    986
   65536    1291   1285   1291   1292   1291   1289    988    982    986
  131072    1293   1283   1293   1298   1295   1287    970    979    998

                End of test Tue Aug 30 14:03:53 2016

#################################################################
                        Version 2
########################## RPi 3 ################################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom

               Start of test Mon Sep  5 14:27:38 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    5518   2990   1309   8808   4732   1455  15426   7656   1244
       8    5414   3115   1322  10150   5068   1470  14323   8301   1254
      16    5503   3143   1270  10255   5154   1378  16743   8043   1221
      32    5507   3145   1344  10142   5089   1458  16572   7732   1206
      64    5033   2999   1257   9230   4867   1419  16012   7869   1228
     128    5255   3041   1258   9372   5014   1365   9452   8192   1252
     256    5266   3093   1282   9401   5006   1372   8418   7864   1313
     512    4494   2765   1358   7248   4482   1332   5748   5460   1410
    1024    3810   2683   1078   4425   3668   1155   1753   1732   1265
    2048    2008   1425   1098   2274   2214    980   1086   1094   1333
    4096    3972   2413   1075   4628   3672    945   1058   1057    839
    8192    1597   2435    920   3671   3649   1199   1059   1067   1043
   16384    3838   1624   1867   4440   1550   1108   1065   1076   1166
   32768    1658   2273   1695   4227   1876   1054   1066   1039    921
   65536    3657   1247   1286   4839   3801   1308   1053   1046   1133
  131072     990    655    810   1260    932    826   1129   1083    619

                End of test Mon Sep  5 14:28:08 2016

####################### RPi 3 1 Core ############################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom

               Start of test Mon Sep  5 14:30:31 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     775    789    994   2578   1309   1027   4087   2337    654
       8    1551    793   1003   2620   1313   1029   4176   2361    656
      16    1553    793   1003   2626   1314   1029   4209   2372    657
      32    1512    782    982   2501   1282   1009   4146   2338    647
      64    1464    770    961   2379   1242    982   3976   2183    636
     128    1476    773    963   2406   1253    990   3837   2160    639
     256    1478    773    964   2389   1256    982   3867   2208    639
     512    1401    748    926   2204   1202    958   3342   2119    636
    1024    1082    663    798   1347    979    814   1759   1634    616
    2048     968    651    776   1193    923    791   1272   1215    604
    4096     962    645    779   1171    909    812   1253   1247    615
    8192     977    654    807   1233    925    820   1240   1245    619
   16384    1016    653    794   1226    920    818   1223   1231    617
   32768    1018    656    815   1263    930    806   1175   1176    615
   65536    1026    658    816   1270    935    829    971    988    614
  131072    1030    660    818   1269    938    830    866    870    608

                End of test Mon Sep  5 14:30:57 2016

####################### RPi 3 Not OMP ###########################

        Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

     Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom

               Start of test Mon Sep  5 14:28:22 2016

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4     785   2536   3789   2360   3448   3787   2670   2693   2692
       8    1594   2547   3812   2389   3465   3812   2715   2716   2716
      16    1595   2551   3824   2392   3477   3823   2727   2728   2728
      32    1556   2435   3564   2300   3272   3565   2730   2722   2723
      64    1513   2314   3330   2189   3091   3327   2599   2435   2435
     128    1516   2312   3357   2188   3118   3353   2635   2569   2569
     256    1521   2316   3381   2187   3130   3384   2676   2618   2617
     512    1419   2034   2765   1977   2674   2835   2593   2481   2524
    1024    1113   1379   1544   1348   1521   1543   1691   1583   1586
    2048     995   1203   1282   1193   1277   1257   1263   1231   1232
    4096     992   1196   1248   1178   1252   1259   1203   1176   1166
    8192    1041   1237   1290   1213   1298   1291    927    943    954
   16384    1052   1262   1311   1229   1252   1303    874    866    867
   32768    1053   1271   1317   1239   1325   1303    995    987    991
   65536    1057   1281   1323   1245   1343   1316    920    920    918
  131072    1057   1283   1323   1184   1350   1327    856    849    840

                End of test Mon Sep  5 14:28:50 2016

#################################################################
                        Comparisons
#################################################################

########################### RPi3/RPi2 ###########################

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  Ares      Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
            MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                   RPi3/RPi2 OpenMP-MemSpeed2

 L1         1.75   1.25   5.07   3.11   1.76   5.11  13.37   4.71   2.78
 L2         1.70   1.64   5.38   1.85   1.62   5.62   7.43   4.80   4.46
 RAM        3.73   2.94   4.68   4.32   2.90   4.05   1.03   0.99   3.73

                   RPi3/RPi2 NotOpenMP-MemSpeed2

 L1         1.32   1.63   1.63   1.26   1.72   1.63   1.48   1.83   1.85
 L2         1.82   1.99   2.13   1.67   2.17   2.16   1.95   2.15   2.15
 RAM        3.33   3.79   3.64   3.56   3.70   3.61   1.12   1.70   1.70

###################### RPi3 OpenMP/NotOpenMP ####################

 L1         3.46   1.25   0.35   4.31   1.50   0.38   5.83   2.95   0.45
 L2         3.37   1.41   0.43   4.01   1.70   0.46   3.39   2.66   0.55
 RAM        2.70   1.53   1.02   3.30   2.16   0.85   1.03   1.04   1.05

################## RPi3 NotOpenMP/1 core OpenMP #################
 
 L1         1.03   3.18   3.75   0.91   2.61   3.65   0.65   1.15   4.17
 L2         1.03   2.78   3.12   0.92   2.28   3.06   0.73   1.13   3.71
 RAM        1.04   1.90   1.62   0.99   1.40   1.59   0.87   0.86   1.66
  


To Start


MP-NeonMFLOPS

This executes the same functions as MP-MFLOPS, with two versions. One uses NEON intrinsic functions, with the second one compiled with directives to use NEON. The two benchmarks obtain similar performance, as reflected in the results below, the first being for MP-MFLOPS, with compiled NEON instructions, but with rounding differences, identified by @@@@@.

Raspberry Pi 3 average performance gains over RPi 2 were 1.34 and 2.30 at the two sets of tests, for the compiled version and effectively the same for the program with intrinsic functions - see above. As produced, the 32 Operations Per Word arithmetic statements were in a loop with one load and one store, but compiled with numerous additional load instructions, with code similar to MP-MFLOPSPiNeon - See assembly code below). It could have probably have been anticipated, that there were insufficient registers for all the variables.


################## RPi 2 V7A2 Compiled NEON ####################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-MFLOPS Compiled NEON v1.0 Fri Mar 20 17:01:47 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      361     446     329     692     678     647
 2T      887     841     430    1371    1358    1300
 4T     1596    1141     381    2719    2725    2482
 8T     1542    1502     384    2604    2701    2460
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951

         End of test Fri Mar 20 17:01:58 2015


################## RPi 2 NEON Intrinsics #######################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 MP-MFLOPS NEON Intrinsics v1.0 Fri Mar 20 17:07:09 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      249     347     268     709     706     679
 2T      635     667     411    1403    1386    1323
 4T      919    1342     377    2783    2798    2623
 8T     1076    1341     380    2589    2476    2409
 Results x 100000
 1T    76406   97075   99969   66014   95363   99951
                               @@@@@   @@@@@

         End of test Fri Mar 20 17:07:20 2015


################ RPi 2 NEON Intrinsics OC #######################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

 MP-MFLOPS NEON Intrinsics v1.0 Fri Mar 20 17:19:01 2015

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      309     386     308     788     785     758
 2T      778     745     500    1554    1546    1483
 4T     1048    1461     468    3097    3072    2931
 8T     1377    1253     465    2780    2781    2689
 Results x 100000
 1T    76406   97075   99969   66014   95363   99951

         End of test Fri Mar 20 17:19:11 2015


################## RPi 3 V7A2 Compiled NEON ####################

      Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      419     782     437    1672    1660    1637
 2T     1324    1529     442    3331    3308    3212
 4T     1903    1574     439    5040    6073    5738
 8T     1613    2204     433    5543    5780    5445
 Results x 100000
 1T    76406   97075   99969   66008   95367   99951

         End of test Mon Aug 15 19:09:52 2016

################## RPi 3 NEON Intrinsics #######################

      Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 MP-MFLOPS NEON Intrinsics v1.0 Mon Aug 15 19:41:37 2016

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      347     583     427    1706    1703    1657
 2T     1080    1157     438    3397    3398    3226
 4T      979    1430     437    6265    6128    5464
 8T     1218    1351     436    5507    5766    5426
 Results x 100000
 1T    76406   97075   99969   66014   95363   99951

         End of test Mon Aug 15 19:41:42 2016

  


To Start


MP-Linpack via NEON - linpackNeonMP

As indicated in Raspberry Pi Benchmarks.htm, the original Linpack benchmark operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as the single precision floating point NEON compilation. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The 0 thread procedures are identical to those in the single core 100 x 100 NEON compilation, using intrinsic functions.

The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.

Unlike the NEON MP MFLOPS benchmark, that carries out the same multiply/add calculations, this program can run much slower using multiple threads. This is due to the overhead of creating and closing threads too frequently. At 100x100, around 0.67 million floating point calculations are executed in daxpy, the critical function. With the present equations, threads have to be created 99 times (unless someone can do better and change more things). At 100x100, data size is 40 KB, L2 cache based. With larger matrices, performance becomes more dependent on RAM, but multi-threading overheads have less influence.

Without threading at N=100, as shown below, speed is a little faster than single core Linpack NEON MFLOPS, due to improved coding, but not as fast as MP-NeonMFLOPS, that has less variety in accessing data. Performance is worse at n=500 and 1000, where data is mainly from RAM.

The benchmark checks that the numeric results produced, using threads, are identical to those without threading. As expected, these are not the same using different matrix sizes, and the n=100 results are the same as linpackPiNEONi, the single core version.

Raspberry Pi 3 - At N=100, average speed was 1.73 times that from a RPi 2, with 1.52 to 1.59 times using the larger matrices. These can be compared with a CPU MHz ratio of 1.33.



######################### RPi 2 NEON ############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 linpackPiNEONi MFLOPS 300
  
 MP-NeonMFLOPS  MFLOPS 347 at 128 KB


######################### RPi 2 NEON ############################

Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz

 Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Sun Mar 22 15:37:56 2015

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

 Threads      None        1        2        4

 N  100     323.06    66.59    64.76    64.64 
 N  500     276.52   216.62   215.69   216.28 
 N 1000     235.25   221.69   222.63   223.98 

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

Thread
 0 - 4 Same Results    Same Results    Same Results


####################### RPi 2 NEON OC ###########################

Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 

Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Sun Mar 22 15:47:04 2015

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

 Threads      None        1        2        4

 N  100     362.42    74.74    74.63    75.16 
 N  500     326.00   259.13   257.42   258.82 
 N 1000     280.61   262.30   262.31   262.38 

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

Thread
 0 - 4 Same Results    Same Results    Same Results


######################### RPi 3 NEON ############################

     Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz

 Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Mon Aug 15 19:44:30 2016

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

 Threads      None        1        2        4

 N  100     538.46   116.24   113.61   113.47 
 N  500     467.73   335.53   338.61   338.97 
 N 1000     363.87   336.10   336.72   336.22 

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

Thread
 0 - 4 Same Results    Same Results    Same Results
  


To Start


Assembly Code

Below are examples of disassembled code for MP-MFLOPS plus the NEON variety. The first uses 32 bit single precision floating point registers. At least, with 32 arithmetic calculations per word, use is made of advanced instructions VFMA or VMAS (Vector Fused Multiply Accumulate or Subtract). Ten of these execute 20 of the 32 floating point operations, the other twelve being from conventional add and multiply instructions.

The NEON compilation uses the same VFMA and VFMS instructions, but using 128 bit quad words, for SIMD operation, then with an unrolled loop, with 10 VFMAs (or VFMSs) to execute 80 instructions. Four word vectors are also used for adds and multiplies. This produces up to 1.7 GFLOPS per core, on a Raspberry Pi 3, not very good, out of a maximum of 9.6 GFLOPS, a part of the reason being the excessive number of load instruction, probably due to an insufficient number of registers. With compiler generated unrolling, disassembled code can show many more sets of calculations, to cover situations where data is too small for the whole unrolled loop.

The notOpenMP-MFLOPS (OpenMP-MFLOPS but not using OMP threads) has the extra test with 8 operations per word. As shown below, the inner loop in unrolled by the compiler to produce 32 calculations via quad word vectors, but at not much more than 1.7 GFLOPS. Manually unrolling the loop to 16 x 8 calculations did not lead to further unrolling by the compiler. With four times more calculations in the loop, a maximum of just over 3 GFLOPS could be demonstrated, still a long way from 9.6.

  
   MP-MFLOPSPiA7,                  MP-MFLOPSPiNeon

   2 Operations Per Word           2 Operations Per Word

   .L27:                           .L83:
   flds      s15, [r1]             vld1.64   {d16-d17}, [lr:64]
   fadds     s15, s0, s15          add       r4, r4, #1
   fmuls     s15, s15, s1          add       lr, lr, #16
   fstmias   r1!, {s15}            cmp       r2, r4
   cmp       r1, r0                add       r3, r3, #16
   bne       .L27                  vadd.f32  q8, q8, q10
                                   vmul.f32  q8, q8, q9
                                   vstr      d16, [r3, #-16]
                                   vstr      d17, [r3, #-8]
                                   bhi       .L83

   32 Operations Per Word          32 Operations Per Word

   .L21:                           .L61:
   flds      s23, [r1]             vld1.64   {d18-d19}, [lr:64]
   fadds     s16, s23, s2          vldr      d16, [sp, #64]
   fadds     s24, s23, s0          vldr      d17, [sp, #72]
   fadds     s31, s23, s4          vldr      d14, [sp, #80]
   fadds     s30, s23, s6          vldr      d15, [sp, #88]
   fnmuls    s16, s3, s16          vadd.f32  q8, q9, q8
   fadds     s29, s23, s8          vld1.64   {d20-d21}, [sp:64]
   fadds     s28, s23, s10         vmul.f32  q8, q8, q7
   fadds     s27, s23, s12         vadd.f32  q10, q9, q10
   vfma.f32  s16, s24, s1          vldr      d14, [sp, #16]
   fadds     s26, s23, s14         vldr      d15, [sp, #24]
   fadds     s25, s23, s17         vldr      d22, [sp, #144]
   fadds     s24, s23, s19         vldr      d23, [sp, #152]
   fadds     s23, s23, s21         vfma.f32  q8, q10, q7
   vfma.f32  s16, s31, s5          vldr      d20, [sp, #128]
   vfms.f32  s16, s30, s7          vldr      d21, [sp, #136]
   vfma.f32  s16, s29, s9          vldr      d14, [sp, #192]
   vfms.f32  s16, s28, s11         vldr      d15, [sp, #200]
   vfma.f32  s16, s27, s13         vadd.f32  q10, q9, q10
   vfms.f32  s16, s26, s15         vadd.f32  q7, q9, q7
   vfma.f32  s16, s25, s18         vfma.f32  q8, q10, q11
   vfms.f32  s16, s24, s20         vldr      d22, [sp, #208]
   vfma.f32  s16, s23, s22         vldr      d23, [sp, #216]
   fstmias   r1!, {s16}            vadd.f32  q10, q9, q15
   cmp       r1, r0                add       r4, r4, #1
   bne       .L21                  add       lr, lr, #16
                                   cmp       r2, r4
                                   add       r3, r3, #16
   NotOpenMP                       vfma.f32  q8, q7, q11
                                   vldr      d22, [sp, #256]
   8 Operations Per Word           vldr      d23, [sp, #264]
                                   vadd.f32  q7, q9, q11
   .L31:                           vldr      d22, [sp, #240]
   vld1.64   {d18-d19}, [lr:64]    vldr      d23, [sp, #248]
   add       r4, r4, #1            vfma.f32  q8, q10, q14
   add       lr, lr, #16           vldr      d20, [sp, #32]
   cmp       r2, r4                vldr      d21, [sp, #40]
   add       r3, r3, #16           vadd.f32  q10, q9, q10
   vadd.f32  q8, q9, q12           vfma.f32  q8, q7, q11
   vadd.f32  q10, q9, q3           vldr      d22, [sp, #96]
   vmul.f32  q8, q8, q11           vldr      d23, [sp, #104]
   vadd.f32  q9, q9, q14           vadd.f32  q7, q9, q11
   vfma.f32  q8, q10, q15          vldr      d22, [sp, #48]
   vfms.f32  q8, q9, q13           vldr      d23, [sp, #56]
   vstr      d16, [r3, #-16]       vfms.f32  q8, q10, q11
   vstr      d17, [r3, #-8]        vldr      d22, [sp, #112]
   bhi       .L31                  vldr      d23, [sp, #120]
                                   vldr      d20, [sp, #160]
                                   vldr      d21, [sp, #168]
                                   vadd.f32  q10, q9, q10
                                   vfms.f32  q8, q7, q11
                                   vldr      d22, [sp, #224]
                                   vldr      d23, [sp, #232]
                                   vadd.f32  q7, q9, q11
                                   vldr      d22, [sp, #176]
                                   vldr      d23, [sp, #184]
                                   vadd.f32  q9, q9, q13
                                   vfms.f32  q8, q10, q11
                                   vfms.f32  q8, q7, q6
                                   vfms.f32  q8, q9, q12
                                   vstr      d16, [r3, #-16]
                                   vstr      d17, [r3, #-8]
                                   bhi       .L61
  


To Start


Roy Longbottom at Linkedin Roy Longbottom September 2016

The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection