Raspberry Pi 4B 32 Bit Benchmarks

Roy Longbottom


Contents


Summary Introduction Benchmark Results
Whetstone Benchmark Dhrystone Benchmark Linpack 100 Benchmark
Livermore Loops Benchmark FFT Benchmarks BusSpeed Benchmark
MemSpeed Benchmark NeonSpeed Benchmark MultiThreading Benchmarks
MP-Whetstone Benchmark MP-Dhrystone Benchmark MP NEON Linpack Benchmark
MP-BusSpeed Benchmark MP-RandMem Benchmark MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks Floating Point Assembly Code OpenMP-MemSpeed Benchmarks
I/O Benchmarks WiFi Benchmark LAN Benchmark
USB 2 and 3 Benchmarks Pi 4 Main Drive benchmark Java Whetstone Benchmark
JavaDraw Benchmark OpenGL Benchmark Stress Tests



Summary

Previously, I have run my 32 bit and 64 bit benchmarks on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a PDF file. This early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer consultant, exercising the system prior to launch. This report contains brief reminders of the benchmarks, with 32 bit results on the new Raspberry Pi 4 using Raspbian Buster Operating System. Existing benchmarks were used to provide comparisons with the old 3B+ model. The benchmarks were also recompiled using gcc 8, that came with Buster, to provide further comparisons. The benchmarks and results are summarised as follows.

Single Core CPU Tests - comprising Whetstone, Dhrystone, Linpack and Livermore Loops Classic Benchmarks. Compared with a Pi 4B/Pi 3B+ CPU MHz ratio of 1.07, the overall performance gains for these four programs increased to around 1.8, 2.0, 4.0 and 2.8 times, with some further improvements between 1.05 and 1.26 from gcc 8 compilations.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. These include eight different measurements of FFTs, at 11 increasing sizes, with average Pi 4B speed gains of 3.26 times. BusSpeed was intended to identify maximum reading speeds, where there was not much difference from L1 cache, some gain via L2 cache and 80% from RAM, increasing by a further 25% using the gcc 8 compilation. MemSpeed and NeonSpeed carry out floating point and integer calculations, providing Pi 4B speed gains at all levels, best with double precision floating point calculations at greater than five times.

Multithreading Benchmarks - Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. The first are for Whetstone, Dhrystone and Linpack benchmarks, providing similar Pi 4B gains as the single core versions, with only Whetstones providing effective four core performance.

Various multithreaded and OpenMP cache/RAM benchmarks were run, these mainly demonstrating the sort of code that is good and bad for efficient MP utilisation. Most demonstrated appropriate single core Pi 4B performance gains, but with some other relationships totally confusing.

Finally, a number of benchmarks attempt to measure maximum MFLOPS floating point speed, using the same series of calculations, with variants covering single and double precision (SP and DP), vector intrinsic functions and OpenMP. Best DP performance was 10.4 GFLOPS with SP at 19.9 GFLOPS. Highest Pi 4B/Pi 3B+ gains were 6.69 times DP and 5.15 times SP. The gcc 8 compilations provided some improvement in speed.

Java and OpenGL Benchmarks - A Java Whetstone benchmark is provided and one using JavaDraw procedures. Test functions of the former were more than twice as fast on the Pi 4B, compared with the 3B+ and similar via javaDraw, for the more demanding tests, also many of the 25 OpenGL test routines. Initially Oracle 8 Java was used but later tests were via OpenJDK11.

Drive LAN and WiFi Benchmarks - Variations of the same program are provided to benchmark internal and USB drives or LAN and WiFi connections, measuring performance using large files, small files and random access. Considering large files, Pi 4B performance improvement shown were up to four times LAN, over five times USB 3, with similar scores using WiFi.

Stress Tests - These have also been run and will be covered in a later report. Default mode provides useful benchmarking information, as shown below. Pi 4B/Pi 3B+ performance ratios are shown to be up to 4.23 for cache based data and 2.09 using RAM.

Introduction below or Go To Start


Introduction

The Raspberry Pi 4B uses a quad core ARM A72 CPU, with 32 KB L1 cache and shared 1 MB L2 cache. RAM is 3200-LPDDR4 with 1, 2 or 4 GB options. Other enhancements are USB 3 connections and gigabit Ethernet.

I have run my benchmarks on the new system, where more descriptions and earlier results can be found at from ResearchGate in this PDF file. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer consultant, exercising the system prior to launch.

The programs and source codes used are available for downloading in Raspberry-Pi-4-Benchmarks.tar.gz.

My most recent benchmarks were compiled for the Raspberry Pi 2, using gcc 4.8. I tried others later, but they did not seem to make much difference. I thought that using a Cortex A72 might, so I have compiled the programs using gcc 8. The first step was to change the functions used to identify the hardware, where the existing procedures replicate information for each core (even four lots were too much). I noted that the lscpu command now provides adequate detail, so I use this now. The Raspbian release is also provided. RPi 3B+ and RPi 4B details are as follows:

Pi 3B+

Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Model:                 4
Model name:            ARMv7 Processor rev 4 (v7l)
CPU max MHz:           1400.0000
CPU min MHz:           600.0000
BogoMIPS:              89.60
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2018-04-18



Pi 4B
 
Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Vendor ID:             ARM
Model:                 3
Model name:            Cortex-A72
Stepping:              r0p3
CPU max MHz:           1500.0000
CPU min MHz:           600.0000
BogoMIPS:              270.00
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2019-05-13
  



Benchmark Results

The following provide benchmark results with limited comments on Raspberry Pi 4B performance gains over Pi 3B+ and relative Pi 4B relationships between older ARM V7 and gcc 8 compilations. For the first few, ancient benchmarks, ARM V6 compilations are also compared.

Whetstone Benchmark below or Go To Start


Whetstone Benchmark - whetstonePiA6, whetstonePiA7, whetstonePiC8

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised, but the time does not affect the overall rating much.

In this case, the overall MWIPS comparison ratios provide valid comparisons, The Pi 4B being between 1.76 and 1.87 times faster than the 3B+. Then gcc 8 provided no real improvement.


 System    MHz  MWIPS   ------MFLOPS------    ------------MOPS---------------
                        1      2      3       COS    EXP  FIXPT     IF  EQUAL
 Arm V6
 Pi 3B+   1400   1094    391    407    348   21.7   12.3   1740   2084   1391
 Pi 4B    1500   2048    520    473    389   53.8   27.1   2497   2245   2246
 4B/3B+   1.07   1.87   1.33   1.16   1.12   2.47   2.20   1.44   1.08   1.61
 
 ARM V7
 Pi 3B+   1400   1060    391    383    298   21.7   12.3   1740   2083   1392
 Pi 4B    1500   1884    516    478    310   54.7   27.1   2498   2247    999
 4B/3B+   1.07   1.78   1.32   1.25   1.04   2.52   2.21   1.44   1.08   0.72
 
 gcc 8
 Pi 3B+   1400   1063    393    373    300   21.8   12.3   1748   2097   1398
 Pi 4B    1500   1883    522    471    313   54.9   26.4   2496   3178    998
 4B/3B+   1.00   1.76   1.33   1.26   1.05   2.51   2.09   1.43   1.52   0.71
 
 gcc 8/V7
 Pi 4B    1.00   1.00   1.01   0.99   1.01   1.00   0.97   1.00   1.41   1.00
 
Go To Start


Dhrystone Benchmark - dhrystonePiA6, dhrystonePiA7, dhrystonePiC8

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The Pi 4B was shown to be around twice as fast as the 3B+ and gcc 8 performance was similar to the ARM V7 compilation.


                                              Best
                ----- Compiler -----         DMIPS
 System     MHz  ARM V6 ARM V7  gcc 8  G8/V7  /MHz

 Pi 3B+    1400   2520   2825   2838   1.00   2.03
 Pi 4B     1500   5077   5366   5646   1.05   3.76

 4B/3B+    1.07   2.01   1.90   1.99          1.86
  
Go To Start


Linpack 100 Benchmark MFLOPS - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP, linpackPiC8, linpackPiC8SP

This original Linpack benchmark is dependent on fused multiply and add instructions, but the overheads on the standard source code restricts processing speed. It seems that the latest hardware has been modified to execute this type of code more efficiently.

All measurements demonstrate that the Pi 4B was between 3,6 and 4.7 times faster than the Pi 3B+.


                      ARM V6        ARM V7        gcc 8      vgcc8/ARMV7
 System    MHz      DP     SP     DP     SP     DP     SP     DP     SP

 Pi 3B+    1400  206.0  220.2  210.5  225.2  224.8  227.3   1.00   1.01
 Pi 4B     1500  764.7  880.6  760.2  921.6  957.1 1068.8   1.04   1.12

 4B/3B+    1.07   3.71   4.00   3.61   4.09   4.26   4.70              
  
Livermore Loops Benchmark below or Go To Start


Livermore Loops Benchmark MFLOPS - liverloopsPiA7, liverloopsPiC8

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, Pi 4B is shown as being 2.36 times faster than the 3B+, and even more so via the gcc 8 compilation, where the gcc8/V7 performance ratio identified is 1.31.

 MFLOPS for 24 loops
 
 Pi 3B+
   225   266   465   394   147   196   411   449   408   207   155    87
   100   125   263   258   359   335   236   248   133    93   339   199

 Pi 4B
   746   964   988   943   212   538  1169  1800  1032   469   214   186
   159   335   778   623   732  1034   320   350   489   360   749   187

 Pi 3B+ gcc 8
   330   262   459   407   231   198   538   542   462   247   174   198
   122   123   281   240   394   325   275   294   213    94   354   198

 Pi 4B gcc 8
  1480  1017   974   930   383   657  1624  1861  1664   617   498   741
   221   320   803   640   737  1003   451   378  1047   411   763   187

Comparisons

 System    MHz   Maximum Average Geomean Harmean Minimum

 ARM V7
 Pi 3B+   1400     464.8   246.7   220.1   193.9    78.3
 Pi 4B    1500    1800.2   635.1   519.0   416.1   155.3
 4B/3B+   1.07      3.87    2.57    2.36    2.15    1.98

 gcc 8
 Pi 3B+   1400     541.7   283.4   257.4   231.5    92.7
 Pi 4B    1500    1860.8   800.4   679.0   564.1   179.5
 4B/3B+   1.07      3.40    2.80    2.61    2.41    1.90
                                                
 g8/V7    1.00      1.03    1.26    1.31    1.36    1.16
   
Fast Fourier Transforms Benchmarks below or Go To Start


Fast Fourier Transforms Benchmarks - fft1-RPi2, fft3c-Rpi2, fft1PiC8, fft3cPiC8

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements using both single and double data, calculating FFT sizes between 1K and 1024K.

Following are average running times from the three passes, then RPi 4B performance gains (fewer milliseconds), where all those for the optimised version were greater than 3 times and also many from the original benchmark. Most gcc 8 running times. on the Pi 4B, were slightly faster than the those produced by the older version.


                              Time in milliseconds

               Raspberry Pi 3B+ FFT 1        Raspberry Pi 3B+ FFT 3     

          ARM V7           gcc 8          ARM V7           gcc 8        
   Size
      K       SP      DP      SP      DP      SP      DP      SP      DP

      1     0.14    0.14    0.16    0.17    0.18    0.14    0.15    0.14
      2     0.31    0.36    0.35    0.48    0.39    0.32    0.33    0.32
      4     0.78    0.92    0.91    1.32    1.05    0.77    0.78    0.75
      8     1.92    2.17    3.02    3.36    2.14    1.76    1.84    1.76
     16     4.67    5.28    5.09    5.99    4.71    5.46    4.27    4.89
     32    10.95   20.57   12.31   20.62   10.71   15.03    9.55   13.65
     64    34.54  128.96   37.33  130.93   28.94   36.78   26.09   33.23
    128   246.04  308.67  254.23  320.44   70.03   84.44   64.74   76.98
    256   586.84  638.88  620.49  734.14  157.29  196.35  145.14  180.66
    512  1232.41 1374.18 1235.39 1447.85  363.61  434.28  336.57  405.09
   1024  2759.71 2993.38 2779.37 3094.66  806.78  975.33  736.46  912.78


   Size       Raspberry Pi 4B FFT 1           Raspberry Pi 4B FFT 3     
      K                                                                 
      1     0.04    0.04    0.04    0.04    0.06    0.05    0.05    0.04
      2     0.08    0.12    0.08    0.13    0.13    0.11    0.10    0.10
      4     0.32    0.37    0.29    0.34    0.27    0.24    0.24    0.23
      8     0.77    0.97    0.79    0.82    0.58    0.55    0.57    0.51
     16     1.69    2.01    1.65    1.85    1.49    1.35    1.32    1.19
     32     4.37    4.89    3.76    4.71    2.96    3.63    2.69    3.30
     64     9.12   26.55    8.82   30.64    7.46   10.75    6.60    9.47
    128    55.52  160.11   58.54  132.41   17.93   26.03   16.92   23.85
    256   305.92  423.06  275.44  373.12   41.16   55.06   37.61   55.97
    512   833.10  854.88  780.89  751.27   86.93  120.53   81.54  128.13
   1024  1617.49 1875.52 1578.70 1812.20  190.28  266.60  186.45  288.27

   Size             RPi 4B Gains (>1.0 4B running time is less)         
      K                                                                 
      1     3.45    3.46    4.02    3.94    3.06    2.66    2.88    3.45
      2     3.79    3.14    4.27    3.84    3.10    2.93    3.28    3.29
      4     2.46    2.50    3.19    3.84    3.86    3.23    3.24    3.22
      8     2.51    2.24    3.82    4.12    3.67    3.18    3.21    3.44
     16     2.76    2.62    3.08    3.23    3.17    4.06    3.25    4.10
     32     2.51    4.21    3.27    4.38    3.62    4.14    3.55    4.13
     64     3.79    4.86    4.23    4.27    3.88    3.42    3.95    3.51
    128     4.43    1.93    4.34    2.42    3.91    3.24    3.83    3.23
    256     1.92    1.51    2.25    1.97    3.82    3.57    3.86    3.23
    512     1.48    1.61    1.58    1.93    4.18    3.60    4.13    3.16
   1024     1.71    1.60    1.76    1.71    4.24    3.66    3.95    3.17
  
BusSpeed Benchmark below or Go To Start


BusSpeed Benchmark - busspeedPiA7, busspeedPiC8

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 skipping following data word by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds.

The speed via these increments can vary considerably, so comparison are provided for the read all column. Both the Pi 4B hardware and gcc 8 compilation contribute to performance gains of the new system, particularly to the highest ratio of 2.81 with impact on the larger L2 cache.


Pi 3B+ ARM V7 

  BusSpeed vfpv4 32b V1 Fri Apr 12 21:39:00 2019
 
    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16   3885   4365   4755   5013   5078   5118
      32   1688   1765   2513   3489   4279   4737
      64    716    720   1315   2268   3399   4147
     128    665    668   1206   2137   3281   4085
     256    632    635   1160   2053   3195   4032
     512    268    277    550   1058   1925   3088
    1024    140    153    296    581   1115   2199
    4096    120    131    257    498   1001   1777
   16384    126    132    256    496    991   1677
   65536    128    132    256    491    991   1950
 
                    Pi 4B  ARM V7                        
 
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read       
  KBytes  Words  Words  Words  Words  Words    All   Gain
  
      16   3836   4049   4467   5885   4641   5858   1.14
      32    761   1473   2594   3216   3960   4780   1.01
      64    409    801   1684   2422   3745   3940   0.95
     128    406    803   1202   1914   3037   5377   1.32
     256    415    700   1165   2481   4789   5137   1.27
     512    392    760   1243   2455   3764   4264   1.38
    1024    230    256    623   1061   2455   3501   1.59
    4096    197    214    454    938   1852   3195   1.80
   16384    138    215    445    897   1724   3210   1.91
   65536    174    215    398    744   1655   3130   1.61

Pi 3B+ gcc 8 

   BusSpeed vfpv4 32b gcc 8 Wed May 15 09:51:20 2019
 
    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16   3833   4346   4729   5002   5046   5069
      32   2435   2532   3152   4860   4949   4999
      64    696    705   1313   2213   3278   3983
     128    651    662   1227   2077   3207   3950
     256    620    630   1183   2007   3152   3925
     512    481    503    955   1641   2618   3318
    1024    133    145    286    506   1012   1694
    4096    117    130    249    453    915   1476
   16384    124    129    247    455    910   1415
   65536    124    108    251    453    905   1445
 
                                                    
                      Pi 4B  gcc 8                              

  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Pi 4B  gcc 8
  KBytes  Words  Words  Words  Words  Words    All   Gain   Gain

      16   4880   5075   5612   5852   5877   5864   1.16   1.00
      32    846   1138   2153   3229   4908   5300   0.99   1.11
      64    746   1019   2035   3027   4910   5360   1.50   1.36
     128    728    983   1952   2908   4888   5389   1.52   1.00
     256    683    934   1901   2794   4874   5431   1.55   1.06
     512    656    900   1760   2625   4585   5259   1.75   1.23
    1024    301    410    870   1356   2846   4238   2.81   1.21
    4096    233    248    531    996   2151   4045   2.35   1.27
   16384    236    258    511    891   2143   4011   2.35   1.25
   65536    237    257    508    881   2172   4015   2.40   1.28
  
MemSpeed Benchmark below or Go To Start


MemSpeed Benchmark MB/Second - memspeedPiA7, memspeedPiC8

This includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 3B+ and 4B results from running the original and gcc 8 recompiled versions, plus full Pi4B/3B+ and old/gcc 8 comparisons.

Using the original ARM V7 versions, the Pi 4B is indicated as faster on all test functions, with best case on double precision calculations using cached data, being between three and six times faster. Similar gains are also shown in the gcc 8 comparisons. Then, gcc8/V7 compiler comparisons show gains with floating point but the old compiler producing some faster speeds using integers. Maximum MFLOPS performance is shown for the calculations in the first two columns, rising from 237 DP and 532 SP on the 3B+ to 1485 DP and 2740 SP on the Pi 4B, using gcc8 - improvements 6.27 times DP and 5.15 times SP..

   Pi 3B+ ARM V7 

Pi 3B+ Memory Reading Speed Test vfpv4 32 Bit Version 1 by Roy Longbottom

               Start of test Fri Apr 12 21:39:51 2019                  

 Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
 KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
   Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

      8    1896   2125   4046   2784   2624   4448   3165   3694   3693
     16    1900   2129   4058   2791   2627   4462   3181   3711   3711
     32    1821   2000   3664   2602   2426   3965   3187   3719   3717
     64    1807   1974   3625   2567   2369   3923   3057   3615   3599
    128    1792   1959   3620   2545   2364   3906   3079   3544   3544
    256    1738   1914   3472   2468   2291   3719   3064   3545   3553
    512    1380   1493   2199   1769   1715   2331   2192   2522   2383
   1024    1003   1138   1319   1250   1219   1298   1487   1324   1324
   2048     925   1001   1104   1065   1049   1103   1093   1032   1035
   4096     901    972   1073   1037   1005   1081   1002    968    973
   8192     852    948   1076   1041   1021   1080   1009    977    975

Max MFLOPS  237    532                                                 
 Pi 4B ARM V7  
 
      8    8459   4766  13344   8303   4768  15553   7806   9926   9927
     16    7142   3918   8649   7103   4094   9309   7899  10086  10056
     32    7969   4490  10339   7941   4532  11627   7758  10070  10048
     64    8126   4602   9909   8114   4617  11069   7425   8021   8070
    128    8302   4651   9623   8311   4657  10836   7374   8049   7934
    256    8319   4663   9627   8360   4666  10768   7530   7922   7925
    512    8088   4629   9453   8239   4650  10696   5023   7904   7949
   1024    3581   3113   3618   3577   3150   3675   5358   2431   1560
   2048    1338   1808   1780   1811   1832   1773   2131    950    956
   4096    1881   1880   1852   1879   1664   1336   1988    984   1054
   8192    1890   1901   1884   1729   1319   1367   2252   1018   1021

Max MFLOPS 1057   1192                                                 
Pi 4B/3B+

      8    4.46   2.24   3.30   2.98   1.82   3.50   2.47   2.69   2.69
     16    3.76   1.84   2.13   2.54   1.56   2.09   2.48   2.72   2.71
     32    4.38   2.25   2.82   3.05   1.87   2.93   2.43   2.71   2.70
     64    4.50   2.33   2.73   3.16   1.95   2.82   2.43   2.22   2.24
    128    4.63   2.37   2.66   3.27   1.97   2.77   2.39   2.27   2.24
    256    4.79   2.44   2.77   3.39   2.04   2.90   2.46   2.23   2.23
    512    5.86   3.10   4.30   4.66   2.71   4.59   2.29   3.13   3.34
   1024    3.57   2.74   2.74   2.86   2.58   2.83   3.60   1.84   1.18
   2048    1.45   1.81   1.61   1.70   1.75   1.61   1.95   0.92   0.92
   4096    2.09   1.93   1.73   1.81   1.66   1.24   1.98   1.02   1.08
   8192    2.22   2.01   1.75   1.66   1.29   1.27   2.23   1.04   1.05

Pi 3B+ gcc 8 


       8    2024   3191   1931   2973   4464   2077   3415   4426   4426
      16    2031   3194   1933   2977   4470   2078   3430   4451   4451
      32    1972   3111   1902   2842   4291   2059   3433   4455   4451
      64    1932   3042   1875   2752   4121   2008   3240   4223   4223
     128    1972   3083   1888   2825   4163   2012   3281   4272   4276
     256    1980   3089   1888   2851   4177   2013   3312   4244   4239
     512    1750   2778   1739   2460   3711   1846   3106   4029   4096
    1024     979   1862   1390   1213   2230   1463   1463   1225   1220
    2048     979   1858   1379   1137   2111   1442    859    828    828
    4096     975   1809   1363   1136   2091   1428    944    924    920
    8192     976   1788   1364   1139   2053   1409    802    792    733

Max MFLOPS   254    799                                                 
 
MemSpeed Continued Below
Pi 4B gcc 8 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 11768 9844 3841 11787 9934 4351 10309 7816 7804 16 11880 9880 3822 11886 10043 4363 10484 7902 7892 32 9539 8528 3678 9517 8661 4098 10564 7948 7945 64 9952 9310 3733 9997 9470 4160 8452 7717 7732 128 9947 9591 3757 9990 9757 4178 8205 7680 7753 256 10015 9604 3758 10030 9781 4186 8120 7734 7707 512 9073 9300 3751 9472 9526 4175 7995 7709 7602 1024 2681 5303 3594 2664 4965 3760 4828 3592 3569 2048 1671 3488 3242 1757 3635 3540 2882 1036 1023 4096 1777 3700 3283 1827 3627 3555 2433 1052 1054 8192 1931 3805 3420 1933 3815 3629 2465 980 971 Max MFLOPS 1485 2740 Pi 4B/3B+ 8 5.81 3.08 1.99 3.96 2.23 2.09 3.02 1.77 1.76 16 5.85 3.09 1.98 3.99 2.25 2.10 3.06 1.78 1.77 32 4.84 2.74 1.93 3.35 2.02 1.99 3.08 1.78 1.78 64 5.15 3.06 1.99 3.63 2.30 2.07 2.61 1.83 1.83 128 5.04 3.11 1.99 3.54 2.34 2.08 2.50 1.80 1.81 256 5.06 3.11 1.99 3.52 2.34 2.08 2.45 1.82 1.82 512 5.18 3.35 2.16 3.85 2.57 2.26 2.57 1.91 1.86 1024 2.74 2.85 2.59 2.20 2.23 2.57 3.30 2.93 2.93 2048 1.71 1.88 2.35 1.55 1.72 2.45 3.36 1.25 1.24 4096 1.82 2.05 2.41 1.61 1.73 2.49 2.58 1.14 1.15 8192 1.98 2.13 2.51 1.70 1.86 2.58 3.07 1.24 1.32 4B gcc 8 gains 8 1.39 2.07 0.29 1.42 2.08 0.28 1.32 0.79 0.79 16 1.66 2.52 0.44 1.67 2.45 0.47 1.33 0.78 0.78 32 1.20 1.90 0.36 1.20 1.91 0.35 1.36 0.79 0.79 64 1.22 2.02 0.38 1.23 2.05 0.38 1.14 0.96 0.96 128 1.20 2.06 0.39 1.20 2.10 0.39 1.11 0.95 0.98 256 1.20 2.06 0.39 1.20 2.10 0.39 1.08 0.98 0.97 512 1.12 2.01 0.40 1.15 2.05 0.39 1.59 0.98 0.96 1024 0.75 1.70 0.99 0.74 1.58 1.02 0.90 1.48 2.29 2048 1.25 1.93 1.82 0.97 1.98 2.00 1.35 1.09 1.07 4096 0.94 1.97 1.77 0.97 2.18 2.66 1.22 1.07 1.00 8192 1.02 2.00 1.82 1.12 2.89 2.65 1.09 0.96 0.95
NeonSpeed Benchmark below or Go To Start


NeonSpeed Benchmark MB/Second - NeonSpeed, NeonSpeedC8

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and Neon through using intrinsic functions. Of late, both methods produce similar performance at up to 3000 million operations per second, across the board. Pi 4B/3B+ comparisons are also included below, showing the best gains in the L2 cache area. Pi 4B gcc 8 gains and losses are also provided, with the main loss on normal integer calculations from cached data.

                     Pi 3B+                       

  NEON Speed Test V 1.0 Fri Apr 12 22:11:38 2019  

       Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v 
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3170   4669   4037   4930   5220   5545
      32   3119   4531   3952   4780   5071   5374
      64   2845   3920   3558   4075   4235   4438
     128   2873   3954   3626   4095   4227   4484
     256   2917   4027   3705   4184   4313   4563
     512   2271   2923   2777   3000   3075   3127
    1024   1181   1209   1221   1201   1163   1198
    4096   1062   1077   1071   1050   1073   1076
   16384   1087   1115   1111   1043   1094   1086
   65536   1125   1144   1139    851   1126   1110
 
                     Pi 4B                        
  
      16   9677  10072   8905   9358   9776  10473
      32  10149  10330   9364   9539   9988  10543
      64  10948  11708  10466  10568  11318  11994
     128  10484  11232  10410  10104  11200  11792
     256  10509  11369  10428  10264  11273  11842
     512  10406  11066  10134  10054  11075  11467
    1024   3069   3202   3159   3166   3204   3203
    4096   1721   1910   1908   1882   1903   1900
   16384   2023   2009   2008   1965   2032   2013
   65536   2073   2074   2074   2073   2068   2064

                 Pi 4B/3B+ Comparisons            

   Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3.05   2.16   2.21   1.90   1.87   1.89
     512   4.58   3.79   3.65   3.35   3.60   3.67
    1024   2.60   2.65   2.59   2.64   2.75   2.67
   16384   1.86   1.80   1.81   1.88   1.86   1.85

                   Pi 3B+ gcc 8                  

  NEON Speed Test gcc 8 Wed May 15 09:57:18 2019  

       Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3289   5377   2010   5076   5731   5732
      32   3280   5341   1995   5043   5706   5706
      64   3115   4547   1923   4348   4771   4771
     128   3145   4683   1927   4482   4886   4888
     256   3146   4698   1926   4500   4906   4908
     512   2666   3762   1779   3527   3903   3915
    1024   1879   1228   1395   1225   1238   1238
    4096   1792   1151   1373   1144   1164   1162
   16384   1698   1167   1353   1119   1167   1170
   65536   1229   1157   1328    874   1165   1166

                    Pi 4B gcc 8                   

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

 
NeonSpeed Continued Below
Pi 4B/3B+ Comparisons Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 3.01 2.40 1.95 2.52 2.28 2.64 512 3.39 2.84 1.92 2.90 2.80 2.87 1024 3.10 2.70 2.51 2.68 2.62 2.34 16384 2.37 1.77 2.55 1.85 1.78 1.77 4B gcc 8 gains and losses 16 1.02 1.28 0.44 1.36 1.34 1.44 512 0.87 0.97 0.34 1.02 0.99 0.98 16384 1.99 1.03 1.72 1.05 1.02 1.03
MultiThreading Benchmarks below or Go To Start


MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.
Go To Start


MP-Whetstone Benchmark - MP-WHETSPiA7, MP-WHETSPC8

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

Based on the 4 thread MWIPS rating, both compilations indicate the same Pi4B performance improvement, but there are variations on the individual test functions.


                      Pi 3B+ ARM V7                               

  MP-Whetstone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:48:42 2019

                    Using 1, 2, 4 and 8 Threads                   

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T  1116.9  582.4  603.6  299.7  21.7  13.3  6969.0 1364.0 1398.5
 2T  2226.5 1167.8 1181.0  593.5  43.4  26.4 12545.8 2789.0 2794.1
 4T  4436.8 2354.9 2387.3 1190.1  86.3  52.5 27429.4 5539.7 5546.8
 8T  4614.6 3174.1 3140.6 1250.0  88.1  54.7 36555.2 6409.9 6051.1

   Overall Seconds   4.99 1T,   5.02 2T,   5.10 4T,  10.20 8T     

                        Pi 4B ARM V7                               

 1T  2059.3  672.8  680.1  310.6  55.6  33.1  7461.6  2244.6  995.2
 2T  4117.1 1341.7 1390.7  624.2 110.7  65.9 14887.3  4466.5 1986.2
 4T  7910.0 2652.0 2722.2 1180.0 208.5 132.6 29291.2  8952.4 3832.3
 8T  8651.6 3057.1 2971.1 1268.3 233.2 149.6 38367.5 11922.5 3941.7

   Overall Seconds   4.99 1T,   5.01 2T,   5.29 4T,  10.71 8T      

                      Pi 3B+ gcc 8                                 

  MP-Whetstone Benchmark Linux/ARM gcc 8 Fri Jun 14 14:25:28 2019  
                   
      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS

 1T  1057.5  390.9  392.6  298.1  21.0  12.3  5227.8  1363.1 1399.4
 2T  2121.8  777.4  778.5  598.3  42.3  24.6 10185.9  2769.0 2762.9
 4T  4225.9 1509.6 1532.2 1192.3  84.7  48.8 19273.0  5326.5 5552.9
 8T  4419.6 1914.9 2041.9 1260.8  86.0  51.3 27645.3  7213.5 6031.5

   Overall Seconds   4.98 1T,   5.00 2T,   5.11 4T,  10.09 8T      

                      Pi 4B gcc 8                                  
 
 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                4 Thread 4B/3B+ Performance ratios                 

 V7    1.78   1.13   1.14   0.99   2.42   2.53   1.07   1.62   0.69
 gcc8  1.79   1.39   1.40   1.05   2.66   2.13   1.53   1.68   0.48
 
MP-Dhrystone Benchmark below or Go To Start


MP-Dhrystone Benchmark - MP-DHRYPiA7, MP-DHRYPiC8

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance, as reflected in the results. The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+, for both compilations, with gcc 8 code being slightly the fastest.

                         Pi 3B+ ARM V7                      

   MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:57:46 2019

                    Using 1, 2, 4 and 8 Threads             

 Threads                        1        2        4        8
 Seconds                     0.85     0.96     1.36     2.71
 Dhrystones per Second    4733611  8295393 11750518 11789451
 VAX MIPS rating             2694     4721     6688     6710

                        Pi 4B  ARM V7                       
 
 Seconds                     0.82     1.59     2.70     5.04
 Dhrystones per Second    9731507 10082787 11833655 12706636
 VAX MIPS rating             5539     5739     6735     7232

                        Pi 3B+ gcc 8                        

 Threads                        1        2        4        8
 Seconds                     0.79     0.92     1.23     2.46
 Dhrystones per Second    5035879  8678942 13020489 13028455
 VAX MIPS rating             2866     4940     7411     7415

                       Pi 4B gcc 8                           
 
 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459
  
MP Linpack Benchmark below or Go To Start


MP SP NEON Linpack Benchmark - linpackNeonMP, linpackNeonMPC8

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

Single thread performance, was the slowest accessing the larger data arrays (N value), more constant across the four sets of results. Fastest Pi 4B improvements were at N = 100, at around three times.

The programs produce the sumchecks, as shown below, with the four sets of calculations producing identical numeric results (as they should).


                      Pi 3B+ ARM V7           
 
 Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Wed Apr 24 23:03:08 2019

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 

 N  100     627.07    66.31    64.79    64.14 
 N  500     465.16   293.95   292.37   293.76 
 N 1000     346.63   311.81   309.19   311.76 

                      Pi 4B  ARM V7           

 N  100    1921.53   108.66   101.88   102.46 
 N  500    1548.81   530.23   714.37   733.09 
 N 1000     399.94   378.11   364.78   398.21 

                      Pi 3B+ gcc 8            

 N  100     638.49    66.92    66.23    66.14 
 N  500     471.71   304.69   297.05   305.51 
 N 1000     356.13   317.22   316.88   316.33 

                      Pi 4B gcc 8             

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 

                      Sumchecks                    

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04
  
MP BusSpeed Benchmark below or Go To Start


MP BusSpeed (read only) Benchmark - MP-BusSpeedPiA7, MP-BusSpd2PiC8

Each thread accesses all of the data in separate sections covering caches and RAM, starting at different points, with this V7A v2 version. See single processor BusSpeed details regarding burst reading that can indicate significant differences. RdAll is the main area for comparison, where MP reading RAM is thought to indicate maximum performance.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. These are subject to multiprocessing peculiarities, but Pi 4B/Pi 3B+ performance gains were indicated as being around 2.5, using L1 cache data, and twice as fast, via L2 cache and RAM, with the gcc 8 produced version little different from the earlier compilations.

                      Pi 3B+ ARM V7                        

 MP-BusSpd ARM V7A v2 Wed Apr 24 22:58:50 2019           

   MB/Second Reading Data, 1, 2, 4 and 8 Threads         
   Staggered starting addresses to avoid caching         

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

 12.3 1T   3470   4390   4408   4760   5138   4926       
      2T   6272   7807   8321   9131   9780   9599       
      4T   9867  13732  15514  17568  19512  18209       
      8T   7385  10918  12320  14591  17357  16462       
122.9 1T    662    648   1253   2129   3291   4475       
      2T   1044   1032   2003   3611   6135   8931       
      4T   1068   1085   2180   4354   8409  16053       
      8T   1057   1078   2124   4247   8227  15070       
12288 1T    125    131    252    494   1009   1996       
      2T    195    136    272    501   1088   2121       
      4T    126    135    263    515   1017   1922       
      8T    114    136    305    545    994   2076       

                      Pi 4B  ARM V7               
                                                Pi 4B/3B+

 12.3 1T   5263   5637   5809   5894   5936  13445   2.73
      2T   9412  10020  10567  11454  11604  24980   2.60
      4T  16282  15577  16418  21222  20000  45530   2.50
      8T  11600  13285  16070  18579  20593  36837       
122.9 1T    739    956   1888   3153   5008   9527   2.13
      2T    629   1158   1568   5058   9509  16489   1.85
      4T    600   1093   2134   4527   8732  16816   1.05
      8T    593   1104   2121   4382   8629  17158       
12288 1T    238    258    518   1005   2001   4029   2.02
      2T    278    228    453   1690   1826   3628   1.71
      4T    269    257    740   1019   1790   4145   2.16
      8T    233    292    532    926   2186   3581       

                      Pi 3B+ gcc 8                

 MP-BusSpd ARM V7A gcc 8 Wed May 15 10:06:27 2019        
 
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

 12.3 1T   3555   4451   4382   4788   5124   5205       
      2T   6515   8132   8332   9016   9793  10100        
      4T  10667  14186  15956  17529  19228  16522       
      8T   7463  10987  13299  14948  17756  16781       
122.9 1T    681    683   1211   2133   3280   4713       
      2T   1049   1057   2009   3848   6155   9293       
      4T   1049   1085   2191   4360   7921  16268       
      8T   1072   1092   2180   4303   8156  15722       
12288 1T    125    131    256    495   1005   1970       
      2T    135    133    273    505   1100   2110       
      4T    116    130    243    511   1009   2059       
      8T    126    138    260    532   1061   2017       

                      Pi 4B gcc 8                        
                                                Pi 4B/3B+

 12.3 1T   5310   5616   5801   5898   5940  13425   2.54
      2T   9393  10008  11293  11293  11368  24932   2.47
      4T  15781  15015  17606  19034  22279  40736   2.47
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281   2.18
      2T    564    726   1523   5376   9387  18985   2.04
      4T    486    919   1886   4289   8337  16979   1.04
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975   2.02
      2T    202    421    450   1765   3307   7396   3.51
      4T    261    288    825   1332   1772   5014   2.44
      8T    218    273    496   1041   2571   4021       
  
MP RandMem Benchmark below or Go To Start


MP RandMem Benchmark - MP-RandMemPiA7, MP-RandMemPiC8

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Serial reading speed is normally similar to BusSpeed RdAll. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Besides the full results, comparisons of the four thread results are shown below for Pi 4B/3B+ performance ratios. The Pi 3B+ appears to be faster reading data from the shared L2 cache, with 4 threads only, otherwise, the average performance of the new processor was indicated as 80% faster.

                  Pi 3B+ ARM V7         

 MP-RandMem Linux/ARM V7A v1.0 Wed Apr 24 22:54:55 2019

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    3419    4333    3420    4422
      2T    6531    4397    6515    4397
      4T   12814    4308   12896    4303
      8T   12922    4289   12561    4244
122.9 1T    3133    3959     800    1041
      2T    5992    3959    1469    1040
      4T   11584    3913    2322    1025
      8T   11417    3895    2288    1028
12288 1T    2034     795      48      62
      2T    2176     799      93      63
      4T    3183     790     128      63
      8T    2008     788     130      62

                  Pi 4B  ARM V7         

 12.3 1T    5860    7905    5927    7657
      2T   11747    7908   11182    7746
      4T   21416    7626   17382    7731
      8T   20649    7528   20431    7378
122.9 1T    5479    7269    1826    1923
      2T   10355    6964    1667    1920
      4T    9808    7177    1715    1908
      8T   11677    7058    1697    1919
12288 1T    3438    1271     179     152
      2T    4176    1204     213     167
      4T    4227    1117     337     161
      8T    3479    1093     287     168

                 Pi 4B/3B+                      

 12.3 4T    1.67    1.77    1.35    1.80
122.9 4T    0.85    1.83    0.74    1.86
12288 4T    1.33    1.41    2.63    2.56

                  Pi 3B+ gcc 8          

 12.3 1T    4362    4386    4363    4386
      2T    8222    4308    8132    4311
      4T   16391    4268   16396    4286
      8T   16297    4244   15510    4228
122.9 1T    3643    3879     925    1025
      2T    7008    3873    1692    1040
      4T   12553    3877    2373    1038
      8T   12000    3881    2330    1043
12288 1T    1848     833      67      62
      2T    2183     829     119      63
      4T    3672     825     135      63
      8T    2608     826     136      63
 
                  Pi 4B gcc 8           

 12.3 1T    5950    7903    5945    7896
      2T   11849    7923   11887    7917
      4T   23404    7785   23395    7761
      8T   21903    7669   23104    7655
122.9 1T    5670    7309    2002    1924
      2T   10682    7285    1648    1923
      4T    9944    7266    1813    1927
      8T    9896    7216    1812    1919
12288 1T    3904    1075     179     164
      2T    7317    1055     215     164
      4T    3398    1063     343     165
      8T    4156    1062     350     165

                Pi 4B/3B+ gcc 8         

 12.3 4T    1.43    1.82    1.43    1.81
122.9 4T    0.79    1.87    0.76    1.86
12288 4T    0.93    1.29    2.54    2.62
MP-MFLOPS Benchmarks below or Go To Start


MP-MFLOPS Benchmarks - MP-MFLOPSPiA7, MP-MFLOPSDP, MP-NeonMFLOPS,
MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

Note across the board Pi 4B performance gains on all programs, with maximum speeds of 17.2 GFLOPS for single precision calculations and and 10.4 GFLOPS using double precision.


                Single Precision Version         

                      Pi 3B+ ARM V7                 

 MP-MFLOPS Linux/ARM V7A v1.0 Wed Apr 24 23:08:19 2019

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
 1T      214     212     189     813     812     797
 2T      403     427     354    1613    1587    1573
 4T      717     811     372    3044    3027    2982
 8T      756     777     388    3005    3101    3064

                     Pi 4B  ARM V7                  

 1T      987     993     606    2816    2794    2804
 2T     1823    1837     567    5610    5541    5497
 4T     2119    3349     647    9884   10702    9081
 8T     3136    3783     609   10230   10504    9240
Max                                                 
4B/3B+   415    4.66    1.67    3.36    3.45    3.02

                      Pi 3B+ gcc 8                   

 1T      214     212     189     799     784     781
 2T      417     417     365    1568    1583    1540
 4T      754     683     385    3026    3017    2919
 8T      738     761     401    3053    2997    2866

                     Pi 4B gcc 8                    

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165
Max                                                 
4B/3B+  5.48    6.07    1.35    3.61    3.53    2.86

 ###################################################

            NEON Intrinsic Functions Version      

                      Pi 3B+ ARM V7                 

 MP-MFLOPS NEON Intrinsics v1.0 Wed Apr 24 22:41:38 2019

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
 1T      692     685     393    2052    2017    1887
 2T     1126    1358     403    4096    3924    3697
 4T     2434    2030     405    7848    7740    5547
 8T     2363    2095     407    7584    7609    6097

                     Pi 4B  ARM V7                  

 1T     2491    2399     615    4325    4285    4261
 2T     5629    5520     591    8602    8463    8308
 4T    10580    5594     553   16991   16493    9124
 8T     7047   10785     513   14325   16219    8867
Max                                                 
4B/3B+  4.35    5.15    1.36    2.17    2.13    1.50

 
MP-MFLOPS Continued Below
Pi 3B+ gcc 8 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 691 684 407 1910 1874 1828 2T 1214 1306 410 3746 3747 3392 4T 1943 2568 410 7403 7435 5913 8T 2093 2233 411 7217 7087 6044 Pi 4B gcc 8 1T 2797 2870 641 4422 4454 4405 2T 3217 5601 569 8587 8800 8377 4T 7902 9864 611 17061 17215 9704 8T 7070 10562 603 15531 16203 9516 Max 4B/3B+ 3.78 4.13 1.49 2.30 2.32 1.61 ################################################### Double Precision Version Pi 3B+ ARM V7 MP-MFLOPS Double Precision v1.0 Sat Jun 15 12:07:33 2019 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 209 206 166 782 797 747 2T 415 416 198 1566 1590 1462 4T 663 801 198 3125 3122 2770 8T 746 729 199 3061 2909 2745 Pi 4B ARM V7 1T 1187 1220 309 2682 2714 2701 2T 2420 2416 282 5379 5415 4780 4T 4665 2381 317 10256 10336 5242 8T 4385 3114 310 9721 10340 5131 Max 4B/3B+ 6.25 3.89 1.59 3.28 3.31 1.89 Pi 3B+ gcc 8 1T 214 213 168 798 797 776 2T 409 416 194 1567 1590 1466 4T 694 675 195 3122 3120 2751 8T 698 797 198 3055 3005 2779 Pi 4B gcc 8 1T 1203 1211 315 2675 2719 2674 2T 2291 2441 293 5406 5421 4907 4T 4673 2501 309 10313 10393 5256 8T 4394 3550 265 8782 10110 5197 Max 4B/3B+ 6.69 4.45 1.56 3.30 3.33 1.89 Sumchecks SP 76406 97075 99969 66015 95363 99951 NEON 76406 97075 99969 66014 95363 99951 DP 76384 97072 99969 66065 95370 99951
OpenMP-MFLOPS Benchmarks below or Go To Start


OpenMP-MFLOPS - OpenMP-MFLOPS, notOpenMP-MFLOPS, OpenMP-MFLOPSC8,
OpenMP-MFLOPSDPC8, notOpenMP-MFLOPSC8, notOpenMP-MFLOPSDPC8

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. With gcc 8, additional versions have been produced, using double precision floating point. The general format and standard parameters are as follows.

The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those from NotOpenMP single core values. However, this is not always the case. The double precision gcc 8 benchmarks appears to be consistent, but only single precision sumchecks are provided.

This benchmark was a compilation of code used for desktop PCs, starting at 100 KB, then 1 MB and 10 MB.


            OpenMP MFLOPS Benchmark 1 Wed Apr 24 22:51:10 2019                

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.281575     1776    0.929538   Yes
 Data in & out    1000000     2      250   1.265817      395    0.992550   Yes
 Data in & out   10000000     2       25   1.222289      409    0.999250   Yes

 Data in & out     100000     8     2500   0.376635     5310    0.957126   Yes
 Data in & out    1000000     8      250   1.305504     1532    0.995524   Yes
 Data in & out   10000000     8       25   1.267736     1578    0.999550   Yes

 Data in & out     100000    32     2500   3.285631     2435    0.890232   Yes
 Data in & out    1000000    32      250   3.351830     2387    0.988068   Yes
 Data in & out   10000000    32       25   3.329400     2403    0.998785   Yes

                End of test Wed Apr 24 22:51:26 2019                          

SumChecks

V7A OMP 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890232 0.988068 0.998785

V7A Not 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890268 0.988078 0.998806

gcc 8 OMP  3B+, Not 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890282 0.988096 0.998806

gcc 8 4B OMP
0.098043 0.810084 0.922891 0.144870 0.922568 0.918226 0.401577 0.935064 0.916277

gcc 8 DP OMP 3B+ 4B, Not 3B+ 4B
0.929474 0.992543 0.999249 0.957164 0.995525 0.999550 0.890377 0.988101 0.998799
  

MFLOPS Performance and Comparisons

The firsts comparisons below identify OpenMP 4 core performance gains. For 2 and 8 operations per word read and written, real gains can only be seen with 100 KB data size. With CPU speed limitations at 32 operations per word, single core MFLOPS is shown to be constant at all data sizes, but high OpenMP speeds only occurring using 100 KB data size.

The other comparisons identify Pi 4B performance gains over the Pi 3B+, where those applying to single core use being better than via OpenMP. Highest OpenMP improvement was 4.5 times, via gcc 8 and double precision operation. Maximum demonstrated Pi 4B speeds were 19.9 GFLOPS single precision and 9.3 GFLOPS double precision.


         V7A Compiler                                          
         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2  1776    831   2.14   4716   2850   1.65   2.66   3.43
 1000- 2   395    391   1.01    556    429   1.30   1.41   1.10
10000- 2   409    409   1.00    544    632   0.86   1.33   1.55

  100- 8  5310   2009   2.64   7981   5191   1.54   1.50   2.58
 1000- 8  1532   1445   1.06   2389   2082   1.15   1.56   1.44
10000- 8  1578   1478   1.07   2199   2003   1.10   1.39   1.36

  100-32  2435   1855   1.31   8147   5449   1.50   3.35   2.94
 1000-32  2387   1733   1.38   7951   5385   1.48   3.33   3.11
10000-32  2403   1736   1.38   8030   5379   1.49   3.34   3.10

 
OpenMP-MFLOPS Continued Below
gcc 8 Compiler Pi 3B+ Pi 4B Pi4 Gains KB+Ops 4 1 4 core 4 1 4 core 4 1 /Word Cores Core Gain Cores Core Gain Cores Core 100- 2 2139 778 2.75 5100 2270 2.25 2.38 2.92 1000- 2 398 403 0.99 617 632 0.98 1.55 1.57 10000- 2 412 415 0.99 542 631 0.86 1.32 1.52 100- 8 7348 1919 3.83 13805 5511 2.50 1.88 2.87 1000- 8 1597 1448 1.10 2168 2217 0.98 1.36 1.53 10000- 8 1635 1444 1.13 2178 2542 0.86 1.33 1.76 100-32 8497 2023 4.20 19921 5341 3.73 2.34 2.64 1000-32 5997 1903 3.15 8556 5267 1.62 1.43 2.77 10000-32 6057 1914 3.16 8731 5276 1.65 1.44 2.76 gcc 8 Double Precision Pi 3B+ Pi 4B Pi4 Gains KB+Ops 4 1 4 core 4 1 4 core 4 1 /Word Cores Core Gain Cores Core Gain Cores Core 100- 2 711 203 3.50 3200 977 3.28 4.50 4.81 1000- 2 193 168 1.15 274 295 0.93 1.42 1.76 10000- 2 199 172 1.16 273 307 0.89 1.37 1.78 100- 8 1898 503 3.77 6771 2440 2.78 3.57 4.85 1000- 8 730 434 1.68 1102 1072 1.03 1.51 2.47 10000- 8 755 435 1.74 1108 1255 0.88 1.47 2.89 100-32 3072 793 3.87 9229 2725 3.39 3.00 3.44 1000-32 2695 765 3.52 4256 2674 1.59 1.58 3.50 10000-32 2719 765 3.55 4469 2677 1.67 1.64 3.50
Floating Point Assembly Code below or Go To Start


Floating Point Assembly Code

The latest floating point performance improvements, via gcc 8, are due to better use of NEON instructions. If I have read this report correctly, double precision ARM NEON SIMD is not supported on V7 CPUs, only Single Instruction Single Data (SISD), where fused multiply and add instructions can produce two results per clock cycle, or a maximum of 3 GFLOPS per core on Pi 4, or 12 GFLOPS overall.

In my MP MFLOPS programs, the routines that include 32 double precision floating point operations per data word read, disassembly indicates that the following instructions are used, with 64 bit d registers, where maximum measured speed was just over 10 GFLOPS.

.L18:                            
    vldr.64         d17, [r1]   
    vadd.f64        d16, d17, d4 
    vadd.f64        d18, d17, d0 
    vadd.f64        d25, d17, d15
    vadd.f64        d24, d17, d11
    vmul.f64        d16, d16, d5 
    vadd.f64        d23, d17, d31
    vadd.f64        d22, d17, d27
    vadd.f64        d21, d17, d2 
    vadd.f64        d20, d17, d6 
    vadd.f64        d19, d17, d13
    vfma.f64        d16, d18, d1 
    vadd.f64        d18, d17, d9 
    vadd.f64        d17, d17, d29
    vfma.f64        d16, d25, d14
    vfma.f64        d16, d24, d10
    vfma.f64        d16, d23, d30
    vfma.f64        d16, d22, d28
    vfms.f64        d16, d21, d3 
    vfms.f64        d16, d20, d7 
    vfms.f64        d16, d19, d12
    vfms.f64        d16, d18, d8 
    vfms.f64        d16, d17, d26
    vstmia.64       r1!, {d16}   
    cmp             r0, r1       
    bne             .L18         
   
It is not clear (to me) what the maximum speed is for single precision calculations. These appear to compile to full SIMD operation, using quad word registers. With fused multiply and add, that could amount to 8 results per clock cycle, with 12 GFLOPS from one Pi 4 core and 48 GFLOPS overall. Maximum obtained was around 20 GFLOPS.

OpenMP-MemSpeed Benchmarks below or Go To Start


OpenMP-MemSpeed - OpenMP-MemSpeed2, NotOpenMP-MemSpeed2,
OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless.

                               Pi 3B+ ARM V7                             

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom       

               Start of test Wed Apr 24 22:45:07 2019                   

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    6432   3483   1646  10276   5514   1770  18468   9721   1534
       8    7041   3603   1651  11747   5783   1788  19068  10085   1538
      16    7023   3606   1557  11694   5839   1672  19316   9528   1469
      32    6983   3600   1525  11413   5915   1656  19385   9532   1442
      64    6283   3554   1584  10861   5751   1621  14307   9466   1443
     128    6828   3578   1580  11074   5828   1659  10791   8935   1490
     256    5384   3365   1521  11216   5166   1687   9806   8148   1519
     512    5371   3253   1511   8917   4858   1412   7752   4363   1365
    1024    3084   2643   1066   3772   3504   1314   1450   1403   1136
    2048    3345   2087   1086   4148   3589   1471   1052   1063   1139
    4096     915   2648    894   4143   2456   1655    984    987   1190
    8192    3644   2504   1124   4183   3530   1496    903    909   1074
   16384     963   2050    922   3867   3154   1478    752    849   1156
   32768    3889   2467   1179   3562   3328   1667    838    833   1150
   65536    3902   2009   1109   3843   1437   1596    917    917    927
  131072     986    667    819   1145    904    820    858    865    584
 
  Not OMP                                                               
       8    1860   2972   4449   2787   4039   4449   3168   3164   3170
     256    1810   2791   4137   2655   3860   4135   3126   3065   3066
   65536     960   1121   1109   1100   1120   1115    901    793    844

                            Pi 4B  ARM V7                                

    Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom        

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7732   8092   1266   7627   8431   1616  31436  15892    889
       8    7546   8158   1284   7925   8537   1597  30383  16635    884
      16    7695   8198   1261   7854   8549   1598  27037  15644    896
      32    7773   7808   1255   8036   7727   1612  29621  16928    897
      64    9728   9094   1233   9355   9028   1602  16855  13297    867
     128   11296  10068   1002  11342  10813   1686  13594  15106    794
     256   13987  11677   1231  15357  13496   1732  12707  10415    878
     512   17763   8841   1170  10023  13404   1529  12655   9137    693
    1024    6070   6553   1262  10196  10069   1455   5405   5027    670
    2048    3858   6609   1343   6440   6643   1657   2234   2324    877
    4096    6055   6743    989   6608   6568   1664   2114   2369    777
    8192    1669   2047   1126   7071   6894   1581   2532   2569    857
   16384    1974   1953   1385   6748   4399   1763   2643   1845    753
   32768    1594   3482   1115   7680   7494   1814   1739   1908   1147
   65536    2630   7446   1320   1632   1826   1651   2061   2920    904
  131072    1438   1540   1249   1714   1694   1244   1760   2011    856

 Not OMP                                                                
       8    8602  11536  13324   8607  11756  13378   7826   7689   7670
     256    8319   9856  10030   8338   8984   9308   5800   7510   7535
   65536    1373   1725   2071   2059   2072   2044   2170    912    900

 
OpenMP-MemSpeed Continued Below
Pi 3B+ gcc 8 Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 7065 3661 8370 10058 5245 9260 18199 9342 9242 8 7350 3854 9338 11747 5786 10201 19108 9663 9412 16 7444 3955 9543 11918 5961 10696 19339 9854 9831 32 7198 3953 9537 9783 5908 10683 19075 9958 9971 64 6848 3901 9057 11146 5168 9187 10408 9399 9440 128 7655 3916 9113 11204 5785 10073 10315 9185 9191 256 7044 3921 9154 11263 5785 10114 9601 9002 9019 512 6662 3579 7738 9326 5206 7931 8313 7911 7903 1024 4050 2892 4167 3997 3674 4318 1437 1422 1435 2048 3996 2879 4134 4038 3624 4325 1042 1012 999 4096 3909 2803 4078 3981 3591 4223 1047 988 1044 8192 3880 2871 3805 4196 3555 4117 935 948 940 16384 1366 2193 3757 4058 3178 3895 902 894 843 32768 2202 2138 3428 3577 3335 3559 871 793 893 65536 1180 1119 1696 1447 1178 1721 853 874 868 131072 1016 688 1096 1133 893 1141 844 1141 1080 Not OMP 8 2020 1878 2056 2959 2018 2068 3398 4406 4406 256 1973 1833 1990 2845 1966 1993 3306 4215 4215 65536 1016 1248 1287 1130 1302 1301 1005 928 915 Pi 4B gcc 8 Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 8097 8322 8641 8020 8436 8384 39701 19701 19712 8 7814 8555 8756 8321 8548 8526 39042 19984 19996 16 8149 7738 7742 8303 7779 8192 37995 19883 19984 32 8969 8769 8799 9040 8759 8743 37737 20133 20130 64 7617 7457 7437 7575 7380 7422 17770 15332 14248 128 11221 10936 11003 11105 11011 10986 13650 13910 13881 256 17883 18144 18036 17691 18094 17844 13073 12465 12535 512 18001 18468 19675 17075 18221 19264 13511 13895 12008 1024 9532 10590 9772 11842 11282 11277 7173 9473 9496 2048 7095 7025 6866 7117 7043 6946 2914 3475 3468 4096 7244 6927 7036 5951 7054 6531 2582 3130 3122 8192 4578 7173 7025 6322 7078 7182 2504 3127 3115 16384 5470 7043 7067 7103 7052 7020 2557 3093 3088 32768 7359 7817 7766 7158 7078 7757 2618 3066 3094 65536 7810 7268 7266 3824 7478 5164 2486 3016 2931 131072 2460 2655 7224 7513 7308 7339 2540 2944 2940 Not OMP 8 11775 3895 4342 11787 4325 4354 10334 7806 7816 256 10032 3699 4223 9978 4289 4185 7105 7612 7621 65536 2099 2587 3033 2103 3021 3001 2585 1105 1101
I/O Benchmarks below or Go To Start


I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The Raspberry Pi programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. For further details and downloads see the usual PDF file.

Go To Start


LanSpeed Benchmarks - WiFi - LanSpeed

Following are Raspberry Pi 3B+ and Pi 4B results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two systems was similar at both frequencies.

 ******************** Pi 3B+ 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    5.71    6.07    5.96    5.69    5.46    4.76
      16    6.14    6.38    6.47    6.14    6.15    5.91

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs      2.94   3.081   3.185    3.04    2.89     3.7

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.16    0.57    0.96    0.36    0.63    1.17
 ms/file    25.3   14.31    17.1   11.46   13.04   14.06   2.138

 ********************* Pi 3B+ 5 GHz *********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3
 
       8   12.82   14.52   14.00   10.98   11.09    8.94
      16   11.60   12.91    4.48    9.16    8.19    7.69

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.41    0.76    1.46    0.41    0.74    1.46
 ms/file    9.96   10.83   11.19   10.11   11.02   11.23   1.990

 Random similar to 2.4 GHz


 ********************* Pi 4B 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** Pi 4B 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz
  
LanSpeed Benchmark below or Go To Start


LanSpeed Benchmark - (1G bits per second Ethernet on Pi 4B) - LanSpeed

There can be significant variability in performance with these small samples. For the large files, the default sizes were increased to produce more stable speeds. In this case, 1 Gbps was clearly demonstrated using the Pi 4B, around three times faster than the Pi3B+. Random access was mainly slightly faster via the Pi 4B and with the small files, perhaps, 25% faster on writing and 50% faster on reading.

 ************************ Pi 3B+ ************************
 
                       MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   31.17   31.62   31.61    13.5   26.19   26.38
      16   31.62   31.89   31.76    26.7   26.94   27.01

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     0.007    1.09   0.688    1.16    1.04    1.08

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     1.15    2.26    4.18    1.73    3.18    5.66
 ms/file    3.57    3.62    3.92    2.36    2.58    2.89   0.511

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32   31.99   31.61   32.13   21.39   27.09   26.87
      64   32.33   32.37   32.35   26.94   26.98    26.7

 ************************ Pi 4B ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29
  
USB Benchmarks below or Go To Start


USB Benchmarks - DriveSpeed

Following are DriveSpeed results on Pi 3B+ and 4B, using the same high speed USB 3 stick (SanDisk Extreme with write/read ratings of 110/190 MB/s and 16 KB sectors). Other sticks would probably provide different comparative performance.

On large files, Pi 4B performance gains on the largest files shown, were 2.2 times on writing and 5.3 times on reading. Unlike LanSpeed, DriveSpeed uses Direct i/O, leading to an extra entry for cached files, reading mainly influenced by RAM speeds. Results can be too variable to provide meaningful comparisons.

Random access speeds were quite similar. On small files, relative reading speed was indicates as five times faster, on the Pi 4B, but the 3B+ appeared to be nearly 30 times faster, on reading.

For the Pi 4B, additional large file performance are included for a Patriot Rage 2 USB 3 stick, rated as reading at up to 400 MB/second, with near 300 MB/second demonstrated using a Windows version of DriveSpeed.. In this case, it appeared to be slightly slower than the first one on reading, but faster on writing, at 80 MB/second. This second drive also obtained those painfully slow speeds on writing small files.
   ********************* Pi 3B+ USB 2 ********************

   DriveSpeed RasPi 1.1 Wed Apr 24 22:09:09 2019

 /media/pi/REMIX_OS/
 Total MB    9017, Free MB    7486, Used MB    1531

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   27.71   27.35   27.13   30.72    30.9   31.31
      16   27.21   27.54   23.69   29.89   31.34   31.27
Cached
       8   52.24   59.57   46.88  333.08  741.57  780.68

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.403   0.403   0.404    0.74    0.85    0.59

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      1.10    2.12    3.82    6.04    9.17   14.01
ms/file     3.71    3.86    4.28    0.68    0.89    1.17   0.123

                        MBytes/Second
MB      Write1  Write2  Write3  Read1   Read2   Read3

    1000   27.25   27.25   27.19   31.23   31.27   31.27
    2000   27.30   27.07   27.32   31.32   31.26   31.26

 ********************* Pi 4B USB 3 *********************

   DriveSpeed RasPi 1.1 Fri Apr 26 17:21:56 2019

 /media/pi/REMIXOSSYS//
 Total MB    5108, Free MB    3982, Used MB    1126

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   33.28   32.27   32.28  161.34  162.25  163.85
      16   39.85   41.95   43.02  164.07  165.53  165.84
Cached
       8   33.32   34.96   34.96  593.94  582.25  589.22

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.383   0.372   0.371    0.77    0.83    0.63

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      0.04    0.07    0.15   20.64   41.04   70.01
ms/file   110.04  109.97  110.01    0.20    0.20    0.23   0.089

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

     500   56.36   58.13   55.25  166.31  165.46  165.43
    1000   59.56   61.46   60.54  161.69  165.97  166.49

 /media/pi/PATRIOT/
 Total MB  120832, Free MB  120832, Used MB       0

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

    1000   80.87   80.23   81.92  131.41  130.72  130.39
    2000   83.67   81.82   82.14  130.85  131.29  131.36
  

Main Drive Benchmark below or Go To Start


Pi 4B Main Drive Benchmark - DriveSpeed

This demonstrates that DriveSpeed measured performance on the main drive, in this case, nowhere near to USB 3 speeds.
  
   DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019

 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014
  
Java Whetstone Benchmark below or Go To Start


Java Whetstone Benchmark - whetstc.class

The Java benchmarks were run after installing Oracle Java 8, then OpenJDK11 later.

Pi 4B performance was nearly as good as the compiled C version. However, there can be wide variations involving new Java versions. Here, the Pi 3B+ overall MWIPS rating was particularly slow, entirely due to the time taken by the sin,cos and exp,sqrt tests. Other than these, the Pi 4B was three to four times faster.

 ************************ Pi 3B+ ************************

      Whetstone Benchmark Java Version, May 14 2019, 15:02:11

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    215.20             0.0892
  N2 floating point  -1.131330490    208.76             0.6438
  N3 if then else     1.000000000             103.58    0.9992
  N4 fixed point     12.000000000             538.09    0.5854
  N5 sin,cos etc.     0.499110103               7.04   11.8100
  N6 floating point   0.999999821    106.22             5.0780
  N7 assignments      3.000000000             322.85    0.5724
  N8 exp,sqrt etc.    0.751108646               1.38   26.9200

  MWIPS                              214.14            46.6980

  Operating System    Linux, Arch. arm, Version 4.14.70-v7+
  Java Vendor         Oracle Corporation, Version  1.8.0_212
 

 ************************ Pi 4B ************************

     Whetstone Benchmark Java Version, May 14 2019, 14:16:44

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    503.94             0.0381
  N2 floating point  -1.131330490    488.37             0.2752
  N3 if then else     1.000000000             332.80    0.3110
  N4 fixed point     12.000000000             881.37    0.3574
  N5 sin,cos etc.     0.499110132              42.92    1.9384
  N6 floating point   0.999999821    345.77             1.5600
  N7 assignments      3.000000000             332.97    0.5550
  N8 exp,sqrt etc.    0.825148463              25.00    1.4880

  MWIPS                             1533.01             6.5231

  Operating System    Linux, Arch. arm, Version 4.19.29-v7l+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

  Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900
  N6 floating point   0.999999821    345.95             1.5592
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
   

JavaDraw Benchmark below or Go To Start


JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver.

Pi 4B performance gains were best on the most complex test function.

A later version was produced and run via OpenJDK11.

 ************************ Pi 3B+ ************************

   Java Drawing Benchmark, May 14 2019, 15:32:06
            Produced by javac 1.6.0_27

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      566    56.55
  Display PNG Bitmap Twice Pass 2      651    65.00
  Plus 2 SweepGradient Circles         665    66.45
  Plus 200 Random Small Circles        660    65.93
  Plus 320 Long Lines                  442    44.16
  Plus 4000 Random Small Circles       334    33.30

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.14.70-v7+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ************************ Pi 4B ************************

   Java Drawing Benchmark, May 14 2019, 14:33:58
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      791    79.05
  Display PNG Bitmap Twice Pass 2      932    93.11
  Plus 2 SweepGradient Circles        1152   115.17
  Plus 200 Random Small Circles       1200   119.98
  Plus 320 Long Lines                  784    78.31
  Plus 4000 Random Small Circles       621    62.03

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.29-v7l+
  Java Vendor         Oracle Corporation, Version  1.8.0_212

 ******************* Pi 4B OpenJDK11 *******************

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft


  

OpenGL GLUT Benchmark below or Go To Start


OpenGL GLUT Benchmark - videogl32

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

After installing freeglut3, the benchmark ran as before. The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 NoHeading                                    
  
Following are results from the Pi 3B+ and Pi 4B. The early tests depend on graphics speed and the later ones becoming CPU speed dependent.

 ************************ Pi 3B+ ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Fri Apr 12 22:21:35 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    343.8    208.3     88.4     56.6     24.3     15.5
   640   480    243.0    170.3     82.8     54.5     24.2     15.5
  1024   768    110.6    101.2     63.6     47.8     24.1     15.4
  1920  1080     49.5     47.3     36.8     32.9     23.4     14.9

 ************************ Pi 4B ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0
  
Stress Tests below or Go To Start


Stress Tests - MP-IntStress, MP-FPUStress, MP-FPUStressDP

A series of stress tests have also been run on the Raspberry Pi 4B and these will be covered in a later report. They have command line parameters for running time, data size, number of threads, log number and complexity of calculations. In default mode, combinations of these are used to indicate relative performance, providing useful benchmarks. Following are Pi 3B+ and Pi 4B results.

 ************************ Pi 3B+ ************************

 MP-Integer-Test 32 Bit v1.0 Fri Jun 21 15:09:22 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   9.4    1   3497  3284  1813  00000000    Yes
   6.3    2   6994  6505  2123  FFFFFFFF    Yes
   5.6    4  13839 12528  1882  5A5A5A5A    Yes
   5.6    8  13723 13780  1872  AAAAAAAA    Yes
   5.6   16  13734 14049  1857  CCCCCCCC    Yes
   5.6   32  13499 13881  1879  0F0F0F0F    Yes

 ************************ Pi 4B ************************
  MP-Integer-Test 32 Bit v1.0 Fri Jun 21 15:39:57 2019

   4.9    1   5956  5754  3977  00000000    Yes
   3.6    2  11861 11429  3763  FFFFFFFF    Yes
   3.1    4  22998 21799  3464  5A5A5A5A    Yes
   3.1    8  22695 21128  3490  AAAAAAAA    Yes
   3.1   16  22835 23491  3485  CCCCCCCC    Yes
   3.0   32  22593 23485  3591  0F0F0F0F    Yes

  Average Gains Caches 1.68, RAM 1.91

 ************************ Pi 3B+ ************************

  MP-Threaded-MFLOPS 32 Bit v1.0 Fri Jun 21 15:10:28 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.1    T1   2   857   849   414   40392  76406  99700
   5.5    T2   2  1661  1678   411   40392  76406  99700
   7.4    T4   2  3086  3336   413   40392  76406  99700
   9.4    T8   2  3194  3168   414   40392  76406  99700
  13.8    T1   8  1942  1935  1495   54756  85091  99820
  16.7    T2   8  3756  3824  1659   54756  85091  99820
  19.0    T4   8  7209  7528  1643   54756  85091  99820
  21.3    T8   8  6978  7341  1657   54756  85091  99820
  36.8    T1  32  2019  2050  1915   35296  66020  99519
  44.6    T2  32  4078  4031  3757   35296  66020  99519
  48.9    T4  32  7927  7910  6095   35296  66020  99519
  53.1    T8  32  7919  8141  6336   35296  66020  99519

 ************************ Pi 4B ************************

  MP-Threaded-MFLOPS 32 Bit v1.0 Sun May 26 21:23:49 2019

   1.6    T1   2  2134  2607   656   40392  76406  99700
   2.9    T2   2  5048  5156   621   40392  76406  99700
   4.0    T4   2  7536  9939   681   40392  76406  99700
   5.2    T8   2  7934  9839   639   40392  76406  99700
   7.2    T1   8  5535  5420  2569   54756  85091  99820
   8.7    T2   8 10757 10732  2454   54756  85091  99820
  10.1    T4   8 18108 20703  2444   54756  85091  99820
  11.5    T8   8 19236 20286  2245   54756  85091  99820
  17.4    T1  32  5309  5270  5262   35296  66020  99519
  20.4    T2  32 10551 10528  9753   35296  66020  99519
  22.4    T4  32 20120 20886 11064   35296  66020  99519
  24.5    T8  32 19415 20464  9929   35296  66020  99519

  Average Gains Caches 2.72, RAM 1.75

 
Stress Tests Continued Below
************************ Pi 3B+ ************************ MP-Threaded-MFLOPS 32 Bit v1.0 Fri Jun 21 15:11:41 2019 Double Precision Benchmark 1, 2, 4 and 8 Threads MFLOPS Numeric Results Ops/ KB KB MB KB KB MB Secs Thrd Word 12.8 128 12.8 12.8 128 12.8 9.7 T1 2 215 213 173 40395 76384 99700 15.9 T2 2 420 426 206 40395 76384 99700 20.6 T4 2 819 830 205 40395 76384 99700 25.3 T8 2 807 823 205 40395 76384 99700 41.4 T1 8 508 502 437 54805 85108 99820 49.8 T2 8 1002 1008 778 54805 85108 99820 55.8 T4 8 1985 1955 768 54805 85108 99820 61.6 T8 8 1974 1958 817 54805 85108 99820 100.5 T1 32 799 794 775 35159 66065 99521 120.1 T2 32 1595 1588 1533 35159 66065 99521 130.5 T4 32 3115 3087 2731 35159 66065 99521 140.7 T8 32 3154 3126 2821 35159 66065 99521 ************************ Pi 4B ************************ MP-Threaded-MFLOPS 32 Bit v1.0 Sun May 26 21:26:37 2019 Double Precision Benchmark 1, 2, 4 and 8 Threads 3.4 T1 2 921 998 326 40395 76384 99700 6.1 T2 2 1968 1995 308 40395 76384 99700 8.4 T4 2 3465 3925 342 40395 76384 99700 10.9 T8 2 3646 3702 301 40395 76384 99700 15.1 T1 8 2377 2446 1283 54805 85108 99820 18.1 T2 8 4916 4860 1326 54805 85108 99820 20.5 T4 8 9202 9510 1391 54805 85108 99820 23.1 T8 8 9090 9006 1298 54805 85108 99820 34.5 T1 32 2695 2725 2707 35159 66065 99521 40.3 T2 32 5416 5441 5121 35159 66065 99521 44.1 T4 32 10666 10831 5275 35159 66065 99521 48.3 T8 32 10427 10602 4832 35159 66065 99521 Average Gains Caches 4.23, RAM 2.09
Go To Start