Raspberry Pi 4B 32 Bit Benchmarks

Raspberry Pi 4B 32 Bit Benchmarks
Roy Longbottom

Summary	Introduction	Benchmark Results
Whetstone Benchmark	Dhrystone Benchmark	Linpack 100 Benchmark
Livermore Loops Benchmark	FFT Benchmarks	BusSpeed Benchmark
MemSpeed Benchmark	NeonSpeed Benchmark	MultiThreading Benchmarks
MP-Whetstone Benchmark	MP-Dhrystone Benchmark	MP NEON Linpack Benchmark
MP-BusSpeed Benchmark	MP-RandMem Benchmark	MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks	Floating Point Assembly Code	OpenMP-MemSpeed Benchmarks
I/O Benchmarks	WiFi Benchmark	LAN Benchmark
USB 2 and 3 Benchmarks	Pi 4 Main Drive benchmark	Java Whetstone Benchmark
JavaDraw Benchmark	OpenGL Benchmark	Stress Tests

Summary

Previously, I have run my 32 bit and 64 bit benchmarks on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a PDF file. This early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer consultant, exercising the system prior to launch. This report contains brief reminders of the benchmarks, with 32 bit results on the new Raspberry Pi 4 using Raspbian Buster Operating System. Existing benchmarks were used to provide comparisons with the old 3B+ model. The benchmarks were also recompiled using gcc 8, that came with Buster, to provide further comparisons. The benchmarks and results are summarised as follows.

Single Core CPU Tests - comprising Whetstone, Dhrystone, Linpack and Livermore Loops Classic Benchmarks. Compared with a Pi 4B/Pi 3B+ CPU MHz ratio of 1.07, the overall performance gains for these four programs increased to around 1.8, 2.0, 4.0 and 2.8 times, with some further improvements between 1.05 and 1.26 from gcc 8 compilations.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. These include eight different measurements of FFTs, at 11 increasing sizes, with average Pi 4B speed gains of 3.26 times. BusSpeed was intended to identify maximum reading speeds, where there was not much difference from L1 cache, some gain via L2 cache and 80% from RAM, increasing by a further 25% using the gcc 8 compilation. MemSpeed and NeonSpeed carry out floating point and integer calculations, providing Pi 4B speed gains at all levels, best with double precision floating point calculations at greater than five times.

Multithreading Benchmarks - Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. The first are for Whetstone, Dhrystone and Linpack benchmarks, providing similar Pi 4B gains as the single core versions, with only Whetstones providing effective four core performance.

Various multithreaded and OpenMP cache/RAM benchmarks were run, these mainly demonstrating the sort of code that is good and bad for efficient MP utilisation. Most demonstrated appropriate single core Pi 4B performance gains, but with some other relationships totally confusing.

Finally, a number of benchmarks attempt to measure maximum MFLOPS floating point speed, using the same series of calculations, with variants covering single and double precision (SP and DP), vector intrinsic functions and OpenMP. Best DP performance was 10.4 GFLOPS with SP at 19.9 GFLOPS. Highest Pi 4B/Pi 3B+ gains were 6.69 times DP and 5.15 times SP. The gcc 8 compilations provided some improvement in speed.

Java and OpenGL Benchmarks - A Java Whetstone benchmark is provided and one using JavaDraw procedures. Test functions of the former were more than twice as fast on the Pi 4B, compared with the 3B+ and similar via javaDraw, for the more demanding tests, also many of the 25 OpenGL test routines. Initially Oracle 8 Java was used but later tests were via OpenJDK11.

Drive LAN and WiFi Benchmarks - Variations of the same program are provided to benchmark internal and USB drives or LAN and WiFi connections, measuring performance using large files, small files and random access. Considering large files, Pi 4B performance improvement shown were up to four times LAN, over five times USB 3, with similar scores using WiFi.

Stress Tests - These have also been run and will be covered in a later report. Default mode provides useful benchmarking information, as shown below. Pi 4B/Pi 3B+ performance ratios are shown to be up to 4.23 for cache based data and 2.09 using RAM.

Introduction below or Go To Start

Introduction

The Raspberry Pi 4B uses a quad core ARM A72 CPU, with 32 KB L1 cache and shared 1 MB L2 cache. RAM is 3200-LPDDR4 with 1, 2 or 4 GB options. Other enhancements are USB 3 connections and gigabit Ethernet.

I have run my benchmarks on the new system, where more descriptions and earlier results can be found at from ResearchGate in this PDF file. The early opportunity to run the programs was due to my acceptance of the request for me to become a volunteer consultant, exercising the system prior to launch.

The programs and source codes used are available for downloading in Raspberry-Pi-4-Benchmarks.tar.gz.

My most recent benchmarks were compiled for the Raspberry Pi 2, using gcc 4.8. I tried others later, but they did not seem to make much difference. I thought that using a Cortex A72 might, so I have compiled the programs using gcc 8. The first step was to change the functions used to identify the hardware, where the existing procedures replicate information for each core (even four lots were too much). I noted that the lscpu command now provides adequate detail, so I use this now. The Raspbian release is also provided. RPi 3B+ and RPi 4B details are as follows:

Pi 3B+

Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Model:                 4
Model name:            ARMv7 Processor rev 4 (v7l)
CPU max MHz:           1400.0000
CPU min MHz:           600.0000
BogoMIPS:              89.60
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2018-04-18



Pi 4B
 
Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Vendor ID:             ARM
Model:                 3
Model name:            Cortex-A72
Stepping:              r0p3
CPU max MHz:           1500.0000
CPU min MHz:           600.0000
BogoMIPS:              270.00
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2019-05-13

Benchmark Results

The following provide benchmark results with limited comments on Raspberry Pi 4B performance gains over Pi 3B+ and relative Pi 4B relationships between older ARM V7 and gcc 8 compilations. For the first few, ancient benchmarks, ARM V6 compilations are also compared.

Whetstone Benchmark below or Go To Start

Whetstone Benchmark - whetstonePiA6, whetstonePiA7, whetstonePiC8

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised, but the time does not affect the overall rating much.

In this case, the overall MWIPS comparison ratios provide valid comparisons, The Pi 4B being between 1.76 and 1.87 times faster than the 3B+. Then gcc 8 provided no real improvement.


 System    MHz  MWIPS   ------MFLOPS------    ------------MOPS---------------
                        1      2      3       COS    EXP  FIXPT     IF  EQUAL
 Arm V6
 Pi 3B+   1400   1094    391    407    348   21.7   12.3   1740   2084   1391
 Pi 4B    1500   2048    520    473    389   53.8   27.1   2497   2245   2246
 4B/3B+   1.07   1.87   1.33   1.16   1.12   2.47   2.20   1.44   1.08   1.61
 
 ARM V7
 Pi 3B+   1400   1060    391    383    298   21.7   12.3   1740   2083   1392
 Pi 4B    1500   1884    516    478    310   54.7   27.1   2498   2247    999
 4B/3B+   1.07   1.78   1.32   1.25   1.04   2.52   2.21   1.44   1.08   0.72
 
 gcc 8
 Pi 3B+   1400   1063    393    373    300   21.8   12.3   1748   2097   1398
 Pi 4B    1500   1883    522    471    313   54.9   26.4   2496   3178    998
 4B/3B+   1.00   1.76   1.33   1.26   1.05   2.51   2.09   1.43   1.52   0.71
 
 gcc 8/V7
 Pi 4B    1.00   1.00   1.01   0.99   1.01   1.00   0.97   1.00   1.41   1.00

Go To Start

Dhrystone Benchmark - dhrystonePiA6, dhrystonePiA7, dhrystonePiC8

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The Pi 4B was shown to be around twice as fast as the 3B+ and gcc 8 performance was similar to the ARM V7 compilation.


                                              Best
                ----- Compiler -----         DMIPS
 System     MHz  ARM V6 ARM V7  gcc 8  G8/V7  /MHz

 Pi 3B+    1400   2520   2825   2838   1.00   2.03
 Pi 4B     1500   5077   5366   5646   1.05   3.76

 4B/3B+    1.07   2.01   1.90   1.99          1.86

Go To Start

Linpack 100 Benchmark MFLOPS - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP, linpackPiC8, linpackPiC8SP

This original Linpack benchmark is dependent on fused multiply and add instructions, but the overheads on the standard source code restricts processing speed. It seems that the latest hardware has been modified to execute this type of code more efficiently.

All measurements demonstrate that the Pi 4B was between 3,6 and 4.7 times faster than the Pi 3B+.


                      ARM V6        ARM V7        gcc 8      vgcc8/ARMV7
 System    MHz      DP     SP     DP     SP     DP     SP     DP     SP

 Pi 3B+    1400  206.0  220.2  210.5  225.2  224.8  227.3   1.00   1.01
 Pi 4B     1500  764.7  880.6  760.2  921.6  957.1 1068.8   1.04   1.12

 4B/3B+    1.07   3.71   4.00   3.61   4.09   4.26   4.70

Livermore Loops Benchmark below or Go To Start

Livermore Loops Benchmark MFLOPS - liverloopsPiA7, liverloopsPiC8

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, Pi 4B is shown as being 2.36 times faster than the 3B+, and even more so via the gcc 8 compilation, where the gcc8/V7 performance ratio identified is 1.31.

 MFLOPS for 24 loops
 
 Pi 3B+
   225   266   465   394   147   196   411   449   408   207   155    87
   100   125   263   258   359   335   236   248   133    93   339   199

 Pi 4B
   746   964   988   943   212   538  1169  1800  1032   469   214   186
   159   335   778   623   732  1034   320   350   489   360   749   187

 Pi 3B+ gcc 8
   330   262   459   407   231   198   538   542   462   247   174   198
   122   123   281   240   394   325   275   294   213    94   354   198

 Pi 4B gcc 8
  1480  1017   974   930   383   657  1624  1861  1664   617   498   741
   221   320   803   640   737  1003   451   378  1047   411   763   187

Comparisons

 System    MHz   Maximum Average Geomean Harmean Minimum

 ARM V7
 Pi 3B+   1400     464.8   246.7   220.1   193.9    78.3
 Pi 4B    1500    1800.2   635.1   519.0   416.1   155.3
 4B/3B+   1.07      3.87    2.57    2.36    2.15    1.98

 gcc 8
 Pi 3B+   1400     541.7   283.4   257.4   231.5    92.7
 Pi 4B    1500    1860.8   800.4   679.0   564.1   179.5
 4B/3B+   1.07      3.40    2.80    2.61    2.41    1.90
                                                
 g8/V7    1.00      1.03    1.26    1.31    1.36    1.16

Fast Fourier Transforms Benchmarks below or Go To Start

Fast Fourier Transforms Benchmarks - fft1-RPi2, fft3c-Rpi2, fft1PiC8, fft3cPiC8

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements using both single and double data, calculating FFT sizes between 1K and 1024K.

Following are average running times from the three passes, then RPi 4B performance gains (fewer milliseconds), where all those for the optimised version were greater than 3 times and also many from the original benchmark. Most gcc 8 running times. on the Pi 4B, were slightly faster than the those produced by the older version.


                              Time in milliseconds

               Raspberry Pi 3B+ FFT 1        Raspberry Pi 3B+ FFT 3     

          ARM V7           gcc 8          ARM V7           gcc 8        
   Size
      K       SP      DP      SP      DP      SP      DP      SP      DP

      1     0.14    0.14    0.16    0.17    0.18    0.14    0.15    0.14
      2     0.31    0.36    0.35    0.48    0.39    0.32    0.33    0.32
      4     0.78    0.92    0.91    1.32    1.05    0.77    0.78    0.75
      8     1.92    2.17    3.02    3.36    2.14    1.76    1.84    1.76
     16     4.67    5.28    5.09    5.99    4.71    5.46    4.27    4.89
     32    10.95   20.57   12.31   20.62   10.71   15.03    9.55   13.65
     64    34.54  128.96   37.33  130.93   28.94   36.78   26.09   33.23
    128   246.04  308.67  254.23  320.44   70.03   84.44   64.74   76.98
    256   586.84  638.88  620.49  734.14  157.29  196.35  145.14  180.66
    512  1232.41 1374.18 1235.39 1447.85  363.61  434.28  336.57  405.09
   1024  2759.71 2993.38 2779.37 3094.66  806.78  975.33  736.46  912.78


   Size       Raspberry Pi 4B FFT 1           Raspberry Pi 4B FFT 3     
      K                                                                 
      1     0.04    0.04    0.04    0.04    0.06    0.05    0.05    0.04
      2     0.08    0.12    0.08    0.13    0.13    0.11    0.10    0.10
      4     0.32    0.37    0.29    0.34    0.27    0.24    0.24    0.23
      8     0.77    0.97    0.79    0.82    0.58    0.55    0.57    0.51
     16     1.69    2.01    1.65    1.85    1.49    1.35    1.32    1.19
     32     4.37    4.89    3.76    4.71    2.96    3.63    2.69    3.30
     64     9.12   26.55    8.82   30.64    7.46   10.75    6.60    9.47
    128    55.52  160.11   58.54  132.41   17.93   26.03   16.92   23.85
    256   305.92  423.06  275.44  373.12   41.16   55.06   37.61   55.97
    512   833.10  854.88  780.89  751.27   86.93  120.53   81.54  128.13
   1024  1617.49 1875.52 1578.70 1812.20  190.28  266.60  186.45  288.27

   Size             RPi 4B Gains (>1.0 4B running time is less)         
      K                                                                 
      1     3.45    3.46    4.02    3.94    3.06    2.66    2.88    3.45
      2     3.79    3.14    4.27    3.84    3.10    2.93    3.28    3.29
      4     2.46    2.50    3.19    3.84    3.86    3.23    3.24    3.22
      8     2.51    2.24    3.82    4.12    3.67    3.18    3.21    3.44
     16     2.76    2.62    3.08    3.23    3.17    4.06    3.25    4.10
     32     2.51    4.21    3.27    4.38    3.62    4.14    3.55    4.13
     64     3.79    4.86    4.23    4.27    3.88    3.42    3.95    3.51
    128     4.43    1.93    4.34    2.42    3.91    3.24    3.83    3.23
    256     1.92    1.51    2.25    1.97    3.82    3.57    3.86    3.23
    512     1.48    1.61    1.58    1.93    4.18    3.60    4.13    3.16
   1024     1.71    1.60    1.76    1.71    4.24    3.66    3.95    3.17

BusSpeed Benchmark below or Go To Start

BusSpeed Benchmark - busspeedPiA7, busspeedPiC8

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 skipping following data word by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds.

The speed via these increments can vary considerably, so comparison are provided for the read all column. Both the Pi 4B hardware and gcc 8 compilation contribute to performance gains of the new system, particularly to the highest ratio of 2.81 with impact on the larger L2 cache.


Pi 3B+ ARM V7 

  BusSpeed vfpv4 32b V1 Fri Apr 12 21:39:00 2019
 
    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16   3885   4365   4755   5013   5078   5118
      32   1688   1765   2513   3489   4279   4737
      64    716    720   1315   2268   3399   4147
     128    665    668   1206   2137   3281   4085
     256    632    635   1160   2053   3195   4032
     512    268    277    550   1058   1925   3088
    1024    140    153    296    581   1115   2199
    4096    120    131    257    498   1001   1777
   16384    126    132    256    496    991   1677
   65536    128    132    256    491    991   1950
 
                    Pi 4B  ARM V7                        
 
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read       
  KBytes  Words  Words  Words  Words  Words    All   Gain
  
      16   3836   4049   4467   5885   4641   5858   1.14
      32    761   1473   2594   3216   3960   4780   1.01
      64    409    801   1684   2422   3745   3940   0.95
     128    406    803   1202   1914   3037   5377   1.32
     256    415    700   1165   2481   4789   5137   1.27
     512    392    760   1243   2455   3764   4264   1.38
    1024    230    256    623   1061   2455   3501   1.59
    4096    197    214    454    938   1852   3195   1.80
   16384    138    215    445    897   1724   3210   1.91
   65536    174    215    398    744   1655   3130   1.61

Pi 3B+ gcc 8 

   BusSpeed vfpv4 32b gcc 8 Wed May 15 09:51:20 2019
 
    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16   3833   4346   4729   5002   5046   5069
      32   2435   2532   3152   4860   4949   4999
      64    696    705   1313   2213   3278   3983
     128    651    662   1227   2077   3207   3950
     256    620    630   1183   2007   3152   3925
     512    481    503    955   1641   2618   3318
    1024    133    145    286    506   1012   1694
    4096    117    130    249    453    915   1476
   16384    124    129    247    455    910   1415
   65536    124    108    251    453    905   1445
 
                                                    
                      Pi 4B  gcc 8                              

  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Pi 4B  gcc 8
  KBytes  Words  Words  Words  Words  Words    All   Gain   Gain

      16   4880   5075   5612   5852   5877   5864   1.16   1.00
      32    846   1138   2153   3229   4908   5300   0.99   1.11
      64    746   1019   2035   3027   4910   5360   1.50   1.36
     128    728    983   1952   2908   4888   5389   1.52   1.00
     256    683    934   1901   2794   4874   5431   1.55   1.06
     512    656    900   1760   2625   4585   5259   1.75   1.23
    1024    301    410    870   1356   2846   4238   2.81   1.21
    4096    233    248    531    996   2151   4045   2.35   1.27
   16384    236    258    511    891   2143   4011   2.35   1.25
   65536    237    257    508    881   2172   4015   2.40   1.28

MemSpeed Benchmark below or Go To Start

MemSpeed Benchmark MB/Second - memspeedPiA7, memspeedPiC8

This includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 3B+ and 4B results from running the original and gcc 8 recompiled versions, plus full Pi4B/3B+ and old/gcc 8 comparisons.

Using the original ARM V7 versions, the Pi 4B is indicated as faster on all test functions, with best case on double precision calculations using cached data, being between three and six times faster. Similar gains are also shown in the gcc 8 comparisons. Then, gcc8/V7 compiler comparisons show gains with floating point but the old compiler producing some faster speeds using integers. Maximum MFLOPS performance is shown for the calculations in the first two columns, rising from 237 DP and 532 SP on the 3B+ to 1485 DP and 2740 SP on the Pi 4B, using gcc8 - improvements 6.27 times DP and 5.15 times SP..

   Pi 3B+ ARM V7 

Pi 3B+ Memory Reading Speed Test vfpv4 32 Bit Version 1 by Roy Longbottom

               Start of test Fri Apr 12 21:39:51 2019                  

 Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
 KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
   Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

      8    1896   2125   4046   2784   2624   4448   3165   3694   3693
     16    1900   2129   4058   2791   2627   4462   3181   3711   3711
     32    1821   2000   3664   2602   2426   3965   3187   3719   3717
     64    1807   1974   3625   2567   2369   3923   3057   3615   3599
    128    1792   1959   3620   2545   2364   3906   3079   3544   3544
    256    1738   1914   3472   2468   2291   3719   3064   3545   3553
    512    1380   1493   2199   1769   1715   2331   2192   2522   2383
   1024    1003   1138   1319   1250   1219   1298   1487   1324   1324
   2048     925   1001   1104   1065   1049   1103   1093   1032   1035
   4096     901    972   1073   1037   1005   1081   1002    968    973
   8192     852    948   1076   1041   1021   1080   1009    977    975

Max MFLOPS  237    532                                                 
 Pi 4B ARM V7  
 
      8    8459   4766  13344   8303   4768  15553   7806   9926   9927
     16    7142   3918   8649   7103   4094   9309   7899  10086  10056
     32    7969   4490  10339   7941   4532  11627   7758  10070  10048
     64    8126   4602   9909   8114   4617  11069   7425   8021   8070
    128    8302   4651   9623   8311   4657  10836   7374   8049   7934
    256    8319   4663   9627   8360   4666  10768   7530   7922   7925
    512    8088   4629   9453   8239   4650  10696   5023   7904   7949
   1024    3581   3113   3618   3577   3150   3675   5358   2431   1560
   2048    1338   1808   1780   1811   1832   1773   2131    950    956
   4096    1881   1880   1852   1879   1664   1336   1988    984   1054
   8192    1890   1901   1884   1729   1319   1367   2252   1018   1021

Max MFLOPS 1057   1192                                                 
Pi 4B/3B+

      8    4.46   2.24   3.30   2.98   1.82   3.50   2.47   2.69   2.69
     16    3.76   1.84   2.13   2.54   1.56   2.09   2.48   2.72   2.71
     32    4.38   2.25   2.82   3.05   1.87   2.93   2.43   2.71   2.70
     64    4.50   2.33   2.73   3.16   1.95   2.82   2.43   2.22   2.24
    128    4.63   2.37   2.66   3.27   1.97   2.77   2.39   2.27   2.24
    256    4.79   2.44   2.77   3.39   2.04   2.90   2.46   2.23   2.23
    512    5.86   3.10   4.30   4.66   2.71   4.59   2.29   3.13   3.34
   1024    3.57   2.74   2.74   2.86   2.58   2.83   3.60   1.84   1.18
   2048    1.45   1.81   1.61   1.70   1.75   1.61   1.95   0.92   0.92
   4096    2.09   1.93   1.73   1.81   1.66   1.24   1.98   1.02   1.08
   8192    2.22   2.01   1.75   1.66   1.29   1.27   2.23   1.04   1.05

Pi 3B+ gcc 8 


       8    2024   3191   1931   2973   4464   2077   3415   4426   4426
      16    2031   3194   1933   2977   4470   2078   3430   4451   4451
      32    1972   3111   1902   2842   4291   2059   3433   4455   4451
      64    1932   3042   1875   2752   4121   2008   3240   4223   4223
     128    1972   3083   1888   2825   4163   2012   3281   4272   4276
     256    1980   3089   1888   2851   4177   2013   3312   4244   4239
     512    1750   2778   1739   2460   3711   1846   3106   4029   4096
    1024     979   1862   1390   1213   2230   1463   1463   1225   1220
    2048     979   1858   1379   1137   2111   1442    859    828    828
    4096     975   1809   1363   1136   2091   1428    944    924    920
    8192     976   1788   1364   1139   2053   1409    802    792    733

Max MFLOPS   254    799                                                 
 MemSpeed Continued Below
  
Pi 4B gcc 8  

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971
 
Max MFLOPS  1485   2740                                                 
Pi 4B/3B+

       8    5.81   3.08   1.99   3.96   2.23   2.09   3.02   1.77   1.76
      16    5.85   3.09   1.98   3.99   2.25   2.10   3.06   1.78   1.77
      32    4.84   2.74   1.93   3.35   2.02   1.99   3.08   1.78   1.78
      64    5.15   3.06   1.99   3.63   2.30   2.07   2.61   1.83   1.83
     128    5.04   3.11   1.99   3.54   2.34   2.08   2.50   1.80   1.81
     256    5.06   3.11   1.99   3.52   2.34   2.08   2.45   1.82   1.82
     512    5.18   3.35   2.16   3.85   2.57   2.26   2.57   1.91   1.86
    1024    2.74   2.85   2.59   2.20   2.23   2.57   3.30   2.93   2.93
    2048    1.71   1.88   2.35   1.55   1.72   2.45   3.36   1.25   1.24
    4096    1.82   2.05   2.41   1.61   1.73   2.49   2.58   1.14   1.15
    8192    1.98   2.13   2.51   1.70   1.86   2.58   3.07   1.24   1.32

4B gcc 8 gains

       8    1.39   2.07   0.29   1.42   2.08   0.28   1.32   0.79   0.79
      16    1.66   2.52   0.44   1.67   2.45   0.47   1.33   0.78   0.78
      32    1.20   1.90   0.36   1.20   1.91   0.35   1.36   0.79   0.79
      64    1.22   2.02   0.38   1.23   2.05   0.38   1.14   0.96   0.96
     128    1.20   2.06   0.39   1.20   2.10   0.39   1.11   0.95   0.98
     256    1.20   2.06   0.39   1.20   2.10   0.39   1.08   0.98   0.97
     512    1.12   2.01   0.40   1.15   2.05   0.39   1.59   0.98   0.96
    1024    0.75   1.70   0.99   0.74   1.58   1.02   0.90   1.48   2.29
    2048    1.25   1.93   1.82   0.97   1.98   2.00   1.35   1.09   1.07
    4096    0.94   1.97   1.77   0.97   2.18   2.66   1.22   1.07   1.00
    8192    1.02   2.00   1.82   1.12   2.89   2.65   1.09   0.96   0.95

NeonSpeed Benchmark below or Go To Start

NeonSpeed Benchmark MB/Second - NeonSpeed, NeonSpeedC8

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and Neon through using intrinsic functions. Of late, both methods produce similar performance at up to 3000 million operations per second, across the board. Pi 4B/3B+ comparisons are also included below, showing the best gains in the L2 cache area. Pi 4B gcc 8 gains and losses are also provided, with the main loss on normal integer calculations from cached data.


                     Pi 3B+                       

  NEON Speed Test V 1.0 Fri Apr 12 22:11:38 2019  

       Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v 
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3170   4669   4037   4930   5220   5545
      32   3119   4531   3952   4780   5071   5374
      64   2845   3920   3558   4075   4235   4438
     128   2873   3954   3626   4095   4227   4484
     256   2917   4027   3705   4184   4313   4563
     512   2271   2923   2777   3000   3075   3127
    1024   1181   1209   1221   1201   1163   1198
    4096   1062   1077   1071   1050   1073   1076
   16384   1087   1115   1111   1043   1094   1086
   65536   1125   1144   1139    851   1126   1110
 
                     Pi 4B                        
  
      16   9677  10072   8905   9358   9776  10473
      32  10149  10330   9364   9539   9988  10543
      64  10948  11708  10466  10568  11318  11994
     128  10484  11232  10410  10104  11200  11792
     256  10509  11369  10428  10264  11273  11842
     512  10406  11066  10134  10054  11075  11467
    1024   3069   3202   3159   3166   3204   3203
    4096   1721   1910   1908   1882   1903   1900
   16384   2023   2009   2008   1965   2032   2013
   65536   2073   2074   2074   2073   2068   2064

                 Pi 4B/3B+ Comparisons            

   Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3.05   2.16   2.21   1.90   1.87   1.89
     512   4.58   3.79   3.65   3.35   3.60   3.67
    1024   2.60   2.65   2.59   2.64   2.75   2.67
   16384   1.86   1.80   1.81   1.88   1.86   1.85

                   Pi 3B+ gcc 8                  

  NEON Speed Test gcc 8 Wed May 15 09:57:18 2019  

       Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3289   5377   2010   5076   5731   5732
      32   3280   5341   1995   5043   5706   5706
      64   3115   4547   1923   4348   4771   4771
     128   3145   4683   1927   4482   4886   4888
     256   3146   4698   1926   4500   4906   4908
     512   2666   3762   1779   3527   3903   3915
    1024   1879   1228   1395   1225   1238   1238
    4096   1792   1151   1373   1144   1164   1162
   16384   1698   1167   1353   1119   1167   1170
   65536   1229   1157   1328    874   1165   1166

                    Pi 4B gcc 8                   

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

 NeonSpeed Continued Below
  


                 Pi 4B/3B+ Comparisons            

   Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3.01   2.40   1.95   2.52   2.28   2.64
     512   3.39   2.84   1.92   2.90   2.80   2.87
    1024   3.10   2.70   2.51   2.68   2.62   2.34
   16384   2.37   1.77   2.55   1.85   1.78   1.77

              4B gcc 8 gains and losses           

      16   1.02   1.28   0.44   1.36   1.34   1.44
     512   0.87   0.97   0.34   1.02   0.99   0.98
   16384   1.99   1.03   1.72   1.05   1.02   1.03

MultiThreading Benchmarks below or Go To Start

MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism. Go To Start

MP-Whetstone Benchmark - MP-WHETSPiA7, MP-WHETSPC8

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

Based on the 4 thread MWIPS rating, both compilations indicate the same Pi4B performance improvement, but there are variations on the individual test functions.


                      Pi 3B+ ARM V7                               

  MP-Whetstone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:48:42 2019

                    Using 1, 2, 4 and 8 Threads                   

      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt     If  Equal
                 1      2      3  MOPS  MOPS    MOPS   MOPS   MOPS

 1T  1116.9  582.4  603.6  299.7  21.7  13.3  6969.0 1364.0 1398.5
 2T  2226.5 1167.8 1181.0  593.5  43.4  26.4 12545.8 2789.0 2794.1
 4T  4436.8 2354.9 2387.3 1190.1  86.3  52.5 27429.4 5539.7 5546.8
 8T  4614.6 3174.1 3140.6 1250.0  88.1  54.7 36555.2 6409.9 6051.1

   Overall Seconds   4.99 1T,   5.02 2T,   5.10 4T,  10.20 8T     

                        Pi 4B ARM V7                               

 1T  2059.3  672.8  680.1  310.6  55.6  33.1  7461.6  2244.6  995.2
 2T  4117.1 1341.7 1390.7  624.2 110.7  65.9 14887.3  4466.5 1986.2
 4T  7910.0 2652.0 2722.2 1180.0 208.5 132.6 29291.2  8952.4 3832.3
 8T  8651.6 3057.1 2971.1 1268.3 233.2 149.6 38367.5 11922.5 3941.7

   Overall Seconds   4.99 1T,   5.01 2T,   5.29 4T,  10.71 8T      

                      Pi 3B+ gcc 8                                 

  MP-Whetstone Benchmark Linux/ARM gcc 8 Fri Jun 14 14:25:28 2019  
                   
      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS

 1T  1057.5  390.9  392.6  298.1  21.0  12.3  5227.8  1363.1 1399.4
 2T  2121.8  777.4  778.5  598.3  42.3  24.6 10185.9  2769.0 2762.9
 4T  4225.9 1509.6 1532.2 1192.3  84.7  48.8 19273.0  5326.5 5552.9
 8T  4419.6 1914.9 2041.9 1260.8  86.0  51.3 27645.3  7213.5 6031.5

   Overall Seconds   4.98 1T,   5.00 2T,   5.11 4T,  10.09 8T      

                      Pi 4B gcc 8                                  
 
 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                4 Thread 4B/3B+ Performance ratios                 

 V7    1.78   1.13   1.14   0.99   2.42   2.53   1.07   1.62   0.69
 gcc8  1.79   1.39   1.40   1.05   2.66   2.13   1.53   1.68   0.48

MP-Dhrystone Benchmark below or Go To Start

MP-Dhrystone Benchmark - MP-DHRYPiA7, MP-DHRYPiC8

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance, as reflected in the results. The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+, for both compilations, with gcc 8 code being slightly the fastest.


                         Pi 3B+ ARM V7                      

   MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Wed Apr 24 22:57:46 2019

                    Using 1, 2, 4 and 8 Threads             

 Threads                        1        2        4        8
 Seconds                     0.85     0.96     1.36     2.71
 Dhrystones per Second    4733611  8295393 11750518 11789451
 VAX MIPS rating             2694     4721     6688     6710

                        Pi 4B  ARM V7                       
 
 Seconds                     0.82     1.59     2.70     5.04
 Dhrystones per Second    9731507 10082787 11833655 12706636
 VAX MIPS rating             5539     5739     6735     7232

                        Pi 3B+ gcc 8                        

 Threads                        1        2        4        8
 Seconds                     0.79     0.92     1.23     2.46
 Dhrystones per Second    5035879  8678942 13020489 13028455
 VAX MIPS rating             2866     4940     7411     7415

                       Pi 4B gcc 8                           
 
 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459

MP Linpack Benchmark below or Go To Start

MP SP NEON Linpack Benchmark - linpackNeonMP, linpackNeonMPC8

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

Single thread performance, was the slowest accessing the larger data arrays (N value), more constant across the four sets of results. Fastest Pi 4B improvements were at N = 100, at around three times.

The programs produce the sumchecks, as shown below, with the four sets of calculations producing identical numeric results (as they should).


                      Pi 3B+ ARM V7           
 
 Linpack Single Precision MultiThreaded Benchmark
 Using NEON Intrinsics, Wed Apr 24 23:03:08 2019

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 

 N  100     627.07    66.31    64.79    64.14 
 N  500     465.16   293.95   292.37   293.76 
 N 1000     346.63   311.81   309.19   311.76 

                      Pi 4B  ARM V7           

 N  100    1921.53   108.66   101.88   102.46 
 N  500    1548.81   530.23   714.37   733.09 
 N 1000     399.94   378.11   364.78   398.21 

                      Pi 3B+ gcc 8            

 N  100     638.49    66.92    66.23    66.14 
 N  500     471.71   304.69   297.05   305.51 
 N 1000     356.13   317.22   316.88   316.33 

                      Pi 4B gcc 8             

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 

                      Sumchecks                    

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

MP BusSpeed Benchmark below or Go To Start

MP BusSpeed (read only) Benchmark - MP-BusSpeedPiA7, MP-BusSpd2PiC8

Each thread accesses all of the data in separate sections covering caches and RAM, starting at different points, with this V7A v2 version. See single processor BusSpeed details regarding burst reading that can indicate significant differences. RdAll is the main area for comparison, where MP reading RAM is thought to indicate maximum performance.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. These are subject to multiprocessing peculiarities, but Pi 4B/Pi 3B+ performance gains were indicated as being around 2.5, using L1 cache data, and twice as fast, via L2 cache and RAM, with the gcc 8 produced version little different from the earlier compilations.

                      Pi 3B+ ARM V7                        

 MP-BusSpd ARM V7A v2 Wed Apr 24 22:58:50 2019           

   MB/Second Reading Data, 1, 2, 4 and 8 Threads         
   Staggered starting addresses to avoid caching         

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

 12.3 1T   3470   4390   4408   4760   5138   4926       
      2T   6272   7807   8321   9131   9780   9599       
      4T   9867  13732  15514  17568  19512  18209       
      8T   7385  10918  12320  14591  17357  16462       
122.9 1T    662    648   1253   2129   3291   4475       
      2T   1044   1032   2003   3611   6135   8931       
      4T   1068   1085   2180   4354   8409  16053       
      8T   1057   1078   2124   4247   8227  15070       
12288 1T    125    131    252    494   1009   1996       
      2T    195    136    272    501   1088   2121       
      4T    126    135    263    515   1017   1922       
      8T    114    136    305    545    994   2076       

                      Pi 4B  ARM V7               
                                                Pi 4B/3B+

 12.3 1T   5263   5637   5809   5894   5936  13445   2.73
      2T   9412  10020  10567  11454  11604  24980   2.60
      4T  16282  15577  16418  21222  20000  45530   2.50
      8T  11600  13285  16070  18579  20593  36837       
122.9 1T    739    956   1888   3153   5008   9527   2.13
      2T    629   1158   1568   5058   9509  16489   1.85
      4T    600   1093   2134   4527   8732  16816   1.05
      8T    593   1104   2121   4382   8629  17158       
12288 1T    238    258    518   1005   2001   4029   2.02
      2T    278    228    453   1690   1826   3628   1.71
      4T    269    257    740   1019   1790   4145   2.16
      8T    233    292    532    926   2186   3581       

                      Pi 3B+ gcc 8                

 MP-BusSpd ARM V7A gcc 8 Wed May 15 10:06:27 2019        
 
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

 12.3 1T   3555   4451   4382   4788   5124   5205       
      2T   6515   8132   8332   9016   9793  10100        
      4T  10667  14186  15956  17529  19228  16522       
      8T   7463  10987  13299  14948  17756  16781       
122.9 1T    681    683   1211   2133   3280   4713       
      2T   1049   1057   2009   3848   6155   9293       
      4T   1049   1085   2191   4360   7921  16268       
      8T   1072   1092   2180   4303   8156  15722       
12288 1T    125    131    256    495   1005   1970       
      2T    135    133    273    505   1100   2110       
      4T    116    130    243    511   1009   2059       
      8T    126    138    260    532   1061   2017       

                      Pi 4B gcc 8                        
                                                Pi 4B/3B+

 12.3 1T   5310   5616   5801   5898   5940  13425   2.54
      2T   9393  10008  11293  11293  11368  24932   2.47
      4T  15781  15015  17606  19034  22279  40736   2.47
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281   2.18
      2T    564    726   1523   5376   9387  18985   2.04
      4T    486    919   1886   4289   8337  16979   1.04
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975   2.02
      2T    202    421    450   1765   3307   7396   3.51
      4T    261    288    825   1332   1772   5014   2.44
      8T    218    273    496   1041   2571   4021

MP RandMem Benchmark below or Go To Start

MP RandMem Benchmark - MP-RandMemPiA7, MP-RandMemPiC8

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Serial reading speed is normally similar to BusSpeed RdAll. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Besides the full results, comparisons of the four thread results are shown below for Pi 4B/3B+ performance ratios. The Pi 3B+ appears to be faster reading data from the shared L2 cache, with 4 threads only, otherwise, the average performance of the new processor was indicated as 80% faster.

                  Pi 3B+ ARM V7         

 MP-RandMem Linux/ARM V7A v1.0 Wed Apr 24 22:54:55 2019

  KB       SerRD SerRDWR   RndRD RndRDWR

 12.3 1T    3419    4333    3420    4422
      2T    6531    4397    6515    4397
      4T   12814    4308   12896    4303
      8T   12922    4289   12561    4244
122.9 1T    3133    3959     800    1041
      2T    5992    3959    1469    1040
      4T   11584    3913    2322    1025
      8T   11417    3895    2288    1028
12288 1T    2034     795      48      62
      2T    2176     799      93      63
      4T    3183     790     128      63
      8T    2008     788     130      62

                  Pi 4B  ARM V7         

 12.3 1T    5860    7905    5927    7657
      2T   11747    7908   11182    7746
      4T   21416    7626   17382    7731
      8T   20649    7528   20431    7378
122.9 1T    5479    7269    1826    1923
      2T   10355    6964    1667    1920
      4T    9808    7177    1715    1908
      8T   11677    7058    1697    1919
12288 1T    3438    1271     179     152
      2T    4176    1204     213     167
      4T    4227    1117     337     161
      8T    3479    1093     287     168

                 Pi 4B/3B+                      

 12.3 4T    1.67    1.77    1.35    1.80
122.9 4T    0.85    1.83    0.74    1.86
12288 4T    1.33    1.41    2.63    2.56

                  Pi 3B+ gcc 8          

 12.3 1T    4362    4386    4363    4386
      2T    8222    4308    8132    4311
      4T   16391    4268   16396    4286
      8T   16297    4244   15510    4228
122.9 1T    3643    3879     925    1025
      2T    7008    3873    1692    1040
      4T   12553    3877    2373    1038
      8T   12000    3881    2330    1043
12288 1T    1848     833      67      62
      2T    2183     829     119      63
      4T    3672     825     135      63
      8T    2608     826     136      63
 
                  Pi 4B gcc 8           

 12.3 1T    5950    7903    5945    7896
      2T   11849    7923   11887    7917
      4T   23404    7785   23395    7761
      8T   21903    7669   23104    7655
122.9 1T    5670    7309    2002    1924
      2T   10682    7285    1648    1923
      4T    9944    7266    1813    1927
      8T    9896    7216    1812    1919
12288 1T    3904    1075     179     164
      2T    7317    1055     215     164
      4T    3398    1063     343     165
      8T    4156    1062     350     165

                Pi 4B/3B+ gcc 8         

 12.3 4T    1.43    1.82    1.43    1.81
122.9 4T    0.79    1.87    0.76    1.86
12288 4T    0.93    1.29    2.54    2.62

MP-MFLOPS Benchmarks below or Go To Start

MP-MFLOPS Benchmarks - MP-MFLOPSPiA7, MP-MFLOPSDP, MP-NeonMFLOPS,
MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

Note across the board Pi 4B performance gains on all programs, with maximum speeds of 17.2 GFLOPS for single precision calculations and and 10.4 GFLOPS using double precision.


                Single Precision Version         

                      Pi 3B+ ARM V7                 

 MP-MFLOPS Linux/ARM V7A v1.0 Wed Apr 24 23:08:19 2019

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
 1T      214     212     189     813     812     797
 2T      403     427     354    1613    1587    1573
 4T      717     811     372    3044    3027    2982
 8T      756     777     388    3005    3101    3064

                     Pi 4B  ARM V7                  

 1T      987     993     606    2816    2794    2804
 2T     1823    1837     567    5610    5541    5497
 4T     2119    3349     647    9884   10702    9081
 8T     3136    3783     609   10230   10504    9240
Max                                                 
4B/3B+   415    4.66    1.67    3.36    3.45    3.02

                      Pi 3B+ gcc 8                   

 1T      214     212     189     799     784     781
 2T      417     417     365    1568    1583    1540
 4T      754     683     385    3026    3017    2919
 8T      738     761     401    3053    2997    2866

                     Pi 4B gcc 8                    

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165
Max                                                 
4B/3B+  5.48    6.07    1.35    3.61    3.53    2.86

 ###################################################

            NEON Intrinsic Functions Version      

                      Pi 3B+ ARM V7                 

 MP-MFLOPS NEON Intrinsics v1.0 Wed Apr 24 22:41:38 2019

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
 1T      692     685     393    2052    2017    1887
 2T     1126    1358     403    4096    3924    3697
 4T     2434    2030     405    7848    7740    5547
 8T     2363    2095     407    7584    7609    6097

                     Pi 4B  ARM V7                  

 1T     2491    2399     615    4325    4285    4261
 2T     5629    5520     591    8602    8463    8308
 4T    10580    5594     553   16991   16493    9124
 8T     7047   10785     513   14325   16219    8867
Max                                                 
4B/3B+  4.35    5.15    1.36    2.17    2.13    1.50

 MP-MFLOPS Continued Below
  



                      Pi 3B+ gcc 8                  

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
 1T      691     684     407    1910    1874    1828
 2T     1214    1306     410    3746    3747    3392
 4T     1943    2568     410    7403    7435    5913
 8T     2093    2233     411    7217    7087    6044

                      Pi 4B gcc 8                    

 1T     2797    2870     641    4422    4454    4405
 2T     3217    5601     569    8587    8800    8377
 4T     7902    9864     611   17061   17215    9704
 8T     7070   10562     603   15531   16203    9516
Max                                                 
4B/3B+  3.78    4.13    1.49    2.30    2.32    1.61

 ###################################################

              Double Precision Version            

                     Pi 3B+ ARM V7                  

 MP-MFLOPS Double Precision v1.0 Sat Jun 15 12:07:33 2019

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
 1T      209     206     166     782     797     747
 2T      415     416     198    1566    1590    1462
 4T      663     801     198    3125    3122    2770
 8T      746     729     199    3061    2909    2745

                     Pi 4B  ARM V7                  

 1T     1187    1220     309    2682    2714    2701
 2T     2420    2416     282    5379    5415    4780
 4T     4665    2381     317   10256   10336    5242
 8T     4385    3114     310    9721   10340    5131
Max                                                 
4B/3B+  6.25    3.89    1.59    3.28    3.31    1.89

                      Pi 3B+ gcc 8                  

 1T      214     213     168     798     797     776
 2T      409     416     194    1567    1590    1466
 4T      694     675     195    3122    3120    2751
 8T      698     797     198    3055    3005    2779

                      Pi 4B gcc 8                   

 1T     1203    1211     315    2675    2719    2674
 2T     2291    2441     293    5406    5421    4907
 4T     4673    2501     309   10313   10393    5256
 8T     4394    3550     265    8782   10110    5197
Max                                                 
4B/3B+  6.69    4.45    1.56    3.30    3.33    1.89

                        Sumchecks                   

 SP    76406   97075   99969   66015   95363   99951
 NEON  76406   97075   99969   66014   95363   99951
 DP    76384   97072   99969   66065   95370   99951

OpenMP-MFLOPS Benchmarks below or Go To Start

OpenMP-MFLOPS - OpenMP-MFLOPS, notOpenMP-MFLOPS, OpenMP-MFLOPSC8,
OpenMP-MFLOPSDPC8, notOpenMP-MFLOPSC8, notOpenMP-MFLOPSDPC8

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. With gcc 8, additional versions have been produced, using double precision floating point. The general format and standard parameters are as follows.

The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those from NotOpenMP single core values. However, this is not always the case. The double precision gcc 8 benchmarks appears to be consistent, but only single precision sumchecks are provided.

This benchmark was a compilation of code used for desktop PCs, starting at 100 KB, then 1 MB and 10 MB.


            OpenMP MFLOPS Benchmark 1 Wed Apr 24 22:51:10 2019                

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.281575     1776    0.929538   Yes
 Data in & out    1000000     2      250   1.265817      395    0.992550   Yes
 Data in & out   10000000     2       25   1.222289      409    0.999250   Yes

 Data in & out     100000     8     2500   0.376635     5310    0.957126   Yes
 Data in & out    1000000     8      250   1.305504     1532    0.995524   Yes
 Data in & out   10000000     8       25   1.267736     1578    0.999550   Yes

 Data in & out     100000    32     2500   3.285631     2435    0.890232   Yes
 Data in & out    1000000    32      250   3.351830     2387    0.988068   Yes
 Data in & out   10000000    32       25   3.329400     2403    0.998785   Yes

                End of test Wed Apr 24 22:51:26 2019                          

SumChecks

V7A OMP 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890232 0.988068 0.998785

V7A Not 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890268 0.988078 0.998806

gcc 8 OMP  3B+, Not 3B+ 4B
0.929538 0.992550 0.999250 0.957126 0.995524 0.999550 0.890282 0.988096 0.998806

gcc 8 4B OMP
0.098043 0.810084 0.922891 0.144870 0.922568 0.918226 0.401577 0.935064 0.916277

gcc 8 DP OMP 3B+ 4B, Not 3B+ 4B
0.929474 0.992543 0.999249 0.957164 0.995525 0.999550 0.890377 0.988101 0.998799

MFLOPS Performance and Comparisons

The firsts comparisons below identify OpenMP 4 core performance gains. For 2 and 8 operations per word read and written, real gains can only be seen with 100 KB data size. With CPU speed limitations at 32 operations per word, single core MFLOPS is shown to be constant at all data sizes, but high OpenMP speeds only occurring using 100 KB data size.

The other comparisons identify Pi 4B performance gains over the Pi 3B+, where those applying to single core use being better than via OpenMP. Highest OpenMP improvement was 4.5 times, via gcc 8 and double precision operation. Maximum demonstrated Pi 4B speeds were 19.9 GFLOPS single precision and 9.3 GFLOPS double precision.


         V7A Compiler                                          
         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2  1776    831   2.14   4716   2850   1.65   2.66   3.43
 1000- 2   395    391   1.01    556    429   1.30   1.41   1.10
10000- 2   409    409   1.00    544    632   0.86   1.33   1.55

  100- 8  5310   2009   2.64   7981   5191   1.54   1.50   2.58
 1000- 8  1532   1445   1.06   2389   2082   1.15   1.56   1.44
10000- 8  1578   1478   1.07   2199   2003   1.10   1.39   1.36

  100-32  2435   1855   1.31   8147   5449   1.50   3.35   2.94
 1000-32  2387   1733   1.38   7951   5385   1.48   3.33   3.11
10000-32  2403   1736   1.38   8030   5379   1.49   3.34   3.10

 OpenMP-MFLOPS Continued Below
  



         gcc 8 Compiler                                          
         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2  2139    778   2.75   5100   2270   2.25   2.38   2.92
 1000- 2   398    403   0.99    617    632   0.98   1.55   1.57
10000- 2   412    415   0.99    542    631   0.86   1.32   1.52

  100- 8  7348   1919   3.83  13805   5511   2.50   1.88   2.87
 1000- 8  1597   1448   1.10   2168   2217   0.98   1.36   1.53
10000- 8  1635   1444   1.13   2178   2542   0.86   1.33   1.76

  100-32  8497   2023   4.20  19921   5341   3.73   2.34   2.64
 1000-32  5997   1903   3.15   8556   5267   1.62   1.43   2.77
10000-32  6057   1914   3.16   8731   5276   1.65   1.44   2.76

         gcc 8 Double Precision                                  
         Pi 3B+               Pi 4B                  Pi4 Gains 
 KB+Ops    4       1   4 core   4       1   4 core   4      1  
  /Word  Cores   Core   Gain  Cores   Core   Gain  Cores   Core

  100- 2   711    203   3.50   3200    977   3.28   4.50   4.81
 1000- 2   193    168   1.15    274    295   0.93   1.42   1.76
10000- 2   199    172   1.16    273    307   0.89   1.37   1.78

  100- 8  1898    503   3.77   6771   2440   2.78   3.57   4.85
 1000- 8   730    434   1.68   1102   1072   1.03   1.51   2.47
10000- 8   755    435   1.74   1108   1255   0.88   1.47   2.89

  100-32  3072    793   3.87   9229   2725   3.39   3.00   3.44
 1000-32  2695    765   3.52   4256   2674   1.59   1.58   3.50
10000-32  2719    765   3.55   4469   2677   1.67   1.64   3.50

Floating Point Assembly Code below or Go To Start

Floating Point Assembly Code

The latest floating point performance improvements, via gcc 8, are due to better use of NEON instructions. If I have read this report correctly, double precision ARM NEON SIMD is not supported on V7 CPUs, only Single Instruction Single Data (SISD), where fused multiply and add instructions can produce two results per clock cycle, or a maximum of 3 GFLOPS per core on Pi 4, or 12 GFLOPS overall.

In my MP MFLOPS programs, the routines that include 32 double precision floating point operations per data word read, disassembly indicates that the following instructions are used, with 64 bit d registers, where maximum measured speed was just over 10 GFLOPS.

.L18:                            
    vldr.64         d17, [r1]   
    vadd.f64        d16, d17, d4 
    vadd.f64        d18, d17, d0 
    vadd.f64        d25, d17, d15
    vadd.f64        d24, d17, d11
    vmul.f64        d16, d16, d5 
    vadd.f64        d23, d17, d31
    vadd.f64        d22, d17, d27
    vadd.f64        d21, d17, d2 
    vadd.f64        d20, d17, d6 
    vadd.f64        d19, d17, d13
    vfma.f64        d16, d18, d1 
    vadd.f64        d18, d17, d9 
    vadd.f64        d17, d17, d29
    vfma.f64        d16, d25, d14
    vfma.f64        d16, d24, d10
    vfma.f64        d16, d23, d30
    vfma.f64        d16, d22, d28
    vfms.f64        d16, d21, d3 
    vfms.f64        d16, d20, d7 
    vfms.f64        d16, d19, d12
    vfms.f64        d16, d18, d8 
    vfms.f64        d16, d17, d26
    vstmia.64       r1!, {d16}   
    cmp             r0, r1       
    bne             .L18

It is not clear (to me) what the maximum speed is for single precision calculations. These appear to compile to full SIMD operation, using quad word registers. With fused multiply and add, that could amount to 8 results per clock cycle, with 12 GFLOPS from one Pi 4 core and 48 GFLOPS overall. Maximum obtained was around 20 GFLOPS.

OpenMP-MemSpeed Benchmarks below or Go To Start

OpenMP-MemSpeed - OpenMP-MemSpeed2, NotOpenMP-MemSpeed2,
OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless.


                               Pi 3B+ ARM V7                             

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom       

               Start of test Wed Apr 24 22:45:07 2019                   

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    6432   3483   1646  10276   5514   1770  18468   9721   1534
       8    7041   3603   1651  11747   5783   1788  19068  10085   1538
      16    7023   3606   1557  11694   5839   1672  19316   9528   1469
      32    6983   3600   1525  11413   5915   1656  19385   9532   1442
      64    6283   3554   1584  10861   5751   1621  14307   9466   1443
     128    6828   3578   1580  11074   5828   1659  10791   8935   1490
     256    5384   3365   1521  11216   5166   1687   9806   8148   1519
     512    5371   3253   1511   8917   4858   1412   7752   4363   1365
    1024    3084   2643   1066   3772   3504   1314   1450   1403   1136
    2048    3345   2087   1086   4148   3589   1471   1052   1063   1139
    4096     915   2648    894   4143   2456   1655    984    987   1190
    8192    3644   2504   1124   4183   3530   1496    903    909   1074
   16384     963   2050    922   3867   3154   1478    752    849   1156
   32768    3889   2467   1179   3562   3328   1667    838    833   1150
   65536    3902   2009   1109   3843   1437   1596    917    917    927
  131072     986    667    819   1145    904    820    858    865    584
 
  Not OMP                                                               
       8    1860   2972   4449   2787   4039   4449   3168   3164   3170
     256    1810   2791   4137   2655   3860   4135   3126   3065   3066
   65536     960   1121   1109   1100   1120   1115    901    793    844

                            Pi 4B  ARM V7                                

    Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom        

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7732   8092   1266   7627   8431   1616  31436  15892    889
       8    7546   8158   1284   7925   8537   1597  30383  16635    884
      16    7695   8198   1261   7854   8549   1598  27037  15644    896
      32    7773   7808   1255   8036   7727   1612  29621  16928    897
      64    9728   9094   1233   9355   9028   1602  16855  13297    867
     128   11296  10068   1002  11342  10813   1686  13594  15106    794
     256   13987  11677   1231  15357  13496   1732  12707  10415    878
     512   17763   8841   1170  10023  13404   1529  12655   9137    693
    1024    6070   6553   1262  10196  10069   1455   5405   5027    670
    2048    3858   6609   1343   6440   6643   1657   2234   2324    877
    4096    6055   6743    989   6608   6568   1664   2114   2369    777
    8192    1669   2047   1126   7071   6894   1581   2532   2569    857
   16384    1974   1953   1385   6748   4399   1763   2643   1845    753
   32768    1594   3482   1115   7680   7494   1814   1739   1908   1147
   65536    2630   7446   1320   1632   1826   1651   2061   2920    904
  131072    1438   1540   1249   1714   1694   1244   1760   2011    856

 Not OMP                                                                
       8    8602  11536  13324   8607  11756  13378   7826   7689   7670
     256    8319   9856  10030   8338   8984   9308   5800   7510   7535
   65536    1373   1725   2071   2059   2072   2044   2170    912    900

 OpenMP-MemSpeed Continued Below
  

 
                           Pi 3B+ gcc 8                                  
 
   Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom             

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7065   3661   8370  10058   5245   9260  18199   9342   9242
       8    7350   3854   9338  11747   5786  10201  19108   9663   9412
      16    7444   3955   9543  11918   5961  10696  19339   9854   9831
      32    7198   3953   9537   9783   5908  10683  19075   9958   9971
      64    6848   3901   9057  11146   5168   9187  10408   9399   9440
     128    7655   3916   9113  11204   5785  10073  10315   9185   9191
     256    7044   3921   9154  11263   5785  10114   9601   9002   9019
     512    6662   3579   7738   9326   5206   7931   8313   7911   7903
    1024    4050   2892   4167   3997   3674   4318   1437   1422   1435
    2048    3996   2879   4134   4038   3624   4325   1042   1012    999
    4096    3909   2803   4078   3981   3591   4223   1047    988   1044
    8192    3880   2871   3805   4196   3555   4117    935    948    940
   16384    1366   2193   3757   4058   3178   3895    902    894    843
   32768    2202   2138   3428   3577   3335   3559    871    793    893
   65536    1180   1119   1696   1447   1178   1721    853    874    868
  131072    1016    688   1096   1133    893   1141    844   1141   1080

Not OMP                                                                 
       8    2020   1878   2056   2959   2018   2068   3398   4406   4406
     256    1973   1833   1990   2845   1966   1993   3306   4215   4215
   65536    1016   1248   1287   1130   1302   1301   1005    928    915

                           Pi 4B gcc 8                                   

     Memory Reading Speed Test OpenMP gcc 8 by Roy Longbottom           

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    8097   8322   8641   8020   8436   8384  39701  19701  19712
       8    7814   8555   8756   8321   8548   8526  39042  19984  19996
      16    8149   7738   7742   8303   7779   8192  37995  19883  19984
      32    8969   8769   8799   9040   8759   8743  37737  20133  20130
      64    7617   7457   7437   7575   7380   7422  17770  15332  14248
     128   11221  10936  11003  11105  11011  10986  13650  13910  13881
     256   17883  18144  18036  17691  18094  17844  13073  12465  12535
     512   18001  18468  19675  17075  18221  19264  13511  13895  12008
    1024    9532  10590   9772  11842  11282  11277   7173   9473   9496
    2048    7095   7025   6866   7117   7043   6946   2914   3475   3468
    4096    7244   6927   7036   5951   7054   6531   2582   3130   3122
    8192    4578   7173   7025   6322   7078   7182   2504   3127   3115
   16384    5470   7043   7067   7103   7052   7020   2557   3093   3088
   32768    7359   7817   7766   7158   7078   7757   2618   3066   3094
   65536    7810   7268   7266   3824   7478   5164   2486   3016   2931
  131072    2460   2655   7224   7513   7308   7339   2540   2944   2940

 Not OMP                                                                
       8   11775   3895   4342  11787   4325   4354  10334   7806   7816
     256   10032   3699   4223   9978   4289   4185   7105   7612   7621
   65536    2099   2587   3033   2103   3021   3001   2585   1105   1101

I/O Benchmarks below or Go To Start

I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The Raspberry Pi programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. For further details and downloads see the usual PDF file.

Go To Start

LanSpeed Benchmarks - WiFi - LanSpeed

Following are Raspberry Pi 3B+ and Pi 4B results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two systems was similar at both frequencies.


 ******************** Pi 3B+ 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    5.71    6.07    5.96    5.69    5.46    4.76
      16    6.14    6.38    6.47    6.14    6.15    5.91

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs      2.94   3.081   3.185    3.04    2.89     3.7

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.16    0.57    0.96    0.36    0.63    1.17
 ms/file    25.3   14.31    17.1   11.46   13.04   14.06   2.138

 ********************* Pi 3B+ 5 GHz *********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3
 
       8   12.82   14.52   14.00   10.98   11.09    8.94
      16   11.60   12.91    4.48    9.16    8.19    7.69

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.41    0.76    1.46    0.41    0.74    1.46
 ms/file    9.96   10.83   11.19   10.11   11.02   11.23   1.990

 Random similar to 2.4 GHz


 ********************* Pi 4B 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** Pi 4B 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz

LanSpeed Benchmark below or Go To Start

LanSpeed Benchmark - (1G bits per second Ethernet on Pi 4B) - LanSpeed

There can be significant variability in performance with these small samples. For the large files, the default sizes were increased to produce more stable speeds. In this case, 1 Gbps was clearly demonstrated using the Pi 4B, around three times faster than the Pi3B+. Random access was mainly slightly faster via the Pi 4B and with the small files, perhaps, 25% faster on writing and 50% faster on reading.


 ************************ Pi 3B+ ************************
 
                       MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   31.17   31.62   31.61    13.5   26.19   26.38
      16   31.62   31.89   31.76    26.7   26.94   27.01

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     0.007    1.09   0.688    1.16    1.04    1.08

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     1.15    2.26    4.18    1.73    3.18    5.66
 ms/file    3.57    3.62    3.92    2.36    2.58    2.89   0.511

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32   31.99   31.61   32.13   21.39   27.09   26.87
      64   32.33   32.37   32.35   26.94   26.98    26.7

 ************************ Pi 4B ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29

USB Benchmarks below or Go To Start

USB Benchmarks - DriveSpeed

Following are DriveSpeed results on Pi 3B+ and 4B, using the same high speed USB 3 stick (SanDisk Extreme with write/read ratings of 110/190 MB/s and 16 KB sectors). Other sticks would probably provide different comparative performance.

On large files, Pi 4B performance gains on the largest files shown, were 2.2 times on writing and 5.3 times on reading. Unlike LanSpeed, DriveSpeed uses Direct i/O, leading to an extra entry for cached files, reading mainly influenced by RAM speeds. Results can be too variable to provide meaningful comparisons.

Random access speeds were quite similar. On small files, relative reading speed was indicates as five times faster, on the Pi 4B, but the 3B+ appeared to be nearly 30 times faster, on reading.

For the Pi 4B, additional large file performance are included for a Patriot Rage 2 USB 3 stick, rated as reading at up to 400 MB/second, with near 300 MB/second demonstrated using a Windows version of DriveSpeed.. In this case, it appeared to be slightly slower than the first one on reading, but faster on writing, at 80 MB/second. This second drive also obtained those painfully slow speeds on writing small files.

********************* Pi 3B+ USB 2 ******************** DriveSpeed RasPi 1.1 Wed Apr 24 22:09:09 2019 /media/pi/REMIX_OS/ Total MB 9017, Free MB 7486, Used MB 1531 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 27.71 27.35 27.13 30.72 30.9 31.31 16 27.21 27.54 23.69 29.89 31.34 31.27 Cached 8 52.24 59.57 46.88 333.08 741.57 780.68 Random Read Write From MB 4 8 16 4 8 16 msecs 0.403 0.403 0.404 0.74 0.85 0.59 200 File Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 1.10 2.12 3.82 6.04 9.17 14.01 ms/file 3.71 3.86 4.28 0.68 0.89 1.17 0.123 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1000 27.25 27.25 27.19 31.23 31.27 31.27 2000 27.30 27.07 27.32 31.32 31.26 31.26 ********************* Pi 4B USB 3 ********************* DriveSpeed RasPi 1.1 Fri Apr 26 17:21:56 2019 /media/pi/REMIXOSSYS// Total MB 5108, Free MB 3982, Used MB 1126 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 33.28 32.27 32.28 161.34 162.25 163.85 16 39.85 41.95 43.02 164.07 165.53 165.84 Cached 8 33.32 34.96 34.96 593.94 582.25 589.22 Random Read Write From MB 4 8 16 4 8 16 msecs 0.383 0.372 0.371 0.77 0.83 0.63 200 File Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.04 0.07 0.15 20.64 41.04 70.01 ms/file 110.04 109.97 110.01 0.20 0.20 0.23 0.089 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 500 56.36 58.13 55.25 166.31 165.46 165.43 1000 59.56 61.46 60.54 161.69 165.97 166.49 /media/pi/PATRIOT/ Total MB 120832, Free MB 120832, Used MB 0 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1000 80.87 80.23 81.92 131.41 130.72 130.39 2000 83.67 81.82 82.14 130.85 131.29 131.36

Main Drive Benchmark below or Go To Start

Pi 4B Main Drive Benchmark - DriveSpeed

This demonstrates that DriveSpeed measured performance on the main drive, in this case, nowhere near to USB 3 speeds.

  
   DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019

 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014

Java Whetstone Benchmark below or Go To Start

Java Whetstone Benchmark - whetstc.class

The Java benchmarks were run after installing Oracle Java 8, then OpenJDK11 later.

Pi 4B performance was nearly as good as the compiled C version. However, there can be wide variations involving new Java versions. Here, the Pi 3B+ overall MWIPS rating was particularly slow, entirely due to the time taken by the sin,cos and exp,sqrt tests. Other than these, the Pi 4B was three to four times faster.

************************ Pi 3B+ ************************ Whetstone Benchmark Java Version, May 14 2019, 15:02:11 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 215.20 0.0892 N2 floating point -1.131330490 208.76 0.6438 N3 if then else 1.000000000 103.58 0.9992 N4 fixed point 12.000000000 538.09 0.5854 N5 sin,cos etc. 0.499110103 7.04 11.8100 N6 floating point 0.999999821 106.22 5.0780 N7 assignments 3.000000000 322.85 0.5724 N8 exp,sqrt etc. 0.751108646 1.38 26.9200 MWIPS 214.14 46.6980 Operating System Linux, Arch. arm, Version 4.14.70-v7+ Java Vendor Oracle Corporation, Version 1.8.0_212 ************************ Pi 4B ************************ Whetstone Benchmark Java Version, May 14 2019, 14:16:44 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 503.94 0.0381 N2 floating point -1.131330490 488.37 0.2752 N3 if then else 1.000000000 332.80 0.3110 N4 fixed point 12.000000000 881.37 0.3574 N5 sin,cos etc. 0.499110132 42.92 1.9384 N6 floating point 0.999999821 345.77 1.5600 N7 assignments 3.000000000 332.97 0.5550 N8 exp,sqrt etc. 0.825148463 25.00 1.4880 MWIPS 1533.01 6.5231 Operating System Linux, Arch. arm, Version 4.19.29-v7l+ Java Vendor Oracle Corporation, Version 1.8.0_212 ******************* Pi 4B OpenJDK11 ******************* Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 524.02 0.0366 N2 floating point -1.131330490 494.12 0.2720 N3 if then else 1.000000000 289.92 0.3570 N4 fixed point 12.000000000 1092.99 0.2882 N5 sin,cos etc. 0.499110132 59.86 1.3900 N6 floating point 0.999999821 345.95 1.5592 N7 assignments 3.000000000 331.54 0.5574 N8 exp,sqrt etc. 0.825148463 25.41 1.4640 MWIPS 1687.92 5.9244 Operating System Linux, Arch. arm, Version 4.19.37-v7l+ Java Vendor BellSoft, Version 11.0.2-BellSoft

JavaDraw Benchmark below or Go To Start

JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver.

Pi 4B performance gains were best on the most complex test function.

A later version was produced and run via OpenJDK11.

************************ Pi 3B+ ************************ Java Drawing Benchmark, May 14 2019, 15:32:06 Produced by javac 1.6.0_27 Test Frames FPS Display PNG Bitmap Twice Pass 1 566 56.55 Display PNG Bitmap Twice Pass 2 651 65.00 Plus 2 SweepGradient Circles 665 66.45 Plus 200 Random Small Circles 660 65.93 Plus 320 Long Lines 442 44.16 Plus 4000 Random Small Circles 334 33.30 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. arm, Version 4.14.70-v7+ Java Vendor Oracle Corporation, Version 1.8.0_212 ************************ Pi 4B ************************ Java Drawing Benchmark, May 14 2019, 14:33:58 Produced by javac 1.7.0_02 Test Frames FPS Display PNG Bitmap Twice Pass 1 791 79.05 Display PNG Bitmap Twice Pass 2 932 93.11 Plus 2 SweepGradient Circles 1152 115.17 Plus 200 Random Small Circles 1200 119.98 Plus 320 Long Lines 784 78.31 Plus 4000 Random Small Circles 621 62.03 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. arm, Version 4.19.29-v7l+ Java Vendor Oracle Corporation, Version 1.8.0_212 ******************* Pi 4B OpenJDK11 ******************* Java Drawing Benchmark, May 15 2019, 18:55:41 Produced by OpenJDK 11 javac Test Frames FPS Display PNG Bitmap Twice Pass 1 877 87.65 Display PNG Bitmap Twice Pass 2 1042 104.18 Plus 2 SweepGradient Circles 1015 101.47 Plus 200 Random Small Circles 779 77.85 Plus 320 Long Lines 336 33.52 Plus 4000 Random Small Circles 83 8.25 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. arm, Version 4.19.37-v7l+ Java Vendor BellSoft, Version 11.0.2-BellSoft

OpenGL GLUT Benchmark below or Go To Start

OpenGL GLUT Benchmark - videogl32

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

After installing freeglut3, the benchmark ran as before. The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 NoHeading

Following are results from the Pi 3B+ and Pi 4B. The early tests depend on graphics speed and the later ones becoming CPU speed dependent.


 ************************ Pi 3B+ ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Fri Apr 12 22:21:35 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    343.8    208.3     88.4     56.6     24.3     15.5
   640   480    243.0    170.3     82.8     54.5     24.2     15.5
  1024   768    110.6    101.2     63.6     47.8     24.1     15.4
  1920  1080     49.5     47.3     36.8     32.9     23.4     14.9

 ************************ Pi 4B ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0

Stress Tests below or Go To Start

Stress Tests - MP-IntStress, MP-FPUStress, MP-FPUStressDP

A series of stress tests have also been run on the Raspberry Pi 4B and these will be covered in a later report. They have command line parameters for running time, data size, number of threads, log number and complexity of calculations. In default mode, combinations of these are used to indicate relative performance, providing useful benchmarks. Following are Pi 3B+ and Pi 4B results.


 ************************ Pi 3B+ ************************

 MP-Integer-Test 32 Bit v1.0 Fri Jun 21 15:09:22 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   9.4    1   3497  3284  1813  00000000    Yes
   6.3    2   6994  6505  2123  FFFFFFFF    Yes
   5.6    4  13839 12528  1882  5A5A5A5A    Yes
   5.6    8  13723 13780  1872  AAAAAAAA    Yes
   5.6   16  13734 14049  1857  CCCCCCCC    Yes
   5.6   32  13499 13881  1879  0F0F0F0F    Yes

 ************************ Pi 4B ************************
  MP-Integer-Test 32 Bit v1.0 Fri Jun 21 15:39:57 2019

   4.9    1   5956  5754  3977  00000000    Yes
   3.6    2  11861 11429  3763  FFFFFFFF    Yes
   3.1    4  22998 21799  3464  5A5A5A5A    Yes
   3.1    8  22695 21128  3490  AAAAAAAA    Yes
   3.1   16  22835 23491  3485  CCCCCCCC    Yes
   3.0   32  22593 23485  3591  0F0F0F0F    Yes

  Average Gains Caches 1.68, RAM 1.91

 ************************ Pi 3B+ ************************

  MP-Threaded-MFLOPS 32 Bit v1.0 Fri Jun 21 15:10:28 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.1    T1   2   857   849   414   40392  76406  99700
   5.5    T2   2  1661  1678   411   40392  76406  99700
   7.4    T4   2  3086  3336   413   40392  76406  99700
   9.4    T8   2  3194  3168   414   40392  76406  99700
  13.8    T1   8  1942  1935  1495   54756  85091  99820
  16.7    T2   8  3756  3824  1659   54756  85091  99820
  19.0    T4   8  7209  7528  1643   54756  85091  99820
  21.3    T8   8  6978  7341  1657   54756  85091  99820
  36.8    T1  32  2019  2050  1915   35296  66020  99519
  44.6    T2  32  4078  4031  3757   35296  66020  99519
  48.9    T4  32  7927  7910  6095   35296  66020  99519
  53.1    T8  32  7919  8141  6336   35296  66020  99519

 ************************ Pi 4B ************************

  MP-Threaded-MFLOPS 32 Bit v1.0 Sun May 26 21:23:49 2019

   1.6    T1   2  2134  2607   656   40392  76406  99700
   2.9    T2   2  5048  5156   621   40392  76406  99700
   4.0    T4   2  7536  9939   681   40392  76406  99700
   5.2    T8   2  7934  9839   639   40392  76406  99700
   7.2    T1   8  5535  5420  2569   54756  85091  99820
   8.7    T2   8 10757 10732  2454   54756  85091  99820
  10.1    T4   8 18108 20703  2444   54756  85091  99820
  11.5    T8   8 19236 20286  2245   54756  85091  99820
  17.4    T1  32  5309  5270  5262   35296  66020  99519
  20.4    T2  32 10551 10528  9753   35296  66020  99519
  22.4    T4  32 20120 20886 11064   35296  66020  99519
  24.5    T8  32 19415 20464  9929   35296  66020  99519

  Average Gains Caches 2.72, RAM 1.75

 Stress Tests Continued Below
  



 ************************ Pi 3B+ ************************

  MP-Threaded-MFLOPS 32 Bit v1.0 Fri Jun 21 15:11:41 2019

   Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   9.7    T1   2   215   213   173   40395  76384  99700
  15.9    T2   2   420   426   206   40395  76384  99700
  20.6    T4   2   819   830   205   40395  76384  99700
  25.3    T8   2   807   823   205   40395  76384  99700
  41.4    T1   8   508   502   437   54805  85108  99820
  49.8    T2   8  1002  1008   778   54805  85108  99820
  55.8    T4   8  1985  1955   768   54805  85108  99820
  61.6    T8   8  1974  1958   817   54805  85108  99820
 100.5    T1  32   799   794   775   35159  66065  99521
 120.1    T2  32  1595  1588  1533   35159  66065  99521
 130.5    T4  32  3115  3087  2731   35159  66065  99521
 140.7    T8  32  3154  3126  2821   35159  66065  99521

 ************************ Pi 4B ************************

  MP-Threaded-MFLOPS 32 Bit v1.0 Sun May 26 21:26:37 2019

   Double Precision Benchmark 1, 2, 4 and 8 Threads

   3.4    T1   2   921   998   326   40395  76384  99700
   6.1    T2   2  1968  1995   308   40395  76384  99700
   8.4    T4   2  3465  3925   342   40395  76384  99700
  10.9    T8   2  3646  3702   301   40395  76384  99700
  15.1    T1   8  2377  2446  1283   54805  85108  99820
  18.1    T2   8  4916  4860  1326   54805  85108  99820
  20.5    T4   8  9202  9510  1391   54805  85108  99820
  23.1    T8   8  9090  9006  1298   54805  85108  99820
  34.5    T1  32  2695  2725  2707   35159  66065  99521
  40.3    T2  32  5416  5441  5121   35159  66065  99521
  44.1    T4  32 10666 10831  5275   35159  66065  99521
  48.3    T8  32 10427 10602  4832   35159  66065  99521

  Average Gains Caches 4.23, RAM 2.09

Go To Start