Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests

Roy Longbottom

Contents


Summary Introduction Whetstone Benchmark
Dhrystone Benchmark Linpack 100 Benchmark Livermore Loops Benchmark
FFT Benchmarks BusSpeed Benchmark MemSpeed Benchmark
NeonSpeed Benchmark MultiThreading Benchmarks MP-Whetstone Benchmark
MP-Dhrystone Benchmark MP NEON Linpack Benchmark MP-BusSpeed Benchmark
MP-BusSpeed Disassembly MP-RandMem Benchmark MP-MFLOPS Benchmarks
MP-MFLOPS Disassembly MP-MFLOPS Sumchecks OpenMP-MFLOPS Benchmarks
OpenMP-MemSpeed Benchmarks Stress Testing Benchmarks Integer Stressing Benchmark
Single Precision Stress Benchmark Double Precision Stress Benchmark High Performance Linpack
DriveSpeed Benchmark USB 3 and 2 Benchmarks Drive Write/Reboot/Read Tests
LAN and WiFi Benchmarks Java Whetstone Benchmark JavaDraw Benchmark
OpenGL Benchmark Stress Tests HP Linpack Stress Test
Integer Stress Test Single Precision FPU Stress Test Double Precision FPU Stress Test
OpenGL + 3 x Livermore Loops Input/Output Stress Test CPU + Main SD + USB + LAN Test

Summary

Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a PDF file. I have also run the 32 bit versions on the Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and Raspberry Pi 4 Stress Tests PDF file. This new report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 and Pi 3B+ using Gentoo Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32 bit systems and later gcc 9 compilations. The range of benchmarking targets were as follows.

Single Core CPU Tests - comprising Whetstone, Dhrystone, Linpack and Livermore Loops Classic Benchmarks.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. These comprise FFTs with floating point, BusSpeed, with integer arithmetic, then MemSpeed and NeonSpeed with both.

Multithreading Benchmarks - Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. Some are multithreaded versions of the previous programs, comprising Whetstone, Dhrystone, Linpack and BusSpeed benchmarks. Then there is MP-RandMem for random access and MP-MFLOPS for high speed floating point. Finally, there are OpenMP versions of the latter and MemSpeed.

Stress Testing Benchmarks - The Raspberry Pi 4 can become excessively hot and might need a cooling fan attachment for efficient operation of certain applications. Stress tests are detailed later, this area covering benchmarks intended to identify which area to test. Three programs cover floating point and integer arithmetic, with different processing profiles, accessing caches and RAM. Then there is High Performance Linpack that can be a killer.

Input/Output Benchmarks - DriveSpeed and LanSpeed are used measure performance of the main SD card, USB connected storage and networks via WiFi or Ethernet.

Java and OpenGL Benchmarks - A Java Whetstone benchmark is provided and one using JavaDraw procedures. The OpenGL benchmark has six test functions of increasing complexity and run using a range of different window sizes.

Stress Tests - Stress tests mainly have run time options to specify running time and such as memory used and alternative test function, then run with continuous displays showing any changes in performance. An extra program measured CPU MHz, temperature and voltage. The main CPU stress tests are mentioned above, the Livermore Loops and OpenGL benchmark programs can also be used, along with one geared up up to exercise input/output. Stress test results identify cases of temperature related CPU speed throttling down to 600 MHz, with temperatures up to 85C, when a cooling fan is not fitted.

Performance Comparisons - More than 1400 comparisons are provided. For the particular main 1000 plus applicable to CPU speed, the Pi 4 was faster than the Pi 3B+, with an average, minimum and maximum values of 2.62, 0.70 and 16.8 times, the latter involved in using the larger L2 cache. There were also average performance gains of 64 bit compilations, compared with those at 32 bits, and some losses, the three ratios being 1.28, 0.31 and 4.90. The same story applied to gcc 9 versus gcc 6 compilations at 1.16, 0.37 and 2.93. A key area is maximum floating point speed running the High Performance Linpack Benchmark, with the four GB Pi 4 achieving more than 10 presumably double precision GFLOPS, close to my benchmarks score at 13, with single precision at 26.

Other Issues

Dual Monitors - handled in different ways. Gentoo provided mirroring or a wide image squashed on one monitor. Raspbian spread wide images across both displays, but had no mirroring option.

C Direct I/O - This worked as expected at 32 bits but in the 64 bit Gentoo version could lead to failure to write or read. Separate write and read programs were produced to enable performance to be measured.

5 GHz WiFi - there were difficulties in connecting at 5 GHz using Raspbian but seemed to be impossible using Gentoo.

Introduction below or Go To Start


Introduction

The Raspberry Pi 4B uses a quad core ARM A72 CPU, with 32 KB L1 cache and shared 1 MB L2 cache. RAM is 3200-LPDDR4 with 1, 2 or 4 GB options. Other enhancements are USB 3 connections and gigabit Ethernet. The benchmarks and stress tests covered here were run on 4 GB models.

Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a Raspberry Pi 3B 32 bit and 64 bit Benchmarks and Stress Tests PDF file. I have also run the 32 bit versions on the Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and Raspberry Pi 4 Stress Tests PDF file. This new report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 and Pi 3B+ using Gentoo Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32 bit systems and later gcc 9 compilations. The programs and source codes for the original 64 bit versions are available for downloading in Raspberry-Pi-4-Benchmarks.tar.gz, and the new gcc 9 compilations in Raspberry-Pi-4-64-Bit-Benchmarks.tar.gz.

New gcc 9 program versions - On producing these, he first step was to change the functions used to identify the hardware, where the existing procedures replicate information for each core (even four lots were too much). I noted that the lscpu command now provides adequate detail, so I use this now. RPi 3B+ and RPi 4B CPUID results are now as follows:

             Pi 3B+

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1400.0000
CPU min MHz:         600.0000
BogoMIPS:            38.40
Flags:               fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-bis+ #2 SMP PREEMPT
Tue Aug 27 13:29:20 GMT 2019 aarch64 GNU/Linux


             Pi 4B

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-p4-bis+ #2 SMP PREEMPT 
Tue Aug 27 13:58:09 GMT 2019 aarch64 GNU/Linux
   


Whetstone Benchmark below or Go To Start


Whetstone Benchmark - whetstonePi64, whetstonePi64g9, whetstonePiA7

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised (N/A), but the time does not affect the overall rating much.

For this simple code, at 64 bits, average Pi 4 performance gain, over the Pi 3B+, was 2.12 times, but only around 1.3 times for straightforward floating point calculations. Then, as should be expected, the Pi 4B 32 bit speed was not much slower.

Performance of the gcc 9 compilations for the Pi 4B was effectively the same as the earlier versions. The Pi 3B+ results indicated improvements, but this was due to the EXP type function calculations. The new compilation included a minor tweak for the IF tests, to avoid over optimisation.

System       MHz  MWIPS   ------MFLOPS------    ------------MOPS---------------
                             1      2      3    COS    EXP  FIXPT     IF  EQUAL

Pi 3B+      1400   1071    383    403    328   20.9   12.4   1704    N/A   1357
Pi 4B       1500   2269    522    534    398   54.8   39.8   2487    N/A    997
Pi4/3B+     1.07   2.12   1.36   1.32   1.21   2.63   3.21   1.46    N/A   0.73

Pi 4B 32b   1500   1884    516    478    310   54.7   27.1   2498   2247    999
64b/32b     1.00   1.20   1.01   1.12   1.28   1.00   1.47   1.00    N/A   1.00

 ===========================================================================

                                 gcc 9                                         

Pi 3B+      1400   1482    384    404    329   27.4   28.2   1712   2042   1362
Pi 4B       1500   2330    522    533    398   60.4   40.3   2493   2984    997
Pi4/3B+     1.07   1.57   1.36   1.32   1.21   2.21   1.43   1.46   1.46   0.73

                                 gcc 9/6                                       
Pi 4B       1.00   1.03   1.00   1.00   1.00   1.10   1.01   1.00    N/A   1.00

                    Dhrystone Benchmark below or Go To Start

  


Dhrystone Benchmark - dhrystonePi64, dhrystonePi64g9, dhrystonePiA7

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you cant compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow. This benchmark has no significant data arrays, suitable for vectorisation.

Using the same 64 bit program, the Pi 4 was more than twice as fast and 52% faster than the 32 bit compilation.

The gcc 9 compilations lead to no real difference in performance.

                 Compiled  DMIPS
System       MHz   DMIPS    /MHz
 
Pi 3B+      1400    4028    2.88
Pi 4B       1500    8176    5.45
Pi4/3B+     1.07    2.03        

Pi 4B 32b   1500    5366    3.58
64b/32b     1.00    1.52        

 ===============================

             gcc 9            
    
Pi 3B+      1400    3896    2.78
Pi 4B       1500    8190    5.46
Pi4/3B+     1.07    2.10        

gcc 9/6                         
Pi 4B       1.00    1.00        

Linpack Benchmark below or Go To Start

  


Linpack 100 Benchmark MFLOPS - linpackPi64, linpackPiSP64, linpackPiNEONi64, linpackPi64g9, linpackPi64g9SP, linpackPi64NEONig9, linpackPiA7, linpackPiA7SP

The original Linpack benchmark specified the use of double precision (DP) floating point arithmetic, and the code used here is identical to that initially approved for use on old PCs. For the benefit of early ARM computers, the code is also run using single precision (SP) numbers. A version was also produced, replacing the key Daxpy code with NEON Intrinsic Functions, using vector operations, also with single precision calculations.

The Pi 3B+ 32 bit results are also provided for clarification. My results were highlighted in the MagPi magazine, on announcement of the Pi 4, particularly the 2 GFLOPS 32 bit NEON speed. See raspberry-pi-4-specs-benchmarks.

At 64 bits, Pi 4/3B+ performance ratios were generally higher than those from the earlier benchmarks. Then, as could be expected, virtually compiler independent performance, using NEON Intrinsic Functions, were similar at 32 bits and 64 bits. The main 64 bit gain was with the compiled single precision version, obtaining the same performance as that via NEON Intrinsics.

The new gcc 9 compilations produced the same performance as the older versions, within the variations normally seen on this benchmark.

                    ------ MFLOPS ------       
 
System      MHz      DP      SP  SP NEON

Pi 3B+      1400   396.6   562.1   604.2
Pi 4B       1500  1059.9  1977.8  1968.6
Pi4/3B+     1.07    2.67    3.52    3.26

Pi 4B 32b   1500   760.2   921.6  2010.5
64b/32b     1.00    1.39    2.15    0.98

Pi 3B+ 32   1400   210.5   225.2   562.5
Pi4/3B+     1.07   3.61     4.09    3.57

 =======================================

                  gcc 9                 

Pi 3B+      1400   396.2   571.3   566.7
Pi 4B       1500  1110.6  2052.4  1887.5
Pi4/3B+     1.07    2.80    3.59    3.33

gcc 9/6                                 
Pi 4B       1.00    1.05    1.04    0.96


Livermore Loops Benchmark below or Go To Start

  


Livermore Loops Benchmark MFLOPS - liverloopsPi64, liverloopsPi64g9, liverloopsPiA7

This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical application, written in Fortran. This was increased to 24 kernels in the 1980s. Following are overall MFLOPS ratings, geometric mean being the official average performance, followed by details from the 24 kernels. Note that these are for double precision calculations

All the ratings indicate reasonably significant performance gains of Pi 4 over Pi 3B+ and 64 bits over 32 bits. Results from the 24 kernels indicate some higher gains. Also note the maximum speed of 2.49 GFLOPS (Double Precision).

The speed of the original Raspberry Pi could be rated as 4.5 times faster than the Cray 1 supercomputer (Geomean 11.9) - see my quote in this report. Now, one core of the Raspberry Pi 4B, at 64 bits, produces performance equivalent to 61 Cray 1 supercomputers.

There were some performance differences in gcc 9 results but average speeds were quite similar.

                   Overall Ratings - MFLOPS          

System      MHz Maximum Average Geomean Harmean Minimum

Pi 3B+ 64b 1400   737.7   319.4   284.7   250.6    91.6
Pi 4B  64b 1500  2490.5     892   730.3   603.3   212.4
Pi4/3B+    1.07    3.38    2.79    2.57    2.41    2.32

Pi 4B 32b  1500  1800.2   635.1   519,0   416.1   155.3
64b/32b    1.00    1.38    1.40    1.41    1.45    1.37

 ======================================================
                            gcc 9                      
Pi 3B+     1400  1000.7   347.8   308.0   275.2   117.3
Pi 4B      1500  2744.5   962.5   768.2   596.2   132.1
Pi4/3B+    1.07    2.74    2.77    2.49    2.17    1.13

                            gcc 9/6                    
Pi 4B      1.00    1.10    1.08    1.05    0.99    0.62

                      MFLOPS for 24 loops             
 
MFLOPS Of 24 Kernels

Pi 3B+     540   296   539   527   226   175   738   428   484   251   169   245
           127   161   291   258   440   520   333   280   310    93   362   209

Pi 4B     2026   997   987   948   372   739  2033  2491  1980   758   495   875
           220   404   811   710   753  1124   444   397  1061   414   822   283

Pi 4B/    3.75  3.37  1.83  1.80  1.65  4.23  2.76  5.83  4.09  3.02  2.92  3.57
Pi 3B+    1.73  2.51  2.79  2.75  1.71  2.16  1.33  1.42  3.43  4.48  2.27  1.36
          Min   1.33  Max   5.83

Pi 4B 32   746   964   988   943   212   538  1169  1800  1032   469   214   186
           159   335   778   623   732  1034   320   350   489   360   749   187

64b/32b   2.72  1.03  1.00  1.00  1.76  1.37  1.74  1.38  1.92  1.62  2.31  4.70
          1.38  1.20  1.04  1.14  1.03  1.09  1.39  1.13  2.17  1.15  1.10  1.51
          Min   1.00  Max   4.70

 ===========================================================================

                                     gcc9                                       

Pi 3B+     565   320   319   535   227   207  1001   581   541   234   171   248
           121   160   293   280   456   547   337   287   367   190   386   209

Pi 4B     2146   989   970   965   390   785  2386  2479  1879   632   500   973
           134   423   814   670   726  1177   450   397  1675   561   818   283

Pi 4B/    3.80  3.09  3.04  1.80  1.72  3.80  2.38  4.27  3.48  2.70  2.93  3.93
Pi 3B+    1.10  2.65  2.78  2.39  1.59  2.15  1.33  1.39  4.56  2.95  2.12  1.35
          Min   1.10  Max   4.56

                                    gcc 9/6                                     

Pi 4B     1.06  0.99  0.98  1.02  1.05  1.06  1.17  1.00  0.95  0.83  1.01  1.11
          0.61  1.05  1.00  0.94  0.96  1.05  1.01  1.00  1.58  1.35  1.00  1.00
          Min   0.61  Max   1.58

           Fast Fourier Transforms Benchmarks below or Go To Start

   


Fast Fourier Transforms Benchmarks - fft1-RPi64, fft3c-RPi64, fft1Pi64g9,
fft3cPi64g9, fft1-RPi2, fft3c-Rpi2

This is a real application provided by my collaborator at Compuserve Forum. There are two versions. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements are made, at each size, using both single and double precision data, calculating FFT sizes between 1K and 1024K. Results are in milliseconds, with those here, the average of three measurements.

There were gains all round on the Pi 4, compared with the 3B+, mainly between 3 and 4 times on the optimised version, less so using FFT1, with more data transfer speed dependency.

On the Pi 4, performance from the 32 bit compilation was often similar to that at 64 bits. This is probably due to much of the data being read on a skipped sequential basis, not good for vectorisation.

The Pi 4B/3B+ performance gains were similar using both gcc 9 and gcc 6 compiled programs, but the gcc 9 compilation produced some faster FFT1 speeds, as shown in the Pi 4B gcc 9/6 comparisons.

            Gentoo 64b Pi 3B+

    Size    FFT1            FFT3
       K      SP      DP      SP      DP

       1    0.13    0.15    0.15    0.17
       2    0.29    0.39    0.32    0.38
       4    0.76    1.13    0.79    0.85
       8    1.93    2.66    1.77    1.94
      16    4.02    5.51    4.69    5.14
      32    9.50   25.11    9.51   13.67
      64   42.53  110.21   25.30   32.25
     128  151.08  257.41   57.68   76.71
     256  355.88  589.07  129.47  174.85
     512  819.91 1324.89  297.80  390.74
    1024 1746.23 2943.08  641.50  863.82

            Gentoo 64b Pi 4B                Pi4/3B+
  
    Size    FFT1            FFT3            FFT1            FFT3
       K      SP      DP      SP      DP      SP      DP      SP      DP

       1    0.04    0.04    0.04    0.04    3.30    3.62    3.60    4.13
       2    0.08    0.14    0.11    0.09    3.81    2.88    2.82    4.03
       4    0.25    0.38    0.19    0.22    3.05    2.93    4.13    3.86
       8    0.79    1.31    0.46    0.50    2.45    2.04    3.87    3.87
      16    2.15    2.91    1.15    1.09    1.87    1.89    4.07    4.71
      32    5.71    6.76    2.48    3.18    1.66    3.71    3.83    4.30
      64   15.22   51.00    5.43    9.29    2.79    2.16    4.66    3.47
     128   83.47  151.95   16.28   24.75    1.81    1.69    3.54    3.10
     256  231.24  362.64   39.13   57.28    1.54    1.62    3.31    3.05
     512  561.16  765.18   90.20  133.21    1.46    1.73    3.30    2.93
    1024 1250.51 1878.44  213.35  303.39    1.40    1.57    3.01    2.85

            Raspbian 32b Pi 4B              64B/32b

    Size    FFT1            FFT3            FFT1            FFT3
       K      SP      DP      SP      DP      SP      DP      SP      DP
   
       1    0.04    0.04    0.06    0.05    0.99    0.96    1.44    1.18
       2    0.08    0.12    0.13    0.11    1.04    0.89    1.14    1.18
       4    0.32    0.37    0.27    0.24    1.28    0.96    1.42    1.09
       8    0.77    0.97    0.58    0.55    0.98    0.74    1.26    1.09
      16    1.69    2.01    1.49    1.35    0.78    0.69    1.29    1.24
      32    4.37    4.89    2.96    3.63    0.77    0.72    1.19    1.14
      64    9.12   26.55    7.46   10.75    0.60    0.52    1.37    1.16
     128   55.52  160.11   17.93   26.03    0.67    1.05    1.10    1.05
     256  305.92  423.06   41.16   55.06    1.32    1.17    1.05    0.96
     512  833.10  854.88   86.93  120.53    1.48    1.12    0.96    0.90
    1024 1617.49 1875.52  190.28  266.60    1.29    1.00    0.89    0.88


                       More below or Go To Start
  
=========================================================================== Gentoo Pi 3B+ gcc 9 Gentoo Pi 4B gcc 9 Size FFT1 FFT3 FFT1 FFT3 K SP DP SP DP SP DP SP DP 1 0.15 0.16 0.15 0.14 0.04 0.04 0.04 0.04 2 0.34 0.39 0.31 0.31 0.08 0.13 0.08 0.09 4 0.89 1.00 0.82 0.79 0.19 0.33 0.19 0.21 8 2.19 2.70 1.66 1.89 0.71 0.74 0.46 0.46 16 4.32 5.94 4.88 5.32 1.63 2.06 1.17 1.09 32 12.47 24.05 9.59 14.82 3.73 4.03 2.44 3.09 64 66.46 116.11 26.53 36.64 7.92 27.12 5.46 9.06 128 169.06 268.02 63.65 84.00 43.28 100.75 16.09 22.00 256 401.86 600.72 141.83 195.69 192.57 254.20 37.08 49.76 512 853.48 1266.96 329.26 435.23 590.20 651.24 82.54 110.23 1024 1966.69 2808.07 721.36 981.82 1463.15 1749.37 202.20 251.71 Pi 4B/3B+ Pi 4B gcc 9/6 1 3.53 3.77 3.63 3.78 0.97 0.98 1.02 1.18 2 4.39 3.05 3.97 3.64 1.00 1.06 1.46 1.08 4 4.75 3.03 4.23 3.81 1.34 1.16 0.98 1.06 8 3.06 3.62 3.62 4.10 1.10 1.76 1.00 1.09 16 2.65 2.89 4.16 4.89 1.32 1.41 0.98 1.00 32 3.34 5.97 3.93 4.79 1.53 1.68 1.02 1.03 64 8.39 4.28 4.85 4.04 1.92 1.88 0.99 1.03 128 3.91 2.66 3.96 3.82 1.93 1.51 1.01 1.12 256 2.09 2.36 3.82 3.93 1.20 1.43 1.06 1.15 512 1.45 1.95 3.99 3.95 0.95 1.17 1.09 1.21 1024 1.34 1.61 3.57 3.90 0.85 1.07 1.06 1.21 BusSpeed Benchmark below or Go To Start


BusSpeed Benchmark - busSpdPi64, busspeedPi64g9, busspeedPiA7

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word address increments before the next, followed by reading after decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds. The two comparison columns are for two word and one word increments.

Most data transfers were 2.0 to 2.5 times faster on the Pi 4, including from RAM, and somewhat higher with L2 cache based data.

The 64 bit version still deals with 32 bit words but transferred data somewhat quicker than the 32 bit program, as shown by the Pi 4 results.

Results from the gcc 9 compilations were virtually the same as those from gcc 6.

                   Gentoo 64b Pi 3B+

     BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019

     Reading Speed 4 Byte Words in MBytes/Second
     Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
     KBytes  Words  Words  Words  Words  Words    All

         16   3819   4253   4622   5041   5089   3870
         32   1234   1328   2067   3158   4082   3674
         64    681    704   1325   2208   3350   3602
        128    638    646   1214   2070   3238   3625
        256    592    617   1165   1991   3164   3622
        512    295    309    640    985   2085   2790
       1024    108    120    271    525   1070   1636
       4096     98    123    249    486    881   1840
      16384    121    114    246    480    977   1642
      65536    121    124    248    409    989   1864
 
                     Gentoo 64b Pi 4B                  Inc2  Rd All
                                                     4B/3B+  4B/3B+

         16   4999   5042   5665   5885   5891   8217   1.16   2.12
         32   1578   2105   3283   4339   5154   7507   1.26   2.04
         64    585    911   1855   3085   5163   7918   1.54   2.20
        128    590    932   1888   3110   5161   7874   1.59   2.17
        256    598    934   1908   3056   5265   7883   1.66   2.18
        512    603    939   1822   3019   5124   7716   2.46   2.77
       1024    319    482   1060   1885   3283   5721   3.07   3.50
       4096    209    253    503   1006   2009   4111   2.28   2.23
      16384    209    261    520   1041   2071   4115   2.12   2.51
      65536    203    263    489   1011   2023   4036   2.05   2.17

                       Raspbian 32b Pi 4B             Rd All
                                                     64b/32b

         16   3836   4049   4467   5885   4641   5858   1.14
         32    761   1473   2594   3216   3960   4780   1.01
         64    409    801   1684   2422   3745   3940   0.95
        128    406    803   1202   1914   3037   5377   1.32
        256    415    700   1165   2481   4789   5137   1.27
        512    392    760   1243   2455   3764   4264   1.38
       1024    230    256    623   1061   2455   3501   1.59
       4096    197    214    454    938   1852   3195   1.80
      16384    138    215    445    897   1724   3210   1.91
      65536    174    215    398    744   1655   3130   1.61


                       More below or Go To Start
  
===================================================================== Gentoo 64b Pi 3B+ gcc 9 BusSpeed 64 Bit gcc 9 Thu Sep 26 12:51:15 2019 BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3860 4283 4677 4901 5022 3591 32 2228 2433 2989 4740 4912 3629 64 700 697 1299 2200 3310 3348 128 637 636 1208 2064 3151 3396 256 597 600 1161 1945 3105 3377 512 232 194 500 884 1629 2350 1024 118 131 159 440 692 1682 4096 91 99 197 463 923 1878 16384 119 117 200 392 775 1606 65536 101 105 238 464 873 1876 Gentoo 64b Pi 4B Rd All Rd All 4B/3B+ gcc 9/6 16 4815 5060 5573 5808 5741 8935 2.49 1.09 32 1534 1828 2967 4254 4930 7825 2.16 1.04 64 792 1007 1988 3269 4844 8062 2.41 1.02 128 730 950 1881 3133 5007 8162 2.40 1.04 256 733 955 1901 3128 5071 8236 2.44 1.04 512 737 952 1885 3139 5058 8237 3.51 1.07 1024 374 539 1047 1884 3177 5537 3.29 0.97 4096 235 255 497 990 1975 3386 1.80 0.82 16384 239 263 501 913 1984 3973 2.47 0.97 65536 239 237 502 995 1984 3971 2.12 0.98 MemSpeed Benchmark below or Go To Start


MemSpeed Benchmark MB/Second - memSpdPi64, memSpdPi64g9, memspeedPiA7

MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the result headings. For the first two double precision tests, speed MFLOPS can be calculated by dividing MB/second by 8 and 16. For single precision divide by 4 and 8.

Results are provided below for the Gentoo 64 bit version on the Pi 3B+ and Pi 4B, and the Raspbian 32 bit variety on the Pi 4B, then a sample of relative performance, covering data from L1 cache, L2 cache and RAM.

Gains, greater than the 7% CPU MHz difference, were recorded all round by the Pi 4B over the Pi 3B+. The most impressive were on using L2 cache based data and the more intensive floating point calculations. On the Pi 4B, speeds of 64 bit and 32 bit compilations were similar using RAM based data and executing some integer tests, but significantly faster from cache based floating point calculations.

Many Pi 4B/3B+ comparisons were similar, but the gcc 9 compilation gave rise to a number of changes, compared with the older version. The latter was slightly faster using some double precision calculations, but gcc 9 produced speed increases between 1.3 and 2.6 times with integers and single precision, the latter providing a maximum of 5.5 GFLOPS compared with 3.5.
 
          Memory Reading Speed Test armv8 64 Bit by Roy Longbottom

                  Start of test Fri Aug 16 12:48:51 2019

     Memory  x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]       x[m]=y[m]
     KBytes   Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
       Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                              Gentoo 64b Pi 3B+

          8   4813   2897   4350   6180   3954   4831   5378   4324   4324
         16   4540   2900   4356   6213   3961   4838   5401   4344   4333
         32   4184   2780   4047   5540   3721   4483   5421   4285   4316
         64   3784   2678   3803   4776   3547   4171   4925   4087   4051
        128   3613   2694   3842   4731   3562   4188   4967   4087   4103
        256   3133   2652   3800   4626   3493   4027   4967   4093   4096
        512    670    882   1630   2913   2422   2718   3101   3141   2780
       1024    587    774   1017   1310   1287   1184   1105   1526   1543
       2048    555    746    917   1143   1131   1043   1071   1007   1128
       4096    545    691   1130   1039   1015   1140   1045   1087    892
       8192    537    795   1139    980   1133   1148    887    854    922
 Max MFLOPS    602    725                                                 

                              Gentoo 64b Pi 4B

          8  15530  13973  12509  15570  14025  15534  11417   9308   7798
         16  15719  14042  12750  15745  14200  15660  11753   9447   7890
         32  14062  12228  11435  14052  12699  12855  11864   9459   7937
         64  12195  11344  10698  12211  11705  12025   8872   8752   7904
        128  12172  11360  10755  12166  11862  11975   8569   8460   7913
        256  12228  11369  10697  12123  11790  12082   8073   8222   7896
        512  11269  10738  10206  10985  11164  11590   8017   6280   6557
       1024   3407   2635   3281   3396   3242   2979   3765   3947   4029
       2048   1525   1832   1838   1851   1607   1838   2819   2790   2770
       4096   1407   1851   1859   1861   1666   1840   2485   2487   2410
       8192   1913   1914   1922   1528   1895   1891   2496   2234   2489
 Max MFLOPS   1965   3511                                                 

                              Comparison 64b Pi4/3B+

          8   3.23   4.82   2.88   2.52   3.55   3.22   2.12   2.15   1.80
         16   3.46   4.84   2.93   2.53   3.58   3.24   2.18   2.17   1.82
 
        256   3.90   4.29   2.82   2.62   3.38   3.00   1.63   2.01   1.93
        512  16.82  12.17   6.26   3.77   4.61   4.26   2.59   2.00   2.36
       1024   5.80   3.40   3.23   2.59   2.52   2.52   3.41   2.59   2.61

       4096   2.58   2.68   1.65   1.79   1.64   1.61   2.38   2.29   2.70
       8192   3.56   2.41   1.69   1.56   1.67   1.65   2.81   2.62   2.70

                              Raspbian 32b Pi 4B

          8   8459   4766  13344   8303   4768  15553   7806   9926   9927
         16   7142   3918   8649   7103   4094   9309   7899  10086  10056
         32   7969   4490  10339   7941   4532  11627   7758  10070  10048
         64   8126   4602   9909   8114   4617  11069   7425   8021   8070
        128   8302   4651   9623   8311   4657  10836   7374   8049   7934
        256   8319   4663   9627   8360   4666  10768   7530   7922   7925
        512   8088   4629   9453   8239   4650  10696   5023   7904   7949
       1024   3581   3113   3618   3577   3150   3675   5358   2431   1560
       2048   1338   1808   1780   1811   1832   1773   2131    950    956
       4096   1881   1880   1852   1879   1664   1336   1988    984   1054
       8192   1890   1901   1884   1729   1319   1367   2252   1018   1021
 Max MFLOPS   1057   1192                                                 

                              MemSpeed Continued Below
  
Comparison Pi 4B 64b/32b 8 1.84 2.93 0.94 1.88 2.94 1.00 1.46 0.94 0.79 16 2.20 3.58 1.47 2.22 3.47 1.68 1.49 0.94 0.78 256 1.47 2.44 1.11 1.45 2.53 1.12 1.07 1.04 1.00 512 1.39 2.32 1.08 1.33 2.40 1.08 1.60 0.79 0.82 1024 0.95 0.85 0.91 0.95 1.03 0.81 0.70 1.62 2.58 4096 0.75 0.98 1.00 0.99 1.00 1.38 1.25 2.53 2.29 8192 1.01 1.01 1.02 0.88 1.44 1.38 1.11 2.19 2.44 ===================================================================== Gentoo 64b Pi 3B+ gcc 9 Memory Reading Speed Test 64 Bit gcc 9 by Roy Longbottom Start of test Thu Sep 26 12:43:02 2019 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 4565 5140 7847 5439 5827 7928 6161 4288 4334 16 4445 5145 7942 5362 5829 7941 6207 4358 4310 32 4094 4853 7251 4750 5396 7250 6139 4312 4303 64 3767 4748 7008 4320 5309 6954 5461 4097 4100 128 3912 4799 7319 4442 5486 7325 5328 4133 4134 256 3838 4824 6934 4400 5426 7247 5354 3844 4010 512 2570 3661 3826 2773 3975 4912 3302 2532 3017 1024 878 2120 2228 938 2182 2239 1098 1215 1361 2048 848 1961 2046 1016 2008 2033 758 805 814 4096 856 1961 2040 1007 1984 2036 839 863 856 8192 885 1940 1956 1013 1921 1957 844 865 868 Max MFLOPS 571 1286 Gentoo 64b Pi 4B 8 13385 21854 24413 13416 23402 24404 11630 9316 9315 16 13527 22116 24712 13551 23675 24722 11800 9447 9446 32 12170 19681 21716 12164 21047 21740 11403 9511 9514 64 11402 19074 20086 11613 20057 20101 9317 8651 8663 128 11770 20334 21119 12124 21389 21087 8003 8136 8136 256 11740 20281 21115 12029 21384 21111 8098 8184 8015 512 11671 20255 20873 12058 21561 21072 7721 6684 6929 1024 2818 7728 5968 3957 7839 7831 4691 3610 3832 2048 1884 3436 3743 1880 3578 3281 2597 2717 2696 4096 1284 2399 2555 1446 3802 3625 2420 2630 2632 8192 1913 3759 3459 1937 3798 3772 2468 2482 2482 Max MFLOPS 1691 5529 Comparison 64b Pi4/3B+ 8 2.93 4.25 3.11 2.47 4.02 3.08 1.89 2.17 2.15 16 3.04 4.30 3.11 2.53 4.06 3.11 1.90 2.17 2.19 256 3.06 4.20 3.05 2.73 3.94 2.91 1.51 2.13 2.00 512 4.54 5.53 5.46 4.35 5.42 4.29 2.34 2.64 2.30 1024 3.21 3.65 2.68 4.22 3.59 3.50 4.27 2.97 2.82 4096 1.50 1.22 1.25 1.44 1.92 1.78 2.88 3.05 3.07 8192 2.16 1.94 1.77 1.91 1.98 1.93 2.92 2.87 2.86 Comparison Pi4B gcc 9/6 8 0.86 1.56 1.95 0.86 1.67 1.57 1.02 1.00 1.19 16 0.86 1.57 1.94 0.86 1.67 1.58 1.00 1.00 1.20 256 0.96 1.78 1.97 0.99 1.81 1.75 1.00 1.00 1.02 512 1.04 1.89 2.05 1.10 1.93 1.82 0.96 1.06 1.06 1024 0.83 2.93 1.82 1.17 2.42 2.63 1.25 0.91 0.95 4096 0.91 1.30 1.37 0.78 2.28 1.97 0.97 1.06 1.09 8192 1.00 1.96 1.80 1.27 2.00 1.99 0.99 1.11 1.00 NeonSpeed Benchmark below or Go To Start


NeonSpeed Benchmark MB/Second - NeonSpeedPi64, NeonSpeedPi64g9, NeonSpeed

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and the Neon measurements through using Intrinsic Functions.

Unlike running the same programs on the Pi 3B+, using the Pi 4, compiled codes were no longer slower than those produced via Intrinsic Functions. This lead to performance gains of up to over five times.

Except using L1 cache based data, performance was essentially the same using 32 bit and 64 bit benchmarks.

With the gcc 9 compilation, the Pi 4B continued to be significantly faster than the 3B+. Comparing Pi 4B gcc 9 and 6 results, performance was essentially the same when NEON Intrinsic Functions were used, but, as with MemSpeed, normal compilations were faster, averaging around 80% faster, in this case.

    NEON Speed Test armv8 64 Bit V 1.0 Fri Aug 16 2019

       Vector Reading Speed in MBytes/Second 
     Memory  Float v=v+s*v  Int v=v+v+s  Neon v=v+v
     KBytes   Norm   Neon   Norm   Neon  Float    Int

                    Gentoo 64b Pi 3B+

         16   2715   5110   3945   4826   5426   5598
         32   2528   4326   3569   4191   4596   4661
         64   2491   4153   3494   4068   4407   4429
        128   2537   4228   3583   4120   4461   4473
        256   2526   4265   3614   4140   4480   4514
        512   1917   2830   2545   2579   2896   2964
       1024   1166   1299   1152   1257   1205   1229
       4096   1022   1135   1132   1122   1130   1100
      16384   1080   1026   1131   1016   1064   1094
      65536    996   1120   1061    831   1110   1069

                     Gentoo 64b Pi 4B

         16  13982  16424  12505  15239  16065  17193
         32   9554  10753   8981   9657  10970  11025
         64  10658  11833  10274  10722  12110  12134
        128  10657  11887  10337  10680  11994  11973
        256  10709  11970  10360  10774  12003  12083
        512  10147  11441   9733  10209  11264  11532
       1024   2964   3222   2876   3216   3270   2942
       4096   1734   1712   1729   1772   1586   1728
      16384   1592   1922   1818   1923   1926   1667
      65536   1970   1736   1997   1747   1884   2021

                   Comparison 64b Pi4/3B+

         16   5.15   3.21   3.17   3.16   2.96   3.07
        256   4.24   2.81   2.87   2.60   2.68   2.68
        512   5.29   4.04   3.82   3.96   3.89   3.89
      65536   1.98   1.55   1.88   2.10   1.70   1.89
                             
                      Raspbian 32b Pi 4B

         16   9677  10072   8905   9358   9776  10473
         32  10149  10330   9364   9539   9988  10543
         64  10948  11708  10466  10568  11318  11994
        128  10484  11232  10410  10104  11200  11792
        256  10509  11369  10428  10264  11273  11842
        512  10406  11066  10134  10054  11075  11467
       1024   3069   3202   3159   3166   3204   3203
       4096   1721   1910   1908   1882   1903   1900
      16384   2023   2009   2008   1965   2032   2013
      65536   2073   2074   2074   2073   2068   2064

                   Comparison Pi 4B 64b/32b

         16   1.44   1.63   1.40   1.63   1.64   1.64
        256   1.02   1.05   0.99   1.05   1.06   1.02
        512   0.98   1.03   0.96   1.02   1.02   1.01
      65536   0.95   0.84   0.96   0.84   0.91   0.98


                  NeonSpeed Continued Below
  
===================================================================== Gentoo 64b Pi 3B+ gcc 9 NEON Speed Test 64 Bit gcc 9 Thu Sep 26 12:45:07 2019 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 5118 5461 6218 5298 6024 6011 32 4894 4980 5886 4855 5431 5445 64 4713 4557 5669 4452 4868 4867 128 4824 4703 5814 4598 4995 4946 256 4857 4750 5815 4643 5028 4964 512 3694 2652 4265 2675 3003 3007 1024 2085 1135 2204 1132 1128 1077 4096 2008 1021 2070 1033 1056 1036 16384 1912 1061 2042 958 1065 1047 65536 1783 1062 1873 769 1080 1081 Gentoo 64b Pi 4B 16 21046 14555 16698 13502 14565 16970 32 17797 12061 14509 10785 12282 13112 64 19517 10860 15252 9981 10793 11419 128 19839 10936 15468 10120 11001 11579 256 20094 10838 15603 10229 10885 11566 512 20076 10846 15469 10185 10943 11667 1024 7016 3040 6826 3211 3417 3548 4096 3945 1940 3599 1950 1768 1937 16384 3394 2017 3386 1963 1848 2014 65536 3484 2043 3839 1765 2060 2049 Comparison 64b Pi4/3B+ 16 4.11 2.67 2.69 2.55 2.42 2.82 32 3.64 2.42 2.47 2.22 2.26 2.41 64 4.14 2.38 2.69 2.24 2.22 2.35 128 4.11 2.33 2.66 2.20 2.20 2.34 256 4.14 2.28 2.68 2.20 2.16 2.33 512 5.43 4.09 3.63 3.81 3.64 3.88 1024 3.36 2.68 3.10 2.84 3.03 3.29 4096 1.96 1.90 1.74 1.89 1.67 1.87 16384 1.78 1.90 1.66 2.05 1.74 1.92 65536 1.95 1.92 2.05 2.30 1.91 1.90 Comparison Pi4B gcc 9/6 16 1.51 0.89 1.34 0.89 0.91 0.99 32 1.86 1.12 1.62 1.12 1.12 1.19 64 1.83 0.92 1.48 0.93 0.89 0.94 128 1.86 0.92 1.50 0.95 0.92 0.97 256 1.88 0.91 1.51 0.95 0.91 0.96 512 1.98 0.95 1.59 1.00 0.97 1.01 1024 2.37 0.94 2.37 1.00 1.04 1.21 4096 2.28 1.13 2.08 1.10 1.11 1.12 16384 2.13 1.05 1.86 1.02 0.96 1.21 65536 1.77 1.18 1.92 1.01 1.09 1.01 Average 1.95 1.00 1.73 1.00 0.99 1.06 MultiThreading Benchmarks below or Go To Start


MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled C code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.
Go To Start


MP-Whetstone Benchmark - MP-WhetsPi64, MP-WhetsPi64g9, MP-WHETSPiA7

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance was generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

As with the single core version, average Pi 4 MWIPS performance gain, over the Pi 3B+, was just over 2 times, but more similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.

Most of the important Pi 4B gcc 9 results were virtually the same as those from the earlier gcc 6 compilations but the 3B+ COS and EXP speeds were somewhat slower.


           MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Threads             1      2      3   MOPS   MOPS   MOPS   MOPS   MOPS

                              Gentoo RPi 3B+ 64 Bit

    1       1152    383    383    328   23.2   13.0   N/A    2721   1365
    2       2312    767    767    657   46.5   26.0   N/A    5461   2738
    4       4580   1506   1526   1304   92.0   51.6   N/A   10777   5449
    8       4788   1815   1961   1382   95.0   53.3   N/A   13827   5811

            Overall Seconds   4.96 1T,   4.95 2T,   5.05 4T,  10.07 8T

                              Gentoo RPi 4B 64 Bit

    1       2395    536    538    397   60.8   39.0   N/A    4483    997
    2       4784   1062   1079    794  121.2   77.9   N/A    8932   1990
    4       9476   2125   2080   1568  240.8  155.3   N/A   17718   3962
    8       9834   2631   2744   1630  243.6  160.1   N/A   22265   4053

            Overall Seconds   4.99 1T,   5.01 2T,   5.12 4T,  10.17 8T

                              Comparison 64b Pi4/3B+

    1       2.08   1.40   1.41   1.21   2.62   3.00   N/A    1.65   0.73
    2       2.07   1.39   1.41   1.21   2.61   3.00   N/A    1.64   0.73
    4       2.07   1.41   1.36   1.20   2.62   3.01   N/A    1.64   0.73
    8       2.05   1.45   1.40   1.18   2.56   3.00   N/A    1.61   0.70

                              Raspbian RPi 4B 32 Bit

    1       2059    673    680    311   55.6   33.1   7462   2245    995
    2       4117   1342   1391    624  110.7   65.9  14887   4467   1986
    4       7910   2652   2722   1180  208.5  132.6  29291   8952   3832
    8       8652   3057   2971   1268  233.2  149.6  38368  11923   3942

            Overall Seconds   4.99 1T,   5.01 2T,   5.29 4T,  10.71 8T

                              Comparison Pi 4B 64b/32b

    1       1.16   0.80   0.79   1.28   1.09   1.18   N/A    2.00   1.00
    2       1.16   0.79   0.78   1.27   1.09   1.18   N/A    2.00   1.00
    4       1.20   0.80   0.76   1.33   1.15   1.17   N/A    1.98   1.03
    8       1.14   0.86   0.92   1.28   1.04   1.07   N/A    1.87   1.03


                  MP-Whetstone  Continued Below
  
=========================================================================== MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS Gentoo 64b Pi 3B+ gcc 9 1 1500 381 384 328 27.2 28.1 5098 2049 1368 2 3001 766 762 656 54.5 56.5 10130 4102 2737 4 5940 1488 1528 1304 107.8 111.5 19741 7665 5423 8 5987 1528 1666 1267 107.4 117.9 25862 9518 5666 Overall Seconds 4.98 1T, 4.98 2T, 5.16 4T, 10.30 8T Gentoo 64b Pi 4B gcc 9 1 2364 530 532 395 60.6 40.0 7426 2242 996 2 4724 1060 1052 789 121.0 80.4 14853 4476 1994 4 9413 2103 2112 1579 241.0 159.5 29161 8638 3968 8 9848 2671 2453 1644 247.0 168.1 37385 11636 4108 Overall Seconds 5.00 1T, 5.01 2T, 5.07 4T, 10.20 8T Comparison 64b Pi4/3B+ 1 1.58 1.39 1.38 1.20 2.23 1.42 1.46 1.09 0.73 2 1.57 1.38 1.38 1.20 2.22 1.42 1.47 1.09 0.73 4 1.58 1.41 1.38 1.21 2.24 1.43 1.48 1.13 0.73 8 1.64 1.75 1.47 1.30 2.30 1.43 1.45 1.22 0.72 Comparison Pi4B gcc 9/6 1 0.99 0.99 0.99 1.00 1.00 1.03 N/A 0.50 1.00 2 0.99 1.00 0.97 0.99 1.00 1.03 N/A 0.50 1.00 4 0.99 0.99 1.02 1.01 1.00 1.03 N/A 0.49 1.00 8 1.00 1.02 0.89 1.01 1.01 1.05 N/A 0.52 1.01

MP-Dhrystone Benchmark below or Go To Start


MP-Dhrystone Benchmark - MP-DHRYPi64, MP-DHRYPi64g9, MP-DHRYPiA7

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance with not much gain using multiple cores.

The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.

As indicated for the earlier gcc 6 results, this benchmark produces inconsistent performance and does not provide a good example of multithreading but, in this case, gcc 6 and gcc 9 results were similar, with a reasonably high Pi 4B/3B+ performance gain.

                  Example Results Log File

  MP-Dhrystone Benchmark 64 Bit gcc 9 Thu Sep 26 11:46:22 2019

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.55     1.19     2.31     4.57
 Dhrystones per Second   14579147 13499628 13827400 14017880
 VAX MIPS rating             8298     7683     7870     7978

         Internal pass count correct all threads

         End of test Thu Sep 26 11:46:31 2019

#############################################################

                      Comparisons

Threads                        1       2       4       8

VAX MIPS rating Pi 3B+ 6    4207    6804    7401    7415
VAX MIPS rating Pi 4B 64    8880    7828    8303    8314
VAX MIPS rating Pi 4B 32    5539    5739    6735    7232

Pi 4B/3B+ 64 bits           2.11    1.15    1.12    1.12
Pi 4B 64 bits/32 bits       1.60    1.36    1.23    1.15

 =======================================================

                        Gentoo gcc 9

VAX MIPS rating Pi 3B+ 6    4062    6504    8242    8343
VAX MIPS rating Pi 4B 64    8298    7683    7870    7978

Pi 4B/3B+ 64 bits           2.04    1.18    0.95    0.96
Pi 4B gcc 9/6               0.93    0.98    0.95    0.96
  

MP Linpack Benchmark below or Go To Start


MP SP NEON Linpack Benchmark - linpackMPNeonPi64, linpackMPNeonPi64g9, linpackNeonMP

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100, without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.

The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core version.

At least for the unthreaded tests, the gcc 9 results for the Pi 4B were mainly within 10% of those from gcc 6.

                  Example Results Log File


 Linpack Single Precision MultiThreaded Benchmark
  64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads     None      1       2       4

N  100    642.56   66.69   66.05   65.54
N  500    479.48  274.36  274.85  269.07
N 1000    363.77  316.17  310.37  316.71

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            1.97            5.40           13.51
 RE  4.69621336e-05  6.44138840e-04  3.22485110e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -1.31130219e-05  5.79357147e-05 -3.08930874e-04
 XN -1.30534172e-05  3.51667404e-05  1.90019608e-04

Thread
 0 - 4 Same Results    Same Results    Same Results

####################################################

                      Comparisons

 Threads     None      1       2       4

          Gentoo Pi 3B+ 64 Bits

N  100    642.56   66.69   66.05   65.54
N  500    479.48  274.36  274.85  269.07
N 1000    363.77  316.17  310.37  316.71

            Gentoo 64b Pi 4B

N  100    2252.7    97.3    97.4    97.4
N  500    1628.2   665.2   646.6   674.4
N 1000     399.9   406.8   405.8   399.5

          Comparison 64b Pi4/3B+

N  100      3.51    1.46    1.48    1.49
N  500      3.40    2.42    2.35    2.51
N 1000      1.10    1.29    1.31    1.26

            Raspbian 32b Pi 4B

N  100    1921.5   108.7   101.9   102.5
N  500    1548.8   530.2   714.4   733.1
N 1000     399.9   378.1   364.8   398.2

          Comparison Pi 4B 64b/32b

N  100      1.17    0.89    0.96    0.95
N  500      1.05    1.25    0.91    0.92
N 1000      1.00    1.08    1.11    1.00

   MP SP NEON Linpack Continued Below
  
======================================== gcc 9 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 Gentoo 64b Pi 3B+ gcc 9 N 100 641.6 63.0 62.3 61.9 N 500 326.6 229.3 222.6 227.0 N 1000 320.1 275.0 274.3 275.2 Gentoo 64b Pi 4B gcc 9 N 100 2076.2 98.6 96.6 96.2 N 500 1327.1 631.9 632.5 639.2 N 1000 394.6 375.3 382.3 375.7 Comparison 64b Pi4/3B+ N 100 3.24 1.57 1.55 1.55 N 500 4.06 2.76 2.84 2.82 N 1000 1.23 1.36 1.39 1.37 Comparison Pi4B gcc 9/6 N 100 0.92 1.01 0.99 0.99 N 500 0.82 0.95 0.98 0.95 N 1000 0.99 0.92 0.94 0.94 #################################################### 32 bit numeric results N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04

MP BusSpeed Benchmark below or Go To Start


MP BusSpeed Benchmark - MP-BusSpd2Pi64, MP-BusSpd2Pi64g9, MP-BusSpeedPiA7
(read only)

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching effects and not seen on subsequent repeated tests.

Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below, where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.

At least, most of the gcc 9 read all compiled tests were significantly faster than those produced by gcc 6.
 
    MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019
    MB/Second Reading Data, 1, 2, 4 and 8 Threads

                        Gentoo 64b Pi 3B+

  KB        Inc32   Inc16    Inc8    Inc4    Inc2   RdAll

 12.3 1T     3138    2822    3044    2383    1708    1737
      2T     5354    4865    5647    4519    3303    3361
      4T     7922    7504    9717    6794    6216    6597
      8T     5125    4159    6987    6696    5350    5195
122.9 1T      640     666    1191    1864    1627    1712
      2T     1008    1018    1926    3496    3268    3387
      4T      962    1042    2157    4259    6427    4372
      8T     1031    1047    2147    3952    6317    6514
12288 1T      124     114     260     527    1016    1363
      2T      137     138     275     487     946    2182
      4T      105     118     240     409     975    2158
      8T      108     117     236     504    1077    2051

                          Gentoo 64b Pi 4B                  RdAll
                                                           4B/3B+

 12.3 1T     4864    4879    5378    4379    4115    4221    2.43
      2T     8159    6924    9179    8006    7689    7837    2.33
      4T    12677   11531   14850   12554   13807   14794    2.24
      8T     7398    6927   10881   11675   11497   13075    2.52
 122.9 1T     665     926    1869    2714    3557    4152    2.43
      2T      610     696    1549    4898    7188    8184    2.42
      4T      476     865    1885    4107    8058   14617    3.34
      8T      474     883    1848    3919    7939   13633    2.09
12288 1T      202     210     514    1044    2033    3616    2.65
      2T      258     425     853    1551    3693    6228    2.85
      4T      217     346     497    1024    2181    3789    1.76
      8T      220     275     540    1030    1937    3577    1.74

                          Raspbian 32b Pi 4B                RdAll
                                                          64b/32b

 12.3 1T     5263    5637    5809    5894    5936   13445    0.31
      2T     9412   10020   10567   11454   11604   24980    0.31
      4T    16282   15577   16418   21222   20000   45530    0.32
      8T    11600   13285   16070   18579   20593   36837    0.35
122.9 1T      739     956    1888    3153    5008    9527    0.44
      2T      629    1158    1568    5058    9509   16489    0.50
      4T      600    1093    2134    4527    8732   16816    0.87
      8T      593    1104    2121    4382    8629   17158    0.79
12288 1T      238     258     518    1005    2001    4029    0.90
      2T      278     228     453    1690    1826    3628    1.72
      4T      269     257     740    1019    1790    4145    0.91
      8T      233     292     532     926    2186    3581    1.00


                  MP-BusSpeed Continued Below
  
=================================================================== MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll Gentoo 64b Pi 3B+ gcc 9 12.3 1T 3453 4178 4428 3543 3584 2335 2T 5594 7732 8086 6856 6924 4654 4T 9065 12522 13157 12942 13415 9209 8T 6661 10770 13266 11955 12573 8478 122.9 1T 640 646 1197 1970 2909 2272 2T 1030 1012 2006 3671 5784 4528 4T 1001 1041 2145 4266 8337 6729 8T 1043 1061 2123 4005 8133 8572 12288 1T 114 104 241 444 932 1352 2T 126 122 253 370 1005 1997 4T 104 138 197 471 1133 1745 8T 102 96 231 466 796 1893 Gentoo 64b Pi 4B gcc 9 RdAll Pi 4B 4B/3B+ gcc 9/6 12.3 1T 5573 5750 5057 5646 5800 9129 3.91 2.16 2T 7191 9038 10035 11020 11125 17757 3.82 2.27 4T 7023 12144 14591 17681 20490 29184 3.17 1.97 8T 7553 11837 12565 15640 18546 30517 3.60 2.33 122.9 1T 672 922 1864 3092 4744 7741 3.41 1.86 2T 577 947 2100 3051 8780 14975 3.31 1.83 4T 519 983 1884 3980 8701 18139 2.70 1.24 8T 515 951 1913 4181 8797 16899 1.97 1.24 12288 1T 230 261 499 1016 1678 3873 2.86 1.07 2T 276 225 418 925 1929 5629 2.82 0.90 4T 258 267 579 802 1749 5758 3.30 1.52 8T 214 213 538 1069 2145 4680 2.47 1.31

MP BusSpeed Disassembly below or Go To Start


MP BusSpeed Disassembly

Following shows part of the source code used to read all data, compile commands used and disassembly of part of the (100+) long sequences of instructions used for the 32 bit and 64 bit gcc 9 benchmarks. A disassembly of the 64 bit gcc 6 version was not available.

        Source Code 64 AND instructions in main loop
  
   for (i=start; i<end; i=i+64)
   {
       andsum1[t] = andsum1[t] 
           & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
           & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
    To
           & array[i+56] & array[i+57] & array[i+58] & array[i+59]
           & array[i+60] & array[i+61] & array[i+62] & array[i+63];
   }


Pi 32 Bit Raspbian Compile

gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
           -mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7

Pi 64 Bit Gentoo Compile

gcc mpbusspd2.c -lpthread -lm -lrt -O3 -march=armv8-a -no-pie -o MP-BusSpd2Pi64g9

Parameters also tried

-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe 
-fomit-frame-pointer"

Pi 32 Bit Disassembly          Pi 64 Bit Disassembly

vld1.32 {q6}, [lr]              ldp     w30, w17, [x0, 52]
vld1.32 {q7}, [r6]              and     w18, w18, w30
vand    q10, q10, q6            and     w1, w1, w18
vld1.32 {q6}, [r0]              ldp     w18, w30, [x0, 60]
vand    q9, q9, q7              and     w17, w17, w18
vand    q12, q12, q6            and     w1, w1, w17
vld1.32 {q7}, [ip]              ldp     w17, w18, [x0, 68]
vld1.32 {q6}, [r7]              and     w30, w30, w17
add     r1, r3, #96             and     w1, w1, w30
add     r6, r3, #144            ldp     w30, w17, [x0, 76]
vand    q11, q11, q7            and     w18, w18, w30
vand    q14, q14, q6            and     w1, w1, w18
vld1.32 {q7}, [r1]              ldp     w18, w30, [x0, 84]
vld1.32 {q6}, [r6]              and     w17, w17, w18
  
MP RandMem Benchmark below or Go To Start


MP RandMem Benchmark - MP-RandMemPi64, MP-RandMemPi64g9, MP-RandMemPiA7

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.

Some moderate Pi4/3B+ performance gains were produced using gcc 9, but the older version was, possibly, a little faster.


            MB/Second Using 1, 2, 4 and 8 Threads

         Serial Serial Random Random Serial Serial Random Random
KB+Thread  Read   RdWr   Read   RdWr   Read   RdWr   Read   RdWr

            Gentoo Pi 4B 64 Bits

 12.3 1T    5922   7871   5892   7857
      2T   11856   7882  11902   7923
      4T   22964   7821  22276   7832
      8T   23225   7751  22082   7717
122.9 1T    5827   7276   2052   1921
      2T   10965   7258   1754   1924
      4T   10969   7232   1848   1929
      8T   10896   7158   1834   1909
12288 1T    3879   1052    188    170
      2T    4848    935    218    168
      4T    4684    943    332    170
      8T    3982   1049    340    171

            Gentoo Pi 3B+ 64 Bits       Raspbian Pi 4B 32 Bits

 12.3 1T    4901   3587   4912   3585   5860   7905   5927   7657
      2T    8749   3564   8719   3556  11747   7908  11182   7746
      4T   17108   3504  17160   3505  21416   7626  17382   7731
      8T   16885   3475  16650   3485  20649   7528  20431   7378
122.9 1T    3921   3339   1010    974   5479   7269   1826   1923
      2T    7360   3350   1814    972  10355   6964   1667   1920
      4T   12199   3313   2281    969   9808   7177   1715   1908
      8T   12089   3313   2279    968  11677   7058   1697   1919
12288 1T    2024    828     83     67   3438   1271    179    152
      2T    2169    820    142     67   4176   1204    213    167
      4T    2178    818    154     67   4227   1117    337    161
      8T    2219    821    161     67   3479   1093    287    168

4 Thread Pi 4B/3B+ 64 Bits           4 Thread Pi 4B 64 bits/32 bits

 12.3 4T    1.34   2.23   1.30   2.23   1.07   1.03   1.28   1.01
122.9 4T    0.90   2.18   0.81   1.99   1.12   1.01   1.08   1.01
12288 4T    2.15   1.15   2.16   2.54   1.11   0.84   0.99   1.06

===================================================================

             MB/Second Using 1, 2, 4 and 8 Threads

         Serial Serial Random Random Serial Serial Random Random
KB+Thread  Read   RdWr   Read   RdWr   Read   RdWr   Read   RdWr

            Gentoo 64b Pi 3B+ gcc 9    Gentoo 64b Pi 4B gcc 9

 12.3 1T    4886   3581   4878   3590   5737   6884   5763   7537
      2T    8723   3550   8724   3550  11536   7592  10238   6898
      4T   16836   3498  17531   3509  21084   7575  15160   7390
      8T   15777   3459  16783   3466  20089   7339  15311   7200
122.9 1T    3913   3346    987    972   5739   7231   2006   1906
      2T    7285   3339   1753    964  10662   7217   1742   1896
      4T   12354   3344   2350    972  10376   6741   1815   1812
      8T   11841   3333   2300    962  10298   6937   1823   1848
12288 1T    1795    761     69     60   3477    905    181    162
      2T    1915    735    118     60   3750    794    215    164
      4T    2452    730    128     59   4669    968    259    162
      8T    1805    755    137     60   3419    981    301    157

                  4 Thread                    4 Thread 
            Comparison 64b Pi4/3B+      Comparison Pi4B gcc 9/6

 12.3 4T    1.25   2.17   0.86   2.11   0.92   0.97   0.68   0.94
122.9 4T    0.84   2.02   0.77   1.86   0.95   0.93   0.98   0.94
12288 4T    1.90   1.33   2.02   2.75   1.00   1.03   0.78   0.95
  

MP-MFLOPS Benchmarks below or Go To Start


MP-MFLOPS Benchmarks - MP-MFLOPSPi64, MP-MFLOPSPi64g9, MP-MFLOPSPi64DP,
MP-MFLOPSPi64DPg9, MP-NeonMFLOPS64, MP-NeonMFLOPS64g9, MP-MFLOPSPiA7

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Versions are available using single precision and double precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

There can be wide variations in speeds, affected by the short running times and such as cached data variations. In order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that, with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less so at 32 operations.

The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar performance as the 64 bit version, but more sumcheck differences.

The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions, providing the fastest speeds.

The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits, does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors - see disassembly below.

It is difficult to judge relative gcc 9 and 6 performance, probably due to the short running times. The former appears to be more than 10% faster, running the single precision tests. For these, the disassembled instructions look the same as those shown below, but in a different sequence.

                              Single Precision 

          MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word        32 Ops/Word
KB  12.8  128  12800   12.8    128 12800  12.8  128  12800  12.8  128 12800

    ---- Gentoo Pi 4B 64 Bits MFLOPS ---

1T  2908  2854   459   5778   5734  5405
2T  5700  5311   457  10935  11212  7968
4T 10375  5588   490  18181  21842  7637
8T  9675  8460   511  20128  20567  8568

    --- Gentoo Pi 3B+ 64 Bits MFLOPS ---  -- Raspbian Pi 4B 32 Bits MFLOPS -

1T     792   806   373  1780  1783  1724   987   993   606  2816  2794  2804
2T    1482  1596   382  3542  3509  3380  1823  1837   567  5610  5541  5497
4T    2861  2742   429  5849  7013  5465  2119  3349   647  9884 10702  9081
8T    2770  2877   429  6434  6700  6101  3136  3783   609 10230 10504  9240

                                  Comparisons
    --------- Pi 4B/3B+ 64 Bits --------  ------ Pi 4B 64 bits/32 bits -----

1T    3.67  3.54  1.23  3.25  3.22  3.14  2.95  2.87  0.76  2.05  2.05  1.93
2T    3.85  3.33  1.20  3.09  3.20  2.36  3.13  2.89  0.81  1.95  2.02  1.45
4T    3.63  2.04  1.14  3.11  3.11  1.40  4.90  1.67  0.76  1.84  2.04  0.84


                  MP-MFLOPS Continued Below
  
=========================================================================== MP-MFLOPS 64 Bit gcc 9 Thu Sep 26 12:36:54 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 1T 827 805 371 3232 3157 2802 3162 3072 468 6754 6714 6340 2T 1608 1567 360 6420 6423 5286 6498 6029 496 13329 12397 7623 4T 1764 3142 400 11240 12355 6029 11709 6141 529 24825 25055 8723 8T 2548 2575 381 10813 11755 5827 10828 8158 493 19452 22190 8426 Comparisons ........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 .......... 1T 3.82 3.82 1.26 2.09 2.13 2.26 1.09 1.08 1.02 1.17 1.17 1.17 2T 4.04 3.85 1.38 2.08 1.93 1.44 1.14 1.14 1.09 1.22 1.11 0.96 4T 6.64 1.95 1.32 2.21 2.03 1.45 1.13 1.10 1.08 1.37 1.15 1.14 ########################################################################### Double Precision MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 ---- Gentoo Pi 4B 64 Bits MFLOPS --- 1T 1464 1386 225 3398 3386 3182 2T 2837 2792 228 6720 6741 4547 4T 5172 3414 251 10405 12762 4763 8T 4774 4353 275 11506 12118 4865 --- Gentoo Pi 3B+ 64 Bits MFLOPS --- -- Raspbian Pi 4B 32 Bits MFLOPS - 1T 415 386 206 1400 1403 1333 1187 1220 309 2682 2714 2701 2T 820 813 209 2804 2767 2597 2420 2416 282 5379 5415 4780 4T 1328 1323 212 5433 5340 2465 4665 2381 317 10256 10336 5242 8T 1343 1308 214 5090 5006 3280 4385 3114 310 9721 10340 5131 Comparisons --------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits ----- 1T 3.53 3.59 1.09 2.43 2.41 2.39 1.23 1.14 0.73 1.27 1.25 1.18 2T 3.46 3.43 1.09 2.40 2.44 1.75 1.17 1.16 0.81 1.25 1.24 0.95 4T 3.89 2.58 1.18 1.92 2.39 1.93 1.11 1.43 0.79 1.01 1.23 0.91 =========================================================================== MP-MFLOPS 64 Bit gcc 9 Double Precision Thu Sep 26 22:05:10 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads ---- Gentoo 64b Pi 3B+ gcc 9 ---- ----- Gentoo 64b Pi 4B gcc 9 ---- 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 1T 384 350 127 1582 1546 1372 657 663 183 3283 3358 3169 2T 753 753 184 3109 3157 2645 3203 2690 223 6573 6353 4535 4T 1346 1330 194 4228 6099 3067 5799 3866 292 12432 12665 4906 8T 1234 1340 201 4888 5748 3190 5322 4583 269 10738 8891 4521 Comparisons ........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 .......... 1T 1.71 1.89 1.44 2.08 2.17 2.31 0.45 0.48 0.81 0.97 0.99 1.00 2T 4.25 3.57 1.21 2.11 2.01 1.71 1.13 0.96 0.98 0.98 0.94 1.00 4T 4.31 2.91 1.51 2.94 2.08 1.60 1.12 1.13 1.16 1.19 0.99 1.03 MP-MFLOPS Continued Below
NEON Single Precision MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 ---- Gentoo Pi 4B 64 Bits MFLOPS --- 1T 3311 3192 535 6442 6548 6198 2T 4607 6186 552 13030 13012 8468 4T 6279 5725 562 23798 24128 9374 8T 7815 12044 486 22725 21712 9395 --- Gentoo Pi 3B+ 64 Bits MFLOPS -- -- Raspbian Pi 4B 32 Bits MFLOPS - 1T 830 823 406 2989 2986 2792 2491 2399 615 4325 4285 4261 2T 1575 1498 414 5981 5872 5445 5629 5520 591 8602 8463 8308 4T 2217 2650 431 11661 11644 6061 10580 5594 553 16991 16493 9124 8T 2733 3197 437 10505 10637 6708 7047 10785 513 14325 16219 8867 Comparisons --------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits ----- 1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45 2T 2.93 4.13 1.33 2.18 2.22 1.56 0.82 1.12 0.93 1.51 1.54 1.02 4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03 =========================================================================== MP-MFLOPS NEON Intrinsics 64 Bit gcc 9 Thu Sep 26 22:02:00 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads ---- Gentoo 64b Pi 3B+ gcc 9 ---- ----- Gentoo 64b Pi 4B gcc 9 ---- 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 1T 769 765 354 3009 2967 2638 1233 1313 507 6451 6428 6224 2T 1315 1324 293 5863 5990 5097 6307 4824 389 12559 12784 7612 4T 1750 2647 380 10081 11250 5748 8101 5186 531 24762 24708 7902 8T 2180 2664 392 9719 11010 6368 6782 8444 504 22598 24113 7979 ........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 .......... 1T 1.60 1.72 1.43 2.14 2.17 2.36 0.37 0.41 0.95 1.00 0.98 1.00 2T 4.80 3.64 1.33 2.14 2.13 1.49 1.37 0.78 0.70 0.96 0.98 0.90 4T 4.63 1.96 1.40 2.46 2.20 1.37 1.29 0.91 0.94 1.04 1.02 0.84

MP-MFLOPS Disassembly below or Go To Start


MP-MFLOPS Disassembly

On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.


SP NEON 24.1 GFLOPS 6.55 1 core          DP 12.7 GFLOPS - 3.39 1 core 

.L41:                                   .L84:
ldr     q1, [x1]                        ldr     q16, [x2, x0]
ldr     q0, [sp, 64]                    add     w3, w3, 1
fadd    v18.4s, v20.4s, v1.4s           cmp     w3, w6
fadd    v17.4s, v22.4s, v1.4s           fadd    v15.2d, v16.2d, v14.2d
fadd    v0.4s, v0.4s, v1.4s             fadd    v17.2d, v16.2d, v12.2d
fadd    v16.4s, v24.4s, v1.4s           fmul    v15.2d, v15.2d, v13.2d
fadd    v7.4s, v26.4s, v1.4s            fmls    v15.2d, v17.2d, v11.2d
fadd    v6.4s, v28.4s, v1.4s            fadd    v17.2d, v16.2d, v10.2d
fadd    v5.4s, v30.4s, v1.4s            fmla    v15.2d, v17.2d, v9.2d
fmul    v0.4s, v0.4s, v19.4s            fadd    v17.2d, v16.2d, v8.2d
fadd    v4.4s, v10.4s, v1.4s            fmls    v15.2d, v17.2d, v31.2d
fadd    v3.4s, v12.4s, v1.4s            fadd    v17.2d, v16.2d, v30.2d
fadd    v2.4s, v14.4s, v1.4s            fmla    v15.2d, v17.2d, v29.2d
fadd    v1.4s, v8.4s, v1.4s             fadd    v17.2d, v16.2d, v28.2d
fmls    v0.4s, v21.4s, v18.4s           fmls    v15.2d, v17.2d, v0.2d
fmla    v0.4s, v23.4s, v17.4s           fadd    v17.2d, v16.2d, v27.2d
fmls    v0.4s, v25.4s, v16.4s           fmla    v15.2d, v17.2d, v26.2d
fmla    v0.4s, v27.4s, v7.4s            fadd    v17.2d, v16.2d, v25.2d
fmls    v0.4s, v29.4s, v6.4s            fmls    v15.2d, v17.2d, v24.2d
fmla    v0.4s, v31.4s, v5.4s            fadd    v17.2d, v16.2d, v23.2d
fmls    v0.4s, v9.4s, v1.4s             fmla    v15.2d, v17.2d, v22.2d
fmla    v0.4s, v4.4s, v11.4s            fadd    v17.2d, v16.2d, v21.2d
fmls    v0.4s, v3.4s, v13.4s            fadd    v16.2d, v16.2d, v19.2d
fmla    v0.4s, v2.4s, v15.4s            fmls    v15.2d, v17.2d, v20.2d
str     q0, [x1], 16                    fmla    v15.2d, v16.2d, v18.2d
cmp     x1, x0                          str     q15, [x2, x0]
bne     .L41                            add     x0, x0, 16
                                        bcc     .L84


                     32 bit    64 bit    32 bit     64 bit   32 bit    64 bit
                         SP        SP        DP        DP   NEON SP   NEON SP

Maximum GFLOPS         10.7      21.8      10.3      12.7      17.0      24.1

Instructions
Total                    27        39        26        27        67        27
Floating point           22        32        22        32        32        22

FP operations
Total                    32       128        32        64       128       128
Add or subtract          11        44        11        22        21        44
Multiply                  1         4         1         2        11         4
Fused                    20        80        20        40         0        80

Add example           fadds      fadd     faddd      fadd  vadd.f32      fadd
                        s16,   v15.4s,      d25,   v15.2d,       q9,    v1.4s,
                        s23,   v16.4s,      d17,   v16.2d,       q8,    v8.4s,
                        s2     v15.4s       d15    v14.2d        q14    v1.4s

Multiply example     fnmuls      fmul     fmuld      fmul  vmul.f32      fmul
                        s16,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                         s3,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                        s16    v17.4s       d5     v13.2d       q12    v19.4s

Fused example      vfma.f32      fmla  vfma.f64      fmla       N/A      fmla
                        s16,   v15.4s,      d16,   v15.2d,              v0.4s,
                        s29,   v17.4s,      d22,   v17.2d,              v4.4s,
                         s9     v0.4s       d28    v22.2d              v11.4s

FP registers used        32         4        32        25        16        32
  
MP-MFLOPS Sumchecks below or Go To Start


MP-MFLOPS Sumchecks

Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.

              2 Ops/Word              32 Ops/Word 
  KB          12.8    128    12800    12.8     128   12800  

SP
4B/64	 1T    76406   97075   99969   66015   95363   99951
3B/64	 1T    76406   97075   99969   66015   95363   99951
4B/32	 1T    76406   97075   99969   66015   95363   99951

DP		
4B/64	 1T    76384   97072   99969   66065   95370   99951	
3B/64	 1T    76384   97072   99969   66065   95370   99951	
4B/32	 1T    76384   97072   99969   66065   95370   99951	

NEON Bit SP		
4B/64	 1T    76406   97075   99969   66015   95363   99951	
3B/64	 1T    76406   97075   99969   66015   95363   99951	
4B/32	 1T    76406   97075   99969   66014-X 95363   99951	
  
OpenMP-MFLOPS Benchmarks below or Go To Start


OpenMP MFLOPS - OpenMP-MFLOPS64, OpenMP-MFLOPS64g9, notOpenMP-MFLOPS64,
notOpenMP-MFLOPS64g9, OpenMP-MFLOPS, notOpenMP-MFLOPS

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS, QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650 graphics card, via CUDA. See GigaFLOPS Benchmarks.htm.

The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit Pi 4B.

The Pi 4B gcc 9/6 performance ratios indicate no real advantage of either compilation, except the results indicate 24.7 SP GFLOPS using gcc 9.

                   Gentoo 64b Pi 4B  gcc 9

            OpenMP MFLOPS64g9 Thu Sep 26 16:51:07 2019

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.124228     4025    0.929538   Yes
 Data in & out    1000000     2      250   0.842066      594    0.992550   Yes
 Data in & out   10000000     2       25   0.873622      572    0.999250   Yes

 Data in & out     100000     8     2500   0.147889    13524    0.957117   Yes
 Data in & out    1000000     8      250   0.904478     2211    0.995518   Yes
 Data in & out   10000000     8       25   0.951405     2102    0.999549   Yes

 Data in & out     100000    32     2500   0.324246    24673    0.890215   Yes
 Data in & out    1000000    32      250   1.097993     7286    0.988088   Yes
 Data in & out   10000000    32       25   1.045087     7655    0.998796   Yes

                                               --------- gcc 9 ---------                            
Mbytes/  Pi 3B+       Pi 4B        Pi 4B       Pi 3B+       Pi 4B
Ops/W0rd   64b          64b          32b         64b          64b
           All    1T    All     1T   All    1T   All    1T    All     1T

0.4/2     2674   755   5386   2780  4716  2850  2341   795   4025   2236
4/2        411   404    563    557   556   429   381   362    594    403
40/2       419   408    545    588   544   632   401   387    572    493

0.4/8     7029  1886  15401   5555  7981  5191  6051  1906  13524   5373
4/8       1656  1495   2223   2116  2389  2082  1491  1352   2211   1948
40/8      1725  1507   2361   2310  2199  2003  1598  1418   2102   2308

0.4/32    6648  1699  20429   5647  8147  5449 12002  3185  24673   6786
4/32      5977  1616   8082   5445  7951  5385  5641  2809   7286   6385
40/32     6027  1616   8470   5479  8030  5379  6142  2809   7655   6415
 
                      Pi 4B        gcc 9       Pi 4B
         4b/3b       64/32b        4b/3b       gcc 9/6
           All    1T    All     1T   All    1T   All    1T

0.4/2     2.01  3.68   1.14   0.98  1.72  2.81  0.75  0.80
4/2       1.37  1.38   1.01   1.30  1.56  1.11  1.06  0.72
40/2      1.30  1.44   1.00   0.93  1.43  1.27  1.05  0.84

0.4/8     2.19  2.95   1.93   1.07  2.24  2.82  0.88  0.97
4/8       1.34  1.42   0.93   1.02  1.48  1.44  0.99  0.92
40/8      1.37  1.53   1.07   1.15  1.32  1.63  0.89  1.00

0.4/32    3.07  3.32   2.51   1.04  2.06  2.13  1.21  1.20
4/32      1.35  3.37   1.02   1.01  1.29  2.27  0.90  1.17
40/32     1.41  3.39   1.05   1.02  1.25  2.28  0.90  1.17
  

OpenMP-MemSpeed Benchmarks below or Go To Start


OpenMP-MemSpeed - OpenMP-MemSpeed264, OpenMP-MemSpeed264g9,
NotOpenMP-MemSpeed264, NotOpenMP-MemSpeed264g9, OpenMP-MemSpeed2,
NotOpenMP-MemSpeed2

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2). Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP. Detailed comparisons of these results are rather meaningless. Below are Pi 4B results from a gcc 9 compilation. See MemSpeed results for other comparisons.


     Memory Reading Speed Test OpenMP 64 Bit gcc 9 by Roy Longbottom

               Start of test Thu Sep 26 22:08:22 2019

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
 
       4    7616   8480   8749   7548   8520   8530  35856  18594  18601
       8    8195   8660   8876   8147   5740   8365  37153  18878  18864
      16    7992   7684   8189   8064   8139   8023  35774  18896  18898
      32    8975   8535   8024   9048   8536   8512  37465  18392  19024
      64    8622   7997   8057   8511   7953   7994  19618  16857  16701
     128   11940  11637  11554  12101  11659  11498  13815  13417  13964
     256   17008  17339  16359  17104  17396  17038  11877  12344  12376
     512   17740  15986  18607  17522  18547  15612  12575  13616  13495
    1024    7011  10208  10016  11310   5287  11413   7060   6279  10045
    2048    7024   4201   7006   7017   6943   3225   2822   3386   3391
    4096    3854   7002   7126   6912   7074   3985   2199   3127   3132
    8192    2632   6950   7151   5291   2796   6813   2546   3091   2403
   16384    7350   7073   3537   7583   5327   3200   2609   3053   1907
   32768    7514   7616   7725   7807   2344   2936   2702   2559   3042
   65536    7065   2937   7571   4306   7086   2975   2127   3017   2677
  131072    1772   1779   2562   8092   2583   2800   2035   1866   2869

    Memory Reading Speed Test notOpenMP 64 Bit gcc 9 by Roy Longbottom

       4   12991  21391  23815  13044  22904  23856  11216   9060   9062
       8   13380  21857  24416  13414  23420  24400  11630   9313   9312
      16   13534  22119  24711  13550  23683  24718  11797   9447   9447
      32   11981  19879  21566  12100  21243  21572   9552   8928   8924
      64   11695  19992  20989  12044  21020  20966   9356   8613   8602
     128   11824  20347  21045  12116  21217  21067   8132   8149   8178
     256   11705  20247  21090  12041  21382  21013   8081   8182   5919
     512   11515  20242  21155  12059  21089  20938   8093   8127   7376
    1024    4504   8674   8151   4658   8682   8680   3894   3739   3887
    2048    1868   3231   3636   1868   3581   3491   2639   2871   2896
    4096    1921   2994   3748   1925   3781   3443   2589   2634   2636
    8192    1836   3719   3695   1921   3624   3791   2603   2596   2595
   16384    1951   3724   3002   1977   3838   3249   2584   2572   2384
   32768    1710   3431   3427   2008   3186   3449   2545   2531   2529
   65536    2030   3034   2135   2047   3035   2394   2550   2535   2546
  131072    2029   2023   2024   1873   2059   1652   2378   2466   2392
   
Stress Test Benchmarks below or Go To Start


Stress Testing Programs Benchmarking Mode

My latest stress testing programs have parameters that specify running time, data size, number of threads, log file number and, in two cases, processing density. When run without parameters, the full range of options are used, providing a useful benchmark. Log file results from Pi 4B tests, and comparisons, are provided below.


Integer Stress Test Benchmark - MP-IntStress64, MP-IntStress

The integer program test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Disassembly shows that the test loop, in fact, used 68 instructions, most additional ones being load register type. The result of these is 68/32 instructions per 4 byte word. At the maximum of 1943M words per second, using a single core, resultant execution speed was 4129 MIPS with nearly four times more using all cores.

The tables below, with speeds on the considered systems, provide average performance gains of the Pi 4B at 64 bits, somewhat limited in this case.

                 Gentoo Pi 4B 64 Bits

  MP-Integer-Test 64 Bit v1.0 Fri Sep  6 16:33:36 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   4.3    1   7771  7352  3895  00000000    Yes
   3.3    2  15467 14218  3714  FFFFFFFF    Yes
   3.0    4  28715 26652  3345  5A5A5A5A    Yes
   3.0    8  30292 26310  3334  AAAAAAAA    Yes
   3.0   16  29466 28503  3337  CCCCCCCC    Yes
   3.0   32  29351 30358  3390  0F0F0F0F    Yes


              Pi 4B 32 bit MB/sec        Pi 3B+ 64 bit MB/sec

               KB      KB      MB         KB      KB      MB
               16     160      16         16     160      16
   Threads
        1    5964    5756    3931       4823    3884    1209
        2   11787   11430    3748       9613    7709    1908
        4   23214   22060    3456      17737   15137    1779
        6   22197   22171    3472      17651   18692    1767
       16   22671   23299    3256      18255   18793    1757
       32   21379   21881    3346      18246   18674    1748

             Pi 4B 64b/32b              64b Pi 4B/3B+
Average
Gain         1.31    1.25    0.99       1.63    1.67    2.13
  

Floating Point Stress Test Benchmarks or Go To Start


Single Precision Floating Point Stress Test Benchmark - MP-FPUStress64, MP-FPUStress

This and the double precision program carry out the same calculations as MP-MFLOPS, but are slightly faster by including a loop that repeats the tests within the calculate functions. Maximum speeds were 6.75 GFLOPS, using one core, and 26.7 GFLOPS with four cores.

These programs were written using a later compiler than those used for MP-MFLOPS, at least resulting in similar speeds between 32 bit and 64 bit versions. Typical Pi 4B/3B+ performance improvements were indicated.


                 Gentoo Pi 4B 64 Bits

  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:30:12 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   1.7    T1   2  2819  2874   504   40392  76406  99700
   3.2    T2   2  5592  5702   511   40392  76406  99700
   4.6    T4   2  9223  7520   519   40392  76406  99700
   6.0    T8   2  9520 10471   545   40392  76406  99700
   8.2    T1   8  5381  5595  2050   54764  85092  99820
   9.8    T2   8 11039 10883  2173   54764  85092  99820
  11.3    T4   8 19087 21040  2044   54764  85092  99820
  12.9    T8   8 19747 21107  2016   54764  85092  99820
  17.5    T1  32  6693  6753  6377   35206  66015  99520
  20.2    T2  32 13491 13464  8710   35206  66015  99520
  22.2    T4  32 25732 26704  9160   35206  66015  99520
  24.1    T8  32 25708 25770  8927   35206  66015  99520

            End of test Fri Sep  6 16:30:37 2019


              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2       2641    2607     646        838     826     373
T2   2       5089    5116     618       1659    1650     380
T4   2       8282    8522     683       2584    3296     384
T8   2       8756    9847     686       3013    3056     391
T1   8       5543    5428    2597       1981    1972    1354
T2   8      10754   10603    2711       3936    3923    1518
T4   8      18716   20823    2844       7482    7396    1531
T8   8      19859   21684    2555       7399    7705    1534
T1  32       5309    5274    5265       2820    2809    2462
T2  32      10557   10509    9991       5636    5583    4754
T4  32      20416   20919   11340      10640   10882    6020
T8  32      20072   19787    9330      10641   10926    6159

              Average Pi 4B Performance Gains

  Ops/Word      Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.09    1.04    0.79       3.37    3.16    1.36
        8    1.00    1.01    0.77       2.69    2.80    1.40
       32    1.27    1.29    0.96       2.40    2.41    1.85
  

Double Precision Stress Test Benchmark below or Go To Start


Double Precision Floating Point Stress Test Benchmark - MP-FPUStress64DP,
MP-FPUStressDP

Maximum measured DP speeds were 3.39 GFLOPS, using one core, and 13.2 GFLOPS with four cores. Some of the 64/32 bit and 4B/3B+ performance ratios were similar to those from MP-MFLOPS


                 Gentoo Pi 4B 64 Bits

  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:31:24 2019

    Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2  1398  1462   285   40395  76384  99700
   6.2    T2   2  2799  2807   256   40395  76384  99700
   8.9    T4   2  5024  4589   257   40395  76384  99700
  11.5    T8   2  5089  5545   280   40395  76384  99700
  15.7    T1   8  2668  2790  1103   54805  85108  99820
  18.8    T2   8  5670  5545  1158   54805  85108  99820
  21.7    T4   8 10259 10011  1068   54805  85108  99820
  24.7    T8   8 10239 10824  1036   54805  85108  99820
  34.1    T1  32  3317  3390  3195   35159  66065  99521
  39.2    T2  32  6791  6754  4753   35159  66065  99521
  43.1    T4  32 12940 13200  4497   35159  66065  99521
  46.9    T8  32 13200 13049  4557   35159  66065  99521

            End of test Fri Sep  6 16:32:11 2019

              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2        993     998     329        412     411     193
T2   2       1971    1995     309        828     824     194
T4   2       3633    3937     340       1543    1514     197
T8   2       3635    3796     339       1525    1551     196
T1   8       2378    2445    1288        980     978     696
T2   8       4770    4860    1282       1975    1964     782
T4   8       9281    9556    1210       3688    3688     781
T8   8       9119    9448    1245       3726    3689     787
T1  32       2697    2726    2708       1402    1403    1231
T2  32       5397    5446    5163       2808    2808    2399
T4  32      10689   10806    5146       5379    5413    3195
T8  32      10716   10494    4497       5450    5485    3150

              Average Pi 4B Performance Gains

  Ops/Word   Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.40    1.37    0.82       3.34    3.39    1.38
        8    1.13    1.12    0.87       2.78    2.83    1.44
       32    1.23    1.24    1.00       2.40    2.41    1.86
  
High Performance Linpack Benchmark below or Go To Start


High Performance Linpack Benchmark

Earlier, the High Performance Linpack Benchmark was run on Raspberry Pi 3 models, and later, on the Raspberry Pi 4 system, both via 32 bit Raspbian Operating System. Details and results can be found in the following reports. Pi 3B and 3B+ results and Pi 4B 32 bit reslts.

Initially, two versions of HPL tests were run, one accessing precompiled Basic Linear Algebra Subprograms and the other with ATLAS alternatives, that had to be built. The whole benchmark suite was produced according to instructions in the following. these instructions.

The ATLAS version was installed, as the older benchmark would not run on the Pi 4. One issue is the time required for the build, apparently due to the numerous tuning tests. Time taken was 14 hours using a Pi 3B+, then 8 hours on a Pi 4. Later, 64 bit ATLAS was built on the Pi 3B+, via Gentoo, taking 26 hours, that included extended periods swapping data with the rather slow main drive. The procedure specified in the above was used, successfully leading to a working package. Only one change was required, this was to Make.rpi line 95 to;

LAdir = /home/pi/atlas-build to = /home/demouser/atlas-build.

Following the introduction of 64 bit Gentoo for the Pi 4B, ATLAS was again created, taking more than 10 hours. As indicated in the above links, the HPL benchmark can be a useful stress test, due to the long running time with heavy processing. It can lead to CPU MHz being throttled on the Pi 4B, producing slow GFLOPS speeds. The tests reported here were run using a Pi 4B with a cooling fan, with CPU MHz monitored to help to indicate that the processor was running at full speed.

The benchmark was run on various Raspberry Pi models, using the same parameters. An example of the main output produced is shown below. Key areas are array size parameter N, running time, GFLOPS speed rating and sumcheck (0.0010188 in this case), including whether acceptable (PASSED).

pi@raspberrypi:~/hpl-2.2/bin/rpi $ mpiexec -f nodes-1pi ./xhpl
================================================================================
HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   20000 
NB     :     128 
PMAP   : Row-major process mapping
P      :       2 
Q      :       2 
PFACT  :   Right 
NBMIN  :       4 
NDIV   :       2 
RFACT  :   Crout 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             494.46              1.079e+01
HPL_pdgesv() start time Fri Oct 11 22:34:37 2019

HPL_pdgesv() end time   Fri Oct 11 22:42:52 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0010188 ...... PASSED
================================================================================
  

High Performance Linpack Benchmark Results below or Go To Start


High Performance Linpack Benchmark Results

Particularly important, maximum performance is dependent on the amount of RAM available. As with the original single CPU Linpack benchmark, where N is the matrix problem size, minimum memory used is N x N x 8 Bytes (double precision) or 512 MB for N = 8000 or 3.2 GB for N = 20000. The end of the detailed output indicates a further problem, where the first run at maximum size might be slow, with extra time swapping data out of RAM, to create space for the HPL data.

Next, the benchmark produces a sumcheck but, in the case of the ATLAS implementation, these are not consistent using the same problem size, all those shown here were indicated as PASSED (within specified tolerances). The anomaly could be produced using different CPU models or alternative compilations but, the least understandable is identified at the end of the detailed output, where the sumcheck is shown to vary on repeating the program on the same system.

Comparing Pi 4B 32 bit and 64 bit GFLOPS maximum speeds, the 32 bit version appears to be slightly faster (or the same within reasonable tolerances). Then it is not clear (to me), whether the compiled code completely embraces the difference in technology or whether external compile options should be included for the different packages involved.

Anyway, around 10 double precision GFLOPS was the maximum produced by other benchmarks, reported above.
  
       ------ Time ------   ----- GFLOPS -----  ----------- Sumcheck ----------

          4B     4B    3B+     4B     4B    3B+         4B         4B        3B+
    N    64b    32b    64b    64b    32b    64b        64b        32b        64b

 4000   5.51   5.20  14.53   7.75   8.20   2.94  0.0022808  0.0023975  0.0025857
 8000  38.22  36.70 101.59   8.93   9.30   3.36  0.0017216  0.0016746  0.0017518
16000 269.26 263.00         10.14  10.40         0.0012577  0.0011258
20000 513.67 494.30         10.38  10.80         0.0009637  0.0010188

                           GFLOPS Comparisons 

                                  4B           64b
                       N        64b/32b       4B/3B+

                    4000          0.95          2.64
                    8000          0.96          2.66
                   16000          0.98
                   20000          0.96


                          Example Logged Results

                                                    Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             516.71              1.032e+01

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008697 ...... PASSED
================================================================================

First Run

WR11C2R4       20000   128     2     2             656.89              8.120e+00

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009470 ...... PASSED
================================================================================
  

I/O Benchmarks below or Go To Start


DriveSpeed Benchmark - DriveSpeed64v2, DriveSpeed64v2g9, DriveSpeed

This benchmark has the format shown below, measuring writing and reading speeds of large files, cached files, random access and numerous small files. Run time parameters are available to specify large file size and the file path.

In order to test a USB drive, it must be mounted - plug in, right click Mount Volume or double click to open. Run df command to find the path, needed for use as a run time parameter. Following is an example log file and the command used to run the program to test a USB 3 stick. With no MB parameter, default large file sizes are 8 and 16 MB.


############################## Pi 4B USB 3 ###############################

Run command ./DriveSpeed64v2g9 MB 512 FilePath /run/media/demouser/PATRIOT

##########################################################################

   DriveSpeed RasPi 64 Bit 2.0 Fri Sep 13 22:25:40 2019
 
 Selected File Path: 
 /run/media/demouser/PATRIOT/
 Total MB  120832, Free MB  119778, Used MB    1054

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

 512    30.72    31.11    34.01   287.24   295.04   311.90
1024    34.66    36.11    35.45   298.87   302.38   300.26
 Cached
   8    42.03    39.58    38.85  1167.71  1029.35  1061.56

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.007    0.310     9.65    10.42     9.71

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.13   268.10   427.95   657.48
 ms/file   122.73   122.28   122.22     0.02     0.02     0.02    2.557
  

For non-cached tests, in the standard version of this benchmark, the file opening handle includes the O_DIRECT option, specifying Direct I/O (no caching). The latest minor variety of this appears to work, as expected, on the 32 bit Raspbian version, on both main and USB drives. The 64 bit compilation of this indicated a failure to write to the main SD drive and a failure to read from USB flash drives. Omitting O_DIRECT, for reading, appeared to correct the latter (see above). To check this and enable main drive measurements, separate direct I/O free large file write and read only programs were produced, to follow write/reboot/read procedures. These were also necessary to indicate throughput simultaneously writing or reading two USB 3 drives.

Following are 64 bit Pi 4B SD main drive results from the separate write and read tests, followed by full results from Pi 4B with 32 bit Raspbian, using a same brand SD card. Note the similarity in writing and reading speeds of large files.


################# Main SD Drive From Write/Read Tests Below =################

       Write1   Write2   Write3    Read1    Read2    Read3

Write   18.99    19.34    19.47  1337.09  1164.91  1325.96  - cached
Read     N/A      N/A      N/A     45.80    45.88    45.89  - not cached

============================== 32 Bit Results ==============================

   DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019

 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second

      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014
  

USB Flash Drives below or Go To Start


USB 3 and 2 Flash Drive Benchmarks

Two FAT 32 formatted USB 3 sticks were used, P at 128 GB, with 32 KB sectors, reading speed rated as up to 400 MB/second, and R 8.8 GB partition, with 8 KB sectors, reading speed rated as up to 190 MB/second (but appears to do better sometimes). The benchmark was run using USB 2 connections, on a Pi 3B+ and a Pi 4B, and via USB 3 slots on the Pi 4B.

Following is a summary of results, indicating USB 3 large file reading speed improvements between 6.7 and 8.1 times, but disappointing writing performance, where the slower P speeds might be affected by the mysteries of updating file allocation tables, also influencing random access and dealing with lots of small files, including file delete times. USB 3 use provided little or no performance gains for the latter. Cached reading reflects RAM speed, the only area showing clear difference in performance between the Pi 3B+ and Pi 4B.


MB/second 16 MB USB 2, 1024 MB USB 3

 System   Drive   Write1  Write2  Write3   Read1   Read2   Read3

Pi 3B+  USB 2 P     11.5    11.4    11.5    36.6    37.7    37.3
Pi 3B+  USB 2 R     15.9    16.4    13.9    37.1    40.1    39.8

Pi 4B   USB 2 P     12.6    12.6    12.6    37.0    37.3    37.2
Pi 4B   USB 2 R     22.6    22.9    22.9    36.5    36.3    36.5
Pi 4B   USB 3 P     34.7    36.1    35.5   298.9   302.4   300.3
Pi 4B   USB 3 R     48.9    44.6    53.4   249.4   248.8   246.2
Compare MB/second
Pi 4B   P USB 3/2   2.75    2.88    2.81    8.07    8.11    8.07
Pi 4B   R USB 3/2   2.17    1.94    2.33    6.83    6.85    6.74

Cached MB/second  Write1  Write2  Write3   Read1   Read2   Read3

Pi 3B+  USB 2 P     13.6    14.2    14.4   633.4   544.0   464.3
Pi 3B+  USB 2 R     13.7    14.4    19.4   623.5   661.4   557.6

Pi 4B   USB 2 P     15.0    14.7    14.8  1204.0  1047.3  1066.3
Pi 4B   USB 2 R     20.8    21.2    13.9   930.2   933.6  1230.3
Pi 4B   USB 3 P     42.0    39.6    38.9  1167.7  1029.4  1061.6
Pi 4B   USB 3 R     21.1    15.9    36.2  1103.6   944.9   981.0
Compare
Pi 4B   P USB 3/2   2.80    2.70    2.63    0.97    0.98    1.00
Pi 4B   R USB 3/2   1.01    0.75    2.60    1.19    1.01    0.80

Random milliseconds
                    Read                   Write

Pi 3B+  USB 2 P    0.013   0.013   0.254   11.76   10.18    9.80
Pi 3B+  USB 2 R    0.017   0.008   0.032    1.09    1.39   11.72

Pi 4B   USB 2 P    0.006   0.007   0.215    9.56    8.54    8.75
Pi 4B   USB 2 R    0.009   0.005   0.016    1.35    2.12    1.34
Pi 4B   USB 3 P    0.004   0.007   0.310    9.65   10.42    9.71
Pi 4B   USB 3 R    0.004   0.004   0.008    1.75    0.85    0.92
Compare
Pi 4B   P USB 3/2   1.50    1.00    0.69    0.99    0.82    0.90
Pi 4B   R USB 3/2   2.25    1.25    2.00    0.77    2.49    1.46

200 Small Files  milliseconds
                  Write                    Read                  Delete

Pi 3B+  USB 2 P    134.2   128.6   129.6    0.08    0.12    0.07    3.36
Pi 3B+  USB 2 R    105.5   104.7   107.6    0.05    0.05    0.07    0.26

Pi 4B   USB 2 P    125.8   125.5   125.8    0.02    0.02    0.02    3.12
Pi 4B   USB 2 R    104.1   104.0   104.0    0.02    0.02    0.03    0.14
Pi 4B   USB 3 P    122.7   122.3   122.2    0.02    0.02    0.02    2.56
Pi 4B   USB 3 R    105.4   104.0   104.3    0.02    0.02    0.03    0.15
Compare
Pi 4B   P USB 3/2   1.03    1.03    1.03    1.00    1.00    1.00    1.22
Pi 4B   R USB 3/2   0.99    1.00    1.00    1.00    1.00    1.00    0.95

  

Drive Write/Reboot/Read Tests below or Go To Start


Drive Write/Reboot/Read Tests - DriveSpeed264WR, DriveSpeed264Rd

As a reminder, different programs were produced to enable separate measurements of writing and reading, because of the inability to avoid written data being cached on a main drive, invalidating drive reading speed measurements. These were also required to measure overall throughput, when using two USB drives. The write test also reads the data for verification, but this will normally be cached in RAM, with high data transfer speeds. VMSTAT results are provided, covering reading speeds.

Main SD Drive - This is rated at up to 98 MB/second reading speed but only achieves near 46 MB/second. VMSTAT results confirm data transfer speed and three files eventually occupying around 3 GB of the cache, with the low 2% (x4) CPU utilisation and 23% (x4) waiting for I/O.

  Run Commands ./DriveSpeed264WR MB 1024 and ./DriveSpeed264Rd MB 1024

 Current Directory Path: /home/demouser/RPi3-64-Bit-Benchmarks/IOtests/writeread
 Total MB   28225, Free MB   18761, Used MB    9464
 
                1024 MB   MBytes/Second

       Write1   Write2   Write3    Read1    Read2    Read3

Write   18.99    19.34    19.47  1337.09  1164.91  1325.96
Read     N/A      N/A      N/A     45.80    45.88    45.89
 
                                  vmstat

procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in   cs us sy id wa st

0  1     0 673848  60668 2792716    0    0 45056     0  767 1181  0  2 75 23  0
0  1     0 630228  60668 2835544    0    0 44544     0  789 1199  0  2 74 23  0
0  1     0 585204  60668 2880268    0    0 45056     0  691 1041  0  3 75 23  0

USB 3 Drive P -
Read only speed was similar to that from the earlier detailed test. Note high CPU utilisation average of 17%, equivalent to 68% of one core.

 Run Commands ./DriveSpeed264WR MB 1024 FilePath /run/media/demouser/PATRIOT
    and       ./DriveSpeed264Rd MB 1024 FilePath /run/media/demouser/PATRIOT

 Selected File Path: 
 /run/media/demouser/PATRIOT/
 Total MB  120832, Free MB  119752, Used MB    1080

                 1024 MB   MBytes/Second

       Write1   Write2   Write3    Read1    Read2    Read3

Write   58.45    23.10    22.91  1368.04  1190.71  1354.84
Read     N/A      N/A      N/A    306.18   294.93   302.91
 
                                  vmstat

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

1  0   256 811672  20920 2696504    0    0 305664     0 3898 6182  1 15 73 11  0
0  1   256 510852  20920 2996188    0    0 303616     0 4304 5936  1 16 72 12  0
1  0   256 239400  20920 3267636    0    0 307184     0 4512 6177  1 17 71 11  0

USB 3 Drive R -
This time data transfer speed was slower than the earlier example.

 Selected File Path: 
 /run/media/demouser/REMIX_OS/
 Total MB    9017, Free MB    7485, Used MB    1532

                 1024 MB   MBytes/Second                  

       Write1   Write2   Write3    Read1    Read2    Read3

Write   46.43    28.81    36.57  1265.07  1103.23  1236.02
Read     N/A      N/A      N/A    172.71   172.14   176.49
 
                                  vmstat

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  1   256 111512    912 3417624    0    0 175189     0 4315 5929  1 12 71 17  0
0  1   256 169756    992 3358840    0    0 169043     0 4064 5515  1 11 71 17  0
0  1   256 177444   1068 3351176    0    0 155724     0 4088 6023  1 12 70 16  0
  
USB 3 Drives R and P Together below or Go To Start


USB 3 Drives R and P Together

File sizes were reduced to 512 MB for these tests, in order to ensure that there would be sufficient RAM to contain six copies, as indicated in VMSTAT cache occupancy. This makes it more tricky to measure total throughput, but the following appears to provide a best case example, with a maximum of up to 386 MB/second, with CPU utilisation near 100% of one core. Different log files are needed for reading, to avoid confusion.

Later is a bad example, where one drive appears to be running at USB 2 speed.

 Run Commands ./DriveSpeed264WR MB 512 FilePath /run/media/demouser/PATRIOT
    and.      ./DriveSpeed264WR MB 512 FilePath /run/media/demouser/REMIX_OS
    and       ./DriveSpeed264Rd MB 512 FilePath /run/media/demouser/PATRIOT Log 1
    and       ./DriveSpeed264Rd MB 512 FilePath /run/media/demouser/REMIX_OS Log 2

Write/Read Thu Sep 19 16:07:48 2019  /run/media/demouser/REMIX_OS/
Write/Read Thu Sep 19 16:07:46 2019  /run/media/demouser/PATRIOT/

                   512 MB MBytes/Second

       Write1   Write2   Write3    Read1    Read2    Read3

  R     28.72    33.89    44.69  1302.19  1131.65  1374.24
  P     11.93     8.86     6.21  1232.47  1072.38  1213.36

Sep 23 17:11:21 2019 /run/media/demouser/PATRIOT/
Sep 23 17:11:20 2019 /run/media/demouser/REMIX_OS/

                  512 MB MBytes/Second

       Write1   Write2   Write3    Read1    Read2    Read3   Seconds

 P       N/A      N/A      N/A    159.78   187.44   294.23   7.7 
 R       N/A      N/A      N/A    221.83   232.10   230.94   6.7+2 delayed start
 
                                  vmstat

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  0     0 3160720  74616  296092   0    0     0     0  2031 3601  4  2 94  0  0
0  1     0 3112052  74616  342188   0    0  45552    0  1512 2257  1  3 93  4  0
0  1     0 2908004  74616  547600   0    0 206336    0  4684 7169  4 14 67 15  0
2  0     0 2531960  74616  919400   0    0 369136    0  5495 8033  4 24 47 25  0
2  0     0 2149064  74616 1303288   0    0 382960    0  5168 7007  1 21 52 26  0
1  1     0 1771492  74616 1681348   0    0 385024    0  5969 8255  1 23 49 26  0
1  1     0 1383524  74616 2068788   0    0 386016    0  5621 7926  1 21 49 29  0
0  2     0  999100  74616 2453280   0    0 383488    0  4602 6895  1 19 54 26  0
0  1     0  628988  74616 2824188   0    0 368640    0  5405 8153  2 20 56 22  0
1  0     0  310748  74624 3142732   0    0 317424   20  4622 6551  1 17 72 10  0
1  0     0  223052  73680 3231812   0    0 268288    0  2815 5012  1 18 72 10  0
0  0     0  223824  73680 3231280   0    0  32768    0  1044 2009  1  3 95  1  0
0  0     0  223824  73680 3231280   0    0      0    0   393  619  0  0 99  0  0

 ===============================================================================

 Bad Example

       Write1   Write2   Write3    Read1    Read2    Read3

 P       N/A      N/A      N/A     36.37    37.72    37.48
 R       N/A      N/A      N/A    248.18   248.22   223.53
  

LAN and WiFi Benchmarks below or Go To Start


LAN and WiFi Benchmarks - LanSpeed64, LanSpeed64g9, LanSpdx86Win.exe, LanSpeed

The Raspberry Pi LanSpeed64 version uses the same programming code as for the DriveSpeed benchmark, except O_DIRECT is not used on creating files. The measurements were made between the Pi 4B and a Windows 7 based PC, where the data transfer speed was confirmed via Task Manager Network information and sysstat sar -n DEV on the Raspberry Pi 4. SAMBA was also installed to connect a remote PC and enable an Intel Windows version, LanSpdx86Win.exe, to be run.

An example of a LanSpeed64 log file is provided below, preceded by examples of the required mount and run commands. For further details of required procedures see This PDF file, LAN/WiFi section. The 64 bit results are followed by details from running the benchmark on a 32 bit system, and showing the same levels of performance, within the usual variability.

Commands

sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.1.68/d /media/public

./LanSpeed64 FilePath /media/public/test

                       Log File

   LanSpeed RasPi 64 Bit 1.0 Thu Sep 12 22:06:06 2019
 
 Selected File Path: 
 /media/public/test/
 Total MB  266240, Free MB   70991, Used MB  195249

                        MBytes/Second

  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    66.13    92.09    92.76    96.36    96.85    97.30
  16    80.79    93.59    94.61   103.99   104.34   104.57

 Random         Read                       Write
 From MB        4        8       16        4        8       16

 msecs      0.004    0.009    0.435     0.95     0.92     0.93

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs

 MB/sec      1.37     2.45     4.77     1.37     2.49     4.92
 ms/file     2.99     3.35     3.43     2.98     3.29     3.33    0.467

==
 ************************ 32 Bit Pi 4B ************==************

                        MBytes/Second

  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    67.82    12.97    90.19    99.84    93.49    96.83
  16    92.25    92.66    92.96    103.9   105.28    91.17

 Random         Read                       Write
 From MB        4        8       16        4        8       16

 msecs      0.007     0.01     0.04     1.01     0.85     0.91

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs

 MB/sec      1.47      2.8     5.14     2.47     4.71     8.61
 ms/file     2.78     2.92     3.19     1.66     1.74     1.90    0.256
  

LAN and WiFi Benchmark Results below or Go To Start


LAN and WiFi Benchmark Results

Below are results from programs run on the Pi 3B+ and 4B, plus others from running on a PC. Dealing with large files, PC to Pi 4B and Pi 4B to PC LAN speeds demonstrated some gigabit performance examples (over 100 MB/second), around three times faster than on the Pi 3B+. My BT Hub has dual 2.4 GHz (WiFi1) and 5 GHz (WiFi2) capabilities, leading to the following erratic performance, where (I think) greater than 10 MB/second is indicative of 5 GHz and around 4 MB/second for 2.4 GHz, the former usually only on writing. In this case, the hub was inches away from the Pi.

I changed the hub settings to provide separate 2.4 and 5 GHz hub address selections, with 72 and 180 Mbits/second being indicated, respectively. These sort of numbers were confirmed on my Smartphone, but variable. The 64 bit version would not connect to the network at 5 GHz, unlike the 32 bit program, for example, obtaining 15 MB/second writing and 8 MB/second reading. these differences could be, I suppose, due to program, software and/or hub incompatibility.

Random access times appeared to be quite similar on all WiFi tests, with faster but variable comparative times via LAN. There were similar relationships on dealing with numerous small files.

Some results from running the 32 bit benchmark on a Pi 4B are provided. Performance there was also erratic, these speeds representing best case measurements, reading large files somewhat faster than those achieved at 64 bits.

  
                       Large Files MB/second

System   MB   Write1   Write2   Write3    Read1    Read2    Read3

PC WiFi  16     4.08     4.16     4.11     2.34     1.68     1.30
PC LAN   16   106.11   106.11   105.89    50.67    33.86    25.47
LAN 3B+  16    28.63    29.03    28.96    22.18    32.28    32.61
3B+ WiFi 16    11.15    11.00    10.76     4.01     3.89     3.09
4B WiFi1 16     6.43     6.39     6.47     4.33     4.13     4.86
4B WiFi2 16    13.26    13.34    13.25     3.69     4.22     4.00
4B LAN   16    80.79    93.59    94.61   103.99   104.34   104.57
4B LAN  128    96.58    96.67    95.74   106.41   107.24   107.82
32 Bit
4B WiFi1 16     6.70     6.82     6.76     7.19     6.53     7.22
4B WiFi2 16    11.50    13.93    14.13     9.91     8.88     9.92


                        Random milliseconds

System          Read                      Write

PC WiFi        1.711    1.972    2.015     2.26     2.28     2.25
PC LAN         0.606    0.590    0.532     0.47     0.48     0.47
LAN 3B+        0.030    0.816    0.484     1.19     1.16     1.16
3B+ WiFi       3.052    3.167    3.475     3.60     3.39     3.45
4B WiFi1       3.286    3.549    3.627     4.02     3.45     3.72
4B WiFi2       2.786    2.822    2.944     3.20     2.94     2.92
4B LAN         0.004    0.009    0.435     0.95     0.92     0.93
32 Bit
4B WiFi1       2.691    2.875    3.048     3.13     2.93     2.84
4B WiFi2       Similar     


                    200 Small Files  milliseconds per file

System      Write                       Read                     Delete

PC WiFi     10.09    12.42    13.81     5.50     6.11     8.06    1.507 
PC LAN       4.05     4.59     4.53     2.38     2.23     2.64    0.661 
LAN 3B+      3.72     4.36     4.45     3.33     3.40     3.60    0.378
3B+ WiFi    12.61    13.53    14.97    13.17    14.06    15.88    2.534
4B WiFi1    15.08    16.53    22.83    12.96    14.23    17.29    2.509
4B WiFi2    11.38    12.85    12.82    10.64    11.83    14.15    2.083
4B LAN       2.99     3.35     3.43     2.98     3.29     3.33    0.467
32 Bit
4B WiFi1    12.14    18.59    15.70    11.10    22.20    12.99    2.153
4B WiFi2    30.85    17.83    18.10    16.62    14.93    16.01    3.361
   

Java Whetstone Benchmark below or Go To Start


Java Whetstone Benchmark - whetstc.class

The benchmark measures performance of various floating point and integer calculations , with an overall rating in Million Whetstone Instructions Per Second (MWIPS). Results are also provided for a 32 bit version run on a Pi 4B, showing variations in performance, using a different version of Java.

############################# Pi 3B+ #############################

    Whetstone Benchmark Java Version, Sep 20 2019, 11:06:12

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    310.88             0.0618
  N2 floating point  -1.131330490    289.41             0.4644
  N3 if then else     1.000000000             241.15    0.4292
  N4 fixed point     12.000000000             706.28    0.4460
  N5 sin,cos etc.     0.499110132              23.31    3.5700
  N6 floating point   0.999999821    130.04             4.1480
  N7 assignments      3.000000000              89.19    2.0720
  N8 exp,sqrt etc.    0.825148463              21.92    1.6970

  MWIPS                              775.89            12.8884

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################
 
    Whetstone Benchmark Java Version, Sep 12 2019, 20:15:35

                                                      1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs Gains

  N1 floating point  -1.124750137    488.80             0.0393   1.57
  N2 floating point  -1.131330490    475.92             0.2824   1.64
  N3 if then else     1.000000000             344.31    0.3006   1.43
  N4 fixed point     12.000000000            1571.86    0.2004   2.23
  N5 sin,cos etc.     0.499110132              43.55    1.9104   1.87
  N6 floating point   0.999999821    264.15             2.0420   2.03
  N7 assignments      3.000000000             264.00    0.7000   2.96
  N8 exp,sqrt etc.    0.825148463              25.80    1.4420   1.18

  MWIPS                             1445.70             6.9171   1.86

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


######################### Pi 4B 32 Bit ###########################

 Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900
  N6 floating point   0.999999821    345.95             1.5592
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
  
JavaDraw Benchmark below or Go To Start


JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

Pi 4B performance gains shown below were indicated between 2.1 and 3.42 times.

At the end are 32 bit results from a Pi 4B test, using alternative Java software, with similar results.

############################# Pi 3B+ #############################

   Java Drawing Benchmark, Sep 20 2019, 11:08:33
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      335    33.46
  Display PNG Bitmap Twice Pass 2      546    54.53
  Plus 2 SweepGradient Circles         502    50.08
  Plus 200 Random Small Circles        366    36.59
  Plus 320 Long Lines                  134    13.30
  Plus 4000 Random Small Circles        46     4.59

         Total Elapsed Time  60.2 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################

   Java Drawing Benchmark, Sep 12 2019, 20:18:28
            Produced by javac 1.7.0_02

  Test                              Frames      FPS  Gains

  Display PNG Bitmap Twice Pass 1     1146   114.52   3.42
  Display PNG Bitmap Twice Pass 2     1318   131.79   2.42
  Plus 2 SweepGradient Circles        1237   123.66   2.47
  Plus 200 Random Small Circles        972    97.13   2.65
  Plus 320 Long Lines                  415    41.48   3.12
  Plus 4000 Random Small Circles        97     9.65   2.10

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222

######################### Pi 4B 32 Bit ###########################

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft
  

OpenGL GLUT Benchmark below or Go To Start


OpenGL GLUT Benchmark - videogl64, videogl64g9, videogl32

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

Pi 4B average performance gains are included below, with textured objects the best, at 2.1 times, and worst, at around 1.5 times, with the slow kitchen displays.

Dual Monitors - The benchmark was also run with two 1920x1080 monitors connected. It displayed two identical displays when the mirror option was selected. Without this, the normal display, from where the program is executed, appeared on one display, and the OpenGL images on the other. This was fine when the usual display dimensions, as shown below, were specified. With no parameters, full screen image was assumed to be 3840x1080 and this was displayed horizontally squashed into 1920 pixels. FPS measurements for the latter are shown below. On running the 32 bit version via Raspbian, the default display was 3840x1080, across both monitors, but only on one monitor, when 1920x1080 parameters or less were specified. There was no mirror option. See performance below.

In order to demonstrate maximum speeds, VSYNCH (vblank) has to be switched off. The command is shown in the following script that is used to run a series of tests.

export vblank_mode=0                                
./videogl64g9 Width 160, Height 120, NoEnd            
./videogl64g9 Width 320, Height 240, NoHeading, NoEnd 
./videogl64g9 Width 640, Height 480, NoHeading, NoEnd 
./videogl64g9 Width 1024, Height 768, NoHeading, NoEnd
./videogl64g9 NoHeading                               

The benchmark can also be run as a stress test, using run time parameters for running time and test to run, besides window size, as shown above.

32 bit Pi 4B results are also provided, in this case, a bit slower than the 64 bit speeds.

############################# Pi 3B+ #############################

 GLUT OpenGL Benchmark 64 Bit Version 1, Fri Sep 20 11:15:47 2019

         Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    389.6    227.2    122.6     75.3     30.0     21.5
   320   240    328.1    201.7    113.8     73.3     30.2     21.3
   640   480    203.3    144.7     87.8     62.0     30.2     21.0
  1024   768    107.1     94.5     60.3     51.1     28.9     20.0
  1920  1080     45.3     47.5     36.9     33.1     28.7     20.0

############################## Pi 4B #############################

   160   120    767.4    420.3    258.3    154.3     45.7     31.7
   320   240    682.9    388.8    245.0    148.3     45.1     30.8
   640   480    367.1    262.6    217.9    140.1     46.2     30.9
  1024   768    150.8    148.8    128.6    117.3     45.3     30.4
  1920  1080     71.9     73.9     64.0     61.6     43.3     27.9

  Pi 4B Gains    1.77     1.74     2.12     2.10     1.52     1.46

  Dual Monitor- mirrored displays
  1920  1080     65.0     66.3     61.6     58.2     42.7     27.5

  Dual Monitor - not mirrored squashed image on one monitor
  3840  1080     60.9     59.6     57.2     54.8     40.8     26.8

  Dual Monitor 32 bit  two monitors
  3840  1080     26.9     26.6     26.1     25.1     25.5     15.9

 ************************ Pi 4B 32 Bit ************************

  GLUT OpenGL Benchmark 32 Bit Version 1, Fri Oct 11 19:12:24 2019

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen

   320   240    663.3    365.9    218.6    126.3     33.1     23.5
   640   480    318.7    259.7    192.4    116.8     32.2     22.1
  1024   768    138.9    134.1    112.7    102.7     31.9     21.4
  1920  1080     57.5     56.1     53.3     50.0     29.3     19.5

  Avg 64b/32b    1.13     1.13     1.15     1.19     1.42     1.39

Stress Tets below or Go To Start


Stress Tests

The first stress tests used cover the central processor, for which an extra program was produced to measure the environment whilst running. Variable parameters are:

Passes and sampling seconds to determine running time. If the stress test also has sampling periods, it is normally not possible to synchronise them but approximate periods can be matched.

CPU MHz - This can vary faster than any sampling time based on seconds, but the general trend can be useful. Tests that measure speed over sampling periods provide a better indication.

Core Voltage - This appears to vary a little, reason unknown.

CPU Temperature - assuming that it is correct, as it change slowly, this is the most useful measurement.

PMIC temperature - No issue so far with Power Management Integrated Circuit temperatures


 ###################################################

 Parameters - upper or lower case

 ./RPiHeatMHzVolts2 passes 33 secs 20 log 12
 or
./RPiHeatMHzVolts2 P 33 S 20 L 12

 For 33 samples at 20 second intervals, log file RPiHeatMHz12.txt 
 
 To cover 10 minute test
 
###################################################

 Temperature and CPU MHz Measurement

 Start at Mon Oct 28 20:49:52 2019

 Using 33 samples at 20 second intervals

 Seconds
    0.0   ARM MHz=1500, core volt=0.8490V, CPU temp=61.0'C, pmic temp=55.2'C
   20.0   ARM MHz=1500, core volt=0.8437V, CPU temp=73.0'C, pmic temp=62.8'C
   40.3   ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=66.5'C
   60.5   ARM MHz=1500, core volt=0.8437V, CPU temp=79.0'C, pmic temp=69.4'C
   80.7   ARM MHz=1500, core volt=0.8437V, CPU temp=80.0'C, pmic temp=70.3'C
  101.0   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C
  121.2   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
  141.4   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
  161.7   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
  181.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C
  
Next are results for the High Performance Linpack that runs for a long time, significantly increasing CPU temperatures and slowing down, without a cooling fan being in place. These results can be compared with those for the 32 bit version, available in the report Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. This shows that the same sort of performance levels as the 64 bit version are obtained, with and without a cooling fan.

Following HPL results here, are some for my integer and floating point stress tests. Although further comparative tests are needed to be conclusive, it does seem that the 64 bit floating point versions are faster than the 32 bit varieties and subject to lower temperature increases.

HP Linpack Stress Test or Go To Start


High Performance Linpack Stress Test

The earlier HPL benchmark results quoted obtained speeds of 8.1 GFLOPS on a cold start and 10.8 GFLOPS later, with a cooling fan in operation for both. The first results below were run without a fan, with a room temperature around 21C, producing 7.6 GFLOPS on a cold start. Then average CPU frequency came out at 1056 MHz, with an average temperature of 80.3C.

The second results followed a warm reboot to use a different version of Gentoo with HPL installed, obtaining 5.54 GFLOPS, with severe CPU frequency throttling, down to 600 MHz, with temperatures up to 80.3C. Averages were 790 MHz and 80.3C.

Shortly afterwards, with the fan in place, the Pi ran at 1500 MHz continuously, achieving 10.4 GFLOPS, with a maximum temperature of 64C.

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             702.81              7.589e+00
HPL_pdgesv() start time Sat Aug 24 10:42:58 2019

HPL_pdgesv() end time   Sat Aug 24 10:54:41 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008453 ...... PASSED
================================================================================

                   Example 2 - Note different sumchecks again 

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             963.16              5.538e+00
HPL_pdgesv() start time Tue Oct 29 11:51:10 2019

HPL_pdgesv() end time   Tue Oct 29 12:07:13 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009005 ...... PASSED
================================================================================

 Temperature and CPU MHz Measurement

 Start at Tue Oct 29 11:50:27 2019

 Using 40 samples at 30 second intervals

 Seconds
    0.0   ARM MHz=1500, core volt=0.8542V, CPU temp=63.0'C, pmic temp=58.0'C
   30.0   ARM MHz=1500, core volt=0.8542V, CPU temp=79.0'C, pmic temp=69.4'C
   60.3   ARM MHz=1000, core volt=0.8542V, CPU temp=83.0'C, pmic temp=72.2'C
   91.6   ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
  122.2   ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=74.1'C
  152.7   ARM MHz= 750, core volt=0.8490V, CPU temp=83.0'C, pmic temp=74.1'C
  183.2   ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
  213.8   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  244.3   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  274.7   ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  305.2   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  335.6   ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  366.1   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  396.6   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  427.2   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  457.5   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  488.0   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  518.6   ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
  549.0   ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  579.6   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
  610.1   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  640.6   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  671.1   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  701.6   ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
  732.0   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  762.4   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  792.9   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  823.4   ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
  853.9   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  884.4   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  914.9   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  945.3   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  975.8   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
 1006.3   ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
 1036.7   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
 1067.0   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
 Averages          790                              84.1              75.5
  

Integer Stress Test below or Go To Start


Integer Stress Test - MP-IntStress64, MP-IntStress

As for my other CPU stress tests, the four and 8 thread results are shown, from running in benchmarking mode. Run time parameters are also provided, the commands used for the particular tests being included.

In this case, a summary of separate tests for L1 cache, L2 cache and RAM are given. During the 10 minute sessions, the cache tests were mainly running at 1000 MHz, with those using RAM at the full speed 1500 MHz. No temperatures above 84C were recorded.

Examining the full detail of the first test indicated that average CPU MHz and measured MB/second were around 75% of the maximum.


                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   3.0    4  28715 26652  3345  5A5A5A5A    Yes
   3.0    8  30292 26310  3334  AAAAAAAA    Yes


  ./RPiHeatMHzVolts2 passes 66 secs 10 log 34 - used for all 10 minute stress tests

 ==== Stress Test Parameters - upper or lower case, only first letter counts ====

   Threads 1, 2, 4, 8, 16, 32  KB between 12 and 15624  Log < 100  Minutes any > 0

  ./MP-IntStress64 KB 16 Threads 8 Mins 10 Log 34

 Seconds                                                                     MB/sec
    0.0   ARM MHz=1500, core volt=0.8455V, CPU temp=62.0'C, pmic temp=57.1'C
   10.0   ARM MHz=1500, core volt=0.8455V, CPU temp=69.0'C, pmic temp=62.8'C  28695
   20.2   ARM MHz=1500, core volt=0.8402V, CPU temp=73.0'C, pmic temp=64.6'C  28729
  152.5   ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=72.2'C  21523
  305.5   ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  20026
  448.2   ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  19611
  601.1   ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  19199
                                                                    %Min/Max   66.9

  ./MP-IntStress64 KB 160 Threads 8 Mins 10 Log 34

 Seconds                                                                     MB/sec
    0.0   ARM MHz=1500, core volt=0.8402V, CPU temp=64.0'C, pmic temp=57.1'C
   10.0   ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C  26323
   20.2   ARM MHz=1500, core volt=0.8402V, CPU temp=75.0'C, pmic temp=66.5'C  26140
  152.9   ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=74.1'C  18016
  306.5   ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  17306
  449.8   ARM MHz=1000, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C  17248
  603.3   ARM MHz= 750, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C  16832
                                                                    %Min/Max   63.9

  ./MP-IntStress64 KB 16000 Threads 8 Mins 10 Log 34

 Seconds                                                                     MB/sec
    0.0   ARM MHz=1500, core volt=0.8402V, CPU temp=66.0'C, pmic temp=60.9'C
   10.0   ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C   3372
   20.3   ARM MHz=1500, core volt=0.8402V, CPU temp=72.0'C, pmic temp=62.8'C   3369
  155.2   ARM MHz=1500, core volt=0.8402V, CPU temp=76.0'C, pmic temp=68.4'C   3365
  309.8   ARM MHz=1500, core volt=0.8402V, CPU temp=79.0'C, pmic temp=69.4'C   3367
  454.4   ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C   3367
  599.7   ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C   3368
                                                                    %Min/Max   99.8
  

Single Precision Floating Point Stress Tests below or Go To Start


Single Precision Floating Point Stress Test - MP-FPUStress64, MP-FPUStress

Two sets of result summaries are provided below, both using 1280 KB memory space and 8 threads. With four cores, this results in data being in L2 cache (4 x 160 KB) to run at full speed, with additional overhead of moving data to/from RAM. One test uses 8 operations per word, with 32 in the other. With hot starts, neither reached a CPU temperature of 84C and had similar performance degradation at the highest temperatures.

Following writing the above, the 32 bit stress test was repeated, with results shown below. Although not conclusive from a single run, they indicate that the impact was more severe than the 64 bit run, CPU speed sample reducing to 600 MHz, higher temperatures and a larger performance degradation.


             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   4.6    T4   2  9223  7520   519   40392  76406  99700
   6.0    T8   2  9520 10471   545   40392  76406  99700
  11.3    T4   8 19087 21040  2044   54764  85092  99820
  12.9    T8   8 19747 21107  2016   54764  85092  99820
  22.2    T4  32 25732 26704  9160   35206  66015  99520
  24.1    T8  32 25708 25770  8927   35206  66015  99520

 ==== Stress Test Parameters - upper or lower case, only first letter counts ====

Threads 1,2,4,8,16,32,64  KB 12 to 15624  Ops/Wordd 2,8,32  Log<100  Minutes any>0 


./MP-FPUStress64 KB 1280 T 8 Ops 8 Mins 10 Log 33

 Seconds                                                                     MFLOPS
    0.0   ARM MHz=1500, core volt=0.8437V, CPU temp=64.0'C, pmic temp=59.0'C
   10.0   ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C  17309
   20.2   ARM MHz=1500, core volt=0.8437V, CPU temp=75.0'C, pmic temp=66.5'C  18018
  101.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  14224
  204.2   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  12806
  306.8   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=73.1'C  12447
  409.4   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C  11870
  501.6   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C  12191
  604.1   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C  12169
                                                                    %Min/Max   65.9   

  ./MP-FPUStress64 KB 1280 T 8 Ops 32 Mins 10 Log 33

 Seconds                                                                     MFLOPS
    0.0   ARM MHz=1500, core volt=0.8437V, CPU temp=65.0'C, pmic temp=59.0'C
   10.0   ARM MHz=1500, core volt=0.8437V, CPU temp=72.0'C, pmic temp=65.6'C  22634
   20.2   ARM MHz=1500, core volt=0.8437V, CPU temp=76.0'C, pmic temp=67.5'C  22992
  101.9   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  18629
  204.0   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C  16674
  306.3   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  16448
  408.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  16158
  500.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C  16081
  603.0   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  15553
                                                                    %Min/Max   67.6

 ======================================================================================

          32 Bit Version   ./MP-FPUStress KB 1280 T 8 Ops 32 Mins 10 Log 73

 Seconds                                                                     MFLOPS
    0.0   ARM MHz=1500, core volt=0.8560V, CPU temp=56.0'C, pmic temp=50.5'C
   10.0   ARM MHz=1500, core volt=0.8560V, CPU temp=70.0'C, pmic temp=60.9'C  20233
   20.7   ARM MHz=1500, core volt=0.8560V, CPU temp=74.0'C, pmic temp=64.6'C  20221
  106.4   ARM MHz=1000, core volt=0.8560V, CPU temp=83.0'C, pmic temp=70.3'C  14173
  204.3   ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=73.1'C  13115
  302.2   ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C  12650
  400.2   ARM MHz= 750, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C  11957
  508.8   ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C  11485
  585.1   ARM MHz= 600, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C  11454
  606.9   ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C  11242
                                                                    %Min/Max   55.6


 

Double Precision Floating Point Stress Tests below or Go To Start


Double Precision Floating Point Stress Test - MP-FPUStress64DP, MP-FPUStressDP

Below are full results for a 10 minute test using the double precision floating point stress test, with data in L2 cache with four cores in use. Although the measured MFLOPS was greater than that obtained be HPL Linpack, the same range of high temperatures and performance degradation were not generated.

The 32 bit version was also rerun, producing similar results as those at 64 bits.

             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   8.9    T4   2  5024  4589   257   40395  76384  99700
  11.5    T8   2  5089  5545   280   40395  76384  99700
  21.7    T4   8 10259 10011  1068   54805  85108  99820
  24.7    T8   8 10239 10824  1036   54805  85108  99820
  43.1    T4  32 12940 13200  4497   35159  66065  99521
  46.9    T8  32 13200 13049  4557   35159  66065  99521

 ==== Stress Test Parameters - upper or lower case, only first letter counts ====

Threads 1,2,4,8,16,32,64  KB 12 to 15624  Ops/Wordd 2,8,32  Log<100  Minutes any>0 

 ./MP-FPUStress64DP KB 1280 T 8 Ops 32 Mins 10 Log 31

 Seconds                                                                     MFLOPS
    0.0   ARM MHz=1500, core volt=0.8437V, CPU temp=63.0'C, pmic temp=57.1'C
   10.0   ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C  12718
   20.2   ARM MHz=1500, core volt=0.8437V, CPU temp=74.0'C, pmic temp=66.5'C  12755
   30.5   ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=68.4'C  12750
   40.7   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C  12755
   50.9   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C  12183
   61.2   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  11358
   71.4   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  10922
   81.6   ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C  10333
   91.8   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   9948
  102.0   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   9692
  112.3   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   9466
  122.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   9217
  132.8   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C   9181
  143.0   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   9145
  153.2   ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C   9043
  163.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8921
  173.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   9838
  183.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8755
  194.1   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8737
  204.4   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   8721
  214.7   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8721
  224.9   ARM MHz=1500, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C   8670
  235.1   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C   8619
  245.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8592
  255.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8592
  265.9   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8540
  276.2   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C   8488
  286.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8547
  296.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8510
  307.0   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8473
  317.2   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8507
  327.5   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8541
  337.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8544
  347.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8464
  358.2   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8531
  368.4   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8495
  378.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8460
  388.9   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8514
  399.2   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8484
  409.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8454
  419.6   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8459  
  429.8   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8489
  440.1   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8472
  450.3   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8428
  460.6   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8384
  470.9   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8384
  481.2   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8387
  491.4   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8391
  501.7   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8244   
  511.9   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8346
  522.1   ARM MHz= 750, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8272
  532.4   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8272
  542.6   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8329
  552.8   ARM MHz= 750, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8239
  563.1   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8183
  573.3   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8129
  583.6   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8343
  593.9   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8266
  604.1   ARM MHz=1000, core volt=0.8437V, CPU temp=85.0'C, pmic temp=74.1'C   8190
                                                                    %Min/Max   63.7

below or Go To Start

OpenGL + 3 x Livermore Loops - liverloopsPi64Rg9, liverloopsPi64, liverloopsPiA7R

In order make it easier to run these stress tests, lxterminal was installed and the script shown below used to open four terminal windows and run the environmental monitor program plus three copies of a modified Loops benchmark, that allows different log files to be specified. This executes 72 loops for a minimum time of 12 seconds each. The second script file is provided to run the kitchen disply tests for 16 minutes in full screen mode. A further terminal was opened to run VMSTAT resource monitor.

The tests were run twice, without and with a cooling fan in place. Results are shown below. In this case, the no fan tests were not that much slower, obtaining averages of 77 to 80% of the fan cooled speeds on OpenGL FPS, CPU MHz and total Loops MFLOPS.

These results were produced with all programs compiled by gcc 9 and not run on a hot day. Compared with performance using 32 bit versions, detailed in this 32 Bit Report, the 64 bit results were far better, but the former were produced by an older compiler and run on a hot day. The tests were repeated, using 32 bit programs produced by the later gcc 8 compiler.

As before, the 64 bit gcc 9 Livermore Loops and OpenGL single core benchmarks were faster than the new 32 bit versions, in this case by 14% for the former and 40% for the latter. On running the stress test, both had similar average CPU MHz, CPU temperature and PMIC temperature, with 64 bit FPS and MFLOPS maintaining performance advantage, with similar ratios as obtained from single core tests.


run.sh
lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 21 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 22 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 23

runogl.sh
export vblank_mode=0  &
./videogl64g9 Test 6 Mins 16 Log 20

            No Fan                          With Fan
 Seconds     MHz   CPU C  PMIC C     FPS     MHz   CPU C  PMIC C     FPS

       0    1500      57      51            1500      37      32
      30    1500      75      63      27    1500      53      44      27
      60    1500      76      68      29    1500      53      44      28
      90    1500      81      72      25    1500      58      50      27
     120    1500      81      70      23    1500      55      48      26
     150    1000      82      74      23    1500      57      49      29
     180    1000      80      72      22    1500      54      47      27
     210    1000      81      72      24    1500      55      46      29
     240    1500      80      72      26    1500      54      44      28
     270    1500      81      72      27    1500      55      47      28
     300    1000      82      72      22    1500      56      48      29
     330    1500      82      72      24    1500      56      50      29
     360    1000      82      72      24    1500      56      49      28
     390    1000      82      72      22    1500      58      50      26
     420    1000      83      72      22    1500      57      50      26
     450    1000      82      74      19    1500      56      50      30
     480    1000      82      74      21    1500      56      48      28
     510    1000      82      72      22    1500      54      46      29
     540    1000      81      72      22    1500      55      47      30
     570    1500      81      72      24    1500      55      47      30
     600    1000      82      74      24    1500      57      49      30
     630    1500      81      72      23    1500      58      51      29
     660    1000      82      72      23    1500      57      50      29
     690    1000      83      73      22    1500      59      51      28
     720    1000      83      72      21    1500      57      51      28
     750    1000      82      74      21    1500      57      50      29
     780    1000      84      74      19    1500      54      47      29
     810    1000      82      72      19    1500      56      48      29
     840    1000      82      72      20    1500      54      46      29
     870    1000      82      72      20    1500      53      46      30
     900    1000      82      72      23    1500      49      42      31

Average     1161      81      71      23    1500      55      47      29
Minimum     1000      57      51      19    1500      37      32      26
Maximum     1500      84      74      29    1500      59      51      31

% Hot/Cold
Average       77      68      66      80
Minimum       67      65      61      73
Maximum      100      70      69      94

 MFLOPS  Average Geomean Harmean         Average Geomean Harmean
       1     684     562     453             898     732     590
       2     716     574     451             887     712     571
       3     716     566     438             895     724     582

Total %Hot/Cold
MFLOPS        79      78      77
  

Input/Output Stress Test below or Go To Start


Input/Output Stress Test - burnindrive264g9, burnindrive2

This is essentially the same as my program used during hundreds of UK Government and University computer acceptance trials during the 1970s and 1980s, with some significant achievements. Burnindrive writes four files, using 164 blocks of 64 KB, repeated 16 times (164.0 MB), with each block containing a unique data pattern. The files are then read for two minutes, on a sort of random sequence, with data and file ID checked for correct values. Then each block (unique pattern) is read numerous times, over one second, again with checking for correct values. Total time is normally about 5 minutes for all tests, with default parameters. The data patterns are shown below, followed by run time parameters, then examples of results provided, including added calculations of speed.


Patterns

 No.    Hex No.     Hex No.     Hex No.     Hex  No.     Hex No.      Hex No.      Hex

  1       0 25   800000 49        3 73       FF  97 FFFFDFFF 121 FFFFEAAA 145 FFFFF0F0
  2       1 26  1000000 50       33 74   FF00FF  98 FFFFBFFF 122 FFFFAAAA 146 FFF0F0F0
  3       2 27  2000000 51      333 75      1FF  99 FFFF7FFF 123 FFFEAAAA 147 F0F0F0F0
  4       4 28  4000000 52     3333 76      3FF 100 FFFEFFFF 124 FFFAAAAA 148 FFFFFFE0
  5       8 29  8000000 53    33333 77      7FF 101 FFFDFFFF 125 FFEAAAAA 149 FFFF83E0
  6      10 30 10000000 54   333333 78      FFF 102 FFFBFFFF 126 FFAAAAAA 150 FE0F83E0
  7      20 31 20000000 55  3333333 79     1FFF 103 FFF7FFFF 127 FEAAAAAA 151 FFFFFFC0
  8      40 32 40000000 56 33333333 80     3FFF 104 FFEFFFFF 128 FAAAAAAA 152 FFFC0FC0
  9      80 33        1 57        7 81     7FFF 105 FFDFFFFF 129 EAAAAAAA 153 FFFFFF80
 10     100 34        5 58      1C7 82     FFFF 106 FFBFFFFF 130 AAAAAAAA 154 FFE03F80
 11     200 35       15 59     71C7 83 FFFFFFFF 107 FF7FFFFF 131 FFFFFFFC 155 FFFFFF00
 12     400 36       55 60   1C71C7 84 FFFFFFFE 108 FEFFFFFF 132 FFFFFFCC 156 FF00FF00
 13     800 37      155 61  71C71C7 85 FFFFFFFD 109 FDFFFFFF 133 FFFFFCCC 157 FFFFFE00
 14    1000 38      555 62        F 86 FFFFFFFB 110 FBFFFFFF 134 FFFFCCCC 158 FFFFFC00
 15    2000 39     1555 63      F0F 87 FFFFFFF7 111 F7FFFFFF 135 FFFCCCCC 159 FFFFF800
 16    4000 40     5555 64    F0F0F 88 FFFFFFEF 112 EFFFFFFF 136 FFCCCCCC 160 FFFFF000
 17    8000 41    15555 65  F0F0F0F 89 FFFFFFDF 113 DFFFFFFF 137 FCCCCCCC 161 FFFFE000
 18   10000 42    55555 66       1F 90 FFFFFFBF 114 BFFFFFFF 138 CCCCCCCC 162 FFFFC000
 19   20000 43   155555 67     7C1F 91 FFFFFF7F 115 FFFFFFFE 139 FFFFFFF8 163 FFFF8000
 20   40000 44   555555 68  1F07C1F 92 FFFFFEFF 116 FFFFFFFA 140 FFFFFE38 164 FFFF0000
 21   80000 45  1555555 69       3F 93 FFFFFDFF 117 FFFFFFEA 141 FFFF8E38
 22  100000 46  5555555 70    3F03F 94 FFFFFBFF 118 FFFFFFAA 142 FFE38E38
 23  200000 47 15555555 71       7F 95 FFFFF7FF 119 FFFFFEAA 143 F8E38E38
 24  400000 48 55555555 72   1FC07F 96 FFFFEFFF 120 FFFFFAAA 144 FFFFFFF0

 Sequences - First 16

 No.   File         No.   File          No.   File          No.   File

  1    0  1  2  3    5    0  2  1  3     9    0  3  1  2    13    0  1  2  3
  2    1  2  3  0    6    1  3  2  0    10    1  0  3  2    14    1  2  3  0
  3    2  3  0  1    7    2  0  1  3    11    2  1  0  3    15    2  3  0  1
  4    3  0  2  1    8    3  1  2  0    12    3  2  1  0    16    3  0  2  1

 ###########################################################################

Run Time Parameters - Upper or Lower Case
                                                                      Default
R or Repeats             Data size, multiplier of 10.25 MB, more or less     16
P or Patterns            Number of patterns for smaller files < 164         164
M or Minutes             Large file reading time                              2
L or Log                 Log file name extension 0 to 99                      0
S or Seconds             Time to read each block, last section                1
F or FilePath            For other than SD card or SD card directory
C or CacheData           Omit O_DIRECT on opening files to allow caching      No  
O or OutputPatterns      Log patterns and file sequences used as above        No
D or DontRunReadTests    Or only run write tests                              No   

  Format ./burnindrive2 Repeats 16, Minutes 2, Log 0, Seconds 1 
     or  ./burnindrive2 R 16, M 2, L 0, S 1

 ###########################################################################

Examples of Results Main SD Card Default Parameters

   File 1  164.00 MB written in   14.66 seconds                - 11.2 MB/second
To File 4  164.00 MB written in   12.15 seconds                - 13.5 MB/second 

   Read passes     1 x 4 Files x  164.00 MB in  0.33 minutes   - 33.1 MB/second
To Read passes     7 x 4 Files x  164.00 MB in  2.28 minutes   - 33.6 MB/second

   Passes in 1 second(s) for each of 164 blocks of 64KB:       - 164 measurements

    580    580    580    580    580    580    580    580    580    580    580
    580    580    580    580    580    580    580    580    580    580    580

    95120 read passes of 64KB blocks in  2.76 minutes          - 36.8 MB/second
  
CPU + Main SD + USB + LAN Test below or Go To Start


CPU + Main SD + USB + LAN Test

A system test was run using the following script file, comprising commands to run programs to monitor the environment, and others to exercise the main SD card, two USB 3 drives, 1 Gbps Ethernet and CPU floating point with two threads. The programs were run via the script file so that they all started at the same time, as indicated in the summaries below. They also all ran for between 12 and 13 minutes. The by itself performance levels (BI) are also shown, often not indicating much improvement. Performance is not as high as shown by other benchmarks, probably because data transfers are based on 64 KB block sizes and all data in each block is checked for correctness.

A snapshot of vmstat system performance is also provided. The bo and bi KB/second writing and reading speeds are essentially the same as the sum those reported by the programs handling the main and USB drives. LAN speeds are not included in vmstat. Total CPU utilisation (us + sy) is shown to be nearly 90% at the start of writing and closer to 75% on reading, representing average utilisation per core or at least three cores at 100%. Next page shows variations in performance with time.

 ############################### Script File ###############################

  lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 Log 21 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 
                 FilePath /run/media/demouser/PATRIOT Log 22 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 
                 FilePath /run/media/demouser/REMIXOSSYS Log 23 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 
                 FilePath /media/public/test Log 24 &
  lxterminal -e ./MP-FPUStress64 KB 256 T 2 Ops 32 Mins 12 Log 33
  vmstat 10 96 > vmstat.txt

############################################################################

Main SD Drive Tue Nov  5 15:47:03 2019
  End of test Tue Nov  5 16:00:06 2019

Write 164 MB x files 4       53.6 seconds = 12.2 MB/second (BI 12.7)
Read  164 MB x files 3 x 4   67.2 seconds = 29.3 MB/second (BI 33.6)
Read  329480 x 64 KB        659.4 seconds = 32.0 MB/second (BI 36.8) 
============================================================

USB 3 Drive 1 Tue Nov  5 15:47:03 2019
  End of test Tue Nov  5 15:59:31 2019

Write 164 MB x files 4       17.5 seconds = 37.5 MB/second (BI 68.3)
Read  164 MB x files 6 x 4   72.0 seconds = 54.7 MB/second (BI 75.0)
Read  735800 x 64 KB        657.6 seconds = 71.6 MB/second (BI 66.5)
============================================================

USB 3 Drive 2  Tue Nov  5 15:47:03 2019
End of test    Tue Nov  5 15:59:57 2019

Write 164 MB x files 4       37.4 seconds = 17.5 MB/second (BI 23.8)
Read  164 MB x files 3 x 4   75.6 seconds = 26.0 MB/second (BI 28.5)
Read  282740 x 64 KB        660.0 seconds = 27.4 MB/second (BI 29.8)
============================================================

1 Gbps LAN     Tue Nov  5 15:47:03 2019
End of test    Tue Nov  5 15:59:35 2019

Write 164 MB x files 4       18.1 seconds = 36.2 MB/second (BI 55.7)
Read  164 MB x files 3 x 4   74.4 seconds = 26.4 MB/second (BI 34.0)
Read  303920 x 64 KB        659.4 seconds = 29.5 MB/second (BI 45.3)       
============================================================

MP-Threaded-MFLOPS 64 Bit v1.1 Tue Nov  5 15:47:03 2019
                   End of test Tue Nov  5 15:59:13 2019

   2 core GFLOPS 10.9 to 7.4 with CPU throttling.
   See RPiHeatMHzVolts2 results where detail is included
============================================================

                      From vmstat 10 second sampling  

Secs procs  ---------memory---------- ---swap--  -----io---- --system--  ------cpu-----
      r  b  swpd   free   buff  cache   si   so     bi    bo    in    cs us sy id wa st

  10  5  3     0 3059800  94956 346060   0    0     14 63204 17819 19051 51 38  2  9  0
  20  3  2     0 3058696  95248 346704   0    0  14265 60713 17613 18789 51 33  4 12  0

  60  4  2     0 3061196  95668 343572   0    0  93479  7577 24239 24987 54 19  4 23  0
  70  4  3     0 3050632  95684 353600   0    0 112115    24 24496 25316 54 20 12 14  0
   
 710  3  3     0 3058696  96532 349460   0    0 132992    16 18936 22387 53 22  3 22  0
 720  5  1     0 3058728  96548 349452   0    0 134400    13 20635 23842 54 23  1 23  0
   

Speeds and Temperature below or Go To Start

Speeds and Temperature - These tests were run without an active cooling fan, resulting in some CPU throttling, with clock speed down to 1000 MHz some of the time, when the temperature reached 80C. The MP-Threaded-MFLOPS dual core performance measurements have been added to the environmental details, mainly indicating the effects of throttling.

The burnindrive last results record the number of read passes in 4 seconds, in a table comprising 14 lines of 11 recordings and one with 10, over approximately 11 minutes. The average burnindrive results for each line are provided below, not exactly synchronised, but giving an indication of changes in throughput with time. Total passes and percentage degradation are also shown, the latter not being as severe as CPU speed reductions.


 Temperature and CPU MHz Measurement + MP-FPUStress64 2 Core MFLOPS

 Start at Tue Nov  5 15:47:03 2019

 Using 25 samples at 30 second intervals

 Seconds                                                                     MFLOPS
    0.0   ARM MHz=1500, core volt=0.8560V, CPU temp=66.0'C, pmic temp=59.0'C 
   30.0   ARM MHz=1500, core volt=0.8560V, CPU temp=75.0'C, pmic temp=65.6'C  10890
   60.2   ARM MHz=1500, core volt=0.8560V, CPU temp=78.0'C, pmic temp=68.4'C  10551
   90.4   ARM MHz=1500, core volt=0.8560V, CPU temp=80.0'C, pmic temp=70.3'C  10549
  120.6   ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C  10452
  150.8   ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C   9862
  181.1   ARM MHz=1000, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C   9482
  211.4   ARM MHz=1500, core volt=0.8560V, CPU temp=82.0'C, pmic temp=72.2'C   9137
  241.6   ARM MHz=1500, core volt=0.8507V, CPU temp=81.0'C, pmic temp=72.2'C   9132
  271.9   ARM MHz=1000, core volt=0.8507V, CPU temp=82.0'C, pmic temp=70.3'C   9122
  302.2   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   9389
  332.4   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8550
  362.7   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   9043
  392.9   ARM MHz=1500, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C   8045
  423.3   ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C   8174
  453.6   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8444
  483.9   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8335
  514.3   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7951
  544.6   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8125
  574.8   ARM MHz=1500, core volt=0.8455V, CPU temp=83.0'C, pmic temp=72.2'C   8078
  605.1   ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C   8280
  635.4   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7845
  665.7   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7761
  696.0   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=73.1'C   8341
  726.2   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7407

Passes in 4 seconds for each of 164 blocks of 64KB Seconds Main SD USB 1 USB 2 LAN Total %First 44 2013 4522 1884 1915 10333 100 88 2007 4533 1838 1911 10289 100 132 2016 4496 1760 1809 10082 98 176 2011 4536 1785 1845 10178 99 220 2002 4493 1729 1913 10136 98 264 1971 4262 1751 1904 9887 96 308 1980 4540 1747 1911 10178 99 352 2002 4464 1660 1845 9971 96 396 1987 4442 1629 1844 9902 96 440 1964 4453 1585 1771 9773 95 484 1995 4504 1635 1731 9864 95 528 1989 4229 1696 1762 9676 94 572 1947 4616 1684 1833 10080 98 616 2013 4476 1660 1798 9947 96 660 2262 4758 1826 2022 10868 105

Go To Start