Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Roy Longbottom

Contents


Summary Introduction Benchmark Results
Whetstone Benchmark Dhrystone Benchmark Linpack 100 Benchmark
Livermore Loops Benchmark FFT Benchmarks BusSpeed Benchmark
MemSpeed Benchmark NeonSpeed Benchmark MultiThreading Benchmarks
MP-Whetstone Benchmark MP-Dhrystone Benchmark MP NEON Linpack Benchmark
MP-BusSpeed Benchmark MP-RandMem Benchmark MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks OpenMP-MemSpeed Benchmarks I/O Benchmarks
WiFi Benchmark LAN Benchmark USB 3 Benchmarks
Pi 4 Main Drive benchmark Java Whetstone Benchmark JavaDraw Benchmark
OpenGL Benchmark Usable RAM High Performance Linpack
Floating Point Stress Tests Integer Stress Tests 64 GB SD Card
System Stress Tests Power Over Ethernet CPU Performance Throttling Effects


Summary

This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8GB RAM and the Beta pre-release 64 bit Raspberry Pi OS. Note that observations and performance measured might not apply to an officially released Operating System. Objectives of this exercise were to show that my programs could be compiled and run on the 64 bit system and to compare performance with that of the original 32 bit Pi 4B.

Single Core CPU Tests - These “Classic Benchmarks” set the original performance standards of computers. There are four, one with three varieties, and some with multiple test functions. All showed 64 bit average performance gains in the range of 11% to 81%, the highest where the new vector instructions were compiled. .

Single Core Memory Benchmarks - These measure performance using data from caches and RAM. There are four benchmarks, each with between 60 and 100 measurements. A bottom line assessment is that 64 bit and 32 bit speeds from RAM were the same, as were around half of CPU dependent routines, with the other half an average near 30% faster at 64 bits.

Multithreading Benchmarks - There were twelve, covering some intended to show that they were unsuitable for multithreading operation. Five measured floating point performance, where the average 64 bit gain was 39%, demonstrating a maximum of 25.9 single precision GFLOPS and 12.7 at double precision. Of the other two applicable benchmarks, one was rated as 13% faster, at 64 bits, with the other indicating the same performance.

Drive and Network Benchmarks - These mainly ran successfully at 64 bits, providing similar performance to 32 bit runs. A major difference is that file sizes appeared to be limited at 2 GB minus 1 (2^31-1) at 32 bits. At this stage, there were free space limitations, but, at 64 bits, up to 3 x 12 GB could be exercised.

Java and OpenGL Benchmarks - 64 bit Java CPU speed, Java drawing and OpenGL benchmarks were run, with different window settings, including using dual monitors.

Usable RAM - two simple repetitive exercises were carried out to see how much RAM space could be used, via allocation and dimensioned arrays. With one program, memory was allocated in 1 billion byte steps. Maximums were 3 billion at 32 bits then at 64 bits, 3 billion with 4 GB RAM and 7 billion at 8 GB. With dimensioning, more precise values were obtained indicating 3.43 GB and 7.9 GB at 64 bits but 2 GB minus 1 at 32 bits.

High Performance Linpack Benchmark - Performance depends on the memory size parameter N squared. With a fan in use, maximum 32 bit and 64 bit speeds were similar at around 11.25 double precision GFLOPS, at N=30000 with 8 GB RAM, best performance with 4 GB, was 10.8 GFLOPS at N=20000. As a stress test, with no fan, the original Pi 4 board obtained 6.2 GFLOPS at N=20000, with the new one reaching at least 8.5 GFLOPS, demonstrating a significant improvement in thermal management.

CPU Stress Tests - Floating point tests demonstrated the same best case 64 bit performance gains as earlier benchmarks and details of 10 minute stress tests confirmed better thermal management, in a more linear way. A single thread 10 minute stress test was run with integer calculations using more than 7.2 GB of RAM, with some swapping, but no severe performance degradation. The stress tests were run without an operational fan.

64 GB Main Drive SD Card - This was obtained to show that extra large files could be used. A single near 40 GB file was written and read with a new benchmark variation, taking 33 minutes.

System Stress Tests - Fifteen minute tests were run, with and without cooling, using four benchmarks covering, CPU, near 6 GB RAM, main drive and graphics. There were temperature rises with no cooling, but with little performance degradation, both continuously providing around 0.6 GFLOPS, 1140 MB/second from RAM, 30 MB/second from the drive and 21 FPS graphics speed.

Power over Ethernet - Following more comprehensive earlier activity, some long cable PoE tests were repeated to confirm that it was still applicable for this 64 bit configuration.

CPU Performance Throttling Effects - Again, after an earlier exercise, frequency scaling settings forced the CPU to run at 600 MHz, normally the lowest throttling frequency, whilst playing programmes via BBC iPlayer for more than two hours to an HD TV, over WiFi. This ran with acceptable picture and sound quality.

Introduction below or Go To Start


Introduction

This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8 GB RAM and the Beta pre-release 64 bit Raspberry Pi OS (Operating System). This is a continuation of earlier activity with details at ResearchGate in Raspberry Pi 4B 32 Bit Benchmarks.pdf and Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. These provide more detailed information of the programs used and comparisons with older systems.

Most of the benchmarks and stress testing programs were recompiled, for use here, using the supplied gcc 8 compiler, with two not providing acceptable code, being substituted by earlier 64 bit versions. All the programs are available for downloading from ResearchGate in Raspberry-Pi-OS-64-Bit-Benchmarks.tar.xz.

Traditionally, the benchmark provided details of the system being tested, by accessing built-in CPUID details. Following are the latest that identify the difference between 32 bit and 64 bit operation.

32 bit
 
Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Vendor ID:             ARM
Model:                 3
Model name:            Cortex-A72
Stepping:              r0p3
CPU max MHz:           1500.0000
CPU min MHz:           600.0000
BogoMIPS:              270.00
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2019-05-13

64 bit
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid
Linux raspberrypi 4.19.118-v8+ #1311 SMP PREEMPT Mon Apr 27 14:32:38 BST 2020 aarch64 GNU/Linux
  



Benchmark Results

The following provide benchmark results with limited comments on Raspberry Pi 4B performance, compiled as 32 bit and 64 bit programs. There are also considerations of the impact of the larger 8 GB RAM and the possibility of larger file sizes.

Whetstone Benchmark below or Go To Start


Whetstone Benchmark - whetstonePiC8, whetstonePi64g8

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised, but the time does not affect the overall rating much.

Performance is normally more dependent on CPU MHz than advanced instructions, but an overall improvement of 11% was indicated and 11% on straightforward floating point calculations.


 System    MHz   MWIPS  ------MFLOPS------   -------------MOPS---------------
                          1      2      3     COS    EXP  FIXPT     IF  EQUAL

 32 bit   1500   1883    522    471    313   54.9   26.4   2496   3178    998
 64 bit   1500   2085    524    535    398   57.6   27.3   2493   2979    997

 64/32 bit       1.11   1.00   1.14   1.27   1.05   1.03   1.00   0.94   1.0


Dhrystone Benchmark - dhrystonePiC8, dhrystonePi64g8

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The 64 bit compilation provided an apparent 54% improvement in performance but possibly over optimised.

                            DMIPS
 System     MHz    DMIPS     /MHz

 32 bit    1500     5077     3.76
 64 bit    1500     7814     5.21

 64/32 bit          1.54         


Linpack 100 Benchmark MFLOPS - linpackPiC8, linpackPiC8SP, linpackPiNEONiC8, linpackPi64g8, linpackPi64gSP, linpackPi64NEONig8

This original Linpack benchmark uses a small data array, unsuitable for higher speed multiprocessing. It executes double precision arithmetic. I introduced a single precision version with a NEON variety, to indicate vector processing speed.

The NEON version, that uses intrinsic functions, was the star of the show when the Pi 4B was introduced, with the most significant performance improvements, compared to the Pi 3B, and the benefit reflected in the 32 bit NEON/SP results below. The 64 bit SP result now shows that 64 bit vector instructions can achieve the same sort of performance gains, this time 81% faster than at 32 bits.

                                    NEON
 System     MHz      DP      SP      SP

 32 bit    1500   957.1  1068.8  1819.9
 64 bit    1500  1111.5  1938.2  2030.9

 64/32 bit         1.16    1.81    1.12


Livermore Loops Benchmark MFLOPS - liverloopsPiC8, liverloopsPi64g8

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, the overall 64 bit speed rating was 13% faster than at 32 bits, but vector instructions pushed this up to a maximum 67%.

 MFLOPS for 24 loops

  32 bit
  1480  1017   974   930   383   657  1624  1861  1664   617   498   741
   221   320   803   640   737  1003   451   378  1047   411   763   187

  64 bit
  2108   936   960   965   383   809  2313  2488  2066   669   500   981
   181   405   815   644   727  1190   450   397  1716   367   818   313

  64 bit / 32 bit gain range - 0.82 to 1.67                             

 Comparisons

 System    MHz   Maximum Average Geomean Harmean Minimum

 32 bit   1500    1860.8   800.4   679.0   564.1   179.5
 64 bit   1500    2616.7   959.8   766.7   613.0   169.7

 64/32 bit          1.41    1.20    1.13    1.09    0.95
Fast Fourier Transforms Benchmarks below or Go To Start


Fast Fourier Transforms Benchmarks - fft1PiC8, fft3cPiC8, fft1Pi64g, fft3cPi64g8

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes from L1 to L2 caches, then to RAM.

Following are average running times from the three passes of each FFT calculation. There were no significant variations in overall performance between 32 bit and 64 bit compilations. This could be expected using RAM, but there is probably too much diversity in data flow from caches to benefit from advanced vector operation.


                              Time in milliseconds                     

            32 bit FFT 1    32 bit FFT 3    64 bit FFT 1    64 bit FFT 3  

             SP      DP      SP      DP      SP      DP      SP      DP
 Sixe K                                                             
      1     0.04    0.04    0.05    0.04    0.04    0.04    0.04    0.04
      2     0.08    0.13    0.10    0.10    0.08    0.14    0.08    0.10
      4     0.29    0.34    0.24    0.23    0.23    0.40    0.21    0.24
      8     0.79    0.82    0.57    0.51    0.74    0.99    0.47    0.51
     16     1.65    1.85    1.32    1.19    1.88    2.67    1.15    1.20
     32     3.76    4.71    2.69    3.30    5.04    5.16    2.26    3.31
     64     8.82   30.64    6.60    9.47    8.72   32.58    5.72   10.19
    128    58.54  132.41   16.92   23.85   49.92  160.12   15.92   24.43
    256   275.44  373.12   37.61   55.97  293.06  389.40   37.85   54.60
    512   780.89  751.27   81.54  128.13  559.88  780.79   82.06  119.23
   1024  1578.70 1812.20  186.45  288.27 1376.28 1890.46  178.37  262.30

    Ratios > 1.0 64 bit faster  Average     1.05    0.89    1.13    1.02
                                Minimum     0.75    0.69    0.99    0.93
                                Maximum     1.39    0.96    1.26    1.12
  


BusSpeed Benchmark - busspeedPiC8, busspeedPi64g8

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 timed the speed of appropriate measurements at Inc16.

The speed via these increments can vary considerably, so comparison are provided for the Read All column. Then, the 32 bit RAM speeds are indicated as being slightly faster but, with data from caches, average 64 bit gains were around 55%.


     Reading Speed 4 Byte Words in MBytes/Second         

  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read       
  KBytes  Words  Words  Words  Words  Words    All       

                        32 bit                           

      16   4880   5075   5612   5852   5877   5864       
      32    846   1138   2153   3229   4908   5300       
      64    746   1019   2035   3027   4910   5360       
     128    728    983   1952   2908   4888   5389       
     256    683    934   1901   2794   4874   5431       
     512    656    900   1760   2625   4585   5259       
    1024    301    410    870   1356   2846   4238       
    4096    233    248    531    996   2151   4045       
   16384    236    258    511    891   2143   4011       
   65536    237    257    508    881   2172   4015       

                         64 bit                   64 bit/
                                                   32 bit
      16   4898   5109   5626   5860   5879   9238   1.58
      32   1109   1389   2485   3804   5026   8435   1.59
      64    804   1030   2025   3285   4871   8312   1.55
     128    737    951   1877   3130   4908   8556   1.59
     256    732    953   1897   3147   4941   8617   1.55
     512    701    939   1766   2902   4601   8150   1.31
    1024    323    494    986   1807   3060   5553   0.31
    4096    242    259    486    964   1932   3856   0.95
   16384    236    268    493    971   1939   3878   0.97
   65536    242    271    494    973   1942   3884   0.97

  
MemSpeed Benchmark below or Go To Start


MemSpeed Benchmark MB/Second - memspeedPiC8, memspeedPi64g8

The benchmark includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 32 bit and 64 bit results, plus some calculations of maximum MFLOPS.

Ignoring the last three columns, with no calculations, that are subject to over optimisation, the arithmetic overhead lead to similar RAM performance of the two environments. Integer speeds appeared to be the same, but double precision tests indicated a 64 bit advantage of over 20% and 30%, depending on which cache was involved. This time (as seen before) the 64 bit compiler generated impossible code, for single precision calculations, by producing much slower speed than at double precision.

Below are results from running the original 64 bit version, compiled by gcc 7 (I think). This confirmed that the strange results are unlikely to be caused by the 64 bit hardware or Operating System.

 
 Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
 KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
   Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

 32 bit

       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971
  MFLOPS    1471   2470                                                 
 64 bit

       8   15531   3999   3957  15576   4387   4358  11629   9313   9314
      16   15717   3992   3922  15770   4355   4377  11799   9444   9446
      32   12020   3818   3814  12043   4179   4198  11549   9496   9497
      64   12228   3816   3887  12220   4166   4195   8935   8506   8506
     128   12265   3869   3941  12157   4182   4206   8080   8193   8196
     256   12230   3873   3932  12073   4199   4216   8129   8224   8223
     512    9731   3832   3902   9709   4150   4171   8029   7845   7865
    1024    3772   3682   3769   3467   3887   3920   5478   5543   5378
    2048    1896   3463   3496   1886   3616   3612   2937   2945   2923
    4096    1924   3520   3528   1933   3651   3394   2752   2796   2785
    8192    1996   3523   3555   1988   3643   3630   2668   2661   2663
  MFLOPS    1964   1000                                                 
64 bit / 32 bit

      16    1.32   0.40   1.03   1.33   0.43   1.00   1.13   1.20   1.20
     256    1.22   0.40   1.05   1.20   0.43   1.01   1.00   1.06   1.07
    8192    1.03   0.93   1.04   1.03   0.95   1.00   1.08   2.72   2.74


 ########################### Earlier Version ###########################

     Memory Reading Speed Test armv8 64 Bit by Roy Longbottom

               Start of test Wed Jun 10 10:04:22 2020

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   15504  13974  12580  15552  14024  15534  11521   9313   7791
      16   15707  14173  12747  15758  14183  15746  11751   9445   7890
      32   13356  11998  11123  13372  12300  12836  11450   9500   7937
      64   12340  11302  10651  12156  11698  12044   9415   8937   7910
     128   12253  11384  10707  12207  11861  12083   8260   8299   7821
     256   12259  11408  10694  12089  11896  12091   8101   8220   7894
     512    9855   9593   9246  10264   9482   9801   7917   8057   7754
    1024    3317   3613   3571   3640   3602   3600   5885   5833   5616
    2048    1881   1885   1881   1890   1879   1879   2911   2999   3015
    4096    1950   1946   1949   1952   1941   1925   2672   2666   2661
    8192    1952   1964   1964   1968   1962   1961   2546   2536   2537
 
NeonSpeed Benchmark below or Go To Start


NeonSpeed Benchmark MB/Second - NeonSpeedC8, NeonSpeedPi64g8

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.

The same slow single precision calculation speeds, as above, were produced again at 64 bits, as indicated by earlier version results included below. As could be expected, 32 bit and 64 bit calculations, obtained via NEON intrinsic functions, were effectively the same.


      Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v 
  KBytes   Norm   Neon   Norm   Neon  Float    Int

                      32 bit                  

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

                      64 bit                  

      16   3629  14987   3925  13643  14457  16642
      32   3475  10933   3821   9970  11029  11055
      64   3447  11749   3845  11098  11802  12079
     128   3332  11392   3912  10813  11430  11513
     256   3325  11565   3926  10981  11598  11699
     512   3313  10553   3917  10269  10755  10740
    1024   3239   3331   3737   3291   3302   3321
    4096   2987   1888   3331   1777   1881   1878
   16384   3150   1821   3347   1814   1812   1834
   65536   2747   1954   3132   2017   1904   2021

64 bit / 32 bit

      16   0.37   1.16   1.00   1.07   1.10   1.10
     256   0.36   0.97   1.16   0.98   0.97   0.95
    8192   0.70   0.93   0.92   1.09   0.90   0.95


 ########################### Earlier Version ###########################

  NEON Speed Test armv8 64 Bit V 1.0 Wed Jun 10 10:06:03 2020

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16  13999  16429  12687  15238  16213  17194
      32  12384  13367  11232  12767  14406  14493
      64  10736  11870  10305  10790  11940  11976
     128  10728  11826  10393  10739  11951  11956
     256  10760  11908  10386  10816  12026  12064
     512  10697  11911  10404  10781  12070  12006
    1024   3854   3941   3810   4015   4315   4402
    4096   2007   2000   2018   1985   1995   1999
   16384   2002   2008   1997   1927   1997   1997
   65536   2030   2027   2022   2020   2012   2023
 
MultiThreading Benchmark next or Go To Start


MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.


MP-Whetstone Benchmark - MP-WHETSPC8, MP-WHETSPi64g8

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP efficiency.

The MWIPS performance rating indicated that 64 bit code was 13% faster than that at 32 bits.


      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS
 
                            32 bit                                 
 
 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                             64 bit                                
 
 1T  2147.8  530.7  530.0  397.8  60.5  27.3  7462.8  2237.7  998.2
 2T  4294.1 1058.4 1059.5  795.8 120.9  54.6 14877.9  4457.8 1994.8
 4T  8558.2 2093.8 2112.2 1590.3 241.8 108.3 29221.8  8909.9 3982.1
 8T  8987.0 2689.8 2721.9 1641.0 254.1 112.0 37422.9 10873.9 4122.3

   Overall Seconds   5.00 1T,   5.00 2T,   5.05 4T,  10.13 8T

                4 Thread 64 bit/32 bit Performance ratios          

       1.13   1.00   0.98   1.27  1.07  1.04    0.99   1.00    1.50
 


MP-Dhrystone Benchmark - MP-DHRYPiC8, MP-DHRYPi64g8

This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded performance. The single thread speeds were similar to the earlier Dhrystone results, with 44% 64 bit performance gains. The other results don’t mean much.


                      MP-Dhrystone Benchmark              

                    Using 1, 2, 4 and 8 Threads            
            
                              32 bit                        

 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459

                              64 bit                        

 Seconds                     0.55     1.08     2.15     4.30
 Dhrystones per Second   14531390 14791730 14896723 14872767
 VAX MIPS rating             8271     8419     8478     8465

64 bit / 32 bit              1.44     1.12     1.22     1.13

  
MP Linpack Benchmark below or Go To Start


MP SP NEON Linpack Benchmark - linpackNeonMPC8, linpackMPNeonPi64g8

This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes. The unthreaded results are of interest but, using NEON functions, the 64 bit program cannot improve performance much.

 Linpack Single Precision MultiThreaded Benchmark
             Using NEON Intrinsics           

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 

                       32 bit                 

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 

                       64 bit                 

 N  100    2167.70    91.82    89.65    89.96 
 N  500    1438.27   644.85   635.89   635.33 
 N 1000     394.99   376.97   383.92   384.19 

                   64 bit / 32 bit            

 N  100       1.08     0.82     0.83     0.84 
 N  500       1.08     0.94     0.93     0.92 
 N 1000       0.98     0.87     0.89     0.89 


MP BusSpeed (read only) Benchmark - MP-BusSpd2PiC8, MP-BusSpd2Pi64g8

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, the latter to avoid misrepresentation of performance using shared L2 cache. Each set of results show appropriate performance gains on increasing the number of threads used. But the 64 bit compiler somehow manages to lose its way on decreasing addressing increments after Inc8, leading to the 32 bit version appearing to be three times faster.

Below are example of results of a version compiled by gcc 9 for 64 bit Gentoo, showing that the performance issue was probably not caused by the 64 bit hardware or Operating System.

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

                         32 bit                          

 12.3 1T   5310   5616   5801   5898   5940  13425       
      2T   9393  10008  11293  11293  11368  24932       
      4T  15781  15015  17606  19034  22279  40736       
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281       
      2T    564    726   1523   5376   9387  18985       
      4T    486    919   1886   4289   8337  16979       
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975       
      2T    202    421    450   1765   3307   7396       
      4T    261    288    825   1332   1772   5014       
      8T    218    273    496   1041   2571   4021       

                         64 bit                   Rd All 
                                                  64 bit/
                                                  32 bit 
 12.3 1T   5168   5542   5641   4205   4095   4230   0.32
      2T   8968  10728  10161   8110   8058   8368   0.34
      4T   7874  13255  15586  13641  15485  16533   0.41
      8T   8186  13386  15239  13469  14431  16372   0.44
122.9 1T    598    927   1876   2792   3746   4059   0.39
      2T    514    719   1538   4846   7596   8083   0.43
      4T    486    933   2060   4126   8175  13690   0.81
      8T    483    937   2059   4160   8166  13817   0.82
12288 1T    224    257    488    964   1933   3579   0.90
      2T    219    427    889   1832   3493   5371   0.73
      4T    280    353    562    859   2168   3286   0.66
      8T    229    230    527   1075   1880   4480   1.11


 ###################### gcc 9 Version ###################

 MP-BusSpd 64 Bit gcc 9 Fri May 29 09:56:08 2020         

 12.3 4T   7317  13937  15720  18355  20549  33244
122.9 4T    492    937   1883   4009   7820  16423
  
MP RandMem Benchmark below or Go To Start


MP RandMem Benchmark - MP-RandMemPiC8, MP-RandMemPi64g8

The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The performance patterns were as expected, and essentially the same at 32 bits and 64 bits, with no scope for vectorisation. Random access is dependent on the impact of burst reading and writing, producing those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write remaining constant at particular data size, probably due to write back to shared data space.

  KB       SerRD SerRDWR   RndRD RndRDWR
 
                    32 bit              

 12.3 1T    5950    7903    5945    7896
      2T   11849    7923   11887    7917
      4T   23404    7785   23395    7761
      8T   21903    7669   23104    7655
122.9 1T    5670    7309    2002    1924
      2T   10682    7285    1648    1923
      4T    9944    7266    1813    1927
      8T    9896    7216    1812    1919
12288 1T    3904    1075     179     164
      2T    7317    1055     215     164
      4T    3398    1063     343     165
      8T    4156    1062     350     165

                    64 bit              

 12.3 1T    5945    7898    5948    7895
      2T   11913    7937   11905    7929
      4T   23601    7875   23385    7867
      8T   23139    7777   23016    7770
122.9 1T    5785    7090    2026    1977
      2T   10941    7074    1654    1968
      4T   10364    7052    1854    1970
      8T   10256    7031    1844    1973
12288 1T    3861    1244     180     169
      2T    3793    1242     220     171
      4T    3941    1100     343     170
      8T    4065    1247     351     171


                64 bit / 32 bit         

 12.3 4T    1.01    1.01    1.00    1.01
122.9 4T    1.04    0.97    1.02    1.02
12288 4T    1.16    1.03    1.00    1.03
MP-MFLOPS Benchmarks below or Go To Start


MP-MFLOPS Benchmarks - MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8,
MP-MFLOPSPi64g8, MP-MFLOPSDPPi64g8, MP-NeonMFLOPSPi64g8

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data.

There are three varieties, single precision, double precision and single precision through NEON intrinsic functions, all attempting to show near maximum MP floating point processing speeds. 64 bit operation implemented vector processing, with expected single precision maximum performance twice as fast as double precision. Best performance gains, over 32 bit working, were up to more than 2.5 times faster, with four thread performance near 26 GFLOPS, and double precision near 13 GFLOPS. This time, The 32 bit NEON version provided performance improvements over the single precision version, but, at 64 bits, more efficient vector instructions were implemented to operate at up to near 25 GFLOPS.

The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

                Single Precision Version         

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
                        32 bit                      

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165

                        64 bit                      

 1T     3303    3113     526    6750    6713    6429
 2T     6410    4860     540   13378   13373    9005
 4T    11696    6413     571   25479   25917   10126
 8T    10262   10054     571   23140   23427    8726

Max                                                 
64b/32b 2.83    2.18    1.06    2.31    2.43    1.21

            NEON Intrinsic Functions Version      

                       32 bit                    

 1T     2797    2870     641    4422    4454    4405
 2T     3217    5601     569    8587    8800    8377
 4T     7902    9864     611   17061   17215    9704
 8T     7070   10562     603   15531   16203    9516

                       64 bit                    

 1T     3319    3245     527    6569    6538    6294
 2T     5737    5333     556   12810   12784    9565
 4T     8497   11088     572   24775   24885    9570
 8T     8037   11330     573   22658   21773    9443

Max                                                 
64b/32b 1.08    1.07    0.89    1.45    1.45    0.99

              Double Precision Version            

                       32 bit                    

 1T     1203    1211     315    2675    2719    2674
 2T     2291    2441     293    5406    5421    4907
 4T     4673    2501     309   10313   10393    5256
 8T     4394    3550     265    8782   10110    5197

                       64 bit                    

 1T     1637    1553     273    3356    3351    3220
 2T     3180    3031     278    6664    6676    4531
 4T     5778    3102     283   12522   12675    4791
 8T     3927    4272     286   12304   11351    4875

Max                                                 
64b/32b 1.24    1.20    0.91    1.21    1.22    0.93

                        Sumchecks                   

 SP    76406   97075   99969   66015   95363   99951
 NEON  76406   97075   99969   66014   95363   99951
 DP    76384   97072   99969   66065   95370   99951
OpenMP-MFLOPS Benchmarks below or Go To Start


OpenMP-MFLOPS - OpenMP-MFLOPSC8, notOpenMP-MFLOPSC8, OpenMP-MFLOPS64g8, notOpenMP-MFLOPS64g8

This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those from NotOpenMP single core values. However, this is not always the case. This benchmark was a compilation of code used for desktop PCs, starting at 100 KB, then 1 MB and 10 MB.

The main purposes of this benchmark are to see if OpenMP can produce similar maximum performance as MP-MFLOPS and that this can increase in line with the number of cores used. These objectives were met using 32 floating point operations per data word. Then, the 64 bit tests achieved up to 24 GFLOPS, 21% faster than at 32 bits.



  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

                                OpenMP MFLOPS 32 bit 
 
 Data in & out     100000     2     2500   0.098043     5100    0.929538   Yes       
 Data in & out    1000000     2      250   0.810084      617    0.992550   Yes
 Data in & out   10000000     2       25   0.922891      542    0.999250   Yes

 Data in & out     100000     8     2500   0.144870    13805    0.957126   Yes
 Data in & out    1000000     8      250   0.922568     2168    0.995524   Yes
 Data in & out   10000000     8       25   0.918226     2178    0.999550   Yes

 Data in & out     100000    32     2500   0.401577    19921    0.890282   Yes
 Data in & out    1000000    32      250   0.935064     8556    0.988096   Yes
 Data in & out   10000000    32       25   0.916277     8731    0.998806   Yes

                                 OpenMP MFLOPS 64 bit                           64b/
                                                                                 32b
 Data in & out     100000     2     2500   0.092784     5389    0.929538   Yes  1.06
 Data in & out    1000000     2      250   0.794744      629    0.992550   Yes  1.02
 Data in & out   10000000     2       25   0.784255      638    0.999250   Yes  1.18

 Data in & out     100000     8     2500   0.114583    17455    0.957117   Yes  1.26
 Data in & out    1000000     8      250   0.797846     2507    0.995518   Yes  1.16
 Data in & out   10000000     8       25   0.879850     2273    0.999549   Yes  1.04

 Data in & out     100000    32     2500   0.332392    24068    0.890215   Yes  1.21
 Data in & out    1000000    32      250   0.849420     9418    0.988088   Yes  1.10
 Data in & out   10000000    32       25   0.933336     8571    0.998796   Yes  0.98

                                 notOpenMP MFLOPS 32 bit                                           

 Data in & out     100000     2     2500   0.220277     2270    0.929538   Yes
 Data in & out    1000000     2      250   0.791373      632    0.992550   Yes
 Data in & out   10000000     2       25   0.792594      631    0.999250   Yes

 Data in & out     100000     8     2500   0.362916     5511    0.957126   Yes
 Data in & out    1000000     8      250   0.902125     2217    0.995524   Yes
 Data in & out   10000000     8       25   0.786859     2542    0.999550   Yes

 Data in & out     100000    32     2500   1.497859     5341    0.890282   Yes
 Data in & out    1000000    32      250   1.518747     5267    0.988096   Yes
 Data in & out   10000000    32       25   1.516393     5276    0.998806   Yes

                                 notOpenMP MFLOPS 64 bit                        64b/
                                                                                 32b                      
 Data in & out     100000     2     2500   0.152535     3278    0.929538   Yes  1.44     
 Data in & out    1000000     2      250   0.965797      518    0.992550   Yes  0.82
 Data in & out   10000000     2       25   0.781680      640    0.999250   Yes  1.01

 Data in & out     100000     8     2500   0.356388     5612    0.957117   Yes  1.02
 Data in & out    1000000     8      250   0.925742     2160    0.995518   Yes  0.97
 Data in & out   10000000     8       25   0.840113     2381    0.999549   Yes  0.94

 Data in & out     100000    32     2500   1.176455     6800    0.890215   Yes  1.27
 Data in & out    1000000    32      250   1.227945     6515    0.988088   Yes  1.24
 Data in & out   10000000    32       25   1.225311     6529    0.998796   Yes  1.24

  
OpenMP-MemSpeed Benchmarks below or Go To Start


OpenMP-MemSpeed - OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8
OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP. Detailed comparisons of these results are rather meaningless, but it demonstrates that OpenMP might be unsuitable to produce performance gains on what appears to be suitable code. There might also be compile options that overcome this problem.

                  Memory Reading Speed Test OpenMP                      

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                               32 bit                                   

       4    8097   8322   8641   8020   8436   8384  39701  19701  19712
       8    7814   8555   8756   8321   8548   8526  39042  19984  19996
      16    8149   7738   7742   8303   7779   8192  37995  19883  19984
      32    8969   8769   8799   9040   8759   8743  37737  20133  20130
      64    7617   7457   7437   7575   7380   7422  17770  15332  14248
     128   11221  10936  11003  11105  11011  10986  13650  13910  13881
     256   17883  18144  18036  17691  18094  17844  13073  12465  12535
     512   18001  18468  19675  17075  18221  19264  13511  13895  12008
    1024    9532  10590   9772  11842  11282  11277   7173   9473   9496
    2048    7095   7025   6866   7117   7043   6946   2914   3475   3468
    4096    7244   6927   7036   5951   7054   6531   2582   3130   3122
    8192    4578   7173   7025   6322   7078   7182   2504   3127   3115
   16384    5470   7043   7067   7103   7052   7020   2557   3093   3088
   32768    7359   7817   7766   7158   7078   7757   2618   3066   3094
   65536    7810   7268   7266   3824   7478   5164   2486   3016   2931
  131072    2460   2655   7224   7513   7308   7339   2540   2944   2940

 Not OMP                                                                
       8   11775   3895   4342  11787   4325   4354  10334   7806   7816
     256   10032   3699   4223   9978   4289   4185   7105   7612   7621
   65536    2099   2587   3033   2103   3021   3001   2585   1105   1101

                               64 bit                                   

       4    7749   8500   8716   7451   8520   8533  39508  18586  18589
       8    8198   8669   8874   8148   8678   8691  38972  18863  18861
      16    8023   8499   8335   7895   8355   8507  38305  19003  19004
      32    9034   8517   8619   9127   8550   8522  37928  19071  18409
      64    8652   8201   8178   8565   8223   8093  25191  17494  17508
     128   11397  11616  11715  11345  11649  11029  13861  14097  14170
     256   18242  18745  18195  17417  18605  18019  12535  12637  12623
     512   17580  18467  18787  18010  18414  18321  12900  13180  13121
    1024    8043  10172  11540  12510  10220  12082   9800   9586   9857
    2048    4816   6807   6850   6922   6805   6666   3137   3372   3369
    4096    7029   6846   6881   7017   5145   6801   2776   3124   3112
    8192    2428   7085   7124   7068   7134   6904   2571   3092   3112
   16384    7133   7152   7328   7008   3445   7178   2473   3099   3104
   32768    2656   7643   7669   7802   7616   7559   2043   3112   3104
   65536    7995   6523   2572   7059   6514   6485   2431   2955   3036
  131072    1981   7273   7327   1878   3615   7267   2538   2968   2976

 Not OMP                                                                
       8   15532   3990   4394  15567   4386   4394  11629   9315   9314
     256   12318   3871   4219  12134   4206   4219   8092   8231   8229
   65536    2005   2588   2937   2011   2930   2621   2577   2565   2566
  
I/O Benchmarks below or Go To Start


I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The Raspberry Pi programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. For further details and downloads see the usual PDF file.


LanSpeed Benchmarks - WiFi - LanSpeed, LanSpeed64g8

Following are Raspberry 32 bit and 64 bit results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two systems was reasonably similar at both frequencies, but speeds can vary widely. Also, (with my setup?) obtaining consistent 5 GHz operation was extremely difficult to achieve, in both cases.

 *********************. 32 bit 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** 32 bit 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz


 ********************** 64 bit 2.4 GHz *******************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8     5.48     5.14     5.39     6.86     6.61     5.30
  16     5.62     5.64     5.69     5.17     5.02     5.18

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      3.666    4.035    5.131     4.82     4.67     3.90

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.24     0.52     0.95     0.34     0.60     1.14
 ms/file    17.10    15.73    17.20    12.00    13.68    14.35    2.437


 ********************** 64 bit 5 GHz *********************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    11.43    11.70    11.57     8.21     3.64     7.05
  16    10.96     7.30    11.84     8.40     6.24     7.94

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.38     0.73     1.12     0.39     0.73     0.98
 ms/file    10.82    11.15    14.62    10.55    11.23    16.73    2.618

 Random similar to 2.4 GHz
  
LAN Benchmark below or Go To Start


LanSpeed Benchmark - (1G bits per second Ethernet) - LanSpeed, LanSpeed64g8

Measured performance can vary significantly, but both 32 bit and 64 bit tests demonstrated Gigabit performance on the large files. Of particular note (with my program), the 32 bit system indicated that the 2 GB file could not be written, the actual file size ended at 2,147,483,647 Bytes (or 2^31 - 1). On the other hand, at 64 bits, three files of up 8 GB and 16 GB were successfully written and read (in around 25 minutes).

 ************************ 32 bit ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29

    1024   96.13   93.34   94.98  114.51  112.16  114.91
    2048   Error writing file  Segmentation fault
           Wrote 2,147,483,647 bytes

 ************************ 64 bit ************************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    93.63    93.17    96.38   108.02   109.36   109.30
2048    98.41    96.54    99.18   111.26   111.89   111.83

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.003    0.005    0.014     0.81     0.75     1.23

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.42     2.82     5.24     2.30     4.56     8.09
 ms/file     2.89     2.90     3.13     1.78     1.80     2.02    0.288


 Much Larger Files

 8192   89.77    89.98    91.86   117.29   117.21   117.17
16384   90.64    89.47    91.10   116.58   117.24   117.13
  
USB Benchmarks below or Go To Start


USB Benchmarks - DriveSpeed, DriveSpeed64v2g8

Following are DriveSpeed results at 32 bits and 64 bits, accessing the same USB 3 drive. Note the difference in performance during the various test procedures (They might not be the same next time). The 32 bit system again failed on attempting to write a 2 GB file (2^31-1 limit).

At 64 bits, 4 GB could not be written, the size limit being disappointing. This benchmark uses Direct I/O. Then, as I later discovered, running with caching enabled (using LanSpeed benchmark) can write and read much larger files, including those too large to cache. The example below is for writing and reading three files, each near 6 GB and 12 GB. The vmstat recordings show that there was no serious memory swapping, with around 7.5 GB of RAM used for caching.

 ********************* 32 bit USB 3 *********************

   DriveSpeed RasPi 1.1 Sat May 30 15:31:20 2020
 
 Selected File Path: /media/pi/PATRIOT1/
 Total MB  120832, Free MB  112565, Used MB    8267

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

 512    73.43    74.88    74.88   217.60   219.98   218.02
1024    63.03    76.64    74.46   220.72   220.60   219.97
 Cached
   8    38.07    41.95    39.95   700.06   693.26   677.20

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.982    0.981    1.001     6.81     6.31     6.31

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.14     2.58     5.23    10.32
 ms/file   120.08   120.06   120.00     1.59     1.57     1.59    2.491

 
 Larger Files           MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

2000    75.14    74.93    74.93   216.19   217.22   216.53
2048 Error writing file Segmentation fault

 ********************* 64 bit USB 3 *********************

   DriveSpeed RasPi 64 Bit gcc 8 Wed May 27 11:43:43 2020
 
 Selected File Path: /media/pi/PATRIOT1/
 Total MB  120832, Free MB  114614, Used MB    6218

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    27.78    21.39    21.43   270.32   278.81   274.98
2048    21.40    21.14    21.44   275.79   273.14   319.95
 Cached
   8    40.27    42.81    42.81  1206.64  1068.72  1031.56

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.004    0.184     4.33     4.00     4.04

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.14   261.45    11.19    84.39
 ms/file   119.60   119.05   119.64     0.02     0.73     0.19    2.477

 Larger Files           MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

2048    23.77    19.89    20.64   320.34   272.90   271.96 
4096    Write failure

2000    21.72    22.38    26.57   275.40   273.85   309.57
4000    37.38    36.30    37.67   297.09   299.91   286.94

Caching Benchmark - USB 3 Hard Drive - 3 files up to near 36 GB capacity used  

 6000  169.80   136.20   126.26    90.43   146.13   144.05
12000  146.65   108.83    67.14   108.13   146.84   143.76

 swpd    free   buff  cache    si   so     bi     bo   vmstat memory and I/O activity

  768 7417668 102040  250844    0    0   1299   1329   Start
  768 1970544  94436 5704132    0    0      0 132723   Writing 12000 MB
  768  107908  92712 7568500    0    0 140339      0   Reading 12000 MB

Main Drive Benchmark below or Go To Start


Pi 4B Main Drive Benchmark - DriveSpeed, DriveSpeed64v2g8

The DriveSpeed benchmark failed to execute on the 64 bit system, providing the message “Error writing file Segmentation fault”. It had run previously on the Pi 4B but, again, would only write less than 2 GB files, as shown below. This also applied when running LanSpeed on the main drive. From below, note the faster reading speeds at 1024 MB, this was because the file size is small enough to be cached.

Below are default results from running LanSpeed on the Pi 4 at 64 bits, initially intended to verify that the main drive could be accessed by one of my programs. Initially, I could not run specifying large files, as there was limited free space on the OS drive. After cloning the card to a 32 GB version, 19 GB free space was indicated. I then ran the program to write three 6000 MB files. This was followed by specifying 16000 MBytes, where one file was written and the second one generated an error after writing around 2500 MB. The good news also was that the test did not crash the system.

 ************************ 32 bit ************************
  
 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks
 Total MB   14845, Free MB    8198, Used MB    6646

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   16.41   11.21   12.27   39.81   40.10   40.39
      16   11.79   21.10   34.05   40.18   40.19   40.33
Cached
       8  137.47  156.43  285.59  580.73  598.66  587.97

Random      Read                   Write
From MB        4       8      16       4       8      16
msecs      0.371   0.371   0.363    1.28    1.53    1.30

200 File   Write                    Read                  Delete
File KB        4       8      16       4       8      16   secs
MB/sec      3.49    6.41    8.26    7.67   11.68   17.51
ms/file     1.17    1.28    1.98    0.53    0.70    0.94   0.014

Larger Files
        
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3


1024    13.38    13.35    13.39    42.68    42.59    42.36
2048    Error writing file Segmentation fault

LanSpeed
 
1024    11.65    13.46    13.48   560.78   574.76   617.67
2048    Error writing file Segmentation fault

 ************************ 64 bit ************************

   LanSpeed RasPi 64 Bit gcc 8 Wed May 27 10:36:54 2020
 
 Current Directory Path: /home/pi/Raspberry-Pi-4-64-Bit-Benchmarks
 Total MB   14637, Free MB    8724, Used MB    5913

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8   265.13   281.30   292.28  1270.88  1286.35  1329.42
  16   246.59   277.53   299.05  1201.20  1327.24  1095.78

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.002    0.002    0.002     7.68     9.01     7.14

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec     56.52    64.92    94.20   303.96   549.54   538.32
 ms/file     0.07     0.13     0.17     0.01     0.01     0.03    0.014

Larger File - 32 GB SD card  

Total MB   29643, Free MB   19776, Used MB    9868         

                        MBytes/Second
   MB   Write1   Write2   Write3    Read1    Read2    Read3

 6000    24.14    18.80    19.39    31.07    45.60    45.76
16000    21.12      Error writing file Segmentation fault

File 1  15.6 GiB (16,777,216,000 bytes)
File 2   2.5 GiB ( 2,645,176,320 bytes) - Not enough free space

Java Whetstone Benchmark below or Go To Start


Java Whetstone Benchmark - whetstc.class

The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files. Performance can vary significantly using different Java Virtual Machines. So, comparisons might not be appropriate.

The results below suggest that 32 bit overall performance, in MWIPS, was faster than at 64 bits. This was due to the most time consuming functions (N5 and N6) taking less time. Note that some speeds are effectively the same as found running the C compiled version above.

 ************************* 32 bit *************************

  Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900 x
  N6 floating point   0.999999821    345.95             1.5592 x
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft


 ************************* 64 bit *************************

    Whetstone Benchmark Java Version, May 22 2020, 14:24:09

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    520.61             0.0369
  N2 floating point  -1.131330490    481.38             0.2792
  N3 if then else     1.000000000             236.41    0.4378
  N4 fixed point     12.000000000            1320.20    0.2386
  N5 sin,cos etc.     0.499110132              47.96    1.7348 x
  N6 floating point   0.999999821    276.33             1.9520 x
  N7 assignments      3.000000000             320.17    0.5772
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1487.99             6.7205

  Operating System    Linux, Arch. aarch64, Version 4.19.118-v8+
  Java Vendor         Debian, Version  11.0.7
  

JavaDraw Benchmark below or Go To Start


JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver.

In this case, performance at 32 bits and 64 bits was quite similar.

 ************************* 32 bit *************************

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft


 ************************* 64 bit *************************

   Java Drawing Benchmark, May 22 2020, 14:25:15
            Produced by javac 1.8.0_222

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      833    83.26
  Display PNG Bitmap Twice Pass 2     1001   100.05
  Plus 2 SweepGradient Circles         994    99.39
  Plus 200 Random Small Circles        836    83.54
  Plus 320 Long Lines                  380    37.98
  Plus 4000 Random Small Circles        95     9.44

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.118-v8+
  Java Vendor         Debian, Version  11.0.7
  

OpenGL GLUT Benchmark below or Go To Start


OpenGL GLUT Benchmark - videogl32, videogl64

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file, the first command needed to avoid VSYNC, allowing FPS to be greater than 60.

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 Width 1920, Height 1080, NoHeading           
  

The benchmark could not be recompiled, at 64 bits, as certain freeglut functions were not readily available. So, an earlier version was used. In this case, the 64 bit version, at the higher pixel settings, appeared to be slower on the graphics speed dependent tests, but faster elsewhere.

As indicated below, the dual monitor connections enabled this option to be tested at 64 bits.

 ************************ 32 bit ************************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0

 ************************ 64 bit ************************

 GLUT OpenGL Benchmark 64 Bit gcc 9, Fri May 22 13:50:00 2020

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    753.4    414.5    258.3    152.0     42.7     30.0
   320   240    644.5    385.9    243.9    145.6     41.5     29.1
   640   480    320.6    270.6    217.9    136.8     43.0     29.4
  1024   768    140.6    135.1    122.6    114.1     41.8     28.5
  1920  1080     57.7     56.4     55.7     52.4     40.5     26.7

 ****************** 64 bit Dual Monitor*******************

  3840  1080     26.9     26.7     27.0     26.0     27.5     21.6
  

Usable RAM below or Go To Start


Usable RAM - MALLOC

On running various benchmarks, it became clear that there were restrictions on how much RAM could be used by my C based benchmarks. A simple program was written that allocated a specified amount of memory, using malloc, filled it with data, freed the space, then repeated the sequence incrementally, until an allocation failure was indicated. Both 32 bit and 64 bit versions were produced and each run on 4 GB and 8 GB systems. Except at 64 bit 8 GB, all were restricted to less than 4,000,000,000 bytes. For the former, vmstat memory utilisation details are provided, showing the low points and samples between, identifying that memory space had been freed.

 ############################### 32 Bit OS ###############################
 
                                 4 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3050000000 (2.84 GB)
 
                                 8 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3060000000 (2.85 GB)

 ############################### 64 Bit OS ###############################
 
                                 4 GB RAM 
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3700000000 (3.45 GB)
 
                                 8 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  1000000000 words allocated  1000000000 written finished 
 Bytes 5000000000  1250000000 words allocated  1250000000 written finished 
 Bytes 6000000000  1500000000 words allocated  1500000000 written finished 
 Bytes 7000000000  1750000000 words allocated  1750000000 written finished 
 Bytes 8000000000  Memory allocation failed - Exit Later OK to 7900000000 (7.36 GB)

pass swpd    free   buff  cache    pass swpd    free   buff  cache

        0 7412260  85908 274472            0 7234852  85908 278140
 1      0 6615688  85908 277608     5      0 2600856  85908 277096
        0 7385388  85908 277264            0 7184736  85908 277612
 2      0 5671192  85908 277612     6      0 1571436  85908 277096
        0 7210328  85908 277264            0 7257464  85908 277096
 3      0 4526104  85908 277096     7      0 624436   86228 281456
        0 7324312  85908 277096            0 7402400  86228 283200
 4      0 3665272  85908 277264                                   

Usable RAM - Specified Dimensions

Where dimensions were specified in the programs, rather than malloc, some differences were apparent. Using the 32 bit system, a compile error was indicated when the dimensions required 2 GB (2^31) Bytes, with 1 Byte less being accepted.. As shown below, at 64 bits, more than 2 GB was allowed on the 4 GB system. Then, at both 4 GB and 8 GB close to these sizes could be used.

 ######################## 32 Bit OS 4 GB and 8 GB ########################
  int    array[536870912]; size of array 'array' is too large 2 GB
  int    array[536870911]; compiles
  float  array[536870912]; size of array 'array' is too large 2 GB
  float  array[536870911]; compiles
  double array[268435456]; size of array 'array' is too large 2 GB
  double array[268435455]; compiles

 ############################# 64 Bit OS 4 GB ############################
 int    array[920000000];  OK 3.43 GB
 int    array[1073741824]; Segmentation fault 4 GB
 float  array[920000000];  OK 3.43 GB
 float  array[1073741824]; Segmentation fault 4 GB
 double array[460000000];  OK 3.43 GB
 double array[536870912];  Segmentation fault 4 GB

 ############################# 64 Bit OS 8 GB ############################
 int    array[1950000000]; OK 7.9 GB paging
 int    array[2147483648]; Segmentation fault 8 GB
 float  array[1950000000]; OK 7.9 GB
 float  array[2147483648]; Segmentation fault 8 GB 
 double array[975000000];  OK 7.9 GB
 double array[1073741824]; Segmentation fault 8 GB

High Performance Linpack Benchmark below or Go To Start


High Performance Linpack Benchmark - xhpl

I ported my ATLAS version of HPL, that I have run on earlier Raspberry Pi systems, to both of the 64 bit and 32 bit SD cards. See my report at ResearchGate Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. The report showed that the amount of memory used followed the same proportions as the original Linpack benchmark of somewhat greater than N x N x 8, for double precision operation on a N x N dimensioned array. For RAM residence, N=20000 would require 4 GB and N=30000 needing 8 GB.

Following are results from tests run without and with a cooling fan in place. The first were for the original Pi 4 with 4 GB RAM, carried out in June 2019. The others, with 8 GB, are running via recent 32 bit and 64 bit Operating System versions, in 2020. With the fan in place, clock speeds were effectively constant at 1500 MHz, on all three test rigs, with the same MFLOPS performance at each problem size. Then, the 4 GB system appeared to be running at a higher temperature, but not high enough to introduce CPU MHz throttling.

With no fan in use, throttling occurred on all systems, at N=16000. From then on, the 4 GB system suffered from more of this than the 8 GB models, reflected in higher temperatures and slower performance. The difference is thought to be due to the improvements that have been made in thermal management.

These tests show that the HPL benchmark is an excellent stress testing application that can demonstrate using most of available RAM and running at high performance levels. The double precision speed approached the 12.6 GFLOPS achieved by one of my benchmarks. The 64 bit production does not appear to benefit from using advanced vector operations, but I could not identify whether other compiling parameters could be included.

                 No Fan                           Fan

 RAM at bits    N  GFLOPS Seconds  Max °C  Min MHz  GFLOPS Seconds  Max °C Min MHz

 4 GB 32b    8000     8.6      40     81      1500     9.3      37     61     1500
 8 GB 32b    8000     9.7      35     58      1500     9.6      35     57     1500
 8 GB 64b    8000     8.8      39     76      1500     8.7      39     55     1500

 4 GB 32b   16000     6.8     404     86   750/600    10.4     263     70     1500
 8 GB 32b   16000     8.6     319     83      1000    10.4     263     63     1500
 8 GB 64b   16000     8.1     338     84      1000    10.0     273     61     1500

 4 GB 32b   20000     6.2     856     87   750/600    10.8     494     71     1500
 8 GB 32b   20000     8.8     604     85      1000    10.7     497     63     1500
 8 GB 64b   20000     8.5     625     85  1000/600    10.3     519     63     1500

 4 GB 32b   30000     N/A                              N/A
 8 GB 32b   30000     8.2    2195     85  1000/600    11.3    1590     64     1500
 8 GB 64b   30000     7.6    2370     86  1000/600    11.4    1584     63     1500
  
Below are vmstat details, showing that most of the RAM was in use and four cores were running at 100% utilisation. Then there are examples of environmental differences btween older 32 bit and later 64 bit operation, particularly MHz throttling variations, core voltage and pmic temperature differences

 8 GB 64b 30000 
procs  -----------memory---------- ---swap--  -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs  us sy id wa st

 0  0      0 7422216  83712 264952    0    0   213     4  211  345   2  2 96  1  0
 4  0      0 5366940  83720 269572    0    0   144     2 1130  483  82  3 15  0  0
 4  0      0 2974924  83728 271960    0    0     0     3 1287  585  97  3  0  0  0
 4  0      0  637296  83960 275704    0    0     0    48 1859 2130  96  4  0  0  0
 4  0   3072  246724  43176 207604    1   83   141    95 1663 1402  97  3  0  0  0
 4  0   3584  243388  32412 191932    3   17    11    23 1110  131 100  0  0  0  0
 6  0   3584  247168  32420 187520    0    0     0     2 1085   59 100  0  0  0  0
 Later
 5  0   3584  238580  34324 193432    0    0     4     2 1196  361  99  1  0  0  0
 5  0   7936  238124  26356 193392    0  140   386   193 1993 2064  97  3  0  0  0
 4  0   7936  247408  27264 194160    1    0    70    11 1889 1888  98  2  0  0  0

 4 GB 32b 20000 No Fan
  485.3   ARM MHz=1000, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C
  506.6   ARM MHz= 750, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
  528.0   ARM MHz= 750, core volt=0.8771V, CPU temp=86.0'C, pmic temp=74.1'C
  549.2   ARM MHz= 600, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
  570.6   ARM MHz=1000, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C
  591.9   ARM MHz= 750, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C

 8 GB 64b 30000 No Fan 
 1546.8   ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C
 1577.8   ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
 1608.8   ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C
 1639.9   ARM MHz=1000, core volt=0.8350V, CPU temp=85.0'C, pmic temp=70.3'C
 1670.8   ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
 1701.8   ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
 1732.8   ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C
  

Floating Point Stress Tests below or Go To Start


Floating Point Stress Tests - MP-FPUStress, MP-FPUStressDP, MP-FPUStress64g8, MP-FPUStress64DPg8

These stress tests have a benchmarking mode that provides choices for a long running test. They cover number of threads, floating point operations carried out on each data word, and memory size to cover caches and RAM. Numeric sumchecks are carried out, where the same number of calculations apply at different thread counts, in each section. Below are results for both 64 bit and 32 bit compilations, where sumchecks were identical. Performance at 64 bits can be seen to be faster than at 32 bits, with best case nearly twice as fast.

Next, below, are results from 10 minute stress tests, showing measured GFLOPS and CPU temperatures, for fanless operation. CPU MHz variations were between 1500/1000/750 at 32 bits and 1500/1000 for all 64 bit tests. Again, indicating improved thermal management.

                    64 Bits MFLOPS       Numeric Results      32 Bits MFLOPS
             Ops/   KB    KB    MB      KB     KB     MB      KB    KB    MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8    12.8   128  12.8

  Single Precision
   0.9    T1   2  3845  4032  1232   40394  76395  99700    2134  2607   656
   1.6    T2   2  7947  7992  1083   40394  76395  99700    5048  5156   621
   2.3    T4   2 14295 14760  1145   40394  76395  99700    7536  9939   681
   3.0    T8   2 13427 14985  1166   40394  76395  99700    7934  9839   639
   4.9    T1   8  4665  4740  3200   54764  85092  99820    5535  5420  2569
   6.0    T2   8  9334  9453  4143   54764  85092  99820   10757 10732  2454
   6.9    T4   8 17902 18462  4693   54764  85092  99820   18108 20703  2444
   7.7    T8   8 17473 18460  4570   54764  85092  99820   19236 20286  2245
  13.0    T1  32  5827  5869  5861   35206  66015  99520    5309  5270  5262
  15.6    T2  32 11712 11729 11524   35206  66015  99520   10551 10528  9753
  17.2    T4  32 23149 22887 16343   35206  66015  99520   20120 20886 11064
  18.7    T8  32 22202 23048 16411   35206  66015  99520   19415 20464  9929

  Double Precision
   1.8    T1   2  1802  1878   587   40395  76384  99700     921   998   326
   3.4    T2   2  3716  3741   527   40395  76384  99700    1968  1995   308
   4.8    T4   2  6814  7335   547   40395  76384  99700    3465  3925   342
   6.1    T8   2  6633  7011   588   40395  76384  99700    3646  3702   301
   9.2    T1   8  2738  2796  2014   54805  85108  99820    2377  2446  1283
  11.4    T2   8  5598  5582  2114   54805  85108  99820    4916  4860  1326
  13.0    T4   8 10545 11132  2196   54805  85108  99820    9202  9510  1391
  14.7    T8   8 10693 10849  2149   54805  85108  99820    9090  9006  1298
  24.1    T1  32  3280  3296  3279   35159  66065  99521    2695  2725  2707
  28.8    T2  32  6583  6588  6430   35159  66065  99521    5416  5441  5121
  31.6    T4  32 12785 13162  8477   35159  66065  99521   10666 10831  5275
  34.4    T8  32 12718 12781  8816   35159  66065  99521   10427 10602  4832

Stress Tests Original 32 Bits  ------------------ 64 Bits ------------------
             
               8 Ops/word      8 Ops/word      32 Ops/Word     32 Ops/Word DP
    Seconds    °C  GFLOPS      °C  GFLOPS       °C  GFLOPS      °C  GFLOPS

        0      61              59               58              58
       20      76    19.2      65    18.4       71    22.9      73    12.9
       40      81    19.0      74    18.2       74    23.1      77    12.9
       60      82    17.8      76    18.4       76    22.9      78    12.9
       80      83    15.5      78    18.1       78    23.0      80    13.0
      100      84    15.0      78    18.1       79    23.0      83    12.4
      120      83    14.0      82    18.2       81    23.0      82    11.7
      140      84    13.3      82    17.6       82    22.5      82    11.2
      160      84    13.3      81    16.8       82    21.6      82    10.9
      180      86    12.9      82    16.3       82    21.0      83    10.9
      200      85    13.0      82    16.2       82    20.7      83    10.5
      220      84    12.8      82    15.8       82    20.4      82    10.2
      240      84    12.6      83    15.6       82    20.1      83    10.2
      260      83    12.6      83    15.9       83    19.9      82    10.2
      280      85    12.2      83    15.3       82    19.9      83    10.0
      300      84    12.1      83    15.4       81    19.6      83     9.9
      320      85    12.0      83    15.5       82    19.5      82     9.7
      340      84    11.6      82    15.2       82    19.5      82     9.9
      360      85    11.6      83    14.7       83    19.3      83     9.8
      380      85    11.3      82    14.7       82    19.2      83     9.6
      400      85    11.6      83    14.8       82    19.0      83     9.6
      420      84    11.6      83    14.9       82    18.9      82     9.5
      440      85    11.5      82    14.6       83    18.8      82     9.6
      460      84    11.5      83    14.9       83    18.7      82     9.5
      480      85    11.5      83    14.6       82    18.8      83     9.5
      500      84    11.1      83    14.7       83    18.8      83     9.5
      520      85    11.3      82    14.6       82    18.6      83     9.4
      540      84    11.4      83    14.7       82    18.7      83     9.4
      560      84    11.3      83    14.6       82    18.7      83     9.6
      580      85    11.3      83    14.6       83    18.4      83     9.6
      600      85    11.3      83    14.5       83    18.5      83     9.7

 Average     83.9    12.9    81.2    15.9     81.1    20.2    81.9    10.5
 Min/max             0.58            0.78             0.80            0.72

Integer Stress Tests below or Go To Start


Integer Stress Tests - MP-IntStress, MP-IntStress64g8, MP-IntStress64

This program has variables for number of threads, memory required and running time. The test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Performance is measured in MBytes per second. Results show the varying hexadecimal data patters used and compared verification, not shown on the summary benchmarking mode logged details below. Here, it can be seen that the 64 bit performance was much slower using the latest gcc 8 64 bit tests. Earlier 64 bit results confirm that the poor performance was due to a compiling issue.

Following the benchmark results are details from two stress tests, running without an operational fan. The first represents one user demanding 7600000 KB ( 7.25 GB) of memory space. Performance throughout was effectively the same as memory speed indicted by the benchmark (1 thread 16 MB), CPU MHz being constant, with little change in temperatures. As shown by vmstat details, some data was swapped out, to make room for that of the application.

The second stress test involved 8 threads and cache based data, initially running at maximum CPU speed (for this code). This time, there were CPU clock throttling, down to 1000 MHz, CPU temperature rises up to 84°C and a 31% decrease in measured MBytes per second.
                              Benchmark MBytes/second

         ------ 32 Bits ------    ------ 64 Bits ------    --- 2019 64 Bits ---
            KB      KB      MB      KB       KB      MB      KB      KB      MB
 Threads    16     160      16      16      160      16      16     160      16

     1    5956    5754    3977    2878     2936    2602    5928    6786    3903
     2   11861   11429    3763    5855     5817    3641   14468   13292    3772
     4   22998   21799    3464   11403    11416    3564   27146   25103    3425
     8   22695   21128    3490   10853    11297    3557   27576   24844    3432
    16   22835   23491    3485   11069    11612    3548   27365   28511    3434
    32   22593   23485    3591   10790    11646    3758   26377   28527    3455

Stress Test Start Data Same All Seconds Size Threads MB/sec Sumcheck Threads 20.0 7600000 KB 1 2606 00000000 Yes 57.8 7600000 KB 1 2604 FFFFFFFF Yes 91.0 7600000 KB 1 2575 5A5A5A5A Yes 129.5 7600000 KB 1 2608 AAAAAAAA Yes
vmstat 10 second samples procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 7433336 83140 266140 0 0 222 2 177 273 1 1 97 1 0 1 0 0 5535964 83152 268248 0 0 2 7 501 707 2 7 92 0 0 1 8 69888 63404 1048 106744 16 6943 515 6951 664 506 3 18 54 25 0 1 0 67072 63916 4548 123920 3 0 3 8 468 260 26 1 74 0 0 Later to end 1 0 95336 62748 4868 135672 4 0 4 6 475 274 26 1 73 0 0 --------- 7600000 KB 1 Thread --------- --------- 1280 KB 8 Threads ---------- Secs MB/sec MHz Volts °C CPU °C PMIC MB/sec MHz Volts °C CPU °C PMIC 0 1500 0.8500 59 55.2 1500 0.8600 57 54.3 20 2606 1500 0.8500 61 55.2 10902 1500 0.8600 70 55.2 40 2599 1500 0.8500 63 55.2 10267 1500 0.8600 73 58.0 60 2604 1500 0.8500 63 56.2 10150 1500 0.8600 75 59.0 80 2575 1500 0.8500 65 57.1 11046 1500 0.8600 79 61.8 100 2566 1500 0.8500 65 57.1 11039 1500 0.8600 80 62.8 120 2605 1500 0.8500 66 58.0 10503 1000 0.8600 81 64.6 140 2608 1500 0.8500 66 58.0 8780 1500 0.8600 82 65.6 160 2583 1500 0.8500 67 59.0 8501 1500 0.8600 82 66.5 180 2605 1500 0.8500 66 59.0 8704 1500 0.8600 83 66.5 200 2604 1500 0.8500 66 59.0 8507 1500 0.8600 83 66.5 220 2608 1500 0.8500 67 59.0 8829 1000 0.8600 83 67.5 240 2608 1500 0.8500 68 59.0 8749 1000 0.8600 82 67.5 260 2605 1500 0.8500 68 59.0 8542 1500 0.8600 83 68.4 280 2573 1500 0.8500 67 59.0 8500 1000 0.8600 82 67.5 300 2601 1500 0.8500 68 59.0 8434 1000 0.8600 83 68.4 320 2607 1500 0.8500 68 59.0 8360 1500 0.8600 83 68.4 340 2605 1500 0.8500 68 59.0 8302 1000 0.8600 83 68.4 360 2575 1500 0.8500 67 59.0 8179 1000 0.8600 82 68.4 380 2608 1500 0.8500 68 59.0 8102 1000 0.8600 84 68.4 400 2584 1500 0.8500 68 59.0 8215 1500 0.8600 84 68.4 420 2575 1500 0.8500 68 59.0 8070 1000 0.8600 82 69.4 440 2574 1500 0.8500 66 59.0 8042 1500 0.8600 82 69.4 460 2608 1500 0.8500 67 59.0 7945 1500 0.8600 82 69.4 480 2581 1500 0.8500 68 59.0 8100 1000 0.8600 84 69.4 500 2583 1500 0.8500 67 59.0 8024 1000 0.8600 84 69.4 520 2609 1500 0.8500 69 59.0 7933 1000 0.8600 82 69.4 540 2602 1500 0.8500 67 60.9 7813 1000 0.8600 84 69.4 560 2606 1500 0.8500 68 59.0 7988 1500 0.8600 83 69.4 580 2606 1500 0.8500 69 60.9 7882 1000 0.8600 83 69.4 600 2704 1500 0.8500 69 60.9 7597 1500 0.8600 83 69.4

64 GB SD Card below or Go To Start


64 GB SD Card - DriveSpeed64v2g8, LanSpeed64g8, DriveSpeed264WRg8, DriveSpeed264Rd2g8

My initial 64 bit Raspberry Pi OS was installed on a 16 GB SD card, later cloned (by Windows Win32DiskImager) to one with 32 GB capacity. It soon became apparent that this was too small to handle extra large files on the main drive. So I bought an 64 GB higher speed version, which, surprisingly, resized free space after booting. I then ran some tests to see how much of this could be used.

USB Drive - The first exercise was to compare performance of 64 GB and 32 GB SanDisk cards, using a USB 3 card reader, via DriveSpeed Direct I/O. The former has maximum MB/second ratings of read 160, write 60 and the latter only read at 98. For the large file tests, handling near 6 GB (3 x 2000 MB), reading speeds were similar, with the 64 GB card being much faster on writing. Random access and small file performance were also similar.

Main Drive - Up to nearly 24 GB (3x8) file space was used running LanSpeed, the same program as DriveSpeed, writing and reading using a 1 MB data array in RAM, with caching allowed, but caching negated on handling such large files. Data from random and small size file tests was cached and can be ignored. Output from vmstat, with 10 second sampling, indicates that most the memory was used then released, repeating the activity for the second three files. As observed in other tests, it seems that writing of cached data is deferred, overlapped with reading.

Huge File - Finally, an example of results from separate write/read and read only benchmarks, with caching enabled, is provided below. This just deals with large files, where up to three can be selected. In this case, one file of near 40 GB was written. The read only test loads the data into an array in RAM, where the maximum size appears to be around 6 GB. When dealing with smaller files, the system should be rebooted before reading, so that the data will be no longer cached.

############################ USB 3 ############################

64 GB Total MB   59639, Free MB   48318, Used MB   11321
32 GB Total MB   29643, Free MB   19707, Used MB    9936

                            MBytes/Second
         MB   Write1   Write2   Write3    Read1    Read2    Read3

64 GB  2000    58.77    59.24    59.10    68.68    69.18    68.84
32 GB  2000    21.23    21.14    21.16    70.22    70.27    70.33

########################## Main Drive #########################

                        MBytes/Second
         MB   Write1   Write2   Write3    Read1    Read2    Read3

64 GB  4000    54.53    38.01    38.91    32.91    45.90    45.88
64 GB  8000    43.16    36.73    36.63    38.34    45.90    45.91

     -----------memory---------- ---swap-- -----io----
    swpd    free    buff   cache   si   so    bi    bo
Stsrt/Write
       0 6430000 1024660  317064    0    0   270     3
       0 4232720 1024696 2511212    0    0     0 27790
       0 3138388 1024744 3605256    0    0     0 37690
Write/Read
     512  258740  427420 7089616    0    0  8336 30214
     512  67632   400000 7309488    0    0 24475 14101
     512  61368   340176 7376464    0    0 44800     0
Delete/Read/Write
     512   56868  121324 7600856    0    0 44817     0
     512 5605880  115092 2057148    0    0 18298 17233
     512 4472096  115140 3191272    0    0     0 36872
Write/Read
     512  267968   17524 7492716    0    0     3 33253
     512   75996   17596 7684276    0    0  8107 31443
     512   63056   17652 7698440    0    0 44817     0
End  512 7521128   18700  238356    0    0 37260     0

#################### Main Drive Near 40 GB ####################

 Before  Total MB   59639, Free MB   48324, Used MB   11315
 After   Total MB   59639, Free MB    8325, Used MB   51314

                        MBytes/Second
   MB   Write1   Write2   Write3    Read1    Read2    Read3

40000    36.65                      45.89
Read only
 6000     N/A      N/A      N/A     45.74

     -----------memory---------- ---swap-- -----io----
    swpd    free    buff   cache   si   so    bi    bo

Example write
     256  270432   33192 7473084    0    0     1 36069
Example read
     256   62384   31332 7681720    0    0 44809     0
Example read only after reboot
     256  272032   25052 3041320    0    0 44812     0 

System Stress Tests below or Go To Start


System Stress Tests

These stress tests were run twice, once with a cooling fan in use and then with the fan disabled. The following script file was run to open six terminals to execute my CPU MHz, Voltage and Temperature Measurement program and vmstat system monitor, whilst running my Livermore Loops, MP Integer RAM Exerciser, BurninDrive and OpenGL benchmarks, in stress testing mode, with nominal running time of 15 minutes.

On running these, as indicated in the environmental monitor, the system ran at much higher temperatures, with no fan in use, but with no indication of CPU MHz throttling in the periodic instantaneous measurement samples. Vmstat recordings were virtually the same, with and without cooling, starting with MP-IntStress64g8 grabbing near 6 GB of RAM, with continuing CPU utilisation of around 82% (3 cores at 100%, one at 28%) and, after a short write phase, the main drive being read at 30 MB/second.

A variation of the Livermore Loops Benchmark has options to change the running time of each of the 72 program floating point kernels, to control running time for stress testing purposes, where results are also checked for correctness, and log numbers assigned to enable multiple copies to be run.

######################## Script File ########################

lxterminal -e ./RPiHeatMHzVolts2 Passes 16 Seconds 60 Log 31 &
lxterminal -e ./liverloopsPi64Rg8 Seconds 12 Log 31 &
lxterminal -e  ./MP-IntStress64g8 Threads 1 KB 6000000 Mins 15 Log 31 &
lxterminal -e ./burnindrive264g8 Repeats 16, Minutes 12, Log 31, Seconds 1  &
export vblank_mode=0  &
lxterminal -e ./videogl64g9 Test 6 Mins 15 Log 31 &
vmstat 60 16 > vmstat31.txt

############## With Cooling #############  ############### No Cooling ##############

================== CPU MHz CPU Voltage and Temperature Measurement =================

Secs  Start at Wed Jun 10 12:56:49 2020    Secs  Start at Wed Jun 10 13:19:58 2020
  0 ARM MHz=1500 0.85V CPU=39'C pmic=34'C    0 ARM MHz=1500 0.85V CPU=40'C pmic=35'C
 60 ARM MHz=1500 0.85V CPU=47'C pmic=39'C   60 ARM MHz=1500 0.85V CPU=58'C pmic=46'C
120 ARM MHz=1500 0.85V CPU=50'C pmic=41'C  120 ARM MHz=1500 0.85V CPU=65'C pmic=53'C
180 ARM MHz=1500 0.85V CPU=50'C pmic=42'C  180 ARM MHz=1500 0.85V CPU=68'C pmic=55'C
241 ARM MHz=1500 0.85V CPU=49'C pmic=41'C  241 ARM MHz=1500 0.85V CPU=71'C pmic=59'C
301 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  301 ARM MHz=1500 0.85V CPU=74'C pmic=60'C
362 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  362 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
422 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  422 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
483 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  482 ARM MHz=1500 0.85V CPU=76'C pmic=62'C
543 ARM MHz=1500 0.85V CPU=51'C pmic=41'C  543 ARM MHz=1500 0.85V CPU=77'C pmic=64'C
604 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  603 ARM MHz=1500 0.85V CPU=78'C pmic=65'C
664 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  664 ARM MHz=1500 0.85V CPU=81'C pmic=66'C
725 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  724 ARM MHz=1500 0.85V CPU=80'C pmic=67'C
785 ARM MHz=1500 0.85V CPU=52'C pmic=42'C  785 ARM MHz=1500 0.85V CPU=81'C pmic=67'C
846 ARM MHz=1500 0.85V CPU=51'C pmic=42'C  845 ARM MHz=1500 0.85V CPU=76'C pmic=66'C
906 ARM MHz=1500 0.85V CPU=46'C pmic=42'C  905 ARM MHz=1500 0.85V CPU=73'C pmic=65'C
966 ARM MHz=1500 0.85V CPU=40'C pmic=37'C  966 ARM MHz=1500 0.85V CPU=65'C pmic=60'C
End at   Wed Jun 10 13:12:56 2020          End at   Wed Jun 10 13:36:04 2020

============================== vmstat 60 second samples =============================

  Memory MB        Swap MB/sec %utilise      Memory MB        Swap MB/sec %utilise
swpd free buf cach si so bi bo us sy id wa swpd free buf cach si so bi bo us sy id wa

   0 7231  45  486  0  0  1  0 14  2 81  3    0 7231  45  486  0  0  1  0 14  2 81  3
   0 1147  45  533  0  0 11 11 71 11  1 17    0 1147  45  533  0  0 11 11 71 11  1 17
   0 1145  45  535  0  0 29  0 76  8  1 16    0 1145  45  535  0  0 29  0 76  8  1 16
   0 1142  45  538  0  0 30  0 75  8  1 17    0 1142  45  538  0  0 30  0 75  8  1 17
   0 1142  45  536  0  0 30  0 75  7  1 17    0 1142  45  536  0  0 30  0 75  7  1 17
   0 1143  45  536  0  0 30  0 75  7  1 17    0 1143  45  536  0  0 30  0 75  7  1 17
   0 1141  45  539  0  0 30  0 75  7  1 17    0 1141  45  539  0  0 30  0 75  7  1 17
   0 1141  45  538  0  0 30  0 75  8  1 16    0 1141  45  538  0  0 30  0 75  8  1 16
   0 1138  45  541  0  0 30  0 75  7  1 17    0 1138  45  541  0  0 30  0 75  7  1 17
   0 1141  45  536  0  0 30  0 76  7  0 17    0 1141  45  536  0  0 30  0 76  7  0 17
   0 1139  45  540  0  0 30  0 75  7  1 16    0 1139  45  540  0  0 30  0 75  7  1 16
   0 1140  46  539  0  0 30  0 74  7  2 17    0 1140  46  539  0  0 30  0 74  7  2 17
   0 1143  46  536  0  0 30  0 75  7  2 17    0 1143  46  536  0  0 30  0 75  7  2 17
   0 1139  46  537  0  0 30  0 75  7  1 16    0 1139  46  537  0  0 30  0 75  7  1 16
   0 1143  46  537  0  0 31  0 61  7 13 18    0 1143  46  537  0  0 31  0 61  7 13 18
   0 1142  46  537  0  0 31  0 52  7 21 20    0 1142  46  537  0  0 31  0 52  7 21 20

======= Livermore Loops 64 Bit Reliability test 12 seconds each loop x 24 x 3 =======

Wed Jun 10 12:56:49 2020                   Wed Jun 10 13:19:58 2020
Numeric results were as expected           Numeric results were as expected
MFLOPS for 24 loops                        MFLOPS for 24 loops
2061.5  944.0  950.8  946.9  362.4  646.6  1498.8  991.4  920.0  733.5  370.3  561.1
2073.5 2695.3 1403.8  547.2  493.9  959.9  2202.2 2453.3 1991.9  711.4  473.4  676.4
 206.5  362.3  794.9  634.4  721.9 1143.2   178.3  349.0  766.6  601.3  641.1 1007.9
 411.8  367.7 1469.5  389.4  739.6  306.1   435.3  376.9 1530.5  365.2  801.5  309.5
Maximum Average Geomean Harmean Minimum     Maximum Average Geomean Harmean Minimum
 2698.1   912.3   737.2   602.3   187.7      2654.4   924.2   742.1   597.9   158.9
End of test Wed Jun 10 13:11:53 2020       End of test Wed Jun 10 13:33:21 2020 

Other Stress Testing Programs used are below or Go To Start


Other Stress Testing Programs - run with the above

MP Integer RAM Exerciser and OpenGL Benchmark - These both report results as the tests progress, and performance for both is provided together below. The former is testing near 6 GB of RAM and the latter running the OpenGL kitchen display test at 1920 x 1080 pixels. Performance varied over the whole period, probably due to the influence of the other programs, but, averages over the 15 minutes, were no different, with and without cooling.

BurnInDrive uses 64 KB block sizes, with 164 variations of data patterns, where a parameter controls file size, in this case 16 blocks for 164 MB files. Four of these are written then read by random selection for a specified time. Finally, blocks are read continuously for a specified number of seconds (See more information here). Again, there was no real difference with and without cooling. Measured performance, like 33 x 4 x 164 MB in 12.32 minutes is 29.3 MB/second, or of the same order measured by vmstat.

MP Integer RAM and OpenGL Tests    With Cooling    No Cooling

 Seconds     KB Threads  Pattern   MB/sec     FPS  MB/sec     FPS

      30 6000000   1    00000000     1978      21    1999      21
      60 6000000   1    FFFFFFFF     1976      21    1864      21
      90 6000000   1    FFFFFFFF     2053      20    1979      21
     120 6000000   1    5A5A5A5A     1918      18    1762      20
     150 6000000   1    AAAAAAAA     1867      19    2066      20
     180 6000000   1    CCCCCCCC     2113      19    1974      21
     210 6000000   1    0F0F0F0F     1841      20    1995      20
     240 6000000   1    FFFFFFFF     1902      20    1928      21
     270 6000000   1    FFFFFFFF     1971      20    2089      20
     300 6000000   1    00000000     2033      20    2084      19
     330 6000000   1    5A5A5A5A     1863      21    1840      21
     360 6000000   1    AAAAAAAA     1974      21    1966      22
     390 6000000   1    AAAAAAAA     2012      21    1956      19
     420 6000000   1    CCCCCCCC     1929      20    1860      20
     450 6000000   1    0F0F0F0F     1964      20    1911      22
     480 6000000   1    00000000     1954      20    2007      21
     510 6000000   1    FFFFFFFF     2019      21    2010      20
     540 6000000   1    FFFFFFFF     1987      21    1999      21
     570 6000000   1    5A5A5A5A     1836      21    1981      21
     600 6000000   1    AAAAAAAA     1991      21    2551      18
     630 6000000   1    CCCCCCCC     1837      21    1996      20
     660 6000000   1    0F0F0F0F     2025      12    1824      21
     690 6000000   1    FFFFFFFF     2017      20    1870      21
     720 6000000   1    FFFFFFFF     2017      20    1843      21
     750 6000000   1    00000000     1858      21    1847      21
     780 6000000   1    5A5A5A5A     2100      21    1905      21
     810 6000000   1    AAAAAAAA     2008      20    1963      21
     840 6000000   1    CCCCCCCC     1966      21    1962      21
     870 6000000   1    CCCCCCCC     1983      21    1980      21
     900 6000000   1    0F0F0F0F     1970      20    1897      21

                        Average      1965    20.1    1964    20.6

======================== burnindrive264g8 Pi 4 Main Drive =========================

Current Path: /home/pi/0test/morestress  Total MB 59639 Free MB 20353, Used MB 39286

Wed Jun 10 12:56:49 2020                   Wed Jun 10 13:19:58 2020
File 1 164 MB written 9.19 seconds         File 1 164 MB written in 9.15 seconds
File 2 164 MB written 9.05 seconds         File 2 164 MB written in 8.94 seconds
File 3 164 MB written 9.63 seconds         File 3 164 MB written in 9.67 seconds
File 4 164 MB written 8.91 seconds         File 4 164 MB written in 8.97 seconds
Total                36.78 seconds         Total                   36.74 seconds

Start Reading Wed Jun 10 12:57:26 2020     Start Reading Wed Jun 10 13:20:35 2020
Passes  1 x 4 Files x 164 MB  0.38 minutes Passes  1 x 4 Files x 164 MB  0.38 minutes
Passes  2 x 4 Files x 164 MB  0.76 minutes Passes  2 x 4 Files x 164 MB  0.75 minutes
Passes  3 x 4 Files x 164 MB  1.13 minutes Passes  3 x 4 Files x 164 MB  1.14 minutes
To
Passes 31 x 4 Files x 164 MB 11.58 minutes Passes 31 x 4 Files x 164 MB 11.56 minutes
Passes 32 x 4 Files x 164 MB 11.95 minutes Passes 32 x 4 Files x 164 MB 11.93 minutes
Passes 33 x 4 Files x 164 MB 12.32 minutes Passes 33 x 4 Files x 164 MB 12.31 minutes

Start Repeat Read Wed Jun 10 13:09:45 2020 Start Repeat Read Wed Jun 10 13:32:53 2020
Passes in 1 second for 164 blocks of 64KB  Passes in 1 second for 164 blocks of 64KB

460 500 540 540 520 440 420 480 540 520    560 560 480 440 440 500 540 540 520 440
520 440 440 460 540 540 520 460 440 440    440 440 540 540 540 420 420 420 520 540
540 540 540 440 440 420 540 540 540 440    540 480 440 460 500 540 540 480 440 460
To                                         To
580 580 580 580 580 580 580 580 580 580    580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580    580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580    580 580 580 580 580 580 580 580 580 580

83300 Passes of 64KB blocks  2.78 minutes  83900 Passes of 64KB blocks 2.78 minutes
No errors found during reading tests       No errors found during reading tests
End of test Wed Jun 10 13:12:32 2020       End of test Wed Jun 10 13:35:40 2020

Power Over Ethernet below or Go To Start


Power Over Ethernet (PoE)

I recently carried out tests of Raspberry Pi 4 systems using power supplied over LAN cables. My report is at ResearchGate in Benchmarking Raspberry Pi 4 Running From Power Over Ethernet.pdf. This covers using long, short, thick and thin cables, measuring data transmission speeds and the ability to run using my most power consuming benchmarks, particularly with the only wire connected to the Pi being the Ethernet cable. Screenshots of remote control via Windows, Linux and Android are provided. PoE requires additional hardware that injects high voltage power on to the cable and, at the other end, converts it to that normally used by the destination device. For Raspberry Pi, there is a PoE HAT, with a fan, for this purpose, or separate fanless connectors can be obtained.

A few simple tests were run on the configuration being considered here, simply to verify that the facility was operational. In this case, 48 metres of CAT 6 cables were used and a fanless connector (the 8 GB Pi was fitted with an inexpensive fan). A hard disk and a USB flash drive were plugged in to USB 3 sockets, but not in use. The tests were executed via remote control Terminals, using PuTTy on a Windows 7 based PC. After the first one, the only wire plugged in to to the Pi was the power connecter, from the PoE converter, with communication via WiFi. Result below were all copied from the Windows PuTTy displays.

The first tests were run using the LAN Benchmark, with only large file results shown. The Ethernet performance was at the same 1 Gbps speeds identified earlier. WiFi was from a greater distance, apparently mainly at 2.4 GHz speeds.

The other example is from running a Floating Point Stress Test, for 10 minutes, with 8 threads running at the same near 24 GFLOPS continuously. The vmstat report indicates 8 processes in use and 100% CPU utilisation (of 4 cores) over the whole period. With the fan in use, temperature increases were insignificant. Core voltage did not change between idle and full speed operation.

################ Data Transmission Speeds ################

                       MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3
Ethernet
 512    80.81    81.27    83.18   112.53   111.69   112.38
1024    93.91    91.64    88.02   112.68   112.64   112.68
WiFi
  50     7.28     8.55     8.15     5.51     6.10     6.37
 100     5.95     7.97     7.14     6.58     5.26     6.75


############ High Power Demanding CPU Stress Test ###########
 
           Data             Ops/         Numeric
 Seconds    Size  Threads    Word  MFLOPS Results     Passes

    9.3  1280 KB        8      32   23435   50160      19677
   18.2  1280 KB        8      32   23274   50160      19677
   27.0  1280 KB        8      32   23375   50160      19677
   35.8  1280 KB        8      32   23374   50160      19677
   44.7  1280 KB        8      32   23357   50160      19677
 To
  566.3  1280 KB        8      32   23396   50160      19677
  575.1  1280 KB        8      32   23406   50160      19677
  583.9  1280 KB        8      32   23424   50160      19677
  592.7  1280 KB        8      32   23359   50160      19677
  601.7  1280 KB        8      32   23145   50160      19677


############################# vmstat Activity Monitor #############################

 procs  -----------memory---------- ---swap-- -----io---- -system--  ------cpu-----
  r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs  us sy id wa st

  8  0      0 7723004  15720 148020    0    0    11     5  975  421  91  0  9  0  0
  8  0      0 7722396  15780 148140    0    0     0     4 1048  428 100  0  0  0  0
  8  0      0 7725468  15844 148044    0    0     0     5 1059  447 100  0  0  0  0
  8  0      0 7725720  15892 148052    0    0     0     3 1052  431 100  0  0  0  0
  8  0      0 7725404  15948 148072    0    0     0     3 1051  432 100  0  0  0  0
  8  0      0 7725368  16004 148072    0    0     0     3 1040  413 100  0  0  0  0
  8  0      0 7725984  16060 148076    0    0     0     4 1050  431 100  0  0  0  0
  9  0      0 7725908  16116 148076    0    0     0     3 1040  409 100  0  0  0  0
  8  0      0 7725656  16164 148084    0    0     0     3 1044  415 100  0  0  0  0
  8  0      0 7725372  16220 148092    0    0     0     4 1067  437 100  0  0  0  0


##################### CPU MHz, Voltage and Temperatures ####################
Seconds
    0.0   ARM MHz= 600, core volt=0.8500V, CPU temp=34.0'C, pmic temp=33.5'C
   60.0   ARM MHz=1500, core volt=0.8500V, CPU temp=51.0'C, pmic temp=38.2'C
  120.4   ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C
  180.8   ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C
  241.3   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  301.7   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  362.1   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  422.4   ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C
  482.8   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C
  543.2   ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C
  603.6   ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C

CPU Performance Throttling Effects below or Go To Start


CPU Performance Throttling Effects

Another of my reports covered Raspberry Pi 4 CPU MHz Throttling Performance Effects. This was demonstrated by forcing the CPU clock speed to run continuously at 600 MHz, by setting the frequency scaling governor to powersave mode.

This exercise involved using BBC iPlayer for two and a half hours, the main reason being to see if it survives using minimum available resources.

The Raspberry Pi was connected to a TV with a 1920 x 1080 display, using WiFi communication and the CPU at 600 MHz. A drama programme was watched for two hours, with no apparent buffering and, in my opinion, a perfectly good display, where the activity report was 960 x 540 size at 1700 kbps. A second programme wildlife documentary did produce the occasional short delay, with buffering, reporting the same size but down to 923 kbps. The tests were run without an active cooling fan.

Following are vmstat details, showing CPU utilisation of around 47%, indicating using two CPU cores at 100%, for most of the time. Then, the environment monitor showing constant MHz and voltage, without significant rises in temperatures.


 vmstat
    -----------memory---------- ---swap--  -----io---- -system-- ------cpu-----
     swpd    free   buff  cache   si   so    bi    bo   in   cs  us sy id wa st

 Early  0 6475260 109296 736232    0    0     0   242 2795 3640  40  7 52  0  0
 End    0 6467036 111324 740656    0    0     0   248 2867 3752  39  7 54  0  0


 RPiHeatMHzVolts2 Program - Room at 27°C
  
 Hot start ARM MHz= 600, core volt=0.8500V, CPU temp=69.0'C, pmic temp=62.8'C
 Later     ARM MHz= 600, core volt=0.8500V, CPU temp=70.0'C, pmic temp=64.6'C
 Near End  ARM MHz= 600, core volt=0.8500V, CPU temp=72.0'C, pmic temp=66.5'C

  

Go To Start