Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks

Roy Longbottom

Summary	Introduction	Benchmark Results
Whetstone Benchmark	Dhrystone Benchmark	Linpack 100 Benchmark
Livermore Loops Benchmark	FFT Benchmarks	BusSpeed Benchmark
MemSpeed Benchmark	NeonSpeed Benchmark	MultiThreading Benchmarks
MP-Whetstone Benchmark	MP-Dhrystone Benchmark	MP NEON Linpack Benchmark
MP-BusSpeed Benchmark	MP-RandMem Benchmark	MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks	OpenMP-MemSpeed Benchmarks	I/O Benchmarks
WiFi Benchmark	LAN Benchmark	USB 3 Benchmarks
Pi 4 Main Drive benchmark	Java Whetstone Benchmark	JavaDraw Benchmark
OpenGL Benchmark	Usable RAM	High Performance Linpack
Floating Point Stress Tests	Integer Stress Tests	64 GB SD Card
System Stress Tests	Power Over Ethernet	CPU Performance Throttling Effects

Summary

This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8GB RAM and the Beta pre-release 64 bit Raspberry Pi OS. Note that observations and performance measured might not apply to an officially released Operating System. Objectives of this exercise were to show that my programs could be compiled and run on the 64 bit system and to compare performance with that of the original 32 bit Pi 4B.

Single Core CPU Tests - These “Classic Benchmarks” set the original performance standards of computers. There are four, one with three varieties, and some with multiple test functions. All showed 64 bit average performance gains in the range of 11% to 81%, the highest where the new vector instructions were compiled. .

Single Core Memory Benchmarks - These measure performance using data from caches and RAM. There are four benchmarks, each with between 60 and 100 measurements. A bottom line assessment is that 64 bit and 32 bit speeds from RAM were the same, as were around half of CPU dependent routines, with the other half an average near 30% faster at 64 bits.

Multithreading Benchmarks - There were twelve, covering some intended to show that they were unsuitable for multithreading operation. Five measured floating point performance, where the average 64 bit gain was 39%, demonstrating a maximum of 25.9 single precision GFLOPS and 12.7 at double precision. Of the other two applicable benchmarks, one was rated as 13% faster, at 64 bits, with the other indicating the same performance.

Drive and Network Benchmarks - These mainly ran successfully at 64 bits, providing similar performance to 32 bit runs. A major difference is that file sizes appeared to be limited at 2 GB minus 1 (2^31-1) at 32 bits. At this stage, there were free space limitations, but, at 64 bits, up to 3 x 12 GB could be exercised.

Java and OpenGL Benchmarks - 64 bit Java CPU speed, Java drawing and OpenGL benchmarks were run, with different window settings, including using dual monitors.

Usable RAM - two simple repetitive exercises were carried out to see how much RAM space could be used, via allocation and dimensioned arrays. With one program, memory was allocated in 1 billion byte steps. Maximums were 3 billion at 32 bits then at 64 bits, 3 billion with 4 GB RAM and 7 billion at 8 GB. With dimensioning, more precise values were obtained indicating 3.43 GB and 7.9 GB at 64 bits but 2 GB minus 1 at 32 bits.

High Performance Linpack Benchmark - Performance depends on the memory size parameter N squared. With a fan in use, maximum 32 bit and 64 bit speeds were similar at around 11.25 double precision GFLOPS, at N=30000 with 8 GB RAM, best performance with 4 GB, was 10.8 GFLOPS at N=20000. As a stress test, with no fan, the original Pi 4 board obtained 6.2 GFLOPS at N=20000, with the new one reaching at least 8.5 GFLOPS, demonstrating a significant improvement in thermal management.

CPU Stress Tests - Floating point tests demonstrated the same best case 64 bit performance gains as earlier benchmarks and details of 10 minute stress tests confirmed better thermal management, in a more linear way. A single thread 10 minute stress test was run with integer calculations using more than 7.2 GB of RAM, with some swapping, but no severe performance degradation. The stress tests were run without an operational fan.

64 GB Main Drive SD Card - This was obtained to show that extra large files could be used. A single near 40 GB file was written and read with a new benchmark variation, taking 33 minutes.

System Stress Tests - Fifteen minute tests were run, with and without cooling, using four benchmarks covering, CPU, near 6 GB RAM, main drive and graphics. There were temperature rises with no cooling, but with little performance degradation, both continuously providing around 0.6 GFLOPS, 1140 MB/second from RAM, 30 MB/second from the drive and 21 FPS graphics speed.

Power over Ethernet - Following more comprehensive earlier activity, some long cable PoE tests were repeated to confirm that it was still applicable for this 64 bit configuration.

CPU Performance Throttling Effects - Again, after an earlier exercise, frequency scaling settings forced the CPU to run at 600 MHz, normally the lowest throttling frequency, whilst playing programmes via BBC iPlayer for more than two hours to an HD TV, over WiFi. This ran with acceptable picture and sound quality.

Introduction below or Go To Start

Introduction

This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8 GB RAM and the Beta pre-release 64 bit Raspberry Pi OS (Operating System). This is a continuation of earlier activity with details at ResearchGate in Raspberry Pi 4B 32 Bit Benchmarks.pdf and Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. These provide more detailed information of the programs used and comparisons with older systems.

Most of the benchmarks and stress testing programs were recompiled, for use here, using the supplied gcc 8 compiler, with two not providing acceptable code, being substituted by earlier 64 bit versions. All the programs are available for downloading from ResearchGate in Raspberry-Pi-OS-64-Bit-Benchmarks.tar.xz.

Traditionally, the benchmark provided details of the system being tested, by accessing built-in CPUID details. Following are the latest that identify the difference between 32 bit and 64 bit operation.

32 bit
 
Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Vendor ID:             ARM
Model:                 3
Model name:            Cortex-A72
Stepping:              r0p3
CPU max MHz:           1500.0000
CPU min MHz:           600.0000
BogoMIPS:              270.00
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2019-05-13

64 bit
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid
Linux raspberrypi 4.19.118-v8+ #1311 SMP PREEMPT Mon Apr 27 14:32:38 BST 2020 aarch64 GNU/Linux

Benchmark Results

The following provide benchmark results with limited comments on Raspberry Pi 4B performance, compiled as 32 bit and 64 bit programs. There are also considerations of the impact of the larger 8 GB RAM and the possibility of larger file sizes.

Whetstone Benchmark below or Go To Start

Whetstone Benchmark - whetstonePiC8, whetstonePi64g8

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised, but the time does not affect the overall rating much.

Performance is normally more dependent on CPU MHz than advanced instructions, but an overall improvement of 11% was indicated and 11% on straightforward floating point calculations.


 System    MHz   MWIPS  ------MFLOPS------   -------------MOPS---------------
                          1      2      3     COS    EXP  FIXPT     IF  EQUAL

 32 bit   1500   1883    522    471    313   54.9   26.4   2496   3178    998
 64 bit   1500   2085    524    535    398   57.6   27.3   2493   2979    997

 64/32 bit       1.11   1.00   1.14   1.27   1.05   1.03   1.00   0.94   1.0

Dhrystone Benchmark - dhrystonePiC8, dhrystonePi64g8

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow.

The 64 bit compilation provided an apparent 54% improvement in performance but possibly over optimised.

                            DMIPS
 System     MHz    DMIPS     /MHz

 32 bit    1500     5077     3.76
 64 bit    1500     7814     5.21

 64/32 bit          1.54

Linpack 100 Benchmark MFLOPS - linpackPiC8, linpackPiC8SP, linpackPiNEONiC8, linpackPi64g8, linpackPi64gSP, linpackPi64NEONig8

This original Linpack benchmark uses a small data array, unsuitable for higher speed multiprocessing. It executes double precision arithmetic. I introduced a single precision version with a NEON variety, to indicate vector processing speed.

The NEON version, that uses intrinsic functions, was the star of the show when the Pi 4B was introduced, with the most significant performance improvements, compared to the Pi 3B, and the benefit reflected in the 32 bit NEON/SP results below. The 64 bit SP result now shows that 64 bit vector instructions can achieve the same sort of performance gains, this time 81% faster than at 32 bits.

                                    NEON
 System     MHz      DP      SP      SP

 32 bit    1500   957.1  1068.8  1819.9
 64 bit    1500  1111.5  1938.2  2030.9

 64/32 bit         1.16    1.81    1.12

Livermore Loops Benchmark MFLOPS - liverloopsPiC8, liverloopsPi64g8

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.

Based on Geomean results, the overall 64 bit speed rating was 13% faster than at 32 bits, but vector instructions pushed this up to a maximum 67%.

 MFLOPS for 24 loops

  32 bit
  1480  1017   974   930   383   657  1624  1861  1664   617   498   741
   221   320   803   640   737  1003   451   378  1047   411   763   187

  64 bit
  2108   936   960   965   383   809  2313  2488  2066   669   500   981
   181   405   815   644   727  1190   450   397  1716   367   818   313

  64 bit / 32 bit gain range - 0.82 to 1.67                             

 Comparisons

 System    MHz   Maximum Average Geomean Harmean Minimum

 32 bit   1500    1860.8   800.4   679.0   564.1   179.5
 64 bit   1500    2616.7   959.8   766.7   613.0   169.7

 64/32 bit          1.41    1.20    1.13    1.09    0.95

Fast Fourier Transforms Benchmarks below or Go To Start

Fast Fourier Transforms Benchmarks - fft1PiC8, fft3cPiC8, fft1Pi64g, fft3cPi64g8

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes from L1 to L2 caches, then to RAM.

Following are average running times from the three passes of each FFT calculation. There were no significant variations in overall performance between 32 bit and 64 bit compilations. This could be expected using RAM, but there is probably too much diversity in data flow from caches to benefit from advanced vector operation.


                              Time in milliseconds                     

            32 bit FFT 1    32 bit FFT 3    64 bit FFT 1    64 bit FFT 3  

             SP      DP      SP      DP      SP      DP      SP      DP
 Sixe K                                                             
      1     0.04    0.04    0.05    0.04    0.04    0.04    0.04    0.04
      2     0.08    0.13    0.10    0.10    0.08    0.14    0.08    0.10
      4     0.29    0.34    0.24    0.23    0.23    0.40    0.21    0.24
      8     0.79    0.82    0.57    0.51    0.74    0.99    0.47    0.51
     16     1.65    1.85    1.32    1.19    1.88    2.67    1.15    1.20
     32     3.76    4.71    2.69    3.30    5.04    5.16    2.26    3.31
     64     8.82   30.64    6.60    9.47    8.72   32.58    5.72   10.19
    128    58.54  132.41   16.92   23.85   49.92  160.12   15.92   24.43
    256   275.44  373.12   37.61   55.97  293.06  389.40   37.85   54.60
    512   780.89  751.27   81.54  128.13  559.88  780.79   82.06  119.23
   1024  1578.70 1812.20  186.45  288.27 1376.28 1890.46  178.37  262.30

    Ratios > 1.0 64 bit faster  Average     1.05    0.89    1.13    1.02
                                Minimum     0.75    0.69    0.99    0.93
                                Maximum     1.39    0.96    1.26    1.12

BusSpeed Benchmark - busspeedPiC8, busspeedPi64g8

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 timed the speed of appropriate measurements at Inc16.

The speed via these increments can vary considerably, so comparison are provided for the Read All column. Then, the 32 bit RAM speeds are indicated as being slightly faster but, with data from caches, average 64 bit gains were around 55%.


     Reading Speed 4 Byte Words in MBytes/Second         

  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read       
  KBytes  Words  Words  Words  Words  Words    All       

                        32 bit                           

      16   4880   5075   5612   5852   5877   5864       
      32    846   1138   2153   3229   4908   5300       
      64    746   1019   2035   3027   4910   5360       
     128    728    983   1952   2908   4888   5389       
     256    683    934   1901   2794   4874   5431       
     512    656    900   1760   2625   4585   5259       
    1024    301    410    870   1356   2846   4238       
    4096    233    248    531    996   2151   4045       
   16384    236    258    511    891   2143   4011       
   65536    237    257    508    881   2172   4015       

                         64 bit                   64 bit/
                                                   32 bit
      16   4898   5109   5626   5860   5879   9238   1.58
      32   1109   1389   2485   3804   5026   8435   1.59
      64    804   1030   2025   3285   4871   8312   1.55
     128    737    951   1877   3130   4908   8556   1.59
     256    732    953   1897   3147   4941   8617   1.55
     512    701    939   1766   2902   4601   8150   1.31
    1024    323    494    986   1807   3060   5553   0.31
    4096    242    259    486    964   1932   3856   0.95
   16384    236    268    493    971   1939   3878   0.97
   65536    242    271    494    973   1942   3884   0.97

MemSpeed Benchmark below or Go To Start

MemSpeed Benchmark MB/Second - memspeedPiC8, memspeedPi64g8

The benchmark includes CPU speed dependent calculations using data from caches and RAM. The calculations are shown in the results column titles. Following are full Pi 32 bit and 64 bit results, plus some calculations of maximum MFLOPS.

Ignoring the last three columns, with no calculations, that are subject to over optimisation, the arithmetic overhead lead to similar RAM performance of the two environments. Integer speeds appeared to be the same, but double precision tests indicated a 64 bit advantage of over 20% and 30%, depending on which cache was involved. This time (as seen before) the 64 bit compiler generated impossible code, for single precision calculations, by producing much slower speed than at double precision.

Below are results from running the original 64 bit version, compiled by gcc 7 (I think). This confirmed that the strange results are unlikely to be caused by the 64 bit hardware or Operating System.

 
 Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
 KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
   Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

 32 bit

       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971
  MFLOPS    1471   2470                                                 
 64 bit

       8   15531   3999   3957  15576   4387   4358  11629   9313   9314
      16   15717   3992   3922  15770   4355   4377  11799   9444   9446
      32   12020   3818   3814  12043   4179   4198  11549   9496   9497
      64   12228   3816   3887  12220   4166   4195   8935   8506   8506
     128   12265   3869   3941  12157   4182   4206   8080   8193   8196
     256   12230   3873   3932  12073   4199   4216   8129   8224   8223
     512    9731   3832   3902   9709   4150   4171   8029   7845   7865
    1024    3772   3682   3769   3467   3887   3920   5478   5543   5378
    2048    1896   3463   3496   1886   3616   3612   2937   2945   2923
    4096    1924   3520   3528   1933   3651   3394   2752   2796   2785
    8192    1996   3523   3555   1988   3643   3630   2668   2661   2663
  MFLOPS    1964   1000                                                 
64 bit / 32 bit

      16    1.32   0.40   1.03   1.33   0.43   1.00   1.13   1.20   1.20
     256    1.22   0.40   1.05   1.20   0.43   1.01   1.00   1.06   1.07
    8192    1.03   0.93   1.04   1.03   0.95   1.00   1.08   2.72   2.74


 ########################### Earlier Version ###########################

     Memory Reading Speed Test armv8 64 Bit by Roy Longbottom

               Start of test Wed Jun 10 10:04:22 2020

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   15504  13974  12580  15552  14024  15534  11521   9313   7791
      16   15707  14173  12747  15758  14183  15746  11751   9445   7890
      32   13356  11998  11123  13372  12300  12836  11450   9500   7937
      64   12340  11302  10651  12156  11698  12044   9415   8937   7910
     128   12253  11384  10707  12207  11861  12083   8260   8299   7821
     256   12259  11408  10694  12089  11896  12091   8101   8220   7894
     512    9855   9593   9246  10264   9482   9801   7917   8057   7754
    1024    3317   3613   3571   3640   3602   3600   5885   5833   5616
    2048    1881   1885   1881   1890   1879   1879   2911   2999   3015
    4096    1950   1946   1949   1952   1941   1925   2672   2666   2661
    8192    1952   1964   1964   1968   1962   1961   2546   2536   2537

NeonSpeed Benchmark below or Go To Start

NeonSpeed Benchmark MB/Second - NeonSpeedC8, NeonSpeedPi64g8

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.

The same slow single precision calculation speeds, as above, were produced again at 64 bits, as indicated by earlier version results included below. As could be expected, 32 bit and 64 bit calculations, obtained via NEON intrinsic functions, were effectively the same.


      Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v 
  KBytes   Norm   Neon   Norm   Neon  Float    Int

                      32 bit                  

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

                      64 bit                  

      16   3629  14987   3925  13643  14457  16642
      32   3475  10933   3821   9970  11029  11055
      64   3447  11749   3845  11098  11802  12079
     128   3332  11392   3912  10813  11430  11513
     256   3325  11565   3926  10981  11598  11699
     512   3313  10553   3917  10269  10755  10740
    1024   3239   3331   3737   3291   3302   3321
    4096   2987   1888   3331   1777   1881   1878
   16384   3150   1821   3347   1814   1812   1834
   65536   2747   1954   3132   2017   1904   2021

64 bit / 32 bit

      16   0.37   1.16   1.00   1.07   1.10   1.10
     256   0.36   0.97   1.16   0.98   0.97   0.95
    8192   0.70   0.93   0.92   1.09   0.90   0.95


 ########################### Earlier Version ###########################

  NEON Speed Test armv8 64 Bit V 1.0 Wed Jun 10 10:06:03 2020

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16  13999  16429  12687  15238  16213  17194
      32  12384  13367  11232  12767  14406  14493
      64  10736  11870  10305  10790  11940  11976
     128  10728  11826  10393  10739  11951  11956
     256  10760  11908  10386  10816  12026  12064
     512  10697  11911  10404  10781  12070  12006
    1024   3854   3941   3810   4015   4315   4402
    4096   2007   2000   2018   1985   1995   1999
   16384   2002   2008   1997   1927   1997   1997
   65536   2030   2027   2022   2020   2012   2023

MultiThreading Benchmark next or Go To Start

MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.

MP-Whetstone Benchmark - MP-WHETSPC8, MP-WHETSPi64g8

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP efficiency.

The MWIPS performance rating indicated that 64 bit code was 13% faster than that at 32 bits.


      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS
 
                            32 bit                                 
 
 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                             64 bit                                
 
 1T  2147.8  530.7  530.0  397.8  60.5  27.3  7462.8  2237.7  998.2
 2T  4294.1 1058.4 1059.5  795.8 120.9  54.6 14877.9  4457.8 1994.8
 4T  8558.2 2093.8 2112.2 1590.3 241.8 108.3 29221.8  8909.9 3982.1
 8T  8987.0 2689.8 2721.9 1641.0 254.1 112.0 37422.9 10873.9 4122.3

   Overall Seconds   5.00 1T,   5.00 2T,   5.05 4T,  10.13 8T

                4 Thread 64 bit/32 bit Performance ratios          

       1.13   1.00   0.98   1.27  1.07  1.04    0.99   1.00    1.50

MP-Dhrystone Benchmark - MP-DHRYPiC8, MP-DHRYPi64g8

This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded performance. The single thread speeds were similar to the earlier Dhrystone results, with 44% 64 bit performance gains. The other results don’t mean much.



                      MP-Dhrystone Benchmark              

                    Using 1, 2, 4 and 8 Threads            
            
                              32 bit                        

 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459

                              64 bit                        

 Seconds                     0.55     1.08     2.15     4.30
 Dhrystones per Second   14531390 14791730 14896723 14872767
 VAX MIPS rating             8271     8419     8478     8465

64 bit / 32 bit              1.44     1.12     1.22     1.13

MP Linpack Benchmark below or Go To Start

MP SP NEON Linpack Benchmark - linpackNeonMPC8, linpackMPNeonPi64g8

This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes. The unthreaded results are of interest but, using NEON functions, the 64 bit program cannot improve performance much.


 Linpack Single Precision MultiThreaded Benchmark
             Using NEON Intrinsics           

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 

                       32 bit                 

 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 

                       64 bit                 

 N  100    2167.70    91.82    89.65    89.96 
 N  500    1438.27   644.85   635.89   635.33 
 N 1000     394.99   376.97   383.92   384.19 

                   64 bit / 32 bit            

 N  100       1.08     0.82     0.83     0.84 
 N  500       1.08     0.94     0.93     0.92 
 N 1000       0.98     0.87     0.89     0.89

MP BusSpeed (read only) Benchmark - MP-BusSpd2PiC8, MP-BusSpd2Pi64g8

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, the latter to avoid misrepresentation of performance using shared L2 cache. Each set of results show appropriate performance gains on increasing the number of threads used. But the 64 bit compiler somehow manages to lose its way on decreasing addressing increments after Inc8, leading to the 32 bit version appearing to be three times faster.

Below are example of results of a version compiled by gcc 9 for 64 bit Gentoo, showing that the performance issue was probably not caused by the 64 bit hardware or Operating System.

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll       

                         32 bit                          

 12.3 1T   5310   5616   5801   5898   5940  13425       
      2T   9393  10008  11293  11293  11368  24932       
      4T  15781  15015  17606  19034  22279  40736       
      8T   8465   9599  14580  18465  20034  36831       
122.9 1T    664    930   1861   3191   5017  10281       
      2T    564    726   1523   5376   9387  18985       
      4T    486    919   1886   4289   8337  16979       
      8T    487    912   1854   4275   8271  16826       
12288 1T    225    258    514   1010   1992   3975       
      2T    202    421    450   1765   3307   7396       
      4T    261    288    825   1332   1772   5014       
      8T    218    273    496   1041   2571   4021       

                         64 bit                   Rd All 
                                                  64 bit/
                                                  32 bit 
 12.3 1T   5168   5542   5641   4205   4095   4230   0.32
      2T   8968  10728  10161   8110   8058   8368   0.34
      4T   7874  13255  15586  13641  15485  16533   0.41
      8T   8186  13386  15239  13469  14431  16372   0.44
122.9 1T    598    927   1876   2792   3746   4059   0.39
      2T    514    719   1538   4846   7596   8083   0.43
      4T    486    933   2060   4126   8175  13690   0.81
      8T    483    937   2059   4160   8166  13817   0.82
12288 1T    224    257    488    964   1933   3579   0.90
      2T    219    427    889   1832   3493   5371   0.73
      4T    280    353    562    859   2168   3286   0.66
      8T    229    230    527   1075   1880   4480   1.11


 ###################### gcc 9 Version ###################

 MP-BusSpd 64 Bit gcc 9 Fri May 29 09:56:08 2020         

 12.3 4T   7317  13937  15720  18355  20549  33244
122.9 4T    492    937   1883   4009   7820  16423

MP RandMem Benchmark below or Go To Start

MP RandMem Benchmark - MP-RandMemPiC8, MP-RandMemPi64g8

The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The performance patterns were as expected, and essentially the same at 32 bits and 64 bits, with no scope for vectorisation. Random access is dependent on the impact of burst reading and writing, producing those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write remaining constant at particular data size, probably due to write back to shared data space.


  KB       SerRD SerRDWR   RndRD RndRDWR
 
                    32 bit              

 12.3 1T    5950    7903    5945    7896
      2T   11849    7923   11887    7917
      4T   23404    7785   23395    7761
      8T   21903    7669   23104    7655
122.9 1T    5670    7309    2002    1924
      2T   10682    7285    1648    1923
      4T    9944    7266    1813    1927
      8T    9896    7216    1812    1919
12288 1T    3904    1075     179     164
      2T    7317    1055     215     164
      4T    3398    1063     343     165
      8T    4156    1062     350     165

                    64 bit              

 12.3 1T    5945    7898    5948    7895
      2T   11913    7937   11905    7929
      4T   23601    7875   23385    7867
      8T   23139    7777   23016    7770
122.9 1T    5785    7090    2026    1977
      2T   10941    7074    1654    1968
      4T   10364    7052    1854    1970
      8T   10256    7031    1844    1973
12288 1T    3861    1244     180     169
      2T    3793    1242     220     171
      4T    3941    1100     343     170
      8T    4065    1247     351     171


                64 bit / 32 bit         

 12.3 4T    1.01    1.01    1.00    1.01
122.9 4T    1.04    0.97    1.02    1.02
12288 4T    1.16    1.03    1.00    1.03

MP-MFLOPS Benchmarks below or Go To Start

MP-MFLOPS Benchmarks - MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8,
MP-MFLOPSPi64g8, MP-MFLOPSDPPi64g8, MP-NeonMFLOPSPi64g8

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data.

There are three varieties, single precision, double precision and single precision through NEON intrinsic functions, all attempting to show near maximum MP floating point processing speeds. 64 bit operation implemented vector processing, with expected single precision maximum performance twice as fast as double precision. Best performance gains, over 32 bit working, were up to more than 2.5 times faster, with four thread performance near 26 GFLOPS, and double precision near 13 GFLOPS. This time, The 32 bit NEON version provided performance improvements over the single precision version, but, at 64 bits, more efficient vector instructions were implemented to operate at up to near 25 GFLOPS.

The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

                Single Precision Version         

        2 Ops/Word              32 Ops/Word         
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS                                             
                        32 bit                      

 1T     1224    1257     520    2814    2800    2803
 2T     2485    2257     525    5608    5575    5576
 4T     4119    3243     534   11018   10645    8358
 8T     4131    4618     541    9941   10339    8165

                        64 bit                      

 1T     3303    3113     526    6750    6713    6429
 2T     6410    4860     540   13378   13373    9005
 4T    11696    6413     571   25479   25917   10126
 8T    10262   10054     571   23140   23427    8726

Max                                                 
64b/32b 2.83    2.18    1.06    2.31    2.43    1.21

            NEON Intrinsic Functions Version      

                       32 bit                    

 1T     2797    2870     641    4422    4454    4405
 2T     3217    5601     569    8587    8800    8377
 4T     7902    9864     611   17061   17215    9704
 8T     7070   10562     603   15531   16203    9516

                       64 bit                    

 1T     3319    3245     527    6569    6538    6294
 2T     5737    5333     556   12810   12784    9565
 4T     8497   11088     572   24775   24885    9570
 8T     8037   11330     573   22658   21773    9443

Max                                                 
64b/32b 1.08    1.07    0.89    1.45    1.45    0.99

              Double Precision Version            

                       32 bit                    

 1T     1203    1211     315    2675    2719    2674
 2T     2291    2441     293    5406    5421    4907
 4T     4673    2501     309   10313   10393    5256
 8T     4394    3550     265    8782   10110    5197

                       64 bit                    

 1T     1637    1553     273    3356    3351    3220
 2T     3180    3031     278    6664    6676    4531
 4T     5778    3102     283   12522   12675    4791
 8T     3927    4272     286   12304   11351    4875

Max                                                 
64b/32b 1.24    1.20    0.91    1.21    1.22    0.93

                        Sumchecks                   

 SP    76406   97075   99969   66015   95363   99951
 NEON  76406   97075   99969   66014   95363   99951
 DP    76384   97072   99969   66065   95370   99951

OpenMP-MFLOPS Benchmarks below or Go To Start

OpenMP-MFLOPS - OpenMP-MFLOPSC8, notOpenMP-MFLOPSC8, OpenMP-MFLOPS64g8, notOpenMP-MFLOPS64g8

This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those from NotOpenMP single core values. However, this is not always the case. This benchmark was a compilation of code used for desktop PCs, starting at 100 KB, then 1 MB and 10 MB.

The main purposes of this benchmark are to see if OpenMP can produce similar maximum performance as MP-MFLOPS and that this can increase in line with the number of cores used. These objectives were met using 32 floating point operations per data word. Then, the 64 bit tests achieved up to 24 GFLOPS, 21% faster than at 32 bits.



  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

                                OpenMP MFLOPS 32 bit 
 
 Data in & out     100000     2     2500   0.098043     5100    0.929538   Yes       
 Data in & out    1000000     2      250   0.810084      617    0.992550   Yes
 Data in & out   10000000     2       25   0.922891      542    0.999250   Yes

 Data in & out     100000     8     2500   0.144870    13805    0.957126   Yes
 Data in & out    1000000     8      250   0.922568     2168    0.995524   Yes
 Data in & out   10000000     8       25   0.918226     2178    0.999550   Yes

 Data in & out     100000    32     2500   0.401577    19921    0.890282   Yes
 Data in & out    1000000    32      250   0.935064     8556    0.988096   Yes
 Data in & out   10000000    32       25   0.916277     8731    0.998806   Yes

                                 OpenMP MFLOPS 64 bit                           64b/
                                                                                 32b
 Data in & out     100000     2     2500   0.092784     5389    0.929538   Yes  1.06
 Data in & out    1000000     2      250   0.794744      629    0.992550   Yes  1.02
 Data in & out   10000000     2       25   0.784255      638    0.999250   Yes  1.18

 Data in & out     100000     8     2500   0.114583    17455    0.957117   Yes  1.26
 Data in & out    1000000     8      250   0.797846     2507    0.995518   Yes  1.16
 Data in & out   10000000     8       25   0.879850     2273    0.999549   Yes  1.04

 Data in & out     100000    32     2500   0.332392    24068    0.890215   Yes  1.21
 Data in & out    1000000    32      250   0.849420     9418    0.988088   Yes  1.10
 Data in & out   10000000    32       25   0.933336     8571    0.998796   Yes  0.98

                                 notOpenMP MFLOPS 32 bit                                           

 Data in & out     100000     2     2500   0.220277     2270    0.929538   Yes
 Data in & out    1000000     2      250   0.791373      632    0.992550   Yes
 Data in & out   10000000     2       25   0.792594      631    0.999250   Yes

 Data in & out     100000     8     2500   0.362916     5511    0.957126   Yes
 Data in & out    1000000     8      250   0.902125     2217    0.995524   Yes
 Data in & out   10000000     8       25   0.786859     2542    0.999550   Yes

 Data in & out     100000    32     2500   1.497859     5341    0.890282   Yes
 Data in & out    1000000    32      250   1.518747     5267    0.988096   Yes
 Data in & out   10000000    32       25   1.516393     5276    0.998806   Yes

                                 notOpenMP MFLOPS 64 bit                        64b/
                                                                                 32b                      
 Data in & out     100000     2     2500   0.152535     3278    0.929538   Yes  1.44     
 Data in & out    1000000     2      250   0.965797      518    0.992550   Yes  0.82
 Data in & out   10000000     2       25   0.781680      640    0.999250   Yes  1.01

 Data in & out     100000     8     2500   0.356388     5612    0.957117   Yes  1.02
 Data in & out    1000000     8      250   0.925742     2160    0.995518   Yes  0.97
 Data in & out   10000000     8       25   0.840113     2381    0.999549   Yes  0.94

 Data in & out     100000    32     2500   1.176455     6800    0.890215   Yes  1.27
 Data in & out    1000000    32      250   1.227945     6515    0.988088   Yes  1.24
 Data in & out   10000000    32       25   1.225311     6529    0.998796   Yes  1.24

OpenMP-MemSpeed Benchmarks below or Go To Start

OpenMP-MemSpeed - OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8
OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP. Detailed comparisons of these results are rather meaningless, but it demonstrates that OpenMP might be unsuitable to produce performance gains on what appears to be suitable code. There might also be compile options that overcome this problem.


                  Memory Reading Speed Test OpenMP                      

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                               32 bit                                   

       4    8097   8322   8641   8020   8436   8384  39701  19701  19712
       8    7814   8555   8756   8321   8548   8526  39042  19984  19996
      16    8149   7738   7742   8303   7779   8192  37995  19883  19984
      32    8969   8769   8799   9040   8759   8743  37737  20133  20130
      64    7617   7457   7437   7575   7380   7422  17770  15332  14248
     128   11221  10936  11003  11105  11011  10986  13650  13910  13881
     256   17883  18144  18036  17691  18094  17844  13073  12465  12535
     512   18001  18468  19675  17075  18221  19264  13511  13895  12008
    1024    9532  10590   9772  11842  11282  11277   7173   9473   9496
    2048    7095   7025   6866   7117   7043   6946   2914   3475   3468
    4096    7244   6927   7036   5951   7054   6531   2582   3130   3122
    8192    4578   7173   7025   6322   7078   7182   2504   3127   3115
   16384    5470   7043   7067   7103   7052   7020   2557   3093   3088
   32768    7359   7817   7766   7158   7078   7757   2618   3066   3094
   65536    7810   7268   7266   3824   7478   5164   2486   3016   2931
  131072    2460   2655   7224   7513   7308   7339   2540   2944   2940

 Not OMP                                                                
       8   11775   3895   4342  11787   4325   4354  10334   7806   7816
     256   10032   3699   4223   9978   4289   4185   7105   7612   7621
   65536    2099   2587   3033   2103   3021   3001   2585   1105   1101

                               64 bit                                   

       4    7749   8500   8716   7451   8520   8533  39508  18586  18589
       8    8198   8669   8874   8148   8678   8691  38972  18863  18861
      16    8023   8499   8335   7895   8355   8507  38305  19003  19004
      32    9034   8517   8619   9127   8550   8522  37928  19071  18409
      64    8652   8201   8178   8565   8223   8093  25191  17494  17508
     128   11397  11616  11715  11345  11649  11029  13861  14097  14170
     256   18242  18745  18195  17417  18605  18019  12535  12637  12623
     512   17580  18467  18787  18010  18414  18321  12900  13180  13121
    1024    8043  10172  11540  12510  10220  12082   9800   9586   9857
    2048    4816   6807   6850   6922   6805   6666   3137   3372   3369
    4096    7029   6846   6881   7017   5145   6801   2776   3124   3112
    8192    2428   7085   7124   7068   7134   6904   2571   3092   3112
   16384    7133   7152   7328   7008   3445   7178   2473   3099   3104
   32768    2656   7643   7669   7802   7616   7559   2043   3112   3104
   65536    7995   6523   2572   7059   6514   6485   2431   2955   3036
  131072    1981   7273   7327   1878   3615   7267   2538   2968   2976

 Not OMP                                                                
       8   15532   3990   4394  15567   4386   4394  11629   9315   9314
     256   12318   3871   4219  12134   4206   4219   8092   8231   8229
   65536    2005   2588   2937   2011   2930   2621   2577   2565   2566

I/O Benchmarks below or Go To Start

I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The Raspberry Pi programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. For further details and downloads see the usual PDF file.

LanSpeed Benchmarks - WiFi - LanSpeed, LanSpeed64g8

Following are Raspberry 32 bit and 64 bit results using what I believe was, both 2.4 GHz and 5 GHz WiFi frequencies. Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two systems was reasonably similar at both frequencies, but speeds can vary widely. Also, (with my setup?) obtaining consistent 5 GHz operation was extremely difficult to achieve, in both cases.


 *********************. 32 bit 2.4 GHz ********************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


 ********************** 32 bit 5 GHz *********************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz


 ********************** 64 bit 2.4 GHz *******************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8     5.48     5.14     5.39     6.86     6.61     5.30
  16     5.62     5.64     5.69     5.17     5.02     5.18

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      3.666    4.035    5.131     4.82     4.67     3.90

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.24     0.52     0.95     0.34     0.60     1.14
 ms/file    17.10    15.73    17.20    12.00    13.68    14.35    2.437


 ********************** 64 bit 5 GHz *********************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    11.43    11.70    11.57     8.21     3.64     7.05
  16    10.96     7.30    11.84     8.40     6.24     7.94

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.38     0.73     1.12     0.39     0.73     0.98
 ms/file    10.82    11.15    14.62    10.55    11.23    16.73    2.618

 Random similar to 2.4 GHz

LAN Benchmark below or Go To Start

LanSpeed Benchmark - (1G bits per second Ethernet) - LanSpeed, LanSpeed64g8

Measured performance can vary significantly, but both 32 bit and 64 bit tests demonstrated Gigabit performance on the large files. Of particular note (with my program), the 32 bit system indicated that the 2 GB file could not be written, the actual file size ended at 2,147,483,647 Bytes (or 2^31 - 1). On the other hand, at 64 bits, three files of up 8 GB and 16 GB were successfully written and read (in around 25 minutes).


 ************************ 32 bit ************************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   67.82   12.97   90.19   99.84   93.49   96.83
      16   92.25   92.66   92.96   103.9  105.28   91.17

Random     Read                    Write
From MB        4       8      16       4       8      16
msecs      0.007    0.01    0.04    1.01    0.85    0.91

200 Files  Write                   Read                  Delete
File KB        4       8      16       4       8      16  secs
MB/sec      1.47     2.8    5.14    2.47    4.71    8.61
ms/file     2.78    2.92    3.19    1.66    1.74     1.9   0.256

 Larger Files
                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

      32    78.2   34.46   80.71   84.94   87.11   84.97
      64   88.18   87.52   87.03  111.34  109.58  107.28

     128   98.84   99.24   96.58  110.99  110.57   87.43
     256  106.75  105.43   106.4   85.78  108.99  106.29

    1024   96.13   93.34   94.98  114.51  112.16  114.91
    2048   Error writing file  Segmentation fault
           Wrote 2,147,483,647 bytes

 ************************ 64 bit ************************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    93.63    93.17    96.38   108.02   109.36   109.30
2048    98.41    96.54    99.18   111.26   111.89   111.83

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.003    0.005    0.014     0.81     0.75     1.23

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.42     2.82     5.24     2.30     4.56     8.09
 ms/file     2.89     2.90     3.13     1.78     1.80     2.02    0.288


 Much Larger Files

 8192   89.77    89.98    91.86   117.29   117.21   117.17
16384   90.64    89.47    91.10   116.58   117.24   117.13

USB Benchmarks below or Go To Start

USB Benchmarks - DriveSpeed, DriveSpeed64v2g8

Following are DriveSpeed results at 32 bits and 64 bits, accessing the same USB 3 drive. Note the difference in performance during the various test procedures (They might not be the same next time). The 32 bit system again failed on attempting to write a 2 GB file (2^31-1 limit).

At 64 bits, 4 GB could not be written, the size limit being disappointing. This benchmark uses Direct I/O. Then, as I later discovered, running with caching enabled (using LanSpeed benchmark) can write and read much larger files, including those too large to cache. The example below is for writing and reading three files, each near 6 GB and 12 GB. The vmstat recordings show that there was no serious memory swapping, with around 7.5 GB of RAM used for caching.

********************* 32 bit USB 3 ********************* DriveSpeed RasPi 1.1 Sat May 30 15:31:20 2020 Selected File Path: /media/pi/PATRIOT1/ Total MB 120832, Free MB 112565, Used MB 8267 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 512 73.43 74.88 74.88 217.60 219.98 218.02 1024 63.03 76.64 74.46 220.72 220.60 219.97 Cached 8 38.07 41.95 39.95 700.06 693.26 677.20 Random Read Write From MB 4 8 16 4 8 16 msecs 0.982 0.981 1.001 6.81 6.31 6.31 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.03 0.07 0.14 2.58 5.23 10.32 ms/file 120.08 120.06 120.00 1.59 1.57 1.59 2.491 Larger Files MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 2000 75.14 74.93 74.93 216.19 217.22 216.53 2048 Error writing file Segmentation fault ********************* 64 bit USB 3 ********************* DriveSpeed RasPi 64 Bit gcc 8 Wed May 27 11:43:43 2020 Selected File Path: /media/pi/PATRIOT1/ Total MB 120832, Free MB 114614, Used MB 6218 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1024 27.78 21.39 21.43 270.32 278.81 274.98 2048 21.40 21.14 21.44 275.79 273.14 319.95 Cached 8 40.27 42.81 42.81 1206.64 1068.72 1031.56 Random Read Write From MB 4 8 16 4 8 16 msecs 0.004 0.004 0.184 4.33 4.00 4.04 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.03 0.07 0.14 261.45 11.19 84.39 ms/file 119.60 119.05 119.64 0.02 0.73 0.19 2.477 Larger Files MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 2048 23.77 19.89 20.64 320.34 272.90 271.96 4096 Write failure 2000 21.72 22.38 26.57 275.40 273.85 309.57 4000 37.38 36.30 37.67 297.09 299.91 286.94 Caching Benchmark - USB 3 Hard Drive - 3 files up to near 36 GB capacity used 6000 169.80 136.20 126.26 90.43 146.13 144.05 12000 146.65 108.83 67.14 108.13 146.84 143.76 swpd free buff cache si so bi bo vmstat memory and I/O activity 768 7417668 102040 250844 0 0 1299 1329 Start 768 1970544 94436 5704132 0 0 0 132723 Writing 12000 MB 768 107908 92712 7568500 0 0 140339 0 Reading 12000 MB

Main Drive Benchmark below or Go To Start

Pi 4B Main Drive Benchmark - DriveSpeed, DriveSpeed64v2g8

The DriveSpeed benchmark failed to execute on the 64 bit system, providing the message “Error writing file Segmentation fault”. It had run previously on the Pi 4B but, again, would only write less than 2 GB files, as shown below. This also applied when running LanSpeed on the main drive. From below, note the faster reading speeds at 1024 MB, this was because the file size is small enough to be cached.

Below are default results from running LanSpeed on the Pi 4 at 64 bits, initially intended to verify that the main drive could be accessed by one of my programs. Initially, I could not run specifying large files, as there was limited free space on the OS drive. After cloning the card to a 32 GB version, 19 GB free space was indicated. I then ran the program to write three 6000 MB files. This was followed by specifying 16000 MBytes, where one file was written and the second one generated an error after writing around 2500 MB. The good news also was that the test did not crash the system.

************************ 32 bit ************************ Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks Total MB 14845, Free MB 8198, Used MB 6646 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 16.41 11.21 12.27 39.81 40.10 40.39 16 11.79 21.10 34.05 40.18 40.19 40.33 Cached 8 137.47 156.43 285.59 580.73 598.66 587.97 Random Read Write From MB 4 8 16 4 8 16 msecs 0.371 0.371 0.363 1.28 1.53 1.30 200 File Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 3.49 6.41 8.26 7.67 11.68 17.51 ms/file 1.17 1.28 1.98 0.53 0.70 0.94 0.014 Larger Files MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1024 13.38 13.35 13.39 42.68 42.59 42.36 2048 Error writing file Segmentation fault LanSpeed 1024 11.65 13.46 13.48 560.78 574.76 617.67 2048 Error writing file Segmentation fault ************************ 64 bit ************************ LanSpeed RasPi 64 Bit gcc 8 Wed May 27 10:36:54 2020 Current Directory Path: /home/pi/Raspberry-Pi-4-64-Bit-Benchmarks Total MB 14637, Free MB 8724, Used MB 5913 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 265.13 281.30 292.28 1270.88 1286.35 1329.42 16 246.59 277.53 299.05 1201.20 1327.24 1095.78 Random Read Write From MB 4 8 16 4 8 16 msecs 0.002 0.002 0.002 7.68 9.01 7.14 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 56.52 64.92 94.20 303.96 549.54 538.32 ms/file 0.07 0.13 0.17 0.01 0.01 0.03 0.014 Larger File - 32 GB SD card Total MB 29643, Free MB 19776, Used MB 9868 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 6000 24.14 18.80 19.39 31.07 45.60 45.76 16000 21.12 Error writing file Segmentation fault File 1 15.6 GiB (16,777,216,000 bytes) File 2 2.5 GiB ( 2,645,176,320 bytes) - Not enough free space

Java Whetstone Benchmark below or Go To Start

Java Whetstone Benchmark - whetstc.class

The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files. Performance can vary significantly using different Java Virtual Machines. So, comparisons might not be appropriate.

The results below suggest that 32 bit overall performance, in MWIPS, was faster than at 64 bits. This was due to the most time consuming functions (N5 and N6) taking less time. Note that some speeds are effectively the same as found running the C compiled version above.

************************* 32 bit ************************* Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 524.02 0.0366 N2 floating point -1.131330490 494.12 0.2720 N3 if then else 1.000000000 289.92 0.3570 N4 fixed point 12.000000000 1092.99 0.2882 N5 sin,cos etc. 0.499110132 59.86 1.3900 x N6 floating point 0.999999821 345.95 1.5592 x N7 assignments 3.000000000 331.54 0.5574 N8 exp,sqrt etc. 0.825148463 25.41 1.4640 MWIPS 1687.92 5.9244 Operating System Linux, Arch. arm, Version 4.19.37-v7l+ Java Vendor BellSoft, Version 11.0.2-BellSoft ************************* 64 bit ************************* Whetstone Benchmark Java Version, May 22 2020, 14:24:09 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 520.61 0.0369 N2 floating point -1.131330490 481.38 0.2792 N3 if then else 1.000000000 236.41 0.4378 N4 fixed point 12.000000000 1320.20 0.2386 N5 sin,cos etc. 0.499110132 47.96 1.7348 x N6 floating point 0.999999821 276.33 1.9520 x N7 assignments 3.000000000 320.17 0.5772 N8 exp,sqrt etc. 0.825148463 25.41 1.4640 MWIPS 1487.99 6.7205 Operating System Linux, Arch. aarch64, Version 4.19.118-v8+ Java Vendor Debian, Version 11.0.7

JavaDraw Benchmark below or Go To Start

JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load. In order for this to run at maximum speed, it was necessary to disable the experimental GL driver.

In this case, performance at 32 bits and 64 bits was quite similar.

************************* 32 bit ************************* Java Drawing Benchmark, May 15 2019, 18:55:41 Produced by OpenJDK 11 javac Test Frames FPS Display PNG Bitmap Twice Pass 1 877 87.65 Display PNG Bitmap Twice Pass 2 1042 104.18 Plus 2 SweepGradient Circles 1015 101.47 Plus 200 Random Small Circles 779 77.85 Plus 320 Long Lines 336 33.52 Plus 4000 Random Small Circles 83 8.25 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. arm, Version 4.19.37-v7l+ Java Vendor BellSoft, Version 11.0.2-BellSoft ************************* 64 bit ************************* Java Drawing Benchmark, May 22 2020, 14:25:15 Produced by javac 1.8.0_222 Test Frames FPS Display PNG Bitmap Twice Pass 1 833 83.26 Display PNG Bitmap Twice Pass 2 1001 100.05 Plus 2 SweepGradient Circles 994 99.39 Plus 200 Random Small Circles 836 83.54 Plus 320 Long Lines 380 37.98 Plus 4000 Random Small Circles 95 9.44 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. aarch64, Version 4.19.118-v8+ Java Vendor Debian, Version 11.0.7

OpenGL GLUT Benchmark below or Go To Start

OpenGL GLUT Benchmark - videogl32, videogl64

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file, the first command needed to avoid VSYNC, allowing FPS to be greater than 60.

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 Width 1920, Height 1080, NoHeading

The benchmark could not be recompiled, at 64 bits, as certain freeglut functions were not readily available. So, an earlier version was used. In this case, the 64 bit version, at the higher pixel settings, appeared to be slower on the graphics speed dependent tests, but faster elsewhere.

As indicated below, the dual monitor connections enabled this option to be tested at 64 bits.

************************ 32 bit ************************ GLUT OpenGL Benchmark 32 Bit Version 1, Thu May 2 19:01:05 2019 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 766.7 371.4 230.6 130.2 32.5 22.7 640 480 427.3 276.5 206.0 121.8 31.7 22.2 1024 768 193.1 178.8 150.5 110.4 31.9 21.5 1920 1080 81.4 79.4 74.6 68.3 30.8 20.0 ************************ 64 bit ************************ GLUT OpenGL Benchmark 64 Bit gcc 9, Fri May 22 13:50:00 2020 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 160 120 753.4 414.5 258.3 152.0 42.7 30.0 320 240 644.5 385.9 243.9 145.6 41.5 29.1 640 480 320.6 270.6 217.9 136.8 43.0 29.4 1024 768 140.6 135.1 122.6 114.1 41.8 28.5 1920 1080 57.7 56.4 55.7 52.4 40.5 26.7 ****************** 64 bit Dual Monitor******************* 3840 1080 26.9 26.7 27.0 26.0 27.5 21.6

Usable RAM below or Go To Start

Usable RAM - MALLOC

On running various benchmarks, it became clear that there were restrictions on how much RAM could be used by my C based benchmarks. A simple program was written that allocated a specified amount of memory, using malloc, filled it with data, freed the space, then repeated the sequence incrementally, until an allocation failure was indicated. Both 32 bit and 64 bit versions were produced and each run on 4 GB and 8 GB systems. Except at 64 bit 8 GB, all were restricted to less than 4,000,000,000 bytes. For the former, vmstat memory utilisation details are provided, showing the low points and samples between, identifying that memory space had been freed.


 ############################### 32 Bit OS ###############################
 
                                 4 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3050000000 (2.84 GB)
 
                                 8 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3060000000 (2.85 GB)

 ############################### 64 Bit OS ###############################
 
                                 4 GB RAM 
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  Memory allocation failed - Exit Later OK to 3700000000 (3.45 GB)
 
                                 8 GB RAM
 Bytes 1000000000   250000000 words allocated   250000000 written finished 
 Bytes 2000000000   500000000 words allocated   500000000 written finished 
 Bytes 3000000000   750000000 words allocated   750000000 written finished 
 Bytes 4000000000  1000000000 words allocated  1000000000 written finished 
 Bytes 5000000000  1250000000 words allocated  1250000000 written finished 
 Bytes 6000000000  1500000000 words allocated  1500000000 written finished 
 Bytes 7000000000  1750000000 words allocated  1750000000 written finished 
 Bytes 8000000000  Memory allocation failed - Exit Later OK to 7900000000 (7.36 GB)

pass swpd    free   buff  cache    pass swpd    free   buff  cache

        0 7412260  85908 274472            0 7234852  85908 278140
 1      0 6615688  85908 277608     5      0 2600856  85908 277096
        0 7385388  85908 277264            0 7184736  85908 277612
 2      0 5671192  85908 277612     6      0 1571436  85908 277096
        0 7210328  85908 277264            0 7257464  85908 277096
 3      0 4526104  85908 277096     7      0 624436   86228 281456
        0 7324312  85908 277096            0 7402400  86228 283200
 4      0 3665272  85908 277264

Usable RAM - Specified Dimensions

Where dimensions were specified in the programs, rather than malloc, some differences were apparent. Using the 32 bit system, a compile error was indicated when the dimensions required 2 GB (2^31) Bytes, with 1 Byte less being accepted.. As shown below, at 64 bits, more than 2 GB was allowed on the 4 GB system. Then, at both 4 GB and 8 GB close to these sizes could be used.


 ######################## 32 Bit OS 4 GB and 8 GB ########################
  int    array[536870912]; size of array 'array' is too large 2 GB
  int    array[536870911]; compiles
  float  array[536870912]; size of array 'array' is too large 2 GB
  float  array[536870911]; compiles
  double array[268435456]; size of array 'array' is too large 2 GB
  double array[268435455]; compiles

 ############################# 64 Bit OS 4 GB ############################
 int    array[920000000];  OK 3.43 GB
 int    array[1073741824]; Segmentation fault 4 GB
 float  array[920000000];  OK 3.43 GB
 float  array[1073741824]; Segmentation fault 4 GB
 double array[460000000];  OK 3.43 GB
 double array[536870912];  Segmentation fault 4 GB

 ############################# 64 Bit OS 8 GB ############################
 int    array[1950000000]; OK 7.9 GB paging
 int    array[2147483648]; Segmentation fault 8 GB
 float  array[1950000000]; OK 7.9 GB
 float  array[2147483648]; Segmentation fault 8 GB 
 double array[975000000];  OK 7.9 GB
 double array[1073741824]; Segmentation fault 8 GB

High Performance Linpack Benchmark below or Go To Start

High Performance Linpack Benchmark - xhpl

I ported my ATLAS version of HPL, that I have run on earlier Raspberry Pi systems, to both of the 64 bit and 32 bit SD cards. See my report at ResearchGate Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. The report showed that the amount of memory used followed the same proportions as the original Linpack benchmark of somewhat greater than N x N x 8, for double precision operation on a N x N dimensioned array. For RAM residence, N=20000 would require 4 GB and N=30000 needing 8 GB.

Following are results from tests run without and with a cooling fan in place. The first were for the original Pi 4 with 4 GB RAM, carried out in June 2019. The others, with 8 GB, are running via recent 32 bit and 64 bit Operating System versions, in 2020. With the fan in place, clock speeds were effectively constant at 1500 MHz, on all three test rigs, with the same MFLOPS performance at each problem size. Then, the 4 GB system appeared to be running at a higher temperature, but not high enough to introduce CPU MHz throttling.

With no fan in use, throttling occurred on all systems, at N=16000. From then on, the 4 GB system suffered from more of this than the 8 GB models, reflected in higher temperatures and slower performance. The difference is thought to be due to the improvements that have been made in thermal management.

These tests show that the HPL benchmark is an excellent stress testing application that can demonstrate using most of available RAM and running at high performance levels. The double precision speed approached the 12.6 GFLOPS achieved by one of my benchmarks. The 64 bit production does not appear to benefit from using advanced vector operations, but I could not identify whether other compiling parameters could be included.

No Fan Fan RAM at bits N GFLOPS Seconds Max °C Min MHz GFLOPS Seconds Max °C Min MHz 4 GB 32b 8000 8.6 40 81 1500 9.3 37 61 1500 8 GB 32b 8000 9.7 35 58 1500 9.6 35 57 1500 8 GB 64b 8000 8.8 39 76 1500 8.7 39 55 1500 4 GB 32b 16000 6.8 404 86 750/600 10.4 263 70 1500 8 GB 32b 16000 8.6 319 83 1000 10.4 263 63 1500 8 GB 64b 16000 8.1 338 84 1000 10.0 273 61 1500 4 GB 32b 20000 6.2 856 87 750/600 10.8 494 71 1500 8 GB 32b 20000 8.8 604 85 1000 10.7 497 63 1500 8 GB 64b 20000 8.5 625 85 1000/600 10.3 519 63 1500 4 GB 32b 30000 N/A N/A 8 GB 32b 30000 8.2 2195 85 1000/600 11.3 1590 64 1500 8 GB 64b 30000 7.6 2370 86 1000/600 11.4 1584 63 1500

Below are vmstat details, showing that most of the RAM was in use and four cores were running at 100% utilisation. Then there are examples of environmental differences btween older 32 bit and later 64 bit operation, particularly MHz throttling variations, core voltage and pmic temperature differences

8 GB 64b 30000 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 7422216 83712 264952 0 0 213 4 211 345 2 2 96 1 0 4 0 0 5366940 83720 269572 0 0 144 2 1130 483 82 3 15 0 0 4 0 0 2974924 83728 271960 0 0 0 3 1287 585 97 3 0 0 0 4 0 0 637296 83960 275704 0 0 0 48 1859 2130 96 4 0 0 0 4 0 3072 246724 43176 207604 1 83 141 95 1663 1402 97 3 0 0 0 4 0 3584 243388 32412 191932 3 17 11 23 1110 131 100 0 0 0 0 6 0 3584 247168 32420 187520 0 0 0 2 1085 59 100 0 0 0 0 Later 5 0 3584 238580 34324 193432 0 0 4 2 1196 361 99 1 0 0 0 5 0 7936 238124 26356 193392 0 140 386 193 1993 2064 97 3 0 0 0 4 0 7936 247408 27264 194160 1 0 70 11 1889 1888 98 2 0 0 0 4 GB 32b 20000 No Fan 485.3 ARM MHz=1000, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C 506.6 ARM MHz= 750, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C 528.0 ARM MHz= 750, core volt=0.8771V, CPU temp=86.0'C, pmic temp=74.1'C 549.2 ARM MHz= 600, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C 570.6 ARM MHz=1000, core volt=0.8771V, CPU temp=85.0'C, pmic temp=74.1'C 591.9 ARM MHz= 750, core volt=0.8771V, CPU temp=84.0'C, pmic temp=74.1'C 8 GB 64b 30000 No Fan 1546.8 ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C 1577.8 ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C 1608.8 ARM MHz=1000, core volt=0.8600V, CPU temp=86.0'C, pmic temp=70.3'C 1639.9 ARM MHz=1000, core volt=0.8350V, CPU temp=85.0'C, pmic temp=70.3'C 1670.8 ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C 1701.8 ARM MHz= 600, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C 1732.8 ARM MHz=1000, core volt=0.8600V, CPU temp=85.0'C, pmic temp=70.3'C

Floating Point Stress Tests below or Go To Start

Floating Point Stress Tests - MP-FPUStress, MP-FPUStressDP, MP-FPUStress64g8, MP-FPUStress64DPg8

These stress tests have a benchmarking mode that provides choices for a long running test. They cover number of threads, floating point operations carried out on each data word, and memory size to cover caches and RAM. Numeric sumchecks are carried out, where the same number of calculations apply at different thread counts, in each section. Below are results for both 64 bit and 32 bit compilations, where sumchecks were identical. Performance at 64 bits can be seen to be faster than at 32 bits, with best case nearly twice as fast.

Next, below, are results from 10 minute stress tests, showing measured GFLOPS and CPU temperatures, for fanless operation. CPU MHz variations were between 1500/1000/750 at 32 bits and 1500/1000 for all 64 bit tests. Again, indicating improved thermal management.

64 Bits MFLOPS Numeric Results 32 Bits MFLOPS Ops/ KB KB MB KB KB MB KB KB MB Secs Thrd Word 12.8 128 12.8 12.8 128 12.8 12.8 128 12.8 Single Precision 0.9 T1 2 3845 4032 1232 40394 76395 99700 2134 2607 656 1.6 T2 2 7947 7992 1083 40394 76395 99700 5048 5156 621 2.3 T4 2 14295 14760 1145 40394 76395 99700 7536 9939 681 3.0 T8 2 13427 14985 1166 40394 76395 99700 7934 9839 639 4.9 T1 8 4665 4740 3200 54764 85092 99820 5535 5420 2569 6.0 T2 8 9334 9453 4143 54764 85092 99820 10757 10732 2454 6.9 T4 8 17902 18462 4693 54764 85092 99820 18108 20703 2444 7.7 T8 8 17473 18460 4570 54764 85092 99820 19236 20286 2245 13.0 T1 32 5827 5869 5861 35206 66015 99520 5309 5270 5262 15.6 T2 32 11712 11729 11524 35206 66015 99520 10551 10528 9753 17.2 T4 32 23149 22887 16343 35206 66015 99520 20120 20886 11064 18.7 T8 32 22202 23048 16411 35206 66015 99520 19415 20464 9929 Double Precision 1.8 T1 2 1802 1878 587 40395 76384 99700 921 998 326 3.4 T2 2 3716 3741 527 40395 76384 99700 1968 1995 308 4.8 T4 2 6814 7335 547 40395 76384 99700 3465 3925 342 6.1 T8 2 6633 7011 588 40395 76384 99700 3646 3702 301 9.2 T1 8 2738 2796 2014 54805 85108 99820 2377 2446 1283 11.4 T2 8 5598 5582 2114 54805 85108 99820 4916 4860 1326 13.0 T4 8 10545 11132 2196 54805 85108 99820 9202 9510 1391 14.7 T8 8 10693 10849 2149 54805 85108 99820 9090 9006 1298 24.1 T1 32 3280 3296 3279 35159 66065 99521 2695 2725 2707 28.8 T2 32 6583 6588 6430 35159 66065 99521 5416 5441 5121 31.6 T4 32 12785 13162 8477 35159 66065 99521 10666 10831 5275 34.4 T8 32 12718 12781 8816 35159 66065 99521 10427 10602 4832 Stress Tests Original 32 Bits ------------------ 64 Bits ------------------ 8 Ops/word 8 Ops/word 32 Ops/Word 32 Ops/Word DP Seconds °C GFLOPS °C GFLOPS °C GFLOPS °C GFLOPS 0 61 59 58 58 20 76 19.2 65 18.4 71 22.9 73 12.9 40 81 19.0 74 18.2 74 23.1 77 12.9 60 82 17.8 76 18.4 76 22.9 78 12.9 80 83 15.5 78 18.1 78 23.0 80 13.0 100 84 15.0 78 18.1 79 23.0 83 12.4 120 83 14.0 82 18.2 81 23.0 82 11.7 140 84 13.3 82 17.6 82 22.5 82 11.2 160 84 13.3 81 16.8 82 21.6 82 10.9 180 86 12.9 82 16.3 82 21.0 83 10.9 200 85 13.0 82 16.2 82 20.7 83 10.5 220 84 12.8 82 15.8 82 20.4 82 10.2 240 84 12.6 83 15.6 82 20.1 83 10.2 260 83 12.6 83 15.9 83 19.9 82 10.2 280 85 12.2 83 15.3 82 19.9 83 10.0 300 84 12.1 83 15.4 81 19.6 83 9.9 320 85 12.0 83 15.5 82 19.5 82 9.7 340 84 11.6 82 15.2 82 19.5 82 9.9 360 85 11.6 83 14.7 83 19.3 83 9.8 380 85 11.3 82 14.7 82 19.2 83 9.6 400 85 11.6 83 14.8 82 19.0 83 9.6 420 84 11.6 83 14.9 82 18.9 82 9.5 440 85 11.5 82 14.6 83 18.8 82 9.6 460 84 11.5 83 14.9 83 18.7 82 9.5 480 85 11.5 83 14.6 82 18.8 83 9.5 500 84 11.1 83 14.7 83 18.8 83 9.5 520 85 11.3 82 14.6 82 18.6 83 9.4 540 84 11.4 83 14.7 82 18.7 83 9.4 560 84 11.3 83 14.6 82 18.7 83 9.6 580 85 11.3 83 14.6 83 18.4 83 9.6 600 85 11.3 83 14.5 83 18.5 83 9.7 Average 83.9 12.9 81.2 15.9 81.1 20.2 81.9 10.5 Min/max 0.58 0.78 0.80 0.72

Integer Stress Tests below or Go To Start

Integer Stress Tests - MP-IntStress, MP-IntStress64g8, MP-IntStress64

This program has variables for number of threads, memory required and running time. The test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Performance is measured in MBytes per second. Results show the varying hexadecimal data patters used and compared verification, not shown on the summary benchmarking mode logged details below. Here, it can be seen that the 64 bit performance was much slower using the latest gcc 8 64 bit tests. Earlier 64 bit results confirm that the poor performance was due to a compiling issue.

Following the benchmark results are details from two stress tests, running without an operational fan. The first represents one user demanding 7600000 KB ( 7.25 GB) of memory space. Performance throughout was effectively the same as memory speed indicted by the benchmark (1 thread 16 MB), CPU MHz being constant, with little change in temperatures. As shown by vmstat details, some data was swapped out, to make room for that of the application.

The second stress test involved 8 threads and cache based data, initially running at maximum CPU speed (for this code). This time, there were CPU clock throttling, down to 1000 MHz, CPU temperature rises up to 84°C and a 31% decrease in measured MBytes per second.

Benchmark MBytes/second ------ 32 Bits ------ ------ 64 Bits ------ --- 2019 64 Bits --- KB KB MB KB KB MB KB KB MB Threads 16 160 16 16 160 16 16 160 16 1 5956 5754 3977 2878 2936 2602 5928 6786 3903 2 11861 11429 3763 5855 5817 3641 14468 13292 3772 4 22998 21799 3464 11403 11416 3564 27146 25103 3425 8 22695 21128 3490 10853 11297 3557 27576 24844 3432 16 22835 23491 3485 11069 11612 3548 27365 28511 3434 32 22593 23485 3591 10790 11646 3758 26377 28527 3455 Stress Test Start Data Same All Seconds Size Threads MB/sec Sumcheck Threads 20.0 7600000 KB 1 2606 00000000 Yes 57.8 7600000 KB 1 2604 FFFFFFFF Yes 91.0 7600000 KB 1 2575 5A5A5A5A Yes 129.5 7600000 KB 1 2608 AAAAAAAA Yes vmstat 10 second samples procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 7433336 83140 266140 0 0 222 2 177 273 1 1 97 1 0 1 0 0 5535964 83152 268248 0 0 2 7 501 707 2 7 92 0 0 1 8 69888 63404 1048 106744 16 6943 515 6951 664 506 3 18 54 25 0 1 0 67072 63916 4548 123920 3 0 3 8 468 260 26 1 74 0 0 Later to end 1 0 95336 62748 4868 135672 4 0 4 6 475 274 26 1 73 0 0 --------- 7600000 KB 1 Thread --------- --------- 1280 KB 8 Threads ---------- Secs MB/sec MHz Volts °C CPU °C PMIC MB/sec MHz Volts °C CPU °C PMIC 0 1500 0.8500 59 55.2 1500 0.8600 57 54.3 20 2606 1500 0.8500 61 55.2 10902 1500 0.8600 70 55.2 40 2599 1500 0.8500 63 55.2 10267 1500 0.8600 73 58.0 60 2604 1500 0.8500 63 56.2 10150 1500 0.8600 75 59.0 80 2575 1500 0.8500 65 57.1 11046 1500 0.8600 79 61.8 100 2566 1500 0.8500 65 57.1 11039 1500 0.8600 80 62.8 120 2605 1500 0.8500 66 58.0 10503 1000 0.8600 81 64.6 140 2608 1500 0.8500 66 58.0 8780 1500 0.8600 82 65.6 160 2583 1500 0.8500 67 59.0 8501 1500 0.8600 82 66.5 180 2605 1500 0.8500 66 59.0 8704 1500 0.8600 83 66.5 200 2604 1500 0.8500 66 59.0 8507 1500 0.8600 83 66.5 220 2608 1500 0.8500 67 59.0 8829 1000 0.8600 83 67.5 240 2608 1500 0.8500 68 59.0 8749 1000 0.8600 82 67.5 260 2605 1500 0.8500 68 59.0 8542 1500 0.8600 83 68.4 280 2573 1500 0.8500 67 59.0 8500 1000 0.8600 82 67.5 300 2601 1500 0.8500 68 59.0 8434 1000 0.8600 83 68.4 320 2607 1500 0.8500 68 59.0 8360 1500 0.8600 83 68.4 340 2605 1500 0.8500 68 59.0 8302 1000 0.8600 83 68.4 360 2575 1500 0.8500 67 59.0 8179 1000 0.8600 82 68.4 380 2608 1500 0.8500 68 59.0 8102 1000 0.8600 84 68.4 400 2584 1500 0.8500 68 59.0 8215 1500 0.8600 84 68.4 420 2575 1500 0.8500 68 59.0 8070 1000 0.8600 82 69.4 440 2574 1500 0.8500 66 59.0 8042 1500 0.8600 82 69.4 460 2608 1500 0.8500 67 59.0 7945 1500 0.8600 82 69.4 480 2581 1500 0.8500 68 59.0 8100 1000 0.8600 84 69.4 500 2583 1500 0.8500 67 59.0 8024 1000 0.8600 84 69.4 520 2609 1500 0.8500 69 59.0 7933 1000 0.8600 82 69.4 540 2602 1500 0.8500 67 60.9 7813 1000 0.8600 84 69.4 560 2606 1500 0.8500 68 59.0 7988 1500 0.8600 83 69.4 580 2606 1500 0.8500 69 60.9 7882 1000 0.8600 83 69.4 600 2704 1500 0.8500 69 60.9 7597 1500 0.8600 83 69.4

64 GB SD Card below or Go To Start

64 GB SD Card - DriveSpeed64v2g8, LanSpeed64g8, DriveSpeed264WRg8, DriveSpeed264Rd2g8

My initial 64 bit Raspberry Pi OS was installed on a 16 GB SD card, later cloned (by Windows Win32DiskImager) to one with 32 GB capacity. It soon became apparent that this was too small to handle extra large files on the main drive. So I bought an 64 GB higher speed version, which, surprisingly, resized free space after booting. I then ran some tests to see how much of this could be used.

USB Drive - The first exercise was to compare performance of 64 GB and 32 GB SanDisk cards, using a USB 3 card reader, via DriveSpeed Direct I/O. The former has maximum MB/second ratings of read 160, write 60 and the latter only read at 98. For the large file tests, handling near 6 GB (3 x 2000 MB), reading speeds were similar, with the 64 GB card being much faster on writing. Random access and small file performance were also similar.

Main Drive - Up to nearly 24 GB (3x8) file space was used running LanSpeed, the same program as DriveSpeed, writing and reading using a 1 MB data array in RAM, with caching allowed, but caching negated on handling such large files. Data from random and small size file tests was cached and can be ignored. Output from vmstat, with 10 second sampling, indicates that most the memory was used then released, repeating the activity for the second three files. As observed in other tests, it seems that writing of cached data is deferred, overlapped with reading.

Huge File - Finally, an example of results from separate write/read and read only benchmarks, with caching enabled, is provided below. This just deals with large files, where up to three can be selected. In this case, one file of near 40 GB was written. The read only test loads the data into an array in RAM, where the maximum size appears to be around 6 GB. When dealing with smaller files, the system should be rebooted before reading, so that the data will be no longer cached.

############################ USB 3 ############################ 64 GB Total MB 59639, Free MB 48318, Used MB 11321 32 GB Total MB 29643, Free MB 19707, Used MB 9936 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 64 GB 2000 58.77 59.24 59.10 68.68 69.18 68.84 32 GB 2000 21.23 21.14 21.16 70.22 70.27 70.33 ########################## Main Drive ######################### MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 64 GB 4000 54.53 38.01 38.91 32.91 45.90 45.88 64 GB 8000 43.16 36.73 36.63 38.34 45.90 45.91 -----------memory---------- ---swap-- -----io---- swpd free buff cache si so bi bo Stsrt/Write 0 6430000 1024660 317064 0 0 270 3 0 4232720 1024696 2511212 0 0 0 27790 0 3138388 1024744 3605256 0 0 0 37690 Write/Read 512 258740 427420 7089616 0 0 8336 30214 512 67632 400000 7309488 0 0 24475 14101 512 61368 340176 7376464 0 0 44800 0 Delete/Read/Write 512 56868 121324 7600856 0 0 44817 0 512 5605880 115092 2057148 0 0 18298 17233 512 4472096 115140 3191272 0 0 0 36872 Write/Read 512 267968 17524 7492716 0 0 3 33253 512 75996 17596 7684276 0 0 8107 31443 512 63056 17652 7698440 0 0 44817 0 End 512 7521128 18700 238356 0 0 37260 0 #################### Main Drive Near 40 GB #################### Before Total MB 59639, Free MB 48324, Used MB 11315 After Total MB 59639, Free MB 8325, Used MB 51314 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 40000 36.65 45.89 Read only 6000 N/A N/A N/A 45.74 -----------memory---------- ---swap-- -----io---- swpd free buff cache si so bi bo Example write 256 270432 33192 7473084 0 0 1 36069 Example read 256 62384 31332 7681720 0 0 44809 0 Example read only after reboot 256 272032 25052 3041320 0 0 44812 0

System Stress Tests below or Go To Start

System Stress Tests

These stress tests were run twice, once with a cooling fan in use and then with the fan disabled. The following script file was run to open six terminals to execute my CPU MHz, Voltage and Temperature Measurement program and vmstat system monitor, whilst running my Livermore Loops, MP Integer RAM Exerciser, BurninDrive and OpenGL benchmarks, in stress testing mode, with nominal running time of 15 minutes.

On running these, as indicated in the environmental monitor, the system ran at much higher temperatures, with no fan in use, but with no indication of CPU MHz throttling in the periodic instantaneous measurement samples. Vmstat recordings were virtually the same, with and without cooling, starting with MP-IntStress64g8 grabbing near 6 GB of RAM, with continuing CPU utilisation of around 82% (3 cores at 100%, one at 28%) and, after a short write phase, the main drive being read at 30 MB/second.

A variation of the Livermore Loops Benchmark has options to change the running time of each of the 72 program floating point kernels, to control running time for stress testing purposes, where results are also checked for correctness, and log numbers assigned to enable multiple copies to be run.

######################## Script File ######################## lxterminal -e ./RPiHeatMHzVolts2 Passes 16 Seconds 60 Log 31 & lxterminal -e ./liverloopsPi64Rg8 Seconds 12 Log 31 & lxterminal -e ./MP-IntStress64g8 Threads 1 KB 6000000 Mins 15 Log 31 & lxterminal -e ./burnindrive264g8 Repeats 16, Minutes 12, Log 31, Seconds 1 & export vblank_mode=0 & lxterminal -e ./videogl64g9 Test 6 Mins 15 Log 31 & vmstat 60 16 > vmstat31.txt ############## With Cooling ############# ############### No Cooling ############## ================== CPU MHz CPU Voltage and Temperature Measurement ================= Secs Start at Wed Jun 10 12:56:49 2020 Secs Start at Wed Jun 10 13:19:58 2020 0 ARM MHz=1500 0.85V CPU=39'C pmic=34'C 0 ARM MHz=1500 0.85V CPU=40'C pmic=35'C 60 ARM MHz=1500 0.85V CPU=47'C pmic=39'C 60 ARM MHz=1500 0.85V CPU=58'C pmic=46'C 120 ARM MHz=1500 0.85V CPU=50'C pmic=41'C 120 ARM MHz=1500 0.85V CPU=65'C pmic=53'C 180 ARM MHz=1500 0.85V CPU=50'C pmic=42'C 180 ARM MHz=1500 0.85V CPU=68'C pmic=55'C 241 ARM MHz=1500 0.85V CPU=49'C pmic=41'C 241 ARM MHz=1500 0.85V CPU=71'C pmic=59'C 301 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 301 ARM MHz=1500 0.85V CPU=74'C pmic=60'C 362 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 362 ARM MHz=1500 0.85V CPU=76'C pmic=62'C 422 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 422 ARM MHz=1500 0.85V CPU=76'C pmic=62'C 483 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 482 ARM MHz=1500 0.85V CPU=76'C pmic=62'C 543 ARM MHz=1500 0.85V CPU=51'C pmic=41'C 543 ARM MHz=1500 0.85V CPU=77'C pmic=64'C 604 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 603 ARM MHz=1500 0.85V CPU=78'C pmic=65'C 664 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 664 ARM MHz=1500 0.85V CPU=81'C pmic=66'C 725 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 724 ARM MHz=1500 0.85V CPU=80'C pmic=67'C 785 ARM MHz=1500 0.85V CPU=52'C pmic=42'C 785 ARM MHz=1500 0.85V CPU=81'C pmic=67'C 846 ARM MHz=1500 0.85V CPU=51'C pmic=42'C 845 ARM MHz=1500 0.85V CPU=76'C pmic=66'C 906 ARM MHz=1500 0.85V CPU=46'C pmic=42'C 905 ARM MHz=1500 0.85V CPU=73'C pmic=65'C 966 ARM MHz=1500 0.85V CPU=40'C pmic=37'C 966 ARM MHz=1500 0.85V CPU=65'C pmic=60'C End at Wed Jun 10 13:12:56 2020 End at Wed Jun 10 13:36:04 2020 ============================== vmstat 60 second samples ============================= Memory MB Swap MB/sec %utilise Memory MB Swap MB/sec %utilise swpd free buf cach si so bi bo us sy id wa swpd free buf cach si so bi bo us sy id wa 0 7231 45 486 0 0 1 0 14 2 81 3 0 7231 45 486 0 0 1 0 14 2 81 3 0 1147 45 533 0 0 11 11 71 11 1 17 0 1147 45 533 0 0 11 11 71 11 1 17 0 1145 45 535 0 0 29 0 76 8 1 16 0 1145 45 535 0 0 29 0 76 8 1 16 0 1142 45 538 0 0 30 0 75 8 1 17 0 1142 45 538 0 0 30 0 75 8 1 17 0 1142 45 536 0 0 30 0 75 7 1 17 0 1142 45 536 0 0 30 0 75 7 1 17 0 1143 45 536 0 0 30 0 75 7 1 17 0 1143 45 536 0 0 30 0 75 7 1 17 0 1141 45 539 0 0 30 0 75 7 1 17 0 1141 45 539 0 0 30 0 75 7 1 17 0 1141 45 538 0 0 30 0 75 8 1 16 0 1141 45 538 0 0 30 0 75 8 1 16 0 1138 45 541 0 0 30 0 75 7 1 17 0 1138 45 541 0 0 30 0 75 7 1 17 0 1141 45 536 0 0 30 0 76 7 0 17 0 1141 45 536 0 0 30 0 76 7 0 17 0 1139 45 540 0 0 30 0 75 7 1 16 0 1139 45 540 0 0 30 0 75 7 1 16 0 1140 46 539 0 0 30 0 74 7 2 17 0 1140 46 539 0 0 30 0 74 7 2 17 0 1143 46 536 0 0 30 0 75 7 2 17 0 1143 46 536 0 0 30 0 75 7 2 17 0 1139 46 537 0 0 30 0 75 7 1 16 0 1139 46 537 0 0 30 0 75 7 1 16 0 1143 46 537 0 0 31 0 61 7 13 18 0 1143 46 537 0 0 31 0 61 7 13 18 0 1142 46 537 0 0 31 0 52 7 21 20 0 1142 46 537 0 0 31 0 52 7 21 20 ======= Livermore Loops 64 Bit Reliability test 12 seconds each loop x 24 x 3 ======= Wed Jun 10 12:56:49 2020 Wed Jun 10 13:19:58 2020 Numeric results were as expected Numeric results were as expected MFLOPS for 24 loops MFLOPS for 24 loops 2061.5 944.0 950.8 946.9 362.4 646.6 1498.8 991.4 920.0 733.5 370.3 561.1 2073.5 2695.3 1403.8 547.2 493.9 959.9 2202.2 2453.3 1991.9 711.4 473.4 676.4 206.5 362.3 794.9 634.4 721.9 1143.2 178.3 349.0 766.6 601.3 641.1 1007.9 411.8 367.7 1469.5 389.4 739.6 306.1 435.3 376.9 1530.5 365.2 801.5 309.5 Maximum Average Geomean Harmean Minimum Maximum Average Geomean Harmean Minimum 2698.1 912.3 737.2 602.3 187.7 2654.4 924.2 742.1 597.9 158.9 End of test Wed Jun 10 13:11:53 2020 End of test Wed Jun 10 13:33:21 2020

Other Stress Testing Programs used are below or Go To Start

Other Stress Testing Programs - run with the above

MP Integer RAM Exerciser and OpenGL Benchmark - These both report results as the tests progress, and performance for both is provided together below. The former is testing near 6 GB of RAM and the latter running the OpenGL kitchen display test at 1920 x 1080 pixels. Performance varied over the whole period, probably due to the influence of the other programs, but, averages over the 15 minutes, were no different, with and without cooling.

BurnInDrive uses 64 KB block sizes, with 164 variations of data patterns, where a parameter controls file size, in this case 16 blocks for 164 MB files. Four of these are written then read by random selection for a specified time. Finally, blocks are read continuously for a specified number of seconds (See more information here). Again, there was no real difference with and without cooling. Measured performance, like 33 x 4 x 164 MB in 12.32 minutes is 29.3 MB/second, or of the same order measured by vmstat.

MP Integer RAM and OpenGL Tests With Cooling No Cooling Seconds KB Threads Pattern MB/sec FPS MB/sec FPS 30 6000000 1 00000000 1978 21 1999 21 60 6000000 1 FFFFFFFF 1976 21 1864 21 90 6000000 1 FFFFFFFF 2053 20 1979 21 120 6000000 1 5A5A5A5A 1918 18 1762 20 150 6000000 1 AAAAAAAA 1867 19 2066 20 180 6000000 1 CCCCCCCC 2113 19 1974 21 210 6000000 1 0F0F0F0F 1841 20 1995 20 240 6000000 1 FFFFFFFF 1902 20 1928 21 270 6000000 1 FFFFFFFF 1971 20 2089 20 300 6000000 1 00000000 2033 20 2084 19 330 6000000 1 5A5A5A5A 1863 21 1840 21 360 6000000 1 AAAAAAAA 1974 21 1966 22 390 6000000 1 AAAAAAAA 2012 21 1956 19 420 6000000 1 CCCCCCCC 1929 20 1860 20 450 6000000 1 0F0F0F0F 1964 20 1911 22 480 6000000 1 00000000 1954 20 2007 21 510 6000000 1 FFFFFFFF 2019 21 2010 20 540 6000000 1 FFFFFFFF 1987 21 1999 21 570 6000000 1 5A5A5A5A 1836 21 1981 21 600 6000000 1 AAAAAAAA 1991 21 2551 18 630 6000000 1 CCCCCCCC 1837 21 1996 20 660 6000000 1 0F0F0F0F 2025 12 1824 21 690 6000000 1 FFFFFFFF 2017 20 1870 21 720 6000000 1 FFFFFFFF 2017 20 1843 21 750 6000000 1 00000000 1858 21 1847 21 780 6000000 1 5A5A5A5A 2100 21 1905 21 810 6000000 1 AAAAAAAA 2008 20 1963 21 840 6000000 1 CCCCCCCC 1966 21 1962 21 870 6000000 1 CCCCCCCC 1983 21 1980 21 900 6000000 1 0F0F0F0F 1970 20 1897 21 Average 1965 20.1 1964 20.6 ======================== burnindrive264g8 Pi 4 Main Drive ========================= Current Path: /home/pi/0test/morestress Total MB 59639 Free MB 20353, Used MB 39286 Wed Jun 10 12:56:49 2020 Wed Jun 10 13:19:58 2020 File 1 164 MB written 9.19 seconds File 1 164 MB written in 9.15 seconds File 2 164 MB written 9.05 seconds File 2 164 MB written in 8.94 seconds File 3 164 MB written 9.63 seconds File 3 164 MB written in 9.67 seconds File 4 164 MB written 8.91 seconds File 4 164 MB written in 8.97 seconds Total 36.78 seconds Total 36.74 seconds Start Reading Wed Jun 10 12:57:26 2020 Start Reading Wed Jun 10 13:20:35 2020 Passes 1 x 4 Files x 164 MB 0.38 minutes Passes 1 x 4 Files x 164 MB 0.38 minutes Passes 2 x 4 Files x 164 MB 0.76 minutes Passes 2 x 4 Files x 164 MB 0.75 minutes Passes 3 x 4 Files x 164 MB 1.13 minutes Passes 3 x 4 Files x 164 MB 1.14 minutes To Passes 31 x 4 Files x 164 MB 11.58 minutes Passes 31 x 4 Files x 164 MB 11.56 minutes Passes 32 x 4 Files x 164 MB 11.95 minutes Passes 32 x 4 Files x 164 MB 11.93 minutes Passes 33 x 4 Files x 164 MB 12.32 minutes Passes 33 x 4 Files x 164 MB 12.31 minutes Start Repeat Read Wed Jun 10 13:09:45 2020 Start Repeat Read Wed Jun 10 13:32:53 2020 Passes in 1 second for 164 blocks of 64KB Passes in 1 second for 164 blocks of 64KB 460 500 540 540 520 440 420 480 540 520 560 560 480 440 440 500 540 540 520 440 520 440 440 460 540 540 520 460 440 440 440 440 540 540 540 420 420 420 520 540 540 540 540 440 440 420 540 540 540 440 540 480 440 460 500 540 540 480 440 460 To To 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 580 83300 Passes of 64KB blocks 2.78 minutes 83900 Passes of 64KB blocks 2.78 minutes No errors found during reading tests No errors found during reading tests End of test Wed Jun 10 13:12:32 2020 End of test Wed Jun 10 13:35:40 2020

Power Over Ethernet below or Go To Start

Power Over Ethernet (PoE)

I recently carried out tests of Raspberry Pi 4 systems using power supplied over LAN cables. My report is at ResearchGate in Benchmarking Raspberry Pi 4 Running From Power Over Ethernet.pdf. This covers using long, short, thick and thin cables, measuring data transmission speeds and the ability to run using my most power consuming benchmarks, particularly with the only wire connected to the Pi being the Ethernet cable. Screenshots of remote control via Windows, Linux and Android are provided. PoE requires additional hardware that injects high voltage power on to the cable and, at the other end, converts it to that normally used by the destination device. For Raspberry Pi, there is a PoE HAT, with a fan, for this purpose, or separate fanless connectors can be obtained.

A few simple tests were run on the configuration being considered here, simply to verify that the facility was operational. In this case, 48 metres of CAT 6 cables were used and a fanless connector (the 8 GB Pi was fitted with an inexpensive fan). A hard disk and a USB flash drive were plugged in to USB 3 sockets, but not in use. The tests were executed via remote control Terminals, using PuTTy on a Windows 7 based PC. After the first one, the only wire plugged in to to the Pi was the power connecter, from the PoE converter, with communication via WiFi. Result below were all copied from the Windows PuTTy displays.

The first tests were run using the LAN Benchmark, with only large file results shown. The Ethernet performance was at the same 1 Gbps speeds identified earlier. WiFi was from a greater distance, apparently mainly at 2.4 GHz speeds.

The other example is from running a Floating Point Stress Test, for 10 minutes, with 8 threads running at the same near 24 GFLOPS continuously. The vmstat report indicates 8 processes in use and 100% CPU utilisation (of 4 cores) over the whole period. With the fan in use, temperature increases were insignificant. Core voltage did not change between idle and full speed operation.

################ Data Transmission Speeds ################ MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 Ethernet 512 80.81 81.27 83.18 112.53 111.69 112.38 1024 93.91 91.64 88.02 112.68 112.64 112.68 WiFi 50 7.28 8.55 8.15 5.51 6.10 6.37 100 5.95 7.97 7.14 6.58 5.26 6.75 ############ High Power Demanding CPU Stress Test ########### Data Ops/ Numeric Seconds Size Threads Word MFLOPS Results Passes 9.3 1280 KB 8 32 23435 50160 19677 18.2 1280 KB 8 32 23274 50160 19677 27.0 1280 KB 8 32 23375 50160 19677 35.8 1280 KB 8 32 23374 50160 19677 44.7 1280 KB 8 32 23357 50160 19677 To 566.3 1280 KB 8 32 23396 50160 19677 575.1 1280 KB 8 32 23406 50160 19677 583.9 1280 KB 8 32 23424 50160 19677 592.7 1280 KB 8 32 23359 50160 19677 601.7 1280 KB 8 32 23145 50160 19677 ############################# vmstat Activity Monitor ############################# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 8 0 0 7723004 15720 148020 0 0 11 5 975 421 91 0 9 0 0 8 0 0 7722396 15780 148140 0 0 0 4 1048 428 100 0 0 0 0 8 0 0 7725468 15844 148044 0 0 0 5 1059 447 100 0 0 0 0 8 0 0 7725720 15892 148052 0 0 0 3 1052 431 100 0 0 0 0 8 0 0 7725404 15948 148072 0 0 0 3 1051 432 100 0 0 0 0 8 0 0 7725368 16004 148072 0 0 0 3 1040 413 100 0 0 0 0 8 0 0 7725984 16060 148076 0 0 0 4 1050 431 100 0 0 0 0 9 0 0 7725908 16116 148076 0 0 0 3 1040 409 100 0 0 0 0 8 0 0 7725656 16164 148084 0 0 0 3 1044 415 100 0 0 0 0 8 0 0 7725372 16220 148092 0 0 0 4 1067 437 100 0 0 0 0 ##################### CPU MHz, Voltage and Temperatures #################### Seconds 0.0 ARM MHz= 600, core volt=0.8500V, CPU temp=34.0'C, pmic temp=33.5'C 60.0 ARM MHz=1500, core volt=0.8500V, CPU temp=51.0'C, pmic temp=38.2'C 120.4 ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C 180.8 ARM MHz=1500, core volt=0.8500V, CPU temp=52.0'C, pmic temp=40.1'C 241.3 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C 301.7 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C 362.1 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C 422.4 ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C 482.8 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C 543.2 ARM MHz=1500, core volt=0.8500V, CPU temp=54.0'C, pmic temp=41.1'C 603.6 ARM MHz=1500, core volt=0.8500V, CPU temp=53.0'C, pmic temp=41.1'C

CPU Performance Throttling Effects below or Go To Start

CPU Performance Throttling Effects

Another of my reports covered Raspberry Pi 4 CPU MHz Throttling Performance Effects. This was demonstrated by forcing the CPU clock speed to run continuously at 600 MHz, by setting the frequency scaling governor to powersave mode.

This exercise involved using BBC iPlayer for two and a half hours, the main reason being to see if it survives using minimum available resources.

The Raspberry Pi was connected to a TV with a 1920 x 1080 display, using WiFi communication and the CPU at 600 MHz. A drama programme was watched for two hours, with no apparent buffering and, in my opinion, a perfectly good display, where the activity report was 960 x 540 size at 1700 kbps. A second programme wildlife documentary did produce the occasional short delay, with buffering, reporting the same size but down to 923 kbps. The tests were run without an active cooling fan.

Following are vmstat details, showing CPU utilisation of around 47%, indicating using two CPU cores at 100%, for most of the time. Then, the environment monitor showing constant MHz and voltage, without significant rises in temperatures.

vmstat -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- swpd free buff cache si so bi bo in cs us sy id wa st Early 0 6475260 109296 736232 0 0 0 242 2795 3640 40 7 52 0 0 End 0 6467036 111324 740656 0 0 0 248 2867 3752 39 7 54 0 0 RPiHeatMHzVolts2 Program - Room at 27°C Hot start ARM MHz= 600, core volt=0.8500V, CPU temp=69.0'C, pmic temp=62.8'C Later ARM MHz= 600, core volt=0.8500V, CPU temp=70.0'C, pmic temp=64.6'C Near End ARM MHz= 600, core volt=0.8500V, CPU temp=72.0'C, pmic temp=66.5'C

Go To Start