Raspberry Pi 400 PC 32 Bit and 64 Bit Benchmarks and Stress Tests

Roy Longbottom

Contents


Summary Introduction Benchmark Results
Whetstone Benchmark Dhrystone Benchmark Linpack 100 Benchmark
Livermore Loops Benchmark FFT Benchmarks BusSpeed Benchmark
MemSpeed Benchmark NeonSpeed Benchmark MultiThreading Benchmarks
MP-Whetstone Benchmark MP-Dhrystone Benchmark MP NEON Linpack Benchmark
MP-BusSpeed Benchmark MP-RandMem Benchmark MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks OpenMP-MemSpeed Benchmarks Java Whetstone Benchmark
JavaDraw Benchmark OpenGL GLUT Benchmark I/O Benchmarks
LAN Benchmark WiFi Benchmark USB Booting
USB 3 and Main Drive Benchmarks High Performance Linpack 32 Bit Stress Test Benchmarks
64 Bit Stress Test Benchmarks Stress Test Parameters 32 Bit Floating Point Stress Tests
64 Bit Floating Point Stress Tests 32 Bit Integer Stress Tests 64 Bit Integer Stress Tests
32 Bit System Stress Tests 32 Bit System Stress Tests Part 2 64 Bit System Stress Tests
32 Bit TV Test Plus Remote Access 64 Bit TV Test Using Bluetooth 64 Bit Danger


Summary

This report provides results of benchmarks and stress tests run on a Raspberry PI 400 PC, using the 32 bit and 64 bit Operating Systems. The PC comprises an upgraded version of the Raspberry Pi 4B CPU, fitted and fanless within a Raspberry Pi keyboard, running at 1800 MHz instead of 1500 MHz. Benchmark results were compared with those run on the original Pi 4B at 32 bits, also those for Pi 400 32 bit versus 64 bit operation.

CPU and RAM Benchmarks The first group of 18 benchmarks measure various aspects of CPU performance, including accessing multiple CPU cores. At 32 bits, the Pi 400 generally provides the expected 20% improvement in performance, where CPU time dominates but little difference with RAM speed limitations. Average performance was superior using 64 bit operation, but too variable to be conclusive. The compiler version used was identified as a potential significant issue.

Graphics Benchmarks These comprise Java and OpenGL programs that execute a range of test functions. All ran successfully on each configuration, including dual monitor operation. Some Pi 400 and 64 bit gains were apparent, but depending on version of system software used.

LAN Benchmarks These demonstrated that the Pi 400 PC obtained the same Gigabit performance as he original Pi 4B, writing and reading large files at around 112 MB/second. The benchmark also demonstrated that 64 bit operation could handle much larger file sizes.

WiFi Benchmarks All systems were run using 2.4 and 5 GHz operation. In my environment, obtaining consistent 5 GHz operation was extremely difficult to achieve and Pi 400 appeared to be slower on data reception. speeds.

USB and Main Drive Benchmarks Local and USB 3 performance was measured using a range of low and high speed drives, also the surprising availability of USB Booting. File size limitations were also exposed with 32 bit working.

High Performance Linpack Benchmark was ported to work at 32 and 64 bits on the Pi 400, both achieving the highest Pi 4 GB RAM rating of 11.7 GFLOPS. As Stress Tests, over 30 minutes, CPU speed was constant, unlike a fanless Pi 4B with a best case performance of 8.8 GFLOPS.

CPU Stress Tests Fifteen Half hour stress tests were run, covering 32 bit and 64 bit operation, single and double precision (SP and DP) floating point and integer calculations, using four threads. Best Pi 400 32 bit floating point average performance was 25 GFLOPS SP and 13 GFLOPS DP, with 64 bits somewhat faster. Most were at room temperatures of 30C, some using side by Pi 400 and 4B systems. With one exception, Pi 400 and 4B with fan tests ran at constant CPU speeds at temperatures less than 70C, where Pi 400 performance was around 20% faster. The exception was a Pi 400 session outside temperatures were greater than 40C. This time the performance was also constant with CPU temperature up to 71C. Tests with the fanless Pi 4 saw temperatures of 86C and CPU MHz sometimes throttled down to 600 MHz.

System Stress Tests Six programs were run at the same time for 15 minutes, exercising integer and floating point hardware, all RAM space, OpenGL and drive data transfers, whilst monitoring environment and system utilisation. These were run at 32 and 64 bits on the Pi 400 and fan controlled Pi 4B. There were no excessive CPU temperatures and no data comparison errors.

TV Tests BBC iPlayer programmes were viewed on the Pi 400 for at least seven hours each, via TV at 32 bits and a PC monitor at 64 bits, with external bluetooth speaker sound for the latter. There were a few peculiarities for consideration, but no interruptions to service.

Introduction below or Go To Start


Introduction

This is a continuation of earlier activity with details at ResearchGate in Raspberry Pi 4B 32 Bit Benchmarks.pdf, Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf and Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks.pdf. The 32 bit benchmarks are available in Raspberry-Pi-4-Benchmarks.tar.gz, with the 64 bit versions in Raspberry-Pi-OS-64-Bit-Benchmarks.tar.xz.

This report covers the Pi 400 PC, using July/August 2020 32 bit and 64 bit Operating Systems, with 64 bit/32 bit comparisons and others with the original Pi 32 bit 4B. Brief descriptions are generally provided. For more comprehensive information, see the above PDF files.

The Pi 400 PC is essentially a Raspberry Pi keyboard containing an upgraded and fanless Pi 4B with enhanced facilities and options. The latest Pi 4 CPU default clock speed is 1800 MHz, compared with 1500 MHz for the original model.

Traditionally, the benchmark provided details of the system being tested, by accessing built-in CPUID details. Following are the latest that identify the difference between the two model Pi 4 systems and 32 bit/64 bit variations.


Pi 4B 32 Bit 2019 OS
 
Architecture:          armv7l
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
Vendor ID:             ARM
Model:                 3
Model name:            Cortex-A72
Stepping:              r0p3
CPU max MHz:           1500.0000
CPU min MHz:           600.0000
BogoMIPS:              270.00
Flags:                 half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 
                       idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference reference 2019-09-26

Pi 400 PC 32 Bit 2020 OS

Architecture:        armv7l
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1800.0000
CPU min MHz:         600.0000
BogoMIPS:            324.00
Flags:               half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32
Raspberry Pi reference 2020-07-06

Pi 400 64 Bit 2020 OS - as 32 bit except

BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid
Linux raspberrypi 5.4.51-v8+ #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020 aarch64 GNU/Linux

 


Benchmark Results

The following provides benchmark results from the original 32 bit Raspberry Pi 4B and later ones from the Pi 400 PC, working at 32 bits and 64 bits. Comparisons and limited comments are provided.

Whetstone Benchmark below or Go To Start


Whetstone Benchmark - whetstonePiC8, whetstonePi64g8

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. With no accessing of data in L2 cache or RAM, across the board 400/4B performance comparisons were the same as that for MHz speeds..

Pi 400 overall rating was 11% faster at 64 bits, with variations of -6% to +27%.

 System       MHz   MWIPS  ------MFLOPS------   -------------MOPS---------------
                             1      2      3     COS    EXP  FIXPT     IF  EQUAL

 Pi 4B  32b  1500   1883    522    471    313   54.9   26.4   2496   3178    998
 Pi 400 32b  1800   2258    628    565    376   65.7   31.7   2998   3826   1198
 Pi 400 64b  1800   2505    628    643    478   69.2   32.8   2996   3592   1198

 400/4B      1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20
 400 64/32b  1.00   1.11   1.00   1.14   1.27   1.05   1.04   1.00   0.94   1.00


Dhrystone Benchmark - dhrystonePiC8, dhrystonePi64g8

This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS. Again, Pi 400 32 bit performance improvement was 20%. The latter also indicated a 38% gain at 64 bits.
                                  DMIPS                             
 System           MHz    DMIPS     /MHz   Compare      MHz    DMIPS

 Pi 4B  32 bit   1500     5648     3.77                            
 Pi 400 32 bit   1800     6779     3.77   400/4B      1.20     1.20
 Pi 400 64 bit   1800     9337     5.19   64/32 bit   1.00     1.38


Linpack 100 Benchmark MFLOPS - linpackPiC8, linpackPiC8SP, linpackPiNEONiC8,
linpackPi64g8, linpackPi64gSP, linpackPi64NEONig8

This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON functions to include vector processing. Performance of this benchmark can vary, with its dependence on data placement in L2 cache. In this case, best results from more than one run were used, reflecting around 20% gain by the 32 bit Pi 400.

64 bit results indicated 10% to 79% speed gain, best being SP where the compiler probably generated vector instructions.

                                        NEON                               NEON
 System          MHz      DP      SP      SP   Compare   MHz     DP    SP    SP

 Pi 4B  32 bit  1500   957.1  1068.8  1819.9                                   
 Pi 400 32 bit  1800  1146.9  1306.2  2174.6   400/4B    1.20  1.20  1.22  1.19
 Pi 400 64 bit  1800  1337.2  2343.9  2400.4   64/32 bit 1.00  1.17  1.79  1.10


Livermore Loops Benchmark MFLOPS - liverloopsPiC8, liverloopsPi64g8

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores. Although each kernel is executed for a relatively long time, performance of some can be inconsistent. as reflected in the performance ratios. Fortunately, in this case, overall 32 bit Pi 400 performance ratings indicated a 20% improvement.

There were wide 64 bit/32 bit comparison variations, but the geometric mean indicated a 15% higher rating.

  MFLOPS for 24 loops Pi 4B 1500 MHz 32 bit
     1480   1017    974    930    383    657   1624   1861   1664    617    498     741
      221    320    803    640    737   1003    451    378   1047    411    763     187

  MFLOPS for 24 loops Pi 400 1800 MHz 32 bit
     1751   1225   1187   1120    469    608   1944   2262   2004    888    591     878
      267    374    965    767    897   1218    542    454   1261    473    915     225

  MFLOPS for 24 loops Pi 400 1800 MHz 64 bit
     2553   1194   1163   1139    452    887   2762   3352   2451    999    601    1169
      255    480    979    771    872   1377    540    477   1978    481    982     375

400/4B  1.18   1.20   1.22   1.20   1.22   0.93   1.20   1.22   1.20   1.44   1.19   1.18
        1.21   1.17   1.20   1.20   1.22   1.21   1.20   1.20   1.20   1.15   1.20   1.20

64/32b  1.46   0.97   0.98   1.02   0.96   1.46   1.42   1.48   1.22   1.13   1.02   1.33
        0.96   1.28   1.01   1.01   0.97   1.13   1.00   1.05   1.57   1.02   1.07   1.67

 System          MHz   Maximum Average Geomean Harmean Minimum

 Pi 4B  32 bit  1500    1860.8   800.4   679.0   564.1   179.5
 Pi 400 32 bit  1800    2262.0   965.1   818.6   679.6   217.4
 Pi 400 64 bit  1800    3353.1  1170.9   938.2   761.6   242.0

 400/4B 32 bit  1.20      1.22    1.21    1.21    1.20    1.21
 64 bit/32 bit  1.00      1.48    1.21    1.15    1.12    1.11
Fast Fourier Transforms Benchmarks below or Go To Start


Fast Fourier Transforms Benchmarks - fft1PiC8, fft3cPiC8, fft1Pi64g, fft3cPi64g8

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes from L1 to L2 caches, then to RAM.

Following are average running times from the three passes of each FFT calculation. Performance can vary, particularly for the calculations with the shortest running times. But they indicate that those that are CPU speed dependent were around 20% faster using the 1800 MHz processor at 32 bit working. Then the remainder, affected by increasing dependency on RAM speed, showed no gain.

The Pi 400 64 bit and 32 bit comparisons indicate that the FFT calculation s involving RAM data transfers are of similar speeds. Then, as with Linpack benchmarks, single precision performance gains are more apparent than that with double precision.

   
               Time in milliseconds                             Comparison              

           Pi 4B FFT1 32b  Pi 4B FFT3 32b                                               
              SP      DP      SP      DP                                                
  Size K                                                                                
       1    0.04    0.04    0.05    0.04                                                
       2    0.08    0.13    0.10    0.10                                                
       4    0.29    0.34    0.24    0.23                                                
       8    0.79    0.82    0.57    0.51                                                
      16    1.65    1.85    1.32    1.19                                                
      32    3.76    4.71    2.69    3.30                                                
      64    8.82   30.64    6.60    9.47                                                
     128   58.54  132.41   16.92   23.85                                                
     256  275.44  373.12   37.61   55.97                                                
     512  780.89  751.27   81.54  128.13                                                
    1024 1578.70 1812.20  186.45  288.27                                                

          Pi 400 FFT1 32b Pi 400 FFT3 32b            400/4B1 FFT1    400/4B1 FFT3       
              SP      DP      SP      DP              SP      DP      SP      DP Average
  Size K                                                                                
       1    0.03    0.03    0.05    0.04            1.17    1.26    1.09    1.11    1.16
       2    0.07    0.10    0.09    0.08            1.15    1.34    1.13    1.24    1.21
       4    0.21    0.29    0.21    0.19            1.35    1.19    1.17    1.20    1.23
       8    0.63    0.82    0.47    0.42            1.26    1.00    1.21    1.20    1.17
      16    1.45    1.60    1.27    1.02            1.13    1.16    1.04    1.16    1.12
      32    3.54    4.22    2.80    3.08            1.06    1.12    0.96    1.07    1.05
      64    7.72   37.12    6.94    8.83            1.14    0.83    0.95    1.07    1.00
     128   55.94  111.66   15.70   22.27            1.05    1.19    1.08    1.07    1.10
     256  230.26  326.14   34.93   53.51            1.20    1.14    1.08    1.05    1.12
     512  667.66  901.75   76.35  121.06            1.17    0.83    1.07    1.06    1.03
    1024 1503.53 1948.66  167.64  279.32            1.05    0.93    1.11    1.03    1.03


          Pi 400 FFT1 64b Pi 400 FFT3 64b            64b/32b FFT1   64b/32b FFT3        
              SP      DP      SP      DP              SP      DP      SP      DP Average
  Size K                                                                                
       1    0.03    0.03    0.03    0.03            1.00    0.94    1.36    1.06    1.09
       2    0.07    0.11    0.07    0.08            1.05    0.85    1.26    1.01    1.04
       4    0.18    0.29    0.17    0.19            1.16    0.99    1.24    0.99    1.10
       8    0.52    0.84    0.38    0.44            1.20    0.98    1.23    0.96    1.09
      16    1.27    1.57    0.96    1.03            1.14    1.02    1.32    0.99    1.12
      32    3.25    4.00    1.96    3.00            1.09    1.06    1.43    1.02    1.15
      64    7.24   28.72    5.35    9.97            1.07    1.29    1.30    0.89    1.14
     128   45.45  187.78   14.83   23.77            1.23    0.59    1.06    0.94    0.96
     256  321.30  465.17   36.13   52.24            0.72    0.70    0.97    1.02    0.85
     512  825.72 1073.88   77.95  113.83            0.81    0.84    0.98    1.06    0.92
    1024 1622.96 2014.74  166.64  250.60            0.93    0.97    1.01    1.11    1.00
  
BusSpeed Benchmark below or Go To Start


BusSpeed Benchmark - busspeedPiC8, busspeedPi64g8

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.

Performance gains of the 32 bit Pi 400 PC CPU increased in line with MHz difference, using data from L1 and L2 caches, with no gain using RAM based data.

With the Pi 400 at 64 bits, RAM speeds were, again, virtually the same as at 32 bits. Based on reading all data, average 64 bit cache based performance gains were 55%. The 64 bit compilation appears to generate less efficient code, like burst reading effects, using L1 cache based data.

      Reading Speed 4 Byte Words in MBytes/Second         

   Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
   KBytes  Words  Words  Words  Words  Words    All

                   Pi 4B 1500 MHz 32 bits          

       16   4880   5075   5612   5852   5877   5864
       32    846   1138   2153   3229   4908   5300
       64    746   1019   2035   3027   4910   5360
      128    728    983   1952   2908   4888   5389
      256    683    934   1901   2794   4874   5431
      512    656    900   1760   2625   4585   5259
     1024    301    410    870   1356   2846   4238
     4096    233    248    531    996   2151   4045
    16384    236    258    511    891   2143   4011
    65536    237    257    508    881   2172   4015

                   Pi 400 1800 MHz 32 bits         

       16   5859   6098   6734   7023   7053   7036
       32   1726   2247   3593   5034   6093   6412
       64    800   1098   2259   3425   5886   6506
      128    825   1125   2258   3353   5842   6513
      256    815   1125   2279   3351   5837   6534
      512    822   1103   2198   3308   5849   6499
     1024    315    533   1035   1172   3134   4961
     4096    232    266    557   1062   2148   4256
    16384    239    256    487    940   1987   3787
    65536    227    256    481    935   1945   3766

                   Pi 400/4B 32 bits               

       16   1.20   1.20   1.20   1.20   1.20   1.20
       32   2.04   1.97   1.67   1.56   1.24   1.21
       64   1.07   1.08   1.11   1.13   1.20   1.21
      128   1.13   1.14   1.16   1.15   1.20   1.21
      256   1.19   1.20   1.20   1.20   1.20   1.20
      512   1.25   1.23   1.25   1.26   1.28   1.24
     1024   1.05   1.30   1.19   0.86   1.10   1.17
     4096   1.00   1.07   1.05   1.07   1.00   1.05
    16384   1.01   0.99   0.95   1.05   0.93   0.94
    65536   0.96   1.00   0.95   1.06   0.90   0.94

                   Pi 400 1800 MHz 64 bits         

       16   1576   2079   4920   6419   6612  10274
       32   1506   1857   3213   4859   6087  10126
       64    965   1239   2442   3969   5844  10015
      128    885   1142   2246   3773   5889  10266
      256    880   1129   2271   3782   5909  10346
      512    875   1135   2203   3682   5818  10175
     1024    425    570   1105   1973   3312   6064
     4096    246    259    560   1122   2182   4276
    16384    236    256    493    987   1968   3921
    65536    243    258    477    944   1887   3780

                   Pi 400 64 bits/32 bits          

       16   0.27   0.34   0.73   0.91   0.94   1.46
       32   0.87   0.83   0.89   0.97   1.00   1.58
       64   1.21   1.13   1.08   1.16   0.99   1.54
      128   1.07   1.02   0.99   1.13   1.01   1.58
      256   1.08   1.00   1.00   1.13   1.01   1.58
      512   1.06   1.03   1.00   1.11   0.99   1.57
     1024   1.35   1.07   1.07   1.68   1.06   1.22
     4096   1.06   0.97   1.01   1.06   1.02   1.00
    16384   0.99   1.00   1.01   1.05   0.99   1.04
    65536   1.07   1.01   0.99   1.01   0.97   1.00
  
MemSpeed Benchmark below or Go To Start


MemSpeed Benchmark MB/Second - memspeedPiC8, memspeedPi64g8

The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision floating point and integer functions. The instruction sequences used are shown in the results column titles.

Subject to normal variations, 32 bit comparisons again indicate the expected 20% improved performance of the later processor, at the lower data sizes, and no difference accessing RAM.

Under 64 bit working, it seems that performance from RAM can still have CPU speed influences, where the 64/32 bit performance ratio can vary from 1.0. With data from caches, 64 bit floating point functions were mainly faster than at 32 bits, with double precision operation, but similar on integer calculations. Completely unexpectedly, 64 bit 32 bit single precision floating point tests were slower than at double precision, making the 32 bit benchmark appear to be more than twice as fast. See 64 Bit Danger.

 Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
 KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
   Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

Pi 4B 1500 MHz 32 bits

       8   11768   9844   3841  11787   9934   4351  10309   7816   7804
      16   11880   9880   3822  11886  10043   4363  10484   7902   7892
      32    9539   8528   3678   9517   8661   4098  10564   7948   7945
      64    9952   9310   3733   9997   9470   4160   8452   7717   7732
     128    9947   9591   3757   9990   9757   4178   8205   7680   7753
     256   10015   9604   3758  10030   9781   4186   8120   7734   7707
     512    9073   9300   3751   9472   9526   4175   7995   7709   7602
    1024    2681   5303   3594   2664   4965   3760   4828   3592   3569
    2048    1671   3488   3242   1757   3635   3540   2882   1036   1023
    4096    1777   3700   3283   1827   3627   3555   2433   1052   1054
    8192    1931   3805   3420   1933   3815   3629   2465    980    971

Pi 400 1800 MHz 32 bits

       8   14084  11813   4591  14142  12013   5226  12383   9380   9364
      16   14259  11857   4586  14263  12061   5243  12589   9483   9476
      32   14323  11877   4563  14321  12078   5114  12688   9539   9478
      64   12010  11155   4479  11965  11345   4980  10951   9127   9121
     128   12147  11512   4515  11998  11714   5030   9677   9176   9191
     256   12149  11527   4522  12000  11735   5026   9683   9145   9249
     512   11383  11071   4508  10765  11467   5007   9675   9187   9139
    1024    3531   6947   4300   4229   7006   4662   5530   5380   5325
    2048    1730   3427   3507   1979   3938   3947   2836   1021   1022
    4096    1772   4032   3891   2027   3484   4044   2511   1038   1038
    8192    2016   4005   3896   2021   3908   3956   2544   1000   1003

Pi 400/4B 32 bits

       8    1.20   1.20   1.20   1.20   1.21   1.20   1.20   1.20   1.20
      16    1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20
      32    1.50   1.39   1.24   1.50   1.39   1.25   1.20   1.20   1.19
      64    1.21   1.20   1.20   1.20   1.20   1.20   1.30   1.18   1.18
     128    1.22   1.20   1.20   1.20   1.20   1.20   1.18   1.19   1.19
     256    1.21   1.20   1.20   1.20   1.20   1.20   1.19   1.18   1.20
     512    1.25   1.19   1.20   1.14   1.20   1.20   1.21   1.19   1.20
    1024    1.32   1.31   1.20   1.59   1.41   1.24   1.15   1.50   1.49
    2048    1.04   0.98   1.08   1.13   1.08   1.11   0.98   0.99   1.00
    4096    1.00   1.09   1.19   1.11   0.96   1.14   1.03   0.99   0.98
    8192    1.04   1.05   1.14   1.05   1.02   1.09   1.03   1.02   1.03

Pi 400 1800 MHz 64 bits

       8   18133   4792   4749  18693   5259   5275  13962  11182  11182
      16   13147   4574   4532  13052   5015   5043  14049  11327  11340
      32   16248   4614   4702  16355   5030   5090  13598  11393  11391
      64   15292   4617   4710  15106   5020   5056  11114  10488  10527
     128   14771   4641   4734  14603   5007   5058   9832   9836   9837
     256   14783   4646   4716  14698   5053   5063   9666   9768   9809
     512   14842   4648   4717  14705   5057   5066   9768   9925   9877
    1024    5441   4436   4494   5484   4486   4646   3852   4179   4389
    2048    1703   3940   3918   2034   4037   4036   2913   2918   2874
    4096    2053   3968   4025   2060   4091   4070   2735   2714   2685
    8192    2036   3940   3882   2034   3935   3995   2642   2643   2638

Pi 400 64 bits/32 bits

       8    1.29   0.41   1.03   1.32   0.44   1.01   1.13   1.19   1.19
      16    0.92   0.39   0.99   0.92   0.42   0.96   1.12   1.19   1.20
      32    1.13   0.39   1.03   1.14   0.42   1.00   1.07   1.19   1.20
      64    1.27   0.41   1.05   1.26   0.44   1.02   1.01   1.15   1.15
     128    1.22   0.40   1.05   1.22   0.43   1.01   1.02   1.07   1.07
     256    1.22   0.40   1.04   1.22   0.43   1.01   1.00   1.07   1.06
     512    1.30   0.42   1.05   1.37   0.44   1.01   1.01   1.08   1.08
    1024    1.54   0.64   1.05   1.30   0.64   1.00   0.70   0.78   0.82
    2048    0.98   1.15   1.12   1.03   1.03   1.02   1.03   2.86   2.81
    4096    1.16   0.98   1.03   1.02   1.17   1.01   1.09   2.61   2.59
    8192    1.01   0.98   1.00   1.01   1.01   1.01   1.04   2.64   2.63
NeonSpeed Benchmark below or Go To Start


NeonSpeed Benchmark MB/Second - NeonSpeedC8, NeonSpeedPi64g8

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.

32 bit performance ratios, at different data sizes, were essentially the same as those for the Memspeed benchmark, around 1.2 from caches and 1.0 from RAM.

At 64 bits, the first column calculations are the same as in MemSpeed, where the 64 bit compiler produces ridiculous results. See 64 Bit Danger. Some gains are indicated with normal integer calculations. The others are from using NEON intrinsic functions, where 64 bit vector instructions can be similar to those from 32 bit NEON code.


      Vector Reading Speed in MBytes/Second      
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v 
  KBytes   Norm   Neon   Norm   Neon  Float    Int

                   Pi 4B 1500 MHz 32 bits         

      16   9884  12882   3910  12773  13090  15133
      32   9904  13061   3916  13002  13162  15239
      64   9029  11526   3450  10704  11708  12084
     128   9242  11784   3391  11016  11816  12179
     256   9283  11890   3396  11215  11929  12284
     512   9043  10680   3413  10211  10925  11241
    1024   5818   3310   3507   3288   3239   2902
    4096   4060   1994   3497   1991   2009   2011
   16384   4030   2063   3445   2068   2072   2067
   65536   3936   2109   3391   1858   2122   2121

                  Pi 400 1800 MHz 32 bits          

      16  11860  15690   4736  15664  15690  18155
      32  11884  15563   4702  15462  15770  17912
      64  10911  14009   4085  13499  14226  14435
     128  11003  14111   4065  13282  14149  14400
     256  11098  14167   4079  13366  14164  14490
     512  10756  12779   4087  12544  13100  13275
    1024   5812   3561   4161   3314   3268   3352
    4096   3798   1813   3666   1870   1870   1868
   16384   3892   1988   3850   1928   1991   1990
   65536   3815   2031   3719   2044   2022   2018

                    Pi 400/4B 32 bits             

      16   1.20   1.22   1.21   1.23   1.20   1.20
      32   1.20   1.19   1.20   1.19   1.20   1.18
      64   1.21   1.22   1.18   1.26   1.22   1.19
     128   1.19   1.20   1.20   1.21   1.20   1.18
     256   1.20   1.19   1.20   1.19   1.19   1.18
     512   1.19   1.20   1.20   1.23   1.20   1.18
    1024   1.00   1.08   1.19   1.01   1.01   1.16
    4096   0.94   0.91   1.05   0.94   0.93   0.93
   16384   0.97   0.96   1.12   0.93   0.96   0.96
   65536   0.97   0.96   1.10   1.10   0.95   0.95

                 Pi 400 1800 MHz 64 bits          

      16   4496  19696   4790  17870  18908  21817
      32   4302  16223   4658  14602  16138  16890
      64   4043  13754   4620  13009  13995  14035
     128   4002  14077   4700  13371  14157  14231
     256   3992  14148   4716  13508  14311  14312
     512   4007  14178   4716  13649  14524  14515
    1024   3867   5478   4490   5301   5458   5531
    4096   3706   2088   4070   2092   2101   2098
   16384   3636   2063   3985   2062   2058   2057
   65536   3319   2057   3803   2011   2059   2063

                 Pi 400 64 bits/32 bits           

      16   0.38   1.26   1.01   1.14   1.21   1.20
      32   0.36   1.04   0.99   0.94   1.02   0.94
      64   0.37   0.98   1.13   0.96   0.98   0.97
     128   0.36   1.00   1.16   1.01   1.00   0.99
     256   0.36   1.00   1.16   1.01   1.01   0.99
     512   0.37   1.11   1.15   1.09   1.11   1.09
    1024   0.67   1.54   1.08   1.60   1.67   1.65
    4096   0.98   1.15   1.11   1.12   1.12   1.12
   16384   0.93   1.04   1.04   1.07   1.03   1.03
   65536   0.87   1.01   1.02   0.98   1.02   1.02
MultiThreading Benchmark next or Go To Start


MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled C code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.


MP-Whetstone Benchmark - MP-WHETSPC8, MP-WHETSPi64g8

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP efficiency.

During the 32 bit tests, except for the last memory copy (Equal) test and 8 thread section, that can have variable performance, other 1800 Mhz gain ratios were effectively at the expected 1.20 level.

Performance of the Pi 400 64 bit version was similar to that at 32 bits on a number of test functions, but with the overall score indicated 14% improvement on all significant thread counts.


      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS
 
                        Pi 4B 1500 MHz 32 bits                     
 
 1T  1889.5  538.7  537.6  311.4  56.3  26.1  7450.5  2243.2  659.9
 2T  3782.7 1065.5 1071.2  627.1 112.3  52.0 14525.7  4460.9 1327.3
 4T  7564.1 2101.0 2145.9 1250.4 225.0 104.1 29430.5  8944.2 2660.8
 8T  8003.6 2598.8 2797.0 1313.0 233.2 110.4 37906.3 10786.7 2799.4

   Overall Seconds   4.99 1T,   5.00 2T,   5.03 4T,  10.06 8T      

                       Pi 400 1800 MHz 32 bits                    
 
 1T  2269.3  639.9  646.3  376.3  67.6  31.3  8975.2  2690.7  762.9
 2T  4533.6 1275.0 1292.1  752.8 134.5  62.6 17936.5  5348.1 1539.8
 4T  9057.1 2527.4 2574.9 1502.3 269.8 124.6 35455.7 10736.3 3086.1
 8T  9658.2 3009.7 3068.6 1577.6 287.3 133.8 46331.2 13940.2 3208.7

   Overall Seconds   5.08 1T,   5.08 2T,   5.12 4T,  10.22 8T

                         Pi 400/4B 32 bits                         

 1T    1.20   1.19   1.20   1.21  1.20  1.20    1.20    1.20   1.16
 2T    1.20   1.20   1.21   1.20  1.20  1.20    1.23    1.20   1.16
 4T    1.20   1.20   1.20   1.20  1.20  1.20    1.20    1.20   1.16
 8T    1.21   1.16   1.10   1.20  1.23  1.21    1.22    1.29   1.15

                       Pi 400 1800 MHz 64 bits                     
 
 1T  2577.2  636.9  635.2  477.4  72.6  32.8  8878.4  2693.8 1198.3
 2T  5153.3 1266.0 1264.9  954.4 145.2  65.6 17863.2  5388.7 2394.6
 4T 10289.2 2520.1 2537.3 1906.5 290.3 131.1 33002.1 10732.0 4781.4
 8T 10768.5 3027.7 3361.7 1960.4 298.5 137.4 45139.4 12877.9 4887.0

   Overall Seconds   4.99 1T,   5.00 2T,   5.04 4T,  10.10 8T

                       Pi 400 64 bits/32 bits                      

 1T    1.14   1.00   0.98   1.27   1.07   1.05   0.99   1.00   1.57
 2T    1.14   0.99   0.98   1.27   1.08   1.05   1.00   1.01   1.56
 4T    1.14   1.00   0.99   1.27   1.08   1.05   0.93   1.00   1.55
 8T    1.11   1.01   1.10   1.24   1.04   1.03   0.97   0.92   1.52
 
MP-Dhrystone Benchmark next or Go To Start


MP-Dhrystone Benchmark - MP-DHRYPiC8, MP-DHRYPi64g8

This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded performance. The 32 bit single thread speeds were similar to the earlier Dhrystone result, with the usual around 20% improvement at the 1800 MHz. The other results dont mean much but were not too far from 20%, in this case.

The single thread test at 64 bits was 42% faster than at 32 bits, but much less using more than one thread.

                      MP-Dhrystone Benchmark              

                    Using 1, 2, 4 and 8 Threads            
            
                        Pi 4B 1500 MHz 32 bits              

 Threads                        1        2        4        8
 Seconds                     0.79     1.21     2.62     4.88
 Dhrystones per Second   10126308 13262168 12230188 13106002
 VAX MIPS rating             5763     7548     6961     7459

                        Pi 400 1800 MHz 32 bits             

 Seconds                     0.65     1.00     2.09     3.87
 Dhrystones per Second   12259203 15949971 15292691 16517837
 VAX MIPS rating             6977     9078     8704     9401

                           Pi 400/4B 32 bits                

 Comparison                  1.21     1.20     1.25     1.26

                        Pi 400 1800 MHz 64 bits             

 Seconds                     0.92     1.78     3.58     7.17
 Dhrystones per Second   17447778 17937022 17879626 17858080
 VAX MIPS rating             9930    10209    10176    10164

                        Pi 400 64 bits/32 bits              

 Comparison                  1.42     1.12     1.17     1.08


MP SP NEON Linpack Benchmark - linpackNeonMPC8, linpackMPNeonPi64g8

This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes.

At 32 bits, the unthreaded N=100 1800 MHz performance gains were close to expectations. At N=500, memory demands border on the L2 cache/RAM area that can lead to inconsistent results, with N=1000 limited by RAM speed.

The Pi 400 64 bit version had similar slow multithreaded performance as the other examples and faster N=100 single thread performance, compared with the Pi 4B.


 Linpack Single Precision MultiThreaded Benchmark
             Using NEON Intrinsics           

  MFLOPS 0 to 4 Threads, N 100, 500, 1000     

 Threads      None        1        2        4 

           Pi 4B 1500 MHz 32 bits             
 N  100    2007.38   112.55   107.85   106.98 
 N  500    1332.24   686.10   686.11   689.02 
 N 1000     402.61   435.26   432.21   432.01 

           Pi 400 1800 MHz 32 bits             
 N  100    2345.82   109.74   104.55   104.73 
 N  500    2022.64   812.55   827.02   819.69 
 N 1000     423.56   438.79   440.90   443.00 

              Pi 400/4B 32 bits               
 N  100       1.17     0.98     0.97     0.98 
 N  500       1.52     1.18     1.21     1.19 
 N 1000       1.05     1.01     1.02     1.03 

           Pi 400 1800 MHz 64 bits            
 N  100    2611.69    95.68    97.32    97.17 
 N  500    1611.41   660.61   658.51   654.51 
 N 1000     409.08   436.27   435.62   416.79 

              Pi 400 64 bits/32 bits          
 N  100       1.11     0.87     0.93    0.93
 N  500       0.80     0.81     0.80    0.80
 N 1000       0.97     0.99     0.99    0.94
 
MP BusSpeed Benchmark below or Go To Start


MP BusSpeed (read only) Benchmark - MP-BusSpd2PiC8, MP-BusSpd2Pi64g8

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, the latter to avoid misrepresentation of performance using shared L2 cache.

Performance variations can be expected from this benchmark, but cache based 32 bit 1800/1500 MHz ratios can be interpreted as around 1.2 and 1.0 from RAM.

The 64 bit compiler somehow manages to lose its way on decreasing addressing increments after Inc8, leading to the 32 bit version appearing to be up to three times faster.

  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

                  Pi 4B 1500 MHz 32 bits          
 12.3 1T   5310   5616   5801   5898   5940  13425
      2T   9393  10008  11293  11293  11368  24932
      4T  15781  15015  17606  19034  22279  40736
      8T   8465   9599  14580  18465  20034  36831
122.9 1T    664    930   1861   3191   5017  10281
      2T    564    726   1523   5376   9387  18985
      4T    486    919   1886   4289   8337  16979
      8T    487    912   1854   4275   8271  16826
12288 1T    225    258    514   1010   1992   3975
      2T    202    421    450   1765   3307   7396
      4T    261    288    825   1332   1772   5014
      8T    218    273    496   1041   2571   4021

                  Pi 400 1800 MHz 32 bits        
 12.3 1T   6387   6760   6851   6749   7138  16074
      2T  11269   9282  11428  12968  13224  29163
      4T  16205  12617  16819  21992  26820  52253
      8T   9723  13073  17001  21763  23693  43707
122.9 1T    797   1117   2070   3794   6006  12286
      2T    690    834   1786   5527  11282  22883
      4T    578   1102   2259   5090   9868  19940
      8T    576   1101   2242   5041   9870  19975
12288 1T    230    255    495   1003   1981   3993
      2T    210    223    906   1587   1918   3794
      4T    303    283    476    904   2082   3549
      8T    250    238    524   1046   2284   3701

                     Pi 400/4B 32 bits            
 12.3 1T   1.20   1.20   1.18   1.14   1.20   1.20
      2T   1.20   0.93   1.01   1.15   1.16   1.17
      4T   1.03   0.84   0.96   1.16   1.20   1.28
      8T   1.15   1.36   1.17   1.18   1.18   1.19
122.9 1T   1.20   1.20   1.11   1.19   1.20   1.20
      2T   1.22   1.15   1.17   1.03   1.20   1.21
      4T   1.19   1.20   1.20   1.19   1.18   1.17
      8T   1.18   1.21   1.21   1.18   1.19   1.19
12288 1T   1.02   0.99   0.96   0.99   0.99   1.00
      2T   1.04   0.53   2.01   0.90   0.58   0.51
      4T   1.16   0.98   0.58   0.68   1.17   0.71
      8T   1.15   0.87   1.06   1.00   0.89   0.92

                  Pi 400 1800 MHz 64 bits        
 12.3 1T   6198   6657   6770   5048   4907   5080
      2T   8825  12952  11558   9422   9531   9937
      4T  10051  11518  17686  16592  18680  19757
      8T   8994  10828  16669  16241  16797  19140
122.9 1T    718   1114   2257   3357   4610   4877
      2T    676    939   2587   5818   9114   9703
      4T    579   1126   2447   4861   9587  17663
      8T    572   1119   2427   4911   9556  16538
12288 1T    226    255    473    940   1882   3738
      2T    311    297    446    944   1880   3786
      4T    236    352    568   1352   2773   3205
      8T    246    263    563    931   1904   4182

                 Pi 400 64 bits/32 bits           
 12.3 1T   0.97   0.98   0.99   0.75   0.69   0.32
      2T   0.78   1.40   1.01   0.73   0.72   0.34
      4T   0.62   0.91   1.05   0.75   0.70   0.38
      8T   0.93   0.83   0.98   0.75   0.71   0.44
122.9 1T   0.90   1.00   1.09   0.88   0.77   0.40
      2T   0.98   1.13   1.45   1.05   0.81   0.42
      4T   1.00   1.02   1.08   0.96   0.97   0.89
      8T   0.99   1.02   1.08   0.97   0.97   0.83
12288 1T   0.98   1.00   0.96   0.94   0.95   0.94
      2T   1.48   1.33   0.49   0.59   0.98   1.00
      4T   0.78   1.24   1.19   1.50   1.33   0.90
      8T   0.98   1.11   1.07   0.89   0.83   1.13
 
MP RandMem Benchmark below or Go To Start


MP RandMem Benchmark - MP-RandMemPiC8, MP-RandMemPi64g8

The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The performance patterns were as expected. Random access is dependent on the impact of burst reading and writing, producing those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write remaining constant at particular data size, probably due to write back to shared data space.

Performance comparisons at 32 bits again indicatd 1800 MHz 20% gains using cached data and no gain using RAM.

On the Pi 400, 64 bit performance was generally the same as at 32 bits, with no scope for vectorisation.


  KB       SerRD SerRDWR   RndRD RndRDWR
 
            Pi 4B 1500 MHz 32 bits      
 12.3 1T    5950    7903    5945    7896
      2T   11849    7923   11887    7917
      4T   23404    7785   23395    7761
      8T   21903    7669   23104    7655
122.9 1T    5670    7309    2002    1924
      2T   10682    7285    1648    1923
      4T    9944    7266    1813    1927
      8T    9896    7216    1812    1919
12288 1T    3904    1075     179     164
      2T    7317    1055     215     164
      4T    3398    1063     343     165
      8T    4156    1062     350     165

            Pi 400 1800 MHz 32 bits     
 12.3 1T    7135    9348    7136    9370
      2T   14256    9359   14273    9352
      4T   28119    9240   28114    9258
      8T   26441    9130   26328    9144
122.9 1T    6790    8281    2381    2246
      2T   12555    8297    2098    2313
      4T   11951    8481    2177    2317
      8T   12044    8485    2155    2305
12288 1T    3777     946     178     162
      2T    7474    1066     211     165
      4T    4319    1184     343     164
      8T    4407    1227     340     165

               Pi 400/4B 32 bits        
 12.3 1T    1.20    1.18    1.20    1.19
      2T    1.20    1.18    1.20    1.18
      4T    1.20    1.19    1.20    1.19
      8T    1.21    1.19    1.14    1.19
122.9 1T    1.20    1.13    1.19    1.17
      2T    1.18    1.14    1.27    1.20
      4T    1.20    1.17    1.20    1.20
      8T    1.22    1.18    1.19    1.20
12288 1T    0.97    0.88    0.99    0.99
      2T    1.02    1.01    0.98    1.01
      4T    1.27    1.11    1.00    0.99
      8T    1.06    1.16    0.97    1.00

            Pi 400 1800 MHz 64 bits     
 12.3 1T    7138    9489    7129    9478
      2T   14187    9516   13922    9506
      4T   22329    9352   23537    9153
      8T   20274    9216   24488    9252
122.9 1T    6921    8444    2397    2242
      2T   13046    8419    1983    2339
      4T   12397    8443    2127    2347
      8T   12567    8295    2127    2371
12288 1T    2761    1264     183     167
      2T    7408    1278     200     162
      4T    3354     772     254     167
      8T    3993    1251     253     141

            Pi 400 64 bits/32 bits      
 12.3 1T    1.00    1.02    1.00    1.01
      2T    1.00    1.02    0.98    1.02
      4T    0.79    1.01    0.84    0.99
      8T    0.77    1.01    0.93    1.01
 122.9 1T   1.02    1.02    1.01    1.00
      2T    1.04    1.01    0.95    1.01
      4T    1.04    1.00    0.98    1.01
      8T    1.04    0.98    0.99    1.03
 12288 1T   0.73    1.34    1.03    1.03
      2T    0.99    1.20    0.95    0.98
      4T    0.78    0.65    0.74    1.02
      8T    0.91    1.02    0.74    0.85
MP-MFLOPS Benchmarks below or Go To Start


MP-MFLOPS Benchmarks - MP-MFLOPSPiC8, MP-MFLOPSDPC8, MP-NeonMFLOPSC8,
MP-MFLOPSPi64g8, MP-MFLOPSDPPi64g8, MP-NeonMFLOPSPi64g8

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. There are three varieties, single precision, double precision and single precision through NEON intrinsic functions, all attempting to show near maximum MP floating point processing speeds.

At 32 bits, but subject to normal intermittent variations, The Pi 400 exhibited the normal 20% performance increase, over the Pi 4B, using 12.8 and 128 KB cache based data size but similar at 12.8 MB from RAM, when there was little dependence on floating point calculating times. Note the NEON 32 bit performance gains.

Pi 400 performance was similar at 64 bits and 32 bits using RAM, again where there was little dependence on arithmetic calculations. With cached data, Pi 400 single precision speed improved considerably, typically 2.5 times faster, but half that at double precision. The NEON cached calculations were generally somewhat faster, with the two processors executing different varieties of vector instructions.


               Single Precision MFLOPS                     Comparisons     

       2 Ops/Word           32 Ops/Word          2 Ops/Word           32 Ops/Word
 KB    12.8    128  12800   12.8    128  12800   12.8    128  12800   12.8     128   12800
 
       Pi 4B 1500 MHz 32 bits Max 11.0 GFLOPS
 1T    1224   1257    520   2814   2800   2803
 2T    2485   2257    525   5608   5575   5576
 4T    4119   3243    534  11018  10645   8358
 8T    4131   4618    541   9941  10339   8165

       Pi 400 1800 MHz 32 bits Max 13.2 GFLOPS               Pi 400/4B
 1T    1526   1504    593   3380   3366   3356   1.25   1.20   1.14   1.20    1.20    1.20
 2T    2972   2593    598   6725   6696   6675   1.20   1.15   1.14   1.20    1.20    1.20
 4T    5414   5304    621  13179  13144   9339   1.31   1.64   1.16   1.20    1.23    1.12
 8T    4862   5790    602  12055  13010   9343   1.18   1.25   1.11   1.21    1.26    1.14

       Pi 400 1800 MHz 64 bits Max 30.9 GFLOPS           Pi 400 64 bits/32 bits
 1T    3995   3752    476   8108   8066   7436   2.62   2.49   0.80   2.40    2.40    2.22
 2T    7743   5887    594  15761  15762  10185   2.61   2.27   0.99   2.34    2.35    1.53
 4T   13859  13899    600  30674  30891  10476   2.56   2.62   0.97   2.33    2.35    1.12
 8T   12741  13809    583  27381  30559   8492   2.62   2.38   0.97   2.27    2.35    0.91
                                                 
             NEON Intrinsic Functions MFLOPS

       Pi 4B 1500 MHz 32 bits Max 17.2 GFLOPS              NEON SP/Normal SP
 1T    2797   2870    641   4422   4454   4405   2.29   2.28   1.23   1.57    1.59    1.57
 2T    3217   5601    569   8587   8800   8377   1.29   2.48   1.08   1.53    1.58    1.50
 4T    7902   9864    611  17061  17215   9704   1.92   3.04   1.14   1.55    1.62    1.16
 8T    7070  10562    603  15531  16203   9516   1.71   2.29   1.11   1.56    1.57    1.17

       Pi 400 1800 MHz 32 bits Max 20.1 GFLOPS               Pi 400/4B
 1T    3471   3459    597   5318   5345   5244   1.24   1.21   0.93   1.20    1.20    1.19
 2T    6842   4295    575  10587  10460   9499   2.13   0.77   1.01   1.23    1.19    1.13
 4T    9441   6507    608  20053  20147   9377   1.19   0.66   1.00   1.18    1.17    0.97
 8T    7133   8382    500  18080  20187   8589   1.01   0.79   0.83   1.16    1.25    0.90

       Pi 400 1800 MHz 64 bits Max 30.2 GFLOPS            Pi 400 64 bits/32 bits
 1T    4015   3865    447   7902   7860   7326   1.16   1.12   0.75   1.49    1.47    1.40
 2T    7412   7347    573  15625  15543   9123   1.08   1.71   1.00   1.48    1.49    0.96
 4T    9292  13936    605  29605  30067  10412   0.98   2.14   1.00   1.48    1.49    1.11
 8T   10169   9622    585  28978  30150   8537   1.43   1.15   1.17   1.60    1.49    0.99

               Double Precision MFLOPS

       Pi 4B 1500 MHz 32 bits Max 10.4 GFLOPS
 1T    1203   1211    315   2675   2719   2674
 2T    2291   2441    293   5406   5421   4907
 4T    4673   2501    309  10313  10393   5256
 8T    4394   3550    265   8782  10110   5197

       Pi 400 1800 MHz 32 bits Max 12.6 GFLOPS               Pi 400/4B
 1T    1441   1470    259   3274   3262   3116   1.20   1.21   0.82   1.22    1.20    1.17
 2T    2944   2640    258   6491   6368   4420   1.29   1.08   0.88   1.20    1.17    0.90
 4T    5555   2860    270  12560  12604   4344   1.19   1.14   0.87   1.22    1.21    0.83
 8T    3730   5499    267  12154  11558   4398   0.85   1.55   1.01   1.38    1.14    0.85

       Pi 400 1800 MHz 64 bits Max 15.1 GFLOPS            Pi 400 64 bits/32 bits
 1T    2003   1955    250   4085   4071   3744   1.39   1.33   0.97   1.25    1.25    1.20
 2T    3789   3780    296   8110   8098   5141   1.29   1.43   1.15   1.25    1.27    1.16
 4T    6974   7093    300  14998  15093   5085   1.26   2.48   1.11   1.19    1.20    1.17
 8T    4784   3983    281  14296  14433   4238   1.28   0.72   1.05   1.18    1.25    0.96
 
OpenMP-MFLOPS Benchmarks below or Go To Start


OpenMP-MFLOPS - OpenMP-MFLOPSC8, notOpenMP-MFLOPSC8, OpenMP-MFLOPS64g8, notOpenMP-MFLOPS64g8

This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

The final data values are checked for consistency. Different compilers or different CPUs could involve using alternative instructions or rounding effects, with variable accuracy. Then, OpenMP sumchecks could be expected to be the same as those from NotOpenMP single core values. However, this is not always the case. This benchmark was a compilation of code used for desktop PCs, starting at 400 KB (100K Words), then 4 MB and 40 MB.

The main purposes of this benchmark are to see if OpenMP can produce similar maximum performance as MP-MFLOPS and that this can increase in line with the number of cores used. In fact, faster OpenMP 32 bit performance was apparent, with 24.1 SP GFLOPS at 1800 MHz, 21% faster than at 1500 MHz, indicating more efficient operation than via my hand coded NEON functions. At 64 bits, a maximum speed of 30.2 single precision GFLOPS was demonstrated, effectively the same as MP-MFLOPS.

With 400 KB minimum data size, probably mainly from L2 cache, performance can be quite variable as with other sizes representing speed from RAM. Appropriate four core performance gains were demonstrated in some cases.


  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All   400
                    Words  Word   Passes                         Results  Same   /4B

                   OpenMP MFLOPS Pi 4B 1500 MHz 32 bits Max 20.0 GFLOPS
 
 Data in & out     100000     2     2500   0.098043     5100    0.929538   Yes
 Data in & out    1000000     2      250   0.810084      617    0.992550   Yes
 Data in & out   10000000     2       25   0.922891      542    0.999250   Yes

 Data in & out     100000     8     2500   0.144870    13805    0.957126   Yes
 Data in & out    1000000     8      250   0.922568     2168    0.995524   Yes
 Data in & out   10000000     8       25   0.918226     2178    0.999550   Yes

 Data in & out     100000    32     2500   0.401577    19921    0.890282   Yes
 Data in & out    1000000    32      250   0.935064     8556    0.988096   Yes
 Data in & out   10000000    32       25   0.916277     8731    0.998806   Yes

                   OpenMP MFLOPS Pi 400 1800 MHz 32 bits Max 24.2 GFLOPS
24.2  
 Data in & out     100000     2     2500   0.157307     3178    0.929538   Yes  0.62
 Data in & out    1000000     2      250   1.049819      476    0.992550   Yes  0.77
 Data in & out   10000000     2       25   0.908061      551    0.999250   Yes  1.02

 Data in & out     100000     8     2500   0.132957    15042    0.957126   Yes  1.09
 Data in & out    1000000     8      250   0.797976     2506    0.995524   Yes  1.16
 Data in & out   10000000     8       25   0.917282     2180    0.999550   Yes  1.00

 Data in & out     100000    32     2500   0.330647    24195    0.890282   Yes  1.21
 Data in & out    1000000    32      250   0.872310     9171    0.988096   Yes  1.07
 Data in & out   10000000    32       25   0.948771     8432    0.998806   Yes  0.97

                                     Next Run

 Data in & out     100000     2     2500   0.087220     5733    0.929538   Yes  1.12
 Data in & out     100000     8     2500   0.108323    18463    0.957126   Yes  1.34
 Data in & out     100000    32     2500   0.987574     8101    0.890282   Yes  0.41

                   notOpenMP MFLOPS Pi 4B 1500 MHz 32 bits Max 5.5 GFLOPS

 Data in & out     100000     2     2500   0.220277     2270    0.929538   Yes
 Data in & out    1000000     2      250   0.791373      632    0.992550   Yes
 Data in & out   10000000     2       25   0.792594      631    0.999250   Yes

 Data in & out     100000     8     2500   0.362916     5511    0.957126   Yes
 Data in & out    1000000     8      250   0.902125     2217    0.995524   Yes
 Data in & out   10000000     8       25   0.786859     2542    0.999550   Yes

 Data in & out     100000    32     2500   1.497859     5341    0.890282   Yes
 Data in & out    1000000    32      250   1.518747     5267    0.988096   Yes
 Data in & out   10000000    32       25   1.516393     5276    0.998806   Yes

                   notOpenMP MFLOPS Pi 400 1800 MHz 32 bits Max 6.6 GFLOPS

 Data in & out     100000     2     2500   0.127996     3906    0.929538   Yes  1.72
 Data in & out    1000000     2      250   0.802889      623    0.992550   Yes  0.99
 Data in & out   10000000     2       25   0.774740      645    0.999250   Yes  1.02

 Data in & out     100000     8     2500   0.302848     6604    0.957126   Yes  1.20
 Data in & out    1000000     8      250   0.897527     2228    0.995524   Yes  1.00
 Data in & out   10000000     8       25   0.858763     2329    0.999550   Yes  0.82

 Data in & out     100000    32     2500   1.247949     6411    0.890282   Yes  1.20
 Data in & out    1000000    32      250   1.303086     6139    0.988096   Yes  1.17
 Data in & out   10000000    32       25   1.293210     6186    0.998806   Yes  1.17

                           64 Bit OpenMP-MFLOPS Results below


Results OpenMP-MFLOPS64g8, notOpenMP-MFLOPS64g8


  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All 64 bits/
                    Words  Word   Passes                         Results  Same 32 bits

                   OpenMP MFLOPS Pi 400 1800 MHz 64 bits Max 30.2 GFLOPS

 Data in & out     100000     2     2500   0.085683     5835    0.929538   Yes  1.84
 Data in & out    1000000     2      250   0.781184      640    0.992550   Yes  1.34
 Data in & out   10000000     2       25   0.781273      640    0.999250   Yes  1.16

 Data in & out     100000     8     2500   0.097116    20594    0.957117   Yes  1.37
 Data in & out    1000000     8      250   0.802633     2492    0.995518   Yes  0.99
 Data in & out   10000000     8       25   0.817137     2448    0.999549   Yes  1.16

 Data in & out     100000    32     2500   0.288180    27760    0.890215   Yes  1.15
 Data in & out    1000000    32      250   0.832001     9615    0.988088   Yes  1.05
 Data in & out   10000000    32       25   0.850003     9412    0.998796   Yes  1.12

                                     Other Run

 Data in & out     100000    32     2500   0.265007    30188    0.890215   Yes
 Data in & out    1000000    32      250   0.836860     9560    0.988088   Yes
 Data in & out   10000000    32       25   0.850294     9409    0.998796   Yes


                   notOpenMP MFLOPS Pi 400 1800 MHz 64 bits Max 8.2 GFLOPS

 Data in & out     100000     2     2500   0.128715     3885    0.929538   Yes  0.99
 Data in & out    1000000     2      250   1.012816      494    0.992550   Yes  0.79
 Data in & out   10000000     2       25   1.232502      406    0.999250   Yes  0.63

 Data in & out     100000     8     2500   0.301728     6628    0.957117   Yes  1.00
 Data in & out    1000000     8      250   1.097623     1822    0.995518   Yes  0.82
 Data in & out   10000000     8       25   1.021233     1958    0.999549   Yes  0.84

 Data in & out     100000    32     2500   0.981493     8151    0.890215   Yes  1.27
 Data in & out    1000000    32      250   1.243212     6435    0.988088   Yes  1.05
 Data in & out   10000000    32       25   1.131976     7067    0.998796   Yes  1.14






  
OpenMP-MemSpeed Benchmarks below or Go To Start


OpenMP-MemSpeed - OpenMP-MemSpeed2C8, NotOpenMP-MemSpeed2C8
OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP. The bencmsrk demonstrates that OpenMP might be unsuitable to produce performance gains on what appears to be suitable code. There might also be compile options that overcome this problem.

Performance comparisons are provided for samples of the OpenMP results and some for single core operation. These can be interpreted as demonstrating the usual 1800 MHz gains of 20% for CPU speed limited tests, at 32 bits, and no gain using RAM based data, but subject to wide variations. MP speed at 64 bits was little different to that at 32 bits, but the single core version appeared to be somewhat faster executing double precision calculations.

                  Memory Reading Speed Test OpenMP                      

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]     
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                         Pi 4B 1500 MHz 32 bits                         

       4    8097   8322   8641   8020   8436   8384  39701  19701  19712
       8    7814   8555   8756   8321   8548   8526  39042  19984  19996
      16    8149   7738   7742   8303   7779   8192  37995  19883  19984
      32    8969   8769   8799   9040   8759   8743  37737  20133  20130
      64    7617   7457   7437   7575   7380   7422  17770  15332  14248
     128   11221  10936  11003  11105  11011  10986  13650  13910  13881
     256   17883  18144  18036  17691  18094  17844  13073  12465  12535
     512   18001  18468  19675  17075  18221  19264  13511  13895  12008
    1024    9532  10590   9772  11842  11282  11277   7173   9473   9496
    2048    7095   7025   6866   7117   7043   6946   2914   3475   3468
    4096    7244   6927   7036   5951   7054   6531   2582   3130   3122
    8192    4578   7173   7025   6322   7078   7182   2504   3127   3115
   16384    5470   7043   7067   7103   7052   7020   2557   3093   3088
   32768    7359   7817   7766   7158   7078   7757   2618   3066   3094
   65536    7810   7268   7266   3824   7478   5164   2486   3016   2931
  131072    2460   2655   7224   7513   7308   7339   2540   2944   2940

 Not OMP                                                                
       8   11775   3895   4342  11787   4325   4354  10334   7806   7816
     256   10032   3699   4223   9978   4289   4185   7105   7612   7621
   65536    2099   2587   3033   2103   3021   3001   2585   1105   1101

                            Pi 400 1800 MHz 32 bits                     

       4    9870  10099  10417   9594  10121  10071  47620  23674  23662
       8    9413  10284  10511   9978  10239  10233  46992  24003  24005
      16    9462  10322  10557   9446  10264  10234  45814  24091  23992
      32   10985  10180  10356  10898  10204  10258  45479  24148  24160
      64   11212  10952  10978  11184  10992  10938  30749  22302  21768
     128   14481  14069  14237  14437  14228  14265  16353  17408  17293
     256   20737  21740  21905  20853  21742  21892  14898  15113  15109
     512   20922  22457  23702  20469  21381  22893  14975  16248  16272
    1024   14626  12757  12104  13595  13508  12422  10711  12240  10897
    2048    5193   7184   7224   7309   7238   7227   2990   3347   3355
    4096    7839   6201   7620   7822   7646   7561   2650   2997   3016
    8192    7867   7820   7844   7778   7736   7736   2494   2961   2877
   16384    8089   7768   7800   5995   7829   7996   2508   2858   2840
   32768    1921   7278   7313   7756   7308   7552   2659   2895   2869
   65536    2302   7267   2708   5814   6992   7310   2597   2769   2801
  131072    3730   2546   2841   7442   2617   5254   2611   2804   2764

 Not OMP                                                                
       4   13781   4653   5235  13804   5216   5215  11951   9142   9130
     256   12185   4441   5064  11889   5009   5001   9702   9260   9256
   65536    1108   1418   3026   2016   2264   3038   2551    945    901

400/4B 32 bits Samples

       4    1.22   1.21   1.21   1.20   1.20   1.20   1.20   1.20   1.20
       8    1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20   1.20
      16    1.16   1.33   1.36   1.14   1.32   1.25   1.21   1.21   1.20
     128    1.29   1.29   1.29   1.30   1.29   1.30   1.20   1.25   1.25
     256    1.16   1.20   1.21   1.18   1.20   1.23   1.14   1.21   1.21
     512    1.16   1.22   1.20   1.20   1.17   1.19   1.11   1.17   1.36
   32768    0.26   0.93   0.94   1.08   1.03   0.97   1.02   0.94   0.93
   65536    0.29   1.00   0.37   1.52   0.94   1.42   1.04   0.92   0.96
  131072    1.52   0.96   0.39   0.99   0.36   0.72   1.03   0.95   0.94

 Not OMP                                                                
       8    1.17   1.19   1.21   1.17   1.21   1.20   1.16   1.17   1.17
     256    1.21   1.20   1.20   1.19   1.17   1.19   1.37   1.22   1.21
   65536    0.53   0.55   1.00   0.96   0.75   1.01   0.99   0.86   0.82

  64 Bit OpenMP-MemSpeed Resultss below


Results OpenMP-MemSpeed264g8, NotOpenMP-MemSpeed64g8

  
                            Pi 400 1800 MHz 64 bits                     

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
  
       4    9304  10175  10500   8950  10229  10240  47444  22323  22326
       8    9867  10400  10669   9767  10411  10432  46796  22663  22659
      16    9624  10210  10012   9400  10097  10223  45939  22831  22830
      32   11543  10488  10590  11533  10484  10479  45358  22849  22850
      64    9976   9351   9360   9833   9300   9287  23856  19396  19353
     128   12504  12426  12326  12438  12399  12252  16646  16057  16029
     256   20937  22629  22372  21707  22453  22434  14964  15050  15079
     512   21415  22091  21606  20618  22166  21289  15254  16012  15761
    1024    9413  11347   9278  12991  12294  12168   6405   9009   9213
    2048    7451   6795   7128   7292   7099   7047   2945   3091   3179
    4096    6679   7016   7220   7210   7028   7107   2784   2897   2919
    8192    7741   6345   7626   7625   7532   7374   2461   2837   2842
   16384    7557   7331   7650   7556   7345   7729   2559   2924   2927
   32768    2214   7172   7066   7283   7125   7105   2489   2836   2829
   65536    2309   6584   7193   7265   6608   6570   2662   2748   2755
  131072    5200   6211   7275   7400   7287   7248   2603   2805   2764

 Not OMP                                                                
       8   18146   4545   5046  17725   5043   5051  13599  10933  10999
     256   14712   4648   5064  14594   5040   5064   9724   9855   9853
   65536    2046   3002   3077   2029   3067   3070   2550   2569   2568

Pi 400 64 bits/32 bits

       4    0.94   1.01   1.01   0.93   1.01   1.02   1.00   0.94   0.94
       8    1.05   1.01   1.02   0.98   1.02   1.02   1.00   0.94   0.94
      16    1.02   0.99   0.95   1.00   0.98   1.00   1.00   0.95   0.95
     128    0.86   0.88   0.87   0.86   0.87   0.86   1.02   0.92   0.93
     256    1.01   1.04   1.02   1.04   1.03   1.02   1.00   1.00   1.00
     512    1.02   0.98   0.91   1.01   1.04   0.93   1.02   0.99   0.97
   32768    1.15   0.99   0.97   0.94   0.97   0.94   0.94   0.98   0.99
   65536    1.00   0.91   2.66   1.25   0.95   0.90   1.03   0.99   0.98
  131072    1.39   2.44   2.56   0.99   2.78   1.38   1.00   1.00   1.00

 Not OMP                                                                
       8    1.32   0.98   0.96   1.28   0.97   0.97   1.14   1.20   1.20
     256    1.21   1.05   1.00   1.23   1.01   1.01   1.00   1.06   1.06
   65536    1.85   2.12   1.02   1.01   1.35   1.01   1.00   2.72   2.85
 
JavWhetstone Benchmark below or Go To Start


Java Whetstone Benchmark - whetstc.class

The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files. Performance can vary significantly using different Java Virtual Machines. So, comparisons might not be appropriate. Note that, here some speeds are effectively the same as found running the C compiled version above with Pi 400 speed gains, at 32 bits, around 20%. Using these particular versions of Java, some floating point functions were slower at 64 bits.

 ********************* Pi 4B 1500 MHz 32 bits ********************

  Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900
  N6 floating point   0.999999821    345.95             1.5592
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft

 ******************** Pi 400 1800 MHz 32 bits ********************
 
    Whetstone Benchmark Java Version, Jul 30 2020, 11:49:33

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    629.92             0.0305
  N2 floating point  -1.131330490    584.35             0.2300
  N3 if then else     1.000000000             415.00    0.2494
  N4 fixed point     12.000000000            1315.79    0.2394
  N5 sin,cos etc.     0.499110132              71.72    1.1600
  N6 floating point   0.999999821    415.05             1.2996
  N7 assignments      3.000000000             399.48    0.4626
  N8 exp,sqrt etc.    0.825148463              32.95    1.1290

  MWIPS                             2083.13             4.8005

  Operating System    Linux, Arch. arm, Version 5.4.51-v7l+
  Java Vendor         Raspbian, Version  11.0.8

 ******************** Pi 400 1800 MHz 64 bits ********************
 
     Whetstone Benchmark Java Version, Aug 26 2020, 19:37:28

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    624.19             0.0308
  N2 floating point  -1.131330490    577.32             0.2328
  N3 if then else     1.000000000             405.88    0.2550
  N4 fixed point     12.000000000            1582.12    0.1991
  N5 sin,cos etc.     0.499110132              57.54    1.4460
  N6 floating point   0.999999821    331.53             1.6270
  N7 assignments      3.000000000             359.25    0.5144
  N8 exp,sqrt etc.    0.825148463              30.47    1.2210

  MWIPS                             1809.61             5.5261

  Operating System    Linux, Arch. aarch64, Version 5.4.51-v8+
  Java Vendor         Debian, Version  11.0.8
  
JavaDraw Benchmark below or Go To Start


JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

At 32 bits, in order for this to run at maximum speed on the original Pi 4B, it was necessary to disable the experimental GL driver. The initial Pi 400 run was without that driver being enabled. Then, the initial graphics speed dependent tests were the same as those for the Pi 4B, with latest CPU limited ones some 20% faster. This Pi 400 test was rerun with part of the display window on the monitor and part on a TV, via dual monitor operation, then completely on the TV. Affected, somewhat slower, FPS speeds, are shown below, as seen in both cases.

Using the 64 bit Pi 400 configuration, performance was the same using the original and experimental GL driver. It was also similar to the first Pi 400 results at 32 bits.

 ********************* Pi 4B 1500 MHz 32 bits ********************

   Java Drawing Benchmark, May 15 2019, 18:55:41
            Produced by OpenJDK 11 javac

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      877    87.65
  Display PNG Bitmap Twice Pass 2     1042   104.18
  Plus 2 SweepGradient Circles        1015   101.47
  Plus 200 Random Small Circles        779    77.85
  Plus 320 Long Lines                  336    33.52
  Plus 4000 Random Small Circles        83     8.25

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft


 ******************** Pi 400 1800 MHz 32 bits ********************

   Java Drawing Benchmark, Jul 30 2020, 12:01:08
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      904    90.36
  Display PNG Bitmap Twice Pass 2     1038   103.79
  Plus 2 SweepGradient Circles        1019   101.84
  Plus 200 Random Small Circles        855    85.41
  Plus 320 Long Lines                  391    39.08
  Plus 4000 Random Small Circles       102    10.11

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 5.4.51-v7l+
  Java Vendor         Raspbian, Version  11.0.8

 ******************* Dual Monitor + TV  Involved *****************

  Display PNG Bitmap Twice Pass 1      698    69.75
  Display PNG Bitmap Twice Pass 2      909    90.84
  Plus 2 SweepGradient Circles         918    91.78


 ** 32 bit Pi 400 after enabling experimental desktop GL driver **

   Java Drawing Benchmark, Jul 31 2020, 10:08:07
            Produced by javac 1.6.0_27

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1     1164   116.33
  Display PNG Bitmap Twice Pass 2     1346   134.49
  Plus 2 SweepGradient Circles        1317   131.62
  Plus 200 Random Small Circles        976    97.53
  Plus 320 Long Lines                  402    40.12
  Plus 4000 Random Small Circles       103    10.27

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. arm, Version 5.4.51-v7l+
  Java Vendor         Raspbian, Version  11.0.8
  

64 Bit Java Draw Benchmark below


64 Bit JavaDraw Benchmark - JavaDrawPi.class


 ******************** Pi 400 1800 MHz 64 bits ********************

   Java Drawing Benchmark, Aug 26 2020, 19:38:46
            Produced by javac 1.8.0_222

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      860    85.92
  Display PNG Bitmap Twice Pass 2      957    95.68
  Plus 2 SweepGradient Circles        1002   100.18
  Plus 200 Random Small Circles        843    84.24
  Plus 320 Long Lines                  402    40.12
  Plus 4000 Random Small Circles        99     9.86

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 5.4.51-v8+
  Java Vendor         Debian, Version  11.0.8


 ** 64 bit Pi 400 after enabling experimental desktop GL driver **

   Java Drawing Benchmark, Aug 26 2020, 20:09:05
            Produced by javac 1.8.0_222

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      800    79.94
  Display PNG Bitmap Twice Pass 2      966    96.51
  Plus 2 SweepGradient Circles         999    99.81
  Plus 200 Random Small Circles        864    86.30
  Plus 320 Long Lines                  409    40.83
  Plus 4000 Random Small Circles       109    10.86

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 5.4.51-v8+
  Java Vendor         Debian, Version  11.0.8


 ******************* Dual Monitor + TV  Involved *****************

  Dual Monitor Part on monitor and part on TV

  Display PNG Bitmap Twice Pass 1      748    74.72
  Display PNG Bitmap Twice Pass 2      872    87.15
  Plus 2 SweepGradient Circles         914    91.37
  
OpenGL GLUT Benchmark below or Go To Start


32 Bit OpenGL GLUT Benchmark - videogl32

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file, the first command needed to avoid VSYNC, allowing FPS to be greater than 60.

  export vblank_mode=0                                     
  ./videogl32 Width 320, Height 240, NoEnd                 
  ./videogl32 Width 640, Height 480, NoHeading, NoEnd      
  ./videogl32 Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl32 Width 1920, Height 1080, NoHeading           
  

The first Pi 400 results indicated that performance was slower on all tests, excluding those for the kitchen displays, the latter being more CPU speed limited, providing hoped for 20% performance improvement. Then, I remembered using an experimental desktop GL driver, enabled via sudo raspi-config. This was used on the Pi 400, where G3 GL OpenGL desktop driver with full KMS was selected. This produced the same or better Pi 400 performance than the Pi 4B.

As indicated below, the dual monitor connections enabled this option to be tested, the default for monitor full screen pixel settings across both monitors being applied, 2 x 1920 wide in this case.

 ********************* Pi 4B 1500 MHz 32 bits ********************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu May  2 19:01:05 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    766.7    371.4    230.6    130.2     32.5     22.7
   640   480    427.3    276.5    206.0    121.8     31.7     22.2
  1024   768    193.1    178.8    150.5    110.4     31.9     21.5
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0

 ******************** Pi 400 1800 MHz 32 bits ********************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu Jul 30 12:31:31 2020

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    688.1    405.2    223.1    138.2     42.8     29.0
   640   480    319.4    281.4    200.1    126.8     41.4     27.8
  1024   768    140.3    134.5    113.9    103.0     40.2     27.1
  1920  1080     57.7     56.3     53.5     49.6     37.4     24.0

 ******************* Pi 400 New Driver 32 bits ******************

 GLUT OpenGL Benchmark 32 Bit Version 1, Thu Jul 30 13:59:55 2020

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    823.6    435.1    244.5    140.7     42.5     28.7
   640   480    427.8    310.0    219.6    134.3     42.1     28.3
  1024   768    192.3    181.9    149.9    116.3     40.9     27.0
  1920  1080     81.7     79.0     73.7     67.4     38.1     24.5


 ****************** Pi 400 Dual Monitor 32 bits ******************

  3840  1080     27.0     26.6     26.3     25.1     27.3     19.3
 

64 Bit OpenGL Benchmark below


64 Bit OpenGL GLUT Benchmark - videogl64

In this case (at this early stage?), the 64 bit default driver appeared to produce much slower performance than the 32 bit Pi 400 system. The later driver also produced slower performance on the early, graphics speed dependent, tests, but 20% faster on the last tests that depend on processor power.

The dual monitor test results were similar to those at 32 bits.

 ******************** Pi 400 1800 MHz 64 bits ********************

 GLUT OpenGL Benchmark 64 Bit gcc 9, Wed Aug 26 19:53:43 2020

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    334.3    162.1    173.9     90.0     27.1     23.7
   320   240    220.5    131.9    128.7     74.1     25.0     21.5
   640   480    109.4     81.0     80.6     55.7     22.2     17.9
  1024   768     57.5     47.5     45.4     34.2     18.2     13.4
  1920  1080     27.0     24.3     22.0     18.9     14.3      8.4


******************* Pi 400 New Driver 64 bits ******************

 GLUT OpenGL Benchmark 64 Bit gcc 9, Wed Aug 26 20:03:54 2020

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    783.4    446.7    286.4    170.7     50.9     35.5
   320   240    659.3    406.0    265.5    160.8     51.9     35.1
   640   480    319.2    276.9    229.0    144.2     47.5     32.7
  1024   768    140.2    134.4    122.4    113.2     48.1     32.5
  1920  1080     57.8     56.5     55.6     52.4     46.7     29.8


 ****************** Pi 400 Dual Monitor 64 bits ******************

  3840  1080     27.2     26.6     27.0     26.0     27.5     21.4
  

I/O Benchmarks below or Go To Start


I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The programs write and reads three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. For further details and downloads see the PDF file.


LanSpeed Benchmark - (1G bits per second Ethernet) - LanSpeed, LanSpeed64g8

Measured performance can vary significantly, but both Pi 4B and Pi 400 tests demonstrated Gigabit performance on the large files. Of particular note (with my program), these 32 bit systems indicated that the 2 GB file could not be written, the actual file size ended at 2,147,483,647 Bytes (or 2^31 - 1). Also note the more consistent speeds handling 1 GB files.

The default 64 bit benchmark produced similar performance as the 32 bit version. However, a major advantage of the former, is its ability to handle much larger files, as illustrated below at 3 and 6 GB.

 ******************** Pi 4B 1500 MHz 32 bits ******************

                      MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    67.82    12.97    90.19    99.84    93.49    96.83
  16    92.25    92.66    92.96    103.9   105.28    91.17

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.007     0.01     0.04     1.01     0.85     0.91

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16    secs
 MB/sec      1.47     2.80     5.14     2.47     4.71     8.61
 ms/file     2.78     2.92     3.19     1.66     1.74     1.90    0.256
 
Large File  Write MBytes/Second   Read MBytes/Second

1 GB    96.13    93.34    94.98   114.51   112.16   114.91
2 GB   Error writing file Segmentation fault

  ******************* Pi 400 1800 MHz 32 bits ******************

                      MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    47.07    87.12    90.94   102.11   100.03   100.24
  16    82.75    90.84    91.03   106.19   106.39   105.10

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.007     0.02     0.43     0.98     0.90     0.89

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.35     2.62     5.04     2.21     4.10     6.88
 ms/file     3.03     3.12     3.25     1.85     2.00     2.38    0.184
 
 Large File Write MBytes/Second   Read MBytes/Second

 1 GB  109.69   111.03   107.39   112.28   112.72   112.02
 2 GB  Error writing file Segmentation fault

  ******************* Pi 400 1800 MHz 64 bits ******************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    46.59    89.13    93.19   103.35    73.78    65.73
  16    65.89    96.57    67.83    90.43   105.20   105.43

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.017    0.397     1.09     1.02     1.05

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.36     2.64     5.11     1.95     3.33     8.55
 ms/file     3.01     3.11     3.21     2.10     2.46     1.92    0.194
 
 Large File Write MBytes/Second    Read MBytes/Second

 3 GB   114.00   114.11   114.93   112.31   114.79   116.96
 6 GB    92.46    92.06   114.06   115.22   115.57   113.66

WiFi Benchmarks below or Go To Start


LanSpeed Benchmarks - WiFi - LanSpeed, LanSpeed64g8

Following are Old Pi 4B results and those for the Pi 400, using 2.4 GHz and 5 GHz WiFi frequencies, communicating with a Windows 7 based PC. Details on setting up the links can be found in This PDF file, LAN/WiFi section. Performance of the two systems was reasonably similar at 2.4 GHz, exhibiting normal variations at these file sizes. With my setup, obtaining consistent 5 GHz operation was extremely difficult to achieve, in both cases, but those shown indicate the most frequent performance patterns. The main difference was the particularly slow Pi 400 reading speeds, apparently with 5 GHz being lower than at 2.4 GHz.

Results at 64 bits were similar to those at 32 bits and it took many more attempts to run at 5 GHz.

 **************** Pi 4B 1500 MHz 2.4 GHz 32 bit **************

                        MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8    6.35    6.33    6.38    7.05    6.98    7.10
      16    6.70    6.82    6.76    7.19    6.53    7.22

 Random     Read                    Write
 From MB       4       8      16       4       8      16
 msecs     2.691   2.875   3.048    3.13    2.93    2.84

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.34    0.44    1.04    0.37    0.37    1.26
 ms/file   12.14   18.59    15.7    11.1    22.2   12.99   2.153


***************** Pi 4B 1500 MHz 5 GHz 32 bit ***************

                         MBytes/Second
      MB  Write1  Write2  Write3   Read1   Read2   Read3

       8   11.90   12.96   13.16   10.11    9.55    9.66
      16   11.50   13.93   14.13    9.91    8.88    9.92

 200 Files  Write                   Read                  Delete
 File KB       4       8      16       4       8      16   secs
 MB/sec     0.13    0.46    0.91    0.25    0.55    1.02
 ms/file   30.85   17.83   18.10   16.62   14.93   16.01   3.361

 Random similar to 2.4 GHz


 *************** Pi 400 1800 MHz 2.4 GHz 32 bit **************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8     2.02     6.08     6.59     6.91     5.82     7.01
  16     6.78     6.64     6.70     7.04     6.05     6.36

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      3.234    3.354    3.637     4.12     3.72     3.72

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.36     0.61     1.07     0.46     0.85     1.55
 ms/file    11.50    13.37    15.34     8.88     9.59    10.55    2.924



 **************** Pi 400 1800 MHz 5 GHz 32 bit ***************

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8     2.85     9.75     9.82     4.03     4.20     4.14
  16    11.42    10.20    10.14     4.18     4.17     4.16

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      3.006    3.206    3.276     3.55     3.29     3.28

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.42     0.50     0.34     0.48     0.88     1.44
 ms/file     9.72    16.44    48.26     8.61     9.30    11.39    2.812
 

64 Bit WiFi Results and USB Booting below Go To Start


LanSpeed Benchmarks - WiFi - LanSpeed64g8

Windows Perfmon was executed, to indicate volume and reliability of network traffic, at the same time as a run of the 5 GHz benchmark. This confirmed the measured data transfer speeds of large files and indicated no errors or discards.

 *************** Pi 400 1800 MHz 2.4 GHz 64 bit **************
 
                       MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8     5.93     5.91     5.98     6.79     5.75     6.62
  16     6.51     3.23     6.61     6.08     5.72     6.19

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      3.240    3.720    3.651     4.14     3.92     4.16

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.32     0.58     1.00     0.42     0.79     1.44
 ms/file    12.92    14.14    16.39     9.80    10.42    11.36    1.335


 *************** Pi 400 1800 MHz 5 GHz 64 bit ****************

                         MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    11.55    12.00    12.24     4.16     4.32     4.28
  16    12.21    12.41    12.34     4.13     4.28     4.24

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      2.738    2.882    2.967     3.10     2.87     2.89

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.49     0.43     0.64     0.54     0.92     1.54
 ms/file     8.42    19.06    25.54     7.65     8.87    10.66    1.009
 


USB Booting

The initial 8 GB SD card was particularly slow on booting and on running initial benchmarks. I cloned it to a 32 GB SanDisk Ultra card and that was fine. In turn, to test USB booting, I then copied that to a faster 64 GB USB stick and a 500 GB partition of a 1 TB hard drive for the highest data transfer rate. Without investigating, I expanded the Filesystem of the HD and that made it unusable (Windows indicated total size of 256 GB, unreadable). That did not matter as there was no useful information on the drive. I repartitioned the drive, via Ubuntu, with >64 GB for Raspbian, where cloning to that from the 64 GB image was satisfactory, without expansion. Two further partitions were created, formatted as FAT32 and EXT4, each occupying half of the remaining space.

As indicated later, the USB 3 drives produced higher data transfer speeds than the 32 GB SD card, but were slower on booting, as shown in the following early life measurements, that could change. Part of the reason for slow booting is explained in an initial display, indicating that an SD card cannot be found and later, apparently searching for a bootable device.

Seconds Initial Total To Reboot To Drive Display Desktop Desktop 32 bit 8 GB SD card N/A 46 68 32 bit 32 GB SD card N/A 22 26 32 bit 32 GB SD USB Reader 7 31 30 32 bit 64 GB USB Stick 25 46 29 32 bit 64 GB HD Partition 29 63 64 64 bit 32 GB SD card N/A 22 26 64 bit 32 GB SD USB Reader 7 25 28
USB 3 and Main Drive Benchmarks or Go To Start


32 Bit USB 3 and Main Drive Benchmarks - DriveSpeed

Large Files - Below are DriveSpeed benchmark results, initially from the four main drives. The SD cards obtained the same level of performance when booted using a USB 3 card reader. The main drive data area is formatted as Ext4. Performance of large files identifies possible benefits of using alternatives drives to SD cards. The 64 GB USB stick appeared to write at a much slower rates, starting at less somewhat less than 500 MB. Using the secondary disk partitions, writing to EXT4 format is shown to be faster than to FAT32. It seems that, compared with the Pi 4B, the Pi 400 can provide superior writing performance with FAT32 but not so at EXT4.

Random Access - The measured access times can vary widely, with the reasons for differences difficult to identify. Traditionally, hard disk drive times would normally be greater than half the revolution time, 5.5 ms, in this case. Then this Toshiba Canvio is said to have a 8 MB buffer, indicating that most accesses could be to the buffer, at bus speeds.

200 Small Files - Under EXT4 format, hard drive performance was indicated as being far superior, with the 8 GB SD card worst. Hard drive performance at FAT32 was exceptionally bad, with sector size of 32 KB, when each of the 200 files were that size. Pi 400 and Pi 4B performance was essentially the same running all these tests (those identical results were double checked).

Large Files                             MBytes/Second
                      MB   Write1   Write2   Write3    Read1    Read2    Read3

 8 GB SD Card         16     7.44     5.23     6.13    22.88    22.71    22.12
32 GB SD Card         16    19.02    17.56    17.39    44.71    43.50    44.84
64 GB USB Drive       16    74.42    77.55    76.80   129.86   130.65   129.75
64 GB USB Drive      500    30.92    23.74    29.67   132.11   131.10   132.16
64 GB USB Drive     2000    28.78    28.77    29.45   131.87   132.27   132.33

64 GB HD Partition    16    55.80    81.05    52.98   134.06   142.09   143.91
64 GB HD Partition  2000   149.83   148.52   146.76   151.64   151.99   150.15
64 GB HD Pi 4B USB  2000   147.17   146.79   146.45   148.38   151.14    97.80

Same HD Pi400 FAT   2000    83.27    82.66    83.22   143.79   144.02   144.06
Same HD Pi400 EXT4  2000   125.74   123.60   120.72   130.20   128.47   124.88
Same HD Pi4B  FAT   2000    68.10    66.83    67.67   148.63   148.69   149.25
Same HD Pi4B  EXT4  2000   125.36   118.11   122.68   130.29   127.10   128.45

Random                          Read                       Write
                     From MB        4        8       16        4        8       16

 8 GB SD Card        msecs      0.436    0.417    0.406     2.86     2.87    79.52
32 GB SD Card        msecs      0.250    0.249    0.279     1.61     1.50     1.55
64 GB USB Drive      msecs      0.671    0.675    0.671     2.14     2.20     2.18
64 GB HD Partition   msecs      0.170    0.647    0.426     5.18    11.79    11.13
64 GB HD Pi 4B USB   msecs      0.976    0.356    0.367     0.68     0.64     0.68
Same HD Pi400 FAT    msecs      0.169    0.170    0.170     0.66     0.63     0.70
Same HD Pi400 EXT4   msecs      0.436    0.486    0.314     0.71     0.64     0.70
Same HD Pi4B  FAT    msecs      0.573    0.515    0.368     0.63     0.58     0.65
Same HD Pi4B  EXT4   msecs      1.087    0.391    0.286     0.68     0.63     0.68

200 Small Files                 Write                      Read
                     File KB        4        8       16        4        8       16

 8 GB SD Card        MB/sec      0.42     2.59     2.61     5.63     8.95    12.15
32 GB SD Card        MB/sec      2.57     5.10     5.59     9.08    12.42    20.69
64 GB USB Drive      MB/sec      1.95     2.55     4.58     7.33    11.85    21.22
64 GB HD Partition   MB/sec      4.20    16.53    13.64    13.32    20.21    50.28
64 GB HD Pi 4B USB   MB/sec      8.58    20.83    35.28    20.83    36.94    61.32
Same HD Pi400 FAT    MB/sec      0.04     0.07     0.15     0.37     0.73     1.46
Same HD Pi400 EXT4   MB/sec      8.15    15.02    20.04     8.86    12.86    34.40
Same HD Pi4B  FAT    MB/sec      0.04     0.07     0.15     0.37     0.73     1.46
Same HD Pi4B  EXT4   MB/sec      9.90    15.22    14.05    13.42     7.95    19.51
  

64 Bit USB 3 and Main Drive Benchmarks Pi 400 below


64 Bit USB 3 and Main Drive Benchmarks Pi 400 - DriveSpeed64v2g8, LanSpeed64g8, DriveSpeed264WRg8, DriveSpeed264Rd2g8

A major advantage of 64 bit working is that much larger files can be handled, but there is a disadvantage, in running my benchmarks, where Direct I/O does not appear to be available. Attempting to run DriveSpeed, leads to an error report, when accessing an Ext4 formatted partition (see below). Alternatives are LanSpeed, using data larger than RAM size, to minimise caching, or separate programs to write and read, requiring a reboot before the latter (DriveSpeed264WRg8 and DriveSpeed264Rd2g). The latter are variations of LanSpeed just dealing with large file tests, writing a 1 MB at a time, with read only declaring an array to contain all data being read. The example below indicated the limitation with 4 GB RAM, where only reading of the first 1024 MB of each file was successful.

Large Files - Compared with 32 bit operation and using the appropriate formatting, performance was similar using Ext4 partitions, but much larger files could be handled at 64 bits. At FAT32, files of twice the size could be dealt with, but performance on writing was much worse.

Random Access - All Pi 400 reading times do not represent drive hardware performance, accelerated by caching or HD buffering, but 32 bit reading was also faster than expectations. Writing times produced inexplicable variations.

Small Files - FAT32 performance was again particularly slow. Then, Ext4 reported speed via LanSpeed was accelerated by buffering.

Large Files                             MBytes/Second
                      MB   Write1   Write2   Write3    Read1    Read2    Read3

HD Ext4  LanSpeed   4096   130.60   112.66   110.96    85.23   118.60   119.20
HD Ext4  LanSpeed   8192*  122.62   111.55   103.17   101.44   124.52   119.92
HD FAT32 LanSpeed   4096   Error writing file Segmentation fault
HD FAT32 LanSpeed   4000=  125.45   137.00   137.94   147.63   146.50   146.16
HD Ext4  DriveSpeed        Error writing file Segmentation fault
HD FAT32 DriveSpeed 4096   Error writing file Segmentation fault
HD FAT32 DriveSpeed 4000=   20.50    20.56    12.53   143.32   146.59   146.32
SD Main  LanSpeed   4096    21.34    18.22    17.40    34.78    45.86    45.33
SD Main  Write/Read 4096#   18.73    18.87    18.83    45.96    46.01    46.04
SD Main  Read only  4096    Memory allocation failed asked for 3 x 4096 MB
SD Main  Read only  1333    Memory allocation failed asked for 3 x 1333 MB
SD Main  Read only  1024      N/A      N/A      N/A    46.26    46.23    45.87
SD FAT32 LanSpeed   4096   Error writing file Segmentation fault
SD FAT32 LanSpeed   4000    20.14    20.12    20.09    95.33    95.21    95.32
SD FAT32 DriveSpeed 4096   Error writing file Segmentation fault
SD FAT32 DriveSpeed 4000    17.13    17.20    17.25    95.79    95.71    95.54

32 Bit From Above For Comparison
HD EXT4  DriveSpeed 2000*  125.74   123.60   120.72   130.20   128.47   124.88
HD FAT32 DriveSpeed 2000=   83.27    82.66    83.22   143.79   144.02   144.06
SD Main  DriveSpeed   16#   19.02    17.56    17.39    44.71    43.50    44.84

Random                          Read                       Write
                     From MB        4        8       16        4        8       16

HD Ext4  LanSpeed    msecs*     0.002    0.002    0.002    43.48    45.76    41.66
HD FAT32 LanSpeed    msecs=     0.003    0.003    0.003    12.22    12.24    16.22
HD FAT32 DriveSpeed  msecs=     0.003    0.003    0.004    12.68    12.37    12.26
SD Main  LanSpeed    msecs#     0.002    0.002    0.002     4.46     4.17     4.63
SD FAT32 LanSpeed    msecs      0.003    0.003    0.003     6.05     5.87     6.05
SD FAT32 DriveSpeed  msecs      0.004    0.004    0.010     2.97     2.55     2.42

32 Bit From Above For Comparison
HD EXT4  DriveSpeed  msecs*     0.436    0.486    0.314     0.71     0.64     0.70
HD FAT32 DriveSpeed  msecs=     0.169    0.170    0.170     0.66     0.63     0.70
SD Main  DriveSpeed  msecs#     0.250    0.249    0.279     1.61     1.50     1.55

200 Small Files                 Write                      Read
                     File KB        4        8       16        4        8       16

HD Ext4  LanSpeed     MB/sec*    69.10   115.19   175.42   232.95   395.30   624.64
HD FAT32 LanSpeed     MB/sec=     0.04     0.08     0.16   296.33   485.65   736.98
HD FAT32 DriveSpeed   MB/sec=     0.04     0.07     0.15   292.47    40.44   391.07
SD Main  LanSpeed     MB/sec#    83.78    36.88   148.72   335.38   216.77   786.13
SD FAT32 LanSpeed     MB/sec      0.04     0.08     0.15   306.56   493.11   730.44
SD FAT32 DriveSpeed   MB/sec      0.04     0.08     0.15   299.67   130.72    34.53

32 Bit From Above For Comparison
HD EXT4  DriveSpeed   MB/sec*     8.15    15.02    20.04     8.86    12.86    34.40
HD FAT32 DriveSpeed   MB/sec=     0.04     0.07     0.15     0.37     0.73     1.46
SD Main  DriveSpeed   MB/sec#     2.57     5.10     5.59     9.08    12.42    20.69

High Performance Linpack Benchmark or Go To Start


High Performance Linpack Benchmark - xhpl

My ATLAS version of HPL has been ported onto the Pi 400. For more detail and results see my ResearchGate report Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. Besides being the gold standard benchmark for massively parallel supercomputers, it makes an excellent stress test, a disadvantage being that there is no ongoing progress report to indicate deteriorating performance. It has an N input parameter that determines how much memory is required (N x N x 8 bytes), where the maximum for 4 GB RAM is not much higher than 20000 for 3.2 GB (base 10).

Below is an example of the main output from the first set of tests on the Pi 400, followed by a summary of later results, comprising five runs over around 50 minutes at N = 20000. There are start and end overheads not reported in benchmark execution time. Performance is shown to be constant over this period. Then are details of VMSTAT system monitor results, showing use of 3.2 GB RAM and 100% CPU utilisation of four cores.

Then are details of VMSTAT system monitor results, showing use of 3.2 GB RAM and 100% CPU utilisation of four cores. This is followed by CPU and Power Management IC temperatures, during the five runs, nowhere near where CPU MHz throttling might be expected. Room temperature was 27C and hot spot readings on the keyboard up to 36C.

 ================================================================================
 T/V                N    NB     P     Q               Time                 Gflops
 --------------------------------------------------------------------------------
 WR11C2R4       20000   128     2     2             451.75              1.181e+01
 HPL_pdgesv() start time Thu Jul 23 21:28:59 2020
 HPL_pdgesv() end time   Thu Jul 23 21:36:31 2020
 --------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0010188 ...... PASSED
 ================================================================================
Start Time Run Time GFLOPS SumCheck 12:04:00 455.77 11.70 0.0010188 12:14:13 453.90 11.75 0.0010188 12:25:13 458.17 11.64 0.0010188 12:36:14 453.06 11.77 0.0010188 12:46:58 457.73 11.65 0.0010188
VMSTAT -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- SUMMARY swpd free buff cache si so bi bo in cs us sy id wa st Pre Start 0 3545296 24620 225724 0 0 102 4 176 263 1 1 97 0 0 1 Started 512 304936 24640 199252 0 17 44 18 491 88 98 2 0 0 0 1 Finished 512 421176 24916 199220 0 0 4 4 490 97 98 2 0 0 0 2 Started 7680 314692 18484 194812 1 241 23 243 550 229 97 3 0 0 0 2 Finished 7680 433492 18836 190508 0 0 0 4 535 202 98 2 0 0 0 Later near 7424 427324 20520 193888 0 0 0 3 513 159 97 3 0 0 0 Run 1 2 3 4 5 1 2 3 4 5 CPU C PMIC C Seconds 0 33 40 43 44 45 36 43 46 48 49 30 41 48 52 53 53 38 45 48 50 51 60 47 53 57 57 58 41 48 51 51 52 90 50 54 57 58 58 44 50 51 53 54 To 390 55 58 61 60 61 51 53 55 55 57 420 55 59 59 62 62 51 53 55 55 57 450 55 58 60 62 62 51 53 55 55 57 480 56 55 59 60 62 51 53 55 55 57 Max 56 59 61 62 62 51 53 55 55 57
Below are Pi 4B and Pi 400 results at larger N values. Later 4B results are included, where fanless performance improved. With performance being partly affected by RAM speed, the fanless Pi 400 gain was around 10%, compared to the Pi 4B with a fan.

 System   Fan       N   GFLOPS Seconds  Max C  Min MHz

 Pi 4B     No   16000     6.8     404     86    750/600
          Yes   16000    10.4     263     70      1500 
 Later 4B  No   16000     8.6     319     83      1000 
          Yes   16000    10.4     263     63      1500 
 Pi 400         16000    11.4     239     57      1800 
 Pi 4B     No   20000     6.2     856     87    750/600
          Yes   20000    10.8     494     71      1500 
 Later 4B  No   20000     8.8     604     85      1000 
          Yes   20000    10.7     497     63      1500 
 Pi 400         20000    11.8     452     62      1800 

64 Bit High Performance Linpack Benchmark - xhpl

Following are results from four consecutive runs of HPL using the 64 bit configuration, with environmental and system activity monitoring. Moderate temperature increases ensured constant CPU MHz and measured GFLOPS, effectively at the same speed as the 32 bit version. As noted before, and inexplicably, the calculated and accepted sumchecks were different.

64 Bit HPL Benchmark Results below

64 Bit HPL Benchmark Results

Start Time            N    NB    P    Q     Time     Gflops    SumCheck
Sep 8 12:07:45    20000   128    2    2   456.61  1.168e+01   0.0009306 .. PASSED
Sep 8 12:16:31    20000   128    2    2   459.68  1.160e+01   0.0009602 .. PASSED
Sep 8 12:25:20    20000   128    2    2   460.25  1.159e+01   0.0011412 .. PASSED
Sep 8 12:34:10    20000   128    2    2   454.22  1.174e+01   0.0009636 .. PASSED

 Temperature and CPU MHz Measurement  Start at Tue Sep  8 12:07:12 2020
 Seconds
    0.0   ARM MHz=1800, core volt=0.9500V, CPU temp=39.0'C, pmic temp=41.1'C
   60.0   ARM MHz=1800, core volt=0.9500V, CPU temp=52.0'C, pmic temp=45.8'C
  121.4   ARM MHz=1800, core volt=0.9500V, CPU temp=53.0'C, pmic temp=48.6'C
  182.6   ARM MHz=1800, core volt=0.9500V, CPU temp=54.0'C, pmic temp=49.6'C
  243.8   ARM MHz=1800, core volt=0.9500V, CPU temp=55.0'C, pmic temp=50.5'C
  304.9   ARM MHz=1800, core volt=0.9500V, CPU temp=55.0'C, pmic temp=51.4'C
  366.1   ARM MHz=1800, core volt=0.9500V, CPU temp=56.0'C, pmic temp=52.4'C
  427.2   ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=52.4'C
  488.4   ARM MHz=1800, core volt=0.9500V, CPU temp=54.0'C, pmic temp=52.4'C
  549.3   ARM MHz=1800, core volt=0.9500V, CPU temp=54.0'C, pmic temp=51.4'C
  610.2   ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=53.3'C
  671.5   ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=54.3'C
  732.7   ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=54.3'C
  794.0   ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=54.3'C
  855.2   ARM MHz=1800, core volt=0.9500V, CPU temp=58.0'C, pmic temp=55.2'C
  916.3   ARM MHz=1800, core volt=0.9500V, CPU temp=58.0'C, pmic temp=55.2'C
  977.5   ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=55.2'C
 1038.7   ARM MHz=1800, core volt=0.9500V, CPU temp=56.0'C, pmic temp=54.3'C
 1099.7   ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=54.3'C
 1160.8   ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=55.2'C
 1222.0   ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=55.2'C
 1283.1   ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=56.2'C
 1344.2   ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=56.2'C
 1405.4   ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=57.1'C
 1466.5   ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=57.1'C
 1527.8   ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=56.2'C
 1589.0   ARM MHz=1800, core volt=0.9500V, CPU temp=57.0'C, pmic temp=55.2'C
 1649.9   ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=55.2'C
 1710.9   ARM MHz=1800, core volt=0.9500V, CPU temp=60.0'C, pmic temp=57.1'C
 1772.1   ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=57.1'C
 1833.4   ARM MHz=1800, core volt=0.9500V, CPU temp=61.0'C, pmic temp=57.1'C
 1894.5   ARM MHz=1800, core volt=0.9500V, CPU temp=62.0'C, pmic temp=57.1'C
 1955.6   ARM MHz=1800, core volt=0.9500V, CPU temp=62.0'C, pmic temp=58.0'C
 2016.7   ARM MHz=1800, core volt=0.9500V, CPU temp=62.0'C, pmic temp=58.0'C
 2077.9   ARM MHz=1800, core volt=0.9500V, CPU temp=59.0'C, pmic temp=57.1'C

vmstat 60 seconds sampling
procs  -----------memory----------  ---swap-- ----io---- --system- ------cpu------
 r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs  us sy id wa st

 0  0      0 3436000  23368 250372    0    0   286     6  204  293   2  2 95  1  0
 4  0  15616  266040    880 131528    0  259   109   263 1250  462  94  2  4  0  0
 4  0  18688  258616   3928 136892    4   49   251    51 1114  119 100  0  0  0  0
 4  0  21504  262664   4240 132936    0   46    13    49 1102   92 100  0  0  0  0
 4  0  21504  262160   4264 132940    1    0     1     2 1097   84 100  0  0  0  0
 4  0  21504  262932   4288 132932    0    0     0     2 1095   79 100  0  0  0  0
 4  0  21504  262428   4316 132944    0    0     0     2 1102   85 100  0  0  0  0
 4  0  21504  265200   4340 130812    1    0     1     2 1092   74 100  0  0  0  0
 4  0  21504  264948   4708 132172    3    0    31     2 1099   93 100  0  0  0  0
 4  0  21504 2423512   4852 135656    0    0    20     3 1105  100  99  1  0  0  0
 4  0  21504  280360   4880 117156    1    0     3     3 1100   89  99  1  0  0  0
 4  0  27904  294848   4908 105784    0  105    79   107 1115  110 100  0  0  0  0
 4  0  27904  293336   4928 105984    0    0     3     2 1097   79 100  0  0  0  0
 4  0  57600  301452   9764 120232   13  495   758   517 1458  809  99  1  0  0  0
 4  0  73728  312128   9576 123948   25  283   257   305 1336  548  99  1  0  0  0
 4  0  73728  311372   9740 124008    0    0     3     2 1099   86 100  0  0  0  0
 4  0  73728  310868   9752 124016    1    0     1     2 1096   80 100  0  0  0  0
 4  0  73728  309828   9764 124624    0    0    10     2 1098   85 100  0  0  0  0
 4  0  73472 1445224  10136 127356    1    0    73     3 1118  128  98  2  0  0  0
 4  0  73472  306776  10172 128404    1    0     1     4 1098   87  99  1  0  0  0
 4  0  73472  306280  10196 128488    1    0     3     2 1166  219 100  0  0  0  0
 5  0  73472  305920  10216 128516    0    0     0     2 1095   78 100  0  0  0  0
 4  0  73472  305044  10244 128524    1    0     1     2 1100   90 100  0  0  0  0
 4  0  73216  305412  10268 128620    0    0     2     2 1094   80 100  0  0  0  0
 4  0  73216  305040  10292 128632    0    0     0     2 1091   75 100  0  0  0  0
 4  0  72960  304916  10320 128640    0    0     0     2 1100   81 100  0  0  0  0
 4  0  72960  302436  10348 131852    1    0     1     3 1096   80 100  0  0  0  0
 4  0  72704  470192  10380 127388    0    0     0     3 1111  110  98  2  0  0  0
 4  0  72704  306264  10412 128684    0    0     0     2 1126  146 100  0  0  0  0
 4  0  72704  305768  10440 128696    1    0     1     3 1095   82 100  0  0  0  0
 4  0  72704  305768  10464 128708    0    0     0     2 1095   79 100  0  0  0  0
 4  0  72704  305876  10488 128752    0    0     1     2 1096   77 100  0  0  0  0
 4  0  72704  305752  10516 128760    0    0     0     3 1094   77 100  0  0  0  0
 4  0  72704  305504  10540 128768    1    0     1     2 1099   87 100  0  0  0  0
 4  0  72704  305380  10568 128784    0    0     0     3 1090   74 100  0  0  0  0
 4  0  72704  299800  10596 134108    5    0    19     2 1227  353 100  0  0  0  0
   
Stress Test Benchmarks below or Go To Start


32 Bit Stress Test Benchmarks - MP-FPUStress, MP-FPUStressDP, MP-IntStress

These stress tests have a benchmarking mode that provides choices for a long running test. They cover number of threads and memory size to cover caches and RAM, in addition, operations carried out on each data word with floating point programs. Numeric sumchecks are carried out to verify all calculations.

Floating Point - Below are Pi 4B single precision speeds in MFLOPS then sumchecks, followed by those for double precision working, then the same for Pi 400 and comparisons. The latter indicate a 20% Pi 400 performance gain for cache based tests and no difference on those that were RAM speed dependent The sumcheck comparisons show that the two system produced the same numeric results carrying out millions of calculations.

Integer - The test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Performance is measured in MBytes per second. Results show the varying hexadecimal data patters used and compared verification. Comparative performance again show that the PI 400 was 20% faster on CPU speed dependent tasks and no different when reliant on RAM speed


            32 Bit Single Precision Floating Point     32 Bit Double Precision Floating Point

           ------ MFLOPS -----  ----- Sumchecks ---   ------ MFLOPS -----   ----- Sumchecks ---
                              *                     *                     *
           -------------------------------------- Pi 4B -------------------------------------

      Ops    KB     KB     MB *    KB     KB     MB *    KB     KB     MB *    KB     KB     MB
Thrds /Wd  12.8    128   12.8 *  12.8    128   12.8 "  12.8    128   12.8 *  12.8    128   12.8

 T1    2   2603   2607    651 " 40392  76406  99700 *   992    990    317 * 40395  76384  99700
 T2    2   5017   5138    645 * 40392  76406  99700 *  1940   1993    319 * 40395  76384  99700
 T4    2   7045   9724    656 * 40392  76406  99700 *  3639   3925    329 * 40395  76384  99700
 T8    2   8747   9690    633 * 40392  76406  99700 "  3690   3913    331 * 40395  76384  99700
 T1    8   5542   5427   2479 * 54756  85091  99820 "  2390   2435   1266 * 54805  85108  99820
 T2    8  10774  10716   2579 * 54756  85091  99820 *  4608   4853   1170 * 54805  85108  99820
 T4    8  19196  20561   2595 * 54756  85091  99820 *  8902   9081   1165 * 54805  85108  99820
 T8    8  18718  20629   2512 * 54756  85091  99820 *  8852   8971   1098 * 54805  85108  99820
 T1   32   5307   5244   5217 * 35296  66020  99519 *  2703   2724   2672 * 35159  66065  99521
 T2   32  10559  10521   9764 * 35296  66020  99519 *  5385   5442   5009 * 35159  66065  99521
 T4   32  20070  20557   9864 * 35296  66020  99519 * 10582  10836   4824 * 35159  66065  99521
 T8   32  19793  20919   9460 * 35296  66020  99519 * 10484  10749   4765 * 35159  66065  99521

            ------------------------------------- Pi 400 -------------------------------------

 T1    2   3163   3129    646 * 40392  76406  99700 *  1192   1187    321 * 40395  76384  99700
 T2    2   6145   6144    646 * 40392  76406  99700 *  2362   2392    324 * 40395  76384  99700
 T4    2   8974  10119    655 * 40392  76406  99700 *  4155   4692    278 * 40395  76384  99700
 T8    2   9584  11780    645 * 40392  76406  99700 *  4232   4730    272 * 40395  76384  99700
 T1    8   6606   6514   2515 * 54756  85091  99820 *  2899   2931   1250 * 54805  85108  99820
 T2    8  13028  12755   2831 * 54756  85091  99820 *  5643   5829   1128 * 54805  85108  99820
 T4    8  22820  25005   2778 " 54756  85091  99820 * 10637  11351   1208 * 54805  85108  99820
 T8    8  23260  24714   2345 * 54756  85091  99820 * 10850  10938   1217 * 54805  85108  99820
 T1   32   6368   6327   6115 * 35296  66020  99519 *  3252   3257   3156 * 35159  66065  99521
 T2   32  12643  12602  10838 * 35296  66020  99519 *  6484   6538   5455 * 35159  66065  99521
 T4   32  24016  25146  10124 * 35296  66020  99519 * 12833  12791   4790 * 35159  66065  99521
 T8   32  23811  24068   8760 * 35296  66020  99519 * 12093  12226   4463 * 35159  66065  99521

            --------------------------------- Pi 400 / Pi 4B  --------------------------------

             L1     L2    RAM *  ---- Sumchecks --- *    L1     L2    RAM "  ---- Sumchecks ---

 T1    2   1.22   1.20   0.99 *  1.00   1.00   1.00 *  1.20   1.20   1.01 "  1.00   1.00   1.00
 T2    2   1.22   1.20   1.00 *  1.00   1.00   1.00 *  1.22   1.20   1.02 *  1.00   1.00   1.00
 T4    2   1.27   1.04   1.00 *  1.00   1.00   1.00 *  1.14   1.20   0.84 *  1.00   1.00   1.00
 T8    2   1.10   1.22   1.02 *  1.00   1.00   1.00 *  1.15   1.21   0.82 *  1.00   1.00   1.00
 T1    8   1.19   1.20   1.01 *  1.00   1.00   1.00 *  1.21   1.20   0.99 *  1.00   1.00   1.00
 T2    8   1.21   1.19   1.10 *  1.00   1.00   1.00 *  1.22   1.20   0.96 *  1.00   1.00   1.00
 T4    8   1.19   1.22   1.07 *  1.00   1.00   1.00 *  1.19   1.25   1.04 *  1.00   1.00   1.00
 T8    8   1.24   1.20   0.93 *  1.00   1.00   1.00 *  1.23   1.22   1.11 *  1.00   1.00   1.00
 T1   32   1.20   1.21   1.17 *  1.00   1.00   1.00 *  1.20   1.20   1.18 *  1.00   1.00   1.00
 T2   32   1.20   1.20   1.11 *  1.00   1.00   1.00 *  1.20   1.20   1.09 *  1.00   1.00   1.00
 T4   32   1.20   1.22   1.03 *  1.00   1.00   1.00 *  1.21   1.18   0.99 *  1.00   1.00   1.00
 T8   32   1.20   1.15   0.93 *  1.00   1.00   1.00 *  1.15   1.14   0.94 *  1.00   1.00   1.00

            ---------------------------- 32 Bit Integers ---------------------------
           
             Pi 4B  MB/second            Same     Pi 400 MB/second     Pi 400/Pi4B  
             KB     KB     MB             All     KB     KB     MB    KB    KB    MB
  Threads    16    160     16 Sumcheck  Tests     16    160     16    16   160    16

       1   5751   5755   3882 00000000   Yes    7062   6907   3825  1.23  1.20  0.99
       2  11820  11302   3772 FFFFFFFF   Yes   14215  13724   3736  1.20  1.21  0.99
       4  22467  21906   3375 5A5A5A5A   Yes   27026  26533   3397  1.20  1.21  1.01
       8  22019  22094   3415 AAAAAAAA   Yes   26959  25993   3419  1.22  1.18  1.00
      16  22891  22448   3395 CCCCCCCC   Yes   27424  27479   3413  1.20  1.22  1.01
      32  22574  23412   3436 0F0F0F0F   Yes   27143  27869   3458  1.20  1.19  1.01
  

64 Bit Stress Test Benchmarks below


64 Bit Stress Test Benchmarks - MP-FPUStress64g8, MP-FPUStress64DPg8, MP-IntStress64g8

Unlike the earlier CPU benchmarks reported here, the 32 bit stress tests were produced by an earlier compiler, where comparisons may not be valid. In this case, the 64 bit floating point performance is generally shown to be faster with the identical data sumchecks, but that with integer arithmetic is indicated as often running at half speed. Faster results from an earlier 64 bit version are provided to identify the later compiler deficiency. See 64 Bit Danger.


            64 Bit Single Precision Floating Point     64 Bit Double Precision Floating Point

           ------ MFLOPS -----  ----- Sumchecks ---   ------ MFLOPS -----   ----- Sumchecks ---
                              *                     *                     *
      Ops    KB     KB     MB *    KB     KB     MB *    KB     KB     MB *    KB     KB     MB
Thrds /Wd  12.8    128   12.8 *  12.8    128   12.8 "  12.8    128   12.8 *  12.8    128   12.8

 T1    2   3114   4852   1191 * 40394  76395  99700 *  1822   2252    613 * 40395  76384  99700
 T2    2   9362   9555   1236 * 40394  76395  99700 *  4190   4493    604 * 40395  76384  99700
 T4    2  16966  15205   1119 * 40394  76395  99700 *  8082   8708    603 * 40395  76384  99700
 T8    2  16096  17963   1027 * 40394  76395  99700 +  8275   7905    603 * 40395  76384  99700
 T1    8   5645   5697   3695 * 54764  85092  99820 *  3342   3354   2190 * 54805  85108  99820
 T2    8  11333  11335   4125 * 54764  85092  99820 *  6643   6718   2142 * 54805  85108  99820
 T4    8  21208  22499   4151 * 54764  85092  99820 * 12734  13322   2058 * 54805  85108  99820
 T8    8  21585  21456   4115 * 54764  85092  99820 * 12919  12523   2101 * 54805  85108  99820
 T1   32   7025   7049   7006 * 35206  66015  99520 *  4002   4009   3961 * 35159  66065  99521
 T2   32  14081  14047  13565 * 35206  66015  99520 *  7993   8016   7511 * 35159  66065  99521
 T4   32  27027  28036  16116 * 35206  66015  99520 * 15462  15988   8132 * 35159  66065  99521
 T8   32  26548  27040  16049 * 35206  66015  99520 * 15722  15825   8038 * 35159  66065  99521

            ------------------------------- Pi 400 64 Bit/32 Bit ------------------------------

             L1     L2    RAM *  ---- Sumchecks --- *    L1     L2    RAM "  ---- Sumchecks ---

 T1    2   0.98   1.55   1.84 *  1.00   1.00   1.00 *  1.53   1.90   1.91 *  1.00   1.00   1.00
 T2    2   1.52   1.56   1.91 *  1.00   1.00   1.00 *  1.77   1.88   1.86 *  1.00   1.00   1.00
 T4    2   1.89   1.50   1.71 *  1.00   1.00   1.00 *  1.95   1.86   2.17 *  1.00   1.00   1.00
 T8    2   1.68   1.52   1.59 *  1.00   1.00   1.00 *  1.96   1.67   2.22 *  1.00   1.00   1.00
 T1    8   0.85   0.87   1.47 *  1.00   1.00   1.00 *  1.15   1.14   1.75 *  1.00   1.00   1.00
 T2    8   0.87   0.89   1.46 *  1.00   1.00   1.00 *  1.18   1.15   1.90 *  1.00   1.00   1.00
 T4    8   0.93   0.90   1.49 *  1.00   1.00   1.00 *  1.20   1.17   1.70 *  1.00   1.00   1.00
 T8    8   0.93   0.87   1.75 *  1.00   1.00   1.00 *  1.19   1.14   1.73 *  1.00   1.00   1.00
 T1   32   1.10   1.11   1.15 *  1.00   1.00   1.00 *  1.23   1.23   1.26 *  1.00   1.00   1.00
 T2   32   1.11   1.11   1.25 *  1.00   1.00   1.00 *  1.23   1.23   1.38 *  1.00   1.00   1.00
 T4   32   1.13   1.11   1.59 *  1.00   1.00   1.00 *  1.20   1.25   1.70 *  1.00   1.00   1.00
 T8   32   1.11   1.12   1.83 *  1.00   1.00   1.00 *  1.30   1.29   1.80 *  1.00   1.00   1.00

            ------------------------------ Pi 400 Integers ----------------------------

             gcc 8 MB/second             Same   Pi 400 64 Bit/32 Bit   Version 1 MB/sec
             KB     KB     MB             All     KB     KB     MB     KB     KB     MB
  Threads    16    160     16 Sumcheck  Tests     16    160     16     16    160     16

       1   3455   3481   3074 00000000   Yes    0.49   0.50   0.80   8774   8150   3772
       2   7047   6975   3507 FFFFFFFF   Yes    0.50   0.51   0.94  17241  15941   3687
       4  13712  13977   3357 5A5A5A5A   Yes    0.51   0.53   0.99  32768  29966   3339
       8  13631  13696   3353 AAAAAAAA   Yes    0.51   0.53   0.98  32845  33055   3366
      16  13184  13906   3351 CCCCCCCC   Yes    0.48   0.51   0.98  32959  34188   3364
      32  12617  13960   3414 0F0F0F0F   Yes    0.46   0.50   0.99  31531  33694   3388
  


Stress Test Parameters

The following show stress test run time parameters. The classifications can be upper or lower case and only the first character is interpreted.

   ./MP-FPUStress   Threads tt, Minutes mm, KB kk, Ops 00, Log ll
   ./MP-FPUStressDP Threads tt, Minutes mm, KB kk, Ops 00, Log ll
   ./MP-IntStress   Threads tt, Minutes mm, KB kk, Log ll
   ./RPiHeatMHzVolts2  Passes pp, Seconds ss, Log ll
   vmstat ss pp

   tt = Threads 1, 2, 4, 8, 16, 32, (64 FPU)       mm = Minutes greater than 0                       
   kk = KBytes 12 to 15624                         oo = Operations Per Word 2, 8 or 32  
   ll = number added to log file name, 0 to 99     pp = Passes (at ss econd intervals) 
   ss = Second intervals 
  
Floating Point Stress Tests below or Go To Start


32 Bit Floating Point Stress Tests - MP-FPUStress, MP-FPUStressDP

Half hour single and double precision cache based stress tests (256 KB, 4 Threads, 32 Ops/Word) were run on the latest fan cooled 8 GB Pi 4, with appropriate firmware and Operating System, side by side with the same on the fanless Pi 400. This was on a hot August day, where the room temperature was 30C. Following is a summary of results, where both ran at constant CPU MHz, voltage and, effectively, measured MFLOPS performance, the latter reflecting the expected Pi 400 gain. CPU temperatures did not approach anticipated level where MHz throttling would occur, with maximum measurements on both systems being similar, unlike those for PMIC, where the Pi 400 recordings were hotter. The Pi 400 has a full width metallic heat spreader between the keyboard and the circuit board, a quick look suggesting thermal contact with the CPU. This appears to be an excellent cooling arrangement.

An extra test was carried out on the Pi 4B, with the fan disabled, demonstrating severe CPU MHz throttling, much worst performance and reflecting the Pi 400 advantage. Average temperature, over half an hour, was 84C accompanied by a 42% reduction in performance.

The Pi 400, keyboard temperature was measured during the stress tests, reaching warm to touch 40C, according to my infrared thermometer.

               Pi 4B                       Pi 4B No Fan  Pi 400 Fanless

               Single        Double        Single        Single        Double

   MFLOPS Avg   20896         10797         12151         25056         12953
   MFLOPS Min   20541         10587         10946         24587         12754

      MHx Avg    1500          1500           870          1800          1800
      MHz Min    1500          1500           600          1800          1800

        Volts  0.8600        0.8600        0.8600        0.9500        0.9500
 
 Temperatures     CPU   PMIC    CPU   PMIC    CPU   PMIC    CPU   PMIC    CPU   PMIC
                   C     C     C     C     C     C     C     C     C     C
           Avg   66.6   49.5   60.3   47.1   84.0   71.2   61.8   57.7   61.1   59.5
           Max   69.0   50.5   62.0   48.6   86.0   72.2   68.0   61.8   64.0   60.9
       Minutes
             0     43   38.2     44   41.1     61   57.1     38   43.9     48   53.3
             1     63   45.8     59   43.9     82   65.6     56   49.6     60   57.1
             2     66   48.6     61   46.7     83   68.4     57   51.4     59   58.0
             3     67   49.6     61   47.7     84   70.3     58   53.3     61   59.0
             4     67   49.6     62   46.7     85   70.3     59   54.3     62   59.0
             5     67   49.6     61   47.7     84   70.3     61   55.2     61   59.0
             6     68   49.6     61   47.7     85   71.2     61   55.2     62   59.0
             7     67   49.6     62   47.7     85   72.2     62   57.1     63   59.0
             8     69   49.6     61   47.7     85   72.2     62   57.1     54   58.0
             9     69   49.6     62   47.7     86   72.2     63   58.0     61   59.0
            10     68   49.6     62   47.7     85   72.2     63   58.0     62   59.0
            11     67   49.6     61   46.7     86   72.2     64   59.0     62   59.0
            12     67   49.6     61   46.7     86   72.2     65   59.0     62   59.0
            13     68   50.5     59   46.7     86   72.2     53   55.2     62   59.0
            14     68   50.5     60   46.7     86   72.2     64   59.0     63   59.0
            15     68   50.5     60   46.7     85   72.2     61   57.1     64   60.9
            16     69   50.5     59   46.7     85   72.2     64   59.0     64   60.9
            17     69   50.5     60   46.7     85   72.2     66   59.0     63   60.9
            18     68   49.6     60   46.7     86   72.2     66   59.9     61   60.9
            19     68   49.6     61   47.7     85   72.2     66   60.9     64   60.9
            20     69   50.5     61   47.7     85   72.2     66   60.9     56   59.0
            21     68   50.5     62   47.7     86   72.2     66   60.9     63   59.0
            22     68   50.5     61   47.7     85   72.2     66   60.9     62   60.9
            23     68   50.5     62   47.7     85   72.2     67   60.9     64   60.9
            24     67   50.5     62   48.6     85   72.2     67   61.8     63   60.9
            25     68   50.5     62   48.6     84   72.2     54   58.0     63   60.9
            26     68   50.5     61   47.7     86   72.2     67   60.9     63   60.9
            27     68   50.5     62   47.7     85   72.2     66   61.8     64   60.9
            28     69   50.5     61   47.7     86   72.2     68   61.8     63   60.9
            29     68   50.5     60   47.7     85   72.2     68   61.8     63   60.9
            30     57   50.5     59   47.7     78   72.2     51   57.1     53   59.0
  

64 Bit Floating Point Stress Tests or Go To Start


64 Bit Floating Point Stress Tests - MP-FPUStress64g8, MP-FPUStress64DPg8

Three half hour floating point stress tests were run consecutively on the Pi 400, two at single precision and one at double precision. All ran using four threads, sharing 256 KB, with 32 floating point operations per data word read/written. The usual environment and system monitors were run at the same time. Room temperature was 27C.

With temperatures remaining relatively low, CPU MHz and measured performance were constant and a little faster than at 32 bits. Average Pi 400 64 bit double precision performance of 15.9 GFLOPS can be judged against 11.6 GFLOPS from High Performance Linpack.

 10-Sep-20        09:45         10:18         10:49

 Precision       Single        Single        Double

 MFLOPS Avg       28011         27995         15948
 MFLOPS Min       26336         27019         15467

      MHz Avg      1800          1800          1800
      MHz Min      1800          1800          1800

        Volts    0.9500        0.9500        0.9500

 Temperatures     CPU   PMIC    CPU   PMIC    CPU   PMIC
                   C     C     C     C     C     C
          Avg    52.3   51.7   56.6   56.2   60.5   58.5
          Max    57.0   55.2   59.0   58.0   63.0   60.9
       Minutes
             0     35   39.2     47   50.5     47   50.5
             1     45   43.9     55   53.3     58   55.2
             2     48   45.8     55   54.3     58   55.2
             3     49   46.7     54   55.2     59   57.1
             4     50   47.7     56   55.2     59   57.1
             5     49   48.6     56   55.2     59   57.1
             6     51   49.6     57   55.2     61   58.0
             7     52   49.6     55   55.2     60   58.0
             8     51   50.5     57   55.2     61   59.0
             9     52   50.5     56   55.2     60   59.0
            10     52   51.4     57   56.2     60   59.0
            11     53   51.4     58   56.2     61   59.0
            12     53   51.4     57   57.1     62   59.0
            13     54   52.4     57   57.1     61   59.0
            14     54   52.4     57   57.1     61   59.0
            15     54   53.3     58   57.1     61   59.0
            16     54   53.3     57   57.1     61   59.0
            17     54   53.3     58   57.1     62   59.0
            18     55   53.3     58   57.1     62   59.0
            19     55   54.3     58   57.1     62   59.0
            20     56   54.3     57   57.1     62   59.0
            21     55   54.3     59   57.1     63   59.0
            22     55   54.3     58   57.1     63   59.0
            23     55   54.3     58   57.1     61   59.0
            24     54   55.2     58   57.1     62   59.0
            25     56   55.2     58   57.1     63   59.0
            26     55   55.2     59   57.1     63   60.9
            27     55   55.2     59   57.1     63   60.9
            28     56   55.2     58   58.0     63   60.9
            29     57   55.2     59   58.0     63   60.9
            30     48   54.3     50   56.2     53   59.0
  

Integer Stress Tests below or Go To Start


32 Bit Integer Stress Tests - MP-IntStress

The Pi 400 integer stress tests were run shortly after those on floating point, leading to a higher temperatures on starting, but not much different on maximum recordings. They were also run using data size of 256 KB and four threads. Both the fan cooled Pi 4 and Pi 400 again ran continuously at maximum MHz and constant voltage. Measured MB/second of each was also effectively constant, with the Pi 400 being 20% faster. The Pi 4B was also run with the fan inoperable, where there were less CPU MHz throttling effects and lower performance degradation than the floating point tests, at 29%.

An additional test was carried out on the Pi 400 outside on a sheltered table, where the local temperature was initially 40C, increasing to 44C with the sun shining on part of the keyboard. The keyboard temperature increased to 51C for the last minute of the test. Over the testing time, maximum temperatures increased by around 7C, not sufficient to invoke throttling and providing virtually the same performance as in the earlier test.


            Pi 4B Fan    Pi 4B No Fan  Pi 400 Fanless  Pi 400 Outside

     MB/S Avg   22164         15736         26395         26215
     MB/S Min   21472         13756         25541         25779

      MHz Avg    1500          1053          1800          1800
      MHz Min    1500           600          1800          1800

        Volts  0.8600        0.8600        0.9500        0.9500

  Temperatures    CPU   PMIC    CPU   PMIC    CPU   PMIC    CPU   PMIC
                   C     C     C     C     C     C     C     C
           Avg   61.5   47.6   82.7   69.9   62.1   59.8   65.1   63.5
           Max   64.0   48.6   86.0   72.2   64.0   61.8   71.0   69.4

       Minutes
             0     45   41.1     60   55.2     48   53.3     45   49.6
             1     59   43.9     78   62.8     58   55.2     56   54.3
             2     63   46.7     82   66.5     62   57.1     58   56.2
             3     63   48.6     83   69.4     60   59.0     60   58.0
             4     63   48.6     83   70.3     62   59.0     62   59.0
             5     63   48.6     83   70.3     62   59.0     63   60.9
             6     63   47.7     84   70.3     62   59.0     63   61.8
             7     64   47.7     84   70.3     63   59.0     63   61.8
             8     62   47.7     83   70.3     61   59.0     64   62.8
             9     60   46.7     84   70.3     64   59.0     65   62.8
            10     62   47.7     83   70.3     63   59.0     65   62.8
            11     62   47.7     83   70.3     62   59.0     64   62.8
            12     63   48.6     83   70.3     63   59.0     66   62.8
            13     62   48.6     83   70.3     63   60.9     67   62.8
            14     63   48.6     83   70.3     63   59.0     68   64.6
            15     64   48.6     83   70.3     62   59.0     68   64.6
            16     63   48.6     83   70.3     63   60.9     67   65.6
            17     61   47.7     83   70.3     63   60.9     68   65.6
            18     63   47.7     84   71.2     62   60.9     67   65.6
            19     62   47.7     83   70.3     64   60.9     68   65.6
            20     63   47.7     83   70.3     64   60.9     67   65.6
            21     63   48.6     84   70.3     61   60.9     69   66.5
            22     63   48.6     84   70.3     63   60.9     67   65.6
            23     64   47.7     85   72.2     63   60.9     67   65.6
            24     62   48.6     84   72.2     63   61.8     68   66.5
            25     61   47.7     86   72.2     63   60.9     69   66.5
            26     62   47.7     86   72.2     64   61.8     70   67.5
            27     61   47.7     84   72.2     63   61.8     70   67.5
            28     62   47.7     86   72.2     62   60.9     71   68.4
            29     62   48.6     84   72.2     64   61.8     71   69.4
            30     54   47.7     84   72.2     64   61.8     62   68.4
  

64 Bit Integer Stress Tests below or Go To Start


64 Bit Integer Stress Tests - MP-IntStress64

As indicated in 64 Bit Stress Test Benchmarks, the gcc 8 integer calculations are shown as often running at half speed. See 64 Bit Danger. So, the the earlier MP-IntStress64 program was used for these stress tests. Three half hour runs of these were carried out, using four threads, covering data from L1 caches, shared L2 cache and RAM. For the latter, more than 3 GB was used, as reflected in the vmstat details shown below.

Again, temperatures were low and performance constant, within normal variations.

   Memory  KB      16           256       3500000       vmstat Memory
                                                         3500000 KB      
MB/sec Avg      34803         28756          3571
MB/sec Min      32815         26940          2804

      MHz Avg    1800          1800          1800
      MHz Min    1800          1800          1800

        Volts  0.9500        0.9500        0.9500

 Temperatures     CPU   PMIC    CPU   PMIC    CPU   PMIC
                   C     C     C     C     C     C
      Avg        55.8   55.3   59.9   58.6   48.2   51.6
      Max        59.0   57.1   63.0   61.8   50.0   52.4
                                                           swpd    free
       Minutes
             0     42   46.7     44   49.6     42   46.7      0 3417124
             1     51   50.5     56   53.3     46   49.6  77312  115776
             2     53   52.4     57   55.2     48   50.5  76800  107948
             3     55   52.4     57   55.2     47   50.5  76800  107648
             4     54   53.3     58   57.1     48   50.5  76800  107884
             5     54   54.3     59   57.1     48   51.4  76800  107884
             6     55   54.3     58   57.1     48   51.4  76800  107128
             7     55   54.3     61   57.1     48   51.4  76800  108388
             8     55   55.2     60   58.0     48   51.4  76544  106120
             9     55   55.2     61   59.0     48   51.4  76544  105868
            10     56   55.2     60   59.0     49   51.4  76544  104356
            11     55   55.2     59   59.0     49   51.4  76544  105364
            12     58   55.2     60   59.0     49   51.4  76544  104356
            13     57   55.2     61   59.0     49   51.4  74240  116292
            14     58   55.2     62   59.0     49   52.4  74240  129052
            15     57   57.1     63   59.0     48   52.4  74240  129052
            16     56   55.2     61   59.0     49   52.4  74240  128548
            17     57   57.1     62   59.0     49   52.4  74240  128664
            18     57   57.1     63   59.0     49   52.4  74240  128012
            19     58   57.1     62   59.0     49   52.4  74240  129272
            20     58   57.1     62   60.9     50   52.4  74240  128784
            21     58   57.1     62   60.9     49   52.4  74240  128028
            22     59   57.1     61   59.0     49   52.4  74240  128532
            23     58   57.1     63   60.9     48   52.4  74240  128280
            24     58   57.1     63   60.9     49   52.4  74240  127020
            25     58   57.1     63   60.9     49   52.4  74240  128532
            26     58   57.1     62   60.9     48   52.4  74240  128532
            27     58   57.1     60   60.9     50   52.4  74240  128280
            28     59   57.1     62   60.9     48   52.4  73728  127776
            29     58   57.1     63   61.8     49   52.4  73728  126516
            30     50   55.2     52   59.0     44   51.4  73472  127524
  

32 Bit System Stress Tests below or Go To Start


32 Bit System Stress Tests

These stress tests comprised running programs, each for 15 minutes at the same time, exercising floating point calculations and OpenGL graphics activity, whilst others were validating data transfers from RAM and the main drive. The run time environment was also monitored. The variations of programs used can be obtained from Raspberry-Pi-4-Benchmarks.tar.gz.

The script file, shown below, was used to kick off the programs at the same time (within 10 seconds, validated by provided results logs). The tests were run on the latest 8 GB Pi 4B, with cooling fan, and the fanless Pi 400 PC. The 4B drive was a 32 GB SD card with the Pi 400 using a higher speed USB 3 booted disk drive.


######################## Script File ########################

lxterminal -e ./RPiHeatMHzVolts2 Passes 16 Seconds 60 Log 31 &
lxterminal -e ./liverloopsPiA7R Seconds 12 Log 31 &
lxterminal -e  ./MP-IntStress Threads 1 KB 15000 Mins 15 Log 31 &
lxterminal -e ./burnindrive2 Repeats 16, Minutes 12, Log 31, Seconds 1  &
export vblank_mode=0  &
lxterminal -e ./videogl32 Test 6 Mins 15 Log 31 &
vmstat 60 16 > vmstat31.txt

The following results cover CPU MHz, voltage, temperatures and utilisation of memory, drives and CPU, with details for other programs on the next page. Both systems appeared to run continuously at maximum CPU MHz, without temperatures increasing anywhere near the point where throttling would occur. The Pi 4B CPU started 5C higher, continuing with the same difference until the end. The Pi 400 PMIC started 4C higher and that increased to 6C.

VMSTAT shows that not much RAM was needed for these tests, both systems having similar CPU + Wait For I/O utilisations, with around 1% idle time. The main difference was main drive MB/second, with the Pi 400 disk drive some 75% faster than the Pi 4B SD card. Not too much can be read into that. It might have been the opposite effect, with the Pi 4B using the hard drive.

Results on the next page indicate that the Pi 400 obtained an official Livermore Loops average of 592.1 MFLOPS, compared with 494.4 on the Pi 4B, a difference of 20%. The two systems obtained similar speeds during the integer RAM tests, of over 2.3 GB/second, with the Pi 400 producing an 11% performance advantage, running running the OpenGL Textured Kitchen routine.

################# Pi 4B #################  ################# Pi 400 ################

================== CPU MHz CPU Voltage and Temperature Measurement =================

Secs   Start at Wed Aug 12 14:03:08 2020   Secs   Start at Wed Aug 12 14:02:58 2020

  0 ARM MHz=1500 0.86V CPU=46C pmic=42C    0 ARM MHz=1800 0.95V CPU=41C pmic=46C
 60 ARM MHz=1500 0.86V CPU=56C pmic=47C   60 ARM MHz=1800 0.95V CPU=49C pmic=50C
121 ARM MHz=1500 0.86V CPU=59C pmic=49C  121 ARM MHz=1800 0.95V CPU=51C pmic=51C
182 ARM MHz=1500 0.86V CPU=59C pmic=49C  182 ARM MHz=1800 0.95V CPU=50C pmic=52C
243 ARM MHz=1500 0.86V CPU=57C pmic=49C  243 ARM MHz=1800 0.95V CPU=51C pmic=52C
304 ARM MHz=1500 0.86V CPU=59C pmic=49C  303 ARM MHz=1800 0.95V CPU=53C pmic=53C
365 ARM MHz=1500 0.86V CPU=59C pmic=50C  364 ARM MHz=1800 0.95V CPU=54C pmic=54C
426 ARM MHz=1500 0.86V CPU=59C pmic=50C  425 ARM MHz=1800 0.95V CPU=53C pmic=54C
486 ARM MHz=1500 0.86V CPU=59C pmic=50C  486 ARM MHz=1800 0.95V CPU=52C pmic=55C
547 ARM MHz=1500 0.86V CPU=59C pmic=49C  548 ARM MHz=1800 0.95V CPU=53C pmic=55C
608 ARM MHz=1500 0.86V CPU=59C pmic=49C  609 ARM MHz=1800 0.95V CPU=55C pmic=55C
669 ARM MHz=1500 0.86V CPU=58C pmic=50C  670 ARM MHz=1800 0.95V CPU=55C pmic=55C
730 ARM MHz=1500 0.86V CPU=59C pmic=49C  731 ARM MHz=1800 0.95V CPU=54C pmic=55C
790 ARM MHz=1500 0.86V CPU=58C pmic=49C  792 ARM MHz=1800 0.95V CPU=54C pmic=55C
851 ARM MHz=1500 0.86V CPU=59C pmic=49C  854 ARM MHz=1800 0.95V CPU=54C pmic=56C
912 ARM MHz=1500 0.86V CPU=51C pmic=48C  915 ARM MHz=1800 0.95V CPU=48C pmic=55C
       End at Wed Aug 12 14:19:21 2020             End at  Wed Aug 12 14:19:14 2020

============================== vmstat 60 second samples =============================

  Memory MB       MB/sec CPU %utilise Wait   Memory MB       MB/sec CPU %utilise Wait
  free buff cache in out user sys idle I/O   free buff cache in out user sys idle I/O
 
  7505  23   224   0   0   3   1  95    1    3377  89   229   0   0   3   1  94    2
  7428  24   263   0   7  75   7   3   15    3313  89   256  33  11  76  12   2   10
  7424  24   266  15   4  76   9   0   15    3315  89   252  42   0  76  11   0   12
  7424  24   265  24   0  76  10   1   13    3316  89   252  42   0  76  11   1   12
  7423  24   265  25   0  75  10   1   15    3315  89   252  43   0  76  10   2   12
  7423  24   265  24   0  75  10   1   15    3312  89   256  42   0  76  11   1   13
  7422  24   266  24   0  75   9   1   16    3313  89   254  41   0  76  10   1   13
  7422  24   268  24   0  75  10   1   14    3311  89   256  41   0  77  11   0   12
  7422  24   268  24   0  76  10   1   13    3310  89   257  42   0  76  11   1   12
  7420  24   269  24   0  75   9   1   14    3311  89   255  41   0  77  10   1   12
  7422  24   267  24   0  76  10   1   13    3310  89   257  41   0  77  11   0   12
  7423  24   267  24   0  74  10   1   15    3308  89   258  41   0  77  11   1   11
  7420  24   269  24   0  75   9   0   15    3308  89   258  43   0  76  11   2   12
  7419  25   270  24   0  74   9   0   16    3309  89   256  64   0  77   1   5    1
  7420  25   268  24   0  75  10   1   14    3309  90   256  75   0  77   1   6    1
  7423  25   266  25   0  70  10   4   16    3309  90   258  78   0  63   1   6    1
 

Other Stress Testing Programs used are below or Go To Start


Other Stress Testing Programs - run with the above

Livermore Loops There are 24 of these, with individual MFLOPS measurements, with a number of summaries also produced, the official average being geometric mean. Note that there are three passes of this benchmark with differing memory demands. The detailed figures are from one of these runs but the summaries are for all results.

MP Integer RAM Exerciser and OpenGL Benchmark - These report results as the tests progress, and performance for both is provided together below. There can be performance variations over the testing time, depending on activities in other programs or manual interventions.

BurnInDrive uses 64 KB block sizes, with 164 variations of data patterns, where a parameter controls file size, in this case 16 blocks for 164 MB files. Four of these are written then read by random selection for a specified time. Finally, blocks are read continuously for a specified number of seconds (See more information here). Performance from the Pi 400 hard drive was clearly superior to that from the Pi 4B SD card. Calculated reading speeds were effectively the same as indicated by VMSTAT.

======= Livermore Loops 64 Bit Reliability test 12 seconds each loop x 24 x 3 =======

Pi 4B                                      Pi 400
Wed Aug 12 14:03:08 2020                   Wed Aug 12 14:02:58 2020
Numeric results were as expected           Numeric results were as expected
MFLOPS for 24 loops                        MFLOPS for 24 loops

 734.0  933.4  982.3  939.1  204.1  717.0   820.5 1060.4 1063.9 1066.8  233.7  517.4
1128.8 1600.7 1225.6  383.5  211.8  184.8  1358.0 1911.3 1521.3  487.6  251.1  220.1
 135.9  267.9  710.0  619.7  731.0 1012.8   188.6  363.1  842.4  734.0  868.1 1177.8
 315.4  330.3  305.8  352.7  681.0  186.4   379.2  393.9  303.3  416.7  835.5  207.1

Maximum Average Geomean Harmean Minimum     Maximum Average Geomean Harmean Minimum
 1600.7   610.8   494.4   390.8   117.9      1911.3   728.0   592.1   472.4   164.5

End of test Wed Aug 12 14:17:56 2020        End of test Wed Aug 12 14:17:24 2020

===================== MP Integer RAM and OpenGL Tests ======================

                                          Pi 4B             Pi 400          
 Start   Aug 12 2020                     14:03:08 14:03:08 14:02:58 14:02:58

    Secs  Kbytes Thrds Pattern All Same    MB/sec      FPS   MB/sec      FPS

      30   15000  1   00000000      Yes      2528       13     2428       14
      60   15000  1   FFFFFFFF      Yes      2501       13     2379       15
      90   15000  1   FFFFFFFF      Yes      2217       13     2539       13
To
     840   15000  1   AAAAAAAA      Yes      2217       14     2175       17
     870   15000  1   CCCCCCCC      Yes      2569       12     2343       15
     900   15000  1   CCCCCCCC      Yes      2455       13     2348       16

     Average                                 2351     13.2     2394     14.7

    End  Aug 12 2020                     14:18:12 14:18:11 14:18:03 14:18:01

======================== burnindrive2 Main Drive =========================

 pi 400                                                               Pi 4B
 Start   Wed Aug 12 14:02:58 2020                                     14:03:08 2020

 Write  seconds 164.00 MB x 4 files  11.93                               82.49

 Read files for 12+ minutes                                              Files Minutes
                                                                          x 4
 Read passes     1 x 4 Files x  164.00 MB in     0.26 minutes                1    0.44
 Read passes     2 x 4 Files x  164.00 MB in     0.52 minutes                2    0.92
TO
 Read passes    25 x 4 Files x  164.00 MB in     6.52 minutes               13    5.88
 Read passes    26 x 4 Files x  164.00 MB in     6.78 minutes               14    6.34
To
 Read passes    45 x 4 Files x  164.00 MB in    11.79 minutes               26   11.79
 Read passes    46 x 4 Files x  164.00 MB in    12.08 minutes               27   12.25

 Calculated MB/second over 12+ minutes          41.6                             24.1

 Passes in 1 second(s) for each of 164 blocks of 64KB:
                                                                         Examples
   1140   1180   1160   1220   1280   1360   1520   1520   1460   1240     420     420
   1260   1200   1160   1140   1140   1140   1160   1140   1160   1120     380     400
To
   1320   1400   1360   1300   1160   1240   1360   1380   1400   1140     540     560
   1240   1240   1240   1220   1220   1240   1180   1180   1160   1180     560     560

                                                                        Passes Minutes
 200220 read passes of 64KB blocks in     2.76 minutes                   79580    2.80

 No errors found during reading tests
 End    Wed Aug 12 14:18:00 2020                                       14:19:34 2020

64 Bit System Stress Tests below or Go To Start


64 Bit System Stress Tests

These 15 minute tests were run using the same script file sequence as the above 32 bit session, using 64 bit program compilations, except most of the RAM was used in the single core integer test.

The vmstat report shows that all these programs ran without memory swapping, with nearly all four cores being used continuously. Recorded data transfer speeds confirmed those measured by the drive program. Processor speed and and measured OpenGL frames per second were constant, with low temperatures being maintained.

        vmstat                                      RPiHeatMHzVolts    MP-Int  OpenGL
        Memory MB-------  MB/sec CPU %util-- %wait  ARM Volts CPU PMIC Stress  Test 6
Minutes swpd  free cache  in out usr sys idl I/O    MHz        C  C  MB/sec     FPS

     0     0  3285   298   0   0   8   0  91   0   1800  0.95  38  41
     1     0   291   326   0  11  74   8   2  17   1800  0.95  46  45    2198      22
     2     0   277   329  28   0  77   8   1  14   1800  0.95  47  46    2202      21
     3     0   273   332  28   0  76   7   1  16   1800  0.95  46  47    2203      21
     4     0   273   333  28   0  76   7   1  16   1800  0.95  48  48    2211      21
     5     0   272   334  28   0  76   7   1  15   1800  0.95  49  48    2201      22
     6     0   275   331  28   0  76   7   1  16   1800  0.95  48  48    2196      22
     7     0   275   330  28   0  76   7   1  16   1800  0.95  50  49    2193      21
     8     0   270   334  28   0  76   7   1  16   1800  0.95  48  49    2189      21
     9     0   275   330  28   0  76   7   1  16   1800  0.95  49  49    2175      22
    10     0   274   331  28   0  76   7   1  15   1800  0.95  51  50    2169      21
    11     0   273   331  28   0  76   7   1  15   1800  0.95  51  50    2166      20
    12     0   271   333  28   0  76   7   1  16   1800  0.95  51  50    2162      21
    13     0   271   334  28   0  76   7   2  15   1800  0.95  51  50    2156      21
    14     0   270   335  30   0  76   7   1  16   1800  0.95  51  50    2148      21
    15     0   271   335  30   0  70   7   6  16   1800  0.95  46  49    2129      20

                                            Avg    1800  0.95  48  48    2180      21
                                            Min    1800  0.95  38  41    2129      20
                                            Max    1800  0.95  51  50    2211      22


Livermore Loops Benchmark 64 Bit
Reliability test 12 seconds each loop x 24 x 3 MFLOPS for 24 loops 2447.2 1053.5 1083.2 1140.1 452.8 921.3 2738.0 3288.1 2377.3 694.8 566.2 1129.5 203.9 430.8 925.5 718.4 835.0 1291.8 516.4 440.1 1899.1 435.7 910.7 363.6 Overall Ratings Maximum Average Geomean Harmean Minimum 3288.1 1110.8 881.5 703.7 185.7
Main SD Card Storage Stress Test ARM 64 Bit
4 x 164.00 MB written in 58.38 seconds = 11.2 MB/second Read passes 31 x 4 Files x 164.00 MB in 12.07 minutes = 28.1 MB/second 85660 read passes of 64KB blocks in 2.79 minutes = 32.0 MB/second

32 Bit TV Test below or Go To Start


32 Bit TV Test Plus Remote Access

Using a TV connection, the Pi 400 was run for eight hours displaying live TV via BBC iPlayer. Before increasing the image to full screen, my environment monitor program was started. The room temperature was 24C. As can be seen below, CPU and PCMI temperatures did not increase significantly. Full screen pixel density was reported as 960 x 540 with Ethernet traffic at 1700 kbps.

Later, two terminals were connected from Putty, on a PC. Sysstat software was installed from there, to enable monitoring of network data transfer speeds. VMSTAT system utilisation monitor was started from the second terminal, both saving results on the Pi 400 SD card.

Received network data mainly arrived continuously at around 214k Bytes per second. Taking into account extra overhead bits, that is similar to 1700k bits per second. The increases to more than 250 kB/s and associated transmitted bytes were included after I opened VNC Viewer, on my Smart Phone, to have a look at the TV picture there. It was really bad, with jumpy rather than smooth flow. Assuming that full screen data is transferred, rather than in compressed input format, 960 x 540 pixels at 4 bytes per pixel indicates over 2000 kB, implying supplied data to the phone would result in an extremely low displayed frames per second.

VMSTAT indicates low Pi 400 CPU utilisation. The only noticeable activity is data output to the main drive being the same as kB/s received over the network. The burst of reading from the drive, near the end, occurred following pausing the iPlayer for a short time, followed by continuing playing the recording.

sar -n DEV 1800 20  Communications Traffic
10:38:54     rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil

11:08:54      147.41     61.02    214.87      4.27      0.00      0.00      0.04      0.18
11:38:54      146.16     62.63    211.44      4.38      0.00      0.00      0.04      0.17
12:08:54      898.76   1548.94    261.53   2104.84      0.00      0.00      0.54      1.72
12:38:54     1028.00   1794.83    273.61    155.17      0.00      0.00      1.40      0.22
13:08:54      148.80     62.42    216.92      4.36      0.00      0.00      0.04      0.18
13:38:54      148.77     62.69    216.94      4.38      0.00      0.00      0.03      0.18
14:08:54     3266.34   3909.49    155.45   2299.59      0.00      0.00      2.33      1.88
14:38:54      147.26     62.17    214.69      4.34      0.00      0.00      0.05      0.18
15:08:54      149.15     62.01    216.70      4.33      0.00      0.00      0.66      0.18
15:38:54      146.12     62.80    211.66      4.39      0.00      0.00      1.17      0.17
16:08:54      148.26     61.73    216.21      4.31      0.00      0.00      0.03      0.18
16:38:54      148.85     62.48    217.04      4.37      0.00      0.00      0.03      0.18

vmstat 1800 20 System Utilisation
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

 1  1      0 2772060  39876 714888    0    0    14    58  680 1022  6  3 91  0  0
 0  0      0 2659720  46688 797124    0    0    31   259 2771 4188  6  3 91  0  0
 0  0      0 2670300  49392 790776    0    0     0   233 2697 4046  5  3 92  0  0
 1  0      0 2636548  52248 812228    0    0     3   232 3014 4191 14  4 81  0  0
 1  0      0 2606896  55732 819748    0    0     4   239 3271 4198 24  6 70  0  0
 0  1      0 2651508  58032 812232    0    0     0   237 2704 4037  6  3 91  0  0
 0  0      0 2626876  60160 822588    0    0     0   235 2687 4038  5  3 92  0  0
 0  0      0 2631420  62128 821656    0    0     0   238 2703 4034  5  3 92  0  0
 0  0      0 2634884  64036 817980    0    0     0   235 2688 4033  5  3 92  0  0
 0  0      0 2643896  65856 813900    0    0     0   237 2684 4032  5  3 92  0  0
 0  0      0 2629000  67704 816040    0    0     0   233 2682 4036  5  3 91  0  0
 4  0      0 2529104  68992 899540    0    0    40   238 2818 4258  6  3 90  0  0
 0  0      0 2529352  70632 896856    0    0     0   237 2693 4034  5  2 92  0  0

 Temperature and CPU MHz Measurement

 Start at Wed Aug 19 08:42:13 2020

 Using samples at 1800 second intervals

 Seconds
     0.0   ARM MHz=1800, core volt=0.9500V, CPU temp=32.0'C, pmic temp=32.6'C
  1800.0   ARM MHz=1800, core volt=0.9500V, CPU temp=35.0'C, pmic temp=38.2'C
  3600.3   ARM MHz=1800, core volt=0.9500V, CPU temp=36.0'C, pmic temp=39.2'C
  5400.5   ARM MHz=1800, core volt=0.9500V, CPU temp=36.0'C, pmic temp=40.1'C
  7200.8   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=40.1'C
  9001.1   ARM MHz=1800, core volt=0.9500V, CPU temp=36.0'C, pmic temp=40.1'C
 10801.3   ARM MHz=1800, core volt=0.9500V, CPU temp=38.0'C, pmic temp=40.1'C
 12601.7   ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=42.9'C
 14401.9   ARM MHz=1800, core volt=0.9500V, CPU temp=38.0'C, pmic temp=42.0'C
 16202.2   ARM MHz=1800, core volt=0.9500V, CPU temp=38.0'C, pmic temp=41.1'C
 18002.4   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
 19802.7   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
 21603.0   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
 23403.3   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
 25203.5   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
 27003.8   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C
 28804.1   ARM MHz=1800, core volt=0.9500V, CPU temp=37.0'C, pmic temp=41.1'C

 Terminated Wed Aug 19 16:42 
  

64 Bit TV Test below or Go To Start


64 Bit TV Test Using Bluetooth

The session via the 64 bit Operating System displayed BBC iPlayer programmes on a PC monitor for more than 7 hours. The Pi 400 PC (I used) has no Av jack, so sound was played by pairing a bluetooth speaker. As with the 32 bit tests, the Ethernet network connection was used.

Without paying serious attention, full screen close up picture quality was acceptable. Then, right clicking on the screen, from time to time, indicated the following properties, showing large differences in the amount of data handled and displayed. There were corresponding variations in monitored statistics, with network received traffic (rxkB/s), vmstat drive kB data out (bo), and CPU utilisation (us + sy). CPU and PMIC temperatures reduced, but that might have been due to the room becoming cooler approaching midnight. I suppose that the varying traffic levels were caused by network congestion (but was it?).

Bluetooth - I found it difficult to connect bluetooth devices (in my environment?). After failing to pair, I could find no menu based operation to prevent further error indications. Executing the commands, shown below, allowed more attempts and sometimes successful connection.

                             kbps         pixels
              Periodic       1700       960 x 540
              Properties     5166      1280 x 720 at 18:30
              Displayed       932       704 x 396
                              533       512 x 288
                             5166      1280 x 720

sar -n DEV 1800 20  Communications Traffic
16:25:17   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s   %ifutil

16:55:17    142.08     73.26    205.16      5.13      0.00      0.00      1.93     16.81
17:25:17    137.60     70.84    198.17      4.97      0.00      0.00      2.10     16.23
17:55:17    130.85     67.14    188.22      4.71      0.00      0.00      2.18     15.42
18:25:17    397.85    193.62    576.05     13.28      0.00      0.00      0.49     47.19
18:55:17    455.64    221.75    661.41     15.08      0.00      0.00      0.03     54.18
19:25:17    450.00    217.99    654.41     14.73      0.00      0.00      0.04     53.61
19:55:17    446.87    216.61    649.94     14.68      0.00      0.00      0.03     53.24
20:25:17    176.93     88.05    256.34      6.11      0.00      0.00      1.43     21.00
20:55:17    134.94     68.95    193.33      4.90      0.00      0.00      2.08     15.84
21:25:17     84.61     44.44    117.08      3.32      0.00      0.00      2.04      9.59
21:55:17     79.75     42.79    110.32      3.24      0.00      0.00      2.07      9.04
22:25:17     51.18     28.03     69.52      2.26      0.00      0.00      1.99      5.70
22:55:17     51.49     28.67     69.98      2.32      0.00      0.00      2.07      5.73
23:25:17     37.40     21.18     49.87      1.82      0.00      0.00      2.04      4.09

vmstat 1800 20 System Utilisation
procs   -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff   cache   si   so    bi    bo   in   cs us sy id wa st

 2  0      0 2586672  50628  714756    0    0    39     4  148  161  2  1 97  0  0
 0  0      0 2314608  54248  897128    0    0     0   235 6690 5244 15  4 81  0  0
 4  0      0 2289668  57772  911392    0    0     0   219 6638 5156 14  3 82  0  0
 1  0      0 2287436  60836  913288    0    0     0   226 6688 5248 14  4 82  0  0
 4  0      0 2123628  64124  970940    0    0     1   356 5991 4691 27  5 68  0  0
 3  0      0 2102296  67052  973788    0    0     0   650 7919 5869 50  9 41  0  0
10  0      0 2019576  69760 1020768    0    0     0   665 7963 5894 51  9 40  0  0
 7  0      0 2011040  72504 1028204    0    0     0   636 7882 5837 50  9 41  0  0
 1  0      0 2017004  74848 1021604    0    0     0   473 7445 5680 37  7 56  0  0
 2  0      0 2009756  77008 1013644    0    0     0   230 6684 5016 15  3 82  0  0
 0  0      0 2019832  79008 1002208    0    0     0   164 6491 4966 12  3 85  0  0
 0  0      0 2006692  80804 1013468    0    0     0   140 6471 4992 11  3 86  0  0
 0  0      0 2005796  82740 1007964    0    0     0   107 6381 4839  9  3 88  0  0
 1  0      0 1986264  84220 1029168    0    0     0    92 6344 4781  8  3 89  0  0
 0  0      0 1995508  85804 1010960    0    0     0    85 6351 4789  8  3 90  0  0
 1  0      0 1995000  87296 1011092    0    0     0    75 6295 4743  7  2 90  0  0

 Temperature and CPU MHz Measurement Start at Sat Sep 12 16:11:35 2020
 Seconds
     0.0   ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=43.9'C
  1800.0   ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=46.7'C
  3600.3   ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=46.7'C
  5400.5   ARM MHz=1800, core volt=0.9500V, CPU temp=43.0'C, pmic temp=47.7'C
  7200.8   ARM MHz=1800, core volt=0.9500V, CPU temp=47.0'C, pmic temp=50.5'C
  9001.3   ARM MHz=1800, core volt=0.9500V, CPU temp=48.0'C, pmic temp=51.4'C
 10801.7   ARM MHz=1800, core volt=0.9500V, CPU temp=48.0'C, pmic temp=51.4'C
 12602.2   ARM MHz=1800, core volt=0.9500V, CPU temp=48.0'C, pmic temp=51.4'C
 14402.6   ARM MHz=1800, core volt=0.9500V, CPU temp=44.0'C, pmic temp=48.6'C
 16202.9   ARM MHz=1800, core volt=0.9500V, CPU temp=44.0'C, pmic temp=47.7'C
 18003.2   ARM MHz=1800, core volt=0.9500V, CPU temp=42.0'C, pmic temp=46.7'C
 19803.4   ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=46.7'C
 21603.7   ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=45.8'C
 23404.0   ARM MHz=1800, core volt=0.9500V, CPU temp=40.0'C, pmic temp=45.8'C
 25204.2   ARM MHz=1800, core volt=0.9500V, CPU temp=41.0'C, pmic temp=45.8'C
 27004.4   ARM MHz=1800, core volt=0.9500V, CPU temp=40.0'C, pmic temp=45.8'C

            Buluetooth Commands  sudo hciconfig hci0 reset                 
                                 sudo invoke-rc.d bluetooth restart

64 Bit Danger below or Go To Start


64 Bit Danger

For all my Raspberry Pi, and other Linux benchmarks, I have relied on using compiling optimisation -O3, possibly with additional parameters for SIMD operation, like -mavx with Linux/Intel and -funsafe-math-optimizations with ARM Cortex-a7, to produce NEON instructions. With 64 bit operation, using -O3, the gcc compiler produced unsuitable slow code (with my programming procedures?). I have included other compile directives, but these produced the same slow performance, with -O3 present. Then I tried -O2 that seems to avoid vectorisation, the results being shown below, along with samples from the -O3 runs.

For the first two examples, although using -O2 produced faster single precision and integer calculations from cached data, performance using RAM was reduced to half speed. The integer stress test also regained appropriate cache based speeds, but with no loss on RAM performance.

It seems that anyone hoping for faster SIMD operation, with these types of program, should also try to compile not using vectorisation, to verify performance gains.

  ########### Memory Reading Speed Test 64 Bit gcc 8 0pt -02 ###########

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   16065  11315   8296  16099   9473   9459  12353   8037   9349
      16   16245  11407   8309  16259   9522   9513  12569   7993   9466
      32   14290  10468   7747  14377   8451   8248  12673   8039   9525
      64   12853  10212   7867  13049   7747   7975  10854   7452   9026
     128   12970  10307   7958  13149   7852   8070  10159   7610   9094
     256   13021  10286   7986  13157   7958   8078   9714   7706   8986
     512   12781  10259   7958  13009   7951   8079   9631   7665   9033
    1024    3689   4372   3978   4432   3886   3902   5865   5469   5928
    2048    1800   1792   1722   1805   1769   1750   3023   2984   2949
    4096    1921   1933   1905   1918   1910   1894   2658   2678   2686
    8192    1962   1961   1809   1952   1955   1926   2596   2601   2613

  ########### Memory Reading Speed Test 64 Bit gcc 8 0pt -03 ###########

       8   18133   4792   4749  18693   5259   5275  13962  11182  11182
     256   14783   4646   4716  14698   5053   5063   9666   9768   9809
    8192    2036   3940   3882   2034   3935   3995   2642   2643   2638

  ##### NEON Speed Test 64 Bit gcc 8 Opt -02 #####

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16  11286  19667   8090  18132  19678  22533
      32  10394  14494   7193  13225  14233  14562
      64  10765  13825   7457  12642  13846  14040
     128  11057  14324   7769  13237  14394  14612
     256  11113  14477   7844  13318  14530  14674
     512  11149  14560   7893  13392  14627  14637
    1024   4513   4758   3637   3808   4211   4770
    4096   2063   2053   2086   2042   2060   2062
   16384   2058   2051   2054   2056   2054   2043
   65536   2059   2045   2049   2064   2049   2050

  ##### NEON Speed Test 64 Bit gcc 8 Opt -03 #####

      16   4496  19696   4790  17870  18908  21817
     256   3992  14148   4716  13508  14311  14312
   65536   3319   2057   3803   2011   2059   2063

  #### MP-Integer-Test 64 Bit v2-gcc8 opt -02 ####

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   4.2    1   8040  7892  3783  00000000    Yes
   3.2    2  17193 15430  3685  FFFFFFFF    Yes
   3.0    4  29261 29819  3329  5A5A5A5A    Yes
   3.0    8  29886 31708  3383  AAAAAAAA    Yes
   3.0   16  30410 33010  3365  CCCCCCCC    Yes
   2.9   32  30375 33435  3392  0F0F0F0F    Yes

  #### MP-Integer-Test 64 Bit v2-gcc8 opt -03 ####

   7.4    1   3455  3481  3074  00000000    Yes
   4.7    2   7047  6975  3507  FFFFFFFF    Yes
   3.6    4  13712 13977  3357  5A5A5A5A    Yes
   3.6    8  13631 13696  3353  AAAAAAAA    Yes
   3.7   16  13184 13906  3351  CCCCCCCC    Yes
   3.6   32  12617 13960  3414  0F0F0F0F    Yes
 

Go To Start