Raspberry Pi 5 64 Bit Benchmarks and Stress Tests

Roy Longbottom

Contents


Summary Introduction Benchmark Results
Whetstone Benchmark Dhrystone Benchmark Linpack 100 Benchmark
Livermore Loops Benchmark FFT Benchmarks BusSpeed Benchmark
MemSpeed Benchmark NeonSpeed Benchmark MultiThreading Benchmarks
MP-Whetstone Benchmark MP-Dhrystone Benchmark MP NEON Linpack Benchmark
MP-BusSpeed Benchmark MP-RandMem Benchmark MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks OpenMP-MemSpeed Benchmarks Java Whetstone Benchmark
JavaDraw Benchmark OpenGL Benchmark I/O Benchmarks
DriveSpeed Benchmark FAT32 Wired and WiFi Benchmark USB Benchmark
New Benchmark More Files New Benchmark Large Files New Benchmark Small Files
Booting Time, Volts and Amps Drive Stress Test Drive Stress Performance Monitor
Disk Drive Errors and Crashes Other System Crashes CPU Stress Testing Benchmarks
CPU Stress Tests No Fan Integer Stress Tests With Fan Floating Point Stress Tests With Fan
4 Amps Power Supply No Disk Crash New INTitHOT Integer Stress Test INTitHOT PI 5 4 Maximum Speeds
INTitHOT Pi 5 Stress Tests INTitHOT Stress Test No Fan 64 KB INTitHOT Stress Test No Fan 512 KB
System Stress Tests Light System Stress Test Light Test With Fan
Light Test No Fan Heavy System Stress Test Heavy Test No Fan
Heavy Test With Fan - FAILED Heavy Test With Fan - Passed Firefox, Bluetooth and YouTube
Pi 5 The Vector Processor PC and Pi Performance Comparisons
New 5 Amps Power Supply and Active Cooler
CPU Stress Tests Heavy System Stress Test Solid State Hard Drive


Summary

As indicated below, some of the benchmarks provided higher average Pi5/Pi4 performance gains than the official claim of two to three times, where individual programs or test functions were between 10 and 18 times faster. This was due to the improved CPU caching arrangements and advanced SIMD hardware and compilation facilities. Examples of compiled SIMD vector instructions are included.

The latest 5 amps power supply and active cooler were not available initially, when tests were run with no cooling fan. Then, stress tests lead to CPU temperature increasing up to 91.7C, but the Pi 5 continued running at a lower speed with controlled CPU MHz and voltage variations, still much faster than a fan cooled Pi 4.

On the downside, my rather extreme stress tests produced a number of system crashes and disk drive reading errors. I believe that the results show that this was not associated with high temperatures but inadequate USB power was to blame. Although stress tests ran successfully using the 5 amps power supply, USB power demands of disk and solid state drives appear to be rather excessive. In this case, the system could be easily crashed by overloading. So these drives should probably only be connected via a powered hub.

Surprisingly, execution of a new stress test, with integer calculations, generated more heat than the floating point variety. The hottest occurred when handling data from L2 cache with higher power demands. Faster L1 cache based data transfers produced somewhat lower temperatures.

Benchmarks - Besides detailed results, Pi5/Pi4 performance comparisons are provided using older gcc8 compiled versions, also the latter with new varieties from gcc12, included in the new 64 bit Operating System software.

Single Core CPU Tests - comprising varieties of Whetstone, Dhrystone, Linpack 100 and Livermore Loops Classic Benchmarks. Pi 5 gains were between 2.14 and 4.65 times from 182 measurements.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. More than 250 Pi5/Pi4 comparisons are provided from five benchmarks, indicating a Pi 5 average gain of 3.1 times maximum 13.3 times. Pi 5 new compilation average gain was 2.6 times and maximum 10 times. High gains were due to improved caching and SIMD vector processing operations.

MultiThreading Benchmarks - These 8 benchmarks execute the same calculations using 1, 2, 4 and 8 threads. From 150 plus comparisons Pi5/Pi4 average/maximum gains were 3.4/18.2 times, with 1.2/5.6 times for Pi 5 gcc12/gcc8 compilations. The reasons for the high gains were improved caching and SIMD as above.

Miscellaneous - average Pi5/Pi4 performance gains for a series of tests were Java Whetstones 2.47 times, JavaDraw 1.98 times and OpenGL 4.0 times for 6 tests at 4 screen resolutions.

Input/Output Benchmarks - These measure performance of large files, small files and random access with numerous performance measurements of Gbps LAN, WiFi, large files with 64 bit OS, main SD and USB 3 FAT and Ext disk drives and 11 main and USB boot drives. Also are booting times, main and USB volts and amps power usage. First test result indicated that Pi 5 was typically 50% faster than Pi 4 handling large files on a high speed USB 3 flash drive.

Drive Stress Test - This writes four large files with data comprising numerous binary data patterns, reads them randomly for a specified time, then repetitively reads each different data block for a time. Eleven 15 minute tests were successfully run on the Pi 5 comprising LAN, WiFi, OS SD, 3 USB 3 flash drives and 5 disk drive partitions, plus 2 network tests from a Pi 400.

Disk Drive Errors and System Crashes - (Power supply issues) - Two out of three tests using 2 disk drives caused crashes one with both on a USB 3 hub, due to exceeding 900 mA USB 3 port specification. Next crash was with one drive via hub, one direct USB and a CPU stress test leading to measured main power supply exceeding the 3 amps specification. This lead to reading the wrong file and data comparison failures. Two disks on different USB 3 ports ran successfully.

CPU Stress Tests - Initial 3 floating point and 3 integer tests were run without fan cooling, each for 15 minutes, using 1, 2 and 4 threads, whilst recording performance, CPU MHz, volts and temperatures. All suffered from MHz throttling at temperatures up to 90C, with measured performance deterioration less than 50%, still faster than a fan cooled Pi 4. I acquired a 4 amps power supply and repeated the test that crashed at 3 amps, this time with no failures.

INTitHOT New Integer Stress - This read only test produced the hottest and fastest effects, through executing continuous SIMD AND instructions. On the Pi 5, fastest, via L1 cache sized data, obtained 240 GB/second or Terabit speed of 1.92 Tbps. Via L2 cache, maximum speed was 168 GB/second with higher power consumption and Temperature. The Pi 5 was around 4.6 times faster than a Pi 4 using 1 or 2 threads, and much greater at 4 threads where the Pi 4 was unbelievably slow.

System Stress Tests - These were run for 30 minutes using the 4 amps power supply and included INTitHOT, disk drive and OpenGL stress tests. Initial tests ran successfully at near maximum speed with the fan but reached a CPU temperature of 91.7C with a 40% reduction in CPU and graphics performance without the fan. The next ones included floating point and network stress tests. The no fan test ran successfully with the usual high temperature and degraded performance but, with the fan, crashed with disk drive errors again. Then a low USB voltage was recorded.

Other Tests and Comparisons - Tests were carried out involving Firefox, Bluetooth sound and YouTube videos. Next is Pi-5 The Vector Processor, with examples and comparing performance with 1978 to 1991 supercomputers, then Comparisons with PCs from 1991 to 2021. Results for the latter indicate that the Raspberry Pi 5 can be assumed to be 194 times faster than the Cray 1 supercomputer.

New 5 Amps Power Supply and Active Cooler - Graphs of temperature increases with time are provided for initial CPU only stress tests, followed by others using the new items, now all much less than the the CPU MHz throttling level. Hottest was not the floating point test but the one using integer calculations with L2 cache based data. Next was a repeat of the Heavy System Stress Tests. This ran successfully twice. It was then repeated with the 4 amps power supply and failed as before but at a much lower CPU temperature, then ran without any issues at a second attempt. The strange measured power volts and amps probably indicate a marginal condition, compared to the 5 amps measurements.

Solid State Hard Drive - Following an earlier disastrous attempt, I repeated the last system stress test powered with 4 and 5 amps supplies on the Pi 5, providing similar performance. Then I ran the drive benchmarks where average large file write/reading speeds were around 360/400 MB/second, faster than the old hard drive. A surprise was tha the measured USB current was the relatively high 640 mA.

Introduction below or Go To Start


Introduction

This report provides results from a wide range of benchmarks and stress tests run on the Raspberry Pi 5 during the Alpha Testing stage. and includes comparisons with the Pi 4. It follows the format of many other reports from 2014 to 2023 available from This ResearchGate Index. The latter includes access to historic results, opening the opportunity to compare Pi 5 performance with computers from as far back as the pre-1960 iron age.

The new Raspberry Pi 5 features a 2.4GHz quad-core 64-bit Arm Cortex-A76 CPU, with near 64 KB L1 and 512 KB L2 caches per core, and a 2MB shared L3 cache, also a host of other enhanced features. Compared to the Raspberry Pi 4, it was claimed to have between two and three times the CPU and GPU performance, with roughly twice the memory and I/O bandwidth. Part of the reason for this is that the Pi 4 runs at 1.5 GHz with a 32 KB L1 cache and 1024 KB shared L2 cache.

The first benchmarks measure performance of a single CPU core, covering integer and floating point performance plus data transfer speeds at all memory cache and RAM levels. Then there are multi-core benchmarks of the same variety and more, plus others for Java and graphics. The stress testing programs measure performance, CPU MHz and temperatures with and without fan cooling, initially for each program then during systems tests, including all CPU cores, disk and network drives and graphics. Then there are other measurements as identified in the contents table, including comparisons with PCs and supercomputers.

The benchmarks can be downloaded in RaspberryPi5BenchmarksandStressTests.tar.xz. This includes folders containing source code with compile commands, compiled programs, example results and script files to select run time parameters. A preprint of the report is also included.

All the programs save the results in log files, full details from some are included in the report. These include the following information of the system under test.

Raspberry Pi 4 Old OS

Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid
Linux raspberrypi 4.19.118-v8+ #1311 SMP PREEMPT 
Mon Apr 27 14:32:38 BST 2020 aarch64 GNU/Linux

Raspberry Pi 5

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Vendor ID:                       ARM
Model name:                      Cortex-A76
Model:                           1
Thread(s) per core:              1
Core(s) per cluster:             4
Socket(s):                       -
Cluster(s):                      1
Stepping:                        r4p1
CPU(s) scaling MHz:              100%
CPU max MHz:                     2400.0000
CPU min MHz:                     1000.0000
BogoMIPS:                        108.00
Flags:                           fp asimd evtstrm aes pmull sha1
                                 sha2 crc32 atomics fphp asimdhp 
                                 cpuid asimdrdm lrcpc dcpop asimddp
Linux raspberrypi 6.1.32-v8+ #1 SMP PREEMPT 
Sat Aug  5 07:03:33 BST 2023 aarch64 GNU/Linux
  

The last count indicated that 31 different benchmarking and stress testing programs were run, producing hundreds of results included here. The devil is in the details.

Whetstone Benchmark below or Go To Start


Whetstone Benchmark - whetstonePi64g8 and g12
Vector Versions - Whetv64SPg8 and g12, whetvDP64g8 and g12

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. with no accessing of data in L2 cache or RAM.

Results are provided for the original scalar single precision (SP) version, along with those for single and double precision (DP) varieties of the vector version, originally written for use on the first Cray 1 supercomputer delivered to the UK. For more information see Pi 5 The Vector Processor later. Examination of the time used by the different tests shows that this can be dominated by those executing such as COS and EXP functions.

Pi 5/Pi 4 comparisons are provided for the gcc 8 scalar versions, indicting performance gains between 2.44 to 2.59 times for the three (MFLOPS) floating point tests and 2.79 on overall MWIPS. Performance of the Pi 5 gcc 12 compilations were essentially identical to those from gcc 8.

Pi 5/Pi 4 vector SP and DP gcc 8 performance gains were similar between 2.34 to 3.10 times for MFLOPS and around 2.3 for MWIPS. Pi 5 SP Vector/Scalar gains are also provided, giving 5.40 to 7.86 times for MFLOPS but only 1.88 times for overall MWIPS, deflated by the COS/EXP tests. Maximum SP scalar speed was 1.36 GFLOPS with vectors at 8.08 SP and 4.0 DP.

Pi 4 GCC 8
Whetstone Single Precision C Benchmark  64 Bit gcc 8R, Fri May 22 10:48:53 2020

Loop content                   Result              MFLOPS      MOPS   Seconds

N1 floating point      -1.12475013732910156       524.251               0.076
N2 floating point      -1.12274742126464844       534.904               0.524
N3 if then else         1.00000000000000000                2978.570     0.073
N4 fixed point         12.00000000000000000                2493.078     0.264
N5 sin,cos etc.         0.49911010265350342                  57.643     3.012
N6 floating point       0.99999982118606567       397.676               2.831
N7 assignments          3.00000000000000000                 996.647     0.387
N8 exp,sqrt etc.        0.75110864639282227                  27.327     2.841

MWIPS                                            2085.311              10.008

Pi 5 GCC 8
Whetstone Single Precision C Benchmark  64 Bit gcc 8R, Thu Aug 10 15:44:50 2023

Loop content                   Result              MFLOPS      MOPS   Seconds  G8 Pi5/4

N1 floating point      -1.12475013732910156      1279.196               0.087    2.44
N2 floating point      -1.12274742126464844      1364.748               0.573    2.55
N3 if then else         1.00000000000000000                7190.834     0.084    2.41
N4 fixed point         12.00000000000000000                5995.954     0.306    2.41
N5 sin,cos etc.         0.49911010265350342                 154.725     3.131    2.68
N6 floating point       0.99999982118606567      1027.998               3.055    3.59
N7 assignments          3.00000000000000000                2398.668     0.449    2.41
N8 exp,sqrt etc.        0.75110864639282227                  93.596     2.314    3.43

MWIPS                                            5822.922               9.998    2.79

Pi 5 GCC 12
 Whetstone Single Precision C Benchmark  64 Bit gcc 12, Thu Sep 28 11:46:43 2023

Loop content                   Result              MFLOPS      MOPS   Seconds

N1 floating point      -1.12475013732910156      1279.140               0.088
N2 floating point      -1.12274742126464844      1364.558               0.575
N3 if then else         1.00000000000000000                3594.939     0.168
N4 fixed point         12.00000000000000000                5994.963     0.307
N5 sin,cos etc.         0.49911010265350342                 157.996     3.075
N6 floating point       0.99999982118606567      1027.940               3.064
N7 assignments          3.00000000000000000                2398.054     0.450
N8 exp,sqrt etc.        0.75110864639282227                  95.590     2.273

MWIPS                                           5839.767              10.000

#################### Vector Whetstone Vecton Length 258 ####################

Pi 4 GCC 8 SP
Whetstone Vector Benchmark 64 Bit Single Precision, Wed Aug 30 10:41:57 2023

Loop content                   Result              MFLOPS      MOPS   Seconds

N1 floating point      -1.13316142559051514      2338.496               0.391
N2 floating point      -1.13312149047851562      1651.957               3.877
N3 if then else         1.00000000000000000                4427.445     1.114
N4 fixed point         12.00000000000000000                1733.458     8.659
N5 sin,cos etc.         0.49998238682746887                  74.913    52.923
N6 floating point       0.99999982118606567      2573.346               9.988
N7 assignments          3.00000000000000000               18596.381     0.474
N8 exp,sqrt etc.        0.75002217292785645                  78.503    22.581

MWIPS                                            4764.843             100.007
Continued below

Continued from above - Note different single and double precision numeric results.

Pi 5 GCC 8 SP
Whetstone Vector Benchmark 64 Bit Single Precision, Sat Oct  7 10:15:16 2023

Loop content                   Result              MFLOPS      MOPS   Seconds  G8 Pi5/4

N1 floating point      -1.13316142559051514      7111.676               0.290    3.04
N2 floating point      -1.13312149047851562      3857.446               3.746    2.34
N3 if then else         1.00000000000000000               10141.446     1.097    2.29
N4 fixed point         12.00000000000000000                2396.242    14.135    1.38
N5 sin,cos etc.         0.49998238682746887                 177.032    50.534    2.36
N6 floating point       0.99999982118606567      7986.011               7.263    3.10
N7 assignments          3.00000000000000000               42584.598     0.467    2.29
N8 exp,sqrt etc.        0.75002217292785645                 178.102    22.459    2.27

MWIPS                                           10753.538              99.990    2.26

Pi 5 GCC 12 SP
Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct  7 10:46:30 2023
                                                                                      Vector/
                                                                                 Pi 5 Scalar
Loop content                   Result              MFLOPS      MOPS   Seconds GCC12/8 G12 Pi5    

N1 floating point      -1.13316142559051514      7393.282               0.286    1.04    5.78    
N2 floating point      -1.13312149047851562      7364.751               2.009    1.91    5.40
N3 if then else         1.00000000000000000               14169.053     0.804    1.40    3.94
N4 fixed point         12.00000000000000000                2398.742    14.457    1.00    0.40
N5 sin,cos etc.         0.49998238682746887                 177.260    51.673    1.00    1.12
N6 floating point       0.99999982118606567      8078.622               7.351    1.91    7.86
N7 assignments          3.00000000000000000               26419.105     0.770    0.62   11.02
N8 exp,sqrt etc.        0.75002217292785645                 178.359    22.961    1.00    1.87

MWIPS                                           10974.928             100.311    1.02    1.88


Pi 4 GCC 8 DP
Whetstone Vector Benchmark 64 Bit Double Precision, Wed Aug 30 10:48:05 2023

Loop content                   Result              MFLOPS      MOPS   Seconds

N1 floating point      -1.13314558088707962      1146.624               0.709
N2 floating point      -1.13310306766606850      1094.230               5.203
N3 if then else         1.00000000000000000                4405.221     0.995
N4 fixed point         12.00000000000000000                1730.427     7.711
N5 sin,cos etc.         0.49998080312723675                  73.193    48.149
N6 floating point       0.99999988868927014      1294.129              17.655
N7 assignments          3.00000000000000000                9967.123     0.785
N8 exp,sqrt etc.        0.75002006515491115                  83.614    18.845

MWIPS                                            4233.571             100.052


Pi 5 GCC 8 DP
Whetstone Vector Benchmark 64 Bit Double Precision, Sat Oct  7 10:18:59 2023   

Loop content                   Result              MFLOPS      MOPS   Seconds  G8 Pi5/4

N1 floating point      -1.13314558088707962      3499.307               0.535    3.05
N2 floating point      -1.13310306766606850      2793.370               4.688    2.55
N3 if then else         1.00000000000000000               10158.471     0.993    2.31
N4 fixed point         12.00000000000000000                2396.163    12.809    1.38
N5 sin,cos etc.         0.49998080312723675                 171.834    47.176    2.35
N6 floating point       0.99999988868927014      3994.760              13.156    3.09
N7 assignments          3.00000000000000000               21713.754     0.829    2.18
N8 exp,sqrt etc.        0.75002006515491115                 184.857    19.607    2.21

MWIPS                                            9763.593              99.793    2.31

Pi 5 GCC 12 DP
Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct  7 10:50:40 2023

Loop content                   Result              MFLOPS      MOPS   Seconds

N1 floating point      -1.13314558088707962      3602.841               0.523
N2 floating point      -1.13310306766606739      3619.564               3.647
N3 if then else         1.00000000000000000               14167.623     0.718
N4 fixed point         12.00000000000000000                2398.696    12.898
N5 sin,cos etc.         0.49998080312723675                 172.068    47.491
N6 floating point       0.99999988868927014      3997.801              13.252
N7 assignments          3.00000000000000000               13172.392     1.378
N8 exp,sqrt etc.        0.75002006515491115                 182.557    20.014

MWIPS                                            9829.517              99.920
  

Dhrystone Benchmark below or Go To Start


Dhrystone Benchmark - dhrystonePi64g8 and g12

This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS.

Pi 5 GCC 8 gain over Pi 4 was 2.37 times. There was a slight gain using GCC 12, where DMIPS/MHz ratio reached 8.57.
 
 Pi 4 GCC 8
 Dhrystone Benchmark 2.1 64 Bit gcc8, Mon May 25 22:16:05 2020

 Nanoseconds one Dhrystone run:        72.83
 Dhrystones per Second:             13729822
 VAX MIPS rating =                   7814.36

 Numeric results were correct


 Pi 5 GCC 8
 Dhrystone Benchmark 2.1 64 Bit gcc8, Thu Aug 10 15:49:13 2023

 Nanoseconds one Dhrystone run:        30.69
 Dhrystones per Second:             32578833
 VAX MIPS rating =                  18542.31   Pi 5/Pi 4 Gain 2.37

 Numeric results were correct


 Pi 5 GCC 12
 Dhrystone Benchmark 2.1 64 Bit gcc12, Thu Sep 28 11:44:33 2023

 Nanoseconds one Dhrystone run:        27.68
 Dhrystones per Second:             36120831
 VAX MIPS rating =                  20558.24   GCC 12/8 Gain 1.11

 Numeric results were correct

  

Linpack 100 Benchmark below or Go To Start


Linpack 100 Benchmark MFLOPS - linpackPi64g8 and g12, linpackPi64gSP, linpackPi64NEONig8

This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON functions to include vector processing. Performance of this benchmark can vary, with its dependence on data placement in L2 cache.

Unlike when the Pi 5 was introduced. later compilers produced code as fast as the NEON version. Now with GCC 12, The NEON variety was slower and the others produced a small gain over GCC 8 compiations. Comparisons for the latter indicated Pi 5 gains were between 3.16 and 3.54 times over the three versions. Maximum Pi 5 speeds were 6.60 GFLOPS SP and 3.93 GFLOPS DP.

 Pi 4 GCC 8
 Linpack Double Precision Unrolled Benchmark n @ 100
 Optimisation 64 Bit gcc 8, Mon May 25 22:05:47 2020

 Speed    1111.51 MFLOPS

 Numeric results were as expected

 Linpack Single Precision Unrolled Benchmark n @ 100
 Optimisation 64 Bit gcc 8, Mon May 25 22:09:12 2020

 Speed    1930.27 MFLOPS

 Numeric results were as expected

 Linpack Single Precision Benchmark n @ 100
  NEON Intrinsics 64 bit gcc 8, Mon May 25 22:11:15 2020

 Speed    2030.95 MFLOPS

 Numeric results were as expected

------------------------------------------------------
 Pi 5 GCC 8                                                   Pi5/Pi4
 Linpack Double Precision Unrolled Benchmark n @ 100
 Optimisation 64 Bit gcc 8, Thu Aug 10 16:12:47 2023

 Speed    3933.38 MFLOPS                                        3.54

 Numeric results were as expected

 Linpack Single Precision Unrolled Benchmark n @ 100
 Optimisation 64 Bit gcc 8, Thu Aug 10 16:04:18 2023

 Speed    6106.68 MFLOPS                                        3.16

 Numeric results were as expected
 
 Linpack Single Precision Benchmark n @ 100
  NEON Intrinsics 64 bit gcc 8, Thu Aug 10 16:13:52 2023

 Speed    6603.58 MFLOPS                                        3.25

 Numeric results were as expected

------------------------------------------------------
 Pi 5 GCC 12                                                   GCC 12/5
 Linpack Double Precision Unrolled Benchmark n @ 100
 Optimisation 64 Bit gcc 12, Thu Sep 28 15:58:07 2023

 Speed    4136.39 MFLOPS                                        1.05

 Numeric results were as expected

 Linpack Single Precision Unrolled Benchmark n @ 100
 Optimisation 64 Bit gcc 12, Thu Sep 28 16:04:19 2023

 Speed    6472.77 MFLOPS                                        1.06

 Numeric results were as expected

 Linpack Single Precision Benchmark n @ 100
  NEON Intrinsics 64 bit gcc 12, Thu Sep 28 15:49:56 2023

 Speed    5665.39 MFLOPS                                        0.86

 Numeric results were as expected
 But 4 needed changing in program, via #define GCC12ARM64N,
 to avoid unnecessary error reports.
 

Livermore Loops Benchmark below or Go To Start


Livermore Loops Benchmark MFLOPS - liverloopsPi64g8 and g12

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores. Although each kernel is executed for a relatively long time, performance of some can be inconsistent.

Pi 5 GCC 8 maximum speed was 9.87 DP GFLOPS, with gains over the Pi 4 between 2.14 and 4.65 over the 24 loops.

Maximum performance via GCC 12 was 10.57 DP GFLOPS, with those for all of the loops similar to GCC 8 scores.

 Pi 4 GCC 8
 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Mon May 25 10:39:10 2020

 MFLOPS for 24 loops
 2108.4  936.3  959.9  965.1  382.5  808.6 2312.9 2488.4 2065.7  668.7  500.3  980.7
  180.7  404.8  815.0  643.8  726.8 1189.6  449.8  397.2 1716.0  366.9  817.7  312.7

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
  2616.7   959.8   766.7   613.0   169.7

 Numeric results were as expected


 Pi 5 GCC 8
 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Thu Aug 10 16:14:33 2023

 MFLOPS for 24 loops
 7423.6 2147.9 2356.6 2472.9  911.5 1871.0 9872.3 5317.7 5162.9 2125.8 1173.2 2672.0
  709.1 1108.7 2966.6 1598.5 1761.3 5526.8 1190.0  956.0 5425.1 1489.5 2147.9  858.2

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
  9872.3  2873.9  2208.3  1763.4   646.6

 Numeric results were as expected

-----------------------------------------------------------------------------------
 GCC 8 Pi5/Pi4 Performance Ratios
 For 24 loops
   3.52   2.29   2.46   2.56   2.38   2.31   4.27   2.14   2.50   3.18   2.34   2.72
   3.92   2.74   3.64   2.48   2.42   4.65   2.65   2.41   3.16   4.06   2.63   2.74
   Min    2.14   Max    4.65

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
    3.77    2.99    2.88    2.88    3.81

-----------------------------------------------------------------------------------
 Pi 5 GCC 12
 Livermore Loops Benchmark 64 Bit gcc 12 via C/C++ Thu Sep 28 16:38:37 2023

 MFLOPS for 24 loops
 7833.8 2404.6 2377.2 2346.8  913.0 1857.1  10577 5350.6 5109.2 2117.4 1186.0 2351.4
  760.0 1121.2 3103.4 1597.7 1776.1 5455.9 1197.2 2490.5 5657.5 1855.7 2139.8  780.4

 Overall Ratings
 Maximum Average Geomean Harmean Minimum
 10576.9  2964.4  2308.1  1870.7   733.9

 Numeric results were as expected via #define GCC12ARMPI
  

Fast Fourier Transforms Benchmarks below or Go To Start


Fast Fourier Transforms Benchmarks - fft1Pi64g, fft3cPi64g8 and g12

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes between caches, then to RAM.

Comparisons of averages of the three runs are provided. Those for FFT1 demonstrate the clear and different advantage of the Pi 5 over the Pi 4, depending on the source of the data, with that from L3 cache providing gains of up to 13.34 times and up to 4.71 times involving the larger L2 cache. Most other gains are in the two to four times range. With the faster CPU speed limited FFT3c, gains were mainly mbetween 2 and 3 times. GCC 12 over GCC 8 comparisons indicate a slight advantage of the former using data from caches, but the role reversed, dealing with RAM data transfers.

 Pi 4 GCC 8
  Pi 4 RPi FFT gcc 8 64 Bit Benchmark 1 Mon May 25 10:54:42 2020

    Size                    milliseconds
       K      Single Precision        Double Precision
       1    0.05    0.04    0.04    0.04    0.04    0.05
       2    0.08    0.08    0.08    0.15    0.14    0.14
       4    0.23    0.23    0.23    0.39    0.38    0.44
       8    0.73    0.80    0.70    0.97    1.04    0.97
      16    1.98    1.87    1.79    2.66    2.52    2.83
      32    4.92    4.92    5.29    5.67    4.92    4.89
      64    8.80    8.69    8.67   32.21   32.23   33.31
     128   49.82   49.79   50.17  161.36  159.61  159.39
     256  295.55  280.43  303.20  411.97  415.90  340.34
     512  506.01  601.29  572.36  781.10  779.05  782.21
    1024 1375.42 1377.64 1375.77 1898.28 1876.88 1896.22

        1024 Square Check Maximum Noise Average Noise
        SP   9.999520e-01  3.346482e-06  4.565234e-11
        DP   1.000000e+00  1.133294e-23  1.428110e-28

               End at Mon May 25 10:55:00 2020


 Pi 4  RPi FFT gcc 8 64 Bit Benchmark 3c.0 Mon May 25 10:56:49 2020

    Size                    milliseconds
       K      Single Precision        Double Precision
       1    0.06    0.04    0.04    0.04    0.04    0.03
       2    0.09    0.07    0.07    0.10    0.10    0.10
       4    0.23    0.20    0.20    0.23    0.26    0.23
       8    0.50    0.44    0.46    0.52    0.50    0.50
      16    1.21    1.19    1.05    1.23    1.17    1.19
      32    2.36    2.23    2.18    3.33    3.32    3.29
      64    6.16    5.70    5.31   10.20   10.20   10.18
     128   16.39   15.69   15.69   24.35   24.45   24.48
     256   38.70   37.46   37.40   54.57   54.65   54.59
     512   83.83   80.96   81.40  119.71  118.70  119.27
    1024  182.08  176.05  176.97  268.43  259.16  259.30

        1024 Square Check Maximum Noise Average Noise
        SP   9.999520e-01  3.346482e-06  4.565234e-11
        DP   1.000000e+00  1.133294e-23  1.428110e-28

               End at Mon May 25 10:56:52 2020

 Pi 5 GCC 8
  Pi 5 RPi FFT gcc 8 64 Bit Benchmark 1 Fri Aug 11 16:47:11 2023

    Size                    milliseconds                  Average Pi5/Pi4
       K      Single Precision        Double Precision         SP      DP
       1    0.02    0.02    0.02    0.02    0.02    0.02     2.20    2.51
       2    0.04    0.04    0.04    0.04    0.04    0.04     1.98    3.81
       4    0.09    0.09    0.09    0.09    0.09    0.09     2.64    4.71
       8    0.19    0.20    0.19    0.29    0.29    0.29     3.88    3.48
      16    0.56    0.56    0.56    0.65    0.67    0.78     3.35    3.82
      32    1.30    1.27    1.29    1.55    1.50    1.80     3.92    3.18
      64    3.18    3.00    2.99    4.16    3.90    3.91     2.85    8.17
     128    7.76    7.30    7.28   14.27   14.44   13.71     6.70   11.33
     256   23.23   21.27   21.40   99.92   94.38   94.97    13.34    4.04
     512  157.82  152.33  173.93  329.15  321.16  323.41     3.47    2.41
    1024  608.66  606.77  600.94 1069.84 1048.00 1049.41     2.27    1.79

        1024 Square Check Maximum Noise Average Noise
        SP   9.999520e-01  3.346482e-06  4.565234e-11
        DP   1.000000e+00  1.133294e-23  1.428110e-28

               End at Fri Aug 11 16:47:19 2023

 
Continued below
Pi 5 RPi FFT gcc 8 64 Bit Benchmark 3c.0 Fri Aug 11 16:48:27 2023 Size milliseconds Average Pi5/Pi4 K Single Precision Double Precision SP DP 1 0.03 0.02 0.02 0.02 0.02 0.02 1.88 1.96 2 0.05 0.04 0.04 0.04 0.04 0.04 1.93 2.61 4 0.10 0.08 0.08 0.09 0.09 0.09 2.37 2.74 8 0.21 0.18 0.18 0.23 0.21 0.21 2.43 2.37 16 0.45 0.41 0.41 0.53 0.48 0.49 2.70 2.40 32 1.16 0.90 0.93 1.22 1.07 1.06 2.27 2.97 64 2.39 2.04 2.39 2.98 2.76 2.69 2.52 3.63 128 5.26 4.82 4.86 9.92 9.90 9.86 3.20 2.47 256 14.58 13.92 13.89 29.15 27.71 26.90 2.68 1.96 512 42.03 39.73 39.84 72.71 72.32 71.70 2.02 1.65 1024 101.56 99.35 98.31 176.62 171.45 175.48 1.79 1.50 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Fri Aug 11 16:48:29 2023 Pi 5 GCC 12 RPi FFT gcc 12 64 Bit Benchmark 1 Thu Sep 28 19:10:33 2023 Size milliseconds Average GCC 12/8 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 1.15 1.02 2 0.06 0.04 0.04 0.04 0.04 0.04 0.92 1.05 4 0.08 0.08 0.08 0.08 0.08 0.08 1.09 1.05 8 0.18 0.18 0.18 0.80 0.26 0.25 1.09 0.65 16 0.55 0.62 0.61 0.78 0.62 0.68 0.95 1.01 32 1.19 1.19 1.18 3.14 1.66 2.23 1.08 0.69 64 2.90 2.87 3.12 4.14 3.83 4.62 1.03 0.95 128 8.01 7.72 8.41 19.04 16.31 19.17 0.93 0.78 256 28.65 29.22 30.38 142.81 143.44 144.91 0.75 0.67 512 256.41 209.11 215.07 400.84 410.99 448.06 0.71 0.77 1024 798.30 749.85 753.61 1073.95 1075.09 1051.38 0.79 0.99 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Sep 28 19:10:41 2023 RPi FFT gcc 12 64 Bit Benchmark 3c.0 Thu Sep 28 19:13:51 2023 Size milliseconds Average GCC 12/8 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 1.20 1.06 2 0.04 0.04 0.04 0.04 0.04 0.04 1.04 1.06 4 0.09 0.08 0.08 0.08 0.08 0.08 1.06 1.06 8 0.19 0.18 0.18 0.20 0.19 0.19 1.06 1.10 16 0.41 0.39 0.39 0.46 0.43 0.43 1.07 1.12 32 0.88 0.85 0.86 1.01 0.96 0.96 1.15 1.14 64 1.98 1.91 1.91 2.57 2.48 2.47 1.17 1.12 128 5.65 4.68 4.63 10.10 10.04 10.06 1.00 0.98 256 14.59 14.50 14.59 36.02 35.29 34.84 0.97 0.79 512 55.50 54.91 55.79 100.99 102.62 99.96 0.73 0.71 1024 143.39 142.49 143.22 231.27 228.44 229.17 0.70 0.76 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Sep 28 19:13:53 2023

BusSpeed Benchmark below or Go To Start


BusSpeed Benchmark - busspeedPi64g8 and g12

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.

The most important ratios are from Read All, others demonstrating when all data is not being read sequentially and the Pi 5 appears to be significantly faster than the Pi 4. The main results indicate Pi 5 gains of just over twice reading data from L1 and L2 caches, but can be more than four times from L3 and more than three times from RAM. Maximum bus speed, using one CPU core, is estimated as around 14 GB/second from Inc16 also shown under Read All. See MP results for higher estimates.

Pi 5 performance produced from GCC 8 and GCC 12 compilations was essentially the same.

 Pi 4 GCC 8
   BusSpeed 64 Bit gcc 8 Mon May 25 22:13:11 2020

    Reading Speed 4 Byte Words in MBytes/Second
   Memory Inc32  Inc16  Inc8   Inc4   Inc2   Read
    KBytes Words  Words  Words  Words  Words  All  Cache      Pi 5

     16   4898   5109   5626   5860   5879   9238  L1          L1
     32   1109   1389   2485   3804   5026   8435
     64    804   1030   2025   3285   4871   8312  L2 Shared
    128    737    951   1877   3130   4908   8556              L2
    256    732    953   1897   3147   4941   8617
    512    701    939   1766   2902   4601   8150
   1024    323    494    986   1807   3060   5553  RAM         L3 Shared
   4096    242    259    486    964   1932   3856              RAM
  16384    236    268    493    971   1939   3878
  65536    242    271    494    973   1942   3884

        End of test Mon May 25 22:13:21 2020


 Pi 5 GCC 8                                       P5/P4 Comparison
   BusSpeed 64 Bit gcc 8 Fri Aug 11 16:46:13 2023

    Reading Speed 4 Byte Words in MBytes/Second
 Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Inc32  Inc16   Inc8   Inc4   Inc2   Read
 KBytes  Words  Words  Words  Words  Words    All  Words  Words  Words  Words  Words    All
MP-bus                                  

     16   8300   8413  15451  17849  18151  18721   1.69   1.65   2.75   3.05   3.09   2.03
     32   9159   9235  15509  17911  18132  18721   8.26   6.65   6.24   4.71   3.61   2.22
     64   7460   7644  13739  17008  17665  18593   9.28   7.42   6.78   5.18   3.63   2.24
    128   2375   4452   7168  11555  13968  18203   3.22   4.68   3.82   3.69   2.85   2.13
    256   2375   4425   7225  11540  13964  18243   3.24   4.64   3.81   3.67   2.83   2.12
    512   1784   2980   5758  10362  13685  18203   2.54   3.17   3.26   3.57   2.97   2.23
   1024   1225   2325   4639   9336  13467  18281   3.79   4.71   4.70   5.17   4.40   3.29
   4096    656   1375   2700   5120   9599  15984   2.71   5.31   5.56   5.31   4.97   4.15
  16384    579    864   1741   3502   7020  14015   2.45   3.22   3.53   3.61   3.62   3.61
  65536    604    796   1595   3195   6351  12699   2.50   2.94   3.23   3.28   3.27   3.27

        End of test Fri Aug 11 16:46:22 2023


 Pi 5 GCC 12                                      Pi 5 GCC 12/8 Comparison
  BusSpeed 64 Bit gcc 12 Thu Sep 28 19:02:33 2023

   Reading Speed 4 Byte Words in MBytes/Second
 Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Inc32  Inc16   Inc8   Inc4   Inc2   Read
 KBytes  Words  Words  Words  Words  Words    All  Words  Words  Words  Words  Words    All

     16   8493   8509  16377  17918  18170  18733   1.02   1.01   1.06   1.00   1.00   1.00
     32   9127   9295  16478  18023  18212  18740   1.00   1.01   1.06   1.01   1.00   1.00
     64   7530   7604  14030  17241  17877  18603   1.01   0.99   1.02   1.01   1.01   1.00
    128   2375   4189   7212  11566  13961  18230   1.00   0.94   1.01   1.00   1.00   1.00
    256   2358   4275   7265  11595  13985  18274   0.99   0.97   1.01   1.00   1.00   1.00
    512   1557   2879   5524  10229  13877  18231   0.87   0.97   0.96   0.99   1.01   1.00
   1024   1225   2339   4606   9318  13902  18271   1.00   1.01   0.99   1.00   1.03   1.00
   4096    780   1387   2672   5115   9407  16053   1.19   1.01   0.99   1.00   0.98   1.00
  16384    652    880   1763   3479   7034  13979   1.13   1.02   1.01   0.99   1.00   1.00
  65536    624    801   1605   3178   6416  12800   1.03   1.01   1.01   0.99   1.01   1.01

  

MemSpeed Benchmark below or Go To Start


MemSpeed Benchmark MB/Second - memspeedPi64g8 and g12

The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision floating point and integer functions. The instruction sequences used are shown in the results column titles.

When compiled with GCC 6, earlier results identified unusual slow operation dealing with 32 bit floating point and integer calculations. This looks as though the effect is to read data from RAM instead of caches, and why Pi 5 performance gains were mainly less than two times. With double precision floating point, average Pi 5 gains were around four times for the first two sets of calculations, including more that 10 times with L3 cache involvement.

The GCC 12 compilation appears to have corrected the above misoperations, providing gains of more than eight times over GCC 8. These calculations also show slight improvements in double precision calculations. Maximum calculated speeds are provided, indicating 15.3 single core GFLOPS SP and 6.86 DP, the relationship expected using SIMD calculations. The tests also confirmed this with the near 6.4 GFLOPS/GHz SP and near half that DP. This performance was obtained using data from L1 and L2 caches with almost that from L3 cache.

 Pi 4 GCC 8
    Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom

               Start of test Mon May 25 22:23:53 2020

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   15531   3999   3957  15576   4387   4358  11629   9313   9314
      16   15717   3992   3922  15770   4355   4377  11799   9444   9446
      32   12020   3818   3814  12043   4179   4198  11549   9496   9497
      64   12228   3816   3887  12220   4166   4195   8935   8506   8506
     128   12265   3869   3941  12157   4182   4206   8080   8193   8196
     256   12230   3873   3932  12073   4199   4216   8129   8224   8223
     512    9731   3832   3902   9709   4150   4171   8029   7845   7865
    1024    3772   3682   3769   3467   3887   3920   5478   5543   5378
    2048    1896   3463   3496   1886   3616   3612   2937   2945   2923
    4096    1924   3520   3528   1933   3651   3394   2752   2796   2785
    8192    1996   3523   3555   1988   3643   3630   2668   2661   2663

                End of test Mon May 25 22:24:10 2020

 Pi 5 GCC 8
     Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom

               Start of test Fri Aug 11 16:34:06 2023

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8   50862   6851   6746  50686   7193   7490  37629  18595  25168
      16   51032   6820   6717  51024   7164   7468  38002  18888  24946
      32   49985   6814   6676  50568   7150   7446  37609  18972  25259
      64   50868   6857   6656  50864   7168   7411  37799  19114  25426
     128   32618   6797   6670  32666   7142   7278  35466  19143  25439
     256   32540   6788   6640  32744   7183   7278  34821  19144  25360
     512   26949   6786   6668  30112   7155   7246  33493  14598  16816
    1024   25094   6719   6645  19272   6821   7206  21805  17292  22671
    2048   20586   6365   6586  19261   6887   7172   4740   4662  13673
    4096    5004   6680   6710   4963   6776   6249   7938   8990   8797
    8192    3229   5589   4662   3205   6496   6573   6654   6719   4613

                End of test Fri Aug 11 16:34:22 2023

 P5/P4 Comparison
  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       8    3.27   1.71   1.70   3.25   1.64   1.72   3.24   2.00   2.70
      16    3.25   1.71   1.71   3.24   1.65   1.71   3.22   2.00   2.64
      32    4.16   1.78   1.75   4.20   1.71   1.77   3.26   2.00   2.66
      64    4.16   1.80   1.71   4.16   1.72   1.77   4.23   2.25   2.99
     128    2.66   1.76   1.69   2.69   1.71   1.73   4.39   2.34   3.10
     256    2.66   1.75   1.69   2.71   1.71   1.73   4.28   2.33   3.08
     512    2.77   1.77   1.71   3.10   1.72   1.74   4.17   1.86   2.14
    1024    6.65   1.82   1.76   5.56   1.75   1.84   3.98   3.12   4.22
    2048   10.86   1.84   1.88  10.21   1.90   1.99   1.61   1.58   4.68
    4096    2.60   1.90   1.90   2.57   1.86   1.84   2.88   3.22   3.16
    8192    1.62   1.59   1.31   1.61   1.78   1.81   2.49   2.52   1.73

 
Continued below
Pi 5 GCC 12 Memory Reading Speed Test 64 Bit gcc 12 by Roy Longbottom Start of test Thu Sep 28 18:54:28 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 54902 61264 65610 55241 65554 63848 37768 25475 25486 16 54803 60539 64671 55169 64700 64750 38078 24891 24891 32 51859 60967 64278 52558 65247 65275 37520 25234 25234 64 52597 61169 65523 52485 65514 65523 37945 25408 25402 128 33580 60278 63742 33647 63692 62897 37218 25370 25457 256 33724 60317 63873 33711 63840 63865 35555 25371 25375 512 33522 59194 63298 33502 63259 63175 35909 25459 25451 1024 32078 57946 60718 31576 60680 59199 26110 22319 23059 2048 29249 55376 57648 29028 57558 57290 16245 18242 19514 4096 4508 11981 11906 4864 11894 9313 10254 10529 10668 8192 3175 6507 6150 3178 6441 6499 6678 6904 6364 Max MFLOPS 6862 15316 End of test Thu Sep 28 18:54:43 2023 Pi 5 GCC 12/8 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 1.08 8.94 9.73 1.09 9.11 8.52 1.00 1.37 1.01 16 1.07 8.88 9.63 1.08 9.03 8.67 1.00 1.32 1.00 32 1.04 8.95 9.63 1.04 9.13 8.77 1.00 1.33 1.00 64 1.03 8.92 9.84 1.03 9.14 8.84 1.00 1.33 1.00 128 1.03 8.87 9.56 1.03 8.92 8.64 1.05 1.33 1.00 256 1.04 8.89 9.62 1.03 8.89 8.78 1.02 1.33 1.00 512 1.24 8.72 9.49 1.11 8.84 8.72 1.07 1.74 1.51 1024 1.28 8.62 9.14 1.64 8.90 8.22 1.20 1.29 1.02 2048 1.42 8.70 8.75 1.51 8.36 7.99 3.43 3.91 1.43 4096 0.90 1.79 1.77 0.98 1.76 1.49 1.29 1.17 1.21 8192 0.98 1.16 1.32 0.99 0.99 0.99 1.00 1.03 1.38

NeonSpeed Benchmark below or Go To Start


NeonSpeed Benchmark MB/Second - NeonSpeedPi64g8 and g12

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.

The initial GCC 8 test functions produced the same irregular results as MemSpeed first Normal Float and Int calculations that appear to only read RAM based data. Performance from NEON code indicated that the Pi 5 was typically 2.5 times faster than the Pi 4, using cache based data, and 1.5 times from RAM. Exceptions were gains of up to 7.9 times using L3 cache and nearly 4.8 from lower level caches.

The GCC 12 compiler produced acceptable Normal performance on the Pi 5, reflected by gains of up to more than ten times over GCC 8 results. This compiler is also shown to provide faster operation than that from NEON functions. Many of the latter show 20% improvements but some were slower. Maximum floating point speed demonstrated was nearly 17 GFLOPS.

Pi 4 GCC 8
NEON Speed 64 Bit gcc 8 Mon May 25 22:21:51 2020
       Vector Reading Speed in MBytes/Second
Memory  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v
KBytes   Norm   Neon   Norm   Neon  Float    Int

     16   3629  14987   3925  13643  14457  16642
     32   3475  10933   3821   9970  11029  11055
     64   3447  11749   3845  11098  11802  12079
    128   3332  11392   3912  10813  11430  11513
    256   3325  11565   3926  10981  11598  11699
    512   3313  10553   3917  10269  10755  10740
   1024   3239   3331   3737   3291   3302   3321
   4096   2987   1888   3331   1777   1881   1878
  16384   3150   1821   3347   1814   1812   1834
  65536   2747   1954   3132   2017   1904   2021
    Max 
 MFLOPS          3747

       End of test Mon May 25 22:22:11 2020

Pi 5 GCC 8                                        P5/P4 Comparison
NEON Speed 64 Bit gcc 8 Fri Aug 11 16:44:52 2023

       Vector Reading Speed in MBytes/Second
Memory  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v
KBytes   Norm   Neon   Norm   Neon  Float    Int   Norm   Neon   Norm   Neon  Float    Int

     16   6745  46851   6968  44490  46849  46847   1.86   3.13   1.78   3.26   3.24   2.81
     32   6727  47104   6947  44618  47061  47056   1.94   4.31   1.82   4.48   4.27   4.26
     64   6703  46642   6962  44166  47040  46955   1.94   3.97   1.81   3.98   3.99   3.89
    128   6587  27383   6840  27199  27404  27398   1.98   2.40   1.75   2.52   2.40   2.38
    256   6579  27491   6857  27299  27509  27509   1.98   2.38   1.75   2.49   2.37   2.35
    512   6571  27433   6862  26599  24237  26163   1.98   2.60   1.75   2.59   2.25   2.44
   1024   6531  26340   6756  25226  24597  24527   2.02   7.91   1.81   7.67   7.45   7.39
   4096   6414   9410   6505   9986   9474   8835   2.15   4.98   1.95   5.62   5.04   4.70
  16384   5690   2850   5501   2830   2865   2488   1.81   1.57   1.64   1.56   1.58   1.36
  65536   4837   2534   4736   2458   2401   2450   1.76   1.30   1.51   1.22   1.26   1.21
    Max 
 MFLOPS         11776

       End of test Fri Aug 11 16:45:12 2023

  Pi 5 GCC 12                                     Pi 5 GCC 12/8
NEON Speed 64 Bit gcc 12 Thu Sep 28 18:57:35 

       Vector Reading Speed in MBytes/Second
Memory  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v
KBytes   Norm   Neon   Norm   Neon  Float    Int   Norm   Neon   Norm   Neon  Float    Int

     16  67042  45164  67037  45358  54228  54166   9.94   0.96   9.62   1.02   1.16   1.16
     32  67631  45190  67621  45415  53833  53675  10.05   0.96   9.73   1.02   1.14   1.14
     64  67812  44856  67491  45171  52338  51321  10.12   0.96   9.69   1.02   1.11   1.09
    128  62779  33147  64360  33074  33619  33458   9.53   1.21   9.41   1.22   1.23   1.22
    256  64352  33405  64803  33187  33699  33719   9.78   1.22   9.45   1.22   1.23   1.23
    512  61159  33171  61798  32263  33178  28319   9.31   1.21   9.01   1.21   1.37   1.08
   1024  58937  32149  57732  31639  32219  32108   9.02   1.22   8.55   1.25   1.31   1.31
   4096   9215   2639   7168   3800   3823   3776   1.44   0.28   1.10   0.38   0.40   0.43
  16384   5546   2830   5592   2772   2753   2503   0.97   0.99   1.02   0.98   0.96   1.01
  65536   4633   2445   4196   1922   2196   2294   0.96   0.96   0.89   0.78   0.91   0.94
    Max
 MFLOPS  16953
  

MultiThreading Benchmark next or Go To Start


MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled C code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.


MP-Whetstone Benchmark - MP-WHETSPi64g8 and g12

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP efficiency, with around 5 seconds for 1, 2 and 4 threads, doubling with 8.

The Pi 5 CPU temperature reached 80.7C within the 26 second testing time. Pi5/Pi4 4 thread performance ratios were between 2.22 and 3.43.

Performance of all GCC 8 compilations were essentially the same as those from GCC 12.

Pi 4 GCC 8
MP-Whetstone Benchmark 64 Bit gcc 8 Mon May 25 10:18:21 2020

                    Using 1, 2, 4 and 8 Threads

           MWIPS  MFLOPS  MFLOPS  MFLOPS     Cos     Exp   Fixpt      If   Equal
                       1       2       3    MOPS    MOPS    MOPS    MOPS    MOPS

1T        2146.7   530.1   530.1   397.2    60.5    27.3  7451.7  2240.2   998.1
2T        4290.4  1056.0  1055.3   794.0   120.9    54.7 14859.4  4488.5  1995.2
4T        8583.9  2115.8  2113.4  1590.5   241.8   109.3 29265.9  8940.7  3984.5
8T        8806.6  2676.0  2140.1  1627.3   244.8   113.0 37995.0 11565.4  4097.5

   Overall Seconds   5.00 1T,   5.01 2T,   5.02 4T,  10.10 8T

   All calculations produced consistent numeric results

Pi 5 GCC 8
MP-Whetstone Benchmark 64 Bit gcc 8 Mon Aug 14 10:09:58 2023

                    Using 1, 2, 4 and 8 Threads

           MWIPS  MFLOPS  MFLOPS  MFLOPS     Cos     Exp   Fixpt      If   Equal
                       1       2       3    MOPS    MOPS    MOPS    MOPS    MOPS

1T        6138.4  1278.2  1278.2  1020.4   174.1    94.8 17273.2  7033.6  2394.9
2T       12198.6  2542.8  2549.5  2029.7   344.4   188.4 35246.9 14307.3  4794.1
4T       24008.3  5013.1  4683.8  4045.3   674.5   374.4 69938.6 28558.3  9381.9
8T       24768.0  5170.6  5867.3  4080.9   693.9   385.9 74272.7 30002.8  9478.1

   Overall Seconds   5.00 1T,   5.04 2T,   5.22 4T,  10.37 8T

   All calculations produced consistent numeric results

P5/P4 Comparison
1T          2.86    2.41    2.41    2.57    2.88    3.47    2.32    3.14    2.40
2T          2.84    2.41    2.42    2.56    2.85    3.44    2.37    3.19    2.40
4T          2.80    2.37    2.22    2.54    2.79    3.43    2.39    3.19    2.35
8T          2.81    1.93    2.74    2.51    2.83    3.42    1.95    2.59    2.31

Pi 5 GCC 12
MP-Whetstone Benchmark 64 Bit gcc 12 Thu Sep 28 21:58:24 2023

                    Using 1, 2, 4 and 8 Threads

           MWIPS  MFLOPS  MFLOPS  MFLOPS     Cos    Exp   Fixpt      If    Equal
                       1       2       3    MOPS   MOPS    MOPS    MOPS     MOPS

1T        6180.4  1279.0  1273.5  1028.0   173.8    96.7 17586.5  7187.4  2396.5
2T       12353.4  2550.4  2556.9  2049.9   347.7   193.3 35875.6 14220.6  4796.8
4T       24647.0  5100.9  5078.2  4106.7   695.5   385.9 63256.4 28609.7  9549.0
8T       25053.6  5121.0  5293.6  4174.6   706.8   386.4 78259.8 31001.5  9658.4

   Overall Seconds   5.00 1T,   5.01 2T,   5.06 4T,  10.10 8T

Pi 5 GCC 12/8
1T          1.01    1.00    1.00    1.01    1.00    1.02    1.02    1.02    1.00
2T          1.01    1.00    1.00    1.01    1.01    1.03    1.02    0.99    1.00
4T          1.03    1.02    1.08    1.02    1.03    1.03    0.90    1.00    1.02
8T          1.01    0.99    0.90    1.02    1.02    1.00    1.05    1.03    1.02

 

MP-Dhrystone Benchmark next or Go To Start


MP-Dhrystone Benchmark - MP-DHRYPi64g8 and g12

This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded performance. Results are in VAX MIPS aka DMIPS.

Using the GCC 8 version, the Pi 5 performance was 2.27 times faster than the Pi 4, achieving 7.67 DMIPS/MHz. The GCC 12 compilation was slightly faster than the former, running on the Pi 5.
 

Pi 4 GCC 8
MP-Dhrystone Benchmark 64 Bit gcc 8 Tue May 26 11:41:49 2020

                    Using 1, 2, 4 and 8 Threads

Threads                        1       2       4       8
Seconds                     0.55    1.08    2.15     4.3
Dhrystones per Second    1.5E+07 1.5E+07 1.5E+07 1.5E+07
VAX MIPS rating             8271    8419    8478    8465

         Internal pass count correct all threads

         End of test Tue May 26 11:41:57 2020

Pi 5 GCC 8
MP-Dhrystone Benchmark 64 Bit gcc 8 Mon Aug 14 10:16:15 2023

                    Using 1, 2, 4 and 8 Threads

Threads                        1       2       4       8
Seconds                     0.62    1.88    4.18    8.45    Pi5/Pi4
Dhrystones per Second    3.2E+07 2.1E+07 1.9E+07 1.9E+07
VAX MIPS rating            18415   12137   10899   10771      2.27

         Internal pass count correct all threads

         End of test Mon Aug 14 10:16:31 2023

Pi 5 GCC 12
  MP-Dhrystone Benchmark 64 Bit gcc 12 Thu Sep 28 22:03:10 2023

                    Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.57     1.95     4.31     8.70     Pi 5 GCC 12/8
 Dhrystones per Second   35046385 20477300 18570390 18398880
 VAX MIPS rating            19947    11655    10569    10472        1.08

         Internal pass count correct all threads

         End of test Thu Sep 28 22:03:26 2023

  

MP SP NEON Linpack Benchmark next or Go To Start


MP SP NEON Linpack Benchmark - linpackMPNeonPi64g8 and g12

This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes. Single core performance ratios are provided below for the three different memory array sizes that use N x N x 4 bytes or 40 KB, 1 MB and 4 MB. The three Pi 5/Pi 4 performance ratios were 2.94, 5.24, and 4.13 times. Maximum single core speed was 6.85 GFLOPS.

Two out of three of the new GCC 12 compilations produced slower performance on the Pi 5 and completely different numeric sumchecks.

Pi 4 GCC 8
Linpack Single Precision MultiThreaded Benchmark
 NEON Intrinsics 64 Bit gcc 8, Tue May 26 11:43:46 2020

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads    None        1       2       4

N  100   2167.70   91.82   89.65   89.96
N  500   1438.27  644.85  635.89  635.33
N 1000    394.99  376.97  383.92  384.19

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            1.97            5.40           13.51
 RE  4.69621336e-05  6.44138840e-04  3.22485110e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -1.31130219e-05  5.79357147e-05 -3.08930874e-04
 XN -1.30534172e-05  3.51667404e-05  1.90019608e-04

Thread
 0 - 4 Same Results    Same Results    Same Results

Pi 5 GCC 8
Linpack Single Precision MultiThreaded Benchmark
 NEON Intrinsics 64 Bit gcc 8, Mon Aug 14 10:22:53 2023

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads    None        1       2       4    Pi5/Pi4

N  100   6375.62  154.59  151.48  150.82      2.94
N  500   7536.07 2250.75 2263.15 2222.61      5.24
N 1000   1631.94 1452.80 1401.29 1298.10      4.13

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            1.97            5.40           13.51
 RE  4.69621336e-05  6.44138840e-04  3.22485110e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -1.31130219e-05  5.79357147e-05 -3.08930874e-04
 XN -1.30534172e-05  3.51667404e-05  1.90019608e-04

Thread
 0 - 4 Same Results    Same Results    Same Results

Pi 5 GCC 12
 Linpack Single Precision MultiThreaded Benchmark
 NEON Intrinsics 64 Bit gcc 12, Thu Sep 28 22:05:37 2023

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads    None        1       2       4  Pi 5 GCC 12/8

N  100   5461.61  169.27  176.25  174.14      0.86
N  500   6853.70 2538.16 2554.26 2562.31      0.91
N 1000   1741.83 1486.68 1493.84 1501.34      1.07

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04

Thread
 0 - 4 Same Results    Same Results    Same Results
  

MP BusSpeed Benchmark below or Go To Start


MP BusSpeed (read only) Benchmark - MP-BusSpd2Pi64g8 and g12

For further details see the single core BusSpeed Benchmark that obtains the same order of GCC 8 results as the single thread of this MP version. For the latter, each thread exercises a dedicated segment of the data, circulated on a round robin basis, reading all data every pass.

Considering the most important GCC 8 Rdall tests, Pi5/Pi4 performance gains mainly approached three times for cache based data but multithreaded application showed gains up to 9.47 times. Highest gains of up to 18.17 times were in other areas. The high gains are due to improved caching on a read only basis.

The early Pi 4 GCC 12/8 comparisons indicated similar performance but increased progressively as more data was being read, reaching up to more than five times on RdAll. Here, single thread data transfer speeds reached nearly 68 GB/second and 4 thread up to 150 GB/second. This lead to me writing a new program New INTitHOT Integer Stress Test, where it is shown that GCC 12 produced highly efficient SIMD vector instructions.

Pi 4 GCC 8
MP-BusSpd 64 Bit gcc 8 Tue May 26 11:51:30 2020

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

    KB   Inc32  Inc16   Inc8   Inc4   Inc2  RdAll 

12.3      5168   5542   5641   4205   4095   4230
    2T    8968  10728  10161   8110   8058   8368
    4T    7874  13255  15586  13641  15485  16533
    8T    8186  13386  15239  13469  14431  16372
122.9      598    927   1876   2792   3746   4059
    2T     514    719   1538   4846   7596   8083
    4T     486    933   2060   4126   8175  13690
    8T     483    937   2059   4160   8166  13817
12288      224    257    488    964   1933   3579
    2T     219    427    889   1832   3493   5371
    4T     280    353    562    859   2168   3286
    8T     229    230    527   1075   1880   4480

         No Errors Found

         End of test Tue May 26 11:51:43 2020

Pi 5 GCC 8                                       Pi 5/4 GCC 8
MP-BusSpd 64 Bit gcc 8 Mon Aug 14 10:37:37 2023

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

    KB   Inc32  Inc16   Inc8   Inc4   Inc2  RdAll  Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

12.3      9289   9450  15464  12578  12443  12073   1.80   1.71   2.74   2.99   3.04   2.85
    2T   11465  15018  23403  20058  22357  22997   1.28   1.40   2.30   2.47   2.77   2.75
    4T    8757  11343  21200  26582  32854  42575   1.11   0.86   1.36   1.95   2.12   2.58
    8T    9036   8602  11448  17821  26795  30949   1.10   0.64   0.75   1.32   1.86   1.89
122.9     2358   4293   7257  11306  11657  11609   3.94   4.63   3.87   4.05   3.11   2.86
    2T    4466   7819  13844  21220  23109  23119   8.69  10.87   9.00   4.38   3.04   2.86
    4T    8831  14835  20781  42375  45809  44669  18.17  15.90  10.09  10.27   5.60   3.26
    8T    7011  11818  19792  34990  39720  43742  14.52  12.61   9.61   8.41   4.86   3.17
12288      654    884   1585   3502   7243  10088   2.92   3.44   3.25   3.63   3.75   2.82
    2T     726    743   1303   3454   7723  18286   3.32   1.74   1.47   1.89   2.21   3.40
    4T     735   1551   1405   5166  10906  31106   2.63   4.39   2.50   6.01   5.03   9.47
    8T     771    933   1486   3197   9182  18377   3.37   4.06   2.82   2.97   4.88   4.10

         No Errors Found

         End of test Mon Aug 14 10:37:49 2023

Pi 5 GCC 12                                      Pi 5 GCC 12/8
MP-BusSpd 64 Bit gcc 12 Thu Sep 28 22:11:28 2023

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

    KB   Inc32  Inc16   Inc8   Inc4   Inc2  RdAll  Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

12.3      9444   9504  16195  17543  27434  67773   1.02   1.01   1.05   1.39   2.20   5.61
    2T   10884  14542  23738  28964  38304  92983   0.95   0.97   1.01   1.44   1.71   4.04
    4T   10566  11790  21233  28439  44074  91129   1.21   1.04   1.00   1.07   1.34   2.14
    8T    8657  10289  12122  19920  30038  45788   0.96   1.20   1.06   1.12   1.12   1.48
122.9     2380   4359   7261  11627  20970  44300   1.01   1.02   1.00   1.03   1.80   3.82
    2T    4586   7699  13845  22597  40901  73723   1.03   0.98   1.00   1.06   1.77   3.19
    4T    5469  10629  24698  38945  69318 150304   0.62   0.72   1.19   0.92   1.51   3.36
    8T    6902  11176  19387  36720  64760 144651   0.98   0.95   0.98   1.05   1.63   3.31
12288      632    806   1838   3628   7366  13161   0.97   0.91   1.16   1.04   1.02   1.30
    2T     961    711   1520   3527   5546  13012   1.32   0.96   1.17   1.02   0.72   0.71
    4T     670   1566   3062   5403  13675  19563   0.91   1.01   2.18   1.05   1.25   0.63
    8T     726   1117   2322   4747   9371  17111   0.94   1.20   1.56   1.48   1.02   0.93
 

MP RandMem Benchmark below or Go To Start


MP RandMem Benchmark - MP-RandMemPi64g8 and g12

The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The performance patterns were as expected. Random access is dependent on the impact of burst reading and writing, producing those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write remaining constant at particular data size, probably due to write back to shared data space.

Again the new PI 5 caching arrangement produced high performance gains over the Pi 4, via GCC 8 compilations. In this case they were between 4 and 18 times. Others were between 2 and 3 times for cached based data and half that from RAM.

Performance from the GCC 12 version was little different to that from GCC 8.

 Pi 4 GCC 8
 MP-RandMem 64 Bit gcc 8 Tue May 26 11:53:37 2020

    MB/Second Using 1, 2, 4 and 8 Threads
     KB         SerRD  SerRW  RndRD  RndRW

   12.3    1T    5945   7898   5948   7895
           2T   11913   7937  11905   7929
           4T   23601   7875  23385   7867
           8T   23139   7777  23016   7770
  122.9    1T    5785   7090   2026   1977
           2T   10941   7074   1654   1968
           4T   10364   7052   1854   1970
           8T   10256   7031   1844   1973
  12288    1T    3861   1244    180    169
           2T    3793   1242    220    171
           4T    3941   1100    343    170
           8T    4065   1247    351    171
    No Errors Found
    End of test Tue May 26 11:54:20 2020

 Pi 4 GCC 8                                       Pi 5/4 GCC 8
 MP-RandMem 64 Bit gcc 8 Mon Aug 14 10:45:21 2023

    MB/Second Using 1, 2, 4 and 8 Threads
     KB         SerRD  SerRW  RndRD  RndRW         SerRD  SerRW  RndRD  RndRW

   12.3    1T   18593  18938  17858  17066          3.13   2.40   3.00   2.16
           2T   32655  18759  32998  16990          2.74   2.36   2.77   2.14
           4T   47087  18905  45181  17027          2.00   2.40   1.93   2.16
           8T   34725  18602  33955  17087          1.50   2.39   1.48   2.20
  122.9    1T   15501  16259  10950   9853          2.68   2.29   5.40   4.98
           2T   29970  16392  21177   9921          2.74   2.32  12.80   5.04
           4T   51762  16408  33068   9781          4.99   2.33  17.84   4.96
           8T   46575  15741  27979   9235          4.54   2.24  15.17   4.68
  12288    1T   12227   1729    538    328          3.17   1.39   2.99   1.94
           2T   16713   1724    617    311          4.41   1.39   2.80   1.82
           4T   16771   1825    722    312          4.26   1.66   2.10   1.84
           8T   13124   1739    622    319          3.23   1.39   1.77   1.87
    No Errors Found
    End of test Mon Aug 14 10:46:01 2023

Pi 5 gcc 12                                      Pi 5 GCC 12/8
 MP-RandMem 64 Bit gcc 12 Thu Sep 28 22:15:02 2023

    MB/Second Using 1, 2, 4 and 8 Threads
     KB         SerRD  SerRW  RndRD  RndRW         SerRD  SerRW  RndRD  RndRW

   12.31T       18667  19102  18108  17246           1.0    1.0    1.0    1.0
       2T       34841  19037  33292  16912           1.1    1.0    1.0    1.0
       4T       47204  18694  46771  17137           1.0    1.0    1.0    1.0
       8T       35115  18676  34015  17230           1.0    1.0    1.0    1.0
  122.91T       15826  16395  10993   9928           1.0    1.0    1.0    1.0
       2T       30566  16400  21397   9940           1.0    1.0    1.0    1.0
       4T       56413  16361  38355   9921           1.1    1.0    1.2    1.0
       8T       54596  16372  37617   9889           1.2    1.0    1.3    1.1
  122881T       13622   1902    539    343           1.1    1.1    1.0    1.0
       2T       20937   1830    603    345           1.3    1.1    1.0    1.1
       4T       26993   1892    682    343           1.6    1.0    0.9    1.1
       8T       18621   1797    650    347           1.4    1.0    1.0    1.1
    No Errors Found
    End of test Thu Sep 28 22:15:42 2023

  

MP-MFLOPS Benchmarks below or Go To Start


MP-MFLOPSPi64g8 and g12, MP-MFLOPSPi64DPg8 and g12

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Here are two varieties, single precision and double precision, both attempting to show near maximum MP floating point processing speeds.

At a given precision, result sumchecks should be identical when using the same run time parameters. Here, gcc 12 compiled programs were run using parameters that produce longer running times, with different sumchecks to those from earlier versions.

These are all short tests running at full MHz with low increases in temperatures. All at 12.8 and 128 KB demonstrate some near doubling performance with twice as many threads. Maximum GCC 12 Pi 5 SP 4 thread performance was 84.9 GFLOPS with DP at 42.5 GFLOPS and slightly less via GCC 8. See next page for comments on comparisons.

Pi 4 GCC 8 MP-MFLOPS 64 Bit gcc 8 Tue May 26 12:01:44 2020

        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800     Maximum  MFLOPS       
        MFLOPS                                            GFLOPS per MHz
 1T     3212    3162     416    6741    6720    6393         6.7     4.5
 2T     6343    5109     565   13381   13376    9914        13.4     8.9
 4T    11644    5077     584   25506   26028    9883        26.0    17.4
 8T     7804    7953     579   20537   24446    8651
 Results x 100000, 0 indicates ERRORS
 1T    76406   97075   99969   66015   95363   99951
 2T    76406   97075   99969   66015   95363   99951
 4T    76406   97075   99969   66015   95363   99951
 8T    76406   97075   99969   66015   95363   99951

         End of test Tue May 26 12:01:46 2020

Pi 5 GCC 8 MP-MFLOPS 64 Bit gcc 8 Mon Aug 14 11:16:36 2023

        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800     Maximum  MFLOPS
        MFLOPS                                            GFLOPS per MHz
 1T     9309    8856     540   20396   19543   11710        19.5     8.1
 2T    17114   18565     683   35842   40506   11937        40.5    16.9
 4T    29453   34610     826   75120   77896   12646        77.9    32.5
 8T    28688   31506     959   59804   57700   15374
 Results x 100000, 0 indicates ERRORS
 1T    76406   97075   99969   66015   95363   99951
 2T    76406   97075   99969   66015   95363   99951
 4T    76406   97075   99969   66015   95363   99951
 8T    76406   97075   99969   66015   95363   99951
 
        End of test Mon Aug 14 11:16:37 2023

Pi 5/4 GCC8
        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800

 1T     2.90    2.80    1.30    3.03    2.91    1.83
 2T     2.70    3.63    1.21    2.68    3.03    1.20
 4T     2.53    6.82    1.41    2.95    2.99    1.28
 8T     3.68    3.96    1.66    2.91    2.36    1.78

Pi 5 GCC 12 MP-MFLOPS2 64 Bit gcc 12 Tue Oct  3 09:52:45 2023

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800     Maximum  MFLOPS
 MFLOPS                                                   GFLOPS per MHz
 1T    10549   10320    1116   21519   21452   16879        21.5    9.0
 2T    19881   20929     982   42488   43002   14280        43.0   17.9
 4T    33400   40206     929   80947   84933   14772        84.9   35.4
 8T    33448   37854    1093   77117   85086   17371
 Results x 100000, 0 indicates ERRORS
 1T    40015   44934   98519   35186   36769   97639
 2T    40015   44934   98519   35186   36769   97639
 4T    40015   44934   98519   35186   36769   97639
 8T    40015   44934   98519   35186   36769   97639

         End of test Tue Oct  3 09:53:21 2023

Pi 5 GCC 12/8
        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800

 1T     1.09    1.05    1.11    1.03    1.09    1.00
 2T     1.12    0.98    0.98    1.15    0.94    0.89
 4T     1.09    1.13    0.99    0.88    0.89    1.01
 8T     0.85    0.85    1.02    0.97    1.07    0.98

Double Precision Results and More Comments below


With the running times being relatively short, individual comparison ratios might not be accurate so averages have been calculated. Pi5/Pi4 GCC 8 ratios were between 2.36 and 6.82 times, average 3.18 with cached data then 1.10 to 1.83, 1.42 from RAM. The Pi 5 improved cache sizes lead to the higher ratios. Longer running stress tests provide more reliable performance indications

GCC 8/12 averages indicated similar single precision performance, with a slight gain for the newer compiler with double precision calculations.

Pi 4 GCC 8 MP-MFLOPS 64 Bit gcc 8 Double Precision Tue May 26 12:11:50 2020

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800     Maximum  MFLOPS
 MFLOPS                                                   GFLOPS per MHz
 1T     1591    1587     269    3386    3379    3240         3.4     2.3
 2T     3228    2803     267    6728    6711    4556         6.7     4.5
 4T     5870    3284     283   12812   12866    4940        12.9     8.6
 8T     5506    4063     277   12077   11538    4695
 Results x 100000, 0 indicates ERRORS
 1T    76384   97072   99969   66065   95370   99951
 2T    76384   97072   99969   66065   95370   99951
 4T    76384   97072   99969   66065   95370   99951
 8T    76384   97072   99969   66065   95370   99951

         End of test Tue May 26 12:11:52 2020

Pi 5 GCC 8 MP-MFLOPS 64 Bit gcc 8 Double Precision Mon Aug 14 11:18:26 2023

        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800     Maximum  MFLOPS
        MFLOPS                                            GFLOPS per MHz
 1T     4661    4127     296   10498   10217    4938        10.2     4.3
 2T     8408    9292     333   20699   19275    5579        19.3     8.0
 4T    14723   17372     399   39480   42352    6572        42.4    17.6
 8T    14387   15799     461   38706   28821    7667
 Results x 100000, 0 indicates ERRORS
 1T    76384   97072   99969   66065   95370   99951
 2T    76384   97072   99969   66065   95370   99951
 4T    76384   97072   99969   66065   95370   99951
 8T    76384   97072   99969   66065   95370   99951

         End of test Mon Aug 14 11:18:27 2023

Pi 5/4 GCC8
        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800

 1T     2.93    2.60    1.10    3.10    3.02    1.52
 2T     2.60    3.32    1.25    3.08    2.87    1.22
 4T     2.51    5.29    1.41    3.08    3.29    1.33
 8T     2.61    3.89    1.66    3.20    2.50    1.63

 Pi 5 GCC 12 DP MP-MFLOPS2 64 Bit gcc 12 Double Precision Tue Oct  3 10:00:48 2023

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800     Maximum  MFLOPS
 MFLOPS                                                   GFLOPS per MHz
 1T     4713    4740     562   10748   10727    8440        10.7     4.5  
 2T     9355    9554     491   21389   21515    7875        21.5     9.0
 4T    17485   18403     468   41704   42464    7499        42.5    17.7
 8T    16645   18592     543   41049   41910    8596
 Results x 100000, 0 indicates ERRORS
 1T    39991   44914   98518   35119   36721   97642
 2T    39991   44914   98518   35119   36721   97642
 4T    39991   44914   98518   35119   36721   97642
 8T    39991   44914   98518   35119   36721   97642

        End of test Tue Oct  3 10:01:24 2023

Pi 5 GCC 12/8
        2 Ops/Word              32 Ops/Word
    KB  12.8     128   12800    12.8     128   12800

 1T     1.01    1.15    1.90    1.02    1.05    1.71
 2T     1.11    1.03    1.47    1.03    1.12    1.41
 4T     1.19    1.06    1.17    1.06    1.00    1.14
 8T     1.16    1.18    1.18    1.06    1.45    1.12 
  

OpenMP-MFLOPS Benchmarks below or Go To Start


OpenMP-MFLOPS - OpenMP-MFLOPS64g8 and g12, notOpenMP-MFLOPS64g8 and g12

This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. Again, gcc 12 compilations were run for longer times that resulted in different First Results sumchecks.

In this case, data sizes used were 400 KB, 4 MB and 40 MB where, with the Pi 5, only the first would be expected to provide a full service from L1 or L2 caches and the second with possible impact of L3 cache. With the GCC 8 full OpenMP version, Pi5/Pi4 performance gains were around 3.0 times at 8 and 32 Operations per word at 400 KB, with most others lower due to data size or fewer operations. At 400 KB Pi 5 GCC 12 performance was 3.2 times faster than GCC 8 at 2 operations per word and slightly faster on the other measurements.

Maximum 4 core performance was 80.1 GFLOPS from GCC 12, at 3.73 times that for a single core, nearly the same as that for MP-MFLOPS.

Pi 4 GCC 8        OpenMP MFLOPS64g8 Tue May 26 12:06:36 2020

Test            4 Byte Ops/  Repeat  Secs MFLOPS   First  All                MP/
                 Words Word  Passes              Results Same              notMP

Data in & out    100000    2    2500 0.093  5389  0.92954 Yes               1.64
Data in & out   1000000    2     250 0.795   629  0.99255 Yes               1.21
Data in & out  10000000    2      25 0.784   638  0.99925 Yes               1.00

Data in & out    100000    8    2500 0.115 17455  0.95712 Yes               3.11
Data in & out   1000000    8     250 0.798  2507  0.99552 Yes               1.16
Data in & out  10000000    8      25 0.880  2273  0.99955 Yes               0.95

Data in & out    100000   32    2500 0.332 24068  0.89022 Yes               3.54
Data in & out   1000000   32     250 0.849  9418  0.98809 Yes               1.45
Data in & out  10000000   32      25 0.933  8571  0.99880 Yes               1.31

                End of test Tue May 26 12:06:42 2020

Pi 5 GCC 8        OpenMP MFLOPS64g8 Mon Aug 14 12:08:35 2023

Test            4 Byte Ops/  Repeat  Secs MFLOPS   First  All      Pi5/4     MP/
                 Words Word  Passes              Results Same       GCC8   notMP

Data in & out    100000    2    2500 0.054  9204  0.92954 Yes       1.71    1.00
Data in & out   1000000    2     250 0.439  1140  0.99255 Yes       1.81    0.80
Data in & out  10000000    2      25 0.618   809  0.99925 Yes       1.27    1.09

Data in & out    100000    8    2500 0.038 52914  0.95712 Yes       3.03    2.92
Data in & out   1000000    8     250 0.410  4880  0.99552 Yes       1.95    0.83
Data in & out  10000000    8      25 0.664  3014  0.99955 Yes       1.33    1.00

Data in & out    100000   32    2500 0.112 71522  0.89022 Yes       2.97    3.60
Data in & out   1000000   32     250 0.424 18865  0.98809 Yes       2.00    1.07
Data in & out  10000000   32      25 0.622 12853  0.99880 Yes       1.50    0.93

                End of test Mon Aug 14 12:08:38 2023

Pi 5 GCC 12     OpenMP MFLOPSL64g12 Tue Oct  3 16:27:53 2023

Test            4 Byte Ops/  Repeat  Secs MFLOPS   First  All       Pi 5     MP/
                 Words Word  Passes              Results Same   GCC 12/8   notMP

Data in & out    100000    2   50000 0.339 29459  0.44935 Yes       3.20    3.10
Data in & out   1000000    2    5000 7.021  1424  0.86736 Yes       1.25    0.82
Data in & out  10000000    2     50012.322   812  0.98519 Yes       1.00    0.80

Data in & out    100000    8   50000 0.634 63086  0.60398 Yes       1.19    3.46
Data in & out   1000000    8    5000 6.956  5750  0.91822 Yes       1.18    0.88
Data in & out  10000000    8     50012.360  3236  0.99109 Yes       1.07    0.80

Data in & out    100000   32   50000 1.997 80104  0.36770 Yes       1.12    3.73
Data in & out   1000000   32    5000 6.891 23219  0.79898 Yes       1.23    1.18
Data in & out  10000000   32     50012.294 13015  0.97639 Yes       1.01    0.79

                End of test Tue Oct  3 16:28:54 2023
 
Single Core Results below


Some Pi5/Pi4 GCC 8 comparisons were different to those above, for the single core benchmark, at between 2.70 and 3. 22. Maximum performance was nearly 21.5 GFLOPS.

Pi 4 GCC 8      notOpenMP MFLOPS64g8 Tue May 26 12:07:34 2020

Test            4 Byte Ops/  Repeat  Secs MFLOPS   First  All
                 Words Word  Passes              Results Same

Data in & out    100000    2    2500 0.153  3278  0.92954 Yes
Data in & out   1000000    2     250 0.966   518  0.99255 Yes
Data in & out  10000000    2      25 0.782   640  0.99925 Yes

Data in & out    100000    8    2500 0.356  5612  0.95712 Yes
Data in & out   1000000    8     250 0.926  2160  0.99552 Yes
Data in & out  10000000    8      25 0.840  2381  0.99955 Yes

Data in & out    100000   32    2500 1.176  6800  0.89022 Yes
Data in & out   1000000   32     250 1.228  6515  0.98809 Yes
Data in & out  10000000   32      25 1.225  6529  0.99880 Yes

                End of test Tue May 26 12:07:42 2020

Pi 5 GCC 8     notOpenMP MFLOPS64g8 Mon Aug 14 12:04:30 2023

Test            4 Byte Ops/  Repeat  Secs MFLOPS   First  All      Pi5/4
                 Words Word  Passes              Results Same       GCC8

Data in & out    100000    2    2500 0.054  9236  0.92954 Yes       2.82
Data in & out   1000000    2     250 0.350  1429  0.99255 Yes       2.76
Data in & out  10000000    2      25 0.675   740  0.99925 Yes       1.16

Data in & out    100000    8    2500 0.111 18092  0.95712 Yes       3.22
Data in & out   1000000    8     250 0.340  5888  0.99552 Yes       2.73
Data in & out  10000000    8      25 0.666  3002  0.99955 Yes       1.26

Data in & out    100000   32    2500 0.402 19891  0.89022 Yes       2.93
Data in & out   1000000   32     250 0.456 17563  0.98809 Yes       2.70
Data in & out  10000000   32      25 0.579 13810  0.99880 Yes       2.12

                End of test Mon Aug 14 12:04:33 2023

Pi 5 GCC 12  notOpenMP MFLOPSL64g12 Tue Oct  3 16:31:00 2023

Test            4 Byte Ops/  Repeat  Secs MFLOPS   First  All       Pi 5
                 Words Word  Passes              Results Same   GCC 12/8

Data in & out    100000    2   50000 1.053  9493  0.44935 Yes       1.03
Data in & out   1000000    2    5000 5.732  1745  0.86736 Yes       1.22
Data in & out  10000000    2     500 9.859  1014  0.98519 Yes       1.37

Data in & out    100000    8   50000 2.194 18228  0.60398 Yes       1.01
Data in & out   1000000    8    5000 6.121  6535  0.91822 Yes       1.11
Data in & out  10000000    8     500 9.872  4052  0.99109 Yes       1.35

Data in & out    100000   32   50000 7.449 21479  0.36770 Yes       1.08
Data in & out   1000000   32    5000 8.121 19701  0.79898 Yes       1.12
Data in & out  10000000   32     500 9.698 16498  0.97639 Yes       1.19

                End of test Tue Oct  3 16:32:01 2023
  

OpenMP-MemSpeed Benchmarks below or Go To Start


OpenMP-MemSpeed264g8 and g12, NotOpenMP-MemSpeed64g8 and g12

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed64). Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP.

Complete output for the Pi 4 is shown below, but just the first few results for the others. The first two lines of single core results are also included to show that the OpenMP options used were clearly unsuitable for this program.

Pi 4 GCC 8
     Memory Reading Speed Test OpenMP 64 Bit gcc 8 by Roy Longbottom

               Start of test Tue May 26 12:14:39 2020

  Memory   x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]       x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
 1 Core
       4   15001   4010   4387  15087   4406   4400  11211   9061   9061
       8   15532   3990   4394  15567   4386   4394  11629   9315   9314
 4 Cores
       4    7749   8500   8716   7451   8520   8533  39508  18586  18589
       8    8198   8669   8874   8148   8678   8691  38972  18863  18861
      16    8023   8499   8335   7895   8355   8507  38305  19003  19004
      32    9034   8517   8619   9127   8550   8522  37928  19071  18409
      64    8652   8201   8178   8565   8223   8093  25191  17494  17508
     128   11397  11616  11715  11345  11649  11029  13861  14097  14170
     256   18242  18745  18195  17417  18605  18019  12535  12637  12623
     512   17580  18467  18787  18010  18414  18321  12900  13180  13121
    1024    8043  10172  11540  12510  10220  12082   9800   9586   9857
    2048    4816   6807   6850   6922   6805   6666   3137   3372   3369
    4096    7029   6846   6881   7017   5145   6801   2776   3124   3112
    8192    2428   7085   7124   7068   7134   6904   2571   3092   3112
   16384    7133   7152   7328   7008   3445   7178   2473   3099   3104
   32768    2656   7643   7669   7802   7616   7559   2043   3112   3104
   65536    7995   6523   2572   7059   6514   6485   2431   2955   3036
  131072    1981   7273   7327   1878   3615   7267   2538   2968   2976

                End of test Tue May 26 12:15:06 2020

Pi 5 GCC 8
     Memory Reading Speed Test OpenMP 64 Bit gcc 8 by Roy Longbottom

               Start of test Mon Aug 14 11:42:10 2023

  Memory   x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]       x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
 1 Core
       4   50151   6872   7511  50254   7170   7181  37548  18867  25383
       8   50904   6848   7485  48915   7202   7487  38102  19038  25477
 4 Cores
       4   31324  14321  12707  28712  14606  21136  27075  18075  18075
       8   28580  13022  13365  32094  14657  21740  26558  13931  16817
      16   27074  19393  19847  32121  19067  24532  35440  24095  23527
      32   37880  31590  31455  34779  32095  29027  37245  22243  24984
      64   23823  29763  30980  30310  28829  28209  23569  27625  24428

                End of test Mon Aug 14 11:42:37 2Pi 5 GCC 12

Pi 5 GCC 12
     Memory Reading Speed Test OpenMP 64 Bit gcc 12 by Roy Longbottom

               Start of test Thu Sep 28 22:43:26 2023

  Memory   x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]       x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
 1 Core
       4   54368  65257  65165  53930  60045  60975  37606  25361  25384
       8   54564  65580  65162  55228  61180  60995  37829  25015  25010
 4 Cores
       4   31314  14584  13443  31523  14625  21373  26964  17800  17883
       8   29471  14672  13405  32067  14677  21719  27561  18585  18540
      16   32013  19352  19797  32164  19549  25666  36645  25085  25423
      32   43228  38115  33331  42989  38653  39254  49341  30903  30892

                End of test Thu   Sep-28   22:4351 2023
 
Single Core Results below


Single Core Benchmark - Again a complete output is provided plus limited results and comparisons. As expected, the latter are similar to those from the original MemSpeed included above. Here, maximum Pi5/4 comparison was 13.9 or L3 cache versus RAM speed.

As before, GCC 12 provided corrections for the GCC 8 fault, now indicating Pi 5 GCC 12/8 performance gains of up to 8.5 times for single precision calculations.

Pi 4 GCC 8
     Memory Reading Speed Test notOpenMP 64 Bit gcc 8 by Roy Longbottom

               Start of test Tue May 26 12:18:16 2020

  Memory    x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]          x[m]=y[m]
  KBytes    Dble    Sngl   Int32    Dble    Sngl   Int32    Dble    Sngl   Int32
    Used    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S

       4   15001    4010    4387   15087    4406    4400   11211    9061    9061
       8   15532    3990    4394   15567    4386    4394   11629    9315    9314
      16   15707    3998    4376   15770    4388    4393   11801    9447    9444
      32   14552    3885    4245   14761    4186    4260   11627    9488    9495
      64   12272    3855    4211   12089    4196    4220    8866    8606    8630
     128   12321    3867    4217   12148    4182    4215    8221    8296    8292
     256   12318    3871    4219   12134    4206    4219    8092    8231    8229
     512   12118    3870    4218   12195    4211    4218    8077    8209    8226
    1024    3224    3738    4032    3701    4009    4066    5387    5529    5331
    2048    1945    3474    3615    1949    3598    3612    2848    2860    2945
    4096    1940    3442    3610    1941    3406    3607    2614    2595    2597
    8192    1951    3425    3637    1954    3617    3644    2595    2581    2583
   16384    1962    3330    3467    1965    3443    3469    2588    2575    2564
   32768    2003    3364    3303    1997    3292    3303    2503    2554    2557
   65536    2005    2588    2937    2011    2930    2621    2577    2565    2566
  131072    2024    2021    2025    2013    2014    2018    2586    2572    2570

                End of test Tue May 26 12:18:42 2020

Pi 5 GCC 8
     Memory Reading Speed Test notOpenMP 64 Bit gcc 8 by Roy Longbottom

               Start of test Mon Aug 14 11:34:27 2023

  Memory  x[m]=x[m]+s*y[m] Int+    x[m]=x[m]+y[m]          x[m]=y[m]
  KBytes   Dble    Sngl   Int32    Dble    Sngl   Int32    Dble    Sngl   Int32
    Used   MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S

       4   50151    6872    7511   50254    7170    7181   37548   18867   25383
      64   50862    6800    7423   50901    7140    7426   36297   19013   25373
     256   32627    6790    7153   32638    7183    7276   34872   19156   25339
    1024   30004    6804    7283   30354    7171    7122   23523   18525   23493
    8192    2992    6089    5571    2005    5255    6448    4794    5279    5340

                End of test Mon Aug 14 11:34:52 2023

Pi 5/4 GCC8
       4    3.34    1.71    1.71    3.33    1.63    1.63    3.35    2.08    2.80
      64    4.14    1.76    1.76    4.21    1.70    1.76    4.09    2.21    2.94
     256    2.65    1.75    1.70    2.69    1.71    1.72    4.31    2.33    3.08
    1024    9.31    1.82    1.81    8.20    1.79    1.75    4.37    3.35    4.41
    2048   12.94    1.91    1.98   13.90    1.98    2.04    6.95    5.99    4.05
    8192    1.53    1.78    1.53    1.03    1.45    1.77    1.85    2.05    2.07

Pi 5 GCC 12
     Memory Reading Speed Test notOpenMP 64 Bit gcc 12 by Roy Longbottom

               Start of test Thu Sep 28 22:42:10 2023

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]          x[m]=y[m]
  KBytes   Dble    Sngl   Int32    Dble    Sngl   Int32    Dble    Sngl   Int32
    Used   MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S    MB/S

       4   54368   65257   65165   53930   60045   60975   37606   25361   25384
      64   52501   65304   65319   53250   59544   59850   37508   25373   25401
     256   33354   63081   63764   33718   60298   60351   35597   25397   25398
    2048   22287   52312   53008   22349   50665   49230   11449   12273   16589
    8192    3087    6050    6120    3132    6038    6491    6902    6608    6778

                End of test Thu Sep 28 22:42:35 2023

Pi 5 GCC 12/8
       4    1.08    9.50    8.68    1.07    8.37    8.49    1.00    1.34    1.00
      64    1.03    9.60    8.80    1.05    8.34    8.06    1.03    1.33    1.00
     256    1.02    9.29    8.91    1.03    8.39    8.29    1.02    1.33    1.00
    2048    0.89    7.88    7.42    0.82    7.10    6.68    0.58    0.72    1.39
    8192    1.03    0.99    1.10    1.56    1.15    1.01    1.44    1.25    1.27
  

JavWhetstone Benchmark below or Go To Start


Java Whetstone Benchmark - whetstc.class

The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files. Performance can vary significantly using different Java Virtual Machines.

Pi 5 performance gains, over the Pi 4, were beteen 1.94 and 3.81.

   Pi 4 Whetstone Benchmark Java Version, May 22 2020, 14:24:09

                                                     1 Pass
   Test                 Result     MFLOPS    MOPS  millisecs

  N1 floating point  -1.124750137     521           0.0369
  N2 floating point  -1.131330490     481           0.2792
  N3 if then else     1.000000000             236   0.4378
  N4 fixed point     12.000000000            1320   0.2386
  N5 sin,cos etc.     0.499110132              48   1.7348
  N6 floating point   0.999999821     276           1.9520
  N7 assignments      3.000000000             320   0.5772
  N8 exp,sqrt etc.    0.825148463              25   1.4640

  MWIPS                              1488           6.7205

    Operating System    Linux, Arch. aarch64, Version 4.19.118-v8+
    Java Vendor         Debian, Version  11.0.7
    CPU null

   Pi 5 Whetstone Benchmark Java Version, Aug 24 2023, 23:25:17

                                                   1 Pass    Pi 5/4
  Test                  Result     MFLOPS    MOPS  millisecs

  N1 floating point  -1.124750137    1232           0.0156     2.37
  N2 floating point  -1.131330490    1048           0.1282     2.18
  N3 if then else     1.000000000             715   0.1448     3.02 
  N4 fixed point     12.000000000            2559   0.1231     1.94
  N5 sin,cos etc.     0.499110132             183   0.4550     3.81
  N6 floating point   0.999999821     554           0.9730     2.00
  N7 assignments      3.000000000             624   0.2960     1.95
  N8 exp,sqrt etc.    0.935364604              63   0.5920     2.47

  MWIPS                              3666           2.7277     2.46

 

JavaDraw Benchmark below or Go To Start


JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

The first runs of this benchmark on the Pi 5 indicated that it was much slower than the Pi 4 on the more demanding functions. Sometime later I reran the benchmark on the Pi 4, using the OS acquired for the Pi 5, and that also produced the slow results. Using this OS, the Pi 5 average performance was around twice as fast.

 Pi 4 Java Drawing Benchmark, May 22 2020, 14:25:15
            Produced by javac 1.8.0_222

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      833    83.26
  Display PNG Bitmap Twice Pass 2     1001   100.05
  Plus 2 SweepGradient Circles         994    99.39
  Plus 200 Random Small Circles        836    83.54
  Plus 320 Long Lines                  380    37.98
  Plus 4000 Random Small Circles        95     9.44

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.118-v8+
  Java Vendor         Debian, Version  11.0.7
  null, null CPUs


 Pi 4 Java Drawing Benchmark, Dec 2 2023, 10:01:16
            Produced by javac 1.8.0_222

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      469    46.86
  Display PNG Bitmap Twice Pass 2      561    56.06
  Plus 2 SweepGradient Circles         523    52.21
  Plus 200 Random Small Circles         31     2.97
  Plus 320 Long Lines                   13     1.22
  Plus 4000 Random Small Circles         2     0.18

         Total Elapsed Time  62.5 seconds

  Operating System    Linux, Arch. aarch64, Version 6.1.47-v8+
  Java Vendor         Debian, Version  17.0.8
  null, null CPUs


 Pi 5 Java Drawing Benchmark, Aug 26 2023, 15:06:26
            Produced by javac 1.8.0_222

  Test                              Frames      FPS   Pi5/Pi4

  Display PNG Bitmap Twice Pass 1     1000    99.96     2.13
  Display PNG Bitmap Twice Pass 2     1077   107.66     1.92
  Plus 2 SweepGradient Circles        1010   100.99     1.93
  Plus 200 Random Small Circles         63     6.16     2.07
  Plus 320 Long Lines                   26     2.50     2.05
  Plus 4000 Random Small Circles         4     0.32     1.78

         Total Elapsed Time  63.1 seconds

  Operating System    Linux, Arch. aarch64, Version 6.1.32-v8+
  Java Vendor         Debian, Version  17.0.8
  null, null CPUs


 Pi 5 Java Drawing Benchmark, Aug 26 2023, 15:15:27
            Produced by javac openjdk 17.0.8

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1     1014   101.33
  Display PNG Bitmap Twice Pass 2     1067   106.59
  Plus 2 SweepGradient Circles        1028   102.70
  Plus 200 Random Small Circles         61     6.04
  Plus 320 Long Lines                   25     2.47
  Plus 4000 Random Small Circles         4     0.33

         Total Elapsed Time  62.3 seconds

  Operating System    Linux, Arch. aarch64, Version 6.1.32-v8+
  Java Vendor         Debian, Version  17.0.8
  null, null CPUs
  

OpenGL Benchmark below or Go To Start


64 Bit OpenGL Benchmark - videogl64C10, videogl64C12

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file format, the first command needed to avoid VSYNC, allowing FPS to be greater than 60.

  export vblank_mode=0                                     
  ./videogl64CXX Width 320, Height 240, NoEnd                 
  ./videogl64Cxx Width 640, Height 480, NoHeading, NoEnd      
  ./videogl64Cxx Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl64Cxx Width 1920, Height 1080, NoHeading           
  
The original benchmark was compiled using freeglut3 but, more recently, this was not available for new systems. The gcc12 version was compiled without this but will not run on my Pi 4, Similarly, the gcc10 program is incompatible with the Pi 5.

Performance comparisons indicate that the Pi 5 was between 2.9 and 5.2 times faster than the Pi 4, with an average of 4.0 times over the 24 measurements. The GLUT variety was recompiled on the Pi 4, using GCC 12. The average Pi5 gain then became 2.5 times.

 Pi 4 gcc 10
 GLUT OpenGL Benchmark 64 GCC 10, Wed Sep 20 00:48:11 2023

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    727.7    413.0    219.7    131.9     42.8     28.9
   640   480    388.6    281.9    189.2    118.0     42.5     28.1
  1024   768    144.0    141.2    129.8     96.9     41.6     26.8
  1920  1080     54.1     50.2     52.7     56.7     38.4     23.9

                   End at Wed Sep 20 00:50:26 2023

 Pi 5 gcc 12
 GLUT OpenGL Benchmark 64 Bit GCC 12, Thu Oct 26 14:52:15 2023

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240   3184.7   1554.8    894.7    474.2    224.0    120.0
   640   480   1441.4    956.8    819.1    442.2    220.4    116.7
  1024   768    624.6    493.7    474.7    364.0    199.1    106.4
  1920  1080    221.4    198.6    194.4    165.8    137.9     87.6

                   End at Thu Oct 26 14:54:28 2023

 Pi 5/4 Comparison
 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240      4.4      3.8      4.1      3.6      5.2      4.2
   640   480      3.7      3.4      4.3      3.7      5.2      4.2
  1024   768      4.3      3.5      3.7      3.8      4.8      4.0
  1920  1080      4.1      4.0      3.7      2.9      3.6      3.7

 #####################################################################
Pi 4
 GLUT OpenGL Benchmark 64 Bit GCC 12, Sat Dec  2 11:35:48 2023

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240   1137.1    517.1    308.3    159.7     93.5     49.6
   640   480    579.0    356.8    283.9    150.5     92.5     48.7
  1024   768    239.5    200.9    203.4    134.7     84.9     45.3
  2032  1080     92.8     74.3     93.6     81.1     75.2     37.6

                   End at Sat Dec  2 11:38:02 2023
  

I/O Benchmarks below or Go To Start


DriveSpeed and LanSpeed I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The programs write and read three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. .

As found during previous tests on 64 bit systems and accessing the system SD card, DriveSpeed with Direct I/O failed, indicating Error writing file. Later it was established that this also applied to external drives with Ext type format but operated correctly formatted as FAT32. A limitation of the latter (at 64 bits) is that file sizes must be less than 4096 MB.

The best option for measuring 64 bit performance, using these benchmarks, is to run LanSpeed, specifying large files that cannot be cached for reading. However, random and small file reading functions are likely to be accessing cached data.


DriveSpeed Benchmark FAT32 - DriveSpeed64v2g8 and g12

The first of the following results are for Pi 4 and Pi 5, both with 8 GB RAM, exercising the same high speed flash drive via USB 3, using 1GB and 2 GB files.

Average Pi 5 gains were around 1.5 times for writing and reading large files, somewhat less writing to cache and nearly 4 times reading from cache, representing RAM speed. The Pi 5 results indicated a slower speed on random reading then much faster on reading small files, where more of the data appears to have been cached.

As during the Pi 4 tests, a starting large file parameter of 2048 KB failed to execute the second part at 4096 KB. Below indicates a successful run at 4094 KB.

Pi 4   DriveSpeed RasPi 64 Bit gcc 8 Wed May 27 11:43:43 2020
 
 Selected File Path: /media/pi/PATRIOT1/
 Total MB  120832, Free MB  114614, Used MB    6218

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    27.78    21.39    21.43   270.32   278.81   274.98
2048    21.40    21.14    21.44   275.79   273.14   319.95
 Cached
   8    40.27    42.81    42.81  1206.64  1068.72  1031.56

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.004    0.184     4.33     4.00     4.04

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.14   261.45    11.19    84.39
 ms/file   119.60   119.05   119.64     0.02     0.73     0.19    2.477


Pi 5   DriveSpeed RasPi 64 Bit gcc 8 Mon Sep  4 16:50:50 2023
 
 Selected File Path: 
 /media/roy/PATRIOT/test/
 Total MB  120832, Free MB  113866, Used MB    6966

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024    30.89    31.14    38.40   349.35   376.91   375.03
2048    42.62    42.11    34.53   377.20   378.08   375.97
 Cached
   8    50.11    52.44    53.78  2327.93  4688.75  6184.63

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.005    0.005    0.233    13.34    12.74    13.10

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.13   386.06   667.63   950.87
 ms/file   123.74   124.04   123.19     0.01     0.01     0.02    3.234


 Pi 5 at 4094 KB
                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

4094    42.74    38.90    45.55   372.93   349.44   376.49 
 

Performance Monitor for above next or Go To Start



Performance Monitor - The following provides vmstat examples handling large files, confirming the benchmark large file data transfer speeds and that the data was actually written to and read from the drive at the benchmark reported time.

Pi 5 VMSTAT Writing and Reading Large Files - volumes in kB, speeds in kB/second
     %CPU utilisation us + sy, 100% means 4 cores being used 

procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 7260884  22404 399188    0    0  1121  1288  179  284  1  1 93  5  0
 1  1      0 7260884  22404 399188    0    0     0 40005 3082 6308  0  4 74 23  0
 1  1      0 7260884  22404 399188    0    0     0 41030 3651 6074  0  3 74 23  0
 1  1      0 7260884  22404 399188    0    0     0 43080 3839 6375  0  3 75 22  0
 1  1      0 7260884  22404 399188    0    0     0 41033 3807 6275  0  3 74 22  0

 1  1      0 7260884  22404 399188    0    0 355824     0 3879 9207  1  9 73 17  0
 1  1      0 7260884  22404 399188    0    0 355320     0 2824 7807  1  9 73 17  0
 1  1      0 7260884  22404 399188    0    0 364544     0 2728 5560  1  9 72 17  0
 1  1      0 7260884  22404 399188    0    0 364540     0 4022 5513  0  8 73 18  0
  






LanSpeed Benchmark below or Go To Start


Pi 5 LanSpeed Benchmark - LanSpeedt64g8 and g12- Wired LAN and WiFi

As indicated above, this benchmark is effectively the same as DriveSpeed, but with Direct I/O not specified. Following are data transfer speeds to a PC via gigabit LAN, 2.4 GHz WiFi and 5 GHz WiFi, plus measurement from a Pi 400 to confirm the same performance levels.

The parameter for large file sizes was intended to be large enough to avoid local caching and some were included to use data sizes of 4 GB or 16 GB in one case. Random access tests access small files that are clearly cached for reading. The many small files used could involve some caching but indicate some consistency.

                                 MBytes/Second To PC
                MB   Write1   Write2   Write3    Read1    Read2    Read3

Wifi 2.4GHz   1024     5.27     5.56     5.69     6.16     5.92     5.72
WiFi 5GHz     1024    11.47    11.85    12.83    11.86    11.12    11.31
LAN 1Gbps 1  16384    55.25    51.88    54.17   114.38   116.13   114.81
LAN 1Gbps 2   4096    53.83    49.33    54.38   113.70   109.48   113.51
LAN Pi 400    4096    62.19    62.11    61.27   102.43   104.56   102.60

                          Milliseconds To PC
Random          Read                       Write
From MB         4        8       16        4        8       16

Wifi 2.4GHz  0.002    0.002    0.002     8.48     8.15     7.79
WiFi 5GHz    0.002    0.002    0.002    14.52    21.38    21.96  
LAN 1Gbps 1  0.002    0.002    0.002     5.04     1.45     0.98
LAN 1Gbps 2  0.002    0.002    0.002     1.71     1.37     1.38
LAN Pi 400   0.005    0.005    0.005     1.43     1.13     1.18

                          MBytes/Second To PC                           
 200 Files         Write                      Read             
 File KB           4        8       16        4        8       16

Wifi 2.4GHz     0.33     0.62     0.92     0.52     0.66     1.21
WiFi 5GHz       0.11     0.16     0.34     0.14     0.83     0.52 
LAN 1Gbps       1.43     2.39     3.13     4.06     8.28    15.30
LAN 1Gbps 2     1.59     1.53     4.80     4.41     7.78    16.67
LAN Pi 400      0.68     2.46     3.55     3.91     6.17    12.45
 

Performance Monitor for above next or Go To Start



Raspbeerry Pi Performance Monitor - First example below is for VMSTAT that does not include network data transfer speeds. This is for LAN 2 test writing and reading the first part, comprising three 2048 MB files. This ends up using most of the 8 GB RAM as a cache, where data appears read from the network. CPU utilisation was mainly low but he maximum of 14% is for 4 cores or 56% of one core (if you want to calculate CPU time).

PC Performance Monitor - In some cases network data transfer speeds could be confirmed on the Windows PC, using Task Manager Performance display and Perfmon detailed tables. However, this became confusing due to deferred writing to the PC disk, with overlapped reading. Also, Perfmon data collector could not keep up with the volume of data, missing output in time slots and indicating unobtainable speeds in a following slot. Also, transferring the largest files could produce a complete overload of the PC, with a dead keyboard. An example of Perfmon results is provided below.

The PC was a four core 3 GHz CPU running under Windows 7. The statistics show significant time waiting for I/O and utilisation of up to all four cores. The second example shows network traffic, disk drive data transfers and CPU utilisation, where a 25% recording represents 100% of one core.

The important considerations for the Pi 5 are confirmation of data transfer speeds measured by the benchmark. Then, the indication that, on reading, no disk involvement was indicated but was supplied from PC RAM based cache and on writing, saving to disk was involved that might have reduced measured speed. In the bigger picture it seemed that all data had not been written to disk when reading began.

LAN 1Gbps 2 VMSTAT initial part writing and reading three 2048 MB files.

procs -----------memory--------- ---swap-- ----io----  -system-- ------cpu-----
r  b  swpd    free   buff  cache   si   so   bi   bo    in   cs  us sy id wa st
Power On
1  0     0 7096944  29968 646800    0    0 4147 1026   859  1470  8  6 74 13  0
Write
1  0     0 1613712  32944 6076752   0    0  203   51  1406  1245  1  2 89  8  0
2  1     0 1352208  32944 6339728   0    0    0    0  3962  3469  0  2 75 23  0
3  0     0   58304   4192 7665904   0    0  175   44  1311  1122  1  2 90  7  0
Read
1  1     0 2727744   944  5000080   0    0  152   38  2153  1921  1  3 87  9  0
3  0     0 1480192   960  6244480   0    0    0    0 38445 42406  0 10 65 25  0
1  2     0  347872   960  7377648   0    0 1472   28 39595 42997  1 13 60 26  0
Write
2  1     0   52176  2688  7674272   0    0  148   37  2458  2198  1  3 87  9  0
1  1     0   94448  2688  7635744   0    0  148   37  2519  2253  1  3 87  9  0

 ##############################################################################
 
 PC Perfmon

         Comms               Disk
         Mbytes/second       Mbytes/second   %CPU
 Second  Received    Sent    Read  Written

     11       50       0       0      90      49
     12       49       0       0       0      47
     13       50       0       0      88      55
     14       49       0       0       0      46
     15       49       0       0      89      45
    To     
     45       37       0       0       0      36
     46     1461       4       0      99      34
     82        3       0       0      40      49
     83       79       0       0      41      56
     86      178       0       0      58      90
     94        0       5       0      43      85

     95        1     122       2      64      42
     96        1     120       1       1      36
     97        1     122       0      56      32
     98        1     121       0       0      35
     99        1     120       0      49      31
  

USB and SD Card Benchmarks below or Go To Start


LanSpeed Benchmark - Pi 5 USB Drives and Operating System SD Card

In most cases, as Direct I/O was not supported, LanSpeed was executed using large files that avoid caching.

These tests were run to confirm that the hardware could support 64 bit type file sizes and to show any major differences. It was found that 4096 MB could not be supported using FAT32 format, but such as 4096 MB was fine. Also, at 2048 MB, the 8 GB RAM might cache all the data.
                                   MBytes/Second
                MB   Write1   Write2   Write3    Read1    Read2    Read3

USB3 HD FAT1  2048    98.07    80.66    74.72   306.43  9209.88  8687.44
USB3 HD Ext2  4096   158.98    28.25   113.34    38.47   143.80   114.56
USB3 HD Ext3  4096   122.73    26.33    61.23    48.78   122.24   109.04
USB3 HD Ext4  4096   164.59    81.99    19.61   103.72   143.48   120.17

Pi 5 SD       4096    27.95    20.58    19.20    43.45   104.53    92.26
SD USB boot   2048    52.82    20.68    20.41 10305.38 11463.08 11496.93
              4096    30.06    20.52    20.60    42.12   260.46    97.04

                            Milliseconds
Random          Read                       Write
From MB         4        8       16        4        8       16

USB3 HD FAT1   N/A as failed to write 4096 MB
USB3 HD Ext2 0.002    0.002    0.002    44.90    15.38    16.10
USB3 HD Ext3 0.002    0.002    0.002    54.50    40.68    45.18
USB3 HD Ext4 0.002    0.002    0.002    52.50    45.27    51.93

Pi 5 SD      0.002    0.002    0.002     3.96     3.60     3.68
SD USB boot  0.002    0.002    0.002     6.83     4.24     3.90


 
                              MBytes/Second                             
 200 Files         Write                      Read             
 File KB           4        8       16        4        8       16

USB3 HD FAT1     N/A
USB3 HD Ext2  141.38    37.47    63.37   587.85   592.36   834.73
USB3 HD Ext3   64.24    21.61    35.24   310.16   601.22   927.89
USB3 HD Ext4  129.74    55.08   104.42   423.15   473.34   465.93

Pi 5 SD        78.41    95.12   194.19   554.82   732.07  1189.95
SD USB boot   106.88   121.88   309.35   596.63   789.24  1504.37
 

New Benchmark More Files next or Go To Start


New Benchmark More Files - LANSpeed64Long

Having encountered VMSTAT performance monitoring problems on running my LANSpeed program, I found that my original Linux version, LANSpeed64Long, avoided this, when compiled for the Raspberry Pi. This writes and reads five large files, followed by other tests, including some for random access and handling numerous small files. As with the earlier program, measured performance can be influenced by caching, sometimes in an unexpected way. Using extra large files helps to avoid the latter. Following is an example of results and sample details from VMSTAT system monitor.

 Current Directory Path: 
 /home/???????
 Total MB  119699, Free MB  102167, Used MB   17531

 Linux LAN Speed Test 64-Bit Version 1.2, Wed Sep 20 13:38:14 2023


  4096 MB File         1          2          3          4          5
 Writing MB/sec      35.46      35.54      35.53      35.49      35.61
 Reading MB/sec     198.94     153.10      92.52      92.67      92.66

 Running Time Too Long At 793 Seconds - No More File Sizes
 ---------------------------------------------------------------------
 8 MB Cached File      1          2          3          4          5
 Writing MB/sec     895.98     859.22     817.44     770.10    1032.07
 Reading MB/sec    3337.54    6467.72    6574.06    6768.83    6643.57

 ---------------------------------------------------------------------
 Bus Speed Block KB     64        128        256        512       1024
 Reading MB/sec   13574.63   15329.45   16213.07   14365.65    9021.80

 ---------------------------------------------------------------------
 1 KB Blocks File MB >    2      4      8     16     32     64    128
 Random Read  msecs    0.40   0.44   0.45   0.45   0.45   0.45   0.45
 Random Write msecs    4.50   4.63   4.60   4.64   4.58   4.68   4.58

 ---------------------------------------------------------------------
 500 Files   Write             Read             Delete
 File KB     MB/sec  ms/File   MB/sec  ms/File  Seconds
       2       0.42     4.85   357.91     0.01    0.012
       4       0.82     5.01   636.20     0.01    0.012
       8       1.64     5.00  1224.07     0.01    0.013
      16       2.91     5.62  1288.33     0.01    0.033
      32       5.51     5.94  2573.57     0.01    0.014
      64       9.22     7.11  4727.86     0.01    0.015
     128      15.04     8.72  5015.65     0.03    0.019
     256      22.87    11.46  5514.21     0.05    0.024
     512      30.27    17.32  6487.64     0.08    0.061
    1024      34.50    30.39  5629.98     0.19    0.054
    2048      36.80    56.99 11498.58     0.18    0.087

VMSTAT Samples Large Files

procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff   cache   si   so    bi    bo   in   cs us sy id wa st
Before Start 
 1  0      0 6245248  54480 1069568   0    0     0     0  199  275  0  0 100  0  0
Write
 1  1      0   41088  76480 7254656    0    0     0 34584  714 1313  0  2 75 23  0
 1  1      0   41088  76480 7254656    0    0    16 35656 2310 4149  0  2 73 24  0
 1  1      0   41088  76480 7254656    0    0     0 36656 1830 3219  1  3 72 23  0
 1  1      0   41088  76480 7254656    0    0    16 34584 2012 3287  6  4 68 22  0
Read
 1  1      0   59568  76624 7238688    0    0 90112     0  812 1778  1  1 75 24  0
 1  1      0   59568  76624 7238688    0    0 90112     0  738 1661  1  2 74 24  0
 1  1      0   59568  76624 7238688    0    0 90624     0  667 1524  0  1 75 24  0
 1  1      0   59568  76624 7238688    0    0 90112     0  559 1479  0  1 75 24  0
  
New Benchmark Large Files next or Go To Start


New Benchmark Large Files

These mainly involved 4096 MB files with smaller ones limited by FAT formatting, available free space or slower WiFi. Approximate vmstat reported performance is also shown. This helps to highlight benchmark results affected by caching.

The first benchmark results were for boot drives, including SD cards, flash drives and hard disk drives, with some from a USB card reader and a USB hub. The other results are for LAN, WiFi and an attached USB flash drive, booted from the SD card. The main use is to demonstrate variations in performance.

Boot Drive               File        1       2       3       4       5    VMSTAT
                                                                          MB/sec
32 GB SD        Writing MB/sec     17.31   17.59   17.69   17.64   17.52      17
3072 MB File    Reading MB/sec    106.05 8253.16  103.94   90.49   90.38      90

128 GB SD       Writing MB/sec     35.46   35.54   35.53   35.49   35.61      36
                Reading MB/sec    198.94   153.1   92.52   92.67   92.66      90

128 GB SD USB   Writing MB/sec     39.04   38.86   39.14   38.98   38.98      39
                Reading MB/sec    132.76   297.8   97.62   97.54   97.12

32 GB Flash     Writing MB/sec     45.32   51.26   45.14   39.56   40.95      37
SanDisk         Reading MB/sec     347.2  764.03  263.08  259.51  256.98     250

128 GB Flash    Writing MB/sec     65.18   59.06   55.93   51.48   44.54  20to70
PATRIOT         Reading MB/sec    529.24  880.72  283.78  358.71  357.57     350

Disk USB        Writing MB/sec     19.00   20.76   21.03   19.03   16.37      20
                Reading MB/sec    187.19  390.54  115.75  103.51   91.63     125

Disk USB HUB    Writing MB/sec     19.36   20.97   19.67   14.24   18.25      20
                Reading MB/sec    206.35  221.78   86.34  111.81  104.16     120

SD Booted

GB LAN          Writing MB/sec     36.31   36.92   36.69   36.94   37.18     N/A
                Reading MB/sec    113.61   112.8  113.33  113.87  114.18

5 GHz WiFi      256 MB File            1       2       3       4       5
                Writing MB/sec     24.82   19.87   17.58   24.74    19.8     N/A
                Reading MB/sec     12.13   11.47   11.53   11.67    9.18

USB Drive FAT32 Writing MB/sec     30.21   30.01   30.06   30.18   30.16      29
3072 MB File    Reading MB/sec    304.19  9936.6  343.77  311.99  309.92     290

USB Drive Ext3  Writing MB/sec Cannot open data file for writing
Use sudo        Writing MB/sec     30.56   30.35   30.39   30.37   30.23      30
                Reading MB/sec    385.17  877.37  311.63  303.94  303.83
  

New Benchmark Small Files Next or Go To Start


New Benchmark Small Files, Booting Time, Volts and Amps

Performance measure are for writing and reading small files and random access, again demonstrating wide variations. The latter is also identified in measured booting time (from inserting the power plug to the full display, including warnings). One of the flash drives was particularly slow at 97 seconds. This drive had also produced unusually slow results during earlier tests.

I have two meters that measure USB voltage and current. One was connected to measure power in and the other USB 3 power out. The main power supply voltage did not appear to vary much, during these tests, and current was well within the 3 available Amps. The disk drive produced the most impact, falling to below 5 volts when connected by a USB hub. Even then, the benchmark ran successfully to the end.

500 Files Write MB/sec

       32 GB SD   128 GB SD   32 GB 128 GB   Disk   Disk   Gbps  5 GHz  FAT32   Ext3
File KB  Board  Board    USB USB Dr USB Dr    USB USB HUB   LAN   WiFi    USB    USB

      2   0.38   0.42   0.45   0.42   0.02   0.05   0.05   0.65   0.11   0.02   0.36
      4   0.74   0.82   0.90   0.68   0.19   0.15   0.09   1.11   0.38   0.04   0.63
      8   1.61   1.64   1.75   2.04   0.15   0.30   0.19   1.93   0.93   0.08   1.42
     16   2.74   2.91   3.11   2.67   0.95   0.46   0.40   4.24   1.77   0.15   2.89
     32   3.22   5.51   5.92   4.58   1.12   0.83   0.81   7.06   3.27   0.30   5.51
     64   8.06   9.22   9.88   8.92   4.66   1.64   1.58  12.41   5.71   0.60   8.45
    128   9.48  15.04  16.17  10.08   4.24   3.21   3.11  17.79   8.14   1.18  13.01
    256  12.46  22.87  24.02  14.43  12.69   6.35   6.03  23.18  11.43   2.29  18.55
    512  15.43  30.27  31.96  20.40  21.03  11.42  11.33  27.59  13.07   4.28  23.51
   1024  16.31  34.50  38.04  32.05  36.48  17.08  16.03  33.55          7.60  27.54
   2048  18.15  36.80  41.70  47.85  46.68  28.00  27.30  35.39         12.35  30.07

Random Access millisecs V = Variable
Read      0.47   0.45   0.61   0.45   0.44V  1.10V  1.52   0.67V 18.77   0.40   0.38
Write     3.20   4.60   4.65V  1.89  16.55V 43.33V 48.80   2.08V 16.23   2.77   4.80

Boot Secs   21     21     30     21     97     46     44    N/A    N/A    N/A    N/A

Power Volts and Amps
Main V    5.20   5.28   5.21   5.24   5.20   5.18   5.21   5.16   5.18   5.18   5.17
Main A    0.87   0.92   1.13   1.09   0.98   1.21   1.52   1.10   0.85   0.91   0.93
USB V      N/A    N/A   5.11   5.12   5.10   5.04   4.97    N/A    N/A   5.11   5.11
USB A      N/A    N/A   0.28   0.24   0.14   0.44   0.83    N/A    N/A   0.14   0.14
  

Drive Stress Test Next or Go To Start


Drive Stress Test - burnindrive264g12

The program uses 64 KB block sizes, with 164 variations of data patterns or a minimum file size of 10.25 MB. Larger files can be produced via a run time multiplication parameter, in this case 16 for for 164 MB files. Four of these written then read sequentially for 12 minutes, but with the choice of files randomised. Finally, each block/data pattern is reread continuously for a second, at full bus speed from disk drives that cache the data. On reading, file number and data values are compared and errors reported.

Note that measured speeds are generally slower than from DriveSpeed benchmark, covered earlier, as data transfers are based on using smaller 64 KB blocks.

The following provides summary Pi 5 results including MB/second performance calculations. The tests exercised the main SD drive, LAN, WiFi and USB 3. Devices on the latter were for a hard drive with Ext2, Ext3, Ext4 and FAT32 partitions and three flash drives. The LAN and WiFi tests were also run on a Pi 400 to confirm the similar performance. No errors were detected.

A gigabit LAN connection was used and WiFi reported as 5 GHz, with the former around 5 times faster on writing and up to 10 times reading. There were performance variations on the various solid state drives that could affect certain applications. One of the disk drive tests, using the Ext3 partition, had inexplicable slow speeds and, when repeated, somewhat slower than the other partitions on writing. Note the much faster transfer speeds with repeated reading of 64 KB blocks, indicating cached data and bus speed.

                    Write            Read                    Blocks Repeated
Source              Seconds  MB/sec  Passes Minutes  MB/sec  Number Minutes  MB/sec

Comms
LAN Pi 5 to PC         19.3    34.0     156   12.06    35.4   99360    2.79    37.1
LAN Pi 400 to PC       20.2    32.6     132   12.37    29.2   80900    2.79    30.2
WiFi Pi 5 to PC        99.6     6.6      20   14.41     3.8   12540    3.61     3.6
WiFi Pi 400 to PC     101.7     6.5      20   12.78     4.3   14720    3.66     4.2

SD OS Card             41.7    15.7     260   12.03    59.1  174960    2.76    66.0

USB 3 Flash Drive
Flash 1                20.7    31.7     328   12.01    74.6  179200    2.76    67.6
Flash 2                 8.0    82.0     352   12.06    79.8  219400    2.75    83.1
Flash 3               145.2     4.5     136   12.12    30.7   89860    2.77    33.8

USB HD
FAT32 Partition         8.4    78.1     268   12.15    60.3  408280    2.75   154.7
Ext 2 Partition         8.9    73.7     272   12.03    61.8  432060    2.74   164.3
Ext 3 Partition        1320     0.5     100   12.14    22.5  427360    2.74   162.5
Ext 3 Repeat           11.8    55.6     256   12.09    57.9  431820    2.74   164.2
Ext 4 Partition         9.0    72.9     284   12.10    64.2  432200    2.74   164.3


  

BurnInDrive Stress Test With Performance Monitoring or Go To Start


BurnInDrive Stress Test With Performance Monitoring

Following are details of a run handling four 2624 MB files, along with associated results from vmstat performance monitor and my CPU Voltage, MHz and Temperature recorder. The tests were run using the Ext3 partition.

First below are the program results with faster writing speeds than above, reading speeds a little slower and repeat reading similar. These might be due to handling larger files.

Second are the sample vmstat results (size numbers are KB) with nothing strange on 8 GB memory utilisation. There were variations in bo writing and bi reading speeds but essentially confirm program measurements. Percentage user + system CPU utilisation was low (note that such a 25% reflects 100% of one core and 100% indicates four core fully utilised).

Finally are samples of the environment measurements that were effectively constant. Results are provided for the start, middle and end of the tests. With ondemand CPU frequency scaling being used, a constant 1500 MHz was indicated for most of the time.

This test was run later on a Pi 4 where writing was 9% slower, reading 6%, repeat reading 18% with similar for CPU utilisation. See results below.

                    Write            Read                    Blocks Repeated
Source              Seconds  MB/sec  Passes Minutes  MB/sec  Number Minutes  MB/sec

Ext 3 Partition       129.2    81.2      16   13.99    50.0  419020    2.74   159.3
Pi 4 Ext3             142.2    73.8      16   14.81    47.2  345680    2.75   130.9 

VMSTAT
procs -----------memory---------- ---swap-- -----io---- -system--    ------cpu-----
 r  b   swpd    free   buff  cache   si   so    bi   bo   in   cs    us sy id wa st
WRITE
 1  1      0 6901476 137524 682832    0    0     0 77806 8123 11887   1  6 74 20  0
 2  0      0 6901476 137524 682832    0    0     8 90292 9889 13562   1  7 74 18  0
READ
 1  1      0 6901476 137524 682832    0    0 32538    46 3377 5344    0  1 75 24  0
 1  1      0 6901476 137524 682832    0    0 60064    16 7630 10652   3  2 72 24  0
REPEAT
 1  1      0 6868408 149372 699428    0    0 162170    3 19231 25503  0  4 72 24  0
 1  1      0 6868408 149372 699428    0    0 162144    3 17290 25480  0  4 72 23  0

ENVIRONMENT
 Seconds
    0.0   ARM MHz=1500, core volt=0.9067V, CPU temp=37.3C, pmic temp=38.4C
  453.6   ARM MHz=1500, core volt=0.9067V, CPU temp=38.9C, pmic temp=38.4C
  897.4   ARM MHz=1500, core volt=0.9067V, CPU temp=38.9C, pmic temp=38.6C
  

Disk Drive Errors and Crashes next or Go To Start


Disk Drive Errors and Crashes - Power Supply Problems

I have two 1TB USB 3 disk drives. The first crash occurred in attempting to run the new benchmark on both disk drives when connected to the USB hub via one USB port. It would have been obvious, if I had looked up the specification. That indicated a maximum of 900 mA, where up to 660 mA on one drive had been observed. It seems that a 5 Amps power supply would not help in running this sort of activity, but should be using a powered USB hub.

The second crash was running two disk drive benchmarks with one on the hub, plus my 4 thread integer CPU stress test. This time the crash appeared to be due to the power demand being greater than the 3 Amps supply. 3.06 Amps was indicated shortly before the crash.

Before the next crash I successfully ran two copies of my burnindrive264g12 stress test on separate USB ports. Then, with one of these and one integer stress test, the last measurements before the screen went blank were a data transfer failure reported by my program and a power input recording of 2.72 Amps. Following is a report from the last failing test session, indicating the seriousness of the situation, reading the wrong file and corrupted data.

Later tests were run using a 4 amps power supply. At the time of testing, the official 5 amps power supply was not available.

 Selected File Path: 
 /media/raspberrypi/EXT3/
 Total MB  348052, Free MB  348052, Used MB       0

 Storage Stress Test ARM 64 Bit v2.0 gcc 8, Fri Oct  6 21:28:44 2023

 File size 2624.00 MB x 4 files, minimum reading time 12.0 minutes

 File 1 2624.00 MB written in   30.97 seconds 
 File 2 2624.00 MB written in   28.80 seconds 
 File 3 2624.00 MB written in   29.70 seconds 
 File 4 2624.00 MB written in   32.35 seconds 


         Total  121.83 seconds, Elapsed  121.83 seconds

              Start Reading Fri Oct  6 21:30:46 2023

 Error reading file 1

 Wrong File Read szzztestz-820 instead of szzztestz1 

 Error reading file 2

 Wrong File Read szzztestz-820 instead of szzztestz2 

 Error reading file 3

 Pass   1 file szzztestz1 word  1, data error was FFFFFCCC expected FFFFCCCC
 Pass   1 file szzztestz1 word  2, data error was FFFFFCCC expected FFFFCCCC

 ERRORS found during reading tests, see above

              End of test Fri Oct  6 21:34:09 2023
  


Other System Crashes

The first tests carried out were run with the Pi 5 operating via a 2 amps power supply, without any real problems running the short duration benchmarks. However, there were reductions in performance on running a series of tests, due to temperature increases. I had a cheap cooling fan module used for Pi 4 tests that I fitted on top of the Pi 5, to connect for use when needed, such as for the following procedures.

High Performance Linpack - I attempted to build this benchmark, to continue using as a stress test. This takes an excessive amount of time to build, appearing to repetitively execute the code for tuning purposes for a particular computer. In view of the timescale, I ensured that the cooling fan was working.

The first attempt was left to run overnight, only to find, in the morning, that the system had crashed. A second attempt crashed after 7 hours. Later with a 3 amps power supply, it took 12 hours to build (but other required software was found to be incompatible).

Stress Test Crash - I had successfully run numerous of my floating point and integer stress tests using a data size parameter aiming to achieve maximum performance using L1 caches on all four CPU cores. Other runs with L2 cache sized data size occasionally crashed. Later these tests ran successfully using the 3 amps power supply, with similar temperature and CPU throttling levels.

Even later, with more demanding system stress tests, the 3 amps supply was found to be inadequate.



  

CPU Stress Testing Benchmarks next or Go To Start


CPU Stress Testing Benchmarks - MP-FPUStress64g8 and g12, MP-FPUStress64DPg8 and g12
MP-IntStress64g8 and g12

These are provided to help in determining parameters to use for a stress test. They run a series of floating point tests using 1, 2, 4 and 8 threads, with three different memory demands, with single precision and double precision versions. An integer program is also provided using 16 and 32 threads, accessing three similar memory sizes.

Pi 5 GCC 12 SP
 MP-Threaded-MFLOPS 64 Bit V2 gcc 12 Fri Sep 29 09:59:04 2023

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.4    T1   2 13111 12985  2003   40394  76395  99700
   0.8    T2   2 24716 26088  1849   40394  76395  99700
   1.2    T4   2 41053 45232  1847   40394  76395  99700
   1.5    T8   2 34398 44918  2141   40394  76395  99700
   2.2    T1   8 17572 17484  8265   54764  85092  99820
   2.8    T2   8 33483 35138  5731   54764  85092  99820
   3.2    T4   8 59976 69804  6737   54764  85092  99820
   3.6    T8   8 58659 69463  8481   54764  85092  99820
   5.3    T1  32 18265 18246 17917   35206  66015  99520
   6.3    T2  32 35625 36482 22484   35206  66015  99520
   7.0    T4  32 69359 72766 29572   35206  66015  99520
   7.6    T8  32 69370 66234 33184   35206  66015  99520

            End of test Fri Sep 29 09:59:12 2023

Pi 5 GCC 8 SP
  MP-Threaded-MFLOPS 64 Bit V2 gcc 8 Thu Aug 17 21:21:35 2023

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.4    T1   2 12746 12885  2029   40394  76395  99700
   0.8    T2   2 25127 24925  1791   40394  76395  99700
   1.2    T4   2 43633 45111  1797   40394  76395  99700
   1.6    T8   2 39439 44308  2151   40394  76395  99700
   2.2    T1   8 17069 17333  7672   54764  85092  99820
   2.7    T2   8 34070 34766  7170   54764  85092  99820
   3.2    T4   8 58695 69177  7229   54764  85092  99820
   3.6    T8   8 59622 65856  8346   54764  85092  99820
   5.3    T1  32 18202 18288 18037   35206  66015  99520
   6.2    T2  32 36321 36549 27452   35206  66015  99520
   6.9    T4  32 68760 73025 27221   35206  66015  99520
   7.5    T8  32 68598 72071 32869   35206  66015  99520

            End of test Thu Aug 17 21:21:42 2023

Pi 5 GCC 12 DP
  MP-Threaded-MFLOPS 64 Bit gcc 12 Fri Sep 29 10:05:24 2023

   Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.9    T1   2  6570  6565  1003   40395  76384  99700
   1.9    T2   2 12052 13057   696   40395  76384  99700
   2.7    T4   2 22815 25654   831   40395  76384  99700
   3.5    T8   2 21088 25978   838   40395  76384  99700
   4.9    T1   8  8348  8388  3290   54805  85108  99820
   6.3    T2   8 15906 16532  2530   54805  85108  99820
   7.3    T4   8 23730 28755  2932   54805  85108  99820
   8.3    T8   8 30036 30142  3327   54805  85108  99820
  11.4    T1  32 10027  9975  9486   35159  66065  99521
  13.3    T2  32 19719 19508 12462   35159  66065  99521
  14.6    T4  32 40249 39892 13452   35159  66065  99521
  15.9    T8  32 38383 39453 13637   35159  66065  99521

            End of test Fri Sep 29 10:05:40 2023

 
Continued Below or Go To Start


Pi 5 GCC 8 DP
  MP-Threaded-MFLOPS 64 Bit gcc 8 Thu Aug 17 21:29:32 2023

   Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.9    T1   2  5832  5779   964   40395  76384  99700
   1.8    T2   2 11389 11537   891   40395  76384  99700
   2.6    T4   2 18744 21914   794   40395  76384  99700
   3.5    T8   2 18803 22948   842   40395  76384  99700
   4.7    T1   8  9375  9433  3984   54805  85108  99820
   5.9    T2   8 18190 18819  2758   54805  85108  99820
   6.8    T4   8 33842 37329  3233   54805  85108  99820
   7.7    T8   8 33857 34347  3393   54805  85108  99820
  10.9    T1  32  9633  9642  9458   35159  66065  99521
  12.7    T2  32 19227 19248 14292   35159  66065  99521
  14.0    T4  32 37215 38597 13208   35159  66065  99521
  15.4    T8  32 35943 36029 13288   35159  66065  99521

            End of test Thu Aug 17 21:29:47 2023

Pi 5 GCC 12
  MP-Integer-Test 64 Bit v2-gcc12 Fri Sep 29 10:11:39 2023

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   1.5    1  18233 17590 13957  00000000    Yes
   1.1    2  36284 35095 13303  FFFFFFFF    Yes
   1.0    4  71208 73154 11228  5A5A5A5A    Yes
   1.0    8  64036 68274 11499  AAAAAAAA    Yes
   0.9   16  70658 71792 12459  CCCCCCCC    Yes
   0.5   32  69044 72425 26917  0F0F0F0F    Yes

            End of test Fri Sep 29 10:11:45 2023

Pi 5 GCC 8
  MP-Integer-Test 64 Bit v2-gcc8 Thu Aug 17 21:32:43 2023

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   1.7    1  15193 15083 13106  00000000    Yes
   1.2    2  30256 30277 13472  FFFFFFFF    Yes
   1.0    4  58317 60842 11173  5A5A5A5A    Yes
   1.0    8  56279 54906 12132  AAAAAAAA    Yes
   0.9   16  54716 59296 13475  CCCCCCCC    Yes
   0.5   32  53649 59206 34738  0F0F0F0F    Yes

            End of test Thu Aug 17 21:32:49 2023


  
Stress Tests - No Fan next or Go To Start


Floating Point and Integer Stress Tests - No Fan

Following are early gcc8 compiled result summaries for the first stress tests without a fan being fitted. They were for 15 minutes, using 1, 2 and 4 threads, measuring average performance over 10 seconds and samples of MHz, Volts and temperatures recordings within that period. The summaries are 5 sets of performance results at the beginning, middle and end, then minimum and maximum values of each column, plus maximum/minimum calculations. Note that, for more than 1 thread, share of data should fit in L1 caches of the utilised cores. Every test ran successfully but identified MHz throttling, with performance degradation between 23% and 55%, besides lower MHz due to throttling, and some voltage reductions. At the end of the integer 4 thread tests, temperatures of up to 90C were recorded and some CPU clock speeds of 1000 MHz.

      Floating Point Stress Test 128 KB    Integer Stress Test 160 KB

                                 CPU  PMIC                           CPU   PMIC
Seconds  MFLOPS    MHz   Volts    C    C  MB/sec    MHz   Volts     C     C
1 Thread
       0          2400  0.9065  68.6  61.8           2400  0.9065   71.9   64.8
      10  18279   2400  0.9065  73.0  63.0   15128   2400  0.9065   77.4   66.0
      20  18273   2400  0.9065  76.8  63.7   15132   2400  0.9065   78.5   66.8
      30  18284   2400  0.9065  75.2  64.4   15094   2400  0.9065   79.0   67.4
      40  18283   2400  0.9065  78.5  65.0   15095   2400  0.9065   81.8   68.1
      50  18277   2400  0.9065  79.0  65.7   15117   2400  0.9065   82.3   68.9

     420  16459   2201  0.7200  84.5  72.8   12906   2146  0.9065   85.1   73.3
     430  16396   2146  0.9065  85.1  72.8   11522   1500  0.9065   84.0   73.0
     440  16440   2256  0.9065  84.5  72.6   12905   1500  0.9065   84.5   73.3
     450  14862   1500  0.9065  86.2  72.5   12437   1500  0.9065   84.5   73.2
     460  15332   2146  0.9065  84.5  72.5   11505   1500  0.9065   85.1   73.0

     860  15370   2256  0.9065  84.0  72.3   12181   1500  0.7200   85.1   73.6
     870  15318   2201  0.9065  84.5  72.5   11929   2146  0.9065   84.0   73.3
     880  17227   2201  0.7200  84.0  72.8   13275   2201  0.9065   84.5   73.2
     890  16381   1500  0.9065  85.6  72.5   12913   1500  0.9065   84.0   73.4
     900  16364   2201  0.7200  82.9  72.4   11974   1500  0.9065   84.5   73.2

    Max   18284   2400  0.9065  86.2  72.8   15132   2400  0.9065   85.1   73.6
    Min   14862   1500    0.72  68.6  61.8   11505   1500    0.72   71.9   64.8
Max/Min    1.23   1.60    1.26  1.26  1.18    1.32   1.60    1.26   1.18   1.14

2 Threads
       0          2400  0.9065  71.4  64.2           2400  0.9065   71.9   64.4
      10  36520   2400  0.9065  79.0  66.8   30425   2400  0.9065   80.7   66.7
      20  35794   2311  0.9065  84.0  68.1   29123   2256  0.9065   84.0   67.8
      30  33156   2256  0.7200  84.5  69.3   28064   2256  0.9065   85.1   68.9
      40  31361   2146  0.7200  85.1  70.0   25692   2201  0.9065   84.0   69.4
      50  30525   2146  0.9065  85.1  70.8   25456   1500  0.9065   84.0   70.1

     420  27102   1500  0.7200  84.5  73.5   21687   1500  0.7200   85.6   73.8
     430  26742   2146  0.7200  85.1  73.5   20675   1500  0.9065   86.2   73.9
     440  27006   1500  0.9065  85.6  73.4   20980   1500  0.7200   85.6   73.6
     450  27092   2201  0.7200  85.6  73.5   21997   1500  0.7200   85.1   73.9
     460  26822   1500  0.9065  85.6  73.3   20854   1500  0.7200   85.1   73.6

     860  26691   2146  0.7200  85.1  73.9   21072   2146  0.7200   85.1   73.9
     870  26989   1500  0.7200  85.1  73.9   21111   1500  0.7200   85.6   73.6
     880  28018   1500  0.7200  85.1  73.9   21035   1500  0.9065   85.6   73.6
     890  27595   1500  0.9065  85.6  73.9   21011   2256  0.7200   84.5   73.8
     900  26449   2256  0.7200  85.1  74.0   21028   1500  0.7200   84.5   73.8

    Max   36520   2400  0.9065  85.6  74.0   30425   2400  0.9065   86.2   73.9
    Min   26449   1500  0.7200  71.4  64.2   20675   1500  0.7200   71.9   64.4
Max/Min    1.38   1.60    1.26  1.20  1.15    1.47   1.60    1.26   1.20   1.15

4 Threads
       0          2400  0.9065  71.4  64.3           2400  0.9065   70.8   64.3
      10  61133   1500  0.9065  85.1  68.0   52566   2256  0.7200   83.4   68.1
      20  52128   1500  0.7200  85.6  69.1   44870   1500  0.7200   84.5   69.2
      30  50301   1500  0.7200  85.1  70.8   43266   2256  0.7200   85.1   70.0
      40  49068   1500  0.9065  86.2  71.0   42129   2201  0.7200   84.5   71.2
      50  48448   2201  0.9065  87.3  71.6   41617   1500  0.7200   85.1   71.4

     420  45854   1500  0.7200  86.2  74.3   34701   1500  0.7200   89.5   76.6
     430  45456   1500  0.7200  86.2  74.3   35108   1500  0.7200   88.4   76.6
     440  45859   1500  0.7200  85.6  74.3   35034   1500  0.7200   90.0   76.6
     450  45853   1500  0.7200  85.6  74.3   35099   1500  0.7200   88.9   76.5
     460  45810   1500  0.7200  85.1  74.3   35176   1000  0.7200   89.5   76.6

     860  45686   1500  0.7200  85.1  74.3   34503   1500  0.7200   88.9   76.8
     870  45337   1500  0.7200  84.5  74.3   34056   1500  0.7200   90.0   77.0
     880  46261   1500  0.7200  85.6  74.3   34053   1500  0.7200   88.9   76.6
     890  45069   1500  0.7200  86.2  74.3   33955   1500  0.7200   89.5   77.0
     900  45285   1500  0.7200  86.2  74.6   34188   1500  0.7200   90.0   76.9

    Max   61133   2400  0.9065  87.3  74.6   52566   2400  0.9065   90.0   77.0
    Min   45069   1500  0.7200  71.4  64.3   33955   1000  0.7200   70.8   64.3
Max/Min    1.36   1.60    1.26  1.22  1.16    1.55   2.40    1.26   1.27   1.20
  
Integer Stress Tests - With Fan next or Go To Start


Integer Stress Tests - With Fan

The fan came as part of a 2019 GeeekPi Acrylic Case for Raspberry Pi 4 Model B, probably not powerful enough for the Pi 5.

The results provided cover data from L1 and L2 caches, with a starting temperature around 40C, in a room at 26C to 27C. One example made use of one thread, running continuously at full speed and reaching a maximum CPU temperature of 57.1C. Similarly, one used two threads and ran at full speed, with temperature up to 70.3C.

There are four examples using 4 threads with KB of data 128, 512, and two at 1024 (to show variations). These all have maximum CPU temperatures indicated as between 84.5C and 85.1C with MHz throttling, maximum speeds of around around 60 GB/second and minimum about 51 GB/second. Example using 1 and 2 threads indicated constant performance near 15 and 30 GB/second respectively, all at 2400 MHz.

         4 Threads 128 KB 4 x L1 Cache      4 threads 1024 KB 4 x L2 Cache
                                 CPU  PMIC                           CPU   PMIC
 Seconds MB/sec    MHz   Volts    C    C  MB/sec    MHz   Volts     C     C

       0          2400  0.9067  38.9  40.1           2400  0.9067   41.1   39.9
      10  59953   2400  0.9067  57.6  43.8   60553   2400  0.9067   56.0   43.7
      20  59448   2400  0.9067  67.0  47.3   60320   2400  0.9067   63.7   45.9
      30  60019   2400  0.9067  70.8  50.0   59929   2400  0.9067   67.0   47.9

     420  51124   2256  0.9067  84.5  62.2   53503   2256  0.9067   84.5   61.4
     430  51011   2146  0.9067  84.5  62.2   53653   2256  0.9067   84.0   61.0
     440  51219   2256  0.9067  84.5  62.4   53297   2146  0.9067   84.5   61.4

     860  50943   2201  0.9067  84.5  62.1   53756   2201  0.9067   83.4   61.7
     870  51446   2311  0.9067  84.0  62.3   53352   2146  0.9067   83.4   61.7
     880  51378   2146  0.7200  82.3  61.9   54173   2201  0.9067   84.5   61.7

    Max   60025   2400  0.9067  84.5  62.4   60553   2400  0.9067   84.5   61.7
    Min   50943   2146  0.7200  38.9  40.1   53157   2146  0.7200   41.1   39.9
Max/Min    1.18   1.12    1.26  2.17  1.56    1.14   1.12    1.26   2.06   1.55

          4 Threads 512 KB 4 x L2 Cache      1 Thread 512 KB L2 Cache

       0          2400  0.9067  41.7  40.5           2400  0.9067   40.6   39.5
      10  58969   2400  0.9067  59.8  44.9   14995   2400  0.9067   46.6   40.7
      20  59611   2400  0.9067  66.4  47.2   15070   2400  0.9067   48.8   42.1
      30  59488   2400  0.9067  70.8  50.0   15018   2400  0.9067   50.5   43.1

     420  51217   1500  0.9067  84.0  62.1   15068   2400  0.9067   54.3   47.0
     430  50975   2201  0.9067  85.1  61.5   15081   2400  0.9067   53.2   46.9
     440  51841   2256  0.9067  84.0  62.3   15064   2400  0.9067   53.8   46.8

     860  51128   2146  0.9067  85.1  61.3   15031   2400  0.9067   56.5   48.2
     870  50938   2311  0.9067  84.5  62.1   15074   2400  0.9067   56.5   48.1
     880  51460   2400  0.9067  84.0  61.7   15055   2400  0.9067   57.1   48.1
  
    3560  51254   1500  0.9067  84.0  62.4   15038   2400  0.9067   56.5   47.8
    3570  51414   2146  0.9067  85.1  61.8   15062   2400  0.9067   56.5   47.7
    3580  51197   1500  0.9067  84.5  62.2   15051   2400  0.9067   56.5   47.7

    Max   59611   2400  0.9067  85.1  62.4   15081   2400  0.9067   57.1   48.2
    Min   50938   1500    0.72  41.7  40.5   14995   2400  0.9067   40.6   39.5
Max/Min    1.17   1.60    1.26  2.04  1.54    1.01   1.00    1.00   1.41   1.22

          2 Threads 512 KB 2 x L2 Cache      4 Threads 1024 KB 4 x L2 Cache

       0          2400  0.9067  39.5  40.0           2400  0.9065   41.1   39.7
      10  30115   2400  0.9067  51.0  42.5   59776   2400  0.9065   57.6   44.2
      20  30172   2400  0.9067  54.9  43.8   59619   2400  0.9065   67.0   47.0
      30  30254   2400  0.9067  55.4  45.0   59773   2400  0.9065   70.8   49.7

     420  30258   2400  0.9067  70.3  53.0   51820   2311  0.7200   84.0   62.0
     430  30295   2400  0.9067  70.3  53.1   51644   2201  0.7200   82.9   61.3
     440  30272   2400  0.9067  68.6  53.2   51512   2146  0.9065   84.5   62.1

     860  30265   2400  0.9067  69.2  53.1   52739   2201  0.9065   83.4   61.7
     870  30252   2400  0.9067  68.1  53.4   52652   2400  0.9065   84.5   61.5
     880  30289   2400  0.9067  68.1  53.2   50956   2201  0.9065   84.5   61.8

    3560  30274   2400  0.9067  69.7  53.2   51051   2311  0.9065   84.5   62.5
    3570  30296   2400  0.9067  68.6  53.2   51008   2146  0.7200   82.3   62.5
    3580  30246   2400  0.9067  68.6  53.2   51157   1500  0.9065   83.4   62.5
               
    Max   30296   2400  0.9067  70.3  53.4   59812   2400  0.9065   84.5   62.5
    Min   30115   2400  0.9067  39.5  40.0   50776   1500  0.7200   41.1   39.7
Max/Min    1.01   1.00    1.00  1.78  1.34    1.18   1.60    1.26   2.06   1.57
  

Floating Point Stress Tests - With Fan next or Go To Start


Floating Point Stress Tests - With Fan

Only two set of results are provided both using 4 threads, with the same data size of 512 KB, one with 2 floating point operations per data word, starting at 51.2 GFLOPS, and the other with 32 floating point operations per data word, starting at 72.3 GFLOPS. At the end of the 15 minutes runs, performance was indicated at 43.3 and 72.2 GFLOPS respectively, the slower one running at higher temperatures. The fastest near constant performance was confirmed by constant CPU MHz reports.

Estimating data flow from MFLOPS and Ops/Word indicates that the test with the slower CPU performance has a much higher data transfer speed and that can influence CPU temperatures.

        4 Threads 2 Ops/Word 512 KB 4 x L2  4 reads 32 Ops/Word 512 KB 4 x L2

                                 CPU  PMIC                           CPU   PMIC
 Seconds MFLOPS    MHz   Volts    C    C  MFLOPS    MHz   Volts     C     C

       0          2400  0.9067  41.7  41.2           1500  0.9067   40.0   40.6
      10  51228   2400  0.9067  65.9  48.3   72366   2400  0.9067   59.3   44.6
      20  50610   2400  0.9067  76.8  52.3   72350   2400  0.9067   67.0   47.3
      30  50799   2400  0.9067  82.3  55.9   72370   2400  0.9067   70.3   49.3
      40  51452   2201  0.9067  83.4  57.7   72348   2400  0.9067   71.9   51.2
      50  50451   2256  0.9067  82.9  59.0   72212   2400  0.9067   74.1   52.6

     420  43777   1500  0.9067  84.0  62.3   72348   2400  0.9067   81.2   58.9
     430  43870   2400  0.9067  84.5  62.5   72381   2400  0.9067   81.2   58.9
     440  43733   2201  0.9067  84.0  62.3   72617   2400  0.9067   80.7   58.9
     450  43887   2146  0.9067  84.5  61.7   72201   2400  0.9067   80.7   58.8
     460  43609   2201  0.9067  85.1  61.9   72229   2400  0.9067   81.2   58.9

     860  43726   2366  0.9067  84.5  62.3   72294   2400  0.9067   81.2   59.2
     870  43346   2201  0.9067  84.5  62.3   72465   2400  0.9067   81.2   59.1
     880  44063   2146  0.9067  85.1  61.9   72257   2400  0.9067   81.8   59.3
     890  43412   2201  0.9067  84.5  62.2   72173   2400  0.9067   81.2   59.2
     900  43353   2146  0.9067  84.5  62.5   72163   2366  0.9067   81.2   59.2

    Max   51452   2400  0.9067  85.1  62.5   72617   2400  0.9067   81.8   59.3
    Min   43346   1500  0.9067  41.7  41.2   72163   1500  0.9067   40.0   40.6
Max/Min    1.19   1.60    1.00  2.04  1.52    1.01   1.60    1.00   2.05   1.46
  


Stress Test Parameters

The following show stress test run time parameters. The classifications can be upper or lower case and only the first character is interpreted.

   ./MP-FPUStress   Threads tt, Minutes mm, KB kk, Ops 00, Log ll
   ./MP-FPUStressDP Threads tt, Minutes mm, KB kk, Ops 00, Log ll
   ./MP-IntStress   Threads tt, Minutes mm, KB kk, Log ll
   ./RPiHeatMHzVolts2  Passes pp, Seconds ss, Log ll
   vmstat ss pp

   tt = Threads 1, 2, 4, 8, 16, 32, (64 FPU)       mm = Minutes greater than 0                       
   kk = KBytes 12 to 15624                         oo = Operations Per Word 2, 8 or 32  
   ll = number added to log file name, 0 to 99     pp = Passes (at ss econd intervals) 
   ss = Second intervals 
  
New Power Supply below or Go To Start


New 4 Amps Power Supply No Disk Crash

Earlier I reported that the Pi 5 crashed when running a stress test on a USB based disk drive along with one executing integer calculations via four threads. A 3 amps power supply was in use.

With no 5 amps power supplies being available, I investigated the Power over Ethernet (PoE) route. My existing Power Injector and Splitter were limited to providing 2.5 amps. There are lots of Injectors delivering 25 or 30 watts but I could not find a Splitter producing 5 amps at 5 volts. However, I acquired a GeeekPi Gigabit USB-C PoE Splitter 48V to 5V, 4A and YuanLey Gigabit PoE Injector 30W, PoE+.

They did not explode on connecting them and I was able to run those tests successfully, once with SD booting and disk on USB 3 and second booting and testing a disk on a USB 3 hub. My monitors typically indicated power in 5.2V 2.8A and USB supply 4.9V and 0.75A.

  

New INTitHOT Integer Stress Test below o or Go To Start


New Integer Stress Test - INTitHOT64g12

Above, I showed that my MP-BusSpeed benchmark could achieve a data transfer rate of 150 GB/second. I have now converted the particular procedures to work as a stress test, with variable options that operate at up to 168 GB/second. Later, 240 GB/second was obtained using L1 cache sized data. As the program is executing AND instructions, this demonstrated Terabit performance at 1.92 Tbps.

The tests identified three particular problems. With no fan, CPU temperature appeared to reach 90C. Then, with a fan, current draw was indicated as being up to 2.3 amps. Also, in both cases there was significant CPU MHz throttling

Following is the C program function calculations and main disassembled code. It is effectively a read only test of 64 words, from a large array, executing AND instructions for a one word output. Each thread exercises a dedicated segment of the data, circulated on a round robin basis, reading all data every pass. The disassembly shows (I believe) loading data into eight pairs of quad word registers, then sixteen quad word AND operations.

In case of anybody is interested in running (or modifying), the program, the source and compiled codes, along with my environmental monitor are available from ResearchGate in INTitHOT.tar.xz.

  Test Function Calculations

  andsum1[t] = andsum1[t] & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
                          & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
                          & array[i+8 ] & array[i+9 ] & array[i+10] & array[i+11]
                          & array[i+12] & array[i+13] & array[i+14] & array[i+15]
                          & array[i+16] & array[i+17] & array[i+18] & array[i+19]
                          & array[i+20] & array[i+21] & array[i+22] & array[i+23]
                          & array[i+24] & array[i+25] & array[i+26] & array[i+27]
                          & array[i+28] & array[i+29] & array[i+30] & array[i+31]
                          & array[i+32] & array[i+33] & array[i+34] & array[i+35]
                          & array[i+36] & array[i+37] & array[i+38] & array[i+39]
                          & array[i+40] & array[i+41] & array[i+42] & array[i+43]
                          & array[i+44] & array[i+45] & array[i+46] & array[i+47]
                          & array[i+48] & array[i+49] & array[i+50] & array[i+51]
                          & array[i+52] & array[i+53] & array[i+54] & array[i+55]
                          & array[i+56] & array[i+57] & array[i+58] & array[i+59]
                          & array[i+60] & array[i+61] & array[i+62] & array[i+63];

 Inner Loop Disassembly

  .L128:
        ldp	q31, q30, [x0]
	add	w13, w13, 1
	ldp	q29, q28, [x0, 32]
	ldp	q27, q26, [x0, 64]
	ldp	q25, q24, [x0, 96]
	ldp	q23, q22, [x0, 128]
	ldp	q21, q20, [x0, 160]
	ldp	q19, q18, [x0, 192]
	ldp	q17, q16, [x0, 224]
	add	x0, x0, 256
	and	v15.16b, v15.16b, v31.16b
	and	v0.16b, v0.16b, v30.16b
	and	v14.16b, v14.16b, v29.16b
	and	v13.16b, v13.16b, v28.16b
	and	v12.16b, v12.16b, v27.16b
	and	v11.16b, v11.16b, v26.16b
	and	v10.16b, v10.16b, v25.16b
	and	v9.16b, v9.16b, v24.16b
	and	v8.16b, v8.16b, v23.16b
	and	v7.16b, v7.16b, v22.16b
	and	v6.16b, v6.16b, v21.16b
	and	v5.16b, v5.16b, v20.16b
	and	v4.16b, v4.16b, v19.16b
	and	v3.16b, v3.16b, v18.16b
	and	v2.16b, v2.16b, v17.16b
	and	v1.16b, v1.16b, v16.16b
	cmp	w2, w13
	bhi	.L128
  

INTitHOT PI 5 and Pi 4 Maximum Speeds below o or Go To Start


INTitHOT PI 5 and Pi 4 Maximum Speeds - With Fan

The INTitHOT tests were run with the fan operational to demonstrate maximum speeds over the first few passes, using the same run time parameters on the Pi 5 and Pi 4. These accessed 64 KB using 1, 2 and 4 threads. Here, near constant elapsed times at all thread levels indicate high efficiency. This applied to the Pi 5 results. But, for an inexplicable reason, the Pi 4 failed to benefit from using 4 threads. Note that the latter system was booted and used via the Pi 5 OS SD card.

Pi 5 performance gains over Pi 4 results were 3.94 and 4.62 at 1 and 2 threads and maybe 10 times at 4 threads. Fastest Pi 5 performance was 240 Gigabytes per second, using 4 threads. This indicates the equivalent of 120 Giga Instructions Per Second (GIPS) or 60 Giga Integer Arithmetic Operations Per Second (GIAOPS).

Also below are maximum speeds using 9 data sizes between 64 and 16384 KB. This test was included in my benchmark, intended to measure bus speeds. In this case, the memory bus speed is indicated as 27 GB/second. Here, at 16 MB data size, each of the 4 threads would be cycling through dedicated segments of 4 MB. Maximum observed current draw was 2.3 amps at 512 KB data size, higher than at 64 KB but with slower performance.

 Pi 5                                    Pi 4

 INTitHOT 64 Bit gcc 12 Thu              INTitHOT 64 Bit gcc 12 Thu
 Oct 19 15:51:53 2023                    Oct 19 15:11:35 2023

 1 Threads. 64 KBytes, 500000            1 Threads. 64 KBytes, 500000
 Passes 1+ Minutes                       Passes 1+ Minutes

 Repeat   MB/second  Seconds             Repeat   MB/second  Seconds

    1         56796    0.58                 1         14418    2.27
    2         56612    0.58                 2         14412    2.27
    3         56704    0.58                 3         14404    2.27

####################################    ####################################

 INTitHOT 64 Bit gcc 12 Thu              INTitHOT 64 Bit gcc 12 Thu
 Oct 19 15:51:16 2023                    Oct 19 15:11:06 2023

 2 Threads. 64 KBytes, 500000            2 Threads. 64 KBytes, 500000
 Passes 1+ Minutes                       Passes 1+ Minutes

 Repeat   MB/second  Seconds             Repeat   MB/second  Seconds

    1        113194    0.58                 1         24510    2.67
    2        113663    0.58                 2         24415    2.68
    3        113272    0.58                 3         24412    2.68

####################################    ####################################

 INTitHOT 64 Bit gcc 12 Thu              INTitHOT 64 Bit gcc 12 Thu
 Oct 19 15:50:53 2023                    Oct 19 15:10:29 2023

 4 Threads. 64 KBytes, 500000            4 Threads. 64 KBytes, 500000
 Passes 1+ Minutes                       Passes 1+ Minutes

 Repeat   MB/second  Seconds             Repeat   MB/second  Seconds

    1        240850    0.54                 1         23839    5.50
    2        231406    0.57                 2         23832    5.50
    3        240861    0.54                 3         23836    5.50

####################################    ####################################
 
 Pi 5 4 Threads Maximum speeds         Power

 Passes    KB MB/sec    Secs            amps

 500000    64 240850    0.54 L1         1.8 to 1.9
 500000   128 165221    1.59 L2         1.9 to 2.0
 500000   256 168499    3.11            1.9 to 2.0
 500000   512 158777    6.64            2.1 to 2.3
  50000   512 158019    0.66            2.1 to 2.3
  50000  1024  73043    2.87 L3         1.8 to 1.9
  50000  2048  52050    8.06 L3         1.7 to 1.8
  50000  4096  32024   26.18 RAM        1.6 to 1.7
  50000  8192  30767   54.53            1.5 to 1.6
  50000 16384  26983  124.35            1.5 to 1.7
  

INTitHOT Stress Tests next or Go To Start


INTitHOT Stress Tests

The tests were all run for 15 minutes using 4 threads, covering two data sizes, 64 KB for the fastest via L1 caches and the hottest at 512 KB using L2 caches. In a table, each performance measurement is for the same pass count, where the time taken can increase due to CPU MHz throttling. The environmental monitor was run at the same time, sampling at 39 second intervals.

Later the full details are provided of the two test sessions running with the fan cooling disconnected and the default CPU frequency ondemand scaling setting used. Others with the performance setting were also run, providing similar long term variations in performance. Here, we have summaries of fan and no fan situations.

With no fan in use, there was significant CPU MHz throttling at both data sizes, less so at 64 KB with the higher KB/second data transfer speeds.

With fan cooling, the 64 KB test was not affected much by MHz throttling, suffering by a mere 5% degradation in performance, compared with 16% at 512 KB, with additional throttling but not that much increase in CPU temperature.

          MB/sec    Secs                     MHz   Volts  CPU C PMIC C

64 KB No Fan
    Min   150715    16.4                    1500  0.7200    42.8    44.2
    Max   240498    26.1                    2256  0.9060    87.3    75.4
 Average                                    1689  0.7492    84.0    71.5

512 KB No Fan
    Min    84743    29.0                    1000  0.7200    47.7    47.3
    Max   144811    49.5                    2146  0.9060    90.0    77.4
 Average                                    1380  0.7433    86.8    74.1

64 KB Fan
    Min   228738    32.7                    2256  0.9067    41.7    39.9
    Max   240414    34.4                    2400  0.9067    84.0    60.1
Average                                     2306  0.9067    82.3    59.7

512 KB Fan
    Min   124143    29.2                    1500  0.7200    41.7    43.0
    Max   143845    33.8                    2400  0.9060    85.6    62.5
Average                                     2193  0.8700    83.6    61.5
  

INTitHOT Stress Test 64 KB next or Go To Start


INTitHOT Stress Test 64 KB - No Fan

The fan was not successful in controlling the CPU temperature that reached 85.6C, leading to a 14% reduction in measured performance. The temperature, CPU MHz and voltage had regular variations.

PI 5 Stress Test 64 KB,  no fan, ondemand MHz scaling

INTitHOT Fri Oct 20 11:20:38     Temperature and CPU MHz Measurement
4 Threads 64 KB 15000000 Passes  Start at Fri Oct 20 11:20:33 2023

  Repeat  MB/sec    Secs         Seconds     MHz   Volts  CPU C PMIC C
                                       0    1500  0.9060    42.8    44.2
       1  240498    16.4              30    2256  0.9060    83.4    58.8
       2  225209    17.5              60    1500  0.9060    85.6    65.4
       3  195713    20.1              91    1500  0.9060    86.2    69.0
       4  182682    21.5             121    1500  0.7200    84.5    71.4
       5  172867    22.8             151    1500  0.7200    85.1    72.0
       6  166663    23.6             182    1500  0.7200    85.1    72.5
       7  163066    24.1             212    2146  0.7200    86.2    73.1
       8  160312    24.5             242    1500  0.7200    84.5    73.9
       9  158921    24.7             273    1500  0.7200    85.6    73.4
      10  157789    24.9             303    1500  0.7200    85.1    73.8
      11  156465    25.1             334    1500  0.7200    85.6    73.8
      12  154721    25.4             364    1500  0.7200    85.6    73.8
      13  155261    25.3             394    1500  0.7200    85.1    73.9
      14  154156    25.5             425    1500  0.7200    86.2    74.2
      15  153030    25.7             455    1500  0.7200    86.2    74.1
      16  152971    25.7             485    1500  0.7200    86.2    74.5
      17  153125    25.7             515    1500  0.7200    85.6    74.5
      18  152132    25.9             546    1500  0.7200    85.6    74.5
      19  152081    25.9             576    1500  0.7200    86.2    74.8
      20  152261    25.8             606    1500  0.7200    86.2    74.8
      21  151389    26.0             637    1500  0.7200    85.6    74.6
      22  151139    26.0             667    1500  0.7200    86.7    74.9
      23  151028    26.0             697    1500  0.7200    86.7    75.0
      24  151525    26.0             728    1500  0.7200    86.2    75.1
      25  151101    26.0             758    1500  0.7200    86.7    75.0
      26  151200    26.0             788    1500  0.7200    86.2    75.2
      27  151501    26.0             819    1500  0.7200    87.3    75.2
      28  150845    26.1             849    1500  0.7200    86.7    75.4
      29  150795    26.1             879    1500  0.7200    86.7    75.2
      30  150715    26.1             910    1500  0.7200    87.3    75.2
      31  151059    26.0             940    1500  0.9060    76.8    72.8
      32  150767    26.1
      33  150751    26.1
      34  150959    26.1
      35  150927    26.1
      36  150783    26.1
      37  151009    26.0

    Min   150715    16.4                    1500  0.7200    42.8    44.2
    Max   240498    26.1                    2256  0.9060    87.3    75.4
  Average                                   1689  0.7492    84.0    71.5

  
INTitHOT Stress Test 512 KB next or Go To Start


INTitHOT Stress Test 512 KB - No Fan

This recorded the highest temperatures at 90C and 42% reduction in MB/second, with lowest CPU frequency regularly at 1000 MHz. Voltage was mainly constant at 0.7200 along with temperature near the top end.

PI 5 Stress Test Detail - 512 KB,  no fan, ondemand MHz scaling

INTitHOT Fri Oct 20 10:49:05     Temperature and CPU MHz Measurement
4 Threads 512 KB 2000000 Passes  Start at Fri Oct 20 10:48:58 2023

  Repeat  MB/sec    Secs         Seconds     MHz   Volts  CPU C PMIC C
                                       0    1500  0.9060    47.7    47.3
       1  144811    29.0              30    1500  0.9060    84.5    62.8
       2  117807    35.6              60    1500  0.9060    86.7    67.7
       3  109939    38.2              91    2146  0.7200    85.1    70.3
       4  106055    39.6             121    1500  0.7200    85.6    71.3
       5  104401    40.2             152    1500  0.7200    85.6    72.2
       6  103921    40.4             182    1500  0.7200    85.1    72.6
       7  103770    40.4             212    1500  0.7200    86.7    73.1
       8  103705    40.4             243    1500  0.7200    87.8    74.1
       9  101765    41.2             273    1500  0.7200    87.8    74.9
      10   98730    42.5             303    1500  0.7200    88.9    75.3
      11   96339    43.5             334    1500  0.7200    89.5    75.8
      12   93876    44.7             364    1500  0.7200    89.5    76.0
      13   92469    45.4             394    1500  0.7200    90.0    76.0
      14   90528    46.3             425    1000  0.7200    89.5    76.2
      15   88594    47.3             455    1500  0.7200    88.9    76.3
      16   88113    47.6             485    1500  0.7200    88.4    76.6
      17   87023    48.2             515    1500  0.7200    90.0    76.5
      18   86581    48.4             546    1500  0.7200    90.0    77.0
      19   85699    48.9             576    1500  0.7200    89.5    77.1
      20   84743    49.5             606    1000  0.7200    88.9    77.0
      21   84760    49.5             637    1000  0.7200    90.0    77.0
                                     667    1000  0.7200    88.4    77.2
                                     698    1000  0.7200    88.4    77.2
                                     728    1500  0.7200    89.5    77.3
                                     758    1000  0.7200    89.5    77.2
                                     789    1000  0.7200    90.0    77.3
                                     819    1500  0.7200    90.0    77.2
                                     849    1000  0.7200    90.0    77.2
                                     880    1500  0.7200    89.5    77.4
                                     910    1000  0.7200    89.5    77.4
                                     940    1500  0.9060    75.7    73.0

    Min    84743   28.96                    1000  0.7200    47.7    47.3
    Max   144811   49.49                    2146  0.9060    90.0    77.4
 Average                                    1380  0.7433    86.8    74.1


  
32 Bit System Stress Tests below or Go To Start


System Stress Tests

All these tests were run for 30 minutes, exercising the CPU, graphics and data input/output and included my environment and VMSTAT performance monitors, the, latter to validate the program MBytes per second measurements and confirm that CPU utilisation was at the expected near 100% level. A script file was used to to ensure that the programs started in at the same time. In most cases, performance was measured or sampled every 60 seconds.

An example script file is below, also the commands to run the OpenGL program from a separate terminal, with VSYNC turned off to produce maximum frames per second (FPS).

Script File

lxterminal -e ./RPiHeatMHzVolts64 Passes 31 Seconds 60 Log 7 &
lxterminal -e ./INTitHOT64g12 threads 2, kBStress 64, Minutes 30, passCount 4000000, logNumber 7 &
lxterminal -e ./MP-FPUStress64g12 threads 2, kb 512, ops 32, Minutes 30, log 7 &
lxterminal -e sudo ./burnindrive264g12 Repeats 16, Minutes 27, Log 8, Seconds 1, F /media/raspberrypi/public/ray &
lxterminal -e sudo ./burnindrive264g12 Repeats 16, Minutes 27, Log 9, Seconds 1, F /media/raspberrypi/EXT3 &
lxterminal -e vmstat 60 30 . vmstat7.txt

Separate Terminal

export vblank_mode=0
./videogl64C12 Test 6 Minutes 30
  
Of particular note, the first set of tests identifies increases in CPU temperature up to 91.7C, with no fan running.

A questionable more significant problem, during the second set of tests, was the disk program indicating errors and the drive temporarily dropping off line during a test with the fan operational. The errors were the same as on earlier runs using a 3 amps power supply, the present PoE connection supposedly providing 4 amps.

Monitoring the input power used and that supplied for the USB drive, indicated that consumption was fairly constant between 2 and 15 minutes testing time, providing the following typical meter readings. These suggest that the disk drive might be more vulnerable to failure when the CPU is fully loaded and CPU MHz throttling might be useful if danger can be predicted.

     No Fan Poor CPU Performance   With Fan Good CPU Performance

         Power          USB             Power          USB      
     Volts   Amps  Volts   Amps     Volts   Amps  Volts   Amps  
     
      5.26   1.75   5.06   0.53      5.20   2.60   4.94   0.53  

  
Light System Stress Test below or Go To Start


Light System Stress Test

The first sessions involved INTitHOT64g12, using 4 threads accessing 512 KB data, with a pass count to control minimum running time. Then, with this test, total running time was specified as 30 minutes, leading to fewer results when the CPU MHz was throttled. These MB/second results were allocated at two minute intervals. Other inclusions were burnindrive264g12 to a USB3 disk drive, plus videogl64C12 accessing the most demanding display test, producing FPS results every 30 seconds, with results provided at 60 second intervals, as shown in the detailed tables below.

Following are two sets of results for one run with the fan in use and one without the fan. On the bright side, these and a number of other tests, using the same parameters, ran without any issues. But CPU MHz throttling occurred in all cases.

Summaries

Minimum values are often isolated examples and can often be ignored. Best scores shown at the head of the table are from standalone runs. Maximum benchmark performance measurements suffer from being noted a minute after start time. Averages indicate significant reductions for the integer and OpenGL tests but little difference on disk drive data transfer speeds.

Of particular note is the CPU temperature measurement of 91.7C with the fan out of use.

                                                   VMSTAT 
                                          Integer   Disk    OpenGL
              MHz   Volts  CPU C PMIC C  MB/sec  KB/sec     FPS

    Best                                   145000   63000     102

 512 KB FAN
 Average     2128  0.8878    82.8    61.8   97568   60368    65.3
     Min     1500  0.7200    42.2    39.7   95281   59159    61.0
     Max     2400  0.9058    85.1    63.2  106457   61815    69.0

 512 KB NO FAN
 Average     1174  0.7260    88.7    77.0   55898   56081    40.0
     Min     1000  0.7200    56.0    53.7   45528   19941    33.0
     Max     2400  0.9058    91.7    79.5   79094   58095    58.0
 
 Average No Fan 
 %Reduction    45      18       7      20      43       7      39
   

Light Test With Fan below or Go To Start


Light Test With Fan

Note that CPU temperature is shown to be more than 84C for most of the time.

 512 KB FAN
                                                  VMSTAT 
                                         Integer   Disk    OpenGL
  Seconds    MHz   Volts  CPU C PMIC C  MB/sec  KB/sec     FPS
 
       0    2400  0.9058    42.2    39.7
      60    2146  0.9058    84.5    59.5  106457   61815      69
     120    2146  0.9058    84.0    62.2           60132      68
     181    2201  0.9058    84.5    62.1           61054      66
     241    2366  0.9058    84.0    62.5   97930   60130      65
     301    2201  0.9058    85.1    62.4           60235      67
     362    2256  0.9058    84.0    62.8           60548      64
     422    2146  0.9058    84.0    62.5   96799   59701      65
     482    2146  0.9058    84.0    63.1           60461      67
     542    2201  0.9058    85.1    62.0           60175      66
     603    2146  0.7200    84.0    63.0   96761   60006      65
     663    2146  0.9058    85.1    61.9           61348      64
     723    2311  0.9058    84.5    62.8           59479      67
     784    2146  0.9058    84.5    62.9   97231   61585      64
     844    2146  0.7200    82.9    62.8           59742      64
     904    2146  0.9058    82.3    62.8           60262      66
     965    1500  0.9058    84.5    62.8   96604   61429      67
    1025    2366  0.9058    84.0    62.9           59341      65
    1086    1500  0.9058    84.0    62.3           60804      64
    1146    2201  0.9058    83.4    62.8   96213   59546      65
    1206    2256  0.9058    84.0    62.8           59360      64
    1267    2366  0.9058    84.5    63.2           61687      68
    1327    1500  0.9058    84.5    63.0   96053              64
    1387    2146  0.9058    84.5    62.8           59159      66
    1447    2146  0.9058    85.1    61.9           60655      65
    1508    1500  0.9058    84.5    62.9   96349              67
    1568    2400  0.7200    81.8    62.7           60491      66
    1629    2146  0.9058    85.1    62.1           59962      64
    1689    2400  0.9058    85.1    62.1   95281              63
    1749    2146  0.9058    84.0    62.3           60429      61
    1809    2146  0.9058    84.5    62.9           60390      64

Average     2128  0.8878    82.8    61.8   97568   60368    65.3
    Min     1500  0.7200    42.2    39.7   95281   59159    61.0
    Max     2400  0.9058    85.1    63.2  106457   61815    69.0
  
Light Test No Fan below or Go To Start


Light Test No Fan

Note that the CPU is running at 1000 MHz for much of the time, with CPU temperature around 90C and that for the Power Management Integrated Circuit more than 78C.

  512 KB NO FAN
 Seconds     MHz   Volts  CPU C PMIC C  MB/sec  KB/sec     FPS
       0    2400  0.9058    56.0    53.7
      60    1500  0.7200    86.2    69.5   79094   19941      58
     120    1500  0.7200    85.6    72.5           58012      52
     181    1500  0.7200    87.8    73.9           57754      50
     241    1500  0.7200    88.9    75.8   70129   56880      50
     301    1500  0.7200    89.5    76.9           57616      48
     362    1500  0.7200    89.5    77.0   64348   57313      45
     422    1000  0.7200    90.6    77.1           57850      44
     482    1500  0.7200    88.9    77.6   57341   57980      42
     543    1000  0.7200    89.5    78.2           57245      44
     603    1000  0.7200    90.0    78.1           57311      41
     663    1000  0.7200    90.0    78.2   53759   57391      39
     724    1000  0.7200    88.9    78.6           57486      37
     784    1000  0.7200    89.5    78.1           57786      38
     844    1000  0.7200    90.0    78.3   50933   57456      36
     905    1000  0.7200    90.0    78.5           57914      37
     965    1000  0.7200    90.6    78.7           56861      38
    1025    1000  0.7200    90.0    78.6   49921   57428      37
    1086    1500  0.7200    89.5    78.9           57705      36
    1146    1000  0.7200    90.6    78.9           57445      38
    1206    1000  0.7200    90.0    78.6   48803   57803      39
    1267    1000  0.7200    90.0    78.9           57618      36
    1327    1000  0.7200    90.0    79.1                      36
    1387    1000  0.7200    90.6    78.9   47790   57545      37
    1448    1000  0.7200    90.0    78.5           58095      36
    1508    1000  0.7200    90.6    79.4                      34
    1568    1000  0.7200    90.0    79.0   47234   57055      35
    1629    1000  0.7200    91.7    79.1           57110      35
    1689    1000  0.7200    91.1    79.5                      34
    1750    1000  0.7200    91.7    79.3   45528   56708      35
    1810    1000  0.7200    91.7    79.4           56874      33

Average     1174  0.7260    88.7    77.0   55898   56081    40.0
    Min     1000  0.7200    56.0    53.7   45528   19941    33.0
    Max     2400  0.9058    91.7    79.5   79094   58095    58.0

  
Heavy System Stress Test below or Go To Start


Heavy System Stress Test

This session comprised INTitHOT64g12, with 2 threads at 64 KB, MP-FPUStress64g12 with 2 threads at 512 KB, burnindrive264g12 to a PC via Ethernet, burnindrive264g12 to a USB 3 disk drive and videogl64C12 as before. Detailed important results are provided for fan and no fan scenarios, with two for the former as the first one failed. Note that, compared with 4 thread results, those for 2 threads can be slower than expected as the main data source can be from L2 cache instead of L1.

On running these tests the main issue was that the second test failed due to data comparison failures on reading. The first indication was a system warning that the disk drive was no longer available but it was remounted. Following are examples of reported errors, similar to the earlier ones described above in Disk Drive Errors and Crashes. These were thought to have been caused by the inadequate 3 amps power supply. Also, see the comments in the initial System Stress Testing summary.

 Read passes    74 x 4 Files x  164.00 MB in    14.03 minutes
 Error reading file 1

 Wrong File Read szzztestz-3 instead of szzztestz1
 Error reading file 2

 Pass  76 file szzztestz1 word  1, data error was FFFFFFFD expected FFFFFFFB
 Pass  76 file szzztestz1 word  2, data error was FFFFFFFD expected FFFFFFFB
 
A summary of the three tests sessions follow. As indicated above power consumption was higher during the tests run with the fan operational, which reduced temperatures, enabling faster performance. Without the fan, MHz throttling, involving higher temperatures, reduced current demands with slower performance. It seems that power consumption was more important than system temperature when considering stability.

                                         Integer Floating OpenGL  & VMSTAT Program
             MHz   Volts  CPU C PMIC C  MB/sec  MFLOPS     FPS Disk MB/s LAN MB/s

   Best     2400                          114000   32000     102       63      36

Test 9 NO FAN
Average     1239  0.7312    88.7    77.5   38696   12361      39   Mainly      27
    Min     1000  0.7200    70.8    64.7   30093    9836      31    58-59
    Max     2400  0.9118    90.6    79.4   76652   22873      51
Test 10 FAN
Average     2288  0.9118    81.2    60.2   71940   24046      66    Error      27
    Min     2146  0.9118    42.8    40.5   64379   22518      61 
    Max     2400  0.9118    84.0    61.7   78453   27388      70
Test 11 FAN
Average     2276  0.9080    80.8    59.7   71794   24003      66   Mainly      27
    Min     1500  0.7950    41.7    38.8   59602   20594      60    57-58
    Max     2400  0.9118    84.0    61.4   82481   26551      72

Average No Fan
%Reductions   46      19       9      23      46      49      41      -2        0

 

Heavy Test No Fan below or Go To Start


Heavy Test No Fan

At 100% CPU utilisation, the following measurements were similar to those during the No Fan Light System Test, with the CPU running at 1000 MHz for much of the time, temperatures around 90C and that for the Power Management Integrated Circuit more than 78C.

 Test 9 NO FAN                           Integer Floating OpenGL  VMSTAT
  Second     MHz   Volts  CPU C PMIC C  MB/sec  MFLOPS     FPS Disk MB/s
       0    2400  0.9118    70.8    64.7
      60    1500  0.7200    85.6    72.5   76652   22873      51     0.3
     120    1500  0.7200    86.2    74.1   50138   15511      50    41.9
     180    1500  0.7200    88.4    75.8   44886   15027      48    58.8
     240    1500  0.7200    89.5    76.6   49106   15012      46    58.1
     300    1500  0.7200    88.9    77.2   44702   14215      45    59.6
     360    1000  0.7200    90.0    77.5   41739   12596      43    58.5
     420    1500  0.7200    89.5    77.6   41734   12524      43    59.3
     480    1000  0.7200    90.0    77.7   40211   12041      42    58.1
     540    1000  0.7200    90.0    78.0   39083   13329      41    58.4
     600    1500  0.7200    89.5    78.2   37814   12529      38    58.3
     660    1500  0.7200    90.0    78.2   36144   11875      38    58.5
     720    1000  0.7200    89.5    78.3   35741   11720      36    58.2
     780    1000  0.7200    90.6    78.5   37614   13467      38    58.5
     840    1000  0.7200    89.5    78.7   33104   10712      35    57.6
     900    1000  0.7200    90.0    78.6   39563   11029      38    58.6
     960    1000  0.7200    90.0    78.4   37259   11448      38    58.2
    1020    1000  0.7200    89.5    78.9   34469   11583      39    57.8
    1080    1000  0.7200    90.0    78.3   35970   11306      38    57.4
    1140    1500  0.7200    90.0    78.7   34045   12281      36    58.6
    1200    1000  0.7200    90.0    78.4   35297   10928      38    59.1
    1260    1500  0.7200    90.0    78.9   37365   12002      36    58.3
    1320    1000  0.7200    90.0    78.5   34004   11252      36    58.2
    1380    1000  0.7200    90.0    78.4   34892   11070      34    58.8
    1440    1000  0.7200    90.0    78.7   36255   10274      37    58.8
    1500    1000  0.7200    88.9    78.7   33912   11320      37    58.3
    1560    1500  0.7200    89.5    79.0   33513   11426      35    58.7
    1620    1000  0.7200    89.5    79.0   30093   10650      35    58.8
    1680    1000  0.7200    89.5    79.4   32852    9836      32    58.7
    1740    1000  0.7200    90.0    79.1   30465   10273      31   122.6
    1800    1500  0.8769    85.1    77.1   32262   10709      32   146.5

 Average    1239  0.7312    88.7    77.5   38696   12361      39
     Min    1000  0.7200    70.8    64.7   30093    9836      31
     Max    2400  0.9118    90.6    79.4   76652   22873      51
  
Heavy Test With Fan below or Go To Start


Heavy Test With Fan - FAILED

As shown initially below, system behaviour did not appear to be much different to that, at the same point, during the later successful test. However, these are instantaneous measurements that can be different in the next picosecond. Also I did note USB power measurements of 4.8 volts at 0.53 amps, compared with 4.94 and 0.53 quoted above. But this might be due to infrequent manual sampling.

                  Tests 10 and 11 at 900 seconds

T11   900   2366  0.9118    83.4    61.0   61490   24333      68    58.1
T10   900   2256  0.9118    83.4    61.5   70134   22929      61    59.1
 
 Test 10 FAN                             Integer Floating OpenGL  VMSTAT
  Second     MHz   Volts  CPU C PMIC C  MB/sec  MFLOPS     FPS Disk MB/s
       0    2400  0.9118    42.8    40.5
      60    2400  0.9118    79.0    55.6   70918   25009      65     9.5
     120    2201  0.9118    82.3    59.7   73729   23355      68    42.9
     180    2366  0.9118    82.9    60.9   68151   24311      67    59.5
     240    2311  0.9118    83.4    61.0   70410   23307      67    59.7
     300    2146  0.9118    82.9    61.0   73093   23714      65    58.6
     360    2311  0.9118    82.3    61.3   69355   22632      64    59.1
     420    2311  0.9118    82.9    61.5   74376   23902      62    59.1
     480    2311  0.9118    83.4    61.0   64379   23731      63    59.2
     540    2201  0.9118    82.9    61.4   72430   22757      66    58.4
     600    2201  0.9118    83.4    61.2   67268   25440      65    58.9
     660    2256  0.9118    82.9    61.7   70452   22864      66    58.2
     720    2311  0.9118    83.4    61.5   66588   22796      64    59.0
     780    2256  0.9118    82.9    61.4   71766   22518      64    59.5
     840    2146  0.9118    84.0    61.7   69162   23801      65    59.0
     900    2256  0.9118    83.4    61.5   70134   22929      61    59.1
     960    2201  0.9118    82.9    61.2   75122   24518      61    31.5
    1020    2400  0.9118    82.9    61.4   74535   23855      64     0.1 FAILED
    1080    2311  0.9118    82.9    61.0   74460   23832      62       0
    1140    2256  0.9118    82.9    61.0   71397   23861      64       0
    1200    2311  0.9118    83.4    61.0   75347   23264      64       0
    1260    2311  0.9118    82.3    61.0   72384   24361      62       0
    1320    2366  0.9118    83.4    61.5   74719   25401      70       2
    1380    2400  0.9118    82.3    61.2   71234   24356      69       0
    1440    2311  0.9118    83.4    61.4   73853   24652      67       0
    1500    2366  0.9118    82.9    61.3   71402   24619      66       0
    1560    2146  0.9118    84.0    61.4   78453   23417      70       0
    1620    2256  0.9118    84.0    61.0   71631   24961      70       0
    1680    2311  0.9118    82.9    61.0   74461   25101      69       0
    1740    2201  0.9118    83.4    61.3   73486   24737      69       0
    1800    2400  0.9118    70.3    57.1   73493   27388      68       0

 Average    2288  0.9118    81.2    60.2   71940   24046      66
     Min    2146  0.9118    42.8    40.5   64379   22518      61
     Max    2400  0.9118    84.0    61.7   78453   27388      70

  
Second Heavy Test With Fan below or Go To Start

Second Heavy Test With Fan

Here, performance did not vary much but there was some CPU MHz throttling. Perhaps the official fan will avoid this and overcome observed undesirable power variations with the new 5 amps version

 Test 11 FAN                             Integer Floating OpenGL  VMSTAT
  Second     MHz   Volts  CPU C PMIC C  MB/sec  MFLOPS     FPS Disk MB/s
       0    2400  0.9118    41.7    38.8
      60    2400  0.9118    74.7    53.7   77484   26076      67     4.5
     120    2400  0.9118    81.8    58.7   82481   25011      72    42.3
     180    2400  0.9118    82.9    60.0   74579   26236      66    58.3
     240    2366  0.9118    81.8    60.1   69930   23368      63    57.7
     300    2311  0.9118    83.4    60.5   76266   22233      68    57.9
     360    2311  0.9118    83.4    60.7   72493   25286      66    58.7
     420    2311  0.9118    82.3    61.0   67909   23927      70    57.9
     480    2311  0.9118    83.4    60.8   73526   25794      63    57.6
     540    2256  0.9118    83.4    61.0   74888   26551      67    57.9
     600    2366  0.9118    82.9    61.0   74110   23912      66    57.4
     660    2256  0.9118    82.9    61.1   75024   25414      65    57.6
     720    2256  0.9118    82.9    61.0   59602   25025      65    59.1
     780    2256  0.9118    83.4    61.0   67930   22907      65    57.1
     840    2256  0.9118    84.0    61.0   71962   24011      67    58.2
     900    2366  0.9118    83.4    61.0   61490   24333      68    58.1
     960    2311  0.9118    82.3    61.1   63462   22888      65    58.2
    1020    2256  0.9118    83.4    61.0   67540   25537      68    57.3
    1080    2256  0.9118    82.9    61.0   70804   23791      66    57.8
    1140    2400  0.9118    83.4    61.0   71113   22011      64    57.5
    1200    2256  0.9118    82.3    61.4   77050   23111      70    58.7
    1260    2311  0.9118    83.4    61.0   73053   24148      63    57.7
    1320    2256  0.9118    82.3    60.9   74469   23307      66    57.6
    1380    2256  0.9118    83.4    61.2   72160   22726      66    58.2
    1440    2256  0.9118    82.3    60.9   73994   24276      66    59.5
    1500    2256  0.9118    83.4    61.0   72659   22260      67    56.9
    1560    2256  0.9118    82.9    61.2   74870   21866      68    57.8
    1620    2256  0.9118    83.4    61.0   76735   23945      66    57.5
    1680    2201  0.9118    83.4    60.9   70727   20594      66    57.6
    1740    2311  0.9118    83.4    61.2   65023   24760      63   123.7
    1800    1500  0.7950    64.2    55.4   70479   24786      60   158.3

 Average    2276  0.9080    80.8    59.7   71794   24003      66
     Min    1500  0.7950    41.7    38.8   59602   20594      60
     Max    2400  0.9118    84.0    61.4   82481   26551      72
  
Second below or Go To Start

Firefox, Bluetooth and YouTube

Whilst looking at numbers for this report and other things, I had movies playing via the readily accessible YouTube at 1080p HD for a few hours. YouTube was accessed via Firefox with Bluetooth sound played on a rechargeable speaker. Examples of MHz, Volts and Temperatures, with ondemand frequency scaling, were :

 Start at Fri Aug 25 10:33:03 2023

 Using 361 samples at 10 second intervals

 Seconds
    0.0   ARM MHz=1500, core volt=0.9065V, CPU temp=47.2C, pmic temp=42.3C
   10.0   ARM MHz=2400, core volt=0.9065V, CPU temp=48.3C, pmic temp=42.5C
   20.1   ARM MHz=2400, core volt=0.9065V, CPU temp=48.3C, pmic temp=42.3C
   30.2   ARM MHz=2400, core volt=0.9065V, CPU temp=48.8C, pmic temp=42.7C

 1028.3   ARM MHz=1500, core volt=0.9065V, CPU temp=43.9C, pmic temp=40.7C
 1038.4   ARM MHz=2400, core volt=0.9065V, CPU temp=46.6C, pmic temp=41.0C

  
Pi 5 bluetooth sound levels were not loud enough for me. They were significantly louder from a side by side Pi 400. This was from Youtube movies and local music from VLC media player.

Pi 5 The Vector Processor below or Go To Start

Pi 5 The Vector Processor including whetv64SPg12 and whetv64DPg12

During the 1980s and early 90s I was responsible for evaluating and acceptance testing of supercomputers for the UK government and those centrally funded for universities. For multiple user development the latter were particularly interested in vector versus scalar performance. I converted my Fortran scalar Whetstone benchmark to one where every test function could vectorize, with a default vector length of 256 words.

The vector version was finely tuned, hands on, on Cray 1 serial 1 that was at Didcot Rutherford Laboratory for a time. First real use was during factory and site trials of the first UK full scale Cray 1. Next was the first CDC Cyber 205 and last was attending user benchmark tests in Japan for ULCC at NEC and Fujitsu, where my benchmarks were also run.

I recompiled the scalar and vector C Whetstone benchmarks on the Pi 5, using gcc 12. The scalar results were effectively the same as those from gcc 8, quoted earlier in this topic. Results for the single and double precision vector version were as follows. Note that the N5 and N8 tests, with functions (both executed at DP) mainly determine the final rating.

The gcc 12 vector benchmark was also run on the Pi 4, to compare like with like. Then, for the three main MFLOPS measurements, the Pi 5 was effectively 3.1 times faster for both single and double precision operation. For both systems, double precision MFLOPS results were effectively half those at single precision, as expected with SIMD vector operation.

        Pi 4 GCC 12 SP
        Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sun Dec 10 17:42:10 2023

        Loop content          Result          MFLOPS  MOPS Seconds

        N1 floating point    -1.13316142559051  2387           0.4
        N2 floating point    -1.13312149047851  2407           2.8
        N3 if then else       1.00000000000000        7428     0.7
        N4 fixed point       12.00000000000000        1736     9.0
        N5 sin,cos etc.       0.49998238682747          79    52.2
        N6 floating point     0.99999982118607  2577          10.4
        N7 assignments        3.00000000000000       10223     0.9
        N8 exp,sqrt etc.      0.75002217292786          78    23.7

        MWIPS                                   4955         100.0

        Pi 4 GCC 12 DP
        Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sun Dec 10 17:47:48 2023

        Loop content          Result          MFLOPS  MOPS Seconds

        N1 floating point    -1.13314558088707  1164           0.7
        N2 floating point    -1.13310306766606  1173           4.9
        N3 if then else       1.00000000000000        7424     0.6
        N4 fixed point       12.00000000000000        1735     7.8
        N5 sin,cos etc.       0.49998080312724          76    47.0
        N6 floating point     0.99999988868927  1295          18.0
        N7 assignments        3.00000000000000        5325     1.5
        N8 exp,sqrt etc.      0.75002006515491          83    19.4

        MWIPS                                   4314         100.0

        Pi 5 GCC 12 SP
        Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct  7 10:46:30 2023

        Loop content          Result          MFLOPS  MOPS Seconds   Pi 5/4

        N1 floating point    -1.13316142559051  7393           0.3     3.10
        N2 floating point    -1.13312149047851  7365           2.0     3.06
        N3 if then else       1.00000000000000       14169     0.8     1.91
        N4 fixed point       12.00000000000000        2399    14.5     1.38
        N5 sin,cos etc.       0.49998238682747         177    51.7     2.24
        N6 floating point     0.99999982118607  8079           7.4     3.13
        N7 assignments        3.00000000000000       26419     0.8     2.58
        N8 exp,sqrt etc.      0.75002217292786         178    23.0     2.29

        MWIPS                                  10975         100.3     2.21

        Pi 5 GCC 12 DP
        Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct  7 10:50:40 2023

        Loop content          Result          MFLOPS  MOPS Seconds   Pi 5/4

        N1 floating point    -1.13314558088707  3603           0.5     3.10
        N2 floating point    -1.13310306766606  3620           3.6     3.09
        N3 if then else       1.00000000000000       14168     0.7     1.91
        N4 fixed point       12.00000000000000        2399    12.9     1.38
        N5 sin,cos etc.       0.49998080312724         172    47.5     2.25
        N6 floating point     0.99999988868927  3998          13.3     3.09
        N7 assignments        3.00000000000000       13172     1.4     2.47
        N8 exp,sqrt etc.      0.75002006515491         183    20.0     2.21

        MWIPS                                   9830          99.9     2.28
  

Example Of Vector Instructions Compiled below or Go To Start


Example Of Vector Instructions Compiled

These are for the first single precision test function for what is probably the key part. Maximum speed of operation would be a long sequence of fused multiply and add or subtract instructions (fmla or fmls) that can produce 8 results per clock cycle for each linked vector pipeline. The disassembled code has too many non-arithmetic instructions, resulting in just over 3 operations per clock cycle on the Pi 5.

 L11:   add     x0, x0, 16
        ldr     q4, [x0, -16]
        ldr     q0, [x0, 4816]
        ldr     q9, [x0, 9648]
        fadd    v4.4s, v0.4s, v4.4s
        ldr     q8, [x0, 14480]
        fadd    v4.4s, v4.4s, v9.4s
        fsub    v4.4s, v4.4s, v8.4s
        fmla    v0.4s, v1.4s, v4.4s
        fsub    v0.4s, v0.4s, v9.4s
        fadd    v0.4s, v0.4s, v8.4s
        fmul    v0.4s, v0.4s, v1.4s
        fneg    v2.4s, v0.4s
        mov     v5.16b, v0.16b
        mov     v3.16b, v0.16b
        fmla    v2.4s, v1.4s, v4.4s
        fmls    v5.4s, v1.4s, v4.4s
        fmla    v3.4s, v1.4s, v4.4s
        fadd    v2.4s, v2.4s, v9.4s
        mov     v4.16b, v5.16b
        fadd    v2.4s, v2.4s, v8.4s
        fmla    v4.4s, v2.4s, v1.4s
        fmla    v3.4s, v2.4s, v1.4s
        fadd    v4.4s, v4.4s, v8.4s
        fmls    v3.4s, v4.4s, v1.4s
        fmul    v3.4s, v3.4s, v1.4s
        fadd    v0.4s, v3.4s, v0.4s
        str     q3, [x0, -16]
        fmls    v0.4s, v2.4s, v1.4s
        fmla    v0.4s, v4.4s, v1.4s
        fmul    v0.4s, v0.4s, v1.4s
        fsub    v5.4s, v3.4s, v0.4s
        str     q0, [x0, 4816]
        fsub    v0.4s, v0.4s, v3.4s
        mov     v3.16b, v5.16b
        fmla    v3.4s, v2.4s, v1.4s
        mov     v2.16b, v3.16b
        fmla    v2.4s, v4.4s, v1.4s
        fmul    v2.4s, v2.4s, v1.4s
        fadd    v0.4s, v0.4s, v2.4s
        str     q2, [x0, 9648]
        fmla    v0.4s, v4.4s, v1.4s
        fmul    v0.4s, v0.4s, v1.4s
        str     q0, [x0, 14480]
        cmp     x0, x22
        bne     .L11  

Comparison With Old Supercomputers

Following are Scalar and Vector Whetstone benchmark results for the original supercomputers. In the 1980s they provided a useful tool in confirming the choice for university work in dealing with multiple user access, typically with programs containing 90% vectorisable code. Then the choices depended on scalar versus vector performance and multiple processors versus multiple pipelines.

Pi 5 results are included and can look good on a per MHz basis. See the next page for comparisons, including for the benchmark originally used to validate performance of the first Cray 1 supercomputer.

                                                     Vector
                         Scalar        Vector       /Scalar
                     MHz  MWIPS MFLOPS  MWIPS MFLOPS MFLOPS   DATE

Cray 1                80   16.2    5.9     98     47    8.0   1978
CDC Cyber 205         50   11.9    4.9    161     57   11.7   1981
Cray XMP1            118   30.3   11.0    313    151   13.7   1982
Cray 2/1             244   25.8    N/A    425    N/A          1984
Amdahl VP 500   #    143   21.7    7.5    250    103   13.8   1984
Amdahl VP 1100  #    143   21.7    7.5    374    146   19.5   1984
Amdahl VP 1200  #    143   21.7    7.5    581    264   35.3   1984
IBM 3090-150 VP       54   12.1    4.9     60     17    3.6   1986
(CDC) ETA 10E         95   15.7    6.5    335    124   19.2   1987
Cray YMP1            154   31.0   12.0    449    195   16.3   1987
Fujitsu VP-2400/4    312   71.7   25.4   1828    794   31.3   1991
NEC SX-3/11          345   42.9   17.0   1106    441   25.9   1991
NEC SX-3/12          345   42.9   17.0   1667    753   44.3   1991
                # Fujitsu Systems

Raspberry Pi 5 SP   2400   5843   1206  10986   7599    6.3   2023
Raspberry Pi 5 DP   2400    N/A    N/A   9816   3731    3.1   2023
  

PC and Pi Comparisons below or Go To Start


PC and Pi Performance Comparisons

The following results are for the original Classic Benchmarks, comprising Livermore Loops, Linpack 100 and Whetstone applications, for PCs from 1991 and the Pi 5. They tended to be produced by the latest compiler version, available at the time. These probably represent best case Pi 5 comparative performance, mainly better than the Core i5 CPU on a per MHz basis.

To be fair, the later MP-MFLOPS results, included below, reflect the other extreme via SIMD vector performance. However, my present compiling procedures might be confusing for a newbie. For the Pi 5, compiling parameters for all programs used were -O3 and -march=armv8-a for optimisation level 3 using armv8-a architecture. For Intel the method I adopted requires inclusion of compile directives for such as SSE, AVX, AVX2 or AVX512.

For those who only consider maximum performance, the Intel based PC MP-MFLOPS speeds are indicated as being far superior. But on a MFLOPS per MHz basis, the Pi 5 results were between Intel SSE and AVX measurements. Considering these and repeated runs, the Core i5 CPUs (on a laptop in this case) appear to be running at a lower MHz, using 4 threads or more.

Given an application mainly running 4 core vector MP-MFLOPS type code and a much smaller part executing the slow Whetstone scalar MFLOPS type functions, the Pi 5 can appear to be faster than that Core i5 PC. This is shown in the example (tongue in cheek) performance calculations shown below. Note the Pi 5 / Cray 1 comparisons, particularly Livermore Loops results, the benchmark originally run to validate required performance of the first Cray 1 system. Here, Gmean MFLOPS was the official average, where the Raspberry Pi 5 is indicated as being 194 times faster.
                                                                           LOOPS Gmean
                            LLLOOPS MFLOPS    MFLOPS  MWIPS MFLOPS Device  MFLOPS
CPU                  MHz    Max  Gmean    Min Linpack Whets  Whets   Year per MHz
Main Columns          V             V             V             V

Cray 1                80   82.1   11.9    1.2     27   16.2    6.0   1978    0.15

Windows or Linux PCs
AMD 80386             40    1.2    0.6    0.2    0.5    5.7    0.8   1991    0.02
80486 DX2             66    4.9    2.7    0.7    2.6     15    3.3   1992    0.04
Pentium               75     24    7.7    1.3    7.6     48     11   1994    0.10
Pentium              100     34     12    2.1     12     66     16   1994    0.12
Pentium              200     66     22    3.8           132     31   1996    0.11
AMD K6               200     68     22    2.7     23    124     26   1997    0.11
Pentium Pro          200    121     34    3.6     49    161     41   1995    0.17
Pentium II           300    177     51    5.5     48    245     61   1997    0.17
AMD K62              500    172     55    6.0     46    309     67   1999    0.11
Pentium III          450    267     77    8.3     62    368     92   1999    0.17
Pentium 4           1700   1043    187     19    382    603    146   2002    0.11
Athlon Tbird        1000   1124    201     23    373    769    161   2000    0.20
Core 2              1830   1650    413     40    998   1557    374   2007    0.23
Core i5             2300   2326    438     35   1065   1813    428   2009    0.19
Athlon 64           2150   2484    447     48    812   1720    355   2005    0.21
Phenom II           3000   3894    644     64   1413   2145    424   2009    0.21
Core i7 930         3066   2751    732     68   1765   2496    576   2010    0.24
Core i7 4820K       3900   5508   1108     88   2680   3114    716   2013    0.28
Core i5 1135G7      4150   7505   1387     92   3541   3293    802   2021    0.33
Linux PCs AVX New Compiler
Core i7 4820K       3900  12878   2615    597   5098   5887   1174   2013    0.67
Core i5 1135G7      4150  19794   3568    943   6998   6477   1077   2021    0.86

Raspberry Pi         700    140     55     17     42    271     94   2013    0.08
Raspberry Pi 2B      900    248    115     42    120    525    244   2015    0.13
Raspberry Pi 3B     1200    436    184     56    180    725    324   2016    0.15
Raspberry Pi 4B     1500   1861    679    180    957   1883    415   2019    0.35
Raspberry Pi 4B 64b 1500   2491    730    212   1060   2269    476   2019    0.35
Raspberry Pi 5  64b 2400  10577   2308    734   4136   5843   1206   2023    0.96

Core i5 / Pi 5      1.73   1.87   1.55   1.28   1.69   1.11   0.89           0.90

Pi 5 / Cray 1         30    129    194    612    153    361    201

#################################################################################
MP-MFLOPS              -----------MFLOPS------------  ------MFLOPS/MHz-----=
Threads          MHz       1       2       4       8     1     2     4     8

Core i7 SSE     3900   23355   46883   88776  119313   6.0  12.0  22.8  30.6
Core i7 AVX     3900   45459   91277  172443  184765  11.7  23.4  44.2  47.4
Core I5 SSE     4150   33273   64727   86194  119426   8.0  15.6  20.8  28.8
Core i5 AVX     4150   64946  128515  153955  225265  15.6  31.0  37.1  54.3
Core i5 AVX512  4150   94417  185785  324870  325915  22.8  44.8  78.3  78.5

Pi 5            2400   21519   42488   80947   85086   9.0  17.7  33.7  35.5

#################################################################################
   Performance Calculations

	   i5 SSE        i5 AVX        Pi 5	
    MOPS   MFLOPS  secs  MFLOPS  secs  MFLOPS  secs

    5000     1077  4.64    1077  4.64    1206  4.15
   50000    86194  0.58                 80947  0.62
   50000   153955                0.32
   Total           5.22          4.96          4.77
 

CPU Stress Tests Next or Go To Start


New 5 Amps Power Supply and Active Cooler

CPU Stress Tests

The fan on my new active cooler did not spin, I might have broken the JST connection on trying to insert the fiddly little thing. However, I have run some stress tests by plonking my cheap old Pi 4 fan on top of the dead new one. That and the new heatsink appear to do a good job and might be recommended as a useful backup arrangement.

Below are temperature graphs of my earlier integer and floating point tests using 64 KB and 512 KB of data. Maximum 4 thread performance was 73 GFLOPS for both floating point tests. For integers it was 240 GB/second at 64 KB then 160 GB/second at 512 KB, the latter being the hottest with data transfers reading from L2 cache as opposed to L1 at 64 KB.

The (part) active cooler graph indicates less than 80C for all measurements, others demonstrating constant maximum CPU MHz and performance. The other graph only covers the integer tests, with and without the old Pi 4 fan. Then, using 64 KB with the fan, CPU MHz throttling was just about avoided. On running without an operational fan, it is commendable that the Pi 5 can continue running at those high temperatures, where throttled performance can be demonstrated that it is far superior to that from a super cooled Pi 4.

Pi5 Old

Pi5


  
Heavy System Stress Test next or Go To Start


Heavy System Stress Test

This is a repeat of the above, comprising INTitHOT64g12, with 2 threads at 64 KB, MP-FPUStress64g12 with 2 threads at 512 KB, burnindrive264g12 to a PC via Ethernet, burnindrive264g12 to a USB 3 disk drive and videogl64C12. They were run with the Active Cooler enabled, initially using the new 5 amps power supply, then controlled by the 4 amps PoE arrangement. The two drive MB/second results are reading speeds, the second being for repetitive reading of the same blocks, representing bus speed where the drive has a buffer.

There were some differences in results of the two sessions at 5 amps, but nothing unusual for a mixed workload. The first test at 4 amps failed, as earlier, with disk reading errors being recorded, this time after 100 seconds. The second one at 4 amps ran successfully, essentially providing the same levels of performance as those at 5 amps. For the first 4 amps test, benchmark results, that were recorded, indicated slower performance.

There were noticeable differences in measured power where the input level was less than 5 volts, using the 4 amps supply. For some inexplicable reason, the failed test input current recording was particularly low.

An additional test was run excluding the floating point program, using the 4 amps power supply and 512 KB data size for INTitHOT via 4 threads. The latter is slower than at 64 KB but requiring a higher amperage and CPU temperature. Higher USB voltage might have helped in avoiding disk errors.

                           INT      MP                 CPU  PMIC OpenGL Drive   LAN
      Volts  Amps       MB/sec  MFLOPS   MHz   Volts    C    C   FPS   MB/s  MB/s
5A Supply
Power  5.15  2.38  Min   62371   19494  2400  0.8833  37.8  40.0  59.0   52.8  35.1
USB    4.92  0.53  Avg   75234   24713  2400  0.8833  63.5  62.4  64.4  117.7  36.7
                   Max   89243   28868  2400  0.8833  67.5  65.0  68.0

Repeat             Min   63097   23625  2400  0.8833  38.4  40.1  60.0   58.5  28.6
                   Avg   77075   25451  2400  0.8833  64.4  62.8  66.4  159.1  31.7
                   Max   89625   27352  2400  0.8833  68.6  66.0  71.0
4A Supply
Power  4.88  1.98  Min   56159   18062  2400  0.7200  37.3  37.9  44.0    N/A  31.3
USB    4.71  0.54  Avg   63134   20087  2400  0.8567  51.5  49.9  56.6    N/A   N/A
FAILED             Max   69947   23773  2400  0.8840  59.8  57.2  70.0

Repeat
Power  4.84  2.39  Min   63472   22513  2400  0.8840  37.8  39.5  59.0   52.6  30.1
USB    4.71  0.54  Avg   76104   25127  2400  0.8840  59.4  58.4  64.7  159.0  32.2
                   Max   84488   27214  2400  0.8840  62.6  60.7  70.0

4A Supply
Power  5.07  2.74  Min   95040          2400  0.8833  35.1  38.6  50.0  57.3  28.6
USB    4.81  0.53  Avg  100302          2400  0.8833  65.0  64.3  61.9 156.8  31.4
                   Max  104684          2400  0.8833  69.2  67.2  66.0

  

Solid State Hard Drive next or Go To Start


Solid State Hard Drive

I obtained another Pi 5 at the same time as the 5 amps power supply and active cooler. I had overstressed the original board creating a irrecoverable hardware failure. This occurred on plugging in a new Solid State Drive, where tests indicated power supply irregularities. It is a SanDisk 1TB Extreme Portable SSD, USB-C USB 3.2 Gen 2, External NVMe Solid State Drive up to 1050 MB/s, now with FAT32 and Ext3 partitions. I quite rightly completed all other proposed tests before returning to those for the SSD, this time with the active cooler in use.

I repeated the last heavy stress test via both the 5 amps and 4 amps power supplies. The results indicate around a 10% increase in USB current, with slightly faster operation at 4 amps but at a higher temperature. A few more runs would be required to determine the truth.

With these particular drives, SSD reading speed was around 2.45 times faster.

                           INT      MP                 CPU  PMIC OpenGL Drive
      Volts  Amps       MB/sec  MFLOPS   MHz   Volts    C    C   FPS   MB/s
5A Supply SSD
Power  5.12  2.74  Min   94755          2400  0.8838  36.7  40.2  60.0  146.7
USB    4.80  0.59  Avg   96325          2400  0.8838  64.8  64.6  64.7  166.1
                   Max  109008          2400  0.8838  68.6  68.3  69.0
4A Supply SSD
Power  5.12  2.95  Min  109197          2400  0.8830  38.4  41.7  64.0  148.5
USB    4.84  0.59  Avg  111188          2400  0.8830  67.7  67.9  67.2  168.4
                   Max  119425          2400  0.8830  71.9  71.1  70.0  
 
DriveSpeed and LanSpeed I/O Benchmarks

As indicated I/O above, there are two varieties of the original drive benchmark, DriveSpeed using Direct I/O and LANSpeed without that option. The former would not run via 64 bit OS software and extra large files have to be selected to avoid caching data using the latter.

First of the following results is for LanSpeed using Ext3 formatted files where one of the 4096 MB files appears to have been partially cached and not identified in vmstat sampling. Note that USB power consumption was up to 640 mA at 5.14 volts.

The second details are partial results running DriveSpeed on a FAT32 partition, where writing large files was slower than during the Ext3 test but similar on reading. The main observation is the exceptionally slow speed on handling small files, particularly on writing. Partition size was around 500 GB.

New Benchmark Large Files above indicates best USB 3 hard drive results like 30 MB/second writing and 310 MB/second reading. Results for that benchmark on the SSD were around 165 and 415 MB/second respectively.

   LanSpeed RasPi 64 Bit gcc 8 Tue Dec 26 12:49:03 2023
 
 Selected File Path: /media/raspberrypi/Ext3/ Total MB  491955, Free MB  491955

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

4096   491.86   393.63   360.86   416.77   937.70   420.40
8192   407.49   364.13   365.28   579.91   412.14   411.16

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.002    0.002    0.002     0.52     0.49     0.48

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec    139.48    34.81   100.02   479.48   558.20  1353.81
 ms/file     0.03     0.24     0.16     0.01     0.01     0.01    0.019

                End of test Tue Dec 26 12:52:22 2023

procs  -----------memory---------- ---swap--  -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff  cache   si   so     bi     bo   in   cs us sy id wa st
 1  3      0 6805744 182608 752752    0    0      0 413554 3775 2544  0 22 46 31  0
 2  2      0 6805744 182608 752752    0    0      0 401661 6715 8275  0 18 32 50  0
 1  3      0 6805744 182608 752752    0    0    123 382200 4824 5126  0 20 32 48  0
 1  3      0 6805744 182608 752752    0    0     13 332742 4379 4918  0 18 27 55  0
 1  3      0 6805744 182608 752752    0    0     66 363967 4509 4615  0 17 47 36  0
 2  2      0 6805744 182608 752752    0    0     46 345998 6905 9378  0 17 45 38  0
 2  0      0 6805744 182608 752752    0    0  85870 272317 4082 4434  0  4 55 41  0
 1  1      0 6805744 182608 752752    0    0 409245      0 3435  648  0  5 73 21  0
 1  1      0 6805744 182608 752752    0    0 381261      0 3076  616  0  5 74 20  0
 1  1      0 6805744 182608 752752    0    0 406957      3 3332  846  0  5 74 21  0
 2  0      0 6805744 182608 752752    0    0 414537      1 3147  597  0  5 74 21  0
 
   DriveSpeed RasPi 64 Bit gcc 8 Tue Dec 26 12:33:43 2023 /media/raspberrypi/FAT32/

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

1024   194.07   198.99   218.42   426.35   426.37   425.99

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 ms/file   104.09   104.07   104.07     0.14     0.21     0.12    0.052
  

Go To Start