Raspberry Pi 5 64 Bit Benchmarks and Stress Tests

Roy Longbottom

Summary	Introduction	Benchmark Results
Whetstone Benchmark	Dhrystone Benchmark	Linpack 100 Benchmark
Livermore Loops Benchmark	FFT Benchmarks	BusSpeed Benchmark
MemSpeed Benchmark	NeonSpeed Benchmark	MultiThreading Benchmarks
MP-Whetstone Benchmark	MP-Dhrystone Benchmark	MP NEON Linpack Benchmark
MP-BusSpeed Benchmark	MP-RandMem Benchmark	MP-MFLOPS Benchmarks
OpenMP-MFLOPS Benchmarks	OpenMP-MemSpeed Benchmarks	Java Whetstone Benchmark
JavaDraw Benchmark	OpenGL Benchmark	I/O Benchmarks
DriveSpeed Benchmark FAT32	Wired and WiFi Benchmark	USB Benchmark
New Benchmark More Files	New Benchmark Large Files	New Benchmark Small Files
Booting Time, Volts and Amps	Drive Stress Test	Drive Stress Performance Monitor
Disk Drive Errors and Crashes	Other System Crashes	CPU Stress Testing Benchmarks
CPU Stress Tests No Fan	Integer Stress Tests With Fan	Floating Point Stress Tests With Fan
4 Amps Power Supply No Disk Crash	New INTitHOT Integer Stress Test	INTitHOT PI 5 4 Maximum Speeds
INTitHOT Pi 5 Stress Tests	INTitHOT Stress Test No Fan 64 KB	INTitHOT Stress Test No Fan 512 KB
System Stress Tests	Light System Stress Test	Light Test With Fan
Light Test No Fan	Heavy System Stress Test	Heavy Test No Fan
Heavy Test With Fan - FAILED	Heavy Test With Fan - Passed	Firefox, Bluetooth and YouTube
Pi 5 The Vector Processor	PC and Pi Performance Comparisons

New 5 Amps Power Supply and Active Cooler

CPU Stress Tests

Heavy System Stress Test

Solid State Hard Drive

Summary

As indicated below, some of the benchmarks provided higher average Pi5/Pi4 performance gains than the official claim of two to three times, where individual programs or test functions were between 10 and 18 times faster. This was due to the improved CPU caching arrangements and advanced SIMD hardware and compilation facilities. Examples of compiled SIMD vector instructions are included.

The latest 5 amps power supply and active cooler were not available initially, when tests were run with no cooling fan. Then, stress tests lead to CPU temperature increasing up to 91.7°C, but the Pi 5 continued running at a lower speed with controlled CPU MHz and voltage variations, still much faster than a fan cooled Pi 4.

On the downside, my rather extreme stress tests produced a number of system crashes and disk drive reading errors. I believe that the results show that this was not associated with high temperatures but inadequate USB power was to blame. Although stress tests ran successfully using the 5 amps power supply, USB power demands of disk and solid state drives appear to be rather excessive. In this case, the system could be easily crashed by overloading. So these drives should probably only be connected via a powered hub.

Surprisingly, execution of a new stress test, with integer calculations, generated more heat than the floating point variety. The hottest occurred when handling data from L2 cache with higher power demands. Faster L1 cache based data transfers produced somewhat lower temperatures.

Benchmarks - Besides detailed results, Pi5/Pi4 performance comparisons are provided using older gcc8 compiled versions, also the latter with new varieties from gcc12, included in the new 64 bit Operating System software.

Single Core CPU Tests - comprising varieties of Whetstone, Dhrystone, Linpack 100 and Livermore Loops Classic Benchmarks. Pi 5 gains were between 2.14 and 4.65 times from 182 measurements.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. More than 250 Pi5/Pi4 comparisons are provided from five benchmarks, indicating a Pi 5 average gain of 3.1 times maximum 13.3 times. Pi 5 new compilation average gain was 2.6 times and maximum 10 times. High gains were due to improved caching and SIMD vector processing operations.

MultiThreading Benchmarks - These 8 benchmarks execute the same calculations using 1, 2, 4 and 8 threads. From 150 plus comparisons Pi5/Pi4 average/maximum gains were 3.4/18.2 times, with 1.2/5.6 times for Pi 5 gcc12/gcc8 compilations. The reasons for the high gains were improved caching and SIMD as above.

Miscellaneous - average Pi5/Pi4 performance gains for a series of tests were Java Whetstones 2.47 times, JavaDraw 1.98 times and OpenGL 4.0 times for 6 tests at 4 screen resolutions.

Input/Output Benchmarks - These measure performance of large files, small files and random access with numerous performance measurements of Gbps LAN, WiFi, large files with 64 bit OS, main SD and USB 3 FAT and Ext disk drives and 11 main and USB boot drives. Also are booting times, main and USB volts and amps power usage. First test result indicated that Pi 5 was typically 50% faster than Pi 4 handling large files on a high speed USB 3 flash drive.

Drive Stress Test - This writes four large files with data comprising numerous binary data patterns, reads them randomly for a specified time, then repetitively reads each different data block for a time. Eleven 15 minute tests were successfully run on the Pi 5 comprising LAN, WiFi, OS SD, 3 USB 3 flash drives and 5 disk drive partitions, plus 2 network tests from a Pi 400.

Disk Drive Errors and System Crashes - (Power supply issues) - Two out of three tests using 2 disk drives caused crashes one with both on a USB 3 hub, due to exceeding 900 mA USB 3 port specification. Next crash was with one drive via hub, one direct USB and a CPU stress test leading to measured main power supply exceeding the 3 amps specification. This lead to reading the wrong file and data comparison failures. Two disks on different USB 3 ports ran successfully.

CPU Stress Tests - Initial 3 floating point and 3 integer tests were run without fan cooling, each for 15 minutes, using 1, 2 and 4 threads, whilst recording performance, CPU MHz, volts and temperatures. All suffered from MHz throttling at temperatures up to 90°C, with measured performance deterioration less than 50%, still faster than a fan cooled Pi 4. I acquired a 4 amps power supply and repeated the test that crashed at 3 amps, this time with no failures.

INTitHOT New Integer Stress - This read only test produced the hottest and fastest effects, through executing continuous SIMD AND instructions. On the Pi 5, fastest, via L1 cache sized data, obtained 240 GB/second or Terabit speed of 1.92 Tbps. Via L2 cache, maximum speed was 168 GB/second with higher power consumption and Temperature. The Pi 5 was around 4.6 times faster than a Pi 4 using 1 or 2 threads, and much greater at 4 threads where the Pi 4 was unbelievably slow.

System Stress Tests - These were run for 30 minutes using the 4 amps power supply and included INTitHOT, disk drive and OpenGL stress tests. Initial tests ran successfully at near maximum speed with the fan but reached a CPU temperature of 91.7°C with a 40% reduction in CPU and graphics performance without the fan. The next ones included floating point and network stress tests. The no fan test ran successfully with the usual high temperature and degraded performance but, with the fan, crashed with disk drive errors again. Then a low USB voltage was recorded.

Other Tests and Comparisons - Tests were carried out involving Firefox, Bluetooth sound and YouTube videos. Next is Pi-5 The Vector Processor, with examples and comparing performance with 1978 to 1991 supercomputers, then Comparisons with PCs from 1991 to 2021. Results for the latter indicate that the Raspberry Pi 5 can be assumed to be 194 times faster than the Cray 1 supercomputer.

New 5 Amps Power Supply and Active Cooler - Graphs of temperature increases with time are provided for initial CPU only stress tests, followed by others using the new items, now all much less than the the CPU MHz throttling level. Hottest was not the floating point test but the one using integer calculations with L2 cache based data. Next was a repeat of the Heavy System Stress Tests. This ran successfully twice. It was then repeated with the 4 amps power supply and failed as before but at a much lower CPU temperature, then ran without any issues at a second attempt. The strange measured power volts and amps probably indicate a marginal condition, compared to the 5 amps measurements.

Solid State Hard Drive - Following an earlier disastrous attempt, I repeated the last system stress test powered with 4 and 5 amps supplies on the Pi 5, providing similar performance. Then I ran the drive benchmarks where average large file write/reading speeds were around 360/400 MB/second, faster than the old hard drive. A surprise was tha the measured USB current was the relatively high 640 mA.

Introduction below or Go To Start

Introduction

This report provides results from a wide range of benchmarks and stress tests run on the Raspberry Pi 5 during the Alpha Testing stage. and includes comparisons with the Pi 4. It follows the format of many other reports from 2014 to 2023 available from This ResearchGate Index. The latter includes access to historic results, opening the opportunity to compare Pi 5 performance with computers from as far back as the pre-1960 iron age.

The new Raspberry Pi 5 features a 2.4GHz quad-core 64-bit Arm Cortex-A76 CPU, with near 64 KB L1 and 512 KB L2 caches per core, and a 2MB shared L3 cache, also a host of other enhanced features. Compared to the Raspberry Pi 4, it was claimed to have between two and three times the CPU and GPU performance, with roughly twice the memory and I/O bandwidth. Part of the reason for this is that the Pi 4 runs at 1.5 GHz with a 32 KB L1 cache and 1024 KB shared L2 cache.

The first benchmarks measure performance of a single CPU core, covering integer and floating point performance plus data transfer speeds at all memory cache and RAM levels. Then there are multi-core benchmarks of the same variety and more, plus others for Java and graphics. The stress testing programs measure performance, CPU MHz and temperatures with and without fan cooling, initially for each program then during systems tests, including all CPU cores, disk and network drives and graphics. Then there are other measurements as identified in the contents table, including comparisons with PCs and supercomputers.

The benchmarks can be downloaded in RaspberryPi5BenchmarksandStressTests.tar.xz. This includes folders containing source code with compile commands, compiled programs, example results and script files to select run time parameters. A preprint of the report is also included.

All the programs save the results in log files, full details from some are included in the report. These include the following information of the system under test.

Raspberry Pi 4 Old OS Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 3 Model name: Cortex-A72 Stepping: r0p3 CPU max MHz: 1500.0000 CPU min MHz: 600.0000 BogoMIPS: 108.00 Flags: fp asimd evtstrm crc32 cpuid Linux raspberrypi 4.19.118-v8+ #1311 SMP PREEMPT Mon Apr 27 14:32:38 BST 2020 aarch64 GNU/Linux Raspberry Pi 5 Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: ARM Model name: Cortex-A76 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 4 Socket(s): - Cluster(s): 1 Stepping: r4p1 CPU(s) scaling MHz: 100% CPU max MHz: 2400.0000 CPU min MHz: 1000.0000 BogoMIPS: 108.00 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp Linux raspberrypi 6.1.32-v8+ #1 SMP PREEMPT Sat Aug 5 07:03:33 BST 2023 aarch64 GNU/Linux

The last count indicated that 31 different benchmarking and stress testing programs were run, producing hundreds of results included here. The devil is in the details.

Whetstone Benchmark below or Go To Start

Whetstone Benchmark - whetstonePi64g8 and g12
Vector Versions - Whetv64SPg8 and g12, whetvDP64g8 and g12

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. with no accessing of data in L2 cache or RAM.

Results are provided for the original scalar single precision (SP) version, along with those for single and double precision (DP) varieties of the vector version, originally written for use on the first Cray 1 supercomputer delivered to the UK. For more information see Pi 5 The Vector Processor later. Examination of the time used by the different tests shows that this can be dominated by those executing such as COS and EXP functions.

Pi 5/Pi 4 comparisons are provided for the gcc 8 scalar versions, indicting performance gains between 2.44 to 2.59 times for the three (MFLOPS) floating point tests and 2.79 on overall MWIPS. Performance of the Pi 5 gcc 12 compilations were essentially identical to those from gcc 8.

Pi 5/Pi 4 vector SP and DP gcc 8 performance gains were similar between 2.34 to 3.10 times for MFLOPS and around 2.3 for MWIPS. Pi 5 SP Vector/Scalar gains are also provided, giving 5.40 to 7.86 times for MFLOPS but only 1.88 times for overall MWIPS, deflated by the COS/EXP tests. Maximum SP scalar speed was 1.36 GFLOPS with vectors at 8.08 SP and 4.0 DP.

Pi 4 GCC 8 Whetstone Single Precision C Benchmark 64 Bit gcc 8R, Fri May 22 10:48:53 2020 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 524.251 0.076 N2 floating point -1.12274742126464844 534.904 0.524 N3 if then else 1.00000000000000000 2978.570 0.073 N4 fixed point 12.00000000000000000 2493.078 0.264 N5 sin,cos etc. 0.49911010265350342 57.643 3.012 N6 floating point 0.99999982118606567 397.676 2.831 N7 assignments 3.00000000000000000 996.647 0.387 N8 exp,sqrt etc. 0.75110864639282227 27.327 2.841 MWIPS 2085.311 10.008 Pi 5 GCC 8 Whetstone Single Precision C Benchmark 64 Bit gcc 8R, Thu Aug 10 15:44:50 2023 Loop content Result MFLOPS MOPS Seconds G8 Pi5/4 N1 floating point -1.12475013732910156 1279.196 0.087 2.44 N2 floating point -1.12274742126464844 1364.748 0.573 2.55 N3 if then else 1.00000000000000000 7190.834 0.084 2.41 N4 fixed point 12.00000000000000000 5995.954 0.306 2.41 N5 sin,cos etc. 0.49911010265350342 154.725 3.131 2.68 N6 floating point 0.99999982118606567 1027.998 3.055 3.59 N7 assignments 3.00000000000000000 2398.668 0.449 2.41 N8 exp,sqrt etc. 0.75110864639282227 93.596 2.314 3.43 MWIPS 5822.922 9.998 2.79 Pi 5 GCC 12 Whetstone Single Precision C Benchmark 64 Bit gcc 12, Thu Sep 28 11:46:43 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 1279.140 0.088 N2 floating point -1.12274742126464844 1364.558 0.575 N3 if then else 1.00000000000000000 3594.939 0.168 N4 fixed point 12.00000000000000000 5994.963 0.307 N5 sin,cos etc. 0.49911010265350342 157.996 3.075 N6 floating point 0.99999982118606567 1027.940 3.064 N7 assignments 3.00000000000000000 2398.054 0.450 N8 exp,sqrt etc. 0.75110864639282227 95.590 2.273 MWIPS 5839.767 10.000 #################### Vector Whetstone Vecton Length 258 #################### Pi 4 GCC 8 SP Whetstone Vector Benchmark 64 Bit Single Precision, Wed Aug 30 10:41:57 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13316142559051514 2338.496 0.391 N2 floating point -1.13312149047851562 1651.957 3.877 N3 if then else 1.00000000000000000 4427.445 1.114 N4 fixed point 12.00000000000000000 1733.458 8.659 N5 sin,cos etc. 0.49998238682746887 74.913 52.923 N6 floating point 0.99999982118606567 2573.346 9.988 N7 assignments 3.00000000000000000 18596.381 0.474 N8 exp,sqrt etc. 0.75002217292785645 78.503 22.581 MWIPS 4764.843 100.007
Continued below

Continued from above - Note different single and double precision numeric results.
Pi 5 GCC 8 SP Whetstone Vector Benchmark 64 Bit Single Precision, Sat Oct 7 10:15:16 2023 Loop content Result MFLOPS MOPS Seconds G8 Pi5/4 N1 floating point -1.13316142559051514 7111.676 0.290 3.04 N2 floating point -1.13312149047851562 3857.446 3.746 2.34 N3 if then else 1.00000000000000000 10141.446 1.097 2.29 N4 fixed point 12.00000000000000000 2396.242 14.135 1.38 N5 sin,cos etc. 0.49998238682746887 177.032 50.534 2.36 N6 floating point 0.99999982118606567 7986.011 7.263 3.10 N7 assignments 3.00000000000000000 42584.598 0.467 2.29 N8 exp,sqrt etc. 0.75002217292785645 178.102 22.459 2.27 MWIPS 10753.538 99.990 2.26 Pi 5 GCC 12 SP Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct 7 10:46:30 2023 Vector/ Pi 5 Scalar Loop content Result MFLOPS MOPS Seconds GCC12/8 G12 Pi5 N1 floating point -1.13316142559051514 7393.282 0.286 1.04 5.78 N2 floating point -1.13312149047851562 7364.751 2.009 1.91 5.40 N3 if then else 1.00000000000000000 14169.053 0.804 1.40 3.94 N4 fixed point 12.00000000000000000 2398.742 14.457 1.00 0.40 N5 sin,cos etc. 0.49998238682746887 177.260 51.673 1.00 1.12 N6 floating point 0.99999982118606567 8078.622 7.351 1.91 7.86 N7 assignments 3.00000000000000000 26419.105 0.770 0.62 11.02 N8 exp,sqrt etc. 0.75002217292785645 178.359 22.961 1.00 1.87 MWIPS 10974.928 100.311 1.02 1.88 Pi 4 GCC 8 DP Whetstone Vector Benchmark 64 Bit Double Precision, Wed Aug 30 10:48:05 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13314558088707962 1146.624 0.709 N2 floating point -1.13310306766606850 1094.230 5.203 N3 if then else 1.00000000000000000 4405.221 0.995 N4 fixed point 12.00000000000000000 1730.427 7.711 N5 sin,cos etc. 0.49998080312723675 73.193 48.149 N6 floating point 0.99999988868927014 1294.129 17.655 N7 assignments 3.00000000000000000 9967.123 0.785 N8 exp,sqrt etc. 0.75002006515491115 83.614 18.845 MWIPS 4233.571 100.052 Pi 5 GCC 8 DP Whetstone Vector Benchmark 64 Bit Double Precision, Sat Oct 7 10:18:59 2023 Loop content Result MFLOPS MOPS Seconds G8 Pi5/4 N1 floating point -1.13314558088707962 3499.307 0.535 3.05 N2 floating point -1.13310306766606850 2793.370 4.688 2.55 N3 if then else 1.00000000000000000 10158.471 0.993 2.31 N4 fixed point 12.00000000000000000 2396.163 12.809 1.38 N5 sin,cos etc. 0.49998080312723675 171.834 47.176 2.35 N6 floating point 0.99999988868927014 3994.760 13.156 3.09 N7 assignments 3.00000000000000000 21713.754 0.829 2.18 N8 exp,sqrt etc. 0.75002006515491115 184.857 19.607 2.21 MWIPS 9763.593 99.793 2.31 Pi 5 GCC 12 DP Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct 7 10:50:40 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13314558088707962 3602.841 0.523 N2 floating point -1.13310306766606739 3619.564 3.647 N3 if then else 1.00000000000000000 14167.623 0.718 N4 fixed point 12.00000000000000000 2398.696 12.898 N5 sin,cos etc. 0.49998080312723675 172.068 47.491 N6 floating point 0.99999988868927014 3997.801 13.252 N7 assignments 3.00000000000000000 13172.392 1.378 N8 exp,sqrt etc. 0.75002006515491115 182.557 20.014 MWIPS 9829.517 99.920

Dhrystone Benchmark below or Go To Start

Dhrystone Benchmark - dhrystonePi64g8 and g12

This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS.

Pi 5 GCC 8 gain over Pi 4 was 2.37 times. There was a slight gain using GCC 12, where DMIPS/MHz ratio reached 8.57.

Pi 4 GCC 8 Dhrystone Benchmark 2.1 64 Bit gcc8, Mon May 25 22:16:05 2020 Nanoseconds one Dhrystone run: 72.83 Dhrystones per Second: 13729822 VAX MIPS rating = 7814.36 Numeric results were correct Pi 5 GCC 8 Dhrystone Benchmark 2.1 64 Bit gcc8, Thu Aug 10 15:49:13 2023 Nanoseconds one Dhrystone run: 30.69 Dhrystones per Second: 32578833 VAX MIPS rating = 18542.31 Pi 5/Pi 4 Gain 2.37 Numeric results were correct Pi 5 GCC 12 Dhrystone Benchmark 2.1 64 Bit gcc12, Thu Sep 28 11:44:33 2023 Nanoseconds one Dhrystone run: 27.68 Dhrystones per Second: 36120831 VAX MIPS rating = 20558.24 GCC 12/8 Gain 1.11 Numeric results were correct

Linpack 100 Benchmark below or Go To Start

Linpack 100 Benchmark MFLOPS - linpackPi64g8 and g12, linpackPi64gSP, linpackPi64NEONig8

This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON functions to include vector processing. Performance of this benchmark can vary, with its dependence on data placement in L2 cache.

Unlike when the Pi 5 was introduced. later compilers produced code as fast as the NEON version. Now with GCC 12, The NEON variety was slower and the others produced a small gain over GCC 8 compiations. Comparisons for the latter indicated Pi 5 gains were between 3.16 and 3.54 times over the three versions. Maximum Pi 5 speeds were 6.60 GFLOPS SP and 3.93 GFLOPS DP.

Pi 4 GCC 8 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Mon May 25 22:05:47 2020 Speed 1111.51 MFLOPS Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Mon May 25 22:09:12 2020 Speed 1930.27 MFLOPS Numeric results were as expected Linpack Single Precision Benchmark n @ 100 NEON Intrinsics 64 bit gcc 8, Mon May 25 22:11:15 2020 Speed 2030.95 MFLOPS Numeric results were as expected ------------------------------------------------------ Pi 5 GCC 8 Pi5/Pi4 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Thu Aug 10 16:12:47 2023 Speed 3933.38 MFLOPS 3.54 Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Thu Aug 10 16:04:18 2023 Speed 6106.68 MFLOPS 3.16 Numeric results were as expected Linpack Single Precision Benchmark n @ 100 NEON Intrinsics 64 bit gcc 8, Thu Aug 10 16:13:52 2023 Speed 6603.58 MFLOPS 3.25 Numeric results were as expected ------------------------------------------------------ Pi 5 GCC 12 GCC 12/5 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 12, Thu Sep 28 15:58:07 2023 Speed 4136.39 MFLOPS 1.05 Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 12, Thu Sep 28 16:04:19 2023 Speed 6472.77 MFLOPS 1.06 Numeric results were as expected Linpack Single Precision Benchmark n @ 100 NEON Intrinsics 64 bit gcc 12, Thu Sep 28 15:49:56 2023 Speed 5665.39 MFLOPS 0.86 Numeric results were as expected But 4 needed changing in program, via #define GCC12ARM64N, to avoid unnecessary error reports.

Livermore Loops Benchmark below or Go To Start

Livermore Loops Benchmark MFLOPS - liverloopsPi64g8 and g12

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores. Although each kernel is executed for a relatively long time, performance of some can be inconsistent.

Pi 5 GCC 8 maximum speed was 9.87 DP GFLOPS, with gains over the Pi 4 between 2.14 and 4.65 over the 24 loops.

Maximum performance via GCC 12 was 10.57 DP GFLOPS, with those for all of the loops similar to GCC 8 scores.

Pi 4 GCC 8 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Mon May 25 10:39:10 2020 MFLOPS for 24 loops 2108.4 936.3 959.9 965.1 382.5 808.6 2312.9 2488.4 2065.7 668.7 500.3 980.7 180.7 404.8 815.0 643.8 726.8 1189.6 449.8 397.2 1716.0 366.9 817.7 312.7 Overall Ratings Maximum Average Geomean Harmean Minimum 2616.7 959.8 766.7 613.0 169.7 Numeric results were as expected Pi 5 GCC 8 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Thu Aug 10 16:14:33 2023 MFLOPS for 24 loops 7423.6 2147.9 2356.6 2472.9 911.5 1871.0 9872.3 5317.7 5162.9 2125.8 1173.2 2672.0 709.1 1108.7 2966.6 1598.5 1761.3 5526.8 1190.0 956.0 5425.1 1489.5 2147.9 858.2 Overall Ratings Maximum Average Geomean Harmean Minimum 9872.3 2873.9 2208.3 1763.4 646.6 Numeric results were as expected ----------------------------------------------------------------------------------- GCC 8 Pi5/Pi4 Performance Ratios For 24 loops 3.52 2.29 2.46 2.56 2.38 2.31 4.27 2.14 2.50 3.18 2.34 2.72 3.92 2.74 3.64 2.48 2.42 4.65 2.65 2.41 3.16 4.06 2.63 2.74 Min 2.14 Max 4.65 Overall Ratings Maximum Average Geomean Harmean Minimum 3.77 2.99 2.88 2.88 3.81 ----------------------------------------------------------------------------------- Pi 5 GCC 12 Livermore Loops Benchmark 64 Bit gcc 12 via C/C++ Thu Sep 28 16:38:37 2023 MFLOPS for 24 loops 7833.8 2404.6 2377.2 2346.8 913.0 1857.1 10577 5350.6 5109.2 2117.4 1186.0 2351.4 760.0 1121.2 3103.4 1597.7 1776.1 5455.9 1197.2 2490.5 5657.5 1855.7 2139.8 780.4 Overall Ratings Maximum Average Geomean Harmean Minimum 10576.9 2964.4 2308.1 1870.7 733.9 Numeric results were as expected via #define GCC12ARMPI

Fast Fourier Transforms Benchmarks below or Go To Start

Fast Fourier Transforms Benchmarks - fft1Pi64g, fft3cPi64g8 and g12

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes between caches, then to RAM.

Comparisons of averages of the three runs are provided. Those for FFT1 demonstrate the clear and different advantage of the Pi 5 over the Pi 4, depending on the source of the data, with that from L3 cache providing gains of up to 13.34 times and up to 4.71 times involving the larger L2 cache. Most other gains are in the two to four times range. With the faster CPU speed limited FFT3c, gains were mainly mbetween 2 and 3 times. GCC 12 over GCC 8 comparisons indicate a slight advantage of the former using data from caches, but the role reversed, dealing with RAM data transfers.

Pi 4 GCC 8 Pi 4 RPi FFT gcc 8 64 Bit Benchmark 1 Mon May 25 10:54:42 2020 Size milliseconds K Single Precision Double Precision 1 0.05 0.04 0.04 0.04 0.04 0.05 2 0.08 0.08 0.08 0.15 0.14 0.14 4 0.23 0.23 0.23 0.39 0.38 0.44 8 0.73 0.80 0.70 0.97 1.04 0.97 16 1.98 1.87 1.79 2.66 2.52 2.83 32 4.92 4.92 5.29 5.67 4.92 4.89 64 8.80 8.69 8.67 32.21 32.23 33.31 128 49.82 49.79 50.17 161.36 159.61 159.39 256 295.55 280.43 303.20 411.97 415.90 340.34 512 506.01 601.29 572.36 781.10 779.05 782.21 1024 1375.42 1377.64 1375.77 1898.28 1876.88 1896.22 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Mon May 25 10:55:00 2020 Pi 4 RPi FFT gcc 8 64 Bit Benchmark 3c.0 Mon May 25 10:56:49 2020 Size milliseconds K Single Precision Double Precision 1 0.06 0.04 0.04 0.04 0.04 0.03 2 0.09 0.07 0.07 0.10 0.10 0.10 4 0.23 0.20 0.20 0.23 0.26 0.23 8 0.50 0.44 0.46 0.52 0.50 0.50 16 1.21 1.19 1.05 1.23 1.17 1.19 32 2.36 2.23 2.18 3.33 3.32 3.29 64 6.16 5.70 5.31 10.20 10.20 10.18 128 16.39 15.69 15.69 24.35 24.45 24.48 256 38.70 37.46 37.40 54.57 54.65 54.59 512 83.83 80.96 81.40 119.71 118.70 119.27 1024 182.08 176.05 176.97 268.43 259.16 259.30 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Mon May 25 10:56:52 2020 Pi 5 GCC 8 Pi 5 RPi FFT gcc 8 64 Bit Benchmark 1 Fri Aug 11 16:47:11 2023 Size milliseconds Average Pi5/Pi4 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 2.20 2.51 2 0.04 0.04 0.04 0.04 0.04 0.04 1.98 3.81 4 0.09 0.09 0.09 0.09 0.09 0.09 2.64 4.71 8 0.19 0.20 0.19 0.29 0.29 0.29 3.88 3.48 16 0.56 0.56 0.56 0.65 0.67 0.78 3.35 3.82 32 1.30 1.27 1.29 1.55 1.50 1.80 3.92 3.18 64 3.18 3.00 2.99 4.16 3.90 3.91 2.85 8.17 128 7.76 7.30 7.28 14.27 14.44 13.71 6.70 11.33 256 23.23 21.27 21.40 99.92 94.38 94.97 13.34 4.04 512 157.82 152.33 173.93 329.15 321.16 323.41 3.47 2.41 1024 608.66 606.77 600.94 1069.84 1048.00 1049.41 2.27 1.79 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Fri Aug 11 16:47:19 2023 Continued below
Pi 5 RPi FFT gcc 8 64 Bit Benchmark 3c.0 Fri Aug 11 16:48:27 2023 Size milliseconds Average Pi5/Pi4 K Single Precision Double Precision SP DP 1 0.03 0.02 0.02 0.02 0.02 0.02 1.88 1.96 2 0.05 0.04 0.04 0.04 0.04 0.04 1.93 2.61 4 0.10 0.08 0.08 0.09 0.09 0.09 2.37 2.74 8 0.21 0.18 0.18 0.23 0.21 0.21 2.43 2.37 16 0.45 0.41 0.41 0.53 0.48 0.49 2.70 2.40 32 1.16 0.90 0.93 1.22 1.07 1.06 2.27 2.97 64 2.39 2.04 2.39 2.98 2.76 2.69 2.52 3.63 128 5.26 4.82 4.86 9.92 9.90 9.86 3.20 2.47 256 14.58 13.92 13.89 29.15 27.71 26.90 2.68 1.96 512 42.03 39.73 39.84 72.71 72.32 71.70 2.02 1.65 1024 101.56 99.35 98.31 176.62 171.45 175.48 1.79 1.50 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Fri Aug 11 16:48:29 2023 Pi 5 GCC 12 RPi FFT gcc 12 64 Bit Benchmark 1 Thu Sep 28 19:10:33 2023 Size milliseconds Average GCC 12/8 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 1.15 1.02 2 0.06 0.04 0.04 0.04 0.04 0.04 0.92 1.05 4 0.08 0.08 0.08 0.08 0.08 0.08 1.09 1.05 8 0.18 0.18 0.18 0.80 0.26 0.25 1.09 0.65 16 0.55 0.62 0.61 0.78 0.62 0.68 0.95 1.01 32 1.19 1.19 1.18 3.14 1.66 2.23 1.08 0.69 64 2.90 2.87 3.12 4.14 3.83 4.62 1.03 0.95 128 8.01 7.72 8.41 19.04 16.31 19.17 0.93 0.78 256 28.65 29.22 30.38 142.81 143.44 144.91 0.75 0.67 512 256.41 209.11 215.07 400.84 410.99 448.06 0.71 0.77 1024 798.30 749.85 753.61 1073.95 1075.09 1051.38 0.79 0.99 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Sep 28 19:10:41 2023 RPi FFT gcc 12 64 Bit Benchmark 3c.0 Thu Sep 28 19:13:51 2023 Size milliseconds Average GCC 12/8 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 1.20 1.06 2 0.04 0.04 0.04 0.04 0.04 0.04 1.04 1.06 4 0.09 0.08 0.08 0.08 0.08 0.08 1.06 1.06 8 0.19 0.18 0.18 0.20 0.19 0.19 1.06 1.10 16 0.41 0.39 0.39 0.46 0.43 0.43 1.07 1.12 32 0.88 0.85 0.86 1.01 0.96 0.96 1.15 1.14 64 1.98 1.91 1.91 2.57 2.48 2.47 1.17 1.12 128 5.65 4.68 4.63 10.10 10.04 10.06 1.00 0.98 256 14.59 14.50 14.59 36.02 35.29 34.84 0.97 0.79 512 55.50 54.91 55.79 100.99 102.62 99.96 0.73 0.71 1024 143.39 142.49 143.22 231.27 228.44 229.17 0.70 0.76 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Sep 28 19:13:53 2023

BusSpeed Benchmark below or Go To Start

BusSpeed Benchmark - busspeedPi64g8 and g12

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.

The most important ratios are from Read All, others demonstrating when all data is not being read sequentially and the Pi 5 appears to be significantly faster than the Pi 4. The main results indicate Pi 5 gains of just over twice reading data from L1 and L2 caches, but can be more than four times from L3 and more than three times from RAM. Maximum bus speed, using one CPU core, is estimated as around 14 GB/second from Inc16 also shown under Read All. See MP results for higher estimates.

Pi 5 performance produced from GCC 8 and GCC 12 compilations was essentially the same.

Pi 4 GCC 8 BusSpeed 64 Bit gcc 8 Mon May 25 22:13:11 2020 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All Cache Pi 5 16 4898 5109 5626 5860 5879 9238 L1 L1 32 1109 1389 2485 3804 5026 8435 64 804 1030 2025 3285 4871 8312 L2 Shared 128 737 951 1877 3130 4908 8556 L2 256 732 953 1897 3147 4941 8617 512 701 939 1766 2902 4601 8150 1024 323 494 986 1807 3060 5553 RAM L3 Shared 4096 242 259 486 964 1932 3856 RAM 16384 236 268 493 971 1939 3878 65536 242 271 494 973 1942 3884 End of test Mon May 25 22:13:21 2020 Pi 5 GCC 8 P5/P4 Comparison BusSpeed 64 Bit gcc 8 Fri Aug 11 16:46:13 2023 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All Words Words Words Words Words All MP-bus 16 8300 8413 15451 17849 18151 18721 1.69 1.65 2.75 3.05 3.09 2.03 32 9159 9235 15509 17911 18132 18721 8.26 6.65 6.24 4.71 3.61 2.22 64 7460 7644 13739 17008 17665 18593 9.28 7.42 6.78 5.18 3.63 2.24 128 2375 4452 7168 11555 13968 18203 3.22 4.68 3.82 3.69 2.85 2.13 256 2375 4425 7225 11540 13964 18243 3.24 4.64 3.81 3.67 2.83 2.12 512 1784 2980 5758 10362 13685 18203 2.54 3.17 3.26 3.57 2.97 2.23 1024 1225 2325 4639 9336 13467 18281 3.79 4.71 4.70 5.17 4.40 3.29 4096 656 1375 2700 5120 9599 15984 2.71 5.31 5.56 5.31 4.97 4.15 16384 579 864 1741 3502 7020 14015 2.45 3.22 3.53 3.61 3.62 3.61 65536 604 796 1595 3195 6351 12699 2.50 2.94 3.23 3.28 3.27 3.27 End of test Fri Aug 11 16:46:22 2023 Pi 5 GCC 12 Pi 5 GCC 12/8 Comparison BusSpeed 64 Bit gcc 12 Thu Sep 28 19:02:33 2023 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All Words Words Words Words Words All 16 8493 8509 16377 17918 18170 18733 1.02 1.01 1.06 1.00 1.00 1.00 32 9127 9295 16478 18023 18212 18740 1.00 1.01 1.06 1.01 1.00 1.00 64 7530 7604 14030 17241 17877 18603 1.01 0.99 1.02 1.01 1.01 1.00 128 2375 4189 7212 11566 13961 18230 1.00 0.94 1.01 1.00 1.00 1.00 256 2358 4275 7265 11595 13985 18274 0.99 0.97 1.01 1.00 1.00 1.00 512 1557 2879 5524 10229 13877 18231 0.87 0.97 0.96 0.99 1.01 1.00 1024 1225 2339 4606 9318 13902 18271 1.00 1.01 0.99 1.00 1.03 1.00 4096 780 1387 2672 5115 9407 16053 1.19 1.01 0.99 1.00 0.98 1.00 16384 652 880 1763 3479 7034 13979 1.13 1.02 1.01 0.99 1.00 1.00 65536 624 801 1605 3178 6416 12800 1.03 1.01 1.01 0.99 1.01 1.01

MemSpeed Benchmark below or Go To Start

MemSpeed Benchmark MB/Second - memspeedPi64g8 and g12

The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision floating point and integer functions. The instruction sequences used are shown in the results column titles.

When compiled with GCC 6, earlier results identified unusual slow operation dealing with 32 bit floating point and integer calculations. This looks as though the effect is to read data from RAM instead of caches, and why Pi 5 performance gains were mainly less than two times. With double precision floating point, average Pi 5 gains were around four times for the first two sets of calculations, including more that 10 times with L3 cache involvement.

The GCC 12 compilation appears to have corrected the above misoperations, providing gains of more than eight times over GCC 8. These calculations also show slight improvements in double precision calculations. Maximum calculated speeds are provided, indicating 15.3 single core GFLOPS SP and 6.86 DP, the relationship expected using SIMD calculations. The tests also confirmed this with the near 6.4 GFLOPS/GHz SP and near half that DP. This performance was obtained using data from L1 and L2 caches with almost that from L3 cache.

Pi 4 GCC 8 Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom Start of test Mon May 25 22:23:53 2020 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 15531 3999 3957 15576 4387 4358 11629 9313 9314 16 15717 3992 3922 15770 4355 4377 11799 9444 9446 32 12020 3818 3814 12043 4179 4198 11549 9496 9497 64 12228 3816 3887 12220 4166 4195 8935 8506 8506 128 12265 3869 3941 12157 4182 4206 8080 8193 8196 256 12230 3873 3932 12073 4199 4216 8129 8224 8223 512 9731 3832 3902 9709 4150 4171 8029 7845 7865 1024 3772 3682 3769 3467 3887 3920 5478 5543 5378 2048 1896 3463 3496 1886 3616 3612 2937 2945 2923 4096 1924 3520 3528 1933 3651 3394 2752 2796 2785 8192 1996 3523 3555 1988 3643 3630 2668 2661 2663 End of test Mon May 25 22:24:10 2020 Pi 5 GCC 8 Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom Start of test Fri Aug 11 16:34:06 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 50862 6851 6746 50686 7193 7490 37629 18595 25168 16 51032 6820 6717 51024 7164 7468 38002 18888 24946 32 49985 6814 6676 50568 7150 7446 37609 18972 25259 64 50868 6857 6656 50864 7168 7411 37799 19114 25426 128 32618 6797 6670 32666 7142 7278 35466 19143 25439 256 32540 6788 6640 32744 7183 7278 34821 19144 25360 512 26949 6786 6668 30112 7155 7246 33493 14598 16816 1024 25094 6719 6645 19272 6821 7206 21805 17292 22671 2048 20586 6365 6586 19261 6887 7172 4740 4662 13673 4096 5004 6680 6710 4963 6776 6249 7938 8990 8797 8192 3229 5589 4662 3205 6496 6573 6654 6719 4613 End of test Fri Aug 11 16:34:22 2023 P5/P4 Comparison Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 3.27 1.71 1.70 3.25 1.64 1.72 3.24 2.00 2.70 16 3.25 1.71 1.71 3.24 1.65 1.71 3.22 2.00 2.64 32 4.16 1.78 1.75 4.20 1.71 1.77 3.26 2.00 2.66 64 4.16 1.80 1.71 4.16 1.72 1.77 4.23 2.25 2.99 128 2.66 1.76 1.69 2.69 1.71 1.73 4.39 2.34 3.10 256 2.66 1.75 1.69 2.71 1.71 1.73 4.28 2.33 3.08 512 2.77 1.77 1.71 3.10 1.72 1.74 4.17 1.86 2.14 1024 6.65 1.82 1.76 5.56 1.75 1.84 3.98 3.12 4.22 2048 10.86 1.84 1.88 10.21 1.90 1.99 1.61 1.58 4.68 4096 2.60 1.90 1.90 2.57 1.86 1.84 2.88 3.22 3.16 8192 1.62 1.59 1.31 1.61 1.78 1.81 2.49 2.52 1.73 Continued below
Pi 5 GCC 12 Memory Reading Speed Test 64 Bit gcc 12 by Roy Longbottom Start of test Thu Sep 28 18:54:28 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 54902 61264 65610 55241 65554 63848 37768 25475 25486 16 54803 60539 64671 55169 64700 64750 38078 24891 24891 32 51859 60967 64278 52558 65247 65275 37520 25234 25234 64 52597 61169 65523 52485 65514 65523 37945 25408 25402 128 33580 60278 63742 33647 63692 62897 37218 25370 25457 256 33724 60317 63873 33711 63840 63865 35555 25371 25375 512 33522 59194 63298 33502 63259 63175 35909 25459 25451 1024 32078 57946 60718 31576 60680 59199 26110 22319 23059 2048 29249 55376 57648 29028 57558 57290 16245 18242 19514 4096 4508 11981 11906 4864 11894 9313 10254 10529 10668 8192 3175 6507 6150 3178 6441 6499 6678 6904 6364 Max MFLOPS 6862 15316 End of test Thu Sep 28 18:54:43 2023 Pi 5 GCC 12/8 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 1.08 8.94 9.73 1.09 9.11 8.52 1.00 1.37 1.01 16 1.07 8.88 9.63 1.08 9.03 8.67 1.00 1.32 1.00 32 1.04 8.95 9.63 1.04 9.13 8.77 1.00 1.33 1.00 64 1.03 8.92 9.84 1.03 9.14 8.84 1.00 1.33 1.00 128 1.03 8.87 9.56 1.03 8.92 8.64 1.05 1.33 1.00 256 1.04 8.89 9.62 1.03 8.89 8.78 1.02 1.33 1.00 512 1.24 8.72 9.49 1.11 8.84 8.72 1.07 1.74 1.51 1024 1.28 8.62 9.14 1.64 8.90 8.22 1.20 1.29 1.02 2048 1.42 8.70 8.75 1.51 8.36 7.99 3.43 3.91 1.43 4096 0.90 1.79 1.77 0.98 1.76 1.49 1.29 1.17 1.21 8192 0.98 1.16 1.32 0.99 0.99 0.99 1.00 1.03 1.38

NeonSpeed Benchmark below or Go To Start

NeonSpeed Benchmark MB/Second - NeonSpeedPi64g8 and g12

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.

The initial GCC 8 test functions produced the same irregular results as MemSpeed first “Normal Float and Int” calculations that appear to only read RAM based data. Performance from NEON code indicated that the Pi 5 was typically 2.5 times faster than the Pi 4, using cache based data, and 1.5 times from RAM. Exceptions were gains of up to 7.9 times using L3 cache and nearly 4.8 from lower level caches.

The GCC 12 compiler produced acceptable “Normal” performance on the Pi 5, reflected by gains of up to more than ten times over GCC 8 results. This compiler is also shown to provide faster operation than that from NEON functions. Many of the latter show 20% improvements but some were slower. Maximum floating point speed demonstrated was nearly 17 GFLOPS.

Pi 4 GCC 8 NEON Speed 64 Bit gcc 8 Mon May 25 22:21:51 2020 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 3629 14987 3925 13643 14457 16642 32 3475 10933 3821 9970 11029 11055 64 3447 11749 3845 11098 11802 12079 128 3332 11392 3912 10813 11430 11513 256 3325 11565 3926 10981 11598 11699 512 3313 10553 3917 10269 10755 10740 1024 3239 3331 3737 3291 3302 3321 4096 2987 1888 3331 1777 1881 1878 16384 3150 1821 3347 1814 1812 1834 65536 2747 1954 3132 2017 1904 2021 Max MFLOPS 3747 End of test Mon May 25 22:22:11 2020 Pi 5 GCC 8 P5/P4 Comparison NEON Speed 64 Bit gcc 8 Fri Aug 11 16:44:52 2023 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int Norm Neon Norm Neon Float Int 16 6745 46851 6968 44490 46849 46847 1.86 3.13 1.78 3.26 3.24 2.81 32 6727 47104 6947 44618 47061 47056 1.94 4.31 1.82 4.48 4.27 4.26 64 6703 46642 6962 44166 47040 46955 1.94 3.97 1.81 3.98 3.99 3.89 128 6587 27383 6840 27199 27404 27398 1.98 2.40 1.75 2.52 2.40 2.38 256 6579 27491 6857 27299 27509 27509 1.98 2.38 1.75 2.49 2.37 2.35 512 6571 27433 6862 26599 24237 26163 1.98 2.60 1.75 2.59 2.25 2.44 1024 6531 26340 6756 25226 24597 24527 2.02 7.91 1.81 7.67 7.45 7.39 4096 6414 9410 6505 9986 9474 8835 2.15 4.98 1.95 5.62 5.04 4.70 16384 5690 2850 5501 2830 2865 2488 1.81 1.57 1.64 1.56 1.58 1.36 65536 4837 2534 4736 2458 2401 2450 1.76 1.30 1.51 1.22 1.26 1.21 Max MFLOPS 11776 End of test Fri Aug 11 16:45:12 2023 Pi 5 GCC 12 Pi 5 GCC 12/8 NEON Speed 64 Bit gcc 12 Thu Sep 28 18:57:35 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int Norm Neon Norm Neon Float Int 16 67042 45164 67037 45358 54228 54166 9.94 0.96 9.62 1.02 1.16 1.16 32 67631 45190 67621 45415 53833 53675 10.05 0.96 9.73 1.02 1.14 1.14 64 67812 44856 67491 45171 52338 51321 10.12 0.96 9.69 1.02 1.11 1.09 128 62779 33147 64360 33074 33619 33458 9.53 1.21 9.41 1.22 1.23 1.22 256 64352 33405 64803 33187 33699 33719 9.78 1.22 9.45 1.22 1.23 1.23 512 61159 33171 61798 32263 33178 28319 9.31 1.21 9.01 1.21 1.37 1.08 1024 58937 32149 57732 31639 32219 32108 9.02 1.22 8.55 1.25 1.31 1.31 4096 9215 2639 7168 3800 3823 3776 1.44 0.28 1.10 0.38 0.40 0.43 16384 5546 2830 5592 2772 2753 2503 0.97 0.99 1.02 0.98 0.96 1.01 65536 4633 2445 4196 1922 2196 2294 0.96 0.96 0.89 0.78 0.91 0.94 Max MFLOPS 16953

MultiThreading Benchmark next or Go To Start

MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism.

MP-Whetstone Benchmark - MP-WHETSPi64g8 and g12

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish. Performance was generally proportional to the number of cores used. Overall seconds indicates MP efficiency, with around 5 seconds for 1, 2 and 4 threads, doubling with 8.

The Pi 5 CPU temperature reached 80.7°C within the 26 second testing time. Pi5/Pi4 4 thread performance ratios were between 2.22 and 3.43.

Performance of all GCC 8 compilations were essentially the same as those from GCC 12.

Pi 4 GCC 8 MP-Whetstone Benchmark 64 Bit gcc 8 Mon May 25 10:18:21 2020 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 2146.7 530.1 530.1 397.2 60.5 27.3 7451.7 2240.2 998.1 2T 4290.4 1056.0 1055.3 794.0 120.9 54.7 14859.4 4488.5 1995.2 4T 8583.9 2115.8 2113.4 1590.5 241.8 109.3 29265.9 8940.7 3984.5 8T 8806.6 2676.0 2140.1 1627.3 244.8 113.0 37995.0 11565.4 4097.5 Overall Seconds 5.00 1T, 5.01 2T, 5.02 4T, 10.10 8T All calculations produced consistent numeric results Pi 5 GCC 8 MP-Whetstone Benchmark 64 Bit gcc 8 Mon Aug 14 10:09:58 2023 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 6138.4 1278.2 1278.2 1020.4 174.1 94.8 17273.2 7033.6 2394.9 2T 12198.6 2542.8 2549.5 2029.7 344.4 188.4 35246.9 14307.3 4794.1 4T 24008.3 5013.1 4683.8 4045.3 674.5 374.4 69938.6 28558.3 9381.9 8T 24768.0 5170.6 5867.3 4080.9 693.9 385.9 74272.7 30002.8 9478.1 Overall Seconds 5.00 1T, 5.04 2T, 5.22 4T, 10.37 8T All calculations produced consistent numeric results P5/P4 Comparison 1T 2.86 2.41 2.41 2.57 2.88 3.47 2.32 3.14 2.40 2T 2.84 2.41 2.42 2.56 2.85 3.44 2.37 3.19 2.40 4T 2.80 2.37 2.22 2.54 2.79 3.43 2.39 3.19 2.35 8T 2.81 1.93 2.74 2.51 2.83 3.42 1.95 2.59 2.31 Pi 5 GCC 12 MP-Whetstone Benchmark 64 Bit gcc 12 Thu Sep 28 21:58:24 2023 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 6180.4 1279.0 1273.5 1028.0 173.8 96.7 17586.5 7187.4 2396.5 2T 12353.4 2550.4 2556.9 2049.9 347.7 193.3 35875.6 14220.6 4796.8 4T 24647.0 5100.9 5078.2 4106.7 695.5 385.9 63256.4 28609.7 9549.0 8T 25053.6 5121.0 5293.6 4174.6 706.8 386.4 78259.8 31001.5 9658.4 Overall Seconds 5.00 1T, 5.01 2T, 5.06 4T, 10.10 8T Pi 5 GCC 12/8 1T 1.01 1.00 1.00 1.01 1.00 1.02 1.02 1.02 1.00 2T 1.01 1.00 1.00 1.01 1.01 1.03 1.02 0.99 1.00 4T 1.03 1.02 1.08 1.02 1.03 1.03 0.90 1.00 1.02 8T 1.01 0.99 0.90 1.02 1.02 1.00 1.05 1.03 1.02

MP-Dhrystone Benchmark next or Go To Start

MP-Dhrystone Benchmark - MP-DHRYPi64g8 and g12

This executes multiple copies of the same program, but with some shared data, leading to unacceptable multithreaded performance. Results are in VAX MIPS aka DMIPS.

Using the GCC 8 version, the Pi 5 performance was 2.27 times faster than the Pi 4, achieving 7.67 DMIPS/MHz. The GCC 12 compilation was slightly faster than the former, running on the Pi 5.

Pi 4 GCC 8 MP-Dhrystone Benchmark 64 Bit gcc 8 Tue May 26 11:41:49 2020 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.55 1.08 2.15 4.3 Dhrystones per Second 1.5E+07 1.5E+07 1.5E+07 1.5E+07 VAX MIPS rating 8271 8419 8478 8465 Internal pass count correct all threads End of test Tue May 26 11:41:57 2020 Pi 5 GCC 8 MP-Dhrystone Benchmark 64 Bit gcc 8 Mon Aug 14 10:16:15 2023 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.62 1.88 4.18 8.45 Pi5/Pi4 Dhrystones per Second 3.2E+07 2.1E+07 1.9E+07 1.9E+07 VAX MIPS rating 18415 12137 10899 10771 2.27 Internal pass count correct all threads End of test Mon Aug 14 10:16:31 2023 Pi 5 GCC 12 MP-Dhrystone Benchmark 64 Bit gcc 12 Thu Sep 28 22:03:10 2023 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.57 1.95 4.31 8.70 Pi 5 GCC 12/8 Dhrystones per Second 35046385 20477300 18570390 18398880 VAX MIPS rating 19947 11655 10569 10472 1.08 Internal pass count correct all threads End of test Thu Sep 28 22:03:26 2023

MP SP NEON Linpack Benchmark next or Go To Start

MP SP NEON Linpack Benchmark - linpackMPNeonPi64g8 and g12

This was produced to show that the original Linpack benchmark was completely unsuitable for benchmarking multiple CPUs or cores, and this is reflected in the results. The program uses NEON intrinsic functions, with increasing data sizes. Single core performance ratios are provided below for the three different memory array sizes that use N x N x 4 bytes or 40 KB, 1 MB and 4 MB. The three Pi 5/Pi 4 performance ratios were 2.94, 5.24, and 4.13 times. Maximum single core speed was 6.85 GFLOPS.

Two out of three of the new GCC 12 compilations produced slower performance on the Pi 5 and completely different numeric sumchecks.

Pi 4 GCC 8 Linpack Single Precision MultiThreaded Benchmark NEON Intrinsics 64 Bit gcc 8, Tue May 26 11:43:46 2020 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 2167.70 91.82 89.65 89.96 N 500 1438.27 644.85 635.89 635.33 N 1000 394.99 376.97 383.92 384.19 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.97 5.40 13.51 RE 4.69621336e-05 6.44138840e-04 3.22485110e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04 XN -1.30534172e-05 3.51667404e-05 1.90019608e-04 Thread 0 - 4 Same Results Same Results Same Results Pi 5 GCC 8 Linpack Single Precision MultiThreaded Benchmark NEON Intrinsics 64 Bit gcc 8, Mon Aug 14 10:22:53 2023 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 Pi5/Pi4 N 100 6375.62 154.59 151.48 150.82 2.94 N 500 7536.07 2250.75 2263.15 2222.61 5.24 N 1000 1631.94 1452.80 1401.29 1298.10 4.13 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.97 5.40 13.51 RE 4.69621336e-05 6.44138840e-04 3.22485110e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04 XN -1.30534172e-05 3.51667404e-05 1.90019608e-04 Thread 0 - 4 Same Results Same Results Same Results Pi 5 GCC 12 Linpack Single Precision MultiThreaded Benchmark NEON Intrinsics 64 Bit gcc 12, Thu Sep 28 22:05:37 2023 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 Pi 5 GCC 12/8 N 100 5461.61 169.27 176.25 174.14 0.86 N 500 6853.70 2538.16 2554.26 2562.31 0.91 N 1000 1741.83 1486.68 1493.84 1501.34 1.07 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04 Thread 0 - 4 Same Results Same Results Same Results

MP BusSpeed Benchmark below or Go To Start

MP BusSpeed (read only) Benchmark - MP-BusSpd2Pi64g8 and g12

For further details see the single core BusSpeed Benchmark that obtains the same order of GCC 8 results as the single thread of this MP version. For the latter, each thread exercises a dedicated segment of the data, circulated on a round robin basis, reading all data every pass.

Considering the most important GCC 8 Rdall tests, Pi5/Pi4 performance gains mainly approached three times for cache based data but multithreaded application showed gains up to 9.47 times. Highest gains of up to 18.17 times were in other areas. The high gains are due to improved caching on a read only basis.

The early Pi 4 GCC 12/8 comparisons indicated similar performance but increased progressively as more data was being read, reaching up to more than five times on RdAll. Here, single thread data transfer speeds reached nearly 68 GB/second and 4 thread up to 150 GB/second. This lead to me writing a new program New INTitHOT Integer Stress Test, where it is shown that GCC 12 produced highly efficient SIMD vector instructions.

Pi 4 GCC 8 MP-BusSpd 64 Bit gcc 8 Tue May 26 11:51:30 2020 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 5168 5542 5641 4205 4095 4230 2T 8968 10728 10161 8110 8058 8368 4T 7874 13255 15586 13641 15485 16533 8T 8186 13386 15239 13469 14431 16372 122.9 598 927 1876 2792 3746 4059 2T 514 719 1538 4846 7596 8083 4T 486 933 2060 4126 8175 13690 8T 483 937 2059 4160 8166 13817 12288 224 257 488 964 1933 3579 2T 219 427 889 1832 3493 5371 4T 280 353 562 859 2168 3286 8T 229 230 527 1075 1880 4480 No Errors Found End of test Tue May 26 11:51:43 2020 Pi 5 GCC 8 Pi 5/4 GCC 8 MP-BusSpd 64 Bit gcc 8 Mon Aug 14 10:37:37 2023 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 9289 9450 15464 12578 12443 12073 1.80 1.71 2.74 2.99 3.04 2.85 2T 11465 15018 23403 20058 22357 22997 1.28 1.40 2.30 2.47 2.77 2.75 4T 8757 11343 21200 26582 32854 42575 1.11 0.86 1.36 1.95 2.12 2.58 8T 9036 8602 11448 17821 26795 30949 1.10 0.64 0.75 1.32 1.86 1.89 122.9 2358 4293 7257 11306 11657 11609 3.94 4.63 3.87 4.05 3.11 2.86 2T 4466 7819 13844 21220 23109 23119 8.69 10.87 9.00 4.38 3.04 2.86 4T 8831 14835 20781 42375 45809 44669 18.17 15.90 10.09 10.27 5.60 3.26 8T 7011 11818 19792 34990 39720 43742 14.52 12.61 9.61 8.41 4.86 3.17 12288 654 884 1585 3502 7243 10088 2.92 3.44 3.25 3.63 3.75 2.82 2T 726 743 1303 3454 7723 18286 3.32 1.74 1.47 1.89 2.21 3.40 4T 735 1551 1405 5166 10906 31106 2.63 4.39 2.50 6.01 5.03 9.47 8T 771 933 1486 3197 9182 18377 3.37 4.06 2.82 2.97 4.88 4.10 No Errors Found End of test Mon Aug 14 10:37:49 2023 Pi 5 GCC 12 Pi 5 GCC 12/8 MP-BusSpd 64 Bit gcc 12 Thu Sep 28 22:11:28 2023 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 9444 9504 16195 17543 27434 67773 1.02 1.01 1.05 1.39 2.20 5.61 2T 10884 14542 23738 28964 38304 92983 0.95 0.97 1.01 1.44 1.71 4.04 4T 10566 11790 21233 28439 44074 91129 1.21 1.04 1.00 1.07 1.34 2.14 8T 8657 10289 12122 19920 30038 45788 0.96 1.20 1.06 1.12 1.12 1.48 122.9 2380 4359 7261 11627 20970 44300 1.01 1.02 1.00 1.03 1.80 3.82 2T 4586 7699 13845 22597 40901 73723 1.03 0.98 1.00 1.06 1.77 3.19 4T 5469 10629 24698 38945 69318 150304 0.62 0.72 1.19 0.92 1.51 3.36 8T 6902 11176 19387 36720 64760 144651 0.98 0.95 0.98 1.05 1.63 3.31 12288 632 806 1838 3628 7366 13161 0.97 0.91 1.16 1.04 1.02 1.30 2T 961 711 1520 3527 5546 13012 1.32 0.96 1.17 1.02 0.72 0.71 4T 670 1566 3062 5403 13675 19563 0.91 1.01 2.18 1.05 1.25 0.63 8T 726 1117 2322 4747 9371 17111 0.94 1.20 1.56 1.48 1.02 0.93

MP RandMem Benchmark below or Go To Start

MP RandMem Benchmark - MP-RandMemPi64g8 and g12

The benchmark uses the same complex indexing for serial and random access, with separate read only and read/write tests. The performance patterns were as expected. Random access is dependent on the impact of burst reading and writing, producing those slow speeds. Read only performance increased, as expected, relative to the thread count, with that for read/write remaining constant at particular data size, probably due to write back to shared data space.

Again the new PI 5 caching arrangement produced high performance gains over the Pi 4, via GCC 8 compilations. In this case they were between 4 and 18 times. Others were between 2 and 3 times for cached based data and half that from RAM.

Performance from the GCC 12 version was little different to that from GCC 8.

Pi 4 GCC 8 MP-RandMem 64 Bit gcc 8 Tue May 26 11:53:37 2020 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRW RndRD RndRW 12.3 1T 5945 7898 5948 7895 2T 11913 7937 11905 7929 4T 23601 7875 23385 7867 8T 23139 7777 23016 7770 122.9 1T 5785 7090 2026 1977 2T 10941 7074 1654 1968 4T 10364 7052 1854 1970 8T 10256 7031 1844 1973 12288 1T 3861 1244 180 169 2T 3793 1242 220 171 4T 3941 1100 343 170 8T 4065 1247 351 171 No Errors Found End of test Tue May 26 11:54:20 2020 Pi 4 GCC 8 Pi 5/4 GCC 8 MP-RandMem 64 Bit gcc 8 Mon Aug 14 10:45:21 2023 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRW RndRD RndRW SerRD SerRW RndRD RndRW 12.3 1T 18593 18938 17858 17066 3.13 2.40 3.00 2.16 2T 32655 18759 32998 16990 2.74 2.36 2.77 2.14 4T 47087 18905 45181 17027 2.00 2.40 1.93 2.16 8T 34725 18602 33955 17087 1.50 2.39 1.48 2.20 122.9 1T 15501 16259 10950 9853 2.68 2.29 5.40 4.98 2T 29970 16392 21177 9921 2.74 2.32 12.80 5.04 4T 51762 16408 33068 9781 4.99 2.33 17.84 4.96 8T 46575 15741 27979 9235 4.54 2.24 15.17 4.68 12288 1T 12227 1729 538 328 3.17 1.39 2.99 1.94 2T 16713 1724 617 311 4.41 1.39 2.80 1.82 4T 16771 1825 722 312 4.26 1.66 2.10 1.84 8T 13124 1739 622 319 3.23 1.39 1.77 1.87 No Errors Found End of test Mon Aug 14 10:46:01 2023 Pi 5 gcc 12 Pi 5 GCC 12/8 MP-RandMem 64 Bit gcc 12 Thu Sep 28 22:15:02 2023 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRW RndRD RndRW SerRD SerRW RndRD RndRW 12.31T 18667 19102 18108 17246 1.0 1.0 1.0 1.0 2T 34841 19037 33292 16912 1.1 1.0 1.0 1.0 4T 47204 18694 46771 17137 1.0 1.0 1.0 1.0 8T 35115 18676 34015 17230 1.0 1.0 1.0 1.0 122.91T 15826 16395 10993 9928 1.0 1.0 1.0 1.0 2T 30566 16400 21397 9940 1.0 1.0 1.0 1.0 4T 56413 16361 38355 9921 1.1 1.0 1.2 1.0 8T 54596 16372 37617 9889 1.2 1.0 1.3 1.1 122881T 13622 1902 539 343 1.1 1.1 1.0 1.0 2T 20937 1830 603 345 1.3 1.1 1.0 1.1 4T 26993 1892 682 343 1.6 1.0 0.9 1.1 8T 18621 1797 650 347 1.4 1.0 1.0 1.1 No Errors Found End of test Thu Sep 28 22:15:42 2023

MP-MFLOPS Benchmarks below or Go To Start

MP-MFLOPSPi64g8 and g12, MP-MFLOPSPi64DPg8 and g12

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Here are two varieties, single precision and double precision, both attempting to show near maximum MP floating point processing speeds.

At a given precision, result sumchecks should be identical when using the same run time parameters. Here, gcc 12 compiled programs were run using parameters that produce longer running times, with different sumchecks to those from earlier versions.

These are all short tests running at full MHz with low increases in temperatures. All at 12.8 and 128 KB demonstrate some near doubling performance with twice as many threads. Maximum GCC 12 Pi 5 SP 4 thread performance was 84.9 GFLOPS with DP at 42.5 GFLOPS and slightly less via GCC 8. See next page for comments on comparisons.

Pi 4 GCC 8 MP-MFLOPS 64 Bit gcc 8 Tue May 26 12:01:44 2020 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 Maximum MFLOPS MFLOPS GFLOPS per MHz 1T 3212 3162 416 6741 6720 6393 6.7 4.5 2T 6343 5109 565 13381 13376 9914 13.4 8.9 4T 11644 5077 584 25506 26028 9883 26.0 17.4 8T 7804 7953 579 20537 24446 8651 Results x 100000, 0 indicates ERRORS 1T 76406 97075 99969 66015 95363 99951 2T 76406 97075 99969 66015 95363 99951 4T 76406 97075 99969 66015 95363 99951 8T 76406 97075 99969 66015 95363 99951 End of test Tue May 26 12:01:46 2020 Pi 5 GCC 8 MP-MFLOPS 64 Bit gcc 8 Mon Aug 14 11:16:36 2023 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 Maximum MFLOPS MFLOPS GFLOPS per MHz 1T 9309 8856 540 20396 19543 11710 19.5 8.1 2T 17114 18565 683 35842 40506 11937 40.5 16.9 4T 29453 34610 826 75120 77896 12646 77.9 32.5 8T 28688 31506 959 59804 57700 15374 Results x 100000, 0 indicates ERRORS 1T 76406 97075 99969 66015 95363 99951 2T 76406 97075 99969 66015 95363 99951 4T 76406 97075 99969 66015 95363 99951 8T 76406 97075 99969 66015 95363 99951 End of test Mon Aug 14 11:16:37 2023 Pi 5/4 GCC8 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 1T 2.90 2.80 1.30 3.03 2.91 1.83 2T 2.70 3.63 1.21 2.68 3.03 1.20 4T 2.53 6.82 1.41 2.95 2.99 1.28 8T 3.68 3.96 1.66 2.91 2.36 1.78 Pi 5 GCC 12 MP-MFLOPS2 64 Bit gcc 12 Tue Oct 3 09:52:45 2023 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 Maximum MFLOPS MFLOPS GFLOPS per MHz 1T 10549 10320 1116 21519 21452 16879 21.5 9.0 2T 19881 20929 982 42488 43002 14280 43.0 17.9 4T 33400 40206 929 80947 84933 14772 84.9 35.4 8T 33448 37854 1093 77117 85086 17371 Results x 100000, 0 indicates ERRORS 1T 40015 44934 98519 35186 36769 97639 2T 40015 44934 98519 35186 36769 97639 4T 40015 44934 98519 35186 36769 97639 8T 40015 44934 98519 35186 36769 97639 End of test Tue Oct 3 09:53:21 2023 Pi 5 GCC 12/8 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 1T 1.09 1.05 1.11 1.03 1.09 1.00 2T 1.12 0.98 0.98 1.15 0.94 0.89 4T 1.09 1.13 0.99 0.88 0.89 1.01 8T 0.85 0.85 1.02 0.97 1.07 0.98

Double Precision Results and More Comments below

With the running times being relatively short, individual comparison ratios might not be accurate so averages have been calculated. Pi5/Pi4 GCC 8 ratios were between 2.36 and 6.82 times, average 3.18 with cached data then 1.10 to 1.83, 1.42 from RAM. The Pi 5 improved cache sizes lead to the higher ratios. Longer running stress tests provide more reliable performance indications

GCC 8/12 averages indicated similar single precision performance, with a slight gain for the newer compiler with double precision calculations.

Pi 4 GCC 8 MP-MFLOPS 64 Bit gcc 8 Double Precision Tue May 26 12:11:50 2020 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 Maximum MFLOPS MFLOPS GFLOPS per MHz 1T 1591 1587 269 3386 3379 3240 3.4 2.3 2T 3228 2803 267 6728 6711 4556 6.7 4.5 4T 5870 3284 283 12812 12866 4940 12.9 8.6 8T 5506 4063 277 12077 11538 4695 Results x 100000, 0 indicates ERRORS 1T 76384 97072 99969 66065 95370 99951 2T 76384 97072 99969 66065 95370 99951 4T 76384 97072 99969 66065 95370 99951 8T 76384 97072 99969 66065 95370 99951 End of test Tue May 26 12:11:52 2020 Pi 5 GCC 8 MP-MFLOPS 64 Bit gcc 8 Double Precision Mon Aug 14 11:18:26 2023 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 Maximum MFLOPS MFLOPS GFLOPS per MHz 1T 4661 4127 296 10498 10217 4938 10.2 4.3 2T 8408 9292 333 20699 19275 5579 19.3 8.0 4T 14723 17372 399 39480 42352 6572 42.4 17.6 8T 14387 15799 461 38706 28821 7667 Results x 100000, 0 indicates ERRORS 1T 76384 97072 99969 66065 95370 99951 2T 76384 97072 99969 66065 95370 99951 4T 76384 97072 99969 66065 95370 99951 8T 76384 97072 99969 66065 95370 99951 End of test Mon Aug 14 11:18:27 2023 Pi 5/4 GCC8 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 1T 2.93 2.60 1.10 3.10 3.02 1.52 2T 2.60 3.32 1.25 3.08 2.87 1.22 4T 2.51 5.29 1.41 3.08 3.29 1.33 8T 2.61 3.89 1.66 3.20 2.50 1.63 Pi 5 GCC 12 DP MP-MFLOPS2 64 Bit gcc 12 Double Precision Tue Oct 3 10:00:48 2023 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 Maximum MFLOPS MFLOPS GFLOPS per MHz 1T 4713 4740 562 10748 10727 8440 10.7 4.5 2T 9355 9554 491 21389 21515 7875 21.5 9.0 4T 17485 18403 468 41704 42464 7499 42.5 17.7 8T 16645 18592 543 41049 41910 8596 Results x 100000, 0 indicates ERRORS 1T 39991 44914 98518 35119 36721 97642 2T 39991 44914 98518 35119 36721 97642 4T 39991 44914 98518 35119 36721 97642 8T 39991 44914 98518 35119 36721 97642 End of test Tue Oct 3 10:01:24 2023 Pi 5 GCC 12/8 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 1T 1.01 1.15 1.90 1.02 1.05 1.71 2T 1.11 1.03 1.47 1.03 1.12 1.41 4T 1.19 1.06 1.17 1.06 1.00 1.14 8T 1.16 1.18 1.18 1.06 1.45 1.12

OpenMP-MFLOPS Benchmarks below or Go To Start

OpenMP-MFLOPS - OpenMP-MFLOPS64g8 and g12, notOpenMP-MFLOPS64g8 and g12

This benchmark carries out the same single precision calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive. Again, gcc 12 compilations were run for longer times that resulted in different “First Results” sumchecks.

In this case, data sizes used were 400 KB, 4 MB and 40 MB where, with the Pi 5, only the first would be expected to provide a full service from L1 or L2 caches and the second with possible impact of L3 cache. With the GCC 8 full OpenMP version, Pi5/Pi4 performance gains were around 3.0 times at 8 and 32 Operations per word at 400 KB, with most others lower due to data size or fewer operations. At 400 KB Pi 5 GCC 12 performance was 3.2 times faster than GCC 8 at 2 operations per word and slightly faster on the other measurements.

Maximum 4 core performance was 80.1 GFLOPS from GCC 12, at 3.73 times that for a single core, nearly the same as that for MP-MFLOPS.

Pi 4 GCC 8 OpenMP MFLOPS64g8 Tue May 26 12:06:36 2020 Test 4 Byte Ops/ Repeat Secs MFLOPS First All MP/ Words Word Passes Results Same notMP Data in & out 100000 2 2500 0.093 5389 0.92954 Yes 1.64 Data in & out 1000000 2 250 0.795 629 0.99255 Yes 1.21 Data in & out 10000000 2 25 0.784 638 0.99925 Yes 1.00 Data in & out 100000 8 2500 0.115 17455 0.95712 Yes 3.11 Data in & out 1000000 8 250 0.798 2507 0.99552 Yes 1.16 Data in & out 10000000 8 25 0.880 2273 0.99955 Yes 0.95 Data in & out 100000 32 2500 0.332 24068 0.89022 Yes 3.54 Data in & out 1000000 32 250 0.849 9418 0.98809 Yes 1.45 Data in & out 10000000 32 25 0.933 8571 0.99880 Yes 1.31 End of test Tue May 26 12:06:42 2020 Pi 5 GCC 8 OpenMP MFLOPS64g8 Mon Aug 14 12:08:35 2023 Test 4 Byte Ops/ Repeat Secs MFLOPS First All Pi5/4 MP/ Words Word Passes Results Same GCC8 notMP Data in & out 100000 2 2500 0.054 9204 0.92954 Yes 1.71 1.00 Data in & out 1000000 2 250 0.439 1140 0.99255 Yes 1.81 0.80 Data in & out 10000000 2 25 0.618 809 0.99925 Yes 1.27 1.09 Data in & out 100000 8 2500 0.038 52914 0.95712 Yes 3.03 2.92 Data in & out 1000000 8 250 0.410 4880 0.99552 Yes 1.95 0.83 Data in & out 10000000 8 25 0.664 3014 0.99955 Yes 1.33 1.00 Data in & out 100000 32 2500 0.112 71522 0.89022 Yes 2.97 3.60 Data in & out 1000000 32 250 0.424 18865 0.98809 Yes 2.00 1.07 Data in & out 10000000 32 25 0.622 12853 0.99880 Yes 1.50 0.93 End of test Mon Aug 14 12:08:38 2023 Pi 5 GCC 12 OpenMP MFLOPSL64g12 Tue Oct 3 16:27:53 2023 Test 4 Byte Ops/ Repeat Secs MFLOPS First All Pi 5 MP/ Words Word Passes Results Same GCC 12/8 notMP Data in & out 100000 2 50000 0.339 29459 0.44935 Yes 3.20 3.10 Data in & out 1000000 2 5000 7.021 1424 0.86736 Yes 1.25 0.82 Data in & out 10000000 2 50012.322 812 0.98519 Yes 1.00 0.80 Data in & out 100000 8 50000 0.634 63086 0.60398 Yes 1.19 3.46 Data in & out 1000000 8 5000 6.956 5750 0.91822 Yes 1.18 0.88 Data in & out 10000000 8 50012.360 3236 0.99109 Yes 1.07 0.80 Data in & out 100000 32 50000 1.997 80104 0.36770 Yes 1.12 3.73 Data in & out 1000000 32 5000 6.891 23219 0.79898 Yes 1.23 1.18 Data in & out 10000000 32 50012.294 13015 0.97639 Yes 1.01 0.79 End of test Tue Oct 3 16:28:54 2023
Single Core Results below

Some Pi5/Pi4 GCC 8 comparisons were different to those above, for the single core benchmark, at between 2.70 and 3. 22. Maximum performance was nearly 21.5 GFLOPS.
Pi 4 GCC 8 notOpenMP MFLOPS64g8 Tue May 26 12:07:34 2020 Test 4 Byte Ops/ Repeat Secs MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.153 3278 0.92954 Yes Data in & out 1000000 2 250 0.966 518 0.99255 Yes Data in & out 10000000 2 25 0.782 640 0.99925 Yes Data in & out 100000 8 2500 0.356 5612 0.95712 Yes Data in & out 1000000 8 250 0.926 2160 0.99552 Yes Data in & out 10000000 8 25 0.840 2381 0.99955 Yes Data in & out 100000 32 2500 1.176 6800 0.89022 Yes Data in & out 1000000 32 250 1.228 6515 0.98809 Yes Data in & out 10000000 32 25 1.225 6529 0.99880 Yes End of test Tue May 26 12:07:42 2020 Pi 5 GCC 8 notOpenMP MFLOPS64g8 Mon Aug 14 12:04:30 2023 Test 4 Byte Ops/ Repeat Secs MFLOPS First All Pi5/4 Words Word Passes Results Same GCC8 Data in & out 100000 2 2500 0.054 9236 0.92954 Yes 2.82 Data in & out 1000000 2 250 0.350 1429 0.99255 Yes 2.76 Data in & out 10000000 2 25 0.675 740 0.99925 Yes 1.16 Data in & out 100000 8 2500 0.111 18092 0.95712 Yes 3.22 Data in & out 1000000 8 250 0.340 5888 0.99552 Yes 2.73 Data in & out 10000000 8 25 0.666 3002 0.99955 Yes 1.26 Data in & out 100000 32 2500 0.402 19891 0.89022 Yes 2.93 Data in & out 1000000 32 250 0.456 17563 0.98809 Yes 2.70 Data in & out 10000000 32 25 0.579 13810 0.99880 Yes 2.12 End of test Mon Aug 14 12:04:33 2023 Pi 5 GCC 12 notOpenMP MFLOPSL64g12 Tue Oct 3 16:31:00 2023 Test 4 Byte Ops/ Repeat Secs MFLOPS First All Pi 5 Words Word Passes Results Same GCC 12/8 Data in & out 100000 2 50000 1.053 9493 0.44935 Yes 1.03 Data in & out 1000000 2 5000 5.732 1745 0.86736 Yes 1.22 Data in & out 10000000 2 500 9.859 1014 0.98519 Yes 1.37 Data in & out 100000 8 50000 2.194 18228 0.60398 Yes 1.01 Data in & out 1000000 8 5000 6.121 6535 0.91822 Yes 1.11 Data in & out 10000000 8 500 9.872 4052 0.99109 Yes 1.35 Data in & out 100000 32 50000 7.449 21479 0.36770 Yes 1.08 Data in & out 1000000 32 5000 8.121 19701 0.79898 Yes 1.12 Data in & out 10000000 32 500 9.698 16498 0.97639 Yes 1.19 End of test Tue Oct 3 16:32:01 2023

OpenMP-MemSpeed Benchmarks below or Go To Start

OpenMP-MemSpeed264g8 and g12, NotOpenMP-MemSpeed64g8 and g12

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed64). Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP.

Complete output for the Pi 4 is shown below, but just the first few results for the others. The first two lines of single core results are also included to show that the OpenMP options used were clearly unsuitable for this program.

Pi 4 GCC 8 Memory Reading Speed Test OpenMP 64 Bit gcc 8 by Roy Longbottom Start of test Tue May 26 12:14:39 2020 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 1 Core 4 15001 4010 4387 15087 4406 4400 11211 9061 9061 8 15532 3990 4394 15567 4386 4394 11629 9315 9314 4 Cores 4 7749 8500 8716 7451 8520 8533 39508 18586 18589 8 8198 8669 8874 8148 8678 8691 38972 18863 18861 16 8023 8499 8335 7895 8355 8507 38305 19003 19004 32 9034 8517 8619 9127 8550 8522 37928 19071 18409 64 8652 8201 8178 8565 8223 8093 25191 17494 17508 128 11397 11616 11715 11345 11649 11029 13861 14097 14170 256 18242 18745 18195 17417 18605 18019 12535 12637 12623 512 17580 18467 18787 18010 18414 18321 12900 13180 13121 1024 8043 10172 11540 12510 10220 12082 9800 9586 9857 2048 4816 6807 6850 6922 6805 6666 3137 3372 3369 4096 7029 6846 6881 7017 5145 6801 2776 3124 3112 8192 2428 7085 7124 7068 7134 6904 2571 3092 3112 16384 7133 7152 7328 7008 3445 7178 2473 3099 3104 32768 2656 7643 7669 7802 7616 7559 2043 3112 3104 65536 7995 6523 2572 7059 6514 6485 2431 2955 3036 131072 1981 7273 7327 1878 3615 7267 2538 2968 2976 End of test Tue May 26 12:15:06 2020 Pi 5 GCC 8 Memory Reading Speed Test OpenMP 64 Bit gcc 8 by Roy Longbottom Start of test Mon Aug 14 11:42:10 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 1 Core 4 50151 6872 7511 50254 7170 7181 37548 18867 25383 8 50904 6848 7485 48915 7202 7487 38102 19038 25477 4 Cores 4 31324 14321 12707 28712 14606 21136 27075 18075 18075 8 28580 13022 13365 32094 14657 21740 26558 13931 16817 16 27074 19393 19847 32121 19067 24532 35440 24095 23527 32 37880 31590 31455 34779 32095 29027 37245 22243 24984 64 23823 29763 30980 30310 28829 28209 23569 27625 24428 End of test Mon Aug 14 11:42:37 2Pi 5 GCC 12 Pi 5 GCC 12 Memory Reading Speed Test OpenMP 64 Bit gcc 12 by Roy Longbottom Start of test Thu Sep 28 22:43:26 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 1 Core 4 54368 65257 65165 53930 60045 60975 37606 25361 25384 8 54564 65580 65162 55228 61180 60995 37829 25015 25010 4 Cores 4 31314 14584 13443 31523 14625 21373 26964 17800 17883 8 29471 14672 13405 32067 14677 21719 27561 18585 18540 16 32013 19352 19797 32164 19549 25666 36645 25085 25423 32 43228 38115 33331 42989 38653 39254 49341 30903 30892 End of test Thu Sep-28 22:4351 2023
Single Core Results below

Single Core Benchmark - Again a complete output is provided plus limited results and comparisons. As expected, the latter are similar to those from the original MemSpeed included above. Here, maximum Pi5/4 comparison was 13.9 or L3 cache versus RAM speed.
As before, GCC 12 provided corrections for the GCC 8 fault, now indicating Pi 5 GCC 12/8 performance gains of up to 8.5 times for single precision calculations.
Pi 4 GCC 8 Memory Reading Speed Test notOpenMP 64 Bit gcc 8 by Roy Longbottom Start of test Tue May 26 12:18:16 2020 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 15001 4010 4387 15087 4406 4400 11211 9061 9061 8 15532 3990 4394 15567 4386 4394 11629 9315 9314 16 15707 3998 4376 15770 4388 4393 11801 9447 9444 32 14552 3885 4245 14761 4186 4260 11627 9488 9495 64 12272 3855 4211 12089 4196 4220 8866 8606 8630 128 12321 3867 4217 12148 4182 4215 8221 8296 8292 256 12318 3871 4219 12134 4206 4219 8092 8231 8229 512 12118 3870 4218 12195 4211 4218 8077 8209 8226 1024 3224 3738 4032 3701 4009 4066 5387 5529 5331 2048 1945 3474 3615 1949 3598 3612 2848 2860 2945 4096 1940 3442 3610 1941 3406 3607 2614 2595 2597 8192 1951 3425 3637 1954 3617 3644 2595 2581 2583 16384 1962 3330 3467 1965 3443 3469 2588 2575 2564 32768 2003 3364 3303 1997 3292 3303 2503 2554 2557 65536 2005 2588 2937 2011 2930 2621 2577 2565 2566 131072 2024 2021 2025 2013 2014 2018 2586 2572 2570 End of test Tue May 26 12:18:42 2020 Pi 5 GCC 8 Memory Reading Speed Test notOpenMP 64 Bit gcc 8 by Roy Longbottom Start of test Mon Aug 14 11:34:27 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 50151 6872 7511 50254 7170 7181 37548 18867 25383 64 50862 6800 7423 50901 7140 7426 36297 19013 25373 256 32627 6790 7153 32638 7183 7276 34872 19156 25339 1024 30004 6804 7283 30354 7171 7122 23523 18525 23493 8192 2992 6089 5571 2005 5255 6448 4794 5279 5340 End of test Mon Aug 14 11:34:52 2023 Pi 5/4 GCC8 4 3.34 1.71 1.71 3.33 1.63 1.63 3.35 2.08 2.80 64 4.14 1.76 1.76 4.21 1.70 1.76 4.09 2.21 2.94 256 2.65 1.75 1.70 2.69 1.71 1.72 4.31 2.33 3.08 1024 9.31 1.82 1.81 8.20 1.79 1.75 4.37 3.35 4.41 2048 12.94 1.91 1.98 13.90 1.98 2.04 6.95 5.99 4.05 8192 1.53 1.78 1.53 1.03 1.45 1.77 1.85 2.05 2.07 Pi 5 GCC 12 Memory Reading Speed Test notOpenMP 64 Bit gcc 12 by Roy Longbottom Start of test Thu Sep 28 22:42:10 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 54368 65257 65165 53930 60045 60975 37606 25361 25384 64 52501 65304 65319 53250 59544 59850 37508 25373 25401 256 33354 63081 63764 33718 60298 60351 35597 25397 25398 2048 22287 52312 53008 22349 50665 49230 11449 12273 16589 8192 3087 6050 6120 3132 6038 6491 6902 6608 6778 End of test Thu Sep 28 22:42:35 2023 Pi 5 GCC 12/8 4 1.08 9.50 8.68 1.07 8.37 8.49 1.00 1.34 1.00 64 1.03 9.60 8.80 1.05 8.34 8.06 1.03 1.33 1.00 256 1.02 9.29 8.91 1.03 8.39 8.29 1.02 1.33 1.00 2048 0.89 7.88 7.42 0.82 7.10 6.68 0.58 0.72 1.39 8192 1.03 0.99 1.10 1.56 1.15 1.01 1.44 1.25 1.27

JavWhetstone Benchmark below or Go To Start

Java Whetstone Benchmark - whetstc.class

The Java benchmarks comprise class files that were produced some time ago. But source codes are available to renew the files. Performance can vary significantly using different Java Virtual Machines.

Pi 5 performance gains, over the Pi 4, were beteen 1.94 and 3.81.

Pi 4 Whetstone Benchmark Java Version, May 22 2020, 14:24:09 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 521 0.0369 N2 floating point -1.131330490 481 0.2792 N3 if then else 1.000000000 236 0.4378 N4 fixed point 12.000000000 1320 0.2386 N5 sin,cos etc. 0.499110132 48 1.7348 N6 floating point 0.999999821 276 1.9520 N7 assignments 3.000000000 320 0.5772 N8 exp,sqrt etc. 0.825148463 25 1.4640 MWIPS 1488 6.7205 Operating System Linux, Arch. aarch64, Version 4.19.118-v8+ Java Vendor Debian, Version 11.0.7 CPU null Pi 5 Whetstone Benchmark Java Version, Aug 24 2023, 23:25:17 1 Pass Pi 5/4 Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 1232 0.0156 2.37 N2 floating point -1.131330490 1048 0.1282 2.18 N3 if then else 1.000000000 715 0.1448 3.02 N4 fixed point 12.000000000 2559 0.1231 1.94 N5 sin,cos etc. 0.499110132 183 0.4550 3.81 N6 floating point 0.999999821 554 0.9730 2.00 N7 assignments 3.000000000 624 0.2960 1.95 N8 exp,sqrt etc. 0.935364604 63 0.5920 2.47 MWIPS 3666 2.7277 2.46

JavaDraw Benchmark below or Go To Start

JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

The first runs of this benchmark on the Pi 5 indicated that it was much slower than the Pi 4 on the more demanding functions. Sometime later I reran the benchmark on the Pi 4, using the OS acquired for the Pi 5, and that also produced the slow results. Using this OS, the Pi 5 average performance was around twice as fast.

Pi 4 Java Drawing Benchmark, May 22 2020, 14:25:15 Produced by javac 1.8.0_222 Test Frames FPS Display PNG Bitmap Twice Pass 1 833 83.26 Display PNG Bitmap Twice Pass 2 1001 100.05 Plus 2 SweepGradient Circles 994 99.39 Plus 200 Random Small Circles 836 83.54 Plus 320 Long Lines 380 37.98 Plus 4000 Random Small Circles 95 9.44 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. aarch64, Version 4.19.118-v8+ Java Vendor Debian, Version 11.0.7 null, null CPUs Pi 4 Java Drawing Benchmark, Dec 2 2023, 10:01:16 Produced by javac 1.8.0_222 Test Frames FPS Display PNG Bitmap Twice Pass 1 469 46.86 Display PNG Bitmap Twice Pass 2 561 56.06 Plus 2 SweepGradient Circles 523 52.21 Plus 200 Random Small Circles 31 2.97 Plus 320 Long Lines 13 1.22 Plus 4000 Random Small Circles 2 0.18 Total Elapsed Time 62.5 seconds Operating System Linux, Arch. aarch64, Version 6.1.47-v8+ Java Vendor Debian, Version 17.0.8 null, null CPUs Pi 5 Java Drawing Benchmark, Aug 26 2023, 15:06:26 Produced by javac 1.8.0_222 Test Frames FPS Pi5/Pi4 Display PNG Bitmap Twice Pass 1 1000 99.96 2.13 Display PNG Bitmap Twice Pass 2 1077 107.66 1.92 Plus 2 SweepGradient Circles 1010 100.99 1.93 Plus 200 Random Small Circles 63 6.16 2.07 Plus 320 Long Lines 26 2.50 2.05 Plus 4000 Random Small Circles 4 0.32 1.78 Total Elapsed Time 63.1 seconds Operating System Linux, Arch. aarch64, Version 6.1.32-v8+ Java Vendor Debian, Version 17.0.8 null, null CPUs Pi 5 Java Drawing Benchmark, Aug 26 2023, 15:15:27 Produced by javac openjdk 17.0.8 Test Frames FPS Display PNG Bitmap Twice Pass 1 1014 101.33 Display PNG Bitmap Twice Pass 2 1067 106.59 Plus 2 SweepGradient Circles 1028 102.70 Plus 200 Random Small Circles 61 6.04 Plus 320 Long Lines 25 2.47 Plus 4000 Random Small Circles 4 0.33 Total Elapsed Time 62.3 seconds Operating System Linux, Arch. aarch64, Version 6.1.32-v8+ Java Vendor Debian, Version 17.0.8 null, null CPUs

OpenGL Benchmark below or Go To Start

64 Bit OpenGL Benchmark - videogl64C10, videogl64C12

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

As a benchmark, it was run using the following script file format, the first command needed to avoid VSYNC, allowing FPS to be greater than 60.

  export vblank_mode=0                                     
  ./videogl64CXX Width 320, Height 240, NoEnd                 
  ./videogl64Cxx Width 640, Height 480, NoHeading, NoEnd      
  ./videogl64Cxx Width 1024, Height 768, NoHeading, NoEnd     
  ./videogl64Cxx Width 1920, Height 1080, NoHeading

The original benchmark was compiled using freeglut3 but, more recently, this was not available for new systems. The gcc12 version was compiled without this but will not run on my Pi 4, Similarly, the gcc10 program is incompatible with the Pi 5.

Performance comparisons indicate that the Pi 5 was between 2.9 and 5.2 times faster than the Pi 4, with an average of 4.0 times over the 24 measurements. The GLUT variety was recompiled on the Pi 4, using GCC 12. The average Pi5 gain then became 2.5 times.

Pi 4 gcc 10 GLUT OpenGL Benchmark 64 GCC 10, Wed Sep 20 00:48:11 2023 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 727.7 413.0 219.7 131.9 42.8 28.9 640 480 388.6 281.9 189.2 118.0 42.5 28.1 1024 768 144.0 141.2 129.8 96.9 41.6 26.8 1920 1080 54.1 50.2 52.7 56.7 38.4 23.9 End at Wed Sep 20 00:50:26 2023 Pi 5 gcc 12 GLUT OpenGL Benchmark 64 Bit GCC 12, Thu Oct 26 14:52:15 2023 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 3184.7 1554.8 894.7 474.2 224.0 120.0 640 480 1441.4 956.8 819.1 442.2 220.4 116.7 1024 768 624.6 493.7 474.7 364.0 199.1 106.4 1920 1080 221.4 198.6 194.4 165.8 137.9 87.6 End at Thu Oct 26 14:54:28 2023 Pi 5/4 Comparison Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 4.4 3.8 4.1 3.6 5.2 4.2 640 480 3.7 3.4 4.3 3.7 5.2 4.2 1024 768 4.3 3.5 3.7 3.8 4.8 4.0 1920 1080 4.1 4.0 3.7 2.9 3.6 3.7 ##################################################################### Pi 4 GLUT OpenGL Benchmark 64 Bit GCC 12, Sat Dec 2 11:35:48 2023 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 1137.1 517.1 308.3 159.7 93.5 49.6 640 480 579.0 356.8 283.9 150.5 92.5 48.7 1024 768 239.5 200.9 203.4 134.7 84.9 45.3 2032 1080 92.8 74.3 93.6 81.1 75.2 37.6 End at Sat Dec 2 11:38:02 2023

I/O Benchmarks below or Go To Start

DriveSpeed and LanSpeed I/O Benchmarks

Two varieties of I/O benchmarks are provided, one to measure performance of main and USB drives, and the other for LAN and WiFi network connections. The programs write and read three files at two sizes (defaults 8 and 16 MB), followed by random reading and writing of 1KB blocks out of 4. 8 and 16 MB and finally, writing and reading 200 small files, sized 4, 8 and 16 KB. Run time parameters are provided for the size of large files and file path. The same program code is used for both varieties, the only difference being file opening properties. The drive benchmark includes extra options to use direct I/O, avoiding data caching in main memory, but includes an extra test with caching allowed. .

As found during previous tests on 64 bit systems and accessing the system SD card, DriveSpeed with Direct I/O failed, indicating “Error writing file”. Later it was established that this also applied to external drives with Ext type format but operated correctly formatted as FAT32. A limitation of the latter (at 64 bits) is that file sizes must be less than 4096 MB.

The best option for measuring 64 bit performance, using these benchmarks, is to run LanSpeed, specifying large files that cannot be cached for reading. However, random and small file reading functions are likely to be accessing cached data.

DriveSpeed Benchmark FAT32 - DriveSpeed64v2g8 and g12

The first of the following results are for Pi 4 and Pi 5, both with 8 GB RAM, exercising the same high speed flash drive via USB 3, using 1GB and 2 GB files.

Average Pi 5 gains were around 1.5 times for writing and reading large files, somewhat less writing to cache and nearly 4 times reading from cache, representing RAM speed. The Pi 5 results indicated a slower speed on random reading then much faster on reading small files, where more of the data appears to have been cached.

As during the Pi 4 tests, a starting large file parameter of 2048 KB failed to execute the second part at 4096 KB. Below indicates a successful run at 4094 KB.

Pi 4 DriveSpeed RasPi 64 Bit gcc 8 Wed May 27 11:43:43 2020 Selected File Path: /media/pi/PATRIOT1/ Total MB 120832, Free MB 114614, Used MB 6218 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1024 27.78 21.39 21.43 270.32 278.81 274.98 2048 21.40 21.14 21.44 275.79 273.14 319.95 Cached 8 40.27 42.81 42.81 1206.64 1068.72 1031.56 Random Read Write From MB 4 8 16 4 8 16 msecs 0.004 0.004 0.184 4.33 4.00 4.04 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.03 0.07 0.14 261.45 11.19 84.39 ms/file 119.60 119.05 119.64 0.02 0.73 0.19 2.477 Pi 5 DriveSpeed RasPi 64 Bit gcc 8 Mon Sep 4 16:50:50 2023 Selected File Path: /media/roy/PATRIOT/test/ Total MB 120832, Free MB 113866, Used MB 6966 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1024 30.89 31.14 38.40 349.35 376.91 375.03 2048 42.62 42.11 34.53 377.20 378.08 375.97 Cached 8 50.11 52.44 53.78 2327.93 4688.75 6184.63 Random Read Write From MB 4 8 16 4 8 16 msecs 0.005 0.005 0.233 13.34 12.74 13.10 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.03 0.07 0.13 386.06 667.63 950.87 ms/file 123.74 124.04 123.19 0.01 0.01 0.02 3.234 Pi 5 at 4094 KB MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 4094 42.74 38.90 45.55 372.93 349.44 376.49

Performance Monitor for above next or Go To Start

Performance Monitor - The following provides vmstat examples handling large files, confirming the benchmark large file data transfer speeds and that the data was actually written to and read from the drive at the benchmark reported time.

Pi 5 VMSTAT Writing and Reading Large Files - volumes in kB, speeds in kB/second
     %CPU utilisation us + sy, 100% means 4 cores being used 

procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd    free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  1      0 7260884  22404 399188    0    0  1121  1288  179  284  1  1 93  5  0
 1  1      0 7260884  22404 399188    0    0     0 40005 3082 6308  0  4 74 23  0
 1  1      0 7260884  22404 399188    0    0     0 41030 3651 6074  0  3 74 23  0
 1  1      0 7260884  22404 399188    0    0     0 43080 3839 6375  0  3 75 22  0
 1  1      0 7260884  22404 399188    0    0     0 41033 3807 6275  0  3 74 22  0

 1  1      0 7260884  22404 399188    0    0 355824     0 3879 9207  1  9 73 17  0
 1  1      0 7260884  22404 399188    0    0 355320     0 2824 7807  1  9 73 17  0
 1  1      0 7260884  22404 399188    0    0 364544     0 2728 5560  1  9 72 17  0
 1  1      0 7260884  22404 399188    0    0 364540     0 4022 5513  0  8 73 18  0

LanSpeed Benchmark below or Go To Start

Pi 5 LanSpeed Benchmark - LanSpeedt64g8 and g12- Wired LAN and WiFi

As indicated above, this benchmark is effectively the same as DriveSpeed, but with Direct I/O not specified. Following are data transfer speeds to a PC via gigabit LAN, 2.4 GHz WiFi and 5 GHz WiFi, plus measurement from a Pi 400 to confirm the same performance levels.

The parameter for large file sizes was intended to be large enough to avoid local caching and some were included to use data sizes of 4 GB or 16 GB in one case. Random access tests access small files that are clearly cached for reading. The many small files used could involve some caching but indicate some consistency.

MBytes/Second To PC MB Write1 Write2 Write3 Read1 Read2 Read3 Wifi 2.4GHz 1024 5.27 5.56 5.69 6.16 5.92 5.72 WiFi 5GHz 1024 11.47 11.85 12.83 11.86 11.12 11.31 LAN 1Gbps 1 16384 55.25 51.88 54.17 114.38 116.13 114.81 LAN 1Gbps 2 4096 53.83 49.33 54.38 113.70 109.48 113.51 LAN Pi 400 4096 62.19 62.11 61.27 102.43 104.56 102.60 Milliseconds To PC Random Read Write From MB 4 8 16 4 8 16 Wifi 2.4GHz 0.002 0.002 0.002 8.48 8.15 7.79 WiFi 5GHz 0.002 0.002 0.002 14.52 21.38 21.96 LAN 1Gbps 1 0.002 0.002 0.002 5.04 1.45 0.98 LAN 1Gbps 2 0.002 0.002 0.002 1.71 1.37 1.38 LAN Pi 400 0.005 0.005 0.005 1.43 1.13 1.18 MBytes/Second To PC 200 Files Write Read File KB 4 8 16 4 8 16 Wifi 2.4GHz 0.33 0.62 0.92 0.52 0.66 1.21 WiFi 5GHz 0.11 0.16 0.34 0.14 0.83 0.52 LAN 1Gbps 1.43 2.39 3.13 4.06 8.28 15.30 LAN 1Gbps 2 1.59 1.53 4.80 4.41 7.78 16.67 LAN Pi 400 0.68 2.46 3.55 3.91 6.17 12.45

Performance Monitor for above next or Go To Start

Raspbeerry Pi Performance Monitor - First example below is for VMSTAT that does not include network data transfer speeds. This is for LAN 2 test writing and reading the first part, comprising three 2048 MB files. This ends up using most of the 8 GB RAM as a cache, where data appears read from the network. CPU utilisation was mainly low but he maximum of 14% is for 4 cores or 56% of one core (if you want to calculate CPU time).

PC Performance Monitor - In some cases network data transfer speeds could be confirmed on the Windows PC, using Task Manager Performance display and Perfmon detailed tables. However, this became confusing due to deferred writing to the PC disk, with overlapped reading. Also, Perfmon data collector could not keep up with the volume of data, missing output in time slots and indicating unobtainable speeds in a following slot. Also, transferring the largest files could produce a complete overload of the PC, with a dead keyboard. An example of Perfmon results is provided below.

The PC was a four core 3 GHz CPU running under Windows 7. The statistics show significant time waiting for I/O and utilisation of up to all four cores. The second example shows network traffic, disk drive data transfers and CPU utilisation, where a 25% recording represents 100% of one core.

The important considerations for the Pi 5 are confirmation of data transfer speeds measured by the benchmark. Then, the indication that, on reading, no disk involvement was indicated but was supplied from PC RAM based cache and on writing, saving to disk was involved that might have reduced measured speed. In the bigger picture it seemed that all data had not been written to disk when reading began.

LAN 1Gbps 2 VMSTAT initial part writing and reading three 2048 MB files. procs -----------memory--------- ---swap-- ----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st Power On 1 0 0 7096944 29968 646800 0 0 4147 1026 859 1470 8 6 74 13 0 Write 1 0 0 1613712 32944 6076752 0 0 203 51 1406 1245 1 2 89 8 0 2 1 0 1352208 32944 6339728 0 0 0 0 3962 3469 0 2 75 23 0 3 0 0 58304 4192 7665904 0 0 175 44 1311 1122 1 2 90 7 0 Read 1 1 0 2727744 944 5000080 0 0 152 38 2153 1921 1 3 87 9 0 3 0 0 1480192 960 6244480 0 0 0 0 38445 42406 0 10 65 25 0 1 2 0 347872 960 7377648 0 0 1472 28 39595 42997 1 13 60 26 0 Write 2 1 0 52176 2688 7674272 0 0 148 37 2458 2198 1 3 87 9 0 1 1 0 94448 2688 7635744 0 0 148 37 2519 2253 1 3 87 9 0 ############################################################################## PC Perfmon Comms Disk Mbytes/second Mbytes/second %CPU Second Received Sent Read Written 11 50 0 0 90 49 12 49 0 0 0 47 13 50 0 0 88 55 14 49 0 0 0 46 15 49 0 0 89 45 To 45 37 0 0 0 36 46 1461 4 0 99 34 82 3 0 0 40 49 83 79 0 0 41 56 86 178 0 0 58 90 94 0 5 0 43 85 95 1 122 2 64 42 96 1 120 1 1 36 97 1 122 0 56 32 98 1 121 0 0 35 99 1 120 0 49 31

USB and SD Card Benchmarks below or Go To Start

LanSpeed Benchmark - Pi 5 USB Drives and Operating System SD Card

In most cases, as Direct I/O was not supported, LanSpeed was executed using large files that avoid caching.

These tests were run to confirm that the hardware could support 64 bit type file sizes and to show any major differences. It was found that 4096 MB could not be supported using FAT32 format, but such as 4096 MB was fine. Also, at 2048 MB, the 8 GB RAM might cache all the data.

MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 USB3 HD FAT1 2048 98.07 80.66 74.72 306.43 9209.88 8687.44 USB3 HD Ext2 4096 158.98 28.25 113.34 38.47 143.80 114.56 USB3 HD Ext3 4096 122.73 26.33 61.23 48.78 122.24 109.04 USB3 HD Ext4 4096 164.59 81.99 19.61 103.72 143.48 120.17 Pi 5 SD 4096 27.95 20.58 19.20 43.45 104.53 92.26 SD USB boot 2048 52.82 20.68 20.41 10305.38 11463.08 11496.93 4096 30.06 20.52 20.60 42.12 260.46 97.04 Milliseconds Random Read Write From MB 4 8 16 4 8 16 USB3 HD FAT1 N/A as failed to write 4096 MB USB3 HD Ext2 0.002 0.002 0.002 44.90 15.38 16.10 USB3 HD Ext3 0.002 0.002 0.002 54.50 40.68 45.18 USB3 HD Ext4 0.002 0.002 0.002 52.50 45.27 51.93 Pi 5 SD 0.002 0.002 0.002 3.96 3.60 3.68 SD USB boot 0.002 0.002 0.002 6.83 4.24 3.90 MBytes/Second 200 Files Write Read File KB 4 8 16 4 8 16 USB3 HD FAT1 N/A USB3 HD Ext2 141.38 37.47 63.37 587.85 592.36 834.73 USB3 HD Ext3 64.24 21.61 35.24 310.16 601.22 927.89 USB3 HD Ext4 129.74 55.08 104.42 423.15 473.34 465.93 Pi 5 SD 78.41 95.12 194.19 554.82 732.07 1189.95 SD USB boot 106.88 121.88 309.35 596.63 789.24 1504.37

New Benchmark More Files next or Go To Start

New Benchmark More Files - LANSpeed64Long

Having encountered VMSTAT performance monitoring problems on running my LANSpeed program, I found that my original Linux version, LANSpeed64Long, avoided this, when compiled for the Raspberry Pi. This writes and reads five large files, followed by other tests, including some for random access and handling numerous small files. As with the earlier program, measured performance can be influenced by caching, sometimes in an unexpected way. Using extra large files helps to avoid the latter. Following is an example of results and sample details from VMSTAT system monitor.

 Current Directory Path: 
 /home/???????
 Total MB  119699, Free MB  102167, Used MB   17531

 Linux LAN Speed Test 64-Bit Version 1.2, Wed Sep 20 13:38:14 2023


  4096 MB File         1          2          3          4          5
 Writing MB/sec      35.46      35.54      35.53      35.49      35.61
 Reading MB/sec     198.94     153.10      92.52      92.67      92.66

 Running Time Too Long At 793 Seconds - No More File Sizes
 ---------------------------------------------------------------------
 8 MB Cached File      1          2          3          4          5
 Writing MB/sec     895.98     859.22     817.44     770.10    1032.07
 Reading MB/sec    3337.54    6467.72    6574.06    6768.83    6643.57

 ---------------------------------------------------------------------
 Bus Speed Block KB     64        128        256        512       1024
 Reading MB/sec   13574.63   15329.45   16213.07   14365.65    9021.80

 ---------------------------------------------------------------------
 1 KB Blocks File MB >    2      4      8     16     32     64    128
 Random Read  msecs    0.40   0.44   0.45   0.45   0.45   0.45   0.45
 Random Write msecs    4.50   4.63   4.60   4.64   4.58   4.68   4.58

 ---------------------------------------------------------------------
 500 Files   Write             Read             Delete
 File KB     MB/sec  ms/File   MB/sec  ms/File  Seconds
       2       0.42     4.85   357.91     0.01    0.012
       4       0.82     5.01   636.20     0.01    0.012
       8       1.64     5.00  1224.07     0.01    0.013
      16       2.91     5.62  1288.33     0.01    0.033
      32       5.51     5.94  2573.57     0.01    0.014
      64       9.22     7.11  4727.86     0.01    0.015
     128      15.04     8.72  5015.65     0.03    0.019
     256      22.87    11.46  5514.21     0.05    0.024
     512      30.27    17.32  6487.64     0.08    0.061
    1024      34.50    30.39  5629.98     0.19    0.054
    2048      36.80    56.99 11498.58     0.18    0.087

VMSTAT Samples Large Files

procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff   cache   si   so    bi    bo   in   cs us sy id wa st
Before Start 
 1  0      0 6245248  54480 1069568   0    0     0     0  199  275  0  0 100  0  0
Write
 1  1      0   41088  76480 7254656    0    0     0 34584  714 1313  0  2 75 23  0
 1  1      0   41088  76480 7254656    0    0    16 35656 2310 4149  0  2 73 24  0
 1  1      0   41088  76480 7254656    0    0     0 36656 1830 3219  1  3 72 23  0
 1  1      0   41088  76480 7254656    0    0    16 34584 2012 3287  6  4 68 22  0
Read
 1  1      0   59568  76624 7238688    0    0 90112     0  812 1778  1  1 75 24  0
 1  1      0   59568  76624 7238688    0    0 90112     0  738 1661  1  2 74 24  0
 1  1      0   59568  76624 7238688    0    0 90624     0  667 1524  0  1 75 24  0
 1  1      0   59568  76624 7238688    0    0 90112     0  559 1479  0  1 75 24  0

New Benchmark Large Files next or Go To Start

New Benchmark Large Files

These mainly involved 4096 MB files with smaller ones limited by FAT formatting, available free space or slower WiFi. Approximate vmstat reported performance is also shown. This helps to highlight benchmark results affected by caching.

The first benchmark results were for boot drives, including SD cards, flash drives and hard disk drives, with some from a USB card reader and a USB hub. The other results are for LAN, WiFi and an attached USB flash drive, booted from the SD card. The main use is to demonstrate variations in performance.

Boot Drive File 1 2 3 4 5 VMSTAT MB/sec 32 GB SD Writing MB/sec 17.31 17.59 17.69 17.64 17.52 17 3072 MB File Reading MB/sec 106.05 8253.16 103.94 90.49 90.38 90 128 GB SD Writing MB/sec 35.46 35.54 35.53 35.49 35.61 36 Reading MB/sec 198.94 153.1 92.52 92.67 92.66 90 128 GB SD USB Writing MB/sec 39.04 38.86 39.14 38.98 38.98 39 Reading MB/sec 132.76 297.8 97.62 97.54 97.12 32 GB Flash Writing MB/sec 45.32 51.26 45.14 39.56 40.95 37 SanDisk Reading MB/sec 347.2 764.03 263.08 259.51 256.98 250 128 GB Flash Writing MB/sec 65.18 59.06 55.93 51.48 44.54 20to70 PATRIOT Reading MB/sec 529.24 880.72 283.78 358.71 357.57 350 Disk USB Writing MB/sec 19.00 20.76 21.03 19.03 16.37 20 Reading MB/sec 187.19 390.54 115.75 103.51 91.63 125 Disk USB HUB Writing MB/sec 19.36 20.97 19.67 14.24 18.25 20 Reading MB/sec 206.35 221.78 86.34 111.81 104.16 120 SD Booted GB LAN Writing MB/sec 36.31 36.92 36.69 36.94 37.18 N/A Reading MB/sec 113.61 112.8 113.33 113.87 114.18 5 GHz WiFi 256 MB File 1 2 3 4 5 Writing MB/sec 24.82 19.87 17.58 24.74 19.8 N/A Reading MB/sec 12.13 11.47 11.53 11.67 9.18 USB Drive FAT32 Writing MB/sec 30.21 30.01 30.06 30.18 30.16 29 3072 MB File Reading MB/sec 304.19 9936.6 343.77 311.99 309.92 290 USB Drive Ext3 Writing MB/sec Cannot open data file for writing Use sudo Writing MB/sec 30.56 30.35 30.39 30.37 30.23 30 Reading MB/sec 385.17 877.37 311.63 303.94 303.83

New Benchmark Small Files Next or Go To Start

New Benchmark Small Files, Booting Time, Volts and Amps

Performance measure are for writing and reading small files and random access, again demonstrating wide variations. The latter is also identified in measured booting time (from inserting the power plug to the full display, including warnings). One of the flash drives was particularly slow at 97 seconds. This drive had also produced unusually slow results during earlier tests.

I have two meters that measure USB voltage and current. One was connected to measure power in and the other USB 3 power out. The main power supply voltage did not appear to vary much, during these tests, and current was well within the 3 available Amps. The disk drive produced the most impact, falling to below 5 volts when connected by a USB hub. Even then, the benchmark ran successfully to the end.

500 Files Write MB/sec 32 GB SD 128 GB SD 32 GB 128 GB Disk Disk Gbps 5 GHz FAT32 Ext3 File KB Board Board USB USB Dr USB Dr USB USB HUB LAN WiFi USB USB 2 0.38 0.42 0.45 0.42 0.02 0.05 0.05 0.65 0.11 0.02 0.36 4 0.74 0.82 0.90 0.68 0.19 0.15 0.09 1.11 0.38 0.04 0.63 8 1.61 1.64 1.75 2.04 0.15 0.30 0.19 1.93 0.93 0.08 1.42 16 2.74 2.91 3.11 2.67 0.95 0.46 0.40 4.24 1.77 0.15 2.89 32 3.22 5.51 5.92 4.58 1.12 0.83 0.81 7.06 3.27 0.30 5.51 64 8.06 9.22 9.88 8.92 4.66 1.64 1.58 12.41 5.71 0.60 8.45 128 9.48 15.04 16.17 10.08 4.24 3.21 3.11 17.79 8.14 1.18 13.01 256 12.46 22.87 24.02 14.43 12.69 6.35 6.03 23.18 11.43 2.29 18.55 512 15.43 30.27 31.96 20.40 21.03 11.42 11.33 27.59 13.07 4.28 23.51 1024 16.31 34.50 38.04 32.05 36.48 17.08 16.03 33.55 7.60 27.54 2048 18.15 36.80 41.70 47.85 46.68 28.00 27.30 35.39 12.35 30.07 Random Access millisecs V = Variable Read 0.47 0.45 0.61 0.45 0.44V 1.10V 1.52 0.67V 18.77 0.40 0.38 Write 3.20 4.60 4.65V 1.89 16.55V 43.33V 48.80 2.08V 16.23 2.77 4.80 Boot Secs 21 21 30 21 97 46 44 N/A N/A N/A N/A Power Volts and Amps Main V 5.20 5.28 5.21 5.24 5.20 5.18 5.21 5.16 5.18 5.18 5.17 Main A 0.87 0.92 1.13 1.09 0.98 1.21 1.52 1.10 0.85 0.91 0.93 USB V N/A N/A 5.11 5.12 5.10 5.04 4.97 N/A N/A 5.11 5.11 USB A N/A N/A 0.28 0.24 0.14 0.44 0.83 N/A N/A 0.14 0.14

Drive Stress Test Next or Go To Start

Drive Stress Test - burnindrive264g12

The program uses 64 KB block sizes, with 164 variations of data patterns or a minimum file size of 10.25 MB. Larger files can be produced via a run time multiplication parameter, in this case 16 for for 164 MB files. Four of these written then read sequentially for 12 minutes, but with the choice of files randomised. Finally, each block/data pattern is reread continuously for a second, at full bus speed from disk drives that cache the data. On reading, file number and data values are compared and errors reported.

Note that measured speeds are generally slower than from DriveSpeed benchmark, covered earlier, as data transfers are based on using smaller 64 KB blocks.

The following provides summary Pi 5 results including MB/second performance calculations. The tests exercised the main SD drive, LAN, WiFi and USB 3. Devices on the latter were for a hard drive with Ext2, Ext3, Ext4 and FAT32 partitions and three flash drives. The LAN and WiFi tests were also run on a Pi 400 to confirm the similar performance. No errors were detected.

A gigabit LAN connection was used and WiFi reported as 5 GHz, with the former around 5 times faster on writing and up to 10 times reading. There were performance variations on the various solid state drives that could affect certain applications. One of the disk drive tests, using the Ext3 partition, had inexplicable slow speeds and, when repeated, somewhat slower than the other partitions on writing. Note the much faster transfer speeds with repeated reading of 64 KB blocks, indicating cached data and bus speed.

Write Read Blocks Repeated Source Seconds MB/sec Passes Minutes MB/sec Number Minutes MB/sec Comms LAN Pi 5 to PC 19.3 34.0 156 12.06 35.4 99360 2.79 37.1 LAN Pi 400 to PC 20.2 32.6 132 12.37 29.2 80900 2.79 30.2 WiFi Pi 5 to PC 99.6 6.6 20 14.41 3.8 12540 3.61 3.6 WiFi Pi 400 to PC 101.7 6.5 20 12.78 4.3 14720 3.66 4.2 SD OS Card 41.7 15.7 260 12.03 59.1 174960 2.76 66.0 USB 3 Flash Drive Flash 1 20.7 31.7 328 12.01 74.6 179200 2.76 67.6 Flash 2 8.0 82.0 352 12.06 79.8 219400 2.75 83.1 Flash 3 145.2 4.5 136 12.12 30.7 89860 2.77 33.8 USB HD FAT32 Partition 8.4 78.1 268 12.15 60.3 408280 2.75 154.7 Ext 2 Partition 8.9 73.7 272 12.03 61.8 432060 2.74 164.3 Ext 3 Partition 1320 0.5 100 12.14 22.5 427360 2.74 162.5 Ext 3 Repeat 11.8 55.6 256 12.09 57.9 431820 2.74 164.2 Ext 4 Partition 9.0 72.9 284 12.10 64.2 432200 2.74 164.3

BurnInDrive Stress Test With Performance Monitoring or Go To Start

BurnInDrive Stress Test With Performance Monitoring

Following are details of a run handling four 2624 MB files, along with associated results from vmstat performance monitor and my CPU Voltage, MHz and Temperature recorder. The tests were run using the Ext3 partition.

First below are the program results with faster writing speeds than above, reading speeds a little slower and repeat reading similar. These might be due to handling larger files.

Second are the sample vmstat results (size numbers are KB) with nothing strange on 8 GB memory utilisation. There were variations in bo writing and bi reading speeds but essentially confirm program measurements. Percentage user + system CPU utilisation was low (note that such a 25% reflects 100% of one core and 100% indicates four core fully utilised).

Finally are samples of the environment measurements that were effectively constant. Results are provided for the start, middle and end of the tests. With ondemand CPU frequency scaling being used, a constant 1500 MHz was indicated for most of the time.

This test was run later on a Pi 4 where writing was 9% slower, reading 6%, repeat reading 18% with similar for CPU utilisation. See results below.

Write Read Blocks Repeated Source Seconds MB/sec Passes Minutes MB/sec Number Minutes MB/sec Ext 3 Partition 129.2 81.2 16 13.99 50.0 419020 2.74 159.3 Pi 4 Ext3 142.2 73.8 16 14.81 47.2 345680 2.75 130.9 VMSTAT procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st WRITE 1 1 0 6901476 137524 682832 0 0 0 77806 8123 11887 1 6 74 20 0 2 0 0 6901476 137524 682832 0 0 8 90292 9889 13562 1 7 74 18 0 READ 1 1 0 6901476 137524 682832 0 0 32538 46 3377 5344 0 1 75 24 0 1 1 0 6901476 137524 682832 0 0 60064 16 7630 10652 3 2 72 24 0 REPEAT 1 1 0 6868408 149372 699428 0 0 162170 3 19231 25503 0 4 72 24 0 1 1 0 6868408 149372 699428 0 0 162144 3 17290 25480 0 4 72 23 0 ENVIRONMENT Seconds 0.0 ARM MHz=1500, core volt=0.9067V, CPU temp=37.3°C, pmic temp=38.4°C 453.6 ARM MHz=1500, core volt=0.9067V, CPU temp=38.9°C, pmic temp=38.4°C 897.4 ARM MHz=1500, core volt=0.9067V, CPU temp=38.9°C, pmic temp=38.6°C

Disk Drive Errors and Crashes next or Go To Start

Disk Drive Errors and Crashes - Power Supply Problems

I have two 1TB USB 3 disk drives. The first crash occurred in attempting to run the new benchmark on both disk drives when connected to the USB hub via one USB port. It would have been obvious, if I had looked up the specification. That indicated a maximum of 900 mA, where up to 660 mA on one drive had been observed. It seems that a 5 Amps power supply would not help in running this sort of activity, but should be using a powered USB hub.

The second crash was running two disk drive benchmarks with one on the hub, plus my 4 thread integer CPU stress test. This time the crash appeared to be due to the power demand being greater than the 3 Amps supply. 3.06 Amps was indicated shortly before the crash.

Before the next crash I successfully ran two copies of my burnindrive264g12 stress test on separate USB ports. Then, with one of these and one integer stress test, the last measurements before the screen went blank were a data transfer failure reported by my program and a power input recording of 2.72 Amps. Following is a report from the last failing test session, indicating the seriousness of the situation, reading the wrong file and corrupted data.

Later tests were run using a 4 amps power supply. At the time of testing, the official 5 amps power supply was not available.

Selected File Path: /media/raspberrypi/EXT3/ Total MB 348052, Free MB 348052, Used MB 0 Storage Stress Test ARM 64 Bit v2.0 gcc 8, Fri Oct 6 21:28:44 2023 File size 2624.00 MB x 4 files, minimum reading time 12.0 minutes File 1 2624.00 MB written in 30.97 seconds File 2 2624.00 MB written in 28.80 seconds File 3 2624.00 MB written in 29.70 seconds File 4 2624.00 MB written in 32.35 seconds Total 121.83 seconds, Elapsed 121.83 seconds Start Reading Fri Oct 6 21:30:46 2023 Error reading file 1 Wrong File Read szzztestz-820 instead of szzztestz1 Error reading file 2 Wrong File Read szzztestz-820 instead of szzztestz2 Error reading file 3 Pass 1 file szzztestz1 word 1, data error was FFFFFCCC expected FFFFCCCC Pass 1 file szzztestz1 word 2, data error was FFFFFCCC expected FFFFCCCC ERRORS found during reading tests, see above End of test Fri Oct 6 21:34:09 2023

Other System Crashes

The first tests carried out were run with the Pi 5 operating via a 2 amps power supply, without any real problems running the short duration benchmarks. However, there were reductions in performance on running a series of tests, due to temperature increases. I had a cheap cooling fan module used for Pi 4 tests that I fitted on top of the Pi 5, to connect for use when needed, such as for the following procedures.

High Performance Linpack - I attempted to build this benchmark, to continue using as a stress test. This takes an excessive amount of time to build, appearing to repetitively execute the code for tuning purposes for a particular computer. In view of the timescale, I ensured that the cooling fan was working.

The first attempt was left to run overnight, only to find, in the morning, that the system had crashed. A second attempt crashed after 7 hours. Later with a 3 amps power supply, it took 12 hours to build (but other required software was found to be incompatible).

Stress Test Crash - I had successfully run numerous of my floating point and integer stress tests using a data size parameter aiming to achieve maximum performance using L1 caches on all four CPU cores. Other runs with L2 cache sized data size occasionally crashed. Later these tests ran successfully using the 3 amps power supply, with similar temperature and CPU throttling levels.

Even later, with more demanding system stress tests, the 3 amps supply was found to be inadequate.

CPU Stress Testing Benchmarks next or Go To Start

CPU Stress Testing Benchmarks - MP-FPUStress64g8 and g12, MP-FPUStress64DPg8 and g12
MP-IntStress64g8 and g12

These are provided to help in determining parameters to use for a stress test. They run a series of floating point tests using 1, 2, 4 and 8 threads, with three different memory demands, with single precision and double precision versions. An integer program is also provided using 16 and 32 threads, accessing three similar memory sizes.


Pi 5 GCC 12 SP
 MP-Threaded-MFLOPS 64 Bit V2 gcc 12 Fri Sep 29 09:59:04 2023

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.4    T1   2 13111 12985  2003   40394  76395  99700
   0.8    T2   2 24716 26088  1849   40394  76395  99700
   1.2    T4   2 41053 45232  1847   40394  76395  99700
   1.5    T8   2 34398 44918  2141   40394  76395  99700
   2.2    T1   8 17572 17484  8265   54764  85092  99820
   2.8    T2   8 33483 35138  5731   54764  85092  99820
   3.2    T4   8 59976 69804  6737   54764  85092  99820
   3.6    T8   8 58659 69463  8481   54764  85092  99820
   5.3    T1  32 18265 18246 17917   35206  66015  99520
   6.3    T2  32 35625 36482 22484   35206  66015  99520
   7.0    T4  32 69359 72766 29572   35206  66015  99520
   7.6    T8  32 69370 66234 33184   35206  66015  99520

            End of test Fri Sep 29 09:59:12 2023

Pi 5 GCC 8 SP
  MP-Threaded-MFLOPS 64 Bit V2 gcc 8 Thu Aug 17 21:21:35 2023

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.4    T1   2 12746 12885  2029   40394  76395  99700
   0.8    T2   2 25127 24925  1791   40394  76395  99700
   1.2    T4   2 43633 45111  1797   40394  76395  99700
   1.6    T8   2 39439 44308  2151   40394  76395  99700
   2.2    T1   8 17069 17333  7672   54764  85092  99820
   2.7    T2   8 34070 34766  7170   54764  85092  99820
   3.2    T4   8 58695 69177  7229   54764  85092  99820
   3.6    T8   8 59622 65856  8346   54764  85092  99820
   5.3    T1  32 18202 18288 18037   35206  66015  99520
   6.2    T2  32 36321 36549 27452   35206  66015  99520
   6.9    T4  32 68760 73025 27221   35206  66015  99520
   7.5    T8  32 68598 72071 32869   35206  66015  99520

            End of test Thu Aug 17 21:21:42 2023

Pi 5 GCC 12 DP
  MP-Threaded-MFLOPS 64 Bit gcc 12 Fri Sep 29 10:05:24 2023

   Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.9    T1   2  6570  6565  1003   40395  76384  99700
   1.9    T2   2 12052 13057   696   40395  76384  99700
   2.7    T4   2 22815 25654   831   40395  76384  99700
   3.5    T8   2 21088 25978   838   40395  76384  99700
   4.9    T1   8  8348  8388  3290   54805  85108  99820
   6.3    T2   8 15906 16532  2530   54805  85108  99820
   7.3    T4   8 23730 28755  2932   54805  85108  99820
   8.3    T8   8 30036 30142  3327   54805  85108  99820
  11.4    T1  32 10027  9975  9486   35159  66065  99521
  13.3    T2  32 19719 19508 12462   35159  66065  99521
  14.6    T4  32 40249 39892 13452   35159  66065  99521
  15.9    T8  32 38383 39453 13637   35159  66065  99521

            End of test Fri Sep 29 10:05:40 2023

Continued Below or Go To Start



Pi 5 GCC 8 DP
  MP-Threaded-MFLOPS 64 Bit gcc 8 Thu Aug 17 21:29:32 2023

   Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   0.9    T1   2  5832  5779   964   40395  76384  99700
   1.8    T2   2 11389 11537   891   40395  76384  99700
   2.6    T4   2 18744 21914   794   40395  76384  99700
   3.5    T8   2 18803 22948   842   40395  76384  99700
   4.7    T1   8  9375  9433  3984   54805  85108  99820
   5.9    T2   8 18190 18819  2758   54805  85108  99820
   6.8    T4   8 33842 37329  3233   54805  85108  99820
   7.7    T8   8 33857 34347  3393   54805  85108  99820
  10.9    T1  32  9633  9642  9458   35159  66065  99521
  12.7    T2  32 19227 19248 14292   35159  66065  99521
  14.0    T4  32 37215 38597 13208   35159  66065  99521
  15.4    T8  32 35943 36029 13288   35159  66065  99521

            End of test Thu Aug 17 21:29:47 2023

Pi 5 GCC 12
  MP-Integer-Test 64 Bit v2-gcc12 Fri Sep 29 10:11:39 2023

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   1.5    1  18233 17590 13957  00000000    Yes
   1.1    2  36284 35095 13303  FFFFFFFF    Yes
   1.0    4  71208 73154 11228  5A5A5A5A    Yes
   1.0    8  64036 68274 11499  AAAAAAAA    Yes
   0.9   16  70658 71792 12459  CCCCCCCC    Yes
   0.5   32  69044 72425 26917  0F0F0F0F    Yes

            End of test Fri Sep 29 10:11:45 2023

Pi 5 GCC 8
  MP-Integer-Test 64 Bit v2-gcc8 Thu Aug 17 21:32:43 2023

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   1.7    1  15193 15083 13106  00000000    Yes
   1.2    2  30256 30277 13472  FFFFFFFF    Yes
   1.0    4  58317 60842 11173  5A5A5A5A    Yes
   1.0    8  56279 54906 12132  AAAAAAAA    Yes
   0.9   16  54716 59296 13475  CCCCCCCC    Yes
   0.5   32  53649 59206 34738  0F0F0F0F    Yes

            End of test Thu Aug 17 21:32:49 2023

Stress Tests - No Fan next or Go To Start

Floating Point and Integer Stress Tests - No Fan

Following are early gcc8 compiled result summaries for the first stress tests without a fan being fitted. They were for 15 minutes, using 1, 2 and 4 threads, measuring average performance over 10 seconds and samples of MHz, Volts and temperatures recordings within that period. The summaries are 5 sets of performance results at the beginning, middle and end, then minimum and maximum values of each column, plus maximum/minimum calculations. Note that, for more than 1 thread, share of data should fit in L1 caches of the utilised cores. Every test ran successfully but identified MHz throttling, with performance degradation between 23% and 55%, besides lower MHz due to throttling, and some voltage reductions. At the end of the integer 4 thread tests, temperatures of up to 90°C were recorded and some CPU clock speeds of 1000 MHz.

      Floating Point Stress Test 128 KB    Integer Stress Test 160 KB

                                 CPU  PMIC                           CPU   PMIC
Seconds  MFLOPS    MHz   Volts    °C    °C  MB/sec    MHz   Volts     °C     °C
1 Thread
       0          2400  0.9065  68.6  61.8           2400  0.9065   71.9   64.8
      10  18279   2400  0.9065  73.0  63.0   15128   2400  0.9065   77.4   66.0
      20  18273   2400  0.9065  76.8  63.7   15132   2400  0.9065   78.5   66.8
      30  18284   2400  0.9065  75.2  64.4   15094   2400  0.9065   79.0   67.4
      40  18283   2400  0.9065  78.5  65.0   15095   2400  0.9065   81.8   68.1
      50  18277   2400  0.9065  79.0  65.7   15117   2400  0.9065   82.3   68.9

     420  16459   2201  0.7200  84.5  72.8   12906   2146  0.9065   85.1   73.3
     430  16396   2146  0.9065  85.1  72.8   11522   1500  0.9065   84.0   73.0
     440  16440   2256  0.9065  84.5  72.6   12905   1500  0.9065   84.5   73.3
     450  14862   1500  0.9065  86.2  72.5   12437   1500  0.9065   84.5   73.2
     460  15332   2146  0.9065  84.5  72.5   11505   1500  0.9065   85.1   73.0

     860  15370   2256  0.9065  84.0  72.3   12181   1500  0.7200   85.1   73.6
     870  15318   2201  0.9065  84.5  72.5   11929   2146  0.9065   84.0   73.3
     880  17227   2201  0.7200  84.0  72.8   13275   2201  0.9065   84.5   73.2
     890  16381   1500  0.9065  85.6  72.5   12913   1500  0.9065   84.0   73.4
     900  16364   2201  0.7200  82.9  72.4   11974   1500  0.9065   84.5   73.2

    Max   18284   2400  0.9065  86.2  72.8   15132   2400  0.9065   85.1   73.6
    Min   14862   1500    0.72  68.6  61.8   11505   1500    0.72   71.9   64.8
Max/Min    1.23   1.60    1.26  1.26  1.18    1.32   1.60    1.26   1.18   1.14

2 Threads
       0          2400  0.9065  71.4  64.2           2400  0.9065   71.9   64.4
      10  36520   2400  0.9065  79.0  66.8   30425   2400  0.9065   80.7   66.7
      20  35794   2311  0.9065  84.0  68.1   29123   2256  0.9065   84.0   67.8
      30  33156   2256  0.7200  84.5  69.3   28064   2256  0.9065   85.1   68.9
      40  31361   2146  0.7200  85.1  70.0   25692   2201  0.9065   84.0   69.4
      50  30525   2146  0.9065  85.1  70.8   25456   1500  0.9065   84.0   70.1

     420  27102   1500  0.7200  84.5  73.5   21687   1500  0.7200   85.6   73.8
     430  26742   2146  0.7200  85.1  73.5   20675   1500  0.9065   86.2   73.9
     440  27006   1500  0.9065  85.6  73.4   20980   1500  0.7200   85.6   73.6
     450  27092   2201  0.7200  85.6  73.5   21997   1500  0.7200   85.1   73.9
     460  26822   1500  0.9065  85.6  73.3   20854   1500  0.7200   85.1   73.6

     860  26691   2146  0.7200  85.1  73.9   21072   2146  0.7200   85.1   73.9
     870  26989   1500  0.7200  85.1  73.9   21111   1500  0.7200   85.6   73.6
     880  28018   1500  0.7200  85.1  73.9   21035   1500  0.9065   85.6   73.6
     890  27595   1500  0.9065  85.6  73.9   21011   2256  0.7200   84.5   73.8
     900  26449   2256  0.7200  85.1  74.0   21028   1500  0.7200   84.5   73.8

    Max   36520   2400  0.9065  85.6  74.0   30425   2400  0.9065   86.2   73.9
    Min   26449   1500  0.7200  71.4  64.2   20675   1500  0.7200   71.9   64.4
Max/Min    1.38   1.60    1.26  1.20  1.15    1.47   1.60    1.26   1.20   1.15

4 Threads
       0          2400  0.9065  71.4  64.3           2400  0.9065   70.8   64.3
      10  61133   1500  0.9065  85.1  68.0   52566   2256  0.7200   83.4   68.1
      20  52128   1500  0.7200  85.6  69.1   44870   1500  0.7200   84.5   69.2
      30  50301   1500  0.7200  85.1  70.8   43266   2256  0.7200   85.1   70.0
      40  49068   1500  0.9065  86.2  71.0   42129   2201  0.7200   84.5   71.2
      50  48448   2201  0.9065  87.3  71.6   41617   1500  0.7200   85.1   71.4

     420  45854   1500  0.7200  86.2  74.3   34701   1500  0.7200   89.5   76.6
     430  45456   1500  0.7200  86.2  74.3   35108   1500  0.7200   88.4   76.6
     440  45859   1500  0.7200  85.6  74.3   35034   1500  0.7200   90.0   76.6
     450  45853   1500  0.7200  85.6  74.3   35099   1500  0.7200   88.9   76.5
     460  45810   1500  0.7200  85.1  74.3   35176   1000  0.7200   89.5   76.6

     860  45686   1500  0.7200  85.1  74.3   34503   1500  0.7200   88.9   76.8
     870  45337   1500  0.7200  84.5  74.3   34056   1500  0.7200   90.0   77.0
     880  46261   1500  0.7200  85.6  74.3   34053   1500  0.7200   88.9   76.6
     890  45069   1500  0.7200  86.2  74.3   33955   1500  0.7200   89.5   77.0
     900  45285   1500  0.7200  86.2  74.6   34188   1500  0.7200   90.0   76.9

    Max   61133   2400  0.9065  87.3  74.6   52566   2400  0.9065   90.0   77.0
    Min   45069   1500  0.7200  71.4  64.3   33955   1000  0.7200   70.8   64.3
Max/Min    1.36   1.60    1.26  1.22  1.16    1.55   2.40    1.26   1.27   1.20

Integer Stress Tests - With Fan next or Go To Start

Integer Stress Tests - With Fan

The fan came as part of a 2019 GeeekPi Acrylic Case for Raspberry Pi 4 Model B, probably not powerful enough for the Pi 5.

The results provided cover data from L1 and L2 caches, with a starting temperature around 40°C, in a room at 26°C to 27°C. One example made use of one thread, running continuously at full speed and reaching a maximum CPU temperature of 57.1°C. Similarly, one used two threads and ran at full speed, with temperature up to 70.3°C.

There are four examples using 4 threads with KB of data 128, 512, and two at 1024 (to show variations). These all have maximum CPU temperatures indicated as between 84.5°C and 85.1°C with MHz throttling, maximum speeds of around around 60 GB/second and minimum about 51 GB/second. Example using 1 and 2 threads indicated constant performance near 15 and 30 GB/second respectively, all at 2400 MHz.

4 Threads 128 KB 4 x L1 Cache 4 threads 1024 KB 4 x L2 Cache CPU PMIC CPU PMIC Seconds MB/sec MHz Volts °C °C MB/sec MHz Volts °C °C 0 2400 0.9067 38.9 40.1 2400 0.9067 41.1 39.9 10 59953 2400 0.9067 57.6 43.8 60553 2400 0.9067 56.0 43.7 20 59448 2400 0.9067 67.0 47.3 60320 2400 0.9067 63.7 45.9 30 60019 2400 0.9067 70.8 50.0 59929 2400 0.9067 67.0 47.9 420 51124 2256 0.9067 84.5 62.2 53503 2256 0.9067 84.5 61.4 430 51011 2146 0.9067 84.5 62.2 53653 2256 0.9067 84.0 61.0 440 51219 2256 0.9067 84.5 62.4 53297 2146 0.9067 84.5 61.4 860 50943 2201 0.9067 84.5 62.1 53756 2201 0.9067 83.4 61.7 870 51446 2311 0.9067 84.0 62.3 53352 2146 0.9067 83.4 61.7 880 51378 2146 0.7200 82.3 61.9 54173 2201 0.9067 84.5 61.7 Max 60025 2400 0.9067 84.5 62.4 60553 2400 0.9067 84.5 61.7 Min 50943 2146 0.7200 38.9 40.1 53157 2146 0.7200 41.1 39.9 Max/Min 1.18 1.12 1.26 2.17 1.56 1.14 1.12 1.26 2.06 1.55 4 Threads 512 KB 4 x L2 Cache 1 Thread 512 KB L2 Cache 0 2400 0.9067 41.7 40.5 2400 0.9067 40.6 39.5 10 58969 2400 0.9067 59.8 44.9 14995 2400 0.9067 46.6 40.7 20 59611 2400 0.9067 66.4 47.2 15070 2400 0.9067 48.8 42.1 30 59488 2400 0.9067 70.8 50.0 15018 2400 0.9067 50.5 43.1 420 51217 1500 0.9067 84.0 62.1 15068 2400 0.9067 54.3 47.0 430 50975 2201 0.9067 85.1 61.5 15081 2400 0.9067 53.2 46.9 440 51841 2256 0.9067 84.0 62.3 15064 2400 0.9067 53.8 46.8 860 51128 2146 0.9067 85.1 61.3 15031 2400 0.9067 56.5 48.2 870 50938 2311 0.9067 84.5 62.1 15074 2400 0.9067 56.5 48.1 880 51460 2400 0.9067 84.0 61.7 15055 2400 0.9067 57.1 48.1 3560 51254 1500 0.9067 84.0 62.4 15038 2400 0.9067 56.5 47.8 3570 51414 2146 0.9067 85.1 61.8 15062 2400 0.9067 56.5 47.7 3580 51197 1500 0.9067 84.5 62.2 15051 2400 0.9067 56.5 47.7 Max 59611 2400 0.9067 85.1 62.4 15081 2400 0.9067 57.1 48.2 Min 50938 1500 0.72 41.7 40.5 14995 2400 0.9067 40.6 39.5 Max/Min 1.17 1.60 1.26 2.04 1.54 1.01 1.00 1.00 1.41 1.22 2 Threads 512 KB 2 x L2 Cache 4 Threads 1024 KB 4 x L2 Cache 0 2400 0.9067 39.5 40.0 2400 0.9065 41.1 39.7 10 30115 2400 0.9067 51.0 42.5 59776 2400 0.9065 57.6 44.2 20 30172 2400 0.9067 54.9 43.8 59619 2400 0.9065 67.0 47.0 30 30254 2400 0.9067 55.4 45.0 59773 2400 0.9065 70.8 49.7 420 30258 2400 0.9067 70.3 53.0 51820 2311 0.7200 84.0 62.0 430 30295 2400 0.9067 70.3 53.1 51644 2201 0.7200 82.9 61.3 440 30272 2400 0.9067 68.6 53.2 51512 2146 0.9065 84.5 62.1 860 30265 2400 0.9067 69.2 53.1 52739 2201 0.9065 83.4 61.7 870 30252 2400 0.9067 68.1 53.4 52652 2400 0.9065 84.5 61.5 880 30289 2400 0.9067 68.1 53.2 50956 2201 0.9065 84.5 61.8 3560 30274 2400 0.9067 69.7 53.2 51051 2311 0.9065 84.5 62.5 3570 30296 2400 0.9067 68.6 53.2 51008 2146 0.7200 82.3 62.5 3580 30246 2400 0.9067 68.6 53.2 51157 1500 0.9065 83.4 62.5 Max 30296 2400 0.9067 70.3 53.4 59812 2400 0.9065 84.5 62.5 Min 30115 2400 0.9067 39.5 40.0 50776 1500 0.7200 41.1 39.7 Max/Min 1.01 1.00 1.00 1.78 1.34 1.18 1.60 1.26 2.06 1.57

Floating Point Stress Tests - With Fan next or Go To Start

Floating Point Stress Tests - With Fan

Only two set of results are provided both using 4 threads, with the same data size of 512 KB, one with 2 floating point operations per data word, starting at 51.2 GFLOPS, and the other with 32 floating point operations per data word, starting at 72.3 GFLOPS. At the end of the 15 minutes runs, performance was indicated at 43.3 and 72.2 GFLOPS respectively, the slower one running at higher temperatures. The fastest near constant performance was confirmed by constant CPU MHz reports.

Estimating data flow from MFLOPS and Ops/Word indicates that the test with the slower CPU performance has a much higher data transfer speed and that can influence CPU temperatures.

4 Threads 2 Ops/Word 512 KB 4 x L2 4 reads 32 Ops/Word 512 KB 4 x L2 CPU PMIC CPU PMIC Seconds MFLOPS MHz Volts °C °C MFLOPS MHz Volts °C °C 0 2400 0.9067 41.7 41.2 1500 0.9067 40.0 40.6 10 51228 2400 0.9067 65.9 48.3 72366 2400 0.9067 59.3 44.6 20 50610 2400 0.9067 76.8 52.3 72350 2400 0.9067 67.0 47.3 30 50799 2400 0.9067 82.3 55.9 72370 2400 0.9067 70.3 49.3 40 51452 2201 0.9067 83.4 57.7 72348 2400 0.9067 71.9 51.2 50 50451 2256 0.9067 82.9 59.0 72212 2400 0.9067 74.1 52.6 420 43777 1500 0.9067 84.0 62.3 72348 2400 0.9067 81.2 58.9 430 43870 2400 0.9067 84.5 62.5 72381 2400 0.9067 81.2 58.9 440 43733 2201 0.9067 84.0 62.3 72617 2400 0.9067 80.7 58.9 450 43887 2146 0.9067 84.5 61.7 72201 2400 0.9067 80.7 58.8 460 43609 2201 0.9067 85.1 61.9 72229 2400 0.9067 81.2 58.9 860 43726 2366 0.9067 84.5 62.3 72294 2400 0.9067 81.2 59.2 870 43346 2201 0.9067 84.5 62.3 72465 2400 0.9067 81.2 59.1 880 44063 2146 0.9067 85.1 61.9 72257 2400 0.9067 81.8 59.3 890 43412 2201 0.9067 84.5 62.2 72173 2400 0.9067 81.2 59.2 900 43353 2146 0.9067 84.5 62.5 72163 2366 0.9067 81.2 59.2 Max 51452 2400 0.9067 85.1 62.5 72617 2400 0.9067 81.8 59.3 Min 43346 1500 0.9067 41.7 41.2 72163 1500 0.9067 40.0 40.6 Max/Min 1.19 1.60 1.00 2.04 1.52 1.01 1.60 1.00 2.05 1.46

Stress Test Parameters

The following show stress test run time parameters. The classifications can be upper or lower case and only the first character is interpreted.

   ./MP-FPUStress   Threads tt, Minutes mm, KB kk, Ops 00, Log ll
   ./MP-FPUStressDP Threads tt, Minutes mm, KB kk, Ops 00, Log ll
   ./MP-IntStress   Threads tt, Minutes mm, KB kk, Log ll
   ./RPiHeatMHzVolts2  Passes pp, Seconds ss, Log ll
   vmstat ss pp

   tt = Threads 1, 2, 4, 8, 16, 32, (64 FPU)       mm = Minutes greater than 0                       
   kk = KBytes 12 to 15624                         oo = Operations Per Word 2, 8 or 32  
   ll = number added to log file name, 0 to 99     pp = Passes (at ss econd intervals) 
   ss = Second intervals

New Power Supply below or Go To Start

New 4 Amps Power Supply No Disk Crash

Earlier I reported that the Pi 5 crashed when running a stress test on a USB based disk drive along with one executing integer calculations via four threads. A 3 amps power supply was in use.

With no 5 amps power supplies being available, I investigated the Power over Ethernet (PoE) route. My existing Power Injector and Splitter were limited to providing 2.5 amps. There are lots of Injectors delivering 25 or 30 watts but I could not find a Splitter producing 5 amps at 5 volts. However, I acquired a GeeekPi Gigabit USB-C PoE Splitter 48V to 5V, 4A and YuanLey Gigabit PoE Injector 30W, PoE+.

They did not explode on connecting them and I was able to run those tests successfully, once with SD booting and disk on USB 3 and second booting and testing a disk on a USB 3 hub. My monitors typically indicated power in 5.2V 2.8A and USB supply 4.9V and 0.75A.

New INTitHOT Integer Stress Test below o or Go To Start

New Integer Stress Test - INTitHOT64g12

Above, I showed that my MP-BusSpeed benchmark could achieve a data transfer rate of 150 GB/second. I have now converted the particular procedures to work as a stress test, with variable options that operate at up to 168 GB/second. Later, 240 GB/second was obtained using L1 cache sized data. As the program is executing AND instructions, this demonstrated Terabit performance at 1.92 Tbps.

The tests identified three particular problems. With no fan, CPU temperature appeared to reach 90°C. Then, with a fan, current draw was indicated as being up to 2.3 amps. Also, in both cases there was significant CPU MHz throttling

Following is the C program function calculations and main disassembled code. It is effectively a read only test of 64 words, from a large array, executing AND instructions for a one word output. Each thread exercises a dedicated segment of the data, circulated on a round robin basis, reading all data every pass. The disassembly shows (I believe) loading data into eight pairs of quad word registers, then sixteen quad word AND operations.

In case of anybody is interested in running (or modifying), the program, the source and compiled codes, along with my environmental monitor are available from ResearchGate in INTitHOT.tar.xz.

Test Function Calculations andsum1[t] = andsum1[t] & array[i ] & array[i+1 ] & array[i+2 ] & array[i+3 ] & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ] & array[i+8 ] & array[i+9 ] & array[i+10] & array[i+11] & array[i+12] & array[i+13] & array[i+14] & array[i+15] & array[i+16] & array[i+17] & array[i+18] & array[i+19] & array[i+20] & array[i+21] & array[i+22] & array[i+23] & array[i+24] & array[i+25] & array[i+26] & array[i+27] & array[i+28] & array[i+29] & array[i+30] & array[i+31] & array[i+32] & array[i+33] & array[i+34] & array[i+35] & array[i+36] & array[i+37] & array[i+38] & array[i+39] & array[i+40] & array[i+41] & array[i+42] & array[i+43] & array[i+44] & array[i+45] & array[i+46] & array[i+47] & array[i+48] & array[i+49] & array[i+50] & array[i+51] & array[i+52] & array[i+53] & array[i+54] & array[i+55] & array[i+56] & array[i+57] & array[i+58] & array[i+59] & array[i+60] & array[i+61] & array[i+62] & array[i+63]; Inner Loop Disassembly .L128: ldp q31, q30, [x0] add w13, w13, 1 ldp q29, q28, [x0, 32] ldp q27, q26, [x0, 64] ldp q25, q24, [x0, 96] ldp q23, q22, [x0, 128] ldp q21, q20, [x0, 160] ldp q19, q18, [x0, 192] ldp q17, q16, [x0, 224] add x0, x0, 256 and v15.16b, v15.16b, v31.16b and v0.16b, v0.16b, v30.16b and v14.16b, v14.16b, v29.16b and v13.16b, v13.16b, v28.16b and v12.16b, v12.16b, v27.16b and v11.16b, v11.16b, v26.16b and v10.16b, v10.16b, v25.16b and v9.16b, v9.16b, v24.16b and v8.16b, v8.16b, v23.16b and v7.16b, v7.16b, v22.16b and v6.16b, v6.16b, v21.16b and v5.16b, v5.16b, v20.16b and v4.16b, v4.16b, v19.16b and v3.16b, v3.16b, v18.16b and v2.16b, v2.16b, v17.16b and v1.16b, v1.16b, v16.16b cmp w2, w13 bhi .L128

INTitHOT PI 5 and Pi 4 Maximum Speeds below o or Go To Start

INTitHOT PI 5 and Pi 4 Maximum Speeds - With Fan

The INTitHOT tests were run with the fan operational to demonstrate maximum speeds over the first few passes, using the same run time parameters on the Pi 5 and Pi 4. These accessed 64 KB using 1, 2 and 4 threads. Here, near constant elapsed times at all thread levels indicate high efficiency. This applied to the Pi 5 results. But, for an inexplicable reason, the Pi 4 failed to benefit from using 4 threads. Note that the latter system was booted and used via the Pi 5 OS SD card.

Pi 5 performance gains over Pi 4 results were 3.94 and 4.62 at 1 and 2 threads and maybe 10 times at 4 threads. Fastest Pi 5 performance was 240 Gigabytes per second, using 4 threads. This indicates the equivalent of 120 Giga Instructions Per Second (GIPS) or 60 Giga Integer Arithmetic Operations Per Second (GIAOPS).

Also below are maximum speeds using 9 data sizes between 64 and 16384 KB. This test was included in my benchmark, intended to measure bus speeds. In this case, the memory bus speed is indicated as 27 GB/second. Here, at 16 MB data size, each of the 4 threads would be cycling through dedicated segments of 4 MB. Maximum observed current draw was 2.3 amps at 512 KB data size, higher than at 64 KB but with slower performance.

Pi 5 Pi 4 INTitHOT 64 Bit gcc 12 Thu INTitHOT 64 Bit gcc 12 Thu Oct 19 15:51:53 2023 Oct 19 15:11:35 2023 1 Threads. 64 KBytes, 500000 1 Threads. 64 KBytes, 500000 Passes 1+ Minutes Passes 1+ Minutes Repeat MB/second Seconds Repeat MB/second Seconds 1 56796 0.58 1 14418 2.27 2 56612 0.58 2 14412 2.27 3 56704 0.58 3 14404 2.27 #################################### #################################### INTitHOT 64 Bit gcc 12 Thu INTitHOT 64 Bit gcc 12 Thu Oct 19 15:51:16 2023 Oct 19 15:11:06 2023 2 Threads. 64 KBytes, 500000 2 Threads. 64 KBytes, 500000 Passes 1+ Minutes Passes 1+ Minutes Repeat MB/second Seconds Repeat MB/second Seconds 1 113194 0.58 1 24510 2.67 2 113663 0.58 2 24415 2.68 3 113272 0.58 3 24412 2.68 #################################### #################################### INTitHOT 64 Bit gcc 12 Thu INTitHOT 64 Bit gcc 12 Thu Oct 19 15:50:53 2023 Oct 19 15:10:29 2023 4 Threads. 64 KBytes, 500000 4 Threads. 64 KBytes, 500000 Passes 1+ Minutes Passes 1+ Minutes Repeat MB/second Seconds Repeat MB/second Seconds 1 240850 0.54 1 23839 5.50 2 231406 0.57 2 23832 5.50 3 240861 0.54 3 23836 5.50 #################################### #################################### Pi 5 4 Threads Maximum speeds Power Passes KB MB/sec Secs amps 500000 64 240850 0.54 L1 1.8 to 1.9 500000 128 165221 1.59 L2 1.9 to 2.0 500000 256 168499 3.11 1.9 to 2.0 500000 512 158777 6.64 2.1 to 2.3 50000 512 158019 0.66 2.1 to 2.3 50000 1024 73043 2.87 L3 1.8 to 1.9 50000 2048 52050 8.06 L3 1.7 to 1.8 50000 4096 32024 26.18 RAM 1.6 to 1.7 50000 8192 30767 54.53 1.5 to 1.6 50000 16384 26983 124.35 1.5 to 1.7

INTitHOT Stress Tests next or Go To Start

INTitHOT Stress Tests

The tests were all run for 15 minutes using 4 threads, covering two data sizes, 64 KB for the fastest via L1 caches and the hottest at 512 KB using L2 caches. In a table, each performance measurement is for the same pass count, where the time taken can increase due to CPU MHz throttling. The environmental monitor was run at the same time, sampling at 39 second intervals.

Later the full details are provided of the two test sessions running with the fan cooling disconnected and the default CPU frequency ondemand scaling setting used. Others with the performance setting were also run, providing similar long term variations in performance. Here, we have summaries of fan and no fan situations.

With no fan in use, there was significant CPU MHz throttling at both data sizes, less so at 64 KB with the higher KB/second data transfer speeds.

With fan cooling, the 64 KB test was not affected much by MHz throttling, suffering by a mere 5% degradation in performance, compared with 16% at 512 KB, with additional throttling but not that much increase in CPU temperature.

MB/sec Secs MHz Volts CPU °C PMIC °C 64 KB No Fan Min 150715 16.4 1500 0.7200 42.8 44.2 Max 240498 26.1 2256 0.9060 87.3 75.4 Average 1689 0.7492 84.0 71.5 512 KB No Fan Min 84743 29.0 1000 0.7200 47.7 47.3 Max 144811 49.5 2146 0.9060 90.0 77.4 Average 1380 0.7433 86.8 74.1 64 KB Fan Min 228738 32.7 2256 0.9067 41.7 39.9 Max 240414 34.4 2400 0.9067 84.0 60.1 Average 2306 0.9067 82.3 59.7 512 KB Fan Min 124143 29.2 1500 0.7200 41.7 43.0 Max 143845 33.8 2400 0.9060 85.6 62.5 Average 2193 0.8700 83.6 61.5

INTitHOT Stress Test 64 KB next or Go To Start

INTitHOT Stress Test 64 KB - No Fan

The fan was not successful in controlling the CPU temperature that reached 85.6°C, leading to a 14% reduction in measured performance. The temperature, CPU MHz and voltage had regular variations.


PI 5 Stress Test 64 KB,  no fan, ondemand MHz scaling

INTitHOT Fri Oct 20 11:20:38     Temperature and CPU MHz Measurement
4 Threads 64 KB 15000000 Passes  Start at Fri Oct 20 11:20:33 2023

  Repeat  MB/sec    Secs         Seconds     MHz   Volts  CPU °C PMIC °C
                                       0    1500  0.9060    42.8    44.2
       1  240498    16.4              30    2256  0.9060    83.4    58.8
       2  225209    17.5              60    1500  0.9060    85.6    65.4
       3  195713    20.1              91    1500  0.9060    86.2    69.0
       4  182682    21.5             121    1500  0.7200    84.5    71.4
       5  172867    22.8             151    1500  0.7200    85.1    72.0
       6  166663    23.6             182    1500  0.7200    85.1    72.5
       7  163066    24.1             212    2146  0.7200    86.2    73.1
       8  160312    24.5             242    1500  0.7200    84.5    73.9
       9  158921    24.7             273    1500  0.7200    85.6    73.4
      10  157789    24.9             303    1500  0.7200    85.1    73.8
      11  156465    25.1             334    1500  0.7200    85.6    73.8
      12  154721    25.4             364    1500  0.7200    85.6    73.8
      13  155261    25.3             394    1500  0.7200    85.1    73.9
      14  154156    25.5             425    1500  0.7200    86.2    74.2
      15  153030    25.7             455    1500  0.7200    86.2    74.1
      16  152971    25.7             485    1500  0.7200    86.2    74.5
      17  153125    25.7             515    1500  0.7200    85.6    74.5
      18  152132    25.9             546    1500  0.7200    85.6    74.5
      19  152081    25.9             576    1500  0.7200    86.2    74.8
      20  152261    25.8             606    1500  0.7200    86.2    74.8
      21  151389    26.0             637    1500  0.7200    85.6    74.6
      22  151139    26.0             667    1500  0.7200    86.7    74.9
      23  151028    26.0             697    1500  0.7200    86.7    75.0
      24  151525    26.0             728    1500  0.7200    86.2    75.1
      25  151101    26.0             758    1500  0.7200    86.7    75.0
      26  151200    26.0             788    1500  0.7200    86.2    75.2
      27  151501    26.0             819    1500  0.7200    87.3    75.2
      28  150845    26.1             849    1500  0.7200    86.7    75.4
      29  150795    26.1             879    1500  0.7200    86.7    75.2
      30  150715    26.1             910    1500  0.7200    87.3    75.2
      31  151059    26.0             940    1500  0.9060    76.8    72.8
      32  150767    26.1
      33  150751    26.1
      34  150959    26.1
      35  150927    26.1
      36  150783    26.1
      37  151009    26.0

    Min   150715    16.4                    1500  0.7200    42.8    44.2
    Max   240498    26.1                    2256  0.9060    87.3    75.4
  Average                                   1689  0.7492    84.0    71.5

INTitHOT Stress Test 512 KB next or Go To Start

INTitHOT Stress Test 512 KB - No Fan

This recorded the highest temperatures at 90°C and 42% reduction in MB/second, with lowest CPU frequency regularly at 1000 MHz. Voltage was mainly constant at 0.7200 along with temperature near the top end.


PI 5 Stress Test Detail - 512 KB,  no fan, ondemand MHz scaling

INTitHOT Fri Oct 20 10:49:05     Temperature and CPU MHz Measurement
4 Threads 512 KB 2000000 Passes  Start at Fri Oct 20 10:48:58 2023

  Repeat  MB/sec    Secs         Seconds     MHz   Volts  CPU °C PMIC °C
                                       0    1500  0.9060    47.7    47.3
       1  144811    29.0              30    1500  0.9060    84.5    62.8
       2  117807    35.6              60    1500  0.9060    86.7    67.7
       3  109939    38.2              91    2146  0.7200    85.1    70.3
       4  106055    39.6             121    1500  0.7200    85.6    71.3
       5  104401    40.2             152    1500  0.7200    85.6    72.2
       6  103921    40.4             182    1500  0.7200    85.1    72.6
       7  103770    40.4             212    1500  0.7200    86.7    73.1
       8  103705    40.4             243    1500  0.7200    87.8    74.1
       9  101765    41.2             273    1500  0.7200    87.8    74.9
      10   98730    42.5             303    1500  0.7200    88.9    75.3
      11   96339    43.5             334    1500  0.7200    89.5    75.8
      12   93876    44.7             364    1500  0.7200    89.5    76.0
      13   92469    45.4             394    1500  0.7200    90.0    76.0
      14   90528    46.3             425    1000  0.7200    89.5    76.2
      15   88594    47.3             455    1500  0.7200    88.9    76.3
      16   88113    47.6             485    1500  0.7200    88.4    76.6
      17   87023    48.2             515    1500  0.7200    90.0    76.5
      18   86581    48.4             546    1500  0.7200    90.0    77.0
      19   85699    48.9             576    1500  0.7200    89.5    77.1
      20   84743    49.5             606    1000  0.7200    88.9    77.0
      21   84760    49.5             637    1000  0.7200    90.0    77.0
                                     667    1000  0.7200    88.4    77.2
                                     698    1000  0.7200    88.4    77.2
                                     728    1500  0.7200    89.5    77.3
                                     758    1000  0.7200    89.5    77.2
                                     789    1000  0.7200    90.0    77.3
                                     819    1500  0.7200    90.0    77.2
                                     849    1000  0.7200    90.0    77.2
                                     880    1500  0.7200    89.5    77.4
                                     910    1000  0.7200    89.5    77.4
                                     940    1500  0.9060    75.7    73.0

    Min    84743   28.96                    1000  0.7200    47.7    47.3
    Max   144811   49.49                    2146  0.9060    90.0    77.4
 Average                                    1380  0.7433    86.8    74.1

32 Bit System Stress Tests below or Go To Start

System Stress Tests

All these tests were run for 30 minutes, exercising the CPU, graphics and data input/output and included my environment and VMSTAT performance monitors, the, latter to validate the program MBytes per second measurements and confirm that CPU utilisation was at the expected near 100% level. A script file was used to to ensure that the programs started in at the same time. In most cases, performance was measured or sampled every 60 seconds.

An example script file is below, also the commands to run the OpenGL program from a separate terminal, with VSYNC turned off to produce maximum frames per second (FPS).

Script File lxterminal -e ./RPiHeatMHzVolts64 Passes 31 Seconds 60 Log 7 & lxterminal -e ./INTitHOT64g12 threads 2, kBStress 64, Minutes 30, passCount 4000000, logNumber 7 & lxterminal -e ./MP-FPUStress64g12 threads 2, kb 512, ops 32, Minutes 30, log 7 & lxterminal -e sudo ./burnindrive264g12 Repeats 16, Minutes 27, Log 8, Seconds 1, F /media/raspberrypi/public/ray & lxterminal -e sudo ./burnindrive264g12 Repeats 16, Minutes 27, Log 9, Seconds 1, F /media/raspberrypi/EXT3 & lxterminal -e vmstat 60 30 . vmstat7.txt Separate Terminal export vblank_mode=0 ./videogl64C12 Test 6 Minutes 30

Of particular note, the first set of tests identifies increases in CPU temperature up to 91.7°C, with no fan running.

A questionable more significant problem, during the second set of tests, was the disk program indicating errors and the drive temporarily dropping off line during a test with the fan operational. The errors were the same as on earlier runs using a 3 amps power supply, the present PoE connection supposedly providing 4 amps.

Monitoring the input power used and that supplied for the USB drive, indicated that consumption was fairly constant between 2 and 15 minutes testing time, providing the following typical meter readings. These suggest that the disk drive might be more vulnerable to failure when the CPU is fully loaded and CPU MHz throttling might be useful if danger can be predicted.

     No Fan Poor CPU Performance   With Fan Good CPU Performance

         Power          USB             Power          USB      
     Volts   Amps  Volts   Amps     Volts   Amps  Volts   Amps  
     
      5.26   1.75   5.06   0.53      5.20   2.60   4.94   0.53

Light System Stress Test below or Go To Start

Light System Stress Test

The first sessions involved INTitHOT64g12, using 4 threads accessing 512 KB data, with a pass count to control minimum running time. Then, with this test, total running time was specified as 30 minutes, leading to fewer results when the CPU MHz was throttled. These MB/second results were allocated at two minute intervals. Other inclusions were burnindrive264g12 to a USB3 disk drive, plus videogl64C12 accessing the most demanding display test, producing FPS results every 30 seconds, with results provided at 60 second intervals, as shown in the detailed tables below.

Following are two sets of results for one run with the fan in use and one without the fan. On the bright side, these and a number of other tests, using the same parameters, ran without any issues. But CPU MHz throttling occurred in all cases.

Summaries

Minimum values are often isolated examples and can often be ignored. Best scores shown at the head of the table are from standalone runs. Maximum benchmark performance measurements suffer from being noted a minute after start time. Averages indicate significant reductions for the integer and OpenGL tests but little difference on disk drive data transfer speeds.

Of particular note is the CPU temperature measurement of 91.7°C with the fan out of use.

VMSTAT Integer Disk OpenGL MHz Volts CPU °C PMIC °C MB/sec KB/sec FPS Best 145000 63000 102 512 KB FAN Average 2128 0.8878 82.8 61.8 97568 60368 65.3 Min 1500 0.7200 42.2 39.7 95281 59159 61.0 Max 2400 0.9058 85.1 63.2 106457 61815 69.0 512 KB NO FAN Average 1174 0.7260 88.7 77.0 55898 56081 40.0 Min 1000 0.7200 56.0 53.7 45528 19941 33.0 Max 2400 0.9058 91.7 79.5 79094 58095 58.0 Average No Fan %Reduction 45 18 7 20 43 7 39

Light Test With Fan below or Go To Start

Light Test With Fan

Note that CPU temperature is shown to be more than 84°C for most of the time.


 512 KB FAN
                                                  VMSTAT 
                                         Integer   Disk    OpenGL
  Seconds    MHz   Volts  CPU °C PMIC °C  MB/sec  KB/sec     FPS
 
       0    2400  0.9058    42.2    39.7
      60    2146  0.9058    84.5    59.5  106457   61815      69
     120    2146  0.9058    84.0    62.2           60132      68
     181    2201  0.9058    84.5    62.1           61054      66
     241    2366  0.9058    84.0    62.5   97930   60130      65
     301    2201  0.9058    85.1    62.4           60235      67
     362    2256  0.9058    84.0    62.8           60548      64
     422    2146  0.9058    84.0    62.5   96799   59701      65
     482    2146  0.9058    84.0    63.1           60461      67
     542    2201  0.9058    85.1    62.0           60175      66
     603    2146  0.7200    84.0    63.0   96761   60006      65
     663    2146  0.9058    85.1    61.9           61348      64
     723    2311  0.9058    84.5    62.8           59479      67
     784    2146  0.9058    84.5    62.9   97231   61585      64
     844    2146  0.7200    82.9    62.8           59742      64
     904    2146  0.9058    82.3    62.8           60262      66
     965    1500  0.9058    84.5    62.8   96604   61429      67
    1025    2366  0.9058    84.0    62.9           59341      65
    1086    1500  0.9058    84.0    62.3           60804      64
    1146    2201  0.9058    83.4    62.8   96213   59546      65
    1206    2256  0.9058    84.0    62.8           59360      64
    1267    2366  0.9058    84.5    63.2           61687      68
    1327    1500  0.9058    84.5    63.0   96053              64
    1387    2146  0.9058    84.5    62.8           59159      66
    1447    2146  0.9058    85.1    61.9           60655      65
    1508    1500  0.9058    84.5    62.9   96349              67
    1568    2400  0.7200    81.8    62.7           60491      66
    1629    2146  0.9058    85.1    62.1           59962      64
    1689    2400  0.9058    85.1    62.1   95281              63
    1749    2146  0.9058    84.0    62.3           60429      61
    1809    2146  0.9058    84.5    62.9           60390      64

Average     2128  0.8878    82.8    61.8   97568   60368    65.3
    Min     1500  0.7200    42.2    39.7   95281   59159    61.0
    Max     2400  0.9058    85.1    63.2  106457   61815    69.0

Light Test No Fan below or Go To Start

Light Test No Fan

Note that the CPU is running at 1000 MHz for much of the time, with CPU temperature around 90°C and that for the Power Management Integrated Circuit more than 78°C.


  512 KB NO FAN
 Seconds     MHz   Volts  CPU °C PMIC °C  MB/sec  KB/sec     FPS
       0    2400  0.9058    56.0    53.7
      60    1500  0.7200    86.2    69.5   79094   19941      58
     120    1500  0.7200    85.6    72.5           58012      52
     181    1500  0.7200    87.8    73.9           57754      50
     241    1500  0.7200    88.9    75.8   70129   56880      50
     301    1500  0.7200    89.5    76.9           57616      48
     362    1500  0.7200    89.5    77.0   64348   57313      45
     422    1000  0.7200    90.6    77.1           57850      44
     482    1500  0.7200    88.9    77.6   57341   57980      42
     543    1000  0.7200    89.5    78.2           57245      44
     603    1000  0.7200    90.0    78.1           57311      41
     663    1000  0.7200    90.0    78.2   53759   57391      39
     724    1000  0.7200    88.9    78.6           57486      37
     784    1000  0.7200    89.5    78.1           57786      38
     844    1000  0.7200    90.0    78.3   50933   57456      36
     905    1000  0.7200    90.0    78.5           57914      37
     965    1000  0.7200    90.6    78.7           56861      38
    1025    1000  0.7200    90.0    78.6   49921   57428      37
    1086    1500  0.7200    89.5    78.9           57705      36
    1146    1000  0.7200    90.6    78.9           57445      38
    1206    1000  0.7200    90.0    78.6   48803   57803      39
    1267    1000  0.7200    90.0    78.9           57618      36
    1327    1000  0.7200    90.0    79.1                      36
    1387    1000  0.7200    90.6    78.9   47790   57545      37
    1448    1000  0.7200    90.0    78.5           58095      36
    1508    1000  0.7200    90.6    79.4                      34
    1568    1000  0.7200    90.0    79.0   47234   57055      35
    1629    1000  0.7200    91.7    79.1           57110      35
    1689    1000  0.7200    91.1    79.5                      34
    1750    1000  0.7200    91.7    79.3   45528   56708      35
    1810    1000  0.7200    91.7    79.4           56874      33

Average     1174  0.7260    88.7    77.0   55898   56081    40.0
    Min     1000  0.7200    56.0    53.7   45528   19941    33.0
    Max     2400  0.9058    91.7    79.5   79094   58095    58.0

Heavy System Stress Test below or Go To Start

Heavy System Stress Test

This session comprised INTitHOT64g12, with 2 threads at 64 KB, MP-FPUStress64g12 with 2 threads at 512 KB, burnindrive264g12 to a PC via Ethernet, burnindrive264g12 to a USB 3 disk drive and videogl64C12 as before. Detailed important results are provided for fan and no fan scenarios, with two for the former as the first one failed. Note that, compared with 4 thread results, those for 2 threads can be slower than expected as the main data source can be from L2 cache instead of L1.

On running these tests the main issue was that the second test failed due to data comparison failures on reading. The first indication was a system warning that the disk drive was no longer available but it was remounted. Following are examples of reported errors, similar to the earlier ones described above in Disk Drive Errors and Crashes. These were thought to have been caused by the inadequate 3 amps power supply. Also, see the comments in the initial System Stress Testing summary.

Read passes 74 x 4 Files x 164.00 MB in 14.03 minutes Error reading file 1 Wrong File Read szzztestz-3 instead of szzztestz1 Error reading file 2 Pass 76 file szzztestz1 word 1, data error was FFFFFFFD expected FFFFFFFB Pass 76 file szzztestz1 word 2, data error was FFFFFFFD expected FFFFFFFB

A summary of the three tests sessions follow. As indicated above power consumption was higher during the tests run with the fan operational, which reduced temperatures, enabling faster performance. Without the fan, MHz throttling, involving higher temperatures, reduced current demands with slower performance. It seems that power consumption was more important than system temperature when considering stability.

Integer Floating OpenGL & VMSTAT Program MHz Volts CPU °C PMIC °C MB/sec MFLOPS FPS Disk MB/s LAN MB/s Best 2400 114000 32000 102 63 36 Test 9 NO FAN Average 1239 0.7312 88.7 77.5 38696 12361 39 Mainly 27 Min 1000 0.7200 70.8 64.7 30093 9836 31 58-59 Max 2400 0.9118 90.6 79.4 76652 22873 51 Test 10 FAN Average 2288 0.9118 81.2 60.2 71940 24046 66 Error 27 Min 2146 0.9118 42.8 40.5 64379 22518 61 Max 2400 0.9118 84.0 61.7 78453 27388 70 Test 11 FAN Average 2276 0.9080 80.8 59.7 71794 24003 66 Mainly 27 Min 1500 0.7950 41.7 38.8 59602 20594 60 57-58 Max 2400 0.9118 84.0 61.4 82481 26551 72 Average No Fan %Reductions 46 19 9 23 46 49 41 -2 0

Heavy Test No Fan below or Go To Start

Heavy Test No Fan

At 100% CPU utilisation, the following measurements were similar to those during the No Fan Light System Test, with the CPU running at 1000 MHz for much of the time, temperatures around 90°C and that for the Power Management Integrated Circuit more than 78°C.


 Test 9 NO FAN                           Integer Floating OpenGL  VMSTAT
  Second     MHz   Volts  CPU °C PMIC °C  MB/sec  MFLOPS     FPS Disk MB/s
       0    2400  0.9118    70.8    64.7
      60    1500  0.7200    85.6    72.5   76652   22873      51     0.3
     120    1500  0.7200    86.2    74.1   50138   15511      50    41.9
     180    1500  0.7200    88.4    75.8   44886   15027      48    58.8
     240    1500  0.7200    89.5    76.6   49106   15012      46    58.1
     300    1500  0.7200    88.9    77.2   44702   14215      45    59.6
     360    1000  0.7200    90.0    77.5   41739   12596      43    58.5
     420    1500  0.7200    89.5    77.6   41734   12524      43    59.3
     480    1000  0.7200    90.0    77.7   40211   12041      42    58.1
     540    1000  0.7200    90.0    78.0   39083   13329      41    58.4
     600    1500  0.7200    89.5    78.2   37814   12529      38    58.3
     660    1500  0.7200    90.0    78.2   36144   11875      38    58.5
     720    1000  0.7200    89.5    78.3   35741   11720      36    58.2
     780    1000  0.7200    90.6    78.5   37614   13467      38    58.5
     840    1000  0.7200    89.5    78.7   33104   10712      35    57.6
     900    1000  0.7200    90.0    78.6   39563   11029      38    58.6
     960    1000  0.7200    90.0    78.4   37259   11448      38    58.2
    1020    1000  0.7200    89.5    78.9   34469   11583      39    57.8
    1080    1000  0.7200    90.0    78.3   35970   11306      38    57.4
    1140    1500  0.7200    90.0    78.7   34045   12281      36    58.6
    1200    1000  0.7200    90.0    78.4   35297   10928      38    59.1
    1260    1500  0.7200    90.0    78.9   37365   12002      36    58.3
    1320    1000  0.7200    90.0    78.5   34004   11252      36    58.2
    1380    1000  0.7200    90.0    78.4   34892   11070      34    58.8
    1440    1000  0.7200    90.0    78.7   36255   10274      37    58.8
    1500    1000  0.7200    88.9    78.7   33912   11320      37    58.3
    1560    1500  0.7200    89.5    79.0   33513   11426      35    58.7
    1620    1000  0.7200    89.5    79.0   30093   10650      35    58.8
    1680    1000  0.7200    89.5    79.4   32852    9836      32    58.7
    1740    1000  0.7200    90.0    79.1   30465   10273      31   122.6
    1800    1500  0.8769    85.1    77.1   32262   10709      32   146.5

 Average    1239  0.7312    88.7    77.5   38696   12361      39
     Min    1000  0.7200    70.8    64.7   30093    9836      31
     Max    2400  0.9118    90.6    79.4   76652   22873      51

Heavy Test With Fan below or Go To Start

Heavy Test With Fan - FAILED

As shown initially below, system behaviour did not appear to be much different to that, at the same point, during the later successful test. However, these are instantaneous measurements that can be different in the next picosecond. Also I did note USB power measurements of 4.8 volts at 0.53 amps, compared with 4.94 and 0.53 quoted above. But this might be due to infrequent manual sampling.


                  Tests 10 and 11 at 900 seconds

T11   900   2366  0.9118    83.4    61.0   61490   24333      68    58.1
T10   900   2256  0.9118    83.4    61.5   70134   22929      61    59.1
 
 Test 10 FAN                             Integer Floating OpenGL  VMSTAT
  Second     MHz   Volts  CPU °C PMIC °C  MB/sec  MFLOPS     FPS Disk MB/s
       0    2400  0.9118    42.8    40.5
      60    2400  0.9118    79.0    55.6   70918   25009      65     9.5
     120    2201  0.9118    82.3    59.7   73729   23355      68    42.9
     180    2366  0.9118    82.9    60.9   68151   24311      67    59.5
     240    2311  0.9118    83.4    61.0   70410   23307      67    59.7
     300    2146  0.9118    82.9    61.0   73093   23714      65    58.6
     360    2311  0.9118    82.3    61.3   69355   22632      64    59.1
     420    2311  0.9118    82.9    61.5   74376   23902      62    59.1
     480    2311  0.9118    83.4    61.0   64379   23731      63    59.2
     540    2201  0.9118    82.9    61.4   72430   22757      66    58.4
     600    2201  0.9118    83.4    61.2   67268   25440      65    58.9
     660    2256  0.9118    82.9    61.7   70452   22864      66    58.2
     720    2311  0.9118    83.4    61.5   66588   22796      64    59.0
     780    2256  0.9118    82.9    61.4   71766   22518      64    59.5
     840    2146  0.9118    84.0    61.7   69162   23801      65    59.0
     900    2256  0.9118    83.4    61.5   70134   22929      61    59.1
     960    2201  0.9118    82.9    61.2   75122   24518      61    31.5
    1020    2400  0.9118    82.9    61.4   74535   23855      64     0.1 FAILED
    1080    2311  0.9118    82.9    61.0   74460   23832      62       0
    1140    2256  0.9118    82.9    61.0   71397   23861      64       0
    1200    2311  0.9118    83.4    61.0   75347   23264      64       0
    1260    2311  0.9118    82.3    61.0   72384   24361      62       0
    1320    2366  0.9118    83.4    61.5   74719   25401      70       2
    1380    2400  0.9118    82.3    61.2   71234   24356      69       0
    1440    2311  0.9118    83.4    61.4   73853   24652      67       0
    1500    2366  0.9118    82.9    61.3   71402   24619      66       0
    1560    2146  0.9118    84.0    61.4   78453   23417      70       0
    1620    2256  0.9118    84.0    61.0   71631   24961      70       0
    1680    2311  0.9118    82.9    61.0   74461   25101      69       0
    1740    2201  0.9118    83.4    61.3   73486   24737      69       0
    1800    2400  0.9118    70.3    57.1   73493   27388      68       0

 Average    2288  0.9118    81.2    60.2   71940   24046      66
     Min    2146  0.9118    42.8    40.5   64379   22518      61
     Max    2400  0.9118    84.0    61.7   78453   27388      70

Second Heavy Test With Fan below or Go To Start

Second Heavy Test With Fan

Here, performance did not vary much but there was some CPU MHz throttling. Perhaps the official fan will avoid this and overcome observed undesirable power variations with the new 5 amps version


 Test 11 FAN                             Integer Floating OpenGL  VMSTAT
  Second     MHz   Volts  CPU °C PMIC °C  MB/sec  MFLOPS     FPS Disk MB/s
       0    2400  0.9118    41.7    38.8
      60    2400  0.9118    74.7    53.7   77484   26076      67     4.5
     120    2400  0.9118    81.8    58.7   82481   25011      72    42.3
     180    2400  0.9118    82.9    60.0   74579   26236      66    58.3
     240    2366  0.9118    81.8    60.1   69930   23368      63    57.7
     300    2311  0.9118    83.4    60.5   76266   22233      68    57.9
     360    2311  0.9118    83.4    60.7   72493   25286      66    58.7
     420    2311  0.9118    82.3    61.0   67909   23927      70    57.9
     480    2311  0.9118    83.4    60.8   73526   25794      63    57.6
     540    2256  0.9118    83.4    61.0   74888   26551      67    57.9
     600    2366  0.9118    82.9    61.0   74110   23912      66    57.4
     660    2256  0.9118    82.9    61.1   75024   25414      65    57.6
     720    2256  0.9118    82.9    61.0   59602   25025      65    59.1
     780    2256  0.9118    83.4    61.0   67930   22907      65    57.1
     840    2256  0.9118    84.0    61.0   71962   24011      67    58.2
     900    2366  0.9118    83.4    61.0   61490   24333      68    58.1
     960    2311  0.9118    82.3    61.1   63462   22888      65    58.2
    1020    2256  0.9118    83.4    61.0   67540   25537      68    57.3
    1080    2256  0.9118    82.9    61.0   70804   23791      66    57.8
    1140    2400  0.9118    83.4    61.0   71113   22011      64    57.5
    1200    2256  0.9118    82.3    61.4   77050   23111      70    58.7
    1260    2311  0.9118    83.4    61.0   73053   24148      63    57.7
    1320    2256  0.9118    82.3    60.9   74469   23307      66    57.6
    1380    2256  0.9118    83.4    61.2   72160   22726      66    58.2
    1440    2256  0.9118    82.3    60.9   73994   24276      66    59.5
    1500    2256  0.9118    83.4    61.0   72659   22260      67    56.9
    1560    2256  0.9118    82.9    61.2   74870   21866      68    57.8
    1620    2256  0.9118    83.4    61.0   76735   23945      66    57.5
    1680    2201  0.9118    83.4    60.9   70727   20594      66    57.6
    1740    2311  0.9118    83.4    61.2   65023   24760      63   123.7
    1800    1500  0.7950    64.2    55.4   70479   24786      60   158.3

 Average    2276  0.9080    80.8    59.7   71794   24003      66
     Min    1500  0.7950    41.7    38.8   59602   20594      60
     Max    2400  0.9118    84.0    61.4   82481   26551      72

Second below or Go To Start

Firefox, Bluetooth and YouTube

Whilst looking at numbers for this report and other things, I had movies playing via the readily accessible YouTube at 1080p HD for a few hours. YouTube was accessed via Firefox with Bluetooth sound played on a rechargeable speaker. Examples of MHz, Volts and Temperatures, with ondemand frequency scaling, were :


 Start at Fri Aug 25 10:33:03 2023

 Using 361 samples at 10 second intervals

 Seconds
    0.0   ARM MHz=1500, core volt=0.9065V, CPU temp=47.2°C, pmic temp=42.3°C
   10.0   ARM MHz=2400, core volt=0.9065V, CPU temp=48.3°C, pmic temp=42.5°C
   20.1   ARM MHz=2400, core volt=0.9065V, CPU temp=48.3°C, pmic temp=42.3°C
   30.2   ARM MHz=2400, core volt=0.9065V, CPU temp=48.8°C, pmic temp=42.7°C

 1028.3   ARM MHz=1500, core volt=0.9065V, CPU temp=43.9°C, pmic temp=40.7°C
 1038.4   ARM MHz=2400, core volt=0.9065V, CPU temp=46.6°C, pmic temp=41.0°C

Pi 5 bluetooth sound levels were not loud enough for me. They were significantly louder from a side by side Pi 400. This was from Youtube movies and local music from VLC media player.

Pi 5 The Vector Processor below or Go To Start

Pi 5 The Vector Processor including whetv64SPg12 and whetv64DPg12

During the 1980s and early 90s I was responsible for evaluating and acceptance testing of supercomputers for the UK government and those centrally funded for universities. For multiple user development the latter were particularly interested in vector versus scalar performance. I converted my Fortran scalar Whetstone benchmark to one where every test function could vectorize, with a default vector length of 256 words.

The vector version was finely tuned, hands on, on Cray 1 serial 1 that was at Didcot Rutherford Laboratory for a time. First real use was during factory and site trials of the first UK full scale Cray 1. Next was the first CDC Cyber 205 and last was attending user benchmark tests in Japan for ULCC at NEC and Fujitsu, where my benchmarks were also run.

I recompiled the scalar and vector C Whetstone benchmarks on the Pi 5, using gcc 12. The scalar results were effectively the same as those from gcc 8, quoted earlier in this topic. Results for the single and double precision vector version were as follows. Note that the N5 and N8 tests, with functions (both executed at DP) mainly determine the final rating.

The gcc 12 vector benchmark was also run on the Pi 4, to compare like with like. Then, for the three main MFLOPS measurements, the Pi 5 was effectively 3.1 times faster for both single and double precision operation. For both systems, double precision MFLOPS results were effectively half those at single precision, as expected with SIMD vector operation.

Pi 4 GCC 12 SP Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sun Dec 10 17:42:10 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13316142559051 2387 0.4 N2 floating point -1.13312149047851 2407 2.8 N3 if then else 1.00000000000000 7428 0.7 N4 fixed point 12.00000000000000 1736 9.0 N5 sin,cos etc. 0.49998238682747 79 52.2 N6 floating point 0.99999982118607 2577 10.4 N7 assignments 3.00000000000000 10223 0.9 N8 exp,sqrt etc. 0.75002217292786 78 23.7 MWIPS 4955 100.0 Pi 4 GCC 12 DP Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sun Dec 10 17:47:48 2023 Loop content Result MFLOPS MOPS Seconds N1 floating point -1.13314558088707 1164 0.7 N2 floating point -1.13310306766606 1173 4.9 N3 if then else 1.00000000000000 7424 0.6 N4 fixed point 12.00000000000000 1735 7.8 N5 sin,cos etc. 0.49998080312724 76 47.0 N6 floating point 0.99999988868927 1295 18.0 N7 assignments 3.00000000000000 5325 1.5 N8 exp,sqrt etc. 0.75002006515491 83 19.4 MWIPS 4314 100.0 Pi 5 GCC 12 SP Whetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct 7 10:46:30 2023 Loop content Result MFLOPS MOPS Seconds Pi 5/4 N1 floating point -1.13316142559051 7393 0.3 3.10 N2 floating point -1.13312149047851 7365 2.0 3.06 N3 if then else 1.00000000000000 14169 0.8 1.91 N4 fixed point 12.00000000000000 2399 14.5 1.38 N5 sin,cos etc. 0.49998238682747 177 51.7 2.24 N6 floating point 0.99999982118607 8079 7.4 3.13 N7 assignments 3.00000000000000 26419 0.8 2.58 N8 exp,sqrt etc. 0.75002217292786 178 23.0 2.29 MWIPS 10975 100.3 2.21 Pi 5 GCC 12 DP Whetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct 7 10:50:40 2023 Loop content Result MFLOPS MOPS Seconds Pi 5/4 N1 floating point -1.13314558088707 3603 0.5 3.10 N2 floating point -1.13310306766606 3620 3.6 3.09 N3 if then else 1.00000000000000 14168 0.7 1.91 N4 fixed point 12.00000000000000 2399 12.9 1.38 N5 sin,cos etc. 0.49998080312724 172 47.5 2.25 N6 floating point 0.99999988868927 3998 13.3 3.09 N7 assignments 3.00000000000000 13172 1.4 2.47 N8 exp,sqrt etc. 0.75002006515491 183 20.0 2.21 MWIPS 9830 99.9 2.28

Example Of Vector Instructions Compiled below or Go To Start

Example Of Vector Instructions Compiled

These are for the first single precision test function for what is probably the key part. Maximum speed of operation would be a long sequence of fused multiply and add or subtract instructions (fmla or fmls) that can produce 8 results per clock cycle for each linked vector pipeline. The disassembled code has too many non-arithmetic instructions, resulting in just over 3 operations per clock cycle on the Pi 5.


 L11:   add     x0, x0, 16
        ldr     q4, [x0, -16]
        ldr     q0, [x0, 4816]
        ldr     q9, [x0, 9648]
        fadd    v4.4s, v0.4s, v4.4s
        ldr     q8, [x0, 14480]
        fadd    v4.4s, v4.4s, v9.4s
        fsub    v4.4s, v4.4s, v8.4s
        fmla    v0.4s, v1.4s, v4.4s
        fsub    v0.4s, v0.4s, v9.4s
        fadd    v0.4s, v0.4s, v8.4s
        fmul    v0.4s, v0.4s, v1.4s
        fneg    v2.4s, v0.4s
        mov     v5.16b, v0.16b
        mov     v3.16b, v0.16b
        fmla    v2.4s, v1.4s, v4.4s
        fmls    v5.4s, v1.4s, v4.4s
        fmla    v3.4s, v1.4s, v4.4s
        fadd    v2.4s, v2.4s, v9.4s
        mov     v4.16b, v5.16b
        fadd    v2.4s, v2.4s, v8.4s
        fmla    v4.4s, v2.4s, v1.4s
        fmla    v3.4s, v2.4s, v1.4s
        fadd    v4.4s, v4.4s, v8.4s
        fmls    v3.4s, v4.4s, v1.4s
        fmul    v3.4s, v3.4s, v1.4s
        fadd    v0.4s, v3.4s, v0.4s
        str     q3, [x0, -16]
        fmls    v0.4s, v2.4s, v1.4s
        fmla    v0.4s, v4.4s, v1.4s
        fmul    v0.4s, v0.4s, v1.4s
        fsub    v5.4s, v3.4s, v0.4s
        str     q0, [x0, 4816]
        fsub    v0.4s, v0.4s, v3.4s
        mov     v3.16b, v5.16b
        fmla    v3.4s, v2.4s, v1.4s
        mov     v2.16b, v3.16b
        fmla    v2.4s, v4.4s, v1.4s
        fmul    v2.4s, v2.4s, v1.4s
        fadd    v0.4s, v0.4s, v2.4s
        str     q2, [x0, 9648]
        fmla    v0.4s, v4.4s, v1.4s
        fmul    v0.4s, v0.4s, v1.4s
        str     q0, [x0, 14480]
        cmp     x0, x22
        bne     .L11

Comparison With Old Supercomputers

Following are Scalar and Vector Whetstone benchmark results for the original supercomputers. In the 1980s they provided a useful tool in confirming the choice for university work in dealing with multiple user access, typically with programs containing 90% vectorisable code. Then the choices depended on scalar versus vector performance and multiple processors versus multiple pipelines.

Pi 5 results are included and can look good on a per MHz basis. See the next page for comparisons, including for the benchmark originally used to validate performance of the first Cray 1 supercomputer.

Vector Scalar Vector /Scalar MHz MWIPS MFLOPS MWIPS MFLOPS MFLOPS DATE Cray 1 80 16.2 5.9 98 47 8.0 1978 CDC Cyber 205 50 11.9 4.9 161 57 11.7 1981 Cray XMP1 118 30.3 11.0 313 151 13.7 1982 Cray 2/1 244 25.8 N/A 425 N/A 1984 Amdahl VP 500 # 143 21.7 7.5 250 103 13.8 1984 Amdahl VP 1100 # 143 21.7 7.5 374 146 19.5 1984 Amdahl VP 1200 # 143 21.7 7.5 581 264 35.3 1984 IBM 3090-150 VP 54 12.1 4.9 60 17 3.6 1986 (CDC) ETA 10E 95 15.7 6.5 335 124 19.2 1987 Cray YMP1 154 31.0 12.0 449 195 16.3 1987 Fujitsu VP-2400/4 312 71.7 25.4 1828 794 31.3 1991 NEC SX-3/11 345 42.9 17.0 1106 441 25.9 1991 NEC SX-3/12 345 42.9 17.0 1667 753 44.3 1991 # Fujitsu Systems Raspberry Pi 5 SP 2400 5843 1206 10986 7599 6.3 2023 Raspberry Pi 5 DP 2400 N/A N/A 9816 3731 3.1 2023

PC and Pi Comparisons below or Go To Start

PC and Pi Performance Comparisons

The following results are for the original Classic Benchmarks, comprising Livermore Loops, Linpack 100 and Whetstone applications, for PCs from 1991 and the Pi 5. They tended to be produced by the latest compiler version, available at the time. These probably represent best case Pi 5 comparative performance, mainly better than the Core i5 CPU on a per MHz basis.

To be fair, the later MP-MFLOPS results, included below, reflect the other extreme via SIMD vector performance. However, my present compiling procedures might be confusing for a newbie. For the Pi 5, compiling parameters for all programs used were -O3 and -march=armv8-a for optimisation level 3 using armv8-a architecture. For Intel the method I adopted requires inclusion of compile directives for such as SSE, AVX, AVX2 or AVX512.

For those who only consider maximum performance, the Intel based PC MP-MFLOPS speeds are indicated as being far superior. But on a MFLOPS per MHz basis, the Pi 5 results were between Intel SSE and AVX measurements. Considering these and repeated runs, the Core i5 CPUs (on a laptop in this case) appear to be running at a lower MHz, using 4 threads or more.

Given an application mainly running 4 core vector MP-MFLOPS type code and a much smaller part executing the slow Whetstone scalar MFLOPS type functions, the Pi 5 can appear to be faster than that Core i5 PC. This is shown in the example (tongue in cheek) performance calculations shown below. Note the Pi 5 / Cray 1 comparisons, particularly Livermore Loops results, the benchmark originally run to validate required performance of the first Cray 1 system. Here, Gmean MFLOPS was the official average, where the Raspberry Pi 5 is indicated as being 194 times faster.

LOOPS Gmean LLLOOPS MFLOPS MFLOPS MWIPS MFLOPS Device MFLOPS CPU MHz Max Gmean Min Linpack Whets Whets Year per MHz Main Columns V V V V Cray 1 80 82.1 11.9 1.2 27 16.2 6.0 1978 0.15 Windows or Linux PCs AMD 80386 40 1.2 0.6 0.2 0.5 5.7 0.8 1991 0.02 80486 DX2 66 4.9 2.7 0.7 2.6 15 3.3 1992 0.04 Pentium 75 24 7.7 1.3 7.6 48 11 1994 0.10 Pentium 100 34 12 2.1 12 66 16 1994 0.12 Pentium 200 66 22 3.8 132 31 1996 0.11 AMD K6 200 68 22 2.7 23 124 26 1997 0.11 Pentium Pro 200 121 34 3.6 49 161 41 1995 0.17 Pentium II 300 177 51 5.5 48 245 61 1997 0.17 AMD K62 500 172 55 6.0 46 309 67 1999 0.11 Pentium III 450 267 77 8.3 62 368 92 1999 0.17 Pentium 4 1700 1043 187 19 382 603 146 2002 0.11 Athlon Tbird 1000 1124 201 23 373 769 161 2000 0.20 Core 2 1830 1650 413 40 998 1557 374 2007 0.23 Core i5 2300 2326 438 35 1065 1813 428 2009 0.19 Athlon 64 2150 2484 447 48 812 1720 355 2005 0.21 Phenom II 3000 3894 644 64 1413 2145 424 2009 0.21 Core i7 930 3066 2751 732 68 1765 2496 576 2010 0.24 Core i7 4820K 3900 5508 1108 88 2680 3114 716 2013 0.28 Core i5 1135G7 4150 7505 1387 92 3541 3293 802 2021 0.33 Linux PCs AVX New Compiler Core i7 4820K 3900 12878 2615 597 5098 5887 1174 2013 0.67 Core i5 1135G7 4150 19794 3568 943 6998 6477 1077 2021 0.86 Raspberry Pi 700 140 55 17 42 271 94 2013 0.08 Raspberry Pi 2B 900 248 115 42 120 525 244 2015 0.13 Raspberry Pi 3B 1200 436 184 56 180 725 324 2016 0.15 Raspberry Pi 4B 1500 1861 679 180 957 1883 415 2019 0.35 Raspberry Pi 4B 64b 1500 2491 730 212 1060 2269 476 2019 0.35 Raspberry Pi 5 64b 2400 10577 2308 734 4136 5843 1206 2023 0.96 Core i5 / Pi 5 1.73 1.87 1.55 1.28 1.69 1.11 0.89 0.90 Pi 5 / Cray 1 30 129 194 612 153 361 201 ################################################################################# MP-MFLOPS -----------MFLOPS------------ ------MFLOPS/MHz-----= Threads MHz 1 2 4 8 1 2 4 8 Core i7 SSE 3900 23355 46883 88776 119313 6.0 12.0 22.8 30.6 Core i7 AVX 3900 45459 91277 172443 184765 11.7 23.4 44.2 47.4 Core I5 SSE 4150 33273 64727 86194 119426 8.0 15.6 20.8 28.8 Core i5 AVX 4150 64946 128515 153955 225265 15.6 31.0 37.1 54.3 Core i5 AVX512 4150 94417 185785 324870 325915 22.8 44.8 78.3 78.5 Pi 5 2400 21519 42488 80947 85086 9.0 17.7 33.7 35.5 ################################################################################# Performance Calculations i5 SSE i5 AVX Pi 5 MOPS MFLOPS secs MFLOPS secs MFLOPS secs 5000 1077 4.64 1077 4.64 1206 4.15 50000 86194 0.58 80947 0.62 50000 153955 0.32 Total 5.22 4.96 4.77

CPU Stress Tests Next or Go To Start

New 5 Amps Power Supply and Active Cooler

CPU Stress Tests

The fan on my new active cooler did not spin, I might have broken the JST connection on trying to insert the fiddly little thing. However, I have run some stress tests by plonking my cheap old Pi 4 fan on top of the dead new one. That and the new heatsink appear to do a good job and might be recommended as a useful backup arrangement.

Below are temperature graphs of my earlier integer and floating point tests using 64 KB and 512 KB of data. Maximum 4 thread performance was 73 GFLOPS for both floating point tests. For integers it was 240 GB/second at 64 KB then 160 GB/second at 512 KB, the latter being the hottest with data transfers reading from L2 cache as opposed to L1 at 64 KB.

The (part) active cooler graph indicates less than 80°C for all measurements, others demonstrating constant maximum CPU MHz and performance. The other graph only covers the integer tests, with and without the old Pi 4 fan. Then, using 64 KB with the fan, CPU MHz throttling was just about avoided. On running without an operational fan, it is commendable that the Pi 5 can continue running at those high temperatures, where throttled performance can be demonstrated that it is far superior to that from a super cooled Pi 4.

Heavy System Stress Test next or Go To Start

Heavy System Stress Test

This is a repeat of the above, comprising INTitHOT64g12, with 2 threads at 64 KB, MP-FPUStress64g12 with 2 threads at 512 KB, burnindrive264g12 to a PC via Ethernet, burnindrive264g12 to a USB 3 disk drive and videogl64C12. They were run with the Active Cooler enabled, initially using the new 5 amps power supply, then controlled by the 4 amps PoE arrangement. The two drive MB/second results are reading speeds, the second being for repetitive reading of the same blocks, representing bus speed where the drive has a buffer.

There were some differences in results of the two sessions at 5 amps, but nothing unusual for a mixed workload. The first test at 4 amps failed, as earlier, with disk reading errors being recorded, this time after 100 seconds. The second one at 4 amps ran successfully, essentially providing the same levels of performance as those at 5 amps. For the first 4 amps test, benchmark results, that were recorded, indicated slower performance.

There were noticeable differences in measured power where the input level was less than 5 volts, using the 4 amps supply. For some inexplicable reason, the failed test input current recording was particularly low.

An additional test was run excluding the floating point program, using the 4 amps power supply and 512 KB data size for INTitHOT via 4 threads. The latter is slower than at 64 KB but requiring a higher amperage and CPU temperature. Higher USB voltage might have helped in avoiding disk errors.

INT MP CPU PMIC OpenGL Drive LAN Volts Amps MB/sec MFLOPS MHz Volts °C °C FPS MB/s MB/s 5A Supply Power 5.15 2.38 Min 62371 19494 2400 0.8833 37.8 40.0 59.0 52.8 35.1 USB 4.92 0.53 Avg 75234 24713 2400 0.8833 63.5 62.4 64.4 117.7 36.7 Max 89243 28868 2400 0.8833 67.5 65.0 68.0 Repeat Min 63097 23625 2400 0.8833 38.4 40.1 60.0 58.5 28.6 Avg 77075 25451 2400 0.8833 64.4 62.8 66.4 159.1 31.7 Max 89625 27352 2400 0.8833 68.6 66.0 71.0 4A Supply Power 4.88 1.98 Min 56159 18062 2400 0.7200 37.3 37.9 44.0 N/A 31.3 USB 4.71 0.54 Avg 63134 20087 2400 0.8567 51.5 49.9 56.6 N/A N/A FAILED Max 69947 23773 2400 0.8840 59.8 57.2 70.0 Repeat Power 4.84 2.39 Min 63472 22513 2400 0.8840 37.8 39.5 59.0 52.6 30.1 USB 4.71 0.54 Avg 76104 25127 2400 0.8840 59.4 58.4 64.7 159.0 32.2 Max 84488 27214 2400 0.8840 62.6 60.7 70.0 4A Supply Power 5.07 2.74 Min 95040 2400 0.8833 35.1 38.6 50.0 57.3 28.6 USB 4.81 0.53 Avg 100302 2400 0.8833 65.0 64.3 61.9 156.8 31.4 Max 104684 2400 0.8833 69.2 67.2 66.0

Solid State Hard Drive next or Go To Start

Solid State Hard Drive

I obtained another Pi 5 at the same time as the 5 amps power supply and active cooler. I had overstressed the original board creating a irrecoverable hardware failure. This occurred on plugging in a new Solid State Drive, where tests indicated power supply irregularities. It is a SanDisk 1TB Extreme Portable SSD, USB-C USB 3.2 Gen 2, External NVMe Solid State Drive up to 1050 MB/s, now with FAT32 and Ext3 partitions. I quite rightly completed all other proposed tests before returning to those for the SSD, this time with the active cooler in use.

I repeated the last heavy stress test via both the 5 amps and 4 amps power supplies. The results indicate around a 10% increase in USB current, with slightly faster operation at 4 amps but at a higher temperature. A few more runs would be required to determine the truth.

With these particular drives, SSD reading speed was around 2.45 times faster.

INT MP CPU PMIC OpenGL Drive Volts Amps MB/sec MFLOPS MHz Volts °C °C FPS MB/s 5A Supply SSD Power 5.12 2.74 Min 94755 2400 0.8838 36.7 40.2 60.0 146.7 USB 4.80 0.59 Avg 96325 2400 0.8838 64.8 64.6 64.7 166.1 Max 109008 2400 0.8838 68.6 68.3 69.0 4A Supply SSD Power 5.12 2.95 Min 109197 2400 0.8830 38.4 41.7 64.0 148.5 USB 4.84 0.59 Avg 111188 2400 0.8830 67.7 67.9 67.2 168.4 Max 119425 2400 0.8830 71.9 71.1 70.0

DriveSpeed and LanSpeed I/O Benchmarks

As indicated I/O above, there are two varieties of the original drive benchmark, DriveSpeed using Direct I/O and LANSpeed without that option. The former would not run via 64 bit OS software and extra large files have to be selected to avoid caching data using the latter.

First of the following results is for LanSpeed using Ext3 formatted files where one of the 4096 MB files appears to have been partially cached and not identified in vmstat sampling. Note that USB power consumption was up to 640 mA at 5.14 volts.

The second details are partial results running DriveSpeed on a FAT32 partition, where writing large files was slower than during the Ext3 test but similar on reading. The main observation is the exceptionally slow speed on handling small files, particularly on writing. Partition size was around 500 GB.

New Benchmark Large Files above indicates best USB 3 hard drive results like 30 MB/second writing and 310 MB/second reading. Results for that benchmark on the SSD were around 165 and 415 MB/second respectively.

LanSpeed RasPi 64 Bit gcc 8 Tue Dec 26 12:49:03 2023 Selected File Path: /media/raspberrypi/Ext3/ Total MB 491955, Free MB 491955 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 4096 491.86 393.63 360.86 416.77 937.70 420.40 8192 407.49 364.13 365.28 579.91 412.14 411.16 Random Read Write From MB 4 8 16 4 8 16 msecs 0.002 0.002 0.002 0.52 0.49 0.48 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 139.48 34.81 100.02 479.48 558.20 1353.81 ms/file 0.03 0.24 0.16 0.01 0.01 0.01 0.019 End of test Tue Dec 26 12:52:22 2023 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 3 0 6805744 182608 752752 0 0 0 413554 3775 2544 0 22 46 31 0 2 2 0 6805744 182608 752752 0 0 0 401661 6715 8275 0 18 32 50 0 1 3 0 6805744 182608 752752 0 0 123 382200 4824 5126 0 20 32 48 0 1 3 0 6805744 182608 752752 0 0 13 332742 4379 4918 0 18 27 55 0 1 3 0 6805744 182608 752752 0 0 66 363967 4509 4615 0 17 47 36 0 2 2 0 6805744 182608 752752 0 0 46 345998 6905 9378 0 17 45 38 0 2 0 0 6805744 182608 752752 0 0 85870 272317 4082 4434 0 4 55 41 0 1 1 0 6805744 182608 752752 0 0 409245 0 3435 648 0 5 73 21 0 1 1 0 6805744 182608 752752 0 0 381261 0 3076 616 0 5 74 20 0 1 1 0 6805744 182608 752752 0 0 406957 3 3332 846 0 5 74 21 0 2 0 0 6805744 182608 752752 0 0 414537 1 3147 597 0 5 74 21 0 DriveSpeed RasPi 64 Bit gcc 8 Tue Dec 26 12:33:43 2023 /media/raspberrypi/FAT32/ MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 1024 194.07 198.99 218.42 426.35 426.37 425.99 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs ms/file 104.09 104.07 104.07 0.14 0.21 0.12 0.052

Go To Start