Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests

Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests

Roy Longbottom

Summary	Introduction	Whetstone Benchmark
Dhrystone Benchmark	Linpack 100 Benchmark	Livermore Loops Benchmark
FFT Benchmarks	BusSpeed Benchmark	MemSpeed Benchmark
NeonSpeed Benchmark	MultiThreading Benchmarks	MP-Whetstone Benchmark
MP-Dhrystone Benchmark	MP NEON Linpack Benchmark	MP-BusSpeed Benchmark
MP-BusSpeed Disassembly	MP-RandMem Benchmark	MP-MFLOPS Benchmarks
MP-MFLOPS Disassembly	MP-MFLOPS Sumchecks	OpenMP-MFLOPS Benchmarks
OpenMP-MemSpeed Benchmarks	Stress Testing Benchmarks	Integer Stressing Benchmark
Single Precision Stress Benchmark	Double Precision Stress Benchmark	High Performance Linpack
DriveSpeed Benchmark	USB 3 and 2 Benchmarks	Drive Write/Reboot/Read Tests
LAN and WiFi Benchmarks	Java Whetstone Benchmark	JavaDraw Benchmark
OpenGL Benchmark	Stress Tests	HP Linpack Stress Test
Integer Stress Test	Single Precision FPU Stress Test	Double Precision FPU Stress Test
OpenGL + 3 x Livermore Loops	Input/Output Stress Test	CPU + Main SD + USB + LAN Test

Summary

Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a PDF file. I have also run the 32 bit versions on the Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and Raspberry Pi 4 Stress Tests PDF file. This new report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 and Pi 3B+ using Gentoo Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32 bit systems and later gcc 9 compilations. The range of benchmarking targets were as follows.

Single Core CPU Tests - comprising Whetstone, Dhrystone, Linpack and Livermore Loops Classic Benchmarks.

Single Core Memory Benchmarks - measuring performance using data from caches and RAM. These comprise FFTs with floating point, BusSpeed, with integer arithmetic, then MemSpeed and NeonSpeed with both.

Multithreading Benchmarks - Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. Some are multithreaded versions of the previous programs, comprising Whetstone, Dhrystone, Linpack and BusSpeed benchmarks. Then there is MP-RandMem for random access and MP-MFLOPS for high speed floating point. Finally, there are OpenMP versions of the latter and MemSpeed.

Stress Testing Benchmarks - The Raspberry Pi 4 can become excessively hot and might need a cooling fan attachment for efficient operation of certain applications. Stress tests are detailed later, this area covering benchmarks intended to identify which area to test. Three programs cover floating point and integer arithmetic, with different processing profiles, accessing caches and RAM. Then there is High Performance Linpack that can be a killer.

Input/Output Benchmarks - DriveSpeed and LanSpeed are used measure performance of the main SD card, USB connected storage and networks via WiFi or Ethernet.

Java and OpenGL Benchmarks - A Java Whetstone benchmark is provided and one using JavaDraw procedures. The OpenGL benchmark has six test functions of increasing complexity and run using a range of different window sizes.

Stress Tests - Stress tests mainly have run time options to specify running time and such as memory used and alternative test function, then run with continuous displays showing any changes in performance. An extra program measured CPU MHz, temperature and voltage. The main CPU stress tests are mentioned above, the Livermore Loops and OpenGL benchmark programs can also be used, along with one geared up up to exercise input/output. Stress test results identify cases of temperature related CPU speed throttling down to 600 MHz, with temperatures up to 85°C, when a cooling fan is not fitted.

Performance Comparisons - More than 1400 comparisons are provided. For the particular main 1000 plus applicable to CPU speed, the Pi 4 was faster than the Pi 3B+, with an average, minimum and maximum values of 2.62, 0.70 and 16.8 times, the latter involved in using the larger L2 cache. There were also average performance gains of 64 bit compilations, compared with those at 32 bits, and some losses, the three ratios being 1.28, 0.31 and 4.90. The same story applied to gcc 9 versus gcc 6 compilations at 1.16, 0.37 and 2.93. A key area is maximum floating point speed running the High Performance Linpack Benchmark, with the four GB Pi 4 achieving more than 10 presumably double precision GFLOPS, close to my benchmark’s score at 13, with single precision at 26.

Other Issues

Dual Monitors - handled in different ways. Gentoo provided mirroring or a wide image squashed on one monitor. Raspbian spread wide images across both displays, but had no mirroring option.

C Direct I/O - This worked as expected at 32 bits but in the 64 bit Gentoo version could lead to failure to write or read. Separate write and read programs were produced to enable performance to be measured.

5 GHz WiFi - there were difficulties in connecting at 5 GHz using Raspbian but seemed to be impossible using Gentoo.

Introduction below or Go To Start

Introduction

The Raspberry Pi 4B uses a quad core ARM A72 CPU, with 32 KB L1 cache and shared 1 MB L2 cache. RAM is 3200-LPDDR4 with 1, 2 or 4 GB options. Other enhancements are USB 3 connections and gigabit Ethernet. The benchmarks and stress tests covered here were run on 4 GB models.

Previously, I have run my 32 bit and 64 bit benchmarks and stress tests on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a Raspberry Pi 3B 32 bit and 64 bit Benchmarks and Stress Tests PDF file. I have also run the 32 bit versions on the Raspberry Pi 4, with results in Raspberry Pi 4 Benchmarks PDF file and Raspberry Pi 4 Stress Tests PDF file. This new report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 and Pi 3B+ using Gentoo Operating System. Pi 4/Pi 3B+ comparisons are included, then others with 32 bit systems and later gcc 9 compilations. The programs and source codes for the original 64 bit versions are available for downloading in Raspberry-Pi-4-Benchmarks.tar.gz, and the new gcc 9 compilations in Raspberry-Pi-4-64-Bit-Benchmarks.tar.gz.

New gcc 9 program versions - On producing these, he first step was to change the functions used to identify the hardware, where the existing procedures replicate information for each core (even four lots were too much). I noted that the lscpu command now provides adequate detail, so I use this now. RPi 3B+ and RPi 4B CPUID results are now as follows:

Pi 3B+ Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 4 Model name: Cortex-A53 Stepping: r0p4 CPU max MHz: 1400.0000 CPU min MHz: 600.0000 BogoMIPS: 38.40 Flags: fp asimd evtstrm crc32 cpuid Linux pi64 4.19.67-v8-174fcab91765-bis+ #2 SMP PREEMPT Tue Aug 27 13:29:20 GMT 2019 aarch64 GNU/Linux Pi 4B Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 3 Model name: Cortex-A72 Stepping: r0p3 CPU max MHz: 1500.0000 CPU min MHz: 600.0000 BogoMIPS: 108.00 Flags: fp asimd evtstrm crc32 cpuid Linux pi64 4.19.67-v8-174fcab91765-p4-bis+ #2 SMP PREEMPT Tue Aug 27 13:58:09 GMT 2019 aarch64 GNU/Linux

Whetstone Benchmark below or Go To Start

Whetstone Benchmark - whetstonePi64, whetstonePi64g9, whetstonePiA7

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised (N/A), but the time does not affect the overall rating much.

For this simple code, at 64 bits, average Pi 4 performance gain, over the Pi 3B+, was 2.12 times, but only around 1.3 times for straightforward floating point calculations. Then, as should be expected, the Pi 4B 32 bit speed was not much slower.

Performance of the gcc 9 compilations for the Pi 4B was effectively the same as the earlier versions. The Pi 3B+ results indicated improvements, but this was due to the EXP type function calculations. The new compilation included a minor tweak for the IF tests, to avoid over optimisation.

System MHz MWIPS ------MFLOPS------ ------------MOPS--------------- 1 2 3 COS EXP FIXPT IF EQUAL Pi 3B+ 1400 1071 383 403 328 20.9 12.4 1704 N/A 1357 Pi 4B 1500 2269 522 534 398 54.8 39.8 2487 N/A 997 Pi4/3B+ 1.07 2.12 1.36 1.32 1.21 2.63 3.21 1.46 N/A 0.73 Pi 4B 32b 1500 1884 516 478 310 54.7 27.1 2498 2247 999 64b/32b 1.00 1.20 1.01 1.12 1.28 1.00 1.47 1.00 N/A 1.00 =========================================================================== gcc 9 Pi 3B+ 1400 1482 384 404 329 27.4 28.2 1712 2042 1362 Pi 4B 1500 2330 522 533 398 60.4 40.3 2493 2984 997 Pi4/3B+ 1.07 1.57 1.36 1.32 1.21 2.21 1.43 1.46 1.46 0.73 gcc 9/6 Pi 4B 1.00 1.03 1.00 1.00 1.00 1.10 1.01 1.00 N/A 1.00 Dhrystone Benchmark below or Go To Start

Dhrystone Benchmark - dhrystonePi64, dhrystonePi64g9, dhrystonePiA7

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow. This benchmark has no significant data arrays, suitable for vectorisation.

Using the same 64 bit program, the Pi 4 was more than twice as fast and 52% faster than the 32 bit compilation.

The gcc 9 compilations lead to no real difference in performance.

Compiled DMIPS System MHz DMIPS /MHz Pi 3B+ 1400 4028 2.88 Pi 4B 1500 8176 5.45 Pi4/3B+ 1.07 2.03 Pi 4B 32b 1500 5366 3.58 64b/32b 1.00 1.52 =============================== gcc 9 Pi 3B+ 1400 3896 2.78 Pi 4B 1500 8190 5.46 Pi4/3B+ 1.07 2.10 gcc 9/6 Pi 4B 1.00 1.00 Linpack Benchmark below or Go To Start

Linpack 100 Benchmark MFLOPS - linpackPi64, linpackPiSP64, linpackPiNEONi64, linpackPi64g9, linpackPi64g9SP, linpackPi64NEONig9, linpackPiA7, linpackPiA7SP

The original Linpack benchmark specified the use of double precision (DP) floating point arithmetic, and the code used here is identical to that initially approved for use on old PCs. For the benefit of early ARM computers, the code is also run using single precision (SP) numbers. A version was also produced, replacing the key Daxpy code with NEON Intrinsic Functions, using vector operations, also with single precision calculations.

The Pi 3B+ 32 bit results are also provided for clarification. My results were highlighted in the MagPi magazine, on announcement of the Pi 4, particularly the 2 GFLOPS 32 bit NEON speed. See raspberry-pi-4-specs-benchmarks.

At 64 bits, Pi 4/3B+ performance ratios were generally higher than those from the earlier benchmarks. Then, as could be expected, virtually compiler independent performance, using NEON Intrinsic Functions, were similar at 32 bits and 64 bits. The main 64 bit gain was with the compiled single precision version, obtaining the same performance as that via NEON Intrinsics.

The new gcc 9 compilations produced the same performance as the older versions, within the variations normally seen on this benchmark.

------ MFLOPS ------ System MHz DP SP SP NEON Pi 3B+ 1400 396.6 562.1 604.2 Pi 4B 1500 1059.9 1977.8 1968.6 Pi4/3B+ 1.07 2.67 3.52 3.26 Pi 4B 32b 1500 760.2 921.6 2010.5 64b/32b 1.00 1.39 2.15 0.98 Pi 3B+ 32 1400 210.5 225.2 562.5 Pi4/3B+ 1.07 3.61 4.09 3.57 ======================================= gcc 9 Pi 3B+ 1400 396.2 571.3 566.7 Pi 4B 1500 1110.6 2052.4 1887.5 Pi4/3B+ 1.07 2.80 3.59 3.33 gcc 9/6 Pi 4B 1.00 1.05 1.04 0.96 Livermore Loops Benchmark below or Go To Start

Livermore Loops Benchmark MFLOPS - liverloopsPi64, liverloopsPi64g9, liverloopsPiA7

This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical application, written in Fortran. This was increased to 24 kernels in the 1980s. Following are overall MFLOPS ratings, geometric mean being the official average performance, followed by details from the 24 kernels. Note that these are for double precision calculations

All the ratings indicate reasonably significant performance gains of Pi 4 over Pi 3B+ and 64 bits over 32 bits. Results from the 24 kernels indicate some higher gains. Also note the maximum speed of 2.49 GFLOPS (Double Precision).

The speed of the original Raspberry Pi could be rated as 4.5 times faster than the Cray 1 supercomputer (Geomean 11.9) - see my quote in this report. Now, one core of the Raspberry Pi 4B, at 64 bits, produces performance equivalent to 61 Cray 1 supercomputers.

There were some performance differences in gcc 9 results but average speeds were quite similar.

Overall Ratings - MFLOPS System MHz Maximum Average Geomean Harmean Minimum Pi 3B+ 64b 1400 737.7 319.4 284.7 250.6 91.6 Pi 4B 64b 1500 2490.5 892 730.3 603.3 212.4 Pi4/3B+ 1.07 3.38 2.79 2.57 2.41 2.32 Pi 4B 32b 1500 1800.2 635.1 519,0 416.1 155.3 64b/32b 1.00 1.38 1.40 1.41 1.45 1.37 ====================================================== gcc 9 Pi 3B+ 1400 1000.7 347.8 308.0 275.2 117.3 Pi 4B 1500 2744.5 962.5 768.2 596.2 132.1 Pi4/3B+ 1.07 2.74 2.77 2.49 2.17 1.13 gcc 9/6 Pi 4B 1.00 1.10 1.08 1.05 0.99 0.62 MFLOPS for 24 loops MFLOPS Of 24 Kernels Pi 3B+ 540 296 539 527 226 175 738 428 484 251 169 245 127 161 291 258 440 520 333 280 310 93 362 209 Pi 4B 2026 997 987 948 372 739 2033 2491 1980 758 495 875 220 404 811 710 753 1124 444 397 1061 414 822 283 Pi 4B/ 3.75 3.37 1.83 1.80 1.65 4.23 2.76 5.83 4.09 3.02 2.92 3.57 Pi 3B+ 1.73 2.51 2.79 2.75 1.71 2.16 1.33 1.42 3.43 4.48 2.27 1.36 Min 1.33 Max 5.83 Pi 4B 32 746 964 988 943 212 538 1169 1800 1032 469 214 186 159 335 778 623 732 1034 320 350 489 360 749 187 64b/32b 2.72 1.03 1.00 1.00 1.76 1.37 1.74 1.38 1.92 1.62 2.31 4.70 1.38 1.20 1.04 1.14 1.03 1.09 1.39 1.13 2.17 1.15 1.10 1.51 Min 1.00 Max 4.70 =========================================================================== gcc9 Pi 3B+ 565 320 319 535 227 207 1001 581 541 234 171 248 121 160 293 280 456 547 337 287 367 190 386 209 Pi 4B 2146 989 970 965 390 785 2386 2479 1879 632 500 973 134 423 814 670 726 1177 450 397 1675 561 818 283 Pi 4B/ 3.80 3.09 3.04 1.80 1.72 3.80 2.38 4.27 3.48 2.70 2.93 3.93 Pi 3B+ 1.10 2.65 2.78 2.39 1.59 2.15 1.33 1.39 4.56 2.95 2.12 1.35 Min 1.10 Max 4.56 gcc 9/6 Pi 4B 1.06 0.99 0.98 1.02 1.05 1.06 1.17 1.00 0.95 0.83 1.01 1.11 0.61 1.05 1.00 0.94 0.96 1.05 1.01 1.00 1.58 1.35 1.00 1.00 Min 0.61 Max 1.58 Fast Fourier Transforms Benchmarks below or Go To Start

Fast Fourier Transforms Benchmarks - fft1-RPi64, fft3c-RPi64, fft1Pi64g9,
fft3cPi64g9, fft1-RPi2, fft3c-Rpi2

This is a real application provided by my collaborator at Compuserve Forum. There are two versions. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements are made, at each size, using both single and double precision data, calculating FFT sizes between 1K and 1024K. Results are in milliseconds, with those here, the average of three measurements.

There were gains all round on the Pi 4, compared with the 3B+, mainly between 3 and 4 times on the optimised version, less so using FFT1, with more data transfer speed dependency.

On the Pi 4, performance from the 32 bit compilation was often similar to that at 64 bits. This is probably due to much of the data being read on a skipped sequential basis, not good for vectorisation.

The Pi 4B/3B+ performance gains were similar using both gcc 9 and gcc 6 compiled programs, but the gcc 9 compilation produced some faster FFT1 speeds, as shown in the Pi 4B gcc 9/6 comparisons.

Gentoo 64b Pi 3B+ Size FFT1 FFT3 K SP DP SP DP 1 0.13 0.15 0.15 0.17 2 0.29 0.39 0.32 0.38 4 0.76 1.13 0.79 0.85 8 1.93 2.66 1.77 1.94 16 4.02 5.51 4.69 5.14 32 9.50 25.11 9.51 13.67 64 42.53 110.21 25.30 32.25 128 151.08 257.41 57.68 76.71 256 355.88 589.07 129.47 174.85 512 819.91 1324.89 297.80 390.74 1024 1746.23 2943.08 641.50 863.82 Gentoo 64b Pi 4B Pi4/3B+ Size FFT1 FFT3 FFT1 FFT3 K SP DP SP DP SP DP SP DP 1 0.04 0.04 0.04 0.04 3.30 3.62 3.60 4.13 2 0.08 0.14 0.11 0.09 3.81 2.88 2.82 4.03 4 0.25 0.38 0.19 0.22 3.05 2.93 4.13 3.86 8 0.79 1.31 0.46 0.50 2.45 2.04 3.87 3.87 16 2.15 2.91 1.15 1.09 1.87 1.89 4.07 4.71 32 5.71 6.76 2.48 3.18 1.66 3.71 3.83 4.30 64 15.22 51.00 5.43 9.29 2.79 2.16 4.66 3.47 128 83.47 151.95 16.28 24.75 1.81 1.69 3.54 3.10 256 231.24 362.64 39.13 57.28 1.54 1.62 3.31 3.05 512 561.16 765.18 90.20 133.21 1.46 1.73 3.30 2.93 1024 1250.51 1878.44 213.35 303.39 1.40 1.57 3.01 2.85 Raspbian 32b Pi 4B 64B/32b Size FFT1 FFT3 FFT1 FFT3 K SP DP SP DP SP DP SP DP 1 0.04 0.04 0.06 0.05 0.99 0.96 1.44 1.18 2 0.08 0.12 0.13 0.11 1.04 0.89 1.14 1.18 4 0.32 0.37 0.27 0.24 1.28 0.96 1.42 1.09 8 0.77 0.97 0.58 0.55 0.98 0.74 1.26 1.09 16 1.69 2.01 1.49 1.35 0.78 0.69 1.29 1.24 32 4.37 4.89 2.96 3.63 0.77 0.72 1.19 1.14 64 9.12 26.55 7.46 10.75 0.60 0.52 1.37 1.16 128 55.52 160.11 17.93 26.03 0.67 1.05 1.10 1.05 256 305.92 423.06 41.16 55.06 1.32 1.17 1.05 0.96 512 833.10 854.88 86.93 120.53 1.48 1.12 0.96 0.90 1024 1617.49 1875.52 190.28 266.60 1.29 1.00 0.89 0.88 More below or Go To Start
=========================================================================== Gentoo Pi 3B+ gcc 9 Gentoo Pi 4B gcc 9 Size FFT1 FFT3 FFT1 FFT3 K SP DP SP DP SP DP SP DP 1 0.15 0.16 0.15 0.14 0.04 0.04 0.04 0.04 2 0.34 0.39 0.31 0.31 0.08 0.13 0.08 0.09 4 0.89 1.00 0.82 0.79 0.19 0.33 0.19 0.21 8 2.19 2.70 1.66 1.89 0.71 0.74 0.46 0.46 16 4.32 5.94 4.88 5.32 1.63 2.06 1.17 1.09 32 12.47 24.05 9.59 14.82 3.73 4.03 2.44 3.09 64 66.46 116.11 26.53 36.64 7.92 27.12 5.46 9.06 128 169.06 268.02 63.65 84.00 43.28 100.75 16.09 22.00 256 401.86 600.72 141.83 195.69 192.57 254.20 37.08 49.76 512 853.48 1266.96 329.26 435.23 590.20 651.24 82.54 110.23 1024 1966.69 2808.07 721.36 981.82 1463.15 1749.37 202.20 251.71 Pi 4B/3B+ Pi 4B gcc 9/6 1 3.53 3.77 3.63 3.78 0.97 0.98 1.02 1.18 2 4.39 3.05 3.97 3.64 1.00 1.06 1.46 1.08 4 4.75 3.03 4.23 3.81 1.34 1.16 0.98 1.06 8 3.06 3.62 3.62 4.10 1.10 1.76 1.00 1.09 16 2.65 2.89 4.16 4.89 1.32 1.41 0.98 1.00 32 3.34 5.97 3.93 4.79 1.53 1.68 1.02 1.03 64 8.39 4.28 4.85 4.04 1.92 1.88 0.99 1.03 128 3.91 2.66 3.96 3.82 1.93 1.51 1.01 1.12 256 2.09 2.36 3.82 3.93 1.20 1.43 1.06 1.15 512 1.45 1.95 3.99 3.95 0.95 1.17 1.09 1.21 1024 1.34 1.61 3.57 3.90 0.85 1.07 1.06 1.21 BusSpeed Benchmark below or Go To Start

BusSpeed Benchmark - busSpdPi64, busspeedPi64g9, busspeedPiA7

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word address increments before the next, followed by reading after decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds. The two comparison columns are for two word and one word increments.

Most data transfers were 2.0 to 2.5 times faster on the Pi 4, including from RAM, and somewhat higher with L2 cache based data.

The 64 bit version still deals with 32 bit words but transferred data somewhat quicker than the 32 bit program, as shown by the Pi 4 results.

Results from the gcc 9 compilations were virtually the same as those from gcc 6.

Gentoo 64b Pi 3B+ BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3819 4253 4622 5041 5089 3870 32 1234 1328 2067 3158 4082 3674 64 681 704 1325 2208 3350 3602 128 638 646 1214 2070 3238 3625 256 592 617 1165 1991 3164 3622 512 295 309 640 985 2085 2790 1024 108 120 271 525 1070 1636 4096 98 123 249 486 881 1840 16384 121 114 246 480 977 1642 65536 121 124 248 409 989 1864 Gentoo 64b Pi 4B Inc2 Rd All 4B/3B+ 4B/3B+ 16 4999 5042 5665 5885 5891 8217 1.16 2.12 32 1578 2105 3283 4339 5154 7507 1.26 2.04 64 585 911 1855 3085 5163 7918 1.54 2.20 128 590 932 1888 3110 5161 7874 1.59 2.17 256 598 934 1908 3056 5265 7883 1.66 2.18 512 603 939 1822 3019 5124 7716 2.46 2.77 1024 319 482 1060 1885 3283 5721 3.07 3.50 4096 209 253 503 1006 2009 4111 2.28 2.23 16384 209 261 520 1041 2071 4115 2.12 2.51 65536 203 263 489 1011 2023 4036 2.05 2.17 Raspbian 32b Pi 4B Rd All 64b/32b 16 3836 4049 4467 5885 4641 5858 1.14 32 761 1473 2594 3216 3960 4780 1.01 64 409 801 1684 2422 3745 3940 0.95 128 406 803 1202 1914 3037 5377 1.32 256 415 700 1165 2481 4789 5137 1.27 512 392 760 1243 2455 3764 4264 1.38 1024 230 256 623 1061 2455 3501 1.59 4096 197 214 454 938 1852 3195 1.80 16384 138 215 445 897 1724 3210 1.91 65536 174 215 398 744 1655 3130 1.61 More below or Go To Start
===================================================================== Gentoo 64b Pi 3B+ gcc 9 BusSpeed 64 Bit gcc 9 Thu Sep 26 12:51:15 2019 BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3860 4283 4677 4901 5022 3591 32 2228 2433 2989 4740 4912 3629 64 700 697 1299 2200 3310 3348 128 637 636 1208 2064 3151 3396 256 597 600 1161 1945 3105 3377 512 232 194 500 884 1629 2350 1024 118 131 159 440 692 1682 4096 91 99 197 463 923 1878 16384 119 117 200 392 775 1606 65536 101 105 238 464 873 1876 Gentoo 64b Pi 4B Rd All Rd All 4B/3B+ gcc 9/6 16 4815 5060 5573 5808 5741 8935 2.49 1.09 32 1534 1828 2967 4254 4930 7825 2.16 1.04 64 792 1007 1988 3269 4844 8062 2.41 1.02 128 730 950 1881 3133 5007 8162 2.40 1.04 256 733 955 1901 3128 5071 8236 2.44 1.04 512 737 952 1885 3139 5058 8237 3.51 1.07 1024 374 539 1047 1884 3177 5537 3.29 0.97 4096 235 255 497 990 1975 3386 1.80 0.82 16384 239 263 501 913 1984 3973 2.47 0.97 65536 239 237 502 995 1984 3971 2.12 0.98 MemSpeed Benchmark below or Go To Start

MemSpeed Benchmark MB/Second - memSpdPi64, memSpdPi64g9, memspeedPiA7

MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the result headings. For the first two double precision tests, speed MFLOPS can be calculated by dividing MB/second by 8 and 16. For single precision divide by 4 and 8.

Results are provided below for the Gentoo 64 bit version on the Pi 3B+ and Pi 4B, and the Raspbian 32 bit variety on the Pi 4B, then a sample of relative performance, covering data from L1 cache, L2 cache and RAM.

Gains, greater than the 7% CPU MHz difference, were recorded all round by the Pi 4B over the Pi 3B+. The most impressive were on using L2 cache based data and the more intensive floating point calculations. On the Pi 4B, speeds of 64 bit and 32 bit compilations were similar using RAM based data and executing some integer tests, but significantly faster from cache based floating point calculations.

Many Pi 4B/3B+ comparisons were similar, but the gcc 9 compilation gave rise to a number of changes, compared with the older version. The latter was slightly faster using some double precision calculations, but gcc 9 produced speed increases between 1.3 and 2.6 times with integers and single precision, the latter providing a maximum of 5.5 GFLOPS compared with 3.5.

Memory Reading Speed Test armv8 64 Bit by Roy Longbottom Start of test Fri Aug 16 12:48:51 2019 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S Gentoo 64b Pi 3B+ 8 4813 2897 4350 6180 3954 4831 5378 4324 4324 16 4540 2900 4356 6213 3961 4838 5401 4344 4333 32 4184 2780 4047 5540 3721 4483 5421 4285 4316 64 3784 2678 3803 4776 3547 4171 4925 4087 4051 128 3613 2694 3842 4731 3562 4188 4967 4087 4103 256 3133 2652 3800 4626 3493 4027 4967 4093 4096 512 670 882 1630 2913 2422 2718 3101 3141 2780 1024 587 774 1017 1310 1287 1184 1105 1526 1543 2048 555 746 917 1143 1131 1043 1071 1007 1128 4096 545 691 1130 1039 1015 1140 1045 1087 892 8192 537 795 1139 980 1133 1148 887 854 922 Max MFLOPS 602 725 Gentoo 64b Pi 4B 8 15530 13973 12509 15570 14025 15534 11417 9308 7798 16 15719 14042 12750 15745 14200 15660 11753 9447 7890 32 14062 12228 11435 14052 12699 12855 11864 9459 7937 64 12195 11344 10698 12211 11705 12025 8872 8752 7904 128 12172 11360 10755 12166 11862 11975 8569 8460 7913 256 12228 11369 10697 12123 11790 12082 8073 8222 7896 512 11269 10738 10206 10985 11164 11590 8017 6280 6557 1024 3407 2635 3281 3396 3242 2979 3765 3947 4029 2048 1525 1832 1838 1851 1607 1838 2819 2790 2770 4096 1407 1851 1859 1861 1666 1840 2485 2487 2410 8192 1913 1914 1922 1528 1895 1891 2496 2234 2489 Max MFLOPS 1965 3511 Comparison 64b Pi4/3B+ 8 3.23 4.82 2.88 2.52 3.55 3.22 2.12 2.15 1.80 16 3.46 4.84 2.93 2.53 3.58 3.24 2.18 2.17 1.82 256 3.90 4.29 2.82 2.62 3.38 3.00 1.63 2.01 1.93 512 16.82 12.17 6.26 3.77 4.61 4.26 2.59 2.00 2.36 1024 5.80 3.40 3.23 2.59 2.52 2.52 3.41 2.59 2.61 4096 2.58 2.68 1.65 1.79 1.64 1.61 2.38 2.29 2.70 8192 3.56 2.41 1.69 1.56 1.67 1.65 2.81 2.62 2.70 Raspbian 32b Pi 4B 8 8459 4766 13344 8303 4768 15553 7806 9926 9927 16 7142 3918 8649 7103 4094 9309 7899 10086 10056 32 7969 4490 10339 7941 4532 11627 7758 10070 10048 64 8126 4602 9909 8114 4617 11069 7425 8021 8070 128 8302 4651 9623 8311 4657 10836 7374 8049 7934 256 8319 4663 9627 8360 4666 10768 7530 7922 7925 512 8088 4629 9453 8239 4650 10696 5023 7904 7949 1024 3581 3113 3618 3577 3150 3675 5358 2431 1560 2048 1338 1808 1780 1811 1832 1773 2131 950 956 4096 1881 1880 1852 1879 1664 1336 1988 984 1054 8192 1890 1901 1884 1729 1319 1367 2252 1018 1021 Max MFLOPS 1057 1192 MemSpeed Continued Below
Comparison Pi 4B 64b/32b 8 1.84 2.93 0.94 1.88 2.94 1.00 1.46 0.94 0.79 16 2.20 3.58 1.47 2.22 3.47 1.68 1.49 0.94 0.78 256 1.47 2.44 1.11 1.45 2.53 1.12 1.07 1.04 1.00 512 1.39 2.32 1.08 1.33 2.40 1.08 1.60 0.79 0.82 1024 0.95 0.85 0.91 0.95 1.03 0.81 0.70 1.62 2.58 4096 0.75 0.98 1.00 0.99 1.00 1.38 1.25 2.53 2.29 8192 1.01 1.01 1.02 0.88 1.44 1.38 1.11 2.19 2.44 ===================================================================== Gentoo 64b Pi 3B+ gcc 9 Memory Reading Speed Test 64 Bit gcc 9 by Roy Longbottom Start of test Thu Sep 26 12:43:02 2019 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 4565 5140 7847 5439 5827 7928 6161 4288 4334 16 4445 5145 7942 5362 5829 7941 6207 4358 4310 32 4094 4853 7251 4750 5396 7250 6139 4312 4303 64 3767 4748 7008 4320 5309 6954 5461 4097 4100 128 3912 4799 7319 4442 5486 7325 5328 4133 4134 256 3838 4824 6934 4400 5426 7247 5354 3844 4010 512 2570 3661 3826 2773 3975 4912 3302 2532 3017 1024 878 2120 2228 938 2182 2239 1098 1215 1361 2048 848 1961 2046 1016 2008 2033 758 805 814 4096 856 1961 2040 1007 1984 2036 839 863 856 8192 885 1940 1956 1013 1921 1957 844 865 868 Max MFLOPS 571 1286 Gentoo 64b Pi 4B 8 13385 21854 24413 13416 23402 24404 11630 9316 9315 16 13527 22116 24712 13551 23675 24722 11800 9447 9446 32 12170 19681 21716 12164 21047 21740 11403 9511 9514 64 11402 19074 20086 11613 20057 20101 9317 8651 8663 128 11770 20334 21119 12124 21389 21087 8003 8136 8136 256 11740 20281 21115 12029 21384 21111 8098 8184 8015 512 11671 20255 20873 12058 21561 21072 7721 6684 6929 1024 2818 7728 5968 3957 7839 7831 4691 3610 3832 2048 1884 3436 3743 1880 3578 3281 2597 2717 2696 4096 1284 2399 2555 1446 3802 3625 2420 2630 2632 8192 1913 3759 3459 1937 3798 3772 2468 2482 2482 Max MFLOPS 1691 5529 Comparison 64b Pi4/3B+ 8 2.93 4.25 3.11 2.47 4.02 3.08 1.89 2.17 2.15 16 3.04 4.30 3.11 2.53 4.06 3.11 1.90 2.17 2.19 256 3.06 4.20 3.05 2.73 3.94 2.91 1.51 2.13 2.00 512 4.54 5.53 5.46 4.35 5.42 4.29 2.34 2.64 2.30 1024 3.21 3.65 2.68 4.22 3.59 3.50 4.27 2.97 2.82 4096 1.50 1.22 1.25 1.44 1.92 1.78 2.88 3.05 3.07 8192 2.16 1.94 1.77 1.91 1.98 1.93 2.92 2.87 2.86 Comparison Pi4B gcc 9/6 8 0.86 1.56 1.95 0.86 1.67 1.57 1.02 1.00 1.19 16 0.86 1.57 1.94 0.86 1.67 1.58 1.00 1.00 1.20 256 0.96 1.78 1.97 0.99 1.81 1.75 1.00 1.00 1.02 512 1.04 1.89 2.05 1.10 1.93 1.82 0.96 1.06 1.06 1024 0.83 2.93 1.82 1.17 2.42 2.63 1.25 0.91 0.95 4096 0.91 1.30 1.37 0.78 2.28 1.97 0.97 1.06 1.09 8192 1.00 1.96 1.80 1.27 2.00 1.99 0.99 1.11 1.00 NeonSpeed Benchmark below or Go To Start

NeonSpeed Benchmark MB/Second - NeonSpeedPi64, NeonSpeedPi64g9, NeonSpeed

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and the Neon measurements through using Intrinsic Functions.

Unlike running the same programs on the Pi 3B+, using the Pi 4, compiled codes were no longer slower than those produced via Intrinsic Functions. This lead to performance gains of up to over five times.

Except using L1 cache based data, performance was essentially the same using 32 bit and 64 bit benchmarks.

With the gcc 9 compilation, the Pi 4B continued to be significantly faster than the 3B+. Comparing Pi 4B gcc 9 and 6 results, performance was essentially the same when NEON Intrinsic Functions were used, but, as with MemSpeed, normal compilations were faster, averaging around 80% faster, in this case.

NEON Speed Test armv8 64 Bit V 1.0 Fri Aug 16 2019 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int Gentoo 64b Pi 3B+ 16 2715 5110 3945 4826 5426 5598 32 2528 4326 3569 4191 4596 4661 64 2491 4153 3494 4068 4407 4429 128 2537 4228 3583 4120 4461 4473 256 2526 4265 3614 4140 4480 4514 512 1917 2830 2545 2579 2896 2964 1024 1166 1299 1152 1257 1205 1229 4096 1022 1135 1132 1122 1130 1100 16384 1080 1026 1131 1016 1064 1094 65536 996 1120 1061 831 1110 1069 Gentoo 64b Pi 4B 16 13982 16424 12505 15239 16065 17193 32 9554 10753 8981 9657 10970 11025 64 10658 11833 10274 10722 12110 12134 128 10657 11887 10337 10680 11994 11973 256 10709 11970 10360 10774 12003 12083 512 10147 11441 9733 10209 11264 11532 1024 2964 3222 2876 3216 3270 2942 4096 1734 1712 1729 1772 1586 1728 16384 1592 1922 1818 1923 1926 1667 65536 1970 1736 1997 1747 1884 2021 Comparison 64b Pi4/3B+ 16 5.15 3.21 3.17 3.16 2.96 3.07 256 4.24 2.81 2.87 2.60 2.68 2.68 512 5.29 4.04 3.82 3.96 3.89 3.89 65536 1.98 1.55 1.88 2.10 1.70 1.89 Raspbian 32b Pi 4B 16 9677 10072 8905 9358 9776 10473 32 10149 10330 9364 9539 9988 10543 64 10948 11708 10466 10568 11318 11994 128 10484 11232 10410 10104 11200 11792 256 10509 11369 10428 10264 11273 11842 512 10406 11066 10134 10054 11075 11467 1024 3069 3202 3159 3166 3204 3203 4096 1721 1910 1908 1882 1903 1900 16384 2023 2009 2008 1965 2032 2013 65536 2073 2074 2074 2073 2068 2064 Comparison Pi 4B 64b/32b 16 1.44 1.63 1.40 1.63 1.64 1.64 256 1.02 1.05 0.99 1.05 1.06 1.02 512 0.98 1.03 0.96 1.02 1.02 1.01 65536 0.95 0.84 0.96 0.84 0.91 0.98 NeonSpeed Continued Below
===================================================================== Gentoo 64b Pi 3B+ gcc 9 NEON Speed Test 64 Bit gcc 9 Thu Sep 26 12:45:07 2019 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 5118 5461 6218 5298 6024 6011 32 4894 4980 5886 4855 5431 5445 64 4713 4557 5669 4452 4868 4867 128 4824 4703 5814 4598 4995 4946 256 4857 4750 5815 4643 5028 4964 512 3694 2652 4265 2675 3003 3007 1024 2085 1135 2204 1132 1128 1077 4096 2008 1021 2070 1033 1056 1036 16384 1912 1061 2042 958 1065 1047 65536 1783 1062 1873 769 1080 1081 Gentoo 64b Pi 4B 16 21046 14555 16698 13502 14565 16970 32 17797 12061 14509 10785 12282 13112 64 19517 10860 15252 9981 10793 11419 128 19839 10936 15468 10120 11001 11579 256 20094 10838 15603 10229 10885 11566 512 20076 10846 15469 10185 10943 11667 1024 7016 3040 6826 3211 3417 3548 4096 3945 1940 3599 1950 1768 1937 16384 3394 2017 3386 1963 1848 2014 65536 3484 2043 3839 1765 2060 2049 Comparison 64b Pi4/3B+ 16 4.11 2.67 2.69 2.55 2.42 2.82 32 3.64 2.42 2.47 2.22 2.26 2.41 64 4.14 2.38 2.69 2.24 2.22 2.35 128 4.11 2.33 2.66 2.20 2.20 2.34 256 4.14 2.28 2.68 2.20 2.16 2.33 512 5.43 4.09 3.63 3.81 3.64 3.88 1024 3.36 2.68 3.10 2.84 3.03 3.29 4096 1.96 1.90 1.74 1.89 1.67 1.87 16384 1.78 1.90 1.66 2.05 1.74 1.92 65536 1.95 1.92 2.05 2.30 1.91 1.90 Comparison Pi4B gcc 9/6 16 1.51 0.89 1.34 0.89 0.91 0.99 32 1.86 1.12 1.62 1.12 1.12 1.19 64 1.83 0.92 1.48 0.93 0.89 0.94 128 1.86 0.92 1.50 0.95 0.92 0.97 256 1.88 0.91 1.51 0.95 0.91 0.96 512 1.98 0.95 1.59 1.00 0.97 1.01 1024 2.37 0.94 2.37 1.00 1.04 1.21 4096 2.28 1.13 2.08 1.10 1.11 1.12 16384 2.13 1.05 1.86 1.02 0.96 1.21 65536 1.77 1.18 1.92 1.01 1.09 1.01 Average 1.95 1.00 1.73 1.00 0.99 1.06 MultiThreading Benchmarks below or Go To Start

MultiThreading Benchmarks

Most of the multithreading benchmarks execute the same calculations using 1, 2, 4 and 8 threads. One of them, MP-MFLOPS, is available in two different versions, using standard compiled “C” code for single and double precision arithmetic. A further version uses NEON intrinsic functions. Another variety uses OpenMP procedures for automatic parallelism. Go To Start

MP-Whetstone Benchmark - MP-WhetsPi64, MP-WhetsPi64g9, MP-WHETSPiA7

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance was generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

As with the single core version, average Pi 4 MWIPS performance gain, over the Pi 3B+, was just over 2 times, but more similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.

Most of the important Pi 4B gcc 9 results were virtually the same as those from the earlier gcc 6 compilations but the 3B+ COS and EXP speeds were somewhat slower.

MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS Gentoo RPi 3B+ 64 Bit 1 1152 383 383 328 23.2 13.0 N/A 2721 1365 2 2312 767 767 657 46.5 26.0 N/A 5461 2738 4 4580 1506 1526 1304 92.0 51.6 N/A 10777 5449 8 4788 1815 1961 1382 95.0 53.3 N/A 13827 5811 Overall Seconds 4.96 1T, 4.95 2T, 5.05 4T, 10.07 8T Gentoo RPi 4B 64 Bit 1 2395 536 538 397 60.8 39.0 N/A 4483 997 2 4784 1062 1079 794 121.2 77.9 N/A 8932 1990 4 9476 2125 2080 1568 240.8 155.3 N/A 17718 3962 8 9834 2631 2744 1630 243.6 160.1 N/A 22265 4053 Overall Seconds 4.99 1T, 5.01 2T, 5.12 4T, 10.17 8T Comparison 64b Pi4/3B+ 1 2.08 1.40 1.41 1.21 2.62 3.00 N/A 1.65 0.73 2 2.07 1.39 1.41 1.21 2.61 3.00 N/A 1.64 0.73 4 2.07 1.41 1.36 1.20 2.62 3.01 N/A 1.64 0.73 8 2.05 1.45 1.40 1.18 2.56 3.00 N/A 1.61 0.70 Raspbian RPi 4B 32 Bit 1 2059 673 680 311 55.6 33.1 7462 2245 995 2 4117 1342 1391 624 110.7 65.9 14887 4467 1986 4 7910 2652 2722 1180 208.5 132.6 29291 8952 3832 8 8652 3057 2971 1268 233.2 149.6 38368 11923 3942 Overall Seconds 4.99 1T, 5.01 2T, 5.29 4T, 10.71 8T Comparison Pi 4B 64b/32b 1 1.16 0.80 0.79 1.28 1.09 1.18 N/A 2.00 1.00 2 1.16 0.79 0.78 1.27 1.09 1.18 N/A 2.00 1.00 4 1.20 0.80 0.76 1.33 1.15 1.17 N/A 1.98 1.03 8 1.14 0.86 0.92 1.28 1.04 1.07 N/A 1.87 1.03 MP-Whetstone Continued Below
=========================================================================== MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS Gentoo 64b Pi 3B+ gcc 9 1 1500 381 384 328 27.2 28.1 5098 2049 1368 2 3001 766 762 656 54.5 56.5 10130 4102 2737 4 5940 1488 1528 1304 107.8 111.5 19741 7665 5423 8 5987 1528 1666 1267 107.4 117.9 25862 9518 5666 Overall Seconds 4.98 1T, 4.98 2T, 5.16 4T, 10.30 8T Gentoo 64b Pi 4B gcc 9 1 2364 530 532 395 60.6 40.0 7426 2242 996 2 4724 1060 1052 789 121.0 80.4 14853 4476 1994 4 9413 2103 2112 1579 241.0 159.5 29161 8638 3968 8 9848 2671 2453 1644 247.0 168.1 37385 11636 4108 Overall Seconds 5.00 1T, 5.01 2T, 5.07 4T, 10.20 8T Comparison 64b Pi4/3B+ 1 1.58 1.39 1.38 1.20 2.23 1.42 1.46 1.09 0.73 2 1.57 1.38 1.38 1.20 2.22 1.42 1.47 1.09 0.73 4 1.58 1.41 1.38 1.21 2.24 1.43 1.48 1.13 0.73 8 1.64 1.75 1.47 1.30 2.30 1.43 1.45 1.22 0.72 Comparison Pi4B gcc 9/6 1 0.99 0.99 0.99 1.00 1.00 1.03 N/A 0.50 1.00 2 0.99 1.00 0.97 0.99 1.00 1.03 N/A 0.50 1.00 4 0.99 0.99 1.02 1.01 1.00 1.03 N/A 0.49 1.00 8 1.00 1.02 0.89 1.01 1.01 1.05 N/A 0.52 1.01

MP-Dhrystone Benchmark below or Go To Start

MP-Dhrystone Benchmark - MP-DHRYPi64, MP-DHRYPi64g9, MP-DHRYPiA7

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance with not much gain using multiple cores.

The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.

As indicated for the earlier gcc 6 results, this benchmark produces inconsistent performance and does not provide a good example of multithreading but, in this case, gcc 6 and gcc 9 results were similar, with a reasonably high Pi 4B/3B+ performance gain.

Example Results Log File MP-Dhrystone Benchmark 64 Bit gcc 9 Thu Sep 26 11:46:22 2019 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.55 1.19 2.31 4.57 Dhrystones per Second 14579147 13499628 13827400 14017880 VAX MIPS rating 8298 7683 7870 7978 Internal pass count correct all threads End of test Thu Sep 26 11:46:31 2019 ############################################################# Comparisons Threads 1 2 4 8 VAX MIPS rating Pi 3B+ 6 4207 6804 7401 7415 VAX MIPS rating Pi 4B 64 8880 7828 8303 8314 VAX MIPS rating Pi 4B 32 5539 5739 6735 7232 Pi 4B/3B+ 64 bits 2.11 1.15 1.12 1.12 Pi 4B 64 bits/32 bits 1.60 1.36 1.23 1.15 ======================================================= Gentoo gcc 9 VAX MIPS rating Pi 3B+ 6 4062 6504 8242 8343 VAX MIPS rating Pi 4B 64 8298 7683 7870 7978 Pi 4B/3B+ 64 bits 2.04 1.18 0.95 0.96 Pi 4B gcc 9/6 0.93 0.98 0.95 0.96

MP Linpack Benchmark below or Go To Start

MP SP NEON Linpack Benchmark - linpackMPNeonPi64, linpackMPNeonPi64g9, linpackNeonMP

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100, without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.

The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core version.

At least for the unthreaded tests, the gcc 9 results for the Pi 4B were mainly within 10% of those from gcc 6.

Example Results Log File Linpack Single Precision MultiThreaded Benchmark 64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 642.56 66.69 66.05 65.54 N 500 479.48 274.36 274.85 269.07 N 1000 363.77 316.17 310.37 316.71 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.97 5.40 13.51 RE 4.69621336e-05 6.44138840e-04 3.22485110e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04 XN -1.30534172e-05 3.51667404e-05 1.90019608e-04 Thread 0 - 4 Same Results Same Results Same Results #################################################### Comparisons Threads None 1 2 4 Gentoo Pi 3B+ 64 Bits N 100 642.56 66.69 66.05 65.54 N 500 479.48 274.36 274.85 269.07 N 1000 363.77 316.17 310.37 316.71 Gentoo 64b Pi 4B N 100 2252.7 97.3 97.4 97.4 N 500 1628.2 665.2 646.6 674.4 N 1000 399.9 406.8 405.8 399.5 Comparison 64b Pi4/3B+ N 100 3.51 1.46 1.48 1.49 N 500 3.40 2.42 2.35 2.51 N 1000 1.10 1.29 1.31 1.26 Raspbian 32b Pi 4B N 100 1921.5 108.7 101.9 102.5 N 500 1548.8 530.2 714.4 733.1 N 1000 399.9 378.1 364.8 398.2 Comparison Pi 4B 64b/32b N 100 1.17 0.89 0.96 0.95 N 500 1.05 1.25 0.91 0.92 N 1000 1.00 1.08 1.11 1.00 MP SP NEON Linpack Continued Below
======================================== gcc 9 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 Gentoo 64b Pi 3B+ gcc 9 N 100 641.6 63.0 62.3 61.9 N 500 326.6 229.3 222.6 227.0 N 1000 320.1 275.0 274.3 275.2 Gentoo 64b Pi 4B gcc 9 N 100 2076.2 98.6 96.6 96.2 N 500 1327.1 631.9 632.5 639.2 N 1000 394.6 375.3 382.3 375.7 Comparison 64b Pi4/3B+ N 100 3.24 1.57 1.55 1.55 N 500 4.06 2.76 2.84 2.82 N 1000 1.23 1.36 1.39 1.37 Comparison Pi4B gcc 9/6 N 100 0.92 1.01 0.99 0.99 N 500 0.82 0.95 0.98 0.95 N 1000 0.99 0.92 0.94 0.94 #################################################### 32 bit numeric results N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04

MP BusSpeed Benchmark below or Go To Start

MP BusSpeed Benchmark - MP-BusSpd2Pi64, MP-BusSpd2Pi64g9, MP-BusSpeedPiA7
(read only)

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching effects and not seen on subsequent repeated tests.

Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below, where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.

At least, most of the gcc 9 read all compiled tests were significantly faster than those produced by gcc 6.

MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019 MB/Second Reading Data, 1, 2, 4 and 8 Threads Gentoo 64b Pi 3B+ KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 3138 2822 3044 2383 1708 1737 2T 5354 4865 5647 4519 3303 3361 4T 7922 7504 9717 6794 6216 6597 8T 5125 4159 6987 6696 5350 5195 122.9 1T 640 666 1191 1864 1627 1712 2T 1008 1018 1926 3496 3268 3387 4T 962 1042 2157 4259 6427 4372 8T 1031 1047 2147 3952 6317 6514 12288 1T 124 114 260 527 1016 1363 2T 137 138 275 487 946 2182 4T 105 118 240 409 975 2158 8T 108 117 236 504 1077 2051 Gentoo 64b Pi 4B RdAll 4B/3B+ 12.3 1T 4864 4879 5378 4379 4115 4221 2.43 2T 8159 6924 9179 8006 7689 7837 2.33 4T 12677 11531 14850 12554 13807 14794 2.24 8T 7398 6927 10881 11675 11497 13075 2.52 122.9 1T 665 926 1869 2714 3557 4152 2.43 2T 610 696 1549 4898 7188 8184 2.42 4T 476 865 1885 4107 8058 14617 3.34 8T 474 883 1848 3919 7939 13633 2.09 12288 1T 202 210 514 1044 2033 3616 2.65 2T 258 425 853 1551 3693 6228 2.85 4T 217 346 497 1024 2181 3789 1.76 8T 220 275 540 1030 1937 3577 1.74 Raspbian 32b Pi 4B RdAll 64b/32b 12.3 1T 5263 5637 5809 5894 5936 13445 0.31 2T 9412 10020 10567 11454 11604 24980 0.31 4T 16282 15577 16418 21222 20000 45530 0.32 8T 11600 13285 16070 18579 20593 36837 0.35 122.9 1T 739 956 1888 3153 5008 9527 0.44 2T 629 1158 1568 5058 9509 16489 0.50 4T 600 1093 2134 4527 8732 16816 0.87 8T 593 1104 2121 4382 8629 17158 0.79 12288 1T 238 258 518 1005 2001 4029 0.90 2T 278 228 453 1690 1826 3628 1.72 4T 269 257 740 1019 1790 4145 0.91 8T 233 292 532 926 2186 3581 1.00 MP-BusSpeed Continued Below
=================================================================== MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll Gentoo 64b Pi 3B+ gcc 9 12.3 1T 3453 4178 4428 3543 3584 2335 2T 5594 7732 8086 6856 6924 4654 4T 9065 12522 13157 12942 13415 9209 8T 6661 10770 13266 11955 12573 8478 122.9 1T 640 646 1197 1970 2909 2272 2T 1030 1012 2006 3671 5784 4528 4T 1001 1041 2145 4266 8337 6729 8T 1043 1061 2123 4005 8133 8572 12288 1T 114 104 241 444 932 1352 2T 126 122 253 370 1005 1997 4T 104 138 197 471 1133 1745 8T 102 96 231 466 796 1893 Gentoo 64b Pi 4B gcc 9 RdAll Pi 4B 4B/3B+ gcc 9/6 12.3 1T 5573 5750 5057 5646 5800 9129 3.91 2.16 2T 7191 9038 10035 11020 11125 17757 3.82 2.27 4T 7023 12144 14591 17681 20490 29184 3.17 1.97 8T 7553 11837 12565 15640 18546 30517 3.60 2.33 122.9 1T 672 922 1864 3092 4744 7741 3.41 1.86 2T 577 947 2100 3051 8780 14975 3.31 1.83 4T 519 983 1884 3980 8701 18139 2.70 1.24 8T 515 951 1913 4181 8797 16899 1.97 1.24 12288 1T 230 261 499 1016 1678 3873 2.86 1.07 2T 276 225 418 925 1929 5629 2.82 0.90 4T 258 267 579 802 1749 5758 3.30 1.52 8T 214 213 538 1069 2145 4680 2.47 1.31

MP BusSpeed Disassembly below or Go To Start

MP BusSpeed Disassembly

Following shows part of the source code used to read all data, compile commands used and disassembly of part of the (100+) long sequences of instructions used for the 32 bit and 64 bit gcc 9 benchmarks. A disassembly of the 64 bit gcc 6 version was not available.


        Source Code 64 AND instructions in main loop
  
   for (i=start; i<end; i=i+64)
   {
       andsum1[t] = andsum1[t] 
           & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
           & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
    To
           & array[i+56] & array[i+57] & array[i+58] & array[i+59]
           & array[i+60] & array[i+61] & array[i+62] & array[i+63];
   }


Pi 32 Bit Raspbian Compile

gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
           -mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7

Pi 64 Bit Gentoo Compile

gcc mpbusspd2.c -lpthread -lm -lrt -O3 -march=armv8-a -no-pie -o MP-BusSpd2Pi64g9

Parameters also tried

-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe 
-fomit-frame-pointer"

Pi 32 Bit Disassembly          Pi 64 Bit Disassembly

vld1.32 {q6}, [lr]              ldp     w30, w17, [x0, 52]
vld1.32 {q7}, [r6]              and     w18, w18, w30
vand    q10, q10, q6            and     w1, w1, w18
vld1.32 {q6}, [r0]              ldp     w18, w30, [x0, 60]
vand    q9, q9, q7              and     w17, w17, w18
vand    q12, q12, q6            and     w1, w1, w17
vld1.32 {q7}, [ip]              ldp     w17, w18, [x0, 68]
vld1.32 {q6}, [r7]              and     w30, w30, w17
add     r1, r3, #96             and     w1, w1, w30
add     r6, r3, #144            ldp     w30, w17, [x0, 76]
vand    q11, q11, q7            and     w18, w18, w30
vand    q14, q14, q6            and     w1, w1, w18
vld1.32 {q7}, [r1]              ldp     w18, w30, [x0, 84]
vld1.32 {q6}, [r6]              and     w17, w17, w18

MP RandMem Benchmark below or Go To Start

MP RandMem Benchmark - MP-RandMemPi64, MP-RandMemPi64g9, MP-RandMemPiA7

This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.

Some moderate Pi4/3B+ performance gains were produced using gcc 9, but the older version was, possibly, a little faster.

MB/Second Using 1, 2, 4 and 8 Threads Serial Serial Random Random Serial Serial Random Random KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr Gentoo Pi 4B 64 Bits 12.3 1T 5922 7871 5892 7857 2T 11856 7882 11902 7923 4T 22964 7821 22276 7832 8T 23225 7751 22082 7717 122.9 1T 5827 7276 2052 1921 2T 10965 7258 1754 1924 4T 10969 7232 1848 1929 8T 10896 7158 1834 1909 12288 1T 3879 1052 188 170 2T 4848 935 218 168 4T 4684 943 332 170 8T 3982 1049 340 171 Gentoo Pi 3B+ 64 Bits Raspbian Pi 4B 32 Bits 12.3 1T 4901 3587 4912 3585 5860 7905 5927 7657 2T 8749 3564 8719 3556 11747 7908 11182 7746 4T 17108 3504 17160 3505 21416 7626 17382 7731 8T 16885 3475 16650 3485 20649 7528 20431 7378 122.9 1T 3921 3339 1010 974 5479 7269 1826 1923 2T 7360 3350 1814 972 10355 6964 1667 1920 4T 12199 3313 2281 969 9808 7177 1715 1908 8T 12089 3313 2279 968 11677 7058 1697 1919 12288 1T 2024 828 83 67 3438 1271 179 152 2T 2169 820 142 67 4176 1204 213 167 4T 2178 818 154 67 4227 1117 337 161 8T 2219 821 161 67 3479 1093 287 168 4 Thread Pi 4B/3B+ 64 Bits 4 Thread Pi 4B 64 bits/32 bits 12.3 4T 1.34 2.23 1.30 2.23 1.07 1.03 1.28 1.01 122.9 4T 0.90 2.18 0.81 1.99 1.12 1.01 1.08 1.01 12288 4T 2.15 1.15 2.16 2.54 1.11 0.84 0.99 1.06 =================================================================== MB/Second Using 1, 2, 4 and 8 Threads Serial Serial Random Random Serial Serial Random Random KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9 12.3 1T 4886 3581 4878 3590 5737 6884 5763 7537 2T 8723 3550 8724 3550 11536 7592 10238 6898 4T 16836 3498 17531 3509 21084 7575 15160 7390 8T 15777 3459 16783 3466 20089 7339 15311 7200 122.9 1T 3913 3346 987 972 5739 7231 2006 1906 2T 7285 3339 1753 964 10662 7217 1742 1896 4T 12354 3344 2350 972 10376 6741 1815 1812 8T 11841 3333 2300 962 10298 6937 1823 1848 12288 1T 1795 761 69 60 3477 905 181 162 2T 1915 735 118 60 3750 794 215 164 4T 2452 730 128 59 4669 968 259 162 8T 1805 755 137 60 3419 981 301 157 4 Thread 4 Thread Comparison 64b Pi4/3B+ Comparison Pi4B gcc 9/6 12.3 4T 1.25 2.17 0.86 2.11 0.92 0.97 0.68 0.94 122.9 4T 0.84 2.02 0.77 1.86 0.95 0.93 0.98 0.94 12288 4T 1.90 1.33 2.02 2.75 1.00 1.03 0.78 0.95

MP-MFLOPS Benchmarks below or Go To Start

MP-MFLOPS Benchmarks - MP-MFLOPSPi64, MP-MFLOPSPi64g9, MP-MFLOPSPi64DP,
MP-MFLOPSPi64DPg9, MP-NeonMFLOPS64, MP-NeonMFLOPS64g9, MP-MFLOPSPiA7

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Versions are available using single precision and double precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

There can be wide variations in speeds, affected by the short running times and such as cached data variations. In order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that, with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less so at 32 operations.

The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar performance as the 64 bit version, but more sumcheck differences.

The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions, providing the fastest speeds.

The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits, does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors - see disassembly below.

It is difficult to judge relative gcc 9 and 6 performance, probably due to the short running times. The former appears to be more than 10% faster, running the single precision tests. For these, the disassembled instructions look the same as those shown below, but in a different sequence.

Single Precision MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 ---- Gentoo Pi 4B 64 Bits MFLOPS --- 1T 2908 2854 459 5778 5734 5405 2T 5700 5311 457 10935 11212 7968 4T 10375 5588 490 18181 21842 7637 8T 9675 8460 511 20128 20567 8568 --- Gentoo Pi 3B+ 64 Bits MFLOPS --- -- Raspbian Pi 4B 32 Bits MFLOPS - 1T 792 806 373 1780 1783 1724 987 993 606 2816 2794 2804 2T 1482 1596 382 3542 3509 3380 1823 1837 567 5610 5541 5497 4T 2861 2742 429 5849 7013 5465 2119 3349 647 9884 10702 9081 8T 2770 2877 429 6434 6700 6101 3136 3783 609 10230 10504 9240 Comparisons --------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits ----- 1T 3.67 3.54 1.23 3.25 3.22 3.14 2.95 2.87 0.76 2.05 2.05 1.93 2T 3.85 3.33 1.20 3.09 3.20 2.36 3.13 2.89 0.81 1.95 2.02 1.45 4T 3.63 2.04 1.14 3.11 3.11 1.40 4.90 1.67 0.76 1.84 2.04 0.84 MP-MFLOPS Continued Below
=========================================================================== MP-MFLOPS 64 Bit gcc 9 Thu Sep 26 12:36:54 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 1T 827 805 371 3232 3157 2802 3162 3072 468 6754 6714 6340 2T 1608 1567 360 6420 6423 5286 6498 6029 496 13329 12397 7623 4T 1764 3142 400 11240 12355 6029 11709 6141 529 24825 25055 8723 8T 2548 2575 381 10813 11755 5827 10828 8158 493 19452 22190 8426 Comparisons ........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 .......... 1T 3.82 3.82 1.26 2.09 2.13 2.26 1.09 1.08 1.02 1.17 1.17 1.17 2T 4.04 3.85 1.38 2.08 1.93 1.44 1.14 1.14 1.09 1.22 1.11 0.96 4T 6.64 1.95 1.32 2.21 2.03 1.45 1.13 1.10 1.08 1.37 1.15 1.14 ########################################################################### Double Precision MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 ---- Gentoo Pi 4B 64 Bits MFLOPS --- 1T 1464 1386 225 3398 3386 3182 2T 2837 2792 228 6720 6741 4547 4T 5172 3414 251 10405 12762 4763 8T 4774 4353 275 11506 12118 4865 --- Gentoo Pi 3B+ 64 Bits MFLOPS --- -- Raspbian Pi 4B 32 Bits MFLOPS - 1T 415 386 206 1400 1403 1333 1187 1220 309 2682 2714 2701 2T 820 813 209 2804 2767 2597 2420 2416 282 5379 5415 4780 4T 1328 1323 212 5433 5340 2465 4665 2381 317 10256 10336 5242 8T 1343 1308 214 5090 5006 3280 4385 3114 310 9721 10340 5131 Comparisons --------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits ----- 1T 3.53 3.59 1.09 2.43 2.41 2.39 1.23 1.14 0.73 1.27 1.25 1.18 2T 3.46 3.43 1.09 2.40 2.44 1.75 1.17 1.16 0.81 1.25 1.24 0.95 4T 3.89 2.58 1.18 1.92 2.39 1.93 1.11 1.43 0.79 1.01 1.23 0.91 =========================================================================== MP-MFLOPS 64 Bit gcc 9 Double Precision Thu Sep 26 22:05:10 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads ---- Gentoo 64b Pi 3B+ gcc 9 ---- ----- Gentoo 64b Pi 4B gcc 9 ---- 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 1T 384 350 127 1582 1546 1372 657 663 183 3283 3358 3169 2T 753 753 184 3109 3157 2645 3203 2690 223 6573 6353 4535 4T 1346 1330 194 4228 6099 3067 5799 3866 292 12432 12665 4906 8T 1234 1340 201 4888 5748 3190 5322 4583 269 10738 8891 4521 Comparisons ........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 .......... 1T 1.71 1.89 1.44 2.08 2.17 2.31 0.45 0.48 0.81 0.97 0.99 1.00 2T 4.25 3.57 1.21 2.11 2.01 1.71 1.13 0.96 0.98 0.98 0.94 1.00 4T 4.31 2.91 1.51 2.94 2.08 1.60 1.12 1.13 1.16 1.19 0.99 1.03 MP-MFLOPS Continued Below
NEON Single Precision MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 ---- Gentoo Pi 4B 64 Bits MFLOPS --- 1T 3311 3192 535 6442 6548 6198 2T 4607 6186 552 13030 13012 8468 4T 6279 5725 562 23798 24128 9374 8T 7815 12044 486 22725 21712 9395 --- Gentoo Pi 3B+ 64 Bits MFLOPS -- -- Raspbian Pi 4B 32 Bits MFLOPS - 1T 830 823 406 2989 2986 2792 2491 2399 615 4325 4285 4261 2T 1575 1498 414 5981 5872 5445 5629 5520 591 8602 8463 8308 4T 2217 2650 431 11661 11644 6061 10580 5594 553 16991 16493 9124 8T 2733 3197 437 10505 10637 6708 7047 10785 513 14325 16219 8867 Comparisons --------- Pi 4B/3B+ 64 Bits -------- ------ Pi 4B 64 bits/32 bits ----- 1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45 2T 2.93 4.13 1.33 2.18 2.22 1.56 0.82 1.12 0.93 1.51 1.54 1.02 4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03 =========================================================================== MP-MFLOPS NEON Intrinsics 64 Bit gcc 9 Thu Sep 26 22:02:00 2019 FPU Add & Multiply using 1, 2, 4 and 8 Threads ---- Gentoo 64b Pi 3B+ gcc 9 ---- ----- Gentoo 64b Pi 4B gcc 9 ---- 2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800 1T 769 765 354 3009 2967 2638 1233 1313 507 6451 6428 6224 2T 1315 1324 293 5863 5990 5097 6307 4824 389 12559 12784 7612 4T 1750 2647 380 10081 11250 5748 8101 5186 531 24762 24708 7902 8T 2180 2664 392 9719 11010 6368 6782 8444 504 22598 24113 7979 ........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 .......... 1T 1.60 1.72 1.43 2.14 2.17 2.36 0.37 0.41 0.95 1.00 0.98 1.00 2T 4.80 3.64 1.33 2.14 2.13 1.49 1.37 0.78 0.70 0.96 0.98 0.90 4T 4.63 1.96 1.40 2.46 2.20 1.37 1.29 0.91 0.94 1.04 1.02 0.84

MP-MFLOPS Disassembly below or Go To Start

MP-MFLOPS Disassembly

On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.


SP NEON 24.1 GFLOPS 6.55 1 core          DP 12.7 GFLOPS - 3.39 1 core 

.L41:                                   .L84:
ldr     q1, [x1]                        ldr     q16, [x2, x0]
ldr     q0, [sp, 64]                    add     w3, w3, 1
fadd    v18.4s, v20.4s, v1.4s           cmp     w3, w6
fadd    v17.4s, v22.4s, v1.4s           fadd    v15.2d, v16.2d, v14.2d
fadd    v0.4s, v0.4s, v1.4s             fadd    v17.2d, v16.2d, v12.2d
fadd    v16.4s, v24.4s, v1.4s           fmul    v15.2d, v15.2d, v13.2d
fadd    v7.4s, v26.4s, v1.4s            fmls    v15.2d, v17.2d, v11.2d
fadd    v6.4s, v28.4s, v1.4s            fadd    v17.2d, v16.2d, v10.2d
fadd    v5.4s, v30.4s, v1.4s            fmla    v15.2d, v17.2d, v9.2d
fmul    v0.4s, v0.4s, v19.4s            fadd    v17.2d, v16.2d, v8.2d
fadd    v4.4s, v10.4s, v1.4s            fmls    v15.2d, v17.2d, v31.2d
fadd    v3.4s, v12.4s, v1.4s            fadd    v17.2d, v16.2d, v30.2d
fadd    v2.4s, v14.4s, v1.4s            fmla    v15.2d, v17.2d, v29.2d
fadd    v1.4s, v8.4s, v1.4s             fadd    v17.2d, v16.2d, v28.2d
fmls    v0.4s, v21.4s, v18.4s           fmls    v15.2d, v17.2d, v0.2d
fmla    v0.4s, v23.4s, v17.4s           fadd    v17.2d, v16.2d, v27.2d
fmls    v0.4s, v25.4s, v16.4s           fmla    v15.2d, v17.2d, v26.2d
fmla    v0.4s, v27.4s, v7.4s            fadd    v17.2d, v16.2d, v25.2d
fmls    v0.4s, v29.4s, v6.4s            fmls    v15.2d, v17.2d, v24.2d
fmla    v0.4s, v31.4s, v5.4s            fadd    v17.2d, v16.2d, v23.2d
fmls    v0.4s, v9.4s, v1.4s             fmla    v15.2d, v17.2d, v22.2d
fmla    v0.4s, v4.4s, v11.4s            fadd    v17.2d, v16.2d, v21.2d
fmls    v0.4s, v3.4s, v13.4s            fadd    v16.2d, v16.2d, v19.2d
fmla    v0.4s, v2.4s, v15.4s            fmls    v15.2d, v17.2d, v20.2d
str     q0, [x1], 16                    fmla    v15.2d, v16.2d, v18.2d
cmp     x1, x0                          str     q15, [x2, x0]
bne     .L41                            add     x0, x0, 16
                                        bcc     .L84


                     32 bit    64 bit    32 bit     64 bit   32 bit    64 bit
                         SP        SP        DP        DP   NEON SP   NEON SP

Maximum GFLOPS         10.7      21.8      10.3      12.7      17.0      24.1

Instructions
Total                    27        39        26        27        67        27
Floating point           22        32        22        32        32        22

FP operations
Total                    32       128        32        64       128       128
Add or subtract          11        44        11        22        21        44
Multiply                  1         4         1         2        11         4
Fused                    20        80        20        40         0        80

Add example           fadds      fadd     faddd      fadd  vadd.f32      fadd
                        s16,   v15.4s,      d25,   v15.2d,       q9,    v1.4s,
                        s23,   v16.4s,      d17,   v16.2d,       q8,    v8.4s,
                        s2     v15.4s       d15    v14.2d        q14    v1.4s

Multiply example     fnmuls      fmul     fmuld      fmul  vmul.f32      fmul
                        s16,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                         s3,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                        s16    v17.4s       d5     v13.2d       q12    v19.4s

Fused example      vfma.f32      fmla  vfma.f64      fmla       N/A      fmla
                        s16,   v15.4s,      d16,   v15.2d,              v0.4s,
                        s29,   v17.4s,      d22,   v17.2d,              v4.4s,
                         s9     v0.4s       d28    v22.2d              v11.4s

FP registers used        32         4        32        25        16        32

MP-MFLOPS Sumchecks below or Go To Start

MP-MFLOPS Sumchecks

Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.


              2 Ops/Word              32 Ops/Word 
  KB          12.8    128    12800    12.8     128   12800  

SP
4B/64	 1T    76406   97075   99969   66015   95363   99951
3B/64	 1T    76406   97075   99969   66015   95363   99951
4B/32	 1T    76406   97075   99969   66015   95363   99951

DP		
4B/64	 1T    76384   97072   99969   66065   95370   99951	
3B/64	 1T    76384   97072   99969   66065   95370   99951	
4B/32	 1T    76384   97072   99969   66065   95370   99951	

NEON Bit SP		
4B/64	 1T    76406   97075   99969   66015   95363   99951	
3B/64	 1T    76406   97075   99969   66015   95363   99951	
4B/32	 1T    76406   97075   99969   66014-X 95363   99951

OpenMP-MFLOPS Benchmarks below or Go To Start

OpenMP MFLOPS - OpenMP-MFLOPS64, OpenMP-MFLOPS64g9, notOpenMP-MFLOPS64,
notOpenMP-MFLOPS64g9, OpenMP-MFLOPS, notOpenMP-MFLOPS

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS, QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650 graphics card, via CUDA. See GigaFLOPS Benchmarks.htm.

The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit Pi 4B.

The Pi 4B gcc 9/6 performance ratios indicate no real advantage of either compilation, except the results indicate 24.7 SP GFLOPS using gcc 9.

Gentoo 64b Pi 4B gcc 9 OpenMP MFLOPS64g9 Thu Sep 26 16:51:07 2019 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.124228 4025 0.929538 Yes Data in & out 1000000 2 250 0.842066 594 0.992550 Yes Data in & out 10000000 2 25 0.873622 572 0.999250 Yes Data in & out 100000 8 2500 0.147889 13524 0.957117 Yes Data in & out 1000000 8 250 0.904478 2211 0.995518 Yes Data in & out 10000000 8 25 0.951405 2102 0.999549 Yes Data in & out 100000 32 2500 0.324246 24673 0.890215 Yes Data in & out 1000000 32 250 1.097993 7286 0.988088 Yes Data in & out 10000000 32 25 1.045087 7655 0.998796 Yes --------- gcc 9 --------- Mbytes/ Pi 3B+ Pi 4B Pi 4B Pi 3B+ Pi 4B Ops/W0rd 64b 64b 32b 64b 64b All 1T All 1T All 1T All 1T All 1T 0.4/2 2674 755 5386 2780 4716 2850 2341 795 4025 2236 4/2 411 404 563 557 556 429 381 362 594 403 40/2 419 408 545 588 544 632 401 387 572 493 0.4/8 7029 1886 15401 5555 7981 5191 6051 1906 13524 5373 4/8 1656 1495 2223 2116 2389 2082 1491 1352 2211 1948 40/8 1725 1507 2361 2310 2199 2003 1598 1418 2102 2308 0.4/32 6648 1699 20429 5647 8147 5449 12002 3185 24673 6786 4/32 5977 1616 8082 5445 7951 5385 5641 2809 7286 6385 40/32 6027 1616 8470 5479 8030 5379 6142 2809 7655 6415 Pi 4B gcc 9 Pi 4B 4b/3b 64/32b 4b/3b gcc 9/6 All 1T All 1T All 1T All 1T 0.4/2 2.01 3.68 1.14 0.98 1.72 2.81 0.75 0.80 4/2 1.37 1.38 1.01 1.30 1.56 1.11 1.06 0.72 40/2 1.30 1.44 1.00 0.93 1.43 1.27 1.05 0.84 0.4/8 2.19 2.95 1.93 1.07 2.24 2.82 0.88 0.97 4/8 1.34 1.42 0.93 1.02 1.48 1.44 0.99 0.92 40/8 1.37 1.53 1.07 1.15 1.32 1.63 0.89 1.00 0.4/32 3.07 3.32 2.51 1.04 2.06 2.13 1.21 1.20 4/32 1.35 3.37 1.02 1.01 1.29 2.27 0.90 1.17 40/32 1.41 3.39 1.05 1.02 1.25 2.28 0.90 1.17

OpenMP-MemSpeed Benchmarks below or Go To Start

OpenMP-MemSpeed - OpenMP-MemSpeed264, OpenMP-MemSpeed264g9,
NotOpenMP-MemSpeed264, NotOpenMP-MemSpeed264g9, OpenMP-MemSpeed2,
NotOpenMP-MemSpeed2

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2). Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP. Detailed comparisons of these results are rather meaningless. Below are Pi 4B results from a gcc 9 compilation. See MemSpeed results for other comparisons.


     Memory Reading Speed Test OpenMP 64 Bit gcc 9 by Roy Longbottom

               Start of test Thu Sep 26 22:08:22 2019

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S
 
       4    7616   8480   8749   7548   8520   8530  35856  18594  18601
       8    8195   8660   8876   8147   5740   8365  37153  18878  18864
      16    7992   7684   8189   8064   8139   8023  35774  18896  18898
      32    8975   8535   8024   9048   8536   8512  37465  18392  19024
      64    8622   7997   8057   8511   7953   7994  19618  16857  16701
     128   11940  11637  11554  12101  11659  11498  13815  13417  13964
     256   17008  17339  16359  17104  17396  17038  11877  12344  12376
     512   17740  15986  18607  17522  18547  15612  12575  13616  13495
    1024    7011  10208  10016  11310   5287  11413   7060   6279  10045
    2048    7024   4201   7006   7017   6943   3225   2822   3386   3391
    4096    3854   7002   7126   6912   7074   3985   2199   3127   3132
    8192    2632   6950   7151   5291   2796   6813   2546   3091   2403
   16384    7350   7073   3537   7583   5327   3200   2609   3053   1907
   32768    7514   7616   7725   7807   2344   2936   2702   2559   3042
   65536    7065   2937   7571   4306   7086   2975   2127   3017   2677
  131072    1772   1779   2562   8092   2583   2800   2035   1866   2869

    Memory Reading Speed Test notOpenMP 64 Bit gcc 9 by Roy Longbottom

       4   12991  21391  23815  13044  22904  23856  11216   9060   9062
       8   13380  21857  24416  13414  23420  24400  11630   9313   9312
      16   13534  22119  24711  13550  23683  24718  11797   9447   9447
      32   11981  19879  21566  12100  21243  21572   9552   8928   8924
      64   11695  19992  20989  12044  21020  20966   9356   8613   8602
     128   11824  20347  21045  12116  21217  21067   8132   8149   8178
     256   11705  20247  21090  12041  21382  21013   8081   8182   5919
     512   11515  20242  21155  12059  21089  20938   8093   8127   7376
    1024    4504   8674   8151   4658   8682   8680   3894   3739   3887
    2048    1868   3231   3636   1868   3581   3491   2639   2871   2896
    4096    1921   2994   3748   1925   3781   3443   2589   2634   2636
    8192    1836   3719   3695   1921   3624   3791   2603   2596   2595
   16384    1951   3724   3002   1977   3838   3249   2584   2572   2384
   32768    1710   3431   3427   2008   3186   3449   2545   2531   2529
   65536    2030   3034   2135   2047   3035   2394   2550   2535   2546
  131072    2029   2023   2024   1873   2059   1652   2378   2466   2392

Stress Test Benchmarks below or Go To Start

Stress Testing Programs Benchmarking Mode

My latest stress testing programs have parameters that specify running time, data size, number of threads, log file number and, in two cases, processing density. When run without parameters, the full range of options are used, providing a useful benchmark. Log file results from Pi 4B tests, and comparisons, are provided below.

Integer Stress Test Benchmark - MP-IntStress64, MP-IntStress

The integer program test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Disassembly shows that the test loop, in fact, used 68 instructions, most additional ones being load register type. The result of these is 68/32 instructions per 4 byte word. At the maximum of 1943M words per second, using a single core, resultant execution speed was 4129 MIPS with nearly four times more using all cores.

The tables below, with speeds on the considered systems, provide average performance gains of the Pi 4B at 64 bits, somewhat limited in this case.

Gentoo Pi 4B 64 Bits MP-Integer-Test 64 Bit v1.0 Fri Sep 6 16:33:36 2019 Benchmark 1, 2, 4, 8, 16 and 32 Threads MB/second KB KB MB Same All Secs Thrds 16 160 16 Sumcheck Tests 4.3 1 7771 7352 3895 00000000 Yes 3.3 2 15467 14218 3714 FFFFFFFF Yes 3.0 4 28715 26652 3345 5A5A5A5A Yes 3.0 8 30292 26310 3334 AAAAAAAA Yes 3.0 16 29466 28503 3337 CCCCCCCC Yes 3.0 32 29351 30358 3390 0F0F0F0F Yes Pi 4B 32 bit MB/sec Pi 3B+ 64 bit MB/sec KB KB MB KB KB MB 16 160 16 16 160 16 Threads 1 5964 5756 3931 4823 3884 1209 2 11787 11430 3748 9613 7709 1908 4 23214 22060 3456 17737 15137 1779 6 22197 22171 3472 17651 18692 1767 16 22671 23299 3256 18255 18793 1757 32 21379 21881 3346 18246 18674 1748 Pi 4B 64b/32b 64b Pi 4B/3B+ Average Gain 1.31 1.25 0.99 1.63 1.67 2.13

Floating Point Stress Test Benchmarks or Go To Start

Single Precision Floating Point Stress Test Benchmark - MP-FPUStress64, MP-FPUStress

This and the double precision program carry out the same calculations as MP-MFLOPS, but are slightly faster by including a loop that repeats the tests within the calculate functions. Maximum speeds were 6.75 GFLOPS, using one core, and 26.7 GFLOPS with four cores.

These programs were written using a later compiler than those used for MP-MFLOPS, at least resulting in similar speeds between 32 bit and 64 bit versions. Typical Pi 4B/3B+ performance improvements were indicated.

Gentoo Pi 4B 64 Bits MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep 6 16:30:12 2019 Benchmark 1, 2, 4 and 8 Threads MFLOPS Numeric Results Ops/ KB KB MB KB KB MB Secs Thrd Word 12.8 128 12.8 12.8 128 12.8 1.7 T1 2 2819 2874 504 40392 76406 99700 3.2 T2 2 5592 5702 511 40392 76406 99700 4.6 T4 2 9223 7520 519 40392 76406 99700 6.0 T8 2 9520 10471 545 40392 76406 99700 8.2 T1 8 5381 5595 2050 54764 85092 99820 9.8 T2 8 11039 10883 2173 54764 85092 99820 11.3 T4 8 19087 21040 2044 54764 85092 99820 12.9 T8 8 19747 21107 2016 54764 85092 99820 17.5 T1 32 6693 6753 6377 35206 66015 99520 20.2 T2 32 13491 13464 8710 35206 66015 99520 22.2 T4 32 25732 26704 9160 35206 66015 99520 24.1 T8 32 25708 25770 8927 35206 66015 99520 End of test Fri Sep 6 16:30:37 2019 Pi 4B 32 bit Pi 3B+ 64 bit Threads KB KB MB KB KB MB Ops/wd 12.8 128 12.8 12.8 128 12.8 T1 2 2641 2607 646 838 826 373 T2 2 5089 5116 618 1659 1650 380 T4 2 8282 8522 683 2584 3296 384 T8 2 8756 9847 686 3013 3056 391 T1 8 5543 5428 2597 1981 1972 1354 T2 8 10754 10603 2711 3936 3923 1518 T4 8 18716 20823 2844 7482 7396 1531 T8 8 19859 21684 2555 7399 7705 1534 T1 32 5309 5274 5265 2820 2809 2462 T2 32 10557 10509 9991 5636 5583 4754 T4 32 20416 20919 11340 10640 10882 6020 T8 32 20072 19787 9330 10641 10926 6159 Average Pi 4B Performance Gains Ops/Word Pi 4B 64b/32b 64b Pi 4B/3B+ 2 1.09 1.04 0.79 3.37 3.16 1.36 8 1.00 1.01 0.77 2.69 2.80 1.40 32 1.27 1.29 0.96 2.40 2.41 1.85

Double Precision Stress Test Benchmark below or Go To Start

Double Precision Floating Point Stress Test Benchmark - MP-FPUStress64DP,
MP-FPUStressDP

Maximum measured DP speeds were 3.39 GFLOPS, using one core, and 13.2 GFLOPS with four cores. Some of the 64/32 bit and 4B/3B+ performance ratios were similar to those from MP-MFLOPS


                 Gentoo Pi 4B 64 Bits

  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:31:24 2019

    Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2  1398  1462   285   40395  76384  99700
   6.2    T2   2  2799  2807   256   40395  76384  99700
   8.9    T4   2  5024  4589   257   40395  76384  99700
  11.5    T8   2  5089  5545   280   40395  76384  99700
  15.7    T1   8  2668  2790  1103   54805  85108  99820
  18.8    T2   8  5670  5545  1158   54805  85108  99820
  21.7    T4   8 10259 10011  1068   54805  85108  99820
  24.7    T8   8 10239 10824  1036   54805  85108  99820
  34.1    T1  32  3317  3390  3195   35159  66065  99521
  39.2    T2  32  6791  6754  4753   35159  66065  99521
  43.1    T4  32 12940 13200  4497   35159  66065  99521
  46.9    T8  32 13200 13049  4557   35159  66065  99521

            End of test Fri Sep  6 16:32:11 2019

              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2        993     998     329        412     411     193
T2   2       1971    1995     309        828     824     194
T4   2       3633    3937     340       1543    1514     197
T8   2       3635    3796     339       1525    1551     196
T1   8       2378    2445    1288        980     978     696
T2   8       4770    4860    1282       1975    1964     782
T4   8       9281    9556    1210       3688    3688     781
T8   8       9119    9448    1245       3726    3689     787
T1  32       2697    2726    2708       1402    1403    1231
T2  32       5397    5446    5163       2808    2808    2399
T4  32      10689   10806    5146       5379    5413    3195
T8  32      10716   10494    4497       5450    5485    3150

              Average Pi 4B Performance Gains

  Ops/Word   Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.40    1.37    0.82       3.34    3.39    1.38
        8    1.13    1.12    0.87       2.78    2.83    1.44
       32    1.23    1.24    1.00       2.40    2.41    1.86

High Performance Linpack Benchmark below or Go To Start

High Performance Linpack Benchmark

Earlier, the High Performance Linpack Benchmark was run on Raspberry Pi 3 models, and later, on the Raspberry Pi 4 system, both via 32 bit Raspbian Operating System. Details and results can be found in the following reports. Pi 3B and 3B+ results and Pi 4B 32 bit reslts.

Initially, two versions of HPL tests were run, one accessing precompiled Basic Linear Algebra Subprograms and the other with ATLAS alternatives, that had to be built. The whole benchmark suite was produced according to instructions in the following. these instructions.

The ATLAS version was installed, as the older benchmark would not run on the Pi 4. One issue is the time required for the build, apparently due to the numerous tuning tests. Time taken was 14 hours using a Pi 3B+, then 8 hours on a Pi 4. Later, 64 bit ATLAS was built on the Pi 3B+, via Gentoo, taking 26 hours, that included extended periods swapping data with the rather slow main drive. The procedure specified in the above was used, successfully leading to a working package. Only one change was required, this was to Make.rpi line 95 to;

LAdir = /home/pi/atlas-build to = /home/demouser/atlas-build.

Following the introduction of 64 bit Gentoo for the Pi 4B, ATLAS was again created, taking more than 10 hours. As indicated in the above links, the HPL benchmark can be a useful stress test, due to the long running time with heavy processing. It can lead to CPU MHz being throttled on the Pi 4B, producing slow GFLOPS speeds. The tests reported here were run using a Pi 4B with a cooling fan, with CPU MHz monitored to help to indicate that the processor was running at full speed.

The benchmark was run on various Raspberry Pi models, using the same parameters. An example of the main output produced is shown below. Key areas are array size parameter N, running time, GFLOPS speed rating and sumcheck (0.0010188 in this case), including whether acceptable (PASSED).

pi@raspberrypi:~/hpl-2.2/bin/rpi $ mpiexec -f nodes-1pi ./xhpl ================================================================================ HPLinpack 2.2 -- High-Performance Linpack benchmark -- February 24, 2016 Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK Modified by Julien Langou, University of Colorado Denver ================================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 20000 NB : 128 PMAP : Row-major process mapping P : 2 Q : 2 PFACT : Right NBMIN : 4 NDIV : 2 RFACT : Crout BCAST : 1ringM DEPTH : 1 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words -------------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual check will be computed: ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 20000 128 2 2 494.46 1.079e+01 HPL_pdgesv() start time Fri Oct 11 22:34:37 2019 HPL_pdgesv() end time Fri Oct 11 22:42:52 2019 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0010188 ...... PASSED ================================================================================

High Performance Linpack Benchmark Results below or Go To Start

High Performance Linpack Benchmark Results

Particularly important, maximum performance is dependent on the amount of RAM available. As with the original single CPU Linpack benchmark, where N is the matrix problem size, minimum memory used is N x N x 8 Bytes (double precision) or 512 MB for N = 8000 or 3.2 GB for N = 20000. The end of the detailed output indicates a further problem, where the first run at maximum size might be slow, with extra time swapping data out of RAM, to create space for the HPL data.

Next, the benchmark produces a sumcheck but, in the case of the ATLAS implementation, these are not consistent using the same problem size, all those shown here were indicated as PASSED (within specified tolerances). The anomaly could be produced using different CPU models or alternative compilations but, the least understandable is identified at the end of the detailed output, where the sumcheck is shown to vary on repeating the program on the same system.

Comparing Pi 4B 32 bit and 64 bit GFLOPS maximum speeds, the 32 bit version appears to be slightly faster (or the same within reasonable tolerances). Then it is not clear (to me), whether the compiled code completely embraces the difference in technology or whether external compile options should be included for the different packages involved.

Anyway, around 10 double precision GFLOPS was the maximum produced by other benchmarks, reported above.

------ Time ------ ----- GFLOPS ----- ----------- Sumcheck ---------- 4B 4B 3B+ 4B 4B 3B+ 4B 4B 3B+ N 64b 32b 64b 64b 32b 64b 64b 32b 64b 4000 5.51 5.20 14.53 7.75 8.20 2.94 0.0022808 0.0023975 0.0025857 8000 38.22 36.70 101.59 8.93 9.30 3.36 0.0017216 0.0016746 0.0017518 16000 269.26 263.00 10.14 10.40 0.0012577 0.0011258 20000 513.67 494.30 10.38 10.80 0.0009637 0.0010188 GFLOPS Comparisons 4B 64b N 64b/32b 4B/3B+ 4000 0.95 2.64 8000 0.96 2.66 16000 0.98 20000 0.96 Example Logged Results Time Gflops -------------------------------------------------------------------------------- WR11C2R4 20000 128 2 2 516.71 1.032e+01 ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0008697 ...... PASSED ================================================================================ First Run WR11C2R4 20000 128 2 2 656.89 8.120e+00 ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009470 ...... PASSED ================================================================================

I/O Benchmarks below or Go To Start

DriveSpeed Benchmark - DriveSpeed64v2, DriveSpeed64v2g9, DriveSpeed

This benchmark has the format shown below, measuring writing and reading speeds of large files, cached files, random access and numerous small files. Run time parameters are available to specify large file size and the file path.

In order to test a USB drive, it must be mounted - plug in, right click Mount Volume or double click to open. Run df command to find the path, needed for use as a run time parameter. Following is an example log file and the command used to run the program to test a USB 3 stick. With no MB parameter, default large file sizes are 8 and 16 MB.

############################## Pi 4B USB 3 ############################### Run command ./DriveSpeed64v2g9 MB 512 FilePath /run/media/demouser/PATRIOT ########################################################################## DriveSpeed RasPi 64 Bit 2.0 Fri Sep 13 22:25:40 2019 Selected File Path: /run/media/demouser/PATRIOT/ Total MB 120832, Free MB 119778, Used MB 1054 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 512 30.72 31.11 34.01 287.24 295.04 311.90 1024 34.66 36.11 35.45 298.87 302.38 300.26 Cached 8 42.03 39.58 38.85 1167.71 1029.35 1061.56 Random Read Write From MB 4 8 16 4 8 16 msecs 0.004 0.007 0.310 9.65 10.42 9.71 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.03 0.07 0.13 268.10 427.95 657.48 ms/file 122.73 122.28 122.22 0.02 0.02 0.02 2.557

For non-cached tests, in the standard version of this benchmark, the file opening handle includes the O_DIRECT option, specifying Direct I/O (no caching). The latest minor variety of this appears to work, as expected, on the 32 bit Raspbian version, on both main and USB drives. The 64 bit compilation of this indicated a failure to write to the main SD drive and a failure to read from USB flash drives. Omitting O_DIRECT, for reading, appeared to correct the latter (see above). To check this and enable main drive measurements, separate direct I/O free large file write and read only programs were produced, to follow write/reboot/read procedures. These were also necessary to indicate throughput simultaneously writing or reading two USB 3 drives.

Following are 64 bit Pi 4B SD main drive results from the separate write and read tests, followed by full results from Pi 4B with 32 bit Raspbian, using a same brand SD card. Note the similarity in writing and reading speeds of large files.

################# Main SD Drive From Write/Read Tests Below =################ Write1 Write2 Write3 Read1 Read2 Read3 Write 18.99 19.34 19.47 1337.09 1164.91 1325.96 - cached Read N/A N/A N/A 45.80 45.88 45.89 - not cached ============================== 32 Bit Results ============================== DriveSpeed RasPi 1.1 Mon Apr 29 10:20:57 2019 Current Directory Path: /home/pi/Raspberry_Pi_Benchmarks/DriveSpeed/drive1 Total MB 14845, Free MB 8198, Used MB 6646 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 16.41 11.21 12.27 39.81 40.10 40.39 16 11.79 21.10 34.05 40.18 40.19 40.33 Cached 8 137.47 156.43 285.59 580.73 598.66 587.97 Random Read Write From MB 4 8 16 4 8 16 msecs 0.371 0.371 0.363 1.28 1.53 1.30 200 File Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 3.49 6.41 8.26 7.67 11.68 17.51 ms/file 1.17 1.28 1.98 0.53 0.70 0.94 0.014

USB Flash Drives below or Go To Start

USB 3 and 2 Flash Drive Benchmarks

Two FAT 32 formatted USB 3 sticks were used, P at 128 GB, with 32 KB sectors, reading speed rated as up to 400 MB/second, and R 8.8 GB partition, with 8 KB sectors, reading speed rated as up to 190 MB/second (but appears to do better sometimes). The benchmark was run using USB 2 connections, on a Pi 3B+ and a Pi 4B, and via USB 3 slots on the Pi 4B.

Following is a summary of results, indicating USB 3 large file reading speed improvements between 6.7 and 8.1 times, but disappointing writing performance, where the slower P speeds might be affected by the mysteries of updating file allocation tables, also influencing random access and dealing with lots of small files, including file delete times. USB 3 use provided little or no performance gains for the latter. Cached reading reflects RAM speed, the only area showing clear difference in performance between the Pi 3B+ and Pi 4B.

MB/second 16 MB USB 2, 1024 MB USB 3 System Drive Write1 Write2 Write3 Read1 Read2 Read3 Pi 3B+ USB 2 P 11.5 11.4 11.5 36.6 37.7 37.3 Pi 3B+ USB 2 R 15.9 16.4 13.9 37.1 40.1 39.8 Pi 4B USB 2 P 12.6 12.6 12.6 37.0 37.3 37.2 Pi 4B USB 2 R 22.6 22.9 22.9 36.5 36.3 36.5 Pi 4B USB 3 P 34.7 36.1 35.5 298.9 302.4 300.3 Pi 4B USB 3 R 48.9 44.6 53.4 249.4 248.8 246.2 Compare MB/second Pi 4B P USB 3/2 2.75 2.88 2.81 8.07 8.11 8.07 Pi 4B R USB 3/2 2.17 1.94 2.33 6.83 6.85 6.74 Cached MB/second Write1 Write2 Write3 Read1 Read2 Read3 Pi 3B+ USB 2 P 13.6 14.2 14.4 633.4 544.0 464.3 Pi 3B+ USB 2 R 13.7 14.4 19.4 623.5 661.4 557.6 Pi 4B USB 2 P 15.0 14.7 14.8 1204.0 1047.3 1066.3 Pi 4B USB 2 R 20.8 21.2 13.9 930.2 933.6 1230.3 Pi 4B USB 3 P 42.0 39.6 38.9 1167.7 1029.4 1061.6 Pi 4B USB 3 R 21.1 15.9 36.2 1103.6 944.9 981.0 Compare Pi 4B P USB 3/2 2.80 2.70 2.63 0.97 0.98 1.00 Pi 4B R USB 3/2 1.01 0.75 2.60 1.19 1.01 0.80 Random milliseconds Read Write Pi 3B+ USB 2 P 0.013 0.013 0.254 11.76 10.18 9.80 Pi 3B+ USB 2 R 0.017 0.008 0.032 1.09 1.39 11.72 Pi 4B USB 2 P 0.006 0.007 0.215 9.56 8.54 8.75 Pi 4B USB 2 R 0.009 0.005 0.016 1.35 2.12 1.34 Pi 4B USB 3 P 0.004 0.007 0.310 9.65 10.42 9.71 Pi 4B USB 3 R 0.004 0.004 0.008 1.75 0.85 0.92 Compare Pi 4B P USB 3/2 1.50 1.00 0.69 0.99 0.82 0.90 Pi 4B R USB 3/2 2.25 1.25 2.00 0.77 2.49 1.46 200 Small Files milliseconds Write Read Delete Pi 3B+ USB 2 P 134.2 128.6 129.6 0.08 0.12 0.07 3.36 Pi 3B+ USB 2 R 105.5 104.7 107.6 0.05 0.05 0.07 0.26 Pi 4B USB 2 P 125.8 125.5 125.8 0.02 0.02 0.02 3.12 Pi 4B USB 2 R 104.1 104.0 104.0 0.02 0.02 0.03 0.14 Pi 4B USB 3 P 122.7 122.3 122.2 0.02 0.02 0.02 2.56 Pi 4B USB 3 R 105.4 104.0 104.3 0.02 0.02 0.03 0.15 Compare Pi 4B P USB 3/2 1.03 1.03 1.03 1.00 1.00 1.00 1.22 Pi 4B R USB 3/2 0.99 1.00 1.00 1.00 1.00 1.00 0.95

Drive Write/Reboot/Read Tests below or Go To Start

Drive Write/Reboot/Read Tests - DriveSpeed264WR, DriveSpeed264Rd

As a reminder, different programs were produced to enable separate measurements of writing and reading, because of the inability to avoid written data being cached on a main drive, invalidating drive reading speed measurements. These were also required to measure overall throughput, when using two USB drives. The write test also reads the data for verification, but this will normally be cached in RAM, with high data transfer speeds. VMSTAT results are provided, covering reading speeds.

Main SD Drive - This is rated at up to 98 MB/second reading speed but only achieves near 46 MB/second. VMSTAT results confirm data transfer speed and three files eventually occupying around 3 GB of the cache, with the low 2% (x4) CPU utilisation and 23% (x4) waiting for I/O.

  Run Commands ./DriveSpeed264WR MB 1024 and ./DriveSpeed264Rd MB 1024

 Current Directory Path: /home/demouser/RPi3-64-Bit-Benchmarks/IOtests/writeread
 Total MB   28225, Free MB   18761, Used MB    9464
 
                1024 MB   MBytes/Second

       Write1   Write2   Write3    Read1    Read2    Read3

Write   18.99    19.34    19.47  1337.09  1164.91  1325.96
Read     N/A      N/A      N/A     45.80    45.88    45.89
 
                                  vmstat

procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in   cs us sy id wa st

0  1     0 673848  60668 2792716    0    0 45056     0  767 1181  0  2 75 23  0
0  1     0 630228  60668 2835544    0    0 44544     0  789 1199  0  2 74 23  0
0  1     0 585204  60668 2880268    0    0 45056     0  691 1041  0  3 75 23  0

USB 3 Drive P - Read only speed was similar to that from the earlier detailed test. Note high CPU utilisation average of 17%, equivalent to 68% of one core.

 Run Commands ./DriveSpeed264WR MB 1024 FilePath /run/media/demouser/PATRIOT
    and       ./DriveSpeed264Rd MB 1024 FilePath /run/media/demouser/PATRIOT

 Selected File Path: 
 /run/media/demouser/PATRIOT/
 Total MB  120832, Free MB  119752, Used MB    1080

                 1024 MB   MBytes/Second

       Write1   Write2   Write3    Read1    Read2    Read3

Write   58.45    23.10    22.91  1368.04  1190.71  1354.84
Read     N/A      N/A      N/A    306.18   294.93   302.91
 
                                  vmstat

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

1  0   256 811672  20920 2696504    0    0 305664     0 3898 6182  1 15 73 11  0
0  1   256 510852  20920 2996188    0    0 303616     0 4304 5936  1 16 72 12  0
1  0   256 239400  20920 3267636    0    0 307184     0 4512 6177  1 17 71 11  0

USB 3 Drive R - This time data transfer speed was slower than the earlier example.

 Selected File Path: 
 /run/media/demouser/REMIX_OS/
 Total MB    9017, Free MB    7485, Used MB    1532

                 1024 MB   MBytes/Second                  

       Write1   Write2   Write3    Read1    Read2    Read3

Write   46.43    28.81    36.57  1265.07  1103.23  1236.02
Read     N/A      N/A      N/A    172.71   172.14   176.49
 
                                  vmstat

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  1   256 111512    912 3417624    0    0 175189     0 4315 5929  1 12 71 17  0
0  1   256 169756    992 3358840    0    0 169043     0 4064 5515  1 11 71 17  0
0  1   256 177444   1068 3351176    0    0 155724     0 4088 6023  1 12 70 16  0

USB 3 Drives R and P Together below or Go To Start

USB 3 Drives R and P Together

File sizes were reduced to 512 MB for these tests, in order to ensure that there would be sufficient RAM to contain six copies, as indicated in VMSTAT cache occupancy. This makes it more tricky to measure total throughput, but the following appears to provide a best case example, with a maximum of up to 386 MB/second, with CPU utilisation near 100% of one core. Different log files are needed for reading, to avoid confusion.

Later is a bad example, where one drive appears to be running at USB 2 speed.

Run Commands ./DriveSpeed264WR MB 512 FilePath /run/media/demouser/PATRIOT and. ./DriveSpeed264WR MB 512 FilePath /run/media/demouser/REMIX_OS and ./DriveSpeed264Rd MB 512 FilePath /run/media/demouser/PATRIOT Log 1 and ./DriveSpeed264Rd MB 512 FilePath /run/media/demouser/REMIX_OS Log 2 Write/Read Thu Sep 19 16:07:48 2019 /run/media/demouser/REMIX_OS/ Write/Read Thu Sep 19 16:07:46 2019 /run/media/demouser/PATRIOT/ 512 MB MBytes/Second Write1 Write2 Write3 Read1 Read2 Read3 R 28.72 33.89 44.69 1302.19 1131.65 1374.24 P 11.93 8.86 6.21 1232.47 1072.38 1213.36 Sep 23 17:11:21 2019 /run/media/demouser/PATRIOT/ Sep 23 17:11:20 2019 /run/media/demouser/REMIX_OS/ 512 MB MBytes/Second Write1 Write2 Write3 Read1 Read2 Read3 Seconds P N/A N/A N/A 159.78 187.44 294.23 7.7 R N/A N/A N/A 221.83 232.10 230.94 6.7+2 delayed start vmstat procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 0 3160720 74616 296092 0 0 0 0 2031 3601 4 2 94 0 0 0 1 0 3112052 74616 342188 0 0 45552 0 1512 2257 1 3 93 4 0 0 1 0 2908004 74616 547600 0 0 206336 0 4684 7169 4 14 67 15 0 2 0 0 2531960 74616 919400 0 0 369136 0 5495 8033 4 24 47 25 0 2 0 0 2149064 74616 1303288 0 0 382960 0 5168 7007 1 21 52 26 0 1 1 0 1771492 74616 1681348 0 0 385024 0 5969 8255 1 23 49 26 0 1 1 0 1383524 74616 2068788 0 0 386016 0 5621 7926 1 21 49 29 0 0 2 0 999100 74616 2453280 0 0 383488 0 4602 6895 1 19 54 26 0 0 1 0 628988 74616 2824188 0 0 368640 0 5405 8153 2 20 56 22 0 1 0 0 310748 74624 3142732 0 0 317424 20 4622 6551 1 17 72 10 0 1 0 0 223052 73680 3231812 0 0 268288 0 2815 5012 1 18 72 10 0 0 0 0 223824 73680 3231280 0 0 32768 0 1044 2009 1 3 95 1 0 0 0 0 223824 73680 3231280 0 0 0 0 393 619 0 0 99 0 0 =============================================================================== Bad Example Write1 Write2 Write3 Read1 Read2 Read3 P N/A N/A N/A 36.37 37.72 37.48 R N/A N/A N/A 248.18 248.22 223.53

LAN and WiFi Benchmarks below or Go To Start

LAN and WiFi Benchmarks - LanSpeed64, LanSpeed64g9, LanSpdx86Win.exe, LanSpeed

The Raspberry Pi LanSpeed64 version uses the same programming code as for the DriveSpeed benchmark, except O_DIRECT is not used on creating files. The measurements were made between the Pi 4B and a Windows 7 based PC, where the data transfer speed was confirmed via Task Manager Network information and sysstat sar -n DEV on the Raspberry Pi 4. SAMBA was also installed to connect a remote PC and enable an Intel Windows version, LanSpdx86Win.exe, to be run.

An example of a LanSpeed64 log file is provided below, preceded by examples of the required mount and run commands. For further details of required procedures see This PDF file, LAN/WiFi section. The 64 bit results are followed by details from running the benchmark on a 32 bit system, and showing the same levels of performance, within the usual variability.

Commands sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.1.68/d /media/public ./LanSpeed64 FilePath /media/public/test Log File LanSpeed RasPi 64 Bit 1.0 Thu Sep 12 22:06:06 2019 Selected File Path: /media/public/test/ Total MB 266240, Free MB 70991, Used MB 195249 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 66.13 92.09 92.76 96.36 96.85 97.30 16 80.79 93.59 94.61 103.99 104.34 104.57 Random Read Write From MB 4 8 16 4 8 16 msecs 0.004 0.009 0.435 0.95 0.92 0.93 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 1.37 2.45 4.77 1.37 2.49 4.92 ms/file 2.99 3.35 3.43 2.98 3.29 3.33 0.467 == ************************ 32 Bit Pi 4B ************==************ MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 67.82 12.97 90.19 99.84 93.49 96.83 16 92.25 92.66 92.96 103.9 105.28 91.17 Random Read Write From MB 4 8 16 4 8 16 msecs 0.007 0.01 0.04 1.01 0.85 0.91 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 1.47 2.8 5.14 2.47 4.71 8.61 ms/file 2.78 2.92 3.19 1.66 1.74 1.90 0.256

LAN and WiFi Benchmark Results below or Go To Start

LAN and WiFi Benchmark Results

Below are results from programs run on the Pi 3B+ and 4B, plus others from running on a PC. Dealing with large files, PC to Pi 4B and Pi 4B to PC LAN speeds demonstrated some gigabit performance examples (over 100 MB/second), around three times faster than on the Pi 3B+. My BT Hub has dual 2.4 GHz (WiFi1) and 5 GHz (WiFi2) capabilities, leading to the following erratic performance, where (I think) greater than 10 MB/second is indicative of 5 GHz and around 4 MB/second for 2.4 GHz, the former usually only on writing. In this case, the hub was inches away from the Pi.

I changed the hub settings to provide separate 2.4 and 5 GHz hub address selections, with 72 and 180 Mbits/second being indicated, respectively. These sort of numbers were confirmed on my Smartphone, but variable. The 64 bit version would not connect to the network at 5 GHz, unlike the 32 bit program, for example, obtaining 15 MB/second writing and 8 MB/second reading. these differences could be, I suppose, due to program, software and/or hub incompatibility.

Random access times appeared to be quite similar on all WiFi tests, with faster but variable comparative times via LAN. There were similar relationships on dealing with numerous small files.

Some results from running the 32 bit benchmark on a Pi 4B are provided. Performance there was also erratic, these speeds representing best case measurements, reading large files somewhat faster than those achieved at 64 bits.

Large Files MB/second System MB Write1 Write2 Write3 Read1 Read2 Read3 PC WiFi 16 4.08 4.16 4.11 2.34 1.68 1.30 PC LAN 16 106.11 106.11 105.89 50.67 33.86 25.47 LAN 3B+ 16 28.63 29.03 28.96 22.18 32.28 32.61 3B+ WiFi 16 11.15 11.00 10.76 4.01 3.89 3.09 4B WiFi1 16 6.43 6.39 6.47 4.33 4.13 4.86 4B WiFi2 16 13.26 13.34 13.25 3.69 4.22 4.00 4B LAN 16 80.79 93.59 94.61 103.99 104.34 104.57 4B LAN 128 96.58 96.67 95.74 106.41 107.24 107.82 32 Bit 4B WiFi1 16 6.70 6.82 6.76 7.19 6.53 7.22 4B WiFi2 16 11.50 13.93 14.13 9.91 8.88 9.92 Random milliseconds System Read Write PC WiFi 1.711 1.972 2.015 2.26 2.28 2.25 PC LAN 0.606 0.590 0.532 0.47 0.48 0.47 LAN 3B+ 0.030 0.816 0.484 1.19 1.16 1.16 3B+ WiFi 3.052 3.167 3.475 3.60 3.39 3.45 4B WiFi1 3.286 3.549 3.627 4.02 3.45 3.72 4B WiFi2 2.786 2.822 2.944 3.20 2.94 2.92 4B LAN 0.004 0.009 0.435 0.95 0.92 0.93 32 Bit 4B WiFi1 2.691 2.875 3.048 3.13 2.93 2.84 4B WiFi2 Similar 200 Small Files milliseconds per file System Write Read Delete PC WiFi 10.09 12.42 13.81 5.50 6.11 8.06 1.507 PC LAN 4.05 4.59 4.53 2.38 2.23 2.64 0.661 LAN 3B+ 3.72 4.36 4.45 3.33 3.40 3.60 0.378 3B+ WiFi 12.61 13.53 14.97 13.17 14.06 15.88 2.534 4B WiFi1 15.08 16.53 22.83 12.96 14.23 17.29 2.509 4B WiFi2 11.38 12.85 12.82 10.64 11.83 14.15 2.083 4B LAN 2.99 3.35 3.43 2.98 3.29 3.33 0.467 32 Bit 4B WiFi1 12.14 18.59 15.70 11.10 22.20 12.99 2.153 4B WiFi2 30.85 17.83 18.10 16.62 14.93 16.01 3.361

Java Whetstone Benchmark below or Go To Start

Java Whetstone Benchmark - whetstc.class

The benchmark measures performance of various floating point and integer calculations , with an overall rating in Million Whetstone Instructions Per Second (MWIPS). Results are also provided for a 32 bit version run on a Pi 4B, showing variations in performance, using a different version of Java.

############################# Pi 3B+ #############################

    Whetstone Benchmark Java Version, Sep 20 2019, 11:06:12

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    310.88             0.0618
  N2 floating point  -1.131330490    289.41             0.4644
  N3 if then else     1.000000000             241.15    0.4292
  N4 fixed point     12.000000000             706.28    0.4460
  N5 sin,cos etc.     0.499110132              23.31    3.5700
  N6 floating point   0.999999821    130.04             4.1480
  N7 assignments      3.000000000              89.19    2.0720
  N8 exp,sqrt etc.    0.825148463              21.92    1.6970

  MWIPS                              775.89            12.8884

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################
 
    Whetstone Benchmark Java Version, Sep 12 2019, 20:15:35

                                                      1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs Gains

  N1 floating point  -1.124750137    488.80             0.0393   1.57
  N2 floating point  -1.131330490    475.92             0.2824   1.64
  N3 if then else     1.000000000             344.31    0.3006   1.43
  N4 fixed point     12.000000000            1571.86    0.2004   2.23
  N5 sin,cos etc.     0.499110132              43.55    1.9104   1.87
  N6 floating point   0.999999821    264.15             2.0420   2.03
  N7 assignments      3.000000000             264.00    0.7000   2.96
  N8 exp,sqrt etc.    0.825148463              25.80    1.4420   1.18

  MWIPS                             1445.70             6.9171   1.86

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


######################### Pi 4B 32 Bit ###########################

 Whetstone Benchmark OpenJDK11 Java Version, May 15 2019, 18:48:20

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    524.02             0.0366
  N2 floating point  -1.131330490    494.12             0.2720
  N3 if then else     1.000000000             289.92    0.3570
  N4 fixed point     12.000000000            1092.99    0.2882
  N5 sin,cos etc.     0.499110132              59.86    1.3900
  N6 floating point   0.999999821    345.95             1.5592
  N7 assignments      3.000000000             331.54    0.5574
  N8 exp,sqrt etc.    0.825148463              25.41    1.4640

  MWIPS                             1687.92             5.9244

  Operating System    Linux, Arch. arm, Version 4.19.37-v7l+
  Java Vendor         BellSoft, Version  11.0.2-BellSoft

JavaDraw Benchmark below or Go To Start

JavaDraw Benchmark - JavaDrawPi.class

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

Pi 4B performance gains shown below were indicated between 2.1 and 3.42 times.

At the end are 32 bit results from a Pi 4B test, using alternative Java software, with similar results.

############################# Pi 3B+ ############################# Java Drawing Benchmark, Sep 20 2019, 11:08:33 Produced by javac 1.7.0_02 Test Frames FPS Display PNG Bitmap Twice Pass 1 335 33.46 Display PNG Bitmap Twice Pass 2 546 54.53 Plus 2 SweepGradient Circles 502 50.08 Plus 200 Random Small Circles 366 36.59 Plus 320 Long Lines 134 13.30 Plus 4000 Random Small Circles 46 4.59 Total Elapsed Time 60.2 seconds Operating System Linux, Arch. aarch64, Version 4.19.67 Java Vendor IcedTea, Version 1.8.0_222 ############################# Pi 4B ############################## Java Drawing Benchmark, Sep 12 2019, 20:18:28 Produced by javac 1.7.0_02 Test Frames FPS Gains Display PNG Bitmap Twice Pass 1 1146 114.52 3.42 Display PNG Bitmap Twice Pass 2 1318 131.79 2.42 Plus 2 SweepGradient Circles 1237 123.66 2.47 Plus 200 Random Small Circles 972 97.13 2.65 Plus 320 Long Lines 415 41.48 3.12 Plus 4000 Random Small Circles 97 9.65 2.10 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. aarch64, Version 4.19.67 Java Vendor IcedTea, Version 1.8.0_222 ######################### Pi 4B 32 Bit ########################### Java Drawing Benchmark, May 15 2019, 18:55:41 Produced by OpenJDK 11 javac Test Frames FPS Display PNG Bitmap Twice Pass 1 877 87.65 Display PNG Bitmap Twice Pass 2 1042 104.18 Plus 2 SweepGradient Circles 1015 101.47 Plus 200 Random Small Circles 779 77.85 Plus 320 Long Lines 336 33.52 Plus 4000 Random Small Circles 83 8.25 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. arm, Version 4.19.37-v7l+ Java Vendor BellSoft, Version 11.0.2-BellSoft

OpenGL GLUT Benchmark below or Go To Start

OpenGL GLUT Benchmark - videogl64, videogl64g9, videogl32

In 2012, I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. The program can be run as a benchmark, or selected functions, as a stress test of any duration.

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

Pi 4B average performance gains are included below, with textured objects the best, at 2.1 times, and worst, at around 1.5 times, with the slow kitchen displays.

Dual Monitors - The benchmark was also run with two 1920x1080 monitors connected. It displayed two identical displays when the mirror option was selected. Without this, the normal display, from where the program is executed, appeared on one display, and the OpenGL images on the other. This was fine when the usual display dimensions, as shown below, were specified. With no parameters, full screen image was assumed to be 3840x1080 and this was displayed horizontally squashed into 1920 pixels. FPS measurements for the latter are shown below. On running the 32 bit version via Raspbian, the default display was 3840x1080, across both monitors, but only on one monitor, when 1920x1080 parameters or less were specified. There was no mirror option. See performance below.

In order to demonstrate maximum speeds, VSYNCH (vblank) has to be switched off. The command is shown in the following script that is used to run a series of tests.

export vblank_mode=0 ./videogl64g9 Width 160, Height 120, NoEnd ./videogl64g9 Width 320, Height 240, NoHeading, NoEnd ./videogl64g9 Width 640, Height 480, NoHeading, NoEnd ./videogl64g9 Width 1024, Height 768, NoHeading, NoEnd ./videogl64g9 NoHeading

The benchmark can also be run as a stress test, using run time parameters for running time and test to run, besides window size, as shown above.

32 bit Pi 4B results are also provided, in this case, a bit slower than the 64 bit speeds.

############################# Pi 3B+ ############################# GLUT OpenGL Benchmark 64 Bit Version 1, Fri Sep 20 11:15:47 2019 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 160 120 389.6 227.2 122.6 75.3 30.0 21.5 320 240 328.1 201.7 113.8 73.3 30.2 21.3 640 480 203.3 144.7 87.8 62.0 30.2 21.0 1024 768 107.1 94.5 60.3 51.1 28.9 20.0 1920 1080 45.3 47.5 36.9 33.1 28.7 20.0 ############################## Pi 4B ############################# 160 120 767.4 420.3 258.3 154.3 45.7 31.7 320 240 682.9 388.8 245.0 148.3 45.1 30.8 640 480 367.1 262.6 217.9 140.1 46.2 30.9 1024 768 150.8 148.8 128.6 117.3 45.3 30.4 1920 1080 71.9 73.9 64.0 61.6 43.3 27.9 Pi 4B Gains 1.77 1.74 2.12 2.10 1.52 1.46 Dual Monitor- mirrored displays 1920 1080 65.0 66.3 61.6 58.2 42.7 27.5 Dual Monitor - not mirrored squashed image on one monitor 3840 1080 60.9 59.6 57.2 54.8 40.8 26.8 Dual Monitor 32 bit two monitors 3840 1080 26.9 26.6 26.1 25.1 25.5 15.9 ************************ Pi 4B 32 Bit ************************ GLUT OpenGL Benchmark 32 Bit Version 1, Fri Oct 11 19:12:24 2019 Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen 320 240 663.3 365.9 218.6 126.3 33.1 23.5 640 480 318.7 259.7 192.4 116.8 32.2 22.1 1024 768 138.9 134.1 112.7 102.7 31.9 21.4 1920 1080 57.5 56.1 53.3 50.0 29.3 19.5 Avg 64b/32b 1.13 1.13 1.15 1.19 1.42 1.39

Stress Tets below or Go To Start

Stress Tests

The first stress tests used cover the central processor, for which an extra program was produced to measure the environment whilst running. Variable parameters are:

Passes and sampling seconds to determine running time. If the stress test also has sampling periods, it is normally not possible to synchronise them but approximate periods can be matched.

CPU MHz - This can vary faster than any sampling time based on seconds, but the general trend can be useful. Tests that measure speed over sampling periods provide a better indication.

Core Voltage - This appears to vary a little, reason unknown.

CPU Temperature - assuming that it is correct, as it change slowly, this is the most useful measurement.

PMIC temperature - No issue so far with Power Management Integrated Circuit temperatures

################################################### Parameters - upper or lower case ./RPiHeatMHzVolts2 passes 33 secs 20 log 12 or ./RPiHeatMHzVolts2 P 33 S 20 L 12 For 33 samples at 20 second intervals, log file RPiHeatMHz12.txt To cover 10 minute test ################################################### Temperature and CPU MHz Measurement Start at Mon Oct 28 20:49:52 2019 Using 33 samples at 20 second intervals Seconds 0.0 ARM MHz=1500, core volt=0.8490V, CPU temp=61.0'C, pmic temp=55.2'C 20.0 ARM MHz=1500, core volt=0.8437V, CPU temp=73.0'C, pmic temp=62.8'C 40.3 ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=66.5'C 60.5 ARM MHz=1500, core volt=0.8437V, CPU temp=79.0'C, pmic temp=69.4'C 80.7 ARM MHz=1500, core volt=0.8437V, CPU temp=80.0'C, pmic temp=70.3'C 101.0 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 121.2 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 141.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 161.7 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 181.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C

Next are results for the High Performance Linpack that runs for a long time, significantly increasing CPU temperatures and slowing down, without a cooling fan being in place. These results can be compared with those for the 32 bit version, available in the report Raspberry Pi 4B Stress Tests Including High Performance Linpack.pdf. This shows that the same sort of performance levels as the 64 bit version are obtained, with and without a cooling fan.

Following HPL results here, are some for my integer and floating point stress tests. Although further comparative tests are needed to be conclusive, it does seem that the 64 bit floating point versions are faster than the 32 bit varieties and subject to lower temperature increases.

HP Linpack Stress Test or Go To Start

High Performance Linpack Stress Test

The earlier HPL benchmark results quoted obtained speeds of 8.1 GFLOPS on a cold start and 10.8 GFLOPS later, with a cooling fan in operation for both. The first results below were run without a fan, with a room temperature around 21°C, producing 7.6 GFLOPS on a cold start. Then average CPU frequency came out at 1056 MHz, with an average temperature of 80.3°C.

The second results followed a warm reboot to use a different version of Gentoo with HPL installed, obtaining 5.54 GFLOPS, with severe CPU frequency throttling, down to 600 MHz, with temperatures up to 80.3°C. Averages were 790 MHz and 80.3°C.

Shortly afterwards, with the fan in place, the Pi ran at 1500 MHz continuously, achieving 10.4 GFLOPS, with a maximum temperature of 64°C.

================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 20000 128 2 2 702.81 7.589e+00 HPL_pdgesv() start time Sat Aug 24 10:42:58 2019 HPL_pdgesv() end time Sat Aug 24 10:54:41 2019 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0008453 ...... PASSED ================================================================================ Example 2 - Note different sumchecks again ================================================================================ T/V N NB P Q Time Gflops -------------------------------------------------------------------------------- WR11C2R4 20000 128 2 2 963.16 5.538e+00 HPL_pdgesv() start time Tue Oct 29 11:51:10 2019 HPL_pdgesv() end time Tue Oct 29 12:07:13 2019 -------------------------------------------------------------------------------- ||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009005 ...... PASSED ================================================================================ Temperature and CPU MHz Measurement Start at Tue Oct 29 11:50:27 2019 Using 40 samples at 30 second intervals Seconds 0.0 ARM MHz=1500, core volt=0.8542V, CPU temp=63.0'C, pmic temp=58.0'C 30.0 ARM MHz=1500, core volt=0.8542V, CPU temp=79.0'C, pmic temp=69.4'C 60.3 ARM MHz=1000, core volt=0.8542V, CPU temp=83.0'C, pmic temp=72.2'C 91.6 ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C 122.2 ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=74.1'C 152.7 ARM MHz= 750, core volt=0.8490V, CPU temp=83.0'C, pmic temp=74.1'C 183.2 ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C 213.8 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 244.3 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 274.7 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C 305.2 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 335.6 ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 366.1 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 396.6 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 427.2 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C 457.5 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 488.0 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 518.6 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C 549.0 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C 579.6 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C 610.1 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 640.6 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C 671.1 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C 701.6 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C 732.0 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 762.4 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 792.9 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 823.4 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C 853.9 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 884.4 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 914.9 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 945.3 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C 975.8 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 1006.3 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C 1036.7 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C 1067.0 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C Averages 790 84.1 75.5

Integer Stress Test below or Go To Start

Integer Stress Test - MP-IntStress64, MP-IntStress

As for my other CPU stress tests, the four and 8 thread results are shown, from running in benchmarking mode. Run time parameters are also provided, the commands used for the particular tests being included.

In this case, a summary of separate tests for L1 cache, L2 cache and RAM are given. During the 10 minute sessions, the cache tests were mainly running at 1000 MHz, with those using RAM at the full speed 1500 MHz. No temperatures above 84°C were recorded.

Examining the full detail of the first test indicated that average CPU MHz and measured MB/second were around 75% of the maximum.

KB KB MB Same All Secs Thrds 16 160 16 Sumcheck Tests 3.0 4 28715 26652 3345 5A5A5A5A Yes 3.0 8 30292 26310 3334 AAAAAAAA Yes ./RPiHeatMHzVolts2 passes 66 secs 10 log 34 - used for all 10 minute stress tests ==== Stress Test Parameters - upper or lower case, only first letter counts ==== Threads 1, 2, 4, 8, 16, 32 KB between 12 and 15624 Log < 100 Minutes any > 0 ./MP-IntStress64 KB 16 Threads 8 Mins 10 Log 34 Seconds MB/sec 0.0 ARM MHz=1500, core volt=0.8455V, CPU temp=62.0'C, pmic temp=57.1'C 10.0 ARM MHz=1500, core volt=0.8455V, CPU temp=69.0'C, pmic temp=62.8'C 28695 20.2 ARM MHz=1500, core volt=0.8402V, CPU temp=73.0'C, pmic temp=64.6'C 28729 152.5 ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=72.2'C 21523 305.5 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 20026 448.2 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 19611 601.1 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 19199 %Min/Max 66.9 ./MP-IntStress64 KB 160 Threads 8 Mins 10 Log 34 Seconds MB/sec 0.0 ARM MHz=1500, core volt=0.8402V, CPU temp=64.0'C, pmic temp=57.1'C 10.0 ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C 26323 20.2 ARM MHz=1500, core volt=0.8402V, CPU temp=75.0'C, pmic temp=66.5'C 26140 152.9 ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=74.1'C 18016 306.5 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 17306 449.8 ARM MHz=1000, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C 17248 603.3 ARM MHz= 750, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C 16832 %Min/Max 63.9 ./MP-IntStress64 KB 16000 Threads 8 Mins 10 Log 34 Seconds MB/sec 0.0 ARM MHz=1500, core volt=0.8402V, CPU temp=66.0'C, pmic temp=60.9'C 10.0 ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C 3372 20.3 ARM MHz=1500, core volt=0.8402V, CPU temp=72.0'C, pmic temp=62.8'C 3369 155.2 ARM MHz=1500, core volt=0.8402V, CPU temp=76.0'C, pmic temp=68.4'C 3365 309.8 ARM MHz=1500, core volt=0.8402V, CPU temp=79.0'C, pmic temp=69.4'C 3367 454.4 ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C 3367 599.7 ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C 3368 %Min/Max 99.8

Single Precision Floating Point Stress Tests below or Go To Start

Single Precision Floating Point Stress Test - MP-FPUStress64, MP-FPUStress

Two sets of result summaries are provided below, both using 1280 KB memory space and 8 threads. With four cores, this results in data being in L2 cache (4 x 160 KB) to run at full speed, with additional overhead of moving data to/from RAM. One test uses 8 operations per word, with 32 in the other. With hot starts, neither reached a CPU temperature of 84°C and had similar performance degradation at the highest temperatures.

Following writing the above, the 32 bit stress test was repeated, with results shown below. Although not conclusive from a single run, they indicate that the impact was more severe than the 64 bit run, CPU speed sample reducing to 600 MHz, higher temperatures and a larger performance degradation.

Ops/ KB KB MB KB KB MB Secs Thrd Word 12.8 128 12.8 12.8 128 12.8 4.6 T4 2 9223 7520 519 40392 76406 99700 6.0 T8 2 9520 10471 545 40392 76406 99700 11.3 T4 8 19087 21040 2044 54764 85092 99820 12.9 T8 8 19747 21107 2016 54764 85092 99820 22.2 T4 32 25732 26704 9160 35206 66015 99520 24.1 T8 32 25708 25770 8927 35206 66015 99520 ==== Stress Test Parameters - upper or lower case, only first letter counts ==== Threads 1,2,4,8,16,32,64 KB 12 to 15624 Ops/Wordd 2,8,32 Log<100 Minutes any>0 ./MP-FPUStress64 KB 1280 T 8 Ops 8 Mins 10 Log 33 Seconds MFLOPS 0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=64.0'C, pmic temp=59.0'C 10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C 17309 20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=75.0'C, pmic temp=66.5'C 18018 101.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 14224 204.2 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 12806 306.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=73.1'C 12447 409.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C 11870 501.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 12191 604.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 12169 %Min/Max 65.9 ./MP-FPUStress64 KB 1280 T 8 Ops 32 Mins 10 Log 33 Seconds MFLOPS 0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=65.0'C, pmic temp=59.0'C 10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=72.0'C, pmic temp=65.6'C 22634 20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=76.0'C, pmic temp=67.5'C 22992 101.9 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 18629 204.0 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C 16674 306.3 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 16448 408.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 16158 500.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 16081 603.0 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 15553 %Min/Max 67.6 ====================================================================================== 32 Bit Version ./MP-FPUStress KB 1280 T 8 Ops 32 Mins 10 Log 73 Seconds MFLOPS 0.0 ARM MHz=1500, core volt=0.8560V, CPU temp=56.0'C, pmic temp=50.5'C 10.0 ARM MHz=1500, core volt=0.8560V, CPU temp=70.0'C, pmic temp=60.9'C 20233 20.7 ARM MHz=1500, core volt=0.8560V, CPU temp=74.0'C, pmic temp=64.6'C 20221 106.4 ARM MHz=1000, core volt=0.8560V, CPU temp=83.0'C, pmic temp=70.3'C 14173 204.3 ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=73.1'C 13115 302.2 ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 12650 400.2 ARM MHz= 750, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 11957 508.8 ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 11485 585.1 ARM MHz= 600, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C 11454 606.9 ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C 11242 %Min/Max 55.6

Double Precision Floating Point Stress Tests below or Go To Start

Double Precision Floating Point Stress Test - MP-FPUStress64DP, MP-FPUStressDP

Below are full results for a 10 minute test using the double precision floating point stress test, with data in L2 cache with four cores in use. Although the measured MFLOPS was greater than that obtained be HPL Linpack, the same range of high temperatures and performance degradation were not generated.

The 32 bit version was also rerun, producing similar results as those at 64 bits.

Ops/ KB KB MB KB KB MB Secs Thrd Word 12.8 128 12.8 12.8 128 12.8 8.9 T4 2 5024 4589 257 40395 76384 99700 11.5 T8 2 5089 5545 280 40395 76384 99700 21.7 T4 8 10259 10011 1068 54805 85108 99820 24.7 T8 8 10239 10824 1036 54805 85108 99820 43.1 T4 32 12940 13200 4497 35159 66065 99521 46.9 T8 32 13200 13049 4557 35159 66065 99521 ==== Stress Test Parameters - upper or lower case, only first letter counts ==== Threads 1,2,4,8,16,32,64 KB 12 to 15624 Ops/Wordd 2,8,32 Log<100 Minutes any>0 ./MP-FPUStress64DP KB 1280 T 8 Ops 32 Mins 10 Log 31 Seconds MFLOPS 0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=63.0'C, pmic temp=57.1'C 10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C 12718 20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=74.0'C, pmic temp=66.5'C 12755 30.5 ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=68.4'C 12750 40.7 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 12755 50.9 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 12183 61.2 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 11358 71.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 10922 81.6 ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C 10333 91.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9948 102.0 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9692 112.3 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9466 122.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 9217 132.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C 9181 143.0 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 9145 153.2 ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C 9043 163.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8921 173.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 9838 183.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8755 194.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8737 204.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 8721 214.7 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8721 224.9 ARM MHz=1500, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C 8670 235.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C 8619 245.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8592 255.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8592 265.9 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8540 276.2 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C 8488 286.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8547 296.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8510 307.0 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8473 317.2 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8507 327.5 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8541 337.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8544 347.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8464 358.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8531 368.4 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8495 378.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8460 388.9 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8514 399.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8484 409.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8454 419.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8459 429.8 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8489 440.1 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8472 450.3 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8428 460.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8384 470.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8384 481.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8387 491.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8391 501.7 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8244 511.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8346 522.1 ARM MHz= 750, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8272 532.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8272 542.6 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8329 552.8 ARM MHz= 750, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8239 563.1 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8183 573.3 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8129 583.6 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8343 593.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8266 604.1 ARM MHz=1000, core volt=0.8437V, CPU temp=85.0'C, pmic temp=74.1'C 8190 %Min/Max 63.7

below or Go To Start

OpenGL + 3 x Livermore Loops - liverloopsPi64Rg9, liverloopsPi64, liverloopsPiA7R

In order make it easier to run these stress tests, lxterminal was installed and the script shown below used to open four terminal windows and run the environmental monitor program plus three copies of a modified Loops benchmark, that allows different log files to be specified. This executes 72 loops for a minimum time of 12 seconds each. The second script file is provided to run the kitchen disply tests for 16 minutes in full screen mode. A further terminal was opened to run VMSTAT resource monitor.

The tests were run twice, without and with a cooling fan in place. Results are shown below. In this case, the no fan tests were not that much slower, obtaining averages of 77 to 80% of the fan cooled speeds on OpenGL FPS, CPU MHz and total Loops MFLOPS.

These results were produced with all programs compiled by gcc 9 and not run on a hot day. Compared with performance using 32 bit versions, detailed in this 32 Bit Report, the 64 bit results were far better, but the former were produced by an older compiler and run on a hot day. The tests were repeated, using 32 bit programs produced by the later gcc 8 compiler.

As before, the 64 bit gcc 9 Livermore Loops and OpenGL single core benchmarks were faster than the new 32 bit versions, in this case by 14% for the former and 40% for the latter. On running the stress test, both had similar average CPU MHz, CPU temperature and PMIC temperature, with 64 bit FPS and MFLOPS maintaining performance advantage, with similar ratios as obtained from single core tests.

run.sh lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 & lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 21 & lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 22 & lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 23 runogl.sh export vblank_mode=0 & ./videogl64g9 Test 6 Mins 16 Log 20 No Fan With Fan Seconds MHz CPU C PMIC C FPS MHz CPU C PMIC C FPS 0 1500 57 51 1500 37 32 30 1500 75 63 27 1500 53 44 27 60 1500 76 68 29 1500 53 44 28 90 1500 81 72 25 1500 58 50 27 120 1500 81 70 23 1500 55 48 26 150 1000 82 74 23 1500 57 49 29 180 1000 80 72 22 1500 54 47 27 210 1000 81 72 24 1500 55 46 29 240 1500 80 72 26 1500 54 44 28 270 1500 81 72 27 1500 55 47 28 300 1000 82 72 22 1500 56 48 29 330 1500 82 72 24 1500 56 50 29 360 1000 82 72 24 1500 56 49 28 390 1000 82 72 22 1500 58 50 26 420 1000 83 72 22 1500 57 50 26 450 1000 82 74 19 1500 56 50 30 480 1000 82 74 21 1500 56 48 28 510 1000 82 72 22 1500 54 46 29 540 1000 81 72 22 1500 55 47 30 570 1500 81 72 24 1500 55 47 30 600 1000 82 74 24 1500 57 49 30 630 1500 81 72 23 1500 58 51 29 660 1000 82 72 23 1500 57 50 29 690 1000 83 73 22 1500 59 51 28 720 1000 83 72 21 1500 57 51 28 750 1000 82 74 21 1500 57 50 29 780 1000 84 74 19 1500 54 47 29 810 1000 82 72 19 1500 56 48 29 840 1000 82 72 20 1500 54 46 29 870 1000 82 72 20 1500 53 46 30 900 1000 82 72 23 1500 49 42 31 Average 1161 81 71 23 1500 55 47 29 Minimum 1000 57 51 19 1500 37 32 26 Maximum 1500 84 74 29 1500 59 51 31 % Hot/Cold Average 77 68 66 80 Minimum 67 65 61 73 Maximum 100 70 69 94 MFLOPS Average Geomean Harmean Average Geomean Harmean 1 684 562 453 898 732 590 2 716 574 451 887 712 571 3 716 566 438 895 724 582 Total %Hot/Cold MFLOPS 79 78 77

Input/Output Stress Test below or Go To Start

Input/Output Stress Test - burnindrive264g9, burnindrive2

This is essentially the same as my program used during hundreds of UK Government and University computer acceptance trials during the 1970s and 1980s, with some significant achievements. Burnindrive writes four files, using 164 blocks of 64 KB, repeated 16 times (164.0 MB), with each block containing a unique data pattern. The files are then read for two minutes, on a sort of random sequence, with data and file ID checked for correct values. Then each block (unique pattern) is read numerous times, over one second, again with checking for correct values. Total time is normally about 5 minutes for all tests, with default parameters. The data patterns are shown below, followed by run time parameters, then examples of results provided, including added calculations of speed.


Patterns

 No.    Hex No.     Hex No.     Hex No.     Hex  No.     Hex No.      Hex No.      Hex

  1       0 25   800000 49        3 73       FF  97 FFFFDFFF 121 FFFFEAAA 145 FFFFF0F0
  2       1 26  1000000 50       33 74   FF00FF  98 FFFFBFFF 122 FFFFAAAA 146 FFF0F0F0
  3       2 27  2000000 51      333 75      1FF  99 FFFF7FFF 123 FFFEAAAA 147 F0F0F0F0
  4       4 28  4000000 52     3333 76      3FF 100 FFFEFFFF 124 FFFAAAAA 148 FFFFFFE0
  5       8 29  8000000 53    33333 77      7FF 101 FFFDFFFF 125 FFEAAAAA 149 FFFF83E0
  6      10 30 10000000 54   333333 78      FFF 102 FFFBFFFF 126 FFAAAAAA 150 FE0F83E0
  7      20 31 20000000 55  3333333 79     1FFF 103 FFF7FFFF 127 FEAAAAAA 151 FFFFFFC0
  8      40 32 40000000 56 33333333 80     3FFF 104 FFEFFFFF 128 FAAAAAAA 152 FFFC0FC0
  9      80 33        1 57        7 81     7FFF 105 FFDFFFFF 129 EAAAAAAA 153 FFFFFF80
 10     100 34        5 58      1C7 82     FFFF 106 FFBFFFFF 130 AAAAAAAA 154 FFE03F80
 11     200 35       15 59     71C7 83 FFFFFFFF 107 FF7FFFFF 131 FFFFFFFC 155 FFFFFF00
 12     400 36       55 60   1C71C7 84 FFFFFFFE 108 FEFFFFFF 132 FFFFFFCC 156 FF00FF00
 13     800 37      155 61  71C71C7 85 FFFFFFFD 109 FDFFFFFF 133 FFFFFCCC 157 FFFFFE00
 14    1000 38      555 62        F 86 FFFFFFFB 110 FBFFFFFF 134 FFFFCCCC 158 FFFFFC00
 15    2000 39     1555 63      F0F 87 FFFFFFF7 111 F7FFFFFF 135 FFFCCCCC 159 FFFFF800
 16    4000 40     5555 64    F0F0F 88 FFFFFFEF 112 EFFFFFFF 136 FFCCCCCC 160 FFFFF000
 17    8000 41    15555 65  F0F0F0F 89 FFFFFFDF 113 DFFFFFFF 137 FCCCCCCC 161 FFFFE000
 18   10000 42    55555 66       1F 90 FFFFFFBF 114 BFFFFFFF 138 CCCCCCCC 162 FFFFC000
 19   20000 43   155555 67     7C1F 91 FFFFFF7F 115 FFFFFFFE 139 FFFFFFF8 163 FFFF8000
 20   40000 44   555555 68  1F07C1F 92 FFFFFEFF 116 FFFFFFFA 140 FFFFFE38 164 FFFF0000
 21   80000 45  1555555 69       3F 93 FFFFFDFF 117 FFFFFFEA 141 FFFF8E38
 22  100000 46  5555555 70    3F03F 94 FFFFFBFF 118 FFFFFFAA 142 FFE38E38
 23  200000 47 15555555 71       7F 95 FFFFF7FF 119 FFFFFEAA 143 F8E38E38
 24  400000 48 55555555 72   1FC07F 96 FFFFEFFF 120 FFFFFAAA 144 FFFFFFF0

 Sequences - First 16

 No.   File         No.   File          No.   File          No.   File

  1    0  1  2  3    5    0  2  1  3     9    0  3  1  2    13    0  1  2  3
  2    1  2  3  0    6    1  3  2  0    10    1  0  3  2    14    1  2  3  0
  3    2  3  0  1    7    2  0  1  3    11    2  1  0  3    15    2  3  0  1
  4    3  0  2  1    8    3  1  2  0    12    3  2  1  0    16    3  0  2  1

 ###########################################################################

Run Time Parameters - Upper or Lower Case
                                                                      Default
R or Repeats             Data size, multiplier of 10.25 MB, more or less     16
P or Patterns            Number of patterns for smaller files < 164         164
M or Minutes             Large file reading time                              2
L or Log                 Log file name extension 0 to 99                      0
S or Seconds             Time to read each block, last section                1
F or FilePath            For other than SD card or SD card directory
C or CacheData           Omit O_DIRECT on opening files to allow caching      No  
O or OutputPatterns      Log patterns and file sequences used as above        No
D or DontRunReadTests    Or only run write tests                              No   

  Format ./burnindrive2 Repeats 16, Minutes 2, Log 0, Seconds 1 
     or  ./burnindrive2 R 16, M 2, L 0, S 1

 ###########################################################################

Examples of Results Main SD Card Default Parameters

   File 1  164.00 MB written in   14.66 seconds                - 11.2 MB/second
To File 4  164.00 MB written in   12.15 seconds                - 13.5 MB/second 

   Read passes     1 x 4 Files x  164.00 MB in  0.33 minutes   - 33.1 MB/second
To Read passes     7 x 4 Files x  164.00 MB in  2.28 minutes   - 33.6 MB/second

   Passes in 1 second(s) for each of 164 blocks of 64KB:       - 164 measurements

    580    580    580    580    580    580    580    580    580    580    580
    580    580    580    580    580    580    580    580    580    580    580

    95120 read passes of 64KB blocks in  2.76 minutes          - 36.8 MB/second

CPU + Main SD + USB + LAN Test below or Go To Start

CPU + Main SD + USB + LAN Test

A system test was run using the following script file, comprising commands to run programs to monitor the environment, and others to exercise the main SD card, two USB 3 drives, 1 Gbps Ethernet and CPU floating point with two threads. The programs were run via the script file so that they all started at the same time, as indicated in the summaries below. They also all ran for between 12 and 13 minutes. The by itself performance levels (BI) are also shown, often not indicating much improvement. Performance is not as high as shown by other benchmarks, probably because data transfers are based on 64 KB block sizes and all data in each block is checked for correctness.

A snapshot of vmstat system performance is also provided. The bo and bi KB/second writing and reading speeds are essentially the same as the sum those reported by the programs handling the main and USB drives. LAN speeds are not included in vmstat. Total CPU utilisation (us + sy) is shown to be nearly 90% at the start of writing and closer to 75% on reading, representing average utilisation per core or at least three cores at 100%. Next page shows variations in performance with time.

############################### Script File ############################### lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 & lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 Log 21 & lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 FilePath /run/media/demouser/PATRIOT Log 22 & lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 FilePath /run/media/demouser/REMIXOSSYS Log 23 & lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 FilePath /media/public/test Log 24 & lxterminal -e ./MP-FPUStress64 KB 256 T 2 Ops 32 Mins 12 Log 33 vmstat 10 96 > vmstat.txt ############################################################################ Main SD Drive Tue Nov 5 15:47:03 2019 End of test Tue Nov 5 16:00:06 2019 Write 164 MB x files 4 53.6 seconds = 12.2 MB/second (BI 12.7) Read 164 MB x files 3 x 4 67.2 seconds = 29.3 MB/second (BI 33.6) Read 329480 x 64 KB 659.4 seconds = 32.0 MB/second (BI 36.8) ============================================================ USB 3 Drive 1 Tue Nov 5 15:47:03 2019 End of test Tue Nov 5 15:59:31 2019 Write 164 MB x files 4 17.5 seconds = 37.5 MB/second (BI 68.3) Read 164 MB x files 6 x 4 72.0 seconds = 54.7 MB/second (BI 75.0) Read 735800 x 64 KB 657.6 seconds = 71.6 MB/second (BI 66.5) ============================================================ USB 3 Drive 2 Tue Nov 5 15:47:03 2019 End of test Tue Nov 5 15:59:57 2019 Write 164 MB x files 4 37.4 seconds = 17.5 MB/second (BI 23.8) Read 164 MB x files 3 x 4 75.6 seconds = 26.0 MB/second (BI 28.5) Read 282740 x 64 KB 660.0 seconds = 27.4 MB/second (BI 29.8) ============================================================ 1 Gbps LAN Tue Nov 5 15:47:03 2019 End of test Tue Nov 5 15:59:35 2019 Write 164 MB x files 4 18.1 seconds = 36.2 MB/second (BI 55.7) Read 164 MB x files 3 x 4 74.4 seconds = 26.4 MB/second (BI 34.0) Read 303920 x 64 KB 659.4 seconds = 29.5 MB/second (BI 45.3) ============================================================ MP-Threaded-MFLOPS 64 Bit v1.1 Tue Nov 5 15:47:03 2019 End of test Tue Nov 5 15:59:13 2019 2 core GFLOPS 10.9 to 7.4 with CPU throttling. See RPiHeatMHzVolts2 results where detail is included ============================================================ From vmstat 10 second sampling Secs procs ---------memory---------- ---swap-- -----io---- --system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 10 5 3 0 3059800 94956 346060 0 0 14 63204 17819 19051 51 38 2 9 0 20 3 2 0 3058696 95248 346704 0 0 14265 60713 17613 18789 51 33 4 12 0 60 4 2 0 3061196 95668 343572 0 0 93479 7577 24239 24987 54 19 4 23 0 70 4 3 0 3050632 95684 353600 0 0 112115 24 24496 25316 54 20 12 14 0 710 3 3 0 3058696 96532 349460 0 0 132992 16 18936 22387 53 22 3 22 0 720 5 1 0 3058728 96548 349452 0 0 134400 13 20635 23842 54 23 1 23 0

Speeds and Temperature below or Go To Start

Speeds and Temperature - These tests were run without an active cooling fan, resulting in some CPU throttling, with clock speed down to 1000 MHz some of the time, when the temperature reached 80°C. The MP-Threaded-MFLOPS dual core performance measurements have been added to the environmental details, mainly indicating the effects of throttling.

The burnindrive last results record the number of read passes in 4 seconds, in a table comprising 14 lines of 11 recordings and one with 10, over approximately 11 minutes. The average burnindrive results for each line are provided below, not exactly synchronised, but giving an indication of changes in throughput with time. Total passes and percentage degradation are also shown, the latter not being as severe as CPU speed reductions.

Temperature and CPU MHz Measurement + MP-FPUStress64 2 Core MFLOPS Start at Tue Nov 5 15:47:03 2019 Using 25 samples at 30 second intervals Seconds MFLOPS 0.0 ARM MHz=1500, core volt=0.8560V, CPU temp=66.0'C, pmic temp=59.0'C 30.0 ARM MHz=1500, core volt=0.8560V, CPU temp=75.0'C, pmic temp=65.6'C 10890 60.2 ARM MHz=1500, core volt=0.8560V, CPU temp=78.0'C, pmic temp=68.4'C 10551 90.4 ARM MHz=1500, core volt=0.8560V, CPU temp=80.0'C, pmic temp=70.3'C 10549 120.6 ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 10452 150.8 ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 9862 181.1 ARM MHz=1000, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 9482 211.4 ARM MHz=1500, core volt=0.8560V, CPU temp=82.0'C, pmic temp=72.2'C 9137 241.6 ARM MHz=1500, core volt=0.8507V, CPU temp=81.0'C, pmic temp=72.2'C 9132 271.9 ARM MHz=1000, core volt=0.8507V, CPU temp=82.0'C, pmic temp=70.3'C 9122 302.2 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 9389 332.4 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8550 362.7 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 9043 392.9 ARM MHz=1500, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8045 423.3 ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8174 453.6 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8444 483.9 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8335 514.3 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7951 544.6 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8125 574.8 ARM MHz=1500, core volt=0.8455V, CPU temp=83.0'C, pmic temp=72.2'C 8078 605.1 ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8280 635.4 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7845 665.7 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7761 696.0 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=73.1'C 8341 726.2 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7407 Passes in 4 seconds for each of 164 blocks of 64KB Seconds Main SD USB 1 USB 2 LAN Total %First 44 2013 4522 1884 1915 10333 100 88 2007 4533 1838 1911 10289 100 132 2016 4496 1760 1809 10082 98 176 2011 4536 1785 1845 10178 99 220 2002 4493 1729 1913 10136 98 264 1971 4262 1751 1904 9887 96 308 1980 4540 1747 1911 10178 99 352 2002 4464 1660 1845 9971 96 396 1987 4442 1629 1844 9902 96 440 1964 4453 1585 1771 9773 95 484 1995 4504 1635 1731 9864 95 528 1989 4229 1696 1762 9676 94 572 1947 4616 1684 1833 10080 98 616 2013 4476 1660 1798 9947 96 660 2262 4758 1826 2022 10868 105

Go To Start