Raspberry Pi, Pi 2 and Pi 3 Benchmarks - Roy Longbottom's PC benchmark Collection

Roy Longbottom's Raspberry Pi, Pi 2 and Pi 3 Benchmarks

New

I have now run my 32 bit and 64 bit Raspberry Pi benchmarks and stress tests on the model 3B+. Full details and results are available from ResearchGate in the Raspberry Pi 3B+ PDF Report. The tests demonstrate 3B+/3B performance improvements (not always), the new thermal characteristics and higher speed LAN and WiFi data transfers.

General	Raspberry Pi System	64 Bit SUSE & Gentoo
Standards/Configuration Details	Whetstone Benchmark	Dhrystone 2 Benchmark
Linpack Benchmark	Livermore Loops Benchmark	Memory Speed Benchmark
Bus Speed Benchmark	FFT Benchmarks

NEON Benchmarks	Linpack NEON Benchmarks	NEON Float & Integer Benchmark
NEON MemSpeed Benchmark	Maximum 1 Core MFLOPS

MultiThreading Benchmarks	MP-MFLOPS	MP-Whetstone
MP-Dhrystone	MP-BusSpeed	MP-RandMem
OpenMP-MFLOPS	OpenMP-MemSpeed
NEON MP Benchmarks	MP-NeonMFLOPS	linpackNeonMP

Java Benchmarks	Java Whetstone Benchmarks	JavaDraw Benchmark
OpenGL ES Benchmark	OpenGL GLUT Benchmark

DriveSpeed Benchmark	LAN/WiFi Benchmark	64 Bit Drive & LAN Benchmarks
Temperature & MHz Recorder	Reliability Tests	64 Bit Reliability Tests
Performance Monitor	Assembly Code

General

Roy Longbottom’s PC Benchmark Collection comprises numerous FREE benchmarks and reliability testing programs, for processors, caches, memory, buses, disks, flash drives, graphics, local area networks and Internet. Original ones run via DOS and later versions under all varieties of Windows. Most have also been converted to run under Linux on PCs. and many to run via Android on tablets and phones. Some of the Linux variety C/C++ source code was changed slightly to compile for execution on the Raspberry Pi.

After reading that compilation time on the Raspberry Pi was painfully slow, the programs were compiled on a Linux Ubuntu 12.04 based PC via Rasbian Toolchain using instructions downloaded from www.xappsoftware.com. This allows programs to be compiled from a Terminal window. Using this, the C/C++ code can be firstly compiled to run on the Linux driven PC, then transferred to the Raspberry Pi via LAN or a USB flash drive. In order to execute after transferring, a change to Properties, Permissions is needed to make executable. One complication is that setting the path to the cross compiler did not work as suggested by xappsoftware. Below are examples of commands used for the two executable files - note the path for gcc:

   cc  whets.c cpuidc.c -lm -O3 -o whetstoneIL

  ~/toolchain/raspbian-toolchain-gcc-4.7.2-linux32/bin/arm-linux-gnueabihf-gcc whets.c 
      cpuidc.c -lm -O3 -march=armv6 -mfloat-abi=hard -mfpu=vfp -o whetstonePiA6

  Command to execute -  ./whetstonePiA6

The last three parameters (-march to -mfpu) made no difference to performance, but others are likely to be needed to take advantage of later ARM floating point functions. Note, the first four benchmark programs were compiled later on the Raspberry Pi itself. Both the above cc and gcc (with no Toolchain path) commands were used for compilation. These and the PC based files all produced the same numeric results and mainly the same performance. Compilation time was acceptable at between 8 and 36 seconds.

The benchmarks and source codes can be downloaded in Raspberry_Pi_Benchmarks.zip. This includes the executables compiled, as above, to run on Intel CPUs via Linux and the versions compiled on the Raspberry Pi. To download the benchmarks, click on the Raspberry_Pi_Benchmarks.zip link, select Save to download to Home (assume /home/pi). Open File Manager and right click on zip file and select Extract here.

To enable execution of the programs, a security setting is required. Double click on Raspberry_Pi_Benchmarks folder to open, right click on each executable (dhrystonePiA6, linpackPiA6, linpackPiSP, liverloopsPiA6, memspeedPiA6, whetstonePiA6), select Properties, Permissions, tick Make the file executable. The new program titles mainly end in PiA7.

To run, open LX Terminal, type cd Raspberry_Pi_Benchmarks to enter the directory, type ls to ensure the path is correct and to list files, then execute for example using ./dhrystonePiA6. Information will be displayed as the benchmarks are running and results will be saved in log files, example Dhry.txt.

To Start

Raspberry Pi System

For those who do not know, the Raspberry Pi has a 3.5 x 2.5 inch motherboard, in this case, containing a 700 MHz ARM 1176JZF v6 single core CPU and 512 MB RAM. External connectors include two full size USB sockets with others for a full size HDMI plug, a micro USB socket for power, an RJ45 Ethernet port and a slot for an SD card, used as the main drive.

The operating system is Raspbian, based on Linux Debian, in ths case Wheezy-Raspbian. This can be obtained pre-loaded on an SD card or downloaded from raspberrypi.org and copied to an SD card to produce a bootable drive. I used Image Writer for Microsoft Windows for this purpose.

In my case, booting time, from connecting power to desktop display, is 30 seconds. Using a simple command (see below) produces a menu where CPU speed can be selected up to 1 GHz, also increasing memory bus speed.

Raspberry Pi 2 Model B has a 900 MHz quad core Broadcom BCM2836 ARM V7 CPU with 1 GB RAM and can be overclocked to 1 GHz, using the configuration menu. L1 data cache size is 32 KB and L2 cache 512 KB, shared by all cores. Existing benchmarks were run on the new computer along with additional programs, produced by a newer compiler, to see if additional hardware features were used. The additional benchmarks were produced using gcc 4.8, where a typical compile command is:

 gcc whets.c cpuidc.c -lm -O3 -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -o newA7

Raspberry Pi 3 Model B includes a quad core Broadcom BCM2837 system-on-chip running at 1200 MHz, each core having a 32 KB L1 cache. There is a shared 512 KB L2 cache and 1 GB RAM. The CPU is an ARM Cortex-A53, capable of 64 bit working, but presently only supports 32 bit operation. Benchmark results are now included.

Performance of a Cortex-A53 based Android tablet is available, for the same benchmarks, at both 32 bit and 64 bit working. These results are included below, to identify potential differences at 64 bits.

To Start

64 Bit OpenSUSE and Gentoo For Raspberry Pi 3

Up until late 2016, readily available operating systems have been 32 bit versions. The first reference I have seen, for a 64 bit variety, was for OpenSUSE for Raspberry Pi 3.. There are different distros available, one for SUSE Linux Enterprise Server (SLES). A number of free OpenSUSE downloads are for both Leap 42.2 and Tumbleweed versions. Registration is required for SLES, with free use for at least a year. All downloads are raw.xz compressed files.

Converting the xz files to successfully bootable SD cards can be difficult. I had to extract the raw files on a PC using Linux Ubuntu and copy them to the card via Windows, using Win32 Disk Imager. I managed to produce working systems for Leap 42.2 but not Tumbleweed.

I installed GCC-6. That produced what appeared to be good 64 bit code (from disassembly), but performance was variable. This was due to a default “on demand” boot setting that produced variable CPU MHz. In order to understand the implications of this, I compiled and ran some MP tests, details of which are in SUSE RPi3 Stress Tests.htm. Some benchmarks were also compiled by gcc 4.8, using 64 bit SLES, to explore performance differences between 32 bit and 64 bit working.

The benchmarks and source codes are being included in Rpi3-64-Bit-Benchmarks.tar.gz. The source codes include the compile and link commands used, an example being below.

Example Compile Command gcc-6 whets.c cpuidc.c -lm -lrt -O3 -march=armv8-a -o whetstonePi64 ARM options, such as -mcpu and other CPUs for -march, were not available

Linux Gentoo - Details of a bootable 64-bit Gentoo image for the Raspberry Pi 3 became available in February 2017. Details and downloads are available from Rpi3-64-Bit-Benchmarks.tar.gz.

The bootable SD card was created as for OpenSUSE above. The OpenSUSE produced benchmarks are being run via Gentoo and, where appropriate, results included below. This time, although “on demand” CPU MHz was used, benchmarks consistently ran at full speed, with lower MHz only being shown when the CPU was idle.

To Start

Standards/Configuration Details

All the benchmarks are run from Terminal commands and provide continuous displays of current activity. This was included in original versions of the benchmarks when CPUs were really slow. They all produce a summary of results in a .txt based log file and this includes system information, where the following example is for my particular system. Note that this includes the meaningless BogoMIPS measurement that does not change when the processor is overclocked. Raspberry Pi 2 has additional features such as neon, vfpv3 and vfpv4.

SUSE and Gentoo for Raspberry Pi 3 - CPU architecture: 8 identifies 64 bit working.

The programs provide keyboard input at the end to include comments in the log, such as "overclocked at 1000 MHz". The source code has expected numeric answers, selected for particular hardware. These are checked for correctness and errors reported in the log. Running on a variation of the hardware could produce false error reports for floating point calculations.

Also shown below is the command to select the menu with the overclocking option and commands to obtain CPU MHz and these do not change when the CPU is overclocked.

SYSTEM INFORMATION From File /proc/cpuinfo Processor : ARMv6-compatible processor rev 7 (v6l) BogoMIPS : 464.48 was #371 PREEMPT BogoMIPS : 697.95 later #557 PREEMPT Features : swp half thumb fastmult vfp edsp java tls CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xb76 CPU revision : 7 Hardware : BCM2708 Revision : 000d Serial : 00000000db690cb4 From File /proc/version Linux version 3.6.11+ (dc4@dc4-arm-01) (gcc version 4.7.2 20120731 (prerelease) (crosstool-NG linaro-1.13.1+bzr2458 - Linaro GCC 2012.08) ) #371 PREEMPT Thu Feb 7 16:31:35 GMT 2013 #################################################### Raspberry Pi 2 processor : 0, 1, 2 and 3 model name : ARMv7 Processor rev 5 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xc07 CPU revision : 5 Linux version 3.18.5-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.8.3 20140303 (prerelease) (crosstool-NG linaro-1.13.1+bzr2650 - Linaro GCC 2014.03) ) #225 SMP PREEMPT Fri Jan 30 18:53:55 GMT 2015 #################################################### Raspberry Pi 3 - 32 Bit Mode processor : 0, 1, 2 and 3 model name : ARMv7 Processor rev 4 (v7l) BogoMIPS : 38.40 Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer : 0x41 CPU architecture: 7 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4 Linux version 4.1.19-v7+ (dc4@dc4-XPS13-9333) (gcc version 4.9.3 (crosstool-NG crosstool-ng-1.22.0-88-g8460611) ) #858 SMP Tue Mar 15 15:56:00 GMT 2016 #################################################### Raspberry Pi 3 - 64 Bit OpenSUSE and Gentoo processor : 0, 1, 2 and 3 BogoMIPS : 38.40 Features : fp asimd evtstrm crc32 CPU implementer : 0x41 CPU architecture: 8 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 4 OpenSUSE Linux version 4.4.36-8-default (geeko@buildhost) (gcc version 4.8.5 (SUSE Linux)) #1 SMP Fri Dec 9 16:18:38 UTC 2016 (3ec5648) Gentoo Linux version 4.10.0-rc5-v8 (sakaki@chiyo) (gcc version 5.4.0 (Gentoo 5.4.0-r2 p1.2, pie-0.6.5) ) #1 SMP PREEMPT Wed Jan 25 20:13:50 GMT 2017 #################################################### Commands to obtain CPU MHz See later for more details vcgencmd measure_clock arm frequency(45)=700074000 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq 700000 With Raspbian and Gentoo, both identify full and standby clock frequencies (RPi 3 1200 and 600 MHz), but the ARM function also provides measurements when the clock speed is reduced due to high temperatures. SUSE - does not support vcgencmd but following appears to identify MHz cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq Command for overclocking selection - not RPi 3 sudo raspi-config

To Start

Whetstone Benchmark - whetstonePiA6, whetstonePiA7, whetstonePi64

See Comparisons Below

The Whetstone Benchmark was the first general purpose benchmark that set industry standards of performance, particularly for minicomputers, and introduced in 1972. The benchmark produced speed ratings in terms of Thousands of Whetstone Instructions Per Second (KWIPS). In 1978, self timing versions (by yours truly) produced speed ratings, for each of the eight test procedures, in MOPS (Millions of Operations Per Second) or MFLOPS (Millions of Floating Point Operations Per Second), with an overall rating in MWIPS, mainly dependent on floating point speed.

Unlike some other floating point benchmarks, the new PiA7 compilation produces identical numeric results to those below.

Besides the logged results, other information, shown below, is displayed on the Terminal, particularly for calibrating to run for a total of about 10 seconds. The time for each test identifies what determines the overall MWIPS rating. It now depends on those with mathematical functions but was N6 floating point originally.

pi@raspberrypi ~/benchmarks $ ./whetstonePiA6 ########################################## Single Precision C Whetstone Benchmark Opt 3 32 Bit, Sun May 12 11:05:53 2013 Calibrate 0.04 Seconds 1 Passes (x 100) 0.19 Seconds 5 Passes (x 100) 0.93 Seconds 25 Passes (x 100) 4.68 Seconds 125 Passes (x 100) Use 267 passes (x 100) Single Precision C/C++ Whetstone Benchmark Loop content Result MFLOPS MOPS Seconds N1 floating point -1.12475013732910156 97.811 0.053 N2 floating point -1.12274742126464844 100.800 0.360 N3 if then else 1.00000000000000000 698.625 0.040 N4 fixed point 12.00000000000000000 425.250 0.200 N5 sin,cos etc. 0.49911010265350342 5.850 3.840 N6 floating point 0.99999982118606567 85.669 1.700 N7 assignments 3.00000000000000000 498.960 0.100 N8 exp,sqrt etc. 0.75110864639282227 2.722 3.690 MWIPS 270.460 9.983 A new results file, whets.txt, will have been created in the same directory as the .EXE files, if one did not already exist. Type additional information to include in whets.txt - Press Enter

To Start

Whetstone Benchmark Comparisons

Results below are for the Raspberry Pi running at 700 MHz and overclocked at 1000 MHz. For comparison purposes, also shown are speeds obtained on various Android based ARM CPUs and Intel processors running under Linux, compiled as above. The latter are similar to those from my earlier Linux benchmarks. Results on many more systems are in Whetstone Results.htm with speeds of ancient computers in Whetstone Benchmark History and Results.

Raspberry Pi 2, with default settings, is just over twice as fast as the original, on average, or 57% faster at 1000 MHz. Performance via gcc 4.8 can be slightly slower than the earlier benchmarks. The programming code used is not really suitable to produce performance gains through advanced instructions.

This benchmark is particularly sensitive to optimisation in compiling the COS and EXP function tests that can determine the overall MWIPS rating. The other main influence is the third MFLOPS measurement. On all fronts, the Raspberry Pi 3 performance is around 1.33 times that of a non-overclocked Raspberry Pi 2, similar to the CPU MHz ratio.

Except for the function tests, other results from the Cortex-A53 based tablet are similar to the Raspberry Pi 3, adjusted for CPU MHz, and that also applies to 32 bit versus 64 bit operation. Much of the similarity is due to execution loops containing few simple instructions.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - other than COS and EXP type function tests, speeds were similar to 32 bit version and the Android 64 bit app. With SUSE on-demand CPU frequency, overall MWIPS ratings were 20% to 40% slower. SUSE and Gentoo MWIPS ratings slightly different, again due to those volatile function test results, with others essentially the same, as would be expected with the simple processing arrangements. As indicated for the IF test, the compiler detected that it was not necessary to repeat the calculations, but this would make no real difference to MWIPS.

System MHz MWIPS ------MFLOPS------- ------------MOPS--------------- 1 2 3 COS EXP FIXPT IF EQUAL Raspberry Pi 700 270.5 97.8 100.8 85.7 5.9 2.7 425.3 698.6 499.0 Raspberry Pi 1000 390.6 136.8 146.3 122.9 8.5 3.9 617.4 1014.3 804.9 RPi 2 v7-A7 900 525.0 252.0 261.3 223.0 10.2 5.1 1102.5 1358.4 882.0 RPi 2 v7-A7 1000 584.6 280.3 290.7 248.0 11.3 5.7 1314.0 1208.9 981.1 RPi 3 v8-A53 1200 724.5 331.0 347.5 298.1 12.1 8.7 1520.4 1873.4 1216.3 gcc 4.8 RPi 2 v7-A7 900 507.0 250.4 227.1 184.6 10.1 5.1 1113.7 1334.9 668.4 RPi 2 v7-A7 1000 568.4 280.4 254.4 206.7 11.3 5.7 1248.8 1497.9 749.2 RPi 3 v8-A53 1200 711.6 336.5 329.7 256.9 12.2 8.8 1498.5 1796.7 1198.7 gcc-6 64 Bit Working OpenSuse RPi 3 v8-A53 1200 997.2 336.6 354.1 287.8 18.4 12.3 1498.7 ###### 1197.3 Gentoo RPi 3 v8-A53 1200 1022.9 327.6 346.3 282.1 20.3 12.6 1467.3 ###### 1166.4 Android ARM 926EJ 800 31.2 10.2 10.2 11.4 0.6 0.3 38.8 278.4 219.4 ARM v7-A9 800 687.4 165.4 149.9 153.4 15.9 9.3 723.1 1082.1 725.3 ARM v7-A9 1300 1115.0 271.3 250.7 256.4 25.8 14.6 1190.0 1797.0 1198.7 ARM v7-A15 1700 1333.6 315.5 291.2 298.6 39.8 18.1 1394.7 2089.9 1395.5 ARM v8-A53 1300 834.7 348.9 312.7 310.9 36.7 5.4 1556.7 1867.2 570.5 64 Bit Version ARM v8-A53 1300 1494.2 347.1 307.0 305.9 37.5 20.6 1552.2 1863.7 1239.1 Intel Atom 1666 822.3 332.4 325.7 308.6 17.2 8.1 1013.8 2368.9 1228.0 Core 2 2400 2316.1 810.0 790.4 576.2 56.8 23.8 3986.9 7532.4 2831.4 Core i7 3900 3959.0 1331.0 1330.9 938.4 96.5 42.1 6515.7 10966.7 5850.8 ###### compiler optimiser produces 1 pass, this test does not affect MWIPS much

To Start

Dhrystone 2 Benchmark - dhrystonePiA6, dhrystonePiA7, dhrystonePi64

See Comparisons Below

The Dhrystone "C" benchmark provides a measure of integer performance (no floating point instructions). It became the key standard benchmark from 1984, with the growth of Unix systems. The first version was produced by Reinhold P. Weicker in ADA and translated to "C" by Rick Richardson. Two versions are available - Dhrystone versions 1.1 and 2.1. The second version, used here, was produced to avoid over-optimisation problems encountered with version 1, but some is still possible. Speed was originally measured in Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result, the latter being regarded as the first 1 MIPS minicomputer.

This again runs for 10 seconds after calibration. In this case, logged results are nanoseconds one Dhrystone run, Dhrystones per Second and VAX MIPS rating plus details of detected errors or “Numeric results were correct? Below is the execution command and details of displayed information, excluding standard system information.

pi@raspberrypi ~/benchmarks $ ./dhrystonePiA6 ########################################## Dhrystone Benchmark, Version 2.1 (Language: C or C++) Optimisation Opt 3 32 Bit Register option not selected 10000 runs 0.00 seconds 100000 runs 0.07 seconds 200000 runs 0.15 seconds 400000 runs 0.28 seconds 800000 runs 0.56 seconds 1600000 runs 1.13 seconds 3200000 runs 2.26 seconds Final values (* implementation-dependent): Int_Glob: O.K. 5 Bool_Glob: O.K. 1 Ch_1_Glob: O.K. A Ch_2_Glob: O.K. B Arr_1_Glob[8]: O.K. 7 Arr_2_Glob8/7: O.K. 3200010 Ptr_Glob-> Ptr_Comp: * 5722488 Discr: O.K. 0 Enum_Comp: O.K. 2 Int_Comp: O.K. 17 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING Next_Ptr_Glob-> Ptr_Comp: * 5722488 same as above Discr: O.K. 0 Enum_Comp: O.K. 1 Int_Comp: O.K. 18 Str_Comp: O.K. DHRYSTONE PROGRAM, SOME STRING Int_1_Loc: O.K. 5 Int_2_Loc: O.K. 13 Int_3_Loc: O.K. 7 Enum_Loc: O.K. 1 Str_1_Loc: O.K. DHRYSTONE PROGRAM, 1'ST STRING Str_2_Loc: O.K. DHRYSTONE PROGRAM, 2'ND STRING Nanoseconds one Dhrystone run: 671.88 Dhrystones per Second: 1488372 VAX MIPS rating = 847.11 Type additional information to include in Dhry.txt - Press Enter

To Start

Dhrystone 2 Benchmark Comparisons

Below is a similar combination of results as for the Whetstone Benchmark. For results on other systems see Dhrystone Results.htm. Unlike with Whetstones, using floating point calculations, the Raspberry Pi CPU speed is close to ARM Cortex-A9 processors, on a per MHz basis, but executing integer functions. The Raspberry Pi 2 is faster than the first version, performance ratios being shown below. The new gcc 4.8 compilation provides slightly higher speed ratings.

The Raspberry Pi 3 averages 45% faster than the Pi 2 on these compilations, compared with a 33% faster CPU MHz. These results are similar to those from the Cortex-A53 based tablet at 64 bits, where optimisation may not have been as good as possible at 32 bits.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - Speeds were more than 40% faster than 32 bit system results, and up to 2.95 VAX MIPS (DMIPS) per MHz. Variations between the two 64 bit tests are quite normal. Considering the worse Android 64 bit performance suggests that the later compiler might be responsible.

System MHz VAX MIPS Raspberry Pi 700 847 Raspberry Pi 1000 1226 RPi 2 v7-A7 900 1538 1.82 x Rpi 700 RPi 2 v7-A7 1000 1694 1.38 x RPi 1000 RPi 3 v8-A53 1200 2201 1.43 x RPi 2 900 gcc 4.8 RPi 2 v7-A7 900 1667 1.08 x RPi 2 RPi 2 v7-A7 1000 1852 1.09 x RPi 2 RPi 3 v8-A53 1200 2469 1.48 x RPi 2 900 gcc-6 64 Bit Working OpenSuse RPi 3 v8-A53 1200 3536 1.43 x RPi 3 32 bits Gentoo RPi 3 v8-A53 1200 3475 0.98 x Suse 64 bits Android ARM 926EJ 800 356 ARM v7-A9 800 962 ARM v7-A9 1300 1610 ARM v7-A15 1700 3189 ARM v8-A53 1300 1423 64 Bit Version ARM v8-A53 1300 2569 Linux using CC Intel Atom 1666 2629 Core 2 2400 6857 Linux using older GCC Intel Atom 1666 2055 Core 2 2400 5582 Core i7 3900 16356

To Start

Linpack Benchmark 32b - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP
Linpack Benchmark 64b - linpackPi64, linpackPiSP64

See Comparisons Below

The Linpack Benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications, particularly under Unix, from the mid 1980's, with a slant towards supercomputer performance. The original double precision C version, used here, operates on 100x100 matrices. Performance is governed by an inner loop in function daxpy() with a linked triad dy[i] = dy[i] + da * dx[i], and is measured in Millions of Floating Point Operations Per Second (MFLOPS).

Displayed output is the same as the original version for PCs (My conversion at Netlib - 1996), where the bloated detail was needed due to using a low resolution timer. The line starting with norm resid 1.7 shows the numeric results of calculations. These can vary using different hardware and compilers - see examples in Linpack numeric results Android. For comparison purposes, these are set in the C source code and checked at run time, a “Numeric results were as expected?message being logged if correct, or details provided if incorrect. Note that the compiled code could give consistent different results on other Linux based ARM processors. The log file shows only one MFLOPS speed.

Unlike normal Intel floating point, double precision calculations are often slower than those using single precision on ARM processors. So, besides linpackPiA6, a single precision compilation, linpackPiSP, is also provided. As for the double precision results, these are identical to those on Android based ARM systems.

The gcc 4.8 equivalents are linpackPiA7 and linpackPiA7SP where, as shown below, these produce different numeric answers. These are probably acceptable and due to different rounding with the assembly code used. Below is that used for the performance dependent code.

pi@raspberrypi ~/benchmarks $ ./linpackPiA6 ########################################## Unrolled Double Precision Linpack Benchmark - Linux Version in °C/C++' Optimisation Opt 3 32 Bit norm resid resid machep x[0]-1 x[n-1]-1 1.7 7.41628980e-14 2.22044605e-16 -1.49880108e-14 -1.89848137e-14 Times are reported for matrices of order 100 1 pass times for array with leading dimension of 201 dgefa dgesl total Mflops unit ratio 0.00000 0.00000 0.00000 0.00 0.0000 0.0000 Calculating matgen overhead 10 times 0.01 seconds 100 times 0.15 seconds 200 times 0.28 seconds 400 times 0.58 seconds 800 times 1.13 seconds Overhead for 1 matgen 0.00141 seconds Calculating matgen/dgefa passes for 1 seconds 10 times 0.17 seconds 20 times 0.35 seconds 40 times 0.69 seconds 80 times 1.38 seconds Passes used 57 Times for array with leading dimension of 201 dgefa dgesl total Mflops unit ratio 0.01578 0.00053 0.01631 42.11 0.0475 0.2912 0.01596 0.00053 0.01648 41.66 0.0480 0.2943 0.01578 0.00053 0.01631 42.11 0.0475 0.2912 0.01596 0.00053 0.01648 41.66 0.0480 0.2943 0.01578 0.00070 0.01648 41.66 0.0480 0.2943 Average 41.84 Calculating matgen2 overhead Overhead for 1 matgen 0.00144 seconds Times for array with leading dimension of 200 dgefa dgesl total Mflops unit ratio 0.01523 0.00053 0.01576 43.58 0.0459 0.2813 0.01540 0.00053 0.01593 43.10 0.0464 0.2845 0.01540 0.00053 0.01593 43.10 0.0464 0.2845 0.01523 0.00070 0.01593 43.10 0.0464 0.2845 0.01523 0.00070 0.01593 43.10 0.0464 0.2845 Average 43.20 Unrolled Double Precision 41.84 Mflops Type additional information to include in linpack.txt - Press Enter
Raspberry Pi Results of Calculations
norm resid resid x[0]-1 x[n-1]-1 DP Pi 1.7 7.41628980e-14 -1.49880108e-14 -1.89848137e-14 DP Pi 2-3 1.9 8.46778499E-14 -1.11799459E-13 -9.60342916E-14 DP Pi 64 1.9 8.46778499e-14 -1.11799459e-13 -9.60342916e-14 SP Pi 1.6 3.80277634e-05 -1.38282776e-05 -7.51018524e-06 SP Pi NEON 2.2 5.16722466e-05 -2.38418579e-07 -5.06639481e-06 SP Pi 2-3 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05 SP Pi 64 2.0 4.69621336e-05 -1.31130219e-05 -1.30534172e-05

To Start

Linpack Benchmark Comparisons

The first Raspberry Pi results do not look too good but they would on a cost/performance basis. Also, the MFLOPS ratings should be compared with Linpack Results on PCs and older mainframes, supercomputers, Unix boxes and minicomputers with Netlib Linpack Results. The Linpack benchmark depends on data in L2 cache and this might lead to variations in running time. Other versions might specify larger array sizes (like 1000 x 1000) that can depend on slower memory.

The Raspberry Pi 2 is faster than the first version, performance ratios being shown below. In this case, the new code from from gcc 4.8 is faster than the original, but only for the double precision benchmark, due to the more efficient instructions shown below. The benchmark has also been compiled to use ARM NEON Single Instruction Multiple Data (SIMD) functions (linpackPiNEONi, linpackPiNEON64), speed being included in the results table. Further details are in a later section.

Based on MFLOPS/MHz, the Raspberry Pi 3 can be slower than the RPi 2, but is quite a bit faster on the NEON version. The Cortex-A53 based tablet 32 bit performance is similar to the RPi 3, but 64 bit working is much faster.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - The measurements for these and the Android 64 bit version, were essentially the same. Speed improvements, over the 32 bit version, were around 1.9 times DP and 2.5 times SP. NEON speeds were not much different, where the intrinsic functions are translated into different variations of vector instructions.

MFLOPS GAIN System MHz DP SP NEON SP DP SP Against Raspberry Pi 700 42 58 N/A Raspberry Pi 1000 68 88 N/A RPi 2 v7-A7 900 120 156 N/A 2.86 2.69 RPi 700 RPi 2 v7-A7 1000 134 175 N/A 1.97 1.99 RPi 1000 RPi 3 v8-A53 1200 176 190 N/A 1.47 1.22 RPi 2 900 gcc 4.8 RPi 2 v7-A7 900 154 156 300 1.28 1.00 RPi 2 900 RPi 2 v7-A7 1000 169 176 334 1.26 1.01 RPi 2 1000 RPi 3 v8-A53 1200 180 194 486 1.17 1.24 RPi 2 900 gcc-6 64 Bit Working OpenSuse RPi 3 v8-A53 1200 348 494 530 1.93 2.55 RPI 3 32b Gentoo RPi 3 v8-A53 1200 343 482 521 1.90 2.48 RPI 3 32b Android ARM 926EJ 800 6 10 N/A ARM v7-A9 800 101 129 256 ARM v7-A9 1300 151 201 377 ARM v7-A15 1700 459 803 1335 gcc 4.8 ARM v7-A9 1300 159 200 ARM v7-A15 1700 795 977 ARM v8-A53 1300 178 187 423 64 Bit Version ARM v8-A53 1300 348 493 521 Linux using CC Intel Atom 1666 211 Core 2 2400 1631 Linux using older GCC Intel Atom 1666 196 Core 2 2400 1288 Core i7 3900 2534

To Start

Livermore Loops Benchmark - liverloopsPiA6, liverloopsPiA7, liverloopsPi64

See Comparisons Below

This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical application, written in Fortran. This was increased to 24 kernels in the 1980s. Performance measurements are in terms of Millions of Floating Point Operations Per Second or MFLOPS. The kernels are executed three times with different double precision data array sizes. Following are overall MFLOPS results for various systems, geometric mean being the official average performance. [Reference - F.H. McMahon, The Livermore Fortran Kernels: A Computer Test Of The Numerical Performance Range, Lawrence Livermore National Laboratory, Livermore, California, UCRL-53745, December 1986]

                    ---------------- MFLOPS ---------------               
CPU            MHz  Maximum Average Geomean Harmean Minimum   Measured in

CDC 6600        10     1.1     0.5     0.5     0.4     0.2      1970  *  
CDC 7600        36.4   7.3     4.2     3.9     2.5     1.4      1974  *  
Cray 1A         80    83.5    25.8    14.4     7.9     2.7      1980  *  
Cray 1S         80    82.1    22.2    11.9     6.5     1.0      1985     
CDC Cyber 205   50   146.9    36.4    14.6     5.0     0.6      1982  *  
Cray 2         244   146.4    36.7    14.2     5.8     1.7      1985     
Cray XMP1      105   187.8    61.3    31.5    15.6     3.6      1986     

                        * Fewer than 24 Kernels

Below is the run command, then displayed calibration phase, final results and details for the 24 loops using the largest data sizes. Calibration arranges for each loop to run for around one second. The Checksums OK column is an indication of accuracy, compared with a specification and probably based on results from CDC 6600 and 7600. These hardware/compiler dependent numeric answers are checked as in the Linpack benchmark. Results included in the log file are Minimum, Maximum, Averages and 24 weighted average MFLOPS speeds.

As with the Linpack benchmark, liverloopsPiA7, the gcc 4.8 compilation, produced different numeric answers to the earlier version, this time for 22 out of the 24 kernels. All were only slightly different and are shown below, for part 3 of 3. The benchmark produced a run time error from the initial gcc 4.8 compilation. This was due to the way in which shared array space is allocated and was also apparent with earlier Android compilations. So, the same code changes were made and the revised source code is included in Raspberry_Pi_Benchmarks.zip.

pi@raspberrypi ~/benchmarks $ ./liverloopsPiA6 ########################################## L.L.N.L. °C' KERNELS: MFLOPS P.C. VERSION 4.0 Optimisation Opt 3 32 Bit Calculating outer loop overhead 1000 times 0.00 seconds 10000 times 0.00 seconds 100000 times 0.00 seconds 1000000 times 0.06 seconds 2000000 times 0.11 seconds 4000000 times 0.23 seconds Overhead for each loop 5.7500e-08 seconds Calibrating part 3 of 3 Loop count 32 0.00 seconds Loop count 128 0.01 seconds Loop count 512 0.04 seconds Loops 200 x 8 x Passes Kernel Floating Pt ops No Passes E No Total Secs. MFLOPS Span Checksums OK ------------ -- ------------- ----- ------- ---- ---------------------- -- 1 28 x 11 5 6.652800e+07 0.97 68.29 27 3.855104502494961e+01 16 2 46 x 18 4 5.829120e+07 0.93 62.65 15 3.953296986903059e+01 16 3 37 x 36 2 1.150848e+08 0.85 135.70 27 2.699309089320672e-01 16 4 38 x 36 2 6.566400e+07 0.88 75.04 27 5.999250595473891e-01 16 5 40 x 12 2 3.993600e+07 1.08 36.99 27 3.182615248447483e+00 16 6 21 x 34 2 5.483520e+07 1.26 43.52 8 1.120309393467088e+00 15 7 20 x 14 16 1.505280e+08 1.03 146.64 21 2.845720217644024e+01 16 8 9 x 10 36 1.347840e+08 1.08 124.52 14 2.960543667875005e+03 15 9 26 x 11 17 1.166880e+08 1.27 92.17 15 2.623968460874250e+03 16 10 25 x 10 9 5.400000e+07 1.16 46.59 15 1.651291227698265e+03 16 11 46 x 18 1 3.444480e+07 1.10 31.30 27 6.551161335845770e+02 16 12 48 x 14 1 2.795520e+07 1.13 24.66 26 1.943435981130448e-06 16 13 31 x 9 7 2.499840e+07 1.19 21.07 8 3.847124199949431e+10 15 14 8 x 11 11 4.181760e+07 1.08 38.63 27 2.923540598672009e+06 15 15 1 x 17 33 6.283200e+07 0.98 64.21 15 1.108997288134785e+03 16 16 14 x 34 10 8.377600e+07 1.41 59.41 15 5.152160000000000e+05 16 17 26 x 17 9 9.547200e+07 1.13 84.27 15 2.947368618589361e+01 16 18 2 x 11 44 1.006720e+08 1.16 86.92 14 9.700646212337041e+02 16 19 28 x 23 6 9.273600e+07 1.30 71.56 15 1.268230698051003e+01 15 20 7 x 9 26 6.814080e+07 1.19 57.04 26 5.987713249475302e+02 16 21 1 x 2 2 8.000000e+07 1.51 52.99 20 5.009945671204667e+07 16 22 8 x 8 17 2.611200e+07 1.16 22.42 15 6.109968728263972e+00 16 23 7 x 11 11 8.808800e+07 0.98 89.56 14 4.850340602749970e+02 16 24 23 x 35 1 3.348800e+07 1.17 28.56 27 1.300000000000000e+01 16 Maximum Rate 146.64 Average Rate 65.20 Geometric Mean 56.66 Harmonic Mean 48.85 Minimum Rate 21.07 Do Span 19 Overall Part 1 weight 1 Part 2 weight 2 Part 3 weight 1 Maximum Rate 148.29 Average Rate 64.41 Geometric Mean 54.74 Harmonic Mean 46.40 Minimum Rate 16.62 Do Span 167 Type additional information to include in linpack.txt - Press Enter gcc 4.8 and gcc-6 Different Results Later at 64 bits - Checks for was results 1 was 3.855104502494985e+01 expected 3.855104502494961e+01 2 was 3.953296986903406e+01 expected 3.953296986903059e+01 3 was 2.699309089321338e-01 expected 2.699309089320672e-01 4 was 5.999250595474085e-01 expected 5.999250595473891e-01 5 was 3.182615248448323e+00 expected 3.182615248447483e+00 6 was 1.120309393467610e+00 expected 1.120309393467088e+00 7 was 2.845720217644064e+01 expected 2.845720217644024e+01 8 was 2.960543667877653e+03 expected 2.960543667875005e+03 9 was 2.623968460874436e+03 expected 2.623968460874250e+03 10 was 1.651291227698388e+03 expected 1.651291227698265e+03 11 was 6.551161335846584e+02 expected 6.551161335845770e+02 12 was 1.943435982643127e-06 expected 1.943435981130448e-06 13 was 3.847124173932926e+10 expected 3.847124199949431e+10 14 was 2.923540598700724e+06 expected 2.923540598672009e+06 15 was 1.108997288135077e+03 expected 1.108997288134785e+03 17 was 2.947368618590736e+01 expected 2.947368618589361e+01 18 was 9.700646212341634e+02 expected 9.700646212337041e+02 19 was 1.268230698051755e+01 expected 1.268230698051003e+01 20 was 5.987713249471707e+02 expected 5.987713249475302e+02 21 was 5.009945671206671e+07 expected 5.009945671204667e+07 22 was 6.109968728264851e+00 expected 6.109968728263972e+00 23 was 4.850340602751729e+02 expected 4.850340602749970e+02

To Start

Livermore Loops Benchmark Comparisons

For Cray 1 comparison purposes, it is more appropriate to use Cray 1S results, as these are from running all 24 kernels. Geometric mean for this system is 11.9 MFLOPS. In 1978, the Cray 1 supercomputer cost $7 Million, weighed 10,500 pounds and had a 115 kilowatt power supply. It was, by far, the fastest computer in the world. The Raspberry Pi costs around $70 (CPU board, case, power supply, SD card), weighs a few ounces, uses a 5 watt power supply and is more than 4.5 times faster than the Cray 1.

Average performance gains of the Raspberry Pi 2 are not as high as those for the Linpack benchmark, but the best test loop, at 900 MHz, is 4.25 times faster than the original Pi at 700 MHz. Highest average of 138 MFLOPS is 11.6 times faster than a Cray 1.

The Raspberry Pi 3 average speed shown is 46% faster than RPi 2, compared with 33% faster MHz, also a little faster than the Android tablet with the Cortex-A53. Then the latter’s 64 bit compilation indicates an average improvement of 46%, with a wide variation in MFLOPS from individual tests.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - In this case, the SUSE based test results all appeared to be sightly faster than those for Gentoo, but the range was between 92% and 99%. Average Android 64 bit speed was 10% slower but some results were faster, probably due to different compiler, in handling the relatively large code. Compared to 32 bit speeds, the 64 bit scores were between 1.02 and 2.88 times faster. The official geometric mean rating was 1.34 times faster. On the same basis, the RPi3 can be rated as the equivalent of 21 times the Cray 1 supercomputer.

See also Livermore Loops Results on PCs.

Compare System MHz Maximum Average Geomean Harmean Minimum Geomean Against Raspberry Pi # 700 148.3 64.4 54.7 46.4 16.6 Raspberry Pi # 1000 216.8 94.8 80.8 68.7 29.3 RPi 2 v7-A7 900 248.0 126.1 114.9 103.9 41.5 2.10 RPi 700 RPi 2 v7-A7 1000 273.5 139.7 127.3 115.2 46.5 1.58 RPi 1000 gcc 4.8 RPi 2 v7-A7 * 900 223.8 136.9 125.6 113.0 42.3 1.09 RPi 2# 900 RPi 2 v7-A7 1000 244.9 150.7 138.2 124.4 46.7 1.09 RPi 2# 1000 RPi 3 v8-A53 ^ 1200 435.5 206.9 183.6 159.8 55.6 1.46 RPi 2* 900 gcc 4.8 RPi 3 v8-A53 > 1200 398.4 210.6 185.9 160.2 56.5 1.01 RPi 3^ 1200 gcc-6 64 Bit Working OpenSuse gcc-6 RPi 3 v8-A53 > 1200 649.0 278.8 249.4 221.6 95.0 1.34 RPi 3 32 bit Gentoo RPi 3 v8-A53 > 1200 627.3 275.7 246.8 219.2 90.6 1.34 RPi 3 32 bit Android ARM 926EJ 800 9.9 5.6 5.4 5.2 2.4 ARM v7-A9 800 253.2 129.3 115.3 101.6 46.7 ARM v7-A9 1200 391.9 202.1 181.3 160.9 68.1 ARM v7-A15 1700 1252.8 476.0 375.8 288.8 90.8 ARM v8-A53 $ 1300 393.4 188.3 158.3 124.6 27.1 0.85 RPi 3> 1200 64 Bit Version ARM v8-A53 1300 772.2 265.9 232.5 206.3 97.8 1.47 RPi 3$ 1300 Atom Z3745 1866 1031.2 480.0 429.8 378.6 154.7 Linux using CC Intel Atom 1666 480.3 217.6 189.9 162.2 59.7 Core 2 2400 2264.7 1039.3 822.9 606.4 161.6 Linux using older GCC Intel Atom 1666 465.2 212.2 185.1 157.4 49.7 Core 2 2400 2384.9 1038.1 805.8 582.1 161.0 Core i7 4820K 3900 5551.3 2196.8 1712.4 1286.6 415.3
MFLOPS for 24 loops
Raspberry Pi 700 MHz 66.1 79.8 132.8 141.1 23.8 29.3 110.8 129.7 90.2 38.7 32.0 25.2 22.1 16.6 61.0 58.6 81.5 59.8 73.5 42.2 29.9 22.5 66.4 29.5 Raspberry Pi 1000 MHz 97.0 116.2 197.2 206.0 37.4 47.2 169.0 185.6 132.6 57.4 46.2 35.9 32.7 32.0 89.7 85.6 118.4 88.8 107.1 75.6 47.6 32.4 106.0 42.6 Raspberry Pi 2 900 MHz 114.1 129.1 221.7 218.0 84.7 96.8 196.3 248.0 155.2 137.4 74.2 63.6 62.4 70.6 125.6 125.1 196.3 153.3 132.6 115.2 78.4 41.6 166.5 89.0 Raspberry Pi 2 1000 MHz 126.7 143.7 246.8 242.7 94.0 108.2 218.5 273.4 172.7 135.8 82.6 70.8 69.0 78.3 140.1 139.3 218.5 170.7 147.7 128.6 80.0 46.7 184.5 99.1 Raspberry Pi 2 900 MHz gcc 4.8 132.0 163.4 223.8 220.6 85.4 126.3 217.5 212.5 189.9 123.4 99.3 56.0 67.9 83.9 125.0 133.2 202.0 180.8 160.3 125.1 86.3 42.5 185.5 127.5 Raspberry Pi 2 1000 MHz gcc 4.8 139.0 166.2 244.9 243.7 88.1 140.1 232.0 234.5 210.7 136.1 109.1 61.6 74.8 92.8 137.9 147.0 223.1 199.2 177.0 133.8 95.2 47.0 204.6 140.9 Raspberry Pi 3 1200 MHz 191.8 242.9 295.6 292.1 139.6 165.7 362.0 435.4 282.7 162.4 108.1 85.0 82.1 107.0 223.8 208.4 358.6 277.4 208.8 201.9 113.9 55.6 305.2 148.6 Raspberry Pi 3 1200 MHz gcc 4.8 192.9 228.0 398.4 337.4 124.6 167.5 359.7 384.3 347.7 171.6 132.5 74.7 83.9 109.1 225.4 221.2 307.9 288.6 202.2 211.9 114.7 56.9 300.2 170.1 Raspberry Pi 3 1200 MHz gcc-6 OpenSuse 64 Bit Working 468.5 260.9 474.4 463.7 196.7 179.6 649.0 399.9 426.1 223.5 148.9 215.3 109.0 140.8 256.3 226.2 386.4 454.5 291.7 246.1 273.6 99.5 316.7 183.3 Gentoo 64 Bit Working 462.9 256.1 465.6 454.5 193.1 178.4 627.3 366.3 417.9 215.1 146.2 211.3 107.1 136.6 251.0 222.3 379.7 446.9 286.5 240.9 253.1 91.6 314.5 180.1 Android ARM 926EJ 800 MHz 5.6 6.4 6.2 6.1 4.6 4.9 5.9 6.1 6.0 9.0 5.8 3.9 4.0 3.6 3.8 5.6 7.6 4.5 5.7 4.3 5.2 2.5 5.7 7.4 ARM v7-A9 800 MHz 172.6 127.5 253.2 248.6 71.6 141.2 197.6 190.4 202.3 109.2 55.2 51.2 54.1 51.5 100.0 144.1 192.1 139.4 130.1 105.4 111.2 63.1 136.3 56.8 ARM v7-A9 1200 MHz 241.7 233.4 383.5 388.7 98.4 147.1 293.1 258.5 314.6 181.1 99.1 95.3 80.6 68.1 171.6 226.9 346.2 176.9 202.6 184.9 119.5 102.1 200.9 88.5 ARM v8-A53 1300 MHz 163.4 243.4 272.1 270.3 109.5 111.2 282.2 389.0 360.6 219.6 124.0 61.8 67.6 87.4 27.3 224.2 340.1 241.9 168.5 198.8 120.2 120.6 277.7 79.1 ARM v8-A53 1300 MHz 64 Bit 451.4 191.4 243.2 272.4 144.9 144.5 749.4 411.1 453.6 261.1 138.0 206.1 122.5 130.1 215.0 249.8 411.6 395.4 241.7 248.1 152.8 118.7 317.2 103.7 Linux using CC Intel Atom 1666 MHz 308 297 480 468 206 175 312 308 406 125 169 140 64 101 122 216 236 195 220 134 188 61 304 94 Core 2 2400 MHz 1952 1302 1583 1527 341 1186 2184 2263 2155 1184 800 795 162 396 371 874 1341 1029 509 384 1597 174 1190 558 Linux using older GCC Intel Atom 1666 MHz 260 250 336 374 167 178 312 306 406 128 168 105 64 99 121 212 228 194 224 134 197 56 304 99 Core 2 2400 MHz 1953 1223 1584 1534 343 1238 2192 2385 2147 1187 795 479 161 396 276 956 1368 959 509 385 1385 165 1182 560

To Start

Memory Speed Benchmark - memspeedPiA6, memspeedPiA7, memSpdPi64

See Comparisons Below

MemSpeed benchmark measures data reading speeds in MegaBytes per second carrying out calculations on arrays of cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the results?headings. For the first two double precision tests, speed in Million Floating Point Operations Per Second (MFLOPS) can be calculated by dividing MB/second by 8 and 16. For single precision divide by 4 and 8. A disassembly showed that Millions of [Assembler] Instructions Per Second (MIPS), for the first two integer tests, can be calculated by multiplying MB/second by 0.78 and 0.59. For the three copy tests, MIPS are MB/second times 0.344 for double precision and 0.688 for the other two. These calculations are shown below. Note that the changes in speeds, as data size increases, indicates the size of caches. As different instructions counts are produced with later NEON compilations, MOPS are shown for the first integer test.

The two executables are for Raspberry Pi and memspeedIL for Intel/Linux. Particularly for the latter, the default maximum of 8 MB might be too small to demonstrate RAM speed. For either, a run time parameter is provided to use more memory. These are for up to 128, 256, 512 or 1024 - examples memspeedPiA6 MB 256 and memspeedIL MB 1024.

Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz Memory Reading Speed Test 32 Bit Version 4 by Roy Longbottom Start of test Mon May 20 10:25:17 2013 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 538 640 930 602 731 1094 1230 465 465 L1 16 568 602 787 602 731 1023 1000 426 507 32 292 256 310 276 262 330 1066 426 547 L2 64 276 238 276 262 238 292 341 269 284 128 189 170 193 182 170 200 222 196 204 256 140 129 142 136 129 144 138 119 124 RAM 512 138 127 138 134 127 144 131 111 119 1024 136 127 138 134 127 144 124 111 119 2048 136 127 138 132 128 144 128 111 121 4096 136 128 138 134 126 144 128 111 119 8192 138 127 138 136 127 144 126 111 119 End of test Mon May 20 10:26:06 2013 Max MFLOPS 71 160 38 91 Max MIPS 725 645 423 320 320 Max MOPS 233

To Start

Memory Speed Comparison

The first results below are for the Raspberry Pi at the maximum overclocked settings. The overheads on repetitively running the tests cause variations in speeds of the lower data sizes but average overclocked speed gain, using L1 cache, is 1.41 times, compared with 1.43 times CPU MHz. Average RAM speed gains are 1.53 times, similar to expectations. A surprise is for L2 cache based data, where the average gain is 1.72 times and some speeds appear to be faster than using L1 cache.

Comparing 900 MHz Raspberry Pi 2 results, from gcc 4.8 (PiA7), with the original system, at 700 MHz, indicates average performance gains of 3.3, 5.3 and 3.8 times for L1 cache, L2 cache and RAM based data, increased from the old PiA6 version at 2.4, 4.5 and 3.5 times. The first calculations are the same as those that determine Linpack benchmark speeds, in this case gcc 4.8 single precision speeds are again slower than using the original benchmarks (Pia7 vfma.f32 instructions and Pia6 fmacs). The PiA7 integer calculations provide the highest performance gains, from cached data, the test loop containing 2 vector loads to quad word registers (vld1.32), 2 vector adds (vadd.i32) and one vector store ( vst1.32), compared with 8 loads, 8 adds and 4 stores in PiA6.

Results for a version compiled to use NEON instructions, providing some of the fastest speeds, are included below. For more details see MemSpeed NEON.

Later results are for the same code compiled for Android devices, less the copy tests, where the later ARM systems are considerably faster. In this case, The Pi performs relatively well on single precision floating point. For other results see Android Benchmarks.htm.

The other results are using the Intel/Linux version, where speeds are generally much faster. An exception is L1 cache speed using single precision floating point, where the Pi is faster than the Atom on a MFLOPS/MHz basis. For older PC speeds that are slower than the Raspberry Pi see MemSpd2k results.htm.

Compared to default speed Raspberry Pi 2 results, RPi 3 L1 cache performance is not much faster than the 1.33 times clock MHz ratio, but L2 results are more than twice as fast, where RPi 3 L2 and L1 cache speeds are similar. Average RPi 3 RAM MB/second measurements indicate an average improvement of 2.5 times, where memory clock speed is double.

The Cortex-A53 based Android tablet 64 bit performance is generally much faster than from the 32 bit compilation, but the 32 bit compiler is not as effective as that used for the Raspberry Pi. Best 64 bit gains are when using 64 bit double precision numbers, where cache based speed can be twice that from the RPi 3.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - Results below include benchmarks compiled with gcc 4.8 and gcc 6, run via SUSE and the latter using Gentoo. These are followed by comparisons of L1 cache, L2 cache and RAM speeds. The first is for SUSE/Gentoo that are essentially the same. Next comparisons are for gcc 6/gcc 4.8 then gcc 6/32 bit A7. The former indicates gains on DP calculations using caches, and the latter on all L1 cache speeds and L2 cache speeds, other than for some integer tests, Both are more efficient running the last data copying procedures. For the first DP tests, gcc 4.8 and gcc 6 both use 64 bit fused multiply and add vector instructions, gcc 4.8 being slower due to using additional and different load instructions - see Assembly Code.

Compile options to use NEON instructions for MemSpeed NEON are not available at 64 bit working.

Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 602 640 1185 930 1163 1662 1422 511 761 L1 16 787 930 1292 853 1023 1523 1777 537 761 32 487 426 487 465 426 568 1939 820 1142 L2 64 465 393 465 426 393 511 592 457 508 128 330 310 341 320 301 365 341 301 341 256 208 200 213 204 200 217 196 170 189 RAM 512 204 200 213 200 200 213 196 176 182 1024 213 200 208 200 200 217 196 170 182 2048 204 196 213 204 200 217 196 170 182 4096 204 200 213 200 200 217 196 170 182 8192 204 200 213 200 200 218 204 169 182 Max MFLOPS 98 232 58 145 Max MIPS 1007 980 667 563 785 Max MOPS 323 ############################## RPi 2 ################################## Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S PiA6 8 731 1280 1142 2133 1454 1422 2666 1523 1641 L1 16 1066 1333 1292 1969 1406 1523 2666 1523 1641 32 1023 1293 1094 1828 1333 1406 2051 1428 1428 64 930 1016 1067 1662 1185 1230 1230 1333 1333 L2 128 853 1016 1023 1524 1186 1186 1163 1454 1333 256 853 1068 930 1423 1186 1186 1143 1455 1455 512 602 853 787 1168 853 930 1144 1027 1066 1024 365 512 393 465 538 426 984 511 465 RAM 2048 310 445 310 353 465 330 853 496 496 4096 301 445 301 341 445 330 834 546 511 8192 307 446 317 351 446 338 945 580 580 Max MFLOPS 133 333 Max MOPS 323 PiA7 8 929 832 2047 2044 1366 2862 2035 2690 2845 16 1398 1197 2050 2049 1368 2868 2044 2861 2861 32 1264 1094 1768 1773 1227 2272 1700 2159 2160 64 1195 1042 1634 1635 1161 1997 1450 1479 1488 128 1133 991 1512 1526 1095 1792 1154 1121 1124 256 961 981 1500 1506 1089 1787 1132 1078 1064 512 629 669 895 878 717 979 1146 786 788 1024 400 396 470 458 413 496 943 642 644 2048 326 313 357 354 328 374 958 678 678 4096 322 311 354 351 326 372 954 721 718 8192 325 311 355 353 327 372 952 732 733 Max MFLOPS 175 299 Max MOPS 512 ########################### RPi 2 OC ################################## Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S PiA6 8 682 853 1306 2327 1523 1777 2909 1777 1777 16 1185 1406 1306 2327 1523 1777 2666 1777 1777 32 1185 1333 1333 1969 1523 1523 2279 1641 1641 64 1023 1293 1094 1778 1333 1306 1882 1599 1428 128 1023 1186 1094 1641 1230 1333 1641 1524 1539 256 930 1142 1016 1642 1333 1333 1778 1429 1524 512 682 930 787 1094 930 930 1642 1068 984 1024 465 602 487 568 639 538 1168 618 618 2048 379 538 409 465 538 409 914 597 597 4096 379 538 379 445 538 409 904 658 682 8192 378 546 393 446 546 427 819 750 760 Max MFLOPS 148 351 Max MOPS 333 PiA7 8 918 928 2261 2258 1509 3162 2248 3142 3143 16 1547 1322 2265 2264 1511 3168 2258 3160 3160 32 1536 1314 2251 2245 1501 3146 2247 3141 3130 64 1296 1135 1773 1776 1263 2134 1795 1789 1797 128 1226 1098 1679 1676 1213 1996 1822 1483 1486 256 1013 985 1442 1446 1083 1672 1549 1311 1304 512 568 553 694 682 579 742 1371 989 993 1024 473 465 550 548 485 591 1279 913 916 2048 413 400 459 456 415 484 943 688 688 4096 410 398 455 446 411 480 871 620 620 8192 411 399 457 454 412 482 847 601 600 Max MFLOPS 193 330 142 189 Max MOPS 566 ########################### RPi 2 NEON ################################# Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 918 1778 2031 2029 2369 2838 2020 2825 2823 16 1388 1781 2034 2034 2374 2847 2029 2840 2828 32 1380 1768 2021 2020 2357 2811 2024 2832 2831 64 1169 1435 1595 1597 1785 1924 1573 1392 1391 128 1124 1366 1509 1513 1688 1794 1608 990 986 256 875 1163 1270 1269 1391 1460 1163 892 900 512 675 886 953 941 1022 1074 1081 776 785 1024 363 401 409 399 419 428 904 596 596 2048 318 338 341 343 355 362 751 539 541 4096 316 333 339 339 351 359 720 501 503 8192 317 334 340 340 352 361 709 483 484 Max MFLOPS 174 445 127 297 Max MOPS 509 ######################## RPi 2 NEON OC ################################ Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 Memory Reading Speed Test NEON 32 Bit Version 1 by Roy Longbottom Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S NEON 8 1542 1963 2257 2253 2633 3143 1672 2143 3078 16 1542 1978 2248 2258 2638 3163 2247 3111 3116 32 1402 1744 1961 1965 2221 2481 1958 2532 2534 64 1303 1596 1770 1778 1988 2146 1700 1756 1756 128 1242 1508 1665 1667 1862 1977 1599 1458 1467 256 976 1276 1376 1395 1532 1483 1610 1313 1315 512 756 966 1031 1020 1111 1156 1643 1099 1107 1024 476 544 569 554 584 606 1376 953 956 2048 401 432 447 444 458 471 1268 968 967 4096 401 429 443 436 455 466 1239 1043 1039 8192 404 434 448 446 460 472 1001 777 779 Max MFLOPS 193 493 141 330 Max MOPS 562 ############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz PiA6 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 1523 1777 1828 2461 1969 2327 3657 2285 2461 L1 16 1662 1777 1828 2285 2133 2327 3846 2285 2461 32 1662 1777 1939 2461 1969 2327 3657 2461 2381 64 1524 1641 1778 2133 1969 1969 3657 2279 2285 L2 128 1524 1778 1828 2328 1829 2133 3657 2279 2279 256 1525 1779 1828 2327 1828 2001 3657 2280 2286 512 1456 1642 1779 2133 1832 1969 3413 2287 2135 1024 930 1094 1094 1236 1186 1186 1232 1144 1070 RAM 2048 930 992 1023 1102 1102 853 1066 914 921 4096 930 1023 1092 1102 1102 1102 834 837 834 8192 893 983 1071 1111 1160 1071 976 945 877 Max MFLOPS 208 444 Max MOPS 485 PiA7 8 1619 1812 3448 2375 2237 3793 2698 3121 3147 16 1621 1814 3459 2379 2240 3793 2710 3136 3162 32 1577 1743 3243 2277 2132 3138 2702 3123 3131 64 1537 1690 3126 2196 2047 3362 2566 2890 2917 128 1570 1714 3257 2243 2076 3502 2624 2993 3027 256 1573 1720 3285 2261 2084 3522 2652 3071 2930 512 1453 1598 2785 2055 1906 2081 2430 2783 2815 1024 918 1097 1327 1204 1185 1355 1606 1261 1263 2048 891 1032 1224 1133 1113 1191 882 811 817 4096 885 1023 1223 1127 1104 1201 787 756 755 8192 876 1019 1225 1118 954 1203 876 871 873 Max MFLOPS 203 454 149 280 Max MOPS 865 ########################### RPi 3 NEON ################################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test NEON 32 Bit Version 1 by Roy Longbottom Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 1627 2387 3467 2387 3181 3812 2713 3164 3149 16 1621 2377 3457 2377 3169 3805 2713 3164 3165 32 1577 2273 3238 2280 2985 3535 2647 3103 3105 64 1526 2150 3018 2157 2793 3256 2568 2921 2921 128 1554 2217 3190 2216 2925 3436 2631 3028 3029 256 1561 2228 3225 2221 2948 3471 2654 3077 3077 512 1434 2010 2742 1978 2534 2313 2468 2840 2840 1024 950 1227 1324 1182 1306 1339 1581 1298 1298 2048 935 1136 1215 1128 1212 1214 915 880 885 4096 913 1121 1180 1131 1213 1212 825 844 842 8192 926 1134 1212 1126 936 1199 792 774 790 Max MFLOPS 203 594 149 396 Max MOPS 864 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 4224 2547 3813 5433 3469 4237 4717 3793 3794 L1 16 4211 2546 3820 5423 3469 4236 4759 3815 3815 32 3380 2287 3225 4132 3003 3526 4603 3752 3752 64 3290 2266 3179 3994 2966 3451 4539 3724 3723 L2 128 3386 2321 3301 4039 3076 3567 4359 3589 3590 256 3342 2346 3359 4096 3132 3643 4355 3593 3593 512 2961 2070 2824 3371 2640 3025 3599 3087 3082 1024 757 1268 1344 1331 1341 1369 1487 1479 1419 RAM 2048 756 959 1227 1193 1226 1254 1134 1237 1212 4096 699 952 1230 1226 1226 1248 1063 1173 1165 8192 754 1169 1203 1206 1207 1210 1036 1045 1033 Max MFLOPS 528 637 340 433 Max MOPS 955 ############################## RPi 3 Gentoo ############################# 8 4158 2503 3749 5341 3411 4164 4639 3729 3730 L1 16 4014 2506 3758 5359 3416 4174 4675 3751 3750 32 3925 2483 3722 5307 3384 4125 4665 3712 3712 64 3253 2301 3271 4121 3043 3581 4342 3544 3535 L2 128 3196 2360 3394 4190 3165 3719 4221 3487 3484 256 3125 2385 3437 4225 3201 3767 4215 3501 3504 512 672 2079 2937 3551 2725 3223 3858 3249 3255 1024 618 1189 1266 1265 1255 1274 1156 1433 1355 RAM 2048 607 1133 1183 1162 1178 1194 978 1027 1026 4096 619 1135 1185 1170 1175 1200 995 1060 1048 8192 554 1140 1189 1171 1178 1206 1009 1081 1081 Max MFLOPS 520 627 335 427 Max MOPS 940 ############################# RPi 3 SUSE ############################## Compiled for 64 bit ARM v8a+fp+sim 8 2726 2544 3468 4013 3468 4233 4206 3791 3788 L1 16 2728 2552 3477 4026 3478 4247 4232 3814 3814 32 2557 2392 3190 3611 3191 3812 4248 3819 3822 64 2416 2248 2961 3246 2961 3478 4037 3725 3728 L2 128 2452 2276 3038 3283 3025 3530 3908 3567 3566 256 2414 2313 3093 3350 3088 3600 3940 3594 3594 512 2156 2027 2603 2779 2583 2989 3473 3255 3075 1024 707 954 1315 1330 1314 1330 1597 1591 1538 RAM 2048 704 955 1146 1148 1134 1156 1038 1039 1037 4096 697 983 1136 1135 1109 1142 843 907 898 8192 694 1106 1140 1135 1141 1136 877 957 940 Max MFLOPS 341 636 251 435 Max MOPS 869 ######################## 64 Bit Comparison ############################# x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 L1 16KB SUSE/Gentoo 1.05 1.02 1.02 1.01 1.02 1.01 1.02 1.02 1.02 SUSE/gcc 4.8 1.54 1.00 1.10 1.35 1.00 1.00 1.12 1.00 1.00 SUSE/32 bit 2.60 1.40 1.10 2.28 1.55 1.12 1.76 1.22 1.21 L2 256 KB SUSE/Gentoo 1.07 0.98 0.98 0.97 0.98 0.97 1.03 1.03 1.03 SUSE/gcc 4.8 1.38 1.01 1.09 1.22 1.01 1.01 1.11 1.00 1.00 SUSE/32 bit 2.12 1.36 1.02 1.81 1.50 1.03 1.64 1.17 1.23 RAM 4 MB SUSE/Gentoo 1.13 0.84 1.04 1.05 1.04 1.04 1.07 1.11 1.11 SUSE/gcc 4.8 1.00 0.97 1.08 1.08 1.11 1.09 1.26 1.29 1.30 SUSE/32 bit 0.79 0.93 1.01 1.09 1.11 1.04 1.35 1.55 1.54 ######################## Other Cortex A53 ############################## Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.16 Compiled for 32 bit ARM v7a Reading Speed in MBytes/Second Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] KBytes Dble Sngl Int Dble Sngl Int 16 1940 971 1693 2470 1278 2084 L1 32 1879 955 1676 2378 1255 1967 64 1801 938 1615 2254 1218 1912 L2 128 1706 941 1620 2279 1224 1872 256 1818 935 1570 2291 1155 1875 512 1633 884 1451 2008 1132 1704 1024 1276 781 1181 1454 938 1324 RAM 4096 1335 808 1260 1533 1010 1386 16384 1342 813 1270 1487 1013 1419 65536 1346 809 1274 1546 1031 1252 Max MFLOPS 242 243 154 160 Max MOPS 419 ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.29 Compiled for 64 bit ARM v8a 16 4092 2198 3951 5293 3611 4408 32 3753 2496 3630 4651 3300 3992 64 3407 2388 3368 3715 3023 3677 128 3496 2462 3521 4137 3139 3844 256 3535 2481 3573 4199 3322 3911 512 3054 2248 3126 3556 2548 3372 1024 1714 1704 2029 2069 1854 2099 4096 1832 1595 1841 1914 1780 1897 16384 1844 1601 1850 1925 1798 1891 65536 1859 1608 1837 1921 1795 1812 Max MFLOPS 512 624 331 451 Max MOPS 988 ############################# Other #################################### Android MemSpeed Benchmark 17-Oct-2012 20.19 ARM Cortex-A9 1300 MHz, 1 GB DDR3 RAM Reading Speed in MBytes/Second Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] KBytes Dble Sngl Int Dble Sngl Int 16 1735 888 2456 2726 1364 2818 L1 32 1448 760 1474 1700 1039 1648 64 1318 719 1290 1468 952 1385 L2 128 1279 715 1289 1443 944 1336 256 1268 714 1279 1435 943 1313 512 1158 691 1204 1321 892 1228 1024 729 553 735 772 632 742 4096 445 392 425 442 421 439 RAM 16384 435 390 428 435 412 431 65536 445 404 393 450 432 449 Intel Atom 1666 MHz memspeedIL Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 1720 853 2150 2203 1086 3686 1379 1851 1785 L1 16 1612 825 2051 2150 1075 2962 1599 1777 1612 32 1517 825 1785 2019 1041 2666 1290 1388 1379 L2 64 1470 825 1785 2051 1041 2580 1379 1333 1646 128 1724 948 2272 2580 1358 3463 1612 1785 1785 256 1725 948 2299 2499 1403 3572 1613 1731 1785 512 1624 914 2151 2349 1315 3228 1533 1670 1668 1024 1590 882 1990 2155 1296 2515 1251 1292 1292 RAM 2048 1590 882 1998 2095 1263 2235 1081 1117 1076 4096 1553 914 1951 2111 1279 2180 1076 1084 1055 8192 1592 910 1985 2113 1279 2171 1092 1085 1119 Core 2 2400 MHz, Dual channel DDR2 RAM, memspeedIL Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 17427 6736 6249 12498 6450 6399 12498 6348 6348 L1 16 13839 6450 6249 12498 6450 6450 12985 6348 6249 32 16664 6249 6450 13134 6399 6450 12498 6348 6143 64 10751 4999 5262 7528 4999 5332 5119 3555 3555 L2 128 7831 4999 5332 7313 4999 5333 5119 3703 3703 256 11494 4999 5332 7691 4999 5333 5208 3555 3656 512 11347 5160 5333 7313 4999 5264 5209 3555 3656 1024 9142 5160 5333 7699 5160 5332 5211 3707 3656 2048 10239 5007 5341 7528 4949 5341 5119 3555 3451 4096 7110 4790 5023 6920 4790 5023 4013 3135 3236 8192 3949 3686 3813 4031 3794 3794 2047 2015 1974 RAM

To Start

Bus Speed Benchmark - busspeedPiA6, busspeedPiA7, busSpdPi64

See Comparisons Below

This benchmark is designed to identify reading data in bursts over buses. The program starts by reading a word (4 bytes) with an address increment of 32 words (128 bytes) before reading another word. The increment is reduced by half on successive tests, until all data is read.

Maximum MB/second data transfer speed is calculated as bus clock MHz x 2 for Double Data Rate (DDR) x bus width (at this time 4 bytes ARM, 8 bytes Intel) x number of memory channels. However, some of these specifications can be misleading and maximum speed options might not be provided on a particular platform. Where the maximum is not provided, there can be confusion as to whether specified MHz is raw bus clock speed or included DDR consideration.

One thing is quite clear, and that is multiple threads or programs are required to demonstrate highest obtainable throughput and this should be less than maximum specified speed due to start up (CAS latency) and other overheads. In order to minimise CPU time influence, estimates of maximum MB/second can be calculates from burst speeds (as shown below for 16 word address increments), and these should normally be greater than the Read All results. In the original benchmark, all threads started reading from word one, but this could lead to unreasonable fast speeds when shared L2 or L3 caches were provided. Results below are for a revised benchmark with staggered starting addresses, for example 4 threads at 3 MB intervals using 12 MB RAM.

Multithreaded benchmark results are provided below to help to identify why the single core BusSpeed benchmark speeds might be different from expectation. For comparison purposes, results are included for Android MP Benchmarks besides Raspberry Pi MP-BusSpeed.

                                                                    Bus
                 Inc16   Inc8   Inc4   Inc2   Read  Clock    DDR  Width    Max 
                 Words  Words  Words  Words    All    MHz     x2  Bytes  MB/sec
   
   Old Atom        262    541   1048   1973   3262    400    800   x8     6400

   Atom 1.86 GHz
   Z3745           275    611   1183   2328   3922    533   1066   x16#  17056
   2 Threads       435    787   1671   3323   6507
   4 Threads       455    884   1754   3490   6971    Max est 16 x 455    7280

   Nexus 7  1.2 GHz 
   Cortex-A9        51     81    126    172    330    666   1333   x4     5333
   2 Threads        67    107    196    335    620
   4 Threads        68    108    215    426    835    Max est 16 x 68     1088

   Kindle HDX 7 2.15 GHz
   Snapdragon 800  406    516    899   1663   2929    933   1866   x8#   14928
   2 Threads       541    962   1569   2851   4776
   4 Threads       605   1109   2439   4161   8243    Max est 16 x 605    9680

   Lenovo Tab 2 1.3 GHz
   Cortex-A53      175    344    677   1285   1979    666   1333   x4     5333
   2 Threads       241    479    968   1883   3724
   4 Threads       277    556   1130   2126   4328    Max est 16 x 277    4432
  
   Moto G4 1.5 GHz
   Cortex-A53      172    339    658   1247   2014    933   1866   x4     7466
   2 Threads       307    591   1124   2192   3839
   4 Threads       353    813   1692   3015   6058    Max est 16 x 353    5648

   Raspberry Pi 2 0.9 GHz
   ARM-V7           71    159    281    628   1147    450    900   x4     3600
   2 Threads        87    177    311    697   1256
   4 Threads        98    191    297    700   1186    Max est 16 x 98     1568

   Raspberry Pi 3 1,2 GHz
   Cortex-A53      136    263    513   1047   2080    450    900   x4     3600
   2 Threads       138    276    554   1108   2149
   4 Threads       137    269    536   1169   2383    Max est 16 x 137    2192

                                                                # dual channel

Below are the Raspberry Pi results from busSpeed.txt log file, running at the default speed settings. The program main test had 64 C statements that translate into 64 load and 64 AND instructions. With loop overheads that translates to 132 instructions on 256 bytes, where MIPS will be MB/second x 0.516.

The results suggest that data transfer bursts are 32 bytes (8 transfers of 4 bytes), with a possible maximum speed of 8 x 34 = 272 MB/second, at this single core level. They imply that there is also burst reading from caches besides using RAM, and performance of the latter is not very good, with this single core CPU.

Raspberry Pi CPU 700 MHz, Core 400 MHz, SDRAM 400 MHz Maximum speed 400 x 2 (DDR) x 4 Width = 3.2 GB/sec BusSpeed 32 Bit V1.1 Wed May 22 15:28:01 2013 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 290 304 568 984 1125 1142 L1 32 133 116 131 133 225 465 L2 64 116 98 116 109 192 409 128 60 54 62 68 126 273 256 34 34 34 43 88 192 RAM 512 34 34 34 45 91 200 1024 34 31 34 45 91 181 4096 32 33 33 45 87 183 16384 32 32 34 44 83 186 65536 34 32 34 44 88 186 End of test Wed May 22 15:28:13 2013

To Start

Bus Speed Comparison

The first one for comparison is the overclocked Pi, where most results are as might be expected at the higher clock frequencies but, again, with some L2 cache speeds quite a bit faster.

Raspberry Pi 2 results are shown with the CPU at 900 MHz and overclocked to 1 GHz, corresponding SDRAM frequencies are 450 and 500 MHz. The busspeedPiA6 speeds are most unusual, on reading all data, where speed, on reading all data, is slower than reading every other word. Assembly Code appears to show that there is little difference in generated instructions, from the two versions, except PiA7 uses negative indexing. Comparisons, shown with PiA7 1 GHz details, suggest that speed from RAM is at least 2.5 times faster from gcc 4.8. The other comparisons are for busspeeddPiA6, where the highest performance gains, of RPi 2, are via data in L2 cache.

Next results are for one CPU core on a Nexus 7, with a 1300 MHz ARM Cortex-A9 processor. The overclocked Pi is not too far away on RAM performance but falls behind on L1 and L2 cache based data.

The two Intel examples are clearly much faster but BusSpd2k Results on PCs provides results on older systems where the Raspberry Pi is the winner (ignore the last two columns for MMX instructions). There are also results of some slower systems in Android Benchmarks.htm.

On the Raspberry Pi 3, busspeedPiA6, and the newer busspeedPiA7 benchmark, demonstrate almost identical performance. With the former, considering just the Read All results, the RPi 3 is shown to average 2.85 times faster than the RPi 2 using cache based data and 5.26 times from RAM. Corresponding ratios using PiA7 are 2.31 and 1.42 times.

The 32 bit compiler, used for the Cortex-A53 based tablet tests, produced different performance characteristics to those used on the Raspberry Pi, some better scores and some worse. The same might apply to the 64 bit version, but results from RAM were faster than other A53 tablet and RPi 3 tests, the latter by an average of 50%.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - With the Read All tests being the most representative of data transfers normally used, comparisons are provided for these and there was not much difference in performance between 32 bit PiA7, 64 bit gcc 6 and 64 bit gcc 4.8 speeds. The one exception is at 16 KB data size, where gcc 6 tests were slow. The C code loop has 64 AND statements. Disassembly shows that the gcc 4.8 version has 64 AND and 65 load (ldr) instructions, using 8 w registers. The gcc 6 program has 64 AND, 19 load (ldr) and 23 lad pair (ldp) instructions, using up to 16 w registers - more registers, fewer instructions but slower?

Raspberry Pi CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 290 387 984 1505 1575 1750 L1 32 246 186 232 232 393 731 L2 64 146 113 131 148 273 546 128 102 87 93 113 210 420 256 53 48 53 75 131 303 RAM 512 48 48 50 75 137 300 1024 48 50 49 69 139 305 4096 50 52 52 72 134 299 16384 48 52 52 69 139 296 65536 49 52 49 72 139 291 ############################## RPi 2 ################################## Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All PiA6 16 1346 1428 1575 1641 1706 1489 L1 32 930 984 1163 1422 1489 1641 64 426 372 630 1024 1365 1365 L2 128 341 380 682 1137 1462 1191 256 213 232 512 813 1191 1169 512 129 136 273 570 840 782 1024 73 83 167 360 685 412 RAM 4096 63 76 152 293 629 322 16384 69 74 149 314 599 335 65536 69 78 148 279 629 335 PiA7 16 950 1509 1632 1726 1734 1738 32 1240 1318 1437 1716 1633 1681 64 419 429 747 1214 1479 1587 128 386 411 702 1211 1572 1625 256 367 399 691 1194 1573 1634 512 138 164 313 598 990 1363 1024 79 88 175 372 673 1264 4096 66 76 154 300 632 1266 16384 71 77 154 299 633 1264 65536 71 76 154 297 633 1261 ########################### RPi 2 OC ################################## Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Pi2/Pi KBytes Words Words Words Words Words All 1 GHz PiA6 16 1066 1662 1706 1975 1861 1896 1.08 32 930 1163 1367 1706 1706 1861 2.55 64 465 474 820 1219 1575 1462 2.68 128 372 426 787 1241 1706 1490 3.55 256 393 426 745 1260 1626 1491 4.92 512 266 281 522 916 1367 1196 3.99 1024 105 114 249 456 913 508 1.67 4096 93 115 220 396 880 419 1.40 16384 100 113 227 419 838 441 1.49 65536 97 111 209 419 883 447 1.54 A7/A6 PiA7 1 GHz 16 1554 1662 1813 1894 1892 1894 1.00 32 629 648 911 1328 1604 1756 0.94 64 453 461 803 1245 1572 1752 1.20 128 394 430 773 1284 1705 1783 1.20 256 280 410 747 1306 1733 1798 1.21 512 242 253 472 891 1335 1607 1.34 1024 107 122 243 481 919 1287 2.53 4096 95 108 216 420 886 1204 2.87 16384 98 108 216 419 885 1205 2.73 65536 99 109 216 419 888 1204 2.69 ############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All PiA6 16 3429 3555 3938 4266 4266 4266 L1 32 1066 1066 1693 2625 3413 3657 64 639 609 1125 1896 2978 3276 L2 128 533 546 1023 1862 2844 3413 256 533 525 1023 1706 2730 3414 512 351 393 758 1310 2184 2983 1024 123 136 274 548 1012 1879 RAM 4096 100 117 254 471 943 1852 16384 119 129 244 489 978 1806 65536 122 123 258 479 1032 1789 PiA7 /PiA6 16 3335 3741 4075 4371 4388 4413 1.03 32 1964 2229 2787 4271 4308 4311 1.18 64 612 615 1121 1932 2880 3546 1.08 128 570 573 1034 1803 2756 3467 1.02 256 541 544 995 1758 2737 3457 1.01 512 382 408 794 1360 2269 3105 1.04 1024 128 136 256 533 1025 1945 1.04 4096 109 125 245 482 961 1585 0.86 16384 120 125 241 477 964 1744 0.97 65536 120 125 243 477 947 1881 1.05 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All /PiA7 16 3370 3765 4085 4440 4477 3399 0.80 32 2070 2222 2768 4314 4386 3389 0.93 64 590 604 1138 1875 2866 3100 0.95 128 559 568 1061 1784 2781 3135 0.92 256 534 542 1023 1741 2759 3161 0.93 512 477 485 948 1628 2648 3107 1.04 1024 100 142 273 519 1082 2038 1.08 4096 90 128 254 493 988 1935 1.04 16384 123 128 253 495 999 1963 1.09 65536 123 128 254 497 994 1980 1.11 ############################ RPi 3 Gentoo ############################# /SUSE 16 1927 3680 4011 4336 4394 3335 0.98 32 2022 2159 2688 4171 4257 3299 0.97 64 579 595 1121 1859 2835 3065 0.99 128 549 557 1041 1750 2735 3082 0.98 256 518 528 1001 1700 2701 3095 0.98 512 384 397 788 1397 2284 2744 0.88 1024 128 131 253 505 1010 1923 0.94 4096 88 119 238 461 938 1737 0.90 16384 115 116 238 455 929 1657 0.84 65536 115 119 238 459 934 1764 0.89 ############################# RPi 3 SUSE ############################## Compiled for 64 bit ARM v8a+fp+sim /SUSE64 16 3275 3775 4021 4277 4330 4399 1.29 32 914 966 1582 2441 3246 3771 1.11 64 601 611 1144 1958 2899 3548 1.14 128 559 567 1054 1824 2796 3471 1.01 256 534 543 1019 1758 2744 3405 1.08 512 319 348 682 1280 2164 3021 0.97 1024 114 138 274 539 1064 2045 1.00 4096 86 124 247 489 966 1788 0.92 16384 121 123 247 488 971 1858 0.95 65536 121 125 247 490 963 1736 0.88 ######################## Other Cortex A53 ############################## Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 10.57 Compiled for 32 bit ARM v7a Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 874 932 1814 2302 2355 2263 L1 32 758 803 1309 1820 2323 2386 64 653 671 1203 1741 2206 2332 L2 128 603 620 1107 1693 2222 2351 256 574 589 1075 1711 2211 2327 512 332 372 681 1075 1863 2120 1024 137 193 371 578 1322 2129 RAM 4096 172 179 351 567 1151 2126 16384 172 178 351 504 1117 2136 65536 172 177 349 478 882 2129 ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 11.02 Compiled for 64 bit ARM v8a 16 3188 3635 3937 4327 4372 4462 32 1478 1607 2246 3382 3853 4144 64 600 622 1163 2011 2972 3585 128 558 575 1056 1889 2892 3525 256 538 550 1028 1826 2837 3260 512 371 425 813 1490 2403 3202 1024 136 196 382 728 1423 2750 4096 170 177 346 669 1340 2652 16384 169 174 341 678 1352 2663 65536 168 174 341 676 1347 2611 ############################# Other #################################### Android BusSpeed Benchmark 19-Oct-2012 17.29 ARM Cortex-A9 1300 MHz, 1 GB DDR3 RAM RAM 1 GB DDR3L-1333 Bandwidth 5.3 GB/sec Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 2723 2420 3044 3364 3499 3500 L1 32 1054 1087 1061 1382 1565 2145 64 436 433 419 652 751 1160 L2 128 345 337 337 542 633 943 256 329 309 322 522 614 961 512 339 299 311 506 574 937 1024 170 168 180 269 349 629 4096 59 55 84 127 176 338 RAM 16384 56 56 83 125 173 335 65536 56 56 82 125 174 334 Intel Atom 1666 MHz busspeedIL Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3703 5160 5881 6249 6399 6529 L1 32 484 396 745 1474 2499 3931 L2 64 484 393 787 1516 2482 3878 128 491 410 775 1462 2509 3923 256 492 415 775 1454 2540 3887 512 225 327 606 1213 2184 3534 1024 130 266 533 1034 1952 3306 RAM 4096 126 262 524 1048 1941 3313 16384 135 270 508 1048 1917 3276 65536 135 262 541 1048 1973 3262 Core 2 2400 MHz, Dual channel DDR2 RAM, busspeedIL Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 6535 5516 6059 6490 6205 6304 L1 32 5925 3225 3938 6023 6094 5966 64 1721 1305 2154 3047 4444 5269 L2 128 1407 1333 2172 3033 4571 5333 256 1538 1365 2206 3047 4432 5334 512 1391 1376 2150 3102 4552 5336 1024 1377 1376 2202 3104 4519 5460 4096 731 814 1425 2206 3669 4882 16384 345 380 761 1310 2530 4343 RAM 65536 321 374 748 1310 2485 4066

To Start

FFT Benchmarks - fft1-RPi2, fft3c-Rpi2, fft1-RPi64, FFT3c-RPi64

In 2000, I provided optimised code for a Fast Fourier Transform program, resulting in a series of Windows benchmarks that provided graphical output - see fftgraf results.htm. The fastest one used SSE type assembly code that modern compilers can also produce. The new versions use all C code, with identical calculations compiled to run via Linux, Windows and Android. The benchmarks and source codes are in FFT Benchmarks.zip with further details and results from PCs, Android devices and RPi 2 in FFTBenchmarks.htm.

There are two benchmarks, FFT1, the original, and FFT3c, optimised, with 32 bit and 64 bit versions, when appropriate. Performance is measured in milliseconds, for FFTs sized 1K to 1024K, with three measurements using both single and double precision floating point data, plus some sumchecks for the largest ones. Results from a Raspberry Pi 2, at 900 MHz, are below. These are similar to a year 2000 Pentium III PC.

Raspberry Pi 3 average performance gains were similar to the clock speed ratio of 1.33.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - On running the newly compiled 64 bit versions on both systems, wide variations in performance were observed, with the smaller FFTs, where measured time is less than a millisecond. Full speed could be achieved by using “performance” CPU MHz setting (where available) or running another CPU bound program at the same time. These slower speeds also became apparent on 32 bit results, including via Raspberry Pi 2. All tests were repeated to run at maximum speeds, producing the results shown below. In some cases, the earlier slow measurements are also included.

Gentoo and SUSE produced virtually the same performance, with variations probably caused by different L2 cache presence. The 64 bit version averaged 24% faster on the single precision FFTs but with no real difference using double precision calculations.

The 64 bit benchmarks and source codes are included in Rpi3-64-Bit-Benchmarks.tar.gz.

Small FFT Tests - shortfft64 and shortfft32 - New programs were produced to identify differences in MHz settings. These execute 30 of the smallest 1K single precision FFTs 500 times. A summary of results is below. Besides the 500 measurements, total time is provided that includes data generation and checking overheads, these being included in a final summary. With on demand CPU MHz setting, 32 bit Raspbian and 64 bit Gentoo generally produce much slower execution times over the first few measurements, with the remainder at similar faster speeds. 64 bit OpenSUSE tends to produce the same slow speeds at the start but also has random longer periods of slow performance, Results from all three systems indicate constant running time with performance MHz setting or running another CPU benchmark at the same time.

These are also included in Rpi3-64-Bit-Benchmarks.tar.gz.

RPi2 FFT 32 Bit Benchmark Version 1.0 Thu Feb 16 12:23:55 2017 Size milliseconds K Single Precision Double Precision 1 0.212 0.206 0.206 0.246 0.245 0.252 2 0.462 0.447 0.447 0.689 0.678 0.723 4 1.244 1.206 1.192 1.704 1.634 1.616 8 2.995 3.133 2.989 4.397 3.963 3.899 16 6.983 6.785 6.767 13.282 10.515 9.748 32 17.142 17.182 16.855 31.020 30.025 31.891 64 52.794 52.885 52.727 152.318 146.516 145.472 128 278.668 280.006 285.012 358.963 362.587 360.340 256 624.823 636.579 632.442 779.830 790.282 815.686 512 1506.681 1512.883 1514.028 1678.495 1681.863 1668.933 1024 3288.894 3293.423 3312.335 3792.264 3808.471 3789.059 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Feb 16 12:24:51 2017 ========================== Slow Speed ========================= 1 0.309 0.305 0.307 0.364 0.356 0.355 2 0.666 0.673 0.680 0.928 0.912 0.900 ################################################### RPi2 FFT 32 Bit Benchmark Version 3c.0 Thu Feb 16 12:21:57 2017 Size milliseconds K Single Precision Double Precision 1 0.282 0.237 0.232 0.255 0.246 0.247 2 0.612 0.529 0.582 0.574 0.627 0.635 4 1.523 1.249 1.203 1.498 1.668 1.543 8 2.925 2.781 2.727 3.226 3.141 3.063 16 7.220 6.679 6.672 8.954 8.808 8.737 32 16.862 17.276 15.712 23.606 23.662 23.527 64 41.294 41.568 40.916 57.516 56.900 56.923 128 98.052 97.028 96.708 128.591 127.978 127.868 256 217.731 214.874 214.927 277.817 276.615 280.291 512 466.673 461.412 462.023 596.874 598.976 595.552 1024 1009.119 998.319 999.178 1325.278 1310.229 1304.572 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565233e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Feb 16 12:22:39 2017 ========================== Slow Speed ========================= 1 0.393 0.349 0.348 0.253 0.237 0.283 2 0.820 0.781 0.802 0.562 0.551 0.552 ######################### Raspberry Pi 3 ######################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz RPi2 FFT 32 Bit Benchmark Version 1.0 Wed Feb 15 11:05:59 2017 Size milliseconds K Single Precision Double Precision 1 0.167 0.164 0.163 0.166 0.164 0.165 2 0.393 0.366 0.366 0.419 0.417 0.418 4 1.036 1.007 0.934 1.117 1.091 1.088 8 2.269 2.247 2.236 2.550 2.506 2.501 16 5.624 5.290 5.231 6.086 5.852 5.842 32 12.714 12.569 12.844 22.068 22.479 21.907 64 43.349 44.585 43.293 110.424 110.410 110.581 128 214.541 217.334 216.575 269.974 269.617 269.755 256 526.296 525.924 525.682 615.746 615.259 615.811 512 1199.912 1199.233 1199.311 1364.511 1364.153 1367.418 1024 2509.227 2538.168 2523.659 2831.903 2831.330 2826.171 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565233e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 15 11:06:47 2017 ========================== Slow Speed ========================= 1 0.329 0.335 0.326 0.446 0.340 0.340 2 0.729 0.733 0.765 0.913 0.840 0.824 ################################################### RPi2 FFT 32 Bit Benchmark Version 3c.0 Wed Feb 15 11:03:37 2017 Size milliseconds K Single Precision Double Precision 1 0.215 0.199 0.199 0.170 0.164 0.164 2 0.453 0.462 0.455 0.376 0.373 0.373 4 1.027 1.279 1.023 0.888 0.889 0.883 8 2.333 2.320 2.282 2.052 2.047 2.043 16 5.465 5.362 5.613 5.987 5.977 6.043 32 12.309 12.468 12.216 15.382 15.479 15.396 64 30.695 31.084 30.685 37.030 36.987 37.003 128 72.510 72.023 72.091 84.237 84.239 84.367 256 161.194 160.483 160.714 193.733 193.813 193.760 512 369.130 367.713 367.509 426.499 426.238 425.983 1024 802.163 799.225 798.768 957.992 948.540 948.625 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565233e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 15 11:04:06 2017 ========================== Slow Speed ========================= 1 0.427 0.397 0.398 0.175 0.165 0.166 2 0.996 0.952 0.924 0.396 0.395 0.393 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a armv8 64 Bit FFT Benchmark Version 1.0 Wed Feb 8 20:01:52 2017 Size milliseconds K Single Precision Double Precision 1 0.153 0.152 0.152 0.175 0.170 0.168 2 0.347 0.339 0.334 0.402 0.387 0.387 4 0.817 0.763 0.766 1.946 1.433 1.242 8 3.296 2.018 1.963 2.966 2.716 2.698 16 4.623 4.456 4.392 6.719 6.229 6.759 32 10.551 10.417 10.301 18.407 18.816 18.941 64 28.290 28.555 28.032 126.881 127.317 127.272 128 173.229 173.332 172.477 299.374 298.644 298.596 256 405.373 405.188 407.602 657.365 657.037 657.864 512 905.640 921.727 921.347 1461.983 1463.511 1462.099 1024 2018.414 2018.043 2018.976 3163.591 3163.848 3164.858 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 8 20:02:29 2017 ################################################### armv8 64 Bit FFT Benchmark Version 3c.0 Wed Feb 8 20:11:05 2017 Size milliseconds K Single Precision Double Precision 1 0.195 0.161 0.159 0.190 0.184 0.185 2 0.380 0.355 0.360 0.421 0.419 0.420 4 0.988 0.796 0.778 0.959 0.956 0.957 8 2.282 2.183 1.802 2.131 2.101 2.100 16 4.371 4.191 4.091 5.203 5.160 5.176 32 9.477 9.550 9.520 14.318 14.219 14.188 64 26.061 25.553 25.462 33.704 33.668 33.720 128 61.337 60.707 60.460 77.791 77.816 77.922 256 137.002 134.328 134.307 179.822 179.707 181.027 512 315.380 313.872 313.642 392.380 394.200 392.586 1024 692.640 689.569 689.751 859.132 854.983 852.890 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 8 20:11:22 2017 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a armv8 64 Bit FFT Benchmark Version 1.0 Wed Feb 8 19:46:59 2017 Size milliseconds K Single Precision Double Precision 1 0.177 0.155 0.166 0.190 0.168 0.168 2 0.346 0.370 0.348 0.640 0.496 0.471 4 0.806 0.776 0.773 1.792 1.811 2.455 8 2.879 2.026 2.313 3.143 2.673 2.614 16 4.694 4.487 4.446 6.501 6.077 6.090 32 10.824 11.067 10.520 27.899 27.393 32.721 64 49.580 37.161 37.028 119.094 118.648 118.820 128 172.333 186.946 172.173 294.386 294.253 294.366 256 406.581 407.594 406.053 670.012 670.096 670.169 512 938.983 938.567 939.929 1486.050 1485.846 1486.961 1024 1987.861 1989.141 1997.740 3143.410 3143.533 3143.669 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 8 19:47:34 2017 ################################################### armv8 64 Bit FFT Benchmark Version 3c.0 Wed Feb 8 19:55:51 2017 Size milliseconds K Single Precision Double Precision 1 0.181 0.172 0.160 0.190 0.185 0.185 2 0.400 0.366 0.362 0.458 0.420 0.423 4 0.892 0.937 0.932 0.989 0.976 0.994 8 1.986 1.967 2.604 2.269 2.270 2.334 16 5.590 4.686 4.433 5.615 5.533 5.621 32 10.438 10.081 10.263 14.656 14.616 14.669 64 27.759 27.381 27.154 34.832 34.816 34.853 128 63.303 62.331 62.107 79.898 79.849 79.896 256 138.935 170.902 137.272 186.385 186.580 186.381 512 318.062 315.184 315.421 409.840 410.370 410.283 1024 691.349 683.468 685.295 919.255 904.665 904.236 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 8 19:56:07 2017 ################################################### Example 64 Bit Results With On Demand CPU MHz ========================== Slow Speed ========================= armv8 64 Bit FFT Benchmark Version 3c.0 Wed Feb 8 19:58:15 2017 Size milliseconds K Single Precision Double Precision 1 0.367 0.321 0.320 0.200 0.213 0.188 2 0.875 0.835 0.805 0.443 0.444 0.425 4 1.974 2.038 1.862 0.996 0.978 0.993 8 6.018 5.208 3.971 2.294 2.285 2.278 16 9.424 4.566 4.586 5.574 5.585 5.561 32 10.608 10.236 10.202 14.902 14.826 14.728 64 28.013 27.164 27.240 34.889 34.867 34.939 128 63.213 62.583 62.562 80.222 80.257 80.036 256 139.365 137.684 137.460 186.954 187.057 187.003 512 318.927 316.056 315.992 412.486 412.306 412.417 1024 693.102 684.980 686.608 918.059 902.694 902.613 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Wed Feb 8 19:58:31 2017 ################################################### RPi 3 500 x 30 1K Single Precision FFT milliseconds Raspbian On Demand 12.9 12.2 7.4 6.0 6.0 6.4 6.0 6.0 6.0 6.0 6.1 6.0 6.0 6.0 6.0 6.0 6.1 6.1 6.0 6.2 6.2 6.0 6.0 6.1 6.0 6.0 6.0 6.0 6.1 6.0 6.2 6.0 6.0 7.0 6.1 6.0 6.0 6.0 6.1 6.0 6.2 6.1 6.0 6.0 6.2 6.0 6.0 6.0 6.0 7.2 To 6.5 6.3 6.1 6.2 6.1 6.1 6.1 6.1 6.1 6.1 6.5 6.3 6.1 6.1 6.1 6.1 6.1 6.1 6.1 6.1 6.4 6.2 6.1 6.1 6.2 6.1 6.1 6.1 6.1 6.1 Raspbian With Stress Test 6.7 6.2 6.0 6.0 6.0 6.0 6.1 6.0 6.1 6.0 6.5 6.2 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.4 6.2 6.0 6.0 6.0 6.0 6.0 6.1 6.0 6.0 To 6.3 6.2 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.3 6.2 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.3 6.2 6.0 6.0 6.1 6.0 6.0 6.0 6.0 6.0 OpenSUSE On Demand 12.1 12.5 8.9 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.6 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.3 To 7.9 11.7 10.7 10.6 10.6 10.6 10.6 10.6 10.6 10.7 11.6 11.2 10.6 10.7 10.6 10.6 10.6 10.6 10.6 10.6 11.7 11.5 10.6 10.6 10.6 10.6 10.6 10.6 10.6 10.6 11.8 11.1 10.6 10.6 10.7 10.6 10.6 10.7 10.6 10.6 To 5.5 6.0 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.5 5.9 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.5 6.0 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.4 OpenSUSE Performance 6.1 6.0 5.4 5.5 5.4 5.3 5.3 5.3 5.3 5.3 5.5 6.0 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.5 6.1 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3 To 5.5 6.2 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.5 6.1 5.8 5.3 5.3 5.3 5.3 5.3 5.3 5.3 5.5 6.0 5.7 5.3 5.3 5.3 5.3 5.3 5.3 5.3 Gentoo On Demand 17.5 15.4 11.8 8.6 5.4 5.4 5.4 5.4 5.4 5.4 5.5 5.8 6.0 5.4 5.5 5.4 5.5 5.4 5.4 5.4 5.5 5.6 6.1 5.4 5.5 5.4 5.5 5.5 5.4 5.4 To 5.7 6.9 5.7 5.4 5.4 5.4 5.5 5.4 5.4 5.4 5.8 6.8 5.8 5.6 5.4 5.4 5.4 5.5 5.4 5.4 5.7 6.4 5.7 5.5 5.4 5.4 5.5 5.4 5.4 5.4 Gentoo With Stress Test 5.9 7.2 5.9 5.5 5.4 5.4 5.4 5.4 5.4 5.5 5.6 6.9 5.7 5.4 5.4 5.4 5.4 5.4 5.4 5.4 5.6 6.5 5.7 5.4 5.4 5.4 5.4 5.4 5.4 5.4 5.8 7.1 5.9 5.4 5.4 5.4 5.4 5.4 5.4 5.4 To 5.7 6.8 5.7 5.4 5.4 5.4 5.4 5.4 5.4 5.4 5.7 6.7 6.1 5.4 5.4 5.4 5.4 5.4 5.4 5.4 5.8 6.6 5.6 5.4 5.4 5.4 5.4 5.4 5.4 5.4 ################### Summary millisecons ################### Each Av 30 500x30 +Overheads Raspbian On Demand 0.206 6.17 3086 14402 Raspbian Plus Stress Test 0.202 6.07 3036 14222 OpenSUSE On Demand 0.221 6.23 3314 13035 OpenSUSE Performance 0.182 5.45 2725 10663 Gentoo On Demand 0.190 5.70 2852 8994 Gentoo Plus Stress Test 0.187 5.61 2802 8872

To Start

Single Core NEON Benchmarks

Some of these are essentially the same as my Android NEON Benchmarks.htm, using NEON Intrinsic Functions. Others are produced by including the compile option -funsafe-math-optimizations, alongside -mfpu=neon-vfpv4. Results for single core NEON benchmarks are included in this document, with the programs and source codes in Raspberry_Pi_Benchmarks.zip. For MultiThreading versions, see Raspberry Pi Multithreading Benchmarks.htm. and Raspberry_Pi_MP_Benchmarks.zip.

64 Bit Versions - The compiler does not have the NEON directive, but translates NEON intrinsic functions into 64 bit vector instructions. The 64 bit benchmarks and source codes are in Rpi3-64-Bit-Benchmarks.tar.gz.

To Start

Linpack NEON Benchmarks - linpackPiNEONi, linpackPiFSSP, linpackPiNEONi64

The Android version was written, using NEON Intrinsic Functions and was converted to Linux format in linpackneon.c, compiled as LinpackPiNEONi. The standard Linux single precision version was recompiled with the additional -funsave parameter as linpackPiFSSP. Comparative performance of the intrinsic program is shown Linpack Benchmark Comparisons above.

Linpack benchmark performance is mainly determined by the daxpy function, specifically an unrolled loop with four dy[i] = dy[i] + da * dx[i] statements, accessing sequential data. NEON q registers are 128 bits or four words and there are multiply and add instructions, using three registers. The assembly code loop has two loads and one store, with linpackPiNEONi using vmla Vector Multiply Accumulate instruction and linpackPiFSSP using the faster vfma Fused Multiply Accumulate - one instruction for 4 multiplies and 4 adds.

Raspberry Pi 3 speeds are shown to be 54% to 57% faster than the non-overclocked Raspberry Pi 2, compared with a 33% faster CPU MHz.

These instructions are known to produce rounding complications, differences in results being shown below. I could not say whether they are acceptable

Raspberry Pi 3 SUSE and Gentoo 64 Bits - As both use different varieties of SIMD instructions, performance is not that much better than the 32 bit version.

linpackPiNEONi linpackPiFSSP linpackPiNEONi64 RPi 2 MFLOPS at 900 MHz 300 311 RPi 2 MFLOPS at 1000 MHz 334 348 RPi 3 MFLOPS at 1200 MHz 486 488 530 NEON Function vmla.f32 q8, q9, q10 vfma.f32 q8, q9, q10 fmla v0.4s, v1.4s, v2.4s norm resid resid x[0]-1 x[n-1]-1 Pi, Android+NEON 1.6 3.80277634e-05 -1.38282776e-05 -7.51018524e-06 Pi 2/3 Not NEON 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05 Pi 3 64 NEON In 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05 Pi 3 64 Not NEON 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05 Pi 2/3 Intrinsic 2.2 5.16722466e-05 -2.38418579e-07 -5.06639481e-06 Pi 2/3 Compiled 1.9 4.62468779e-05 -1.31130219e-05 -1.30534172e-05

To Start

NEON Float & Integer Benchmark - NeonSpeed, NeonSpeedPi64

This was the first benchmark produced to measure speed using NEON instructions on ARM v7 CPUs using Android. It executes some of the code used in Memory Speed Benchmark, with additional tests recoded using NEON intrinsic functions. The benchmark and source code are included in Raspberry_Pi_Benchmarks.zip.

The compile command (for gcc 4.8) is shown below, where the -funsafe-math-optimizations option leads to the compiler generating NEON code for normal floating point statements. In this case, vfma Fused Multiply Accumulate instructions were generated, as opposed to vmla Vector Multiply Accumulate from the intrinsic functions. Then, vadd.i32 was produced for all integer tests. In this case, performance from both methods was quite similar.

Raspberry Pi 3 speeds were quite a bit faster than the Raspberry Pi 2 at 900 MHz. Average, minimum and maximum improvements, using data in L1 cache, were 1.71, 1.37 and 2.02 times. L2 cache ratios were 3.13, 1.90 and 2.45, with RAM, best, at 3.57, 2.92 and 5.33. The RPi 3 was also more efficient in running the NEON instructions using caches.

Examples Android results logs are also provided, to show the difference where compiled NEON instructions are not provided at 32 bits. Performance at 64 bits is also provided, for the tablet with the ARM-A53 CPU, where NEON instructions are compiled and cache based speeds similar to the RPi 3.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - 64 bit and 32 bit speeds are, again, nearly the same, using different variations of vector instructions. An exception is the slower performance from gcc 4.8 in translating NEON intrinsic functions for the v=v+s*v test.

gcc neonspeed.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations -o NeonSpeed ############################################## Raspberry Pi 2 CPU 900 MHz, Core 250 MHz, SDRAM 450 MHz NEON Speed Test V 1.0 Tue Mar 17 12:06:58 2015 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 1914 1978 2049 2293 2341 2797 L1 32 1897 1951 2032 2253 2310 2745 64 1517 1543 1619 1694 1718 1915 L2 128 1417 1435 1510 1569 1594 1791 256 1414 1433 1499 1571 1593 1771 512 680 578 654 600 577 604 1024 434 403 451 414 396 409 RAM 4096 327 328 332 324 324 330 16384 333 334 338 345 330 337 65536 339 336 340 172 331 338 Max MFLOPS 479 495 Max MOPS 512 573 ##################### OC ###################### Raspberry Pi 2 CPU 1000 MHz, Core 500 MHz, SDRAM 500 MHz, over_voltage=2 NEON Speed Test V 1.0 Tue Mar 17 12:12:37 2015 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 2114 2183 2265 2531 2587 3090 L1 32 2078 2134 2228 2461 2532 3003 64 1673 1703 1785 1870 1900 2118 L2 128 1565 1581 1668 1736 1761 1974 256 1545 1577 1660 1726 1752 1951 512 1055 1042 1100 1121 1101 1178 1024 499 506 523 525 512 530 RAM 4096 429 431 440 428 433 445 16384 436 438 448 453 440 454 65536 446 443 452 229 444 458 Max MFLOPS 529 546 Max MOPS 566 633 End of test Tue Mar 17 12:12:57 2015 ################### RPi 3 ################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz NEON Speed Test V 1.0 Fri Jul 29 12:03:47 2016 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 2720 4001 3459 4225 4474 4750 32 2598 3706 3268 3879 4091 4320 64 2453 3389 3069 3526 3675 3859 128 2503 3466 3178 3598 3718 3918 256 2530 3516 3230 3649 3779 3950 512 2221 2923 2718 2964 3104 3217 1024 1262 1326 1317 1316 1324 1316 4096 1170 1213 1204 1213 1210 1195 16384 1177 1229 1218 1147 1222 1215 65536 1181 1226 1221 916 1208 1218 Max MFLOPS 680 1000 Max MOPS 865 1056 End of test Fri Jul 29 12:04:07 2016 ################ RPi 3 SUSE ################ Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 2393 4497 3479 4252 4783 4932 32 2299 4081 3284 3910 4362 4441 64 2193 3663 3067 3593 3896 3904 128 2227 3701 3144 3603 3909 3926 256 2226 3693 3153 3586 3896 3923 512 1913 3461 2958 3358 3609 3577 1024 1271 1408 1406 1364 1363 1422 4096 1130 1207 1219 1158 1186 1208 16384 1102 1116 1132 1037 1111 1116 65536 1089 1095 1107 810 1091 1095 Max MFLOPS 598 1124 Max MOPS 870 1063 ############### RPi 3 Gentoo ################# 16 2352 4419 3418 4178 4700 4850 32 2330 4355 3388 4122 4664 4806 64 2177 3678 3066 3607 3932 3923 128 2230 3772 3174 3683 4012 4007 256 2240 3785 3199 3694 4024 4024 512 1936 3095 2690 2996 3241 3279 1024 1143 1203 1253 1162 1178 1229 4096 1097 1182 1182 1115 1138 1192 16384 1103 1193 1188 1138 1143 1201 65536 1109 1199 1200 866 1165 1214 Max MFLOPS 588 1104 Max MOPS 855 1045 ################ RPi 3 SUSE ################## Compiled for 64 bit ARM v8a+fp+sim 16 2390 3001 3187 3925 4135 4372 32 2381 2985 3187 3894 4135 4371 64 2174 2674 2817 3300 3468 3608 128 2177 2704 2859 3341 3512 3654 256 2200 2712 2848 3315 3520 3637 512 2010 2400 2539 2894 3018 3094 1024 1238 1314 1338 1356 1382 1385 4096 1098 1148 1159 1158 1170 1188 16384 1063 1082 1120 1041 1109 1114 65536 1063 1067 1108 815 1092 1097 Max MFLOPS 598 750 Max MOPS 797 981 #################### Android ##################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.32 Compiled for 32 bit ARM v7a Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 971 3853 1807 4059 3957 4397 32 970 3812 1800 3983 3891 4323 64 927 3228 1605 3038 3269 3521 128 926 3321 1681 3343 3354 3596 256 936 3386 1693 3449 3413 3667 512 898 2889 1578 2996 2927 3118 1024 794 1859 1345 2057 1996 1924 4096 794 1796 1250 1788 1813 1835 16384 792 1773 1270 1820 1829 1864 65536 796 1811 1289 1852 1832 1880 Total Elapsed Time 11.3 seconds ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.37 Compiled for 64 bit ARM v8a Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 3054 4055 3605 4376 4911 5094 32 2922 3787 3435 4198 4546 4682 64 2795 3514 3259 3658 4050 4116 128 2886 3529 3373 3924 4148 3963 256 2883 3641 3264 3942 4193 4276 512 2454 3165 2985 3385 3586 3542 1024 1633 2000 1835 2043 2114 2105 4096 1738 1893 1899 1900 1956 1955 16384 1757 1870 1886 1802 1921 1846 65536 1755 1875 1870 1903 1936 1937 Max MFLOPS 764 1014 Max MOPS 901 1094 Total Elapsed Time 10.2 seconds #################### Android ##################### Nexus 7 Quad 1200 MHz Cortex-A9, Android 4.1.2 Android NeonSpeed Benchmark 15-Dec-2012 14.38 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 860 2575 2325 2918 3053 3245 L1 32 950 2551 2400 2823 2944 3131 64 744 1396 1329 1434 1465 1496 L2 128 713 1342 1319 1365 1392 1417 256 714 1339 1311 1357 1377 1400 512 708 1323 1299 1348 1358 1383 1024 608 875 869 917 930 952 4096 460 493 492 481 488 504 RAM 16384 460 498 487 507 506 504 65536 459 495 469 251 503 505 Max MFLOPS 238 644 Max MOPS 600 730

To Start

MemSpeed NEON - memSpdPiNEON

This is compiled from the Memory Speed Benchmark source code, using the -funsafe-math-optimizations additional compile parameter. An example of results in included above. The memspeedPiA7 benchmarks, compiled with the -mfpu=neon-vfpv4 option, generated NEON instructions for integer arithmetic (vadd.i32 q8, q8, q10), as with memSpdPiNEON. leading to the same performance. Then four scalar fused multiply and add instructions ( fadds s12, s8, s12) were generated for the single precision (SP) floating point test, as opposed to NEON (vfma.f32 q8, q9, q6) with the new benchmark, with similar differences for the second set of calculations. Details are above, and maximum MFLOPS below. showing a gain of approaching 50% through using NEON instructions. Note: currently NEON floating point functions are only available at single precision. For reference, double precision (DP) results are also shown.

Both compilations for memspeedPiA7 and memSpdPiNEON have NEON integer instructions of the form vadd.i32 q8, q8, q9, providing significant performance gains, as shown by integer MOPS below.

Raspberry Pi 3 - Best gains were on Integer MOPS of 1.5 to 1.7 times 900 MHz RPi 2. Some double precision speeds were slower than clock MHz ratio of 1.33.

Raspberry Pi 3 SUSE and Gentoo 64 Bits - Compile options not available, but see Memory Speed Benchmark above.

MFLOPS memspeedPiA6 memspeedPiA7 memSpdPiNEON Raspberry Pi 2 SP MFLOPS at 900 MHz 333 299 445 SP MFLOPS at 1000 MHz 351 330 493 DP MFLOPS at 900 MHz 133 175 174 DP MFLOPS at 1000 MHz 148 193 193 Raspberry Pi 3 SP MFLOPS at 1200 MHz 444 454 594 DP MFLOPS at 1200 MHz 208 203 203 INT MOPS memspeedPiA6 memspeedPiA7 memSpdPiNEON Raspberry Pi 2 Int MOPS at 900 MHz 323 512 509 Int MOPS at 1000 MHz 333 566 562 Raspberry Pi 3 Int MOPS at 1200 MHz 485 865 864

To Start

Maximum One Core Single Precision MFLOPS

notOpenMP-MFLOPS, notOpenMP-MFLOPS64, MP-MFLOPSPiA7, MP-MFLOPS64
MP-NeonMFLOPS, MP-NeonMLOPS64, MP-MFLOPSPiNeon

All of these carry out the same calculations executed in the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Full results are provided below. OpenMP-MFLOPS automatically uses all available cores and notOpenMP-MFLOPS uses one core with no MP overheads. All others use 1, 2, 4 and 8 threads, best MFLOPS from 1 thread shown here.

The compilation for PiA7 MP-MFLOPS includes an option to use NEON instructions, but does not do so in the 32 bit version. MP-MFLOPS64 and OpenMP-MFLOPS64 varieties use the simple “-march=armv8-a” directive.

The compiled MP-MFLOPSPiNeon and OpenMP benchmarks include “-funsafe-math-optimizations” parameter that produces SIMD instructions. This option is not available at 64 bits. MP-NeonMFLOPS and MP-NeonMFLOPS64 use a well ordered structure of NEON intrinsic functions, clearly suitable for SIMD operation. gcc neonmflops.c cpuidc.c -lm -lrt -O3 -mcpu=cortex-a7 -mfloat-abi=hard -mfpu=neon-vfpv4 -funsafe-math-optimizations -lpthread -o MP-NeonMFLOPS

Raspberry Pi 3 speeds were 1.75 times faster than model 2, at two operations per word, increasing to 2.28 times at 32 operations per word.

64 Bit Versions MPMFLOPS results were between 2.3 to 4 time faster than the 32 bit version, due to using SIMD instructions. The notOpenMP-MFLOPS performance was similar with both SIMD. MP-NeonMLOPS64 intrinsics were compiled as more effective vector instructions, to produce gains between 1.25 and 1.54 times.

Cortex-A53 based Android tablet results are also shown, with similar performance. Details are in Android 64 Bit Benchmarks.htm.

Reliability Tests - The MP-MFLOPS functions were used in stress testing programs that have command line options to define which function to use and running time. See Reliability Tests, 64 Bit Reliability Tests and Raspberry Pi 2 and 3 Stress Tests.htm. The original versions, such as burninfpuPiA7 and MFLOPS benchmarks, produced less the 1.5 MFLOPS per MHz, where the test functions were driven by repetitive external calls. A later one, burninfpuPi2, in Raspberry_Pi_2_Stress_Tests.zip, included the repeat calls within the functions, and unrolled some of the calculations, producing some much faster speeds. The 64 bit version, burninfpuPi64, in Rpi3-64-Bit-Benchmarks.tar.gz, produced similar superior performance, as reflected in the results below.

Single Precision MFLOPS MHz 2 Ops/word 8 Ops/word 32 Ops/word Raspberry Pi 2 notOpenMP-MFLOPS 900 398 777 692 notOpenMP-MFLOPS 1000 461 861 765 burninfpuPiA7 L2 900 450 777 685 Raspberry Pi 3 1200 notOpenMP-MFLOPS 716 1697 1581 MP-MFLOPSPiA7 182 693 MP-MFLOPSPiNeon Compiled 782 1672 MP-NeonMFLOPS Intrinsics 583 1706 burninfpuPiA7 L2 cache data 721 1644 1703 notOpenMP-MFLOPS64 718 1720 1496 MP-MFLOPS64 730 1579 MP-MFLOPSNeon Compiled N/A MP-NeonMLOPS64 Intrinsics 729 2640 burninfpuPi64 L2 cache data 1721 3796 1562 Cortex A53 Android Tablet 1300 MHz 1 Core Threaded SP MFLOPS 32 bit Intrinsics 619 1426 SP MFLOPS 64 bit Intrinsics 726 2639

To Start

MultiThreading Benchmarks

These are essentially the same as my Android Multithreading Benchmarks.htm. Except for OpenMP tests, all run the benchmarks using 1, 2, 4 and 8 threads. Those that use caches and RAM have data sizes around 12.8 KB, 128 KB and 12.8 MB. The test runs considered below are to provide Raspberry Pi 3 comparisons of 64 bit versus 32 bit operation. Tne new benchmarks and source codes are included in Rpi3-64-Bit-Benchmarks.tar.gz. Details and results of earlier measurements can be found in Raspberry Pi Multithreading Benchmarks.htm, with benchmarks and source codes in Raspberry_Pi_MP_Benchmarks.zip.

Where appropriate, the benchmarks show that the same numerical results are produced using a varying number of threads. Example results for different compilations of MP-MFLOPS are shown below. At 32 bits, the benchmark was compiled with normal floating point parameters, secondly with additional NEON directives and thirdly with NEON intrinsic functions, replacing normal C code. At 64 bits, the first and last of these was appropriate. The intrinsic functions were translated into different forms of vector instructions. The end products produced variations in numerical results, as shown in the following.

################ MP-MFLOPS FORMAT ################# MP-MFLOPS armv8 64Bit Fri Feb 24 13:30:16 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 730 717 413 1579 1575 1541 2T 1361 1351 389 3075 3145 2849 4T 2259 2417 370 5399 6114 4944 8T 2226 1919 352 5346 5948 4986 Results x 100000, 0 indicates ERRORS 1T 76406 97075 99969 66015 95363 99951 2T 76406 97075 99969 66015 95363 99951 4T 76406 97075 99969 66015 95363 99951 8T 76406 97075 99969 66015 95363 99951 End of test Fri Feb 24 13:30:21 2017 MP-MFLOPS Linux/ARM V7A 1T 76406 97075 99969 66015 95363 99951 MP-MFLOPS Compiled NEON 1T 76406 97075 99969 66008 95367 99951 MP-MFLOPS NEON Intrinsics 1T 76406 97075 99969 66014 95363 99951 MP-MFLOPS 64 Bit 1T 76406 97075 99969 66015 95363 99951 MP-MFLOPS NEON Intrinsics 64 Bit 1T 76406 97075 99969 66015 95363 99951 MP-MFLOPS Double Precision 1T 76384 97072 99969 66065 95370 99951 MP-MFLOPS 64 Bit DP 1T 76384 97072 99969 66065 95370 99951

To Start

MP-MFLOPS - MP-MFLOPSPiA7, MP-MFLOPSDP, MP-MFLOPSPi64, MP-MFLOPSPi64DP

MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. Others use more calculations in the form of x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 8 or 32 operations per input data word. Each thread carries out the same calculations but accesses different segments of the data. The result, on cache based calculations, is often performance proportional to the number of cores used.

64 Bit vs 32 Bit - Bearing in mind that results represented by the third column are likely to be dependent on memory speed, average speed gains of the first cache based tests were four times faster, with 25% improvement from RAM. Then, with 32 operations per word, a 2.19 speed gain applied. Double precision improvements were much less.

Single/Double Precision - Results were quite similar using the 32 bit benchmark. At 64 bits, average improved SP speed was 2.1 times, at 2 operations per word, and demonstrated a 37% improvement with the higher number of calculations.

SUSE vs Gentoo - Exccept for the isolated blip, that can be expected on these type of tests, performance was essentially the same.

MP gains - Ignoring 12800 memory based tests, that can be lower, four versus 1 thread gains averaged 3.38 times, with a maximum of 3.88 times.

Comparison with other MP-MFLOPS benchmarks - see Maximum 1 Core MFLOPS above.

###################### RPi 3 ####################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Linux/ARM V7A v1.0 Tue Aug 30 14:16:59 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 159 181 178 690 692 685 2T 342 364 353 1384 1386 1368 4T 466 501 456 2451 2473 2633 8T 581 643 479 2618 2502 2550 Results x 100000 1T 76406 97075 99969 66015 95363 99951 ########### RPi 3 V7A2 Double Precision ############ MP-MFLOPS Double Precision v1.0 Wed Sep 7 17:07:12 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 143 182 171 678 680 674 2T 343 361 240 1360 1360 1335 4T 441 712 240 2232 2208 2185 8T 406 593 241 2345 2315 2272 Results x 100000 1T 76384 97072 99969 66065 95370 99951 ################## RPi 3 SUSE ##################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-MFLOPS armv8 64Bit Fri Feb 24 13:30:16 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 730 717 413 1579 1575 1541 2T 1361 1351 389 3075 3145 2849 4T 2259 2417 370 5399 6114 4944 8T 2226 1919 352 5346 5948 4986 Results x 100000 1T 76406 97075 99969 66015 95363 99951 ########### RPi 3 SUSE Double Precision ############ MP-MFLOPS armv8 64Bit Double Precision Fri Feb 24 13:53:27 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 365 356 198 1233 1230 1127 2T 659 657 166 2401 2397 1923 4T 1200 927 176 4678 4640 2776 8T 1051 1039 174 4678 4682 2909 Results x 100000 1T 76384 97072 99969 66065 95370 99951 ################ RPi 3 Gentoo ###################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-MFLOPS armv8 64Bit Thu Mar 2 16:48:04 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 709 634 391 1541 1535 1497 2T 1095 1072 355 3095 3023 2925 4T 1503 2249 350 5419 6070 5230 8T 2475 1985 381 5440 5975 5030 Results x 100000 1T 76406 97075 99969 66015 95363 99951 ########## RPi 3 Gentoo Double Precision ########### MP-MFLOPS armv8 64Bit Double Precision Thu Mar 2 16:52:33 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 354 327 197 1205 1187 1081 2T 685 691 201 2411 2369 1763 4T 1202 1063 202 4681 4595 2064 8T 1145 1077 201 4520 4581 2663 Results x 100000 1T 76384 97072 99969 66065 95370 99951

To Start

MP-Whetstone - MP-WHETSPiA7, MP-WHETSPi64

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Again performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used.

None of the test functions are suitable for SIMD operation, with the simpler instructions being used can lead to some 32 bit tests being faster than those compiled for 64 bits. The Fixed Point MIPS loops are clearly over optimised but, in any case, the time taken has little influence on the overall MWIPS rating.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-Whetstone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:34:21 2016 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 723.1 517.2 517.0 254.9 12.1 8.8 5853.9 1181.8 1189.8 2T 1464.7 960.5 1025.1 511.3 24.1 18.5 11899.0 2381.2 2385.7 4T 2902.3 1696.4 1867.3 1013.4 47.8 36.8 19754.6 4541.3 4687.1 8T 3004.0 2747.8 2569.0 1066.4 48.6 38.0 25502.9 6075.2 5610.8 Overall Seconds 4.77 1T, 4.74 2T, 4.88 4T, 9.76 8T ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-Whetstone Benchmark armv8 64 Bit Tue Mar 7 23:27:25 2017 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 985.3 336.2 336.3 287.7 18.1 12.3 1478579.3 2331.7 1198.9 2T 1964.8 670.7 672.6 566.7 36.2 24.6 2794892.5 4724.7 2372.4 4T 3900.7 1248.1 1330.8 1139.9 71.6 48.9 3931546.6 9424.8 4747.9 8T 3925.4 1314.4 1349.8 1146.9 72.0 49.1 6508657.2 9578.2 4779.7 Overall Seconds 4.94 1T, 4.98 2T, 5.14 4T, 10.11 8T ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-Whetstone Benchmark armv8 64 Bit Wed Mar 8 11:48:21 2017 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 1045.1 322.6 330.4 282.5 20.4 12.8 1527755.4 2316.1 1178.5 2T 2091.3 653.1 661.0 563.9 40.9 25.5 2764929.6 4599.8 2356.7 4T 2460.5 1199.4 1314.7 1124.8 41.2 27.3 5201735.3 9305.0 2480.2 8T 3394.6 1422.0 1697.0 1192.2 56.4 44.8 4006323.7 10229.6 2480.3 Overall Seconds 5.02 1T, 5.02 2T, 8.57 4T, 13.51 8T

To Start

MP-Dhrystone - MP-DHRYPiA7, MP-DHRYPi64

This runs multiple copies of the whole program at the same time. Dedicated data arrays are used for each thread but there are numerous other variables that are shared. The latter reduces performance gains via multiple threads and, in some cases, these can be slower than using a single thread.

The only reliable measurement, for comparison purposes, is the single thread speed. Here, the 64 bit version indicates a speed improvement of 50%, over the 32 bit program.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Aug 15 19:47:57 2016 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.95 1.12 1.59 3.04 Dhrystones per Second 4229473 7124952 10091677 10523432 VAX MIPS rating 2407 4055 5744 5989 Internal pass count correct all threads End of test Mon Aug 15 19:48:04 2016 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-Dhrystone Benchmark armv8 64 Bit Tue Mar 7 22:20:45 2017 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.63 0.77 1.40 2.77 Dhrystones per Second 6343818 10382333 11459690 11533058 VAX MIPS rating 3611 5909 6522 6564 Internal pass count correct all threads End of test Tue Mar 7 22:20:51 2017 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-Dhrystone Benchmark armv8 64 Bit Wed Mar 8 11:34:32 2017 Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.63 0.78 2.75 3.11 Dhrystones per Second 6367171 10213192 5810865 10285768 VAX MIPS rating 3624 5813 3307 5854 Internal pass count correct all threads End of test Wed Mar 8 11:34:40 2017

To Start

MP-BusSpeedPiA7, MP-BusSpeedPi64

This runs integer read only tests using caches and RAM, each thread accessing the same data sequentially. To start with, data is read with large address increments to demonstrate burst data transfers. Performance gains, using L1 cache, can be proportional to the number of cores, but not quite so using L2. The program is designed to produce maximum throughput over buses and demonstrates the fastest RAM speeds using multiple cores.

In the original version, each thread started reading data from the same starting point. This produced acceptable results until shared L2 caches appeared. Then it produced excessive RAM speeds, using more than one thread. With version 2, as used for the following, each thread starts reading from different addresses, providing more realistic results.

The 32 bit ARM V7A compilation produced the expected pattern of speeds, doubling up with decreasing address increments, where burst reading is used, and improving L1 cache data transfer rate, also providing reasonable MP performance gains. The 64 bit results were much slower and, particularly, demonstrated slower L1 cache speeds at reducing address increments. The reason can be identified from a disassembly of the code used for the important “Read All” tests. Here, the C code has a loop with 64 AND operations. The 32 bit version translated these arithmetic operations into 16 NEON four way vector instructions. The 64 bit version had 64 scalar AND and 64 data load instructions, overall executing 2.5 times the number of instructions, than the 32 bit version, to deal with the same amount of data.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-BusSpd ARM V7A v2 Sun Jul 24 09:26:21 2016 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 3011 3715 3792 4080 4400 4149 2T 5391 6873 7125 7827 8466 8124 4T 8622 11926 13488 15276 16419 13422 8T 4922 7930 9659 11732 13307 11995 122.9 1T 565 563 1070 1792 2830 3865 2T 886 901 1762 3225 5402 7584 4T 901 921 1863 3727 7185 13816 8T 874 919 1762 3712 6269 9242 12288 1T 120 125 244 420 968 1926 2T 126 128 246 537 1000 2184 4T 110 118 231 443 990 1824 8T 120 137 262 517 1043 2124 End of test Sun Jul 24 09:26:33 2016 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-BusSpd armv8 64 Bit Tue Mar 7 22:44:44 2017 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 2885 2407 2576 2093 1460 1521 2T 4764 4197 4944 3960 2890 2929 4T 6842 6443 8343 6997 5360 5667 8T 4563 4352 6368 6106 4600 5184 122.9 1T 545 584 1043 1596 1456 1462 2T 872 890 1718 3001 2807 2861 4T 828 900 1859 3687 5523 5789 8T 866 913 1875 3691 5477 5704 12288 1T 113 123 244 486 915 1145 2T 69 125 226 435 1149 1964 4T 86 91 268 490 998 2092 8T 89 104 219 480 976 1798 End of test Tue Mar 7 22:44:57 2017 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-BusSpd armv8 64 Bit Wed Mar 15 11:23:10 2017 MB/Second Reading Data, 1, 2, 4 and 8 Threads Staggered starting addresses to avoid caching KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 2699 2430 2645 2052 1467 1493 2T 4687 4153 4854 3797 2827 2933 4T 6825 6472 8358 7148 4789 5680 8T 4272 4146 5928 5705 4588 4977 122.9 1T 550 568 1022 1615 1427 1472 2T 872 852 1691 3027 2821 2932 4T 821 894 1845 3654 5570 5822 8T 896 892 1850 3602 5136 5439 12288 1T 108 115 224 455 852 1085 2T 51 120 216 432 856 1722 4T 68 109 229 402 887 1604 8T 67 109 240 583 975 1834 End of test Wed Mar 15 11:23:23 2017

To Start

MP-RandMemPiA7, MP-RandMemPi64

The benchmark has cache and RAM read only and read/write tests using sequential and random access, each thread accessing the same data but starting at different points. It uses the Mutex functions as in Whetstone above, sometimes leading to no performance gains using multiple threads. Although performance via the L1 cache, L2 cache and RAM can be different, it is normally consistent, in each of these areas, during read/write tests. With the read only tests, performance via L1 cache typically produced a throughput gain of 3.6 to 3.8 times using four cores, but somewhat less so, using shared data in L2 cache. Random access is also demonstrated as being relatively slow where burst data transfers are involved. Note that performance can vary somewhat, and a few runs might be needed to demonstrate best case results.

L1 cache 64 bit speeds are shown to be 43% faster than those at 32 bits, for read only tests and 20% via L2 cache, but in the same areas, up to 20% slower when writing is involved.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-RandMem Linux/ARM V7A v1.0 Mon Aug 15 19:37:27 2016 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 2907 3773 2917 3790 2T 5480 3768 5187 3775 4T 11198 3679 10960 3712 8T 10094 3697 10038 3685 122.9 1T 2673 3340 686 892 2T 5031 3386 1251 888 4T 9398 3378 2002 890 8T 9291 3370 1916 886 12288 1T 1896 899 50 64 2T 2535 900 98 65 4T 2878 896 137 64 8T 2631 897 130 65 End of test Mon Aug 15 19:38:14 2016 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-RandMem armv8 64 Bit Tue Mar 7 23:20:26 2017 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 4251 3142 4180 3074 2T 7641 3118 7586 3120 4T 15308 3077 15309 3060 8T 14920 3041 14761 3043 122.9 1T 3462 2848 889 858 2T 6356 2899 1590 846 4T 11078 2910 2013 857 8T 11069 2917 2018 843 12288 1T 1858 873 83 67 2T 2331 864 148 66 4T 2359 878 160 66 8T 2108 890 163 66 End of test Tue Mar 7 23:21:12 2017 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-RandMem armv8 64 Bit Sun Mar 12 11:18:10 2017 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 4268 3087 4267 3087 2T 7520 3062 7525 3055 4T 15295 3021 14322 3021 8T 15200 2973 14897 2999 122.9 1T 3384 2851 872 839 2T 6314 2877 1523 838 4T 11027 2871 2012 836 8T 10344 2864 1937 835 12288 1T 1795 846 78 63 2T 1933 771 136 63 4T 1760 845 152 63 8T 1972 843 138 63 End of test Sun Mar 12 11:18:56 2017

To Start

OpenMP-MFLOPS, notOpenMP-MFLOPS, OpenMP-MFLOPS64, notOpenMP-MFLOPS64

The benchmark uses the same source code program calculations as the original MP_MFLOPS benchmark for Linux with MP-MFLOPS above using a cut down version, implemented to use on Android devices. OpenMP-MFLOPS benchmark uses the simplest OpenMP directive, #pragma omp parallel for, before the for loops where parallelisation might be expected, and a -fopenmp compile parameter. Also, notOpenMP-MFLOPS is the same, without the compile parameter.

Samples of full results are below for 32 bit and 64 bit benchmarks. At this time OpenMP libraries are not included in gcc for 64 bit Gentoo but, of course, the notOpenMP-MFLOPS64 program could be run.

Below the detailed results are performance comparisons and a table of numeric results. Although the latter were constant during a test run, variations occur on values from different compilations. In should be noted that minimum data size is 400 KB, or in L2 cache using one core or four cores.

64 Bit vs 32 Bit - Main gains were at 32 operations per word read, little different with the single core test, maybe a little slower, but up to 2.8 times faster using all cores.

MP gains - The main gains were on tests using L2 cache and 32 calculations per word, with maximum of 2.72 times at 32 bits and 3.89 times at 64 bits.

Different Numeric Results - 32 bit and 64 bit results can be different. The with and without OpenMP values are the same, except for 32 operations per word at 32 bits. Here, the same type of instructions are used, but in a different order.

Comparison with other MP-MFLOPS benchmarks - see Maximum 1 Core MFLOPS above.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Not OpenMP MFLOPS Benchmark 1 Mon Aug 15 19:23:03 2016 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.697952 716 0.929538 Yes Data in & out 1000000 2 250 1.160158 431 0.992550 Yes Data in & out 10000000 2 25 1.140070 439 0.999250 Yes Data in & out 100000 8 2500 1.178477 1697 0.957126 Yes Data in & out 1000000 8 250 1.442497 1386 0.995524 Yes Data in & out 10000000 8 25 1.428921 1400 0.999550 Yes Data in & out 100000 32 2500 5.060230 1581 0.890268 Yes Data in & out 1000000 32 250 5.203246 1538 0.988078 Yes Data in & out 10000000 32 25 5.203889 1537 0.998806 Yes OpenMP MFLOPS Benchmark 1 Sat Jul 30 13:01:12 2016 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.363631 1375 0.929538 Yes Data in & out 1000000 2 250 1.133716 441 0.992550 Yes Data in & out 10000000 2 25 1.150107 435 0.999250 Yes Data in & out 100000 8 2500 0.432833 4621 0.957126 Yes Data in & out 1000000 8 250 1.177219 1699 0.995524 Yes Data in & out 10000000 8 25 1.151536 1737 0.999550 Yes Data in & out 100000 32 2500 3.845114 2081 0.890232 Yes Data in & out 1000000 32 250 3.754590 2131 0.988068 Yes Data in & out 10000000 32 25 3.737356 2141 0.998785 Yes ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a notOpenMP MFLOPS64 Fri Feb 24 15:48:41 2017 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.696362 718 0.929538 Yes Data in & out 1000000 2 250 1.202102 416 0.992550 Yes Data in & out 10000000 2 25 1.140033 439 0.999250 Yes Data in & out 100000 8 2500 1.162491 1720 0.957117 Yes Data in & out 1000000 8 250 1.504922 1329 0.995518 Yes Data in & out 10000000 8 25 1.478444 1353 0.999549 Yes Data in & out 100000 32 2500 5.346043 1496 0.890215 Yes Data in & out 1000000 32 250 5.482719 1459 0.988088 Yes Data in & out 10000000 32 25 5.477190 1461 0.998796 Yes OpenMP MFLOPS64 Fri Feb 24 16:49:35 2017 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.229756 2176 0.929538 Yes Data in & out 1000000 2 250 1.230560 406 0.992550 Yes Data in & out 10000000 2 25 1.159971 431 0.999250 Yes Data in & out 100000 8 2500 0.344756 5801 0.957117 Yes Data in & out 1000000 8 250 1.245537 1606 0.995518 Yes Data in & out 10000000 8 25 1.187876 1684 0.999549 Yes Data in & out 100000 32 2500 1.373730 5824 0.890215 Yes Data in & out 1000000 32 250 1.519274 5266 0.988088 Yes Data in & out 10000000 32 25 1.469316 5445 0.998796 Yes ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a notOpenMP MFLOPS64 Thu Mar 2 17:05:47 2017 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.739649 676 0.929538 Yes Data in & out 1000000 2 250 1.230036 406 0.992550 Yes Data in & out 10000000 2 25 1.179612 424 0.999250 Yes Data in & out 100000 8 2500 1.196997 1671 0.957117 Yes Data in & out 1000000 8 250 1.560925 1281 0.995518 Yes Data in & out 10000000 8 25 1.483354 1348 0.999549 Yes Data in & out 100000 32 2500 5.437056 1471 0.890215 Yes Data in & out 1000000 32 250 5.585995 1432 0.988088 Yes Data in & out 10000000 32 25 5.576582 1435 0.998796 Yes OpenMP MFLOPS64 - OpenMP libray file not available ################### Comparison ################### Words Ops/ MP Gains 64 Bit Gains Word 32 Bit 64 Bit Not OMP 100000 2 1.92 3.03 1.00 1.58 1000000 2 1.02 0.98 0.97 0.92 10000000 2 0.99 0.98 1.00 0.99 100000 8 2.72 3.37 1.01 1.26 1000000 8 1.23 1.21 0.96 0.95 10000000 8 1.24 1.24 0.97 0.97 100000 32 1.32 3.89 0.95 2.80 1000000 32 1.39 3.61 0.95 2.47 10000000 32 1.39 3.73 0.95 2.54 #################### Numeric Results ##################### Words Ops/ Not OMP Not OMP Word 32 Bit 32 Bit 64 Bit 64 Bit 100000 2 0.929538 0.929538 0.929538 0.929538 1000000 2 0.992550 0.992550 0.992550 0.992550 10000000 2 0.999250 0.999250 0.999250 0.999250 100000 8 0.957126 0.957126 0.957117 0.957117 1000000 8 0.995524 0.995524 0.995518 0.995518 10000000 8 0.999550 0.999550 0.999549 0.999549 100000 32 0.890268 0.890232 0.890215 0.890215 1000000 32 0.988078 0.988068 0.988088 0.988088 10000000 32 0.998806 0.998785 0.998796 0.998796

To Start

OpenMP-MemSpeed2, NotOpenMP-MemSpeed2, OpenMP-MemSpeed264, NotOpenMP-MemSpeed264

This is the same as Memory Speed Benchmark but with measurements extending to test more memory, also using the OpenMP directive and compile parameter. The NotOpenMP tests use the same code without specifying a compilation using OpenMP. These allow comparisons of MP performance gains over the full range of memory use. At this time, OpenMP was not available in Gentoo, but the NotOpernMP benchmark was run.

MP Gains and Losses As all the test functions involve writing back results, with few instructions in between, MP benefits are often not that good. With the OpenMP 64 bit version, integer tests averaged 12% to 30% slower, but faster on floating point calculations 1.62 to 2.45 times DP and 1.25 to 1.88 times SP. 32 bit ratios were 33% to 61%, 2.85 to 3.75 and 1.44 to 1.88 respectively.

64/32 Bit Ratios 64 bit versus 32 bit comparisons were also diverse, starting with the former’s RAM speeds being somewhat slower. For cache based data, average integer, DP and SP performance ratios, with OpenMP, were 1.23 to 1.45, 0.82 to 1.05 and 0.71 to 1.04, then with notOpenMP, 1.05 to 1.35, 1.63 to 2.60 and 0.96 to 1.35.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 14:27:38 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 5518 2990 1309 8808 4732 1455 15426 7656 1244 8 5414 3115 1322 10150 5068 1470 14323 8301 1254 16 5503 3143 1270 10255 5154 1378 16743 8043 1221 32 5507 3145 1344 10142 5089 1458 16572 7732 1206 64 5033 2999 1257 9230 4867 1419 16012 7869 1228 128 5255 3041 1258 9372 5014 1365 9452 8192 1252 256 5266 3093 1282 9401 5006 1372 8418 7864 1313 512 4494 2765 1358 7248 4482 1332 5748 5460 1410 1024 3810 2683 1078 4425 3668 1155 1753 1732 1265 2048 2008 1425 1098 2274 2214 980 1086 1094 1333 4096 3972 2413 1075 4628 3672 945 1058 1057 839 8192 1597 2435 920 3671 3649 1199 1059 1067 1043 16384 3838 1624 1867 4440 1550 1108 1065 1076 1166 32768 1658 2273 1695 4227 1876 1054 1066 1039 921 65536 3657 1247 1286 4839 3801 1308 1053 1046 1133 131072 990 655 810 1260 932 826 1129 1083 619 End of test Mon Sep 5 14:28:08 2016 ####################### RPi 3 Not OMP ########################### Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom Start of test Mon Sep 5 14:28:22 2016 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 785 2536 3789 2360 3448 3787 2670 2693 2692 8 1594 2547 3812 2389 3465 3812 2715 2716 2716 16 1595 2551 3824 2392 3477 3823 2727 2728 2728 32 1556 2435 3564 2300 3272 3565 2730 2722 2723 64 1513 2314 3330 2189 3091 3327 2599 2435 2435 128 1516 2312 3357 2188 3118 3353 2635 2569 2569 256 1521 2316 3381 2187 3130 3384 2676 2618 2617 512 1419 2034 2765 1977 2674 2835 2593 2481 2524 1024 1113 1379 1544 1348 1521 1543 1691 1583 1586 2048 995 1203 1282 1193 1277 1257 1263 1231 1232 4096 992 1196 1248 1178 1252 1259 1203 1176 1166 8192 1041 1237 1290 1213 1298 1291 927 943 954 16384 1052 1262 1311 1229 1252 1303 874 866 867 32768 1053 1271 1317 1239 1325 1303 995 987 991 65536 1057 1281 1323 1245 1343 1316 920 920 918 131072 1057 1283 1323 1184 1350 1327 856 849 840 End of test Mon Sep 5 14:28:50 2016 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom Start of test Tue Mar 7 23:41:04 2017 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 5788 3106 1698 8203 4576 1827 11038 5622 2042 8 6182 3187 1711 9272 4842 1848 11315 5645 2054 16 5631 3197 1639 9320 4850 1753 11223 5520 1813 32 6132 3174 1604 9124 4833 1640 11040 5408 1731 64 5967 3168 1602 8641 4764 1688 9768 5338 1763 128 5469 3173 1572 8682 4408 1727 9054 5358 1811 256 5242 3177 1625 8630 4668 1678 8276 4972 1822 512 3684 3015 1640 7187 4321 1659 6745 5019 1585 1024 2326 2620 1307 4240 3442 1284 1656 1634 1071 2048 3767 2494 1283 4155 3419 1243 1088 1066 989 4096 1996 2628 962 4139 3391 1361 1049 1034 803 8192 854 1627 1304 4122 3352 1595 1010 987 1045 16384 859 2221 1635 3764 1334 1735 865 819 843 32768 1349 1109 1821 1222 1645 1101 912 797 979 65536 3554 533 550 1493 1223 1258 857 874 1600 131072 927 640 737 1097 860 740 891 1272 549 End of test Tue Mar 7 23:41:35 2017 ####################### RPi 3 SUSE Not OMP ########################### Memory Reading Speed Test notOpenMP 64 Bit by Roy Longbottom Start of test Tue Mar 7 23:40:15 2017 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 3991 2484 4123 5232 3381 4123 4521 3674 3672 8 4033 2494 4144 5322 3398 4149 4628 3714 3693 16 4006 2497 4156 5339 3402 4157 4656 3734 3733 32 3487 2331 3711 4570 3102 3710 4615 3600 3628 64 3181 2199 3436 3980 2954 3455 4371 3580 3580 128 3313 2259 3478 3941 2999 3478 4197 3466 3464 256 3418 2263 3502 3966 3011 3507 4244 3493 3489 512 2683 1918 2749 2994 2425 2701 3585 3129 3143 1024 673 1215 1298 1330 1314 1337 1365 1562 1548 2048 729 982 1135 1141 1122 1133 1039 1087 1093 4096 713 1069 1116 1128 1117 1089 968 993 986 8192 704 1074 1111 1125 1119 1110 927 969 960 16384 624 1077 1108 1123 1118 1107 953 958 801 32768 795 1064 1108 1123 1120 1107 962 720 949 65536 1112 1081 1109 1123 1121 921 966 997 1006 131072 908 805 886 1118 1148 1159 959 965 1014 End of test Tue Mar 7 23:40:43 2017 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a OpenMP Not Available ####################### RPi 3 Gentoo Not OMP ########################### Memory Reading Speed Test notOpenMP 64 Bit by Roy Longbottom Start of test Wed Mar 8 11:50:40 2017 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 4059 2496 4143 5250 3396 4140 4545 3689 3692 8 4135 2503 4138 5342 3412 4168 4646 3728 3727 16 4134 2509 4172 5365 3418 4172 4678 3751 3750 32 3563 2327 3668 4399 3082 3671 4443 3611 3616 64 3390 2287 3549 4081 3026 3549 4309 3524 3558 128 3344 2298 3561 4025 3060 3561 4231 3500 3520 256 3337 2276 3496 3951 3040 3495 4243 3508 3510 512 2542 1744 2559 2790 2282 2575 2587 2816 2845 1024 766 1167 1268 1290 1245 1264 1268 1437 1418 2048 765 1054 1118 1126 1107 1106 828 902 897 4096 793 1056 1105 1105 1098 1106 875 903 909 8192 797 1083 1141 1124 1122 1135 782 788 787 16384 913 1098 1164 1147 1140 1163 734 733 740 32768 1137 1105 1175 1156 1150 1166 734 737 741 65536 1137 1106 1176 1157 1151 1176 712 719 718 131072 1135 1087 1180 1155 1155 1175 745 761 761 End of test Wed Mar 8 11:51:08 2017

To Start

NEON MultiThreading Benchmarks

32 bit versions are in Raspberry_Pi_MP_Benchmarks.zip with details and results in Raspberry Pi Multithreading Benchmarks.htm. The new 64 bit varieties are included in Rpi3-64-Bit-Benchmarks.tar.gz.

There is no NEON compile option, for C code, at 64 bits but, as at 32 bits, programs using NEON intrinsic functions are compiled as different vector instructions.

To Start

MP-MFLOPSPiNeon, MP-NeonMFLOPS, MP-NeonMFLOPS64

MP-MFLOPSPiNeon is compiled, at 32 bits, from the same source code as MP-MFLOPS, using NEON compile options. The other two carry out the same calculations using a well ordered structure of NEON intrinsic functions, clearly suitable for SIMD operation. The two 32 bit benchmarks produce similar speeds but slightly different results of numeric calculations, through using different instructions.

64 Bit vs 32 Bit - Average speed improvements, with cached based data, was 60%, with RAM speeds slightly slower.

MP gains - At 64 bits and cached data, throughput increased between 3.0 and 3.97 times, but somehat less at 32 bits.

Comparison with other MP-MFLOPS benchmarks - see Maximum 1 Core MFLOPS above.

The 64 bit NEON intrinsics version produced the best MP performance, at just over 10 GFLOPS, compared with 6.3 GFLOPS at 32 bits.

################## RPi 3 V7A2 Compiled NEON #################### MP-MFLOPS Compiled NEON v1.0 Mon Aug 15 19:09:46 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 419 782 437 1672 1660 1637 2T 1324 1529 442 3331 3308 3212 4T 1903 1574 439 5040 6073 5738 8T 1613 2204 433 5543 5780 5445 Results x 100000 1T 76406 97075 99969 66008 95367 99951 ############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS NEON Intrinsics v1.0 Mon Aug 15 19:41:37 2016 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 347 583 427 1706 1703 1657 2T 1080 1157 438 3397 3398 3226 4T 979 1430 437 6265 6128 5464 8T 1218 1351 436 5507 5766 5426 Results x 100000 1T 76406 97075 99969 66014 95363 99951 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-MFLOPS NEON Intrinsics 64 Bit Tue Feb 28 15:37:21 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 729 688 404 2640 2570 2479 2T 1427 1423 348 5142 5119 4646 4T 1433 2729 361 9844 10075 5788 8T 2202 2481 358 9199 10119 5844 Results x 100000 1T 76406 97075 99969 66015 95363 99951 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a MP-MFLOPS NEON Intrinsics 64 Bit Thu Mar 2 17:03:53 2017 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 695 688 386 2595 2577 2423 2T 1373 1242 375 5163 5096 4445 4T 1389 1785 371 10035 10030 6171 8T 2071 2470 339 9410 9481 6209 Results x 100000 1T 76406 97075 99969 66015 95363 99951

To Start

linpackNeonMP, linpackNeonMP64

The original Linpack benchmark for Raspberry Pi, operates on double precision floating point 100x100 matrices (N = 100). This version uses mainly the same C programming code as the single precision floating point NEON compilation. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The 0 thread procedures are identical to those in the single core 100 x 100 NEON compilation, using NEON intrinsic functions. The benchmark was produced to demonstrate that the original Linpack 100x100 code could not be converted (by me) to show increased performance using multiple threads. The official line is that users are allowed to use their own linear equation solver for this purpose.

In this case, the 64 bit version tends to be a little faster than the 32 bit program. Results from the numeric calculations are not the same, due to different instructions being compiled from intrinsic NEON functions.

############################## RPi 3 ################################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Linpack Single Precision MultiThreaded Benchmark Using NEON Intrinsics, Mon Aug 15 19:44:30 2016 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 538.46 116.24 113.61 113.47 N 500 467.73 335.53 338.61 338.97 N 1000 363.87 336.10 336.72 336.22 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a Linpack Single Precision MultiThreaded Benchmark 64 Bit NEON Intrinsics, Tue Mar 7 22:30:51 2017 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 566.59 130.23 127.00 123.08 N 500 475.35 349.60 346.54 340.43 N 1000 355.27 326.07 324.62 325.75 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.97 5.40 13.51 RE 4.69621336e-05 6.44138840e-04 3.22485110e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04 XN -1.30534172e-05 3.51667404e-05 1.90019608e-04 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a Linpack Single Precision MultiThreaded Benchmark 64 Bit NEON Intrinsics, Wed Mar 8 11:36:25 2017 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 552.47 112.73 105.19 105.31 N 500 442.32 303.75 303.64 305.03 N 1000 353.88 315.96 309.15 308.31 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.97 5.40 13.51 RE 4.69621336e-05 6.44138840e-04 3.22485110e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04 XN -1.30534172e-05 3.51667404e-05 1.90019608e-04

To Start

Java Benchmarks

Java programs can run via any Operating System, assuming that a compatible Java RunTime Environment (JRE) is available. The JRE translates a general purpose .class file into hardware dependent computer instructions. The .class files are produced using a Java Development Kit (JDK) and these can be run via suitable Operating Systems.

Java programming code can be arranged in two different ways, that is off-line, in this case via a Terminal command (example java myprog), or on-line as an Applet launched by an HTML document. In both cases, the .class files are produced using javac command (example javac myprog.java).

Relating my experiences, the initial Rasbian Operating System had no JREs or JDKs installed. After executing the following, the command java -version indicated Java version ?.6.0_27? Java Applets and off-line files could then be executed but floating point arithmetic was painfully slow.

    sudo apt-get update                    
    sudo apt-get install icedtea-plugin

It was then discovered that Oracle Java SE 8 Developer Preview for ARM was needed to provide high speed hard float support. This and JRE 7 were installed using Instructions to install Java 8 and 7 (Then the various JREs could be selected using sudo update-alternatives --config java).

Using JRE 8 produced the desired effect with off-line Java but made no difference accessing on-line Applets. This question in Raspberry Pi Forum provided a solution to at least faster floating point. This specifies changes to .cfg files to enable JamVM to be used as an alternative Java Virtual Machine. Further details from a Pi Forum message suggested that Cacao VM would be faster than JamVM on floating point calculations, and this proved to be true. The available version was icedtea-6-jre-cacao, or for JRE 6. It seems that this is run using the command java -cacao program and -jamvm can be used or -zero for the original slow version, but only when JRE 6 is selected.

Using a newly installed Raspbian, the on-line versions only run via the Midori browser. After loading the page, it can take longer than 10 seconds before a benchmark starts running.

As usual the benchmarks and source codes are included in Raspberry_Pi_Benchmarks.zip.

64 Bit Versions - Existing java benchmarks could be run using OpenSUSE, via readily available Java Runtime Environment software, but those for Gentoo would have involved a lot of studying. However, Oracle JDK 1.8 was downloaded for temporary use. This allowed execution of the Java Whetstone Benchmark but not JavaDraw tests. On the other hand, it could compile a 1.8 version, of the latter, that would run using OpenSUSE. The Java benchmarks are available in Rpi3-64-Bit-Benchmarks.tar.gz.

To Start

Java Whetstone Benchmarks

See Comparisons Below

Details of the Whetstone Benchmark are provided above. Both off-line and on-line versions are provided in the zip file, including source code for off-line and Applet versions, along with the HTML page to run the program. A text log file is produced with the off-line version but a screen copy (scrot -s command, click on browser - needs installing - sudo apt-get install scrot) has to be made for on-line runs, if a record of performance is required. Examples of both are below.

The benchmark .class files were compiled using JDK 1.6 via Linux Ubuntu 10.04, and these run via Windows and Linux. The zip file also includes .class files produced by JDK 7 on the Raspberry Pi. WARNING: this failed to run via Ubuntu using JRE 6 but it does on the Pi that also has JRE 7.

The on-line versions are run by clicking on whetjava2.htm (or right click and select browser) and the off-line varieties using the command “java whetstc? The on-line version can also be run via Online Benchmarks.htm.

Whetstone Benchmark Java Version, May 27 2013, 18:09:00 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 49.18 0.3904 N2 floating point -1.131330490 46.54 2.8880 N3 if then else 1.000000000 27.73 3.7320 N4 fixed point 12.000000000 92.48 3.4060 N5 sin,cos etc. 0.499110103 1.08 77.3100 N6 floating point 0.999999821 26.69 20.2100 N7 assignments 3.000000000 39.90 4.6320 N8 exp,sqrt etc. 0.751108646 0.31 119.9300 MWIPS 43.01 232.4984 Operating System Linux, Arch. arm, Version 3.6.11+ Java Vendor Oracle Corporation, Version 1.8.0-ea

To Start

Java Whetstone Comparison

Following are off-line and on-line results, showing the changes in performance through using upgraded JREs and with overclocking (CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts). There is not much difference in overall MWIPS on using JamVM and the original is faster on the COS/EXP tests, but much slower on the other tests. Similarly, JRE 8 averages about four times faster than JRE 7 with JamVM, if the COS/EXP results are excluded. Later results, using Cacao VM, show that this is much faster than JamVM, again except for the COS/EXP functions.

The off-line versions ran without any problems on the Raspberry Pi 2, other than the EXP test that was particularly slow via JRE 8, reducing the overall MWIPS rating. Other than this, performance via JRE 8 was significantly better than using JRE 6 or JRE 7, as it was on the original RPi. JRE 8 RPi 2 speeds were between 1.5 and 4.3 times faster than RPi 1.

I was unable to persuade Epiphany Browser to use JRE 7 or 8, to run the on-line version. I installed IcedTea 1.6, then Midori Browser to obtain the first results (subject to providing permission to JRE 6). Then, out of the blue, Epiphany ran the benchmark applet, obtaining the same performance as Midori. Results, with the CPU at 1 GHz, were 2.0 to 5.7 times faster than those run on the earlier Raspberry Pi.

Some time was spent attempting to run this benchmark efficiently on a Raspberry Pi 3, but without success in enabling hardware acceleration. This was achieved running the off-line version using Java 1.8 Runtime Environment. Here, average performance was 1.5 times RPi 2, compared with 1.33 times MHz speed. Performance was somewhat different to the Cortex-A53 based tablet.

A result of the C version is also shown, along with another from the Android version. For further details and results see Android Benchmarks and Whetstone Benchmark Java Results.

64 Bit Versions - Performance varied considerably using the different Java implementations. OpenSUSE compared well with an original 32 bit version, with performance ratios, for the different tests, varying between 0.28 and 16.1, average 3.53. The Gentoo was 12% faster than OpenSUSE, on average.

Version JRE MWIPS ------MFLOPS------- -------------MOPS--------------- 1 2 3 COS EXP FIXPT IF EQUAL Off-line Original 6 18.3 4.4 6.4 3.3 0.99 0.30 7.9 2.9 2.5 JamVM 6 23.4 9.4 10.0 8.9 0.69 0.23 17.8 8.1 5.4 Cacao 6 32.7 25.5 36.7 25.7 0.76 0.24 55.2 28.9 25.8 Original 7 18.7 4.3 6.2 3.5 0.98 0.30 7.9 2.9 2.6 JamVM 7 25.7 12.3 11.7 9.7 0.74 0.24 23.6 10.9 6.2 Original 8 47.8 49.4 47.8 26.7 1.19 0.36 93.3 27.8 40.0 1000 MHz 8 75.1 71.4 69.2 39.8 2.10 0.53 134.9 40.3 57.8 Raspberry Pi 2 Original 6 101.8 30.3 43.6 20.2 2.89 1.98 60.7 38.2 15.0 Original 7 100.8 30.4 43.6 19.9 2.84 1.99 60.8 38.3 14.8 Original 8 117.4 118.8 125.3 62.2 3.89 0.74 278.8 60.8 224.8 javac 1.7 7 100.6 30.4 43.6 19.9 2.83 1.99 60.8 38.3 14.8 javac 1.7 8 116.9 119.5 125.1 62.2 3.91 0.73 278.5 60.8 223.0 javac 1.8 8 116.8 119.5 125.0 62.2 3.81 0.74 278.3 60.8 224.8 1000 MHz 1.7 8 128.6 133.0 139.4 69.3 4.19 0.81 310.3 67.7 249.7 Raspberry Pi 3 javac 1.7 # 7 83.6 25.3 25.8 21.5 4.55 0.94 42.0 18.6 10.4 javac 1.7 8 183.0 183.8 178.0 91.1 6.02 1.18 461.1 88.4 276.7 javac 1.8 8 183.4 184.1 179.6 91.1 5.94 1.19 460.5 88.6 276.6 # no hardware acceleration ########################## RPi 3 SUSE 64 Bit ########################### javac 1.7 7 521.8 269.4 247.3 114.2 12.33 10.04 599.5 191.7 80.0 javac 1.7 8 692.7 273.8 254.5 114.2 21.88 19.01 620.3 211.7 78.2 ######################### RPi 3 Gentoo 64 Bit ########################## javac 1.7 8 783.0 335.4 296.3 207.0 18.95 18.09 667.1 160.8 88.3 ############################################################################### On-line JamVM 6 25.3 9.9 10.3 7.7 0.63 0.25 17.9 8.5 5.1 1000 MHz 6 39.0 14.1 13.7 11.2 1.30 0.40 25.4 12.2 7.2 Raspberry Pi 2 900 MHz 6 101.8 40.8 42.8 26.5 2.90 2.01 65.9 51.4 12.4 1000 MHz 6 120.4 45.7 47.9 28.5 3.20 2.27 73.8 57.7 14.1 Raspberry Pi 3 Similar to javac 1.7 # no hardware acceleration ############################################################################### Other MHz C Version 700 270.5 97.8 100.8 85.7 5.90 2.70 425.3 698.6 499.0 C RPi 2 900 525.0 252.0 261.3 223.0 10.20 5.10 1102.5 1358.4 882.0 C RPi 3 1200 724.5 331.0 347.5 298.1 12.1 8.70 1520.4 1873.4 1216.3 Android Native C v8-A53 1300 834.7 348.9 312.7 310.9 36.7 5.40 1556.7 1867.2 570.5 v7-A9 1200 731.1 273.6 253.0 252.8 28.0 5.00 1185.2 2383.4 1192.1 Android Java v8-A53 1300 391.3 86.0 155.7 66.8 12.3 8.60 277.8 89.8 61.4 v7-A9 1000 286.7 53.7 84.7 46.7 14.5 5.40 183.0 69.7 33.2

To Start

JavaDraw Benchmark

See Comparisons Below

JavaDraw is intended to run the same test functions as my JavaDraw.apk benchmark for Android devices, where details and results can be found in Android Graphics Benchmarks.htm. The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades. For further details and results see JavaDraw.htm, where links to on-line versions are also provided. However, some displays from these can be erratic and tearing.

Test 1 loads two PNG files, one bitmap moving left/right for each frame, the other circling. This is repeated twice, in this version, as the long start up time leads to slow speeds being reported.

Plus Test 2 for JavaDraw.apk generates 2 SweepGradient multi-coloured circles moving towards the centre and back. The circles are loaded a PNG file for this version.

Plus Test 3 draws 200 random small circles in the middle of the screen.

Plus Test 4 draws 80 lines from the centre of each side to the opposite side, again with changing colours.

Plus Test 5 draws the same small random circles as Test 3 but with 4000, filling the screen.

Each test runs for approximately 10 seconds at window size 1280 x 720 pixels.

Two versions are available, JavaDrawPC, compiled using JDK 6 via Linux Ubuntu, and JavaDrawPi produced on the Pi using JDK 7. Both can be run via Windows and Linux, subject to tha appropriate JRE being available. Commands to use are “java JavaDrawPC?and “java JavaDrawPi?

Measured speeds are displayed in the Terminal window and in JavaDraw.txt log file. An example is shown below, preceded by the display during Test 4. The benchmark identifies the Operating System and JRE used.


   Java Drawing Benchmark, May 30 2013, 12:40:39
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1       24     2.39
  Display PNG Bitmap Twice Pass 2      118    11.72
  Plus 2 SweepGradient Circles         116    11.56
  Plus 200 Random Small Circles         95     9.48
  Plus 320 Long Lines                   56     5.60
  Plus 4000 Random Small Circles        20     1.92

         Total Elapsed Time  60.8 seconds

  Operating System    Linux, Arch. arm, Version 3.6.11+
  Java Vendor         Oracle Corporation, Version  1.7.0_07

To Start

JavaDraw Comparison - Frames Per Second

Following are Raspberry Pi results at normal and overclocked settings, using JRE 6, 7 and 8. Basic JRE 6 and with JamVM produced similar results with Cacao VM being slightly faster. A surprise was that JRE 8 speeds were much slower (early release - see later results below). JavaDrawPC (from JDK 6) and JavaDrawPi (from Pi JDK 7) also produced similar performance, via JRE 7.

Raspberry Pi 2 results provided an average speed gain of 4.2 times for JRE 7, the CPU running at 1000 MHz, with JRE 8 performance being even better. As the original tests, on the older RPi, indicated extremely slow performance using JRE 8, the software was updated to a new version, and the benchmark rerun. Further tests were carried out, a second one restricted to using one CPU core, whilst running vmstat performance monitor at the same time - See Details and results below. These confirm that JavaDraw can use more than one core to improve performance.

Raspberry Pi 3 had the latest (May 2016) version of Raspbian installed, along with the new OpenGL GLUT experimental driver. As shown below, performance, using JRE 8, was too slow. The previous Operating System was loaded, and that produced average speeds 50% faster than the Raspberry Pi 2. Back to the newer OS, the GLUT driver was disabled, enabling the faster performance, but somewhat slower with JRE 7 selected.

Android results on a Nexus 7 are also shown and these are matched by the Pi at 1 GHz. The final much faster results are for running the same tests via Linux on an Atom and Core 2 based PCs. The screen shot above was from a PC the a quad core 3 GHz Phenom CPU under Windows, where CPU utilisation was around 57%, indicating that more that two cores were fully utilised.

64 Bit Versions - Even more variations were demonstrated, where one OpenSUSE download consistently produced different results to another. Then, maximum FPS speeds were significantly slower than those on a Raspberry Pi 2 and best case using Raspberry Pi 3 at 32 bits.

PNG PNG +Sweep +200 +320 +4000 Bitmaps Bitmaps Gradient Small Long Small JRE 1 2 Circles Circles Lines Circles Pi 700 MHz 6 3.6 12.0 11.9 9.5 5.5 1.8 Pi 700 MHz 6 Cac 0.2 13.6 14.7 12.4 7.2 2.8 Pi 700 MHz 7 2.4 11.7 11.6 9.5 5.6 1.9 Pi 700 MHz 8 0.4 2.7 2.6 1.9 0.8 0.4 Pi 1000 MHz 6 10.1 19.5 19.3 15.9 9.4 3.1 Pi 1000 MHz 6 Cac 8.3 23.1 21.9 18.2 11.0 4.2 Pi 1000 MHz 7 11.1 19.2 18.7 16.2 9.5 3.1 Pi 1000 MHz 8 2.0 4.3 4.2 3.0 1.3 0.6 Later Java Pi 700 MHz 7 0.3 10.9 10.8 8.2 5.0 1.3 Pi 700 MHz 8 0.2 7.2 10.7 11.7 7.7 5.1 Raspberry Pi 2 Pi 2 900 MHz 6 43.1 54.3 54.2 48.4 31.8 18.4 Pi 2 900 MHz 7 40.8 52.5 51.9 46.9 30.5 17.1 Pi 2 900 MHz 8 44.4 56.8 57.3 55.0 38.6 25.2 1 CPU 900 MHz 8 22.1 35.8 35.4 36.0 38.3 22.4 Pi 2 1000 MHz 6 51.4 65.7 64.5 57.5 37.4 20.3 Pi 2 1000 MHz 7 51.5 63.7 62.8 56.4 36.9 20.1 Pi 2 1000 MHz 8 55.0 69.5 70.0 67.7 46.4 29.5 Raspberry Pi 3 Pi 3 1200 MHz 7@ 2.9 3.2 7.3 8.1 7.5 7.0 Pi 3 1200 MHz 8@ 2.9 3.1 7.3 8.1 7.6 7.0 Pi 3 1200 MHz 7+ 71.8 78.9 84.0 73.1 47.7 23.2 Pi 3 1200 MHz 8+ 81.1 96.9 96.9 88.7 62.8 43.3 Pi 3 1200 MHz 8- 81.0 97.2 96.7 92.1 61.9 42.1 @ Later Raspbian with the new OpenGL GLUT driver + Driver installed but not enabled - Older Raspbian without GLUT driver Raspberry Pi 3 64 Bit OpenSUSE Pi 3A 1.7 > 1.8.0_111 8.6 10.9 10.7 10.1 7.9 3.6 Pi 3B 1.7 > 1.8.0_121 22.8 32.1 32.3 27.7 15.3 6.2 A and B different OpenSUSE downloads Raspberry Pi 3 64 Bit Gentoo - Not available Other ARM A53 1300 MHz 55.7 36.7 28.5 17.8 5.0 ARM A9 1300 MHz 20.4 16.5 14.5 11.3 3.8 Atom 1666 MHz 57.3 83.2 80.1 74.8 53.6 24.5 Core 2 2400 MHz 271.5 360.6 227.7 237.6 205.2 142.5 Cac = Cacao VM

To Start

OpenGL ES Benchmark - OpenGL1Pi.bin

See Comparisons Below

This benchmark is essentially the same as JavaOpenGL1 described in Android Graphics Benchmarks.htm. This has four tests that draw a background of 50 cubes first as wireframes then colour shaded. The third test views the cubes in and out of a tunnel with slotted sides and roof, also containing rotating plates. The last test adds textures to the cubes and plates. The 50 cubes are redrawn 15, 30 and 60 times, with randomised positions, colours and rotational settings. With 6 x 2 triangles per cube, minimum triangles per frame for the three sets of tests are 9000, 18000 and 36000.

Speed is measured in Frames Per second (FPS). With Android, maximum FPS is 60, limited by the imposition of wait for vertical blank (VSYNC). So, there is not much point in using lighter loading. As VSYNC appears not to be forced under Raspbian, additional tests using five cubes (x15 repeats) are included.

The commands to compile the OpenGL ES program were extracted from sample program hello_pi makefile.

Nominal duration of each test is 10 seconds. Actual elapsed times and FPS scores are displayed on the LXTerminal display as the tests progress. On completion, results are saved in a text log file. See example below, along with compile and execute commands, the latter having parameters that define the window size to use. As usual, the benchmark, source code and image files used are available in Raspberry_Pi_Benchmarks.zip.

Of particular note, CPU utilisation, shown in Task Manager, is less than 50% for the most stressful test. The run time parameters were changed to allow the benchmark to run for a specified time- see Reliability Tests. This still runs 16 tests but each generates 36000 textured triangles.

June 2015 - Version 1.2 produced. The original version was found to be counting frames twice, doubling FPS speed results. This is not important when comparing performance at different system settings or with Raspberry Pi 2. The revised program has the correct frame count. The results suggest that displays are synchronised to run at a maximum of 50 FPS, using the UK standard frequency of 50 Hz, as the VSYNC setting.

Compile Commands Use the two cc extremely long (>512 charas) compile commands and the cc link command in comments at the start of OpenGL1Pi.c Make files These are now included in the zip file. A make comaand executes Makefile that uses Makefile.include to comopile and link the benchmark programs. Run Commands ./OpenGL1Pi.bin Wide pppp, High pppp RunTime mm pppp = pixels, mm - minutes for reliability test Default ./OpenGL1Pi.bin - 1280 x 720, 16 x 10 second tests parameter names just first letter used upper or lower case pppp any size e.g. W 1920, H 1080 - W 120, H 60 - W 60, H 120 Example OpenGLPi.txt Log File Raspberry Pi OpenGL ES Benchmark 1.2, Mon Jun 8 11:22:12 2015 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 900+ 50.05 50.01 43.50 39.30 9000+ 20.20 20.06 15.06 11.60 18000+ 10.27 10.19 8.72 6.41 36000+ 5.15 5.13 4.74 3.43 Screen Pixels 1280 Wide 720 High End Time Mon Jun 8 11:24:54 2015

To Start

OpenGL ES Comparison - Frames Per Second

The following results show that maximum overclocking, larger window sizes and smaller ones do not produce significant variations in performance.

Raspberry Pi 2 - The benchmarks were run on RPi 2 and resultant speeds are little different. Measured CPU utilisation was typically 6% or 24% of one CPU core. Recompilation with Cortex A7 parameters made no difference. Details are below.

Raspberry Pi 3 The benchmark would not run with the new OpenGL GLUT driver installed and enabled, but did with it disabled. Then, as shown below, performance was similar to that from the Raspberry Pi 2.

Lastly, some Android JavaOpenGL1 results are shown for comparison purposes.

############ Original Raspberry Pi ############# RPi 700 MHz, Screen Pixels 1280 x 720 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 900+ 50.05 50.01 43.50 39.30 9000+ 20.20 20.06 15.06 11.60 18000+ 10.27 10.19 8.72 6.41 36000+ 5.15 5.13 4.74 3.43 RPi 1000 MHz, Screen Pixels 1280 x 720 900+ 50.07 50.01 43.82 39.58 9000+ 20.20 20.18 15.13 11.64 18000+ 10.25 10.25 8.76 6.42 36000+ 5.15 5.16 4.76 3.44 RPi 700 MHz, Screen Pixels 1920 x 1080 900+ 50.05 50.01 43.50 39.30 9000+ 20.20 20.06 15.06 11.60 18000+ 10.27 10.19 8.72 6.41 36000+ 5.15 5.13 4.74 3.43 RPi 700 MHz, Screen Pixels 320 x 180 900+ 50.11 50.01 44.90 41.80 9000+ 20.60 20.49 15.33 12.79 18000+ 10.41 10.35 8.85 7.21 36000+ 5.23 5.20 4.79 3.87 ################ Raspberry Pi 2 ################# RPi 2 900 MHz, Screen Pixels 1280 x 720 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 900+ 50.07 50.00 44.76 41.10 9000+ 20.38 20.61 15.36 12.24 18000+ 10.37 10.42 8.90 6.89 36000+ 5.21 5.23 4.82 3.72 RPi 2 900 MHz, Screen Pixels 1920 x 1080 900+ 50.07 50.00 43.32 38.94 9000+ 19.63 19.75 14.85 11.69 18000+ 10.15 10.03 8.60 6.02 36000+ 4.99 5.06 4.66 3.07 ################ Raspberry Pi 3 ################# OpenGL GLUT driver disabled RPi 3 1200 MHz, Screen Pixels 1280 x 720 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 900+ 60.02 60.00 43.48 40.03 9000+ 20.30 20.13 15.03 12.15 18000+ 10.29 10.19 8.71 6.83 36000+ 5.17 5.13 4.72 3.67 RPi 3 1200 MHz, Screen Pixels 1920 x 1080 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 900+ 59.99 60.00 41.45 37.88 9000+ 19.38 19.17 14.42 11.59 18000+ 9.84 9.75 8.34 6.49 36000+ 4.91 4.90 4.52 3.30 ##################### Other ##################### Excludes Tests With 900+ Triangles 1.3 GHz quad core 64 bit MediaTek ARM Cortex-A53 Android 5.0, GPU Mali T720 MP2 Screen Pixels 800 Wide 1216 High Android Java OpenGL Benchmark 26-Aug-2015 16.24 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 9000+ 22.55 22.11 16.67 14.27 18000+ 11.55 11.60 9.98 8.27 36000+ 5.92 5.98 5.48 4.48 Andoid JavaOpenGL1 Galaxy SIII, Quad Cortex-A9 1.4 GHz, Android 4.0.4, ARM Mali-400 MP4 quad core graphics. Screen Pixels 1280 x 720 --------- Frames Per Second -------- Triangles WireFrame Shaded Shaded+ Textured 9000+ 57.98 59.62 51.93 41.19 18000+ 34.46 34.28 29.61 15.25 36000+ 14.45 13.11 13.03 7.34 Andoid JavaOpenGL1 Nexus 7 Quad 1300 MHz Cortex-A9, Android 4.1.2, nVidia ULP GeForce Graphics 12 core, 416 MHz. Screen Pixels 1280 x 736 9000+ 42.18 43.57 33.38 23.54 18000+ 23.68 23.47 19.91 13.38 36000+ 12.05 11.95 11.00 7.10

To Start

OpenGL GLUT Benchmark - videogl32, videogl64

See Comparisons Below

In 2011, I produced a Linux version of my 2004 Windows VideoGL1 benchmark. Its pedigree was established in 2012, when I approved a request from a Quality Engineer at Canonical, to use this OpenGL benchmark in the testing framework of the Unity desktop software. One reason probably was that it can be run for extended periods as a stress test. Further details and Linux results are in Linux OpenGL Benchmarks.htm.

The OpenGL version required minimum conversion, with OpenGL code functions unchanged. The benchmark, source code and image files are included in the OpenGL folder in Raspberry_Pi_Benchmarks.zip, also separately in Raspberry_Pi_OpenGL_Benchmark.zip,

The benchmarks measure graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces. The textures are obtained from 24 bit BMP files that can be up 256 x 256 pixels at 192 KB, with those supplied being 64 x 64 pixels at 12 KB.

After booting Raspbian-jessie on a Raspberry Pi 2, freeglut software was installed via:

sudo apt-get update               
sudo apt-get install freeglut3    
sudo apt-get install freeglut3-dev

The command raspi-config had to be run, the function updated and the experimental GL driver enabled via Advanced Options.

The benchmark was coppiled and linked by the following Terminal command:

gcc ogl1.c cpuidc.c -lrt  -lm -O3 -lglut -lGLU -lGL -o videogl32

The default benchmark runs all tests, each for 5 seconds, at the current display size settings, with an output header (configuration details and main headings), one line of results and end messages (date and time). Other parameters are pixel dimensions W or Width and H or Height. Initial results, in the next section, were produced via the following script (runit included in zip file), to provide a single table with minimum additional data. The export command is a later addition to turn off Wait For Vertical Blank (or VSYNC), to demonstrate maximum speeds.

export vblank_mode=0                                
./videogl32 Width 320, Height 240, NoEnd            
./videogl32 Width 640, Height 480, NoHeading, NoEnd 
./videogl32 Width 1024, Height 768, NoHeading, NoEnd
./videogl32 NoHeading

Stress test results, running videogl32 and CPU tests, are included in the stress test report. See: Livermore Loops and Maximum MFLOPS benchmarks.

64 Bit Version - The later compilations for 64 bit operation are available in Rpi3-64-Bit-Benchmarks.tar.gz. This also includes the source code and script file to run the benchmark at different window sizes. The program can also be used as a stress test, as indicated above. In order to compile and run, FreeGLUT library (or equivalent) has to be installed. This did not work on some OpenSUSE Leap 42.2 distributions.

To Start

OpenGL GLUT Benchmark Comparisons

The first set of results demonstrated extremely slow speeds. Then, via sudo raspi-config I enabled the experimental desktop GL driver, to produce the much improved second set of results. These appear to be limited to a maximum of 50 Frames Per Second, assumed to be due to Wait For Vertical Blank (VSYNC) being active. Googling indicated that an export vblank_mode=0 command was needed. So this was added to the script file to produce the third report.

The fourth table loaded 192 KB BMP texture files instead of the default ones at 12 KB. These could reduce displayed speed by up to three times.

The fifth sores are with the system overclocked from 900 to 1000 MHz (1.11 times). Average improvements in FPS speeds were 1.13 times, with small window plain colour tests appearing to be up to 25% faster, these probably being more dependent on graphics speed.

Raspberry Pi 3 - results below show that this was 47% to 76% faster than the non-overclocked Raspberry Pi 2 at the smallest window size. At full screen, 1920 x 1080 pixels, performance was similar using plain colours, then up to 81% faster with the more complex displays.

64 Bit Version - Results below include some from 64 bit versions of OpenSUSE and Gentoo, plus comparisons of the latter and the 32 bit Raspbian based version. At the smaller window sizes, and simple objects, the Raspbian results were faster, but the opposite was apparent on running the more complex kitchen displays. Performance became more equal at increasing window sizes. OpenSUSE speeds were slower than those using Gentoo, particularly with simple objects and all tests at window size 1920 x 1080 pixels.

First results ######################## Raspberry Pi 2 ######################## GLUT OpenGL Benchmark 32 Bit Version 1, Mon Apr 18 10:01:21 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 9.5 5.4 7.1 3.7 1.5 1.1 640 480 3.5 2.9 2.8 1.9 1.3 0.7 1024 768 1.5 1.3 1.3 1.3 1.0 0.4 1824 984 0.7 0.6 0.6 0.5 0.7 0.2 End at Mon Apr 18 10:04:58 2016 After enabling the experimental desktop GL driver ######################## Raspberry Pi 2 ######################## GLUT OpenGL Benchmark 32 Bit Version 1, Mon Apr 18 10:18:33 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 49.4 49.4 39.9 24.9 10.0 7.1 640 480 50.0 49.4 30.1 23.8 10.0 7.1 1024 768 47.2 45.4 24.7 23.3 10.0 7.0 1920 1080 18.5 18.2 16.5 15.5 9.8 7.0 End at Mon Apr 18 10:20:48 2016 After disabling VSYNC ######################## Raspberry Pi 2 ######################## GLUT OpenGL Benchmark 32 Bit Version 1, Tue Apr 19 09:02:30 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 210.3 114.4 52.6 32.5 12.1 7.8 640 480 115.0 89.5 48.5 30.6 11.9 7.7 1024 768 47.9 46.7 37.5 28.3 11.6 7.6 1920 1080 20.6 18.6 16.8 15.9 11.4 7.4 End at Tue Apr 19 09:04:45 2016 Larger texture files - 192 KB instaed of 12 KB ######################## Raspberry Pi 2 ######################## GLUT OpenGL Benchmark 32 Bit Version 1, Tue Apr 19 13:30:49 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 213.4 110.6 39.5 12.4 11.7 2.6 640 480 111.1 84.1 34.0 12.1 11.9 2.5 1024 768 49.1 47.0 27.7 11.0 11.7 2.4 1920 1080 20.2 17.3 15.7 9.3 11.5 2.2 End at Tue Apr 19 13:33:07 2016 Default Textures, Overclocked CPU at 1000 MHz (1.11 time faster) ######################## Raspberry Pi 2 ######################## GLUT OpenGL Benchmark 32 Bit Version 1, Thu Apr 21 15:41:04 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 266.7 138.7 60.4 36.9 13.5 8.7 640 480 126.7 103.8 56.4 35.8 13.5 8.8 1024 768 55.3 51.1 41.4 32.3 13.1 8.5 1920 1080 21.6 20.7 18.0 17.2 12.8 8.5 Average Gain 1.14 1.14 1.12 1.13 1.13 1.13 Raspberry Pi 3 1200 MHz Default Textures ######################## Raspberry Pi 3 ######################## GLUT OpenGL Benchmark 32 Bit Version 1, Wed Jul 27 20:31:52 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 308.4 182.1 82.6 52.3 21.6 13.7 640 480 129.5 119.6 74.6 49.2 21.6 13.8 1024 768 54.8 52.2 43.7 39.2 21.4 13.6 1920 1080 21.5 17.9 20.3 19.6 20.6 13.4 End at Wed Jul 27 20:34:06 2016 Comparison With Raspberry Pi 2 At Sefault 900 MHz 320 240 1.47 1.59 1.57 1.61 1.79 1.76 1920 1080 1.04 0.96 1.21 1.23 1.81 1.81 Larger texture files - 192 KB instaed of 12 KB GLUT OpenGL Benchmark 32 Bit Version 1, Wed Jul 27 20:42:04 2016 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 312.2 178.0 61.0 20.1 21.1 4.2 640 480 129.8 119.5 47.8 18.3 21.3 4.1 1024 768 54.8 52.9 37.3 17.0 21.6 3.9 1920 1080 21.9 18.5 18.0 14.0 20.6 3.6 End at Wed Jul 27 20:44:19 2016 ############################# RPi 3 SUSE ############################## Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a GLUT OpenGL Benchmark 64 Bit Version 1, Sat Mar 18 19:03:25 2017 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 160 120 87.1 76.3 64.3 46.9 24.3 15.6 320 240 59.2 54.7 53.7 43.9 25.6 15.6 640 480 33.4 31.7 31.0 27.6 24.4 15.3 1024 768 17.5 17.5 17.7 17.0 16.2 14.1 1920 1080 8.2 8.3 9.0 9.3 8.4 7.6 End at Sat Mar 18 19:06:16 2017 ############################ RPi 3 Gentoo ############################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz Compiled for 64 bit ARM v8a GLUT OpenGL Benchmark 64 Bit Version 1, Sat Mar 18 18:21:44 2017 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 160 120 184.3 127.5 72.7 48.4 26.8 16.6 320 240 161.8 116.0 67.1 46.3 26.7 16.7 640 480 76.8 74.8 49.8 41.4 25.9 16.3 1024 768 35.7 34.8 29.7 26.7 25.0 15.7 1920 1080 18.0 18.7 16.4 15.8 17.1 13.1 End at Sat Mar 18 18:24:35 2017 64 Bit / 32 FPS Bit Speed Comparisons 320 240 0.52 0.64 0.81 0.89 1.24 1.22 1024 768 0.65 0.67 0.68 0.68 1.17 1.15 1920 1080 0.84 1.04 0.81 0.81 0.83 0.98

To Start

DriveSpeed Benchmark

The main execution C code in version 1.0 was the same as the Android version. However, as some of the results were vastly different to a version produced for Linux, the program was revised. The execution and source code are again in Raspberry_Pi_Benchmarks.zip. The benchmark is provided to measure speeds of the main SD card drive and USB attached storage devices. In my case, a mini USB hub was used that has multiple ports and card reading slots. An example of results, displayed and saved in driveSpeed.txt log file, are shown below. Tests carried out and changes made are:

Test 1 - Write and read three 8 and 16 MB; Results given in MBytes/second
Test 2 - Write 8 MB, read can be cached in RAM; Results given in MBytes/second
Test 3 - Random write and read 1 KB from 4 to 16 MB; Results are Average time in milliseconds.
The original version appeared to enable caching on reading.
Test 4 - Write and read 200 files 4 KB to 16 KB; Results in MB/sec, msecs/file and delete seconds.
Version 1.0 included an extra “safe to remove?flush that increased file writing times.

Below is a log file for the benchmark running on the SD card. Raspberry Pi 2 speeds were little different (See Comparisons), except for the caching test where, results below demonstrate RPi 2 faster RAM speed.

64 Bit Versions - See Drive and LAN Benchmarks.

##################################################### DriveSpeed RasPi 1.1 Mon Dec 16 16:20:35 2013 Current Directory Path: /home/pi/benchmarks/DriveSpeed Total MB 14894, Free MB 12338, Used MB 2556 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 8.33 8.82 6.87 22.46 22.74 22.74 16 14.45 14.07 19.45 22.66 22.78 22.76 Cached 8 45.95 49.94 58.35 156.96 156.18 155.54 Random Read Write From MB 4 8 16 4 8 16 msecs 0.711 0.709 0.757 3.34 2.97 6.67 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 1.49 2.54 3.72 5.35 8.65 11.91 ms/file 2.75 3.23 4.41 0.77 0.95 1.38 0.086 End of test Mon Dec 16 16:21:06 2013 ##################################################### Raspberry Pi 2 DriveSpeed RasPi 1.1 Sun Mar 1 10:43:41 2015 Current Directory Path: /home/pi/benchmarks/drivespd Total MB 6266, Free MB 3444, Used MB 2822 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 Cached 8 101.13 118.31 143.66 487.11 495.06 481.95

As usual, the tests are run from an LX Terminal command (./DriveSpeed), pointing to the directory containing the benchmark. As shown above, there are also run time parameters for starting file size (example for 16 and 32 MB) and path for the data. The latter is particularly important for measuring speeds via USB connections. Format options for the run command to use a different path for data and file size is

./DriveSpeed MBytes nn, FilePath /dddd/dddd (or M nn, F /dddd/dddd).

Mounted USB devices can be identified by executing a df command, with my results shown below (/dev) for a USB Flash drive and a USB powered disk drive with two partitions, the first for a FAT formatted area and the second (long number) for a Linux bootable Ext4 section. The benchmark can be executed, as shown below, using the displayed path. Sometimes, a sudo command might be needed. The benchmark can also be saved on the USB drive and run from there.

At least my system appears to crash sometimes, on changing the USB drive, even after executing an unmount command (see below). A better option appeared to be via the Places tab on File Manager.

Step 1 display paths pi@raspberrypi ~ $ df Filesystem Size Used Avail Use% Mounted on rootfs 15G 2.0G 12G 14% / /dev/root 15G 2.0G 12G 14% / devtmpfs 180M 0 180M 0% /dev tmpfs 38M 300K 38M 1% /run tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 75M 0 75M 0% /run/shm /dev/mmcblk0p1 56M 19M 38M 34% /boot /dev/sdc1 1.9G 222M 1.7G 12% /media/USB2 (USB2 = volume name) /dev/sda1 56M 19M 38M 34% /media/C522-EA52 /dev/sda2 7.3G 1.7G 5.4G 24% /media/62ba9ec9-47d9-4421-aaee-71dd6c0f3707 Execute Examples pi@raspberrypi ~/testdir $ ./DriveSpeed FilePath /media/USB2 pi@raspberrypi ~/testdir $ sudo ./DriveSpeed FilePath /media/that-long-path Benchmark On USB Drive Examples pi@raspberrypi ~ $ /media/path/DriveSpeed pi@raspberrypi /media/path $ ./DriveSpeed Possibility Permissions need setting pi@raspberrypi /media/bmarkhere $ sudo chmod 0777 DriveSpeed Unmount pi@raspberrypi ~/testdir $ sudo umount /media/path

To Start

DriveSpeed Comparison

Following are example results, but note that there can be considerable variations from different test runs. The first two of the SD cards have a Class 4 specification, where the number represents minimum speed for recording a video, in MBytes/second. SD 3 has a Class 10 rating but can be the slowest. SD 4 is a SanDisk Extreme Pro microSDHC UHS-1 Class 10 card, rated at up to 633X or 95 MB/second. This clearly has the fastest card writing speeds but, as for most reading speeds, is limited by bus clock frequency. CPU utilisation was less than 10% during writing and reading the large files. All SD cards are as Ext4 formatted as system drives. Rpi booting times are also shown, where SD 3 is again the slowest, maybe relative to random reading times.

Next are a series of USB Flash memory sticks with FAT formatting. St1 is a SanDisk Cruzer with maximum writing speed rated at 10 MB/s and has an 8 KB sector size. For reference, St2 is an old drive. Patriot Rage XT St3 write/read ratings are 25/27 MB/s with 4 KB sectors. St5 is a high speed SanDisk Extreme USB 3.0 drive, with write/read ratings of 110/190 MB/s and 16 KB sectors. St4 and St6 are the last two with Ext4 format. The main observations are that the faster drive provides little advantage on reading performance, limited by bus speed and other overheads, but produces the fastest writing speeds, with significant gains using Ext4 format over that using FAT.

The disk drive (USB2 HD) results probably reflect bus and RPi overheads, with similar performance to the fastest USB stick on large file tests, including gains from the improved formatting. This is not the case on random access and small files, particularly on writing and more so using FAT format.

The last results are on a Linux based PC with a 2.4 GHz CPU and USB2 sockets. Source code, identical that used for RPi, was compiled for the tests. The first is for a SATA based disk drive, with its superior performance, particularly on large files. Then there are results for USB sticks St1, St5 and St6, indicating faster hardware speeds and lower overheads, particularly on writing and reading large files. Then FAT formatting lead to worst performance writing small files.

Raspberry Pi 2 speeds are provided for the main SD card and USB sticks St5 FAT and St6 Ext4. also the fast micro SD 4 card via two different readers. The latter produced faster USB 2 speeds, on large files, using a USB 3 card reader. With other devices, performance could be somewhat better or worse than that via the original Raspberry Pi.

MB/second 16 MB files Boot Large Write1 Write2 Write3 Read1 Read2 Read3 Secomds SD Main 16 GB 11.5 10.3 11.5 22.7 22.7 22.8 36 SD 2 4 GB 8.0 9.4 8.2 20.2 20.2 20.2 37 SD 3 8 GB 3.8 6.7 4.6 18.3 18.4 18.2 59 SD 4 16 GB 19.6 19.8 19.9 22.6 22.2 22.8 37 USB2 St1 16 GB FAT 3.8 4.0 3.8 24.3 24.7 24.1 USB2 St2 2 GB old 3.9 3.9 3.9 14.4 14.6 14.6 USB2 St3 8 GB FAT 9.1 9.2 9.3 25.6 25.5 24.6 USB2 St4 8 GB Ext4 11.8 11.7 10.8 25.6 25.3 25.3 USB3 St5 32 GB FAT 17.2 17.3 16.2 25.9 26.1 26.0 [2] USB3 St6 32 GB Ext4 26.1 26.4 26.4 26.5 26.2 26.2 [2] USB3 St7 32 GB F2fs 22.0 22.0 22.3 24.9 25.4 25.8 [2] USB2 HD FAT 17.0 16.0 16.0 24.0 25.6 25.7 USB2 HD Ext4 24.8 23.8 24.7 22.5 23.4 21.1 Raspberry Pi 2 Boot Seconds SD Main 8 GB 12.6 12.5 12.6 19.5 19.2 19.5 33 SD 4A 16 GB 29.8 28.6 29.6 30.1 29.4 28.9 SD 4B 16 GB 15.6 15.7 15.7 19.2 19.3 19.4 USB3 St5 32 GB FAT 9.2 7.8 11.6 29.1 29.0 29.3 USB3 St6 32 GB Ext4 19.0 19.2 26.3 24.1 30.3 30.3 Raspberry Pi 3 Boot Seconds SD Main 16 GB 8.7 7.3 11.0 16.9 22.9 23.1 17 USB2 8 GB FAT 10.6 4.8 3.7 35.0 35.6 34.8 Linux Main HD PC Ext4 68.4 52.0 77.0 77.7 69.7 70.0 USB2 St1 Linux FAT 4.3 3.4 4.0 26.7 26.1 25.8 USB3 St5 Linux FAT 28.1 28.5 27.5 39.0 39.3 39.3 [1] USB3 St6 Linux Ext4 29.5 29.6 29.6 39.1 39.3 39.3 [1] USB3 St7 Linux F2fs 29.6 30.0 29.6 39.6 39.7 39.4 [1] Random milliseconds Read Write From MB 4 8 16 4 8 16 SD Main Kingston 0.568 0.538 0.535 4.6 5.0 5.2 SD 2 PNY 0.821 0.775 0.997 11.1 26.6 28.9 SD 3 Verbatim 0.995 1.076 1.144 8.5 113.8 70.3 SD 4 SanDisk EP 0.748 0.735 0.696 2.6 4.4 2.4 USB2 St1 San Cruzer 0.806 0.799 0.791 20.1 22.2 62.9 USB2 St2 Old 0.906 0.888 0.889 42.2 56.7 291.7 USB2 St3 Patriot Rge 0.817 0.789 0.937 3.7 10.1 30.0 USB2 St4 Pat Ext4 0.775 0.776 0.801 6.0 3.6 10.3 USB3 St5 San Exteme 0.894 0.891 0.871 1.4 1.2 0.8 [3] USB3 St6 San Ex Ext4 0.839 0.822 0.845 0.9 0.8 0.8 [3] USB3 St7 San Ex F2fs 0.851 0.903 0.896 2.1 3.2 2.3 [3] USB3 St6 4 KB Ext4 0.928 0.940 0.950 1.0 1.0 1.0 [4] USB3 St7 4 KB F2fs 0.926 0.943 0.946 0.9 0.9 0.9 [4] USB3 St6 Cached Ext4 0.024 0.034 0.114 0.03 0.03 0.04 [5] USB3 St7 Cached F2fs 0.025 0.021 0.191 0.01 0.01 0.01 [5] 1 GB file 4 KB from 256, 512, 1024 MB USB3 St6 Cached Ext4 1.168 1.137 1.117 0.32 1.07 0.56 [6] USB3 St7 Cached F2fs 1.212 1.160 1.149 0.13 0.12 0.14 [6] USB2 HD FAT 0.904 1.490 3.879 1.7 2.1 2.3 USB2 HD Ext4 0.892 1.750 4.250 1.6 2.2 2.4 Raspberry Pi 2 SD Main 8 GB 0.389 0.571 0.403 3.5 8.3 3.4 SD 4A 16 GB 0.656 0.708 0.698 2.2 3.2 2.7 SD 4B 16 GB 0.807 0.856 0.843 2.8 4.5 2.1 USB3 St5 32 GB FAT 0.979 0.484 0.481 1.40 1.60 0.61 USB3 St6 32 GB Ext4 0.415 0.416 0.439 0.69 0.75 0.59 Raspberry Pi 3 SD Main 16 GB 0.460 0.450 0.400 1.68 2.60 1.77 USB2 8 GB FAT 0.717 0.771 0.797 1.94 2.38 2.41 Linux Main HD PC Ext4 0.501 0.385 4.163 1.5 2.5 3.3 USB2 St1 Linux FAT 0.501 0.498 0.499 91.9 41.5 80.1 USB3 St5 Linux FAT 0.505 0.501 0.500 0.8 1.0 1.5 [3] USB3 St6 Linux Ext4 0.503 0.498 0.499 1.1 0.8 0.6 [3] USB3 St7 Linux F2fs 0.602 0.624 0.624 1.8 1.7 1.8 [3] Milliseconds per file Write Read Delete File KB 4 8 16 4 8 16 Seconds SD Main 5.30 4.15 4.49 0.87 0.94 1.39 0.108 SD 2 4.99 6.25 6.39 1.16 1.65 2.23 0.122 SD 3 5.83 17.44 8.40 1.37 1.99 2.64 0.105 SD 4 2.68 2.58 3.79 0.82 0.94 1.33 0.094 USB2 St1 FAT 30.75 18.41 25.15 1.09 1.17 1.55 0.100 USB2 St2 FAT 53.83 40.84 35.26 1.75 1.63 1.98 0.058 USB2 St3 Pat FAT 15.01 15.71 19.27 1.25 1.46 1.57 0.096 USB2 St4 Pat Ext4 4.48 4.73 8.72 1.14 1.30 1.61 0.043 USB3 St5 San Ex FAT 4.95 4.48 4.94 1.01 1.30 1.51 0.445 [7] USB3 St6 San Ex Ext4 1.57 1.43 2.02 0.98 1.06 1.32 0.043 [7] USB3 St7 San Ex F2fs 1.56 1.51 1.87 0.92 1.05 1.36 0.032 [7] USB2 HD FAT 8.87 8.20 8.49 1.46 1.37 1.97 0.409 USB2 HD Ext4 2.86 1.88 2.23 4.43 1.50 1.57 0.109 Raspberry Pi 2 SD Main 2.79 2.27 2.72 0.57 0.84 1.25 0.036 SD 4A 1.49 2.22 1.20 0.64 0.91 1.14 0.037 SD 4B 0.96 1.21 1.74 0.60 0.86 1.29 0.037 USB3 St5 32 GB FAT 2.39 1.77 9.88 0.42 0.67 3.60 0.043 USB3 St6 32 GB Ext4 1.02 0.84 2.37 0.71 0.57 0.76 0.025 Raspberry Pi 3 SD Main 16 GB 4.39 1.75 3.83 0.54 0.70 1.09 0.019 USB2 8 GB FAT 7.24 9.12 12.72 0.64 0.74 0.63 0.012 Linux Main HD PC 1.25 0.24 0.35 0.30 0.29 0.37 0.004 USB2 St1 Linux FAT 40.85 27.53 37.09 0.60 0.64 0.89 0.004 USB3 St5 Linux FAT 1 10.49 10.70 10.86 0.53 0.67 0.73 0.004 [7] USB3 St5 Linux FAT 2 1.22 1.07 0.96 0.69 0.73 0.76 0.003 [7] USB3 St6 Linux Ext4 0.72 0.65 0.90 0.38 0.52 0.76 0.004 [7] USB3 St7 Linux F2fs 0.51 0.59 0.51 0.39 0.51 0.40 0.003 [7] FAT 1 and FAT 2 Typical variations on this device using FAT SD 4A Old USB 2 Hub, SD 4B USB 3 card reader

To Start

DriveSpeed F2FS Format

F2FS Flash Friendly File System was created by Samsung to work with Linux, specifically to suit characteristics of such as SSDs and SD cards. Published benchmark results often show that writing performance is superior to using Ext4 format, particularly with random access. Others indicate faster speeds on handling small files.

In order to format a USB Flash drive, a recent version of Linux is required. In my case, Ubuntu 13.10 with Linux 3.12.0 was installed, followed by f2fs-tools. I formatted my SanDisk Extreme USB 3.0 drive, using GParted, with three partitions, FAT, Ext4 and F2fs. The F2fs partition was shown as having an unknown format and did not show using the DF command. However, it could be mounted as shown here. Even then it was not visible in Ubuntu, but the directory path could be accessed by the benchmark (using sudo).

For the Raspberry Pi, I downloaded and installed 2013.12.20 Rasbian with Linux 3.10. This provides support at least for reading and writing F2fs partitions. Initially, the existing F2fs USB drive partition was not visible using the df command but, as the drive had another partition, the Filesystem path could be assumed and mounted. DriveSpeed benchmark was run on the Linux PC and RPi, results being included above under St5, St6 and St7 - See [] references.

Large Files - The three different formats produced the same high speed writing and reading on the Linux PC [1] but with some degradation on writing on the RPi to F2fs and particularly FAT [2].

Random Access - Random reading was slightly slower using F2fs and noticeably the slowest on writing. Again the Linux PC was faster [3]. Random access for the benchmark is via 1 KB block sizes. Using VMSTAT and F2fs, it was found that 4 KB was being read and written for each 1 KB access. Increasing block size to 4 KB avoided the reading and F2fs was slightly faster than Ext4 [4].

Random Access Cached - The benchmark opens the file for random access using Direct I/O, avoiding data being kept in the RAM based cache. Enabling caching produces ridiculously fast response times, with the file sizes used [5] (at 1KB block size).

Random Access Larger Files - The next step was to see what happens with larger files, where up to 1 GB was used [6] (with 4 KB blocks). In this case, random writing times varied considerably with Ext4 (more than shown) but were consistently much faster with F2fs formatting, apparently due to the way in which data stored. The benchmark is supposed to measure speeds over four seconds but, with Ext4, actual time could be much longer, probably due to shuffling the memory after writing was committed.

Small Files - [7] Average writing and reading times of small files could vary quite a bit but, using Ext4 and F2fs, were generally faster than via FAT formatting and F2fs marginally the winner.

Random Accesses Longer Time - Below [8]are further cached results with 4 KB from 256, 512 and 1024 MB, but running for 40 seconds (Ext4 up to 45 seconds), from which the number of transactions executed has been calculated. Other statistics shown were derived from running VMSTAT at the same time.

Ext4 and F2fs response times and system loading are similar on reading. The speed is now much faster reading from 256 MB, with higher CPU utilisation, due to more data being in the RAM based cache. KB per transaction numbers represent data read over the USB and this can be larger than the 4 KB data requests.

Writing response times are a little slower than with 4 second tests but more consistent with Ext4. The most important observation is that F2fs is still remarkably fast, transferring data over USB at near maximum speed and with high CPU utilisation organising the data.

4 KB Random Access Over 40 Seconds [8] Read Write From MB 256 512 1024 256 512 1024 USB3 St6 Cached Ext4 msecs 0.099 0.617 0.967 0.80 1.35 1.26 Transactions x 1000 404 65 41 53 33 33 Million Bytes 263 314 324 151 139 115 KB per transaction 0.7 4.8 7.8 2.9 4.2 3.4 MB per second 6.6 7.8 8.1 3.6 3.1 2.7 CPU Utilisation 64% 49% 49% 33% 27% 30% USB3 St7 Cached F2fs msecs 0.107 0.636 0.997 0.14 0.17 0.18 Transactions x 1000 374 63 40 286 235 222 Million Bytes 262 310 318 945 885 833 KB per transaction 0.7 4.9 7.9 3.3 3.8 3.7 MB per second 6.6 7.7 7.9 23.6 22.1 20.8 CPU Utilisation 62% 50% 49% 95% 91% 92%

To Start

Copying F2FS Files

Performance investigation of USB drives formatted with F2fs, compared with Ext4, were prompted by reports in XBMC Community Forum that copying files to the former was up to nine times faster than to the same drive formatted as Ext4. The particular page is not now directly available but might still be found by Googling for “OpenELEC Testbuilds for RaspberryPi Part 2?2013-12-19 20:03 (was page 199, later 133). DriveSpeed benchmark did not demonstrate this level of performance gain, except during an extended period of random writing. Now, copying files is likely to involve normal reading and writing, transferring data via a RAM based file cache.

DriveSpeed measures speed with caching enabled, but for larger files. A modified caching version was produced using a large number of small files of increasing sizes where, unlike copying, writing precedes reading. Average results of three tests are shown below [9]. F2fs is faster using smaller files, but not that much, with the position reversed as file sizes are increased. Data transfer speed in MBytes per second is provided [10], to demonstrate caching, where USB speed is exceeded (like > 30 MB/second). Data was not cached, starting at 256 MB, half RAM size.

Next stage involved producing a series of directories, with average file sizes between 6 KB and 500 KB, occupying >100 MB (similar to sizes quoted in XBMC Forum). Results, below [11] show that, still using the SanDisk Extreme USB 3.0 drive, F2fs is a little faster at the larger file sizes, but the position is reversed at reducing file sizes. Most significant is at 6 KB, where Ext4 is 70% faster, with the du command reporting 178 MB, compared with 269 MB with F2fs. VMSTAT recorded MegaBytes written, read, memory used and cache space are also shown for this test, confirming at least these volumes. Windows identified total file size and disk space used under NTFS are also shown. For comparison purposes, calculated MB/second speeds are based on the former.

I installed XBMC Media Center, on a Windows based PC, to produce a Thumbnails directory from photographs, included in the mix, in case there was something special about them. The directory comprised 4370 JPG files at around 34 KB average size, occupying 161 MB with Ext4 and 178 MB under F2fs, the former being slightly faster. These directories were also copied, using two other USB sticks, via the Raspberry Pi and a Linux based PC (plus limited tests with FAT formatting). Linux was faster on all, and the other drives were slower than the Extreme, but there were no significant variations between Ext4 and F2fs formatting. Results are again shown below.

XBMC for the Raspberry Pi is part of OpenElec (Open Embedded Linux Entertainment Center). I installed various versions of this on SD cards and ran DriveSpeed benchmark and file copying tests, booted to OpenElec. Details are in Raspberry Pi OpenElec Benchmarks.htm.

[9] DriveSpeed 1000 small files, cached, average milliseconds per file, Extreme Drive File KB 4 8 16 32 64 128 256 512 1024 F2FS Write 0.35 0.32 0.45 0.63 1.50 3.40 9.67 20.49 40.97 Read 0.09 0.12 0.18 0.28 0.69 1.72 13.00 23.29 43.94 Ext4 Write 0.46 0.48 0.60 1.07 2.33 5.28 10.21 20.29 43.40 Read 0.12 0.16 0.21 0.33 0.63 1.43 11.80 21.33 43.75 [10] F2FS MB/second Write 11.5 25.0 35.8 51.1 42.8 37.6 26.5 25.0 25.0 Read 44.4 68.6 88.9 112.9 92.8 74.6 19.7 22.0 23.3 Copying command and results format time sh -c "cp -r /source /destination && sync" real 0m35.851s user 0m0.420s sys 0m7.420s [11] Copying Six Different Directories Extreme Drive Based on Win MB Win KB Win Win on F2FS Ext4 F2FS Ext4 F2FS Ext4 Set Files /file MB Drive du MB du MB Secs Secs MB/sec MB/sec 1 22945 6 129 173 269 xxx 178 106.9 63.0 1.2 2.0 2 12974 11 140 171 227 176 66.5 57.0 2.1 2.5 3 7118 23 161 179 212 184 47.2 39.1 3.4 4.1 4T 4370 34 148 156 178 161 35.9 30.0 4.1 4.9 5 932 107 100 102 109 105 14.9 18.0 6.7 5.6 6 959 492 472 474 466 462 46.2 51.6 10.2 9.1 xxx vmstat MB F2FS Read 272 Write 277, Ext4 Read 184 Write 223 xxx vmstat MB F2FS RAM 298 Cache 288, Ext4 RAM 286 Cache 248 XBMC Thumbnails 4T - 4370 Files 148 MB F2FS Ext4 FAT Drive Elap CPU MB/sec Elap CPU MB/sec Elap CPU MB/sec Secs Secs Secs Secs Secs Secs Rpi Extreme 35.9 7.8 4.1 30.0 8.2 4.9 64.9 19.9 2.3 Attache 75.3 7.8 2.0 75.4 9.2 2.0 Cruzer 118.9 7.9 1.2 103.7 9.2 1.4 Linux Extreme 26.9 0.8 5.5 26.8 0.8 5.5 55.7 1.8 2.7 Attache 54.7 0.7 2.7 65.0 0.7 2.3 Cruzer 98.6 0.7 1.5 86.4 0.8 1.7

To Start

LAN/WiFi Benchmark - LanSpeed

This is mainly the same as the DriveSpeed benchmark, described above. The exception is that the cached data test is not possible and the open file options to avoid caching produces run time errors. The benchmark and source code are again in Raspberry_Pi_Benchmarks.zip. Tests carried out are:

Test 1 - Write and read three 8 and 16 MB; Results given in MBytes/second
Test 2 - Random write and read 1 KB from 4 to 16 MB; Results are Average time in milliseconds
Test 3 - Write and read 200 files 4 KB to 16 KB; Results in MB/sec, msecs/file and delete seconds.

The benchmark can measure performance communicating to both Windows and Linux via Local Area Network (LAN), including a wireless connection, in my case via a Windows Workgroup. The first step is to set up a directory on the Raspberry Pi to mirror the remote sharable data, in my case /public in /media. Then, a directory on the remote system is useful, in my case /test.

The second step is to obtain the Internet Protocol (IP) address of remote PCs - in my case this is dynamic, variable not constant. The appropriate commands are shown below, followed by those for the third step to mount the sharable drive, partition or directory.

The benchmark can be run in three ways with LAN involvement, firstly with the Terminal pointing to the directory on the RPi containing the benchmark and a FilePath parameter /media/public/test (in my case). The second method requires a copy of LanSpeed in /media/public/test with Terminal pointing to that source. The final method uses the remote copy but just loads the benchmark and uses the home (or whichever) folder for writing and reading files, with no LAN activity. As with DriveSpeed, a run time parameter can also specify minimum size for the large file tests (example ./LanSpeed MB 32 for 32 and 64 MB).

64 Bit Versions - See Drive and LAN Benchmarks.

Create new folder command - sudo mkdir /media/public NOTE: there should be no spaces after commas with multiple -o options Windows Command Prompt ipconfig command = 192.168.0.2 Windows share drive (partition) d sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.0.2/d /media/public can also add -o password=pi - in this case unchanged default password Linux Terminal command ifconfig eth0 (or eth1) = 192.168.0.3 Linux Wireless Connection Information = 192.168.0.4 Linux share directory all sudo mount -t cifs -o user=UU,password=PP //192.168.0.3/all /media/public UU and PP are IDs for Linux system, -o dir_mode=0777,file_mode=0777 not needed NOTE: If wrong IDs are used, a locked file will be generated and this leads to a failure to open a new file when correct IDs are used. The file must be deleted. Benchmark and log on Raspberry Pi pi@raspberrypi ~/benchmarks/lanspeed $ ./LanSpeed FilePath /media/public/test Benchmark and log on remote system pi@raspberrypi /media/public/test $ ./LanSpeed Benchmark remote, data and log /home/pi - does not use LAN pi@raspberrypi ~ $ /media/public/test/LanSpeed sudo umount //192.168.0.2/d or //192.168.0.3/all

To Start

LAN/WiFi Benchmark - More

The Raspberry Pi LAN speed is rated at 100 Mbps, whereby maximum data transfer speed will be less than 12.5 MB/second, due to overheads. See the example results below. The overheads also lead to the fairly constant average time to write and read small files. See Raspberry Pi 2 results in comparisons, below.

##################################################### LanSpeed RasPi 1.0 Tue Jul 2 10:56:28 2013 Current Directory Path: /media/public/test Total MB 230000, Free MB 85052, Used MB 144948 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 7.49 5.84 8.13 11.56 9.04 11.57 16 7.29 8.13 6.78 11.53 11.60 11.58 Random Read Write From MB 4 8 16 4 8 16 msecs 0.011 2.272 1.651 3.40 4.12 4.17 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.62 1.20 2.13 1.05 1.62 2.74 ms/file 6.63 6.83 7.68 3.88 5.07 5.98 0.280 End of test Tue Jul 2 10:57:12 2013

Intel Linux and Windows Versions - LanSpdx86Lin, LanSpdx86Win.exe

Versions to run on Intel processors via Linux and Windows have been produced. The former was compiled from the supplied lanspeed.c code, but with a different version string for printing. The Windows version has some slight changes, inherited from an earlier benchmark. The execution files are included in the zip file. They can be run from the host PC or stored on the RPi drive and executed via the LAN.

In order for Windows Workgroup systems to access RPi files, samba and samba-common-bin need to be installed, along with changes to /etc/samba/smb.conf. Detailed procedures are in Treating Raspberry Pi as just another Windows machine.

The remote Pi can be made visible using Windows “Map network drive" - T: on my PC). The Raspberry Pi user name and password need to be entered (I seem to have changed the password from raspberry to pi, so mine in pi and pi). The benchmark can the be run from a Windows Command prompt in two ways, as shown below, where LanSpdx86Win.exe is in folder D:\WinDDK\32bit\lanspeed. The .exe file can also be copied to a folder on the Pi, the folder selected in Windows and the benchmark run by double clicking on the .exe file. The log will be saved in the same folder.

The rather convoluted mount command, shown below, is needed to run from Linux. The benchmark (in my case from roy@roy-64Bit:~/all/lanspeed$) can be run from a Terminal command, also shown below. The program can be saved on the RPi (in /media/public/lanspeed). I also copied a script file, runlan86, with the command "./LanSpdx86Lin" and execution permission set. This can be run by clicking on the script file, where output is on the Linux Terminal display.

Windows D:\WinDDK\32bit\lanspeed>LanSpdx86Win FilePath T:\test D:\WinDDK\32bit\lanspeed>LanSpdx86Win FilePath \\MYPI\pi\test Raspberry Pi ifconfig eth0 = 192.168.0.8 Linux Ubuntu 10.10 using smbfs - mount all on one line sudo mount -t smbfs -o user=pi,password=pi,dir_mode=0777,file_mode=0777 //192.168.0.8/public/home/pi/benchmarks /media/public Linux run command - ./LanSpdx86Lin FilePath /media/public/lanspeed

To Start

LAN/WiFi Comparison

The results log files identify the system running the tests in Configuration Details with somewhat different variations using Windows and Linux. The destination system can normally be identified from logged Current Directory Path and Total MB (drive capacity).

The first four results are for the RPi handling data to/from Windows and Linux, then as destination from/to the two PCs. It should be noted that there can be significant performance differences depending on which system is the source or destination.

The next two sets of results are from RPi to a laptop via WiFi, showing the reduction in speed when the laptop is some distance from the router. These are followed by a test not using the LAN, but with RPi accessing the local drive, as DriveSpeed above, where data is cached in RAM.

The last four results are using a Gigabit LAN, again with wide variations in performance depending on the configuration used.

Some LanSpeed Raspberry Pi 2 results are included. Running this, accessing a Windows PC, appeared to produce more consistent high reading and writing times for the large files, at over 11 MB/second (demonstrating 100 Mbps LAN), compared with the original RPi. Running LanSpdx86Win.exe, stored on the SD drive, demonstrated some improvement.

Source Dest MBytes/Second CPU CPU/drive MB Write1 Write2 Write3 Read1 Read2 Read3 Rpi Ph Win 16 7.29 8.13 6.78 11.53 11.60 11.58 Ph Win Rpi 16 11.29 11.18 10.70 4.22 2.70 1.97 Rpi 2 Ph Win 16 11.31 11.32 11.32 11.65 10.80 11.65 Ph Win RPi 2 16 11.51 11.53 11.49 5.33 3.47 2.57 Rpi 3 Ph Win 16 11.41 11.40 11.39 11.68 11.66 11.67 LAN Rpi 3 Ph Win 16 2.69 3.16 2.63 1.60 1.46 0.80 WiFi Rpi C2 Lin 16 7.79 7.52 7.84 11.62 11.61 11.66 C2 Lin Rpi 16 6.53 6.36 6.23 5.58 5.49 6.01 Rpi LT Lin 16 3.23 3.24 3.20 3.59 3.50 3.50 WiFi Rpi LT Lin 16 1.78 1.62 1.00 0.92 0.89 0.39 WiFi outside Rpi Rpi 16 57.41 60.05 50.00 155.48 152.66 155.89 cached C2 Lin Ph Win 16 57.76 54.31 55.02 33.82 31.91 32.13 1Gbps Ph Win C2 Lin 16 108.62 89.62 109.83 36.45 22.09 15.30 write later Ph Win C2 Win 16 29.19 38.20 38.18 21.48 14.95 11.59 1Gbps C2 Win Ph Win 16 72.36 68.46 50.16 25.96 18.76 12.71 1Gbps Random msecs Read Write From MB 4 8 16 4 8 16 Rpi Ph Win 0.011 2.272 1.651 3.40 4.12 4.17 Ph Win Rpi 1.299 1.208 1.275 1.29 1.37 1.28 Rpi 2 Ph Win 0.124 0.911 0.998 1.96 1.56 1.68 Ph Win RPi 2 0.722 0.699 0.688 0.73 0.73 0.73 Rpi 3 Ph Win 0.459 0.864 0.743 3.47 2.77 3.16 LAN Rpi 3 Ph Win 7.178 10.447 7.784 11.18 9.79 8.99 WiFi Rpi C2 Lin 0.637 2.160 0.872 2.42 2.14 2.15 C2 Lin Rpi 1.820 0.978 1.259 3.05 2.49 2.45 Rpi LT Lin 4.520 5.391 3.234 4.08 3.22 3.16 WiFi Rpi LT Lin 10.264 11.906 11.107 5.16 4.08 4.29 WiFi outside Rpi Rpi 0.012 0.012 0.012 23.03 24.69 25.01 cached C2 Lin Ph Win 0.001 0.002 0.002 1.79 2.04 1.77 1Gbps Ph Win C2 Lin 0.556 0.468 0.423 0.43 0.43 0.43 write later Ph Win C2 Win 0.846 0.875 5.553 1.13 2.41 2.88 1Gbps C2 Win Ph Win 0.613 0.585 0.583 0.88 1.24 1.37 1Gbps milliseconds per file 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs Rpi Ph Win 6.63 6.83 7.68 3.88 5.07 5.98 0.28 Ph Win Rpi 14.15 14.21 15.76 10.32 10.52 11.47 1.79 Rpi 2 Ph Win 3.92 4.31 5.08 2.33 2.65 3.48 0.15 Ph Win RPi 2 7.78 8.33 9.54 4.84 5.31 5.96 0.74 Rpi 3 Ph Win 4.39 4.66 5.39 1.79 2.31 3.29 0.33 LAN Rpi 3 Ph Win 30.22 34.09 53.57 36.84 22.94 40.33 3.13 WiFi Rpi C2 Lin 5.74 6.83 8.96 4.87 5.97 6.74 0.60 C2 Lin Rpi 9.87 10.55 11.73 7.13 7.52 8.44 1.30 Rpi LT Lin 9.79 10.81 13.34 7.69 8.95 11.53 1.07 WiFi Rpi LT Lin 12.26 16.08 18.94 9.30 12.53 15.09 1.54 WiFi outside Rpi Rpi 0.87 0.73 0.67 0.08 0.15 0.19 0.05 cached C2 Lin Ph Win 2.57 2.52 2.61 0.85 0.83 0.86 0.13 1Gbps Ph Win C2 Lin 3.72 3.58 3.60 3.20 3.22 3.31 0.52 write later Ph Win C2 Win 4.92 3.46 3.50 3.22 3.09 3.42 0.40 1Gbps C2 Win Ph Win 3.10 3.12 3.19 3.99 2.93 2.73 0.46 1Gbps Ph Win = Phenom Windows 7 C2 Win = Core 2 Vista C2 Lin = Core 2 Ubuntu 10.1 LT Lin = Laptop Ubuntu 10.1 RPi = Raspberry Pi

To Start

DriveSpeed64, DriveSpeed64Long and LanSpeed64

As indicated earlier, the DriveSpeed and LanSpeed benchmarks are identical, except the former opens files with a parameter to not use caching, except for the cached test. Result for both are shown below, where many were not as expected and those from Gentoo, OpenSUSE and SUSE SLES were often different. Perhaps this could be changed via Linux tuning parameters. LanSpeed benchmark speeds are also shown from running on local drives, to help to explain unusual results.

Performance of these cannot be compared with earlier 32 bit varieties via Raspbian, as the 64 bit Linux systems are installed on faster micro USB cards, but what can be compared is relative behaviour. These comments are included in the results below.

DriveSpeed64 and DriveSpeed64Long - This only ran successfully via SUSE SLES on the main drive and a Gentoo USB connected external SD card with btrfs format, but still with certain peculiarities. Although the benchmarks did not run properly using USB flash drives, there were no problems in using normal USB connections via Gentoo, but rather complex procedures were required to mount USB drives via OpenSUSE.

LanSpeed64 - Only OpenSUSE could be used, following instructions provided here. Samba for Gentoo was said to be not tested at 64 bits and that for SUSE SLES could not be downloaded following a necessary reinstallation of the system. LanSpeed64 was also run targeting USB and main drives, mainly to identify caching effects. As shown below, there were some strange results.

These 64 bit benchmarks and source codes are now included in Rpi3-64-Bit-Benchmarks.tar.gz.

Complete logs of the latest 32 Bit Results are also provided, showing expected performance pattern is still being produced.

###################### DriveSpeed64 SUSE SLES ###################### DriveSpeed RasPi 64 Bit 1.1 Mon Apr 3 23:40:21 2017 Current Directory Path: /home/roy/driveLANSUSE Total MB 29465, Free MB 27495, Used MB 1970 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 10.26 15.50 7.78 47.27 51.62 48.91 16 10.58 13.86 10.14 54.05 55.50 45.78 Cached 8 520.96 586.68 601.25 709.43 709.23 706.46 Random Read Write From MB 4 8 16 4 8 16 msecs 0.005 0.004 0.004 16.91 20.31 22.13 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.25 0.36 1.06 252.55 403.28 621.47 ms/file 16.10 23.00 15.43 0.02 0.02 0.03 0.029 End of test Mon Apr 3 23:40:59 2017 >>>>>>>>>>>>>>>>>>> Comparison with 32 Bit Version <<<<<<<<<<<<<<<<<<< Large Files > Faster SD card reflected, reading > twice as fast Random > Writing exceptionally slow, reading far too fast, data cached? Small Files > Writing exceptionally slow, reading far too fast, data cached? #################### DriveSpeed64Long SUSE SLES #################### Linux Storage Speed Test 64-Bit Version 1.2, Mon Apr 3 23:42:37 2017 Current Directory Path: /home/roy/driveLANSUSE Total MB 29465, Free MB 27495, Used MB 1970 8 MB File 1 2 3 4 5 Writing MB/sec 16.26 7.47 10.28 7.43 10.54 Reading MB/sec 45.71 53.63 48.12 61.91 38.85 16 MB File 1 2 3 4 5 Writing MB/sec 13.88 10.92 11.15 8.22 10.59 Reading MB/sec 55.18 55.05 52.56 56.11 67.87 32 MB File 1 2 3 4 5 Writing MB/sec 9.18 10.36 10.23 9.45 10.99 Reading MB/sec 54.81 51.33 52.77 50.28 54.82 --------------------------------------------------------------------- 8 MB Cached File 1 2 3 4 5 Writing MB/sec 505.97 566.27 554.70 567.73 572.96 Reading MB/sec 583.78 689.28 696.95 687.48 695.33 --------------------------------------------------------------------- Bus Speed Block KB 64 128 256 512 1024 Reading MB/sec 3096.23 3292.99 1743.08 808.57 733.75 --------------------------------------------------------------------- 1 KB Blocks File MB > 2 4 8 16 32 64 128 Random Read msecs 0.54 0.57 1.01 0.71 0.64 0.65 0.65 Random Write msecs 17.21 10.61 12.56 10.99 22.69 23.40 15.47 --------------------------------------------------------------------- 500 Files Write Read Delete File KB MB/sec ms/File MB/sec ms/File Seconds 2 0.17 11.79 57.04 0.04 0.073 4 0.27 15.16 184.70 0.02 0.063 8 0.55 14.86 414.95 0.02 0.071 16 0.96 17.12 334.40 0.05 0.077 32 1.92 17.10 700.71 0.05 0.080 64 2.30 28.51 207.18 0.32 0.093 End of test Mon Apr 3 23:45:06 2017 >>>>>>>>>>>>>>>>>>> Comparison with Short Version <<<<<<<<<<<<<<<<<<<< Large Files > Similar speeds Random > Writing speed similar, reading speed quite normal. The file used is 128 MB, compared with 16 MB Small Files > Writing and reading speed similar, latter still too fast ############### DriveSpeed64 & Long OpenSUSE & Gentoo ############### MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 Error writing file <<<<<< Could not handle Direct I/O Segmentation fault <<<<<< See next, same program normal cached I/O ############## LanSpeed64 Example Running On Main Drive ############# To demonstrate caching effects LanSpeed RasPi 64 Bit 1.0 Tue Apr 4 12:21:50 2017 Current Directory Path: /home/driveLANOSUSE48 Total MB 14210, Free MB 10766, Used MB 3444 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 219.04 239.01 192.05 382.28 380.89 380.13 16 217.11 210.56 179.51 359.47 278.78 372.17 Random Read Write From MB 4 8 16 4 8 16 msecs 0.004 0.005 0.004 156.99 32.99 16.36 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 49.40 66.06 106.01 194.50 377.95 537.56 ms/file 0.08 0.12 0.15 0.02 0.02 0.03 0.012 End of test Tue Apr 4 12:22:22 2017 ######################### LanSpeed64 Example ####################### LanSpeed RasPi 64 Bit 1.0 Tue Apr 4 13:04:06 2017 Selected File Path: /root/Desktop/sharepc/ Total MB 266240, Free MB 134653, Used MB 131587 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 11.23 11.40 11.40 8.10 11.62 11.64 16 11.27 11.42 11.44 11.66 11.66 11.64 Random Read Write From MB 4 8 16 4 8 16 msecs 0.724 0.886 1.333 1.58 1.50 1.37 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.99 1.81 2.73 1.77 3.02 4.50 ms/file 4.13 4.54 6.01 2.32 2.71 3.64 0.201 End of test Tue Apr 4 13:04:43 2017 >>>>>>>>>>>>> Comparison with 32 Bit Version Rpi 3 Ph Win <<<<<<<<<<<<< Large Files > Similar speeds reflecting 100 Mbps Random > Similar but writing faster, no apparent caching Small Files > Similar speeds ################### DriveSpeed64 USB Flash Drive ################### DriveSpeed RasPi 64 Bit 1.1 Tue Apr 4 10:41:03 2017 Gentoo USB 3 Stick Selected File Path: /run/media/demouser/8EDA-9C1C/gentoo// Total MB 59665, Free MB 59663, Used MB 2 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 12.16 11.78 11.65 Error reading file Gentoo USB 2 Stick 8 3.10 3.50 2.78 Error reading file OPenSUSE USB 3 Stick 8 5.30 14.01 14.27 Error reading file >>>>>>>>> Direct I/O to USB Flash Drives, cannot read >>>>>>>>> See later LAN test, same program, normal cached I/O ################## DriveSpeed64 External SD Card ################### Gentoo via USB, btrfs format DriveSpeed RasPi 64 Bit 1.1 Tue Apr 4 10:28:11 2017 Selected File Path: /run/media/demouser/ROOT/home/roy/benchmarks// Total MB 29465, Free MB 27511, Used MB 1953 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 5.53 10.64 12.23 29.99 31.88 33.25 16 6.88 6.82 8.53 31.21 26.41 28.64 Cached 8 159.30 175.77 158.98 235.45 229.22 266.71 Random Read Write From MB 4 8 16 4 8 16 msecs 0.016 0.006 0.006 20.67 50.55 22.84 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.25 0.40 0.97 58.44 160.18 150.07 ms/file 16.09 20.66 16.87 0.07 0.05 0.11 0.160 >>>>>>>> Similar pattern to DriveSpeed64 SUSE SLES main drive above >>>>>>>> But slower due to USB data transfer speed ################ LanSpeed64 Gentoo USB 3 Flash Drive ############### LanSpeed RasPi 64 Bit 1.0 Tue Apr 4 10:57:15 2017 Selected File Path: /run/media/demouser/8EDA-9C1C/gentoo/ Total MB 59665, Free MB 59639, Used MB 26 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 29.84 32.89 36.14 619.40 644.11 423.39 16 53.39 18.56 17.62 675.52 337.83 380.98 Random Read Write From MB 4 8 16 4 8 16 msecs 0.003 0.003 0.003 5.71 5.26 4.07 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.97 0.93 3.24 163.54 268.83 430.68 ms/file 4.21 8.83 5.06 0.03 0.03 0.04 0.066 End of test Tue Apr 4 10:57:43 2017 >>>>>>>> Similar pattern to LanSpeed64 Running On Main Drive above >>>>>>>> But slower due to USB data transfer speed ################ LanSpeed64 Gentoo USB 2 Flash Drive ############### LanSpeed RasPi 64 Bit 1.0 Tue Apr 4 10:22:24 2017 Current Directory Path: /home/demouser/driveLANGENTOO Total MB 14118, Free MB 7431, Used MB 6688 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 83.67 114.47 122.94 372.52 369.38 238.50 16 113.20 152.20 114.87 332.33 237.96 392.18 Random Read Write From MB 4 8 16 4 8 16 msecs 0.003 0.003 0.003 5.08 8.85 4.28 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 27.93 13.34 30.62 133.59 394.21 372.22 ms/file 0.15 0.61 0.53 0.03 0.02 0.04 0.021 End of test Tue Apr 4 10:22:42 2017 >>>>>>>> Faster writing than USB 3, cached vs writing to drive >>>>>>>> Possibly because USB 3 drive is 64 GB vs 16 GB ################ 32 Bit DriveSpeed Reference Results ################ #################### DriveSpeed USB 3 Flash Drive ################### DriveSpeed RasPi 1.1 Tue Apr 18 13:45:16 2017 Selected File Path: /media/pi/8EDA-9C1C/ Total MB 59665, Free MB 59638, Used MB 27 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 17.30 16.31 15.95 34.91 35.87 34.29 16 8.96 11.45 13.55 35.26 35.58 35.73 Cached 8 51.79 59.95 3.80 524.88 497.86 594.12 Random Read Write From MB 4 8 16 4 8 16 msecs 0.762 0.818 0.826 5.27 4.96 4.51 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.54 0.69 1.26 4.00 5.29 14.04 ms/file 7.64 11.94 13.01 1.02 1.55 1.17 0.013 End of test Tue Apr 18 13:45:53 2017 >>>>>>>> Unlike the 64 bit version, this runs as expected ################# External MicroSD Ultra card via USB ############### DriveSpeed RasPi 1.1 Tue Apr 18 10:50:17 2017 Selected File Path: /media/pi/ROOT/home/benchmarks/test/ Total MB 14210, Free MB 10860, Used MB 3350 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 6.75 7.55 8.33 10.86 10.90 10.81 16 7.94 6.38 7.11 10.82 10.80 10.81 Cached 8 160.69 173.57 296.09 719.21 742.73 713.39 Random Read Write From MB 4 8 16 4 8 16 msecs 0.804 0.827 0.809 2.90 2.55 2.55 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 3.34 5.25 3.87 4.11 6.10 5.42 ms/file 1.23 1.56 4.23 1.00 1.34 3.02 0.022 End of test Tue Apr 18 10:50:55 2017 >>>>>>>> Unlike here, the 64 bit version unexpectedly produced fast (cached?) reading and slow writing speeds ################# 32 Bit LanSpeed Reference Results ################# ##################### LanSpeed USB 3 Flash Drive #################### LanSpeed RasPi 1.0 Tue Apr 18 13:49:04 2017 Selected File Path: /media/pi/8EDA-9C1C/ Total MB 59665, Free MB 59662, Used MB 3 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 46.32 59.94 20.98 703.69 716.66 694.25 16 90.11 20.46 15.68 700.87 703.35 692.30 Random Read Write From MB 4 8 16 4 8 16 msecs 0.003 0.003 0.003 3.83 4.92 8.12 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 0.58 2.09 1.97 254.55 213.75 318.84 ms/file 7.04 3.91 8.32 0.02 0.04 0.05 0.028 End of test Tue Apr 18 13:49:30 2017 >>>>>>>> Accepting wide variations, 32 bit and 64 bit results are the same

To Start

Temperature Recorder - RPiTemperature - Later RPiHeatMHz For RPi 2

RPiTemperature has been replaced by RPiHeatMHz, to measure and log CPU MHz besides CPU temperature. The program is included in Raspberry_Pi_Benchmarks.zip. This uses data from the following to display and log results (see RPiHeatMHz.c in zip file):

                   /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq  
                   /opt/vc/bin/vcgencmd measure_temp

Run time parameters specify number of samples and interval - see below. Default is 10 samples with 1 second delay between samples. System settings on booting are also included. These are different on the Raspberry Pi 3, where overclocking is not provided.

2016 - the original MHz was measured using function scaling_cur_freq. Now, it is apparent that this does not show dynamic variations. The latest version of RPiHeatMHz now includes results from the measure_clock arm command. Below are example reports from the revised program running on a Raspberry Pi 3, showing that the two measures can be the same when nearly idling.

64 Bit Temperature Measurements - Two versions were produced with RPiHeatMHz64G using the same functions as the latest 32 bit versions, then RPiHeatMHz64S for the SUSE Operating Systems, where the vcgencmd command was not available, scaling_cur_freq was used for CPU throttling MHz and sensors function had to be installed to obtain CPU temperature (see log below)..

Command - ./RPiTemperature passes 5, seconds 2 Temperature Measurement - Start at Tue Jun 18 11:57:19 2013 Using 5 samples at 2 second intervals Seconds 0.0 temp=50.8°C 2.0 temp=50.8°C 4.1 temp=50.8°C 6.1 temp=51.4°C 8.2 temp=50.8°C 10.2 temp=50.8°C Temperature Measurement - End at Tue Jun 18 11:57:29 2013 ########################################################## New Command - ./RPiHeatMHz passes 5, seconds 2 Switches to 900 MHz whilst running CPU benchmark Temperature and CPU MHz Measurement Start at Sun Mar 1 07:14:19 2015 Using 5 samples at 2 second intervals Boot Settings arm_freq=900 hdmi_force_hotplug=1 config_hdmi_boost=4 overscan_left=24 overscan_right=24 overscan_top=16 overscan_bottom=16 disable_overscan=0 core_freq=250 sdram_freq=450 over_voltage=0 Seconds 0.0 600 MHz temp=44.4°C 2.0 600 MHz temp=44.4°C 4.1 600 MHz temp=44.4°C 6.1 600 MHz temp=44.4°C 8.2 600 MHz temp=43.9°C 10.3 600 MHz temp=44.4°C End at Sun Mar 1 07:14:30 2015 #################### New RPiHeatMHz #################### #################### & RPiHeatMHz64G ################### From cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq vcgencmd measure_clock arm vcgencmd measure_temp Seconds 0.0 1200 scaling MHz, 1200 ARM MHz, temp=58.0°C 15.0 1200 scaling MHz, 1200 ARM MHz, temp=67.1°C 30.1 1200 scaling MHz, 1200 ARM MHz, temp=70.9°C 45.1 1200 scaling MHz, 1200 ARM MHz, temp=73.6°C 60.2 1200 scaling MHz, 1200 ARM MHz, temp=75.8°C 75.3 1200 scaling MHz, 1200 ARM MHz, temp=78.4°C 90.5 1200 scaling MHz, 1200 ARM MHz, temp=79.5°C 105.6 1200 scaling MHz, 1160 ARM MHz, temp=80.6°C 120.7 1200 scaling MHz, 1075 ARM MHz, temp=81.1°C 135.8 1200 scaling MHz, 1051 ARM MHz, temp=81.7°C 150.9 1200 scaling MHz, 1023 ARM MHz, temp=81.7°C 166.0 1200 scaling MHz, 1020 ARM MHz, temp=82.2°C 181.1 1200 scaling MHz, 1006 ARM MHz, temp=82.2°C Seconds 0.0 600 scaling MHz, 600 ARM MHz, temp=55.8°C 1.0 1200 scaling MHz, 1200 ARM MHz, temp=56.4°C 2.0 1200 scaling MHz, 1200 ARM MHz, temp=56.4°C 3.1 1200 scaling MHz, 1200 ARM MHz, temp=56.9°C 4.1 1200 scaling MHz, 1200 ARM MHz, temp=56.9°C 5.2 600 scaling MHz, 600 ARM MHz, temp=56.4°C ##################### RPiHeatMHz64S #################### From cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq sensors -A Seconds 0.0 1200 scaling MHz, bcm2835_thermal-virtual-0 Adapter: Virtual device temp1: +60.7Â°C 10.0 1200 scaling MHz, bcm2835_thermal-virtual-0 Adapter: Virtual device temp1: +63.4Â°C

To Start

Reliability Tests

Following are example results from running the modified OpenGL ES Benchmark and Livermore Loops Stability Test in reliability testing mode. The tests comprised running the OpenGL functions, then these plus the Loops program, both at normal (700 MHz) and overclocked CPU settings (CPU 1000 MHz, Core 500 MHz, SDRAM 600 MHz, 6 volts), measuring temperatures with RPiTemperature. The temperature recordings were at 30 second intervals with 36 samples, started first. With both test programs, the Livermore Loops were started next, at 10 seconds per test (24 x 3 x 10 = 720 seconds) but runs for longer due to early calibration. Finally, a full screen OpenGL test was started with a 15 minute setting (approximate, adjusted to 16 tests at 57 seconds).

On running just the OpenGL tests, FPS speed was virtually the same at 700 and 1000 MHz and only 7.5% slower running the Livermore Loops at the same time, at 700 MHz. As indicated earlier, OpenGL CPU utilisation was about 50%, leading to the Loops recording around half speed, when run at the same time.

Recorded temperatures for all tests are shown below, where room temperature was 23°C and the CPU allowed to cool down between tests. At 700 MHz, adding the Loops lead to a slightly faster temperature increase, but ending only about 3°C higher at 69.7°C. At 1000 MHz, and just OpenGL, maximum temperature was 69.1°C.

Repeating earlier observations, with hotter room temperature, the overclocked tests failed on running OpenGL and Livermore Loops tests at the same time. This time, the OpenGL program terminated with an “Illegal Instruction?after 75 seconds and the display froze on restarting, after a short delay. Temperatures were recorded at 15 second intervals, reaching 72.9°C.

Further reliability test programs have been produced. See Raspberry Pi Stress Tests.htm and Raspberry Pi 2 and 3 Stress Tests.htm. The latter includes some using the new OpenGL GLUT Benchmark. This required installation of a new experimental OpenGL GLUT driver to demonstrate much faster speeds. Unfortunately, this was found to cause the monitor to fail occasionally, displaying the multi-coloured square seen on booting. This occurred on Raspberry Pi 2, when overclocked, but more frequently using the Raspberry Pi 3. The solution was to install the new driver in a later version of the Raspbian Operating System. However, the stress tests showed that the Raspberry Pi 3 was getting too hot and it automatically reduced clock MHz (throttling) to lighten the load and decrease measured performance. Later, the RPi 3 was fitted into a FLIRC Case, where the whole aluminium case becomes the heatsink. Then, CPU MHz and all the stress tests ran at maximum speeds, due to temperatures being lower than the point where clock throttling occurs. See detailed reports on multitasking tests.

###################################################################### Command - ./OpenGL1Pi.bin Wide 1920, High 1080, RunMinutes 15 Raspberry Pi OpenGL ES Benchmark 1.1, Fri Jun 21 10:41:01 2013 Reliability Mode 16 Tests of 57 Seconds --------- Frames Per Second -------- Triangles All Textured 36000+ 5.28 5.30 5.01 5.37 36000+ 5.37 5.51 5.78 5.80 36000+ 5.75 5.47 5.54 5.32 36000+ 5.29 5.30 5.42 5.91 Screen Pixels 1920 Wide 1080 High End Time Fri Jun 21 10:56:17 2013 ###################################################################### Command - ./liverloopsPiA6 Seconds 10 Livermore Loops Benchmark Opt 3 32 Bit via C/C++ Fri Jun 21 10:40:49 2013 Reliability test 10 seconds each loop x 24 x 3 Part 1 of 3 start at Fri Jun 21 10:40:49 2013 Part 2 of 3 start at Fri Jun 21 10:48:21 2013 Part 3 of 3 start at Fri Jun 21 10:52:31 2013 Numeric results were as expected MFLOPS for 24 loops 59.4 65.6 97.1 81.5 9.0 13.7 55.2 72.0 41.9 19.9 17.0 12.3 10.1 6.9 26.6 34.4 55.1 19.9 44.6 18.9 11.3 13.6 30.1 14.2 Overall Ratings Maximum Average Geomean Harmean Minimum 97.1 35.2 28.9 23.7 6.9 ###################################################################### Command - ./RPiTemperature Passes 36, Seconds 30 O/C2 - ./RPiTemperature Passes 72, Seconds 15 Normal Overclocked Seconds OGL OG+LPs OGL OG+LPs 0 50.3 49.8 50.8 51.4 15 59.5 30 56.2 56.2 56.2 67.0 45 70.2 60 60.0 60.5 61.1 71.8 75 72.9 Illegal 90 61.1 62.7 63.8 70.2 Instruction 115 65.9 120 62.1 63.8 63.8 64.3 Restart OGL 135 62.7 Screen Froze 150 62.1 64.3 64.8 180 63.2 65.4 64.8 210 63.8 65.4 65.9 240 64.3 65.9 65.9 270 64.3 66.4 65.9 300 64.3 66.4 65.9 330 64.8 66.4 66.4 360 64.8 67.0 66.4 390 65.4 67.0 67.0 420 64.8 68.1 67.0 450 65.4 67.0 67.5 480 64.8 68.6 67.0 510 65.4 68.1 67.0 540 65.9 68.1 68.1 570 65.9 68.1 67.5 600 65.9 68.6 68.1 630 66.4 68.6 68.6 660 66.4 69.1 68.1 690 66.4 68.6 68.1 720 66.4 69.1 68.1 750 65.9 69.1 68.1 780 66.4 69.1 68.1 810 67.0 69.1 68.1 840 66.4 69.7 68.1 870 67.0 69.1 68.6 900 67.0 69.1 69.1 930 60.0 64.3 65.4 960 56.8 59.5 59.5 990 55.1 57.3 57.3 1020 53.5 56.8 55.7 1050 53.0 55.7 55.7

To Start

Livermore Loops Stability Test

A long time ago, the Livermore Loops Benchmark produced wrong numeric results on an overclocked Pentium Pro CPU. A revised benchmark included a run time option to specify the nominal running time of each loop, an example of the 5 seconds per test parameter used here is shown below. With this, the start time of each section is logged and the results of every pass checked for correctness. Run time displays and reported performance are the same as before.

The stability test was run on the Pi at 700 MHz and overclocked 1000 MHz, at 5 seconds per test (see command format), or a total time of 6 minutes. CPU temperature was measured (see measure_temp command) at 30 second intervals. Results are provided below. Room temperature was 22.6°C. At 700 MHz temperature increased from 48.7 to 53.0°C and, higher at 1000 MHz, from 50.3 to 60.5°C.

See also Reliability Tests and Raspberry Pi 2 Stress Tests.htm, describing stress tests using multiple copies of this benchmark, along with graphics programs. These include demontrations of failures and slow performance caused by overheating, using Raspberry Pi 2 and 3.

Command ./liverloopsPiA6 Secs 5 ##################################################### Livermore Loops Benchmark Opt 3 32 Bit via C/C++ Fri May 17 15:52:01 2013 Reliability test 5 seconds each loop x 24 x 3 Part 1 of 3 start at Fri May 17 15:52:01 2013 Part 2 of 3 start at Fri May 17 15:54:09 2013 Part 3 of 3 start at Fri May 17 15:56:24 2013 Numeric results were as expected #####################################################
Temperatures Degrees C using /opt/vc/bin/vcgencmd measure_temp
MHz Minutes 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 700 48.7 50.8 51.4 51.9 51.9 51.9 51.9 51.9 51.9 52.5 53.0 53.0 52.5 1000 50.3 56.8 58.4 57.8 57.8 59.5 59.5 59.5 59.5 59.5 60.5 60.5 59.5

To Start

64 Bit Reliability Tests

The two main CPU stress tests have been recompiled as stressintPi64 and burninfpuPi64 and are now included in Rpi3-64-Bit-Benchmarks.tar.gz, along with RPiHeatMHz64S and RPiHeatMHz64G , besides videogl64 and liverloopsPi64, the other appropriate programs.

The new CPU tests have already been run via OpenSUSE and SUSE LES, reported in SUSE RPi3 Stress Tests.htm. The purpose here is to demonstrate test results running via Gentoo and, at least with CPU stress tests, have been confirmed as producing similar effects via OpenSUSE. In this case, except for one example, the Raspberry Pi 3 was tested outside the case, showing far more CPU MHz throttling than when enclosed in the FLIRC case - see above. Rather than using a script file to run the programs, it was found more convenient to open four or five different terminal windows, initially copying the execution commands without starting the tests. Below are results for the following tests. All include the new RPiHeatMHz program, with a test duration of around 15 minutes.

Integer Arithmetic Stress Test - comprising four runs of stressintPi64 using 40 KB of data, aimed at all using L2 cache, with 12 tests each running for 80 seconds. Performance on all cores was essentially the same, with CPU throttling starting after 30 seconds, eventually reducing CPU MHz by nearly 32%, with maximum recorded sample CPU temperature of 84.4 °C. Compared with stand alone results, CPU performance was degraded to a greater extent due to MP overheads.

Reminder - This has six write/read and six read only tests, with different variations of data patterns. The read phase comprises an equal number of additions and subtractions, with the data being unchanged afterwards and checked for correctness. Speed is measured in MB/second. Results are displayed at 10 second intervals. Run time parameters are provided for KBytes memory used, seconds for each of the twelve tests and log number for use in multitasking.

Floating Point Arithmetic Stress Test - having four burninfpuPi64 test procedures, using L2 cache with 8 operations per data word. Again performance was effectively constant from all cores, with maximum total throughput of 13.7 GFLOPS, reducing by nearly 4 GFLOPS due to CPU throttling down to 843 MHz, again with a maximum temperature of 84.4 °C.

Reminder - This is based on MP-MFLOPS but to use a single CPU core, with the code rearranged to obtain a higher throughput. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. The same variables are used for each word and final results are checked for consistency, any errors being reported. The benchmark has input parameters for KWords, Section 1, 2 or 3 (for 2, 8 or 32 operations per word) and log number (0 to 99).

Livermore Loops Stress Test - This uses four copies of the Livermore Loops Benchmark, running in the stress testing mode, with each of the 72 test kernels set to run for 12 seconds. Summary performance MfLOPS speeds are measured over all the tests but short term variations can be seen in RPiHeatMHz64 results, including inconsistent effects of different arithmetic functions. Maximum temperature recoded was 84.9 °C with a CPU MHz of 744.

Integer and OpenGL Stress Tests - The most complicated OpenGL kitchen test was run for 15 minutes, along with three Integer Stress Tests, the former being at a window size of 1024 x 768 pixels, to allow space to display other test’s ongoing results. The tests were run using on demand and performance MHz settings with the system board out of case. Then, performance was quite similar, with OpenGL FPS and integer MB/second reduced to around 60% of initial speeds. A maximum of 85.4 °C was recorded, with many more at 84.9 °C, when CPU MHz temporarily dropped to half speed at 600 MHz (at both on demand and performance settings). The tests were repeated with the system in the FLIRC case, producing consistent performance, whilst nearly reaching the temperature where CPU throttling would occur.

########################################################### Integer Stress Test RPi 64 Commands from different terminals ./RPiHeatMHz64G passes 100 seconds 10 ./stressintPi64 KB 40 Secs 80 Log 11 ./stressintPi64 KB 40 Secs 80 Log 12 ./stressintPi64 KB 40 Secs 80 Log 13 ./stressintPi64 KB 40 Secs 80 Log 14 Log MB/Second Secs 11 12 13 14 Av MHz °C MB/sec /MHz 0 1200 54.8 10 1199 63.4 20 1199 74.1 30 1200 79.5 40 1074 81.7 50 1019 82.2 60 993 82.7 70 972 82.7 Wr/Rd 80 2225 2182 2167 2106 2170 946 82.7 2.29 160 1800 1806 1786 1791 1796 881 84.4 2.04 240 1704 1709 1704 1686 1701 828 84.4 2.05 320 1682 1684 1658 1676 1675 826 83.8 2.03 400 1672 1671 1655 1662 1665 826 84.4 2.02 480 1665 1666 1666 1656 1663 820 84.4 2.03 Read 560 1799 1831 1802 1816 1812 843 83.8 2.15 640 1761 1763 1748 1740 1753 852 83.8 2.06 720 1697 1698 1701 1694 1698 823 83.8 2.06 800 1688 1703 1683 1683 1689 830 83.8 2.04 880 1702 1705 1677 1690 1694 836 84.4 2.03 960 1695 1706 1720 1770 1723 824 84.4 2.09 1040 1197 60.1 Stand alone Write/Read 2530 1200 2.11 Stand alone Read 2870 1200 2.39 Min or Max 820 84.4 % of Max 68.3 ########################################################### Burn-In-FPU RPi 64 Using 40 KBytes, 8 Operations Per Word Commands from different terminals ./RPiHeatMHz64G passes 100 seconds 10 ./burninfpuPi64 Kwds 10 Sect 2 Mins 15 Log 51 ./burninfpuPi64 Kwds 10 Sect 2 Mins 15 Log 52 ./burninfpuPi64 Kwds 10 Sect 2 Mins 15 Log 53 ./burninfpuPi64 Kwds 10 Sect 2 Mins 15 Log 54 Log MFLOPS Secs 51 52 53 54 Av MHz °C MFLOPS /MHz 0 1200 54.8 30 3416 3397 3462 3397 3418 1200 76.3 2.85 60 3154 3142 3042 3025 3091 1072 81.7 2.88 90 2917 2878 2862 2827 2871 982 82.7 2.92 120 2816 2765 2737 2753 2768 954 83.8 2.90 150 2615 2663 2605 2633 2629 934 83.8 2.81 180 2630 2583 2583 2616 2603 905 82.7 2.88 210 2613 2585 2607 2524 2582 901 83.8 2.87 240 2548 2566 2590 2506 2553 885 83.8 2.88 270 2586 2541 2508 2528 2541 874 83.8 2.91 300 2518 2496 2515 2517 2512 886 83.8 2.83 330 2545 2510 2515 2494 2516 867 83.8 2.90 360 2523 2490 2520 2513 2512 877 83.8 2.86 390 2444 2461 2512 2537 2489 869 83.8 2.86 420 2449 2482 2484 2535 2488 872 83.8 2.85 450 2526 2499 2470 2479 2494 885 83.8 2.82 480 2405 2378 2477 2360 2405 865 83.8 2.78 510 2474 2458 2510 2470 2478 884 83.8 2.80 540 2475 2435 2484 2484 2470 887 83.8 2.78 570 2506 2437 2434 2501 2470 877 83.8 2.82 600 2465 2494 2449 2479 2472 870 83.8 2.84 630 2487 2447 2500 2455 2472 877 83.8 2.82 660 2496 2479 2434 2442 2463 859 83.8 2.87 690 2480 2448 2435 2499 2466 882 84.4 2.80 720 2472 2471 2494 2412 2462 863 83.8 2.85 750 2462 2461 2516 2429 2467 871 83.8 2.83 780 2472 2467 2464 2427 2458 859 84.4 2.86 810 2472 2483 2471 2430 2464 845 83.8 2.92 840 2479 2444 2456 2494 2468 850 83.8 2.90 870 2471 2483 2469 2445 2467 860 83.8 2.87 900 2427 2490 2489 2450 2464 861 84.4 2.86 930 2460 2491 2434 2475 2465 850 83.8 2.90 960 2477 2494 2429 2486 2472 862 83.8 2.87 990 2490 2458 2455 2450 2463 843 83.8 2.92 1020 2433 2474 2458 2468 2458 888 84.4 2.77 1050 2522 2500 2526 2540 2522 864 83.8 2.92 Stand Alone 3570 1200 2.98 Min or Max 843 84.4 % of Max 70.3 ########################################################### Livermore Loops Stress Test 64 Bit 24x3x12 = 864 seconds Commands from different terminals ./RPiHeatMHz64G passes 100 seconds 10 ./liverloopsPi64 Seconds 12 ./liverloopsPi64 Seconds 12 ./liverloopsPi64 Seconds 12 ./liverloopsPi64 Seconds 12 Overall Ratings Maximum Average Geomean Harmean Minimum 570.2 218.4 193.9 171.1 69.4 591.3 221.2 196.3 173.0 72.4 576.4 218.1 194.1 171.6 72.4 572.5 219.9 194.8 171.2 67.6 One Program - 72 seconds 623.9 275.9 246.1 217.8 98.6 83 measurements of 10+ seconds MHz °C MHz °C MHz °C 1200 51.5 1011 82.7 931 83.3 1200 59.1 1019 82.7 828 83.8 1200 67.7 945 82.7 928 82.7 1200 72.0 1025 82.2 843 84.9 1199 79.5 913 83.8 744 84.9 967 82.7 842 83.8 800 84.4 1043 82.2 840 84.4 916 83.3 1195 80.6 891 82.7 896 83.8 1193 80.6 829 83.8 918 83.8 1117 81.7 818 83.8 951 82.2 1138 81.7 873 83.3 1024 81.7 1173 80.6 996 82.7 1023 82.2 1200 80.6 963 82.7 924 82.7 1148 81.7 994 82.7 892 83.8 1059 81.7 1001 83.3 883 83.8 1031 82.2 971 82.7 895 83.8 1002 82.7 947 83.8 1006 82.7 926 83.3 883 83.8 940 82.7 1022 82.7 976 82.7 931 82.7 973 82.7 936 83.3 949 83.3 1079 82.2 993 82.2 933 83.3 1028 82.7 943 83.3 995 82.7 998 82.7 941 83.8 1071 81.7 1019 82.2 927 83.3 941 83.3 995 82.7 1043 81.7 1145 81.1 1097 82.2 1058 82.2 1008 82.7 965 82.7 980 82.7 1193 75.2 1122 82.2 1016 82.7 Min Max Min Max Min Max 926 83.3 818 84.4 744 84.9 % of Max 62.0 ################################################################################### Integer and OpenGL Stress Tests Commands from different terminals 1 ./RPiHeatMHz64G passes 100 seconds 10 2 export vblank_mode=0 ./videogl64 Wide 1024, High 768, Minutes 15 Test 6 3 ./stressintPi64 KB 8 Secs 80 Log 31 4 ./stressintPi64 KB 8 Secs 80 Log 32 5 ./stressintPi64 KB 8 Secs 80 Log 33 On demand out of case Performance out of case Performance FLIRC case Total Total Total Secs MB/s FPS MHz °C MB/s FPS MHz °C MB/s FPS MHz °C 0 1200 52.1 1200 55.8 1200 44.0 30 13 1200 78.4 13 1107 80.6 13 1200 60.1 60 11 953 82.7 11 910 82.7 13 1199 63.4 80 6206 10 890 82.7 6064 9 850 83.8 7116 13 1200 65.0 160 4861 9 804 83.8 4656 9 744 84.9 7041 13 1199 68.8 240 4414 8 743 84.9 4305 8 600 82.7 7072 13 1200 70.9 320 4211 8 706 84.9 4217 8 600 82.7 7075 13 1200 72.0 400 4163 8 716 84.9 4209 8 738 84.9 7095 13 1200 74.1 480 4132 8 600 82.7 4209 8 600 82.7 7081 13 1200 75.8 560 4858 8 737 85.4 4802 8 738 84.9 8067 13 1200 74.7 640 4720 8 721 84.9 4768 8 722 84.9 8092 13 1200 76.8 720 4662 8 899 83.3 4730 8 743 84.9 7989 13 1200 77.4 800 4624 8 764 84.9 4664 8 823 84.9 8050 13 1200 78.4 880 4638 8 875 83.8 4712 8 719 84.9 7984 13 1200 79.5 960 5874 8 1015 82.7 5917 8 938 82.7 8344 13 1200 74.1

To Start

Performance Monitor

JavaDraw - following show JavaDraw benchmark speeds, at 10 seconds per test, and simultaneous vmstat performance monitor CPU utilisation, with 5 second samples. The normal tests were run, then again, with an affinity setting to use 1 CPU core.

When running a CPU benchmark, %user time is recorded as 25%, as most of the single core test, with a little system overhead. For some reason, JavaDraw seemed to use more than one core for the last two tests. Overall, the details show that Raspberry Pi 2 can use more than one CPU core to improve performance on drawing with a Java program.

Raspberry Pi 3 results are also shown. For full speed, the GLUT driver, required to run the new OpenGL GLUT benchmark, has to be disabled. Again two cores are used for best performance.

Normal Affinity 1 CPU FPS %usr %sys FPS %usr %sys Bitmap Twice Pass 1 45.0 43 9 22.1 22 4 43 8 23 6 Bitmap Twice Pass 2 56.8 42 9 35.8 23 6 41 10 24 6 Plus 2 Circles 57.8 41 9 35.4 24 6 44 8 25 5 Plus 200 Rand Circles 54.9 43 8 36.0 25 5 43 7 23 7 Plus 320 Long Lines 38.3 42 8 33.4 33 5 42 9 32 6 Plus 4000 Rand Circles 25.1 48 9 22.4 38 5 48 9 vmstat command for 20 5 second samples - vmstat 5 20 > vmstatlog1.txt benchmark commands - java JavaDrawPC and taskset 0x00000001 java JavaDrawPC ########################### Raspberry Pi 3 ########################### Bitmap Twice Pass 1 76.7 42 10 47.0 26 7 43 10 26 9 Bitmap Twice Pass 2 97.5 40 13 67.1 26 10 42 11 25 10 Plus 2 Circles 97.4 40 11 56.9 23 6 40 10 25 8 Plus 200 Rand Circles 90.8 40 10 58.6 26 8 40 10 27 8 Plus 320 Long Lines 62.4 39 10 60.0 30 8 40 10 32 7 Plus 4000 Rand Circles 42.8 46 11 36.4 24 8 44 10 24 6

To Start

Assembly Code

Linpack benchmark performance is completely dependent on the daxpy function with a linked triad dy[i] = dy[i] + da * dx[i], with an unrolled to loop containing four linked add and multiply statements. Compilers can produce a range of instruction combinations, to cover a number of different accesses to the function. The following seem to be the most likely frequent instructions executed. The linpackPiA7SP compilation has instructions the same as linpackPiA7, except using 32 bit registers, example vfma.f32 s14, s0, s13, maybe executing at the same speed as the 64 bit vfma instruction.

Instruction fmacd is double precision multiply-accumulate and vfma is fused floating-point multiply accumulate, where the result of the multiply is not rounded before the accumulation, and might be the reason for different numeric answers. If true to form, FMA can produce a maximum of two results per CPU clock cycle, doubling performance.

Next are details of assembly code for BusSpeed reading all data, where RAM speed from the original PiA6 benchmark is at half the expected speed, and slower than reading every other word. The benchmark test loop has 64 AND statements, read sequentially. The only difference appears to be that gcc 4.8, for PiA7, produces negative indexing.

LinpackPiA6 LinpackPiA7
gcc 4.6 armv6 vfp gcc 4.8 cortex-a7 vfpv4 .L185: .L208: fldd d6, [r1, #-24] fldd d16, [r1, #-24] fldd d7, [r3, #-24] fldd d19, [r3, #-24] fldd d5, [r3, #-16] fldd d18, [r3, #-16] fldd d4, [r3, #-8] vfma.f64 d19, d0, d16 fmacd d7, d0, d6 mov r4, r1 mov r4, r1 fldd d17, [r3, #-8] fldd d3, [r3, #0] fldd d16, [r3] add r2, r2, #4 add r2, r2, #4 cmp r0, r2 add r1, r1, #32 fstd d7, [r3, #-24] cmp r0, r2 fldd d7, [r1, #-16] fstd d19, [r3, #-24] fmacd d5, d0, d7 fldd d19, [r1, #-48] fstd d5, [r3, #-16] vfma.f64 d18, d0, d19 fldd d7, [r1, #-8] fstd d18, [r3, #-16] add r1, r1, #32 fldd d18, [r1, #-40] fmacd d4, d0, d7 vfma.f64 d17, d0, d18 fstd d4, [r3, #-8] fstd d17, [r3, #-8] fldd d7, [r4, #0] fldd d17, [r4] fmacd d3, d0, d7 vfma.f64 d16, d0, d17 fmrrd r4, r5, d3 fmrrd r4, r5, d16 strd r4, [r3], #32 strd r4, [r3], #32 bgt .L185 bgt .L208
busspeedPiA6 busspeedPiA7
.L19: .L17: ldmia r3, {r0, ip} ldr r0, [r3] ldr r1, [r3, #8] add r2, r2, #64 ldr r5, [r3, #248] ldr ip, [r3, #4] and ip, ip, r0 add r3, r3, #256 ldr r0, [r3, #12] ldr r1, [r3, #-248] and ip, ip, r1 and ip, ip, r0 ldr r1, [r3, #16] ldr r0, [r3, #-244] and ip, ip, r0 and ip, ip, r1 ldr r0, [r3, #20] ldr r1, [r3, #-240] and ip, ip, r1 and ip, ip, r0 To ldr r0, [r3, #244] ldr r1, [r3, #-16] and ip, ip, r1 and ip, ip, r0 and ip, ip, r0 ldr r0, [r3, #-12] ldr r4, [r7] and ip, ip, r1 ldr r0, [r3, #252] ldr r4, [r3, #-8] add r2, r2, #64 ldr r5, [r7] and ip, ip, r5 and ip, ip, r0 and r1, ip, r0 ldr r0, [r3, #-4] cmp r4, r2 and ip, ip, r4 and r6, r1, r6 cmp r5, r2 add r3, r3, #256 and r1, ip, r0 bgt .L19 and r6, r1, r6 bgt .L17 Both Mainly ldr r1 and r0 ldr r0 and r1 memSpdPi64 Double Precision Multiply and Add gcc 4.8 gcc 6 .L26: .L30: ld1 {v4.2d}, [x2] ldr q4, [x2, 16] ld1 {v1.2d}, [x1] add w1, w1, 1 add x3, x2, 16 ldr q1, [x0, 16] ld1 {v0.2d}, [x0] cmp w1, w23 ld1 {v5.2d}, [x3] ldr q3, [x2], 32 fmla v1.2d, v4.2d, v3.2d add x0, x0, 32 st1 {v1.2d}, [x1] ldr q0, [x0, -32] fmla v0.2d, v5.2d, v3.2d fmla v1.2d, v4.2d, v2.2d add x1, x1, 32 fmla v0.2d, v3.2d, v2.2d st1 {v0.2d}, [x0] str q1, [x0, -16] cmp x1, x4 str q0, [x0, -32] add x2, x2, 32 bcc .L30 add x0, x0, 32 bne .L26 32 Bit memspeedPiA7 uses scalar fldd and fmacd instructions

To Start

Roy Longbottom May 2017

The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

Roy Longbottom's Raspberry Pi, Pi 2 and Pi 3 Benchmarks

New

Contents

General

Raspberry Pi System

64 Bit OpenSUSE and Gentoo For Raspberry Pi 3

Standards/Configuration Details

Whetstone Benchmark - whetstonePiA6, whetstonePiA7, whetstonePi64

Whetstone Benchmark Comparisons

Dhrystone 2 Benchmark - dhrystonePiA6, dhrystonePiA7, dhrystonePi64

Dhrystone 2 Benchmark Comparisons

Linpack Benchmark 32b - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP Linpack Benchmark 64b - linpackPi64, linpackPiSP64

Raspberry Pi Results of Calculations

Linpack Benchmark Comparisons

Livermore Loops Benchmark - liverloopsPiA6, liverloopsPiA7, liverloopsPi64

Livermore Loops Benchmark Comparisons

MFLOPS for 24 loops

Memory Speed Benchmark - memspeedPiA6, memspeedPiA7, memSpdPi64

Memory Speed Comparison

Bus Speed Benchmark - busspeedPiA6, busspeedPiA7, busSpdPi64

Bus Speed Comparison

FFT Benchmarks - fft1-RPi2, fft3c-Rpi2, fft1-RPi64, FFT3c-RPi64

Single Core NEON Benchmarks

Linpack NEON Benchmarks - linpackPiNEONi, linpackPiFSSP, linpackPiNEONi64

NEON Float & Integer Benchmark - NeonSpeed, NeonSpeedPi64

MemSpeed NEON - memSpdPiNEON

Maximum One Core Single Precision MFLOPS

MultiThreading Benchmarks

MP-MFLOPS - MP-MFLOPSPiA7, MP-MFLOPSDP, MP-MFLOPSPi64, MP-MFLOPSPi64DP

MP-Whetstone - MP-WHETSPiA7, MP-WHETSPi64

MP-Dhrystone - MP-DHRYPiA7, MP-DHRYPi64

MP-BusSpeedPiA7, MP-BusSpeedPi64

MP-RandMemPiA7, MP-RandMemPi64

OpenMP-MFLOPS, notOpenMP-MFLOPS, OpenMP-MFLOPS64, notOpenMP-MFLOPS64

OpenMP-MemSpeed2, NotOpenMP-MemSpeed2, OpenMP-MemSpeed264, NotOpenMP-MemSpeed264

NEON MultiThreading Benchmarks

MP-MFLOPSPiNeon, MP-NeonMFLOPS, MP-NeonMFLOPS64

linpackNeonMP, linpackNeonMP64

Java Benchmarks

Java Whetstone Benchmarks

Java Whetstone Comparison

JavaDraw Benchmark

JavaDraw Comparison - Frames Per Second

OpenGL ES Benchmark - OpenGL1Pi.bin

OpenGL ES Comparison - Frames Per Second

OpenGL GLUT Benchmark - videogl32, videogl64

OpenGL GLUT Benchmark Comparisons

DriveSpeed Benchmark

DriveSpeed Comparison

DriveSpeed F2FS Format

Copying F2FS Files

LAN/WiFi Benchmark - LanSpeed

LAN/WiFi Benchmark - More

Intel Linux and Windows Versions - LanSpdx86Lin, LanSpdx86Win.exe

LAN/WiFi Comparison

DriveSpeed64, DriveSpeed64Long and LanSpeed64

Temperature Recorder - RPiTemperature - Later RPiHeatMHz For RPi 2

Reliability Tests

Livermore Loops Stability Test

Temperatures Degrees C using /opt/vc/bin/vcgencmd measure_temp

64 Bit Reliability Tests

Performance Monitor

Assembly Code

LinpackPiA6 LinpackPiA7

busspeedPiA6 busspeedPiA7

Roy Longbottom May 2017

Linpack Benchmark 32b - linpackPiA6, linpackPiSP, linpackPiA7, linpackPiA7SP
Linpack Benchmark 64b - linpackPi64, linpackPiSP64