Below is a list of links to a large number of my reports, under the following headings, along with separate summaries of each.
The Collection comprises a large number of free benchmarks and stress testing programs, with no advertisements, that run under Windows, Linux and Android, using PCs, Raspberry Pi or other single board computers, phones and tablets. Besides details of the programs and extensive results, the following reports also provide links to compressed files containing the source codes and execution files. Except for Android apps, no formal installation process is needed, simply extract from the compressed file and run by a command or click.
The programs are aimed at identifying best and worst performance characteristics not a single overall rating. They are mainly calibrated to run for noticeable times, with results displayed on an ongoing basis and saved in text log files.
Historic Data Reports included, provide performance ratings of computers released from 1954 to more modern times, most including cost besides the year of manufacture. They are based on benchmark results and information collected by myself and my colleagues, engineers working for the UK Government’s Central Computer Agency, formed in 1957.
This New Home Page provides links to more than 40 reports mainly in both HTM and PDF format, along with brief summaries. Using current browsers, it uses automatic word wrap to see all the text in a PC Window and mobile phone or tablet screens. Also, HTM reports can be manually stretched and moved side to side. The identified reports contain wide tables of numeric data, not exactly mobile phone friendly, but they can also be stretched and moved sideways. PDF files might need to be downloaded to view the detail.
The PDF files are provided from ResearchGate and Archive HTM files from roylongbottom.org.uk at Wayback Machine.
About Roy Longbottom
Celebrating 50 years of computer benchmarking and stress testing 1972 to 2022 -From 1972 to 2022 I produced and ran computer benchmarking and stress testing programs. The Whetstone Benchmark, for which I became the design authority, also covered exactly the same time span.
Stress Tests - I wrote a series of programs to use during acceptance trials of computers purchased by the UK Government. From 1972, these were used on many hundreds of acceptance trials up to 1990.
I personally supervised trials of the first range of supercomputers, including CDC 7600 and Cray 1, where, out of five such systems, my programs lead to three failed first trials.
Then, over the years, I produced stress tests to run via Windows, Linux and other Operating Systems, covering PCs, Android based devices and Raspberry Pi systems.
In 2019 (aged 84), I was recruited as a voluntary member of Raspberry Pi pre-release Alpha testing team. my 2022 contribution being for the Raspberry Pi Pico W.
Whetstone Benchmarks - This was produced by my colleague Harold Curnow, who passed over responsibility to me later. In 1972, I included it in the acceptance trial’s suite of programs. I introduced timing and output format changes, aimed at verifying final numeric calculations and identifying unexpected performance attributes.
I produced a vector processing version for supercomputers. Then, new varieties for the same range of technology quoted for stress tests.
Original Main Page (see About Roy) -
Original Main Page (see About Roy) -This page was overcomplex for viewing on a mobile phone but might continue to be available via Wayback Machine archive, where it seems that compressed files, containing benchmarks, source codes and most reports, are accessible to download.
Historic Data Summary
Computer Speeds From Instruction Mixes pre-1960 to 1971 -190 Gibson and ADP instruction mix results from 18 manufacturers
Headings - Manufacturer, Model, Word Size bits, Memory Max, Memory Cycle Time, Gibson Mix KIPS, ADP Mix KIPS, Intro Year
Computer Speed Claims 1980 to 1996 -
Computer Speed Claims 1980 to 1996 -For more than 2000 mainframes, minicomputers, supercomputers and workstations, from around 120 suppliers
Headings - No. of CPUs, OS/CPU chip, MHz, MIPS, MAX MFLOPS, Type, Year, Cost GBP
PC CPUID 1994 to 2013, plus Measured Maximum Speeds Via Assembler Code -
PC CPUID 1994 to 2013, plus Measured Maximum Speeds Via Assembler Code -
Sections Features Codes, Model Codes, More than 80 sets of results from 80486 to Core i7 and Phenom
Headings - Model, MHz, MIPS and MFLOPS using 1, 2, 3 and 4 registers, 32 bit and 64 bit. Operations normal, MMX, SSE, SSE2, AVX, 3DNow, SP, DP, 1, 2, 4, and 8 threads,
PC CPU Specifications 1994 to 2014, plus Measured MIPS and MFLOPS per MHz -
PC CPU Specifications 1994 to 2014, plus Measured MIPS and MFLOPS per MHz -
Intel and AMD CPU Characteristics - 28 pages, Model, CPUs, Cores, MHz from to, KB L1 L2 L3 caches, HT and RAM MHz, CPUID
Measured MIPS and MFLOPS per MHz - 80486 to Core i7 and Phenom, 8 pages derived from benchmarks CPUID, BusSpeed, RandMem, Classics (Whetstone, Dhrystone, Linpack, Livermore Loops), SSE3DNow, FFTGGraf, some covering CPU, caches and RAM.
Whetstone Benchmark History and Results 1973 to 2014 -
Whetstone Benchmark History and Results 1973 to 2014 -In The Beginning, Whetting The Stone, Rolling The Stone, Throwing The Stone, Compiler Optimisation, Table Headings, Detailed Results (more than 500 from 53 manufacturers)
Headings - System, CPU, MHz, MWIPS, MFLOPS, VAX MIPS, DP MWIPS, Language, Opt, Cost $K, Intro Date
Plus PCs - 75 results, MWIPS from 22 CPUs using 12 different interpreters and compilers, MP MWIPS 1 to 8 cores on 5 systems, %MWIPS/MHz efficiency (between 0.03 and 311)
Cray 1 Supercomputer Performance Comparisons With Home Computers Phones and Tablets -
Cray 1 Supercomputer Performance Comparisons With Home Computers Phones and Tablets -Based on Livermore Loops, the benchmark used to verify performance of the first Cray 1, supported by similar vintage scalar and vector Whetstone and Linpack 100 benchmarks, then later MP-MFLOPS and other MP benchmarks
Topics - My background and benchmarks, Main tests on Cray 1, Raspberry Pi 1 to 4 and 400, Android phones and tablets, Windows and Linux based PCs, SIMD considerations, other supercomputers, Performance Summary Cray 1, PC AVX 512, Android phone, Raspberry Pi 400, Error reports.
Classic Benchmarks Summary
These have an initial calibration to run individual test functions for a noticeable finite time, with results displayed as the programs progress.
The benchmarks measure performance of single CPUs, that tends to be proportional to MHz, particularly at a given level of technology. For PCs, they cover processors from 80386 to Core i7. Following the latter, CPU MHz has not increased sufficiently to pursue further results.
But some are included in the Historic Data section Cray 1 report, for a 2021 Intel 11th Generation CPU that has advanced vector processing type functions. For PCs, first versions included compilations with and without optimisation, with some some from other compilers.
The bulk of PC results are from DOS, OS/2, Windows and Linux varieties. Limited ones are provided for Android and Raspberry Pi devices, where many more up to date performance details are covered in other reports.
From 1972 Whetstone -
From 1972 Whetstone -8 test functions with measurements in floating point MFLOPS, integer MOPS, scientific function MOPS and overall rating in MWIPS. Initial target minicomputers and mainframes.
Besides from C/C++ compilations, results are included from Fortran, Java, Basic and Visual Basic versions.
There are 21 pages covering around 670 sets of results, each with 10 entries, over 17 categories (including SP, DP, 1 core, MP, Opt, No Opt, 16 bit, 32 bit, 64 bit, different Operating Systems, different programming languages and compilers, different manufacturers).
Largest group is for original C compilations, for 76 1991 to 2017 vintage CPUs.
From 1984 Dhrystone -
From 1984 Dhrystone -overall score in VAX MIPS AKA DMIPS. Initial target UNIX based systems. The benchmarks reported here are all compiled by C/C++ for single processor operation.
There are 149 sets of results containing between 1 and 4 DMIPS ratings, covering the same range of appropriate categories and vintage as those for the Whetstone Benchmarks.
For PCs, there are 75 results, each containing DMIPS for versions Dhrystone 1 and 2, produced by optimised and non-optimised compilations.
From 1979 Linpack 100 -
From 1979 Linpack 100 -performance measured in MFLOPS. Initial target scientific workstations. There are 188 sets of results of similar mixture to above.
The largest batch is for the original double precision Linpack 100 benchmark, running on PCs and comprising 80 optimised and 80 non-optimised MFLOPS measurements.
From 1970 Livermore Loops -
From 1970 Livermore Loops -latest version has 24 loops run three times with different memory demands, each measured in MFLOPS, with overall averages, minimum and maximum. Initial target supercomputers.
The main performance ratings are three variations of average MFLOPS, with minimum and maximum, for each of 203 results. In turn, the MFLOPS scores of the 24 selected loops are provided for most.
There are 59 sets of ratings for the 1991 to 2017 vintage CPUs, for both optimised and non-optimised benchmark compilations.
Memory Benchmarks Summary
Windows and Linux CPU, Cache and RAM PC Benchmarks -For all of the Windows memory benchmarks, results are provided covering more than 20 years from 80386 or 80486 CPUs to Core i7 and AMD equivalents, with separate tables providing sample performance measurements, mainly in MBytes per second, for RAM and each variety of cache.
Examples of full output are shown for all benchmarks. For some, calculations are carried out by assembly code, others by C/C++ compilations. There are also both 32 bit and 64 bit varieties.
Linux results cover the same areas for three of the later processors, concentrating on providing comparisons between 32 bit and 64 bit working, with some including the use of more advanced SIMD operation.
MemSpeed - carries out three different sets of single and double precision floating point and integer calculations via two data arrays. Two versions are available, the first one, originally to run under DOS, based on. assembly code
BusSpeed - The benchmark is intended to demonstrate maximum data transfer speeds from buses and caches. On the latest PCs, use of multiple cores appears to be required, to achieve this goal.
The program starts by reading one word, with a large address increment for the next one, the increment being reduced by a half for following measurements, until all data is read. This identifies where data is read in bursts and provides a means of estimating bus and maximum RAM (or cache) speed.
RandMem - Serial and random address selections are employed by this benchmark, using the same complex integer based indexing, with read and read/write tests for 32 bit integers and 64 bit floating point numbers.
The main purpose is to show the difference between serial and random data transfer speed, where that for the latter is considerably reduced by burst reading or writing, in turn affected by data size.
SSEfpu - This carries out floating point calculations, similar to MemSpeed, to compare data transfer speeds, and associated MFLOPS, between two at a time SSE2 double precision, four at a time SSSE2 and single word calculations.
FFT Benchmarks - Three versions were produced, the first being the original C code, the second with further optimised assembly language and the third using SSE SIMD instructions.
The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run a number of times to identify variance, with results in milliseconds.
MultiThreading Benchmarks Summary
Windows and Linux MultiThreading Benchmarks -These benchmarks execute the same code as the original, designed to exercise a single CPU, but implementing multithreading to use up to all available cores. For most, multithreading levels are controlled by the program, with others using OpenMP and QPAR to automatically generate parallelism.
This report concentrates on showing variations in performance of a quad core, 8 thread CPU, with links to other reports covering many different processors. With up to 8 columns of results, details are provided for each thread from between 1 and 8, compiled for 32 bit and 64 bit working, via Windows and Linux.
Whetstone MP Benchmark - is mainly dependent on floating point speed but with some independently timed integer test functions.
Each thread executes shared code using mainly L1 cache based independent variables, leading to performance being proportional to the number of cores, or higher with hyperthreading.
Assembly Code Arithmetic - This executes integer and SSE floating point add instructions via independent threads.
BusSpeed MP Benchmark - provides read only access to data in caches and RAM. It is intended to demonstrate bus operation and speed where data is transferred in bursts and maximum data transfer speed.
In the original Windows version, each thread read all the data, starting at the same point. This had to be modified for Linux, due to excessive impact of caching.
RandMem MP Benchmark - The program uses the same code for serial and random access via a complex indexing structure and comprises Read and Read/Write tests, covering data from caches and RAM. This benchmark uses data from the same array for all threads, but starting at different points.
MP MFLOPS Benchmark - The benchmark carries out calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word, via caches and RAM.
Each thread deals with separate segments of the data, via shared code, fully demonstrating multithreading speed gains. Performance is highly dependent on ability of a compiler available at production time, particularly using SIMD options.
OpenMP MFLOPS Benchmark - The benchmark carries out the same calculations as MP MFLOPS Benchmark, essentially using the same code, without any OpenMP code requirements, but with critical loops preceded by a simple “go parallel” directive.
QPAR MFLOPS Benchmark - QPAR is a Microsoft alternative to OpenMP.
Graphics Benchmarks Summary
Windows and Linux Graphics Benchmarks -Reports on the following contain numerous results and links to download variations, plus benchmarks and source codes. Here, the main results are for a Core i7 CPU, some with comparisons with older computers and different Operating Systems, at 32 bits or 64 bits, covering a range of monitor screen resolutions.
Windows Drawing Benchmarks - draws different shapes, copies blocks of image data, colours an area, and pokes pixels, with performance measured in Millions of Pixels Per Second and Frames Per Second.
Windows DirectDraw Benchmarks - uses DirectDraw functions to copy image data and to colour fill an area, with performance measured in Millions of Pixels Per Second and Frames Per Second.
Windows Direct3D Benchmark - uses Direct3D functions operating on wireframe, coloured and textured moving objects, with performance measured in Frames Per Second, replaced with Direct3D9 Benchmarks for 32 bit and 64 bit versions.
Windows Direct3D9 Benchmarks - with similar wireframe, coloured and textured objects plus use of Pixel and Vertex Shaders, with performance measured in Frames Per Second.
Windows OpenGL Benchmarks - Coloured and textured moving objects, again, and a complex wireframe and textured real simulation of a kitchen, with performance measured in Frames Per Second.
Windows BMPSpeed Benchmarks - This is system test, with graphics activity, comprising writing and reading small to enlarged images, scrolling and rotating them, with time in seconds and milliseconds plus MB/second for scrolling.
JavaDraw Benchmark - for running via Windows, Lixux and Andrioid, starting with a simple scene with added complexity for subsequent tests, with performance measured in Frames Per Second. There is also an on-line version of this benchmark, executed via a downloaded HTML document.
Linux OpenGL Benchmarks - This is a similar to, but enhanced, version of the Windows OpenGL program. Approval was given to Canonical to include this benchmark in the testing framework for the Unity desktop.
Linux SDL BMPSpeed Benchmarks - carrying out the same functional tests as Windows BMPSpeed, but written using Simple DirectMedia Layer functions.
Stress Testing - The benchmarks have run time parameters to include them in a stress testing exercise, including which section to run and running time.
Input/Output Benchmarks Summary
Windows, Linux and Android Data Storage Device Benchmarks -Again, the general htm file provides examples of performance, with the detail provided in the following main files. These also include links to download programs ans source codes.
DiskGRAF is a full Windows application that measures speeds of serial writing and reading, then for cached and random access activity. Results are logged in a text file and graphically for serial operation.
The main report includes 4 tables of results, each containing more than 70 sets of performance and CPU utilisation results of disk drives, covering 1994 to 2014 vintage PCs. Other results are for CD and DVD writers, flash, firewire and network drives.
CDDVDSpd is another full Windows application that measures writing, reading times/speeds of a large file and 520 small files, on most types of mass storage devices.
Results are provided for the same period as DiskGraf, with 67 devices covered from floppy disks to 7200 RPM disks, then SSD, SD. USB and firewire drives, plus those accessed via WiFi and LAN networks.
DriveSpeed is a command line driven program, with variations for Windows and Linux, having parameters for path/device to use and large file sizes. For the latter, a number of files are written and read.
Then there is a test for cached data, followed by one handling random access. Finally, a large number of differently sized small files are written and read.
The identified main file has a number of tables, one covering 15 disk drives with various using Linux or Windows, NTFS, FAT or Ext formatting, main or USB drives. Then there are 12 similar entries for flash drives, 14 for a revised benchmark with random access, 8 for 2014 drives and 3 covering 2016 Windows tablets.
LANspeed is a variant of DriveSpeed that enables running on a selected network drive. A Windows executable version is also available and is run is run from a Windows based PC by clicking on the file resident on a remote computer
This has 12 sets of LAN and WiFi results accessing PCs, desktops and a netbook, using 32 bit and 64 bit compilations.
Stress Testing Programs Summary
DOS, Windows and Linux Stress Testing Programs -The first stress tests, in this collection, were based on programs that I wrote for acceptance trials of computers purchased by the UK Government. See Celebrating 50 years of computer benchmarking and stress testing.
Initial requirements were that running times and data volumes should be controllable and results of calculations should be checked for correctness or consistency, with a clear indication provided of any errors or absolute minimum output, if needed for manual checking.
Checking written data, on reading, or results of integer calculations, presented no problems. For floating point, either a simple integer sumcheck was produced or a series of calculations arranged to obtain a theoretical value of 1.0, that would be multiplied by results from repeating the calculations, to generate a final answer close to 1.0.
The program used for stressing input/output writes a number of files, filled with blocks of different data patterns. Reading is carried out, one block at a time, with the target file selected on a random basis. Finally each block, from one file, is read repetitively, intended to be from a disk’s buffer. File sizes and running times can be specified.
The file, accessible here, has the following sections. Each provides further links to detailed reports and for benchmark downloads, also sample log files produced by the programs.
DOS and Windows PC CPU Tests - CPU benchmarks CPR4DOS.EXE, FPtest.exe - includes sample results from 1997 and 2017, plus example sumchecks on different CPUs.
DOS and Windows PC Drive Tests - CDK1DOS.EXE, DiskTest.exe - with program data patterns, plus 1997 and 2017 logs
Livermore Loops Benchmark EXE files - Modified for extended running time and for checking results. In its original form, it was found to produce the wrong results of numeric calculations on an overclocked PC.
BusSpd2k.exe Full Windows app - Stress test added to benchmarking options, particularly to select data size to test a caches or RAM. The program uses a variety of different data patterns. Examples of data comparison failures are provided, believed to be from an overclocked PC.
IntBurn64.exe Full 64 bit Windows app - Same program as BusSpd2k stress test
Windows Multiprocessor Integer Stress Tests - Identifies files covering MP tests using multiple copies of other stress tests. Example of performance provided, using 1, 4 and 8 copies on a quad core/8 thread PC.
Windows Floating Point Stress Tests - SSE3DSoak.exe and SSEburn64.exe, use assembly code SSE, SSE2 or 3DNow Single Instruction Multiple Data (SIMD) floating point instructions to soak test the CPU, Cache or RAM. Includes temperature graph over 8 minutes, running 4 copies of the program.
Windows Graphics Stress Tests - CUDA MFLOPS, VideoD3D9_64, VideoD3D9_32 - These graphics benchmarks have parameters to specify running time and which test procedure to use. The report, directly accessible here, includes results of 10 minutes tests that ran at constant speeds on a particular PC. The CUDA test identified graphics processor temperature increase of 30°C.
Linux PC CPU Tests - lloops, lloops_64, intburn32, intburn64, burninsse32 and burninsse64 - (same as Windows programs) - These new 32/64 bit command line driven benchmarks were the forerunners of my later test programs, avoiding the overcomplex Windows procedures.
The more detailed summary report identifies excessive CPU temperatures and result of cleaning the heatsink. These tests caused a laptop to overheat to the point of failure and, for the first time, identified the effects of system induced CPU MHz changes.
Linux PC Drive Tests - drivestress32, drivestress64 - (same as Windows program)
Linux Graphics Stress Tests - cudamflops32SP, cudamflops64SP - (same as Windows programs), videogl32, videogl64 (OpenGL) - Report includes samples of performance and CPU/GPU temperatures, running seven copies of the CPU tests along with the OpenGL program.
Raspberry Pi Benchmarks and Stress Tests Summary 1
These benchmarks were compiled to run on ARM processors and are essentially same as the latest programs produced to run on Intel CPUs, via Windows and Linux. ARM versions were also included to suit newer technology, for both 32 bit and 64 bit working. In many cases, detailed descriptions of the benchmarks are included.
Raspberry Pi, Pi 2 and Pi 3 32 Bit and 64 Bit Benchmarks and Stress Tests -
Raspberry Pi, Pi 2 and Pi 3 32 Bit and 64 Bit Benchmarks and Stress Tests -All of the benchmarks in the Classic and Memory categories were run on all three processors with the full detail of measurements provided. Different compilers or compile options were used to embrace new facilities. Pi 3 tests were run for 32 bit, 64 bit and multithreading operation. For comparison purposes, some Android/ARM and PC/Intel results are included.
Next we have Java Whetstone, JavaDraw, OpenGL ES and the cross platform OpenGL GLUT results , along with screenshots. DriveSpeed measurements are included for all processors, using main SD cards, USB drives and various formatting options, then many covering LanSpeed data transfers, including at 64 bits.
Finally are examples of stress tests that highlight identified problems. The first is for the single core PI 1, where running a CPU test and and an OpenGL one, lead to failures using the CPU overclocking option. The second problem is the Pi 3 system crashing, running my new OpenGL GLUT benchmark, where a new version of the Operating System provided a fix.
The main considerations are temperature effects on the Pi 3 at 64 bits, using all four CPU cores, with several tables identifying excessive temperatures producing CPU MHz throttling. Then there are some that show slow single core performance using default power settings. Lastly, results demonstrate less throttling on installing a CPU heatsink, then full speed after installing the system board in a special metal case.
Raspberry Pi OpenElec Benchmarks -
Raspberry Pi 1, 2, 3 Multithreading Benchmarks -
Raspberry Pi 2 and 3 Stress Tests -
Raspberry Pi 3B 32 bit and 64 bit Benchmarks and Stress Tests -
Raspberry Pi OpenElec Benchmarks -A few benchmarks were run under Open Embedded Linux Entertainment Center, identifying high memory occupancy and CPU utilisation, then file formatting, benchmark and copying performance.
Raspberry Pi 1, 2, 3 Multithreading Benchmarks -These demonstrate best and worst case MP performance running 1, 2, 4 and 8 threads. Detailed result are provided on using MP-MFLOPS, MP-WHETS, MP-Dhry, MP-BusSpd, MP-RandMem, OpenMP-MFLOPS, OpenMP-MemSpeed, MP-NeonMFLOPS and linpackNeonMP. Some PC and Android devive results are included. Raspberry Pi assembly code is provided for compilations producing scalar and NEON vector instructions.
Raspberry Pi 2 and 3 Stress Tests -These cover the same area included in “Raspberry Pi, Pi 2 and Pi 3 32 Bit and 64 Bit Benchmarks and Stress Tests”, but with more information in some cases. Most make use of multiple CPU stress tests with OpenGL. Differences are paging and systems tests. The former runs multiple copies of a program that uses 720 MB, monitored by vmstat to demonstrate memory utilisation. The other deals with three drive and two CPU tests.
Raspberry Pi 3B 32 bit and 64 bit Benchmarks and Stress Tests -The main objective was to compare Pi 3B and 3B+ performance, where gains of CPU benchmarks were generally proportional to CPU MHz, but not the case with RAM speed. For multi-core testing the MP range of programs were used, each using 1, 2, 4 and 8 threads. Finally, improved performance levels were as expected, but earlier ones switched to running at minimum CPU MHz, when not using the latest power supply plug.
For most benchmarks, results of using both 32 bit and 64 bit working are provided, generally showing performance gains of the latter. Problems encountered were 64 bit Linux Gentoo handling drive input/output in a non-standard way and peculiarities running LAN and WiFi benchmarks.
A new program was produced for stress testing, measuring CPU MHz, voltage and temperature. This demonstrated that the 3B+ CPU MHz reduced from 1400 to 1200 when the temperature reached 70°C, with further throttling at 80°C. Core voltage also reduced.
Integer and floating point stress tests were run at both 32 bits and 64 bits. With no heatsink and a plastic case, all reached the 70°C threshold, and 80°C with the former. The latter 64 bit code benefited from using NEON SIMD vector instructions (disassembled examples provided). Using a special metal case, with three 15 minute CPU stressing programs and an OpenGL one, most tests recorded 1400 (sample not average) MHz, with the odd reduction to 1200 and up to 6% average performance reduction. .
Raspberry Pi 3B and 3B+ High Performance Linpack and Error Tests -
Raspberry Pi 3B and 3B+ High Performance Linpack and Error Tests -High Performance Linpack Benchmark (HPL) is normally used to measure MFLOPS performance on the latest supercomputers. It can run using all CPU cores, benefiting from larger memory data array space that needs to be specified at run time. HPL is also a notable stress testing program.
This report covers running an existing version of HPL that uses BLAS Basic Linear Algebra Software and another with ATLAS (Automatically Tuned) that I built for 32 bit operation. Numerous tests were run on a Pi 3B and 3B+ housed in that special metal case, covering data sizes between 8 and 512 Mbytes using 1, 2 and 4 CPU cores.
Bottom line achievements were successful runs on the 3B+, at all sizes, but with performance degradation due to reducing CPU MHz at a temperature of 60°C. The 3B suffered from failures due to apparently wrong sumchecks, system crashes, fatal error indications, when using an older operating system and crashes with 4 cores using 512 MB.
My floating point stress tests were also run, that produced numerous wrong numeric results and system crashes on the Pi 3B, but not on the Pi 3B+. These tests provide minute by minute changes in performance, CPU MHz and temperature.
Raspberry Pi Benchmarks and Stress Tests Summary 2
Raspberry Pi 4B 32 Bit Benchmarks -Following the last reports, (aged 84), I was recruited as a voluntary member of Raspberry Pi pre-release Alpha Testing Team. This represents my first effort that was endorsed by Eben Upton, the CEO and praised by Gordon Hollingworth, Chief Product Officer, in this Twitter topic.
The then ARM V6 and V7 Classic, Memory, Multithreading, Java and OpenGL Benchmarks were run on the Pi 4B for comparison with Pi 3B+ results. Those written in C/C++ were reproduced using the later GCC 8 compiler and run on both computers for further comparisons.
Compared with a 1.07 times increase in CPU MHz, the Classics overall scores increased between 1.87 and 4.70 times. For other CPU speed dependent benchmarks floating point improvements were often between 4 and 6 times faster.
Numerous results and comparisons are provided, too many for a quick survey. For example, 300 comparisons are provided for GCC 8 Memory benchmarks, that cover data from caches and RAM. There, average and maximum ratios were 2.33 and 4.9 times, with 6% noticeably slower.
Some of the Multithreading benchmarks, run here, are intended to demonstrate that this form of programming can produce slow and inconsistent performance. These cover 1, 2, 4 and 8 threads, where best case examples show gains nearly proportional up to the thread count, of up to 4. Pi 4B/3B+ performance improvements were similar to those for Memory benchmarks.
Oracle Java was used to run Whetstone benchmark and and a drawing program, providing Pi4B/3B+ average gains of 3.43 times over 14 test procedures (range 1 to 18). OpenJDK was also tried on the Pi 4, producing some much faster drawing speeds. My OpenGL benchmark demonstrated average speed gains of 1.82 times comprising 6 tests at 4 window sizes.
Input/Output benchmarks Pi 4B/3B+ performance comparisons - 2.4 and 5 GHz WiFi speeds were similar. LANs were 1 Gbps vs 100 Mbps with 4B large file data transfer speeds 3 to 4 times faster. USB USB3 vs USB2, where example Pi 4B large files were around 3 times faster on writing and 4.2 times on reading but, on small files, 4B was similar on reading but 27 times slower on writing.
Further Alpha Test activity is covered in a Stress Testing report, where those for floating point and integer programs now have benchmarking options that measure performance over the full range of data sizes and test complexity, using between 1 and 32 threads, with Pi 4B integers up to 1.9 times faster and floating point 2.6 times, at 20.8 GFLOPS.
Raspberry Pi 4B Stress Tests Including High Performance Linpack -
Raspberry Pi 4B Stress Tests Including High Performance Linpack -Programs used were MP-IntStress, MP-FPUStress, MP-FPUStressDP, videogl32, liverloopsPiA7R, burnindrive2 and High Performance Linpack. Monitors used were a new RPiHeatMHzVolts2, measuring CPU MHz, core volts, CPU and Power Chip temperatures, then vmstat for CPU, memory and I/O utilisation, also sar (from sysstat) for network activity.
Initially, program descriptions, example cold state results output and available run time parameters are provided. The first tests were without cooling, for five minutes, to identify weakest links. Five MP-IntStress tests were run via 1, 4 or 8 threads using caches and RAM, showing CPU MHz throttling starting at 80°C, the main offender being when using the shared L2 cache, with a performance degradation of 43%. Then videogl32, by itself, ran with constant CPU MHz and frames per second.
The remaining CPU stress tests were all run for 15 minutes, each with a number of runs involving (some of) no cooling, heatsink, case fan or Raspberry Pi PoE HAT fan.. 8 threads and 1280 KB (>L2 cache size). Full details of results are provided, with some graphs to show variations by time or CPU MHz throttling variations.
CPU Stressing tests with fans all effectively ran continuously at full speed with CPU temperatures less than 70°C. With no fans MP-IntStress, MP-FPUStress and two variations of MP-FPUStressDP indicated CPU temperatures up to 86°C with 44% performance degradations and CPU MHz occasionally half speed at 750 MHz.
High Performance Linpack was run with parameters to use four memory demands between 128 MB and 3.2 GB, each without and with fans, at 3.2 GB achieving 6.2 GFLOPS at 87°C without and 10.8 at 71°C with. A 10 second sampling graph indicates CPU temperature reduction spikes to 600 MHz.
CPU + OpenGL - Three copies of liverloopsPiA7R plus the most CPU dependent videogl32 test were run 1 with and 2 without a cooling fan and 3 the latter using dual monitors (2 x pixels), each for around 16 minutes. Test 1 recorded continuous maximum CPU MHz and OpenGL FPS. Without the fan, both tests recorded temperatures of 82°C within 30 seconds, with approaching half speed CPU MHz, Loops MFLOPS and OpenGL FPS.
The six OpenGL test functions were run using both a single monitor and in dual mode, without liverloopsPiA7R. Those depending on the pixel count ran at half FPS on the dual, but the CPU speed dependent ones, slower to start with, suffered from a further reduction of between 20% and 30%.
Input/Output Stress Tests - For these, three copies of burnindrive2 were run, accessing the main drive, a USB 3 stick and a remote PC via a 1 Gbps LAN, along with MP-IntStress using four threads, for 15 minutes without a fan. No errors were detected. Following 80°C being reached, after 2 minutes, CPU MHz throttling came into play.
Stand alone speeds are provided, showing that LAN data transfers continued at this rate throughout. The CPU program ran at 58% of maximum MB/second, at the start, falling to 45%. Drive speeds varied but were up to 10% slower than maximum. Performance monitors showed near 100% CPU utilisation of four cores, LAN speed, as measured by the program, at around 33 MB/second with total for drives at up to 80 MB/second.
Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests -
Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests -These were produced and run under 64 bit Linux Gentoo Operating system and include later compilations from GCC 9. They comprised the full range of benchmarks and stress tests covered in earlier Pi 4 reports.
More than 1000 performance comparisons are included. At 64 bits, Pi 4/3B+ average gains were 2.62 times in the range of 0.70 to 16.8. 64bit/32 bit ratios were 1.28 times, from 0.31 to 4.90 and GCC 9/6 near similar to the latter.
Stress Tests - Maximum speeds of MP-IntStress, MP-FPUStress and MP-FPUStressDP, with short running times, were around 40% faster using the 64 bit versions, now 28.7 GB/second, 26.7 GFLOPS and 13.2 GFLOPS. A series of 10 minute runs of these, without a cooling fan, produced the same order of CPU MHz throttling as at 32 bits.
High Performance Linpack - Similar fan cooled performance, as the 32 bit version, were indicated, at 10.4 DP MFLOPS. Other runs demonstrated the no fan performance variability and different, but valid, sumchecks.
15 Minute tests, comprising 64 bit OpenGL and 3 x Livermore Loops programs, were run with and without fan cooling. Performance was much better that that at 32 bits, but running in an improved environment.
Input /Output Stress Tests - A wide variety of these were run, mainly to establish that all was well using a 64 bit Operating System.
Errors? - During all of these tests, other than the High Performance Linpack sumcheck issue, no other data comparisons failures were detected nor any system crashes, in spite of CPU speed sometimes reducing to 600 MHz. DriveSpeed64 file handling operated in a different way and required a new approach in order to avoid unrequired data caching.
Raspberry Pi 4 CPU MHz Throttling Performance Effects -
Raspberry Pi 4 CPU MHz Throttling Performance Effects -Without cooling, the Pi 4B appears to run at between, 1500, 1000, 750 and 600 MHz. The latter can be selected, to run continuously, by using a frequency scaling governor command. These tests were run at the two extremes of 1500 and 600 MHz.
Using the latter, performance of benchmarks, with short running times, can indicate worst case CPU speed throttling effects. The bcmstat performance monitor was run to obtain CPU utilisation of each of the four cores and other details. In case you are unaware, %total indicates average over 4 cores.
Video Playback - These tests were run using BBC iPlayer with data transfers via LAN. Unlike with WiFi connection, no buffering was indicated using both MHz settings but, at 600 MHz, pixel dimension quality was worse viewing complex images, then the same with plain backgrounds. The bcmstat monitor indicated that all four cores were heavily utilised, at an average of 81% each at 600 MHz.
OpenGL Benchmark - Performance was the same or worse, at 600 MHz, depending whether graphics or CPU speed was the limiting factor and nearly proportional to CPU MHz. However, this was not reflected in CPU utilisation ratios, possibly due to lack of multithreading or graphics processor time.
Main Drive Benchmark - Writing and reading large files, average data transfer speeds were similar at both MHz settings, as was CPU utilisation, equivalent to a little more than 100% of one CPU core. Then this was nearly all recorded as waiting for I/O.
LAN Benchmark - Again transferring large files, as for the drive benchmark. Gigabit speeds were demonstrated at the higher MHz, some 25% faster than at 600 MHz. CPU utilisation differences were similar, but influenced by waiting time for I/O and serving interrupt requests.
LAN Plus CPU Benchmarks - Using the same LAN benchmark plus a single threaded processor test, network speeds were the same as before but the CPU benchmark performance was proportional to MHz settings. This time, the latter increased equivalent average CPU utilisation by around 25% per core, as might be expected.
Copying 1 GB Files From Pi 4 USB 3 Drive Via LAN To Windows PC - Copying speed MB/second performance degradation at 600 MHz was 40%, compared with 60% in MHz. CPU utilisation and data transfer speeds were lower than those for the LAN benchmark.
Core Utilisation Variations - These are from bcmstat, using 1 second sampling, showing details of variations for the file copying and video playback tests.
Benchmarking Raspberry Pi 4 Running From Power Over Ethernet -
Benchmarking Raspberry Pi 4 Running From Power Over Ethernet -This report compares network performance and power supply effects of Raspberry Pi 4 computers activated using PoE (Power over Ethernet) and normal power supplies, then considers some applications that can be run with no other input/output or power cable connections.
Hardware required for PoE is a unit that injects power on to an ethernet cable, at up to 50 volts, and another to extract it at a remote destination, converted to 5 volts. The latter can be a Raspberry Pi PoE HAT, that includes a Pi system fan, or a separate unit.
The main cables used were combinations of three (30+10+8) for 48 metres CAT 6 and (30+10+10) for 50 metres, the last one being an unlabeled thin one. Programs run, involve large and very small files, using LanSpeed benchmark and Burnindrive stress test, then CPU stress tests that were known to consume the most power. The report includes detailed results logs that can be subject to different interpretations.
LAN Benchmarks - The first LanSpeed run, from the Pi 4, was via a short CAT 6 cable, to determine maximum speeds. This was followed by using the two long cables running from normal power and PoE. Subject to wide variability, the CAT 6 cable essentially demonstrated 1000 Mbps speeds but, including the thin cable only 100 Mbps was possible. Using the latter with PoE, failed to read the 2000 MB large files and, at all Cat 6 with PoE, 100 Mbps performance was only possible using much smaller files.
LAN Stress Tests - These were each run for 21 minutes, transmitting numerous different data patterns with random switching between files. Performance is also measured, but was slower than from benchmarks, due to the time required to compare the data with expected values. Tests using PoE and the long CAT 6 cable, were completed successfully with no errors detected. Using the long cable, with the thin section, failed to run properly at lengths of of 50 and 40 metres, but ran without errors at 18 metres.
PoE Voltage Tests - CPU stress tests, that had been identified as those with the highest current demands, were run for 10 minutes. The system was fan cooled, then CPU MHz, Voltage and temperatures were monitored. Five MP-FPUStress and five MP-IntStress tests were carried out, covering normain power, PoE, long thick and thin cables, HAT and external power connections. Effectively, long term voltages, temperatures and performance measurements were constant throughout.
PoE CPU Stress Tests Plus USB 3 Drives - The CPU stress tests were repeated with USB 3 disk or Flash drives connected, both active and non-active. Intermittent system crashes occurred, in most cases. Results fron successful runs are provided but with no indications of unacceptable behaviour.
One Wire PoE WiFi Only - A series of tests were carried out Using PoE, with the Ethernet cable unplugged at the Pi 4 end, with WiFi communications active. Screen shots of all are provided.
1. LanSpeed to Windows 7 by clicking and PuTTy, long CAT 6 cable.
Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks -
Raspberry Pi 64 Bit OS and 8 GB Pi 4B Benchmarks -This report covers the May 2020 Raspberry Pi 4B upgrades, comprising 8GB RAM and the Beta pre-release 64 bit Raspberry Pi OS. Tests included all the usual Classic, Memory, Multithreading, Input/Output, OpenGL Benchmarks and Stress Tests. Most benchmarks were run, using GCC 8 compilations. Full detailed results are provided, along with 32 bit and 64 bit performance comparisons.
Classic Benchmarks - All showed 64 bit average performance gains in the range of 11% to 81%, the highest where the new vector instructions were compiled.
Memory Benchmarks - 64 bit and 32 bit speeds from RAM were the same, as were around half of CPU dependent routines, with the other half an average near 30% faster at 64 bits.
Multithreading Benchmarks - There were twelve, covering some intended to show that they were unsuitable for multithreading operation. Five measured floating point performance, where the average 64 bit gain was 39%, demonstrating a maximum of 25.9 single precision GFLOPS and 12.7 at double precision. Comparisons from most others were irrelevant.
Drive and Network Benchmarks - 32 bit and 64 bit performance was generally the same. But 32 bit file sizes were limited to 2 GB minus 1, whereas 3 x 12 GB could be exercised at 64 bits. There were a number of caching issues at 64 bits.
Java and OpenGL Benchmarks - 64 bit Java CPU speed, Java drawing and OpenGL benchmarks were run, with different window settings, including using dual monitors. 32 bit versions were slightly faster on some test functions. 64/32 bit ratios were between 0.7 and 1.3 with OpenGL.
Measured Usable RAM Caoacity - 3.43 out of 4 GB and 7.9 out of 8GB at 64 bits, but just under 2 GB at 32 bits.
High Performance Linpack Benchmark - Maximum performance was similar at 32 and 64 bits, at around 11.25 double precision GFLOPS, with 8 GB RAM, and 10.8 GFLOPS, using 4 GB. Without an active cooling fan, the latest improvements in thermal management lead to significant increased performance, working in this state.
Other 8 GB Stress Tests - these included using 7.2 GB RAM with swapping, exercising a 40 GB file and demonstrating less performance degradation caused by CPU MHz throttling.
Other Tests - these included using Power over Ethernet (PoE) and playing TV programmes.
Raspberry Pi 400 PC Benchmarks and Stress Tests -
Raspberry Pi 400 PC Benchmarks and Stress Tests -This report provides results of benchmarks and stress tests run on a Raspberry PI 400 PC, using the 32 bit and 64 bit Operating Systems. The PC comprises an upgraded version of the Raspberry Pi 4B CPU, fitted and fanless within a Raspberry Pi keyboard, running at 1800 MHz instead of 1500 MHz. The system has a full width metallic heat spreader between the keyboard and the circuit board. This appears to be an excellent cooling arrangement.
Tests included all the usual Classic, Memory, Multithreading, Input/Output, OpenGL Benchmarks and Stress Tests. Full details of the programs' logged results and comparisons are provided for Pi 400 at 32 and 64 bits and a Pi 4B at 32 bits.
CPU and RAM Benchmarks - The first group of 18 benchmarks measure various aspects of CPU performance, including accessing multiple CPU cores. At 32 bits, the Pi 400 generally provides the expected 20% improvement in performance, where CPU time dominates but little difference with RAM speed limitations. Average performance was superior using 64 bit operation, but too variable to be conclusive. The compiler version used was identified as a potential significant issue.
Input/Output Benchmarks - These were runs of OpenGL on single and dual monitors, LanSpeed at 1 Gbps, WiFi at 2.4 and 5 GHz, and DriveSpeed on a range of SD cards, USB 3 flash and hard drives, some with different formats. The 32 bit drive benchmark identified associated file size limitations but no other real issues. The 64 bit version demonstrated handling larger files, but the long established method of avoiding caching produced various failures.
USB Booting - This new feature was tried using SD cards, flash drives and a hard drive. These tests were completed satisfactorily. with a few complications, the major one being a restriction in a drive’s partition size.
High Performance Linpack Benchmark - A number of runs were carried out to demonstrate consistent performance and temperature control using the novel heat spreader. The higher CPU MHz speed now produces a rating of 11.7 DP GFLOPS with 4 GB RAM. The 64 bit version, again, produced different numeric sumchecks that were accepted as valid by the program.
Stress Test Benchmarks - MP-FPUStress+, MP-IntStress+ - Using multiple threads and data sizes, performance was generally 20% faster than the Pi 4B, using cache based data,. Maximum 64 bit GFLOPS were SP 28.0, DP 16.0 and Integer GB/second 34.2, all clearly using advanced SIMD instructions and 4 CPU cores. Compilations for 32 bit operation were somewhat slower. That 64 bit GB/second rating was from an earlier version of the benchmark, the later one being much slower, not benefiting from the usual compilation parameters.
CPU Stress Tests - Fifteen 30 minute tests were run covering floating point SP and DP and integer, 32 bit and 64 bit operation, exercising the Pi 400, along with a fan and fanless Pi 4B. Full details of measured CPU and PMIC (power chip) temperatures are provided and summaries of voltages, plus performance and CPU MHz range. One Pi 400 test was run outside at ambient temperature greater than 40°C. Overall, Pi 400 cooling and performance advantages were demonstrated.
System Stress Tests - Six programs were run at the same time for 15 minutes, exercising integer and floating point hardware, all RAM space, OpenGL and drive data transfers, whilst monitoring environment and system utilisation. These were run at 32 and 64 bits on the Pi 400 and fan controlled Pi 4B. There were no excessive CPU temperatures and no data comparison errors.
TV Tests - BBC iPlayer programmes were viewed, using the Pi 400, for at least seven hours each, via TV at 32 bits and a PC monitor at 64 bits, with external bluetooth speaker sound for the latter. There were a few peculiarities for consideration, but no interruptions to service.
Raspberry Pi Benchmarks and Stress Tests Summary 3
Raspberry Pi Pico, Pi 4 and Pi 400 Python and C Basic Beginners Bit Banging Benchmarks -The Pico is a microcontroller with many advanced options, identified such as DMA, ADC, UART, 12C and PWM. Beginners in this area might be initially interested in exploiting general purpose input/output. This report covers measuring bit banging performance, switching LEDs on and off at various frequencies, using C and (new to me) Python programs, accessing Pico and Raspberry Pi 4B and 400 computers. Additionally, some general purpose CPU benchmarks were run.
Bit Banging Tests - Details of different wiring and program code are provided for using Python and C, on a Pi computer and Pico. The full tests are carried out executing on then off cycles to 1 and 13 output pins, specifying 6 sleep time delays between 100 milliseconds and 1 microsecond’ These are repeated between 100 and 10000000 times, where theoretical running time is 20 seconds for all steps. The 6 tests are repeated, without sleeping, then with sleeping and no output. These enable execution and sleep overheads to be calculated, along with maximum possible cyscles per second performance.
Full details of results are provided, amounting to more than 200 measurements, including Pi 400 running at 1800 and 600 MHz and different sleep timers. Also included are details of power consumption and a program that validates data transfer speeds on a monitoring Pi computer.
Maximum Bit Banging Speed (no sleeping) - Measured cycles per second were much slower at 13 outputs, but converting to bits per second could produce similar performance to that from one output. Maximum Pi 400 speed, via C, was around 66 Mbps at 1800 MHz and, proportional to MHz, 22 Mbps at 600 MHz. The Pico, at 125 MHz, achieved 42 to 52 Mbps, indicating less dependency on MHz. In all cases, The C code produced maximum speeds more than 500 times faster than from Python programs.
Sleep Timer Overheads (no output) - These lead to lower than possible speeds on changing run time parameter for higher cycles per second. However, a desired speed can often be obtained by experimenting with lower sleep time settings. Performance also depends on availability of timer with minimum overheads. The only one observed, that produced linear increases in speed, over the range specified, was one using C on the Pico. All tests indicated desired speed with parameters for 100 milliseconds sleep and within 10% at 1 milliseconds, but out of reach with lower time settings.
Overall Performance (output + sleeping) - compared with sleep only speeds, the best C Program indicated the same 100% timing accuracy with one output, but not quite with 13 and at 1 microsecond sleep time. Those using Python still obtained the same minimum speed but gradually became less accurate than the sleep only tests, by up to a further 27%.
CPU Benchmarks - I compiled my Whetstone, Dhrystone and MemSpeed C/C++ benchmarks to run on the Pico. The benchmarks, source codes and necessary make procedures are made available to download. Performance, of course, is much slower than from recent Pi computers, particularly using floating point calculations. MemSpeed, with calculations, demonstrated 48 Mbps with floating point and at least 760 Mbps with integers, the latter much faster than needed for the bit banging benchmarks run here.
Raspberry Pi Pico W Basic Beginners Bit Banging, CPU and WiFi Benchmarks -
Raspberry Pi Pico W Basic Beginners Bit Banging, CPU and WiFi Benchmarks -The Pico W is essentially the the same as the original Pico, with additional WiFi connectivity and the option of using a new embedded MicroPython package. Full details of test results are included in the report.
Bit Banging Tests (see above) - These include precompiled versions, where performance was the same as on the original Pico, and others run via various MicroPython interpreters. For the latter, the embedded MicroPython and later releases of the original version, provided an increase of nearly four times in maximum bit switching speeds, but still much slower than from compiled programs.
CPU Benchmarks (see above) - Performance of the precompiled Dhrystone and MemSpeed benchmarks was the same as before, with Whetstone slightly different and involving a variation in output format, now the same as from Pi CPUs. In addition, Python Pystone benchmark was run. This was produced by the original author of the Dhrystone program, allowing approximate benefits of compilation to be calculated and indicating that the existing C version was 175 times faster.
WiFi Tests - These were carried out between Pico W, Raspberry Pi 4, with 2.4 and 5 GHz WiFi, and a PC, all on the same internal network. The first tests were via the iPerf benchmark, providing estimates of the maximum achievable bandwidth of the network. Calculations included the effects of 5 GHz or 2.4 GHz WiFi, CPU MHz impact, and network packet sizes used. The Pico performance was relatively very slow, but nearly 10M bits per second might be adequate for Pico W applications, dealing with this sort of activity. Note, that these tests were with a Pico W Client sending data to a remote Server. At this time Pico W iPerf Server operation was attempted, established connectivity but failed to transfer data.
Ping Tests - The next tests were via compiled ping Windows and Linux utilities, accessing the Pico W. These can be used to confirm that a Pico W is up and running, connected to the network and provide guidance on the likely performance of dealing with small sized data transfers. At the time of testing, pinging the Pico was extremely slow, but a temporary fix was obtained to allow tests to continue. Data transfer speed, whilst pinging, was calculated, achieving the same sort of levels as iPerf.
The last detailed ping tests were carried out using a recommended Python program available from GitHub, started by satisfactorily pinging up to 4096 bytes, from Pico W to a Raspberry Pi and a PC. Windows and Linux network monitoring utilities were used to confirm reception.This Python utility has a parameter to control speed of operation. Using maximum speed, ping produced false reports, with the remote performance monitor recording much greater data volumes. Other monitoring indicated that the Pico W was overloaded and could not cope.
Android Benchmarks and Stress Tests Summary
Most reports for these Android programs provide direct access to download and install the Apps or a folder to reside on an SD card. Installation normally requires approval, in Setting, to take onboard non-Market applications. Links to download Project Files, containing source codes, are also provided. These are arranged to run under Eclipse Integrated Development Environment (Some were later converted by Android Studio, but not included in the collection).
The Apps usually have three buttons, Run, Info and Save or Email. Originally, default for the latter was to Email the results to me. Now, with the latest versions of Android, multiple choices are provided, like Gmail, Bluetooth, Message or Drive.
The Apps require Java code to communicate with Android but, in this case, C/C++ programs are also used, mainly to produce faster performance. Note that downloaded Apps might not operate correctly using later or earlier versions of Android than those shown here, nor with alternative hardware.
2013 Android Benchmarks2 -
2013 Android Benchmarks2 -This covers results of Classic, Memory and DriveSpeed benchmarks, running under Android 2 and 4, with descriptions of MultiThreading and Graphics programs. A selection of results are provided for 26 tablets or phones, with others for PC hardware or emulators. ARM technology CPU speeds were between 800 and 2000 MHz, working in 32 bit mode. Some benchmarks were compiled to run using A5 and/or A7 architecture levels that particularly affect floating point performance.
Classic Benchmarks - Whetstone and Linpack Java versions were run, along with those using C/C++ compiled native ARM code, the latter being more than twice as fast on the later CPUs. Results for Linpack include programs using five varieties of compiling options, with fastest system producing between 29 and 1335 MFLOPS. Optimised and non-optimised versions of the Dhrystone benchmarks were run, the former typically being around twice as fast and maximum performance achieving VAX MIPS (AKA DMIPS) per MHz of 2.17. MFLOPS of all 24 Livermore Loops are provided, one demonstrating more than 1 GFLOPS, and many shown to be faster than the Cray 1 supercomputer, for which this benchmark originally used.
Memory Benchmarks - MemSpeed, BusSpeed and RandMem benchmarks were run, each measuring cache and RAM data transfer speeds at 10 different capacity demands between 16 KB and 65 MB. With 180 results for each system tested, much can be learned. With MemSpeed best GB/second were up to 9.4 L1 cache, 6.4 L2 cache and 1.6 RAM, compared with worst at 0.69, 0.15 and 0.15 respectively. Performance was similar on other benchmarks, except for random access, where best case from RAM was 0.1 GB/second.
DriveSpeed Benchmark - This is not easy to use as the drive path normally has to be has to be typed in and can be difficult to identify. There are sometimes caching issues, where a file is written but a reboot is needed in order to ensure that the drive is read and not data cached in RAM. See the report for details. Example results are provided for main and external SD cards and USB 2/3 drives, some of which are identified as running particularly slow.
CPU MHz Monitor - This demonstrated where MHz varied, using power and energy saving settings and on battery power.
On-Line Benchmarks - Image Loading Times - These procedures do not work anymore.
2015 Android Graphics Benchmark Apps -
2015 Android Graphics Benchmark Apps -These benchmarks only use Java functions.
JavaOpenGL1 - This measures frames per second (FPS) of WireFrame, Shaded, Shaded+ and Textured displays at thee different pixel densities. All sorts of complications were identified. Here, best score for the test with the heaviest loading was around 8 FPS. Then, for lightest, performance was limited by a system forced 60 FPS.
JavaDraw - Five tests draw on a background of continuously changing colour shades with ever increasing drawing content, again measuring FPS. Slowest system produced performance ratings between 4 and 12 FPS, and fastest 6 to 60 FPS.
Battery Test - The program runs the second most demanding JavaDraw test, with CPU MHz displayed, along with FPS and running time in minutes. Five systems were tested for between 4 an 6 hours. Some ran with little variation in MHz samples or FPS, most eventually turning off due to the lack of power. Others had higher variation in MHz or peculiar behaviour.
2016 Android MultiThreading Benchmark Apps -
2016 Android MultiThreading Benchmark Apps -These are run using 1, 2, 4 and 8 threads with different memory demands or CPU functions. Most demonstrate the number of CPU cores, not always clear from that identified in CPUID data. Results are provided for 17 different systems, some including programs produced for older A5 architecture. Some of the usual good and bad performance ratings obtained are as follows, in this case from ARM CPUs.
Fastest Floating Point with own segment of shared data. - MP-MFLOPS benchmark 1 thread 1.2 GFLOPS, 4 threads 4.2 GFLOPS.
Example Fast Data Transfer speed with own segment of shared data. - MP-BusSpeed benchmark L1 cache 1 thread 6.0 GB/sec, 4 threads 23.7 GB/sec, RAM 1 thread 2.7 GB/sec, 4 threads 9.1 GB/sec.
Best Performance each thread with own data - Whetstone benchmark 1 thread 1877 MWIPS, 4 threads 7426 MWIPS.
Worst MultiThreading Performance - Write/Read test MP-RandMem benchmark with no MP gain at around 3.5 GB/sec using 1, 2, 4 and 8 threads.
Limited MultiThreading Performance - MP-Dhrystone benchmark with some shared data - 1584, 2749, 3836 DMIPS 1, 2, 4 threads.
2016 Android NEON Benchmark Apps -
2016 Android NEON Benchmark Apps -ARM NEON SIMD instructions are available for single precision floating point and integer calculations, operating on four 32 bit numbers in the same clock cycle. In this case they are produced using C/C++ intrinsic functions.
NEON MP-MFLOPS Benchmark - The same best case performance quoted above for MP-MFLOPS, were more than twice as fast, on the same system, at 2.9 GFLOPS using 1 thread and 11.6 GFLOPS with 4 threads.
NeonSpeed Benchmark - This covers the same single precision floating point and integer calculations and memory demands as MemSpeed, providing normal and NEON speeds for comparative purposes, the latter indicting up to 2.3 times improvement on the system quoted for MemSpeed.
NEON-Linpack Benchmark - A table is provided with 17 sets of results covering 5 variations of compiler and options used, running the single core programs. Here, the NEON version was typically twice as fast as the normal single precision benchmark.
NEON-Linpack Benchmark-MP - The Linpack benchmark is completely unsuitable for multithreading operation, using my usual method. This one is run accessing three different sized data matrices, using normal operation then 1, 2 and 4 separate threads. Example performance, with the normal N=100 parameter, was 1498 MFLOPS, without threading and around 61 MFLOPS using 1, 2 and 4 threads.
2016 Android Benchmarks32 -
2016 Android Benchmarks32 -The Apps and details are essentially the same as 2013 Android Benchmarks2, but with additional results, Some faster hardware performance was obtained, particularly with main memory speeds
Results from both Android 4 and 5 are provided for some systems, indicating improvement in Java performance, in one case. Tests were run using Android on Intel CPUs for comparison purposes, also some running Windows and Linux versions of the benchmarks.
2016 Android Native ARM + Intel Benchmarks -
2016 Android Native ARM + Intel Benchmarks -With the number of devices running Android on Intel CPUs and the introduction of 64 bit architecture and Operating Systems, GCC 4.8 on Eclipse enabled Apps to select 32 bit or 64 bit code on various platforms at run time. Those then available were arm64-v8a, armeabi (A5), armeabi-v7a (A7), mips, mips64, x86 and x86-64. All were selected, along with new icons to produce a new set of benchmarks, downloadable from this document.downloadable from this document, downloadable from this document.
These results are included below in 2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS.
2016 Android 64 Bit Benchmarks -
2016 Android 64 Bit Benchmarks -This contains details of a few tests using the new compilation facilities.
Again, these results are included below in 2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS.
2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS -
2018 Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS -These benchmarks generally ran successfully on devices controlled by up to Android 7. They could be installed, using Android 8, but failed to run due to a minor incompatibility. Results are included for benchmarks running Android on devices using Intel Atom CPUs, also that and a Core i7 processor, using Windows 10 and Remix OS for Android.
Full details of results are provided from a wide range of systems from a 2013 Cortex v7-A9 to 2017. Many under the headings of compilation versions Original ARM, ARM/Intel 32 Bit, ARM/Intel 64 Bit, Intel/Windows 32 Bit, and Intel/Windows 64 Bit, and a few also from Android Java and Windows Java. Android variations were from versions 4, 5, 6 and 7. One tablet had options to boot for Android or Windows 10. Numerous calculations are included, the main ones being mentioned below. Results, using 64 Bit Android, were only available for one device. Comparisons for these are in later publications.
Classic Benchmarks - Many of the new ARM 32 bit compilations produced similar performance to the original but the odd ones were faster. Intel results include some using Atom models and one using a high end Core i7 CPU. In most cases, the latter produced far superior speeds but would not not be competitive on cost/performance grounds.
Memory Benchmarks - All results include samples of MB/second per MHz calculations, with separate ratios using L1 cache, L2 cache and RAM. Regarding ARM processors, raw RAM BusSpeed results were up to 6.3 times faster than the 2013 Cortex v7-A9, and 5.0 times on a MB/second per MHz. For the latter, L2 cache improvements were up to 3.1 times. L1 cache and other benchmark MB/second per MHz results were much lower, with wide variations. Two variations of My Fast Fourier benchmarks were included, involving single and double precision calculations at 11 different FFT sizes.
MultiThreading Benchmarks - All identify performance using 1, 2, 4 and 8 threads. In some, the expected 4 thread gain is not always demonstrated. Multiple runs are required to establish that this is normal behaviour.
ARM MP-Classic Benchmarks - Some Whetstone compilations produce slow performance on two critical tests that use such as COS and EXP functions, depending on the default libraries used. Best 4 thread MWIPS scores have improved to reach 7491 and maximum GFLOPS to 2.5.
ARM MP-Memory Benchmarks - BusSpeed demonstrates that RAM throughput can benefit from multithreading, now up to 8 GB/second. RandMem continues to produce no gain performance with read/write tests, good improvement on serial reading but some disappointing on random access.
ARM MP-MFLOPS Benchmarks - Over the period and devices considered here, MFLOPS per MHz improved from 3.5 to 5.4 to produce 11.6 GFLOPS (also quoted above). This is when using NEON SIMD instructions. Note PC with Intel Core processor is shown to reach 23.7 MFLOPS per MHz
ARM OpenGL and Java Drawing Benchmarks - For OpenGL, the 2013 V7-A9 obtained 7.10 FFS on the heaviest test at screen size 1280 x 720 pixels. Best was 17.6 FPS at 2048 x 1440 pixels. With Java Draw, V7-A9 reached 3.81 FPS and best shown at 6.72 at 1290 x 1032.
CPU MHz Benchmark and Battery Test - Example results of the MHz benchmark and battery test are provided. Later, these were found to be no longer applicable and alternatives produced.
New CPU Stress Tests - These comprise MP-FPU-Stress.apk for floating point operation and MP-Int-Stress.apk using integer calculations. They both have a benchmarking mode that provides use of between 1 and 8 (FPU) or 32 (Int) threads, with data sizes for L1 cache, L2 cache or RAM and 2, 8 and 32 arithmetic operations per word, for floating point. These variables can be set for stress testing, besides running time in minutes and up to 32 threads can be selected in both cases. Then, results are displayed after each pass, where pass count depends on CPU speed.
For FPU and Integer stress tests, 8 sets of results are provided for 15 minute tests on 8 different systems or battery use, with 8 threads, accessing L2 cache sized data. The FPU test used 32 operations per data word. In both cases, recorded performance of three tests ended with a speed reduced by more than 40%.
Next, 8 thread FPU and Integer tests were run at the same time, with each running at half speed at the start. Finally, these tests were repeated, each mode using 32 threads. After a while, performance decreased but in an unpredictable manner.
No data sumcheck errors were reported, but results were occasionally lost due to system crashes or a flat battery. See report for other unacceptable behaviour.
2018 Updated Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM and Intel -
2018 Updated Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM and Intel -The latest ARM benchmarks failed to run via Android 8 due to an unimportant configuration request. They were recompiled, omitting this request, using gcc 4.9, with slightly different names, new icons with a 4A8 title in the results. These are available to download in the main report.
ARM Classic Benchmarks - As for most other benchmarks considered here, sample results are provided for the earlier and 4A8 compilations at both 32 bits and 64 bits.
For the Whetstone benchmark, calculated MFLOPS/MHz ratios were little different across the range, but those for MWIPS/MHz improved, at 64 bit working, due to faster MOPS/MHz speeds using COS and EXP functions. Dhrystone 64 bit DMIPS/MHz ratings were much higher, best at 5.87 (Some would suggest over optimisation, again). ARM Linpack speeds has improved at 64 bits and with the latest V8 CPU at 4A8 32 bits, probably due to the use of advanced SIMD operation, with best here at 1.38 GFLOPS or 0.59 MFLOPS/MHz. Livermore Loops speeds also improved in line those from Linpack, with best maximum at 2.6 GFLOPS at 1.0 MFLOPS/MHz and average 0.45 MFLOPS/MHz.
ARM Memory Benchmarks - Calculated MemSpeed maximum Single and Double Precision (SP and DP) MFLOPS/MHz ratios were faster at 64 bits but not so much using 4A8 compilations. One phone, using the 4A8 version, was much slower than running the earlier version. Using the latest technology, ARM 64 bit SIMD vector operation was demonstrated, (but limited by lack of complexity), where maximum calculated speeds were 3.9 GFLOPS DP and 6.8 GFLOPS SP. These are normally the same with scalar operation.
64 bit SIMD compiled operation was also demonstrated using NeonSpeed for SP floating point and integer calculations, by producing the same speeds as those from using NEON intrinsic functions. Here, best was 21 to 25 GB/second CPU data transfer speed and 2.5 to 3 GB/second from RAM. There was nothing particularly outstanding in results from the other memory benchmarks.
ARM MultiThreading Benchmarks - There were limited 4A8 and 64 bit performance gains running MP-Whetstones, ignoring those by alternative libraries. A new PC provided new best 4 thred speed of 11762 MWIPS. MP-Dhrystone continued to demonstrate poor MP performance. Best MP-BusSpeed RAM result is now 14.5 GB/second. MP-RandMem provided some gains and some losses using newer versions.
ARM MP-MFLOPS Both 64 bit and 4A8 compilations provided performance gains. Best at 4 threads is now 42.0 GFLOPS, clearly using SIMD, including fused multiply and add operations, with single core 5.27 SP MFLOPS/MHz or 18.27 using 4 threads. NEON-MFLOPS-MP also obtained the same level of performance, NEON intrinsic functions being converted to 64 bit vector instructions.
ARM OpenGL and Java Drawing Benchmarks - These Java programs ran successfully under Android 8, as all other programs run, under this OS, for this report.
CPU Stress Tests - See previous summary. In this case, MHz measurements of each core were recorded. Results from eight different 10 minute stress tests are provided, covering six tablets or phones, with one running at constant speed and others with unpredictable reductions.
A near best example I have is for a 10 minute 8 thread integer test, running on a CPU with 8 cores, with rated speeds of 4 at 2350 MHz and 4 at 1900 MHz. This ran with all cores being utilised throughout, mainly at the specified speed, with final average MHz reduced by 9%.
Worst case was a floating test, using 8 cores, where only six were active after the first 20 seconds. The CPU comprised 4 Cortex A53 cores rated 1500 MHz and 4 Cortex A57 at 2000 MHz. Highest measured were 1330 and 1555 MHz ending at 384 and 960, after 10 minutes. Measured MFLOPS reduced from 17955 to 6713.
Battery Test - An example of running the integer stress test on phone with a near flat battery is provided, starting using 8 cores, running at 1517 or 1210 Mhz, reducing to four cores after a minute. This was followed by four cores mainly running at the lower speed, then at 998 MHz until the end of a ten minute test. On restarting, the test continued to run at that speed until the phone died, after a short time. Measured GB/second reduced from 37.0 to 14.3.
2020 Android 9 Benchmarks and Stress Tests On 32 Bit and 64 Bit CPUs -
2020 Android 9 Benchmarks and Stress Tests On 32 Bit and 64 Bit CPUs -The latest benchmark runs were primarily aimed to show that they worked under Android 9, but also provided the opportunity to identify differences in performance between 64 bit and 32 bit operation, using what was essentially the same version of CPU technology. Some of the earlier results are provided for comparison purposes. All the apps can be downloaded from the report.
Classic Benchmarks - comparing measured speed/MHz, one Cortex-A73 appeared to be slower than the other, at 64 bits. Later, it was found that the CPU was running at 1805 MHz, and not the claimed 2000 MHz. Allowing for this, performance from Android 8 and 9 could be assumed to be the same. Whetstone results were similar between 32 bit and 64 bit versions. On the other benchmarks, the latter was up to twice as fast.
Memory Benchmarks - As these benchmarks access data covering caches and RAM, performance levels can indicate which cache is used. These are labelled in BusSpeed results and are shown to be different on CPUs labelled as Cortex A73, where maximum RAM speed was 9.1 GB/second. MemSpeed maximum GFLOPS were 3.11 DP and 8.57 SP. There were wide variations in random access performance during RandMem.
MultiThreading Benchmarks - Most devices had 8 cores, with 2 groups of four, running at different MHz. Best 8 thread MP-Whetstone score was 20501 MWIPS, 6.4 times that for 1 thread. MP-Dhrystone continued to have unacceptable MP performance. Fastest MP-BusSpeed RAM result was 14.5 GB/second, 1.8 times faster than 1 thread. An even faster result was indicated for MP-RandMem at 20.9 GB/second from RAM, but this is probably affected by the 3 MB L2 cache size.
Single Precision MP-MFLOPS Benchmarks - As reported earlier, highest 4 thread speed obtained was 42.0 GFLOPS, 3.5 times faster than 1 thread. Here, I also point out that performance using 8 threads was lower, at 36.7 GFLOPS, on that device.
OpenGL and Java Drawing Benchmarks - These ran successfully on Android 9, As did DriveSpeed, but with usual data caching problems.
CPU Stress Tests - These were carried out on my mainly 8 core tablets and phones, two of which have the same CPU running under Android 9, one at 32 bits and the other at 64 bits. First we have examples of one minute 6 thread floating point and integer tests, with results alongside measured MHz of the 8 cores, sampled every 5 second, showing reducing performance. The main observation is the unpredictable variation in core MHz speeds.
Next are examples of stress test benchmarks, covering Android 5, 7, 9, 32 bits and 64 bits, identifying identical floating point sumchecks and error free integer calculations. This is followed by a 100 seconds integer test, with 1 second MHz samples of the 8 cores plus reducing MB/second and associated MHz reductions.
Finally there are full details of 16 stress tests, covering the identical CPUs running at 32 bits or 64 bits, 4 and 8 threads, floating point and integer operation, 5 minute and 15 minute durations. These identify variations 64 bit/32 bit and 8 bit/4 bit performance ratios.
2021 Android 10 and 11 Benchmarks and ARM big.LITTLE Architecture Issues -
2021 Android 10 and 11 Benchmarks and ARM big.LITTLE Architecture Issues -This report provides results from testing a new phone, running under Android 11, with CPU core specification of 2x2.0 GHz Kryo 480 and 6x1.8 GHz Kryo 460, said to be based on Cortex-A76 and Cortex-A55, These are compared with one using Android 10, the CPU cores comprising 4x2.0 GHz Cortex-A73 and 4x2.0 GHz Cortex-A53. Of particular interest was the mismatch of big.LITTLE Architecture.
64 Bit Classic Single Core Benchmarks - With MHz of the fastest cores being the same, Kryo performance gains greater than 1.0 indicate improved internal architecture (as claimed for A 76). There are 35 performance measurements in this group, mainly floating point ratio gains over the earlier phone were average 1.83, minimum 1.26, maximum 2.38.
64 Bit Memory Benchmarks - There are 220 MB/second scores from the four main memory benchmarks. Average Kryo gains were 2.25 from L1 cache, 1.98 from L2 cache, 2.12 from L3 cache vs RAM and 1.51 from RAM vs RAM.
64 Bit MultiThreading Benchmarks - Comparing Kryo 2+6 cores with older 4+4 CPU. MP-Classic Benchmarks - As indicated before, two of these were produced to demonstrate unsuitability for multithreading operation. They are MP-Dhrystone and MP-Linpack. The latter is no longer run, as execution time can be greater than 5 minutes. MP-Whetstone reflects the mismatch in big/LITTLE CPU MHz operation, where the Kryo MWIPS gains for 1, 2, 4 and 8 threads were 1.47, 1,61, 1.28 and 1.04. Sample MHz measurements of the 8 cores are provided. MP-BusSpeed results indicates that the MHz mismatch can lead to the older CPU being much faster using 4 and 8 threads. MP-RandMem also has similar lower speeds but Kryo random write/read access can be more than four times faster because of the large L3 cache.
MP-MFLOPS Benchmarks - The Kryo achieved up to 12178 MFLOPS using one thread, or around 6 MFLOPS/MHz, clearly demonstrating SIMD operation with fused multiply and add instructions, followed by 23674 using 2 threads. Then there were the disappointing mismatch results of 26173 and 35686 at 4 and 8 threads. Performance gains over the other CPU were between 1.01 and 3.45 times. NEON-MFLOPS-MP results were similar.
Java Benchmarks comprising OpenGL, Draw, Whetstone and Linpack all ran successfully under Android 11 and faster with the Kryo CPU.
CPU Stress Test Benchmarks - Examination of the detail can identify unexpected performance, like faster using 16 threads on an 8 core CPU, as this leads to execution in a lower level cache. The floating point stress test is essentially the same as MP-MFLOPS, but with more run time options. The example integer stress test used up to 32 threads, where the fastest speeds were demonstrated, in this case 49046 MB/second using Kryo, not much faster than the older CPU, due to the CPU MHz mismatch.
CPU Stress Tests - Results from many 15 minute 8 thread tests are provided. The first are 30 second samples of CPU MHz on all 8 cores and measured performance of both CPUs being considered, . The Integer Test MHz sampling indicated that the Kryo had 2 cores running at 2035 MHz and 6 at 1805 and producing between 56 and 57 GB/second over 30 seconds. The older CPU had variable MHz readings with average reducing from 1989 to 1404 and performance from 52 to 40 GB/second. The Kryo Floating Point Test indicated constant average samples of 1862.5 MHz and 37 to 38 GFLOPS. The other CPU came with average MHz reducing from 1989 to 1504 and GFLOPS from 31 to 25.
The final results are for a series of 15 minute tests using 2, 4 and 32 threads without MHz recording. Average Kryo/older Integer Test performance ratios varied from 0.95 to 1.65 at the start 1.12 to 1.63 at the end with Floating Point Tests starting between 1.11 to 2.50 and ending in 1.17 to 2.7. The lowest ratios are on using 4 threads.