Android 64 Bit Benchmarks - Roy Longbottom's PC benchmark Collection

Roy Longbottom's Android 64 Bit Benchmarks

For latest results see Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS.

General	Logged Configuration	Whetstone Benchmark
Dhrystone Benchmark	Linpack Benchmark	Livermore Loops Benchmark
MemSpeed Benchmark	BusSpeed Benchmark	RandMem Benchmark
MP-MFLOPS Benchmarks	MP-MFLOPS Benchmark Results	MP-Whetstone Benchmark
MP-Dhrystone Benchmark	MP-BusSpeed Benchmark	MP-RandMem Benchmark
NEON-Linpack Benchmark	NeonSpeed Benchmark	NEON-MFLOPS-MP Benchmark
NEON-Linpack-MP Benchmark	FFT Benchmarks	Assembly Code

Download Benchmark Apps

A Settings, Security option may need changing to allow installation of non-Market applications

	NativeWhetstone2.apk First standard benchmark				Dhrystone2i.apk First integer benchmark
	LinpackDP2.apk First comptutational benchmark				LinpackSP2.apk Single precision Linpack
	LivermoreLoops2.apk First supercomputer benchmark				MemSpeedi.apk Floating Point Cache and RAM Test
	BusSpeedv7i.apk Integer Bus, Cache and RAM Test				RandMemi.apk Random/Serial Access Cache and RAM Test
	MP-MFLOPSi.apk CPU, Cache, RAM MFLOPS Test				MP-MFLOPS2i.apk Long Running MP-MFLOPS
	MP-WHETSi.apk Whetstone Floating and Fixed Point Tests				MP-Dhryi.apk Dhrystone Integer Benchmark
	MP-BusSpdi.apk Multithreaded BusSpeed Benchmark				MP-RndMemi.apk Multithreaded RandMem Benchmark
	NEON-Linpacki.apk Linpack Benchmark using ARM NEON Intrinsic Functions				NeonSpeedi.apk NEON Memory Speed Test Using Intrinsic Functions
	NEON-MFLOPS2i-MP.apk MP-MFLOPS using ARM NEON Intrinsic Functions				NEON-Linpacki-MP.apk Linpack MP Benchmark nsing NEON Intrinsic Functions
	MP-BusSpd2i.apk Long running vesion with staggered start				fft1.apk Original FFT Benchmark
	fft3c.apk Optimised FFT Benchmark

All the above were produced using gcc 4.8, via Eclipse, running under Linux Ubuntu 14.04 They are compiled to run on both 32 bit and 64 bit CPUs from ARM, Intel and MIPS, automatically selected at run time. Downloads are identical to those in Android Native ARM-Intel Benchmarks.htm.

General

As indicated above, the benchmarks, downloadable from here, were compiled for both 32 bit and 64 bit operation. The purpose of this document is to report results from running at 64 bits and to provide comparisons with those at 32 bits, including the latter also compiled by gcc 4.8. Eclipse (or Android Studio?) projects for the new compilations are included in Android Intel-ARM Benchmarks.zip.

The tablet used was a Lenovo Tab 2 A8-50, 8 Inch Tablet, with a 1.3 GHz MediaTek mt8161 quad core processor (64 bit ARM Cortex-A53) and Android 5.0.2. After initially proving that the benchmarks were in 64 bit mode, new versions are being produced that indicate the mode, in this case for 64 or 32 bit ARM, as in the options below:

             Compiled for 32 bit ARM v7a     Compiled for 64 bit ARM v8a
             Compiled for 32 bit Intel x86   Compiled for 64 bit Intel x86_64
             Compiled for 32 bit Mips CPU    Compiled for 64 bit Mips CPU

For comparison on other systems, the 32 bit apps are provided in Android Intel-ARM 32 Bit Benchmarks.zip. Note that these will overwrite the 64 bit app installation.

At the time of starting this report (July 2015), other benchmarks indicated similar performance to a Nexus 7 32 bit tablet. Results for this are included below and they suggest that only 32 bit benchmarks were being used, when 64 bit varieties can be much faster.

Results below are for the revised benchmarks, that indicate which section of the code is used. The new projects are included in the zip file and others will follow.

To Start

Logged Configuration

In line with other Android benchmarks in Android Benchmarks.htm, the programs identify system information, in this case, the following for the Tab 2 A8-50. - Strange, only 3 CPU cores are reported.

System Information Device LENOVO Lenovo TAB 2 A8-50F Screen pixels w x h 800 x 1216 Android Build Version 5.0.2 Processor : AArch64 Processor rev 3 (aarch64) processor : 0 BogoMIPS : 26.00 processor : 1 BogoMIPS : 26.00 processor : 2 BogoMIPS : 26.00 Features : fp asimd aes pmull sha1 sha2 crc32 CPU implementer : 0x41 CPU architecture: AArch64 CPU variant : 0x0 CPU part : 0xd03 CPU revision : 3 Hardware : MT8161 Linux version 3.10.65 (jenkins@ubuntu12) (gcc version 4.9 20140514 (mtk-20150408) (GCC) ) #1 SMP PREEMPT Fri Jun 19 11:01:08 CST 2015

To Start

Whetstone Benchmark - NativeWhetstone2.apk

This provides an overall rating in MWIPS, plus separate results for the eight test procedures in MFLOPS (floating point) and MOPS (functions and integer). For full details and results via Windows. Linux, Android and via different programming languages, see Whetstone Benchmark Results on PCs. On the latest CPUs, running time largely depends on the COS and EXP function tests. This is highlighted in the examples below.

As with the next four benchmarks, tests comprised the original, from an earlier compiler, then gcc 4.8 separate compilations for 32 bit and 64 bit CPUs, finally the one produced covering all ARM, Intel and MIPS based systems. The latter shows that this tablet uses the 64 bit code option.

In this case, most tests indicated that the later versions, and 64 bit operation, provided no performance gains, but were somewhat faster than the Nexus 7 Cortex-A9 CPU.

Version MWIPS ------MFLOPS------- ------------MOPS-------------- 1 2 3 COS EXP FIXPT IF EQUAL 1300 MHz Cortex-A53 Original 32 bit 1433.7 348.0 319.3 308.2 36.3 19.8 1551.4 1861.9 611.0 ARM/Intel 32 bit 834.7 348.9 312.7 310.9 36.7 5.4 1556.7 1867.2 570.5 ARM/Intel 64 bit 1504.4 348.8 304.9 309.3 38.2 20.5 1536.4 1862.0 1242.4 ARM/Intel 32/64 bit 1494.2 347.1 307.0 305.9 37.5 20.6 1552.2 1863.7 1239.1 1200 MHz Cortex-A9 Original 32 bit 1115.0 271.3 250.7 256.4 25.8 14.6 1190.0 1797.0 1198.7 ARM/Intel 32/64 bit 731.1 273.6 253.0 252.8 28.0 5.0 1185.2 2383.4 1192.1

To Start

Dhrystone Benchmark - Dhrystone2i.apk

The Dhrystone integer benchmark produces a performance rating in Vax MIPS (AKA DMIPS). Further details of the Dhrystone benchmark, and results from Windows and Linux based PCs, can be found in Dhrystone Results.htm. The ratio MIPS/MHz is often quoted, but this depends on compiler optimisation (or over-optimisation)

The 32 bit gcc 4.8 compilations were slower than the original and similar to the Nexus 7, but the 64 bit version was significantly faster using the Cortex-A53.

Version Vax MIPS MIPS /MHz 1300 MHz Cortex-A53 Original 32 bit 1683 1.29 ARM/Intel 32 bit 1423 1.09 ARM/Intel 64 bit 2549 1.96 ARM/Intel 32/64 bit 2569 1.98 1200 MHz Cortex-A9 Original 32 bit 1610 1.34 ARM/Intel 32/64 bit 1317 1.10

To Start

Linpack Benchmark - LinpackDP2.apk, LinpackSP2.apk

The Linpack benchmark speed is measured in MFLOPS, officially for double precision floating point calculations. A version was produced using NEON functions (see later) that only provides single precision operation. So, for comparison purposes, an available C code option, to define single precision data, was used to produce a new version and this has usually lead to a higher MFLOPS speed. Results from various hardware and software platforms can be found in Linpack Results.htm.

Performance of the Linpack benchmark is almost entirely dependent on the calculation x[i]=x[i]+c*y[i], in the daxpy() function. The ARM compilations generated floating point multiply and accumulate instructions for this, using such as fmacd d6, d7, d5 at 32 bits and fmadd d1, d0, d1, d5, using four registers, at 64 bits.

In this case, 64 bit operation increased speed by almost 2 times with double precision calculations and 2.7 times at single precision. Performance at 32 bits was similar to that on the Nexus 7.

September 2015 - New best score from 2 GHz Qualcomm Snapdragon 810, (Cortex-A57) and Android 5.0.2, with SP speed of 1277 MFLOPS at 64 bits.

Below, is the general output produced, indicating different (but probably acceptable) numeric results of computation, at the various modes of operation.

32 bit DP compilation 32 bit SP compilation norm. resid 1.7 1.6 resid 7.41628980e-14 3.80277634e-05 machep 2.22044605e-16 1.19209290e-07 x[0]-1 -1.49880108e-14 -1.38282776e-05 x[n-1]-1 -1.89848137e-14 -7.51018524e-06 64 bit DP compilation 64 bit SP compilation norm. resid 1.9 2.0 resid 8.46778499e-14 4.69621336e-05 machep 2.22044605e-16 1.19209290e-07 x[0]-1 -1.11799459e-13 -1.31130219e-05 x[n-1]-1 -9.60342916e-14 -1.30534172e-05 Version LinpackDP LinpackSP MFLOPS MFLOPS 1300 MHz Cortex-A53 Original 32 bit 156.70 184.09 ARM/Intel 32 bit 172.28 180.64 ARM/Intel 64 bit 337.97 473.10 ARM/Intel 32/64 bit 340.18 482.43 2000 MHz Cortex-A57 ARM/Intel 64 bit 1277.76 1200 MHz Cortex-A9 Original 32 bit 151.05 201.30 ARM/Intel 32/64 bit 159.34 199.84

To Start

Livermore Loops Benchmark - LivermoreLoops2.apk

The Livermore Loops comprise 24 kernels of numerical application with speeds calculated in MFLOPS. A summary is also produced, with maximum, minimum and various mean values, geometric mean being the official average. As for other of these benchmarks, details and results are provided, in this case, in Livermore Loops Results.htm.

Summary results, below, indicate similar Cortex-A53 and Cortex-A9 speeds at 32 bits, and the former faster at 64 bits. This is followed by MFLOPS for each of the 24 test functions, where 64 bit/32 bit performance ratios vary between 0.8 and 7.9 times, with a geometric mean ratio of 1.5. The identified numeric results are also shown, and they can again be slightly different.

Version Max Average Geomean Harmean Min 1300 MHz Cortex-A53 MFLOPS Original 32 bit 371.5 192.4 171.9 151.8 67.1 ARM/Intel 32 bit 393.4 188.3 158.3 124.6 27.1 ARM/Intel 64 bit 781.4 265.4 231.8 205.6 98.1 ARM/Intel 32/64 bit 772.2 265.9 232.5 206.3 97.8 1200 MHz Cortex-A9 MFLOPS Original 32 bit 391.9 202.1 181.3 160.9 68.1 ARM/Intel 32/64 bit 396.6 207.6 175.6 136.1 26.8 A53 ARM/Intel 32 bit MFLOPS for 24 loops Do Span 471 163.4 243.4 272.1 270.3 109.5 111.2 282.2 389.0 360.6 219.6 124.0 61.8 67.6 87.4 27.3 224.2 340.1 241.9 168.5 198.8 120.2 120.6 277.7 79.1 Results of last two calculations 4.850340602749970e+02 1.300000000000000e+01 A53 ARM/Intel 32/64 bit MFLOPS for 24 loops Do Span 471 451.4 191.4 243.2 272.4 144.9 144.5 749.4 411.1 453.6 261.1 138.0 206.1 122.5 130.1 215.0 249.8 411.6 395.4 241.7 248.1 152.8 118.7 317.2 103.7 Results of last two calculations 4.850340602751729e+02 1.300000000000000e+01

To Start

MemSpeed Benchmark - MemSpeedi.apk

MemSpeed benchmark employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays. It uses data volumes of 4 KBytes upwards to indicate performance via caches and RAM. A version was produced to run under Linux with a variation of calculations (mainly to use via OpenMP). The Android benchmark is the same as this but with fewer tests, but still reflecting cache and RAM speeds.

The 64 bit compilation was nearly twice as fast as the 32 bit version with double precision floating point calculations, using cached data, and provided a 33% increase from RAM. Corresponding single precision ratios were 2.6 and 2.0 times and integer ratios of 2.2 and 1.5. For floating point, the C program loop has four each of loads, stores, multiplies and adds, where the latter two are linked or fused into one function. At 64 bits, vector SIMD instructions were produced, leading to 2 each at double precision and 1 each at single precision (4 words in registers). With 32 bit integers, 8 scalar adds were reduced to two vector adds.

At 32 bits, the Nexus 7 was slightly faster using L1 cache, but The A8 gains averaged nearly 1.4 times from L2 cache and 2.4 times with RAM based data.

Tab 2 A8 64 bit maximum, MFLOPS were somewhat faster than with similar calculations in the Linpack benchmark, but a modern Intel CPU could be three times faster, at that CPU MHz, using SSE type SIMD instructions.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.16 Compiled for 32 bit ARM v7a Reading Speed in MBytes/Second Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] KBytes Dble Sngl Int Dble Sngl Int 16 1940 971 1693 2470 1278 2084 L1 32 1879 955 1676 2378 1255 1967 64 1801 938 1615 2254 1218 1912 L2 128 1706 941 1620 2279 1224 1872 256 1818 935 1570 2291 1155 1875 512 1633 884 1451 2008 1132 1704 1024 1276 781 1181 1454 938 1324 RAM 4096 1335 808 1260 1533 1010 1386 16384 1342 813 1270 1487 1013 1419 65536 1346 809 1274 1546 1031 1252 Total Elapsed Time 11.7 seconds ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.29 Compiled for 64 bit ARM v8a Reading Speed in MBytes/Second Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] KBytes Dble Sngl Int Dble Sngl Int 16 4092 2198 3951 5293 3611 4408 32 3753 2496 3630 4651 3300 3992 64 3407 2388 3368 3715 3023 3677 128 3496 2462 3521 4137 3139 3844 256 3535 2481 3573 4199 3322 3911 512 3054 2248 3126 3556 2548 3372 1024 1714 1704 2029 2069 1854 2099 4096 1832 1595 1841 1914 1780 1897 16384 1844 1601 1850 1925 1798 1891 65536 1859 1608 1837 1921 1795 1812 Total Elapsed Time 10.2 seconds Max MFLOPS 512 624 #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel MemSpeed Benchmark 1.1 25-Apr-2015 12.24 Reading Speed in MBytes/Second Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] KBytes Dble Sngl Int Dble Sngl Int 16 1856 1019 2537 2913 1459 2544 32 1416 832 1327 1508 920 1345 64 1286 779 1198 1418 908 1296 128 1282 781 1195 1424 912 1305 256 1278 774 1190 1433 878 1298 512 1197 752 1122 1340 862 1216 1024 833 626 822 903 695 857 4096 463 420 456 463 440 459 16384 459 426 453 455 435 458 65536 463 430 411 462 443 452 Total Elapsed Time 11.5 seconds

To Start

BusSpeed Benchmark - BusSpeedv7i.apk

BusSpeed Benchmark is particularly designed to identify reading data in bursts over buses, The program starts by reading a word, with address increments of 32 words for the next data. The increment is reduced to 16 words then halving until all data is read. In this case, an estimate of maximum speed can be 16 times MB/second at 16 word increments. Normally, an MP version is needed for maximum throughput.

Other than identifying burst data transfers, the final column, reading all data, is the major performance guide. Here, 64/32 bit comparison ratios were up to 2.0 from L1 cache, 1.5 from L2 cache and 1.25 from RAM.

At 32 bits, The Lenovo A8 was slower than the Nexus 7 on L1 cache based data, but the position was reversed on L2 cache tests, and particularly on RAM data transfers.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 Single Channel RAM, LPDDR3 666 MHz, 5.3 GB/second ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 10.57 Compiled for 32 bit ARM v7a Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 874 932 1814 2302 2355 2263 L1 32 758 803 1309 1820 2323 2386 64 653 671 1203 1741 2206 2332 L2 128 603 620 1107 1693 2222 2351 256 574 589 1075 1711 2211 2327 512 332 372 681 1075 1863 2120 1024 137 193 371 578 1322 2129 RAM 4096 172 179 351 567 1151 2126 16384 172 178 351 504 1117 2136 65536 172 177 349 478 882 2129 Total Elapsed Time 5.3 seconds ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 11.02 Compiled for 64 bit ARM v8a Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3188 3635 3937 4327 4372 4462 32 1478 1607 2246 3382 3853 4144 64 600 622 1163 2011 2972 3585 128 558 575 1056 1889 2892 3525 256 538 550 1028 1826 2837 3260 512 371 425 813 1490 2403 3202 1024 136 196 382 728 1423 2750 4096 170 177 346 669 1340 2652 16384 169 174 341 678 1352 2663 65536 168 174 341 676 1347 2611 Total Elapsed Time 5.2 seconds Estimated maximum = 16 x 174 = 2784 MB/second #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM Android BusSpeed Benchmark 19-Oct-2012 17.29 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 2723 2420 3044 3364 3499 3500 L1 32 1054 1087 1061 1382 1565 2145 64 436 433 419 652 751 1160 L2 128 345 337 337 542 633 943 256 329 309 322 522 614 961 512 339 299 311 506 574 937 1024 170 168 180 269 349 629 4096 59 55 84 127 176 338 RAM 16384 56 56 83 125 173 335 65536 56 56 82 125 174 334 Total Elapsed Time 5.7 seconds

To Start

RandMem Benchmark - RandMemi.apk

RandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches. For more details and further results see RandMem in Android Benchmarks.htm.

This program uses quite complex memory address indexing and Tab A8 32 bit and 64 bit versions were not that different overall, each one slightly faster on some tests.

At 32 bits, the A8 had the L2 cache and RAM speed advantages, over the Nexus 7, on serial reading and writing but, on random access, the latter’s larger L2 cache lead to faster speeds on later cache based data sizes and als affectd RAM data transfer speeds.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel RandMem Benchmark 1.2 06-Aug-2015 12.29 Compiled for 32 bit ARM v7a MBytes/Second Transferring 4 Byte Words Memory Serial....... Random....... KBytes Read Rd/Wrt Read Rd/Wrt 16 2807 3606 2753 3595 L1 32 2719 3433 1429 1930 64 2615 3266 914 1166 L2 128 2592 3243 705 828 256 2570 3223 637 720 512 2367 2684 237 347 1024 2137 1855 120 163 RAM 4096 1918 1658 83 97 16384 2152 1665 74 85 65536 2104 1652 72 64 Total Elapsed Time 11.6 seconds ARM/Intel RandMem Benchmark 1.2 06-Aug-2015 12.32 Compiled for 64 bit ARM v8a MBytes/Second Transferring 4 Byte Words Memory Serial....... Random....... KBytes Read Rd/Wrt Read Rd/Wrt 16 3865 3033 3798 3027 32 3622 2760 3105 2734 64 3094 2803 1011 1077 128 3074 2740 776 801 256 3050 2771 718 693 512 2420 2463 270 371 1024 1322 1853 131 164 4096 1754 1598 87 100 16384 1791 1586 75 91 65536 1856 1609 57 68 Total Elapsed Time 14.6 seconds #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel RandMem Benchmark 1.1 25-Apr-2015 12.33 MBytes/Second Transferring 4 Byte Words Memory Serial....... Random....... KBytes Read Rd/Wrt Read Rd/Wrt 16 2521 3175 2490 3038 L1 32 1427 1451 1218 1446 64 1133 1052 853 907 L2 128 1039 871 646 650 256 1028 909 543 518 512 1025 895 499 502 1024 700 489 242 236 4096 487 282 90 88 RAM 16384 483 281 71 70 65536 478 274 63 62 Total Elapsed Time 11.3 seconds

To Start

MP-MFLOPS Benchmarks - MP-MFLOPSi and MP-MFLOPS2i

MP-MFLOPS arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2 and 32 operations per input data word, using 1, 2, 4 and 8 threads. Data sizes are limited to three, to use L1 cache, L2 cache and RAM at 12.8, 128 and 12800 KB (3200, 32000 and 3200000 single precision floating point words). Each thread uses the same calculations but accessing different segments of the data. The program checks for consistent numeric results, primarily to show that all calculations are carried out and can be run. The numeric results start with values of 1.0, with subsequent calculations reducing the values, the amount depending on the number of calculations. An example log file is shown below.

The original benchmark runs too fast on later CPUs, so a revised version, MP-MFLOPS2, was produced, with 50 times more calculations, producing the expected reduction in result values, as also shown below. Those from the 32 bit benchmark are slightly different to those from 64 bit operation.

ARM/Intel MP-MFLOPS v7 Benchmark V1.2 09-Aug-2015 21.20 Compiled for 64 bit ARM v8a FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 701 695 583 1391 1394 1349 2T 1347 1370 712 2792 2798 2743 4T 1641 1544 716 3587 3491 3374 8T 1370 1803 693 4001 4255 5016 Results x 100000, 0 indicates ERRORS 1T 86735 98519 99984 79897 97639 99975 2T 86735 98519 99984 79897 97639 99975 4T 86735 98519 99984 79897 97639 99975 8T 86735 98519 99984 79897 97639 99975 Total Elapsed Time 3.1 seconds MP-MFLOPS2i 32 bit 1T 40392 76406 99700 35218 66014 99520 MP-MFLOPS2i 64 bit 1T 40392 76406 99700 35206 66015 99520

To Start

MP-MFLOPS Benchmark Results

Except for producing faster results with data in RAM, The 32 bit Tab 2 performance, was again, similar to the Cortex-A9 based Nexus 7. At 32 operations per word, the Tab 2 was just over twice as fast at 64 bits, then up to 3.7 times, at 2 operations per word, with cache based data. The reason is that 64 bit vector SIMD instructions were produced, instead of scalars.

For further comparisons see NEON-MFLOPS-MP Benchmark and Assembly Code below.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel MP-MFLOPS2 Benchmark V2.2 09-Aug-2015 21.17 Compiled for 32 bit ARM v7a FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 190 190 184 670 672 664 2T 377 378 370 1343 1345 1329 4T 707 755 725 2657 2669 2621 8T 722 736 714 2640 2672 2631 Total Elapsed Time 113.0 seconds ARM/Intel MP-MFLOPS2 Benchmark V2.2 09-Aug-2015 21.24 Compiled for 64 bit ARM v8a FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 705 701 636 1398 1394 1362 2T 1376 1395 942 2794 2797 2757 4T 2063 2602 962 5491 5546 5336 8T 2474 2611 957 5367 5500 5417 Total Elapsed Time 51.6 seconds #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel MP-MFLOPS2 Benchmark V2.1 28-Apr-2015 17.44 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 188 156 116 598 578 574 2T 365 319 197 1195 1161 1145 4T 682 709 237 2372 2345 2249 8T 678 731 237 2361 2381 2254 Total Elapsed Time 135.0 seconds

To Start

MP-Whetstone Benchmark - MP-WHETSi

This is a multithreaded version of the Whetstone Benchmark above. Tab 2 A8-50 performance, on the 32 bit version was, again, similar to the Nexus 7. At 64 bits, the Fixpt test was clearly nearly optimised out, but this makes little difference to the overall MWIPS rating, at 2.25 times faster than the 32 bit benchmark.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel MP-Whetstone Benchmark V1.2 10-Aug-2015 11.30 Compiled for 32 bit ARM v7a Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 676.4 275.9 281.9 147.9 35.4 5.3 600.3 901.0 285.5 2T 1362.5 533.8 561.7 298.0 70.9 10.8 1203.1 1838.9 574.0 4T 2698.6 903.9 1071.7 594.4 141.2 21.5 2346.1 3305.5 1138.5 8T 2830.1 1463.2 1393.0 614.2 152.5 21.9 3243.9 4418.3 1171.4 Overall Seconds 4.95 1T, 4.94 2T, 5.11 4T, 10.09 8T ARM/Intel MP-Whetstone Benchmark V1.2 10-Aug-2015 11.34 Compiled for 64 bit ARM v8a Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 1524.8 328.6 348.8 297.6 37.3 19.9 1462579 1867.2 1238.0 2T 3062.5 688.8 697.9 596.0 75.5 39.8 2097113 3726.7 2481.3 4T 6085.4 1214.9 1360.5 1185.4 150.5 79.4 2449153 7055.0 4951.8 8T 6222.4 1495.2 1545.6 1204.2 152.2 80.6 3869846 9218.8 5154.1 Overall Seconds 4.92 1T, 4.90 2T, 5.05 4T, 9.97 8T #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel MP-Whetstone Benchmark V1.1 30-Apr-2015 21.32 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 602.2 242.3 242.3 140.2 27.2 4.9 482.8 1425.2 239.1 2T 1208.7 481.2 484.2 280.8 55.0 9.9 970.0 2869.6 478.7 4T 2398.7 805.4 966.7 562.5 109.5 19.5 1938.2 5722.5 957.1 8T 2429.1 974.6 1076.2 562.4 110.9 19.7 1981.5 5816.1 963.6 Overall Seconds 4.94 1T, 4.93 2T, 5.08 4T, 9.93 8T

To Start

MP Dhrystone Benchmark - MP-Dhryi.apk

This is a multithreaded version of the Dhrystone Benchmark above Tab 2 A8-50 performance, on the 32 bit version was, again, similar to the Nexus 7.

Each thread executes the same code but some variables are shared and that can lead to non-linear MP activity, in this case, two CPUs producing increased throughput of 1.8 times or less. At least, single threaded performance is essentially the same as the non-threaded benchmark..

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel MP-Dhrystone 2 Benchmark V1.2 10-Aug-2015 11.32 Compiled for 32 bit ARM v7a Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.64 0.71 0.90 1.70 Dhrystones per Second 2481286 4495793 7094180 7540038 VAX MIPS rating 1412 2559 4038 4291 ARM/Intel MP-Dhrystone 2 Benchmark V1.2 10-Aug-2015 11.36 Compiled for 64 bit ARM v8a Using 1, 2, 4 and 8 Threads Threads 1 2 4 8 Seconds 0.89 1.06 1.64 3.24 Dhrystones per Second 4476736 7574470 9768350 9861922 VAX MIPS rating 2548 4311 5560 5613 #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel MP-Dhrystone 2 Benchmark V1.1 04-May-2015 17.18 Threads 1 2 4 8 Seconds 0.78 0.95 1.27 2.44 Dhrystones per Second 2572642 4214238 6280420 6565767 VAX MIPS rating 1464 2399 3575 3737

To Start

MP-BusSpeed Benchmark - MP-BusSpdi.apk and MP-BusSpd2i.apk

This is a multithreaded version of the BusSpeed Benchmark, with data sizes considered suitable to measure performance from L1 cache, L2 cache and RAM.

The original MP-BusSpd benchmark read all the data with every thread, each starting at the beginning. With some devices having, large shared L2 caches, some of the RAM based data could be cached, sometimes indicating an impossible performance level. All threads in the new version, MP-BusSpd2, read all the data, but with staggered starting points. The difference in not that great on the Tab 2 A8, as indicated below.

Just considering MP-BusSpd2 and reading all data, at 32 bits, the Cortex-A53/Cortex-A9 L1 cache, L2 cache and RAM performance ratios are around 0.8, 3.0 and 5.5. A53 64/32 bit ratios average 2.2, 1.8 and 1.0.

Maximum 64 bit memory transfer rate is 4.328 GB/second, or 4.448, based on 16 word increments, out of a possible 5.33 - See BusSpeed. Note that multithreading increases memory throughput by more than 60%.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 Single Channel RAM, LPDDR3 666 MHz, 5.3 GB/second ARM/Intel MP-BusSpd Benchmark V1.2 12-Aug-2015 16.13 Compiled for 32 bit ARM v7a MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1849 2140 2079 2211 2270 2297 2T 3663 4252 4294 4400 4370 4580 4T 4630 5574 5691 5893 6015 6083 8T 5331 5775 6033 6622 7968 8023 122.9 1T 597 621 1119 1815 2135 2237 2T 869 943 1644 2992 3740 4412 4T 949 951 1922 3736 6468 7779 8T 948 978 1911 3717 6464 7542 12288 1T 123 174 344 678 1215 1840 2T 243 310 672 1332 2383 3974 4T 302 285 594 1282 2271 4606 8T 279 295 654 1198 2749 4660 Total Elapsed Time 12.8 seconds ARM/Intel MP-BusSpd2 Benchmark V1.2 12-Aug-2015 16.14 Compiled for 32 bit ARM v7a 12.3 1T 1877 2124 2176 2266 2296 2343 2T 3625 4198 4341 4468 4536 4613 4T 5733 7541 8293 8830 8024 9042 8T 2985 3829 7438 6117 8108 8923 122.9 1T 604 625 1142 1846 2150 2284 2T 924 950 1793 3277 4270 4504 4T 962 989 1939 3765 6798 8862 8T 965 993 1933 3748 6651 8239 49152 1T 165 175 344 677 1285 1979 2T 234 238 482 961 1907 3547 4T 266 298 562 1224 2296 4478 8T 272 275 538 1098 2149 4282 Total Elapsed Time 48.8 seconds ARM/Intel MP-BusSpd Benchmark V1.2 12-Aug-2015 16.17 Compiled for 64 bit ARM v8a 12.3 1T 3247 3895 4031 4182 4286 4367 2T 5676 7211 7771 8320 8539 7887 4T 10390 13919 14891 14949 15595 12977 8T 9693 12748 14246 14325 14434 16076 122.9 1T 577 575 1107 1884 2882 3568 2T 924 939 1827 3380 5554 6890 4T 959 972 1897 3659 6554 8508 8T 956 980 1913 3814 7206 11996 12288 1T 133 182 351 690 1381 2720 2T 309 282 669 1329 2625 5265 4T 281 286 715 1383 2614 5040 8T 303 341 670 1180 2303 4354 Total Elapsed Time 13.1 seconds ARM/Intel MP-BusSpd2 Benchmark V1.2 12-Aug-2015 16.18 Compiled for 64 bit ARM v8a 12.3 1T 2610 2472 2586 2727 2748 5841 2T 4404 4681 4994 5369 5420 11297 4T 6546 8125 9105 10243 10319 20610 8T 3380 4023 7919 7146 9871 19852 122.9 1T 604 621 1110 1872 2446 5100 2T 919 948 1855 3433 4853 10037 4T 961 974 1984 3924 7491 14935 8T 963 942 1931 3915 7572 14689 49152 1T 173 177 340 692 1300 2653 2T 266 241 479 968 1883 3724 4T 304 277 556 1130 2126 4328 8T 279 278 544 1138 2179 4275 Total Elapsed Time 49.4 seconds #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel MP-BusSpd2 Benchmark V1.0 24-Jul-2015 15.59 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 2166 2774 3181 3307 3377 3263 2T 3924 5188 5207 5754 5759 5805 4T 7570 10011 10252 11165 11375 11777 8T 3510 4786 9011 8318 11351 11544 122.9 1T 383 409 359 558 663 983 2T 525 541 520 741 1241 1814 4T 739 752 753 1219 1590 2776 8T 735 741 753 1218 1607 2737 49152 1T 56 51 81 126 172 330 2T 65 67 107 196 335 620 4T 70 68 108 215 426 835 8T 70 68 109 215 428 851 Total Elapsed Time 48.2 seconds

To Start

MP-RandMem Benchmark - MP-RndMemi.apk

This is a multithreaded version of the RandMem Benchmark. Probably as performance is dependent on the complex indexing used, A53 performance is mainly not much faster at 64 bits. At 32 bits, it is clearly faster than the Cortex-A9 with serial access, using L2 cache and RAM, but the latter is comparable on random reading and writing.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel MP-RndMem Benchmark V1.2 12-Aug-2015 17.13 Compiled for 32 bit ARM v7a MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.29 1T 2894 2438 2887 2433 2T 5665 2402 5663 2403 4T 10922 2369 11100 2310 8T 10065 2293 10648 2265 122.9 1T 2681 2368 757 758 2T 5351 2360 1398 769 4T 10056 2308 2121 772 8T 8838 2351 1916 742 12288 1T 2309 1662 80 78 2T 3986 1683 164 73 4T 5419 1684 283 82 8T 4658 1694 279 82 ARM/Intel MP-RndMem Benchmark V1.2 12-Aug-2015 17.15 Compiled for 64 bit ARM v8a 12.29 1T 4445 3109 4455 3089 2T 8010 3100 8072 3105 4T 15909 3057 14711 3040 8T 14764 3036 14570 3037 122.9 1T 3457 2888 842 876 2T 6537 2924 1524 876 4T 11095 2892 2119 861 8T 11729 2916 2080 874 12288 1T 2475 1679 81 78 2T 4155 1713 163 73 4T 5503 1711 285 89 8T 4519 1717 281 89 Total Elapsed Time 48.1 seconds #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel MP-RndMem Benchmark V1.1 06-May-2015 11.59 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.29 1T 3060 2001 2867 1904 2T 5459 1879 5463 1867 4T 10797 1852 10537 1856 8T 10090 1802 10608 1813 122.9 1T 968 823 588 547 2T 1749 785 902 618 4T 2716 812 1328 672 8T 2733 810 1407 673 12288 1T 329 274 90 82 2T 636 272 112 82 4T 849 271 128 82 8T 869 271 126 81 Total Elapsed Time 45.4 seconds

To Start

NEON-Linpack Benchmark - NEON-Linpacki.apk

This is identical to the Linpack Benchmark above, except the main calculations, in the performance dependent daxpy() function, were replaced using NEON intrinsic functions. These only operate on single precision floating point numbers. Results from 32 bit and 64 bit compilations were similar as the programs use identical intrinsic functions. The speed of the original 64 bit benchmark is also not that different. This is compiled using fmadd, scalar floating-point fused multiply instructions, compared with NEON vmla vector multiply accumulate (4 words at a time). See Assembly Code below.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel NEON Linpack Benchmark V 1.2 13-Aug-2015 Compiled for 32b ARM v7a 64b ARM v8a 64b Above SP MFLOPS 407 505 482 #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel NEON Linpack Benchmark V 1.0 03-May-2015 SP MFLOPS 347

To Start

NeonSpeed Benchmark - NeonSpeedi.apk

The benchmark carries out the same calculations as MemSpeed Benchmark, repeating the standard single precision multiply/add and integer tests with two adds, for comparison with those via NEON intrinsic functions.

As with NEON-Linpack, many results from 32 bit and 64 bit compilations, via NEON instructions, were similar. NEON functions produced significant performance gains at 32 bits, over normal code, but were limited to no more than 30% at 64 bits. NEON tests were quite a bit faster than those on the Cortex-A9.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.32 Compiled for 32 bit ARM v7a Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 971 3853 1807 4059 3957 4397 32 970 3812 1800 3983 3891 4323 64 927 3228 1605 3038 3269 3521 128 926 3321 1681 3343 3354 3596 256 936 3386 1693 3449 3413 3667 512 898 2889 1578 2996 2927 3118 1024 794 1859 1345 2057 1996 1924 4096 794 1796 1250 1788 1813 1835 16384 792 1773 1270 1820 1829 1864 65536 796 1811 1289 1852 1832 1880 Total Elapsed Time 11.3 seconds ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.37 Compiled for 64 bit ARM v8a Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 3054 4055 3605 4376 4911 5094 32 2922 3787 3435 4198 4546 4682 64 2795 3514 3259 3658 4050 4116 128 2886 3529 3373 3924 4148 3963 256 2883 3641 3264 3942 4193 4276 512 2454 3165 2985 3385 3586 3542 1024 1633 2000 1835 2043 2114 2105 4096 1738 1893 1899 1900 1956 1955 16384 1757 1870 1886 1802 1921 1846 65536 1755 1875 1870 1903 1936 1937 Max MFLOPS 764 1014 Total Elapsed Time 10.2 seconds #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel NeonSpeed Benchmark V1.1 09-May-2015 18.07 Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 881 2440 2501 3334 3206 3465 32 901 1868 1705 2260 2083 2186 64 801 1395 1365 1573 1548 1581 128 784 1282 1278 1405 1389 1411 256 787 1279 1285 1420 1380 1409 512 777 1266 1267 1409 1370 1394 1024 604 786 762 769 770 828 4096 458 479 477 463 486 488 16384 436 447 448 469 470 469 65536 450 472 469 240 482 483 Total Elapsed Time 11.5 seconds

To Start

NEON-MFLOPS-MP Benchmark - NEON-MFLOPS2i-MP.apk

NEON-MFLOPS-MP benchmark is the same as MP-MFLOPS, except using NEON intrinsic functions for the calculations. For comparison purposes, single thread MP-MFLOPS results are included below (1TNN), with details of program source code and CPU assembly instructions used below.

Tab 2 A8 performance of the 32 bit compilations was up to 3.2 times faster than the original MP-MFLOPS benchmark, using NEON intrinsic functions, but the source code for the latter included four times more calculations within the test loops. Results were also similar to those on the Nexus 7, except for RAM speed, measured at 12800 KB, where the Tab 2 excelled. The same unrolling applied for calculations at 32 operations per word, except the original incurred heavy addressing overheads, using 10 vector registers, compared with 32 via NEON, leading to the latter being measured as twice as fast. In both cases, the instruction count was reduced by using fused multiply-add or multiply-subtract.

The NEON 64 bit compilation produced a small performance gain over 32 bit results, at 2 operations per word, but near double speed at 32 operations, the latter suffering from fewer registers for the variables. Using one core, maximum speed was 2.77 GFLOPS, rising to 10.8 GFLOPS via four cores. The one core speed equated to just over two floating point operation per clock cycle. This is disappointing, compared with Intel processors, such as the Core 2 onwards, at 6 per clock cycle out of a maximum of 8, with SSE SIMD code (See Linux results).

September 2015 - New best score from 2 GHz Qualcomm Snapdragon 810, (Cortex-A57) and Android 5.0.2, at 64 bits. Performance, with 8 threads, is up to 23.6 GFLOPS, and up to nearly 3.5 results per clock cycle, using one core.

#################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 13-Aug-2015 16.35 Compiled for 32 bit ARM v7a FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 619 613 575 1444 1446 1426 2T 1174 1206 889 2894 2902 2839 4T 1585 1616 901 5679 5726 5596 8T 2075 2130 944 5400 5585 5519 Total Elapsed Time 25.8 seconds 1TNN 190 190 184 670 672 664 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 13-Aug-2015 16.38 Compiled for 64 bit ARM v8a FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 726 745 647 2766 2774 2639 2T 1397 1402 903 5523 5552 5371 4T 1871 1930 898 10780 10479 10439 8T 2496 2876 1011 9736 10679 9900 Total Elapsed Time 15.1 seconds 1TNN 705 701 636 1398 1394 1362 #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel NEON-MFLOPS2-MP Benchmark V2.1 13-May-2015 12.24 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 657 407 132 1077 1074 1053 2T 1265 817 222 2147 2150 2078 4T 2024 1695 234 4214 4276 3555 8T 2435 2495 234 4196 4100 3523 Total Elapsed Time 39.0 seconds 1TNN 188 156 116 598 578 574 #################################################### Quad-core 2 GHz Qualcomm Snapdragon 810, Android 5.0.2 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 16-Sep-2015 17.59 Compiled for 64 bit ARM v8a FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 2811 3126 1089 6943 6589 6342 2T 2488 4114 1541 12084 10559 8809 4T 4759 5480 2038 16516 14826 11960 8T 4840 8985 2452 22082 23563 12461 Total Elapsed Time 7.6 seconds

To Start

NEON-Linpack-MP Benchmark - NEON-Linpacki-MP.apk

This version uses mainly the same C programming code as the single precision floating point NEON compilation above. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.

Unlike the NEON MP MFLOPS benchmark, that carries out the same multiply/add calculations, this program can run much slower using multiple threads. This is due to the overhead of creating and closing threads too frequently. Note the difference between the unthreaded speeds and those using one thread.

Ignoring multiple thread speeds, with the 32 bit variety, the Tab 2 A8 is particularly faster than the Nexus 7 at N = 500 and 1000, due to the larger L2 cache and faster RAM.

MFLOPS 0 to 4 Threads, N 100, 500, 1000 #################################################### Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53 ARM/Intel Linpack NEON SP MP Benchmark 1.2 13-Aug-2015 12.52 Compiled for 32 bit ARM v7a Threads None 1 2 4 N 100 460.74 22.35 23.16 23.82 N 500 480.63 336.52 339.94 303.66 N 1000 470.02 405.86 403.01 405.98 ARM/Intel Linpack NEON SP MP Benchmark 1.2 13-Aug-2015 12.57 Compiled for 64 bit ARM v8a Threads None 1 2 4 N 100 548.67 27.70 33.93 37.00 N 500 470.04 285.95 297.79 301.67 N 1000 519.02 441.84 443.47 441.91 #################################################### ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM ARM/Intel Linpack NEON SP MP Benchmark 14-May-2015 15.40 Threads None 1 2 4 N 100 385.49 28.79 29.06 29.25 N 500 272.07 184.85 183.70 183.18 N 1000 147.09 131.92 132.44 130.05

To Start

FFT Benchmarks - fft1.apk, fft3c.apk

The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run three times to identify variance. Results are displayed and saved in a log file (FFT-tests.txt), with FFT running time in milliseconds. Besides Android, the bechmarks are available to run via Windows and Linux. Two versions are available FFT1, original version and with optimised C code as FFT3c. Further details, results, and links for benchmarks and source code are in FFTBenchmarks.htm. Below is an example of results.

Kindle Fire HDX 7, 2.2 GHz Quad Core Qualcomm Snapdragon 800 ARM/Intel FFT Benchmark 3c.0 08-Sep-2015 23.15 Compiled for 32 bit ARM v7a Size milliseconds K Single Precision Double Precision 1 0.155 0.352 1.341 0.087 0.073 0.073 2 0.812 0.814 0.750 0.201 0.187 0.251 4 1.751 1.658 1.776 0.414 0.405 0.443 8 3.712 1.083 1.065 0.930 0.899 0.890 16 2.880 3.356 2.430 2.579 2.658 2.380 32 6.124 6.541 5.605 5.907 6.070 5.681 64 13.430 12.566 12.774 13.792 13.556 13.997 128 30.737 27.408 27.132 33.318 33.088 33.071 256 64.472 63.394 64.690 73.288 72.546 72.786 512 153.609 150.383 156.046 155.788 156.304 163.178 1024 315.283 306.323 307.409 369.426 337.074 336.684 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 Total Elapsed Time 6.5 seconds

To Start

Assembly Code

Single Precision Floating Point Instructions 32 Bit Compile fadds s13, s13, s15 Add fmuls s13, s13, s14 Multiply fmacs s15, s14, s23 Multiply and accumulate fnmacs s15, s24, s2 Negated multiply and accumulate fmscs s15, s24, s12 Multiply and subtract NEON vadd.f32 q10, q10, q8 Vector add vmul.f32 q10, q10, q9 Vector multiply vsub.f32 q7, q6, q7 Vector subtract vmla.f32 q8, q10, q9 Vector multiply accumulate Single Precision Floating Point Instructions 64 Bit Compile fmadd s4, s0, s1, s4 Fused multiply-add fadd v2.4s, v2.4s, v4.4s Add fmul v2.4s, v2.4s, v3.4s Multiply fmla v0.4s, v22.4s, v17.4s Fused multiply-add to accumulator fmls v0.4s, v8.4s, v4.4s Fused multiply-subtract from accumulator ##################################################################### MP-MFLOPS 2 Operations Per Word NEON-MFLOPS2i-MP 2 Operations Per Word for(i=0; I < n; i++) Loop Functions x[i] = (x[i]+a)*b; ptrx1 vld1q_f32 vst1q_f32 vaddq_f32 1 vmulq_f32 1 1 add, 1 multiply 4 add, 4 multiply =========================================================================================== Main Assembly Code 32 Bit Main Assembly Code NEON 32 Bit Code No.Ops Example 190 MFLOPS Code No.Ops Example 619 MFLOPS add 1 cmp 1 cmp 1 bge 1 bge 1 b 1 b 1 adds 1 adds 1 flds 1 flds s13, [r3] vld1.32 1 vld1.32 {d20-d21}, [r1] fstmias 1 fstmias r3!, {s13} vst1.32 1 vst1.32 {d20-d21}, [r1] fadds 1 1 fadds s13, s13, s15 vadd.f32 1 4 vadd.f32 q10, q10, q8 fmuls 1 1 fmuls s13, s13, s14 vmul.f32 1 4 vmul.f32 q10, q10, q9 =========================================================================================== Main Assembly Code 64 Bit 4 way unroll Main Assembly Code NEON 64 Bit Code No.Ops. Example 705 MFLOPS Code No.OPs Example 745 MFLOPS add 1 cmp 1 cmp 1 bne 1 bhi 1 ldr 1 ldr q0, [x3] ldr 1 ldr q2, [x5] str 1 str q0, [x3],16 str 1 str q2, [x5],16 fadd 1 4 fadd v0.4s, v0.4s, v2.4s fadd 1 4 fadd v2.4s, v2.4s, v4.4s fmul 1 4 fmul v0.4s, v1.4s, v0.4s fmul 1 4 fmul v2.4s, v2.4s, v3.4s ########################################################################################## MP-MFLOPS 32 Operations Per Word NEON-MFLOPS2i-MP 32 Operations Per Word for(i=0; I < n; i++) Loop Functions x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f ptrx1 -(x[i]+g)*h+(x[i]+j)*k-(x[i]+l)*m vld1q_f32 +(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t vst1q_f32 -(x[i]+u)*v+(x[i]+w)*y; vaddq_f32 16 vmulq_f32 11 vsubq_f32 5 23 variables 16 add, 5 subtract, 11 multiply 64 add, 20 subtract, 44 multiply =========================================================================================== Main Assembly Code 32 Bit Main Assembly Code NEON 32 Bit Code No.Ops. Example 672 MFLOPS Code No.OPs Example 1446 MFLOPS add 1 adds 1 b 1 cmp 1 bge 1 bge 1 cmp 1 b 1 vldr 18 vldr d18, [sp, #16] adds 1 vld1.64 1 vld1.64 {d18-d19}, [sp:64] flds 1 flds s14, [r2] vld1.32 1 vld1.32 {d16-d17}, [r2] fstmias 1 fstmias r2!, {s15} vst1.32 1 vst1.32 {d12-d13}, [r2] fadds 11 11 fadds s14, s14, s22 vadd.f32 16 64 vadd.f32 q6, q8, q9 fmuls 1 1 fmuls s15, s15, s10 vmul.f32 11 44 vmul.f32 q6, q6, q7 fmacs 5 10 fmacs s15, s14, s23 vsub.f32 5 20 vsub.f32 q7, q6, q7 fnmacs 4 8 fnmacs s15, s24, s2 fmscs 1 2 fmscs s15, s24, s12 16 q registers used =========================================================================================== Main Assembly Code 64 Bit 4 way unroll Main Assembly Code NEON 64 Bit Code No.Ops. Example 1398 MFLOPS Code No.OPs Example 2766 MFLOPS orr 2 cmp 1 bcc 1 add 16 fmov 11 cmp 1 ins 11 bnr 1 ldr 13 ldr q17, [x28] ldr 1 ldr q1, [x8] str 4 str q8, [x28] str 1 str q0, [x8],16 fadd 11 44 fadd v9.4s, v16.4s, v17.4s fadd 11 44 fadd v17.4s, v21.4s, v1.4s fmla 5 40 fmla v17.4s, v9.4s, v13.4s fmla 5 40 fmla v0.4s, v22.4s, v17.4s fmls 5 40 fmls v17.4s, v8.4s, v9.4s fmls 5 40 fmls v0.4s, v8.4s, v4.4s fmul 1 4 fmul v17.4s, v9.4s, v17.4s fmul 1 4 fmul v0.4s, v18.4s, v0.4s 10 Vector Registers used 32 Vector Registers used ########################################################################################## LinpackSP2 NEON-Linpack for (i = m; i < n; i = i + 4) for (i = m; i < n; i=i+4) { { dy[i] = dy[i] + da*dx[i]; x41 = vld1q_f32(ptrx1); dy[i+1] = dy[i+1] + da*dx[i+1]; y41 = vld1q_f32(ptry1); dy[i+2] = dy[i+2] + da*dx[i+2]; r41 = vmlaq_f32(y41, x41, c41); dy[i+3] = dy[i+3] + da*dx[i+3]; vst1q_f32(ptry1, r41); } ptrx1 = ptrx1 + 4; ptry1 = ptry1 + 4; } 32 Bit Compilation 181 MFLOPS 32 Bit Compilation 407 MFLOPS .L42 .L38: cmp r1, r0 cmp r3, r0 add r3, r3, #16 bge .L33 add r2, r2, #16 vld1.32 {d20-d21}, [r2] bge .L33 adds r3, r3, #4 flds s13, [r2, #-16] adds r2, r2, #16 flds s14, [r3, #-16] vld1.32 {d16-d17}, [r1] fmacs s14, s15, s13 vmla.f32 q8, q10, q9 adds r1, r1, #4 vst1.32 {d16-d17}, [r1] fsts s14, [r3, #-16] adds r1, r1, #16 flds s14, [r3, #-12] b .L38 flds s13, [r2, #-12] fmacs s14, s15, s13 fsts s14, [r3, #-12] flds s14, [r3, #-8] flds s13, [r2, #-8] fmacs s14, s15, s13 fsts s14, [r3, #-8] flds s14, [r3, #-4] flds s13, [r2, #-4] fmacs s14, s15, s13 fsts s14, [r3, #-4] b .L42 64 Bit Compilation 482 MFLOPS 64 Bit Compilation 505 MFLOPS .L59: .L54: add x0, x0, 16 ldr q1, [x1],16 add x1, x1, 16 ldr q0, [x3] ldr s1, [x1,-16] fmla v0.4s, v2.4s, v1.4s cmp x0, x3 str q0, [x3],16 ldr s4, [x0,-16] cmp x3, x0 ldr s2, [x0,-12] bne .L54 fmadd s4, s0, s1, s4 ldr s1, [x0,-8] ldr s5, [x0,-4] str s4, [x0,-16] ldr s3, [x1,-12] fmadd s3, s0, s3, s2 str s3, [x0,-12] ldr s2, [x1,-8] fmadd s2, s0, s2, s1 str s2, [x0,-8] ldr s1, [x1,-4] fmadd s1, s0, s1, s5 str s1, [x0,-4] bne .L59 ##################################################################### NeonSpeed Normal NeonSpeed NEON for (m=0; m < ks; m=m+4) for(i=0; i < size/16; i++) { { xs[m] = xs[m] + sums * ys[m]; x41 = vld1q_f32(ptrx1); xs[m+1] = xs[m+1] + sums * ys[m+1]; x42 = vld1q_f32(ptrx2); xs[m+2] = xs[m+2] + sums * ys[m+2]; x43 = vld1q_f32(ptrx3); xs[m+3] = xs[m+3] + sums * ys[m+3]; x44 = vld1q_f32(ptrx4); } y41 = vld1q_f32(ptry1); y42 = vld1q_f32(ptry2); y43 = vld1q_f32(ptry3); y44 = vld1q_f32(ptry4); z41 = vmlaq_f32(x41, y41, c4); z42 = vmlaq_f32(x42, y42, c4); z43 = vmlaq_f32(x43, y43, c4); z44 = vmlaq_f32(x44, y44, c4); vst1q_f32(ptrx1, z41); vst1q_f32(ptrx2, z42); vst1q_f32(ptrx3, z43); vst1q_f32(ptrx4, z44); ptrx1 = ptrx1 + 16; ptry1 = ptry1 + 16; ptrx2 = ptrx2 + 16; ptry2 = ptry2 + 16; ptrx3 = ptrx3 + 16; ptry3 = ptry3 + 16; ptrx4 = ptrx4 + 16; ptry4 = ptry4 + 16; } 32 Bit Compilation 243 MFLOPS 32 Bit Compilation 963 MFLOPS .L53: .L24: cmp r1, r9 cmp r2, r3 add r3, r3, #16 add r8, r1, #48 add r2, r2, #16 add ip, r1, #32 bge .L102 add r7, r1, #16 flds s14, [r2, #-16] add r6, r0, #48 flds s15, [r3, #-16] add r5, r0, #32 fmacs s15, s14, s18 add r4, r0, #16 flds s14, [r2, #-12] bge .L26 adds r1, r1, #4 vld1.32 {d24-d25}, [r0] fsts s15, [r3, #-16] adds r2, r2, #1 flds s15, [r3, #-12] vld1.32 {d6-d7}, [r1] fmacs s15, s14, s18 adds r1, r1, #64 flds s14, [r2, #-8] vmla.f32 q12, q3, q8 fsts s15, [r3, #-12] vld1.32 {d22-d23}, [r4] flds s15, [r3, #-8] vld1.32 {d20-d21}, [r5] fmacs s15, s14, s18 vld1.32 {d18-d19}, [r6] flds s14, [r2, #-4] vld1.32 {d30-d31}, [r7] fsts s15, [r3, #-8] vmla.f32 q11, q15, q8 flds s15, [r3, #-4] vld1.32 {d28-d29}, [ip] fmacs s15, s14, s18 vmla.f32 q10, q14, q8 fsts s15, [r3, #-4] vld1.32 {d26-d27}, [r8] b .L53 vmla.f32 q9, q13, q8 vst1.32 {d24-d25}, [r0] adds r0, r0, #64 vst1.32 {d22-d23}, [r4] vst1.32 {d20-d21}, [r5] vst1.32 {d18-d19}, [r6] b .L24 64 Bit Compilation 764 MFLOPS 64 Bit Compilation 1014 MFLOPS .L49: .L16: ldr q1, [x2],16 mov x3, x1 add w0, w0, 1 ldr q4, [x0] ldr q0, [x1] add x6, x0, 16 cmp w0, w26 add x5, x0, 32 fmla v0.4s, v1.4s, v2.4s ldr q5, [x3],16 str q0, [x1],16 add x7, x1, 32 bcc .L49 ldr q3, [x6] add x4, x0, 48 add x2, x1, 48 ldr q2, [x5] ldr q7, [x7] add x1, x1, 64 ldr q1, [x4] ldr q6, [x3] fmla v4.4s, v5.4s, v0.4s ldr q5, [x2] fmla v2.4s, v7.4s, v0.4s str q4, [x0] add x0, x0, 64 fmla v3.4s, v6.4s, v0.4s cmp x0, x8 fmla v1.4s, v5.4s, v0.4s str q3, [x6] str q2, [x5] str q1, [x4] bne .L16

To Start

Roy Longbottom January 2016

The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

Roy Longbottom's Android 64 Bit Benchmarks

Contents

Download Benchmark Apps

General

Logged Configuration

Whetstone Benchmark - NativeWhetstone2.apk

Dhrystone Benchmark - Dhrystone2i.apk

Linpack Benchmark - LinpackDP2.apk, LinpackSP2.apk

Livermore Loops Benchmark - LivermoreLoops2.apk

MemSpeed Benchmark - MemSpeedi.apk

BusSpeed Benchmark - BusSpeedv7i.apk

RandMem Benchmark - RandMemi.apk

MP-MFLOPS Benchmarks - MP-MFLOPSi and MP-MFLOPS2i

MP-MFLOPS Benchmark Results

MP-Whetstone Benchmark - MP-WHETSi

MP Dhrystone Benchmark - MP-Dhryi.apk

MP-BusSpeed Benchmark - MP-BusSpdi.apk and MP-BusSpd2i.apk

MP-RandMem Benchmark - MP-RndMemi.apk

NEON-Linpack Benchmark - NEON-Linpacki.apk

NeonSpeed Benchmark - NeonSpeedi.apk

NEON-MFLOPS-MP Benchmark - NEON-MFLOPS2i-MP.apk

NEON-Linpack-MP Benchmark - NEON-Linpacki-MP.apk

FFT Benchmarks - fft1.apk, fft3c.apk

Assembly Code

Roy Longbottom January 2016