Contents
Download Benchmark Apps
A Settings, Security option may need changing to allow installation of non-Market applications
|
NativeWhetstone2.apk
First standard benchmark
|
|
|
|
Dhrystone2i.apk
First integer benchmark
|
|
|
LinpackDP2.apk
First comptutational benchmark
|
|
|
|
LinpackSP2.apk
Single precision Linpack
|
|
|
LivermoreLoops2.apk
First supercomputer benchmark
|
|
|
|
MemSpeedi.apk
Floating Point Cache and RAM Test
|
|
|
BusSpeedv7i.apk
Integer Bus, Cache and RAM Test
|
|
|
|
RandMemi.apk
Random/Serial Access Cache and RAM Test
|
|
|
MP-MFLOPSi.apk
CPU, Cache, RAM MFLOPS Test
|
|
|
|
MP-MFLOPS2i.apk
Long Running MP-MFLOPS
|
|
|
MP-WHETSi.apk
Whetstone Floating and Fixed Point Tests
|
|
|
|
MP-Dhryi.apk
Dhrystone Integer Benchmark
|
|
|
MP-BusSpdi.apk
Multithreaded BusSpeed
Benchmark
|
|
|
|
MP-RndMemi.apk
Multithreaded RandMem
Benchmark
|
|
|
NEON-Linpacki.apk
Linpack Benchmark using ARM
NEON Intrinsic Functions
|
|
|
|
NeonSpeedi.apk
NEON Memory Speed Test
Using Intrinsic Functions
|
|
|
NEON-MFLOPS2i-MP.apk
MP-MFLOPS using ARM
NEON Intrinsic Functions
|
|
|
|
NEON-Linpacki-MP.apk
Linpack MP Benchmark nsing
NEON Intrinsic Functions
|
|
|
MP-BusSpd2i.apk
Long running vesion
with staggered start
|
|
|
|
fft1.apk
Original FFT Benchmark
|
|
|
fft3c.apk
Optimised FFT Benchmark
|
|
|
|
|
All the above were produced using gcc 4.8, via Eclipse, running under Linux Ubuntu 14.04
They are compiled to run on both 32 bit and 64 bit CPUs from ARM, Intel and MIPS, automatically selected at run time. Downloads are identical to those in
Android Native ARM-Intel Benchmarks.htm.
General
As indicated above, the benchmarks, downloadable from here, were compiled for both 32 bit and 64 bit operation. The purpose of this document is to report results from running at 64 bits and to provide comparisons with those at 32 bits, including the latter also compiled by gcc 4.8.
Eclipse (or Android Studio?) projects for the new compilations are included in
Android Intel-ARM Benchmarks.zip.
The tablet used was a Lenovo Tab 2 A8-50, 8 Inch Tablet, with a 1.3 GHz MediaTek mt8161 quad core processor (64 bit ARM Cortex-A53) and Android 5.0.2. After initially proving that the benchmarks were in 64 bit mode, new versions are being produced that indicate the mode, in this case for 64 or 32 bit ARM, as in the options below:
Compiled for 32 bit ARM v7a Compiled for 64 bit ARM v8a
Compiled for 32 bit Intel x86 Compiled for 64 bit Intel x86_64
Compiled for 32 bit Mips CPU Compiled for 64 bit Mips CPU
For comparison on other systems, the 32 bit apps are provided in
Android Intel-ARM 32 Bit Benchmarks.zip.
Note that these will overwrite the 64 bit app installation.
At the time of starting this report (July 2015), other benchmarks indicated similar performance to a Nexus 7 32 bit tablet. Results for this are included below and they suggest that only 32 bit benchmarks were being used, when 64 bit varieties can be much faster.
Results below are for the revised benchmarks, that indicate which section of the code is used. The new projects are included in the zip file and others will follow.
To Start
Logged Configuration
In line with other Android benchmarks in
Android Benchmarks.htm,
the programs identify system information, in this case, the following for the Tab 2 A8-50.
- Strange, only 3 CPU cores are reported.
System Information
Device LENOVO Lenovo TAB 2 A8-50F
Screen pixels w x h 800 x 1216
Android Build Version 5.0.2
Processor : AArch64 Processor rev 3 (aarch64)
processor : 0
BogoMIPS : 26.00
processor : 1
BogoMIPS : 26.00
processor : 2
BogoMIPS : 26.00
Features : fp asimd aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: AArch64
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 3
Hardware : MT8161
Linux version 3.10.65 (jenkins@ubuntu12) (gcc version 4.9 20140514
(mtk-20150408) (GCC) ) #1 SMP PREEMPT Fri Jun 19 11:01:08 CST 2015
|
To Start
Whetstone Benchmark - NativeWhetstone2.apk
This provides an overall rating in MWIPS, plus separate results for the eight test procedures in MFLOPS (floating point) and MOPS (functions and integer). For full details and results via Windows. Linux, Android and via different programming languages, see
Whetstone Benchmark Results on PCs.
On the latest CPUs, running time largely depends on the COS and EXP function tests. This is highlighted in the examples below.
As with the next four benchmarks, tests comprised the original, from an earlier compiler, then gcc 4.8 separate compilations for 32 bit and 64 bit CPUs, finally the one produced covering all ARM, Intel and MIPS based systems. The latter shows that this tablet uses the 64 bit code option.
In this case, most tests indicated that the later versions, and 64 bit operation, provided no performance gains, but were somewhat faster than the Nexus 7 Cortex-A9 CPU.
Version MWIPS ------MFLOPS------- ------------MOPS--------------
1 2 3 COS EXP FIXPT IF EQUAL
1300 MHz Cortex-A53
Original 32 bit 1433.7 348.0 319.3 308.2 36.3 19.8 1551.4 1861.9 611.0
ARM/Intel 32 bit 834.7 348.9 312.7 310.9 36.7 5.4 1556.7 1867.2 570.5
ARM/Intel 64 bit 1504.4 348.8 304.9 309.3 38.2 20.5 1536.4 1862.0 1242.4
ARM/Intel 32/64 bit 1494.2 347.1 307.0 305.9 37.5 20.6 1552.2 1863.7 1239.1
1200 MHz Cortex-A9
Original 32 bit 1115.0 271.3 250.7 256.4 25.8 14.6 1190.0 1797.0 1198.7
ARM/Intel 32/64 bit 731.1 273.6 253.0 252.8 28.0 5.0 1185.2 2383.4 1192.1
|
To Start
Dhrystone Benchmark - Dhrystone2i.apk
The Dhrystone integer benchmark produces a performance rating in Vax MIPS (AKA DMIPS). Further details of the Dhrystone benchmark, and results from Windows and Linux based PCs, can be found in
Dhrystone Results.htm.
The ratio MIPS/MHz is often quoted, but this depends on compiler optimisation (or over-optimisation)
The 32 bit gcc 4.8 compilations were slower than the original and similar to the Nexus 7, but the 64 bit version was significantly faster using the Cortex-A53.
Version Vax MIPS
MIPS /MHz
1300 MHz Cortex-A53
Original 32 bit 1683 1.29
ARM/Intel 32 bit 1423 1.09
ARM/Intel 64 bit 2549 1.96
ARM/Intel 32/64 bit 2569 1.98
1200 MHz Cortex-A9
Original 32 bit 1610 1.34
ARM/Intel 32/64 bit 1317 1.10
|
To Start
Linpack Benchmark - LinpackDP2.apk, LinpackSP2.apk
The Linpack benchmark speed is measured in MFLOPS, officially for double precision floating point calculations. A version was produced using NEON functions
(see later)
that only provides single precision operation. So, for comparison purposes, an available C code option, to define single precision data, was used to produce a new version and this has usually lead to a higher MFLOPS speed. Results from various hardware and software platforms can be found in
Linpack Results.htm.
Performance of the Linpack benchmark is almost entirely dependent on the calculation x[i]=x[i]+c*y[i], in the daxpy() function.
The ARM compilations generated floating point multiply and accumulate instructions for this, using such as fmacd d6, d7, d5 at 32 bits and fmadd d1, d0, d1, d5, using four registers, at 64 bits.
In this case, 64 bit operation increased speed by almost 2 times with double precision calculations and 2.7 times at single precision. Performance at 32 bits was similar to that on the Nexus 7.
September 2015 - New best score from 2 GHz Qualcomm Snapdragon 810, (Cortex-A57) and Android 5.0.2, with SP speed of 1277 MFLOPS at 64 bits.
Below, is the general output produced, indicating different (but probably acceptable) numeric results of computation, at the various modes of operation.
32 bit DP compilation 32 bit SP compilation
norm. resid 1.7 1.6
resid 7.41628980e-14 3.80277634e-05
machep 2.22044605e-16 1.19209290e-07
x[0]-1 -1.49880108e-14 -1.38282776e-05
x[n-1]-1 -1.89848137e-14 -7.51018524e-06
64 bit DP compilation 64 bit SP compilation
norm. resid 1.9 2.0
resid 8.46778499e-14 4.69621336e-05
machep 2.22044605e-16 1.19209290e-07
x[0]-1 -1.11799459e-13 -1.31130219e-05
x[n-1]-1 -9.60342916e-14 -1.30534172e-05
Version LinpackDP LinpackSP
MFLOPS MFLOPS
1300 MHz Cortex-A53
Original 32 bit 156.70 184.09
ARM/Intel 32 bit 172.28 180.64
ARM/Intel 64 bit 337.97 473.10
ARM/Intel 32/64 bit 340.18 482.43
2000 MHz Cortex-A57
ARM/Intel 64 bit 1277.76
1200 MHz Cortex-A9
Original 32 bit 151.05 201.30
ARM/Intel 32/64 bit 159.34 199.84
|
To Start
Livermore Loops Benchmark - LivermoreLoops2.apk
The Livermore Loops comprise 24 kernels of numerical application with speeds calculated in MFLOPS. A summary is also produced, with maximum, minimum and various mean values, geometric mean being the official average. As for other of these benchmarks, details and results are provided, in this case, in
Livermore Loops Results.htm.
Summary results, below, indicate similar Cortex-A53 and Cortex-A9 speeds at 32 bits, and the former faster at 64 bits. This is followed by MFLOPS for each of the 24 test functions, where 64 bit/32 bit performance ratios vary between 0.8 and 7.9 times, with a geometric mean ratio of 1.5. The identified numeric results are also shown, and they can again be slightly different.
Version Max Average Geomean Harmean Min
1300 MHz Cortex-A53 MFLOPS
Original 32 bit 371.5 192.4 171.9 151.8 67.1
ARM/Intel 32 bit 393.4 188.3 158.3 124.6 27.1
ARM/Intel 64 bit 781.4 265.4 231.8 205.6 98.1
ARM/Intel 32/64 bit 772.2 265.9 232.5 206.3 97.8
1200 MHz Cortex-A9 MFLOPS
Original 32 bit 391.9 202.1 181.3 160.9 68.1
ARM/Intel 32/64 bit 396.6 207.6 175.6 136.1 26.8
A53 ARM/Intel 32 bit
MFLOPS for 24 loops Do Span 471
163.4 243.4 272.1 270.3 109.5 111.2
282.2 389.0 360.6 219.6 124.0 61.8
67.6 87.4 27.3 224.2 340.1 241.9
168.5 198.8 120.2 120.6 277.7 79.1
Results of last two calculations
4.850340602749970e+02 1.300000000000000e+01
A53 ARM/Intel 32/64 bit
MFLOPS for 24 loops Do Span 471
451.4 191.4 243.2 272.4 144.9 144.5
749.4 411.1 453.6 261.1 138.0 206.1
122.5 130.1 215.0 249.8 411.6 395.4
241.7 248.1 152.8 118.7 317.2 103.7
Results of last two calculations
4.850340602751729e+02 1.300000000000000e+01
|
To Start
MemSpeed Benchmark - MemSpeedi.apk
MemSpeed benchmark
employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays. It uses data volumes of 4 KBytes upwards to indicate performance via caches and RAM. A version was produced to run under
Linux
with a variation of calculations (mainly to use via OpenMP). The Android benchmark is the same as this but with fewer tests, but still reflecting cache and RAM speeds.
The 64 bit compilation was nearly twice as fast as the 32 bit version with double precision floating point calculations, using cached data, and provided a 33% increase from RAM. Corresponding single precision ratios were 2.6 and 2.0 times and integer ratios of 2.2 and 1.5. For floating point, the C program loop has four each of loads, stores, multiplies and adds, where the latter two are linked or fused into one function. At 64 bits, vector SIMD instructions were produced, leading to 2 each at double precision and 1 each at single precision (4 words in registers). With 32 bit integers, 8 scalar adds were reduced to two vector adds.
At 32 bits, the Nexus 7 was slightly faster using L1 cache, but The A8 gains averaged nearly 1.4 times from L2 cache and 2.4 times with RAM based data.
Tab 2 A8 64 bit maximum, MFLOPS were somewhat faster than with similar calculations in the Linpack benchmark, but a modern Intel CPU could be three times faster, at that CPU MHz, using SSE type SIMD instructions.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.16
Compiled for 32 bit ARM v7a
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 1940 971 1693 2470 1278 2084 L1
32 1879 955 1676 2378 1255 1967
64 1801 938 1615 2254 1218 1912 L2
128 1706 941 1620 2279 1224 1872
256 1818 935 1570 2291 1155 1875
512 1633 884 1451 2008 1132 1704
1024 1276 781 1181 1454 938 1324 RAM
4096 1335 808 1260 1533 1010 1386
16384 1342 813 1270 1487 1013 1419
65536 1346 809 1274 1546 1031 1252
Total Elapsed Time 11.7 seconds
ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.29
Compiled for 64 bit ARM v8a
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 4092 2198 3951 5293 3611 4408
32 3753 2496 3630 4651 3300 3992
64 3407 2388 3368 3715 3023 3677
128 3496 2462 3521 4137 3139 3844
256 3535 2481 3573 4199 3322 3911
512 3054 2248 3126 3556 2548 3372
1024 1714 1704 2029 2069 1854 2099
4096 1832 1595 1841 1914 1780 1897
16384 1844 1601 1850 1925 1798 1891
65536 1859 1608 1837 1921 1795 1812
Total Elapsed Time 10.2 seconds
Max MFLOPS 512 624
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel MemSpeed Benchmark 1.1 25-Apr-2015 12.24
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 1856 1019 2537 2913 1459 2544
32 1416 832 1327 1508 920 1345
64 1286 779 1198 1418 908 1296
128 1282 781 1195 1424 912 1305
256 1278 774 1190 1433 878 1298
512 1197 752 1122 1340 862 1216
1024 833 626 822 903 695 857
4096 463 420 456 463 440 459
16384 459 426 453 455 435 458
65536 463 430 411 462 443 452
Total Elapsed Time 11.5 seconds
|
To Start
BusSpeed Benchmark - BusSpeedv7i.apk
BusSpeed Benchmark
is particularly designed to identify reading data in bursts over buses, The program starts by reading a word, with address increments of 32 words for the next data. The increment is reduced to 16 words then halving until all data is read. In this case, an estimate of maximum speed can be 16 times MB/second at 16 word increments. Normally,
an MP version
is needed for maximum throughput.
Other than identifying burst data transfers, the final column, reading all data, is the major performance guide. Here, 64/32 bit comparison ratios were up to 2.0 from L1 cache, 1.5 from L2 cache and 1.25 from RAM.
At 32 bits, The Lenovo A8 was slower than the Nexus 7 on L1 cache based data, but the position was reversed on L2 cache tests, and particularly on RAM data transfers.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
Single Channel RAM, LPDDR3 666 MHz, 5.3 GB/second
ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 10.57
Compiled for 32 bit ARM v7a
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 874 932 1814 2302 2355 2263 L1
32 758 803 1309 1820 2323 2386
64 653 671 1203 1741 2206 2332 L2
128 603 620 1107 1693 2222 2351
256 574 589 1075 1711 2211 2327
512 332 372 681 1075 1863 2120
1024 137 193 371 578 1322 2129 RAM
4096 172 179 351 567 1151 2126
16384 172 178 351 504 1117 2136
65536 172 177 349 478 882 2129
Total Elapsed Time 5.3 seconds
ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 11.02
Compiled for 64 bit ARM v8a
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 3188 3635 3937 4327 4372 4462
32 1478 1607 2246 3382 3853 4144
64 600 622 1163 2011 2972 3585
128 558 575 1056 1889 2892 3525
256 538 550 1028 1826 2837 3260
512 371 425 813 1490 2403 3202
1024 136 196 382 728 1423 2750
4096 170 177 346 669 1340 2652
16384 169 174 341 678 1352 2663
65536 168 174 341 676 1347 2611
Total Elapsed Time 5.2 seconds
Estimated maximum = 16 x 174 = 2784 MB/second
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
Android BusSpeed Benchmark 19-Oct-2012 17.29
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 2723 2420 3044 3364 3499 3500 L1
32 1054 1087 1061 1382 1565 2145
64 436 433 419 652 751 1160 L2
128 345 337 337 542 633 943
256 329 309 322 522 614 961
512 339 299 311 506 574 937
1024 170 168 180 269 349 629
4096 59 55 84 127 176 338 RAM
16384 56 56 83 125 173 335
65536 56 56 82 125 174 334
Total Elapsed Time 5.7 seconds
|
To Start
RandMem Benchmark - RandMemi.apk
RandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches. For more details and further results see
RandMem in Android Benchmarks.htm.
This program uses quite complex memory address indexing and Tab A8 32 bit and 64 bit versions were not that different overall, each one slightly faster on some tests.
At 32 bits, the A8 had the L2 cache and RAM speed advantages, over the Nexus 7, on serial reading and writing but, on random access, the latter’s larger L2 cache lead to faster speeds on later cache based data sizes and als affectd RAM data transfer speeds.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel RandMem Benchmark 1.2 06-Aug-2015 12.29
Compiled for 32 bit ARM v7a
MBytes/Second Transferring 4 Byte Words
Memory Serial....... Random.......
KBytes Read Rd/Wrt Read Rd/Wrt
16 2807 3606 2753 3595 L1
32 2719 3433 1429 1930
64 2615 3266 914 1166 L2
128 2592 3243 705 828
256 2570 3223 637 720
512 2367 2684 237 347
1024 2137 1855 120 163 RAM
4096 1918 1658 83 97
16384 2152 1665 74 85
65536 2104 1652 72 64
Total Elapsed Time 11.6 seconds
ARM/Intel RandMem Benchmark 1.2 06-Aug-2015 12.32
Compiled for 64 bit ARM v8a
MBytes/Second Transferring 4 Byte Words
Memory Serial....... Random.......
KBytes Read Rd/Wrt Read Rd/Wrt
16 3865 3033 3798 3027
32 3622 2760 3105 2734
64 3094 2803 1011 1077
128 3074 2740 776 801
256 3050 2771 718 693
512 2420 2463 270 371
1024 1322 1853 131 164
4096 1754 1598 87 100
16384 1791 1586 75 91
65536 1856 1609 57 68
Total Elapsed Time 14.6 seconds
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel RandMem Benchmark 1.1 25-Apr-2015 12.33
MBytes/Second Transferring 4 Byte Words
Memory Serial....... Random.......
KBytes Read Rd/Wrt Read Rd/Wrt
16 2521 3175 2490 3038 L1
32 1427 1451 1218 1446
64 1133 1052 853 907 L2
128 1039 871 646 650
256 1028 909 543 518
512 1025 895 499 502
1024 700 489 242 236
4096 487 282 90 88 RAM
16384 483 281 71 70
65536 478 274 63 62
Total Elapsed Time 11.3 seconds
|
To Start
MP-MFLOPS Benchmarks - MP-MFLOPSi and MP-MFLOPS2i
MP-MFLOPS arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2 and 32 operations per input data word, using 1, 2, 4 and 8 threads. Data sizes are limited to three, to use L1 cache, L2 cache and RAM at 12.8, 128 and 12800 KB (3200, 32000 and 3200000 single precision floating point words). Each thread uses the same calculations but accessing different segments of the data. The program checks for consistent numeric results, primarily to show that all calculations are carried out and can be run. The numeric results start with values of 1.0, with subsequent calculations reducing the values, the amount depending on the number of calculations. An example log file is shown below.
The original benchmark runs too fast on later CPUs, so a revised version, MP-MFLOPS2, was produced, with 50 times more calculations, producing the expected reduction in result values, as also shown below. Those from the 32 bit benchmark are slightly different to those from 64 bit operation.
ARM/Intel MP-MFLOPS v7 Benchmark V1.2 09-Aug-2015 21.20
Compiled for 64 bit ARM v8a
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 701 695 583 1391 1394 1349
2T 1347 1370 712 2792 2798 2743
4T 1641 1544 716 3587 3491 3374
8T 1370 1803 693 4001 4255 5016
Results x 100000, 0 indicates ERRORS
1T 86735 98519 99984 79897 97639 99975
2T 86735 98519 99984 79897 97639 99975
4T 86735 98519 99984 79897 97639 99975
8T 86735 98519 99984 79897 97639 99975
Total Elapsed Time 3.1 seconds
MP-MFLOPS2i 32 bit
1T 40392 76406 99700 35218 66014 99520
MP-MFLOPS2i 64 bit
1T 40392 76406 99700 35206 66015 99520
|
To Start
MP-MFLOPS Benchmark Results
Except for producing faster results with data in RAM, The 32 bit Tab 2 performance, was again, similar to the Cortex-A9 based Nexus 7. At 32 operations per word, the Tab 2 was just over twice as fast at 64 bits, then up to 3.7 times, at 2 operations per word, with cache based data. The reason is that 64 bit vector SIMD instructions were produced, instead of scalars.
For further comparisons see
NEON-MFLOPS-MP Benchmark
and
Assembly Code below.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel MP-MFLOPS2 Benchmark V2.2 09-Aug-2015 21.17
Compiled for 32 bit ARM v7a
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 190 190 184 670 672 664
2T 377 378 370 1343 1345 1329
4T 707 755 725 2657 2669 2621
8T 722 736 714 2640 2672 2631
Total Elapsed Time 113.0 seconds
ARM/Intel MP-MFLOPS2 Benchmark V2.2 09-Aug-2015 21.24
Compiled for 64 bit ARM v8a
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 705 701 636 1398 1394 1362
2T 1376 1395 942 2794 2797 2757
4T 2063 2602 962 5491 5546 5336
8T 2474 2611 957 5367 5500 5417
Total Elapsed Time 51.6 seconds
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel MP-MFLOPS2 Benchmark V2.1 28-Apr-2015 17.44
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 188 156 116 598 578 574
2T 365 319 197 1195 1161 1145
4T 682 709 237 2372 2345 2249
8T 678 731 237 2361 2381 2254
Total Elapsed Time 135.0 seconds
|
To Start
MP-Whetstone Benchmark - MP-WHETSi
This is a multithreaded version of the
Whetstone Benchmark above.
Tab 2 A8-50 performance, on the 32 bit version was, again, similar to the Nexus 7. At 64 bits, the Fixpt test was clearly nearly optimised out, but this makes little difference to the overall MWIPS rating, at 2.25 times faster than the 32 bit benchmark.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel MP-Whetstone Benchmark V1.2 10-Aug-2015 11.30
Compiled for 32 bit ARM v7a
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 676.4 275.9 281.9 147.9 35.4 5.3 600.3 901.0 285.5
2T 1362.5 533.8 561.7 298.0 70.9 10.8 1203.1 1838.9 574.0
4T 2698.6 903.9 1071.7 594.4 141.2 21.5 2346.1 3305.5 1138.5
8T 2830.1 1463.2 1393.0 614.2 152.5 21.9 3243.9 4418.3 1171.4
Overall Seconds 4.95 1T, 4.94 2T, 5.11 4T, 10.09 8T
ARM/Intel MP-Whetstone Benchmark V1.2 10-Aug-2015 11.34
Compiled for 64 bit ARM v8a
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 1524.8 328.6 348.8 297.6 37.3 19.9 1462579 1867.2 1238.0
2T 3062.5 688.8 697.9 596.0 75.5 39.8 2097113 3726.7 2481.3
4T 6085.4 1214.9 1360.5 1185.4 150.5 79.4 2449153 7055.0 4951.8
8T 6222.4 1495.2 1545.6 1204.2 152.2 80.6 3869846 9218.8 5154.1
Overall Seconds 4.92 1T, 4.90 2T, 5.05 4T, 9.97 8T
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel MP-Whetstone Benchmark V1.1 30-Apr-2015 21.32
Using 1, 2, 4 and 8 Threads
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
1T 602.2 242.3 242.3 140.2 27.2 4.9 482.8 1425.2 239.1
2T 1208.7 481.2 484.2 280.8 55.0 9.9 970.0 2869.6 478.7
4T 2398.7 805.4 966.7 562.5 109.5 19.5 1938.2 5722.5 957.1
8T 2429.1 974.6 1076.2 562.4 110.9 19.7 1981.5 5816.1 963.6
Overall Seconds 4.94 1T, 4.93 2T, 5.08 4T, 9.93 8T
|
To Start
MP Dhrystone Benchmark - MP-Dhryi.apk
This is a multithreaded version of the
Dhrystone Benchmark above
Tab 2 A8-50 performance, on the 32 bit version was, again, similar to the Nexus 7.
Each thread executes the same code but some variables are shared and that can lead to non-linear MP activity, in this case, two CPUs producing increased throughput of 1.8 times or less.
At least, single threaded performance is essentially the same as the
non-threaded benchmark..
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel MP-Dhrystone 2 Benchmark V1.2 10-Aug-2015 11.32
Compiled for 32 bit ARM v7a
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.64 0.71 0.90 1.70
Dhrystones per Second 2481286 4495793 7094180 7540038
VAX MIPS rating 1412 2559 4038 4291
ARM/Intel MP-Dhrystone 2 Benchmark V1.2 10-Aug-2015 11.36
Compiled for 64 bit ARM v8a
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.89 1.06 1.64 3.24
Dhrystones per Second 4476736 7574470 9768350 9861922
VAX MIPS rating 2548 4311 5560 5613
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel MP-Dhrystone 2 Benchmark V1.1 04-May-2015 17.18
Threads 1 2 4 8
Seconds 0.78 0.95 1.27 2.44
Dhrystones per Second 2572642 4214238 6280420 6565767
VAX MIPS rating 1464 2399 3575 3737
|
To Start
MP-BusSpeed Benchmark - MP-BusSpdi.apk and MP-BusSpd2i.apk
This is a multithreaded version of the BusSpeed Benchmark,
with data sizes considered suitable to measure performance from L1 cache, L2 cache and RAM.
The original MP-BusSpd benchmark read all the data with every thread, each starting at the beginning. With some devices having, large shared L2 caches, some of the RAM based data could be cached, sometimes indicating an impossible performance level. All threads in the new version, MP-BusSpd2, read all the data, but with staggered starting points. The difference in not that great on the Tab 2 A8, as indicated below.
Just considering MP-BusSpd2 and reading all data, at 32 bits, the Cortex-A53/Cortex-A9 L1 cache, L2 cache and RAM performance ratios are around 0.8, 3.0 and 5.5. A53 64/32 bit ratios average 2.2, 1.8 and 1.0.
Maximum 64 bit memory transfer rate is 4.328 GB/second, or 4.448, based on 16 word increments, out of a possible 5.33 -
See BusSpeed.
Note that multithreading increases memory throughput by more than 60%.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
Single Channel RAM, LPDDR3 666 MHz, 5.3 GB/second
ARM/Intel MP-BusSpd Benchmark V1.2 12-Aug-2015 16.13
Compiled for 32 bit ARM v7a
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 1849 2140 2079 2211 2270 2297
2T 3663 4252 4294 4400 4370 4580
4T 4630 5574 5691 5893 6015 6083
8T 5331 5775 6033 6622 7968 8023
122.9 1T 597 621 1119 1815 2135 2237
2T 869 943 1644 2992 3740 4412
4T 949 951 1922 3736 6468 7779
8T 948 978 1911 3717 6464 7542
12288 1T 123 174 344 678 1215 1840
2T 243 310 672 1332 2383 3974
4T 302 285 594 1282 2271 4606
8T 279 295 654 1198 2749 4660
Total Elapsed Time 12.8 seconds
ARM/Intel MP-BusSpd2 Benchmark V1.2 12-Aug-2015 16.14
Compiled for 32 bit ARM v7a
12.3 1T 1877 2124 2176 2266 2296 2343
2T 3625 4198 4341 4468 4536 4613
4T 5733 7541 8293 8830 8024 9042
8T 2985 3829 7438 6117 8108 8923
122.9 1T 604 625 1142 1846 2150 2284
2T 924 950 1793 3277 4270 4504
4T 962 989 1939 3765 6798 8862
8T 965 993 1933 3748 6651 8239
49152 1T 165 175 344 677 1285 1979
2T 234 238 482 961 1907 3547
4T 266 298 562 1224 2296 4478
8T 272 275 538 1098 2149 4282
Total Elapsed Time 48.8 seconds
ARM/Intel MP-BusSpd Benchmark V1.2 12-Aug-2015 16.17
Compiled for 64 bit ARM v8a
12.3 1T 3247 3895 4031 4182 4286 4367
2T 5676 7211 7771 8320 8539 7887
4T 10390 13919 14891 14949 15595 12977
8T 9693 12748 14246 14325 14434 16076
122.9 1T 577 575 1107 1884 2882 3568
2T 924 939 1827 3380 5554 6890
4T 959 972 1897 3659 6554 8508
8T 956 980 1913 3814 7206 11996
12288 1T 133 182 351 690 1381 2720
2T 309 282 669 1329 2625 5265
4T 281 286 715 1383 2614 5040
8T 303 341 670 1180 2303 4354
Total Elapsed Time 13.1 seconds
ARM/Intel MP-BusSpd2 Benchmark V1.2 12-Aug-2015 16.18
Compiled for 64 bit ARM v8a
12.3 1T 2610 2472 2586 2727 2748 5841
2T 4404 4681 4994 5369 5420 11297
4T 6546 8125 9105 10243 10319 20610
8T 3380 4023 7919 7146 9871 19852
122.9 1T 604 621 1110 1872 2446 5100
2T 919 948 1855 3433 4853 10037
4T 961 974 1984 3924 7491 14935
8T 963 942 1931 3915 7572 14689
49152 1T 173 177 340 692 1300 2653
2T 266 241 479 968 1883 3724
4T 304 277 556 1130 2126 4328
8T 279 278 544 1138 2179 4275
Total Elapsed Time 49.4 seconds
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel MP-BusSpd2 Benchmark V1.0 24-Jul-2015 15.59
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 2166 2774 3181 3307 3377 3263
2T 3924 5188 5207 5754 5759 5805
4T 7570 10011 10252 11165 11375 11777
8T 3510 4786 9011 8318 11351 11544
122.9 1T 383 409 359 558 663 983
2T 525 541 520 741 1241 1814
4T 739 752 753 1219 1590 2776
8T 735 741 753 1218 1607 2737
49152 1T 56 51 81 126 172 330
2T 65 67 107 196 335 620
4T 70 68 108 215 426 835
8T 70 68 109 215 428 851
Total Elapsed Time 48.2 seconds
|
To Start
MP-RandMem Benchmark - MP-RndMemi.apk
This is a multithreaded version of the
RandMem Benchmark.
Probably as performance is dependent on the complex indexing used, A53 performance is mainly not much faster at 64 bits. At 32 bits, it is clearly faster than the Cortex-A9 with serial access, using L2 cache and RAM, but the latter is comparable on random reading and writing.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel MP-RndMem Benchmark V1.2 12-Aug-2015 17.13
Compiled for 32 bit ARM v7a
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.29 1T 2894 2438 2887 2433
2T 5665 2402 5663 2403
4T 10922 2369 11100 2310
8T 10065 2293 10648 2265
122.9 1T 2681 2368 757 758
2T 5351 2360 1398 769
4T 10056 2308 2121 772
8T 8838 2351 1916 742
12288 1T 2309 1662 80 78
2T 3986 1683 164 73
4T 5419 1684 283 82
8T 4658 1694 279 82
ARM/Intel MP-RndMem Benchmark V1.2 12-Aug-2015 17.15
Compiled for 64 bit ARM v8a
12.29 1T 4445 3109 4455 3089
2T 8010 3100 8072 3105
4T 15909 3057 14711 3040
8T 14764 3036 14570 3037
122.9 1T 3457 2888 842 876
2T 6537 2924 1524 876
4T 11095 2892 2119 861
8T 11729 2916 2080 874
12288 1T 2475 1679 81 78
2T 4155 1713 163 73
4T 5503 1711 285 89
8T 4519 1717 281 89
Total Elapsed Time 48.1 seconds
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel MP-RndMem Benchmark V1.1 06-May-2015 11.59
MB/Second Using 1, 2, 4 and 8 Threads
KB SerRD SerRDWR RndRD RndRDWR
12.29 1T 3060 2001 2867 1904
2T 5459 1879 5463 1867
4T 10797 1852 10537 1856
8T 10090 1802 10608 1813
122.9 1T 968 823 588 547
2T 1749 785 902 618
4T 2716 812 1328 672
8T 2733 810 1407 673
12288 1T 329 274 90 82
2T 636 272 112 82
4T 849 271 128 82
8T 869 271 126 81
Total Elapsed Time 45.4 seconds
|
To Start
NEON-Linpack Benchmark - NEON-Linpacki.apk
This is identical to the Linpack Benchmark above, except the main calculations, in the performance dependent daxpy() function, were replaced using NEON intrinsic functions. These only operate on single precision floating point numbers.
Results from 32 bit and 64 bit compilations were similar as the programs use identical intrinsic functions. The speed of the original 64 bit benchmark is also not that different. This is compiled using fmadd, scalar floating-point fused multiply instructions, compared with NEON vmla vector multiply accumulate (4 words at a time). See
Assembly Code below.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel NEON Linpack Benchmark V 1.2 13-Aug-2015
Compiled for 32b ARM v7a 64b ARM v8a 64b Above
SP MFLOPS 407 505 482
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel NEON Linpack Benchmark V 1.0 03-May-2015
SP MFLOPS 347
|
To Start
NeonSpeed Benchmark - NeonSpeedi.apk
The benchmark carries out the same calculations as
MemSpeed Benchmark,
repeating the standard single precision multiply/add and integer tests with two adds, for comparison with those via NEON intrinsic functions.
As with NEON-Linpack, many results from 32 bit and 64 bit compilations, via NEON instructions, were similar. NEON functions produced significant performance gains at 32 bits, over normal code, but were limited to no more than 30% at 64 bits. NEON tests were quite a bit faster than those on the Cortex-A9.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.32
Compiled for 32 bit ARM v7a
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 971 3853 1807 4059 3957 4397
32 970 3812 1800 3983 3891 4323
64 927 3228 1605 3038 3269 3521
128 926 3321 1681 3343 3354 3596
256 936 3386 1693 3449 3413 3667
512 898 2889 1578 2996 2927 3118
1024 794 1859 1345 2057 1996 1924
4096 794 1796 1250 1788 1813 1835
16384 792 1773 1270 1820 1829 1864
65536 796 1811 1289 1852 1832 1880
Total Elapsed Time 11.3 seconds
ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.37
Compiled for 64 bit ARM v8a
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 3054 4055 3605 4376 4911 5094
32 2922 3787 3435 4198 4546 4682
64 2795 3514 3259 3658 4050 4116
128 2886 3529 3373 3924 4148 3963
256 2883 3641 3264 3942 4193 4276
512 2454 3165 2985 3385 3586 3542
1024 1633 2000 1835 2043 2114 2105
4096 1738 1893 1899 1900 1956 1955
16384 1757 1870 1886 1802 1921 1846
65536 1755 1875 1870 1903 1936 1937
Max MFLOPS 764 1014
Total Elapsed Time 10.2 seconds
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel NeonSpeed Benchmark V1.1 09-May-2015 18.07
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 881 2440 2501 3334 3206 3465
32 901 1868 1705 2260 2083 2186
64 801 1395 1365 1573 1548 1581
128 784 1282 1278 1405 1389 1411
256 787 1279 1285 1420 1380 1409
512 777 1266 1267 1409 1370 1394
1024 604 786 762 769 770 828
4096 458 479 477 463 486 488
16384 436 447 448 469 470 469
65536 450 472 469 240 482 483
Total Elapsed Time 11.5 seconds
|
To Start
NEON-MFLOPS-MP Benchmark - NEON-MFLOPS2i-MP.apk
NEON-MFLOPS-MP benchmark is the same as
MP-MFLOPS,
except using NEON intrinsic functions for the calculations. For comparison purposes, single thread MP-MFLOPS results are included below (1TNN), with details of program source code and CPU assembly instructions used
below.
Tab 2 A8 performance of the 32 bit compilations was up to 3.2 times faster than the original MP-MFLOPS benchmark, using NEON intrinsic functions, but the source code for the latter included four times more calculations within the test loops. Results were also similar to those on the Nexus 7, except for RAM speed, measured at 12800 KB, where the Tab 2 excelled.
The same unrolling applied for calculations at 32 operations per word, except the original incurred heavy addressing overheads, using 10 vector registers, compared with 32 via NEON, leading to the latter being measured as twice as fast. In both cases, the instruction count was reduced by using fused multiply-add or multiply-subtract.
The NEON 64 bit compilation produced a small performance gain over 32 bit results, at 2 operations per word, but near double speed at 32 operations, the latter suffering from fewer registers for the variables. Using one core, maximum speed was 2.77 GFLOPS, rising to 10.8 GFLOPS via four cores. The one core speed equated to just over two floating point operation per clock cycle. This is disappointing, compared with Intel processors, such as the Core 2 onwards, at 6 per clock cycle out of a maximum of 8, with SSE SIMD code
(See Linux results).
September 2015 - New best score from 2 GHz Qualcomm Snapdragon 810, (Cortex-A57) and Android 5.0.2, at 64 bits. Performance, with 8 threads, is up to 23.6 GFLOPS, and up to nearly 3.5 results per clock cycle, using one core.
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 13-Aug-2015 16.35
Compiled for 32 bit ARM v7a
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 619 613 575 1444 1446 1426
2T 1174 1206 889 2894 2902 2839
4T 1585 1616 901 5679 5726 5596
8T 2075 2130 944 5400 5585 5519
Total Elapsed Time 25.8 seconds
1TNN 190 190 184 670 672 664
ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 13-Aug-2015 16.38
Compiled for 64 bit ARM v8a
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 726 745 647 2766 2774 2639
2T 1397 1402 903 5523 5552 5371
4T 1871 1930 898 10780 10479 10439
8T 2496 2876 1011 9736 10679 9900
Total Elapsed Time 15.1 seconds
1TNN 705 701 636 1398 1394 1362
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel NEON-MFLOPS2-MP Benchmark V2.1 13-May-2015 12.24
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 657 407 132 1077 1074 1053
2T 1265 817 222 2147 2150 2078
4T 2024 1695 234 4214 4276 3555
8T 2435 2495 234 4196 4100 3523
Total Elapsed Time 39.0 seconds
1TNN 188 156 116 598 578 574
####################################################
Quad-core 2 GHz Qualcomm Snapdragon 810, Android 5.0.2
ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 16-Sep-2015 17.59
Compiled for 64 bit ARM v8a
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 2811 3126 1089 6943 6589 6342
2T 2488 4114 1541 12084 10559 8809
4T 4759 5480 2038 16516 14826 11960
8T 4840 8985 2452 22082 23563 12461
Total Elapsed Time 7.6 seconds
|
To Start
NEON-Linpack-MP Benchmark - NEON-Linpacki-MP.apk
This version uses mainly the same C programming code as the single precision floating point NEON compilation
above.
It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.
Unlike the NEON MP MFLOPS benchmark, that carries out the same multiply/add calculations, this program can run much slower using multiple threads. This is due to the overhead of creating and closing threads too frequently.
Note the difference between the unthreaded speeds and those using one thread.
Ignoring multiple thread speeds, with the 32 bit variety, the Tab 2 A8 is particularly faster than the Nexus 7 at N = 500 and 1000, due to the larger L2 cache and faster RAM.
MFLOPS 0 to 4 Threads, N 100, 500, 1000
####################################################
Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
ARM/Intel Linpack NEON SP MP Benchmark 1.2 13-Aug-2015 12.52
Compiled for 32 bit ARM v7a
Threads None 1 2 4
N 100 460.74 22.35 23.16 23.82
N 500 480.63 336.52 339.94 303.66
N 1000 470.02 405.86 403.01 405.98
ARM/Intel Linpack NEON SP MP Benchmark 1.2 13-Aug-2015 12.57
Compiled for 64 bit ARM v8a
Threads None 1 2 4
N 100 548.67 27.70 33.93 37.00
N 500 470.04 285.95 297.79 301.67
N 1000 519.02 441.84 443.47 441.91
####################################################
ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
ARM/Intel Linpack NEON SP MP Benchmark 14-May-2015 15.40
Threads None 1 2 4
N 100 385.49 28.79 29.06 29.25
N 500 272.07 184.85 183.70 183.18
N 1000 147.09 131.92 132.44 130.05
|
To Start
FFT Benchmarks - fft1.apk, fft3c.apk
The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run three times to identify variance. Results are displayed and saved in a log file (FFT-tests.txt), with FFT running time in milliseconds.
Besides Android, the bechmarks are available to run via Windows and Linux.
Two versions are available FFT1, original version and with optimised C code as FFT3c. Further details, results, and links for benchmarks and source code are in
FFTBenchmarks.htm.
Below is an example of results.
Kindle Fire HDX 7, 2.2 GHz Quad Core Qualcomm Snapdragon 800
ARM/Intel FFT Benchmark 3c.0 08-Sep-2015 23.15
Compiled for 32 bit ARM v7a
Size milliseconds
K Single Precision Double Precision
1 0.155 0.352 1.341 0.087 0.073 0.073
2 0.812 0.814 0.750 0.201 0.187 0.251
4 1.751 1.658 1.776 0.414 0.405 0.443
8 3.712 1.083 1.065 0.930 0.899 0.890
16 2.880 3.356 2.430 2.579 2.658 2.380
32 6.124 6.541 5.605 5.907 6.070 5.681
64 13.430 12.566 12.774 13.792 13.556 13.997
128 30.737 27.408 27.132 33.318 33.088 33.071
256 64.472 63.394 64.690 73.288 72.546 72.786
512 153.609 150.383 156.046 155.788 156.304 163.178
1024 315.283 306.323 307.409 369.426 337.074 336.684
1024 Square Check Maximum Noise Average Noise
SP 9.999520e-01 3.346482e-06 4.565234e-11
DP 1.000000e+00 1.133294e-23 1.428110e-28
Total Elapsed Time 6.5 seconds
|
To Start
Assembly Code
Single Precision Floating Point Instructions 32 Bit Compile
fadds s13, s13, s15 Add
fmuls s13, s13, s14 Multiply
fmacs s15, s14, s23 Multiply and accumulate
fnmacs s15, s24, s2 Negated multiply and accumulate
fmscs s15, s24, s12 Multiply and subtract
NEON
vadd.f32 q10, q10, q8 Vector add
vmul.f32 q10, q10, q9 Vector multiply
vsub.f32 q7, q6, q7 Vector subtract
vmla.f32 q8, q10, q9 Vector multiply accumulate
Single Precision Floating Point Instructions 64 Bit Compile
fmadd s4, s0, s1, s4 Fused multiply-add
fadd v2.4s, v2.4s, v4.4s Add
fmul v2.4s, v2.4s, v3.4s Multiply
fmla v0.4s, v22.4s, v17.4s Fused multiply-add to accumulator
fmls v0.4s, v8.4s, v4.4s Fused multiply-subtract from accumulator
#####################################################################
MP-MFLOPS 2 Operations Per Word NEON-MFLOPS2i-MP 2 Operations Per Word
for(i=0; I < n; i++) Loop Functions
x[i] = (x[i]+a)*b; ptrx1
vld1q_f32
vst1q_f32
vaddq_f32 1
vmulq_f32 1
1 add, 1 multiply 4 add, 4 multiply
===========================================================================================
Main Assembly Code 32 Bit Main Assembly Code NEON 32 Bit
Code No.Ops Example 190 MFLOPS Code No.Ops Example 619 MFLOPS
add 1
cmp 1 cmp 1
bge 1 bge 1
b 1 b 1
adds 1 adds 1
flds 1 flds s13, [r3] vld1.32 1 vld1.32 {d20-d21}, [r1]
fstmias 1 fstmias r3!, {s13} vst1.32 1 vst1.32 {d20-d21}, [r1]
fadds 1 1 fadds s13, s13, s15 vadd.f32 1 4 vadd.f32 q10, q10, q8
fmuls 1 1 fmuls s13, s13, s14 vmul.f32 1 4 vmul.f32 q10, q10, q9
===========================================================================================
Main Assembly Code 64 Bit 4 way unroll Main Assembly Code NEON 64 Bit
Code No.Ops. Example 705 MFLOPS Code No.OPs Example 745 MFLOPS
add 1 cmp 1
cmp 1 bne 1
bhi 1 ldr 1 ldr q0, [x3]
ldr 1 ldr q2, [x5] str 1 str q0, [x3],16
str 1 str q2, [x5],16 fadd 1 4 fadd v0.4s, v0.4s, v2.4s
fadd 1 4 fadd v2.4s, v2.4s, v4.4s fmul 1 4 fmul v0.4s, v1.4s, v0.4s
fmul 1 4 fmul v2.4s, v2.4s, v3.4s
##########################################################################################
MP-MFLOPS 32 Operations Per Word NEON-MFLOPS2i-MP 32 Operations Per Word
for(i=0; I < n; i++) Loop Functions
x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f ptrx1
-(x[i]+g)*h+(x[i]+j)*k-(x[i]+l)*m vld1q_f32
+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t vst1q_f32
-(x[i]+u)*v+(x[i]+w)*y; vaddq_f32 16
vmulq_f32 11
vsubq_f32 5
23 variables
16 add, 5 subtract, 11 multiply 64 add, 20 subtract, 44 multiply
===========================================================================================
Main Assembly Code 32 Bit Main Assembly Code NEON 32 Bit
Code No.Ops. Example 672 MFLOPS Code No.OPs Example 1446 MFLOPS
add 1
adds 1
b 1
cmp 1 bge 1
bge 1 cmp 1
b 1 vldr 18 vldr d18, [sp, #16]
adds 1 vld1.64 1 vld1.64 {d18-d19}, [sp:64]
flds 1 flds s14, [r2] vld1.32 1 vld1.32 {d16-d17}, [r2]
fstmias 1 fstmias r2!, {s15} vst1.32 1 vst1.32 {d12-d13}, [r2]
fadds 11 11 fadds s14, s14, s22 vadd.f32 16 64 vadd.f32 q6, q8, q9
fmuls 1 1 fmuls s15, s15, s10 vmul.f32 11 44 vmul.f32 q6, q6, q7
fmacs 5 10 fmacs s15, s14, s23 vsub.f32 5 20 vsub.f32 q7, q6, q7
fnmacs 4 8 fnmacs s15, s24, s2
fmscs 1 2 fmscs s15, s24, s12 16 q registers used
===========================================================================================
Main Assembly Code 64 Bit 4 way unroll Main Assembly Code NEON 64 Bit
Code No.Ops. Example 1398 MFLOPS Code No.OPs Example 2766 MFLOPS
orr 2
cmp 1
bcc 1
add 16
fmov 11 cmp 1
ins 11 bnr 1
ldr 13 ldr q17, [x28] ldr 1 ldr q1, [x8]
str 4 str q8, [x28] str 1 str q0, [x8],16
fadd 11 44 fadd v9.4s, v16.4s, v17.4s fadd 11 44 fadd v17.4s, v21.4s, v1.4s
fmla 5 40 fmla v17.4s, v9.4s, v13.4s fmla 5 40 fmla v0.4s, v22.4s, v17.4s
fmls 5 40 fmls v17.4s, v8.4s, v9.4s fmls 5 40 fmls v0.4s, v8.4s, v4.4s
fmul 1 4 fmul v17.4s, v9.4s, v17.4s fmul 1 4 fmul v0.4s, v18.4s, v0.4s
10 Vector Registers used 32 Vector Registers used
##########################################################################################
LinpackSP2 NEON-Linpack
for (i = m; i < n; i = i + 4) for (i = m; i < n; i=i+4)
{ {
dy[i] = dy[i] + da*dx[i]; x41 = vld1q_f32(ptrx1);
dy[i+1] = dy[i+1] + da*dx[i+1]; y41 = vld1q_f32(ptry1);
dy[i+2] = dy[i+2] + da*dx[i+2]; r41 = vmlaq_f32(y41, x41, c41);
dy[i+3] = dy[i+3] + da*dx[i+3]; vst1q_f32(ptry1, r41);
} ptrx1 = ptrx1 + 4;
ptry1 = ptry1 + 4;
}
32 Bit Compilation 181 MFLOPS 32 Bit Compilation 407 MFLOPS
.L42 .L38:
cmp r1, r0 cmp r3, r0
add r3, r3, #16 bge .L33
add r2, r2, #16 vld1.32 {d20-d21}, [r2]
bge .L33 adds r3, r3, #4
flds s13, [r2, #-16] adds r2, r2, #16
flds s14, [r3, #-16] vld1.32 {d16-d17}, [r1]
fmacs s14, s15, s13 vmla.f32 q8, q10, q9
adds r1, r1, #4 vst1.32 {d16-d17}, [r1]
fsts s14, [r3, #-16] adds r1, r1, #16
flds s14, [r3, #-12] b .L38
flds s13, [r2, #-12]
fmacs s14, s15, s13
fsts s14, [r3, #-12]
flds s14, [r3, #-8]
flds s13, [r2, #-8]
fmacs s14, s15, s13
fsts s14, [r3, #-8]
flds s14, [r3, #-4]
flds s13, [r2, #-4]
fmacs s14, s15, s13
fsts s14, [r3, #-4]
b .L42
64 Bit Compilation 482 MFLOPS 64 Bit Compilation 505 MFLOPS
.L59: .L54:
add x0, x0, 16 ldr q1, [x1],16
add x1, x1, 16 ldr q0, [x3]
ldr s1, [x1,-16] fmla v0.4s, v2.4s, v1.4s
cmp x0, x3 str q0, [x3],16
ldr s4, [x0,-16] cmp x3, x0
ldr s2, [x0,-12] bne .L54
fmadd s4, s0, s1, s4
ldr s1, [x0,-8]
ldr s5, [x0,-4]
str s4, [x0,-16]
ldr s3, [x1,-12]
fmadd s3, s0, s3, s2
str s3, [x0,-12]
ldr s2, [x1,-8]
fmadd s2, s0, s2, s1
str s2, [x0,-8]
ldr s1, [x1,-4]
fmadd s1, s0, s1, s5
str s1, [x0,-4]
bne .L59
#####################################################################
NeonSpeed Normal NeonSpeed NEON
for (m=0; m < ks; m=m+4) for(i=0; i < size/16; i++)
{ {
xs[m] = xs[m] + sums * ys[m]; x41 = vld1q_f32(ptrx1);
xs[m+1] = xs[m+1] + sums * ys[m+1]; x42 = vld1q_f32(ptrx2);
xs[m+2] = xs[m+2] + sums * ys[m+2]; x43 = vld1q_f32(ptrx3);
xs[m+3] = xs[m+3] + sums * ys[m+3]; x44 = vld1q_f32(ptrx4);
} y41 = vld1q_f32(ptry1);
y42 = vld1q_f32(ptry2);
y43 = vld1q_f32(ptry3);
y44 = vld1q_f32(ptry4);
z41 = vmlaq_f32(x41, y41, c4);
z42 = vmlaq_f32(x42, y42, c4);
z43 = vmlaq_f32(x43, y43, c4);
z44 = vmlaq_f32(x44, y44, c4);
vst1q_f32(ptrx1, z41);
vst1q_f32(ptrx2, z42);
vst1q_f32(ptrx3, z43);
vst1q_f32(ptrx4, z44);
ptrx1 = ptrx1 + 16;
ptry1 = ptry1 + 16;
ptrx2 = ptrx2 + 16;
ptry2 = ptry2 + 16;
ptrx3 = ptrx3 + 16;
ptry3 = ptry3 + 16;
ptrx4 = ptrx4 + 16;
ptry4 = ptry4 + 16;
}
32 Bit Compilation 243 MFLOPS 32 Bit Compilation 963 MFLOPS
.L53: .L24:
cmp r1, r9 cmp r2, r3
add r3, r3, #16 add r8, r1, #48
add r2, r2, #16 add ip, r1, #32
bge .L102 add r7, r1, #16
flds s14, [r2, #-16] add r6, r0, #48
flds s15, [r3, #-16] add r5, r0, #32
fmacs s15, s14, s18 add r4, r0, #16
flds s14, [r2, #-12] bge .L26
adds r1, r1, #4 vld1.32 {d24-d25}, [r0]
fsts s15, [r3, #-16] adds r2, r2, #1
flds s15, [r3, #-12] vld1.32 {d6-d7}, [r1]
fmacs s15, s14, s18 adds r1, r1, #64
flds s14, [r2, #-8] vmla.f32 q12, q3, q8
fsts s15, [r3, #-12] vld1.32 {d22-d23}, [r4]
flds s15, [r3, #-8] vld1.32 {d20-d21}, [r5]
fmacs s15, s14, s18 vld1.32 {d18-d19}, [r6]
flds s14, [r2, #-4] vld1.32 {d30-d31}, [r7]
fsts s15, [r3, #-8] vmla.f32 q11, q15, q8
flds s15, [r3, #-4] vld1.32 {d28-d29}, [ip]
fmacs s15, s14, s18 vmla.f32 q10, q14, q8
fsts s15, [r3, #-4] vld1.32 {d26-d27}, [r8]
b .L53 vmla.f32 q9, q13, q8
vst1.32 {d24-d25}, [r0]
adds r0, r0, #64
vst1.32 {d22-d23}, [r4]
vst1.32 {d20-d21}, [r5]
vst1.32 {d18-d19}, [r6]
b .L24
64 Bit Compilation 764 MFLOPS 64 Bit Compilation 1014 MFLOPS
.L49: .L16:
ldr q1, [x2],16 mov x3, x1
add w0, w0, 1 ldr q4, [x0]
ldr q0, [x1] add x6, x0, 16
cmp w0, w26 add x5, x0, 32
fmla v0.4s, v1.4s, v2.4s ldr q5, [x3],16
str q0, [x1],16 add x7, x1, 32
bcc .L49 ldr q3, [x6]
add x4, x0, 48
add x2, x1, 48
ldr q2, [x5]
ldr q7, [x7]
add x1, x1, 64
ldr q1, [x4]
ldr q6, [x3]
fmla v4.4s, v5.4s, v0.4s
ldr q5, [x2]
fmla v2.4s, v7.4s, v0.4s
str q4, [x0]
add x0, x0, 64
fmla v3.4s, v6.4s, v0.4s
cmp x0, x8
fmla v1.4s, v5.4s, v0.4s
str q3, [x6]
str q2, [x5]
str q1, [x4]
bne .L16
|
To Start
Roy Longbottom January 2016
The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|