Roy Longbottom at Linkedin  Roy Longbottom's Android 64 Bit Benchmarks

For latest results see Android Benchmarks For 32 Bit and 64 Bit CPUs from ARM, Intel and MIPS.

Contents


General Logged Configuration Whetstone Benchmark
Dhrystone Benchmark Linpack Benchmark Livermore Loops Benchmark
MemSpeed Benchmark BusSpeed Benchmark RandMem Benchmark
MP-MFLOPS Benchmarks MP-MFLOPS Benchmark Results MP-Whetstone Benchmark
MP-Dhrystone Benchmark MP-BusSpeed Benchmark MP-RandMem Benchmark
NEON-Linpack Benchmark NeonSpeed Benchmark NEON-MFLOPS-MP Benchmark
NEON-Linpack-MP Benchmark FFT Benchmarks
Assembly Code

Download Benchmark Apps


A Settings, Security option may need changing to allow installation of non-Market applications

Logo NativeWhetstone2.apk
First standard benchmark
Download
Logo Dhrystone2i.apk
First integer benchmark
Download
Logo LinpackDP2.apk
First comptutational benchmark
Download
Logo LinpackSP2.apk
Single precision Linpack
Download
Logo LivermoreLoops2.apk
First supercomputer benchmark
Download
Logo MemSpeedi.apk
Floating Point Cache and
RAM Test
Download
Logo BusSpeedv7i.apk
Integer Bus, Cache and RAM
Test
Download
Logo RandMemi.apk
Random/Serial Access
Cache and RAM Test
Download
Logo MP-MFLOPSi.apk
CPU, Cache, RAM MFLOPS
Test
Download
Logo MP-MFLOPS2i.apk
Long Running MP-MFLOPS
Download
Logo MP-WHETSi.apk
Whetstone Floating and Fixed Point Tests
Download
Logo MP-Dhryi.apk
Dhrystone Integer Benchmark
Download
Logo MP-BusSpdi.apk
Multithreaded BusSpeed
Benchmark
Download
Logo MP-RndMemi.apk
Multithreaded RandMem
Benchmark
Download
Logo NEON-Linpacki.apk
Linpack Benchmark using ARM
NEON Intrinsic Functions
Download
Logo NeonSpeedi.apk
NEON Memory Speed Test
Using Intrinsic Functions
Download
Logo NEON-MFLOPS2i-MP.apk
MP-MFLOPS using ARM
NEON Intrinsic Functions
Download
Logo NEON-Linpacki-MP.apk
Linpack MP Benchmark nsing
NEON Intrinsic Functions
Download
Logo MP-BusSpd2i.apk
Long running vesion
with staggered start
Download
Logo fft1.apk
Original FFT Benchmark
Download
Logo fft3c.apk
Optimised FFT Benchmark
Download




All the above were produced using gcc 4.8, via Eclipse, running under Linux Ubuntu 14.04 They are compiled to run on both 32 bit and 64 bit CPUs from ARM, Intel and MIPS, automatically selected at run time. Downloads are identical to those in Android Native ARM-Intel Benchmarks.htm.

General

As indicated above, the benchmarks, downloadable from here, were compiled for both 32 bit and 64 bit operation. The purpose of this document is to report results from running at 64 bits and to provide comparisons with those at 32 bits, including the latter also compiled by gcc 4.8. Eclipse (or Android Studio?) projects for the new compilations are included in Android Intel-ARM Benchmarks.zip.

The tablet used was a Lenovo Tab 2 A8-50, 8 Inch Tablet, with a 1.3 GHz MediaTek mt8161 quad core processor (64 bit ARM Cortex-A53) and Android 5.0.2. After initially proving that the benchmarks were in 64 bit mode, new versions are being produced that indicate the mode, in this case for 64 or 32 bit ARM, as in the options below:

             Compiled for 32 bit ARM v7a     Compiled for 64 bit ARM v8a
             Compiled for 32 bit Intel x86   Compiled for 64 bit Intel x86_64
             Compiled for 32 bit Mips CPU    Compiled for 64 bit Mips CPU
For comparison on other systems, the 32 bit apps are provided in Android Intel-ARM 32 Bit Benchmarks.zip. Note that these will overwrite the 64 bit app installation.

At the time of starting this report (July 2015), other benchmarks indicated similar performance to a Nexus 7 32 bit tablet. Results for this are included below and they suggest that only 32 bit benchmarks were being used, when 64 bit varieties can be much faster.

Results below are for the revised benchmarks, that indicate which section of the code is used. The new projects are included in the zip file and others will follow.

To Start


Logged Configuration

In line with other Android benchmarks in Android Benchmarks.htm, the programs identify system information, in this case, the following for the Tab 2 A8-50. - Strange, only 3 CPU cores are reported.


 System Information
 Device LENOVO Lenovo TAB 2 A8-50F
 Screen pixels w x h 800 x 1216
 Android Build Version      5.0.2

 Processor : AArch64 Processor rev 3 (aarch64)
 processor : 0
 BogoMIPS : 26.00

 processor : 1
 BogoMIPS : 26.00

 processor : 2
 BogoMIPS : 26.00

 Features : fp asimd aes pmull sha1 sha2 crc32
 CPU implementer : 0x41
 CPU architecture: AArch64
 CPU variant : 0x0
 CPU part : 0xd03
 CPU revision : 3

 Hardware : MT8161

 Linux version 3.10.65 (jenkins@ubuntu12) (gcc version 4.9 20140514
 (mtk-20150408) (GCC) ) #1 SMP PREEMPT Fri Jun 19 11:01:08 CST 2015


   

To Start


Whetstone Benchmark - NativeWhetstone2.apk

This provides an overall rating in MWIPS, plus separate results for the eight test procedures in MFLOPS (floating point) and MOPS (functions and integer). For full details and results via Windows. Linux, Android and via different programming languages, see Whetstone Benchmark Results on PCs. On the latest CPUs, running time largely depends on the COS and EXP function tests. This is highlighted in the examples below.

As with the next four benchmarks, tests comprised the original, from an earlier compiler, then gcc 4.8 separate compilations for 32 bit and 64 bit CPUs, finally the one produced covering all ARM, Intel and MIPS based systems. The latter shows that this tablet uses the 64 bit code option.

In this case, most tests indicated that the later versions, and 64 bit operation, provided no performance gains, but were somewhat faster than the Nexus 7 Cortex-A9 CPU.



 Version               MWIPS  ------MFLOPS-------   ------------MOPS--------------
                                1      2      3     COS   EXP  FIXPT      IF  EQUAL
 1300 MHz Cortex-A53
 Original  32 bit     1433.7  348.0  319.3  308.2  36.3  19.8 1551.4  1861.9  611.0
 ARM/Intel 32 bit      834.7  348.9  312.7  310.9  36.7   5.4 1556.7  1867.2  570.5
 ARM/Intel 64 bit     1504.4  348.8  304.9  309.3  38.2  20.5 1536.4  1862.0 1242.4
 ARM/Intel 32/64 bit  1494.2  347.1  307.0  305.9  37.5  20.6 1552.2  1863.7 1239.1
 
 1200 MHz Cortex-A9
 Original  32 bit     1115.0  271.3  250.7  256.4  25.8  14.6 1190.0  1797.0 1198.7
 ARM/Intel 32/64 bit   731.1  273.6  253.0  252.8  28.0   5.0 1185.2  2383.4 1192.1
   

To Start


Dhrystone Benchmark - Dhrystone2i.apk

The Dhrystone integer benchmark produces a performance rating in Vax MIPS (AKA DMIPS). Further details of the Dhrystone benchmark, and results from Windows and Linux based PCs, can be found in Dhrystone Results.htm. The ratio MIPS/MHz is often quoted, but this depends on compiler optimisation (or over-optimisation)

The 32 bit gcc 4.8 compilations were slower than the original and similar to the Nexus 7, but the 64 bit version was significantly faster using the Cortex-A53.


 Version                   Vax       MIPS
                          MIPS       /MHz
 1300 MHz Cortex-A53
 Original  32 bit         1683       1.29
 ARM/Intel 32 bit         1423       1.09
 ARM/Intel 64 bit         2549       1.96
 ARM/Intel 32/64 bit      2569       1.98
 
 1200 MHz Cortex-A9
 Original  32 bit         1610       1.34
 ARM/Intel 32/64 bit      1317       1.10
   

To Start


Linpack Benchmark - LinpackDP2.apk, LinpackSP2.apk

The Linpack benchmark speed is measured in MFLOPS, officially for double precision floating point calculations. A version was produced using NEON functions (see later) that only provides single precision operation. So, for comparison purposes, an available C code option, to define single precision data, was used to produce a new version and this has usually lead to a higher MFLOPS speed. Results from various hardware and software platforms can be found in Linpack Results.htm.

Performance of the Linpack benchmark is almost entirely dependent on the calculation x[i]=x[i]+c*y[i], in the daxpy() function. The ARM compilations generated floating point multiply and accumulate instructions for this, using such as fmacd d6, d7, d5 at 32 bits and fmadd d1, d0, d1, d5, using four registers, at 64 bits.

In this case, 64 bit operation increased speed by almost 2 times with double precision calculations and 2.7 times at single precision. Performance at 32 bits was similar to that on the Nexus 7.

September 2015 - New best score from 2 GHz Qualcomm Snapdragon 810, (Cortex-A57) and Android 5.0.2, with SP speed of 1277 MFLOPS at 64 bits.

Below, is the general output produced, indicating different (but probably acceptable) numeric results of computation, at the various modes of operation.


 32 bit DP compilation             32 bit SP compilation

 norm. resid                 1.7                     1.6
 resid            7.41628980e-14          3.80277634e-05
 machep           2.22044605e-16          1.19209290e-07
 x[0]-1          -1.49880108e-14         -1.38282776e-05
 x[n-1]-1        -1.89848137e-14         -7.51018524e-06

 64 bit DP compilation             64 bit SP compilation

 norm. resid                 1.9                     2.0
 resid            8.46778499e-14          4.69621336e-05
 machep           2.22044605e-16          1.19209290e-07
 x[0]-1          -1.11799459e-13         -1.31130219e-05
 x[n-1]-1        -9.60342916e-14         -1.30534172e-05


 Version                LinpackDP  LinpackSP
                          MFLOPS     MFLOPS
 1300 MHz Cortex-A53
 Original  32 bit         156.70     184.09
 ARM/Intel 32 bit         172.28     180.64
 ARM/Intel 64 bit         337.97     473.10
 ARM/Intel 32/64 bit      340.18     482.43
 
 2000 MHz Cortex-A57
 ARM/Intel 64 bit                   1277.76
 
 1200 MHz Cortex-A9
 Original  32 bit         151.05     201.30
 ARM/Intel 32/64 bit      159.34     199.84
 
    

To Start


Livermore Loops Benchmark - LivermoreLoops2.apk

The Livermore Loops comprise 24 kernels of numerical application with speeds calculated in MFLOPS. A summary is also produced, with maximum, minimum and various mean values, geometric mean being the official average. As for other of these benchmarks, details and results are provided, in this case, in Livermore Loops Results.htm.

Summary results, below, indicate similar Cortex-A53 and Cortex-A9 speeds at 32 bits, and the former faster at 64 bits. This is followed by MFLOPS for each of the 24 test functions, where 64 bit/32 bit performance ratios vary between 0.8 and 7.9 times, with a geometric mean ratio of 1.5. The identified numeric results are also shown, and they can again be slightly different.

      
  Version                 Max  Average  Geomean Harmean   Min

  1300 MHz Cortex-A53 MFLOPS
  Original  32 bit       371.5   192.4   171.9   151.8    67.1
  ARM/Intel 32 bit       393.4   188.3   158.3   124.6    27.1
  ARM/Intel 64 bit       781.4   265.4   231.8   205.6    98.1
  ARM/Intel 32/64 bit    772.2   265.9   232.5   206.3    97.8
 
  1200 MHz Cortex-A9 MFLOPS
  Original  32 bit       391.9   202.1   181.3   160.9    68.1
  ARM/Intel 32/64 bit    396.6   207.6   175.6   136.1    26.8
  
  A53 ARM/Intel 32 bit    
  MFLOPS for 24 loops Do Span 471
   163.4   243.4   272.1   270.3   109.5   111.2
   282.2   389.0   360.6   219.6   124.0    61.8
    67.6    87.4    27.3   224.2   340.1   241.9
   168.5   198.8   120.2   120.6   277.7    79.1

   Results of last two calculations
   4.850340602749970e+02  1.300000000000000e+01

  
  A53 ARM/Intel 32/64 bit 
  MFLOPS for 24 loops Do Span 471
   451.4   191.4   243.2   272.4   144.9   144.5
   749.4   411.1   453.6   261.1   138.0   206.1
   122.5   130.1   215.0   249.8   411.6   395.4
   241.7   248.1   152.8   118.7   317.2   103.7

   Results of last two calculations
   4.850340602751729e+02  1.300000000000000e+01

   

To Start


MemSpeed Benchmark - MemSpeedi.apk

MemSpeed benchmark employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays. It uses data volumes of 4 KBytes upwards to indicate performance via caches and RAM. A version was produced to run under Linux with a variation of calculations (mainly to use via OpenMP). The Android benchmark is the same as this but with fewer tests, but still reflecting cache and RAM speeds.

The 64 bit compilation was nearly twice as fast as the 32 bit version with double precision floating point calculations, using cached data, and provided a 33% increase from RAM. Corresponding single precision ratios were 2.6 and 2.0 times and integer ratios of 2.2 and 1.5. For floating point, the C program loop has four each of loads, stores, multiplies and adds, where the latter two are linked or fused into one function. At 64 bits, vector SIMD instructions were produced, leading to 2 each at double precision and 1 each at single precision (4 words in registers). With 32 bit integers, 8 scalar adds were reduced to two vector adds.

At 32 bits, the Nexus 7 was slightly faster using L1 cache, but The A8 gains averaged nearly 1.4 times from L2 cache and 2.4 times with RAM based data.

Tab 2 A8 64 bit maximum, MFLOPS were somewhat faster than with similar calculations in the Linpack benchmark, but a modern Intel CPU could be three times faster, at that CPU MHz, using SSE type SIMD instructions.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
 
 ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.16
           Compiled for 32 bit ARM v7a

              Reading Speed in MBytes/Second
  Memory  x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]
  KBytes   Dble   Sngl    Int   Dble   Sngl    Int

      16   1940    971   1693   2470   1278   2084 L1
      32   1879    955   1676   2378   1255   1967
      64   1801    938   1615   2254   1218   1912 L2
     128   1706    941   1620   2279   1224   1872
     256   1818    935   1570   2291   1155   1875
     512   1633    884   1451   2008   1132   1704
    1024   1276    781   1181   1454    938   1324 RAM
    4096   1335    808   1260   1533   1010   1386
   16384   1342    813   1270   1487   1013   1419
   65536   1346    809   1274   1546   1031   1252

          Total Elapsed Time   11.7 seconds

 ARM/Intel MemSpeed Benchmark 1.2 05-Aug-2015 17.29
           Compiled for 64 bit ARM v8a

              Reading Speed in MBytes/Second
  Memory  x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]
  KBytes   Dble   Sngl    Int   Dble   Sngl    Int

      16   4092   2198   3951   5293   3611   4408
      32   3753   2496   3630   4651   3300   3992
      64   3407   2388   3368   3715   3023   3677
     128   3496   2462   3521   4137   3139   3844
     256   3535   2481   3573   4199   3322   3911
     512   3054   2248   3126   3556   2548   3372
    1024   1714   1704   2029   2069   1854   2099
    4096   1832   1595   1841   1914   1780   1897
   16384   1844   1601   1850   1925   1798   1891
   65536   1859   1608   1837   1921   1795   1812

          Total Elapsed Time   10.2 seconds

Max MFLOPS  512    624


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
 
 ARM/Intel MemSpeed Benchmark 1.1 25-Apr-2015 12.24

              Reading Speed in MBytes/Second
  Memory  x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]
  KBytes   Dble   Sngl    Int   Dble   Sngl    Int

      16   1856   1019   2537   2913   1459   2544
      32   1416    832   1327   1508    920   1345
      64   1286    779   1198   1418    908   1296
     128   1282    781   1195   1424    912   1305
     256   1278    774   1190   1433    878   1298
     512   1197    752   1122   1340    862   1216
    1024    833    626    822    903    695    857
    4096    463    420    456    463    440    459
   16384    459    426    453    455    435    458
   65536    463    430    411    462    443    452

          Total Elapsed Time   11.5 seconds
   

To Start


BusSpeed Benchmark - BusSpeedv7i.apk

BusSpeed Benchmark is particularly designed to identify reading data in bursts over buses, The program starts by reading a word, with address increments of 32 words for the next data. The increment is reduced to 16 words then halving until all data is read. In this case, an estimate of maximum speed can be 16 times MB/second at 16 word increments. Normally, an MP version is needed for maximum throughput.

Other than identifying burst data transfers, the final column, reading all data, is the major performance guide. Here, 64/32 bit comparison ratios were up to 2.0 from L1 cache, 1.5 from L2 cache and 1.25 from RAM.

At 32 bits, The Lenovo A8 was slower than the Nexus 7 on L1 cache based data, but the position was reversed on L2 cache tests, and particularly on RAM data transfers.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
  Single Channel RAM, LPDDR3 666 MHz, 5.3 GB/second
 
 ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 10.57
           Compiled for 32 bit ARM v7a

    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16    874    932   1814   2302   2355   2263 L1
      32    758    803   1309   1820   2323   2386
      64    653    671   1203   1741   2206   2332 L2
     128    603    620   1107   1693   2222   2351
     256    574    589   1075   1711   2211   2327
     512    332    372    681   1075   1863   2120
    1024    137    193    371    578   1322   2129 RAM
    4096    172    179    351    567   1151   2126
   16384    172    178    351    504   1117   2136
   65536    172    177    349    478    882   2129

          Total Elapsed Time    5.3 seconds

 ARM/Intel BusSpeed Benchmark 1.2 06-Aug-2015 11.02
           Compiled for 64 bit ARM v8a

    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16   3188   3635   3937   4327   4372   4462
      32   1478   1607   2246   3382   3853   4144
      64    600    622   1163   2011   2972   3585
     128    558    575   1056   1889   2892   3525
     256    538    550   1028   1826   2837   3260
     512    371    425    813   1490   2403   3202
    1024    136    196    382    728   1423   2750
    4096    170    177    346    669   1340   2652
   16384    169    174    341    678   1352   2663
   65536    168    174    341    676   1347   2611

          Total Elapsed Time    5.2 seconds

   Estimated maximum = 16 x 174 = 2784 MB/second


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
 
 Android BusSpeed Benchmark 19-Oct-2012 17.29

    Reading Speed 4 Byte Words in MBytes/Second
  Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read
  KBytes  Words  Words  Words  Words  Words    All

      16   2723   2420   3044   3364   3499   3500 L1
      32   1054   1087   1061   1382   1565   2145
      64    436    433    419    652    751   1160 L2
     128    345    337    337    542    633    943
     256    329    309    322    522    614    961
     512    339    299    311    506    574    937
    1024    170    168    180    269    349    629
    4096     59     55     84    127    176    338 RAM
   16384     56     56     83    125    173    335
   65536     56     56     82    125    174    334

          Total Elapsed Time    5.7 seconds
   

To Start


RandMem Benchmark - RandMemi.apk

RandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches. For more details and further results see RandMem in Android Benchmarks.htm.

This program uses quite complex memory address indexing and Tab A8 32 bit and 64 bit versions were not that different overall, each one slightly faster on some tests.

At 32 bits, the A8 had the L2 cache and RAM speed advantages, over the Nexus 7, on serial reading and writing but, on random access, the latter’s larger L2 cache lead to faster speeds on later cache based data sizes and als affectd RAM data transfer speeds.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel RandMem Benchmark 1.2 06-Aug-2015 12.29
           Compiled for 32 bit ARM v7a

    MBytes/Second Transferring 4 Byte Words
   Memory     Serial.......     Random.......
   KBytes     Read   Rd/Wrt     Read   Rd/Wrt

       16     2807     3606     2753     3595 L1
       32     2719     3433     1429     1930
       64     2615     3266      914     1166 L2
      128     2592     3243      705      828
      256     2570     3223      637      720
      512     2367     2684      237      347
     1024     2137     1855      120      163 RAM
     4096     1918     1658       83       97
    16384     2152     1665       74       85
    65536     2104     1652       72       64

          Total Elapsed Time   11.6 seconds

 ARM/Intel RandMem Benchmark 1.2 06-Aug-2015 12.32
           Compiled for 64 bit ARM v8a

    MBytes/Second Transferring 4 Byte Words
   Memory     Serial.......     Random.......
   KBytes     Read   Rd/Wrt     Read   Rd/Wrt

       16     3865     3033     3798     3027
       32     3622     2760     3105     2734
       64     3094     2803     1011     1077
      128     3074     2740      776      801
      256     3050     2771      718      693
      512     2420     2463      270      371
     1024     1322     1853      131      164
     4096     1754     1598       87      100
    16384     1791     1586       75       91
    65536     1856     1609       57       68

          Total Elapsed Time   14.6 seconds


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM
 
 ARM/Intel RandMem Benchmark 1.1 25-Apr-2015 12.33

    MBytes/Second Transferring 4 Byte Words
   Memory     Serial.......     Random.......
   KBytes     Read   Rd/Wrt     Read   Rd/Wrt

       16     2521     3175     2490     3038 L1
       32     1427     1451     1218     1446
       64     1133     1052      853      907 L2
      128     1039      871      646      650
      256     1028      909      543      518
      512     1025      895      499      502
     1024      700      489      242      236
     4096      487      282       90       88 RAM
    16384      483      281       71       70
    65536      478      274       63       62

          Total Elapsed Time   11.3 seconds
  

To Start


MP-MFLOPS Benchmarks - MP-MFLOPSi and MP-MFLOPS2i

MP-MFLOPS arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2 and 32 operations per input data word, using 1, 2, 4 and 8 threads. Data sizes are limited to three, to use L1 cache, L2 cache and RAM at 12.8, 128 and 12800 KB (3200, 32000 and 3200000 single precision floating point words). Each thread uses the same calculations but accessing different segments of the data. The program checks for consistent numeric results, primarily to show that all calculations are carried out and can be run. The numeric results start with values of 1.0, with subsequent calculations reducing the values, the amount depending on the number of calculations. An example log file is shown below.

The original benchmark runs too fast on later CPUs, so a revised version, MP-MFLOPS2, was produced, with 50 times more calculations, producing the expected reduction in result values, as also shown below. Those from the 32 bit benchmark are slightly different to those from 64 bit operation.


 ARM/Intel MP-MFLOPS v7 Benchmark V1.2 09-Aug-2015 21.20
            Compiled for 64 bit ARM v8a

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      701     695     583    1391    1394    1349
 2T     1347    1370     712    2792    2798    2743
 4T     1641    1544     716    3587    3491    3374
 8T     1370    1803     693    4001    4255    5016
 Results x 100000, 0 indicates ERRORS
 1T    86735   98519   99984   79897   97639   99975
 2T    86735   98519   99984   79897   97639   99975
 4T    86735   98519   99984   79897   97639   99975
 8T    86735   98519   99984   79897   97639   99975
 
           Total Elapsed Time    3.1 seconds

 MP-MFLOPS2i 32 bit
 1T    40392   76406   99700   35218   66014   99520
 MP-MFLOPS2i 64 bit
 1T    40392   76406   99700   35206   66015   99520

   

To Start


MP-MFLOPS Benchmark Results

Except for producing faster results with data in RAM, The 32 bit Tab 2 performance, was again, similar to the Cortex-A9 based Nexus 7. At 32 operations per word, the Tab 2 was just over twice as fast at 64 bits, then up to 3.7 times, at 2 operations per word, with cache based data. The reason is that 64 bit vector SIMD instructions were produced, instead of scalars.

For further comparisons see NEON-MFLOPS-MP Benchmark and Assembly Code below.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel MP-MFLOPS2 Benchmark V2.2 09-Aug-2015 21.17
            Compiled for 32 bit ARM v7a

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      190     190     184     670     672     664
 2T      377     378     370    1343    1345    1329
 4T      707     755     725    2657    2669    2621
 8T      722     736     714    2640    2672    2631

           Total Elapsed Time  113.0 seconds

 ARM/Intel MP-MFLOPS2 Benchmark V2.2 09-Aug-2015 21.24
            Compiled for 64 bit ARM v8a

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      705     701     636    1398    1394    1362
 2T     1376    1395     942    2794    2797    2757
 4T     2063    2602     962    5491    5546    5336
 8T     2474    2611     957    5367    5500    5417

           Total Elapsed Time   51.6 seconds


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

ARM/Intel MP-MFLOPS2 Benchmark V2.1 28-Apr-2015 17.44

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      188     156     116     598     578     574
 2T      365     319     197    1195    1161    1145
 4T      682     709     237    2372    2345    2249
 8T      678     731     237    2361    2381    2254

           Total Elapsed Time  135.0 seconds
  

To Start


MP-Whetstone Benchmark - MP-WHETSi

This is a multithreaded version of the Whetstone Benchmark above. Tab 2 A8-50 performance, on the 32 bit version was, again, similar to the Nexus 7. At 64 bits, the Fixpt test was clearly nearly optimised out, but this makes little difference to the overall MWIPS rating, at 2.25 times faster than the 32 bit benchmark.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

     ARM/Intel MP-Whetstone Benchmark V1.2 10-Aug-2015 11.30
            Compiled for 32 bit ARM v7a

                    Using 1, 2, 4 and 8 Threads
      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS

 1T   676.4  275.9  281.9  147.9  35.4   5.3   600.3   901.0  285.5
 2T  1362.5  533.8  561.7  298.0  70.9  10.8  1203.1  1838.9  574.0
 4T  2698.6  903.9 1071.7  594.4 141.2  21.5  2346.1  3305.5 1138.5
 8T  2830.1 1463.2 1393.0  614.2 152.5  21.9  3243.9  4418.3 1171.4

 Overall Seconds   4.95 1T,   4.94 2T,   5.11 4T,  10.09 8T

     ARM/Intel MP-Whetstone Benchmark V1.2 10-Aug-2015 11.34
            Compiled for 64 bit ARM v8a

                    Using 1, 2, 4 and 8 Threads
      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS

 1T  1524.8  328.6  348.8  297.6  37.3  19.9 1462579  1867.2 1238.0
 2T  3062.5  688.8  697.9  596.0  75.5  39.8 2097113  3726.7 2481.3
 4T  6085.4 1214.9 1360.5 1185.4 150.5  79.4 2449153  7055.0 4951.8
 8T  6222.4 1495.2 1545.6 1204.2 152.2  80.6 3869846  9218.8 5154.1

 Overall Seconds   4.92 1T,   4.90 2T,   5.05 4T,   9.97 8T

 
 ####################################################

  ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

     ARM/Intel MP-Whetstone Benchmark V1.1 30-Apr-2015 21.32

                    Using 1, 2, 4 and 8 Threads
      MWIPS MFLOPS MFLOPS MFLOPS   Cos   Exp   Fixpt      If  Equal
                 1      2      3  MOPS  MOPS    MOPS    MOPS   MOPS
 
 1T   602.2  242.3  242.3  140.2  27.2   4.9   482.8  1425.2  239.1
 2T  1208.7  481.2  484.2  280.8  55.0   9.9   970.0  2869.6  478.7
 4T  2398.7  805.4  966.7  562.5 109.5  19.5  1938.2  5722.5  957.1
 8T  2429.1  974.6 1076.2  562.4 110.9  19.7  1981.5  5816.1  963.6

  Overall Seconds   4.94 1T,   4.93 2T,   5.08 4T,   9.93 8T
   

To Start


MP Dhrystone Benchmark - MP-Dhryi.apk

This is a multithreaded version of the Dhrystone Benchmark above Tab 2 A8-50 performance, on the 32 bit version was, again, similar to the Nexus 7.

Each thread executes the same code but some variables are shared and that can lead to non-linear MP activity, in this case, two CPUs producing increased throughput of 1.8 times or less. At least, single threaded performance is essentially the same as the non-threaded benchmark..


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel MP-Dhrystone 2 Benchmark V1.2 10-Aug-2015 11.32
           Compiled for 32 bit ARM v7a

                   Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.64     0.71     0.90     1.70
 Dhrystones per Second    2481286  4495793  7094180  7540038
 VAX MIPS rating             1412     2559     4038     4291

 ARM/Intel MP-Dhrystone 2 Benchmark V1.2 10-Aug-2015 11.36
            Compiled for 64 bit ARM v8a

                   Using 1, 2, 4 and 8 Threads

 Threads                        1        2        4        8
 Seconds                     0.89     1.06     1.64     3.24
 Dhrystones per Second    4476736  7574470  9768350  9861922
 VAX MIPS rating             2548     4311     5560     5613


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

 ARM/Intel MP-Dhrystone 2 Benchmark V1.1 04-May-2015 17.18

 Threads                        1        2        4        8
 Seconds                     0.78     0.95     1.27     2.44
 Dhrystones per Second    2572642  4214238  6280420  6565767
 VAX MIPS rating             1464     2399     3575     3737

   

To Start


MP-BusSpeed Benchmark - MP-BusSpdi.apk and MP-BusSpd2i.apk

This is a multithreaded version of the BusSpeed Benchmark, with data sizes considered suitable to measure performance from L1 cache, L2 cache and RAM.

The original MP-BusSpd benchmark read all the data with every thread, each starting at the beginning. With some devices having, large shared L2 caches, some of the RAM based data could be cached, sometimes indicating an impossible performance level. All threads in the new version, MP-BusSpd2, read all the data, but with staggered starting points. The difference in not that great on the Tab 2 A8, as indicated below.

Just considering MP-BusSpd2 and reading all data, at 32 bits, the Cortex-A53/Cortex-A9 L1 cache, L2 cache and RAM performance ratios are around 0.8, 3.0 and 5.5. A53 64/32 bit ratios average 2.2, 1.8 and 1.0.

Maximum 64 bit memory transfer rate is 4.328 GB/second, or 4.448, based on 16 word increments, out of a possible 5.33 - See BusSpeed. Note that multithreading increases memory throughput by more than 60%.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53
  Single Channel RAM, LPDDR3 666 MHz, 5.3 GB/second

 ARM/Intel MP-BusSpd Benchmark V1.2 12-Aug-2015 16.13
           Compiled for 32 bit ARM v7a

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   1849   2140   2079   2211   2270   2297
      2T   3663   4252   4294   4400   4370   4580
      4T   4630   5574   5691   5893   6015   6083
      8T   5331   5775   6033   6622   7968   8023
122.9 1T    597    621   1119   1815   2135   2237
      2T    869    943   1644   2992   3740   4412
      4T    949    951   1922   3736   6468   7779
      8T    948    978   1911   3717   6464   7542
12288 1T    123    174    344    678   1215   1840
      2T    243    310    672   1332   2383   3974
      4T    302    285    594   1282   2271   4606
      8T    279    295    654   1198   2749   4660

          Total Elapsed Time   12.8 seconds

 ARM/Intel MP-BusSpd2 Benchmark V1.2 12-Aug-2015 16.14
           Compiled for 32 bit ARM v7a

 12.3 1T   1877   2124   2176   2266   2296   2343
      2T   3625   4198   4341   4468   4536   4613
      4T   5733   7541   8293   8830   8024   9042
      8T   2985   3829   7438   6117   8108   8923
122.9 1T    604    625   1142   1846   2150   2284
      2T    924    950   1793   3277   4270   4504
      4T    962    989   1939   3765   6798   8862
      8T    965    993   1933   3748   6651   8239
49152 1T    165    175    344    677   1285   1979
      2T    234    238    482    961   1907   3547
      4T    266    298    562   1224   2296   4478
      8T    272    275    538   1098   2149   4282

          Total Elapsed Time   48.8 seconds

 ARM/Intel MP-BusSpd Benchmark V1.2 12-Aug-2015 16.17
           Compiled for 64 bit ARM v8a

 12.3 1T   3247   3895   4031   4182   4286   4367
      2T   5676   7211   7771   8320   8539   7887
      4T  10390  13919  14891  14949  15595  12977
      8T   9693  12748  14246  14325  14434  16076
122.9 1T    577    575   1107   1884   2882   3568
      2T    924    939   1827   3380   5554   6890
      4T    959    972   1897   3659   6554   8508
      8T    956    980   1913   3814   7206  11996
12288 1T    133    182    351    690   1381   2720
      2T    309    282    669   1329   2625   5265
      4T    281    286    715   1383   2614   5040
      8T    303    341    670   1180   2303   4354

          Total Elapsed Time   13.1 seconds

 ARM/Intel MP-BusSpd2 Benchmark V1.2 12-Aug-2015 16.18
           Compiled for 64 bit ARM v8a

 12.3 1T   2610   2472   2586   2727   2748   5841
      2T   4404   4681   4994   5369   5420  11297
      4T   6546   8125   9105  10243  10319  20610
      8T   3380   4023   7919   7146   9871  19852
122.9 1T    604    621   1110   1872   2446   5100
      2T    919    948   1855   3433   4853  10037
      4T    961    974   1984   3924   7491  14935
      8T    963    942   1931   3915   7572  14689
49152 1T    173    177    340    692   1300   2653
      2T    266    241    479    968   1883   3724
      4T    304    277    556   1130   2126   4328
      8T    279    278    544   1138   2179   4275

          Total Elapsed Time   49.4 seconds


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

 ARM/Intel MP-BusSpd2 Benchmark V1.0 24-Jul-2015 15.59

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
  KB      Inc32  Inc16   Inc8   Inc4   Inc2  RdAll

 12.3 1T   2166   2774   3181   3307   3377   3263
      2T   3924   5188   5207   5754   5759   5805
      4T   7570  10011  10252  11165  11375  11777
      8T   3510   4786   9011   8318  11351  11544
122.9 1T    383    409    359    558    663    983
      2T    525    541    520    741   1241   1814
      4T    739    752    753   1219   1590   2776
      8T    735    741    753   1218   1607   2737
49152 1T     56     51     81    126    172    330
      2T     65     67    107    196    335    620
      4T     70     68    108    215    426    835
      8T     70     68    109    215    428    851

          Total Elapsed Time   48.2 seconds
   

To Start


MP-RandMem Benchmark - MP-RndMemi.apk

This is a multithreaded version of the RandMem Benchmark. Probably as performance is dependent on the complex indexing used, A53 performance is mainly not much faster at 64 bits. At 32 bits, it is clearly faster than the Cortex-A9 with serial access, using L2 cache and RAM, but the latter is comparable on random reading and writing.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel MP-RndMem Benchmark V1.2 12-Aug-2015 17.13
           Compiled for 32 bit ARM v7a

   MB/Second Using 1, 2, 4 and 8 Threads
   KB       SerRD SerRDWR   RndRD RndRDWR

 12.29 1T    2894    2438    2887    2433
       2T    5665    2402    5663    2403
       4T   10922    2369   11100    2310
       8T   10065    2293   10648    2265
 122.9 1T    2681    2368     757     758
       2T    5351    2360    1398     769
       4T   10056    2308    2121     772
       8T    8838    2351    1916     742
 12288 1T    2309    1662      80      78
       2T    3986    1683     164      73
       4T    5419    1684     283      82
       8T    4658    1694     279      82

 ARM/Intel MP-RndMem Benchmark V1.2 12-Aug-2015 17.15
           Compiled for 64 bit ARM v8a

 12.29 1T    4445    3109    4455    3089
       2T    8010    3100    8072    3105
       4T   15909    3057   14711    3040
       8T   14764    3036   14570    3037
 122.9 1T    3457    2888     842     876
       2T    6537    2924    1524     876
       4T   11095    2892    2119     861
       8T   11729    2916    2080     874
 12288 1T    2475    1679      81      78
       2T    4155    1713     163      73
       4T    5503    1711     285      89
       8T    4519    1717     281      89

          Total Elapsed Time   48.1 seconds


####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

  ARM/Intel MP-RndMem Benchmark V1.1 06-May-2015 11.59

   MB/Second Using 1, 2, 4 and 8 Threads
   KB       SerRD SerRDWR   RndRD RndRDWR

 12.29 1T    3060    2001    2867    1904
       2T    5459    1879    5463    1867
       4T   10797    1852   10537    1856
       8T   10090    1802   10608    1813
 122.9 1T     968     823     588     547
       2T    1749     785     902     618
       4T    2716     812    1328     672
       8T    2733     810    1407     673
 12288 1T     329     274      90      82
       2T     636     272     112      82
       4T     849     271     128      82
       8T     869     271     126      81

          Total Elapsed Time   45.4 seconds
  

To Start


NEON-Linpack Benchmark - NEON-Linpacki.apk

This is identical to the Linpack Benchmark above, except the main calculations, in the performance dependent daxpy() function, were replaced using NEON intrinsic functions. These only operate on single precision floating point numbers. Results from 32 bit and 64 bit compilations were similar as the programs use identical intrinsic functions. The speed of the original 64 bit benchmark is also not that different. This is compiled using fmadd, scalar floating-point fused multiply instructions, compared with NEON vmla vector multiply accumulate (4 words at a time). See Assembly Code below.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel NEON Linpack Benchmark V 1.2 13-Aug-2015

 Compiled for  32b ARM v7a  64b ARM v8a   64b Above

 SP MFLOPS         407          505          482


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

 ARM/Intel NEON Linpack Benchmark V 1.0 03-May-2015

 SP MFLOPS          347 
   

To Start


NeonSpeed Benchmark - NeonSpeedi.apk

The benchmark carries out the same calculations as MemSpeed Benchmark, repeating the standard single precision multiply/add and integer tests with two adds, for comparison with those via NEON intrinsic functions.

As with NEON-Linpack, many results from 32 bit and 64 bit compilations, via NEON instructions, were similar. NEON functions produced significant performance gains at 32 bits, over normal code, but were limited to no more than 30% at 64 bits. NEON tests were quite a bit faster than those on the Cortex-A9.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.32
           Compiled for 32 bit ARM v7a

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16    971   3853   1807   4059   3957   4397
      32    970   3812   1800   3983   3891   4323
      64    927   3228   1605   3038   3269   3521
     128    926   3321   1681   3343   3354   3596
     256    936   3386   1693   3449   3413   3667
     512    898   2889   1578   2996   2927   3118
    1024    794   1859   1345   2057   1996   1924
    4096    794   1796   1250   1788   1813   1835
   16384    792   1773   1270   1820   1829   1864
   65536    796   1811   1289   1852   1832   1880

          Total Elapsed Time   11.3 seconds

 ARM/Intel NeonSpeed Benchmark V1.2 13-Aug-2015 16.37
           Compiled for 64 bit ARM v8a

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16   3054   4055   3605   4376   4911   5094
      32   2922   3787   3435   4198   4546   4682
      64   2795   3514   3259   3658   4050   4116
     128   2886   3529   3373   3924   4148   3963
     256   2883   3641   3264   3942   4193   4276
     512   2454   3165   2985   3385   3586   3542
    1024   1633   2000   1835   2043   2114   2105
    4096   1738   1893   1899   1900   1956   1955
   16384   1757   1870   1886   1802   1921   1846
   65536   1755   1875   1870   1903   1936   1937

Max MFLOPS  764   1014 

          Total Elapsed Time   10.2 seconds


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

 ARM/Intel NeonSpeed Benchmark V1.1 09-May-2015 18.07

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
  KBytes   Norm   Neon   Norm   Neon  Float    Int

      16    881   2440   2501   3334   3206   3465
      32    901   1868   1705   2260   2083   2186
      64    801   1395   1365   1573   1548   1581
     128    784   1282   1278   1405   1389   1411
     256    787   1279   1285   1420   1380   1409
     512    777   1266   1267   1409   1370   1394
    1024    604    786    762    769    770    828
    4096    458    479    477    463    486    488
   16384    436    447    448    469    470    469
   65536    450    472    469    240    482    483

          Total Elapsed Time   11.5 seconds
   

To Start


NEON-MFLOPS-MP Benchmark - NEON-MFLOPS2i-MP.apk

NEON-MFLOPS-MP benchmark is the same as MP-MFLOPS, except using NEON intrinsic functions for the calculations. For comparison purposes, single thread MP-MFLOPS results are included below (1TNN), with details of program source code and CPU assembly instructions used below.

Tab 2 A8 performance of the 32 bit compilations was up to 3.2 times faster than the original MP-MFLOPS benchmark, using NEON intrinsic functions, but the source code for the latter included four times more calculations within the test loops. Results were also similar to those on the Nexus 7, except for RAM speed, measured at 12800 KB, where the Tab 2 excelled. The same unrolling applied for calculations at 32 operations per word, except the original incurred heavy addressing overheads, using 10 vector registers, compared with 32 via NEON, leading to the latter being measured as twice as fast. In both cases, the instruction count was reduced by using fused multiply-add or multiply-subtract.

The NEON 64 bit compilation produced a small performance gain over 32 bit results, at 2 operations per word, but near double speed at 32 operations, the latter suffering from fewer registers for the variables. Using one core, maximum speed was 2.77 GFLOPS, rising to 10.8 GFLOPS via four cores. The one core speed equated to just over two floating point operation per clock cycle. This is disappointing, compared with Intel processors, such as the Core 2 onwards, at 6 per clock cycle out of a maximum of 8, with SSE SIMD code (See Linux results).

September 2015 - New best score from 2 GHz Qualcomm Snapdragon 810, (Cortex-A57) and Android 5.0.2, at 64 bits. Performance, with 8 threads, is up to 23.6 GFLOPS, and up to nearly 3.5 results per clock cycle, using one core.


 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 13-Aug-2015 16.35
           Compiled for 32 bit ARM v7a

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      619     613     575    1444    1446    1426
 2T     1174    1206     889    2894    2902    2839
 4T     1585    1616     901    5679    5726    5596
 8T     2075    2130     944    5400    5585    5519

          Total Elapsed Time   25.8 seconds

 1TNN    190     190     184     670     672     664


 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 13-Aug-2015 16.38
           Compiled for 64 bit ARM v8a

      FPU Add & Multiply using 1, 2, 4 and 8 Threads
        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      726     745     647    2766    2774    2639
 2T     1397    1402     903    5523    5552    5371
 4T     1871    1930     898   10780   10479   10439
 8T     2496    2876    1011    9736   10679    9900

          Total Elapsed Time   15.1 seconds

 1TNN    705     701     636    1398    1394    1362


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.1 13-May-2015 12.24

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T      657     407     132    1077    1074    1053
 2T     1265     817     222    2147    2150    2078
 4T     2024    1695     234    4214    4276    3555
 8T     2435    2495     234    4196    4100    3523

          Total Elapsed Time   39.0 seconds

1TNN     188     156     116     598     578     574


 ####################################################

 Quad-core 2 GHz Qualcomm Snapdragon 810, Android 5.0.2 

 ARM/Intel NEON-MFLOPS2-MP Benchmark V2.2 16-Sep-2015 17.59
           Compiled for 64 bit ARM v8a

     FPU Add & Multiply using 1, 2, 4 and 8 Threads
         2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T     2811    3126    1089    6943    6589    6342
 2T     2488    4114    1541   12084   10559    8809
 4T     4759    5480    2038   16516   14826   11960
 8T     4840    8985    2452   22082   23563   12461

          Total Elapsed Time    7.6 seconds
   

To Start


NEON-Linpack-MP Benchmark - NEON-Linpacki-MP.apk

This version uses mainly the same C programming code as the single precision floating point NEON compilation above. It is run run on 100x100, 500x500 and 1000x1000 matrices using 0, 1, 2 and 4 separate threads. The code differences were slight changes to allow a higher level of parallelism. The initial 100x100 Linpack benchmark is only of use for measuring performance of single processor systems. The one for shared memory multiple processor systems is a 1000x1000 variety. The programming code for this is the same as 100x100, except users are allowed to use their own linear equation solver.

Unlike the NEON MP MFLOPS benchmark, that carries out the same multiply/add calculations, this program can run much slower using multiple threads. This is due to the overhead of creating and closing threads too frequently. Note the difference between the unthreaded speeds and those using one thread.

Ignoring multiple thread speeds, with the 32 bit variety, the Tab 2 A8 is particularly faster than the Nexus 7 at N = 500 and 1000, due to the larger L2 cache and faster RAM.


        MFLOPS 0 to 4 Threads, N 100, 500, 1000

 ####################################################

 Tab 2 A8-50, 1.3 GHz quad core 64 bit ARM Cortex-A53

 ARM/Intel Linpack NEON SP MP Benchmark 1.2 13-Aug-2015 12.52
           Compiled for 32 bit ARM v7a

 Threads     None        1        2        4

 N  100     460.74    22.35    23.16    23.82
 N  500     480.63   336.52   339.94   303.66
 N 1000     470.02   405.86   403.01   405.98

 ARM/Intel Linpack NEON SP MP Benchmark 1.2 13-Aug-2015 12.57
           Compiled for 64 bit ARM v8a

 Threads     None        1        2        4

 N  100     548.67    27.70    33.93    37.00
 N  500     470.04   285.95   297.79   301.67
 N 1000     519.02   441.84   443.47   441.91


 ####################################################

 ARM Cortex-A9 1200 MHz, Android 4.1.2, 1 GB DDR3 RAM

 ARM/Intel Linpack NEON SP MP Benchmark 14-May-2015 15.40

  Threads     None        1        2        4

  N  100     385.49    28.79    29.06    29.25
  N  500     272.07   184.85   183.70   183.18
  N 1000     147.09   131.92   132.44   130.05

   

To Start

FFT Benchmarks - fft1.apk, fft3c.apk

The benchmarks run code for single and double precision Fast Fourier Transforms of size 1024 to 1048576 (1K to 1024K), each one being run three times to identify variance. Results are displayed and saved in a log file (FFT-tests.txt), with FFT running time in milliseconds. Besides Android, the bechmarks are available to run via Windows and Linux. Two versions are available FFT1, original version and with optimised C code as FFT3c. Further details, results, and links for benchmarks and source code are in FFTBenchmarks.htm. Below is an example of results.


    Kindle Fire HDX 7, 2.2 GHz  Quad Core Qualcomm Snapdragon 800

       ARM/Intel FFT Benchmark 3c.0 08-Sep-2015 23.15
             Compiled for 32 bit ARM v7a

  Size                     milliseconds
    K     Single Precision              Double Precision
    1     0.155     0.352     1.341     0.087     0.073     0.073 
    2     0.812     0.814     0.750     0.201     0.187     0.251 
    4     1.751     1.658     1.776     0.414     0.405     0.443 
    8     3.712     1.083     1.065     0.930     0.899     0.890 
   16     2.880     3.356     2.430     2.579     2.658     2.380 
   32     6.124     6.541     5.605     5.907     6.070     5.681 
   64    13.430    12.566    12.774    13.792    13.556    13.997 
  128    30.737    27.408    27.132    33.318    33.088    33.071 
  256    64.472    63.394    64.690    73.288    72.546    72.786 
  512   153.609   150.383   156.046   155.788   156.304   163.178 
 1024   315.283   306.323   307.409   369.426   337.074   336.684 

        1024 Square Check Maximum Noise Average Noise
        SP   9.999520e-01  3.346482e-06  4.565234e-11
        DP   1.000000e+00  1.133294e-23  1.428110e-28

       Total Elapsed Time    6.5 seconds
   

To Start


Assembly Code




 Single Precision Floating Point Instructions 32 Bit Compile

 fadds   s13, s13, s15         Add
 fmuls   s13, s13, s14         Multiply
 fmacs   s15, s14, s23         Multiply and accumulate
 fnmacs  s15, s24, s2          Negated multiply and accumulate
 fmscs   s15, s24, s12         Multiply and subtract
 
 NEON
     
 vadd.f32 q10, q10, q8         Vector add
 vmul.f32 q10, q10, q9         Vector multiply
 vsub.f32 q7, q6, q7           Vector subtract
 vmla.f32 q8, q10, q9          Vector multiply accumulate


 Single Precision Floating Point Instructions 64 Bit Compile


 fmadd   s4, s0, s1, s4        Fused multiply-add 

 fadd   v2.4s, v2.4s, v4.4s    Add
 fmul   v2.4s, v2.4s, v3.4s    Multiply
 fmla   v0.4s, v22.4s, v17.4s  Fused multiply-add to accumulator
 fmls   v0.4s, v8.4s, v4.4s    Fused multiply-subtract from accumulator 


 #####################################################################

 MP-MFLOPS 2 Operations Per Word              NEON-MFLOPS2i-MP 2 Operations Per Word

 for(i=0; I < n; i++)                         Loop Functions
      x[i] = (x[i]+a)*b;                      ptrx1
                                              vld1q_f32
                                              vst1q_f32
                                              vaddq_f32    1
                                              vmulq_f32    1

 1 add, 1 multiply                            4 add, 4 multiply

===========================================================================================
 Main Assembly Code 32 Bit                    Main Assembly Code NEON 32 Bit

 Code    No.Ops Example    190 MFLOPS         Code    No.Ops Example    619 MFLOPS

                                              add       1
 cmp       1                                  cmp       1
 bge       1                                  bge       1
 b         1                                  b         1
 adds      1                                  adds      1
 flds      1    flds    s13, [r3]             vld1.32   1    vld1.32  {d20-d21}, [r1]
 fstmias   1    fstmias r3!, {s13}            vst1.32   1    vst1.32  {d20-d21}, [r1]
 fadds     1  1 fadds   s13, s13, s15         vadd.f32  1  4 vadd.f32 q10, q10, q8
 fmuls     1  1 fmuls   s13, s13, s14         vmul.f32  1  4 vmul.f32 q10, q10, q9


===========================================================================================
 Main Assembly Code 64 Bit 4 way unroll       Main Assembly Code NEON 64 Bit

 Code    No.Ops. Example    705 MFLOPS        Code    No.OPs Example    745 MFLOPS

 add       1                                  cmp       1
 cmp       1                                  bne       1
 bhi       1                                  ldr       1    ldr   q0, [x3]
 ldr       1    ldr     q2, [x5]              str       1    str   q0, [x3],16
 str       1    str     q2, [x5],16           fadd      1  4 fadd  v0.4s, v0.4s, v2.4s
 fadd      1  4 fadd    v2.4s, v2.4s, v4.4s   fmul      1  4 fmul  v0.4s, v1.4s, v0.4s
 fmul      1  4 fmul    v2.4s, v2.4s, v3.4s


##########################################################################################

 MP-MFLOPS 32 Operations Per Word            NEON-MFLOPS2i-MP 32 Operations Per Word

 for(i=0; I < n; i++)                         Loop Functions
   x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f    ptrx1
         -(x[i]+g)*h+(x[i]+j)*k-(x[i]+l)*m    vld1q_f32
          +(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t   vst1q_f32
          -(x[i]+u)*v+(x[i]+w)*y;             vaddq_f32   16
                                              vmulq_f32   11
                                              vsubq_f32    5

 23 variables
 16 add, 5 subtract, 11 multiply              64 add, 20 subtract, 44 multiply

===========================================================================================
 Main Assembly Code 32 Bit                    Main Assembly Code NEON 32 Bit

 Code    No.Ops. Example    672 MFLOPS        Code    No.OPs Example   1446 MFLOPS

                                              add       1
                                              adds      1
                                              b         1
 cmp       1                                  bge       1
 bge       1                                  cmp       1
 b         1                                  vldr     18    vldr     d18, [sp, #16]
 adds      1                                  vld1.64   1    vld1.64  {d18-d19}, [sp:64]
 flds      1    flds    s14, [r2]             vld1.32   1    vld1.32  {d16-d17}, [r2]
 fstmias   1    fstmias r2!, {s15}            vst1.32   1    vst1.32  {d12-d13}, [r2]
 fadds    11 11 fadds   s14, s14, s22         vadd.f32 16 64 vadd.f32 q6, q8, q9
 fmuls     1  1 fmuls   s15, s15, s10         vmul.f32 11 44 vmul.f32 q6, q6, q7
 fmacs     5 10 fmacs   s15, s14, s23         vsub.f32  5 20 vsub.f32 q7, q6, q7
 fnmacs    4  8 fnmacs  s15, s24, s2
 fmscs     1  2 fmscs   s15, s24, s12                     16 q registers used


===========================================================================================
 Main Assembly Code 64 Bit 4 way unroll       Main Assembly Code NEON 64 Bit

  Code    No.Ops. Example   1398 MFLOPS       Code    No.OPs Example  2766 MFLOPS

 orr       2
 cmp       1
 bcc       1
 add      16
 fmov     11                                  cmp       1
 ins      11                                  bnr       1
 ldr      13    ldr   q17, [x28]              ldr       1    ldr   q1, [x8]
 str       4    str   q8, [x28]               str       1    str   q0, [x8],16
 fadd     11 44 fadd  v9.4s, v16.4s, v17.4s   fadd     11 44 fadd  v17.4s, v21.4s, v1.4s
 fmla      5 40 fmla  v17.4s, v9.4s, v13.4s   fmla      5 40 fmla  v0.4s, v22.4s, v17.4s
 fmls      5 40 fmls  v17.4s, v8.4s, v9.4s    fmls      5 40 fmls  v0.4s, v8.4s, v4.4s
 fmul      1  4 fmul  v17.4s, v9.4s, v17.4s   fmul      1  4 fmul  v0.4s, v18.4s, v0.4s

             10 Vector Registers used                     32 Vector Registers used
 

##########################################################################################

 LinpackSP2                            NEON-Linpack

 for (i = m; i < n; i = i + 4)         for (i = m; i < n; i=i+4)
 {                                     {
    dy[i] = dy[i] + da*dx[i];             x41 = vld1q_f32(ptrx1);
    dy[i+1] = dy[i+1] + da*dx[i+1];       y41 = vld1q_f32(ptry1);
    dy[i+2] = dy[i+2] + da*dx[i+2];       r41 = vmlaq_f32(y41, x41, c41);
    dy[i+3] = dy[i+3] + da*dx[i+3];       vst1q_f32(ptry1, r41);
 }                                        ptrx1 = ptrx1 + 4;
                                          ptry1 = ptry1 + 4;
                                       }

   32 Bit Compilation  181 MFLOPS           32 Bit Compilation 407 MFLOPS
    .L42                                     .L38:
       cmp     r1, r0                           cmp      r3, r0
       add     r3, r3, #16                      bge      .L33
       add     r2, r2, #16                      vld1.32  {d20-d21}, [r2]
       bge     .L33                             adds     r3, r3, #4
       flds    s13, [r2, #-16]                  adds     r2, r2, #16
       flds    s14, [r3, #-16]                  vld1.32  {d16-d17}, [r1]
       fmacs   s14, s15, s13                    vmla.f32 q8, q10, q9
       adds    r1, r1, #4                       vst1.32  {d16-d17}, [r1]
       fsts    s14, [r3, #-16]                  adds     r1, r1, #16
       flds    s14, [r3, #-12]                  b        .L38
       flds    s13, [r2, #-12]
       fmacs   s14, s15, s13
       fsts    s14, [r3, #-12]
       flds    s14, [r3, #-8]
       flds    s13, [r2, #-8]
       fmacs   s14, s15, s13
       fsts    s14, [r3, #-8]
       flds    s14, [r3, #-4]
       flds    s13, [r2, #-4]
       fmacs   s14, s15, s13
       fsts    s14, [r3, #-4]
       b       .L42

   64 Bit Compilation 482 MFLOPS            64 Bit Compilation 505 MFLOPS
   .L59:                                     .L54:
       add     x0, x0, 16                       ldr     q1, [x1],16
       add     x1, x1, 16                       ldr     q0, [x3]
       ldr     s1, [x1,-16]                     fmla    v0.4s, v2.4s, v1.4s
       cmp     x0, x3                           str     q0, [x3],16
       ldr     s4, [x0,-16]                     cmp     x3, x0
       ldr     s2, [x0,-12]                     bne     .L54
       fmadd   s4, s0, s1, s4
       ldr     s1, [x0,-8]
       ldr     s5, [x0,-4]
       str     s4, [x0,-16]
       ldr     s3, [x1,-12]
       fmadd   s3, s0, s3, s2
       str     s3, [x0,-12]
       ldr     s2, [x1,-8]
       fmadd   s2, s0, s2, s1
       str     s2, [x0,-8]
       ldr     s1, [x1,-4]
       fmadd   s1, s0, s1, s5
       str     s1, [x0,-4]
       bne     .L59


 #####################################################################

    NeonSpeed Normal                         NeonSpeed NEON

    for (m=0; m < ks; m=m+4)                   for(i=0; i < size/16; i++)
    {                                        {
       xs[m]   = xs[m]   + sums * ys[m];        x41 = vld1q_f32(ptrx1);
       xs[m+1] = xs[m+1] + sums * ys[m+1];      x42 = vld1q_f32(ptrx2);
       xs[m+2] = xs[m+2] + sums * ys[m+2];      x43 = vld1q_f32(ptrx3);
       xs[m+3] = xs[m+3] + sums * ys[m+3];      x44 = vld1q_f32(ptrx4);
    }                                           y41 = vld1q_f32(ptry1);
                                                y42 = vld1q_f32(ptry2);
                                                y43 = vld1q_f32(ptry3);
                                                y44 = vld1q_f32(ptry4);
                                                z41 = vmlaq_f32(x41, y41, c4);
                                                z42 = vmlaq_f32(x42, y42, c4);
                                                z43 = vmlaq_f32(x43, y43, c4);
                                                z44 = vmlaq_f32(x44, y44, c4);
                                                vst1q_f32(ptrx1, z41);
                                                vst1q_f32(ptrx2, z42);
                                                vst1q_f32(ptrx3, z43);
                                                vst1q_f32(ptrx4, z44);
                                                ptrx1 = ptrx1 + 16;
                                                ptry1 = ptry1 + 16;
                                                ptrx2 = ptrx2 + 16;
                                                ptry2 = ptry2 + 16;
                                                ptrx3 = ptrx3 + 16;
                                                ptry3 = ptry3 + 16;
                                                ptrx4 = ptrx4 + 16;
                                                ptry4 = ptry4 + 16;
                                             }

   32 Bit Compilation 243 MFLOPS            32 Bit Compilation 963 MFLOPS
    .L53:                                   .L24:
       cmp     r1, r9                           cmp     r2, r3
       add     r3, r3, #16                      add     r8, r1, #48
       add     r2, r2, #16                      add     ip, r1, #32
       bge     .L102                            add     r7, r1, #16
       flds    s14, [r2, #-16]                  add     r6, r0, #48
       flds    s15, [r3, #-16]                  add     r5, r0, #32
       fmacs   s15, s14, s18                    add     r4, r0, #16
       flds    s14, [r2, #-12]                  bge     .L26
       adds    r1, r1, #4                       vld1.32 {d24-d25}, [r0]
       fsts    s15, [r3, #-16]                  adds    r2, r2, #1
       flds    s15, [r3, #-12]                  vld1.32 {d6-d7}, [r1]
       fmacs   s15, s14, s18                    adds    r1, r1, #64
       flds    s14, [r2, #-8]                   vmla.f32        q12, q3, q8
       fsts    s15, [r3, #-12]                  vld1.32 {d22-d23}, [r4]
       flds    s15, [r3, #-8]                   vld1.32 {d20-d21}, [r5]
       fmacs   s15, s14, s18                    vld1.32 {d18-d19}, [r6]
       flds    s14, [r2, #-4]                   vld1.32 {d30-d31}, [r7]
       fsts    s15, [r3, #-8]                   vmla.f32        q11, q15, q8
       flds    s15, [r3, #-4]                   vld1.32 {d28-d29}, [ip]
       fmacs   s15, s14, s18                    vmla.f32        q10, q14, q8
       fsts    s15, [r3, #-4]                   vld1.32 {d26-d27}, [r8]
       b       .L53                             vmla.f32        q9, q13, q8
                                                vst1.32 {d24-d25}, [r0]
                                                adds    r0, r0, #64
                                                vst1.32 {d22-d23}, [r4]
                                                vst1.32 {d20-d21}, [r5]
                                                vst1.32 {d18-d19}, [r6]
                                                b       .L24

   64 Bit Compilation 764 MFLOPS            64 Bit Compilation 1014 MFLOPS
    .L49:                                   .L16:
       ldr     q1, [x2],16                      mov     x3, x1
       add     w0, w0, 1                        ldr     q4, [x0]
       ldr     q0, [x1]                         add     x6, x0, 16
       cmp     w0, w26                          add     x5, x0, 32
       fmla    v0.4s, v1.4s, v2.4s              ldr     q5, [x3],16
       str     q0, [x1],16                      add     x7, x1, 32
       bcc     .L49                             ldr     q3, [x6]
                                                add     x4, x0, 48
                                                add     x2, x1, 48
                                                ldr     q2, [x5]
                                                ldr     q7, [x7]
                                                add     x1, x1, 64
                                                ldr     q1, [x4]
                                                ldr     q6, [x3]
                                                fmla    v4.4s, v5.4s, v0.4s
                                                ldr     q5, [x2]
                                                fmla    v2.4s, v7.4s, v0.4s
                                                str     q4, [x0]
                                                add     x0, x0, 64
                                                fmla    v3.4s, v6.4s, v0.4s
                                                cmp     x0, x8
                                                fmla    v1.4s, v5.4s, v0.4s
                                                str     q3, [x6]
                                                str     q2, [x5]
                                                str     q1, [x4]
                                                bne     .L16
   

To Start


Roy Longbottom at Linkedin  Roy Longbottom January 2016



The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection