|
Roy Longbottom's Android Benchmark Apps |
|
Java Benchmark ResultsThe first measurements obtained were via emulators running on a 3 GHz quad core Phenom, the benchmark only using one core, of course. They suggest a slightly slower performance using a screen with a higher pixel density and much better performance with a later Android version and/or a more modern CPU. Compared with the ARM 926EJ CPU, the v7 has VFPv3 enhanced floating point hardware. P1 and P2 v7 Cortex-A8 processors have performance proportional to CPU MHz. The v7 Cortex-A9 CPU has dual cores but only one will be used. Performance improvements are through higher performance VFPv3 and a new out-of-order speculative issue superscalar execution pipeline.
The last two sets of results are for the same Java code running on Intel CPUs under Linux, adjusting the speeds to represent processors running at 1 GHz.
The ARM processors appear to be catching up with the Atom on fixed point operation, and do particularly well on standard and trigonometric functions. The latter have significant impact on the overall MWIPS score (see millisecs in example results).
Java Numeric ResultsBoth emulated and real numeric results using ARMv7 are different from ARM926EJ for some floating point calculations. This is not unusual for different compilers or types of processor and is due to variations in instruction sequences or hardware rounding arrangements. It looks as though these two processors are not logically identical or program optimisation procesures are different.
ARMv7 P3 has enhanced architecture that probably changes the calculated results of the lasts test.
Results from Native Code versions are also provided.
|
System ARM MHz Android MWIPS ------MFLOPS------- COS EXP FIXPT IF EQUAL See CPU Build 1 2 3 ------------MOPS-------------- T1 @5 926EJ 800 2.2 31.2 10.2 10.2 11.4 0.6 0.3 38.8 278.4 219.4 T1 @7 926EJ 800 2.2 30.3 10.2 9.3 11.5 0.6 0.3 39.0 293.5 220.1 T2 @5 v7-A9 800 2.3.4 170.9 20.4 21.4 28.4 7.6 2.2 85.5 756.0 764.3 T2 @7 v7-A9 800 2.3.4 687.4 165.4 149.9 153.4 15.9 9.3 723.1 1082.1 725.3 EP1 @5 926EJ Emul 2.2 20.1 7.0 6.7 9.3 0.4 0.2 30.9 218.6 98.5 ET2 @5 v7-A8 Emul 4.03 43.7 7.2 7.0 9.3 1.1 0.6 30.8 225.1 100.9 ET2 @7 v7-A8 Emul 4.03 96.7 37.0 32.1 36.1 1.6 1.3 121.9 238.4 216.4 Atom 800 Linux 184.9 107.1 60.0 27.9 6.3 3.5 100.4 78.8 45.6 Core 2 800 Linux 496.3 179.3 177.7 98.0 16.0 7.4 266.3 238.3 94.0 System - T = Tablet, P = Phone, E = Emulator, @7 for vfpv3 FPU |
Results are also shown for Linux C compilations on two Intel processors, adjusted to assume clock speeds of 800 MHz. These show that ARM CPUs with the fast FPU, running at the same clock speed as Intel processors, can produce equal or better performance.
Linpack Benchmark 
The Linpack Benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications, particularly under Unix, from the mid 1980's, with a slant towards supercomputer performance. The original double precision C version, used here, operates on 100x100 matrices. Performance is governed by an inner loop in function daxpy() with a linked triad dy[i] = dy[i] + da * dx[i], and is measured in Millions of Floating Point Operations Per Second (MFLOPS). This version uses a Java front end, again providing Run, Info and Email buttons, with the main C code compiled by Android Native Development Kit. Two varieties are available. Linpackv5.apk (LP5), using old, slow instructions, and Linpackv7.apk (LPK) that will use faster vfpv3 hardware, if available.
The .apk application files can be downloaded from www.roylongbottom.org.uk/Linpackv5.apk and www.roylongbottom.org.uk/Linpackv7.apk. Further details of the Linpack benchmark, and results from Windows and Linux based PCs, can be found in Linpack Results.htm.
Output results provide the same System Information as shown for the Whetstone Benchmark, preceded by MFLOPS speed and numeric results, examples being shown below.
In this case, calculations from both versions produce the same numeric results. These are also identical to those from Microsoft Visual C under Windows and Linux using 64-Bit GCC on PCs, with other compilers used producing differences.
Android Linpack v7 Benchmark Android Linpack v5 Benchmark Speed 101.39 MFLOPS Speed 10.56 MFLOPS norm. resid 1.7 norm. resid 1.7 resid 7.41628980e-14 resid 7.41628980e-14 machep 2.22044605e-16 machep 2.22044605e-16 x[0]-1 -1.49880108e-14 x[0]-1 -1.49880108e-14 x[n-1]-1 -1.89848137e-14 x[n-1]-1 -1.89848137e-14 |
Results below again show that the compilation using the vfpv3 FPU library produces much faster speed on the tablet with the Cortex-A9 processor and the alternative library is used with the older CPU. Results scaled to represent Intel processor speeds, at 800 MHz, are also shown. This time, the Cortex-A9 MFLOPS are similar to those for the Atom CPU, but the latter is really twice as fast, as it runs at 1600 MHz.
Results from this benchmark are significantly faster than the apparently popular
Linpack For Android
That produced 30 MFLOPS on the xTAB-70 tablet, compared with 101 MFLOPS with my benchmark. The main difference appears to be that this one uses pre-compiled C code and the other is Java based, reflecting earlier Whetstone Benchmark comparisons.
System ARM MHz Android Linpackv5 Linpackv7 See MFLOPS MFLOPS T1 926EJ 800 2.2 5.63 5.67 T2 v7-A9 800 2.3.4 10.56 101.39 EP1 926EJ Emul 2.2 4.27 4.54 ET2 v7-A8 Emul 4.03 4.39 12.24 Atom 800 Linux 94.11 Atom 800 Windows 87.98 Core 2 800 Linux 429.33 Core 2 800 Windows 438.43 System - T = Tablet, P = Phone, E = Emulator |
Java Linpack is also available to run via Windows and Linux browsers. On tunning this the 800 MHz ratings for the Atom and Core 2 CPUs are 55 and 377 MFLOPS.
Dhrystone 2 Benchmark 
The Dhrystone "C" benchmark provides a measure of integer performance (no floating point instructions). It became the key standard benchmark from 1984, with the growth of Unix systems. The first version was produced by Reinhold P. Weicker in ADA and translated to "C" by Rick Richardson. Two versions are available - Dhrystone versions 1.1 and 2.1. The second version, used here, was produced to avoid over-optimisation problems encountered with version 1, but some is still possible. Because of this, optimised and non-optimised compilations are provided. Speed was originally measured in Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result, the latter being regarded as the first 1 MIPS minicomputer.
The optimised .apk app file (DS2) can be downloaded from www.roylongbottom.org.uk/Dhrystone2.apk and the non-optimised one (DSN) from www.roylongbottom.org.uk/Dhry2Nopt.apk. Further details of the Dhrystone benchmark, and results from Windows and Linux based PCs, can be found in Dhrystone Results.htm.
The same format Java front end, described above, is used, with the two C programs compiled using Android NDK. Examples of results is below, the Emailed version including the standard System Information.
Dhrystone 2 Benchmark 10-Feb-2012 19.08 Dhry2 NoOpt Benchmark 14-Feb-2012 12.15 Nanoseconds one Dhrystone run 592 Nanoseconds one Dhrystone run 1244 Dhrystones per Second 1689546 Dhrystones per Second 804020 VAX MIPS rating 962 VAX MIPS rating 458 |
Unlike when using floating point, on this benchmark, the Cortex-A9 CPU is less than three times faster than the 926EJ on all measurements, a ratio similar to that provided by the BogoMIPS results, shown in System Information. Measurements for Intel Atom and Core 2 CPUs are also provided for Windows (Watcom 32 Bit) and Linux (GCC 32 Bit and 64 Bit) compilations. Relative to CPU MHz, the A9 performance is similar to the Atom 32 bit compilations, the latter being faster at optimised 64 bits, probably due to more registers being available. Core 2 results show considerable variations, highlighting the Danger, in comparing results from different compilers.
The optimised benchmark produces 1.2 Vax MIPS/MHz for the Cortex-A9. ARM, themselves, quote 2.5 Vax MIPS (DMIPS) per MHz for the same processor, probably just another different compiler variation.
Opt No Opt
System ARM MHz Android Vax Vax Bogo
See MIPS MIPS MIPS
T1 926EJ 800 2.2 356 196 798
T2 v7-A9 800 2.3.4 962 458 2036
EP1 926EJ Emul 2.2 227 122
ET2 v7-A8 Emul 4.03 286 160
32 Bit Atom 1666 Linux 2055 1194
64 Bit Atom 1666 Linux 2704 1098
32 Bit Atom 1666 Windows 1948 780
32 Bit Core 2 2400 Linux 5852 3348
64 Bit Core 2 2400 Linux 12265 3288
32 Bit Core 2 2400 Windows 6466 1251
System - T = Tablet, P = Phone, E = Emulator
|
Livermore Loops BenchmarkThis original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical application, written in Fortran. This was increased to 24 kernels in the 1980s. Performance measurements are in terms of Millions of Floating Point Operations Per Second or MFLOPS. The kernels are executed three times with different double precision data array sizes. Following are overall MFLOPS results for Cray 1, geometric mean being the official average performance.
---------------- MFLOPS --------------- CPU MHz Maximum Average Geomean Harmean Minimum Cray 1 80 82.1 22.2 11.9 6.5 1.0
The benchmark execution file can be downloaded from www.roylongbottom.org.uk/LivermoreLoops.apk. Further details of the Livermore Loops benchmark, and results from Windows and Linux based PCs, can be found in Livermore Loops Results.htm.
The same format Java front end, described above, is used, with the C program compiled using Android NDK. An example of results is below, the Emailed version including the standard System Information.
800 MHz ARM Cortex-A9
Android Livermore Loops Benchmark 12-Feb-2012 21.55
MFLOPS for 24 loops Do Span 471
172.6 127.5 253.2 248.6 71.6 141.2
197.6 190.4 202.3 109.2 55.2 51.2
54.1 51.5 100.0 144.1 192.1 139.4
130.1 105.4 111.2 63.1 136.3 56.8
Overall Weighted MFLOPS Do Spans 471, 90, 19
Maximum Average Geomean Harmean Minimum
253.2 129.3 115.3 101.6 46.7
Results of last two calculations
4.850340602749970e+02 1.300000000000000e+01
Total Elapsed Time 11.9 seconds
|
So far, results of the last two calculations have been identical on all benchmark runs.
System T2, with the high speed vfpv3 hardware, is again shown to be around 20 times faster than the tablet T1, on these floating point calculations.
This time, T2 performance is quite respectable, compared with an Intel Atom, running at twice the clock speed.
Sys See ARM MHz Android Run Time MFLOPS for 24 loops Do Span 471 T1 926EJ 800 2.2 97.3 secs 5.6 6.4 6.2 6.1 4.6 4.9 5.9 6.1 6.0 9.0 5.8 3.9 Max Average Geomean Harmean Min 4.0 3.6 3.8 5.6 7.6 4.5 9.9 5.6 5.4 5.2 2.4 5.7 4.3 5.2 2.5 5.7 7.4 T2 v7-A9 800 2.3.4 11.9 secs 172.6 127.5 253.2 248.6 71.6 141.2 197.6 190.4 202.3 109.2 55.2 51.2 Max Average Geomean Harmean Min 54.1 51.5 100.0 144.1 192.1 139.4 253.2 129.3 115.3 101.6 46.7 130.1 105.4 111.2 63.1 136.3 56.8 ET2 v7-A8 Emul 4.03 124.8 secs 5.0 4.8 5.1 4.8 4.3 4.9 4.7 4.4 4.6 7.1 5.0 3.0 Max Average Geomean Harmean Min 3.5 3.6 3.2 3.8 5.5 3.4 7.2 4.3 4.2 4.1 2.3 4.7 3.2 5.1 3.5 4.3 5.4 Atom 1666 MHz Linux Core 2 2400 MHz Linux Max Average Geomean Harmean Min Max Average Geomean Harmean Min 465.2 212.2 185.1 157.4 49.7 2384.9 1038.1 805.8 582.1 161.0 |
MemSpeed BenchmarkThis benchmark measures data reading speeds in MegaBytes per second carrying out calculations on arrays of cache and RAM data, sized 2 x 8 KB to 2 x 32 MB. Calculations are x[m]=x[m]+s*y[m] and x[m]=x[m]+y[m], using double and single precision floating point and x[m]=x[m]+s+y[m] and x[m]=x[m]+y[m] with integers. Million Floating Point Operations Per Second (MFLOPS) speed can be calculated by dividing double precision MB/second by 8 and 16, for the two tests, and single precision speeds by 4 and 8. Assembly listings for integer tests show that Millions of Instructions Per Second can be found by multiplying MB/second by 0.78 with 2 adds and 0.66 for the other test. Cache sizes are indicated by varying performance as memory usage changes. Download the app from www.roylongbottom.org.uk/MemSpeed.apk.
Emailed results include System Information as provided above. Results follow. System T2 appears to have 32 KB L1 cache and 128 KB L2 cache and maximum MFLOPS of 130 DP and 133 SP with Integer MIPS up to 1227. The results for T1 are 7.5 and 11 MFLOPS and 468 MIPS, with only a 16 KB L1 cache.
The program code used is the same as
Linux Multithreading Benchmarks.htm
and (nearly)
MemSpd2k Results.htm.
Results on an Intel Atom, for a single thread, using the multithreading benchmark, are shown below. On a per MHz basis, the Cortex-A9 performs well using L1 cache but (DDR2) RAM speeds are particularly slow.
System T2, ARM Cortex-A9 800 MHz, Android 2.3.4
Android MemSpeed Benchmark 17-Feb-2012 17.41
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 1002 533 1574 1742 812 1639
32 1042 530 1533 1717 701 1751
64 994 461 984 1144 644 942
128 656 396 691 696 511 673
256 269 259 273 271 255 280
512 249 246 244 256 244 247
1024 249 249 244 240 253 244
4096 246 244 247 246 242 245
16384 253 236 252 254 241 246
65536 254 241 253 250 252 241
Total Elapsed Time 19.4 seconds
System T1, ARM 926EJ 800 MHz, Android 2.2
Android MemSpeed Benchmark 17-Feb-2012 17.47
Reading Speed in MBytes/Second
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]
KBytes Dble Sngl Int Dble Sngl Int
16 60 44 600 93 76 694
32 46 38 146 60 56 146
64 48 37 154 66 54 144
128 48 36 155 65 53 144
256 48 36 153 65 56 135
512 48 38 153 65 57 142
1024 47 37 153 65 57 142
4096 47 37 152 67 55 142
16384 47 37 152 70 63 138
65536 44 37 153 106 92 142
Total Elapsed Time 93.5 seconds
Atom 1666 MHz, DDR2 RAM 533 MHz, Linux
16 1892 943 1979 2759 1329 2813
64 1647 879 1690 2334 1269 2323
65535 1515 834 1517 2010 1208 1945
|
BusSpeed BenchmarkThis benchmark is designed to identify reading data in bursts over buses. The program starts by reading a word (4 bytes) with an address increment of 32 words (128 bytes) before reading another word. The increment is reduced by half on successive tests, until all data is read. On reading data from RAM, 64 Byte bursts are typically used. Then, measured reading speed reduces from a maximum, when all data is read, to a minimum on using 16 word increments. Potential maximum speed can be estimated by multiplying this minimum value by 8. With this burst rate, measured speed at 32 word and 16 word increments are likely to be the same. Cache sizes are indicated by varying speed as memory use changes. Note, with smallest L1 cache demands, measured speed can be low due to overheads when reading little data.
The program C source code is as used for Linux, See BusSpd2K Results.htm. This has unrolled loops with up to 64 AND statements (& array[i ] to & array[i+63]). The Linux compiler for Intel CPUs translates this into 64 assembly instructions ANDing data from indexed memory locations #####. In this case, Integer MIPS approximately equals MB/second divided by 4 (See Atom results below at 16 KB Read All). The Android NDK compiler generates 64 ANDs, 64 loads and 64+ adds/moves and this reduces comparative performance #####. The results for memory data transfers also indicate that the Cortex-A9 CPU can be slower than the older ARM processor.
This benchmark application can be downloaded from
www.roylongbottom.org.uk/BusSpeed.apk.
System T2, ARM Cortex-A9 800 MHz, Android 2.3.4
Android BusSpeed Benchmark 19-Feb-2012 14.00
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 1748 1347 2154 2331 2331 2285
32 1038 1446 1474 1678 1735 1899
64 407 490 508 592 489 826
128 180 213 183 258 266 530
256 47 42 57 83 79 132
512 41 39 47 73 68 137
1024 39 38 52 70 57 135
4096 38 26 60 69 67 115
16384 39 32 59 71 59 135
65536 34 33 59 67 63 123
Total Elapsed Time 6.9 seconds
Typical variation in results
16 403 421 503 2316 2331 2285
32 1344 1446 1428 1658 1750 1943
System T1, ARM 926EJ 800 MHz, Android 2.2
Android BusSpeed Benchmark 19-Feb-2012 13.47
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 96 95 199 407 426 467
32 35 34 34 68 124 201
64 29 29 30 58 108 174
128 30 30 29 57 108 182
256 29 30 30 56 107 169
512 28 29 29 57 106 181
1024 28 29 29 55 99 176
4096 28 29 29 57 106 177
16384 28 28 29 53 103 181
65536 28 29 29 56 106 179
Total Elapsed Time 6.3 seconds
Atom 1666 MHz, DDR2 RAM 533 MHz, Linux
16 5024 5502 6040 6312 6382 6412
64 493 404 786 1485 2588 3941
65536 136 262 521 1036 2008 3295
|
Randmem BenchmarkRandMem benchmark carries out four tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests using 32 bit integers. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is not used, and by the size of preceding caches.
The benchmark uses the first four tests described in RandMem Results.htm and can be downloaded from www.roylongbottom.org.uk/RandMem.apk. The program structure is as follows, with array xi indexing via sequential or random numbers stored in the array.
Read - toti = toti & xi[xi[i+0]] | xi[xi[i+2]
& xi[xi[i+4]] | xi[xi[i+4]] and &| to i+30
Read/write - xi[xi[i+2]] = xi[xi[i+0]];
xi[xi[i+6]] = xi[xi[i+4]]; repeated to i+30 and i+28
The results below show that random access performance is approximately the same as BusSpeed with address increments of 32 words, the burst reading effect. This program is again based on indexed memory addressing where the older technology CPU can be faster than than the Cortex-A9. This might be due to poor implementation of the memory bus interface on this tablet, as noted on PC tests. Atom results are provided, again showing better relative performance, particularly when using data from RAM. As with BusSpeed, and not noticed so far on the other benchmarks, measured speeds using L1 cache are sometimes slow to start with.
System T2, ARM Cortex-A9 800 MHz, Android 2.3.4
Android RandMem Benchmark 20-Feb-2012 16.45
MBytes/Second transferring 4 Byte Words
Memory Serial....... Random.......
KBytes Read Rd/Wrt Read Rd/Wrt
16 1777 1879 1669 1809
32 1359 1394 1185 1505
64 799 861 621 755
128 394 202 295 333
256 147 146 92 104
512 133 136 71 42
1024 125 125 53 62
4096 129 98 41 53
16384 128 113 42 45
65536 121 115 30 32
Total Elapsed Time 11.7 seconds
System T1, ARM 926EJ 800 MHz, Android 2.2
Android RandMem Benchmark 20-Feb-2012 16.51
MBytes/Second transferring 4 Byte Words
Memory Serial....... Random.......
KBytes Read Rd/Wrt Read Rd/Wrt
16 841 1119 666 955
32 222 147 83 62
64 145 169 56 53
128 198 181 48 57
256 191 178 44 58
512 196 180 27 32
1024 189 180 22 26
4096 193 181 19 23
16384 195 177 19 22
65536 186 166 19 22
Total Elapsed Time 40.5 seconds
Atom 1666 MHz, DDR2 RAM 533 MHz, Linux
16 3976 5132 4100 5134
64 3086 3215 1042 1349
65536 2708 1290 49 74
|
T1 Device TTFone M013S 10.1 inch tablet, 300-800 MHz VIA 8650
Screen pixels w x h 600 x 1024
Android Build Version 2.2
Processor : ARM926EJ-S rev 5 (v5l)
BogoMIPS : 797.97
Features : swp half thumb fastmult edsp java
CPU part : 0x926
Linux version 2.6.32.9
T2 Device WayTeq xTAB-70 7 inch tablet, 800 MHz Cortex-A9
Screen pixels w x h 600 x 800
Android Build Version 2.3.4
Processor : ARMv7 Processor rev 1 (v7l)
BogoMIPS : 2035.71
Features : swp half thumb fastmult vfp edsp neon vfpv3
CPU part : 0xc09 - Cortex-A9
Linux version 2.6.34
P1 Device Motorola Milestone 1 CyanogenMod 7 ROM overclocked
Screen pixels w x h 854 x 480
Android Build Version 2.3.5
Processor : ARMv7 Processor rev 3 (v7l)
BogoMIPS : 598.90
Features : swp half thumb fastmult vfp edsp neon vfpv3
CPU part : 0xc08 - Cortex-A8
Linux version 2.6.32.9
P2 Device Samsung Galaxy s
Screen pixels w x h 480 x 800
Android Build Version 2.2
Processor : ARMv7 Processor rev 2 (v7l)
BogoMIPS : 996.00
Features : swp half thumb fastmult vfp edsp neon vfpv3
CPU part : 0xc08 - Cortex-A8
Linux version 2.6.32.9
P3 Device Motorola Milestone 3 (XT860)
Screen pixels w x h 960 x 540
Android Build Version 2.3.6
Processor : ARMv7 Processor rev 2 (v7l)
processor : 0 - CPU 1 of 2
BogoMIPS : 598.90 - too low?
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3
CPU part : 0xc09 - Cortex-A9
Linux version 2.6.35.7
EP1 Device Emulator 3 GHz Phenom
or 2.4 GHz Core 2
Screen pixels w x h 240 x 320
Android Build Version 2.2
Processor : ARM926EJ-S rev 5 (v5l)
BogoMIPS : 522.64
Linux version 2.6.29
ET1 Device Emulator 3 GHz Phenom
Screen pixels w x h 600 x 1024
Android Build Version 2.2
Processor : ARM926EJ-S rev 5 (v5l)
BogoMIPS : 530.84
Linux version 2.6.29
ET2 Device Emulator 3 GHz Phenom
or 2.4 GHz Core 2
Screen pixels w x h 600 x 1024
Android Build Version 4.0.3
Processor : ARMv7 Processor rev 0 (v7l)
BogoMIPS : 527.56
Linux version 2.6.29
|