Title

Atom CPU Hyperthreading Benchmarks

Contents

CPUIDMP Whets32MP BusMP
RandMP32 OpenMPMFLOPS Other Benchmarks

Introduction

Benchmarks were run on a Netbook that has a single core Intel Atom processor. This has 24 KB L1 data cache, 512 KB L2 cache and 533 MHz single channel DDR2 RAM. This CPU is designed for low power consumption, having 16 stage pipelines (longer than Core CPUs) and in-order instruction issue (compared with out of order on other modern CPUs). Two integer and two floating point arithmetic-logic units are provided (3 or 4 on latest mainstream processors). As could be expected from these limitations, performance at a given CPU MHz will be less than the Core processor line. On the other hand, the Atom has Hyperthreading, where two threads can utilise different parts of the hardware at the same time to enhance performance.

CPUSpeed.htm provides a comparative summary of my single CPU processor benchmarks in terms of %MIPS/MHz and %MFLOPS/MHz, with separate figures for CPU/L1 Cache, L2 Cache and RAM. CPU/L1 cache results show that the 1600 MHz Atom runs at the approximate average equivalent speed of a single Core 2 processor at 1200 MHz for Integer MIPS calculations, 550 MHz for i387 MFLOPS, 750 MHz for SSE MFLOPS and 350 MHz for SSE2 64 bit MFLOPS. L2 results are slightly worse but performance via RAM can be similar as using the same single channel RAM on a Core 2 system.

For 32 bit and 64 bit versions of the following multi-threading programs, the benchmarks and source code can be found in DualCore.zip and NewSource.zip, with both for the fifth benchmark in OpenMPMFLOPS.zip.


  Hardware  Information
   CPU GenuineIntel, Features Code BFE9FBFF, Model Code 000106C2
   Intel(R) Atom(TM) CPU N270   @ 1.60GHz Measured 1596 MHz
   Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
  Windows  Information
   Intel processor architecture, 2 CPUs 
   Windows NT  Version 5.1, build 2600, Service Pack 3
   Memory 1015 MB, Free 680 MB
   User Virtual Space 2048 MB, Free 2043 MB

The benchmarks produce text log files, besides on screen progress and results. All logs include the above system details. Note that a single processor with Hyperthreading (HT) is identified as having two CPUs.

The first four benchmarks have been modified to use 1, 2, 4, 6 and 8 threads, particularly for the Quad Core i7, that has Hyperthreading and appears to Windows as having 8 CPUs. For further detail and results see Quad Core 8 Thread.htm.

To Start


CPUIDMP

CPUIDMP executes three passes of simple additions to registers attempting to demonstrate maximum CPU speeds. Firstly an integer and an SSE floating point test are run separately. They are then run as two threads of equal priority, where both should run at full speed with 2 CPUs. The benchmark has a third section using four threads with two SSE tests and two integer tests. Results are available in WhatCPU Results.htm.

Both type of instructions appear to nearly fully utilise the pipelines, like producing two integer additions per CPU clock cycle, so there is not much room for improvement with HT. The later calculations indicate a 10% to 12% improvement in throughput.


 
   CPU ID and MP Speed Test 32 bit Version 1.0 Sun Jul 11 13:31:53 2010
 
          Assembled with Microsoft ml.exe Version 6.15.8803

  Speed adding to registers   Pass 1   Pass 2   Pass 3  Average  Percent

  Separate Tests
  32 bit SSE   MFLOPS          5042     5066     5066      5058
  32 bit Integer MIPS          3123     3148     3128      3133

  Two Threads Equal Priority
  32 bit SSE   MFLOPS          2995     2998     2994      2996      59%
  32 bit Integer MIPS          1585     1588     1586      1586      51%

  Four Threads, First Normal Priority, Others Normal - 1
  32 bit SSE   MFLOPS          3040     3002     3074      3039      60%
  32 bit Integer MIPS           561      608      579       583      19%
  32 bit SSE   MFLOPS          1024      751      615       797      16%
  32 bit Integer MIPS           442      571      542       518      17% 



To Start


Whets32MP

The Whetstone Benchmark has various routines that execute floating point and integer instructions. The benchmark is run in the main thread and another copy in a low priority second thread which should mainly run at the same speed with two CPUs. When run on a single processor, the second thread receives little or no CPU time. Further results can be found in Whetstone Results.htm.

The second results shown were produced with CPU Affinity flags set to use one processor, where the second thread made no contribution. With HT, the two threads lead to a near doubling of MFLOPS and 33% improvement on VAX MIPS.

The third results are from running the benchmark on one CPU of a 1830 MHz Core 2 Duo. As can be seen, HT on the Atom leads to faster floating point calculations for these particular tests.

 
 Whetstone Single Precision MP SSE Benchmark Sun Jul 11 13:41:18 2010

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS
 
    688   6298   1526    704    729    634   27.5   19.8    791   1539   1641
  Thread 1               363    357    311   13.9   10.1    396    771   1240
  Thread 2               341    372    323   13.6   9.68    396    769    401

  Total No HT
    346   4711    820    365    361    314   15.0   10.4    427   1158   1693

  Total one Core 2
    539  10964   1790    630    633    393   42.8   21.9   1458   1403   5153

To Start


BusMP

BusMP starts by reading integer words with 128 byte (32 words) address increments, to indicate memory or cache bus burst reading speeds, then reduces the increment to finally read all words sequentially. The last test reads 128 bits for four 32 bit SSE2 integers. Speed is measured using data in caches in RAM. Results are in BusSpd2K Results.htm.

These results show that performance can be much slower using two threads. In this case each thread has its own full copy of the data. With each requiring 24 KB, data is read from L2 cache instead of L1, and, at 384 KB, data is read from RAM instead of from L2 cache. There is a small gain using L1 cache but nearly 60% via L2 and, as might be expected, speed from RAM is slightly slower. Worst case is half speed with 24 KB data. When two threads are run with Affinity set to use one CPU (No HT), total speed of two threads is the same as that for one thread.

 
      MP Bus Speed Test 32 bit Version 1.2 Sun Jul 11 13:23:16 2010
 
                  Part 1 - Single Thread MBytes/Second

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     4453     5223     5331     5633     5665     5566    23245
       24     3439     3649     3415     4424     5046     5265    14800
       96      462      394      735     1360     2380     3525     5504
      384      431      386      712     1351     2280     3455     5364
      768      126      223      462      936     1777     3115     3747
     1536      115      220      442      866     1712     3004     3538
    16380      102      207      409      828     1621     2988     3272
   131070      103      204      414      814     1644     2945     3310

               Part 2 - Two Threads Total MBytes/Second

                                                                          Average
   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2   % of 1

        6     5065     5679     5802     5959     5965     5913    23804     107%
       24      696      758     1350     2301     3644     4562     9151      50%
       96      645      744     1307     2243     3642     4540     8957     159%
      384      111      230      477      962     1962     3628     3744      69%
      768      102      209      419      833     1690     3340     3328    ]
     1536      102      208      415      815     1645     2794     3069    ]
    16380       97      204      415      826     1673     3308     3259    ] 
   131070      101      208      418      829     1672     3311     3294    ] 97%

  For 32 bit MIPS divide MB/Second by 4. SSE2 divide by 16 for 128 bit MIPS 

To Start


RandMP32

The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM. Results are in RandMem Results.htm. With two threads, each has its own code and use the same data but the second thread starts at the half way point.

Below the logged results, the percentage improvements of using two threads are shown. Random access is particularly slow in terms of MB/second, where all transmitted data is not used when burst reading is involved. The greater than two times improvement might be due to achieving a higher hit rate on cached data.

The last percentages are for a Core 2 Duo CPU where speeds via L1 cache can be much slower with two threads, reading and writing from/to two L1 caches. This can be put down to Windows flushing caches to maintain data coherency when sharing the same data array. This effect is also apparent on larger data sizes, to some extent, on dual core CPUs that do not use shared L2 caches.



  RandMP Write/Read Test 32 bit Version 1.1 Sun Jul 11 13:36:02 2010

               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 1 Thread
 Serial RD     3718    3797    2802    2803    2040    2048    2030    2032
 Serial RW     1946    2252    1902    1902    1415    1186    1120    1112
 Random RD     3412    3869     802     489     126      77      55      45
 Random RW     1840    2265     822     516     185     110      80      68

 2 Threads
 Serial RD1    2444    2823    2248    2283    1618    1612    1593    1584
 Serial RD2    2425    2783    2204    2261    1592    1573    1575    1563

 Serial RW1    1823    2141    1722    1712    1328    1082     851     723
 Serial RW2    1762    2090    1706    1696    1209    1022     712     576

 Random RD1    2631    2858     633     408     120      72      52      42
 Random RD2    2602    2819     619     401     119      71      52      41

 Random RW1    1797    2087     657     424     205     137      66      80
 Random RW2    1790    2078     649     414     202     137      64      80

                     End of test Sun Jul 11 13:36:54 2010

           For approximate speed in MIPS divide MBytes/Second by 3.2 

               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB
 2 Thread % of 1
 Serial RD     131%    148%    159%    162%    157%    156%    156%    155%
 Serial RW     184%    188%    180%    179%    179%    177%    140%    117%
 Random RD     153%    147%    156%    165%    190%    186%    189%    184%
 Random RW     195%    184%    159%    162%    220%    249%    163%    235%

 Example Dual Core CPU
 Serial RD     205%    196%    191%    193%    191%    189%    184%    184%
 Serial RW      47%     46%    181%    181%    182%    184%    132%    148%
 Random RD     200%    197%    166%    163%    162%    164%    143%    195%
 Random RW      17%     17%     80%    142%    148%    158%    169%    201%

To Start


OpenMPMFLOPS

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. Results are in OpenMP MFLOPS.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Three data sizes are used, 0.1M, 1M and 10M words or 0.4M, 4M and 40M Bytes. In the case of the Atom, virtually all calculations will involve accessing RAM.

The first results shown below were produced with CPU Affinity flags set to use one processor (No HT), with the second ones when using Hyperthreading. HT performance gains are in the range 167% to 185%. Although these gains are excellent, relative performance to a single Core 2 Duo is not very good, with the latter having much larger caches and better arithmetic pipeline arrangements. For example, all Atom HT results are much slower than one CPU of a 1830 MHz laptop Core 2 Duo, the latter being two to more than three times faster.



  32 Bit OpenMP MFLOPS Benchmark 1 Sun Jul 11 13:53:33 2010 - No HT

  Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   2.895631      173    0.929475   Yes
 Data in & out    1000000     2      250   2.688429      186    0.992543   Yes
 Data in & out   10000000     2       25   2.654217      188    0.999249   Yes

 Data in & out     100000     8     2500   7.871326      254    0.957164   Yes
 Data in & out    1000000     8      250   7.621398      262    0.995525   Yes
 Data in & out   10000000     8       25   7.593257      263    0.999550   Yes

 Data in & out     100000    32     2500  21.799048      367    0.890377   Yes
 Data in & out    1000000    32      250  21.626982      370    0.988102   Yes
 Data in & out   10000000    32       25  21.598040      370    0.998799   Yes


 ******************************************************************************

  32 Bit OpenMP MFLOPS Benchmark 1 Sun Jul 11 13:52:14 2010 - With HT

  Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   1.546320      323    0.929475   Yes
 Data in & out    1000000     2      250   1.608478      311    0.992543   Yes
 Data in & out   10000000     2       25   1.512073      331    0.999249   Yes

 Data in & out     100000     8     2500   4.345824      460    0.957164   Yes
 Data in & out    1000000     8      250   4.519152      443    0.995525   Yes
 Data in & out   10000000     8       25   4.209331      475    0.999550   Yes

 Data in & out     100000    32     2500  11.787812      679    0.890377   Yes
 Data in & out    1000000    32      250  11.788433      679    0.988102   Yes
 Data in & out   10000000    32       25  11.808370      677    0.998799   Yes 

To Start


Other Benchmarks

Disk Speed - The 5400 RPM hard disk in the Netbook operates at a reasonable speed DiskGraf Results.htm (jAtoLap), showing 34 MB/second writing and 47 MB/second reading. BMPSpeed Results.htm (AtomM) shows image writing speeds of up to 39 MB/second, aided by caching, and reading/formatting at up to 36 MB/second.

Intel 945GSE Graphics - Using BMPSpeed edit, rotate, save and load average speeds of around 1 second as demonstrating suitability for editing images, the Netbook appears to be adequate for up to 32 MB, as good as many Pentium 4 based PCs see - BMPSpeed Results.htm 32 MB. Image scrolling speed, at this size, is also super fast at 1.9 milliseconds per full screen window or 323 M Pixels/second.

Results for system Atom 1, at 32 bits colour settings, show that the Netbook is as good as many Pentium 4 systems, running Windows and DirectDraw 2D benchmarks. As might be expected, the Netbook is not very fast on 3D applications, but it will run some of them. Atom 1, 32 bits results for Direct3D show that maximum Frames Per Second (FPS) are similar to those of a laptop with Intel X3100 graphics (at same screen pixel size), but can be faster on tests that are more dependent on CPU speed. The PC can also run DirectX 9 applications, with PixelShader 2, but not Vertex Shader 2, again possibly faster than X3100. The same arguments apply to OpenGL applications. It should be noted that performance is generally better than PCs of the Petium II era.


To Start


Roy Longbottom August 2010



The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection