Title

64-Bit Windows 7, Phenom II and Core i7 Benchmarks

Index

Introduction Systems and compilers used
Identification Identifying 64-Bit Windows 7, Phenom II and Core i7
Paging 64 bit program can only allocate small portion of 8192 GB available virtual space
Vista paging more efficient than Windows XP, Windows 7 even better
Accessing data 400 times slower than using RAM, 8 times slower than normal disk I/O
Dual/Quad Core Processor
Multi Threading
Efficient use Phenom, not Core i7 CPUIDMP64 - 4 Threads
32 bit and 64 bit compilations, Core i7 faster floating point Whets64MP - 2 Threads
Slow on data streaming to 64 bit integer registers BusMP64 - 2 Threads
Windows slow on writing/reading shared data RandMP64 - 2 Threads
OpenMP Efficient calculations using shared data on dual and quad core processors
Multi-Tasking Quad CPU - nearly four times performance level of one CPU
Disk Drives Partitioned disk C: drive slower than D:
Graphics Integrated graphics can be faster than the GeForce card, 3D slower using Aero
Image Processing Slow test times using Classic Desktop and scrolling exceptionally slow under Aero
Floating Point 64 bit compilations slow on Core 2 Duo and Phenom More slow speed
Benchmark Code Benchmark program source and execution codes


Introduction

All of the programs in The PC Benchmark Collection were run on a new PC with a quad core Phenom II CPU using 64-Bit Windows 7. Some were also run on a Core i7 CPU using the same version of Windows. The system tested and compilers/assemblers used were:


  AMD Phenom II X4 945 3.0 GHz, Asus M4A785TD-V motherboard, 8GB 1333 MHz DDR3 RAM,
  WD6400AACS 640GB 5400 RPM (Green) SATA-300 disk, 16 MB buffer,  
  nVidia GeForce GTS250 1GB card and on-motherboard Radeon HD 4200 graphics 

  Intel Core i7 860 2.8 GHz, Asus P7P55D motherboard, 8GB 1333 MHz DDR3 RAM
  Intel X25-M 80GB SATA Solid State Drive

  Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
  Microsoft ml64.exe Version 8.00.40310.39
  Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
  Microsoft ml.exe Version 6.15.8803
  Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64   (2009)
  Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86 (2009)

Some results are also compared with those from a Core 2 Duo using 64-Bit Vista and an Athlon 64 x2 with XP X64. Benchmark results are mainly in terms of MBytes Per Second, Millions if Instructions Per Second (MIPS) or Millions of Floating Point Operations Per Second (MFLOPS).

To Start

Configuration Statistics

All the latest benchmarks provide the following system identification details. Variations for an Intel CPU, 32 bit versions of Windows and applications compiled for 32 bits are also shown. The only way to determine 64 bit Windows by programming appears to by the GetSystemInfo flag PROCESSOR_ARCHITECTURE_AMD64, with 32 bit varieties showing PROCESSOR_ARCHITECTURE_INTEL. It appears that 32 bit applications running via 64 bit Windows can use 4 GB of virtual address space (UVS), compared with 2 GB for 32 bit versions.

To Start


  CPUID and RDTSC Assembly Code
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
  AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, 
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
  AMD64 processor architecture, 4 CPUs           [64 Bit Windows 64 bit application]
  Windows NT  Version 6.1, build 7600,           [Windows 7]
  Memory 8192 MB, Free 7105 MB
  User Virtual Space 8388608 MB, Free 8388483 MB [64 Bit Windows, 64 bit application]

  Memory 7936 MB, Free 6340 MB                   [Motherboard Radeon HD 4200 enabled]

  Windows NT  Version 6.0, build 6000,           [Vista]
  Windows NT  Version 5.2, build 3790            [XP Pro x64]

  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106E5
  Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz Measured 2809 MHz
  AMD64 processor architecture, 8 CPUs           [8 CPUs as Hyperthreading provided]
  Windows NT  Version 6.1, build 7600, 
  Memory 8183 MB, Free 5857 MB
  User Virtual Space 8388608 MB, Free 8388547 MB

  Intel processor architecture, 4 CPUs           [32 Bit Windows or 32 bit application] 
  Intel processor architecture, 8 CPUs 

  User Virtual Space 4096 MB, Free 4035 MB       [64 Bit Windows, 32 bit application]
  User Virtual Space 2048 MB, Free 2024 MB       [32 Bit Windows, 32 bit application]


To Start

Paging and Virtual Memory

Besides the results files indicated below, see also Paging.htm. This includes comparisons, further details and Performance Monitor graphs of disk utilisation, showing significant differences between XP, Vista and Windows 7.

Paging speed in terms of MB/second can be measured using BusSpd2K and IntBurn64 Burn-in/Reliability benchmarks. These have six Write/Read and six read only tests. The benchmarks have been modified to include a Paging Test option that uses a single write/read test. Results are in BusSpd2K Results.htm and, along with graphs of Performance Monitor (Perfmon) logs, in Paging.htm.

As indicated above, 8192 GB user virtual space is available to a 64 bit application. However, paging test results, provided below, show that far less than this can be actually allocated by the program as a single array. Limits were between 5 and 6 GB using XP x64 and 1 GB RAM, 7.9 GB with 64-Bit Vista and 4 GB RAM, then 14 GB via 64-Bit Windows 7 ad 8 GB memory.

With each new version of Windows, writing and reading block sizes, as identified via Perfmon, appear to have increased, producing improved performance. Using XP x64 write/read average block sizes were 64/4 KB. Vista mainly wrote blocks of around 1000 KB with reading between 16 KB and 64 KB for periods. Windows 7 used similar large block sizes on writing but reading involved mainly 50 KB to 64 KB blocks. Performance was also improved by keeping data in memory, rather than paging out in a First In First Out basis. For a Windows 7 test with an 8 GB array, less than half involved reading from disk.

To Start


           64 Bit IntBurn64         64 Bit IntBurn64         64 Bit IntBurn64

      CPU     Athlon 64               Core 2 Duo                 Phenom II
      MHz       2210                     2400                       3000
   RAM MB       1024                     4096                       8192
  Windows      XP x64               64-Bit Vista             64-Bit Windows 7
 Disk W/R
   MB/sec         55                       55                         92

         KB    Secs  MB/sec         KB    Secs  MB/sec         KB    Secs  MB/sec

     800000       1    1976    2000000       5    3056    6000000       3    4051
     900000      58      32    3000000       6    2878    7000000       4    4078
    1000000     128      16    3500000       7    1075    8000000     227      72
    1400000     358       8    4000000     145      56    9000000     697      26
    2000000     683       6    5000000    1040      10   10000000    1231      17
    5000000    1707       6    7900000     771      21   14000000    2742      10

    6000000 Cannot allocate    8000000 Cannot allocate   15000000 Cannot allocate


To Start

Dual/Quad Core Benchmarks

The four multi-threaded benchmarks, with 32 bit and 64 bit varieties, were run on the Phenom to demonstrate dual core CPU performance and quad core with one of them. Multi-tasking tests were also run using four copies of BusSpd2K Reliability Tests and IntBurn64. The 64 bit dual core tests were also run on the Core i7. See DualCore.htm, BurnIn64.htm and BurnIn4CPU.htm.

Core i7 2800 MHz (21x133) CPUs could be expected to run at 3466 MHz (26x133) using Turbo Boost and one CPU, 3333 MHz (25x133) using two CPUs, then 2933 MHz (22x133) with three or four in use. The speeds will be reduced if the CPU heats up too much. For these MP tests, Turbo Boost settings appear to have been upset with unused MHz being reported as 2862 MHz (21.5x133), 3000 MHz (22.5x133) using one or two CPUs and 2467 MHz (18.5x133) with four cores in use, via Turbo Boost.

CPUIDMP and CPUIDMP64 - The benchmark uses an integer test and a floating point test. They are first executed separately, followed by together in two threads of equal priority and finally with two of each type, where three are at a lower priority. With a quad CPU, performance with four threads should be similar to that of the stand alone runs. Results are in WhatCPU Results.htm.

Core i7 results are slower then expected, probably due to wrong settings for Turbo Boost.


                                 Phenom      Phenom     Core i7
                                3.0 GHz     3.0 GHz     *** GHz
                                 32 bit      64 bit      64 bit    
 
   Separate Tests
   32 bit SSE   MFLOPS           11981       12013       11961
   32 bit Integer MIPS            9015        8279        8734#

   Two Threads
   32 bit SSE   MFLOPS           12012       12042       11966
   32 bit Integer MIPS            9027        8265        8760

   Four Threads
   32 bit SSE   MFLOPS           12005       11996       10349
   32 bit Integer MIPS            9029        8265        7560
   32 bit SSE   MFLOPS           11991       12004        9350
   32 bit Integer MIPS            8270        9030        7085

 *** Estimate 3.0 GHz one or two CPUs in use, 2.467 GHz four CPUs
     Could be 3.47 GHz and 2.93 GHz with four CPUs using Turbo Boost

 # 2.4 GHz Core 2 Duo obtains 6940 MIPS,    8734/6940*2.4 = 3.02 GHz
 Original CPUID - 10202 MIPS i7, 7040 C2D, 10202/7040*2.4 = 3.48 GHz


To Start


Whets32MP and Whets64MP - The Whetstone Benchmark has various routines that execute floating point and integer instructions. In the MP version, the benchmark is run in the main thread and another copy of each routine in a low priority second thread which should mainly run at the same speed with two CPUs. Results are in Whetstone Results.htm.

The 64 bit compilation could be expected to be faster than the 32 bit version, as more registers are available for optimisation. Although the Core i7 is probably running at the wrong but similar GHz, floating point performance (with this program) is superior to the Phenom.


 Phenom 3.0 GHz 32 bit compilation

 MFLOPS    Vax  MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Gmean   MIPS            1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

   1675  25786   6353   1818   1780   1453    138     95   4986   5385   5109
  Thread 1               900    870    721     69     47   2512   2898   3995
  Thread 2               918    910    732     68     48   2473   2487   1113

 Phenom 3.0 GHz 64 bit compilation

   1486  34935   6892   1808   1451   1252    199     93   4964   5804  11837
  Thread 1               900    724    751    100     46   2482   2893  10625
  Thread 2               908    727    501     99     47   2483   2912   1211

 Core i7 at 3.0 GHz? 64 bit compilation

   1858  32842   8287   2191   1994   1469    268    107   5427   2958  17649
  Thread 1              1095    998    780    134     53   2704   1477  16537
  Thread 2              1096    996    689    134     54   2723   1481   1113


To Start


BusMP and BusMP64 Two Thread Tests - These run a series of tests to measure performance via caches and RAM, firstly as a single thread and secondly using two threads accessing different data arrays. The tests are based on those in BusSpd2K and results are in BusSpeed2K Results.htm. To indicate bus burst reading speeds, the tests start with reading one word with address increments of 32 words (128 bytes at 32 bits and 256 bytes at 64 bits), then with decreasing increments until all data is read. The last test uses 128 bit SSE2 instructions. The data is read using a sequence of 64 AND instructions to one CPU register, repeated numerous times without programmed interference. Results below are for reading a word with 64 Byte address increments plus reading all data to integer and SSE2 registers.

On running two copies of BusSpd2K at the same time, with one or two CPUs, there is not the unexpected variation shown below. The MP test instructions comprise ANDing to a single register. As shown in WhatCPU Results.htm, in this case, maximum speed is likely to be 24000 MB/second at 64 bits (8 x MHz) and 12000 MB/second at 32 bits. L1 cache results using one CPU produce that level of performance but might not using two threads, where, sometimes, two processors transfer data slower than one. Part of the difference is that the two threads do not finish at the same time but it does seem that Windows is interfering with the data flow and timing is more critical using 64 bit integers.

For RAM, assuming 64 Byte burst reading, expected maximum MB/second would be sixteen times Inc64B speed at 32 bits and eight times at 64 bits and is confirmed here. With bus or RAM speed limitations, throughput using two CPUs would be no greater than with one, but that is not the case for these results. Core i7 DDR3 RAM is shown to be faster than with a Phenom II, but the latter catches up somewhat using two threads.

Inc64B speeds using L2 and L3 caches show that data transfer also involves burst reading, but addressing increments are different on the two systems. Reading speed of all data from L2 and L3 caches is similar to that using L1 on the i7 but L3 is slower with the Phenom.

Estimated Core i7 CPU GHz using Turbo Boost is shown below.

To Start


                             Speed In MBytes/Second

           Phenom 3 GHz DDR3       Phenom 3 GHz DDR3       Core i7 3? GHz DDR3
           32 Bit                  64 Bit                  64 Bit
    CPUs   Inc64B  RdAll   SSE2    Inc64B  RdAll   SSE2    Inc64B  RdAll   SSE2
 L1 6KB  
      1     14041  13617  23847     23804  24340  23800     20910  21836  23732
      2     18754  25133  47214     22114  29916  47175     15019  25961  47356
      %       134    185    198        93    123    198        72    119    200

 L2 96KB
      1      1496  12878  23879      2986  21594  23822      5997  19282  23945
      2      2974  25520  47516      5676  27000  47542      9708  25461  47521
      %       199    198    199       190    125    200       162    132    198

 L3
      1       841  10107  13311      1499  11256  11967      4333  18279  22468
      2      1492  18736  25895      2788  15690  22640      6919  23999  43022
      %       177    185    195       186    139    189       160    131    191

 RAM 128MB
      1       454   5212   7289       897   5792   7372      1657  10108  10953
      2       760   8959  12146      1477   8734  12124      1968  13076  15560
      %       167    172    167       165    151    164       119    129    142

 i7/C2D L1 RdAll 21836/17449*2.4 = 3.0 GHz, BusSpd2K 32b 13742/9252*2.4 = 3.56 GHz


To Start

RandMP32 and RandMP64 Two Thread Tests - The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM, firstly as a single thread and secondly using two threads. The 64 bit compilation uses the same 32 bit integer arrays as the 32 bit version and resultant speeds are generally the same. For this benchmark, the two threads share the same data array but one starts half way through. Examples of MB/second results from Randmem Results.htm are below.

The main observations are the reduction in throughput in Read/Write tests with cache based data when using two CPUs, particularly for random access. Here, Windows will be updating data in RAM to maintain integrity (a cache killer). For these tests, performance degradation is much lower for Core i7. This CPU, again, appears to be running at a Turbo Boosted speed of 3 GHz where some single processor tests produce similar performance to the Phenom. Some of the i7 RAM speeds are noticeably faster.

To Start


                  Speed In MBytes/Second - 64 Bit Version, 32 Bit Integers

       Phenom II 3.0 GHz, 1333 MHz DDR3 RAM       Core i7 3? GHz, 1333 MHz DDR3 RAM
       64-Bit Windows 7                           64-Bit Windows 7

       Serial                                     Serial

       L1 Cache      L2 Cache      RAM            L1 Cache      L2 Cache      RAM
 CPUs  RD     RW     RD     RW     RD     RW      RD     RW     RD     RW     RD     RW

  1  15853   5572  12645   5543   4462   3532   11797   4769  10645   4333   8134   3884
  2  30818   5043  25567   5796   7254   6153   23509   5891  21010   6621  14410   7645
  %    194     91    202    105    163    174     199    124    197    153    177    197

       Random                                     Random

  1  15116   5666   7409   4991    607    522   11759   4777   6282   3699    983    952
  2  30224   1382  14718   1616   1044    974   23588   3061  12602   4454   2030   1912
  %    200     24    199     32    172    187     201     64    201    120    207    201


To Start


OPenMP Benchmark

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. For further detail and results see OpenMP MFLOPS.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words.

At with later Floating Point tests, the compiler is not very efficient in handling SSE instructions, where one 3.0 GHz CPU could achieve up to 24000 MFLOPS but can obtain less than 4000. The smallest data size of 400 KB is too large for L1 caches and, with two operations per word, cache speed can be the limiting factor. Larger data sizes are likely to depend on shared cache or RAM speeds. With 32 operations per word, performance is likely to governed by CPU speed.

The benchmark demonstrates that OpenMP declarations can produce a throughput improvement of greater than 1.9 times on dual core systems and more than 3.9 times with a quad core processor. The Core i7 quad core performance is much better than CPUIDMP64 above and this might be influenced by Hyperthreading where eight CPUs are identified (see Identification).

To Start


                     64 Bit OpenMP Benchmark MFLOPS

                 Athlon 64 x2 2.2 GHz       Core 2 Duo 2.4 GHz
            
   Data   Ops/   SSE 64b SSE 64b  Gain   SSE 64b SSE 64b  Gain
   Words  Word     1 CPU   2 CPU           1 CPU   2 CPU

   100000    2     1114    2015    1.8      1594    2573   1.6
  1000000    2      638     817    1.3      1577    2589   1.6
 10000000    2      636     831    1.3      1160    1203   1.0

   100000    8     1942    3783    1.9      3423    6404   1.9
  1000000    8     1692    3058    1.8      3363    5956   1.8
 10000000    8     1706    3110    1.8      3301    4221   1.3

   100000   32     1731    3254    1.9      3526    5376   1.5
  1000000   32     1793    3369    1.9      3538    4230   1.2
 10000000   32     1774    3443    1.9      3523    6748   1.9


                 Phenom II 3.0 GHz         Core i7 ### GHz 
 
   Data   Ops/   SSE 64b SSE 64b  Gain   SSE 64b SSE 64b  Gain
   Words  Word     1 CPU   4 CPU           1 CPU   4 CPU

   100000    2      1822    5613   3.1      1661    4263   2.6
  1000000    2      1870    7056   3.8      1922    5142   2.7
 10000000    2      1563    2972   1.9      1824    3838   2.1

   100000    8      3637   12653   3.5      3939   13804   3.5
  1000000    8      3709   14518   3.9      4251   18082   4.3
 10000000    8      3543   11273   3.2      4133   15079   3.6

   100000   32      3652   14092   3.9      4438   16299   3.7
  1000000   32      3663   14510   4.0      4512   18081   4.0
 10000000   32      3633   14034   3.9      4493   17752   4.0

    ### Core i7 is rated as 2.8 GHz but probably running at 
        3.0 GHz using Turbo Boost (4493 / 3523 x 2.4 = 3.0)
 


To Start

Multi-Tasking Benchmarks

The multi-threading benchmarks demonstrate some performance limitations on using cached data. As shown in DualCore.htm, BurnIn64.htm and BurnIn4CPU.htm, multiple copies of IntBurn64 and SSEBurn64 burn-in programs can be used to demonstrate multi-tasking performance. Results on the Phenom based PC are shown below for SSEBurn64. This has three different types of tests - CPU registers only, L1 Cached data and Memory. For the latter, data size is selected to test caches or RAM. All tests can be run using either 32 bit or 64 bit floating point numbers. Figures below are reading/execution speeds. SSE MB/second are 4 x MFLOPS and SSE2 MB/second are 8 x MFLOPS.

SSE registers are 128 bits for four SSE words or two with SSE2 and arithmetic instructions can manipulate all the data at the same time, producing 12000 SSE or 6000 SSE2 MFLOPS with a 3 GHz processor. Such as add and multiply can be linked to produce up to eight results per clock cycle with SSE. In this case, 5.6 results per cycle are demonstrated with the cache test. The CPU and RAM tests obtain around 12000 MFLOPS, the same as CPUIDMP above.

With four programs running, each operates at approximately the same performance level. With CPU, cache and memory L1/L2 cache tests, performance gains are greater than 3.9 times. The Phenom has 6 MB shared L3 cache and using 1 MB per CPU can be quite efficient. Using four CPUs can more than double memory throughput.

To Start



              Phenom II 3.0 GHz, 1333 MHz DDR3 RAM, 64-Bit Windows 7 

  Test           1 CPU   Copy1   Copy2   Copy3   Copy4   Total   Gain 
                MFLOPS  MFLOPS  MFLOPS  MFLOPS  MFLOPS  MFLOPS        

  CPU SSE        12022   11931   11901   11867   12007   47706   3.97

  Cache SSE      16802   16478   16466   16381   16410   65735   3.91

  Cache SSE2      8258    8090    8107    8166    8103   32465   3.93

 

  Memory         1 CPU   Copy1   Copy2   Copy3   Copy4   Total   Gain 
  SSE Test      MB/sec  MB/sec  MB/sec  MB/sec  MB/sec  MB/sec 
  
  L1 32 KB       47484   47109   46907   47553   47222  188791   3.98

  L2 256 KB      23919   23577   23907   23807   23690   94980   3.97

  L3 1024 KB     11250    9171    9225    9264    9224   36884   3.28

  RAM 64 MB       7041    3708    3796    3756    3730   14990   2.13



To Start

Disks

The first two partitions (C: and D:) on the Phenom based PC disk drive are each 224 GB. As expected, maximum data transfer speed is higher using the first partition, writing/reading at 94/104 MB/second, compared with 90/95 MB/second on D:. As with earlier systems, this is not the case writing and reading small files. Below are results from the CDDVDSpd benchmark that writes and reads 520 small files of a chosen size. Here, AVAST anti-virus is enabled for both, where this is also known to reduce performance with small files. For more details see DiskGraf Results.htm and CDDVDSpd Results.htm.

Tests were run using real data via copying and pasting a folder containing downloaded HTML documents with lots of tiny GIF files - 18.1 MB, 28.4 MB on disk, 3638 files in 277 folders. Example copying times were 43 seconds on C: and 8 seconds on D:.

To Start


       C: Partition                D: Partition
      
                      Per File                     Per File
  KB  Write   Read  Write   Read   Write   Read  Write   Read
       MB/s   MB/s  msecs  msecs    MB/s   MB/s  msecs  msecs

   2   0.37   0.32    5.4    6.3    2.57   4.12    0.8    0.5
   4   0.95   0.64    4.2    6.3    8.10  10.18    0.5    0.4
   8   1.61   1.10    5.0    7.3   15.64  16.02    0.5    0.5


To Start

Graphics

My Windows graphics benchmarks, compiled for 32 bit and 64 bit working, are available. Further details and results are provided in Direct3D Results.htm, DirectDraw Results.htm, OpenGL Results.htm and VideoWin Results.htm. Performance via 64 bit and 32 bit versions was mainly the same. Below are results of the DirectX9 benchmark showing the impact of using Aero desktop, the effect of large textures and the difference between the motherboard based graphics and moderately fast graphics card.

The Windows benchmark (VideoWin) showed that the on-board integrated graphics can be faster than the GeForce card but this is far from the case with DirectX9 3D graphics. At the lowest pixel size settings, where CPU speed has more effect, the results using the card are 1.9 to 6.2 times faster, increasing to 5.4 to 8.2 times using the highest number of pixels.

As for 64-Bit Vista (see Vista64.htm), speeds with Aero selected are slower than using Classic Desktop, in this case by an average of around 10%. This might be influenced by the benchmark running in windowed mode, and not as a dedicated full screen application.

Some tests were run where objects were textured using 1024 x 1024 or 1M pixel JPG images (tests 4, 6, 7, 8). These textures are rather large but might be needed for close ups of moving objects. Pix1 results are when using the same file for the five textures used with Pix2 involving different files but not reducing performance much. This can reduce speed by more than 50% using the GeForce card but much less via the on-board Radeon graphics, where CPU time is more significant.

Further tests were run using 10M pixel textures (for close ups of part of an object?), resulting in much slower performance. In this case, there was around a 10 second delay with each texture test before movement started, due to a non-timed pass used to calibrate running time parameters. During this delay, CPU utilisation was 25% or 100% of one CPU. When the textured images were being displayed, utilisation was in the range 52% to 84% of one CPU, indicating that graphics hardware speed limits performance.


To Start


  Aero ATI Radeon HD 4200

      DirectX9 D3D Test 32 Bit Version 1.1, Sat Nov 21 14:05:59 2009
                      Copyright (C) 2006, Roy Longbottom

               ..................... Frames Per Second ......................
 Resolution    Shaded WireEgg     500 Texture  Colour Texture   Pixel  Vertex
                  Egg   Vsync   Cubes  Tunnel Objects Objects Shader2 Shader2

  640  480 32  1674.0    60.0    75.9   471.5   866.0   454.5   451.6   668.4 
  800  600 32  1083.8    60.0    58.6   313.6   602.9   312.3   310.3   458.9 
 1024  768 32   651.4    60.0    43.0   205.4   386.2   203.9   203.3   299.5 
 1280 1024 32   363.9    60.0    27.7   126.5   214.8   112.1   113.1   169.6 
 1280 1M pix1   371.9    60.0    27.8    80.9   206.9    82.7    80.2   127.9 


 Aero GeForce GTS 250

      DirectX9 D3D Test 64 Bit Version 1.1, Thu Nov 19 14:27:42 2009
                      Copyright (C) 2006, Roy Longbottom

               ..................... Frames Per Second ......................
 Resolution    Shaded WireEgg     500 Texture  Colour Texture   Pixel  Vertex
                  Egg   Vsync   Cubes  Tunnel Objects Objects Shader2 Shader2

  640  480 32  9079.0    60.6   396.3  2919.4  2266.4  1496.9  1400.0  1293.1 
  800  600 32  6368.3    60.6   326.9  2247.2  2001.7  1408.7  1287.7  1259.6 
 1024  768 32  4962.9    60.6   257.6  1609.4  2163.5  1363.7  1251.9  1346.0 
 1280 1024 32  3065.1    60.6   188.5  1093.2  1551.2   908.4   862.0  1167.6 
 1280 1M pix   3057.5    60.6   189.7   498.2  1547.2   449.1   439.3   734.0
 1680 1050 32  2422.1    60.6   181.4   876.3  1329.5   806.8   771.8  1046.4 
 1680 1M pix1  2409.7    60.6   181.2   402.1  1324.0   408.1   400.4   668.9 
 1680 1M pix2  2360.7    60.5   180.6   401.0  1302.7   402.1   397.3   658.2 
 1680 10M pix  2358.9    60.6   180.6   154.6  1300.1   165.2   160.0   306.7 

 10M pix +10 secs each textured tests for untimed first Render()


 GeForce GTS 250 and Radeon HD 4200 Dual Monitor
 
 2960 1000 32  1470.0    30.0   171.7   606.8   966.5   662.2   632.3   787.0  


 Classic GeForce GTS 250

               ..................... Frames Per Second ......................
 Resolution    Shaded WireEgg     500 Texture  Colour Texture   Pixel  Vertex
                  Egg   Vsync   Cubes  Tunnel Objects Objects Shader2 Shader2

  640  480 32  9544.8    60.6   425.4  3164.5  2398.7  1618.5  1501.4  1339.3 
  800  600 32  7801.6    60.6   354.4  2446.3  2366.6  1626.9  1513.1  1365.9 
 1024  768 32  5244.4    60.6   282.5  1728.3  2319.5  1521.1  1358.8  1343.3 
 1280 1024 32  3291.5    60.5   214.8  1188.4  1664.8  1004.4   948.6  1246.8 
 1280 1M pix1  3292.2    60.6   214.2   538.5  1665.4   485.7   478.4   803.9  
 1680 1050 32  2633.7    60.6   206.5   955.0  1435.4   895.4   855.9  1143.6 


To Start

Image Processing Benchmark

BMPSpeed benchmark measures performance using image files increasing in size from 0.5 MB to 512 MB. Tests run are Enlarge/Edit, Save BMP file, Load BMP file, Scroll and Rotate. Further details can be found in BMPSpeed Results.htm. Below are results from the 64 bit version, excluding saving and loading speeds which were similar to those provided by disk benchmarks. As for graphics, the impact of using Aero desktop and the difference between the motherboard based graphics and the fast graphics card are shown. Earlier results for a Core 2 Duo based PC, using 64-Bit Vista, are provided for comparison purposes.

The benchmark uses fast BitBlt copying when permitted and a slower byte based method when not. With 32 bit Windows, the former is used up to image size of 64 MB. Using 64 bit Windows, no limit is seen. Maximum memory demand within user space is 1.1 GB but the BitBlt method creates 4 bytes per pixel bitmaps outside this area, increasing memory demands up to 2.3 GB.

Performance Monitor (Perfmon) shows that Enlarge and Rotate use 100% utilisation of one CPU. This leads to similar performance using motherboard integrated graphics and the much faster video card. Besides image size, performance depends on whether the data is in caches, RAM or even paged out to disk, when insufficient RAM is available. In this case, caches will only help for the smaller images. Perfmon also shows a linear increase in memory demands according to image size, with no noticeable disk activity. In turn, this produces a fairly linear increase in the enlarge test times. Here, the Core 2 Duo is faster on the small images, probably due to cache speed, but the Phenom excels with large images with its DDR3 memory. In rotating the larger images, it is not clear why measured times are non-linear, why the Core 2 Duo is faster or why the times for the larger two images are similar. This test involves copying bytes from array rows to columns and might involve cache flushing or be affected by reading and writing data in 64 byte bursts.

On the Core 2 Duo with Vista, speed using Classic Desktop settings is slightly slower than with Aero enabled. Under Windows 7 and the Phenom, Classic Desktop enlarge and rotate speeds are significantly slower with both of the nVidia and ATI graphics.

The speed on scrolling is expected to be similar to results shown for Classic Desktop and Aero on the Core 2 Duo, where smaller images are cached in video RAM (or RAM used for display purposes). Slower speed for non-cached data represents RAM/bus speeds, clearly faster on the Phenom based system. The glaring anomaly is the slow scrolling using Aero via Windows 7.


To Start


     Phenom II 3.0 GHz, 512 KB L2 cache, 6 MB L3 cache, 
     GeForce GTS 250, 64-Bit Windows 7, 1680 x 1050 x 32 bits

             Aero                        Classic
      MBytes Enlarge  Rotate  Scroll     Enlarge  Rotate  Scroll
                Secs    Secs  MB/Sec        Secs    Secs  MB/Sec

         0.5    0.04    0.03    1375        0.08    0.07    4071
           1    0.07    0.04    1612        0.14    0.10    4399
           2    0.08    0.05    1882        0.17    0.15    4173
           4    0.11    0.08    2251        0.23    0.21    2703
           8    0.11    0.13    2575        0.30    0.32    2690
          16    0.18    0.22    2553        0.44    0.49    2643
          32    0.22    0.37    2543        0.58    0.77    2641
          64    0.35    0.69    2551        0.84    1.23    2629
         128    0.54    1.52    2554        1.27    2.34    2614
         256    0.93    7.18    2554        1.98    8.28    2638
         512    1.74    7.87    2546        3.18    9.42    2605


     Phenom II 3.0 GHz, Radeon HD 4200, 64-Bit Windows 7  
     1280 x 1024 x 32 bits

             Aero                        Classic
      MBytes Enlarge  Rotate  Scroll     Enlarge  Rotate  Scroll
                Secs    Secs  MB/Sec        Secs    Secs  MB/Sec

         0.5    0.05    0.03    2503        0.11    0.10    4015
           1    0.07    0.04    2332        0.16    0.14    4356
           2    0.10    0.05    2413        0.23    0.20    4262
           4    0.09    0.08    2460        0.29    0.28    2795
           8    0.11    0.13    2389        0.38    0.41    2754
          16    0.17    0.22    2366        0.52    0.61    2738
          32    0.21    0.38    2367        0.76    0.95    2720
          64    0.33    0.70    2383        1.09    1.48    2731
         128    0.55    1.54    2380        1.62    2.72    2674
         256    0.95    7.29    2370        2.41    8.81    2676
         512    1.74    7.89    2367        3.97   10.13    2650


     Core 2 Duo 2.4 GHz, 4 MB L2 cache
     GeForce 8600 GT, 64-Bit Vista, 1280 x 1024 x 32 bits

             Aero                        Classic
      MBytes Enlarge  Rotate  Scroll     Enlarge  Rotate  Scroll
                Secs    Secs  MB/Sec        Secs    Secs  MB/Sec

         0.5    0.03    0.02    4510        0.03    0.03    4541
           1    0.06    0.03    3881        0.04    0.05    4397
           2    0.05    0.05    2184        0.06    0.06    2083
           4    0.07    0.09    1711        0.07    0.08    1630
           8    0.09    0.12    1623        0.11    0.13    1548
          16    0.14    0.20    1621        0.16    0.20    1530
          32    0.21    0.33    1611        0.25    0.33    1511
          64    0.37    0.56    1604        0.42    0.63    1524
         128    0.68    1.12    1599        0.74    1.10    1504
         256    1.27    4.28    1583        1.35    4.36    1470
         512    2.41    5.89    1500        2.94    5.91    1388


To Start

Floating Point

My original benchmarks, that measure CPU floating point performance, were converted and compiled for 64 bit working. These have to use SSE and SSE2 floating point instead of the old x87 instructions. Other versions were generated for 32 bit working, with options set to use SSE and SSE2. They were first produced to run on an Athlon 64 x2 CPU via Windows XP Pro x64 and all ran successfully on a Core 2 Duo processor using 64-Bit Vista, now also via Windows 7 using a Phenom II.

When using assembly code, processing speed of SSE and SSE2 floating point instructions are shown to be much faster than the old x87 variety. Unfortunately, the compiler used did not implement Single Instruction Multiple Data (SIMD) instructions properly, only using one variable in registers - Single Instruction Single Data (SISD). The result is that maximum speed could be expected to be reduced by two times on 64 bit double precision SSE2 operation and by four times using 32 bit single precision SSE instructions. In some cases, this leads to programs using the old x87 floating point instructions being faster than SSE/SSE2 varieties.

The major surprise is that Core 2 Duo and Phenom demonstrate particularly slow performance on some 64 bit compilations that produce SSE2 instructions, where the Athlon 64 could be up to twice as fast. These slow results were from the original 2006 versions but these were corrected using a later compiler in 2009. Below are results for the Linpack and Livermore Loops benchmarks - see Linpack Results.htm and Livermore Loops Results.htm.

For other benchmarks using floating point see WhatCPU Results.htm, FFTGraf Results.htm, Whetstone Results.htm, SSE3DNow Results.htm and MemSpd2K Results.htm.

Core i7 results are included below for original benchmarks. In this case, the CPU appears to be running at the Turbo Boost speed of 3466 MHz.

To Start


                     Linpack Benchmark - Results in MFLOPS

                                 64 Bit       32 Bit      Original

  Core 2 Duo 2400 MHz, Vista       823         1480         1315
  Core 2 Duo 2009 compilation     1602

  Athlon 64 2210 MHz, XP x64      1044         1014          838
  Athlon 64 2009 compilation      1091

  Phenom II 3000 MHz, Win7         850         1713         1413
  Phenom II 2009 compilation      1905
  
  Core i7 3466 TB MHz, Win7                                 2004

 
              Livermore Loops Benchmark - Results in MFLOPS

                    64 Bit              32 Bit             Original
                Max  Mean   Min     Max  Mean   Min     Max  Mean   Min

  Core 2 Duo   1175   537   227    2526   804   195    2236   539    52
  2009 comp    2799   893   261

  Athlon 64    2284   661   166    1908   641   162    2566   461    48
  2009 comp    2068   679   171

  Phenom II    1314   625   185    3010   936   209    3893   644    64
  2009 comp    3541  1023   206

  Core i7                                              3147   828    76


To Start

SSE3DNow and MemSpeed (latest MemSpd2K) carry out the same floating point data streaming instructions to measure performance via caches and RAM. The former uses SIMD assembly code and the latter was compiled using the original x87 instructions. This was recompiled at 64 bits where SSE/SSE2 instructions are used, then later using the 2009 compiler. Results using L1 caches are below, showing slow performance on Core 2 Duo and Phenom using the earlier 64 bit compiler. The disassembled code was examined and the main difference was that alternative varieties of move/load instructions were used at 64 bits. Details of this can be found in Vista64.htm.


                         See above for system details - Results in MFLOPS

                    Assembled SIMD - SSE3DNow      Original Compiled x87 - MemSpd2K

                   s=s+x[m]*y[m]  x[m]=x[m]+y[m]      s=s+x[m]*y[m]  x[m]=x[m]+y[m] 
              MHz   Dble   Sngl     Dble   Sngl        Dble   Sngl     Dble   Sngl

  Athlon 64  2210   2048   4070     1096   2165        1100   1103     1030   1103
  Core 2 Duo 2400   3171   6335     2373   4747        1587   1592     1186   1193
  Phenom II  3000   2902   5804     2854   5728        1499   1504     1508   1576
  Core i7    3466?  4433   9053     3332   5803


                  Compiled MemSpeed64 - SSE/SSE2       Compiled MemSpeed64 - 2009

                   s=s+x[m]*y[m]  x[m]=x[m]+y[m]      s=s+x[m]*y[m]  x[m]=x[m]+y[m] 
              MHz   Dble   Sngl     Dble   Sngl        Dble   Sngl     Dble   Sngl

  Athlon 64  2210    903    923      940    735         982    981      979    979
  Core 2 Duo 2400    767*  1274      398*   955        1269   1273     1186   1189
  Phenom II  3000   1330   1499      500*  1144        1499   1499     1497   1164

                        * Slow


To Start


Benchmark Codes

Test and Benchmark Benchmark Code Source Code
Paging Test 32 bits - BusSpd2k.exe BusSpd2K.zip BusSpd2K.zip
Paging Test 64 bits - IntBurn64.exe More64bit.zip More64bit.zip
Multi-Core Test 32 bits - CPUIDMP.exe DualCore.zip NewSource.zip
Multi-Core Test 64 bits - CPUIDMP64.exe DualCore.zip NewSource.zip
Dual Core Test 32 bits - Whets32MP.exe DualCore.zip NewSource.zip
Dual Core Test 64 bits - Whets64MP.exe DualCore.zip NewSource.zip
Dual Core Test 32 bits - BusMP32.exe DualCore.zip NewSource.zip
Dual Core Test 64 bits - BusMP64.exe DualCore.zip NewSource.zip
Dual Core Test 32 bits - RandMP32.exe DualCore.zip NewSource.zip
Dual Core Test 64 bits - RandMP64.exe DualCore.zip NewSource.zip
Multi-Core Test 32 bits - OpenMP32MFLOPS.exe OpenMPMFLOPS.zip OpenMPMFLOPS.zip
Multi-Core Test 64 bits - OpenMP64MFLOPS.exe OpenMPMFLOPS.zip OpenMPMFLOPS.zip
Multi Tasking Test 64 bits - IntBurn64.exe More64bit.zip More64bit.zip
Multi Tasking Test 64 bits - SSEBurn64.exe More64bit.zip More64bit.zip
Disk Tests 32 bits - Disgraf.exe, CDDVDSpd.exe DiskGraf.zip DiskGraf.zip
Disk Tests 64 bits - Disgraf64.exe, CDDVDSpd64.exe More64bit.zip More64bit.zip
Graphics Tests 64 bits - VideoWin64.exe, VideDD64.exe,
OpenGL64.exe, VideoD3D9_64.exe, (VideoD3D9_32.exe)
Video64.zip Video64.zip
Image Processing Test 32 bits - BMPSpd.exe BMPSpd.zip BMPSpd.zip
Image Processing Test 64 bits - BMPSpeed64.exe Video64.zip Video64.zip
Floating Point Tests original - Linpack, Livermore Loops BenchNT.zip BenchNT.zip
Floating Point Tests new 32/64 bits - Linpack, Livermore Loops Win64.zip NewSource.zip
Floating Point Tests original - MemSpd2K.exe MemSpd2K.zip MemSpd2K.zip
Floating Point Tests 64 bit SSE/SSE2 - SSE4.exe More64bit.zip More64bit.zip



To Start


Roy Longbottom February 2010

The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection