Introduction
All of the programs in The PC Benchmark Collection were run on a new PC with a quad core Phenom II CPU using 64-Bit Windows 7. Some were also run on a Core i7 CPU using the same version of Windows. The system tested and compilers/assemblers used were:
AMD Phenom II X4 945 3.0 GHz, Asus M4A785TD-V motherboard, 8GB 1333 MHz DDR3 RAM,
WD6400AACS 640GB 5400 RPM (Green) SATA-300 disk, 16 MB buffer,
nVidia GeForce GTS250 1GB card and on-motherboard Radeon HD 4200 graphics
Intel Core i7 860 2.8 GHz, Asus P7P55D motherboard, 8GB 1333 MHz DDR3 RAM
Intel X25-M 80GB SATA Solid State Drive
Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
Microsoft ml64.exe Version 8.00.40310.39
Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Microsoft ml.exe Version 6.15.8803
Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64 (2009)
Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86 (2009)
|
Some results are also compared with those from a Core 2 Duo using 64-Bit Vista and an Athlon 64 x2 with XP X64.
Benchmark results are mainly in terms of MBytes Per Second, Millions if Instructions Per Second (MIPS) or Millions of Floating Point Operations Per Second (MFLOPS).
To Start
Configuration Statistics
All the latest benchmarks provide the following system identification details. Variations for an Intel CPU, 32 bit versions of Windows and applications compiled for 32 bits are also shown. The only way to determine 64 bit Windows by programming appears to by the GetSystemInfo flag PROCESSOR_ARCHITECTURE_AMD64, with 32 bit varieties showing PROCESSOR_ARCHITECTURE_INTEL.
It appears that 32 bit applications running via 64 bit Windows can use 4 GB of virtual address space (UVS), compared with 2 GB for 32 bit versions.
To Start
CPUID and RDTSC Assembly Code
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
AMD64 processor architecture, 4 CPUs [64 Bit Windows 64 bit application]
Windows NT Version 6.1, build 7600, [Windows 7]
Memory 8192 MB, Free 7105 MB
User Virtual Space 8388608 MB, Free 8388483 MB [64 Bit Windows, 64 bit application]
Memory 7936 MB, Free 6340 MB [Motherboard Radeon HD 4200 enabled]
Windows NT Version 6.0, build 6000, [Vista]
Windows NT Version 5.2, build 3790 [XP Pro x64]
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106E5
Intel(R) Core(TM) i7 CPU 860 @ 2.80GHz Measured 2809 MHz
AMD64 processor architecture, 8 CPUs [8 CPUs as Hyperthreading provided]
Windows NT Version 6.1, build 7600,
Memory 8183 MB, Free 5857 MB
User Virtual Space 8388608 MB, Free 8388547 MB
Intel processor architecture, 4 CPUs [32 Bit Windows or 32 bit application]
Intel processor architecture, 8 CPUs
User Virtual Space 4096 MB, Free 4035 MB [64 Bit Windows, 32 bit application]
User Virtual Space 2048 MB, Free 2024 MB [32 Bit Windows, 32 bit application]
|
To Start
Paging and Virtual Memory
Besides the results files indicated below, see also Paging.htm. This includes comparisons, further details and Performance Monitor graphs of disk utilisation, showing significant differences between XP, Vista and Windows 7.
Paging speed in terms of MB/second can be measured using BusSpd2K and IntBurn64 Burn-in/Reliability benchmarks. These have six Write/Read and six read only tests. The benchmarks have been modified to include a Paging Test option that uses a single write/read test.
Results are in BusSpd2K Results.htm and, along with
graphs of Performance Monitor (Perfmon) logs, in Paging.htm.
As indicated above, 8192 GB user virtual space is available to a 64 bit application. However, paging test results, provided below, show that far less than this can be actually allocated by the program as a single array. Limits were between 5 and 6 GB using XP x64 and 1 GB RAM, 7.9 GB with 64-Bit Vista and 4 GB RAM, then 14 GB via 64-Bit Windows 7 ad 8 GB memory.
With each new version of Windows, writing and reading block sizes, as identified via Perfmon, appear to have increased, producing improved performance. Using XP x64 write/read average block sizes were 64/4 KB. Vista mainly wrote blocks of around 1000 KB with reading between 16 KB and 64 KB for periods. Windows 7 used similar large block sizes on writing but reading involved mainly 50 KB to 64 KB blocks.
Performance was also improved by keeping data in memory, rather than paging out in a First In First Out basis. For a Windows 7 test with an 8 GB array, less than half involved reading from disk.
To Start
64 Bit IntBurn64 64 Bit IntBurn64 64 Bit IntBurn64
CPU Athlon 64 Core 2 Duo Phenom II
MHz 2210 2400 3000
RAM MB 1024 4096 8192
Windows XP x64 64-Bit Vista 64-Bit Windows 7
Disk W/R
MB/sec 55 55 92
KB Secs MB/sec KB Secs MB/sec KB Secs MB/sec
800000 1 1976 2000000 5 3056 6000000 3 4051
900000 58 32 3000000 6 2878 7000000 4 4078
1000000 128 16 3500000 7 1075 8000000 227 72
1400000 358 8 4000000 145 56 9000000 697 26
2000000 683 6 5000000 1040 10 10000000 1231 17
5000000 1707 6 7900000 771 21 14000000 2742 10
6000000 Cannot allocate 8000000 Cannot allocate 15000000 Cannot allocate
|
To Start
Dual/Quad Core Benchmarks
The four multi-threaded benchmarks, with 32 bit and 64 bit varieties, were run on the Phenom to demonstrate dual core CPU performance and quad core with one of them. Multi-tasking tests were also run using four copies of BusSpd2K Reliability Tests and IntBurn64. The 64 bit dual core tests were also run on the Core i7.
See DualCore.htm, BurnIn64.htm and BurnIn4CPU.htm.
Core i7 2800 MHz (21x133) CPUs could be expected to run at 3466 MHz (26x133) using Turbo Boost and one CPU, 3333 MHz (25x133) using two CPUs, then 2933 MHz (22x133) with three or four in use. The speeds will be reduced if the CPU heats up too much. For these MP tests, Turbo Boost settings appear to have been upset with unused MHz being reported as 2862 MHz (21.5x133), 3000 MHz (22.5x133) using one or two CPUs and 2467 MHz (18.5x133) with four cores in use, via Turbo Boost.
CPUIDMP and CPUIDMP64 - The benchmark uses an integer test and a floating point test. They are first executed separately, followed by together in two threads of equal priority and finally with two of each type, where three are at a lower priority. With a quad CPU, performance with four threads should be similar to that of the stand alone runs. Results are in WhatCPU Results.htm.
Core i7 results are slower then expected, probably due to wrong settings for Turbo Boost.
Phenom Phenom Core i7
3.0 GHz 3.0 GHz *** GHz
32 bit 64 bit 64 bit
Separate Tests
32 bit SSE MFLOPS 11981 12013 11961
32 bit Integer MIPS 9015 8279 8734#
Two Threads
32 bit SSE MFLOPS 12012 12042 11966
32 bit Integer MIPS 9027 8265 8760
Four Threads
32 bit SSE MFLOPS 12005 11996 10349
32 bit Integer MIPS 9029 8265 7560
32 bit SSE MFLOPS 11991 12004 9350
32 bit Integer MIPS 8270 9030 7085
*** Estimate 3.0 GHz one or two CPUs in use, 2.467 GHz four CPUs
Could be 3.47 GHz and 2.93 GHz with four CPUs using Turbo Boost
# 2.4 GHz Core 2 Duo obtains 6940 MIPS, 8734/6940*2.4 = 3.02 GHz
Original CPUID - 10202 MIPS i7, 7040 C2D, 10202/7040*2.4 = 3.48 GHz
|
To Start
Whets32MP and Whets64MP - The Whetstone Benchmark has various routines that execute floating point and integer instructions. In the MP version, the benchmark is run in the main thread and another copy of each routine in a low priority second thread which should mainly run at the same speed with two CPUs. Results are in Whetstone Results.htm.
The 64 bit compilation could be expected to be faster than the 32 bit version, as more registers are available for optimisation. Although the Core i7 is probably running at the wrong but similar GHz, floating point performance (with this program) is superior to the Phenom.
Phenom 3.0 GHz 32 bit compilation
MFLOPS Vax MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Gmean MIPS 1 2 3 MOPS MOPS MOPS MOPS MOPS
1675 25786 6353 1818 1780 1453 138 95 4986 5385 5109
Thread 1 900 870 721 69 47 2512 2898 3995
Thread 2 918 910 732 68 48 2473 2487 1113
Phenom 3.0 GHz 64 bit compilation
1486 34935 6892 1808 1451 1252 199 93 4964 5804 11837
Thread 1 900 724 751 100 46 2482 2893 10625
Thread 2 908 727 501 99 47 2483 2912 1211
Core i7 at 3.0 GHz? 64 bit compilation
1858 32842 8287 2191 1994 1469 268 107 5427 2958 17649
Thread 1 1095 998 780 134 53 2704 1477 16537
Thread 2 1096 996 689 134 54 2723 1481 1113
|
To Start
BusMP and BusMP64 Two Thread Tests - These run a series of tests to measure performance via caches and RAM, firstly as a single thread and secondly using two threads accessing different data arrays. The tests are based on those in BusSpd2K and results are in BusSpeed2K Results.htm. To indicate bus burst reading speeds, the tests start with reading one word with address increments of 32 words (128 bytes at 32 bits and 256 bytes at 64 bits), then with decreasing increments until all data is read. The last test uses 128 bit SSE2 instructions. The data is read using a sequence of 64 AND instructions to one CPU register, repeated numerous times without programmed interference.
Results below are for reading a word with 64 Byte address increments plus reading all data to integer and SSE2 registers.
On running two copies of BusSpd2K at the same time, with one or two CPUs, there is not the unexpected variation shown below. The MP test instructions comprise ANDing to a single register. As shown in WhatCPU Results.htm, in this case, maximum speed is likely to be 24000 MB/second at 64 bits (8 x MHz) and 12000 MB/second at 32 bits. L1 cache results using one CPU produce that level of performance but might not using two threads, where, sometimes, two processors transfer data slower than one. Part of the difference is that the two threads do not finish at the same time but it does seem that Windows is interfering with the data flow and timing is more critical using 64 bit integers.
For RAM, assuming 64 Byte burst reading, expected maximum MB/second would be sixteen times Inc64B speed at 32 bits and eight times at 64 bits and is confirmed here. With bus or RAM speed limitations, throughput using two CPUs would be no greater than with one, but that is not the case for these results. Core i7 DDR3 RAM is shown to be faster than with a Phenom II, but the latter catches up somewhat using two threads.
Inc64B speeds using L2 and L3 caches show that data transfer also involves burst reading, but addressing increments are different on the two systems. Reading speed of all data from L2 and L3 caches is similar to that using L1 on the i7 but L3 is slower with the Phenom.
Estimated Core i7 CPU GHz using Turbo Boost is shown below.
To Start
Speed In MBytes/Second
Phenom 3 GHz DDR3 Phenom 3 GHz DDR3 Core i7 3? GHz DDR3
32 Bit 64 Bit 64 Bit
CPUs Inc64B RdAll SSE2 Inc64B RdAll SSE2 Inc64B RdAll SSE2
L1 6KB
1 14041 13617 23847 23804 24340 23800 20910 21836 23732
2 18754 25133 47214 22114 29916 47175 15019 25961 47356
% 134 185 198 93 123 198 72 119 200
L2 96KB
1 1496 12878 23879 2986 21594 23822 5997 19282 23945
2 2974 25520 47516 5676 27000 47542 9708 25461 47521
% 199 198 199 190 125 200 162 132 198
L3
1 841 10107 13311 1499 11256 11967 4333 18279 22468
2 1492 18736 25895 2788 15690 22640 6919 23999 43022
% 177 185 195 186 139 189 160 131 191
RAM 128MB
1 454 5212 7289 897 5792 7372 1657 10108 10953
2 760 8959 12146 1477 8734 12124 1968 13076 15560
% 167 172 167 165 151 164 119 129 142
i7/C2D L1 RdAll 21836/17449*2.4 = 3.0 GHz, BusSpd2K 32b 13742/9252*2.4 = 3.56 GHz
|
To Start
RandMP32 and RandMP64 Two Thread Tests - The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM, firstly as a single thread and secondly using two threads. The 64 bit compilation uses the same 32 bit integer arrays as the 32 bit version and resultant speeds are generally the same. For this benchmark, the two threads share the same data array but one starts half way through. Examples of MB/second results from Randmem Results.htm are below.
The main observations are the reduction in throughput in Read/Write tests with cache based data when using two CPUs, particularly for random access. Here, Windows will be updating data in RAM to maintain integrity (a cache killer).
For these tests, performance degradation is much lower for Core i7. This CPU, again, appears to be running at a Turbo Boosted speed of 3 GHz where some single processor tests produce similar performance to the Phenom. Some of the i7 RAM speeds are noticeably faster.
To Start
Speed In MBytes/Second - 64 Bit Version, 32 Bit Integers
Phenom II 3.0 GHz, 1333 MHz DDR3 RAM Core i7 3? GHz, 1333 MHz DDR3 RAM
64-Bit Windows 7 64-Bit Windows 7
Serial Serial
L1 Cache L2 Cache RAM L1 Cache L2 Cache RAM
CPUs RD RW RD RW RD RW RD RW RD RW RD RW
1 15853 5572 12645 5543 4462 3532 11797 4769 10645 4333 8134 3884
2 30818 5043 25567 5796 7254 6153 23509 5891 21010 6621 14410 7645
% 194 91 202 105 163 174 199 124 197 153 177 197
Random Random
1 15116 5666 7409 4991 607 522 11759 4777 6282 3699 983 952
2 30224 1382 14718 1616 1044 974 23588 3061 12602 4454 2030 1912
% 200 24 199 32 172 187 201 64 201 120 207 201
|
To Start
OPenMP Benchmark
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. For further detail and results see OpenMP MFLOPS.htm.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words.
At with later Floating Point tests, the compiler is not very efficient in handling SSE instructions, where one 3.0 GHz CPU could achieve up to 24000 MFLOPS but can obtain less than 4000.
The smallest data size of 400 KB is too large for L1 caches and, with two operations per word, cache speed can be the limiting factor. Larger data sizes are likely to depend on shared cache or RAM speeds. With 32 operations per word, performance is likely to governed by CPU speed.
The benchmark demonstrates that OpenMP declarations can produce a throughput improvement of greater than 1.9 times on dual core systems and more than 3.9 times with a quad core processor. The Core i7 quad core performance is much better than CPUIDMP64 above and this might be influenced by Hyperthreading where eight CPUs are identified (see Identification).
To Start
64 Bit OpenMP Benchmark MFLOPS
Athlon 64 x2 2.2 GHz Core 2 Duo 2.4 GHz
Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain
Words Word 1 CPU 2 CPU 1 CPU 2 CPU
100000 2 1114 2015 1.8 1594 2573 1.6
1000000 2 638 817 1.3 1577 2589 1.6
10000000 2 636 831 1.3 1160 1203 1.0
100000 8 1942 3783 1.9 3423 6404 1.9
1000000 8 1692 3058 1.8 3363 5956 1.8
10000000 8 1706 3110 1.8 3301 4221 1.3
100000 32 1731 3254 1.9 3526 5376 1.5
1000000 32 1793 3369 1.9 3538 4230 1.2
10000000 32 1774 3443 1.9 3523 6748 1.9
Phenom II 3.0 GHz Core i7 ### GHz
Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain
Words Word 1 CPU 4 CPU 1 CPU 4 CPU
100000 2 1822 5613 3.1 1661 4263 2.6
1000000 2 1870 7056 3.8 1922 5142 2.7
10000000 2 1563 2972 1.9 1824 3838 2.1
100000 8 3637 12653 3.5 3939 13804 3.5
1000000 8 3709 14518 3.9 4251 18082 4.3
10000000 8 3543 11273 3.2 4133 15079 3.6
100000 32 3652 14092 3.9 4438 16299 3.7
1000000 32 3663 14510 4.0 4512 18081 4.0
10000000 32 3633 14034 3.9 4493 17752 4.0
### Core i7 is rated as 2.8 GHz but probably running at
3.0 GHz using Turbo Boost (4493 / 3523 x 2.4 = 3.0)
|
To Start
Multi-Tasking Benchmarks
The multi-threading benchmarks demonstrate some performance limitations on using cached data. As shown in
DualCore.htm, BurnIn64.htm and BurnIn4CPU.htm,
multiple copies of IntBurn64 and SSEBurn64 burn-in programs can be used to demonstrate multi-tasking performance.
Results on the Phenom based PC are shown below for SSEBurn64. This has three different types of tests - CPU registers only, L1 Cached data and Memory. For the latter, data size is selected to test caches or RAM. All tests can be run using either 32 bit or 64 bit floating point numbers.
Figures below are reading/execution speeds. SSE MB/second are 4 x MFLOPS and SSE2 MB/second are 8 x MFLOPS.
SSE registers are 128 bits for four SSE words or two with SSE2 and arithmetic instructions can manipulate all the data at the same time, producing 12000 SSE or 6000 SSE2 MFLOPS with a 3 GHz processor. Such as add and multiply can be linked to produce up to eight results per clock cycle with SSE. In this case, 5.6 results per cycle are demonstrated with the cache test. The CPU and RAM tests obtain around 12000 MFLOPS, the same as CPUIDMP above.
With four programs running, each operates at approximately the same performance level. With CPU, cache and memory L1/L2 cache tests, performance gains are greater than 3.9 times. The Phenom has 6 MB shared L3 cache and using 1 MB per CPU can be quite efficient. Using four CPUs can more than double memory throughput.
To Start
Phenom II 3.0 GHz, 1333 MHz DDR3 RAM, 64-Bit Windows 7
Test 1 CPU Copy1 Copy2 Copy3 Copy4 Total Gain
MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS
CPU SSE 12022 11931 11901 11867 12007 47706 3.97
Cache SSE 16802 16478 16466 16381 16410 65735 3.91
Cache SSE2 8258 8090 8107 8166 8103 32465 3.93
Memory 1 CPU Copy1 Copy2 Copy3 Copy4 Total Gain
SSE Test MB/sec MB/sec MB/sec MB/sec MB/sec MB/sec
L1 32 KB 47484 47109 46907 47553 47222 188791 3.98
L2 256 KB 23919 23577 23907 23807 23690 94980 3.97
L3 1024 KB 11250 9171 9225 9264 9224 36884 3.28
RAM 64 MB 7041 3708 3796 3756 3730 14990 2.13
|
To Start
Disks
The first two partitions (C: and D:) on the Phenom based PC disk drive are each 224 GB. As expected, maximum data transfer speed is higher using the first partition, writing/reading at 94/104 MB/second, compared with 90/95 MB/second on D:. As with earlier systems, this is not the case writing and reading small files. Below are results from the CDDVDSpd benchmark that writes and reads 520 small files of a chosen size. Here, AVAST anti-virus is enabled for both, where this is also known to reduce performance with small files.
For more details see DiskGraf Results.htm and CDDVDSpd Results.htm.
Tests were run using real data via copying and pasting a folder containing downloaded HTML documents with lots of tiny GIF files - 18.1 MB, 28.4 MB on disk, 3638 files in 277 folders. Example copying times were 43 seconds on C: and 8 seconds on D:.
To Start
C: Partition D: Partition
Per File Per File
KB Write Read Write Read Write Read Write Read
MB/s MB/s msecs msecs MB/s MB/s msecs msecs
2 0.37 0.32 5.4 6.3 2.57 4.12 0.8 0.5
4 0.95 0.64 4.2 6.3 8.10 10.18 0.5 0.4
8 1.61 1.10 5.0 7.3 15.64 16.02 0.5 0.5
|
To Start
Graphics
My Windows graphics benchmarks, compiled for 32 bit and 64 bit working, are available. Further details and results are provided in
Direct3D Results.htm, DirectDraw Results.htm, OpenGL Results.htm and VideoWin Results.htm.
Performance via 64 bit and 32 bit versions was mainly the same. Below are results of the DirectX9 benchmark showing the impact of using Aero desktop, the effect of large textures and the difference between the motherboard based graphics and moderately fast graphics card.
The Windows benchmark (VideoWin) showed that the on-board integrated graphics can be faster than the GeForce card but this is far from the case with DirectX9 3D graphics. At the lowest pixel size settings, where CPU speed has more effect, the results using the card are 1.9 to 6.2 times faster, increasing to 5.4 to 8.2 times using the highest number of pixels.
As for 64-Bit Vista (see Vista64.htm),
speeds with Aero selected are slower than using Classic Desktop, in this case by an average of around 10%. This might be influenced by the benchmark running in windowed mode, and not as a dedicated full screen application.
Some tests were run where objects were textured using 1024 x 1024 or 1M pixel JPG images (tests 4, 6, 7, 8). These textures are rather large but might be needed for close ups of moving objects. Pix1 results are when using the same file for the five textures used with Pix2 involving different files but not reducing performance much. This can reduce speed by more than 50% using the GeForce card but much less via the on-board Radeon graphics, where CPU time is more significant.
Further tests were run using 10M pixel textures (for close ups of part of an object?), resulting in much slower performance. In this case, there was around a 10 second delay with each texture test before movement started, due to a non-timed pass used to calibrate running time parameters. During this delay, CPU utilisation was 25% or 100% of one CPU.
When the textured images were being displayed, utilisation was in the range 52% to 84% of one CPU, indicating that graphics hardware speed limits performance.
To Start
Aero ATI Radeon HD 4200
DirectX9 D3D Test 32 Bit Version 1.1, Sat Nov 21 14:05:59 2009
Copyright (C) 2006, Roy Longbottom
..................... Frames Per Second ......................
Resolution Shaded WireEgg 500 Texture Colour Texture Pixel Vertex
Egg Vsync Cubes Tunnel Objects Objects Shader2 Shader2
640 480 32 1674.0 60.0 75.9 471.5 866.0 454.5 451.6 668.4
800 600 32 1083.8 60.0 58.6 313.6 602.9 312.3 310.3 458.9
1024 768 32 651.4 60.0 43.0 205.4 386.2 203.9 203.3 299.5
1280 1024 32 363.9 60.0 27.7 126.5 214.8 112.1 113.1 169.6
1280 1M pix1 371.9 60.0 27.8 80.9 206.9 82.7 80.2 127.9
Aero GeForce GTS 250
DirectX9 D3D Test 64 Bit Version 1.1, Thu Nov 19 14:27:42 2009
Copyright (C) 2006, Roy Longbottom
..................... Frames Per Second ......................
Resolution Shaded WireEgg 500 Texture Colour Texture Pixel Vertex
Egg Vsync Cubes Tunnel Objects Objects Shader2 Shader2
640 480 32 9079.0 60.6 396.3 2919.4 2266.4 1496.9 1400.0 1293.1
800 600 32 6368.3 60.6 326.9 2247.2 2001.7 1408.7 1287.7 1259.6
1024 768 32 4962.9 60.6 257.6 1609.4 2163.5 1363.7 1251.9 1346.0
1280 1024 32 3065.1 60.6 188.5 1093.2 1551.2 908.4 862.0 1167.6
1280 1M pix 3057.5 60.6 189.7 498.2 1547.2 449.1 439.3 734.0
1680 1050 32 2422.1 60.6 181.4 876.3 1329.5 806.8 771.8 1046.4
1680 1M pix1 2409.7 60.6 181.2 402.1 1324.0 408.1 400.4 668.9
1680 1M pix2 2360.7 60.5 180.6 401.0 1302.7 402.1 397.3 658.2
1680 10M pix 2358.9 60.6 180.6 154.6 1300.1 165.2 160.0 306.7
10M pix +10 secs each textured tests for untimed first Render()
GeForce GTS 250 and Radeon HD 4200 Dual Monitor
2960 1000 32 1470.0 30.0 171.7 606.8 966.5 662.2 632.3 787.0
Classic GeForce GTS 250
..................... Frames Per Second ......................
Resolution Shaded WireEgg 500 Texture Colour Texture Pixel Vertex
Egg Vsync Cubes Tunnel Objects Objects Shader2 Shader2
640 480 32 9544.8 60.6 425.4 3164.5 2398.7 1618.5 1501.4 1339.3
800 600 32 7801.6 60.6 354.4 2446.3 2366.6 1626.9 1513.1 1365.9
1024 768 32 5244.4 60.6 282.5 1728.3 2319.5 1521.1 1358.8 1343.3
1280 1024 32 3291.5 60.5 214.8 1188.4 1664.8 1004.4 948.6 1246.8
1280 1M pix1 3292.2 60.6 214.2 538.5 1665.4 485.7 478.4 803.9
1680 1050 32 2633.7 60.6 206.5 955.0 1435.4 895.4 855.9 1143.6
|
To Start
Image Processing Benchmark
BMPSpeed benchmark measures performance using image files increasing in size from 0.5 MB to 512 MB. Tests run are Enlarge/Edit, Save BMP file, Load BMP file, Scroll and Rotate. Further details can be found in
BMPSpeed Results.htm.
Below are results from the 64 bit version, excluding saving and loading speeds which were similar to those provided by disk benchmarks.
As for graphics, the impact of using Aero desktop and the difference between the motherboard based graphics and the fast graphics card are shown. Earlier results for a Core 2 Duo based PC, using 64-Bit Vista, are provided for comparison purposes.
The benchmark uses fast BitBlt copying when permitted and a slower byte based method when not. With 32 bit Windows, the former is used up to image size of 64 MB. Using 64 bit Windows, no limit is seen. Maximum memory demand within user space is 1.1 GB but the BitBlt method creates 4 bytes per pixel bitmaps outside this area, increasing memory demands up to 2.3 GB.
Performance Monitor (Perfmon) shows that Enlarge and Rotate use 100% utilisation of one CPU. This leads to similar performance using motherboard integrated graphics and the much faster video card. Besides image size, performance depends on whether the data is in caches, RAM or even paged out to disk, when insufficient RAM is available. In this case, caches will only help for the smaller images. Perfmon also shows a linear increase in memory demands according to image size, with no noticeable disk activity. In turn, this produces a fairly linear increase in the enlarge test times. Here, the Core 2 Duo is faster on the small images, probably due to cache speed, but the Phenom excels with large images with its DDR3 memory.
In rotating the larger images, it is not clear why measured times are non-linear, why the Core 2 Duo is faster or why the times for the larger two images are similar. This test involves copying bytes from array rows to columns and might involve cache flushing or be affected by reading and writing data in 64 byte bursts.
On the Core 2 Duo with Vista, speed using Classic Desktop settings is slightly slower than with Aero enabled. Under Windows 7 and the Phenom, Classic Desktop enlarge and rotate speeds are significantly slower with both of the nVidia and ATI graphics.
The speed on scrolling is expected to be similar to results shown for Classic Desktop and Aero on the Core 2 Duo, where smaller images are cached in video RAM (or RAM used for display purposes). Slower speed for non-cached data represents RAM/bus speeds, clearly faster on the Phenom based system. The glaring anomaly is the slow scrolling using Aero via Windows 7.
To Start
Phenom II 3.0 GHz, 512 KB L2 cache, 6 MB L3 cache,
GeForce GTS 250, 64-Bit Windows 7, 1680 x 1050 x 32 bits
Aero Classic
MBytes Enlarge Rotate Scroll Enlarge Rotate Scroll
Secs Secs MB/Sec Secs Secs MB/Sec
0.5 0.04 0.03 1375 0.08 0.07 4071
1 0.07 0.04 1612 0.14 0.10 4399
2 0.08 0.05 1882 0.17 0.15 4173
4 0.11 0.08 2251 0.23 0.21 2703
8 0.11 0.13 2575 0.30 0.32 2690
16 0.18 0.22 2553 0.44 0.49 2643
32 0.22 0.37 2543 0.58 0.77 2641
64 0.35 0.69 2551 0.84 1.23 2629
128 0.54 1.52 2554 1.27 2.34 2614
256 0.93 7.18 2554 1.98 8.28 2638
512 1.74 7.87 2546 3.18 9.42 2605
Phenom II 3.0 GHz, Radeon HD 4200, 64-Bit Windows 7
1280 x 1024 x 32 bits
Aero Classic
MBytes Enlarge Rotate Scroll Enlarge Rotate Scroll
Secs Secs MB/Sec Secs Secs MB/Sec
0.5 0.05 0.03 2503 0.11 0.10 4015
1 0.07 0.04 2332 0.16 0.14 4356
2 0.10 0.05 2413 0.23 0.20 4262
4 0.09 0.08 2460 0.29 0.28 2795
8 0.11 0.13 2389 0.38 0.41 2754
16 0.17 0.22 2366 0.52 0.61 2738
32 0.21 0.38 2367 0.76 0.95 2720
64 0.33 0.70 2383 1.09 1.48 2731
128 0.55 1.54 2380 1.62 2.72 2674
256 0.95 7.29 2370 2.41 8.81 2676
512 1.74 7.89 2367 3.97 10.13 2650
Core 2 Duo 2.4 GHz, 4 MB L2 cache
GeForce 8600 GT, 64-Bit Vista, 1280 x 1024 x 32 bits
Aero Classic
MBytes Enlarge Rotate Scroll Enlarge Rotate Scroll
Secs Secs MB/Sec Secs Secs MB/Sec
0.5 0.03 0.02 4510 0.03 0.03 4541
1 0.06 0.03 3881 0.04 0.05 4397
2 0.05 0.05 2184 0.06 0.06 2083
4 0.07 0.09 1711 0.07 0.08 1630
8 0.09 0.12 1623 0.11 0.13 1548
16 0.14 0.20 1621 0.16 0.20 1530
32 0.21 0.33 1611 0.25 0.33 1511
64 0.37 0.56 1604 0.42 0.63 1524
128 0.68 1.12 1599 0.74 1.10 1504
256 1.27 4.28 1583 1.35 4.36 1470
512 2.41 5.89 1500 2.94 5.91 1388
|
To Start
Floating Point
My original benchmarks, that measure CPU floating point performance, were converted and compiled for 64 bit working. These have to use SSE and SSE2 floating point instead of the old x87 instructions. Other versions were generated for 32 bit working, with options set to use SSE and SSE2. They were first produced to run on an Athlon 64 x2 CPU via Windows XP Pro x64 and all ran successfully on a Core 2 Duo processor using 64-Bit Vista, now also via Windows 7 using a Phenom II.
When using assembly code, processing speed of SSE and SSE2 floating point instructions are shown to be much faster than the old x87 variety. Unfortunately, the compiler used did not implement Single Instruction Multiple Data (SIMD) instructions properly, only using one variable in registers - Single Instruction Single Data (SISD). The result is that maximum speed could be expected to be reduced by two times on 64 bit double precision SSE2 operation and by four times using 32 bit single precision SSE instructions. In some cases, this leads to programs using the old x87 floating point instructions being faster than SSE/SSE2 varieties.
The major surprise is that Core 2 Duo and Phenom demonstrate particularly slow performance on some 64 bit compilations that produce SSE2 instructions, where the Athlon 64 could be up to twice as fast.
These slow results were from the original 2006 versions but these were corrected using a later compiler in 2009.
Below are results for the Linpack and Livermore Loops benchmarks - see Linpack Results.htm and Livermore Loops Results.htm.
For other benchmarks using floating point see
WhatCPU Results.htm, FFTGraf Results.htm, Whetstone Results.htm, SSE3DNow Results.htm and MemSpd2K Results.htm.
Core i7 results are included below for original benchmarks. In this case, the CPU appears to be running at the Turbo Boost speed of 3466 MHz.
To Start
Linpack Benchmark - Results in MFLOPS
64 Bit 32 Bit Original
Core 2 Duo 2400 MHz, Vista 823 1480 1315
Core 2 Duo 2009 compilation 1602
Athlon 64 2210 MHz, XP x64 1044 1014 838
Athlon 64 2009 compilation 1091
Phenom II 3000 MHz, Win7 850 1713 1413
Phenom II 2009 compilation 1905
Core i7 3466 TB MHz, Win7 2004
Livermore Loops Benchmark - Results in MFLOPS
64 Bit 32 Bit Original
Max Mean Min Max Mean Min Max Mean Min
Core 2 Duo 1175 537 227 2526 804 195 2236 539 52
2009 comp 2799 893 261
Athlon 64 2284 661 166 1908 641 162 2566 461 48
2009 comp 2068 679 171
Phenom II 1314 625 185 3010 936 209 3893 644 64
2009 comp 3541 1023 206
Core i7 3147 828 76
|
To Start
SSE3DNow and MemSpeed (latest MemSpd2K) carry out the same floating point data streaming instructions to measure performance via caches and RAM. The former uses SIMD assembly code and the latter was compiled using the original x87 instructions. This was recompiled at 64 bits where SSE/SSE2 instructions are used, then later using the 2009 compiler. Results using L1 caches are below, showing slow performance on Core 2 Duo and Phenom using the earlier 64 bit compiler.
The disassembled code was examined and the main difference was that alternative varieties of move/load instructions were used at 64 bits. Details of this can be found in Vista64.htm.
See above for system details - Results in MFLOPS
Assembled SIMD - SSE3DNow Original Compiled x87 - MemSpd2K
s=s+x[m]*y[m] x[m]=x[m]+y[m] s=s+x[m]*y[m] x[m]=x[m]+y[m]
MHz Dble Sngl Dble Sngl Dble Sngl Dble Sngl
Athlon 64 2210 2048 4070 1096 2165 1100 1103 1030 1103
Core 2 Duo 2400 3171 6335 2373 4747 1587 1592 1186 1193
Phenom II 3000 2902 5804 2854 5728 1499 1504 1508 1576
Core i7 3466? 4433 9053 3332 5803
Compiled MemSpeed64 - SSE/SSE2 Compiled MemSpeed64 - 2009
s=s+x[m]*y[m] x[m]=x[m]+y[m] s=s+x[m]*y[m] x[m]=x[m]+y[m]
MHz Dble Sngl Dble Sngl Dble Sngl Dble Sngl
Athlon 64 2210 903 923 940 735 982 981 979 979
Core 2 Duo 2400 767* 1274 398* 955 1269 1273 1186 1189
Phenom II 3000 1330 1499 500* 1144 1499 1499 1497 1164
* Slow
|
To Start
Benchmark Codes
|