Image Processing - 64 Bit Windows slow on large images due to paging
Identification - Identifying 64-Bit Vista and Core 2 Duo
Paging - Program can allocate more memory with 64-Bit Vista
Paging - Vista paging more efficient than Windows XP
Paging - Accessing data 500 times slower than using RAM and 8 times slower than normal disk I/O
Dual Core CPU - Efficient use
Dual Core CPU - Slow on data streaming to 64 bit integer registers
Dual Core CPU - Windows slow on writing/reading shared data
Disk Drives - Partitioned disk C: drive slower than D:
Graphics - 3D slower using Aero
Floating Point - 64 bit compilations slow on Core 2 Duo
All of the programs in The PC Benchmark Collection were run on a new PC with a Core 2 Duo CPU using 64-Bit Vista. A surprising number of performance issues were raised but these are mainly related to Core 2 Duo, the compilers used, Windows in general and 64 bit working, rather than just Vista. The system tested and compilers/assemblers used were:
Core 2 Duo 2400 MHz, Asus P5B motherboard, 4 GB 800 MHz DDR2 RAM,
Seagate ST3400633AS SATA-300 disk, 16 MB buffer, 7200 RPM,
GeForce 8600 GT graphics, Vista Home Premium 64-Bit
Microsoft C/C++ Optimizing Compiler Version 14.00.40310.41 for AMD64
Microsoft ml64.exe Version 8.00.40310.39
Microsoft 32-bit C/C++ Optimizing Compiler Version 13.10.3077 for 80x86
Microsoft ml.exe Version 6.15.8803
Results were compared with those from a dual core Athlon 64 using Windows XP Pro x64 and other PCs with Windows XP and 2000.
Paging and Virtual Memory
Besides the results files indicated below see Paging.htm. This includes comparisons, further details and Performance Monitor graphs of disk activity, showing significant differences between Windows XP and Vista.
Image Processing - BMPSpeed benchmark measures performance using image files increasing in size from 0.5 MB to 512 MB. Initially (until Memory Remapping was set in BIOS), only 3 GB of the 4 GB was seen and slow speed due to paging was reported with the larger images. This follows worst ever results using 64 bit XP and 1 GB RAM.
See BMPSpeed Results.htm.
The benchmark uses fast BitBlt copying when permitted and a slower byte based method when not. With 32 bit Windows, the former was used up to a maximum image size of 64 MB. Using 64 bit Windows, no limit was seen. A CreateDIBitmap function is used for BitBlt and this uses memory space outside user virtual memory space at 32 bits per pixel instead of 24. The result is that maximum memory demands are increased from 1.1 GB to 2.3 GB. Not of the same significance, 64 bit Windows shows an increase in user memory requirements.
Configuration Statistics - All the latest benchmarks provide the following system identification details. Variations for the AMD CPU, 32 bit versions of Windows and applications compiled for 32 bits are also shown. The only way to determine 64 bit Vista by programming appears to by the GetSystemInfo flag PROCESSOR_ARCHITECTURE_AMD64 with 32 bit varieties via PROCESSOR_ARCHITECTURE_INTEL.
It appears that 32 bit applications running via 64 bit Windows can use 4 GB of virtual address space (UVS), compared with 2 GB for 32 bit versions.
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
AMD64 processor architecture, 2 CPUs [64 Bit Windows]
Windows NT Version 6.0, build 6000, [Vista]
Memory 4095 MB, Free 3088 MB
User Virtual Space 8388608 MB, Free 8388542 MB [64 Bit Windows, 64 bit application]
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00020FB1
AMD Athlon(tm) 64 X2 Dual Core Processor 4200+ Measured 2211 MHz
Windows NT Version 5.2, build 3790 [XP Pro x64]
Intel processor architecture, 2 CPUs [32 Bit Windows]
Intel processor architecture, 1 CPU
User Virtual Space 4096 MB, Free 4035 MB [64 Bit Windows, 32 bit application]
User Virtual Space 2048 MB, Free 2024 MB [32 Bit Windows, 32 bit application]
Paging Benchmarks - Paging speed in terms of MB/second can be measured using BusSpd2K and IntBurn64 Burn-in/Reliability benchmarks. These have six Write/Read and six read only tests. With 4 GB RAM and paging, test running time was greater than an hour. The benchmarks have been modified to include a Paging Test option that uses a single write/read test.
Results are in BusSpd2K Results.htm.
Running with increasing (or decreasing) memory demands reveals the maximum that can be allocated for a single data array:
- 32 bit Windows (UVM 2 GB) - 1,200,000 KB with XP and 1,500,000 KB via Windows 2000
- 64-Bit Vista and 64 bit program (UVM 8192 GB) - 7,900,000 KB, XP x64 (1 GB RAM) - between 5,000,000 KB and 6,000,000 KB
- 64 bit Windows and 32 bit program (UVM 4 GB) - 2,000,000 KB.
Tests were run on four PCs using Windows XP, 2000, x64 and 64-Bit Vista. All had disk drives that could write and read large files at around 50 MB/second.
Best (RAM speed) and worst case results are below. Worst is using XP x64 with 1 GB RAM, where, with similar demands, it can be slower than the PCs with 512 MB. In all cases, writing/reading data with paging can be much slower than using normal disk output/input. There is also an issue regarding speed relative to that from RAM with more advanced memory technology.
Using Vista showed that speed tended to improve more frequently and to a greater extent, as memory demands increased. The others showed more of a general decline. Performance Monitor logging revealed differences between XP x64 and Vista. The former appeared to consistently write using 64 KB data transfers and read at 4 KB. With Vista there were periods of reading at 8, 16, 32 and 64 KB and writing was mainly at around 1000 KB. Disk benchmarks show that writing at 64 KB block size is not necessarily the fastest and reading at 4 KB is usually much slower. Graphs of Performance Monitor logs are included in Paging.htm, along with many more speed measurements.
Windows RAM Maximum Minimum
MB MB/sec MB/sec KB Seconds
64 Bit Benchmark
64-Bit Vista 4096 3393 10 5,000,000 1024
21 7,900,000 770
XP x64 1024 2040 6 5,000,000 1707
6 2,000,000 683
32 Bit Benchmark
64-Bit Vista 4096 3390 2139 2,000,000 2
XP x64 1024 2041 6 2,000,000 683
8 1,400,000 358
XP 512 532 13 1,200,000 189
20000 512 970 15 1,500,000 205
Dual Core Benchmarks
The four multi-threaded benchmarks, with 32 bit and 64 bit varieties, were run to demonstrate dual core CPU performance. Multi-tasking tests were Also run using two copies of BusSpd2K Reliability Tests and IntBurn64. See DualCore.htm.
CPUIDMP and CPUIDMP64 - The benchmark uses an integer test and a floating point test. They are first executed separately, followed by together in two threads of equal priority and finally with two of each type, where three are at a lower priority. With a dual CPU, performance with two threads should be similar to that of the stand alone runs. The total speed of four threads might give some variation on the latter. Results are in WhatCPU Results.htm. There were no surprises here.
Whets32MP and Whets64MP - The Whetstone Benchmark has various routines that execute floating point and integer instructions. In the MP version, the benchmark is run in the main thread and another copy of each routine in a low priority second thread which should mainly run at the same speed with two CPUs. Results are in Whetstone Results.htm. Again, there were no surprises.
Multi-Tasking Tests - The 32 bit BusSpd2K Reliability Test and 64 bit IntBurn64 were run separately then with two copies concurrently to demonstrate dual core performance using data in caches and RAM. The former uses 64 bit MMX instructions and the latter 64 bit integers. They have write/read and read only tests where both showed similar gains using two CPUs. Example reading speeds in MB/second are below - from BusSpeed2K Results.htm.
These show significant performance differences between the two system. Total throughput via caches is seen to nearly double, using two CPUs, which is surprising for the Core 2 Duo shared L2 cache. There is also a gain on using RAM, more notable with the AMD system.
Core 2 Duo 2400 MHz, 800 MHz RAM Athlon 64 X2 2210 MHz, 400 MHz RAM
64-Bit Vista Windows XP x64
L1 L2 RAM L1 L2 RAM
Mode 32 64 32 64 32 64 32 64 32 64 32 64
1 15794 16206 13084 13048 5433 5408 20913 22257 9112 10102 2872 3009
2 31401 32248 25111 25033 6066 6019 41503 44389 18023 19957 4706 4838
% 199 199 192 192 112 111 198 199 198 198 164 161
BusMP and BusMP64 Two Thread Tests - These run a series of tests to measure performance via caches and RAM, firstly as a single thread and secondly using two threads accessing different data arrays. The tests are based on those in BusSpd2K and results are in BusSpeed2K Results.htm. To indicate bus burst reading speeds, the tests start with reading one word with address increments of 32 words (128 bytes at 32 bits and 256 bytes at 64 bits), then with decreasing increments until all data is read. The last test uses 128 bit SSE2 instructions. The data is read using a sequence of 64 AND instructions to one CPU register, repeated numerous times without programmed interference.
The first observations are that, using a single CPU, the time used for each test and throughput is approximately the same using one and two threads. This is certainly not the case using two CPUs. Two copies of BusSpd2K were also run simultaneously on a Core 2 Duo CPU and showed no performance degradation.
Example L1 cache MB/second results below are for address increments of 64 bytes, reading all data (32 or 64 bit integers) and 128 bit SSE2. Note that running the 32 bit version, with two CPUs via 32 bit Windows, produces the same variations. Except for Athlon SSE2, the single thread tests execute instructions at around 1 per clock cycle (e.g. 2400 MIPS at 2400 MHz) for 32, 64 and 128 bit data (divide MB/sec by 4, 8 and 16) and are unlikely to run faster, using a single register (see in WhatCPU Results.htm).
The speed of the two thread tests can vary quite a bit and, throughput via two CPUs can be even less than from one processor when data streaming 64 bit integers. Part of the difference is that the two threads do not finish at the same time. Other measurements (program debug option) show that this does not usually seem to affect the tests with large address increments but, adjusting for this on the Core 2 Duo Read All results, could increase the performance gain from 117% to 140%. It does seem that Windows is interfering with the data flow and timing is more critical using 64 bit integers.
Core 2 Duo 2400 MHz, 64-Bit Vista Athlon 64 X2 2210 MHz, XP x64
32 Bit 64 Bit 32 Bit 64 Bit
Inc64B RdAll SSE2 Inc64B RdAll SSE2 Inc64B RdAll SSE2 Inc64B RdAll SSE2
1 8999 9263 37310 16736 17449 37143 8462 9958 17391 16078 17443 17358
2 11341 17116 66635 15512 20354 71247 10367 18515 34466 13348 21414 34617
% 126 185 179 93 117 192 123 186 198 83 123 199
RandMP32 and RandMP64 Two Thread Tests - The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM, firstly as a single thread and secondly using two threads. The 64 bit compilation uses the same 32 bit integer arrays as the 32 bit version and resultant speeds are generally the same. For this benchmark, the two threads share the same data array but one starts half way through. Examples of MB/second results from Randmem Results.htm are below.
The main observations are the reduction in throughput when writing and reading cache based data in two CPUs, particularly for random access. Here, Windows will be updating data in RAM to maintain integrity (a cache killer). The impact of the Core 2 Duo shared L2 cache is demonstrated, with a lower percentage increase on reading but a higher one with writing and reading, plus some gains on RAM based data.
Core 2 Duo 2400 MHz, 800 MHz RAM Athlon 64 X2 2210 MHz, 400 MHz RAM
64-Bit Vista Windows XP x64
L1 Cache L2 Cache RAM L1 Cache L2 Cache RAM
CPUs RD RW RD RW RD RW RD RW RD RW RD RW
1 8742 8428 7498 7665 4417 2442 8552 4346 5115 2702 2344 1354
2 16895 4017 14517 13720 7986 3211 16906 2173 10240 2199 4070 1719
% 193 48 194 179 181 131 198 50 200 81 174 127
1 8918 8014 4244 3390 638 418 8176 4384 3733 2865 255 161
2 17174 1463 7008 2807 905 559 16301 989 7384 897 251 173
% 193 18 165 83 142 134 199 23 198 31 98 107
The disk on the Core 2 Duo/Vista PC is partitioned at 200 GB + 200 GB. On running the DiskGraf and CDDVDSpd benchmarks, it was anticipated that the C: drive would be faster than D:. This was shown to be true for large files with large block sizes but small files and those with small block sizes were slower. CDDVDSpd writes and reads one large file of selected size and 520 small files that occupy the same amount of space. Below are some results for small files, including average milliseconds per file. For more details see 64 Bit Disk Tests.htm, DiskGraf Results.htm and CDDVDSpd Results.htm.
Tests were run using real data via copying and pasting a folder containing downloaded HTML documents with lots of tiny GIF files - 18.1 MB, 28.4 MB on disk, 3636 files in 277 folders. Example copying times were 41 seconds on C: and 15 seconds on D:.
After the above, the same tests were run on a Pentium 4 PC using Windows XP and the same effects were observed but performance differences were not as great.
C: Partition D: Partition
Per File Per File
KB Write Read Write Read Write Read Write Read
MB/s MB/s msecs msecs MB/s MB/s msecs msecs
2 0.55 0.50 3.6 4.0 2.26 8.30 0.9 0.2
4 1.14 1.12 3.5 3.6 7.00 14.71 0.6 0.3
8 1.86 1.69 4.3 4.7 14.46 26.89 0.6 0.3
My old Windows, DirectDraw, and OpenGL benchmarks were converted to run at 64 bits and to compile with a more modern 32 bit compiler. A new DirectX 9 was also produced as the Direct3D functions used were no longer supported. The benchmarks were run via 64-Bit Vista on a Core 2 Duo CPU, following earlier tests using Windows XP Pro x64 and a dual core Athlon 64 CPU. The Vista tests included running them on Aero and Classic Desktops
The only real problem was that the programmed WaitForVerticalBlank (VSYNC) failed to synchronise to the monitor refresh Hz on the old DirectDraw and D3D benchmarks, program refresh speed being 1.5 to 3 times faster than it should be. Performance via 64 bit and new 32 bit versions was mainly the same. Some different speeds were obtained with Aero and Classic Desktops in the background. Occasionally, Aero was faster but, the DirectX 9 tests showed that Classic Desktop results produced an average speed gain of 12%. This might be influenced by the benchmark running in windowed mode, and not as a dedicated full screen application.
Below are results for the 64 bit DirectX 9 benchmark. For further details see - Direct3D Results.htm, DirectDraw Results.htm, OpenGL Results.htm and VideoWin Results.htm.
DirectX 9 ..................... Frames Per Second ......................
Shaded WireEgg 500 Texture Colour Texture Pixel Vertex
Resolution Egg Vsync Cubes Tunnel Objects Objects Shader2 Shader2
640 480 32 3244.0 60.0 95.7 953.3 1336.8 877.8 868.7 894.8
800 600 32 2479.9 60.0 74.1 673.5 1108.4 642.1 635.6 834.7
1024 768 32 1514.2 60.0 60.9 501.2 845.5 475.5 472.4 673.6
1280 1024 32 848.7 60.0 42.6 323.2 545.2 291.2 290.9 431.3
640 480 32 3596.4 60.0 111.2 1086.0 1532.4 996.2 963.7 958.5
800 600 32 2760.9 60.0 89.2 780.7 1209.4 749.0 737.4 931.2
1024 768 32 1832.7 60.0 67.6 529.7 836.5 495.2 503.0 690.4
1280 1024 32 1131.0 60.0 48.9 356.8 571.0 327.1 326.1 461.8
My original benchmarks, that measure CPU floating point performance, were converted and compiled for 64 bit working. These have to use SSE and SSE2 floating point instead of the old x87 instructions. Other versions were produced for 32 bit working, with options set to use SSE and SSE2. They were first produced to run on an Athlon 64 x2 CPU via Windows XP Pro x64 and all ran successfully on a Core 2 Duo processor using 64-Bit Vista.
When using assembly code, processing speed of SSE and SSE2 floating point instructions are shown to be much faster than the old x87 variety. Unfortunately, the compiler used did not implement Single Instruction Multiple Data (SIMD) instructions properly, only using one variable in registers - Single Instruction Single Data (SISD). The result is that maximum speed could be expected to be reduced by two times on 64 bit double precision SSE2 operation and by four times using 32 bit single precision SSE instructions. In some cases, this leads to programs using the old x87 floating point instructions being faster than SSE/SSE2 varieties.
The major surprise was that Core 2 Duo demonstrated particularly slow performance on some 64 bit compilations that produce SSE2 instructions, where the Athlon 64 could be up to twice as fast.
These slow results were from the original 2006 versions but these were corrected using a later compiler in 2009.
Below are results for the Linpack and Livermore Loops benchmarks - see Linpack Results.htm and Livermore Loops Results.htm.
Other benchmarks using floating point generally show superior Core 2 Duo performance - see WhatCPU Results.htm, FFTGraf Results.htm, Whetstone Results.htm, SSE3DNow Results.htm and MemSpd2K Results.htm.
Linpack Benchmark - Results in MFLOPS
64 Bit 32 Bit Original
Core 2 Duo 2400 MHz, Vista 823 1480 1315
Core 2 Duo 2009 compilation 1602
Athlon 64 2210 MHz, XP x64 1044 1014 838
Athlon 64 2009 compilation 1091
Livermore Loops Benchmark - Results in MFLOPS
64 Bit 32 Bit Original
Max Mean Min Max Mean Min Max Mean Min
Core 2 Duo 1175 537 227 2526 804 195 2236 539 52
2009 comp 2799 893 261
Athlon 64 2284 661 166 1908 641 162 2566 461 48
2009 comp 2068 679 171
SSE3DNow and MemSpeed (latest MemSpd2K) carry out the same floating point data streaming instructions to measure performance via caches and RAM. The former uses SIMD assembly code and the latter was compiled using the original x87 instructions. This was recompiled at 64 bits and 32 bits, using SSE/SSE2 options. Results are below, showing slow performance on Core 2 Duo at 64 bits.
The disassembled code was examined and the main difference was that the movsd load instruction was used with the 32 bit compilation and movlpd at 64 bits (as with Linpack and Livermore Loops benchmarks). Assembly code using SSE2 instructions was constructed for the first double precision test, to prove that movlpd produced slower results - see below. A later search found Intel Documentation, confirming that there are complications concerning use of the same register following movlpd, which can cause pipeline stalls.
This was also corrected in the 2009 recompilation using the movsdx instruction.
Maximum Speeds in MFLOPS
Core 2 Duo 2400 MHz, Vista Athlon 64 2210 MHz, XP x64
s=s+x[m]*y[m] x[m]=x[m]+y[m] s=s+x[m]*y[m] x[m]=x[m]+y[m]
Dble Sngl Dble Sngl Dble Sngl Dble Sngl
Assembled SIMD 3166 6347 2340 4692 1999 3998 1011 2140
32 bit SISD 1053 1059 1173 1147 673 727 970 878
64 bit SISD 761 1275 398 943 891 918 808 735
2009 compile 1260 1270 1172 1188 976 978 976 979
x87 FPU 1591 1593 1180 1177 1047 1100 853 1085
Assembly Code Experiments
Code used by 32 bit compiler Code used by 64 bit compiler
Slower due to unnecessary save Slow due to movlpd
movsd xmm0, QWORD PTR [rax+rdi] movlpd xmm1, QWORD PTR [rdi+rax*8]
mulsd xmm0, QWORD PTR [rcx+rdi] movlpd xmm0, QWORD PTR [r8+rax*8+8]
addsd xmm0, QWORD PTR [rbx] add rax, 4
movsd xmm1, QWORD PTR [rax+rdi+8] cmp rax, rcx
mulsd xmm1, QWORD PTR [rcx+rdi+8] mulsd xmm0, QWORD PTR [rdi+rax*8-24]
addsd xmm0, xmm1 mulsd xmm1, QWORD PTR [r8+rax*8-32]
movsd xmm1, QWORD PTR [rax+rdi+16] addsd xmm1, xmm2
mulsd xmm1, QWORD PTR [rcx+rdi+16] movsd xmm2, xmm1
addsd xmm0, xmm1 addsd xmm2, xmm0
movsd xmm1, QWORD PTR [rax+rdi+24] movlpd xmm0, QWORD PTR [r8+rax*8-16]
mulsd xmm1, QWORD PTR [rcx+rdi+24] mulsd xmm0, QWORD PTR [rdi+rax*8-16]
add rdi, 32 addsd xmm2, xmm0
sub rdx, 4 movlpd xmm0, QWORD PTR [r8+rax*8-8]
addsd xmm0, xmm1 mulsd xmm0, QWORD PTR [rdi+rax*8-8]
movsd QWORD PTR [rbx], xmm0 addsd xmm2, xmm0
jg lp jl lp
2400 MHz Core 2 Duo 1065 MFLOPS 2400 MHz Core 2 Duo 767 MFLOPS
2110 MHz Athlon 64 696 MFLOPS 2110 MHz Athlon 64 958 MFLOPS
Modified 64 bit code using movsd Modified 64 bit code using movlpd
Faster using movsd Fastest using movlpd to 4 registers
movsd xmm1, QWORD PTR [rdi+rax*8] movlpd xmm0, QWORD PTR [rax+rdi]
movsd xmm0, QWORD PTR [r8+rax*8+8] movlpd xmm1, QWORD PTR [rax+rdi+8]
add rax, 4 movlpd xmm2, QWORD PTR [rax+rdi+16]
cmp rax, rcx movlpd xmm3, QWORD PTR [rax+rdi+24]
mulsd xmm0, QWORD PTR [rdi+rax*8-24] mulsd xmm0, QWORD PTR [rcx+rdi]
mulsd xmm1, QWORD PTR [r8+rax*8-32] mulsd xmm1, QWORD PTR [rcx+rdi+8]
addsd xmm1, xmm2 mulsd xmm2, QWORD PTR [rcx+rdi+16]
movsd xmm2, xmm1 mulsd xmm3, QWORD PTR [rcx+rdi+24]
addsd xmm2, xmm0 addsd xmm4, xmm0
movsd xmm0, QWORD PTR [r8+rax*8-16] addsd xmm4, xmm1
mulsd xmm0, QWORD PTR [rdi+rax*8-16] add rdi, 32
addsd xmm2, xmm0 sub rdx, 4
movsd xmm0, QWORD PTR [r8+rax*8-8] addsd xmm4, xmm2
mulsd xmm0, QWORD PTR [rdi+rax*8-8] addsd xmm4, xmm3
addsd xmm2, xmm0 jg lp
2400 MHz Core 2 Duo 1267 MFLOPS 2400 MHz Core 2 Duo 1448 MFLOPS
2110 MHz Athlon 64 964 MFLOPS 2110 MHz Athlon 64 1085 MFLOPS