OpenMP Parallel Computing Benchmarks
Contents
Summary
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers.
The benchmark executes the same functions, using the same data sizes, as the
CUDA Graphics GPU Parallel Computing Benchmark,
with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code.
A run time Affinity option is available to execute the benchmark on a selected single processor.
The benchmarks demonstrate a near doubling of performance, using dual core processors, when not limited by memory speed and when the source code is compatible with Single Instruction Multiple Data (SIMD) operation. All that is needed for the speed increase is an extra directive in the source code (implying parallelise this) and a compilation parameter.
Later tests show up to four times faster speeds using a quad core processor.
Potential performance gains due to hardware SIMD with SSE instructions are not realised due to compiler limitations and this enhances the comparative benefit of CUDA GPU parallel processing. On the other hand, the benchmark, compiled for 64 bit working, demonstrates significant speed improvement using the eight additional SSE registers that are available.
Then, it also appears that certain compiler optimisation options (like loop unrolling) cannot be implemented on using OpenMP.
The benchmarks identify three slightly different numeric results on tests using SSE, old i387 and CUDA floating point instructions. Results output has been revised to provide more detail.
Other benchmarks have been converted to run using OpenMP and are described in
OpenMP Speeds.htm.
Observations are that performance with smaller data arrays can be extremely slow, due to high startup overheads, and wrong numeric results can be produced with careless use of OpenMP directives.
The benchmarks can be downloaded via
OpenMPMflops.zip.
No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.
The OpenMP tests have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see
linux benchmarks.htm,
linux openmp benchmarks.htm
and download benchmark execution files, source code, compile and run instructions in
linux_openmp.tar.gz.
Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.
See
GigaFLOPS Benchmarks.htm
for further details and results, including comparisons with MP MFLOPS, a threaded C version, CUDA MFLOPS, for GeForce graphics processors, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via
gigaflops-benchmarks.zip.
To Start
General
OpenMP is a system independent set of procedures and software that arranges parallel processing of shared memory data. This option is available in the latest C++ Microsoft compilers.
In this case, 32 bit and 64 bit versions used were from the free
Windows Driver Kit Version 7.0.0.
For OpenMP, Microsoft Visual C++ 2008
Redistributable Packages for x86 and x64
were also downloaded.
For comparison purposes, the OpenMP benchmarks execute the same functions as the CUDA tests - see
Benchmark Details.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words.
All that is required to arrange for the code to be run on more than one CPU is a simple directive:
#pragma omp parallel for
for(i=0; i < n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
There are some issues with the Microsoft compilers that limit performance. Using SSE instructions, the hardware registers can contain four data words to permit such as four simultaneous adds - Single Instruction Multiple Data (SIMD) operation - as OpenMP directives. The compilers appear to only generate the single data type instructions (SISD), operating on 32 bits out of the 128 bits provided.
Using processors at 64 bits, the old i387 instructions are not available and SSE types have to be used but more registers are available for optimisation.
The 64 bit version of the benchmark at least demonstrates more than one floating point result per CPU clock cycle (linked add and multiply?).
Results are provided for an Athlon 64 X2 with XP x64, Core 2 Duo processors using 32-Bit and 64-Bit Vista, a Phenom II X4 via 64-Bit Windows 7 and a Core i7 again using Windows 7. On one CPU of a 2.4 GHz Core 2 Duo, up to 3.5 GFLOPS is produced or 6.8 GFLOPS using both processors.
Corresponding results for a four processor 3 GHz Phenom II are 3.7 and 14.5 GFLOPS.
The quad CPU Core i7 results are difficult to interpret, the first issue being that Hyperthreading is available where 8 threads can be run at the same time and this could have some impact even with purely floating point calculations. The main problem is Turbo Boost where, using a single CPU, it can run much faster than its rated MHz. Even four processors can run faster than the rating if not too hot. Results provided are for two 2.8 GHz i7 processors with different Turbo Boost speeds of up to 3.066 GHz and 3.466 GHz.
At 32 bits, the latest compilers refuse to obey the /arch:SSE parameter and produce only i387 floating point instructions. The ZIP file contains SSE32MFLOPS.exe, a single processor version, produced for SSE operation via an earlier compiler. Some results are given below.
The benchmark can be downloaded via
OpenMPMflops.zip.
No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.
The benchmarks have run time parameters to change the number of words used and repeat passes that might need adjusting for timing purposes. There is also an option to select a single processor via an Affinity setting. BAT files containing examples of run time parameters are in the ZIP file.
To Start
Example Log Files
The CUDA graphics parallel computing benchmark
has three lots of tests where two do not involve transferring data to and/or from the host cpu’s memory. The tests here can be compared with the CUDA "Data in & out" test. Below is a sample log file for the 64 Bit version on a 2.4 GHz Core 2 Duo via Vista. The second results are for a single selected CPU.
64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:19 2009
Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.194304 2573 0.929538 Yes
Data in & out 1000000 2 250 0.193139 2589 0.992550 Yes
Data in & out 10000000 2 25 0.415691 1203 0.999250 Yes
Data in & out 100000 8 2500 0.312285 6404 0.957117 Yes
Data in & out 1000000 8 250 0.335818 5956 0.995517 Yes
Data in & out 10000000 8 25 0.473814 4221 0.999549 Yes
Data in & out 100000 32 2500 1.488048 5376 0.890211 Yes
Data in & out 1000000 32 250 1.891056 4230 0.988082 Yes
Data in & out 10000000 32 25 1.185456 6748 0.998796 Yes
64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:31 2009
Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
Single CPU Affinity 1
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.313641 1594 0.929538 Yes
Data in & out 1000000 2 250 0.317088 1577 0.992550 Yes
Data in & out 10000000 2 25 0.431107 1160 0.999250 Yes
Data in & out 100000 8 2500 0.584243 3423 0.957117 Yes
Data in & out 1000000 8 250 0.594728 3363 0.995517 Yes
Data in & out 10000000 8 25 0.605958 3301 0.999549 Yes
Data in & out 100000 32 2500 2.268676 3526 0.890211 Yes
Data in & out 1000000 32 250 2.261049 3538 0.988082 Yes
Data in & out 10000000 32 25 2.270906 3523 0.998796 Yes
Hardware Information
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows Information
AMD64 processor architecture, 2 CPUs
Windows NT Version 6.0, build 6002, Service Pack 2
Memory 4095 MB, Free 2854 MB
User Virtual Space 8388608 MB, Free 8388560 MB
|
To Start
Results From Different Versions
Following are results of the single processor SSE test, the 32 bit i387 OpenMP benchmark, the 64 bit SSE OpenMP version and MFLOPS obtained using CUDA.
The latter are given for tests with copying data from/to host RAM or to/from graphics RAM and those for graphics processor calculations without external data transfers.
Systems used are a Core 2 Duo with 64-Bit Vista and an AMD Athlon 64 X2 using XP x64, followed by a Pentium 4 with 32-Bit XP and a Core 2 Duo laptop with 32-Bit Vista.
Later results are for a quad core Phenom II CPU using 64-Bit Windows 7 and a much faster graphics card.
Even later are for a quad core Intel i7 processor, with a top end graphics card and, again, using 64-Bit Windows 7. This processor can use Hyperthreading and appears to Windows as having eight CPUs.
Latest results are for a Core i5 dual that also has Hyperthreading.
Single, Dual and Quad CPUs - Appropriate performance gains are obvious on increasing the number of calculations per memory access. With two calculations per word there can be little gain using more than one CPU, as performance is limited by main memory speed.
Some results on the AMD Athlon CPU reflect the smaller, slower L2 cache.
CPU and GPU - Particularly as the compiler used does not fully implement SSE SIMD instructions, the GPU CUDA operations can be attractively fast, the latest results showing up to 24 times faster using a GTX 480.
SSE and i387 - Again because of compiler limitations, the old i387 floating point instructions can produce comparable performance, in some cases.
32 and 64 Bit SSE - Faster performance using a 64-Bit compilation could be expected, due to the availability of more registers for optimisation, but this is not always the case. Examination of actual intermediate machine code instructions can provide an explanation (see below).
Hyperthreading - This does not appear to improve maximum throughput of the four core i7 by much more than four times. Using one or two threads, the processors are likely to be running at the Turbo Boost speed of 3066 MHz but falling back to 2933 MHz with four threads (or 2800 MHz if hot), reducing relative performance.
For more details and Hyperthreading results with other benchmarks see
Quad Core 8 Thread.htm.
Core i7 930 2.8 GHz increased by Turbo Boost up to 3.066 GHz using
1 CPU and up to 2.933 GHz using 4 CPUs - Windows 7 64 - MFLOPS
CUDA CUDA
Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
Words Word 1 CPU 1 CPU 4/8 CPU 1 CPU 4/8 CPU GTX480 GTX480
100000 2 3567 1248 4455 1574 4001 521 5554
1000000 2 3529 1420 5433 1861 4919 819 21493
10000000 2 2388 1364 3038 1735 3076 1014 31991
100000 8 4655 2337 8798 3794 14581 2058 20129
1000000 8 4642 2413 9813 4149 17080 3306 82132
10000000 8 4453 2436 9581 4011 12457 4057 125413
100000 32 3328 2957 12020 4324 16786 7768 52230
1000000 32 3329 3011 12339 4436 17599 13190 254306
10000000 32 3307 3003 12432 4418 17576 16077 425237
Phenom II X4 3.0 GHz, Windows 7 64 - MFLOPS
CUDA CUDA
Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
Words Word 1 CPU 1 CPU 4 CPU 1 CPU 4 CPU GTS250 GTS250
100000 2 3552 1920 5587 1822 5613 328 3054
1000000 2 3268 1919 5585 1870 7056 625 9672
10000000 2 1861 1625 2993 1563 2972 714 13038
100000 8 4535 2115 7763 3637 12653 1336 12233
1000000 8 4341 2108 7975 3709 14518 2382 39481
10000000 8 4141 2100 8062 3543 11273 2949 51199
100000 32 4012 2566 9675 3652 14092 5142 36080
1000000 32 3981 2552 10091 3663 14510 9427 108170
10000000 32 3941 2510 9902 3633 14034 11182 135041
Core 2 Duo 2.4 GHz, Vista 64 - MFLOPS
CUDA CUDA
Data Ops/ SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
Words Word 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 8600GT 8600GT
100000 2 2524 1599 2660 1594 2573 215 1770
1000000 2 2353 1617 2957 1577 2589 342 3479
10000000 2 1158 1180 1136 1160 1203 417 3874
100000 8 3647 2063 3948 3423 6404 886 6931
1000000 8 3445 2070 3624 3363 5956 1371 13250
10000000 8 3231 2058 3962 3301 4221 1661 14281
100000 32 2590 2653 4909 3526 5376 3329 16583
1000000 32 2659 2658 4580 3538 4230 5019 27027
10000000 32 2663 2649 5183 3523 6748 5975 28923
Core i5-2467M 1.6 GHz to 2.3 GHz Turbo Boost
Dual Core + Hyperthreading, Windows 7 - MFLOPS
Data Ops/ SSE i387 SSE 64b
Words Word 1 CPU 2 CPU 2 CPU
100000 2 1611 975 1613
1000000 2 2247 2100 1917
10000000 2 1625 1603 1681
100000 8 2829 2621 3524
1000000 8 3248 2756 3604
10000000 8 3458 2844 5377
100000 32 3308 3691 4032
1000000 32 3330 3994 4178
10000000 32 3322 4898 5041
AMD Athlon 64 X2 2.2 GHz, XP x64 - MFLOPS
A64 A64 A64 A64 A64
Data Ops/ SSE i387 i387 SSE 64b SSE 64b
Words Word 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU
100000 2 1304 1060 1961 1114 2015
1000000 2 659 639 812 638 817
10000000 2 665 640 837 636 831
100000 8 2084 1495 2922 1942 3783
1000000 8 1853 1369 2629 1692 3058
10000000 8 1861 1376 2701 1706 3110
100000 32 2488 1852 3428 1731 3254
1000000 32 2439 1813 3614 1793 3369
10000000 32 2443 1818 3629 1774 3443
32 Bit Windows Pentium 4 Core 2 Duo Laptop Atom Netbook
MFLOPS MFLOPS MFLOPS
CPU P4 P4 C2D C2D C2D Atom Atom Atom
MHz 1900 1900 1829 1829 1829 1600 1600 1600
XP32 XP32 V32 V32 V32 XP32 XP32 XP32
Data Ops/ SSE i387 SSE i387 i387 SSE i387 i387
Words Word 1 CPU 1 CPU 1 CPU 1 CPU 2 CPU 1 CPU No HT HT
100000 2 221 223 1811 1201 2063 264 175 323
1000000 2 224 224 673 650 630 259 185 311
10000000 2 204 206 651 668 650 258 189 331
100000 8 835 742 2648 1558 2773 409 257 460
1000000 8 817 699 2326 1529 2568 406 263 443
10000000 8 764 771 2331 1508 2645 406 265 475
100000 32 1160 1017 1935 1978 3627 457 369 679
1000000 32 1163 1025 1970 1977 3719 456 371 679
10000000 32 1165 1029 2015 1921 3727 456 372 677
Single processor Atom i387 results Hyperthreading off and on
|
To Start
64-Bit Comparisons
Following are OpenMP benchmark result for the version compiled for 64 bit working, with performance gains shown when using multiple processors.
These gains are lowest using 10M words (40 MB) with an add and a multiply for each word read, limited by RAM speed. There is generally no such limitation with 32 operations per word at all data sizes.
These results include those for two 2.8 GHz Core i7 CPUs that have different Turbo Boost characteristics. In this case, the i7 860 had been detuned and, based on results with 32 operations per word, single CPU tests suggest that both were running at around 3 GHz, with Core i7/Core 2 measured speed ratios similar to MHz ratios (3066/2400 = 4510/3530). The i7 860 has faster RAM, affecting tests with fewer operations per word.
64 Bit OpenMP Benchmark MFLOPS
Athlon 64 x2 Core 2 Duo
Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain
Words Word 1 CPU 2 CPU 1 CPU 2 CPU
100000 2 1114 2015 1.8 1594 2573 1.6
1000000 2 638 817 1.3 1577 2589 1.6
10000000 2 636 831 1.3 1160 1203 1.0
100000 8 1942 3783 1.9 3423 6404 1.9
1000000 8 1692 3058 1.8 3363 5956 1.8
10000000 8 1706 3110 1.8 3301 4221 1.3
100000 32 1731 3254 1.9 3526 5376 1.5
1000000 32 1793 3369 1.9 3538 4230 1.2
10000000 32 1774 3443 1.9 3523 6748 1.9
Phenom II Core i7 860 Core i7 930
Data Ops/ SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain SSE 64b SSE 64b Gain
Words Word 1 CPU 4 CPU 1 CPU 4 CPU 1 CPU 4 CPU
100000 2 1822 5613 3.1 1661 4263 2.6 1574 4001 2.5
1000000 2 1870 7056 3.8 1922 5142 2.7 1861 4919 2.6
10000000 2 1563 2972 1.9 1824 3838 2.1 1735 3076 1.8
100000 8 3637 12653 3.5 3939 13804 3.5 3794 14581 3.8
1000000 8 3709 14518 3.9 4251 18082 4.3 4149 17080 4.1
10000000 8 3543 11273 3.2 4133 15079 3.6 4011 12457 3.1
100000 32 3652 14092 3.9 4438 16299 3.7 4324 16786 3.9
1000000 32 3663 14510 4.0 4512 18081 4.0 4436 17599 4.0
10000000 32 3633 14034 3.9 4493 17752 4.0 4418 17576 4.0
i7 860 2.8 GHz, Turbo Boost possible to 3.47 GHz using 1 CPU to 2.93 GHz using 4
i7 930 2.8 GHz, Turbo Boost possible to 3.07 GHz using 1 CPU to 2.93 GHz using 4
|
To Start
Assembly Code
The benchmarks were compiled using the /Fa option which produces a file with assembly code listing. These show significant differences between 64 bit and 32 bit compilations, also if the /openmp directive is or is not included.
The most obvious difference is when using two operations per word, where 32 bit compilation unrolls the loop (using x[i], x[i+1], x[i+2] and x[i+3] with four times as many calculations). This results in some much faster speeds for the 32 bit version. A further 64 bit compilation, without /openmp, included unrolling.
For the other extreme, where 64 bit compilation is much faster, memory accesses are reduced by using the additional registers. These accesses are indicated by such as addss xmm6, DWORD PTR _g$[esp] and the extra registers by xmm8 to xmm15 (really needs 24 registers - CUDA has more).
64 Bit SSE Instructions 32 Bit SSE Instructions
2 Operations Per Word
for(i=0; i< n; i++) x[i]=(x[i]+a)*b;
$LL6@triad$omp$: $L56949:
; Line 77 ; Line 77
movaps xmm0, xmm1 movss xmm2, DWORD PTR [eax-8]
add rax, 4 addss xmm2, xmm1
sub rcx, 1 mulss xmm2, xmm0
addss xmm0, DWORD PTR [rax-4] movss DWORD PTR [eax-8], xmm2
mulss xmm0, xmm2 movaps xmm2, xmm1
movss DWORD PTR [rax-4], xmm0 addss xmm2, DWORD PTR [eax-4]
jne SHORT $LL6@triad$omp$ mulss xmm2, xmm0
movss DWORD PTR [eax-4], xmm2
movss xmm2, DWORD PTR [eax]
addss xmm2, xmm1
mulss xmm2, xmm0
movss DWORD PTR [eax], xmm2
movss xmm2, DWORD PTR [eax+4]
addss xmm2, xmm1
mulss xmm2, xmm0
movss DWORD PTR [eax+4], xmm2
add eax, 16
dec edx
jne SHORT $L56949
8 Operations Per Word
for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
$LL6@triadplus$: $L56942:
; Line 69 ; Line 69
movss xmm1, DWORD PTR [rcx] movss xmm6, DWORD PTR [eax-8]
add rcx, 4 movss xmm7, DWORD PTR [eax-8]
sub rax, 1 addss xmm7, xmm3
movaps xmm2, xmm1 mulss xmm7, xmm2
movaps xmm0, xmm1 addss xmm6, xmm5
addss xmm1, xmm7 mulss xmm6, xmm4
addss xmm2, xmm3 subss xmm6, xmm7
addss xmm0, xmm5 movss xmm7, DWORD PTR [eax-8]
mulss xmm1, xmm8 addss xmm7, xmm1
mulss xmm2, xmm4 mulss xmm7, xmm0
mulss xmm0, xmm6 addss xmm6, xmm7
subss xmm2, xmm0 movss DWORD PTR [eax-8], xmm6
addss xmm2, xmm1 movaps xmm6, xmm5
movss DWORD PTR [rcx-4], xmm2 addss xmm6, DWORD PTR [eax-4]
jne SHORT $LL6@triadplus$ mulss xmm6, xmm4
movaps xmm7, xmm3
addss xmm7, DWORD PTR [eax-4]
mulss xmm7, xmm2
subss xmm6, xmm7
movaps xmm7, xmm1
addss xmm7, DWORD PTR [eax-4]
mulss xmm7, xmm0
addss xmm6, xmm7
movss xmm7, DWORD PTR [eax]
movss DWORD PTR [eax-4], xmm6
movss xmm6, DWORD PTR [eax]
addss xmm7, xmm3
mulss xmm7, xmm2
addss xmm6, xmm5
mulss xmm6, xmm4
subss xmm6, xmm7
movss xmm7, DWORD PTR [eax]
addss xmm7, xmm1
mulss xmm7, xmm0
addss xmm6, xmm7
movss xmm7, DWORD PTR [eax+4]
movss DWORD PTR [eax], xmm6
movss xmm6, DWORD PTR [eax+4]
addss xmm7, xmm3
addss xmm6, xmm5
mulss xmm7, xmm2
mulss xmm6, xmm4
subss xmm6, xmm7
movss xmm7, DWORD PTR [eax+4]
addss xmm7, xmm1
mulss xmm7, xmm0
addss xmm6, xmm7
movss DWORD PTR [eax+4], xmm6
add eax, 16
dec edx
jne $L56942
32 Operations Per Word
for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k
-(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y;
$LL6@triadplus2: $L56934:
; Line 61 ; Line 61
movss xmm2, DWORD PTR [rbp] movss xmm5, DWORD PTR [edx+ecx*4]
add rbp, 4 addss xmm5, DWORD PTR _a$[esp]
sub r12, 1 mulss xmm5, DWORD PTR _b$[esp]
movaps xmm0, xmm2 movss xmm6, DWORD PTR [edx+ecx*4]
movaps xmm1, xmm2 addss xmm6, DWORD PTR _c$[esp]
movaps xmm3, xmm2 mulss xmm6, DWORD PTR _d$[esp]
addss xmm0, xmm6 subss xmm5, xmm6
addss xmm3, xmm4 movss xmm6, DWORD PTR [edx+ecx*4]
addss xmm1, xmm8 addss xmm6, DWORD PTR _e$[esp]
mulss xmm0, xmm7 mulss xmm6, DWORD PTR _f$[esp]
mulss xmm3, xmm5 addss xmm5, xmm6
mulss xmm1, xmm9 movss xmm6, DWORD PTR [edx+ecx*4]
subss xmm3, xmm0 addss xmm6, DWORD PTR _g$[esp]
movaps xmm0, xmm2 mulss xmm6, DWORD PTR _h$[esp]
addss xmm3, xmm1 subss xmm5, xmm6
addss xmm0, xmm10 movss xmm6, DWORD PTR [edx+ecx*4]
movaps xmm1, xmm2 addss xmm6, DWORD PTR _j$[esp]
mulss xmm0, xmm11 mulss xmm6, DWORD PTR _k$[esp]
subss xmm3, xmm0 addss xmm5, xmm6
addss xmm1, xmm12 movss xmm6, DWORD PTR [edx+ecx*4]
movaps xmm0, xmm2 addss xmm6, DWORD PTR _l$[esp]
mulss xmm1, xmm13 mulss xmm6, DWORD PTR _m$[esp]
addss xmm3, xmm1 subss xmm5, xmm6
addss xmm0, xmm14 movss xmm6, DWORD PTR [edx+ecx*4]
movaps xmm1, xmm2 addss xmm6, DWORD PTR _o$[esp]
mulss xmm0, xmm15 mulss xmm6, DWORD PTR _p$[esp]
addss xmm1, DWORD PTR [rax] addss xmm5, xmm6
subss xmm3, xmm0 movss xmm6, DWORD PTR [edx+ecx*4]
movaps xmm0, xmm2 addss xmm6, DWORD PTR _q$[esp]
mulss xmm1, DWORD PTR [rcx] mulss xmm6, DWORD PTR _r$[esp]
addss xmm0, DWORD PTR [rdx] subss xmm5, xmm6
addss xmm3, xmm1 movss xmm6, DWORD PTR [edx+ecx*4]
mulss xmm0, DWORD PTR [r8] addss xmm6, DWORD PTR _s$[esp]
movaps xmm1, xmm2 mulss xmm6, xmm4
addss xmm1, DWORD PTR [r9] addss xmm5, xmm6
subss xmm3, xmm0 movss xmm6, DWORD PTR [edx+ecx*4]
mulss xmm1, DWORD PTR [r10] addss xmm6, xmm3
movaps xmm0, xmm2 mulss xmm6, xmm2
addss xmm0, DWORD PTR [r11] subss xmm5, xmm6
addss xmm2, DWORD PTR [rdi] movss xmm6, DWORD PTR [edx+ecx*4]
mulss xmm0, DWORD PTR [rbx] addss xmm6, xmm1
mulss xmm2, DWORD PTR [rsi] mulss xmm6, xmm0
addss xmm3, xmm1 addss xmm5, xmm6
subss xmm3, xmm0 movss DWORD PTR [edx+ecx*4], xmm5
addss xmm3, xmm2 inc ecx
movss DWORD PTR [rbp-4], xmm3 cmp ecx, edi
jne $LL6@triadplus2 jl $L56934
|
To Start
Numeric Accuracy
The run time display and log files show the numeric result of calculations and values from using the same default parameters are shown below. There is some variation in rounding after calculations, different with SSE, i387 and CUDA instructions.
4 Byte Ops Repeat SSE i387 i387 SSE 64b SSE 64b SSE 64b CUDA
Words /Wd Passes 1 CPU 1 CPU 2 CPU 1 CPU 2 CPU 4 CPU 8600GT
100000 2 2500 0.929538 0.929475 0.929475 0.929538 0.929538 0.929538 0.929538
1000000 2 250 0.992550 0.992543 0.992543 0.992550 0.992550 0.992550 0.992550
10000000 2 25 0.999250 0.999249 0.999249 0.999250 0.999250 0.999250 0.999250
100000 8 2500 0.957117 0.957164 0.957164 0.957117 0.957117 0.957117 0.956980
1000000 8 250 0.995517 0.995525 0.995525 0.995517 0.995517 0.995517 0.995509
10000000 8 25 0.999549 0.999550 0.999550 0.999549 0.999549 0.999549 0.999549
100000 32 2500 0.890211 0.890377 0.890377 0.890211 0.890211 0.890211 0.890079
1000000 32 250 0.988082 0.988102 0.988102 0.988082 0.988082 0.988082 0.988073
10000000 32 25 0.998796 0.998799 0.998799 0.998796 0.998796 0.998796 0.998799
|
To Start
Roy Longbottom July 2014
The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|