Title

Roy Longbottom at Linkedin   OpenMP Parallel Computing Benchmarks

Contents

General Example Log Files Different Version Results
64 Bit OpenMP Comparisons Assembly Code Numeric Accuracy

Summary

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The benchmark executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code. A run time Affinity option is available to execute the benchmark on a selected single processor.

The benchmarks demonstrate a near doubling of performance, using dual core processors, when not limited by memory speed and when the source code is compatible with Single Instruction Multiple Data (SIMD) operation. All that is needed for the speed increase is an extra directive in the source code (implying parallelise this) and a compilation parameter. Later tests show up to four times faster speeds using a quad core processor.

Potential performance gains due to hardware SIMD with SSE instructions are not realised due to compiler limitations and this enhances the comparative benefit of CUDA GPU parallel processing. On the other hand, the benchmark, compiled for 64 bit working, demonstrates significant speed improvement using the eight additional SSE registers that are available. Then, it also appears that certain compiler optimisation options (like loop unrolling) cannot be implemented on using OpenMP.

The benchmarks identify three slightly different numeric results on tests using SSE, old i387 and CUDA floating point instructions. Results output has been revised to provide more detail.

Other benchmarks have been converted to run using OpenMP and are described in OpenMP Speeds.htm. Observations are that performance with smaller data arrays can be extremely slow, due to high startup overheads, and wrong numeric results can be produced with careless use of OpenMP directives.

The benchmarks can be downloaded via OpenMPMflops.zip. No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.

The OpenMP tests have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see linux benchmarks.htm, linux openmp benchmarks.htm and download benchmark execution files, source code, compile and run instructions in linux_openmp.tar.gz. Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.

See GigaFLOPS Benchmarks.htm for further details and results, including comparisons with MP MFLOPS, a threaded C version, CUDA MFLOPS, for GeForce graphics processors, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via gigaflops-benchmarks.zip.

To Start


General

OpenMP is a system independent set of procedures and software that arranges parallel processing of shared memory data. This option is available in the latest C++ Microsoft compilers. In this case, 32 bit and 64 bit versions used were from the free Windows Driver Kit Version 7.0.0. For OpenMP, Microsoft Visual C++ 2008 Redistributable Packages for x86 and x64 were also downloaded. For comparison purposes, the OpenMP benchmarks execute the same functions as the CUDA tests - see Benchmark Details. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. All that is required to arrange for the code to be run on more than one CPU is a simple directive:

               #pragma omp parallel for
               for(i=0; i < n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;

There are some issues with the Microsoft compilers that limit performance. Using SSE instructions, the hardware registers can contain four data words to permit such as four simultaneous adds - Single Instruction Multiple Data (SIMD) operation - as OpenMP directives. The compilers appear to only generate the single data type instructions (SISD), operating on 32 bits out of the 128 bits provided.

Using processors at 64 bits, the old i387 instructions are not available and SSE types have to be used but more registers are available for optimisation. The 64 bit version of the benchmark at least demonstrates more than one floating point result per CPU clock cycle (linked add and multiply?).

Results are provided for an Athlon 64 X2 with XP x64, Core 2 Duo processors using 32-Bit and 64-Bit Vista, a Phenom II X4 via 64-Bit Windows 7 and a Core i7 again using Windows 7. On one CPU of a 2.4 GHz Core 2 Duo, up to 3.5 GFLOPS is produced or 6.8 GFLOPS using both processors. Corresponding results for a four processor 3 GHz Phenom II are 3.7 and 14.5 GFLOPS.

The quad CPU Core i7 results are difficult to interpret, the first issue being that Hyperthreading is available where 8 threads can be run at the same time and this could have some impact even with purely floating point calculations. The main problem is Turbo Boost where, using a single CPU, it can run much faster than its rated MHz. Even four processors can run faster than the rating if not too hot. Results provided are for two 2.8 GHz i7 processors with different Turbo Boost speeds of up to 3.066 GHz and 3.466 GHz.

At 32 bits, the latest compilers refuse to obey the /arch:SSE parameter and produce only i387 floating point instructions. The ZIP file contains SSE32MFLOPS.exe, a single processor version, produced for SSE operation via an earlier compiler. Some results are given below.

The benchmark can be downloaded via OpenMPMflops.zip. No installation is necessary - Extract All and click on OpenMP32MFLOPS.exe or OpenMP64MFLOPS but see ReadMe.txt first. The ZIP file also includes the C++ source code.

The benchmarks have run time parameters to change the number of words used and repeat passes that might need adjusting for timing purposes. There is also an option to select a single processor via an Affinity setting. BAT files containing examples of run time parameters are in the ZIP file.

To Start


Example Log Files

The CUDA graphics parallel computing benchmark has three lots of tests where two do not involve transferring data to and/or from the host cpu’s memory. The tests here can be compared with the CUDA "Data in & out" test. Below is a sample log file for the 64 Bit version on a 2.4 GHz Core 2 Duo via Vista. The second results are for a single selected CPU.


  64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:19 2009

  Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.194304     2573    0.929538   Yes
 Data in & out    1000000     2      250   0.193139     2589    0.992550   Yes
 Data in & out   10000000     2       25   0.415691     1203    0.999250   Yes

 Data in & out     100000     8     2500   0.312285     6404    0.957117   Yes
 Data in & out    1000000     8      250   0.335818     5956    0.995517   Yes
 Data in & out   10000000     8       25   0.473814     4221    0.999549   Yes

 Data in & out     100000    32     2500   1.488048     5376    0.890211   Yes
 Data in & out    1000000    32      250   1.891056     4230    0.988082   Yes
 Data in & out   10000000    32       25   1.185456     6748    0.998796   Yes


  64 Bit OpenMP MFLOPS Benchmark 1 Fri Oct 02 10:21:31 2009

  Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64

  Single CPU Affinity 1

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.313641     1594    0.929538   Yes
 Data in & out    1000000     2      250   0.317088     1577    0.992550   Yes
 Data in & out   10000000     2       25   0.431107     1160    0.999250   Yes

 Data in & out     100000     8     2500   0.584243     3423    0.957117   Yes
 Data in & out    1000000     8      250   0.594728     3363    0.995517   Yes
 Data in & out   10000000     8       25   0.605958     3301    0.999549   Yes

 Data in & out     100000    32     2500   2.268676     3526    0.890211   Yes
 Data in & out    1000000    32      250   2.261049     3538    0.988082   Yes
 Data in & out   10000000    32       25   2.270906     3523    0.998796   Yes

 Hardware  Information
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz Measured 2402 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows  Information
  AMD64 processor architecture, 2 CPUs 
  Windows NT  Version 6.0, build 6002, Service Pack 2
  Memory 4095 MB, Free 2854 MB
  User Virtual Space 8388608 MB, Free 8388560 MB

To Start


Results From Different Versions

Following are results of the single processor SSE test, the 32 bit i387 OpenMP benchmark, the 64 bit SSE OpenMP version and MFLOPS obtained using CUDA. The latter are given for tests with copying data from/to host RAM or to/from graphics RAM and those for graphics processor calculations without external data transfers. Systems used are a Core 2 Duo with 64-Bit Vista and an AMD Athlon 64 X2 using XP x64, followed by a Pentium 4 with 32-Bit XP and a Core 2 Duo laptop with 32-Bit Vista. Later results are for a quad core Phenom II CPU using 64-Bit Windows 7 and a much faster graphics card. Even later are for a quad core Intel i7 processor, with a top end graphics card and, again, using 64-Bit Windows 7. This processor can use Hyperthreading and appears to Windows as having eight CPUs. Latest results are for a Core i5 dual that also has Hyperthreading.

Single, Dual and Quad CPUs - Appropriate performance gains are obvious on increasing the number of calculations per memory access. With two calculations per word there can be little gain using more than one CPU, as performance is limited by main memory speed. Some results on the AMD Athlon CPU reflect the smaller, slower L2 cache.

CPU and GPU - Particularly as the compiler used does not fully implement SSE SIMD instructions, the GPU CUDA operations can be attractively fast, the latest results showing up to 24 times faster using a GTX 480.

SSE and i387 - Again because of compiler limitations, the old i387 floating point instructions can produce comparable performance, in some cases.

32 and 64 Bit SSE - Faster performance using a 64-Bit compilation could be expected, due to the availability of more registers for optimisation, but this is not always the case. Examination of actual intermediate machine code instructions can provide an explanation (see below).

Hyperthreading - This does not appear to improve maximum throughput of the four core i7 by much more than four times. Using one or two threads, the processors are likely to be running at the Turbo Boost speed of 3066 MHz but falling back to 2933 MHz with four threads (or 2800 MHz if hot), reducing relative performance. For more details and Hyperthreading results with other benchmarks see Quad Core 8 Thread.htm.


 Core i7 930 2.8 GHz increased by Turbo Boost up to 3.066 GHz using 
 1 CPU and up to 2.933 GHz using 4 CPUs - Windows 7 64 - MFLOPS

                                                          CUDA    CUDA
   Data   Ops/     SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words  Word   1 CPU   1 CPU 4/8 CPU   1 CPU 4/8 CPU  GTX480  GTX480

   100000    2    3567    1248    4455    1574    4001     521    5554
  1000000    2    3529    1420    5433    1861    4919     819   21493
 10000000    2    2388    1364    3038    1735    3076    1014   31991

   100000    8    4655    2337    8798    3794   14581    2058   20129
  1000000    8    4642    2413    9813    4149   17080    3306   82132
 10000000    8    4453    2436    9581    4011   12457    4057  125413

   100000   32    3328    2957   12020    4324   16786    7768   52230
  1000000   32    3329    3011   12339    4436   17599   13190  254306
 10000000   32    3307    3003   12432    4418   17576   16077  425237


 Phenom II X4 3.0 GHz, Windows 7 64 - MFLOPS
                                                          CUDA    CUDA
   Data   Ops/     SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words  Word   1 CPU   1 CPU   4 CPU   1 CPU   4 CPU  GTS250  GTS250

   100000    2    3552    1920    5587    1822    5613     328    3054
  1000000    2    3268    1919    5585    1870    7056     625    9672
 10000000    2    1861    1625    2993    1563    2972     714   13038

   100000    8    4535    2115    7763    3637   12653    1336   12233
  1000000    8    4341    2108    7975    3709   14518    2382   39481
 10000000    8    4141    2100    8062    3543   11273    2949   51199

   100000   32    4012    2566    9675    3652   14092    5142   36080
  1000000   32    3981    2552   10091    3663   14510    9427  108170
 10000000   32    3941    2510    9902    3633   14034   11182  135041


 Core 2 Duo 2.4 GHz, Vista 64 - MFLOPS
                                                         CUDA    CUDA
   Data   Ops/     SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words  Word   1 CPU   1 CPU   2 CPU   1 CPU   2 CPU  8600GT  8600GT

   100000    2    2524    1599    2660    1594    2573     215    1770
  1000000    2    2353    1617    2957    1577    2589     342    3479
 10000000    2    1158    1180    1136    1160    1203     417    3874

   100000    8    3647    2063    3948    3423    6404     886    6931
  1000000    8    3445    2070    3624    3363    5956    1371   13250
 10000000    8    3231    2058    3962    3301    4221    1661   14281

   100000   32    2590    2653    4909    3526    5376    3329   16583
  1000000   32    2659    2658    4580    3538    4230    5019   27027
 10000000   32    2663    2649    5183    3523    6748    5975   28923


 Core i5-2467M 1.6 GHz to 2.3 GHz Turbo Boost
 Dual Core + Hyperthreading, Windows 7 - MFLOPS 

 Data    Ops/     SSE    i387 SSE 64b
 Words   Word   1 CPU   2 CPU   2 CPU

  100000    2    1611     975    1613
 1000000    2    2247    2100    1917
10000000    2    1625    1603    1681

  100000    8    2829    2621    3524
 1000000    8    3248    2756    3604
10000000    8    3458    2844    5377

  100000   32    3308    3691    4032
 1000000   32    3330    3994    4178
10000000   32    3322    4898    5041


 AMD Athlon 64 X2 2.2 GHz, XP x64 - MFLOPS

                   A64     A64     A64     A64     A64
   Data   Ops/     SSE    i387    i387 SSE 64b SSE 64b
   Words  Word   1 CPU   1 CPU   2 CPU   1 CPU   2 CPU

   100000    2    1304    1060    1961    1114    2015
  1000000    2     659     639     812     638     817
 10000000    2     665     640     837     636     831

   100000    8    2084    1495    2922    1942    3783
  1000000    8    1853    1369    2629    1692    3058
 10000000    8    1861    1376    2701    1706    3110

   100000   32    2488    1852    3428    1731    3254
  1000000   32    2439    1813    3614    1793    3369
 10000000   32    2443    1818    3629    1774    3443


 32 Bit Windows    Pentium 4    Core 2 Duo Laptop      Atom Netbook
                    MFLOPS          MFLOPS                MFLOPS
 
           CPU     P4     P4    C2D    C2D    C2D   Atom   Atom   Atom
           MHz   1900   1900   1829   1829   1829   1600   1600   1600
                 XP32   XP32    V32    V32    V32   XP32   XP32   XP32
   Data   Ops/    SSE   i387    SSE   i387   i387    SSE   i387   i387
   Words  Word  1 CPU  1 CPU  1 CPU  1 CPU  2 CPU  1 CPU  No HT     HT

   100000    2    221    223   1811   1201   2063    264    175    323 
  1000000    2    224    224    673    650    630    259    185    311
 10000000    2    204    206    651    668    650    258    189    331

   100000    8    835    742   2648   1558   2773    409    257    460 
  1000000    8    817    699   2326   1529   2568    406    263    443
 10000000    8    764    771   2331   1508   2645    406    265    475

   100000   32   1160   1017   1935   1978   3627    457    369    679
  1000000   32   1163   1025   1970   1977   3719    456    371    679
 10000000   32   1165   1029   2015   1921   3727    456    372    677

      Single processor Atom i387 results Hyperthreading off and on

To Start


64-Bit Comparisons

Following are OpenMP benchmark result for the version compiled for 64 bit working, with performance gains shown when using multiple processors. These gains are lowest using 10M words (40 MB) with an add and a multiply for each word read, limited by RAM speed. There is generally no such limitation with 32 operations per word at all data sizes.

These results include those for two 2.8 GHz Core i7 CPUs that have different Turbo Boost characteristics. In this case, the i7 860 had been detuned and, based on results with 32 operations per word, single CPU tests suggest that both were running at around 3 GHz, with Core i7/Core 2 measured speed ratios similar to MHz ratios (3066/2400 = 4510/3530). The i7 860 has faster RAM, affecting tests with fewer operations per word.


                 64 Bit OpenMP Benchmark MFLOPS

                 Athlon 64 x2          Core 2 Duo
            
   Data   Ops/   SSE 64b SSE 64b  Gain SSE 64b SSE 64b  Gain
   Words  Word     1 CPU   2 CPU         1 CPU   2 CPU

   100000    2     1114    2015    1.8    1594    2573   1.6
  1000000    2      638     817    1.3    1577    2589   1.6
 10000000    2      636     831    1.3    1160    1203   1.0

   100000    8     1942    3783    1.9    3423    6404   1.9
  1000000    8     1692    3058    1.8    3363    5956   1.8
 10000000    8     1706    3110    1.8    3301    4221   1.3

   100000   32     1731    3254    1.9    3526    5376   1.5
  1000000   32     1793    3369    1.9    3538    4230   1.2
 10000000   32     1774    3443    1.9    3523    6748   1.9


                 Phenom II             Core i7 860           Core i7 930
 
   Data   Ops/   SSE 64b SSE 64b  Gain SSE 64b SSE 64b  Gain SSE 64b SSE 64b  Gain
   Words  Word     1 CPU   4 CPU         1 CPU   4 CPU         1 CPU   4 CPU

   100000    2      1822    5613   3.1    1661    4263   2.6    1574    4001   2.5
  1000000    2      1870    7056   3.8    1922    5142   2.7    1861    4919   2.6
 10000000    2      1563    2972   1.9    1824    3838   2.1    1735    3076   1.8

   100000    8      3637   12653   3.5    3939   13804   3.5    3794   14581   3.8
  1000000    8      3709   14518   3.9    4251   18082   4.3    4149   17080   4.1
 10000000    8      3543   11273   3.2    4133   15079   3.6    4011   12457   3.1

   100000   32      3652   14092   3.9    4438   16299   3.7    4324   16786   3.9
  1000000   32      3663   14510   4.0    4512   18081   4.0    4436   17599   4.0
 10000000   32      3633   14034   3.9    4493   17752   4.0    4418   17576   4.0

 i7 860 2.8 GHz, Turbo Boost possible to 3.47 GHz using 1 CPU to 2.93 GHz using 4 
 i7 930 2.8 GHz, Turbo Boost possible to 3.07 GHz using 1 CPU to 2.93 GHz using 4 


To Start


Assembly Code

The benchmarks were compiled using the /Fa option which produces a file with assembly code listing. These show significant differences between 64 bit and 32 bit compilations, also if the /openmp directive is or is not included.

The most obvious difference is when using two operations per word, where 32 bit compilation unrolls the loop (using x[i], x[i+1], x[i+2] and x[i+3] with four times as many calculations). This results in some much faster speeds for the 32 bit version. A further 64 bit compilation, without /openmp, included unrolling.

For the other extreme, where 64 bit compilation is much faster, memory accesses are reduced by using the additional registers. These accesses are indicated by such as addss xmm6, DWORD PTR _g$[esp] and the extra registers by xmm8 to xmm15 (really needs 24 registers - CUDA has more).


   64 Bit SSE Instructions                   32 Bit SSE Instructions

   2 Operations Per Word
   for(i=0; i< n; i++) x[i]=(x[i]+a)*b;

   $LL6@triad$omp$:                          $L56949:
   ; Line 77                                 ; Line 77
   movaps xmm0, xmm1                         movss  xmm2, DWORD PTR [eax-8]
   add    rax, 4                             addss  xmm2, xmm1
   sub    rcx, 1                             mulss  xmm2, xmm0
   addss  xmm0, DWORD PTR [rax-4]            movss  DWORD PTR [eax-8], xmm2
   mulss  xmm0, xmm2                         movaps xmm2, xmm1
   movss  DWORD PTR [rax-4], xmm0            addss  xmm2, DWORD PTR [eax-4]
   jne    SHORT $LL6@triad$omp$              mulss  xmm2, xmm0
                                             movss  DWORD PTR [eax-4], xmm2
                                             movss  xmm2, DWORD PTR [eax]
                                             addss  xmm2, xmm1
                                             mulss  xmm2, xmm0
                                             movss  DWORD PTR [eax], xmm2
                                             movss  xmm2, DWORD PTR [eax+4]
                                             addss  xmm2, xmm1
                                             mulss  xmm2, xmm0
                                             movss  DWORD PTR [eax+4], xmm2
                                             add    eax, 16
                                             dec    edx
                                             jne    SHORT $L56949

   8 Operations Per Word
   for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;

   $LL6@triadplus$:                          $L56942:
   ; Line 69                                 ; Line 69
   movss  xmm1, DWORD PTR [rcx]              movss  xmm6, DWORD PTR [eax-8]
   add    rcx, 4                             movss  xmm7, DWORD PTR [eax-8]
   sub    rax, 1                             addss  xmm7, xmm3
   movaps xmm2, xmm1                         mulss  xmm7, xmm2
   movaps xmm0, xmm1                         addss  xmm6, xmm5
   addss  xmm1, xmm7                         mulss  xmm6, xmm4
   addss  xmm2, xmm3                         subss  xmm6, xmm7
   addss  xmm0, xmm5                         movss  xmm7, DWORD PTR [eax-8]
   mulss  xmm1, xmm8                         addss  xmm7, xmm1
   mulss  xmm2, xmm4                         mulss  xmm7, xmm0
   mulss  xmm0, xmm6                         addss  xmm6, xmm7
   subss  xmm2, xmm0                         movss  DWORD PTR [eax-8], xmm6
   addss  xmm2, xmm1                         movaps xmm6, xmm5
   movss  DWORD PTR [rcx-4], xmm2            addss  xmm6, DWORD PTR [eax-4]
   jne    SHORT $LL6@triadplus$              mulss  xmm6, xmm4
                                             movaps xmm7, xmm3
                                             addss  xmm7, DWORD PTR [eax-4]
                                             mulss  xmm7, xmm2
                                             subss  xmm6, xmm7
                                             movaps xmm7, xmm1
                                             addss  xmm7, DWORD PTR [eax-4]
                                             mulss  xmm7, xmm0
                                             addss  xmm6, xmm7
                                             movss  xmm7, DWORD PTR [eax]
                                             movss  DWORD PTR [eax-4], xmm6
                                             movss  xmm6, DWORD PTR [eax]
                                             addss  xmm7, xmm3
                                             mulss  xmm7, xmm2
                                             addss  xmm6, xmm5
                                             mulss  xmm6, xmm4
                                             subss  xmm6, xmm7
                                             movss  xmm7, DWORD PTR [eax]
                                             addss  xmm7, xmm1
                                             mulss  xmm7, xmm0
                                             addss  xmm6, xmm7
                                             movss  xmm7, DWORD PTR [eax+4]
                                             movss  DWORD PTR [eax], xmm6
                                             movss  xmm6, DWORD PTR [eax+4]
                                             addss  xmm7, xmm3
                                             addss  xmm6, xmm5
                                             mulss  xmm7, xmm2
                                             mulss  xmm6, xmm4
                                             subss  xmm6, xmm7
                                             movss  xmm7, DWORD PTR [eax+4]
                                             addss  xmm7, xmm1
                                             mulss  xmm7, xmm0
                                             addss  xmm6, xmm7
                                             movss  DWORD PTR [eax+4], xmm6
                                             add    eax, 16
                                             dec    edx
                                             jne    $L56942


   32 Operations Per Word
   for(i=0; i< n; i++) x[i]=(x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k
                -(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y;

   $LL6@triadplus2:                          $L56934:
   ; Line 61                                 ; Line 61
   movss  xmm2, DWORD PTR [rbp]              movss  xmm5, DWORD PTR [edx+ecx*4]
   add    rbp, 4                             addss  xmm5, DWORD PTR _a$[esp]
   sub    r12, 1                             mulss  xmm5, DWORD PTR _b$[esp]
   movaps xmm0, xmm2                         movss  xmm6, DWORD PTR [edx+ecx*4]
   movaps xmm1, xmm2                         addss  xmm6, DWORD PTR _c$[esp]
   movaps xmm3, xmm2                         mulss  xmm6, DWORD PTR _d$[esp]
   addss  xmm0, xmm6                         subss  xmm5, xmm6
   addss  xmm3, xmm4                         movss  xmm6, DWORD PTR [edx+ecx*4]
   addss  xmm1, xmm8                         addss  xmm6, DWORD PTR _e$[esp]
   mulss  xmm0, xmm7                         mulss  xmm6, DWORD PTR _f$[esp]
   mulss  xmm3, xmm5                         addss  xmm5, xmm6
   mulss  xmm1, xmm9                         movss  xmm6, DWORD PTR [edx+ecx*4]
   subss  xmm3, xmm0                         addss  xmm6, DWORD PTR _g$[esp]
   movaps xmm0, xmm2                         mulss  xmm6, DWORD PTR _h$[esp]
   addss  xmm3, xmm1                         subss  xmm5, xmm6
   addss  xmm0, xmm10                        movss  xmm6, DWORD PTR [edx+ecx*4]
   movaps xmm1, xmm2                         addss  xmm6, DWORD PTR _j$[esp]
   mulss  xmm0, xmm11                        mulss  xmm6, DWORD PTR _k$[esp]
   subss  xmm3, xmm0                         addss  xmm5, xmm6
   addss  xmm1, xmm12                        movss  xmm6, DWORD PTR [edx+ecx*4]
   movaps xmm0, xmm2                         addss  xmm6, DWORD PTR _l$[esp]
   mulss  xmm1, xmm13                        mulss  xmm6, DWORD PTR _m$[esp]
   addss  xmm3, xmm1                         subss  xmm5, xmm6
   addss  xmm0, xmm14                        movss  xmm6, DWORD PTR [edx+ecx*4]
   movaps xmm1, xmm2                         addss  xmm6, DWORD PTR _o$[esp]
   mulss  xmm0, xmm15                        mulss  xmm6, DWORD PTR _p$[esp]
   addss  xmm1, DWORD PTR [rax]              addss  xmm5, xmm6
   subss  xmm3, xmm0                         movss  xmm6, DWORD PTR [edx+ecx*4]
   movaps xmm0, xmm2                         addss  xmm6, DWORD PTR _q$[esp]
   mulss  xmm1, DWORD PTR [rcx]              mulss  xmm6, DWORD PTR _r$[esp]
   addss  xmm0, DWORD PTR [rdx]              subss  xmm5, xmm6
   addss  xmm3, xmm1                         movss  xmm6, DWORD PTR [edx+ecx*4]
   mulss  xmm0, DWORD PTR [r8]               addss  xmm6, DWORD PTR _s$[esp]
   movaps xmm1, xmm2                         mulss  xmm6, xmm4
   addss  xmm1, DWORD PTR [r9]               addss  xmm5, xmm6
   subss  xmm3, xmm0                         movss  xmm6, DWORD PTR [edx+ecx*4]
   mulss  xmm1, DWORD PTR [r10]              addss  xmm6, xmm3
   movaps xmm0, xmm2                         mulss  xmm6, xmm2
   addss  xmm0, DWORD PTR [r11]              subss  xmm5, xmm6
   addss  xmm2, DWORD PTR [rdi]              movss  xmm6, DWORD PTR [edx+ecx*4]
   mulss  xmm0, DWORD PTR [rbx]              addss  xmm6, xmm1
   mulss  xmm2, DWORD PTR [rsi]              mulss  xmm6, xmm0
   addss  xmm3, xmm1                         addss  xmm5, xmm6
   subss  xmm3, xmm0                         movss  DWORD PTR [edx+ecx*4], xmm5
   addss  xmm3, xmm2                         inc    ecx
   movss  DWORD PTR [rbp-4], xmm3            cmp    ecx, edi
   jne    $LL6@triadplus2                    jl     $L56934


To Start


Numeric Accuracy

The run time display and log files show the numeric result of calculations and values from using the same default parameters are shown below. There is some variation in rounding after calculations, different with SSE, i387 and CUDA instructions.


   4 Byte Ops Repeat      SSE      i387      i387   SSE 64b   SSE 64b   SSE 64b      CUDA
    Words /Wd Passes    1 CPU     1 CPU     2 CPU     1 CPU     2 CPU     4 CPU    8600GT

   100000   2  2500  0.929538  0.929475  0.929475  0.929538  0.929538  0.929538  0.929538
  1000000   2   250  0.992550  0.992543  0.992543  0.992550  0.992550  0.992550  0.992550
 10000000   2    25  0.999250  0.999249  0.999249  0.999250  0.999250  0.999250  0.999250

   100000   8  2500  0.957117  0.957164  0.957164  0.957117  0.957117  0.957117  0.956980
  1000000   8   250  0.995517  0.995525  0.995525  0.995517  0.995517  0.995517  0.995509
 10000000   8    25  0.999549  0.999550  0.999550  0.999549  0.999549  0.999549  0.999549

   100000  32  2500  0.890211  0.890377  0.890377  0.890211  0.890211  0.890211  0.890079
  1000000  32   250  0.988082  0.988102  0.988102  0.988082  0.988082  0.988082  0.988073
 10000000  32    25  0.998796  0.998799  0.998799  0.998796  0.998796  0.998796  0.998799  


To Start






Roy Longbottom at Linkedin  Roy Longbottom July 2014



The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection