Title

  Roy Longbottom at Linkedin GigaFLOPS Benchmarks

Contents


General CUDA GPU Benchmarks CUDA Comparisons
MP MFLOPS Benchmarks MP MFLOPS Comparisons MP MFLOPS Numeric Results
MP MFLOPS Assembly Code OpenMP Benchmark Qpar MP Benchmark
Qpar and OpenMP Comparisons Reliability Tests


General

In this series, four types of benchmarks are available, using OpenMP and Qpar, Microsoft’s proprietary equivalent, both with automatic multiprocessing, CUDA for GeForce graphics, and MPMFLOPS, CPU benchmark with multithreading (up to 64 threads). All carry out the same calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. They also check the numeric result for consistency, and this also identifies differences in numeric calculations on using the various instruction sets.

The benchmarks are available, compiled for 32 bit and 64 bit Operating Systems, and can be for single and double precision floating point calculations, using the original x87, SSE or AVX instruction sets. Tests are run at increasing data sizes to transfer data from caches and RAM. The latest versions of the benchmarks and source code are included in GigaFLOPS-Benchmarks.zip.

The programs are run from Command Prompt windows and generally have parameter options to run for extended periods as reliability/burn-in tests. Results are displayed and saved in a text based log file. Speed is shown in MFLOPS or Millions of FLoating point Operations Per Second



To Start

CUDA GPU Benchmarks

These were the first benchmarks in the series. Four varieties are available CUDA3MFLOPS-x86SP.exe, CUDA3MFLOPS-x86DP.exe, CUDA3MFLOPS-x64SP.exe and CUDA3MFLOPS-x64DP.exe, two for each for 32 bit and 64 bit Windows. Single Precision (SP) and Double Precision (DP) compilations are provide as the performance difference can be considerable. CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously.

Unlike the CPU benchmarks, three tests are carried out at each data size and calculations per word combination. This is to show that performance is severely degraded if data transfers are over the relatively slow external bus. The first tests are run with Repeat Passes controlled by the CPU but, to demonstrate fastest speeds (with these tests), Extra Tests are run with all repeats controlled by the GPU.

The example below is for a GeForce GTX 680, possibly the fastest graphics card in 2012. This has a maximum specification of 3090 GFLOPS, the benchmark achieving up to 1746 GFLOPS. Note that numeric accuracy improves with fewer data returns between calculations.

Details on installing and using CUDA software, with further benchmark details and results, can be found in cuda1.htm, cuda2.htm and cuda3 x64m.htm. These show how the numeric results vary according to precision used and provide details on how to run the programs for extended periods as reliability/burn-in tests.

Note that total data size transferred can be 1 GB in and out. On reducing data flow by this amount, the difference in test time is mainly around 0.25 seconds, suggesting that effective transfer speed is about 4 GB/second, maybe about right for PCiE 3.0 x 16 with a maximum of nearly 16 GB/second.


   CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.3 Sun Nov 04 17:58:44 2012

  CUDA devices found 
  Device 0: GeForce GTX 680  with 8 Processors 64 cores 
  Global Memory 2000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  0.967272      517   0.9295383095741  Yes
 Data out only    100000    2    2500  0.387346     1291   0.9295383095741  Yes
 Calculate only   100000    2    2500  0.070436     7099   0.9295383095741  Yes

 Data in & out   1000000    2     250  0.526663      949   0.9925497770309  Yes
 Data out only   1000000    2     250  0.245081     2040   0.9925497770309  Yes
 Calculate only  1000000    2     250  0.019763    25299   0.9925497770309  Yes

 Data in & out  10000000    2      25  0.482678     1036   0.9992496371269  Yes
 Data out only  10000000    2      25  0.240010     2083   0.9992496371269  Yes
 Calculate only 10000000    2      25  0.013708    36475   0.9992496371269  Yes

 Data in & out    100000    8    2500  0.759731     2633   0.9571172595024  Yes
 Data out only    100000    8    2500  0.410652     4870   0.9571172595024  Yes
 Calculate only   100000    8    2500  0.073366    27261   0.9571172595024  Yes

 Data in & out   1000000    8     250  0.524791     3811   0.9955183267593  Yes
 Data out only   1000000    8     250  0.245618     8143   0.9955183267593  Yes
 Calculate only  1000000    8     250  0.020494    97589   0.9955183267593  Yes

 Data in & out  10000000    8      25  0.494677     4043   0.9995489120483  Yes
 Data out only  10000000    8      25  0.240809     8305   0.9995489120483  Yes
 Calculate only 10000000    8      25  0.013834   144575   0.9995489120483  Yes

 Data in & out    100000   32    2500  0.764819    10460   0.8902152180672  Yes
 Data out only    100000   32    2500  0.415392    19259   0.8902152180672  Yes
 Calculate only   100000   32    2500  0.135979    58833   0.8902152180672  Yes

 Data in & out   1000000   32     250  0.529935    15096   0.9880878329277  Yes
 Data out only   1000000   32     250  0.247135    32371   0.9880878329277  Yes
 Calculate only  1000000   32     250  0.024024   333000   0.9880878329277  Yes

 Data in & out  10000000   32      25  0.493384    16215   0.9987964630127  Yes
 Data out only  10000000   32      25  0.242553    32983   0.9987964630127  Yes
 Calculate only 10000000   32      25  0.015177   527122   0.9987964630127  Yes

 Extra tests - Repeat Passes in main CUDA Function

 Calculate      10000000    2      25  0.004503   111041   0.9992496371269  Yes
 Shared Memory  10000000    2      25  0.002053   243601   0.9992496371269  Yes

 Calculate      10000000    8      25  0.005726   349289   0.9995489120483  Yes
 Shared Memory  10000000    8      25  0.002773   721272   0.9995489120483  Yes

 Calculate      10000000   32      25  0.008419   950255   0.9987964630127  Yes
 Shared Memory  10000000   32      25  0.004581  1746493   0.9987964630127  Yes

 Hardware  Information
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206D7
  Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz Measured 3200 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows  Information
  AMD64 processor architecture, 12 CPUs 
  Windows NT  Version 6.1, build 7601, Service Pack 1
  Memory 32710 MB, Free 28947 MB
  User Virtual Space 8388608 MB, Free 8388539 MB

  2000 MB Graphics RAM, Used 204 Minimum, 240 Maximum

   

To Start

CUDA Comparisons

Following are further results for the GTX 680, with others for a GTX 650, with single and double precision benchmarks compiled for 32 bit and 64 bit Windows. The PCs have similar CPU, bus and RAM speeds, reflected in results involving data transfers. Note that 64 bit compilations are faster than those at 32 bits and single precision speeds can be much higher than that for double precision calculations, the largest difference (>15x) on the extra tests.

 
                             GeForce GTX 680              GeForce GTX 650
 Maximum Specification       3090 GFLOPS                  1505 GFLOPS

 Test            Wds x Ops   3.1 SP  3.1 SP 3.1 DP 3.1 DP 3.1 SP 3.1 SP 3.1 DP 3.1 DP
                  x Passes      32b     64b    32b    64b    32b    64b    32b    64b

 Data in & out  .1Mx2x2500      622     517    354    353    593    459    333    350
 Data out only  .1Mx2x2500     1241    1291    667    712   1050   1059    643    651
 Calculate only .1Mx2x2500     7310    7099   4126   6140   3854   3449   3351   3069

 Data in & out  1Mx2x250        931     949    479    489    869    893    471    479
 Data out only  1Mx2x250       1960    2040   1000   1049   1765   1790    941    962
 Calculate only 1Mx2x250      21914   25299  16143  16226   9244   8806   6889   6627

 Data in & out  10Mx2x25        999    1036    513    516    959    980    502    508
 Data out only  10Mx2x25       2005    2083   1040   1052   1783   1852    952    972
 Calculate only 10Mx2x25      36175   36475  19876  19882  10867  10530   7746   7533

 Data in & out  .1Mx8x2500     2445    2633   1415   1430   2348   2375   1294   1299
 Data out only  .1Mx8x2500     4280    4870   2724   2906   4135   4151   2396   2415
 Calculate only .1Mx8x2500    25502   27261  15501  22388  14274  13056   9809   9029

 Data in & out  1Mx8x250       3722    3811   1901   1949   3458   3545   1806   1829
 Data out only  1Mx8x250       7694    8143   4058   4143   6982   7107   3455   3504
 Calculate only 1Mx8x250      85455   97589  60592  60056  33375  34014  16613  15852

 Data in & out  10Mx8x25       4015    4043   2061   2052   3841   3896   1915   1926
 Data out only  10Mx8x25       8177    8305   4055   4130   7222   7283   3524   3572
 Calculate only 10Mx8x25     140800  144575  75558  74333  42328  40905  17922  17221

 Data in & out  .1Mx32x2500   10080   10460   5396   5485   9026   9183   4262   4588
 Data out only  .1Mx32x2500   18672   19259   9978  10053  15660  15769   7355   7377
 Calculate only .1Mx32x2500   83989   58833  43334  47974  47957  43975  17469  16658

 Data in & out  1Mx32x250     14629   15096   7269   7440  13634  14006   5947   6015
 Data out only  1Mx32x250     30977   32371  14810  14977  27261  27684   9864   9881
 Calculate only 1Mx32x250    347405  333000  96044  92231 125027 120972  22708  21998

 Data in & out  10Mx32x25     15995   16215   7739   7765  15123  15499   6288   6278
 Data out only  10Mx32x25     31586   32983  14956  15037  27770  28906   9984   9999
 Calculate only 10Mx32x25    519153  527122 105419 101771 149697 147100  23467  22763

 Extra tests - Repeat Passes in main CUDA Function

 Calculate      10Mx2x25     126843  111041  50749  44017  29899  26876  11491   9801
 Shared Memory  10Mx2x25     236915  243601  73186  72369  75618  77049  16407  16268

 Calculate      10Mx8x25     470867  349289  89172  83696 110088  81484  19821  18451
 Shared Memory  10Mx8x25     969416  721272 100825 100254 227879 181190  22425  22322

 Calculate      10Mx32x25   1154649  950255 109800 105142 253381 216570  24083  23353
 Shared Memory  10Mx32x25   1714512 1746493 111425 110982 412313 400966  24587  24560

   

To Start

MP MFLOPS Benchmarks

The benchmark was first produced to run on PCs via Linux, details being in Linux Multithreading Benchmarks.htm. Then mini versions were produced, where details are in Android MultiThreading Benchmarks.htm and Raspberry Pi Multithreading Benchmarks.htm.

Four versions are available, compiled from the same source code, in the GigaFLOPS ZIP file., now intended to be run via 64 bit Windows. They are:

  • MPmflops32.exe compiled with MS C/C++ Version 15.00.30729.207 for 80x86 - old 8087 floating point instructions
  • MPmflops64.exe compiled with MS C/C++ Version 15.00.30729.207 for x64 - to use SSE floating point calculations
  • MPmflopsc2.exe compiled with MS C/C++ Version 18.00.21005.1 for x64 - to fully implement SSE functions
  • MPmflopsAVX.exe compiled with MS C/C++ Version 18.00.21005.1 for x64, with /arch:AVX option to use new vector instructions. The first compilation of this failed to run on a PC without AVX. The revised benchmark is a small one that identified the configuration, indicates if AVX not supported or runs the benchmark, now named onlyAVX.exe.

A WhatConfig function identifies how many cores are available, doubled when Hyperthreading is available. This provides the default thread count. Alternatively, a run time parameter specifies the number of threads to use (see below). This can be zero where no threads are created but, in this case, speeds have been the same as using one thread.

The threads used a shared data array, but each uses a dedicated segment. Example results are below. The “Data in & out” label is to identify the equivalent CUDA test. The thread count should be one of the options shown below (divide into 1024). Otherwise, unexpected results will be notified (see example). The numeric results depend on the number of calculations, changed by repeat passes that vary according to the number of identified CPUs (2500 x CPUs See Below). Those via x87 floating point are different to SSE calculations but the thread count should not affect the numeric answers.


 Command Format Example - MPmflopsc2 Threads xx or T xx
                          x = 0, 1, 2, 4, 8, 16, 32 or 64

 Example results log file:
 
 8 CPUs Available
 ##############################################

  64 Bit MP SSE MFLOPS Benchmark C2, 8 Threads, Mon Jun 09 12:15:48 2014

        C/C++ Optimizing Compiler Version 18.00.21005.1 for x64

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    20000   0.069739    58734    0.620974   Yes
 Data in & out    1024000     2     2000   0.093680    43723    0.942935   Yes
 Data in & out   10240000     2      200   0.410417     9980    0.994032   Yes

 Data in & out     102400     8    20000   0.168666    97139    0.749971   Yes
 Data in & out    1024000     8     2000   0.166426    98446    0.965360   Yes
 Data in & out   10240000     8      200   0.408970    40062    0.996409   Yes

 Data in & out     102400    32    20000   0.717656    91320    0.498060   Yes
 Data in & out    1024000    32     2000   0.698044    93885    0.910573   Yes
 Data in & out   10240000    32      200   0.703741    93125    0.990447   Yes

               End of test Mon Jun 09 12:15:51 2014

 ##############################################

  CPU GenuineIntel, ecx 7FBEE3BF, edx BFEBFBFF, Model 000306E4 
  Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Measured 3711 MHz 
  Has MMX, Has SSE, Has SSE2, Has SSE3, Has AVX, No 3DNow, 
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus 
  AMD64 processor architecture, 8 CPUs  
  Windows NT  Version 6.2, build 9200,  
  Memory 32705 MB, Free 31350 MB 
  User Virtual Space 134217728 MB, Free 134217718 MB 
 
 ##############################################

 Example Calculation

 MFLOPS  = 10240000 / 1000000 x 32 X 200 / 0.703741 = 93125.17  

 ##############################################

 Example display and log file with errors or odd thread count (4 CPUs Available)

 Data in & out     102400     2    10000   0.097761    20949    See later   No

 Data in & out     102400     2    10000 word  102388 was 0.999999 not 0.764063
 Data in & out     102400     2    10000 word  102389 was 0.999999 not 0.764063

  

To Start

MP MFLOPS Comparisons

The tables below show L2 and L3 cache size, the former provided per core and the latter shared between cores. Tests using two arithmetic operations per word are the most likely to be affected by RAM or shared L3 cache speed. With 2 operations, L2 cache sizes of 512 KB, 10.24M Words and 16 threads, performance gains over 8 threads are higher than might be expected. Data per thread is 640 KB, where Windows time slices can use multiple L2 caches more effectively.

The second version of the benchmark has a compile option to use SSE instructions, where Single Instruction Multiple Data (SIMD) instructions could execute a multiply or add on on four 32 bit single precision floating point numbers at the same time. Instead, MS C/C++ Version 15 caused the benchmark to run a conventional scalar processor with one 32 bit number in the 128 bit SSE registers or Single Instruction Single Data (SISD) operation. In some cases, this is no faster than compilations using the original 8087 floating point instructions.

Recompiling with MS C/C++ Version 18, from Visual Studio 2013, produced full SIMD functions. Theoretical maximum MFLOPS speed of earlier processors with SSE was 4 x number of cores x CPU MHz. Later, a multiply and add could be executed in the same clock cycle where, maximum speed per core is 8 x CPU MHz. More recently, Advanced Vector Instructions (AVX) were introduced, with 256 bit registers and speed of up to 16 x CPU MHz per core. Unfortunately, compiling with the AVX option with MS C/C++ Version 18 did not fully implement the new instructions. For details of assembly code produced (See Below). Later CPUs have AVX2 with further performance improvements. Results are for:

Core 2 Duo 2.4 GHz 2 cores - Maximum GFLOPS 4.4 i87, 7.3 SIMD, 19.0 SIMD or 4.0 per GHz per core

Phenom II 3.0 GHz 4 cores - Maximum GFLOPS 15.5 i87, 17.7 SISD, 53.7 SIMD or 4.48 per GHz per core.

Core i7 3.9 GHz 4 cores + Hyperthreading - Maximum GFLOPS 24.3 i87, 24.7 SISD, 98.4 SIMD or 6.30 per GHz per core, AVX 98.2. There is little gain through Hyperthreading and note similar SIMD and AVX speeds, also improved performance relative to clock speed, compared with the Phenom. This PC has four memory channels, producing exceptional relative performance.


                                         MFLOPS 1 to 16 Threads

 Operations Per Word     2      2      2       8      8      8      32     32     32
      Million Words   0.10   1.02  10.24    0.10   1.02  10.24    0.10   1.02  10.24
             Threads

 Core 2 Duo       1   1509   1419   1112    1981   2038   2014    2301   2292   2280
 512 KB x 2 L2    2   2854   2606   1185    3709   3847   3738    4374   4359   4301
 2400 MHz         4   2854   2606   1185    3709   3847   3738    4374   4359   4301
 i87              8   2957   2853   1265    3875   3918   3595    4440   4362   4424
                 16   3056   2855   2118    3963   3935   3636    4396   4386   4345

 Core 2 Duo       1   2504   2181   1165    3319   3664   3506    3595   3577   3559
 512 KB x 2 L2    2   4791   4497   1197    6907   6824   4481    6841   6755   6718
 2400 MHz         4   4773   4572   1211    5015   7040   4832    6849   6951   6153
 SISD             8   4786   4657   1300    6335   6842   4931    6872   6826   6730
                 16   5100   4608   2341    7311   7095   6840    6711   7018   6810

 Core 2 Duo       1   2915   2554   1156    6347   6006   4672    9303   9217   9101
 512 KB x 2 L2    2   5576   5177   1205   11127  10546   4775   16941  17613  16703
 2400 MHz         4   5627   5227   1197   12371  11686   4775   17783  17512  16636
 SIMD             8   5564   5573   1291   12386  12306   5093   17920  17992  16892
 L3 6MB          16   5548   5519   2321   12967  11094   8537   19047  18020  16783

 Phenom II X4     1   2086   1970   1631    3858   3707   3521    3929   3909   3861
 512 KB x 4 L2    2   4158   3945   2593    7698   7482   6934    7846   7825   7698
 3000 MHz         4   8232   7744   2737   14914   9544  11251   15461  15498  14906
 i87              8   8201   7922   2927   14645  12169  10091   15512  14633  14772
 L3 6MB          16   7424   7199   3408   14883  13697  11652   15187  14100  14428

 Phenom II X4     1   3320   3091   1910    4551   4439   4196    3962   3930   3886
 512 KB x 4 L2    2   6579   6312   2705    9074   8940   8138    7909   7863   7763
 3000 MHz         4  12796  11945   3076   17473  17505  11322   12679  15450  14953
 SISD             8  12775  10843   3073   17620  17722  11304   15378  12972  15053
 L3 6MB          16  11478  10623   4656   17601  17164  11734   15486  15207  14894

 Phenom II X4     1   6539   4328   2127   12022  11210   7326   13656  13339  12781
 512 KB x 4 L2    2  12811   8750   2841   23793  22928  10622   27202  26813  25210
 3000 MHz         4  16543  12395   3182   45783  43472  12134   53746  51504  42746
 SIMD             8  24417  23910   3188   46269  46410  12189   53181  52662  43354
 L3 6MB          16  23607  22774   3647   41679  43777  13327   52156  51361  44364

 Core i7 4820K    1   3867   3853   3386    6085   6054   6017    5830   5824   5809
 256 KB x 4 L2    2   7737   7731   6618   12160  12165  11991   11653  11648  11650
 4 core 8 Thrd    4  15433  15459   9833   23487  24291  23886   22666  23175  23220
 3900 MHz i87     8  15359  15395   9846   23554  23708  23586   23418  23464  23416
 L3 10 MB        16  15145  15192  10023   23422  23536  22966   23241  23401  23282

 Core i7 4820K    1   5004   4960   4192    6188   6182   6135    5890   5890   5887
 256 KB x 4 L2    2   9996  10002   8049   12371  12354  12282   11770  11779  11744
 4 core 8 Thrd    4  19923  18532   9866   23946  24704  24347   23219  23531  23497
 3900 MHz SISD    8  19602  19776   9820   24683  24648  24634   23521  23497  23506
 L3 10 MB        16  18727  19077  10073   24316  24243  24442   23469  23393  23385

 Core i7 4820K    1  10116   9864   5852   24636  24436  19881   23353  23389  23243
 256 KB x 4 L2    2  26453  19851   9189   49181  49223  34969   46653  46759  46414
 4 core 8 Thrd    4  41845  26975  10063   85909  93852  40163   89202  90572  87329
 3900 MHz SIMD    8  58734  43723   9980   97139  98446  40062   91320  93885  93125
 L3 10 MB        16  57731  42194  10178   94166  93338  40074   90162  92102  93496

 Core i7 4820K    1  10046   9901   5906   24629  24382  19832   23411  23361  23246
 256 KB x 4 L2    2  26634  19679   9250   49194  49267  35183   46788  46788  46382
 4 core 8 Thrd    4  52424  39057  10092   60266  98220  39744   90948  90611  92515
 3900 MHz AVX     8  58601  43529  10032   85198  98220  40162   93810  93866  93745
 L3 10 MB        16  57098  42920  10319   86267  95243  40427   92929  92995  92356

   

To Start

MP MFLOPS Numeric Answers

As indicated earlier, the calculated result depends on the number of calculations on each data word, closer to initial values of 0.999999 indicating fewer calculations. The number depends on Repeat passes, with the starting value of identified CPUs x 2500. For example Repeats of 10000 could be for a quad core CPU and 20000 could be for a quad core CPU with Hyperthreading.

Results via old i87 calculations are slightly different to those using SSE arithmetic.


 Repeats  5000               10000               20000               40000
 Version   SSE       i87       SSE       i87       SSE       i87       SSE       i87

 a2     0.867359  0.867238  0.764063  0.763849  0.620974  0.620631  0.481454  0.481096
 b2     0.985193  0.985180  0.970753  0.970727  0.942935  0.942883  0.891302  0.891203
 c2     0.998502  0.998501  0.997008  0.997006  0.994032  0.994027  0.988125  0.988114

 a8     0.918220  0.918307  0.850923  0.851082  0.749971  0.750239  0.635325  0.635706
 b8     0.991084  0.991095  0.982342  0.982363  0.965360  0.965401  0.933325  0.933397
 c8     0.999099  0.999101  0.998200  0.998204  0.996409  0.996416  0.992853  0.992862

 a32    0.798973  0.799276  0.660143  0.660653  0.498060  0.498797  0.385106  0.384777
 b32    0.976383  0.976422  0.953631  0.953702  0.910573  0.910709  0.833458  0.833707
 c32    0.997595  0.997602  0.995203  0.995214  0.990447  0.990463  0.981037  0.981068
 
 Key - Words a=102400 x Repeats, b=1024000 x Repeats / 10, c=10240000 x Repeats / 100  
                   2, 8 and 32 are Operations Per Word
 
   

To Start

MP MFLOPS Assembly Code

Following are details of assembly code for the test with 3 multiplies, 4 adds and a subtract. The test loop is mainly unrolled, with these sequences repeated multiple times. For full SSE and AVX, the main compiled loop included 16 arithmetic instructions, with totals of 8 x 8 and 16 x 8 operations. Then there were other sequences for non-multiples of 64 and 128.

The original benchmark produced the inefficient SISD instructions but the later one, that came with Visual Studio 2013, produced SIMD, but not a full implementation of AVX functions. A compilation under Linux Ubuntu 14.04 produced the expected AVX sequences, using 256 bit ymm registers and not 128 bit xmm varieties.

The GFLOPS figures shown are maximum on a 3.9 GHz CPU that can execute a multiply and add per cycle, or up to 7.8 x 4 = 31.2 GFLOPS, with four cores, under SISD. With 3 multiplies out of 8 instructions, in the test sequence, 25 GFLOPS looks quite good, also 98 GFLOPS with SIMD.

 
  Assembly Code Sequences for x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;

         GFLOPS are for four Core i7-4820K running at 3.9 GHz

  Four Cores 24 GFLOPS
  Compiler 15.00.30729.207 for 80x86
 
  fadd     DWORD PTR _a$[esp+4]
  fmul     ST(0), ST(2)
  fld      ST(3)
  faddp    ST(2), ST(0)
  fld      ST(4)
  fmulp    ST(2), ST(0)
  fsubrp   ST(1), ST(0)
  fld      DWORD PTR tv1064[esp+4]
  fadd     ST(0), ST(5)
  fmul     ST(0), ST(6)
  faddp    ST(1), ST(0)

 
  SSE Option - SISD                    SSE Option - SIMD
  ss is scalar single precision        ps is packed single precision
  Four Cores 25 GFLOPS                 Four Cores 98 GFLOPS
  Compiler 15.00.30729.207 for x64     Compiler 18.00.21005.1 for x64
 
  addss    xmm1, xmm8                  addps    xmm1, xmm10
  addss    xmm2, xmm4                  addps    xmm2, xmm6
  addss    xmm0, xmm6                  addps    xmm0, xmm8
  mulss    xmm1, xmm9                  mulps    xmm1, xmm11
  mulss    xmm0, xmm7                  mulps    xmm2, xmm7
  mulss    xmm2, xmm5                  mulps    xmm0, xmm9
  subss    xmm2, xmm0                  subps    xmm2, xmm0
  addss    xmm2, xmm1                  addps    xmm2, xmm1

 
  AVX Option - 128 bit registers       AVX Option - 256 bit registers
  Four Cores 98 GFLOPS                 Four Cores 162 GFLOPS
  Compiler 18.00.21005.1 for x64       Compiler gcc version 4.8.2
 
  vaddps   xmm1, xmm0, xmm6            vaddps   ymm5, ymm6,  ymm13
  vaddps   xmm0, xmm0, xmm8            vaddps   ymm3, ymm6,  ymm7
  vmulps   xmm2, xmm0, xmm9            vaddps   ymm1, ymm6,  ymm6
  vmulps   xmm3, xmm1, xmm7            vmulps   ymm4, ymm13, ymm13
  vsubps   xmm4, xmm3, xmm2            vmulps   ymm2, ymm7,  ymm7
  vaddps   xmm1, xmm5, xmm10           vmulps   ymm0, ymm6,  ymm6
  vmulps   xmm0, xmm1, xmm11           vsubps   ymm7, ymm13, ymm7
  vaddps   xmm2, xmm4, xmm0            vaddps   ymm6, ymm7,  ymm6
   

To Start

OpenMP Benchmark

The second benchmarks in this series was for OpenMP. This did not have the same word counts as MP MFLOPS (divisible by 64), so numeric results are slightly different. The benchmarks included are OpenMP32MFLOPS.exe, using i387 Floating Point, OpenMP64MFLOPS.exe, with default SSE instructions and SSE32MFLOPS.exe, not using OpenMP. Details and results are in openmp mflops.htm, then benchmarks and source codes downloaded in OpenMPMflops.zip.

This benchmark uses standard C/C++ code, with a pragma directive before loops offered for parallelisation - see below, with a /openmp option added to the compile command. The original compiled using one word at a time SISD SSE instructions (see above) but automatically selected all CPU cores, confirmed by the example of using one core, via an Affinity setting, shown below. Measured performance is effectively the same as with the MP MFLOPS benchmark, using the same Core i7 based system.

The benchmark was recompiled with MS C/C++ Version 18, from Visual Studio 2013 but, unlike MP MFLOPS, did not produce full SIMD functions. Results are below, mainly demonstrating the same performance as the original program.

  
                      Results on 3900 MHz Core i7 4820K
 
  C Code Directive

  #pragma omp parallel for
  for(i=0; i < n; i++)
  {
     x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
  }

  Compile Command

  cl /O2 /openmp /MD /W4  /Zi /TP /EHsc /Fa  /c OpenMP64MFLOPS.cpp


      64 Bit OpenMP MFLOPS Benchmark 1 Wed May 28 16:06:54 2014

  Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.059061     8466    0.929538   Yes
 Data in & out    1000000     2      250   0.039231    12745    0.992550   Yes
 Data in & out   10000000     2       25   0.052317     9557    0.999250   Yes

 Data in & out     100000     8     2500   0.103711    19284    0.957117   Yes
 Data in & out    1000000     8      250   0.083413    23977    0.995517   Yes
 Data in & out   10000000     8       25   0.083001    24096    0.999549   Yes

 Data in & out     100000    32     2500   0.360390    22198    0.890211   Yes
 Data in & out    1000000    32      250   0.344797    23202    0.988082   Yes
 Data in & out   10000000    32       25   0.345281    23170    0.998796   Yes


 Example 1 Core - Command Start /Affinity 1 OpenMP64MFLOPS

 Data in & out     100000     8     2500   0.356484     5610    0.957117   Yes
 Data in & out    1000000     8      250   0.329129     6077    0.995517   Yes
 Data in & out   10000000     8       25   0.328513     6088    0.999549   Yes


  Via Microsoft C/C++ Optimizing Compiler Version 18.00.21005.1 for x64

 Data in & out     100000     2     2500   0.037910    13189    0.929538   Yes
 Data in & out    1000000     2      250   0.035475    14095    0.992550   Yes
 Data in & out   10000000     2       25   0.055958     8935    0.999250   Yes

 Data in & out     100000     8     2500   0.086657    23079    0.957117   Yes
 Data in & out    1000000     8      250   0.082454    24256    0.995517   Yes
 Data in & out   10000000     8       25   0.083116    24063    0.999549   Yes

 Data in & out     100000    32     2500   0.343880    23264    0.890211   Yes
 Data in & out    1000000    32      250   0.358404    22321    0.988082   Yes
 Data in & out   10000000    32       25   0.342454    23361    0.998796   Yes

   

To Start

Qpar MP Benchmark

With Visual Studio 2012, Microsoft added an "Auto-Parallelizer" to the compiler that can automatically generate multiple threads in the same way as OpenMP. This requires a /Qpar compiler option and a pragma directive, as shown below.

The OpenMP benchmark was compiled this way, via Compiler Version 18.00, supplied with Visual Studio 2013, and produced SIMD instructions, mainly providing performance gains of four times OpenMP results, similar to those in MP MFLOPS. The program was also compiled to use AVX instructions, producing the same disappointing speeds as MP MFLOPS.

The #pragma loop directive cannot use a variable to specify the number of threads. So the program was modified to select different fixed counts of 1, 2, 4, 8, or 16 threads, with a default of 4. The new benchmark is QparMP64MFLOPS.exe, with execution and source files included in GigaFLOPS-Benchmarks.zip. This also has copied of DLL files that might be needed to run the benchmark on different systems. Besides having a run time parameter to control the number of threads (T or Threads), others are provided for the number of words (W or Words) and repeat passes (R or Repeats) - see below.

 
  C Code Directive

  #pragma loop(hint_parallel(4))
  for(i=0; i < n; i++)
  {
     x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
  }

  Compile Command

  cl /O2 /Qpar /MD /W4  /Zi /TP /EHsc /Fa /c QparMP64MFLOPS.cpp


   64 Bit Qpar MFLOPS Benchmark 1, 4 Threads, Mon Jun 23 16:27:50 2014

  Via Microsoft C/C++ Optimizing Compiler Version 18.00.21005.1 for x64

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.010905    45849    0.929538   Yes
 Data in & out    1000000     2      250   0.012933    38661    0.992550   Yes
 Data in & out   10000000     2       25   0.052096     9598    0.999250   Yes

 Data in & out     100000     8     2500   0.023440    85324    0.957117   Yes
 Data in & out    1000000     8      250   0.020773    96279    0.995517   Yes
 Data in & out   10000000     8       25   0.052930    37786    0.999549   Yes

 Data in & out     100000    32     2500   0.087836    91079    0.890211   Yes
 Data in & out    1000000    32      250   0.085585    93475    0.988082   Yes
 Data in & out   10000000    32       25   0.086839    92124    0.998796   Yes


  Run Time Commands

   QparMP64MFLOPS.exe Txxx N1, Wxxx N2, Rxxx N3

   Examples  QparMP64MFLOPS.exe T 16      or  QparMP64MFLOPS.exe Thrds 16
             QparMP64MFLOPS.exe W 200000  or  QparMP64MFLOPS.exe Wds 200000
             QparMP64MFLOPS.exe T 8, W 200000, R 5000
   

To Start

Qpar and OpenMP Comparisons

Following are some OpenMP, Qpar and MP-MFLOPS comparisons on two quad core systems. On results not dependent on memory speed, the four CPU Core i7 Qpar and MP-MFLOPS speeds average four times faster than OpenMP scores, due to SIMD and SISD register use, not quite as good on the Phenom.

With Qpar tests, the Phenom speeds, using eight threads, tend to be much slower than with four threads. The Core i7 degradation is not as severe and is probably helped by Hyperthreading. With MP-MFLOPS, and CPU speed limited tests, performance with eight threads tends to be four times that when using a single thread, Hyperthreading providing a higher gain on the i7.

The 10.24 MB tests, with fewer calculations, suggest that multiple threads are needed to obtain the highest RAM throughput, the i7 also being influenced by its 10 MB L3 cache.

The i7 PC has four memory channels, compared with two on the Phenom, and the former is faster. The CPU MHz ratio is 1.30, but the i7 CPU dependent speeds on the i7 average around 1.75 times those on the Phenom, 2.5 times due to caching effects and more than three times via RAM.


                                         MFLOPS 1 to 16 Threads

 Operations Per Word     2      2      2       8      8      8      32     32     32
      Million Words   0.10   1.02  10.24    0.10   1.02  10.24    0.10   1.02  10.24
             Threads

 Phenom X4        1   1800   1866   1524    3585   3686   3523    3647   3653   3628
 OpenMP           4   5408   7045   2964   11891  14194  11178   13494  14006  13842

 Phenom           1   6003   4190   2116   11585  11203   7321   14087  13768  13151
 X4               2  10999   8653   2843   23194  22732  10567   27421  27538  25513
 4 Core           4  17699  15915   3125   36186  43438  11926   47664  53064  43406
 3000 MHz         8   8158   3208   2948   25464  29383  10189   32648  36188  32475
 Qpar            16   9937  10422   3200   34765  42399  12125   41226  53077  39954

 Phenom           1   6539   4328   2127   12022  11210   7326   13656  13339  12781
 4 Core           2  12811   8750   2841   23793  22928  10622   27202  26813  25210
 3000 MHz         4  16543  12395   3182   45783  43472  12134   53746  51504  42746
 3000 MHz         8  24417  23910   3188   46269  46410  12189   53181  52662  43354
 MP-MFLOPS       16  23607  22774   3647   41679  43777  13327   52156  51361  44364


 Core i7 4820K    1   3612   3802   3549    6002   6136   6100    5845   5870   5879
 OpenMP           4  13189  14095   8935   23079  24256  24063   23264  22321  23361

 Core i7          1  10181   9972   5842   24458  24086  19646   23497  23533  23373
 4820K            2  25378  19873   9186   47674  48861  34432   46546  46983  46560
 4 core 8 Thrd    4  45194  39092   9839   85928  95689  37602   90159  93022  90933
 3900 MHz         8  42665  38325   9761   75672  88846  37919   88217  91233  86306
 Qpar            16  18840  35358   9757   66481  90022  38735   83196  91050  87909

 Core i7          1  10116   9864   5852   24636  24436  19881   23353  23389  23243
 4820K            2  26453  19851   9189   49181  49223  34969   46653  46759  46414
 4 core 8 Thrd    4  41845  26975  10063   85909  93852  40163   89202  90572  87329
 3900 MHz         8  58734  43723   9980   97139  98446  40062   91320  93885  93125
 MP-MFLOPS       16  57731  42194  10178   94166  93338  40074   90162  92102  93496
   

To Start

Reliability Tests

MP-MFLOPS and CUDA-MFLOPS benchmarks have a command parameter to run for extended periods for burn-in purposes or reliability verification. The parameter specifies the run time in minutes and one test, that possibly represents the highest CPU/GPU loading, is run. With MP-MFLOPS, it is with 102400 words at 32 operations per word, where the number of threads can also be varied. CUDA-MFLOPS uses the last of the Extra Tests with 10 million words and 32 operations per word, where the default is Calculate, but an added FC (fast cache) option uses the faster Shared Memory (see example below).

The benchmarks display and log MFLOPS speeds, normally at intervals of 15 to 20 seconds, via a calibrated Repeat Passes count, and numeric results are checked for consistency. The running time and MFLOPS measurements can fall, if CPU/GPU clock speeds are automatically reduced, due to overheating or power saving.

The first entries below show the log format of the two programs, followed by a BAT file to run 10 minute tests, with MP-MFLOPS restricted to four out of eight threads on a quad core i7 4820K. The graphics card is a GeForce GTX 650. Next are the results, with CUDA-MFLOPS running with no performance degradation, but stealing some CPU time from MP-MFLOPS to reduce that program's speed.

Temperatures were monitored during the test, using Asus AI Suite II for the CPU (case temperature) and Asus GPU Tweak for the GeForce graphics processor. There is nothing unusual about the recorded temperatures, shown below, but such measurements are useful for later comparisons, in the event of system failures.


 8 CPUs Available
 ##############################################

  64 Bit MP SSE MFLOPS Benchmark C2, 8 Threads, Tue Jul 01 10:39:20 2014

        C/C++ Optimizing Compiler Version 18.00.21005.1 for x64

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400    32   427524  14.963640    93621    0.352168   Yes
 Data in & out     102400    32   427524  14.964366    93616    0.352168   Yes
 Data in & out     102400    32   427524  14.979681    93521    0.352168   Yes
 Data in & out     102400    32   427524  14.987834    93470    0.352168   Yes


 ##############################################

  CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.31 Tue Jul 01 11:40:41 2014

  Shared Memory  Reliability Test 1 minutes, report every 15 seconds

  Results of all calculations should be -    0.5065064430236816

  Test Seconds   MFLOPS    Errors     First               Value
                                       Word 

   1    14.324   429967   None found
   2    14.324   429970   None found
   3    14.324   429968   None found
   4    14.325   429955   None found

 ##############################################

 Run.bat 10 minute test 

 Start cuda3mflops-x64sp Minutes 10 FC
 Start MPmflopsc2 Minutes 10, Threads 4

 ##############################################

 CUDA

 Test Seconds   MFLOPS    Errors

  22    14.361   429554   None found
  23    14.375   429133   None found
  24    14.360   429598   None found

 ##############################################

 MP-MFLOPS 4 Threads

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400    32   288902  13.039905    72598    0.352168   Yes
 Data in & out     102400    32   288902  13.505035    70098    0.352168   Yes
 Data in & out     102400    32   288902  13.992925    67654    0.352168   Yes

 Just CPU Test
 Data in & out     102400    32   288902  10.430468    90760    0.352168   Yes

 ##############################################

 Temperatures  - Room 27 °C
                                Minutes                                   Spec
            0    1    2    3    4    5    6    7    8    9   10  Increase  Max

  CPU °C   39   42   46   48   49   50   50   51   52   53   53    14     66.8
  GPU °C   31   49   54   56   58   59   60   60   60   60   60    29     98   

   

To Start


Roy Longbottom at Linkedin  Roy Longbottom July 2014

The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection