Title

More OpenMP Parallel Computing Benchmarks

Contents

MemSpeed Example Log Files Different Version Results
Results On A Different Processor Other Benchmark Compilations

General

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The first benchmark, described in OpenMP MFLOPS, executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code.

It was decided to compile other existing benchmarks using the same Microsoft compiler and OpenMP directive, the first one being the Linpack Benchmark, where performance is mainly governed by a loop containing

   dy[i] = dy[i] + da * dx[i]             
The speed measured by the OPenMP version was unexpectedly extremely slow. So, it was decided to produce a variation of the MemSpeed Benchmark, with the same calculations, but using data sizes that occupy increasing memory sizes to test caches and RAM. Other benchmarks were also converted to identify other slow functions. Some of these showed that careless use of OpenMP leads to programs producing wrong and inconsistent numeric results.

The new benchmarks are included for download in OpenMPMflops.zip. No installation is necessary - Extract All and click on EXE files.

The OpenMP benchmarks have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see linux benchmarks.htm, linux openmp benchmarks.htm and download benchmark execution files, source code, compile and run instructions in linux_openmp.tar.gz. Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.

To Start


MemSpeed

MemSpeed benchmark employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays:

   Sum to register   r = r + x [m] * y[m] (Integer + y [m])
   Sum to memory     x[m] = x[m] + y[m]                    
   Memory to memory  x[m] = y[m]                           
   

MemSpd2K, the latest standard version, uses assembly code to execute the same instructions as the original MemSpeed benchmark. This special version for OpenMP is again all C code, with the first linked triad tests returning results to memory via:

   Sum to memory     x[m] = x[m] + r * y[m]                

Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision.

To Start


Example Log Files

Below are OpenMP (MemSpdOMP.exe) results produced from running on a Quad CPU Phenom processor using 64-Bit Windows 7 and those for the same code produced without the OpenMP compiler parameter (MemSpdNotOMP.exe). The programs each identify the system hardware and software as shown before performance details. Of particular note are the extremely slow OpenMP speeds for the smaller data sizes.

The slowest original OpenMP floating point benchmark results on this PC were 1920 MFLOPS using one CPU and 5587 MFLOPS with four processors. This was at 100,000 words or 400 KBytes. This MemSpeed version is similar at 512 KB, with 9921 MB/second or 2480 single precision MFLOPS with one CPU, and 22009 MB/second or 5502 MFLOPS with four CPUs using OpenMP. The single processor speeds are faster with less data, using L1 cache but, unexpectedly, those for OpenMP are progressively slower.

As with other benchmarks running on this system, use by more than one processor is required for maximum throughput from RAM.


  CPUID and RDTSC Assembly Code
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
  AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, 
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
  Intel processor architecture, 4 CPUs 
  Windows NT  Version 6.1, build 7600, 
  Memory 4096 MB, Free 4096 MB
  User Virtual Space 4096 MB, Free 3005 MB


  OPenMP Version

      Memory Reading Speed Test OpenMP Version 4.0 by Roy Longbottom

      0.100 seconds per test, Start Wed Oct 13 12:27:26 2010

  Memory    x[m]=x[m]+s*y[m] Int+  x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl   Int    Dble   Sngl   Int    Dble   Sngl   Int
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

      4      418    436    439    438    441    449    222    224    225
      8      874    862    866    849    873    867    443    445    443
     16     1727   1713   1700   1730   1708   1737    878    853    873
     32     3341   3234   3263   3378   3218   3287   1724   1680   1647
     64     6123   5792   5978   6280   5922   6024   3156   3103   3052
    128    10822   9932  10085  11262   9666  10149   5848   5335   5481
    256    17639  15485  16134  18178  15582  16453   9879   8871   8853
    512    25742  22009  22123  26990  21379  22327  13959  12877  13138
   1024    33657  27622  26572  35721  27548  27919  19185  16918  16260
   2048    37554  30171  31756  37599  31174  30073  22600  18869  19298
   4096    24280  22284  23117  26256  22540  22471  14475  11822  12494
   8192    16476  13555  15907  18268  14493  15129   9479   7495   8435
  16384     7394   7137   7077   7743   7004   7248   3920   3697   3692
  32768     7387   6969   7184   7644   7167   7124   3987   3618   3752
  65536     7486   7188   7240   7733   6975   7077   3974   3725   3773
 131072     7462   7163   7249   7775   7197   7258   3976   3603   3654
 262144     7578   7207   7280   7816   7208   7223   4029   3632   3812
 524288     7652   7405   7344   8009   7331   7487   4084   3837   3825
1048576     7720   7373   7469   8012   7181   7480   4112   3837   3789

                End of test Wed Oct 13 12:28:05 2010

 
  Normal Compilation

      Memory Reading Speed Test Version 4.0 by Roy Longbottom

      0.100 seconds per test, Start Wed Oct 13 12:26:33 2010

  Memory    x[m]=x[m]+s*y[m] Int+  x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl   Int    Dble   Sngl   Int    Dble   Sngl   Int
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

      4    22924  11651  12725  23949  12063  12721  15055   7771   9346
      8    23536  11839  13242  24553  12230  13677  15577   7855   9488
     16    23834  11887  12790  24828  12294  13728  15816   7957   9557
     32    23407  11902  13659  23941  12159  12991  15478   7913   9482
     64    23669  11847  12910  24528  12337  13543  15626   7913   9464
    128    14703   9926  10290  14750  10443  10243   8688   6701   6981
    256    14644   9906  10175  14884  10130  10166   8593   6680   6927
    512    14302   9921  10171  14895  10376  10188   8611   6687   6899
   1024     8246   7091   7026   8596   7017   7190   4509   3911   3976
   2048     8166   6976   7142   8545   7019   7125   4452   3880   3937
   4096     8006   6898   6984   8392   6836   7003   4469   3788   3857
   8192     4416   3983   4175   4530   4037   4169   2341   2157   2202
  16384     4244   3888   3993   4484   3826   4010   2298   2093   2135
  32768     4249   3885   3958   4467   3888   3966   2256   2095   2123
  65536     4235   3892   3929   4424   3875   3991   2293   2079   2137
 131072     4264   3894   3965   4487   3904   3980   2302   2092   2125
 262144     4279   3870   3991   4394   3903   4007   2305   2090   2131
 524288     4235   3873   3968   4423   3906   3998   2222   2073   2127
1048576     4297   3922   3976   4520   3913   3976   2325   2107   2142

                End of test Wed Oct 13 12:27:12 2010

To Start


Results From Different Versions

The OpenMP version was also run using Task Manager Processes Affinity options to execute using one and two processors. These produced the same sort of speeds as the OpenMP log above, using the smaller data sizes. Viewing the Threads column, in Task Manager Processes, shows that four threads are used irrespective of the number of CPUs selected by Affinity settings. Calculations indicate that there is a OpenMP startup overhead, for all these tests, of around 9 microseconds with this Phenom processor. Note that, with the normal compilation, the time to read 100 KB is about 9 microseconds.

The speed of the OpenMP tests, relative to those for the normal compilation, are shown in the graph. Maximum speeds are only achieved with data in the 6144 KB L3 cache. Performance with the larger data sizes are limited by RAM speed.


Single Precision Floating Point x[m]=x[m]+s*y[m]




To Start


Results Different Processors

Following are results of single and double precision calculations of the x[m]=x[m]+s*y[m] tests on a PC with a Core 2 Duo using 64-Bit Vista. The first two columns are for normal compilations, without OpenMP. The next four columns show data transfer speeds using one and two cores with OpenMP functions. Next are loss and gain ratios for the single precision speeds, where dual core throughput improvement is associated with data in the shared 4096 KB L2 cache. The last column reflect startup overheads of at least 9 microseconds.

Later results shown are for a dual core Core i5 that also has Hyperthreading. See Intel processor architecture, 4 CPUs.


  CPUID and RDTSC Assembly Code
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz Measured 2402 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
  Intel processor architecture, 2 CPUs 
  Windows NT  Version 6.0, build 6002, Service Pack 2
  Memory 4095 MB, Free 1079 MB
  User Virtual Space 4096 MB, Free 3018 MB

                            x[m]=x[m]+s*y[m]
                                                       Loss/Gain SP
           Not OpenMP    OpenMP 1 CPU  OPenMP 2 CPUs   1 CPU 2 CPUs  2 CPUs
  KBytes   Dble   Sngl    Dble   Sngl    Dble   Sngl    Sngl   Sngl   usecs
    Used   MB/S   MB/S    MB/S   MB/S    MB/S   MB/S   ratio  ratio   /pass

       4  18490   9185     547    553     425    413    0.06   0.04       9
       8  18631   9349    1051   1005     842    830    0.11   0.09      10
      16  18903   9467    1903   1827    1681   1630    0.19   0.17      10
      32  18739   9487    3059   2831    2558   2640    0.30   0.28      11
      64  11535   7631    4552   3986    5148   4751    0.52   0.62      14
     128  11626   7584    6150   5234    7553   6765    0.69   0.89      18
     256  11634   7686    7263   5815   10645   8937    0.76   1.16      30
     512  11632   7524    8375   6395   12273  10469    0.85   1.39      46
    1024  11605   7638    8362   7131   13733   9631    0.93   1.26      87
    2048  11408   7298    8998   7118   15255  11028    0.98   1.51     162
    4096   8626   7057    7792   5856   13525  10211    0.83   1.45     350
    8192   4287   4222    3667   3685    4367   4318    0.87   1.02    2287
   16384   3690   3532    3360   3510    3409   3718    0.99   1.05    4421
   32768   3284   3431    3472   3315    2815   3017    0.97   0.88    9166
   65536   3572   3460    3458   3452    3570   3602    1.00   1.04   19270
  131072   3485   3550    3429   3376    3656   3466    0.95   0.98   36268
  262144   3504   3570    3638   2990    3727   3564    0.84   1.00   70469
  524288   3650   3533    3130   3500    3737   3637    0.99   1.03  143996
 1048576   3696   3534    3616   3598    3603   3002    1.02   0.85  285017


  CPUID and RDTSC Assembly Code
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206A7
  Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
  Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
  Intel processor architecture, 4 CPUs 
  Windows NT  Version 6.1, build 7601, Service Pack 1
  Memory 4096 MB, Free 4096 MB
  User Virtual Space 4096 MB, Free 3006 MB

                       x[m]=x[m]+s*y[m]

           Not OpenMP     OPenMP 2 CPUs    Loss/Gain   2 CPUs
  KBytes   Dble   Sngl     Dble   Sngl    Dble   Sngl   usecs
    Used   MB/S   MB/S     MB/S   MB/S   ratio  ratio   /pass

       4  19157   9719      250    262    0.01   0.03      15
       8  19932  10030      718    697    0.04   0.07      11
      16  20002   9768     1413   1372    0.07   0.14      12
      32  19766  10046     2723   2587    0.14   0.26      12
      64  17504   9708     4940   4536    0.28   0.47      15
     128  17415  10066     8351   7018    0.48   0.70      17
     256  17368   9676    12771   9624    0.74   0.99      25
     512   9736   6919    15949  11184    1.64   1.62      54
    1024   9944   6919    14707  10785    1.48   1.56      91
    2048   9763   6815    16064  10940    1.65   1.61     177
    4096   7895   6077    10684   9087    1.35   1.50     421
    8192   7646   6045     9156   8920    1.20   1.48     966
   16384   7643   5942     9096   9179    1.19   1.54    1751
   32768   7658   6031     9528   9655    1.24   1.60    3475
   65536   7718   6045    10187   9730    1.32   1.61    5767
  131072   7734   6061     9572   9638    1.24   1.59   14493
  262144   7934   6117    10563   9588    1.33   1.57   27239
  524288   8137   6248    10492  10612    1.29   1.70   49118
 1048576   8138   6221    11311  10512    1.39   1.69   98708





To Start


Other Benchmark Compilations

The Livermore Loops Benchmark was converted to use OpenMP. This is the 1970’s benchmark that set the standards for the first supercomputers (Cray 1 onwards). It has 24 kernels of numerical application with performance measured in MFLOPS. Each kernel produces a double precision floating point checksum to demonstrate accuracy of the system being tested and this can vary slightly, depending on the compiler and options used. My C++ program checks these numbers against those built-in for a particular compilation (for use as a reliability/burn-in test). The kernels are run three times using decreasing memory demands, mainly starting at 8 KB for each of one or more arrays.

The first results below are for the normal compilation, with checksums identical to the first successful run. This includes specifying the “#pragma omp parallel for” directives but they are not used. The other results are for runs with these directives enabled by using the /openmp compiler parameter. Kernels 16 and 17 have no loops for the pragma to apply.

The next results are with OpenMP using four processors, where a few tests are slightly faster than above, but many are much slower. Even worse, the calculations do not produce the same checksum numeric results and repeated runs show that the value can be unpredictable. The third results are with OpenMP using one CPU (but two threads), where identical wrong checksums appear to be produced on repeating the benchmark.

There are a number of other OpenMP programming options and the simple directive used here is not suitable for many of the kernels. Anything more complex than the MemSpeed x[m]=x[m]+r*y[m] needs careful consideration to ensure that instructions are executed in a consistent sequence and functions run long enough to absorb startup delays. Maybe it is best to leave it to a compiler that can ensure that the correct and most efficient procedures are used.

Later results are for a Core i5 (dual) CPU, showing the same degradation effects with Intel.


  AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz

  Normal MFLOPS for 24 loops

 2622.5 1851.1  887.0 1454.3  336.3  779.3 3405.7 3011.2 2861.3 1428.9  207.0 1394.6
  280.2  559.9 1162.3  989.0  999.2 2087.7  522.9 1177.1 1815.8  282.1  964.3  661.7

  Numeric results were as expected


  OpenMP MFLOPS for 24 loops

  522.9    6.2  210.0  133.9  193.1   86.5 1560.6  371.6  189.8   99.4   98.6  108.2
   44.5  228.4  279.3  939.7  999.2  154.5   32.9  480.1   22.3  159.0  116.6  108.0

  Section 1 Test  6  result was 4.312366077873135e+003 expected 4.375116344729986e+003
  Section 1 Test 13  result was 1.202533952702805e+011 expected 1.202533961842805e+011
  Section 1 Test 14  result was 3.165549299821230e+009 expected 3.165553044000335e+009
  Section 1 Test 20  result was 3.042067004051425e+007 expected 3.040644339351239e+007

  Section 2 Test 13  result was 9.816387759644356e+010 expected 9.816387810944356e+010
  Section 2 Test 19  result was 5.421816884714813e+002 expected 5.421816960147207e+002

  Section 3 Test 19  result was 1.268230668053491e+001 expected 1.268230698051004e+001


  Different Results Next Run

  Section 1 Test  6  result was 4.345898038418117e+003 expected 4.375116344729986e+003
  Section 1 Test 14  result was 3.165550475680920e+009 expected 3.165553044000335e+009
  Section 1 Test 19  result was 5.421816884714813e+002 expected 5.421816960147207e+002
  Section 1 Test 20  result was 3.042636088846063e+007 expected 3.040644339351239e+007

  Section 3 Test 19  result was 1.268230698051474e+001 expected 1.268230698051004e+001


  Affinity Set To Use 1 CPU - Consistent Results

  MFLOPS for 24 loops

  466.8    6.6  182.7  106.8  141.2  216.7 1169.0  359.1  186.4   93.3   76.4  104.9
   42.3  233.6  235.2  892.8 1001.5  152.8   32.9  838.0   22.7  117.1  113.4  101.3

  Section 1 Test  2  result was 1.542092319263005e+003 expected 1.539721811668385e+003
  Section 1 Test 19  result was 5.421816947167190e+002 expected 5.421816960147207e+002

  Section 2 Test  2  result was 1.542092319263005e+003 expected 1.539721811668385e+003
  Section 2 Test 19  result was 5.421816947167190e+002 expected 5.421816960147207e+002

  Section 3 Test  2  result was 3.958295105509222e+001 expected 3.953296986903060e+001
  Section 3 Test  3  result was 2.699309089320673e-001 expected 2.699309089320672e-001
  Section 3 Test 19  result was 1.268230657539253e+001 expected 1.268230698051004e+001


  Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz

  Normal MFLOPS for 24 loops

 2094.0 1711.7  964.3 1254.7  286.7  809.9 2761.5 3030.6 3373.6 1285.8  256.4 1127.4
  520.9  681.1  864.9 1250.6 1001.4 1547.4  568.4  892.5 1645.5  238.5  941.4  902.4

  OpenMP MFLOPS for 24 loops

  359.3    4.8  141.7   74.9  104.1  134.4  745.8  221.2  110.0   61.2   67.0   71.8
   30.8  208.0  175.5  873.2  696.8   80.3   20.9  502.8   15.1  102.2   79.0   73.3


To Start




Roy Longbottom April 2012



The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection