More OpenMP Parallel Computing Benchmarks
Contents
General
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers.
The first benchmark, described in
OpenMP MFLOPS,
executes the same functions, using the same data sizes, as the
CUDA Graphics GPU Parallel Computing Benchmark,
with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code.
It was decided to compile other existing benchmarks using the same Microsoft compiler and OpenMP directive, the first one being the
Linpack Benchmark,
where performance is mainly governed by a loop containing
dy[i] = dy[i] + da * dx[i]
The speed measured by the OPenMP version was unexpectedly extremely slow. So, it was decided to produce a variation of the
MemSpeed Benchmark,
with the same calculations, but using data sizes that occupy increasing memory sizes to test caches and RAM.
Other benchmarks were also converted to identify other slow functions. Some of these showed that careless use of OpenMP leads to programs producing wrong and inconsistent numeric results.
The new benchmarks are included for download in
OpenMPMflops.zip.
No installation is necessary - Extract All and click on EXE files.
The OpenMP benchmarks have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see
linux benchmarks.htm,
linux openmp benchmarks.htm
and download benchmark execution files, source code, compile and run instructions in
linux_openmp.tar.gz.
Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.
To Start
MemSpeed
MemSpeed benchmark
employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays:
Sum to register r = r + x [m] * y[m] (Integer + y [m])
Sum to memory x[m] = x[m] + y[m]
Memory to memory x[m] = y[m]
MemSpd2K, the latest standard version, uses assembly code to execute the same instructions as the original MemSpeed benchmark. This special version for OpenMP is again all C code, with the first linked triad tests returning results to memory via:
Sum to memory x[m] = x[m] + r * y[m]
Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision.
To Start
Example Log Files
Below are OpenMP (MemSpdOMP.exe) results produced from running on a Quad CPU Phenom processor using 64-Bit Windows 7 and those for the same code produced without the OpenMP compiler parameter (MemSpdNotOMP.exe).
The programs each identify the system hardware and software as shown before performance details. Of particular note are the extremely slow OpenMP speeds for the smaller data sizes.
The slowest original OpenMP floating point benchmark results on this PC were 1920 MFLOPS using one CPU and 5587 MFLOPS with four processors. This was at 100,000 words or 400 KBytes. This MemSpeed version is similar at 512 KB, with 9921 MB/second or 2480 single precision MFLOPS with one CPU, and 22009 MB/second or 5502 MFLOPS with four CPUs using OpenMP. The single processor speeds are faster with less data, using L1 cache but, unexpectedly, those for OpenMP are progressively slower.
As with other benchmarks running on this system, use by more than one processor is required for maximum throughput from RAM.
CPUID and RDTSC Assembly Code
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 4 CPUs
Windows NT Version 6.1, build 7600,
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 3005 MB
OPenMP Version
Memory Reading Speed Test OpenMP Version 4.0 by Roy Longbottom
0.100 seconds per test, Start Wed Oct 13 12:27:26 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 418 436 439 438 441 449 222 224 225
8 874 862 866 849 873 867 443 445 443
16 1727 1713 1700 1730 1708 1737 878 853 873
32 3341 3234 3263 3378 3218 3287 1724 1680 1647
64 6123 5792 5978 6280 5922 6024 3156 3103 3052
128 10822 9932 10085 11262 9666 10149 5848 5335 5481
256 17639 15485 16134 18178 15582 16453 9879 8871 8853
512 25742 22009 22123 26990 21379 22327 13959 12877 13138
1024 33657 27622 26572 35721 27548 27919 19185 16918 16260
2048 37554 30171 31756 37599 31174 30073 22600 18869 19298
4096 24280 22284 23117 26256 22540 22471 14475 11822 12494
8192 16476 13555 15907 18268 14493 15129 9479 7495 8435
16384 7394 7137 7077 7743 7004 7248 3920 3697 3692
32768 7387 6969 7184 7644 7167 7124 3987 3618 3752
65536 7486 7188 7240 7733 6975 7077 3974 3725 3773
131072 7462 7163 7249 7775 7197 7258 3976 3603 3654
262144 7578 7207 7280 7816 7208 7223 4029 3632 3812
524288 7652 7405 7344 8009 7331 7487 4084 3837 3825
1048576 7720 7373 7469 8012 7181 7480 4112 3837 3789
End of test Wed Oct 13 12:28:05 2010
Normal Compilation
Memory Reading Speed Test Version 4.0 by Roy Longbottom
0.100 seconds per test, Start Wed Oct 13 12:26:33 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 22924 11651 12725 23949 12063 12721 15055 7771 9346
8 23536 11839 13242 24553 12230 13677 15577 7855 9488
16 23834 11887 12790 24828 12294 13728 15816 7957 9557
32 23407 11902 13659 23941 12159 12991 15478 7913 9482
64 23669 11847 12910 24528 12337 13543 15626 7913 9464
128 14703 9926 10290 14750 10443 10243 8688 6701 6981
256 14644 9906 10175 14884 10130 10166 8593 6680 6927
512 14302 9921 10171 14895 10376 10188 8611 6687 6899
1024 8246 7091 7026 8596 7017 7190 4509 3911 3976
2048 8166 6976 7142 8545 7019 7125 4452 3880 3937
4096 8006 6898 6984 8392 6836 7003 4469 3788 3857
8192 4416 3983 4175 4530 4037 4169 2341 2157 2202
16384 4244 3888 3993 4484 3826 4010 2298 2093 2135
32768 4249 3885 3958 4467 3888 3966 2256 2095 2123
65536 4235 3892 3929 4424 3875 3991 2293 2079 2137
131072 4264 3894 3965 4487 3904 3980 2302 2092 2125
262144 4279 3870 3991 4394 3903 4007 2305 2090 2131
524288 4235 3873 3968 4423 3906 3998 2222 2073 2127
1048576 4297 3922 3976 4520 3913 3976 2325 2107 2142
End of test Wed Oct 13 12:27:12 2010
|
To Start
Results From Different Versions
|
The OpenMP version was also run using Task Manager Processes Affinity options to execute using one and two processors. These produced the same sort of speeds as the OpenMP log above, using the smaller data sizes.
Viewing the Threads column, in Task Manager Processes, shows that four threads are used irrespective of the number of CPUs selected by Affinity settings.
Calculations indicate that there is a OpenMP startup overhead, for all these tests, of around 9 microseconds with this Phenom processor.
Note that, with the normal compilation, the time to read 100 KB is about 9 microseconds.
The speed of the OpenMP tests, relative to those for the normal compilation, are shown in the graph. Maximum speeds are only achieved with data in the 6144 KB L3 cache. Performance with the larger data sizes are limited by RAM speed.
|
|
Single Precision Floating Point x[m]=x[m]+s*y[m]

|
To Start
Results Different Processors
Following are results of single and double precision calculations of the x[m]=x[m]+s*y[m] tests on a PC with a Core 2 Duo using 64-Bit Vista. The first two columns are for normal compilations, without OpenMP. The next four columns show data transfer speeds using one and two cores with OpenMP functions. Next are loss and gain ratios for the single precision speeds, where dual core throughput improvement is associated with data in the shared 4096 KB L2 cache. The last column reflect startup overheads of at least 9 microseconds.
Later results shown are for a dual core Core i5 that also has Hyperthreading. See Intel processor architecture, 4 CPUs.
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 2 CPUs
Windows NT Version 6.0, build 6002, Service Pack 2
Memory 4095 MB, Free 1079 MB
User Virtual Space 4096 MB, Free 3018 MB
x[m]=x[m]+s*y[m]
Loss/Gain SP
Not OpenMP OpenMP 1 CPU OPenMP 2 CPUs 1 CPU 2 CPUs 2 CPUs
KBytes Dble Sngl Dble Sngl Dble Sngl Sngl Sngl usecs
Used MB/S MB/S MB/S MB/S MB/S MB/S ratio ratio /pass
4 18490 9185 547 553 425 413 0.06 0.04 9
8 18631 9349 1051 1005 842 830 0.11 0.09 10
16 18903 9467 1903 1827 1681 1630 0.19 0.17 10
32 18739 9487 3059 2831 2558 2640 0.30 0.28 11
64 11535 7631 4552 3986 5148 4751 0.52 0.62 14
128 11626 7584 6150 5234 7553 6765 0.69 0.89 18
256 11634 7686 7263 5815 10645 8937 0.76 1.16 30
512 11632 7524 8375 6395 12273 10469 0.85 1.39 46
1024 11605 7638 8362 7131 13733 9631 0.93 1.26 87
2048 11408 7298 8998 7118 15255 11028 0.98 1.51 162
4096 8626 7057 7792 5856 13525 10211 0.83 1.45 350
8192 4287 4222 3667 3685 4367 4318 0.87 1.02 2287
16384 3690 3532 3360 3510 3409 3718 0.99 1.05 4421
32768 3284 3431 3472 3315 2815 3017 0.97 0.88 9166
65536 3572 3460 3458 3452 3570 3602 1.00 1.04 19270
131072 3485 3550 3429 3376 3656 3466 0.95 0.98 36268
262144 3504 3570 3638 2990 3727 3564 0.84 1.00 70469
524288 3650 3533 3130 3500 3737 3637 0.99 1.03 143996
1048576 3696 3534 3616 3598 3603 3002 1.02 0.85 285017
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206A7
Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 4 CPUs
Windows NT Version 6.1, build 7601, Service Pack 1
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 3006 MB
x[m]=x[m]+s*y[m]
Not OpenMP OPenMP 2 CPUs Loss/Gain 2 CPUs
KBytes Dble Sngl Dble Sngl Dble Sngl usecs
Used MB/S MB/S MB/S MB/S ratio ratio /pass
4 19157 9719 250 262 0.01 0.03 15
8 19932 10030 718 697 0.04 0.07 11
16 20002 9768 1413 1372 0.07 0.14 12
32 19766 10046 2723 2587 0.14 0.26 12
64 17504 9708 4940 4536 0.28 0.47 15
128 17415 10066 8351 7018 0.48 0.70 17
256 17368 9676 12771 9624 0.74 0.99 25
512 9736 6919 15949 11184 1.64 1.62 54
1024 9944 6919 14707 10785 1.48 1.56 91
2048 9763 6815 16064 10940 1.65 1.61 177
4096 7895 6077 10684 9087 1.35 1.50 421
8192 7646 6045 9156 8920 1.20 1.48 966
16384 7643 5942 9096 9179 1.19 1.54 1751
32768 7658 6031 9528 9655 1.24 1.60 3475
65536 7718 6045 10187 9730 1.32 1.61 5767
131072 7734 6061 9572 9638 1.24 1.59 14493
262144 7934 6117 10563 9588 1.33 1.57 27239
524288 8137 6248 10492 10612 1.29 1.70 49118
1048576 8138 6221 11311 10512 1.39 1.69 98708
|
To Start
Other Benchmark Compilations
The Livermore Loops Benchmark
was converted to use OpenMP. This is the 1970’s benchmark that set the standards for the first supercomputers (Cray 1 onwards). It has 24 kernels of numerical application with performance measured in MFLOPS.
Each kernel produces a double precision floating point checksum to demonstrate accuracy of the system being tested and this can vary slightly, depending on the compiler and options used. My C++ program checks these numbers against those built-in for a particular compilation (for use as a reliability/burn-in test).
The kernels are run three times using decreasing memory demands, mainly starting at 8 KB for each of one or more arrays.
The first results below are for the normal compilation, with checksums identical to the first successful run. This includes specifying the “#pragma omp parallel for” directives but they are not used. The other results are for runs with these directives enabled by using the /openmp compiler parameter. Kernels 16 and 17 have no loops for the pragma to apply.
The next results are with OpenMP using four processors, where a few tests are slightly faster than above, but many are much slower. Even worse, the calculations do not produce the same checksum numeric results and repeated runs show that the value can be unpredictable.
The third results are with OpenMP using one CPU (but two threads), where identical wrong checksums appear to be produced on repeating the benchmark.
There are a number of other OpenMP programming options and the simple directive used here is not suitable for many of the kernels. Anything more complex than the MemSpeed x[m]=x[m]+r*y[m] needs careful consideration to ensure that instructions are executed in a consistent sequence and functions run long enough to absorb startup delays. Maybe it is best to leave it to a compiler that can ensure that the correct and most efficient procedures are used.
Later results are for a Core i5 (dual) CPU, showing the same degradation effects with Intel.
AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
Normal MFLOPS for 24 loops
2622.5 1851.1 887.0 1454.3 336.3 779.3 3405.7 3011.2 2861.3 1428.9 207.0 1394.6
280.2 559.9 1162.3 989.0 999.2 2087.7 522.9 1177.1 1815.8 282.1 964.3 661.7
Numeric results were as expected
OpenMP MFLOPS for 24 loops
522.9 6.2 210.0 133.9 193.1 86.5 1560.6 371.6 189.8 99.4 98.6 108.2
44.5 228.4 279.3 939.7 999.2 154.5 32.9 480.1 22.3 159.0 116.6 108.0
Section 1 Test 6 result was 4.312366077873135e+003 expected 4.375116344729986e+003
Section 1 Test 13 result was 1.202533952702805e+011 expected 1.202533961842805e+011
Section 1 Test 14 result was 3.165549299821230e+009 expected 3.165553044000335e+009
Section 1 Test 20 result was 3.042067004051425e+007 expected 3.040644339351239e+007
Section 2 Test 13 result was 9.816387759644356e+010 expected 9.816387810944356e+010
Section 2 Test 19 result was 5.421816884714813e+002 expected 5.421816960147207e+002
Section 3 Test 19 result was 1.268230668053491e+001 expected 1.268230698051004e+001
Different Results Next Run
Section 1 Test 6 result was 4.345898038418117e+003 expected 4.375116344729986e+003
Section 1 Test 14 result was 3.165550475680920e+009 expected 3.165553044000335e+009
Section 1 Test 19 result was 5.421816884714813e+002 expected 5.421816960147207e+002
Section 1 Test 20 result was 3.042636088846063e+007 expected 3.040644339351239e+007
Section 3 Test 19 result was 1.268230698051474e+001 expected 1.268230698051004e+001
Affinity Set To Use 1 CPU - Consistent Results
MFLOPS for 24 loops
466.8 6.6 182.7 106.8 141.2 216.7 1169.0 359.1 186.4 93.3 76.4 104.9
42.3 233.6 235.2 892.8 1001.5 152.8 32.9 838.0 22.7 117.1 113.4 101.3
Section 1 Test 2 result was 1.542092319263005e+003 expected 1.539721811668385e+003
Section 1 Test 19 result was 5.421816947167190e+002 expected 5.421816960147207e+002
Section 2 Test 2 result was 1.542092319263005e+003 expected 1.539721811668385e+003
Section 2 Test 19 result was 5.421816947167190e+002 expected 5.421816960147207e+002
Section 3 Test 2 result was 3.958295105509222e+001 expected 3.953296986903060e+001
Section 3 Test 3 result was 2.699309089320673e-001 expected 2.699309089320672e-001
Section 3 Test 19 result was 1.268230657539253e+001 expected 1.268230698051004e+001
Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz
Normal MFLOPS for 24 loops
2094.0 1711.7 964.3 1254.7 286.7 809.9 2761.5 3030.6 3373.6 1285.8 256.4 1127.4
520.9 681.1 864.9 1250.6 1001.4 1547.4 568.4 892.5 1645.5 238.5 941.4 902.4
OpenMP MFLOPS for 24 loops
359.3 4.8 141.7 74.9 104.1 134.4 745.8 221.2 110.0 61.2 67.0 71.8
30.8 208.0 175.5 873.2 696.8 80.3 20.9 502.8 15.1 102.2 79.0 73.3
|
To Start
Roy Longbottom April 2012
The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|