More OpenMP Parallel Computing Benchmarks
Contents
General
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers.
The first benchmark, described in
OpenMP MFLOPS,
executes the same functions, using the same data sizes, as the
CUDA Graphics GPU Parallel Computing Benchmark,
with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code.
It was decided to compile other existing benchmarks using the same Microsoft compiler and OpenMP directive, the first one being the
Linpack Benchmark,
where performance is mainly governed by a loop containing
dy[i] = dy[i] + da * dx[i]
The speed measured by the OPenMP version was unexpectedly extremely slow. So, it was decided to produce a variation of the
MemSpeed Benchmark,
with the same calculations, but using data sizes that occupy increasing memory sizes to test caches and RAM.
Other benchmarks were also converted to identify other slow functions. Some of these showed that careless use of OpenMP leads to programs producing wrong and inconsistent numeric results.
The new benchmarks are included for download in
OpenMPMflops.zip.
No installation is necessary - Extract All and click on EXE files.
The OpenMP benchmarks have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see
linux benchmarks.htm,
linux openmp benchmarks.htm
and download benchmark execution files, source code, compile and run instructions in
linux_openmp.tar.gz.
Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.
To Start
MemSpeed
MemSpeed benchmark
employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays:
Sum to register r = r + x [m] * y[m] (Integer + y [m])
Sum to memory x[m] = x[m] + y[m]
Memory to memory x[m] = y[m]
MemSpd2K, the latest standard version, uses assembly code to execute the same instructions as the original MemSpeed benchmark. This special version for OpenMP is again all C code, with the first linked triad tests returning results to memory via:
Sum to memory x[m] = x[m] + r * y[m]
Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision.
To Start
Example Log Files
Below are OpenMP (MemSpdOMP.exe) results produced from running on a Quad CPU Phenom processor using 64-Bit Windows 7 and those for the same code produced without the OpenMP compiler parameter (MemSpdNotOMP.exe).
The programs each identify the system hardware and software as shown before performance details. Of particular note are the extremely slow OpenMP speeds for the smaller data sizes.
The slowest original OpenMP floating point benchmark results on this PC were 1920 MFLOPS using one CPU and 5587 MFLOPS with four processors. This was at 100,000 words or 400 KBytes. This MemSpeed version is similar at 512 KB, with 9921 MB/second or 2480 single precision MFLOPS with one CPU, and 22009 MB/second or 5502 MFLOPS with four CPUs using OpenMP. The single processor speeds are faster with less data, using L1 cache but, unexpectedly, those for OpenMP are progressively slower.
As with other benchmarks running on this system, use by more than one processor is required for maximum throughput from RAM.
CPUID and RDTSC Assembly Code
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 4 CPUs
Windows NT Version 6.1, build 7600,
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 3005 MB
OPenMP Version
Memory Reading Speed Test OpenMP Version 4.0 by Roy Longbottom
0.100 seconds per test, Start Wed Oct 13 12:27:26 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 418 436 439 438 441 449 222 224 225
8 874 862 866 849 873 867 443 445 443
16 1727 1713 1700 1730 1708 1737 878 853 873
32 3341 3234 3263 3378 3218 3287 1724 1680 1647
64 6123 5792 5978 6280 5922 6024 3156 3103 3052
128 10822 9932 10085 11262 9666 10149 5848 5335 5481
256 17639 15485 16134 18178 15582 16453 9879 8871 8853
512 25742 22009 22123 26990 21379 22327 13959 12877 13138
1024 33657 27622 26572 35721 27548 27919 19185 16918 16260
2048 37554 30171 31756 37599 31174 30073 22600 18869 19298
4096 24280 22284 23117 26256 22540 22471 14475 11822 12494
8192 16476 13555 15907 18268 14493 15129 9479 7495 8435
16384 7394 7137 7077 7743 7004 7248 3920 3697 3692
32768 7387 6969 7184 7644 7167 7124 3987 3618 3752
65536 7486 7188 7240 7733 6975 7077 3974 3725 3773
131072 7462 7163 7249 7775 7197 7258 3976 3603 3654
262144 7578 7207 7280 7816 7208 7223 4029 3632 3812
524288 7652 7405 7344 8009 7331 7487 4084 3837 3825
1048576 7720 7373 7469 8012 7181 7480 4112 3837 3789
End of test Wed Oct 13 12:28:05 2010
Normal Compilation
Memory Reading Speed Test Version 4.0 by Roy Longbottom
0.100 seconds per test, Start Wed Oct 13 12:26:33 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 22924 11651 12725 23949 12063 12721 15055 7771 9346
8 23536 11839 13242 24553 12230 13677 15577 7855 9488
16 23834 11887 12790 24828 12294 13728 15816 7957 9557
32 23407 11902 13659 23941 12159 12991 15478 7913 9482
64 23669 11847 12910 24528 12337 13543 15626 7913 9464
128 14703 9926 10290 14750 10443 10243 8688 6701 6981
256 14644 9906 10175 14884 10130 10166 8593 6680 6927
512 14302 9921 10171 14895 10376 10188 8611 6687 6899
1024 8246 7091 7026 8596 7017 7190 4509 3911 3976
2048 8166 6976 7142 8545 7019 7125 4452 3880 3937
4096 8006 6898 6984 8392 6836 7003 4469 3788 3857
8192 4416 3983 4175 4530 4037 4169 2341 2157 2202
16384 4244 3888 3993 4484 3826 4010 2298 2093 2135
32768 4249 3885 3958 4467 3888 3966 2256 2095 2123
65536 4235 3892 3929 4424 3875 3991 2293 2079 2137
131072 4264 3894 3965 4487 3904 3980 2302 2092 2125
262144 4279 3870 3991 4394 3903 4007 2305 2090 2131
524288 4235 3873 3968 4423 3906 3998 2222 2073 2127
1048576 4297 3922 3976 4520 3913 3976 2325 2107 2142
End of test Wed Oct 13 12:27:12 2010
|
To Start
Results From Different Versions
The OpenMP version was also run using Task Manager Processes Affinity options to execute using one and two processors. These produced the same sort of speeds as the OpenMP log above, using the smaller data sizes.
Viewing the Threads column, in Task Manager Processes, shows that four threads are used irrespective of the number of CPUs selected by Affinity settings.
Calculations indicate that there is a OpenMP startup overhead, for all these tests, of around 9 microseconds with this Phenom processor.
Note that, with the normal compilation, the time to read 100 KB is about 9 microseconds.
The speed of the OpenMP tests, relative to those for the normal compilation, are shown in the graph. Maximum speeds are only achieved with data in the 6144 KB L3 cache. Performance with the larger data sizes are limited by RAM speed.
|
|
Single Precision Floating Point x[m]=x[m]+s*y[m]
|
To Start
Results Different Processors
Following are results of single and double precision calculations of the x[m]=x[m]+s*y[m] tests on a PC with a Core 2 Duo using 64-Bit Vista. The first two columns are for normal compilations, without OpenMP. The next four columns show data transfer speeds using one and two cores with OpenMP functions. Next are loss and gain ratios for the single precision speeds, where dual core throughput improvement is associated with data in the shared 4096 KB L2 cache. The last column reflect startup overheads of at least 9 microseconds.
Later results shown are for a dual core Core i5 that also has Hyperthreading (See configuration details - Intel processor architecture, 4 CPUs).
Then, there are full results for a 4 core, 8 thread Core i7, with and without using OpenMP. Here, the impact of the latter is even worse, with the single thread version being up to 100 times faster. There are performance gains of up to 3.85 times using shared L3 cache and twice 2.0 times using RAM
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 2 CPUs
Windows NT Version 6.0, build 6002, Service Pack 2
Memory 4095 MB, Free 1079 MB
User Virtual Space 4096 MB, Free 3018 MB
x[m]=x[m]+s*y[m]
Loss/Gain SP
Not OpenMP OpenMP 1 CPU OPenMP 2 CPUs 1 CPU 2 CPUs 2 CPUs
KBytes Dble Sngl Dble Sngl Dble Sngl Sngl Sngl usecs
Used MB/S MB/S MB/S MB/S MB/S MB/S ratio ratio /pass
4 18490 9185 547 553 425 413 0.06 0.04 9
8 18631 9349 1051 1005 842 830 0.11 0.09 10
16 18903 9467 1903 1827 1681 1630 0.19 0.17 10
32 18739 9487 3059 2831 2558 2640 0.30 0.28 11
64 11535 7631 4552 3986 5148 4751 0.52 0.62 14
128 11626 7584 6150 5234 7553 6765 0.69 0.89 18
256 11634 7686 7263 5815 10645 8937 0.76 1.16 30
512 11632 7524 8375 6395 12273 10469 0.85 1.39 46
1024 11605 7638 8362 7131 13733 9631 0.93 1.26 87
2048 11408 7298 8998 7118 15255 11028 0.98 1.51 162
4096 8626 7057 7792 5856 13525 10211 0.83 1.45 350
8192 4287 4222 3667 3685 4367 4318 0.87 1.02 2287
16384 3690 3532 3360 3510 3409 3718 0.99 1.05 4421
32768 3284 3431 3472 3315 2815 3017 0.97 0.88 9166
65536 3572 3460 3458 3452 3570 3602 1.00 1.04 19270
131072 3485 3550 3429 3376 3656 3466 0.95 0.98 36268
262144 3504 3570 3638 2990 3727 3564 0.84 1.00 70469
524288 3650 3533 3130 3500 3737 3637 0.99 1.03 143996
1048576 3696 3534 3616 3598 3603 3002 1.02 0.85 285017
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206A7
Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 4 CPUs
Windows NT Version 6.1, build 7601, Service Pack 1
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 3006 MB
x[m]=x[m]+s*y[m]
Not OpenMP OPenMP 2 CPUs Loss/Gain 2 CPUs
KBytes Dble Sngl Dble Sngl Dble Sngl usecs
Used MB/S MB/S MB/S MB/S ratio ratio /pass
4 19157 9719 250 262 0.01 0.03 15
8 19932 10030 718 697 0.04 0.07 11
16 20002 9768 1413 1372 0.07 0.14 12
32 19766 10046 2723 2587 0.14 0.26 12
64 17504 9708 4940 4536 0.28 0.47 15
128 17415 10066 8351 7018 0.48 0.70 17
256 17368 9676 12771 9624 0.74 0.99 25
512 9736 6919 15949 11184 1.64 1.62 54
1024 9944 6919 14707 10785 1.48 1.56 91
2048 9763 6815 16064 10940 1.65 1.61 177
4096 7895 6077 10684 9087 1.35 1.50 421
8192 7646 6045 9156 8920 1.20 1.48 966
16384 7643 5942 9096 9179 1.19 1.54 1751
32768 7658 6031 9528 9655 1.24 1.60 3475
65536 7718 6045 10187 9730 1.32 1.61 5767
131072 7734 6061 9572 9638 1.24 1.59 14493
262144 7934 6117 10563 9588 1.33 1.57 27239
524288 8137 6248 10492 10612 1.29 1.70 49118
1048576 8138 6221 11311 10512 1.39 1.69 98708
############################################################
Windows 8.1 64-Bit, Core i7-4820K 3.7 GHz, 4 Channel DDR3 1600 MHz RAM
CPUID and RDTSC Assembly Code
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4
Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Measured 3711 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
Intel processor architecture, 8 CPUs
Windows NT Version 6.2, build 9200,
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 2999 MB
Memory Reading Speed Test OpenMP Version 4.0 by Roy Longbottom
0.100 seconds per test, Start Tue Sep 30 10:24:44 2014
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 329 328 324 331 346 345 174 173 173 L1
8 685 697 685 684 689 687 347 345 345
16 1380 1362 1353 1381 1365 1369 698 683 666
32 2727 2675 2711 2705 2708 2703 1358 1361 1375
64 5250 5257 5242 5359 5221 5278 2711 2659 2670 L2
128 10368 9981 9925 10466 10119 10164 5286 5156 5174
256 19247 17801 17303 19893 18320 18449 10314 9539 9515
512 33203 28253 28718 34933 30428 30399 18218 16123 16077 L3
1024 48844 38945 40108 52994 42481 42635 27973 22975 23084
2048 65318 49487 50672 68134 55093 48589 36840 30226 30346
4096 79834 56326 58847 85096 63436 63283 45528 36018 35250
8192 83167 59969 61809 87526 67789 66920 45066 38250 38200
16384 26091 25915 25962 26063 26029 26043 13128 13026 13003 RAM
32768 24690 23614 24635 24782 24723 24611 12502 12398 12381
65536 24678 24595 24661 24865 24739 24760 12382 12511 12469
131072 25203 25127 25129 25307 25101 25146 12752 12691 12673
262144 25489 24881 25358 25433 25297 25346 12645 12777 12748
524288 25639 25093 25400 25495 24977 25445 12838 12722 12825
1048576 25953 26054 25955 25926 25957 26063 13043 13033 12999
Not OpenMP
Memory Reading Speed Test Version 4.0 by Roy Longbottom
0.100 seconds per test, Start Tue Sep 30 10:21:40 2014
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 32502 16875 18651 37629 19264 18745 26795 14188 11974 L1
8 33594 17065 18877 38727 19574 19350 28999 15089 12670
16 35686 18063 19965 41194 20699 20035 30259 15408 12678
32 35887 18064 19973 41221 20718 19996 26834 14694 12575
64 31618 17918 19982 34163 20203 19993 23658 13109 12516 L2
128 31641 17909 19967 34038 20273 20016 23503 13159 12557
256 30443 17696 19894 32707 20076 19878 22051 12878 12348
512 24592 17512 18549 25832 18477 18431 15095 10902 10701 L3
1024 24667 17479 18608 25860 18479 18501 15028 10912 10887
2048 24675 17485 18552 25867 18461 18540 15015 10910 10896
4096 24160 17110 18235 25504 18092 18265 14896 10826 10485
8192 22490 16639 17598 23718 17608 17487 13493 10413 10548
16384 15145 13134 13423 15247 13157 13401 7646 7775 7842 RAM
32768 14783 13029 13210 14894 12963 12855 7408 7587 7565
65536 14827 13100 13226 14923 13016 13258 7432 7641 7645
131072 14958 13033 13279 15007 13052 12410 7398 7664 7632
262144 14901 13124 13266 15032 13097 13273 7489 7666 7647
524288 14897 13124 13304 15077 13063 13273 7456 7688 7568
1048576 14813 12940 13165 15028 12947 13265 7411 7660 7662
|
To Start
Other Benchmark Compilations
The Livermore Loops Benchmark
was converted to use OpenMP. This is the 1970’s benchmark that set the standards for the first supercomputers (Cray 1 onwards). It has 24 kernels of numerical application with performance measured in MFLOPS.
Each kernel produces a double precision floating point checksum to demonstrate accuracy of the system being tested and this can vary slightly, depending on the compiler and options used. My C++ program checks these numbers against those built-in for a particular compilation (for use as a reliability/burn-in test).
The kernels are run three times using decreasing memory demands, mainly starting at 8 KB for each of one or more arrays.
The first results below are for the normal compilation, with checksums identical to the first successful run. This includes specifying the “#pragma omp parallel for” directives but they are not used. The other results are for runs with these directives enabled by using the /openmp compiler parameter. Kernels 16 and 17 have no loops for the pragma to apply.
The next results are with OpenMP using four processors, where a few tests are slightly faster than above, but many are much slower. Even worse, the calculations do not produce the same checksum numeric results and repeated runs show that the value can be unpredictable.
The third results are with OpenMP using one CPU (but two threads), where identical wrong checksums appear to be produced on repeating the benchmark.
There are a number of other OpenMP programming options and the simple directive used here is not suitable for many of the kernels. Anything more complex than the MemSpeed x[m]=x[m]+r*y[m] needs careful consideration to ensure that instructions are executed in a consistent sequence and functions run long enough to absorb startup delays. Maybe it is best to leave it to a compiler that can ensure that the correct and most efficient procedures are used.
Later results are for a dual core 4 thread Core i5 CPU and a quad core, 8 thread Core i7 processor, showing the same (or worse) degradation effects with Intel.
############################################################
AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
Normal MFLOPS for 24 loops
2622.5 1851.1 887.0 1454.3 336.3 779.3 3405.7 3011.2 2861.3 1428.9 207.0 1394.6
280.2 559.9 1162.3 989.0 999.2 2087.7 522.9 1177.1 1815.8 282.1 964.3 661.7
Numeric results were as expected
OpenMP MFLOPS for 24 loops
522.9 6.2 210.0 133.9 193.1 86.5 1560.6 371.6 189.8 99.4 98.6 108.2
44.5 228.4 279.3 939.7 999.2 154.5 32.9 480.1 22.3 159.0 116.6 108.0
Section 1 Test 6 result was 4.312366077873135e+003 expected 4.375116344729986e+003
Section 1 Test 13 result was 1.202533952702805e+011 expected 1.202533961842805e+011
Section 1 Test 14 result was 3.165549299821230e+009 expected 3.165553044000335e+009
Section 1 Test 20 result was 3.042067004051425e+007 expected 3.040644339351239e+007
Section 2 Test 13 result was 9.816387759644356e+010 expected 9.816387810944356e+010
Section 2 Test 19 result was 5.421816884714813e+002 expected 5.421816960147207e+002
Section 3 Test 19 result was 1.268230668053491e+001 expected 1.268230698051004e+001
Different Results Next Run
Section 1 Test 6 result was 4.345898038418117e+003 expected 4.375116344729986e+003
Section 1 Test 14 result was 3.165550475680920e+009 expected 3.165553044000335e+009
Section 1 Test 19 result was 5.421816884714813e+002 expected 5.421816960147207e+002
Section 1 Test 20 result was 3.042636088846063e+007 expected 3.040644339351239e+007
Section 3 Test 19 result was 1.268230698051474e+001 expected 1.268230698051004e+001
Affinity Set To Use 1 CPU - Consistent Results
MFLOPS for 24 loops
466.8 6.6 182.7 106.8 141.2 216.7 1169.0 359.1 186.4 93.3 76.4 104.9
42.3 233.6 235.2 892.8 1001.5 152.8 32.9 838.0 22.7 117.1 113.4 101.3
Section 1 Test 2 result was 1.542092319263005e+003 expected 1.539721811668385e+003
Section 1 Test 19 result was 5.421816947167190e+002 expected 5.421816960147207e+002
Section 2 Test 2 result was 1.542092319263005e+003 expected 1.539721811668385e+003
Section 2 Test 19 result was 5.421816947167190e+002 expected 5.421816960147207e+002
Section 3 Test 2 result was 3.958295105509222e+001 expected 3.953296986903060e+001
Section 3 Test 3 result was 2.699309089320673e-001 expected 2.699309089320672e-001
Section 3 Test 19 result was 1.268230657539253e+001 expected 1.268230698051004e+001
############################################################
Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz
Normal MFLOPS for 24 loops
2094.0 1711.7 964.3 1254.7 286.7 809.9 2761.5 3030.6 3373.6 1285.8 256.4 1127.4
520.9 681.1 864.9 1250.6 1001.4 1547.4 568.4 892.5 1645.5 238.5 941.4 902.4
OpenMP MFLOPS for 24 loops
359.3 4.8 141.7 74.9 104.1 134.4 745.8 221.2 110.0 61.2 67.0 71.8
30.8 208.0 175.5 873.2 696.8 80.3 20.9 502.8 15.1 102.2 79.0 73.3
############################################################
Windows 8.1 64-Bit, Core i7-4820K 3.7 GHz
Normal MFLOPS for 24 loops
4901.9 3628.6 2568.4 2640.4 564.7 1590.4 4685.2 5227.3 5595.2 2833.1 441.3 1932.3
996.0 1245.6 2289.5 2245.1 1778.9 3549.2 1069.2 1883.2 2827.2 411.1 1598.2 1621.3
OpenMP MFLOPS for 24 loops
440.7 5.0 175.8 108.0 170.7 224.0 1361.9 305.1 150.1 78.1 85.9 87.3
37.6 296.0 237.8 2258.1 1784.1 125.7 26.5 1461.5 17.9 140.6 93.7 84.8
|
To Start
Roy Longbottom October 2014
The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|