More OpenMP Processor Parallel Computing Benchmarks - Roy Longbottom's PC benchmark Collection

More OpenMP Parallel Computing Benchmarks

MemSpeed	Example Log Files	Different Version Results
Results On A Different Processor	Other Benchmark Compilations

General

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The first benchmark, described in OpenMP MFLOPS, executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code.

It was decided to compile other existing benchmarks using the same Microsoft compiler and OpenMP directive, the first one being the Linpack Benchmark, where performance is mainly governed by a loop containing

   dy[i] = dy[i] + da * dx[i]

The speed measured by the OPenMP version was unexpectedly extremely slow. So, it was decided to produce a variation of the MemSpeed Benchmark, with the same calculations, but using data sizes that occupy increasing memory sizes to test caches and RAM. Other benchmarks were also converted to identify other slow functions. Some of these showed that careless use of OpenMP leads to programs producing wrong and inconsistent numeric results.

The new benchmarks are included for download in OpenMPMflops.zip. No installation is necessary - Extract All and click on EXE files.

The OpenMP benchmarks have also been ported to 32-Bit and 64-Bit Linux using the supplied GCC compiler (all free software) - see linux benchmarks.htm, linux openmp benchmarks.htm and download benchmark execution files, source code, compile and run instructions in linux_openmp.tar.gz. Using Windows the file downloaded wrongly as linux_openmp.tar.tar but was fine when renamed linux_openmp.tar.gz.

To Start

MemSpeed

MemSpeed benchmark employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays:

   Sum to register   r = r + x [m] * y[m] (Integer + y [m])
   Sum to memory     x[m] = x[m] + y[m]                    
   Memory to memory  x[m] = y[m]

MemSpd2K, the latest standard version, uses assembly code to execute the same instructions as the original MemSpeed benchmark. This special version for OpenMP is again all C code, with the first linked triad tests returning results to memory via:

   Sum to memory     x[m] = x[m] + r * y[m]

Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision.

To Start

Example Log Files

Below are OpenMP (MemSpdOMP.exe) results produced from running on a Quad CPU Phenom processor using 64-Bit Windows 7 and those for the same code produced without the OpenMP compiler parameter (MemSpdNotOMP.exe). The programs each identify the system hardware and software as shown before performance details. Of particular note are the extremely slow OpenMP speeds for the smaller data sizes.

The slowest original OpenMP floating point benchmark results on this PC were 1920 MFLOPS using one CPU and 5587 MFLOPS with four processors. This was at 100,000 words or 400 KBytes. This MemSpeed version is similar at 512 KB, with 9921 MB/second or 2480 single precision MFLOPS with one CPU, and 22009 MB/second or 5502 MFLOPS with four CPUs using OpenMP. The single processor speeds are faster with less data, using L1 cache but, unexpectedly, those for OpenMP are progressively slower.

As with other benchmarks running on this system, use by more than one processor is required for maximum throughput from RAM.

CPUID and RDTSC Assembly Code CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus Intel processor architecture, 4 CPUs Windows NT Version 6.1, build 7600, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 3005 MB OPenMP Version Memory Reading Speed Test OpenMP Version 4.0 by Roy Longbottom 0.100 seconds per test, Start Wed Oct 13 12:27:26 2010 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 418 436 439 438 441 449 222 224 225 8 874 862 866 849 873 867 443 445 443 16 1727 1713 1700 1730 1708 1737 878 853 873 32 3341 3234 3263 3378 3218 3287 1724 1680 1647 64 6123 5792 5978 6280 5922 6024 3156 3103 3052 128 10822 9932 10085 11262 9666 10149 5848 5335 5481 256 17639 15485 16134 18178 15582 16453 9879 8871 8853 512 25742 22009 22123 26990 21379 22327 13959 12877 13138 1024 33657 27622 26572 35721 27548 27919 19185 16918 16260 2048 37554 30171 31756 37599 31174 30073 22600 18869 19298 4096 24280 22284 23117 26256 22540 22471 14475 11822 12494 8192 16476 13555 15907 18268 14493 15129 9479 7495 8435 16384 7394 7137 7077 7743 7004 7248 3920 3697 3692 32768 7387 6969 7184 7644 7167 7124 3987 3618 3752 65536 7486 7188 7240 7733 6975 7077 3974 3725 3773 131072 7462 7163 7249 7775 7197 7258 3976 3603 3654 262144 7578 7207 7280 7816 7208 7223 4029 3632 3812 524288 7652 7405 7344 8009 7331 7487 4084 3837 3825 1048576 7720 7373 7469 8012 7181 7480 4112 3837 3789 End of test Wed Oct 13 12:28:05 2010 Normal Compilation Memory Reading Speed Test Version 4.0 by Roy Longbottom 0.100 seconds per test, Start Wed Oct 13 12:26:33 2010 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 22924 11651 12725 23949 12063 12721 15055 7771 9346 8 23536 11839 13242 24553 12230 13677 15577 7855 9488 16 23834 11887 12790 24828 12294 13728 15816 7957 9557 32 23407 11902 13659 23941 12159 12991 15478 7913 9482 64 23669 11847 12910 24528 12337 13543 15626 7913 9464 128 14703 9926 10290 14750 10443 10243 8688 6701 6981 256 14644 9906 10175 14884 10130 10166 8593 6680 6927 512 14302 9921 10171 14895 10376 10188 8611 6687 6899 1024 8246 7091 7026 8596 7017 7190 4509 3911 3976 2048 8166 6976 7142 8545 7019 7125 4452 3880 3937 4096 8006 6898 6984 8392 6836 7003 4469 3788 3857 8192 4416 3983 4175 4530 4037 4169 2341 2157 2202 16384 4244 3888 3993 4484 3826 4010 2298 2093 2135 32768 4249 3885 3958 4467 3888 3966 2256 2095 2123 65536 4235 3892 3929 4424 3875 3991 2293 2079 2137 131072 4264 3894 3965 4487 3904 3980 2302 2092 2125 262144 4279 3870 3991 4394 3903 4007 2305 2090 2131 524288 4235 3873 3968 4423 3906 3998 2222 2073 2127 1048576 4297 3922 3976 4520 3913 3976 2325 2107 2142 End of test Wed Oct 13 12:27:12 2010

To Start

Results From Different Versions

The OpenMP version was also run using Task Manager Processes Affinity options to execute using one and two processors. These produced the same sort of speeds as the OpenMP log above, using the smaller data sizes. Viewing the Threads column, in Task Manager Processes, shows that four threads are used irrespective of the number of CPUs selected by Affinity settings. Calculations indicate that there is a OpenMP startup overhead, for all these tests, of around 9 microseconds with this Phenom processor. Note that, with the normal compilation, the time to read 100 KB is about 9 microseconds.

The speed of the OpenMP tests, relative to those for the normal compilation, are shown in the graph. Maximum speeds are only achieved with data in the 6144 KB L3 cache. Performance with the larger data sizes are limited by RAM speed.

Single Precision Floating Point x[m]=x[m]+s*y[m]

To Start

Results Different Processors

Following are results of single and double precision calculations of the x[m]=x[m]+s*y[m] tests on a PC with a Core 2 Duo using 64-Bit Vista. The first two columns are for normal compilations, without OpenMP. The next four columns show data transfer speeds using one and two cores with OpenMP functions. Next are loss and gain ratios for the single precision speeds, where dual core throughput improvement is associated with data in the shared 4096 KB L2 cache. The last column reflect startup overheads of at least 9 microseconds.

Later results shown are for a dual core Core i5 that also has Hyperthreading (See configuration details - Intel processor architecture, 4 CPUs). Then, there are full results for a 4 core, 8 thread Core i7, with and without using OpenMP. Here, the impact of the latter is even worse, with the single thread version being up to 100 times faster. There are performance gains of up to 3.85 times using shared L3 cache and twice 2.0 times using RAM

CPUID and RDTSC Assembly Code CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus Intel processor architecture, 2 CPUs Windows NT Version 6.0, build 6002, Service Pack 2 Memory 4095 MB, Free 1079 MB User Virtual Space 4096 MB, Free 3018 MB x[m]=x[m]+s*y[m] Loss/Gain SP Not OpenMP OpenMP 1 CPU OPenMP 2 CPUs 1 CPU 2 CPUs 2 CPUs KBytes Dble Sngl Dble Sngl Dble Sngl Sngl Sngl usecs Used MB/S MB/S MB/S MB/S MB/S MB/S ratio ratio /pass 4 18490 9185 547 553 425 413 0.06 0.04 9 8 18631 9349 1051 1005 842 830 0.11 0.09 10 16 18903 9467 1903 1827 1681 1630 0.19 0.17 10 32 18739 9487 3059 2831 2558 2640 0.30 0.28 11 64 11535 7631 4552 3986 5148 4751 0.52 0.62 14 128 11626 7584 6150 5234 7553 6765 0.69 0.89 18 256 11634 7686 7263 5815 10645 8937 0.76 1.16 30 512 11632 7524 8375 6395 12273 10469 0.85 1.39 46 1024 11605 7638 8362 7131 13733 9631 0.93 1.26 87 2048 11408 7298 8998 7118 15255 11028 0.98 1.51 162 4096 8626 7057 7792 5856 13525 10211 0.83 1.45 350 8192 4287 4222 3667 3685 4367 4318 0.87 1.02 2287 16384 3690 3532 3360 3510 3409 3718 0.99 1.05 4421 32768 3284 3431 3472 3315 2815 3017 0.97 0.88 9166 65536 3572 3460 3458 3452 3570 3602 1.00 1.04 19270 131072 3485 3550 3429 3376 3656 3466 0.95 0.98 36268 262144 3504 3570 3638 2990 3727 3564 0.84 1.00 70469 524288 3650 3533 3130 3500 3737 3637 0.99 1.03 143996 1048576 3696 3534 3616 3598 3603 3002 1.02 0.85 285017 CPUID and RDTSC Assembly Code CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206A7 Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus Intel processor architecture, 4 CPUs Windows NT Version 6.1, build 7601, Service Pack 1 Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 3006 MB x[m]=x[m]+s*y[m] Not OpenMP OPenMP 2 CPUs Loss/Gain 2 CPUs KBytes Dble Sngl Dble Sngl Dble Sngl usecs Used MB/S MB/S MB/S MB/S ratio ratio /pass 4 19157 9719 250 262 0.01 0.03 15 8 19932 10030 718 697 0.04 0.07 11 16 20002 9768 1413 1372 0.07 0.14 12 32 19766 10046 2723 2587 0.14 0.26 12 64 17504 9708 4940 4536 0.28 0.47 15 128 17415 10066 8351 7018 0.48 0.70 17 256 17368 9676 12771 9624 0.74 0.99 25 512 9736 6919 15949 11184 1.64 1.62 54 1024 9944 6919 14707 10785 1.48 1.56 91 2048 9763 6815 16064 10940 1.65 1.61 177 4096 7895 6077 10684 9087 1.35 1.50 421 8192 7646 6045 9156 8920 1.20 1.48 966 16384 7643 5942 9096 9179 1.19 1.54 1751 32768 7658 6031 9528 9655 1.24 1.60 3475 65536 7718 6045 10187 9730 1.32 1.61 5767 131072 7734 6061 9572 9638 1.24 1.59 14493 262144 7934 6117 10563 9588 1.33 1.57 27239 524288 8137 6248 10492 10612 1.29 1.70 49118 1048576 8138 6221 11311 10512 1.39 1.69 98708 ############################################################ Windows 8.1 64-Bit, Core i7-4820K 3.7 GHz, 4 Channel DDR3 1600 MHz RAM CPUID and RDTSC Assembly Code CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4 Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Measured 3711 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus Intel processor architecture, 8 CPUs Windows NT Version 6.2, build 9200, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 2999 MB Memory Reading Speed Test OpenMP Version 4.0 by Roy Longbottom 0.100 seconds per test, Start Tue Sep 30 10:24:44 2014 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 329 328 324 331 346 345 174 173 173 L1 8 685 697 685 684 689 687 347 345 345 16 1380 1362 1353 1381 1365 1369 698 683 666 32 2727 2675 2711 2705 2708 2703 1358 1361 1375 64 5250 5257 5242 5359 5221 5278 2711 2659 2670 L2 128 10368 9981 9925 10466 10119 10164 5286 5156 5174 256 19247 17801 17303 19893 18320 18449 10314 9539 9515 512 33203 28253 28718 34933 30428 30399 18218 16123 16077 L3 1024 48844 38945 40108 52994 42481 42635 27973 22975 23084 2048 65318 49487 50672 68134 55093 48589 36840 30226 30346 4096 79834 56326 58847 85096 63436 63283 45528 36018 35250 8192 83167 59969 61809 87526 67789 66920 45066 38250 38200 16384 26091 25915 25962 26063 26029 26043 13128 13026 13003 RAM 32768 24690 23614 24635 24782 24723 24611 12502 12398 12381 65536 24678 24595 24661 24865 24739 24760 12382 12511 12469 131072 25203 25127 25129 25307 25101 25146 12752 12691 12673 262144 25489 24881 25358 25433 25297 25346 12645 12777 12748 524288 25639 25093 25400 25495 24977 25445 12838 12722 12825 1048576 25953 26054 25955 25926 25957 26063 13043 13033 12999 Not OpenMP Memory Reading Speed Test Version 4.0 by Roy Longbottom 0.100 seconds per test, Start Tue Sep 30 10:21:40 2014 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 32502 16875 18651 37629 19264 18745 26795 14188 11974 L1 8 33594 17065 18877 38727 19574 19350 28999 15089 12670 16 35686 18063 19965 41194 20699 20035 30259 15408 12678 32 35887 18064 19973 41221 20718 19996 26834 14694 12575 64 31618 17918 19982 34163 20203 19993 23658 13109 12516 L2 128 31641 17909 19967 34038 20273 20016 23503 13159 12557 256 30443 17696 19894 32707 20076 19878 22051 12878 12348 512 24592 17512 18549 25832 18477 18431 15095 10902 10701 L3 1024 24667 17479 18608 25860 18479 18501 15028 10912 10887 2048 24675 17485 18552 25867 18461 18540 15015 10910 10896 4096 24160 17110 18235 25504 18092 18265 14896 10826 10485 8192 22490 16639 17598 23718 17608 17487 13493 10413 10548 16384 15145 13134 13423 15247 13157 13401 7646 7775 7842 RAM 32768 14783 13029 13210 14894 12963 12855 7408 7587 7565 65536 14827 13100 13226 14923 13016 13258 7432 7641 7645 131072 14958 13033 13279 15007 13052 12410 7398 7664 7632 262144 14901 13124 13266 15032 13097 13273 7489 7666 7647 524288 14897 13124 13304 15077 13063 13273 7456 7688 7568 1048576 14813 12940 13165 15028 12947 13265 7411 7660 7662

To Start

Other Benchmark Compilations

The Livermore Loops Benchmark was converted to use OpenMP. This is the 1970’s benchmark that set the standards for the first supercomputers (Cray 1 onwards). It has 24 kernels of numerical application with performance measured in MFLOPS. Each kernel produces a double precision floating point checksum to demonstrate accuracy of the system being tested and this can vary slightly, depending on the compiler and options used. My C++ program checks these numbers against those built-in for a particular compilation (for use as a reliability/burn-in test). The kernels are run three times using decreasing memory demands, mainly starting at 8 KB for each of one or more arrays.

The first results below are for the normal compilation, with checksums identical to the first successful run. This includes specifying the “#pragma omp parallel for” directives but they are not used. The other results are for runs with these directives enabled by using the /openmp compiler parameter. Kernels 16 and 17 have no loops for the pragma to apply.

The next results are with OpenMP using four processors, where a few tests are slightly faster than above, but many are much slower. Even worse, the calculations do not produce the same checksum numeric results and repeated runs show that the value can be unpredictable. The third results are with OpenMP using one CPU (but two threads), where identical wrong checksums appear to be produced on repeating the benchmark.

There are a number of other OpenMP programming options and the simple directive used here is not suitable for many of the kernels. Anything more complex than the MemSpeed x[m]=x[m]+r*y[m] needs careful consideration to ensure that instructions are executed in a consistent sequence and functions run long enough to absorb startup delays. Maybe it is best to leave it to a compiler that can ensure that the correct and most efficient procedures are used.

Later results are for a dual core 4 thread Core i5 CPU and a quad core, 8 thread Core i7 processor, showing the same (or worse) degradation effects with Intel.

############################################################ AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz Normal MFLOPS for 24 loops 2622.5 1851.1 887.0 1454.3 336.3 779.3 3405.7 3011.2 2861.3 1428.9 207.0 1394.6 280.2 559.9 1162.3 989.0 999.2 2087.7 522.9 1177.1 1815.8 282.1 964.3 661.7 Numeric results were as expected OpenMP MFLOPS for 24 loops 522.9 6.2 210.0 133.9 193.1 86.5 1560.6 371.6 189.8 99.4 98.6 108.2 44.5 228.4 279.3 939.7 999.2 154.5 32.9 480.1 22.3 159.0 116.6 108.0 Section 1 Test 6 result was 4.312366077873135e+003 expected 4.375116344729986e+003 Section 1 Test 13 result was 1.202533952702805e+011 expected 1.202533961842805e+011 Section 1 Test 14 result was 3.165549299821230e+009 expected 3.165553044000335e+009 Section 1 Test 20 result was 3.042067004051425e+007 expected 3.040644339351239e+007 Section 2 Test 13 result was 9.816387759644356e+010 expected 9.816387810944356e+010 Section 2 Test 19 result was 5.421816884714813e+002 expected 5.421816960147207e+002 Section 3 Test 19 result was 1.268230668053491e+001 expected 1.268230698051004e+001 Different Results Next Run Section 1 Test 6 result was 4.345898038418117e+003 expected 4.375116344729986e+003 Section 1 Test 14 result was 3.165550475680920e+009 expected 3.165553044000335e+009 Section 1 Test 19 result was 5.421816884714813e+002 expected 5.421816960147207e+002 Section 1 Test 20 result was 3.042636088846063e+007 expected 3.040644339351239e+007 Section 3 Test 19 result was 1.268230698051474e+001 expected 1.268230698051004e+001 Affinity Set To Use 1 CPU - Consistent Results MFLOPS for 24 loops 466.8 6.6 182.7 106.8 141.2 216.7 1169.0 359.1 186.4 93.3 76.4 104.9 42.3 233.6 235.2 892.8 1001.5 152.8 32.9 838.0 22.7 117.1 113.4 101.3 Section 1 Test 2 result was 1.542092319263005e+003 expected 1.539721811668385e+003 Section 1 Test 19 result was 5.421816947167190e+002 expected 5.421816960147207e+002 Section 2 Test 2 result was 1.542092319263005e+003 expected 1.539721811668385e+003 Section 2 Test 19 result was 5.421816947167190e+002 expected 5.421816960147207e+002 Section 3 Test 2 result was 3.958295105509222e+001 expected 3.953296986903060e+001 Section 3 Test 3 result was 2.699309089320673e-001 expected 2.699309089320672e-001 Section 3 Test 19 result was 1.268230657539253e+001 expected 1.268230698051004e+001 ############################################################ Intel(R) Core(TM) i5-2467M CPU @ 1.60GHz Measured 1596 MHz Normal MFLOPS for 24 loops 2094.0 1711.7 964.3 1254.7 286.7 809.9 2761.5 3030.6 3373.6 1285.8 256.4 1127.4 520.9 681.1 864.9 1250.6 1001.4 1547.4 568.4 892.5 1645.5 238.5 941.4 902.4 OpenMP MFLOPS for 24 loops 359.3 4.8 141.7 74.9 104.1 134.4 745.8 221.2 110.0 61.2 67.0 71.8 30.8 208.0 175.5 873.2 696.8 80.3 20.9 502.8 15.1 102.2 79.0 73.3 ############################################################ Windows 8.1 64-Bit, Core i7-4820K 3.7 GHz Normal MFLOPS for 24 loops 4901.9 3628.6 2568.4 2640.4 564.7 1590.4 4685.2 5227.3 5595.2 2833.1 441.3 1932.3 996.0 1245.6 2289.5 2245.1 1778.9 3549.2 1069.2 1883.2 2827.2 411.1 1598.2 1621.3 OpenMP MFLOPS for 24 loops 440.7 5.0 175.8 108.0 170.7 224.0 1361.9 305.1 150.1 78.1 85.9 87.3 37.6 296.0 237.8 2258.1 1784.1 125.7 26.5 1461.5 17.9 140.6 93.7 84.8

To Start

Roy Longbottom October 2014

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

More OpenMP Parallel Computing Benchmarks

Contents

General

MemSpeed

Example Log Files

Results From Different Versions

Results Different Processors

Other Benchmark Compilations