GigaFLOPS Benchmarks

General	CUDA GPU Benchmarks	CUDA Comparisons
MP MFLOPS Benchmarks	MP MFLOPS Comparisons	MP MFLOPS Numeric Results
MP MFLOPS Assembly Code	OpenMP Benchmark	Qpar MP Benchmark
Qpar and OpenMP Comparisons	Reliability Tests

General

In this series, four types of benchmarks are available, using OpenMP and Qpar, Microsoft’s proprietary equivalent, both with automatic multiprocessing, CUDA for GeForce graphics, and MPMFLOPS, CPU benchmark with multithreading (up to 64 threads). All carry out the same calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. They also check the numeric result for consistency, and this also identifies differences in numeric calculations on using the various instruction sets.

The benchmarks are available, compiled for 32 bit and 64 bit Operating Systems, and can be for single and double precision floating point calculations, using the original x87, SSE or AVX instruction sets. Tests are run at increasing data sizes to transfer data from caches and RAM. The latest versions of the benchmarks and source code are included in GigaFLOPS-Benchmarks.zip.

The programs are run from Command Prompt windows and generally have parameter options to run for extended periods as reliability/burn-in tests. Results are displayed and saved in a text based log file. Speed is shown in MFLOPS or Millions of FLoating point Operations Per Second

To Start

CUDA GPU Benchmarks

These were the first benchmarks in the series. Four varieties are available CUDA3MFLOPS-x86SP.exe, CUDA3MFLOPS-x86DP.exe, CUDA3MFLOPS-x64SP.exe and CUDA3MFLOPS-x64DP.exe, two for each for 32 bit and 64 bit Windows. Single Precision (SP) and Double Precision (DP) compilations are provide as the performance difference can be considerable. CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously.

Unlike the CPU benchmarks, three tests are carried out at each data size and calculations per word combination. This is to show that performance is severely degraded if data transfers are over the relatively slow external bus. The first tests are run with Repeat Passes controlled by the CPU but, to demonstrate fastest speeds (with these tests), Extra Tests are run with all repeats controlled by the GPU.

The example below is for a GeForce GTX 680, possibly the fastest graphics card in 2012. This has a maximum specification of 3090 GFLOPS, the benchmark achieving up to 1746 GFLOPS. Note that numeric accuracy improves with fewer data returns between calculations.

Details on installing and using CUDA software, with further benchmark details and results, can be found in cuda1.htm, cuda2.htm and cuda3 x64m.htm. These show how the numeric results vary according to precision used and provide details on how to run the programs for extended periods as reliability/burn-in tests.

Note that total data size transferred can be 1 GB in and out. On reducing data flow by this amount, the difference in test time is mainly around 0.25 seconds, suggesting that effective transfer speed is about 4 GB/second, maybe about right for PCiE 3.0 x 16 with a maximum of nearly 16 GB/second.

CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.3 Sun Nov 04 17:58:44 2012 CUDA devices found Device 0: GeForce GTX 680 with 8 Processors 64 cores Global Memory 2000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024 Using 256 Threads Test 4 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 0.967272 517 0.9295383095741 Yes Data out only 100000 2 2500 0.387346 1291 0.9295383095741 Yes Calculate only 100000 2 2500 0.070436 7099 0.9295383095741 Yes Data in & out 1000000 2 250 0.526663 949 0.9925497770309 Yes Data out only 1000000 2 250 0.245081 2040 0.9925497770309 Yes Calculate only 1000000 2 250 0.019763 25299 0.9925497770309 Yes Data in & out 10000000 2 25 0.482678 1036 0.9992496371269 Yes Data out only 10000000 2 25 0.240010 2083 0.9992496371269 Yes Calculate only 10000000 2 25 0.013708 36475 0.9992496371269 Yes Data in & out 100000 8 2500 0.759731 2633 0.9571172595024 Yes Data out only 100000 8 2500 0.410652 4870 0.9571172595024 Yes Calculate only 100000 8 2500 0.073366 27261 0.9571172595024 Yes Data in & out 1000000 8 250 0.524791 3811 0.9955183267593 Yes Data out only 1000000 8 250 0.245618 8143 0.9955183267593 Yes Calculate only 1000000 8 250 0.020494 97589 0.9955183267593 Yes Data in & out 10000000 8 25 0.494677 4043 0.9995489120483 Yes Data out only 10000000 8 25 0.240809 8305 0.9995489120483 Yes Calculate only 10000000 8 25 0.013834 144575 0.9995489120483 Yes Data in & out 100000 32 2500 0.764819 10460 0.8902152180672 Yes Data out only 100000 32 2500 0.415392 19259 0.8902152180672 Yes Calculate only 100000 32 2500 0.135979 58833 0.8902152180672 Yes Data in & out 1000000 32 250 0.529935 15096 0.9880878329277 Yes Data out only 1000000 32 250 0.247135 32371 0.9880878329277 Yes Calculate only 1000000 32 250 0.024024 333000 0.9880878329277 Yes Data in & out 10000000 32 25 0.493384 16215 0.9987964630127 Yes Data out only 10000000 32 25 0.242553 32983 0.9987964630127 Yes Calculate only 10000000 32 25 0.015177 527122 0.9987964630127 Yes Extra tests - Repeat Passes in main CUDA Function Calculate 10000000 2 25 0.004503 111041 0.9992496371269 Yes Shared Memory 10000000 2 25 0.002053 243601 0.9992496371269 Yes Calculate 10000000 8 25 0.005726 349289 0.9995489120483 Yes Shared Memory 10000000 8 25 0.002773 721272 0.9995489120483 Yes Calculate 10000000 32 25 0.008419 950255 0.9987964630127 Yes Shared Memory 10000000 32 25 0.004581 1746493 0.9987964630127 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206D7 Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz Measured 3200 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information AMD64 processor architecture, 12 CPUs Windows NT Version 6.1, build 7601, Service Pack 1 Memory 32710 MB, Free 28947 MB User Virtual Space 8388608 MB, Free 8388539 MB 2000 MB Graphics RAM, Used 204 Minimum, 240 Maximum

To Start

CUDA Comparisons

Following are further results for the GTX 680, with others for a GTX 650, with single and double precision benchmarks compiled for 32 bit and 64 bit Windows. The PCs have similar CPU, bus and RAM speeds, reflected in results involving data transfers. Note that 64 bit compilations are faster than those at 32 bits and single precision speeds can be much higher than that for double precision calculations, the largest difference (>15x) on the extra tests.

GeForce GTX 680 GeForce GTX 650 Maximum Specification 3090 GFLOPS 1505 GFLOPS Test Wds x Ops 3.1 SP 3.1 SP 3.1 DP 3.1 DP 3.1 SP 3.1 SP 3.1 DP 3.1 DP x Passes 32b 64b 32b 64b 32b 64b 32b 64b Data in & out .1Mx2x2500 622 517 354 353 593 459 333 350 Data out only .1Mx2x2500 1241 1291 667 712 1050 1059 643 651 Calculate only .1Mx2x2500 7310 7099 4126 6140 3854 3449 3351 3069 Data in & out 1Mx2x250 931 949 479 489 869 893 471 479 Data out only 1Mx2x250 1960 2040 1000 1049 1765 1790 941 962 Calculate only 1Mx2x250 21914 25299 16143 16226 9244 8806 6889 6627 Data in & out 10Mx2x25 999 1036 513 516 959 980 502 508 Data out only 10Mx2x25 2005 2083 1040 1052 1783 1852 952 972 Calculate only 10Mx2x25 36175 36475 19876 19882 10867 10530 7746 7533 Data in & out .1Mx8x2500 2445 2633 1415 1430 2348 2375 1294 1299 Data out only .1Mx8x2500 4280 4870 2724 2906 4135 4151 2396 2415 Calculate only .1Mx8x2500 25502 27261 15501 22388 14274 13056 9809 9029 Data in & out 1Mx8x250 3722 3811 1901 1949 3458 3545 1806 1829 Data out only 1Mx8x250 7694 8143 4058 4143 6982 7107 3455 3504 Calculate only 1Mx8x250 85455 97589 60592 60056 33375 34014 16613 15852 Data in & out 10Mx8x25 4015 4043 2061 2052 3841 3896 1915 1926 Data out only 10Mx8x25 8177 8305 4055 4130 7222 7283 3524 3572 Calculate only 10Mx8x25 140800 144575 75558 74333 42328 40905 17922 17221 Data in & out .1Mx32x2500 10080 10460 5396 5485 9026 9183 4262 4588 Data out only .1Mx32x2500 18672 19259 9978 10053 15660 15769 7355 7377 Calculate only .1Mx32x2500 83989 58833 43334 47974 47957 43975 17469 16658 Data in & out 1Mx32x250 14629 15096 7269 7440 13634 14006 5947 6015 Data out only 1Mx32x250 30977 32371 14810 14977 27261 27684 9864 9881 Calculate only 1Mx32x250 347405 333000 96044 92231 125027 120972 22708 21998 Data in & out 10Mx32x25 15995 16215 7739 7765 15123 15499 6288 6278 Data out only 10Mx32x25 31586 32983 14956 15037 27770 28906 9984 9999 Calculate only 10Mx32x25 519153 527122 105419 101771 149697 147100 23467 22763 Extra tests - Repeat Passes in main CUDA Function Calculate 10Mx2x25 126843 111041 50749 44017 29899 26876 11491 9801 Shared Memory 10Mx2x25 236915 243601 73186 72369 75618 77049 16407 16268 Calculate 10Mx8x25 470867 349289 89172 83696 110088 81484 19821 18451 Shared Memory 10Mx8x25 969416 721272 100825 100254 227879 181190 22425 22322 Calculate 10Mx32x25 1154649 950255 109800 105142 253381 216570 24083 23353 Shared Memory 10Mx32x25 1714512 1746493 111425 110982 412313 400966 24587 24560

To Start

MP MFLOPS Benchmarks

The benchmark was first produced to run on PCs via Linux, details being in Linux Multithreading Benchmarks.htm. Then mini versions were produced, where details are in Android MultiThreading Benchmarks.htm and Raspberry Pi Multithreading Benchmarks.htm.

Four versions are available, compiled from the same source code, in the GigaFLOPS ZIP file., now intended to be run via 64 bit Windows. They are:

MPmflops32.exe compiled with MS C/C++ Version 15.00.30729.207 for 80x86 - old 8087 floating point instructions
MPmflops64.exe compiled with MS C/C++ Version 15.00.30729.207 for x64 - to use SSE floating point calculations
MPmflopsc2.exe compiled with MS C/C++ Version 18.00.21005.1 for x64 - to fully implement SSE functions
MPmflopsAVX.exe compiled with MS C/C++ Version 18.00.21005.1 for x64, with /arch:AVX option to use new vector instructions. The first compilation of this failed to run on a PC without AVX. The revised benchmark is a small one that identified the configuration, indicates if AVX not supported or runs the benchmark, now named onlyAVX.exe.

A WhatConfig function identifies how many cores are available, doubled when Hyperthreading is available. This provides the default thread count. Alternatively, a run time parameter specifies the number of threads to use (see below). This can be zero where no threads are created but, in this case, speeds have been the same as using one thread.

The threads used a shared data array, but each uses a dedicated segment. Example results are below. The “Data in & out” label is to identify the equivalent CUDA test. The thread count should be one of the options shown below (divide into 1024). Otherwise, unexpected results will be notified (see example). The numeric results depend on the number of calculations, changed by repeat passes that vary according to the number of identified CPUs (2500 x CPUs See Below). Those via x87 floating point are different to SSE calculations but the thread count should not affect the numeric answers.

Command Format Example - MPmflopsc2 Threads xx or T xx x = 0, 1, 2, 4, 8, 16, 32 or 64 Example results log file: 8 CPUs Available ############################################## 64 Bit MP SSE MFLOPS Benchmark C2, 8 Threads, Mon Jun 09 12:15:48 2014 C/C++ Optimizing Compiler Version 18.00.21005.1 for x64 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 2 20000 0.069739 58734 0.620974 Yes Data in & out 1024000 2 2000 0.093680 43723 0.942935 Yes Data in & out 10240000 2 200 0.410417 9980 0.994032 Yes Data in & out 102400 8 20000 0.168666 97139 0.749971 Yes Data in & out 1024000 8 2000 0.166426 98446 0.965360 Yes Data in & out 10240000 8 200 0.408970 40062 0.996409 Yes Data in & out 102400 32 20000 0.717656 91320 0.498060 Yes Data in & out 1024000 32 2000 0.698044 93885 0.910573 Yes Data in & out 10240000 32 200 0.703741 93125 0.990447 Yes End of test Mon Jun 09 12:15:51 2014 ############################################## CPU GenuineIntel, ecx 7FBEE3BF, edx BFEBFBFF, Model 000306E4 Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Measured 3711 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, Has AVX, No 3DNow, Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus AMD64 processor architecture, 8 CPUs Windows NT Version 6.2, build 9200, Memory 32705 MB, Free 31350 MB User Virtual Space 134217728 MB, Free 134217718 MB ############################################## Example Calculation MFLOPS = 10240000 / 1000000 x 32 X 200 / 0.703741 = 93125.17 ############################################## Example display and log file with errors or odd thread count (4 CPUs Available) Data in & out 102400 2 10000 0.097761 20949 See later No Data in & out 102400 2 10000 word 102388 was 0.999999 not 0.764063 Data in & out 102400 2 10000 word 102389 was 0.999999 not 0.764063

To Start

MP MFLOPS Comparisons

The tables below show L2 and L3 cache size, the former provided per core and the latter shared between cores. Tests using two arithmetic operations per word are the most likely to be affected by RAM or shared L3 cache speed. With 2 operations, L2 cache sizes of 512 KB, 10.24M Words and 16 threads, performance gains over 8 threads are higher than might be expected. Data per thread is 640 KB, where Windows time slices can use multiple L2 caches more effectively.

The second version of the benchmark has a compile option to use SSE instructions, where Single Instruction Multiple Data (SIMD) instructions could execute a multiply or add on on four 32 bit single precision floating point numbers at the same time. Instead, MS C/C++ Version 15 caused the benchmark to run a conventional scalar processor with one 32 bit number in the 128 bit SSE registers or Single Instruction Single Data (SISD) operation. In some cases, this is no faster than compilations using the original 8087 floating point instructions.

Recompiling with MS C/C++ Version 18, from Visual Studio 2013, produced full SIMD functions. Theoretical maximum MFLOPS speed of earlier processors with SSE was 4 x number of cores x CPU MHz. Later, a multiply and add could be executed in the same clock cycle where, maximum speed per core is 8 x CPU MHz. More recently, Advanced Vector Instructions (AVX) were introduced, with 256 bit registers and speed of up to 16 x CPU MHz per core. Unfortunately, compiling with the AVX option with MS C/C++ Version 18 did not fully implement the new instructions. For details of assembly code produced (See Below). Later CPUs have AVX2 with further performance improvements. Results are for:

Core 2 Duo 2.4 GHz 2 cores - Maximum GFLOPS 4.4 i87, 7.3 SIMD, 19.0 SIMD or 4.0 per GHz per core

Phenom II 3.0 GHz 4 cores - Maximum GFLOPS 15.5 i87, 17.7 SISD, 53.7 SIMD or 4.48 per GHz per core.

Core i7 3.9 GHz 4 cores + Hyperthreading - Maximum GFLOPS 24.3 i87, 24.7 SISD, 98.4 SIMD or 6.30 per GHz per core, AVX 98.2. There is little gain through Hyperthreading and note similar SIMD and AVX speeds, also improved performance relative to clock speed, compared with the Phenom. This PC has four memory channels, producing exceptional relative performance.

MFLOPS 1 to 16 Threads Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Threads Core 2 Duo 1 1509 1419 1112 1981 2038 2014 2301 2292 2280 512 KB x 2 L2 2 2854 2606 1185 3709 3847 3738 4374 4359 4301 2400 MHz 4 2854 2606 1185 3709 3847 3738 4374 4359 4301 i87 8 2957 2853 1265 3875 3918 3595 4440 4362 4424 16 3056 2855 2118 3963 3935 3636 4396 4386 4345 Core 2 Duo 1 2504 2181 1165 3319 3664 3506 3595 3577 3559 512 KB x 2 L2 2 4791 4497 1197 6907 6824 4481 6841 6755 6718 2400 MHz 4 4773 4572 1211 5015 7040 4832 6849 6951 6153 SISD 8 4786 4657 1300 6335 6842 4931 6872 6826 6730 16 5100 4608 2341 7311 7095 6840 6711 7018 6810 Core 2 Duo 1 2915 2554 1156 6347 6006 4672 9303 9217 9101 512 KB x 2 L2 2 5576 5177 1205 11127 10546 4775 16941 17613 16703 2400 MHz 4 5627 5227 1197 12371 11686 4775 17783 17512 16636 SIMD 8 5564 5573 1291 12386 12306 5093 17920 17992 16892 L3 6MB 16 5548 5519 2321 12967 11094 8537 19047 18020 16783 Phenom II X4 1 2086 1970 1631 3858 3707 3521 3929 3909 3861 512 KB x 4 L2 2 4158 3945 2593 7698 7482 6934 7846 7825 7698 3000 MHz 4 8232 7744 2737 14914 9544 11251 15461 15498 14906 i87 8 8201 7922 2927 14645 12169 10091 15512 14633 14772 L3 6MB 16 7424 7199 3408 14883 13697 11652 15187 14100 14428 Phenom II X4 1 3320 3091 1910 4551 4439 4196 3962 3930 3886 512 KB x 4 L2 2 6579 6312 2705 9074 8940 8138 7909 7863 7763 3000 MHz 4 12796 11945 3076 17473 17505 11322 12679 15450 14953 SISD 8 12775 10843 3073 17620 17722 11304 15378 12972 15053 L3 6MB 16 11478 10623 4656 17601 17164 11734 15486 15207 14894 Phenom II X4 1 6539 4328 2127 12022 11210 7326 13656 13339 12781 512 KB x 4 L2 2 12811 8750 2841 23793 22928 10622 27202 26813 25210 3000 MHz 4 16543 12395 3182 45783 43472 12134 53746 51504 42746 SIMD 8 24417 23910 3188 46269 46410 12189 53181 52662 43354 L3 6MB 16 23607 22774 3647 41679 43777 13327 52156 51361 44364 Core i7 4820K 1 3867 3853 3386 6085 6054 6017 5830 5824 5809 256 KB x 4 L2 2 7737 7731 6618 12160 12165 11991 11653 11648 11650 4 core 8 Thrd 4 15433 15459 9833 23487 24291 23886 22666 23175 23220 3900 MHz i87 8 15359 15395 9846 23554 23708 23586 23418 23464 23416 L3 10 MB 16 15145 15192 10023 23422 23536 22966 23241 23401 23282 Core i7 4820K 1 5004 4960 4192 6188 6182 6135 5890 5890 5887 256 KB x 4 L2 2 9996 10002 8049 12371 12354 12282 11770 11779 11744 4 core 8 Thrd 4 19923 18532 9866 23946 24704 24347 23219 23531 23497 3900 MHz SISD 8 19602 19776 9820 24683 24648 24634 23521 23497 23506 L3 10 MB 16 18727 19077 10073 24316 24243 24442 23469 23393 23385 Core i7 4820K 1 10116 9864 5852 24636 24436 19881 23353 23389 23243 256 KB x 4 L2 2 26453 19851 9189 49181 49223 34969 46653 46759 46414 4 core 8 Thrd 4 41845 26975 10063 85909 93852 40163 89202 90572 87329 3900 MHz SIMD 8 58734 43723 9980 97139 98446 40062 91320 93885 93125 L3 10 MB 16 57731 42194 10178 94166 93338 40074 90162 92102 93496 Core i7 4820K 1 10046 9901 5906 24629 24382 19832 23411 23361 23246 256 KB x 4 L2 2 26634 19679 9250 49194 49267 35183 46788 46788 46382 4 core 8 Thrd 4 52424 39057 10092 60266 98220 39744 90948 90611 92515 3900 MHz AVX 8 58601 43529 10032 85198 98220 40162 93810 93866 93745 L3 10 MB 16 57098 42920 10319 86267 95243 40427 92929 92995 92356

To Start

MP MFLOPS Numeric Answers

As indicated earlier, the calculated result depends on the number of calculations on each data word, closer to initial values of 0.999999 indicating fewer calculations. The number depends on Repeat passes, with the starting value of identified CPUs x 2500. For example Repeats of 10000 could be for a quad core CPU and 20000 could be for a quad core CPU with Hyperthreading.

Results via old i87 calculations are slightly different to those using SSE arithmetic.

Repeats 5000 10000 20000 40000 Version SSE i87 SSE i87 SSE i87 SSE i87 a2 0.867359 0.867238 0.764063 0.763849 0.620974 0.620631 0.481454 0.481096 b2 0.985193 0.985180 0.970753 0.970727 0.942935 0.942883 0.891302 0.891203 c2 0.998502 0.998501 0.997008 0.997006 0.994032 0.994027 0.988125 0.988114 a8 0.918220 0.918307 0.850923 0.851082 0.749971 0.750239 0.635325 0.635706 b8 0.991084 0.991095 0.982342 0.982363 0.965360 0.965401 0.933325 0.933397 c8 0.999099 0.999101 0.998200 0.998204 0.996409 0.996416 0.992853 0.992862 a32 0.798973 0.799276 0.660143 0.660653 0.498060 0.498797 0.385106 0.384777 b32 0.976383 0.976422 0.953631 0.953702 0.910573 0.910709 0.833458 0.833707 c32 0.997595 0.997602 0.995203 0.995214 0.990447 0.990463 0.981037 0.981068 Key - Words a=102400 x Repeats, b=1024000 x Repeats / 10, c=10240000 x Repeats / 100 2, 8 and 32 are Operations Per Word

To Start

MP MFLOPS Assembly Code

Following are details of assembly code for the test with 3 multiplies, 4 adds and a subtract. The test loop is mainly unrolled, with these sequences repeated multiple times. For full SSE and AVX, the main compiled loop included 16 arithmetic instructions, with totals of 8 x 8 and 16 x 8 operations. Then there were other sequences for non-multiples of 64 and 128.

The original benchmark produced the inefficient SISD instructions but the later one, that came with Visual Studio 2013, produced SIMD, but not a full implementation of AVX functions. A compilation under Linux Ubuntu 14.04 produced the expected AVX sequences, using 256 bit ymm registers and not 128 bit xmm varieties.

The GFLOPS figures shown are maximum on a 3.9 GHz CPU that can execute a multiply and add per cycle, or up to 7.8 x 4 = 31.2 GFLOPS, with four cores, under SISD. With 3 multiplies out of 8 instructions, in the test sequence, 25 GFLOPS looks quite good, also 98 GFLOPS with SIMD.

Assembly Code Sequences for x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; GFLOPS are for four Core i7-4820K running at 3.9 GHz Four Cores 24 GFLOPS Compiler 15.00.30729.207 for 80x86 fadd DWORD PTR _a$[esp+4] fmul ST(0), ST(2) fld ST(3) faddp ST(2), ST(0) fld ST(4) fmulp ST(2), ST(0) fsubrp ST(1), ST(0) fld DWORD PTR tv1064[esp+4] fadd ST(0), ST(5) fmul ST(0), ST(6) faddp ST(1), ST(0) SSE Option - SISD SSE Option - SIMD ss is scalar single precision ps is packed single precision Four Cores 25 GFLOPS Four Cores 98 GFLOPS Compiler 15.00.30729.207 for x64 Compiler 18.00.21005.1 for x64 addss xmm1, xmm8 addps xmm1, xmm10 addss xmm2, xmm4 addps xmm2, xmm6 addss xmm0, xmm6 addps xmm0, xmm8 mulss xmm1, xmm9 mulps xmm1, xmm11 mulss xmm0, xmm7 mulps xmm2, xmm7 mulss xmm2, xmm5 mulps xmm0, xmm9 subss xmm2, xmm0 subps xmm2, xmm0 addss xmm2, xmm1 addps xmm2, xmm1 AVX Option - 128 bit registers AVX Option - 256 bit registers Four Cores 98 GFLOPS Four Cores 162 GFLOPS Compiler 18.00.21005.1 for x64 Compiler gcc version 4.8.2 vaddps xmm1, xmm0, xmm6 vaddps ymm5, ymm6, ymm13 vaddps xmm0, xmm0, xmm8 vaddps ymm3, ymm6, ymm7 vmulps xmm2, xmm0, xmm9 vaddps ymm1, ymm6, ymm6 vmulps xmm3, xmm1, xmm7 vmulps ymm4, ymm13, ymm13 vsubps xmm4, xmm3, xmm2 vmulps ymm2, ymm7, ymm7 vaddps xmm1, xmm5, xmm10 vmulps ymm0, ymm6, ymm6 vmulps xmm0, xmm1, xmm11 vsubps ymm7, ymm13, ymm7 vaddps xmm2, xmm4, xmm0 vaddps ymm6, ymm7, ymm6

To Start

OpenMP Benchmark

The second benchmarks in this series was for OpenMP. This did not have the same word counts as MP MFLOPS (divisible by 64), so numeric results are slightly different. The benchmarks included are OpenMP32MFLOPS.exe, using i387 Floating Point, OpenMP64MFLOPS.exe, with default SSE instructions and SSE32MFLOPS.exe, not using OpenMP. Details and results are in openmp mflops.htm, then benchmarks and source codes downloaded in OpenMPMflops.zip.

This benchmark uses standard C/C++ code, with a pragma directive before loops offered for parallelisation - see below, with a /openmp option added to the compile command. The original compiled using one word at a time SISD SSE instructions (see above) but automatically selected all CPU cores, confirmed by the example of using one core, via an Affinity setting, shown below. Measured performance is effectively the same as with the MP MFLOPS benchmark, using the same Core i7 based system.

The benchmark was recompiled with MS C/C++ Version 18, from Visual Studio 2013 but, unlike MP MFLOPS, did not produce full SIMD functions. Results are below, mainly demonstrating the same performance as the original program.

Results on 3900 MHz Core i7 4820K C Code Directive #pragma omp parallel for for(i=0; i < n; i++) { x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; } Compile Command cl /O2 /openmp /MD /W4 /Zi /TP /EHsc /Fa /c OpenMP64MFLOPS.cpp 64 Bit OpenMP MFLOPS Benchmark 1 Wed May 28 16:06:54 2014 Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.059061 8466 0.929538 Yes Data in & out 1000000 2 250 0.039231 12745 0.992550 Yes Data in & out 10000000 2 25 0.052317 9557 0.999250 Yes Data in & out 100000 8 2500 0.103711 19284 0.957117 Yes Data in & out 1000000 8 250 0.083413 23977 0.995517 Yes Data in & out 10000000 8 25 0.083001 24096 0.999549 Yes Data in & out 100000 32 2500 0.360390 22198 0.890211 Yes Data in & out 1000000 32 250 0.344797 23202 0.988082 Yes Data in & out 10000000 32 25 0.345281 23170 0.998796 Yes Example 1 Core - Command Start /Affinity 1 OpenMP64MFLOPS Data in & out 100000 8 2500 0.356484 5610 0.957117 Yes Data in & out 1000000 8 250 0.329129 6077 0.995517 Yes Data in & out 10000000 8 25 0.328513 6088 0.999549 Yes Via Microsoft C/C++ Optimizing Compiler Version 18.00.21005.1 for x64 Data in & out 100000 2 2500 0.037910 13189 0.929538 Yes Data in & out 1000000 2 250 0.035475 14095 0.992550 Yes Data in & out 10000000 2 25 0.055958 8935 0.999250 Yes Data in & out 100000 8 2500 0.086657 23079 0.957117 Yes Data in & out 1000000 8 250 0.082454 24256 0.995517 Yes Data in & out 10000000 8 25 0.083116 24063 0.999549 Yes Data in & out 100000 32 2500 0.343880 23264 0.890211 Yes Data in & out 1000000 32 250 0.358404 22321 0.988082 Yes Data in & out 10000000 32 25 0.342454 23361 0.998796 Yes

To Start

Qpar MP Benchmark

With Visual Studio 2012, Microsoft added an "Auto-Parallelizer" to the compiler that can automatically generate multiple threads in the same way as OpenMP. This requires a /Qpar compiler option and a pragma directive, as shown below.

The OpenMP benchmark was compiled this way, via Compiler Version 18.00, supplied with Visual Studio 2013, and produced SIMD instructions, mainly providing performance gains of four times OpenMP results, similar to those in MP MFLOPS. The program was also compiled to use AVX instructions, producing the same disappointing speeds as MP MFLOPS.

The #pragma loop directive cannot use a variable to specify the number of threads. So the program was modified to select different fixed counts of 1, 2, 4, 8, or 16 threads, with a default of 4. The new benchmark is QparMP64MFLOPS.exe, with execution and source files included in GigaFLOPS-Benchmarks.zip. This also has copied of DLL files that might be needed to run the benchmark on different systems. Besides having a run time parameter to control the number of threads (T or Threads), others are provided for the number of words (W or Words) and repeat passes (R or Repeats) - see below.

C Code Directive #pragma loop(hint_parallel(4)) for(i=0; i < n; i++) { x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f; } Compile Command cl /O2 /Qpar /MD /W4 /Zi /TP /EHsc /Fa /c QparMP64MFLOPS.cpp 64 Bit Qpar MFLOPS Benchmark 1, 4 Threads, Mon Jun 23 16:27:50 2014 Via Microsoft C/C++ Optimizing Compiler Version 18.00.21005.1 for x64 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.010905 45849 0.929538 Yes Data in & out 1000000 2 250 0.012933 38661 0.992550 Yes Data in & out 10000000 2 25 0.052096 9598 0.999250 Yes Data in & out 100000 8 2500 0.023440 85324 0.957117 Yes Data in & out 1000000 8 250 0.020773 96279 0.995517 Yes Data in & out 10000000 8 25 0.052930 37786 0.999549 Yes Data in & out 100000 32 2500 0.087836 91079 0.890211 Yes Data in & out 1000000 32 250 0.085585 93475 0.988082 Yes Data in & out 10000000 32 25 0.086839 92124 0.998796 Yes Run Time Commands QparMP64MFLOPS.exe Txxx N1, Wxxx N2, Rxxx N3 Examples QparMP64MFLOPS.exe T 16 or QparMP64MFLOPS.exe Thrds 16 QparMP64MFLOPS.exe W 200000 or QparMP64MFLOPS.exe Wds 200000 QparMP64MFLOPS.exe T 8, W 200000, R 5000

To Start

Qpar and OpenMP Comparisons

Following are some OpenMP, Qpar and MP-MFLOPS comparisons on two quad core systems. On results not dependent on memory speed, the four CPU Core i7 Qpar and MP-MFLOPS speeds average four times faster than OpenMP scores, due to SIMD and SISD register use, not quite as good on the Phenom.

With Qpar tests, the Phenom speeds, using eight threads, tend to be much slower than with four threads. The Core i7 degradation is not as severe and is probably helped by Hyperthreading. With MP-MFLOPS, and CPU speed limited tests, performance with eight threads tends to be four times that when using a single thread, Hyperthreading providing a higher gain on the i7.

The 10.24 MB tests, with fewer calculations, suggest that multiple threads are needed to obtain the highest RAM throughput, the i7 also being influenced by its 10 MB L3 cache.

The i7 PC has four memory channels, compared with two on the Phenom, and the former is faster. The CPU MHz ratio is 1.30, but the i7 CPU dependent speeds on the i7 average around 1.75 times those on the Phenom, 2.5 times due to caching effects and more than three times via RAM.

MFLOPS 1 to 16 Threads Operations Per Word 2 2 2 8 8 8 32 32 32 Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24 Threads Phenom X4 1 1800 1866 1524 3585 3686 3523 3647 3653 3628 OpenMP 4 5408 7045 2964 11891 14194 11178 13494 14006 13842 Phenom 1 6003 4190 2116 11585 11203 7321 14087 13768 13151 X4 2 10999 8653 2843 23194 22732 10567 27421 27538 25513 4 Core 4 17699 15915 3125 36186 43438 11926 47664 53064 43406 3000 MHz 8 8158 3208 2948 25464 29383 10189 32648 36188 32475 Qpar 16 9937 10422 3200 34765 42399 12125 41226 53077 39954 Phenom 1 6539 4328 2127 12022 11210 7326 13656 13339 12781 4 Core 2 12811 8750 2841 23793 22928 10622 27202 26813 25210 3000 MHz 4 16543 12395 3182 45783 43472 12134 53746 51504 42746 3000 MHz 8 24417 23910 3188 46269 46410 12189 53181 52662 43354 MP-MFLOPS 16 23607 22774 3647 41679 43777 13327 52156 51361 44364 Core i7 4820K 1 3612 3802 3549 6002 6136 6100 5845 5870 5879 OpenMP 4 13189 14095 8935 23079 24256 24063 23264 22321 23361 Core i7 1 10181 9972 5842 24458 24086 19646 23497 23533 23373 4820K 2 25378 19873 9186 47674 48861 34432 46546 46983 46560 4 core 8 Thrd 4 45194 39092 9839 85928 95689 37602 90159 93022 90933 3900 MHz 8 42665 38325 9761 75672 88846 37919 88217 91233 86306 Qpar 16 18840 35358 9757 66481 90022 38735 83196 91050 87909 Core i7 1 10116 9864 5852 24636 24436 19881 23353 23389 23243 4820K 2 26453 19851 9189 49181 49223 34969 46653 46759 46414 4 core 8 Thrd 4 41845 26975 10063 85909 93852 40163 89202 90572 87329 3900 MHz 8 58734 43723 9980 97139 98446 40062 91320 93885 93125 MP-MFLOPS 16 57731 42194 10178 94166 93338 40074 90162 92102 93496

To Start

Reliability Tests

MP-MFLOPS and CUDA-MFLOPS benchmarks have a command parameter to run for extended periods for burn-in purposes or reliability verification. The parameter specifies the run time in minutes and one test, that possibly represents the highest CPU/GPU loading, is run. With MP-MFLOPS, it is with 102400 words at 32 operations per word, where the number of threads can also be varied. CUDA-MFLOPS uses the last of the Extra Tests with 10 million words and 32 operations per word, where the default is Calculate, but an added FC (fast cache) option uses the faster Shared Memory (see example below).

The benchmarks display and log MFLOPS speeds, normally at intervals of 15 to 20 seconds, via a calibrated Repeat Passes count, and numeric results are checked for consistency. The running time and MFLOPS measurements can fall, if CPU/GPU clock speeds are automatically reduced, due to overheating or power saving.

The first entries below show the log format of the two programs, followed by a BAT file to run 10 minute tests, with MP-MFLOPS restricted to four out of eight threads on a quad core i7 4820K. The graphics card is a GeForce GTX 650. Next are the results, with CUDA-MFLOPS running with no performance degradation, but stealing some CPU time from MP-MFLOPS to reduce that program's speed.

Temperatures were monitored during the test, using Asus AI Suite II for the CPU (case temperature) and Asus GPU Tweak for the GeForce graphics processor. There is nothing unusual about the recorded temperatures, shown below, but such measurements are useful for later comparisons, in the event of system failures.

8 CPUs Available ############################################## 64 Bit MP SSE MFLOPS Benchmark C2, 8 Threads, Tue Jul 01 10:39:20 2014 C/C++ Optimizing Compiler Version 18.00.21005.1 for x64 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 32 427524 14.963640 93621 0.352168 Yes Data in & out 102400 32 427524 14.964366 93616 0.352168 Yes Data in & out 102400 32 427524 14.979681 93521 0.352168 Yes Data in & out 102400 32 427524 14.987834 93470 0.352168 Yes ############################################## CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.31 Tue Jul 01 11:40:41 2014 Shared Memory Reliability Test 1 minutes, report every 15 seconds Results of all calculations should be - 0.5065064430236816 Test Seconds MFLOPS Errors First Value Word 1 14.324 429967 None found 2 14.324 429970 None found 3 14.324 429968 None found 4 14.325 429955 None found ############################################## Run.bat 10 minute test Start cuda3mflops-x64sp Minutes 10 FC Start MPmflopsc2 Minutes 10, Threads 4 ############################################## CUDA Test Seconds MFLOPS Errors 22 14.361 429554 None found 23 14.375 429133 None found 24 14.360 429598 None found ############################################## MP-MFLOPS 4 Threads Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 32 288902 13.039905 72598 0.352168 Yes Data in & out 102400 32 288902 13.505035 70098 0.352168 Yes Data in & out 102400 32 288902 13.992925 67654 0.352168 Yes Just CPU Test Data in & out 102400 32 288902 10.430468 90760 0.352168 Yes ############################################## Temperatures - Room 27 °C Minutes Spec 0 1 2 3 4 5 6 7 8 9 10 Increase Max CPU °C 39 42 46 48 49 50 50 51 52 53 53 14 66.8 GPU °C 31 49 54 56 58 59 60 60 60 60 60 29 98

To Start

Roy Longbottom July 2014

The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection