Contents
General
In this series, four types of benchmarks are available, using OpenMP and Qpar, Microsoft’s proprietary equivalent, both with automatic multiprocessing, CUDA for GeForce graphics, and MPMFLOPS, CPU benchmark with multithreading (up to 64 threads). All carry out the same calculations of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. They also check the numeric result for consistency, and this also identifies differences in numeric calculations on using the various instruction sets.
The benchmarks are available, compiled for 32 bit and 64 bit Operating Systems, and can be for single and double precision floating point calculations, using the original x87, SSE or AVX instruction sets.
Tests are run at increasing data sizes to transfer data from caches and RAM. The latest versions of the benchmarks and source code are included in
GigaFLOPS-Benchmarks.zip.
The programs are run from Command Prompt windows and generally have parameter options to run for extended periods as reliability/burn-in tests. Results are displayed and saved in a text based log file.
Speed is shown in MFLOPS or Millions of FLoating point Operations Per Second
To Start
CUDA GPU Benchmarks
These were the first benchmarks in the series. Four varieties are available CUDA3MFLOPS-x86SP.exe, CUDA3MFLOPS-x86DP.exe, CUDA3MFLOPS-x64SP.exe and CUDA3MFLOPS-x64DP.exe, two for each for 32 bit and 64 bit Windows. Single Precision (SP) and Double Precision (DP) compilations are provide as the performance difference can be considerable.
CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously.
Unlike the CPU benchmarks, three tests are carried out at each data size and calculations per word combination. This is to show that performance is severely degraded if data transfers are over the relatively slow external bus. The first tests are run with Repeat Passes controlled by the CPU but, to demonstrate fastest speeds (with these tests), Extra Tests are run with all repeats controlled by the GPU.
The example below is for a GeForce GTX 680, possibly the fastest graphics card in 2012. This has a maximum specification of 3090 GFLOPS, the benchmark achieving up to 1746 GFLOPS. Note that numeric accuracy improves with fewer data returns between calculations.
Details on installing and using CUDA software, with further benchmark details and results, can be found in
cuda1.htm,
cuda2.htm and
cuda3 x64m.htm.
These show how the numeric results vary according to precision used and provide details on how to run the programs for extended periods as reliability/burn-in tests.
Note that total data size transferred can be 1 GB in and out. On reducing data flow by this amount, the difference in test time is mainly around 0.25 seconds, suggesting that effective transfer speed is about 4 GB/second, maybe about right for PCiE 3.0 x 16 with a maximum of nearly 16 GB/second.
CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.3 Sun Nov 04 17:58:44 2012
CUDA devices found
Device 0: GeForce GTX 680 with 8 Processors 64 cores
Global Memory 2000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 4 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 0.967272 517 0.9295383095741 Yes
Data out only 100000 2 2500 0.387346 1291 0.9295383095741 Yes
Calculate only 100000 2 2500 0.070436 7099 0.9295383095741 Yes
Data in & out 1000000 2 250 0.526663 949 0.9925497770309 Yes
Data out only 1000000 2 250 0.245081 2040 0.9925497770309 Yes
Calculate only 1000000 2 250 0.019763 25299 0.9925497770309 Yes
Data in & out 10000000 2 25 0.482678 1036 0.9992496371269 Yes
Data out only 10000000 2 25 0.240010 2083 0.9992496371269 Yes
Calculate only 10000000 2 25 0.013708 36475 0.9992496371269 Yes
Data in & out 100000 8 2500 0.759731 2633 0.9571172595024 Yes
Data out only 100000 8 2500 0.410652 4870 0.9571172595024 Yes
Calculate only 100000 8 2500 0.073366 27261 0.9571172595024 Yes
Data in & out 1000000 8 250 0.524791 3811 0.9955183267593 Yes
Data out only 1000000 8 250 0.245618 8143 0.9955183267593 Yes
Calculate only 1000000 8 250 0.020494 97589 0.9955183267593 Yes
Data in & out 10000000 8 25 0.494677 4043 0.9995489120483 Yes
Data out only 10000000 8 25 0.240809 8305 0.9995489120483 Yes
Calculate only 10000000 8 25 0.013834 144575 0.9995489120483 Yes
Data in & out 100000 32 2500 0.764819 10460 0.8902152180672 Yes
Data out only 100000 32 2500 0.415392 19259 0.8902152180672 Yes
Calculate only 100000 32 2500 0.135979 58833 0.8902152180672 Yes
Data in & out 1000000 32 250 0.529935 15096 0.9880878329277 Yes
Data out only 1000000 32 250 0.247135 32371 0.9880878329277 Yes
Calculate only 1000000 32 250 0.024024 333000 0.9880878329277 Yes
Data in & out 10000000 32 25 0.493384 16215 0.9987964630127 Yes
Data out only 10000000 32 25 0.242553 32983 0.9987964630127 Yes
Calculate only 10000000 32 25 0.015177 527122 0.9987964630127 Yes
Extra tests - Repeat Passes in main CUDA Function
Calculate 10000000 2 25 0.004503 111041 0.9992496371269 Yes
Shared Memory 10000000 2 25 0.002053 243601 0.9992496371269 Yes
Calculate 10000000 8 25 0.005726 349289 0.9995489120483 Yes
Shared Memory 10000000 8 25 0.002773 721272 0.9995489120483 Yes
Calculate 10000000 32 25 0.008419 950255 0.9987964630127 Yes
Shared Memory 10000000 32 25 0.004581 1746493 0.9987964630127 Yes
Hardware Information
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000206D7
Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz Measured 3200 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows Information
AMD64 processor architecture, 12 CPUs
Windows NT Version 6.1, build 7601, Service Pack 1
Memory 32710 MB, Free 28947 MB
User Virtual Space 8388608 MB, Free 8388539 MB
2000 MB Graphics RAM, Used 204 Minimum, 240 Maximum
|
To Start
CUDA Comparisons
Following are further results for the GTX 680, with others for a GTX 650, with single and double precision benchmarks compiled for 32 bit and 64 bit Windows. The PCs have similar CPU, bus and RAM speeds, reflected in results involving data transfers. Note that 64 bit compilations are faster than those at 32 bits and single precision speeds can be much higher than that for double precision calculations, the largest difference (>15x) on the extra tests.
GeForce GTX 680 GeForce GTX 650
Maximum Specification 3090 GFLOPS 1505 GFLOPS
Test Wds x Ops 3.1 SP 3.1 SP 3.1 DP 3.1 DP 3.1 SP 3.1 SP 3.1 DP 3.1 DP
x Passes 32b 64b 32b 64b 32b 64b 32b 64b
Data in & out .1Mx2x2500 622 517 354 353 593 459 333 350
Data out only .1Mx2x2500 1241 1291 667 712 1050 1059 643 651
Calculate only .1Mx2x2500 7310 7099 4126 6140 3854 3449 3351 3069
Data in & out 1Mx2x250 931 949 479 489 869 893 471 479
Data out only 1Mx2x250 1960 2040 1000 1049 1765 1790 941 962
Calculate only 1Mx2x250 21914 25299 16143 16226 9244 8806 6889 6627
Data in & out 10Mx2x25 999 1036 513 516 959 980 502 508
Data out only 10Mx2x25 2005 2083 1040 1052 1783 1852 952 972
Calculate only 10Mx2x25 36175 36475 19876 19882 10867 10530 7746 7533
Data in & out .1Mx8x2500 2445 2633 1415 1430 2348 2375 1294 1299
Data out only .1Mx8x2500 4280 4870 2724 2906 4135 4151 2396 2415
Calculate only .1Mx8x2500 25502 27261 15501 22388 14274 13056 9809 9029
Data in & out 1Mx8x250 3722 3811 1901 1949 3458 3545 1806 1829
Data out only 1Mx8x250 7694 8143 4058 4143 6982 7107 3455 3504
Calculate only 1Mx8x250 85455 97589 60592 60056 33375 34014 16613 15852
Data in & out 10Mx8x25 4015 4043 2061 2052 3841 3896 1915 1926
Data out only 10Mx8x25 8177 8305 4055 4130 7222 7283 3524 3572
Calculate only 10Mx8x25 140800 144575 75558 74333 42328 40905 17922 17221
Data in & out .1Mx32x2500 10080 10460 5396 5485 9026 9183 4262 4588
Data out only .1Mx32x2500 18672 19259 9978 10053 15660 15769 7355 7377
Calculate only .1Mx32x2500 83989 58833 43334 47974 47957 43975 17469 16658
Data in & out 1Mx32x250 14629 15096 7269 7440 13634 14006 5947 6015
Data out only 1Mx32x250 30977 32371 14810 14977 27261 27684 9864 9881
Calculate only 1Mx32x250 347405 333000 96044 92231 125027 120972 22708 21998
Data in & out 10Mx32x25 15995 16215 7739 7765 15123 15499 6288 6278
Data out only 10Mx32x25 31586 32983 14956 15037 27770 28906 9984 9999
Calculate only 10Mx32x25 519153 527122 105419 101771 149697 147100 23467 22763
Extra tests - Repeat Passes in main CUDA Function
Calculate 10Mx2x25 126843 111041 50749 44017 29899 26876 11491 9801
Shared Memory 10Mx2x25 236915 243601 73186 72369 75618 77049 16407 16268
Calculate 10Mx8x25 470867 349289 89172 83696 110088 81484 19821 18451
Shared Memory 10Mx8x25 969416 721272 100825 100254 227879 181190 22425 22322
Calculate 10Mx32x25 1154649 950255 109800 105142 253381 216570 24083 23353
Shared Memory 10Mx32x25 1714512 1746493 111425 110982 412313 400966 24587 24560
|
To Start
MP MFLOPS Benchmarks
The benchmark was first produced to run on PCs via Linux, details being in
Linux Multithreading Benchmarks.htm.
Then mini versions were produced, where details are in
Android MultiThreading Benchmarks.htm
and
Raspberry Pi Multithreading Benchmarks.htm.
Four versions are available, compiled from the same source code, in
the GigaFLOPS ZIP file.,
now intended to be run via 64 bit Windows. They are:
- MPmflops32.exe compiled with MS C/C++ Version 15.00.30729.207 for 80x86 - old 8087 floating point instructions
- MPmflops64.exe compiled with MS C/C++ Version 15.00.30729.207 for x64 - to use SSE floating point calculations
- MPmflopsc2.exe compiled with MS C/C++ Version 18.00.21005.1 for x64 - to fully implement SSE functions
- MPmflopsAVX.exe compiled with MS C/C++ Version 18.00.21005.1 for x64, with /arch:AVX option to use new vector instructions.
The first compilation of this failed to run on a PC without AVX. The revised benchmark is a small one that identified the configuration, indicates if AVX not supported or runs the benchmark, now named onlyAVX.exe.
A WhatConfig function identifies how many cores are available, doubled when Hyperthreading is available. This provides the default thread count. Alternatively, a run time parameter specifies the number of threads to use (see below). This can be zero where no threads are created but, in this case, speeds have been the same as using one thread.
The threads used a shared data array, but each uses a dedicated segment. Example results are below. The “Data in & out” label is to identify the equivalent CUDA test. The thread count should be one of the options shown below (divide into 1024). Otherwise, unexpected results will be notified (see example).
The numeric results depend on the number of calculations, changed by repeat passes that vary according to the number of identified CPUs (2500 x CPUs See Below). Those via x87 floating point are different to SSE calculations but the thread count should not affect the numeric answers.
Command Format Example - MPmflopsc2 Threads xx or T xx
x = 0, 1, 2, 4, 8, 16, 32 or 64
Example results log file:
8 CPUs Available
##############################################
64 Bit MP SSE MFLOPS Benchmark C2, 8 Threads, Mon Jun 09 12:15:48 2014
C/C++ Optimizing Compiler Version 18.00.21005.1 for x64
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 2 20000 0.069739 58734 0.620974 Yes
Data in & out 1024000 2 2000 0.093680 43723 0.942935 Yes
Data in & out 10240000 2 200 0.410417 9980 0.994032 Yes
Data in & out 102400 8 20000 0.168666 97139 0.749971 Yes
Data in & out 1024000 8 2000 0.166426 98446 0.965360 Yes
Data in & out 10240000 8 200 0.408970 40062 0.996409 Yes
Data in & out 102400 32 20000 0.717656 91320 0.498060 Yes
Data in & out 1024000 32 2000 0.698044 93885 0.910573 Yes
Data in & out 10240000 32 200 0.703741 93125 0.990447 Yes
End of test Mon Jun 09 12:15:51 2014
##############################################
CPU GenuineIntel, ecx 7FBEE3BF, edx BFEBFBFF, Model 000306E4
Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Measured 3711 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has AVX, No 3DNow,
Windows GetSystemInfo, GetVersionEx, GlobalMemoryStatus
AMD64 processor architecture, 8 CPUs
Windows NT Version 6.2, build 9200,
Memory 32705 MB, Free 31350 MB
User Virtual Space 134217728 MB, Free 134217718 MB
##############################################
Example Calculation
MFLOPS = 10240000 / 1000000 x 32 X 200 / 0.703741 = 93125.17
##############################################
Example display and log file with errors or odd thread count (4 CPUs Available)
Data in & out 102400 2 10000 0.097761 20949 See later No
Data in & out 102400 2 10000 word 102388 was 0.999999 not 0.764063
Data in & out 102400 2 10000 word 102389 was 0.999999 not 0.764063
|
To Start
MP MFLOPS Comparisons
The tables below show L2 and L3 cache size, the former provided per core and the latter shared between cores.
Tests using two arithmetic operations per word are the most likely to be affected by RAM or shared L3 cache speed. With 2 operations, L2 cache sizes of 512 KB, 10.24M Words and 16 threads, performance gains over 8 threads are higher than might be expected. Data per thread is 640 KB, where Windows time slices can use multiple L2 caches more effectively.
The second version of the benchmark has a compile option to use SSE instructions, where Single Instruction Multiple Data (SIMD) instructions could execute a multiply or add on on four 32 bit single precision floating point numbers at the same time. Instead, MS C/C++ Version 15 caused the benchmark to run a conventional scalar processor with one 32 bit number in the 128 bit SSE registers or Single Instruction Single Data (SISD) operation. In some cases, this is no faster than compilations using the original 8087 floating point instructions.
Recompiling with MS C/C++ Version 18, from Visual Studio 2013, produced full SIMD functions. Theoretical maximum MFLOPS speed of earlier processors with SSE was 4 x number of cores x CPU MHz. Later, a multiply and add could be executed in the same clock cycle where, maximum speed per core is 8 x CPU MHz. More recently, Advanced Vector Instructions (AVX) were introduced, with 256 bit registers and speed of up to 16 x CPU MHz per core. Unfortunately, compiling with the AVX option with MS C/C++ Version 18 did not fully implement the new instructions. For details of assembly code produced
(See Below).
Later CPUs have AVX2 with further performance improvements. Results are for:
Core 2 Duo 2.4 GHz 2 cores - Maximum GFLOPS 4.4 i87, 7.3 SIMD, 19.0 SIMD or 4.0 per GHz per core
Phenom II 3.0 GHz 4 cores - Maximum GFLOPS 15.5 i87, 17.7 SISD, 53.7 SIMD or 4.48 per GHz per core.
Core i7 3.9 GHz 4 cores + Hyperthreading - Maximum GFLOPS 24.3 i87, 24.7 SISD, 98.4 SIMD or 6.30 per GHz per core, AVX 98.2. There is little gain through Hyperthreading
and note similar SIMD and AVX speeds, also improved performance relative to clock speed, compared with the Phenom. This PC has four memory channels, producing exceptional relative performance.
MFLOPS 1 to 16 Threads
Operations Per Word 2 2 2 8 8 8 32 32 32
Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24
Threads
Core 2 Duo 1 1509 1419 1112 1981 2038 2014 2301 2292 2280
512 KB x 2 L2 2 2854 2606 1185 3709 3847 3738 4374 4359 4301
2400 MHz 4 2854 2606 1185 3709 3847 3738 4374 4359 4301
i87 8 2957 2853 1265 3875 3918 3595 4440 4362 4424
16 3056 2855 2118 3963 3935 3636 4396 4386 4345
Core 2 Duo 1 2504 2181 1165 3319 3664 3506 3595 3577 3559
512 KB x 2 L2 2 4791 4497 1197 6907 6824 4481 6841 6755 6718
2400 MHz 4 4773 4572 1211 5015 7040 4832 6849 6951 6153
SISD 8 4786 4657 1300 6335 6842 4931 6872 6826 6730
16 5100 4608 2341 7311 7095 6840 6711 7018 6810
Core 2 Duo 1 2915 2554 1156 6347 6006 4672 9303 9217 9101
512 KB x 2 L2 2 5576 5177 1205 11127 10546 4775 16941 17613 16703
2400 MHz 4 5627 5227 1197 12371 11686 4775 17783 17512 16636
SIMD 8 5564 5573 1291 12386 12306 5093 17920 17992 16892
L3 6MB 16 5548 5519 2321 12967 11094 8537 19047 18020 16783
Phenom II X4 1 2086 1970 1631 3858 3707 3521 3929 3909 3861
512 KB x 4 L2 2 4158 3945 2593 7698 7482 6934 7846 7825 7698
3000 MHz 4 8232 7744 2737 14914 9544 11251 15461 15498 14906
i87 8 8201 7922 2927 14645 12169 10091 15512 14633 14772
L3 6MB 16 7424 7199 3408 14883 13697 11652 15187 14100 14428
Phenom II X4 1 3320 3091 1910 4551 4439 4196 3962 3930 3886
512 KB x 4 L2 2 6579 6312 2705 9074 8940 8138 7909 7863 7763
3000 MHz 4 12796 11945 3076 17473 17505 11322 12679 15450 14953
SISD 8 12775 10843 3073 17620 17722 11304 15378 12972 15053
L3 6MB 16 11478 10623 4656 17601 17164 11734 15486 15207 14894
Phenom II X4 1 6539 4328 2127 12022 11210 7326 13656 13339 12781
512 KB x 4 L2 2 12811 8750 2841 23793 22928 10622 27202 26813 25210
3000 MHz 4 16543 12395 3182 45783 43472 12134 53746 51504 42746
SIMD 8 24417 23910 3188 46269 46410 12189 53181 52662 43354
L3 6MB 16 23607 22774 3647 41679 43777 13327 52156 51361 44364
Core i7 4820K 1 3867 3853 3386 6085 6054 6017 5830 5824 5809
256 KB x 4 L2 2 7737 7731 6618 12160 12165 11991 11653 11648 11650
4 core 8 Thrd 4 15433 15459 9833 23487 24291 23886 22666 23175 23220
3900 MHz i87 8 15359 15395 9846 23554 23708 23586 23418 23464 23416
L3 10 MB 16 15145 15192 10023 23422 23536 22966 23241 23401 23282
Core i7 4820K 1 5004 4960 4192 6188 6182 6135 5890 5890 5887
256 KB x 4 L2 2 9996 10002 8049 12371 12354 12282 11770 11779 11744
4 core 8 Thrd 4 19923 18532 9866 23946 24704 24347 23219 23531 23497
3900 MHz SISD 8 19602 19776 9820 24683 24648 24634 23521 23497 23506
L3 10 MB 16 18727 19077 10073 24316 24243 24442 23469 23393 23385
Core i7 4820K 1 10116 9864 5852 24636 24436 19881 23353 23389 23243
256 KB x 4 L2 2 26453 19851 9189 49181 49223 34969 46653 46759 46414
4 core 8 Thrd 4 41845 26975 10063 85909 93852 40163 89202 90572 87329
3900 MHz SIMD 8 58734 43723 9980 97139 98446 40062 91320 93885 93125
L3 10 MB 16 57731 42194 10178 94166 93338 40074 90162 92102 93496
Core i7 4820K 1 10046 9901 5906 24629 24382 19832 23411 23361 23246
256 KB x 4 L2 2 26634 19679 9250 49194 49267 35183 46788 46788 46382
4 core 8 Thrd 4 52424 39057 10092 60266 98220 39744 90948 90611 92515
3900 MHz AVX 8 58601 43529 10032 85198 98220 40162 93810 93866 93745
L3 10 MB 16 57098 42920 10319 86267 95243 40427 92929 92995 92356
|
To Start
MP MFLOPS Numeric Answers
As indicated earlier, the calculated result depends on the number of calculations on each data word, closer to initial values of 0.999999 indicating fewer calculations. The number depends on Repeat passes, with the starting value of identified CPUs x 2500. For example Repeats of 10000 could be for a quad core CPU and 20000 could be for a quad core CPU with Hyperthreading.
Results via old i87 calculations are slightly different to those using SSE arithmetic.
Repeats 5000 10000 20000 40000
Version SSE i87 SSE i87 SSE i87 SSE i87
a2 0.867359 0.867238 0.764063 0.763849 0.620974 0.620631 0.481454 0.481096
b2 0.985193 0.985180 0.970753 0.970727 0.942935 0.942883 0.891302 0.891203
c2 0.998502 0.998501 0.997008 0.997006 0.994032 0.994027 0.988125 0.988114
a8 0.918220 0.918307 0.850923 0.851082 0.749971 0.750239 0.635325 0.635706
b8 0.991084 0.991095 0.982342 0.982363 0.965360 0.965401 0.933325 0.933397
c8 0.999099 0.999101 0.998200 0.998204 0.996409 0.996416 0.992853 0.992862
a32 0.798973 0.799276 0.660143 0.660653 0.498060 0.498797 0.385106 0.384777
b32 0.976383 0.976422 0.953631 0.953702 0.910573 0.910709 0.833458 0.833707
c32 0.997595 0.997602 0.995203 0.995214 0.990447 0.990463 0.981037 0.981068
Key - Words a=102400 x Repeats, b=1024000 x Repeats / 10, c=10240000 x Repeats / 100
2, 8 and 32 are Operations Per Word
|
To Start
MP MFLOPS Assembly Code
Following are details of assembly code for the test with 3 multiplies, 4 adds and a subtract. The test loop is mainly unrolled, with these sequences repeated multiple times. For full SSE and AVX, the main compiled loop included 16 arithmetic instructions, with totals of 8 x 8 and 16 x 8 operations. Then there were other sequences for non-multiples of 64 and 128.
The original benchmark produced the inefficient SISD instructions but the later one, that came with Visual Studio 2013, produced SIMD, but not a full implementation of AVX functions. A compilation under Linux Ubuntu 14.04 produced the expected AVX sequences, using 256 bit ymm registers and not 128 bit xmm varieties.
The GFLOPS figures shown are maximum on a 3.9 GHz CPU that can execute a multiply and add per cycle, or up to 7.8 x 4 = 31.2 GFLOPS, with four cores, under SISD. With 3 multiplies out of 8 instructions, in the test sequence, 25 GFLOPS looks quite good, also 98 GFLOPS with SIMD.
Assembly Code Sequences for x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
GFLOPS are for four Core i7-4820K running at 3.9 GHz
Four Cores 24 GFLOPS
Compiler 15.00.30729.207 for 80x86
fadd DWORD PTR _a$[esp+4]
fmul ST(0), ST(2)
fld ST(3)
faddp ST(2), ST(0)
fld ST(4)
fmulp ST(2), ST(0)
fsubrp ST(1), ST(0)
fld DWORD PTR tv1064[esp+4]
fadd ST(0), ST(5)
fmul ST(0), ST(6)
faddp ST(1), ST(0)
SSE Option - SISD SSE Option - SIMD
ss is scalar single precision ps is packed single precision
Four Cores 25 GFLOPS Four Cores 98 GFLOPS
Compiler 15.00.30729.207 for x64 Compiler 18.00.21005.1 for x64
addss xmm1, xmm8 addps xmm1, xmm10
addss xmm2, xmm4 addps xmm2, xmm6
addss xmm0, xmm6 addps xmm0, xmm8
mulss xmm1, xmm9 mulps xmm1, xmm11
mulss xmm0, xmm7 mulps xmm2, xmm7
mulss xmm2, xmm5 mulps xmm0, xmm9
subss xmm2, xmm0 subps xmm2, xmm0
addss xmm2, xmm1 addps xmm2, xmm1
AVX Option - 128 bit registers AVX Option - 256 bit registers
Four Cores 98 GFLOPS Four Cores 162 GFLOPS
Compiler 18.00.21005.1 for x64 Compiler gcc version 4.8.2
vaddps xmm1, xmm0, xmm6 vaddps ymm5, ymm6, ymm13
vaddps xmm0, xmm0, xmm8 vaddps ymm3, ymm6, ymm7
vmulps xmm2, xmm0, xmm9 vaddps ymm1, ymm6, ymm6
vmulps xmm3, xmm1, xmm7 vmulps ymm4, ymm13, ymm13
vsubps xmm4, xmm3, xmm2 vmulps ymm2, ymm7, ymm7
vaddps xmm1, xmm5, xmm10 vmulps ymm0, ymm6, ymm6
vmulps xmm0, xmm1, xmm11 vsubps ymm7, ymm13, ymm7
vaddps xmm2, xmm4, xmm0 vaddps ymm6, ymm7, ymm6
|
To Start
OpenMP Benchmark
The second benchmarks in this series was for OpenMP. This did not have the same word counts as MP MFLOPS (divisible by 64), so numeric results are slightly different. The benchmarks included are OpenMP32MFLOPS.exe, using i387 Floating Point, OpenMP64MFLOPS.exe, with default SSE instructions and SSE32MFLOPS.exe, not using OpenMP. Details and results are in
openmp mflops.htm,
then benchmarks and source codes downloaded in
OpenMPMflops.zip.
This benchmark uses standard C/C++ code, with a pragma directive before loops offered for parallelisation - see below, with a /openmp option added to the compile command. The original compiled using one word at a time SISD SSE instructions (see above) but automatically selected all CPU cores, confirmed by the example of using one core, via an Affinity setting, shown below. Measured performance is effectively the same as with the MP MFLOPS benchmark, using the same Core i7 based system.
The benchmark was recompiled with MS C/C++ Version 18, from Visual Studio 2013 but, unlike MP MFLOPS, did not produce full SIMD functions. Results are below, mainly demonstrating the same performance as the original program.
Results on 3900 MHz Core i7 4820K
C Code Directive
#pragma omp parallel for
for(i=0; i < n; i++)
{
x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
}
Compile Command
cl /O2 /openmp /MD /W4 /Zi /TP /EHsc /Fa /c OpenMP64MFLOPS.cpp
64 Bit OpenMP MFLOPS Benchmark 1 Wed May 28 16:06:54 2014
Via Microsoft C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.059061 8466 0.929538 Yes
Data in & out 1000000 2 250 0.039231 12745 0.992550 Yes
Data in & out 10000000 2 25 0.052317 9557 0.999250 Yes
Data in & out 100000 8 2500 0.103711 19284 0.957117 Yes
Data in & out 1000000 8 250 0.083413 23977 0.995517 Yes
Data in & out 10000000 8 25 0.083001 24096 0.999549 Yes
Data in & out 100000 32 2500 0.360390 22198 0.890211 Yes
Data in & out 1000000 32 250 0.344797 23202 0.988082 Yes
Data in & out 10000000 32 25 0.345281 23170 0.998796 Yes
Example 1 Core - Command Start /Affinity 1 OpenMP64MFLOPS
Data in & out 100000 8 2500 0.356484 5610 0.957117 Yes
Data in & out 1000000 8 250 0.329129 6077 0.995517 Yes
Data in & out 10000000 8 25 0.328513 6088 0.999549 Yes
Via Microsoft C/C++ Optimizing Compiler Version 18.00.21005.1 for x64
Data in & out 100000 2 2500 0.037910 13189 0.929538 Yes
Data in & out 1000000 2 250 0.035475 14095 0.992550 Yes
Data in & out 10000000 2 25 0.055958 8935 0.999250 Yes
Data in & out 100000 8 2500 0.086657 23079 0.957117 Yes
Data in & out 1000000 8 250 0.082454 24256 0.995517 Yes
Data in & out 10000000 8 25 0.083116 24063 0.999549 Yes
Data in & out 100000 32 2500 0.343880 23264 0.890211 Yes
Data in & out 1000000 32 250 0.358404 22321 0.988082 Yes
Data in & out 10000000 32 25 0.342454 23361 0.998796 Yes
|
To Start
Qpar MP Benchmark
With Visual Studio 2012, Microsoft added an "Auto-Parallelizer" to the compiler that can automatically generate multiple threads in the same way as OpenMP. This requires a /Qpar compiler option and a pragma directive, as shown below.
The OpenMP benchmark was compiled this way, via Compiler Version 18.00, supplied with Visual Studio 2013, and produced SIMD instructions, mainly providing performance gains of four times OpenMP results, similar to those in MP MFLOPS.
The program was also compiled to use AVX instructions, producing the same disappointing speeds as MP MFLOPS.
The #pragma loop directive cannot use a variable to specify the number of threads. So the program was modified to select different fixed counts of 1, 2, 4, 8, or 16 threads, with a default of 4. The new benchmark is QparMP64MFLOPS.exe, with execution and source files included in
GigaFLOPS-Benchmarks.zip.
This also has copied of DLL files that might be needed to run the benchmark on different systems.
Besides having a run time parameter to control the number of threads (T or Threads), others are provided for the number of words (W or Words) and repeat passes (R or Repeats) - see below.
C Code Directive
#pragma loop(hint_parallel(4))
for(i=0; i < n; i++)
{
x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f;
}
Compile Command
cl /O2 /Qpar /MD /W4 /Zi /TP /EHsc /Fa /c QparMP64MFLOPS.cpp
64 Bit Qpar MFLOPS Benchmark 1, 4 Threads, Mon Jun 23 16:27:50 2014
Via Microsoft C/C++ Optimizing Compiler Version 18.00.21005.1 for x64
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.010905 45849 0.929538 Yes
Data in & out 1000000 2 250 0.012933 38661 0.992550 Yes
Data in & out 10000000 2 25 0.052096 9598 0.999250 Yes
Data in & out 100000 8 2500 0.023440 85324 0.957117 Yes
Data in & out 1000000 8 250 0.020773 96279 0.995517 Yes
Data in & out 10000000 8 25 0.052930 37786 0.999549 Yes
Data in & out 100000 32 2500 0.087836 91079 0.890211 Yes
Data in & out 1000000 32 250 0.085585 93475 0.988082 Yes
Data in & out 10000000 32 25 0.086839 92124 0.998796 Yes
Run Time Commands
QparMP64MFLOPS.exe Txxx N1, Wxxx N2, Rxxx N3
Examples QparMP64MFLOPS.exe T 16 or QparMP64MFLOPS.exe Thrds 16
QparMP64MFLOPS.exe W 200000 or QparMP64MFLOPS.exe Wds 200000
QparMP64MFLOPS.exe T 8, W 200000, R 5000
|
To Start
Qpar and OpenMP Comparisons
Following are some OpenMP, Qpar and MP-MFLOPS comparisons on two quad core systems. On results not dependent on memory speed, the four CPU Core i7 Qpar and MP-MFLOPS speeds average four times faster than OpenMP scores, due to SIMD and SISD register use, not quite as good on the Phenom.
With Qpar tests, the Phenom speeds, using eight threads, tend to be much slower than with four threads. The Core i7 degradation is not as severe and is probably helped by Hyperthreading. With MP-MFLOPS, and CPU speed limited tests, performance with eight threads tends to be four times that when using a single thread, Hyperthreading providing a higher gain on the i7.
The 10.24 MB tests, with fewer calculations, suggest that multiple threads are needed to obtain the highest RAM throughput, the i7 also being influenced by its 10 MB L3 cache.
The i7 PC has four memory channels, compared with two on the Phenom, and the former is faster. The CPU MHz ratio is 1.30, but the i7 CPU dependent speeds on the i7 average around 1.75 times those on the Phenom, 2.5 times due to caching effects and more than three times via RAM.
MFLOPS 1 to 16 Threads
Operations Per Word 2 2 2 8 8 8 32 32 32
Million Words 0.10 1.02 10.24 0.10 1.02 10.24 0.10 1.02 10.24
Threads
Phenom X4 1 1800 1866 1524 3585 3686 3523 3647 3653 3628
OpenMP 4 5408 7045 2964 11891 14194 11178 13494 14006 13842
Phenom 1 6003 4190 2116 11585 11203 7321 14087 13768 13151
X4 2 10999 8653 2843 23194 22732 10567 27421 27538 25513
4 Core 4 17699 15915 3125 36186 43438 11926 47664 53064 43406
3000 MHz 8 8158 3208 2948 25464 29383 10189 32648 36188 32475
Qpar 16 9937 10422 3200 34765 42399 12125 41226 53077 39954
Phenom 1 6539 4328 2127 12022 11210 7326 13656 13339 12781
4 Core 2 12811 8750 2841 23793 22928 10622 27202 26813 25210
3000 MHz 4 16543 12395 3182 45783 43472 12134 53746 51504 42746
3000 MHz 8 24417 23910 3188 46269 46410 12189 53181 52662 43354
MP-MFLOPS 16 23607 22774 3647 41679 43777 13327 52156 51361 44364
Core i7 4820K 1 3612 3802 3549 6002 6136 6100 5845 5870 5879
OpenMP 4 13189 14095 8935 23079 24256 24063 23264 22321 23361
Core i7 1 10181 9972 5842 24458 24086 19646 23497 23533 23373
4820K 2 25378 19873 9186 47674 48861 34432 46546 46983 46560
4 core 8 Thrd 4 45194 39092 9839 85928 95689 37602 90159 93022 90933
3900 MHz 8 42665 38325 9761 75672 88846 37919 88217 91233 86306
Qpar 16 18840 35358 9757 66481 90022 38735 83196 91050 87909
Core i7 1 10116 9864 5852 24636 24436 19881 23353 23389 23243
4820K 2 26453 19851 9189 49181 49223 34969 46653 46759 46414
4 core 8 Thrd 4 41845 26975 10063 85909 93852 40163 89202 90572 87329
3900 MHz 8 58734 43723 9980 97139 98446 40062 91320 93885 93125
MP-MFLOPS 16 57731 42194 10178 94166 93338 40074 90162 92102 93496
|
To Start
Reliability Tests
MP-MFLOPS and CUDA-MFLOPS benchmarks have a command parameter to run for extended periods for burn-in purposes or reliability verification.
The parameter specifies the run time in minutes and one test, that possibly represents the highest CPU/GPU loading, is run. With MP-MFLOPS, it is with 102400 words at 32 operations per word, where the number of threads can also be varied. CUDA-MFLOPS uses the last of the Extra Tests with 10 million words and 32 operations per word, where the default is Calculate, but an added FC (fast cache) option uses the faster Shared Memory (see example below).
The benchmarks display and log MFLOPS speeds, normally at intervals of 15 to 20 seconds, via a calibrated Repeat Passes count, and numeric results are checked for consistency. The running time and MFLOPS measurements can fall, if CPU/GPU clock speeds are automatically reduced, due to overheating or power saving.
The first entries below show the log format of the two programs, followed by a BAT file to run 10 minute tests, with MP-MFLOPS restricted to four out of eight threads on a quad core i7 4820K. The graphics card is a GeForce GTX 650.
Next are the results, with CUDA-MFLOPS running with no performance degradation, but stealing some CPU time from MP-MFLOPS to reduce that program's speed.
Temperatures were monitored during the test, using Asus AI Suite II for the CPU (case temperature) and Asus GPU Tweak for the GeForce graphics processor. There is nothing unusual about the recorded temperatures, shown below, but such measurements are useful for later comparisons, in the event of system failures.
8 CPUs Available
##############################################
64 Bit MP SSE MFLOPS Benchmark C2, 8 Threads, Tue Jul 01 10:39:20 2014
C/C++ Optimizing Compiler Version 18.00.21005.1 for x64
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 32 427524 14.963640 93621 0.352168 Yes
Data in & out 102400 32 427524 14.964366 93616 0.352168 Yes
Data in & out 102400 32 427524 14.979681 93521 0.352168 Yes
Data in & out 102400 32 427524 14.987834 93470 0.352168 Yes
##############################################
CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.31 Tue Jul 01 11:40:41 2014
Shared Memory Reliability Test 1 minutes, report every 15 seconds
Results of all calculations should be - 0.5065064430236816
Test Seconds MFLOPS Errors First Value
Word
1 14.324 429967 None found
2 14.324 429970 None found
3 14.324 429968 None found
4 14.325 429955 None found
##############################################
Run.bat 10 minute test
Start cuda3mflops-x64sp Minutes 10 FC
Start MPmflopsc2 Minutes 10, Threads 4
##############################################
CUDA
Test Seconds MFLOPS Errors
22 14.361 429554 None found
23 14.375 429133 None found
24 14.360 429598 None found
##############################################
MP-MFLOPS 4 Threads
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 32 288902 13.039905 72598 0.352168 Yes
Data in & out 102400 32 288902 13.505035 70098 0.352168 Yes
Data in & out 102400 32 288902 13.992925 67654 0.352168 Yes
Just CPU Test
Data in & out 102400 32 288902 10.430468 90760 0.352168 Yes
##############################################
Temperatures - Room 27 °C
Minutes Spec
0 1 2 3 4 5 6 7 8 9 10 Increase Max
CPU °C 39 42 46 48 49 50 50 51 52 53 53 14 66.8
GPU °C 31 49 54 56 58 59 60 60 60 60 60 29 98
|
To Start
Roy Longbottom July 2014
The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|