Four Core Eight Thread Computing Benchmarks
Contents
General
Dual Core benchmark code
(see DualCore.htm)
has been modified to use eight threads, initially intended for measuring performance of four core processors with Hyperthreading, where Windows sees the system as having eight processors.
Download QuadCore.zip
for benchmark source code and EXE files at 64 bits and 32 bits.
Results below include those for a Quad Core Phenom II and a Quad Core i7 with Hyperthreading.
With one core in use, the latter processor can run at 3066 MHz using Turbo Boost, but this will be reduced to 2933 MHz, when using more than one core, or to the specified speed or 2800 MHz, if hot. This behaviour makes the effects of Hyperthreading more difficult to determine.
Except for the Whetstone benchmark, which has program loops with few instructions, the test programs have long sequences of streamed data, with some using efficient assembly code. In this case, high performance gains on a quad core processor with Hyperthreading are not really expected when using more than four threads.
To Start
CPUIDMP CPU Only Benchmark
Programs CPUID8Thread32.exe and CPUID8Thread64.exe are the same programs but compiled for 32 and 64 bits. They execute three passes of simple additions to four different registers, via assembly code, attempting to demonstrate maximum CPU speeds. Firstly an integer (INT) and an SSE floating point test are run separately. They are then run as two threads, followed by 2 INT and 2 SSE, 3 INT and 3 SSE then 4 INT and 4 SSE. Further information can be found in
WhatCPU Results.htm.
The high speed operation achieved appears to leave a little room to squeeze in additional hyperthreaded instructions on the Core i7. Even using four threads, integer throughput is disappointing and, between four and eight threads, the Phenom appears to be more efficient (on this particular code).
The slow i7 speeds could be due to a reduction in Turbo Boost MHz from 3066 to 2933, where maximum gain might be 2933 / 3066 x 4 x 100 = 383%.
CPU Core 2 Athlon 64 Core 2 Phenom II Core i7 Core i7
MHz 1830 2211 2400 3000 #### ****
CPUs/Hyperthreads 2/0 2/0 2/0 4/0 4/4 4/4
Windows Vis32 XPx64 Vis64 Win764 Win764 Win864
Separate Tests
32 bit SSE MFLOPS 6781 4400 9222 12020 10178 15460
32 bit Integer MIPS 4556 6612 6296 9018 8611 12292
Two Threads Equal Priority
32 bit SSE MFLOPS 6777 4384 9266 12003 10176 15460
32 bit Integer MIPS 5117 6604 6740 9028 8606 12290
Four Threads, First Normal Priority, Others Normal - 1
32 bit SSE MFLOPS 6816 4363 9086 11935 11032 14611
32 bit SSE MFLOPS 0 2215 0 11986 9161 14611
32 bit Integer MIPS 2508 3232 3257 8897 6739 11916
32 bit Integer MIPS 2642 67 3671 8956 6566 11259
Total SSE MFLOPS 6816 6578 9086 23921 20193 29221
Total Integer MIPS 5150 3300 6929 17853 13305 23175
Gain % SSE 101 150 99 199 198 189
Gain % Integer 113 50 110 198 155 189
Total 353 378
Six Threads, All Normal Priority
32 bit SSE MFLOPS 2200 1439 3059 5864 8166 15021
32 bit SSE MFLOPS 2355 1450 3114 12012 9124 10050
32 bit SSE MFLOPS 2358 1488 3111 11946 6653 9695
32 bit Integer MIPS 1700 2192 2112 4452 4612 8103
32 bit Integer MIPS 1669 2163 2249 4546 4612 8802
32 bit Integer MIPS 1699 2257 2450 4519 4041 8658
Total SSE MFLOPS 6913 4376 9284 29822 23942 34766
Total Integer MIPS 5068 6612 6811 13517 13265 25563
Gain % SSE 102 99 101 248 235 225
Gain % Integer 111 100 108 150 154 208
Total 389 433
Eight Threads, All Normal Priority
32 bit SSE MFLOPS 1705 1083 2283 4077 5445 12063
32 bit SSE MFLOPS 1730 1067 2321 5867 5210 13852
32 bit SSE MFLOPS 1730 1078 2321 11982 5194 12678
32 bit SSE MFLOPS 1728 1130 2314 6141 4693 12593
32 bit Integer MIPS 1252 1630 1680 4451 4032 6604
32 bit Integer MIPS 1251 1634 1672 2973 4029 6926
32 bit Integer MIPS 1411 1639 1671 4495 4036 6582
32 bit Integer MIPS 1244 1732 1879 2968 4035 6736
Total SSE MFLOPS 6893 4358 9240 28067 20541 51187
Total Integer MIPS 5158 6635 6902 14887 16132 26847
Gain % SSE 102 99 100 234 202 331
Gain % Integer 113 100 110 165 187 218
Total 389 549
#### Core i7 930 rated at 2800 MHz but running up to 3066 MHz using Turbo Boost
**** Core i7 4820K rated at 3700 MHz but running up to 3900 MHz using Turbo Boost
|
To Start
Whetstone Benchmark
The Whetstone Benchmark has various routines that execute floating point and integer instructions.
Speed of individual tests is in terms of Millions of Operations Per Second (MOPS), or MFLOPS for those using simple floating point arithmetic, and an overall rating in Millions of Whetstone Instructions Per Second (MWIPS).
Programs Whets8Thread32.exe and Whets8Thread64.exe are the same programs but compiled for 32 and 64 bits.
Unlike the dual core variety, this version uses common code and equal priority for all threads to produce more consistent performance.
Results and further details can be found in
Whetstone Results.htm.
Those at 64 bits are somewhat faster due to improved optimisation.
The total (top line) results shown are calculated using a simple sum of speeds for each thread and can be distorted by threads finishing at different times. Using a harmonic mean makes little difference and the overall MWIPS rating is calculated using the sum of elapsed times of tests in all threads.
Considering the four core Phenom results, consistent speeds are produced on all test using two and four threads to produce performance gains of 200% and nearly 400%. It is not clear why, but average gains using six and eight threads were around 450%.
The Core i7 produces 200% gain using two threads but less than 400% with four threads, no doubt due to the Turbo Boost clock of 3066 MHz being reduced to the specification speed of 2800 MHz.
This benchmark appears to demonstrate Hyperthreading in a most favourable light, producing average gains of around 450%, using six threads, and 700% with eight threads. The main beneficiaries are the floating point tests, in this case translated to SSE code as Single Instruction Single Data (SISD not SIMD/Multiple) operations.
MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL
CPU MHz 1 2 3 MOPS MOPS MOPS MOPS MOPS
Phenom II Win7 3000 3115 902 739 716 69.5 49.3 2509 3008 1289
Dual Core Thread 1 902 739 716 69.5 49.3 2509 3008 1289
Phenom II Win7 3000 6229 1811 1480 1432 139 98.6 5007 6022 2578
Dual Core Thread 1 906 738 716 69.5 49.3 2508 3010 1289
Thread 2 905 741 716 69.5 49.3 2499 3012 1288
Gain % 200 201 200 200 200 200 200 200 200
Phenom II Win7 3000 12414 3603 2950 2853 277 196 9988 11992 5139
Dual Core Thread 1 902 735 714 69.1 49.2 2481 2983 1278
Thread 2 903 739 715 69.4 49.0 2501 3000 1287
Thread 3 905 739 710 69.3 49.0 2499 2999 1285
Thread 4 893 736 714 69.5 49.2 2508 3009 1288
Gain % 399 399 399 398 399 398 398 399 399
Phenom II Win7 3000 14101 4322 3550 3374 325 231 12239 14250 6019
Dual Core Thread 1 621 767 725 46.3 49.5 2655 1995 860
Thread 2 613 510 482 46.5 32.8 1722 3009 859
Thread 3 617 496 477 46.4 33.0 1741 2116 862
Thread 4 933 767 726 69.7 49.6 1725 3077 1291
Thread 5 604 505 486 46.3 32.8 2651 2043 854
Thread 6 934 506 477 69.8 33.1 1744 2011 1293
Gain % 453 479 480 471 468 469 488 474 467
Phenom II 8 Threads Similar
##################################################################################
Core i7 Win7 #### 3115 1065 886 738 79.3 39.7 2447 2936 1154
Quad Core Thread 1 1065 886 738 79.3 39.7 2447 2936 1154
Core i7 Win7 #### 6228 2130 1773 1474 159 79.4 4894 5872 2308
Quad Core Thread 1 1065 887 737 79.3 39.7 2447 2936 1154
Plus HT Thread 2 1065 886 737 79.3 39.7 2448 2936 1154
Gain % 200 200 200 200 201 200 200 200 200
Core i7 Win7 #### 12043 4243 3529 2930 302 156 9078 10207 4170
Quad Core Thread 1 1059 880 730 75.0 39.4 2102 2332 1018
Plus HT Thread 2 1064 881 733 76.9 38.7 2450 2498 1107
Thread 3 1057 881 729 74.1 38.6 2187 2439 1044
Thread 4 1063 887 738 76.4 39.0 2339 2938 1001
Gain % 387 398 398 397 381 393 371 348 361
Core i7 Win7 #### 17149 6705 5463 4426 422 224 12984 13145 4869
Quad Core Thread 1 1146 919 739 72.3 37.6 2019 1958 816
Plus HT Thread 2 1145 915 736 69.8 37.0 2044 2664 793
Thread 3 1143 916 744 71.8 37.0 2058 2083 793
Thread 4 1111 926 737 68.5 37.6 2398 2023 788
Thread 5 1097 916 742 72.2 37.8 2110 2124 827
Thread 6 1062 872 728 67.8 36.7 2355 2292 852
Gain % 551 630 617 600 532 564 531 448 422
Core i7 Win7 #### 21690 8676 7621 5844 531 291 16643 12027 5034
Quad Core Thread 1 1091 1027 728 66.4 36.5 2050 1501 629
Plus HT Thread 2 1089 1037 742 66.0 36.5 2090 1507 630
Thread 3 1090 946 742 66.8 36.5 2069 1534 631
Thread 4 1092 1037 727 66.6 36.6 2031 1501 630
Thread 5 1042 959 736 66.4 36.5 1912 1483 630
Thread 6 1091 874 723 66.6 36.1 2049 1507 629
Thread 7 1090 867 725 65.6 36.3 2094 1516 631
Thread 8 1091 874 722 66.3 36.3 2350 1476 624
Gain % 696 815 860 792 670 733 680 410 436
#### i7 930 2800 MHz running using Turbo Boost at up to 3066 MHz
##################################################################################
Core i7 Win8 $$$$ 3807 1243 1042 931 86.0 52.4 3570 5741 1543
Quad Core Thread 1 1243 1042 931 86.0 52.4 3570 5741 1543
Core i7 Win8 $$$$ 7319 2461 2104 1782 165 100 6984 11103 2953
Quad Core Thread 1 1231 1052 890 82.5 50.2 3490 5551 1476
Plus HT Thread 2 1231 1052 891 82.5 50.2 3494 5552 1477
Gain % 192 198 202 191 192 191 196 193 191
Core i7 Win8 $$$$ 14616 4931 4229 3560 329 201 13868 22200 5899
Quad Core Thread 1 1233 1058 890 82.4 50.1 3443 5551 1476
Plus HT Thread 2 1233 1058 890 82.3 50.2 3494 5552 1475
Thread 3 1232 1056 889 82.1 50.2 3438 5546 1472
Thread 4 1232 1057 890 82.2 50.2 3494 5551 1476
Gain % 384 397 406 382 383 384 388 387 382
Core i7 Win8 $$$$ 20608 7421 6345 5280 459 287 19153 23418 6721
Quad Core Thread 1 1236 1057 881 78.0 47.7 3241 3545 1087
Plus HT Thread 2 1236 1058 882 78.1 47.3 3092 4881 1149
Thread 3 1239 1058 880 75.3 48.8 3216 3378 1176
Thread 4 1240 1058 880 75.8 47.5 3246 3378 1097
Thread 5 1236 1057 880 76.0 48.5 3275 4314 1099
Thread 6 1235 1057 878 75.7 47.5 3084 3922 1113
Gain % 541 597 609 567 534 548 536 408 436
Core i7 Win8 $$$$ 26301 9876 8162 7022 582 372 24785 22207 7493
Quad Core Thread 1 1235 1006 878 72.6 46.4 3099 2776 937
Plus HT Thread 2 1234 1050 876 72.5 46.5 3094 2777 938
Thread 3 1235 1018 878 73.2 46.3 3097 2777 934
Thread 4 1235 976 877 72.9 46.4 3095 2775 937
Thread 5 1235 1034 883 72.7 46.3 3102 2775 937
Thread 6 1235 1028 881 73.0 46.7 3098 2776 938
Thread 7 1233 1017 871 72.7 46.4 3104 2775 934
Thread 8 1233 1033 877 72.8 46.7 3096 2777 938
Gain % 691 795 783 754 677 710 694 387 486
$$$$ i7 4820K 3700 MHz running using Turbo Boost at up to 3900 MHz
|
To Start
BusMP Maximum Data Flow Benchmark - MBytes/Second
Bus8Thread32.exe and Bus8Thread64.exe are the same programs but compiled for 32 and 64 bits. Results and further details can be found in
BusSpd2K Results.htm.
One difference is that integers for the the 64 bit version are declared as 64 bits, rather than the default 32. The first results below show major performance differences between the two varieties, where performance in MBytes Per Second can be near twice as fast at 64 bits, indicating a processing speed limitation (64 bit integer arithmetic speed can be same as at 32 bits).
The program starts by reading words with 32 word address increments, to identify memory bus burst reading speed, then reduces the increment to eventually read all words sequentially. Finally, a test loads data to 128 bit SSE registers. Burst reading is mainly at 64 bytes at a time, so maximum speed is likely to be 16 times the MB/second 16 or 8 word increments for 32 or 64 bit numbers.
On the single thread results, burst calculations suggest that the Phenom could achieve 7280 MB/second RAM speed from one CPU, similar to that obtained at 128 bits. The figure for the i7 is 11344 MB/second, higher than that achieved.
According to the specifications, maximum speeds are 21333 MB/second (at 667 MHz) for the Phenom and 17067 MB/second (at 533 MHz) for the i7. Multi-Thread tests achieve up to 15000 MB/second and nearly 14000 MB/second respectively.
Part two tables show performance and gains using 1, 2, 4, 6 and 8 threads, for all caches and RAM, using the 32 bit compilation, at Inc 32wds, Read All and 128b SSE2.
Using 4 or more threads, the Phenom achieves performance gains of 360% to 390% via L1 and L2 caches, around 320% via L3 and 200% to 250% using RAM. With the Core i7, there are only significant gains due to Hyperthreading in the 128 bit SSE L1 cache test. Here, the maximum speed is likely to be one result of 16 Bytes per CPU clock per processor, or 16 x 2800 x 4 = 179,200 MB/second. This was nearly achieved using 8 threads. On the downside, it looks as though the system was trying to use eight lots of 1.5 MB (L3 data) at the same time, forcing data to be read from RAM.
Single Thread Cache MHz Inc Inc Inc Inc Inc Read 128b
Results RAM 32wds 16wds 8wds 4wds 2wds All SSE2
Phenom II 32b L1 3000 10606 13543 13819 13363 13463 14219 23691
Phenom II 32b L2 3000 1496 1495 2957 5972 11352 13145 23798
Phenom II 32b L3 3000 659 751 1377 2995 5656 9562 10838
Phenom II 32b RAM 3000 439 455 894 1846 3097 5214 7302
Phenom II 64b L1 3000 20650 21652 25936 25907 26860 27037 23718
Phenom II 64b L2 3000 2922 2970 2992 5927 11859 22500 23881
Phenom II 64b L3 3000 1419 1462 1492 2908 5958 11097 11891
Phenom II 64b RAM 3000 832 877 911 1784 3676 6237 7360
Core i7 930 32b L1 **** 10303 9510 9654 9122 9134 9023 23326
Core i7 930 32b L2 **** 1996 2041 3677 5980 8009 8643 22092
Core i7 930 32b L3 **** 1948 2004 3608 5848 8074 8614 21650
Core i7 930 32b RAM **** 526 709 1350 2352 4458 7063 9485
Core i7 930 64b L1 **** 20105 18713 19136 17974 18126 17910 23345
Core i7 930 64b L2 **** 3934 3999 4076 7064 12003 15793 21923
Core i7 930 64b L3 **** 3842 3909 4028 6979 11748 15845 21848
Core i7 930 64b RAM **** 949 1048 1419 2736 4698 8812 9459
Core i7 4820 32b L1 $$$$ 15642 15642 22493 21590 21709 21375 61610
Core i7 4820 32b L2 $$$$ 2782 2904 5623 9806 17348 20363 40673
Core i7 4820 32b L3 $$$$ 2741 2821 5499 9736 16795 20679 38331
Core i7 4820 32b RAM $$$$ 644 934 1994 3842 8098 13852 15963
Core i7 4820 64b L1 $$$$ 31565 31291 31178 42042 42508 41978 61606
Core i7 4820 64b L2 $$$$ 5364 5427 5508 10779 19355 33166 37951
Core i7 4820 64b L3 $$$$ 5364 5427 5508 10779 19355 33166 37951
Core i7 4820 64b RAM $$$$ 1034 1272 1866 4023 7724 16029 15980
L1 Cache Results in MBytes/Second - 6 KB % Gain
Cache CPUs/ MHz Inc Read 128b Inc Read 128b
KB HTs 32wds All SSE2 32wds All SSE2
Phenom II 64 4/0 3000 10606 14219 23691
764 2 Threads 128 21150 28435 47423 199 200 200
4 Threads 256 40763 54630 92595 384 384 391
6 Threads 256 31624 54370 88023 298 382 372
8 Threads 256 38638 53126 85948 364 374 363
Core i7 930 32 4/4 **** 10303 9023 23326
764 2 Threads 64 20590 18031 46677 200 200 200
4 Threads 128 29499 31104 91726 286 345 393
6 Threads 128 35391 35846 137181 344 397 588
8 Threads 128 41300 39292 170513 401 435 731
Core i7 4820K 32 4/4 $$$$ 15642 21375 61610
864 2 threads 64 31284 42597 123206 200 199 200
4 Threads 128 39511 70155 238644 253 328 387
6 Threads 128 54064 88245 309920 346 413 503
8 Threads 128 62539 107411 402166 400 503 653
L2 Cache Results in MB/Second - 96 KB % Gain
Phenom II 512 4/0 3000 1496 13145 23798
2 Threads 1024 2983 26351 47336 199 200 199
4 Threads 2048 5761 51226 92184 385 390 387
6 Threads 2048 5863 48050 86055 392 366 362
8 Threads 2048 5380 48529 85650 360 369 360
Core i7 930 256 4/4 **** 1996 8643 22092
2 Threads 512 3378 17305 43722 169 200 198
4 Threads 1024 3866 26611 60836 194 308 275
6 Threads 1024 4049 33262 64866 203 385 294
8 Threads 1024 4178 37228 68711 209 431 311
Core i7 4820K 256 4/4 $$$$ 2782 20363 40673
2 threads 512 5552 40717 80597 200 200 198
4 Threads 1024 8984 74935 123866 323 368 305
6 Threads 1024 9844 83460 143356 354 410 352
8 Threads 1924 10703 98906 164050 385 486 403
L3 Cache - 1536 KB Data % Gain
Phenom II 6144 4/0 3000 659 9562 10838
2 Threads 1431 18082 22559 217 189 208
4 Threads 2222 29623 34566 337 310 319
6 Threads 2221 30682 34525 337 321 319
8 Threads 2240 31417 35148 340 329 324
Core i7 930 8192 4/4 **** 1948 8614 21650
2 Threads 3192 17141 42945 164 199 198
4 Threads 3772 30387 58809 194 353 272
6 Threads 2537 29429 43411 130 342 201
8 Threads 1060 19526 15886 54 227 73
Core i7 4820K 10240 4/4 $$$$ 2741 20679 38331
2 threads 5343 41353 76302 195 200 199
4 Threads 8369 74219 129958 305 359 339
6 Threads 7924 73640 123287 289 356 322
8 Threads 5467 60140 92112 199 291 240
RAM Results in MBytes/Second - 128 MB % Gain
Phenom II 4/0 3000 439 5214 7302
2 Threads 744 8920 12162 169 171 167
4 Threads 913 13000 14952 208 249 205
6 Threads 902 13183 15005 205 253 205
8 Threads 909 12701 14966 207 244 205
Core i7 930 4/4 **** 526 7063 9485
2 Threads 637 11883 12945 121 168 136
4 Threads 724 13600 13828 138 193 146
6 Threads 731 13572 13911 139 192 147
8 Threads 731 13750 13722 139 195 145
Core i7 4820K 4/4 $$$$ 644 13852 15963
2 threads 1135 26066 28578 176 188 179
4 Threads 1316 36384 35472 204 263 222
6 Threads 1291 36347 36784 200 262 230
8 Threads 1374 36504 36414 213 264 228
**** i7 930 2800 MHz running using Turbo Boost at up to 3066 MHz
$$$$ i7 4820K 3700 MHz Turbo Boost at up to 3900 MHz, RAM max 51.2 MB/s
|
To Start
RandMP Serial/Random Access Benchmark
Rand8Thread32.exe and Rand8Thread64.exe are compiled from the same program, but for 32 and 64 bits. The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1, 2, 4, 6 and 8 threads. Results (32 and 64 bit versions) and further details can be found in
RandMem Results.htm.
This benchmark uses data from the same array for all threads, but starting at different points. As with the dual core version, with RW and particularly random, flushing dedicated caches to maintain data coherency, leads to reduced performance using more than one thread. Here, speed using shared L2 or L3 cache can be faster than using L1 cache.
Results for the 32 bit version below show the total throughput of all threads based on harmonic mean.
Data sizes are, again, 6 KB for L1 cache, 96 KB for L2 cache, 1536 KB for L3 cache but 96 MB for RAM.
On the Phenom, speed on serial reading, from caches and RAM, is similar to that for BusMP Read All tests. This also applies via caches for the Core i7 but, using RAM, the data transfer speed appears to be higher than possible, most likely due to efficient caching of shared data (different data starting point probably suits 8 MB L3 cache). This i7 RAM test is the only one where Hyperthreading has a major impact.
Random reading speed via L1 cache is similar to that for serial reading but becomes progressively slower through other caches and RAM. The Core i7 is the faster from L3 cache and RAM using few threads, but the Phenom nearly catches up at 8 threads.
The i7 is clearly much faster of the two systems on most read/write tests, but still struggles to achieve a throughput gain of grater than 2.0 using more than two threads. Note that, using one thread on random read/write of L1 cache sized data, the i7 is five times faster than using multiple threads and the Phenom up to ten times faster. For the latter, using data in RAM is faster than data that could sit within L1 cache.
CPUs MBytes Per Second Using Threads Gain At Threads
/HTs 1 2 4 6 8 2 4 6 8
Serial RD
Core i7 4/8 L1 11458 22661 37039 43717 46374 2.0 3.2 3.8 4.0
930 L2 10380 20832 32853 41711 42839 2.0 3.2 4.0 4.1
#### MHz L3 8828 17743 29610 38414 40330 2.0 3.4 4.4 4.6
Win 764 RAM 4266 8712 17347 24946 25589 2.0 4.1 5.8 6.0
Serial RW
Core i7 4/8 L1 15282 13724 16240 16209 18379 0.9 1.1 1.1 1.2
930 L2 12223 18216 25326 28104 27047 1.5 2.1 2.3 2.2
#### MHz L3 10234 19266 21931 24450 26351 1.9 2.1 2.4 2.6
Win 764 RAM 4533 7656 13876 14543 13390 1.7 3.1 3.2 3.0
Random RD
Core i7 4/8 L1 11266 22548 38174 45592 47141 2.0 3.4 4.0 4.2
930 L2 6233 12463 20059 24986 25667 2.0 3.2 4.0 4.1
#### MHz L3 3499 6915 9211 10002 9531 2.0 2.6 2.9 2.7
Win 764 RAM 459 909 1241 1398 1364 2.0 2.7 3.0 3.0
Random RW
Core i7 4/8 L1 14375 3027 2780 2901 3297 0.2 0.2 0.2 0.2
930 L2 5887 4555 6117 6693 7281 0.8 1.0 1.1 1.2
#### MHz L3 3104 4604 4721 5047 4933 1.5 1.5 1.6 1.6
Win 764 RAM 428 860 899 948 1026 2.0 2.1 2.2 2.4
#### 2.8 GHz running at up to 3.06 GHz via Turbo Boost, dual channel 1066 MHz DDR3 RAM
##################################################################################
CPUs Number Of Threads Gain At Threads
/HTs 1 2 4 6 8 2 4 6 8
Serial RD
Core i7 4/8 L1 28442 57130 114198 114435 107457 2.0 4.0 4.0 3.8
4820K L2 20531 41075 82142 87468 92156 2.0 4.0 4.3 4.5
$$$$ MHz L3 17015 34734 69551 77040 81525 2.0 4.1 4.5 4.8
Win 8.1 RAM 6004 12438 25044 38420 42316 2.1 4.2 6.4 7.0
Serial RW
Core i7 4/8 L1 30091 21439 20928 24068 28856 0.7 0.7 0.8 1.0
4820K L2 22100 20942 38196 48821 53497 0.9 1.7 2.2 2.4
$$$$ MHz L3 17341 33271 65558 60361 73659 1.9 3.8 3.5 4.2
Win 8.1 RAM 10680 21454 42836 50906 53162 2.0 4.0 4.8 5.0
Random RD
Core i7 4/8 L1 27862 55813 111471 111534 104011 2.0 4.0 4.0 3.7
4820K L2 13514 27231 54374 54880 59899 2.0 4.0 4.1 4.4
$$$$ MHz L3 5557 11141 20900 21977 14510 2.0 3.8 4.0 2.6
Win 8.1 RAM 627 1238 2472 2533 2479 2.0 3.9 4.0 4.0
Random RW
Core i7 4/8 L1 29930 3734 3215 4134 5002 0.1 0.1 0.1 0.2
4820K L2 9374 5108 8194 8510 9159 0.5 0.9 0.9 1.0
$$$$ MHz L3 4759 7101 12497 13962 13291 1.5 2.6 2.9 2.8
Win 8.1 RAM 588 1256 2496 2526 2521 2.1 4.2 4.3 4.3
$$$$ 3.7 GHz running at up to 3.9 GHz via Turbo Boost, quad channel 1600 MHz DDR3 RAM
RAM max throughput 51.2 GB/second
##################################################################################
CPUs MBytes Per Second Using Threads Gain At Threads
/HTs 1 2 4 6 8 2 4 6 8
Serial RD
Phenom II 4/0 L1 15212 29350 58904 58896 54909 1.9 3.9 3.9 3.6
3000 MHz L2 12236 24767 49039 50798 47318 2.0 4.0 4.2 3.9
Win 764 L3 8148 16402 30391 33436 32457 2.0 3.7 4.1 4.0
1333 MHz DDR3 RAM 3917 6983 11299 12484 12002 1.8 2.9 3.2 3.1
Serial RW
Phenom II 4/0 L1 7741 5100 5750 6598 6517 0.7 0.7 0.9 0.8
3000 MHz L2 7998 5906 7479 8466 8345 0.7 0.9 1.1 1.0
Win 764 L3 7132 13142 7489 8566 8582 1.8 1.1 1.2 1.2
1333 MHz DDR3 RAM 3589 5981 8576 7913 7802 1.7 2.4 2.2 2.2
Random RD
Phenom II 4/0 L1 14367 27877 56817 55300 54129 1.9 4.0 3.8 3.8
3000 MHz L2 7250 14355 28436 29723 27962 2.0 3.9 4.1 3.9
Win 764 L3 1560 3419 6641 7403 7410 2.2 4.3 4.7 4.8
1333 MHz DDR3 RAM 339 679 1140 1336 1242 2.0 3.4 3.9 3.7
Random RW
Phenom II 4/0 L1 7585 1381 752 833 757 0.2 0.1 0.1 0.1
3000 MHz L2 5985 1624 1230 1387 1245 0.3 0.2 0.2 0.2
Win 764 L3 1505 1724 1377 1545 1572 1.1 0.9 1.0 1.0
1333 MHz DDR3 RAM 313 634 1113 1157 1153 2.0 3.6 3.7 3.7
|
To Start
OpenMP, MP-MFLOPS, QPAR Benchmark MFLOPS
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers.
The benchmark executes the same functions, using the same data sizes, as the
CUDA Graphics GPU Parallel Computing Benchmark,
with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code (OpenMP32MLOPS.exe and OpenMP64MLOPS.exe).
A run time Affinity option is available to execute the benchmark on a selected single processor.
These benchmarks and a non-OpenMP SSE version (SSE32MFLOPS.exe) can be downloaded via
OpenMPMflops.zip.
Results and further details can be found in
OpenMP MFLOPS.htm.
The benchmark demonstrates that OpenMP can make use of four CPUs but not much extra on the Core i7 due to Hyperthreading.
Each test reads 1000 MB and writes 1000 MB where at least the largest data size of 10M words will be from/to RAM and could be limited by memory speed with 2 floating point operations per word. Two example calculations of MB/second are shown below.
October 2014 additions are for a second Core i7, with a higher speed CPU than the earlier one, much faster memory via four channels, but slower graphics, when executing CUDA calculations with no data communication with the CPUs. The main additions are for MP MFLOPS and QPAR comparative benchmarks, where full details and source code links are in
GigaFLOPS Benchmarks.htm. OpenMP uses simple compiler directives to produce automatic multiprocessing. MP MFLOPS uses identical calculations but with additional C code to organise threads. QPAR is Microsoft’s proprietary equivalent OpenMP, just requiring a different compiler parameter.
As with other floating point benchmarks, produced via the same compiler version, OpenMP SSE benchmark did not implement SIMD functions for four simultaneous floating point calculations, but used SSE SISD instructions operating on one variable. OpenMP overheads are also high and the non-OpenMP SSE programs can be faster.
Using the same compiler with MP MFLOPS produced programs with lower overheads than OpenMP but similar maximum speeds using 8 threads.
OpenMP and MP MFLOPS were reproduced using the compiler that came with Microsoft Visual Studio 2013. Performance of the former did not improve, but full SIMD instruction were produced for MP MFLOPS (New SSE). Note that maximum GFLOPS is GHz x 4 (SIMD) x 2 (multiply and add) x cores = 3.9 x 4 x 2 x 4 = 124.8 and up to 98.4 GFLOPS was obtained.
This processor provides AVX instructions, produced by a compile option but, unlike a Linux GCC compilation, did not demonstrate speeds, potentially twice as fast as the SSE version. However, the program compiled with the QPAR directive provided performance as good as might be expected (up to 91.2 GFLOPS).
Core i7 4820K 3.7 GHz running up to 3.9 GHz via Turbo Boost
CUDA CUDA
Data Ops/ Repeat SSE i387 i387 SSE 64b SSE 64b GeForce No I/O
Words Word Passes 1 CPU 1 CPU 4/8 CPU 1 CPU 4/8 CPU GTX650 GTX650
100000 2 2500 5172 2155 6705 2963 6690 459 3449
1000000 2 250 5126 2534 9710 3560 9694 893 8806
10000000 2 25 4306 2507 9364 3402 9397 980 10530
100000 8 2500 6170 4146 14613 5605 14591 2375 13056
1000000 8 250 6163 4460 17569 6077 17622 3545 34014
10000000 8 25 6130 4453 17885 6088 17887 3896 40905
100000 32 2500 5848 5124 21243 5745 21188 9183 43975
1000000 32 250 5841 5203 22372 5863 22839 14006 120972
10000000 32 25 5847 5218 22350 5860 22391 15499 147100
maximum Gain 430% 390%
Core i7 4820K 3.7 GHz MP MFLOPS and QPAR
Data Ops/ Repeat i387 SSE 64 New SSE New SSE AVX QPAR QPAR
Words Word Passes 8 Thrd 8 Thrd 1 Thrd 8 Thrd 8 Thrd 1 Thrd 8 Thrd
100000 2 2500 15359 19602 10116 58734 58601 10181 42665
1000000 2 250 15395 19776 9864 43723 43529 9972 38325
10000000 2 25 9846 9820 5852 9980 10032 5842 9761
100000 8 2500 23554 24683 24636 97139 85198 24458 75672
1000000 8 250 23708 24648 24436 98446 98220 24086 88846
10000000 8 25 23586 24634 19881 40062 40162 19646 37919
100000 32 2500 23418 23521 23353 91320 93810 23497 88217
1000000 32 250 23464 23497 23389 93885 93866 23533 91233
10000000 32 25 23416 23506 23243 93125 93745 23373 86306
############################################################################
Core i7 930 2.8 GHz running at up to 3.06 GHz via Turbo Boost
Windows 7 64
CUDA CUDA
Data Ops/ Repeat SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
Words Word Passes 1 CPU 1 CPU 4/8 CPU 1 CPU 4/8 CPU GTX480 GTX480
100000 2 2500 3567 1248 4455 1574 4001 521 5554
1000000 2 250 3529 1420 5433 1861 4919 819 21493
10000000 2 25 2388 1364 3038 1735 3076xx 1014 31991
100000 8 2500 4655 2337 8798 3794 14581 2058 20129
1000000 8 250 4642 2413 9813 4149 17080 3306 82132
10000000 8 25 4453 2436 9581 4011 12457 4057 125413
100000 32 2500 3328 2957 12020 4324 16786 7768 52230
1000000 32 250 3329 3011 12339 4436 17599 13190 254306
10000000 32 25 3307 3003 12432 4418 17576yy 16077 425237
Maximum Gain 414% 412%
xx in 0.163 seconds - MB/Second = 2000 / 0.163 = 12270 (x 2/8 for MFLOPS)
yy in 0.455 seconds - MB/Second = 2000 / 0.455 = 4396 (x 32/8 for MFLOPS)
############################################################################
Phenom II X4 3.0 GHz, Windows 7 64
CUDA CUDA
Data Ops/ Repeat SSE i387 i387 SSE 64b SSE 64b GeFrce No I/O
Words Word Passes 1 CPU 1 CPU 4 CPU 1 CPU 4 CPU GTS250 GTS250
100000 2 2500 3552 1920 5587 1822 5613 328 3054
1000000 2 250 3268 1919 5585 1870 7056 625 9672
10000000 2 25 1861 1625 2993 1563 2972 714 13038
100000 8 2500 4535 2115 7763 3637 12653 1336 12233
1000000 8 250 4341 2108 7975 3709 14518 2382 39481
10000000 8 25 4141 2100 8062 3543 11273 2949 51199
100000 32 2500 4012 2566 9675 3652 14092 5142 36080
1000000 32 250 3981 2552 10091 3663 14510 9427 108170
10000000 32 25 3941 2510 9902 3633 14034 11182 135041
Maximum Gain 395% 396%
|
To Start
Multiple Tasks
Multitasking tests were run on the Core i7 using IntBurn64.exe and SSEBurn64,exe which are described in
BurnIn64.htm and
BurnIn4CPU.htm.
The benchmark and source code are in
More64bit.zip.
Tests run were one copy each of the Integer and SSE floating point programs, four concurrent copies of the integer test and four copies of both integer and SSE programs at the same time. Test durations were one minute each and results showed that all multitasking tests started and finished within the same clock time second.
Each test used L1 cache size data of 8 K. The SSE tests used the Cache Test option, normally the fastest.
Single test result show that the integer test is producing around one 64 bit result per clock Hz and four 32 bit (128 bits) floating point results per Hz using SSE instructions. As might be expected, the higher Turbo Boost CPU clock frequency using one CPU, means that four concurrent integer tests do not achieve a 400% performance level. However, running these eight programs, along with Hyperthreading, increases this to between 428% and 450%.
1 Test ----- 4 Concurrent Tests ---- Total Gain
Int Write/Read MB/sec 14195 13955 13902 13879 13905 55641 392%
Int Read MB/sec 20267 20206 20191 20179 20169 80746 398%
Int Write/Read MB/sec 8127 8756 8345 8414 33641 237%
Int Read MB/sec 10914 10794 10790 11049 43547 215%
SSE Calculate MFLOPS 11743 6231 6119 6144 6517 25011 213%
|
To Start
Roy Longbottom October 2014
The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|