Title

Four Core Eight Thread Computing Benchmarks

Contents

CPU Only Benchmark Whetstone Benchmark Maximum Data Flow Benchmark
Serial/Random Access Benchmark OpenMP Benchmark Multiple Tasks

General

Dual Core benchmark code (see DualCore.htm) has been modified to use eight threads, initially intended for measuring performance of four core processors with Hyperthreading, where Windows sees the system as having eight processors. Download QuadCore.zip for benchmark source code and EXE files at 64 bits and 32 bits. Results below include those for a Quad Core Phenom II and a Quad Core i7 with Hyperthreading. With one core in use, the latter processor can run at 3066 MHz using Turbo Boost, but this will be reduced to 2933 MHz, when using more than one core, or to the specified speed or 2800 MHz, if hot. This behaviour makes the effects of Hyperthreading more difficult to determine.

Except for the Whetstone benchmark, which has program loops with few instructions, the test programs have long sequences of streamed data, with some using efficient assembly code. In this case, high performance gains on a quad core processor with Hyperthreading are not really expected when using more than four threads.

To Start


CPUIDMP CPU Only Benchmark

Programs CPUID8Thread32.exe and CPUID8Thread64.exe are the same programs but compiled for 32 and 64 bits. They execute three passes of simple additions to four different registers, via assembly code, attempting to demonstrate maximum CPU speeds. Firstly an integer (INT) and an SSE floating point test are run separately. They are then run as two threads, followed by 2 INT and 2 SSE, 3 INT and 3 SSE then 4 INT and 4 SSE. Further information can be found in WhatCPU Results.htm.

The high speed operation achieved appears to leave a little room to squeeze in additional hyperthreaded instructions on the Core i7. Even using four threads, integer throughput is disappointing and, between four and eight threads, the Phenom appears to be more efficient (on this particular code). The slow i7 speeds could be due to a reduction in Turbo Boost MHz from 3066 to 2933, where maximum gain might be 2933 / 3066 x 4 x 100 = 383%.


   CPU                        Core 2 Athlon 64    Core 2 Phenom II   Core i7
   MHz                          1830      2211      2400      3000      ####
   CPUs/Hyperthreads             2/0       2/0       2/0       4/0       4/4
   Windows                     Vis32     XPx64     Vis64    Win764    Win764

   Separate Tests
   32 bit SSE   MFLOPS          6781      4400      9222     12020     10178
   32 bit Integer MIPS          4556      6612      6296      9018      8611


   Two Threads Equal Priority
   32 bit SSE   MFLOPS          6777      4384      9266     12003     10176
   32 bit Integer MIPS          5117      6604      6740      9028      8606


   Four Threads, First Normal Priority, Others Normal - 1
   32 bit SSE   MFLOPS          6816      4363      9086     11935     11032
   32 bit SSE   MFLOPS             0      2215         0     11986      9161
   32 bit Integer MIPS          2508      3232      3257      8897      6739
   32 bit Integer MIPS          2642        67      3671      8956      6566

   Total  SSE   MFLOPS          6816      6578      9086     23921     20193
   Total  Integer MIPS          5150      3300      6929     17853     13305

   Gain % SSE                    101       150        99       199       198
   Gain % Integer                113        50       110       198       155
                                                                   Total 353

   Six Threads, All Normal Priority
   32 bit SSE   MFLOPS          2200      1439      3059      5864      8166
   32 bit SSE   MFLOPS          2355      1450      3114     12012      9124
   32 bit SSE   MFLOPS          2358      1488      3111     11946      6653
   32 bit Integer MIPS          1700      2192      2112      4452      4612
   32 bit Integer MIPS          1669      2163      2249      4546      4612
   32 bit Integer MIPS          1699      2257      2450      4519      4041

   Total  SSE   MFLOPS          6913      4376      9284     29822     23942
   Total  Integer MIPS          5068      6612      6811     13517     13265

   Gain % SSE                    102        99       101       248       235
   Gain % Integer                111       100       108       150       154
                                                                   Total 389

   Eight Threads, All Normal Priority
   32 bit SSE   MFLOPS          1705      1083      2283      4077      5445
   32 bit SSE   MFLOPS          1730      1067      2321      5867      5210
   32 bit SSE   MFLOPS          1730      1078      2321     11982      5194
   32 bit SSE   MFLOPS          1728      1130      2314      6141      4693
   32 bit Integer MIPS          1252      1630      1680      4451      4032
   32 bit Integer MIPS          1251      1634      1672      2973      4029
   32 bit Integer MIPS          1411      1639      1671      4495      4036
   32 bit Integer MIPS          1244      1732      1879      2968      4035

   Total  SSE   MFLOPS          6893      4358      9240     28067     20541
   Total  Integer MIPS          5158      6635      6902     14887     16132

   Gain % SSE                    102        99       100       234       202
   Gain % Integer                113       100       110       165       187
                                                                   Total 389

 #### Core i7 930 rated at 2800 MHz but running up to 3066 MHz using Turbo Boost

To Start


Whetstone Benchmark

The Whetstone Benchmark has various routines that execute floating point and integer instructions. Speed of individual tests is in terms of Millions of Operations Per Second (MOPS), or MFLOPS for those using simple floating point arithmetic, and an overall rating in Millions of Whetstone Instructions Per Second (MWIPS). Programs Whets8Thread32.exe and Whets8Thread64.exe are the same programs but compiled for 32 and 64 bits. Unlike the dual core variety, this version uses common code and equal priority for all threads to produce more consistent performance. Results and further details can be found in Whetstone Results.htm. Those at 64 bits are somewhat faster due to improved optimisation.

The total (top line) results shown are calculated using a simple sum of speeds for each thread and can be distorted by threads finishing at different times. Using a harmonic mean makes little difference and the overall MWIPS rating is calculated using the sum of elapsed times of tests in all threads. Considering the four core Phenom results, consistent speeds are produced on all test using two and four threads to produce performance gains of 200% and nearly 400%. It is not clear why, but average gains using six and eight threads were around 450%.

The Core i7 produces 200% gain using two threads but less than 400% with four threads, no doubt due to the Turbo Boost clock of 3066 MHz being reduced to the specification speed of 2800 MHz. This benchmark appears to demonstrate Hyperthreading in a most favourable light, producing average gains of around 450%, using six threads, and 700% with eight threads. The main beneficiaries are the floating point tests, in this case translated to SSE code as Single Instruction Single Data (SISD not SIMD/Multiple) operations.

MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL CPU MHz 1 2 3 MOPS MOPS MOPS MOPS MOPS Phenom II Win7 3000 3115 902 739 716 69.5 49.3 2509 3008 1289 Dual Core Thread 1 902 739 716 69.5 49.3 2509 3008 1289 Phenom II Win7 3000 6229 1811 1480 1432 139 98.6 5007 6022 2578 Dual Core Thread 1 906 738 716 69.5 49.3 2508 3010 1289 Thread 2 905 741 716 69.5 49.3 2499 3012 1288 Gain % 200 201 200 200 200 200 200 200 200 Phenom II Win7 3000 12414 3603 2950 2853 277 196 9988 11992 5139 Dual Core Thread 1 902 735 714 69.1 49.2 2481 2983 1278 Thread 2 903 739 715 69.4 49.0 2501 3000 1287 Thread 3 905 739 710 69.3 49.0 2499 2999 1285 Thread 4 893 736 714 69.5 49.2 2508 3009 1288 Gain % 399 399 399 398 399 398 398 399 399 Phenom II Win7 3000 14101 4322 3550 3374 325 231 12239 14250 6019 Dual Core Thread 1 621 767 725 46.3 49.5 2655 1995 860 Thread 2 613 510 482 46.5 32.8 1722 3009 859 Thread 3 617 496 477 46.4 33.0 1741 2116 862 Thread 4 933 767 726 69.7 49.6 1725 3077 1291 Thread 5 604 505 486 46.3 32.8 2651 2043 854 Thread 6 934 506 477 69.8 33.1 1744 2011 1293 Gain % 453 479 480 471 468 469 488 474 467 Phenom II 8 Threads Similar Core i7 Win7 #### 3115 1065 886 738 79.3 39.7 2447 2936 1154 Quad Core Thread 1 1065 886 738 79.3 39.7 2447 2936 1154 Core i7 Win7 #### 6228 2130 1773 1474 159 79.4 4894 5872 2308 Quad Core Thread 1 1065 887 737 79.3 39.7 2447 2936 1154 Plus HT Thread 2 1065 886 737 79.3 39.7 2448 2936 1154 Gain % 200 200 200 200 201 200 200 200 200 Core i7 Win7 #### 12043 4243 3529 2930 302 156 9078 10207 4170 Quad Core Thread 1 1059 880 730 75.0 39.4 2102 2332 1018 Plus HT Thread 2 1064 881 733 76.9 38.7 2450 2498 1107 Thread 3 1057 881 729 74.1 38.6 2187 2439 1044 Thread 4 1063 887 738 76.4 39.0 2339 2938 1001 Gain % 387 398 398 397 381 393 371 348 361 Core i7 Win7 #### 17149 6705 5463 4426 422 224 12984 13145 4869 Quad Core Thread 1 1146 919 739 72.3 37.6 2019 1958 816 Plus HT Thread 2 1145 915 736 69.8 37.0 2044 2664 793 Thread 3 1143 916 744 71.8 37.0 2058 2083 793 Thread 4 1111 926 737 68.5 37.6 2398 2023 788 Thread 5 1097 916 742 72.2 37.8 2110 2124 827 Thread 6 1062 872 728 67.8 36.7 2355 2292 852 Gain % 551 630 617 600 532 564 531 448 422 Core i7 Win7 #### 21690 8676 7621 5844 531 291 16643 12027 5034 Quad Core Thread 1 1091 1027 728 66.4 36.5 2050 1501 629 Plus HT Thread 2 1089 1037 742 66.0 36.5 2090 1507 630 Thread 3 1090 946 742 66.8 36.5 2069 1534 631 Thread 4 1092 1037 727 66.6 36.6 2031 1501 630 Thread 5 1042 959 736 66.4 36.5 1912 1483 630 Thread 6 1091 874 723 66.6 36.1 2049 1507 629 Thread 7 1090 867 725 65.6 36.3 2094 1516 631 Thread 8 1091 874 722 66.3 36.3 2350 1476 624 Gain % 696 815 860 792 670 733 680 410 436 #### i7 930 2800 MHz running using Turbo Boost at up to 3066 MHz

To Start


BusMP Maximum Data Flow Benchmark - MBytes/Second

Bus8Thread32.exe and Bus8Thread64.exe are the same programs but compiled for 32 and 64 bits. Results and further details can be found in BusSpd2K Results.htm. One difference is that integers for the the 64 bit version are declared as 64 bits, rather than the default 32. The first results below show major performance differences between the two varieties, where performance in MBytes Per Second can be near twice as fast at 64 bits, indicating a processing speed limitation (64 bit integer arithmetic speed can be same as at 32 bits).

The program starts by reading words with 32 word address increments, to identify memory bus burst reading speed, then reduces the increment to eventually read all words sequentially. Finally, a test loads data to 128 bit SSE registers. Burst reading is mainly at 64 bytes at a time, so maximum speed is likely to be 16 times the MB/second 16 or 8 word increments for 32 or 64 bit numbers. On the single thread results, burst calculations suggest that the Phenom could achieve 7280 MB/second RAM speed from one CPU, similar to that obtained at 128 bits. The figure for the i7 is 11344 MB/second, higher than that achieved. According to the specifications, maximum speeds are 21333 MB/second (at 667 MHz) for the Phenom and 17067 MB/second (at 533 MHz) for the i7. Multi-Thread tests achieve up to 15000 MB/second and nearly 14000 MB/second respectively.

Part two tables show performance and gains using 1, 2, 4, 6 and 8 threads, for all caches and RAM, using the 32 bit compilation, at Inc 32wds, Read All and 128b SSE2. Using 4 or more threads, the Phenom achieves performance gains of 360% to 390% via L1 and L2 caches, around 320% via L3 and 200% to 250% using RAM. With the Core i7, there are only significant gains due to Hyperthreading in the 128 bit SSE L1 cache test. Here, the maximum speed is likely to be one result of 16 Bytes per CPU clock per processor, or 16 x 2800 x 4 = 179,200 MB/second. This was nearly achieved using 8 threads. On the downside, it looks as though the system was trying to use eight lots of 1.5 MB (L3 data) at the same time, forcing data to be read from RAM.

 
  Single Thread    Cache   MHz     Inc     Inc     Inc     Inc     Inc    Read    128b
  Results           RAM          32wds   16wds    8wds    4wds    2wds     All    SSE2
 
  Phenom II    32b   L1   3000   10606   13543   13819   13363   13463   14219   23691
  Phenom II    32b   L2   3000    1496    1495    2957    5972   11352   13145   23798
  Phenom II    32b   L3   3000     659     751    1377    2995    5656    9562   10838
  Phenom II    32b  RAM   3000     439     455     894    1846    3097    5214    7302

  Phenom II    64b   L1   3000   20650   21652   25936   25907   26860   27037   23718
  Phenom II    64b   L2   3000    2922    2970    2992    5927   11859   22500   23881
  Phenom II    64b   L3   3000    1419    1462    1492    2908    5958   11097   11891
  Phenom II    64b  RAM   3000     832     877     911    1784    3676    6237    7360

  Core i7 930  32b   L1   ****   10303    9510    9654    9122    9134    9023   23326
  Core i7 930  32b   L2   ****    1996    2041    3677    5980    8009    8643   22092
  Core i7 930  32b   L3   ****    1948    2004    3608    5848    8074    8614   21650
  Core i7 930  32b  RAM   ****     526     709    1350    2352    4458    7063    9485

  Core i7 930  64b   L1   ****   20105   18713   19136   17974   18126   17910   23345
  Core i7 930  64b   L2   ****    3934    3999    4076    7064   12003   15793   21923
  Core i7 930  64b   L3   ****    3842    3909    4028    6979   11748   15845   21848
  Core i7 930  64b  RAM   ****     949    1048    1419    2736    4698    8812    9459

 
  L1 Cache Results in MBytes/Second - 6 KB                    % Gain

                 Cache CPUs/  MHz     Inc    Read    128b     Inc   Read   128b
                   KB   HTs         32wds     All    SSE2   32wds    All   SSE2
 
  Phenom II        64   4/0  3000   10606   14219   23691
  764  2 Threads  128               21150   28435   47423     199    200    200
  4 Threads       256               40763   54630   92595     384    384    391
  6 Threads       256               31624   54370   88023     298    382    372
  8 Threads       256               38638   53126   85948     364    374    363

  Core i7 930      32   4/4  ****   10303    9023   23326
  764  2 Threads   64               20590   18031   46677     200    200    200
  4 Threads       128               29499   31104   91726     286    345    393
  6 Threads       128               35391   35846  137181     344    397    588
  8 Threads       128               41300   39292  170513     401    435    731

  L2 Cache Results in MB/Second - 96 KB                       % Gain

  Phenom II       512   4/0  3000    1496   13145   23798
  2 Threads      1024                2983   26351   47336     199    200    199
  4 Threads      2048                5761   51226   92184     385    390    387
  6 Threads      2048                5863   48050   86055     392    366    362
  8 Threads      2048                5380   48529   85650     360    369    360

  Core i7 930     256   4/4  ****    1996    8643   22092
  2 Threads       512                3378   17305   43722     169    200    198
  4 Threads      1024                3866   26611   60836     194    308    275
  6 Threads      1024                4049   33262   64866     203    385    294
  8 Threads      1024                4178   37228   68711     209    431    311

  L3 Cache - 1536 KB Data                                     % Gain

  Phenom II      6144   4/0  3000     659    9562   10838
  2 Threads                          1431   18082   22559     217    189    208
  4 Threads                          2222   29623   34566     337    310    319
  6 Threads                          2221   30682   34525     337    321    319
  8 Threads                          2240   31417   35148     340    329    324

  Core i7 930    8192   4/4  ****    1948    8614   21650
  2 Threads                          3192   17141   42945     164    199    198
  4 Threads                          3772   30387   58809     194    353    272
  6 Threads                          2537   29429   43411     130    342    201
  8 Threads                          1060   19526   15886      54    227     73

  RAM Results in MBytes/Second - 128 MB                       % Gain

  Phenom II             4/0  3000     439    5214    7302
  2 Threads                           744    8920   12162     169    171    167
  4 Threads                           913   13000   14952     208    249    205
  6 Threads                           902   13183   15005     205    253    205
  8 Threads                           909   12701   14966     207    244    205

  Core i7 930           4/4  ****     526    7063    9485
  2 Threads                           637   11883   12945     121    168    136
  4 Threads                           724   13600   13828     138    193    146
  6 Threads                           731   13572   13911     139    192    147
  8 Threads                           731   13750   13722     139    195    145

        ****  i7 930 2800 MHz running using Turbo Boost at up to 3066 MHz      


To Start


RandMP Serial/Random Access Benchmark

Rand8Thread32.exe and Rand8Thread64.exe are compiled from the same program, but for 32 and 64 bits. The program uses the same code for serial and random use via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1, 2, 4, 6 and 8 threads. Results (32 and 64 bit versions) and further details can be found in RandMem Results.htm.

This benchmark uses data from the same array for all threads, but starting at different points. As with the dual core version, with RW and particularly random, flushing dedicated caches to maintain data coherency, leads to reduced performance using more than one thread. Here, speed using shared L2 or L3 cache can be faster than using L1 cache. Results for the 32 bit version below show the total throughput of all threads based on harmonic mean. Data sizes are, again, 6 KB for L1 cache, 96 KB for L2 cache, 1536 KB for L3 cache but 96 MB for RAM.

On the Phenom, speed on serial reading, from caches and RAM, is similar to that for BusMP Read All tests. This also applies via caches for the Core i7 but, using RAM, the data transfer speed appears to be higher than possible, most likely due to efficient caching of shared data (different data starting point probably suits 8 MB L3 cache). This i7 RAM test is the only one where Hyperthreading has a major impact.

Random reading speed via L1 cache is similar to that for serial reading but becomes progressively slower through other caches and RAM. The Core i7 is the faster from L3 cache and RAM using few threads, but the Phenom nearly catches up at 8 threads. The i7 is clearly much faster of the two systems on most read/write tests, but still struggles to achieve a throughput gain of grater than 2.0 using more than two threads. Note that, using one thread on random read/write of L1 cache sized data, the i7 is five times faster than using multiple threads and the Phenom up to ten times faster. For the latter, using data in RAM is faster than data that could sit within L1 cache.


             CPUs          MBytes Per Second Using Threads        Gain At Threads
             /HTs         1       2       4       6       8     2     4     6     8
 Serial RD
 Core i7     4/8 L1   11458   22661   37039   43717   46374   2.0   3.2   3.8   4.0
 930             L2   10380   20832   32853   41711   42839   2.0   3.2   4.0   4.1
 #### MHz        L3    8828   17743   29610   38414   40330   2.0   3.4   4.4   4.6
 Win 764        RAM    4266    8712   17347   24946   25589   2.0   4.1   5.8   6.0

 Serial RW
 Core i7     4/8 L1   15282   13724   16240   16209   18379   0.9   1.1   1.1   1.2
 930             L2   12223   18216   25326   28104   27047   1.5   2.1   2.3   2.2
 #### MHz        L3   10234   19266   21931   24450   26351   1.9   2.1   2.4   2.6
 Win 764        RAM    4533    7656   13876   14543   13390   1.7   3.1   3.2   3.0

 Random RD
 Core i7     4/8 L1   11266   22548   38174   45592   47141   2.0   3.4   4.0   4.2
 930             L2    6233   12463   20059   24986   25667   2.0   3.2   4.0   4.1
 #### MHz        L3    3499    6915    9211   10002    9531   2.0   2.6   2.9   2.7
 Win 764        RAM     459     909    1241    1398    1364   2.0   2.7   3.0   3.0

 Random RW
 Core i7     4/8 L1   14375    3027    2780    2901    3297   0.2   0.2   0.2   0.2
 930             L2    5887    4555    6117    6693    7281   0.8   1.0   1.1   1.2
 #### MHz        L3    3104    4604    4721    5047    4933   1.5   1.5   1.6   1.6
 Win 764        RAM     428     860     899     948    1026   2.0   2.1   2.2   2.4

 #### 2.8 GHz running at up to 3.06 GHz via Turbo Boost, dual channel 1066 MHz DDR3 RAM 

 ##################################################################################
 
             CPUs          MBytes Per Second Using Threads        Gain At Threads
             /HTs         1       2       4       6       8     2     4     6     8
 Serial RD
 Phenom II   4/0 L1   15212   29350   58904   58896   54909   1.9   3.9   3.9   3.6
 3000 MHz        L2   12236   24767   49039   50798   47318   2.0   4.0   4.2   3.9
 Win 764         L3    8148   16402   30391   33436   32457   2.0   3.7   4.1   4.0
 1333 MHz DDR3  RAM    3917    6983   11299   12484   12002   1.8   2.9   3.2   3.1
 
 Serial RW
 Phenom II   4/0 L1    7741    5100    5750    6598    6517   0.7   0.7   0.9   0.8
 3000 MHz        L2    7998    5906    7479    8466    8345   0.7   0.9   1.1   1.0
 Win 764         L3    7132   13142    7489    8566    8582   1.8   1.1   1.2   1.2
 1333 MHz DDR3  RAM    3589    5981    8576    7913    7802   1.7   2.4   2.2   2.2
 
 Random RD
 Phenom II   4/0 L1   14367   27877   56817   55300   54129   1.9   4.0   3.8   3.8
 3000 MHz        L2    7250   14355   28436   29723   27962   2.0   3.9   4.1   3.9
 Win 764         L3    1560    3419    6641    7403    7410   2.2   4.3   4.7   4.8
 1333 MHz DDR3  RAM     339     679    1140    1336    1242   2.0   3.4   3.9   3.7
 
 Random RW
 Phenom II   4/0 L1    7585    1381     752     833     757   0.2   0.1   0.1   0.1
 3000 MHz        L2    5985    1624    1230    1387    1245   0.3   0.2   0.2   0.2
 Win 764         L3    1505    1724    1377    1545    1572   1.1   0.9   1.0   1.0
 1333 MHz DDR3  RAM     313     634    1113    1157    1153   2.0   3.6   3.7   3.7


To Start


OpenMP Benchmark - MFLOPS

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the latest Microsoft C++ compilers. The benchmark executes the same functions, using the same data sizes, as the CUDA Graphics GPU Parallel Computing Benchmark, with varieties compiled for 32 bit and 64 bit operation, using old style i387 floating point instructions and more recent SSE code (OpenMP32MLOPS.exe and OpenMP64MLOPS.exe). A run time Affinity option is available to execute the benchmark on a selected single processor. These benchmarks and a non-OpenMP SSE version (SSE32MFLOPS.exe) can be downloaded via OpenMPMflops.zip. Results and further details can be found in OpenMP MFLOPS.htm.

The benchmark demonstrates that OpenMP can make use of four CPUs but not much extra on the Core i7 due to Hyperthreading. Each test reads 1000 MB and writes 1000 MB where at least the largest data size of 10M words will be from/to RAM and could be limited by memory speed with 2 floating point operations per word. Two example calculations of MB/second are shown below.


  Core i7 930 2.8 GHz running at up to 3.06 GHz via Turbo Boost
  Windows 7 64
                                                                CUDA    CUDA
   Data    Ops/ Repeat    SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words   Word Passes  1 CPU   1 CPU 4/8 CPU   1 CPU 4/8 CPU  GTX480  GTX480

    100000    2   2500   3567    1248    4455    1574    4001     521    5554
   1000000    2    250   3529    1420    5433    1861    4919     819   21493
  10000000    2     25   2388    1364    3038    1735    3076xx  1014   31991

    100000    8   2500   4655    2337    8798    3794   14581    2058   20129
   1000000    8    250   4642    2413    9813    4149   17080    3306   82132
  10000000    8     25   4453    2436    9581    4011   12457    4057  125413

    100000   32   2500   3328    2957   12020    4324   16786    7768   52230
   1000000   32    250   3329    3011   12339    4436   17599   13190  254306
  10000000   32     25   3307    3003   12432    4418   17576yy 16077  425237

  Maximum Gain                           414%            412%

  xx in 0.163 seconds - MB/Second = 2000 / 0.163 = 12270 (x  2/8 for MFLOPS)
  yy in 0.455 seconds - MB/Second = 2000 / 0.455 =  4396 (x 32/8 for MFLOPS) 

  Phenom II X4 3.0 GHz, Windows 7 64
                                                                CUDA    CUDA
   Data    Ops/ Repeat    SSE    i387    i387 SSE 64b SSE 64b  GeFrce  No I/O
   Words   Word Passes  1 CPU   1 CPU   4 CPU   1 CPU   4 CPU  GTS250  GTS250

    100000    2   2500   3552    1920    5587    1822    5613     328    3054
   1000000    2    250   3268    1919    5585    1870    7056     625    9672
  10000000    2     25   1861    1625    2993    1563    2972     714   13038

    100000    8   2500   4535    2115    7763    3637   12653    1336   12233
   1000000    8    250   4341    2108    7975    3709   14518    2382   39481
  10000000    8     25   4141    2100    8062    3543   11273    2949   51199

    100000   32   2500   4012    2566    9675    3652   14092    5142   36080
   1000000   32    250   3981    2552   10091    3663   14510    9427  108170
  10000000   32     25   3941    2510    9902    3633   14034   11182  135041

  Maximum Gain                           395%            396%


To Start


Multiple Tasks

Multitasking tests were run on the Core i7 using IntBurn64.exe and SSEBurn64,exe which are described in BurnIn64.htm and BurnIn4CPU.htm. The benchmark and source code are in More64bit.zip. Tests run were one copy each of the Integer and SSE floating point programs, four concurrent copies of the integer test and four copies of both integer and SSE programs at the same time. Test durations were one minute each and results showed that all multitasking tests started and finished within the same clock time second. Each test used L1 cache size data of 8 K. The SSE tests used the Cache Test option, normally the fastest.

Single test result show that the integer test is producing around one 64 bit result per clock Hz and four 32 bit (128 bits) floating point results per Hz using SSE instructions. As might be expected, the higher Turbo Boost CPU clock frequency using one CPU, means that four concurrent integer tests do not achieve a 400% performance level. However, running these eight programs, along with Hyperthreading, increases this to between 428% and 450%.


                        1 Test  ----- 4 Concurrent Tests ----   Total   Gain
   
 Int Write/Read MB/sec  14195   13955   13902   13879   13905   55641   392%
 Int Read       MB/sec  20267   20206   20191   20179   20169   80746   398%

 Int Write/Read MB/sec           8127    8756    8345    8414   33641   237%
 Int Read       MB/sec          10914   10794   10790   11049   43547   215%

 SSE Calculate  MFLOPS  11743    6231    6119    6144    6517   25011   213%


To Start




Roy Longbottom August 2010



The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection