Title

Roy Longbottom at Linkedin   Linux MultiThreading Benchmarks

Contents


General Description Simple Add Tests Comparison Simple Add Tests
Whetstone MP Benchmark Comparison Whetstone MP MP MFLOPS Description
Comparison MP MFLOPS MP MFLOPS Burn-In Test MP Memory Speed
MPmemSpeed Comparisons MP Memory Bus Speed MPbusSpeed Comparisons
MP Memory Random Access MPrandMem Comparisons

Summary

Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).

Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.

Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.

MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,

MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data.

MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed. A second version provides the alternative of thread reading starting at different data addresses, to avoid overestimation of maximum speed due to large L3 caches.

MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.

To Start


General

These tests are intended to measure Linux and hardware performance at high speeds using multithreading. The programs were compiled at both 32 bits and 64 bits. The execution files, source code, compilation and running instructions can be found in linux_multithreading_apps.tar.gz. This includes 2014 revised benchmarks, produced by a later compiler. The latter was also used to produce versions using the newer AVX instructions, available in AVX_benchmarks.tar.gz and described in Linux AVX benchmarks.htm. All provide the following information on the system under test. They are based on versions available for running under Windows and described in quad core 8 thread.htm.


##############################################

  Assembler CPUID and RDTSC       
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 
  AMD Phenom(tm) II X4 945 Processor 
  Measured - Minimum 3014 MHz, Maximum 3014 MHz 
  Linux Functions 
  get_nprocs() - CPUs 4, Configured CPUs 4 
  get_phys_pages() and size - RAM Size  7.81 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-64, 2.6.35-22-generic 
  #33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64 



To Start

Description Simple Add Tests

CPUMaxMP32 and CPUMaxMP64 execute simple integer and floating point add instructions via assembly language. Floating point arithmetic is identical on the two versions, via SSE instructions that handle four calculations at a time via 128 bit registers. Performance is measured in Millions of Floating Point Operations Per Second (MFLOPS) with expected maximum speed of four adds per CPU clock cycle. Integer calculations are the same, except one uses 32 bit instructions/registers and the other the 64 bit varieties. For these, speed is measured in Millions of Instructions Per Second (MIPS). Results are logged in file MPadds.txt.

The assembly code loops execute two billion add instructions to ensure that elapsed times of a single thread are significant (like 0.5 seconds or more for SSE tests on current CPUs). A command line run time variable is available to specify the number of threads to use, between 1 and 64, with a default of four. Below are example full results, using four threads, and MIPS from a run with 64 threads. Besides total MIPS and MFLOPS, second sums are provides, based on the time for the last thread to finish. As seen for both examples, completion times are not based on first in first out, but the time is shared fairly evenly, even with 64 threads.


 Command ./cpumaxmp64 Threads 4 (or T 4 or t 4)

 Phenom 4 CPUs Available
##############################################

 Multithreading Add Test 64 bit Version 1.0 Thu May  5 11:35:18 2011
 
 Integer Additions 4 Threads

 Thread   4 -    8281 64 bit Integer MIPS
 Thread   2 -    7996 64 bit Integer MIPS
 Thread   1 -    7815 64 bit Integer MIPS
 Thread   3 -    7800 64 bit Integer MIPS
 Total      -   31892 64 Bit Integer MIPS
 Aggregate  -   31201 64 Bit Integer MIPS, based on last to finish


 SSE Floating Point Additions 4 Threads

 Thread   2 -   12030 32 Bit SSE MFLOPS
 Thread   3 -   11976 32 Bit SSE MFLOPS
 Thread   4 -   11861 32 Bit SSE MFLOPS
 Thread   1 -   11692 32 Bit SSE MFLOPS
 Total      -   47559 32 Bit SSE MFLOPS
 Aggregate  -   46770 32 Bit SSE MFLOPS, based on last to finish

 End of test Thu May  5 11:35:23 2011


 Integer MIPS 64 Threads

 Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS

    1  528   10  522   41  518   11  517   49  516   21  515   52  514   48  514
   24  527   55  521   56  518    5  517   32  515   45  515   47  514   44  514
   57  526   29  521   26  518   28  516    8  515    9  515   27  514   31  514
   62  525   23  520   20  518   58  516   17  515   38  515   63  514    3  514
    6  525   39  520   25  518    7  516   36  515   33  515    2  514   19  513
   59  523   61  520   12  518   51  516   53  515   54  515   34  514   35  513
   46  523   50  519   16  517   37  516   14  515   22  515   64  514   15  513
   13  522   60  519   43  517   42  516    4  515   18  515   30  514   40  513

To Start


Comparison Simple Add Tests

Following are sample results on a range of systems with one , two and 4 CPUs, using 1, 2, 4 and 64 threads. The range of speeds of individual threads is also shown for the latter.

Atom - This is a netbook, where the single CPU has HyperThreading and 64 bit capability. With HT, two CPUs are identified and, in this case, integer addition throughput using multiple threads is 40% higher than from a single thread and 20% faster with SSE floating point calculations.

Core 2 Duo - Results from the 32-Bit and 64-Bit compilations are shown for this dual core processor, where 32 bit integer speed is somewhat faster than at 64 bits. Integer additions are executed at up to 2.75 per CPU clock cycle (or MIPS/MHz) with SSE calculations at the maximum rate of four per clock cycle. As with earlier tests, this system runs at 1.6 GHz when one CPU is being used under Linux and default “On-Demand” Frequency Scaling is used ( see Linux Peculiarities in linux burn-in apps.htm). Result provided are for a “Performance” setting.

Phenom - Results of the 64-Bit version are shown for this quad core processor, via Linux Ubuntu and Fedora. There appears to be some differences between the two versions of Linux but these might be normal variations due to other influences. They at least show that the quad core processor can increase throughput by four times with these tests.

Core i7 - This is a 4 core/8 thread 3.7 GHz 4820K, running at 3.9 GHz Turbo Boost speed. It seems that all 8 threads are needed for a four times performance improvement. With 4 cores and 4 SSE adds at a time, maximum speed would be 62.4 GFLOPS, and that was nearly achieved. With integer addition, nearly 1.6 results per clock cycle is demonstrated.


                    Atom 1.7 GHz  Core 2 Duo 2.4 GHz          Phenom X4 3.0 GHz
                    64 bit        64 bit        32 bit        64 bit        64 bit
                    Ubuntu        Ubuntu        Ubuntu        Ubuntu        Fedora
                    MIPS MFLOPS   MIPS MFLOPS   MIPS MFLOPS   MIPS MFLOPS   MIPS MFLOPS
 Threads
   1    Total       1844   5418   5268   9605   6597   9591   8052  12046   8213  12030
        Aggregate   1844   5418   5268   9605   6597   9591   8052  12046   8213  12030
            %      100.0  100.0  100.0  100.0  100.0  100.0  100.0  100.0  100.0  100.0

   2    Total       2631   6460  10290  18992  12782  18707  15964  24022  16447  24052
        Aggregate   2598   6441  10156  18898  12621  18344  15810  23946  16446  24050
            %       98.7   99.7   98.7   99.5   98.7   98.1   99.0   99.7  100.0  100.0

   4    Total       2652   6473  10508  19159  13011  19047  31892  47559  32701  47889
        Aggregate   2630   6449  10416  19070  12933  18940  31201  46770  32344  47620
            %       99.2   99.6   99.1   99.5   99.4   99.4   97.8   98.3   98.9   99.4

   64   Min           42    101    164    299    205    299    513    749    510    655
        Max           43    103    173    310    229    310    528    798    529    763
        Total       2719   6526  10696  19443  13556  19435  33094  48974  33012  43339
        Aggregate   2657   6466  10503  19111  13129  19120  32840  47932  32664  41938
            %       97.7   99.1   98.2   98.3   96.9   98.4   99.2   97.9   98.9   96.8


 ------------------------------------------------------------------------

3.7 GHz Core i7 4820K

1 Thread 2 Threads 4 Threads 8 Threads 8 Thread Gains MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS 32 bit Total 11694 15450 12154 30907 36237 30993 48733 61911 4.2 4.0 Aggregat 11694 15450 11953 30907 24342 30902 47020 61763 4.0 4.0 % 100.0 100.0 98.3 100.0 67.2 99.7 96.5 99.8 64 bit Total 11937 15450 23069 30887 24717 46409 49162 61916 4.1 4.0 Aggregat 11937 15450 23061 30874 24167 30903 47387 61540 4.0 4.0 % 100.0 100.0 100.0 100.0 97.8 66.6 96.4 99.4 96.4 99.4


To Start


Whetstone MP Benchmark

The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in Whetstone Benchmark History And Results and Results On PCs. The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations. As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.

This multithreading benchmark also has a run time parameter to specify the number of threads (up to 64) with a default identified as Configured CPUs in gathered system information (see above). An initial calibration determines the number of passes needed for an overall execution time of 10 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are logged for the first thread with others checked for the same values. Actual results might be different or repeated runs as they are dependent on the number of passes.

Four versions are available, whetsMP64, whetsMP64DP, whetsMP32 and whetsMP32DP, for 32-Bit or 64-Bit systems using Single or Double Precision floating point. Results are logged in file MPwhetres.txt.


 Equivalent command ./whetsMP64 Threads 4

 Phenom 4 CPUs Available
 #####################################################################

     Multithreading Single Precision Whetstones 64-Bit Version 1.0

             Using 4 threads - Sat May 14 12:03:51 2011

         MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
 Thread             1      2      3   MOPS   MOPS   MOPS   MOPS   MOPS

      1   2861    927    872    747     71     38   2947   2259    629
      2   2865    875    892    745     71     38   3294   2198    641
      3   2875    869    892    744     71     38   3408   2202    645
      4   2896    906    895    744     72     38   3141   2232    651

 Total   11496   3577   3550   2979    285    151  12790   8891   2566

 MWIPS   11389 Based on time for last thread to finish


 Results Of Calculations Thread 1

 MFLOPS 1   -1.12475013732910156      MFLOPS 2   -1.13133049011230469
 IFMOPS      1.00000000000000000      FIXPMOPS   12.00000000000000000
 COSMOPS     0.49911013245582581      MFLOPS 3    0.99999982118606567
 EQUMOPS     3.00000000000000000      EXPMOPS     0.93536460399627686

      Numeric results of the other 3 threads were same as above

               End of test Sat May 14 12:04:09 2011
 


To Start


Comparison Whetstone MP Benchmark

Following are results of the four versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs noted above. The 32-Bit compilations use the original i87 floating point instructions, where arithmetic calculations are at double precision, normally producing the same speeds with both precision options. With i87 mode not being available at 64-Bits, SSE instructions are used for single precision and SSE2 for double. Using Single Instruction Multiple Data (SIMD) mode, included in the above Add Tests, SSE can be twice as fast as SSE2, with four 32 bit arithmetic calculations at a time, compared with two at 64 bits. In this case, the source code is unsuitable for SIMD compilation, so scalar or SISD (Single Data) instructions are used. In this case, single precision calculations can be at the same speed or slightly faster than using double precision. This scalar operation means that 64-Bit and 32-Bit compilations can produce similar performance.

There are differences with the 32-Bit double precision version where speed can be much faster. For the one headed Equal MOPS, the single precision code uses mov instructions rather than store on the faster compilation. For Fixpt MOPS, integer calculations are the same but the faster one involves integer conversion to double precision rather than single precision. The speed difference remains a mystery but this has little effect on the overall performance rating.

As indicated earlier, the single core Atom has Hyperthreading. In this case, some floating point calculations can be twice as fast using more than one thread. One anomaly is the high speed result during the four thread fixed point test. Here, Linux appeared to have run one thread twice as fast as the others, distorting the total. Results on the Core 2 Duo include one for a test using 64 threads. Speeds are also shown using Fedora on the Phenom, rather than Ubuntu.

Later results shown are for a Core i7, with 4 cores and 8 threads, via Hypertheading. It is also rated at 3.7 GHz but mainly runs at 3.9 GHz via Turbo Boost. On this PC at 64 bits, Cos and Exp tests are much faster than the 32 bit compilations, and this has a significant effect on the MWIPS rating. In some cases, Hyperthreading doubles speed on eight threads, compared with four.


        Threads  MWIPS  MFLOPS  MFLOPS  MFLOPS     Cos     Exp   Fixpt      If   Equal
                             1       2       3    MOPS    MOPS    MOPS    MOPS    MOPS

 Atom        1     751     397     395     363      17       8     698    1230     141
 1.6 GHz     2    1284     747     697     698      31      14     948    1657     190
 64 Bit      4    1301     768     760     715      31      14    1207    1661     190
 SP         16    1309     801     773     726      32      14     956    1698     191
 Aggregate  16    1305

 Atom        1     748     381     381     324      17       8     700    1235     141
 1.6 GHz     2    1287     732     714     634      31      14     950    1662     191
 64 Bit      4    1259     781     748     593      32      13     963    1686     186
 DP         16    1307     765     742     647      32      14     958    1691     191
 Aggregate  16    1302

 Atom        1     698     330     329     282      17       7     758    1230     118
 1.6 GHz     2    1182     594     588     478      29      13     987    1654     178
 32 Bit      4    1193     613     614     483      30      13     998    1690     178
 SP         16    1202     618     589     483      30      13     995    1688     178
 Aggregate  16    1199

 Atom        1     757     330     330     282      17       7    1468     837     299
 1.6 GHz     2    1312     600     592     480      29      13    2420    1248     506
 32 Bit      4    1319     611     604     482      30      13    2504    1263     505
 DP         16    1329     610     615     485      30      13    2575    1268     507
 Aggregate  16    1324

 Core2 Duo   1    2501     876     876     600      68      29    3198    3601     600
 2.4 GHz     2    4926    1726    1632    1192     135      58    6102    6930    1193
 64 Bit      4    4963    1733    1748    1196     135      58    6328    7158    1196
 SP         16    4982    1758    1757    1199     136      58    6420    7212    1198
 Aggregate  16    4966
            64    5054    1783    1782    1215     138      59    6566    7292    1218
 Aggregate  64    4973

 Core2 Duo   1    2364     803     803     533      61      30    3005    3601     600
 2.4 GHz     2    4688    1589    1586    1059     121      60    5911    7082    1196
 64 Bit      4    4698    1599    1601    1062     122      60    5976    7089    1197
 DP         16    4714    1609    1613    1062     122      61    6068    7213    1197
 Aggregate  16    4704

 Core2 Duo   1    2165     817     817     576      58      23    3169    3600     623
 2.4 GHz     2    4270    1564    1558    1130     114      45    6149    6823    1234
 32 Bit      4    4330    1616    1628    1149     116      45    6636    7168    1253
 SP         16    4331    1628    1638    1151     116      45    6586    7229    1256
 Aggregate  16    4317

 Core2 Duo   1    2244     817     817     576      58      23    5140    3596    1028
 2.4 GHz     2    4452    1621    1578    1144     115      46   10028    7120    2049
 32 Bit      4    4450    1624    1630    1150     113      46   10399    7176    2065
 DP         16    4472    1634    1636    1149     115      46   10301    7227    2051
 Aggregate  16    4465

 Phenom x4   1    2909     925     927     753      72      38    3229    2258     644
 3.0 GHz     2    5787    1832    1825    1504     144      76    6375    4488    1253
 64 Bit      4   11496    3577    3550    2979     285     151   12790    8891    2566
 SP         16   11655    3700    3718    3006     289     153   13395    9039    2635
 Aggregate  16   11578
 Fedora     16   11842    3705    3715    3010     296     158   13474    9067    2552
 Aggregate  16   11725

 Phenom x4   1    3002     927     927     753      75      42    3228    2253     601
 3.0 GHz     2    5977    1819    1829    1498     150      83    6410    4491    1184
 64 Bit      4   11810    3583    3610    2976     297     163   12492    8875    2372
 DP         16   11992    3694    3715    3008     300     166   12945    9068    2429
 Aggregate  16   11929

 Phenom x4   1    2586     927     926     695      64      31    3132    2259     621
 3.0 GHz     2    5141    1819    1827    1389     129      62    6213    4484    1200
 32 Bit      4   10178    3564    3623    2747     255     124   11567    8893    2390
 SP         16   10300    3695    3691    2780     256     125   12584    9070    2460
 Aggregate  16   10233

 Phenom x4   1    2768     926     927     695      63      32    7525    1807    1806
 3.0 GHz     2    5504    1815    1824    1388     126      64   14367    3570    3613
 32 Bit      4   10853    3596    3594    2758     249     125   27371    7162    7110
 DP         16   10960    3703    3701    2777     249     127   30629    7212    7177
 Aggregate  16   10903


   Core i7-4820K CPU 3.7 GHz mainly at 3.9 GHz Turbo Boost speed - 4 Cores, 8 Threads

        Threads  MWIPS  MFLOPS  MFLOPS  MFLOPS     Cos     Exp   Fixpt      If   Equal
                             1       2       3    MOPS    MOPS    MOPS    MOPS    MOPS

 Core i7     1    4663    1330    1328     977     132      65    4874    5855     986
 3.9 GHz     2    9323    2660    2657    1954     263     129    9764   11709    1965
 64 bit      4   17005    5271    5281    3875     446     247   12573   17569    3929
 SP          8   30080   10366   10214    7707     731     466   24948   23501    5033
 Aggregate   8   29839

 Core i7     1    4648    1331    1331     977     122      70    4720    5855     983
 3.9 GHz     2    9274    2661    2660    1945     243     140    9769   11717    1964
 64 bit      4   18078    5263    5229    3907     488     265   15620   17408    3929
 DP          8   30524   10321   10384    7657     733     494   25079   23559    5033
 Aggregate   8   30312

 Core i7     1    3663    1331    1330     938      95      42    4601    5852     950
 3.9 GHz     2    7312    2660    2658    1877     189      85    9200   11703    1868
 32 bit      4   14256    5275    5295    3731     377     163   15413   17216    3642
 SP          8   24612   10513   10417    7468     577     312   25897   23424    4728
 Aggregate   8   24463

 Core i7     1    3881    1330    1330     938      94      43    9749    5856    2345
 3.9 GHz     2    7762    2661    2660    1876     187      86   19489   11714    4686
 32 bit      4   14869    5275    5150    3738     373     164   29334   17585    7031
 DP          8   26022   10372   10266    7465     569     314   38564   23700    9383
 Aggregate   8   25910

 


To Start


MP MFLOPS Benchmark

This benchmark executes identical functions as my CUDA and OpenMP performance tests. Details and results can be found in linux_cuda_mflops.htm and OpenMP Speeds.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out. This variation can be run using between 1 and 64 threads, the default being equal to the number of identified CPUs. Each thread carries out the same calculations but on a different segment of the data. The data size starts at 102400 words, rather than 100000, to ensure that each thread uses the same amount of data. For example, with 64 threads, each will use 1600 words or 6400 bytes.

Two versions, MPmflops64 and MPmflops32, were compiled, the first involving the default SSE floating point instructions and the second using the original i87 functions. Speed of the 64-Bit version was so fast that a second 32-Bit benchmark, MPmflops32SSE, was produced. Results are logged in file MPMflopsLog.txt, with examples shown below. These show that the 64-Bit and 32-Bit SSE versions produce the same numeric results and the same speeds (within normal variations). Then the i87 program produces slightly different answers and much slower speeds.


 Phenom Results
##############################################

  64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    10000   0.091754    22321    0.764063   Yes
 Data in & out    1024000     2     1000   0.136134    15044    0.970753   Yes
 Data in & out   10240000     2      100   0.632075     3240    0.997008   Yes

 Data in & out     102400     8    10000   0.167023    49047    0.850923   Yes
 Data in & out    1024000     8     1000   0.176219    46488    0.982342   Yes
 Data in & out   10240000     8      100   0.658828    12434    0.998200   Yes

 Data in & out     102400    32    10000   0.558509    58670    0.660143   Yes
 Data in & out    1024000    32     1000   0.556450    58888    0.953631   Yes
 Data in & out   10240000    32      100   0.722131    45377    0.995203   Yes


##############################################

  32 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:57:17 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    10000   0.092236    22204    0.764063   Yes
 Data in & out    1024000     2     1000   0.135243    15143    0.970753   Yes
 Data in & out   10240000     2      100   0.638202     3209    0.997008   Yes

 Data in & out     102400     8    10000   0.164866    49689    0.850923   Yes
 Data in & out    1024000     8     1000   0.183847    44559    0.982342   Yes
 Data in & out   10240000     8      100   0.677530    12091    0.998200   Yes

 Data in & out     102400    32    10000   0.604816    54178    0.660143   Yes
 Data in & out    1024000    32     1000   0.613424    53418    0.953631   Yes
 Data in & out   10240000    32      100   0.756550    43312    0.995203   Yes


##############################################

  32 Bit MP i87 MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:59 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    10000   0.278444     7355    0.763849   Yes
 Data in & out    1024000     2     1000   0.287133     7133    0.970727   Yes
 Data in & out   10240000     2      100   0.673376     3041    0.997006   Yes

 Data in & out     102400     8    10000   0.625873    13089    0.851082   Yes
 Data in & out    1024000     8     1000   0.629958    13004    0.982363   Yes
 Data in & out   10240000     8      100   0.740114    11069    0.998204   Yes

 Data in & out     102400    32    10000   2.172758    15081    0.660653   Yes
 Data in & out    1024000    32     1000   2.186809    14984    0.953702   Yes
 Data in & out   10240000    32      100   2.236048    14654    0.995214   Yes
 


To Start


Comparison MP MFLOPS Benchmark

As previously, following are results of the 64-Bit and 32-Bit versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs. Performance of the single CPU tests are virtually the same as those from the OpenMP benchmark, as expected, using the same C statements but without OPenMP directives. Multiple processor tests are a little faster on the i87 version but significantly faster on the SSE varieties. The OpenMP compilation only produced SISD SSE instructions. The generated code for this MP MFLOPS benchmark not only used full SIMD functions but also clearly included linked add and multiply instructions to produce more than four results per clock cycle. Best case is the Core 2 Duo where up to six adds or multiples were recorded per clock, per CPU. The Phenom shows the highest throughput here at 60 GFLOPS, with four cores at five results per clock cycle. Performance gains on the Atom again reflect Hyperthreading effects but some are more influenced by smaller cache sizes.

Numeric results of calculations are constant for a given number of repeat passes, but these are arranged to increase in proportion to the number of identified CPUs, to maintain similar running times. Rounding effects also produce slight differences on i87 and SSE versions. Default answers are shown below for systems with 2, 4, 8 and 16 cores.

Besides for defining the number of threads, command line input parameters are available to use specified repeat passes, either to extend running time or to check for consistent answers.

Later results, provided below, are for a 3.7 GHz Core i7 4820K, normally running at Turbo Boost speed of 3.9 GHz. Maximum speed per CPU, for a 64 bit Single Precision SSE compilation. is 3.9 x 4 (register width) x 2 (linked multiply and add) = 31.2 GFLOPS. The benchmark was run as a stand alone test and that produced the same performance as a test with a single thread. In both cases, maximum speeds were around 24.5 GFLOPS. Then, is seems that 8 threads are needed to maximise performance at 92 GFLOPS. Both of these demonstrate some linking of multiply and add instructions. As usual, loading and saving data leads to lower performance with two operations per data word.

The benchmark was recompiled to use the AVX instructions, available on this Core i7. These use 256 bit registers, or SIMD on eight 32 bit numbers, with a potential maximum speed of 62.4 GFLOPS per core. The results demonstrate up to 46.5 GFLOPS using one core and 177.8 GFLOPS via 8 threads. - These are from a compiled C program.


 Run Time Parameters

 t N or T N or Threads N       where N is between 1 and 64 
 r P or R P or repeats P or Repeats P for P Repeat Passes
 m T or M T or minutes T or Minutes T for T minutes burn-in test 

 Examples  ./MPmflops32 Threads 64  ./MPmflops64 T 4  ./MPmflops64  T 8, R 20000

             Atom 1.7 GHz             Core 2 Duo 2.4 GHz          Phenom X4 3.0 GHz
 Thds   1      2      4     16      1      2      4     16      1      2      4     16

 64 Bit SSE MFLOPS
 
 a2     800   1430   1501   1508   5545   8503   8581  12567   7237  13870  22321  25742
 b2     648    610    640   1396   3779   4290   8929   9374   4611   9135  15044  27084
 c2     660    629    624    628   1248   1248   1242   2192   2152   2819   3240   3649

 a8    1810   3372   3396   3405  13036  23704  23904  26636  13815  26435  49047  51692
 b8    1741   2486   2553   3304  10787  15437  23931  25331  13168  25751  46488  54633
 c8    1746   2536   2504   2528   5003   4970   4993   8546   7152  10816  12434  13898

 a32   1832   3530   3560   3577  14405  28155  28240  27827  15110  30093  58670  59810
 b32   1818   3502   3521   3535  14212  27492  28084  28577  14897  29867  58888  60311
 c32   1820   3504   3531   3535  13620  19913  19964  25607  14208  27760  45377  47678

 32 Bit i87 MFLOPS

 a2     204    327    369    369   1602   3568   3523   3185   1950   3841   7355   7535
 b2     201    354    361    364   1799   3136   3613   3618   1885   3804   7133   7686
 c2     202    358    363    363   1236   1252   1251   2048   1582   2505   3041   3240

 a8     303    557    565    567   3188   6346   6193   6278   3361   6676  13089  13363
 b8     301    550    564    565   3162   6280   6213   6304   3292   6648  13004  13404
 c8     302    556    566    565   3081   4988   4959   5860   3168   6211  11069  11382

 a32    404    777    794    794   3362   6696   6649   6704   3813   7613  15081  15175
 b32    403    777    790    784   3357   6680   6645   6689   3775   7566  14984  15197
 c32    403    776    790    785   3338   6628   6592   6620   3715   7411  14654  14848
 ---------------------------------------------------------------------------------------

3.7 GHz Core i7 4820K

MFLOPS 1 to 8 Threads 4 Byte Ops/ Repeat SSE ------ SSE ------ ------ AVX ------ Words Word Passes 1 CPU 1 4 8 1 4 8 100000 2 2500 9918 9681 45340 54621 12542 62273 60258 1000000 2 250 9688 9759 21688 41832 11404 23031 44329 10000000 2 25 5870 5990 9237 10026 5991 8970 9977 100000 8 2500 24448 24533 49320 92086 35982 159040 173224 1000000 8 250 24465 24570 49918 92352 36180 80096 151909 10000000 8 25 20055 19975 36638 39982 23299 40124 40153 100000 32 2500 23251 23269 46942 92408 46400 90572 173372 1000000 32 250 23265 23307 89676 93282 46572 91058 177831 10000000 32 25 23063 23052 91029 92050 44729 88877 158594 ------------------------------------------------------------------------------------- Numeric Results Repeats 5000 10000 20000 40000 Version SSE i87 SSE i87 SSE i87 SSE i87 a2 0.867359 0.867238 0.764063 0.763849 0.620974 0.620631 0.481454 0.481096 b2 0.985193 0.985180 0.970753 0.970727 0.942935 0.942883 0.891302 0.891203 c2 0.998502 0.998501 0.997008 0.997006 0.994032 0.994027 0.988125 0.988114 a8 0.918220 0.918307 0.850923 0.851082 0.749971 0.750239 0.635325 0.635706 b8 0.991084 0.991095 0.982342 0.982363 0.965360 0.965401 0.933325 0.933397 c8 0.999099 0.999101 0.998200 0.998204 0.996409 0.996416 0.992853 0.992862 a32 0.798973 0.799276 0.660143 0.660653 0.498060 0.498797 0.385106 0.384777 b32 0.976383 0.976422 0.953631 0.953702 0.910573 0.910709 0.833458 0.833707 c32 0.997595 0.997602 0.995203 0.995214 0.990447 0.990463 0.981037 0.981068 Key - Words a=102400, b=1024000, c=10240000 - Operations per word 2, 8 and 32


To Start


MP MFLOPS Burn-In Test

As the benchmarks generated exceptionally high speeds from a single program, it was decided to include a burn-in/reliability test function. This is initiated by including a “Minutes” input parameter. This test just uses the 32 operations per word, 102400 word procedures, with an initial calibration run to identify the number of repeat passes to generate four results per minute.

The first results below are for the quad core Phenom. Overall throughput and CPU temperatures were almost identical to those running four copies of the BurnInSSE pogram. See - Linux burn-in apps.htm. The second results are from running on a 1.83 GHz Core 2 Duo based laptop. As with the earlier burn-in apps results, the CPU switched to lower GHz CPU speeds, when the CPU core temperatures reached around 95°C.


 Command ./MPmflops64 Minutes 2

 ##############################################

 Reliability Test around 2 Minutes

 4 CPUs Available
 ##############################################

  64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:41:07 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400    32   266688  14.334887    60962    0.352168   Yes
 Data in & out     102400    32   266688  14.334628    60963    0.352168   Yes
 Data in & out     102400    32   266688  14.506037    60243    0.352168   Yes
 Data in & out     102400    32   266688  14.400784    60683    0.352168   Yes
 Data in & out     102400    32   266688  14.354242    60880    0.352168   Yes
 Data in & out     102400    32   266688  14.418992    60606    0.352168   Yes
 Data in & out     102400    32   266688  14.536283    60117    0.352168   Yes
 Data in & out     102400    32   266688  14.499469    60270    0.352168   Yes
 Data in & out     102400    32   266688  14.583635    59922    0.352168   Yes

               End of test Fri May 20 12:43:18 2011

##############################################

  64 Bit MP SSE MFLOPS Benchmark 1, 2 Threads, Sat May 21 16:45:59 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400    32    94075  14.361988    21464    0.352474   Yes
 Data in & out     102400    32    94075  14.255409    21624    0.352474   Yes
 Data in & out     102400    32    94075  23.311603    13224    0.352474   Yes
 Data in & out     102400    32    94075  33.148504     9300    0.352474   Yes
 Data in & out     102400    32    94075  33.139577     9302    0.352474   Yes
 Data in & out     102400    32    94075  33.111674     9310    0.352474   Yes
 Data in & out     102400    32    94075  33.140281     9302    0.352474   Yes
 Data in & out     102400    32    94075  33.586864     9178    0.352474   Yes
 Data in & out     102400    32    94075  14.385304    21429    0.352474   Yes
 Data in & out     102400    32    94075  14.276475    21593    0.352474   Yes

               End of test Sat May 21 16:50:07 2011
 


To Start


MP Memory Speed

This is based on my original MemSpeed benchmark benchmark. It employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 or 64 bit integers via two data arrays:

   Result to memory     x[m] = x[m] + s * y[m]     
   Sum to memory        x[m] = x[m] + y[m]         
   Memory to memory     x[m] = y[m]                
 

Add is used instead of multiply for the first integer tests. Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using two arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision. The C programming calculations are identical to those used in an OpenMP version. See - OpenMP Speeds.htm.

The execution files are MPmemspeed32 and MPmemspeed64. The 32 bit version uses the old i87 floating point instructions and 32 bit integers. The other, as expected, compiles to use SSE instructions, but these are the slow SISD variety. It is also coded to use 64 bit integers. There is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs, possibly rounded up. Again, each thread carries out the same calculations but on a different segment of the data. Results are saved in memSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom. The data and calculations on each array element are identical. The program checks results for consistency and reports any errors.

Below are 64-Bit and 32-Bit results on the 3.0 GHz quad core Phenom using four threads. These show that the SSE floating point speeds are somewhat faster than tests using i87 instructions, except where performance becomes dependent on memory speed. MB/second rates using 64-Bit integers can be much faster than at 32-Bits, firstly, as the CPU can execute both types of instructions at the same speed and, secondly, as more registers are available for optimisation. The number of measurements at 32-Bits are limited as the full 8 GB of RAM cannot be recognised.


     get_phys_pages() and size - RAM Size  7.81 GB

     MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads

               Start of test Tue Jun  7 11:32:54 2011

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   18253  12913  18066  18667  14409  27989  14221  11201  14643
       8   26068  17651  25448  29410  20679  39706  21694  16463  22343
      16   39834  25431  38289  47167  29614  57648  21134  21446  35492
      32   50545  29588  47840  65341  37153  76190  46231  24825  45564
      65   57307  31898  53119  71962  40593  86253  48212  25858  48779
     131   64285  33405  56454  83929  42824  93889  51601  26317  52109
     262   65111  32902  58563  85910  43904  96199  52272  26517  52156
     524   58699  32056  53683  67177  39149  66647  44137  26617  44123
    1048   59967  32531  53638  67332  39808  67046  43401  26310  44172
    2097   48409  31709  51453  59630  37829  59008  32561  25079  32687
    4194   36529  27079  37052  37380  32077  37280  18682  18694  18732
    8388   27898  21163  25293  27011  23273  27070  14253  12800  13768
   16777    9006   8909   8869   8978   8806   9023   4488   4462   4516
   33554    8946   8875   8887   8606   8855   8921   4525   4497   4508
   67108    8717   8458   8325   8516   8452   8756   4287   4366   4379
  134217    8688   8339   8362   8696   8473   8698   4276   4357   4355
  268435    8703   8608   8516   8659   8393   8648   4280   4268   4328
  536870    8700   8591   8421   8673   8514   8690   4308   4290   4264
 1073741    8596   8471   8584   8628   8619   8698   4398   4329   4395
 2147483    8825   8790   8835   8790   8763   8842   4397   4402   4468

                           No errors found

                End of test Tue Jun  7 11:33:52 2011

##############################################

     get_phys_pages() and size - RAM Size  3.20 GB

     MP Memory Reading Speed Test 32 Bit Version 1 Using 4 Threads

               Start of test Tue Jun  7 14:23:02 2011

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   15704  11347  10961  17813  12518  15904  13744   8714   8758
       8   24188  15367  14929  26770  17870  21025  20789  10866  10234
      16   33319  19229  18266  38724  23589  23124  31390  13114  13157
      32   40697  20675  21180  51120  27260  25282  39385  13921  13960
      65   45013  22913  22267  57143  30132  24875  42247  14314  14241
     131   45569  23573  22953  61979  31356  27585  44688  14427  13289
     262   48701  23759  22666  63235  32103  27892  44447  14200  14453
     524   44900  22996  20417  53167  30753  25832  36085  14671  13403
    1048   44929  23357  20300  54596  30302  25790  36207  14708  13590
    2097   42017  22864  20927  42429  28809  24778  26734  13125  12659
    4194   34909  20379  19542  36402  25268  21093  18592  12625  12821
    8388   22498  17592  17006  23354  19577  18854  12489   9400   9657
   16777    8906   8697   8781   8884   8841   8844   4433   4217   4440
   33554    8848   8684   8606   8877   8436   8843   4412   4293   4422
   67108    8423   8445   8433   8685   8506   8526   4228   4296   4273
  134217    8704   8453   8572   8563   8426   8485   4383   4303   4346
  268435    8623   8579   8539   8731   8652   8612   4408   4301   4322
  536870    8683   8331   8534   8724   8658   8444   4371   4330   4325

                           No errors found

                End of test Tue Jun  7 14:24:05 2011
 


To Start


MP Memory Speeds Comparison

Below are 64-Bit results on the 3.0 GHz quad core Phenom and 2.4 GHz Core 2 Duo for the multiply and add tests using 1 CPU, all CPUs and with 64 threads. On the single thread tests, although the speeds are dependent on CPU GHz, variations generally reflect cache sizes. The full gain in throughput through using more than one CPU is not achieved at the lower data sizes, mainly due to higher overheads. For example, at 4 KB there are two arrays of 2 KB, producing a segment of 32 bytes for each of 64 threads. There are significant additional performance gains using an increasing number of threads with mid to large data sets. This is due to the relatively small data segments being repetitively processed from a lower level cache.

Next are results on the Netbook with an Atom CPU running via 64-Bit Ubuntu 11.04. It can be seen that Hyperthreading provides significant gains in throughput using floating point instructions.

Later are speeds obtained on a Core i7 4820K, with Hyperthreading on 4 cores, normally running at 3.9 GHz Turbo Boost speed. It has 32 GB RAM on 4 memory chaanels, with a maximum speed of 51.2 GB/second. The original benchmark again show slower MP speeds via data in L1 cache, with maximum via other caches. Then MP speed using RAM was quite good at up to 25.4 GB/second. As indicated earlier, only SISD instructions were used, where a single thread only produced up one result per clock cycle (3.9 GFLOPS) and MP speed 3.8 times faster.

Results for a recompiled version, using AVX directives, are also shown. Of special note, performance of single threaded AVX version is often worse than that without AVX. Full SIMD AVX instructions are implemented, but there are numerous extra instructions used, such as shuffle, unpack and insert (4 vector multiply, 4 vector add, 80 other vector instructions) - needed to allow any unknown number of threads?. At least, the multithreaded speeds can be four times that of a single threaded run and twice as fast using RAM based data.


 Quad Core Phenom - Caches L1 64 KB/CPU, L2 512 KB/CPU, L3 6144 KB shared

              Commands      ./MPmemspeed64  Threads 1
                            ./MPmemspeed64
                            ./MPmemspeed64  Threads 64

             1 thread             4 threads            64 threads

    KBytes   Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

         4  16834   8588  16753  18253  12913  18066  17040  12902  21602
         8  17196   8671  16039  26068  17651  25448  30712  18072  31164
        16  17188   8670  18148  39834  25431  38289  34169  21795  41192
        32  17314   8703  16669  50545  29588  47840  39691  21845  35582
        65  15023   8634  15211  57307  31898  53119  45213  25520  52872
       131  15274   8155  13675  64285  33405  56454  53787  29396  58835
       262  15335   8143  13508  65111  32902  58563  56811  31077  59534
       524  14512   8013  13242  58699  32056  53683  62895  32310  58189
      1048  10911   7355  10791  59967  32531  53638  64335  32520  60720
      2097  10909   7350  10784  48409  31709  51453  65249  32553  59855
      4194  10561   7169  10411  36529  27079  37052  65800  32975  57665
      8388   6642   5610   6315  27898  21163  25293  57438  30135  52237
     16777   6128   5410   5853   9006   8909   8869  57345  30844  50350
     33554   6311   5427   5677   8946   8875   8887  54636  30033  49051
     67108   5789   5160   5698   8717   8458   8325  31347  24887  30860
    134217   5969   5138   5908   8688   8339   8362  26846  19845  26308
    268435   5922   5449   5779   8703   8608   8516   8510   8359   8558
    536870   6121   5090   5811   8700   8591   8421   8638   8525   8387
   1073741   6020   5481   5832   8596   8471   8584   8569   8334   8422
   2147483   6264   5630   6028   8825   8790   8835   8834   8699   8625


 Core 2 Duo - Caches L1 32 KB/CPU, L2 4096 KB shared

             1 thread             2 threads            64 threads

    Kbytes   Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

         4  12572   6264  12613  19400  10834  20763   6272   4371   8553
         8  12534   6337  12613  22252  11696  22188   5644   3760   6362
        16  12639   6364  12692  23343  12208  23485  10681   6790  15783
        32  12468   6318  12483  24406  12409  24317  13725   8918  15230
        65   9434   5733   9453  24573  12548  24520  17721  11117  18778
       131   9584   5770   9612  18347  11708  17212  21622  11731  22935
       262   9645   5824   9648  18389  11700  18191  23737  11938  23593
       524   9674   5834   9667  18286  11635  18153  24032  12329  24204
      1048   9696   5843   9684  18324  11663  18206  24579  12484  24682
      2097   9548   5777   9623  18326  11593  18203  24377  12261  24413
      4194   8188   5520   8302  13296  10585  13927  17971  11380  17734
      8388   4381   4219   4407   4701   4597   4664  17905  11450  17698
     16777   3788   3830   3847   3948   3921   3944  17803  11413  17646
     33554   3817   3827   3806   3903   3868   3893  17580  11294  17428
     67108   3845   3872   3856   3908   3888   3917  16531  10438  14364
    134217   3876   3856   3798   3886   3918   3922   9007   8009   9152
    268435   3885   3894   3889   3893   3885   3882   4092   4088   4102
    536870   3827   3816   3829   3923   3922   3918   3900   3893   3887


 Atom 1 CPU with Hyperthreading - Caches L1 24 KB, L2 512 KB
 get_phys_pages() and size - RAM Size  0.96 GB, Page Size 4096 Bytes 
 uname() - Linux, roy-Ubuntu-11, 2.6.38-8-generic 
 #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011, x86_64

             1 thread             2 threads            64 threads

    Kbytes   Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

         4   2112   1149   5817   3849   2085   6203   3329   1896   5237
         8   2117   1148   5922   3254   2106   6284   3519   1939   5653
        16   2113   1150   5938   3870   2085   6307   3608   1970   5882
        32   1742   1050   3764   3182   1879   4926   3682   1935   6005
        65   1770    869   3863   3157   1908   4728   3571   1975   5912
       131   1804   1050   3869   3173   1905   4652   3619   1992   6052
       262   1802   1043   3833   3181   1900   4711   3597   1977   6104
       524   1731   1017   3575   3021   1838   4440   3633   2002   6000
      1048   1656    983   2622   2668   1746   2656   3035   1787   4710
      2097   1652    962   2222   2087   1769   2059   3043   1822   4516
      4194   1655    946   2123   1969   1755   1950   3023   1807   4528
      8388   1538    981   2128   1958   1747   1956   2956   1788   4483
     16777   1571    979   2118   1948   1769   1948   2788   1711   3364
     33554   1661    969   2119   1957   1760   1921   1978   1652   2170
     67108   1606    986   2176   1930   1762   1951   1763   1644   1929
    134217   1660    975   2149   1932   1747   1966   1882   1672   1692

 ------------------------------------------------------------------------

3.7 GHz Core i7 4820K

Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 64 Bit SSE 1 Thread 8 30422 15420 27836 40569 20504 34929 19735 9939 19614 L1 16 30754 15503 28069 41065 20688 35335 19977 10011 19895 131 28955 15286 27122 34323 20476 30926 20086 10078 20069 L2 262 28741 15287 27017 33579 20373 30875 19760 10060 19758 2097 24424 15207 23963 26358 19342 25851 14665 9648 14824 L3 4194 24408 14253 23951 26355 19366 25531 14655 9334 14821 536870 14386 11824 14302 14704 13442 14732 7652 7426 7416 RAM 1073741 14452 11468 14715 14861 13439 15189 7348 7828 7394 Max GFLOPS 3.8 3.9 2.6 2.6 64 Bit SSE 8 Threads 8 52063 33134 52441 60254 36470 57610 45343 30815 45180 L1 16 69122 44818 65876 82924 46075 69534 57760 36707 57120 131 115996 53402 102715 140036 76202 116053 68671 35659 72891 L2 262 113644 60777 104488 132590 81609 111435 72205 37232 72061 2097 95433 58470 99412 115476 72032 109176 60839 36350 56557 L3 4194 98608 57900 102912 105228 78041 106928 59517 36122 58749 536870 25054 24707 24623 24592 25130 25430 12805 11899 11850 RAM 1073741 25402 25735 24886 25412 25128 24711 12662 12367 12617 Max GFLOPS 14.5 15.2 8.8 10.2 64 Bit AVX 1 Thread 8 16874 10136 29980 60061 59981 73657 39651 39127 39645 L1 16 16901 10137 30260 61288 61171 78154 40569 40385 40608 131 16891 10113 28484 48351 48134 47964 29242 29294 29285 L2 262 16845 10094 27490 45215 44739 46562 27383 27371 27377 2097 16725 9985 23943 30302 30323 30669 16994 16576 17163 L3 4194 16549 9912 23767 30294 30231 30720 16977 16461 17136 536870 11805 8651 14877 13817 13983 14446 7545 7081 7203 RAM 1073741 12168 8818 14636 14692 14973 14163 7329 7393 7202 Max GFLOPS 2.1 2.5 3.8 7.6 64 Bit AVX 8 Threads 8 45769 32166 52015 76043 69787 61836 64106 57154 59198 L1 16 53369 37318 67629 126171 116852 72240 83390 81532 80529 131 64036 39272 110665 216488 221720 269585 149667 148116 148231 L2 262 67037 39317 114213 194473 193210 203301 115279 114005 118131 2097 65163 37914 102215 115502 125560 127970 65827 69361 69114 L3 4194 62412 40752 94501 123285 110270 118960 63359 64405 64734 536870 24650 23942 25309 25072 25428 25601 12789 12471 12756 RAM 1073741 25124 24729 24310 24854 24576 25209 12612 12415 12524 Max GFLOPS 8.6 10.2 14.4 27.7


To Start


MP Memory Bus Speed

MPbusspeed64 and MPbusspeed32 are based on my old BusSpeed2K Benchmark and are essentially the same as the Windows Multithreading Version. Data is read using AND instructions at a range of data sizes covering caches and RAM. The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data. The final results use SSE2 integer AND instructions to read the data into the 16 byte register, the 32 bit and 64 bit procedures using the same assembly code. Note that, except for SSE2 tests, the CPU can execute 64 bit integer calculations at the same speed as those at 32 bits, resulting in higher MB/second at 64 bits.

Again there is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs. This time, each thread reads all the data. Results are saved in busSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom. Using L1 and L2 caches, data transfer speed with 64 bit integers is around twice as fast as using 32 bit numbers, suggesting a CPU speed limitation. From burst reading, estimates of maximum RAM speed are 904 x 8 = 7232 MB/second and 448 x 16 = 7168 MB/s. Cache examples are - L2 2989 x 8 = 23912 MB/s and L3 1432 x 8 = 11456 MB/s.

Later additions are for a 3.7 GHz Core i7 4820K CPU, normally working at Turbo Boost speed of 3.9 GHz, with 4 cores and Hyperthreading, plus a 10 MB L3 cache, The system has four memory channels, with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second. With the original benchmark, the same data is shared by all threads, with each starting reading at the beginning. With this arrangement, and a large L3 cache, multiple threads can read from there, leading to what seems like impossible memory data transfer speeds. For Version 2 of the benchmarks [MPbusspeed64V2 and 32V2], data is still shared but each thread is started at a different point (example 4 threads, 4 starting points 25% apart). The result is more realistic memory speeds but higher overheads affecting early tests. The i7 results also further demonstrate that multiple threads are required to produce speeds approaching the maximum specification. The benchmarks ans source code are included in the tar.gz file..

Note that Version 1 is still valid, representing an application where multiple threads search data for different values. In order to show that the latest results are representative of maximum RAM speeds, eight copies of BusSpeed RAM reliability tests [IntBurn64] were run at the same time. Results are shown below.


                    3.0 GHz quad core Phenom

 MP Bus Speeds 64 bit Version 1.0, 1 Threads, Fri Jun 17 16:51:46 2011

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    22196    22641    26239    26979    26580    26409    23762
       24    23507    23118    27665    27323    27081    25424    23850
       96     2959     2964     2989     5987    11983    21134    23868
      384     2917     2917     2869     5853    11732    21898    23264
      768     1362     1359     1352     2699     5408    10617    10803
     1536     1322     1293     1432     2856     5764    11098    12081
    16380      862      886      902     1745     3637     6019     7249
   131070      854      885      902     1777     3431     5853     6619
   393210      858      830      904     1757     3602     5995     7074

            64 bit words, Speed in MB/Second - MIPS divide by 8

                   End at Fri Jun 17 16:52:21 2011

 
 MP Bus Speeds 32 bit Version 1.0, 1 Threads, Fri Jun 17 16:45:12 2011

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     8318    12102    12848    13317    13413    13452    23033
       24     9876    12979    13877    13288    13436    13640    23876
       96     1495     1496     2675     5979    11081    13201    23852
      384     1209     1238     2454     4946     8816    12459    18874
      768      721      726     1480     3046     5638     9741    11846
     1536      699      699     1513     3032     5708     9722    11924
    16380      413      423      860     1805     2993     5022     7075
   131070      411      444      841     1793     3024     4881     7060
   393210      427      448      887     1826     3052     4946     6897

            32 bit words, Speed in MB/Second - MIPS divide by 4

                   End at Fri Jun 17 16:45:47 2011
 

              3.7 GHz Core i7 - 4 Cores, 8 Threads, Original
            Memory Maximum Speed Specification 51.2 GB/second

  MP Bus Speeds 64 bit Version 1.0, 8 Threads, Sun Nov 23 10:37:14 2014

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    91241    82686    89258   121404   128237   134549   232859 L1 x 4
       24    90260    96688    94783   131484   133668   150865   236821
       96    45332    45942    46346    67232   107906   121757   219034 L2 x 4
      384    20465    28119    29021    51632    76028   119540   157096 L3 x 1
      768    20462    25185    19519    37238    66783   110456   153193
     1536    19354    22512    21843    35804    71371   112147   142987
    16380     6771     8432    10752    16948    40643    73808    65614 RAM
   131070     4030     5140     5916    12464    25868    42665    56575
   393210     3182     3971     5796    11936    24963    49436    52118

            64 bit words, Speed in MB/Second - MIPS divide by 8

                   End at Sun Nov 23 10:37:44 2014


              3.7 GHz Core i7 - 4 Cores, 8 Threads, Revised

  MP Bus Speeds 64 bit Version 2.0, 8 Threads, Sun Nov 23 10:45:05 2014
   Same as Version 1.0, except each thread starts at different address

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    12369    12496    21021    39879    87865   105814   182821
       24    45987    46458    77473    88435   123606   122993   260009
       96    39888    43919    48699    79316   117534   139709   242467
      384    24258    19148    25700    46969    86177   117704   163429
      768    20185    22569    19117    37035    80091   103124   175760
     1536    20967    21061    20910    39814    77704   116114   156833
    16380     5780     6537    10217    18113    30917    79641    80375
   131070     3073     3818     4822    10452    20528    38770    46567
   393210     2348     3147     4793    11090    20306    39280    38707
   786420     2152     3062     4834    10111    19119    37950    39038
  1572840     2061     2987     4794     9738    19272    37597    38428

            64 bit words, Speed in MB/Second - MIPS divide by 8

                   End at Sun Nov 23 10:46:09 2014

                   8 BusSpeed Programs 40 MB Each

  4891 + 4992 + 4917 + 4854 + 4772 + 4840 + 4908 + 4933 = 39037 MB/second 
 


To Start


MPbusSpeed Comparisons

Below are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom and 3.7 GHz Core i7, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 (based on identified CPUs) and 64 threads. Comparison of performance gains using multiple threads and 64 bit compilations versus 32 bit are best limited to the last two columns as the earlier burst reading tests produce all sorts of timing peculiarities. Although measured RAM speeds could be improved with multiple threads reading shared data from caches, those provided have mainly been confirmed with programs using dedicated data.

Core i7 - results for 8 threads included, often slower than with 4 threads. RAM speed at 3864 MB also shown, to show possible effects of large L3 cache.

Phenom - some performance gains from cached data were only 3 times with 4 threads but nearer 4 times with 64 threads. Using 4 or 64 threads can double throughput on memory bus. Maximum measured speed was 15.8 GB/second, compared with specification of 21.3 GB/second, comprising 667 MHz x 2 (DDR) x 8 (bus width) x 2 (dual channel).

Core 2 Duo - generally achieved peak performance using two threads and, if anything, was slightly slower using a concurrency of 64. Compared with the Phenom, some tests came out much faster and others significantly slower.

Atom - Hyperthreading had little impact on performance via L1 cache but did provide improvements via L2 cache and RAM.

L1 Cache Results in MBytes/Second - 6 KB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 14199 15008 18856 21606 21604 20462 61577 4 Threads 44729 50128 67114 79464 82422 80118 200833 8 Threads 43681 49006 68593 82414 84016 90261 357561 64 Threads 53262 55670 82638 90202 93360 93827 347210 Core i7 V2 32b 4/4 3700 11417 13524 17525 19258 19598 18921 61579 4 Threads 10372 17363 38891 43445 50634 62156 238316 8 Threads 5470 8950 20258 35494 40223 63398 280884 64 Threads 868 1319 2922 5015 9176 18896 38113 Phenom II 32b 4/0 3000 8318 12102 12848 13317 13413 13452 23033 4 Threads 3901 7614 14703 28644 29313 34882 74424 64 Threads 10098 14599 20588 30042 32606 37702 76707 Core 2 Duo 32b 2/0 2400 8069 8772 9036 9283 9369 9390 37413 2 Threads 13921 16387 16361 16996 18147 18380 61694 64 Threads 15315 16271 17162 17587 17848 18004 60044 Atom N455 32b 1/1 1667 5092 5813 5959 6272 6290 6289 24663 2 Threads 5638 6157 6323 6450 6439 6488 25608 64 Threads 5011 5655 5609 5785 5867 5610 22419 64 Bit Version Core i7 64b 4/4 3700 31210 31251 31251 42437 43392 43667 61539 4 Threads 76609 51101 75602 140546 104501 167163 205782 8 Threads 91241 82686 89258 121404 128237 134549 232859 64 Threads 114113 112881 116740 179847 190265 192408 341135 Core i7 V2 64b 4/4 3700 31501 31266 31243 41117 36617 41277 61526 4 Threads 28749 29616 58739 64451 61610 129160 231735 8 Threads 12369 12496 21021 39879 87865 105814 182821 64 Threads 1961 1981 3954 7413 15514 31971 35788 Phenom II 64b 4/0 3000 22196 22641 26239 26979 26580 26409 23762 4 Threads 4478 17301 15108 29950 58038 58706 76500 64 Threads 19577 43542 35782 52893 74743 73745 76027 Core 2 Duo 64b 2/0 2400 15931 17513 18140 18542 18715 18813 37391 2 Threads 30486 32243 35126 36209 35493 35615 73979 64 Threads 29474 32288 33585 34516 34670 35792 71354 Atom N455 64b 1/1 1667 9004 10592 11500 12051 12565 12735 24731 2 Threads 10224 11743 12283 12767 12948 13053 25668 64 Threads 8574 10100 11089 10638 11905 11649 21789

L2 Cache Results in MBytes/Second - 96 KB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 7165 7133 10898 16546 20138 20473 60657 4 Threads 23189 23575 40872 62006 78956 81301 241144 8 Threads 26914 29097 47495 67700 79412 89292 322563 64 Threads 33001 35645 53618 72906 85615 91768 318366 Core i7 V2 32b 4/4 3700 6535 6897 10249 15313 18679 19052 60683 4 Threads 21398 22214 36104 55228 68446 72179 241137 8 Threads 22385 25289 42997 60219 68834 75464 322513 64 Threads 12054 15849 34363 38072 59987 69942 425475 Phenom II 32b 4/0 3000 1495 1496 2675 5979 11081 13201 23852 4 Threads 4648 5085 8422 19230 33948 39486 74050 64 Threads 5247 5317 10388 21080 36499 45574 89766 Core 2 Duo 32b 2/0 2400 2065 2044 3275 4562 6700 8095 19153 2 Threads 3138 3050 5218 7776 11911 15415 32030 64 Threads 3099 2963 5209 7617 11721 15081 31010 Atom N455 32b 1/1 1667 505 415 788 1481 2577 3915 5909 2 Threads 597 665 1243 2285 3657 4660 8598 64 Threads 455 534 996 1861 3133 4100 7417 64 Bit Version Core i7 64b 4/4 3700 13596 14421 15252 24216 33143 40470 60420 4 Threads 41962 40737 43299 73311 123250 160399 240730 8 Threads 45332 45942 46346 67232 107906 121757 219034 64 Threads 64186 66500 65554 102211 146587 170442 293072 Core i7 V2 64b 4/4 3700 12815 13950 14344 23717 32150 39251 59683 4 Threads 39170 40423 42705 76442 110895 154667 240928 8 Threads 39888 43919 48699 79316 117534 139709 242467 64 Threads 23157 24129 40748 86708 101514 163061 397091 Phenom II 64b 4/0 3000 2959 2964 2989 5987 11983 21134 23868 4 Threads 4478 17301 15108 29950 58038 58706 76500 64 Threads 10290 10607 10548 20940 40361 75398 89396 Core 2 Duo 64b 2/0 2400 4171 4170 4098 6715 9120 13430 19157 2 Threads 6322 6435 6127 10810 15448 23521 32104 64 Threads 6090 6237 5972 10720 15200 23342 31444 Atom N455 64b 1/1 1667 993 1020 833 1564 2954 5156 5919 2 Threads 1066 1127 1386 2501 4608 7240 8722 64 Threads 1003 1033 1055 1991 3674 6250 7222

L3 Cache Results in MBytes/Second - 1536 KB

Core i7 32b 4/4 3700 2735 2792 5450 9694 16764 20399 38274 4 Threads 9816 10382 19816 36661 62386 81197 151963 8 Threads 16717 17096 32534 54967 75466 86510 234195 64 Threads 14769 15185 28392 48591 71997 88559 213647 Core i7 V2 32b 4/4 3700 2735 2793 5445 9573 16373 18990 38263 4 Threads 9802 10371 19738 35609 61112 73520 150464 8 Threads 15355 15282 27981 51300 66966 75178 236662 64 Threads 13224 14216 28137 46946 65679 74795 219197 Phenom II 32b 4/0 3000 699 699 1513 3032 5708 9722 11924 4 Threads 2407 2543 4943 10058 17570 29261 41159 64 Threads 2541 2571 5022 10100 18811 30768 41078 64 Bit Version Core i7 64b 4/4 3700 5316 5421 5554 10925 19443 33253 38363 4 Threads 19103 19854 20683 39137 54701 127196 152980 8 Threads 19354 22512 21843 35804 71371 112147 142987 64 Threads 31324 32675 29819 51003 95378 146538 203624 Core i7 V2 64b 4/4 3700 5269 5367 5499 10811 19434 33779 38372 4 Threads 19086 19296 20661 39791 73469 120135 152311 8 Threads 20967 21061 20910 39814 77704 116114 156833 64 Threads 29225 31754 29372 50516 90849 147624 197627 Phenom II 64b 4/0 3000 1322 1293 1432 2856 5764 11098 12081 4 Threads 5112 4899 5083 10018 19866 37136 38700 64 Threads 5092 5101 5101 10051 20203 37850 41309

RAM Results in MBytes/Second - 128 MB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 731 1040 2188 4264 8773 15159 17640 4 Threads 3420 4173 7526 14623 28847 52396 61043 8 Threads 4388 5953 11378 22221 41937 69685 94501 64 Threads 3991 4486 8871 17654 34047 59458 67917 Core i7 V2 32b 4/4 3700 737 1048 2236 4313 9022 15072 18058 4 Threads 1526 2510 4964 9880 19395 38001 40618 8 Threads 1488 2410 5166 9560 18878 37571 39464 64 Threads 2892 4051 8141 17176 29628 53519 68451 Phenom II 32b 4/0 3000 411 444 841 1793 3024 4881 7060 4 Threads 786 813 1605 3444 6259 12161 14950 64 Threads 891 969 1869 3678 6930 12678 15564 Core 2 Duo 32b 2/0 2400 353 399 814 1467 2725 5021 5842 2 Threads 621 808 1181 2217 4108 7686 9952 64 Threads 395 598 1080 1947 3541 6540 8133 Atom N455 32b 1/1 1667 122 256 514 1029 1978 3256 4122 2 Threads 131 318 684 1312 2434 4189 5307 64 Threads 125 265 577 1159 2280 4435 4636 64 Bit Version Core i7 64b 4/4 3700 1229 1470 2054 4514 8754 18043 18094 4 Threads 5901 6947 8368 15029 29096 51843 61776 8 Threads 4030 5140 5916 12464 25868 42665 56575 64 Threads 4262 5272 7059 14648 29005 58071 58266 Core i7 V2 64b 4/4 3700 1226 1484 2096 4411 8462 18188 18382 4 Threads 2038 3108 5197 10201 20004 38092 40726 8 Threads 3073 3818 4822 10452 20528 38770 46567 64 Threads 3020 3618 5513 11215 21773 44187 44666 Phenom II 64b 4/0 3000 858 830 904 1757 3602 5995 7074 4 Threads 1561 1648 1701 3330 6964 13025 14027 64 Threads 1808 1854 1947 3773 7488 14176 15516 Core 2 Duo 64b 2/0 2400 699 711 803 1622 2946 5414 5861 2 Threads 1210 1336 1632 2436 4635 7947 10028 64 Threads 706 813 1226 2177 3919 7014 8127 Atom N455 64b 1/1 1667 125 256 514 1038 2024 3994 4057 2 Threads 136 294 677 1327 2530 4924 5230 64 Threads 142 249 523 1129 2312 4565 4639

RAM Results in MBytes/Second - 384 MB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 740 1043 2191 4267 8785 15164 17652 4 Threads 3425 4187 7525 14625 28846 52165 61051 8 Threads 3841 5504 11461 21286 39250 69341 85998 64 Threads 2115 3304 6884 13623 27005 50799 56297 Core i7 V2 32b 4/4 3700 738 1049 2237 4314 9041 15058 18065 4 Threads 1507 2496 4928 9821 19225 37965 40423 8 Threads 1481 2393 4786 9488 18923 37398 38729 64 Threads 1712 2633 5189 10464 20499 40335 41751 Core i7 64b 4/4 3700 2823 3214 4190 8287 16053 32489 33564 4 Threads 5909 5426 8370 12684 29097 58307 59609 8 Threads 3182 3971 5796 11936 24963 49436 52118 64 Threads 2723 3732 5896 12059 24079 46997 47480 Core i7 V2 64b 4/4 3700 1724 2609 3858 8029 15699 32236 32708 4 Threads 2090 3101 5072 9867 19538 39489 L 39824 8 Threads 2348 3147 4793 11090 20306 39280 L 38707 64 Threads 2592 3168 5019 9857 20329 40428 M 40392 L Less L3 cache effects, M More L3 cache effects


To Start


MP Memory Random Access

MPrandmem64 and MPrandmem32 are based on my old RandMem Benchmark and are essentially the same as the Windows Multithreading Version, except there are added tests identified as Mutex SRW and Mutex RRW. The program uses the same code for serial and random access via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1 to 64 threads, using a run time parameter, the default being equal to the number of identified CPUs. This benchmark uses data from the same array for all threads, but starting at different points. Results are saved in file randmemMP.txt. In this case, both the 64 bit and 32 bit versions use 32 bit integer data arrays.

Below are logged results on a 3.0 GHz quad core Phenom using one and four threads. On serial and random read only tests, performance gains are up to four times using dedicated caches, with the smaller data sizes slower due to overheads. Random reading is much slower than serial data transfers where burst reading leads to more data being transferred than is requested. With reading and writing, there is a possibility that data can be corrupted when more than one thread updates the same data. Although it cannot be proven with this benchmark, it seems that the Operating System does not allow shared data to be updated in local caches and flushes them to update in shared data areas, producing significant performance degradation, particularly on random access.

The extra tests with Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using four threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM.

All results are provided for the Core i7, mentioned previously, for 1, 4 and 8 threads. Appropriate performance gains can be produced on read only tests and, with writing, for shared L3 cache based data. This 10 MB cache is also probably responsible for the rather excessive serial memory reading speeds, due to threads reading the sane data. All mutex based tests, at 8 threads, are slower than those using a single thread.


    RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sun Jun 26 18:01:19 2011
 
               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB

 Serial RD    15757   15827   12155   11879    8534    8511    4392    4385
 Serial RW     9263    9534    8875    8868    7591    7493    3740    3601
 Random RD    14396   14271    7504    3159    2269    1751     622     341
 Random RW     9231    9510    6136    2993    2087    1507     532     319
 Mutex SRW     9264    9534    8875    8869    7591    7492    3740    3608
 Mutex RRW     9231    9510    6138    2993    2087    1507     532     320


    RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011
 
               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB

 Serial RD    29630   53166   44120   44829   29620   29671   12108   11987
 Serial RW     5040    7334    7442    7402    7353    7395    8532    6247
 Random RD    28388   41211   27807   12265    8866    6611    2103    1271
 Random RW      657    1096    1229    1283    1288    1376    1648     993
 Mutex SRW     5962    8654    7998    7882    6982    6853    3579    3415
 Mutex RRW     6243    8594    5838    2815    1970    1370     486     310

 ------------------------------------------------------------------------

3.7 GHz Core i7 4820K

RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sat Nov 8 12:39:21 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 28021 27808 20268 19318 19231 19255 12455 11589 Serial RW 29972 30232 21894 17867 17410 17420 12242 11581 Random RD 27479 27463 13595 8251 6228 5605 2470 1011 Random RW 30429 30076 9224 6120 5177 4782 2800 982 Mutex SRW 29987 30245 21895 17875 17419 17249 12373 11495 Mutex RRW 30417 30027 9199 6117 5175 4780 2796 982 RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sat Nov 8 12:40:59 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 72892 106273 80223 75623 75330 75842 43100 37773 Serial RW 14198 33187 38290 43442 45145 63266 49480 34706 Random RD 72819 104511 52719 32436 24475 11447 9297 3543 Random RW 3092 5558 8188 11156 11734 12527 9136 1811 Mutex SRW 20510 29213 21157 17705 16794 16835 11969 11250 Mutex RRW 21625 29271 8982 5912 5060 4699 2720 973 RandMemMP Speeds 64 Bit Version 1, 8 Threads, Sat Nov 8 12:41:51 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 37112 77469 94806 94862 90795 86826 65882 56315 Serial RW 8924 29533 54380 47712 51176 69146 68008 22145 Random RD 36944 76814 62245 33838 24552 21588 13472 3341 Random RW 2000 6016 9058 17412 16237 16733 10066 2806 Mutex SRW 7829 16705 19723 16432 16331 16570 11550 10669 Mutex RRW 10672 20797 8933 5659 4844 4561 2659 940


To Start


MPrandmem Comparisons

Below, again, are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 and 64 threads plus others for the Phenom at 64 bits.

L1 Cache - Probably due to the overheads involved, performance using 64 threads is noticeable slow, with the Core 2 Duo performing better than the quad core Phenom when writing is involved. Hyperthreading does not lead to much performance gain on the Atom.

L2 Cache - Reading tests can show appropriate performance gains using all processors and not as much degradation with reading and writing. Although dealing with 32 bit integers, and unlike the Phenom, the Core 2 Duo and Atom produce much faster speeds with the 64 bit version using 64 threads. The Mutex tests produce different performance characteristics than when using L1 cache.

RAM - There are some performance gains with multiple threads making better use of memory bandwidth and excessive numbers of threads do not necessarily reduce performance.

L1 Cache Results in MBytes/Second - 6 KB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 15791 11495 14746 11409 11478 11394 4 Threads 29613 5014 29595 610 7152 7569 64 Threads 2321 188 2234 42 257 255 Core 2 Duo 32b 2/0 2400 6327 8474 6314 8305 12642 8306 2 Threads 13285 3559 13432 1312 6935 9327 64 Threads 800 452 802 93 346 433 Atom N455 32b 1/1 1667 3500 4742 4422 5028 5032 5022 2 Threads 4902 4770 4895 1153 677 3101 64 Threads 307 296 301 69 55 207 64 Bit Version Phenom II 64b 4/0 3000 15757 9263 14396 9231 9264 9231 4 Threads 29630 5040 28388 657 5962 6243 8 Threads 14933 2120 14892 338 2465 2867 16 Threads 8514 846 8284 174 910 1041 64 Threads 2247 189 2173 45 225 214 Core 2 Duo 64b 2/0 2400 9579 12619 6385 7720 12623 7600 2 Threads 14257 3553 14073 1505 7018 7718 64 Threads 875 893 875 112 348 358 Atom N455 64b 1/1 1667 3838 4222 3834 4222 4233 4215 2 Threads 4438 4779 4481 1218 970 3130 64 Threads 285 291 281 68 42 167

L2 Cache Results in MBytes/Second - 96 KB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 12476 10387 7552 6241 10385 6241 4 Threads 45484 7488 27712 1238 9645 5810 64 Threads 31650 7033 22035 1162 1989 1431 Core 2 Duo 32b 2/0 2400 5228 5892 4245 3852 5935 2619 2 Threads 15026 16440 7009 2896 7009 3054 64 Threads 3304 2662 3268 497 1300 1506 Atom N455 32b 1/1 1667 2768 3349 855 1175 3464 1173 2 Threads 4642 4424 1317 1570 2805 966 64 Threads 1177 1138 1118 665 423 584 64 Bit Version Phenom II 64b 4/0 3000 12155 8875 7504 6136 8875 6138 4 Threads 44120 7442 27807 1229 7998 5838 8 Threads 42685 7413 27567 1240 6875 4867 16 Threads 42004 7443 27870 1237 5335 3760 64 Threads 30435 7057 21892 1157 1686 1329 Core 2 Duo 64b 2/0 2400 6234 5992 4320 3777 5947 3779 2 Threads 14542 15153 7145 2932 7190 3113 64 Threads 11741 12994 6426 2767 4714 2317 Atom N455 64b 1/1 1667 2813 3064 845 1103 3063 1122 2 Threads 4613 4576 1352 1615 3044 1111 64 Threads 3551 3506 1179 1312 1759 874

L3 Cache Results in MBytes/Second - 1536 KB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 8756 7918 1743 1505 7919 1503 4 Threads 29961 7491 6617 1332 7414 1391 64 Threads 30159 7763 6643 1333 2394 472 64 Bit Version Phenom II 64b 4/0 3000 8511 7493 1751 1507 7492 1507 4 Threads 29671 7395 6611 1376 6853 1370 8 Threads 29330 7549 6558 1342 6229 1234 16 Threads 29827 7627 6623 1361 4763 988 64 Threads 29812 7733 6650 1337 2163 473

RAM Results in MBytes/Second - 96 MB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 4407 3845 344 320 3826 320 4 Threads 12009 6305 1273 994 3615 308 64 Threads 12010 6641 1298 1003 2881 308 Core 2 Duo 32b 2/0 2400 4803 2147 449 282 2232 310 2 Threads 5492 2512 621 401 2239 308 64 Threads 6206 2567 635 411 2294 304 Atom N455 32b 1/1 1667 2253 1159 38 54 1275 54 2 Threads 3926 1257 63 79 1109 42 64 Threads 3951 1274 64 79 1322 54 64 Bit Version Phenom II 64b 4/0 3000 4385 3601 341 319 3608 320 4 Threads 11987 6247 1271 993 3415 310 8 Threads 11930 6248 1274 990 3321 307 16 Threads 11860 6557 1281 997 2777 304 64 Threads 11863 6651 1288 1002 2774 302 Core 2 Duo 64b 2/0 2400 3971 2141 416 283 2148 314 2 Threads 5448 2508 632 404 2425 298 64 Threads 6065 2612 639 416 2318 305 Atom N455 64b 1/1 1667 2284 1286 39 54 1298 54 2 Threads 3717 1241 62 78 1336 54 64 Threads 3785 1270 64 78 1122 42


To Start


Roy Longbottom at Linkedin   Roy Longbottom December 2014

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection