Title

Linux MultiThreading Benchmarks

Contents

General Description Simple Add Tests Comparison Simple Add Tests
Whetstone MP Benchmark Comparison Whetstone MP MP MFLOPS Description
Comparison MP MFLOPS MP MFLOPS Burn-In Test MP Memory Speed
MPmemSpeed Comparisons MP Memory Bus Speed MPbusSpeed Comparisons
MP Memory Random Access MPrandMem Comparisons

Summary

Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).

Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.

Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.

MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,

MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data.

MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed.

MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.

To Start


General

These tests are intended to measure Linux and hardware performance at high speeds using multithreading. The programs were compiled at both 32 bits and 64 bits. The execution files, source code, compilation and running instructions can be found in linux_multithreading_apps.tar.gz. All provide the following information on the system under test. They are based on versions available for running under Windows and described in quad core 8 thread.htm.


##############################################

  Assembler CPUID and RDTSC       
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 
  AMD Phenom(tm) II X4 945 Processor 
  Measured - Minimum 3014 MHz, Maximum 3014 MHz 
  Linux Functions 
  get_nprocs() - CPUs 4, Configured CPUs 4 
  get_phys_pages() and size - RAM Size  7.81 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-64, 2.6.35-22-generic 
  #33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64 


To Start

Description Simple Add Tests

CPUMaxMP32 and CPUMaxMP64 execute simple integer and floating point add instructions via assembly language. Floating point arithmetic is identical on the two versions, via SSE instructions that handle four calculations at a time via 128 bit registers. Performance is measured in Millions of Floating Point Operations Per Second (MFLOPS) with expected maximum speed of four adds per CPU clock cycle. Integer calculations are the same, except one uses 32 bit instructions/registers and the other the 64 bit varieties. For these, speed is measured in Millions of Instructions Per Second (MIPS). Results are logged in file MPadds.txt.

The assembly code loops execute two billion add instructions to ensure that elapsed times of a single thread are significant (like 0.5 seconds or more for SSE tests on current CPUs). A command line run time variable is available to specify the number of threads to use, between 1 and 64, with a default of four. Below are example full results, using four threads, and MIPS from a run with 64 threads. Besides total MIPS and MFLOPS, second sums are provides, based on the time for the last thread to finish. As seen for both examples, completion times are not based on first in first out, but the time is shared fairly evenly, even with 64 threads.


 Command ./cpumaxmp64 Threads 4 (or T 4 or t 4)

 Phenom 4 CPUs Available
##############################################

 Multithreading Add Test 64 bit Version 1.0 Thu May  5 11:35:18 2011
 
 Integer Additions 4 Threads

 Thread   4 -    8281 64 bit Integer MIPS
 Thread   2 -    7996 64 bit Integer MIPS
 Thread   1 -    7815 64 bit Integer MIPS
 Thread   3 -    7800 64 bit Integer MIPS
 Total      -   31892 64 Bit Integer MIPS
 Aggregate  -   31201 64 Bit Integer MIPS, based on last to finish


 SSE Floating Point Additions 4 Threads

 Thread   2 -   12030 32 Bit SSE MFLOPS
 Thread   3 -   11976 32 Bit SSE MFLOPS
 Thread   4 -   11861 32 Bit SSE MFLOPS
 Thread   1 -   11692 32 Bit SSE MFLOPS
 Total      -   47559 32 Bit SSE MFLOPS
 Aggregate  -   46770 32 Bit SSE MFLOPS, based on last to finish

 End of test Thu May  5 11:35:23 2011


 Integer MIPS 64 Threads

 Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS

    1  528   10  522   41  518   11  517   49  516   21  515   52  514   48  514
   24  527   55  521   56  518    5  517   32  515   45  515   47  514   44  514
   57  526   29  521   26  518   28  516    8  515    9  515   27  514   31  514
   62  525   23  520   20  518   58  516   17  515   38  515   63  514    3  514
    6  525   39  520   25  518    7  516   36  515   33  515    2  514   19  513
   59  523   61  520   12  518   51  516   53  515   54  515   34  514   35  513
   46  523   50  519   16  517   37  516   14  515   22  515   64  514   15  513
   13  522   60  519   43  517   42  516    4  515   18  515   30  514   40  513

To Start


Comparison Simple Add Tests

Following are sample results on a range of systems with one , two and 4 CPUs, using 1, 2, 4 and 64 threads. The range of speeds of individual threads is also shown for the latter.

Atom - This is a netbook, where the single CPU has HyperThreading and 64 bit capability. With HT, two CPUs are identified and, in this case, integer addition throughput using multiple threads is 40% higher than from a single thread and 20% faster with SSE floating point calculations.

Core 2 Duo - Results from the 32-Bit and 64-Bit compilations are shown for this dual core processor, where 32 bit integer speed is somewhat faster than at 64 bits. Integer additions are executed at up to 2.75 per CPU clock cycle (or MIPS/MHz) with SSE calculations at the maximum rate of four per clock cycle. As with earlier tests, this system runs at 1.6 GHz when one CPU is being used under Linux and default “On-Demand” Frequency Scaling is used ( see Linux Peculiarities in linux burn-in apps.htm). Result provided are for a “Performance” setting.

Phenom - Results of the 64-Bit version are shown for this quad core processor, via Linux Ubuntu and Fedora. There appears to be some differences between the two versions of Linux but these might be normal variations due to other influences. They at least show that the quad core processor can increase throughput by four times with these tests.


                    Atom 1.7 GHz  Core 2 Duo 2.4 GHz          Phenom X4 3.0 GHz
                    64 bit        64 bit        32 bit        64 bit        64 bit
                    Ubuntu        Ubuntu        Ubuntu        Ubuntu        Fedora
                    MIPS MFLOPS   MIPS MFLOPS   MIPS MFLOPS   MIPS MFLOPS   MIPS MFLOPS
 Threads
   1    Total       1844   5418   5268   9605   6597   9591   8052  12046   8213  12030
        Aggregate   1844   5418   5268   9605   6597   9591   8052  12046   8213  12030
            %      100.0  100.0  100.0  100.0  100.0  100.0  100.0  100.0  100.0  100.0

   2    Total       2631   6460  10290  18992  12782  18707  15964  24022  16447  24052
        Aggregate   2598   6441  10156  18898  12621  18344  15810  23946  16446  24050
            %       98.7   99.7   98.7   99.5   98.7   98.1   99.0   99.7  100.0  100.0

   4    Total       2652   6473  10508  19159  13011  19047  31892  47559  32701  47889
        Aggregate   2630   6449  10416  19070  12933  18940  31201  46770  32344  47620
            %       99.2   99.6   99.1   99.5   99.4   99.4   97.8   98.3   98.9   99.4

   64   Min           42    101    164    299    205    299    513    749    510    655
        Max           43    103    173    310    229    310    528    798    529    763
        Total       2719   6526  10696  19443  13556  19435  33094  48974  33012  43339
        Aggregate   2657   6466  10503  19111  13129  19120  32840  47932  32664  41938
            %       97.7   99.1   98.2   98.3   96.9   98.4   99.2   97.9   98.9   96.8


To Start


Whetstone MP Benchmark

The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in Whetstone Benchmark History And Results and Results On PCs. The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations. As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.

This multithreading benchmark also has a run time parameter to specify the number of threads (up to 64) with a default identified as Configured CPUs in gathered system information (see above). An initial calibration determines the number of passes needed for an overall execution time of 10 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are logged for the first thread with others checked for the same values. Actual results might be different or repeated runs as they are dependent on the number of passes.

Four versions are available, whetsMP64, whetsMP64DP, whetsMP32 and whetsMP32DP, for 32-Bit or 64-Bit systems using Single or Double Precision floating point. Results are logged in file MPwhetres.txt.


 Equivalent command ./whetsMP64 Threads 4

 Phenom 4 CPUs Available
 #####################################################################

     Multithreading Single Precision Whetstones 64-Bit Version 1.0

             Using 4 threads - Sat May 14 12:03:51 2011

         MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
 Thread             1      2      3   MOPS   MOPS   MOPS   MOPS   MOPS

      1   2861    927    872    747     71     38   2947   2259    629
      2   2865    875    892    745     71     38   3294   2198    641
      3   2875    869    892    744     71     38   3408   2202    645
      4   2896    906    895    744     72     38   3141   2232    651

 Total   11496   3577   3550   2979    285    151  12790   8891   2566

 MWIPS   11389 Based on time for last thread to finish


 Results Of Calculations Thread 1

 MFLOPS 1   -1.12475013732910156      MFLOPS 2   -1.13133049011230469
 IFMOPS      1.00000000000000000      FIXPMOPS   12.00000000000000000
 COSMOPS     0.49911013245582581      MFLOPS 3    0.99999982118606567
 EQUMOPS     3.00000000000000000      EXPMOPS     0.93536460399627686

      Numeric results of the other 3 threads were same as above

               End of test Sat May 14 12:04:09 2011
 


To Start


Comparison Whetstone MP Benchmark

Following are results of the four versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs noted above. The 32-Bit compilations use the original i87 floating point instructions, where arithmetic calculations are at double precision, normally producing the same speeds with both precision options. With i87 mode not being available at 64-Bits, SSE instructions are used for single precision and SSE2 for double. Using Single Instruction Multiple Data (SIMD) mode, included in the above Add Tests, SSE can be twice as fast as SSE2, with four 32 bit arithmetic calculations at a time, compared with two at 64 bits. In this case, the source code is unsuitable for SIMD compilation, so scalar or SISD (Single Data) instructions are used. In this case, single precision calculations can be at the same speed or slightly faster than using double precision. This scalar operation means that 64-Bit and 32-Bit compilations can produce similar performance.

There are differences with the 32-Bit double precision version where speed can be much faster. For the one headed Equal MOPS, the single precision code uses mov instructions rather than store on the faster compilation. For Fixpt MOPS, integer calculations are the same but the faster one involves integer conversion to double precision rather than single precision. The speed difference remains a mystery but this has little effect on the overall performance rating.

As indicated earlier, the single core Atom has Hyperthreading. In this case, some floating point calculations can be twice as fast using more than one thread. One anomaly is the high speed result during the four thread fixed point test. Here, Linux appeared to have run one thread twice as fast as the others, distorting the total. Results on the Core 2 Duo include one for a test using 64 threads. Speeds are also shown using Fedora on the Phenom, rather than Ubuntu.


        Threads  MWIPS  MFLOPS  MFLOPS  MFLOPS     Cos     Exp   Fixpt      If   Equal
                             1       2       3    MOPS    MOPS    MOPS    MOPS    MOPS

 Atom        1     751     397     395     363      17       8     698    1230     141
 1.6 GHz     2    1284     747     697     698      31      14     948    1657     190
 64 Bit      4    1301     768     760     715      31      14    1207    1661     190
 SP         16    1309     801     773     726      32      14     956    1698     191
 Aggregate  16    1305

 Atom        1     748     381     381     324      17       8     700    1235     141
 1.6 GHz     2    1287     732     714     634      31      14     950    1662     191
 64 Bit      4    1259     781     748     593      32      13     963    1686     186
 DP         16    1307     765     742     647      32      14     958    1691     191
 Aggregate  16    1302

 Atom        1     698     330     329     282      17       7     758    1230     118
 1.6 GHz     2    1182     594     588     478      29      13     987    1654     178
 32 Bit      4    1193     613     614     483      30      13     998    1690     178
 SP         16    1202     618     589     483      30      13     995    1688     178
 Aggregate  16    1199

 Atom        1     757     330     330     282      17       7    1468     837     299
 1.6 GHz     2    1312     600     592     480      29      13    2420    1248     506
 32 Bit      4    1319     611     604     482      30      13    2504    1263     505
 DP         16    1329     610     615     485      30      13    2575    1268     507
 Aggregate  16    1324

 Core2 Duo   1    2501     876     876     600      68      29    3198    3601     600
 2.4 GHz     2    4926    1726    1632    1192     135      58    6102    6930    1193
 64 Bit      4    4963    1733    1748    1196     135      58    6328    7158    1196
 SP         16    4982    1758    1757    1199     136      58    6420    7212    1198
 Aggregate  16    4966
            64    5054    1783    1782    1215     138      59    6566    7292    1218
 Aggregate  64    4973

 Core2 Duo   1    2364     803     803     533      61      30    3005    3601     600
 2.4 GHz     2    4688    1589    1586    1059     121      60    5911    7082    1196
 64 Bit      4    4698    1599    1601    1062     122      60    5976    7089    1197
 DP         16    4714    1609    1613    1062     122      61    6068    7213    1197
 Aggregate  16    4704

 Core2 Duo   1    2165     817     817     576      58      23    3169    3600     623
 2.4 GHz     2    4270    1564    1558    1130     114      45    6149    6823    1234
 32 Bit      4    4330    1616    1628    1149     116      45    6636    7168    1253
 SP         16    4331    1628    1638    1151     116      45    6586    7229    1256
 Aggregate  16    4317

 Core2 Duo   1    2244     817     817     576      58      23    5140    3596    1028
 2.4 GHz     2    4452    1621    1578    1144     115      46   10028    7120    2049
 32 Bit      4    4450    1624    1630    1150     113      46   10399    7176    2065
 DP         16    4472    1634    1636    1149     115      46   10301    7227    2051
 Aggregate  16    4465

 Phenom x4   1    2909     925     927     753      72      38    3229    2258     644
 3.0 GHz     2    5787    1832    1825    1504     144      76    6375    4488    1253
 64 Bit      4   11496    3577    3550    2979     285     151   12790    8891    2566
 SP         16   11655    3700    3718    3006     289     153   13395    9039    2635
 Aggregate  16   11578
 Fedora     16   11842    3705    3715    3010     296     158   13474    9067    2552
 Aggregate  16   11725

 Phenom x4   1    3002     927     927     753      75      42    3228    2253     601
 3.0 GHz     2    5977    1819    1829    1498     150      83    6410    4491    1184
 64 Bit      4   11810    3583    3610    2976     297     163   12492    8875    2372
 DP         16   11992    3694    3715    3008     300     166   12945    9068    2429
 Aggregate  16   11929

 Phenom x4   1    2586     927     926     695      64      31    3132    2259     621
 3.0 GHz     2    5141    1819    1827    1389     129      62    6213    4484    1200
 32 Bit      4   10178    3564    3623    2747     255     124   11567    8893    2390
 SP         16   10300    3695    3691    2780     256     125   12584    9070    2460
 Aggregate  16   10233

 Phenom x4   1    2768     926     927     695      63      32    7525    1807    1806
 3.0 GHz     2    5504    1815    1824    1388     126      64   14367    3570    3613
 32 Bit      4   10853    3596    3594    2758     249     125   27371    7162    7110
 DP         16   10960    3703    3701    2777     249     127   30629    7212    7177
 Aggregate  16   10903
 


To Start


MP MFLOPS Benchmark

This benchmark executes identical functions as my CUDA and OpenMP performance tests. Details and results can be found in linux_cuda_mflops.htm and OpenMP Speeds.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out. This variation can be run using between 1 and 64 threads, the default being equal to the number of identified CPUs. Each thread carries out the same calculations but on a different segment of the data. The data size starts at 102400 words, rather than 100000, to ensure that each thread uses the same amount of data. For example, with 64 threads, each will use 1600 words or 6400 bytes.

Two versions, MPmflops64 and MPmflops32, were compiled, the first involving the default SSE floating point instructions and the second using the original i87 functions. Speed of the 64-Bit version was so fast that a second 32-Bit benchmark, MPmflops32SSE, was produced. Results are logged in file MPMflopsLog.txt, with examples shown below. These show that the 64-Bit and 32-Bit SSE versions produce the same numeric results and the same speeds (within normal variations). Then the i87 program produces slightly different answers and much slower speeds.


 Phenom Results
##############################################

  64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    10000   0.091754    22321    0.764063   Yes
 Data in & out    1024000     2     1000   0.136134    15044    0.970753   Yes
 Data in & out   10240000     2      100   0.632075     3240    0.997008   Yes

 Data in & out     102400     8    10000   0.167023    49047    0.850923   Yes
 Data in & out    1024000     8     1000   0.176219    46488    0.982342   Yes
 Data in & out   10240000     8      100   0.658828    12434    0.998200   Yes

 Data in & out     102400    32    10000   0.558509    58670    0.660143   Yes
 Data in & out    1024000    32     1000   0.556450    58888    0.953631   Yes
 Data in & out   10240000    32      100   0.722131    45377    0.995203   Yes


##############################################

  32 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:57:17 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    10000   0.092236    22204    0.764063   Yes
 Data in & out    1024000     2     1000   0.135243    15143    0.970753   Yes
 Data in & out   10240000     2      100   0.638202     3209    0.997008   Yes

 Data in & out     102400     8    10000   0.164866    49689    0.850923   Yes
 Data in & out    1024000     8     1000   0.183847    44559    0.982342   Yes
 Data in & out   10240000     8      100   0.677530    12091    0.998200   Yes

 Data in & out     102400    32    10000   0.604816    54178    0.660143   Yes
 Data in & out    1024000    32     1000   0.613424    53418    0.953631   Yes
 Data in & out   10240000    32      100   0.756550    43312    0.995203   Yes


##############################################

  32 Bit MP i87 MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:59 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    10000   0.278444     7355    0.763849   Yes
 Data in & out    1024000     2     1000   0.287133     7133    0.970727   Yes
 Data in & out   10240000     2      100   0.673376     3041    0.997006   Yes

 Data in & out     102400     8    10000   0.625873    13089    0.851082   Yes
 Data in & out    1024000     8     1000   0.629958    13004    0.982363   Yes
 Data in & out   10240000     8      100   0.740114    11069    0.998204   Yes

 Data in & out     102400    32    10000   2.172758    15081    0.660653   Yes
 Data in & out    1024000    32     1000   2.186809    14984    0.953702   Yes
 Data in & out   10240000    32      100   2.236048    14654    0.995214   Yes
 


To Start


Comparison MP MFLOPS Benchmark

As previously, following are results of the 64-Bit and 32-Bit versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs. Performance of the single CPU tests are virtually the same as those from the OpenMP benchmark, as expected, using the same C statements but without OPenMP directives. Multiple processor tests are a little faster on the i87 version but significantly faster on the SSE varieties. The OpenMP compilation only produced SISD SSE instructions. The generated code for this MP MFLOPS benchmark not only used full SIMD functions but also clearly included linked add and multiply instructions to produce more than four results per clock cycle. Best case is the Core 2 Duo where up to six adds or multiples were recorded per clock, per CPU. The Phenom shows the highest throughput here at 60 GFLOPS, with four cores at five results per clock cycle. Performance gains on the Atom again reflect Hyperthreading effects but some are more influenced by smaller cache sizes.

Numeric results of calculations are constant for a given number of repeat passes, but these are arranged to increase in proportion to the number of identified CPUs, to maintain similar running times. Rounding effects also produce slight differences on i87 and SSE versions. Default answers are shown below for systems with 2, 4, 8 and 16 cores.

Besides for defining the number of threads, command line input parameters are available to use specified repeat passes, either to extend running time or to check for consistent answers.


 Run Time Parameters

 t N or T N or Threads N       where N is between 1 and 64 
 r P or R P or repeats P or Repeats P for P Repeat Passes
 m T or M T or minutes T or Minutes T for T minutes burn-in test 

 Examples  ./MPmflops32 Threads 64  ./MPmflops64 T 4  ./MPmflops64  T 8, R 20000

             Atom 1.7 GHz             Core 2 Duo 2.4 GHz          Phenom X4 3.0 GHz
 Thds   1      2      4     16      1      2      4     16      1      2      4     16

 64 Bit SSE MFLOPS
 
 a2     800   1430   1501   1508   5545   8503   8581  12567   7237  13870  22321  25742
 b2     648    610    640   1396   3779   4290   8929   9374   4611   9135  15044  27084
 c2     660    629    624    628   1248   1248   1242   2192   2152   2819   3240   3649

 a8    1810   3372   3396   3405  13036  23704  23904  26636  13815  26435  49047  51692
 b8    1741   2486   2553   3304  10787  15437  23931  25331  13168  25751  46488  54633
 c8    1746   2536   2504   2528   5003   4970   4993   8546   7152  10816  12434  13898

 a32   1832   3530   3560   3577  14405  28155  28240  27827  15110  30093  58670  59810
 b32   1818   3502   3521   3535  14212  27492  28084  28577  14897  29867  58888  60311
 c32   1820   3504   3531   3535  13620  19913  19964  25607  14208  27760  45377  47678

 32 Bit i87 MFLOPS

 a2     204    327    369    369   1602   3568   3523   3185   1950   3841   7355   7535
 b2     201    354    361    364   1799   3136   3613   3618   1885   3804   7133   7686
 c2     202    358    363    363   1236   1252   1251   2048   1582   2505   3041   3240

 a8     303    557    565    567   3188   6346   6193   6278   3361   6676  13089  13363
 b8     301    550    564    565   3162   6280   6213   6304   3292   6648  13004  13404
 c8     302    556    566    565   3081   4988   4959   5860   3168   6211  11069  11382

 a32    404    777    794    794   3362   6696   6649   6704   3813   7613  15081  15175
 b32    403    777    790    784   3357   6680   6645   6689   3775   7566  14984  15197
 c32    403    776    790    785   3338   6628   6592   6620   3715   7411  14654  14848


Numeric Results 

 Repeats  5000               10000               20000               40000
 Version   SSE       i87       SSE       i87       SSE       i87       SSE       i87

 a2     0.867359  0.867238  0.764063  0.763849  0.620974  0.620631  0.481454  0.481096
 b2     0.985193  0.985180  0.970753  0.970727  0.942935  0.942883  0.891302  0.891203
 c2     0.998502  0.998501  0.997008  0.997006  0.994032  0.994027  0.988125  0.988114

 a8     0.918220  0.918307  0.850923  0.851082  0.749971  0.750239  0.635325  0.635706
 b8     0.991084  0.991095  0.982342  0.982363  0.965360  0.965401  0.933325  0.933397
 c8     0.999099  0.999101  0.998200  0.998204  0.996409  0.996416  0.992853  0.992862

 a32    0.798973  0.799276  0.660143  0.660653  0.498060  0.498797  0.385106  0.384777
 b32    0.976383  0.976422  0.953631  0.953702  0.910573  0.910709  0.833458  0.833707
 c32    0.997595  0.997602  0.995203  0.995214  0.990447  0.990463  0.981037  0.981068
 
     Key - Words a=102400, b=1024000, c=10240000 - Operations per word 2, 8 and 32
 


To Start


MP MFLOPS Burn-In Test

As the benchmarks generated exceptionally high speeds from a single program, it was decided to include a burn-in/reliability test function. This is initiated by including a “Minutes” input parameter. This test just uses the 32 operations per word, 102400 word procedures, with an initial calibration run to identify the number of repeat passes to generate four results per minute.

The first results below are for the quad core Phenom. Overall throughput and CPU temperatures were almost identical to those running four copies of the BurnInSSE pogram. See - Linux burn-in apps.htm. The second results are from running on a 1.83 GHz Core 2 Duo based laptop. As with the earlier burn-in apps results, the CPU switched to lower GHz CPU speeds, when the CPU core temperatures reached around 95°C.


 Command ./MPmflops64 Minutes 2

 ##############################################

 Reliability Test around 2 Minutes

 4 CPUs Available
 ##############################################

  64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:41:07 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400    32   266688  14.334887    60962    0.352168   Yes
 Data in & out     102400    32   266688  14.334628    60963    0.352168   Yes
 Data in & out     102400    32   266688  14.506037    60243    0.352168   Yes
 Data in & out     102400    32   266688  14.400784    60683    0.352168   Yes
 Data in & out     102400    32   266688  14.354242    60880    0.352168   Yes
 Data in & out     102400    32   266688  14.418992    60606    0.352168   Yes
 Data in & out     102400    32   266688  14.536283    60117    0.352168   Yes
 Data in & out     102400    32   266688  14.499469    60270    0.352168   Yes
 Data in & out     102400    32   266688  14.583635    59922    0.352168   Yes

               End of test Fri May 20 12:43:18 2011

##############################################

  64 Bit MP SSE MFLOPS Benchmark 1, 2 Threads, Sat May 21 16:45:59 2011

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400    32    94075  14.361988    21464    0.352474   Yes
 Data in & out     102400    32    94075  14.255409    21624    0.352474   Yes
 Data in & out     102400    32    94075  23.311603    13224    0.352474   Yes
 Data in & out     102400    32    94075  33.148504     9300    0.352474   Yes
 Data in & out     102400    32    94075  33.139577     9302    0.352474   Yes
 Data in & out     102400    32    94075  33.111674     9310    0.352474   Yes
 Data in & out     102400    32    94075  33.140281     9302    0.352474   Yes
 Data in & out     102400    32    94075  33.586864     9178    0.352474   Yes
 Data in & out     102400    32    94075  14.385304    21429    0.352474   Yes
 Data in & out     102400    32    94075  14.276475    21593    0.352474   Yes

               End of test Sat May 21 16:50:07 2011
 


To Start


MP Memory Speed

This is based on my original MemSpeed benchmark benchmark. It employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 or 64 bit integers via two data arrays:

   Result to memory     x[m] = x[m] + s * y[m]     
   Sum to memory        x[m] = x[m] + y[m]         
   Memory to memory     x[m] = y[m]                
 

Add is used instead of multiply for the first integer tests. Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using two arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision. The C programming calculations are identical to those used in an OpenMP version. See - OpenMP Speeds.htm.

The execution files are MPmemspeed32 and MPmemspeed64. The 32 bit version uses the old i87 floating point instructions and 32 bit integers. The other, as expected, compiles to use SSE instructions, but these are the slow SISD variety. It is also coded to use 64 bit integers. There is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs, possibly rounded up. Again, each thread carries out the same calculations but on a different segment of the data. Results are saved in memSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom. The data and calculations on each array element are identical. The program checks results for consistency and reports any errors.

Below are 64-Bit and 32-Bit results on the 3.0 GHz quad core Phenom using four threads. These show that the SSE floating point speeds are somewhat faster than tests using i87 instructions, except where performance becomes dependent on memory speed. MB/second rates using 64-Bit integers can be much faster than at 32-Bits, firstly, as the CPU can execute both types of instructions at the same speed and, secondly, as more registers are available for optimisation. The number of measurements at 32-Bits are limited as the full 8 GB of RAM cannot be recognised.


     get_phys_pages() and size - RAM Size  7.81 GB

     MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads

               Start of test Tue Jun  7 11:32:54 2011

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   18253  12913  18066  18667  14409  27989  14221  11201  14643
       8   26068  17651  25448  29410  20679  39706  21694  16463  22343
      16   39834  25431  38289  47167  29614  57648  21134  21446  35492
      32   50545  29588  47840  65341  37153  76190  46231  24825  45564
      65   57307  31898  53119  71962  40593  86253  48212  25858  48779
     131   64285  33405  56454  83929  42824  93889  51601  26317  52109
     262   65111  32902  58563  85910  43904  96199  52272  26517  52156
     524   58699  32056  53683  67177  39149  66647  44137  26617  44123
    1048   59967  32531  53638  67332  39808  67046  43401  26310  44172
    2097   48409  31709  51453  59630  37829  59008  32561  25079  32687
    4194   36529  27079  37052  37380  32077  37280  18682  18694  18732
    8388   27898  21163  25293  27011  23273  27070  14253  12800  13768
   16777    9006   8909   8869   8978   8806   9023   4488   4462   4516
   33554    8946   8875   8887   8606   8855   8921   4525   4497   4508
   67108    8717   8458   8325   8516   8452   8756   4287   4366   4379
  134217    8688   8339   8362   8696   8473   8698   4276   4357   4355
  268435    8703   8608   8516   8659   8393   8648   4280   4268   4328
  536870    8700   8591   8421   8673   8514   8690   4308   4290   4264
 1073741    8596   8471   8584   8628   8619   8698   4398   4329   4395
 2147483    8825   8790   8835   8790   8763   8842   4397   4402   4468

                           No errors found

                End of test Tue Jun  7 11:33:52 2011

##############################################

     get_phys_pages() and size - RAM Size  3.20 GB

     MP Memory Reading Speed Test 32 Bit Version 1 Using 4 Threads

               Start of test Tue Jun  7 14:23:02 2011

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   15704  11347  10961  17813  12518  15904  13744   8714   8758
       8   24188  15367  14929  26770  17870  21025  20789  10866  10234
      16   33319  19229  18266  38724  23589  23124  31390  13114  13157
      32   40697  20675  21180  51120  27260  25282  39385  13921  13960
      65   45013  22913  22267  57143  30132  24875  42247  14314  14241
     131   45569  23573  22953  61979  31356  27585  44688  14427  13289
     262   48701  23759  22666  63235  32103  27892  44447  14200  14453
     524   44900  22996  20417  53167  30753  25832  36085  14671  13403
    1048   44929  23357  20300  54596  30302  25790  36207  14708  13590
    2097   42017  22864  20927  42429  28809  24778  26734  13125  12659
    4194   34909  20379  19542  36402  25268  21093  18592  12625  12821
    8388   22498  17592  17006  23354  19577  18854  12489   9400   9657
   16777    8906   8697   8781   8884   8841   8844   4433   4217   4440
   33554    8848   8684   8606   8877   8436   8843   4412   4293   4422
   67108    8423   8445   8433   8685   8506   8526   4228   4296   4273
  134217    8704   8453   8572   8563   8426   8485   4383   4303   4346
  268435    8623   8579   8539   8731   8652   8612   4408   4301   4322
  536870    8683   8331   8534   8724   8658   8444   4371   4330   4325

                           No errors found

                End of test Tue Jun  7 14:24:05 2011
 


To Start


MP Memory Speeds Comparison

Below are 64-Bit results on the 3.0 GHz quad core Phenom and 2.4 GHz Core 2 Duo for the multiply and add tests using 1 CPU, all CPUs and with 64 threads. On the single thread tests, although the speeds are dependent on CPU GHz, variations generally reflect cache sizes. The full gain in throughput through using more than one CPU is not achieved at the lower data sizes, mainly due to higher overheads. For example, at 4 KB there are two arrays of 2 KB, producing a segment of 32 bytes for each of 64 threads. There are significant additional performance gains using an increasing number of threads with mid to large data sets. This is due to the relatively small data segments being repetitively processed from a lower level cache.

Later are results on the Netbook with an Atom CPU running via 64-Bit Ubuntu 11.04. It can be seen that Hyperthreading provides significant gains in throughput using floating point instructions.


 Quad Core Phenom - Caches L1 64 KB/CPU, L2 512 KB/CPU, L3 6144 KB shared

              Commands      ./MPmemspeed64  Threads 1
                            ./MPmemspeed64
                            ./MPmemspeed64  Threads 64

             1 thread             4 threads            64 threads

    KBytes   Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

         4  16834   8588  16753  18253  12913  18066  17040  12902  21602
         8  17196   8671  16039  26068  17651  25448  30712  18072  31164
        16  17188   8670  18148  39834  25431  38289  34169  21795  41192
        32  17314   8703  16669  50545  29588  47840  39691  21845  35582
        65  15023   8634  15211  57307  31898  53119  45213  25520  52872
       131  15274   8155  13675  64285  33405  56454  53787  29396  58835
       262  15335   8143  13508  65111  32902  58563  56811  31077  59534
       524  14512   8013  13242  58699  32056  53683  62895  32310  58189
      1048  10911   7355  10791  59967  32531  53638  64335  32520  60720
      2097  10909   7350  10784  48409  31709  51453  65249  32553  59855
      4194  10561   7169  10411  36529  27079  37052  65800  32975  57665
      8388   6642   5610   6315  27898  21163  25293  57438  30135  52237
     16777   6128   5410   5853   9006   8909   8869  57345  30844  50350
     33554   6311   5427   5677   8946   8875   8887  54636  30033  49051
     67108   5789   5160   5698   8717   8458   8325  31347  24887  30860
    134217   5969   5138   5908   8688   8339   8362  26846  19845  26308
    268435   5922   5449   5779   8703   8608   8516   8510   8359   8558
    536870   6121   5090   5811   8700   8591   8421   8638   8525   8387
   1073741   6020   5481   5832   8596   8471   8584   8569   8334   8422
   2147483   6264   5630   6028   8825   8790   8835   8834   8699   8625


 Core 2 Duo - Caches L1 32 KB/CPU, L2 4096 KB shared

             1 thread             2 threads            64 threads

    Kbytes   Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

         4  12572   6264  12613  19400  10834  20763   6272   4371   8553
         8  12534   6337  12613  22252  11696  22188   5644   3760   6362
        16  12639   6364  12692  23343  12208  23485  10681   6790  15783
        32  12468   6318  12483  24406  12409  24317  13725   8918  15230
        65   9434   5733   9453  24573  12548  24520  17721  11117  18778
       131   9584   5770   9612  18347  11708  17212  21622  11731  22935
       262   9645   5824   9648  18389  11700  18191  23737  11938  23593
       524   9674   5834   9667  18286  11635  18153  24032  12329  24204
      1048   9696   5843   9684  18324  11663  18206  24579  12484  24682
      2097   9548   5777   9623  18326  11593  18203  24377  12261  24413
      4194   8188   5520   8302  13296  10585  13927  17971  11380  17734
      8388   4381   4219   4407   4701   4597   4664  17905  11450  17698
     16777   3788   3830   3847   3948   3921   3944  17803  11413  17646
     33554   3817   3827   3806   3903   3868   3893  17580  11294  17428
     67108   3845   3872   3856   3908   3888   3917  16531  10438  14364
    134217   3876   3856   3798   3886   3918   3922   9007   8009   9152
    268435   3885   3894   3889   3893   3885   3882   4092   4088   4102
    536870   3827   3816   3829   3923   3922   3918   3900   3893   3887


 Atom 1 CPU with Hyperthreading - Caches L1 24 KB, L2 512 KB
 get_phys_pages() and size - RAM Size  0.96 GB, Page Size 4096 Bytes 
 uname() - Linux, roy-Ubuntu-11, 2.6.38-8-generic 
 #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011, x86_64 

             1 thread             2 threads            64 threads

    Kbytes   Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

         4   2112   1149   5817   3849   2085   6203   3329   1896   5237
         8   2117   1148   5922   3254   2106   6284   3519   1939   5653
        16   2113   1150   5938   3870   2085   6307   3608   1970   5882
        32   1742   1050   3764   3182   1879   4926   3682   1935   6005
        65   1770    869   3863   3157   1908   4728   3571   1975   5912
       131   1804   1050   3869   3173   1905   4652   3619   1992   6052
       262   1802   1043   3833   3181   1900   4711   3597   1977   6104
       524   1731   1017   3575   3021   1838   4440   3633   2002   6000
      1048   1656    983   2622   2668   1746   2656   3035   1787   4710
      2097   1652    962   2222   2087   1769   2059   3043   1822   4516
      4194   1655    946   2123   1969   1755   1950   3023   1807   4528
      8388   1538    981   2128   1958   1747   1956   2956   1788   4483
     16777   1571    979   2118   1948   1769   1948   2788   1711   3364
     33554   1661    969   2119   1957   1760   1921   1978   1652   2170
     67108   1606    986   2176   1930   1762   1951   1763   1644   1929
    134217   1660    975   2149   1932   1747   1966   1882   1672   1692
 


To Start


MP Memory Bus Speed

MPbusspeed64 and MPbusspeed32 are based on my old BusSpeed2K Benchmark and are essentially the same as the Windows Multithreading Version. Data is read using AND instructions at a range of data sizes covering caches and RAM. The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data. The final results use SSE2 integer AND instructions to read the data into the 16 byte register, the 32 bit and 64 bit procedures using the same assembly code.

Again there is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs. This time, each thread reads all the data. Results are saved in busSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom. Using L1 and L2 caches, data transfer speed with 64 bit integers is around twice as fast as using 32 bit numbers, suggesting a CPU speed limitation. From burst reading, estimates of maximum RAM speed are 904 x 8 = 7232 MB/second and 448 x 16 = 7168 MB/s. Cache examples are - L2 2989 x 8 = 23912 MB/s and L3 1432 x 8 = 11456 MB/s.


 MP Bus Speeds 64 bit Version 1.0, 1 Threads, Fri Jun 17 16:51:46 2011

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6    22196    22641    26239    26979    26580    26409    23762
       24    23507    23118    27665    27323    27081    25424    23850
       96     2959     2964     2989     5987    11983    21134    23868
      384     2917     2917     2869     5853    11732    21898    23264
      768     1362     1359     1352     2699     5408    10617    10803
     1536     1322     1293     1432     2856     5764    11098    12081
    16380      862      886      902     1745     3637     6019     7249
   131070      854      885      902     1777     3431     5853     6619
   393210      858      830      904     1757     3602     5995     7074

            64 bit words, Speed in MB/Second - MIPS divide by 8

                   End at Fri Jun 17 16:52:21 2011

 
 MP Bus Speeds 32 bit Version 1.0, 1 Threads, Fri Jun 17 16:45:12 2011

   Kbytes Inc32wds Inc16wds  Inc8wds  Inc4wds  Inc2wds  ReadAll 128bSSE2

        6     8318    12102    12848    13317    13413    13452    23033
       24     9876    12979    13877    13288    13436    13640    23876
       96     1495     1496     2675     5979    11081    13201    23852
      384     1209     1238     2454     4946     8816    12459    18874
      768      721      726     1480     3046     5638     9741    11846
     1536      699      699     1513     3032     5708     9722    11924
    16380      413      423      860     1805     2993     5022     7075
   131070      411      444      841     1793     3024     4881     7060
   393210      427      448      887     1826     3052     4946     6897

            32 bit words, Speed in MB/Second - MIPS divide by 4

                   End at Fri Jun 17 16:45:47 2011
 


To Start


MPbusSpeed Comparisons

Below are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 (based on identified CPUs) and 64 threads. Comparison of performance gains using multiple threads and 64 bit compilations versus 32 bit are best limited to the last two columns as the earlier burst reading tests produce all sorts of timing peculiarities. Although measured RAM speeds could be improved with multiple threads reading shared data from caches, those provided have mainly been confirmed with programs using dedicated data.

Phenom - some performance gains from cached data were only 3 times with 4 threads but nearer 4 times with 64 threads. Using 4 or 64 threads can double throughput on memory bus. Maximum measured speed was 15.8 GB/second, compared with specification of 21.3 GB/second, comprising 667 MHz x 2 (DDR) x 8 (bus width) x 2 (dual channel).

Core 2 Duo - generally achieved peak performance using two threads and, if anything, was slightly slower using a concurrency of 64. Compared with the Phenom, some tests came out much faster and others significantly slower.

Atom - Hyperthreading had little impact on performance via L1 cache but did provide improvements via L2 cache and RAM.

L1 Cache Results in MBytes/Second - 6 KB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Phenom II 32b 4/0 3000 8318 12102 12848 13317 13413 13452 23033 4 Threads 3901 7614 14703 28644 29313 34882 74424 64 Threads 10098 14599 20588 30042 32606 37702 76707 Core 2 Duo 32b 2/0 2400 8069 8772 9036 9283 9369 9390 37413 2 Threads 13921 16387 16361 16996 18147 18380 61694 64 Threads 15315 16271 17162 17587 17848 18004 60044 Atom N455 32b 1/1 1667 5092 5813 5959 6272 6290 6289 24663 2 Threads 5638 6157 6323 6450 6439 6488 25608 64 Threads 5011 5655 5609 5785 5867 5610 22419 64 Bit Version Phenom II 64b 4/0 3000 22196 22641 26239 26979 26580 26409 23762 4 Threads 4478 17301 15108 29950 58038 58706 76500 64 Threads 19577 43542 35782 52893 74743 73745 76027 Core 2 Duo 64b 2/0 2400 15931 17513 18140 18542 18715 18813 37391 2 Threads 30486 32243 35126 36209 35493 35615 73979 64 Threads 29474 32288 33585 34516 34670 35792 71354 Atom N455 64b 1/1 1667 9004 10592 11500 12051 12565 12735 24731 2 Threads 10224 11743 12283 12767 12948 13053 25668 64 Threads 8574 10100 11089 10638 11905 11649 21789

L2 Cache Results in MBytes/Second - 96 KB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Phenom II 32b 4/0 3000 1495 1496 2675 5979 11081 13201 23852 4 Threads 4648 5085 8422 19230 33948 39486 74050 64 Threads 5247 5317 10388 21080 36499 45574 89766 Core 2 Duo 32b 2/0 2400 2065 2044 3275 4562 6700 8095 19153 2 Threads 3138 3050 5218 7776 11911 15415 32030 64 Threads 3099 2963 5209 7617 11721 15081 31010 Atom N455 32b 1/1 1667 505 415 788 1481 2577 3915 5909 2 Threads 597 665 1243 2285 3657 4660 8598 64 Threads 455 534 996 1861 3133 4100 7417 64 Bit Version Phenom II 64b 4/0 3000 2959 2964 2989 5987 11983 21134 23868 4 Threads 4478 17301 15108 29950 58038 58706 76500 64 Threads 10290 10607 10548 20940 40361 75398 89396 Core 2 Duo 64b 2/0 2400 4171 4170 4098 6715 9120 13430 19157 2 Threads 6322 6435 6127 10810 15448 23521 32104 64 Threads 6090 6237 5972 10720 15200 23342 31444 Atom N455 64b 1/1 1667 993 1020 833 1564 2954 5156 5919 2 Threads 1066 1127 1386 2501 4608 7240 8722 64 Threads 1003 1033 1055 1991 3674 6250 7222

L3 Cache Results in MBytes/Second - 1536 KB

Phenom II 32b 4/0 3000 699 699 1513 3032 5708 9722 11924 4 Threads 2407 2543 4943 10058 17570 29261 41159 64 Threads 2541 2571 5022 10100 18811 30768 41078 64 Bit Version Phenom II 64b 4/0 3000 1322 1293 1432 2856 5764 11098 12081 4 Threads 5112 4899 5083 10018 19866 37136 38700 64 Threads 5092 5101 5101 10051 20203 37850 41309

RAM Results in MBytes/Second - 128 MB

CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Phenom II 32b 4/0 3000 411 444 841 1793 3024 4881 7060 4 Threads 786 813 1605 3444 6259 12161 14950 64 Threads 891 969 1869 3678 6930 12678 15564 Core 2 Duo 32b 2/0 2400 353 399 814 1467 2725 5021 5842 2 Threads 621 808 1181 2217 4108 7686 9952 64 Threads 395 598 1080 1947 3541 6540 8133 Atom N455 32b 1/1 1667 122 256 514 1029 1978 3256 4122 2 Threads 131 318 684 1312 2434 4189 5307 64 Threads 125 265 577 1159 2280 4435 4636 64 Bit Version Phenom II 64b 4/0 3000 858 830 904 1757 3602 5995 7074 4 Threads 1561 1648 1701 3330 6964 13025 14027 64 Threads 1808 1854 1947 3773 7488 14176 15516 Core 2 Duo 64b 2/0 2400 699 711 803 1622 2946 5414 5861 2 Threads 1210 1336 1632 2436 4635 7947 10028 64 Threads 706 813 1226 2177 3919 7014 8127 Atom N455 64b 1/1 1667 125 256 514 1038 2024 3994 4057 2 Threads 136 294 677 1327 2530 4924 5230 64 Threads 142 249 523 1129 2312 4565 4639


To Start


MP Memory Random Access

MPrandmem64 and MPrandmem32 are based on my old RandMem Benchmark and are essentially the same as the Windows Multithreading Version, except there are added tests identified as Mutex SRW and Mutex RRW. The program uses the same code for serial and random access via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1 to 64 threads, using a run time parameter, the default being equal to the number of identified CPUs. This benchmark uses data from the same array for all threads, but starting at different points. Results are saved in file randmemMP.txt. In this case, both the 64 bit and 32 bit versions use 32 bit integer data arrays.

Below are logged results on a 3.0 GHz quad core Phenom using one and four threads. On serial and random read only tests, performance gains are up to four times using dedicated caches, with the smaller data sizes slower due to overheads. Random reading is much slower than serial data transfers where burst reading leads to more data being transferred than is requested. With reading and writing, there is a possibility that data can be corrupted when more than one thread updates the same data. Although it cannot be proven with this benchmark, it seems that the Operating System does not allow shared data to be updated in local caches and flushes them to update in shared data areas, producing significant performance degradation, particularly on random access.

The extra tests with Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using four threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM.


    RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sun Jun 26 18:01:19 2011
 
               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB

 Serial RD    15757   15827   12155   11879    8534    8511    4392    4385
 Serial RW     9263    9534    8875    8868    7591    7493    3740    3601
 Random RD    14396   14271    7504    3159    2269    1751     622     341
 Random RW     9231    9510    6136    2993    2087    1507     532     319
 Mutex SRW     9264    9534    8875    8869    7591    7492    3740    3608
 Mutex RRW     9231    9510    6138    2993    2087    1507     532     320


    RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011
 
               ------------------ MBytes Per Second At --------------------
               6 KB   24 KB   96 KB  384 KB  768 KB 1536 KB   12 MB   96 MB

 Serial RD    29630   53166   44120   44829   29620   29671   12108   11987
 Serial RW     5040    7334    7442    7402    7353    7395    8532    6247
 Random RD    28388   41211   27807   12265    8866    6611    2103    1271
 Random RW      657    1096    1229    1283    1288    1376    1648     993
 Mutex SRW     5962    8654    7998    7882    6982    6853    3579    3415
 Mutex RRW     6243    8594    5838    2815    1970    1370     486     310
 


To Start


MPrandmem Comparisons

Below, again, are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 and 64 threads plus others for the Phenom at 64 bits.

L1 Cache - Probably due to the overheads involved, performance using 64 threads is noticeable slow, with the Core 2 Duo performing better than the quad core Phenom when writing is involved. Hyperthreading does not lead to much performance gain on the Atom.

L2 Cache - Reading tests can show appropriate performance gains using all processors and not as much degradation with reading and writing. Although dealing with 32 bit integers, and unlike the Phenom, the Core 2 Duo and Atom produce much faster speeds with the 64 bit version using 64 threads. The Mutex tests produce different performance characteristics than when using L1 cache.

RAM - There are some performance gains with multiple threads making better use of memory bandwidth and excessive numbers of threads do not necessarily reduce performance.

L1 Cache Results in MBytes/Second - 6 KB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 15791 11495 14746 11409 11478 11394 4 Threads 29613 5014 29595 610 7152 7569 64 Threads 2321 188 2234 42 257 255 Core 2 Duo 32b 2/0 2400 6327 8474 6314 8305 12642 8306 2 Threads 13285 3559 13432 1312 6935 9327 64 Threads 800 452 802 93 346 433 Atom N455 32b 1/1 1667 3500 4742 4422 5028 5032 5022 2 Threads 4902 4770 4895 1153 677 3101 64 Threads 307 296 301 69 55 207 64 Bit Version Phenom II 64b 4/0 3000 15757 9263 14396 9231 9264 9231 4 Threads 29630 5040 28388 657 5962 6243 8 Threads 14933 2120 14892 338 2465 2867 16 Threads 8514 846 8284 174 910 1041 64 Threads 2247 189 2173 45 225 214 Core 2 Duo 64b 2/0 2400 9579 12619 6385 7720 12623 7600 2 Threads 14257 3553 14073 1505 7018 7718 64 Threads 875 893 875 112 348 358 Atom N455 64b 1/1 1667 3838 4222 3834 4222 4233 4215 2 Threads 4438 4779 4481 1218 970 3130 64 Threads 285 291 281 68 42 167

L2 Cache Results in MBytes/Second - 96 KB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 12476 10387 7552 6241 10385 6241 4 Threads 45484 7488 27712 1238 9645 5810 64 Threads 31650 7033 22035 1162 1989 1431 Core 2 Duo 32b 2/0 2400 5228 5892 4245 3852 5935 2619 2 Threads 15026 16440 7009 2896 7009 3054 64 Threads 3304 2662 3268 497 1300 1506 Atom N455 32b 1/1 1667 2768 3349 855 1175 3464 1173 2 Threads 4642 4424 1317 1570 2805 966 64 Threads 1177 1138 1118 665 423 584 64 Bit Version Phenom II 64b 4/0 3000 12155 8875 7504 6136 8875 6138 4 Threads 44120 7442 27807 1229 7998 5838 8 Threads 42685 7413 27567 1240 6875 4867 16 Threads 42004 7443 27870 1237 5335 3760 64 Threads 30435 7057 21892 1157 1686 1329 Core 2 Duo 64b 2/0 2400 6234 5992 4320 3777 5947 3779 2 Threads 14542 15153 7145 2932 7190 3113 64 Threads 11741 12994 6426 2767 4714 2317 Atom N455 64b 1/1 1667 2813 3064 845 1103 3063 1122 2 Threads 4613 4576 1352 1615 3044 1111 64 Threads 3551 3506 1179 1312 1759 874

L3 Cache Results in MBytes/Second - 1536 KB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 8756 7918 1743 1505 7919 1503 4 Threads 29961 7491 6617 1332 7414 1391 64 Threads 30159 7763 6643 1333 2394 472 64 Bit Version Phenom II 64b 4/0 3000 8511 7493 1751 1507 7492 1507 4 Threads 29671 7395 6611 1376 6853 1370 8 Threads 29330 7549 6558 1342 6229 1234 16 Threads 29827 7627 6623 1361 4763 988 64 Threads 29812 7733 6650 1337 2163 473

RAM Results in MBytes/Second - 96 MB

CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 4407 3845 344 320 3826 320 4 Threads 12009 6305 1273 994 3615 308 64 Threads 12010 6641 1298 1003 2881 308 Core 2 Duo 32b 2/0 2400 4803 2147 449 282 2232 310 2 Threads 5492 2512 621 401 2239 308 64 Threads 6206 2567 635 411 2294 304 Atom N455 32b 1/1 1667 2253 1159 38 54 1275 54 2 Threads 3926 1257 63 79 1109 42 64 Threads 3951 1274 64 79 1322 54 64 Bit Version Phenom II 64b 4/0 3000 4385 3601 341 319 3608 320 4 Threads 11987 6247 1271 993 3415 310 8 Threads 11930 6248 1274 990 3321 307 16 Threads 11860 6557 1281 997 2777 304 64 Threads 11863 6651 1288 1002 2774 302 Core 2 Duo 64b 2/0 2400 3971 2141 416 283 2148 314 2 Threads 5448 2508 632 404 2425 298 64 Threads 6065 2612 639 416 2318 305 Atom N455 64b 1/1 1667 2284 1286 39 54 1298 54 2 Threads 3717 1241 62 78 1336 54 64 Threads 3785 1270 64 78 1122 42


To Start


Roy Longbottom July 2011

The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection