Title

Roy Longbottom at Linkedin Linux PC Benchmarks

Contents


General Configuration Details 32-Bit and 64-Bit Differences
Classic Benchmarks Classic Benchmark Results Maximum CPU Speeds
OpenMP Benchmark MemSpeed Benchmark BusSpeed Benchmark
RandMem Benchmark SSEfpu Benchmark nVidia CUDA Benchmarks
Disk, Bus and LAN Benchmarks Burn-In and Reliability Apps Multithreading Benchmarks
Image Processing Benchmarks OpenGL Benchmark On-Line Benchmarks
Booting Time



General

Both 32-Bit and 64-Bit versions of Ubuntu Linux were installed on an eSATA/USB hard disk and on USB Flash drives, to compile and assemble existing PC benchmarks via the compiler and assembler that are included in the package. The booting method used also enabled loading Ubuntu on a range of different PCs and laptops.

The benchmark programs, including source code and compile/link commands, are compressed in .tar.gz format. Copy the latter to your home directory or subdirectory for extraction. Examine the README file for further directions. The benchmarks are simple execution files and do not need installing. The first ones run in a Terminal window via the normal ./name command or via clicking on a shell script, containing the commands. Details are displayed when the tests are running and performance results are save in a .txt file.

The benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are included below. When recompiled benchmarks produced significant different results to the older ones, they are available in AVX_benchmarks.tar.gz. This also contains source codes with changes that enable error free compiling and correct execution. Further details are in Linux AVX benchmarks.htm.

Latest results are for a quad core/8 thread 3.7 GHz Core i7 4820K with 10 MB L3 cache, normally running at Turbo Burst speed of 3.9 GHz. It has 32 GB DDR3 RAM on 4 memory channels with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.

To Start


Configuration Details

All benchmarks include the same configuration details, some of which is produced via assembly language code. Example details shown are for an AMD Phenom quad core processor via 32 -Bit Ubuntu and an Intel Core 2 Duo using the 64-Bit version.

######################################################################

  Assembler CPUID and RDTSC       
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4 
         Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz 
  Measured - Minimum 3711 MHz, Maximum 3711 MHz 
  Linux Functions 
  get_nprocs() - CPUs 8, Configured CPUs 8 
  get_phys_pages() and size - RAM Size 31.51 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-WD32, 2.6.35-24-generic-pae 
  #42-Ubuntu SMP Thu Dec 2 03:21:31 UTC 2010, i686 

  Assembler CPUID and RDTSC       
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 
  AMD Phenom(tm) II X4 945 Processor 
  Measured - Minimum 2978 MHz, Maximum 3008 MHz 
  Linux Functions 
  get_nprocs() - CPUs 4, Configured CPUs 4 
  get_phys_pages() and size - RAM Size  7.88 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-C2D, 2.6.35-22-generic-pae 
  #35-Ubuntu SMP Sat Oct 16 22:16:51 UTC 2010, i686 

  Assembler CPUID and RDTSC       
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 
  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz 
  Measured - Minimum 2407 MHz, Maximum 2407 MHz 
  Linux Functions 
  get_nprocs() - CPUs 2, Configured CPUs 2 
  get_phys_pages() and size - RAM Size  3.87 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-64Bit, 2.6.35-22-generic 
  #33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64 

  
  Identified with Fedora Linux

  uname() - Linux, localhost.localdomain, 2.6.34.7-61.fc13.x86_64 
  #1 SMP Tue Oct 19 04:06:30 UTC 2010, x86_64 

######################################################################

To Start


32-Bit and 64-Bit Differences

The main advantage of 64-Bit working is that the amount of main memory installed and accessible is much larger that 32-Bit operation. The downside can be worse performance if integer array variables are defined as 64 bits, leading to twice the data volumes being read and written.

The original x87 floating point instructions are not available using 64-Bit compilations. Instead, SSE instructions are used for 32-Bit Single Precision (SP) floating point numbers and SSE2 for 64-Bit Double Precision (DP). These are potentially Single Instruction Multiple Data (SIMD) instructions, where four SP results or two DP results can be produced per clock cycle and, even adds and multiplies linked, with eight or four results. Unfortunately, it seems that only Single Instruction Single Data (SISD) operations are issued, where only one number is used in the 128 bit registers, and this can lead to slower performance than a program compiled for 32-Bits with x87 instructions.

The main performance gains at 64-Bits appears to be the provision of twice as many general purpose and SSE registers where, with optimisation options, provides faster speeds through reducing the need to save and reload variables that involve access to slower memory.

Some of these for better and for worse results are reflected in the tables below.

To Start


Classic Benchmarks

The Classic Benchmarks are the first programs used to measure relative performance of computers. They are:

Livermore Kernels (Livermore Loops) - Produced for the first supercomputers and comprising 14 kernels in 1970, then 24 in the 1980s. The 24 kernels are run at three different data sizes. Results are in Millions of Floating Point Operations Per Second (MFLOPS) with one measurement for each kernel and some overall figures, where Geometric Mean is the official overall rating.

Whetstone Benchmark - the first general purpose benchmark that set industry standards of performance, particularly for minicomputers, and introduced in 1972. The benchmark produced speed ratings in terms of Thousands of Whetstone Instructions Per Second (KWIPS). In 1978, self timing versions (by yours truly) produced speed ratings, for each of the eight test procedures, in MOPS (Millions of Operations Per Second) or MFLOPS, with an overall rating in MWIPS.

Dhrystone Benchmarks 1.1 and 2.1 - The Dhrystone benchmark, a sort of Whetstone without floating point, became the key standard benchmark, from 1984, with the growth of Unix systems. The second version (2.1) was produced to avoid over-optimisation problems encountered with version 1.1. Original performance ratings were in terms of Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result.

Linpack Benchmark - This benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications from the mid 1980's with a slant towards supercomputer performance, with speed measured in MFLOPS.

Further details and references can be found in classic.htm

On starting execution, the programs go through a calibration phase to determine the number of passes to run for more than 2 seconds with Dhystone, 1 second for each of 8 tests with Linpack, 1 second for each of 72 tests with Livermore Loops and 10 seconds overall with Whetstone. Displayed results demonstrate that running time is proportional to the number of passes.

For the benchmark execution codes and source files, download classic_benchmarks.tar.gz. Four execution files are provided for each benchmark. They comprise 32-Bit and 64-Bit compilations, non-optimised and optimised varieties. On downloading to Windows, the file appeared as classic_benchmarks.tar.tar but seemed to be fine with the name changed to classic_benchmarks.tar.gz.

To Start


Classic Benchmark Results

Results of these Linux based benchmarks are included with those run via Windows in the following reports. Some examples are given below, all for using 1 CPU of a 2.4 GHz Core 2 Duo and 2014 speeds of a 3.7 GHz Core i7, running at the Turbo Boost speed of 3.9 GHz.

The benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are shown below (New x64 with SSE/SSE2 and New AVX). The Core i7 maximum speed in GFLOPS per core (4 available) is GHz x 4 (SSE single precision) x 2 (with multiply and add) or 31.2 GFLOPS and 62.4 using AVX1. Using double precision, the best possible scores are 15.6 and 31.2 GFLOPS respectively.

The only real beneficiary of the recompilation is the Linpack benchmark via AVX options. Some of the Livermore Loops should benefit but via the really simple structure used but this is presently beyond the capabilities of the compiler.

Dhrystone Benchmark Results On PCs

Whetstone Benchmark Results On PCs

Linpack Benchmark Results On PCs

Livermore Loops Benchmark Results On PCs



                     Whetstone Benchmark Optimised

          MWIPS  MFLOP  MFLOP  MFLOP    COS    EXP  FIXPT    IF   EQUAL
                    1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

                              2.4 GHz Core 2

 32 Bit    2280    815    811    576   56.5   22.6   4011   7413   3651
 64 Bit    2560    865    885    589   65.7   29.1   3851   5314   1078

                          3.7 GHz (TB 3.9) Core i7 

 32 Bit    3959   1331   1331    938     97   42.1   6516  10967   5851
 64 Bit    4880   1331   1324    977    129   64.2   6517  11657   1812
 New x64   4891   1330   1323    977    120   64.5   6505  11638   3903
 New AVX   4897   1325   1323    977    120   64.5   6515  11649   3909



                   Livermore Loops MFLOPS 24 Kernels Optimised
           Loop
            1     2     3     4     5     6     7     8     9    10    11    12
           13    14    15    16    17    18    19    20    21    22    23    24

                              2.4 GHz Core 2

 32 Bit  1953  1223  1584  1534   343  1238  2192  2385  2147  1187   795   479
          161   396   276   956  1368   959   509   385  1385   165  1182   560
 64 Bit  1702  1340  1593  1531   341  1199  2422  3060  2057   770   798   861
          481   673   444   992  1029  1222   461   423  1251   351  1184   819

                          3.7 GHz (TB 3.9) Core i7 

 32 Bit  4327  3661  2622  2642   527  2250  4217  5549  5223  2511  1311  1279
          450  1036   730  2038  2479  2835   810   783  2820   419  2022   967
 64 Bit  4707  3434  2629  2657   565  2155  4592  6131  5442  2602  1314  1296
          937  1239  2288  2293  2392  3538   839   968  2792   939  2034  1720
 New x64 4729  3422  2639  2657   565  2164  4599  5714  4984  2446  1310  1879
         1018  1267  2287  2012  2397  5343   836   969  3042   940  2011  1840
 New AVX 4692  3488  2638  2654   564  2160  4471  5717  4978  2619  1308  1863
          978  1305  2285  2043  2492  6418   836   968  3069   938  2010  1558


            Dhrystone                            Linpack

            Dhry1  Dhry1  Dhry2  Dhry2
            NoOpt    Opt  NoOpt    Opt
              VAX    VAX    VAX    VAX           No Opt       Opt
             MIPS   MIPS   MIPS   MIPS           MFLOPS    MFLOPS   

                              2.4 GHz Core 2

 32 Bit      3428  13599   3348   5852              404      1288
 64 Bit      3643  18738   3288  12265              378      1577

                          3.7 GHz (TB 3.9) Core i7 

 32 Bit      7108  29277   7478  16356              988      2534
 64 Bit      8436  32659   8481  23607              900      3672
 New x64     8441  32499   8381  24140              946      3631
 New AVX     8441  32575   8395  23626              935      5413



To Start


Maximum CPU Speeds

Benchmarks whatcpu32 and whatcpu64 are essentially the same as cpuid and cpuid64, produced for Windows, with description and results in WhatCPU results.htm. The programs were written with a view towards demonstrating maximum CPU performance executing all types of arithmetic instructions. The execution files and source code are available for download in max_cpu_speeds.tar.gz.

The benchmark programs use assembler level instructions, including full SIMD operations where appropriate, to simply add values via 1, 2, 3 and 4 registers. Results are in MIPS and MFLOPS, millions of adds per second in both cases. The programs also check that the end totals are correct. The 32 bit version adds 32 bit integers, then 32 bit single precision and 64 bit double precision floating point numbers using the original x87 instructions. This is followed by adding 32 bit integers using MMX and SSE2 instructions and 64 bit integers also using SSE2 functions. Finally there are 32 bit floating point additions using SSE instructions plus 3DNow, using AMD processors, and 64 bit floating point sums with SSE2 operations.

MMX, x87 and 3DNow instructions are not available at 64 bit working, but normal integer instructions are provided to use 64 bit numbers which, in the case of this register based program, mainly run at the same speed as with 32 bit arithmetic.

Results below are for an AMD Phenom X4 and Intel Core 2 Duo, using one CPU in each case. These suggest three integer adds and two 64 bit MMX operations can be executed per clock cycle. Then SSE/SSE2 floating point calculation speed is based on one 128 bit register dealt with per cycle. Best is eight 32 bit SSE integer adds per cycle. Here, the AMD processor appears to be more efficient than the Intel CPU, but later Intel i7 32 bit and 64 bit results correct some of this anomaly.

Results from a later Core i7 are also shown. This CPU has AVX1 instructions included, with 256 bit registers, producing up to eight 32 bit floating point results per CPU cycle (31.2 GFLOPS at 3.9 GHz), on addition and twice this with linked multiply and add instructions. The latter were included in a new AVX test (AVXid64), demonstrating 62 GFLOPS at 3.9 GHz. Details of the latter can be found in Linux AVX benchmarks.htm.

  
  Word                         32 bit OS Version             64 bit OS Version
  Size                    1 Reg  2 Reg  3 Reg  4 Reg    1 Reg  2 Reg  3 Reg  4 Reg

        Core i7  3.7 GHz 
        at up to 3.9 GHz 
        via Turbo Boost
 
 32 bit  Integer   MIPS    4301   8551  11994  12292     4302   8559  11996  12293
 64 bit  Integer   MIPS      -      -      -      -      4302   8553  11996  12293
 32 bit  x87     MFLOPS    1303   2607   3865   3864       -      -      -      -
 64 bit  x87     MFLOPS    1303   2607   3865   3864       -      -      -      -
 32 bit  MMX Int   MIPS    7822  14900  14932  14900       -      -      -      -
 32 bit  SSE2 Int  MIPS   15642  29800  29868  29800    15643  29802  29870  29805
 64 bit  SSE2 Int  MIPS    7821  14899  14934  14900     7822  14901  14935  14901
 32 bit  SSE     MFLOPS    5214  10427  15459  15457     5214  10429  15460  15459
 64 bit  SSE2    MFLOPS    2607   5214   7730   7729     2607   5215   7730   7729
 32 bit  3DNow   MFLOPS      -      -      -      -        -      -      -      -
 32 bit  AVX1    MFLOPS      -      -      -      -     10430  20860     -   30920
 64 bit  AVX1    MFLOPS      -      -      -      -      5210  10430     -   15460
 32 bit  AVX1 +* MFLOPS      -      -      -      -        -      -      -   62000
 64 bit  AVX1 +* MFLOPS      -      -      -      -        -      -      -   31000

 
        Phenom II X4
         3.0 GHz
 
 32 bit  Integer   MIPS    3314   6629   8664   9040     3315   6629   8664   9040
 64 bit  Integer   MIPS      -      -      -      -      3315   6629   7701   8287
 32 bit  x87     MFLOPS     753   1506   2259   3013       -      -      -      -
 64 bit  x87     MFLOPS     753   1506   2259   3013       -      -      -      -
 32 bit  MMX Int   MIPS    3012   6026   9036  12054       -      -      -      -
 32 bit  SSE2 Int  MIPS    6024  12050  18073  24107     6025  12053  18081  24107
 64 bit  SSE2 Int  MIPS    3012   6025   9037  12053     3013   6027   9040  12053
 32 bit  SSE     MFLOPS    3012   6024   9037  12050     3013   6025   9040  12053
 64 bit  SSE2    MFLOPS    1506   3012   4518   6025     1506   3013   4519   6027
 32 bit  3DNow   MFLOPS    1506   3012   4518   6025       -      -      -      -


         Core 2 Duo 
         2.4 GHz

 32 bit  Integer   MIPS    2629   4915   5356   6605     2601   4410   5226   6606
 64 bit  Integer   MIPS      -      -      -      -      2612   3908   5525   5285
 32 bit  x87     MFLOPS     801   1601   2402   2402       -      -      -      -
 64 bit  x87     MFLOPS     801   1601   2402   2402       -      -      -      -
 32 bit  MMX Int   MIPS    4726   7116   8772   8734       -      -      -      -
 32 bit  SSE2 Int  MIPS    9490  13769  17545  17469     9490  14641  17527  17471
 64 bit  SSE2 Int  MIPS    2402   4575   4586   4575     2402   4576   4585   4576
 32 bit  SSE     MFLOPS    3202   6405   9608   9608     3202   6405   9609   9609
 64 bit  SSE2    MFLOPS    1601   3202   4804   4804     1601   3202   4804   4804
 32 bit  3DNow   MFLOPS      -      -      -      -        -      -      -      -
  


To Start


OpenMP Benchmark

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the C/C++ compiler included in the Linux Ubuntu Distribution. In each case, four benchmarks are provided, compiled with and without OpenMP options, to run on 32 bit and 64 bit systems. The execution files and source code along with compile and run instructions can be downloaded in linux_openmp.tar.gz. Details and results are provided in linux_openmp benchmarks.htm and a summary follows.

Original OpenMP Benchmark

The original benchmark used larger data array sizes of 0.4, 4.0 and 40 MBytes with 2, 8 and 32 floating point calculations per word (4 Bytes). The 32 bit version behaved in a similar way to the Windows compilation, showing performance gains of a four core processor of up to four times that of a single CPU. The 64 bit OpenMP version behaved in a similar manner to the 32 bit variation but appears to be relatively worse on comparing with speeds produced by the normal compilation. The reason is that the latter produces full SIMD operation, with four calculations per clock cycle, and the former SISD with one calculation per clock. (See above, where SIMD was not produced). Examples of results are given below.

Later results are for the 64 bit version running on the Core i7. In this case, for comparative purposes, those obtained by a multithreading version are also shown. This is MP MFLOPS - (see below). Also included is the non-OpenMP version and new compilations for SSE and with AVX functions. The former produces the same speeds as MP MFLOPS using one thread, with maximum speed of around 24.5 GFLOPS for one thread, demonstrating SIMD, where the maximum possible is 31.2 GFLOPS [CPU GHz x 4 (register width) x 2 (linked multiply and add)]. Performance of 4 way MP MFLOPS speeds show appropriate gains, to produce up to 93.2 GFLOPS, but could require the use of the 8 threads available via Hyperthreading. The AVX benchmark shows suitable gains, with 8 word registers, where the maximum demonstrated is 177.8 GFLOPS.

Note that the i7 SSE OpenMP speeds, shown below, are from a recompiled version by GCC 4.8.2, as this produces SIMD instructions. The new versions are included in AVX_benchmarks.tar.gz, along with the new AVX benchmark. The original SSE version, in linux_openmp.tar.gz, produces SISD instructions and maximum speeds shown underneath the i7 table. The new compilations produce SIMD instructions for 2 and 8 operations per word, but performance is degraded due data handling overheads. Then, at least, AVX scores are double those produced via SSE arithmetic. All the complex data handling seems to lead to SIMD instructions not being generated for the 32 operations tests, leading to SSE and AVX speeds being the same (single data word handling).


                  Linux OpenMP MFLOPS 3 GHz Quad Core Phenom

                  32 Bits                         64 Bits

     Data  Ops/   1 CPU   1 CPU  2 CPUs  4 CPUs   1 CPU   1 CPU  2 CPUs  4 CPUs
    Words  Word   *Norm     OMP     OMP     OMP   *Norm     OMP     OMP     OMP

    100000    2    2439    1903    3575    5758    7624    1974    3597    5769
   1000000    2    2231    1787    3588    6710    4686    1913    3843    6674
  10000000    2    1739    1509    2490    3062    2195    1590    2566    2944

    100000    8    3348    3518    6963   13353   14357    3437    6835   12126
   1000000    8    3195    3453    6943   13524   13376    3375    6802   12420
  10000000    8    3080    3308    6541   11311    7473    3219    6379   10976

    100000   32    3881    3794    7566   14896   15336    3552    7084   13494
   1000000   32    3853    3774    7554   14969   15009    3533    7079   13540
  10000000   32    3817    3735    7465   14883   14318    3490    6970   13450

  Instructions      FPU     FPU     FPU     FPE    SIMD    SISD    SISD    SISD
                    x87     x87     x87     x87     SSE     SSE     SSE     SSE

               *Norm OpenMP Directives not used - 1 CPU core SSE 


               Core i7  3.7 GHz at up to 3.9 GHz via Turbo Boost

               ----- MP MFLOPS 1 to 8 Threads -----          -------- OpenMP ---------
               ----- SSE -----      ----- AVX -----    SSE   --- SSE ---   --- AVX ---
 M 4B  Ops     1      4      8      1      4      8     1*   aff1      8   aff1      8
 Words Word
                                                              ##     ##
 0.1    2   9681  45340  54621  12542  62273  60258   9918   6061  13742  10196  19577
 1.0    2   9759  21688  41832  11404  23031  44329   9688   6215  19477  10025  37906
 10.2   2   5990   9237  10026   5991   8970   9977   5870   5059   9137   5880   7782

 0.1    8  24533  49320  92086  35982 159040 173224  24448  13220  44104  26481  88370
 1.0    8  24570  49918  92352  36180  80096 151909  24465  13373  49499  27045  90579
 10.2   8  19975  36638  39982  23299  40124  40153  20055  12719  38369  20593  35607

 0.1   32  23269  46942  92408  46400  90572 173372  23251   5854  22858   5865  22845
 1.0   32  23307  89676  93282  46572  91058 177831  23265   5863  23234   5870  23141
 10.2  32  23052  91029  92050  44729  88877 158594  23063   5860  23127   5854  23077

 2&8 Ops   ------- SIMD ------  ------- SIMD ------   SIMD   --- SIMD --   --- SIMD --
 32  Ops   ------- SIMD ------  ------- SIMD ------   SIMD   --- SISD --   --- SISD --

     ## new version, Original SISD all cores - 2 Ops 3400, 8 Ops 6100, 32 Ops 5900  
  


To Start


OpenMP MemSpeed Benchmark

MemSpeed benchmark employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays. It uses data volumes of 4 KBytes upwards to indicate performance via caches and RAM. This version is a variation with evaluation mainly concentrating on the formula x[m] = x[m] + r * y[m]. Below is a sample log file with the 64 bit benchmark using four CPUs. The extremely slow performance at the smaller data sizes is due to the relatively high startup overheads of OpenMP and, probably, cache flushing because shared data is being updated. The 32 bit version produces even slower performance relative to the non-OpenMP compilation. See also Multithreading version.

Selected results for the Core i7 include those for the benchmark, compiled without OpenMP directives, plus with and without OpenMP, produced by the later compiler that generates AVX instructions. The CPU has 4 cores plus Hyperthreading. The non-OpenMP versions are compiled to use SIMD instructions, but performance is restricted due to overheads of loading, storing and inserting data. With these, AVX produced suitable gains for cache based data. SISD was generated by OpenMP compilations, leading to SSE and AVX speeds being the same. At least, many MP speeds were appropriately faster than those for single core tests, and maximum memory throughput was excellent. Further details are in OpenMP MFLOPS.htm. The same program was compiled using Pthread multithreading functions see See MP Memory Speed Later



                   Phenom II X4 3000 MHz OpenMP 

      Memory Reading Speed Test 64 Bit Version 1 by Roy Longbottom

               Start of test Sun Dec  5 12:26:36 2010

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    2413   2340   2426   2408   2371   2593   1301   1302   1306
       8    4642   4379   4655   4739   4488   5045   2562   2478   2583
      16    8321   7942   8513   9215   8412   9668   4989   4695   4982
      32   15714  12698  15446  16397  14036  17359   9112   7963   9159
      64   25533  18268  24526  26971  21394  28979  16033  12269  16032
     128   36147  23064  34023  40018  28460  42871  23255  16389  23172
     256   45821  26908  42782  21679  34353  57114  31501  20370  31889
     512   46924  28555  46191  55514  35557  54808  33583  22754  33376
    1024   45478  28681  45098  48798  34662  47103  25081  22172  24993
    2048   36642  26993  36187  36523  32366  36917  18354  17985  18388
    4096   30960  24342  30259  32057  26483  32862  17172  15049  17153
    8192   22963  20257  22754  23462  21376  23910  12203  11223  12176
   16384    8927   8774   8888   8947   8803   8951   4469   4454   4487
   32768    8938   8817   8875   8963   3681   8964   4494   4465   4488
   65536    8956   8863   8910   8959   8849   8981   4500   4474   4502
  131072    8979   8918   8951   8830   8808   9022   4513   4494   4517
  262144    8784   8657   8706   8760   8826   8919   4436   4422   4433
  524288    8774   8478   8789   8732   8643   8864   4374   3703   4435
 1048576    8664   8559   8617   8689   8612   8678   4368   4360   4336
 2097152    8661   8631   8643   8611   8597   8692   4364   4368   4367


         Core i7 4820K 3900 MHz Turbo Boost - x[m]=x[m]+s*y[m] Int+

          64 bit         64 bit OMP    64 bit AVX    64b AVX OMP   32 bit

  KBytes   Dble   Sngl   Dble   Sngl   Dble   Sngl   Dble   Sngl   Dble   Sngl
    Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4  39311  24057   2666   2628  60212  56633   2716   2670  34682  17663 L1
       8  39076  24566   5058   4962  61736  58608   5163   5100  35342  17780
      16  39851  24795   9662   9412  62061  59459   9818   9526  35555  17824
      32  39859  24862  18780  17122  61951  59466  19317  17272  35391  17781
      64  32844  24462  33953  26599  47441  40896  34221  26564  30900  17303 L2
     128  32879  24498  51235  36875  46181  40101  52329  37762  31022  17313
     256  30516  23886  70872  47353  41612  36928  71102  47183  29852  17324
     512  25604  22420  90020  53395  31463  30294  90080  54397  24994  17127 L3
    1024  25565  22368  97333  57510  30155  29099  97835  57372  24903  17129
    2048  25589  22479  96621  58092  30044  29144  93511  58513  24909  17125
    4096  25600  22405  87122  60230  30056  29218  93758  60141  24951  17194
    8192  25593  22460  94138  60267  29891  29223 104996  59273  24864  17203
   16384  15083  14415  27817  27128  15577  15790  27302  27169  14951  13705 RAM
   32768  14845  14293  24666  24563  15191  15371  24620  24175  14890  13704
   65536  14959  14424  24868  25137  15215  15401  24763  24725  14856  13695
  131072  15041  14492  25625  25696  15230  15401  25636  25597  14880  13726
  262144  15023  14491  25603  25435  15247  15410  25507  25348  14958  13773
  524288  15053  14520  25603  25634  15204  15445  25646  25396  15016  13824
 1048576  15085  14534  25569  25690  15198  15438  25160  25678  15025  13846
 2097152  15096  14538  25634  25814  15254  15462  25656  25700
 4194304  15096  14544  25344  25266  15252  15452  25413  25421

Max GFLOPS  5.0    6.2   12.2   15.1    7.8   14.9   13.1   15.0    4.4    4.5
     


To Start


BusSpeed Benchmark

This benchmark is particularly designed to identify reading data in bursts over buses, with a 32 bit version using 32 bit integer words and one for 64 bits using 64 bit numbers. The program starts by reading a word, with address increments of 32 words for the next data. The increment is reduced to 16 words then halving until all data is read. The last test reads all data but using SSE2 instructions.

Below are 64 bit results on a Core i7, a Core 2 Duo, with sample results at 32 bits and both varieties on a Phenom processor. The data burst size over the memory bus is indicated at the point where performance becomes constant, like Inc8wds at 64 bits and Inc16wds at 32 bits, both suggesting 512 bits or 64 bytes. Burst reading speed is eight times the constant speed at 64 bits and 16 times at 32 bits, or around 6400 MB/second for the Core 2 Duo and 7200 for the Phenom. There also appears to be some burst reading from data in L2 cache.

Speeds via L1 cache are fairly constant up to ReadAll, indicating no burst reading but, with the data transfer speed at 32 bits being twice that for 64 bits, a constant instruction execution speed is suggested. This, in MIPS, is slightly less than CPU MHz for the Core 2 Duo and somewhat higher than MHz on the Phenom. The SSE2 test is identical at both bit versions with the Core 2 Duo showing better efficiency at nearly four 32 bit results (1 SSE register full) per CPU clock cycle. Maximum speed of the Core i7, based on burst speed, is suggested to be around 18 GB/second, a long way fro the 51,2 GB/second specification, but i7 Multithreading Benchmarks (below) are needed to approach this.

The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in memory_benchmarks.tar.gz. with more details and results in Linux Results BusSpeed


 Speed in MB/Second - For MIPS 64 bit divide by 8 and 32 bit divide by 4

Core i7 4820K 3900 MHz Turbo Boost - 1 CPU

Bus Speed Test 64 bit Version 2.0 Sat Nov 8 12:08:24 2014 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 31233 31271 31267 42205 38182 42586 61438 L1 24 31300 31277 31262 41632 39363 42724 62272 96 14511 15005 15180 24371 33172 40471 60769 L2 384 5367 5423 5502 10797 19594 33646 39043 L3 768 5280 5366 5435 10797 19322 33431 38081 1536 5247 5348 5493 10799 19399 33625 38234 16380 1282 1569 2170 4762 9130 18547 19124 RAM 131070 1223 1484 2098 4543 8731 18096 18349 393210 1223 1484 2098 4542 8733 18095 18344 Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM 6 15308 15463 20502 18262 20300 21300 60627 96 7434 7593 11491 16540 20013 21082 60633 1536 2677 2757 5381 9694 16801 21026 38206 393210 742 1048 2245 4360 9063 16342 18263

Core 2 Duo 2400 MHz - 1 CPU

Bus Speed Test 64 bit Version 2.0 Thu Dec 16 23:09:19 2010 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 15997 17525 18167 18540 18734 18804 37355 24 17759 18484 17865 17822 18531 18526 37980 96 4189 4158 4107 6724 9128 13435 19175 384 4182 4137 4091 6721 9133 13450 19206 768 4109 4123 4094 6723 9129 13448 19229 1536 3883 4086 4039 6643 9011 13280 18913 16380 657 691 800 1626 2949 5445 5882 131070 693 711 803 1622 2942 5440 5874 393210 698 713 803 1623 2948 5444 5865 Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM 6 8568 9076 9176 9315 9412 9433 37350 96 2112 2053 3277 4561 6714 8097 19170 393210 356 401 815 1474 2730 5091 5870

Phenom II X4 3000 MHz - 1 CPU

Bus Speed Test 64 bit Version 2.0 - L1 cache, L2 cache and RAM 6 21407 22690 26285 27053 27050 26435 23784 96 2992 2973 2991 5992 11780 20725 23813 393210 869 901 918 1791 3729 6264 7391 Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM 6 11287 12793 13466 13625 13407 13281 23648 96 1494 1490 2974 5854 10509 13147 23781 393210 447 453 901 1830 3097 5206 7276


To Start


RandMem Benchmark

RandMem benchmark carries out eight tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests for 32 bit integers and 64 bit floating point numbers. In both cases, 32 bit integers are used. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is redundant, and by the size of preceding caches.

Below, all 64 bit results are shown for a Phenom along with sample speeds at 32 bits and for a Core 2 Duo at 64 bits. Many of the low order speeds are similar at 32 bits and 64 bits but, using RAM, some relationships change, with integer random access becoming progressively worse at 64 bits. The lower GHz Core 2 Duo performs better on some tests. Later results are for the Core i7, which is much faster than the earlier systems, particularly relative to CPU clock speed.

The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in memory_benchmarks.tar.gz with more details and results in Linux Results RandMem.


   Core i7 4820K 3.9 GHz Turbo Boost - 1 CPU

   Random/Serial Memory Test 64 Bit Version 2 Sat Nov  8 12:10:51 2014
 
         Integer.......................  Double/Integer................
         Serial........  Random........  Serial........  Random........
    RAM   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt
     KB  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec

      6   26914   28379   26521   25259   30506   43477   30389   43791 L1
     12   26984   28876   26341   28078   29905   43462   29909   43020
     24   27062   29098   26526   28219   29865   43649   29832   42931
     48   23161   23723   18749   12718   29702   33997   29670   30451 L2
     96   23203   23731   13790    8816   29766   33586   22909   14830
    192   23378   23626   11539    7634   29685   32647   18371   12232
    384   22366   18631    8073    5883   27876   24687   14813   10078 L3
    768   22290   18024    6043    4978   27801   23322   10159    8041
   1536   22305   18023    5407    4576   27801   23316    8801    7311
   3072   22449   18119    5170    4374   27443   23151    8202    6887
   6144   22392   18111    5040    4269   27867   23187    7970    6683
  12288   15007   11910    2499    2698   20487   16022    4276    4837 RAM
  24576   13928   11206    1332    1336   17949   13729    2324    2389
  49152   13987   11299    1068    1061   17771   13626    1750    1774
  98304   14041   11331     971     864   18568   13699    1586    1558
 196608   14031   11379     927     685   18627   13752    1491    1175
 393216   14044   11397     908     623   18637   13741    1450     992
 786432   14037   11373     898     603   18579   13650    1430     935
1572864   13844   11407     890     614   18624   13720    1418     924

 At 32 bits
      6   24759   28651   24162   27110   30309   42529   30315   42969
     96   22385   23808   13417    8855   29721   34194   23310   14622
   1536   21480   18032    5369    4573   26884   23312    8845    7302
 393216   13743   11378     906     693   18574   13708    1450    1097
 786432   13809   11398     896     670   18578   13753    1430    1033


   AMD Phenom(tm) II X4 945 Processor 3.0 GHz

   Random/Serial Memory Test 64 Bit Version 2 Tue Dec 14 17:21:46 2010
 
         Integer.......................  Double/Integer................
         Serial........  Random........  Serial........  Random........
    RAM   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt   Read   Rd/Wrt
     KB  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec  MB/Sec

      6   12542    9137   12636    9066   16812   13621   16795   13621
     12   12613    9165   12676    9137   17022   13705   17013   13673
     24   12647    9179   12734    9157   17129   13720   17130   13694
     48   12664    9186   12775    9161   17183   13728   17183   13719
     96   11989    8464    6866    5221   16934   11776   16496   11888
    192    7778    8434    3703    3177   16902   11747    7146    6132
    384    7778    8437    3001    2749   16918   11671    5116    4730
    768    4956    7348    1954    1900    9978    9459    3670    3591
   1536    4763    7201    1404    1388    9748    9346    2488    2474
   3072    4016    6914    1078    1045    9531    9200    2048    2043
   6144    3668    6769     750     661    9004    8719    1405    1280
  12288    2771    3636     590     502    6688    5495    1012     848
  24576    2850    3592     504     450    6706    5506     841     736
  49152    2858    3583     439     402    6719    5332     727     659
  98304    2679    3536     333     307    6697    5490     612     564
 196608    2729    3548     266     241    6945    5445     459     422
 393216    2866    3559     229     200    6931    5490     377     336
 786432    2870    3547     192     167    6938    5499     327     283

 At 32 bits
      6   14488   11399   12852   11133   16741   20258   16789   19825
     96   11088    9912    6861    5520   16960   16197   16554   14645
   1536    8044    7528    1410    1390    9668    9223    2475    2461
 393216    4296    3575     281     258    6668    5497     491     458
 786432    4296    3562     238     212    6841    5492     396     361

   Intel Core 2 CPU 6600 @ 2.40GHz 
 At 64 bits
      6    9142   12213    9154    5161   13728   16211   13727   15654
     96    8019    9473    4113    3701   11381   11971    7382    6419
   1536    7978    8586    2691    2497   11269   11044    4760    4222
 393216    3285    2273     238     207    5705    2999     503     374
 786432    3297    2277     149     152    5637    3001     297     281
  


To Start


SSEfpu Benchmark

This is a variation of the SSE3DNow Benchmark with extensions but excluding AMD 3DNow tests. The benchmark measures Single Precision (SP) and Double Precision (DP) Floating Point speeds, data streaming from caches and RAM. It uses SSE (SP) and SSE2 (DP) assembly code instructions, along with compiled C code that produces the old x87 instructions at 32 bits and SSE type for working on a 64 bit system. The additional tests avoid intermediate register to register operations using s=(s+x[m])*y[m] and s=s+x[m]+y[m] to produce much faster speeds. The AMD processor performs relatively better on the extra test, with linked add and multiply, at 7.11 floating point results per clock cycle on the Phenom. Then, the Core i7 regains the lost ground and also obtains a high throughput on RAM based data.

The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in memory_benchmarks.tar.gz with more details and results in Linux Results SSEfpu.


     Core i7 4820K 3.9 GHz Turbo Boost - 1 CPU

     SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.1

      0.100 seconds per test, Start Tue Dec  2 17:46:19 2014

  Memory    --s=s+x[m]*y[m]---   --x[m]=x[m]+y[m]-- (s+x[m])?y[m]
  KBytes    SSE2    SSE   Sngl   SSE2    SSE   Sngl  +*SSE  ++SSE
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   40997  41051  10763  78459  75013  28446  87877  59793 L1
       8   41168  41321  10588  78338  78301  27627  96326  60640
      16   41366  41444  10505  80368  80739  27706  98675  60593
      32   41423  41511  10462  80669  81160  27759  92764  60609
      64   41432  41427  10445  50083  50209  27169  57689  57389 L2
     128   41447  41508  10412  49595  49500  27192  55731  56598
     256   39673  39746  10414  46176  46119  26167  48386  48563
     512   37246  37301  10417  32252  32250  24247  39595  39688 L3
    1024   36639  36601  10425  31307  31197  24044  38688  38794
    2048   36640  36824  10421  31262  31328  24138  38804  38750
    4096   36900  36899  10393  31379  31381  24227  38739  38942
    8192   36585  36615  10403  31076  31063  24076  38442  38534
   16384   23186  23097   9577  15371  15292  16067  22518  22562 RAM
   32768   22592  22574   9573  14973  15013  15743  21935  22058
   65536   22603  22504   9596  15041  14972  15718  22061  22052
  131072   22612  22612   9582  15038  15030  15672  22096  22003
  262144   22629  22610   9584  15049  15044  15698  22040  22109
  524288   22638  22654   9592  15057  15056  15682  22101  22101
 1048576   22618  22481   9598  15038  15049  15605  22110  22104
 2097152   22671  22648   9608  15050  15051  15546  22094  22129
 4194304   22671  22668   9597  15044  15056  15691  22112  22128

            SSE2    SSE   Norm   SSE2    SSE   Norm    SSE    SSE
 Maximum      DP     SP     SP     DP     SP     SP     SP     SP
  MFLOPS    5181  10378   2691   5042  10145   3556  24669  15160

MFLOPS/MHz  1.33   2.66   0.69   1.29   2.60   0.91   6.33   3.89

 MB/sec at 32 bits

       8   41382  41382  10592  79081  78697  20892  92411  61511
     128   41604  41586  10436  49128  49126  18239  55914  56067
    1024   36098  35957  10425  31113  31127  16998  38204  38336
  131072   21010  20979  10092  14783  14774  12497  20655  20626

  
  AMD Phenom(tm) II X4 945 Processor 3.0 GHz

     SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.0

      0.100 seconds per test, Start Tue Dec 21 12:18:05 2010

  Memory    --s=s+x[m]*y[m]---   --x[m]=x[m]+y[m]-- (s+x[m])?y[m]
  KBytes    SSE2    SSE   Sngl   SSE2    SSE   Sngl  +*SSE  ++SSE
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   22773  22689   6156  43460  42950  23333  66361  41700
       8   23421  23377   6089  45716  45433  23624  78620  44642
      16   23623  23691   6059  42561  42562  23724  84534  45885
      32   23834  23827   6043  45141  45140  23797  82980  46315
      64   23921  23918   6035  44686  45478  23823  85405  46897
     128   23859  23901   6029  22154  22157  17973  23785  23782
     256   23821  23764   6027  21555  21535  18026  23888  23889
     512   19300  19264   6010  17865  17840  16359  19219  19222
    1024   10376  10379   5965  10168  10168  10228  10371  10373
    2048   10369  10372   5966  10163  10163  10236  10369  10368
    4096   10261  10281   5862   9975   9975  10025  10278  10278
    8192    8053   8190   5362   6841   6836   6863   8029   8027
   16384    7985   8095   5327   6572   6569   6651   7848   7883
   32768    8074   8099   5314   6424   6531   6660   7858   7928
   65536    8148   8151   5321   6599   6607   6674   7961   7961
  131072    8092   8159   5320   6585   6412   6484   7891   7936
  262144    8112   8173   5318   6580   6556   6665   7887   7960
  524288    8117   8042   5327   6607   6604   6689   7861   7961
 1048576    8147   8108   5328   6535   6581   6668   7941   7816

            SSE2    SSE   Norm   SSE2    SSE   Norm    SSE    SSE
 Maximum      DP     SP     SP     DP     SP     SP     SP     SP
  MFLOPS    2990   5980   1539   2857   5685   2978  21351  11724

MFLOPS/MHz  0.99   1.99   0.51   0.95   1.95   0.99   7.11   3.90

 MB/sec at 32 bits
 Different                                    #####
       8   23188  23276   6057  45641  43156  11688  78703  44729
     128   23634  23692   5997  22418  22250   9893  23671  23664
    1024   10248  10254   5930  10056  10053   8682  10253  10253
  131072    8258   8276   5389   6680   6698   6098   7909   8091

   Intel Core 2 CPU 6600 @ 2.40GHz 
 At 64 bits
 Different                                    #####  #####  #####
       8   25420  25368   6506  37691  37692  13152  36503  36637
     128   18481  18655   6406  17105  17107  12704  19725  19744
    1024   18517  18749   6391  17136  17137  12690  19803  19822
  131072    6444   6419   5455   3955   3956   3863   6399   6393
Maximum
MFLOPS/MHz  1.32   2.64   0.68   0.98   1.96   0.68   3.80   3.81
  


To Start


nVidia CUDA Benchmarks and Burn-in Tests

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. There are five scenarios - New Calculations with data in and out, Update Data with just data out, Graphics Only Data using only graphics RAM and two extra tests with lower overheads. The tests are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

The 32 and 64 bit benchmarks, source code and instructions can be downloaded in linux_cuda_mflops.tar.gz with more details and results in linux_cuda_mflops.htm, the latter showing how to use the benchmarks as reliability/burn-in tests. Example results are below.


  Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Wed Dec 29 15:35:35 2010

  CUDA devices found 
  Device 0: GeForce GTS 250  with 16 Processors 128 cores 
  Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512

  Using 256 Threads

  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  1.035893      483   0.9295383095741  Yes
 Data out only    100000    2    2500  0.514445      972   0.9295383095741  Yes
 Calculate only   100000    2    2500  0.082464     6063   0.9295383095741  Yes

 Data in & out   1000000    2     250  0.706176      708   0.9925497770309  Yes
 Data out only   1000000    2     250  0.380928     1313   0.9925497770309  Yes
 Calculate only  1000000    2     250  0.051266     9753   0.9925497770309  Yes

 Data in & out  10000000    2      25  0.639933      781   0.9992496371269  Yes
 Data out only  10000000    2      25  0.339051     1475   0.9992496371269  Yes
 Calculate only 10000000    2      25  0.041672    11999   0.9992496371269  Yes

 Data in & out    100000    8    2500  1.013196     1974   0.9569796919823  Yes
 Data out only    100000    8    2500  0.490317     4079   0.9569796919823  Yes
 Calculate only   100000    8    2500  0.088028    22720   0.9569796919823  Yes

 Data in & out   1000000    8     250  0.666709     3000   0.9955092668533  Yes
 Data out only   1000000    8     250  0.351320     5693   0.9955092668533  Yes
 Calculate only  1000000    8     250  0.052704    37948   0.9955092668533  Yes

 Data in & out  10000000    8      25  0.620265     3224   0.9995486140251  Yes
 Data out only  10000000    8      25  0.335467     5962   0.9995486140251  Yes
 Calculate only 10000000    8      25  0.044453    44992   0.9995486140251  Yes

 Data in & out    100000   32    2500  1.057142     7568   0.8900792598724  Yes
 Data out only    100000   32    2500  0.531691    15046   0.8900792598724  Yes
 Calculate only   100000   32    2500  0.128706    62157   0.8900792598724  Yes

 Data in & out   1000000   32     250  0.688714    11616   0.9880728721619  Yes
 Data out only   1000000   32     250  0.375411    21310   0.9880728721619  Yes
 Calculate only  1000000   32     250  0.075172   106423   0.9880728721619  Yes

 Data in & out  10000000   32      25  0.644074    12421   0.9987990260124  Yes
 Data out only  10000000   32      25  0.357000    22409   0.9987990260124  Yes
 Calculate only 10000000   32      25  0.062001   129029   0.9987990260124  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.050288     9943   0.9992496371269  Yes
 Shared Memory  10000000    2      25  0.009206    54313   0.9992496371269  Yes

 Calculate      10000000    8      25  0.049608    40316   0.9995486140251  Yes
 Shared Memory  10000000    8      25  0.017254   115916   0.9995486140251  Yes

 Calculate      10000000   32      25  0.050531   158320   0.9987990260124  Yes
 Shared Memory  10000000   32      25  0.046626   171580   0.9987990260124  Yes


To Start


Disk, Bus and LAN Benchmarks

These benchmark tests are based on those produced for Windows, where details and results can be found in DiskGraf Results.htm and CDDVDSpd Results.htm. The tests comprise:

  • Writing and Reading Large Files - Five files each of 8 MB, 16 MB and 32 MB are used.
    System is instructed not to cache the data.


  • Writing and Reading Cached Data - Five files of 8 MB are used. Performance normally
    reflects memory speed.


  • Reading Bus Speed - The same data is read repetitively at block sizes between 64 KB and
    1 MB. This normally reads data from the disk’s buffer to show maximum bus speeds.


  • Random Reading Speed - 1 KB blocks are read randomly from 7 file sizes between 2 MB
    and 128 MB. Results reflect the disk's buffer size and rotation speed.


  • Writing and Reading Small Files - 500 files are written, read and deleted at 6 different
    file sizes each between 2 KB and 64 KB. Besides speed, milliseconds per file is provided
    to reflect overheads.


  • Run time parameters - These are provided to write and read larger files and to specify
    the drive and file path to be used.

Besides testing disk and flash memory drives, it was intended to use the (drivespeed) benchmarks for measuring speed over such as Local Area Networks (LANs). In order to avoid data being cached in main memory by the Operating System, the program uses direct I/O (file open parameter O_DIRECT for Linux). This prevented directories being mounted over a LAN, so a second program (lanspeed) was produced, identical except with no direct I/O parameter. Compilations at both 32 bits and 64 bits were produced - drivespeed32, lanspeed32, drivespeed64 and lanspeed64. The lanspeed tests can be used to measure speeds between Linux platforms and also between Linux and Windows systems. A Windows program, drivespeed32.exe is also provided and this can also be used as a LAN speed test.

The execution files, source code along with compiling and running instructions, can be downloaded in linux_disk_usb_lan_benchmarks.tar.gz with more details and results in linux_disk_usb_lan_benchmarks.htm. The latest version has an added test to measure Random Writing Speed. Example results are below.


 Current Directory Path: 
 /media/f816ec76-8bf2-4dd3-9e98-62934909a779/roy/all64/drivespeed2
 Total MB   11263, Free MB    9513, Used MB    1750

 Linux Storage Speed Test 64-Bit Version 1.1, Tue Feb  1 14:20:39 2011

                Copyright (C) Roy Longbottom 2011

     8 MB File         1          2          3          4          5
 Writing MB/sec       4.33      76.73      76.15      82.40     105.84
 Reading MB/sec      57.37      86.62      83.40      80.74      82.34

    16 MB File         1          2          3          4          5
 Writing MB/sec      73.94     108.16      72.53     116.19     116.12
 Reading MB/sec      70.39     103.31     120.31     121.53     121.48

    32 MB File         1          2          3          4          5
 Writing MB/sec     113.01      76.67      73.20     115.83     116.05
 Reading MB/sec     105.19     102.41     113.15     121.55     120.59

 ---------------------------------------------------------------------
 8 MB Cached File      1          2          3          4          5
 Writing MB/sec    1271.71    1503.73    1496.38    1493.27    1491.68
 Reading MB/sec    3406.70    4015.11    4079.82    4081.24    4080.77

 ---------------------------------------------------------------------
 Bus Speed Block KB     64        128        256        512       1024
 Reading MB/sec      84.93     102.31     112.31     121.03     116.41

 ---------------------------------------------------------------------
 1 KB Reads File MB >    2      4      8     16     32     64    128
 Random Read msecs    0.43   0.39   0.45   3.01   4.49   5.93   6.69

 ---------------------------------------------------------------------
 500 Files   Write             Read             Delete
 File KB     MB/sec  ms/File   MB/sec  ms/File  Seconds
       2       7.54     0.27     7.67     0.27    0.015
       4      17.19     0.24    22.27     0.18    0.018
       8      20.24     0.40    27.21     0.30    0.017
      16      33.27     0.49    47.16     0.35    0.019
      32      52.67     0.62    67.20     0.49    0.016
      64      55.43     1.18    75.49     0.87    0.015

              End of test Tue Feb  1 14:21:29 2011 


To Start


Burn-In and Reliability Testing Apps

A new set of programs have been designed for soak testing Linux based PCs. The execution files and source code along with compile and run instructions can be downloaded in linux_burn-in_apps.tar.gz. Full details and results are provided in linux burn-in apps.htm.

These programs are intended to stress test CPUs, caches, RAM, buses, disks and other drives using high processing speeds, to induce heating effects, and varying data bit order, to investigate possible pattern conscious faults. Common features are command line options to specify memory/storage demands, running time and different results log file names, for use in multiprocessor tests. Data read and results of calculations are also checked for correct or consistent values. Versions compiled to run on 32-Bit and 64-Bit processors are provided.

Three new programs provided are BurnInSSE, IntBurn and DriveStress but they can also be used in conjunction with program produced earlier. BurnInSSE64 and BurnInSSE32 were compiled to use the same range of SSE floating point instructions, where GCC generates fast execution speeds. The IntBurn tests are based on assembly code with IntBurn32 using 32 bit integers and IntBurn64 accessing a larger number of 64 bit registers. DriveStress32 and DriveStress64 were compiled from the same C code and measure drive and bus speeds (e.g. SATA or USB) whilst checking data read for correct values. Earlier programs, that also have reliability testing options and included in the package, are Livermore Loops and nVidia CUDA Benchmarks.

Successes - Three significant problems were identified during testing. The first was apparent excessive temperatures on a desktop PC, compared with earlier measurements via Windows. This was cured by clearing dust out of the CPU heatsink using a compressed air sprayer. Then there were two Linux Peculiarities that seem to be affected by power saving options. A desktop PC with a Core 2 Duo CPU showed a throughput increase of three times using both cores. Here, using one core with “On-Demand” CPU GHz (via Frequency Scaling Monitor), the processor was running at 1.6 GHz instead of 2.4 GHz. Then a laptop, again with a Core 2 Duo PC, overheated, causing the CPU to run at less than half speed. Unlike using Windows, with power on to Ubuntu, initial CPU temperatures were high with the fan not appearing to run as fast as it might. On an apparent random basis, the laptop started at a lower temperature and did not overheat, with the fan apparently running at high speed.

Paging/Swapping Tests - Running multiple copies of the processor exercise programs, with appropriate parameters to demand more main memory capacity than is available, will lead to data being swapped out/in to/from disk. However, with excessive demands, running times can be unpredictable.

Multitasking Scripts - Examples are provided showing how to mix and match programs and run time parameter to soak test complete systems for as long as is required. They also demonstrate how to organise dynamic displayed results in multiple X terminal windows.

The test programs display and log results of calculations and speeds at regular intervals. Examples are shown below, with interpretation and more details in linux burn-in apps.htm.


  IntBurn

  Test 4 KB at 10x2 seconds per test, Start at Thu Mar 17 12:00:59 2011

 Write/Read
  1   10529 MB/sec  Pattern 0000000000000000 Result OK   25705389 passes
  2   10579 MB/sec  Pattern FFFFFFFFFFFFFFFF Result OK   25826660 passes
  3   10592 MB/sec  Pattern A5A5A5A5A5A5A5A5 Result OK   25858754 passes
  4   10587 MB/sec  Pattern 5555555555555555 Result OK   25846727 passes
  5   10601 MB/sec  Pattern 3333333333333333 Result OK   25880968 passes
  6   10602 MB/sec  Pattern F0F0F0F0F0F0F0F0 Result OK   25883259 passes
 Max   2236 64 bit MIPS
 Read
  1   16941 MB/sec  Pattern 0000000000000000 Result OK   82719400 passes
  2   16946 MB/sec  Pattern FFFFFFFFFFFFFFFF Result OK   82744300 passes
  3   16932 MB/sec  Pattern A5A5A5A5A5A5A5A5 Result OK   82676600 passes
  4   16927 MB/sec  Pattern 5555555555555555 Result OK   82653700 passes
  5   16883 MB/sec  Pattern 3333333333333333 Result OK   82439400 passes
  6   16857 MB/sec  Pattern F0F0F0F0F0F0F0F0 Result OK   82311300 passes
 Max   2515 64 bit MIPS

  BurnInSSE

 Using 400 KBytes, 32 Operations Per Word, For Approximately 1 Minutes

   Pass    4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All
            Words  Word   Passes                            Results  Same

      1    100000    32    67500      15.10    14304    0.356166393   Yes
      2    100000    32    67500      15.11    14296    0.356166393   Yes
      3    100000    32    67500      15.09    14312    0.356166393   Yes
      4    100000    32    67500      15.33    14091    0.356166393   Yes

  DriveStress

 File size   10.25 MB x 4 files, minimum reading time 1 minutes

 File 1   10.25 MB written in    0.12 seconds 
 File 2   10.25 MB written in    0.14 seconds 
 File 3   10.25 MB written in    0.11 seconds 
 File 4   10.25 MB written in    0.14 seconds 


              Start Reading Sun Apr 17 20:06:07 2011

 Read passes    18 x 4 Files x   10.25 MB in     0.25 minutes
 Read passes    36 x 4 Files x   10.25 MB in     0.51 minutes
 Read passes    54 x 4 Files x   10.25 MB in     0.76 minutes
 Read passes    72 x 4 Files x   10.25 MB in     1.01 minutes

            Start Repeat Read Sun Apr 17 20:08:08 2011

 Passes in 1 second(s) for each of 164 blocks of 64KB:

   1440   1480   1480   1480   1480   1400   1480   1480   1480   1460   1380
   1480   1480   1460   1480   1440   1440   1480   1480   1480   1440   1460
   1480   1440   1480   1460   1500   1460   1480   1760   1540   1480   1480
   1440   1480   1480   1480   1480   1460   1440   1480   1480   1480   1460
 + another 120 results

    No errors found during reading tests
 


To Start


Multithreading Benchmarks

These multithreading tests are based on the above benchmarks, in turn, Maximum CPU Speeds, Whetstone Classic Benchmark, Original OpenMP Benchmark, MemSpeed Benchmark, BusSpeed Benchmark and RandMem Benchmark. For further details, sample results, benchmark programs, source code and instructions see linux multithreading benchmarks.htm and linux_multithreading_apps.tar.gz.

See also results on a Core i7 with 4 cores plus 4 Hyperthreading

Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).

Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.

Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.

MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,

MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data. Calculations are the same as MemSpeed Benchmark, used with OpenMP, where there is no programmable control on the order in which data is accessed.

MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed.

MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests (Mutex) avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.

Examples of results log format on a quad core 3.0 GHz Phenom II are given below.

  

Simple Add Tests

Multithreading Add Test 64 bit Version 1.0 Thu May 5 11:35:18 2011 Integer Additions 4 Threads Thread 4 - 8281 64 bit Integer MIPS Thread 2 - 7996 64 bit Integer MIPS Thread 1 - 7815 64 bit Integer MIPS Thread 3 - 7800 64 bit Integer MIPS Total - 31892 64 Bit Integer MIPS Aggregate - 31201 64 Bit Integer MIPS, based on last to finish SSE Floating Point Additions 4 Threads Thread 2 - 12030 32 Bit SSE MFLOPS Thread 3 - 11976 32 Bit SSE MFLOPS Thread 4 - 11861 32 Bit SSE MFLOPS Thread 1 - 11692 32 Bit SSE MFLOPS Total - 47559 32 Bit SSE MFLOPS Aggregate - 46770 32 Bit SSE MFLOPS, based on last to finish

Whetstone MP Benchmark

Multithreading Single Precision Whetstones 64-Bit Version 1.0 Using 4 threads - Sat May 14 12:03:51 2011 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 2861 927 872 747 71 38 2947 2259 629 2 2865 875 892 745 71 38 3294 2198 641 3 2875 869 892 744 71 38 3408 2202 645 4 2896 906 895 744 72 38 3141 2232 651 Total 11496 3577 3550 2979 285 151 12790 8891 2566 MWIPS 11389 Based on time for last thread to finish

MP MFLOPS Benchmark

64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 2 10000 0.091754 22321 0.764063 Yes Data in & out 1024000 2 1000 0.136134 15044 0.970753 Yes Data in & out 10240000 2 100 0.632075 3240 0.997008 Yes Data in & out 102400 8 10000 0.167023 49047 0.850923 Yes Data in & out 1024000 8 1000 0.176219 46488 0.982342 Yes Data in & out 10240000 8 100 0.658828 12434 0.998200 Yes Data in & out 102400 32 10000 0.558509 58670 0.660143 Yes Data in & out 1024000 32 1000 0.556450 58888 0.953631 Yes Data in & out 10240000 32 100 0.722131 45377 0.995203 Yes

MP Memory Speed

MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads Start of test Tue Jun 7 11:32:54 2011 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 15704 11347 10961 17813 12518 15904 13744 8714 8758 8 24188 15367 14929 26770 17870 21025 20789 10866 10234 16 33319 19229 18266 38724 23589 23124 31390 13114 13157 32 40697 20675 21180 51120 27260 25282 39385 13921 13960 65 45013 22913 22267 57143 30132 24875 42247 14314 14241 131 45569 23573 22953 61979 31356 27585 44688 14427 13289 262 48701 23759 22666 63235 32103 27892 44447 14200 14453 524 44900 22996 20417 53167 30753 25832 36085 14671 13403 1048 44929 23357 20300 54596 30302 25790 36207 14708 13590 2097 42017 22864 20927 42429 28809 24778 26734 13125 12659 4194 34909 20379 19542 36402 25268 21093 18592 12625 12821 8388 22498 17592 17006 23354 19577 18854 12489 9400 9657 16777 8906 8697 8781 8884 8841 8844 4433 4217 4440 33554 8848 8684 8606 8877 8436 8843 4412 4293 4422 67108 8423 8445 8433 8685 8506 8526 4228 4296 4273 134217 8704 8453 8572 8563 8426 8485 4383 4303 4346 268435 8623 8579 8539 8731 8652 8612 4408 4301 4322 536870 8683 8331 8534 8724 8658 8444 4371 4330 4325

MP Memory Bus Speed

MP Bus Speeds 32 bit Version 1.0, 4 Threads, Fri Jun 17 16:44:21 2011 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 3901 7614 14703 28644 29313 34882 74424 24 7466 14648 28660 29468 37750 40926 79860 96 4648 5085 8422 19230 33948 39486 74050 384 4774 5131 9864 19142 32406 41067 82021 768 2726 2746 5361 9874 17152 30193 42259 1536 2407 2543 4943 10058 17570 29261 41159 16380 812 837 1684 3635 6772 12743 16252 131070 786 813 1605 3444 6259 12161 14950 393210 807 855 1649 3333 6234 11625 14892

MP Memory Random Access

RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 29630 53166 44120 44829 29620 29671 12108 11987 Serial RW 5040 7334 7442 7402 7353 7395 8532 6247 Random RD 28388 41211 27807 12265 8866 6611 2103 1271 Random RW 657 1096 1229 1283 1288 1376 1648 993 Mutex SRW 5962 8654 7998 7882 6982 6853 3579 3415 Mutex RRW 6243 8594 5838 2815 1970 1370 486 310


To Start


Core i7 Multithreading Benchmarks

This is a quad core/8 thread 3.7 GHz Core i7 4820K with 10 MB L3 cache, normally running at Turbo Burst speed of 3.9 GHz. It has 4 memory channels with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.

Simple Add Tests - See also Maximum CPU Speed Tests, where the stand alone speeds are slightly faster than those for single threads. It also seems that, for these particular code sequences, eight threads are required for near a four times performance improvement, where throughput is 12.2 MIPS/MHz and 15.8 MFLOPS/MHz.

Whetstone MP Benchmark - The single core version of this benchmark does not use pipelines very efficiently but, using 8 threads, performance of MFLOPS test is increased by 7.8 times, but 4 to 5 times on integer routines.

MP MFLOPS Benchmark - This used the same basic C code an OpenMP variety. See comparisons above. Note that there is a second version, compiled to use AVX instructions. Maximum speed of one core, with linked multiply and add, is 31.2, using SSE instructions, and twice that with AVX. With 4 cores, SSE and AVX maximum GFLOPS are 124.8 and 249.6, with 75% and 71% of these being demonstrated.

MP BusSpeed - This did not benefit by running via 8 threads, compared with four. Measured maximum RAM speed was greater than the 51.2 GB/second specification. This was due to all threads reading the same data and the 10 MB shared cache. A new version was produced, to minimise the effect, with threads starting reading from different addresses, still in the same data array, reducing maximum speed to 40 GB/second or less.

MP MemSpeed - This firstly shows single and double precision multiply + add tests, using one and eight threads, with normal 64 bit compilation and, again, with AVX options, then with one thread for a 32 bit Operating System. There are some start up overheads, providing slower performance than MemSpeed Benchmark above, using one thread, but, as each thread handles a unique segment of data, cache flushing is minimised with multiple threads. The benchmarks’ assembly code listings show that full SIMD SSE and AVX instructions are used but, possibly because of compiling for multiple threads, there are excessive numbers of addition instructions generated. This leads to some slower speeds that OpenMP MemSpeed and SSE/SSE2 being faster than AVX.

The additional results, for the second tests with just addition, show that the compiled code is much better, with SSE/SSE2 speeds similar to MemSpeed via OpenMP and AVX instructions providing appropriate performance gains. Then, none of these GFLOPS speeds are close to the maximum potential of 31.2 single precision GFLOPS with SSE and double using AVX instructions (half these with double precision).

MP Random Access Benchmark - As expected, multithreading performance can be worse than using a single thread, when write back to memory is used, but reasonable performance and improvements were possible with data in the large L3 cache. Using Mutex restrictions lead to no real gains using multi-threading.

  

Simple Add Tests

Multithreading Add Test 64 bit Version 1.0 Sat Nov 8 12:16:25 2014 Integer Additions 8 Threads Thread 3 - 6318 64 bit Integer MIPS Thread 5 - 6307 64 bit Integer MIPS Thread 2 - 6241 64 bit Integer MIPS Thread 6 - 6212 64 bit Integer MIPS Thread 7 - 6124 64 bit Integer MIPS Thread 4 - 6036 64 bit Integer MIPS Thread 8 - 6001 64 bit Integer MIPS Thread 1 - 5923 64 bit Integer MIPS Total - 49162 64 Bit Integer MIPS Aggregate - 47387 64 Bit Integer MIPS, based on last to finish SSE Floating Point Additions 8 Threads Thread 7 - 7767 32 Bit SSE MFLOPS Thread 8 - 7765 32 Bit SSE MFLOPS Thread 3 - 7752 32 Bit SSE MFLOPS Thread 4 - 7749 32 Bit SSE MFLOPS Thread 5 - 7738 32 Bit SSE MFLOPS Thread 2 - 7727 32 Bit SSE MFLOPS Thread 1 - 7725 32 Bit SSE MFLOPS Thread 6 - 7693 32 Bit SSE MFLOPS Total - 61916 32 Bit SSE MFLOPS Aggregate - 61540 32 Bit SSE MFLOPS, based on last to finish Single Thread 11937 64 Bit Integer MIPS 15450 32 Bit SSE MFLOPS Two Threads 23069 64 Bit Integer MIPS 30887 32 Bit SSE MFLOPS Four Theads 24717 64 Bit Integer MIPS 24167 64 Bit Integer MIPS, based on last to finish 46409 32 Bit SSE MFLOPS 30903 32 Bit SSE MFLOPS, based on last to finish

Whetstone MP Benchmark

Multithreading Double Precision Whetstones 64-Bit Version 1.0 Using 8 threads - Sat Nov 8 14:58:12 2014 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 3828 1321 1320 959 92 62 3156 2963 629 2 3803 1270 1321 952 92 61 3155 2930 628 3 3811 1315 1282 956 92 61 3125 2990 630 4 3807 1259 1280 952 92 62 3145 2958 629 5 3821 1286 1287 961 92 62 3087 2926 629 6 3815 1283 1284 962 91 62 3134 2933 629 7 3818 1300 1306 956 92 62 3135 2929 629 8 3821 1286 1304 958 92 62 3143 2931 629 Total 30524 10321 10384 7657 733 494 25079 23559 5033 Total 1 Thrd 4648 1331 1331 977 122 70 4720 5855 983 2 Thrd 9274 2661 2660 1945 243 140 9769 11717 1964 4 Thrd 18078 5263 5229 3907 488 265 15620 17408 3929

MP MFLOPS Benchmark

MFLOPS 1 to 8 Threads 4 Byte Ops/ Repeat SSE ------ SSE ------ ------ AVX ------ Words Word Passes 1 CPU 1 4 8 1 4 8 100000 2 2500 9918 9681 45340 54621 12542 62273 60258 1000000 2 250 9688 9759 21688 41832 11404 23031 44329 10000000 2 25 5870 5990 9237 10026 5991 8970 9977 100000 8 2500 24448 24533 49320 92086 35982 159040 173224 1000000 8 250 24465 24570 49918 92352 36180 80096 151909 10000000 8 25 20055 19975 36638 39982 23299 40124 40153 100000 32 2500 23251 23269 46942 92408 46400 90572 173372 1000000 32 250 23265 23307 89676 93282 46572 91058 177831 10000000 32 25 23063 23052 91029 92050 44729 88877 158594

MP Memory Speed

x[m]=x[m]+s*y[m] 64b 1 Thread 64b 8 Thread 64b AVX 1 T 64b AVX 8 T 32b 1 Thread KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 29668 15246 37397 22021 16828 10053 38396 31823 22323 11275 L1 8 30422 15420 52063 33134 16865 10130 46928 32871 22744 11317 16 30754 15503 69122 44818 16891 10136 53801 37870 22887 11340 32 30680 15459 98246 51419 16102 10134 66372 37707 22870 11324 64 28867 15292 103196 54739 16872 10132 68113 39620 22352 11281 L2 128 28955 15286 115996 53402 16895 10132 61264 36423 22359 11296 256 28741 15287 113644 60777 16785 10134 68244 40618 22165 11296 512 24664 15200 116243 60628 16580 10128 65631 37589 21408 11285 L3 1024 24662 15207 117177 57777 16620 10087 63796 37746 21288 11270 2048 24424 15207 95433 58470 16444 9827 64988 40739 21305 11268 4096 24408 14253 98608 57900 15592 9839 63209 36650 20958 11141 8192 24213 14940 99671 56541 15666 8823 67851 38623 20297 11030 16384 14983 11747 28689 28004 12310 9117 30911 28600 15179 10297 RAM 32768 14667 11464 25857 25885 12253 9098 24926 24294 15075 9576 65536 14523 11772 24875 24963 11968 9016 24070 22805 14547 9738 131072 14433 11570 24789 24833 12564 9180 23856 25190 15249 10246 262144 14266 11165 25525 24575 12529 8851 25236 22608 15273 10252 524288 14386 11824 25054 24707 12338 8931 24974 24490 15295 10268 1048576 14452 11468 25402 25735 11954 8972 24917 24153 15308 10278 2097152 14908 11769 25100 25402 12396 8901 24545 25061 4194304 14938 11916 24785 24556 12284 9007 24608 25285 Max GFLOPS 3.8 3.9 14.6 15.2 2.1 2.5 8.5 10.2 2.9 2.8 x[m]=x[m]+y[m] 64b 1 Thread 64b 8 Thread 64b AVX 1 T 64b AVX 8 T 32b 1 Thread KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 16 41065 20688 82924 46075 61385 61280 116710 90819 27816 14030 L1 128 34323 20476 140036 76202 48299 47771 226979 230972 26712 13977 L2 8192 26045 19106 108046 80815 28005 27977 121758 113292 22607 13535 L3 131072 15644 14115 25675 25639 14893 14915 24319 25609 15862 12099 RAM Max GFLOPS 2.6 2.6 8.8 10.1 3.8 7.7 14.2 28.9 1.7 1.8

MP Memory Bus Speed

MP Bus Speeds 64 bit Version 1.0, 4 Threads, Sun Nov 23 10:35:01 2014 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 76609 51101 75602 140546 104501 167163 205782 L1 24 120982 107268 113828 153185 170288 149892 248761 96 41962 40737 43299 73311 123250 160399 240730 L2 384 19664 20262 20831 38942 75517 128002 160495 L3 768 19242 19941 20676 39821 73897 127177 152781 1536 19103 19854 20683 39137 54701 127196 152980 16380 6210 6913 8363 14942 29204 56919 56522 RAM 131070 5901 6947 8368 15029 29096 51843 61776 393210 5909 5426 8370 12684 29097 58307 59609 1 Thread 6 31501 31266 31243 41117 36617 41277 61526 768 5303 5386 5499 10808 19429 33765 38337 131070 1229 1470 2054 4514 8754 18043 18094 MP Bus Speeds 64 bit Version 2.0, 4 Threads, Sun Nov 23 10:35:44 2014 Same as Version 1.0, except each thread starts at different address Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 28749 29616 58739 64451 61610 129160 231735 24 114043 117435 119746 108799 143160 163902 245756 96 39170 40423 42705 76442 110895 154667 240928 384 19631 20232 20793 40066 69429 126075 158417 768 19212 19923 20648 39748 72952 125329 151560 1536 19086 19296 20661 39791 73469 120135 152311 16380 5843 6857 8210 14523 27776 55150 59064 131070 2038 3108 5197 10201 20004 38092 40726 393210 2090 3101 5072 9867 19538 39489 39824 786420 2083 2943 5082 10133 20016 37592 40764 1572840 2025 3011 5091 10207 19039 39479 40781 1 Thread 6 31501 31266 31243 41117 36617 41277 61526 768 5303 5386 5499 10808 19429 33765 38337 131070 1226 1484 2096 4411 8462 18188 18382

MP Memory Random Access

RandMemMP Speeds 64 Bit Version 1, 8 Threads, Sat Nov 8 12:41:51 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 37112 77469 94806 94862 90795 86826 65882 56315 Serial RW 8924 29533 54380 47712 51176 69146 68008 22145 Random RD 36944 76814 62245 33838 24552 21588 13472 3341 Random RW 2000 6016 9058 17412 16237 16733 10066 2806 Mutex SRW 7829 16705 19723 16432 16331 16570 11550 10669 Mutex RRW 10672 20797 8933 5659 4844 4561 2659 940 RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sat Nov 8 12:39:21 2014 Serial RD 28021 27808 20268 19318 19231 19255 12455 11589 Serial RW 29972 30232 21894 17867 17410 17420 12242 11581 Random RD 27479 27463 13595 8251 6228 5605 2470 1011 Random RW 30429 30076 9224 6120 5177 4782 2800 982 Mutex SRW 29987 30245 21895 17875 17419 17249 12373 11495 Mutex RRW 30417 30027 9199 6117 5175 4780 2796 982


To Start


Image Processing Benchmarks

SDL_bmpspd32 and SDL_bmpspd64 benchmarks execute the same tests as the Windows version, where details and results can be found in bmpspeed results.htm. They are 32 bit and 64 bit varieties compiled to run under Linux using Simple DirectMedia Layer (SDL) functions. The benchmarks generate BMP files and measure speed of saving, loading, scrolling, rotating and editing of 0.5, 1, 2, 4 etc. to 512 MB images.

The programs automatically adjust maximum image size used, depending on available main memory, but run time parameters can be used to change this. The execution files, source code, compilation and running instructions can be found in linux_image_processing_benchmarks.tar.gz with further details in linux image processing benchmarks.htm. Example results are below. Besides the standard Configuration Details shown earlier, additional attributes, obtained for this benchmark, are determined and included in the following example results.

Hardware benchmarked for the main report were desktops, a laptop and a netbook using internal and external (eSATA) disk drives plus usb flash memory and disk drives. Linux versions used were 32-Bit and 64-Bit Ubuntu 10.10 with GNOME 2, 64-Bit Ubuntu 11.04 with Unity on two different graphics arrangements, 64-Bit Fedora 14 with GNOME 2 and 64-Bit OpenSuse 11.4 with KDE.


 Additional System Details

 #####################################################################

  Memory stats from /proc/meminfo

     MemTotal:   3963.8 MB   A
      MemFree:   3181.8 MB   B
      Buffers:     46.5 MB   C
       Cached:    297.5 MB   D

  Memory Used:    438.0 MB = A - B - C - D

  Current Directory Path (getcwd) and drive space (statvfs): 
  /home/roy/all64/bmpspd
  Total MB   11263, Free MB    9446, Used MB    1817
  See files hd1.txt and hd2.txt for details of drive used

  SDL_GetVideoInfo
  hw_available flag is 0 - cannot create hardware surfaces
  Display size 1280 x 1024 pixels at 32 bits

  SDL_VideoDriverName = x11

  Graphics (command - lspci | grep -i vga > vga.txt)
  VGA compatible controller: nVidia Corporation G84 [GeForce 8600 GT] (rev a1)

 #####################################################################

   Image Editing Speeds 64 Bit Version 1, Sat Aug  6 09:45:47 2011

   Input Enlarge    Save    Load  Scroll  Scroll  Rotate  Max MB
   Image Display         Display  Repeat Overall  90 deg  Memory
  Mbytes    Secs    Secs    Secs   msecs  MB/Sec    Secs    Used

     0.5    0.02    0.01    0.01    0.83  601.15    0.01   440.2
     1.0    0.02    0.05    0.02    1.63  612.30    0.02   441.9
     2.0    0.02    0.02    0.03    3.31  634.52    0.02   445.4
     4.0    0.03    0.04    0.06    5.66  625.44    0.03   451.6
     8.0    0.05    0.08    0.11    6.73  584.70    0.05   464.7
    16.0    0.09    0.16    0.20    6.77  580.53    0.08   489.5
    32.0    0.16    0.29    0.31    6.70  587.05    0.16   541.1
    64.0    0.29    0.59    0.71    6.94  566.85    0.32   672.4
   128.0    0.59    1.32    1.22    6.64  592.54    0.65   785.3
   256.0    1.14    2.35    2.60    6.63  593.46    3.51  1129.9
   512.0    2.27    4.90    4.73    6.65  591.47    3.91  1822.9

                   End at Sat Aug  6 09:46:58 2011


To Start

OpenGL Benchmark

The benchmarks, videogl32 and videogl64, are 32-Bit and 64-Bit Linux compilations of OpenGL code used for testing via Windows. Details and results can be found in Linux OpenGL Benchmarks.htm. The benchmarks measure graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

The textures are obtained from 24 bit BMP files that can be up 256 x 256 pixels at 192 KB. The BMP files and Linux execution files can be found in linux_opengl_benchmarks.tar.gz, along with source code, compilation and running instructions. Windows benchmarks from the same source code are also included.

The benchmarks were run on a variety of Ubuntu, Fedora and OpenSuse distros and different PC hardware, with nVidia, ATI and Intel graphics. Newly installed Linux systems do not [so far] provide OpenGL hardware acceleration and, except for nVidia, finding such a driver that works with a particular release is seemingly impossible, in some cases. As a default, the benchmark runs using a full screen window, but input parameters allow different sized windows to be used, via Terminal commands or a script file. Following are example log files from tests using a Core 2 Duo CPU and GeForce 8600 GT graphics, using a default driver and one from nVidia. Decreasing performance, as the window size increases, suggests a graphics speed limitation, with constant performance indicating that processor speed is the limiting factor.


 #####################################################################

 Linux OpenGL Benchmark 64 Bit Version 1, Wed Oct 26 22:29:24 2011

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240    221.7    158.1    162.4    109.3     72.1     48.0
   640   480     60.9     53.5     46.2     37.6     52.7     22.2
  1024   768     23.7     22.0     18.4     15.6     34.9     10.7
  1280  1024     15.6     14.6     12.0     10.3     28.5      7.4

                   End at Wed Oct 26 22:31:38 2011

 #####################################################################

 Linux OpenGL Benchmark 64 Bit Version 1, Tue Oct 25 18:36:45 2011

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   320   240   3670.2   2326.6   1160.9    678.8    401.0    229.2
   640   480   2463.1   2033.9    896.3    666.3    414.5    231.3
  1024   768   1089.2    987.3    541.6    440.9    401.8    214.6
  1280  1024    727.0    680.8    412.1    338.3    400.2    194.0

                   End at Tue Oct 25 18:38:58 2011


To Start

On-Line Benchmarks

A Java version of the Whetstone Classic Benchmark, that is executed via a downloaded HTML page, was produced in 1997. Because of the timing considerations in those days, the benchmark ran for 100 seconds. It also included a measurement of graphics speed. Running this via FireFox and Linux identified some unacceptable text displays and measured speeds, due to over-optimisation. The code was modified slightly to avoid this, running time was reduced and graphics tests were excluded, for a new version, compiled via Java installed under Linux.

The benchmark is run via WhetJava2.html or indirectly from online benchmarks.html, which also includes tests to measure downloading speed of images (see below). Performance results are produced in graphics format, but this can be kept using Take ScreenShot. A version of the new benchmark was also compiled, that runs from a Terminal command, to produce text output to the window and log file. Format is the same as the graphics display and an example is given below.

Results via Linux and Windows are available in Whetstone Benchmark Results - Java. These show differences in 32 bit vs 64 bit, Windows vs Linux, On-line vs Off-line and same results with different browsers. The benchmarks, including source code, can be downloaded from onlinetests.zip or onlinetests.tar.gz.


  *************************************************************

     Whetstone Benchmark Java Version, Dec 8 2011, 23:38:14

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    894.69             0.0215
  N2 floating point  -1.131330490    732.82             0.1834
  N3 if then else     1.000000000            1027.81    0.1007
  N4 fixed point     12.000000000            1735.54    0.1815
  N5 sin,cos etc.     0.499110132              41.15    2.0220
  N6 floating point   0.999999821    496.69             1.0860
  N7 assignments      3.000000000             582.23    0.3174
  N8 exp,sqrt etc.    0.825148463              33.54    1.1090

  MWIPS                             1991.45             5.0215

  Operating System    Linux, Arch. amd64, Version 2.6.34-12-desktop
  Java Vendor         Sun Microsystems Inc., Version  1.6.0_26

Online Benchmark Downloading Tests measure the downloading time of 1 MByte or 100 KByte BMP, GIF and JPG files and for 200 or 400 70 Byte GIF files. Of particular note, typical loading times of the 400 GIFs (28 KB) is twice as long as that for the 1 MB image files.

To Start

Booting Time

Below are booting times on two PCs, from boot menu selection to loaded desktop. The two PCs are a Netbook with a 1.66 GHz Atom CPU, originally running Windows XP, and a desktop PC with a 2.4 GHz Core 2 Duo and Windows Vista. Besides seconds to boot, MB/second reading speed of the drives is provided, derived from the Image Processing Benchmark results. The first results show Windows booting time, for comparison purposes, the Core 2 Duo being particularly slow. The second and fastest results are for 64-Bit Ubuntu 10.10, booting from the Windows disk in the Netbook, and a fast (for 2009) eSATA disk on the desktop.

Figures for the next six entries are from USB sticks, booting 32-Bit and 64-Bit Ubuntu 10.10, 64-Bit Ubuntu 11.04, 64-Bit Fedora 14 and 64-Bit OpenSuse 11.4. On moving the drives between systems, it seems that booting time of the next system used can be considerably longer than normal (needs to use alternative drivers?). Also, the first Linux installations were with Ubuntu and nVidia drivers were installed in order to run CUDA based benchmarks, probably the reason why these would only fully boot on using Recovery Mode on the Netbook, with its Intel graphics.

On the desktop, all Linux loading times are faster than Windows, using much slower drives, but the fastest flash drive does not necessarily produce the shortest booting time. Repeating the tests for a number of times indicates that booting time depends on differing hardware/distro combinations. The last result is with OpenSuse on a USB disk drive, where the faster data transfer speed, compared to a flash drive, does not improve booting time much.


                              Netbook, WinXP, 5400      Desktop, Vista 7200 RPM
                              RPM Local Disk            SATA and eSATA Disks

 Drive         Linux           Boot1 Boot2  Disk  Mode   Boot1 Boot2  Disk  Mode
                                Secs  Secs  MB/s          Secs  Secs  MB/s

 Windows Disk                     64    50  70.0  Norm     170   170  47.8  Norm


 Local Disk    Ubuntu 10.10       37    35  56.0  Norm      22    23 108.0  Norm


 Old Staples   Ubuntu 10.10      100    66   9.3   Rec      76    71   8.8  Norm
 4 GB Stick    64 Bit                                       95    71        Rec

 PNY Attache   Ubuntu 10.10      100    77  18.2   Rec     103    62  20.4  Norm
 4 GB Stick    32 Bit

 Cruzer U3     Ubuntu 10.10       50    51  16.4   Rec      57    57  16.9  Norm
 4 GB Stick    64 Bit

 Patriot Rage  Ubuntu 11.04       46    57  24.3  Norm      76    48  26.8  Norm
 8 GB Stick    64 Bit

 Cruzer U3     Fedora 14         110    98  22.0  Norm      73    70  23.8  Norm
 16 GB         64 Bit

 Cruzer Blade  OpenSuse 11.4      82    70  19.1  Norm      70    44  20.8  Norm
 8 GB Stick    64 Bit

 USB Disk      OpenSuse 11.4      59    60  28.4  Norm      48    42  34.8  Norm
               64 Bit


To Start




Roy Longbottom at Linkedin Roy Longbottom December 2014



The Official Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection