Title

Roy Longbottom at Linkedin  Linux AVX Benchmarks


Contents


General Classic Benchmarks Maximum CPU Speeds
OpenMP & MemSpeed MultiThread MemSpeed MP MFLOPS


SSE and AVX Instructions


General

My original Linux benchmarks, described in Linux Benchmarks.htm, were compiled by an earlier version of GCC 4, under Ubuntu 10.10. Ubuntu 14.04 has GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 (See SSE and AVX Instructions). Benchmarks that could benefit from AVX are being recompiled and at least tested on a new 3.7 GHz Core i7 based PC. The latter was used to provide the following comparisons of original and new benchmark results, via Ubuntu 14.04. This CPU can run at 3.9 GHz using Turbo Boost and maximum speed in GFLOPS per core (4 available) is GHz x 4 (SSE single precision) x 2 (with multiply and add) or 31.2 GFLOPS and 62.4 using AVX1. Using double precision, the best possible scores are 15.6 and 31.2 GFLOPS respectively. The system has four memory channels, with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.

When recompiled benchmarks produce significant different results to the older ones, they are available in AVX_benchmarks.tar.gz. This also contains source codes with changes that enable error free compiling and correct execution.

To Start

Classic Benchmarks

These are Whetstone, Dhrystone, Linpack and Livermore Loops, described in more detail in Classic.htm. The original benchmarks and source code are available in classic_benchmarks.tar.gz.

Whetstone Benchmark - Comprises 8 tests and an overall MWIPS (Whetstone MIPS) rating. This compiled with numerous warning messages, such as “incompatible implicit declaration of built-in function sin [enabled by default]”, but the defaults appear to be correct. Data arrays used are too small to benefit from SIMD where, except from maths functions, speed is virtually the same on all x86 and 64-bit calculations. X64 functions are faster than the hard wires ones with x86. The last test just copies array data and is more sensitive to variations in compiled code.

Dhrystone Benchmark - There are two versions of this benchmark, the second being produced to minimise unwarranted optimisation. Non-optimised and optimised speeds are provided. Dhrystone 2 produced two unexpected compiler warning messages. These benchmarks make no use of SSE or AVX instructions, with all 64 bit version speeds being the same.

Linpack Benchmark - The original 64 bit versions, with its SSE2 instructions for double precision, had SISD instructions. Although the new version made use of SIMD, surprisingly, there was no speed gain, but SIMD AVX produced the fastest speed on the PC in question.

Livermore Loops Benchmark - The compiler indicated errors, necessitating changes to the data structure. The benchmark has 24 double precision test loops. The original 64 bit Linux version, with SISD SSE2, produced slightly faster speeds than the 32 bit x86/87 version and similar to the recompilation. The AVX program made little difference. It is rather surprising that the new SSE2 and AVX benchmarks did not have SIMD implementations. However, at least the first loop can achieve 9.4 GFLOPS with SSE2 and 13,6 GFLOPS using AVX, in a slightly different structure. The benchmark count the number of passes in a standard function that only takes any action when all passes have been run. The higher speeds were obtained via using an outer loop to control the pass count.

The new Linpack AVX benchmark and revised Livermore Loops benchmark C source code are included in AVX_benchmarks.tar.gz.


                     Whetstone Benchmark Optimised

          MWIPS  MFLOP  MFLOP  MFLOP    COS    EXP  FIXPT    IF   EQUAL
                    1      2      3    MOPS   MOPS   MOPS   MOPS   MOPS

 Old x86   3959   1331   1331    938     97   42.1   6516  10967   5851
 Old x64   4880   1331   1324    977    129   64.2   6517  11657   1812
 New x64   4891   1330   1323    977    120   64.5   6505  11638   3903
 New AVX   4897   1325   1323    977    120   64.5   6515  11649   3909


            Dhrystone Benchmarks                 Linpack Benchmark

            Dhry1  Dhry1  Dhry2  Dhry2
            NoOpt    Opt  NoOpt    Opt
              VAX    VAX    VAX    VAX           No Opt       Opt
             MIPS   MIPS   MIPS   MIPS           MFLOPS    MFLOPS   

 Old x86     7108  29277   7478  16356              988      2534
 Old x64     8436  32659   8481  23607              900      3672
 New x64     8441  32499   8381  24140              946      3631
 New AVX     8441  32575   8395  23626              935      5413
 

                                Livermore Loops MFLOPS
               LOOP
CPU             1     2     3     4     5     6     7     8     9    10    11    12
               13    14    15    16    17    18    19    20    21    22    23    24

 Old x86     4327  3661  2622  2642   527  2250  4217  5549  5223  2511  1311  1279
              450  1036   730  2038  2479  2835   810   783  2820   419  2022   967
 Old x64     4707  3434  2629  2657   565  2155  4592  6131  5442  2602  1314  1296
              937  1239  2288  2293  2392  3538   839   968  2792   939  2034  1720
 New x64     4729  3422  2639  2657   565  2164  4599  5714  4984  2446  1310  1879
             1018  1267  2287  2012  2397  5343   836   969  3042   940  2011  1840
 New AVX     4692  3488  2638  2654   564  2160  4471  5717  4978  2619  1308  1863
              978  1305  2285  2043  2492  6418   836   968  3069   938  2010  1558
   

To Start

Maximum CPU Speeds and CPUID - AVXid64

This benchmark follows those in WhatCPU results.htm, where various assembly code integer and floating point add instructions, using 1, 2, 3 and 4 registers, are executed and attempt to demonstrate certain maximum speeds. The benchmark and source codes are included in AVX_benchmarks.tar.gz.

Following is for a 3.7 GHz Core i7 that appears to run at the Turbo Speed of 3.9 GHz. Here maximum single precision speed is 8 x 3.9 or 31.2 GFLOPS, increasing to 62.4 GFLOPS by the hardware ability to link associated add and multiply instructions like x=x+y*z.


##############################################

  Assembler CPUID and RDTSC       
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4 
         Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz 
  Measured - Minimum 3711 MHz, Maximum 3711 MHz 
  Linux Functions 
  get_nprocs() - CPUs 8, Configured CPUs 8 
  get_phys_pages() and size - RAM Size 31.36 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-i7UB14, 3.13.0-24-generic 
  #46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014, x86_64 

##############################################

 AVX1 ID and Speed Test 64 bit Version 1.0 Wed Nov  5 10:22:42 2014
 
 Test                                           GFLOPS    Sumcheck

 SP 1-2 Register add vr0+1                       10.43        OK
 SP 2-4 Register add vr0+1 vr2+3                 20.86        OK
 SP 4-8 Register add vr0+1 vr2+3 v4+5 vr6+7      30.92        OK
 As SP 4-8 with add and multiply vrx+y*z         62.00        OK
 DP 1-2 Register add vr0+1                        5.21        OK
 DP 2-4 Register add vr0+1 vr2+3                 10.43        OK
 DP 4-8 Register add vr0+1 vr2+3 v4+5 vr6+7      15.46        OK
 As DP 4-8 with add and multiply vrx+y*z         31.00        OK

   

To Start

OpenMP & MemSpeed - memory_speed64AVX, memory_speed64AVXOMP

This is a variation of my MemSpeed benchmark, using the calculations shown below, the first floating point tests being the same as the performance dependent code in the Linpack benchmark, but covering caches and RAM. Calculations use double precision (DP) and single precision (SP) floating point, then with integer numbers. The same program was compiled without and with OpenMP directives. Both benchmarks are also available in standard 64 bit format (SSE/SSE2) which can be found in linux_openmp.tar.gz. Results and comparisons are shown below.

All older SSE/SSE2 version results are shown below, where DP/SP speeds are a long way from the possible 15.6/31.2 GFLOPS. Then there is a summary of AVX version results, where appropriate gains are produced, via L1 cache, with floating point, but no different using integers. DP/SP gains are less, using other caches, and virtually the same using RAM.

The next results are for the SSE/SSE2 benchmark using OpenMP. Here, performance is worse via L1 cache, due to overheads and, probably, cache flushing as accessing a shared data array. Best improvements, and highest GFLOPS, are using L3 cache data sizes. The benchmark shows that multiple cores need to be used to make use of the high memory bus bandwidth, where the 25 GB/second MP speed is quite respectable, out of the maximum specification of 51.2 GB/second.

The final results are for AVX with OpentMP, where the compiler fails to implement AVX instructions, and code is essentially the same as the old SSE/SSE2 version. However, there are some performance gains on using AVX functions.


 #####################################################

       Memory Reading Speed Test 64 Bit Version 4.1 by Roy Longbottom

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   39311  24057  52483  40345  24058  52352  28687  15957  29066 L1
       8   39076  24566  57022  40079  24470  57001  30005  15794  30071
      16   39851  24795  59688  40773  24768  59685  30605  15683  30691
      32   39859  24862  60824  41216  24876  61083  28148  15675  30978
      64   32844  24462  47825  34369  24707  47838  23819  15646  29441 L2
     128   32879  24498  48223  34308  24841  48325  23978  15603  29659
     256   30516  23886  43374  31823  24290  43355  20623  15412  26554
     512   25604  22420  30617  26141  22961  30617  15299  13893  17247 L3
    1024   25565  22368  30352  26103  22992  30275  15124  13823  17145
    2048   25589  22479  30344  26056  23017  30339  15120  13793  17155
    4096   25600  22405  30332  26136  23025  30249  15122  13829  17159
    8192   25593  22460  30297  26025  22997  30299  15110  13832  17160
   16384   15083  14415  14745  15085  14690  14752   7484   7601   7464 RAM
   32768   14845  14293  14391  14840  14313  14382   7331   7480   7330
   65536   14959  14424  14498  14961  14466  14490   7387   7518   7343
  131072   15041  14492  14607  15048  14592  14608   7416   7550   7371
  262144   15023  14491  14598  15017  14595  14601   7406   7523   7377
  524288   15053  14520  14645  15096  14666  14659   7424   7570   7395
 1048576   15085  14534  14659  15093  14675  14650   7432   7565   7396
 2097152   15096  14538  14670  15109  14687  14649   7433   7573   7401
 4194304   15096  14544  14665  15108  14684  14673   7434   7570   7402

 Max GFLOPS  5.0    6.2

       Memory Reading Speed Test 64 Bit AVX v4.1 by Roy Longbottom

       8   61747  57965  57139  60342  60007  60148  39695  39314  39280 L1
      16   62152  59667  59718  61363  61268  61332  40675  40425  40426
  Gain      1.57   2.38   1.00   1.51   2.46   1.04   1.33   2.53   1.31
     128   47554  41906  47884  48347  48245  47791  29682  29561  29698 L2
     256   39989  36923  40809  41397  41077  40806  26011  25996  25973
  Gain      1.38   1.63   0.97   1.36   1.82   0.97   1.25   1.79   0.99
    2048   30093  29338  30337  30667  30641  30339  17175  17173  17171 L3
    4096   30083  29362  30301  30654  30650  30313  17183  17186  17185
  Gain      1.18   1.31   1.00   1.17   1.33   1.00   1.14   1.24   1.00
   65536   14807  14959  14656  14590  14594  14656   7361   7352   7350 RAM
  131072   14857  15026  14654  14612  14621  14666   7392   7381   7377
  Gain      0.99   1.04   1.01   0.97   1.01   1.01   1.00   0.98   1.00

 Max GFLOPS  7.8   14.9 

       Memory Reading Speed Test 64 Bit OPenMP v4.1 by Roy Longbottom
                          Gain over  No OpenMP

       8    5058   4962   5001   5184   5038   5005   2665   2604   2609 L1
      16    9662   9412   9612  10322   9790   9637   5234   5045   5060
  Gain      0.19   0.29   0.12   0.19   0.30   0.12   0.13   0.24   0.13
     128   51235  36875  36401  58443  44166  44785  31628  24488  24758 L2
     256   70872  47353  45667  82647  58676  57787  46448  32315  32563
  Gain      1.94   1.74   0.90   2.15   2.10   1.13   1.79   1.83   1.03
    2048   96621  58092  56214 105074  75938  74497  57895  43166  42881 L3
    4096   87122  60230  57450 108329  79312  76890  60543  44350  44679
  Gain      3.59   2.64   1.87   4.09   3.37   2.50   3.92   3.17   2.55
   65536   24868  25137  24941  24889  25022  25066  12683  12623  12598 RAM
  131072   25625  25696  25301  25566  25606  25593  12904  12729  12793
  Gain      1.68   1.76   1.73   1.68   1.74   1.74   1.73   1.68   1.73

 Max GFLOPS 12.1   15.1

       Memory Reading Speed Test 64 Bit AVX OMP v4.1 by Roy Longbottom
                          Gain over  AVX No OpenMP

       8    4584   4964   5000   5239   5083   5057   2647   2622   2626 L1
      16    9056   9413   9449  10368   9793   9667   5236   5068   5068
  Gain      0.11   0.12   0.12   0.13   0.12   0.12   0.10   0.10   0.10
     128   51222  34104  37139  59422  45145  46240  31286  24311  24595 L2
     256   65935  47007  45285  84331  58294  56615  46805  32592  32531
  Gain      1.36   1.04   0.94   1.63   1.18   1.18   1.43   1.04   1.04
    2048   73879  57775  56074 106129  76487  75147  59042  43105  42679 L3
    4096  100558  59738  57252 108882  79594  77143  61164  44567  44514
  Gain      2.90   2.00   1.87   3.51   2.55   2.51   3.50   2.55   2.54
   65536   25198  25230  24848  25202  25191  25191  12691  12632  12587
  131072   25621  25685  25370  25679  25604  25682  12956  12874  12859 RAM
  Gain      1.71   1.70   1.71   1.74   1.74   1.73   1.74   1.73   1.73

 Max GFLOPS 12.6   14.9
  

To Start

MultiThread MemSpeed - MPmemspeed64AVX, MPmemspeed64

This benchmark carries out the same calculations as memory_speed64, but uses Pthreads to execute the functions using multiple threads. In this case, all threads share the same data arrays, but each uses a different segment. There is an input parameter that specifies the number of threads, up to 64, an example of the command being MPmemspeed64AVX Threads 2. Default is the number of identified CPUs, or 8 with a quad Core i7 4820K with hyperthreading. Results below include comparisons with the original version (MPmemspeed64), available in linux_multithreading_apps.tar.gz.

Results using 8 threads are clearly much faster to start with, than those using OpenMP. Maximum speeds are again with data greater than L1 cache size, when OpenMP can be faster. The performance is virtually the same with data in RAM.

Of special note, performance of single threaded MPmemspeed64AVX is often worse than the stand alone 64 bit Memspeed. Full SIMD AVX instructions are implemented, but there are numerous extra instructions used, such as shuffle, unpack and insert (4 vector multiply, 4 vector add, 80 other vector instructions) - needed to allow any unknown number of threads?. At least, the multithreaded speeds can be four times that of a single threaded run and twice as fast using RAM based data.


     MP Memory Reading Speed Test 64 Bit AVX Version 1.1 Using 8 Threads

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int64   Dble   Sngl  Int64   Dble   Sngl  Int64
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4   37793  28722  51156  48994  52362  68206  35695  32938  36361 L1
       8   45769  32166  52015  76043  69787  61836  64106  57154  59198
      16   53369  37318  67629 126171 116852  72240  83390  81532  80529
      32   61317  39091 101048 190721 154129 202827 135056 124602 139214 L2
      65   61917  39098  97595 230050 216223 222696 156878 150286 137167
     131   64036  39272 110665 216488 221720 269585 149667 148116 148231
     262   67037  39317 114213 194473 193210 203301 115279 114005 118131 L3
     524   68566  39044 109261 178279 190967 180976 106867 109279 108726
    1048   68665  39866 112433 172563 153711 170144  98248  97462  96250
    2097   65163  37914 102215 115502 125560 127970  65827  69361  69114
    4194   62412  40752  94501 123285 110270 118960  63359  64405  64734
    8388   65018  37625  97453 109678 119597 111028  64608  65434  67057
   16777   28312  28192  30281  56647  59461  54978  26426  27921  15173 RAM
   33554   25865  24951  26054  31416  25794  26228  13986  12989  13306
   67108   25362  23514  25377  25089  25348  25514  12516  12674  12416
  134217   25286  23348  25355  24784  25420  24585  12082  12671  12475
  268435   23482  24032  24977  24292  24489  24812  12837  12637  12635
  536870   24650  23942  25309  25072  25428  25601  12789  12471  12756
 1073741   25124  24729  24310  24854  24576  25209  12612  12415  12524
 2147483   24921  24840  24774  24925  25069  24468  12790  12320  12408
 4294967   24939  24346  23925  25028  24669  25179  12275  12216  12436

 Max GFLOPS  8.6   10.2          14.4   27.7

     MP Memory Reading Speed Test 64 Bit AVX Version 1.1 Using 1 Threads

       8   16874  10136  29980  60061  59981  73657  39651  39127  39645 L1
      16   16901  10137  30260  61288  61171  78154  40569  40385  40608

     131   16891  10113  28484  48351  48134  47964  29242  29294  29285 L2
     262   16845  10094  27490  45215  44739  46562  27383  27371  27377

    2097   16725   9985  23943  30302  30323  30669  16994  16576  17163 L3
    4194   16549   9912  23767  30294  30231  30720  16977  16461  17136

  536870   11805   8651  14877  13817  13983  14446   7545   7081   7203 RAM
 1073741   12168   8818  14636  14692  14973  14163   7329   7393   7202

 Max GFLOPS  2.1    2.5           3.8    7.6

    MP Memory Reading Speed Test 64 Bit Version 1.1 Using 1 Threads

       8   30422  15420  27836  40569  20504  34929  19735   9939  19614 L1
      16   30754  15503  28069  41065  20688  35335  19977  10011  19895

     131   28955  15286  27122  34323  20476  30926  20086  10078  20069 L2
     262   28741  15287  27017  33579  20373  30875  19760  10060  19758

    2097   24424  15207  23963  26358  19342  25851  14665   9648  14824 L3
    4194   24408  14253  23951  26355  19366  25531  14655   9334  14821

  536870   14386  11824  14302  14704  13442  14732   7652   7426   7416 RAM
 1073741   14452  11468  14715  14861  13439  15189   7348   7828   7394

 Max GFLOPS  3.8    3.9           2.6    2.6

     MP Memory Reading Speed Test 64 Bit Version 1.1 Using 8 Threads


       8   52063  33134  52441  60254  36470  57610  45343  30815  45180 L1
      16   69122  44818  65876  82924  46075  69534  57760  36707  57120

     131  115996  53402 102715 140036  76202 116053  68671  35659  72891 L2
     262  113644  60777 104488 132590  81609 111435  72205  37232  72061

    2097   95433  58470  99412 115476  72032 109176  60839  36350  56557 L3
    4194   98608  57900 102912 105228  78041 106928  59517  36122  58749

  536870   25054  24707  24623  24592  25130  25430  12805  11899  11850 RAM
 1073741   25402  25735  24886  25412  25128  24711  12662  12367  12617

 Max GFLOPS 14.5   15.2           8.8   10.2
   

To Start

MP MFLOPS

MPmflops64 (original), MPmflops64AVX, openMPmflops64, openMPmflops64AVX, notOMPmflops64

The same calculations are carried out via three different procedures, using programmed threading with a parameter to use 1 or more threads, then with OpenMP and, finally, stand alone. All are complied for 64 bit working using SSE single precision floating point with the MP versions also produced with the AVX option. Most were compiled using GCC 4.8.2, as even some SSE compilations are faster than those from earlier compilers and are included in AVX_benchmarks.tar.gz. More details can be found in Linux Multithreading Benchmarks.htm and linux_openmp benchmarks.htm Arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element out of 0.1, 1 and 10 MB. On the Core i7 4820K, the first will use L2 cache, with L3 for the second and sometimes the last.

MP MFLOPS - Bearing in mind that maximum SIMD speed using SSE instructions is 31.2 GFLOPS per core or 124.8 with four, and twice these via AVX, measured performance is quite respectable, but needs eight threads to squeeze the last drop, at up to 93.2 GFLOPS with SSE and 177.8 using AVX. Performance is more cache memory speed dependent with two operations per word and doubling the number of threads does not necessarily double the throughput.

Open MP - The default uses all CPU cores but an extra run was carried out with an affinity setting to use one CPU core. Disassemblies showed that the compiled code had more saving and loading instructions than MP MFLOPS, leading to slower performance for 2 and 8 operations per word. At least, AVX tests could be twice as fast as the ones using SSE. At 32 operations per word, SIMD instructions were not used, producing much slower performance, with SSE and AVX tests running at the same speed.

SSE - Compiling the latter benchmark, without the OpenMP and AVX directives, produced virtually the same speeds as MP MFLOPS using a single thread.


  64 Bit MP AVX MFLOPS Benchmark 1, 8 Threads, Mon Dec  8 15:51:50 2014

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     102400     2    20000   0.067975    60258    0.620974   Yes
 Data in & out    1024000     2     2000   0.092400    44329    0.942935   Yes
 Data in & out   10240000     2      200   0.410527     9977    0.994032   Yes

 Data in & out     102400     8    20000   0.094583   173224    0.749971   Yes
 Data in & out    1024000     8     2000   0.107854   151909    0.965360   Yes
 Data in & out   10240000     8      200   0.408042    40153    0.996409   Yes

 Data in & out     102400    32    20000   0.378009   173372    0.498060   Yes
 Data in & out    1024000    32     2000   0.368530   177831    0.910573   Yes
 Data in & out   10240000    32      200   0.413231   158594    0.990447   Yes


               ----- MP MFLOPS 1 to 8 Threads -----          -------- OpenMP ---------
               ----- SSE -----      ----- AVX -----    SSE   --- SSE ---   --- AVX ---
 M 4B  Ops     1      4      8      1      4      8      1  aff1       8  aff1       8
 Words Word

 0.1    2   9681  45340  54621  12542  62273  60258   9918   6061  13742  10196  19577
 1.0    2   9759  21688  41832  11404  23031  44329   9688   6215  19477  10025  37906
 10.2   2   5990   9237  10026   5991   8970   9977   5870   5059   9137   5880   7782

 0.1    8  24533  49320  92086  35982 159040 173224  24448  13220  44104  26481  88370
 1.0    8  24570  49918  92352  36180  80096 151909  24465  13373  49499  27045  90579
 10.2   8  19975  36638  39982  23299  40124  40153  20055  12719  38369  20593  35607

 0.1   32  23269  46942  92408  46400  90572 173372  23251   5854  22858   5865  22845
 1.0   32  23307  89676  93282  46572  91058 177831  23265   5863  23234   5870  23141
 10.2  32  23052  91029  92050  44729  88877 158594  23063   5860  23127   5854  23077
  

To Start

SSE and AVX Instructions

The 64-bit Operating Systems cannot execute old style i387 floating point instructions, but are limited to SSE varieties instead. These use 128 bit registers that can contain up to 4 single precision numbers (SSE) or 2 at double precision (SSE2). The fastest mode of operation is Single Instruction Multiple Data (SIMD), where the same arithmetic instruction can operate on all contained numbers at the same time. The slowest is with single data words (SISD), where, possibly due to an inefficient compiler, data cannot be organised to use SIMD.

Originally, maximum speed of SIMD, in Millions of FLoating point Operations Per Second (MFLOPS), was CPU MHz x 4 (2 x with SSE2). Later, multiply and such as add could be linked together, to produce 8 x MHz MFLOPS. Then, with multithreading, this can be multiplied by the number of CPU cores. AVX1 can double these speeds using 256 bit registers and twice again with later AVX2 at 512 bits.

Below are examples of SSE and AVX single precision assembly code instructions. For double precision, the last s is replaced by d. Besides providing the benefit of larger registers, additional optimisation is possible using AVX instructions, as three registers can be used.


  Example compile command for AVX and variation to produce assembly code.  

  gcc whets.c cpuidc64.o cpuida64.o -O3 -mavx -lrt -lc -lm -o whetAVX
  gcc whets.c cpuidc64.o cpuida64.o -O3 -mavx -lrt -lc -lm -S


  SSE - SISD                   SSE - SIMD                   SSE2 - SIMD DP

  addss    xmm1, xmm8          addps    xmm1, xmm10         addpd    xmm1, xmm10
  addss    xmm2, xmm4          addps    xmm2, xmm6          etc.
  addss    xmm0, xmm6          addps    xmm0, xmm8
  mulss    xmm1, xmm9          mulps    xmm1, xmm11
  mulss    xmm0, xmm7          mulps    xmm2, xmm7
  mulss    xmm2, xmm5          mulps    xmm0, xmm9
  subss    xmm2, xmm0          subps    xmm2, xmm0
  addss    xmm2, xmm1          addps    xmm2, xmm1


  AVX1 - SISD                  AVX1  - SIMD                 AVX1  - SIMD DP

  vaddss   ymm5, ymm6,  ymm13  vaddps   ymm5, ymm6,  ymm13  vaddpd   ymm5, ymm6,  ymm13
  vaddss   ymm3, ymm6,  ymm7   vaddps   ymm3, ymm6,  ymm7   etc.
  vaddss   ymm1, ymm6,  ymm6   vaddps   ymm1, ymm6,  ymm6
  vmulss   ymm4, ymm13, ymm13  vmulps   ymm4, ymm13, ymm13
  vmulss   ymm2, ymm7,  ymm7   vmulps   ymm2, ymm7,  ymm7
  vmulss   ymm0, ymm6,  ymm6   vmulps   ymm0, ymm6,  ymm6
  vsubss   ymm7, ymm13, ymm7   vsubps   ymm7, ymm13, ymm7
  vaddss   ymm6, ymm7,  ymm6   vaddps   ymm6, ymm7,  ymm6

  vaddps   ymm3, ymm6,  ymm7   means add ymm6 and ymm7 with result in ymm3
   

To Start


Roy Longbottom at Linkedin   Roy Longbottom December 2014

The Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection