Contents
General
My original Linux benchmarks, described in
Linux Benchmarks.htm,
were compiled by an earlier version of GCC 4, under Ubuntu 10.10. Ubuntu 14.04 has GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1
(See SSE and AVX Instructions).
Benchmarks that could benefit from AVX are being recompiled and at least tested on a new 3.7 GHz Core i7 based PC.
The latter was used to provide the following comparisons of original and new benchmark results, via Ubuntu 14.04.
This CPU can run at 3.9 GHz using Turbo Boost and maximum speed in GFLOPS per core (4 available) is GHz x 4 (SSE single precision) x 2 (with multiply and add) or 31.2 GFLOPS and 62.4 using AVX1. Using double precision, the best possible scores are 15.6 and 31.2 GFLOPS respectively.
The system has four memory channels, with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.
When recompiled benchmarks produce significant different results to the older ones, they are available in
AVX_benchmarks.tar.gz.
This also contains source codes with changes that enable error free compiling and correct execution.
To Start
Classic Benchmarks
These are Whetstone, Dhrystone, Linpack and Livermore Loops, described in more detail in
Classic.htm.
The original benchmarks and source code are available in
classic_benchmarks.tar.gz.
Whetstone Benchmark - Comprises 8 tests and an overall MWIPS (Whetstone MIPS) rating.
This compiled with numerous warning messages, such as “incompatible implicit declaration of built-in function sin [enabled by default]”, but the defaults appear to be correct. Data arrays used are too small to benefit from SIMD where, except from maths functions, speed is virtually the same on all x86 and 64-bit calculations. X64 functions are faster than the hard wires ones with x86. The last test just copies array data and is more sensitive to variations in compiled code.
Dhrystone Benchmark - There are two versions of this benchmark, the second being produced to minimise unwarranted optimisation. Non-optimised and optimised speeds are provided. Dhrystone 2 produced two unexpected compiler warning messages. These benchmarks make no use of SSE or AVX instructions, with all 64 bit version speeds being the same.
Linpack Benchmark - The original 64 bit versions, with its SSE2 instructions for double precision, had SISD instructions. Although the new version made use of SIMD, surprisingly, there was no speed gain, but SIMD AVX produced the fastest speed on the PC in question.
Livermore Loops Benchmark - The compiler indicated errors, necessitating changes to the data structure.
The benchmark has 24 double precision test loops. The original 64 bit Linux version, with SISD SSE2, produced slightly faster speeds than the 32 bit x86/87 version and similar to the recompilation. The AVX program made little difference. It is rather surprising that the new SSE2 and AVX benchmarks did not have SIMD implementations.
However, at least the first loop can achieve 9.4 GFLOPS with SSE2 and 13,6 GFLOPS using AVX, in a slightly different structure.
The benchmark count the number of passes in a standard function that only takes any action when all passes have been run. The higher speeds were obtained via using an outer loop to control the pass count.
The new Linpack AVX benchmark and revised Livermore Loops benchmark C source code are included in
AVX_benchmarks.tar.gz.
Whetstone Benchmark Optimised
MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL
1 2 3 MOPS MOPS MOPS MOPS MOPS
Old x86 3959 1331 1331 938 97 42.1 6516 10967 5851
Old x64 4880 1331 1324 977 129 64.2 6517 11657 1812
New x64 4891 1330 1323 977 120 64.5 6505 11638 3903
New AVX 4897 1325 1323 977 120 64.5 6515 11649 3909
Dhrystone Benchmarks Linpack Benchmark
Dhry1 Dhry1 Dhry2 Dhry2
NoOpt Opt NoOpt Opt
VAX VAX VAX VAX No Opt Opt
MIPS MIPS MIPS MIPS MFLOPS MFLOPS
Old x86 7108 29277 7478 16356 988 2534
Old x64 8436 32659 8481 23607 900 3672
New x64 8441 32499 8381 24140 946 3631
New AVX 8441 32575 8395 23626 935 5413
Livermore Loops MFLOPS
LOOP
CPU 1 2 3 4 5 6 7 8 9 10 11 12
13 14 15 16 17 18 19 20 21 22 23 24
Old x86 4327 3661 2622 2642 527 2250 4217 5549 5223 2511 1311 1279
450 1036 730 2038 2479 2835 810 783 2820 419 2022 967
Old x64 4707 3434 2629 2657 565 2155 4592 6131 5442 2602 1314 1296
937 1239 2288 2293 2392 3538 839 968 2792 939 2034 1720
New x64 4729 3422 2639 2657 565 2164 4599 5714 4984 2446 1310 1879
1018 1267 2287 2012 2397 5343 836 969 3042 940 2011 1840
New AVX 4692 3488 2638 2654 564 2160 4471 5717 4978 2619 1308 1863
978 1305 2285 2043 2492 6418 836 968 3069 938 2010 1558
|
To Start
Maximum CPU Speeds and CPUID - AVXid64
This benchmark follows those in WhatCPU results.htm, where various assembly code integer and floating point add instructions, using 1, 2, 3 and 4 registers, are executed and attempt to demonstrate certain maximum speeds. The benchmark and source codes are included in
AVX_benchmarks.tar.gz.
Following is for a 3.7 GHz Core i7 that appears to run at the Turbo Speed of 3.9 GHz. Here maximum single precision speed is 8 x 3.9 or 31.2 GFLOPS, increasing to 62.4 GFLOPS by the hardware ability to link associated add and multiply instructions like x=x+y*z.
##############################################
Assembler CPUID and RDTSC
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4
Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz
Measured - Minimum 3711 MHz, Maximum 3711 MHz
Linux Functions
get_nprocs() - CPUs 8, Configured CPUs 8
get_phys_pages() and size - RAM Size 31.36 GB, Page Size 4096 Bytes
uname() - Linux, roy-i7UB14, 3.13.0-24-generic
#46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014, x86_64
##############################################
AVX1 ID and Speed Test 64 bit Version 1.0 Wed Nov 5 10:22:42 2014
Test GFLOPS Sumcheck
SP 1-2 Register add vr0+1 10.43 OK
SP 2-4 Register add vr0+1 vr2+3 20.86 OK
SP 4-8 Register add vr0+1 vr2+3 v4+5 vr6+7 30.92 OK
As SP 4-8 with add and multiply vrx+y*z 62.00 OK
DP 1-2 Register add vr0+1 5.21 OK
DP 2-4 Register add vr0+1 vr2+3 10.43 OK
DP 4-8 Register add vr0+1 vr2+3 v4+5 vr6+7 15.46 OK
As DP 4-8 with add and multiply vrx+y*z 31.00 OK
|
To Start
OpenMP & MemSpeed - memory_speed64AVX, memory_speed64AVXOMP
This is a variation of my
MemSpeed benchmark,
using the calculations shown below, the first floating point tests being the same as the performance dependent code in the Linpack benchmark, but covering caches and RAM.
Calculations use double precision (DP) and single precision (SP) floating point, then with integer numbers.
The same program was compiled without and with OpenMP directives. Both benchmarks are also available in standard 64 bit format (SSE/SSE2) which can be found in
linux_openmp.tar.gz.
Results and comparisons are shown below.
All older SSE/SSE2 version results are shown below, where DP/SP speeds are a long way from the possible 15.6/31.2 GFLOPS. Then there is a summary of AVX version results, where appropriate gains are produced, via L1 cache, with floating point, but no different using integers. DP/SP gains are less, using other caches, and virtually the same using RAM.
The next results are for the SSE/SSE2 benchmark using OpenMP. Here, performance is worse via L1 cache, due to overheads and, probably, cache flushing as accessing a shared data array. Best improvements, and highest GFLOPS, are using L3 cache data sizes.
The benchmark shows that multiple cores need to be used to make use of the high memory bus bandwidth, where the 25 GB/second MP speed is quite respectable, out of the maximum specification of 51.2 GB/second.
The final results are for AVX with OpentMP, where the compiler fails to implement AVX instructions, and code is essentially the same as the old SSE/SSE2 version. However, there are some performance gains on using AVX functions.
#####################################################
Memory Reading Speed Test 64 Bit Version 4.1 by Roy Longbottom
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 39311 24057 52483 40345 24058 52352 28687 15957 29066 L1
8 39076 24566 57022 40079 24470 57001 30005 15794 30071
16 39851 24795 59688 40773 24768 59685 30605 15683 30691
32 39859 24862 60824 41216 24876 61083 28148 15675 30978
64 32844 24462 47825 34369 24707 47838 23819 15646 29441 L2
128 32879 24498 48223 34308 24841 48325 23978 15603 29659
256 30516 23886 43374 31823 24290 43355 20623 15412 26554
512 25604 22420 30617 26141 22961 30617 15299 13893 17247 L3
1024 25565 22368 30352 26103 22992 30275 15124 13823 17145
2048 25589 22479 30344 26056 23017 30339 15120 13793 17155
4096 25600 22405 30332 26136 23025 30249 15122 13829 17159
8192 25593 22460 30297 26025 22997 30299 15110 13832 17160
16384 15083 14415 14745 15085 14690 14752 7484 7601 7464 RAM
32768 14845 14293 14391 14840 14313 14382 7331 7480 7330
65536 14959 14424 14498 14961 14466 14490 7387 7518 7343
131072 15041 14492 14607 15048 14592 14608 7416 7550 7371
262144 15023 14491 14598 15017 14595 14601 7406 7523 7377
524288 15053 14520 14645 15096 14666 14659 7424 7570 7395
1048576 15085 14534 14659 15093 14675 14650 7432 7565 7396
2097152 15096 14538 14670 15109 14687 14649 7433 7573 7401
4194304 15096 14544 14665 15108 14684 14673 7434 7570 7402
Max GFLOPS 5.0 6.2
Memory Reading Speed Test 64 Bit AVX v4.1 by Roy Longbottom
8 61747 57965 57139 60342 60007 60148 39695 39314 39280 L1
16 62152 59667 59718 61363 61268 61332 40675 40425 40426
Gain 1.57 2.38 1.00 1.51 2.46 1.04 1.33 2.53 1.31
128 47554 41906 47884 48347 48245 47791 29682 29561 29698 L2
256 39989 36923 40809 41397 41077 40806 26011 25996 25973
Gain 1.38 1.63 0.97 1.36 1.82 0.97 1.25 1.79 0.99
2048 30093 29338 30337 30667 30641 30339 17175 17173 17171 L3
4096 30083 29362 30301 30654 30650 30313 17183 17186 17185
Gain 1.18 1.31 1.00 1.17 1.33 1.00 1.14 1.24 1.00
65536 14807 14959 14656 14590 14594 14656 7361 7352 7350 RAM
131072 14857 15026 14654 14612 14621 14666 7392 7381 7377
Gain 0.99 1.04 1.01 0.97 1.01 1.01 1.00 0.98 1.00
Max GFLOPS 7.8 14.9
Memory Reading Speed Test 64 Bit OPenMP v4.1 by Roy Longbottom
Gain over No OpenMP
8 5058 4962 5001 5184 5038 5005 2665 2604 2609 L1
16 9662 9412 9612 10322 9790 9637 5234 5045 5060
Gain 0.19 0.29 0.12 0.19 0.30 0.12 0.13 0.24 0.13
128 51235 36875 36401 58443 44166 44785 31628 24488 24758 L2
256 70872 47353 45667 82647 58676 57787 46448 32315 32563
Gain 1.94 1.74 0.90 2.15 2.10 1.13 1.79 1.83 1.03
2048 96621 58092 56214 105074 75938 74497 57895 43166 42881 L3
4096 87122 60230 57450 108329 79312 76890 60543 44350 44679
Gain 3.59 2.64 1.87 4.09 3.37 2.50 3.92 3.17 2.55
65536 24868 25137 24941 24889 25022 25066 12683 12623 12598 RAM
131072 25625 25696 25301 25566 25606 25593 12904 12729 12793
Gain 1.68 1.76 1.73 1.68 1.74 1.74 1.73 1.68 1.73
Max GFLOPS 12.1 15.1
Memory Reading Speed Test 64 Bit AVX OMP v4.1 by Roy Longbottom
Gain over AVX No OpenMP
8 4584 4964 5000 5239 5083 5057 2647 2622 2626 L1
16 9056 9413 9449 10368 9793 9667 5236 5068 5068
Gain 0.11 0.12 0.12 0.13 0.12 0.12 0.10 0.10 0.10
128 51222 34104 37139 59422 45145 46240 31286 24311 24595 L2
256 65935 47007 45285 84331 58294 56615 46805 32592 32531
Gain 1.36 1.04 0.94 1.63 1.18 1.18 1.43 1.04 1.04
2048 73879 57775 56074 106129 76487 75147 59042 43105 42679 L3
4096 100558 59738 57252 108882 79594 77143 61164 44567 44514
Gain 2.90 2.00 1.87 3.51 2.55 2.51 3.50 2.55 2.54
65536 25198 25230 24848 25202 25191 25191 12691 12632 12587
131072 25621 25685 25370 25679 25604 25682 12956 12874 12859 RAM
Gain 1.71 1.70 1.71 1.74 1.74 1.73 1.74 1.73 1.73
Max GFLOPS 12.6 14.9
|
To Start
MultiThread MemSpeed - MPmemspeed64AVX, MPmemspeed64
This benchmark carries out the same calculations as memory_speed64, but uses Pthreads to execute the functions using multiple threads. In this case, all threads share the same data arrays, but each uses a different segment. There is an input parameter that specifies the number of threads, up to 64, an example of the command being MPmemspeed64AVX Threads 2. Default is the number of identified CPUs, or 8 with a quad Core i7 4820K with hyperthreading.
Results below include comparisons with the original version (MPmemspeed64), available in
linux_multithreading_apps.tar.gz.
Results using 8 threads are clearly much faster to start with, than those using OpenMP. Maximum speeds are again with data greater than L1 cache size, when OpenMP can be faster. The performance is virtually the same with data in RAM.
Of special note, performance of single threaded MPmemspeed64AVX is often worse than the stand alone 64 bit Memspeed. Full SIMD AVX instructions are implemented, but there are numerous extra instructions used, such as shuffle, unpack and insert (4 vector multiply, 4 vector add, 80 other vector instructions) - needed to allow any unknown number of threads?.
At least, the multithreaded speeds can be four times that of a single threaded run and twice as fast using RAM based data.
MP Memory Reading Speed Test 64 Bit AVX Version 1.1 Using 8 Threads
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 37793 28722 51156 48994 52362 68206 35695 32938 36361 L1
8 45769 32166 52015 76043 69787 61836 64106 57154 59198
16 53369 37318 67629 126171 116852 72240 83390 81532 80529
32 61317 39091 101048 190721 154129 202827 135056 124602 139214 L2
65 61917 39098 97595 230050 216223 222696 156878 150286 137167
131 64036 39272 110665 216488 221720 269585 149667 148116 148231
262 67037 39317 114213 194473 193210 203301 115279 114005 118131 L3
524 68566 39044 109261 178279 190967 180976 106867 109279 108726
1048 68665 39866 112433 172563 153711 170144 98248 97462 96250
2097 65163 37914 102215 115502 125560 127970 65827 69361 69114
4194 62412 40752 94501 123285 110270 118960 63359 64405 64734
8388 65018 37625 97453 109678 119597 111028 64608 65434 67057
16777 28312 28192 30281 56647 59461 54978 26426 27921 15173 RAM
33554 25865 24951 26054 31416 25794 26228 13986 12989 13306
67108 25362 23514 25377 25089 25348 25514 12516 12674 12416
134217 25286 23348 25355 24784 25420 24585 12082 12671 12475
268435 23482 24032 24977 24292 24489 24812 12837 12637 12635
536870 24650 23942 25309 25072 25428 25601 12789 12471 12756
1073741 25124 24729 24310 24854 24576 25209 12612 12415 12524
2147483 24921 24840 24774 24925 25069 24468 12790 12320 12408
4294967 24939 24346 23925 25028 24669 25179 12275 12216 12436
Max GFLOPS 8.6 10.2 14.4 27.7
MP Memory Reading Speed Test 64 Bit AVX Version 1.1 Using 1 Threads
8 16874 10136 29980 60061 59981 73657 39651 39127 39645 L1
16 16901 10137 30260 61288 61171 78154 40569 40385 40608
131 16891 10113 28484 48351 48134 47964 29242 29294 29285 L2
262 16845 10094 27490 45215 44739 46562 27383 27371 27377
2097 16725 9985 23943 30302 30323 30669 16994 16576 17163 L3
4194 16549 9912 23767 30294 30231 30720 16977 16461 17136
536870 11805 8651 14877 13817 13983 14446 7545 7081 7203 RAM
1073741 12168 8818 14636 14692 14973 14163 7329 7393 7202
Max GFLOPS 2.1 2.5 3.8 7.6
MP Memory Reading Speed Test 64 Bit Version 1.1 Using 1 Threads
8 30422 15420 27836 40569 20504 34929 19735 9939 19614 L1
16 30754 15503 28069 41065 20688 35335 19977 10011 19895
131 28955 15286 27122 34323 20476 30926 20086 10078 20069 L2
262 28741 15287 27017 33579 20373 30875 19760 10060 19758
2097 24424 15207 23963 26358 19342 25851 14665 9648 14824 L3
4194 24408 14253 23951 26355 19366 25531 14655 9334 14821
536870 14386 11824 14302 14704 13442 14732 7652 7426 7416 RAM
1073741 14452 11468 14715 14861 13439 15189 7348 7828 7394
Max GFLOPS 3.8 3.9 2.6 2.6
MP Memory Reading Speed Test 64 Bit Version 1.1 Using 8 Threads
8 52063 33134 52441 60254 36470 57610 45343 30815 45180 L1
16 69122 44818 65876 82924 46075 69534 57760 36707 57120
131 115996 53402 102715 140036 76202 116053 68671 35659 72891 L2
262 113644 60777 104488 132590 81609 111435 72205 37232 72061
2097 95433 58470 99412 115476 72032 109176 60839 36350 56557 L3
4194 98608 57900 102912 105228 78041 106928 59517 36122 58749
536870 25054 24707 24623 24592 25130 25430 12805 11899 11850 RAM
1073741 25402 25735 24886 25412 25128 24711 12662 12367 12617
Max GFLOPS 14.5 15.2 8.8 10.2
|
To Start
MP MFLOPS
MPmflops64 (original), MPmflops64AVX, openMPmflops64, openMPmflops64AVX, notOMPmflops64
The same calculations are carried out via three different procedures, using programmed threading with a parameter to use 1 or more threads, then with OpenMP and, finally, stand alone. All are complied for 64 bit working using SSE single precision floating point with the MP versions also produced with the AVX option. Most were compiled using GCC 4.8.2, as even some SSE compilations are faster than those from earlier compilers and are included in
AVX_benchmarks.tar.gz. More details can be found in
Linux Multithreading Benchmarks.htm
and
linux_openmp benchmarks.htm
Arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element out of 0.1, 1 and 10 MB.
On the Core i7 4820K, the first will use L2 cache, with L3 for the second and sometimes the last.
MP MFLOPS - Bearing in mind that maximum SIMD speed using SSE instructions is 31.2 GFLOPS per core or 124.8 with four, and twice these via AVX, measured performance is quite respectable, but needs eight threads to squeeze the last drop, at up to 93.2 GFLOPS with SSE and 177.8 using AVX. Performance is more cache memory speed dependent with two operations per word and doubling the number of threads does not necessarily double the throughput.
Open MP - The default uses all CPU cores but an extra run was carried out with an affinity setting to use one CPU core. Disassemblies showed that the compiled code had more saving and loading instructions than MP MFLOPS, leading to slower performance for 2 and 8 operations per word.
At least, AVX tests could be twice as fast as the ones using SSE. At 32 operations per word, SIMD instructions were not used, producing much slower performance, with SSE and AVX tests running at the same speed.
SSE - Compiling the latter benchmark, without the OpenMP and AVX directives, produced virtually the same speeds as MP MFLOPS using a single thread.
64 Bit MP AVX MFLOPS Benchmark 1, 8 Threads, Mon Dec 8 15:51:50 2014
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 2 20000 0.067975 60258 0.620974 Yes
Data in & out 1024000 2 2000 0.092400 44329 0.942935 Yes
Data in & out 10240000 2 200 0.410527 9977 0.994032 Yes
Data in & out 102400 8 20000 0.094583 173224 0.749971 Yes
Data in & out 1024000 8 2000 0.107854 151909 0.965360 Yes
Data in & out 10240000 8 200 0.408042 40153 0.996409 Yes
Data in & out 102400 32 20000 0.378009 173372 0.498060 Yes
Data in & out 1024000 32 2000 0.368530 177831 0.910573 Yes
Data in & out 10240000 32 200 0.413231 158594 0.990447 Yes
----- MP MFLOPS 1 to 8 Threads ----- -------- OpenMP ---------
----- SSE ----- ----- AVX ----- SSE --- SSE --- --- AVX ---
M 4B Ops 1 4 8 1 4 8 1 aff1 8 aff1 8
Words Word
0.1 2 9681 45340 54621 12542 62273 60258 9918 6061 13742 10196 19577
1.0 2 9759 21688 41832 11404 23031 44329 9688 6215 19477 10025 37906
10.2 2 5990 9237 10026 5991 8970 9977 5870 5059 9137 5880 7782
0.1 8 24533 49320 92086 35982 159040 173224 24448 13220 44104 26481 88370
1.0 8 24570 49918 92352 36180 80096 151909 24465 13373 49499 27045 90579
10.2 8 19975 36638 39982 23299 40124 40153 20055 12719 38369 20593 35607
0.1 32 23269 46942 92408 46400 90572 173372 23251 5854 22858 5865 22845
1.0 32 23307 89676 93282 46572 91058 177831 23265 5863 23234 5870 23141
10.2 32 23052 91029 92050 44729 88877 158594 23063 5860 23127 5854 23077
|
To Start
SSE and AVX Instructions
The 64-bit Operating Systems cannot execute old style i387 floating point instructions, but are limited to SSE varieties instead. These use 128 bit registers that can contain up to 4 single precision numbers (SSE) or 2 at double precision (SSE2). The fastest mode of operation is Single Instruction Multiple Data (SIMD), where the same arithmetic instruction can operate on all contained numbers at the same time. The slowest is with single data words (SISD), where, possibly due to an inefficient compiler, data cannot be organised to use SIMD.
Originally, maximum speed of SIMD, in Millions of FLoating point Operations Per Second (MFLOPS), was CPU MHz x 4 (2 x with SSE2). Later, multiply and such as add could be linked together, to produce 8 x MHz MFLOPS. Then, with multithreading, this can be multiplied by the number of CPU cores. AVX1 can double these speeds using 256 bit registers and twice again with later AVX2 at 512 bits.
Below are examples of SSE and AVX single precision assembly code instructions. For double precision, the last s is replaced by d. Besides providing the benefit of larger registers, additional optimisation is possible using AVX instructions, as three registers can be used.
Example compile command for AVX and variation to produce assembly code.
gcc whets.c cpuidc64.o cpuida64.o -O3 -mavx -lrt -lc -lm -o whetAVX
gcc whets.c cpuidc64.o cpuida64.o -O3 -mavx -lrt -lc -lm -S
SSE - SISD SSE - SIMD SSE2 - SIMD DP
addss xmm1, xmm8 addps xmm1, xmm10 addpd xmm1, xmm10
addss xmm2, xmm4 addps xmm2, xmm6 etc.
addss xmm0, xmm6 addps xmm0, xmm8
mulss xmm1, xmm9 mulps xmm1, xmm11
mulss xmm0, xmm7 mulps xmm2, xmm7
mulss xmm2, xmm5 mulps xmm0, xmm9
subss xmm2, xmm0 subps xmm2, xmm0
addss xmm2, xmm1 addps xmm2, xmm1
AVX1 - SISD AVX1 - SIMD AVX1 - SIMD DP
vaddss ymm5, ymm6, ymm13 vaddps ymm5, ymm6, ymm13 vaddpd ymm5, ymm6, ymm13
vaddss ymm3, ymm6, ymm7 vaddps ymm3, ymm6, ymm7 etc.
vaddss ymm1, ymm6, ymm6 vaddps ymm1, ymm6, ymm6
vmulss ymm4, ymm13, ymm13 vmulps ymm4, ymm13, ymm13
vmulss ymm2, ymm7, ymm7 vmulps ymm2, ymm7, ymm7
vmulss ymm0, ymm6, ymm6 vmulps ymm0, ymm6, ymm6
vsubss ymm7, ymm13, ymm7 vsubps ymm7, ymm13, ymm7
vaddss ymm6, ymm7, ymm6 vaddps ymm6, ymm7, ymm6
vaddps ymm3, ymm6, ymm7 means add ymm6 and ymm7 with result in ymm3
|
To Start
Roy Longbottom December 2014
The Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|