Linux OpenMP Benchmark Results
Contents
General
OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the C/C++ compiler included in the Linux Ubuntu Distribution.
In using OpenMP functions, I assumed that it would work in the same way as a vectorizing compiler as used on supercomputers. The most popular program used to demonstrate supercomputer performance is the Linpack Benchmark, the
scalar version
is included in this collection. Performance is governed by the following linked triad in a loop:
for(i=0; i < n; i++)
{
dy[i] = dy[i] + da * dx[i]
}
All that is required to parallelise the code is the following and a -fopenmp parameter in the compile command. Without this parameter, the program compiles normally, with the #pragma being ignored.
#pragma omp parallel for
for(i=0; i < n; i++)
{
dy[i] = dy[i] + da * dx[i]
}
The benchmark execution files and source code along with compile and run instructions can be downloaded in
linux_openmp.tar.gz.
To Start
MemSpeed
MemSpeed benchmark
employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays:
Sum to register r = r + x [m] * y[m] (Integer + y [m])
Sum to memory x[m] = x[m] + y[m]
Memory to memory x[m] = y[m]
MemSpd2K, the latest standard version, uses assembly code to execute the same instructions as the original MemSpeed benchmark. This special version for OpenMP is again all C code, with the first linked triad tests returning results to memory via:
Sum to memory x[m] = x[m] + r * y[m]
Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For the linked triad tests, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision.
Four MemSpeed execution files are provided with normal and OpenMP calculations compiled for 32-Bit and 64-Bit hardware -
memory_speed32, memory_speed32OMP, memory_speed64 and memory_speed64OMP.
Integer tests for the latter use 64 bit data words and SSE/SSE2 instructions are generated for floating point instead of the old x87 codes.
Note that the compiled code does not necessarily run SSE/SSE2 instructions efficiently. See
32-Bit and 64-Bit Differences,
but See later.
To Start
Example Log Files
Below are results produced from running on a Quad CPU Phenom processor using 64-Bit Ubuntu.
It can be seen that performance is influenced by cache sizes, in this case, 64 KB L1 cache and 512 KB L2 cache per CPU core and 6 MB shared L3 cache.
#####################################################
Assembler CPUID and RDTSC
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor
Measured - Minimum 3013 MHz, Maximum 3013 MHz
Linux Functions
get_nprocs() - CPUs 4, Configured CPUs 4
get_phys_pages() and size - RAM Size 7.81 GB, Page Size 4096 Bytes
uname() - Linux, roy-64Bit, 2.6.35-22-generic
#33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64
#####################################################
Normal Compilation
Memory Reading Speed Test 64 Bit Version 4 by Roy Longbottom
Start of test Tue Dec 7 09:40:17 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 21946 17235 35860 22906 18867 35859 22313 13661 18091
8 22605 17675 37302 23577 19332 37301 23305 13977 18738
16 22873 17868 37917 23834 19439 37916 23687 14150 18974
32 22998 17970 38233 23965 19526 38233 23892 14241 19124
64 23066 17978 38391 24029 19620 38390 23990 8021 19199
128 17491 15870 22707 17582 17096 22709 11432 7881 11307
256 17435 15738 22512 17313 16852 22512 11359 7968 11189
512 16530 15378 19598 16577 16264 19604 9923 9806 9856
1024 10120 10138 9962 10201 10196 9963 5000 4982 4974
2048 10145 10158 9970 10220 10215 9968 5004 4986 4974
4096 9859 9794 9751 9957 9939 9758 4887 4859 4879
8192 6593 6240 6769 6699 6577 6807 3459 3409 3449
16384 6310 6070 6430 6302 6280 6498 3209 3253 3265
32768 6378 6032 6484 6353 6341 6480 3267 3190 3265
65536 6340 6010 6462 6390 6287 6463 3252 3234 3246
131072 6350 6044 6484 6389 6308 6482 3257 3247 3289
262144 6390 6088 6478 6419 6323 6458 3224 3249 3304
524288 6403 6065 6475 6326 6374 6498 3224 3185 3275
1048576 6390 6032 6454 6457 6366 6373 3230 3212 3306
2097152 6376 6051 6465 6377 6329 6468 3239 3233 3279
End of test Tue Dec 7 09:40:58 2010
OPenMP Version
Memory Reading Speed Test 64 Bit Version 1 by Roy Longbottom
Start of test Sun Dec 5 12:26:36 2010
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 2413 2340 2426 2408 2371 2593 1301 1302 1306
8 4642 4379 4655 4739 4488 5045 2562 2478 2583
16 8321 7942 8513 9215 8412 9668 4989 4695 4982
32 15714 12698 15446 16397 14036 17359 9112 7963 9159
64 25533 18268 24526 26971 21394 28979 16033 12269 16032
128 36147 23064 34023 40018 28460 42871 23255 16389 23172
256 45821 26908 42782 21679 34353 57114 31501 20370 31889
512 46924 28555 46191 55514 35557 54808 33583 22754 33376
1024 45478 28681 45098 48798 34662 47103 25081 22172 24993
2048 36642 26993 36187 36523 32366 36917 18354 17985 18388
4096 30960 24342 30259 32057 26483 32862 17172 15049 17153
8192 22963 20257 22754 23462 21376 23910 12203 11223 12176
16384 8927 8774 8888 8947 8803 8951 4469 4454 4487
32768 8938 8817 8875 8963 3681 8964 4494 4465 4488
65536 8956 8863 8910 8959 8849 8981 4500 4474 4502
131072 8979 8918 8951 8830 8808 9022 4513 4494 4517
262144 8784 8657 8706 8760 8826 8919 4436 4422 4433
524288 8774 8478 8789 8732 8643 8864 4374 3703 4435
1048576 8664 8559 8617 8689 8612 8678 4368 4360 4336
2097152 8661 8631 8643 8611 8597 8692 4364 4368 4367
End of test Sun Dec 5 12:27:13 2010
|
To Start
64 Bit, 32 Bit and Windows Comparisons, Phenom
Normal Compilation - Following are results on the above 3 GHz Phenom based system, for the first three tests at 64 bits and 32 bits using Ubuntu plus 32 bits via Windows 7 (see
OpenMP Speeds.htm).
Using Ubuntu and looking at floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS), with 64 bit compilations, using SSE and SSE2, maximum double precision speed approaches one result per CPU clock cycle at 2883 MFLOPS but single precision comes out better at nearly 4500 MFLOPS. The maximum x87 speeds, at 32 bit working, are similar for both single and double precision at 2800 MFLOPS.
The integer code is not compiled as was expected (over optimised), but this does not matter in comparing. Both 64 bit and 32 bit versions are translated to eleven integer instructions per eight data words. Therefore, speed in Millions of Instructions Per Second (MIPS) is up to 6600 MIPS, at 64 bits, or 4000 MIPS at 32 bits.
The Windows based results are similar to those for Ubuntu, with data in L1 cache, but somewhat slower with data in L2 cache, L3 cache and RAM.
OpenMP Compilations - It was indeed a shock to see the pathetic slow speeds when using four CPUs, confirmed by performance monitor showing 100% utilisation on all. Analysis of results indicates that there is a startup delay involved, of around 1.7 microseconds in the case of the Ubuntu compilation/run. This is not as high as the
9 microseconds via Windows
but the latter appears to have a lower ongoing overhead. At 32 bits, maximum gains of the three measurements are 3.7, 2.3 and 2.0 times with Ubuntu then 4.6, 4.3 and 4.4 times with Windows. Greater than four times is possible when L3 cache sized data is shared amongst four L2 caches.
Below are results on from a PC with a 2.4 GHz Core 2 Duo processor. Here maximum gains on using both CPUs are up to 1.6 times using Windows but, other than with data in RAM, Ubuntu is often slower using both cores.
By using the gcc -S parameter to provide an assembly listing, it was found that the 64 bit compilation, when not requesting OpenMP, resulted in some full SIMD SSE instructions, whereby four multiplies or adds can be executed per clock cycle. This leads to the relative MP comparisons being less than they could be for both types of processor.
64 Bit Ubuntu 32 Bit Ubuntu 32 Bit Windows
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+s*y[m] Int+
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 21946 17235 35860 21428 11040 11551 22924 11651 12725
8 22605 17675 37302 22078 11168 11555 23536 11839 13242
16 22873 17868 37917 22336 11231 11647 23834 11887 12790
32 22998 17970 38233 22460 11261 11684 23407 11902 13659
64 23066 17978 38391 22519 11275 11575 23669 11847 12910
128 17491 15870 22707 17075 9468 10026 14703 9926 10290
256 17435 15738 22512 16989 9431 9945 14644 9906 10175
512 16530 15378 19598 16597 9376 9847 14302 9921 10171
1024 10120 10138 9962 10092 8434 8270 8246 7091 7026
2048 10145 10158 9970 10109 8429 8269 8166 6976 7142
4096 9859 9794 9751 9855 8184 8045 8006 6898 6984
8192 6593 6240 6769 6531 5989 6267 4416 3983 4175
16384 6310 6070 6430 6274 5861 6148 4244 3888 3993
32768 6378 6032 6484 6180 5844 6119 4249 3885 3958
65536 6340 6010 6462 6169 5803 6095 4235 3892 3929
131072 6350 6044 6484 6112 5840 6124 4264 3894 3965
262144 6390 6088 6478 6158 5840 6104 4279 3870 3991
524288 6403 6065 6475 6085 5817 5980 4235 3873 3968
OpenMP Using 4 CPUs
4 2413 2340 2426 2387 2108 2060 418 436 439
8 4642 4379 4655 4469 3872 3698 874 862 866
16 8321 7942 8513 8194 6297 6122 1727 1713 1700
32 15714 12698 15446 13347 8921 8657 3341 3234 3263
64 25533 18268 24526 18662 11155 10740 6123 5792 5978
128 36147 23064 34023 23357 12784 12286 10822 9932 10085
256 45821 26908 42782 27111 13916 13245 17639 15485 16134
512 46924 28555 46191 27574 14148 13196 25742 22009 22123
1024 45478 28681 45098 28696 14428 13452 33657 27622 26572
2048 36642 26993 36187 29056 14493 5942 37554 30171 31756
4096 30960 24342 30259 26326 14192 12892 24280 22284 23117
8192 22963 20257 22754 23863 13865 12563 16476 13555 15907
16384 8927 8774 8888 8777 8709 8658 7394 7137 7077
32768 8938 8817 8875 8887 8700 8679 7387 6969 7184
65536 8956 8863 8910 8915 8716 8701 7486 7188 7240
131072 8979 8918 8951 8931 8719 8735 7462 7163 7249
262144 8784 8657 8706 8866 8591 8294 7578 7207 7280
524288 8774 8478 8789 8710 8197 8264 7652 7405 7344
|
To Start
Core 2 Duo Results
64 Bit Ubuntu 32 Bit Ubuntu 32 Bit Windows
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+s*y[m] Int+
KBytes Dble Sngl Int Dble Sngl Int Dble Sngl Int
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 17307 12410 18880 17333 8611 9470 18490 9185 9190
8 17230 12597 18835 17224 8698 9536 18631 9349 9490
16 17395 12701 19021 17396 8737 9569 18903 9467 9367
32 17481 12755 19104 17471 8751 9571 18739 9487 9440
64 11757 11813 12069 11731 7632 7993 11535 7631 7874
128 11769 11826 12099 11735 7629 7991 11626 7584 7890
256 11773 11834 12106 11747 7635 7997 11634 7686 7868
512 11777 11838 12109 11749 7638 8000 11632 7524 7897
1024 11779 11835 12111 11745 7617 7996 11605 7638 7929
2048 11663 11666 11998 11668 7530 7908 11408 7298 7743
4096 9505 9534 9724 8895 6843 7149 8626 7057 7242
8192 4496 4502 4449 4580 4580 4594 4287 4222 4305
16384 3922 3903 3884 4036 4052 4041 3690 3532 3678
32768 3888 3868 3847 3926 3902 3943 3284 3431 3539
65536 3894 3877 3852 3933 3938 3950 3572 3460 3550
131072 3851 3837 3805 3897 3898 3907 3485 3550 3528
262144 3867 3847 3823 3911 3912 3922 3504 3570 3400
524288 3879 3856 3829 3906 3908 3919 3650 3533 3440
OpenMP Using 2 CPUs
4 3753 3188 3728 2423 1953 1908 425 413 440
8 6063 4722 5937 4074 2903 2751 842 830 821
16 8992 6258 8678 6024 3758 3586 1681 1630 1607
32 11799 7391 11120 7942 4372 4204 2558 2640 2908
64 13713 8112 12689 8960 4773 4512 5148 4751 4700
128 15128 8636 13996 9975 4890 4622 7553 6765 6987
256 16109 8881 14739 10557 5028 4752 10645 8937 8613
512 16560 9034 15156 10837 5130 4820 12273 10469 11132
1024 16790 9101 15353 11024 5140 4844 13733 9631 11788
2048 16682 9092 15310 11057 5166 4856 15255 11028 12647
4096 8333 7509 8566 9226 5065 4781 13525 10211 11712
8192 4513 4560 4590 4722 4445 4306 4367 4318 3582
16384 4106 4081 4110 4103 4045 4016 3409 3718 3706
32768 3960 3951 3966 4077 4043 3998 2815 3017 3575
65536 3961 3956 3948 4105 4071 4055 3570 3602 3401
131072 3991 3996 3983 4027 4025 4011 3656 3466 3614
262144 3992 4016 3988 4066 4084 4034 3727 3564 3720
524288 4036 4018 4017 3939 4042 4030 3737 3637 3641
|
To Start
Results Using 1 to 4 CPUs
The OpenMP execution codes can be run with a parameter than dictates which CPU or CPUs are used, as with Windows Affinity options. With Ubuntu an example to run on the first CPU is:
taskset 0x00000001 ./memory_speed64OMP
Then it is 0x00000003 for the first two cores and 0x00000007 for three. Anyway, the graphs below show performance gains and losses of the 32 bit and 64 bit OpenMP compilations using single precision floating point variables.
32 Bit Single Precision Floating Point x[m]=x[m]+s*y[m]

|
|
64 Bit Single Precision Floating Point x[m]=x[m]+s*y[m]

|
To Start
Original OpenMP Benchmark
My first OpenMP performance tests executed the same instruction sequences as used in a
CUDA Benchmark.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words.
The program checks for consistent numeric results, primarily to show that all calculations are carries out. These are quite sensitive, where a change of 1 of Repeat Passes changes 0.929475 to 0.929448. However, SSE and x87 calculations produce slightly
different answers.
Four execution files are again provided with normal and OpenMP calculations compiled for 32-Bit and 64-Bit hardware - openMPmflops32, openMPmflops64, notOMPmflops32 and notOMPmflops364. Below is an example of results on the 3 GHz Quad Core Phenom, showing maximum speed of nearly 15 GFLOPS.
32 Bit OpenMP MFLOPS Benchmark 1 Thu Dec 9 16:48:08 2010
Via Ubuntu 32 Bit Compiler
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.086841 5758 0.929475 Yes
Data in & out 1000000 2 250 0.074519 6710 0.992543 Yes
Data in & out 10000000 2 25 0.163307 3062 0.999249 Yes
Data in & out 100000 8 2500 0.149781 13353 0.957164 Yes
Data in & out 1000000 8 250 0.147888 13524 0.995525 Yes
Data in & out 10000000 8 25 0.176824 11311 0.999550 Yes
Data in & out 100000 32 2500 0.537051 14896 0.890377 Yes
Data in & out 1000000 32 250 0.534446 14969 0.988102 Yes
Data in & out 10000000 32 25 0.537515 14883 0.998799 Yes
|
To Start
OpenMPMFLOPS comparisons
Following are MFLOPS of normal compilations for a single CPU compared with OpenMP speeds for all processors, on a quad core Phenom and Core 2 Duo, at 32 bits and 64 bits. The 32 bit versions, using x87 instructions, behave in the same way as the
Windows version,
where quad core OpenMP throughput is up to nearly four times faster than the normal single CPU speeds. This is not the case at 64 bit working, using SSE instructions, where dual core OpenMP speeds can be much slower than the single CPU normal compilation, and four cores are required to obtain the same performance.
The second set of results show OpenMP speeds using 1, 2, 3 and 4 cores on the Phenom (using the taskset execution parameter - see above). Here, performance can be nearly proportional to the number of cores requested.
See below for an explanation why 64 bit normal compilation produces much higher speeds using one CPU.
3 GHz Phenom 2.4 GHz Core 2 Duo
32 Bits 64 Bits 32 Bits 64 Bits
Data Ops/ 1 CPU 4 CPUs 1 CPU 4 CPUs 1 CPU 2 CPUs 1 CPU 2 CPUs
Words Word i387 OMP SSE OMP i387 OMP SSE OMP
100000 2 2439 5758 7624 5769 1602 2815 5556 2904
1000000 2 2231 6710 4686 6674 1590 2280 4292 2780
10000000 2 1739 3062 2195 2944 1173 1233 1251 1258
100000 8 3348 13353 14357 12126 3142 4665 13061 5212
1000000 8 3195 13524 13376 12420 3129 4878 11591 5152
10000000 8 3080 11311 7473 10976 3067 4430 4997 4861
100000 32 3881 14896 15336 13494 3316 5922 14486 6421
1000000 32 3853 14969 15009 13540 3341 5932 14320 6411
10000000 32 3817 14883 14318 13450 3334 5842 13750 6396
3 GHz Phenom 3 GHz Phenom
32 Bits i387 64 Bits i387
Data Ops/ 1 CPU 2 CPUs 3 CPUs 4 CPUs 1 CPU 2 CPUs 3 CPUs 4 CPUs
Words Word OMP OMP OMP OMP OMP OMP OMP OMP
100000 2 1903 3575 4678 5758 1974 3597 4770 5769
1000000 2 1787 3588 5300 6710 1913 3843 5564 6674
10000000 2 1509 2490 2832 3062 1590 2566 2867 2944
100000 8 3518 6963 10342 13353 3437 6835 10100 12126
1000000 8 3453 6943 10367 13524 3375 6802 10121 12420
10000000 8 3308 6541 9522 11311 3219 6379 9395 10976
100000 32 3794 7566 11310 14896 3552 7084 10601 13494
1000000 32 3774 7554 11317 14969 3533 7079 10616 13540
10000000 32 3735 7465 11166 14883 3490 6970 10440 13450
|
To Start
Netbook Hyperthreading MFLOPS
|
The 32 bit and 64 bit benchmarks were also run on a Netbook, having a 1.66 GHz Intel Atom N455 processor, both running via the 64 bit version of Ubuntu. This has a single core but includes Hyperthreading, where two processors are identified as available.
Floating point MFLOPS speed of the Atom is not very good compared with CPU MHz, but Hyperthreading can nearly double the speed. The processor supports SSE instructions, again not as efficiently as desktop CPUs, but this leads to best performance from the single thread 64 bit benchmark.
Below are one and two thread MemSpeed OpenMP speeds, converted to MFLOPS, plus those without using OpenMP.
Single precision L1 cache results can be broadly compared with OpenMP FLOPS at two operations per word.
|
64 Bit 32 Bit
Data OMP OMP SSE OMP OMP i387
Words 1CPU 2CPU 1CPU 1CPU 2CPU 1CPU
2 Ops/Wd
100000 232 386 784 203 340 203
1000000 227 383 657 201 339 201
10000000 228 391 662 201 352 202
8 Ops/Wd
100000 505 892 1811 294 547 303
1000000 500 883 1736 293 547 302
10000000 498 900 1745 294 552 302
32 Ops/Wd
100000 506 963 1832 403 783 404
1000000 505 965 1817 403 762 404
10000000 506 971 1819 403 784 403
|
|
The only areas where OpenMP provides the best performance, at 64 bits, are double precision calculations from data in L2 Cache and RAM, then the same at single and double precision with 32 bit compilation.
|
64 Bit MFLOPS 32 Bit MFLOPS
OMP OMP OMP OMP SSE2 SSE OMP OMP OMP OMP SSE2 SSE
1CPU 1CPU 2CPU 2CPU 1CPU 1CPU 1CPU 1CPU 2CPU 2CPU 1CPU 1CPU
Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl
L1 Cache 106 150 114 235 262 745 108 157 89 165 243 246
137 181 225 330 265 748 151 187 162 231 246 246
L2 Cache 182 223 364 464 226 604 186 212 287 323 213 228
171 214 379 468 226 606 187 206 296 330 213 228
RAM 175 220 248 445 208 515 175 204 246 327 192 215
175 221 254 445 208 513 176 205 247 328 196 217
|
To Start
Assembly Code
Below are disassemblies of the loops using 32 operations per data word, compiled for 64 bit working, where SSE instructions are produced. The normal compilation uses full SIMD instructions where four adds, subtractions or multiplies (addps, subps or mulps) are executed at the same time, and the loop increment is 4 words or 16 bytes (addq $16). The OpenMP code produces uses Single Instruction Single Data (SISD) functions (addss, subss or mulss) with a loop increment of 1 word (addq $4). The SISD sequence is also included along with the SIMD code, for the last 1 to 3 words, if required.
A 32 bit non-OpenMP version, compiled to use SSE instructions, was also produced (gcc -msse parameter). This also employed SIMD and generated the same high levels of performance as the 64 bit normal variety.
64 Bit Normal Compilation 64 Bit OpenMP Compilation
.L7: .L14:
movaps (%r8,%rdx), %xmm1 movss (%rdx), %xmm1
addl $1, %ecx addl $1, %ecx
movaps %xmm1, %xmm0 movss 12(%rbx), %xmm0
movaps %xmm1, %xmm2 movss 20(%rbx), %xmm2
addps %xmm15, %xmm0 addss %xmm1, %xmm0
addps %xmm13, %xmm2 addss %xmm1, %xmm2
mulps %xmm14, %xmm0 mulss 16(%rbx), %xmm0
mulps %xmm12, %xmm2 mulss 24(%rbx), %xmm2
subps %xmm2, %xmm0 subss %xmm2, %xmm0
movaps %xmm1, %xmm2 movss 28(%rbx), %xmm2
addps %xmm11, %xmm2 addss %xmm1, %xmm2
mulps %xmm10, %xmm2 mulss 32(%rbx), %xmm2
addps %xmm2, %xmm0 addss %xmm2, %xmm0
movaps %xmm1, %xmm2 movss 36(%rbx), %xmm2
addps %xmm9, %xmm2 addss %xmm1, %xmm2
mulps %xmm8, %xmm2 mulss 40(%rbx), %xmm2
subps %xmm2, %xmm0 subss %xmm2, %xmm0
movaps %xmm1, %xmm2 movss 44(%rbx), %xmm2
addps %xmm7, %xmm2 addss %xmm1, %xmm2
mulps %xmm6, %xmm2 mulss 48(%rbx), %xmm2
addps %xmm2, %xmm0 addss %xmm2, %xmm0
movaps %xmm1, %xmm2 movss 52(%rbx), %xmm2
addps %xmm5, %xmm2 addss %xmm1, %xmm2
mulps %xmm4, %xmm2 mulss 56(%rbx), %xmm2
subps %xmm2, %xmm0 subss %xmm2, %xmm0
movaps %xmm1, %xmm2 movss 60(%rbx), %xmm2
addps %xmm3, %xmm2 addss %xmm1, %xmm2
mulps 40(%rsp), %xmm2 mulss 64(%rbx), %xmm2
addps %xmm2, %xmm0 addss %xmm2, %xmm0
movaps 24(%rsp), %xmm2 movss 68(%rbx), %xmm2
addps %xmm1, %xmm2 addss %xmm1, %xmm2
mulps 8(%rsp), %xmm2 mulss 72(%rbx), %xmm2
subps %xmm2, %xmm0 subss %xmm2, %xmm0
movaps -8(%rsp), %xmm2 movss 76(%rbx), %xmm2
addps %xmm1, %xmm2 addss %xmm1, %xmm2
mulps -24(%rsp), %xmm2 mulss 80(%rbx), %xmm2
addps %xmm2, %xmm0 addss %xmm2, %xmm0
movaps -40(%rsp), %xmm2 movss 84(%rbx), %xmm2
addps %xmm1, %xmm2 addss %xmm1, %xmm2
addps -72(%rsp), %xmm1 addss 92(%rbx), %xmm1
mulps -56(%rsp), %xmm2 mulss 88(%rbx), %xmm2
mulps -88(%rsp), %xmm1 mulss 96(%rbx), %xmm1
subps %xmm2, %xmm0 subss %xmm2, %xmm0
addps %xmm1, %xmm0 addss %xmm1, %xmm0
movaps %xmm0, (%r8,%rdx) movss %xmm0, (%rdx)
addq $16, %rdx addq $4, %rdx
cmpl %r9d, %ecx cmpl %ecx, %eax
jb .L7 jg .L14
|
To Start
Roy Longbottom March 2011
The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|