Linux MultiThreading Benchmarks
Contents
Summary
Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).
Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.
Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.
MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,
MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data.
MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed.
MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.
To Start
General
These tests are intended to measure Linux and hardware performance at high speeds using multithreading. The programs were compiled at both 32 bits and 64 bits. The execution files, source code, compilation and running instructions can be found in
linux_multithreading_apps.tar.gz.
All provide the following information on the system under test. They are based on versions available for running under Windows and described in
quad core 8 thread.htm.
##############################################
Assembler CPUID and RDTSC
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor
Measured - Minimum 3014 MHz, Maximum 3014 MHz
Linux Functions
get_nprocs() - CPUs 4, Configured CPUs 4
get_phys_pages() and size - RAM Size 7.81 GB, Page Size 4096 Bytes
uname() - Linux, roy-64, 2.6.35-22-generic
#33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64
|
To Start
Description Simple Add Tests
CPUMaxMP32 and CPUMaxMP64 execute simple integer and floating point add instructions via assembly language. Floating point arithmetic is identical on the two versions, via SSE instructions that handle four calculations at a time via 128 bit registers. Performance is measured in Millions of Floating Point Operations Per Second (MFLOPS) with expected maximum speed of four adds per CPU clock cycle. Integer calculations are the same, except one uses 32 bit instructions/registers and the other the 64 bit varieties. For these, speed is measured in Millions of Instructions Per Second (MIPS). Results are logged in file MPadds.txt.
The assembly code loops execute two billion add instructions to ensure that elapsed times of a single thread are significant (like 0.5 seconds or more for SSE tests on current CPUs). A command line run time variable is available to specify the number of threads to use, between 1 and 64, with a default of four.
Below are example full results, using four threads, and MIPS from a run with 64 threads. Besides total MIPS and MFLOPS, second sums are provides, based on the time for the last thread to finish.
As seen for both examples, completion times are not based on first in first out, but the time is shared fairly evenly, even with 64 threads.
Command ./cpumaxmp64 Threads 4 (or T 4 or t 4)
Phenom 4 CPUs Available
##############################################
Multithreading Add Test 64 bit Version 1.0 Thu May 5 11:35:18 2011
Integer Additions 4 Threads
Thread 4 - 8281 64 bit Integer MIPS
Thread 2 - 7996 64 bit Integer MIPS
Thread 1 - 7815 64 bit Integer MIPS
Thread 3 - 7800 64 bit Integer MIPS
Total - 31892 64 Bit Integer MIPS
Aggregate - 31201 64 Bit Integer MIPS, based on last to finish
SSE Floating Point Additions 4 Threads
Thread 2 - 12030 32 Bit SSE MFLOPS
Thread 3 - 11976 32 Bit SSE MFLOPS
Thread 4 - 11861 32 Bit SSE MFLOPS
Thread 1 - 11692 32 Bit SSE MFLOPS
Total - 47559 32 Bit SSE MFLOPS
Aggregate - 46770 32 Bit SSE MFLOPS, based on last to finish
End of test Thu May 5 11:35:23 2011
Integer MIPS 64 Threads
Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS
1 528 10 522 41 518 11 517 49 516 21 515 52 514 48 514
24 527 55 521 56 518 5 517 32 515 45 515 47 514 44 514
57 526 29 521 26 518 28 516 8 515 9 515 27 514 31 514
62 525 23 520 20 518 58 516 17 515 38 515 63 514 3 514
6 525 39 520 25 518 7 516 36 515 33 515 2 514 19 513
59 523 61 520 12 518 51 516 53 515 54 515 34 514 35 513
46 523 50 519 16 517 37 516 14 515 22 515 64 514 15 513
13 522 60 519 43 517 42 516 4 515 18 515 30 514 40 513
|
To Start
Comparison Simple Add Tests
Following are sample results on a range of systems with one , two and 4 CPUs, using 1, 2, 4 and 64 threads. The range of speeds of individual threads is also shown for the latter.
Atom - This is a netbook, where the single CPU has HyperThreading and 64 bit capability. With HT, two CPUs are identified and, in this case, integer addition throughput using multiple threads is 40% higher than from a single thread and 20% faster with SSE floating point calculations.
Core 2 Duo - Results from the 32-Bit and 64-Bit compilations are shown for this dual core processor, where 32 bit integer speed is somewhat faster than at 64 bits. Integer additions are executed at up to 2.75 per CPU clock cycle (or MIPS/MHz) with SSE calculations at the maximum rate of four per clock cycle.
As with earlier tests, this system runs at 1.6 GHz when one CPU is being used under Linux and default “On-Demand” Frequency Scaling is used ( see Linux Peculiarities in
linux burn-in apps.htm).
Result provided are for a “Performance” setting.
Phenom - Results of the 64-Bit version are shown for this quad core processor, via Linux Ubuntu and Fedora. There appears to be some differences between the two versions of Linux but these might be normal variations due to other influences. They at least show that the quad core processor can increase throughput by four times with these tests.
Atom 1.7 GHz Core 2 Duo 2.4 GHz Phenom X4 3.0 GHz
64 bit 64 bit 32 bit 64 bit 64 bit
Ubuntu Ubuntu Ubuntu Ubuntu Fedora
MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS
Threads
1 Total 1844 5418 5268 9605 6597 9591 8052 12046 8213 12030
Aggregate 1844 5418 5268 9605 6597 9591 8052 12046 8213 12030
% 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
2 Total 2631 6460 10290 18992 12782 18707 15964 24022 16447 24052
Aggregate 2598 6441 10156 18898 12621 18344 15810 23946 16446 24050
% 98.7 99.7 98.7 99.5 98.7 98.1 99.0 99.7 100.0 100.0
4 Total 2652 6473 10508 19159 13011 19047 31892 47559 32701 47889
Aggregate 2630 6449 10416 19070 12933 18940 31201 46770 32344 47620
% 99.2 99.6 99.1 99.5 99.4 99.4 97.8 98.3 98.9 99.4
64 Min 42 101 164 299 205 299 513 749 510 655
Max 43 103 173 310 229 310 528 798 529 763
Total 2719 6526 10696 19443 13556 19435 33094 48974 33012 43339
Aggregate 2657 6466 10503 19111 13129 19120 32840 47932 32664 41938
% 97.7 99.1 98.2 98.3 96.9 98.4 99.2 97.9 98.9 96.8
|
To Start
Whetstone MP Benchmark
The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in
Whetstone Benchmark History And Results and
Results On PCs.
The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations.
As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.
This multithreading benchmark also has a run time parameter to specify the number of threads (up to 64) with a default identified as Configured CPUs in gathered system information (see above).
An initial calibration determines the number of passes needed for an overall execution time of 10 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs.
The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are logged for the first thread with others checked for the same values. Actual results might be different or repeated runs as they are dependent on the number of passes.
Four versions are available, whetsMP64, whetsMP64DP, whetsMP32 and whetsMP32DP, for 32-Bit or 64-Bit systems using Single or Double Precision floating point. Results are logged in file MPwhetres.txt.
Equivalent command ./whetsMP64 Threads 4
Phenom 4 CPUs Available
#####################################################################
Multithreading Single Precision Whetstones 64-Bit Version 1.0
Using 4 threads - Sat May 14 12:03:51 2011
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS
1 2861 927 872 747 71 38 2947 2259 629
2 2865 875 892 745 71 38 3294 2198 641
3 2875 869 892 744 71 38 3408 2202 645
4 2896 906 895 744 72 38 3141 2232 651
Total 11496 3577 3550 2979 285 151 12790 8891 2566
MWIPS 11389 Based on time for last thread to finish
Results Of Calculations Thread 1
MFLOPS 1 -1.12475013732910156 MFLOPS 2 -1.13133049011230469
IFMOPS 1.00000000000000000 FIXPMOPS 12.00000000000000000
COSMOPS 0.49911013245582581 MFLOPS 3 0.99999982118606567
EQUMOPS 3.00000000000000000 EXPMOPS 0.93536460399627686
Numeric results of the other 3 threads were same as above
End of test Sat May 14 12:04:09 2011
|
To Start
Comparison Whetstone MP Benchmark
Following are results of the four versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs noted above. The 32-Bit compilations use the original i87 floating point instructions, where arithmetic calculations are at double precision, normally producing the same speeds with both precision options. With i87 mode not being available at 64-Bits, SSE instructions are used for single precision and SSE2 for double. Using Single Instruction Multiple Data (SIMD) mode, included in the above Add Tests, SSE can be twice as fast as SSE2, with four 32 bit arithmetic calculations at a time, compared with two at 64 bits. In this case, the source code is unsuitable for SIMD compilation, so scalar or SISD (Single Data) instructions are used. In this case, single precision calculations can be at the same speed or slightly faster than using double precision. This scalar operation means that 64-Bit and 32-Bit compilations can produce similar performance.
There are differences with the 32-Bit double precision version where speed can be much faster. For the one headed Equal MOPS, the single precision code uses mov instructions rather than store on the faster compilation. For Fixpt MOPS, integer calculations are the same but the faster one involves integer conversion to double precision rather than single precision. The speed difference remains a mystery but this has little effect on the overall performance rating.
As indicated earlier, the single core Atom has Hyperthreading. In this case, some floating point calculations can be twice as fast using more than one thread. One anomaly is the high speed result during the four thread fixed point test. Here, Linux appeared to have run one thread twice as fast as the others, distorting the total.
Results on the Core 2 Duo include one for a test using 64 threads. Speeds are also shown using Fedora on the Phenom, rather than Ubuntu.
Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
1 2 3 MOPS MOPS MOPS MOPS MOPS
Atom 1 751 397 395 363 17 8 698 1230 141
1.6 GHz 2 1284 747 697 698 31 14 948 1657 190
64 Bit 4 1301 768 760 715 31 14 1207 1661 190
SP 16 1309 801 773 726 32 14 956 1698 191
Aggregate 16 1305
Atom 1 748 381 381 324 17 8 700 1235 141
1.6 GHz 2 1287 732 714 634 31 14 950 1662 191
64 Bit 4 1259 781 748 593 32 13 963 1686 186
DP 16 1307 765 742 647 32 14 958 1691 191
Aggregate 16 1302
Atom 1 698 330 329 282 17 7 758 1230 118
1.6 GHz 2 1182 594 588 478 29 13 987 1654 178
32 Bit 4 1193 613 614 483 30 13 998 1690 178
SP 16 1202 618 589 483 30 13 995 1688 178
Aggregate 16 1199
Atom 1 757 330 330 282 17 7 1468 837 299
1.6 GHz 2 1312 600 592 480 29 13 2420 1248 506
32 Bit 4 1319 611 604 482 30 13 2504 1263 505
DP 16 1329 610 615 485 30 13 2575 1268 507
Aggregate 16 1324
Core2 Duo 1 2501 876 876 600 68 29 3198 3601 600
2.4 GHz 2 4926 1726 1632 1192 135 58 6102 6930 1193
64 Bit 4 4963 1733 1748 1196 135 58 6328 7158 1196
SP 16 4982 1758 1757 1199 136 58 6420 7212 1198
Aggregate 16 4966
64 5054 1783 1782 1215 138 59 6566 7292 1218
Aggregate 64 4973
Core2 Duo 1 2364 803 803 533 61 30 3005 3601 600
2.4 GHz 2 4688 1589 1586 1059 121 60 5911 7082 1196
64 Bit 4 4698 1599 1601 1062 122 60 5976 7089 1197
DP 16 4714 1609 1613 1062 122 61 6068 7213 1197
Aggregate 16 4704
Core2 Duo 1 2165 817 817 576 58 23 3169 3600 623
2.4 GHz 2 4270 1564 1558 1130 114 45 6149 6823 1234
32 Bit 4 4330 1616 1628 1149 116 45 6636 7168 1253
SP 16 4331 1628 1638 1151 116 45 6586 7229 1256
Aggregate 16 4317
Core2 Duo 1 2244 817 817 576 58 23 5140 3596 1028
2.4 GHz 2 4452 1621 1578 1144 115 46 10028 7120 2049
32 Bit 4 4450 1624 1630 1150 113 46 10399 7176 2065
DP 16 4472 1634 1636 1149 115 46 10301 7227 2051
Aggregate 16 4465
Phenom x4 1 2909 925 927 753 72 38 3229 2258 644
3.0 GHz 2 5787 1832 1825 1504 144 76 6375 4488 1253
64 Bit 4 11496 3577 3550 2979 285 151 12790 8891 2566
SP 16 11655 3700 3718 3006 289 153 13395 9039 2635
Aggregate 16 11578
Fedora 16 11842 3705 3715 3010 296 158 13474 9067 2552
Aggregate 16 11725
Phenom x4 1 3002 927 927 753 75 42 3228 2253 601
3.0 GHz 2 5977 1819 1829 1498 150 83 6410 4491 1184
64 Bit 4 11810 3583 3610 2976 297 163 12492 8875 2372
DP 16 11992 3694 3715 3008 300 166 12945 9068 2429
Aggregate 16 11929
Phenom x4 1 2586 927 926 695 64 31 3132 2259 621
3.0 GHz 2 5141 1819 1827 1389 129 62 6213 4484 1200
32 Bit 4 10178 3564 3623 2747 255 124 11567 8893 2390
SP 16 10300 3695 3691 2780 256 125 12584 9070 2460
Aggregate 16 10233
Phenom x4 1 2768 926 927 695 63 32 7525 1807 1806
3.0 GHz 2 5504 1815 1824 1388 126 64 14367 3570 3613
32 Bit 4 10853 3596 3594 2758 249 125 27371 7162 7110
DP 16 10960 3703 3701 2777 249 127 30629 7212 7177
Aggregate 16 10903
|
To Start
MP MFLOPS Benchmark
This benchmark executes identical functions as my CUDA and OpenMP performance tests. Details and results can be found in
linux_cuda_mflops.htm and
OpenMP Speeds.htm.
The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out.
This variation can be run using between 1 and 64 threads, the default being equal to the number of identified CPUs. Each thread carries out the same calculations but on a different segment of the data. The data size starts at 102400 words, rather than 100000, to ensure that each thread uses the same amount of data. For example, with 64 threads, each will use 1600 words or 6400 bytes.
Two versions, MPmflops64 and MPmflops32, were compiled, the first involving the default SSE floating point instructions and the second using the original i87 functions. Speed of the 64-Bit version was so fast that a second 32-Bit benchmark, MPmflops32SSE, was produced.
Results are logged in file MPMflopsLog.txt, with examples shown below. These show that the 64-Bit and 32-Bit SSE versions produce the same numeric results and the same speeds (within normal variations). Then the i87 program produces slightly different answers and much slower speeds.
Phenom Results
##############################################
64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 2 10000 0.091754 22321 0.764063 Yes
Data in & out 1024000 2 1000 0.136134 15044 0.970753 Yes
Data in & out 10240000 2 100 0.632075 3240 0.997008 Yes
Data in & out 102400 8 10000 0.167023 49047 0.850923 Yes
Data in & out 1024000 8 1000 0.176219 46488 0.982342 Yes
Data in & out 10240000 8 100 0.658828 12434 0.998200 Yes
Data in & out 102400 32 10000 0.558509 58670 0.660143 Yes
Data in & out 1024000 32 1000 0.556450 58888 0.953631 Yes
Data in & out 10240000 32 100 0.722131 45377 0.995203 Yes
##############################################
32 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:57:17 2011
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 2 10000 0.092236 22204 0.764063 Yes
Data in & out 1024000 2 1000 0.135243 15143 0.970753 Yes
Data in & out 10240000 2 100 0.638202 3209 0.997008 Yes
Data in & out 102400 8 10000 0.164866 49689 0.850923 Yes
Data in & out 1024000 8 1000 0.183847 44559 0.982342 Yes
Data in & out 10240000 8 100 0.677530 12091 0.998200 Yes
Data in & out 102400 32 10000 0.604816 54178 0.660143 Yes
Data in & out 1024000 32 1000 0.613424 53418 0.953631 Yes
Data in & out 10240000 32 100 0.756550 43312 0.995203 Yes
##############################################
32 Bit MP i87 MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:59 2011
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 2 10000 0.278444 7355 0.763849 Yes
Data in & out 1024000 2 1000 0.287133 7133 0.970727 Yes
Data in & out 10240000 2 100 0.673376 3041 0.997006 Yes
Data in & out 102400 8 10000 0.625873 13089 0.851082 Yes
Data in & out 1024000 8 1000 0.629958 13004 0.982363 Yes
Data in & out 10240000 8 100 0.740114 11069 0.998204 Yes
Data in & out 102400 32 10000 2.172758 15081 0.660653 Yes
Data in & out 1024000 32 1000 2.186809 14984 0.953702 Yes
Data in & out 10240000 32 100 2.236048 14654 0.995214 Yes
|
To Start
Comparison MP MFLOPS Benchmark
As previously, following are results of the 64-Bit and 32-Bit versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs. Performance of the single CPU tests are virtually the same as those from the OpenMP benchmark, as expected, using the same C statements but without OPenMP directives. Multiple processor tests are a little faster on the i87 version but significantly faster on the SSE varieties. The OpenMP compilation only produced SISD SSE instructions. The generated code for this MP MFLOPS benchmark not only used full SIMD functions but also clearly included linked add and multiply instructions to produce more than four results per clock cycle. Best case is the Core 2 Duo where up to six adds or multiples were recorded per clock, per CPU.
The Phenom shows the highest throughput here at 60 GFLOPS, with four cores at five results per clock cycle. Performance gains on the Atom again reflect Hyperthreading effects but some are more influenced by smaller cache sizes.
Numeric results of calculations are constant for a given number of repeat passes, but these are arranged to increase in proportion to the number of identified CPUs, to maintain similar running times. Rounding effects also produce slight differences on i87 and SSE versions. Default answers are shown below for systems with 2, 4, 8 and 16 cores.
Besides for defining the number of threads, command line input parameters are available to use specified repeat passes, either to extend running time or to check for consistent answers.
Run Time Parameters
t N or T N or Threads N where N is between 1 and 64
r P or R P or repeats P or Repeats P for P Repeat Passes
m T or M T or minutes T or Minutes T for T minutes burn-in test
Examples ./MPmflops32 Threads 64 ./MPmflops64 T 4 ./MPmflops64 T 8, R 20000
Atom 1.7 GHz Core 2 Duo 2.4 GHz Phenom X4 3.0 GHz
Thds 1 2 4 16 1 2 4 16 1 2 4 16
64 Bit SSE MFLOPS
a2 800 1430 1501 1508 5545 8503 8581 12567 7237 13870 22321 25742
b2 648 610 640 1396 3779 4290 8929 9374 4611 9135 15044 27084
c2 660 629 624 628 1248 1248 1242 2192 2152 2819 3240 3649
a8 1810 3372 3396 3405 13036 23704 23904 26636 13815 26435 49047 51692
b8 1741 2486 2553 3304 10787 15437 23931 25331 13168 25751 46488 54633
c8 1746 2536 2504 2528 5003 4970 4993 8546 7152 10816 12434 13898
a32 1832 3530 3560 3577 14405 28155 28240 27827 15110 30093 58670 59810
b32 1818 3502 3521 3535 14212 27492 28084 28577 14897 29867 58888 60311
c32 1820 3504 3531 3535 13620 19913 19964 25607 14208 27760 45377 47678
32 Bit i87 MFLOPS
a2 204 327 369 369 1602 3568 3523 3185 1950 3841 7355 7535
b2 201 354 361 364 1799 3136 3613 3618 1885 3804 7133 7686
c2 202 358 363 363 1236 1252 1251 2048 1582 2505 3041 3240
a8 303 557 565 567 3188 6346 6193 6278 3361 6676 13089 13363
b8 301 550 564 565 3162 6280 6213 6304 3292 6648 13004 13404
c8 302 556 566 565 3081 4988 4959 5860 3168 6211 11069 11382
a32 404 777 794 794 3362 6696 6649 6704 3813 7613 15081 15175
b32 403 777 790 784 3357 6680 6645 6689 3775 7566 14984 15197
c32 403 776 790 785 3338 6628 6592 6620 3715 7411 14654 14848
Numeric Results
Repeats 5000 10000 20000 40000
Version SSE i87 SSE i87 SSE i87 SSE i87
a2 0.867359 0.867238 0.764063 0.763849 0.620974 0.620631 0.481454 0.481096
b2 0.985193 0.985180 0.970753 0.970727 0.942935 0.942883 0.891302 0.891203
c2 0.998502 0.998501 0.997008 0.997006 0.994032 0.994027 0.988125 0.988114
a8 0.918220 0.918307 0.850923 0.851082 0.749971 0.750239 0.635325 0.635706
b8 0.991084 0.991095 0.982342 0.982363 0.965360 0.965401 0.933325 0.933397
c8 0.999099 0.999101 0.998200 0.998204 0.996409 0.996416 0.992853 0.992862
a32 0.798973 0.799276 0.660143 0.660653 0.498060 0.498797 0.385106 0.384777
b32 0.976383 0.976422 0.953631 0.953702 0.910573 0.910709 0.833458 0.833707
c32 0.997595 0.997602 0.995203 0.995214 0.990447 0.990463 0.981037 0.981068
Key - Words a=102400, b=1024000, c=10240000 - Operations per word 2, 8 and 32
|
To Start
MP MFLOPS Burn-In Test
As the benchmarks generated exceptionally high speeds from a single program, it was decided to include a burn-in/reliability test function. This is initiated by including a “Minutes” input parameter. This test just uses the 32 operations per word, 102400 word procedures, with an initial calibration run to identify the number of repeat passes to generate four results per minute.
The first results below are for the quad core Phenom. Overall throughput and CPU temperatures were almost identical to those running four copies of the BurnInSSE pogram. See -
Linux burn-in apps.htm.
The second results are from running on a 1.83 GHz Core 2 Duo based laptop. As with the earlier
burn-in apps results,
the CPU switched to lower GHz CPU speeds, when the CPU core temperatures reached around 95°C.
Command ./MPmflops64 Minutes 2
##############################################
Reliability Test around 2 Minutes
4 CPUs Available
##############################################
64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:41:07 2011
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 32 266688 14.334887 60962 0.352168 Yes
Data in & out 102400 32 266688 14.334628 60963 0.352168 Yes
Data in & out 102400 32 266688 14.506037 60243 0.352168 Yes
Data in & out 102400 32 266688 14.400784 60683 0.352168 Yes
Data in & out 102400 32 266688 14.354242 60880 0.352168 Yes
Data in & out 102400 32 266688 14.418992 60606 0.352168 Yes
Data in & out 102400 32 266688 14.536283 60117 0.352168 Yes
Data in & out 102400 32 266688 14.499469 60270 0.352168 Yes
Data in & out 102400 32 266688 14.583635 59922 0.352168 Yes
End of test Fri May 20 12:43:18 2011
##############################################
64 Bit MP SSE MFLOPS Benchmark 1, 2 Threads, Sat May 21 16:45:59 2011
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 102400 32 94075 14.361988 21464 0.352474 Yes
Data in & out 102400 32 94075 14.255409 21624 0.352474 Yes
Data in & out 102400 32 94075 23.311603 13224 0.352474 Yes
Data in & out 102400 32 94075 33.148504 9300 0.352474 Yes
Data in & out 102400 32 94075 33.139577 9302 0.352474 Yes
Data in & out 102400 32 94075 33.111674 9310 0.352474 Yes
Data in & out 102400 32 94075 33.140281 9302 0.352474 Yes
Data in & out 102400 32 94075 33.586864 9178 0.352474 Yes
Data in & out 102400 32 94075 14.385304 21429 0.352474 Yes
Data in & out 102400 32 94075 14.276475 21593 0.352474 Yes
End of test Sat May 21 16:50:07 2011
|
To Start
MP Memory Speed
This is based on my original
MemSpeed benchmark
benchmark. It employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 or 64 bit integers via two data arrays:
Result to memory x[m] = x[m] + s * y[m]
Sum to memory x[m] = x[m] + y[m]
Memory to memory x[m] = y[m]
Add is used instead of multiply for the first integer tests. Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using two arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision. The C programming calculations are identical to those used in an OpenMP version. See -
OpenMP Speeds.htm.
The execution files are MPmemspeed32 and MPmemspeed64. The 32 bit version uses the old i87 floating point instructions and 32 bit integers. The other, as expected, compiles to use SSE instructions, but these are the slow SISD variety. It is also coded to use 64 bit integers.
There is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs, possibly rounded up.
Again, each thread carries out the same calculations but on a different segment of the data. Results are saved in memSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom.
The data and calculations on each array element are identical. The program checks results for consistency and reports any errors.
Below are 64-Bit and 32-Bit results on the 3.0 GHz quad core Phenom using four threads. These show that the SSE floating point speeds are somewhat faster than tests using i87 instructions, except where performance becomes dependent on memory speed. MB/second rates using 64-Bit integers can be much faster than at 32-Bits, firstly, as the CPU can execute both types of instructions at the same speed and, secondly, as more registers are available for optimisation.
The number of measurements at 32-Bits are limited as the full 8 GB of RAM cannot be recognised.
get_phys_pages() and size - RAM Size 7.81 GB
MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads
Start of test Tue Jun 7 11:32:54 2011
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 18253 12913 18066 18667 14409 27989 14221 11201 14643
8 26068 17651 25448 29410 20679 39706 21694 16463 22343
16 39834 25431 38289 47167 29614 57648 21134 21446 35492
32 50545 29588 47840 65341 37153 76190 46231 24825 45564
65 57307 31898 53119 71962 40593 86253 48212 25858 48779
131 64285 33405 56454 83929 42824 93889 51601 26317 52109
262 65111 32902 58563 85910 43904 96199 52272 26517 52156
524 58699 32056 53683 67177 39149 66647 44137 26617 44123
1048 59967 32531 53638 67332 39808 67046 43401 26310 44172
2097 48409 31709 51453 59630 37829 59008 32561 25079 32687
4194 36529 27079 37052 37380 32077 37280 18682 18694 18732
8388 27898 21163 25293 27011 23273 27070 14253 12800 13768
16777 9006 8909 8869 8978 8806 9023 4488 4462 4516
33554 8946 8875 8887 8606 8855 8921 4525 4497 4508
67108 8717 8458 8325 8516 8452 8756 4287 4366 4379
134217 8688 8339 8362 8696 8473 8698 4276 4357 4355
268435 8703 8608 8516 8659 8393 8648 4280 4268 4328
536870 8700 8591 8421 8673 8514 8690 4308 4290 4264
1073741 8596 8471 8584 8628 8619 8698 4398 4329 4395
2147483 8825 8790 8835 8790 8763 8842 4397 4402 4468
No errors found
End of test Tue Jun 7 11:33:52 2011
##############################################
get_phys_pages() and size - RAM Size 3.20 GB
MP Memory Reading Speed Test 32 Bit Version 1 Using 4 Threads
Start of test Tue Jun 7 14:23:02 2011
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 15704 11347 10961 17813 12518 15904 13744 8714 8758
8 24188 15367 14929 26770 17870 21025 20789 10866 10234
16 33319 19229 18266 38724 23589 23124 31390 13114 13157
32 40697 20675 21180 51120 27260 25282 39385 13921 13960
65 45013 22913 22267 57143 30132 24875 42247 14314 14241
131 45569 23573 22953 61979 31356 27585 44688 14427 13289
262 48701 23759 22666 63235 32103 27892 44447 14200 14453
524 44900 22996 20417 53167 30753 25832 36085 14671 13403
1048 44929 23357 20300 54596 30302 25790 36207 14708 13590
2097 42017 22864 20927 42429 28809 24778 26734 13125 12659
4194 34909 20379 19542 36402 25268 21093 18592 12625 12821
8388 22498 17592 17006 23354 19577 18854 12489 9400 9657
16777 8906 8697 8781 8884 8841 8844 4433 4217 4440
33554 8848 8684 8606 8877 8436 8843 4412 4293 4422
67108 8423 8445 8433 8685 8506 8526 4228 4296 4273
134217 8704 8453 8572 8563 8426 8485 4383 4303 4346
268435 8623 8579 8539 8731 8652 8612 4408 4301 4322
536870 8683 8331 8534 8724 8658 8444 4371 4330 4325
No errors found
End of test Tue Jun 7 14:24:05 2011
|
To Start
MP Memory Speeds Comparison
Below are 64-Bit results on the 3.0 GHz quad core Phenom and 2.4 GHz Core 2 Duo for the multiply and add tests using 1 CPU, all CPUs and with 64 threads.
On the single thread tests, although the speeds are dependent on CPU GHz, variations generally reflect cache sizes.
The full gain in throughput through using more than one CPU is not achieved at the lower data sizes, mainly due to higher overheads. For example, at 4 KB there are two arrays of 2 KB, producing a segment of 32 bytes for each of 64 threads.
There are significant additional performance gains using an increasing number of threads with mid to large data sets. This is due to the relatively small data segments being repetitively processed from a lower level cache.
Later are results on the Netbook with an Atom CPU running via 64-Bit Ubuntu 11.04. It can be seen that Hyperthreading provides significant gains in throughput using floating point instructions.
Quad Core Phenom - Caches L1 64 KB/CPU, L2 512 KB/CPU, L3 6144 KB shared
Commands ./MPmemspeed64 Threads 1
./MPmemspeed64
./MPmemspeed64 Threads 64
1 thread 4 threads 64 threads
KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 16834 8588 16753 18253 12913 18066 17040 12902 21602
8 17196 8671 16039 26068 17651 25448 30712 18072 31164
16 17188 8670 18148 39834 25431 38289 34169 21795 41192
32 17314 8703 16669 50545 29588 47840 39691 21845 35582
65 15023 8634 15211 57307 31898 53119 45213 25520 52872
131 15274 8155 13675 64285 33405 56454 53787 29396 58835
262 15335 8143 13508 65111 32902 58563 56811 31077 59534
524 14512 8013 13242 58699 32056 53683 62895 32310 58189
1048 10911 7355 10791 59967 32531 53638 64335 32520 60720
2097 10909 7350 10784 48409 31709 51453 65249 32553 59855
4194 10561 7169 10411 36529 27079 37052 65800 32975 57665
8388 6642 5610 6315 27898 21163 25293 57438 30135 52237
16777 6128 5410 5853 9006 8909 8869 57345 30844 50350
33554 6311 5427 5677 8946 8875 8887 54636 30033 49051
67108 5789 5160 5698 8717 8458 8325 31347 24887 30860
134217 5969 5138 5908 8688 8339 8362 26846 19845 26308
268435 5922 5449 5779 8703 8608 8516 8510 8359 8558
536870 6121 5090 5811 8700 8591 8421 8638 8525 8387
1073741 6020 5481 5832 8596 8471 8584 8569 8334 8422
2147483 6264 5630 6028 8825 8790 8835 8834 8699 8625
Core 2 Duo - Caches L1 32 KB/CPU, L2 4096 KB shared
1 thread 2 threads 64 threads
Kbytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 12572 6264 12613 19400 10834 20763 6272 4371 8553
8 12534 6337 12613 22252 11696 22188 5644 3760 6362
16 12639 6364 12692 23343 12208 23485 10681 6790 15783
32 12468 6318 12483 24406 12409 24317 13725 8918 15230
65 9434 5733 9453 24573 12548 24520 17721 11117 18778
131 9584 5770 9612 18347 11708 17212 21622 11731 22935
262 9645 5824 9648 18389 11700 18191 23737 11938 23593
524 9674 5834 9667 18286 11635 18153 24032 12329 24204
1048 9696 5843 9684 18324 11663 18206 24579 12484 24682
2097 9548 5777 9623 18326 11593 18203 24377 12261 24413
4194 8188 5520 8302 13296 10585 13927 17971 11380 17734
8388 4381 4219 4407 4701 4597 4664 17905 11450 17698
16777 3788 3830 3847 3948 3921 3944 17803 11413 17646
33554 3817 3827 3806 3903 3868 3893 17580 11294 17428
67108 3845 3872 3856 3908 3888 3917 16531 10438 14364
134217 3876 3856 3798 3886 3918 3922 9007 8009 9152
268435 3885 3894 3889 3893 3885 3882 4092 4088 4102
536870 3827 3816 3829 3923 3922 3918 3900 3893 3887
Atom 1 CPU with Hyperthreading - Caches L1 24 KB, L2 512 KB
get_phys_pages() and size - RAM Size 0.96 GB, Page Size 4096 Bytes
uname() - Linux, roy-Ubuntu-11, 2.6.38-8-generic
#42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011, x86_64
1 thread 2 threads 64 threads
Kbytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 2112 1149 5817 3849 2085 6203 3329 1896 5237
8 2117 1148 5922 3254 2106 6284 3519 1939 5653
16 2113 1150 5938 3870 2085 6307 3608 1970 5882
32 1742 1050 3764 3182 1879 4926 3682 1935 6005
65 1770 869 3863 3157 1908 4728 3571 1975 5912
131 1804 1050 3869 3173 1905 4652 3619 1992 6052
262 1802 1043 3833 3181 1900 4711 3597 1977 6104
524 1731 1017 3575 3021 1838 4440 3633 2002 6000
1048 1656 983 2622 2668 1746 2656 3035 1787 4710
2097 1652 962 2222 2087 1769 2059 3043 1822 4516
4194 1655 946 2123 1969 1755 1950 3023 1807 4528
8388 1538 981 2128 1958 1747 1956 2956 1788 4483
16777 1571 979 2118 1948 1769 1948 2788 1711 3364
33554 1661 969 2119 1957 1760 1921 1978 1652 2170
67108 1606 986 2176 1930 1762 1951 1763 1644 1929
134217 1660 975 2149 1932 1747 1966 1882 1672 1692
|
To Start
MP Memory Bus Speed
MPbusspeed64 and MPbusspeed32 are based on my old
BusSpeed2K Benchmark
and are essentially the same as the
Windows Multithreading Version.
Data is read using AND instructions at a range of data sizes covering caches and RAM. The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data.
The final results use SSE2 integer AND instructions to read the data into the 16 byte register, the 32 bit and 64 bit procedures using the same assembly code.
Again there is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs. This time, each thread reads all the data. Results are saved in busSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom.
Using L1 and L2 caches, data transfer speed with 64 bit integers is around twice as fast as using 32 bit numbers, suggesting a CPU speed limitation. From burst reading, estimates of maximum RAM speed are 904 x 8 = 7232 MB/second and 448 x 16 = 7168 MB/s. Cache examples are - L2 2989 x 8 = 23912 MB/s and L3 1432 x 8 = 11456 MB/s.
MP Bus Speeds 64 bit Version 1.0, 1 Threads, Fri Jun 17 16:51:46 2011
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 22196 22641 26239 26979 26580 26409 23762
24 23507 23118 27665 27323 27081 25424 23850
96 2959 2964 2989 5987 11983 21134 23868
384 2917 2917 2869 5853 11732 21898 23264
768 1362 1359 1352 2699 5408 10617 10803
1536 1322 1293 1432 2856 5764 11098 12081
16380 862 886 902 1745 3637 6019 7249
131070 854 885 902 1777 3431 5853 6619
393210 858 830 904 1757 3602 5995 7074
64 bit words, Speed in MB/Second - MIPS divide by 8
End at Fri Jun 17 16:52:21 2011
MP Bus Speeds 32 bit Version 1.0, 1 Threads, Fri Jun 17 16:45:12 2011
Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2
6 8318 12102 12848 13317 13413 13452 23033
24 9876 12979 13877 13288 13436 13640 23876
96 1495 1496 2675 5979 11081 13201 23852
384 1209 1238 2454 4946 8816 12459 18874
768 721 726 1480 3046 5638 9741 11846
1536 699 699 1513 3032 5708 9722 11924
16380 413 423 860 1805 2993 5022 7075
131070 411 444 841 1793 3024 4881 7060
393210 427 448 887 1826 3052 4946 6897
32 bit words, Speed in MB/Second - MIPS divide by 4
End at Fri Jun 17 16:45:47 2011
|
To Start
MPbusSpeed Comparisons
Below are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 (based on identified CPUs) and 64 threads.
Comparison of performance gains using multiple threads and 64 bit compilations versus 32 bit are best limited to the last two columns as the earlier burst reading tests produce all sorts of timing peculiarities.
Although measured RAM speeds could be improved with multiple threads reading shared data from caches, those provided have mainly been confirmed with programs using dedicated data.
Phenom - some performance gains from cached data were only 3 times with 4 threads but nearer 4 times with 64 threads. Using 4 or 64 threads can double throughput on memory bus. Maximum measured speed was 15.8 GB/second, compared with specification of 21.3 GB/second, comprising 667 MHz x 2 (DDR) x 8 (bus width) x 2 (dual channel).
Core 2 Duo - generally achieved peak performance using two threads and, if anything, was slightly slower using a concurrency of 64. Compared with the Phenom, some tests came out much faster and others significantly slower.
Atom - Hyperthreading had little impact on performance via L1 cache but did provide improvements via L2 cache and RAM.
L1 Cache Results in MBytes/Second - 6 KB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b
HTs 32wds 16wds 8wds 4wds 2wds All SSE2
Phenom II 32b 4/0 3000 8318 12102 12848 13317 13413 13452 23033
4 Threads 3901 7614 14703 28644 29313 34882 74424
64 Threads 10098 14599 20588 30042 32606 37702 76707
Core 2 Duo 32b 2/0 2400 8069 8772 9036 9283 9369 9390 37413
2 Threads 13921 16387 16361 16996 18147 18380 61694
64 Threads 15315 16271 17162 17587 17848 18004 60044
Atom N455 32b 1/1 1667 5092 5813 5959 6272 6290 6289 24663
2 Threads 5638 6157 6323 6450 6439 6488 25608
64 Threads 5011 5655 5609 5785 5867 5610 22419
64 Bit Version
Phenom II 64b 4/0 3000 22196 22641 26239 26979 26580 26409 23762
4 Threads 4478 17301 15108 29950 58038 58706 76500
64 Threads 19577 43542 35782 52893 74743 73745 76027
Core 2 Duo 64b 2/0 2400 15931 17513 18140 18542 18715 18813 37391
2 Threads 30486 32243 35126 36209 35493 35615 73979
64 Threads 29474 32288 33585 34516 34670 35792 71354
Atom N455 64b 1/1 1667 9004 10592 11500 12051 12565 12735 24731
2 Threads 10224 11743 12283 12767 12948 13053 25668
64 Threads 8574 10100 11089 10638 11905 11649 21789
L2 Cache Results in MBytes/Second - 96 KB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b
HTs 32wds 16wds 8wds 4wds 2wds All SSE2
Phenom II 32b 4/0 3000 1495 1496 2675 5979 11081 13201 23852
4 Threads 4648 5085 8422 19230 33948 39486 74050
64 Threads 5247 5317 10388 21080 36499 45574 89766
Core 2 Duo 32b 2/0 2400 2065 2044 3275 4562 6700 8095 19153
2 Threads 3138 3050 5218 7776 11911 15415 32030
64 Threads 3099 2963 5209 7617 11721 15081 31010
Atom N455 32b 1/1 1667 505 415 788 1481 2577 3915 5909
2 Threads 597 665 1243 2285 3657 4660 8598
64 Threads 455 534 996 1861 3133 4100 7417
64 Bit Version
Phenom II 64b 4/0 3000 2959 2964 2989 5987 11983 21134 23868
4 Threads 4478 17301 15108 29950 58038 58706 76500
64 Threads 10290 10607 10548 20940 40361 75398 89396
Core 2 Duo 64b 2/0 2400 4171 4170 4098 6715 9120 13430 19157
2 Threads 6322 6435 6127 10810 15448 23521 32104
64 Threads 6090 6237 5972 10720 15200 23342 31444
Atom N455 64b 1/1 1667 993 1020 833 1564 2954 5156 5919
2 Threads 1066 1127 1386 2501 4608 7240 8722
64 Threads 1003 1033 1055 1991 3674 6250 7222
L3 Cache Results in MBytes/Second - 1536 KB
Phenom II 32b 4/0 3000 699 699 1513 3032 5708 9722 11924
4 Threads 2407 2543 4943 10058 17570 29261 41159
64 Threads 2541 2571 5022 10100 18811 30768 41078
64 Bit Version
Phenom II 64b 4/0 3000 1322 1293 1432 2856 5764 11098 12081
4 Threads 5112 4899 5083 10018 19866 37136 38700
64 Threads 5092 5101 5101 10051 20203 37850 41309
RAM Results in MBytes/Second - 128 MB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b
HTs 32wds 16wds 8wds 4wds 2wds All SSE2
Phenom II 32b 4/0 3000 411 444 841 1793 3024 4881 7060
4 Threads 786 813 1605 3444 6259 12161 14950
64 Threads 891 969 1869 3678 6930 12678 15564
Core 2 Duo 32b 2/0 2400 353 399 814 1467 2725 5021 5842
2 Threads 621 808 1181 2217 4108 7686 9952
64 Threads 395 598 1080 1947 3541 6540 8133
Atom N455 32b 1/1 1667 122 256 514 1029 1978 3256 4122
2 Threads 131 318 684 1312 2434 4189 5307
64 Threads 125 265 577 1159 2280 4435 4636
64 Bit Version
Phenom II 64b 4/0 3000 858 830 904 1757 3602 5995 7074
4 Threads 1561 1648 1701 3330 6964 13025 14027
64 Threads 1808 1854 1947 3773 7488 14176 15516
Core 2 Duo 64b 2/0 2400 699 711 803 1622 2946 5414 5861
2 Threads 1210 1336 1632 2436 4635 7947 10028
64 Threads 706 813 1226 2177 3919 7014 8127
Atom N455 64b 1/1 1667 125 256 514 1038 2024 3994 4057
2 Threads 136 294 677 1327 2530 4924 5230
64 Threads 142 249 523 1129 2312 4565 4639
|
To Start
MP Memory Random Access
MPrandmem64 and MPrandmem32 are based on my old
RandMem Benchmark
and are essentially the same as the
Windows Multithreading Version,
except there are added tests identified as Mutex SRW and Mutex RRW.
The program uses the same code for serial and random access via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1 to 64 threads, using a run time parameter, the default being equal to the number of identified CPUs.
This benchmark uses data from the same array for all threads, but starting at different points. Results are saved in file randmemMP.txt.
In this case, both the 64 bit and 32 bit versions use 32 bit integer data arrays.
Below are logged results on a 3.0 GHz quad core Phenom using one and four threads.
On serial and random read only tests, performance gains are up to four times using dedicated caches, with the smaller data sizes slower due to overheads. Random reading is much slower than serial data transfers where burst reading leads to more data being transferred than is requested.
With reading and writing, there is a possibility that data can be corrupted when more than one thread updates the same data. Although it cannot be proven with this benchmark, it seems that the Operating System does not allow shared data to be updated in local caches and flushes them to update in shared data areas, producing significant performance degradation, particularly on random access.
The extra tests with Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using four threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM.
RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sun Jun 26 18:01:19 2011
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
Serial RD 15757 15827 12155 11879 8534 8511 4392 4385
Serial RW 9263 9534 8875 8868 7591 7493 3740 3601
Random RD 14396 14271 7504 3159 2269 1751 622 341
Random RW 9231 9510 6136 2993 2087 1507 532 319
Mutex SRW 9264 9534 8875 8869 7591 7492 3740 3608
Mutex RRW 9231 9510 6138 2993 2087 1507 532 320
RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011
------------------ MBytes Per Second At --------------------
6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB
Serial RD 29630 53166 44120 44829 29620 29671 12108 11987
Serial RW 5040 7334 7442 7402 7353 7395 8532 6247
Random RD 28388 41211 27807 12265 8866 6611 2103 1271
Random RW 657 1096 1229 1283 1288 1376 1648 993
Mutex SRW 5962 8654 7998 7882 6982 6853 3579 3415
Mutex RRW 6243 8594 5838 2815 1970 1370 486 310
|
To Start
MPrandmem Comparisons
Below, again, are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 and 64 threads plus others for the Phenom at 64 bits.
L1 Cache - Probably due to the overheads involved, performance using 64 threads is noticeable slow, with the Core 2 Duo performing better than the quad core Phenom when writing is involved. Hyperthreading does not lead to much performance gain on the Atom.
L2 Cache - Reading tests can show appropriate performance gains using all processors and not as much degradation with reading and writing. Although dealing with 32 bit integers, and unlike the Phenom, the Core 2 Duo and Atom produce much faster speeds with the 64 bit version using 64 threads. The Mutex tests produce different performance characteristics than when using L1 cache.
RAM - There are some performance gains with multiple threads making better use of memory bandwidth and excessive numbers of threads do not necessarily reduce performance.
L1 Cache Results in MBytes/Second - 6 KB
CPUs/ MHz Serial Serial Random Random Mutex Mutex
HTs RD RW RD RW SRW RRW
Phenom II 32b 4/0 3000 15791 11495 14746 11409 11478 11394
4 Threads 29613 5014 29595 610 7152 7569
64 Threads 2321 188 2234 42 257 255
Core 2 Duo 32b 2/0 2400 6327 8474 6314 8305 12642 8306
2 Threads 13285 3559 13432 1312 6935 9327
64 Threads 800 452 802 93 346 433
Atom N455 32b 1/1 1667 3500 4742 4422 5028 5032 5022
2 Threads 4902 4770 4895 1153 677 3101
64 Threads 307 296 301 69 55 207
64 Bit Version
Phenom II 64b 4/0 3000 15757 9263 14396 9231 9264 9231
4 Threads 29630 5040 28388 657 5962 6243
8 Threads 14933 2120 14892 338 2465 2867
16 Threads 8514 846 8284 174 910 1041
64 Threads 2247 189 2173 45 225 214
Core 2 Duo 64b 2/0 2400 9579 12619 6385 7720 12623 7600
2 Threads 14257 3553 14073 1505 7018 7718
64 Threads 875 893 875 112 348 358
Atom N455 64b 1/1 1667 3838 4222 3834 4222 4233 4215
2 Threads 4438 4779 4481 1218 970 3130
64 Threads 285 291 281 68 42 167
L2 Cache Results in MBytes/Second - 96 KB
CPUs/ MHz Serial Serial Random Random Mutex Mutex
HTs RD RW RD RW SRW RRW
Phenom II 32b 4/0 3000 12476 10387 7552 6241 10385 6241
4 Threads 45484 7488 27712 1238 9645 5810
64 Threads 31650 7033 22035 1162 1989 1431
Core 2 Duo 32b 2/0 2400 5228 5892 4245 3852 5935 2619
2 Threads 15026 16440 7009 2896 7009 3054
64 Threads 3304 2662 3268 497 1300 1506
Atom N455 32b 1/1 1667 2768 3349 855 1175 3464 1173
2 Threads 4642 4424 1317 1570 2805 966
64 Threads 1177 1138 1118 665 423 584
64 Bit Version
Phenom II 64b 4/0 3000 12155 8875 7504 6136 8875 6138
4 Threads 44120 7442 27807 1229 7998 5838
8 Threads 42685 7413 27567 1240 6875 4867
16 Threads 42004 7443 27870 1237 5335 3760
64 Threads 30435 7057 21892 1157 1686 1329
Core 2 Duo 64b 2/0 2400 6234 5992 4320 3777 5947 3779
2 Threads 14542 15153 7145 2932 7190 3113
64 Threads 11741 12994 6426 2767 4714 2317
Atom N455 64b 1/1 1667 2813 3064 845 1103 3063 1122
2 Threads 4613 4576 1352 1615 3044 1111
64 Threads 3551 3506 1179 1312 1759 874
L3 Cache Results in MBytes/Second - 1536 KB
CPUs/ MHz Serial Serial Random Random Mutex Mutex
HTs RD RW RD RW SRW RRW
Phenom II 32b 4/0 3000 8756 7918 1743 1505 7919 1503
4 Threads 29961 7491 6617 1332 7414 1391
64 Threads 30159 7763 6643 1333 2394 472
64 Bit Version
Phenom II 64b 4/0 3000 8511 7493 1751 1507 7492 1507
4 Threads 29671 7395 6611 1376 6853 1370
8 Threads 29330 7549 6558 1342 6229 1234
16 Threads 29827 7627 6623 1361 4763 988
64 Threads 29812 7733 6650 1337 2163 473
RAM Results in MBytes/Second - 96 MB
CPUs/ MHz Serial Serial Random Random Mutex Mutex
HTs RD RW RD RW SRW RRW
Phenom II 32b 4/0 3000 4407 3845 344 320 3826 320
4 Threads 12009 6305 1273 994 3615 308
64 Threads 12010 6641 1298 1003 2881 308
Core 2 Duo 32b 2/0 2400 4803 2147 449 282 2232 310
2 Threads 5492 2512 621 401 2239 308
64 Threads 6206 2567 635 411 2294 304
Atom N455 32b 1/1 1667 2253 1159 38 54 1275 54
2 Threads 3926 1257 63 79 1109 42
64 Threads 3951 1274 64 79 1322 54
64 Bit Version
Phenom II 64b 4/0 3000 4385 3601 341 319 3608 320
4 Threads 11987 6247 1271 993 3415 310
8 Threads 11930 6248 1274 990 3321 307
16 Threads 11860 6557 1281 997 2777 304
64 Threads 11863 6651 1288 1002 2774 302
Core 2 Duo 64b 2/0 2400 3971 2141 416 283 2148 314
2 Threads 5448 2508 632 404 2425 298
64 Threads 6065 2612 639 416 2318 305
Atom N455 64b 1/1 1667 2284 1286 39 54 1298 54
2 Threads 3717 1241 62 78 1336 54
64 Threads 3785 1270 64 78 1122 42
|
To Start
Roy Longbottom July 2011
The new Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|