Linux MultiThreading Benchmarks

General	Description Simple Add Tests	Comparison Simple Add Tests
Whetstone MP Benchmark	Comparison Whetstone MP	MP MFLOPS Description
Comparison MP MFLOPS	MP MFLOPS Burn-In Test	MP Memory Speed
MPmemSpeed Comparisons	MP Memory Bus Speed	MPbusSpeed Comparisons
MP Memory Random Access	MPrandMem Comparisons

Summary

Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).

Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.

Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.

MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,

MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data.

MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed. A second version provides the alternative of thread reading starting at different data addresses, to avoid overestimation of maximum speed due to large L3 caches.

MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.

To Start

General

These tests are intended to measure Linux and hardware performance at high speeds using multithreading. The programs were compiled at both 32 bits and 64 bits. The execution files, source code, compilation and running instructions can be found in linux_multithreading_apps.tar.gz. This includes 2014 revised benchmarks, produced by a later compiler. The latter was also used to produce versions using the newer AVX instructions, available in AVX_benchmarks.tar.gz and described in Linux AVX benchmarks.htm. All provide the following information on the system under test. They are based on versions available for running under Windows and described in quad core 8 thread.htm.

############################################## Assembler CPUID and RDTSC CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 AMD Phenom(tm) II X4 945 Processor Measured - Minimum 3014 MHz, Maximum 3014 MHz Linux Functions get_nprocs() - CPUs 4, Configured CPUs 4 get_phys_pages() and size - RAM Size 7.81 GB, Page Size 4096 Bytes uname() - Linux, roy-64, 2.6.35-22-generic #33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64

To Start

Description Simple Add Tests

CPUMaxMP32 and CPUMaxMP64 execute simple integer and floating point add instructions via assembly language. Floating point arithmetic is identical on the two versions, via SSE instructions that handle four calculations at a time via 128 bit registers. Performance is measured in Millions of Floating Point Operations Per Second (MFLOPS) with expected maximum speed of four adds per CPU clock cycle. Integer calculations are the same, except one uses 32 bit instructions/registers and the other the 64 bit varieties. For these, speed is measured in Millions of Instructions Per Second (MIPS). Results are logged in file MPadds.txt.

The assembly code loops execute two billion add instructions to ensure that elapsed times of a single thread are significant (like 0.5 seconds or more for SSE tests on current CPUs). A command line run time variable is available to specify the number of threads to use, between 1 and 64, with a default of four. Below are example full results, using four threads, and MIPS from a run with 64 threads. Besides total MIPS and MFLOPS, second sums are provides, based on the time for the last thread to finish. As seen for both examples, completion times are not based on first in first out, but the time is shared fairly evenly, even with 64 threads.

Command ./cpumaxmp64 Threads 4 (or T 4 or t 4) Phenom 4 CPUs Available ############################################## Multithreading Add Test 64 bit Version 1.0 Thu May 5 11:35:18 2011 Integer Additions 4 Threads Thread 4 - 8281 64 bit Integer MIPS Thread 2 - 7996 64 bit Integer MIPS Thread 1 - 7815 64 bit Integer MIPS Thread 3 - 7800 64 bit Integer MIPS Total - 31892 64 Bit Integer MIPS Aggregate - 31201 64 Bit Integer MIPS, based on last to finish SSE Floating Point Additions 4 Threads Thread 2 - 12030 32 Bit SSE MFLOPS Thread 3 - 11976 32 Bit SSE MFLOPS Thread 4 - 11861 32 Bit SSE MFLOPS Thread 1 - 11692 32 Bit SSE MFLOPS Total - 47559 32 Bit SSE MFLOPS Aggregate - 46770 32 Bit SSE MFLOPS, based on last to finish End of test Thu May 5 11:35:23 2011 Integer MIPS 64 Threads Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS Thrd MIPS 1 528 10 522 41 518 11 517 49 516 21 515 52 514 48 514 24 527 55 521 56 518 5 517 32 515 45 515 47 514 44 514 57 526 29 521 26 518 28 516 8 515 9 515 27 514 31 514 62 525 23 520 20 518 58 516 17 515 38 515 63 514 3 514 6 525 39 520 25 518 7 516 36 515 33 515 2 514 19 513 59 523 61 520 12 518 51 516 53 515 54 515 34 514 35 513 46 523 50 519 16 517 37 516 14 515 22 515 64 514 15 513 13 522 60 519 43 517 42 516 4 515 18 515 30 514 40 513

To Start

Comparison Simple Add Tests

Following are sample results on a range of systems with one , two and 4 CPUs, using 1, 2, 4 and 64 threads. The range of speeds of individual threads is also shown for the latter.

Atom - This is a netbook, where the single CPU has HyperThreading and 64 bit capability. With HT, two CPUs are identified and, in this case, integer addition throughput using multiple threads is 40% higher than from a single thread and 20% faster with SSE floating point calculations.

Core 2 Duo - Results from the 32-Bit and 64-Bit compilations are shown for this dual core processor, where 32 bit integer speed is somewhat faster than at 64 bits. Integer additions are executed at up to 2.75 per CPU clock cycle (or MIPS/MHz) with SSE calculations at the maximum rate of four per clock cycle. As with earlier tests, this system runs at 1.6 GHz when one CPU is being used under Linux and default “On-Demand” Frequency Scaling is used ( see Linux Peculiarities in linux burn-in apps.htm). Result provided are for a “Performance” setting.

Phenom - Results of the 64-Bit version are shown for this quad core processor, via Linux Ubuntu and Fedora. There appears to be some differences between the two versions of Linux but these might be normal variations due to other influences. They at least show that the quad core processor can increase throughput by four times with these tests.

Core i7 - This is a 4 core/8 thread 3.7 GHz 4820K, running at 3.9 GHz Turbo Boost speed. It seems that all 8 threads are needed for a four times performance improvement. With 4 cores and 4 SSE adds at a time, maximum speed would be 62.4 GFLOPS, and that was nearly achieved. With integer addition, nearly 1.6 results per clock cycle is demonstrated.

Atom 1.7 GHz Core 2 Duo 2.4 GHz Phenom X4 3.0 GHz 64 bit 64 bit 32 bit 64 bit 64 bit Ubuntu Ubuntu Ubuntu Ubuntu Fedora MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS Threads 1 Total 1844 5418 5268 9605 6597 9591 8052 12046 8213 12030 Aggregate 1844 5418 5268 9605 6597 9591 8052 12046 8213 12030 % 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 2 Total 2631 6460 10290 18992 12782 18707 15964 24022 16447 24052 Aggregate 2598 6441 10156 18898 12621 18344 15810 23946 16446 24050 % 98.7 99.7 98.7 99.5 98.7 98.1 99.0 99.7 100.0 100.0 4 Total 2652 6473 10508 19159 13011 19047 31892 47559 32701 47889 Aggregate 2630 6449 10416 19070 12933 18940 31201 46770 32344 47620 % 99.2 99.6 99.1 99.5 99.4 99.4 97.8 98.3 98.9 99.4 64 Min 42 101 164 299 205 299 513 749 510 655 Max 43 103 173 310 229 310 528 798 529 763 Total 2719 6526 10696 19443 13556 19435 33094 48974 33012 43339 Aggregate 2657 6466 10503 19111 13129 19120 32840 47932 32664 41938 % 97.7 99.1 98.2 98.3 96.9 98.4 99.2 97.9 98.9 96.8 ------------------------------------------------------------------------
3.7 GHz Core i7 4820K
1 Thread 2 Threads 4 Threads 8 Threads 8 Thread Gains MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS MIPS MFLOPS 32 bit Total 11694 15450 12154 30907 36237 30993 48733 61911 4.2 4.0 Aggregat 11694 15450 11953 30907 24342 30902 47020 61763 4.0 4.0 % 100.0 100.0 98.3 100.0 67.2 99.7 96.5 99.8 64 bit Total 11937 15450 23069 30887 24717 46409 49162 61916 4.1 4.0 Aggregat 11937 15450 23061 30874 24167 30903 47387 61540 4.0 4.0 % 100.0 100.0 100.0 100.0 97.8 66.6 96.4 99.4 96.4 99.4

To Start

Whetstone MP Benchmark

The Whetstone programs, initially used in 1972, were the first general purpose benchmarks that set industry standards of computer system performance. Details and performance of early to modern systems can be found in Whetstone Benchmark History And Results and Results On PCs. The overall performance rating is in Millions of Whetstone Instructions Per Second (MWIPS). Later, it was found necessary to measure the speed of the eight different test functions used, to demonstrate that compilers were not over optimising and to allow code tweaks to avoid this situation. The additional measurements are in terms of Millions of Operations Per Second (MOPS) or MFLOPS for straight floating point calculations. As the design authority, nominated by the original author, I have to say that versions that do not provide these separate measurements cannot be taken as valid.

This multithreading benchmark also has a run time parameter to specify the number of threads (up to 64) with a default identified as Configured CPUs in gathered system information (see above). An initial calibration determines the number of passes needed for an overall execution time of 10 seconds. Then all threads are run using the same pass count, running time being extended when there are more threads than CPUs. The same calculations are carried out on each thread but using dedicated variables. The numeric results of calculations are logged for the first thread with others checked for the same values. Actual results might be different or repeated runs as they are dependent on the number of passes.

Four versions are available, whetsMP64, whetsMP64DP, whetsMP32 and whetsMP32DP, for 32-Bit or 64-Bit systems using Single or Double Precision floating point. Results are logged in file MPwhetres.txt.

Equivalent command ./whetsMP64 Threads 4 Phenom 4 CPUs Available ##################################################################### Multithreading Single Precision Whetstones 64-Bit Version 1.0 Using 4 threads - Sat May 14 12:03:51 2011 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 2861 927 872 747 71 38 2947 2259 629 2 2865 875 892 745 71 38 3294 2198 641 3 2875 869 892 744 71 38 3408 2202 645 4 2896 906 895 744 72 38 3141 2232 651 Total 11496 3577 3550 2979 285 151 12790 8891 2566 MWIPS 11389 Based on time for last thread to finish Results Of Calculations Thread 1 MFLOPS 1 -1.12475013732910156 MFLOPS 2 -1.13133049011230469 IFMOPS 1.00000000000000000 FIXPMOPS 12.00000000000000000 COSMOPS 0.49911013245582581 MFLOPS 3 0.99999982118606567 EQUMOPS 3.00000000000000000 EXPMOPS 0.93536460399627686 Numeric results of the other 3 threads were same as above End of test Sat May 14 12:04:09 2011

To Start

Comparison Whetstone MP Benchmark

Following are results of the four versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs noted above. The 32-Bit compilations use the original i87 floating point instructions, where arithmetic calculations are at double precision, normally producing the same speeds with both precision options. With i87 mode not being available at 64-Bits, SSE instructions are used for single precision and SSE2 for double. Using Single Instruction Multiple Data (SIMD) mode, included in the above Add Tests, SSE can be twice as fast as SSE2, with four 32 bit arithmetic calculations at a time, compared with two at 64 bits. In this case, the source code is unsuitable for SIMD compilation, so scalar or SISD (Single Data) instructions are used. In this case, single precision calculations can be at the same speed or slightly faster than using double precision. This scalar operation means that 64-Bit and 32-Bit compilations can produce similar performance.

There are differences with the 32-Bit double precision version where speed can be much faster. For the one headed Equal MOPS, the single precision code uses mov instructions rather than store on the faster compilation. For Fixpt MOPS, integer calculations are the same but the faster one involves integer conversion to double precision rather than single precision. The speed difference remains a mystery but this has little effect on the overall performance rating.

As indicated earlier, the single core Atom has Hyperthreading. In this case, some floating point calculations can be twice as fast using more than one thread. One anomaly is the high speed result during the four thread fixed point test. Here, Linux appeared to have run one thread twice as fast as the others, distorting the total. Results on the Core 2 Duo include one for a test using 64 threads. Speeds are also shown using Fedora on the Phenom, rather than Ubuntu.

Later results shown are for a Core i7, with 4 cores and 8 threads, via Hypertheading. It is also rated at 3.7 GHz but mainly runs at 3.9 GHz via Turbo Boost. On this PC at 64 bits, Cos and Exp tests are much faster than the 32 bit compilations, and this has a significant effect on the MWIPS rating. In some cases, Hyperthreading doubles speed on eight threads, compared with four.

Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS Atom 1 751 397 395 363 17 8 698 1230 141 1.6 GHz 2 1284 747 697 698 31 14 948 1657 190 64 Bit 4 1301 768 760 715 31 14 1207 1661 190 SP 16 1309 801 773 726 32 14 956 1698 191 Aggregate 16 1305 Atom 1 748 381 381 324 17 8 700 1235 141 1.6 GHz 2 1287 732 714 634 31 14 950 1662 191 64 Bit 4 1259 781 748 593 32 13 963 1686 186 DP 16 1307 765 742 647 32 14 958 1691 191 Aggregate 16 1302 Atom 1 698 330 329 282 17 7 758 1230 118 1.6 GHz 2 1182 594 588 478 29 13 987 1654 178 32 Bit 4 1193 613 614 483 30 13 998 1690 178 SP 16 1202 618 589 483 30 13 995 1688 178 Aggregate 16 1199 Atom 1 757 330 330 282 17 7 1468 837 299 1.6 GHz 2 1312 600 592 480 29 13 2420 1248 506 32 Bit 4 1319 611 604 482 30 13 2504 1263 505 DP 16 1329 610 615 485 30 13 2575 1268 507 Aggregate 16 1324 Core2 Duo 1 2501 876 876 600 68 29 3198 3601 600 2.4 GHz 2 4926 1726 1632 1192 135 58 6102 6930 1193 64 Bit 4 4963 1733 1748 1196 135 58 6328 7158 1196 SP 16 4982 1758 1757 1199 136 58 6420 7212 1198 Aggregate 16 4966 64 5054 1783 1782 1215 138 59 6566 7292 1218 Aggregate 64 4973 Core2 Duo 1 2364 803 803 533 61 30 3005 3601 600 2.4 GHz 2 4688 1589 1586 1059 121 60 5911 7082 1196 64 Bit 4 4698 1599 1601 1062 122 60 5976 7089 1197 DP 16 4714 1609 1613 1062 122 61 6068 7213 1197 Aggregate 16 4704 Core2 Duo 1 2165 817 817 576 58 23 3169 3600 623 2.4 GHz 2 4270 1564 1558 1130 114 45 6149 6823 1234 32 Bit 4 4330 1616 1628 1149 116 45 6636 7168 1253 SP 16 4331 1628 1638 1151 116 45 6586 7229 1256 Aggregate 16 4317 Core2 Duo 1 2244 817 817 576 58 23 5140 3596 1028 2.4 GHz 2 4452 1621 1578 1144 115 46 10028 7120 2049 32 Bit 4 4450 1624 1630 1150 113 46 10399 7176 2065 DP 16 4472 1634 1636 1149 115 46 10301 7227 2051 Aggregate 16 4465 Phenom x4 1 2909 925 927 753 72 38 3229 2258 644 3.0 GHz 2 5787 1832 1825 1504 144 76 6375 4488 1253 64 Bit 4 11496 3577 3550 2979 285 151 12790 8891 2566 SP 16 11655 3700 3718 3006 289 153 13395 9039 2635 Aggregate 16 11578 Fedora 16 11842 3705 3715 3010 296 158 13474 9067 2552 Aggregate 16 11725 Phenom x4 1 3002 927 927 753 75 42 3228 2253 601 3.0 GHz 2 5977 1819 1829 1498 150 83 6410 4491 1184 64 Bit 4 11810 3583 3610 2976 297 163 12492 8875 2372 DP 16 11992 3694 3715 3008 300 166 12945 9068 2429 Aggregate 16 11929 Phenom x4 1 2586 927 926 695 64 31 3132 2259 621 3.0 GHz 2 5141 1819 1827 1389 129 62 6213 4484 1200 32 Bit 4 10178 3564 3623 2747 255 124 11567 8893 2390 SP 16 10300 3695 3691 2780 256 125 12584 9070 2460 Aggregate 16 10233 Phenom x4 1 2768 926 927 695 63 32 7525 1807 1806 3.0 GHz 2 5504 1815 1824 1388 126 64 14367 3570 3613 32 Bit 4 10853 3596 3594 2758 249 125 27371 7162 7110 DP 16 10960 3703 3701 2777 249 127 30629 7212 7177 Aggregate 16 10903 Core i7-4820K CPU 3.7 GHz mainly at 3.9 GHz Turbo Boost speed - 4 Cores, 8 Threads Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS Core i7 1 4663 1330 1328 977 132 65 4874 5855 986 3.9 GHz 2 9323 2660 2657 1954 263 129 9764 11709 1965 64 bit 4 17005 5271 5281 3875 446 247 12573 17569 3929 SP 8 30080 10366 10214 7707 731 466 24948 23501 5033 Aggregate 8 29839 Core i7 1 4648 1331 1331 977 122 70 4720 5855 983 3.9 GHz 2 9274 2661 2660 1945 243 140 9769 11717 1964 64 bit 4 18078 5263 5229 3907 488 265 15620 17408 3929 DP 8 30524 10321 10384 7657 733 494 25079 23559 5033 Aggregate 8 30312 Core i7 1 3663 1331 1330 938 95 42 4601 5852 950 3.9 GHz 2 7312 2660 2658 1877 189 85 9200 11703 1868 32 bit 4 14256 5275 5295 3731 377 163 15413 17216 3642 SP 8 24612 10513 10417 7468 577 312 25897 23424 4728 Aggregate 8 24463 Core i7 1 3881 1330 1330 938 94 43 9749 5856 2345 3.9 GHz 2 7762 2661 2660 1876 187 86 19489 11714 4686 32 bit 4 14869 5275 5150 3738 373 164 29334 17585 7031 DP 8 26022 10372 10266 7465 569 314 38564 23700 9383 Aggregate 8 25910

To Start

MP MFLOPS Benchmark

This benchmark executes identical functions as my CUDA and OpenMP performance tests. Details and results can be found in linux_cuda_mflops.htm and OpenMP Speeds.htm. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. Array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The program checks for consistent numeric results, primarily to show that all calculations are carried out. This variation can be run using between 1 and 64 threads, the default being equal to the number of identified CPUs. Each thread carries out the same calculations but on a different segment of the data. The data size starts at 102400 words, rather than 100000, to ensure that each thread uses the same amount of data. For example, with 64 threads, each will use 1600 words or 6400 bytes.

Two versions, MPmflops64 and MPmflops32, were compiled, the first involving the default SSE floating point instructions and the second using the original i87 functions. Speed of the 64-Bit version was so fast that a second 32-Bit benchmark, MPmflops32SSE, was produced. Results are logged in file MPMflopsLog.txt, with examples shown below. These show that the 64-Bit and 32-Bit SSE versions produce the same numeric results and the same speeds (within normal variations). Then the i87 program produces slightly different answers and much slower speeds.

Phenom Results ############################################## 64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 2 10000 0.091754 22321 0.764063 Yes Data in & out 1024000 2 1000 0.136134 15044 0.970753 Yes Data in & out 10240000 2 100 0.632075 3240 0.997008 Yes Data in & out 102400 8 10000 0.167023 49047 0.850923 Yes Data in & out 1024000 8 1000 0.176219 46488 0.982342 Yes Data in & out 10240000 8 100 0.658828 12434 0.998200 Yes Data in & out 102400 32 10000 0.558509 58670 0.660143 Yes Data in & out 1024000 32 1000 0.556450 58888 0.953631 Yes Data in & out 10240000 32 100 0.722131 45377 0.995203 Yes ############################################## 32 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:57:17 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 2 10000 0.092236 22204 0.764063 Yes Data in & out 1024000 2 1000 0.135243 15143 0.970753 Yes Data in & out 10240000 2 100 0.638202 3209 0.997008 Yes Data in & out 102400 8 10000 0.164866 49689 0.850923 Yes Data in & out 1024000 8 1000 0.183847 44559 0.982342 Yes Data in & out 10240000 8 100 0.677530 12091 0.998200 Yes Data in & out 102400 32 10000 0.604816 54178 0.660143 Yes Data in & out 1024000 32 1000 0.613424 53418 0.953631 Yes Data in & out 10240000 32 100 0.756550 43312 0.995203 Yes ############################################## 32 Bit MP i87 MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:59 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 2 10000 0.278444 7355 0.763849 Yes Data in & out 1024000 2 1000 0.287133 7133 0.970727 Yes Data in & out 10240000 2 100 0.673376 3041 0.997006 Yes Data in & out 102400 8 10000 0.625873 13089 0.851082 Yes Data in & out 1024000 8 1000 0.629958 13004 0.982363 Yes Data in & out 10240000 8 100 0.740114 11069 0.998204 Yes Data in & out 102400 32 10000 2.172758 15081 0.660653 Yes Data in & out 1024000 32 1000 2.186809 14984 0.953702 Yes Data in & out 10240000 32 100 2.236048 14654 0.995214 Yes

To Start

Comparison MP MFLOPS Benchmark

As previously, following are results of the 64-Bit and 32-Bit versions of the benchmark, using 1, 2, 4 and 16 threads, via the Atom, Core 2 Duo and Phenom CPUs. Performance of the single CPU tests are virtually the same as those from the OpenMP benchmark, as expected, using the same C statements but without OPenMP directives. Multiple processor tests are a little faster on the i87 version but significantly faster on the SSE varieties. The OpenMP compilation only produced SISD SSE instructions. The generated code for this MP MFLOPS benchmark not only used full SIMD functions but also clearly included linked add and multiply instructions to produce more than four results per clock cycle. Best case is the Core 2 Duo where up to six adds or multiples were recorded per clock, per CPU. The Phenom shows the highest throughput here at 60 GFLOPS, with four cores at five results per clock cycle. Performance gains on the Atom again reflect Hyperthreading effects but some are more influenced by smaller cache sizes.

Numeric results of calculations are constant for a given number of repeat passes, but these are arranged to increase in proportion to the number of identified CPUs, to maintain similar running times. Rounding effects also produce slight differences on i87 and SSE versions. Default answers are shown below for systems with 2, 4, 8 and 16 cores.

Besides for defining the number of threads, command line input parameters are available to use specified repeat passes, either to extend running time or to check for consistent answers.

Later results, provided below, are for a 3.7 GHz Core i7 4820K, normally running at Turbo Boost speed of 3.9 GHz. Maximum speed per CPU, for a 64 bit Single Precision SSE compilation. is 3.9 x 4 (register width) x 2 (linked multiply and add) = 31.2 GFLOPS. The benchmark was run as a stand alone test and that produced the same performance as a test with a single thread. In both cases, maximum speeds were around 24.5 GFLOPS. Then, is seems that 8 threads are needed to maximise performance at 92 GFLOPS. Both of these demonstrate some linking of multiply and add instructions. As usual, loading and saving data leads to lower performance with two operations per data word.

The benchmark was recompiled to use the AVX instructions, available on this Core i7. These use 256 bit registers, or SIMD on eight 32 bit numbers, with a potential maximum speed of 62.4 GFLOPS per core. The results demonstrate up to 46.5 GFLOPS using one core and 177.8 GFLOPS via 8 threads. - These are from a compiled C program.

Run Time Parameters t N or T N or Threads N where N is between 1 and 64 r P or R P or repeats P or Repeats P for P Repeat Passes m T or M T or minutes T or Minutes T for T minutes burn-in test Examples ./MPmflops32 Threads 64 ./MPmflops64 T 4 ./MPmflops64 T 8, R 20000 Atom 1.7 GHz Core 2 Duo 2.4 GHz Phenom X4 3.0 GHz Thds 1 2 4 16 1 2 4 16 1 2 4 16 64 Bit SSE MFLOPS a2 800 1430 1501 1508 5545 8503 8581 12567 7237 13870 22321 25742 b2 648 610 640 1396 3779 4290 8929 9374 4611 9135 15044 27084 c2 660 629 624 628 1248 1248 1242 2192 2152 2819 3240 3649 a8 1810 3372 3396 3405 13036 23704 23904 26636 13815 26435 49047 51692 b8 1741 2486 2553 3304 10787 15437 23931 25331 13168 25751 46488 54633 c8 1746 2536 2504 2528 5003 4970 4993 8546 7152 10816 12434 13898 a32 1832 3530 3560 3577 14405 28155 28240 27827 15110 30093 58670 59810 b32 1818 3502 3521 3535 14212 27492 28084 28577 14897 29867 58888 60311 c32 1820 3504 3531 3535 13620 19913 19964 25607 14208 27760 45377 47678 32 Bit i87 MFLOPS a2 204 327 369 369 1602 3568 3523 3185 1950 3841 7355 7535 b2 201 354 361 364 1799 3136 3613 3618 1885 3804 7133 7686 c2 202 358 363 363 1236 1252 1251 2048 1582 2505 3041 3240 a8 303 557 565 567 3188 6346 6193 6278 3361 6676 13089 13363 b8 301 550 564 565 3162 6280 6213 6304 3292 6648 13004 13404 c8 302 556 566 565 3081 4988 4959 5860 3168 6211 11069 11382 a32 404 777 794 794 3362 6696 6649 6704 3813 7613 15081 15175 b32 403 777 790 784 3357 6680 6645 6689 3775 7566 14984 15197 c32 403 776 790 785 3338 6628 6592 6620 3715 7411 14654 14848 ---------------------------------------------------------------------------------------
3.7 GHz Core i7 4820K
MFLOPS 1 to 8 Threads 4 Byte Ops/ Repeat SSE ------ SSE ------ ------ AVX ------ Words Word Passes 1 CPU 1 4 8 1 4 8 100000 2 2500 9918 9681 45340 54621 12542 62273 60258 1000000 2 250 9688 9759 21688 41832 11404 23031 44329 10000000 2 25 5870 5990 9237 10026 5991 8970 9977 100000 8 2500 24448 24533 49320 92086 35982 159040 173224 1000000 8 250 24465 24570 49918 92352 36180 80096 151909 10000000 8 25 20055 19975 36638 39982 23299 40124 40153 100000 32 2500 23251 23269 46942 92408 46400 90572 173372 1000000 32 250 23265 23307 89676 93282 46572 91058 177831 10000000 32 25 23063 23052 91029 92050 44729 88877 158594 ------------------------------------------------------------------------------------- Numeric Results Repeats 5000 10000 20000 40000 Version SSE i87 SSE i87 SSE i87 SSE i87 a2 0.867359 0.867238 0.764063 0.763849 0.620974 0.620631 0.481454 0.481096 b2 0.985193 0.985180 0.970753 0.970727 0.942935 0.942883 0.891302 0.891203 c2 0.998502 0.998501 0.997008 0.997006 0.994032 0.994027 0.988125 0.988114 a8 0.918220 0.918307 0.850923 0.851082 0.749971 0.750239 0.635325 0.635706 b8 0.991084 0.991095 0.982342 0.982363 0.965360 0.965401 0.933325 0.933397 c8 0.999099 0.999101 0.998200 0.998204 0.996409 0.996416 0.992853 0.992862 a32 0.798973 0.799276 0.660143 0.660653 0.498060 0.498797 0.385106 0.384777 b32 0.976383 0.976422 0.953631 0.953702 0.910573 0.910709 0.833458 0.833707 c32 0.997595 0.997602 0.995203 0.995214 0.990447 0.990463 0.981037 0.981068 Key - Words a=102400, b=1024000, c=10240000 - Operations per word 2, 8 and 32

To Start

MP MFLOPS Burn-In Test

As the benchmarks generated exceptionally high speeds from a single program, it was decided to include a burn-in/reliability test function. This is initiated by including a “Minutes” input parameter. This test just uses the 32 operations per word, 102400 word procedures, with an initial calibration run to identify the number of repeat passes to generate four results per minute.

The first results below are for the quad core Phenom. Overall throughput and CPU temperatures were almost identical to those running four copies of the BurnInSSE pogram. See - Linux burn-in apps.htm. The second results are from running on a 1.83 GHz Core 2 Duo based laptop. As with the earlier burn-in apps results, the CPU switched to lower GHz CPU speeds, when the CPU core temperatures reached around 95°C.

Command ./MPmflops64 Minutes 2 ############################################## Reliability Test around 2 Minutes 4 CPUs Available ############################################## 64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Fri May 20 12:41:07 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 32 266688 14.334887 60962 0.352168 Yes Data in & out 102400 32 266688 14.334628 60963 0.352168 Yes Data in & out 102400 32 266688 14.506037 60243 0.352168 Yes Data in & out 102400 32 266688 14.400784 60683 0.352168 Yes Data in & out 102400 32 266688 14.354242 60880 0.352168 Yes Data in & out 102400 32 266688 14.418992 60606 0.352168 Yes Data in & out 102400 32 266688 14.536283 60117 0.352168 Yes Data in & out 102400 32 266688 14.499469 60270 0.352168 Yes Data in & out 102400 32 266688 14.583635 59922 0.352168 Yes End of test Fri May 20 12:43:18 2011 ############################################## 64 Bit MP SSE MFLOPS Benchmark 1, 2 Threads, Sat May 21 16:45:59 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 32 94075 14.361988 21464 0.352474 Yes Data in & out 102400 32 94075 14.255409 21624 0.352474 Yes Data in & out 102400 32 94075 23.311603 13224 0.352474 Yes Data in & out 102400 32 94075 33.148504 9300 0.352474 Yes Data in & out 102400 32 94075 33.139577 9302 0.352474 Yes Data in & out 102400 32 94075 33.111674 9310 0.352474 Yes Data in & out 102400 32 94075 33.140281 9302 0.352474 Yes Data in & out 102400 32 94075 33.586864 9178 0.352474 Yes Data in & out 102400 32 94075 14.385304 21429 0.352474 Yes Data in & out 102400 32 94075 14.276475 21593 0.352474 Yes End of test Sat May 21 16:50:07 2011

To Start

MP Memory Speed

This is based on my original MemSpeed benchmark benchmark. It employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 or 64 bit integers via two data arrays:

   Result to memory     x[m] = x[m] + s * y[m]     
   Sum to memory        x[m] = x[m] + y[m]         
   Memory to memory     x[m] = y[m]

Add is used instead of multiply for the first integer tests. Memory tested doubles up from 4 KB to 25% of RAM size, to use all caches and RAM. Speed measurements are data reading speeds in MegaBytes Per Second. For tests using two arithmetic operations, speed in MFLOPS can be calculated as MB/second divided by 4 for single precision floating point tests and divided by 8 for those using double precision. The C programming calculations are identical to those used in an OpenMP version. See - OpenMP Speeds.htm.

The execution files are MPmemspeed32 and MPmemspeed64. The 32 bit version uses the old i87 floating point instructions and 32 bit integers. The other, as expected, compiles to use SSE instructions, but these are the slow SISD variety. It is also coded to use 64 bit integers. There is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs, possibly rounded up. Again, each thread carries out the same calculations but on a different segment of the data. Results are saved in memSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom. The data and calculations on each array element are identical. The program checks results for consistency and reports any errors.

Below are 64-Bit and 32-Bit results on the 3.0 GHz quad core Phenom using four threads. These show that the SSE floating point speeds are somewhat faster than tests using i87 instructions, except where performance becomes dependent on memory speed. MB/second rates using 64-Bit integers can be much faster than at 32-Bits, firstly, as the CPU can execute both types of instructions at the same speed and, secondly, as more registers are available for optimisation. The number of measurements at 32-Bits are limited as the full 8 GB of RAM cannot be recognised.

get_phys_pages() and size - RAM Size 7.81 GB MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads Start of test Tue Jun 7 11:32:54 2011 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 18253 12913 18066 18667 14409 27989 14221 11201 14643 8 26068 17651 25448 29410 20679 39706 21694 16463 22343 16 39834 25431 38289 47167 29614 57648 21134 21446 35492 32 50545 29588 47840 65341 37153 76190 46231 24825 45564 65 57307 31898 53119 71962 40593 86253 48212 25858 48779 131 64285 33405 56454 83929 42824 93889 51601 26317 52109 262 65111 32902 58563 85910 43904 96199 52272 26517 52156 524 58699 32056 53683 67177 39149 66647 44137 26617 44123 1048 59967 32531 53638 67332 39808 67046 43401 26310 44172 2097 48409 31709 51453 59630 37829 59008 32561 25079 32687 4194 36529 27079 37052 37380 32077 37280 18682 18694 18732 8388 27898 21163 25293 27011 23273 27070 14253 12800 13768 16777 9006 8909 8869 8978 8806 9023 4488 4462 4516 33554 8946 8875 8887 8606 8855 8921 4525 4497 4508 67108 8717 8458 8325 8516 8452 8756 4287 4366 4379 134217 8688 8339 8362 8696 8473 8698 4276 4357 4355 268435 8703 8608 8516 8659 8393 8648 4280 4268 4328 536870 8700 8591 8421 8673 8514 8690 4308 4290 4264 1073741 8596 8471 8584 8628 8619 8698 4398 4329 4395 2147483 8825 8790 8835 8790 8763 8842 4397 4402 4468 No errors found End of test Tue Jun 7 11:33:52 2011 ############################################## get_phys_pages() and size - RAM Size 3.20 GB MP Memory Reading Speed Test 32 Bit Version 1 Using 4 Threads Start of test Tue Jun 7 14:23:02 2011 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 15704 11347 10961 17813 12518 15904 13744 8714 8758 8 24188 15367 14929 26770 17870 21025 20789 10866 10234 16 33319 19229 18266 38724 23589 23124 31390 13114 13157 32 40697 20675 21180 51120 27260 25282 39385 13921 13960 65 45013 22913 22267 57143 30132 24875 42247 14314 14241 131 45569 23573 22953 61979 31356 27585 44688 14427 13289 262 48701 23759 22666 63235 32103 27892 44447 14200 14453 524 44900 22996 20417 53167 30753 25832 36085 14671 13403 1048 44929 23357 20300 54596 30302 25790 36207 14708 13590 2097 42017 22864 20927 42429 28809 24778 26734 13125 12659 4194 34909 20379 19542 36402 25268 21093 18592 12625 12821 8388 22498 17592 17006 23354 19577 18854 12489 9400 9657 16777 8906 8697 8781 8884 8841 8844 4433 4217 4440 33554 8848 8684 8606 8877 8436 8843 4412 4293 4422 67108 8423 8445 8433 8685 8506 8526 4228 4296 4273 134217 8704 8453 8572 8563 8426 8485 4383 4303 4346 268435 8623 8579 8539 8731 8652 8612 4408 4301 4322 536870 8683 8331 8534 8724 8658 8444 4371 4330 4325 No errors found End of test Tue Jun 7 14:24:05 2011

To Start

MP Memory Speeds Comparison

Below are 64-Bit results on the 3.0 GHz quad core Phenom and 2.4 GHz Core 2 Duo for the multiply and add tests using 1 CPU, all CPUs and with 64 threads. On the single thread tests, although the speeds are dependent on CPU GHz, variations generally reflect cache sizes. The full gain in throughput through using more than one CPU is not achieved at the lower data sizes, mainly due to higher overheads. For example, at 4 KB there are two arrays of 2 KB, producing a segment of 32 bytes for each of 64 threads. There are significant additional performance gains using an increasing number of threads with mid to large data sets. This is due to the relatively small data segments being repetitively processed from a lower level cache.

Next are results on the Netbook with an Atom CPU running via 64-Bit Ubuntu 11.04. It can be seen that Hyperthreading provides significant gains in throughput using floating point instructions.

Later are speeds obtained on a Core i7 4820K, with Hyperthreading on 4 cores, normally running at 3.9 GHz Turbo Boost speed. It has 32 GB RAM on 4 memory chaanels, with a maximum speed of 51.2 GB/second. The original benchmark again show slower MP speeds via data in L1 cache, with maximum via other caches. Then MP speed using RAM was quite good at up to 25.4 GB/second. As indicated earlier, only SISD instructions were used, where a single thread only produced up one result per clock cycle (3.9 GFLOPS) and MP speed 3.8 times faster.

Results for a recompiled version, using AVX directives, are also shown. Of special note, performance of single threaded AVX version is often worse than that without AVX. Full SIMD AVX instructions are implemented, but there are numerous extra instructions used, such as shuffle, unpack and insert (4 vector multiply, 4 vector add, 80 other vector instructions) - needed to allow any unknown number of threads?. At least, the multithreaded speeds can be four times that of a single threaded run and twice as fast using RAM based data.

Quad Core Phenom - Caches L1 64 KB/CPU, L2 512 KB/CPU, L3 6144 KB shared Commands ./MPmemspeed64 Threads 1 ./MPmemspeed64 ./MPmemspeed64 Threads 64 1 thread 4 threads 64 threads KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 16834 8588 16753 18253 12913 18066 17040 12902 21602 8 17196 8671 16039 26068 17651 25448 30712 18072 31164 16 17188 8670 18148 39834 25431 38289 34169 21795 41192 32 17314 8703 16669 50545 29588 47840 39691 21845 35582 65 15023 8634 15211 57307 31898 53119 45213 25520 52872 131 15274 8155 13675 64285 33405 56454 53787 29396 58835 262 15335 8143 13508 65111 32902 58563 56811 31077 59534 524 14512 8013 13242 58699 32056 53683 62895 32310 58189 1048 10911 7355 10791 59967 32531 53638 64335 32520 60720 2097 10909 7350 10784 48409 31709 51453 65249 32553 59855 4194 10561 7169 10411 36529 27079 37052 65800 32975 57665 8388 6642 5610 6315 27898 21163 25293 57438 30135 52237 16777 6128 5410 5853 9006 8909 8869 57345 30844 50350 33554 6311 5427 5677 8946 8875 8887 54636 30033 49051 67108 5789 5160 5698 8717 8458 8325 31347 24887 30860 134217 5969 5138 5908 8688 8339 8362 26846 19845 26308 268435 5922 5449 5779 8703 8608 8516 8510 8359 8558 536870 6121 5090 5811 8700 8591 8421 8638 8525 8387 1073741 6020 5481 5832 8596 8471 8584 8569 8334 8422 2147483 6264 5630 6028 8825 8790 8835 8834 8699 8625 Core 2 Duo - Caches L1 32 KB/CPU, L2 4096 KB shared 1 thread 2 threads 64 threads Kbytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 12572 6264 12613 19400 10834 20763 6272 4371 8553 8 12534 6337 12613 22252 11696 22188 5644 3760 6362 16 12639 6364 12692 23343 12208 23485 10681 6790 15783 32 12468 6318 12483 24406 12409 24317 13725 8918 15230 65 9434 5733 9453 24573 12548 24520 17721 11117 18778 131 9584 5770 9612 18347 11708 17212 21622 11731 22935 262 9645 5824 9648 18389 11700 18191 23737 11938 23593 524 9674 5834 9667 18286 11635 18153 24032 12329 24204 1048 9696 5843 9684 18324 11663 18206 24579 12484 24682 2097 9548 5777 9623 18326 11593 18203 24377 12261 24413 4194 8188 5520 8302 13296 10585 13927 17971 11380 17734 8388 4381 4219 4407 4701 4597 4664 17905 11450 17698 16777 3788 3830 3847 3948 3921 3944 17803 11413 17646 33554 3817 3827 3806 3903 3868 3893 17580 11294 17428 67108 3845 3872 3856 3908 3888 3917 16531 10438 14364 134217 3876 3856 3798 3886 3918 3922 9007 8009 9152 268435 3885 3894 3889 3893 3885 3882 4092 4088 4102 536870 3827 3816 3829 3923 3922 3918 3900 3893 3887 Atom 1 CPU with Hyperthreading - Caches L1 24 KB, L2 512 KB get_phys_pages() and size - RAM Size 0.96 GB, Page Size 4096 Bytes uname() - Linux, roy-Ubuntu-11, 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011, x86_64 1 thread 2 threads 64 threads Kbytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 2112 1149 5817 3849 2085 6203 3329 1896 5237 8 2117 1148 5922 3254 2106 6284 3519 1939 5653 16 2113 1150 5938 3870 2085 6307 3608 1970 5882 32 1742 1050 3764 3182 1879 4926 3682 1935 6005 65 1770 869 3863 3157 1908 4728 3571 1975 5912 131 1804 1050 3869 3173 1905 4652 3619 1992 6052 262 1802 1043 3833 3181 1900 4711 3597 1977 6104 524 1731 1017 3575 3021 1838 4440 3633 2002 6000 1048 1656 983 2622 2668 1746 2656 3035 1787 4710 2097 1652 962 2222 2087 1769 2059 3043 1822 4516 4194 1655 946 2123 1969 1755 1950 3023 1807 4528 8388 1538 981 2128 1958 1747 1956 2956 1788 4483 16777 1571 979 2118 1948 1769 1948 2788 1711 3364 33554 1661 969 2119 1957 1760 1921 1978 1652 2170 67108 1606 986 2176 1930 1762 1951 1763 1644 1929 134217 1660 975 2149 1932 1747 1966 1882 1672 1692 ------------------------------------------------------------------------
3.7 GHz Core i7 4820K
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 64 Bit SSE 1 Thread 8 30422 15420 27836 40569 20504 34929 19735 9939 19614 L1 16 30754 15503 28069 41065 20688 35335 19977 10011 19895 131 28955 15286 27122 34323 20476 30926 20086 10078 20069 L2 262 28741 15287 27017 33579 20373 30875 19760 10060 19758 2097 24424 15207 23963 26358 19342 25851 14665 9648 14824 L3 4194 24408 14253 23951 26355 19366 25531 14655 9334 14821 536870 14386 11824 14302 14704 13442 14732 7652 7426 7416 RAM 1073741 14452 11468 14715 14861 13439 15189 7348 7828 7394 Max GFLOPS 3.8 3.9 2.6 2.6 64 Bit SSE 8 Threads 8 52063 33134 52441 60254 36470 57610 45343 30815 45180 L1 16 69122 44818 65876 82924 46075 69534 57760 36707 57120 131 115996 53402 102715 140036 76202 116053 68671 35659 72891 L2 262 113644 60777 104488 132590 81609 111435 72205 37232 72061 2097 95433 58470 99412 115476 72032 109176 60839 36350 56557 L3 4194 98608 57900 102912 105228 78041 106928 59517 36122 58749 536870 25054 24707 24623 24592 25130 25430 12805 11899 11850 RAM 1073741 25402 25735 24886 25412 25128 24711 12662 12367 12617 Max GFLOPS 14.5 15.2 8.8 10.2 64 Bit AVX 1 Thread 8 16874 10136 29980 60061 59981 73657 39651 39127 39645 L1 16 16901 10137 30260 61288 61171 78154 40569 40385 40608 131 16891 10113 28484 48351 48134 47964 29242 29294 29285 L2 262 16845 10094 27490 45215 44739 46562 27383 27371 27377 2097 16725 9985 23943 30302 30323 30669 16994 16576 17163 L3 4194 16549 9912 23767 30294 30231 30720 16977 16461 17136 536870 11805 8651 14877 13817 13983 14446 7545 7081 7203 RAM 1073741 12168 8818 14636 14692 14973 14163 7329 7393 7202 Max GFLOPS 2.1 2.5 3.8 7.6 64 Bit AVX 8 Threads 8 45769 32166 52015 76043 69787 61836 64106 57154 59198 L1 16 53369 37318 67629 126171 116852 72240 83390 81532 80529 131 64036 39272 110665 216488 221720 269585 149667 148116 148231 L2 262 67037 39317 114213 194473 193210 203301 115279 114005 118131 2097 65163 37914 102215 115502 125560 127970 65827 69361 69114 L3 4194 62412 40752 94501 123285 110270 118960 63359 64405 64734 536870 24650 23942 25309 25072 25428 25601 12789 12471 12756 RAM 1073741 25124 24729 24310 24854 24576 25209 12612 12415 12524 Max GFLOPS 8.6 10.2 14.4 27.7

To Start

MP Memory Bus Speed

MPbusspeed64 and MPbusspeed32 are based on my old BusSpeed2K Benchmark and are essentially the same as the Windows Multithreading Version. Data is read using AND instructions at a range of data sizes covering caches and RAM. The program starts by reading words with 32 word address increments, then reduces the increment to eventually read all words sequentially. Speed reductions of around 50% at each higher increment suggests reading in bursts over the bus. This is normal for reading from RAM and is sometimes found reading cached data. The final results use SSE2 integer AND instructions to read the data into the 16 byte register, the 32 bit and 64 bit procedures using the same assembly code. Note that, except for SSE2 tests, the CPU can execute 64 bit integer calculations at the same speed as those at 32 bits, resulting in higher MB/second at 64 bits.

Again there is an input parameter to use 1, 2, 4, 8, 16, 32 or 64 threads, the default being the number of identified CPUs. This time, each thread reads all the data. Results are saved in busSpeedMP.txt, examples below being for the 3.0 GHz quad core Phenom. Using L1 and L2 caches, data transfer speed with 64 bit integers is around twice as fast as using 32 bit numbers, suggesting a CPU speed limitation. From burst reading, estimates of maximum RAM speed are 904 x 8 = 7232 MB/second and 448 x 16 = 7168 MB/s. Cache examples are - L2 2989 x 8 = 23912 MB/s and L3 1432 x 8 = 11456 MB/s.

Later additions are for a 3.7 GHz Core i7 4820K CPU, normally working at Turbo Boost speed of 3.9 GHz, with 4 cores and Hyperthreading, plus a 10 MB L3 cache, The system has four memory channels, with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second. With the original benchmark, the same data is shared by all threads, with each starting reading at the beginning. With this arrangement, and a large L3 cache, multiple threads can read from there, leading to what seems like impossible memory data transfer speeds. For Version 2 of the benchmarks [MPbusspeed64V2 and 32V2], data is still shared but each thread is started at a different point (example 4 threads, 4 starting points 25% apart). The result is more realistic memory speeds but higher overheads affecting early tests. The i7 results also further demonstrate that multiple threads are required to produce speeds approaching the maximum specification. The benchmarks ans source code are included in the tar.gz file..

Note that Version 1 is still valid, representing an application where multiple threads search data for different values. In order to show that the latest results are representative of maximum RAM speeds, eight copies of BusSpeed RAM reliability tests [IntBurn64] were run at the same time. Results are shown below.

3.0 GHz quad core Phenom MP Bus Speeds 64 bit Version 1.0, 1 Threads, Fri Jun 17 16:51:46 2011 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 22196 22641 26239 26979 26580 26409 23762 24 23507 23118 27665 27323 27081 25424 23850 96 2959 2964 2989 5987 11983 21134 23868 384 2917 2917 2869 5853 11732 21898 23264 768 1362 1359 1352 2699 5408 10617 10803 1536 1322 1293 1432 2856 5764 11098 12081 16380 862 886 902 1745 3637 6019 7249 131070 854 885 902 1777 3431 5853 6619 393210 858 830 904 1757 3602 5995 7074 64 bit words, Speed in MB/Second - MIPS divide by 8 End at Fri Jun 17 16:52:21 2011 MP Bus Speeds 32 bit Version 1.0, 1 Threads, Fri Jun 17 16:45:12 2011 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 8318 12102 12848 13317 13413 13452 23033 24 9876 12979 13877 13288 13436 13640 23876 96 1495 1496 2675 5979 11081 13201 23852 384 1209 1238 2454 4946 8816 12459 18874 768 721 726 1480 3046 5638 9741 11846 1536 699 699 1513 3032 5708 9722 11924 16380 413 423 860 1805 2993 5022 7075 131070 411 444 841 1793 3024 4881 7060 393210 427 448 887 1826 3052 4946 6897 32 bit words, Speed in MB/Second - MIPS divide by 4 End at Fri Jun 17 16:45:47 2011 3.7 GHz Core i7 - 4 Cores, 8 Threads, Original Memory Maximum Speed Specification 51.2 GB/second MP Bus Speeds 64 bit Version 1.0, 8 Threads, Sun Nov 23 10:37:14 2014 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 91241 82686 89258 121404 128237 134549 232859 L1 x 4 24 90260 96688 94783 131484 133668 150865 236821 96 45332 45942 46346 67232 107906 121757 219034 L2 x 4 384 20465 28119 29021 51632 76028 119540 157096 L3 x 1 768 20462 25185 19519 37238 66783 110456 153193 1536 19354 22512 21843 35804 71371 112147 142987 16380 6771 8432 10752 16948 40643 73808 65614 RAM 131070 4030 5140 5916 12464 25868 42665 56575 393210 3182 3971 5796 11936 24963 49436 52118 64 bit words, Speed in MB/Second - MIPS divide by 8 End at Sun Nov 23 10:37:44 2014 3.7 GHz Core i7 - 4 Cores, 8 Threads, Revised MP Bus Speeds 64 bit Version 2.0, 8 Threads, Sun Nov 23 10:45:05 2014 Same as Version 1.0, except each thread starts at different address Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 12369 12496 21021 39879 87865 105814 182821 24 45987 46458 77473 88435 123606 122993 260009 96 39888 43919 48699 79316 117534 139709 242467 384 24258 19148 25700 46969 86177 117704 163429 768 20185 22569 19117 37035 80091 103124 175760 1536 20967 21061 20910 39814 77704 116114 156833 16380 5780 6537 10217 18113 30917 79641 80375 131070 3073 3818 4822 10452 20528 38770 46567 393210 2348 3147 4793 11090 20306 39280 38707 786420 2152 3062 4834 10111 19119 37950 39038 1572840 2061 2987 4794 9738 19272 37597 38428 64 bit words, Speed in MB/Second - MIPS divide by 8 End at Sun Nov 23 10:46:09 2014 8 BusSpeed Programs 40 MB Each 4891 + 4992 + 4917 + 4854 + 4772 + 4840 + 4908 + 4933 = 39037 MB/second

To Start

MPbusSpeed Comparisons

Below are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom and 3.7 GHz Core i7, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 (based on identified CPUs) and 64 threads. Comparison of performance gains using multiple threads and 64 bit compilations versus 32 bit are best limited to the last two columns as the earlier burst reading tests produce all sorts of timing peculiarities. Although measured RAM speeds could be improved with multiple threads reading shared data from caches, those provided have mainly been confirmed with programs using dedicated data.

Core i7 - results for 8 threads included, often slower than with 4 threads. RAM speed at 3864 MB also shown, to show possible effects of large L3 cache.

Phenom - some performance gains from cached data were only 3 times with 4 threads but nearer 4 times with 64 threads. Using 4 or 64 threads can double throughput on memory bus. Maximum measured speed was 15.8 GB/second, compared with specification of 21.3 GB/second, comprising 667 MHz x 2 (DDR) x 8 (bus width) x 2 (dual channel).

Core 2 Duo - generally achieved peak performance using two threads and, if anything, was slightly slower using a concurrency of 64. Compared with the Phenom, some tests came out much faster and others significantly slower.

Atom - Hyperthreading had little impact on performance via L1 cache but did provide improvements via L2 cache and RAM.

L1 Cache Results in MBytes/Second - 6 KB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 14199 15008 18856 21606 21604 20462 61577 4 Threads 44729 50128 67114 79464 82422 80118 200833 8 Threads 43681 49006 68593 82414 84016 90261 357561 64 Threads 53262 55670 82638 90202 93360 93827 347210 Core i7 V2 32b 4/4 3700 11417 13524 17525 19258 19598 18921 61579 4 Threads 10372 17363 38891 43445 50634 62156 238316 8 Threads 5470 8950 20258 35494 40223 63398 280884 64 Threads 868 1319 2922 5015 9176 18896 38113 Phenom II 32b 4/0 3000 8318 12102 12848 13317 13413 13452 23033 4 Threads 3901 7614 14703 28644 29313 34882 74424 64 Threads 10098 14599 20588 30042 32606 37702 76707 Core 2 Duo 32b 2/0 2400 8069 8772 9036 9283 9369 9390 37413 2 Threads 13921 16387 16361 16996 18147 18380 61694 64 Threads 15315 16271 17162 17587 17848 18004 60044 Atom N455 32b 1/1 1667 5092 5813 5959 6272 6290 6289 24663 2 Threads 5638 6157 6323 6450 6439 6488 25608 64 Threads 5011 5655 5609 5785 5867 5610 22419 64 Bit Version Core i7 64b 4/4 3700 31210 31251 31251 42437 43392 43667 61539 4 Threads 76609 51101 75602 140546 104501 167163 205782 8 Threads 91241 82686 89258 121404 128237 134549 232859 64 Threads 114113 112881 116740 179847 190265 192408 341135 Core i7 V2 64b 4/4 3700 31501 31266 31243 41117 36617 41277 61526 4 Threads 28749 29616 58739 64451 61610 129160 231735 8 Threads 12369 12496 21021 39879 87865 105814 182821 64 Threads 1961 1981 3954 7413 15514 31971 35788 Phenom II 64b 4/0 3000 22196 22641 26239 26979 26580 26409 23762 4 Threads 4478 17301 15108 29950 58038 58706 76500 64 Threads 19577 43542 35782 52893 74743 73745 76027 Core 2 Duo 64b 2/0 2400 15931 17513 18140 18542 18715 18813 37391 2 Threads 30486 32243 35126 36209 35493 35615 73979 64 Threads 29474 32288 33585 34516 34670 35792 71354 Atom N455 64b 1/1 1667 9004 10592 11500 12051 12565 12735 24731 2 Threads 10224 11743 12283 12767 12948 13053 25668 64 Threads 8574 10100 11089 10638 11905 11649 21789
L2 Cache Results in MBytes/Second - 96 KB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 7165 7133 10898 16546 20138 20473 60657 4 Threads 23189 23575 40872 62006 78956 81301 241144 8 Threads 26914 29097 47495 67700 79412 89292 322563 64 Threads 33001 35645 53618 72906 85615 91768 318366 Core i7 V2 32b 4/4 3700 6535 6897 10249 15313 18679 19052 60683 4 Threads 21398 22214 36104 55228 68446 72179 241137 8 Threads 22385 25289 42997 60219 68834 75464 322513 64 Threads 12054 15849 34363 38072 59987 69942 425475 Phenom II 32b 4/0 3000 1495 1496 2675 5979 11081 13201 23852 4 Threads 4648 5085 8422 19230 33948 39486 74050 64 Threads 5247 5317 10388 21080 36499 45574 89766 Core 2 Duo 32b 2/0 2400 2065 2044 3275 4562 6700 8095 19153 2 Threads 3138 3050 5218 7776 11911 15415 32030 64 Threads 3099 2963 5209 7617 11721 15081 31010 Atom N455 32b 1/1 1667 505 415 788 1481 2577 3915 5909 2 Threads 597 665 1243 2285 3657 4660 8598 64 Threads 455 534 996 1861 3133 4100 7417 64 Bit Version Core i7 64b 4/4 3700 13596 14421 15252 24216 33143 40470 60420 4 Threads 41962 40737 43299 73311 123250 160399 240730 8 Threads 45332 45942 46346 67232 107906 121757 219034 64 Threads 64186 66500 65554 102211 146587 170442 293072 Core i7 V2 64b 4/4 3700 12815 13950 14344 23717 32150 39251 59683 4 Threads 39170 40423 42705 76442 110895 154667 240928 8 Threads 39888 43919 48699 79316 117534 139709 242467 64 Threads 23157 24129 40748 86708 101514 163061 397091 Phenom II 64b 4/0 3000 2959 2964 2989 5987 11983 21134 23868 4 Threads 4478 17301 15108 29950 58038 58706 76500 64 Threads 10290 10607 10548 20940 40361 75398 89396 Core 2 Duo 64b 2/0 2400 4171 4170 4098 6715 9120 13430 19157 2 Threads 6322 6435 6127 10810 15448 23521 32104 64 Threads 6090 6237 5972 10720 15200 23342 31444 Atom N455 64b 1/1 1667 993 1020 833 1564 2954 5156 5919 2 Threads 1066 1127 1386 2501 4608 7240 8722 64 Threads 1003 1033 1055 1991 3674 6250 7222
L3 Cache Results in MBytes/Second - 1536 KB
Core i7 32b 4/4 3700 2735 2792 5450 9694 16764 20399 38274 4 Threads 9816 10382 19816 36661 62386 81197 151963 8 Threads 16717 17096 32534 54967 75466 86510 234195 64 Threads 14769 15185 28392 48591 71997 88559 213647 Core i7 V2 32b 4/4 3700 2735 2793 5445 9573 16373 18990 38263 4 Threads 9802 10371 19738 35609 61112 73520 150464 8 Threads 15355 15282 27981 51300 66966 75178 236662 64 Threads 13224 14216 28137 46946 65679 74795 219197 Phenom II 32b 4/0 3000 699 699 1513 3032 5708 9722 11924 4 Threads 2407 2543 4943 10058 17570 29261 41159 64 Threads 2541 2571 5022 10100 18811 30768 41078 64 Bit Version Core i7 64b 4/4 3700 5316 5421 5554 10925 19443 33253 38363 4 Threads 19103 19854 20683 39137 54701 127196 152980 8 Threads 19354 22512 21843 35804 71371 112147 142987 64 Threads 31324 32675 29819 51003 95378 146538 203624 Core i7 V2 64b 4/4 3700 5269 5367 5499 10811 19434 33779 38372 4 Threads 19086 19296 20661 39791 73469 120135 152311 8 Threads 20967 21061 20910 39814 77704 116114 156833 64 Threads 29225 31754 29372 50516 90849 147624 197627 Phenom II 64b 4/0 3000 1322 1293 1432 2856 5764 11098 12081 4 Threads 5112 4899 5083 10018 19866 37136 38700 64 Threads 5092 5101 5101 10051 20203 37850 41309
RAM Results in MBytes/Second - 128 MB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 731 1040 2188 4264 8773 15159 17640 4 Threads 3420 4173 7526 14623 28847 52396 61043 8 Threads 4388 5953 11378 22221 41937 69685 94501 64 Threads 3991 4486 8871 17654 34047 59458 67917 Core i7 V2 32b 4/4 3700 737 1048 2236 4313 9022 15072 18058 4 Threads 1526 2510 4964 9880 19395 38001 40618 8 Threads 1488 2410 5166 9560 18878 37571 39464 64 Threads 2892 4051 8141 17176 29628 53519 68451 Phenom II 32b 4/0 3000 411 444 841 1793 3024 4881 7060 4 Threads 786 813 1605 3444 6259 12161 14950 64 Threads 891 969 1869 3678 6930 12678 15564 Core 2 Duo 32b 2/0 2400 353 399 814 1467 2725 5021 5842 2 Threads 621 808 1181 2217 4108 7686 9952 64 Threads 395 598 1080 1947 3541 6540 8133 Atom N455 32b 1/1 1667 122 256 514 1029 1978 3256 4122 2 Threads 131 318 684 1312 2434 4189 5307 64 Threads 125 265 577 1159 2280 4435 4636 64 Bit Version Core i7 64b 4/4 3700 1229 1470 2054 4514 8754 18043 18094 4 Threads 5901 6947 8368 15029 29096 51843 61776 8 Threads 4030 5140 5916 12464 25868 42665 56575 64 Threads 4262 5272 7059 14648 29005 58071 58266 Core i7 V2 64b 4/4 3700 1226 1484 2096 4411 8462 18188 18382 4 Threads 2038 3108 5197 10201 20004 38092 40726 8 Threads 3073 3818 4822 10452 20528 38770 46567 64 Threads 3020 3618 5513 11215 21773 44187 44666 Phenom II 64b 4/0 3000 858 830 904 1757 3602 5995 7074 4 Threads 1561 1648 1701 3330 6964 13025 14027 64 Threads 1808 1854 1947 3773 7488 14176 15516 Core 2 Duo 64b 2/0 2400 699 711 803 1622 2946 5414 5861 2 Threads 1210 1336 1632 2436 4635 7947 10028 64 Threads 706 813 1226 2177 3919 7014 8127 Atom N455 64b 1/1 1667 125 256 514 1038 2024 3994 4057 2 Threads 136 294 677 1327 2530 4924 5230 64 Threads 142 249 523 1129 2312 4565 4639
RAM Results in MBytes/Second - 384 MB
CPUs/ MHz Inc Inc Inc Inc Inc Read 128b HTs 32wds 16wds 8wds 4wds 2wds All SSE2 Core i7 32b 4/4 3700 740 1043 2191 4267 8785 15164 17652 4 Threads 3425 4187 7525 14625 28846 52165 61051 8 Threads 3841 5504 11461 21286 39250 69341 85998 64 Threads 2115 3304 6884 13623 27005 50799 56297 Core i7 V2 32b 4/4 3700 738 1049 2237 4314 9041 15058 18065 4 Threads 1507 2496 4928 9821 19225 37965 40423 8 Threads 1481 2393 4786 9488 18923 37398 38729 64 Threads 1712 2633 5189 10464 20499 40335 41751 Core i7 64b 4/4 3700 2823 3214 4190 8287 16053 32489 33564 4 Threads 5909 5426 8370 12684 29097 58307 59609 8 Threads 3182 3971 5796 11936 24963 49436 52118 64 Threads 2723 3732 5896 12059 24079 46997 47480 Core i7 V2 64b 4/4 3700 1724 2609 3858 8029 15699 32236 32708 4 Threads 2090 3101 5072 9867 19538 39489 L 39824 8 Threads 2348 3147 4793 11090 20306 39280 L 38707 64 Threads 2592 3168 5019 9857 20329 40428 M 40392 L Less L3 cache effects, M More L3 cache effects

To Start

MP Memory Random Access

MPrandmem64 and MPrandmem32 are based on my old RandMem Benchmark and are essentially the same as the Windows Multithreading Version, except there are added tests identified as Mutex SRW and Mutex RRW. The program uses the same code for serial and random access via a complex indexing structure and comprises Read (RD) and Read/Write (RW) tests. They are run to use data from L1 cache, L2 cache and RAM using 1 to 64 threads, using a run time parameter, the default being equal to the number of identified CPUs. This benchmark uses data from the same array for all threads, but starting at different points. Results are saved in file randmemMP.txt. In this case, both the 64 bit and 32 bit versions use 32 bit integer data arrays.

Below are logged results on a 3.0 GHz quad core Phenom using one and four threads. On serial and random read only tests, performance gains are up to four times using dedicated caches, with the smaller data sizes slower due to overheads. Random reading is much slower than serial data transfers where burst reading leads to more data being transferred than is requested. With reading and writing, there is a possibility that data can be corrupted when more than one thread updates the same data. Although it cannot be proven with this benchmark, it seems that the Operating System does not allow shared data to be updated in local caches and flushes them to update in shared data areas, producing significant performance degradation, particularly on random access.

The extra tests with Mutex, or mutual exclusion, functions avoid the updating conflict by only allowing one thread at a time to access common data. This can still lead to using four threads being slower than one but, with random access, there can be a significant improvement compared with untethered multiple thread speeds, except when accessing RAM.

All results are provided for the Core i7, mentioned previously, for 1, 4 and 8 threads. Appropriate performance gains can be produced on read only tests and, with writing, for shared L3 cache based data. This 10 MB cache is also probably responsible for the rather excessive serial memory reading speeds, due to threads reading the sane data. All mutex based tests, at 8 threads, are slower than those using a single thread.

RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sun Jun 26 18:01:19 2011 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 15757 15827 12155 11879 8534 8511 4392 4385 Serial RW 9263 9534 8875 8868 7591 7493 3740 3601 Random RD 14396 14271 7504 3159 2269 1751 622 341 Random RW 9231 9510 6136 2993 2087 1507 532 319 Mutex SRW 9264 9534 8875 8869 7591 7492 3740 3608 Mutex RRW 9231 9510 6138 2993 2087 1507 532 320 RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 29630 53166 44120 44829 29620 29671 12108 11987 Serial RW 5040 7334 7442 7402 7353 7395 8532 6247 Random RD 28388 41211 27807 12265 8866 6611 2103 1271 Random RW 657 1096 1229 1283 1288 1376 1648 993 Mutex SRW 5962 8654 7998 7882 6982 6853 3579 3415 Mutex RRW 6243 8594 5838 2815 1970 1370 486 310 ------------------------------------------------------------------------
3.7 GHz Core i7 4820K
RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sat Nov 8 12:39:21 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 28021 27808 20268 19318 19231 19255 12455 11589 Serial RW 29972 30232 21894 17867 17410 17420 12242 11581 Random RD 27479 27463 13595 8251 6228 5605 2470 1011 Random RW 30429 30076 9224 6120 5177 4782 2800 982 Mutex SRW 29987 30245 21895 17875 17419 17249 12373 11495 Mutex RRW 30417 30027 9199 6117 5175 4780 2796 982 RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sat Nov 8 12:40:59 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 72892 106273 80223 75623 75330 75842 43100 37773 Serial RW 14198 33187 38290 43442 45145 63266 49480 34706 Random RD 72819 104511 52719 32436 24475 11447 9297 3543 Random RW 3092 5558 8188 11156 11734 12527 9136 1811 Mutex SRW 20510 29213 21157 17705 16794 16835 11969 11250 Mutex RRW 21625 29271 8982 5912 5060 4699 2720 973 RandMemMP Speeds 64 Bit Version 1, 8 Threads, Sat Nov 8 12:41:51 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 37112 77469 94806 94862 90795 86826 65882 56315 Serial RW 8924 29533 54380 47712 51176 69146 68008 22145 Random RD 36944 76814 62245 33838 24552 21588 13472 3341 Random RW 2000 6016 9058 17412 16237 16733 10066 2806 Mutex SRW 7829 16705 19723 16432 16331 16570 11550 10669 Mutex RRW 10672 20797 8933 5659 4844 4561 2659 940

To Start

MPrandmem Comparisons

Below, again, are cache and RAM speeds obtained on the Quad Core 3.0 GHz Phenom, the 2.4 GHz Core 2 Duo and the 1.66 GHz Atom. These are for 1, 2 or 4 and 64 threads plus others for the Phenom at 64 bits.

L1 Cache - Probably due to the overheads involved, performance using 64 threads is noticeable slow, with the Core 2 Duo performing better than the quad core Phenom when writing is involved. Hyperthreading does not lead to much performance gain on the Atom.

L2 Cache - Reading tests can show appropriate performance gains using all processors and not as much degradation with reading and writing. Although dealing with 32 bit integers, and unlike the Phenom, the Core 2 Duo and Atom produce much faster speeds with the 64 bit version using 64 threads. The Mutex tests produce different performance characteristics than when using L1 cache.

RAM - There are some performance gains with multiple threads making better use of memory bandwidth and excessive numbers of threads do not necessarily reduce performance.

L1 Cache Results in MBytes/Second - 6 KB
CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 15791 11495 14746 11409 11478 11394 4 Threads 29613 5014 29595 610 7152 7569 64 Threads 2321 188 2234 42 257 255 Core 2 Duo 32b 2/0 2400 6327 8474 6314 8305 12642 8306 2 Threads 13285 3559 13432 1312 6935 9327 64 Threads 800 452 802 93 346 433 Atom N455 32b 1/1 1667 3500 4742 4422 5028 5032 5022 2 Threads 4902 4770 4895 1153 677 3101 64 Threads 307 296 301 69 55 207 64 Bit Version Phenom II 64b 4/0 3000 15757 9263 14396 9231 9264 9231 4 Threads 29630 5040 28388 657 5962 6243 8 Threads 14933 2120 14892 338 2465 2867 16 Threads 8514 846 8284 174 910 1041 64 Threads 2247 189 2173 45 225 214 Core 2 Duo 64b 2/0 2400 9579 12619 6385 7720 12623 7600 2 Threads 14257 3553 14073 1505 7018 7718 64 Threads 875 893 875 112 348 358 Atom N455 64b 1/1 1667 3838 4222 3834 4222 4233 4215 2 Threads 4438 4779 4481 1218 970 3130 64 Threads 285 291 281 68 42 167
L2 Cache Results in MBytes/Second - 96 KB
CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 12476 10387 7552 6241 10385 6241 4 Threads 45484 7488 27712 1238 9645 5810 64 Threads 31650 7033 22035 1162 1989 1431 Core 2 Duo 32b 2/0 2400 5228 5892 4245 3852 5935 2619 2 Threads 15026 16440 7009 2896 7009 3054 64 Threads 3304 2662 3268 497 1300 1506 Atom N455 32b 1/1 1667 2768 3349 855 1175 3464 1173 2 Threads 4642 4424 1317 1570 2805 966 64 Threads 1177 1138 1118 665 423 584 64 Bit Version Phenom II 64b 4/0 3000 12155 8875 7504 6136 8875 6138 4 Threads 44120 7442 27807 1229 7998 5838 8 Threads 42685 7413 27567 1240 6875 4867 16 Threads 42004 7443 27870 1237 5335 3760 64 Threads 30435 7057 21892 1157 1686 1329 Core 2 Duo 64b 2/0 2400 6234 5992 4320 3777 5947 3779 2 Threads 14542 15153 7145 2932 7190 3113 64 Threads 11741 12994 6426 2767 4714 2317 Atom N455 64b 1/1 1667 2813 3064 845 1103 3063 1122 2 Threads 4613 4576 1352 1615 3044 1111 64 Threads 3551 3506 1179 1312 1759 874
L3 Cache Results in MBytes/Second - 1536 KB
CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 8756 7918 1743 1505 7919 1503 4 Threads 29961 7491 6617 1332 7414 1391 64 Threads 30159 7763 6643 1333 2394 472 64 Bit Version Phenom II 64b 4/0 3000 8511 7493 1751 1507 7492 1507 4 Threads 29671 7395 6611 1376 6853 1370 8 Threads 29330 7549 6558 1342 6229 1234 16 Threads 29827 7627 6623 1361 4763 988 64 Threads 29812 7733 6650 1337 2163 473
RAM Results in MBytes/Second - 96 MB
CPUs/ MHz Serial Serial Random Random Mutex Mutex HTs RD RW RD RW SRW RRW Phenom II 32b 4/0 3000 4407 3845 344 320 3826 320 4 Threads 12009 6305 1273 994 3615 308 64 Threads 12010 6641 1298 1003 2881 308 Core 2 Duo 32b 2/0 2400 4803 2147 449 282 2232 310 2 Threads 5492 2512 621 401 2239 308 64 Threads 6206 2567 635 411 2294 304 Atom N455 32b 1/1 1667 2253 1159 38 54 1275 54 2 Threads 3926 1257 63 79 1109 42 64 Threads 3951 1274 64 79 1322 54 64 Bit Version Phenom II 64b 4/0 3000 4385 3601 341 319 3608 320 4 Threads 11987 6247 1271 993 3415 310 8 Threads 11930 6248 1274 990 3321 307 16 Threads 11860 6557 1281 997 2777 304 64 Threads 11863 6651 1288 1002 2774 302 Core 2 Duo 64b 2/0 2400 3971 2141 416 283 2148 314 2 Threads 5448 2508 632 404 2425 298 64 Threads 6065 2612 639 416 2318 305 Atom N455 64b 1/1 1667 2284 1286 39 54 1298 54 2 Threads 3717 1241 62 78 1336 54 64 Threads 3785 1270 64 78 1122 42

To Start

Roy Longbottom December 2014

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection