Linux PC Benchmarks Ubuntu - Roy Longbottom's PC benchmark Collection

Linux PC Benchmarks

General	Configuration Details	32-Bit and 64-Bit Differences
Classic Benchmarks	Classic Benchmark Results	Maximum CPU Speeds
OpenMP Benchmark	MemSpeed Benchmark	BusSpeed Benchmark
RandMem Benchmark	SSEfpu Benchmark	nVidia CUDA Benchmarks
Disk, Bus and LAN Benchmarks	Burn-In and Reliability Apps	Multithreading Benchmarks
Image Processing Benchmarks	OpenGL Benchmark	On-Line Benchmarks
JavaDraw On/Off-Line Benchmark	Booting Time

General

Both 32-Bit and 64-Bit versions of Ubuntu Linux were installed on an eSATA/USB hard disk and on USB Flash drives, to compile and assemble existing PC benchmarks via the compiler and assembler that are included in the package. The booting method used also enabled loading Ubuntu on a range of different PCs and laptops.

The benchmark programs, including source code and compile/link commands, are compressed in .tar.gz format. Copy the latter to your home directory or subdirectory for extraction. Examine the README file for further directions. The benchmarks are simple execution files and do not need installing. The first ones run in a Terminal window via the normal ./name command or via clicking on a shell script, containing the commands. Details are displayed when the tests are running and performance results are save in a .txt file.

The benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are included below. When recompiled benchmarks produced significant different results to the older ones, they are available in AVX_benchmarks.tar.gz. This also contains source codes with changes that enable error free compiling and correct execution. Further details are in Linux AVX benchmarks.htm.

Latest results are for a quad core/8 thread 3.7 GHz Core i7 4820K with 10 MB L3 cache, normally running at Turbo Burst speed of 3.9 GHz. It has 32 GB DDR3 RAM on 4 memory channels with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.

To Start

Configuration Details

All benchmarks include the same configuration details, some of which is produced via assembly language code. Example details shown are for an AMD Phenom quad core processor via 32 -Bit Ubuntu and an Intel Core 2 Duo using the 64-Bit version.

###################################################################### Assembler CPUID and RDTSC CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000306E4 Intel(R) Core(TM) i7-4820K CPU @ 3.70GHz Measured - Minimum 3711 MHz, Maximum 3711 MHz Linux Functions get_nprocs() - CPUs 8, Configured CPUs 8 get_phys_pages() and size - RAM Size 31.51 GB, Page Size 4096 Bytes uname() - Linux, roy-WD32, 2.6.35-24-generic-pae #42-Ubuntu SMP Thu Dec 2 03:21:31 UTC 2010, i686 Assembler CPUID and RDTSC CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 AMD Phenom(tm) II X4 945 Processor Measured - Minimum 2978 MHz, Maximum 3008 MHz Linux Functions get_nprocs() - CPUs 4, Configured CPUs 4 get_phys_pages() and size - RAM Size 7.88 GB, Page Size 4096 Bytes uname() - Linux, roy-C2D, 2.6.35-22-generic-pae #35-Ubuntu SMP Sat Oct 16 22:16:51 UTC 2010, i686 Assembler CPUID and RDTSC CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured - Minimum 2407 MHz, Maximum 2407 MHz Linux Functions get_nprocs() - CPUs 2, Configured CPUs 2 get_phys_pages() and size - RAM Size 3.87 GB, Page Size 4096 Bytes uname() - Linux, roy-64Bit, 2.6.35-22-generic #33-Ubuntu SMP Sun Sep 19 20:32:27 UTC 2010, x86_64 Identified with Fedora Linux uname() - Linux, localhost.localdomain, 2.6.34.7-61.fc13.x86_64 #1 SMP Tue Oct 19 04:06:30 UTC 2010, x86_64 ######################################################################

To Start

32-Bit and 64-Bit Differences

The main advantage of 64-Bit working is that the amount of main memory installed and accessible is much larger that 32-Bit operation. The downside can be worse performance if integer array variables are defined as 64 bits, leading to twice the data volumes being read and written.

The original x87 floating point instructions are not available using 64-Bit compilations. Instead, SSE instructions are used for 32-Bit Single Precision (SP) floating point numbers and SSE2 for 64-Bit Double Precision (DP). These are potentially Single Instruction Multiple Data (SIMD) instructions, where four SP results or two DP results can be produced per clock cycle and, even adds and multiplies linked, with eight or four results. Unfortunately, it seems that only Single Instruction Single Data (SISD) operations are issued, where only one number is used in the 128 bit registers, and this can lead to slower performance than a program compiled for 32-Bits with x87 instructions.

The main performance gains at 64-Bits appears to be the provision of twice as many general purpose and SSE registers where, with optimisation options, provides faster speeds through reducing the need to save and reload variables that involve access to slower memory.

Some of these for better and for worse results are reflected in the tables below.

To Start

Classic Benchmarks

The Classic Benchmarks are the first programs used to measure relative performance of computers. They are:

Livermore Kernels (Livermore Loops) - Produced for the first supercomputers and comprising 14 kernels in 1970, then 24 in the 1980s. The 24 kernels are run at three different data sizes. Results are in Millions of Floating Point Operations Per Second (MFLOPS) with one measurement for each kernel and some overall figures, where Geometric Mean is the official overall rating.

Whetstone Benchmark - the first general purpose benchmark that set industry standards of performance, particularly for minicomputers, and introduced in 1972. The benchmark produced speed ratings in terms of Thousands of Whetstone Instructions Per Second (KWIPS). In 1978, self timing versions (by yours truly) produced speed ratings, for each of the eight test procedures, in MOPS (Millions of Operations Per Second) or MFLOPS, with an overall rating in MWIPS.

Dhrystone Benchmarks 1.1 and 2.1 - The Dhrystone benchmark, a sort of Whetstone without floating point, became the key standard benchmark, from 1984, with the growth of Unix systems. The second version (2.1) was produced to avoid over-optimisation problems encountered with version 1.1. Original performance ratings were in terms of Dhrystones per second. This was later changed to VAX MIPS by dividing Dhrystones per second by 1757, the DEC VAX 11/780 result.

Linpack Benchmark - This benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications from the mid 1980's with a slant towards supercomputer performance, with speed measured in MFLOPS.

Further details and references can be found in classic.htm

On starting execution, the programs go through a calibration phase to determine the number of passes to run for more than 2 seconds with Dhystone, 1 second for each of 8 tests with Linpack, 1 second for each of 72 tests with Livermore Loops and 10 seconds overall with Whetstone. Displayed results demonstrate that running time is proportional to the number of passes.

For the benchmark execution codes and source files, download classic_benchmarks.tar.gz. Four execution files are provided for each benchmark. They comprise 32-Bit and 64-Bit compilations, non-optimised and optimised varieties. On downloading to Windows, the file appeared as classic_benchmarks.tar.tar but seemed to be fine with the name changed to classic_benchmarks.tar.gz.

To Start

Classic Benchmark Results

Results of these Linux based benchmarks are included with those run via Windows in the following reports. Some examples are given below, all for using 1 CPU of a 2.4 GHz Core 2 Duo and 2014 speeds of a 3.7 GHz Core i7, running at the Turbo Boost speed of 3.9 GHz.

The benchmarks were recompiled via Ubuntu 14.04 via GCC 4.8.2 that can handle later Intel CPU instructions, including AVX1 and results are shown below (New x64 with SSE/SSE2 and New AVX). The Core i7 maximum speed in GFLOPS per core (4 available) is GHz x 4 (SSE single precision) x 2 (with multiply and add) or 31.2 GFLOPS and 62.4 using AVX1. Using double precision, the best possible scores are 15.6 and 31.2 GFLOPS respectively.

The only real beneficiary of the recompilation is the Linpack benchmark via AVX options. Some of the Livermore Loops should benefit but via the really simple structure used but this is presently beyond the capabilities of the compiler.

Dhrystone Benchmark Results On PCs
Whetstone Benchmark Results On PCs

Linpack Benchmark Results On PCs
Livermore Loops Benchmark Results On PCs

Whetstone Benchmark Optimised MWIPS MFLOP MFLOP MFLOP COS EXP FIXPT IF EQUAL 1 2 3 MOPS MOPS MOPS MOPS MOPS 2.4 GHz Core 2 32 Bit 2280 815 811 576 56.5 22.6 4011 7413 3651 64 Bit 2560 865 885 589 65.7 29.1 3851 5314 1078 3.7 GHz (TB 3.9) Core i7 32 Bit 3959 1331 1331 938 97 42.1 6516 10967 5851 64 Bit 4880 1331 1324 977 129 64.2 6517 11657 1812 New x64 4891 1330 1323 977 120 64.5 6505 11638 3903 New AVX 4897 1325 1323 977 120 64.5 6515 11649 3909 Livermore Loops MFLOPS 24 Kernels Optimised Loop 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2.4 GHz Core 2 32 Bit 1953 1223 1584 1534 343 1238 2192 2385 2147 1187 795 479 161 396 276 956 1368 959 509 385 1385 165 1182 560 64 Bit 1702 1340 1593 1531 341 1199 2422 3060 2057 770 798 861 481 673 444 992 1029 1222 461 423 1251 351 1184 819 3.7 GHz (TB 3.9) Core i7 32 Bit 4327 3661 2622 2642 527 2250 4217 5549 5223 2511 1311 1279 450 1036 730 2038 2479 2835 810 783 2820 419 2022 967 64 Bit 4707 3434 2629 2657 565 2155 4592 6131 5442 2602 1314 1296 937 1239 2288 2293 2392 3538 839 968 2792 939 2034 1720 New x64 4729 3422 2639 2657 565 2164 4599 5714 4984 2446 1310 1879 1018 1267 2287 2012 2397 5343 836 969 3042 940 2011 1840 New AVX 4692 3488 2638 2654 564 2160 4471 5717 4978 2619 1308 1863 978 1305 2285 2043 2492 6418 836 968 3069 938 2010 1558 Dhrystone Linpack Dhry1 Dhry1 Dhry2 Dhry2 NoOpt Opt NoOpt Opt VAX VAX VAX VAX No Opt Opt MIPS MIPS MIPS MIPS MFLOPS MFLOPS 2.4 GHz Core 2 32 Bit 3428 13599 3348 5852 404 1288 64 Bit 3643 18738 3288 12265 378 1577 3.7 GHz (TB 3.9) Core i7 32 Bit 7108 29277 7478 16356 988 2534 64 Bit 8436 32659 8481 23607 900 3672 New x64 8441 32499 8381 24140 946 3631 New AVX 8441 32575 8395 23626 935 5413

To Start

Maximum CPU Speeds

Benchmarks whatcpu32 and whatcpu64 are essentially the same as cpuid and cpuid64, produced for Windows, with description and results in WhatCPU results.htm. The programs were written with a view towards demonstrating maximum CPU performance executing all types of arithmetic instructions. The execution files and source code are available for download in max_cpu_speeds.tar.gz.

The benchmark programs use assembler level instructions, including full SIMD operations where appropriate, to simply add values via 1, 2, 3 and 4 registers. Results are in MIPS and MFLOPS, millions of adds per second in both cases. The programs also check that the end totals are correct. The 32 bit version adds 32 bit integers, then 32 bit single precision and 64 bit double precision floating point numbers using the original x87 instructions. This is followed by adding 32 bit integers using MMX and SSE2 instructions and 64 bit integers also using SSE2 functions. Finally there are 32 bit floating point additions using SSE instructions plus 3DNow, using AMD processors, and 64 bit floating point sums with SSE2 operations.

MMX, x87 and 3DNow instructions are not available at 64 bit working, but normal integer instructions are provided to use 64 bit numbers which, in the case of this register based program, mainly run at the same speed as with 32 bit arithmetic.

Results below are for an AMD Phenom X4 and Intel Core 2 Duo, using one CPU in each case. These suggest three integer adds and two 64 bit MMX operations can be executed per clock cycle. Then SSE/SSE2 floating point calculation speed is based on one 128 bit register dealt with per cycle. Best is eight 32 bit SSE integer adds per cycle. Here, the AMD processor appears to be more efficient than the Intel CPU, but later Intel i7 32 bit and 64 bit results correct some of this anomaly.

Results from a later Core i7 are also shown. This CPU has AVX1 instructions included, with 256 bit registers, producing up to eight 32 bit floating point results per CPU cycle (31.2 GFLOPS at 3.9 GHz), on addition and twice this with linked multiply and add instructions. The latter were included in a new AVX test (AVXid64), demonstrating 62 GFLOPS at 3.9 GHz. Details of the latter can be found in Linux AVX benchmarks.htm.

Word 32 bit OS Version 64 bit OS Version Size 1 Reg 2 Reg 3 Reg 4 Reg 1 Reg 2 Reg 3 Reg 4 Reg Core i7 3.7 GHz at up to 3.9 GHz via Turbo Boost 32 bit Integer MIPS 4301 8551 11994 12292 4302 8559 11996 12293 64 bit Integer MIPS - - - - 4302 8553 11996 12293 32 bit x87 MFLOPS 1303 2607 3865 3864 - - - - 64 bit x87 MFLOPS 1303 2607 3865 3864 - - - - 32 bit MMX Int MIPS 7822 14900 14932 14900 - - - - 32 bit SSE2 Int MIPS 15642 29800 29868 29800 15643 29802 29870 29805 64 bit SSE2 Int MIPS 7821 14899 14934 14900 7822 14901 14935 14901 32 bit SSE MFLOPS 5214 10427 15459 15457 5214 10429 15460 15459 64 bit SSE2 MFLOPS 2607 5214 7730 7729 2607 5215 7730 7729 32 bit 3DNow MFLOPS - - - - - - - - 32 bit AVX1 MFLOPS - - - - 10430 20860 - 30920 64 bit AVX1 MFLOPS - - - - 5210 10430 - 15460 32 bit AVX1 +* MFLOPS - - - - - - - 62000 64 bit AVX1 +* MFLOPS - - - - - - - 31000 Phenom II X4 3.0 GHz 32 bit Integer MIPS 3314 6629 8664 9040 3315 6629 8664 9040 64 bit Integer MIPS - - - - 3315 6629 7701 8287 32 bit x87 MFLOPS 753 1506 2259 3013 - - - - 64 bit x87 MFLOPS 753 1506 2259 3013 - - - - 32 bit MMX Int MIPS 3012 6026 9036 12054 - - - - 32 bit SSE2 Int MIPS 6024 12050 18073 24107 6025 12053 18081 24107 64 bit SSE2 Int MIPS 3012 6025 9037 12053 3013 6027 9040 12053 32 bit SSE MFLOPS 3012 6024 9037 12050 3013 6025 9040 12053 64 bit SSE2 MFLOPS 1506 3012 4518 6025 1506 3013 4519 6027 32 bit 3DNow MFLOPS 1506 3012 4518 6025 - - - - Core 2 Duo 2.4 GHz 32 bit Integer MIPS 2629 4915 5356 6605 2601 4410 5226 6606 64 bit Integer MIPS - - - - 2612 3908 5525 5285 32 bit x87 MFLOPS 801 1601 2402 2402 - - - - 64 bit x87 MFLOPS 801 1601 2402 2402 - - - - 32 bit MMX Int MIPS 4726 7116 8772 8734 - - - - 32 bit SSE2 Int MIPS 9490 13769 17545 17469 9490 14641 17527 17471 64 bit SSE2 Int MIPS 2402 4575 4586 4575 2402 4576 4585 4576 32 bit SSE MFLOPS 3202 6405 9608 9608 3202 6405 9609 9609 64 bit SSE2 MFLOPS 1601 3202 4804 4804 1601 3202 4804 4804 32 bit 3DNow MFLOPS - - - - - - - -

To Start

OpenMP Benchmark

OpenMP is a system independent set of procedures and software that arranges automatic parallel processing of shared memory data when more than one processor is provided. This option is available in the C/C++ compiler included in the Linux Ubuntu Distribution. In each case, four benchmarks are provided, compiled with and without OpenMP options, to run on 32 bit and 64 bit systems. The execution files and source code along with compile and run instructions can be downloaded in linux_openmp.tar.gz. Details and results are provided in linux_openmp benchmarks.htm and a summary follows.

Original OpenMP Benchmark

The original benchmark used larger data array sizes of 0.4, 4.0 and 40 MBytes with 2, 8 and 32 floating point calculations per word (4 Bytes). The 32 bit version behaved in a similar way to the Windows compilation, showing performance gains of a four core processor of up to four times that of a single CPU. The 64 bit OpenMP version behaved in a similar manner to the 32 bit variation but appears to be relatively worse on comparing with speeds produced by the normal compilation. The reason is that the latter produces full SIMD operation, with four calculations per clock cycle, and the former SISD with one calculation per clock. (See above, where SIMD was not produced). Examples of results are given below.

Later results are for the 64 bit version running on the Core i7. In this case, for comparative purposes, those obtained by a multithreading version are also shown. This is MP MFLOPS - (see below). Also included is the non-OpenMP version and new compilations for SSE and with AVX functions. The former produces the same speeds as MP MFLOPS using one thread, with maximum speed of around 24.5 GFLOPS for one thread, demonstrating SIMD, where the maximum possible is 31.2 GFLOPS [CPU GHz x 4 (register width) x 2 (linked multiply and add)]. Performance of 4 way MP MFLOPS speeds show appropriate gains, to produce up to 93.2 GFLOPS, but could require the use of the 8 threads available via Hyperthreading. The AVX benchmark shows suitable gains, with 8 word registers, where the maximum demonstrated is 177.8 GFLOPS.

Note that the i7 SSE OpenMP speeds, shown below, are from a recompiled version by GCC 4.8.2, as this produces SIMD instructions. The new versions are included in AVX_benchmarks.tar.gz, along with the new AVX benchmark. The original SSE version, in linux_openmp.tar.gz, produces SISD instructions and maximum speeds shown underneath the i7 table. The new compilations produce SIMD instructions for 2 and 8 operations per word, but performance is degraded due data handling overheads. Then, at least, AVX scores are double those produced via SSE arithmetic. All the complex data handling seems to lead to SIMD instructions not being generated for the 32 operations tests, leading to SSE and AVX speeds being the same (single data word handling).

Linux OpenMP MFLOPS 3 GHz Quad Core Phenom 32 Bits 64 Bits Data Ops/ 1 CPU 1 CPU 2 CPUs 4 CPUs 1 CPU 1 CPU 2 CPUs 4 CPUs Words Word *Norm OMP OMP OMP *Norm OMP OMP OMP 100000 2 2439 1903 3575 5758 7624 1974 3597 5769 1000000 2 2231 1787 3588 6710 4686 1913 3843 6674 10000000 2 1739 1509 2490 3062 2195 1590 2566 2944 100000 8 3348 3518 6963 13353 14357 3437 6835 12126 1000000 8 3195 3453 6943 13524 13376 3375 6802 12420 10000000 8 3080 3308 6541 11311 7473 3219 6379 10976 100000 32 3881 3794 7566 14896 15336 3552 7084 13494 1000000 32 3853 3774 7554 14969 15009 3533 7079 13540 10000000 32 3817 3735 7465 14883 14318 3490 6970 13450 Instructions FPU FPU FPU FPE SIMD SISD SISD SISD x87 x87 x87 x87 SSE SSE SSE SSE *Norm OpenMP Directives not used - 1 CPU core SSE Core i7 3.7 GHz at up to 3.9 GHz via Turbo Boost ----- MP MFLOPS 1 to 8 Threads ----- -------- OpenMP --------- ----- SSE ----- ----- AVX ----- SSE --- SSE --- --- AVX --- M 4B Ops 1 4 8 1 4 8 1* aff1 8 aff1 8 Words Word ## ## 0.1 2 9681 45340 54621 12542 62273 60258 9918 6061 13742 10196 19577 1.0 2 9759 21688 41832 11404 23031 44329 9688 6215 19477 10025 37906 10.2 2 5990 9237 10026 5991 8970 9977 5870 5059 9137 5880 7782 0.1 8 24533 49320 92086 35982 159040 173224 24448 13220 44104 26481 88370 1.0 8 24570 49918 92352 36180 80096 151909 24465 13373 49499 27045 90579 10.2 8 19975 36638 39982 23299 40124 40153 20055 12719 38369 20593 35607 0.1 32 23269 46942 92408 46400 90572 173372 23251 5854 22858 5865 22845 1.0 32 23307 89676 93282 46572 91058 177831 23265 5863 23234 5870 23141 10.2 32 23052 91029 92050 44729 88877 158594 23063 5860 23127 5854 23077 2&8 Ops ------- SIMD ------ ------- SIMD ------ SIMD --- SIMD -- --- SIMD -- 32 Ops ------- SIMD ------ ------- SIMD ------ SIMD --- SISD -- --- SISD -- ## new version, Original SISD all cores - 2 Ops 3400, 8 Ops 6100, 32 Ops 5900

To Start

OpenMP MemSpeed Benchmark

MemSpeed benchmark employs three different sequences of operations, on 64 bit double precision floating point numbers, 32 bit single precision numbers and 32 bit integers via two data arrays. It uses data volumes of 4 KBytes upwards to indicate performance via caches and RAM. This version is a variation with evaluation mainly concentrating on the formula x[m] = x[m] + r * y[m]. Below is a sample log file with the 64 bit benchmark using four CPUs. The extremely slow performance at the smaller data sizes is due to the relatively high startup overheads of OpenMP and, probably, cache flushing because shared data is being updated. The 32 bit version produces even slower performance relative to the non-OpenMP compilation. See also Multithreading version.

Selected results for the Core i7 include those for the benchmark, compiled without OpenMP directives, plus with and without OpenMP, produced by the later compiler that generates AVX instructions. The CPU has 4 cores plus Hyperthreading. The non-OpenMP versions are compiled to use SIMD instructions, but performance is restricted due to overheads of loading, storing and inserting data. With these, AVX produced suitable gains for cache based data. SISD was generated by OpenMP compilations, leading to SSE and AVX speeds being the same. At least, many MP speeds were appropriately faster than those for single core tests, and maximum memory throughput was excellent. Further details are in OpenMP MFLOPS.htm. The same program was compiled using Pthread multithreading functions see See MP Memory Speed Later

Phenom II X4 3000 MHz OpenMP Memory Reading Speed Test 64 Bit Version 1 by Roy Longbottom Start of test Sun Dec 5 12:26:36 2010 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 2413 2340 2426 2408 2371 2593 1301 1302 1306 8 4642 4379 4655 4739 4488 5045 2562 2478 2583 16 8321 7942 8513 9215 8412 9668 4989 4695 4982 32 15714 12698 15446 16397 14036 17359 9112 7963 9159 64 25533 18268 24526 26971 21394 28979 16033 12269 16032 128 36147 23064 34023 40018 28460 42871 23255 16389 23172 256 45821 26908 42782 21679 34353 57114 31501 20370 31889 512 46924 28555 46191 55514 35557 54808 33583 22754 33376 1024 45478 28681 45098 48798 34662 47103 25081 22172 24993 2048 36642 26993 36187 36523 32366 36917 18354 17985 18388 4096 30960 24342 30259 32057 26483 32862 17172 15049 17153 8192 22963 20257 22754 23462 21376 23910 12203 11223 12176 16384 8927 8774 8888 8947 8803 8951 4469 4454 4487 32768 8938 8817 8875 8963 3681 8964 4494 4465 4488 65536 8956 8863 8910 8959 8849 8981 4500 4474 4502 131072 8979 8918 8951 8830 8808 9022 4513 4494 4517 262144 8784 8657 8706 8760 8826 8919 4436 4422 4433 524288 8774 8478 8789 8732 8643 8864 4374 3703 4435 1048576 8664 8559 8617 8689 8612 8678 4368 4360 4336 2097152 8661 8631 8643 8611 8597 8692 4364 4368 4367 Core i7 4820K 3900 MHz Turbo Boost - x[m]=x[m]+s*y[m] Int+ 64 bit 64 bit OMP 64 bit AVX 64b AVX OMP 32 bit KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 39311 24057 2666 2628 60212 56633 2716 2670 34682 17663 L1 8 39076 24566 5058 4962 61736 58608 5163 5100 35342 17780 16 39851 24795 9662 9412 62061 59459 9818 9526 35555 17824 32 39859 24862 18780 17122 61951 59466 19317 17272 35391 17781 64 32844 24462 33953 26599 47441 40896 34221 26564 30900 17303 L2 128 32879 24498 51235 36875 46181 40101 52329 37762 31022 17313 256 30516 23886 70872 47353 41612 36928 71102 47183 29852 17324 512 25604 22420 90020 53395 31463 30294 90080 54397 24994 17127 L3 1024 25565 22368 97333 57510 30155 29099 97835 57372 24903 17129 2048 25589 22479 96621 58092 30044 29144 93511 58513 24909 17125 4096 25600 22405 87122 60230 30056 29218 93758 60141 24951 17194 8192 25593 22460 94138 60267 29891 29223 104996 59273 24864 17203 16384 15083 14415 27817 27128 15577 15790 27302 27169 14951 13705 RAM 32768 14845 14293 24666 24563 15191 15371 24620 24175 14890 13704 65536 14959 14424 24868 25137 15215 15401 24763 24725 14856 13695 131072 15041 14492 25625 25696 15230 15401 25636 25597 14880 13726 262144 15023 14491 25603 25435 15247 15410 25507 25348 14958 13773 524288 15053 14520 25603 25634 15204 15445 25646 25396 15016 13824 1048576 15085 14534 25569 25690 15198 15438 25160 25678 15025 13846 2097152 15096 14538 25634 25814 15254 15462 25656 25700 4194304 15096 14544 25344 25266 15252 15452 25413 25421 Max GFLOPS 5.0 6.2 12.2 15.1 7.8 14.9 13.1 15.0 4.4 4.5

To Start

BusSpeed Benchmark

This benchmark is particularly designed to identify reading data in bursts over buses, with a 32 bit version using 32 bit integer words and one for 64 bits using 64 bit numbers. The program starts by reading a word, with address increments of 32 words for the next data. The increment is reduced to 16 words then halving until all data is read. The last test reads all data but using SSE2 instructions.

Below are 64 bit results on a Core i7, a Core 2 Duo, with sample results at 32 bits and both varieties on a Phenom processor. The data burst size over the memory bus is indicated at the point where performance becomes constant, like Inc8wds at 64 bits and Inc16wds at 32 bits, both suggesting 512 bits or 64 bytes. Burst reading speed is eight times the constant speed at 64 bits and 16 times at 32 bits, or around 6400 MB/second for the Core 2 Duo and 7200 for the Phenom. There also appears to be some burst reading from data in L2 cache.

Speeds via L1 cache are fairly constant up to ReadAll, indicating no burst reading but, with the data transfer speed at 32 bits being twice that for 64 bits, a constant instruction execution speed is suggested. This, in MIPS, is slightly less than CPU MHz for the Core 2 Duo and somewhat higher than MHz on the Phenom. The SSE2 test is identical at both bit versions with the Core 2 Duo showing better efficiency at nearly four 32 bit results (1 SSE register full) per CPU clock cycle. Maximum speed of the Core i7, based on burst speed, is suggested to be around 18 GB/second, a long way fro the 51,2 GB/second specification, but i7 Multithreading Benchmarks (below) are needed to approach this.

The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in memory_benchmarks.tar.gz. with more details and results in Linux Results BusSpeed

Speed in MB/Second - For MIPS 64 bit divide by 8 and 32 bit divide by 4
Core i7 4820K 3900 MHz Turbo Boost - 1 CPU
Bus Speed Test 64 bit Version 2.0 Sat Nov 8 12:08:24 2014 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 31233 31271 31267 42205 38182 42586 61438 L1 24 31300 31277 31262 41632 39363 42724 62272 96 14511 15005 15180 24371 33172 40471 60769 L2 384 5367 5423 5502 10797 19594 33646 39043 L3 768 5280 5366 5435 10797 19322 33431 38081 1536 5247 5348 5493 10799 19399 33625 38234 16380 1282 1569 2170 4762 9130 18547 19124 RAM 131070 1223 1484 2098 4543 8731 18096 18349 393210 1223 1484 2098 4542 8733 18095 18344 Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM 6 15308 15463 20502 18262 20300 21300 60627 96 7434 7593 11491 16540 20013 21082 60633 1536 2677 2757 5381 9694 16801 21026 38206 393210 742 1048 2245 4360 9063 16342 18263
Core 2 Duo 2400 MHz - 1 CPU
Bus Speed Test 64 bit Version 2.0 Thu Dec 16 23:09:19 2010 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 15997 17525 18167 18540 18734 18804 37355 24 17759 18484 17865 17822 18531 18526 37980 96 4189 4158 4107 6724 9128 13435 19175 384 4182 4137 4091 6721 9133 13450 19206 768 4109 4123 4094 6723 9129 13448 19229 1536 3883 4086 4039 6643 9011 13280 18913 16380 657 691 800 1626 2949 5445 5882 131070 693 711 803 1622 2942 5440 5874 393210 698 713 803 1623 2948 5444 5865 Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM 6 8568 9076 9176 9315 9412 9433 37350 96 2112 2053 3277 4561 6714 8097 19170 393210 356 401 815 1474 2730 5091 5870
Phenom II X4 3000 MHz - 1 CPU
Bus Speed Test 64 bit Version 2.0 - L1 cache, L2 cache and RAM 6 21407 22690 26285 27053 27050 26435 23784 96 2992 2973 2991 5992 11780 20725 23813 393210 869 901 918 1791 3729 6264 7391 Bus Speed Test 32 bit Version 2.0 - L1 cache, L2 cache and RAM 6 11287 12793 13466 13625 13407 13281 23648 96 1494 1490 2974 5854 10509 13147 23781 393210 447 453 901 1830 3097 5206 7276

To Start

RandMem Benchmark

RandMem benchmark carries out eight tests at increasing data sizes to produce data transfer speeds in MBytes Per Second from caches and memory. Serial and random address selections are employed, using the same program structure, with read and read/write tests for 32 bit integers and 64 bit floating point numbers. In both cases, 32 bit integers are used. The main purpose is to demonstrate how much slower performance can be through using random access. Here, speed can be considerably influenced by reading and writing in bursts, where much of the data is redundant, and by the size of preceding caches.

Below, all 64 bit results are shown for a Phenom along with sample speeds at 32 bits and for a Core 2 Duo at 64 bits. Many of the low order speeds are similar at 32 bits and 64 bits but, using RAM, some relationships change, with integer random access becoming progressively worse at 64 bits. The lower GHz Core 2 Duo performs better on some tests. Later results are for the Core i7, which is much faster than the earlier systems, particularly relative to CPU clock speed.

The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in memory_benchmarks.tar.gz with more details and results in Linux Results RandMem.

Core i7 4820K 3.9 GHz Turbo Boost - 1 CPU Random/Serial Memory Test 64 Bit Version 2 Sat Nov 8 12:10:51 2014 Integer....................... Double/Integer................ Serial........ Random........ Serial........ Random........ RAM Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt KB MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec 6 26914 28379 26521 25259 30506 43477 30389 43791 L1 12 26984 28876 26341 28078 29905 43462 29909 43020 24 27062 29098 26526 28219 29865 43649 29832 42931 48 23161 23723 18749 12718 29702 33997 29670 30451 L2 96 23203 23731 13790 8816 29766 33586 22909 14830 192 23378 23626 11539 7634 29685 32647 18371 12232 384 22366 18631 8073 5883 27876 24687 14813 10078 L3 768 22290 18024 6043 4978 27801 23322 10159 8041 1536 22305 18023 5407 4576 27801 23316 8801 7311 3072 22449 18119 5170 4374 27443 23151 8202 6887 6144 22392 18111 5040 4269 27867 23187 7970 6683 12288 15007 11910 2499 2698 20487 16022 4276 4837 RAM 24576 13928 11206 1332 1336 17949 13729 2324 2389 49152 13987 11299 1068 1061 17771 13626 1750 1774 98304 14041 11331 971 864 18568 13699 1586 1558 196608 14031 11379 927 685 18627 13752 1491 1175 393216 14044 11397 908 623 18637 13741 1450 992 786432 14037 11373 898 603 18579 13650 1430 935 1572864 13844 11407 890 614 18624 13720 1418 924 At 32 bits 6 24759 28651 24162 27110 30309 42529 30315 42969 96 22385 23808 13417 8855 29721 34194 23310 14622 1536 21480 18032 5369 4573 26884 23312 8845 7302 393216 13743 11378 906 693 18574 13708 1450 1097 786432 13809 11398 896 670 18578 13753 1430 1033 AMD Phenom(tm) II X4 945 Processor 3.0 GHz Random/Serial Memory Test 64 Bit Version 2 Tue Dec 14 17:21:46 2010 Integer....................... Double/Integer................ Serial........ Random........ Serial........ Random........ RAM Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt Read Rd/Wrt KB MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec MB/Sec 6 12542 9137 12636 9066 16812 13621 16795 13621 12 12613 9165 12676 9137 17022 13705 17013 13673 24 12647 9179 12734 9157 17129 13720 17130 13694 48 12664 9186 12775 9161 17183 13728 17183 13719 96 11989 8464 6866 5221 16934 11776 16496 11888 192 7778 8434 3703 3177 16902 11747 7146 6132 384 7778 8437 3001 2749 16918 11671 5116 4730 768 4956 7348 1954 1900 9978 9459 3670 3591 1536 4763 7201 1404 1388 9748 9346 2488 2474 3072 4016 6914 1078 1045 9531 9200 2048 2043 6144 3668 6769 750 661 9004 8719 1405 1280 12288 2771 3636 590 502 6688 5495 1012 848 24576 2850 3592 504 450 6706 5506 841 736 49152 2858 3583 439 402 6719 5332 727 659 98304 2679 3536 333 307 6697 5490 612 564 196608 2729 3548 266 241 6945 5445 459 422 393216 2866 3559 229 200 6931 5490 377 336 786432 2870 3547 192 167 6938 5499 327 283 At 32 bits 6 14488 11399 12852 11133 16741 20258 16789 19825 96 11088 9912 6861 5520 16960 16197 16554 14645 1536 8044 7528 1410 1390 9668 9223 2475 2461 393216 4296 3575 281 258 6668 5497 491 458 786432 4296 3562 238 212 6841 5492 396 361 Intel Core 2 CPU 6600 @ 2.40GHz At 64 bits 6 9142 12213 9154 5161 13728 16211 13727 15654 96 8019 9473 4113 3701 11381 11971 7382 6419 1536 7978 8586 2691 2497 11269 11044 4760 4222 393216 3285 2273 238 207 5705 2999 503 374 786432 3297 2277 149 152 5637 3001 297 281

To Start

SSEfpu Benchmark

This is a variation of the SSE3DNow Benchmark with extensions but excluding AMD 3DNow tests. The benchmark measures Single Precision (SP) and Double Precision (DP) Floating Point speeds, data streaming from caches and RAM. It uses SSE (SP) and SSE2 (DP) assembly code instructions, along with compiled C code that produces the old x87 instructions at 32 bits and SSE type for working on a 64 bit system. The additional tests avoid intermediate register to register operations using s=(s+x[m])*y[m] and s=s+x[m]+y[m] to produce much faster speeds. The AMD processor performs relatively better on the extra test, with linked add and multiply, at 7.11 floating point results per clock cycle on the Phenom. Then, the Core i7 regains the lost ground and also obtains a high throughput on RAM based data.

The 32 bit and 64 bit benchmarks, source code and instructions can be downloaded in memory_benchmarks.tar.gz with more details and results in Linux Results SSEfpu.

Core i7 4820K 3.9 GHz Turbo Boost - 1 CPU SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.1 0.100 seconds per test, Start Tue Dec 2 17:46:19 2014 Memory --s=s+x[m]*y[m]--- --x[m]=x[m]+y[m]-- (s+x[m])?y[m] KBytes SSE2 SSE Sngl SSE2 SSE Sngl +*SSE ++SSE Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 40997 41051 10763 78459 75013 28446 87877 59793 L1 8 41168 41321 10588 78338 78301 27627 96326 60640 16 41366 41444 10505 80368 80739 27706 98675 60593 32 41423 41511 10462 80669 81160 27759 92764 60609 64 41432 41427 10445 50083 50209 27169 57689 57389 L2 128 41447 41508 10412 49595 49500 27192 55731 56598 256 39673 39746 10414 46176 46119 26167 48386 48563 512 37246 37301 10417 32252 32250 24247 39595 39688 L3 1024 36639 36601 10425 31307 31197 24044 38688 38794 2048 36640 36824 10421 31262 31328 24138 38804 38750 4096 36900 36899 10393 31379 31381 24227 38739 38942 8192 36585 36615 10403 31076 31063 24076 38442 38534 16384 23186 23097 9577 15371 15292 16067 22518 22562 RAM 32768 22592 22574 9573 14973 15013 15743 21935 22058 65536 22603 22504 9596 15041 14972 15718 22061 22052 131072 22612 22612 9582 15038 15030 15672 22096 22003 262144 22629 22610 9584 15049 15044 15698 22040 22109 524288 22638 22654 9592 15057 15056 15682 22101 22101 1048576 22618 22481 9598 15038 15049 15605 22110 22104 2097152 22671 22648 9608 15050 15051 15546 22094 22129 4194304 22671 22668 9597 15044 15056 15691 22112 22128 SSE2 SSE Norm SSE2 SSE Norm SSE SSE Maximum DP SP SP DP SP SP SP SP MFLOPS 5181 10378 2691 5042 10145 3556 24669 15160 MFLOPS/MHz 1.33 2.66 0.69 1.29 2.60 0.91 6.33 3.89 MB/sec at 32 bits 8 41382 41382 10592 79081 78697 20892 92411 61511 128 41604 41586 10436 49128 49126 18239 55914 56067 1024 36098 35957 10425 31113 31127 16998 38204 38336 131072 21010 20979 10092 14783 14774 12497 20655 20626 AMD Phenom(tm) II X4 945 Processor 3.0 GHz SSE & SSE2 Memory Reading Speed Test 64-Bit Version 2.0 0.100 seconds per test, Start Tue Dec 21 12:18:05 2010 Memory --s=s+x[m]*y[m]--- --x[m]=x[m]+y[m]-- (s+x[m])?y[m] KBytes SSE2 SSE Sngl SSE2 SSE Sngl +*SSE ++SSE Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 22773 22689 6156 43460 42950 23333 66361 41700 8 23421 23377 6089 45716 45433 23624 78620 44642 16 23623 23691 6059 42561 42562 23724 84534 45885 32 23834 23827 6043 45141 45140 23797 82980 46315 64 23921 23918 6035 44686 45478 23823 85405 46897 128 23859 23901 6029 22154 22157 17973 23785 23782 256 23821 23764 6027 21555 21535 18026 23888 23889 512 19300 19264 6010 17865 17840 16359 19219 19222 1024 10376 10379 5965 10168 10168 10228 10371 10373 2048 10369 10372 5966 10163 10163 10236 10369 10368 4096 10261 10281 5862 9975 9975 10025 10278 10278 8192 8053 8190 5362 6841 6836 6863 8029 8027 16384 7985 8095 5327 6572 6569 6651 7848 7883 32768 8074 8099 5314 6424 6531 6660 7858 7928 65536 8148 8151 5321 6599 6607 6674 7961 7961 131072 8092 8159 5320 6585 6412 6484 7891 7936 262144 8112 8173 5318 6580 6556 6665 7887 7960 524288 8117 8042 5327 6607 6604 6689 7861 7961 1048576 8147 8108 5328 6535 6581 6668 7941 7816 SSE2 SSE Norm SSE2 SSE Norm SSE SSE Maximum DP SP SP DP SP SP SP SP MFLOPS 2990 5980 1539 2857 5685 2978 21351 11724 MFLOPS/MHz 0.99 1.99 0.51 0.95 1.95 0.99 7.11 3.90 MB/sec at 32 bits Different ##### 8 23188 23276 6057 45641 43156 11688 78703 44729 128 23634 23692 5997 22418 22250 9893 23671 23664 1024 10248 10254 5930 10056 10053 8682 10253 10253 131072 8258 8276 5389 6680 6698 6098 7909 8091 Intel Core 2 CPU 6600 @ 2.40GHz At 64 bits Different ##### ##### ##### 8 25420 25368 6506 37691 37692 13152 36503 36637 128 18481 18655 6406 17105 17107 12704 19725 19744 1024 18517 18749 6391 17136 17137 12690 19803 19822 131072 6444 6419 5455 3955 3956 3863 6399 6393 Maximum MFLOPS/MHz 1.32 2.64 0.68 0.98 1.96 0.68 3.80 3.81

To Start

nVidia CUDA Benchmarks and Burn-in Tests

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. There are five scenarios - New Calculations with data in and out, Update Data with just data out, Graphics Only Data using only graphics RAM and two extra tests with lower overheads. The tests are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

The benchmarks, source code and instructions can be downloaded in linux_cuda_mflops.tar.gz with more details and results in linux_cuda_mflops.htm, the latter showing how to use the benchmarks as reliability-burn-in tests. Added 2014 results are for a mid range GeForce GTX 650, with a 3.7 GHz Core i7, via Windows 8.1 and Ubuntu 14.04. A maximum 412 GFLOPS was demonstrated, making it more than twice as fast as a more expensive GTS 250, from three years earlier. The i7 Asus P9X79 LE motherboard has PCI Express 3.0 x 16 which, along with faster RAM and CPU GHz, produces the fastest speeds, so far, where data in and out or out only is involved. Earlier systems probably had PCIe 1 with maximum bandwidth is 4 GB/s, or PCIe 2 at 8 GB/s, compared with 15.74 GB/s for PCIe 3. Below are full results for GTS 250 and the GTX 650.

Phenom II 3.0 GHz GeForce GTS 250 Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Wed Dec 29 15:35:35 2010 CUDA devices found Device 0: GeForce GTS 250 with 16 Processors 128 cores Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512 Using 256 Threads Test 4 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 1.035893 483 0.9295383095741 Yes Data out only 100000 2 2500 0.514445 972 0.9295383095741 Yes Calculate only 100000 2 2500 0.082464 6063 0.9295383095741 Yes Data in & out 1000000 2 250 0.706176 708 0.9925497770309 Yes Data out only 1000000 2 250 0.380928 1313 0.9925497770309 Yes Calculate only 1000000 2 250 0.051266 9753 0.9925497770309 Yes Data in & out 10000000 2 25 0.639933 781 0.9992496371269 Yes Data out only 10000000 2 25 0.339051 1475 0.9992496371269 Yes Calculate only 10000000 2 25 0.041672 11999 0.9992496371269 Yes Data in & out 100000 8 2500 1.013196 1974 0.9569796919823 Yes Data out only 100000 8 2500 0.490317 4079 0.9569796919823 Yes Calculate only 100000 8 2500 0.088028 22720 0.9569796919823 Yes Data in & out 1000000 8 250 0.666709 3000 0.9955092668533 Yes Data out only 1000000 8 250 0.351320 5693 0.9955092668533 Yes Calculate only 1000000 8 250 0.052704 37948 0.9955092668533 Yes Data in & out 10000000 8 25 0.620265 3224 0.9995486140251 Yes Data out only 10000000 8 25 0.335467 5962 0.9995486140251 Yes Calculate only 10000000 8 25 0.044453 44992 0.9995486140251 Yes Data in & out 100000 32 2500 1.057142 7568 0.8900792598724 Yes Data out only 100000 32 2500 0.531691 15046 0.8900792598724 Yes Calculate only 100000 32 2500 0.128706 62157 0.8900792598724 Yes Data in & out 1000000 32 250 0.688714 11616 0.9880728721619 Yes Data out only 1000000 32 250 0.375411 21310 0.9880728721619 Yes Calculate only 1000000 32 250 0.075172 106423 0.9880728721619 Yes Data in & out 10000000 32 25 0.644074 12421 0.9987990260124 Yes Data out only 10000000 32 25 0.357000 22409 0.9987990260124 Yes Calculate only 10000000 32 25 0.062001 129029 0.9987990260124 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.050288 9943 0.9992496371269 Yes Shared Memory 10000000 2 25 0.009206 54313 0.9992496371269 Yes Calculate 10000000 8 25 0.049608 40316 0.9995486140251 Yes Shared Memory 10000000 8 25 0.017254 115916 0.9995486140251 Yes Calculate 10000000 32 25 0.050531 158320 0.9987990260124 Yes Shared Memory 10000000 32 25 0.046626 171580 0.9987990260124 Yes ############################################################################## Core i7 4820K 3.9 GHz Turbo Boost GeForce GTX 650 Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Tue Dec 30 22:50:52 2014 CUDA devices found Device 0: GeForce GTX 650 with 2 Processors 16 cores Global Memory 999 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024 Using 256 Threads Test 4 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 0.837552 597 0.9295383095741 Yes Data out only 100000 2 2500 0.389646 1283 0.9295383095741 Yes Calculate only 100000 2 2500 0.085709 5834 0.9295383095741 Yes Data in & out 1000000 2 250 0.441478 1133 0.9925497770309 Yes Data out only 1000000 2 250 0.229017 2183 0.9925497770309 Yes Calculate only 1000000 2 250 0.051727 9666 0.9925497770309 Yes Data in & out 10000000 2 25 0.369060 1355 0.9992496371269 Yes Data out only 10000000 2 25 0.201172 2485 0.9992496371269 Yes Calculate only 10000000 2 25 0.048027 10411 0.9992496371269 Yes Data in & out 100000 8 2500 0.708377 2823 0.9571172595024 Yes Data out only 100000 8 2500 0.388206 5152 0.9571172595024 Yes Calculate only 100000 8 2500 0.092254 21679 0.9571172595024 Yes Data in & out 1000000 8 250 0.478644 4178 0.9955183267593 Yes Data out only 1000000 8 250 0.231182 8651 0.9955183267593 Yes Calculate only 1000000 8 250 0.053854 37138 0.9955183267593 Yes Data in & out 10000000 8 25 0.370669 5396 0.9995489120483 Yes Data out only 10000000 8 25 0.202392 9882 0.9995489120483 Yes Calculate only 10000000 8 25 0.049263 40599 0.9995489120483 Yes Data in & out 100000 32 2500 0.725027 11034 0.8902152180672 Yes Data out only 100000 32 2500 0.407579 19628 0.8902152180672 Yes Calculate only 100000 32 2500 0.113188 70679 0.8902152180672 Yes Data in & out 1000000 32 250 0.497855 16069 0.9880878329277 Yes Data out only 1000000 32 250 0.261461 30597 0.9880878329277 Yes Calculate only 1000000 32 250 0.060132 133042 0.9880878329277 Yes Data in & out 10000000 32 25 0.375882 21283 0.9987964630127 Yes Data out only 10000000 32 25 0.207640 38528 0.9987964630127 Yes Calculate only 10000000 32 25 0.054718 146204 0.9987964630127 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.018107 27613 0.9992496371269 Yes Shared Memory 10000000 2 25 0.007775 64308 0.9992496371269 Yes Calculate 10000000 8 25 0.025103 79671 0.9995489120483 Yes Shared Memory 10000000 8 25 0.008724 229241 0.9995489120483 Yes Calculate 10000000 32 25 0.036397 219797 0.9987964630127 Yes Shared Memory 10000000 32 25 0.019414 412070 0.9987964630127 Yes

To Start

Disk, Bus and LAN Benchmarks

These benchmark tests are based on those produced for Windows, where details and results can be found in DiskGraf Results.htm and CDDVDSpd Results.htm. The tests comprise:

Writing and Reading Large Files - Five files each of 8 MB, 16 MB and 32 MB are used.
System is instructed not to cache the data.

Writing and Reading Cached Data - Five files of 8 MB are used. Performance normally
reflects memory speed.

Reading Bus Speed - The same data is read repetitively at block sizes between 64 KB and
1 MB. This normally reads data from the disk’s buffer to show maximum bus speeds.

Random Reading Speed - 1 KB blocks are read randomly from 7 file sizes between 2 MB
and 128 MB. Results reflect the disk's buffer size and rotation speed.

Writing and Reading Small Files - 500 files are written, read and deleted at 6 different
file sizes each between 2 KB and 64 KB. Besides speed, milliseconds per file is provided
to reflect overheads.

Run time parameters - These are provided to write and read larger files and to specify
the drive and file path to be used.

Besides testing disk and flash memory drives, it was intended to use the (drivespeed) benchmarks for measuring speed over such as Local Area Networks (LANs). In order to avoid data being cached in main memory by the Operating System, the program uses direct I/O (file open parameter O_DIRECT for Linux). This prevented directories being mounted over a LAN, so a second program (lanspeed) was produced, identical except with no direct I/O parameter. Compilations at both 32 bits and 64 bits were produced - drivespeed32, lanspeed32, drivespeed64 and lanspeed64. The lanspeed tests can be used to measure speeds between Linux platforms and also between Linux and Windows systems. A Windows program, drivespeed32.exe is also provided and this can also be used as a LAN speed test.

The execution files, source code along with compiling and running instructions, can be downloaded in linux_disk_usb_lan_benchmarks.tar.gz with linux_disk_usb_lan_benchmarks.htm. providing details and results. Example results are below.

The latest version has an added test to measure Random Writing Speed. Second below are 2014 results on the 3.7 GHz Core i7, via Ubuntu 14.04, using a Seagate Expansion USB 3.0 disk drive. Further details and comparisons with a number of Flash Drives are in the results report.

Current Directory Path: /media/f816ec76-8bf2-4dd3-9e98-62934909a779/roy/all64/drivespeed2 Total MB 11263, Free MB 9513, Used MB 1750 Linux Storage Speed Test 64-Bit Version 1.1, Tue Feb 1 14:20:39 2011 8 MB File 1 2 3 4 5 Writing MB/sec 4.33 76.73 76.15 82.40 105.84 Reading MB/sec 57.37 86.62 83.40 80.74 82.34 16 MB File 1 2 3 4 5 Writing MB/sec 73.94 108.16 72.53 116.19 116.12 Reading MB/sec 70.39 103.31 120.31 121.53 121.48 32 MB File 1 2 3 4 5 Writing MB/sec 113.01 76.67 73.20 115.83 116.05 Reading MB/sec 105.19 102.41 113.15 121.55 120.59 --------------------------------------------------------------------- 8 MB Cached File 1 2 3 4 5 Writing MB/sec 1271.71 1503.73 1496.38 1493.27 1491.68 Reading MB/sec 3406.70 4015.11 4079.82 4081.24 4080.77 --------------------------------------------------------------------- Bus Speed Block KB 64 128 256 512 1024 Reading MB/sec 84.93 102.31 112.31 121.03 116.41 --------------------------------------------------------------------- 1 KB Reads File MB > 2 4 8 16 32 64 128 Random Read msecs 0.43 0.39 0.45 3.01 4.49 5.93 6.69 --------------------------------------------------------------------- 500 Files Write Read Delete File KB MB/sec ms/File MB/sec ms/File Seconds 2 7.54 0.27 7.67 0.27 0.015 4 17.19 0.24 22.27 0.18 0.018 8 20.24 0.40 27.21 0.30 0.017 16 33.27 0.49 47.16 0.35 0.019 32 52.67 0.62 67.20 0.49 0.016 64 55.43 1.18 75.49 0.87 0.015 ###################################################################### 3.7 GHz Core i7, Seagate Expansion USB 3.0 Disk Drive Current Directory Path: /home/roy/benchmarks/Old/drivespeed Total MB 446040, Free MB 435358, Used MB 10681 Linux Storage Speed Test 64-Bit Version 1.2, Sun Dec 28 11:36:15 2014 8 MB File 1 2 3 4 5 Writing MB/sec 165.25 70.00 29.78 26.55 41.54 Reading MB/sec 28.61 68.77 74.49 89.81 148.71 16 MB File 1 2 3 4 5 Writing MB/sec 94.83 105.93 90.70 101.86 88.25 Reading MB/sec 70.23 90.52 84.74 43.40 98.24 32 MB File 1 2 3 4 5 Writing MB/sec 118.93 102.33 95.05 94.94 105.92 Reading MB/sec 85.99 102.28 99.45 104.34 112.30 --------------------------------------------------------------------- 8 MB Cached File 1 2 3 4 5 Writing MB/sec 2388.78 2453.24 2468.73 2351.90 2472.20 Reading MB/sec 7077.93 8329.63 8966.46 8957.32 8925.51 --------------------------------------------------------------------- Bus Speed Block KB 64 128 256 512 1024 Reading MB/sec 165.98 146.73 177.92 197.84 202.40 --------------------------------------------------------------------- 1 KB Blocks File MB > 2 4 8 16 32 64 128 Random Read msecs 0.17 0.15 0.17 2.33 6.44 6.90 8.04 Random Write msecs 0.12 0.19 0.14 1.70 13.34 2.39 8.61 --------------------------------------------------------------------- 500 Files Write Read Delete File KB MB/sec ms/File MB/sec ms/File Seconds 2 7.48 0.27 12.00 0.17 0.004 4 25.33 0.16 29.68 0.14 0.004 8 48.45 0.17 32.84 0.25 0.008 16 73.08 0.22 37.87 0.43 0.004 32 80.54 0.41 55.88 0.59 0.004 64 107.98 0.61 82.93 0.79 0.009

To Start

Burn-In and Reliability Testing Apps

A new set of programs have been designed for soak testing Linux based PCs. The execution files and source code along with compile and run instructions can be downloaded in linux_burn-in_apps.tar.gz. Full details and results are provided in linux burn-in apps.htm.

These programs are intended to stress test CPUs, caches, RAM, buses, disks and other drives using high processing speeds, to induce heating effects, and varying data bit order, to investigate possible pattern conscious faults. Common features are command line options to specify memory/storage demands, running time and different results log file names, for use in multiprocessor tests. Data read and results of calculations are also checked for correct or consistent values. Versions compiled to run on 32-Bit and 64-Bit processors are provided.

Three new programs provided are BurnInSSE, IntBurn and DriveStress but they can also be used in conjunction with program produced earlier. BurnInSSE64 and BurnInSSE32 were compiled to use the same range of SSE floating point instructions, where GCC generates fast execution speeds. The IntBurn tests are based on assembly code with IntBurn32 using 32 bit integers and IntBurn64 accessing a larger number of 64 bit registers. DriveStress32 and DriveStress64 were compiled from the same C code and measure drive and bus speeds (e.g. SATA or USB) whilst checking data read for correct values. Earlier programs, that also have reliability testing options and included in the package, are Livermore Loops and nVidia CUDA Benchmarks.

Successes - Three significant problems were identified during testing. The first was apparent excessive temperatures on a desktop PC, compared with earlier measurements via Windows. This was cured by clearing dust out of the CPU heatsink using a compressed air sprayer. Then there were two Linux Peculiarities that seem to be affected by power saving options. A desktop PC with a Core 2 Duo CPU showed a throughput increase of three times using both cores. Here, using one core with “On-Demand” CPU GHz (via Frequency Scaling Monitor), the processor was running at 1.6 GHz instead of 2.4 GHz. Then a laptop, again with a Core 2 Duo PC, overheated, causing the CPU to run at less than half speed. Unlike using Windows, with power on to Ubuntu, initial CPU temperatures were high with the fan not appearing to run as fast as it might. On an apparent random basis, the laptop started at a lower temperature and did not overheat, with the fan apparently running at high speed.

Paging/Swapping Tests - Running multiple copies of the processor exercise programs, with appropriate parameters to demand more main memory capacity than is available, will lead to data being swapped out/in to/from disk. However, with excessive demands, running times can be unpredictable.

Multitasking Scripts - Examples are provided showing how to mix and match programs and run time parameter to soak test complete systems for as long as is required. They also demonstrate how to organise dynamic displayed results in multiple X terminal windows.

The test programs display and log results of calculations and speeds at regular intervals. Examples are shown below, with interpretation and more details in linux burn-in apps.htm.

The htm report includes results on the Core i7, showing variances caused by Hyperthreading. The tests comprised six copies of BurnInSSE and the most demanding CUDA Shared Memory test, over 10 minutes. Temperatures were measured using Psensor and CPU results shown are averages over readings for four cores. CPU GFLOPS are total from the six different streams. The CUDA program uses more than 100% of one core and the CPU produces more GFLOPS than 4 times that from one core, due to hyperthreading effects. Maximum temperatures are not excessive.

IntBurn Test 4 KB at 10x2 seconds per test, Start at Thu Mar 17 12:00:59 2011 Write/Read 1 10529 MB/sec Pattern 0000000000000000 Result OK 25705389 passes 2 10579 MB/sec Pattern FFFFFFFFFFFFFFFF Result OK 25826660 passes 3 10592 MB/sec Pattern A5A5A5A5A5A5A5A5 Result OK 25858754 passes 4 10587 MB/sec Pattern 5555555555555555 Result OK 25846727 passes 5 10601 MB/sec Pattern 3333333333333333 Result OK 25880968 passes 6 10602 MB/sec Pattern F0F0F0F0F0F0F0F0 Result OK 25883259 passes Max 2236 64 bit MIPS Read 1 16941 MB/sec Pattern 0000000000000000 Result OK 82719400 passes 2 16946 MB/sec Pattern FFFFFFFFFFFFFFFF Result OK 82744300 passes 3 16932 MB/sec Pattern A5A5A5A5A5A5A5A5 Result OK 82676600 passes 4 16927 MB/sec Pattern 5555555555555555 Result OK 82653700 passes 5 16883 MB/sec Pattern 3333333333333333 Result OK 82439400 passes 6 16857 MB/sec Pattern F0F0F0F0F0F0F0F0 Result OK 82311300 passes Max 2515 64 bit MIPS BurnInSSE Using 400 KBytes, 32 Operations Per Word, For Approximately 1 Minutes Pass 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same 1 100000 32 67500 15.10 14304 0.356166393 Yes 2 100000 32 67500 15.11 14296 0.356166393 Yes 3 100000 32 67500 15.09 14312 0.356166393 Yes 4 100000 32 67500 15.33 14091 0.356166393 Yes DriveStress File size 10.25 MB x 4 files, minimum reading time 1 minutes File 1 10.25 MB written in 0.12 seconds File 2 10.25 MB written in 0.14 seconds File 3 10.25 MB written in 0.11 seconds File 4 10.25 MB written in 0.14 seconds Start Reading Sun Apr 17 20:06:07 2011 Read passes 18 x 4 Files x 10.25 MB in 0.25 minutes Read passes 36 x 4 Files x 10.25 MB in 0.51 minutes Read passes 54 x 4 Files x 10.25 MB in 0.76 minutes Read passes 72 x 4 Files x 10.25 MB in 1.01 minutes Start Repeat Read Sun Apr 17 20:08:08 2011 Passes in 1 second(s) for each of 164 blocks of 64KB: 1440 1480 1480 1480 1480 1400 1480 1480 1480 1460 1380 1480 1480 1460 1480 1440 1440 1480 1480 1480 1440 1460 1480 1440 1480 1460 1500 1460 1480 1760 1540 1480 1480 1440 1480 1480 1480 1480 1460 1440 1480 1480 1480 1460 + another 120 results No errors found during reading tests ############################################################################
Core i7 3.7 GHz, GeForce GTX 650
Stand Max Alone Over ------------------ GFLOPS ------------------ 15s 4 CPU 90 86 99 83 96 86 86 88 97 99 109 116 GPU 430 430 430 430 430 430 430 430 430 430 430 430 Minute 0 1 2 3 4 5 6 7 8 9 10 Rise ------------------- °C ----------------------- CPUs 32 55 58 60 61 62 62 63 63 63 63 31 GPU 30 46 53 56 58 59 59 60 60 60 60 30

To Start

Multithreading Benchmarks

These multithreading tests are based on the above benchmarks, in turn, Maximum CPU Speeds, Whetstone Classic Benchmark, Original OpenMP Benchmark, MemSpeed Benchmark, BusSpeed Benchmark and RandMem Benchmark. For further details, sample results, benchmark programs, source code and instructions see linux multithreading benchmarks.htm and linux_multithreading_apps.tar.gz.

Six benchmarks are provided that can run using up to 64 concurrent threads, with versions compiled to run using 64 bit or 32 bit systems. Performance is mainly measured as Millions of Instructions Per Second (MIPS), Millions of Floating Point Operations Per Second (MFLOPS) or Millions of Bytes per Second (MB/S).

Simple Add Tests - execute 32 bit or 64 bit integer instructions and 128 bit SSE floating point functions via assembly language. These use simple add operations with little access to external data. Resultant performance is generally proportional to the number of CPU cores with some gains also identified when Hyperthreading is available. Each thread executes independent code.

Whetstone Benchmark - is the first general purpose benchmark that set industry standards of computer system performance, mainly dependent on floating point speed but with some independently timed integer test functions. Data used is generally contained in L1 cache with performance gains again proportional to the number of cores. Each thread again executes independent code.

MP MFLOPS Program - uses the same functions as my CUDA and OpenMP benchmarks, comprising routines with 2, 8 and 32 add or multiply floating point calculations with data from higher level caches or RAM. The 64 bit version compiles using SSE floating point, where up to 6 MFLOPS per CPU MHz per core can be produced. The 32 bit program uses the much slower original 80387 FPU instructions. These programs can also be used as burn-in/reliability tests. Each thread executes the same functions but on a different segment of the data,

MP Memory Speed Tests - employ three sequences of operations, using double and single precision floating point numbers and integers, on data sized between 4 KB and 25% of RAM size. The operations are memory to memory transfers with 0, 1 and 2 arithmetic calculations. The 64 bit version again uses SSE functions but not as efficiently as MP MFLOPS. Again each thread has the same procedures using different segments of the data. Calculations are the same as MemSpeed Benchmark, used with OpenMP, where there is no programmable control on the order in which data is accessed.

MP Memory Bus Speed Tests - read data at a range of sizes covering caches and RAM. Data is accessed with varying address increments to identify reading data in bursts over the bus and allow estimation of maximum bus/memory speed. This time, each thread reads all the data. The 64 bit version uses the double size 8 byte words, where data transfer speed can be twice that of the 32 bit compilation, demonstrating that 32 and 64 bit integer instructions can execute at the same speed.

MP Memory Random Access Speed Benchmark - comprises serial and random access read and read/write tests that cover cache and RAM data sizes. All threads access the same data but starting at different points. In this case, data could be corrupted with concurrent updates, but the Operating System appears to flush caches to avoid this, producing extremely slow performance. Extra tests (Mutex) avoid this conflict by executing one read/write test at a time, leading to some slower and some faster speeds. Random access can be affected by burst reading/writing with associated poor performance.

Examples of results log format on a quad core 3.0 GHz Phenom II are given below.

Simple Add Tests
Multithreading Add Test 64 bit Version 1.0 Thu May 5 11:35:18 2011 Integer Additions 4 Threads Thread 4 - 8281 64 bit Integer MIPS Thread 2 - 7996 64 bit Integer MIPS Thread 1 - 7815 64 bit Integer MIPS Thread 3 - 7800 64 bit Integer MIPS Total - 31892 64 Bit Integer MIPS Aggregate - 31201 64 Bit Integer MIPS, based on last to finish SSE Floating Point Additions 4 Threads Thread 2 - 12030 32 Bit SSE MFLOPS Thread 3 - 11976 32 Bit SSE MFLOPS Thread 4 - 11861 32 Bit SSE MFLOPS Thread 1 - 11692 32 Bit SSE MFLOPS Total - 47559 32 Bit SSE MFLOPS Aggregate - 46770 32 Bit SSE MFLOPS, based on last to finish
Whetstone MP Benchmark
Multithreading Single Precision Whetstones 64-Bit Version 1.0 Using 4 threads - Sat May 14 12:03:51 2011 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 2861 927 872 747 71 38 2947 2259 629 2 2865 875 892 745 71 38 3294 2198 641 3 2875 869 892 744 71 38 3408 2202 645 4 2896 906 895 744 72 38 3141 2232 651 Total 11496 3577 3550 2979 285 151 12790 8891 2566 MWIPS 11389 Based on time for last thread to finish
MP MFLOPS Benchmark
64 Bit MP SSE MFLOPS Benchmark 1, 4 Threads, Tue May 17 19:00:43 2011 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 102400 2 10000 0.091754 22321 0.764063 Yes Data in & out 1024000 2 1000 0.136134 15044 0.970753 Yes Data in & out 10240000 2 100 0.632075 3240 0.997008 Yes Data in & out 102400 8 10000 0.167023 49047 0.850923 Yes Data in & out 1024000 8 1000 0.176219 46488 0.982342 Yes Data in & out 10240000 8 100 0.658828 12434 0.998200 Yes Data in & out 102400 32 10000 0.558509 58670 0.660143 Yes Data in & out 1024000 32 1000 0.556450 58888 0.953631 Yes Data in & out 10240000 32 100 0.722131 45377 0.995203 Yes
MP Memory Speed
MP Memory Reading Speed Test 64 Bit Version 1 Using 4 Threads Start of test Tue Jun 7 11:32:54 2011 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int64 Dble Sngl Int64 Dble Sngl Int64 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 15704 11347 10961 17813 12518 15904 13744 8714 8758 8 24188 15367 14929 26770 17870 21025 20789 10866 10234 16 33319 19229 18266 38724 23589 23124 31390 13114 13157 32 40697 20675 21180 51120 27260 25282 39385 13921 13960 65 45013 22913 22267 57143 30132 24875 42247 14314 14241 131 45569 23573 22953 61979 31356 27585 44688 14427 13289 262 48701 23759 22666 63235 32103 27892 44447 14200 14453 524 44900 22996 20417 53167 30753 25832 36085 14671 13403 1048 44929 23357 20300 54596 30302 25790 36207 14708 13590 2097 42017 22864 20927 42429 28809 24778 26734 13125 12659 4194 34909 20379 19542 36402 25268 21093 18592 12625 12821 8388 22498 17592 17006 23354 19577 18854 12489 9400 9657 16777 8906 8697 8781 8884 8841 8844 4433 4217 4440 33554 8848 8684 8606 8877 8436 8843 4412 4293 4422 67108 8423 8445 8433 8685 8506 8526 4228 4296 4273 134217 8704 8453 8572 8563 8426 8485 4383 4303 4346 268435 8623 8579 8539 8731 8652 8612 4408 4301 4322 536870 8683 8331 8534 8724 8658 8444 4371 4330 4325
MP Memory Bus Speed
MP Bus Speeds 32 bit Version 1.0, 4 Threads, Fri Jun 17 16:44:21 2011 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 3901 7614 14703 28644 29313 34882 74424 24 7466 14648 28660 29468 37750 40926 79860 96 4648 5085 8422 19230 33948 39486 74050 384 4774 5131 9864 19142 32406 41067 82021 768 2726 2746 5361 9874 17152 30193 42259 1536 2407 2543 4943 10058 17570 29261 41159 16380 812 837 1684 3635 6772 12743 16252 131070 786 813 1605 3444 6259 12161 14950 393210 807 855 1649 3333 6234 11625 14892
MP Memory Random Access
RandMemMP Speeds 64 Bit Version 1, 4 Threads, Sun Jun 26 18:00:21 2011 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 29630 53166 44120 44829 29620 29671 12108 11987 Serial RW 5040 7334 7442 7402 7353 7395 8532 6247 Random RD 28388 41211 27807 12265 8866 6611 2103 1271 Random RW 657 1096 1229 1283 1288 1376 1648 993 Mutex SRW 5962 8654 7998 7882 6982 6853 3579 3415 Mutex RRW 6243 8594 5838 2815 1970 1370 486 310

To Start

Core i7 Multithreading Benchmarks

This is a quad core/8 thread 3.7 GHz Core i7 4820K with 10 MB L3 cache, normally running at Turbo Burst speed of 3.9 GHz. It has 4 memory channels with maximum speed of 800 MHz (bus speed) x 2 (DDR) x 4 (channels) x 8 (bus width) or 51.2 GB/second.

Simple Add Tests - See also Maximum CPU Speed Tests, where the stand alone speeds are slightly faster than those for single threads. It also seems that, for these particular code sequences, eight threads are required for near a four times performance improvement, where throughput is 12.2 MIPS/MHz and 15.8 MFLOPS/MHz.

Whetstone MP Benchmark - The single core version of this benchmark does not use pipelines very efficiently but, using 8 threads, performance of MFLOPS test is increased by 7.8 times, but 4 to 5 times on integer routines.

MP MFLOPS Benchmark - This used the same basic C code an OpenMP variety. See comparisons above. Note that there is a second version, compiled to use AVX instructions. Maximum speed of one core, with linked multiply and add, is 31.2, using SSE instructions, and twice that with AVX. With 4 cores, SSE and AVX maximum GFLOPS are 124.8 and 249.6, with 75% and 71% of these being demonstrated.

MP BusSpeed - This did not benefit by running via 8 threads, compared with four. Measured maximum RAM speed was greater than the 51.2 GB/second specification. This was due to all threads reading the same data and the 10 MB shared cache. A new version was produced, to minimise the effect, with threads starting reading from different addresses, still in the same data array, reducing maximum speed to 40 GB/second or less.

MP MemSpeed - This firstly shows single and double precision multiply + add tests, using one and eight threads, with normal 64 bit compilation and, again, with AVX options, then with one thread for a 32 bit Operating System. There are some start up overheads, providing slower performance than MemSpeed Benchmark above, using one thread, but, as each thread handles a unique segment of data, cache flushing is minimised with multiple threads. The benchmarks’ assembly code listings show that full SIMD SSE and AVX instructions are used but, possibly because of compiling for multiple threads, there are excessive numbers of addition instructions generated. This leads to some slower speeds that OpenMP MemSpeed and SSE/SSE2 being faster than AVX.

The additional results, for the second tests with just addition, show that the compiled code is much better, with SSE/SSE2 speeds similar to MemSpeed via OpenMP and AVX instructions providing appropriate performance gains. Then, none of these GFLOPS speeds are close to the maximum potential of 31.2 single precision GFLOPS with SSE and double using AVX instructions (half these with double precision).

MP Random Access Benchmark - As expected, multithreading performance can be worse than using a single thread, when write back to memory is used, but reasonable performance and improvements were possible with data in the large L3 cache. Using Mutex restrictions lead to no real gains using multi-threading.

Simple Add Tests
Multithreading Add Test 64 bit Version 1.0 Sat Nov 8 12:16:25 2014 Integer Additions 8 Threads Thread 3 - 6318 64 bit Integer MIPS Thread 5 - 6307 64 bit Integer MIPS Thread 2 - 6241 64 bit Integer MIPS Thread 6 - 6212 64 bit Integer MIPS Thread 7 - 6124 64 bit Integer MIPS Thread 4 - 6036 64 bit Integer MIPS Thread 8 - 6001 64 bit Integer MIPS Thread 1 - 5923 64 bit Integer MIPS Total - 49162 64 Bit Integer MIPS Aggregate - 47387 64 Bit Integer MIPS, based on last to finish SSE Floating Point Additions 8 Threads Thread 7 - 7767 32 Bit SSE MFLOPS Thread 8 - 7765 32 Bit SSE MFLOPS Thread 3 - 7752 32 Bit SSE MFLOPS Thread 4 - 7749 32 Bit SSE MFLOPS Thread 5 - 7738 32 Bit SSE MFLOPS Thread 2 - 7727 32 Bit SSE MFLOPS Thread 1 - 7725 32 Bit SSE MFLOPS Thread 6 - 7693 32 Bit SSE MFLOPS Total - 61916 32 Bit SSE MFLOPS Aggregate - 61540 32 Bit SSE MFLOPS, based on last to finish Single Thread 11937 64 Bit Integer MIPS 15450 32 Bit SSE MFLOPS Two Threads 23069 64 Bit Integer MIPS 30887 32 Bit SSE MFLOPS Four Theads 24717 64 Bit Integer MIPS 24167 64 Bit Integer MIPS, based on last to finish 46409 32 Bit SSE MFLOPS 30903 32 Bit SSE MFLOPS, based on last to finish
Whetstone MP Benchmark
Multithreading Double Precision Whetstones 64-Bit Version 1.0 Using 8 threads - Sat Nov 8 14:58:12 2014 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal Thread 1 2 3 MOPS MOPS MOPS MOPS MOPS 1 3828 1321 1320 959 92 62 3156 2963 629 2 3803 1270 1321 952 92 61 3155 2930 628 3 3811 1315 1282 956 92 61 3125 2990 630 4 3807 1259 1280 952 92 62 3145 2958 629 5 3821 1286 1287 961 92 62 3087 2926 629 6 3815 1283 1284 962 91 62 3134 2933 629 7 3818 1300 1306 956 92 62 3135 2929 629 8 3821 1286 1304 958 92 62 3143 2931 629 Total 30524 10321 10384 7657 733 494 25079 23559 5033 Total 1 Thrd 4648 1331 1331 977 122 70 4720 5855 983 2 Thrd 9274 2661 2660 1945 243 140 9769 11717 1964 4 Thrd 18078 5263 5229 3907 488 265 15620 17408 3929
MP MFLOPS Benchmark
MFLOPS 1 to 8 Threads 4 Byte Ops/ Repeat SSE ------ SSE ------ ------ AVX ------ Words Word Passes 1 CPU 1 4 8 1 4 8 100000 2 2500 9918 9681 45340 54621 12542 62273 60258 1000000 2 250 9688 9759 21688 41832 11404 23031 44329 10000000 2 25 5870 5990 9237 10026 5991 8970 9977 100000 8 2500 24448 24533 49320 92086 35982 159040 173224 1000000 8 250 24465 24570 49918 92352 36180 80096 151909 10000000 8 25 20055 19975 36638 39982 23299 40124 40153 100000 32 2500 23251 23269 46942 92408 46400 90572 173372 1000000 32 250 23265 23307 89676 93282 46572 91058 177831 10000000 32 25 23063 23052 91029 92050 44729 88877 158594
MP Memory Speed
x[m]=x[m]+s*y[m] 64b 1 Thread 64b 8 Thread 64b AVX 1 T 64b AVX 8 T 32b 1 Thread KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 29668 15246 37397 22021 16828 10053 38396 31823 22323 11275 L1 8 30422 15420 52063 33134 16865 10130 46928 32871 22744 11317 16 30754 15503 69122 44818 16891 10136 53801 37870 22887 11340 32 30680 15459 98246 51419 16102 10134 66372 37707 22870 11324 64 28867 15292 103196 54739 16872 10132 68113 39620 22352 11281 L2 128 28955 15286 115996 53402 16895 10132 61264 36423 22359 11296 256 28741 15287 113644 60777 16785 10134 68244 40618 22165 11296 512 24664 15200 116243 60628 16580 10128 65631 37589 21408 11285 L3 1024 24662 15207 117177 57777 16620 10087 63796 37746 21288 11270 2048 24424 15207 95433 58470 16444 9827 64988 40739 21305 11268 4096 24408 14253 98608 57900 15592 9839 63209 36650 20958 11141 8192 24213 14940 99671 56541 15666 8823 67851 38623 20297 11030 16384 14983 11747 28689 28004 12310 9117 30911 28600 15179 10297 RAM 32768 14667 11464 25857 25885 12253 9098 24926 24294 15075 9576 65536 14523 11772 24875 24963 11968 9016 24070 22805 14547 9738 131072 14433 11570 24789 24833 12564 9180 23856 25190 15249 10246 262144 14266 11165 25525 24575 12529 8851 25236 22608 15273 10252 524288 14386 11824 25054 24707 12338 8931 24974 24490 15295 10268 1048576 14452 11468 25402 25735 11954 8972 24917 24153 15308 10278 2097152 14908 11769 25100 25402 12396 8901 24545 25061 4194304 14938 11916 24785 24556 12284 9007 24608 25285 Max GFLOPS 3.8 3.9 14.6 15.2 2.1 2.5 8.5 10.2 2.9 2.8 x[m]=x[m]+y[m] 64b 1 Thread 64b 8 Thread 64b AVX 1 T 64b AVX 8 T 32b 1 Thread KBytes Dble Sngl Dble Sngl Dble Sngl Dble Sngl Dble Sngl Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 16 41065 20688 82924 46075 61385 61280 116710 90819 27816 14030 L1 128 34323 20476 140036 76202 48299 47771 226979 230972 26712 13977 L2 8192 26045 19106 108046 80815 28005 27977 121758 113292 22607 13535 L3 131072 15644 14115 25675 25639 14893 14915 24319 25609 15862 12099 RAM Max GFLOPS 2.6 2.6 8.8 10.1 3.8 7.7 14.2 28.9 1.7 1.8
MP Memory Bus Speed
MP Bus Speeds 64 bit Version 1.0, 4 Threads, Sun Nov 23 10:35:01 2014 Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 76609 51101 75602 140546 104501 167163 205782 L1 24 120982 107268 113828 153185 170288 149892 248761 96 41962 40737 43299 73311 123250 160399 240730 L2 384 19664 20262 20831 38942 75517 128002 160495 L3 768 19242 19941 20676 39821 73897 127177 152781 1536 19103 19854 20683 39137 54701 127196 152980 16380 6210 6913 8363 14942 29204 56919 56522 RAM 131070 5901 6947 8368 15029 29096 51843 61776 393210 5909 5426 8370 12684 29097 58307 59609 1 Thread 6 31501 31266 31243 41117 36617 41277 61526 768 5303 5386 5499 10808 19429 33765 38337 131070 1229 1470 2054 4514 8754 18043 18094 MP Bus Speeds 64 bit Version 2.0, 4 Threads, Sun Nov 23 10:35:44 2014 Same as Version 1.0, except each thread starts at different address Kbytes Inc32wds Inc16wds Inc8wds Inc4wds Inc2wds ReadAll 128bSSE2 6 28749 29616 58739 64451 61610 129160 231735 24 114043 117435 119746 108799 143160 163902 245756 96 39170 40423 42705 76442 110895 154667 240928 384 19631 20232 20793 40066 69429 126075 158417 768 19212 19923 20648 39748 72952 125329 151560 1536 19086 19296 20661 39791 73469 120135 152311 16380 5843 6857 8210 14523 27776 55150 59064 131070 2038 3108 5197 10201 20004 38092 40726 393210 2090 3101 5072 9867 19538 39489 39824 786420 2083 2943 5082 10133 20016 37592 40764 1572840 2025 3011 5091 10207 19039 39479 40781 1 Thread 6 31501 31266 31243 41117 36617 41277 61526 768 5303 5386 5499 10808 19429 33765 38337 131070 1226 1484 2096 4411 8462 18188 18382
MP Memory Random Access
RandMemMP Speeds 64 Bit Version 1, 8 Threads, Sat Nov 8 12:41:51 2014 ------------------ MBytes Per Second At -------------------- 6 KB 24 KB 96 KB 384 KB 768 KB 1536 KB 12 MB 96 MB Serial RD 37112 77469 94806 94862 90795 86826 65882 56315 Serial RW 8924 29533 54380 47712 51176 69146 68008 22145 Random RD 36944 76814 62245 33838 24552 21588 13472 3341 Random RW 2000 6016 9058 17412 16237 16733 10066 2806 Mutex SRW 7829 16705 19723 16432 16331 16570 11550 10669 Mutex RRW 10672 20797 8933 5659 4844 4561 2659 940 RandMemMP Speeds 64 Bit Version 1, 1 Threads, Sat Nov 8 12:39:21 2014 Serial RD 28021 27808 20268 19318 19231 19255 12455 11589 Serial RW 29972 30232 21894 17867 17410 17420 12242 11581 Random RD 27479 27463 13595 8251 6228 5605 2470 1011 Random RW 30429 30076 9224 6120 5177 4782 2800 982 Mutex SRW 29987 30245 21895 17875 17419 17249 12373 11495 Mutex RRW 30417 30027 9199 6117 5175 4780 2796 982

To Start

Image Processing Benchmarks

SDL_bmpspd32 and SDL_bmpspd64 benchmarks execute the same tests as the Windows version, where details and results can be found in bmpspeed results.htm. They are 32 bit and 64 bit varieties compiled to run under Linux using Simple DirectMedia Layer (SDL) functions. The benchmarks generate BMP files and measure speed of saving, loading, scrolling, rotating and editing of 0.5, 1, 2, 4 etc. to 512 MB images.

The programs automatically adjust maximum image size used, depending on available main memory, but run time parameters can be used to change this. The execution files, source code, compilation and running instructions can be found in linux_image_processing_benchmarks.tar.gz with further details in linux image processing benchmarks.htm. Example results are below. Besides the standard Configuration Details shown earlier, additional attributes, obtained for this benchmark, are determined and included in the following example results.

Hardware benchmarked for the main report were desktops, a laptop and a netbook using internal and external (eSATA) disk drives plus usb flash memory and disk drives. Linux versions used were 32-Bit and 64-Bit Ubuntu 10.10 with GNOME 2, 64-Bit Ubuntu 11.04 with Unity on two different graphics arrangements, 64-Bit Fedora 14 with GNOME 2 and 64-Bit OpenSuse 11.4 with KDE.

Result are also provided for the Core i7, with a USB 3.0 disk drive, plus faster CPUs, memory and graphics card.

Additional System Details ##################################################################### 2.4 GHz Core 2 Duo, eSATA disk Memory stats from /proc/meminfo MemTotal: 3963.8 MB A MemFree: 3181.8 MB B Buffers: 46.5 MB C Cached: 297.5 MB D Memory Used: 438.0 MB = A - B - C - D Current Directory Path (getcwd) and drive space (statvfs): /home/roy/all64/bmpspd Total MB 11263, Free MB 9446, Used MB 1817 See files hd1.txt and hd2.txt for details of drive used SDL_GetVideoInfo hw_available flag is 0 - cannot create hardware surfaces Display size 1280 x 1024 pixels at 32 bits SDL_VideoDriverName = x11 Graphics (command - lspci | grep -i vga > vga.txt) VGA compatible controller: nVidia Corporation G84 [GeForce 8600 GT] (rev a1) ##################################################################### Image Editing Speeds 64 Bit Version 1, Sat Aug 6 09:45:47 2011 Input Enlarge Save Load Scroll Scroll Rotate Max MB Image Display Display Repeat Overall 90 deg Memory Mbytes Secs Secs Secs msecs MB/Sec Secs Used 0.5 0.02 0.01 0.01 0.83 601.15 0.01 440.2 1.0 0.02 0.05 0.02 1.63 612.30 0.02 441.9 2.0 0.02 0.02 0.03 3.31 634.52 0.02 445.4 4.0 0.03 0.04 0.06 5.66 625.44 0.03 451.6 8.0 0.05 0.08 0.11 6.73 584.70 0.05 464.7 16.0 0.09 0.16 0.20 6.77 580.53 0.08 489.5 32.0 0.16 0.29 0.31 6.70 587.05 0.16 541.1 64.0 0.29 0.59 0.71 6.94 566.85 0.32 672.4 128.0 0.59 1.32 1.22 6.64 592.54 0.65 785.3 256.0 1.14 2.35 2.60 6.63 593.46 3.51 1129.9 512.0 2.27 4.90 4.73 6.65 591.47 3.91 1822.9 ##################################################################### 3.7 GHz Core 17 (3.9 GHz Turbo Boost), USB 3 disk Memory stats from /proc/meminfo MemTotal: 32114.1 MB A MemFree: 30952.5 MB B Buffers: 40.2 MB C Cached: 376.1 MB D Memory Used: 745.4 MB = A - B - C - D Current Directory Path (getcwd) and drive space (statvfs): /home/roy/benchmarks/Old/bmpspd/bin64 Total MB 446040, Free MB 435462, Used MB 10577 See files hd1.txt and hd2.txt for details of drive used SDL_GetVideoInfo hw_available flag is 0 - cannot create hardware surfaces Display size 1920 x 1080 pixels at 32 bits SDL_VideoDriverName = x11 Graphics (command - lspci | grep -i vga > vga.txt) VGA compatible controller: NVIDIA Corporation GK107 [GeForce GTX 650] (rev a1) ##################################################################### Image Editing Speeds 64 Bit Version 1, Sat Dec 27 09:58:41 2014 Input Enlarge Save Load Scroll Scroll Rotate Max MB Image Display Display Repeat Overall 90 deg Memory Mbytes Secs Secs Secs msecs MB/Sec Secs Used 0.5 0.01 0.01 0.02 0.65 774.44 0.00 751.4 1.0 0.01 0.11 0.01 1.04 957.30 0.01 752.2 2.0 0.02 0.01 0.03 1.87 1121.19 0.01 756.3 4.0 0.02 0.03 0.03 3.37 1108.22 0.02 763.0 8.0 0.03 0.05 0.15 4.72 1119.93 0.02 774.7 16.0 0.05 0.09 0.26 5.61 1108.62 0.04 800.4 32.0 0.06 0.31 0.51 5.02 1239.99 0.05 853.0 64.0 0.11 0.56 0.62 5.52 1126.91 0.12 983.3 128.0 0.20 1.32 1.28 5.87 1059.86 0.23 1095.7 256.0 0.38 2.78 2.67 5.86 1062.25 0.58 1443.1 512.0 0.74 5.42 5.07 6.35 979.01 0.83 2135.9

To Start

OpenGL Benchmark

The benchmarks, videogl32 and videogl64, are 32-Bit and 64-Bit Linux compilations of OpenGL code used for testing via Windows. Details and results can be found in Linux OpenGL Benchmarks.htm. The benchmarks measure graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines. The second has colours and textures applied to the surfaces.

The textures are obtained from 24 bit BMP files that can be up 256 x 256 pixels at 192 KB. The BMP files and Linux execution files can be found in linux_opengl_benchmarks.tar.gz, along with source code, compilation and running instructions. Windows benchmarks from the same source code are also included.

The benchmarks were run on a variety of Ubuntu, Fedora and OpenSuse distros and different PC hardware, with nVidia, ATI and Intel graphics. Newly installed Linux systems do not [so far] provide OpenGL hardware acceleration and, except for nVidia, finding such a driver that works with a particular release is seemingly impossible, in some cases. As a default, the benchmark runs using a full screen window, but input parameters allow different sized windows to be used, via Terminal commands or a script file. Following are example log files from tests using a Core 2 Duo CPU and GeForce 8600 GT graphics, using a default driver and one from nVidia. Decreasing performance, as the window size increases, suggests a graphics speed limitation, with constant performance indicating that processor speed is the limiting factor.

2014 results for the Core i7 system are also provided below, where speeds can all be twice those on the Core 2 Duo.

##################################################################### Linux OpenGL Benchmark 64 Bit Version 1, Wed Oct 26 22:29:24 2011 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 221.7 158.1 162.4 109.3 72.1 48.0 640 480 60.9 53.5 46.2 37.6 52.7 22.2 1024 768 23.7 22.0 18.4 15.6 34.9 10.7 1280 1024 15.6 14.6 12.0 10.3 28.5 7.4 End at Wed Oct 26 22:31:38 2011 ##################################################################### Linux OpenGL Benchmark 64 Bit Version 1, Tue Oct 25 18:36:45 2011 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 3670.2 2326.6 1160.9 678.8 401.0 229.2 640 480 2463.1 2033.9 896.3 666.3 414.5 231.3 1024 768 1089.2 987.3 541.6 440.9 401.8 214.6 1280 1024 727.0 680.8 412.1 338.3 400.2 194.0 End at Tue Oct 25 18:38:58 2011 ##################################################################### 3.7 GHz Core i7, Ubuntu 14.04, GeForce GTX 650 Linux OpenGL Benchmark 64 Bit Version 1, Fri Jan 2 11:16:35 2015 Running Time Approximately 5 Seconds Each Test Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 7488.1 4641.5 2094.8 1249.6 774.4 398.7 640 480 6630.8 5549.2 2217.1 1250.0 744.9 395.3 1024 768 3399.2 3174.1 1958.8 1195.9 655.6 342.4 1280 1024 2151.5 2075.0 1481.5 1158.0 762.3 376.7 1680 1050 1753.3 1692.3 1289.0 1036.6 696.7 361.3 1920 1080 1563.3 1512.1 1189.6 986.0 779.2 375.5 End at Fri Jan 2 11:19:54 2015

To Start

On-Line Benchmarks

A Java version of the Whetstone Classic Benchmark, that is executed via a downloaded HTML page, was produced in 1997. Because of the timing considerations in those days, the benchmark ran for 100 seconds. It also included a measurement of graphics speed. Running this via FireFox and Linux identified some unacceptable text displays and measured speeds, due to over-optimisation. The code was modified slightly to avoid this, running time was reduced and graphics tests were excluded, for a new version, compiled via Java installed under Linux.

The benchmark is run via WhetJava2.html or indirectly from online benchmarks.html, which also includes tests to measure downloading speed of images (see below). Performance results are produced in graphics format, but this can be kept using Take ScreenShot. A version of the new benchmark was also compiled, that runs from a Terminal command, to produce text output to the window and log file. Format is the same as the graphics display and an example is given below.

Results via Linux and Windows are available in Whetstone Benchmark Results - Java. These show differences in 32 bit vs 64 bit, Windows vs Linux, On-line vs Off-line and same results with different browsers. The benchmarks, including source code, can be downloaded from onlinetests.zip or onlinetests.tar.gz.

************************************************************* 3.0 GHz Phenom Whetstone Benchmark Java Version, Dec 8 2011, 23:38:14 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 894.69 0.0215 N2 floating point -1.131330490 732.82 0.1834 N3 if then else 1.000000000 1027.81 0.1007 N4 fixed point 12.000000000 1735.54 0.1815 N5 sin,cos etc. 0.499110132 41.15 2.0220 N6 floating point 0.999999821 496.69 1.0860 N7 assignments 3.000000000 582.23 0.3174 N8 exp,sqrt etc. 0.825148463 33.54 1.1090 MWIPS 1991.45 5.0215 Operating System Linux, Arch. amd64, Version 2.6.34-12-desktop Java Vendor Sun Microsystems Inc., Version 1.6.0_26 ************************************************************* 3.7 GHz Core i7 Whetstone Benchmark Java Version, Jan 4 2015, 11:53:10 1 Pass Test Result MFLOPS MOPS millisecs N1 floating point -1.124750137 1280.00 0.0150 N2 floating point -1.131330490 1150.68 0.1168 N3 if then else 1.000000000 1358.98 0.0762 N4 fixed point 12.000000000 3118.81 0.1010 N5 sin,cos etc. 0.499110132 73.76 1.1280 N6 floating point 0.999999821 658.45 0.8192 N7 assignments 3.000000000 1133.74 0.1630 N8 exp,sqrt etc. 0.935364604 46.60 0.7982 MWIPS 3108.14 3.2174 Operating System Linux, Arch. amd64, Version 3.13.0-24-generic Java Vendor Oracle Corporation, Version 1.8.0_25

Online Benchmark Downloading Tests measure the downloading time of 1 MByte or 100 KByte BMP, GIF and JPG files and for 200 or 400 70 Byte GIF files. Of particular note, typical loading times of the 400 GIFs (28 KB) is twice as long as that for the 1 MB image files.

To Start

JavaDraw Benchmarks

Versions (.class files) compiled with Java JDK 6 and 7 are available to execute off-line, via a terminal command, and on-line, using a browser. There are two benchmarks for each of these, the original (Swing) and, to avoid Windows issues with this, a new version (AWT). For details and results see JavaDraw.htm. Java source codes, class files and images used are in: Java PC Benchmarks.zip.

As shown in the example results below, the benchmark has five test procedures with increasing activity, each one running for 10 seconds, and the first one repeated to identify start up overheads. Note that the benchmark is designed to measure speed and displays might have flashing and missing objects, particularly with the on-line versions. The latter requires specific permissions to execute, with IcedTea JRE appearing to be the only one that enables this ability under Ubuntu (14.04).

Following the example output are a series of results on the 3.7 GHz Core i7 with a GeForce GTX 650 graphics card, running under Ubuntu 14.04, including comparisons with the Java code compiled using JDK 8 and running via JRE 1.8. It can be seen that, compiling with JDK 7 and 8, leads to similar speed, running via JRE 1.8. Then, running via JRE 1.7 produced a completely different performance profile. The much faster JRE 1.8 performance, with the lighter loading, appears to be associated in a higher level of multithreading but, this also applies with the heaviest loading, suggesting different graphics processor utilisation or CPU to GPU communication. Further results are available using other processors and Windows.

****************************************************************** Java AWT Drawing Benchmark, Jan 5 2015, 10:32:15 Produced by javac 1.8.0_25 Test Frames FPS Display PNG Bitmap Twice Pass 1 19201 1920.10 Display PNG Bitmap Twice Pass 2 20826 2082.60 Plus 2 SweepGradient Circles 20478 2047.80 Plus 200 Random Small Circles 9620 962.00 Plus 320 Long Lines 3830 383.00 Plus 4000 Random Small Circles 435 43.30 Total Elapsed Time 60.1 seconds Operating System Linux, Arch. amd64, Version 3.13.0-24-generic Java Vendor Oracle Corporation, Version 1.8.0_25 ****************************************************************** On-line ----- Off-line ------ JDK Compiler 7 7 7 8 JRE 1.7 1.7 1.8 1.8 PNG Bitmaps 1 984 779 1971 1920 PNG Bitmaps 2 1006 979 2032 2083 + SweepGradient Circle 485 453 1923 2048 + 200 Small Circles 474 403 909 962 + 320 Long Lines 412 307 312 383 + 4000 Small Circles 306 219 41 43

To Start

Booting Time

Below are booting times on two PCs, from boot menu selection to loaded desktop. The two PCs are a Netbook with a 1.66 GHz Atom CPU, originally running Windows XP, and a desktop PC with a 2.4 GHz Core 2 Duo and Windows Vista. Besides seconds to boot, MB/second reading speed of the drives is provided, derived from the Image Processing Benchmark results. The first results show Windows booting time, for comparison purposes, the Core 2 Duo being particularly slow. The second and fastest results are for 64-Bit Ubuntu 10.10, booting from the Windows disk in the Netbook, and a fast (for 2009) eSATA disk on the desktop.

Figures for the next six entries are from USB sticks, booting 32-Bit and 64-Bit Ubuntu 10.10, 64-Bit Ubuntu 11.04, 64-Bit Fedora 14 and 64-Bit OpenSuse 11.4. On moving the drives between systems, it seems that booting time of the next system used can be considerably longer than normal (needs to use alternative drivers?). Also, the first Linux installations were with Ubuntu and nVidia drivers were installed in order to run CUDA based benchmarks, probably the reason why these would only fully boot on using Recovery Mode on the Netbook, with its Intel graphics.

On the desktop, all Linux loading times are faster than Windows, using much slower drives, but the fastest flash drive does not necessarily produce the shortest booting time. Repeating the tests for a number of times indicates that booting time depends on differing hardware/distro combinations. The last result is with OpenSuse on a USB disk drive, where the faster data transfer speed, compared to a flash drive, does not improve booting time much.

Later results, loading Ubuntu 14.04, are for the 37 GHz Core i7, using a USB 3.0 Seagate Expansion STBX1000101 disk drive and a cheap USB 3.0 Lexar Flash Drive, plus a WD CAVIAR BLACK WD1003FZEX SATA disk for Windows 8.1. All booting times are after a BIOS based menu that takes around 20 seconds to appear after switch on.

Netbook, WinXP, 5400 Desktop, Vista 7200 RPM RPM Local Disk SATA and eSATA Disks Drive Linux Boot1 Boot2 Disk Mode Boot1 Boot2 Disk Mode Secs Secs MB/s Secs Secs MB/s Windows Disk 64 50 70.0 Norm 170 170 47.8 Norm Local Disk Ubuntu 10.10 37 35 56.0 Norm 22 23 108.0 Norm Old Staples Ubuntu 10.10 100 66 9.3 Rec 76 71 8.8 Norm 4 GB Stick 64 Bit 95 71 Rec PNY Attache Ubuntu 10.10 100 77 18.2 Rec 103 62 20.4 Norm 4 GB Stick 32 Bit Cruzer U3 Ubuntu 10.10 50 51 16.4 Rec 57 57 16.9 Norm 4 GB Stick 64 Bit Patriot Rage Ubuntu 11.04 46 57 24.3 Norm 76 48 26.8 Norm 8 GB Stick 64 Bit Cruzer U3 Fedora 14 110 98 22.0 Norm 73 70 23.8 Norm 16 GB 64 Bit Cruzer Blade OpenSuse 11.4 82 70 19.1 Norm 70 44 20.8 Norm 8 GB Stick 64 Bit USB Disk OpenSuse 11.4 59 60 28.4 Norm 48 42 34.8 Norm 64 Bit Rec = Recovery Mode ################################################################################ Desktop 3.7 GHz Core i7 Drive Linux Boot1 Boot2 Disk Mode Secs Secs MB/s Windows 8.1 Disk 57 58 139 Norm USB 3 Disk Ubuntu 14.04 32 32 112 Norm USB 3 Flash Ubuntu 14.04 26 26 94 Norm

To Start

Roy Longbottom January 2015

The Official Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection