CUDA nVidia Graphics Processor Parallel Computing Benchmarks - Roy Longbottom's PC benchmark Collection

CUDA GPU Parallel Computing Benchmarks

General	Maximum Performance	CUDA Programming
CudaMFLOPS1 Benchmark	Shared Memory and Extra Tests	More Comparisons
Run Time Parameters	Reliability Testing	Errors

Summary

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

CudaMFLOPS1 benchmark exploits multiple registers and large data size and now the fast memory. An intention is to demonstrate poor performance besides high speed operation. For all tests, there were three sets of measurements but additional tests are now included:

New Calculations - Copy data to graphics RAM, execute instructions, copy back to host RAM
Update Data - Execute further instructions on data in graphics RAM, copy back to host RAM
Graphics Only Data - Execute further instructions on data in graphics RAM, leave it there
Extra Test 1 - Just graphics data, repeat loop in CUDA function
Extra Test 2 - Just graphics data, repeat loop in CUDA function but using Shared Memory

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

A graphics card used has 32 processor cores working at 1.14 GHz and, executing an add and a multiply per clock cycle, could run at 76 GFLOPS. Maximum speed obtained by the benchmark, for graphics only data, 10M words and 32 instructions, was 29 GFLOPS, considerably degraded by a graphics RAM in/out transfer of a single word. This was increased to 31 GFLOPS with Extra Test 1, then 34 GFLOPS using Shared Memory. Calculations to exclude data transfer overheads indicated up to 51 GFLOPS.

Slowest speed for graphics only data, at 100K words and 2 instructions per word was 1.6 GFLOPS. With data in and out, the speed range was 0.2 to 6.0 GFLOPS over all data sizes. Note that the same calculations, running on both processor of the 2.4 GHz Core 2 Duo (OPenMP using the “#pragma omp parallel for” statement, and not SSE instructions), produced 1.15 to 5.0 GFLOPS (data in and out - See OpenMP MFLOPS.htm).

Later results are for a card with 128 processors working at 1.836 GHz, with a maximum specification of 705 GFLOPS. This produced up to 360 GFLOPS, with minimum speeds of 0.3 to 11.1 GFLOPS for data in and out. Even later, a 2010 top end card, with a maximum specification of 1345 GFLOPS, ran at up to 983 GFLOPS, and 0.5 to 15.5 GFLOPS with data in and out.

The benchmark can be downloaded via CudaMflops.zip. No installation is necessary - Extract All and click on CudaMFLOPS1.exe but see ReadMe.txt first. The ZIP file also includes the CUDA C++ source code. A double precision version is now available. For details and results see CUDA2.htm.

The first single and double precision benchmarks were compiled using CUDA Toolkit 2.3. These have been replaced with updated versions, providing a little extra detail on graphics memory utilisation. They were also compiled to use 32 bit (x86) PCs. They have also been produced using CUDA Toolkit 3.1, particularly to see if double precision calculations are faster on later GeForce GPUs. The revision exercise was started by compiling to use 64 bit (x64) PCs, which was not straightforward and different procedures were needed for CUDA 2.3 and 3.1. Details of the revised versions and problems are in CUDA3 x64.htm with source code and all benchmark EXE files in CudaMflops.zip.

The benchmarks have now been ported to 32-bit and 64-bit versions of Ubuntu Linux. Details and results are provided in linux_cuda_mflops.htm.

See GigaFLOPS Benchmarks.htm for further details and results, including comparisons with MP MFLOPS, a threaded C version, OpenMP MFLOPS, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via gigaflops-benchmarks.zip.

To Start

General

CUDA is an nVidia general purpose parallel computing architecture that uses the large number of Graphics Processing Units (GPUs) available on modern GeForce graphics cards. Free software can be downloaded, for 32 and 64 bit operation, initially for use in conjunction with a C/C++ compiler. The software comes in three parts, the first part being a new graphics driver that has an option to turn CUDA processing on and off. Part two is a Toolkit that includes functions needed to run compiled programs. In this case, it is cudart.dll, and it is not clear whether this has to be compatible with a specific driver version. The third part is the System Development Kit (SDK) that includes example source and compiled codes.

Some CUDA functions can be compiled with a C/C++ compiler (64 or 32 bit). Most example programs have and extension of .CU instead of CPP and might have functions that need the CUDA compiler nvcc. In the case of Windows, the latter has to be interfaced to an MS compiler and does not work with older versions. For 32 bit working, there is the free MS Visual C++ 2008 Express Edition that is compatible.

The CUDA drivers and software were installed on a PC with a GeForce 8600 GT card using 64-Bit Vista and a laptop with 8400M GS graphics and 32-Bit Vista. Visual C++ 2008 Express Edition was also installed on both.

In principle, the example projects can be loaded for compilation with VC 2008 by clicking on a .vcproj file. In my case, Build produced an error but BuildLog.htm showed the compile commands used. These could be run from a Command Prompt window after executing the supplied vcvarsall.bat and further bat commands for CUDA as below. The compile command line used was as follows, but nvcc -c CudaMFLOPS1.cu (compile only) also worked. The link command needed /NODEFAULTLIB:libc.lib CUDA.LIB CUDART.LIB CudaMFLOPS1.obj.

nvcc -arch sm_10 -ccbin D:\MSCompile\MSVC8\VC\bin -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT" -c CudaMFLOPS1.cu

The software was copied to a PC using Windows XP, that is (in my case) folders MSVC8 and v6.0A for the C++ compiler and those for CUDA. A BAT file with the following is executed from a Command Prompt to allow the same compile and link lines to successfully work.

Set PATH=D:\MSCompile\MSVC8\Common7\IDE;%PATH% Set PATH=D:\MSCompile\MSVC8\VC\BIN;%PATH% Set PATH=D:\MSCompile\MSVC8\Common7\Tools;%PATH% Set PATH=D:\MSCompile\MSVC8\VC\VCPackages;%PATH% Set PATH=C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin;%PATH% Set LIB=D:\MSCompile\MSVC8\VC\LIB;%LIB% Set LIB=C:\Program Files\Microsoft SDKs\Windows\v6.0A\lib;%LIB% Set LIBPATH=D:\MSCompile\MSVC8\VC\LIB;%LIBPATH% Set INCLUDE=D:\MSCompile\MSVC8\VC\INCLUDE;%INCLUDE% Set INCLUDE=C:\Program Files\Microsoft SDKs\Windows\v6.0A\include;%INCLUDE% Set PATH=D:\CUDA\bin;%PATH% Set INCLUDE=D:\CUDA\include;%INCLUDE% Set INCLUDE=D:\CUDA\SDK\common\inc;%INCLUDE% Set LIB=D:\CUDA\lib;%LIB% Set LIB=D:\CUDA\SDK\common\lib;%LIB%

The benchmark can be downloaded via CudaMflops1.zip. No installation is necessary - Extract All and click on CudaMFLOPS1.exe but see ReadMe.txt first. The ZIP file also includes the CUDA C++ source code.

To Start

Maximum Performance

Maximum floating point execution speed in GFLOPS (32 bits) are calculated as shader clock GHz x shader stream processors x 3. The three representing a linked multiply and add (MADD) and another multiply that can be executed per clock cycle. In my case, maximum speeds would be 1.19 x 32 x 3 = 114.24 GFLOPS for the 8600 GT and 0.8 x 16 x 3 = 38.4 GFLOPS for the 8400M GS. Googling indicates that the second multiply did not work on this vintage of graphics. This would reduce maximum speeds to 76.16 and 25.6 GFLOPS.

Note that the theoretical maximum speed of the test PC Core 2 Duo, with 32 bit SSE instructions, is 2.4 (GHz) x 4 (SSE register size) x 2 (CPUs) x 2 (MADD) = 38.4 GFLOPS, but best I could demonstrate with data from cache is around 6.5 GFLOPS per CPU.

One of the provided CUDA sample programs does produce high GFLOPS. The program is nbody.exe (a popular particle simulation method). It produces a graphics display but can also be command line driven to produce GFLOPS without the graphics. Results are 56 GFLOPS on the desktop with 2.4 GHz Core 2 Duo and GeForce 8600 GT then 19 GFLOPS on the laptop with 1.8 GHz Core 2 Duo and 8400M GS. As indicated below, Nbody performance does not appear to be limited by graphics memory or PCI-E bus speeds. The code indicates that there are 20 FLOPS per interaction and will be using some of the available pool of 8192 registers for this. It also makes use of local memory, providing high speed cache like performance.

BandwidthTest.exe, another of the CUDA sample programs, provides MBytes/second speeds transferring 33 MB of data from main memory to graphics RAM, graphics RAM to main memory and graphics RAM to graphics RAM. The three measurements for the 8600 GT were 1905, 1450 and 15942 MB/sec, with 1145, 848 and 4140 MB/sec for the 8400M. For 4 byte words, and assuming one floating point calculation for each of these memory accesses, these can be translated into maximum speeds of (desktop) 0.3 to 0.5 GFLOPS with main memory access and 4 GFLOPS using only graphics RAM with (laptop) 0.2 to 0.3 GFLOPS and 1 GFLOPS. The CudaMFLOPS1 Benchmark demonstrates this level of performance and the impact of additional calculations per memory access.

Later results are for a GeForce GTS 250 at 1.836 GHz x 128 processors x 3, or 705 GFLOPS. This was using a 3.0 GHz quad core Phenom II CPU. Results produced were 285 GFLOPS with Nbody, then 3033 and 54129 MB/sec (758 to 13532 MFLOPS) with BandwidthTest.

Latest results are for a 2010 top end GTX 480 at 1.4 MHz x 480 x 2, or 1344 GFLOPS (or x 3 for 2016 GFLOPS linked multiply/add/multiply?). System had a quad core 2.8 GHz Core i7 processor. Results for this were 752 GFLOPS with Nbody, then 3766 and 120236 MB/sec (942 to 20059 MFLOPS) with BandwidthTest.

Summary of Maximum Speeds - from above and CudaMFLOPS1 below 8600 GT 8400M GS GTS 250 GTX 480 MFLOPS MFLOPS MFLOPS MFLOPS Maximum Specification 114240 38400 705024 1344960 Or 76160 25600 NBody 55970 18590 285593 751637 Bandwidth From Host CPU RAM 1 Op 500 300 758 942 Bandwidth From Graphics RAM 1 Op 4000 1030 13532 20059 CudaMFLOPS1 Data In & Out 2 Ops 414 220 714 987 CudaMFLOPS1 Data out only 2 Ops 677 340 1338 1911 CudaMFLOPS1 Graphics RAM 2 Ops 3866 1030 13038 32286 CudaMFLOPS1 Data In & Out 32 Ops 5926 3098 11182 15485 CudaMFLOPS1 Data out only 32 Ops 9006 4459 19975 29608 CudaMFLOPS1 Graphics RAM 32 Ops 28994 9876 135041 440451 CudaMFLOPS1 Shared Memory 32 Ops 34094 11454 173291 769658 CudaMFLOPS1 Calculations # 24 Ops 43782 18792 297354 ****** CudaMFLOPS1 Calculations # 30 Ops 51016 23093 358989 983220 # Based on time difference between 32 Ops and 8 or 2 Ops ******* Too fast to estimate

To Start

CUDA Programming

CUDA programming for Single Instruction Multiple Data (SIMD) operation to use the graphics processors for general calculations on data arrays can be quite simple. The number of processing threads has to be specified (maximum here 512), with blocks of threads calculated as array words / threads. Functions are called using, for example, calc<<< blocks, threads >>>(words, aval, bval, gpu_array);. The simple function, in this case, could be:

            __global__ void triad(int n, float a, float b, float *data)
            {
               int i = blockIdx.x * blockDim.x + threadIdx.x;
               if( i < n ) data[i] = (data[i] + a) * b;
            }

This leads to automatic execution of threads on available graphics processor cores for each block in turn. Special functions for allocating/releasing graphics RAM, copying data form the host CPU memory and arranging synchronisation of threads are cudaMalloc, cudaFree, cudaMemcpy and cudaThreadSynchronize. Further details using Shared Memory and other calculations are given below.

To Start

CudaMFLOPS1 Benchmark

CudaMFLOPS1.exe has 3 x 3 x 3 tests, with increasing array size, increasing number of arithmetic operations per memory access and different data access modes. For the latter, the speed is calculated for copying the data from the host CPU memory to graphics RAM, processing and copying results back to the host. Next, the test is repeated only copying from the host on the first of multiple passes. The last test excludes repeated copying the results from and to the host but does include graphics RAM/GPU data transfers. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations. Default array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The latest version has additional calculation and shared memory tests where repetition is controlled within the main CUDA functions (See Below).

On running, results are displayed and saved in file CudaLog.txt. Examples are given below and this is somewhat different to the initial version.

Initially, each word in the data array is set to 0.999999 and have the same calculations with a final result somewhat lower. This result is now constant for a given number of operations per word and repeat passes (example 0.998799 for 32 operations and 25 repeats). The results are also useful for comparing numerical accuracy when the same calculation are executed on a CPU. The revised benchmark displays the first result and checks that all words are identical - see example logs below and Errors later.

Below are results for PCs with Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT, Mobile Core 2 Duo 1.8 GHz, Vista 32, GeForce 8400M GS and Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250. Result became available in 2010 for a GeForce GTX 480 top end graphics card, using a Core i7 930 processor and 64-Bit Windows 7. These were from a revised version of the benchmark, still using CUDA Toolkit 2.3, but with the same layout as the double precision version and providing additional details on graphics memory use.

Assuming a constant non-overlapped overhead for data transfer, with calculations based on the difference in fastest times between those for 32 and 8 or 2 operations, actual execution speed comes out at between 40 and 51 GFLOPS on the 8600 GT, 13 to 23 GFLOPS on the 8400M and 203 to 360 GFLOPS using the GTS 250 See calculations. Slowest speeds, with continuous data transfer from/to the host CPU and two calculations for each word transferred, are 215, 132 and 328 MFLOPS for the three systems.

Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT ******************************************************** CUDA MFLOPS Benchmark 1.1 Wed Sep 23 11:52:04 2009 CUDA devices found Device 0: GeForce 8600 GT with 4 Processors 32 cores Using 256 Threads Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 2.328578 215 0.929538 Yes Data out only 100000 2 2500 1.368496 365 0.929538 Yes Calculate only 100000 2 2500 0.302413 1653 0.929538 Yes Data in & out 1000000 2 250 1.466895 341 0.992550 Yes Data out only 1000000 2 250 0.828149 604 0.992550 Yes Calculate only 1000000 2 250 0.143866 3475 0.992550 Yes Data in & out 10000000 2 25 1.207633 414 0.999250 Yes Data out only 10000000 2 25 0.738722 677 0.999250 Yes Calculate only 10000000 2 25 0.129336 3866 0.999250 Yes Data in & out 100000 8 2500 2.352154 850 0.956980 Yes Data out only 100000 8 2500 1.399414 1429 0.956980 Yes Calculate only 100000 8 2500 0.303826 6583 0.956980 Yes Data in & out 1000000 8 250 1.481691 1350 0.995509 Yes Data out only 1000000 8 250 0.844532 2368 0.995509 Yes Calculate only 1000000 8 250 0.151419 13208 0.995509 Yes Data in & out 10000000 8 25 1.212482 1650 0.999549 Yes Data out only 10000000 8 25 0.751451 2662 0.999549 Yes Calculate only 10000000 8 25 0.138901 14399 0.999549 Yes Data in & out 100000 32 2500 2.521739 3172 0.890079 Yes Data out only 100000 32 2500 1.601592 4995 0.890079 Yes Calculate only 100000 32 2500 0.506698 15789 0.890079 Yes Data in & out 1000000 32 250 1.622790 4930 0.988073 Yes Data out only 1000000 32 250 0.986792 8107 0.988073 Yes Calculate only 1000000 32 250 0.298806 26773 0.988073 Yes Data in & out 10000000 32 25 1.350080 5926 0.998799 Yes Data out only 10000000 32 25 0.888331 9006 0.998799 Yes Calculate only 10000000 32 25 0.275918 28994 0.998799 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.133918 3734 0.999250 Yes Shared Memory 10000000 2 25 0.046697 10707 0.999250 Yes Calculate 10000000 8 25 0.132696 15072 0.999549 Yes Shared Memory 10000000 8 25 0.084822 23579 0.999549 Yes Calculate 10000000 32 25 0.255621 31296 0.998799 Yes Shared Memory 10000000 32 25 0.234647 34094 0.998799 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information Intel processor architecture, 2 CPUs Windows NT Version 6.0, build 6002, Service Pack 2 Memory 4095 MB, Free 2878 MB User Virtual Space 4096 MB, Free 4047 MB Core 2 Duo 1.8 GHz Laptop, Vista 32, GeForce 8400M GS ******************************************************** CUDA MFLOPS Benchmark 1.1 Tue Sep 22 18:09:43 2009 CUDA devices found Device 0: GeForce 8400M GS with 2 Processors 16 cores Using 256 Threads Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 3.774912 132 0.929538 Yes Data out only 100000 2 2500 2.358369 212 0.929538 Yes Calculate only 100000 2 2500 0.744282 672 0.929538 Yes Data in & out 1000000 2 250 2.486853 201 0.992550 Yes Data out only 1000000 2 250 1.615676 309 0.992550 Yes Calculate only 1000000 2 250 0.509336 982 0.992550 Yes Data in & out 10000000 2 25 2.272207 220 0.999250 Yes Data out only 10000000 2 25 1.469490 340 0.999250 Yes Calculate only 10000000 2 25 0.485254 1030 0.999250 Yes Data in & out 100000 8 2500 3.736443 535 0.956980 Yes Data out only 100000 8 2500 2.376323 842 0.956980 Yes Calculate only 100000 8 2500 0.757495 2640 0.956980 Yes Data in & out 1000000 8 250 2.530923 790 0.995509 Yes Data out only 1000000 8 250 1.632063 1225 0.995509 Yes Calculate only 1000000 8 250 0.514978 3884 0.995509 Yes Data in & out 10000000 8 25 2.267129 882 0.999549 Yes Data out only 10000000 8 25 1.491643 1341 0.999549 Yes Calculate only 10000000 8 25 0.490735 4076 0.999549 Yes Data in & out 100000 32 2500 4.134719 1935 0.890079 Yes Data out only 100000 32 2500 2.776958 2881 0.890079 Yes Calculate only 100000 32 2500 1.146639 6977 0.890079 Yes Data in & out 1000000 32 250 2.803619 2853 0.988073 Yes Data out only 1000000 32 250 1.946918 4109 0.988073 Yes Calculate only 1000000 32 250 0.834157 9591 0.988073 Yes Data in & out 10000000 32 25 2.582265 3098 0.998799 Yes Data out only 10000000 32 25 1.794135 4459 0.998799 Yes Calculate only 10000000 32 25 0.810025 9876 0.998799 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.503554 993 0.999250 Yes Shared Memory 10000000 2 25 0.139127 3594 0.999250 Yes Calculate 10000000 8 25 0.516617 3871 0.999549 Yes Shared Memory 10000000 8 25 0.253509 7889 0.999549 Yes Calculate 10000000 32 25 0.759397 10535 0.998799 Yes Shared Memory 10000000 32 25 0.698451 11454 0.998799 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006FD Intel(R) Core(TM)2 Duo CPU T5550 @ 1.83GHz Measured 1828 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information Intel processor architecture, 2 CPUs Windows NT Version 6.0, build 6001, Service Pack 1 Memory 2046 MB, Free 906 MB User Virtual Space 2048 MB, Free 2006 MB Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250 ******************************************************** CUDA MFLOPS Benchmark 1.1 Wed Nov 04 18:44:35 2009 CUDA devices found Device 0: GeForce GTS 250 with 16 Processors 128 cores Using 256 Threads Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 1.522759 328 0.929538 Yes Data out only 100000 2 2500 0.759221 659 0.929538 Yes Calculate only 100000 2 2500 0.163718 3054 0.929538 Yes Data in & out 1000000 2 250 0.799499 625 0.992550 Yes Data out only 1000000 2 250 0.441804 1132 0.992550 Yes Calculate only 1000000 2 250 0.051695 9672 0.992550 Yes Data in & out 10000000 2 25 0.700515 714 0.999250 Yes Data out only 10000000 2 25 0.373559 1338 0.999250 Yes Calculate only 10000000 2 25 0.038349 13038 0.999250 Yes Data in & out 100000 8 2500 1.496596 1336 0.956980 Yes Data out only 100000 8 2500 0.740724 2700 0.956980 Yes Calculate only 100000 8 2500 0.163493 12233 0.956980 Yes Data in & out 1000000 8 250 0.839552 2382 0.995509 Yes Data out only 1000000 8 250 0.451659 4428 0.995509 Yes Calculate only 1000000 8 250 0.050657 39481 0.995509 Yes Data in & out 10000000 8 25 0.678276 2949 0.999549 Yes Data out only 10000000 8 25 0.364515 5487 0.999549 Yes Calculate only 10000000 8 25 0.039063 51199 0.999549 Yes Data in & out 100000 32 2500 1.555722 5142 0.890079 Yes Data out only 100000 32 2500 0.813645 9832 0.890079 Yes Calculate only 100000 32 2500 0.221733 36080 0.890079 Yes Data in & out 1000000 32 250 0.848588 9427 0.988073 Yes Data out only 1000000 32 250 0.472604 16927 0.988073 Yes Calculate only 1000000 32 250 0.073957 108170 0.988073 Yes Data in & out 10000000 32 25 0.715423 11182 0.998799 Yes Data out only 10000000 32 25 0.400496 19975 0.998799 Yes Calculate only 10000000 32 25 0.059241 135041 0.998799 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.050168 9967 0.999250 Yes Shared Memory 10000000 2 25 0.009245 54082 0.999250 Yes Calculate 10000000 8 25 0.050910 39285 0.999549 Yes Shared Memory 10000000 8 25 0.016720 119617 0.999549 Yes Calculate 10000000 32 25 0.050587 158143 0.998799 Yes Shared Memory 10000000 32 25 0.046165 173291 0.998799 Yes Hardware Information CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, Windows Information Intel processor architecture, 4 CPUs Windows NT Version 6.1, build 7600, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 4050 MB Core i7 Quad 2.8 GHz, Windows 7-64, GeForce GTX 480 ******************************************************** CUDA 2.3 x86 Single Precision MFLOPS Benchmark 1.3 Fri Jul 30 15:23:30 2010 CUDA devices found Device 0: GeForce GTX 480 with 15 Processors 120 cores Global Memory 1468 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024 Using 256 Threads Test 4 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 0.959230 521 0.9295383095741 Yes Data out only 100000 2 2500 0.513680 973 0.9295383095741 Yes Calculate only 100000 2 2500 0.087057 5743 0.9295383095741 Yes Data in & out 1000000 2 250 0.607810 823 0.9925497770309 Yes Data out only 1000000 2 250 0.295149 1694 0.9925497770309 Yes Calculate only 1000000 2 250 0.022970 21767 0.9925497770309 Yes Data in & out 10000000 2 25 0.506478 987 0.9992496371269 Yes Data out only 10000000 2 25 0.261632 1911 0.9992496371269 Yes Calculate only 10000000 2 25 0.015486 32286 0.9992496371269 Yes Data in & out 100000 8 2500 0.966099 2070 0.9571172595024 Yes Data out only 100000 8 2500 0.533396 3750 0.9571172595024 Yes Calculate only 100000 8 2500 0.090899 22002 0.9571172595024 Yes Data in & out 1000000 8 250 0.608723 3286 0.9955183267593 Yes Data out only 1000000 8 250 0.296098 6755 0.9955183267593 Yes Calculate only 1000000 8 250 0.023509 85074 0.9955183267593 Yes Data in & out 10000000 8 25 0.503973 3968 0.9995489120483 Yes Data out only 10000000 8 25 0.261953 7635 0.9995489120483 Yes Calculate only 10000000 8 25 0.015559 128542 0.9995489120483 Yes Data in & out 100000 32 2500 1.003045 7976 0.8902152180672 Yes Data out only 100000 32 2500 0.556316 14380 0.8902152180672 Yes Calculate only 100000 32 2500 0.125837 63574 0.8902152180672 Yes Data in & out 1000000 32 250 0.629748 12703 0.9880878329277 Yes Data out only 1000000 32 250 0.310655 25752 0.9880878329277 Yes Calculate only 1000000 32 250 0.029831 268179 0.9880878329277 Yes Data in & out 10000000 32 25 0.516623 15485 0.9987964630127 Yes Data out only 10000000 32 25 0.270197 29608 0.9987964630127 Yes Calculate only 10000000 32 25 0.018163 440451 0.9987964630127 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.004954 100937 0.9992496371269 Yes Shared Memory 10000000 2 25 0.002766 180734 0.9992496371269 Yes Calculate 10000000 8 25 0.006198 322696 0.9995489120483 Yes Shared Memory 10000000 8 25 0.004850 412376 0.9995489120483 Yes Calculate 10000000 32 25 0.012249 653139 0.9987964630127 Yes Shared Memory 10000000 32 25 0.010394 769658 0.9987964630127 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106A5 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2807 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information Intel processor architecture, 8 CPUs Windows NT Version 6.1, build 7600, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 4042 MB 1468 MB Graphics RAM, Used 261 Minimum, 297 Maximum

To Start

Shared Memory and Repeat Calculations

The original version ran repeated calculations from within the main C++ code as shown in the "Calculate Only" section below. For the first extra test, "Calculate", the repetition loop was moved to the CUDA function where it was initially expected to produce faster speeds due to fewer function calls. In practice, this is not necessarily the case and, bearing in mind that data size of up to 40M Bytes is used, it is now clear that multiple calls must be arranged via CUDA software to transfer part of the data at a time.

For the third example and in this case, the size of Shared Memory is defined in the main program as 1024 Bytes (per thread?) and is allocated at the start of the CUDA function. Data is the copied from graphics RAM before calculation and back to graphics RAM at the end. As shown below , this can produce significant performance gains when limited calculations are used.

Calculate only - Repeat for (i=0; i < repeats; i++) { calc2<<< blocks, threads >>>(words, aval, xval, x_gpu); } cutilSafeCall( cudaThreadSynchronize() ); cutilCheckMsg("Kernel execution failed"); ---------------------------------------------------------- __global__ void calc2(int n, float a, float b, float *x) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i < n ) x[i] = (x[i]+a)*b; } ########################################################## Calculate - Repeat in CUDA Function calc2r<<< blocks, threads >>>(words, repeats, aval, xval, x_gpu); cutilSafeCall( cudaThreadSynchronize() ); cutilCheckMsg("Kernel execution failed"); ---------------------------------------------------------- __global__ void calc2r(int n, int rep, float a, float b, float *x) { for (int rr=0; rr < rep; rr++) { int i = blockIdx.x * blockDim.x + threadIdx.x; if( i < n ) x[i] = (x[i]+a)*b; } } ########################################################## Shared Memory - Repeat in CUDA Function sharecalc2<<< numBlocks, threadsPerBlock, sharedMemSize >>> (x_gpu, repeats, aval, xval); cutilSafeCall( cudaThreadSynchronize() ); cutilCheckMsg("Kernel execution failed"); ---------------------------------------------------------- __global__ void sharecalc2(float *xi, int rep, float a, float b) { extern __shared__ float x[]; int inOffset = blockDim.x * blockIdx.x; int in = inOffset + threadIdx.x; x[threadIdx.x] = xi[in]; __syncthreads(); int i = threadIdx.x; for (int rr=0; rr < rep; rr++) { x[i] = (x[i]+a)*b; } xi[in] = x[threadIdx.x]; }

To Start

More Comparisons

It is not clear how CUDA software arranges for data to be transferred using the above code but identical numeric results, for all tests at given operations per word and repetition rates, show that exactly the same number of calculations have been executed.

As indicated above, speed does not vary much when calculations are repeated via the main program or CUDA function and data transfers from host to/from graphics memory is likely to be similar. On the other hand, speed using shared memory is faster, particularly with fewer arithmetic operations per word. Calculations below show shared memory produces lower GFLOPS but overheads for transferring the data are much smaller.

All 10M Wds 25 Reps GeF 8600 GT GeF 8400M GS GeF GTS 250 GeF GTX 480 Test Ops Secs GFLOPS Secs GFLOPS Secs GFLOPS Secs GFLOPS Calculate only 2 0.129 3.86 0.485 1.03 0.038 13.04 0.015 32.29 Shared Memory 2 0.047 10.71 0.139 3.59 0.009 54.08 0.003 180.73 Calculate only 8 0.139 14.40 0.491 4.08 0.039 51.20 0.016 128.54 Shared Memory 8 0.085 23.58 0.254 7.89 0.016 119.62 0.004 412.38 Calculate only 32 0.276 29.00 0.810 9.88 0.059 135.04 0.018 440.45 Shared Memory 32 0.235 34.09 0.698 11.45 0.046 173.29 0.010 769.66 Calculations 32 Ops minus 2 Ops - 30 Ops, 25 Repeats GeF 8600 GT GeF 8400M GS GeF GTS 250 GeF GTX 480 Test Secs G Secs Secs G Secs Secs G Secs Secs G Secs Calcs FLOPS O/H Calcs FLOPS O/H Calcs FLOPS O/H Calcs FLOPS O/H Calc 0.147 51 0.119 0.325 23 0.463 0.021 359 0.037 0.003* *** 0.015 Sh Mem 0.188 40 0.034 0.559 13 0.102 0.037 203 0.007 0.008 983 0.002 * *** too fast to estimate

To Start

Run Time Parameters

Run time parameters are available for use via a Command Prompt window or BAT file. See ExampleRun.bat in CudaMflops.zip. Parameters are:

Threads (or t or T) - Default 256, Maximum currently 512
Words (or w or W) - Default 100000 (increases to 1M and 10M)
Repeats (or r or R) - Default 2500 (decreases to 250 and 25), Minimum 100

Examples:

CudaMFLOPS1 Threads 128, Words 50000, Repeats 2500
CudaMFLOPS1 t 256 W 1000 r 100
CudaMFLOPS1 Threads 512

Varying Number Of Threads - Following are results from the last Calculate Only test using the same data size and repeat passes but varying numbers of threads. The starting parameters are Words 6000, Repeats 10000. Performance decreases using fewer threads in a similar fashion for both systems used, although the 8600 has 32 processor cores and the 8400M 16 cores. A surprise is the decrease in speed at 16 threads on both systems.

Threads Words Ops Repeats MFLOPS MFLOPS MFLOPS 8600 GT 8400MGS GTS250 256 600000 32 100 25153 9136 94109 128 600000 32 100 24868 9525 91595 64 600000 32 100 24575 9381 97762 32 600000 32 100 21584 7815 78845 16 600000 32 100 13620 4728 47930

To Start

Reliability Testing

The revised version also has facilities to run a reliability/burn-in test. Default operation is the 10M word, 32 operations per word, extra calculate test, running for one minute with output every 15 seconds via a simple command cudamflops1 M. The test is first calibrated to run the loop within the CUDA function for less than 1.75 seconds, in order to avoid a timeout error. The number of times this is repeated externally is then calculated. All data is checked for the same numeric result, with the first error value shown and the number of errors. The default number of threads and words can be changed as above and additional parameters are:

Minutes (or M or m or Mins) - Example for below CudaMFLOPS1 Mins 10
Seconds (or S or s or Secs) - for result display rate
FC - to use alternative Shared Memory test.

Below are example results for a 10 minute test on a GeForce 8600 GT. Monitored temperatures are also shown (room temperature 21.5 °C). Graphics temperatures were measured using nVidia System Monitor. This also showed GPU Utilisation 99% to 100%, Bus Utilisation 38%, with CPU Utilisation 60% to 70%. A Shared Memory test produced higher MFLOPS (34500) but slightly lower temperatures at GPU Utilisation 99% and Bus Utilisation 3%.

See also example results below for CudaMFLOPS1 W 50000, T 512, M 1, S 30, FC

Later are results of a 10 minute test on the GeForce GTS 250. Here, temperature and fan speed were monitored using Vtune/TBPanel, supplied with the graphics driver. Recorded temperature increased from 42°C to 72°C, after 5 minutes. Following this, it varied between 72°C and 70°C, controlled by fan speed increasing from the shown 35% to 40%.

Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT ******************************************************** Command - CudaMFLOPS1 Mins 10 CUDA MFLOPS Benchmark 1.1 Mon Sep 28 16:22:51 2009 CUDA devices found Device 0: GeForce 8600 GT with 4 Processors 32 cores Using 256 Threads Calculate Reliability Test 10 minutes, report every 15 seconds Repeat CUDA 155 times at 1.57 seconds. Repeat former 9 times Results of all calculations should be 0.936169 Test 4 Byte Ops/ Repeat Seconds MFLOPS Errors First Value Words Word Passes Word 1 10000000 32 1395 14.164 31517 None found 2 10000000 32 1395 14.163 31518 None found 3 10000000 32 1395 14.162 31521 None found 4 10000000 32 1395 14.162 31520 None found 5 10000000 32 1395 14.161 31522 None found 6 10000000 32 1395 14.162 31521 None found to 40 10000000 32 1395 14.161 31522 None found Minutes 0 0.5 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Temperature °C 47 59 61 62 64 65 66 66 66 66 66 66 ******************************************************** Command - CudaMFLOPS1 W 50000, T 512, M 1, S 30, FC CUDA MFLOPS Benchmark 1.1 Tue Sep 29 12:03:14 2009 CUDA devices found Device 0: GeForce 8400M GS with 2 Processors 16 cores Using 512 Threads Shared Memory Reliability Test 1 minutes, report every 30 seconds Repeat CUDA 10163 times at 1.40 seconds. Repeat former 21 times Results of all calculations should be 0.351382 Test 4 Byte Ops/ Repeat Seconds MFLOPS Errors First Value Words Word Passes Word 1 50000 32 213423 29.531 11563 None found 2 50000 32 213423 29.531 11563 None found ******************************************************** Device 0: GeForce 8600 GT with 4 Processors 32 cores Repeat CUDA 34576 times at 1.64 seconds. Repeat former 18 times 1 50000 32 622368 29.501 33754 None found 2 50000 32 622368 29.505 33749 None found Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250 ******************************************************** CUDA MFLOPS Benchmark 1.1 Mon Nov 09 11:11:31 2009 CUDA devices found Device 0: GeForce GTS 250 with 16 Processors 128 cores Using 256 Threads Calculate Reliability Test 10 minutes, report every 15 seconds Repeat CUDA 761 times at 1.52 seconds. Repeat former 9 times Results of all calculations should be 0.741250 Test 4 Byte Ops/ Repeat Seconds MFLOPS Errors First Value Words Word Passes Word 1 10000000 32 6849 13.736 159563 None found 2 10000000 32 6849 13.732 159605 None found 3 10000000 32 6849 13.730 159623 None found 4 10000000 32 6849 13.730 159629 None found 5 10000000 32 6849 13.734 159575 None found 6 10000000 32 6849 13.732 159607 None found to 40 10000000 32 6849 13.753 159362 None found Minutes 0 0.5 1.0 1.5 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 Temperature °C 42 59 63 67 69 71 71 72 72 71 70 71

To Start

Errors

As indicated above, the performance tests check each word for consistent values. Following is an example of error reports, where details of the first ten comparison failures are provided. In developing the benchmark it was found that the program could finish with some of the data unchanged from the initial value of 0.999999. These checks will show if this happens. Reliability tests provide a count of the number of errors and the array position and value of the first wrong word. Examples of CUDA generated error messages are also shown.

Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Shared Memory 10000000 32 250 2.314890 34559 See later No First Unexpected Results Shared Memory 10000000 32 250 word 38528 was 0.010145 not 0.988073 Shared Memory 10000000 32 250 word 38529 was 0.010145 not 0.988073 Error display if GPU is used continuously for more than 2 seconds. cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file CudaMFLOPS1.cu cudaSafeCall() Runtime API error in file < CudaMFLOPS1.cu >, line 232 : the launch timed out and was terminated. Error display if attempt is made to use more than 512 threads. cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file CudaMFLOPS1.cu line 225 : invalid configuration argument.

To Start

Roy Longbottom October 2014

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

CUDA GPU Parallel Computing Benchmarks

Contents

Summary

General

Maximum Performance

CUDA Programming

CudaMFLOPS1 Benchmark

Shared Memory and Repeat Calculations

More Comparisons

Run Time Parameters

Reliability Testing

Errors