Title

Roy Longbottom at Linkedin   CUDA GPU Parallel Computing Benchmarks

Contents


General Maximum Performance CUDA Programming
CudaMFLOPS1 Benchmark Shared Memory and Extra Tests More Comparisons
Run Time Parameters Reliability Testing Errors

Summary

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

CudaMFLOPS1 benchmark exploits multiple registers and large data size and now the fast memory. An intention is to demonstrate poor performance besides high speed operation. For all tests, there were three sets of measurements but additional tests are now included:

  • New Calculations - Copy data to graphics RAM, execute instructions, copy back to host RAM
  • Update Data - Execute further instructions on data in graphics RAM, copy back to host RAM
  • Graphics Only Data - Execute further instructions on data in graphics RAM, leave it there
  • Extra Test 1 - Just graphics data, repeat loop in CUDA function
  • Extra Test 2 - Just graphics data, repeat loop in CUDA function but using Shared Memory

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

A graphics card used has 32 processor cores working at 1.14 GHz and, executing an add and a multiply per clock cycle, could run at 76 GFLOPS. Maximum speed obtained by the benchmark, for graphics only data, 10M words and 32 instructions, was 29 GFLOPS, considerably degraded by a graphics RAM in/out transfer of a single word. This was increased to 31 GFLOPS with Extra Test 1, then 34 GFLOPS using Shared Memory. Calculations to exclude data transfer overheads indicated up to 51 GFLOPS.

Slowest speed for graphics only data, at 100K words and 2 instructions per word was 1.6 GFLOPS. With data in and out, the speed range was 0.2 to 6.0 GFLOPS over all data sizes. Note that the same calculations, running on both processor of the 2.4 GHz Core 2 Duo (OPenMP using the “#pragma omp parallel for” statement, and not SSE instructions), produced 1.15 to 5.0 GFLOPS (data in and out - See OpenMP MFLOPS.htm).

Later results are for a card with 128 processors working at 1.836 GHz, with a maximum specification of 705 GFLOPS. This produced up to 360 GFLOPS, with minimum speeds of 0.3 to 11.1 GFLOPS for data in and out. Even later, a 2010 top end card, with a maximum specification of 1345 GFLOPS, ran at up to 983 GFLOPS, and 0.5 to 15.5 GFLOPS with data in and out.

The benchmark can be downloaded via CudaMflops.zip. No installation is necessary - Extract All and click on CudaMFLOPS1.exe but see ReadMe.txt first. The ZIP file also includes the CUDA C++ source code. A double precision version is now available. For details and results see CUDA2.htm.

The first single and double precision benchmarks were compiled using CUDA Toolkit 2.3. These have been replaced with updated versions, providing a little extra detail on graphics memory utilisation. They were also compiled to use 32 bit (x86) PCs. They have also been produced using CUDA Toolkit 3.1, particularly to see if double precision calculations are faster on later GeForce GPUs. The revision exercise was started by compiling to use 64 bit (x64) PCs, which was not straightforward and different procedures were needed for CUDA 2.3 and 3.1. Details of the revised versions and problems are in CUDA3 x64.htm with source code and all benchmark EXE files in CudaMflops.zip.

The benchmarks have now been ported to 32-bit and 64-bit versions of Ubuntu Linux. Details and results are provided in linux_cuda_mflops.htm.

See GigaFLOPS Benchmarks.htm for further details and results, including comparisons with MP MFLOPS, a threaded C version, OpenMP MFLOPS, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via gigaflops-benchmarks.zip.


To Start


General

CUDA is an nVidia general purpose parallel computing architecture that uses the large number of Graphics Processing Units (GPUs) available on modern GeForce graphics cards. Free software can be downloaded, for 32 and 64 bit operation, initially for use in conjunction with a C/C++ compiler. The software comes in three parts, the first part being a new graphics driver that has an option to turn CUDA processing on and off. Part two is a Toolkit that includes functions needed to run compiled programs. In this case, it is cudart.dll, and it is not clear whether this has to be compatible with a specific driver version. The third part is the System Development Kit (SDK) that includes example source and compiled codes.

Some CUDA functions can be compiled with a C/C++ compiler (64 or 32 bit). Most example programs have and extension of .CU instead of CPP and might have functions that need the CUDA compiler nvcc. In the case of Windows, the latter has to be interfaced to an MS compiler and does not work with older versions. For 32 bit working, there is the free MS Visual C++ 2008 Express Edition that is compatible.

The CUDA drivers and software were installed on a PC with a GeForce 8600 GT card using 64-Bit Vista and a laptop with 8400M GS graphics and 32-Bit Vista. Visual C++ 2008 Express Edition was also installed on both.

In principle, the example projects can be loaded for compilation with VC 2008 by clicking on a .vcproj file. In my case, Build produced an error but BuildLog.htm showed the compile commands used. These could be run from a Command Prompt window after executing the supplied vcvarsall.bat and further bat commands for CUDA as below. The compile command line used was as follows, but nvcc -c CudaMFLOPS1.cu (compile only) also worked. The link command needed /NODEFAULTLIB:libc.lib CUDA.LIB CUDART.LIB CudaMFLOPS1.obj.

nvcc -arch sm_10 -ccbin D:\MSCompile\MSVC8\VC\bin -Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT" -c CudaMFLOPS1.cu

The software was copied to a PC using Windows XP, that is (in my case) folders MSVC8 and v6.0A for the C++ compiler and those for CUDA. A BAT file with the following is executed from a Command Prompt to allow the same compile and link lines to successfully work.


  Set PATH=D:\MSCompile\MSVC8\Common7\IDE;%PATH%
  Set PATH=D:\MSCompile\MSVC8\VC\BIN;%PATH%
  Set PATH=D:\MSCompile\MSVC8\Common7\Tools;%PATH%
  Set PATH=D:\MSCompile\MSVC8\VC\VCPackages;%PATH%
  Set PATH=C:\Program Files\Microsoft SDKs\Windows\v6.0A\bin;%PATH%
  Set LIB=D:\MSCompile\MSVC8\VC\LIB;%LIB%
  Set LIB=C:\Program Files\Microsoft SDKs\Windows\v6.0A\lib;%LIB%
  Set LIBPATH=D:\MSCompile\MSVC8\VC\LIB;%LIBPATH%
  Set INCLUDE=D:\MSCompile\MSVC8\VC\INCLUDE;%INCLUDE%
  Set INCLUDE=C:\Program Files\Microsoft SDKs\Windows\v6.0A\include;%INCLUDE%

  Set PATH=D:\CUDA\bin;%PATH%
  Set INCLUDE=D:\CUDA\include;%INCLUDE%
  Set INCLUDE=D:\CUDA\SDK\common\inc;%INCLUDE%
  Set LIB=D:\CUDA\lib;%LIB%
  Set LIB=D:\CUDA\SDK\common\lib;%LIB%

The benchmark can be downloaded via CudaMflops1.zip. No installation is necessary - Extract All and click on CudaMFLOPS1.exe but see ReadMe.txt first. The ZIP file also includes the CUDA C++ source code.

To Start


Maximum Performance

Maximum floating point execution speed in GFLOPS (32 bits) are calculated as shader clock GHz x shader stream processors x 3. The three representing a linked multiply and add (MADD) and another multiply that can be executed per clock cycle. In my case, maximum speeds would be 1.19 x 32 x 3 = 114.24 GFLOPS for the 8600 GT and 0.8 x 16 x 3 = 38.4 GFLOPS for the 8400M GS. Googling indicates that the second multiply did not work on this vintage of graphics. This would reduce maximum speeds to 76.16 and 25.6 GFLOPS.

Note that the theoretical maximum speed of the test PC Core 2 Duo, with 32 bit SSE instructions, is 2.4 (GHz) x 4 (SSE register size) x 2 (CPUs) x 2 (MADD) = 38.4 GFLOPS, but best I could demonstrate with data from cache is around 6.5 GFLOPS per CPU.

One of the provided CUDA sample programs does produce high GFLOPS. The program is nbody.exe (a popular particle simulation method). It produces a graphics display but can also be command line driven to produce GFLOPS without the graphics. Results are 56 GFLOPS on the desktop with 2.4 GHz Core 2 Duo and GeForce 8600 GT then 19 GFLOPS on the laptop with 1.8 GHz Core 2 Duo and 8400M GS. As indicated below, Nbody performance does not appear to be limited by graphics memory or PCI-E bus speeds. The code indicates that there are 20 FLOPS per interaction and will be using some of the available pool of 8192 registers for this. It also makes use of local memory, providing high speed cache like performance.

BandwidthTest.exe, another of the CUDA sample programs, provides MBytes/second speeds transferring 33 MB of data from main memory to graphics RAM, graphics RAM to main memory and graphics RAM to graphics RAM. The three measurements for the 8600 GT were 1905, 1450 and 15942 MB/sec, with 1145, 848 and 4140 MB/sec for the 8400M. For 4 byte words, and assuming one floating point calculation for each of these memory accesses, these can be translated into maximum speeds of (desktop) 0.3 to 0.5 GFLOPS with main memory access and 4 GFLOPS using only graphics RAM with (laptop) 0.2 to 0.3 GFLOPS and 1 GFLOPS. The CudaMFLOPS1 Benchmark demonstrates this level of performance and the impact of additional calculations per memory access.

Later results are for a GeForce GTS 250 at 1.836 GHz x 128 processors x 3, or 705 GFLOPS. This was using a 3.0 GHz quad core Phenom II CPU. Results produced were 285 GFLOPS with Nbody, then 3033 and 54129 MB/sec (758 to 13532 MFLOPS) with BandwidthTest.

Latest results are for a 2010 top end GTX 480 at 1.4 MHz x 480 x 2, or 1344 GFLOPS (or x 3 for 2016 GFLOPS linked multiply/add/multiply?). System had a quad core 2.8 GHz Core i7 processor. Results for this were 752 GFLOPS with Nbody, then 3766 and 120236 MB/sec (942 to 20059 MFLOPS) with BandwidthTest.


  Summary of Maximum Speeds - from above and CudaMFLOPS1 below 

                                       8600 GT   8400M GS    GTS 250    GTX 480
                                        MFLOPS     MFLOPS     MFLOPS     MFLOPS

   Maximum Specification                114240      38400     705024    1344960
   Or                                    76160      25600
   NBody                                 55970      18590     285593     751637
   Bandwidth From Host CPU RAM 1 Op        500        300        758        942
   Bandwidth From Graphics RAM 1 Op       4000       1030      13532      20059
   CudaMFLOPS1 Data In & Out   2 Ops       414        220        714        987
   CudaMFLOPS1 Data out only   2 Ops       677        340       1338       1911
   CudaMFLOPS1 Graphics RAM    2 Ops      3866       1030      13038      32286
   CudaMFLOPS1 Data In & Out  32 Ops      5926       3098      11182      15485
   CudaMFLOPS1 Data out only  32 Ops      9006       4459      19975      29608
   CudaMFLOPS1 Graphics RAM   32 Ops     28994       9876     135041     440451
   CudaMFLOPS1 Shared Memory  32 Ops     34094      11454     173291     769658

   CudaMFLOPS1 Calculations # 24 Ops     43782      18792     297354     ******
   CudaMFLOPS1 Calculations # 30 Ops     51016      23093     358989     983220

            #        Based on time difference between 32 Ops and 8 or 2 Ops
            *******  Too fast to estimate


To Start


CUDA Programming

CUDA programming for Single Instruction Multiple Data (SIMD) operation to use the graphics processors for general calculations on data arrays can be quite simple. The number of processing threads has to be specified (maximum here 512), with blocks of threads calculated as array words / threads. Functions are called using, for example, calc<<< blocks, threads >>>(words, aval, bval, gpu_array);. The simple function, in this case, could be:

            __global__ void triad(int n, float a, float b, float *data)
            {
               int i = blockIdx.x * blockDim.x + threadIdx.x;
               if( i < n ) data[i] = (data[i] + a) * b;
            }

This leads to automatic execution of threads on available graphics processor cores for each block in turn. Special functions for allocating/releasing graphics RAM, copying data form the host CPU memory and arranging synchronisation of threads are cudaMalloc, cudaFree, cudaMemcpy and cudaThreadSynchronize. Further details using Shared Memory and other calculations are given below.

To Start


CudaMFLOPS1 Benchmark

CudaMFLOPS1.exe has 3 x 3 x 3 tests, with increasing array size, increasing number of arithmetic operations per memory access and different data access modes. For the latter, the speed is calculated for copying the data from the host CPU memory to graphics RAM, processing and copying results back to the host. Next, the test is repeated only copying from the host on the first of multiple passes. The last test excludes repeated copying the results from and to the host but does include graphics RAM/GPU data transfers. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations. Default array sizes used are 0.1, 1 or 10 million 4 byte single precision floating point words. The latest version has additional calculation and shared memory tests where repetition is controlled within the main CUDA functions (See Below).

On running, results are displayed and saved in file CudaLog.txt. Examples are given below and this is somewhat different to the initial version.

Initially, each word in the data array is set to 0.999999 and have the same calculations with a final result somewhat lower. This result is now constant for a given number of operations per word and repeat passes (example 0.998799 for 32 operations and 25 repeats). The results are also useful for comparing numerical accuracy when the same calculation are executed on a CPU. The revised benchmark displays the first result and checks that all words are identical - see example logs below and Errors later.

Below are results for PCs with Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT, Mobile Core 2 Duo 1.8 GHz, Vista 32, GeForce 8400M GS and Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250. Result became available in 2010 for a GeForce GTX 480 top end graphics card, using a Core i7 930 processor and 64-Bit Windows 7. These were from a revised version of the benchmark, still using CUDA Toolkit 2.3, but with the same layout as the double precision version and providing additional details on graphics memory use.

Assuming a constant non-overlapped overhead for data transfer, with calculations based on the difference in fastest times between those for 32 and 8 or 2 operations, actual execution speed comes out at between 40 and 51 GFLOPS on the 8600 GT, 13 to 23 GFLOPS on the 8400M and 203 to 360 GFLOPS using the GTS 250 See calculations. Slowest speeds, with continuous data transfer from/to the host CPU and two calculations for each word transferred, are 215, 132 and 328 MFLOPS for the three systems.


 Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT

 ********************************************************

  CUDA MFLOPS Benchmark 1.1 Wed Sep 23 11:52:04 2009

  CUDA devices found 
  Device 0: GeForce 8600 GT  with 4 Processors 32 cores 

  Using 256 Threads

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS      First   All
                    Words  Word   Passes                        Results  Same

 Data in & out     100000     2     2500   2.328578      215   0.929538   Yes
 Data out only     100000     2     2500   1.368496      365   0.929538   Yes
 Calculate only    100000     2     2500   0.302413     1653   0.929538   Yes

 Data in & out    1000000     2      250   1.466895      341   0.992550   Yes
 Data out only    1000000     2      250   0.828149      604   0.992550   Yes
 Calculate only   1000000     2      250   0.143866     3475   0.992550   Yes

 Data in & out   10000000     2       25   1.207633      414   0.999250   Yes
 Data out only   10000000     2       25   0.738722      677   0.999250   Yes
 Calculate only  10000000     2       25   0.129336     3866   0.999250   Yes

 Data in & out     100000     8     2500   2.352154      850   0.956980   Yes
 Data out only     100000     8     2500   1.399414     1429   0.956980   Yes
 Calculate only    100000     8     2500   0.303826     6583   0.956980   Yes

 Data in & out    1000000     8      250   1.481691     1350   0.995509   Yes
 Data out only    1000000     8      250   0.844532     2368   0.995509   Yes
 Calculate only   1000000     8      250   0.151419    13208   0.995509   Yes

 Data in & out   10000000     8       25   1.212482     1650   0.999549   Yes
 Data out only   10000000     8       25   0.751451     2662   0.999549   Yes
 Calculate only  10000000     8       25   0.138901    14399   0.999549   Yes

 Data in & out     100000    32     2500   2.521739     3172   0.890079   Yes
 Data out only     100000    32     2500   1.601592     4995   0.890079   Yes
 Calculate only    100000    32     2500   0.506698    15789   0.890079   Yes

 Data in & out    1000000    32      250   1.622790     4930   0.988073   Yes
 Data out only    1000000    32      250   0.986792     8107   0.988073   Yes
 Calculate only   1000000    32      250   0.298806    26773   0.988073   Yes

 Data in & out   10000000    32       25   1.350080     5926   0.998799   Yes
 Data out only   10000000    32       25   0.888331     9006   0.998799   Yes
 Calculate only  10000000    32       25   0.275918    28994   0.998799   Yes

 Extra tests - loop in main CUDA Function

 Calculate       10000000     2       25   0.133918     3734   0.999250   Yes
 Shared Memory   10000000     2       25   0.046697    10707   0.999250   Yes

 Calculate       10000000     8       25   0.132696    15072   0.999549   Yes
 Shared Memory   10000000     8       25   0.084822    23579   0.999549   Yes

 Calculate       10000000    32       25   0.255621    31296   0.998799   Yes
 Shared Memory   10000000    32       25   0.234647    34094   0.998799   Yes

 Hardware  Information
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
  Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz Measured 2402 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows  Information
  Intel processor architecture, 2 CPUs 
  Windows NT  Version 6.0, build 6002, Service Pack 2
  Memory 4095 MB, Free 2878 MB
  User Virtual Space 4096 MB, Free 4047 MB
 

 Core 2 Duo 1.8 GHz Laptop, Vista 32, GeForce 8400M GS

 ********************************************************

  CUDA MFLOPS Benchmark 1.1 Tue Sep 22 18:09:43 2009

  CUDA devices found 
  Device 0: GeForce 8400M GS  with 2 Processors 16 cores 

  Using 256 Threads

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS      First   All
                    Words  Word   Passes                        Results  Same

 Data in & out     100000     2     2500   3.774912      132   0.929538   Yes
 Data out only     100000     2     2500   2.358369      212   0.929538   Yes
 Calculate only    100000     2     2500   0.744282      672   0.929538   Yes

 Data in & out    1000000     2      250   2.486853      201   0.992550   Yes
 Data out only    1000000     2      250   1.615676      309   0.992550   Yes
 Calculate only   1000000     2      250   0.509336      982   0.992550   Yes

 Data in & out   10000000     2       25   2.272207      220   0.999250   Yes
 Data out only   10000000     2       25   1.469490      340   0.999250   Yes
 Calculate only  10000000     2       25   0.485254     1030   0.999250   Yes

 Data in & out     100000     8     2500   3.736443      535   0.956980   Yes
 Data out only     100000     8     2500   2.376323      842   0.956980   Yes
 Calculate only    100000     8     2500   0.757495     2640   0.956980   Yes

 Data in & out    1000000     8      250   2.530923      790   0.995509   Yes
 Data out only    1000000     8      250   1.632063     1225   0.995509   Yes
 Calculate only   1000000     8      250   0.514978     3884   0.995509   Yes

 Data in & out   10000000     8       25   2.267129      882   0.999549   Yes
 Data out only   10000000     8       25   1.491643     1341   0.999549   Yes
 Calculate only  10000000     8       25   0.490735     4076   0.999549   Yes

 Data in & out     100000    32     2500   4.134719     1935   0.890079   Yes
 Data out only     100000    32     2500   2.776958     2881   0.890079   Yes
 Calculate only    100000    32     2500   1.146639     6977   0.890079   Yes

 Data in & out    1000000    32      250   2.803619     2853   0.988073   Yes
 Data out only    1000000    32      250   1.946918     4109   0.988073   Yes
 Calculate only   1000000    32      250   0.834157     9591   0.988073   Yes

 Data in & out   10000000    32       25   2.582265     3098   0.998799   Yes
 Data out only   10000000    32       25   1.794135     4459   0.998799   Yes
 Calculate only  10000000    32       25   0.810025     9876   0.998799   Yes

 Extra tests - loop in main CUDA Function

 Calculate       10000000     2       25   0.503554      993   0.999250   Yes
 Shared Memory   10000000     2       25   0.139127     3594   0.999250   Yes

 Calculate       10000000     8       25   0.516617     3871   0.999549   Yes
 Shared Memory   10000000     8       25   0.253509     7889   0.999549   Yes

 Calculate       10000000    32       25   0.759397    10535   0.998799   Yes
 Shared Memory   10000000    32       25   0.698451    11454   0.998799   Yes

 Hardware  Information
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006FD
  Intel(R) Core(TM)2 Duo CPU     T5550  @ 1.83GHz Measured 1828 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows  Information
  Intel processor architecture, 2 CPUs 
  Windows NT  Version 6.0, build 6001, Service Pack 1
  Memory 2046 MB, Free 906 MB
  User Virtual Space 2048 MB, Free 2006 MB


 Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250

 ********************************************************

  CUDA MFLOPS Benchmark 1.1 Wed Nov 04 18:44:35 2009

  CUDA devices found 
  Device 0: GeForce GTS 250  with 16 Processors 128 cores 

  Using 256 Threads

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS      First   All
                    Words  Word   Passes                        Results  Same

 Data in & out     100000     2     2500   1.522759      328   0.929538   Yes
 Data out only     100000     2     2500   0.759221      659   0.929538   Yes
 Calculate only    100000     2     2500   0.163718     3054   0.929538   Yes

 Data in & out    1000000     2      250   0.799499      625   0.992550   Yes
 Data out only    1000000     2      250   0.441804     1132   0.992550   Yes
 Calculate only   1000000     2      250   0.051695     9672   0.992550   Yes

 Data in & out   10000000     2       25   0.700515      714   0.999250   Yes
 Data out only   10000000     2       25   0.373559     1338   0.999250   Yes
 Calculate only  10000000     2       25   0.038349    13038   0.999250   Yes

 Data in & out     100000     8     2500   1.496596     1336   0.956980   Yes
 Data out only     100000     8     2500   0.740724     2700   0.956980   Yes
 Calculate only    100000     8     2500   0.163493    12233   0.956980   Yes

 Data in & out    1000000     8      250   0.839552     2382   0.995509   Yes
 Data out only    1000000     8      250   0.451659     4428   0.995509   Yes
 Calculate only   1000000     8      250   0.050657    39481   0.995509   Yes

 Data in & out   10000000     8       25   0.678276     2949   0.999549   Yes
 Data out only   10000000     8       25   0.364515     5487   0.999549   Yes
 Calculate only  10000000     8       25   0.039063    51199   0.999549   Yes

 Data in & out     100000    32     2500   1.555722     5142   0.890079   Yes
 Data out only     100000    32     2500   0.813645     9832   0.890079   Yes
 Calculate only    100000    32     2500   0.221733    36080   0.890079   Yes

 Data in & out    1000000    32      250   0.848588     9427   0.988073   Yes
 Data out only    1000000    32      250   0.472604    16927   0.988073   Yes
 Calculate only   1000000    32      250   0.073957   108170   0.988073   Yes

 Data in & out   10000000    32       25   0.715423    11182   0.998799   Yes
 Data out only   10000000    32       25   0.400496    19975   0.998799   Yes
 Calculate only  10000000    32       25   0.059241   135041   0.998799   Yes

 Extra tests - loop in main CUDA Function

 Calculate       10000000     2       25   0.050168     9967   0.999250   Yes
 Shared Memory   10000000     2       25   0.009245    54082   0.999250   Yes

 Calculate       10000000     8       25   0.050910    39285   0.999549   Yes
 Shared Memory   10000000     8       25   0.016720   119617   0.999549   Yes

 Calculate       10000000    32       25   0.050587   158143   0.998799   Yes
 Shared Memory   10000000    32       25   0.046165   173291   0.998799   Yes

 Hardware  Information
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
  AMD Phenom(tm) II X4 945 Processor Measured 3013 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, 
 Windows  Information
  Intel processor architecture, 4 CPUs 
  Windows NT  Version 6.1, build 7600, 
  Memory 4096 MB, Free 4096 MB
  User Virtual Space 4096 MB, Free 4050 MB


 Core i7 Quad 2.8 GHz, Windows 7-64, GeForce GTX 480

 ********************************************************
  CUDA 2.3 x86 Single Precision MFLOPS Benchmark 1.3 Fri Jul 30 15:23:30 2010

  CUDA devices found 
  Device 0: GeForce GTX 480  with 15 Processors 120 cores 
  Global Memory 1468 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  0.959230      521   0.9295383095741  Yes
 Data out only    100000    2    2500  0.513680      973   0.9295383095741  Yes
 Calculate only   100000    2    2500  0.087057     5743   0.9295383095741  Yes

 Data in & out   1000000    2     250  0.607810      823   0.9925497770309  Yes
 Data out only   1000000    2     250  0.295149     1694   0.9925497770309  Yes
 Calculate only  1000000    2     250  0.022970    21767   0.9925497770309  Yes

 Data in & out  10000000    2      25  0.506478      987   0.9992496371269  Yes
 Data out only  10000000    2      25  0.261632     1911   0.9992496371269  Yes
 Calculate only 10000000    2      25  0.015486    32286   0.9992496371269  Yes

 Data in & out    100000    8    2500  0.966099     2070   0.9571172595024  Yes
 Data out only    100000    8    2500  0.533396     3750   0.9571172595024  Yes
 Calculate only   100000    8    2500  0.090899    22002   0.9571172595024  Yes

 Data in & out   1000000    8     250  0.608723     3286   0.9955183267593  Yes
 Data out only   1000000    8     250  0.296098     6755   0.9955183267593  Yes
 Calculate only  1000000    8     250  0.023509    85074   0.9955183267593  Yes

 Data in & out  10000000    8      25  0.503973     3968   0.9995489120483  Yes
 Data out only  10000000    8      25  0.261953     7635   0.9995489120483  Yes
 Calculate only 10000000    8      25  0.015559   128542   0.9995489120483  Yes

 Data in & out    100000   32    2500  1.003045     7976   0.8902152180672  Yes
 Data out only    100000   32    2500  0.556316    14380   0.8902152180672  Yes
 Calculate only   100000   32    2500  0.125837    63574   0.8902152180672  Yes

 Data in & out   1000000   32     250  0.629748    12703   0.9880878329277  Yes
 Data out only   1000000   32     250  0.310655    25752   0.9880878329277  Yes
 Calculate only  1000000   32     250  0.029831   268179   0.9880878329277  Yes

 Data in & out  10000000   32      25  0.516623    15485   0.9987964630127  Yes
 Data out only  10000000   32      25  0.270197    29608   0.9987964630127  Yes
 Calculate only 10000000   32      25  0.018163   440451   0.9987964630127  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.004954   100937   0.9992496371269  Yes
 Shared Memory  10000000    2      25  0.002766   180734   0.9992496371269  Yes

 Calculate      10000000    8      25  0.006198   322696   0.9995489120483  Yes
 Shared Memory  10000000    8      25  0.004850   412376   0.9995489120483  Yes

 Calculate      10000000   32      25  0.012249   653139   0.9987964630127  Yes
 Shared Memory  10000000   32      25  0.010394   769658   0.9987964630127  Yes

 Hardware  Information
  CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106A5
  Intel(R) Core(TM) i7 CPU         930  @ 2.80GHz Measured 2807 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
 Windows  Information
  Intel processor architecture, 8 CPUs 
  Windows NT  Version 6.1, build 7600, 
  Memory 4096 MB, Free 4096 MB
  User Virtual Space 4096 MB, Free 4042 MB

  1468 MB Graphics RAM, Used 261 Minimum, 297 Maximum


To Start


Shared Memory and Repeat Calculations

The original version ran repeated calculations from within the main C++ code as shown in the "Calculate Only" section below. For the first extra test, "Calculate", the repetition loop was moved to the CUDA function where it was initially expected to produce faster speeds due to fewer function calls. In practice, this is not necessarily the case and, bearing in mind that data size of up to 40M Bytes is used, it is now clear that multiple calls must be arranged via CUDA software to transfer part of the data at a time.

For the third example and in this case, the size of Shared Memory is defined in the main program as 1024 Bytes (per thread?) and is allocated at the start of the CUDA function. Data is the copied from graphics RAM before calculation and back to graphics RAM at the end. As shown below , this can produce significant performance gains when limited calculations are used.


 Calculate only - Repeat

 for (i=0; i < repeats; i++)
 {
    calc2<<< blocks, threads >>>(words, aval, xval, x_gpu);
 }
 cutilSafeCall( cudaThreadSynchronize() );
 cutilCheckMsg("Kernel execution failed");

 ----------------------------------------------------------

  __global__ void calc2(int n, float a, float b, float *x)
 {
     int i = blockIdx.x * blockDim.x + threadIdx.x;
     if( i < n ) x[i] = (x[i]+a)*b;
 }
 ##########################################################
 
 Calculate - Repeat in CUDA Function 

 calc2r<<< blocks, threads >>>(words, repeats, aval, xval, x_gpu);
 cutilSafeCall( cudaThreadSynchronize() );
 cutilCheckMsg("Kernel execution failed");

 ----------------------------------------------------------

  __global__ void calc2r(int n, int rep, float a, float b, float *x)
 {
     for (int rr=0; rr < rep; rr++)
     {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if( i < n ) x[i] = (x[i]+a)*b;
     }
 }
 ##########################################################

 Shared Memory - Repeat in CUDA Function 

 sharecalc2<<< numBlocks, threadsPerBlock, sharedMemSize >>>
              (x_gpu, repeats,  aval, xval);
 cutilSafeCall( cudaThreadSynchronize() );
 cutilCheckMsg("Kernel execution failed");

 ----------------------------------------------------------

 __global__ void sharecalc2(float *xi, int rep, float a, float b)
 {
     extern __shared__ float x[];
     int inOffset  = blockDim.x * blockIdx.x;
     int in  = inOffset + threadIdx.x;
     x[threadIdx.x] = xi[in];
     __syncthreads();
     int i = threadIdx.x;
     for (int rr=0; rr < rep; rr++)
     {
        x[i] = (x[i]+a)*b;
     }
     xi[in] = x[threadIdx.x];
 }

To Start


More Comparisons

It is not clear how CUDA software arranges for data to be transferred using the above code but identical numeric results, for all tests at given operations per word and repetition rates, show that exactly the same number of calculations have been executed.

As indicated above, speed does not vary much when calculations are repeated via the main program or CUDA function and data transfers from host to/from graphics memory is likely to be similar. On the other hand, speed using shared memory is faster, particularly with fewer arithmetic operations per word. Calculations below show shared memory produces lower GFLOPS but overheads for transferring the data are much smaller.


 All 10M Wds 25 Reps  GeF 8600 GT     GeF 8400M GS    GeF GTS 250     GeF GTX 480
 
 Test           Ops   Secs  GFLOPS    Secs  GFLOPS    Secs  GFLOPS    Secs  GFLOPS

 Calculate only   2  0.129    3.86    0.485   1.03   0.038   13.04   0.015   32.29
 Shared Memory    2  0.047   10.71    0.139   3.59   0.009   54.08   0.003  180.73

 Calculate only   8  0.139   14.40    0.491   4.08   0.039   51.20   0.016  128.54
 Shared Memory    8  0.085   23.58    0.254   7.89   0.016  119.62   0.004  412.38

 Calculate only  32  0.276   29.00    0.810   9.88   0.059  135.04   0.018  440.45
 Shared Memory   32  0.235   34.09    0.698  11.45   0.046  173.29   0.010  769.66


 Calculations 32 Ops minus 2 Ops - 30 Ops, 25 Repeats

            GeF 8600 GT         GeF 8400M GS        GeF GTS 250         GeF GTX 480

 Test     Secs    G   Secs    Secs    G   Secs    Secs    G   Secs    Secs    G   Secs
         Calcs  FLOPS  O/H   Calcs  FLOPS  O/H   Calcs  FLOPS  O/H   Calcs  FLOPS  O/H

 Calc    0.147   51  0.119   0.325   23  0.463   0.021  359  0.037   0.003* ***  0.015
 Sh Mem  0.188   40  0.034   0.559   13  0.102   0.037  203  0.007   0.008  983  0.002

                                    * *** too fast to estimate


To Start


Run Time Parameters

Run time parameters are available for use via a Command Prompt window or BAT file. See ExampleRun.bat in CudaMflops.zip. Parameters are:

Threads (or t or T) - Default 256, Maximum currently 512
Words (or w or W) - Default 100000 (increases to 1M and 10M)
Repeats (or r or R) - Default 2500 (decreases to 250 and 25), Minimum 100

Examples:

CudaMFLOPS1 Threads 128, Words 50000, Repeats 2500
CudaMFLOPS1 t 256 W 1000 r 100
CudaMFLOPS1 Threads 512

Varying Number Of Threads - Following are results from the last Calculate Only test using the same data size and repeat passes but varying numbers of threads. The starting parameters are Words 6000, Repeats 10000. Performance decreases using fewer threads in a similar fashion for both systems used, although the 8600 has 32 processor cores and the 8400M 16 cores. A surprise is the decrease in speed at 16 threads on both systems.


    Threads    Words   Ops   Repeats   MFLOPS   MFLOPS   MFLOPS  
                                       8600 GT  8400MGS  GTS250

      256     600000    32      100     25153     9136    94109
      128     600000    32      100     24868     9525    91595
       64     600000    32      100     24575     9381    97762   
       32     600000    32      100     21584     7815    78845
       16     600000    32      100     13620     4728    47930


To Start


Reliability Testing

The revised version also has facilities to run a reliability/burn-in test. Default operation is the 10M word, 32 operations per word, extra calculate test, running for one minute with output every 15 seconds via a simple command cudamflops1 M. The test is first calibrated to run the loop within the CUDA function for less than 1.75 seconds, in order to avoid a timeout error. The number of times this is repeated externally is then calculated. All data is checked for the same numeric result, with the first error value shown and the number of errors. The default number of threads and words can be changed as above and additional parameters are:

Minutes (or M or m or Mins) - Example for below CudaMFLOPS1 Mins 10
Seconds (or S or s or Secs) - for result display rate
FC - to use alternative Shared Memory test.

Below are example results for a 10 minute test on a GeForce 8600 GT. Monitored temperatures are also shown (room temperature 21.5 °C). Graphics temperatures were measured using nVidia System Monitor. This also showed GPU Utilisation 99% to 100%, Bus Utilisation 38%, with CPU Utilisation 60% to 70%. A Shared Memory test produced higher MFLOPS (34500) but slightly lower temperatures at GPU Utilisation 99% and Bus Utilisation 3%.

See also example results below for CudaMFLOPS1 W 50000, T 512, M 1, S 30, FC

Later are results of a 10 minute test on the GeForce GTS 250. Here, temperature and fan speed were monitored using Vtune/TBPanel, supplied with the graphics driver. Recorded temperature increased from 42°C to 72°C, after 5 minutes. Following this, it varied between 72°C and 70°C, controlled by fan speed increasing from the shown 35% to 40%.



 Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT
 ********************************************************

  Command - CudaMFLOPS1 Mins 10

  CUDA MFLOPS Benchmark 1.1 Mon Sep 28 16:22:51 2009

  CUDA devices found 
  Device 0: GeForce 8600 GT  with 4 Processors 32 cores 

  Using 256 Threads

  Calculate      Reliability Test 10 minutes, report every 15 seconds

  Repeat CUDA 155 times at  1.57 seconds. Repeat former 9 times

  Results of all calculations should be 0.936169

  Test  4 Byte  Ops/   Repeat  Seconds   MFLOPS    Errors     First      Value
         Words  Word   Passes                                  Word 

   1  10000000    32     1395   14.164    31517   None found
   2  10000000    32     1395   14.163    31518   None found
   3  10000000    32     1395   14.162    31521   None found
   4  10000000    32     1395   14.162    31520   None found
   5  10000000    32     1395   14.161    31522   None found
   6  10000000    32     1395   14.162    31521   None found
  to 
  40  10000000    32     1395   14.161    31522   None found

  Minutes          0  0.5  1.0  1.5  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0
  Temperature °C  47   59   61   62   64   65   66   66   66   66   66   66


 ********************************************************

  Command - CudaMFLOPS1 W 50000, T 512, M 1, S 30, FC

  CUDA MFLOPS Benchmark 1.1 Tue Sep 29 12:03:14 2009

  CUDA devices found 
  Device 0: GeForce 8400M GS  with 2 Processors 16 cores 

  Using 512 Threads

  Shared Memory  Reliability Test 1 minutes, report every 30 seconds

  Repeat CUDA 10163 times at  1.40 seconds. Repeat former 21 times

  Results of all calculations should be 0.351382

  Test  4 Byte  Ops/   Repeat  Seconds   MFLOPS    Errors     First      Value
         Words  Word   Passes                                  Word 

   1     50000    32   213423   29.531    11563   None found
   2     50000    32   213423   29.531    11563   None found

 ********************************************************

  Device 0: GeForce 8600 GT  with 4 Processors 32 cores 
  Repeat CUDA 34576 times at  1.64 seconds. Repeat former 18 times

   1     50000    32   622368   29.501    33754   None found
   2     50000    32   622368   29.505    33749   None found


 Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250
 ********************************************************

  CUDA MFLOPS Benchmark 1.1 Mon Nov 09 11:11:31 2009

  CUDA devices found 
  Device 0: GeForce GTS 250  with 16 Processors 128 cores 

  Using 256 Threads

  Calculate      Reliability Test 10 minutes, report every 15 seconds

  Repeat CUDA 761 times at  1.52 seconds. Repeat former 9 times

  Results of all calculations should be 0.741250

  Test  4 Byte  Ops/   Repeat  Seconds   MFLOPS    Errors     First      Value
         Words  Word   Passes                                  Word 

   1  10000000    32     6849   13.736   159563   None found
   2  10000000    32     6849   13.732   159605   None found
   3  10000000    32     6849   13.730   159623   None found
   4  10000000    32     6849   13.730   159629   None found
   5  10000000    32     6849   13.734   159575   None found
   6  10000000    32     6849   13.732   159607   None found
  to 
  40  10000000    32     6849   13.753   159362   None found

  Minutes          0  0.5  1.0  1.5  2.0  3.0  4.0  5.0  6.0  7.0  8.0  9.0
  Temperature °C  42   59   63   67   69   71   71   72   72   71   70   71



To Start


Errors

As indicated above, the performance tests check each word for consistent values. Following is an example of error reports, where details of the first ten comparison failures are provided. In developing the benchmark it was found that the program could finish with some of the data unchanged from the initial value of 0.999999. These checks will show if this happens. Reliability tests provide a count of the number of errors and the array position and value of the first wrong word. Examples of CUDA generated error messages are also shown.


  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS      First   All
                    Words  Word   Passes                        Results  Same

 Shared Memory   10000000    32      250   2.314890    34559   See later   No

 First Unexpected Results
 Shared Memory   10000000    32      250 word 38528  was  0.010145 not  0.988073
 Shared Memory   10000000    32      250 word 38529  was  0.010145 not  0.988073


 Error display if GPU is used continuously for more than 2 seconds.

 cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file CudaMFLOPS1.cu
 cudaSafeCall() Runtime API error in file < CudaMFLOPS1.cu >, line 232 :
 the launch timed out and was terminated.


 Error display if attempt is made to use more than 512 threads.

 cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file CudaMFLOPS1.cu
 line 225 : invalid configuration argument.


To Start


Roy Longbottom at Linkedin   Roy Longbottom October 2014



The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection