CUDA Benchmarks, Roy Longbottom's PC benchmark Collection

CUDA MFLOPS Benchmarks

Roy Longbottom

Summary

The arithmetic operations executed are, as used in MP MFLOPS, see - MultiThreading Benchmarks.htm, of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. Five different modes of operation are used, with varying levels of data input and output. These combined variables demonstrate some worst and best case performance.

The graphics adapter at the centre of this report has specified maximum speed of 813 GFLOPS single precision (SP) and 33.9 GFLOPS double precision (DP). At 32 operations per data word, worst measurements were 15.6 GFLOPS SP and 6.4 GFLOPS DP, both worse than the host CPU. Best performance was 404 GFLOPS SP and 24.6 GFLOPS DP, with no communication to the outside world.

Links are provided to download the benchmarks and to reports with many more results.

General

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher via a graphics processor than from multi-core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from an array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. The benchmarks use nVidia CUDA programming functions that only execute on their graphics hardware and compatible driver. There are five scenarios:

New Calculations - Copy data to graphics RAM, execute instructions, copy back
to host RAM [Data in & out]

Update Data - Execute further instructions on data in graphics RAM, copy
back to host RAM [Data out only]

Graphics Only Data - Execute further instructions on data in graphics RAM, leave
it there [Calculate only]

Extra Test 1 - Just graphics data, repeat loop in CUDA function [Calculate]

Extra Test 2 - Just graphics data, repeat loop in CUDA function but using
Shared Memory [Shared Memory]

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are, as used in MP MFLOPS from MultiThreading Benchmarks.htm, of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

Four Windows and Linux versions are available, with 32-Bit and 64-Bit compilations, using Single and Double Precision floating point numbers. The latest execution files, source code along with compiling and running instructions, can be downloaded in gigaflops-benchmarks.zip for Windows and linux cuda mflops.tar.gz.

Detailed notes for installing and setting up nVidia software are included in my reports - cuda1.htm, cuda2.htm , cuda3 x64.htm and linux_cuda_mflops.htm, also containing results and comparisons from tests on various PCs with GeForce graphics adapters.

The next two pages contain log files of single precision and double precision results for a a PC with a 3.9 GHz Core i7 CPU, using a 2012 mid range GeForce GTX 650, where the top end card was up to 4 times faster, with one from 2017 being at least 3 times faster than that. The fourth page provides comparisons of 64 bit and 32 bit performance and numeric result, plus some comparing those from benchmarking Windows and Linux.

Go To Start

Single Precision Log

Specified maximum single precision speed claim for the GeForce GTX 650 is 812.5 GFLOPS. The benchmark demonstrated a respectable 403.6 GFLOPS. Using the same calculations for MP MFLOPS in MultiThreading Benchmarks.htm that demonstrated a maximum of 178 GFLOPS on the Core i7 used for the CUDA benchmark runs. The procedures used equate to “Data in & out” here, where the best CUDA speed was 15.6 GFLOPS, slower than the i7 CPU using the old i387 floating point instructions.


CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.31 Mon Jun 30 17:34:14 2014

  CUDA devices found 
  Device 0: GeForce GTX 650  with 2 Processors 16 cores 
  Global Memory 1000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  1.125333      444   0.9295383095741  Yes
 Data out only    100000    2    2500  0.468638     1067   0.9295383095741  Yes
 Calculate only   100000    2    2500  0.142182     3517   0.9295383095741  Yes

 Data in & out   1000000    2     250  0.551554      907   0.9925497770309  Yes
 Data out only   1000000    2     250  0.271324     1843   0.9925497770309  Yes
 Calculate only  1000000    2     250  0.056439     8859   0.9925497770309  Yes

 Data in & out  10000000    2      25  0.505138      990   0.9992496371269  Yes
 Data out only  10000000    2      25  0.267062     1872   0.9992496371269  Yes
 Calculate only 10000000    2      25  0.047600    10504   0.9992496371269  Yes

 Data in & out    100000    8    2500  0.836211     2392   0.9571172595024  Yes
 Data out only    100000    8    2500  0.476396     4198   0.9571172595024  Yes
 Calculate only   100000    8    2500  0.150029    13331   0.9571172595024  Yes

 Data in & out   1000000    8     250  0.558576     3581   0.9955183267593  Yes
 Data out only   1000000    8     250  0.273134     7322   0.9955183267593  Yes
 Calculate only  1000000    8     250  0.058554    34157   0.9955183267593  Yes

 Data in & out  10000000    8      25  0.503466     3972   0.9995489120483  Yes
 Data out only  10000000    8      25  0.268260     7455   0.9995489120483  Yes
 Calculate only 10000000    8      25  0.048742    41032   0.9995489120483  Yes

 Data in & out    100000   32    2500  0.860386     9298   0.8902152180672  Yes
 Data out only    100000   32    2500  0.505322    15831   0.8902152180672  Yes
 Calculate only   100000   32    2500  0.178712    44765   0.8902152180672  Yes

 Data in & out   1000000   32     250  0.560442    14274   0.9880878329277  Yes
 Data out only   1000000   32     250  0.281167    28453   0.9880878329277  Yes
 Calculate only  1000000   32     250  0.065909   121379   0.9880878329277  Yes

 Data in & out  10000000   32      25  0.512846    15599   0.9987964630127  Yes
 Data out only  10000000   32      25  0.269143    29724   0.9987964630127  Yes
 Calculate only 10000000   32      25  0.054317   147283   0.9987964630127  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.018608    26870   0.9992496371269  Yes
 Shared Memory  10000000    2      25  0.006495    76980   0.9992496371269  Yes

 Calculate      10000000    8      25  0.024547    81476   0.9995489120483  Yes
 Shared Memory  10000000    8      25  0.011040   181159   0.9995489120483  Yes

 Calculate      10000000   32      25  0.036803   217371   0.9987964630127  Yes
 Shared Memory  10000000   32      25  0.019822   403590   0.9987964630127  Yes

Go To Start

Double Precision Log

Specified maximum double precision speed claim for the GeForce GTX 650 is 33.9 GFLOPS. The benchmark demonstrated up to 24.6 GFLOPS, but only 6.4 GFLOPS with “Data in & out”, that can easily be beaten by the main CPU.

  
  CUDA 3.1 x64 Double Precision MFLOPS Benchmark 1.31 Mon Jun 30 17:33:45 2014
  
  CUDA devices found 
  Device 0: GeForce GTX 650  with 2 Processors 16 cores 
  Global Memory 1000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            8 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  1.439831      347   0.9294744580218  Yes
 Data out only    100000    2    2500  0.748076      668   0.9294744580218  Yes
 Calculate only   100000    2    2500  0.161219     3101   0.9294744580218  Yes

 Data in & out   1000000    2     250  1.026529      487   0.9925431921162  Yes
 Data out only   1000000    2     250  0.505108      990   0.9925431921162  Yes
 Calculate only  1000000    2     250  0.075140     6654   0.9925431921162  Yes

 Data in & out  10000000    2      25  0.969507      516   0.9992492055877  Yes
 Data out only  10000000    2      25  0.493658     1013   0.9992492055877  Yes
 Calculate only 10000000    2      25  0.066215     7551   0.9992492055877  Yes

 Data in & out    100000    8    2500  1.461447     1369   0.9571642109917  Yes
 Data out only    100000    8    2500  0.806252     2481   0.9571642109917  Yes
 Calculate only   100000    8    2500  0.219439     9114   0.9571642109917  Yes

 Data in & out   1000000    8     250  1.080003     1852   0.9955252302690  Yes
 Data out only   1000000    8     250  0.556391     3595   0.9955252302690  Yes
 Calculate only  1000000    8     250  0.126047    15867   0.9955252302690  Yes

 Data in & out  10000000    8      25  1.018470     1964   0.9995496465632  Yes
 Data out only  10000000    8      25  0.543857     3677   0.9995496465632  Yes
 Calculate only 10000000    8      25  0.116258    17203   0.9995496465632  Yes

 Data in & out    100000   32    2500  1.738967     4600   0.8903768345465  Yes
 Data out only    100000   32    2500  1.066385     7502   0.8903768345465  Yes
 Calculate only   100000   32    2500  0.478614    16715   0.8903768345465  Yes

 Data in & out   1000000   32     250  1.321268     6055   0.9881014965491  Yes
 Data out only   1000000   32     250  0.793547    10081   0.9881014965491  Yes
 Calculate only  1000000   32     250  0.363371    22016   0.9881014965491  Yes

 Data in & out  10000000   32      25  1.251858     6391   0.9987993043723  Yes
 Data out only  10000000   32      25  0.784260    10201   0.9987993043723  Yes
 Calculate only 10000000   32      25  0.351443    22763   0.9987993043723  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.051005     9803   0.9992492055877  Yes
 Shared Memory  10000000    2      25  0.030731    16270   0.9992492055877  Yes

 Calculate      10000000    8      25  0.108356    18458   0.9995496465632  Yes
 Shared Memory  10000000    8      25  0.089620    22316   0.9995496465632  Yes

 Calculate      10000000   32      25  0.342569    23353   0.9987993043723  Yes
 Shared Memory  10000000   32      25  0.325699    24563   0.9987993043723  Yes

Go To Start

Speed and Numeric Results Comparisons

The following are single precision results for 32 bit and 64 bit compilations for Windows and 64 bit Linux tests. There are some variations in recorded performance, with some effectively the same. Multiple runs are probably needed to identify any real differences.

All array elements are initialised with the same value of nearly 1.0. Subsequent calculations, that are the same for each element, reduce this value. The final numeric results should be identical for a given set of operations per word, number of repeat passes and whether single or double precision arithmetic is used (check extra tests at 10M words). The program check all results for identical numeric results and reports any errors.

Windows 32 bit Windows 64 bit Linux 64 bit Test 4 Byte Ops Repeat MFLOPS Results MFLOPS Results MFLOPS Results Words /Wd Passes Data in & out 100000 2 2500 594 0.9295383 444 0.9295383 597 0.9295383 Data out only 100000 2 2500 1066 0.9295383 1067 0.9295383 1283 0.9295383 Calculate only 100000 2 2500 3890 0.9295383 3517 0.9295383 5834 0.9295383 Data in & out 1000000 2 250 886 0.9925498 907 0.9925498 1133 0.9925498 Data out only 1000000 2 250 1812 0.9925498 1843 0.9925498 2183 0.9925498 Calculate only 1000000 2 250 9272 0.9925498 8859 0.9925498 9666 0.9925498 Data in & out 10000000 2 25 975 0.9992496 990 0.9992496 1355 0.9992496 Data out only 10000000 2 25 1833 0.9992496 1872 0.9992496 2485 0.9992496 Calculate only 10000000 2 25 10846 0.9992496 10504 0.9992496 10411 0.9992496 Data in & out 100000 8 2500 2361 0.9571173 2392 0.9571173 2823 0.9571173 Data out only 100000 8 2500 4164 0.9571173 4198 0.9571173 5152 0.9571173 Calculate only 100000 8 2500 14567 0.9571173 13331 0.9571173 21679 0.9571173 Data in & out 1000000 8 250 3529 0.9955183 3581 0.9955183 4178 0.9955183 Data out only 1000000 8 250 7203 0.9955183 7322 0.9955183 8651 0.9955183 Calculate only 1000000 8 250 35912 0.9955183 34157 0.9955183 37138 0.9955183 Data in & out 10000000 8 25 3900 0.9995489 3972 0.9995489 5396 0.9995489 Data out only 10000000 8 25 7291 0.9995489 7455 0.9995489 9882 0.9995489 Calculate only 10000000 8 25 42235 0.9995489 41032 0.9995489 40599 0.9995489 Data in & out 100000 32 2500 9185 0.8902152 9298 0.8902152 11034 0.8902152 Data out only 100000 32 2500 15696 0.8902152 15831 0.8902152 19628 0.8902152 Calculate only 100000 32 2500 48404 0.8902152 44765 0.8902152 70679 0.8902152 Data in & out 1000000 32 250 13954 0.9880878 14274 0.9880878 16069 0.9880878 Data out only 1000000 32 250 27974 0.9880878 28453 0.9880878 30597 0.9880878 Calculate only 1000000 32 250 125372 0.9880878 121379 0.9880878 133042 0.9880878 Data in & out 10000000 32 25 15494 0.9987965 15599 0.9987965 21283 0.9987965 Data out only 10000000 32 25 29055 0.9987965 29724 0.9987965 38528 0.9987965 Calculate only 10000000 32 25 150111 0.9987965 147283 0.9987965 146204 0.9987965 Extra tests - loop in main CUDA Function Calculate 10000000 2 25 30115 0.9992496 26870 0.9992496 27613 0.9992496 Shared Memory 10000000 2 25 77105 0.9992496 76980 0.9992496 64308 0.9992496 Calculate 10000000 8 25 110900 0.9995489 81476 0.9995489 79671 0.9995489 Shared Memory 10000000 8 25 230804 0.9995489 181159 0.9995489 229241 0.9995489 Calculate 10000000 32 25 254474 0.9987965 217371 0.9987965 219797 0.9987965 Shared Memory 10000000 32 25 415035 0.9987965 403590 0.9987965 412070 0.9987965

Go To Start

CUDA MFLOPS Benchmarks

Roy Longbottom

Contents

Summary

General

Single Precision Log

Double Precision Log

Speed and Numeric Results Comparisons