CUDA MFLOPS Benchmarks

Roy Longbottom

Contents

General

Single Precision Log

Double Precision Log

Speed and Numeric Results Comparisons

Summary

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS), with versions for processing single and double precision variables, compiled for both Windows and Linux based PCs.

The arithmetic operations executed are, as used in MP MFLOPS, see - MultiThreading Benchmarks.htm, of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. Five different modes of operation are used, with varying levels of data input and output. These combined variables demonstrate some worst and best case performance.

The graphics adapter at the centre of this report has specified maximum speed of 813 GFLOPS single precision (SP) and 33.9 GFLOPS double precision (DP). At 32 operations per data word, worst measurements were 15.6 GFLOPS SP and 6.4 GFLOPS DP, both worse than the host CPU. Best performance was 404 GFLOPS SP and 24.6 GFLOPS DP, with no communication to the outside world.

Links are provided to download the benchmarks and to reports with many more results.


General

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher via a graphics processor than from multi-core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from an array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. The benchmarks use nVidia CUDA programming functions that only execute on their graphics hardware and compatible driver. There are five scenarios:

  • New Calculations - Copy data to graphics RAM, execute instructions, copy back
    to host RAM [Data in & out]

  • Update Data - Execute further instructions on data in graphics RAM, copy
    back to host RAM [Data out only]

  • Graphics Only Data - Execute further instructions on data in graphics RAM, leave
    it there [Calculate only]

  • Extra Test 1 - Just graphics data, repeat loop in CUDA function [Calculate]

  • Extra Test 2 - Just graphics data, repeat loop in CUDA function but using
    Shared Memory [Shared Memory]

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are, as used in MP MFLOPS from MultiThreading Benchmarks.htm, of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

Four Windows and Linux versions are available, with 32-Bit and 64-Bit compilations, using Single and Double Precision floating point numbers. The latest execution files, source code along with compiling and running instructions, can be downloaded in gigaflops-benchmarks.zip for Windows and linux cuda mflops.tar.gz.

Detailed notes for installing and setting up nVidia software are included in my reports - cuda1.htm, cuda2.htm , cuda3 x64.htm and linux_cuda_mflops.htm, also containing results and comparisons from tests on various PCs with GeForce graphics adapters.

The next two pages contain log files of single precision and double precision results for a a PC with a 3.9 GHz Core i7 CPU, using a 2012 mid range GeForce GTX 650, where the top end card was up to 4 times faster, with one from 2017 being at least 3 times faster than that. The fourth page provides comparisons of 64 bit and 32 bit performance and numeric result, plus some comparing those from benchmarking Windows and Linux.

Go To Start


Single Precision Log

Specified maximum single precision speed claim for the GeForce GTX 650 is 812.5 GFLOPS. The benchmark demonstrated a respectable 403.6 GFLOPS. Using the same calculations for MP MFLOPS in MultiThreading Benchmarks.htm that demonstrated a maximum of 178 GFLOPS on the Core i7 used for the CUDA benchmark runs. The procedures used equate to “Data in & out” here, where the best CUDA speed was 15.6 GFLOPS, slower than the i7 CPU using the old i387 floating point instructions.

CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.31 Mon Jun 30 17:34:14 2014

  CUDA devices found 
  Device 0: GeForce GTX 650  with 2 Processors 16 cores 
  Global Memory 1000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  1.125333      444   0.9295383095741  Yes
 Data out only    100000    2    2500  0.468638     1067   0.9295383095741  Yes
 Calculate only   100000    2    2500  0.142182     3517   0.9295383095741  Yes

 Data in & out   1000000    2     250  0.551554      907   0.9925497770309  Yes
 Data out only   1000000    2     250  0.271324     1843   0.9925497770309  Yes
 Calculate only  1000000    2     250  0.056439     8859   0.9925497770309  Yes

 Data in & out  10000000    2      25  0.505138      990   0.9992496371269  Yes
 Data out only  10000000    2      25  0.267062     1872   0.9992496371269  Yes
 Calculate only 10000000    2      25  0.047600    10504   0.9992496371269  Yes

 Data in & out    100000    8    2500  0.836211     2392   0.9571172595024  Yes
 Data out only    100000    8    2500  0.476396     4198   0.9571172595024  Yes
 Calculate only   100000    8    2500  0.150029    13331   0.9571172595024  Yes

 Data in & out   1000000    8     250  0.558576     3581   0.9955183267593  Yes
 Data out only   1000000    8     250  0.273134     7322   0.9955183267593  Yes
 Calculate only  1000000    8     250  0.058554    34157   0.9955183267593  Yes

 Data in & out  10000000    8      25  0.503466     3972   0.9995489120483  Yes
 Data out only  10000000    8      25  0.268260     7455   0.9995489120483  Yes
 Calculate only 10000000    8      25  0.048742    41032   0.9995489120483  Yes

 Data in & out    100000   32    2500  0.860386     9298   0.8902152180672  Yes
 Data out only    100000   32    2500  0.505322    15831   0.8902152180672  Yes
 Calculate only   100000   32    2500  0.178712    44765   0.8902152180672  Yes

 Data in & out   1000000   32     250  0.560442    14274   0.9880878329277  Yes
 Data out only   1000000   32     250  0.281167    28453   0.9880878329277  Yes
 Calculate only  1000000   32     250  0.065909   121379   0.9880878329277  Yes

 Data in & out  10000000   32      25  0.512846    15599   0.9987964630127  Yes
 Data out only  10000000   32      25  0.269143    29724   0.9987964630127  Yes
 Calculate only 10000000   32      25  0.054317   147283   0.9987964630127  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.018608    26870   0.9992496371269  Yes
 Shared Memory  10000000    2      25  0.006495    76980   0.9992496371269  Yes

 Calculate      10000000    8      25  0.024547    81476   0.9995489120483  Yes
 Shared Memory  10000000    8      25  0.011040   181159   0.9995489120483  Yes

 Calculate      10000000   32      25  0.036803   217371   0.9987964630127  Yes
 Shared Memory  10000000   32      25  0.019822   403590   0.9987964630127  Yes
  
Go To Start


Double Precision Log

Specified maximum double precision speed claim for the GeForce GTX 650 is 33.9 GFLOPS. The benchmark demonstrated up to 24.6 GFLOPS, but only 6.4 GFLOPS with “Data in & out”, that can easily be beaten by the main CPU.
  
  CUDA 3.1 x64 Double Precision MFLOPS Benchmark 1.31 Mon Jun 30 17:33:45 2014
  
  CUDA devices found 
  Device 0: GeForce GTX 650  with 2 Processors 16 cores 
  Global Memory 1000 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024

  Using 256 Threads

  Test            8 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  1.439831      347   0.9294744580218  Yes
 Data out only    100000    2    2500  0.748076      668   0.9294744580218  Yes
 Calculate only   100000    2    2500  0.161219     3101   0.9294744580218  Yes

 Data in & out   1000000    2     250  1.026529      487   0.9925431921162  Yes
 Data out only   1000000    2     250  0.505108      990   0.9925431921162  Yes
 Calculate only  1000000    2     250  0.075140     6654   0.9925431921162  Yes

 Data in & out  10000000    2      25  0.969507      516   0.9992492055877  Yes
 Data out only  10000000    2      25  0.493658     1013   0.9992492055877  Yes
 Calculate only 10000000    2      25  0.066215     7551   0.9992492055877  Yes

 Data in & out    100000    8    2500  1.461447     1369   0.9571642109917  Yes
 Data out only    100000    8    2500  0.806252     2481   0.9571642109917  Yes
 Calculate only   100000    8    2500  0.219439     9114   0.9571642109917  Yes

 Data in & out   1000000    8     250  1.080003     1852   0.9955252302690  Yes
 Data out only   1000000    8     250  0.556391     3595   0.9955252302690  Yes
 Calculate only  1000000    8     250  0.126047    15867   0.9955252302690  Yes

 Data in & out  10000000    8      25  1.018470     1964   0.9995496465632  Yes
 Data out only  10000000    8      25  0.543857     3677   0.9995496465632  Yes
 Calculate only 10000000    8      25  0.116258    17203   0.9995496465632  Yes

 Data in & out    100000   32    2500  1.738967     4600   0.8903768345465  Yes
 Data out only    100000   32    2500  1.066385     7502   0.8903768345465  Yes
 Calculate only   100000   32    2500  0.478614    16715   0.8903768345465  Yes

 Data in & out   1000000   32     250  1.321268     6055   0.9881014965491  Yes
 Data out only   1000000   32     250  0.793547    10081   0.9881014965491  Yes
 Calculate only  1000000   32     250  0.363371    22016   0.9881014965491  Yes

 Data in & out  10000000   32      25  1.251858     6391   0.9987993043723  Yes
 Data out only  10000000   32      25  0.784260    10201   0.9987993043723  Yes
 Calculate only 10000000   32      25  0.351443    22763   0.9987993043723  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.051005     9803   0.9992492055877  Yes
 Shared Memory  10000000    2      25  0.030731    16270   0.9992492055877  Yes

 Calculate      10000000    8      25  0.108356    18458   0.9995496465632  Yes
 Shared Memory  10000000    8      25  0.089620    22316   0.9995496465632  Yes

 Calculate      10000000   32      25  0.342569    23353   0.9987993043723  Yes
 Shared Memory  10000000   32      25  0.325699    24563   0.9987993043723  Yes

  
Go To Start


Speed and Numeric Results Comparisons

The following are single precision results for 32 bit and 64 bit compilations for Windows and 64 bit Linux tests. There are some variations in recorded performance, with some effectively the same. Multiple runs are probably needed to identify any real differences.

All array elements are initialised with the same value of nearly 1.0. Subsequent calculations, that are the same for each element, reduce this value. The final numeric results should be identical for a given set of operations per word, number of repeat passes and whether single or double precision arithmetic is used (check extra tests at 10M words). The program check all results for identical numeric results and reports any errors.


                                  Windows 32 bit     Windows 64 bit     Linux 64 bit
Test            4 Byte Ops Repeat MFLOPS    Results  MFLOPS    Results  MFLOPS    Results
                 Words /Wd Passes

Data in & out    100000  2  2500     594  0.9295383     444  0.9295383     597  0.9295383
Data out only    100000  2  2500    1066  0.9295383    1067  0.9295383    1283  0.9295383
Calculate only   100000  2  2500    3890  0.9295383    3517  0.9295383    5834  0.9295383

Data in & out   1000000  2   250     886  0.9925498     907  0.9925498    1133  0.9925498
Data out only   1000000  2   250    1812  0.9925498    1843  0.9925498    2183  0.9925498
Calculate only  1000000  2   250    9272  0.9925498    8859  0.9925498    9666  0.9925498

Data in & out  10000000  2    25     975  0.9992496     990  0.9992496    1355  0.9992496
Data out only  10000000  2    25    1833  0.9992496    1872  0.9992496    2485  0.9992496
Calculate only 10000000  2    25   10846  0.9992496   10504  0.9992496   10411  0.9992496

Data in & out    100000  8  2500    2361  0.9571173    2392  0.9571173    2823  0.9571173
Data out only    100000  8  2500    4164  0.9571173    4198  0.9571173    5152  0.9571173
Calculate only   100000  8  2500   14567  0.9571173   13331  0.9571173   21679  0.9571173

Data in & out   1000000  8   250    3529  0.9955183    3581  0.9955183    4178  0.9955183
Data out only   1000000  8   250    7203  0.9955183    7322  0.9955183    8651  0.9955183
Calculate only  1000000  8   250   35912  0.9955183   34157  0.9955183   37138  0.9955183

Data in & out  10000000  8    25    3900  0.9995489    3972  0.9995489    5396  0.9995489
Data out only  10000000  8    25    7291  0.9995489    7455  0.9995489    9882  0.9995489
Calculate only 10000000  8    25   42235  0.9995489   41032  0.9995489   40599  0.9995489

Data in & out    100000 32  2500    9185  0.8902152    9298  0.8902152   11034  0.8902152
Data out only    100000 32  2500   15696  0.8902152   15831  0.8902152   19628  0.8902152
Calculate only   100000 32  2500   48404  0.8902152   44765  0.8902152   70679  0.8902152

Data in & out   1000000 32   250   13954  0.9880878   14274  0.9880878   16069  0.9880878
Data out only   1000000 32   250   27974  0.9880878   28453  0.9880878   30597  0.9880878
Calculate only  1000000 32   250  125372  0.9880878  121379  0.9880878  133042  0.9880878

Data in & out  10000000 32    25   15494  0.9987965   15599  0.9987965   21283  0.9987965
Data out only  10000000 32    25   29055  0.9987965   29724  0.9987965   38528  0.9987965
Calculate only 10000000 32    25  150111  0.9987965  147283  0.9987965  146204  0.9987965

Extra tests - loop in main CUDA Function

Calculate      10000000  2    25   30115  0.9992496   26870  0.9992496   27613  0.9992496
Shared Memory  10000000  2    25   77105  0.9992496   76980  0.9992496   64308  0.9992496

Calculate      10000000  8    25  110900  0.9995489   81476  0.9995489   79671  0.9995489
Shared Memory  10000000  8    25  230804  0.9995489  181159  0.9995489  229241  0.9995489

Calculate      10000000 32    25  254474  0.9987965  217371  0.9987965  219797  0.9987965
Shared Memory  10000000 32    25  415035  0.9987965  403590  0.9987965  412070  0.9987965
  

Go To Start