Contents
General
Single Precision Log
Double Precision Log
Speed and Numeric Results Comparisons
Summary
CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously.
The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS), with versions for processing single and double precision variables, compiled for both Windows and Linux based PCs.
The arithmetic operations executed are, as used in MP MFLOPS, see -
MultiThreading Benchmarks.htm,
of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or
32 adds or subtracts and multiplies on each data element.
Five different modes of operation are used, with varying levels of data input and output. These combined variables demonstrate some worst and best case performance.
The graphics adapter at the centre of this report has specified maximum speed of 813 GFLOPS single precision (SP) and 33.9 GFLOPS double precision (DP). At 32 operations per data word, worst measurements were 15.6 GFLOPS SP and 6.4 GFLOPS DP, both worse than the host CPU. Best performance was 404 GFLOPS SP and 24.6 GFLOPS DP, with no communication to the outside world.
Links are provided to download the benchmarks and to reports with many more results.
General
CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously.
Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher via a graphics processor than from multi-core CPUs.
This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from an array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.
The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. The benchmarks use nVidia CUDA programming functions that only execute on their graphics hardware and compatible driver. There are five scenarios:
- New Calculations - Copy data to graphics RAM, execute instructions, copy back
to host RAM [Data in & out]
- Update Data - Execute further instructions on data in graphics RAM, copy
back to host RAM [Data out only]
- Graphics Only Data - Execute further instructions on data in graphics RAM, leave
it there [Calculate only]
- Extra Test 1 - Just graphics data, repeat loop in CUDA function [Calculate]
- Extra Test 2 - Just graphics data, repeat loop in CUDA function but using
Shared Memory [Shared Memory]
These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times.
The arithmetic operations executed are, as used in MP MFLOPS from
MultiThreading Benchmarks.htm,
of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.
Four Windows and Linux versions are available, with 32-Bit and 64-Bit compilations, using Single and Double Precision floating point numbers. The latest execution files, source code along with compiling and running instructions, can be downloaded in
gigaflops-benchmarks.zip for Windows
and
linux cuda mflops.tar.gz.
Detailed notes for installing and setting up nVidia software are included in my reports -
cuda1.htm,
cuda2.htm ,
cuda3 x64.htm
and
linux_cuda_mflops.htm,
also containing results and comparisons from tests on various PCs with GeForce graphics adapters.
The next two pages contain log files of single precision and double precision results for a
a PC with a 3.9 GHz Core i7 CPU, using a 2012 mid range GeForce GTX 650, where the top end card was up to 4 times faster, with one from 2017 being at least 3 times faster than that.
The fourth page provides comparisons of 64 bit and 32 bit performance and numeric result, plus some comparing those from benchmarking Windows and Linux.
All array elements are initialised with the same value of nearly 1.0. Subsequent calculations, that are the same for each element, reduce this value. The final numeric results should be identical for a given set of operations per word, number of repeat passes and whether single or double precision arithmetic is used (check extra tests at 10M words). The program check all results for identical numeric results and reports any errors.