CUDA GPU Double Precision Benchmark
Contents
Summary
CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously.
Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs.
This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.
CudaMFLOPS1 benchmark exploits multiple registers and large data size and now the fast memory. An intention is to demonstrate poor performance besides high speed operation. For all tests, there were three sets of measurements but additional tests are now included:
- New Calculations - Copy data to graphics RAM, execute instructions, copy back to host RAM
- Update Data - Execute further instructions on data in graphics RAM, copy back to host RAM
- Graphics Only Data - Execute further instructions on data in graphics RAM, leave it there
- Extra Test 1 - Just graphics data, repeat loop in CUDA function
- Extra Test 2 - Just graphics data, repeat loop in CUDA function but using Shared Memory
These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.
This benchmark is a simple conversion from the original single precision version to use double precision numbers.
For more details and single precision results see CUDA1.htm.
The execution file and source code can be downloaded via
CudaMflops.zip.
No installation is necessary - Extract All and click on CudaMFLOPS1DP.exe but see ReadMe.txt first. The ZIP file also includes the CUDA C++ source code.
As could be expected, double precision speed was generally at most half that from using the single precision version for New Calculations or Update Data, where there is data transfer in/out or out from the main processor. A surprise was that, on using Graphics Only Data, speed was up to 11.5 times slower, where graphics memory speed would be a major influence. The other surprise is that, using Shared Memory Extra Test 2, double precision speeds were nearly as fast as those for single precision.
The first single and double precision benchmarks were compiled using CUDA Toolkit 2.3. These have been replaced with updated versions, providing a little extra detail on graphics memory utilisation. They were also compiled to use 32 bit (x86) PCs. They have also been produced using CUDA Toolkit 3.1, particularly to see if double precision calculations are faster on later GeForce GPUs. The revision exercise was started by compiling to use 64 bit (x64) PCs, which was not straightforward and different procedures were needed for CUDA 2.3 and 3.1. Details of the revised versions and problems are in
CUDA3 x64.htm
with source code and all benchmark EXE files in
CudaMflops.zip.
Latest results, below, are for the revised double precision program using CUDA Toolkit 2.3 on a GTX 480 graphics card. This had double precision hardware but some results are disappointing.
The benchmarks have now been ported to 32-bit and 64-bit versions of Ubuntu Linux. Details and results are provided in
linux_cuda_mflops.htm.
See
GigaFLOPS Benchmarks.htm
for further details and results, including comparisons with MP MFLOPS, a threaded C version, OpenMP MFLOPS, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via
gigaflops-benchmarks.zip.
To Start
Results
Following are results from a 3.0 GHz Phenom II X4 with 64-Bit Windows 7 and GeForce GTS 250, and a 2.4 GHz Core 2 Duo with 64-Bit Vista and GeForce 8600 GT, then a laptop using a 1.8 GHz Core 2 Duo with 32-Bit Vista and GeForce 8400M GS.
Except for the shared memory tests, the most notable features are the almost constant running times with 2, 8 and 32 floating point operations per data word used, and the associated four fold increases in MFLOPS. This demonstrates that performance is limited by data transfer time.
The effects of using double precision for these calculations is also demonstrated. Initial data values are declared as 0.999999 and the final calculations should produce similar results [e.g. (0.999999 + 0.00002) * 0.99998]. Single precision calculations give a best case result of 0.999549 compared with 0.9999988079071 from worst case double precision calculations.
The extra tests are not run on the laptop as the program calculates that a test might take more than 2 seconds, where a CUDA timeout is likely due to uninterrupted use of the GPU being longer than permitted.
Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250
********************************************************
CUDA Double Precision MFLOPS Benchmark 1.2 Sun Jun 06 17:00:48 2010
CUDA devices found
Device 0: GeForce GTS 250 with 16 Processors 128 cores
Using 256 Threads
Test 8 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 2.499293 200 0.9999988079071 Yes
Data out only 100000 2 2500 1.500436 333 0.9999988079071 Yes
Calculate only 100000 2 2500 0.534337 936 0.9999988079071 Yes
Data in & out 1000000 2 250 1.878829 266 0.9999988079071 Yes
Data out only 1000000 2 250 1.173691 426 0.9999988079071 Yes
Calculate only 1000000 2 250 0.449066 1113 0.9999988079071 Yes
Data in & out 10000000 2 25 1.843320 271 0.9999988079071 Yes
Data out only 10000000 2 25 1.109989 450 0.9999988079071 Yes
Calculate only 10000000 2 25 0.441100 1134 0.9999988079071 Yes
Data in & out 100000 8 2500 2.434207 822 0.9999988079071 Yes
Data out only 100000 8 2500 1.495548 1337 0.9999988079071 Yes
Calculate only 100000 8 2500 0.569520 3512 0.9999988079071 Yes
Data in & out 1000000 8 250 1.879104 1064 0.9999988079071 Yes
Data out only 1000000 8 250 1.171405 1707 0.9999988079071 Yes
Calculate only 1000000 8 250 0.453054 4414 0.9999988079071 Yes
Data in & out 10000000 8 25 1.755349 1139 0.9999990100041 Yes
Data out only 10000000 8 25 1.094636 1827 0.9999990100041 Yes
Calculate only 10000000 8 25 0.439384 4552 0.9999990100041 Yes
Data in & out 100000 32 2500 2.433104 3288 0.9999988079071 Yes
Data out only 100000 32 2500 1.509020 5301 0.9999988079071 Yes
Calculate only 100000 32 2500 0.610137 13112 0.9999988079071 Yes
Data in & out 1000000 32 250 1.866450 4286 0.9999988079071 Yes
Data out only 1000000 32 250 1.229134 6509 0.9999988079071 Yes
Calculate only 1000000 32 250 0.443122 18054 0.9999988079071 Yes
Data in & out 10000000 32 25 1.819324 4397 0.9999988079071 Yes
Data out only 10000000 32 25 1.116542 7165 0.9999988079071 Yes
Calculate only 10000000 32 25 0.433849 18440 0.9999988079071 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.515275 970 0.9999988079071 Yes
Shared Memory 10000000 2 25 0.013449 37177 0.9999988079071 Yes
Calculate 10000000 8 25 0.513707 3893 0.9999990100041 Yes
Shared Memory 10000000 8 25 0.018692 106997 0.9999990100041 Yes
Calculate 10000000 32 25 0.512674 15604 0.9999988079071 Yes
Shared Memory 10000000 32 25 0.049006 163245 0.9999988079071 Yes
Hardware Information
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor Measured 3000 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows Information
Intel processor architecture, 4 CPUs
Windows NT Version 6.1, build 7600,
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 4050 MB
Core i7 Quad 2.8 GHz, Windows 7-64, GeForce GTX 480
********************************************************
CUDA 2.3 x86 Double Precision MFLOPS Benchmark 1.3 Fri Jul 30 15:23:02 2010
CUDA devices found
Device 0: GeForce GTX 480 with 15 Processors 120 cores
Global Memory 1468 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024
Using 256 Threads
Test 8 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 1.685190 297 0.9294744580218 Yes
Data out only 100000 2 2500 0.869685 575 0.9294744580218 Yes
Calculate only 100000 2 2500 0.094995 5263 0.9294744580218 Yes
Data in & out 1000000 2 250 1.114482 449 0.9925431921162 Yes
Data out only 1000000 2 250 0.567602 881 0.9925431921162 Yes
Calculate only 1000000 2 250 0.035579 14053 0.9925431921162 Yes
Data in & out 10000000 2 25 0.990978 505 0.9992492055877 Yes
Data out only 10000000 2 25 0.518927 964 0.9992492055877 Yes
Calculate only 10000000 2 25 0.027536 18158 0.9992492055877 Yes
Data in & out 100000 8 2500 1.703644 1174 0.9571642109917 Yes
Data out only 100000 8 2500 0.866748 2307 0.9571642109917 Yes
Calculate only 100000 8 2500 0.130193 15362 0.9571642109917 Yes
Data in & out 1000000 8 250 1.111342 1800 0.9955252302690 Yes
Data out only 1000000 8 250 0.551637 3626 0.9955252302690 Yes
Calculate only 1000000 8 250 0.035633 56128 0.9955252302690 Yes
Data in & out 10000000 8 25 0.992191 2016 0.9995496465632 Yes
Data out only 10000000 8 25 0.523850 3818 0.9995496465632 Yes
Calculate only 10000000 8 25 0.027778 71999 0.9995496465632 Yes
Data in & out 100000 32 2500 1.793422 4461 0.8903768345465 Yes
Data out only 100000 32 2500 0.974425 8210 0.8903768345465 Yes
Calculate only 100000 32 2500 0.183113 43689 0.8903768345465 Yes
Data in & out 1000000 32 250 1.157246 6913 0.9881014965491 Yes
Data out only 1000000 32 250 0.619567 12912 0.9881014965491 Yes
Calculate only 1000000 32 250 0.080926 98855 0.9881014965491 Yes
Data in & out 10000000 32 25 1.032130 7751 0.9987993043723 Yes
Data out only 10000000 32 25 0.560146 14282 0.9987993043723 Yes
Calculate only 10000000 32 25 0.070265 113855 0.9987993043723 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 0.013806 36216 0.9992492055877 Yes
Shared Memory 10000000 2 25 0.006157 81214 0.9992492055877 Yes
Calculate 10000000 8 25 0.018377 108830 0.9995496465632 Yes
Shared Memory 10000000 8 25 0.017984 111212 0.9995496465632 Yes
Calculate 10000000 32 25 0.065742 121689 0.9987993043723 Yes
Shared Memory 10000000 32 25 0.065526 122088 0.9987993043723 Yes
Hardware Information
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106A5
Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2806 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows Information
Intel processor architecture, 8 CPUs
Windows NT Version 6.1, build 7600,
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 4042 MB
1468 MB Graphics RAM, Used 261 Minimum, 334 Maximum
Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT
********************************************************
CUDA Double Precision MFLOPS Benchmark 1.2 Sun Jun 06 17:08:37 2010
CUDA devices found
Device 0: GeForce 8600 GT with 4 Processors 32 cores
Using 256 Threads
Test 8 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 4.766313 105 0.9999988079071 Yes
Data out only 100000 2 2500 3.027592 165 0.9999988079071 Yes
Calculate only 100000 2 2500 1.102929 453 0.9999988079071 Yes
Data in & out 1000000 2 250 3.430963 146 0.9999988079071 Yes
Data out only 1000000 2 250 2.371367 211 0.9999988079071 Yes
Calculate only 1000000 2 250 1.089442 459 0.9999988079071 Yes
Data in & out 10000000 2 25 3.153290 159 0.9999988079071 Yes
Data out only 10000000 2 25 2.271246 220 0.9999988079071 Yes
Calculate only 10000000 2 25 1.050520 476 0.9999988079071 Yes
Data in & out 100000 8 2500 4.713221 424 0.9999988079071 Yes
Data out only 100000 8 2500 3.044560 657 0.9999988079071 Yes
Calculate only 100000 8 2500 1.104363 1811 0.9999988079071 Yes
Data in & out 1000000 8 250 3.450819 580 0.9999988079071 Yes
Data out only 1000000 8 250 2.402212 833 0.9999988079071 Yes
Calculate only 1000000 8 250 1.114893 1794 0.9999988079071 Yes
Data in & out 10000000 8 25 3.206385 624 0.9999990100041 Yes
Data out only 10000000 8 25 2.311201 865 0.9999990100041 Yes
Calculate only 10000000 8 25 1.108218 1805 0.9999990100041 Yes
Data in & out 100000 32 2500 4.790173 1670 0.9999988079071 Yes
Data out only 100000 32 2500 3.098194 2582 0.9999988079071 Yes
Calculate only 100000 32 2500 1.176393 6800 0.9999988079071 Yes
Data in & out 1000000 32 250 3.485163 2295 0.9999988079071 Yes
Data out only 1000000 32 250 2.440628 3278 0.9999988079071 Yes
Calculate only 1000000 32 250 1.158169 6907 0.9999988079071 Yes
Data in & out 10000000 32 25 3.240463 2469 0.9999988079071 Yes
Data out only 10000000 32 25 2.366616 3380 0.9999988079071 Yes
Calculate only 10000000 32 25 1.144864 6988 0.9999988079071 Yes
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 1.135575 440 0.9999988079071 Yes
Shared Memory 10000000 2 25 0.055957 8935 0.9999988079071 Yes
Calculate 10000000 8 25 1.113948 1795 0.9999990100041 Yes
Shared Memory 10000000 8 25 0.094749 21108 0.9999990100041 Yes
Calculate 10000000 32 25 1.129894 7080 0.9999988079071 Yes
Shared Memory 10000000 32 25 0.248978 32131 0.9999988079071 Yes
Hardware Information
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows Information
Intel processor architecture, 2 CPUs
Windows NT Version 6.0, build 6002, Service Pack 2
Memory 4095 MB, Free 2512 MB
User Virtual Space 4096 MB, Free 4047 MB
Core 2 Duo 1.8 GHz Laptop, Vista 32, GeForce 8400M GS
********************************************************
CUDA Double Precision MFLOPS Benchmark 1.2 Sun Jun 06 17:12:30 2010
CUDA devices found
Device 0: GeForce 8400M GS with 2 Processors 16 cores
Using 256 Threads
Test 8 Byte Ops Repeat Seconds MFLOPS First All
Words /Wd Passes Results Same
Data in & out 100000 2 2500 9.476113 53 0.9999988079071 Yes
Data out only 100000 2 2500 7.111541 70 0.9999988079071 Yes
Calculate only 100000 2 2500 4.348053 115 0.9999988079071 Yes
Data in & out 1000000 2 250 7.665548 65 0.9999988079071 Yes
Data out only 1000000 2 250 6.223459 80 0.9999988079071 Yes
Calculate only 1000000 2 250 4.066144 123 0.9999988079071 Yes
Data in & out 10000000 2 25 7.488595 67 0.9999988079071 Yes
Data out only 10000000 2 25 6.096964 82 0.9999988079071 Yes
Calculate only 10000000 2 25 4.036956 124 0.9999988079071 Yes
Data in & out 100000 8 2500 9.536672 210 0.9999988079071 Yes
Data out only 100000 8 2500 7.161488 279 0.9999988079071 Yes
Calculate only 100000 8 2500 4.358269 459 0.9999988079071 Yes
Data in & out 1000000 8 250 7.702518 260 0.9999988079071 Yes
Data out only 1000000 8 250 6.306045 317 0.9999988079071 Yes
Calculate only 1000000 8 250 4.107744 487 0.9999988079071 Yes
Data in & out 10000000 8 25 7.784999 257 0.9999990100041 Yes
Data out only 10000000 8 25 6.267146 319 0.9999990100041 Yes
Calculate only 10000000 8 25 4.104738 487 0.9999990100041 Yes
Data in & out 100000 32 2500 9.769602 819 0.9999988079071 Yes
Data out only 100000 32 2500 7.361504 1087 0.9999988079071 Yes
Calculate only 100000 32 2500 4.363039 1834 0.9999988079071 Yes
Data in & out 1000000 32 250 7.765321 1030 0.9999988079071 Yes
Data out only 1000000 32 250 6.187865 1293 0.9999988079071 Yes
Calculate only 1000000 32 250 4.020476 1990 0.9999988079071 Yes
Data in & out 10000000 32 25 7.583945 1055 0.9999988079071 Yes
Data out only 10000000 32 25 6.112571 1309 0.9999988079071 Yes
Calculate only 10000000 32 25 4.078046 1962 0.9999988079071 Yes
Extra tests not run as 2 second timeout is possible
Hardware Information
CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006FD
Intel(R) Core(TM)2 Duo CPU T5550 @ 1.83GHz Measured 1829 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow,
Windows Information
Intel processor architecture, 2 CPUs
Windows NT Version 6.0, build 6002, Service Pack 2
Memory 2046 MB, Free 917 MB
User Virtual Space 2048 MB, Free 2014 MB
|
To Start
Double and Single Precision Comparison
Below are single and double precision MFLOPS speed measurements of the three desktop systems. Here, the much faster single precision (mainly Calculate Only) tests become more limited by graphics processor unit (GPU) speed using 32 operations per data word (speed gain less than four times than using 8 operations).
Ignoring the extra tests, the GTS 250 single precision (SP) speed averages around six times faster than double precision (DP) on Calculate Only and 2.3 times with data output and/or input, the latter influenced by twice as much data and slower calculations. GTX 480 SP calculations are just over twice as fast as the GTS 250 but DP around eight times faster. This leads to GTX 480 DP speed being closer to that for SP, but still quite a bit slower in some cases. GTX 480 SP shared memory performance is more than three time faster on the GTS 250 but not so good on DP, where it is slower with 32 operations per word. Here, GTS 250 speeds at SP and DP are similar.
GTS 250 8600 GT GTX 480
Test 4/8 Byte Ops Repeat SP DP SP DP SP DP
Words /Wd Passes MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS
Data in & out 100000 2 2500 328 200 215 105 521 297
Data out only 100000 2 2500 659 333 365 165 973 575
Calculate only 100000 2 2500 3054 936 1653 453 5743 5263
Data in & out 1000000 2 250 625 266 341 146 823 449
Data out only 1000000 2 250 1132 426 604 211 1694 881
Calculate only 1000000 2 250 9672 1113 3475 459 21767 14053
Data in & out 10000000 2 25 714 271 414 159 987 505
Data out only 10000000 2 25 1338 450 677 220 1911 964
Calculate only 10000000 2 25 13038 1134 3866 476 32286 18158
Data in & out 100000 8 2500 1336 822 850 424 2070 1174
Data out only 100000 8 2500 2700 1337 1429 657 3750 2307
Calculate only 100000 8 2500 12233 3512 6583 1811 22002 15362
Data in & out 1000000 8 250 2382 1064 1350 580 3286 1800
Data out only 1000000 8 250 4428 1707 2368 833 6755 3626
Calculate only 1000000 8 250 39481 4414 13208 1794 85074 56128
Data in & out 10000000 8 25 2949 1139 1650 624 3968 2016
Data out only 10000000 8 25 5487 1827 2662 865 7635 3818
Calculate only 10000000 8 25 51199 4552 14399 1805 128542 71999
Data in & out 100000 32 2500 5142 3288 3172 1670 7976 4461
Data out only 100000 32 2500 9832 5301 4995 2582 14380 8210
Calculate only 100000 32 2500 36080 13112 15789 6800 63574 43689
Data in & out 1000000 32 250 9427 4286 4930 2295 12703 6913
Data out only 1000000 32 250 16927 6509 8107 3278 25752 12912
Calculate only 1000000 32 250 108170 18054 26773 6907 268179 98855
Data in & out 10000000 32 25 11182 4397 5926 2469 15485 7751
Data out only 10000000 32 25 19975 7165 9006 3380 29608 14282
Calculate only 10000000 32 25 135041 18440 28994 6988 440451 113855
Extra tests - loop in main CUDA Function
Calculate 10000000 2 25 9967 970 3734 440 100937 36216
Shared Memory 10000000 2 25 54082 37177 10707 8935 180734 81214
Calculate 10000000 8 25 39285 3893 15072 1795 322696 108830
Shared Memory 10000000 8 25 119617 106997 23579 21108 412376 111212
Calculate 10000000 32 25 158143 15604 31296 7080 653139 121689
Shared Memory 10000000 32 25 173291 163245 34094 32131 769658 122088
|
To Start
Reliability/Burn-In Testing
The benchmark can also be used in reliability testing mode as on using the single precision version, but output layout is slightly different. The following results were from a 5 minute run using a BAT file with command “CudaMFLOPS1DP Mins 5 FC” where FC is to use alternative Shared Memory test.
Temperature and fan speeds were monitored during the test using VTune. This showed 48°C at the start, increasing to 61°C after half a minute, 65°C after one minute to 71°C after 4 minutes. Then, the fan kicked in, increasing in speed from 35% of maximum to 40%. Throughout, CPU utilisation was 25% or 100% of one CPU. The same test under single precision was 6% faster and maximum temperature 1°C higher.
CUDA Double Precision MFLOPS Benchmark 1.2 Tue Jun 08 12:55:26 2010
CUDA devices found
Device 0: GeForce GTS 250 with 16 Processors 128 cores
Using 256 Threads
Shared Memory Reliability Test 5 minutes, report every 15 seconds
Repeat CUDA 219 times at 0.42 seconds. Repeat former 35 times
Tests - 10000000 8 Byte Words, 32 Operations Per Word, 7665 Repeat Passes
Results of all calculations should be - 0.9999988079071044
Test Seconds MFLOPS Errors First Value
Word
1 14.799 165736 None found
2 14.793 165810 None found
3 14.795 165781 None found
4 14.797 165763 None found
5 14.798 165758 None found
6 14.793 165810 None found
7 14.798 165755 None found
8 14.796 165770 None found
9 14.792 165822 None found
10 14.795 165781 None found
11 14.793 165804 None found
12 14.794 165800 None found
13 14.793 165808 None found
14 14.794 165800 None found
15 14.795 165787 None found
16 14.794 165793 None found
17 14.794 165799 None found
18 14.793 165810 None found
19 14.819 165521 None found
20 14.798 165755 None found
Hardware Information
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor Measured 3000 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows Information
Intel processor architecture, 4 CPUs
Windows NT Version 6.1, build 7600,
Memory 4096 MB, Free 4096 MB
User Virtual Space 4096 MB, Free 4050 MB
|
To Start
Roy Longbottom October 2014
The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|