CUDA nVidia Graphics Processor Double Precision Parallel Benchmark

CUDA GPU Double Precision Benchmark

Summary

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

CudaMFLOPS1 benchmark exploits multiple registers and large data size and now the fast memory. An intention is to demonstrate poor performance besides high speed operation. For all tests, there were three sets of measurements but additional tests are now included:

New Calculations - Copy data to graphics RAM, execute instructions, copy back to host RAM
Update Data - Execute further instructions on data in graphics RAM, copy back to host RAM
Graphics Only Data - Execute further instructions on data in graphics RAM, leave it there
Extra Test 1 - Just graphics data, repeat loop in CUDA function
Extra Test 2 - Just graphics data, repeat loop in CUDA function but using Shared Memory

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

This benchmark is a simple conversion from the original single precision version to use double precision numbers. For more details and single precision results see CUDA1.htm. The execution file and source code can be downloaded via CudaMflops.zip. No installation is necessary - Extract All and click on CudaMFLOPS1DP.exe but see ReadMe.txt first. The ZIP file also includes the CUDA C++ source code.

As could be expected, double precision speed was generally at most half that from using the single precision version for New Calculations or Update Data, where there is data transfer in/out or out from the main processor. A surprise was that, on using Graphics Only Data, speed was up to 11.5 times slower, where graphics memory speed would be a major influence. The other surprise is that, using Shared Memory Extra Test 2, double precision speeds were nearly as fast as those for single precision.

The first single and double precision benchmarks were compiled using CUDA Toolkit 2.3. These have been replaced with updated versions, providing a little extra detail on graphics memory utilisation. They were also compiled to use 32 bit (x86) PCs. They have also been produced using CUDA Toolkit 3.1, particularly to see if double precision calculations are faster on later GeForce GPUs. The revision exercise was started by compiling to use 64 bit (x64) PCs, which was not straightforward and different procedures were needed for CUDA 2.3 and 3.1. Details of the revised versions and problems are in CUDA3 x64.htm with source code and all benchmark EXE files in CudaMflops.zip.

Latest results, below, are for the revised double precision program using CUDA Toolkit 2.3 on a GTX 480 graphics card. This had double precision hardware but some results are disappointing.

The benchmarks have now been ported to 32-bit and 64-bit versions of Ubuntu Linux. Details and results are provided in linux_cuda_mflops.htm.

See GigaFLOPS Benchmarks.htm for further details and results, including comparisons with MP MFLOPS, a threaded C version, OpenMP MFLOPS, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via gigaflops-benchmarks.zip.

To Start

Results

Following are results from a 3.0 GHz Phenom II X4 with 64-Bit Windows 7 and GeForce GTS 250, and a 2.4 GHz Core 2 Duo with 64-Bit Vista and GeForce 8600 GT, then a laptop using a 1.8 GHz Core 2 Duo with 32-Bit Vista and GeForce 8400M GS. Except for the shared memory tests, the most notable features are the almost constant running times with 2, 8 and 32 floating point operations per data word used, and the associated four fold increases in MFLOPS. This demonstrates that performance is limited by data transfer time.

The effects of using double precision for these calculations is also demonstrated. Initial data values are declared as 0.999999 and the final calculations should produce similar results [e.g. (0.999999 + 0.00002) * 0.99998]. Single precision calculations give a best case result of 0.999549 compared with 0.9999988079071 from worst case double precision calculations.

The extra tests are not run on the laptop as the program calculates that a test might take more than 2 seconds, where a CUDA timeout is likely due to uninterrupted use of the GPU being longer than permitted.

Phenom II X4 3.0 GHz, 64-Bit Windows 7, GeForce GTS 250 ******************************************************** CUDA Double Precision MFLOPS Benchmark 1.2 Sun Jun 06 17:00:48 2010 CUDA devices found Device 0: GeForce GTS 250 with 16 Processors 128 cores Using 256 Threads Test 8 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 2.499293 200 0.9999988079071 Yes Data out only 100000 2 2500 1.500436 333 0.9999988079071 Yes Calculate only 100000 2 2500 0.534337 936 0.9999988079071 Yes Data in & out 1000000 2 250 1.878829 266 0.9999988079071 Yes Data out only 1000000 2 250 1.173691 426 0.9999988079071 Yes Calculate only 1000000 2 250 0.449066 1113 0.9999988079071 Yes Data in & out 10000000 2 25 1.843320 271 0.9999988079071 Yes Data out only 10000000 2 25 1.109989 450 0.9999988079071 Yes Calculate only 10000000 2 25 0.441100 1134 0.9999988079071 Yes Data in & out 100000 8 2500 2.434207 822 0.9999988079071 Yes Data out only 100000 8 2500 1.495548 1337 0.9999988079071 Yes Calculate only 100000 8 2500 0.569520 3512 0.9999988079071 Yes Data in & out 1000000 8 250 1.879104 1064 0.9999988079071 Yes Data out only 1000000 8 250 1.171405 1707 0.9999988079071 Yes Calculate only 1000000 8 250 0.453054 4414 0.9999988079071 Yes Data in & out 10000000 8 25 1.755349 1139 0.9999990100041 Yes Data out only 10000000 8 25 1.094636 1827 0.9999990100041 Yes Calculate only 10000000 8 25 0.439384 4552 0.9999990100041 Yes Data in & out 100000 32 2500 2.433104 3288 0.9999988079071 Yes Data out only 100000 32 2500 1.509020 5301 0.9999988079071 Yes Calculate only 100000 32 2500 0.610137 13112 0.9999988079071 Yes Data in & out 1000000 32 250 1.866450 4286 0.9999988079071 Yes Data out only 1000000 32 250 1.229134 6509 0.9999988079071 Yes Calculate only 1000000 32 250 0.443122 18054 0.9999988079071 Yes Data in & out 10000000 32 25 1.819324 4397 0.9999988079071 Yes Data out only 10000000 32 25 1.116542 7165 0.9999988079071 Yes Calculate only 10000000 32 25 0.433849 18440 0.9999988079071 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.515275 970 0.9999988079071 Yes Shared Memory 10000000 2 25 0.013449 37177 0.9999988079071 Yes Calculate 10000000 8 25 0.513707 3893 0.9999990100041 Yes Shared Memory 10000000 8 25 0.018692 106997 0.9999990100041 Yes Calculate 10000000 32 25 0.512674 15604 0.9999988079071 Yes Shared Memory 10000000 32 25 0.049006 163245 0.9999988079071 Yes Hardware Information CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 AMD Phenom(tm) II X4 945 Processor Measured 3000 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, Windows Information Intel processor architecture, 4 CPUs Windows NT Version 6.1, build 7600, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 4050 MB Core i7 Quad 2.8 GHz, Windows 7-64, GeForce GTX 480 ******************************************************** CUDA 2.3 x86 Double Precision MFLOPS Benchmark 1.3 Fri Jul 30 15:23:02 2010 CUDA devices found Device 0: GeForce GTX 480 with 15 Processors 120 cores Global Memory 1468 MB, Shared Memory/Block 49152 B, Max Threads/Block 1024 Using 256 Threads Test 8 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 1.685190 297 0.9294744580218 Yes Data out only 100000 2 2500 0.869685 575 0.9294744580218 Yes Calculate only 100000 2 2500 0.094995 5263 0.9294744580218 Yes Data in & out 1000000 2 250 1.114482 449 0.9925431921162 Yes Data out only 1000000 2 250 0.567602 881 0.9925431921162 Yes Calculate only 1000000 2 250 0.035579 14053 0.9925431921162 Yes Data in & out 10000000 2 25 0.990978 505 0.9992492055877 Yes Data out only 10000000 2 25 0.518927 964 0.9992492055877 Yes Calculate only 10000000 2 25 0.027536 18158 0.9992492055877 Yes Data in & out 100000 8 2500 1.703644 1174 0.9571642109917 Yes Data out only 100000 8 2500 0.866748 2307 0.9571642109917 Yes Calculate only 100000 8 2500 0.130193 15362 0.9571642109917 Yes Data in & out 1000000 8 250 1.111342 1800 0.9955252302690 Yes Data out only 1000000 8 250 0.551637 3626 0.9955252302690 Yes Calculate only 1000000 8 250 0.035633 56128 0.9955252302690 Yes Data in & out 10000000 8 25 0.992191 2016 0.9995496465632 Yes Data out only 10000000 8 25 0.523850 3818 0.9995496465632 Yes Calculate only 10000000 8 25 0.027778 71999 0.9995496465632 Yes Data in & out 100000 32 2500 1.793422 4461 0.8903768345465 Yes Data out only 100000 32 2500 0.974425 8210 0.8903768345465 Yes Calculate only 100000 32 2500 0.183113 43689 0.8903768345465 Yes Data in & out 1000000 32 250 1.157246 6913 0.9881014965491 Yes Data out only 1000000 32 250 0.619567 12912 0.9881014965491 Yes Calculate only 1000000 32 250 0.080926 98855 0.9881014965491 Yes Data in & out 10000000 32 25 1.032130 7751 0.9987993043723 Yes Data out only 10000000 32 25 0.560146 14282 0.9987993043723 Yes Calculate only 10000000 32 25 0.070265 113855 0.9987993043723 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 0.013806 36216 0.9992492055877 Yes Shared Memory 10000000 2 25 0.006157 81214 0.9992492055877 Yes Calculate 10000000 8 25 0.018377 108830 0.9995496465632 Yes Shared Memory 10000000 8 25 0.017984 111212 0.9995496465632 Yes Calculate 10000000 32 25 0.065742 121689 0.9987993043723 Yes Shared Memory 10000000 32 25 0.065526 122088 0.9987993043723 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000106A5 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz Measured 2806 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information Intel processor architecture, 8 CPUs Windows NT Version 6.1, build 7600, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 4042 MB 1468 MB Graphics RAM, Used 261 Minimum, 334 Maximum Core 2 Duo 2.4 GHz, Vista 64, GeForce 8600 GT ******************************************************** CUDA Double Precision MFLOPS Benchmark 1.2 Sun Jun 06 17:08:37 2010 CUDA devices found Device 0: GeForce 8600 GT with 4 Processors 32 cores Using 256 Threads Test 8 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 4.766313 105 0.9999988079071 Yes Data out only 100000 2 2500 3.027592 165 0.9999988079071 Yes Calculate only 100000 2 2500 1.102929 453 0.9999988079071 Yes Data in & out 1000000 2 250 3.430963 146 0.9999988079071 Yes Data out only 1000000 2 250 2.371367 211 0.9999988079071 Yes Calculate only 1000000 2 250 1.089442 459 0.9999988079071 Yes Data in & out 10000000 2 25 3.153290 159 0.9999988079071 Yes Data out only 10000000 2 25 2.271246 220 0.9999988079071 Yes Calculate only 10000000 2 25 1.050520 476 0.9999988079071 Yes Data in & out 100000 8 2500 4.713221 424 0.9999988079071 Yes Data out only 100000 8 2500 3.044560 657 0.9999988079071 Yes Calculate only 100000 8 2500 1.104363 1811 0.9999988079071 Yes Data in & out 1000000 8 250 3.450819 580 0.9999988079071 Yes Data out only 1000000 8 250 2.402212 833 0.9999988079071 Yes Calculate only 1000000 8 250 1.114893 1794 0.9999988079071 Yes Data in & out 10000000 8 25 3.206385 624 0.9999990100041 Yes Data out only 10000000 8 25 2.311201 865 0.9999990100041 Yes Calculate only 10000000 8 25 1.108218 1805 0.9999990100041 Yes Data in & out 100000 32 2500 4.790173 1670 0.9999988079071 Yes Data out only 100000 32 2500 3.098194 2582 0.9999988079071 Yes Calculate only 100000 32 2500 1.176393 6800 0.9999988079071 Yes Data in & out 1000000 32 250 3.485163 2295 0.9999988079071 Yes Data out only 1000000 32 250 2.440628 3278 0.9999988079071 Yes Calculate only 1000000 32 250 1.158169 6907 0.9999988079071 Yes Data in & out 10000000 32 25 3.240463 2469 0.9999988079071 Yes Data out only 10000000 32 25 2.366616 3380 0.9999988079071 Yes Calculate only 10000000 32 25 1.144864 6988 0.9999988079071 Yes Extra tests - loop in main CUDA Function Calculate 10000000 2 25 1.135575 440 0.9999988079071 Yes Shared Memory 10000000 2 25 0.055957 8935 0.9999988079071 Yes Calculate 10000000 8 25 1.113948 1795 0.9999990100041 Yes Shared Memory 10000000 8 25 0.094749 21108 0.9999990100041 Yes Calculate 10000000 32 25 1.129894 7080 0.9999988079071 Yes Shared Memory 10000000 32 25 0.248978 32131 0.9999988079071 Yes Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006F6 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz Measured 2402 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information Intel processor architecture, 2 CPUs Windows NT Version 6.0, build 6002, Service Pack 2 Memory 4095 MB, Free 2512 MB User Virtual Space 4096 MB, Free 4047 MB Core 2 Duo 1.8 GHz Laptop, Vista 32, GeForce 8400M GS ******************************************************** CUDA Double Precision MFLOPS Benchmark 1.2 Sun Jun 06 17:12:30 2010 CUDA devices found Device 0: GeForce 8400M GS with 2 Processors 16 cores Using 256 Threads Test 8 Byte Ops Repeat Seconds MFLOPS First All Words /Wd Passes Results Same Data in & out 100000 2 2500 9.476113 53 0.9999988079071 Yes Data out only 100000 2 2500 7.111541 70 0.9999988079071 Yes Calculate only 100000 2 2500 4.348053 115 0.9999988079071 Yes Data in & out 1000000 2 250 7.665548 65 0.9999988079071 Yes Data out only 1000000 2 250 6.223459 80 0.9999988079071 Yes Calculate only 1000000 2 250 4.066144 123 0.9999988079071 Yes Data in & out 10000000 2 25 7.488595 67 0.9999988079071 Yes Data out only 10000000 2 25 6.096964 82 0.9999988079071 Yes Calculate only 10000000 2 25 4.036956 124 0.9999988079071 Yes Data in & out 100000 8 2500 9.536672 210 0.9999988079071 Yes Data out only 100000 8 2500 7.161488 279 0.9999988079071 Yes Calculate only 100000 8 2500 4.358269 459 0.9999988079071 Yes Data in & out 1000000 8 250 7.702518 260 0.9999988079071 Yes Data out only 1000000 8 250 6.306045 317 0.9999988079071 Yes Calculate only 1000000 8 250 4.107744 487 0.9999988079071 Yes Data in & out 10000000 8 25 7.784999 257 0.9999990100041 Yes Data out only 10000000 8 25 6.267146 319 0.9999990100041 Yes Calculate only 10000000 8 25 4.104738 487 0.9999990100041 Yes Data in & out 100000 32 2500 9.769602 819 0.9999988079071 Yes Data out only 100000 32 2500 7.361504 1087 0.9999988079071 Yes Calculate only 100000 32 2500 4.363039 1834 0.9999988079071 Yes Data in & out 1000000 32 250 7.765321 1030 0.9999988079071 Yes Data out only 1000000 32 250 6.187865 1293 0.9999988079071 Yes Calculate only 1000000 32 250 4.020476 1990 0.9999988079071 Yes Data in & out 10000000 32 25 7.583945 1055 0.9999988079071 Yes Data out only 10000000 32 25 6.112571 1309 0.9999988079071 Yes Calculate only 10000000 32 25 4.078046 1962 0.9999988079071 Yes Extra tests not run as 2 second timeout is possible Hardware Information CPU GenuineIntel, Features Code BFEBFBFF, Model Code 000006FD Intel(R) Core(TM)2 Duo CPU T5550 @ 1.83GHz Measured 1829 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, No 3DNow, Windows Information Intel processor architecture, 2 CPUs Windows NT Version 6.0, build 6002, Service Pack 2 Memory 2046 MB, Free 917 MB User Virtual Space 2048 MB, Free 2014 MB

To Start

Double and Single Precision Comparison

Below are single and double precision MFLOPS speed measurements of the three desktop systems. Here, the much faster single precision (mainly Calculate Only) tests become more limited by graphics processor unit (GPU) speed using 32 operations per data word (speed gain less than four times than using 8 operations).

Ignoring the extra tests, the GTS 250 single precision (SP) speed averages around six times faster than double precision (DP) on Calculate Only and 2.3 times with data output and/or input, the latter influenced by twice as much data and slower calculations. GTX 480 SP calculations are just over twice as fast as the GTS 250 but DP around eight times faster. This leads to GTX 480 DP speed being closer to that for SP, but still quite a bit slower in some cases. GTX 480 SP shared memory performance is more than three time faster on the GTS 250 but not so good on DP, where it is slower with 32 operations per word. Here, GTS 250 speeds at SP and DP are similar.

GTS 250 8600 GT GTX 480 Test 4/8 Byte Ops Repeat SP DP SP DP SP DP Words /Wd Passes MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS MFLOPS Data in & out 100000 2 2500 328 200 215 105 521 297 Data out only 100000 2 2500 659 333 365 165 973 575 Calculate only 100000 2 2500 3054 936 1653 453 5743 5263 Data in & out 1000000 2 250 625 266 341 146 823 449 Data out only 1000000 2 250 1132 426 604 211 1694 881 Calculate only 1000000 2 250 9672 1113 3475 459 21767 14053 Data in & out 10000000 2 25 714 271 414 159 987 505 Data out only 10000000 2 25 1338 450 677 220 1911 964 Calculate only 10000000 2 25 13038 1134 3866 476 32286 18158 Data in & out 100000 8 2500 1336 822 850 424 2070 1174 Data out only 100000 8 2500 2700 1337 1429 657 3750 2307 Calculate only 100000 8 2500 12233 3512 6583 1811 22002 15362 Data in & out 1000000 8 250 2382 1064 1350 580 3286 1800 Data out only 1000000 8 250 4428 1707 2368 833 6755 3626 Calculate only 1000000 8 250 39481 4414 13208 1794 85074 56128 Data in & out 10000000 8 25 2949 1139 1650 624 3968 2016 Data out only 10000000 8 25 5487 1827 2662 865 7635 3818 Calculate only 10000000 8 25 51199 4552 14399 1805 128542 71999 Data in & out 100000 32 2500 5142 3288 3172 1670 7976 4461 Data out only 100000 32 2500 9832 5301 4995 2582 14380 8210 Calculate only 100000 32 2500 36080 13112 15789 6800 63574 43689 Data in & out 1000000 32 250 9427 4286 4930 2295 12703 6913 Data out only 1000000 32 250 16927 6509 8107 3278 25752 12912 Calculate only 1000000 32 250 108170 18054 26773 6907 268179 98855 Data in & out 10000000 32 25 11182 4397 5926 2469 15485 7751 Data out only 10000000 32 25 19975 7165 9006 3380 29608 14282 Calculate only 10000000 32 25 135041 18440 28994 6988 440451 113855 Extra tests - loop in main CUDA Function Calculate 10000000 2 25 9967 970 3734 440 100937 36216 Shared Memory 10000000 2 25 54082 37177 10707 8935 180734 81214 Calculate 10000000 8 25 39285 3893 15072 1795 322696 108830 Shared Memory 10000000 8 25 119617 106997 23579 21108 412376 111212 Calculate 10000000 32 25 158143 15604 31296 7080 653139 121689 Shared Memory 10000000 32 25 173291 163245 34094 32131 769658 122088

To Start

Reliability/Burn-In Testing

The benchmark can also be used in reliability testing mode as on using the single precision version, but output layout is slightly different. The following results were from a 5 minute run using a BAT file with command “CudaMFLOPS1DP Mins 5 FC” where FC is to use alternative Shared Memory test. Temperature and fan speeds were monitored during the test using VTune. This showed 48°C at the start, increasing to 61°C after half a minute, 65°C after one minute to 71°C after 4 minutes. Then, the fan kicked in, increasing in speed from 35% of maximum to 40%. Throughout, CPU utilisation was 25% or 100% of one CPU. The same test under single precision was 6% faster and maximum temperature 1°C higher.

CUDA Double Precision MFLOPS Benchmark 1.2 Tue Jun 08 12:55:26 2010 CUDA devices found Device 0: GeForce GTS 250 with 16 Processors 128 cores Using 256 Threads Shared Memory Reliability Test 5 minutes, report every 15 seconds Repeat CUDA 219 times at 0.42 seconds. Repeat former 35 times Tests - 10000000 8 Byte Words, 32 Operations Per Word, 7665 Repeat Passes Results of all calculations should be - 0.9999988079071044 Test Seconds MFLOPS Errors First Value Word 1 14.799 165736 None found 2 14.793 165810 None found 3 14.795 165781 None found 4 14.797 165763 None found 5 14.798 165758 None found 6 14.793 165810 None found 7 14.798 165755 None found 8 14.796 165770 None found 9 14.792 165822 None found 10 14.795 165781 None found 11 14.793 165804 None found 12 14.794 165800 None found 13 14.793 165808 None found 14 14.794 165800 None found 15 14.795 165787 None found 16 14.794 165793 None found 17 14.794 165799 None found 18 14.793 165810 None found 19 14.819 165521 None found 20 14.798 165755 None found Hardware Information CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 AMD Phenom(tm) II X4 945 Processor Measured 3000 MHz Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, Windows Information Intel processor architecture, 4 CPUs Windows NT Version 6.1, build 7600, Memory 4096 MB, Free 4096 MB User Virtual Space 4096 MB, Free 4050 MB

To Start

Roy Longbottom October 2014

The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection

CUDA GPU Double Precision Benchmark

Contents

Summary

Results

Double and Single Precision Comparison

Reliability/Burn-In Testing