Title

Roy Longbottom at Linkedin   CUDA GPU Benchmarks Compiled For x64

and CUDA Toolkit 3.1

Contents

32 Bit Version 64 Bit Version VCVARSAMD64.BAT
VSVARS32.BAT X64 Compile Path MSVCR90.DLL
Windows Driver Kit CUDA Toolkit 3.1 32 Bit 3.1 Version
64 Bit 3.1 Version Common Code and Errors Comparisons GTS 250
Comparisons GTX 480 Comparisons GTX 680

General

I had no real problems with my CUDAMFLOPS1 benchmark in compiling to work on a 32 bit PC. For details and results see CUDA1.htm for the single precision calculations and CUDA2.htm for those using double precision. On the other hand, successful compiling to work at 64 bits appeared to be impossible using my usual procedures.

The benchmarks that I write are free to download and run. So I am inclined to use free Microsoft compilers, some of which do not include a compatible Visual Studio. This leads to my normal procedure of using Command Prompt to compile and link programs.

The benchmarks have now been ported to 32-bit and 64-bit versions of Ubuntu Linux. Details and results are provided in linux_cuda_mflops.htm.

See GigaFLOPS Benchmarks.htm for further details and results, including comparisons with MP MFLOPS, a threaded C version, OpenMP MFLOPS, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via gigaflops-benchmarks.zip.


32 Bit Version

The starting point was Microsoft Visual C++ 2008 Express which initially executes vcvars32.bat to set up paths. In my case, an example of the command used in a BAT file (with key variables VSINSTALLDIR, VCINSTALLDIR and WindowsSdkDir) is:

 @SET VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio 9.0
 @SET VCINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC

 cmd.exe /k C:\"Program Files (x86)"\"Microsoft Visual Studio 9.0"\VC\bin\vcvars32.bat

The next step is to go to the CUDA folder and execute CUDA.bat to set paths as follows, where CUDA is installed on D:

 Set PATH=D:\MSCompile\CUDA\bin;%PATH%
 Set INCLUDE=D:\MSCompile\CUDA\include;%INCLUDE%
 Set INCLUDE=D:\MSCompile\include;%INCLUDE%
 Set INCLUDE=D:\MSCompile\CUDA\SDK\common\inc;%INCLUDE%
 Set LIB=D:\MSCompile\CUDA\lib;%LIB%
 Set LIB=D:\MSCompile\CUDA\SDK\common\lib;%LIB%

The compiler used, compile command (in Compile.bat) and link command (in Linkit.bat) were as follows.

 Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86

 nvcc -arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" 
      -Xcompiler  "/EHsc /W3 /nologo /O2 /Zi /MT" -c CudaMFLOPS1.cu

 link /LARGEADDRESSAWARE /NODEFAULTLIB:libc.lib /NODEFAULTLIB:uuid.lib CUDA.LIB
      CUDART.LIB CUTIL32.LIB CudaMFLOPS1.obj asmtime.obj CPUasm.obj

To Start


64 Bit Version

C++ 2008 Express does not include a 64 bit compiler, but one is available in the free Windows Software Development Kit (SDK) for Windows Server 2008 and .NET Framework 3.5. This is kicked off using vcvars64.bat that mainly sets paths to amd64 or x64 folders. Exceptions are VSINSTALLDIR and VCINSTALLDIR, which are the same as those for the 32 bit version, and common folders such as %VSINSTALLDIR%\Common7\Tools.

VCVARSAMD64.BAT

Below are details of the compiler used and the first error message on executing the appropriate compile.bat. This is overcome by copying vcvars64.bat to the folder containing the x64 compiler and renaming it as vcvarsamd64.bat. If vcvars64.bat is still used on starting, much of the content of vcvarsamd64.bat can be deleted. The alternative is to point to vcvarsamd64.bat on starting.

 Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64

 nvcc fatal:
 Visual Studio configuration file '(null)' could not be found for installation 
 at 'C:\Program Files (x86)\Microsoft Visual Studio 9.0/VC/bin/..'

To Start


VSVARS32.BAT

The second error message is shown below. Experiments with the 32 bit benchmark showed that the CUDA compiler examines file vsvars32.bat for the required data shown below, particularly the year. In this case the file is in folder Microsoft Visual Studio 9.0\Common7\Tools. The solution, to avoid the error, was to copy the Common7 folder to Microsoft Visual Studio 9.0\VC with vsvars32.bat containing only the required data.

 nvcc fatal:
 nvcc cannot find a supported cl version. Only MSVC 8.0 and MSVC 9.0 are supported

 Required data
 @echo Setting environment for using Microsoft Visual Studio 2008 x86 tools

X64 Compiler Path

On attempting the next compilation, the third error report shown below was produced. This provides a clue as to the reason for the first two errors. The CUDA compiler is making false assumptions about the location of required Microsoft C++ compiler, declared in paths or the nvcc command line, by stepping back two folders from the declared location. Using a path statement pointing to the 32 bit CL.exe, compiles successfully but at 32 bits. This was fixed in the nvcc command by pointing -ccbin to the 32 bit \bin folder and not \bin\amd64 - same compile statement as for 32 bits, as shown above.

 32 bit CL compiler is in:
    C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin
 64 bit CL compiler is in:
    C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\amd64 

  Specifying folder containing 64 bit compiler
  nvcc fatal: 
  Visual Studio configuration file '(null)' could not be found for installation
  at 'C:/Program Files (x86)/Microsoft Visual Studio 9.0/VC/bin/amd64/../..'

 C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64 /../.. 
 Steps backwards to:
 C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC>

To Start


MSVCR90.DLL

The benchmark then compiled, linked and ran properly. However, using a copy or renamed version of the EXE file failed to run, a Message Box indicating "The program can’t start because MSVCR90.DLL is missing". A x86 version of this DLL is in a VS 2008 \redist folder but there is not one for amd64 (x64). There are 64 bit versions in a Windows\winsxs system folder, but copying this to the same folder as the EXE file produced a different error.

Windows Driver Kit

I also have the free Windows Driver Kit Version 7.0.0 that includes the version of the 64 bit compiler shown below. With this, MSVCR90.DLL is in the same folder as x64 CL.exe. This software is installed in D:\WinDDK where there is a Common7 folder and \Tools has vcvars.bat. As with the above, \Common7 was copied to \bin and the minimum same vsvars32.bat included. Vcvars.bat was renamed as vcvarsamd64.bat, copied to the CL.exe folder and paths pointed to amd64 items.

This software does not have the standard \include and \lib folders, but sub-folders in \inc and \lib. The paths were found on searching for missing functions reported by the compiler. Anyway, the compiled programs can be renamed and copied to other PCs, so far, with no complications. VCVARSAMD64.bat changes are shown below, along with the compile command, again pointing one step back from the x64 CL.exe.

 Microsoft (R) C/C++ Optimizing Compiler Version 15.00.30729.207 for x64

 Extra Path for compiler
 %VCINSTALLDIR%\BIN\x86\amd64;

 Extra Includes
 %VCINSTALLDIR%\ATLMFC\INCLUDE;
 %VCINSTALLDIR%\INC\CRT;
 %VCINSTALLDIR%\INC\API;
 %VCINSTALLDIR%\INC\API\CRT\STL70;%INCLUDE%

 Extra Libraries
 %VCINSTALLDIR%\ATLMFC\LIB\amd64;
 %VCINSTALLDIR%\LIB\CRT\amd64;
 %VCINSTALLDIR%\LIB\WLH\amd64;%LIB%

 Compile Command
 nvcc -arch sm_10 -ccbin "D:\WinDDK\bin\x86"  -Xcompiler  "/EHsc 
 /nologo /O2 /Zi /MD" -c CudaMFLOPS1.cu

To Start


CUDA Toolkit 3.1 and Driver

The first benchmarks were produced using CUDA Toolkit 2.3 (Cuda compilation tools, release 2.3, V0.2.1221). Then 3.1 (Cuda compilation tools, release 3.1, V0.2.1221) became available, where there is a possibility of faster double precision speeds on newer graphics cards. This was downloaded but compiled benchmarks did not work until compatible desktop and laptop drivers were installed (Developer Drivers for WinVista and Win7 - 257.21). Unlike the earlier version, the new one sets paths in Windows Environment Variables, making it difficult to develop 32 bit and 64 bit applications on the same PC. Anyway, it appears that 64 bit applications cannot be compiled on a 32 bit PC.

No CUDA SDK files were included in both the new 32 bit and 64 bit packages, but paths to version 2.3 SDK files were used. For both sets of compilations, VCVARS and/or VSVARS BAT files defined path, library and include statements that are not required for this sort of calculations, including some for Windows SDK and Framework.

To Start


32 Bit 3.1 Version

The 32 bit single and double precision benchmarks were compiled on a 32 bit laptop and ran without any real difficulty using Visual C++ 2008 Express (In this case from D:\MSCompile). BAT files executed are shown below, the first being VCVARS32.bat, but all this did was call VSVARS32.bat using System Environmental Variable - "%VS90COMNTOOLS%vsvars32.bat". The latter needed to access WindowsSdkDir for such as windows.h and some library files. As before, nvcc -ccbin points to the location of the Microsoft CL C++ compiler (\VC\bin) where VSVARS32.bat also resides.

 32 Bit BAT Files

 VSVARS32.BAT (Main Part)

 @echo Setting environment for using Microsoft Visual Studio 2008 x86 tools.
 @call :GetWindowsSdkDir
 set "INCLUDE=%WindowsSdkDir%include;%INCLUDE%"
 set "LIB=%WindowsSdkDir%lib;%LIB%"
 @set PATH=D:\MSCompile\MSVC8\Common7\IDE;D:\MSCompile\MSVC8\VC\BIN;%PATH%
 @set INCLUDE=D:\MSCompile\MSVC8\VC\INCLUDE;%INCLUDE%
 @set LIB=D:\MSCompile\MSVC8\VC\LIB;%LIB%
 @goto end

 :GetWindowsSdkDir
 Code to Get WindowsSdkDir
 :end
 
 CUDA32.BAT

 Set INCLUDE=D\MSCompile:\CUDA32\include;%INCLUDE%
 Set INCLUDE=D:\MSCompile\CUDA\SDK\common\inc;%INCLUDE%
 Set LIB=D:\MSCompile\CUDA32\lib;%LIB%
 Set LIB=D:\MSCompile\CUDA\SDK\common\lib;%LIB%

 Compiler Used, COMPILE.BAT, LINKIT.BAT

 Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86
 nvcc -arch sm_10 -ccbin D:\MSCompile\MSVC8\VC\bin -Xcompiler 
     "/EHsc /W3 /nologo /O2 /Zi /MT" -c CudaMFLOPS1.cu
 link /LARGEADDRESSAWARE /NODEFAULTLIB:libc.lib /NODEFAULTLIB:uuid.lib CUDA.LIB
     CUDART.LIB CUTIL32.LIB CudaMFLOPS1.obj asmtime.obj CPUasm.obj

To Start


64 Bit 3.1 Version

The 64 bit versions were again compiled using Windows Driver Kit Version 7.0.0 but, preventing the Visual Studio configuration file (null) compilation failure message, shown above, proved to be more difficult. This starts specifying the same paths, includes and libraries as the earlier WDK example, but all were not really needed. The problem was associated with the required VSVARS32.bat and VCVARSAMD64.bat files. Compiling again with -ccbin "D:\WinDDK\bin\x86" picked up a VSVARS file (not from same Common7 folder as above, but the original one) but not VCVARSAMD64.bat. The solution was to point -ccbin to the Visual C++ 2008 Express 32 bit compiler folder, where both BAT files were used to confirm the Visual Studio version. So this -ccbin does not identify the C++ compiler being used.

 64 Bit BAT Files

 VSVARS64.BAT (Main Part)

 @set VCINSTALLDIR=D:\WinDDK
 @set PATH=%VCINSTALLDIR%\BIN\x86\amd64;%PATH%
 @set INCLUDE=%VCINSTALLDIR%\INC\CRT;%VCINSTALLDIR%\INC\API;
      %VCINSTALLDIR%\INC\API\CRT\STL70;%INCLUDE%
 @set LIB=%VCINSTALLDIR%\LIB\CRT\amd64;%VCINSTALLDIR%\LIB\WLH\amd64;%LIB%

 CUDA64.BAT

 Set INCLUDE=D:\MSCompile\CUDA64b\include;%INCLUDE%
 Set INCLUDE=D:\MSCompile\CUDA64\SDK\C\common\inc;%INCLUDE%
 Set LIB=D:\MSCompile\CUDA64b\lib64;%LIB%
 Set LIB=D:\MSCompile\CUDA64\SDK\C\common\lib;%LIB%

 Compiler Used, COMPILE.BAT, LINKIT.BAT

 Microsoft (R) C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
 nvcc -arch sm_10 -ccbin "D:\MSCompile\MSVC8\VC\bin"  -Xcompiler  
     "/EHsc  /nologo /O2 /Zi /MD" -c CudaMFLOPS1.cu
 link /LARGEADDRESSAWARE  /NODEFAULTLIB:libc.lib /NODEFAULTLIB:uuid.lib CUDA.LIB 
     CUDART.LIB  CUTIL64.LIB CudaMFLOPS1.obj asmtime.obj CPUasm.obj

 VSVARS32.BAT From D:\MSCompile\MSVC8\Common7\Tools

 @echo Setting environment for using Microsoft Visual Studio 2008 x86 tools.

 VCVARSAMD64.BAT From D:\MSCompile\MSVC8\VC\bin\amd64
                    Not from D:\WinDDK\bin\x86\amd64

 @echo Setting environment for using Microsoft Visual Studio 2008 x64 tools.
    set "INCLUDE=%WindowsSdkDir%include;%INCLUDE%"
or  set "INCLUDE=D:\MSCompile\SDK2008\Include;%INCLUDE%"

To Start


Common Code and Errors

Originally, there were two versions of the source code, one with variables declared for single precision float numbers and the other for double precision. Now there is one program with variable parameters defined at the start. With identical output format, single precision numbers are given with a precision greater than needed. The version run is now identified by one of the eight following titles.

The eight EXE benchmark files, supporting DLLs and source code can be downloaded from CudaMflops.zip. Besides the original configuration information, details of graphics RAM use is now provided, along with maximum threads, as shown below. This example is for using up to 10 million 8 byte words, where 148 MB is used. The benchmarks can be executed using run time parameters (see ZIP file) and memory demands might exceed maximum possible. Block size is limited to 65535 words. With 256 threads, maximum number of words is 16,776,960 (65535 x 256), or 134 MB using double precision. This is increased to 268 MB using 512 threads. The program now checks for these demands and reduces the number of words used to avoid run time errors. If there is insufficient video RAM (or other issues), the program can fail with no error message provided in the results log file. To see any error message, the Command Prompt window needs to be open. This can be achieved as shown at the end of this section. This command might also be needed when attempting to run the CUDA 3 versions on a system using a CUDA 2 driver.


  Output Heading                                          Benchmark File

  CUDA 2.3 x86 Single Precision MFLOPS Benchmark 1.3      CUDA2MFLOPS-x86SP.exe
  CUDA 2.3 x86 Double Precision MFLOPS Benchmark 1.3      CUDA2MFLOPS-x86DP.exe
  CUDA 2.3 x64 Single Precision MFLOPS Benchmark 1.3      CUDA2MFLOPS-x64SP.exe
  CUDA 2.3 x64 Double Precision MFLOPS Benchmark 1.3      CUDA2MFLOPS-x64DP.exe
  CUDA 3.1 x86 Single Precision MFLOPS Benchmark 1.3      CUDA3MFLOPS-x86SP.exe
  CUDA 3.1 x86 Double Precision MFLOPS Benchmark 1.3      CUDA3MFLOPS-x86DP.exe
  CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.3      CUDA3MFLOPS-x64SP.exe
  CUDA 3.1 x64 Double Precision MFLOPS Benchmark 1.3      CUDA3MFLOPS-x64DP.exe

  Common Configuration Details

  CUDA devices found 
  Device 0: GeForce GTS 250  with 16 Processors 128 cores 
  Global Memory 982 MB, Shared Memory/Block 16384 B, Max Threads/Block 512

  Using 256 Threads

 Hardware  Information
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
  AMD Phenom(tm) II X4 945 Processor Measured 3000 MHz
  Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow, 
 Windows  Information
  AMD64 processor architecture, 4 CPUs 
  Windows NT  Version 6.1, build 7600, 
  Memory 7936 MB, Free 6447 MB
  User Virtual Space 8388608 MB, Free 8388555 MB

  982 MB Graphics RAM, Used 112 Minimum, 148 Maximum

  Example BAT file command to run and keep Command Prompt window open:

  cmd.exe /k Cuda3MFLOPS-x86SP

To Start


Comparisons GTS 250

Following are speeds in Millions of Floating Point Operations Per Second (MFLOPS) for the eight benchmark varieties, running on the system identified above. This shows that, at least on this set up, CUDA Toolkit 3.1 produces the same speeds as CUDA 2.3. As should probably be expected, compilations to use 64 bit PC instructions produce the same speeds as 32 bit versions. Also, double precision calculations using the GeForce GPU are much slower than single precision arithmetic.


                              Single Precision MFLOPS     Double Precision MFLOPS
 Test     100K x Words x      2.3    2.3    3.1    3.1    2.3    2.3    3.1    3.1
            Ops x Passes      32b    64b    32b    64b    32b    64b    32b    64b

 Data in & out  1x2x2500      339    373    349    347    196    199    193    201
 Data out only  1x2x2500      641    695    651    751    328    349    320    352
 Calculate only 1x2x2500     3073   3173   3120   2990    912    887    851    960

 Data in & out  10x2x250      638    580    638    605    250    242    252    255
 Data out only  10x2x250     1194   1026   1183   1118    406    387    403    393
 Calculate only 10x2x250     9927   9994   9949   9989   1109   1113   1108   1109

 Data in & out  100x2x25      737    615    736    680    270    261    268    255
 Data out only  100x2x25     1295   1154   1329   1248    421    409    415    407
 Calculate only 100x2x25    13102  13209  12821  12881   1139   1138   1136   1127

 Data in & out  1x8x2500     1417   1380   1366   1331    758    814    748    792
 Data out only  1x8x2500     2594   2799   2582   2955   1270   1279   1195   1380
 Calculate only 1x8x2500    12312  12611  12057  11685   3584   3555   3440   3892

 Data in & out  10x8x250     2561   2254   2553   2428   1031    988   1038   1057
 Data out only  10x8x250     4782   4020   4713   4562   1590   1578   1699   1692
 Calculate only 10x8x250    39843  38850  36285  38792   4412   4450   4381   4429

 Data in & out  100x8x25     2859   2514   2717   2856   1128   1037   1066   1075
 Data out only  100x8x25     5410   4868   5324   5144   1708   1601   1671   1758
 Calculate only 100x8x25    52899  52811  52665  51550   4539   4554   4557   4562

 Data in & out  1x32x2500    5496   5326   5376   5895   3032   3103   2910   3306
 Data out only  1x32x2500    9976  11418  10036  10687   4995   5275   4553   5496
 Calculate only 1x32x2500   39604  37732  36975  38843  14574  15198  13239  15087

 Data in & out  10x32x250    9808   8923   9690   9152   4147   3963   4050   4040
 Data out only  10x32x250   17716  16204  17385  16849   6670   6368   6673   6770
 Calculate only 10x32x250  109969 111397 106807 108303  18013  18054  17713  18091

 Data in & out  100x32x25   11349   9904  10751  10792   4635   4167   4433   4451
 Data out only  100x32x25   20581  17870  20083  19033   6946   6510   6973   7102
 Calculate only 100x32x25  133174 133218 135006 135130  18453  18513  18486  18495

 Extra tests - loop in main CUDA Function

 Calculate      100x2x25    10023  10046  10037  10044    961    963    972    965
 Shared Memory  100x2x25    54077  54021  54101  54062  37309  37396  37201  37286

 Calculate      100x8x25    39928  40268  40304  40262   3916   3770   3902   3761
 Shared Memory  100x8x25   119566 119596 119556 119569 106947 106940 106842 106938

 Calculate      100x32x25  158224 158239 158067 158195  15258  15205  15597  15079
 Shared Memory  100x32x25  173284 172907 173118 172911 163273 163811 163140 163780

To Start


Comparisons GTX 480

Following are results from tests on the 2010 top end GeForce GTX 480 graphics card, where it was thought that CUDA Toolkit 3.1 might produce faster double precision speeds, but this is not the case. Comparisons of GTS 250 and GTX 480 are available in CUDA2.htm. The GTX 480 is clearly much faster except for double precision calculations from data in shared memory. Here, the GTS single and double precision calculations run at similar execution speeds whereas GTX double precision is much slower. At least, the single precision shared memory speed is demonstrated as being up to 0.77 TeraFLOPS.


                              Single Precision MFLOPS     Double Precision MFLOPS
 Test     100K x Words x      2.3    2.3    3.1    3.1    2.3    2.3    3.1    3.1
            Ops x Passes      32b    64b    32b    64b    32b    64b    32b    64b

 Data in & out  1x2x2500      521    521    520    511    297    300    298    299
 Data out only  1x2x2500      973    965    960    936    575    581    583    557
 Calculate only 1x2x2500     5743   5554   5574   4084   5263   5055   5196   4875

 Data in & out  10x2x250      823    819    826    832    449    463    452    455
 Data out only  10x2x250     1694   1643   1673   1704    881    944    903    922
 Calculate only 10x2x250    21767  21493  21609  18791  14053  14168  14247  14072

 Data in & out  100x2x25      987   1014    989    991    505    504    504    505
 Data out only  100x2x25     1911   1979   1903   1934    964    983    967    973
 Calculate only 100x2x25    32286  31991  32458  31348  18158  17967  18143  18085

 Data in & out  1x8x2500     2070   2058   2064   1939   1174   1170   1158   1162
 Data out only  1x8x2500     3750   3749   3783   3511   2307   2209   2231   2178
 Calculate only 1x8x2500    22002  20129  16843  14109  15362  13805  17585  16481

 Data in & out  10x8x250     3286   3306   3212   3305   1800   1833   1799   1823
 Data out only  10x8x250     6755   6792   6589   6784   3626   3710   3545   3707
 Calculate only 10x8x250    85074  82132  75085  79136  56128  54356  55712  53651

 Data in & out  100x8x25     3968   4057   3875   3970   2016   2062   1980   2059
 Data out only  100x8x25     7635   7914   7451   7762   3818   4004   3869   4011
 Calculate only 100x8x25   128542 125413 125393 122580  71999  70789  71939  70715

 Data in & out  1x32x2500    7976   7768   7398   7181   4461   4427   4451   4330
 Data out only  1x32x2500   14380  13597  13223  12760   8210   8017   8060   7771
 Calculate only 1x32x2500   63574  52230  46278  37085  43689  37802  41026  35705

 Data in & out  10x32x250   12703  13190  12857  13191   6913   7072   6830   6913
 Data out only  10x32x250   25752  27027  26294  26278  12912  13597  13242  13692
 Calculate only 10x32x250  268179 254306 271978 212808  98855  95247  97562  93702

 Data in & out  100x32x25   15485  16077  15507  15775   7751   7825   7742   7896
 Data out only  100x32x25   29608  31292  29547  30816  14282  14511  14294  14766
 Calculate only 100x32x25  440451 425237 435768 414020 113855 112129 113645 112501

 Extra tests - loop in main CUDA Function

 Calculate      100x2x25   100937  80817 100841  80702  36216  50998  36360  50860
 Shared Memory  100x2x25   180734 142817 180472 142312  81214  80679  81171  80755

 Calculate      100x8x25   322696 262490 322354 262176 108830 103164 108797 103214
 Shared Memory  100x8x25   412376 389187 411974 386225 111212 110139 111199 110153
 
 Calculate      100x32x25  653139 586440 648695 585688 121689 121401 121684 121398
 Shared Memory  100x32x25  769658 710336 767771 709577 122088 121897 122077 121878

To Start


Comparisons GTX 680

Following are results from tests on the 2012 top end GeForce GTX 680 graphics card.


                            Single Precision MFLOPS        Double Precision MFLOPS
 Test          Wds x Ops    2.3 SP  2.3 SP  3.1 SP  3.1 SP 2.3 DP 2.3 DP 3.1 DP 3.1 DP
                x Passes       32b     64b     32b     64b    32b    64b    32b    64b

 Data in & out  1x2x2500       649     445     622     517    336    373    354    353
 Data out only  1x2x2500      1168    1231    1241    1291    723    736    667    712
 Calculate only 1x2x2500      5174    7385    7310    7099   6320   6499   4126   6140

 Data in & out  10x2x250       923     959     931     949    479    496    479    489
 Data out only  10x2x250      1954    2050    1960    2040   1009   1042   1000   1049
 Calculate only 10x2x250     21739   26138   21914   25299  16382  16594  16143  16226

 Data in & out  100x2x25      1005    1035     999    1036    504    515    513    516
 Data out only  100x2x25      2025    2077    2005    2083   1034   1039   1040   1052
 Calculate only 100x2x25     37817   36790   36175   36475  19799  19806  19876  19882

 Data in & out  1x8x2500      2437    2715    2445    2633   1458   1484   1415   1430
 Data out only  1x8x2500      4287    5217    4280    4870   2858   2984   2724   2906
 Calculate only 1x8x2500     25047   26963   25502   27261  22803  24113  15501  22388

 Data in & out  10x8x250      3720    3789    3722    3811   1914   1965   1901   1949
 Data out only  10x8x250      7895    8159    7694    8143   4072   4201   4058   4143
 Calculate only 10x8x250     86859  101933   85455   97589  61628  61509  60592  60056

 Data in & out  100x8x25      4042    4133    4015    4043   2044   2087   2061   2052
 Data out only  100x8x25      7779    8405    8177    8305   4106   4232   4055   4130
 Calculate only 100x8x25    144468  144451  140800  144575  76075  75718  75558  74333

 Data in & out  1x32x2500    10208   10697   10080   10460   5374   5503   5396   5485
 Data out only  1x32x2500    19145   19655   18672   19259   9645  10180   9978  10053
 Calculate only 1x32x2500    91276   90090   83989   58833  48673  51722  43334  47974

 Data in & out  10x32x250    14653   15031   14629   15096   7364   7471   7269   7440
 Data out only  10x32x250    30683   31764   30977   32371  14781  15098  14810  14977
 Calculate only 10x32x250   305847  353992  347405  333000  96653  94686  96044  92231

 Data in & out  100x32x25    16292   16150   15995   16215   7903   7985   7739   7765
 Data out only  100x32x25    31667   33556   31586   32983  14740  15157  14956  15037
 Calculate only 100x32x25   516621  530827  519153  527122 105885 102998 105419 101771

 Extra tests - loop in main CUDA Function

 Calculate      100x32x25   127808  110609  126843  111041  50394  47530  50749  44017
 Shared Memory  100x32x25   237600  227664  236915  243601  73545  72991  73186  72369

 Calculate      100x32x25   436349  350130  470867  349289  89325  86644  89172  83696
 Shared Memory  100x32x25   966567  720773  969416  721272 100983 100384 100825 100254

 Calculate      100x32x25  1092531  949533 1154649  950255 109886 106505 109800 105142
 Shared Memory  100x32x25  1842112 1793478 1714512 1746493 111408 111044 111425 110982

To Start


Roy Longbottom at Linkedin  Roy Longbottom July 2014



The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection