CUDA GPU Benchmarks Compiled For x64
and CUDA Toolkit 3.1
Contents
General
I had no real problems with my CUDAMFLOPS1 benchmark in compiling to work on a 32 bit PC. For details and results see CUDA1.htm for the single precision calculations and CUDA2.htm for those using double precision. On the other hand, successful compiling to work at 64 bits appeared to be impossible using my usual procedures.
The benchmarks that I write are free to download and run. So I am inclined to use free Microsoft compilers, some of which do not include a compatible Visual Studio. This leads to my normal procedure of using Command Prompt to compile and link programs.
The benchmarks have now been ported to 32-bit and 64-bit versions of Ubuntu Linux. Details and results are provided in
linux_cuda_mflops.htm.
See
GigaFLOPS Benchmarks.htm
for further details and results, including comparisons with MP MFLOPS, a threaded C version, OpenMP MFLOPS, and Qpar MFLOPS, where Qpar is Microsoft’s proprietary equivalent of OpenMP and faster via Windows. The benchmarks and source codes can be obtained via
gigaflops-benchmarks.zip.
32 Bit Version
The starting point was Microsoft Visual C++ 2008 Express which initially executes vcvars32.bat to set up paths. In my case, an example of the command used in a BAT file (with key variables VSINSTALLDIR, VCINSTALLDIR and WindowsSdkDir) is:
@SET VSINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio 9.0
@SET VCINSTALLDIR=C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC
cmd.exe /k C:\"Program Files (x86)"\"Microsoft Visual Studio 9.0"\VC\bin\vcvars32.bat
|
The next step is to go to the CUDA folder and execute CUDA.bat to set paths as follows, where CUDA is installed on D:
Set PATH=D:\MSCompile\CUDA\bin;%PATH%
Set INCLUDE=D:\MSCompile\CUDA\include;%INCLUDE%
Set INCLUDE=D:\MSCompile\include;%INCLUDE%
Set INCLUDE=D:\MSCompile\CUDA\SDK\common\inc;%INCLUDE%
Set LIB=D:\MSCompile\CUDA\lib;%LIB%
Set LIB=D:\MSCompile\CUDA\SDK\common\lib;%LIB%
|
The compiler used, compile command (in Compile.bat) and link command (in Linkit.bat) were as follows.
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.01 for 80x86
nvcc -arch sm_10 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin"
-Xcompiler "/EHsc /W3 /nologo /O2 /Zi /MT" -c CudaMFLOPS1.cu
link /LARGEADDRESSAWARE /NODEFAULTLIB:libc.lib /NODEFAULTLIB:uuid.lib CUDA.LIB
CUDART.LIB CUTIL32.LIB CudaMFLOPS1.obj asmtime.obj CPUasm.obj
|
To Start
64 Bit Version
C++ 2008 Express does not include a 64 bit compiler, but one is available in the free Windows Software Development Kit (SDK) for Windows Server 2008 and .NET Framework 3.5. This is kicked off using vcvars64.bat that mainly sets paths to amd64 or x64 folders. Exceptions are VSINSTALLDIR and VCINSTALLDIR, which are the same as those for the 32 bit version, and common folders such as %VSINSTALLDIR%\Common7\Tools.
VCVARSAMD64.BAT
Below are details of the compiler used and the first error message on executing the appropriate compile.bat.
This is overcome by copying vcvars64.bat to the folder containing the x64 compiler and renaming it as vcvarsamd64.bat. If vcvars64.bat is still used on starting, much of the content of vcvarsamd64.bat can be deleted. The alternative is to point to vcvarsamd64.bat on starting.
Microsoft (R) C/C++ Optimizing Compiler Version 15.00.21022.08 for x64
nvcc fatal:
Visual Studio configuration file '(null)' could not be found for installation
at 'C:\Program Files (x86)\Microsoft Visual Studio 9.0/VC/bin/..'
|
To Start
VSVARS32.BAT
The second error message is shown below. Experiments with the 32 bit benchmark showed that the CUDA compiler examines file vsvars32.bat for the required data shown below, particularly the year. In this case the file is in folder Microsoft Visual Studio 9.0\Common7\Tools.
The solution, to avoid the error, was to copy the Common7 folder to Microsoft Visual Studio 9.0\VC with vsvars32.bat containing only the required data.
nvcc fatal:
nvcc cannot find a supported cl version. Only MSVC 8.0 and MSVC 9.0 are supported
Required data
@echo Setting environment for using Microsoft Visual Studio 2008 x86 tools
|
X64 Compiler Path
On attempting the next compilation, the third error report shown below was produced. This provides a clue as to the reason for the first two errors. The CUDA compiler is making false assumptions about the location of required Microsoft C++ compiler, declared in paths or the nvcc command line, by stepping back two folders from the declared location. Using a path statement pointing to the 32 bit CL.exe, compiles successfully but at 32 bits.
This was fixed in the nvcc command by pointing -ccbin to the 32 bit \bin folder and not \bin\amd64 - same compile statement as for 32 bits, as shown above.
32 bit CL compiler is in:
C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin
64 bit CL compiler is in:
C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin\amd64
Specifying folder containing 64 bit compiler
nvcc fatal:
Visual Studio configuration file '(null)' could not be found for installation
at 'C:/Program Files (x86)/Microsoft Visual Studio 9.0/VC/bin/amd64/../..'
C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64 /../..
Steps backwards to:
C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC>
|
To Start
MSVCR90.DLL
The benchmark then compiled, linked and ran properly. However, using a copy or renamed version of the EXE file failed to run, a Message Box indicating "The program can’t start because MSVCR90.DLL is missing".
A x86 version of this DLL is in a VS 2008 \redist folder but there is not one for amd64 (x64). There are 64 bit versions in a Windows\winsxs system folder, but copying this to the same folder as the EXE file produced a different error.
Windows Driver Kit
I also have the free Windows Driver Kit Version 7.0.0 that includes the version of the 64 bit compiler shown below. With this, MSVCR90.DLL is in the same folder as x64 CL.exe.
This software is installed in D:\WinDDK where there is a Common7 folder and \Tools has vcvars.bat. As with the above, \Common7 was copied to \bin and the minimum same vsvars32.bat included. Vcvars.bat was renamed as vcvarsamd64.bat, copied to the CL.exe folder and paths pointed to amd64 items.
This software does not have the standard \include and \lib folders, but sub-folders in \inc and \lib. The paths were found on searching for missing functions reported by the compiler. Anyway, the compiled programs can be renamed and copied to other PCs, so far, with no complications.
VCVARSAMD64.bat changes are shown below, along with the compile command, again pointing one step back from the x64 CL.exe.
Microsoft (R) C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
Extra Path for compiler
%VCINSTALLDIR%\BIN\x86\amd64;
Extra Includes
%VCINSTALLDIR%\ATLMFC\INCLUDE;
%VCINSTALLDIR%\INC\CRT;
%VCINSTALLDIR%\INC\API;
%VCINSTALLDIR%\INC\API\CRT\STL70;%INCLUDE%
Extra Libraries
%VCINSTALLDIR%\ATLMFC\LIB\amd64;
%VCINSTALLDIR%\LIB\CRT\amd64;
%VCINSTALLDIR%\LIB\WLH\amd64;%LIB%
Compile Command
nvcc -arch sm_10 -ccbin "D:\WinDDK\bin\x86" -Xcompiler "/EHsc
/nologo /O2 /Zi /MD" -c CudaMFLOPS1.cu
|
To Start
CUDA Toolkit 3.1 and Driver
The first benchmarks were produced using CUDA Toolkit 2.3 (Cuda compilation tools, release 2.3, V0.2.1221). Then 3.1 (Cuda compilation tools, release 3.1, V0.2.1221) became available, where there is a possibility of faster double precision speeds on newer graphics cards. This was downloaded but compiled benchmarks did not work until compatible desktop and laptop drivers were installed (Developer Drivers for WinVista and Win7 - 257.21). Unlike the earlier version, the new one sets paths in Windows Environment Variables, making it difficult to develop 32 bit and 64 bit applications on the same PC. Anyway, it appears that 64 bit applications cannot be compiled on a 32 bit PC.
No CUDA SDK files were included in both the new 32 bit and 64 bit packages, but paths to version 2.3 SDK files were used.
For both sets of compilations, VCVARS and/or VSVARS BAT files defined path, library and include statements that are not required for this sort of calculations, including some for Windows SDK and Framework.
To Start
32 Bit 3.1 Version
The 32 bit single and double precision benchmarks were compiled on a 32 bit laptop and ran without any real difficulty using Visual C++ 2008 Express (In this case from D:\MSCompile).
BAT files executed are shown below, the first being VCVARS32.bat, but all this did was call VSVARS32.bat using System Environmental Variable - "%VS90COMNTOOLS%vsvars32.bat". The latter needed to access WindowsSdkDir for such as windows.h and some library files. As before, nvcc -ccbin points to the location of the Microsoft CL C++ compiler (\VC\bin) where VSVARS32.bat also resides.
32 Bit BAT Files
VSVARS32.BAT (Main Part)
@echo Setting environment for using Microsoft Visual Studio 2008 x86 tools.
@call :GetWindowsSdkDir
set "INCLUDE=%WindowsSdkDir%include;%INCLUDE%"
set "LIB=%WindowsSdkDir%lib;%LIB%"
@set PATH=D:\MSCompile\MSVC8\Common7\IDE;D:\MSCompile\MSVC8\VC\BIN;%PATH%
@set INCLUDE=D:\MSCompile\MSVC8\VC\INCLUDE;%INCLUDE%
@set LIB=D:\MSCompile\MSVC8\VC\LIB;%LIB%
@goto end
:GetWindowsSdkDir
Code to Get WindowsSdkDir
:end
CUDA32.BAT
Set INCLUDE=D\MSCompile:\CUDA32\include;%INCLUDE%
Set INCLUDE=D:\MSCompile\CUDA\SDK\common\inc;%INCLUDE%
Set LIB=D:\MSCompile\CUDA32\lib;%LIB%
Set LIB=D:\MSCompile\CUDA\SDK\common\lib;%LIB%
Compiler Used, COMPILE.BAT, LINKIT.BAT
Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 15.00.30729.207 for 80x86
nvcc -arch sm_10 -ccbin D:\MSCompile\MSVC8\VC\bin -Xcompiler
"/EHsc /W3 /nologo /O2 /Zi /MT" -c CudaMFLOPS1.cu
link /LARGEADDRESSAWARE /NODEFAULTLIB:libc.lib /NODEFAULTLIB:uuid.lib CUDA.LIB
CUDART.LIB CUTIL32.LIB CudaMFLOPS1.obj asmtime.obj CPUasm.obj
|
To Start
64 Bit 3.1 Version
The 64 bit versions were again compiled using Windows Driver Kit Version 7.0.0 but, preventing the Visual Studio configuration file (null) compilation failure message, shown above, proved to be more difficult. This starts specifying the same paths, includes and libraries as the earlier WDK example, but all were not really needed. The problem was associated with the required VSVARS32.bat and VCVARSAMD64.bat files. Compiling again with -ccbin "D:\WinDDK\bin\x86" picked up a VSVARS file (not from same Common7 folder as above, but the original one) but not VCVARSAMD64.bat. The solution was to point -ccbin to the Visual C++ 2008 Express 32 bit compiler folder, where both BAT files were used to confirm the Visual Studio version. So this -ccbin does not identify the C++ compiler being used.
64 Bit BAT Files
VSVARS64.BAT (Main Part)
@set VCINSTALLDIR=D:\WinDDK
@set PATH=%VCINSTALLDIR%\BIN\x86\amd64;%PATH%
@set INCLUDE=%VCINSTALLDIR%\INC\CRT;%VCINSTALLDIR%\INC\API;
%VCINSTALLDIR%\INC\API\CRT\STL70;%INCLUDE%
@set LIB=%VCINSTALLDIR%\LIB\CRT\amd64;%VCINSTALLDIR%\LIB\WLH\amd64;%LIB%
CUDA64.BAT
Set INCLUDE=D:\MSCompile\CUDA64b\include;%INCLUDE%
Set INCLUDE=D:\MSCompile\CUDA64\SDK\C\common\inc;%INCLUDE%
Set LIB=D:\MSCompile\CUDA64b\lib64;%LIB%
Set LIB=D:\MSCompile\CUDA64\SDK\C\common\lib;%LIB%
Compiler Used, COMPILE.BAT, LINKIT.BAT
Microsoft (R) C/C++ Optimizing Compiler Version 15.00.30729.207 for x64
nvcc -arch sm_10 -ccbin "D:\MSCompile\MSVC8\VC\bin" -Xcompiler
"/EHsc /nologo /O2 /Zi /MD" -c CudaMFLOPS1.cu
link /LARGEADDRESSAWARE /NODEFAULTLIB:libc.lib /NODEFAULTLIB:uuid.lib CUDA.LIB
CUDART.LIB CUTIL64.LIB CudaMFLOPS1.obj asmtime.obj CPUasm.obj
VSVARS32.BAT From D:\MSCompile\MSVC8\Common7\Tools
@echo Setting environment for using Microsoft Visual Studio 2008 x86 tools.
VCVARSAMD64.BAT From D:\MSCompile\MSVC8\VC\bin\amd64
Not from D:\WinDDK\bin\x86\amd64
@echo Setting environment for using Microsoft Visual Studio 2008 x64 tools.
set "INCLUDE=%WindowsSdkDir%include;%INCLUDE%"
or set "INCLUDE=D:\MSCompile\SDK2008\Include;%INCLUDE%"
|
To Start
Common Code and Errors
Originally, there were two versions of the source code, one with variables declared for single precision float numbers and the other for double precision. Now there is one program with variable parameters defined at the start. With identical output format, single precision numbers are given with a precision greater than needed. The version run is now identified by one of the eight following titles.
The eight EXE benchmark files, supporting DLLs and source code can be downloaded from CudaMflops.zip.
Besides the original configuration information, details of graphics RAM use is now provided, along with maximum threads, as shown below. This example is for using up to 10 million 8 byte words, where 148 MB is used. The benchmarks can be executed using run time parameters (see ZIP file) and memory demands might exceed maximum possible. Block size is limited to 65535 words. With 256 threads, maximum number of words is 16,776,960 (65535 x 256), or 134 MB using double precision. This is increased to 268 MB using 512 threads. The program now checks for these demands and reduces the number of words used to avoid run time errors. If there is insufficient video RAM (or other issues), the program can fail with no error message provided in the results log file. To see any error message, the Command Prompt window needs to be open. This can be achieved as shown at the end of this section.
This command might also be needed when attempting to run the CUDA 3 versions on a system using a CUDA 2 driver.
Output Heading Benchmark File
CUDA 2.3 x86 Single Precision MFLOPS Benchmark 1.3 CUDA2MFLOPS-x86SP.exe
CUDA 2.3 x86 Double Precision MFLOPS Benchmark 1.3 CUDA2MFLOPS-x86DP.exe
CUDA 2.3 x64 Single Precision MFLOPS Benchmark 1.3 CUDA2MFLOPS-x64SP.exe
CUDA 2.3 x64 Double Precision MFLOPS Benchmark 1.3 CUDA2MFLOPS-x64DP.exe
CUDA 3.1 x86 Single Precision MFLOPS Benchmark 1.3 CUDA3MFLOPS-x86SP.exe
CUDA 3.1 x86 Double Precision MFLOPS Benchmark 1.3 CUDA3MFLOPS-x86DP.exe
CUDA 3.1 x64 Single Precision MFLOPS Benchmark 1.3 CUDA3MFLOPS-x64SP.exe
CUDA 3.1 x64 Double Precision MFLOPS Benchmark 1.3 CUDA3MFLOPS-x64DP.exe
Common Configuration Details
CUDA devices found
Device 0: GeForce GTS 250 with 16 Processors 128 cores
Global Memory 982 MB, Shared Memory/Block 16384 B, Max Threads/Block 512
Using 256 Threads
Hardware Information
CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42
AMD Phenom(tm) II X4 945 Processor Measured 3000 MHz
Has MMX, Has SSE, Has SSE2, Has SSE3, Has 3DNow,
Windows Information
AMD64 processor architecture, 4 CPUs
Windows NT Version 6.1, build 7600,
Memory 7936 MB, Free 6447 MB
User Virtual Space 8388608 MB, Free 8388555 MB
982 MB Graphics RAM, Used 112 Minimum, 148 Maximum
Example BAT file command to run and keep Command Prompt window open:
cmd.exe /k Cuda3MFLOPS-x86SP
|
To Start
Comparisons GTS 250
Following are speeds in Millions of Floating Point Operations Per Second (MFLOPS) for the eight benchmark varieties, running on the system identified above. This shows that, at least on this set up, CUDA Toolkit 3.1 produces the same speeds as CUDA 2.3. As should probably be expected, compilations to use 64 bit PC instructions produce the same speeds as 32 bit versions. Also, double precision calculations using the GeForce GPU are much slower than single precision arithmetic.
Single Precision MFLOPS Double Precision MFLOPS
Test 100K x Words x 2.3 2.3 3.1 3.1 2.3 2.3 3.1 3.1
Ops x Passes 32b 64b 32b 64b 32b 64b 32b 64b
Data in & out 1x2x2500 339 373 349 347 196 199 193 201
Data out only 1x2x2500 641 695 651 751 328 349 320 352
Calculate only 1x2x2500 3073 3173 3120 2990 912 887 851 960
Data in & out 10x2x250 638 580 638 605 250 242 252 255
Data out only 10x2x250 1194 1026 1183 1118 406 387 403 393
Calculate only 10x2x250 9927 9994 9949 9989 1109 1113 1108 1109
Data in & out 100x2x25 737 615 736 680 270 261 268 255
Data out only 100x2x25 1295 1154 1329 1248 421 409 415 407
Calculate only 100x2x25 13102 13209 12821 12881 1139 1138 1136 1127
Data in & out 1x8x2500 1417 1380 1366 1331 758 814 748 792
Data out only 1x8x2500 2594 2799 2582 2955 1270 1279 1195 1380
Calculate only 1x8x2500 12312 12611 12057 11685 3584 3555 3440 3892
Data in & out 10x8x250 2561 2254 2553 2428 1031 988 1038 1057
Data out only 10x8x250 4782 4020 4713 4562 1590 1578 1699 1692
Calculate only 10x8x250 39843 38850 36285 38792 4412 4450 4381 4429
Data in & out 100x8x25 2859 2514 2717 2856 1128 1037 1066 1075
Data out only 100x8x25 5410 4868 5324 5144 1708 1601 1671 1758
Calculate only 100x8x25 52899 52811 52665 51550 4539 4554 4557 4562
Data in & out 1x32x2500 5496 5326 5376 5895 3032 3103 2910 3306
Data out only 1x32x2500 9976 11418 10036 10687 4995 5275 4553 5496
Calculate only 1x32x2500 39604 37732 36975 38843 14574 15198 13239 15087
Data in & out 10x32x250 9808 8923 9690 9152 4147 3963 4050 4040
Data out only 10x32x250 17716 16204 17385 16849 6670 6368 6673 6770
Calculate only 10x32x250 109969 111397 106807 108303 18013 18054 17713 18091
Data in & out 100x32x25 11349 9904 10751 10792 4635 4167 4433 4451
Data out only 100x32x25 20581 17870 20083 19033 6946 6510 6973 7102
Calculate only 100x32x25 133174 133218 135006 135130 18453 18513 18486 18495
Extra tests - loop in main CUDA Function
Calculate 100x2x25 10023 10046 10037 10044 961 963 972 965
Shared Memory 100x2x25 54077 54021 54101 54062 37309 37396 37201 37286
Calculate 100x8x25 39928 40268 40304 40262 3916 3770 3902 3761
Shared Memory 100x8x25 119566 119596 119556 119569 106947 106940 106842 106938
Calculate 100x32x25 158224 158239 158067 158195 15258 15205 15597 15079
Shared Memory 100x32x25 173284 172907 173118 172911 163273 163811 163140 163780
|
To Start
Comparisons GTX 480
Following are results from tests on the 2010 top end GeForce GTX 480 graphics card, where it was thought that CUDA Toolkit 3.1 might produce faster double precision speeds, but this is not the case. Comparisons of GTS 250 and GTX 480 are available in
CUDA2.htm.
The GTX 480 is clearly much faster except for double precision calculations from data in shared memory. Here, the GTS single and double precision calculations run at similar execution speeds whereas GTX double precision is much slower. At least, the single precision shared memory speed is demonstrated as being up to 0.77 TeraFLOPS.
Single Precision MFLOPS Double Precision MFLOPS
Test 100K x Words x 2.3 2.3 3.1 3.1 2.3 2.3 3.1 3.1
Ops x Passes 32b 64b 32b 64b 32b 64b 32b 64b
Data in & out 1x2x2500 521 521 520 511 297 300 298 299
Data out only 1x2x2500 973 965 960 936 575 581 583 557
Calculate only 1x2x2500 5743 5554 5574 4084 5263 5055 5196 4875
Data in & out 10x2x250 823 819 826 832 449 463 452 455
Data out only 10x2x250 1694 1643 1673 1704 881 944 903 922
Calculate only 10x2x250 21767 21493 21609 18791 14053 14168 14247 14072
Data in & out 100x2x25 987 1014 989 991 505 504 504 505
Data out only 100x2x25 1911 1979 1903 1934 964 983 967 973
Calculate only 100x2x25 32286 31991 32458 31348 18158 17967 18143 18085
Data in & out 1x8x2500 2070 2058 2064 1939 1174 1170 1158 1162
Data out only 1x8x2500 3750 3749 3783 3511 2307 2209 2231 2178
Calculate only 1x8x2500 22002 20129 16843 14109 15362 13805 17585 16481
Data in & out 10x8x250 3286 3306 3212 3305 1800 1833 1799 1823
Data out only 10x8x250 6755 6792 6589 6784 3626 3710 3545 3707
Calculate only 10x8x250 85074 82132 75085 79136 56128 54356 55712 53651
Data in & out 100x8x25 3968 4057 3875 3970 2016 2062 1980 2059
Data out only 100x8x25 7635 7914 7451 7762 3818 4004 3869 4011
Calculate only 100x8x25 128542 125413 125393 122580 71999 70789 71939 70715
Data in & out 1x32x2500 7976 7768 7398 7181 4461 4427 4451 4330
Data out only 1x32x2500 14380 13597 13223 12760 8210 8017 8060 7771
Calculate only 1x32x2500 63574 52230 46278 37085 43689 37802 41026 35705
Data in & out 10x32x250 12703 13190 12857 13191 6913 7072 6830 6913
Data out only 10x32x250 25752 27027 26294 26278 12912 13597 13242 13692
Calculate only 10x32x250 268179 254306 271978 212808 98855 95247 97562 93702
Data in & out 100x32x25 15485 16077 15507 15775 7751 7825 7742 7896
Data out only 100x32x25 29608 31292 29547 30816 14282 14511 14294 14766
Calculate only 100x32x25 440451 425237 435768 414020 113855 112129 113645 112501
Extra tests - loop in main CUDA Function
Calculate 100x2x25 100937 80817 100841 80702 36216 50998 36360 50860
Shared Memory 100x2x25 180734 142817 180472 142312 81214 80679 81171 80755
Calculate 100x8x25 322696 262490 322354 262176 108830 103164 108797 103214
Shared Memory 100x8x25 412376 389187 411974 386225 111212 110139 111199 110153
Calculate 100x32x25 653139 586440 648695 585688 121689 121401 121684 121398
Shared Memory 100x32x25 769658 710336 767771 709577 122088 121897 122077 121878
|
To Start
Comparisons GTX 680
Following are results from tests on the 2012 top end GeForce GTX 680 graphics card.
Single Precision MFLOPS Double Precision MFLOPS
Test Wds x Ops 2.3 SP 2.3 SP 3.1 SP 3.1 SP 2.3 DP 2.3 DP 3.1 DP 3.1 DP
x Passes 32b 64b 32b 64b 32b 64b 32b 64b
Data in & out 1x2x2500 649 445 622 517 336 373 354 353
Data out only 1x2x2500 1168 1231 1241 1291 723 736 667 712
Calculate only 1x2x2500 5174 7385 7310 7099 6320 6499 4126 6140
Data in & out 10x2x250 923 959 931 949 479 496 479 489
Data out only 10x2x250 1954 2050 1960 2040 1009 1042 1000 1049
Calculate only 10x2x250 21739 26138 21914 25299 16382 16594 16143 16226
Data in & out 100x2x25 1005 1035 999 1036 504 515 513 516
Data out only 100x2x25 2025 2077 2005 2083 1034 1039 1040 1052
Calculate only 100x2x25 37817 36790 36175 36475 19799 19806 19876 19882
Data in & out 1x8x2500 2437 2715 2445 2633 1458 1484 1415 1430
Data out only 1x8x2500 4287 5217 4280 4870 2858 2984 2724 2906
Calculate only 1x8x2500 25047 26963 25502 27261 22803 24113 15501 22388
Data in & out 10x8x250 3720 3789 3722 3811 1914 1965 1901 1949
Data out only 10x8x250 7895 8159 7694 8143 4072 4201 4058 4143
Calculate only 10x8x250 86859 101933 85455 97589 61628 61509 60592 60056
Data in & out 100x8x25 4042 4133 4015 4043 2044 2087 2061 2052
Data out only 100x8x25 7779 8405 8177 8305 4106 4232 4055 4130
Calculate only 100x8x25 144468 144451 140800 144575 76075 75718 75558 74333
Data in & out 1x32x2500 10208 10697 10080 10460 5374 5503 5396 5485
Data out only 1x32x2500 19145 19655 18672 19259 9645 10180 9978 10053
Calculate only 1x32x2500 91276 90090 83989 58833 48673 51722 43334 47974
Data in & out 10x32x250 14653 15031 14629 15096 7364 7471 7269 7440
Data out only 10x32x250 30683 31764 30977 32371 14781 15098 14810 14977
Calculate only 10x32x250 305847 353992 347405 333000 96653 94686 96044 92231
Data in & out 100x32x25 16292 16150 15995 16215 7903 7985 7739 7765
Data out only 100x32x25 31667 33556 31586 32983 14740 15157 14956 15037
Calculate only 100x32x25 516621 530827 519153 527122 105885 102998 105419 101771
Extra tests - loop in main CUDA Function
Calculate 100x32x25 127808 110609 126843 111041 50394 47530 50749 44017
Shared Memory 100x32x25 237600 227664 236915 243601 73545 72991 73186 72369
Calculate 100x32x25 436349 350130 470867 349289 89325 86644 89172 83696
Shared Memory 100x32x25 966567 720773 969416 721272 100983 100384 100825 100254
Calculate 100x32x25 1092531 949533 1154649 950255 109886 106505 109800 105142
Shared Memory 100x32x25 1842112 1793478 1714512 1746493 111408 111044 111425 110982
|
To Start
Roy Longbottom July 2014
The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection
|