Title

Roy Longbottom at Linkedin  Linux CUDA GPU Parallel Computing Benchmarks

Contents


General Example Results Log Installing Software
Compiling/Running Programs Comparative Results Burn-In Tests

General

This exercise involved installing 32-bit and 64-bit CUDA software on eSATA and USB drives used for compiling programs and running them on various PCs.

CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory.

The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. The benchmarks use nVidia CUDA programming functions that only execute on their graphics hardware and compatible driver. There are five scenarios:

  • New Calculations - Copy data to graphics RAM, execute instructions, copy back
    to host RAM [Data in & out]


  • Update Data - Execute further instructions on data in graphics RAM, copy
    back to host RAM [Data out only]


  • Graphics Only Data - Execute further instructions on data in graphics RAM, leave
    it there [Calculate only]


  • Extra Test 1 - Just graphics data, repeat loop in CUDA function [Calculate]


  • Extra Test 2 - Just graphics data, repeat loop in CUDA function but using
    Shared Memory [Shared Memory]

These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times.

The main benchmark code is as used for Windows versions, as described in CUDA1.htm, CUDA2.htm and CUDA3 x64.htm, where further technical descriptions and comparative results are provided.

Four versions were produced via Ubuntu Linux, with 32-Bit and 64-Bit compilations using Single and Double Precision floating point numbers. The execution files, source code along with compiling and running instructions, can be downloaded in linux_cuda_mflops.tar.gz. The benchmarks are simple execution files and do not need installing. The first ones run in a Terminal window via the normal ./name command. Details are displayed when the tests are running and performance results are save in a .txt file. Details of other Linux benchmarks can be found in linux benchmarks.htm.

To Start


Example Results Log

Following is an example log file of the 64-Bit Single Precision version running on a 3 GHz AMD CPU with GeForce GTS 250 graphics. Some of the CUDA programming code is rather strange. So it was felt necessary to check that all array elements had been used, as reflected in the last two columns. The data checking also lead to including parameters to use the programs for burn-in/reliability tests (see later). Note that maximum speed shown here is nearly 172 GFLOPS.


 #####################################################

  Assembler CPUID and RDTSC       
  CPU AuthenticAMD, Features Code 178BFBFF, Model Code 00100F42 
  AMD Phenom(tm) II X4 945 Processor 
  Measured - Minimum 3000 MHz, Maximum 3000 MHz 
  Linux Functions 
  get_nprocs() - CPUs 4, Configured CPUs 4 
  get_phys_pages() and size - RAM Size  7.81 GB, Page Size 4096 Bytes 
  uname() - Linux, roy-AMD4, 2.6.35-22-generic 
  #35-Ubuntu SMP Sat Oct 16 20:45:36 UTC 2010, x86_64 

 #####################################################

  Linux CUDA 3.2 x64 32 Bits SP MFLOPS Benchmark 1.4 Wed Dec 29 15:35:35 2010

  CUDA devices found 
  Device 0: GeForce GTS 250  with 16 Processors 128 cores 
  Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512

  Using 256 Threads

  Test            4 Byte  Ops  Repeat   Seconds   MFLOPS             First  All
                   Words  /Wd  Passes                              Results Same

 Data in & out    100000    2    2500  1.035893      483   0.9295383095741  Yes
 Data out only    100000    2    2500  0.514445      972   0.9295383095741  Yes
 Calculate only   100000    2    2500  0.082464     6063   0.9295383095741  Yes

 Data in & out   1000000    2     250  0.706176      708   0.9925497770309  Yes
 Data out only   1000000    2     250  0.380928     1313   0.9925497770309  Yes
 Calculate only  1000000    2     250  0.051266     9753   0.9925497770309  Yes

 Data in & out  10000000    2      25  0.639933      781   0.9992496371269  Yes
 Data out only  10000000    2      25  0.339051     1475   0.9992496371269  Yes
 Calculate only 10000000    2      25  0.041672    11999   0.9992496371269  Yes

 Data in & out    100000    8    2500  1.013196     1974   0.9569796919823  Yes
 Data out only    100000    8    2500  0.490317     4079   0.9569796919823  Yes
 Calculate only   100000    8    2500  0.088028    22720   0.9569796919823  Yes

 Data in & out   1000000    8     250  0.666709     3000   0.9955092668533  Yes
 Data out only   1000000    8     250  0.351320     5693   0.9955092668533  Yes
 Calculate only  1000000    8     250  0.052704    37948   0.9955092668533  Yes

 Data in & out  10000000    8      25  0.620265     3224   0.9995486140251  Yes
 Data out only  10000000    8      25  0.335467     5962   0.9995486140251  Yes
 Calculate only 10000000    8      25  0.044453    44992   0.9995486140251  Yes

 Data in & out    100000   32    2500  1.057142     7568   0.8900792598724  Yes
 Data out only    100000   32    2500  0.531691    15046   0.8900792598724  Yes
 Calculate only   100000   32    2500  0.128706    62157   0.8900792598724  Yes

 Data in & out   1000000   32     250  0.688714    11616   0.9880728721619  Yes
 Data out only   1000000   32     250  0.375411    21310   0.9880728721619  Yes
 Calculate only  1000000   32     250  0.075172   106423   0.9880728721619  Yes

 Data in & out  10000000   32      25  0.644074    12421   0.9987990260124  Yes
 Data out only  10000000   32      25  0.357000    22409   0.9987990260124  Yes
 Calculate only 10000000   32      25  0.062001   129029   0.9987990260124  Yes

 Extra tests - loop in main CUDA Function

 Calculate      10000000    2      25  0.050288     9943   0.9992496371269  Yes
 Shared Memory  10000000    2      25  0.009206    54313   0.9992496371269  Yes

 Calculate      10000000    8      25  0.049608    40316   0.9995486140251  Yes
 Shared Memory  10000000    8      25  0.017254   115916   0.9995486140251  Yes

 Calculate      10000000   32      25  0.050531   158320   0.9987990260124  Yes
 Shared Memory  10000000   32      25  0.046626   171580   0.9987990260124  Yes


To Start


Installing Software

There are uncertaincies when installing CUDA. In my case, I am using Ubuntu 10.10 but the only current choice is CUDA Toolkit 3.2 on Ubuntu 10.04 Linux. This Tutorial provides detailed information on how to do it. To stand the best chance of working, the first step is to download the compatible nVidia graphics driver - Developer Drivers for Linux (260.19.26) with 32-bit and 64-bit versions, downloaded from nVidia’s Linux Developer Zone. Later, down the page, there are links to download 32-bit and 64-bit versions of CUDA Toolkit for Ubuntu Linux 10.04 and GPU Computing SDK code samples. It is advisable to install the latter to show that the software runs properly on existing hardware. This also requires the installation of an OpenGL driver as indicated in the Tuturial.

Before installing the new graphics driver, the system might need reconfiguring to use the basic graphics driver. The Tutorial also provides details of the of a script file to blacklist various functions, particularly nouveau. In my case, this did not work, but the installer took care of it. The Tutorial commands shown to install the driver required amending as 260.19.26. In my case with the 64-bit version, on rebooting and unlike the basic driver, the nVidia software did not detect the correct the correct monitor pixel settings. Resetting the default meant that initial settings were incorrect when the appropriate USB drive was used on another PC.

The sample program source codes are in a /C/src directory and all compiled with a single make command, with execution files saved in /C/bin/linux/release.

To Start


Compiling and Running Programs

Each sample programs has its own make file. The easiest way to compile a new program is to create a new folder in /src to contain source code and associated files. Then copy and modify a make file from another directory in /src. An example makefile is shown below, the .o entries being standard object files to obtain PC configuration details. Later, required includes for the nvcc compile command were determined, also shown below. In this case, the execution file appears in the same directory as the source files and this can be in a more convenient location.

Compiling the benchmarks under Windows required different nvcc path directives for producing 32-bit and 64-bit execution files and separate run time DLL library files had to be included for redistribution, but these did not need path settings. Using Linux, the same simple nvcc format can be used for both bit size compilations but separate library files (libcudart.so.3) are needed for redistribution. With this in the execution file directory, the library is made accessible using the export command shown below.

Both the 32-bit and 64-bit benchmarks could be compiled and run using the nVidia 260.19.26 graphics driver with GeForce 8600 GT and GeForce GTS 250 graphics cards on different PCs. The 32-bit and 64-bit nVidia graphics driver versions 260.19.06, recommended by Ubuntu System - Administration - Additional Drivers, were installed on other USB drives and these ran the benchmarks successfully on the same PCs.

Although Windows versions of the benchmarks ran successfully wherever tried, the Linux varieties failed sometimes with the error message “cudaSafeCall() Runtime API error : all CUDA-capable devices are busy or unavailable”, particularly on a laptop. This occurred randomly on using 32-bit and 64-bit Ubuntu and both drivers. In this case, CUDA can produce device statistics but will not run most provided sample programs. It is as though the hardware is stuck in a bad state but can mainly be overcome by powering off/on and rebooting. Googling shows that this type of error is quite common, including via Windows.


   #########################################################################
   Makefile

   # Add source files here
   EXECUTABLE	                               := cudamflops
   # Cuda source files (compiled with cudacc)
   CUFILES		                       := cudaMFLOPS1.cu
   # CUDA dependency files
   CU_DEPS		                       :=
   # C/C++ source files (compiled with gcc/c++)
   CCFILES		                       := cpuida64.o cpuidc64.o

   # Rules and Targets
   include ../../common/common.mk

   #########################################################################
   Compile and Link Command

   nvcc cudaMFLOPS1.cu -I ~/NVIDIA_GPU_Computing_SDK/C/common/inc 
                       -I ~/NVIDIA_GPU_Computing_SDK/shared/inc
                       cpuida64.o cpuidc64.o -o cudamflops

   #########################################################################
   Set Library File Path

   export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/


To Start


Comparative Results

The GTS 250 is on a 16x PCI-E bus whose maximum speed is 4 GB/second or 1G 32 bit words per second. The Data Out Only tests provide 2, 8 or 32 single precision floating point operations per word. Dividing the faster MFLOPS speeds by these numbers indicates a data transfer speed of around 700 MW or 2.8 GB per second, the bus speed clearly being the limiting factor. With data in and out speed is reduced by half, as would be expected. Maximum data size is 40 MB which can easily be accommodated in the available graphics RAM, but speed of the latter will affect Calculate and Shared Memory Tests.

Video RAM throughput is 70.4 GB or 17.6 G words per second. With data in and out, 8.8 Gw/second can be assumed. Multiplying this by 2, 8 and 32 operations indicates maximum speeds of 17600, 70400 and 281600 MFLOPS. Actual results of Calculate tests are up to 70% of these speeds. Shared memory tests use a somewhat faster cache. Maximum speed specification of the GTS 250 is 1.836 GHz x 128 processors x 3 (linked multiply, add and multiply) or 705 GFLOPS but this would be difficult to sustain with memory access.

Results below are for GeForce 8600 GT graphics on a Core 2 Duo based PC and GeForce GTS 250 on a 3.0 GHz Quad Core Phenom based system. The 64b and 32b keys refer to benchmarks compiled to run on 64 or 32 bit versions of Operating System. The first and fifth columns are for tests compiled for Windows using CUDA Toolkit 3.1 and the following set is for Linux using 3.2 Toolkit. Some Linux speeds are faster but this might be due to the toolkit version. The other Single Precision (SP) and Double Precision (DP) results are typical of 64-bit and 32-bit compilations where, as the tests mainly depend on the speed using graphics processors, performance is similar when using the same precision. Compilations for 32 bit tests produced the same speeds when run via 32-Bit and 64-Bit versions of Ubuntu and the alternative graphics driver made no difference. Of particular note, DP speed (shown below) is much slower than that for SP calculations.

Later results are for faster cards running at a maximum of up to 790 GFLOPS. Added 2014 results are for a mid range GeForce GTX 650, with a 3.7 GHz Core i7, via Windows 8.1 and Ubuntu 14.04. A maximum 412 GFLOPS was demonstrated, making it more than twice as fast as a more expensive GTS 250, from three years earlier. The i7 Asus P9X79 LE motherboard has PCI Express 3.0 x 16 which, along with faster RAM and CPU GHz, produces the fastest speeds, so far, where data in and out, or out only, are involved. Earlier systems probably had PCIe 1 with maximum bandwidth is 4 GB/s, or PCIe 2 at 8 GB/s, compared with 15.74 GB/s for PCIe 3.


                                           Speed In MFLOPS

                            GTS250 GTS250   8600  8600  GTS250 GTS250   8600   8600
                               Win  Linux  Linux  Linux    Win  Linux  Linux  Linux
 Test       100K Words x    3.1 SP 3.2 SP 3.2 SP 3.2 SP 3.1 DP 3.2 DP 3.2 DP 3.2 DP
            Ops x Passes       64b    64b    64b    32b    64b    64b    64b    32b

 Data in & out   1x2x2500      347    496    266    265    201    238    116    115
 Data out only   1x2x2500      751    979    462    458    352    395    180    179
 Calculate only  1x2x2500     2990   6030   2797   2739    960   1155    496    493

 Data in & out   10x2x250      605    714    390    388    255    297    155    154
 Data out only   10x2x250     1118   1312    632    629    393    463    215    215
 Calculate only  10x2x250     9989   9809   3529   3546   1109   1125    457    455

 Data in & out   100x2x25      680    796    445    441    255    309    165    163
 Data out only   100x2x25     1248   1469    694    691    407    483    225    223
 Calculate only  100x2x25    12881  11935   3943   3913   1127   1147    463    476

 Data in & out   1x8x2500     1331   1906   1062   1053    792    999    460    458
 Data out only   1x8x2500     2955   4086   1827   1811   1380   1649    715    711
 Calculate only  1x8x2500    11685  22809  10174   9928   3892   4588   1962   1956

 Data in & out   10x8x250     2428   3075   1547   1537   1057   1264    616    614
 Data out only   10x8x250     4562   5834   2499   2480   1692   2037    849    847
 Calculate only  10x8x250    38792  38811  13109  13056   4429   4517   1795   1790

 Data in & out   100x8x25     2856   3174   1764   1750   1075   1241    649    644
 Data out only   100x8x25     5144   5901   2726   2722   1758   1939    872    873
 Calculate only  100x8x25    51550  49304  14481  14588   4562   4591   1791   1786

 Data in & out   1x32x2500    5895   7332   3902   3857   3306   3971   1823   1815
 Data out only   1x32x2500   10687  15111   6356   6245   5496   6499   2818   2801
 Calculate only  1x32x2500   38843  62060  22228  21261  15087  17780   7544   7446

 Data in & out   10x32x250    9152  11828   5586   5553   4040   5063   2439   2429
 Data out only   10x32x250   16849  21985   8505   8448   6770   8184   3338   3331
 Calculate only  10x32x250  108303 104855  27363  27226  18091  18220   6892   6911

 Data in & out   100x32x25   10792  12274   6293   6243   4451   4990   2548   2534
 Data out only   100x32x25   19033  22096   9215   9170   7102   7783   3421   3411
 Calculate only  100x32x25  135130 117655  29034  29177  18495  18640   6892   6930

 Extra tests - loop in main CUDA Function

 Calculate       100x2x25    10044  10021   3825   3825    965    943    443    445
 Shared Memory   100x2x25    54062  52088  10710  10717  37286  37414   8940   8947

 Calculate       100x8x25    40262  40233  15144  15199   3761   3862   1772   1777
 Shared Memory   100x8x25   119569 117125  23384  23591 106938 107308  21113  21117

 Calculate       100x32x25  158195 158537  31317  31309  15079  15430   7149   7226
 Shared Memory   100x32x25  172911 171721  34033  34108 163780 164046  32243  32139


                             Corei7  Corei7 Phenom Corei7 Corei7
                             2.8GHz  2.8GHz 3.0GHz 3.7GHz 3.7GHz
                             GTX580  GTX580 GTX570 GTX650 GTX650
                                Win     Win  Linux    Win  Linux
 Test        100K Words x    3.1 DP  3.1 SP 3.2 SP 3.2 SP 3.2 SP
             Ops x Passes       64b     64b    64b    64b    64b

 Data in & out    1x2x2500      299     511    403    459    597
 Data out only    1x2x2500      557     936   1010   1059   1283
 Calculate only   1x2x2500     4875    4084  13882   3449   5834

 Data in & out    10x2x250      455     832    654    893   1133
 Data out only    10x2x250      922    1704   1245   1790   2183
 Calculate only   10x2x250    14072   18791  29634   8806   9666

 Data in & out    100x2x25      505     991    709    980   1355
 Data out only    100x2x25      973    1934   1269   1852   2485
 Calculate only   100x2x25    18085   31348  34996  10530  10411

 Data in & out    1x8x2500     1162    1939   1949   2375   2823
 Data out only    1x8x2500     2178    3511   4068   4151   5152
 Calculate only   1x8x2500    16481   14109  45930  13056  21679

 Data in & out    10x8x250     1823    3305   2708   3545   4178
 Data out only    10x8x250     3707    6784   5502   7107   8651
 Calculate only   10x8x250    53651   79136 113472  34014  37138

 Data in & out    100x8x25     2059    3970   2839   3896   5396
 Data out only    100x8x25     4011    7762   4938   7283   9882
 Calculate only   100x8x25    70715  122580 138639  40905  40599

 Data in & out   1x32x2500     4330    7181   7835   9183  11034
 Data out only   1x32x2500     7771   12760  15013  15769  19628
 Calculate only  1x32x2500    35705   37085 109243  43975  70679

 Data in & out   10x32x250     6913   13191  11043  14006  16069
 Data out only   10x32x250    13692   26278  20691  27684  30597
 Calculate only  10x32x250    93702  212808 375859 120972 133042

 Data in & out   100x32x25     7896   15775  10925  15499  21283
 Data out only   100x32x25    14766   30816  20591  28906  38528
 Calculate only  100x32x25   112501  414020 510582 147100 146204

 Extra tests - loop in main

 Calculate        100x2x25    50860   80702  88987  26876  27613
 Shared Memory    100x2x25    80755  142312 160615  77049  64308

 Calculate        100x8x25   103214  262176 289222  81484  79671
 Shared Memory    100x8x25   110153  386225 426749 181190 229241

 Calculate       100x32x25   121398  585688 650986 216570 219797
 Shared Memory   100x32x25   121878  709577 790930 400966 412070




To Start


Burn-In/Reliability Tests

The program has run time parameters to vary the number threads, words and repeat passes. Details are provided in CUDA1.htm. This also details other parameters available to run a reliability/burn-in test. These are running time in minutes and seconds between logged results, default 15 seconds. Calculate is the default routine but the Shared Memory test can be used with an added parameter of FC. Results are checked for consistent values and performance measured.

Below are results of a 10 minute test, where GPU temperature and fan speed are also shown, as provided by nVidia X Server Statistics for Thermal Settings via System, Preferences, Monitors. CPU utilisation was also noted from System Monitor and, surprisingly, showed the equivalent of two of the four Phenom cores running flat out. On running this test, it is advisable to set power saving options to “Never” and five second reporting would be more appropriate to minimise the time that the display is frozen.

See also linux burn-in apps.htm.


  Command: ./cudamflops32SP Mins 10, FC

  Linux CUDA 3.2 x86 32 Bits SP MFLOPS Benchmark 1.4 Thu Jan  6 18:36:23 2011

  CUDA devices found 
  Device 0: GeForce GTS 250  with 16 Processors 128 cores 
  Global Memory 999 MB, Shared Memory/Block 16384 B, Max Threads/Block 512

  Using 256 Threads

  Shared Memory  Reliability Test 10 minutes, report every 15 seconds

  Repeat CUDA 788 times at  1.43 seconds. Repeat former 10 times
  Tests - 10000000 4 Byte Words, 32 Operations Per Word, 7880 Repeat Passes

  Results of all calculations should be -    0.7124574184417725

  Test Seconds   MFLOPS    Errors     First               Value
                                       Word 

   1    14.326   176011   None found
   2    14.326   176020   None found
   3    14.326   176020   None found
   4    14.326   176019   None found
   5    14.326   176017   None found

  To

  36    14.326   176018   None found
  37    14.326   176018   None found
  38    14.326   176018   None found
  39    14.326   176014   None found
  40    14.326   176016   None found


  Minutes          0  0.5    1  1.5    2    3    4    5    6    7    8    9   10

  Temperature C   48   60   67   70   72   74   74   74   74   74   74   75   74
  Fan Speed %     35   35   35   35   40   42   45   45   46   46   45   47   47



To Start


Roy Longbottom at Linkedin   Roy Longbottom January 2015





The Internet Home for my PC Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection