GeneralThis exercise involved installing 32-bit and 64-bit CUDA software on eSATA and USB drives used for compiling programs and running them on various PCs. CUDA, from nVidia, provides programming functions to use GeForce graphics processors for general purpose computing. These functions are easy to use in executing arithmetic instructions on numerous processing elements simultaneously. Maximum speeds, in terms of billions on floating point operations per second or GFLOPS, can be higher on a laptop graphics processor than such as dual core CPUs. This is for Single Instruction Multiple Data (SIMD) operation, where the same instructions can be executed simultaneously on sections of data from a data array. For maximum speeds, the data array has to be large and with little or no references to graphics or host CPU RAM. To assist in this, CUDA hardware provides a large number of registers and high speed cache like memory. The benchmarks measure floating point speeds in Millions of Floating Point Operations Per Second (MFLOPS). They demonstrates some best and worst case performance using varying data array size and increasing processing instructions per data access. The benchmarks use nVidia CUDA programming functions that only execute on their graphics hardware and compatible driver. There are five scenarios:
These are run at three different data sizes, defaults 100,000 words repeated 2500 times, 1M words 250 times and 10M words 25 times. The arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 adds or subtracts and multiplies on each data element. The Extra Tests are only run using 10M words repeated 25 times. The main benchmark code is as used for Windows versions, as described in CUDA1.htm, CUDA2.htm and CUDA3 x64.htm, where further technical descriptions and comparative results are provided. Four versions were produced via Ubuntu Linux, with 32-Bit and 64-Bit compilations using Single and Double Precision floating point numbers. The execution files, source code along with compiling and running instructions, can be downloaded in
linux_cuda_mflops.tar.gz.
The benchmarks are simple execution files and do not need installing. The first ones run in a Terminal window via the normal ./name command. Details are displayed when the tests are running and performance results are save in a .txt file.
Details of other Linux benchmarks can be found in
linux benchmarks.htm.
Example Results LogFollowing is an example log file of the 64-Bit Single Precision version running on a 3 GHz AMD CPU with GeForce GTS 250 graphics.
Some of the CUDA programming code is rather strange. So it was felt necessary to check that all array elements had been used, as reflected in the last two columns. The data checking also lead to including parameters to use the programs for burn-in/reliability tests (see later). Note that maximum speed shown here is nearly 172 GFLOPS.
Installing SoftwareThere are uncertaincies when installing CUDA. In my case, I am using Ubuntu 10.10 but the only current choice is CUDA Toolkit 3.2 on Ubuntu 10.04 Linux. This Tutorial provides detailed information on how to do it. To stand the best chance of working, the first step is to download the compatible nVidia graphics driver - Developer Drivers for Linux (260.19.26) with 32-bit and 64-bit versions, downloaded from nVidia’s Linux Developer Zone. Later, down the page, there are links to download 32-bit and 64-bit versions of CUDA Toolkit for Ubuntu Linux 10.04 and GPU Computing SDK code samples. It is advisable to install the latter to show that the software runs properly on existing hardware. This also requires the installation of an OpenGL driver as indicated in the Tuturial. Before installing the new graphics driver, the system might need reconfiguring to use the basic graphics driver. The Tutorial also provides details of the of a script file to blacklist various functions, particularly nouveau. In my case, this did not work, but the installer took care of it. The Tutorial commands shown to install the driver required amending as 260.19.26. In my case with the 64-bit version, on rebooting and unlike the basic driver, the nVidia software did not detect the correct the correct monitor pixel settings. Resetting the default meant that initial settings were incorrect when the appropriate USB drive was used on another PC. The sample program source codes are in a /C/src directory and all compiled with a single make command, with execution files saved in /C/bin/linux/release.
Compiling and Running ProgramsEach sample programs has its own make file. The easiest way to compile a new program is to create a new folder in /src to contain source code and associated files. Then copy and modify a make file from another directory in /src. An example makefile is shown below, the .o entries being standard object files to obtain PC configuration details. Later, required includes for the nvcc compile command were determined, also shown below. In this case, the execution file appears in the same directory as the source files and this can be in a more convenient location. Compiling the benchmarks under Windows required different nvcc path directives for producing 32-bit and 64-bit execution files and separate run time DLL library files had to be included for redistribution, but these did not need path settings. Using Linux, the same simple nvcc format can be used for both bit size compilations but separate library files (libcudart.so.3) are needed for redistribution. With this in the execution file directory, the library is made accessible using the export command shown below. Both the 32-bit and 64-bit benchmarks could be compiled and run using the nVidia 260.19.26 graphics driver with GeForce 8600 GT and GeForce GTS 250 graphics cards on different PCs. The 32-bit and 64-bit nVidia graphics driver versions 260.19.06, recommended by Ubuntu System - Administration - Additional Drivers, were installed on other USB drives and these ran the benchmarks successfully on the same PCs. Although Windows versions of the benchmarks ran successfully wherever tried, the Linux varieties failed sometimes with the error message “cudaSafeCall() Runtime API error : all CUDA-capable devices are busy or unavailable”, particularly on a laptop. This occurred randomly on using 32-bit and 64-bit Ubuntu and both drivers. In this case, CUDA can produce device statistics but will not run most provided sample programs. It is as though the hardware is stuck in a bad state but can mainly be overcome by powering off/on and rebooting. Googling shows that this type of error is quite common, including via Windows.
Comparative ResultsThe GTS 250 is on a 16x PCI-E bus whose maximum speed is 4 GB/second or 1G 32 bit words per second. The Data Out Only tests provide 2, 8 or 32 single precision floating point operations per word. Dividing the faster MFLOPS speeds by these numbers indicates a data transfer speed of around 700 MW or 2.8 GB per second, the bus speed clearly being the limiting factor. With data in and out speed is reduced by half, as would be expected. Maximum data size is 40 MB which can easily be accommodated in the available graphics RAM, but speed of the latter will affect Calculate and Shared Memory Tests. Video RAM throughput is 70.4 GB or 17.6 G words per second. With data in and out, 8.8 Gw/second can be assumed. Multiplying this by 2, 8 and 32 operations indicates maximum speeds of 17600, 70400 and 281600 MFLOPS. Actual results of Calculate tests are up to 70% of these speeds. Shared memory tests use a somewhat faster cache. Maximum speed specification of the GTS 250 is 1.836 GHz x 128 processors x 3 (linked multiply, add and multiply) or 705 GFLOPS but this would be difficult to sustain with memory access. Results below are for GeForce 8600 GT graphics on a Core 2 Duo based PC and GeForce GTS 250 on a 3.0 GHz Quad Core Phenom based system. The 64b and 32b keys refer to benchmarks compiled to run on 64 or 32 bit versions of Operating System. The first and fifth columns are for tests compiled for Windows using CUDA Toolkit 3.1 and the following set is for Linux using 3.2 Toolkit. Some Linux speeds are faster but this might be due to the toolkit version. The other Single Precision (SP) and Double Precision (DP) results are typical of 64-bit and 32-bit compilations where, as the tests mainly depend on the speed using graphics processors, performance is similar when using the same precision. Compilations for 32 bit tests produced the same speeds when run via 32-Bit and 64-Bit versions of Ubuntu and the alternative graphics driver made no difference. Of particular note, DP speed (shown below) is much slower than that for SP calculations. Later results are for faster cards running at a maximum of up to 790 GFLOPS.
Added 2014 results are for a mid range GeForce GTX 650, with a 3.7 GHz Core i7, via Windows 8.1 and Ubuntu 14.04. A maximum 412 GFLOPS was demonstrated, making it more than twice as fast as a more expensive GTS 250, from three years earlier. The i7 Asus P9X79 LE motherboard has PCI Express 3.0 x 16 which, along with faster RAM and CPU GHz, produces the fastest speeds, so far, where data in and out, or out only, are involved. Earlier systems probably had PCIe 1 with maximum bandwidth is 4 GB/s, or PCIe 2 at 8 GB/s, compared with 15.74 GB/s for PCIe 3.
Burn-In/Reliability TestsThe program has run time parameters to vary the number threads, words and repeat passes. Details are provided in CUDA1.htm. This also details other parameters available to run a reliability/burn-in test. These are running time in minutes and seconds between logged results, default 15 seconds. Calculate is the default routine but the Shared Memory test can be used with an added parameter of FC. Results are checked for consistent values and performance measured. Below are results of a 10 minute test, where GPU temperature and fan speed are also shown, as provided by nVidia X Server Statistics for Thermal Settings via System, Preferences, Monitors. CPU utilisation was also noted from System Monitor and, surprisingly, showed the equivalent of two of the four Phenom cores running flat out. On running this test, it is advisable to set power saving options to “Never” and five second reporting would be more appropriate to minimise the time that the display is frozen. See also
linux burn-in apps.htm.
|