Description
FFTGraf benchmark runs code for single and double precision Fast Fourier Transforms (FFTs) of size 1024 to 1048576 (1K to 1024K) , producing a graph of results. As the time for a single calculation can vary, the tests are run a number of times (default 5). Results given here are minimum times in milliseconds. Three versions are available:
Version 1 is all C code using optimised procedures produced by Scott Taylor. The data for FFTs is not loaded or stored from sequential memory addresses for much of the time. As the hardware loads this data in 32 or 64 byte bursts (or more), much of it is redundant, resulting in slow performance.
Version 2 of the program produces significant speed improvements mainly by making more efficient use of caches and using all burst data. Scott’s new code divides the data from RAM into L2 cache sized segments, leading to a 2 x improvement on large FFTs. Roy’s new code unrolls the critical loop to make effective use of burst data read into caches, leading to up to a further 2 x improvement or more with PCs that use 64 byte bursts. Roy has also supplied assembly code for the main calculations which helps a little.
Version 3 includes additional code from Roy to use SSE or 3DNow assembly instructions for single precision calculations and SSE2 instructions for double precision, when these are provided by the CPU.
Pre-compiled versions of the benchmarks can be found in FFTGraf.zip which also contains the source code and more detailed explanations. The three versions have also been compiled to run at 64 bits using Windows 64. Version 1 is the same but compiles using SSE/SSE2 instructions. Version 2 uses C for the main calculations, as i386 floating point instructions are not available under Win64. Version 3 is the same as before except it has no i386 floating point or 3Dnow facilities. The 64 bit versions are in More64Bit.zip.
Then there is My Main Page for other PC benchmarks and results.
Version 1 memory demands in bytes are up to 16 times the FFT size on single precision and 28 times on double precision. These also apply to Versions 2 and 3 up to size 8K. Above this, they can be up to 28 times on SP and 52 times on DP - maximum 52 MB. Data spanned in the critical timing loop is 8 and 16 times (SP and DP) FFT size at 8K and below then 16 and 32 times above 8K.
Following is an example of Version 3 output for a 3.0 GHz Pentium 4E. The output shows Scott's MagSq'd[n/16], Peak Noise and Average Noise accuracy checks for FFTs (at 1024K). It should be noted that these might vary slightly using different compilers, particularly on single precision. Examples below show that SSE instructions in Version 3 produce different checksums.