Description
SSE3DNow benchmark measures Single Precision (SP) and Double Precision (DP) Floating Point speeds, data streaming from caches and RAM. It uses SSE (SP) and/or SSE2 (DP) and/or AMD 3DNow (SP) Single Instruction Multiple Data (SIMD) floating point instructions. Tests using normal i386 single precision floating point are also included to demonstrate performance gains of the new functions.
Note: SSE instructions are available on Intel Pentium III onwards and SSE2 came with the Pentium 4. AMD Athlon 4 CPUs also incorporated SSE instructions and Athlon 64 has SSE2.
The procedures used are memory to register operations (r = r + x[m] * y[m]) and memory to register to memory (x[m] = x[m] + y[m]). SSE instructions use 128 bit registers for 4 single precision floating point words with SSE2 using them for 2 double precision words. 3DNow uses the 64 bit MMX registers with two single precision words. The tests are run with a range of data sizes to provide measurement via L1 cache, L2 cache and RAM, where up to 25% of main memory size is automatically selected.
A pre-compiled version of the benchmark can be found in SSE3DNow.zip which also contains the source code, providing further explanatory comments.
SSE64, a version compiled for 64 bit operation is in More64bit.zip, where i387 normal floating point and 3DNow instructions are not supported. The SSE/SSE2 assembly code is the same as the original (see results below).
Information on maximum speeds when less processing is involved can be obtained from BusSpd2K results.htm and RandMem results.htm .
Then there is My Main Page for other PC benchmarks and results.
Results are given as Millions of Bytes Per Second (MB/s) memory reading speed. The second set of figures can be multiplied by 1.5 for read/write speed. The results can be converted to Millions of Floating Point Operations Per Second (MFLOPS) - memory to register, divide SSE2 MB/s by 8 and others by 4 then memory to register to memory, divide SSE2 MB/s by 16 and others by 8. Test results output includes maximum MFLOPS speeds.
Following is an example output for an Athlon 64 CPU. Variations in performance identify L1 and L2 cache sizes.
Memory s=s+x[m]*y[m] x[m]=x[m]+y[m]
KBytes SSE2 SSE 3DNow Sngl SSE2 SSE 3DNow Sngl
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
L1 4 15431 15023 13503 7732 17119 16934 14091 4072
8 16382 16280 14518 7678 17531 17320 13931 4020
16 16720 16915 15187 7838 17421 17212 14000 3925
32 16921 16867 15399 7723 17720 17435 14001 3741
64 17286 17279 15253 7605 17235 17545 13798 3787
L2 128 7822 7802 7522 4186 4792 4939 4851 2226
256 7331 7522 7262 4178 4728 4645 4656 2218
512 7366 7320 6962 4095 4597 4575 4540 2174
RAM 1024 3464 3357 3359 2928 2063 2104 2094 1791
2048 3385 3445 3351 2914 2054 2092 2053 1779
4096 3334 3281 3295 2853 2069 2107 2028 1780
8192 3325 3329 3339 2935 2074 2080 2071 1804
16384 3314 3320 3280 2830 2075 2077 2098 1781
32768 3298 3313 3273 2822 2074 2043 2040 1806
65536 3356 3352 3319 2827 2066 2049 2068 1821
131072 3303 3325 3329 2918 2059 2072 2105 1807
262144 3318 3326 3375 2880 2072 2071 2062 1808
SSE2 SSE 3DNow Norm SSE2 SSE 3DNow Norm
Maximum DP SP SP SP DP SP SP SP
MFLOPS 2161 4320 3850 1960 1108 2193 1761 509
|
To Start
Results
Separate tables of speeds obtained via L1 cache, L2 cache and RAM are given below. Except when connected via the memory bus, performance via caches tends to be proportional to CPU MHz for a given type of processor. So, only a sample of results are provided. Details of cache sizes, speed and range of CPU MHz can be found in CPUSpeed.htm.
The latter also provides a range of performance comparisons based on %MFLOPS/MHz, which include SSE3DNow results.
The tables also show the highest value for each column of results and associated MFLOPS. This makes it easier to see that one CPU is not the fastest on all tests.
L1 Cache Results - Comparing Intel Pentium 4 results with those from AMD of similar MHz shows that the latter has superior floating point performance, particularly on old i386 instructions. This also applies to Pentium M and both have larger L1 caches. The results also show that the later Pentium 4E (Prescott) performs significantly faster than earlier P4s, on these tests. Also, AMD Athlon 64 CPUs produce similar performance to earlier Athlons of the same MHz. Later results for Core 2 Duo show that this has outstanding
performance on SSE and SSE2 tests relative to CPU MHz.
Results are sorted by the first i386 FP speeds (Sngl).
L2 Cache Results - These are sorted in the same order as for L1 cache. The i386 FP columns indicating relative improvements on Pentium 4 results, although AMD CPUs of the same MHz are still faster. The superior Pentium 4 (and Pentium M) L2 cache performance is reflected in SSE and SSE2 results where AMD CPUs of the same MHz are slower. Some Athlon 64/Opteron results are much better than their earlier processors. Again, Core 2 Duo results are outstanding.
RAM Speed Results - Result include four Pentium 4s at 2533 MHz, showing the impact of different types of RAM and variations due to different mainboards. Performance depends on CPU and RAM speed with Pentium 4s showing best scores on all tests. At least Athlon 64s take the lead on %MFLOPS/MHz. Later, the first Core 2 Duo results were via a board using the nForce 570 chipset, where most speeds were disappointing. Further results were via the Intel 965 chipset which produced the best performance.
To Start
|