Document:         MemSpd2K.txt
File Group:       Benchmarks/Standards
Creation Date:    20 April 2000
Revision Date:    23 June 2004

Title:            Memory Data Transfer Rate Tests For Windows
Keywords:         BENCHMARK PERFORMANCE MEMORY CACHE  

Abstract:         This document contains details of a pre-compiled C
                  program that measures memory and cache data transfer
                  rates via Windows 9x/NT. A secondary purpose is to
                  identify hardware peculiarities such as a cache not
                  being used properly.

                  See also source code.

Note:             The program is still under test and should be treated
                  as beta test software.  Please submit feedback directly
                  directly to the contributor, or as a posting in Section
                  10.

Location          Compuserve PC Hardware Forum

Contributor:      Roy_Longbottom@compuserve.com


Revision          Version 3.11 April 2001 - configuration details detected.
                  Change to timing calibration to avoid hangs with large
                  RAM size and fast CPUs.

                  Version 3.12 September 2001 - another change to avoid
                  running time being excessive.

 * WARNING - The two arrays are allocated with addresses in multiples of
 * 1048576 bytes appart. This identifies a design limitation with the Intel
 * P4 CPU, producing false cache flushing and some very slow speeds.

 
MEMORY DATA TRANSFER RATE TESTS

0.  SUMMARY

The benchmark is an upgraded version of those in MemSpeed.zip for use when 
caches are greater than 1 Mbytes.

The program employs three different sequences of operations, on 64 bit 
double precision floating point numbers, 32 bit single precision numbers 
and 32 bit integers via data arrays. The memory loading speed is calculated 
in terms of millions of bytes per second (MB/S). Measurements are made at 
4000, 8000, 1600 etc. memory bytes up to 25% of the main RAM size to 
produce speed ratings via data from different levels of cache and from RAM. 
Running time is normally about one minutes to produce 108 results testing 
up to 8MB.

Results are displayed as the program is running and saved in file 
XferMBPS.txt which should appear in the same directory as the EXE file. 
Besides MB/S figures, the program produced results used for comparing CPU 
speeds in file CPUSpeed.txt.

Before running, all other applications should be closed. To run, click on 
the appropriate EXE icon and the Run button. The program can also be run 
from a BAT file or command line to include parameters for test running 
time, maximum memory used and to run, log the results and exit 
automatically. See later examples.

Results should be sent to Roy_Longbottom@compuserve.com and details of the 
system under test should be included. 


1.  INTRODUCTION

The original MemSpeed Benchmark employed fixed arrays occupying 6 MB of 
memory and used in such a way (data in the stack) that addressing overheads 
were minimum. Because of increasing cache sizes, it was decided to use 
variable memory size where the maximum used is allocated at run time (using 
VirtualAlloc). In order to preserve the addressing overheads, hand coded 
assembly language was used for the dominant testing functions. Instruction 
sequence is identical to that in MemSpeed and all instructions virtually 
the same.

The measurements made are for 64 bit double precision floating point 
numbers, 32 bit single precision and 32 bit integers with the following 
assignments and calculations in inner timing loops:

               sum  = sum  + x[m] * y[m] (+y[m] integers as faster)
               x[m] = x[m] + y[m]
               x[m] = y[m]

These represent calculations with data from memory, secondly with results 
returned to memory and thirdly memory to memory copying. The calculated 
data transfer rates are for memory reading speeds but could be adjusted for 
total throughput (second one times 1.5, third times 2).  


2.  PROGRAM DETAILS

The program first allocates two 64 bits double precision arrays, each with  
a size of half the maximum memory to be used. Pointers are used to fix 
addresses for the 32 bit floating point and integer arrays as for double 
precision. The first measurements are made at 4000 bytes. Thus, logically, 
one array starts at the lowest address and the other at the half maximum 
point. In practice, Windows may well only allocate real memory when data 
transfer requests to specific array addresses are made. Before each timing 
test the required array space is filled with data to avoid any overheads.

The sequence of tests uses 4000, 8000, 16000, 32000, 64000, 128000, 256000, 
512000, 1024000 and 2048000 bytes as in MemSpeed, doubling for subsequent 
tests up to the maximum requested. 

Each measurement is self calibrating, via an outer loop, to run for a 
minimum of approximately 0.1 seconds using the high resolution timer 
(QueryPerformanceCounter). The tests were 5 seconds each in the original 
MemSpeed.


2.1  ARRAY AND ARITHMETIC TO VARIABLE

The double and single precision tests carry out the following calculations 
(using the appropriate arrays). The multiply and add produces a 
significantly higher data throughput than two adds. The integer test is of 
the form sumi = sumi + xi[m+] + yi[m+] as multiplication produces lower 
throughput with integers. The original C code was:

  for (m=0; m<kd; m=m+4)
  {
      sumd = sumd + xd[m]   * yd[m];
      sumd = sumd + xd[m+1] * yd[m+1];
      sumd = sumd + xd[m+2] * yd[m+2];
      sumd = sumd + xd[m+3] * yd[m+3];
  }


2.1.1  DOUBLE PRECISION INSTRUCTIONS

See Source Code

The loop with 18 instructions executes 8 floating point arithmetic 
operations per 64 bytes used for computing the MB per second transfer rate 
so:

  Millions of Instructions per second    (MIPS)   = MB per sec * 18/64
  Millions of Floating Point Ops Per Sec (MFLOPS) = MB per sec *  8/64


2.1.2  SINGLE PRECISION INSTRUCTIONS

This is similar to the above but uses 4 byte double words with 16 
instructions per 32 bytes.

      MIPS   = MB per sec * 16/32
      MFLOPS = MB per sec *  8/32

2.1.3  INTEGER INSTRUCTIONS

This tests carries out 8 integer adds (OPS) to register edx, each 4 bytes 
long and uses 11 instructions. 
      
      MIPS = MB per sec * 11/32
      MOPS = MB per sec *  8/32

Original was:

  for (m=0; m<ki; m=m+4)
  {
      sumi = sumi + xi[m]   + yi[m];
      sumi = sumi + xi[m+1] + yi[m+1];
      sumi = sumi + xi[m+2] + yi[m+2];
      sumi = sumi + xi[m+3] + yi[m+3];
  }


2.2 ARRAY PLUS ARRAY TO ARRAY

In this case, all three sections carry out the same operations with the 
appropriate data type. Original was:

  for (m=0; m<kd; m=m+4)
  {
      xd[m]   = xd[m]   + yd[m];
      xd[m+1] = xd[m+1] + yd[m+1];
      xd[m+2] = xd[m+2] + yd[m+2];
      xd[m+3] = xd[m+3] + yd[m+3];
  }


2.2.1  DOUBLE PRECISION INSTRUCTIONS

This time 21 instructions are used to execute four floating point 
arithmetic operations on 64 bytes.

See Source Code

      MIPS   = MB per sec * 21/64
      MFLOPS = MB per sec *  4/64



2.2.2 SNGLE PRECISION INSTRUCTIONS

As double precision but 8 * 4 bytes.

      MIPS   = MB per sec * 21/32
      MFLOPS = MB per sec *  4/32


2.2.3  INTEGER INSTRUCTIONS

This uses 14 instructions for 4 operations on 4 * 8 bytes.

      MIPS = MB per sec * 14/32
      MOPS = MB per sec *  4/32


2.3 ARRAY TO ARRAY

Again all three sections carry out the same operations on the appropriate 
data types. For comparison with Assignment MOPS results in the Whetstone 
Benchmark, each assignment is considered to be an operation (OP). Original 
was:

for (m=0; m<kd; m=m+4)
  {
      xd[m]   = yd[m];
      xd[m+1] = yd[m+1];
      xd[m+2] = yd[m+2];
      xd[m+3] = yd[m+3];
  }


2.3.1 DOUBLE PRECISION INSTRUCTIONS

This first loads the four elements of xd[] sequentially then stores them to 
yd[] sequentially. It uses 13 instructions for the 4 operations on 32 bytes 
loaded.

See Source Code

      MIPS = MB per sec * 13/32
      MOPS = MB per sec *  4/32


2.3.2 SINGLE PRECISION INSTRUCTIONS

As double precision but on 16 bytes loaded.

      MIPS = MB per sec * 13/16
      MOPS = MB per sec *  4/16


2.3.3  INTEGER INSTRUCTIONS

This uses 12 instructions to read 16 bytes with alternate loads and stores.

      MIPS = MB per sec * 12/16
      MOPS = MB per sec *  4/16


3.  OUTPUT DISPLAY

When the program is running, the results are displayed in the same format 
as in the results file. 


4.  RESULTS FILE

Results are appended to file XFERMBPS.TXT which should appear in the same 
directory as the EXE file. The results are in the following format. Example 
is Celeron 450/100 MHz with PC100 RAM, 16KB L1 data cache and 128KB L2 
cache. 

Note slow results at 128000 bytes. This is thought to be due to Windows 
effects which can also produce other inconsistencies. Because of 
differences in loop sizes and positions of code, some results are different 
to the original MemSpeed programs.


      Memory Reading Speed Test Version 3.0 by Roy Longbottom

      0.100 seconds per test, Start Fri Apr 21 11:19:53 2000

  Memory    s=s+x[m]*y[m] Int+     x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl   Int    Dble   Sngl   Int    Dble   Sngl   Int
   Used     MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

      4     1951   1078   1283   2003   1095   1578   1732    883    786
      8     2019   1092   1297   2029    933   1546   1766    888    792
     16     2048   1087   1294   2033   1099   1557   1763    882    791
     32     1410    939   1142   1127    845    984    607    545    489
     64     1411    938   1138   1130    857    980    601    537    488
    128     1078    556    725    843    513    565    517    420    366
    256      612    329    375    303    202    203    156    154    153
    512      611    328    375    301    202    203    153    153    152
   1024      611    329    375    300    202    203    152    152    151
   2048      611    329    375    300    202    203    152    152    151
   4096      599    310    362    300    200    204    152    147    150
   8192      595    315    364    299    202    211    151    150    150
  16384      575    311    362    295    201    213    149    149    148

                End of test Fri Apr 21 11:20:26 2000


 **************** Roy's Memory Ratio ***************
 Plus cache and RAM speeds to update CPUSpeed tables

                       Integer        Floating Point
                   Average  StdDev   Average  StdDev

 Original to 2MB     818     408       801     436
 New one to 16MB     764     388       758     419


 Millions of Instructions Per Second Integer Tests

  MemKB  r=r+m+m   m=m+m    m=m    Average  Ratio

      4     441     690     589      554    1232 
      8     446     676     594      555    1110 
     16     445     681     593      555    1683 
     32     393     430     367      395    1234 
     64     391     429     366      394    1312 
    128     249     247     275      256     916 
    256     129      89     114      108     470 
    512     129      89     114      108     568 
   1024     129      89     113      108     598 
   2048     129      89     113      108     633 
   4096     125      89     113      107     628 
   8192     125      92     112      108     637 
  16384     124      93     111      108     635 

 Maximum MFLOPS     256 DP,     273 SP


RMR ratings and integer ratios are derived by using results from a Pentium 
100 MHz CPU using this version of the benchmark.

Figures used to add to table 2 in CPUSpeed.txt should be average and ratio 
at 8KB for L1 results (4KB too variable), those at 64KB or greater for best 
L2 results and those at 2048 KB or greater for RAM results.

Version 3.1 - configuration details added e.g.

 Windows 98 Version 4.10, build 2222,  A 
 CPU AuthenticAMD, Features Code 0183F9FF, Model Code 00000630, 949 MHz
 From GlobalMemoryStatus: Size 130416 KB, Free 50180 KB, at end 17964 KB


5. RUN TIME PARAMETERS

The following run time parameters can be included in a BAT file or typed at 
the command prompt:

          MemSpd2k  KB kkkk, ms mmmm, Auto

Where kkkk is the maximum memory to be tested, mmmm is the minimum run time 
in milliseconds for each test and Auto runs the program, logs the results 
and exits.


6. SUBMITTING RESULTS

Results should be sent to Roy_Longbottom@compuserve.com and the following 
information on the system under test should be included.

               PC Supplier/model 
               CPU chip           
               Clock MHz              CPU and bus         
               Cache size             and memory type
               Chipset & H/W Options  and mainboard type
               Windows version       
               Run by (name)   
               E-Mail address 



