Roy Longbottom at Linkedin OpenSUSE and SUSE Raspberry Pi 3 64 Bit Stress Tests


Contents


General FPU MFLOPS Test Integer MB/second Test Monitoring and Tests Run
FPU Stress Test Integer Stress Test SUSE Enterprise Server Assembly Code


General

Raspberry Pi 3 has a 64 bit ARM Cortex-A53 CPU, but available operating systems have been 32 bit versions. The first reference I have seen, for a 64 bit variety, was for OpenSuse. There are different distros available, one for SUSE Linux Enterprise Server. The intention is to recompile existing benchmarks to work at 64 bits.

On running the first conversions, it was found that performance could be disappointingly slow and inconsistent. This problem is due to the system booting with a parameter that sets the CPU frequency to operate on demand, varying between 600 and 1200 MHz. and that caused the speed performance changes. The SD card drive booting partition, config.txt, has a parameter force_turbo=0 that sets on demand. Changing this to force_turbo=1 implies performance mode, with the clock running continuously at 1200 MHz, so it seemed. Then, the benchmarks ran at full speed.

A second temporary workaround was suggested in the OpenSUSE forum, indicating that “performance” or “on demand” can be set via the cpupower terminal command (see below). Given the ability to demonstrate maximum performance of a single threaded benchmark, at a fixed 1200 MHz, raises the question of what will happen when all cores are run at maximum speeds, such as demonstrated in Raspberry Pi 2 and 3 Stress Tests, where the CPU clock gradually reduces in speed to avoid overheating.

The most important floating point and integer stress tests have been compiled to run at 64 bits, both having parameters that can be used to run with data from caches or RAM. The first results below, for single core tests, further demonstrate variable speeds using on demand CPU frequency settings, consistent speeds with the performance option, and example results via caches and RAM. This is followed by multi-thread test results, using all four cores. in this case, CPU speed is reduced in small increments as temperature increases, in the same way as running earlier tests via Raspbian OS.

It could be concluded that OpenSuse behaves as might be expected, using multiple cores, and it would seem to be more appropriate to boot the system using “performance” CPU frequency setting.

The stress test progams and source code can be downloaded from SuseRpi3Stress.tar.gz

Other results for 64 bit benchmark conversions will be included in the report on Raspberry Pi, Pi 2 and Pi 3 Benchmarks.

To Start

Maximum MFLOPS - burninfpuPi64

Arithmetic operations executed are of the form x[i] = (x[i] + a) * b - (x[i] + c) * d + (x[i] + e) * f with 2, 8 or 32 operations per input data word. The same variables are used for each word and final results are checked for consistency, any errors being reported. The benchmark has input parameters for KWords, Section 1, 2 or 3 (for 2, 8 or 32 operations per word) and log number (0 to 99).

Following are results from running one copy of the program, demonstrating that, in this case, average MFLOPS speed, using on demand frequency setting, is 52% of that with the performance option. The latter is also 31% faster than the Original 32 Bit scores.

Different Parameters - Results provided below show that maximum speeds are produced with data from caches, using eight operations per word. Speeds using 32 calculations are particularly disappointing. see Assembly Code for an explanation.


 ############################### On demand ###############################

   Default Setting
   cpupower -c all frequency-set -g ondemand
   Default Parameters for ./burninfpuPi64

    Burn-In-FPU SUSE/ARM 64 Sat Jan 14 15:50:39 2017

 Using 12 KBytes, 8 Operations Per Word, For Approximately 1 Minutes

   Pass     4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All
             Words  Word   Passes                            Results  Same

      1       3200     8  1180000      15.07     2005    0.539296687   Yes
      2       3200     8  1180000      15.78     1915    0.539296687   Yes
      3       3200     8  1180000      14.43     2094    0.539296687   Yes
      4       3200     8  1180000      15.33     1971    0.539296687   Yes

                   End at Sat Jan 14 15:51:39 2017


 ############################## Performance ##############################

   cpupower -c all frequency-set -g performance
   Default Parameters for ./burninfpuPi64

    Burn-In-FPU SUSE/ARM 64 Sat Jan 14 15:54:26 2017

 Using 12 KBytes, 8 Operations Per Word, For Approximately 1 Minutes

   Pass     4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All
             Words  Word   Passes                            Results  Same

      1       3200     8  2250000      15.03     3832    0.539296687   Yes
      2       3200     8  2250000      15.03     3832    0.539296687   Yes
      3       3200     8  2250000      15.03     3832    0.539296687   Yes
      4       3200     8  2250000      15.03     3832    0.539296687   Yes

                   End at Sat Jan 14 15:55:26 2017
 
 ######################### Different Parameters ##########################

   Area     4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All
             Words  Word   Passes                            Results  Same

  L1 cache    8000     2  1592000      15.00     1698    0.400158763   Yes
              8000     8   876000      15.02     3733    0.539296687   Yes
              8000    32    92000      15.05     1564    0.352537483   Yes

  L2 cache   40000     2   308800      15.00     1647    0.400158763   Yes
             40000     8   177200      15.02     3774    0.539296687   Yes
             40000    32    18400      15.08     1562    0.516640484   Yes

     RAM   2000000     2     1672      15.04      445    0.951906323   Yes
           2000000     8     1664      15.06     1768    0.970994949   Yes
           2000000    32      360      15.06     1530    0.982915401   Yes
  

To Start

Maximum Integer MB/Second - stressintPi64

This has six tests that alternately write and read data and six tests using write once and read many times, each test using two data patterns out of 24 variations. Some are shown in the results. The read phase comprises an equal number of additions and subtractions, with the data being unchanged afterwards. This is checked for correctness, at the end of each test, and any errors reported. Run time parameters are provided for KBytes memory used, seconds for each of the twelve tests and log number for use in multitasking. Default parameters are shown below.

Again are results of on demand and performance settings, using default run time parameters, demonstrating the wide performance variation in the former, worst case being around half speed.

Different Parameters - Speeds via L1 cache and L2 cache are similar, but the latter could be much slower using multiple tests, where data from all copies overfills the shared cache.


 ############################### On demand ###############################
 
  Default Setting
  cpupower -c all frequency-set -g ondemand
  Default Parameters ./stressintPi64

  Integer Stress Test SUSE/ARM 64 Sat Jan 14 15:52:12 2017

   8 KBytes Cache or RAM Space, 1 Seconds Per Test, 12 Tests

 Write/Read
  1    2368 MB/sec  Pattern 00000000 Result OK     144536 passes
  2    2581 MB/sec  Pattern FFFFFFFF Result OK     157550 passes
  3    2582 MB/sec  Pattern A5A5A5A5 Result OK     157565 passes
  4    1424 MB/sec  Pattern 55555555 Result OK      86907 passes
  5    1290 MB/sec  Pattern 33333333 Result OK      78734 passes
  6    1741 MB/sec  Pattern F0F0F0F0 Result OK     106258 passes
 Read
  1    1468 MB/sec  Pattern 00000000 Result OK     179200 passes
  2    2521 MB/sec  Pattern FFFFFFFF Result OK     307800 passes
  3    2937 MB/sec  Pattern A5A5A5A5 Result OK     358600 passes
  4    2554 MB/sec  Pattern 55555555 Result OK     311800 passes
  5    2937 MB/sec  Pattern 33333333 Result OK     358600 passes
  6    2937 MB/sec  Pattern F0F0F0F0 Result OK     358600 passes

                   End at Sat Jan 14 15:52:24 2017

 ############################## Performance ##############################

  cpupower -c all frequency-set -g performance
  Default Parameters ./stressintPi64

  Integer Stress Test SUSE/ARM 64 Sat Jan 14 16:00:17 2017

   8 KBytes Cache or RAM Space, 1 Seconds Per Test, 12 Tests

 Write/Read
  1    2578 MB/sec  Pattern 00000000 Result OK     157369 passes
  2    2580 MB/sec  Pattern FFFFFFFF Result OK     157441 passes
  3    2580 MB/sec  Pattern A5A5A5A5 Result OK     157488 passes
  4    2580 MB/sec  Pattern 55555555 Result OK     157469 passes
  5    2580 MB/sec  Pattern 33333333 Result OK     157458 passes
  6    2580 MB/sec  Pattern F0F0F0F0 Result OK     157464 passes
 Read
  1    2937 MB/sec  Pattern 00000000 Result OK     358500 passes
  2    2936 MB/sec  Pattern FFFFFFFF Result OK     358500 passes
  3    2937 MB/sec  Pattern A5A5A5A5 Result OK     358500 passes
  4    2936 MB/sec  Pattern 55555555 Result OK     358500 passes
  5    2936 MB/sec  Pattern 33333333 Result OK     358500 passes
  6    2936 MB/sec  Pattern F0F0F0F0 Result OK     358400 passes

                   End at Sat Jan 14 16:00:29 2017


  ######################### Different Parameters ##########################

 Write/Read ./stressintPi64 KB 8 Secs  1
  1    2483 MB/sec  Pattern 00000000 Result OK     151542 passes
 Read
  1    2842 MB/sec  Pattern 00000000 Result OK     346900 passes

 Write/Read ./stressintPi64 KB 40 Secs  1
  1    2582 MB/sec  Pattern 00000000 Result OK      31517 passes
 Read
  1    2776 MB/sec  Pattern 00000000 Result OK      67800 passes

 Write/Read ./stressintPi64 KB 8000 Secs  1
  1    1077 MB/sec  Pattern 00000000 Result OK         66 passes
 Read
  1    1239 MB/sec  Pattern 00000000 Result OK        160 passes
  

To Start

Monitoring and Tests Run

CPU temperatures and MHz need to be measured, on running these stress tests, in order to help to identify reasons for performance variations. The main issue is the CPU clock speed being reduced (throttling) if temperature becomes too high. Under Raspbian, the controlling frequency is identified via the measure_clock arm command, which is not available via SUSE. Instead, data from the normal cpuinfo_cur_freq storage location appears to be used. Also, CPU temperature measurement is via a different procedure, where then sensors application needs to be installed.

In this case, MHz and temperature were noted manually, using the watch command, with 1 second sampling delay, as shown below. This updates the measurements on the screen, without scrolling. The stress tests were also run starting four copies of the program, each in its own Terminal window. Originally, the procedures were kicked of using a script file.

The last RPi 3 tests were run with the system board fitted in a FLIRC enclosure, where the whole aluminium case becomes the heatsink, considerably reducing CPU temperature. This was also used on these tests, where performance remained constant, over the test periods. Further tests were run using a copper heatsink. See Earlier stress test report. These were repeated using on demand and performance CPU frequency options.


 ########################### Measure CPU MHz #############################

 watch -n 1 -p cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq

 Every 1.0s: cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_cur_freq

 1200000 (or less)
 1200000
 1200000
 1200000


 ######################### Measure Temperature ###########################

 watch -n 1 sensors

 Every 1.0s: sensors

 bcm2835_thermal-virtual-0
 Adapter: Virtual device
 temp1:        +39.2°C  


  

To Start

FPU Stress Tests

Unlike the single core tests, essentially the same performance was demonstrated, using on demand and performance CPU frequency options. With the FLIRC case, performance was also effectively constant, over the 15 minute tests runs, temperatures still rising but still containable.

On using the copper heatsink, performance degraded as the temperature reached to 80°C, with CPU MHz reductions being displayed. Note that these and the temperature displays were then constantly flickering with slight variations in readings.

The on demand results confirm that the system did not appear to perform correctly, running the single core tests, where these 4 core tests appear to be at least six times faster.


 Example run command  ./burninfpuPi64 Kwords 10 Section 2 Minutes 15 Log 21 
 Log has 4 lines per minute, test is 8 operations per word, using 40 KB of data

 Minute   FLIRC Performance   FLIRC On Demand     Copper Performance  Copper On Demand
          MFLOPS    °C   MHz  MFLOPS    °C   MHz  MFLOPS    °C   MHz  MFLOPS    °C   MHz

    0             47.0                48.0                46.2                46.2
    1      14564  51.5  1200   14450  55.3  1200   14436  65.5  1200   14391  65.5  1200
    2      14584  53.7  1200   14531  57.5  1200   13264  74.1  1200   14431  72.5  1200
    3      14568  55.8  1200   14374  59.1  1200   14499  77.9  1200   14426  76.8  1200
    4      14565  56.9  1200   14478  60.1  1200   12779  80.6  1181   14397  79.5  1200
    5      14548  58.0  1200   14589  61.2  1200   13411  81.7  1141   13593  81.1  1157
    6      14568  59.1  1200   14519  62.3  1200   12963  81.7  1080   13180  81.7  1110
    7      14556  60.1  1200   14749  62.8  1200   12789  81.7  1055   12923  81.7  1077
    8      14575  60.7  1200   14570  63.4  1200   12781  81.7  1038   12906  81.7  1072
    9      14578  60.7  1200   14483  63.9  1200   12634  81.7  1037   12826  81.7  1066
    10     14575  61.8  1200   14431  64.5  1200   12781  81.7  1032   12926  81.7  1060
    11     14555  62.8  1200   14529  65.0  1200   12556  81.7  1026   12666  82.2  1049
    12     14570  63.4  1200   14504  65.5  1200   12560  82.2  1031   12580  82.2  1040
    13     14554  63.4  1200   14533  66.1  1200   12460  82.2  1021   12539  82.2  1041
    14     14561  63.9  1200   14452  66.6  1200   12383  82.2  1022   12568  82.2  1042
    15     14597  64.5  1200   14275  66.6  1200   12445  82.2  1017   12650  82.2  1033

 Average   14566               14514               13021               13311
 Min       14548  51.5  1200   14275  55.3  1200   12383  65.5  1017   12539  65.5  1033
 Max       14597  64.5  1200   14749  66.6  1200   14499  82.2  1200   14431  82.2  1200
 Min/Max    1.00  0.80  1.00    0.97  0.83  1.00    0.85  0.80  0.85    0.87  0.80  0.86

Speed and Temperature

    

To Start


Integer Stress Tests

As for the floating point tests, the same performance was demonstrated, using on demand and performance CPU frequency options and performance was constant (within 8%) using the FLIRC case. Surprisingly, the integer tests produced lower CPU throttling MHz than the floating point program, with associated higher temperatures and slower performance.


 Example run command - ./stressIntPi64 KB 40 Seconds 80 Log 31
    12 tests at 80 seconds each, 16 minutes overall, results displayed every 10 seconds

   Test   FLIRC Performance   FLIRC On Demand     Copper Performance  Copper On Demand
          MB/sec    °C   MHz  MB/sec    °C   MHz  MB/sec    °C   MHz  MB/sec    °C   MHz

    0             42.9  1200          42.9  1200          46.2  1200          47.6  1200
    1      10116  59.2  1200   10071  59.8  1200    9863  78.9  1200    9901  75.6  1200
    2      10104  62.0  1200    9645  62.3  1200    9954  80.6  1197    9978  81.7  1138
    3      10085  63.4  1200    9653  63.4  1200    8640  82.2   991    8700  81.7   983
    4       9900  64.9  1200   10071  65.0  1200    8162  82.7   955    8219  82.7   964
    5      10097  66.6  1200   10065  66.9  1200    7983  82.7   936    8030  82.7   939
    6       9825  67.1  1200    9655  66.6  1200    7901  82.7   930    7952  82.7   936
    7      10501  66.4  1200   10434  67.3  1200    8539  82.7   962    8675  82.7   968
    8      10691  68.2  1200    9996  68.6  1200    8267  83.3   927    8414  82.7   932
    9      10249  68.8  1200    9564  68.8  1200    7998  83.3   893    8127  82.7   907
    10     10463  69.7  1200   10446  70.6  1200    7887  83.3   875    8067  83.3   894
    11     10447  70.3  1200    9998  70.5  1200    7876  83.3   887    8059  83.3   857
    12     10495  70.9  1200    9807  69.8  1200    7923  83.3   871    8080  83.3   890

 average   10248                9950                8416                8517
 min        9825  59.2  1200    9564  59.8  1200    7876  78.9   871    7952  75.6   857
 max       10691  70.9  1200   10446  70.6  1200    9954  83.3  1200    9978  83.3  1200
 min/max    0.92  0.84  1.00    0.92  0.85  1.00    0.79  0.95  0.73    0.80  0.91  0.71

Speed and Temperature
   

To Start

SUSE Linux Enterprise Server (SLES)

The stress tests were repeated using SLES, to see if the same on demand and performance settings were provided. They were, with default again being on demand, and measured speed variations no different.

Following are average speeds of stressintPi64, running 6 minute tests at 8 KB, using 1, 2 and 4 cores. With the performance setting, except for a little degradation due to heat effects, with 4 threads, MB/second results were the same. On the other hand, default on demand option produced better performance per core as the load increased. That seems to be the wrong way round.


             On Demand     Performance

 Program   Total Average  Total  Average
  Copies  MB/sec  MB/sec  MB/sec  MB/sec
 
     1      2079    2079    2758    2758
     2      4651    2325    5519    2759
     4     10811    2703   10806    2701

  

To Start

Assembly Code or Meanderings of an Octogenarian Ancient GEEK

Below are the assembly instructions used for the floating point stress tests. The source code has inner loops with 2, 8 or 32 arithmetic operations, but the compiler unrolls the loops to fully use quad word registers. The first two are unrolled another four times, resulting in four (x 4 way) loads at the start, four (x 4 way) stores at the end, with calculations in between. It seems that there are insufficient register for the 32 operations test, where it ends up with alternate load and arithmetic instructions. The result is that the fastest code is via 8 operations per word.

Of particular interest are vector fmla and fmls instructions, or fused multiply and add or subtract, with the potential of producing 8 operations per CPU clock cycle. Best case here was 3.19 MFLOPS/MHz, for 1 core and 12.16 with 4 cores.

The integer stress test, read only section, has an inner loop that loads 32 four byte words at a maximum speed of 2937 MB/second or 734 MW/second. With 32 arithmetic adds or subtracts, MOPS (Million Operations Per Second) is also 734. With 32 arithmetic operations, 32 data loads, indexing and loop overheads. A total of 99 instructions are used, or 3 per word loaded, leading to an execution speed of 2202 MIPS (Million Instructions Per Second) at 1.835 MIPS/MHz.


 Ops/word        2      2      8      8     32     32
              inst    ops   inst    ops   inst    ops

 overheads       4      4      4      4      4      4
 fadd            4     16     12     48     11     44
 fmla                          4     32      5     40
 fmls                          4     32      5     40
 fmul            4     16      4     16      1      4
 ldr             4     16      4     16     16     64
 str             4     16      4     16      1      4

 Total          20     68     36    164     43    200

 arithmetic            32           128           128
     %                 47            78            64
 unroll                16            16             4

 Overheads   = 2 adds, compare, branch
 Eaample 64 Bit Intructions     
               fxxx    v16.4s, v18.4s, v10.4s
               ldr     q0,  [sp, 272]
               str     q16, [x7, x4]
  

To Start


Roy Longbottom at Linkedin Roy Longbottom January 2017



The Official Internet Home for my Benchmarks is via the link
Roy Longbottom's PC Benchmark Collection