Raspberry Pi 3B and 3B+ High Performance Linpack and Error Tests

Roy Longbottom

Contents

Background Original Pre-Compiled HPL ATLAS BLAS HPL Compilation
Performance Expectations Multiple Thread Benchmark Results
Results Raspbian Stretch & Jessie Raspberry Pi 3B+ HPL Results Raspberry Pi 3B HPL Results
Stress Tests Using All Four Cores Pi 3B Raspbian Stretch Stress Tests Pi 3B Raspbian Jessie Stress Tests
3B+ Raspbian Stretch Stress Tests 3B+ Raspbian Jessie Stress Tests
Pi 3B 64 Bit Gentoo Stress Tests Pi 3B+ 64 Bit Gentoo Stress Tests

Summary

In 2006, I ran a precompiled version of the High Performance Linpack Benchmark on my Raspberry Pi 3B. As seen by others, this indicated that the program could produce the wrong and inconsistent numeric calculations, also system crashes. I decided to repeat this exercise via a later Raspbian release, run a later program, compiled to use ATLAS, with alternative Basic Linear Algebra Subprograms and to provide comparisons with the newer Raspberry Pi 3B+. The ATLAS based benchmark had to be built from scratch, taking an unbelievable 14 hours on the Pi 3B+, including hundreds of MFLOPS speed tuning calculations.

I also decided to see if I could reproduce the failures using my stress testing programs, particularly running four copies of one that carries out floating point calculations. At the time of writing this report, I was unable to find a 64 bit HPL benchmark for the Raspberry Pi but, in the event of failures occurring in the 32 bit stress tests, to also run the 64 bit versions under Gentoo. Voltage might be particularly important, as a claimed solution, to the original Pi 3 failures, was to set an overvoltage parameter in the booting configurtion file. Voltage changes again came into play on the later Raspbian Stretch release and Pi 3B+, where CPU MHz was reduced at a lower temperature than previously encountered.

It should be noted that the results reproduced here are for my particular systems and may not be representative of other Raspberry Pi configurations. To identify possible causes of failures, my CPU MHz, core volts and temperature monitoring program was run at the same time as the other benchmarks. Both the Pi 3B and 3B+ boards were installed in FLIRC cases, that provide efficient cooling arrangements.

HPL Tests - Four data sizes were used controlled by N at 1000, 2000, 4000 and 8000, running via 1, 2 and 4 cores (possibly not intended to be used that way). My program that monitors CPU MHz, core voltage and temperature was mainly run at the same time. Result sumchecks were noted, expected to be different, depending on N, but constant, independent of the number of cores used.

HPL Older Pi 3B - Initially, running both the original HPL benchmark and ATLAS version, via Raspbian Jessie and Stretch Operating Systems, gave rise to wrong numeric sumchecks or system crashes at all data N sizes. The only failures noted were on using all four cores, but performance using fewer cores often did not achieve anticipated performance levels. Using the later recommended over voltage setting, errors and crashes only occurred using the largest N=8000 setting, the exception being via Stretch and ATLAS, where performance could be expected to be slower than using the original HPL (see 3B+) and at lower maximum CPU temperatures. Also, some of the failures were noted after short running times, when recorded temperatures were low.

HPL 3B+ - There were no sumcheck failures or system crashes, using the Pi 3B+. Sumchecks were identical using 1, 2 and 4 cores, except using an alternative procedure to specify how many should be used. There were no high temperatures (FLIRC case effect?) but, unexpectedly using Stretch. core voltage and CPU MHz reduced at 60C, producing slower performance that could be no better than that from the old Pi 3B. The default temperature had been lowered from 70C with Stretch. The latter applied under Jessie, with a continuous 1400 MHz, temperature well under 70C, and the highest voltage. The fix for Stretch was another booting option. With this, the MHz and voltage were constant and temperatures reasonable. These tests confirmed that, at N=8000, the ATLAS version was slower than the original HPL benchmark.

Stress Tests - I ran my floating point stress tests, nominally in 15 minute sessions, using four independent programs, attempting to reproduce the HPL Pi 3B failures. After limited testing, I found that wrong numeric calculations and system crashes could occur when each program used 160 KB of data, overfilling the shared L2 cache. Possible performance of the latter was confirmed by running 1, 3 and 4 programs, where three improved throughput by nearly 3 times, but was slower using all four cores.

Pi 3B - These sessions were run without the power boost, the main tests being under Jessie, with Stretch essentially producing the same performance, but as slightly different recorded voltages. Results from four tests indicated rather strange results, all starting at 1200 MHz with some throttling at 80C, mainly at a constant 1.2625 volts. Test 1 had 3 errors. Test 2 was hotter, with MHz throttling, but higher total MFLOPS and 25 errors. Test 3, after rebooting, had more typical speed, higher 1.2750 volts and throttling, but no errors. Test 4, after power off/on, produced results similar to Test 1.

Pi 3B+ - No data comparison failures were detected on this system. Initially under Stretch, performance was degraded due to CPU MHz being reduced, as with HPL, from 1400 to 1200 but voltage slightly different, from 1.3563 to 1.2500 at 60C. After implementing the 70C limit option, noted CPU MHz was a constant 1400 MHz, with voltage briefly indicating 1.2500, near the end of the tests. Using Jessie, temperatures were similar to the first Stretch tests, but 1400 MHz was recorded continuously. Then, voltages appeared to increase slightly, from 1.3375 to 1.3438, above 60C. Measured MFLOPS performance appeared to be slower than when using Stretch, maybe due to different arrangement with cached data.

64 Bit Version - Considering the old Pi 3B, running stand alone, the 64 bit benchmark indicated performance 28% faster than the 32 bit version but was slower running four copies of the program, becoming faster after repetitive runs (fewer RAM accesses?). Then, above 74C at 1.2625 volts, data comparison were detected, the system crashing at 76.3C. Following power off/on, there were no errors, with slow performance, not reaching 74C. There were no detected errors using the Pi 3B+. As during the HPL Jessie tests, no excessive temperatures were recorded, and voltage increased at 60C, MHz was a constant 1400, but four core MFLOPS was particularly slow.


Background

In 2016 I ran a precompiled version of High Performance Linpack (HPL) on my Raspberry Pi 3. As seen by others, this indicated that the program could produce the wrong and inconsistent numeric calculations, also system crashes. I ran with commands that specified the use of 1, 2 and 4 CPU cores, the failures only occurring using the latter. See details in Raspberry Pi Forum,

The errors were indicated in topic Pi3 incorrect results under load (possibly heat related), Here, it is indicated that the area and frequency of failures can vary considerably from one raspberry Pi 3 to another, some appearing to be due to heating effects and others power fluctuations. For the latter, a config.txt power setting over_voltage=2 was said avoid the problem, but not so in other cases. In my case, the power setting improved the failure rate.

Instructions for using the benchmark are (at this time) available from: this howtoforge.com tutorial, The current document provides details of how to download, compile and run the benchmark, a procedure that I preferred, hopefully to include compile options to use later technology than those available for an earlier pre-compiled version. Unfortunately, this required installing MPICH2 (message-passing for distributed-memory applications), where the specified link failed to provide the package.

The benchmarks were run on both the Raspberry Pi 3B and 3B+. Both systems were installed in FLIRC aluminium cases, that have a built in heatsink. These lead to much lower CPU temperatures than standard plastic varieties, as demonstrated in Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and Stress Tests.pdf from ResearchGate and Raspberry Pi 2 and 3 Stress Tests (htm) archived copy. These were repeated, using different parameters, successfully reproducing similar incorrect numeric results and system crashes, as observed on running the HPL programs.


Original Pre-Compiled Version

Later comments in the howtoforge article provided a reminder of how to download and run the pre-compiled version. I wanted this to see if the errors occurred running on a Raspberry Pi 3B+ (my original copy was no longer available). This one uses a different MPICH, appropriate instructions being:

 sudo apt-get install libmpich-dev                          
 wget http://web.eece.maine.edu/~vweaver/junk/pi3_hpl.tar.gz
 tar -xvzf pi3_hpl.tar.gz                                   
 chmod +x xhpl                                              
 ./xhpl                                                     
 
The ./xhpl run command automatically uses all four Pi 3B CPU cores. Besides this, I ran tests using taskset 0x00000001 ./xhpl and taskset 0x00000003 ./xhpl to run using 1 and 2 cores. These are important to show that the provided sumcheck indicates the same numeric results on varying the number of processors used.


ATLAS - Compiling Version Using Different BLAS (unbelievably slow to produce)

Another version of HPL was found, available from Building HPL and ATLAS for the Raspberry Pi, where ATLAS is an alternative Basic Linear Algebra Subprograms. The build was completed after 14 hours on a Pi 3B+, and included hundreds of MFLOPS tuning calculations. Of little avail, resultant performance was worse than the original and still produced errors using the old model 3B. The command to run this version, that uses all four cores, is as follows along with an example of that used initially to exercise fewer cores:

 mpiexec -f nodes-1pi ./xhpl                   
 mpiexec -f nodes-1pi taskset 0x00000003 ./xhpl
 
Further probing indicated that the supplied file nodes-1pi had four identical localhost statements, to use four cores, but this was dependent on P an Q values, each of 2, in file HPL.dat. Changing these to one localhost and 1,1 for P,Q then two localhosts and 2,1, provided an alternative to run using 1 or 2 cores. As shown later, these produced different execution speeds and numeric results to those using taskset.


Performance Expectations

My expectations of variance in performance are reflected in Intel Atom HPL resulrs, provided in a Raspberry Pi Forum topic This shows performance increasing by more than 1.5 times and less than twice, on doubling the number of cores used, then performance increasing on using larger data N sizes. Also shown are numeric result sumchecks also varying on increasing data N size, but constant irrespective of the number of threads used, at a particular size. In other circumstances, performance can be degraded if use of higher level slower memory is required or extended running time leads to reduced CPU MHz due to high temperatures.

These relationships are confirmed by results from my MP MFLOPS benchmark in MultiThreading Benchmarks.pdf, as shown in Table 1 below, measuring performance using 1, 2, 4 and 8 threads, the first set being for normal operation. The others were run to see what happens if the number of CPU cores used is restricted to one and two, via taskset parameters. In the one core case, speeds using 1 to 8 threads were reasonably constant and similar to the earlier one thread test (maybe a slight overhead). With two cores, 2, 4 and 8 thread speeds were effectively the same. For all three runs, numeric sumchecks were identical.

Next Page Table 1 or Go To Start


Example Benchmark Running Multiple Threads Using 1, 2 and 4 Cores

When CPU speed limited, and using all cores, doubling the number of threads, up to four, improved MFLOPS throughput by nearly twice.

Specifying that only one core should be used, produced constant performance with variable numbers of threads, defined by the program.

Specifying two cores produced constant performance on using 2, 4, and 8 threads.

Note constant sumchecks.


              Table 1 MP-MFLOPS Benchmark

 Add and Multiply instructions using 1, 2, 4, 8 Threads 
 Each thread deals with a dedicated segment of data.
 Identical data is initialised in each word and the same
 calculations applied. This leads to constant sumchecks, 
 independent of the number of threads used. Sumchecks 
 vary due to different number of calculations, depending
 on data size. Third column MFLOPS depend on RAM speed.

             2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800

          All Cores Command ./MP-MFLOPSPiA7g6
 
 MFLOPS
 1T      213     213     186     783     798     779
 2T      410     424     340    1555    1587    1506
 4T      681     731     372    3064    3008    2941
 8T      783     788     407    3083    3135    2849
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951
 2T    76406   97075   99969   66015   95363   99951
 4T    76406   97075   99969   66015   95363   99951
 8T    76406   97075   99969   66015   95363   99951

 1 Core Command taskset 0x00000001 ./MP-MFLOPSPiA7g6

 MFLOPS
 1T      210     209     182     784     783     760
 2T      209     208     182     783     783     757
 4T      209     208     182     783     782     757
 8T      208     209     181     781     782     758
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951
 2T    76406   97075   99969   66015   95363   99951
 4T    76406   97075   99969   66015   95363   99951
 8T    76406   97075   99969   66015   95363   99951

 2 Cores Command taskset 0x00000003 ./MP-MFLOPSPiA7g6

 MFLOPS
 1T      206     208     182     795     783     759
 2T      407     416     334    1568    1566    1506
 4T      410     395     343    1567    1540    1509
 8T      372     370     344    1562    1535    1506
 Results x 100000
 1T    76406   97075   99969   66015   95363   99951
 2T    76406   97075   99969   66015   95363   99951
 4T    76406   97075   99969   66015   95363   99951
 8T    76406   97075   99969   66015   95363   99951
  

Go To Start


Raspberry Pi 3B+ HPL Results

Table 2 provides results run via Raspbian Stretch on a Raspberry Pi 3B+, with N values of 1000, 2000, 4000 and 8000 defined in file HPL.dat (and 256 NBs), as used for earlier runs. Variations in these parameters (such as 8192 and 128) made little difference in performance. Three results from tests that use all four cores are provided to show some possible variations. All tests ran without any wrong results or system crashes.

The original HPL program run command included taskset, to restrict the number of cores used. This produced consistent sumchecks at a particular N size. Particularly at the smaller data sizes, expected performance differences between tests using 1, 2 and 4 cores are not demonstrated., but can be useful for other comparisons. The three sets of tests produced similar MFLOPS speeds. The third one followed a boot including a new parameter that reduced CPU MHz at a higher temperature (see table 3).

The first ATLAS HPL tests were run using the taskset parameter (last 2 columns). The performance pattern was similar to that from the original HPL benchmark but, in spite of a later compilation and all those MFLOPS calibration tests in setting it up, speed was much slower. The N 8000 performance was affected by the lower MHz (see table 3).

The other ATLAS tests made use of the nodes-1pi and HPL.dat parameters, mentioned earlier, to control the number of cores used. This produced different sumchecks using 1 and 2 cores, possibly invalidating the apparent higher performance.

                   Table 2 3B+ Results using Raspbian Stretch

            Original HPL taskset       ATLAS HPL input params  ATLAS HPL taskset
  N  Cores      MFLOPS       SumCheck    MFLOPS     SumCheck   MFLOPS  SumCheck
    
 1000  1    76    79    78  0.0052233  1066  1057  0.0066595      31  0.0069506
 1000  2   177   149   172  0.0052233  1244  1220  0.0066480     237  0.0069506
 1000  4  2586  2637  2650  0.0052233  1504  1458  0.0069506    1496  0.0069506
       4  2608  2606  2661  0.0052233  1479  1512  0.0069506    1504  0.0069506
       4  2451  2660  2642  0.0052233  1481  1440  0.0069506    1499  0.0069506

 2000  1   226   227   226  0.0044702  1330  1331  0.0042812     118  0.0050602
 2000  2   518   430   519  0.0044702  1767  1768  0.0043077     755  0.0050602
 2000  4  3844  4047  3997  0.0044702  2434  2463  0.0050602    2448  0.0050602
       4  3906  4046  4015  0.0044702  2469  2461  0.0050602    2479  0.0050602
       4  3862  4066  4056  0.0044702  2390  2449  0.0050602    2461  0.0050602

 4000  1   626   623   623  0.0029620  1474  1475  0.0033552     392  0.0028653
 4000  2  1228  1253  1251  0.0029620  2397  2282  0.0033594    1310  0.0028653
 4000  4  4966  5205  5238  0.0029620  3199  3435  0.0028653    3376  0.0028653
       4  5169  5249  5154  0.0029620  3426  3401  0.0028653    3369  0.0028653
       4  5004  5232  5202  0.0029620  3327  3328  0.0028653    3385  0.0028653

 8000  1  1182  1168  1167  0.0025941  1571  1568  0.0022596     786  0.0024910
 8000  2  1957  2200  2149  0.0025941  2654  2635  0.0022581    1782  0.0024910
 8000  4  5769  5735  5813  0.0025941  3894  3815  0.0024910    3010  0.0024910
       4  5795  5417  5835  0.0025941  3962  4076  0.0024910    3293  0.0024910
       4  5809  5063  5607  0.0025941  4063  4022  0.0024910    3294  0.0024910
Following (XXX) were from tests on a RPi 3B+ in a FLIRC case, at N=8000 via 4 cores, after three or four runs. They demonstrated that excessive temperatures were not produced. However, unlike running via Raspbian Jessie, using Stretch, CPU MHz was reduced from 1400 to 1200 above temperatures of 60C, compared with 70C as originally specified for the Pi 3B+. As described here, this change was included as the Stretch default to avoid problems using unstable boards (or inadequate cooling). The solution was to include temp_soft_limit=70 in the /boot/config.txt file. Results using this limit change are included below.

In case it means anything, note the differences in recorded voltages, where even using Jessie, a slight increase was indicated above 60C. Then, under Stretch, it was decreased by 7.5% to run at the lower MHz, then increased slightly for the constant 1400 MHz, all being lower than that measured with the Jessie tests. The MFLOPS speeds are representative of likely performance differences without and with the config.txt change.

   Table 3 Inconsistent Four Core Performance (samples over test periods)

  ------- Stretch Original -------   --------- Stretch ATLAS ---------   Jessie Original

  MFLOPS       5063           5607                 3120           3914           5311
                     
  XXX                   limit=70      XXX                   limit=70 
                      All 1400 MHz                        All 1400 MHz   All 1400 MHz
  Volts   MHz    C    Volts    C    Volts   MHz    C    Volts    C    Volts    C

 1.3500  1400  53.7   1.3563  49.4   1.3500  1400  56.9   1.3563  53.7   1.3875  53.7
 1.3500  1400  57.5   1.3563  49.9   1.3500  1400  60.1   1.3563  55.8   1.3875  56.9
 1.3500  1399  58.0   1.3563  55.9   1.2563  1200  60.1   1.3563  58.0   1.3875  58.0
 1.3500  1400  59.1   1.3563  59.1   1.2563  1200  60.1   1.3563  60.7   1.3875  58.5
 1.3500  1400  59.1   1.3563  60.1   1.2563  1200  60.7   1.3563  62.3   1.3875  59.1
 1.2563  1200  59.6   1.3563  61.2   1.2563  1200  60.1   1.3563  62.3   1.3938  60.1
 1.3500  1400  59.1   1.3563  62.3   1.2563  1200  60.1   1.3563  62.3   1.3938  60.7
 1.3500  1200  60.1   1.3563  62.3   1.2563  1200  60.1   1.3563  62.8   1.3938  60.7
 1.2563  1200  59.1   1.3563  62.8   1.2563  1200  60.1   1.3563  63.4   1.3938  61.2
 1.2563  1200  60.1   1.3563  63.4   1.2563  1200  60.1   1.3563  62.8   1.3938  62.3
 1.3500  1200  59.6   1.3563  63.9   1.2563  1199  60.7   1.3563  63.4   1.3938  62.3

Go To Start


Raspberry Pi 3B HPL Results (With Errors and Crashes)

Besides under Raspbian Stretch, the original HPL code was run via the earlier Jessie Operating System and performance was essentially identical. Using normal voltage settings, all test ran successfully using 1 and 2 cores and all failed utilising all four cores, in some rather random cases producing invalid final numeric results, or causing a system crash (frozen display), that needed a reboot. Of particular note, runs producing wrong numeric results indicated the same MFLOPS speeds as successful runs.The benchmarks were also run using the over volts setting, when the only failures were at N=8000. The ATLAS version was also run, with the same failures as normal runs but did achieve complete success using over voltage. Table 5, below, shows CPU temperatures, core voltage and MHz recorded during some of the tests.

Note that these issues apply to my particular systems, other users having reported different behaviour.

                     Table 4 Old Pi 3B Performance, Errors and Crashes

            Stretch Original HPL      Jessie  Original HPL      Stretch ATLAS HPL
            Normal       Volts+       Normal       Volts+       Normal       Volts+
  N  Cores  MFLOPS Sumck MFLOPS Sumck MFLOPS Sumck MFLOPS Sumck MFLOPS Sumck MFLOPS Sumck

 1000  1        80   OK      79   OK      82   OK      78   OK     970   OK     978   OK
 1000  2       178   OK     159   OK     158   OK     177   OK    1124   OK    1097   OK
 1000  4      2468  NO*    2477   OK    2494  NO*    2512   OK    1390  NO*    1385   OK
       4      2426  NO*    2479   OK    2464   OK    2527   OK    1385   OK    1374   OK
       4                   2445   OK    2524   OK    2499   OK    1387   OK    1362   OK

 2000  1       221   OK     222   OK     219   OK     225   OK    1139   OK    1142   OK
 2000  2       506   OK     439   OK     472   OK     496   OK    1609   OK    1613   OK
 2000  4     CRASH         3727   OK    3804  NO*    3799   OK    2159  NO*    2226   OK
       4      3775  NO*    3799   OK    3747   OK    3810   OK    2259  NO*    2260   OK
       4                   3782   OK    3826  NO*    3797   OK    2267  NO*    2215   OK

 4000  1       602   OK     598   OK     601   OK     601   OK    1294   OK    1292   OK
 4000  2      1298   OK    1311   OK    1310   OK    1057   OK    2039   OK    2025   OK
 4000  4     CRASH         4831   OK   CRASH         4864   OK   ERROR         3077   OK
       4     CRASH         4813   OK   CRASH         4873   OK   CRASH         3119   OK
       4                   4736   OK                 4777   OK                 3142   OK

 8000  1      1085   OK    1088   OK    1094   OK    1096   OK    1358   OK    1354   OK
 8000  2      1996   OK    2009   OK    2036   OK    2027   OK    2278   OK    2355   OK
 8000  4     CRASH         5056  NO*   CRASH        CRASH        ERROR         3514   OK
       4                                                         CRASH         3620   OK
       4                                                                       3665   OK

         NO*   SumCheck  such as  86232467 5841191 or 1281583765 12822
          OK   See 3B+ Sumcheck results
       ERROR   Fatal error indication
       CRASH   Frozen display  reboot required
 
Following were from tests on the older RPi 3B in a FLIRC case, at N=8000 via 4 cores, after three or four runs. The samples are at around 4 second intervals.

The first example indicates that a system crash does not appear to be caused by a high temperature. The other two, for completely successful runs, have the config.txt over_voltage=2 setting (note higher voltage) with constant voltage and MHz. For these, CPU temperatures are high and not reaching the point where CPU MHz is throttled.

                Table 5 Raspberry Pi 3B High Temperatures

     Original HPL CRASH       Original HPL              ATLAS HPL
     Normal Volts             Over Volts                Over Volts
     
     MHz   Volts      C       MHz   Volts      C       MHz   Volts      C

    1200  1.2563    44.5      1200  1.3188    52.6      1200  1.3188    59.1
    1200  1.2563    48.3      1200  1.3188    61.2      1200  1.3188    61.2
    1200  1.2563    50.5      1200  1.3188    65.5      1200  1.3188    65.5
    1200  1.2563    51.5      1200  1.3188    68.8      1200  1.3188    65.5
    1200  1.2563    51.5      1200  1.3188    70.9      1200  1.3188    67.7
    1200  1.2563    53.2      1200  1.3188    71.4      1199  1.3188    67.7
                              1200  1.3188    73.1      1200  1.3188    68.8
                              1200  1.3188    74.1      1200  1.3188    69.3
                              1200  1.3188    75.2      1200  1.3188    70.4
                              1200  1.3188    76.8      1199  1.3188    70.4
                              1200  1.3188    76.3      1200  1.3188    70.9
                              1200  1.3188    77.4      1200  1.3188    71.4
                              1200  1.3188    78.4      1200  1.3188    72.5
                              1200  1.3188    79.0      1200  1.3188    72.5
                              1200  1.3188    79.0      1200  1.3188    72.0
                              1200  1.3188    79.5      1200  1.3188    73.1
                              1200  1.3188    79.5      1200  1.3188    73.1
                              1200  1.3188    79.5      1199  1.3188    73.6
                              1200  1.3188    75.2      1200  1.3188    73.6
                              1200  1.3188    71.4      1200  1.3188    73.6

Go To Start


Stress Tests Using All Four Cores

As High Performance Linpack errors were only shown when using all four Raspberry Pi 3 cores, it was decided to determine if failures could be identified using my floating point stress testing programs. Descriptions and results of these can be found in Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and Stress Tests.pdf from ResearchGate and Raspberry Pi 2 and 3 Stress Tests (htm) archived copy.

Four copies of the stress test program were run at the same time, along with another that measures CPU MHz, core voltage and temperature. In my case, the tests could be run, via Raspbian Stretch, using the following commands in a script file, but separate terminal windows had to be opened, for Raspbian Jessie, and individual commands used. The main programs were run using 40k words or 160k bytes each, with total address accesses greater than the shared L2 cache size. Then section 2 is specified, for 8 floating point operations per data word, running for a minimum of 15 minutes. On starting, the number of passes is calibrated to produce 15 second tests and a final numeric result for checking purposes, identical data and calculations being used for each data word. These results are displayed and logged on an ongoing basis.

As all the programs cannot be started at the same time, later running times cannot be a constant 15 seconds per pass, and this can be affected by multiprocessing overheads and CPU clock speed reductions. This also introduces complications in synchronising MFLOPS speed calculations with the measured MHz. Further more, the way in which the Operating System handles an over utilised L2 cache can change the running time. In some case, the OS appears to improve access efficiency to produce higher measured MFLOPS, even when CPU MHz has decreased. The main concern is to use the same calculation passes for each of these logged tests, enabling numeric results to be verified. The variations are reflected in the following log entries. If wrong numeric results are identified, only the first one is reported, to avoid excessive log entries. The sumcheck is dependent on the number of passes and is likely to be different for each of the four tests.

 lxterminal --geometry=80x15 -e ./RPiHeatMHzVolts passes 63, seconds 15 &    
 lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 40 Sect 2 Mins 15 Log 5 &
 lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 40 Sect 2 Mins 15 Log 6 &
 lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 40 Sect 2 Mins 15 Log 7 &
 lxterminal --geometry=80x15 -e ./burninfpuPi2 Kwds 40 Sect 2 Mins 15 Log 8  

 Seconds                                                                    
   15.0     1200 scaling MHz,   1200 ARM MHz, core volt=1.2688V, temp=57.5'C 

 Pass     4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All     
           Words  Word   Passes                            Results  Same     
  
    1      40000     8    94400      15.05     2007    0.549930990   Yes     

 Error Example                                                               
   36      40000     8    94400      19.16     1577    See later      No     

 At End                                                                      
 First Unexpected Results                                                    
 test1     40000     8    94400 word     24113 was   0.578973 not   0.549931 
 


Pi 3B Raspbian Stretch Stress Tests (With Errors)

Results below are from running using a single CPU core, three and four cores. Performance, using three cores, was around three times that using that for a warmed up single core, voltage and CPU MHz were constant but with temperatures up to 73.6C. Running to utilise all four cores, shortly after 3, stiil indicated the same voltage but temperatures reached 81.1C, with associated CPU clock throttling, down to 1087 MHz. Note that recorded MFLOPS were less than when using three cores, the likely influence of increases out of cache accesses. Also note that the MFLOPS measurements are approximate, based on averages over slightly different intervals.

Data comparison failures only occurred during minute 7 at 80.6C.

   Single Test Running
 
   Pass     4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All
             Words  Word   Passes                            Results  Same

      1      40000     8   128000      15.05     2722    0.541245401   Yes
      2      40000     8   128000      14.23     2879    0.541245401   Yes

          3 Tests Running                 4 Tests Running
          ARM                   Total     ARM                   Total
 Minute   MHz   Volts      C  MFLOPS     MHz   Volts      C  MFLOPS

    0    1200  1.2688    47.2            1200  1.2688    56.4
    1    1199  1.2688    62.3    8360    1200  1.2688    73.6    7743
    2    1200  1.2688    65.5    8548    1200  1.2688    75.8    7990
    3    1199  1.2688    67.1    8545    1200  1.2688    77.9    7918
    4    1200  1.2688    68.2    8514    1200  1.2688    79.0    7949
    5    1200  1.2688    68.8    8481    1200  1.2688    79.5    7775
    6    1200  1.2688    68.8    8503    1141  1.2688    80.6    8052
    7    1200  1.2688    70.9    8537    1141  1.2688    80.6    7930 ERROR
    8    1200  1.2688    72.0    8533    1141  1.2688    80.6    7908
    9    1200  1.2688    71.4    8528    1087  1.2688    80.6    7870
   10    1200  1.2688    72.0    8535    1141  1.2688    81.1    7725
   11    1200  1.2688    73.1    8503    1141  1.2688    81.1    7891
   12    1200  1.2688    73.6    8491    1140  1.2688    81.1    7795

Next Same Using Raspbian Jessie or Go To Start


Pi 3B, Raspbian Jessie Stress Tests (With Errors)

The first test below ran without CPU MHz being throttled and wrong numeric results were detected with temperature above 71C. There were more errors than the Jessie tests, in spite of the lower temperature, along with lower recorded MFLOPS speeds. A possible influence was the lower recorded core voltage.

The second test started at a higher temperature and suffered from MHz throttling, as expected, starting a 80C. This time numerous errors were detected. Voltages were the same as the first test, but recorded MFLOPS were much higher, with no significant changes on throttling, and probably due to better organisation in L2 cached data, providing a higher hit rate.

The system was rebooted before the third test, with the recorded voltage surprisingly higher. No errors were observed with the temperatures increasing to MHz throttling level, bur MFLOPS were again lower.

Before the fourth test, the system was powered off then on, after a delay, to start at the lowest temperature. This time the voltage was restored to the Test 1 level, no MHz throttling was indicated, but data comparison failures were detected. MFLOPS were slightly better than during Test 3.

           Test 1 - errors                         Test 2 - errors
           Start at Sat Feb 16 16:09:03 2019       Start at Sat Feb 16 16:28:45 2019

                                  Total                                   Total
  Minute     MHz   Volts      C  MFLOPS  Errors     MHz   Volts      C  MFLOPS  Errors

       0    1200  1.2625    42.4                    1200  1.2625    52.6
       1    1200  1.2625    61.8    6516   0        1200  1.2625    73.1    8171   0
       2    1200  1.2625    65.0    6231   0        1200  1.2625    75.8    8142   2
       3    1200  1.2625    67.7    6325   0        1200  1.2625    76.8    7843   2
       4    1200  1.2625    69.3    6314   0        1200  1.2625    77.9    8042   5
       5    1200  1.2625    70.9    6321   0        1200  1.2625    79.5    8178   5
       6    1200  1.2625    72.0    6335   1        1195  1.2625    80.6    8002   3
       7    1199  1.2625    73.1    6313   0        1141  1.2625    80.6    8019   3
       8    1200  1.2625    73.1    6264   1        1141  1.2625    81.1    8075   3
       9    1199  1.2625    74.1    6251   0        1087  1.2625    80.6    8046   2
      10    1200  1.2625    75.2    6266   0        1087  1.2625    80.6    7967   0
      11    1200  1.2625    75.2    6405   1        1087  1.2625    80.6    7894   0
      12    1200  1.2625    76.3    6368   0        1140  1.2625    81.1    7844   0
      13    1200  1.2625    73.6    6349   2        1087  1.2625    81.1    7931   0
      14    1200  1.2625    69.8    6777   2        1199  1.2625    78.4    7886   0
      15    1200  1.2625    64.5                    1200  1.2625    68.8        
 Total                                     7                                      25
 Min        1199  1.2625    42.4    6231            1087  1.2625    52.6    7843
 Max        1200  1.2625    76.3    6777            1200  1.2625    81.1    8171

           Start at Sat Feb 16 17:03:02 2019       Start at Sat Feb 16 20:58:47 2019
           Test 3 Reboot time 17:03 - no errors    Test 4 Power off/on time 20:58 - errors

       0    1200   1.275    48.3                    1200  1.2625    40.8
       1    1200   1.275    69.3    6835   0        1200  1.2625    61.2    7199   0
       2    1200   1.275    72.5    6857   0        1200  1.2625    65.0    7037   0
       3    1200   1.275    74.7    7015   0        1200  1.2625    67.1    7041   0
       4    1199   1.275    75.8    6860   0        1200  1.2625    68.8    6997   0
       5    1200   1.275    76.8    6658   0        1200  1.2625    70.4    6904   0
       6    1200   1.275    77.4    7011   0        1200  1.2625    72.0    6830   0
       7    1199   1.275    79.0    7016   0        1199  1.2625    73.1    6980   0
       8    1200   1.275    79.5    6731   0        1200  1.2625    73.6    7133   1
       9    1200   1.275    79.5    6659   0        1200  1.2625    74.7    7188   1
      10    1194   1.275    79.5    6712   0        1200  1.2625    75.2    7185   2
      11    1200   1.275    80.6    6949   0        1200  1.2625    75.8    7256   0
      12    1195   1.275    81.1    6936   0        1200  1.2625    77.4    7200   2
      13    1199   1.275    78.4    6813   0        1200  1.2625    74.1    7342   0
      14    1200   1.275    72.5           0        1200  1.2625    69.8           1
      15    1200   1.275    60.1                    1200  1.2625    63.4            
 Total                                     0                                       7
 Min        1194   1.275    48.3    6658            1199  1.2625    40.8    6830
 Max        1200   1.275    81.1    7016            1200  1.2625    77.4    7342
  

Next Stretch 3B+ or Go To Start


Pi 3B+, Raspbian Stretch Stress Tests (No Errors or Crashes)

The first example results below are from only running one copy of the stress testing program. This indicates that four cores, using all L2 cache based data, could achieve more than 12 single precision GFLOPS. The other details are from running four copies of the program, where some RAM accesses are inevitable, resulting in slower performance. The main consideration is that no final data comparison errors were detected.

Two pairs of MP tests were run, to show the effects of increasing temperature on repeating the procedure. The first were from using the Pi, as delivered, with CPU voltage and MHz reducing at 60C (would have been earlier without that FLIRC case). For the second pair, the system was booted with that limit=70 change, reported above for the HPL benchmark. Then voltage and MHz were constant. until 70C was reached. Minimum MFLOPS indicated the improvement. Maximum speeds shown, and some after as low as after 10 minutes, were probably affected by higher L2 cache hit rates and some programs finishing earlier.

 1 Program  4 Byte  Ops/   Repeat    Seconds   MFLOPS          First   All
   Pass      Words  Word   Passes                            Results  Same

      1      40000     8   150400      15.03     3203    0.540749788   Yes
      2      40000     8   150400      14.27     3373    0.540749788   Yes

4 Core Test 1 ----------------   Test 2 ----------------  Test 3 ----   Test 4 -----------
                                                          limit=70      limit=70
                                                          All 1400 MHz  All 1400 MHz
                                                          All 1.3563 V
Minute MHz  Volts   C  MFLOPS   MHz  Volts   C  MFLOPS   C  MFLOPS   Volts   C  MFLOPS

   0  1400 1.3563  45.1         1400 1.3563  53.7         40.2         1.3563  52.1
   1  1400 1.3563  55.3  8849   1400 1.2500  60.1  8378   51.5   9408  1.3563  62.3  8504
   2  1400 1.3563  57.5  9176   1200 1.2500  60.1  8232   54.2   9430  1.3563  64.5  8623
   3  1400 1.3563  58.5  9170   1200 1.2500  61.2  8247   56.4   9382  1.3563  65.5  8638
   4  1200 1.3563  60.1  9075   1200 1.2500  61.2  8227   56.9   9414  1.3563  66.6  8628
   5  1200 1.3563  59.6  8956   1200 1.2500  61.8  8218   59.1   9390  1.3563  67.7  8630
   6  1200 1.3563  60.1  8573   1200 1.2500  61.2  8116   60.7   9410  1.3563  67.7  8447
   7  1200 1.2500  60.1  8707   1200 1.2500  62.3  8098   61.2   9471  1.3563  68.8  8214
   8  1200 1.2500  60.1  8626   1200 1.2500  62.3  8174   62.3   9419  1.3563  68.8  8203
   9  1200 1.2500  60.1  8261   1200 1.2500  62.3  8145   63.4   9427  1.3563  69.8  8251
  10  1200 1.2500  60.1  8610   1200 1.2500  62.8  8223   64.5   9569  1.3563  69.8  8505
  11  1200 1.2500  61.2  8558   1200 1.2500  62.3  9488   66.6  11228  1.3563  69.8  9417
  12  1200 1.2500  60.7  8624   1200 1.2500  63.4  9624   64.5         1.3563  70.4  9651
  13  1200 1.2500  61.2  8483   1200 1.2500  63.4  9636   59.1         1.2500  70.4  9853
  14  1200 1.2500  61.2         1200 1.2500  61.2         53.7         1.3563  64.5  9760
  15  1400 1.3563  56.9         1400 1.3563  55.8         52.6         1.3563  58.5

Min   1200 1.2500  45.1  8261   1200 1.2500  53.7  8098   40.2   9382  1.2500  52.1  8203
Max   1400 1.3563  61.2  9416   1400 1.3563  63.4  9636   66.6  11228  1.3563  70.4  9853
  


Pi 3B+, Raspbian Jessie Stress Tests (No Errors or Crashes)

Using Jessie no errors were again observed using the 3B+. Temperatures were similar to the first Stretch tests, but 1400 MHz was recorded continuously. Then, voltages appeared to increase slightly above 60C. Measured MFLOPS performance appeared to be slower than when using Stretch, maybe due to different arrangement with cached data.
            Start at Sat Feb 16 23:12:35    Start at Sat Feb 16 23:36:47

                                   Total                                   Total
  Minute     MHz   Volts      C  MFLOPS  Errors     MHz   Volts      C  MFLOPS  Errors

       0    1400  1.3375    41.9                    1400  1.3375    46.2
       1    1400  1.3375    50.5    7827   0        1400  1.3375    56.4    7557   0
       2    1400  1.3375    53.7    7862   0        1400  1.3375    58.5    7380   0
       3    1400  1.3375    54.8    7956   0        1400  1.3375    60.1    7570   0
       4    1400  1.3375    56.4    7951   0        1400  1.3438    60.1    7409   0
       5    1400  1.3375    56.9    7916   0        1400  1.3438    61.2    7402   0
       6    1399  1.3375    57.5    8042   0        1400  1.3438    62.3    7447   0
       7    1400  1.3375    59.6    7931   0        1400  1.3438    62.3    7448   0
       8    1400  1.3438    59.1    7841   0        1400  1.3438    63.4    7438   0
       9    1400  1.3438    60.7    7800   0        1400  1.3438    63.4    7453   0
      10    1400  1.3438    61.2    7972   0        1400  1.3438    64.5    7463   0
      11    1400  1.3438    61.8    7996   0        1400  1.3438    63.9           0
      12    1400  1.3438    62.3    7857   0        1400  1.3438    63.4           0
      13    1400  1.3438    61.8           0        1400  1.3438    63.4           0
      14    1400  1.3375    58.0           0        1400  1.3438    63.4           0
      15    1400  1.3375    55.3                    1400  1.3438    60.7            
 Total                                     0                                       0
 Min        1399  1.3375    41.9    7800            1400  1.3375    46.2    7380
 Max        1400  1.3438    62.3    8042            1400  1.3438    64.5    7570   

Next 64 Bit Gentoo or Go To Start


64 Bit Gentoo RP1 3B Stress Tests (With Errors)

As far as I was aware, access to a 64 bit HPL benchmark was not available for the Raspberry Pi 3, at the time of writing this report. But I wondered if the same MHz and voltage variations and errors might occur, as shown by my stress tests. To see, the 64 bit version was run, under Gentoo, using the same parameters as at 32 bits. of 40K words (160K Bytes) with 8 floating point operations per data word. Further deatails of the benchmark can be found, at ResearchGate, in Raspberry Pi 3B+ 32 bit and 64 bit Benchmarks and Stress Tests.pdf along with the benchmark execution codes (burninfpuPi64 and RPiHeatMHzVolts64G) in Rpi3-64-Bit-Benchmarks.tar.gz.

Considering the old Pi 3B, running stand alone, the 64 bit benchmark indicated performance 28% faster than the 32 bit version. As shown in the following results (at 4 minute intervals), this version did not demonstate the same throughput improvement, using four cores, Maximum recorded CPU temperatures were not as high, but data comparison errors were detected, after warming up, noting that recorded power measurements were at 1.2625 volts, same as when failures occurred at 32 bits. Rebooting, following a system crash, voltage was indicated as slightly higher, and no errors were detected (if that mans anything). Running times were also more variable, with some tests finishing early. With fewer than four cores in use, improved throughput would be due to little or no out of cache accesses.

 
                              Single Core Average 3688 MFLOPS  

           Test 1                  Test 2                  Test 3
                         Total                   Total                           Total
 Seconds   MHz      C  MFLOPS     MHz      C  MFLOPS     MHz   Volts      C  MFLOPS Errors

     0    1200    44.0            1200    47.2            1200  1.2625    46.2
   240    1200    63.4    3506    1200    67.1    5151    1200  1.2625    68.2    6716     0
   480    1200    66.6    3489    1200    70.4    5145    1200  1.2625    72.0    6698     0
   720    1200    69.8    3962    1200    72.0    5149    1200  1.2625    74.1    6709     6
   960    1200    66.1   >7391    1200    72.5   >7367    1200  1.2625    74.1   >7630     5

             Test 4                                  Test 5 Power Off/On
                                   Total                                   Total
  Seconds    MHz   Volts      C  MFLOPS  Errors     MHz   Volts      C  MFLOPS  Errors

       0    1200  1.2625    56.9                    1200  1.2688    47.2
     240    1200  1.2625    74.1    6248     2      1200  1.2688    68.2    4680     0
     480    1200  1.2625    76.3    6246     8      1200  1.2688    71.4    4679     0
     720    1200  1.2625   CRASH             3      1200  1.2688    73.6    4342     0
     960    1200                                    1200  1.2688    69.3   >7200     0


64 Bit Gentoo Rpi 3B+ Stress Tests (No Errors or Crashes)

As for the limited 32 bit Pi 3B+ tests, no data comparison failures were detected, during the six tests reported below. The 64 bit benchmark was again faster, running one copy but, on the limited tests, was shown to be slower using programs that use all four cores. Slight increases in voltages were also indicated above 60C, as during the Jessie HPL tests, maximum temperatures were similar and there was no CPU MHz throttling. Note that recorded voltages changed on rebooting.
                                          1 Core   2 Cores   3 Cores
                        Average MFLOPS      4316      8273      7462

         Test 1                       Test 2                       Test 3
                             Total                        Total                        Total
 Seconds  MHz   Volts   C  MFLOPS     MHz   Volts   C  MFLOPS     MHz   Volts   C  MFLOPS

    0    1400  1.3375  39.2           1400  1.3375  47.8           1400  1.3375  50.5
  240    1400  1.3375  52.1   5980    1400  1.3375  58.0   4298    1400  1.3438  61.2   5570
  480    1400  1.3375  54.8   5980    1400  1.3438  60.1   4290    1400  1.3438  62.8   5595
  720    1400  1.3375  58.5   5994    1400  1.3438  61.8   4788    1400  1.3438  63.9   5522
  960    1400  1.3438  60.1  >7400    1400  1.3438  60.1  >8770    1400  1.3438  65.0   5545

         Test 4                       Test 5  Reboot               Test 6  Power Off/On

    0    1400  1.3375  54.8           1400  1.3500  47.2           1400  1.3500  46.2
  240    1400  1.3438  63.4   4477    1400  1.3500  58.5   5817    1400  1.3500  59.6   6327
  480    1400  1.3438  64.5   4441    1400  1.3563  61.2   5805    1400  1.3563  61.8   6341
  720    1400  1.3438  64.5   4505    1400  1.3563  64.5   6500    1400  1.3563  64.5   7183
  960    1400  1.3438  63.4  >5720    1400  1.3563  65.0  >9500    1400  1.3563  62.3 >10000
 

Go To Start