Raspberry Pi 3B+ 32 Bit and 64 Bit Benchmarks and Stress TestsRoy Longbottom
|
System MHz VAX MIPS MIPS/MHz 32 Bit RPi 3 v8-A53 1200 2469 2.06 RPi 3B+ v8-A53 1400 2881 2.06 Ratio 1.17 1.17 64 Bit RPi 3 v8-A53 1200 3475 2.90 RPi 3B+ v8-A53 1400 4021 2.87 Ratio 1.17 1.16 3B+ 64/32 bit 1.40 |
The Linpack Benchmark was produced from the "LINPACK" package of linear algebra routines. It became the primary benchmark for scientific applications, particularly under Unix, from the mid 1980's, with a slant towards supercomputer performance. The original double precision C version, used here, operates on 100x100 matrices. Performance is governed by an inner loop in function daxpy() with a linked triad dy[i] = dy[i] + da * dx[i], and is measured in Millions of Floating Point Operations Per Second (MFLOPS).
Programming procedures and displayed output are the same as the original version for PCs (My accepted conversion at Netlib - 1996), where the bloated detail was needed due to using a low resolution timer. As for the original Fortran version, two sets of results are produced, with different memory alignment, and the lowest MFLOPS selected as the speed rating. This can lead to variation over multiple runs.
The benchmark produces a set of numeric results of calculations that demonstrate accuracy and consistency. These can vary, mainly by not much, as a result of the compiler generating different scalar or vector instructions. The source code includes the option of changing values used for comparison purposes, to suit particular situations. In this case, some benchmark programs have not been modified and result in an error message (see below). The range of results encountered is also shown.
See also Linpack Benchmark Results On PCs and Later Devices.
Besides compiled from standard C code, a new version is included, using NEON Intrinsic Functions for the daxpy function. This produces a significant performance gain with 32 bit compilation, but the vector instructions, used at 64 bits, provide similar speed gains.
Note the 64 bit performance gains in the table, that are up to near 2.5 times. Model 3B+/3B performance ratios are again mainly proportional to those for CPU MHz.
MFLOPS per MHz ratios are also shown, now better than the Whetstone benchmark at up to 0.43 single precision and 0.29 double precision, for 64 bit programs.
------ MFLOPS ---- --- MFLOPS/MHz -- System MHz DP SP NEON SP DP SP NEON SP 32 Bit RPi 3 v8-A53 1200 180 194 486 0.15 0.16 0.41 RPi 3B+ v8-A53 1400 210 226 562 0.15 0.16 0.40 Ratio 1.17 1.17 1.16 1.16 64 Bit RPi 3 v8-A53 1200 343 484 521 0.29 0.40 0.43 RPi 3B+ v8-A53 1400 397 563 605 0.29 0.40 0.43 Ratio 1.17 1.20 1.17 1.17 3B+ 64/32 bit 1.89 2.49 1.08 Error Message Example Variable norm. resid Non-standard result was 1.9 instead of 1.7 Variable resid Non-standard result was 8.46778499e-14 instead of 7.41628980e-14 Variable x[0]-1 Non-standard result was -1.11799459e-13 instead of -1.49880108e-14 Variable x[n-1]-1 Non-standard result was -9.60342916e-14 instead of -1.89848137e-14 Results of Calculations norm resid resid x[0]-1 x[n-1]-1 DP Pi 1.7 7.41628980e-14 -1.49880108e-14 -1.89848137e-14 DP Pi 2-3 1.9 8.46778499E-14 -1.11799459E-13 -9.60342916E-14 DP Pi 64 1.9 8.46778499e-14 -1.11799459e-13 -9.60342916e-14 SP Pi 1.6 3.80277634e-05 -1.38282776e-05 -7.51018524e-06 SP Pi NEON 2.2 5.16722466e-05 -2.38418579e-07 -5.06639481e-06 SP Pi 2-3 2.0 4.69621336E-05 -1.31130219E-05 -1.30534172E-05 SP Pi 64 2.0 4.69621336e-05 -1.31130219e-05 -1.30534172e-05 |
The speed of the original Raspberry Pi could be rated as 4.5 times faster than the Cray 1 supercomputer - see my quote on Cost and Physical Differences. Now, one core of the Raspberry Pi 3B+ produces performance equivalent to 24 Cray 1 computers.
Some of the program's 3 x 24 kernels included produce inconsistent speeds, particularly for the minimum value but CPU MHz ratios still broadly apply to the performance summaries. The 64 bit official average MFLOPS rating is shown as being 32% faster than at 32 bits, with double precision MFLOPS/MHz at 0.20. The latter for maximum speed is 0.51.
See also Livermore Loops Benchmark Results On PCs and Later Devices.
32 Bit SUmmary -------------- DP MFLOPS -------------- Per MHz System MHz Maximum Average Geomean Harmean Minimum Geomean RPi 3 v8-A53 1200 398.4 210.6 185.9 160.2 56.5 0.15 RPi 3B+ v8-A53 1400 462.5 243.8 215.2 185.7 65.6 0.15 Ratio 1.17 1.16 1.16 1.16 1.16 1.16 64 Bit Summary RPi 3 v8-A53 1200 633.1 275.8 245.2 215.5 81.3 0.20 RPi 3B+ v8-A53 1400 720.6 320.2 285.6 251.9 94.4 0.20 Ratio 1.17 1.18 1.16 1.15 1.14 1.04 3B+ 64/32 bit 1.59 1.31 1.32 1.35 1.44 32 Bit DP MFLOPS 24 Loops Raspberry Pi 3 1200 MHz 192.9 228.0 398.4 337.4 124.6 167.5 359.7 384.3 347.7 171.6 132.5 74.7 83.9 109.1 225.4 221.2 307.9 288.6 202.2 211.9 114.7 56.9 300.2 170.1 Raspberry Pi 3B+ 1400 MHz 223.8 264.4 462.5 392.9 146.0 159.4 416.0 446.3 406.7 199.1 153.8 86.7 99.5 126.7 261.8 256.8 357.4 333.6 234.7 239.5 132.9 66.0 345.4 197.5 Ratios 0.95 to 1.19, average 1.15 (Normal variations) 64 Bit DP MFLOPS 24 Loops Raspberry Pi 3 1200 MHz 463.4 256.0 465.9 455.0 194.9 181.3 633.1 410.3 417.9 196.2 146.2 211.4 104.5 139.5 250.8 222.1 379.5 447.1 286.4 238.0 239.3 82.0 312.6 179.9 Raspberry Pi 3B+ 1400 MHz 538.9 297.5 539.8 528.6 225.5 208.6 720.6 477.9 475.9 252.1 169.7 245.2 127.2 159.7 290.9 258.2 441.1 509.4 332.9 279.9 302.9 95.5 337.4 208.9 Ratios 1.03 to 1.31, sverage 1.16 (Normal variations) 3B+ 64/32 bit 1.00 to 2.83, avsrage 1.40 |
MemSpeed benchmark measures data reading speeds in MegaBytes per second carrying out calculations on arrays of cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the result headings. For the first two double precision tests, speed in Million Floating Point Operations Per Second (MFLOPS) can be calculated by dividing MB/second by 8 and 16. For single precision divide by 4 and 8. There is also a version that instructs the compiler to use NEON code. The 32 bit version results are provided below, but the particular compile options used were not acceptable using a 64 bit compiler.
In this case, relative 3B/3B+ speed ratios were calculated as separate averages for tests that use L1 cache, L2 cache and RAM. The cache based measurements were, as usual, equivalent to those derived from CPU MHz, but indicate that RAM could be slightly slower.
As would be expected, the use of NEON instructions provided a performance gain, using single precision floating point (32 bit system only). The 64/32 bit ratios are provided below, for the normal MemSpeed benchmarks, indicating the highest 64 bit gains were for double precision calculations, then single precision MFLOPS/MHz ratios, were similar to that derived from the NEON benchmark, 64 bit floating point calculations benefiting from using vector instructions. See details of Assembly Code.
The first two calculations are essentially the same as those in the Linpack benchmark performance dependent daxpy function, but speed not deflated by frequent calls to a function. This increases 64 bit MFLOPS/MHz to 0.43 double precision and 0.52 single precision.
############################ RPi 3 32 Bit ############################### Memory Reading Speed Test vfpv4 32 Bit Version 1 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz 8 1619 1812 3448 2375 2237 3793 2698 3121 3147 16 1621 1814 3459 2379 2240 3793 2710 3136 3162 32 1577 1743 3243 2277 2132 3138 2702 3123 3131 64 1537 1690 3126 2196 2047 3362 2566 2890 2917 128 1570 1714 3257 2243 2076 3502 2624 2993 3027 256 1573 1720 3285 2261 2084 3522 2652 3071 2930 512 1453 1598 2785 2055 1906 2081 2430 2783 2815 1024 918 1097 1327 1204 1185 1355 1606 1261 1263 2048 891 1032 1224 1133 1113 1191 882 811 817 4096 885 1023 1223 1127 1104 1201 787 756 755 8192 876 1019 1225 1118 954 1203 876 871 873 Max MFLOPS 203 454 Per MHz 0.17 0.38 ########################### RPi 3B+ 32 Bit ############################## Raspberry Pi 3B+ CPU 1400 MHz, SDRAM ? Avg. Gain 8 1899 2125 4041 2783 2624 4448 3164 3693 3693 1.17 L1 16 1901 2128 4058 2791 2628 4462 3177 3703 3707 32 1852 2049 3817 2686 2508 4161 3186 3715 3711 64 1796 1959 3574 2542 2367 3855 2945 3347 3347 1.16 L2 128 1826 1989 3741 2600 2408 4031 3042 3506 3508 256 1833 1995 3771 2617 2414 4068 2860 3616 3617 512 1517 1618 2587 2039 1911 2687 2459 2825 2832 1024 968 1098 1221 1172 1140 1211 1455 1144 1137 0.98 RAM 2048 911 980 1060 1038 1026 1062 1013 941 935 4096 913 993 1064 1047 1038 948 992 902 903 8192 926 1013 1077 1074 1065 1085 782 784 783 Max MFLOPS 238 532 Per MHz 0.17 0.38 More Below ######################### RPi 3 NEON 32 Bit ############################## Memory Reading Speed Test NEON 32 Bit Version 1 by Roy Longbottom Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz 8 1627 2387 3467 2387 3181 3812 2713 3164 3149 16 1621 2377 3457 2377 3169 3805 2713 3164 3165 32 1577 2273 3238 2280 2985 3535 2647 3103 3105 64 1526 2150 3018 2157 2793 3256 2568 2921 2921 128 1554 2217 3190 2216 2925 3436 2631 3028 3029 256 1561 2228 3225 2221 2948 3471 2654 3077 3077 512 1434 2010 2742 1978 2534 2313 2468 2840 2840 1024 950 1227 1324 1182 1306 1339 1581 1298 1298 2048 935 1136 1215 1128 1212 1214 915 880 885 4096 913 1121 1180 1131 1213 1212 825 844 842 8192 926 1134 1212 1126 936 1199 792 774 790 Max MFLOPS 203 594 Per MHz 0.17 0.50 ######################### RPi 3B+ NEON 32 Bit ########################### Raspberry Pi 3B+ CPU 1400 MHz, SDRAM ? Avg. Gain 8 1890 2774 4027 2773 3694 4427 3130 3674 3676 1.16 L1 16 1885 2778 4037 2758 3702 4439 3155 3693 3691 32 1813 2581 3646 2590 3366 3951 3130 3586 3591 64 1808 2565 3653 2575 3370 3943 2987 3363 3366 1.17 L2 128 1790 2534 3606 2536 3334 3893 3040 3485 3485 256 1796 2538 3638 2544 3360 3914 3079 3572 3569 512 1654 2273 3163 2301 2945 3333 3010 3435 3447 1024 959 1166 1185 1165 1209 1213 1438 1141 1130 0.97 RAM 2048 918 1059 1080 1061 1088 1081 1073 890 889 4096 922 1076 1082 1069 1069 1084 1015 867 871 8192 929 1089 1091 1083 1102 1081 786 774 774 Max MFLOPS 236 695 Per MHz 0.17 0.50 ######################### RPi 3 Gentoo 64 Bit ########################## Memory Reading Speed Test armv8 64 Bit Raspberry Pi 3B CPU 1200 MHz, SDRAM 900 MHz Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 4161 2506 3749 5347 3393 4166 4641 3730 3731 16 4032 2506 3758 5357 3419 4162 4674 3753 3750 32 4016 2486 3721 5311 3390 4137 4673 3714 3727 64 3372 2342 3361 4232 3123 3685 4244 3522 3499 128 3352 2393 3454 4266 3189 3789 4359 3564 3563 256 3227 2398 3463 4266 3224 3769 4246 3525 3525 512 633 2010 2885 3603 2674 2457 3733 3081 3084 1024 560 889 1217 1192 1202 1011 857 1094 1095 2048 565 880 1145 1131 991 1156 844 885 788 4096 514 1092 987 1127 1134 1159 873 944 951 8192 531 887 1150 1139 1038 1162 782 799 704 Max MFLOPS 520 627 Per MHz 0.43 0.52 More Below ######################## RPi 3B+ Gentoo 64 Bit ########################## Raspberry Pi 3B+ CPU 1400 MHz, SDRAM ? Avg. Gain 8 4822 2888 4346 6190 3955 4830 5372 4324 4325 1.16 L1 16 4684 2904 4337 6197 3955 4833 5389 4340 4343 32 4471 2898 4345 6172 3951 4824 5438 4323 4347 64 3814 2630 3721 4671 3467 4052 5272 4238 4208 1.18 L2 128 3866 2727 3905 4797 3601 4257 4935 4102 4103 256 3891 2765 3975 4877 3700 4296 4901 4096 4102 512 671 2305 3252 3791 3003 3530 3638 3721 3718 1024 694 1263 1324 1320 1317 1277 1192 1482 1477 1.18 RAM 2048 645 1213 1255 1245 1132 1269 840 921 925 4096 617 1204 1122 1230 1238 1120 968 990 990 8192 658 1210 1256 1224 1101 1271 1011 1082 1084 Max MFLOPS 602 726 Per MHz 0.43 0.52 ###################### RPi 3B+ Gentoo NEON 64 Bit ######################## Compile options to use NEON instructions are not available at 64 bit working. ##################### Compare 64 bit / 32 bit Pi 3 ####################### Memory Reading Speed Test Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 2.57 1.38 1.09 2.25 1.52 1.10 1.72 1.20 1.19 256 2.05 1.39 1.05 1.89 1.55 1.07 1.60 1.15 1.20 8192 0.61 0.87 0.94 1.02 1.09 0.97 0.89 0.92 0.81 #################### Compare 64 bit / 32 bit Pi 3B+ ###################### 8 2.54 1.36 1.08 2.22 1.51 1.09 1.70 1.17 1.17 256 2.12 1.39 1.05 1.86 1.53 1.06 1.71 1.13 1.13 8192 0.71 1.19 1.17 1.14 1.03 1.17 1.29 1.38 1.38 |
This was my first benchmark produced to measure speed using NEON instructions on ARM v7 CPUs using Android. It executes some of the code used in Memory Speed Benchmark, with additional tests recoded using NEON intrinsic functions. In this case there are no double precision calculations.
Pi 3B+ CPU/Cache performance gains are again proportional to MHz, with some worse via RAM. Single precision MFLOPS per MHz increased up to 0.92 through using NEON intrinsic functions. These were compiled as different vector instructions, including the use of the Fused Multiply Accumulate variety. See details of Assembly Code.
No significant 64 bit performance gains were indicated, using these test functions, as similar instructions were generated. Some of the 32 bit functions were also somehat faster.
##################### RPi 3 32 Bit ######################### NEON Speed Test V 1.0 Raspberry Pi 3 CPU 1200 MHz Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 2720 4001 3459 4225 4474 4750 32 2598 3706 3268 3879 4091 4320 64 2453 3389 3069 3526 3675 3859 128 2503 3466 3178 3598 3718 3918 256 2530 3516 3230 3649 3779 3950 512 2221 2923 2718 2964 3104 3217 1024 1262 1326 1317 1316 1324 1316 4096 1170 1213 1204 1213 1210 1195 16384 1177 1229 1218 1147 1222 1215 65536 1181 1226 1221 916 1208 1218 Max MFLOPS 680 1000 Per MHz 0.57 0.84 ##################### RPi 3B+ 32 Bit ####################### Raspberry Pi 3B+ CPU 1400 MHz Avg Gain 16 3188 4690 4055 4953 5243 5570 1.17 L1 32 3143 4578 3990 4811 5120 5431 64 2927 4089 3693 4253 4446 4674 1.16 L2 128 2864 3912 3588 4060 4172 4478 256 2905 3953 3632 4119 4213 4524 512 2255 2835 2661 2873 2922 3035 1024 1234 1264 1263 1265 1248 1232 0.93 RAM 4096 1099 1114 1110 1106 1091 1088 16384 1116 1128 1116 1117 1102 1092 65536 1113 1132 1122 837 1107 1090 Max MFLOPS 797 1173 Per MHz 0.57 0.84 ################### RPi 3 Gentoo 64 Bit #################### NEON Speed Test armv8 64 Bit V 1.0 Raspberry Pi 3 CPU 1200 MHz Vector Reading Speed in MBytes/Second Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 2350 4419 3415 4176 4686 4843 32 2247 3991 3216 3831 4258 4348 64 2161 3631 3038 3559 3886 3882 128 2212 3744 3148 3648 3980 3966 256 2230 3766 3171 3677 4009 3962 512 1931 2736 2685 2663 3267 3322 1024 1116 1116 1223 1135 1156 1213 4096 1065 1075 1146 1040 1117 1162 16384 1065 1072 1149 978 1106 1078 65536 1007 1150 1076 824 1103 1137 Max MFLOPS 588 1105 Per MHz 0.49 0.92 More Below ################## RPi 3B+ Gentoo 64 Bit ################### Raspberry Pi 3B+ CPU 1400 MHz Avg Gain 16 2724 5109 3961 4841 5446 5607 1.16 L1 32 2612 4645 3726 4450 4968 5036 64 2523 4247 3540 4150 4521 4519 1.16 L2 128 2583 4363 3666 4253 4616 4635 256 2576 4314 3674 4254 4591 4631 512 1852 2871 2608 2466 2916 2698 1024 1222 1207 1305 1179 1280 1216 1.08 RAM 4096 1157 1144 1214 1109 1181 1160 16384 1175 1245 1244 1134 1191 1180 65536 1143 1258 1185 909 1144 1260 Max MFLOPS 681 1277 Per MHz 0.49 0.91 ############### Compare 64 bit / 32 bit Pi 3 ################# Memory Float v=v+s*v Int v=v+v+s Neon v=v+v KBytes Norm Neon Norm Neon Float Int 16 0.86 1.10 0.99 0.99 1.05 1.02 256 0.88 1.07 0.98 1.01 1.06 1.00 65536 0.85 0.94 0.88 0.90 0.91 0.93 ############## Compare 64 bit / 32 bit Pi 3B+ ################ 16 0.85 1.09 0.98 0.98 1.04 1.01 256 0.89 1.09 1.01 1.03 1.09 1.02 65536 1.03 1.11 1.06 1.09 1.03 1.16 |
This benchmark is designed to identify reading data in bursts over buses and possible maximum data transfer speed from RAM (using 1 core - see MP version). The program starts by reading a word (4 bytes) with an address increment of 32 words (128 bytes) before reading another word. The increment is reduced by half on successive tests, until all data is read.
Model 3B+ Speed gains are provided for reading all data, as usual similar to increase in MHz, with RAM speed ratio much less, but not negative. The 64 bit compiler produced unexpected slower speeds on reading all data from L1 cache, compared with addressing increment of 2 words. This leads to an indication that the 32 bit program is faster, using this test. As indicted for the MP-BusSpd Benchmark, the 64 bit compiler did not identify that vector SIMD instructions could be used.
##################### RPi 3 32 Bit ######################### BusSpeed vfpv4 32b V1 Raspberry Pi 3 CPU 1200 MHz Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3335 3741 4075 4371 4388 4413 32 1964 2229 2787 4271 4308 4311 64 612 615 1121 1932 2880 3546 128 570 573 1034 1803 2756 3467 256 541 544 995 1758 2737 3457 512 382 408 794 1360 2269 3105 1024 128 136 256 533 1025 1945 4096 109 125 245 482 961 1585 16384 120 125 241 477 964 1744 65536 120 125 243 477 947 1881 ##################### RPi 3B+ 32 Bit ####################### Raspberry Pi 3B+ CPU 1400 MHz Gain Read All 16 3751 4125 4755 4965 5083 5104 1.16 L1 32 1983 2177 2819 4258 4681 4958 64 719 728 1333 2298 3428 4165 1.17 L2 128 664 666 1201 2130 3285 4084 256 625 635 1163 2055 3197 4032 512 329 360 702 1309 2297 3342 1024 128 143 279 548 1061 2128 1.00 RAM 4096 115 131 256 498 978 1694 16384 124 130 254 489 994 1620 65536 126 129 253 492 1003 1728 ################### RPi 3 Gentoo 64 Bit #################### Raspberry Pi 3 CPU 1200 MHz BusSpeed armv8 64 Bit Mon Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 3312 3684 4007 4341 4390 3341 32 2019 2158 2687 4172 4235 3294 64 577 595 1124 1861 2836 3062 128 546 556 1040 1754 2733 3062 256 516 530 1000 1696 2692 3094 512 341 272 708 1264 2099 2626 1024 77 126 251 488 847 1860 4096 85 115 222 446 908 1685 16384 99 115 231 393 902 1704 65536 98 115 229 443 810 1700 More Below ################## RPi 3B+ Gentoo 64 Bit ################### Raspberry Pi 3B+ CPU 1400 MHz Gain Read All 16 3823 4251 4638 4945 5045 3854 1.15 L1 32 1543 1677 2423 3331 4152 3680 64 672 694 1306 2169 3300 3577 1.17 L2 128 635 648 1211 2055 3202 3604 256 600 615 1163 1971 3152 3612 512 328 278 695 1272 2256 2978 1024 94 140 281 543 960 2075 1.12 RAM 4096 99 128 259 448 1016 1931 16384 125 129 258 500 898 1863 65536 125 114 257 500 1015 1898 ############### Compare 64 bit / 32 bit Pi 3 ################# Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All 16 0.99 0.98 0.98 0.99 1.00 0.76 256 0.95 0.97 1.01 0.96 0.98 0.89 65536 0.82 0.92 0.94 0.93 0.86 0.90 ############## Compare 64 bit / 32 bit Pi 3B+ ################ 16 1.02 1.03 0.98 1.00 0.99 0.76 256 0.96 0.97 1.00 0.96 0.99 0.90 65536 0.99 0.88 1.02 1.02 1.01 1.10 |
There are two benchmarks, FFT1, the original, and FFT3c, optimised, with 32 bit and 64 bit versions, when appropriate. Performance is measured in milliseconds, for FFTs sized 1K to 1024K, with three measurements using both single and double precision floating point data, plus some sumchecks for the largest ones.
The second of the three measurements are provided below. Note that three of the smaller FFT tests can be executed in less than a millisecond, when the CPU MHz scaling governor can produce a lower frequency (64 bit system), leading to increased running time, until the high MHz kicks in (see example below). For full speed, the scaling governor setting should be performance (sudo su echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor).
Much of the data is accessed on a skipped sequential basis, where only part of data transferred in bursts, over buses, is likely to be used. The 3C version was optimised to use more of the burst data, producing much improved performance.
Data transfer covers caches and RAM. RPi 3B+ gains are provided for each FFT size, indicating where RAM transfers apply, when the performance ratio is less than that derived from CPU MHz. Comparisons of 64/32 bit performance are also shown, with some good and some bad. Note that the processing activity is unlikely to produce absolutely consistent speeds, particularly when data size is near cache capacity or execution time is very low.
################### FFT V 1 32 Bit #################### RPi3 RPi3B+ Compare Size -------- milliseconds -------- 3B+ Gain K Single Double Single Double Single Double 1 0.16 0.16 0.14 0.14 1.16 1.15 2 0.37 0.42 0.34 0.34 1.09 1.24 4 1.01 1.09 0.88 0.94 1.14 1.16 8 2.25 2.51 1.96 2.17 1.15 1.15 16 5.29 5.85 4.56 5.16 1.16 1.13 32 12.57 22.48 10.48 19.52 1.20 1.15 64 44.59 110.41 36.67 130.32 1.22 0.85 128 217.33 269.62 239.81 314.27 0.91 0.86 256 525.92 615.26 584.42 705.36 0.90 0.87 512 1199.23 1364.15 1324.23 1534.86 0.91 0.89 1024 2538.17 2831.33 2740.23 3152.95 0.93 0.90 ################### FFT V 3C 32 Bit ################### 1 0.20 0.16 0.17 0.14 1.16 1.19 2 0.46 0.37 0.38 0.32 1.21 1.17 4 1.28 0.89 1.07 0.77 1.19 1.15 8 2.32 2.05 2.13 1.89 1.09 1.08 16 5.36 5.98 4.57 5.83 1.17 1.03 32 12.47 15.48 10.77 15.48 1.16 1.00 64 31.08 36.99 29.05 37.25 1.07 0.99 128 72.02 84.24 70.05 85.03 1.03 0.99 256 160.48 193.81 160.68 199.34 1.00 0.97 512 367.71 426.24 364.53 437.72 1.01 0.97 1024 799.23 948.54 794.54 974.48 1.01 0.97 RPi3B+ FFT3C scaling_governor ondemand 1 0.40 0.14 2 0.93 0.32 4 1.97 0.75 8 4.64 1.76 16 4.47 5.83 ################### FFT V 1 64 Bit #################### RPi3 RPi3B+ Compare Size -------- milliseconds -------- 3B+ Gain K Single Double Single Double Single Double 1 0.18 0.18 0.15 0.15 1.18 1.17 2 0.35 0.39 0.29 0.41 1.21 0.96 4 0.87 1.64 0.79 0.99 1.10 1.65 8 2.08 3.18 1.87 2.45 1.12 1.30 16 4.68 7.18 3.86 5.23 1.21 1.37 32 10.76 29.77 10.20 23.64 1.05 1.26 64 39.65 126.03 50.50 105.28 0.79 1.20 128 174.53 302.94 148.80 262.45 1.17 1.15 256 408.05 700.83 352.60 603.27 1.16 1.16 512 956.18 1543.35 836.47 1362.20 1.14 1.13 1024 2055.48 3278.12 1841.71 2848.54 1.12 1.15 More Below ################### FFT V 3C 64 Bit ################### 1 0.18 0.19 0.14 0.18 1.30 1.05 2 0.41 0.43 0.36 0.37 1.13 1.17 4 0.80 0.99 0.70 0.86 1.15 1.15 8 2.10 2.32 2.60 1.95 0.81 1.19 16 6.22 5.66 4.66 5.05 1.33 1.12 32 10.38 15.08 9.05 13.20 1.15 1.14 64 27.59 35.71 24.54 31.38 1.12 1.14 128 71.14 81.19 56.37 72.85 1.26 1.11 256 139.33 190.07 124.65 170.73 1.12 1.11 512 321.96 428.64 295.34 385.23 1.09 1.11 1024 705.92 938.42 629.95 838.97 1.12 1.12 ############## Compare 64 bit / 32 bit ################ RPi3 RPi3B+ K Single Double Single Double FFT1 1 to 8 1.05 0.86 1.06 0.90 16 to 128 1.17 0.83 1.14 1.06 256 to 1K 1.26 0.88 1.58 1.13 FFT3C 1 to 8 1.24 0.89 1.17 0.88 16 to 128 1.05 1.04 1.15 1.17 256 to 1K 1.14 1.01 1.26 1.16 |
As before, details of the benchmarks, other results and download links are available from ResearchGate in this PDF file.
On running my multithreading benchmarks, I noted unusual slow performance from certain tests. The first was the MP-Whetstone Benchmark, with independent copies of the program, using 1, 2, 4 and 8 threads. Then, the running time should not increase much using up to 4 threads, but should be just over twice as long using 8. As shown in the example below, the 4 thread test was too slow and this was particularly due to the long running COS test.
MP-Whetstone Benchmark armv8 64 Bit Mon Jun 18 23:09:29 2018 Using 1, 2, 4 and 8 Threads MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS ## MOPS MOPS MOPS 1T 1112.9 352.1 379.0 319.2 22.0 12.7 1641076.6 2722.5 1328.7 2T 2250.7 717.5 767.4 656.4 44.5 25.5 2684285.1 5456.3 2652.7 4T 2899.0 1342.3 1525.3 1048.1 42.6 46.1 1959513.0 4497.3 4319.1 8T 3433.1 1654.1 1804.6 1106.4 55.2 47.8 2453184.1 10960.3 4994.0 Overall Seconds 5.14 1T, 5.11 2T, 8.08 4T, 13.66 8T ## over optimised but always had little effect on overall MWIPS |
A (not official) 2.5 amp power supply was used and this was connected via a digital meter that measures current and voltage. During the tests, this reported constant over 5 volts and less than 1 amp. I suspected overheating and ran my RPiHeatMHz program at the same time,
(See OpenGL Power Cable Tests for core volts)
producing the results below and showing that the CPU MHz flipped into 600 MHz at the time of slow recorded performance. Although the temperature was not excessive. I carried out further tests with the system wrapped in bags of frozen food. The failures still occurred with recorded temperatures of less than 30°C.
Temperature and CPU MHz Measurement Start at Mon Jun 18 23:09:26 2018 Using 40 samples at 1 second intervals Seconds 0.0 1400 scaling MHz, 1400 ARM MHz, temp=55.8°C 1.0 1400 scaling MHz, 1400 ARM MHz, temp=55.8°C 2.2 1400 scaling MHz, 1400 ARM MHz, temp=55.8°C 3.3 1400 scaling MHz, 1400 ARM MHz, temp=56.4°C 1T 4.5 1400 scaling MHz, 1400 ARM MHz, temp=56.9°C 5.7 1400 scaling MHz, 1400 ARM MHz, temp=56.9°C 6.9 1400 scaling MHz, 1400 ARM MHz, temp=56.9°C 8.0 1400 scaling MHz, 1400 ARM MHz, temp=57.5°C 2T 9.2 1400 scaling MHz, 1400 ARM MHz, temp=58.0°C 10.4 1400 scaling MHz, 1400 ARM MHz, temp=58.0°C 11.7 1400 scaling MHz, 1399 ARM MHz, temp=59.1°C 12.9 1400 scaling MHz, 1400 ARM MHz, temp=59.1°C 4T 14.1 1400 scaling MHz, 600 ARM MHz, temp=59.1°C 15.4 1400 scaling MHz, 600 ARM MHz, temp=59.1°C 16.8 1400 scaling MHz, 600 ARM MHz, temp=58.0°C 18.3 1400 scaling MHz, 1400 ARM MHz, temp=60.1°C 19.6 1400 scaling MHz, 1400 ARM MHz, temp=60.7°C 20.8 1400 scaling MHz, 1400 ARM MHz, temp=61.2°C 22.0 1400 scaling MHz, 1400 ARM MHz, temp=61.8°C 8T 23.3 1400 scaling MHz, 600 ARM MHz, temp=60.1°C 24.9 1400 scaling MHz, 600 ARM MHz, temp=60.1°C 26.4 1400 scaling MHz, 1400 ARM MHz, temp=60.7°C 27.6 1400 scaling MHz, 1400 ARM MHz, temp=61.2°C To 38.8 1400 scaling MHz, 1400 ARM MHz, temp=60.1°C |
Next, I tried using my official Pi 2 amp power supply and that seemed to be fine, but caused the failures when the meter was included, needing connection using a longer wire. It also failed when just the wire extension was included.
Now all multithreading programs have been run to verify the results, using a directly connected official 2.5 amp power supply. Even with this, four thread performance can be inconsistent, mainly when the running time is not very long, influenced by other system activity and the programs calculating performance based on the last thread to finish. In these cases, four threads carry out the same number of instructions as a single thread, potentially reducing running time by a quarter. The Whetstone benchmark is probably the best one to identify the power drop, with each thread (of up to four) taking around 5 seconds to execute the same functions.
This benchmark was intended to demonstrate near maximum throughput using single precision floating point calculations. It nearly did on an Intel Core i7 CPU, compiled with gcc under Linux, obtaining 23 out of 32 MFLOPS/MHz with SSE instructions (4 cores, quad word registers, linked multiply and add). The latter arrangement (I believe) also applies to the ARM Cortex-A53 where, with the same efficiency, a Raspberry Pi 3B, at 1200 MHz, would be expected to achieve 27600 MFLOPS and a 3B+ 32200 MFLOPS, at 1400 MHz. For ARM, and probably Intel, as shown below, 20 instructions could be executed at the full speed, with 12 at half speed, nearly corresponding with the 72% (23*100/32) efficiency obtained with Intel.
Single Precision and Double Precision Raspberry Pi 3B+ MFLOPS results are shown below for existing compiled 32 bit and 64 bit benchmarks and one that uses Single Precision NEON Intrinsic functions, then those from a new compilation using gcc 7. None achieve the levels of performance suggested above. Source code and benchmarks for the new MP-MFLOPS, compiled by gcc 7, are in this zip file and this tar.gz file. Speeds of later results from the OpenMP-MFLOPS Benchmark are included in the table.
Performance using one and four threads is shown, along with the gain via the latter. Note that particularly four thread performance can vary significantly, even when using a reliable power supply - See Above.
Source and Assembly Codes for these benchmarks runs are shown below. where explanations of the differences are provided. Next are the detailed results.Raspberry Pi 3B+ MFLOPS at 32 Operations Per Data Word NEON 64 bit OpenMP 32 bit 64 bit 64 bit 64 bit gcc7 gcc6 gcc7 SP DP SP DP SP SP DP SP SP 1 Thread 813 798 1793 1405 2999 2800 1403 1692 2781 4 Threads 3189 3109 6981 4398 11563 10608 4492 6469 10007 4T/1T 3.92 3.90 3.89 3.13 3.86 3.79 3.20 3.82 3.60
3B+ to 3B performance gains are provided following the detailed results. This benchmark tends to be limited by processor speed, producing gains proportional to CPU MHz, but subject to random variations.
64 bit speed gains are also shown (before gcc 7), excluding using RAM, these being greater than 2.1 times single precision and 1.5 times double precision.
################# MP-MFLOPS Raspbian RPi 3B 32 Bit ################# Raspberry Pi 3 CPU 1200 MHz, SDRAM 900 MHz MP-MFLOPS Linux/ARM V7A v1.0 Sun Jul 15 14:45:33 2018 FPU Add & Multiply using 1, 2, 4 and 8 Threads 2 Ops/Word 32 Ops/Word KB 12.8 128 12800 12.8 128 12800 MFLOPS 1T 184 181 172 697 697 691 2T 367 360 339 1393 1373 1379 4T 642 714 411 2702 2652 2650 8T 597 689 429 2635 2623 2590 Results x 100000 1T 76406 97075 99969 66015 95363 99951 ########### RPi 3 V7A2 Double Precision ############ MP-MFLOPS Double Precision v1.0 Sun Jul 15 14:44:57 2018 1T 182 183 160 684 684 670 2T 354 361 216 1365 1342 1320 4T 590 709 215 2695 2695 2544 8T 609 612 219 2576 2662 2529 Results x 100000 1T 76384 97072 99969 66065 95370 99951 |
None of the test functions are suitable for SIMD operation, with the simpler instructions being used, possibly leading to some 32 bit tests being faster than those compiled for 64 bits. The Fixed Point MIPS loops are clearly over optimised but, in any case, the time taken has little influence on the overall MWIPS rating.
For both 32 and 64 bit versions, overall single core MWIPS were 17% faster on the 3B+, proportional to CPU MHz ratios. MP speed improvements can be judged by the overall running times shown, which should be similar for 1, 2 and 4 threads and double with 8 threads.
################# MP-Whetstone Raspbian RPi 3B 32 Bit ################# MP-Whetstone Benchmark Linux/ARM v1.0 Sun Jun 17 20:55:06 2018 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 924.2 335.9 276.8 298.5 18.5 10.4 5817.2 1035.3 719.4 2T 1864.8 672.5 664.3 594.1 37.3 20.7 11726.4 2386.9 1438.8 4T 3718.4 1286.3 1303.9 1193.5 74.3 41.5 19961.4 4698.4 2862.7 8T 3908.9 1639.8 1746.6 1274.2 75.9 43.6 29809.6 6321.5 3002.2 Overall Seconds 5.02 1T, 4.97 2T, 5.07 4T, 10.08 8T ################# MP-Whetstone Raspbian RPi 3B+ 32 Bit ################# MP-Whetstone Benchmark Linux/ARM v1.0 Sun Jun 17 22:56:26 2018 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 1084.2 391.0 384.9 348.6 21.7 12.1 6967.0 1013.1 822.3 2T 2174.4 778.3 775.7 691.9 43.5 24.2 13762.0 2787.4 1675.0 4T 4343.8 1540.9 1558.3 1389.5 86.6 48.4 27529.5 5549.8 3338.5 8T 4548.4 1895.1 1896.0 1504.7 88.0 51.0 39107.6 7287.7 3440.6 Overall Seconds 5.05 1T, 5.00 2T, 5.06 4T, 10.10 8T ################## MP-Whetstone Gentoo RPi 3B 64 Bit ################## MP-Whetstone Benchmark armv8 64 Bit Tue Jun 19 00:00:13 2018 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 979.8 330.4 322.9 281.5 20.0 10.8 1368033.3 2335.5 1177.1 2T 1986.0 623.9 659.3 564.2 40.0 22.4 2311401.6 4675.1 2355.7 4T 3914.5 1206.2 1295.8 1122.2 78.3 44.3 3007162.3 9230.2 4636.6 8T 4039.7 1498.5 1670.6 1170.2 79.5 45.3 1183764.2 12054.6 5082.7 Overall Seconds 5.04 1T, 5.01 2T, 5.27 4T, 10.22 8T ################## MP-Whetstone Gentoo RPi 3B+ 64 Bit ################## MP-Whetstone Benchmark armv8 64 Bit Tue Jun 26 12:02:45 2018 MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal 1 2 3 MOPS MOPS MOPS MOPS MOPS 1T 1151.6 383.0 382.7 327.6 23.2 13.0 1717931.5 2720.5 1364.5 2T 2311.6 766.5 766.8 657.2 46.5 26.0 3478249.3 5460.9 2738.4 4T 4579.6 1505.5 1525.7 1304.4 92.0 51.6 4647842.5 10777.1 5448.5 8T 4788.4 1814.9 1961.4 1381.9 95.0 53.3 5689217.0 13827.3 5810.6 Overall Seconds 4.96 1T, 4.95 2T, 5.05 4T, 10.07 8T |
Average performance gain of the 3B+ over the older 3B were. as usual, the same as the CPU MHz ratio. Single thread performance, at 64 bits, was 55% faster than at 32 bits but, in both cases, reduced to 10% via four threads.
################# MP-Dhrystone Raspbian RPi 3B 32 Bit ################# MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Sun Jun 17 20:36:41 2018 Threads 1 2 4 8 Seconds 0.78 0.92 1.27 2.52 Dhrystones per Second 4107750 6949821 10067546 10156278 VAX MIPS rating 2338 3956 5730 5780 End of test Sun Jun 17 20:36:48 2018 ################# MP-Dhrystone Raspbian RPi 3B+ 32 Bit ################# MP-Dhrystone Benchmark Linux/ARM V7A v1.0 Mon Jun 18 10:05:26 2018 Threads 1 2 4 8 Seconds 0.85 0.96 1.36 2.71 Dhrystones per Second 4732954 8293353 11799850 11823294 VAX MIPS rating 2694 4720 6716 6729 End of test Mon Jun 18 10:05:33 2018 ################## MP-Dhrystone Gentoo RPi 3B 64 Bit ################## MP-Dhrystone Benchmark armv8 64 Bit Tue Jun 19 00:02:49 2018 Threads 1 2 4 8 Seconds 0.63 0.79 1.45 2.86 Dhrystones per Second 6364104 10106501 11050923 11173626 VAX MIPS rating 3622 5752 6290 6359 End of test Tue Jun 19 00:02:55 2018 ################## MP-Dhrystone Gentoo RPi 3B+ 64 Bit ################## MP-Dhrystone Benchmark armv8 64 Bit Mon Jun 18 23:11:33 2018 Threads 1 2 4 8 Seconds 0.54 0.74 1.24 2.46 Dhrystones per Second 7376153 10819564 12921258 13021546 VAX MIPS rating 4198 6158 7354 7411 End of test Mon Jun 18 23:11:39 2018 |
Performance can vary somewhat with this benchmark but reflect the usual 3B+ speed gains, at least on averaging all results. On the same basis, average 64 bit speeds are suggested as being the same as those at 32 bits, but some indicate slower performance. Similar performance could be expected as the compiled code is derived from high level NEON SIMD vector functions.
The poor performance, even using a single thread, is due to the frequent starting and stopping of threads to execute the critical calculations. Consistent threaded speed indicates shared data write back to RAM dependency. This probably increases with larger matrices as more calculations are carried out during a threaded function call.
MFLOPS 0 to 4 Threads, N 100, 500, 1000 ################# MP-Linpack Raspbian RPi 3B 32 Bit ################# Using NEON Intrinsics, Sun Jun 17 20:32:04 2018 Threads None 1 2 4 N 100 542.22 61.00 60.67 60.74 N 500 480.55 311.06 316.00 303.48 N 1000 364.07 272.49 231.10 232.07 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 2.17 5.42 9.50 RE 5.16722466e-05 6.46698638e-04 2.26586126e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04 XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04 ################# MP-Linpack Raspbian RPi 3B+ 32 Bit ################# Using NEON Intrinsics, Mon Jun 18 10:00:08 2018 Threads None 1 2 4 N 100 633.11 70.82 70.13 70.20 N 500 505.37 323.24 326.81 327.73 N 1000 378.29 337.34 337.01 337.80 SumChecks as above but note 64 bit differences - rounding effects? ################## MP-Linpack Gentoo RPi 3B 64 Bit ################## 64 Bit NEON Intrinsics, Tue Jun 19 00:04:01 2018 MFLOPS 0 to 4 Threads, N 100, 500, 1000 Threads None 1 2 4 N 100 551.48 87.43 81.66 82.68 N 500 359.51 258.43 242.92 255.61 N 1000 296.11 281.75 279.20 282.71 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1 N 100 500 1000 NR 1.97 5.40 13.51 RE 4.69621336e-05 6.44138840e-04 3.22485110e-03 MA 1.19209290e-07 1.19209290e-07 1.19209290e-07 X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04 XN -1.30534172e-05 3.51667404e-05 1.90019608e-04 ################## MP-Linpack Gentoo RPi 3B+ 64 Bit ################## 64 Bit NEON Intrinsics, Mon Jun 18 23:13:26 2018 Threads None 1 2 4 N 100 639.82 100.30 95.24 95.25 N 500 430.41 292.80 291.12 290.04 N 1000 349.47 313.59 312.38 313.40 SumChecks as above but note 32 bit differences - rounding effects? |
In the original version, each thread started reading data from the same starting point. This produced acceptable results until shared L2 caches appeared. Then it produced excessive RAM speeds, using more than one thread. With version 2, as used for the following, each thread starts reading from different addresses, providing more realistic results.
Considering just the ReadAll speeds, and MP performance variability, the usual 3B+/3B gains applied. Compared with the BusSpeed benchmark results, the 64 bit one thread performance was much slower and many old 3B cache based speeds significantly faster. In this case, disassembly code was examined to identify why. The ReadAll C code comprises a loop with 64 read statements, using AND. The 64 bit compiler produced code with 64 scalar instructions (e.g. and w3, w3, w0) and 64 loads, compared with 32 bits, with 16 four way SIMD instructions (e.g. vand q15, q15, q6), 16 vector loads, but lots of other adds (for indexing?).
At least, performance on reading data from RAM could be nearly doubled using multithreading.
################# MP-BusSpd Raspbian RPi 3B 32 Bit ################# MP-BusSpd ARM V7A v2 Fri Jul 13 18:29:45 2018 MB/Second Reading Data, 1, 2, 4 and 8 Threads KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 2690 3768 3793 4081 4387 4223 2T 5086 6856 7148 7710 8571 8159 4T 8285 11814 13335 15091 16656 15720 8T 6381 8690 10777 11997 14310 13789 122.9 1T 567 557 1059 1802 2804 3934 2T 888 903 1746 3287 5379 7686 4T 895 928 1810 3671 7205 13860 8T 909 927 1837 3691 7049 13125 12288 1T 120 124 240 475 963 1906 2T 135 123 245 505 1010 1978 4T 135 132 259 467 1080 2135 8T 126 124 255 500 973 2158 End of test Fri Jul 13 18:29:57 2018 ################# MP-BusSpd Raspbian RPi 3B+ 32 Bit ################# MP-BusSpd ARM V7A v2 Fri Jul 13 20:18:36 2018 KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 3510 4345 4419 4731 5031 4928 2T 6010 7992 8384 9018 10024 9648 4T 10127 13748 15247 17581 19516 18252 8T 7165 10780 13100 14043 16201 16504 122.9 1T 662 648 1247 2090 3246 4565 2T 1030 1024 2047 3829 6317 8962 4T 1040 1078 2167 4340 8380 15935 8T 1052 1077 2122 4263 8362 15826 12288 1T 129 133 267 516 1044 2085 2T 141 139 280 544 1115 2126 4T 141 159 301 530 1075 2338 8T 153 140 273 618 1190 2488 End of test Fri Jul 13 20:18:48 2018 ################## MP-BusSpd Gentoo RPi 3B 64 Bit ################## MP-BusSpd armv8 64 Bit Tue Jun 19 00:06:12 2018 KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll 12.3 1T 1462 2407 2584 2038 1461 1492 2T 4412 4081 4820 3867 2822 2928 4T 6446 6019 8348 6814 5330 5346 8T 2598 3924 6114 5788 3827 5016 122.9 1T 535 569 1016 1578 1425 1470 2T 687 859 1708 3013 2829 2932 4T 721 878 1829 3573 4369 5261 8T 780 897 1827 3588 4949 5271 12288 1T 30 111 213 365 835 1024 2T 45 65 143 337 798 1590 4T 58 71 253 341 663 1546 8T 47 97 147 443 904 1821 End of test Tue Jun 19 00:06:25 2018 |
There can be a lot of variability on 4 thread/1 thread performance gains and many runs might be requires to provide accurate comparisons. On all tests, 3B+/3B performance gains were as expected for cache based results, with averages between 1.6 and 1.7, with RAM performance being similar. Read only MP gains were mainly greater than 3.5 times for cache tests, except on random access to L2, at around 2.4 times, understandably lower using a shared cache. There were also some MP increased throughput using RAM. Raspbian based results indicate slightly improved performance over those using 64 bit Gentoo.
################# MP-RandMem Raspbian RPi 3B 32 Bit ################# MP-RandMem Linux/ARM v1.0 Sun Jul 15 10:54:39 2018 MB/Second Using 1, 2, 4 and 8 Threads KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 4078 3814 4018 3798 2T 8045 3768 8043 3777 4T 15622 3724 15625 3730 8T 15208 3723 15020 3724 122.9 1T 3289 3393 827 891 2T 6556 3379 1512 880 4T 12125 3364 2078 886 8T 12309 3364 2042 886 12288 1T 1669 878 65 64 2T 3485 872 121 65 4T 4296 876 146 65 8T 2435 878 147 65 End of test Sun Jul 15 10:55:24 2018 ################# MP-RandMem Raspbian RPi 3B+ 32 Bit ################# MP-RandMem Linux/ARM v1.0 Sun Jul 15 11:03:26 2018 KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 4747 4447 4776 4435 2T 9253 4362 9378 4362 4T 18114 4343 18080 4322 8T 17813 4345 17788 4321 122.9 1T 3871 3893 948 1016 2T 7612 3954 1742 1021 4T 14399 3929 2383 1025 8T 14089 3935 2367 1023 12288 1T 1850 860 67 68 2T 3670 867 126 67 4T 4097 874 146 68 8T 2919 873 148 68 End of test Sun Jul 15 11:04:10 2018 ################## MP-RandMem Gentoo RPi 3B 64 Bit ################## MP-RandMem armv8 64 Bit Tue Jun 19 00:08:43 2018 KB SerRD SerRDWR RndRD RndRDWR 12.3 1T 4260 3071 4261 3081 2T 7500 3054 7496 3059 4T 15092 3018 14794 3019 8T 14315 2977 14544 2989 122.9 1T 3385 2861 867 837 2T 6323 2653 1543 838 4T 10638 2873 2009 835 8T 10810 2841 1947 834 12288 1T 1607 746 71 60 2T 1605 696 123 59 4T 1939 766 129 58 8T 1682 681 141 58 End of test Tue Jun 19 00:09:34 2018 |
The usual proportional to MHz 3B+ versus 3B gains were provided, with data from caches, and RAM throughput slightly faster. 64 bit/32 bit NotOpenMP performance ratios were similar to Memory Speed Benchmark, but some were worse using OpenMP. As with some other benchmarks, a new gcc 7 compilation might provide improvement improvement, by including more efficient 64 bit instructions.
############## OpenMP-MemSpeed Raspbian RPi 3B 32 Bit ############## Memory Reading Speed Test Not OpenMP Version 2 by Roy Longbottom Start of test Sun Jun 17 20:56:39 2018 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 4 1577 2537 3790 2360 3449 3789 2673 2694 2692 8 1594 2547 3811 2388 3469 3812 2717 2716 2716 16 1595 2553 3825 2393 3478 3825 2728 2728 2728 32 1556 2435 3566 2312 3272 3566 2730 2712 2715 64 1508 2300 3304 2177 3065 3303 2542 2485 2485 128 1515 2305 3353 2183 3108 3356 2644 2573 2574 256 1527 2341 3431 2226 3183 3432 2673 2615 2616 512 1406 2083 2869 1983 2702 2873 2558 2495 2404 1024 935 1228 1295 1194 1300 1315 1561 1360 1349 2048 889 1091 1170 1083 1162 1167 1211 1096 1099 4096 890 1109 1169 1089 1167 1168 911 895 903 8192 906 1141 1188 1116 1194 1168 811 804 802 16384 916 1159 1202 1132 1209 1206 766 761 761 32768 928 1166 1206 1119 1224 1206 760 746 746 65536 970 1171 1210 1140 1225 1212 811 810 808 131072 966 1172 1207 1141 1230 1146 953 908 883 End of test Sun Jun 17 20:57:07 2018 Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom Start of test Mon Jul 9 10:33:36 2018 4 5535 2990 1372 8773 4728 1478 15869 7828 1261 8 6068 3107 1382 10109 5056 1486 16438 8104 1258 16 5739 3119 1317 10193 5114 1393 16624 7862 1220 32 5689 3121 1405 10216 5150 1473 16737 8624 1302 64 5416 3055 1303 8618 4928 1403 12254 8045 1218 128 5396 3050 1359 9101 4932 1379 9496 8089 1249 256 5399 3049 1361 8980 4921 1488 8361 7625 1294 512 4418 2770 1458 6865 4226 1421 5432 5042 1130 1024 3785 2477 1110 4361 3461 1202 1533 1573 1158 2048 3729 2466 975 4268 3439 1200 1017 1017 1150 4096 3714 2477 1144 4228 3370 1431 986 979 1041 8192 3799 2368 1157 3968 3366 1484 961 950 1142 16384 1477 2341 1079 4107 3047 1547 982 985 1037 32768 3351 2499 1080 2089 3216 1437 1005 1001 794 65536 3820 614 1026 3901 3078 1209 1006 1008 954 131072 944 614 746 1160 858 765 1074 1034 566 End of test Mon Jul 9 10:34:05 2018 |
There are variabilities in measured speeds, but the usual 3B+/3B performance ratios can be assumed. Multiprocessor performance gains were disappointing with the 32 bit Raspbian version, but up to scratch at 64 bits, via Gentoo. The latter benchmark was recompiled using gcc 7 to produce similar best case performance as MP-MFLOPS - see table above. For these particular benchmarks, the only real 64 bit/32 bit gains are on using 32 operations per word (at between 2.3 to 3.0 times).
The gcc 7 versions, OpenMP-MFLOPS64G7 and notOpenMP-MFLOPS64G7, can be downloaded from ResearchGate in
ompmflops7.tar.gz file.
############## OpenMP-MFLOPS Raspbian RPi 3B 32 Bit ############## Not OpenMP MFLOPS Benchmark 1 Sun Jun 17 20:58:49 2018 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.763554 655 0.929538 Yes Data in & out 1000000 2 250 1.206237 415 0.992550 Yes Data in & out 10000000 2 25 1.134379 441 0.999250 Yes Data in & out 100000 8 2500 1.161077 1723 0.957126 Yes Data in & out 1000000 8 250 1.453741 1376 0.995524 Yes Data in & out 10000000 8 25 1.435932 1393 0.999550 Yes Data in & out 100000 32 2500 5.024988 1592 0.890268 Yes Data in & out 1000000 32 250 5.158612 1551 0.988078 Yes Data in & out 10000000 32 25 5.275346 1516 0.998806 Yes End of test Sun Jun 17 20:59:12 2018 OpenMP MFLOPS Benchmark 1 Sun Jun 17 21:02:32 2018 Data in & out 100000 2 2500 0.277303 1803 0.929538 Yes Data in & out 1000000 2 250 1.183362 423 0.992550 Yes Data in & out 10000000 2 25 1.138538 439 0.999250 Yes Data in & out 100000 8 2500 0.445954 4485 0.957126 Yes Data in & out 1000000 8 250 1.299288 1539 0.995524 Yes Data in & out 10000000 8 25 1.407459 1421 0.999550 Yes Data in & out 100000 32 2500 4.305910 1858 0.890232 Yes Data in & out 1000000 32 250 3.822810 2093 0.988068 Yes Data in & out 10000000 32 25 3.757323 2129 0.998785 Yes End of test Sun Jun 17 21:02:51 2018 ############## OpenMP-MFLOPS Raspbian RPi 3B+ 32 Bit ############## Not OpenMP MFLOPS Benchmark 1 Mon Jun 18 10:20:24 2018 Test 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same Data in & out 100000 2 2500 0.682055 733 0.929538 Yes Data in & out 1000000 2 250 1.200001 417 0.992550 Yes Data in & out 10000000 2 25 1.120259 446 0.999250 Yes Data in & out 100000 8 2500 0.997494 2005 0.957126 Yes Data in & out 1000000 8 250 1.314719 1521 0.995524 Yes Data in & out 10000000 8 25 1.262752 1584 0.999550 Yes Data in & out 100000 32 2500 4.307349 1857 0.890268 Yes Data in & out 1000000 32 250 4.438297 1802 0.988078 Yes Data in & out 10000000 32 25 4.432952 1805 0.998806 Yes End of test Mon Jun 18 10:20:44 2018 |
As can be seen in the results, the Gentoo 64 bit versions are much slower than those using 32 bit Raspbian, probably a current driver issue. Different drivers and hardware might have also lead to unlike 3B+/3B comparisons, averaging 1.08 times using Raspbian and 1.27 via Gentoo.
Produced by javac 1.7.0_02, run with java 1.8.0_65 Operating System Linux, Arch. arm, Version 4.14.34-v7+ ################# JavaDraw Raspbian RPi 3B 32 Bit ################# Test Frames FPS Display PNG Bitmap Twice Pass 1 522 52.19 Display PNG Bitmap Twice Pass 2 617 61.66 Plus 2 SweepGradient Circles 627 62.64 Plus 200 Random Small Circles 603 60.22 Plus 320 Long Lines 425 42.44 Plus 4000 Random Small Circles 306 30.54 Total Elapsed Time 60.1 seconds ################# JavaDraw Raspbian RPi 3B+ 32 Bit ################# Display PNG Bitmap Twice Pass 1 570 56.91 Display PNG Bitmap Twice Pass 2 663 66.25 Plus 2 SweepGradient Circles 673 67.29 Plus 200 Random Small Circles 664 66.38 Plus 320 Long Lines 450 44.97 Plus 4000 Random Small Circles 336 33.51 Total Elapsed Time 60.1 seconds Produced by javac 1.7.0_02, run with java 1.8.0_161 Operating System Linux, Arch. aarch64, Version 4.14.44-v8-4fca48b7612d-bis+ ################## JavaDraw Gentoo RPi 3B 64 Bit ################## Display PNG Bitmap Twice Pass 1 326 32.59 Display PNG Bitmap Twice Pass 2 529 52.88 Plus 2 SweepGradient Circles 500 49.97 Plus 200 Random Small Circles 306 30.55 Plus 320 Long Lines 92 9.18 Plus 4000 Random Small Circles 45 4.46 Total Elapsed Time 60.2 seconds ################## JavaDraw Gentoo RPi 3B+ 64 Bit ################## Display PNG Bitmap Twice Pass 1 391 39.05 Display PNG Bitmap Twice Pass 2 592 59.18 Plus 2 SweepGradient Circles 538 53.75 Plus 200 Random Small Circles 378 37.78 Plus 320 Long Lines 167 16.67 Plus 4000 Random Small Circles 53 5.29 Total Elapsed Time 60.1 seconds |
The first tests tend to be limited by graphics hardware speed where 3B+/3B comparisons are less than the CPU MHz ratio, with the kitchen tests approaching this 16.7% improvement. Although probably affected be different drivers, 64/32 bit comparisons suggest similar graphics speeds but 64 bit CPU instructions indicated performance gains of more than 30% on the textured kitchen.
As usual, see the original ResearchGate PDF file for more details, links for downloads and other reports.
Example Script File export vblank_mode=0 ./videogl32 Width 320, Height 240, NoEnd ./videogl32 Width 640, Height 480, NoHeading, NoEnd ./videogl32 Width 1024, Height 768, NoHeading, NoEnd ./videogl32 NoHeading NoEnd prevents logging of configuration. Last command uses default resolution. ################# OpenGL GLUT Raspbian RPi 3B 32 Bit ################# GLUT OpenGL Benchmark 32 Bit Version 1, Fri Jul 27 11:56:04 2018 Window Size Coloured Objects Textured Objects WireFrm Texture Pixels Few All Few All Kitchen Kitchen Wide High FPS FPS FPS FPS FPS FPS 320 240 327.8 191.9 81.6 51.3 21.1 13.4 640 480 245.1 161.1 75.1 48.5 21.0 13.5 1024 768 110.8 102.0 63.8 45.1 21.1 13.4 1920 1080 49.4 47.4 37.0 32.9 20.7 13.2 End at Fri Jul 27 11:58:18 2018 ################# OpenGL GLUT Raspbian RPi 3B+ 32 Bit ################# GLUT OpenGL Benchmark 32 Bit Version 1, Fri Jul 27 11:44:59 2018 320 240 343.2 199.7 88.7 56.6 23.7 15.2 640 480 241.0 168.2 79.9 52.5 23.8 15.1 1024 768 110.5 101.7 63.8 47.1 24.2 15.4 1920 1080 49.7 47.4 36.9 32.8 23.8 15.2 End at Fri Jul 27 11:47:13 2018 ################## OpenGL GLUT Gentoo RPi 3B 64 Bit ################## GLUT OpenGL Benchmark 64 Bit Version 1, Tue Jul 17 19:26:36 2018 160 120 382.3 214.5 118.7 72.3 24.9 18.5 320 240 328.3 199.7 108.9 69.6 24.9 18.4 640 480 220.4 162.2 89.7 62.3 24.9 18.4 1024 768 104.1 96.5 61.1 49.9 24.5 17.9 1920 1080 50.1 47.4 36.6 32.6 23.8 17.7 End at Tue Jul 17 19:29:26 2018 ################## OpenGL GLUT Gentoo RPi 3B+ 64 Bit ################## GLUT OpenGL Benchmark 64 Bit Version 1, Fri Jul 27 11:28:58 2018 160 120 427.2 239.7 132.6 81.3 28.6 21.2 320 240 365.7 224.1 121.5 77.5 28.9 21.3 640 480 247.0 181.6 98.6 68.3 28.5 20.9 1024 768 116.6 107.0 68.5 56.0 28.2 20.7 1920 1080 53.8 51.9 40.3 36.1 27.8 20.5 End at Fri Jul 27 11:31:47 2018 |
64 Bit Benchmark - As with earlier attempts to run DriveSpeed64, it failed, providing error reports. This appears to be due to the “do not cache” open file options, as proved by running LanSpeed64 on local drives. Results of the latter are provided below, showing high speed cached data transfers, particularly using default large file sizes. However, large file writing and reading speeds can be measured by specifying much larger files that are too large to be cached in RAM. See the run command and results on another SanDisk Ultra microSDHC card, showing similar data transfer speeds as the other one, but using 0.5 and 1 Mbyte files.
Results from random access and small file tests are not influenced by the large file parameter.
################# DriveSpeed Raspbian RPi 3B 32 Bit ################# DriveSpeed RasPi 1.1 Mon Jul 30 14:37:47 2018 Current Directory Path: /home/pi/benchmarks/DriveSpeed Total MB 14845, Free MB 10483, Used MB 4363 MBytes/Second MB Write1 Write2 Write3 Read1 Read2 Read3 8 18.80 18.81 11.18 23.36 23.45 23.45 16 8.62 11.26 10.62 23.42 23.49 23.51 Cached 8 264.48 261.59 272.90 707.81 599.52 753.99 Random Read Write From MB 4 8 16 4 8 16 msecs 0.323 0.311 0.288 2.56 1.63 1.57 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 2.36 3.19 2.42 5.96 11.02 12.59 ms/file 1.74 2.57 6.77 0.69 0.74 1.30 0.024 End of test Mon Jul 30 14:38:18 2018 ################# DriveSpeed Raspbian RPi 3B+ 32 Bit ################# DriveSpeed RasPi 1.1 Mon Jul 30 14:53:47 2018 8 19.14 6.37 10.66 23.38 23.53 23.61 16 10.47 10.63 12.90 23.52 23.27 23.60 Cached 8 226.44 303.78 299.19 547.29 865.25 921.83 Random Read Write From MB 4 8 16 4 8 16 msecs 0.356 0.401 0.322 1.62 8.13 1.54 200 Files Write Read Delete File KB 4 8 16 4 8 16 secs MB/sec 2.38 3.01 2.53 8.56 7.86 12.74 ms/file 1.72 2.72 6.48 0.48 1.04 1.29 0.012 End of test Mon Jul 30 14:54:19 2018 |
As shown in the above PDF report, mount statements are required, whereby the benchmarks are run as local programs on the Raspberry Pi and other Linux based systems. Also, Samba was installed to connect the Pi to a Windows Workgroup, in order to run the Intel EXE based benchmark.
The 32 bit benchmarks were run under Raspbian (Stretch) and 64 bit varieties via Gentoo, with a Raspberry Pi 3B+ communicating with a PC running Windows 7 and a dual booted Windows 10/Linux Ubuntu system, also using the older model 3B to Windows 7 to provide comparisons.
Below is a summary of all of the test results that generally provide best case examples. Even so, it is clear that wide variations in performance make it difficult to provide accurate comparisons. Just dealing with the Pi based programs, note the slow random writing speeds to Windows 10. Considering 3B+ to 3B comparisons to Windows 7, later detailed results include some for reading and writing 512 MB files, where the 3B+ is indicated as being 3.3 to 3.4 times faster on reading with 2.2 times improvement on writing. There appears to be some gain on random writing and 200 short file tests, but not much with the smaller data sizes.
16 MB Files MBytes/Second Write1 Write2 Write3 Read1 Read2 Read3 Raspbian 3B >W7 11.42 11.44 11.44 11.67 11.67 11.67 3B+>W7 35.46 36.18 36.22 25.79 25.74 25.59 3B+>W10 35.55 35.95 36.00 26.95 26.95 27.65 3B+>Ubu 34.58 34.46 34.54 27.19 27.29 27.28 W7 >3B+ 25.67 25.34 16.71 11.49 8.77 7.09 W10>3B+ 25.02 16.97 16.93 11.44 8.66 6.97 Ubu>3B+ 27.52 27.49 27.59 38.91 39.01 39.08 Gentoo 3B >W7 11.22 11.30 11.30 11.63 11.61 11.56 3B+>W7 33.73 35.52 35.19 24.70 22.54 23.39 3B+>W10 33.58 35.30 35.42 13.70 26.50 9.31 3B+>Ubu 33.67 34.73 34.74 20.64 26.90 27.66 W7 >3B+ 25.17 23.77 23.82 14.48 10.39 8.05 W10>3B+ 17.17 25.31 25.26 14.90 10.56 8.15 Ubu>3B+ 21.62 29.01 17.14 39.37 39.62 39.55 Random Read milliseconds Write milliseconds From MB 4 8 16 4 8 16 Raspbian 3B >W7 0.014 0.685 0.829 1.49 1.22 1.35 3B+>W7 0.005 0.659 0.857 0.85 0.91 0.99 3B+>W10 0.005 0.660 1.118 11.79 12.84 14.38 3B+>Ubu 0.005 0.019 0.456 0.49 0.49 0.49 W7 >3B+ 0.338 0.335 0.330 0.422 0.422 0.404 W10>3B+ 0.471 0.457 0.385 0.474 0.463 0.488 Ubu>3B+ 0.49 0.50 0.50 Gentoo 3B >W7 0.022 0.746 0.894 1.58 1.43 1.47 3B+>W7 0.024 0.864 0.706 1.09 1.06 1.04 3B+>W10 0.013 0.556 0.775 23.98 15.66 30.23 3B+>Ubu 0.006 0.067 0.552 0.52 0.52 0.52 W7 >3B+ 0.617 0.613 0.507 0.694 0.651 0.687 W10>3B+ 0.518 0.505 0.499 0.589 0.622 0.609 Ubu>3B+ 0.87 0.63 0.61 200 Files Write ms/file Read ms/file Delete File KB 4 8 16 4 8 16 secs Raspbian 3B >W7 4.60 4.42 6.08 2.61 3.22 5.20 0.547 3B+>W7 3.30 3.38 3.68 2.29 2.03 2.71 0.385 3B+>W10 4.49 4.41 4.81 2.15 2.34 2.54 0.274 3B+>Ubu 5.00 5.02 5.33 2.41 2.72 4.19 0.311 W7 >3B+ 4.83 4.95 6.01 2.55 2.57 2.95 0.831 W10>3B+ 4.74 5.07 5.96 3.20 2.53 3.03 0.841 Ubu>3B+ 4.12 5.02 4.96 2.41 2,53 2.83 1.479 Gentoo 3B >W7 4.78 5.05 6.36 3.07 3.68 6.34 0.833 3B+>W7 3.44 3.88 4.30 2.48 2.16 2.47 0.254 3B+>W10 4.11 4.71 6.81 1.77 2.09 2.84 0.415 3B+>Ubu 5.37 5.56 5.62 3.16 4.32 4.43 0.317 W7 >3B+ 4.49 5.37 5.76 2.76 2.73 3.24 0.812 W10>3B+ 5.28 5.60 6.21 3.67 3.21 3.52 0.849 Ubu>3B+ 3.45 3.69 4.05 2.72 2.72 3.09 1.299 |
16 MB Files MBytes/Second Average Gain Write1 Write2 Write3 Read1 Read2 Read3 Write Read Raspbian 3B >W7 4.96 5.07 5.06 5.23 6.76 6.63 3B+>W7 11.34 13.82 14.14 8.98 9.97 9.77 2.60 1.54 3B+>W10 11.24 13.78 14.19 8.67 10.22 8.57 3B+>Ubu 11.31 13.30 13.62 10.72 13.25 8.93 W7 >3B+ 9.27 7.34 10.53 5.51 3.92 3.04 W10>3B+ 13.84 13.94 13.77 6.77 4.53 3.41 Ubu>3B+ 12.51 11.27 11.88 14.55 15.53 15.64 Gentoo 3B >W7 4.98 5.04 5.11 6.35 6.28 6.11 3B+>W7 11.54 11.59 11.78 4.27 4.10 4.16 2.31 0.67 3B+>W10 10.04 11.00 11.38 3.77 4.19 3.67 3B+>Ubu 9.94 11.04 11.31 4.21 3.78 4.18 W7 >3B+ 3.91 3.97 3.98 2.75 2.13 1.74 W10>3B+ 4.08 4.06 2.57 2.07 1.76 1.53 Ubu>3B+ 4.19 4.20 4.22 10.24 11.51 11.68 Random Read milliseconds Write milliseconds Average Gain From MB 4 8 16 4 8 16 Read Write Raspbian 3B >W7 3.360 3.445 3.654 3.64 3.39 3.35 3B+>W7 2.275 2.755 2.782 2.93 2.68 2.76 1.34 1.24 3B+>W10 10.197 6.838 2.785 20.95 18.85 16.46 3B+>Ubu 2.429 2.778 2.829 1.39 1.39 1.39 W7 >3B+ 1.375 1.344 1.329 1.570 1.561 1.539 W10>3B+ 1.275 1.262 1.271 1.526 1.495 1.510 Ubu>3B+ 2.11 2.12 2.12 Gentoo 3B >W7 3.194 3.472 3.750 3.83 3.59 3.61 3B+>W7 2.824 2.884 2.964 3.11 2.87 2.90 1.20 1.24 3B+>W10 2.779 2.768 2.740 20.67 20.91 20.38 3B+>Ubu 2.991 3.160 3.385 1.67 1.61 1.62 W7 >3B+ 1.487 1.421 1.435 1.860 1.779 1.799 W10>3B+ 1.518 1.458 1.400 2.072 2.236 1.980 Ubu>3B+ 2.29 2.29 2.31 200 Files Write ms/file Read ms/file Delete Average Gain File KB 4 8 16 4 8 16 secs Read Write Raspbian 3B >W7 13.80 15.00 18.23 12.20 16.37 14.66 2.616 3B+>W7 10.36 11.25 11.22 9.91 10.64 11.11 1.910 1.43 1.37 3B+>W10 11.18 11.91 13.13 10.03 10.52 11.22 1.259 3B+>Ubu 13.46 13.46 14.05 24.57 11.88 12.19 1.509 W7 >3B+ 12.76 13.15 13.93 4.73 5.83 6.39 1.969 W10>3B+ 21.54 20.53 23.41 6.04 7.48 7.99 2.735 Ubu>3B+ 9.96 10.59 14.38 6.46 7.37 7.68 2.603 Gentoo 3B >W7 14.21 17.01 18.07 12.79 15.85 14.83 2.530 3B+>W7 10.86 11.88 13.69 11.11 12.51 14.31 2.139 1.35 1.15 3B+>W10 11.92 12.94 13.30 10.82 12.00 13.92 2.055 3B+>Ubu 13.98 14.25 15.05 12.75 13.37 15.74 1.711 W7 >3B+ 14.17 14.91 17.20 5.36 6.35 7.71 2.442 W10>3B+ 35.64 27.95 29.61 7.09 7.78 8.88 3.572 Ubu>3B+ 9.75 11.00 12.77 7.16 7.91 8.87 2.516 |
During the tests, another program was available to measure CPU MHz and core temperature, on a sampling basis. In view of voltage related problems identified during MultiThreading Benchmarks, measurement of this has been included in new 32 bit and 64 bit versions. As shown below, the voltage option can be of importance in considering heating effects on performance.
Command ./RPiHeatMHzVolts64G passes 60, seconds 16 Temperature and CPU MHz Measurement Start at Tue Jul 31 21:14:36 2018 Seconds 0.0 1400 scaling MHz, 1400 ARM MHz, core volt=1.3438V, temp=58.0°C 16.0 1400 scaling MHz, 1400 ARM MHz, core volt=1.3500V, temp=65.0°C 32.5 1400 scaling MHz, 1400 ARM MHz, core volt=1.3563V, temp=69.3°C 49.1 1400 scaling MHz, 1200 ARM MHz, core volt=1.3563V, temp=70.4°C 65.6 1400 scaling MHz, 1199 ARM MHz, core volt=1.2375V, temp=70.9°C 82.1 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=71.4°C 98.7 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=72.0°C 115.2 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=73.1°C 131.7 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=74.1°C 148.3 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=74.1°C 164.9 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=74.7°C 181.5 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=75.2°C 197.9 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=75.2°C 214.4 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=75.2°C 230.9 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=76.3°C 247.5 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=77.4°C 264.0 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=77.4°C 280.5 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=77.4°C 297.1 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=77.4°C 313.6 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=77.4°C 330.1 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=77.4°C 346.7 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=78.4°C 363.2 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=78.4°C 379.8 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=78.4°C 396.4 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 413.0 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 429.6 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 446.1 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 462.6 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 479.2 1400 scaling MHz, 1195 ARM MHz, core volt=1.2375V, temp=80.1°C 495.9 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 512.3 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 528.8 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 545.3 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 561.8 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=78.4°C 578.3 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.0°C 594.9 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=79.5°C 611.6 1400 scaling MHz, 1195 ARM MHz, core volt=1.2375V, temp=79.5°C 628.1 1400 scaling MHz, 1195 ARM MHz, core volt=1.2375V, temp=79.5°C 644.6 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=80.6°C 661.1 1400 scaling MHz, 1200 ARM MHz, core volt=1.2375V, temp=80.6°C 677.6 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.1°C 694.2 1400 scaling MHz, 1194 ARM MHz, core volt=1.2375V, temp=80.6°C 710.7 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 727.3 1400 scaling MHz, 1195 ARM MHz, core volt=1.2375V, temp=80.6°C 743.8 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 760.3 1400 scaling MHz, 1087 ARM MHz, core volt=1.2375V, temp=80.1°C 776.8 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 793.3 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 809.8 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 826.3 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=81.1°C 842.9 1400 scaling MHz, 1087 ARM MHz, core volt=1.2375V, temp=80.6°C 859.5 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 876.0 1400 scaling MHz, 1087 ARM MHz, core volt=1.2375V, temp=81.1°C 892.5 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 909.1 1400 scaling MHz, 1140 ARM MHz, core volt=1.2375V, temp=80.6°C 925.6 1400 scaling MHz, 1087 ARM MHz, core volt=1.2375V, temp=81.1°C 942.2 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=80.6°C 958.7 1400 scaling MHz, 1141 ARM MHz, core volt=1.2375V, temp=81.1°C |
Next, in the table, are the first to start results from a 64 bit Gentoo session that reported the above MHz, voltage and temperature measurements (started 3 seconds earlier). The test program is intended to run each part for the same number of seconds, leading to lower pass counts as the CPU speed reduces. This is followed by performance obtained on running a single copy of the benchmarks with L1 cache, L2 cache and RAM based data. In this case, the 32 bit versions are shown to be faster than the 64 bit compilations.
Finally are a range of 3B+ stress tests results, where no heat sink was used and it was installed in a plastic case. The cool ones were from first runs when the room temperature was 23 °C, and the hot ones from a second test when the room temperature was 3 to 4 °C higher.
The critical 80 °C was breached in all cases, but not until the Read Only section with the cooler tests. Performance degradation is shown to be quite similar at 32 and 64 bits, at least with Read/Write tests, also on comparing MB/second and CPU MHz reductions.
Shell Script lxterminal --geometry=80x15 -e ./RPiHeatMHzVolts passes 60, seconds 16 lxterminal --geometry=80x15 -e ./stressIntPiA7 KB 16 Secs 80 Log 21 lxterminal --geometry=80x15 -e ./stressIntPiA7 KB 16 Secs 80 Log 22 lxterminal --geometry=80x15 -e ./stressIntPiA7 KB 16 Secs 80 Log 23 lxterminal --geometry=80x15 -e ./stressIntPiA7 KB 16 Secs 80 Log 24 Gentoo Integer Stress Test RPi 64 Tue Jul 31 21:14:39 2018 16 KBytes Cache or RAM Space, 80 Seconds Per Test, 12 Tests Write/Read 1 2748 MB/sec Pattern 00000000 Result OK 6708355 passes 2 2529 MB/sec Pattern FFFFFFFF Result OK 6173859 passes 3 2504 MB/sec Pattern A5A5A5A5 Result OK 6113617 passes 4 2508 MB/sec Pattern 55555555 Result OK 6124102 passes 5 2515 MB/sec Pattern 33333333 Result OK 6141199 passes 6 2504 MB/sec Pattern F0F0F0F0 Result OK 6113284 passes Read 1 2806 MB/sec Pattern 00000000 Result OK 13702300 passes 2 2826 MB/sec Pattern FFFFFFFF Result OK 13797100 passes 3 2740 MB/sec Pattern A5A5A5A5 Result OK 13378700 passes 4 2676 MB/sec Pattern 55555555 Result OK 13068900 passes 5 2656 MB/sec Pattern 33333333 Result OK 12967300 passes 6 2658 MB/sec Pattern F0F0F0F0 Result OK 12977800 passes Single Core Speeds, 32 Bit Raspbian and 64 Bit Gentoo Write/Read MB/scond Read 32 Bit 64 Bit 32 Bit 64 Bit KB 16 64 2048 16 64 2048 16 64 2048 16 64 2048 3B+ 3883 3786 1681 2991 2910 1480 4246 3625 1907 3344 2985 1800 ################################################################################# Raspbian 32 Bit Gentoo 64 Bit Cool Hot Cool Hot MB/s MHz °C MB/s MHz °C MB/s MHz °C MB/s MHz °C Write/Read 1400 44.0 1400 59.1 1400 47.8 1399 65.5 1 3699 1400 68.2 3240 1200 75.2 2927 1400 69.8 2589 1200 78.4 2 3415 1200 70.9 3297 1200 79.0 2611 1200 70.9 2509 1141 81.1 3 3218 1200 72.9 3247 1195 80.6 2536 1200 72.0 2375 1087 81.7 4 3299 1200 73.1 3045 1141 81.1 2525 1200 75.2 2302 1034 81.7 5 3288 1200 75.2 2929 1087 81.7 2533 1200 75.8 2263 1034 81.7 6 3291 1200 76.3 2882 1033 81.7 2533 1200 77.4 2231 1034 81.7 Read 1 3620 1200 75.8 3343 1141 81.7 2832 1200 77.4 2646 1141 81.7 2 3602 1200 78.4 3153 1034 81.7 2841 1200 78.4 2539 1034 81.7 3 3592 1195 79.5 3015 980 82.2 2829 1195 80.1 2470 1034 81.7 4 3567 1141 80.5 2938 926 82.2 2790 1141 80.6 2444 1033 82.2 5 3500 1141 80.6 2922 980 82.7 2733 1141 80.6 2414 1034 82.2 6 3432 1087 80.6 2876 980 82.2 2679 1087 80.1 2152 980 82.2 1 Core RW 3883 1400 3883 1400 2991 1400 2991 1400 1 Core Rd 4246 1400 4246 1400 3344 1400 3344 1400 MIn RW 3218 1200 2882 1033 2525 1200 2231 1034 Min Rd 3432 1087 2876 926 2679 1087 2152 980 %1 Core RW 83 86 74 74 84 86 75 74 %1 Core Rd 81 78 68 66 80 78 64 70 |
Below is the start of the program output where, in this case, each pass carries out the same number of calculations, resulting in longer time when the core MHz reduces. This is followed by performance, in MFLOPS, for the three increasing operations per word, using L1 cache, L2 cache and RAM, via both 32 bit and 64 bit compilations. Note that the latter obtains the highest maximum speeds, but is slower using certain test functions.
No Heatsink
The next table provides results of fifteen minute stress tests, on the Raspberry Pi 3B+, in a plastic case and no heatsink, using 32 bit Raspbian and 64 bit Gentoo. In this case, the cool runs followed immediately powering on with room temperatures around 23°C and the hot ones shortly after the others finished. With the different timing procedures, the MFLOPS, MHz and temperature measurements shown are at approximately the same time. MFLOPS is the average speed over 15 to 17 minutes, temperature and MHz instantaneously at around 16 minute intervals, with both varying up and down within a test.
Although the MFLOPS speed is slower, the 32 bit test appears to generate higher temperatures, with earlier degradation to 1200 MHz and further reductions on exceeding 80°C
(worst cases 81.7°C, 1033 MHz, 2681 from 3439 MFLOPS).
The 32 bit version was compiled to use NEON four way SIMD instructions, with the other employing the more recent 64 bit SIMD vector functions, apparently less demanding power wise.
Part example one of four 64 bit programs Command ./burninfpuPi64 KWords 4, Section 2, minutes 15, Log 21 Burn-In-FPU RPi 64 Sat Aug 11 13:04:10 2018 Using 16 KBytes, 8 Operations Per Word, For Approximately 15 Minutes Pass 4 Byte Ops/ Repeat Seconds MFLOPS First All Words Word Passes Results Same 1 4000 8 1996000 15.03 4251 0.539296687 Yes 2 4000 8 1996000 15.56 4104 0.539296687 Yes 3 4000 8 1996000 16.58 3851 0.539296687 Yes 4 4000 8 1996000 17.16 3721 0.539296687 Yes 5 4000 8 1996000 17.40 3671 0.539296687 Yes 6 4000 8 1996000 17.74 3600 0.539296687 Yes 7 4000 8 1996000 17.87 3574 0.539296687 Yes 8 4000 8 1996000 17.72 3605 0.539296687 Yes 9 4000 8 1996000 17.59 3630 0.539296687 Yes 10 4000 8 1996000 17.55 3640 0.539296687 Yes Single Core MFLOPS, 32 Bit Raspbian and 64 Bit Gentoo Ops/Word 2 2 2 8 8 8 32 32 32 K Bytes 16 64 2048 16 64 2048 16 64 2048 K Words 4 16 512 4 16 512 4 16 512 32 Bit 1788 1672 413 3439 3365 1636 2011 2000 1846 64 Bit 2070 1924 405 4360 4278 1617 1781 1775 1696 |
Whilst running, results from all tests are displayed, with the logged summary shown below.
Later are processor MHz and temperature measurements for both 32 bit and 64 bit programs. The cool versions were immediately after powering on, with a room temperature around 22°C, with the hot tests following shortly afterwards. As this program has 72 different variations in code executed, temperature can go up and down, with maximum just about 80°C. For most of the time, in all cases, CPU MHz was at 1200 MHz, and this is reflected in little difference in hot and cold overall performance ratings (provided below). Here, single core performance results generally indicate faster speeds proportional to MHz speed increase.
Example Log File Entry ./liverloopsPi64 Seconds 12 Livermore Loops Benchmark armv8 64 Bit via C/C++ Fri Aug 17 12:33:02 2018 Reliability test 12 seconds each loop x 24 x 3 Part 1 of 3 start at Fri Aug 17 12:33:03 2018 Part 2 of 3 start at Fri Aug 17 12:38:31 2018 Part 3 of 3 start at Fri Aug 17 12:43:14 2018 Numeric results were as expected MFLOPS for 24 loops 530.4 296.7 513.6 450.1 198.5 191.2 629.0 424.8 450.5 229.1 145.8 208.1 105.1 128.3 247.8 219.5 374.4 440.5 284.2 237.5 260.8 78.5 307.1 176.3 Overall Ratings Maximum Average Geomean Harmean Minimum 629.0 274.7 243.8 213.4 76.8 ==================================================================================== 64b Hot 622.0 272.0 241.8 212.4 77.7 64b 1 Core 720.6 320.2 285.6 251.9 94.4 32b Cool 428.4 210.4 187.2 164.4 66.0 32b Hot 386.3 209.5 187.0 164.5 66.2 32b 1 Core 462.5 243.8 215.2 185.7 65.6 Raspbian 32 Bit Gentoo 64 Bit Cool Hot Cool Hot Secs MHz °C MHz °C MHz °C MHz °C 0 1400 53.7 1400 58.0 1400 52.6 1399 60.1 15 1400 65.5 1199 69.8 1400 60.1 1400 69.3 31 1400 69.8 1200 70.4 1400 66.6 1200 70.4 46 1200 70.9 1200 73.1 1200 70.9 1200 74.1 62 1200 70.4 1200 72.5 1200 70.9 1200 76.8 77 1199 70.4 1200 71.4 1200 69.8 1200 74.1 93 1200 70.4 1200 72.5 1200 70.9 1200 74.1 280 1200 72.0 1200 74.1 1200 70.9 1200 74.1 296 1200 72.0 1200 74.1 1200 72.0 1200 75.8 311 1200 73.6 1200 75.2 1200 70.9 1200 75.2 327 1200 74.1 1200 76.3 1200 72.0 1200 76.3 343 1200 73.6 1200 76.3 1200 74.1 1200 77.4 358 1200 75.8 1200 77.4 1200 73.6 1200 77.4 374 1200 75.2 1200 78.4 1200 77.4 1195 79.5 389 1200 74.7 1200 76.3 1200 75.2 1195 79.5 607 1200 76.3 1200 77.4 1199 74.1 1200 76.8 623 1200 75.8 1200 77.9 1200 75.2 1200 77.9 639 1200 77.4 1200 79.0 1200 75.8 1200 78.4 654 1200 77.4 1200 79.5 1200 78.4 1200 79.5 670 1200 77.4 1200 78.4 1200 78.4 1141 80.1 685 1200 76.3 1200 78.4 1200 76.3 1199 79.5 701 1200 76.8 1200 78.4 1200 76.8 1200 79.0 763 1200 76.3 1200 78.4 1200 76.8 1200 78.4 779 1200 75.2 1200 77.9 1200 77.4 1200 78.4 794 1200 76.8 1200 78.4 1200 75.2 1200 77.9 810 1200 75.8 1200 78.4 1200 76.3 1199 77.4 826 1200 75.2 1200 77.4 1200 76.3 1200 77.9 841 1200 75.2 1200 77.4 1200 77.4 1200 77.4 857 1200 75.2 1200 76.3 1200 74.1 1200 77.4 872 1200 76.3 1200 77.9 1200 75.8 1200 76.8 |
Hot Test - The Tiled Kitchen test was then run for 15 minute test periods, when room temperature was about 22°C. As indicated, a first run indicated a constant 20 FPS speed (rounded up or down), nearly reaching 70°C, where MHz reduces to 1200. However, the short term FPS, displayed during the tests, sometimes indicated 19 FPS, indicting that the MHz can vary quite rapidly. The hot run followed almost immediately afterwards, when the display recorded between 17 and 20 FPS. The unsynchronised variations in temperatures, voltage and CPU MHz again suggest rapid variations. Recorded temperature reached 70.9°C.
Extended Power Cable Test - As mentioned with MultiThreading Benchmarks, the CPU cores can run slowly with longer than normal power supply cables. A short videogl64 test was run with a one metre extension cable, on the 2.5A power supply. This time, core voltage measurements were included, indicating 1.2 volts, instead of 1.35 (therabouts), with 600 MHz and 8 FPS. The Pi 3B+ deserves a commendation for actually running in these circumstances (a permanent way of running cool?).
Commands ./RPiHeatMHzVolts64G passes 7 seconds 10 ./videogl64 test n, mins 1, where n = 1 to 6 ./vmstat 10 7 - for 7 samples every 10 seconds, example output next procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 0 530084 22812 216736 0 0 0 15 3029 1131 9 3 89 0 0 Raspbian 32 bit Gentoo 64 bit Test FPS %CPUt °C FPS %CPUt °C 1 Few Objects 49 48 58.5 54 64 62.8 2 All Objects, No Textures 47 60 59.1 52 80 63.4 3 Few Objects, With Textures 36 80 60.1 39 96 64.5 4 All Objects, With Textures 32 100 61.2 35 96 65.5 5 WireFrame Kitchen 23 116 61.2 27 124 66.6 6 Tiled Kitchen 14 112 61.2 20 116 66.6 6 Tiled Kitchen 15 minutes 14 112 69.3 20 116 69.8 Hot Run OpenGL Reliability Test 64 Bit Version 1, Mon Aug 20 17:12:38 2018 Display 1920 x 1080 Tiled Kitchen, Test for 15 minutes Normal Output Part RPiHeatMHzVolts Results Start 1400 ARM MHz, core volt=1.3438V, temp=62.8°C Test 6 Tiled Kitchen, 30 seconds, 20 FPS 1400 ARM MHz, core volt=1.3500V, temp=67.7°C Test 6 Tiled Kitchen, 30 seconds, 20 FPS 1400 ARM MHz, core volt=1.3500V, temp=69.3°C Test 6 Tiled Kitchen, 30 seconds, 20 FPS 1400 ARM MHz, core volt=1.2375V, temp=69.8°C Test 6 Tiled Kitchen, 30 seconds, 19 FPS 1400 ARM MHz, core volt=1.3500V, temp=69.8°C Test 6 Tiled Kitchen, 30 seconds, 19 FPS 1200 ARM MHz, core volt=1.2375V, temp=69.8°C Test 6 Tiled Kitchen, 30 seconds, 19 FPS Continued 19 FPS next 14 entries Test 6 Tiled Kitchen, 30 seconds, 19 FPS 5 at 1200 MHz, 3 at 1.2375V Test 6 Tiled Kitchen, 30 seconds, 19 FPS Temperatures 69.8°C and 70.9°C up and down Test 6 Tiled Kitchen, 30 seconds, 18 FPS 1400 ARM MHz, core volt=1.2375V, temp=69.8°C Test 6 Tiled Kitchen, 30 seconds, 19 FPS 1400 ARM MHz, core volt=1.3500V, temp=70.4°C Test 6 Tiled Kitchen, 30 seconds, 19 FPS Coninued 19 FPS 10 entries to end Test 6 Tiled Kitchen, 30 seconds, 19 FPS 2 at 1200 MHz, 5 at 1.2375V Test 6 Tiled Kitchen, 30 seconds, 19 FPS Temperatures 69.8°C and 70.9°C up and down Extended Power Cable Test OpenGL Reliability Program 64 Bit Version 1 Test 6 Tiled Kitchen, 30 seconds, 8 FPS Test 6 Tiled Kitchen, 30 seconds, 8 FPS Temperature and CPU MHz Measurement Using 70 samples at 1 second intervals Seconds 0.0 1400 scaling MHz, 1400 ARM MHz, core volt=1.2000V, temp=52.6°C 1.0 1400 scaling MHz, 600 ARM MHz, core volt=1.2000V, temp=52.6°C 2.4 1400 scaling MHz, 600 ARM MHz, core volt=1.2000V, temp=52.6°C 3.8 1400 scaling MHz, 1400 ARM MHz, core volt=1.2000V, temp=52.6°C To end 53.2°C to 55.3°C, mainly 600 MHz, 1.2000V, some 1400 MHz, 1.3438V |
The stand alone OpenGL Tests CPU utilisation report indicated that more than one core was being used for some of the time. This probably lead to greater floating point and OpenGL performance reductions, compared with the integer tests, with similar average temperatures. The run using Livermore Loops was the least affected.
These tests were run with the main board in a simple plastic case and the CPU having no heatsink, The following section includes repeats of the tests, with the system in a FLIRC case, where the whole aluminium case becomes the heatsink, and lead to significantly lower CPU temperatures during earlier Raspberry Pi 3B stress tests - see Raspberry Pi 2 and 3 Stress Tests (possibly slow to load archived copy) and ResearchGate Pi 3B Report.
Raspberry Pi 3B+ 64 Bit Stress Tests Secs stressIntPi64+OGL burninfpuPi64+OGL liverloopsPi64+OGL MHz FPS °C MHz FPS °C MHz FPS °C 0 1400 55.8 1400 66.1 1400 49.9 30 1200 16 69.8 1200 9 72.5 1400 17 67.7 60 1200 17 73.1 1200 11 75.2 1200 17 70.9 90 1200 16 75.2 1200 10 76.3 1200 16 70.9 120 1200 16 76.3 1200 10 75.8 1200 16 70.4 150 1200 16 77.4 1200 11 77.4 1200 16 70.9 180 1200 16 78.4 1200 10 77.9 1200 16 73.1 210 1200 16 79.5 1200 10 78.4 1200 15 73.6 240 1200 16 80.1 1200 11 79.5 1200 16 74.1 270 1141 16 80.6 1200 12 80.1 1200 16 74.1 300 1034 15 81.1 1195 13 80.6 1200 16 74.7 330 1141 15 80.6 1195 11 80.6 1200 16 79.0 360 1141 15 81.7 1087 11 80.6 1200 16 77.9 390 1034 15 81.1 1141 12 80.6 1200 16 77.4 420 1034 15 81.7 1141 11 81.1 1200 15 78.4 450 1034 14 80.6 1141 11 80.6 1200 15 79.5 480 1141 15 80.6 1141 10 80.6 1200 17 79.0 510 980 15 81.7 1141 11 81.1 1200 16 78.4 540 1034 15 80.6 1034 11 80.6 1200 16 79.0 570 1034 14 81.7 1087 11 80.6 1195 16 80.6 600 1034 15 81.1 1141 11 81.7 1034 17 80.6 630 1034 14 81.7 1087 11 81.1 1141 16 81.1 660 1034 14 81.1 1087 10 80.6 1195 16 80.6 690 1034 14 81.7 1034 9 81.1 1200 15 79.5 720 1034 14 81.7 1140 11 81.7 1141 16 80.6 750 980 14 81.7 1141 10 81.7 1141 16 79.5 780 1034 14 81.7 1034 10 81.1 1141 16 80.6 810 1034 14 82.7 1034 8 82.2 1141 16 80.6 840 1034 14 82.2 1141 10 81.1 1141 16 81.1 870 1034 14 82.2 1033 10 81.7 1200 16 76.8 900 1141 13 79.5 1141 10 81.1 1200 16 75.2 Average 1093 14.9 80.0 1137 10.5 79.8 1189 16.0 76.9 %Av/Max 78 75 81 53 85 80 Performance MB/sec MFLOPS MFLOPS Average Was 2376 2880 229.7 Max 3168 4360 285.6 % 75.0 66.1 80 Max is typical average performance testing a single core |
Performance using the FLIRC case was clearly superior to that from using a plastic case with no heatsink filled on the Pi board but, with varying starting conditions, it is difficult to be precise. In the latest results, there is little sign of core temperatures reaching 70°C, with CPU MHz almost always at 1400 MHz. Average graphics Frames Per Second were quite close to the maximum possible, with all cores fully utilised (at 19 FPS), 27% to 75% faster than using the plastic case. Also average MB/second, from the Integer tests, was 24% to 27% faster. and floating point MFLOPS providing 42% to 44% improvement.
Raspberry Pi 3B+ 64 Bit Stress Tests stressIntPi64+OGL stressIntPi64+OGL burninfpuPi64+OGL burninfpuPi64+OGL Secs MHz FPS °C MHz FPS °C MHz FPS °C MHz FPS °C 0 1400 37.6 1400 53.7 1400 46.2 1400 50.5 30 1400 18 47.2 1400 19 60.1 1400 19 54.2 1400 19 56.4 60 1400 19 49.9 1400 19 63.9 1400 19 56.4 1400 19 60.1 90 1400 19 51.5 1400 19 64.5 1400 19 58.5 1400 19 61.8 120 1400 19 52.1 1400 19 66.6 1400 19 58.5 1400 18 62.3 150 1399 19 53.7 1400 19 66.6 1400 19 60.1 1400 19 63.4 180 1400 19 54.2 1400 18 68.2 1399 18 60.7 1400 18 63.4 210 1400 19 55.8 1400 19 67.7 1400 19 61.8 1400 19 64.5 240 1400 19 55.8 1400 18 68.8 1400 18 62.3 1400 19 65.0 270 1399 19 56.9 1399 19 68.8 1400 19 62.3 1400 19 65.5 300 1400 19 58.0 1400 19 69.3 1400 19 62.8 1400 19 66.1 330 1400 18 58.0 1400 19 69.3 1400 19 62.8 1400 19 66.6 360 1400 19 59.1 1400 19 69.3 1400 19 63.4 1400 18 66.6 390 1400 19 59.1 1400 18 69.8 1400 18 64.5 1400 19 67.1 420 1400 19 60.1 1200 18 69.3 1400 18 65.0 1400 19 67.7 450 1400 19 60.7 1400 18 70.4 1400 18 65.0 1400 19 67.7 480 1400 19 61.2 1400 19 69.8 1400 19 65.5 1400 19 67.7 510 1400 19 60.7 1400 19 69.8 1400 19 65.5 1400 18 68.8 540 1400 19 61.2 1400 19 69.8 1400 19 66.6 1400 19 68.8 570 1400 19 61.2 1400 19 69.8 1400 19 66.6 1400 19 69.3 600 1400 19 62.3 1400 18 69.8 1400 18 66.6 1400 19 68.8 630 1400 19 62.3 1200 18 70.4 1400 18 67.1 1400 18 68.8 660 1400 19 63.4 1400 18 70.4 1400 18 67.7 1400 19 69.3 690 1400 19 64.5 1399 18 69.8 1400 18 68.2 1400 19 69.3 720 1400 19 64.5 1400 18 69.8 1400 18 67.7 1400 19 69.3 750 1400 19 65.0 1400 18 70.4 1400 18 67.7 1400 19 69.8 780 1400 19 65.5 1400 18 69.8 1400 18 67.7 1400 19 69.3 810 1400 19 65.5 1400 18 70.4 1400 18 68.8 1200 19 69.8 840 1400 19 66.6 1400 18 69.8 1400 18 68.8 1400 18 69.8 870 1400 19 66.1 1400 18 69.8 1400 18 69.3 1400 19 69.8 900 1400 19 67.1 1400 17 70.4 1400 17 69.3 1400 19 69.8 Average 1400 18.9 59.6 1387 18.4 68.8 1400 18.4 64.4 1393 18.8 66.8 %Av/Max 100 95 99 92 100 92 99 94 Performance MB/sec MB/sec MFLOPS MFLOPS Average Was 3025 2954 4162 4096 Max 3168 3168 4360 4360 % 95.5 93.3 95.5 93.9 Max is typical average performance testing a single core |
As shown in the code below, I have disassembled my MemSpeed type benchmarks. There are calculations using intrinsic functions and normal four way unrolled C code. each of the four Vector Multiply Accumulate intrinsic statements should lead to execution of four multiplies and four adds (total of 16 floating point operations). The C code loop has four multiples and four adds, but the compilers might be expected to unroll this further, where appropriate (they didn’t - is there a parameter to force this?). This lead to the fastest speeds being produced by intrinsics, using assembly code instructions shown below.
NeonSpeed - At 32 bit working the Vector Multiply Accumulate intrinsic were directly converted to NEON vmla.f32 instructions using quad word registers. The 64 bit compiler converted the intrinsics to A64 instructions “Floating-point fused multiply-add to accumulator” , using 128 bit vector registers. Next are instructions generated for normal C code, using neon and funsafe compiler directives at 32 bits and standard parameters at 64 bits, acting on single precision calculations. At 32 bits, a single SIMD NEON instruction is used - vfma.32 (Vector Fused Multiply Accumulate) with four calculations. At 64 bits, vfma is generated again. The difference in speed is apparent from using a single SIMD instruction in the loop, compared with four with intrinsics.
MemSpeed - Single Precision vs Double Precision - For the 32 bit version, using the NEON compiling parameter shown, NEON instructions were not generated, four scalar Floating-point multiply-accumulate (fmacs or fmacd) were produced instead, producing the slowest speeds. Adding that funsafe parameter produced the same vfma.f32 NEON instruction as NeonSpeed for four single precision calculations. But four vfma.f64 were generated for double precision. Yes these are NEON instructions but SISD (Single Instruction Single Data), each with data in 64 bit scalar registers.
64 Bit MemSpeed - For the four sets of calculations, the fmla vector instructions were again produced, requiring two for double precision and speed closer to that from single precision calculations.
Program Code NEON Intrinsics MemSpeed and NEONSpeed C Code for Compilation { Single and Double Precision x41 = vld1q_f32(ptrx1); x42 = vld1q_f32(ptrx2); for (m=0; m<kd; m=m+inc) x43 = vld1q_f32(ptrx3); { x44 = vld1q_f32(ptrx4); xn[m] = xn[m] + sumn * yn[m]; y41 = vld1q_f32(ptry1); xn[m+1] = xn[m+1] + sumn * yn[m+1]; y42 = vld1q_f32(ptry2); xn[m+2] = xn[m+2] + sumn * yn[m+2]; y43 = vld1q_f32(ptry3); xn[m+3] = xn[m+3] + sumn * yn[m+3]; } y44 = vld1q_f32(ptry4); ( z41 = vmlaq_f32(x41, y41, c4); z42 = vmlaq_f32(x42, y42, c4); z43 = vmlaq_f32(x43, y43, c4); z44 = vmlaq_f32(x44, y44, c4); vst1q_f32(ptrx1, z41); vst1q_f32(ptrx2, z42); vst1q_f32(ptrx3, z43); vst1q_f32(ptrx4, z44); ptrx1 = ptrx1 + 16; ptry1 = ptry1 + 16; ptrx2 = ptrx2 + 16; ptry2 = ptry2 + 16; ptrx3 = ptrx3 + 16; ptry3 = ptry3 + 16; ptrx4 = ptrx4 + 16; ptry4 = ptry4 + 16; } ###################################################################### NEON Speed Intrinsics 32 Bit 64 Bit 1173 MFLOPS 1277 MFLOPS .L75: .L13: add r0, r3, #48 ldr q4, [x3, -16] add ip, r3, #32 add x3, x3, 64 add lr, r3, #16 ldr q3, [x3, -64] add r10, r2, #48 add x1, x1, 64 add r7, r2, #32 ldr q2, [x3, -48] add r4, r2, #16 ldr q1, [x3, -32] vld1.32 {d24-d25}, [r3] cmp x3, x2 vld1.32 {d18-d19}, [r0] ldr q16, [x1, -64] vld1.32 {d20-d21}, [ip] ldr q7, [x1, -48] vld1.32 {d22-d23}, [lr] ldr q6, [x1, -32] vld1.32 {d26-d27}, [r2] ldr q5, [x1, -16] vld1.32 {d6-d7}, [r10] fmla v4.4s, v0.4s, v16.4s vld1.32 {d30-d31}, [r7] fmla v3.4s, v0.4s, v7.4s vld1.32 {d28-d29}, [r4] fmla v2.4s, v0.4s, v6.4s vmla.f32 q9, q3, q8 fmla v1.4s, v0.4s, v5.4s vmla.f32 q10, q15, q8 str q4, [x3, -80] vmla.f32 q11, q14, q8 str q3, [x3, -64] vmla.f32 q12, q13, q8 str q2, [x3, -48] add r1, r1, #1 str q1, [x3, -32] add r2, r2, #64 bne .L13 cmp r1, r5 vst1.32 {d24-d25}, [r3] vst1.32 {d22-d23}, [lr] add r3, r3, #64 vst1.32 {d20-d21}, [ip] vst1.32 {d18-d19}, [r0] bne .L75 More Below or Go To Start ###################################################################### NEON Speed Normal -mfpu=neon-vfpv4 -funsafe-math-optimizations -march=armv8-a 797 MFLOPS 681 MFLOPS 32 Bit 64 bit .L54: .L37: vld1.32 {q9}, [r2] ldr q0, [x0, x26] vld1.32 {q8}, [r3] add w1, w1, 1 add r1, r1, #1 ldr q1, [x0, x28] add r2, r2, #16 cmp w24, w1 cmp r1, r4 fmla v0.4s, v1.4s, v2.4s vfma.f32 q8, q9, q7 str q0, [x0, x26] vst1.32 {q8}, [r3] add x0, x0, 16 add r3, r3, #16 bhi .L37 bcc .L54 ###################################################################### MemSpeed 32 Bit Single and Double Precision Parameters -mfpu=neon-vfpv4 Single Precision Double Precision MFLOPS 532 mFLOPS 238 .L45: .L31: mov ip, r2 fldd d5, [r2, #-24] flds s15, [r3] fldd d6, [r3, #-24] flds s11, [ip] fldd d7, [r3, #-16] flds s12, [r3, #-12] fldd d4, [r3, #-8] flds s13, [r3, #-8] mov r6, r2 flds s14, [r3, #-4] fmacd d6, d5, d8 flds s8, [r2, #-12] fldd d3, [r3] flds s9, [r2, #-8] add r2, r2, #32 flds s10, [r2, #-4] fstd d6, [r3, #-24] fmacs s15, s11, s30 fldd d6, [r2, #-48] fmacs s12, s8, s30 fmacd d7, d6, d8 fmacs s13, s9, s30 fstd d7, [r3, #-16] fmacs s14, s10, s30 fldd d7, [r2, #-40] add r2, r2, #16 fmacd d4, d7, d8 fmrs ip, s15 fstd d4, [r3, #-8] fsts s12, [r3, #-12] fldd d7, [r6] fsts s13, [r3, #-8] fmacd d3, d7, d8 fsts s14, [r3, #-4] fmrrd r8, r9, d3 str ip, [r3], #16 @ float strd r8, [r3], #32 cmp r3, r6 cmp r3, r1 bne .L45 bne .L31 ###################################################################### More MemSpeed 32 Bit Single and Double Precision Parameters -mfpu=neon-vfpv4 -funsafe-math-optimizations Single Precision Double Precision MFLOPS 695 MFLOPS 236 MLOPS .L44: .L28: vld1.64 {d16-d17}, [r3:64] fldd d17, [r2, #-24] vld1.64 {d18-d19}, [r1:64] fldd d16, [r3, #-24] add r2, r2, #1 fldd d18, [r3, #-16] add r1, r1, #16 vfma.f64 d16, d17, d8 cmp r4, r2 mov r4, r2 add r3, r3, #16 fldd d17, [r3, #-8] vfma.f32 q8, q9, q7 add r2, r2, #32 vstr d16, [r3, #-16] fcpyd d19, d16 vstr d17, [r3, #-8] fldd d16, [r3] bhi .L44 fstd d19, [r3, #-24] fldd d19, [r2, #-48] vfma.f64 d18, d19, d8 fstd d18, [r3, #-16] fldd d18, [r2, #-40] vfma.f64 d17, d18, d8 fstd d17, [r3, #-8] fldd d17, [r4] vfma.f64 d16, d17, d8 fmrrd r4, r5, d16 strd r4, [r3], #32 cmp r3, r1 bne .L28 |
In the original single precision benchmarks, the NEON version produced significantly faster performance, where the compiler converted the 32 intrinsic calculating functions into 22 instructions, with those fused operations, and a total in-loop count of 27. Performance of the first 64 bit version was degraded through making use of only 12 vector registers, for a programming function involving 23 variables, necessitating frequent load instructions. The gcc 7 compiler made use of 25 vector registers with out of loop loads to achieve similar performance as the hand code NEON benchmark. Both the 64 bit double precision benchmarks included the higher efficient code, with external data loading, but best speed was, as expected, half that for single precision SIMD calculations.
Function triadplus2for(i=0; i<n; i++) x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f-(x[i]+g)*h+(x[i]+j)*k -(x[i]+l)*m+(x[i]+o)*p-(x[i]+q)*r+(x[i]+s)*t-(x[i]+u)*v+(x[i]+w)*y; ###################################################################### gcc 4.9 32 bit SP MFLOPS DP MFLOPS 797 1t 3134 4T 798 1T 3119 4T .L21: .L21: flds s23, [r3] fldd d17, [r1] fadds s15, s8, s23 faddd d16, d17, d2 fadds s24, s10, s23 faddd d18, d17, d0 fadds s31, s6, s23 faddd d25, d17, d4 fadds s30, s4, s23 faddd d24, d17, d6 fnmuls s15, s15, s7 fnmuld d16, d3, d16 fadds s29, s3, s23 faddd d23, d17, d15 fadds s28, s1, s23 faddd d22, d17, d13 fadds s27, s0, s23 faddd d21, d17, d11 vfma.f32 s15, s9, s24 faddd d20, d17, d9 fadds s26, s17, s23 faddd d19, d17, d31 fadds s25, s18, s23 vfma.f64 d16, d18, d1 fadds s24, s20, s23 faddd d18, d17, d29 fadds s23, s21, s23 faddd d17, d17, d27 vfma.f32 s15, s5, s31 vfma.f64 d16, d25, d5 vfma.f32 s15, s14, s30 vfms.f64 d16, d24, d7 vfma.f32 s15, s2, s29 vfma.f64 d16, d23, d14 vfma.f32 s15, s13, s28 vfms.f64 d16, d22, d12 vfma.f32 s15, s16, s27 vfma.f64 d16, d21, d10 vfma.f32 s15, s12, s26 vfms.f64 d16, d20, d8 vfma.f32 s15, s19, s25 vfma.f64 d16, d19, d30 vfma.f32 s15, s11, s24 vfms.f64 d16, d18, d26 vfma.f32 s15, s22, s23 vfma.f64 d16, d17, d28 fstmias r3!, {s15} fstmiad r1!, {d16} cmp r3, r2 cmp r1, r0 bne .L9 bne .L21 |
The benchmarks obtain information on CPU hardware characteristics and version of Linux from files /proc/cpuinfo and /proc/version. Below are details provided from 32 bit Raspbian and 64 bit Gentoo and they are the same for both Raspberry Pi 3B and 3B+. Raspbian BogoMIPS appears to depend on the CPU MHz frequency governor, with the default “On-Demand” setting, 38.4 was indicated but 89.6 with the “Performance” option or 76.8 using an earlier version of Raspbian.
32 bit Raspbian 64 Bit Gentoo processor 0 to 3 0 to 3 model name ARMv7 Processor rev 4 (v7l) BogoMIPS 38.40 or 89.6 38.40 Features half thumb fastmult vfp edsp neon fp asimd evtstrm crc32 cpuid vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm crc32 CPU implementer 0x41 0x41 CPU architecture: 7 8 CPU variant 0x0 0x0 CPU part 0xd03 0xd03 CPU revision 4 4 Linux version 4.14.34-v7+ (dc4@dc4-XPS13-9333) 4.14.31-v8-b36f4e9e1984+ (gcc version 4.9.3 (crosstool-NG (sakaki@chiyo) (gcc version 6.4.0 crosstool-ng-1.22.0-88-g8460611)) (Gentoo 6.4.0-r1 p1.3)) #1 SMP #1110 SMP Mon Apr 16 15:18:51 BST PREEMPT Sun Apr 1 14:15:34 BST 2018 2018 |