www.riscos.com - Acorn Computer Archives

Introduction to the StrongARM Revision 3, 04-Oct-96

Performance

Performance issues

The StrongARM has significantly different performance characteristics to older ARM processors. It is clocked 5 times faster than any previous ARM, and many instructions execute in fewer cycles. In particular:

B/BL take 2 cycles, rather than 3
MOV PC,Rn and ADD PC,PC,Rn,LSL #2 etc take 2 cycles rather than 3
LDR takes 2 cycles (from the cache) rather than 3, and will take only 1 cycle if the result is not used in the next instruction.
STR takes 1 cycle rather than 2, if the write buffer isn't full
MUL/MLA take 1-3 cycles rather than 2-17 cycles.
Many instructions will in fact take only one cycle provided the result is not used in the next instruction.

For fuller information see the StrongARM Technical Reference Manual, available from Digital Semiconductor's WWW site (currently at http://www.digital.com/info/semiconductor/dsc-strongarm.html)

The StrongARM's cache and write buffer are also significantly better than previous ARMs, allowing an average fivefold speed increase, despite the unaltered system bus. Pumping large amounts of data will still be limited by the system bus, but advantage can be taken of the write buffer to interleave a large amount of processing with memory accesses. For example on StrongARM it is quicker to plot a 4bpp sprite to a 32bpp mode than to plot a 32bpp sprite to a 32bpp mode; the latter case is pure data transfer, while the former is less data transfer with interleaved (ie effectively free) processing.

The long cache lines of the ARM710 and StrongARM can impact performance. A random read or instruction fetch from a cached area will load 8 words into the cache; this can make traversal of a long linked list inefficient. It is also often worth aligning code to an 8-word boundary. In current versions of RISC OS modules are loaded at an address 16*n+4. Future versions of RISC OS will probably load modules at an address 32*n+4, so it is worth aligning your service call entries appropriately in preparation for this change.

Two significant disadvantages of StrongARM over previous processors are:

Burst reads are not performed from uncached areas. In particular this means that reads from the screen are slower on the StrongARM than on previous ARMs. A future version of RISC OS may address this by marking the screen cacheable before reading (eg in a block copy operation). Also, burst writes are not performed to unbuffered areas.
Code modification is expensive. You can modify code, but a complete SynchroniseCodeAreas can take of the order of half a millisecond (ie 100000 processor cycles) to execute, and will flush the entire instruction cache. Thus use of self-modifying code is strongly deprecated; a static alternative will almost always be faster. Synchronisation of a single word (eg modifying a hardware vector) is cheaper (of the order of 100 processor cycles) but still requires the whole instruction cache to be flushed.

Note that future processors will no doubt have different performance characteristics again; you shouldn't optimise your code too much for one particular architecture at the expense of others. However, hopefully you will now have a better idea how to get better performance from your StrongARM.

Return to section Index