www.riscos.com Technical Support: |
|
The following example assembly language fragments show ways in which the basic ARM instructions can combine to give efficient code. None of the techniques illustrated save a great deal of execution time (although they all save some), mostly they just save code.
Note that, when optimising code for execution speed, consideration to different hardware bases should be given. Some changes which optimise speed on one machine may slow the code on another. An example is unrolling loops (eg divide loops) which speeds execution on an ARM2, but can slow execution on an ARM3, which has a cache.
CMP Rn,#p ; IF Rn=p OR Rm=q THEN GOTO Label BEQ Label CMP Rm,#q BEQ Label
can be replaced by:
CMP Rn,#p CMPNE Rm,#q ; If condition not satisfied try BEQ Label ; another test.
TEQ Rn,#0 ; Test sign RSBMI Rn,Rn,#0 ; and 2's complement if necessary.
TEQ Rc,#127 ; discrete test CMPNE Rc,#" "-1 ; range test MOVLS Rc,#"." ; IF Rc<#" " OR Rc=CHR$127 THEN Rc:="."
; Enter with dividend in Ra, divisor in Rb. ; Divisor must not be zero. MOV Rd,Rb ; Put the divisor in Rd. CMP Rd,Ra,LSR #1 ; Then double it until Div1 MOVLS Rd,Rd,LSL #1 ; 2 * Rd > divisor. CMP Rd,Ra,LSR #1 BLS Div1 MOV Rc,#0 ; Initialise the quotient Div2 CMP Ra,Rd ; Can we subtract Rd? SUBCS Ra,Ra,Rd ; If we can, do so ADC Rc,Rc,Rc ; Double quotient and add new bit MOV Rd,Rd,LSR #1 ; Halve Rd. CMP Rd,Rb ; And loop until we've gone BHS Div2 ; past the original divisor, ; Now Ra holds remainder, Rb holds original divisor, ; Rc holds quotient and Rd holds junk.
It is often necessary to generate (pseudo-) random numbers, and the most efficient algorithms are based on shift generators with a feedback rather like a cyclic redundancy check generator. Unfortunately, the sequence of a 32 bit generator needs more than one feedback tap to be maximal length (that is, 232-1 cycles before repetition). A 33 bit shift generator with taps at bits 20 and 33 is required.
The basic algorithm is:
All this can be done in five S cycles:
; Enter with seed in Ra (32 bits),Rb (1 bit in Rb lsb) ; Uses Rc TST Rb,Rb,LSR #1 ; top bit into carry MOVS Rc,Ra,RRX ; 33 bit rotate right ADC Rb,Rb,Rb ; carry into lsb of Rb EOR Rc,Rc,Ra,LSL#12 ; (involved!) EOR Ra,Rc,Rc,LSR#20 ; (similarly involved!) ; New seed in Ra, Rb as before
MOV Ra,Ra,LSL #n
ADD Ra,Ra,Ra,LSL #n
RSB Ra,Ra,Ra,LSL #n
ADD Ra,Ra,Ra,LSL #1 ; Multiply by 3 MOV Ra,Ra,LSL #1 ; and then by 2.
ADD Ra,Ra,Ra,LSL #2 ; Multiply by 5 ADD Ra,Rc,Ra,LSL #1 ; Multiply by 2 and add in next digit
If C even, say C = 2n×D, D odd:
D=1 : MOV Rb,Ra,LSL #n D<>1: {Rb := Ra*D} MOV Rb,Rb,LSL #n
If C MOD 4 = 1, say C = 2n×D+1, D odd, n>1:
D=1 : ADD Rb,Ra,Ra,LSL #n D<>1: {Rb := Ra*D} ADD Rb,Ra,Rb,LSL #n
If C MOD 4 = 3, say C = 2n×D-1, D odd, n>1:
D=1 : RSB Rb,Ra,Ra,LSL #n D<>1: {Rb := Ra*D} RSB Rb,Ra,Rb,LSL #n
This is not quite optimal, but close. An example of its non-optimal use is multiply by 45 which is done by:
RSB Rb,Ra,Ra,LSL #2 ; Multiply by 3 RSB Rb,Ra,Rb,LSL #2 ; Multiply by 4*3-1 = 11 ADD Rb,Ra,Rb,LSL #2 ; Multiply by 4*11+1 = 45
rather than by:
ADD Rb,Ra,Ra,LSL #3 ; Multiply by 9 ADD Rb,Rb,Rb,LSL #2 ; Multiply by 5*9 = 45
There is no instruction to load a word from an unknown alignment. To do this requires some code (which can be a macro) along the following lines:
; Enter with 32-bit address in Ra ; Uses Rb, Rc; result in Rd ; Note d must be less than c BIC Rb,Ra,#3 ; Get word-aligned address LDMIA Rb,{Rd,Rc} ; Get 64 bits containing answer AND Rb,Ra,#3 ; Correction factor in bytes MOVS Rb,Rb,LSL #3 ; ...now in bits and test if aligned MOVNE Rd,Rd,LSR Rb ; If not aligned, produce bottom ; of result word RSBNE Rb,Rb,#32 ; Get other shift amount ORRNE Rd,Rd,Rc,LSL Rb ; Combine two halves to get result
MOV Ra,Ra,LSL #16 ; Move to top, MOV Ra,Ra,LSR #16 ; and back to bottom ; Use ASR to get sign extended version
CFLAG * &20000000 BICS PC,R14,#CFLAG ; Returns clearing C flag ; from link register ORRCCS PC,R14,#CFLAG ; Conditionally returns setting C flag
This code should not be used except in user mode, since it will reset the interrupt mode to the state which existed when the R14 was set up. This rule generally applies to non-user mode programming.
For example in supervisor mode:
MOV PC,R14
is safer than
MOVS PC,R14
However, note that MOVS PC,R14 is required by the ARM Procedure Call Standard, used by code compiled from the high level language C. Such code, of course, runs in user mode.
The ARM's multiply instruction multiplies two 32 bit numbers together and produces the least significant 32 bits of the result. These 32 bits are the same regardless of whether the numbers are signed or unsigned.
To produce the full 64 bits of a product of two unsigned 32 bit numbers, the following code can be used:
; Enter with two unsigned numbers in Ra and Rb. MOVS Rd,Ra,LSR #16 ; Rd is ms 16 bits of Ra BIC Ra,Ra,Rd,LSL #16 ; Ra is ls 16 bits MOV Re,Rb,LSR #16 ; Re is ms 16 bits of Rb BIC Rb,Rb,Re,LSL #16 ; Rb is ls 16 bits MUL Rc,Ra,Rb ; Low partial product MUL Rb,Rd,Rb ; First middle partial product MUL Ra,Re,Ra ; Second middle partial product MULNE Rd,Re,Rd ; High partial product - NE ; condition reduces time taken ; if Rd is zero ADDS Ra,Ra,Rb ; Add middle partial products - ; could not use MLA because we ; need carry ADDCS Rd,Rd, #&10000 ; Add carry into high partial ; product ADDS Rc,Rc,Ra,LSL #16 ; Add middle partial product ADC Rd,Rd,Ra,LSR #16 ; sum into low and high words ; of result ; Now Rc holds the low word of the product, Rd its high word, ; and Ra, Rb and Re hold junk.
Of course, the ARM7M core provides the Multiply Long class of instructions to perform a 64 bit signed or unsigned multiply or multiply-accumulate (see Multiply Long and Multiply-Accumulate Long).