The Art of
ASSEMBLY LANGUAGE PROGRAMMING

Chapter Twenty Five (Part 1)

Table of Content

CHAPTER TWENTY FIVE:
OPTIMIZING YOUR PROGRAMM (Part 2)
25.5 Improving the Implementation of an Algorithm

One easy way to partially demonstrate how to optimize a piece of code is to provide an example of some program and the optimization steps you can apply to that program. This section will present a short program that blurs an eight-bit gray scale image. Then this section will lead though through several optimization steps and show you how to get that program running over 16 times faster.

The following code assumes that you provide it with a file containing a 251x256 gray scale photographic image. The data structure for this file is as follows:

	Image: array [0..250
0..255] of byte;

Each byte contains a value in the range 0..255 with zero denoting black 255 representing white and the other values representing even shades of gray between these two extremes.

The blurring algorithm averages a pixel with its eight closest neighbors. A single blur operation applies this average to all interior pixels of an image (that is it does not apply to the pixels on the boundary of the image because they do not have the same number of neighbors as the other pixels). The following Pascal program implements the blurring algorithm and lets the user specify the amount of blurring (by looping through the algorithm the number of times the user specifies):

program PhotoFilter(input
output);

(* Here is the raw file data type produced by the Photoshop program *)

type
image = array [0..250] of array [0..255] of byte;

(* The variables we will use. Note that the "datain" and "dataout" *)
(* variables are pointers because Turbo Pascal will not allow us to *)
(* allocate more than 64K data in the one global data segment it *)
(* supports. *)

var
h
i
j
k
l
sum
iterations:integer;
datain
dataout: ^image;
f
g:file of image;

begin

(* Open the files and real the input data *)

assign(f
'roller1.raw');
assign(g
'roller2.raw');
reset(f);
rewrite(g);
new(datain);
new(dataout);
read(f
datain^);

(* Get the number of iterations from the user *)

write('Enter number of iterations:');
readln(iterations);

writeln('Computing result');

(* Copy the data from the input array to the output array. *)
(* This is a really lame way to copy the border from the *)
(* input array to the output array. *)

for i := 0 to 250 do
for j := 0 to 255 do
dataout^ [i][j] := datain^ [i][j];

(* Okay
here's where all the work takes place. The outside *)
(* loop repeats this blurring operation the number of *)
(* iterations specified by the user. *)

for h := 1 to iterations do begin

(* For each row except the first and the last
compute *)
(* a new value for each element. *)

for i := 1 to 249 do

(* For each column except the first and the last
com- *)
(* pute a new value for each element. *)

for j := 1 to 254 do begin

(* For each element in the array
compute a new
blurred value by adding up the eight cells
around an array element along with eight times
the current cell's value. Then divide this by
sixteen to compute a weighted average of the
nine cells forming a square around the current
cell. The current cell has a 50% weighting

the other eight cells around the current cel
provide the other 50% weighting (6.25% each). *)

sum := 0;
for k := -1 to 1 do
for l := -1 to 1 do
sum := sum + datain^ [i+k][j+l];

(* Sum currently contains the sum of the nine     *)
(* cells
add in seven times the current cell so  *)
(* we get a total of eight times the current cell. *)

dataout^ [i][j] := (sum + datain^ [i][j]*7) div 16;

end;

(* Copy the output cell values back to the input cells *)
(* so we can perform the blurring on this new data on *)
(* the next iteration. *)

for i := 0 to 250 do
for j := 0 to 255 do
datain^ [i][j] := dataout^ [i][j];

end;

writeln('Writing result');
write(g
dataout^);
close(f);
close(g);

end.

The Pascal program above compiled with Turbo Pascal v7.0 takes 45 seconds to compute 100 iterations of the blurring algorithm. A comparable program written in C and compiled with Borland C++ v4.02 takes 29 seconds to run. The same source file compiled with Microsoft C++ v8.00 runs in 21 seconds. Obviously the C compilers produce better code than Turbo Pascal. It took about three hours to get the Pascal version running and tested. The C versions took about another hour to code and test. The following two images provide a "before" and "after" example of this program's function:

Before blurring:

After blurring (10 iterations):

The following is a crude translation from Pascal directly into assembly language of the above program. It requires 36 seconds to run. Yes the C compilers did a better job but once you see how bad this code is you'll wonder what it is that Turbo Pascal is doing to run so slow. It took about an hour to translate the Pascal version into this assembly code and debug it to the point it produced the same output as the Pascal version.

; IMGPRCS.ASM
;
; An image processing program.
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              36 seconds.
;       Borland Pascal v7.0-                    45 seconds.
;       Borland C++ v4.02-                      29 seconds.
;       Microsoft C++ v8.00-                    21 seconds.

.xlist
include         stdlib.a
includelib      stdlib.lib
.list
.286

dseg            segment para public 'data'

; Loop control variables and other variables:

h               word    ?
i               word    ?
j               word    ?
k               word    ?
l               word    ?
sum             word    ?
iterations      word    ?

; File names:

InName          byte    "roller1.raw"
0
OutName         byte    "roller2.raw"
0

dseg            ends


; Here is the input data that we operate on.

InSeg           segment para public 'indata'

DataIn          byte    251 dup (256 dup (?))

InSeg           ends


; Here is the output array that holds the result.

OutSeg          segment para public 'outdata'

DataOut         byte    251 dup (256 dup (?))

OutSeg          ends




cseg            segment para public 'code'
assume  cs:cseg
ds:dseg

Main            proc
mov     ax
dseg
mov     ds
ax
meminit

mov     ax
3d00h               ;Open input file for reading.
lea     dx
InName
int     21h
jnc     GoodOpen
print
byte    "Could not open input file."
cr
lf
0
jmp     Quit

GoodOpen:       mov     bx
ax                  ;File handle.
mov     dx
InSeg               ;Where to put the data.
mov     ds
dx
lea     dx
DataIn
mov     cx
256*251             ;Size of data file to read.
mov     ah
3Fh
int     21h
cmp     ax
256*251             ;See if we read the data.
je      GoodRead
print
byte    "Did not read the file properly"
cr
lf
0
jmp     Quit

GoodRead:       mov     ax
dseg
mov     ds
ax
print
byte    "Enter number of iterations: "
0
getsm
atoi
free
mov     iterations
ax
print
byte    "Computing Result"
cr
lf
0

; Copy the input data to the output buffer.

mov     i
0
iloop0:         cmp     i
250
ja      iDone0
mov     j
0
jloop0:         cmp     j
255
ja      jDone0

mov     bx
i                   ;Compute index into both
shl     bx
8                   ; arrays using the formula
add     bx
j                   ; i*256+j (row major).

mov     cx
InSeg               ;Point at input segment.
mov     es
cx
mov     al
es:DataIn[bx]       ;Get DataIn[i][j].

mov     cx
OutSeg              ;Point at output segment.
mov     es
cx
mov     es:DataOut[bx]
al      ;Store into DataOut[i][j]

inc     j                       ;Next iteration of j loop.
jmp     jloop0

jDone0:         inc     i                       ;Next iteration of i loop.
jmp     iloop0

iDone0:

; for h := 1 to iterations-

mov     h
1
hloop:          mov     ax
h
cmp     ax
iterations
ja      hloopDone



; for i := 1 to 249 -

mov     i
1
iloop:          cmp     i
249
ja      iloopDone

; for j := 1 to 254 -
mov     j
1
jloop:          cmp     j
254
ja      jloopDone


; sum := 0;
; for k := -1 to 1 do for l := -1 to 1 do

mov     ax
InSeg               ;Gain access to InSeg.
mov     es
ax

mov     sum
0
mov     k
-1
kloop:          cmp     k
1
jg      kloopDone

mov     l
-1
lloop:          cmp     l
1
jg      lloopDone

; sum := sum + datain [i+k][j+l]

mov     bx
i
add     bx
k
shl     bx
8                   ;Multiply by 256.
add     bx
j
add     bx
l

mov     al
es:DataIn[bx]
mov     ah
0
add     Sum
ax

inc     l
jmp     lloop

lloopDone:      inc     k
jmp     kloop


; dataout [i][j] := (sum + datain[i][j]*7) div 16;

kloopDone:      mov     bx
i
shl     bx
8                   ;*256
add     bx
j
mov     al
es:DataIn[bx]
mov     ah
0
imul    ax
7
add     ax
sum
shr     ax
4                   ;div 16

mov     bx
OutSeg
mov     es
bx

mov     bx
i
shl     bx
8
add     bx
j
mov     es:DataOut[bx]
al

inc     j
jmp     jloop

jloopDone:              inc     i
jmp     iloop

iloopDone:
; Copy the output data to the input buffer.

mov     i
0
iloop1:         cmp     i
250
ja      iDone1
mov     j
0
jloop1:         cmp     j
255
ja      jDone1

mov     bx
i                   ;Compute index into both
shl     bx
8                   ; arrays using the formula
add     bx
j                   ; i*256+j (row major).

mov     cx
OutSeg              ;Point at input segment.
mov     es
cx
mov     al
es:DataOut[bx]      ;Get DataIn[i][j].

mov     cx
InSeg               ;Point at output segment.
mov     es
cx
mov     es:DataIn[bx]
al       ;Store into DataOut[i][j]

inc     j                       ;Next iteration of j loop.
jmp     jloop1

jDone1:         inc     i                       ;Next iteration of i loop.
jmp     iloop1

iDone1:         inc     h
jmp     hloop

hloopDone:      print
byte    "Writing result"
cr
lf
0


; Okay
write the data to the output file:

mov     ah
3ch         ;Create output file.
mov     cx
0           ;Normal file attributes.
lea     dx
OutName
int     21h
jnc     GoodCreate
print
byte    "Could not create output file."
cr
lf
0
jmp     Quit

GoodCreate:     mov     bx
ax          ;File handle.
push    bx
mov     dx
OutSeg      ;Where the data can be found.
mov     ds
dx
lea     dx
DataOut
mov     cx
256*251     ;Size of data file to write.
mov     ah
40h         ;Write operation.
int     21h
pop     bx              ;Retrieve handle for close.
cmp     ax
256*251     ;See if we wrote the data.
je      GoodWrite
print
byte    "Did not write the file properly"
cr
lf
0
jmp     Quit

GoodWrite:      mov     ah
3eh         ;Close operation.
int     21h


Quit:           ExitPgm                 ;DOS macro to quit program.
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             byte    1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       byte    16 dup (?)
zzzzzzseg       ends
end     Main

This assembly code is a very straight-forward line by line translation of the previous Pascal code. Even beginning programmers (who've read and understand Chapters Eight and Nine) should easily be able to improve the performance of this code.

While we could run a profiler on this program to determine where the "hot spots" are in this code a little analysis particularly of the Pascal version should make it obvious that there are a lot of nested loops in this code. As Chapter Ten points out when optimizing code you should always start with the innermost loops. The major change between the code above and the following assembly language version is that we've unrolled the innermost loops and we've replaced the array index computations with some constant computations. These minor changes speed up the execution by a factor of six! The assembly version now runs in six seconds rather than 36. A Microsoft C++ version of the same program with comparable optimzations runs in eight seconds. It required nearly four hours to develop test and debug this code. It required an additional hour to apply these same modifications to the C version.

; IMGPRCS2.ASM
;
; An image processing program (First optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              6 seconds.
;       Original ASM code-                      36 seconds.
;       Borland Pascal v7.0-                    45 seconds.
;       Borland C++ v4.02-                      29 seconds.
;       Microsoft C++ v8.00-                    21 seconds.


;       // Lots of omitted code goes here
see the previous version//


print
byte    "Computing Result"
cr
lf
0

; for h := 1 to iterations-

mov     h
1
hloop:

; Copy the input data to the output buffer.
; Optimization step #1: Replace with movs instruction.

push    ds
mov     ax
OutSeg
mov     ds
ax
mov     ax
InSeg
mov     es
ax
lea     si
DataOut
lea     di
DataIn
mov     cx
(251*256)/4
rep     movsd
pop     ds


; Optimization Step #1: Convert loops to repeat..until form.

; for i := 1 to 249 -

mov     i
1
iloop:

; for j := 1 to 254 -

mov     j
1
jloop:


; Optimization. Unroll the innermost two loops:

mov     bh
byte ptr i          ;i is always less than 256.
mov     bl
byte ptr j          ;Computes i*256+j!

push    ds
mov     ax
InSeg               ;Gain access to InSeg.
mov     ds
ax

mov     cx
0                   ;Compute sum here.
mov     ah
ch
mov     cl
ds:DataIn[bx-257]   ;DataIn[i-1][j-1]
mov     al
ds:DataIn[bx-256]   ;DataIn[i-1][j]
add     cx
ax
mov     al
ds:DataIn[bx-255]   ;DataIn[i-1][j+1]
add     cx
ax
mov     al
ds:DataIn[bx-1]     ;DataIn[i][j-1]
add     cx
ax
mov     al
ds:DataIn[bx+1]     ;DataIn[i][j+1]
add     cx
ax
mov     al
ds:DataIn[bx+255]   ;DataIn[i+1][j-1]
add     cx
ax
mov     al
ds:DataIn[bx+256]   ;DataIn[i+1][j]
add     cx
ax
mov     al
ds:DataIn[bx+257]   ;DataIn[i+1][j+1]
add     cx
ax

mov     al
ds:DataIn[bx]       ;DataIn[i][j]
shl     ax
3                   ;DataIn[i][j]*8
add     cx
ax
shr     cx
4                   ;Divide by 16
mov     ax
OutSeg
mov     ds
ax
mov     ds:DataOut[bx]
cl
pop     ds

inc     j
cmp     j
254
jbe     jloop

inc     i
cmp     i
249
jbe     iloop

inc     h
mov     ax
h
cmp     ax
Iterations
jnbe    Done
jmp     hloop

Done:           print
byte    "Writing result"
cr
lf
0

;       //More omitted code goes here
see the previous version//

The second version above still uses memory variables for most computations. The optimizations applied to the original code were mainly language-independent optimizations. The next step was to begin applying some assembly language specific optimizations to the code. The first optimization we need to do is to move as many variables as possible into the 80x86's register set. The following code provides this optimization. Although this only improves the running time by 2 seconds that is a 33% improvement (six seconds down to four)!

; IMGPRCS.ASM
;
; An image processing program (Second optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.


;       <<Lots of delete code goes here»

print
byte    "Computing Result"
cr
lf
0


; Copy the input data to the output buffer.


hloop:          mov     ax
InSeg
mov     es
ax
mov     ax
OutSeg
mov     ds
ax
lea     si
DataOut
lea     di
DataIn
mov     cx
(251*256)/4
rep     movsd

assume  ds:InSeg
es:OutSeg
mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax



mov     cl
249
iloop:          mov     bh
cl                  ;i*256
mov     bl
1                   ;Start at j=1.
mov     ch
254                 ;# of times through loop.
jloop:
mov     dx
0                   ;Compute sum here.
mov     ah
dh
mov     dl
DataIn[bx-257]      ;DataIn[i-1][j-1]
mov     al
DataIn[bx-256]      ;DataIn[i-1][j]
add     dx
ax
mov     al
DataIn[bx-255]      ;DataIn[i-1][j+1]
add     dx
ax
mov     al
DataIn[bx-1]        ;DataIn[i][j-1]
add     dx
ax
mov     al
DataIn[bx+1]        ;DataIn[i][j+1]
add     dx
ax
mov     al
DataIn[bx+255]      ;DataIn[i+1][j-1]
add     dx
ax
mov     al
DataIn[bx+256]      ;DataIn[i+1][j]
add     dx
ax
mov     al
DataIn[bx+257]      ;DataIn[i+1][j+1]
add     dx
ax

mov     al
DataIn[bx]          ;DataIn[i][j]
shl     ax
3                   ;DataIn[i][j]*8
add     dx
ax
shr     dx
4                   ;Divide by 16
mov     DataOut[bx]
dl

inc     bx
dec     ch
jne     jloop

dec     cl
jne     iloop

dec     bp
jne     hloop

Done:           print
byte    "Writing result"
cr
lf
0

;       //More deleted code goes here
see the original version//

Note that on each iteration the code above still copies the output data back to the input data. That's almost six and a half megabytes of data movement for 100 iterations! The following version of the blurring program unrolls the hloop twice. The first occurrence copies the data from DataIn to DataOut while computing the blur the second instance copies the data from DataOut back to DataIn while blurring the image. By using these two code sequences the program save copying the data from one point to another. This version also maintains some common computations between two adjacent cells to save a few instructions in the innermost loop. This version arranges instructions in the innermost loop to help avoid data hazards on 80486 and later processors. The end result is almost 40% faster than the previous version (down to 2.5 seconds from four seconds).

; IMGPRCS.ASM
;
; An image processing program (Third optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
;        Removed hazards. Maintained common subexpressions. Did some
;        more loop unrolling.
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system
100 iterations).
;
;       This code-                              2.5 seconds.
;       2nd optimization pass-                  4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.

;       <<Lots of deleted code here
see the original version»

print
byte    "Computing Result"
cr
lf
0


assume  ds:InSeg
es:OutSeg

mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax

; Copy the data once so we get the edges in both arrays.

mov     cx
(251*256)/4
lea     si
DataIn
lea     di
DataOut
rep     movsd


; "hloop" repeats once for each iteration.

hloop:
mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax

; "iloop" processes the rows in the matrices.

mov     cl
249
iloop:          mov     bh
cl                  ;i*256
mov     bl
1                   ;Start at j=1.
mov     ch
254/2               ;# of times through loop.
mov     si
bx
mov     dh
0                   ;Compute sum here.
mov     bh
0
mov     ah
0

; "jloop" processes the individual elements of the array.
; This loop has been unrolled once to allow the two portions to share
; some common computations.

jloop:

; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] +
; DataIn [i+1][j+1] will be used in the second half of this computation.
; So save its value in a register (di) until we need it again.

mov     dl
DataIn[si-256]      ;[i-1
j]
mov     al
DataIn[si-255]      ;[i-1
j+1]
mov     bl
DataIn[si+257]      ;[i+1
j+1]
add     dx
ax
mov     al
DataIn[si+256]      ;[I+1
j]
add     dx
bx
mov     bl
DataIn[si+1]        ;[i
j+1]
add     dx
ax
mov     al
DataIn[si+255]      ;[i+1
j-1]

mov     di
dx                  ;Save partial result.

add     dx
bx
mov     bl
DataIn[si-1]        ;[i
j-1]
add     dx
ax
mov     al
DataIn[si]          ;[i
j]
add     dx
bx
mov     bl
DataIn[si-257]      ;[i-1
j-1]
shl     ax
3                   ;DataIn[i
j] * 8.
add     dx
bx
add     dx
ax
shr     ax
3                   ;Restore DataIn[i
j].
shr     dx
4                   ;Divide by 16.
add     di
ax
mov     DataOut[si]
dl

; Okay
process the next cell over. Note that we've got a partial sum
; sitting in DI already. Don't forget
we haven't bumped SI at this point

; so the offsets are off by one. (This is the second half of the unrolled
; loop.)

mov     dx
di                  ;Partial sum.
mov     bl
DataIn[si-254]      ;[i-1
j+1]
mov     al
DataIn[si+2]        ;[i
j+1]
add     dx
bx
mov     bl
DataIn[si+258]      ;[i+1
j+1];
add     dx
ax
mov     al
DataIn[si+1]        ;[i
j]
add     dx
bx
shl     ax
3                   ;DataIn[i][j]*8
add     si
2                   ;Bump array index.
add     dx
ax
mov     ah
0                   ;Clear for next iter.
shr     dx
4                   ;Divide by 16
dec     ch
mov     DataOut[si-1]
dl
jne     jloop

dec     cl
jne     iloop

dec     bp
je      Done


; Special case so we don't have to move the data between the two arrays.
; This is an unrolled version of the hloop that swaps the input and output
; arrays so we don't have to move data around in memory.

mov     ax
OutSeg
mov     ds
ax
mov     ax
InSeg
mov     es
ax
assume  es:InSeg
ds:OutSeg

hloop2:

mov     cl
249
iloop2:         mov     bh
cl
mov     bl
1
mov     ch
254/2
mov     si
bx
mov     dh
0
mov     bh
0
mov     ah
0
jloop2:
mov     dl
DataOut[si-256]
mov     al
DataOut[si-255]
mov     bl
DataOut[si+257]
add     dx
ax
mov     al
DataOut[si+256]
add     dx
bx
mov     bl
DataOut[si+1]
add     dx
ax
mov     al
DataOut[si+255]

mov     di
dx

add     dx
bx
mov     bl
DataOut[si-1]
add     dx
ax
mov     al
DataOut[si]
add     dx
bx
mov     bl
DataOut[si-257]
shl     ax
3
add     dx
bx
add     dx
ax
shr     ax
3
shr     dx
4
mov     DataIn[si]
dl

mov     dx
di
mov     bl
DataOut[si-254]
add     dx
ax
mov     al
DataOut[si+2]
add     dx
bx
mov     bl
DataOut[si+258]
add     dx
ax
mov     al
DataOut[si+1]
add     dx
bx
shl     ax
3
add     si
2
add     dx
ax
mov     ah
0
shr     dx
4
dec     ch
mov     DataIn[si-1]
dl
jne     jloop2

dec     cl
jne     iloop2

dec     bp
je      Done2
jmp     hloop


; Kludge to guarantee that the data always resides in the output segment.

Done2:
mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax
mov     cx
(251*256)/4
lea     si
DataIn
lea     di
DataOut
rep     movsd

Done:           print
byte    "Writing result"
cr
lf
0

;       //Lots of deleted code here
see the original program//

This code provides a good example of the kind of optimization that scares a lot of people. There is a lot of cycle counting instruction scheduling and other crazy stuff that makes program very difficult to read and understand. This is the kind of optimization for which assembly language programmers are famous; the stuff that spawned the phrase "never optimize early." You should never try this type of optimization until you feel you've exhausted all other possibilities. Once you write your code in this fashion it is going to be very difficult to make further changes to it. By the way the above code took about 15 hours to develop and debug (debugging took the most time). That works out to a 0.1 second improvement (for 100 iterations) for each hour of work. Although this code certainly isn't optimal yet it is difficult to justify more time attempting to improve this code by mechanical means (e.g. moving instructions around etc.) because the performance gains would be so little.

In the four steps above we've reduced the running time of the assembly code from 36 seconds down to 2.5 seconds. Quite an impressive feat. However you shouldn't get the idea that this was easy or even that there were only four steps involved. During the actual development of this example there were many attempts that did not improve performance (in fact some modifications wound up reducing performance) and others did not improve performance enough to justify their inclusion. Just to demonstrate this last point the following code included a major change in the way the program organized data. The main loop operates on 16 bit objects in memory rather than eight bit objects. On some machines with large external caches (256K or better) this algorithm provides a slight improvement in performance (2.4 seconds down from 2.5). However on other machines it runs slower. Therefore this code was not chosen as the final implementation:

; IMGPRCS.ASM
;
; An image processing program (Fourth optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
;        Removed hazards. Maintained common subexpressions. Did some
;        more loop unrolling.
;
; Version #5: Converted data arrays to words rather than bytes and operated
;        on 16-bit values. Yielded minimal speedup.
;
;       Performance comparisons (66 MHz 80486 DX/2 system).
;
;       This code-                              2.4 seconds.
;       3rd optimization pass-                  2.5 seconds.
;       2nd optimization pass-                  4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.

.xlist
include         stdlib.a
includelib      stdlib.lib
.list
.386
option          segment:use16



dseg            segment para public 'data'

ImgData         byte    251 dup (256 dup (?))

InName          byte    "roller1.raw"
0
OutName         byte    "roller2.raw"
0
Iterations      word    0

dseg            ends


; This code makes the naughty assumption that the following
; segments are loaded contiguously in memory! Also
because these
; segments are paragraph aligned
this code assumes that these segments
; will contain a full 65
536 bytes. You cannot declare a segment with
; exactly 65
536 bytes in MASM. However
the paragraph alignment option
; ensures that the extra byte of padding is added to the end of each
; segment.

DataSeg1        segment para public 'ds1'
Data1a          byte    65535 dup (?)
DataSeg1        ends

DataSeg2        segment para public 'ds2'
Data1b          byte    65535 dup (?)
DataSeg2        ends

DataSeg3        segment para public 'ds3'
Data2a          byte    65535 dup (?)
DataSeg3        ends

DataSeg4        segment para public 'ds4'
Data2b          byte    65535 dup (?)
DataSeg4        ends




cseg            segment para public 'code'
assume  cs:cseg
ds:dseg

Main            proc
mov     ax
dseg
mov     ds
ax
meminit

mov     ax
3d00h       ;Open input file for reading.
lea     dx
InName
int     21h
jnc     GoodOpen
print
byte    "Could not open input file."
cr
lf
0
jmp     Quit

GoodOpen:       mov     bx
ax          ;File handle.
lea     dx
ImgData
mov     cx
256*251     ;Size of data file to read.
mov     ah
3Fh
int     21h
cmp     ax
256*251     ;See if we read the data.
je      GoodRead
print
byte    "Did not read the file properly"
cr
lf
0
jmp     Quit

GoodRead:       print
byte    "Enter number of iterations: "
0
getsm
atoi
free
mov     Iterations
ax
cmp     ax
0
jle     Quit

printf
byte    "Computing Result for %d iterations"
cr
lf
0
dword   Iterations



; Copy the data and expand it from eight bits to sixteen bits.
; The first loop handles the first 32
768 bytes
the second loop
; handles the remaining bytes.

mov     ax
DataSeg1
mov     es
ax
mov     ax
DataSeg3
mov     fs
ax

mov     ah
0
mov     cx
32768
lea     si
ImgData
xor     di
di          ;Output data is at ofs zero.
CopyLoop:       lodsb                   ;Read a byte
mov     fs:[di]
ax     ;Store a word in DataSeg3
stosw                   ;Store a word in DataSeg1
dec     cx
jne     CopyLoop

mov     di
DataSeg2
mov     es
di
mov     di
DataSeg4
mov     fs
di
mov     cx
(251*256) - 32768
xor     di
di
CopyLoop1:      lodsb                   ;Read a byte
mov     fs:[di]
ax     ;Store a word in DataSeg4
stosw                   ;Store a word in DataSeg2
dec     cx
jne     CopyLoop1

; hloop completes one iteration on the data moving it from Data1a/Data1b
; to Data2a/Data2b

hloop:          mov     ax
DataSeg1
mov     ds
ax
mov     ax
DataSeg3
mov     es
ax

; Process the first 127 rows (65
024 bytes) of the array):

mov     cl
127
lea     si
Data1a+202h ;Start at [1
1]
iloop0:         mov     ch
254/2       ;# of times through loop.
jloop0:         mov     dx
[si]        ;[i
j]
mov     bx
[si-200h]   ;[i-1
j]
mov     ax
dx
shl     dx
3           ;[i
j] * 8
add     bx
[si-1feh]   ;[i-1
j+1]
mov     bp
[si+2]      ;[i
j+1]
add     bx
[si+200h]   ;[i+1
j]
add     dx
bp
add     bx
[si+202h]   ;[i+1
j+1]
add     dx
[si-202h]   ;[i-1
j-1]
mov     di
[si-1fch]   ;[i-1
j+2]
add     dx
[si-2]      ;[i
j-1]
add     di
[si+4]      ;[i
j+2]
add     dx
[si+1feh]   ;[i+1
j-1]
add     di
[si+204h]   ;[i+1
j+2]
shl     bp
3           ;[i
j+1] * 8
add     dx
bx
add     bp
ax
shr     dx
4           ;Divide by 16.
add     bp
bx
mov     es:[si]
dx     ;Store [i
j] entry.
add     bp
di
add     si
4           ;Affects next store operation!
shr     bp
4           ;Divide by 16.
dec     ch
mov     es:[si-2]
bp   ;Store [i
j+1] entry.
jne     jloop0

add     si
4           ;Skip to start of next row.

dec     cl
jne     iloop0

; Process the last 124 rows of the array). This requires that we switch from
; one segment to the next. Note that the segments overlap.

mov     ax
DataSeg2
sub     ax
40h         ;Back up to last 2 rows in DS2
mov     ds
ax
mov     ax
DataSeg4
sub     ax
40h         ;Back up to last 2 rows in DS4
mov     es
ax

mov     cl
251-127-1   ;Remaining rows to process.
mov     si
202h        ;Continue with next row.
iloop1:         mov     ch
254/2       ;# of times through loop.
jloop1:         mov     dx
[si]        ;[i
j]
mov     bx
[si-200h]   ;[i-1
j]
mov     ax
dx
shl     dx
3           ;[i
j] * 8
add     bx
[si-1feh]   ;[i-1
j+1]
mov     bp
[si+2]      ;[i
j+1]
add     bx
[si+200h]   ;[i+1
j]
add     dx
bp
add     bx
[si+202h]   ;[i+1
j+1]
add     dx
[si-202h]   ;[i-1
j-1]
mov     di
[si-1fch]   ;[i-1
j+2]
add     dx
[si-2]      ;[i
j-1]
add     di
[si+4]      ;[i
j+2]
add     dx
[si+1feh]   ;[i+1
j-1]
add     di
[si+204h]   ;[i+1
j+2]
shl     bp
3           ;[i
j+1] * 8
add     dx
bx
add     bp
ax
shr     dx
4           ;Divide by 16
add     bp
bx
mov     es:[si]
dx     ;Store [i
j] entry.
add     bp
di
add     si
4           ;Affects next store operation!
shr     bp
4
dec     ch
mov     es:[si-2]
bp   ;Store [i
j+1] entry.
jne     jloop1

add     si
4           ;Skip to start of next row.

dec     cl
jne     iloop1

mov     ax
dseg
mov     ds
ax
assume  ds:dseg

dec     Iterations
je      Done0

; Unroll the iterations loop so we can move the data from DataSeg2/4 back
; to DataSeg1/3 without wasting extra time. Other than the direction of the
; data movement
this code is virtually identical to the above.

mov     ax
DataSeg3
mov     ds
ax
mov     ax
DataSeg1
mov     es
ax

mov     cl
127
lea     si
Data1a+202h
iloop2:         mov     ch
254/2
jloop2:         mov     dx
[si]
mov     bx
[si-200h]
mov     ax
dx
shl     dx
3
add     bx
[si-1feh]
mov     bp
[si+2]
add     bx
[si+200h]
add     dx
bp
add     bx
[si+202h]
add     dx
[si-202h]
mov     di
[si-1fch]
add     dx
[si-2]
add     di
[si+4]
add     dx
[si+1feh]
add     di
[si+204h]
shl     bp
3
add     dx
bx
add     bp
ax
shr     dx
4
add     bp
bx
mov     es:[si]
dx
add     bp
di
add     si
4
shr     bp
4
dec     ch
mov     es:[si-2]
bp
jne     jloop2

add     si
4

dec     cl
jne     iloop2


mov     ax
DataSeg4
sub     ax
40h
mov     ds
ax
mov     ax
DataSeg2
sub     ax
40h
mov     es
ax

mov     cl
251-127-1
mov     si
202h
iloop3:         mov     ch
254/2
jloop3:         mov     dx
[si]
mov     bx
[si-200h]
mov     ax
dx
shl     dx
3
add     bx
[si-1feh]
mov     bp
[si+2]
add     bx
[si+200h]
add     dx
bp
add     bx
[si+202h]
add     dx
[si-202h]
mov     di
[si-1fch]
add     dx
[si-2]
add     di
[si+4]
add     dx
[si+1feh]
add     di
[si+204h]
shl     bp
3
add     dx
bx
add     bp
ax
shr     dx
4
add     bp
bx
mov     es:[si]
dx
add     bp
di
add     si
4
shr     bp
4
dec     ch
mov     es:[si-2]
bp
jne     jloop3

add     si
4

dec     cl
jne     iloop3

mov     ax
dseg
mov     ds
ax
assume  ds:dseg

dec     Iterations
je      Done2
jmp     hloop

Done2:          mov     ax
DataSeg1
mov     bx
DataSeg2
jmp     Finish

Done0:          mov     ax
DataSeg3
mov     bx
DataSeg4
Finish:         mov     ds
ax
print
byte    "Writing result"
cr
lf
0

; Convert data back to byte form and write to the output file:

mov     ax
dseg
mov     es
ax

mov     cx
32768
lea     di
ImgData
xor     si
si          ;Output data is at offset zero.
CopyLoop3:      lodsw                   ;Read a word from final array.
stosb                   ;Write a byte to output array.
dec     cx
jne     CopyLoop3

mov     ds
bx
mov     cx
(251*256) - 32768
xor     si
si
CopyLoop4:      lodsw                   ;Read final data word.
stosb                   ;Write data byte to output array.
dec     cx
jne     CopyLoop4


; Okay
write the data to the output file:

mov     ah
3ch         ;Create output file.
mov     cx
0           ;Normal file attributes.
mov     dx
dseg
mov     ds
dx
lea     dx
OutName
int     21h
jnc     GoodCreate
print
byte    "Could not create output file."
cr
lf
0
jmp     Quit

GoodCreate:     mov     bx
ax          ;File handle.
push    bx
mov     dx
dseg        ;Where the data can be found.
mov     ds
dx
lea     dx
ImgData
mov     cx
256*251     ;Size of data file to write.
mov     ah
40h         ;Write operation.
int     21h
pop     bx              ;Retrieve handle for close.
cmp     ax
256*251     ;See if we wrote the data.
je      GoodWrite
print
byte    "Did not write the file properly"
cr
lf
0
jmp     Quit

GoodWrite:      mov     ah
3eh         ;Close operation.
int     21h


Quit:           ExitPgm                 ;DOS macro to quit program.
Main            endp

cseg            ends

sseg            segment para stack 'stack'
stk             byte    1024 dup ("stack ")
sseg            ends

zzzzzzseg       segment para public 'zzzzzz'
LastBytes       byte    16 dup (?)
zzzzzzseg       ends
end     Main

Of course the absolute best way to improve the performance of any piece of code is with a better algorithm. All of the above assembly language versions were limited by a single requirement - they all must produce the same output file as the original Pascal program. Often programmers lose sight of what it is that they are trying to accomplish and get so caught up in the computations they are performing that they fail to see other possibilities. The optimization example above is a perfect example. The assembly code faithfully preserves the semantics of the original Pascal program; it computes the weighted average of all interior pixels as the sum of the eight neighbors around a pixel plus eight times the current pixel's value with the entire sum divided by 16. Now this is a good blurring function but it is not the only blurring function. A Photoshop (or other image processing program) user doesn't care about algorithms or such. When that user selects "blur image" they want it to go out of focus. Exactly how much out of focus is generally immaterial. In fact the less the better because the user can always run the blur algorithm again (or specify some number of iterations). The following assembly language program shows how to get better performance by modifying the blurring algorithm to reduce the number of instructions it needs to execute in the innermost loops. It computes blurring by averaging a pixel with the four neighbors above below to the left and to the right of the current pixel. This modification yields a program that runs 100 iterations in 2.2 seconds a 12% improvement over the previous version:

; IMGPRCS.ASM
;
; An image processing program (Fifth optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
;        than a loop to copy data from DataOut back to DataIn.
;        (2) Used repeat..until forms for all loops. (3) unrolled
;        the innermost two loops (which is responsible for most of
;        the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
;        once and for all through the execution of the main loop so
;        the code didn't have to reload ds each time through. Computed
;        index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
;        Removed hazards. Maintained common subexpressions. Did some
;        more loop unrolling.
;
; Version #6: Changed the blurring algorithm to use fewer computations.
;        This version does *NOT* produce the same data as the other
;        programs.
;
;
;       Performance comparisons (66 MHz 80486 DX/2 system
100 iterations).
;
;       This code-                              2.2 seconds.
;       3rd optmization pass-                   2.5 seconds.
;       2nd optimization pass-                  4 seconds.
;       1st optimization pass-                  6 seconds.
;       Original ASM code-                      36 seconds.

;       <<Lots of deleted code here
see the original program»


print
byte    "Computing Result"
cr
lf
0


assume  ds:InSeg
es:OutSeg

mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax

; Copy the data once so we get the edges in both arrays.

mov     cx
(251*256)/4
lea     si
DataIn
lea     di
DataOut
rep     movsd


; "hloop" repeats once for each iteration.

hloop:
mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax

; "iloop" processes the rows in the matrices.

mov     cl
249
iloop:          mov     bh
cl                  ;i*256
mov     bl
1                   ;Start at j=1.
mov     ch
254/2               ;# of times through loop.
mov     si
bx
mov     dh
0                   ;Compute sum here.
mov     bh
0
mov     ah
0

; "jloop" processes the individual elements of the array.
; This loop has been unrolled once to allow the two portions to share
; some common computations.

jloop:

; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] +
; DataIn [i+1][j+1] will be used in the second half of this computation.
; So save its value in a register (di) until we need it again.

mov     dl
DataIn[si]          ;[i
j]
mov     al
DataIn[si-256]      ;[I-1
j]
shl     dx
2                   ;[i
j]*4
mov     bl
DataIn[si-1]        ;[i
j-1]
add     dx
ax
mov     al
DataIn[si+1]        ;[i
j+1]
add     dx
bx
mov     bl
DataIn[si+256]      ;[i+1
j]
add     dx
ax
shl     ax
2                   ;[i
j+1]*4
add     dx
bx
mov     bl
DataIn[si-255]      ;[i-1
j+1]
shr     dx
3                   ;Divide by 8.
add     ax
bx
mov     DataOut[si]
dl
mov     bl
DataIn[si+2]        ;[i
j+2]
mov     dl
DataIn[si+257]      ;[i+1
j+1]
add     ax
bx
mov     bl
DataIn[si]          ;[i
j]
add     ax
dx
add     ax
bx
shr     ax
3
dec     ch
mov     DataOut[si+1]
al
jne     jloop

dec     cl
jne     iloop

dec     bp
je      Done


; Special case so we don't have to move the data between the two arrays.
; This is an unrolled version of the hloop that swaps the input and output
; arrays so we don't have to move data around in memory.

mov     ax
OutSeg
mov     ds
ax
mov     ax
InSeg
mov     es
ax
assume  es:InSeg
ds:OutSeg

hloop2:

mov     cl
249
iloop2:         mov     bh
cl
mov     bl
1
mov     ch
254/2
mov     si
bx
mov     dh
0
mov     bh
0
mov     ah
0
jloop2:
mov     dl
DataOut[si-256]
mov     al
DataOut[si-255]
mov     bl
DataOut[si+257]
add     dx
ax
mov     al
DataOut[si+256]
add     dx
bx
mov     bl
DataOut[si+1]
add     dx
ax
mov     al
DataOut[si+255]

mov     di
dx

add     dx
bx
mov     bl
DataOut[si-1]
add     dx
ax
mov     al
DataOut[si]
add     dx
bx
mov     bl
DataOut[si-257]
shl     ax
3
add     dx
bx
add     dx
ax
shr     ax
3
shr     dx
4
mov     DataIn[si]
dl

mov     dx
di
mov     bl
DataOut[si-254]
add     dx
ax
mov     al
DataOut[si+2]
add     dx
bx
mov     bl
DataOut[si+258]
add     dx
ax
mov     al
DataOut[si+1]
add     dx
bx
shl     ax
3
add     si
2
add     dx
ax
mov     ah
0
shr     dx
4
dec     ch
mov     DataIn[si-1]
dl
jne     jloop2

dec     cl
jne     iloop2

dec     bp
je      Done2
jmp     hloop


; Kludge to guarantee that the data always resides in the output segment.

Done2:
mov     ax
InSeg
mov     ds
ax
mov     ax
OutSeg
mov     es
ax
mov     cx
(251*256)/4
lea     si
DataIn
lea     di
DataOut
rep     movsd

Done:           print
byte    "Writing result"
cr
lf
0


;       //Lots of delete code here
see the original program//

One very important thing to keep in mind about the codein this section is that we've optimized it for 100 iterations. While it turns out that these optimizations apply equally well to more iterations this isn't necessarily true for fewer iterations. In particular if we run only one iteration any copying of data at the end of the operation will easily consume a large part of the time we save by the optimizations. Since it is very rare for a user to blur an image 100 times in a row our optimizations may not be as good as we could make them. However this section does provide a good example of the steps you must go through in order to optimize a given program. One hundred iterations was a good choice for this example because it was easy to measure the running time of all versions of the program. However you must keep in mind that you should optimize your programs for the expected case not an arbitrary case.

Chapter Twenty Five (Part 1)

Table of Content

Chapter Twenty Five: Optimizing Your Programm (Part 2)
29 SEP 1996