|
Table of Content | |
One easy way to partially demonstrate how to optimize a piece of code is to provide an example of some program and the optimization steps you can apply to that program. This section will present a short program that blurs an eight-bit gray scale image. Then this section will lead though through several optimization steps and show you how to get that program running over 16 times faster.
The following code assumes that you provide it with a file containing a 251x256 gray scale photographic image. The data structure for this file is as follows:
Image: array [0..250 0..255] of byte;
Each byte contains a value in the range 0..255 with zero denoting black 255 representing white and the other values representing even shades of gray between these two extremes.
The blurring algorithm averages a pixel with its eight closest neighbors. A single blur operation applies this average to all interior pixels of an image (that is it does not apply to the pixels on the boundary of the image because they do not have the same number of neighbors as the other pixels). The following Pascal program implements the blurring algorithm and lets the user specify the amount of blurring (by looping through the algorithm the number of times the user specifies):
program PhotoFilter(input
output);
(* Here is the raw file data type produced by the Photoshop program *)
type
image = array [0..250] of array [0..255] of byte;
(* The variables we will use. Note that the "datain" and "dataout" *)
(* variables are pointers because Turbo Pascal will not allow us to *)
(* allocate more than 64K data in the one global data segment it *)
(* supports. *)
var
h
i
j
k
l
sum
iterations:integer;
datain
dataout: ^image;
f
g:file of image;
begin
(* Open the files and real the input data *)
assign(f
'roller1.raw');
assign(g
'roller2.raw');
reset(f);
rewrite(g);
new(datain);
new(dataout);
read(f
datain^);
(* Get the number of iterations from the user *)
write('Enter number of iterations:');
readln(iterations);
writeln('Computing result');
(* Copy the data from the input array to the output array. *)
(* This is a really lame way to copy the border from the *)
(* input array to the output array. *)
for i := 0 to 250 do
for j := 0 to 255 do
dataout^ [i][j] := datain^ [i][j];
(* Okay
here's where all the work takes place. The outside *)
(* loop repeats this blurring operation the number of *)
(* iterations specified by the user. *)
for h := 1 to iterations do begin
(* For each row except the first and the last
compute *)
(* a new value for each element. *)
for i := 1 to 249 do
(* For each column except the first and the last
com- *)
(* pute a new value for each element. *)
for j := 1 to 254 do begin
(* For each element in the array
compute a new
blurred value by adding up the eight cells
around an array element along with eight times
the current cell's value. Then divide this by
sixteen to compute a weighted average of the
nine cells forming a square around the current
cell. The current cell has a 50% weighting
the other eight cells around the current cel
provide the other 50% weighting (6.25% each). *)
sum := 0;
for k := -1 to 1 do
for l := -1 to 1 do
sum := sum + datain^ [i+k][j+l];
(* Sum currently contains the sum of the nine *)
(* cells
add in seven times the current cell so *)
(* we get a total of eight times the current cell. *)
dataout^ [i][j] := (sum + datain^ [i][j]*7) div 16;
end;
(* Copy the output cell values back to the input cells *)
(* so we can perform the blurring on this new data on *)
(* the next iteration. *)
for i := 0 to 250 do
for j := 0 to 255 do
datain^ [i][j] := dataout^ [i][j];
end;
writeln('Writing result');
write(g
dataout^);
close(f);
close(g);
end.
The Pascal program above compiled with Turbo Pascal v7.0 takes 45 seconds to compute 100 iterations of the blurring algorithm. A comparable program written in C and compiled with Borland C++ v4.02 takes 29 seconds to run. The same source file compiled with Microsoft C++ v8.00 runs in 21 seconds. Obviously the C compilers produce better code than Turbo Pascal. It took about three hours to get the Pascal version running and tested. The C versions took about another hour to code and test. The following two images provide a "before" and "after" example of this program's function:
Before blurring:
After blurring (10 iterations):
The following is a crude translation from Pascal directly into assembly language of the above program. It requires 36 seconds to run. Yes the C compilers did a better job but once you see how bad this code is you'll wonder what it is that Turbo Pascal is doing to run so slow. It took about an hour to translate the Pascal version into this assembly code and debug it to the point it produced the same output as the Pascal version.
; IMGPRCS.ASM
;
; An image processing program.
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Performance comparisons (66 MHz 80486 DX/2 system).
;
; This code- 36 seconds.
; Borland Pascal v7.0- 45 seconds.
; Borland C++ v4.02- 29 seconds.
; Microsoft C++ v8.00- 21 seconds.
.xlist
include stdlib.a
includelib stdlib.lib
.list
.286
dseg segment para public 'data'
; Loop control variables and other variables:
h word ?
i word ?
j word ?
k word ?
l word ?
sum word ?
iterations word ?
; File names:
InName byte "roller1.raw"
0
OutName byte "roller2.raw"
0
dseg ends
; Here is the input data that we operate on.
InSeg segment para public 'indata'
DataIn byte 251 dup (256 dup (?))
InSeg ends
; Here is the output array that holds the result.
OutSeg segment para public 'outdata'
DataOut byte 251 dup (256 dup (?))
OutSeg ends
cseg segment para public 'code'
assume cs:cseg
ds:dseg
Main proc
mov ax
dseg
mov ds
ax
meminit
mov ax
3d00h ;Open input file for reading.
lea dx
InName
int 21h
jnc GoodOpen
print
byte "Could not open input file."
cr
lf
0
jmp Quit
GoodOpen: mov bx
ax ;File handle.
mov dx
InSeg ;Where to put the data.
mov ds
dx
lea dx
DataIn
mov cx
256*251 ;Size of data file to read.
mov ah
3Fh
int 21h
cmp ax
256*251 ;See if we read the data.
je GoodRead
print
byte "Did not read the file properly"
cr
lf
0
jmp Quit
GoodRead: mov ax
dseg
mov ds
ax
print
byte "Enter number of iterations: "
0
getsm
atoi
free
mov iterations
ax
print
byte "Computing Result"
cr
lf
0
; Copy the input data to the output buffer.
mov i
0
iloop0: cmp i
250
ja iDone0
mov j
0
jloop0: cmp j
255
ja jDone0
mov bx
i ;Compute index into both
shl bx
8 ; arrays using the formula
add bx
j ; i*256+j (row major).
mov cx
InSeg ;Point at input segment.
mov es
cx
mov al
es:DataIn[bx] ;Get DataIn[i][j].
mov cx
OutSeg ;Point at output segment.
mov es
cx
mov es:DataOut[bx]
al ;Store into DataOut[i][j]
inc j ;Next iteration of j loop.
jmp jloop0
jDone0: inc i ;Next iteration of i loop.
jmp iloop0
iDone0:
; for h := 1 to iterations-
mov h
1
hloop: mov ax
h
cmp ax
iterations
ja hloopDone
; for i := 1 to 249 -
mov i
1
iloop: cmp i
249
ja iloopDone
; for j := 1 to 254 -
mov j
1
jloop: cmp j
254
ja jloopDone
; sum := 0;
; for k := -1 to 1 do for l := -1 to 1 do
mov ax
InSeg ;Gain access to InSeg.
mov es
ax
mov sum
0
mov k
-1
kloop: cmp k
1
jg kloopDone
mov l
-1
lloop: cmp l
1
jg lloopDone
; sum := sum + datain [i+k][j+l]
mov bx
i
add bx
k
shl bx
8 ;Multiply by 256.
add bx
j
add bx
l
mov al
es:DataIn[bx]
mov ah
0
add Sum
ax
inc l
jmp lloop
lloopDone: inc k
jmp kloop
; dataout [i][j] := (sum + datain[i][j]*7) div 16;
kloopDone: mov bx
i
shl bx
8 ;*256
add bx
j
mov al
es:DataIn[bx]
mov ah
0
imul ax
7
add ax
sum
shr ax
4 ;div 16
mov bx
OutSeg
mov es
bx
mov bx
i
shl bx
8
add bx
j
mov es:DataOut[bx]
al
inc j
jmp jloop
jloopDone: inc i
jmp iloop
iloopDone:
; Copy the output data to the input buffer.
mov i
0
iloop1: cmp i
250
ja iDone1
mov j
0
jloop1: cmp j
255
ja jDone1
mov bx
i ;Compute index into both
shl bx
8 ; arrays using the formula
add bx
j ; i*256+j (row major).
mov cx
OutSeg ;Point at input segment.
mov es
cx
mov al
es:DataOut[bx] ;Get DataIn[i][j].
mov cx
InSeg ;Point at output segment.
mov es
cx
mov es:DataIn[bx]
al ;Store into DataOut[i][j]
inc j ;Next iteration of j loop.
jmp jloop1
jDone1: inc i ;Next iteration of i loop.
jmp iloop1
iDone1: inc h
jmp hloop
hloopDone: print
byte "Writing result"
cr
lf
0
; Okay
write the data to the output file:
mov ah
3ch ;Create output file.
mov cx
0 ;Normal file attributes.
lea dx
OutName
int 21h
jnc GoodCreate
print
byte "Could not create output file."
cr
lf
0
jmp Quit
GoodCreate: mov bx
ax ;File handle.
push bx
mov dx
OutSeg ;Where the data can be found.
mov ds
dx
lea dx
DataOut
mov cx
256*251 ;Size of data file to write.
mov ah
40h ;Write operation.
int 21h
pop bx ;Retrieve handle for close.
cmp ax
256*251 ;See if we wrote the data.
je GoodWrite
print
byte "Did not write the file properly"
cr
lf
0
jmp Quit
GoodWrite: mov ah
3eh ;Close operation.
int 21h
Quit: ExitPgm ;DOS macro to quit program.
Main endp
cseg ends
sseg segment para stack 'stack'
stk byte 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes byte 16 dup (?)
zzzzzzseg ends
end Main
This assembly code is a very straight-forward line by line translation of the previous Pascal code. Even beginning programmers (who've read and understand Chapters Eight and Nine) should easily be able to improve the performance of this code.
While we could run a profiler on this program to determine where the "hot spots" are in this code a little analysis particularly of the Pascal version should make it obvious that there are a lot of nested loops in this code. As Chapter Ten points out when optimizing code you should always start with the innermost loops. The major change between the code above and the following assembly language version is that we've unrolled the innermost loops and we've replaced the array index computations with some constant computations. These minor changes speed up the execution by a factor of six! The assembly version now runs in six seconds rather than 36. A Microsoft C++ version of the same program with comparable optimzations runs in eight seconds. It required nearly four hours to develop test and debug this code. It required an additional hour to apply these same modifications to the C version.
; IMGPRCS2.ASM ; ; An image processing program (First optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16 weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K) the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; ; Performance comparisons (66 MHz 80486 DX/2 system). ; ; This code- 6 seconds. ; Original ASM code- 36 seconds. ; Borland Pascal v7.0- 45 seconds. ; Borland C++ v4.02- 29 seconds. ; Microsoft C++ v8.00- 21 seconds. ; // Lots of omitted code goes here see the previous version// print byte "Computing Result" cr lf 0 ; for h := 1 to iterations- mov h 1 hloop: ; Copy the input data to the output buffer. ; Optimization step #1: Replace with movs instruction. push ds mov ax OutSeg mov ds ax mov ax InSeg mov es ax lea si DataOut lea di DataIn mov cx (251*256)/4 rep movsd pop ds ; Optimization Step #1: Convert loops to repeat..until form. ; for i := 1 to 249 - mov i 1 iloop: ; for j := 1 to 254 - mov j 1 jloop: ; Optimization. Unroll the innermost two loops: mov bh byte ptr i ;i is always less than 256. mov bl byte ptr j ;Computes i*256+j! push ds mov ax InSeg ;Gain access to InSeg. mov ds ax mov cx 0 ;Compute sum here. mov ah ch mov cl ds:DataIn[bx-257] ;DataIn[i-1][j-1] mov al ds:DataIn[bx-256] ;DataIn[i-1][j] add cx ax mov al ds:DataIn[bx-255] ;DataIn[i-1][j+1] add cx ax mov al ds:DataIn[bx-1] ;DataIn[i][j-1] add cx ax mov al ds:DataIn[bx+1] ;DataIn[i][j+1] add cx ax mov al ds:DataIn[bx+255] ;DataIn[i+1][j-1] add cx ax mov al ds:DataIn[bx+256] ;DataIn[i+1][j] add cx ax mov al ds:DataIn[bx+257] ;DataIn[i+1][j+1] add cx ax mov al ds:DataIn[bx] ;DataIn[i][j] shl ax 3 ;DataIn[i][j]*8 add cx ax shr cx 4 ;Divide by 16 mov ax OutSeg mov ds ax mov ds:DataOut[bx] cl pop ds inc j cmp j 254 jbe jloop inc i cmp i 249 jbe iloop inc h mov ax h cmp ax Iterations jnbe Done jmp hloop Done: print byte "Writing result" cr lf 0 ; //More omitted code goes here see the previous version//
The second version above still uses memory variables for most computations. The optimizations applied to the original code were mainly language-independent optimizations. The next step was to begin applying some assembly language specific optimizations to the code. The first optimization we need to do is to move as many variables as possible into the 80x86's register set. The following code provides this optimization. Although this only improves the running time by 2 seconds that is a 33% improvement (six seconds down to four)!
; IMGPRCS.ASM ; ; An image processing program (Second optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16 weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K) the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; ; Performance comparisons (66 MHz 80486 DX/2 system). ; ; This code- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. ; <<Lots of delete code goes here» print byte "Computing Result" cr lf 0 ; Copy the input data to the output buffer. hloop: mov ax InSeg mov es ax mov ax OutSeg mov ds ax lea si DataOut lea di DataIn mov cx (251*256)/4 rep movsd assume ds:InSeg es:OutSeg mov ax InSeg mov ds ax mov ax OutSeg mov es ax mov cl 249 iloop: mov bh cl ;i*256 mov bl 1 ;Start at j=1. mov ch 254 ;# of times through loop. jloop: mov dx 0 ;Compute sum here. mov ah dh mov dl DataIn[bx-257] ;DataIn[i-1][j-1] mov al DataIn[bx-256] ;DataIn[i-1][j] add dx ax mov al DataIn[bx-255] ;DataIn[i-1][j+1] add dx ax mov al DataIn[bx-1] ;DataIn[i][j-1] add dx ax mov al DataIn[bx+1] ;DataIn[i][j+1] add dx ax mov al DataIn[bx+255] ;DataIn[i+1][j-1] add dx ax mov al DataIn[bx+256] ;DataIn[i+1][j] add dx ax mov al DataIn[bx+257] ;DataIn[i+1][j+1] add dx ax mov al DataIn[bx] ;DataIn[i][j] shl ax 3 ;DataIn[i][j]*8 add dx ax shr dx 4 ;Divide by 16 mov DataOut[bx] dl inc bx dec ch jne jloop dec cl jne iloop dec bp jne hloop Done: print byte "Writing result" cr lf 0 ; //More deleted code goes here see the original version//
Note that on each iteration
the code above still copies
the output data back to the input data. That's almost six and a half megabytes of data
movement for 100 iterations! The following version of the blurring program unrolls the hloop
twice. The first occurrence copies the data from DataIn to DataOut while computing the
blur
the second instance copies the data from DataOut back to DataIn while blurring the
image. By using these two code sequences
the program save copying the data from one point
to another. This version also maintains some common computations between two adjacent
cells to save a few instructions in the innermost loop. This version arranges instructions
in the innermost loop to help avoid data hazards on 80486 and later processors. The end
result is almost 40% faster than the previous version (down to 2.5 seconds from four
seconds).
; IMGPRCS.ASM ; ; An image processing program (Third optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16 weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K) the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; Version #4: Eliminated copying data from DataOut to DataIn on each pass. ; Removed hazards. Maintained common subexpressions. Did some ; more loop unrolling. ; ; ; Performance comparisons (66 MHz 80486 DX/2 system 100 iterations). ; ; This code- 2.5 seconds. ; 2nd optimization pass- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. ; <<Lots of deleted code here see the original version» print byte "Computing Result" cr lf 0 assume ds:InSeg es:OutSeg mov ax InSeg mov ds ax mov ax OutSeg mov es ax ; Copy the data once so we get the edges in both arrays. mov cx (251*256)/4 lea si DataIn lea di DataOut rep movsd ; "hloop" repeats once for each iteration. hloop: mov ax InSeg mov ds ax mov ax OutSeg mov es ax ; "iloop" processes the rows in the matrices. mov cl 249 iloop: mov bh cl ;i*256 mov bl 1 ;Start at j=1. mov ch 254/2 ;# of times through loop. mov si bx mov dh 0 ;Compute sum here. mov bh 0 mov ah 0 ; "jloop" processes the individual elements of the array. ; This loop has been unrolled once to allow the two portions to share ; some common computations. jloop: ; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] + ; DataIn [i+1][j+1] will be used in the second half of this computation. ; So save its value in a register (di) until we need it again. mov dl DataIn[si-256] ;[i-1 j] mov al DataIn[si-255] ;[i-1 j+1] mov bl DataIn[si+257] ;[i+1 j+1] add dx ax mov al DataIn[si+256] ;[I+1 j] add dx bx mov bl DataIn[si+1] ;[i j+1] add dx ax mov al DataIn[si+255] ;[i+1 j-1] mov di dx ;Save partial result. add dx bx mov bl DataIn[si-1] ;[i j-1] add dx ax mov al DataIn[si] ;[i j] add dx bx mov bl DataIn[si-257] ;[i-1 j-1] shl ax 3 ;DataIn[i j] * 8. add dx bx add dx ax shr ax 3 ;Restore DataIn[i j]. shr dx 4 ;Divide by 16. add di ax mov DataOut[si] dl ; Okay process the next cell over. Note that we've got a partial sum ; sitting in DI already. Don't forget we haven't bumped SI at this point ; so the offsets are off by one. (This is the second half of the unrolled ; loop.) mov dx di ;Partial sum. mov bl DataIn[si-254] ;[i-1 j+1] mov al DataIn[si+2] ;[i j+1] add dx bx mov bl DataIn[si+258] ;[i+1 j+1]; add dx ax mov al DataIn[si+1] ;[i j] add dx bx shl ax 3 ;DataIn[i][j]*8 add si 2 ;Bump array index. add dx ax mov ah 0 ;Clear for next iter. shr dx 4 ;Divide by 16 dec ch mov DataOut[si-1] dl jne jloop dec cl jne iloop dec bp je Done ; Special case so we don't have to move the data between the two arrays. ; This is an unrolled version of the hloop that swaps the input and output ; arrays so we don't have to move data around in memory. mov ax OutSeg mov ds ax mov ax InSeg mov es ax assume es:InSeg ds:OutSeg hloop2: mov cl 249 iloop2: mov bh cl mov bl 1 mov ch 254/2 mov si bx mov dh 0 mov bh 0 mov ah 0 jloop2: mov dl DataOut[si-256] mov al DataOut[si-255] mov bl DataOut[si+257] add dx ax mov al DataOut[si+256] add dx bx mov bl DataOut[si+1] add dx ax mov al DataOut[si+255] mov di dx add dx bx mov bl DataOut[si-1] add dx ax mov al DataOut[si] add dx bx mov bl DataOut[si-257] shl ax 3 add dx bx add dx ax shr ax 3 shr dx 4 mov DataIn[si] dl mov dx di mov bl DataOut[si-254] add dx ax mov al DataOut[si+2] add dx bx mov bl DataOut[si+258] add dx ax mov al DataOut[si+1] add dx bx shl ax 3 add si 2 add dx ax mov ah 0 shr dx 4 dec ch mov DataIn[si-1] dl jne jloop2 dec cl jne iloop2 dec bp je Done2 jmp hloop ; Kludge to guarantee that the data always resides in the output segment. Done2: mov ax InSeg mov ds ax mov ax OutSeg mov es ax mov cx (251*256)/4 lea si DataIn lea di DataOut rep movsd Done: print byte "Writing result" cr lf 0 ; //Lots of deleted code here see the original program//
This code provides a good example of the kind of optimization that scares a lot of people. There is a lot of cycle counting instruction scheduling and other crazy stuff that makes program very difficult to read and understand. This is the kind of optimization for which assembly language programmers are famous; the stuff that spawned the phrase "never optimize early." You should never try this type of optimization until you feel you've exhausted all other possibilities. Once you write your code in this fashion it is going to be very difficult to make further changes to it. By the way the above code took about 15 hours to develop and debug (debugging took the most time). That works out to a 0.1 second improvement (for 100 iterations) for each hour of work. Although this code certainly isn't optimal yet it is difficult to justify more time attempting to improve this code by mechanical means (e.g. moving instructions around etc.) because the performance gains would be so little.
In the four steps above we've reduced the running time of the assembly code from 36 seconds down to 2.5 seconds. Quite an impressive feat. However you shouldn't get the idea that this was easy or even that there were only four steps involved. During the actual development of this example there were many attempts that did not improve performance (in fact some modifications wound up reducing performance) and others did not improve performance enough to justify their inclusion. Just to demonstrate this last point the following code included a major change in the way the program organized data. The main loop operates on 16 bit objects in memory rather than eight bit objects. On some machines with large external caches (256K or better) this algorithm provides a slight improvement in performance (2.4 seconds down from 2.5). However on other machines it runs slower. Therefore this code was not chosen as the final implementation:
; IMGPRCS.ASM
;
; An image processing program (Fourth optimization pass).
;
; This program blurs an eight-bit grayscale image by averaging a pixel
; in the image with the eight pixels around it. The average is computed
; by (CurCell*8 + other 8 cells)/16
weighting the current cell by 50%.
;
; Because of the size of the image (almost 64K)
the input and output
; matrices are in different segments.
;
; Version #1: Straight-forward translation from Pascal to Assembly.
;
; Version #2: Three major optimizations. (1) used movsd instruction rather
; than a loop to copy data from DataOut back to DataIn.
; (2) Used repeat..until forms for all loops. (3) unrolled
; the innermost two loops (which is responsible for most of
; the performance improvement).
;
; Version #3: Used registers for all variables. Set up segment registers
; once and for all through the execution of the main loop so
; the code didn't have to reload ds each time through. Computed
; index into each row only once (outside the j loop).
;
; Version #4: Eliminated copying data from DataOut to DataIn on each pass.
; Removed hazards. Maintained common subexpressions. Did some
; more loop unrolling.
;
; Version #5: Converted data arrays to words rather than bytes and operated
; on 16-bit values. Yielded minimal speedup.
;
; Performance comparisons (66 MHz 80486 DX/2 system).
;
; This code- 2.4 seconds.
; 3rd optimization pass- 2.5 seconds.
; 2nd optimization pass- 4 seconds.
; 1st optimization pass- 6 seconds.
; Original ASM code- 36 seconds.
.xlist
include stdlib.a
includelib stdlib.lib
.list
.386
option segment:use16
dseg segment para public 'data'
ImgData byte 251 dup (256 dup (?))
InName byte "roller1.raw"
0
OutName byte "roller2.raw"
0
Iterations word 0
dseg ends
; This code makes the naughty assumption that the following
; segments are loaded contiguously in memory! Also
because these
; segments are paragraph aligned
this code assumes that these segments
; will contain a full 65
536 bytes. You cannot declare a segment with
; exactly 65
536 bytes in MASM. However
the paragraph alignment option
; ensures that the extra byte of padding is added to the end of each
; segment.
DataSeg1 segment para public 'ds1'
Data1a byte 65535 dup (?)
DataSeg1 ends
DataSeg2 segment para public 'ds2'
Data1b byte 65535 dup (?)
DataSeg2 ends
DataSeg3 segment para public 'ds3'
Data2a byte 65535 dup (?)
DataSeg3 ends
DataSeg4 segment para public 'ds4'
Data2b byte 65535 dup (?)
DataSeg4 ends
cseg segment para public 'code'
assume cs:cseg
ds:dseg
Main proc
mov ax
dseg
mov ds
ax
meminit
mov ax
3d00h ;Open input file for reading.
lea dx
InName
int 21h
jnc GoodOpen
print
byte "Could not open input file."
cr
lf
0
jmp Quit
GoodOpen: mov bx
ax ;File handle.
lea dx
ImgData
mov cx
256*251 ;Size of data file to read.
mov ah
3Fh
int 21h
cmp ax
256*251 ;See if we read the data.
je GoodRead
print
byte "Did not read the file properly"
cr
lf
0
jmp Quit
GoodRead: print
byte "Enter number of iterations: "
0
getsm
atoi
free
mov Iterations
ax
cmp ax
0
jle Quit
printf
byte "Computing Result for %d iterations"
cr
lf
0
dword Iterations
; Copy the data and expand it from eight bits to sixteen bits.
; The first loop handles the first 32
768 bytes
the second loop
; handles the remaining bytes.
mov ax
DataSeg1
mov es
ax
mov ax
DataSeg3
mov fs
ax
mov ah
0
mov cx
32768
lea si
ImgData
xor di
di ;Output data is at ofs zero.
CopyLoop: lodsb ;Read a byte
mov fs:[di]
ax ;Store a word in DataSeg3
stosw ;Store a word in DataSeg1
dec cx
jne CopyLoop
mov di
DataSeg2
mov es
di
mov di
DataSeg4
mov fs
di
mov cx
(251*256) - 32768
xor di
di
CopyLoop1: lodsb ;Read a byte
mov fs:[di]
ax ;Store a word in DataSeg4
stosw ;Store a word in DataSeg2
dec cx
jne CopyLoop1
; hloop completes one iteration on the data moving it from Data1a/Data1b
; to Data2a/Data2b
hloop: mov ax
DataSeg1
mov ds
ax
mov ax
DataSeg3
mov es
ax
; Process the first 127 rows (65
024 bytes) of the array):
mov cl
127
lea si
Data1a+202h ;Start at [1
1]
iloop0: mov ch
254/2 ;# of times through loop.
jloop0: mov dx
[si] ;[i
j]
mov bx
[si-200h] ;[i-1
j]
mov ax
dx
shl dx
3 ;[i
j] * 8
add bx
[si-1feh] ;[i-1
j+1]
mov bp
[si+2] ;[i
j+1]
add bx
[si+200h] ;[i+1
j]
add dx
bp
add bx
[si+202h] ;[i+1
j+1]
add dx
[si-202h] ;[i-1
j-1]
mov di
[si-1fch] ;[i-1
j+2]
add dx
[si-2] ;[i
j-1]
add di
[si+4] ;[i
j+2]
add dx
[si+1feh] ;[i+1
j-1]
add di
[si+204h] ;[i+1
j+2]
shl bp
3 ;[i
j+1] * 8
add dx
bx
add bp
ax
shr dx
4 ;Divide by 16.
add bp
bx
mov es:[si]
dx ;Store [i
j] entry.
add bp
di
add si
4 ;Affects next store operation!
shr bp
4 ;Divide by 16.
dec ch
mov es:[si-2]
bp ;Store [i
j+1] entry.
jne jloop0
add si
4 ;Skip to start of next row.
dec cl
jne iloop0
; Process the last 124 rows of the array). This requires that we switch from
; one segment to the next. Note that the segments overlap.
mov ax
DataSeg2
sub ax
40h ;Back up to last 2 rows in DS2
mov ds
ax
mov ax
DataSeg4
sub ax
40h ;Back up to last 2 rows in DS4
mov es
ax
mov cl
251-127-1 ;Remaining rows to process.
mov si
202h ;Continue with next row.
iloop1: mov ch
254/2 ;# of times through loop.
jloop1: mov dx
[si] ;[i
j]
mov bx
[si-200h] ;[i-1
j]
mov ax
dx
shl dx
3 ;[i
j] * 8
add bx
[si-1feh] ;[i-1
j+1]
mov bp
[si+2] ;[i
j+1]
add bx
[si+200h] ;[i+1
j]
add dx
bp
add bx
[si+202h] ;[i+1
j+1]
add dx
[si-202h] ;[i-1
j-1]
mov di
[si-1fch] ;[i-1
j+2]
add dx
[si-2] ;[i
j-1]
add di
[si+4] ;[i
j+2]
add dx
[si+1feh] ;[i+1
j-1]
add di
[si+204h] ;[i+1
j+2]
shl bp
3 ;[i
j+1] * 8
add dx
bx
add bp
ax
shr dx
4 ;Divide by 16
add bp
bx
mov es:[si]
dx ;Store [i
j] entry.
add bp
di
add si
4 ;Affects next store operation!
shr bp
4
dec ch
mov es:[si-2]
bp ;Store [i
j+1] entry.
jne jloop1
add si
4 ;Skip to start of next row.
dec cl
jne iloop1
mov ax
dseg
mov ds
ax
assume ds:dseg
dec Iterations
je Done0
; Unroll the iterations loop so we can move the data from DataSeg2/4 back
; to DataSeg1/3 without wasting extra time. Other than the direction of the
; data movement
this code is virtually identical to the above.
mov ax
DataSeg3
mov ds
ax
mov ax
DataSeg1
mov es
ax
mov cl
127
lea si
Data1a+202h
iloop2: mov ch
254/2
jloop2: mov dx
[si]
mov bx
[si-200h]
mov ax
dx
shl dx
3
add bx
[si-1feh]
mov bp
[si+2]
add bx
[si+200h]
add dx
bp
add bx
[si+202h]
add dx
[si-202h]
mov di
[si-1fch]
add dx
[si-2]
add di
[si+4]
add dx
[si+1feh]
add di
[si+204h]
shl bp
3
add dx
bx
add bp
ax
shr dx
4
add bp
bx
mov es:[si]
dx
add bp
di
add si
4
shr bp
4
dec ch
mov es:[si-2]
bp
jne jloop2
add si
4
dec cl
jne iloop2
mov ax
DataSeg4
sub ax
40h
mov ds
ax
mov ax
DataSeg2
sub ax
40h
mov es
ax
mov cl
251-127-1
mov si
202h
iloop3: mov ch
254/2
jloop3: mov dx
[si]
mov bx
[si-200h]
mov ax
dx
shl dx
3
add bx
[si-1feh]
mov bp
[si+2]
add bx
[si+200h]
add dx
bp
add bx
[si+202h]
add dx
[si-202h]
mov di
[si-1fch]
add dx
[si-2]
add di
[si+4]
add dx
[si+1feh]
add di
[si+204h]
shl bp
3
add dx
bx
add bp
ax
shr dx
4
add bp
bx
mov es:[si]
dx
add bp
di
add si
4
shr bp
4
dec ch
mov es:[si-2]
bp
jne jloop3
add si
4
dec cl
jne iloop3
mov ax
dseg
mov ds
ax
assume ds:dseg
dec Iterations
je Done2
jmp hloop
Done2: mov ax
DataSeg1
mov bx
DataSeg2
jmp Finish
Done0: mov ax
DataSeg3
mov bx
DataSeg4
Finish: mov ds
ax
print
byte "Writing result"
cr
lf
0
; Convert data back to byte form and write to the output file:
mov ax
dseg
mov es
ax
mov cx
32768
lea di
ImgData
xor si
si ;Output data is at offset zero.
CopyLoop3: lodsw ;Read a word from final array.
stosb ;Write a byte to output array.
dec cx
jne CopyLoop3
mov ds
bx
mov cx
(251*256) - 32768
xor si
si
CopyLoop4: lodsw ;Read final data word.
stosb ;Write data byte to output array.
dec cx
jne CopyLoop4
; Okay
write the data to the output file:
mov ah
3ch ;Create output file.
mov cx
0 ;Normal file attributes.
mov dx
dseg
mov ds
dx
lea dx
OutName
int 21h
jnc GoodCreate
print
byte "Could not create output file."
cr
lf
0
jmp Quit
GoodCreate: mov bx
ax ;File handle.
push bx
mov dx
dseg ;Where the data can be found.
mov ds
dx
lea dx
ImgData
mov cx
256*251 ;Size of data file to write.
mov ah
40h ;Write operation.
int 21h
pop bx ;Retrieve handle for close.
cmp ax
256*251 ;See if we wrote the data.
je GoodWrite
print
byte "Did not write the file properly"
cr
lf
0
jmp Quit
GoodWrite: mov ah
3eh ;Close operation.
int 21h
Quit: ExitPgm ;DOS macro to quit program.
Main endp
cseg ends
sseg segment para stack 'stack'
stk byte 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes byte 16 dup (?)
zzzzzzseg ends
end Main
Of course the absolute best way to improve the performance of any piece of code is with a better algorithm. All of the above assembly language versions were limited by a single requirement - they all must produce the same output file as the original Pascal program. Often programmers lose sight of what it is that they are trying to accomplish and get so caught up in the computations they are performing that they fail to see other possibilities. The optimization example above is a perfect example. The assembly code faithfully preserves the semantics of the original Pascal program; it computes the weighted average of all interior pixels as the sum of the eight neighbors around a pixel plus eight times the current pixel's value with the entire sum divided by 16. Now this is a good blurring function but it is not the only blurring function. A Photoshop (or other image processing program) user doesn't care about algorithms or such. When that user selects "blur image" they want it to go out of focus. Exactly how much out of focus is generally immaterial. In fact the less the better because the user can always run the blur algorithm again (or specify some number of iterations). The following assembly language program shows how to get better performance by modifying the blurring algorithm to reduce the number of instructions it needs to execute in the innermost loops. It computes blurring by averaging a pixel with the four neighbors above below to the left and to the right of the current pixel. This modification yields a program that runs 100 iterations in 2.2 seconds a 12% improvement over the previous version:
; IMGPRCS.ASM ; ; An image processing program (Fifth optimization pass). ; ; This program blurs an eight-bit grayscale image by averaging a pixel ; in the image with the eight pixels around it. The average is computed ; by (CurCell*8 + other 8 cells)/16 weighting the current cell by 50%. ; ; Because of the size of the image (almost 64K) the input and output ; matrices are in different segments. ; ; Version #1: Straight-forward translation from Pascal to Assembly. ; ; Version #2: Three major optimizations. (1) used movsd instruction rather ; than a loop to copy data from DataOut back to DataIn. ; (2) Used repeat..until forms for all loops. (3) unrolled ; the innermost two loops (which is responsible for most of ; the performance improvement). ; ; Version #3: Used registers for all variables. Set up segment registers ; once and for all through the execution of the main loop so ; the code didn't have to reload ds each time through. Computed ; index into each row only once (outside the j loop). ; ; Version #4: Eliminated copying data from DataOut to DataIn on each pass. ; Removed hazards. Maintained common subexpressions. Did some ; more loop unrolling. ; ; Version #6: Changed the blurring algorithm to use fewer computations. ; This version does *NOT* produce the same data as the other ; programs. ; ; ; Performance comparisons (66 MHz 80486 DX/2 system 100 iterations). ; ; This code- 2.2 seconds. ; 3rd optmization pass- 2.5 seconds. ; 2nd optimization pass- 4 seconds. ; 1st optimization pass- 6 seconds. ; Original ASM code- 36 seconds. ; <<Lots of deleted code here see the original program» print byte "Computing Result" cr lf 0 assume ds:InSeg es:OutSeg mov ax InSeg mov ds ax mov ax OutSeg mov es ax ; Copy the data once so we get the edges in both arrays. mov cx (251*256)/4 lea si DataIn lea di DataOut rep movsd ; "hloop" repeats once for each iteration. hloop: mov ax InSeg mov ds ax mov ax OutSeg mov es ax ; "iloop" processes the rows in the matrices. mov cl 249 iloop: mov bh cl ;i*256 mov bl 1 ;Start at j=1. mov ch 254/2 ;# of times through loop. mov si bx mov dh 0 ;Compute sum here. mov bh 0 mov ah 0 ; "jloop" processes the individual elements of the array. ; This loop has been unrolled once to allow the two portions to share ; some common computations. jloop: ; The sum of DataIn [i-1][j] + DataIn[i-1][j+1] + DataIn[i+1][j] + ; DataIn [i+1][j+1] will be used in the second half of this computation. ; So save its value in a register (di) until we need it again. mov dl DataIn[si] ;[i j] mov al DataIn[si-256] ;[I-1 j] shl dx 2 ;[i j]*4 mov bl DataIn[si-1] ;[i j-1] add dx ax mov al DataIn[si+1] ;[i j+1] add dx bx mov bl DataIn[si+256] ;[i+1 j] add dx ax shl ax 2 ;[i j+1]*4 add dx bx mov bl DataIn[si-255] ;[i-1 j+1] shr dx 3 ;Divide by 8. add ax bx mov DataOut[si] dl mov bl DataIn[si+2] ;[i j+2] mov dl DataIn[si+257] ;[i+1 j+1] add ax bx mov bl DataIn[si] ;[i j] add ax dx add ax bx shr ax 3 dec ch mov DataOut[si+1] al jne jloop dec cl jne iloop dec bp je Done ; Special case so we don't have to move the data between the two arrays. ; This is an unrolled version of the hloop that swaps the input and output ; arrays so we don't have to move data around in memory. mov ax OutSeg mov ds ax mov ax InSeg mov es ax assume es:InSeg ds:OutSeg hloop2: mov cl 249 iloop2: mov bh cl mov bl 1 mov ch 254/2 mov si bx mov dh 0 mov bh 0 mov ah 0 jloop2: mov dl DataOut[si-256] mov al DataOut[si-255] mov bl DataOut[si+257] add dx ax mov al DataOut[si+256] add dx bx mov bl DataOut[si+1] add dx ax mov al DataOut[si+255] mov di dx add dx bx mov bl DataOut[si-1] add dx ax mov al DataOut[si] add dx bx mov bl DataOut[si-257] shl ax 3 add dx bx add dx ax shr ax 3 shr dx 4 mov DataIn[si] dl mov dx di mov bl DataOut[si-254] add dx ax mov al DataOut[si+2] add dx bx mov bl DataOut[si+258] add dx ax mov al DataOut[si+1] add dx bx shl ax 3 add si 2 add dx ax mov ah 0 shr dx 4 dec ch mov DataIn[si-1] dl jne jloop2 dec cl jne iloop2 dec bp je Done2 jmp hloop ; Kludge to guarantee that the data always resides in the output segment. Done2: mov ax InSeg mov ds ax mov ax OutSeg mov es ax mov cx (251*256)/4 lea si DataIn lea di DataOut rep movsd Done: print byte "Writing result" cr lf 0 ; //Lots of delete code here see the original program//
One very important thing to keep in mind about the codein this section is that we've optimized it for 100 iterations. While it turns out that these optimizations apply equally well to more iterations this isn't necessarily true for fewer iterations. In particular if we run only one iteration any copying of data at the end of the operation will easily consume a large part of the time we save by the optimizations. Since it is very rare for a user to blur an image 100 times in a row our optimizations may not be as good as we could make them. However this section does provide a good example of the steps you must go through in order to optimize a given program. One hundred iterations was a good choice for this example because it was easy to measure the running time of all versions of the program. However you must keep in mind that you should optimize your programs for the expected case not an arbitrary case.
|
Table of Content | |
Chapter Twenty Five: Optimizing Your
Programm (Part 2)
29 SEP 1996