|
Table of Content | Chapter Fifteen (Part 5) |
| CHAPTER
FIFTEEN: STRINGS AND CHARACTER SETS (Part 4) |
|
| 15.3 -
Character String Functions 15.3.1 - Substr 15.3.2 - Index 15.3.3 - Repeat |
15.3.4
- Insert 15.3.5 - Delete 15.3.6 - Concatenation |
| 15.3 Character String Functions | |
Most high level languages like Pascal BASIC "C" and PL/I provide several string functions and procedures (either built into the language or as part of a standard library). Other than the five string operations provided above the 80x86 doesn't support any string functions. Therefore if you need a particular string function you'll have to write it yourself. The following sections describe many of the more popular string functions and how to implement them in assembly language.
The Substr (substring) function copies a
portion of one string to another. In a high level language
this function usually takes
the form:
DestStr := Substr(SrcStr Index Length);
where:
DestStr is the name of the string variable
where you want to store the substring
SrcStr is the name of the source string (from
which the substring is to be taken)
Index is the starting character position within
the string (1..length(SrcStr))
and DestStr.
The following examples show how Substr works.
SrcStr := 'This is an example of a string'; DestStr := Substr(SrcStr 11 7); write(DestStr);
This prints 'example'. The index value is eleven
so
the Substr
function will begin copying data starting at the eleventh character in the string. The
eleventh character is the 'e' in 'example'. The length of the string is seven.
This invocation copies the seven characters 'example' to DestStr.
SrcStr := 'This is an example of a string'; DestStr := Substr(SrcStr 1 10); write(DestStr);
This prints 'This is an'. Since the index is one
this
occurrence of the Substr function starts copying 10 characters starting with
the first character in the string.
SrcStr := 'This is an example of a string'; DestStr := Substr(SrcStr 20 11); write(DestStr);
This prints 'of a string'. This call to Substr
extracts the last eleven characters in the string.
What happens if the index and length values are out of
bounds? For example
what happens if Index is zero or is greater than the
length of the string? What happens if Index is fine
but the sum of Index
and Length is greater than the length of the source string? You can handle
these abnormal situations in one of three ways: (1)ignore the possibility of error;
(2)abort the program with a run-time error; (3)process some reasonable number of
characters in response to the request.
The first solution operates under the assumption that the
caller never makes a mistake computing the values for the parameters to the Substr
function. It blindly assumes that the values passed to the Substr function
are correct and processes the string based on that assumption. This can produce some
bizarre effects. Consider the following examples
which use length-prefixed strings:
SourceStr :='1234567890ABCDEFGHIJKLMNOPQRSTUVWXYZ';
DestStr := Substr(SourceStr
0
5);
Write('DestStr');
prints '$1234'. The reason
of course
is that SourceStr
is a length-prefixed string. Therefore the length
36
appears at offset zero within the
string. If Substr uses the illegal index of zero then the length of the
string will be returned as the first character. In this particular case
the length of the
string
36
just happened to correspond to the ASCII code for the '$' character.
The situation is considerably worse if the value specified
for Index is negative or is greater than the length of the string. In such a
case
the Substr function would be returning a substring containing
characters appearing before or after the source string. This is not a reasonable result.
Despite the problems with ignoring the possibility of error
in the Substr function
there is one big advantage to processing substrings
in this manner: the resulting Substr code is more efficient if it doesn't
have to perform any run-time checking on the data. If you know that the index and length
values are always within an acceptable range
then there is no need to do this checking
within Substr function. If you can guarantee that an error will not occur
your programs will run (somewhat) faster by eliminating the run-time check.
Since most programs are rarely error-free
you're taking a
big gamble if you assume that all calls to the Substr routine are passing
reasonable values. Therefore
some sort of run-time check is often necessary to catch
errors in your program. An error occurs under the following conditions:
Index) is less than one. Index is greater than the length of the string.
Substr length parameter (Length)
is greater than the length of the string. Index and Length is
greater than the length of the string. An alternative to ignoring any of these errors is to abort
with an error message. This is probably fine during the program development phase
but
once your program is in the hands of users it could be a real disaster. Your customers
wouldn't be very happy if they'd spent all day entering data into a program and it
aborted
causing them to lose the data they've entered. An alternative to aborting when an
error occurs is to have the Substr function return an error condition. Then
leave it up to the calling code to determine if an error has occurred. This technique
works well with the third alternative to handling errors: processing the substring as best
you can.
The third alternative handling the error as best you can is probably the best alternative. Handle the error conditions in the following manner:
Index) is less than one.
There are two ways to handle this error condition. One way is to automatically set the Index
parameter to one and return the substring beginning with the first character of the source
string. The other alternative is to return the empty string
a string of length zero
as
the substring. Variations on this theme are also possible. You might return the substring
beginning with the first character if the index is zero and an empty string if the index
is negative. Another alternative is to use unsigned numbers. Then you've only got to worry
about the case where Index is zero. A negative number
should the calling
code accidentally generate one
would look like a large positive number. Substr function should return an empty string.
Intuitively
this is the proper response in this situation. Substr length parameter (Length)
is greater than the length of the string. -or- Index and Length is
greater than the length of the string. Points three and four are the same problem
the
length of the desired substring extends beyond the end of the source string. In this
event
Substr should return the substring consisting of those characters
starting at Index through the end of the source string. The following code for the Substr function
expects four parameters: the addresses of the source and destination strings
the starting
index
and the length of the desired substring. Substr expects the parameters
in the following registers:
ds:si- The address of the source string.
es:di- The address of the destination string.
ch- The starting index.
cl- The length of the substring.
Substr returns the following values:
es:di. Substr clears the carry flag if there were no
errors. Substr sets the carry flag if there was an error. Substr preserves all the registers. If an error occurs
then the calling code must examine the
values in si
di and cx to determine the exact cause of the
error (if this is necessary). In the event of an error
the Substr function
returns the following substrings:
Index parameter (ch) is
zero
Substr uses one instead. Index and Length parameters
are both unsigned byte values
therefore they are never negative. Index parameter is greater than the
length of the source string
Substr returns an empty string. Index and Length
parameters is greater than the length of the source string
Substr returns
only those characters from Index through the end of the source string. The
following code realizes the substring function. ; Substring function. ; ; HLL form: ; ;procedure substring(var Src:string; ; Index Length:integer; ; var Dest:string); ; ; Src- Address of a source string. ; Index- Index into the source string. ; Length- Length of the substring to extract. ; Dest- Address of a destination string. ; ; Copies the source string from address [Src+index] of length ; Length to the destination string. ; ; If an error occurs the carry flag is returned set otherwise ; clear. ; ; Parameters are passed as follows: ; ; DS:SI- Source string address. ; ES:DI- Destination string address. ; CH- Index into source string. ; CL- Length of source string. ; ; Note: the strings pointed at by the SI and DI registers are ; length-prefixed strings. That is the first byte of each ; string contains the length of that string. Substring proc near push ax push cx push di push si clc ;Assume no error. pushf ;Save direction flag status. ; Check the validity of the parameters. cmp ch [si] ;Is index beyond the length of ja ReturnEmpty ; the source string? mov al ch ;See if the sum of index and dec al ; length is beyond the end of the add al cl ; string. jc TooLong ;Error if > 255. cmp al [si] ;Beyond the length of the source? jbe OkaySoFar ; If the substring isn't completely contained within the source ; string truncate it: TooLong: popf stc ;Return an error flag. pushf mov al [si] ;Get maximum length. sub al ch ;Subtract index value. inc al ;Adjust as appropriate. mov cl al ;Save as new length. OkaySoFar: mov es:[di] cl ;Save destination string length. inc di mov al ch ;Get index into source. mov ch 0 ;Zero extend length value into CX. mov ah 0 ;Zero extend index into AX. add si ax ;Compute address of substring. cld rep movsb ;Copy the substring. popf SubStrDone: pop si pop di pop cx pop ax ret ; Return an empty string here: ReturnEmpty: mov byte ptr es:[di] 0 popf stc jmp SubStrDone SubString endp
The Index string function searches for the
first occurrence of one string within another and returns the offset to that occurrence.
Consider the following HLL form:
SourceStr := 'Hello world'; TestStr := 'world'; I := INDEX(SourceStr TestStr);
The Index function scans through the source
string looking for the first occurrence of the test string. If found
it returns the index
into the source string where the test string begins. In the example above
the Index
function would return seven since the substring 'world' starts at the seventh
character position in the source string.
The only possible error occurs if Index cannot
find the test string in the source string. In such a situation
most implementations
return zero. Our version will do likewise. The Index function which follows
operates in the following fashion:
1) It compares the length of the test string to the length
of the source string. If the test string is longer
Index immediately returns
zero since there is no way the test string will be found in the source string in this
situation.
2) The index function operates as follows:
i := 1; while (i < (length(source)-length(test)) and test <> substr(source i length(test)) do i := i+1;
When this loop terminates if (i < length(source)-length(test)) then it contains the index into source where test begins. Otherwise test is not a substring of source. Using the previous example this loop compares test to source in the following manner:
i=1 test: world No match source: Hello world i=2 test: world No match source: Hello world i=3 test: world No match source: Hello world i=4 test: world No match source: Hello world i=5 test: world No match source: Hello world i=6 test: world No match source: Hello world i=7 test: world Match source: Hello world
There are (algorithmically) better ways to do this
comparison[7]
however
the algorithm above lends itself to the
use of 80x86 string instructions and is very easy to understand. Index's code
follows:
; INDEX- computes the offset of one string within another. ; ; On entry: ; ; ES:DI- Points at the test string that INDEX will search for ; in the source string. ; DS:SI- Points at the source string which (presumably) ; contains the string INDEX is searching for. ; ; On exit: ; ; AX- Contains the offset into the source string where the ; test string was found. INDEX proc near push si push di push bx push cx pushf ;Save direction flag value. cld mov al es:[di] ;Get the length of the test string. cmp al [si] ;See if it is longer than the length ja NotThere ; of the source string. ; Compute the index of the last character we need to compare the ; test string against in the source string. mov al es:[di] ;Length of test string. mov cl al ;Save for later. mov ch 0 sub al [si] ;Length of source string. mov bl al ;# of times to repeat loop. inc di ;Skip over length byte. xor ax ax ;Init index to zero. CmpLoop: inc ax ;Bump index by one. inc si ;Move on to the next char in source. push si ;Save string pointers and the push di ; length of the test string. push cx rep cmpsb ;Compare the strings. pop cx ;Restore string pointers pop di ; and length. pop si je Foundindex ;If we found the substring. dec bl jnz CmpLoop ;Try next entry in source string. ; If we fall down here the test string doesn't appear inside the ; source string. NotThere: xor ax ax ;Return INDEX = 0 ; If the substring was found in the loop above remove the ; garbage left on the stack FoundIndex: popf pop cx pop bx pop di pop si ret INDEX endp
The Repeat string function expects three
parameters- the address of a string
a length
and a character. It constructs a string of
the specified length containing "length" copies of the specified character. For
example
Repeat(STR
5
'*') stores the string '*****' into the STR
string variable. This is a very easy string function to write
thanks to the stosb instruction:
; REPEAT- Constructs a string of length CX where each element ; is initialized to the character passed in AL. ; ; On entry: ; ; ES:DI- Points at the string to be constructed. ; CX- Contains the length of the string. ; AL- Contains the character with which each element of ; the string is to be initialized. REPEAT proc near push di push ax push cx pushf ;Save direction flag value. cld mov es:[di] cl ;Save string length. mov ch 0 ;Just in case. inc di ;Start string at next location. rep stosb popf pop cx pop ax pop di ret REPEAT endp
The Insert string function inserts one string
into another. It expects three parameters
a source string
a destination string
and an
index. Insert inserts the source string into the destination string starting
at the offset specified by the index parameter. HLLs usually call the Insert procedure
as follows:
source := ' there'; dest := 'Hello world'; INSERT(source dest 6);
The call to Insert above would change source
to contain the string 'Hello there world'. It does this by inserting the string ' there'
before the sixth character in 'Hello world'.
The insert procedure using the following algorithm:
Insert(Src
dest
index);
1) Move the characters from location dest+index
through the end of the destination string length (Src) bytes up
in memory.
2) Copy the characters from the Src string to
location dest+index.
3) Adjust the length of the destination string so that it is the sum of the destination and source lengths. The following code implements this algorithm:
; INSERT- Inserts one string into another. ; ; On entry: ; ; DS:SI Points at the source string to be inserted ; ; ES:DI Points at the destination string into which the source ; string will be inserted. ; ; DX Contains the offset into the destination string where the ; source string is to be inserted. ; ; ; All registers are preserved. ; ; Error condition- ; ; If the length of the newly created string is greater than 255 ; the insert operation will not be performed and the carry flag ; will be returned set. ; ; If the index is greater than the length of the destination ; string ; then the source string will be appended to the end of the destin- ; ation string. INSERT proc near push si push di push dx push cx push bx push ax clc ;Assume no error. pushf mov dh 0 ;Just to be safe. ; First see if the new string will be too long. mov ch 0 mov ah ch mov bh ch mov al es:[di] ;AX = length of dest string. mov cl [si] ;CX = length of source string. mov bl al ;BX = length of new string. add bl cl jc TooLong ;Abort if too long. mov es:[di] bl ;Update length. ; See if the index value is too large: cmp dl al jbe IndexIsOK mov dl al IndexIsOK: ; Now make room for the string that's about to be inserted. push si ;Save for later. push cx mov si di ;Point SI at the end of current add si ax ; destination string. add di bx ;Point DI at the end of new str. std rep movsb ;Open up space for new string. ; Now copy the source string into the space opened up. pop cx pop si add si cx ;Point at end of source string. rep movsb jmp INSERTDone TooLong: popf stc pushf INSERTDone: popf pop ax pop bx pop cx pop dx pop di pop si ret INSERT endp
The Delete string removes characters from a
string. It expects three parameters - the address of a string
an index into that string
and the number of characters to remove from that string. A HLL call to Delete usually
takes the form:
Delete(Str index length);
For example
Str := 'Hello there world'; Delete(str 7 6);
This call to Delete will leave str containing
'Hello world'. The algorithm for the delete operation is the following:
1) Subtract the length parameter value from the length of the destination string and update the length of the destination string with this new value.
2) Copy any characters following the deleted substring over the top of the deleted substring.
There are a couple of errors that may occur when using the
delete procedure. The index value could be zero or larger than the size of the specified
string. In this case
the Delete procedure shouldn't do anything to the
string. If the sum of the index and length parameters is greater than the length of the
string
then the Delete procedure should delete all the characters to the end
of the string. The following code implements the Delete procedure:
; DELETE - removes some substring from a string. ; ; On entry: ; ; DS:SI Points at the source string. ; DX Index into the string of the start of the substring ; to delete. ; CX Length of the substring to be deleted. ; ; Error conditions- ; ; If DX is greater than the length of the string then the ; operation is aborted. ; ; If DX+CX is greater than the length of the string DELETE only ; deletes those characters from DX through the end of the string. DELETE proc near push es push si push di push ax push cx push dx pushf ;Save direction flag. mov ax ds ;Source and destination strings mov es ax ; are the same. mov ah 0 mov dh ah ;Just to be safe. mov ch ah ; See if any error conditions exist. mov al [si] ;Get the string length cmp dl al ;Is the index too big? ja TooBig mov al dl ;Now see if INDEX+LENGTH add al cl ;is too large jc Truncate cmp al [si] jbe LengthIsOK ; If the substring is too big truncate it to fit. Truncate: mov cl [si] ;Compute maximum length sub cl dl inc cl ; Compute the length of the new string. LengthIsOK: mov al [si] sub al cl mov [si] al ; Okay now delete the specified substring. add si dx ;Compute address of the substring mov di si ; to be deleted and the address of add di cx ; the first character following it. cld rep movsb ;Delete the string. TooBig: popf pop dx pop cx pop ax pop di pop si pop es ret DELETE endp
The concatenation operation takes two strings and appends
one to the end of the other. For example
Concat('Hello '
'world') produces
the string 'Hello world'. Some high level languages treat concatenation as a function
call
others as a procedure call. Since in assembly language everything is a procedure
call anyway
we'll adopt the procedural syntax. Our Concat procedure will
take the following form:
Concat(source1 source2 dest);
This procedure will copy source1 to dest
then it will concatenate source2 to the end of dest. Concat
follows:
; Concat- Copies the string pointed at by SI to the string ; rointed at by DI and then concatenates the string; ; pointed at by BX to the destination string. ; ; On entry- ; ; DS:SI- Points at the first source string ; DS:BX- Points at the second source string ; ES:DI- Points at the destination string. ; ; Error condition- ; ; The sum of the lengths of the two strings is greater than 255. ; In this event the second string will be truncated so that the ; entire string is less than 256 characters in length. CONCAT proc near push si push di push cx push ax pushf ; Copy the first string to the destination string: mov al [si] mov cl al mov ch 0 mov ah ch add al [bx] ;Compute the sum of the string's adc ah 0 ; lengths. cmp ax 256 jb SetNewLength mov ah [si] ;Save original string length. mov al 255 ;Fix string length at 255. SetNewLength: mov es:[di] al ;Save new string length. inc di ;Skip over length bytes. inc si rep movsb ;Copy source1 to dest string. ; If the sum of the two strings is too long the second string ; must be truncated. mov cl [bx] ;Get length of second string. cmp ax 256 jb LengthsAreOK mov cl ah ;Compute truncated length. neg cl ;CL := 256-Length(Str1). LengthsAreOK: lea si 1[bx] ;Point at second string and ; ; skip the string length. cld rep movsb ;Perform the concatenation. popf pop ax pop cx pop di pop si ret CONCAT endp
[7] The interested reader should look up the Knuth-Morris-Pratt algorithm in "Data Structure Techniques" by Thomas A. Standish. The Boyer-Moore algorithm is another fast string search routine although somewhat more complex.
|
Table of Content | Chapter Fifteen (Part 5) |
Chapter Fifteen: Strings And
Character Sets (Part 4)
28 SEP 1996