|
Table of Content | Chapter Fifteen (Part 7) |
| CHAPTER FIFTEEN: STRINGS AND CHARACTER SETS (Part 6) |
| 15.5 -
The Character Set Routines in the UCR Standard Library 15.6 - Using the String Instructions on Other Data Types 15.6.1 - Multi-precision Integer Strings 15.6.2 - Dealing with Whole Arrays and Records |
| 15.5 The Character Set Routines in the UCR Standard Library |
The UCR Standard Library provides an extensive collection of character set routines. These routines let you create sets clear sets (set them to the empty set) add and remove one or more items test for set membership copy sets compute the union intersection or difference and extract items from a set. Although intended to manipulate sets of characters you can use the StdLib character set routines to manipulate any set with 256 or fewer possible items.
The first unusual thing to note about the StdLib's sets is
their storage format. A 256-bit array would normally consumes 32 consecutive bytes. For
performance reasons
the UCR Standard Library's set format packs eight separate sets into
272 bytes (256 bytes for the eight sets plus 16 bytes overhead). To declare set variables
in your data segment you should use the set macro. This macro takes the form:
set SetName1 SetName2 ... SetName8
SetName1..SetName8
represent the names of up to
eight set variables. You may have fewer than eight names in the operand field
but doing
so will waste some bits in the set array.
The CreateSets routine provides another
mechanism for creating set variables. Unlike the set macro
which you would use to create
set variables in your data segment
the CreateSets routine allocates storage
for up to eight sets dynamically at run time. It returns a pointer to the first set
variable in es:di. The remaining seven sets follow at locations es:di+1
es:di+2
...
es:di+7. A typical program that allocates set
variables dynamically might use the following code:
Set0 dword ? Set1 dword ? Set2 dword ? Set3 dword ? Set4 dword ? Set5 dword ? Set6 dword ? Set7 dword ? . . . CreateSets mov word ptr Set0+2 es mov word ptr Set1+2 es mov word ptr Set2+2 es mov word ptr Set3+2 es mov word ptr Set4+2 es mov word ptr Set5+2 es mov word ptr Set6+2 es mov word ptr Set7+2 es mov word ptr Set0 di inc di mov word ptr Set1 di inc di mov word ptr Set2 di inc di mov word ptr Set3 di inc di mov word ptr Set4 di inc di mov word ptr Set5 di inc di mov word ptr Set6 di inc di mov word ptr Set7 di inc di
This code segment creates eight different sets on the heap all empty and stores pointers to them in the appropriate pointer variables.
The SHELL.ASM file provides a commented-out line of code in
the data segment that includes the file STDSETS.A. This include file provides the bit
definitions for eight commonly used character sets. They are alpha (upper and
lower case alphabetics)
lower (lower case alphabetics)
upper
(upper case alphabetics)
digits ("0".."9")
xdigits
("0".."9"
"A".."F"
and
"a".."f")
alphanum (upper and lower case alphabetics
plus the digits)
whitespace (space
tab
carriage return
and line feed)
and delimiters (whitespace plus commas
semicolons
less than
greater than
and vertical bar). If you would like to use these standard character sets in your program
you need to remove the semicolon from the beginning of the include statement
in the SHELL.ASM file.
The UCR Standard Library provides 16 character set
routines: CreateSets
EmptySet
RangeSet
AddStr
AddStrl
RmvStr
RmvStrl
AddChar
RmvChar
Member
CopySet
SetUnion
SetIntersect
SetDifference
NextItem
and RmvItem. All of these
routines except CreateSets require a pointer to a character set variable in
the es:di registers. Specific routines may require other parameters as well.
The EmptySet routine clears all the bits in a
set producing the empty set. This routine requires the address of the set variable in the es:di.
The following example clears the set pointed at by Set1:
les di Set1 EmptySet
RangeSet
unions in a range of values into the set
variable pointed at by es:di. The al register contains the lower
bound of the range of items
ah contains the upper bound. Note that al
must be less than or equal to ah. The following example constructs the set of
all control characters (ASCII codes one through 31
the null character [ASCII code zero]
is not allowed in sets):
les di CtrlCharSet ;Ptr to ctrl char set. mov al 1 mov ah 31 RangeSet
AddStr
and AddStrl add all the
characters in a zero terminated string to a character set. For AddStr
the dx:si
register pair points at the zero terminated string. For AddStrl
the zero
terminated string follows the call to AddStrl in the code stream. These
routines union each character of the specified string into the set. The following examples
add the digits and some special characters into the FPDigits set:
Digits byte "0123456789" 0 set FPDigitsSet FPDigits dword FPDigitsSet . . . ldxi Digits ;Loads DX:SI with adrs of Digits. les di FPDigits AddStr . . . les di FPDigits AddStrL byte "Ee.+-" 0
RmvStr
and RmvStrl remove characters
from a set. You supply the characters in a zero terminated string. For RmvStr
dx:si points at the string of characters to remove from the string. For RmvStrl
the zero terminated string follows the call. The following example uses RmvStrl to remove
the special symbols from FPDigits above:
les di FPDigits RmvStrl byte "Ee.+-" 0
The AddChar and RmvChar routines
let you add or remove individual characters. As usual
es:di points at the
set; the al register contains the character you wish to add to the set or
remove from the set. The following example adds a space to the set FPDigits and removes
the "
" character (if present):
les di FPDigits mov al ' ' AddChar . . . les di FPDigits mov al ' ' RmvChar
The Member function checks to see if a
character is in a set. On entry
es:di must point at the set and al
must contain the character to check. On exit
the zero flag is set if the character is a
member of the set
the zero flag will be clear if the character is not in the set. The
following example reads characters from the keyboard until the user presses a key that is
not a whitespace character:
SkipWS: get ;Read char from user into AL. lesi WhiteSpace ;Address of WS set into es:di. member je SkipWS
The CopySet
SetUnion
SetIntersect
and SetDifference routines all operate on two sets of characters. The es:di
register points at the destination character set
the dx:si register pair
points at a source character set. CopySet copies the bits from the source set
to the destination set
replacing the original bits in the destination set. SetUnion
computes the union of the two sets and stores the result into the destination set. SetIntersect
computes the set intersection and stores the result into the destination set. Finally
the
SetDifference routine computes DestSet := DestSet - SrcSet.
The NextItem and RmvItem routines
let you extract elements from a set. NextItem returns in al the ASCII code of
the first character it finds in a set. RmvItem does the same thing except it
also removes the character from the set. These routines return zero in al if
the set is empty (StdLib sets cannot contain the NULL character). You can use the RmvItem
routine to build a rudimentary iterator for a character set.
The UCR Standard Library's character set routines are very powerful. With them you can easily manipulate character string data especially when searching for different patterns within a string. We will consider this routines again when we study pattern matching later in this text.
The string instructions work with other data types besides character strings. You can use the string instructions to copy whole arrays from one variable to another to initialize large data structures to a single value or to compare entire data structures for equality or inequality. Anytime you're dealing with data structures containing several bytes you may be able to use the string instructions.
15.6.1 Multi-precision Integer Strings
The cmps instruction is useful for comparing
(very) large integer values. Unlike character strings
we cannot compare integers with cmps
from the L.O. byte through the H.O. byte. Instead
we must compare them from the
H.O. byte down to the L.O. byte. The following code compares two 12-byte integers:
lea di integer1+10 lea si integer2+10 mov cx 6 std repe cmpsw
After the execution of the cmpsw instruction
the flags will contain the
result of the comparison.
You can easily assign one long integer string to another
using the movs instruction. Nothing tricky here
just load up the si
di
and cx registers and have at it. You must do other operations
including arithmetic and logical operations
using the extended precision methods
described in the chapter on arithmetic operations.
15.6.2 Dealing with Whole Arrays and Records
The only operations that apply
in general
to all array
and record structures are assignment and comparison (for equality/inequality only). You
can use the movs and cmps instructions for these operations.
Operations such as scalar addition
transposition
etc.
may be easily synthesized using the lods and stos instructions.
The following code shows how you can easily add the value 20 to each element of the
integer array A:
lea si A mov di si mov cx SizeOfA cld AddLoop: lodsw add ax 20 stosw loop AddLoop
You can implement other operations in a similar fashion.
|
Table of Content | Chapter Fifteen (Part 7) |
Chapter Fifteen: Strings And
Character Sets (Part 6)
28 SEP 1996