The Art of
ASSEMBLY LANGUAGE PROGRAMMING

Chapter Fifteen (Part 5)

Table of Content

Chapter Fifteen (Part 7) 

CHAPTER FIFTEEN:
STRINGS AND CHARACTER SETS (Part 6)
15.5 - The Character Set Routines in the UCR Standard Library
15.6 - Using the String Instructions on Other Data Types
15.6.1 - Multi-precision Integer Strings
15.6.2 - Dealing with Whole Arrays and Records
15.5 The Character Set Routines in the UCR Standard Library

The UCR Standard Library provides an extensive collection of character set routines. These routines let you create sets clear sets (set them to the empty set) add and remove one or more items test for set membership copy sets compute the union intersection or difference and extract items from a set. Although intended to manipulate sets of characters you can use the StdLib character set routines to manipulate any set with 256 or fewer possible items.

The first unusual thing to note about the StdLib's sets is their storage format. A 256-bit array would normally consumes 32 consecutive bytes. For performance reasons the UCR Standard Library's set format packs eight separate sets into 272 bytes (256 bytes for the eight sets plus 16 bytes overhead). To declare set variables in your data segment you should use the set macro. This macro takes the form:

		set	SetName1
SetName2
...
SetName8

SetName1..SetName8 represent the names of up to eight set variables. You may have fewer than eight names in the operand field but doing so will waste some bits in the set array.

The CreateSets routine provides another mechanism for creating set variables. Unlike the set macro which you would use to create set variables in your data segment the CreateSets routine allocates storage for up to eight sets dynamically at run time. It returns a pointer to the first set variable in es:di. The remaining seven sets follow at locations es:di+1 es:di+2 ... es:di+7. A typical program that allocates set variables dynamically might use the following code:

Set0            dword   ?
Set1            dword   ?
Set2            dword   ?
Set3            dword   ?
Set4            dword   ?
Set5            dword   ?
Set6            dword   ?
Set7            dword   ?
.
.
.
CreateSets
mov     word ptr Set0+2
es
mov     word ptr Set1+2
es
mov     word ptr Set2+2
es
mov     word ptr Set3+2
es
mov     word ptr Set4+2
es
mov     word ptr Set5+2
es
mov     word ptr Set6+2
es
mov     word ptr Set7+2
es

mov     word ptr Set0
di
inc     di
mov     word ptr Set1
di
inc     di
mov     word ptr Set2
di
inc     di
mov     word ptr Set3
di
inc     di
mov     word ptr Set4
di
inc     di
mov     word ptr Set5
di
inc     di
mov     word ptr Set6
di
inc     di
mov     word ptr Set7
di
inc     di

This code segment creates eight different sets on the heap all empty and stores pointers to them in the appropriate pointer variables.

The SHELL.ASM file provides a commented-out line of code in the data segment that includes the file STDSETS.A. This include file provides the bit definitions for eight commonly used character sets. They are alpha (upper and lower case alphabetics) lower (lower case alphabetics) upper (upper case alphabetics) digits ("0".."9") xdigits ("0".."9" "A".."F" and "a".."f") alphanum (upper and lower case alphabetics plus the digits) whitespace (space tab carriage return and line feed) and delimiters (whitespace plus commas semicolons less than greater than and vertical bar). If you would like to use these standard character sets in your program you need to remove the semicolon from the beginning of the include statement in the SHELL.ASM file.

The UCR Standard Library provides 16 character set routines: CreateSets EmptySet RangeSet AddStr AddStrl RmvStr RmvStrl AddChar RmvChar Member CopySet SetUnion SetIntersect SetDifference NextItem and RmvItem. All of these routines except CreateSets require a pointer to a character set variable in the es:di registers. Specific routines may require other parameters as well.

The EmptySet routine clears all the bits in a set producing the empty set. This routine requires the address of the set variable in the es:di. The following example clears the set pointed at by Set1:

                les     di
Set1
EmptySet

RangeSet unions in a range of values into the set variable pointed at by es:di. The al register contains the lower bound of the range of items ah contains the upper bound. Note that al must be less than or equal to ah. The following example constructs the set of all control characters (ASCII codes one through 31 the null character [ASCII code zero] is not allowed in sets):

                les     di
CtrlCharSet         ;Ptr to ctrl char set.
mov     al
1
mov     ah
31
RangeSet

AddStr and AddStrl add all the characters in a zero terminated string to a character set. For AddStr the dx:si register pair points at the zero terminated string. For AddStrl the zero terminated string follows the call to AddStrl in the code stream. These routines union each character of the specified string into the set. The following examples add the digits and some special characters into the FPDigits set:

Digits          byte    "0123456789"
0
set     FPDigitsSet
FPDigits        dword   FPDigitsSet
.
.
.
ldxi    Digits          ;Loads DX:SI with adrs of Digits.
les     di
FPDigits
AddStr
.
.
.
les     di
FPDigits
AddStrL
byte    "Ee.+-"
0

RmvStr and RmvStrl remove characters from a set. You supply the characters in a zero terminated string. For RmvStr dx:si points at the string of characters to remove from the string. For RmvStrl the zero terminated string follows the call. The following example uses RmvStrl to remove the special symbols from FPDigits above:

                les     di
FPDigits
RmvStrl
byte    "Ee.+-"
0

The AddChar and RmvChar routines let you add or remove individual characters. As usual es:di points at the set; the al register contains the character you wish to add to the set or remove from the set. The following example adds a space to the set FPDigits and removes the " " character (if present):

                les     di
FPDigits
mov     al
' '
AddChar
.
.
.
les     di
FPDigits
mov     al
'
'
RmvChar

The Member function checks to see if a character is in a set. On entry es:di must point at the set and al must contain the character to check. On exit the zero flag is set if the character is a member of the set the zero flag will be clear if the character is not in the set. The following example reads characters from the keyboard until the user presses a key that is not a whitespace character:

SkipWS:         get                     ;Read char from user into AL.
lesi    WhiteSpace      ;Address of WS set into es:di.
member
je      SkipWS

The CopySet SetUnion SetIntersect and SetDifference routines all operate on two sets of characters. The es:di register points at the destination character set the dx:si register pair points at a source character set. CopySet copies the bits from the source set to the destination set replacing the original bits in the destination set. SetUnion computes the union of the two sets and stores the result into the destination set. SetIntersect computes the set intersection and stores the result into the destination set. Finally the SetDifference routine computes DestSet := DestSet - SrcSet.

The NextItem and RmvItem routines let you extract elements from a set. NextItem returns in al the ASCII code of the first character it finds in a set. RmvItem does the same thing except it also removes the character from the set. These routines return zero in al if the set is empty (StdLib sets cannot contain the NULL character). You can use the RmvItem routine to build a rudimentary iterator for a character set.

The UCR Standard Library's character set routines are very powerful. With them you can easily manipulate character string data especially when searching for different patterns within a string. We will consider this routines again when we study pattern matching later in this text.

15.6 Using the String Instructions on Other Data Types

The string instructions work with other data types besides character strings. You can use the string instructions to copy whole arrays from one variable to another to initialize large data structures to a single value or to compare entire data structures for equality or inequality. Anytime you're dealing with data structures containing several bytes you may be able to use the string instructions.

15.6.1 Multi-precision Integer Strings

The cmps instruction is useful for comparing (very) large integer values. Unlike character strings we cannot compare integers with cmps from the L.O. byte through the H.O. byte. Instead we must compare them from the H.O. byte down to the L.O. byte. The following code compares two 12-byte integers:

                lea     di
integer1+10
lea     si
integer2+10
mov     cx
6
std
repe    cmpsw

After the execution of the cmpsw instruction the flags will contain the result of the comparison.

You can easily assign one long integer string to another using the movs instruction. Nothing tricky here just load up the si di and cx registers and have at it. You must do other operations including arithmetic and logical operations using the extended precision methods described in the chapter on arithmetic operations.

15.6.2 Dealing with Whole Arrays and Records

The only operations that apply in general to all array and record structures are assignment and comparison (for equality/inequality only). You can use the movs and cmps instructions for these operations.

Operations such as scalar addition transposition etc. may be easily synthesized using the lods and stos instructions. The following code shows how you can easily add the value 20 to each element of the integer array A:

                lea     si
A
mov     di
si
mov     cx
SizeOfA
cld
AddLoop:        lodsw
add     ax
20
stosw
loop    AddLoop

You can implement other operations in a similar fashion.

Chapter Fifteen (Part 5)

Table of Content

Chapter Fifteen (Part 7) 

Chapter Fifteen: Strings And Character Sets (Part 6)
28 SEP 1996