|
Table of Content | Chapter Sixteen (Part 7) |
| CHAPTER SIXTEEN: PATTERN MATCHING (Part 6) |
| 16.4 -
Designing Your Own Pattern Matching Routines 16.5 - Extracting Substrings from Matched Patterns |
| 16.4 Designing Your Own Pattern Matching Routines |
Although the UCR Standard Library provides a wide variety of matching functions there is no way to anticipate the needs of all applications. Therefore you will probably discover that the library does not support some particular pattern matching function you need. Fortunately it is very easy for you to create your own pattern matching functions to augment those available in the UCR Standard Library. When you specify a matching function name in the pattern data structure the match routine calls the specified address using a far call and passing the following parameters:
es:di- Points at the next character in the
input string. You should not look at any characters before this address. Furthermore
you
should never look beyond the end of the string (see cx below).
ds:si- Contains the four byte parameter found
in the matchparm field.
cx- Contains the last position
plus one
in
the input string you're allowed to look at. Note that your pattern matching routine should
not look beyond location es:cx or the zero terminating byte; whichever comes
first in the input string.
On return from the function
ax must contain
the offset into the string (di's value) of the last character matched plus
one
if your matching function is successful. It must also set the carry flag to denote
success. After your pattern matches
the match routine might call another matching
function (the one specified by the next pattern field) and that function begins matching
at location es:ax.
If the pattern match fails
then you must return the
original di value in the ax register and return with the carry
flag clear. Note that your matching function must preserve all other registers.
There is one very important detail you must never forget
with writing your own pattern matching routines - ds does not point at your
data segment
it contains the H.O. word of the matchparm parameter.
Therefore
if you are going to access global variables in your data segment you will need
to push ds
load it with the address of dseg
and pop ds
before leaving. Several examples throughout this chapter demonstrate how to do this.
There are some obvious omissions from (the current version
of) the UCR Standard Library's repertoire. For example
there should probably be matchtoistr
matchichar
and matchtoichar pattern functions. The following
example code demonstrates how to add a matchtoistr (match up to a string
doing a case insensitive comparison) routine.
.xlist
include stdlib.a
includelib stdlib.lib
matchfuncs
.list
dseg segment para public 'data'
TestString byte "This is the string 'xyz' in it"
cr
lf
0
TestPat pattern {matchtoistr
xyz}
xyz byte "XYZ"
0
dseg ends
cseg segment para public 'code'
assume cs:cseg
ds:dseg
; MatchToiStr- Matches all characters in a string up to
and including
the
; specified parameter string. The parameter string must be
; all upper case characters. This guy matches string using
; a case insensitive comparison.
;
; inputs:
; es:di- Source string
; ds:si- String to match
; cx- Maximum match position
;
; outputs:
; ax- Points at first character beyond the end of the
; matched string if success
contains the initial DI
; value if failure occurs.
; carry- 0 if failure
1 if success.
MatchToiStr proc far
pushf
push di
push si
cld
; Check to see if we're already past the point were we're allowed
; to scan in the input string.
cmp di
cx
jae MTiSFailure
; If the pattern string is the empty string
always match.
cmp byte ptr ds:[si]
0
je MTSsuccess
; The following loop scans through the input string looking for
; the first character in the pattern string.
ScanLoop: push si
lodsb ;Get first char of string
dec di
FindFirst: inc di ;Move on to next (or 1st) char.
cmp di
cx ;If at cx
then we've got to
jae CantFind1st ; fail.
mov ah
es:[di] ;Get input character.
cmp ah
'a' ;Convert input character to
jb DoCmp ; upper case if it's a lower
cmp ah
'z' ; case character.
ja DoCmp
and ah
5fh
DoCmp: cmp al
ah ;Compare input character against
jne FindFirst ; pattern string.
; At this point
we've located the first character in the input string
; that matches the first character of the pattern string. See if the
; strings are equal.
push di ;Save restart point.
CmpLoop: cmp di
cx ;See if we've gone beyond the
jae StrNotThere ; last position allowable.
lodsb ;Get next input character.
cmp al
0 ;At the end of the parameter
je MTSsuccess2 ; string? If so
succeed.
inc di
mov ah
es:[di] ;Get the next input character.
cmp ah
'a' ;Convert input character to
jb DoCmp2 ; upper case if it's a lower
cmp ah
'z' ; case character.
ja DoCmp2
and ah
5fh
DoCmp2: cmp al
ah ;Compare input character against
je CmpLoop
pop di
pop si
jmp ScanLoop
StrNotThere: add sp
2 ;Remove di from stack.
CantFind1st: add sp
2 ;Remove si from stack.
MTiSFailure: pop si
pop di
mov ax
di ;Return failure position in AX.
popf
clc ;Return failure.
ret
MTSSuccess2: add sp
2 ;Remove DI value from stack.
MTSSuccess: add sp
2 ;Remove SI value from stack.
mov ax
di ;Return next position in AX.
pop si
pop di
popf
stc ;Return success.
ret
MatchToiStr endp
Main proc
mov ax
dseg
mov ds
ax
mov es
ax
meminit
lesi TestString
ldxi TestPat
xor cx
cx
match
jnc NoMatch
print
byte "Matched"
cr
lf
0
jmp Quit
NoMatch: print
byte "Did not match"
cr
lf
0
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
Often simply determining that a string matches a given pattern is insufficient. You may want to perform various operations that depend upon the actual information in that string. However the pattern matching facilities described thus far do not provide a mechanism for testing individual components of the input string. In this section you will see how to extract portions of a pattern for further processing.
Perhaps an example may help clarify the need to extract portions of a string. Suppose you are writing a stock buy/sell program and you want it to process commands described by the following regular expression:
(buy | sell) [0-9]+ shares of (ibm | apple | hp | dec)
While it is easy to devise a Standard Library pattern that
recognizes strings of this form
calling the match routine would only tell
you that you have a legal buy or sell command. It does not tell you if you are to buy or
sell
who to buy or sell
or how many shares to buy or sell. Of course
you could take the
cross product of (buy | sell) with (ibm | apple | hp | dec) and generate eight different
regular expressions that uniquely determine whether you're buying or selling and whose
stock you're trading
but you can't process the integer values this way (unless you
willing to have millions of regular expressions). A better solution would be to extract
substrings from the legal pattern and process these substrings after you verify that you
have a legal buy or sell command. For example
you could extract buy or sell into one
string
the digits into another
and the company name into a third. After verifying the
syntax of the command
you could process the individual strings you've extracted. The UCR
Standard Library patgrab routine provides this capability for you.
You normally call patgrab after calling match
and verifying that it matches the input string. Patgrab expects a single
parameter - a pointer to a pattern recently processed by match. Patgrab
creates a string on the heap consisting of the characters matched by the given pattern and
returns a pointer to this string in es:di. Note that patgrab
only returns a string associated with a single pattern data structure
not a chain of
pattern data structures. Consider the following pattern:
PatToGrab pattern {matchstr
str1
0
Pat2}
Pat2 pattern {matchstr
str2}
str1 byte "Hello"
0
str2 byte " there"
0
Calling match on PatToGrab will
match the string "Hello there". However
if after calling match you
call patgrab and pass it the address of PatToGrab
patgrab
will return a pointer to the string "Hello".
Of course
you might want to collect a string that is the
concatenation of several strings matched within your pattern (i.e.
a portion of the
pattern list). This is where calling the sl_match2 pattern matching function
comes in handy. Consider the following pattern:
Numbers pattern {sl_match2
FirstNumber}
FirstNumber pattern {anycset
digits
0
OtherDigs}
OtherDigs pattern {spancset
digits}
This pattern matches the same strings as
Numbers pattern {anycset
digits
0
OtherDigs}
OtherDigs pattern {spancset
digits}
So why bother with the extra pattern that calls sl_match2?
Well
as it turns out the sl_match2 matching function lets you create
parenthetical patterns. A parenthetical pattern is a pattern list that the pattern
matching routines (especially patgrab) treat as a single pattern. Although
the match routine will match the same strings regardless of which version of Numbers
you use
patgrab will produce two entirely different strings depending upon
your choice of the above patterns. If you use the latter version
patgrab
will only return the first digit of the number. If you use the former version (with the
call to sl_match2)
then patgrab returns the entire string
matched by sl_match2
and that turns out to be the entire string of digits.
The following sample program demonstrates how to use parenthetical patterns to extract the pertinent information from the stock command presented earlier. It uses parenthetical patterns for the buy/sell command the number of shares and the company name.
.xlist
include stdlib.a
includelib stdlib.lib
matchfuncs
.list
dseg segment para public 'data'
; Variables used to hold the number of shares bought/sold
a pointer to
; a string containing the buy/sell command
and a pointer to a string
; containing the company name.
Count word 0
CmdPtr dword ?
CompPtr dword ?
; Some test strings to try out:
Cmd1 byte "Buy 25 shares of apple stock"
0
Cmd2 byte "Sell 50 shares of hp stock"
0
Cmd3 byte "Buy 123 shares of dec stock"
0
Cmd4 byte "Sell 15 shares of ibm stock"
0
BadCmd0 byte "This is not a buy/sell command"
0
; Patterns for the stock buy/sell command:
;
; StkCmd matches buy or sell and creates a parenthetical pattern
; that contains the string "buy" or "sell".
StkCmd pattern {sl_match2
buyPat
0
skipspcs1}
buyPat pattern {matchistr
buystr
sellpat}
buystr byte "BUY"
0
sellpat pattern {matchistr
sellstr}
sellstr byte "SELL"
0
; Skip zero or more white space characters after the buy command.
skipspcs1 pattern {spancset
whitespace
0
CountPat}
; CountPat is a parenthetical pattern that matches one or more
; digits.
CountPat pattern {sl_match2
Numbers
0
skipspcs2}
Numbers pattern {anycset
digits
0
RestOfNum}
RestOfNum pattern {spancset
digits}
; The following patterns match " shares of " allowing any amount
; of white space between the words.
skipspcs2 pattern {spancset
whitespace
0
sharesPat}
sharesPat pattern {matchistr
sharesStr
0
skipspcs3}
sharesStr byte "SHARES"
0
skipspcs3 pattern {spancset
whitespace
0
ofPat}
ofPat pattern {matchistr
ofStr
0
skipspcs4}
ofStr byte "OF"
0
skipspcs4 pattern {spancset
whitespace
0
CompanyPat}
; The following parenthetical pattern matches a company name.
; The patgrab-available string will contain the corporate name.
CompanyPat pattern {sl_match2
ibmpat}
ibmpat pattern {matchistr
ibm
applePat}
ibm byte "IBM"
0
applePat pattern {matchistr
apple
hpPat}
apple byte "APPLE"
0
hpPat pattern {matchistr
hp
decPat}
hp byte "HP"
0
decPat pattern {matchistr
decstr}
decstr byte "DEC"
0
include stdsets.a
dseg ends
cseg segment para public 'code'
assume cs:cseg
ds:dseg
; DoBuySell- This routine processes a stock buy/sell command.
; After matching the command
it grabs the components
; of the command and outputs them as appropriate.
; This routine demonstrates how to use patgrab to
; extract substrings from a pattern string.
;
; On entry
es:di must point at the buy/sell command
; you want to process.
DoBuySell proc near
ldxi StkCmd
xor cx
cx
match
jnc NoMatch
lesi StkCmd
patgrab
mov word ptr CmdPtr
di
mov word ptr CmdPtr+2
es
lesi CountPat
patgrab
atoi ;Convert digits to integer
mov Count
ax
free ;Return storage to heap.
lesi CompanyPat
patgrab
mov word ptr CompPtr
di
mov word ptr CompPtr+2
es
printf
byte "Stock command: %^s\n"
byte "Number of shares: %d\n"
byte "Company to trade: %^s\n\n"
0
dword CmdPtr
Count
CompPtr
les di
CmdPtr
free
les di
CompPtr
free
ret
NoMatch: print
byte "Illegal buy/sell command"
cr
lf
0
ret
DoBuySell endp
Main proc
mov ax
dseg
mov ds
ax
mov es
ax
meminit
lesi Cmd1
call DoBuySell
lesi Cmd2
call DoBuySell
lesi Cmd3
call DoBuySell
lesi Cmd4
call DoBuySell
lesi BadCmd0
call DoBuySell
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk db 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes db 16 dup (?)
zzzzzzseg ends
end Main
Sample program output:
Stock command: Buy
Number of shares: 25
Company to trade: apple
Stock command: Sell
Number of shares: 50
Company to trade: hp
Stock command: Buy
Number of shares: 123
Company to trade: dec
Stock command: Sell
Number of shares: 15
Company to trade: ibm
Illegal buy/sell command
|
Table of Content | Chapter Sixteen (Part 7) |
Chapter Sixteen: Pattern Matching
(Part 6)
29 SEP 1996