|
Table of Content | Chapter Sixteen (Part 6) |
| CHAPTER
SIXTEEN: PATTERN MATCHING (Part 5) |
|
| 16.2 -
The UCR Standard Library Pattern Matching Routines 16.3 - The Standard Library Pattern Matching Functions 16.3.1 - Spancset 16.3.2 - Brkcset 16.3.3 - Anycset 16.3.4 - Notanycset 16.3.5 - MatchStr 16.3.6 - MatchiStr 16.3.7 - MatchToStr 16.3.8 - MatchChar 16.3.9 - MatchToChar |
16.3.10
- MatchChars 16.3.11 - MatchToPat 16.3.12 - EOS 16.3.13 - ARB 16.3.14 - ARBNUM 16.3.15 - Skip 16.3.16 - Pos 16.3.17 - RPos 16.3.18 - GotoPos 16.3.19 - RGotoPos 16.3.20 - SL_Match2 |
| 16.2 The UCR Standard Library Pattern Matching Routines | |
The UCR Standard Library provides a very sophisticated set of pattern matching routines. They are patterned after the pattern matching facilities of SNOBOL4 support CFGs and provide fully automatic backtracking as necessary. Furthermore by writing only five assembly language statements you can match simple or complex patterns.
There is very little assembly language code to worry about
when using the Standard Library's pattern matching routines because most of the work
occurs in the data segment. To use the pattern matching routines
you first construct a
pattern data structure in the data segment. You then pass the address of this pattern and
the string you wish to test to the Standard Library match routine. The match
routine returns failure or success depending on the state of the comparison. This isn't
quite as easy as it sounds
though; learning how to construct the pattern data structure
is almost like learning a new programming language. Fortunately
if you've followed the
discussion on context free languages
learning this new "language" is a breeze.
The Standard Library pattern data structure takes the following form:
Pattern struct MatchFunction dword ? MatchParm dword ? MatchAlt dword ? NextPattern dword ? EndPattern word ? StartPattern word ? StrSeg word ? Pattern ends
The MatchFunction field contains the address
of a routine to call to perform some sort of comparison. The success or failure of this
function determines whether the pattern matches the input string. For example
the UCR
Standard Library provides a MatchStr function that compares the next n
characters of the input string against some other character string.
The MatchParm field contains the address or
value of a parameter (if appropriate) for the MatchFunction routine. For
example
if the MatchFunction routine is MatchStr
then the MatchParm
field contains the address of the string to compare the input characters against.
Likewise
the MatchChar routine compares the next input character in the
string against the L.O. byte of the MatchParm field. Some matching functions
do not require any parameters
they will ignore any value you assign to MatchParm
field. By convention
most programmers store a zero in unused fields of the Pattern
structure.
The MatchAlt field contains either zero (NULL)
or the address of some other pattern data structure. If the current pattern matches the
input characters
the pattern matching routines ignore this field. However
if the current
pattern fails to match the input string
then the pattern matching routines will attempt
to match the pattern whose address appears in this field. If this alternate pattern
returns success
then the pattern matching routine returns success to the caller
otherwise it returns failure. If the MatchAlt field contains NULL
then the
pattern matching routine immediately fails if the main pattern does not match.
The Pattern data structure only matches one
item. For example
it might match a single character
a single string
or a character from
a set of characters. A real world pattern will probably contain several small patterns
concatenated together
e.g.
the pattern for a Pascal identifier consists of a single
character from the set of alphabetic characters followed by one or more characters from
the set [a-zA-Z0-9_]. The NextPattern field lets you create a composite
pattern as the concatenation of two individual patterns. For such a composite pattern to
return success
the current pattern must match and then the pattern specified by the NextPattern
field must also match. Note that you can chain as many patterns together as you please
using this field.
The last three fields
EndPattern
StartPattern
and StrSeg are for the internal use of the pattern matching routine. You
should not modify or examine these fields.
Once you create a pattern
it is very easy to test a string
to see if it matches that pattern. The calling sequence for the UCR Standard Library match
routine is
lesi << Input string to match » ldxi << Pattern to match string against » mov cx 0 match jc Success
The Standard Library match routine expects a
pointer to the input string in the es:di registers; it expects a pointer to
the pattern you want to match in the dx:si register pair. The cx
register should contain the length of the string you want to test. If cx
contains zero
the match routine will test the entire input string. If cx
contains a nonzero value
the match routine will only test the first cx
characters in the string. Note that the end of the string (the zero terminating byte) must
not appear in the string before the position specified in cx. For most
applications
loading cx with zero before calling match is the most
appropriate operation.
On return from the match routine
the carry
flag denotes success or failure. If the carry flag is set
the pattern matches the string;
if the carry flag is clear
the pattern does not match the string. Unlike the examples
given in earlier sections
the match routine does not modify the di
register
even if the match succeeds. Instead
it returns the failure/success position in
the ax register. The is the position of the first character after the match
if match succeeds
it is the position of the first unmatched character if match
fails.
The UCR Standard Library provides about 20 built-in pattern matching functions. These functions are based on the pattern matching facilities provided by the SNOBOL4 programming language so they are very powerful indeed! You will probably discover that these routines solve all your pattern matching need although it is easy to write your own pattern matching routines if an appropriate one is not available. The following subsections describe each of these pattern matching routines in detail.
There are two things you should note if you're using the Standard Library's SHELL.ASM file when creating programs that use pattern matching and character sets. First there is a line at the very beginning of the SHELL.ASM file that contains the statement "matchfuncs". This line is currently a comment because it contains a semicolon in column one. If you are going to be using the pattern matching facilities of the UCR Standard Library you need to uncomment this line by deleting the semicolon in column one. If you are going to be using the character set facilities of the UCR Standard Library (very common when using the pattern matching facilities) you may want to uncomment the line containing "include stdsets.a" in the data segment. The "stdsets.a" file includes several common character sets including alphabetics digits alphanumerics whitespace and so on.
The spancset routine skips over all characters
belonging to a character set. This routine will match zero or more characters in the
specified set and
therefore
always succeeds. The MatchParm field of the
pattern data structure must point at a UCR Standard Library character set variable.
Example:
SkipAlphas pattern {spancset
alpha}
.
.
.
lesi StringWAlphas
ldxi SkipAlphas
xor cx
cx
match
Brkcset
is the dual to spancset - it
matches zero or more characters in the input string which are not members of a specified
character set. Another way of viewing brkcset is that it will match all
characters in the input string up to a character in the specified character set (or to the
end of the string). The matchparm field contains the address of the character
set to match.
Example:
DoDigits pattern {brkcset
digits
0
DoDigits2}
DoDigits2 pattern {spancset
digits}
.
.
.
lesi StringWDigits
ldxi DoDigits
xor cx
cx
match
jnc NoDigits
The code above matches any string that contains a string of one or more digits somewhere in the string.
Anycset
matches a single character in the input
string from a set of characters. The matchparm field contains the address of
a character set variable. If the next character in the input string is a member of this
set
anycset set accepts the string and skips over than character. If the
next input character is not a member of that set
anycset returns failure.
Example:
DoID pattern {anycset
alpha
0
DoID2}
DoID2 pattern {spancset
alphanum}
.
.
.
lesi StringWID
ldxi DoID
xor cx
cx
match
jnc NoID
This code segment checks the string StringWID
to see if it begins with an identifier specified by the regular expression
[a-zA-Z][a-zA-Z0-9]*. The first subpattern with anycset makes sure there is
an alphabetic character at the beginning of the string (alpha is the
stdsets.a set variable that has all the alphabetic characters as members). If the string
does not begin with an alphabetic
the DoID pattern fails. The second
subpattern
DoID2
skips over any following alphanumeric characters using the
spancset matching function. Note that spancset always succeeds.
The above code does not simply match a string that is an identifier; it matches strings that begin with a valid identifier. For example it would match "ThisIsAnID" as well as "ThisIsAnID+SoIsThis - 5". If you only want to match a single identifier and nothing else you must explicitly check for the end of string in your pattern.
Notanycset
provides the complement to anycset
- it matches a single character in the input string that is not a member of a character
set. The matchparm field
as usual
contains the address of the character set
whose members must not appear as the next character in the input string. If notanycset
successfully matches a character (that is
the next input character is not in the
designated character set)
the function skips the character and returns success; otherwise
it returns failure.
Example:
DoSpecial pattern {notanycset
digits
0
DoSpecial2}
DoSpecial2 pattern {spancset
alphanum}
.
.
.
lesi StringWSpecial
ldxi DoSpecial
xor cx
cx
match
jnc NoSpecial
This code is similar to the DoID pattern in
the previous example. It matches a string containing any character except a digit and then
matches a string of alphanumeric characters.
Matchstr
compares the next set of input characters
against a character string. The matchparm field contains the address of a
zero terminated string to compare against. If matchstr succeeds
it returns
the carry set and skips over the characters it matched; if it fails
it tries the
alternate matching function or returns failure if there is no alternate.
Example:
DoString pattern {matchstr
MyStr}
MyStr byte "Match this!"
0
.
.
.
lesi String
ldxi DoString
xor cx
cx
match
jnc NotMatchThis
This sample code matches any string that begins with the characters "Match This!"
Matchistr
is like matchstr insofar as
it compares the next several characters against a zero terminated string value. However
matchistr
does a case insensitive comparison. During the comparison it converts the characters in
the input string to upper case before comparing them to the characters that the matchparm
field points at. Therefore
the string pointed at by the matchparm field must
contain uppercase wherever alphabetics appear. If the matchparm string
contains any lower case characters
the matchistr function will always fail.
Example:
DoString pattern {matchistr
MyStr}
MyStr byte "MATCH THIS!"
0
.
.
.
lesi String
ldxi DoString
xor cx
cx
match
jnc NotMatchThis
This example is identical to the one in the previous section except it will match the characters "match this!" using any combination of upper and lower case characters.
Matchtostr
matches all characters in an input string
up to and including the characters specified by the matchparm parameter. This
routine succeeds if the specified string appears somewhere in the input string
it fails
if the string does not appear in the input string. This pattern function is quite useful
for locating a substring and ignoring everything that came before the substring.
Example:
DoString pattern {matchtostr
MyStr}
MyStr byte "Match this!"
0
.
.
.
lesi String
ldxi DoString
xor cx
cx
match
jnc NotMatchThis
Like the previous two examples
this code segment matches
the string "Match this!" However
it does not require that the input string (String)
begin with "Match this!" Instead
it only requires that "Match this!"
appear somewhere in the string.
The matchchar function matches a single
character. The matchparm field's L.O. byte contains the character you want to
match. If the next character in the input string is that character
then this function
succeeds
otherwise it fails.
Example:
DoSpace pattern {matchchar
' '}
.
.
.
lesi String
ldxi DoSpace
xor cx
cx
match
jnc NoSpace
This code segment matches any string that begins with a
space. Keep in mind that the match routine only checks the prefix of a
string. If you wanted to see if the string contained only a space (rather than a string
that begins with a space)
you would need to explicitly check for an end of string after
the space. Of course
it would be far more efficient to use strcmp rather
than match for this purpose!
Note that unlike matchstr
you encode the
character you want to match directly into the matchparm field. This lets you
specify the character you want to test directly in the pattern definition.
Like matchtostr
matchtochar
matches all characters up to and including a character you specify. This is similar to brkcset
except you don't have to create a character set containing a single member and brkcset
skips up to but not including the specified character(s). Matchtochar fails
if it cannot find the specified character in the input string.
Example:
DoToSpace pattern {matchtochar
' '}
.
.
.
lesi String
ldxi DoSpace
xor cx
cx
match
jnc NoSpace
This call to match will fail if there are no
spaces left in the input string. If there are
the call to matchtochar will
skip over all characters up to
and including
the first space. This is a useful pattern
for skipping over words in a string.
Matchchars
skips zero or more occurrences of a singe
character in an input string. It is similar to spancset except you can
specify a single character rather than an entire character set with a single member. Like matchchar
matchchars expects a single character in the L.O. byte of the matchparm
field. Since this routine matches zero or more occurrences of that character
it always
succeeds.
Example:
Skip2NextWord pattern {matchtochar
' '
0
SkipSpcs}
SkipSpcs pattern {matchchars
' '}
.
.
.
lesi String
ldxi Skip2NextWord
xor cx
cx
match
jnc NoWord
The code segment skips to the beginning of the next word in a string. It fails if there are no additional words in the string (i.e. the string contains no spaces).
Matchtopat
matches all characters in a string up to
and including the substring matched by some other pattern. This is one of the two
facilities the UCR Standard Library pattern matching routines provide to allow the
implementation of nonterminal function calls. This matching function succeeds if it finds
a string matching the specified pattern somewhere on the line. If it succeeds
it skips
the characters through the last character matched by the pattern parameter. As you would
expect
the matchparm field contains the address of the pattern to match.
Example:
; Assume there is a pattern "expression" that matches arithmetic
; expressions. The following pattern determines if there is such an
; expression on the line followed by a semicolon.
FindExp pattern {matchtopat
expression
0
MatchSemi}
MatchSemi pattern {matchchar
';'}
.
.
.
lesi String
ldxi FindExp
xor cx
cx
match
jnc NoExp
The EOS pattern matches the end of a string.
This pattern
which must obviously appear at the end of a pattern list if it appears at
all
checks for the zero terminating byte. Since the Standard Library routines only match
prefixes
you should stick this pattern at the end of a list if you want to ensure that a
pattern exactly matches a string with no left over characters at the end. EOS
succeeds if it matches the zero terminating byte
it fails otherwise.
Example:
SkipNumber pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits
0
EOSPat}
EOSPat pattern {EOS}
.
.
.
lesi String
ldxi SkipNumber
xor cx
cx
match
jnc NoNumber
The SkipNumber pattern matches strings that
contain only decimal digits (from the start of the match to the end of the string). Note
that EOS requires no parameters
not even a matchparm parameter.
ARB
matches any number of arbitrary characters. This
pattern matching function is equivalent to ARB
is a very inefficient routine to use. It works by assuming it can match all remaining
characters in the string and then tries to match the pattern specified by the nextpattern
field. If the nextpattern item fails
ARB backs up one character
and tries matching nextpattern again. This continues until the pattern
specified by nextpattern succeeds or ARB backs up to its initial
starting position. ARB succeeds if the pattern specified by nextpattern
succeeds
it fails if it backs up to its initial starting position.
Given the enormous amount of backtracking that can occur
with ARB (especially on long strings)
you should try to avoid using this
pattern if at all possible. The matchtostr
matchtochar
and matchtopat
functions accomplish much of what ARB accomplishes
but they work forward
rather than backward in the source string and may be more efficient. ARB is
useful mainly if you're sure the following pattern appears late in the string you're
matching or if the string you want to match occurs several times and you want to match the
last occurrence (matchtostr
matchtochar
and matchtopat
always match the first occurrence they find).
Example:
SkipNumber pattern {ARB
0
0
SkipDigit}
SkipDigit pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits}
.
.
.
lesi String
ldxi SkipNumber
xor cx
cx
match
jnc NoNumber
This code example matches the last number that appears on
an input line. Note that ARB does not use the matchparm field
so you should set it to zero by default.
ARBNUM
matches an arbitrary number (zero or more) of
patterns that occur in the input string. If R represents some nonterminal number
(pattern matching function)
then ARBNUM(R ) is equivalent to the
production ARBNUM The matchparm field contains the address of
the pattern that ARBNUM attempts to match.
Example:
SkipNumbers pattern {ARBNUM
SkipNumber}
SkipNumber pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits
0
EndDigits}
EndDigits pattern {matchchars
' '
EndString}
EndString pattern {EOS}
.
.
.
lesi String
ldxi SkipNumbers
xor cx
cx
match
jnc IllegalNumbers
This code accepts the input string if it consists of a
sequence of zero or more numbers separated by spaces and terminated with the EOS
pattern. Note the use of the matchalt field in the EndDigits
pattern to select EOS rather than a space for the last number in the string.
Skip
matches n arbitrary characters in the input
string. The matchparm field is an integer value containing the number of
characters to skip. Although the matchparm field is a double word
this
routine limits the number of characters you can skip to 16 bits (65
535 characters); that
is
n is the L.O. word of the matchparm field. This should prove sufficient
for most needs.
Skip succeeds if there are at least n
characters left in the input string; it fails if there are fewer than n characters left in
the input string.
Example:
Skip1st6 pattern {skip
6
0
SkipNumber}
SkipNumber pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits
0
EndDigits}
EndDigits pattern {EOS}
.
.
.
lesi String
ldxi Skip1st6
xor cx
cx
match
jnc IllegalItem
This example matches a string containing six arbitrary characters followed by one or more decimal digits and a zero terminating byte.
Pos
succeeds if the matching functions are currently
at the nth character in the string
where n is the value in the L.O. word of the matchparm
field. Pos fails if the matching functions are not currently at position n in
the string. Unlike the pattern matching functions you've seen so far
pos
does not consume any input characters. Note that the string starts out at position zero.
So when you use the pos function
it succeeds if you've matched n characters
at that point.
Example:
SkipNumber pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits
0
EndDigits}
EndDigits pattern {pos
4}
.
.
.
lesi String
ldxi SkipNumber
xor cx
cx
match
jnc IllegalItem
This code matches a string that begins with exactly 4 decimal digits.
Rpos
works quite a bit like the pos
function except it succeeds if the current position is n character positions from the end
of the string. Like pos
n is the L.O. 16 bits of the matchparm
field. Also like pos
rpos does not consume any input
characters.
Example:
SkipNumber pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits
0
EndDigits}
EndDigits pattern {rpos
4}
.
.
.
lesi String
ldxi SkipNumber
xor cx
cx
match
jnc IllegalItem
This code matches any string that is all decimal digits except for the last four characters of the string. The string must be at least five characters long for the above pattern match to succeed.
Gotopos
skips over any characters in the string
until it reaches character position n in the string. This function fails if the pattern is
already beyond position n in the string. The L.O. word of the matchparm field
contains the value for n.
Example:
SkipNumber pattern {gotopos
10
0
MatchNmbr}
MatchNmbr pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits
0
EndDigits}
EndDigits pattern {rpos
4}
.
.
.
lesi String
ldxi SkipNumber
xor cx
cx
match
jnc IllegalItem
This example code skips to position 10 in the string and attempts to match a string of digits starting with the 11th character. This pattern succeeds if the there are four characters remaining in the string after processing all the digits.
Rgotopos
works like gotopos except it
goes to the position specified from the end of the string. Rgotopos fails if
the matching routines are already beyond position n from the end of the string. As with gotopos
the L.O. word of the matchparm field contains the value for n.
Example:
SkipNumber pattern {rgotopos
10
0
MatchNmbr}
MatchNmbr pattern {anycset
digits
0
SkipDigits}
SkipDigits pattern {spancset
digits}
.
.
.
lesi String
ldxi SkipNumber
xor cx
cx
match
jnc IllegalItem
This example skips to ten characters from the end of the string and then attempts to match one or digits starting at that point. It fails if there aren't at least 11 characters in the string or the last 10 characters don't begin with a string of one or more digits.
The sl_match2 routine is nothing more than a
recursive call to match. The matchparm field contains the address of pattern
to match. This is quite useful for simulating parenthesis around a pattern in a pattern
expression. As far as matching strings are concerned
pattern1 and pattern2
below
are equivalent:
Pattern2 pattern {sl_match2
Pattern1}
Pattern1 pattern {matchchar
'a'}
The only difference between invoking a pattern directly and
invoking it with sl_match2 is that sl_match2 tweaks some
internal variables to keep track of matching positions within the input string. Later
you
can extract the character string matched by sl_match2 using the patgrab
routine.
|
Table of Content | Chapter Sixteen (Part 6) |
Chapter Sixteen: Pattern Matching
(Part 5)
29 SEP 1996