|
Table of Content | Chapter Sixteen (Part 5) |
Context free languages provide a superset of the regular languages - if you can specify a class of patterns with a regular expression you can express the same language using a context free grammar. In addition you can specify many languages that are not regular using context free grammars (CFGs).
Examples of languages that are context free
but not
regular
include the set of all strings representing common arithmetic expressions
legal
Pascal or C source files
and MASM macros. Context free languages are characterized by
balance and nesting. For example
arithmetic expression have balanced sets of parenthesis.
High level language statements like repeat...until allow nesting
and are always balanced (e.g.
for every repeat there is a corresponding until
statement later in the source file).
There is only a slight extension to the regular languages to handle context free languages - function calls. In a regular expression we only allow the objects we want to match and the specific RE operators like "|" "*" concatenation and so on. To extend regular languages to context free languages we need only add recursive function calls to regular expressions. Although it would be simple to create a syntax allowing function calls within a regular expression computer scientists use a different notation altogether for context free languages - a context free grammar.
A context free grammar contains two types of symbols: terminal
symbols and nonterminal symbols. Terminal symbols are the individual
characters and strings that the context free grammar matches plus the empty string
. Context free
grammars use nonterminal symbols for function calls and definitions. In our context free
grammars we will use italic characters to denote nonterminal symbols and standard
characters to denote terminal symbols.
To match this string
we begin by calling the starting
symbol function
expression
using the function expression
expression
+ factor. The first plus sign suggests that the expression term
must match "7" and the factor term must match "5*(2+1)". Now
we need to match our input string with the pattern expression + factor.
To do this
we call the expression function once
again
this time using the expression
factor
production. This give us the reduction:
expression
expression
+ factor
factor +
factor
The
symbol
denotes the application of a nonterminal function call (a reduction).
Next
we call the factor function
using the production factor
term to
yield the reduction:
expression
expression
+ factor
factor +
factor
term +
factor
Continuing we call the term function to produce the reduction:
expression
expression
+ factor
factor +
factor
term +
factor
IntegerConstant
+ factor
Next we call the IntegerConstant function to yield:
expression
expression
+ factor
factor +
factor
term +
factor
IntegerConstant
+ factor
7 + factor
At this point the first two symbols of our generated string match the first two characters of the input string so we can remove them from the input and concentrate on the items that follow. In succession we call the factor function to produce the reduction 7 + factor * term and then we call factor term and IntegerConstant to yield 7 + 5 * term. In a similar fashion we can reduce the term to "( expression )" and reduce expression to "2+1". The complete derivation for this string is
expressionexpression + factor
factor + factor
term + factor
IntegerConstant + factor
7 + factor
7 + factor * term
7 + term * term
7 + IntegerConstant * term
7 + 5 * term
7 + 5 * ( expression )
7 + 5 * ( expression + factor )
7 + 5 * ( factor + factor )
7 + 5 * ( IntegerConstant + factor )
7 + 5 * ( 2 + factor )
7 + 5 * ( 2 + term )
7 + 5 * ( 2 + IntegerConstant )
7 + 5 * ( 2 + 1 )
The final reduction completes the derivation of our input string so the string 7+5*(2+1) is in the language specified by the context free grammar.
16.1.4 Eliminating Left Recursion and Left Factoring CFGs
In the next section we will discuss how to convert a CFG to an assembly language program. However the technique we are going to use to do this conversion will require that we modify certain grammars before converting them. The arithmetic expression grammar in the previous section is a good example of such a grammar - one that is left recursive.
Left recursive grammars pose a problem for us because the way we will typically convert a production to assembly code is to call a function corresponding to a nonterminal and compare against the terminal symbols. However we will run into trouble if we attempt to convert a production like the following using this technique:
expressionexpression + factor
Such a conversion would yield some assembly code that looks roughly like the following:
expression proc near call expression jnc fail cmp byte ptr es:[di] '+' jne fail inc di call factor jnc fail stc ret Fail: clc ret expression endp
The obvious problem with this code is that it will generate
an infinite loop. Upon entering the expression function this code immediately
calls expression recursively
which immediately calls expression
recursively
which immediately calls expression recursively
... Clearly
we
need to resolve this problem if we are going to write any real code to match this
production.
The trick to resolving left recursion is to note that if there is a production that suffers from left recursion there must be some production with the same left hand side that is not left recursive. All we need do is rewrite the left recursive call in terms of the production that does not have any left recursion. This sound like a difficult task but it's actually quite easy.
To see how to eliminate left recursion let Xi and Yj represent any set of terminal symbols or nonterminal symbols that do not have a right hand side beginning with the nonterminal A. If you have some productions of the form:
AAX1 | AX2 | | AXn | Y1 | Y2 | | Ym
You will be able to translate this to an equivalent grammar
without left recursion by replacing each term of the form A ![]()
Yi
by A ![]()
Yi
A and each term of the form A ![]()
AXi
by A' ![]()
Xi A' |
. For example
consider three of the productions from the arithmetic grammar:
expressionexpression + factor
expressionexpression - factor
expressionfactor
In this example A corresponds to expression
X1 corresponds to "+ factor "
X2 corresponds
to "- factor "
and Y1 corresponds to "factor ".
The equivalent grammar without left recursion is
expressionfactor E'
E'- factor E'
E'+ factor E'
E'![]()
The complete arithmetic grammar with left recursion removed is
expressionfactor E'
E'+ factor E' | - factor E' |
factorterm F'
F'* term F' | / term F' |
termIntegerConstant | ( expression )
IntegerConstantdigit | digit IntegerConstant
digit0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Another useful transformation on a grammar is to left factor the grammar. This can reduce the need for backtracking improving the performance of your pattern matching code. Consider the following CFG fragment:
stmtif expression then stmt endif
stmtif expression then stmt else stmt endif
These two productions begin with the same set of symbols.
Either production will match all the characters in an if statement up to the
point the matching algorithm encounters the first else or endif.
If the matching algorithm processes the first statement up to the point of the endif
terminal symbol and encounters the else terminal symbol instead
it must
backtrack all the way to the if symbol and start over. This can be terribly
inefficient because of the recursive call to stmt (imagine a 10
000 line
program that has a single if statement around the entire 10
000 lines
a compiler using
this pattern matching technique would have to recompile the entire program from scratch if
it used backtracking in this fashion). However
by left factoring the grammar before
converting it to program code
you can eliminate the need for backtracking.
To left factor a grammar you collect all productions that have the same left hand side and begin with the same symbols on the right hand side. In the two productions above the common symbols are "if expression then stmt ". You combine the common strings into a single production and then append a new nonterminal symbol to the end of this new production e.g.
stmtif expression then stmt NewNonTerm
Finally you create a new set of productions using this new nonterminal for each of the suffixes to the common production:
NewNonTermendif | else stmt endif
This eliminates backtracking because the matching algorithm
can process the if
the expression
the then
and
the stmt before it has to choose between endif and else.
Since the context free languages are a superset of the regular languages it should come as no surprise that it is possible to convert regular expressions to context free grammars. Indeed this is a very easy process involving only a few intuitive rules.
1) If a regular expression simply consists of a sequence of
characters
xyz
you can easily create a production for this regular expression of the
form P
xyz. This
applies equally to the empty string
.
2) If r and s are two regular expression
that you've converted to CFG productions R and S
and you have a
regular expression rs that you want to convert to a production
simply create a
new production of the form T
R S.
3) If r and s are two regular expression
that you've converted to CFG productions R and S
and you have a
regular expression r | s that you want to convert to a production
simply create
a new production of the form T
R | S.
4) If r is a regular expression that you've
converted to a production
R
and you want to create a production for r*
simply use the production RStar
R RStar |
.
5) If r is a regular expression that you've
converted to a production
R
and you want to create a production for r+
simply use the production RPlus
R RPlus |
R.
6) For regular expressions there are operations with various precedences. Regular expressions also allow parenthesis to override the default precedence. This notion of precedence does not carry over into CFGs. Instead you must encode the precedence directly into the grammar. For example to encode R S* you would probably use productions of the form:
TR SStar
SStarS SStar |
Likewise to handle a grammar of the form (RS )* you could use productions of the form:
TR S T |
RSR S
If you have removed left recursion and you've left factored a grammar it is very easy to convert such a grammar to an assembly language program that recognizes strings in the context free language.
The first convention we will adopt is that es:di
always points at the start of the string we want to match. The second convention we will
adopt is to create a function for each nonterminal. This function returns success (carry
set) if it matches an associated subpattern
it returns failure (carry clear) otherwise.
If it succeeds
it leaves di pointing at the next character is the staring
after the matched pattern; if it fails
it preserves the value in di across
the function call.
To convert a set of productions to their corresponding assembly code we need to be able to handle four things: terminal symbols nonterminal symbols alternation and the empty string. First we will consider simple functions (nonterminals) which do not have multiple productions (i.e. alternation).
If a production takes the form T ![]()
and there are
no other productions associated with T
then this production always succeeds. The
corresponding assembly code is simply:
T
proc
near
stc
ret
T endp
Of course there is no real need to ever call T and test the returned result since we know it will always succeed. On the other hand if T is a stub that you intend to fill in later you should call T.
If a production takes the form T
xyz
where
xyz is a string of one or more terminal symbols
then the function returns success if the
next several input characters match xyz
it returns failure otherwise. Remember
if the
prefix of the input string matches xyz
then the matching function must advance di
beyond these characters. If the first characters of the input string does not match xyz
it must preserve di. The following routines demonstrate two cases
where xyz
is a single character and where xyz is a string of characters:
T1 proc near cmp byte ptr es:[di] 'x' ;Single char. je Success clc ;Return Failure. ret Success: inc di ;Skip matched char. stc ;Return success. ret T1 endp T2 proc near call MatchPrefix byte 'xyz' 0 ret T2 endp
MatchPrefix
is a routine that matches the prefix of
the string pointed at by es:di against the string following the call in the code stream.
It returns the carry set and adjusts di if the string in the code stream is a
prefix of the input string
it returns the carry flag clear and preserves di
if the literal string is not a prefix of the input. The MatchPrefix code
follows:
MatchPrefix proc far ;Must be far! push bp mov bp sp push ax push ds push si push di lds si 2[bp] ;Get the return address. CmpLoop: mov al ds:[si] ;Get string to match. cmp al 0 ;If at end of prefix je Success ; we succeed. cmp al es:[di] ;See if it matches prefix jne Failure ; if not immediately fail. inc si inc di jmp CmpLoop Success: add sp 2 ;Don't restore di. inc si ;Skip zero terminating byte. mov 2[bp] si ;Save as return address. pop si pop ds pop ax pop bp stc ;Return success. ret Failure: inc si ;Need to skip to zero byte. cmp byte ptr ds:[si] 0 jne Failure inc si mov 2[bp] si ;Save as return address. pop di pop si pop ds pop ax pop bp clc ;Return failure. ret MatchPrefix endp
If a production takes the form T
R
where R is a nonterminal
then the T function calls R and
returns whatever status R returns
e.g.
T proc near call R ret T endp
If the right hand side of a production contains a string of
terminal and nonterminal symbols
the corresponding assembly code checks each item in
turn. If any check fails
then the function returns failure. If all items succeed
then
the function returns success. For example
if you have a production of the form T
R
abc S you could implement this in assembly language as
T proc near push di ;If we fail must preserve di. call R jnc Failure call MatchPrefix byte "abc" 0 jnc Failure call S jnc Failure add sp 2 ;Don't preserve di if we succeed. stc ret Failure: pop di clc ret T endp
Note how this code preserves di if it fails but does not preserve di if it succeeds.
If you have multiple productions with the same left hand
side (i.e.
alternation)
then writing an appropriate matching function for the
productions is only slightly more complex than the single production case. If you have
multiple productions associated with a single nonterminal on the left hand side
then
create a sequence of code to match each of the individual productions. To combine them
into a single matching function
simply write the function so that it succeeds if any one
of these code sequences succeeds. If one of the productions is of the form T
e
then test
the other conditions first. If none of them could be selected
the function succeeds. For
example
consider the productions:
E'+ factor E' | - factor E' |
This translates to the following assembly code:
EPrime proc near push di cmp byte ptr es:[di] '+' jne TryMinus inc di call factor jnc EP_Failed call EPrime jnc EP_Failed Success: add sp 2 stc ret TryMinus: cmp byte ptr es:[di] '-' jne EP_Failed inc di call factor jnc EP_Failed call EPrime jnc EP_Failed add sp 2 stc ret EP_Failed: pop di stc ;Succeed because of E' -> e ret EPrime endp
This routine always succeeds because it has the production E'
![]()
. This is why
the stc instruction appears after the EP_Failed label.
To invoke a pattern matching function
simply load es:di
with the address of the string you want to test and call the pattern matching function. On
return
the carry flag will contain one if the pattern matches the string up to the point
returned in di. If you want to see if the entire string matches the pattern
simply check
to see if es:di is pointing at a zero byte when you get back from the
function call. If you want to see if a string belongs to a context free language
you
should call the function associated with the starting symbol for the given context free
grammar.
The following program implements the arithmetic grammar we've been using as examples throughout the past several sections. The complete implementation is
; ARITH.ASM
;
; A simple recursive descent parser for arithmetic strings.
.xlist
include stdlib.a
includelib stdlib.lib
.list
dseg segment para public 'data'
; Grammar for simple arithmetic grammar (supports +
-
*
/):
;
; E -> FE'
; E' -> + F E' | - F E' | <empty string>
; F -> TF'
; F' -> * T F' | / T F' | <empty string>
; T -> G | (E)
; G -> H | H G
; H -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
;
InputLine byte 128 dup (0)
dseg ends
cseg segment para public 'code'
assume cs:cseg
ds:dseg
; Matching functions for the grammar.
; These functions return the carry flag set if they match their
; respective item. They return the carry flag clear if they fail.
; If they fail
they preserve di. If they succeed
di points to
; the first character after the match.
; E -> FE'
E proc near
push di
call F ;See if F
then E'
succeeds.
jnc E_Failed
call EPrime
jnc E_Failed
add sp
2 ;Success
don't restore di.
stc
ret
E_Failed: pop di ;Failure
must restore di.
clc
ret
E endp
; E' -> + F E' | - F E' | e
EPrime proc near
push di
; Try + F E' here
cmp byte ptr es:[di]
'+'
jne TryMinus
inc di
call F
jnc EP_Failed
call EPrime
jnc EP_Failed
Success: add sp
2
stc
ret
; Try - F E' here.
TryMinus: cmp byte ptr es:[di]
'-'
jne Success
inc di
call F
jnc EP_Failed
call EPrime
jnc EP_Failed
add sp
2
stc
ret
; If none of the above succeed
return success anyway because we have
; a production of the form E' -> e.
EP_Failed: pop di
stc
ret
EPrime endp
; F -> TF'
F proc near
push di
call T
jnc F_Failed
call FPrime
jnc F_Failed
add sp
2 ;Success
don't restore di.
stc
ret
F_Failed: pop di
clc
ret
F endp
; F -> * T F' | / T F' | e
FPrime proc near
push di
cmp byte ptr es:[di]
'*' ;Start with "*"?
jne TryDiv
inc di ;Skip the "*".
call T
jnc FP_Failed
call FPrime
jnc FP_Failed
Success: add sp
2
stc
ret
; Try F -> / T F' here
TryDiv: cmp byte ptr es:[di]
'/' ;Start with "/"?
jne Success ;Succeed anyway.
inc di ;Skip the "/".
call T
jnc FP_Failed
call FPrime
jnc FP_Failed
add sp
2
stc
ret
; If the above both fail
return success anyway because we've got
; a production of the form F -> e
FP_Failed: pop di
stc
ret
FPrime endp
; T -> G | (E)
T proc near
; Try T -> G here.
call G
jnc TryParens
ret
; Try T -> (E) here.
TryParens: push di ;Preserve if we fail.
cmp byte ptr es:[di]
'(' ;Start with "("?
jne T_Failed ;Fail if no.
inc di ;Skip "(" char.
call E
jnc T_Failed
cmp byte ptr es:[di]
')' ;End with ")"?
jne T_Failed ;Fail if no.
inc di ;Skip ")"
add sp
2 ;Don't restore di
stc ; we've succeeded.
ret
T_Failed: pop di
clc
ret
T endp
; The following is a free-form translation of
;
; G -> H | H G
; H -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
;
; This routine checks to see if there is at least one digit. It fails if there
; isn't at least one digit; it succeeds and skips over all digits if there are
; one or more digits.
G proc near
cmp byte ptr es:[di]
'0' ;Check for at least
jb G_Failed ; one digit.
cmp byte ptr es:[di]
'9'
ja G_Failed
DigitLoop: inc di ;Skip any remaining
cmp byte ptr es:[di]
'0' ; digits found.
jb G_Succeeds
cmp byte ptr es:[di]
'9'
jbe DigitLoop
G_Succeeds: stc
ret
G_Failed: clc ;Fail if no digits
ret ; at all.
G endp
; This main program tests the matching functions above and demonstrates
; how to call the matching functions.
Main proc
mov ax
seg dseg ;Set up the segment registers
mov ds
ax
mov es
ax
printf
byte "Enter an arithmetic expression: "
0
lesi InputLine
gets
call E
jnc BadExp
; Good so far
but are we at the end of the string?
cmp byte ptr es:[di]
0
jne BadExp
; Okay
it truly is a good expression at this point.
printf
byte "'%s' is a valid expression"
cr
lf
0
dword InputLine
jmp Quit
BadExp: printf
byte "'%s' is an invalid arithmetic expression"
cr
lf
0
dword InputLine
Quit: ExitPgm
Main endp
cseg ends
sseg segment para stack 'stack'
stk byte 1024 dup ("stack ")
sseg ends
zzzzzzseg segment para public 'zzzzzz'
LastBytes byte 16 dup (?)
zzzzzzseg ends
end Main
The techniques presented in this chapter for converting CFGs to assembly code do not work for all CFGs. They only work for a (large) subset of the CFGs known as LL(1) grammars. The code that these techniques produce is a recursive descent predictive parser. Although the set of context free languages recognizable by an LL(1) grammar is a subset of the context free languages it is a very large subset and you shouldn't run into too many difficulties using this technique.
One important feature of predictive parsers is that they do not require any backtracking. If you are willing to live with the inefficiencies associated with backtracking it is easy to extended a recursive descent parser to handle any CFG. Note that when you use backtracking the predictive adjective goes away you wind up with a nondeterministic system rather than a deterministic system (predictive and deterministic are very close in meaning in this case).
There are other CFG systems as well as LL(1). The so-called operator precedence and LR(k) CFGs are two examples. For more information about parsing and grammars consult a good text on formal language theory or compiler construction (see the bibliography).
16.1.8 Beyond Context Free Languages
Although most patterns you will probably want to process will be regular or context free there may be times when you need to recognize certain types of patterns that are beyond these two (e.g. context sensitive languages). As it turns out the finite state automata are the simplest machines; the pushdown automata (that recognize context free languages) are the next step up. After pushdown automata the next step up in power is the Turing machine. However Turing machines are equivalent in power to the 80x86 so matching patterns recognized by Turing machines is no different than writing a normal program.
The key to writing functions that recognize patterns that are not context free is to maintain information in variables and use the variables to decide which of several productions you want to use at any one given time. This technique introduces context sensitivity. Such techniques are very useful in artificial intelligence programs (like natural language processing) where ambiguity resolution depends on past knowledge or the current context of a pattern matching operation. However the uses for such types of pattern matching quickly go beyond the scope of a text on assembly language programming so we will let some other text continue this discussion.
|
Table of Content | Chapter Sixteen (Part 5) |
Chapter Sixteen: Pattern Matching
(Part 4)
29 SEP 1996