View previous topic :: View next topic |
Author |
Message |
Michael Pro
Joined: 20 Aug 2007 Posts: 3
|
Posted: Tue Sep 18, 2007 5:33 am Post subject: Regular expressions in lexer grammar |
|
|
Let me explain my problem.
Now I'm writing modified lexer for Pascal Script and I've wrote some grammar constructions for my new lexer.
For example, here is a part of my definitions
IdentList = <Identifier> (',' <Identifier>)*;
TypeRes = ':' <Identifier>;
IdentWithType = IdentList TypeRes;
IdentWithTypeList = (IdentWithType ';')+;
It's defining a kind of typical pascal declarations
FMode, FStatus: Boolean;
Counter: Integer;
Velocity: Double;
etc
Although, as you see, this definitions use greedy algorithm for token finding - it will find only two first strings in memo, ignoring the last one with Velocity.
I've tried to use standard '+?' for IdentWithTypeList to use non-greedy method:
IdentWithTypeList = (IdentWithType ';')+?;
But it doesn't work - grammar isn't compiling. What should I do in this situation? How should I use non-greedy algorithm in grammar? It's vital, because there is no other way in defining recursive constructions.
Best Regards,
Michael. |
|
Back to top |
|
|
Michael Pro
Joined: 20 Aug 2007 Posts: 3
|
Posted: Tue Sep 18, 2007 5:45 am Post subject: |
|
|
Well, I've fixed some definitions in Rules.
I'm using rule for variable list detecting with this grammar and Self closing range.
Despite this the analyzer doesn't use greedy algorythm at all.
In Memo
FMode, FStatus: Boolean;
Counter: Integer;
Velocity: Double;
It finds only two first as desired block and doesn't affects the last block - it seems like it's work as "non-greedy". But declaration
IdentWithTypeList = (IdentWithType ';')+;
means that all three strings must be found - because selection method is "greedy". |
|
Back to top |
|
|
econtrol Site Admin
Joined: 09 Jun 2006 Posts: 202
|
Posted: Tue Sep 18, 2007 6:16 pm Post subject: |
|
|
Hello.
Back scanning algorithm is used, so if you link range (with "Initially closed" flag) to rule
IdentWithTypeList = (IdentWithType ';')+;
there will be created 3 ranges.
FMode, FStatus: Boolean;
FMode, FStatus: Boolean;
Counter: Integer;
FMode, FStatus: Boolean;
Counter: Integer;
Velocity: Double;
This occurs due to rule is checked at each token and in back order.
It is not true grammar, it is only extended test token sequence algorithm.
It is intended to improve condition checking, but not for real parsing of text.
To correctly extract this text you should define terminate condition, for example:
IdentWithTypeList = (IdentWithType ';')+ NextSectionStart;
NextSectionStart = 'begin' | 'const' | 'var';
In this case first will be checked 'NextSectionStart' and then (IdentWithType ';')+.
Disadvantage 'NextSectionStart' will be included in range, that is not good.
I think it will be good to combine grammar rule and usual conditions. For example, if usual condition is defined and is tested successfully grammar rule is tested.
LALR algorithm is not appropriate for purpose of sequence testing (it requiers code consistency around tested sequence). Implemented algorithm is intended for testing sequences independently on surrounding text.
P.S. I have plans to implement LALR algorithm as additional method of text analysis.
Michael. |
|
Back to top |
|
|
Michael Pro
Joined: 20 Aug 2007 Posts: 3
|
Posted: Wed Sep 19, 2007 5:03 am Post subject: |
|
|
Thanks a lot.
Yesterday I've solved this problem in similiar way - created a list with standard Script Pascal "ending sequences" for variables section |
|
Back to top |
|
|
|