Convert a tag pattern to a regular expression pattern. A tag pattern is a
modified version of a regular expression, designed for matching sequences
of tags. The differences between regular expression patterns and tag
patterns are:
-
In tag patterns,
'<' and '>' act as
parentheses; so '<NN>+' matches one or more
repetitions of '<NN>' , not '<NN'
followed by one or more repetitions of '>' .
-
Whitespace in tag patterns is ignored. So
'<DT> |
<NN>' is equivalant to
'<DT>|<NN>'
-
In tag patterns,
'.' is equivalant to
'[^{}<>]' ; so '<NN.*>' matches
any single tag starting with 'NN' .
In particular, tag_pattern2re_pattern performs the
following transformations on the given pattern:
-
Replace '.' with '[^<>{}]'
-
Remove any whitespace
-
Add extra parens around '<' and '>', to make '<' and '>'
act like parentheses. E.g., so that in '<NN>+', the '+' has
scope over the entire '<NN>'; and so that in '<NN|IN>',
the '|' has scope over 'NN' and 'IN', but not '<' or '>'.
-
Check to make sure the resulting pattern is valid.
- Parameters:
tag_pattern (string ) - The tag pattern to convert to a regular expression pattern.
- Returns:
string
- A regular expression pattern corresponding to
tag_pattern .
- Raises:
ValueError - If tag_pattern is not a valid tag pattern. In
particular, tag_pattern should not include braces; and
it should not contain nested or mismatched angle-brackets.
|