"The asteroid to kill this dinosaur is still in orbit." – Lex Manual
"Optimize: this currently has no effect in Boost.Regex." – Boost Manual
"Reflex: a thing that is determined by and reproduces the essential features or qualities of something else." – Oxford Dictionary
RE/flex is a flexible scanner-generator framework for generating regex-centric, Flex-compatible scanners. The RE/flex command-line tool is compatible with the Flex command-line tool. RE/flex accepts standard Flex specification syntax and supports Flex options.
Features:
RE/flex is not merely designed to fix the limitations of Flex and Lex! RE/flex balances efficiency with flexibility by offering a choice of regex engines that are used by the generated scanner. The choice includes Boost.Regex and RE/flex matcher engines that offer a rich regex syntax. The RE/flex POSIX matcher adds lazy quantifiers, word boundary anchors, and other useful patterns to the POSIX mode of matching. Also Unicode character sets and ASCII/UTF-8/16/32 file input is supported by RE/flex, without any additional coding required. RE/flex regex patterns are converted to efficient deterministic finite state machines. These machines differ from Flex in supporting the new pattern-matching features.
RE/flex incorporates proper object-oriented design principles and does not rely on macros and globals as Flex does. Macros and globals are only added when the Flex-compatibility option −−flex
is used when invoking the reflex scanner generator. However, in all cases the reflex scanner generator produces C++ scanner classes derived from a base lexer class template, with a matcher engine as the template parameter. This offers an extensible approach that permits new regex matching engines to be included in this framework in the future.
Use the reflex scanner generator with two options −−flex
and −−bison
to output Flex C-compatible code. These options generate the global non-reentrant "yy" functions and variables, such as yylex()
and yytext
.
In this document we refer to a regex as a shorthand for regular expression. Some of you may not agree with this broad use of terminology. The term regular expressions refers to the formal concept of regular languages, wheras regex refers to backtracking-based regex matching that Perl introduced, that could no longer be said to be regular in a true mathematical sense.
In summary, RE/flex is
Lex, Flex and variants are powerful scanner generators that generate scanners (a.k.a. lexical analyzers and lexers) from lex specifications. These lex specifications define patterns with user-defined actions that are executed when their patterns match the input stream. The scanner repeatedly matches patterns and triggers these actions until the end of the input stream is reached.
Both Lex and Flex are popular to develop tokenizers in which the user-defined actions emit or return a token when the corresponding pattern matches. These tokenizers are typically implemented to scan and tokenize the source code for a compiler or an interpreter of a programming language. The regular expression patterns in a tokenizer define the make-up of identifiers, constants, keywords, punctuation, and to skip over white space in the source code that is scanned.
Consider for example the following patterns and associated actions defined in a lex specification:
When the tokenizer matches a pattern, the corresponding action is invoked. The example above returns tokens to the compiler's parser, which repeatedly invokes the tokenizer for more tokens until the tokenizer reaches the end of the input. The tokenizer returns zero (0) when the end of the input is reached.
Lex and Flex have remained relatively stable (inert) tools while the demand has increased for tokenizing Unicode texts encoded in common wide character formats such as UTF-8, UCS/UTF-16, and UTF-32. Lex/Flex still use 8-bit character sets for regex patterns. Regex pattern syntax in Lex/Flex is also limited. No lazy repetitions. No word boundary anchors. No indent and dedent matching.
It is possible, but not trivial to implement scanners with Lex/Flex to tokenize the source code of more modern programming languages with Unicode-based lexical structures, such as Java, C#, and C++11.
A possible approach is to use UTF-8 in patterns and reformat the input to UTF-8 for matching. However, the UTF-8 patterns for common Unicode character classes are unrecognizable by humans and are prone to errors when written by hand. The UTF-8 pattern to match a Unicode letter \p{L}
is hundreds of lines long!
Furthermore, the regular expression syntax in Lex/Flex is limited to meet POSIX mode matching constraints. Scanners should use POSIX mode matching, as we will explain below. To make things even more interesting, scanners should avoid the "greedy trap" when matching input.
Lex/Flex scanners use POSIX pattern matching, meaning that the leftmost longest match is returned (among a set of patterns that match the same input). Because POSIX matchers produce the longest match for any given input text, we should be careful when using patterns with "greedy" repetitions (X*
, X+
etc.) because our pattern may gobble up more input than intended. We end up falling into the "greedy trap".
To illustrate this trap consider matching HTML comments <!−− ... −−>
with the pattern <!−−.*−−>
. The problem is that the repetition X*
is greedy and the .*−−>
pattern matches everything until the last −−>
while moving over −−>
that are between the <!−−
and the last −−>
.
.
normally does not match newline \n
in Lex/Flex patterns, unless we use dot-all mode that is sometimes confusingly called "single line mode".We can use much more complex patterns such as <!−−([^−]|−[^−]|−−+[^−>])*−*−−>
just to match comments in HTML, by ensuring the pattern ends at the first match of a −−>
in the input and not at the very last −−>
in the input. The POSIX leftmost longest match can be surprisingly effective in rendering our tokenizer into works of ASCII art!
We may claim our intricate pattern trophies as high achievements to the project team, but our team will quickly point out that a regex <!−−.*?−−>
suffices to match HTML comments with the lazy repetition X*?
construct, also known as a non-greedy repeat. The ?
is a lazy quantifier that modifies the behavior of the X*?
repeat to match only X
repeately if the rest of the pattern does not match. Therefore, the regex <!−−.*?−−>
matches HTML comments and nothing more.
But Lex/Flex does not permit us to be lazy!
Not surprising, even the Flex manual shows ad-hoc code rather than a pattern to scan over C/C++ source code input to match multiline comments that start with a /*
and end with the first occurrence of a */
. The Flex manual recommends:
Another argument to use this code with Flex is that the internal Flex buffer is limited to 16KB. By contrast, RE/flex buffers are dynamically resized and will accept long matches.
Workarounds such as these are not necessary with RE/flex. The RE/flex scanners use regex libraries with expressive pattern syntax. We can use lazy repetition to write a regex pattern for multiline comments as follows:
Most regex libraries support syntaxes and features that we have come to rely on for pattern matching. A regex with lazy quantifiers can be much easier to read and comprehend compared to a greedy variant. Most regex libraries that support lazy quantifiers run in Perl mode, using backtracking over the input. Scanners use POSIX mode matching, meaning that the leftmost longest match is found. The difference is important as we saw earlier and even more so when we consider the problems with Perl mode matching when specifying patterns to tokenize input, as we will explain next.
Consider the lex specification at the start of this section. Suppose the input to tokenize is iflag = 1
. In POSIX mode we return ASCII_IDENTIFIER
for the name iflag
, OP_ASSIGN
for =
, and NUMBER
for 1
. In Perl mode, we find that iflag
matches if
and the rest of the name is not consumed, which gives KEYWORD_IF
for if
, ASCII_IDENTIFIER
for lag
, OP_ASSIGN
for =
, and a NUMBER
for 1
. Perl mode matching greedely returns leftmost matches.
Using Perl mode in a scanner requires all overlapping patterns to be defined in a lex specification with longer patterns defined first to avoid partial matches of tokens as demonstrated. By contrast, POSIX mode is declarative and allows you to define the patterns in the specification in any order. Perhaps the only ordering constraint on patterns is for patterns that match the same input, such such as matching the keyword if
in the example: KEYWORD_IF
must be matched before ASCII_IDENTIFIER
.
For this reason, RE/flex scanners use a regex library in POSIX mode by default.
In summary, the advantages that RE/flex has to offer include:
−−flex
and/or −−bison
options. This eliminates a learning curve to use RE/flex.−−bison
generates a scanner compatible with Bison. RE/flex also supports Bison bridge (pure reentrant and MT-safe) parsers.The RE/flex scanner generator section has more details on the RE/flex scanner generator tool.
In the next part of this manual, we will take a quick look at the RE/flex regex API that can be used as a stand-alone library for matching, searching, scanning and splitting input from strings, files and streams in regular C++ applications (i.e. applications that are not necessarily tokenizers for compilers).
The RE/flex regex pattern matching classes include two classes for Boost.Regex, two classes for C++11 std::regex, and a RE/flex class:
Engine | Header file to include | reflex matcher classes |
---|---|---|
RE/flex regex | reflex/matcher.h | Matcher |
Boost.Regex | reflex/boostmatcher.h | BoostMatcher , BoostPosixMatcher |
std::regex | reflex/stdmatcher.h | StdMatcher , StdPosixMatcher |
The RE/flex reflex::Matcher
class compiles regex patterns to efficient finite state machines (FSMs) when instantiated. These deterministic automata speed up matching considerably, at the cost of the initial FSM construction (see further below for hints on how to avoid this run time overhead).
C++11 std::regex supports ECMAScript and AWK POSIX syntax with the StdMatcher
and StdPosixMatcher
classes respectively. The std::regex syntax is therefore a lot more limited compared to Boost.Regex and RE/flex.
The RE/flex regex common interface API is implemented in an abstract base class template reflex::AbstractMatcher
from which regex matchers are derived. This regex API offers a common interface that is used in the generated scanner. You can also use this API in your C++ application for pattern matching.
The RE/flex abstract matcher offers four operations for matching with the regex engines that are derived from this base abstract class:
Method | Result |
---|---|
matches() | true if the input from begin to end matches the regex pattern |
find() | search input and return true if a match was found |
scan() | scan input and return true if input at current position matches |
split() | split input at the next match |
These methods are repeatable, where the last three return additional matches.
For example, to check if a string is a valid date:
To search a string for words \w+
:
When executed this code prints:
Found How Found now Found brown Found cow
The scan
method is similar to the find
method, but scan
matches only from the current position in the input. It fails when the next match does not start from the current position. Scanning an input source means that matches must be continuous in this input source.
The split
method is roughly the inverse of the find
method and returns text located between matches. For example using non-word matching \W+
:
When executed this code prints:
Found How Found now Found brown Found cow Found
Note that split also returns the (possibly empty) remaining text after the last match, as you can see in the output above: the last split with \W+
returns an empty string, which is the remaining input after the period in the sentence.
The regex engines currently available as classes in the reflex
namespace are:
Class | Mode | Engine | Performance |
---|---|---|---|
Matcher | POSIX | RE/flex lib | deterministic finite automaton |
BoostMatcher | Perl | Boost.Regex | regex backtracking |
BoostPerlMatcher | Perl | Boost.Regex | regex backtracking |
BoostPosixMatcher | POSIX | Boost.Regex | regex backtracking |
StdMatcher | ECMA | std::regex | regex backtracking |
StdEcmaMatcher | ECMA | std::regex | regex backtracking |
StdPosixMatcher | POSIX | std::regex | regex backtracking |
The RE/flex regex engine uses a deterministic finite state machine (FSM) to get the best performance when matching. However, constructing a FSM adds overhead. This matcher is better suitable for searching long texts. The FSM construction overhead can be eliminated by pre-converting the regex to C++ code tables ahead of time as we will see later.
The Boost.Regex engines normally use Perl mode matching. We added a POSIX mode Boost.Regex engine class for the RE/flex scanner generator. Scanners typically use POSIX mode matching. See POSIX versus Perl matching for more information.
The Boost.Regex engines are all initialized with match_not_dot_newline
, which disables dotall matching as the default setting. Dotall can be re-enabled with the (?s)
regex mode modifier. This is done for compatibility with scanners.
A matcher can be applied to strings and wide strings, such as std::string
and std::wstring
, char*
and wchar_t*
. Wide strings are converted to UTF-8 to enable matching with regular expressions that contain Unicode patterns.
You can pattern match text in files. File contents are streamed and not loaded as a whole into memory, meaning that the data stream is not limited in size and matching happens immediately. Interactive mode permits matching the input from a console (a TTY device generates a potentially endless stream of characters):
We can also pattern match text from FILE
descriptors. The additional benefit of using FILE
descriptors is the automatic decoding of UTF-16/32 input to UTF-8 by the reflex::Input
class that manages input sources and their state.
For example, pattern matching the content of "cows.txt" that may use UTF-8, 16, or 32 encodings:
The find
, scan
, and split
methods are also implemented as input iterators that apply filtering tokenization, and splitting:
Iterator range | Acts as a | Iterates over |
---|---|---|
find.begin() ...find.end() | filter | all matches |
scan.begin() ...scan.end() | tokenizer | continuous matches |
split.begin() ...split.end() | splitter | text between matches |
The type reflex::AbstractMatcher::Operation
is a functor that defines find
, scan
, and split
. The functor operation returns true upon success. The use of an iterator is simply supported by invoking begin()
and end()
methods of the functor, which return reflex::AbstractMatcher::iterator
. Likewise, there are also cbegin()
and cend()
methods that return a const_iterator
.
We can use these RE/flex iterators in C++ for many tasks, including to populate containers by stuffing the iterator's text matches into it:
As a result, the words
vector contains "How", "now", "brown", "cow".
Casting a matcher object to const char*
is the same as invoking text()
, and the cast is implicitly applied in the example above where the matcher object is cast to the matched text that is then used to instantiate std::string
for the words
vector. Beware that text()
or a cast to const char*
simply returns a pointer to an internal buffer. Because this pointer is no longer valid after the matcher continued, we should store the matched text in a std::string
.
RE/flex iterators are useful in C++11 range-based loops. For example:
When executed this code prints:
Found How Found now Found brown Found cow
And RE/flex iterators are also useful with STL algorithms and lambdas, for example to compute a histogram of word frequencies:
As a result, the freq
array contains 0, 1, 1, and 2.
Casting the matcher object to a size_t
returns the group capture index, which is used in the example shown above. We also us it in the example below that is capturing all regex pattern groupings into a vector:
As a result, the vector contains the group captures 3, 1, 3, and 2.
Casting the matcher object to size_t
is the same as invoking accept()
.
You can use this method and other methods to obtain the details of a match:
Method | Result |
---|---|
accept() | returns group capture or zero if not captured/matched |
text() | returns \0 -terminated const char* string match |
pair() | returns std::pair<size_t,const char*>(accept(), text()) |
size() | returns the length of the match in bytes |
wsize() | returns the length of the match in number of (wide) characters |
rest() | returns \0 -terminated const char* of the rest of the input |
more() | tells the matcher to append the next match (adjacent matches) |
less(n) | cuts text() to n bytes and repositions the matcher |
lineno() | returns line number of the match, starting with line 1 |
columno() | returns column number of the match, starting with 0 |
first() | returns position of the first character of the match |
last() | returns position of the last + 1 character of the match |
at_bol() | true if matcher reached the begin of a new line \n |
at_bob() | true if matcher is at the start of input, no matches consumed |
at_end() | true if matcher is at the end of input |
The first three methods return values that can also be obtained by type casting the matcher object to size_t
(or to an integer type), to const char*
(or to std::string
), and to std::pair<size_t,const char*>
(or to a type-compatible form such as std::pair<uint64_t,std::string>
.
reflex::Matcher
class, the accept()
method returns the accepted pattern among the alternations in the regex that are specified only at the top level in the regex. For example, the regex "(a(b)c)|([A-Z])"
has two groups, because only the outer top-level groups are recognized. Because groups are specified at the top level only, the grouping parenthesis are optional. We can simplify the regex to "a(b)c|[A-Z]"
and still capture the two patterns.text()
method returns the match by pointing to the const char*
string that is stored in an internal buffer. This pointer should not be used after matching continues and when the matcher object is deallocated. To retain the text()
string value we recommend to instantiate a std::string
.Three special methods can be used to manipulate the input stream directly:
Method | Result |
---|---|
input() | returns the next character from the input, matcher then skips it |
unput(c) | put character c back unto the stream, matcher then takes it |
peek() | returns the next character on the input without consuming it |
To initialize a matcher for interactive use, to assign a new input source or to change its pattern, you can use the following methods:
Method | Result |
---|---|
input(i) | set input to reflex::Input i (string, stream, or FILE* ) |
pattern(p) | set pattern p string, reflex::Pattern or boost::regex |
has_pattern() | true if the matcher has a pattern assigned to it |
own_pattern() | true if the matcher has a pattern to manage and to delete |
pattern() | get the pattern object, reflex::Pattern or boost::regex |
buffer() | buffer all input at once, returns true if successful |
buffer(n) | set the adaptive buffer size to n bytes to buffer input |
interactive() | sets buffer size to 1 for console-based (TTY) input |
flush() | flush the remaining input from the internal buffer |
reset() | resets the matcher, restarting it from the remaining input |
reset(o) | resets the matcher with new options string o ("A?N?T?") |
A reflex::Input
object represents the source of input for a matcher, which is either a file FILE*
, or a string (with UTF-8 character data) of const char*
or std::string
type, or a stream pointer std::istream*
. The reflex::Input
object is implicitly constructed from one of these input sources, for example:
So far we explained how to use reflex::BoostMatcher
for pattern matching. We can also use the RE/flex reflex::Matcher
class for pattern matching. The API is exactly the same. The reflex::Matcher
class uses reflex::Pattern
, which internally represents an efficient finite state machine that is compiled from a regex. These state machines are used for fast matching.
The construction of deterministic finite state machines (FSMs) is optimized but can take some time and therefore adds overhead before matching can start. This FSM construction should not be executed repeatedly if it can be avoided. So we recommend to construct static pattern objects to create the FSMs only once:
The RE/flex matcher only supports POSIX mode matching and does not support Perl mode matching. See POSIX versus Perl matching for more information.
The RE/flex reflex::Pattern
class has several options that control the regex. Options and modes for the regex are set as a string, for example:
The f=graph.gv
option emits a Graphviz .gv file that can be visually rendered with the open source Graphviz dot tool by converting the deterministic finite state machine (FSM) to PDF, PNG, or other formats:
The f=machine.cpp
option emits opcode tables for the finite state machine, which in this case is the following table of 11 code words:
Option o
can be used with f=machine.cpp
to emit optimized native C++ code for the finite state machine:
The compact FSM opcode tables or the optimized larger FSM code may be used directly in your code. This omits the FSM construction overhead at runtime. You can simply include this generated file in your source code and pass it on to the reflex::Pattern
constructor:
The RE/flex reflex::Pattern
construction options are given as a string:
Option | Effect |
---|---|
b | bracket lists are parsed without converting escapes |
e=c; | redefine the escape character |
f=file.cpp; | save finite state machine code to file.cpp |
f=file.gv; | save deterministic finite state machine to file.gv |
i | case-insensitive matching, same as (?i)X |
l | Lex-style trailing context with / , same as (?l)X |
m | multiline mode, same as (?m)X |
n=name; | use reflex_code_name for the machine (instead of FSM ) |
o | only with option f : generate optimized FSM native C++ code |
q | Lex-style quotations "..." equal \Q...\E , same as (?q)X |
r | throw regex syntax error exceptions (not just fatal errors) |
s | dot matches all (aka. single line mode), same as (?s)X |
x | free space mode with inline comments, same as (?x)X |
w | display regex syntax errors before raising them as exceptions |
For example, reflex::Pattern pattern(pattern, "isr")
enables case-insensitive dot-all matching with syntax errors thrown as reflex::Pattern::Error
types of exceptions.
In summary:
The RE/flex regex library section has more information about the RE/flex regex library.
The RE/flex scanner generator tool takes a lex specification and generates a regex-based C++ lexer class that is saved in lex.yy.cpp, or saved to the file specified by the -o
command-line option. This file is then compiled and linked with a regex-library to produce a scanner. A scanner can be a stand-alone application or part of a larger program such as a compiler:
The RE/flex-generated scanners use the RE/flex regex library API for pattern matching. The RE/flex regex library API is defined by the abstract class reflex::AbstractMatcher
.
There are two regex matching engines to choose from for the generated scanner: the Boost.Regex library (assuming Boost.Regex is installed) or the RE/flex POSIX matcher engine. The libreflex
library should be linked and also libboost_regex
when needed.
The input class reflex::Input
of the libreflex
library manages input from strings, wide strings, streams, and data from FILE
descriptors. File data may be encoded in ASCII, binary or in UTF-8/16/32. UTF-16/32 is automatically decoded and converted to UTF-8 for UTF-8-based regex matching:
The generated scanner executes actions (typically to produce tokens for a parser). The actions are triggered by matching patterns to the input.
The reflex command generates a C++ scanner class in a source code file given a lex specification. The reflex command accepts −−flex
and −−bison
options for compatibility. These options allow reflex to be used as a replacement of the classic Flex and Lex tools:
$ reflex −−flex −−bison lexspec.l
The first option −−flex
specifies that lexspec.l
is a classic Lex specification with yytext
or YYText()
and the usual "yy" variables and functions.
The second option −−bison
generates a scanner class and auxiliary code for compatibility with Bison parsers. See Interfacing with Bison and Yacc for more details.
The source code output is intentionally structured in sections that are clean, readable, and reusable.
A lex specification consists of three sections that are divided by %%
delimiters:
The Definitions section is used to define named regex patterns, to set options for the scanner, and for including C++ declarations.
The Rules section is the main workhorse of the scanner and consists of patterns and actions, where patterns may use named regex patterns that are defined in the definitions section. The actions are executed when patterns match. For example, the following lex specification replaces all occurrences of cow
by chick
:
The default rule is to echo any input character that is read from input that does not match a rule in the rules section, so all other text is faithfully reproduced.
Because the pattern cow
also matches words partly we get chicks
for cows
. But we also get badly garbled output for words such as coward
and we are skipping capitalized Cows. We can improve this with a pattern that anchors word boundaries and accepts a lower or upper case C:
Note that we defined a named pattern cow
in the definitions section to match the start and end of (capitalized) "cow" word with the regex \<[Cc]ow\>
. We use {cow}
in our rule for matching. The matched text first character is emitted with text()[0]
and we simply add "hick" to complete our chick.
Note that regex grouping with parenthesis to capture text matched by a parenthesized sub-regex is generally not supported by scanner generators, so we have to use the entire matched text()
string.
Flex and Lex do not support word boundary anchors \<
, \>
, \b
, and \B
so this example only works with RE/flex.
If you are wondering about the action code in our example not exactly reflecting the C code expected with Flex, then rest assured that RE/flex supports the classic Flex and Lex actions such as yytext
instead of text()
and *yyout
instead of out()
. Simply use option −−flex
to regress to the C-style Flex names and actions. Use options −−flex
and −−bison
to regress even further to generated a global yylex()
function and "yy" variables.
To create a stand-alone scanner, we add main
to the user code section:
The main function instantiates the lexer class and invokes the scanner, which will not return until the entire input is processed.
The Definitions section includes name-pattern pairs to define names for patterns. Named patterns can be referenced in regex patterns by embracing them in {
and }
.
The following example defines two names for two patterns, where the second regex pattern uses the previously named pattern:
Names must be defined before being referenced. Names are expanded as macros in regex patterns. For example, {digit}+
is expanded into [0-9]+
.
X
, X
is placed in a non-capturing group (?:X)
to preserve its structure. For example, {number}
expands to (?:{digit}+)
which in turn expands to (?:(?:[0-9])+)
.To inject code into the generated scanner, indent the code or place the code within a %{
and %}
. The %{
and %}
should be placed at the start of a new line. To inject code at the very top of the generated scanner, place this code within %top{
and %}
:
The Definition section may also contain one or more options with %option
(or %o
for short). For example:
Multiple options can be grouped on the same line as is shown above. See Command-line options for a list of available options.
Consider the following example. Say we want to count the number of occurrences of the word "cow" in some text. We declare a global counter, increment the counter when we see a "cow", and finally report the total tally when we reach the end of the input marked by the <<EOF>>
rule:
The above works fine, but we are using a global counter which is not a best practice and is not MT-safe or reentrant: multiple Lexer class instances may compete to bump the counter. Another problem is that the Lexer can only be used once, there is no proper initialization to restart the Lexer on new input.
RE/flex allows you to inject code in the generated Lexer class, meaning that class members and constructor code can be added to manage the Lexer class state. All Lexer class members are visible in actions, even when private. New Lexers can be instantiated given some input to scan. Lexers can run in parallel in threads without requiring synchronization when their state is part of the instance and not managed by global variables.
To inject Lexer class member declarations, place the declarations within %class{
and %}
. The %class{
and %}
should be placed at the start of a new line.
Likewise, to inject Lexer class constructor code, for example to initialize members, place the code within %init{
and %}
. The %init{
and %}
should be placed at the start of a new line.
For example, we use these code injectors to make our cow counter herd
part of the Lexer class state:
Note that nothing else needed to be changed, because the actions are part of the generated Lexer class and can access the Lexer class members, in this example the member variable herd
.
To modularize lex specifications use %include
(or %i
for short) to include files into the definitions section of a lex specification. For example:
This includes examples/jdefs.l with Java patterns into the current specification so you can match Java lexical structures, such as copying Java identifiers to the output given some Java source program as input:
To declare start condition state names use %state
(%s
for short) to declare inclusive states and use %xstate
(%x
for short) to declare exclusive states:
See Start condition states for more information about states.
Each rule in the Rules section consists of a pattern-action pair. For example, the following defines an action for a pattern:
To add action code that spans multiple lines, indent the code or place the code within a {
and }
code block. When local variables are declared in an action then the code should always be placed in a code block.
Actions in the Rules section can use predefined RE/flex variables and functions. With reflex option −−flex
, the variables and functions are the classic Flex actions shown in the second column of this table:
RE/flex action | Flex action | Result |
---|---|---|
text() | YYText() , yytext | \0 -terminated text match |
size() | YYLeng() , yyleng | size of the match in bytes |
wsize() | n/a | number of wide chars matched |
lineno() | yylineno | line number of match (>=1) |
columno() | n/a | column number of match (>=0) |
echo() | ECHO | out().write(text(), size()) |
start() | YY_START | get current start condition |
start(n) | BEGIN n | set start condition to n |
push_state(n) | yy_push_state(n) | push current state, start n |
pop_state() | yy_pop_state() | pop state and make it current |
top_state() | yy_top_state() | get top state start condition |
in(i) | yyin = &i | set input to reflex::Input i |
in() | *yyin | get reflex::Input object |
out(s) | yyout = &s | set output to std::ostream s |
out() | *yyout | get std::ostream object |
out().write(s, n) | LexerOutput(s, n) | output chars s[0..n-1] |
out().put(c) | output(c) | output char c |
matcher().accept() | yy_act | number of the matched rule |
matcher().input() | yyinput() | get next character from input |
matcher().unput(c) | unput(c) | put back character c |
matcher().peek() | n/a | peek at character on input |
matcher().more() | yymore() | concat next match to the match |
matcher().less(n) | yyless(n) | shrink match's length to n |
matcher().first() | n/a | first pos of match in input |
matcher().last() | n/a | last pos+1 of match in input |
matcher().rest() | n/a | get rest of input until end |
matcher().at_bob() | n/a | true if at the begin of input |
matcher().at_end() | n/a | true if at the end of input |
matcher().at_bol() | YY_AT_BOL() | true if at begin of a newline |
set_debug(n) | set_debug(n) | reflex option -d sets n=1 |
debug() | debug() | nonzero when debugging |
Note that Flex switch_streams(i, o)
is the same as invoking the in(i)
and out(o)
methods. Flex yyrestart(i)
is the same as setting the input yyin=&i
or invoking in(i)
to set input to a file, stream, or string.
Use reflex options −−flex
and −−bison
to enable global Flex actions and variables. This makes Flex actions and variables globally accessible outside of the rules section, with the exception of yy_push_state
, yy_pop_state
, yy_top_state
. Outside the rules section you must use the global action yyinput
instead of input
, global action yyunput
instead of unput
, and global action yyoutput
instead of output
. Because yyin
and yyout
are macros they cannot be (re)declared or accessed as global variables, but they can be used as if these were variables. To avoid compilation errors, use reflex option −−header-file
to generate a header file lex.yy.h
to include in your code to use the global use Flex actions and variables. See Interfacing with Bison and Yacc for more details on the −−bison
options to use.
From the first entries in the table shown above you may have guessed correctly that text()
is just a shorthand for matcher().text()
, since matcher()
is the matcher object associated with the generated Lexer class. The same shorthands apply to size()
, wsize()
, lineno()
and columno()
.
Because matcher()
returns the current matcher object, the following Flex-like actions are also supported:
RE/flex action | Flex action | Result |
---|---|---|
matcher().buffer() | n/a | buffer entire input |
matcher().buffer(n) | n/a | set buffer size to n |
matcher().interactive() | yy_set_interactive(1) | set interactive input |
matcher().flush() | YY_FLUSH_BUFFER | flush input buffer |
matcher().get(s, n) | LexerInput(s, n) | read s[0..n-1] |
matcher().set_bol(b) | yy_set_bol(b) | set begin of line |
matcher().set_eof(b) | n/a | set EOF flag to b |
You can switch to a new matcher while scanning input, and use operations to create a new matcher, push/pop a matcher on/from a stack, and delete a matcher:
RE/flex action | Flex action | Result |
---|---|---|
matcher(m) | yy_switch_to_buffer(m) | use matcher m |
new_matcher(i) | yy_create_buffer(i, n) | new matcher reflex::Input i |
del_matcher(m) | yy_delete_buffer(m) | delete matcher m |
push_matcher(m) | yypush_buffer_state(m) | push current matcher, use m |
pop_matcher() | yypop_buffer_state() | pop matcher and delete current |
ptr_matcher() | YY_CURRENT_BUFFER | pointer to current matcher |
The matcher type m
is a Lexer class-specific Matcher
type, which depends on the underlying matcher used by the scanner. Therefore, new_matcher(i)
instantiates a reflex::Matcher
or reflex::BoostPosixMatcher
depending on the −−matcher
option.
The Flex yy_scan_string(string)
and yy_scan_bytes(bytes, len)
functions are also supported with reflex option −−flex
and are simply creating setting the matcher's in(i)
input to a string.
The generated scanner reads from the standard input by default or from an input source specified as a reflex::Input
object, such as a string, wide string, file, or a stream.
See Switching input sources for more details on managing the input to a scanner.
To inject code at the end of the generated scanner, such as a main
function, we can use the User Code section. All of the code in the User Code section is copied to the generated scanner.
Below is a User Code section example with main
that invokes the lexer to read from standard input (the default input) and display all numbers found:
You can also automatically generate a main
with the reflex −−main
option, which will produce the same main
function shown in the example above. This creates a stand-alone scanner that instantiates a Lexer that reads input from standard input.
To scan from other input than standard input, such as from files, streams, and strings, instantiate the Lexer class with the input source as the first argument. To set an alternative output stream than standard output, pass a std::ostream
object as the second argument to the Lexer class constructor:
The above uses a FILE
descriptor to read input from, which has the advantage of automatically decoding UTF-8/16/32 input. Other permissible input sources are std::istream
, std::string
, std::wstring
, char*
, and wchar_t*
.
A pattern is an extended set of regular expressions, with nested sub-expression patterns φ
and ψ
:
Pattern | Matches |
---|---|
x | matches the character x , where x is not a special character |
. | matches any single character except newline (unless in dotall mode) |
\. | matches . (dot), special characters are escaped with a backslash |
\n | matches a newline, others are \a (BEL), \b (BS), \t (HT), \v (VT), \f (FF), and \r (CR) |
\0 | matches the NUL character |
\cX | matches the control character X mod 32 (e.g. \cA is \x01 ) |
\0177 | matches an 8-bit character with octal value 177 (see below) |
\x7f | matches an 8-bit character with hexadecimal value 7f |
\x{7f} | matches an 8-bit character with hexadecimal value 7f |
\p{C} | matches a character in class C (see below) |
\Q..\E | matches the quoted content between \Q and \E literally |
[abc] | matches one of a , b , or c (bracket list) |
[0-9] | matches a digit 0 to 9 (range) |
[^0-9] | matches anything but a digit (negative list) |
φ? | matches φ zero or one time (optional) |
φ* | matches φ zero or more times (repetition) |
φ+ | matches φ one or more times (repetition) |
φ{2,5} | matches φ two to five times (repetition) |
φ{2,} | matches φ at least two times (repetition) |
φ{2} | matches φ exactly two times (repetition) |
φ?? | matches φ zero or once as needed (lazy optional) |
φ*? | matches φ a minimum number of times as needed (lazy repetition) |
φ+? | matches φ a minimum number of times at least once as needed (lazy repetition) |
φ{2,5}? | matches φ two to five times as needed (lazy repetition) |
φ{2,}? | matches φ at least two times or more as needed (lazy repetition) |
φψ | matches φ followed by ψ (concatenation) |
φ⎮ψ | matches φ or matches ψ (alternation) |
(φ) | matches φ as a group (capture) |
(?:φ) | matches φ without group capture |
(?=φ) | matches φ without consuming it (lookahead) |
(?^φ) | matches φ and ignore it to continue matching (ignore, RE/flex only) |
^φ | matches φ at the start of input or start of a line (multi-line mode) |
φ$ | matches φ at the end of input or end of a line (multi-line mode) |
\Aφ | matches φ at the start of input |
φ\Z | matches φ at the end of input |
\bφ | matches φ starting at a word boundary |
φ\b | matches φ ending at a word boundary |
\Bφ | matches φ starting at a non-word boundary |
φ\B | matches φ ending at a non-word boundary |
\<φ | matches φ that starts as a word |
\>φ | matches φ that starts as a non-word |
φ\< | matches φ that ends as a non-word |
φ\> | matches φ that ends as a word |
(?i:φ) | case insensitive mode: matches φ ignoring case |
(?m:φ) | multi-line mode: ^ and $ in φ match begin and end of a line |
(?s:φ) | dotall mode: . (dot) in φ matches newline |
(?x:φ) | free space mode: ignore all whitespace and comments in φ |
(?#:..) | commenting: all of .. is skipped as a comment |
In addition, the following patterns can be used in lex specifications for the reflex scanner generator:
Pattern | Matches |
---|---|
x (UTF-8) | matches wide character x encoded in UTF-8, requires the −−unicode option |
\u{2318} | matches Unicode character U+2318, requires the −−unicode option |
\p{C} | matches a character in class C, ASCII or Unicode with the −−unicode option |
\177 | matches an 8-bit character with octal value 177 |
".." | matches the quoted content literally |
φ/ψ | matches φ if followed by ψ (trailing context, same as lookahead) |
<S>φ | matches φ only if state S is enabled |
<S1,S2,S3>φ | matches φ only if state S1 , S2 , or state S3 is enabled |
<*>φ | matches φ in any state |
<<EOF>> | matches EOF in any state |
<S><<EOF>> | matches EOF only if state S is enabled |
The order of precedence for composing larger patterns from sub-patterns is as follows, from high to low precedence:
(φ)
, (?:φ)
, (?=φ)
, and inline modifiers (?imsx:φ)
?
, *
, +
, {n,m}
^
, $
, \<
, \>
, \b
, \B
, \A
, \Z
|
(?imsx)φ
Note that the characters .
(dot), \
, ?
, *
, +
, |
, (
, )
, [
, ]
, {
, }
, ^
, and $
are meta-characters and should be escaped to match. Lex specifications also include the "
and /
as meta-characters and these should be escaped to match.
Bracket lists cannot be empty, so []
and [^]
are invalid. In fact, the first character after the bracket is always part of the list. So [][]
is a list that matches a ]
and a [
, [^][]
is a list that matches anything but ]
and [
, and [-^]
is a list that matches a -
and a ^
.
Multi-line mode is the default mode in lex specifications. Free space mode can be useful to improve readability. To enable free space mode in reflex use the −−freespace
option. Be warned that this requires all actions in lex specifications to be placed within {
and }
blocks and other code to be placed in %{
and %}
blocks.
Patterns ending in an escape \
in lex specifications continue on the next line. This permits layout out long patterns, for example in free space mode.
?
for optional patterns φ??
and repetitions φ*?
φ+?
is not supported by Boost.Regex in POSIX mode. In general, POSIX matchers do not support lazy quantifiers due to POSIX limitations that are rooted in the theory of formal languages.??
at all cost in C++ strings. Instead, use at least one escaped question mark, such as ?\?
, which the compiler will translate to ??
. Otherwise, lazy optional pattern constructs will appear broken. Fortunately, most C++ compilers will warn about trigraph translation before causing trouble.The 7-bit ASCII character categories are:
Category | List form | Matches |
---|---|---|
\p{Space} | [[:space:]] | matches a white space character [ \t\n\v\f\r\x85] same as \s |
\p{Xdigit} | [[:xdigit:]] | matches a hex digit [0-9A-Fa-f] |
\p{Cntrl} | [[:cntrl:]] | matches a control character [\x00-\0x1f\x7f] |
\p{Print} | [[:print:]] | matches a printable character [\x20-\x7e] |
\p{Alnum} | [[:alnum:]] | matches a alphanumeric character [0-9A-Za-z] |
\p{Alpha} | [[:alpha:]] | matches a letter [A-Za-z] |
\p{Blank} | [[:blank:]] | matches a blank [ \t] same as \h |
\p{Digit} | [[:digit:]] | matches a digit [0-9] same as \d |
\p{Graph} | [[:graph:]] | matches a visible character [\x21-\x7e] |
\p{Lower} | [[:lower:]] | matches a lower case letter [a-z] |
\p{Punct} | [[:punct:]] | matches a punctuation character [\x21-\x2f\x3a-\x40\x5b-\x60\x7b-\x7e] |
\p{Upper} | [[:upper:]] | matches an upper case letter [A-Z] |
\p{Word} | [[:word:]] | matches a word character [0-9A-Za-z_] |
\d | [[:digit:]] | matches a digit |
\D | [^[:digit:]] | matches a non-digit |
\h | [[:blank:]] | matches a blank character |
\H | [^[:blank:]] | matches a non-blank character |
\s | [[:space:]] | matches a white space character |
\S | [^[:space:]] | matches a non-white space |
\l | [[:lower:]] | matches a lower case letter [a-z] |
\L | [^[:lower:]] | matches a non-lower case letter [^a-z] |
\u | [[:upper:]] | matches an upper case letter [A-Z] |
\U | [^[:upper:]] | matches a nonupper case letter [^A-Z] |
\w | [[:word:]] | matches a word character |
\W | [^[:word:]] | matches a non-word character |
The following Unicode character categories are enabled with the reflex −−unicode
option:
Category | Matches |
---|---|
. | matches any Unicode character |
\X | matches any Unicode character with or without the −−unicode option enabled |
\s | matches a Unicode white space character |
\l , \p{Ll} | matches a lower case letter with Unicode sub-property Ll |
\u , \p{Lu} | matches an upper case letter with Unicode sub-property Lu |
\w , \p{Word} | matches a Unicode word character with property L, Nd, or Pc |
\p{Unicode} | matches any character (Unicode U+0000 to U+10FFFF) |
\p{ASCII} | matches an ASCII character U+0000 to U+007F) |
\p{Non_ASCII_Unicode} | matches a non-ASCII Unicode character U+0080 to U+10FFFF) |
\p{Letter} | matches a character with Unicode property Letter |
\p{Mark} | matches a character with Unicode property Mark |
\p{Separator} | matches a character with Unicode property Separator |
\p{Symbol} | matches a character with Unicode property Symbol |
\p{Number} | matches a character with Unicode property Number |
\p{Punctuation} | matches a character with Unicode property Punctuation |
\p{Other} | matches a character with Unicode property Other |
\p{Lowercase_Letter} , \p{Ll} | matches a character with Unicode sub-property Ll |
\p{Uppercase_Letter} , \p{Lu} | matches a character with Unicode sub-property Lu |
\p{Titlecase_Letter} , \p{Lt} | matches a character with Unicode sub-property Lt |
\p{Modifier_Letter} , \p{Lm} | matches a character with Unicode sub-property Lm |
\p{Other_Letter} , \p{Lo} | matches a character with Unicode sub-property Lo |
\p{Non_Spacing_Mark} , \p{Mn} | matches a character with Unicode sub-property Mn |
\p{Spacing_Combining_Mark} , \p{Mc} | matches a character with Unicode sub-property Mc |
\p{Enclosing_Mark} , \p{Me} | matches a character with Unicode sub-property Me |
\p{Space_Separator} , \p{Zs} | matches a character with Unicode sub-property Zs |
\p{Line_Separator} , \p{Zl} | matches a character with Unicode sub-property Zl |
\p{Paragraph_Separator} , \p{Zp} | matches a character with Unicode sub-property Zp |
\p{Math_Symbol} , \p{Sm} | matches a character with Unicode sub-property Sm |
\p{Currency_Symbol} , \p{Sc} | matches a character with Unicode sub-property Sc |
\p{Modifier_Symbol} , \p{Sk} | matches a character with Unicode sub-property Sk |
\p{Other_Symbol} , \p{So} | matches a character with Unicode sub-property So |
\p{Decimal_Digit_Number} , \p{Nd} | matches a character with Unicode sub-property Nd |
\p{Letter_Number} , \p{Nl} | matches a character with Unicode sub-property Nl |
\p{Other_Number} , \p{No} | matches a character with Unicode sub-property No |
\p{Dash_Punctuation} , \p{Pd} | matches a character with Unicode sub-property Pd |
\p{Open_Punctuation} , \p{Ps} | matches a character with Unicode sub-property Ps |
\p{Close_Punctuation} , \p{Pe} | matches a character with Unicode sub-property Pe |
\p{Initial_Punctuation} , \p{Pi} | matches a character with Unicode sub-property Pi |
\p{Final_Punctuation} , \p{Pf} | matches a character with Unicode sub-property Pf |
\p{Connector_Punctuation} , \p{Pc} | matches a character with Unicode sub-property Pc |
\p{Other_Punctuation} , \p{Po} | matches a character with Unicode sub-property Po |
\p{Control} , \p{Cc} | matches a character with Unicode sub-property Cc |
\p{Format} , \p{Cf} | matches a character with Unicode sub-property Cf |
\p{UnicodeIdentifierStart} | matches a character in the Unicode IdentifierStart class |
\p{UnicodeIdentifierPart} | matches a character in the Unicode IdentifierPart class |
\p{IdentifierIgnorable} | matches a character in the IdentifierIgnorable class |
\p{JavaIdentifierStart} | matches a character in the Java IdentifierStart class |
\p{JavaIdentifierPart} | matches a character in the Java IdentifierPart class |
\p{CsIdentifierStart} | matches a character in the C# IdentifierStart class |
\p{CsIdentifierPart} | matches a character in the C# IdentifierPart class |
\p{PythonIdentifierStart} | matches a character in the Python IdentifierStart class |
\p{PythonIdentifierPart} | matches a character in the Python IdentifierPart class |
In addition, the −−unicode
option enables Unicode language scripts:
\p{Arabic}
, \p{Armenian}
, \p{Avestan}
, \p{Balinese}
, \p{Bamum}
, \p{Bassa_Vah}
, \p{Batak}
, \p{Bengali}
, \p{Bopomofo}
, \p{Brahmi}
, \p{Braille}
, \p{Buginese}
, \p{Buhid}
, \p{Canadian_Aboriginal}
, \p{Carian}
, \p{Caucasian_Albanian}
, \p{Chakma}
, \p{Cham}
, \p{Cherokee}
, \p{Common}
, \p{Coptic}
, \p{Cuneiform}
, \p{Cypriot}
, \p{Cyrillic}
, \p{Deseret}
, \p{Devanagari}
, \p{Duployan}
, \p{Egyptian_Hieroglyphs}
, \p{Elbasan}
, \p{Ethiopic}
, \p{Georgian}
, \p{Glagolitic}
, \p{Gothic}
, \p{Grantha}
, \p{Greek}
, \p{Gujarati}
, \p{Gurmukhi}
, \p{Han}
, \p{Hangul}
, \p{Hanunoo}
, \p{Hebrew}
, \p{Hiragana}
, \p{Imperial_Aramaic}
, \p{Inherited}
, \p{Inscriptional_Pahlavi}
, \p{Inscriptional_Parthian}
, \p{Javanese}
, \p{Kaithi}
, \p{Kannada}
, \p{Katakana}
, \p{Kayah_Li}
, \p{Kharoshthi}
, \p{Khmer}
, \p{Khojki}
, \p{Khudawadi}
, \p{Lao}
, \p{Latin}
, \p{Lepcha}
, \p{Limbu}
, \p{Linear_A}
, \p{Linear_B}
, \p{Lisu}
, \p{Lycian}
, \p{Lydian}
, \p{Mahajani}
, \p{Malayalam}
, \p{Mandaic}
, \p{Manichaean}
, \p{Meetei_Mayek}
, \p{Mende_Kikakui}
, \p{Meroitic_Cursive}
, \p{Meroitic_Hieroglyphs}
, \p{Miao}
, \p{Modi}
, \p{Mongolian}
, \p{Mro}
, \p{Myanmar}
, \p{Nabataean}
, \p{New_Tai_Lue}
, \p{Nko}
, \p{Ogham}
, \p{Ol_Chiki}
, \p{Old_Italic}
, \p{Old_North_Arabian}
, \p{Old_Permic}
, \p{Old_Persian}
, \p{Old_South_Arabian}
, \p{Old_Turkic}
, \p{Oriya}
, \p{Osmanya}
, \p{Pahawh_Hmong}
, \p{Palmyrene}
, \p{Pau_Cin_Hau}
, \p{Phags_Pa}
, \p{Phoenician}
, \p{Psalter_Pahlavi}
, \p{Rejang}
, \p{Runic}
, \p{Samaritan}
, \p{Saurashtra}
, \p{Sharada}
, \p{Shavian}
, \p{Siddham}
, \p{Sinhala}
, \p{Sora_Sompeng}
, \p{Sundanese}
, \p{Syloti_Nagri}
, \p{Syriac}
, \p{Tagalog}
, \p{Tagbanwa}
, \p{Tai_Le}
, \p{Tai_Tham}
, \p{Tai_Viet}
, \p{Takri}
, \p{Tamil}
, \p{Telugu}
, \p{Thaana}
, \p{Thai}
, \p{Tibetan}
, \p{Tifinagh}
, \p{Tirhuta}
, \p{Ugaritic}
, \p{Vai}
, \p{Warang_Citi}
, \p{Yi}
.
Automatic indent and dedent matching is a new feature in RE/flex and is enabled when the RE/flex matcher engine is used (not the Boost.Regex matcher).
Pattern | Matches |
---|---|
\i | indent: matches the next indent position |
\j | dedent: matches the previous indent position |
These patterns should be used in combination with the start of a line anchor ^
followed by a pattern that represents the spacing for indentations. The pattern may include any characters that are considered part of indentation margins, but should exclude \n
. For example:
The \h
pattern matches space and tabs, where tabs advance to the next column that is a multiple of 8. The tab multiplier can be changed by setting the −−tabs=N
option where N
must be a positive integer that is a power of 2.
To add a pattern that consumes line continuations without affecting the indentation levels, use a negative match, which is a new RE/flex feature:
The negative pattern (?^\\\n\h+)
consumes input internally as if we are repeately calling input()
(or yyinput()
with −−flex
). We used it here to consume the line-ending \
and the indent that followed it, as if this text was not part of the input, which ensures that the current indent positions are not affected.
To scan input that continues on the next new line(s) while preserving the current indent positions, use the RE/flex matcher matcher().push_stops()
and matcher().pop_stops()
. For example, to continue scanning after a /*
for multiple lines without indentation matching and up to a */
you can save the current indent positions and transition to a new start condition state to scan the content between /*
and */
:
See Start condition states for more information about start condition states.
By default, reflex produces a Lexer class with a virtual lex scanner function. The name of this function as well as the Lexer class name and the namespace can be set with options:
Option | RE/flex default name | Flex default name |
---|---|---|
namespace | n/a | n/a |
lexer | Lexer class | yyFlexLexer class |
lex | lex() function | yylex() function |
To customize the Lexer class use these options and code injection.
To understand the impact of these options, consider the following lex specification template:
This produces the following Lexer class with the template parts filled in:
The Lexer class produced with option −−flex
is compatible with Flex (assuming Flex with option -+
for C++):
To use a new Lexer class that is derived from the generated base Lexer class, use option option −−class=CLASS
.
When using this option, you must also explicitly define a CLASS
class that is derived from the base lexer class. For example, the simplest derived Lexer class would be:
The base Lexer class LEXER
is the same class as was shown earlier, except that the LEX()
function is now a pure virtual method and the Lexer class is abstract. The CLASS::LEX()
scanner function is generated by reflex.
You can omit the lexer
and lex
options shown above, which means that the default Lexer class name and lex scanner function name are used.
To control the output of the reflex scanner generator, use command-line options. These options can also be specified in the lex specification with %option
(or %o
for short):
The above is equivalent to the −−flex
, −−bison
, and −−graphs-file=mygraph
command-line options. Multiple options can be grouped on a single line:
Option parameters should be quoted when parameters contain special characters:
Shorter forms can be used, with each option on a separate line:
Options that affect the regular expressions such as %option unicode
and %option dotall
MUST be specified before the regular expressions are defined and used in the specification.
The reflex command-line options are listed below.
−+
, −−flex
This generates a yyFlexLexer
scanner class that is compatible with the Flex-generated yyFlexLexer
scanner class (assuming Flex with option −+
for C++). The generated yyFlexLexer
class has the usual yytext
and other "yy" variables and functions, as defined by the lex specification standard. Without this option, RE/flex actions should be used that are lexer class methods such as echo()
and reflex::AbstractMatcher
class methods such as matcher.more()
, see Lex specifications for more details.
−−bison
This generates a scanner that works with Bison parsers, by defining global (non-MT-safe and non-reentrant) "yy" variables and functions. See Interfacing with Bison and Yacc for more details. Use option −−noyywrap
to remove the dependency on the global yywrap()
function.
−−bison-bridge
This generates a scanner that works with Bison pure (MT-safe and reentrant) parsers using a Bison bridge for one ore more scanner objects. See Interfacing with Bison and Yacc for more details.
−−bison-locations
This generates a scanner that works with Bison with locations enabled. See Interfacing with Bison and Yacc for more details.
-R
, −−reentrant
This generates additional Flex-compatible yylex()
reentrant scanner functions. RE/flex scanners are always reentrant, assuming that %class
variables are used instead of global variables in the scanner's user code. This is a Flex-compatibility option and should only be used with options −−flex
and −−bison
. See Interfacing with Bison and Yacc .
−−main
This generates a main
function to create a stand-alone scanner that scans data from standard input (using stdin
).
−−nostdinit
This initializes input to std::cin
instead of using stdin
. Automatic UTF decoding is not supported. Use stdin
for automatic UTF BOM detection and decoding.
-u
, −−unicode
This makes .
(dot), \s
, \w
, \l
, and \u
match Unicode and also groups UTF-8 sequences in the regex, such that each UTF-8 encoded character in a regex is properly matched as one wide character. Note that \S
and \W
are not affected by this switch.
-x
, −−freespace
This switches the reflex scanner generator to free space mode. Regular expressions in free space mode may contain spacing to improve readability. Spacing within regular expressions is ignored, so use " "
or [ ]
to match a space and \h
to match a space or a tab character. Actions in free space mode MUST be placed in {
and }
blocks and all other code must be placed in %{
and %}
blocks. Patterns ending in an escape \
continue on the next line.
-a
, −−dotall
This makes dot (.
) in patterns match newline. Normally dot matches a single character except a newline (\n
ASCII 0x0A).
-i
, −−case_insensitive
This ignore case in patterns. Patterns match lower and upper case letters. This switch only applies to ASCII letters.
-B
, −−batch
This generates a batch input scanner that reads the entire input all at once when possible. This scanner is fast, but consumes more memory depending on the input data size.
-I
, −−interactive
, −−always-interactive
This generates an interactive scanner and permits console input by sacrificing speed. By contrast, the default buffered input strategy is more efficient.
-d
, −−debug
This enables debug mode in the generated scanner. Running the scanner produces debug messages on std::cerr
standard error and the debug()
function returns nonzero. To temporarily turn off debug messages, use set_debug(0)
in your action code. To turn debug messages back on, use set_debug(1)
. The set_debug()
and debug()
functions are virtual, so you can override their behavior in a derived lexer class. Also enables assertions that check for errors.
-s
, −−nodefault
This suppresses the default rule that ECHOs all unmatched text when no rule matches. With the −−flex
option, the scanner reports "scanner jammed" when no rule matches. Without the −−flex
option, unmatched input is silently ignored.
-o FILE
, −−outfile=FILE
This saves the scanner to FILE instead of yy.lex.cpp.
-t
, −−stdout
This writes the scanner to stdout instead of to the yy.lex.cpp file.
-h
, −−help
This displays helpful information about reflex.
-V
, −−version
This displays the current reflex release version.
-v
, −−verbose
This displays a summary of scanner statistics.
-w
, −−nowarn
This disables warnings.
-P NAME
, −−prefix=NAME
This specifies NAME as a prefix for the generated yyFlexLexer
class to replace the default yy
prefix. This option is only useful with the −−flex
option and can be used in combination with −−bison
.
-L
, −−noline
This suppresses the #line directives in the generated scanner code.
−−namespace=NAME
This places the generated scanner class in the C++ namespace NAME scope. Which means NAME::Lexer
(and NAME::yyFlexLexer
when option −−flex
is used).
−−lexer=NAME
This defines the NAME of the generated scanner class and replaces the default name Lexer
(and replaces yyFlexLexer
when option −−flex
is used).
−−lex=NAME
This defines the NAME of the generated scanner function to replace the function name lex()
(and yylex()
when option −−flex
is used).
−−class=NAME
This defines the NAME of the scanner class that should be derived from the generated base Lexer
class. Use this option when defining your own scanner class named NAME. You can declare a custom lexer class in the first section of the lex specification. Because the custom lexer class is user-defined, reflex generates the implementation of the lex()
scanner function for this specified class.
−−yyclass=NAME
This option combines options −−flex
and −−class=NAME
.
-m reflex
, −−matcher=reflex
This generates a scanner that uses the RE/flex reflex::Matcher
class with a POSIX matcher engine. This is the default matcher for scanning. This option is best for Flex compatibility. The matcher supports Unicode, lazy quantifiers, word boundaries, indent/nodent/dedent matching, and supports FSM output for visualization with Graphviz.
-m boost
, −−matcher=boost
This generates a scanner that uses the reflex::BoostPosixMatcher
class with a Boost.Regex POSIX matcher engine for scanning. The matcher supports Unicode and word boundary anchors, but not lazy quantifiers. No Graphviz output.
-m boost-perl
, −−matcher=boost-perl
This generates a scanner that uses the reflex::BoostPerlMatcher
class with a Boost.Regex normal (Perl) matcher engine for scanning. The matching behavior differs from the POSIX leftmost longest rule and results in the first matching rule to be applied instead of the rule that produces the longest match. The matcher supports lazy quantifiers and word boundary anchors. No Graphviz output.
-f
, −−full
(RE/flex matcher only). This option adds the FSM to the generated code as a static opcode table, thus generating the scanner in full. FSM construction overhead is eliminated when the scanner is initialized, resulting in a scanner that starts scanning the input immediately. This option has no effect when option −−fast
is used.
-F
, −−fast
(RE/flex matcher only). This option adds the FSM to the generated code as optimized native C++ code. FSM construction overhead is eliminated when the scanner is initialized, resulting in a scanner that starts scanning the input immediately. The generated code takes more space compared to the −−full
option.
−−graphs-file[=FILE]
(RE/flex matcher only). This generates a Graphviz file FILE.gv, where FILE is optional. When FILE is omitted the reflex command generates the file reflex.S.gv for each start condition state S defined in the lex specification. This includes reflex.INITIAL.gv for the INITIAL start condition state. This option can be used to visualize the RE/flex matcher's finite state machine with the Graphviz dot tool. For example:
In case you are curious: the specification for this FSM digraph has two patterns: [1] a pattern to match dollar amounts with the regex \$\d+(\.\d{2})?
and [2] the regex .|\n
to skip a character and advance to the next match.
−−tables-file[=FILE]
(RE/flex matcher only). This generates a C++ file FILE.cpp with the finite state machine in source code form, where FILE is optional. When FILE is omitted the reflex command generates reflex.S.cpp for each start condition state S. This includes the file reflex.INITIAL.cpp for the INITIAL start condition state. When this option is used in combination with −−full
or −−fast
, the reflex::Pattern
is instantiated with the code table defined in this file. Therefore, when you combine this option with −−full
or −−fast
then you should compile the generated table file with the scanner. Options −−full
and −−fast
eliminate the FSM construction overhead when the scanner is initialized.
−−header-file[=FILE]
This generates a C++ header file FILE.h that declares the lexer class, in addition to the generated lexer class code, where FILE is optional. When FILE is omitted the reflex command generates lex.yy.h.
−−regexp-file[=FILE]
This generates a text file FILE.txt that contains the scanner's regular expression patterns, where FILE is optional. When FILE is omitted the reflex command generates reflex.S.txt for each start condition state S. The regular expression patterns are converted from the lex specification and translated into valid C++ strings that can be used with a regex library for pattern matching.
−−pattern=NAME
This defines a custom pattern class NAME for a customer matcher defined with option -m
.
−−tabs=N
This sets the tab size to N, where N > 0 must be a power of 2. The tab size is used internally to determine the column position in indent \i
and dedent \j
matching. It has no effect otherwise.
−−yywrap
and −−noyywrap
Option −−yywrap
generates a scanner that calls the global int yywrap()
function when EOF is reached. Only applicable when −−flex
is used for compatibility. The same applies when options −−flex
and −−bison
are used together. Use −−noyywrap
to disable the dependence on this global function. This option has no effect for C++ lexer classes, which have a virtual int wrap()
(or yywrap()
with option −−flex
) method.
−−lineno
, −−yymore
, −−batch
These options are enabled by default and have no effect.
The reflex scanner generator gives you a choice of matchers to use in the generated scanner, where the default is the POSIX RE/flex matcher engine. Other options are the Boost.Regex matcher in POSIX mode or in Perl mode.
To use a matcher for the generated scanner, use one of these three choices:
Option | Matcher class used | Mode | Engine |
---|---|---|---|
-m reflex | Matcher | POSIX | RE/flex lib (default choice) |
-m boost | BoostPosixMatcher | POSIX | Boost.Regex |
-m boost-perl | BoostPerlMatcher | Perl | Boost.Regex |
The POSIX matchers look for the longest possible match among the given set of alternative patterns. Perl matchers look for the first match among the given set of alternative patterns.
POSIX is generally preferred for scanners, since it is easier to arrange rules that may have partially overlapping patterns. Since we are looking for the longest match anyway, it does not matter which rule comes first. The order does not matter as long as the length of the matches differ. When matches are of the same length because multiple patterns match, then the first rule is selected.
Consider for example the following lex specification with rules that are intended to match keywords and identifiers in some input text:
When the input to the scanner is the text integer
then a POSIX matcher selects the last rule, which is what we want because it is an identifier.
By contrast, a Perl matcher selects the first rule because it matches the first part int
of integer
. This is NOT what we want. The same problem occurs when the text interface
is encountered on the input, which we want to recognize as a separate keyword and not match against int
. Switching the rules for int
and interface
fixes that problem. But note that we cannot do the same to fix matching integer
as an identifier: when moving the last rule up to the top we cannot match int
any longer!
int(?=[^A-Za-z0-9_])
which is written in a lex specification as int/[^A-Za-z0-9_]
with the /
lookahead meta symbol.Basically, a Perl matcher works in an operational mode by working the regex pattern as a sequence of operations for matching, usually using backtracking to find a matching pattern.
A POSIX matcher on the other hand is declarative and have a deeper foundation in formal language theory. An advantage of POSIX matchers is that regular expressions can be compiled to deterministic finite state machines for efficient matching.
int
matching rule before the identifier matching rule. Otherwise an int
on the input will be matched by the identifier rule.Perl matchers generally support lazy quantifiers and group captures, while most POSIX matchers do not (Boost.Regex in POSIX mode does not support lazy quantifiers). The RE/flex POSIX matcher supports lazy quantifiers, but not group captures. The added support for lazy quantifiers and word boundary anchors in RE/flex matching offers a reasonably new and useful feature for scanners that require POSIX mode matching.
To create a Lexer class instance that reads from a designated input source instead of standard input, pass the input source as the first argument to its constructor and use the second argument to optionally set an std::ostream
:
the same with the −−flex
option becomes:
where input
is a reflex::Input
object. The reflex::Input
constructor takes a FILE*
descriptor, std::istream
, a string std::string
or char*
, or a wide string std::wstring
or wchar_t*
.
To switch input to another source while using the scanner, use in(input)
:
However, switching input this way discards all remaining input from a previous input source. To switch input without affecting the current input source, switch matchers instead. The matchers buffer the input and manage the input state, in addition to pattern matching the input.
To switch to a matcher that scans from a new input source, use:
the same with the −−flex
option becomes:
This switches the scanner's input by switching to another matcher. Note that matcher(m)
may be used by the virtual wrap()
method (or yywrap()
when option −−flex
is used) if you use input wrapping after EOF to set things up for continued scanning.
Switching input sources (via either matcher(m)
or in(input)
) does not change the current start condition state.
When the scanner reaches the end of the input, it will check the int wrap()
method to detetermine if scanning should continue. If wrap()
returns one (1) the scanner terminates and returns zero to its caller. If wrap()
returns zero (0) then the scanner continues. In this case wrap()
should set up a new input source to scan.
To implement a wrap()
(and yywrap()
when option −−flex
is used), create a derived Lexer class using option class=NAME
(or yyclass=NAME
) and override the wrap()
(or yywrap()
) method:
You can override the wrap()
method to set up a new input source when the current input is exhausted. Do not use matcher().input(i)
to set a new input source i
, because that resets the internal matcher state.
With the −−flex
options your can override the yyFlexLexer::yywrap()
method that returns an integer 0 (more input available) or 1 (we're done).
With the −−flex
and −−bison
options you should define a global yywrap()
function that returns an integer 0 (more input available) or 1 (we're done).
A more typical scenario is to process an include
directive in the source input that should include the source of another file before continuing with the current input.
For example, the following lex specification processes #include
directives by switching matchers and using the stack of matchers to permit nested #include
directives up to a depth of 99 files:
With option −−flex
, the statement push_matcher(new_matcher(fd))
above becomes yypush_buffer_state(yy_create_buffer(fd, YY_BUF_SIZE))
and pop_matcher()
becomes yypop_buffer_state()
.
To set the current input as interactive, such as input from a console, use matcher().interactive()
(yy_set_interactive(1)
with option −−flex
). This disables buffering of the input and makes the scanner responsive.
To read from the input without pattern matching, use matcher().input()
to read one character at a time (8-bit, ASCII or UTF-8). This function returns EOF if the end of the input was reached. But be careful, the Flex yyinput()
and input()
functions return 0 instead of an EOF
(-1)!
To put back one character unto the input stream, use matcher().unput(c)
For example, to crudily scan a C/C++ multiline comment we can use the rule:
Instead of the ugly solution above, a better alternative is to use a regex /\*.*?\*/
or perhaps use start condition states, see Start condition states .
To grab the rest of the input as a string, use matcher().rest()
which returns a const char*
string that points to an internal buffer. Copy the string before using the matcher again.
To read a number of bytes n
into a string buffer s[0..n-1]
, use matcher().in.get(s, n)
, which is the same as invoking the virtual method matcher().get(s, n)
. This matcher method can be overriden by a derived matcher class (to customize reading.
The Flex YY_INPUT
macro is not supported by RE/flex. It is recommended to use YY_BUFFER_STATE
(Flex), which is a reflex::FlexLexer::Matcher
class in RE/flex that holds the matcher state and the state of the current input, including the line and column number positions (so unlike Flex, yylineno
does not have to be saved and restored when switching buffers). See also section Lex specifications on the actions to use.
To implement a custom input handler you can use a proper object-oriented approach: create a derived class of reflex::Matcher
(or reflex::BoostPosixMatcher
) and in the derived class override the size_t reflex::Matcher::get(char *s, size_t n)
method for input handling. This function is called with a string buffer s
of size n
bytes. Fill the string buffer s
up to n
bytes and return the number of bytes stored in s
. Return zero upon EOF. Use reflex options −−matcher=NAME
and −−pattern=reflex::Pattern
to use your new matcher class NAME
(or leave out −−pattern
for Boost.Regex derived matchers).
The FlexLexer
lexer class that is the base class of the yyFlexLexer
lexer class generated with reflex option −−flex
defines a virtual size_t LexerInput(char*, size_t)
method. This approach is compatible with Flex. The virtual method can be redefined in the generated yyFlexLexer
lexer to consume input from some source of text:
The LexerInput
method may be invoked multiple times until it returns zero indicating EOF.
Start conditions are used to group rules and selectively activate rules when the start condition state becomes active.
A rule with a pattern that is prefixed with one ore more start conditions will only be active when the scanner is in one of these start condition states.
For example:
When the scanner is in state A
rules 1 and 2 are active. When the scanner is in state B
rules 1 and 3 are active.
Start conditions are declared in the definitions section (the first section) of the lex specification using %state
or %xstate
(or %s
and %x
for short) followed by a list of names called start symbols. Start conditions declared with %s
are inclusive start conditions. Start conditions declared with %x
are exclusive start conditions.
If a start condition is inclusive, then all rules without a start condition and rules with the corresponding start condition will be active.
If a start condition is exclusive, only the rules with the corresponding start condition will be active.
When declaring start symbol names it is recommended to use all upper case to avoid name clashes with other Lexer class members. For example, we cannot use text
as a start symbol name because text()
is a Lexer method. When option −−flex
is used, start symbol names are simple macros for compatibility.
The scanner is initially in the INITIAL
start condition state. The INITIAL
start condtion is inclusive: all rules without a start condition and those prefixed with the INITIAL
start condition are active when the scanner is in the INITIAL
start condition state.
The special start condition prefix <*>
matches every start condition. The prefix <*>
is not needed for <<EOF>>
rules, because unprefixed <<EOF>>
rules are always active, as a special case (<<EOF>>
and this special exception were introduced by Flex).
For example:
When the scanner is in state INITIAL
rules 4, 5, and 6 are active. When the scanner is in state A
rules 1, 2, 4, 5, and 6 are active. When the scanner is in state B
rules 1, 3, 4, and 6 are active. Note that A
is inclusive whereas B
is exclusive.
To switch to a start condition state, use start(START)
(or BEGIN START
when option −−flex
is used). To get the current state use start()
(or YY_START
when option −−flex
is used). Switching start condition states in your scanner allows you to create "mini-scanners" to scan portions of the input that are syntactically different from the rest of the input, such as comments:
Start symbols are actually integer values, where INITIAL
is 0. This means that you can store a start symbol value in a variable. You can also push the current start condition on a stack and transition to start condition START
with push_state(START)
. To transition to a start condition that is on the top of the stack and pop it use pop_state()
. The top_state()
returns the start condition that is on the top of the stack:
When many rules are prefixed by the same start conditions, you can simplify the rules by placing them in a start condition scope:
Contrary to some Flex manuals, rules cannot be indented in a start condition scope. Indentation is used to specify code blocks. Code blocks can be indented or placed within %{
and %}
. These code blocks are executed when the scanner is invoked. A code block at the start of a condition scope is executed when the scanner is invoked and if it is in the corresponding start condition state:
Start condition scopes may be nested.
The Bison tools generate parsers that invoke the global C function yylex()
to get the next token. Tokens are integer values returned by yylex()
.
To support Bison parsers use reflex option −−bison
. This option generates a scanner with a global lexer object YY_SCANNER
and a global YY_EXTERN_C int yylex()
function. When the Bison parser is compiled in C and the scanner is compiled in C++, you must set YY_EXTERN_C
in the lex specification to extern "C"
to enable C linkage rules:
Note that noyywrap
is used to remove the dependency on the global yywrap()
function that is not defined.
This example sets the global yylval.num
to the integer scanned or yylval.str
to the string scanned. It assumes that the yacc grammar specification defines the tokens CONST_NUMBER
and CONST_STRING
and the type YYSTYPE
of yylval
, which is a union. For example:
When option −−flex
is used with −−bison
, the yytext
, yyleng
, and yylineno
globals are accessible to the Bison/Yacc parser. In fact, all Flex actions and variables are globally accessible (outside the rules section of the lex specification) with the exception of yy_push_state
, yy_pop_state
, and yy_top_state
that are class methods. Furthermore, yyin
and yyout
are macros and cannot be (re)declared or accessed as global variables, but these can be used as if they are variables to assign a new input source and to set the output stream. To avoid compilation errors when using globals such as yyin
, use reflex option −−header-file
to generate a header file lex.yy.h
to include in your code. Finally, in code outside of the rules section you must use yyinput
instead of input
, use the global action yyunput
instead of unput
, and use the global action yyoutput
instead of output
.
See the generated lex.yy.cpp
BISON section, which contains declarations specific to Bison when the −−bison
option is used.
There are two approaches for a Bison parser to work with a scanner. Either the yacc grammar specification for Bison should include the externs we need to import from the scanner:
or a better approach is to generate a lex.yy.h
header file with option --header-file
and use this header file in the yacc grammar specification:
The second option requires the generated parser to be compiled in C++, because lex.yy.h
contains C++ declarations.
The ugly Flex macro YY_DECL
is not supported by RE/flex. This macro is needed with Flex to redeclare the yylex()
function signature, for example to take an additional yylval
parameter that must be passed through from yyparse()
to yylex()
. Because the generated scanner uses a Lexer class for scanning, the class can be extended with %class{
and %}
to hold state information and additional token-related values. These values can then be exchanged with the parser using getters and setters, which is preferred over changing the yylex()
function signature.
Yacc and Bison use the global variable yylval
to exchange token values. Reentrant Bison parsers pass the yylval
to the yylex()
function as a parameter. RE/flex supports all of these Bison-specific features with command-line options.
To change the generated code for use with Bison, use the following options:
Options | Method | Global functions and variables |
---|---|---|
int Lexer::lex() | ||
−−flex | int yyFlexLexer::yylex() | |
−−bison | int Lexer::lex() | Lexer YY_SCANNER , int yylex() , YYSTYPE yylval |
−−flex −−bison | int yyFlexLexer::yylex() | yyFlexLexer YY_SCANNER , int yylex() , YYSTYPE yylval , char *yytext , yy_size_t yyleng , int yylineno |
−−bison −−reentrant | int Lexer::lex() | int yylex(yyscan_t) , void yylex_init(yyscan_t*) , void yylex_destroy(yyscan_t) |
−−flex −−bison −−reentrant | int yyFlexLexer::lex() | int yylex(yyscan_t) , void yylex_init(yyscan_t*) , void yylex_destroy(yyscan_t) |
−−bison-bridge | int Lexer::lex(YYSTYPE& yylval) | int yylex(YYSTYPE*, yyscan_t) , void yylex_init(yyscan_t*) , void yylex_destroy(yyscan_t) |
−−flex −−bison-bridge | int yyFlexLexer::yylex(YYSTYPE& yylval) | int yylex(YYSTYPE*, yyscan_t) , void yylex_init(yyscan_t*) , void yylex_destroy(yyscan_t) |
−−bison-locations | int Lexer::lex(YYSTYPE& yylval) | Lexer YY_SCANNER , int yylex(YYSTYPE*, YYLTYPE*) |
−−flex −−bison-locations | int yyFlexLexer::yylex(YYSTYPE& yylval) | yyFlexLexer YY_SCANNER , int yylex(YYSTYPE*, YYLTYPE*) |
−−bison-locations −−bison-bridge | int Lexer::lex(YYSTYPE& yylval) | int yylex(YYSTYPE*, YYLTYPE*, yyscan_t) , void yylex_init(yyscan_t*) , void yylex_destroy(yyscan_t) |
−−flex −−bison-locations −−bison-bridge | int yyFlexLexer::yylex(YYSTYPE& yylval) | int yylex(YYSTYPE*, YYLTYPE*, yyscan_t) , void yylex_init(yyscan_t*) , void yylex_destroy(yyscan_t) |
Option −−prefix
may be used with option −−flex
to change the prefix of the generated yyFlexLexer
and yylex
. This option can be combined with option −−bison
to also change the prefix of the generated yytext
, yyleng
, and yylineno
.
The reflex option −−bison-bridge
expects a Bison "pure parser":
For the −−bison-bridge
option, the yyscan_t
argument type of yylex()
is a void*
type that passes the scanner object to this global function (as defined by YYPARSE_PARAM
and YYLEX_PARAM
). The function then invokes this scanner's lex function. This option also passes the yylval
value to the lex function, which is a reference to an YYSTYPE
value.
For the −−reentrant
and −−bison-bridge
options two additional functions are generated that can be used to create a new scanner and delete the scanner:
The option −−bison-locations
expects a Bison parser with the locations feature enabled. This feature provides line and column numbers of the matched text for error reporting. For example:
The yylval
value is passed to the lex function. The yylloc
structure is automatically set by the RE/flex scanner, so you do not need to define a YY_USER_ACTION
macro as you have to with Flex.
And a final example that combines options −−bison-locations
and −−bison-bridge
, which expects a Bison pure-parser with locations enabled:
When Bison locations
with define api.pure full
is used, yyerror
has the signature void yyerror(YYLTYPE *locp, char const *msg)
. This function signature is required to obtain the location information with Bison pure-parsers.
yylval
is not a pointer argument but is always passed by reference and can be used as such in the scanner's rules.YYSTYPE
is declared by the parser, do not forget to add a #include "y.tab.h"
to the top of your lex specification:Some lex specification examples to generate scanners with RE/flex.
The following Flex specification counts the lines, words, and characters on the input. We use yyleng
match text length to count 8-bit characters (bytes).
To build this example with RE/flex, use reflex option −−flex
to generate Flex-compatible "yy" variables and functions. This generates a C++ scanner class yyFlexLexer
that is compatible with the Flex scanner class (assuming Flex with option -+
for C++).
To generate a scanner with a global yylex()
function similar to Flex in C mode (i.e. without Flex option -+
), use reflex option −−bison
with the lex specification above. This option when combined with −−flex
produces the global "yy" functions and variables. This means that you can use RE/flex scanners with Bison (Yacc) and with any other C code, assuming everything is compiled together with a C++ compiler.
An improved implementation drops the use of global variables in favor of Lexer class member variables. We also want to count Unicode letters with the wd
counter instead of ASCII letters, which are single bytes while Unicode UTF-8 encodings vary in size. So we add the Unicode option and use \w
to match Unicode word characters. Note that .
(dot) matches Unicode, so the match length may be longer than one character that must be counted. We drop the −−flex
option and use RE/flex Lexer methods instead of the Flex "yy" functions:
This simple word count program differs slightly from the Unix wc utility, because the wc utility counts words delimited by wide character spaces (iswspace
) whereas this program counts words made up from word characters combined with punctuation.
The following RE/flex specification filters tags from XML documents and verifies whether or not the tags are properly balanced. Note that this example uses the lazy repetitions to keep the patterns simple. The XML document scanned should not include invalid XML characters such as /
, <
, or >
in attributes (otherwise the tags will not match properly). The dotall
option allows .
(dot) to match newline in all patterns, similar to the (?s)
modifier in regexes.
Note thay we restrict XML tag names to valid characters, including all UTF-8 sequences that run in the range \x80
-\xFF
per 8-bit character. This matches all Unicode characters U+0080 to U+10FFFF.
The ATTRIBUTES
state is used to scan attributes and their quoted values separately from the INITIAL
state. The INITIAL
state permits quotes to freely occur in character data, whereas the ATTRIBUTES
state matches quoted attribute values.
We use matcher().less(size() - 1)
to remove the ending >
from the match in text()
. The >
will be matched again, this time by the <*>.
rule that ignores it. We could also have used a lookahead pattern "\</"{name}/"\>"
where X/Y
means look ahead for Y
after X
.
This example lex specification in Flex syntax scans C/C++ source code. It uses free space mode to enhance readability.
Free space mode permits spacing between concatenations and alternations. To match a single space, use " "
or [ ]
. Long patterns can continue on the next line(s) when lines ends with an escape \
.
In free space mode you MUST place actions in {
and }
blocks and other code in %{
and %}
.
When used with option unicode
, the scanner automatically recognizes and scans Unicode identifier names. Note that we can use matcher().columno
in the error message. The matcher()
object associated with the Lexer offers several other methods that Flex does not support.
The RE/flex matcher engine uses an efficient FSM. There are known limitations to FSM matching that apply to Lex/Flex and therefore also apply to RE/flex:
a/b
and a(?=b)
are permitted, but (a/b)c
and a(?=b)c
are not.zx*/xy*
, where the x*
matches the x
at the beginning of the lookahead pattern.\A
and ^
, end of buffer/line anchors \Z
and $
and the word boundary anchors must start or end a pattern. For example, \<cow\>
is permitted, but .*\Bboy
is not.Current reflex tool limitations that may be removed in future versions:
[a-z]{-}[aeiou]
.[^...]
cannot contain Unicode characters.REJECT
action is not supported.T
are not supported.Boost.Regex issues to be resolved in future versions (this is not affecting the default use of the RE/flex regex engine by the reflex command-line tool):
.*
are used. Under certain conditions greedy repetitions may behave as lazy repetitions. For example, the Boost.Regex engine may return the short match abc
when the regex a.*c
is applied to abc abc
, instead of returning the full match abc abc
. The problem is caused by the limitations of Boost.Regex match_partial
matching algorithm. To work around this limitation, we suggest to make the repetition pattern as specific as possible and not overlap with the pattern that follows the repetition. The easiest solution is to read the entire input using reflex option -B
(batch input). For a stand-alone BoostMatcher
, use the buffer()
method. We consider this Boost.Regex partial match behavior a bug, not a restriction, because as long as backtracking on a repetition pattern is possible given some partial text, Boost.Regex should flag the result as a partial match instead of a full match.The RE/flex regex library consists of a set of C++ templates and classes that encapsulate regex engines in a standard API for scanning, tokenizing, searching, and splitting of strings, wide strings, files, and streams.
The RE/flex regex library is a class hierarchy that has at the root an abstract class reflex::AbstractMatcher
. Pattern types may differ between for matchers so the reflex::PatternMatcher
template class takes a pattern type and creates a class that is complete except for the implementation of the reflex::match()
virtual method that requires a regex engine, such as Boost.Regex or the RE/flex engine.
The reflex::BoostMatcher
inherits reflex::PatternMatcher<boost::regex>
, and in turn the reflex::BoostPerlMatcher
and reflex::BoostPosixMatcher
are both derived from reflex::BoostMatcher
:
An instance of reflex::BoostPerlMatcher
is initialized with flag match_perl
and the flag match_not_dot_newline
, these are boost::regex_constants
flags. These flags are the only difference with the plain reflex::BoostMatcher
.
An instance of reflex::BoostPosixMatcher
creates a POSIX matcher. This means that lazy quantifiers are not supported and the leftmost longest rule applies to pattern matching. This instance is initialized with the flags match_posix
and match_not_dot_newline
.
Boost.Regex is a powerful library. The RE/flex regex API enhances this library with operations to match, search, scan, and split data from a given input. The input can be a file, a string, or a stream. Files that are UTF-8/16/32-encoded are automatically decoded. Further, streams can be of unlimited length because internal buffering is used by the RE/flex regex API enhancements to efficiently apply Boost.Regex pattern matching to streaming data. This enhancement permits pattern matching of interactive input from the console, such that searching and scanning interactive input for matches will return these matches immediately.
reflex::BoostMatcher
extends the capabilities of Boost.Regex, which does not natively support streaming input:match_partial
flag and boost::regex_iterator
. Incremental matching can be used to support matching on (possibly infinite) streams. To use this method correctly, a buffer should be created that is large enough to hold the text for each match. The buffer must adjust with a growing size of the matched text, to ensure that long matches that do not fit the buffer are not discared.Boost.IOStreams
with regex_filter
loads the entire stream into memory which does not permit pattern matching of streaming and interactive input data.A reflex::BoostMatcher
(or reflex::BoostPerlMatcher
) engine is created from a boost::regex
object, or string regex, and some given input for normal (Perl mode) matching:
A reflex::BoostPosixMatcher
engine is created from a boost::regex
object, or string regex, and some given input for POSIX mode matching:
For input you can specify a string, a wide string, a file, or a stream object.
Use option "N"
to permit empty matches (nullable results).
See Patterns for more details on regex patterns.
See The Input class for more details on the reflex::Input
class.
See Pattern matching methods for more details on pattern matching methods.
The reflex::StdMatcher
class inherits reflex::PatternMatcher<std::regex>
as a base. The reflex::StdEcmaMatcher
and reflex::StdPosixMatcher
are derived classes from reflex::StdMatcher
:
An instance of reflex::StdEcmaMatcher
is initialized with regex syntax option std::regex::ECMAScript
. This is also the default std::regex syntax.
An instance of reflex::StdPosixMatcher
creates a POSIX AWK-based matcher. So that lazy quantifiers are not supported and the leftmost longest rule applies to pattern matching. This instance is initialized with the regex syntax option std::regex::awk
.
The C++11 std::regex library does not support match_partial
that is needed to match patterns on real streams with an adaptive internal buffer that grows when longer matches are made when more input becomes available. Therefore all input is buffered with the C++11 std::regex class matchers.
The std::regex syntax is more limited than Boost.Regex and RE/flex regex. Also the matching behavior differs and cannot be controlled with mode modifiers:
.
(dot) matches anything except \0
(NUL);\177
is erroneously interpreted as a backreference, \0177
does not match;\x7f
is not supported in POSIX mode;\cX
is not supported in POSIX mode;\Q..\E
is not supported;(?imsx:φ)
;\A
, \Z
, \<
and \>
anchors;\b
and \B
anchors in POSIX mode;(?:φ)
in POSIX mode;"N"
(nullable) may cause issues;interactive()
is not supported.With respect to performance, as of this time of writing, std::regex matching is much slower than Boost.Regex, slower by a factor 10 or more.
See Patterns for more details on regex patterns.
See The Input class for more details on the reflex::Input
class.
See Pattern matching methods for more details on pattern matching methods.
The RE/flex framework includes a POSIX regex matching library reflex::Matcher
that inherits the API from reflex::PatternMatcher<reflex::Pattern>
:
where the RE/flex reflex::Pattern
class represents a regex pattern. Patterns as regex texts are internally compiled into deterministic finite state machines by the reflex::Pattern
class. The machines are used by the reflex::Matcher
for fast matching of regex patterns on some given input. The reflex::Matcher
can be much faster than the Boost.Regex matchers.
A reflex::Matcher
engine is constructed from a reflex::Pattern
object, or a string regex, and some given input:
Use option "N"
to permit empty matches (nullable results). Option "T=8"
sets the tab size to 8 for indent \i
and dedent \j
matching.
For input you can specify a string, a wide string, a file, or a stream object.
See Patterns for more details on regex patterns.
See The Input class for more details on the reflex::Input
class.
See Pattern matching methods for more details on pattern matching methods.
The reflex::Pattern
class converts a regex pattern to an efficient FSM and takes a regex string and options to construct the FSM internally:
It is recommended to create a static instance of the pattern if the regex string is fixed. This avoids repeated FSM construction.
The following options are combined in a string and passed to the contructor:
Option | Effect |
---|---|
b | bracket lists are parsed without converting escapes |
e=c; | redefine the escape character |
f=file.cpp; | save finite state machine code to file.cpp |
f=file.gv; | save deterministic finite state machine to file.gv |
i | case-insensitive matching, same as (?i)X |
l | Lex-style trailing context with / , same as (?l)X |
m | multiline mode, same as (?m)X |
n=name; | use reflex_code_name for the machine (instead of FSM) |
q | Lex-style quotations "..." equals \Q...\E , same as (?q)X |
r | throw regex syntax error exceptions (not just fatal errors) |
s | dot matches all (aka. single line mode), same as (?s)X |
x | inline comments, same as (?x)X |
w | display regex syntax errors before raising them as exceptions |
The RE/flex abstract matcher offers four operations for matching with the regex engines that are derived from this base abstract class:
Method | Result |
---|---|
matches() | true if the input from begin to end matches the regex pattern |
find() | search input and return true if a match was found |
scan() | scan input and return true if input at current position matches |
split() | split input at the next match |
The find
, scan
, and split
methods are also implemented as input iterators that apply filtering tokenization, and splitting:
Iterator range | Acts as a | Iterates over |
---|---|---|
find.begin() ...find.end() | filter | all matches |
scan.begin() ...scan.end() | tokenizer | continuous matches |
split.begin() ...split.end() | splitter | text between matches |
To obtain details of a match use the following methods:
Method | Result |
---|---|
accept() | returns group capture or zero if not captured/matched |
text() | returns \0 -terminated const char* string match |
pair() | returns std::pair<size_t,const char*>(accept(), text()) |
size() | returns the length of the text match in bytes |
wsize() | returns the length of the match in number of (wide) characters |
rest() | returns \0 -terminated const char* of the rest of the input |
more() | tells the matcher to append the next match (adjacent matches) |
less(n) | cuts text() to n bytes and repositions the matcher |
lineno() | returns line number of the match, starting with line 1 |
columno() | returns column number of the match, starting with 0 |
first() | returns position of the first character of the match |
last() | returns position of the last + 1 character of the match |
at_bol() | true if matcher reached the begin of a new line \n |
at_bob() | true if matcher is at the start of input, no matches consumed |
at_end() | true if matcher is at the end of input |
Two special methods can be used to manipulate the input stream directly:
Method | Result |
---|---|
input() | returns the next character from the input, matcher then skips it |
unput(c) | put character c back unto the stream, matcher then takes it |
These methods operate on the in
public instance variable of a matcher:
Variable | Usage |
---|---|
in | the reflex::Input object used by the matcher |
find | the reflex::AbstractMatcher::Operation functor for searching |
scan | the reflex::AbstractMatcher::Operation functor for scanning |
split | the reflex::AbstractMatcher::Operation functor for splitting |
To initialize a matcher for interactive use, to assign a new input source or to change its pattern, you can use the following methods:
Method | Result |
---|---|
input(i) | set input to reflex::Input i (string, stream, or FILE* ) |
pattern(p) | set pattern to p (string regex or reflex::Pattern ) |
has_pattern() | true if the matcher has a pattern assigned to it |
own_pattern() | true if the matcher has a pattern to manage and to delete |
pattern() | get the pattern object, reflex::Pattern or boost::regex |
buffer() | buffer all input at once, returns true if successful |
buffer(n) | set the adaptive buffer size to n bytes to buffer input |
interactive() | sets buffer size to 1 for console-based (TTY) input |
flush() | flush the remaining input from the internal buffer |
reset() | resets the matcher, restarting it from the remaining input |
reset(o) | resets the matcher with new options string o ("A?N?T?") |
The first two methods return a reference to the matcher, so multiple method invocations can be chained together.
A matcher reads from the specified input source using its virtual method size_t get(char *s, size_t n)
that simply returns in.get(s, n)
. This method can be overriden by a derived matcher class to customize reading.
Method | Result |
---|---|
get(s, n) | fill s[0..n-1] with next input, returns number of bytes added |
wrap() | returns false (can be overriden to wrap input) |
When a matcher reaches the end of input, it invokes the virtual method wrap()
to check if more input is available. This method returns false by default, but this behavior can be changed by overring wrap()
to set a new input source and return true
, for example:
Note that the constructor in this example does not specify a pattern and input. We can set a pattern for the matcher after its instantiation, by using the pattern(p)
method. In this case the input does not need to be specified to immediately force reading the sources of input that we assigned in our wrap()
.
StdMatcher
that uses std::regex
, may not work properly with wrap()
when input was assigned prior with the matcher's instantiation, because buffer()
in that case only sees the initial input. RE/flex matchers handle this fine.A matcher may accept several types of input, but can only read from one input source at a time. Input to a matcher is represented by a single reflex::Input
class instance that the matcher uses internally.
An input object can be assigned std::string
and char*
strings, wide strings std::wstring
and wchar_t*
, a FILE*
, or a std::istream
.
Wide strings are internally converted to UTF-8 for matching, which effectively normalizes the content.
Conversion from wide string to UTF-8 is shown in the example below. The copyright symbol ©
with Unicode U+00A9 is matched against its UTF-8 sequence C2 A9
:
To obtain the properties of an input object use the following methods:
Method | Result |
---|---|
size() | size of the input in (UTF-encoded) bytes, or zero when unknown |
good() | input is available to read (no error and not EOF) |
eof() | end of input (but use only at_end() with matchers!) |
cstring() | the current const char* (of a std::string ) or NULL |
wstring() | the current const wchar_t* (of a std::wstring ) or NULL |
file() | the current FILE* file descriptor or NULL |
istream() | a std::istream* pointer to the current stream object or NULL |
File content specified with a FILE*
file descriptor can be encoded in ASCII, binary, or UTF-8/16/32 formats. UTF-16/32 file content is automatically decoded and converted to UTF-8 (i.e. streaming) to normalize the input for matching.
A UTF byte order mark (BOM) is detected in the content of a file by the matcher, which enables UTF-8 normalization of the input automatically.
The file encoding is obtained with the file_encoding()
method of a reflex::Input
object and it returns one of:
Encoding constant | Effect when set |
---|---|
Input::Const::plain | plain octets with ASCII, binary, or UTF-8 |
Input::Const::utf16be | UCS-2/UTF-16 big endian |
Input::Const::utf16le | UCS-2/UTF-16 little endian |
Input::Const::utf32be | UCS-4/UTF-32 big endian |
Input::Const::utf32le | UCS-4/UTF-32 little endian |
To override the file encoding, use file_encoding(encoding constant)
.
Some examples to demonstrate the concepts discussed.
This example illustrates the find
and split
methods and iterators with a RE/flex reflex::Matcher
and reflex::BoostMatcher
:
When executed this code prints:
Monty at 1,0 spans 0..5 Python at 2,1 spans 7..13 4 words Monty Python's Flying Circus Circus Flying Monty Python's
This example shows how input can be reassigned in each iteration of a loop that matches wide strings against a word pattern \w+
:
When executed this code prints:
Monty, Flying, Circus,
This example counts the number of words, lines, and chars from the std::cin
stream:
This example tokenizes a string by grouping the subpatterns in a regex and by using the group index of the capture obtained with accept()
:
When executed this code prints:
Token = 1: matched 'cats' with '(\\w*cat\\w*)' Token = 4: matched ' ' with '(.)' Token = 3: matched 'love' with '(\\w+)' Token = 4: matched ' ' with '(.)' Token = 2: matched 'hotdogs' with '(\\w*dog\\w*)' Token = 4: matched '!' with '(.)'
This example reads a file with embedded credit card numbers to extract. The numbers are sorted into five sets for each type of major credit card:
When executed this code prints:
0: 5212345678901234 1: 4123456789012 1: 4123456789012345 2: 371234567890123 3: 601112345678901234 4: 38812345678901
This example shows how a FILE*
file descriptor is used as input. The file encoding is obtained from the UTF BOM, when present in the file. Note that the file's state is accessed through the matcher's member variable in
:
This example shows how to override the file encoding, such as in cases when a file with wide character content has no UTF BOM or when a binary file is read that could start with a BOM that should be ignored:
Download RE/flex from SourceForge and visit the GitHub RE/flex repository.
RE/flex software is released under the BSD-3 license. All parts of the software have reasonable copyright terms permitting free redistribution. This includes the ability to reuse all or parts of the RE/flex source tree.
Copyright (c) 2016, Robert van Engelen, Genivia Inc. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
(1) Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
(2) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
(3) The name of the author may not be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
The Free Software Foundation maintains a BSD-3 License Wiki.