![]() |
Home | Libraries | People | FAQ | More |
When using xpressive, the first thing you'll do is create a
object. This section goes over the nuts and bolts of building a regular expression
in the two dialects xpressive supports: static and dynamic.
basic_regex<>
The feature that really sets xpressive apart from other C/C++ regular expression libraries is the ability to author a regular expression using C++ expressions. xpressive achieves this through operator overloading, using a technique called expression templates to embed a mini-language dedicated to pattern matching within C++. These "static regexes" have many advantages over their string-based brethren. In particular, static regexes:
Since we compose static regexes using C++ expressions, we are constrained by the rules for legal C++ expressions. Unfortunately, that means that "classic" regular expression syntax cannot always be mapped cleanly into C++. Rather, we map the regex constructs, picking new syntax that is legal C++.
You create a static regex by assigning one to an object of type
.
For instance, the following defines a regex that can be used to find patterns
in objects of type basic_regex<>
std::string
:
sregex re = '$' >> +_d >> '.' >> _d >> _d;
Assignment works similarly.
In static regexes, character and string literals match themselves. For
instance, in the regex above, '$'
and '.'
match the characters
'$'
and '.'
respectively. Don't be confused by the fact that $
and
.
are meta-characters in Perl. In xpressive, literals
always represent themselves.
When using literals in static regexes, you must take care that at least one operand is not a literal. For instance, the following are not valid regexes:
sregex re1 = 'a' >> 'b'; // ERROR! sregex re2 = +'a'; // ERROR!
The two operands to the binary >>
operator are both literals, and the operand of the unary +
operator is also a literal, so these statements
will call the native C++ binary right-shift and unary plus operators, respectively.
That's not what we want. To get operator overloading to kick in, at least
one operand must be a user-defined type. We can use xpressive's as_xpr()
helper function to "taint" an expression with regex-ness, forcing
operator overloading to find the correct operators. The two regexes above
should be written as:
sregex re1 = as_xpr('a') >> 'b'; // OK sregex re2 = +as_xpr('a'); // OK
As you've probably already noticed, sub-expressions in static regexes must
be separated by the sequencing operator, >>
.
You can read this operator as "followed by".
// Match an 'a' followed by a digit sregex re = 'a' >> _d;
Alternation works just as it does in Perl with the |
operator. You can read this operator as "or". For example:
// match a digit character or a word character one or more times sregex re = +( _d | _w );
In Perl, parentheses ()
have
special meaning. They group, but as a side-effect they also create back-references
like $1
and $2
. In C++, parentheses
only group -- there is no way to give them side-effects. To get the same
effect, we use the special s1
,
s2
, etc. tokens. Assigning
to one creates a back-reference. You can then use the back-reference later
in your expression, like using \1
and \2
in Perl. For example, consider the following regex, which finds matching
HTML tags:
"<(\\w+)>.*?</\\1>"
In static xpressive, this would be:
'<' >> (s1= +_w) >> '>' >> -*_ >> "</" >> s1 >> '>'
Notice how you capture a back-reference by assigning to s1
,
and then you use s1
later
in the pattern to find the matching end tag.
![]() |
Tip |
---|---|
Grouping without capturing a back-reference
|
Perl lets you make part of your regular expression case-insensitive by
using the (?i:)
pattern modifier. xpressive also has
a case-insensitivity pattern modifier, called icase
.
You can use it as follows:
sregex re = "this" >> icase( "that" );
In this regular expression, "this"
will be matched exactly, but "that"
will be matched irrespective of case.
Case-insensitive regular expressions raise the issue of internationalization:
how should case-insensitive character comparisons be evaluated? Also, many
character classes are locale-specific. Which characters are matched by
digit
and which are matched
by alpha
? The answer depends
on the std::locale
object the regular expression
object is using. By default, all regular expression objects use the global
locale. You can override the default by using the imbue()
pattern modifier, as follows:
std::locale my_locale = /* initialize a std::locale object */; sregex re = imbue( my_locale )( +alpha >> +digit );
This regular expression will evaluate alpha
and digit
according to
my_locale
. See the section
on Localization
and Regex Traits for more information about how to customize the
behavior of your regexes.
The table below lists the familiar regex constructs and their equivalents in static xpressive.
Table 1.4. Perl syntax vs. Static xpressive syntax
Perl |
Static xpressive |
Meaning |
---|---|---|
|
any character (assuming Perl's /s modifier). |
|
|
|
sequencing of |
|
|
alternation of |
|
|
group and capture a back-reference. |
|
|
group and do not capture a back-reference. |
|
a previously captured back-reference. |
|
|
|
zero or more times, greedy. |
|
|
one or more times, greedy. |
|
|
zero or one time, greedy. |
|
|
between |
|
|
zero or more times, non-greedy. |
|
|
one or more times, non-greedy. |
|
|
zero or one time, non-greedy. |
|
|
between |
|
beginning of sequence assertion. |
|
|
end of sequence assertion. |
|
|
word boundary assertion. |
|
|
|
not word boundary assertion. |
|
literal newline. |
|
|
|
any character except a literal newline (without Perl's /s modifier). |
|
logical newline. |
|
|
|
any single character not a logical newline. |
|
a word character, equivalent to set[alnum | '_']. |
|
|
|
not a word character, equivalent to ~set[alnum | '_']. |
|
a digit character. |
|
|
|
not a digit character. |
|
a space character. |
|
|
|
not a space character. |
|
an alpha-numeric character. |
|
|
an alphabetic character. |
|
|
a horizontal white-space character. |
|
|
a control character. |
|
|
a digit character. |
|
|
a graphable character. |
|
|
a lower-case character. |
|
|
a printing character. |
|
|
a punctuation character. |
|
|
a white-space character. |
|
|
an upper-case character. |
|
|
a hexadecimal digit character. |
|
|
|
characters in range |
|
|
characters |
|
|
same as above |
|
characters |
|
|
same as above |
|
|
|
not characters |
|
|
match stuff disregarding case. |
|
|
independent sub-expression, match stuff and turn off backtracking. |
|
|
positive look-ahead assertion, match if before stuff but don't include stuff in the match. |
|
|
negative look-ahead assertion, match if not before stuff. |
|
|
positive look-behind assertion, match if after stuff but don't include stuff in the match. (stuff must be constant-width.) |
|
|
negative look-behind assertion, match if not after stuff. (stuff must be constant-width.) |
|
|
Create a named capture. |
|
|
Refer back to a previously created named capture. |