Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

String Substitutions

Regular expressions are not only good for searching text; they're good at manipulating it. And one of the most common text manipulation tasks is search-and-replace. xpressive provides the regex_replace() algorithm for searching and replacing.

regex_replace()

Performing search-and-replace using regex_replace() is simple. All you need is an input sequence, a regex object, and a format string or a formatter object. There are several versions of the regex_replace() algorithm. Some accept the input sequence as a bidirectional container such as std::string and returns the result in a new container of the same type. Others accept the input as a null terminated string and return a std::string. Still others accept the input sequence as a pair of iterators and writes the result into an output iterator. The substitution may be specified as a string with format sequences or as a formatter object. Below are some simple examples of using string-based substitutions.

std::string input("This is his face");
sregex re = as_xpr("his");                // find all occurrences of "his" ...
std::string format("her");                // ... and replace them with "her"

// use the version of regex_replace() that operates on strings
std::string output = regex_replace( input, re, format );
std::cout << output << '\n';

// use the version of regex_replace() that operates on iterators
std::ostream_iterator< char > out_iter( std::cout );
regex_replace( out_iter, input.begin(), input.end(), re, format );

The above program prints out the following:

Ther is her face
Ther is her face

Notice that all the occurrences of "his" have been replaced with "her".

Click here to see a complete example program that shows how to use regex_replace(). And check the regex_replace() reference to see a complete list of the available overloads.

Replace Options

The regex_replace() algorithm takes an optional bitmask parameter to control the formatting. The possible values of the bitmask are:

Table 1.7. Format Flags

Flag

Meaning

format_default

Recognize the ECMA-262 format sequences (see below).

format_first_only

Only replace the first match, not all of them.

format_no_copy

Don't copy the parts of the input sequence that didn't match the regex to the output sequence.

format_literal

Treat the format string as a literal; that is, don't recognize any escape sequences.

format_perl

Recognize the Perl format sequences (see below).

format_sed

Recognize the sed format sequences (see below).

format_all

In addition to the Perl format sequences, recognize some Boost-specific format sequences.


These flags live in the xpressive::regex_constants namespace. If the substitution parameter is a function object instead of a string, the flags format_literal, format_perl, format_sed, and format_all are ignored.

The ECMA-262 Format Sequences

When you haven't specified a substitution string dialect with one of the format flags above, you get the dialect defined by ECMA-262, the standard for ECMAScript. The table below shows the escape sequences recognized in ECMA-262 mode.

Table 1.8. Format Escape Sequences

Escape Sequence

Meaning

$1, $2, etc.

the corresponding sub-match

$&

the full match

$`

the match prefix

$'

the match suffix

$$

a literal '$' character


Any other sequence beginning with '$' simply represents itself. For example, if the format string were "$a" then "$a" would be inserted into the output sequence.

The Sed Format Sequences

When specifying the format_sed flag to regex_replace(), the following escape sequences are recognized:

Table 1.9. Sed Format Escape Sequences

Escape Sequence

Meaning

\1, \2, etc.

The corresponding sub-match

&

the full match

\a

A literal '\a'

\e

A literal char_type(27)

\f

A literal '\f'

\n

A literal '\n'

\r

A literal '\r'

\t

A literal '\t'

\v

A literal '\v'

\xFF

A literal char_type(0xFF), where F is any hex digit

\x{FFFF}

A literal char_type(0xFFFF), where F is any hex digit

\cX

The control character X


The Perl Format Sequences

When specifying the format_perl flag to regex_replace(), the following escape sequences are recognized:

Table 1.10. Perl Format Escape Sequences

Escape Sequence

Meaning

$1, $2, etc.

the corresponding sub-match

$&

the full match

$`

the match prefix

$'

the match suffix

$$

a literal '$' character

\a

A literal '\a'

\e

A literal char_type(27)

\f

A literal '\f'

\n

A literal '\n'

\r

A literal '\r'

\t

A literal '\t'

\v

A literal '\v'

\xFF

A literal char_type(0xFF), where F is any hex digit

\x{FFFF}

A literal char_type(0xFFFF), where F is any hex digit

\cX

The control character X

\l

Make the next character lowercase

\L

Make the rest of the substitution lowercase until the next \E

\u

Make the next character uppercase

\U

Make the rest of the substitution uppercase until the next \E

\E

Terminate \L or \U

\1, \2, etc.

The corresponding sub-match

\g<name>

The named backref name


The Boost-Specific Format Sequences

When specifying the format_all flag to regex_replace(), the escape sequences recognized are the same as those above for format_perl. In addition, conditional expressions of the following form are recognized:

?Ntrue-expression:false-expression

where N is a decimal digit representing a sub-match. If the corresponding sub-match participated in the full match, then the substitution is true-expression. Otherwise, it is false-expression. In this mode, you can use parens () for grouping. If you want a literal paren, you must escape it as \(.

Formatter Objects

Format strings are not always expressive enough for all your text substitution needs. Consider the simple example of wanting to map input strings to output strings, as you may want to do with environment variables. Rather than a format string, for this you would use a formatter object. Consider the following code, which finds embedded environment variables of the form "$(XYZ)" and computes the substitution string by looking up the environment variable in a map.

#include <map>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
using namespace boost;
using namespace xpressive;

std::map<std::string, std::string> env;

std::string const &format_fun(smatch const &what)
{
    return env[what[1].str()];
}

int main()
{
    env["X"] = "this";
    env["Y"] = "that";

    std::string input("\"$(X)\" has the value \"$(Y)\"");

    // replace strings like "$(XYZ)" with the result of env["XYZ"]
    sregex envar = "$(" >> (s1 = +_w) >> ')';
    std::string output = regex_replace(input, envar, format_fun);
    std::cout << output << std::endl;
}

In this case, we use a function, format_fun() to compute the substitution string on the fly. It accepts a match_results<> object which contains the results of the current match. format_fun() uses the first submatch as a key into the global env map. The above code displays:

"this" has the value "that"

The formatter need not be an ordinary function. It may be an object of class type. And rather than return a string, it may accept an output iterator into which it writes the substitution. Consider the following, which is functionally equivalent to the above.

#include <map>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
using namespace boost;
using namespace xpressive;

struct formatter
{
    typedef std::map<std::string, std::string> env_map;
    env_map env;

    template<typename Out>
    Out operator()(smatch const &what, Out out) const
    {
        env_map::const_iterator where = env.find(what[1]);
        if(where != env.end())
        {
            std::string const &sub = where->second;
            out = std::copy(sub.begin(), sub.end(), out);
        }
        return out;
    }

};

int main()
{
    formatter fmt;
    fmt.env["X"] = "this";
    fmt.env["Y"] = "that";

    std::string input("\"$(X)\" has the value \"$(Y)\"");

    sregex envar = "$(" >> (s1 = +_w) >> ')';
    std::string output = regex_replace(input, envar, fmt);
    std::cout << output << std::endl;
}

The formatter must be a callable object -- a function or a function object -- that has one of three possible signatures, detailed in the table below. For the table, fmt is a function pointer or function object, what is a match_results<> object, out is an OutputIterator, and flags is a value of regex_constants::match_flag_type:

Table 1.11. Formatter Signatures

Formatter Invocation

Return Type

Semantics

fmt(what)

Range of characters (e.g. std::string) or null-terminated string

The string matched by the regex is replaced with the string returned by the formatter.

fmt(what, out)

OutputIterator

The formatter writes the replacement string into out and returns out.

fmt(what, out, flags)

OutputIterator

The formatter writes the replacement string into out and returns out. The flags parameter is the value of the match flags passed to the regex_replace() algorithm.


Formatter Expressions

In addition to format strings and formatter objects, regex_replace() also accepts formatter expressions. A formatter expression is a lambda expression that generates a string. It uses the same syntax as that for Semantic Actions, which are covered later. The above example, which uses regex_replace() to substitute strings for environment variables, is repeated here using a formatter expression.

#include <map>
#include <string>
#include <iostream>
#include <boost/xpressive/xpressive.hpp>
#include <boost/xpressive/regex_actions.hpp>
using namespace boost::xpressive;

int main()
{
    std::map<std::string, std::string> env;
    env["X"] = "this";
    env["Y"] = "that";

    std::string input("\"$(X)\" has the value \"$(Y)\"");

    sregex envar = "$(" >> (s1 = +_w) >> ')';
    std::string output = regex_replace(input, envar, ref(env)[s1]);
    std::cout << output << std::endl;
}

In the above, the formatter expression is ref(env)[s1]. This means to use the value of the first submatch, s1, as a key into the env map. The purpose of xpressive::ref() here is to make the reference to the env local variable lazy so that the index operation is deferred until we know what to replace s1 with.


PrevUpHomeNext