C++11 Regex Library
by Kevin Heifner, Principal Software Engineer
July 2013
Introduction
The new C++11 regular expression library <regex>
brings the power of regular expressions to the C++ standard library. Previously part of TR1 (C++ Standards Committee Technical Report 1) and also available via Boost, the library is now under namespace std
in C++11.
Bjarnew Stroustrup said: "Surprisingly, C++0x [C++11] feels like a new language...". As such, it is important, even for experienced C++ developers, to study the new features. You can find many references online and many of the classic C++ books are even now hitting the bookstores with new editions that are rewritten to explain this "new" language. See the reference section at the end of this article for a list of some of these.
In this article, we will explore the C++11 regex library and look at some of the new C++11 language features by examing the implementation of FuncFinder, a useful command line utility.
Function Regex Finder Utility (FuncFinder) Overview
The FuncFinder command line application takes a regular expression and a list of C++ source files and lists the entire function definition of any function that contains the regular expression. It works similarly to grep with -A NUM, --after-context=NUM
and -B NUM, --before-context=NUM
options. Except, instead of printing a specified number of trailing and leading lines, it prints the entire function definition regardless of the number of lines. In addition to the function definition, the output includes the line with the regex as well as the range of lines of the function in the file.
> FuncFinder
Usage: FuncFinder regex source_file ...
It can be tedious to specify each source file individually, but under Unix-like environments like Linux or Mac OS X or even Windows with Cygwin, find
and xargs
can specify each file for you.
> find . -name "*.cpp" | xargs FuncFinder regex_search_string
Here is an example of running it on the TAO services source code looking for uses of std::ifstream
.
> cd ACE_wrappers-2.0/TAO/orbsvcs
> find . -name "*.cpp" | xargs FuncFinder std::ifstream
== ./performance-tests/RTEvent/Colocated_Roundtrip/compare_histo.cpp(30) range [24,41]==
void
load_file (Vector &vector,
const char *filename)
{
std::ifstream is (filename);
if (!is)
throw "Cannot open file";
while (is && !is.eof ())
{
long i; double v;
is >> i >> v;
Vector::value_type e (i, v);
vector.insert (e);
}
}
Note the use of std::ifstream
in the first line of the load_file
function. Since FuncFinder found the search string in the compare_histo.cpp
file, it reported the number of the line that matched std::ifstream
and printed the entire function definition. Here our search regular expression is just the simple string, "std::ifstream"
, but it could have been any regular expression.
Even though, as we will see, FuncFinder is not written with performance particularly in mind, performance is not completely ignored. FuncFinder runs sufficiently quick for most use cases. You can speed up the previous example by incorporating the amazing performance of grep. The following produces the same output as above, but runs much faster.
> cd ACE_wrappers-2.0/TAO/orbsvcs
> find . -name "*.cpp" | xargs grep -l std::ifstream | xargs FuncFinder std::ifstream
Regex Library Overview
The implementation of FuncFinder utilizes the new C++ regex library. The interface for the regex library is rather small as library interfaces go, but as is the case for regular expression libraries, the power is in the regular expression syntax itself and not in the actual API. This article assumes you are familiar with basic regular expression syntax and will not cover it in detail. If this is not the case, there are many good references in the Reference section at the end of this article.
The main elements of the C++ Regex library consist of only one class (not counting sub_match
and match_results
), three functions, and two iterators.
Let's start with the class: regex
.
- template<class CharT, class Traits = regex_traits<CharT>>
- class basic_regex {
- public:
- // traits & operations not shown, typically only the constructor is used
-
- // overloads for basic_string, basic_regex
- // ECMAScript is ECMA-262 i.e. JavaScript with minor modifications
- explicit basic_regex(const CharT* p,
- flag_type f = regex_constants::ECMAScript);
- };
-
- // same pattern that is used for std::string
- using regex = basic_regex<char>;
The regex
class holds the regular expression and is typically created with a string. The standard implementation includes not only the default, JavaScript, but also basic POSIX, extended POSIX, awk, grep, and egrep grammar support. These non-default grammars are supported by specifying regex_constants
passed in as flags to the constructor. Insensitive case comparison is also supported by passing regex_constants::icase
as a flag to the constructor. Consult your reference for a complete list of regex_constants
flags that are supported.
Note the use of "using regex = basic_regex;
" instead of "typedef basic_regex regex;
" in the above code snippet. The C++11 standard defines them as equivalent.
Along with the regex
class, the library includes the functions: regex_match
, regex_search
, and regex_replace
. regex_match
and regex_search
have the same function signatures/overloads. The only difference between the two is that regex_match
must match the entire input sequence and regex_search
looks for the regular expression pattern anywhere in the input sequence.
- // params beg & end are begin and end iterators for an input string
- // param m is an out parameter of sub-matches
- // param e is the regex
- // param flags are any regex flags to the function
- template <class BidirectionalIterator, class Allocator, class charT, class traits>
- bool regex_match(BidirectionalIterator beg, BidirectionalIterator end,
- match_results<BidirectionalIterator, Allocator>& m,
- const basic_regex<charT, traits>& e,
- regex_constants::match_flag_type flags = regex_constants::match_default);
-
- // same as above without sub-matches
- template <class BidirectionalIterator, class charT, class traits>
- bool regex_match(BidirectionalIterator beg, BidirectionalIterator end,
- const basic_regex<charT, traits>& e,
- regex_constants::match_flag_type flags = regex_constants::match_default);
-
- // overload for std::string not shown
- // same as above only string passed instead of begin, end iterators
- template <class charT, class Allocator, class traits>
- bool regex_match(const charT* str,
- match_results<const charT*, Allocator>& m,
- const basic_regex<charT, traits>& e,
- regex_constants::match_flag_type flags = regex_constants::match_default);
-
- // overload for std::string not shown
- // same as above only string passed instead of begin, end iterators
- template <class charT, class traits>
- bool regex_match(const charT* str,
- const basic_regex<charT, traits>& e,
- regex_constants::match_flag_type flags = regex_constants::match_default);
The optional match_results
out argument in the above regex_match
and regex_search
algorithms provides access into the regular expression capture groups (sub_matches
). The empty
method of match_results
as well as the return value of the algorithm indicates a successful match. Once a match is found, the match_results
sub_matches
can be indexed via operator[]
and size
or iterated via begin
and end
iterators.
regex_replace
is the other available library function. It uses a format string to replace matched regular expressions in a string. The format string fmt
can be a simple string or can contain references into the regex e
.
- // param s is string to search
- // param e is the regex
- // param fmt is the replacement string, can contain references into regex e
- // param flags are any regex flags to the function
- template <class Traits, class CharT>
- basic_string<CharT> regex_replace(const basic_string<CharT>& s,
- const basic_regex<CharT, Traits>& e,
- const basic_string<CharT>& fmt,
- match_flag_type flags = match_default);
-
- // param out is an output iterator, can be std::back_inserter
- // params beg & end are begin and end iterators for an input string
- template <class OutputIter, class BidirectionalIter, class Traits, class CharT>
- OutputIter regex_replace(OutputIter out,
- BidirectionalIter beg, BidirectionalIter end,
- const basic_regex<CharT, Traits>& e,
- const basic_string<CharT>& fmt,
- match_flag_type flags = match_default);
Rounding out the Regex library are two iterators: regex_iterator
and regex_token_iterator
.regex_iterator
is used to iterate through matches of a regular expression in a string. regex_token_iterator
is used to iterate through sub_matches
(capture groups) or to iterate through non-matches providing advanced tokenization capabilities. I will not be covering regex_token_iterator in this article. However, you can find a nice simple example here: http://en.cppreference.com/w/cpp/regex/regex_token_iterator. We will see regex_iterator
in the implementation of FuncFinder.
FuncFinder Implementation
As we saw in the usage of FuncFinder above, FuncFinder takes a regular expression to find and a list of files to search. There is nothing special about the implementation to make this happen. It is just a simple loop through the command line arguments as can be seen in the listing of main
below.
- // funcfinder.cpp
- #include <algorithm>
- #include <cstddef>
- #include <functional>
- #include <fstream>
- #include <iostream>
- #include <regex> // C++11 regular expression library
- #include <sstream>
- #include <string>
- #include <tuple>
-
- // ... (we will see the rest of the code shortly)
-
- int main(int argc, char* argv[])
- {
- try {
- if (argc < 3) {
- std::cerr << "Usage: " << argv[0] << " regex source_file ..." << std::endl;
- return 1;
- }
-
- for (int i = 2; i < argc; ++i) {
- processFile(argv[1], argv[i]);
- }
-
- return 0;
- } catch (std::exception& e) {
- std::cerr << "Exception: " << e.what() << std::endl;
- } catch (...) {
- std::cerr << "Unknown exception." << std::endl;
- }
- return 1;
- }
More interesting code is found in processFile
.
- void processFile(const std::string& regExpStr, const std::string& fileName)
- {
- std::ifstream ifs(fileName);
- if (!ifs) {
- std::cerr << "Unable to open file: " << fileName << std::endl;
- return;
- }
-
- // much faster not to store lines as file is processed, verify regex found first
- std::regex grepExp(regExpStr);
- bool found = false;
- for (std::string s; !found && std::getline(ifs, s);) {
- found = std::regex_search(s, grepExp);
- }
- if (!found) return;
-
- ifs.seekg(0);
- for (size_t total = 0;;) {
- // including findFunction memory usage, using up to 2x size of file in memory plus overhead
- std::stringstream ss;
- auto result = findFunction(ifs, grepExp, ss);
- if (std::get<0>(result) == -1) break;
- std::cout << "== " << fileName << "("
- << total + std::get<0>(result) << ") range ["
- << total + std::get<1>(result) << ","
- << total + std::get<2>(result) << "] ==n";
- std::cout << ss.str();
- total += std::get<2>(result);
- }
- }
After ProcessFile
opens the input file (lines 3-7), it does a quick search through the file to see if the search regular expression exists in the file. ProcessFile
does a two pass approach since, as we will see, findFunction
is memory and process intensive.
On lines 10 and 13 we have the first use of the C++11 regex library. On line 10 a std::regex
object, grepExp
, is created from the command line regular expression string. On line 13, this regex object is passed along with each line of the input file to determine if the regular expression matches any line in the file. std::regex_search
will return true if the grepExp
matches any value in the line s
. If the regular expression is not in the file, then there is no need to call findFunction
. Note, lines 11-15 could be removed and FuncFinder would return the exact same results, it would just take longer when the input file does not contain the search regular expression. If we always preprocessed our input files on the command line via grep, this two pass approach would actually slow us down.
On line 17 we start the second pass through the file. This time we pass the input stream to findFunction
, which does all the hard work. Line 21 uses the new C++11 auto
capability. result
is whatever type findFunction
returns, in this case, the new C++11 std::tuple
. std::tuple
is just like the old std::pair
only it can have more than two values, each retrievable by std::get
as we see on lines: 22, 24-26, 28. To determine the values stored in these three tuple values, lets look at the function declaration of findFunction
.
- // Finds regExp in function definitions and places entire function in output stream.
- // Stores up to the complete input stream in memory plus overhead if os != nullptr.
- // Does not handle multi-line raw strings or preprocessor continuation lines.
- // @param is input stream
- // @param regExp regular expression to search for within function definitions
- // @param os output stream, if nullptr then greatly reduced memory usage.
- // @return get<0> is line offset in input stream where regExp found or -1
- // get<1> is line offset in input stream of the first line of the function
- // get<2> is line offset in input stream of last line read
- std::tuple<size_t,size_t,size_t>
- findFunction(std::istream& is, const std::regex& regExp, std::ostream* os = nullptr)
The first tuple value is the line number where the regExp
is found or -1
if not found. The second and third tuple values are the first and last line of the function. Since these values are offsets we keep a total from the last call to findFunction
in processFile
so that we can report the correct line number.
Now we are ready to look at the implementation of findFunction
.
- std::tuple<size_t,size_t,size_t>
- findFunction(std::istream& is, const std::regex& regExp, std::ostream* os = nullptr)
- {
- // "namespace {", "namespace a{", "namespace a{namespace b {", "using namespace std;",
- // "namespace alias = oldname;"
- // "class a {", "class a : public b {"
- std::regex nsClassExp("\\b(?:(?:namespace)|(?:class))\\b");
- std::regex usingNSClassExp("(?:\\busing\\s+namespace\\b)|" // using namespace
- "(?:\\bnamespace\\s*[^\\d\\W]\\w*\\s*=)|" // namespace alias
- "(?:\\bclass\\s*[^\\d\\W]\\w*\\s*)[;,>]" // class forward declare, or template
- );
- std::regex lineCommentExp("\\/\\/.*"); // rest of line
- // handling multi-line comments that begin/end on same line separately simplifies logic
- std::regex spanCommentExp("\\/\\*(?:.*?)\\*\\/"); // a multiline comment on one line: /* stuff */
- std::regex begMultiLineCommentExp("(.*?)\\/\\*.*"); // [stuff]/* ...
- std::regex endMultiLineCommentExp(".*?\\*\\/(.*)"); // ... */[stuff]
- std::regex includeExp("^\\s*#include\\b"); // #include
- // "*" and '*' which allows for embedded quotes: "hi\"there\""
- // This would be more readable if C++11 raw string support was available.
- std::regex quotedStrExp("\"(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"|" // double quotes with possible escapes
- "\\'(?:[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\'"); // single quotes with possible escapes
At the beginning of findFunction
we define the regex
objects needed.
nsClassExp
- Matches the whole words:namespace
orclass
\\b
indicates a word boundary(?:
indicates a non-capture parentheses group
usingNSClassExp
- Matches use of anamespace
orclass
keyword that is not a namespace or class definition, e.g.,using namespace
ornamespace b=
orclass Foo;
\\s
indicates whitespace\\d
indicates digits [0-9]\\w
indicates any word character including [0-9]\\W
indicates any non-word character
lineCommentExp
- Matches//
to the end of the linespanCommentExp
- Matches entire comment/* .. */
when on one line.*?
indicates non-greedy so that match ends at first*/
begMultiLineCommentExp
- Matches/*
to end of lineendMultiLineCommentExp
- Matches everything up to and including*/
includeExp
- Matches#include
at beginning of line (whitespace allowed before)quotedStrExp
- Matches" .. "
or' .. '
allowing for escaped quotes" .. \" .. "
- All of these regular expression strings would be easier to read if the new C++11 raw string was more widely supported. However, at this time, support is very limited. With raw string support this could be written as:
std::regex quotedStrExp(R""(?:[^"\\]*(?:\\.[^"\\]*)*"|\'(?:[^\'\\]*(?:\\.[^\'>\\]*)*)\'");
- Still rather complicated, but easier to read without all the C++ required escaped characters.
- All of these regular expression strings would be easier to read if the new C++11 raw string was more widely supported. However, at this time, support is very limited. With raw string support this could be written as:
Next we declare some other variables needed.
- std::vector<std::string> lines;
- std::string s;
- int numOpenBrackets = 0; // int required, using negative numbers to handle namespace/class
- size_t foundLoc = 0, begLineNum = 1, endLineNum = 1;
- bool inMultiLineComment = false;
-
- // factor out common reset logic into helper function
- std::function<void()> reset = [&]() {
- lines.clear();
- numOpenBrackets = 0;
- begLineNum = endLineNum + 1;
- };
lines
- Used to store input lines in case we find the searchregExp
. The vector is cleared for each new function we encounter in the input stream.s
- Used to store the current input line. We manipulates
as needed, which is okay because a copy is stored inlines
for possible output.numOpenBrackets
-findFunction
tracks open/close brackets in order to know when it is inside a function definition.foundLoc, begLineNum, endLineNum
- tracks the three return values.inMultiLineComment
- Since we process the input stream one line at a time, we have to know if we are currently inside a multi-line comment.reset
- A C++11 lambda function used to factor out some common code needed to reset our local variables when we reach namespace, class, or file scope. The new C++11std::function
allows us to store the generated lambda function for later invocation. We use[&]
to provide reference access tolines, numOpenBrackets, begLineNum,
andendLineNum
. In this case, only a new C++11 lambda with closure support could provide the ease of use that makes this worthwhile. This technique provides a new tool for fighting code duplication.
And now for the rest of findFunction
.
- for (; std::getline(is, s); ++endLineNum) {
- if (os) lines.push_back(s);
- s = std::regex_replace(s, spanCommentExp, " "); // remove span comments /* .. */ on same line
- // handle multi-line comments
- std::smatch m;
- if (!inMultiLineComment && std::regex_search(s, m, begMultiLineCommentExp)) {
- s = m[1].str(); // smatch uses iterators, m invalid after assignment to s
- inMultiLineComment = true;
- } else if (inMultiLineComment && std::regex_search(s, m, endMultiLineCommentExp)) {
- s = m[1].str();
- inMultiLineComment = false;
- }
- if (inMultiLineComment) continue;
- s = std::regex_replace(s, lineCommentExp, ""); // remove single line comments
- s = std::regex_replace(s, quotedStrExp, ""); // remove quoted strings
- // ignore #includes in file, assume they are always at file scope
- if (foundLoc == 0 && std::regex_search(s, includeExp)) { reset(); continue; }
- if (std::regex_search(s, regExp)) foundLoc = endLineNum; // found, print at end of function
- // assumes an open namespace statement not on the same line as a using namespace statement
- if (std::regex_search(s, nsClassExp) && !std::regex_search(s, usingNSClassExp)) {
- reset(); // reset on namespace/class
- // subtract out namespace/class open brackets so we don't report the entire namespace/class
- numOpenBrackets -= countMatches(s, nsClassExp);
- }
- //numOpenBrackets += std::count(s.cbegin(), s.cend(), '{'); // alternative implementation
- //auto numCloseBrackets = std::count(s.cbegin(), s.cend(), '}'); // instead of the for_each
- decltype(numOpenBrackets) numCloseBrackets = 0;
- std::for_each(s.cbegin(), s.cend(),
- [&] (char c) {
- if (c == '{') ++numOpenBrackets;
- if (c == '}') ++numCloseBrackets;
- });
-
- if (numCloseBrackets > 0) {
- // only clear at '}' so that function name, comments, etc. included in output of find
- numOpenBrackets -= numCloseBrackets;
- if (numOpenBrackets <= 0) {
- if (foundLoc > 0) {
- // pre-C++11: std::copy(lines.begin(), lines.end(), std::ostream_iterator<std::string>(os, "\n"));
- if (os) for (const auto& x : lines) *os << x << '\n';
- return std::make_tuple(foundLoc, begLineNum, endLineNum);
- }
- reset();
- }
- }
- }
-
- return std::make_tuple(-1, begLineNum, endLineNum);
- }
The function consists of a for
loop reading one line at a time. Each loop iteration increments the endLineNum
which we need in order to properly report the function range. If the caller provided an output stream, then we keep track of each read line in lines
, so that on line 82 we can copy them into the output stream. We do not immediately place the line in the output stream because we do not know if we will find the regExp
.
Next, on lines 45-56, findFunction removes comments. findFunction
does not look for the search regExp
in comments. If we wanted findFunction
to search comments, as well as code, we could move line 60 up to line 45 so that it searches for the regExp
before removing comments. On line 45, if there are any /* .. */
comments on the same line we remove them from our line, s
, so they are not included in the search. std::regex_replace
replaces all matches of spanCommentExp
in s
with a single space. We have already seen std::regex_search used in processFile, however, on lines 48 & 51 we use the overloaded version of regex_search
that takes a std::smatch
. The std::smatch
provides access to capture groups in the found regular expression. In the case of begMultiLineCommentExp
, the first capture group is all the characters before /*
. In the case of endMultiLineCommentExp
, the first capture group is all the characters after */
. [0]
is reserved for the complete regex match. s
is set to either all the characters before/*
(line 49) or all the characters after */
(line 52).
Since quoted strings could include {
or }
characters we remove them from the search as well on line 57. On line 59 we reset our search any time we find #include
. This is done so that functions at the top of files that match our search do not include all the #include
statements. When functions are reported, findFunction
reports all the comments above the function, but we do not want to include all the #include
statements. Since #includes
are rarely found in functions this works nicely for the nominal case.
Line 60 is the regex search for the command line regular expression. If found, we keep track of the line number which is the current last line read, stored in endLineNum
. We keep processing the file until we find the closing }
for the function with the found regular expression.
Lines 61-79 handle processing of {
and }
brackets. findFunction
keeps running totals of the number of open brackets and the number of closed brackets in order to determine the end of the function with the found regular expression. On line 62, if we find a namespace
or class
definition (ignoring any namespace
or class
keywords not used in the context of a definition), then we reset and subtract out the number of namespaces or classes from numOpenBrackets
. This will likely cause numOpenBrackets
to become negative so that when we find the {
associated with the namespace/class
, numOpenBrackets
is incremented to 0
.
countMatches
is not provided by the C++11 regex library. However, it is easy to create one.
- size_t countMatches(const std::string& s, const std::regex& rexp) {
- // iteration of each of the matches of rexp in s
- std::sregex_iterator begin(s.cbegin(), s.cend(), rexp), end;
- // only interested in number of matches
- return std::distance(begin, end);
- }
std::sregex_iterator
iterates through a string providing access to each of the matched regular expressions. In this case, all we are interested in are the number of matches. std::distance
reports the number of elements (matches) between begin
and end
.
On lines 67-74 findFunction
counts the number of { }
brackets. This could be done by a simple std::count
, however, here we use std::for_each
with a lambda to count both { }
on one pass through the string.
Once we reach the end of a function definition (line 79) and we have found the search regular expression (line 80), findFunction
reports the find. If an output stream is provided, then each of the stored lines are streamed via the new C++11 range-based for loop (line 82). std::vector
provides iterators that the new range-based for loop can use to iterate over the vector. x is a const reference to each value of the vector. Personally, I like the range-base for loop syntax better than std::copy
.
Just like std::make_pair
, C++11 provides a std::make_tuple
which we use to return a constructed tuple with the needed values. If the search regular expression was not found in the current function then we simply reset (line 85) and go on to the next line.
Summary
I hope you find the FuncFinder utility useful. The full source can be found on GitHub: https://github.com/objectcomputing/FuncFinder.
<regex> Supported Platforms
- Microsoft Visual Studio 2012 - Under namespace
std
- Microsoft Visual Studio 2008 msdn - Under namespace
std::tr1
- GCC - Not implemented as of GCC 4.8.0, March 2013 - Use Boost - Under namespace
boost
- Many others via Boost Regex library - Under namespace
boost
References
- [1] FuncFinder implementation on GitHub
https://github.com/objectcomputing/FuncFinder - [2] Wikipedia article on regular expressions
http://en.wikipedia.org/wiki/Regular_expression - [3] MSDN regular expression overview
http://msdn.microsoft.com/en-us/library/bb982727.aspx - [4] Boost regular expression documentation
http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/boost_regex/syntax.html - [5] Online regex library reference
http://en.cppreference.com/w/cpp/regex - [6] Printable C++11 regex cheatsheet
http://cpprocks.com/regex-cheatsheet - [7] Online regex expression tester
http://regexpal.com - [8] Online regex expression tester
http://www.myregextester.com - [9] Excellent book on regular expressions
http://goo.gl/NXDko - [10] The C++ Programming Language, 4th Edition, by Bjarne Stroustrup - Amazon.com link
http://goo.gl/McTCy Completely rewritten for C++11. - [11] C++ Primer, 5th Edition, by Stanley B. Lippman, Josee Lajoie, and Barbara E. Moo - Amazon.com link
http://goo.gl/t0dpf Completely rewritten for C++11. - [12] The C++ Standard Library: A Tutorial and Reference, 2nd Edition, by Nicolai M. Josuttis - Amazon.com link
http://goo.gl/WMsIj Covers all the new C++11 libraries.