C++11 Regex Library

by Kevin Heifner, Principal Software Engineer

July 2013

Introduction

The new C++11 regular expression library <regex> brings the power of regular expressions to the C++ standard library. Previously part of TR1 (C++ Standards Committee Technical Report 1) and also available via Boost, the library is now under namespace std in C++11.

Bjarnew Stroustrup said: "Surprisingly, C++0x [C++11] feels like a new language...". As such, it is important, even for experienced C++ developers, to study the new features. You can find many references online and many of the classic C++ books are even now hitting the bookstores with new editions that are rewritten to explain this "new" language. See the reference section at the end of this article for a list of some of these.

In this article, we will explore the C++11 regex library and look at some of the new C++11 language features by examing the implementation of FuncFinder, a useful command line utility.

Function Regex Finder Utility (FuncFinder) Overview

The FuncFinder command line application takes a regular expression and a list of C++ source files and lists the entire function definition of any function that contains the regular expression. It works similarly to grep with -A NUM, --after-context=NUM and -B NUM, --before-context=NUM options. Except, instead of printing a specified number of trailing and leading lines, it prints the entire function definition regardless of the number of lines. In addition to the function definition, the output includes the line with the regex as well as the range of lines of the function in the file.

> FuncFinder
Usage: FuncFinder regex source_file ...

It can be tedious to specify each source file individually, but under Unix-like environments like Linux or Mac OS X or even Windows with Cygwin, find and xargs can specify each file for you.

> find . -name "*.cpp" | xargs FuncFinder regex_search_string

Here is an example of running it on the TAO services source code looking for uses of std::ifstream.

> cd ACE_wrappers-2.0/TAO/orbsvcs
> find . -name "*.cpp" | xargs FuncFinder std::ifstream
== ./performance-tests/RTEvent/Colocated_Roundtrip/compare_histo.cpp(30) range [24,41]==
 
 
void
load_file (Vector &vector,
           const char *filename)
{
  std::ifstream is (filename);
  if (!is)
    throw "Cannot open file";
 
  while (is && !is.eof ())
    {
      long i; double v;
      is >> i >> v;
      Vector::value_type e (i, v);
      vector.insert (e);
    }
}

Note the use of std::ifstream in the first line of the load_file function. Since FuncFinder found the search string in the compare_histo.cpp file, it reported the number of the line that matched std::ifstream and printed the entire function definition. Here our search regular expression is just the simple string, "std::ifstream", but it could have been any regular expression.

Even though, as we will see, FuncFinder is not written with performance particularly in mind, performance is not completely ignored. FuncFinder runs sufficiently quick for most use cases. You can speed up the previous example by incorporating the amazing performance of grep. The following produces the same output as above, but runs much faster.

> cd ACE_wrappers-2.0/TAO/orbsvcs
> find . -name "*.cpp" | xargs grep -l std::ifstream | xargs FuncFinder std::ifstream

Regex Library Overview

The implementation of FuncFinder utilizes the new C++ regex library. The interface for the regex library is rather small as library interfaces go, but as is the case for regular expression libraries, the power is in the regular expression syntax itself and not in the actual API. This article assumes you are familiar with basic regular expression syntax and will not cover it in detail. If this is not the case, there are many good references in the Reference section at the end of this article.

The main elements of the C++ Regex library consist of only one class (not counting sub_match and match_results), three functions, and two iterators.

Let's start with the class: regex.

template<class CharT, class Traits = regex_traits<CharT>> 
class basic_regex {
  public:
    // traits & operations not shown, typically only the constructor is used
 
    // overloads for basic_string, basic_regex
    // ECMAScript is ECMA-262 i.e. JavaScript with minor modifications
    explicit basic_regex(const CharT* p, 
        flag_type f = regex_constants::ECMAScript); 
};
 
// same pattern that is used for std::string
using regex = basic_regex<char>;

The regex class holds the regular expression and is typically created with a string. The standard implementation includes not only the default, JavaScript, but also basic POSIX, extended POSIX, awk, grep, and egrep grammar support. These non-default grammars are supported by specifying regex_constants passed in as flags to the constructor. Insensitive case comparison is also supported by passing regex_constants::icase as a flag to the constructor. Consult your reference for a complete list of regex_constants flags that are supported.

Note the use of "using regex = basic_regex;" instead of "typedef basic_regex regex;" in the above code snippet. The C++11 standard defines them as equivalent.

Along with the regex class, the library includes the functions: regex_match, regex_search, and regex_replace. regex_match and regex_search have the same function signatures/overloads. The only difference between the two is that regex_match must match the entire input sequence and regex_search looks for the regular expression pattern anywhere in the input sequence.

// params beg & end are begin and end iterators for an input string
// param m is an out parameter of sub-matches
// param e is the regex
// param flags are any regex flags to the function
template <class BidirectionalIterator, class Allocator, class charT, class traits>
  bool regex_match(BidirectionalIterator beg, BidirectionalIterator end,
                   match_results<BidirectionalIterator, Allocator>& m,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags = regex_constants::match_default);
 
// same as above without sub-matches
template <class BidirectionalIterator, class charT, class traits>
  bool regex_match(BidirectionalIterator beg, BidirectionalIterator end,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags = regex_constants::match_default);
 
// overload for std::string not shown
// same as above only string passed instead of begin, end iterators
template <class charT, class Allocator, class traits>
  bool regex_match(const charT* str, 
                   match_results<const charT*, Allocator>& m,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags = regex_constants::match_default);
 
// overload for std::string not shown
// same as above only string passed instead of begin, end iterators
template <class charT, class traits>
  bool regex_match(const charT* str,
                   const basic_regex<charT, traits>& e,
                   regex_constants::match_flag_type flags = regex_constants::match_default);

The optional match_results out argument in the above regex_match and regex_search algorithms provides access into the regular expression capture groups (sub_matches). The empty method of match_results as well as the return value of the algorithm indicates a successful match. Once a match is found, the match_results sub_matches can be indexed via operator[] and size or iterated via begin and end iterators.

regex_replace is the other available library function. It uses a format string to replace matched regular expressions in a string. The format string fmt can be a simple string or can contain references into the regex e.

// param s is string to search
// param e is the regex
// param fmt is the replacement string, can contain references into regex e
// param flags are any regex flags to the function
template <class Traits, class CharT>
basic_string<CharT> regex_replace(const basic_string<CharT>& s,
                                  const basic_regex<CharT, Traits>& e,
                                  const basic_string<CharT>& fmt,
                                  match_flag_type flags = match_default);
 
// param out is an output iterator, can be std::back_inserter
// params beg & end are begin and end iterators for an input string
template <class OutputIter, class BidirectionalIter, class Traits, class CharT>
OutputIter regex_replace(OutputIter out,
                         BidirectionalIter beg, BidirectionalIter end,
                         const basic_regex<CharT, Traits>& e,
                         const basic_string<CharT>& fmt,
                         match_flag_type flags = match_default);

Rounding out the Regex library are two iterators: regex_iterator and regex_token_iterator.regex_iterator is used to iterate through matches of a regular expression in a string. regex_token_iterator is used to iterate through sub_matches (capture groups) or to iterate through non-matches providing advanced tokenization capabilities. I will not be covering regex_token_iterator in this article. However, you can find a nice simple example here: http://en.cppreference.com/w/cpp/regex/regex_token_iterator. We will see regex_iterator in the implementation of FuncFinder.

FuncFinder Implementation

As we saw in the usage of FuncFinder above, FuncFinder takes a regular expression to find and a list of files to search. There is nothing special about the implementation to make this happen. It is just a simple loop through the command line arguments as can be seen in the listing of main below.

// funcfinder.cpp 
#include <algorithm>
#include <cstddef>
#include <functional>
#include <fstream>
#include <iostream>
#include <regex> // C++11 regular expression library
#include <sstream>
#include <string>
#include <tuple>
 
// ... (we will see the rest of the code shortly)
 
int main(int argc, char* argv[]) 
{
  try {
    if (argc < 3) {
      std::cerr << "Usage: " << argv[0] << " regex source_file ..." << std::endl;
      return 1;
    }
 
    for (int i = 2; i < argc; ++i) {
      processFile(argv[1], argv[i]);
    }
 
    return 0;
  } catch (std::exception& e) {
    std::cerr << "Exception: " << e.what() << std::endl;
  } catch (...) {
    std::cerr << "Unknown exception." << std::endl;
  }
  return 1;
}

More interesting code is found in processFile.

void processFile(const std::string& regExpStr, const std::string& fileName)
{
  std::ifstream ifs(fileName);
  if (!ifs) {
    std::cerr << "Unable to open file: " << fileName << std::endl;
    return;
  }
 
  // much faster not to store lines as file is processed, verify regex found first
  std::regex grepExp(regExpStr);
  bool found = false;
  for (std::string s; !found && std::getline(ifs, s);) {
    found = std::regex_search(s, grepExp);
  }
  if (!found) return;
 
  ifs.seekg(0);
  for (size_t total = 0;;) {
    // including findFunction memory usage, using up to 2x size of file in memory plus overhead
    std::stringstream ss;
    auto result = findFunction(ifs, grepExp, ss);
    if (std::get<0>(result) == -1) break;
    std::cout << "== " << fileName << "("
      << total + std::get<0>(result) << ") range ["
      << total + std::get<1>(result) << ","
      << total + std::get<2>(result) << "] ==n";
    std::cout << ss.str();
    total += std::get<2>(result);
  }
}

After ProcessFile opens the input file (lines 3-7), it does a quick search through the file to see if the search regular expression exists in the file. ProcessFile does a two pass approach since, as we will see, findFunction is memory and process intensive.

On lines 10 and 13 we have the first use of the C++11 regex library. On line 10 a std::regex object, grepExp, is created from the command line regular expression string. On line 13, this regex object is passed along with each line of the input file to determine if the regular expression matches any line in the file. std::regex_search will return true if the grepExp matches any value in the line s. If the regular expression is not in the file, then there is no need to call findFunction. Note, lines 11-15 could be removed and FuncFinder would return the exact same results, it would just take longer when the input file does not contain the search regular expression. If we always preprocessed our input files on the command line via grep, this two pass approach would actually slow us down.

On line 17 we start the second pass through the file. This time we pass the input stream to findFunction, which does all the hard work. Line 21 uses the new C++11 auto capability. result is whatever type findFunction returns, in this case, the new C++11 std::tuple. std::tuple is just like the old std::pair only it can have more than two values, each retrievable by std::get as we see on lines: 22, 24-26, 28. To determine the values stored in these three tuple values, lets look at the function declaration of findFunction.

// Finds regExp in function definitions and places entire function in output stream.
// Stores up to the complete input stream in memory plus overhead if os != nullptr.
// Does not handle multi-line raw strings or preprocessor continuation lines.
// @param is input stream
// @param regExp regular expression to search for within function definitions
// @param os output stream, if nullptr then greatly reduced memory usage.
// @return get<0> is line offset in input stream where regExp found or -1
//         get<1> is line offset in input stream of the first line of the function
//         get<2> is line offset in input stream of last line read
std::tuple<size_t,size_t,size_t> 
  findFunction(std::istream& is, const std::regex& regExp, std::ostream* os = nullptr)

The first tuple value is the line number where the regExp is found or -1 if not found. The second and third tuple values are the first and last line of the function. Since these values are offsets we keep a total from the last call to findFunction in processFile so that we can report the correct line number.

Now we are ready to look at the implementation of findFunction.

std::tuple<size_t,size_t,size_t> 
  findFunction(std::istream& is, const std::regex& regExp, std::ostream* os = nullptr) 
{
  // "namespace {", "namespace a{", "namespace a{namespace b {", "using namespace std;", 
  // "namespace alias = oldname;"
  // "class a {", "class a : public b {"
  std::regex nsClassExp("\\b(?:(?:namespace)|(?:class))\\b");
  std::regex usingNSClassExp("(?:\\busing\\s+namespace\\b)|"           // using namespace
                             "(?:\\bnamespace\\s*[^\\d\\W]\\w*\\s*=)|" // namespace alias
                             "(?:\\bclass\\s*[^\\d\\W]\\w*\\s*)[;,>]"  // class forward declare, or template
                            ); 
  std::regex lineCommentExp("\\/\\/.*"); // rest of line
  // handling multi-line comments that begin/end on same line separately simplifies logic
  std::regex spanCommentExp("\\/\\*(?:.*?)\\*\\/"); // a multiline comment on one line: /* stuff */
  std::regex begMultiLineCommentExp("(.*?)\\/\\*.*"); // [stuff]/* ...
  std::regex endMultiLineCommentExp(".*?\\*\\/(.*)"); // ... */[stuff]
  std::regex includeExp("^\\s*#include\\b"); // #include 
  // "*" and '*' which allows for embedded quotes: "hi\"there\""
  // This would be more readable if C++11 raw string support was available.
  std::regex quotedStrExp("\"(?:[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"|"      // double quotes with possible escapes
                          "\\'(?:[^\\'\\\\]*(?:\\\\.[^\\'\\\\]*)*)\\'"); // single quotes with possible escapes

At the beginning of findFunction we define the regex objects needed.

nsClassExp - Matches the whole words: namespace or class
- \\b indicates a word boundary
- (?: indicates a non-capture parentheses group
usingNSClassExp - Matches use of a namespace or class keyword that is not a namespace or class definition, e.g., using namespace or namespace b= or class Foo;
- \\s indicates whitespace
- \\d indicates digits [0-9]
- \\w indicates any word character including [0-9]
- \\W indicates any non-word character
lineCommentExp - Matches // to the end of the line
spanCommentExp - Matches entire comment /* .. */ when on one line
- .*? indicates non-greedy so that match ends at first */
begMultiLineCommentExp - Matches /* to end of line
endMultiLineCommentExp - Matches everything up to and including */
includeExp - Matches #include at beginning of line (whitespace allowed before)
quotedStrExp - Matches " .. " or ' .. ' allowing for escaped quotes " .. \" .. "
- All of these regular expression strings would be easier to read if the new C++11 raw string was more widely supported. However, at this time, support is very limited. With raw string support this could be written as:
  - std::regex quotedStrExp(R""(?:[^"\\]*(?:\\.[^"\\]*)*"|\'(?:[^\'\\]*(?:\\.[^\'>\\]*)*)\'");
  - Still rather complicated, but easier to read without all the C++ required escaped characters.

Next we declare some other variables needed.

std::vector<std::string> lines;
std::string s;
int numOpenBrackets = 0; // int required, using negative numbers to handle namespace/class
size_t foundLoc = 0, begLineNum = 1, endLineNum = 1;
bool inMultiLineComment = false;
 
// factor out common reset logic into helper function
std::function<void()> reset = [&]() {
  lines.clear();
  numOpenBrackets = 0;
  begLineNum = endLineNum + 1;
};

lines - Used to store input lines in case we find the search regExp. The vector is cleared for each new function we encounter in the input stream.
s - Used to store the current input line. We manipulate s as needed, which is okay because a copy is stored in lines for possible output.
numOpenBrackets - findFunction tracks open/close brackets in order to know when it is inside a function definition.
foundLoc, begLineNum, endLineNum - tracks the three return values.
inMultiLineComment - Since we process the input stream one line at a time, we have to know if we are currently inside a multi-line comment.
reset - A C++11 lambda function used to factor out some common code needed to reset our local variables when we reach namespace, class, or file scope. The new C++11 std::function allows us to store the generated lambda function for later invocation. We use [&] to provide reference access to lines, numOpenBrackets, begLineNum, and endLineNum. In this case, only a new C++11 lambda with closure support could provide the ease of use that makes this worthwhile. This technique provides a new tool for fighting code duplication.

And now for the rest of findFunction.

for (; std::getline(is, s); ++endLineNum) {
    if (os) lines.push_back(s);
    s = std::regex_replace(s, spanCommentExp, " "); // remove span comments /* .. */ on same line
    // handle multi-line comments
    std::smatch m;
    if (!inMultiLineComment && std::regex_search(s, m, begMultiLineCommentExp)) {
      s = m[1].str(); // smatch uses iterators, m invalid after assignment to s
      inMultiLineComment = true;
    } else if (inMultiLineComment && std::regex_search(s, m, endMultiLineCommentExp)) {
      s = m[1].str();
      inMultiLineComment = false;
    }
    if (inMultiLineComment) continue;
    s = std::regex_replace(s, lineCommentExp, ""); // remove single line comments
    s = std::regex_replace(s, quotedStrExp, ""); // remove quoted strings
    // ignore #includes in file, assume they are always at file scope
    if (foundLoc == 0 && std::regex_search(s, includeExp)) { reset(); continue; }
    if (std::regex_search(s, regExp)) foundLoc = endLineNum; // found, print at end of function
    // assumes an open namespace statement not on the same line as a using namespace statement
    if (std::regex_search(s, nsClassExp) && !std::regex_search(s, usingNSClassExp)) {      
      reset(); // reset on namespace/class
      // subtract out namespace/class open brackets so we don't report the entire namespace/class
      numOpenBrackets -= countMatches(s, nsClassExp);
    }
    //numOpenBrackets += std::count(s.cbegin(), s.cend(), '{');      // alternative implementation
    //auto numCloseBrackets = std::count(s.cbegin(), s.cend(), '}'); // instead of the for_each
    decltype(numOpenBrackets) numCloseBrackets = 0;
    std::for_each(s.cbegin(), s.cend(), 
      [&] (char c) { 
        if (c == '{') ++numOpenBrackets;
        if (c == '}') ++numCloseBrackets;
    });
 
    if (numCloseBrackets > 0) {
      // only clear at '}' so that function name, comments, etc. included in output of find
      numOpenBrackets -= numCloseBrackets;
      if (numOpenBrackets <= 0) {
        if (foundLoc > 0) {
          // pre-C++11: std::copy(lines.begin(), lines.end(), std::ostream_iterator<std::string>(os, "\n"));
          if (os) for (const auto& x : lines) *os << x << '\n';
          return std::make_tuple(foundLoc, begLineNum, endLineNum);
        }
        reset();
      }
    }
  }
 
  return std::make_tuple(-1, begLineNum, endLineNum);
}

The function consists of a for loop reading one line at a time. Each loop iteration increments the endLineNum which we need in order to properly report the function range. If the caller provided an output stream, then we keep track of each read line in lines, so that on line 82 we can copy them into the output stream. We do not immediately place the line in the output stream because we do not know if we will find the regExp.

Next, on lines 45-56, findFunction removes comments. findFunction does not look for the search regExp in comments. If we wanted findFunction to search comments, as well as code, we could move line 60 up to line 45 so that it searches for the regExp before removing comments. On line 45, if there are any /* .. */ comments on the same line we remove them from our line, s, so they are not included in the search. std::regex_replace replaces all matches of spanCommentExp in s with a single space. We have already seen std::regex_search used in processFile, however, on lines 48 & 51 we use the overloaded version of regex_search that takes a std::smatch. The std::smatch provides access to capture groups in the found regular expression. In the case of begMultiLineCommentExp, the first capture group is all the characters before /*. In the case of endMultiLineCommentExp, the first capture group is all the characters after */. [0] is reserved for the complete regex match. s is set to either all the characters before/* (line 49) or all the characters after */ (line 52).

Since quoted strings could include { or } characters we remove them from the search as well on line 57. On line 59 we reset our search any time we find #include. This is done so that functions at the top of files that match our search do not include all the #include statements. When functions are reported, findFunction reports all the comments above the function, but we do not want to include all the #include statements. Since #includes are rarely found in functions this works nicely for the nominal case.

Line 60 is the regex search for the command line regular expression. If found, we keep track of the line number which is the current last line read, stored in endLineNum. We keep processing the file until we find the closing } for the function with the found regular expression.

Lines 61-79 handle processing of { and } brackets. findFunction keeps running totals of the number of open brackets and the number of closed brackets in order to determine the end of the function with the found regular expression. On line 62, if we find a namespace or class definition (ignoring any namespace or class keywords not used in the context of a definition), then we reset and subtract out the number of namespaces or classes from numOpenBrackets. This will likely cause numOpenBrackets to become negative so that when we find the { associated with the namespace/class, numOpenBrackets is incremented to 0.

countMatches is not provided by the C++11 regex library. However, it is easy to create one.

size_t countMatches(const std::string& s, const std::regex& rexp) {
 // iteration of each of the matches of rexp in s
 std::sregex_iterator begin(s.cbegin(), s.cend(), rexp), end;
 // only interested in number of matches
 return std::distance(begin, end);
}

std::sregex_iterator iterates through a string providing access to each of the matched regular expressions. In this case, all we are interested in are the number of matches. std::distance reports the number of elements (matches) between begin and end.

On lines 67-74 findFunction counts the number of { } brackets. This could be done by a simple std::count, however, here we use std::for_each with a lambda to count both { } on one pass through the string.

Once we reach the end of a function definition (line 79) and we have found the search regular expression (line 80), findFunction reports the find. If an output stream is provided, then each of the stored lines are streamed via the new C++11 range-based for loop (line 82). std::vector provides iterators that the new range-based for loop can use to iterate over the vector. x is a const reference to each value of the vector. Personally, I like the range-base for loop syntax better than std::copy.

Just like std::make_pair, C++11 provides a std::make_tuple which we use to return a constructed tuple with the needed values. If the search regular expression was not found in the current function then we simply reset (line 85) and go on to the next line.

Summary

I hope you find the FuncFinder utility useful. The full source can be found on GitHub: https://github.com/objectcomputing/FuncFinder.

<regex> Supported Platforms

Microsoft Visual Studio 2012 - Under namespace std
Microsoft Visual Studio 2008 msdn - Under namespace std::tr1
GCC - Not implemented as of GCC 4.8.0, March 2013 - Use Boost - Under namespace boost
Many others via Boost Regex library - Under namespace boost

References

[1] FuncFinder implementation on GitHub
https://github.com/objectcomputing/FuncFinder
[2] Wikipedia article on regular expressions
http://en.wikipedia.org/wiki/Regular_expression
[3] MSDN regular expression overview
http://msdn.microsoft.com/en-us/library/bb982727.aspx
[4] Boost regular expression documentation
http://www.boost.org/doc/libs/1_53_0/libs/regex/doc/html/boost_regex/syntax.html
[5] Online regex library reference
http://en.cppreference.com/w/cpp/regex
[6] Printable C++11 regex cheatsheet
http://cpprocks.com/regex-cheatsheet
[7] Online regex expression tester
http://regexpal.com
[8] Online regex expression tester
http://www.myregextester.com
[9] Excellent book on regular expressions
http://goo.gl/NXDko
[10] The C++ Programming Language, 4th Edition, by Bjarne Stroustrup - Amazon.com link
http://goo.gl/McTCy Completely rewritten for C++11.
[11] C++ Primer, 5th Edition, by Stanley B. Lippman, Josee Lajoie, and Barbara E. Moo - Amazon.com link
http://goo.gl/t0dpf Completely rewritten for C++11.
[12] The C++ Standard Library: A Tutorial and Reference, 2nd Edition, by Nicolai M. Josuttis - Amazon.com link
http://goo.gl/WMsIj Covers all the new C++11 libraries.