Regular Expressions in Java

Regular Expressions in Java

By Chip Jones, OCI Senior Software Engineer 

October 2001


Introduction

Regular expressions (regexs or REs) are strings that describe patterns of characters in other strings. They are effective tools for searching and manipulating text and are built-in features of Perl, Python, and many other scripting languages. They are a ubiquitous aspect of programming and computer science, but they have been absent from the standard Java API before now.

This article discusses the java.util.regex package, which is a new feature of the J2SE 1.4.

The J2SE 1.4 Regex API

Package java.util.regex provides regular expression-based pattern matching. It contains a Pattern class and a Matcher class. 

Matcher instances use Pattern instances to find, replace, and otherwise compare a regex to an input character sequence.

Patterns

The Pattern class is a regex container. It is instantiated by "compiling" an expression with either

Pattern pat = Pattern.compile("x*yz*");

or

boolean isMatch = Pattern.matches("x*yz*", "xxxyzzz");

Both examples create a pattern that matches any number of 'x' characters followed by a single 'y' character and then by any number of 'z' characters. The strings "y", "xxy" and "xyz" are valid matches for the pattern.

The matches() method in the second example is a shortcut compile and match. For one-time tests of a pattern, the matches() method of the Pattern class eliminates the need to instantiate a Matcher and call its matches() method. However, as it does not allow a compiled expression to be reused, it is less efficient if the match is repeated several times.

The regex grammar recognized by the Pattern class is similar in many respects to Perl. The list below shows some of its common expressions.

Summary of Regex Grammar

ExpressionMatches
. any character
x character x
x? zero or one of character x
x* zero or more of character x
\t \n \r \f tab, line feed, carriage-return, form-feed
\d \D digit, non-digit
\s \S whitespace, non-whitespace
\w \W word, non-word
[a-z] the lowercase characters a through z inclusively
[A-Z] the uppercase characters A through Z inclusively
^ the beginning of a line
$ the end of a line

Matchers

A Matcher object is instantiated by invoking the matcher() method of a Pattern instance.

Matchers are used to search or manipulate a specified string.

Suppose a programmer needs to swap one substring for another. He might choose to deconstruct the string either with a StringTokenizer or with methods in the String class. He would have to write logic to disassemble the string, swap the tokens and reassemble the string.

Alternately and in C-fashion, he could "walk" the string a character at a time and make the appropriate character substitutions. He would have to look ahead and account for shifting characters if the swapped string and substring were not the same length.

In contrast to this complexity, is the ease with which Matcher objects can replace one substring with another.

  1. // swap 'are' and 'is'
  2. Pattern pat = Pattern.compile("are");
  3. Matcher mat = pat.matcher("Java are fun.");
  4. String sentence = mat.replaceAll("is"); // 'Java is fun.'

In addition to substitution, the Matcher class provides methods that find the next matching substring, test complete and partial string matches and return matched substrings.

Regex Additions to the String Class

The split() method, also new in the J2SE 1.4, is a regex-like addition to the String class. It is similar to the split routine in Perl. It uses an input regex as a delimiter and deconstructs the contents of an input string into an array of strings.

This method is useful for parsing character delimited text files or user input. To parse a colon-delimited text string from a user password file once could use:

  1. String x = "joe:x:670:500::/home/joe:/bin/false";
  2. String arr[] = x.split(":");

The contents of arr[] are { "joe", "x", "670", "500", "", "home/joe", "/bin/false" }.

A Form Validation Example

Often, websites relegate form validation to the browser and JavaScript. Authors of heavily trafficked sites may choose browser-side form validation to offload server-side processing. Problems can arise when users disable scripting in their browsers, and server cycles ultimately may be traded for developer cycles, as the JavaScripts are a separate codebase that must be maintained.

With regular expressions, server-side form validation is easily implemented.

Consider a small web application developed in anticipation of an area code change from 318 to 543. The user enters his name and phone number. A servlet or JSP verifies the proper format of the name and phone number and checks a data source to see if the area code is changing. If so, it displays the phone number with the new area code.

A war file of this application is available for download.

Below is the bean that implements its regular expression logic.

  1. package com.ociweb.jnb;
  2.  
  3. import java.util.regex.*;
  4.  
  5. public class RegexValidate {
  6. private String name ="";
  7. private String phone = "";
  8. private String response =
  9. "Your name and phone number are not on record.";
  10.  
  11. // area codes 314 will change; 318 will not
  12. private static final String OLD318 = "318";
  13. private static final String NEW318 = "543";
  14.  
  15. // precompiled patterns
  16. // notice double escape of special characters
  17. private Pattern phonePattern =
  18. Pattern.compile("\\d\\d\\d-\\d\\d\\d-\\d\\d\\d\\d");
  19. private Pattern namePattern =
  20. Pattern.compile("^[A-Z]\\s[A-Z][a-z]*");
  21. private Pattern newAreaCodePattern =
  22. Pattern.compile("^31\\d");
  23.  
  24. private String nameList[]= {
  25. // "name:phoneNumber:newAreaCode
  26. "S White:314-555-3141",
  27. "J Smith:318-555-3147",
  28. "C Brown:314-555-6543",
  29. "T Hogan:318-555-3180",
  30. "E James:636-555-8970"
  31. };
  32.  
  33. private int NUMNAMES = java.lang.reflect.Array.getLength(nameList);
  34.  
  35. public RegexValidate() { }
  36.  
  37. public void setName (String name) {
  38. this.name = name;
  39. if(!isValidName())
  40. throw new IllegalArgumentException("Invalid Name.");
  41. }
  42.  
  43. public void setPhone (String phone) {
  44. this.phone = phone;
  45. if(!isValidPhone())
  46. throw new IllegalArgumentException("Invalid Phone Number.");
  47. }
  48.  
  49. private boolean isValidPhone () {
  50. Matcher phoneMat = phonePattern.matcher(this.phone);
  51. return phoneMat.matches();
  52. }
  53.  
  54. private boolean isValidName () {
  55. Matcher nameMat = namePattern.matcher(this.name);
  56. return nameMat.matches();
  57. }
  58.  
  59. public String getResponse () {
  60. boolean found = false;
  61. String nameData[] = new String[2];
  62.  
  63. for (int ctr = 0; ctr < NUMNAMES && !found; ctr++) {
  64. nameData = this.nameList[ctr].split(":");
  65.  
  66. if (nameData[0].equals(this.name) &&
  67. nameData[1].equals(this.phone)) {
  68.  
  69. found = true;
  70. this.response = "Your phone number, " +
  71. this.phone + ", will not change.";
  72.  
  73. // check old and new area code
  74. Matcher mat = newAreaCodePattern.matcher(this.phone);
  75.  
  76. if(mat.find())
  77. if(mat.group().equals(OLD318)) {
  78. this.response = "Your new phone number is " +
  79. mat.replaceAll(NEW318);
  80. }
  81. }
  82. }
  83.  
  84. return this.response;
  85. }
  86. }

The isValidPhone() and isValidName() methods use precompiled patterns to limit allowed input for the phone number and name fields. The first constrains phone numbers to a 10-digit hyphenated format. The second ensures that names are properly capitalized and entered as first initial space last name.

The getResponse() method uses the split() method to deconstruct a list of names and phone numbers. The split phone numbers are matched against a pattern that selects the area code. While the match may be performed against each record in the data source, all the matches are against the same pattern, so the code is more efficient if the pattern is compiled outside of the getResponse() method and reused.

The pattern '^31\\d' matches both '314' and '318'. Since 318 is the only area code that changes, the example could use a pattern that matched only 318. The partial match pattern was chosen to show the versatility of the regex implementation and to demonstrate the group() method.

The example uses the group() method of the Matcher class to return the substring of the last match. It then checks the substring and calls replaceAll() if the substring is '318'. The replaceAll() method changes the area code to '543'. Since the pattern only matches digits at the start of the line, replaceAll() does not replace any '318' substrings that occur later in the string.

Summary

The new java.util.regex package provides regular expression functionality that has been absent from the standard Java API.

The methods of the new Matcher and Pattern classes and the grammar they support let developers describe and manipulate sequences of characters succinctly. They can replace string machinations with simpler regex constructs that are more powerful and easier to use and maintain.

References

Other Regex Implementations

For those who don't have access to the J2SE 1.4 or are tied to earlier versions of the JDK, there are several third-party regular expression packages. Among them are ORO and Regexp. Both are part of the Jakarta project.

ORO boasts more features and seems to have more active development. The Free Software Foundation distributes the gnu.regex package, and Pat is a regex package compatible with JDK 1.0.

Other Sources of Information



Software Engineering Tech Trends (SETT) is a regular publication featuring emerging trends in software engineering.